418 21 10MB
English Pages 310 [304] Year 2021
A P P L I C AT I O N S O F B I G D ATA I N H E A LT H C A R E
A P P L I C AT I O N S O F B I G D ATA I N H E A LT H C A R E Theory and Practice Edited by
ASHISH KHANNA Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, Guru Gobind Singh Indraprastha University, Delhi, India
DEEPAK GUPTA Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, Guru Gobind Singh Indraprastha University, Delhi, India
NILANJAN DEY Department of Computer Science and Engineering, JIS University, Kolkata
Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom Copyright © 2021 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress ISBN: 978-0-12-820203-6 For Information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals
Publisher: Mara Conner Acquisitions Editor: Chris Katsaropoulos Editorial Project Manager: Fernanda A. Oliveira Production Project Manager: Maria Bernard Cover Designer: Miles Hitchen Typeset by MPS Limited, Chennai, India
Contents List of contributors .....................................................................................xiii About the authors ..................................................................................... xvii Preface ........................................................................................................xix
1 Big Data classification: techniques and tools .............................................1 Pijush Kanti Dutta Pramanik, Saurabh Pal, Moutan Mukhopadhyay and Simar Preet Singh 1.1 Introduction .......................................................................................... 1 1.2 Big Data classification.......................................................................... 3 1.2.1 Definition of classification ..........................................................3 1.2.2 Need for classification in Big Data.............................................4 1.2.3 Challenges in Big Data classification .........................................4 1.2.4 Types of classification.................................................................4 1.2.5 Big Data classification approaches ............................................6 1.2.6 Phases of classification ...............................................................6 1.2.7 Classification pattern...................................................................8 1.3 Big Data classification techniques ...................................................... 9 1.3.1 Traditional learning techniques .................................................9 1.3.2 Evolutionary techniques ...........................................................20 1.3.3 Advanced learning techniques.................................................24 1.4 Big Data classification tools and platforms...................................... 29 1.4.1 Shogun.....................................................................................29 1.4.2 Scikit-learn ...............................................................................29 1.4.3 TensorFlow ..............................................................................30 1.4.4 Pattern ......................................................................................30 1.4.5 Weka.........................................................................................30 1.4.6 BigML .......................................................................................31 1.4.7 DataRobot ................................................................................31 1.4.8 Google Cloud AutoML ............................................................32 1.4.9 IBM Watson Studio .................................................................32 1.4.10 MLJAR......................................................................................33 1.4.11 Rapidminer ..............................................................................33 1.4.12 Tableau.....................................................................................33 1.4.13 Azure Machine Learning Studio............................................. 34 1.4.14 H2O Driverless AI ....................................................................34 v
vi
Contents
1.4.15 Apache Mahout ....................................................................... 35 1.4.16 Apache Spark (MLib)...............................................................35 1.4.17 Apache Storm.......................................................................... 36 1.5 Conclusion .......................................................................................... 36 References ................................................................................................. 37
2 Big Data Analytics for healthcare: theory and applications ..................45 Shivam Bachhety, Shivani Kapania and Rachna Jain 2.1 Introduction to Big Data................................................................... 45 2.1.1 Motivation................................................................................ 47 2.2 Big Data Analytics ............................................................................ 48 2.2.1 Techniques and technologies................................................. 49 2.2.2 How Big Data Analytics work ................................................. 51 2.2.3 Uses and challenges ............................................................... 52 2.3 Big Data in healthcare sector .......................................................... 52 2.4 Medical imaging............................................................................... 53 2.5 Methodology .................................................................................... 54 2.6 Big Data Analytics: platforms and tools ......................................... 55 2.6.1 Cloud storage .......................................................................... 55 2.6.2 NoSQL databases .................................................................... 57 2.6.3 Hadoop..................................................................................... 57 2.6.4 Hive .......................................................................................... 58 2.6.5 Pig............................................................................................. 58 2.6.6 Cassandra ................................................................................ 59 2.7 Opportunities for Big Data in healthcare........................................ 59 2.7.1 Quality of treatment ................................................................ 59 2.7.2 Early disease detection ........................................................... 60 2.7.3 Data accessibility and decision-making................................. 60 2.7.4 Cost reduction ......................................................................... 61 2.8 Challenges to Big Data Analytics in healthcare ............................. 61 2.8.1 Data acquisition and modeling .............................................. 62 2.8.2 Data storage and transfer ....................................................... 62 2.8.3 Data security and risk.............................................................. 62 2.8.4 Querying and reporting .......................................................... 63 2.8.5 Technology incorporation and miscommunication gaps ........................................................ 63 2.9 Applications of Big Data in healthcare industry ............................ 63 2.9.1 Advanced patient monitoring and alerts............................... 64 2.9.2 Management and operational efficiency ............................... 64
Contents
vii
2.9.3 Fraud and error prevention .................................................... 65 2.9.4 Enhanced patient engagement .............................................. 65 2.9.5 Smart healthcare intelligence................................................. 65 2.10 Future of Big Data in healthcare ..................................................... 65 References ................................................................................................. 66
3 Application of tools and techniques of Big data analytics for healthcare system ..............................................................................................69 Samarth Chugh, Shubham Kumaram and Deepak Kumar Sharma 3.1 Introduction ........................................................................................ 70 3.2 Need and past work ........................................................................... 71 3.2.1 Importance and motivation ......................................................71 3.2.2 Background................................................................................72 3.3 Methods of application ...................................................................... 73 3.3.1 Feature extraction......................................................................73 3.3.2 Imputation..................................................................................74 3.4 Result domains................................................................................... 76 3.4.1 Bioinformatics ...........................................................................76 3.4.2 Neuroinformatic ........................................................................77 3.4.3 Clinical informatics....................................................................77 3.4.4 MRI data for prediction .............................................................78 3.4.5 ICU readmission and mortality rates ....................................... 78 3.4.6 Analyzing real-time data streams for diagnosis and prognosis ...................................................................................78 3.4.7 Public health informatics .......................................................... 79 3.4.8 Search query data .....................................................................79 3.4.9 Social media analytics ..............................................................80 3.5 Discussion........................................................................................... 80 3.5.1 Past shortcomings.....................................................................81 3.6 Conclusion .......................................................................................... 81 References ................................................................................................. 82
4 Healthcare and medical Big Data analytics ...............................................85 Blagoj Ristevski and Snezana Savoska 4.1 Introduction ........................................................................................ 85 4.2 Medical and healthcare Big Data ...................................................... 88 4.2.1 Exposome data..........................................................................93
viii
Contents
4.3 Big Data Analytics .............................................................................. 94 4.3.1 Unsupervised learning.............................................................. 96 4.3.2 Supervised learning ..................................................................96 4.3.3 Semisupervised learning .......................................................... 97 4.4 Healthcare and medical data coding and taxonomy ....................... 99 4.5 Medical and healthcare data interchange standards..................... 101 4.6 Framework for healthcare information system based on Big Data ....................................................................................... 104 4.7 Big Data security, privacy, and governance ................................... 107 4.8 Discussion and further work ........................................................... 108 References ............................................................................................... 110
5 Big Data analytics in medical imaging ......................................................113 Siddhant Bagga, Sarthak Gupta and Deepak Kumar Sharma 5.1 Introduction ...................................................................................... 114 5.1.1 Medical imaging...................................................................... 114 5.1.2 Challenges in medical imaging .............................................. 115 5.2 Big Data analytics in medical imaging ........................................... 116 5.2.1 Analytical methods ................................................................. 116 5.2.2 Collection, sharing, and compression ................................... 119 5.3 Artificial intelligence for analytics of medical images................... 121 5.4 Tools and frameworks ..................................................................... 123 5.4.1 MapReduce .............................................................................. 124 5.4.2 Hadoop..................................................................................... 126 5.4.3 Yet Another Resource Negotiator .......................................... 129 5.4.4 Spark ........................................................................................ 131 5.5 Conclusion ........................................................................................ 132 References ............................................................................................... 133
6 Big Data analytics and artificial intelligence in mental healthcare ............................................................................................137 Ariel Rosenfeld, David Benrimoh, Caitrin Armstrong, Nykan Mirchi, Timothe Langlois-Therrien, Colleen Rollins, Myriam Tanguay-Sela, Joseph Mehltretter, Robert Fratila, Sonia Israel, Emily Snook, Kelly Perlman, Akiva Kleinerman, Bechara Saab, Mark Thoburn, Cheryl Gabbay and Amit Yaniv-Rosenfeld 6.1 Introduction ...................................................................................... 138
Contents
ix
6.2 What makes mental healthcare complex? ..................................... 142 6.3 Opportunities and limitations for artificial intelligence and big data in mental health................................................................. 146 6.3.1 Diagnosis ................................................................................. 148 6.3.2 Prognosis ................................................................................. 150 6.3.3 Treatment selection................................................................. 152 6.3.4 Treatment delivery .................................................................. 156 6.3.5 Monitoring ............................................................................... 159 6.3.6 Ethical considerations ............................................................. 162 6.4 Conclusions ...................................................................................... 165 Acknowledgments .................................................................................. 166 References ............................................................................................... 166
7 Big Data based breast cancer prediction using kernel support vector machine with the Gray Wolf Optimization algorithm ....................................................................................173 T. Jayasankar, N.B. Prakash and G.R. Hemalakshmi 7.1 Introduction ...................................................................................... 173 7.2 Literature survey............................................................................... 175 7.3 Proposed methodology ................................................................... 178 7.3.1 Preprocessing .......................................................................... 179 7.3.2 Feature selection ..................................................................... 179 7.3.3 Kernel based support vector machine with Gray Wolf Optimization .......................................................... 183 7.3.4 Dataset description.................................................................. 187 7.4 Result and discussion ...................................................................... 188 7.4.1 Comparison measures ............................................................ 189 7.5 Conclusion ........................................................................................ 192 References ............................................................................................... 192
8 Big Data based medical data classification using oppositional Gray Wolf Optimization with kernel ridge regression............................195 N. Krishnaraj, Sujatha Krishamoorthy, S. Venkata Lakshmi, C. Sharon Roji Priya, Vandna Dahiya and K. Shankar 8.1 Introduction ...................................................................................... 196 8.2 Literature survey............................................................................... 198 8.3 Proposed methodology ................................................................... 201 8.3.1 Feature reduction .................................................................... 201
x
Contents
8.3.2 Feature selection ..................................................................... 202 8.3.3 Classification using OGWOKRRG .......................................... 204 8.4 Result and discussion ...................................................................... 208 8.4.1 Classification accuracy............................................................ 208 8.4.2 Sensitivity ................................................................................ 208 8.4.3 Specificity................................................................................. 209 8.4.4 Performance evaluation.......................................................... 209 8.4.5 Comparative analysis.............................................................. 210 8.5 Conclusion ........................................................................................ 212 References ............................................................................................... 213
9 An analytical hierarchical process evaluation on parameters Apps-based Data Analytics for healthcare services ..............................215 Monika Arora, Radhika Adholeya and Swati Sharan 9.1 Introduction ...................................................................................... 216 9.2 Review of literature .......................................................................... 222 9.3 Research methodology .................................................................... 225 9.3.1 Analytic hierarchy processing model .................................... 225 9.3.2 Analytic hierarchy processing technique ..............................226 9.4 Proposed analytical hierarchy processing model of successful healthcare ....................................................................... 229 9.4.1 Hospital/lab (C2) ...................................................................... 230 9.4.2 Analytic hierarchy processing model description ................ 231 9.5 Conclusion ........................................................................................ 236 Appendix 1 .............................................................................................. 237 Big data analytics for healthcare ..................................................... 237 References ............................................................................................... 238
10 Firefly—Binary Cuckoo Search Technique based heart disease prediction in Big Data Analytics................................................241 G. Manjula, R. Gopi, S. Sheeba Rani, Shiva Shankar Reddy and E. Dhiravida Chelvi 10.1 Introduction ...................................................................................242 10.2 Literature survey............................................................................244 10.3 Proposed methodology ................................................................247 10.3.1 Preprocessing .................................................................... 248
Contents
xi
10.3.2 Optimal feature selection using bacterial foraging optimization........................................................ 248 10.3.3 Optimization by using firefly—Binary Cuckoo Search................................................................... 250 10.3.4 Dataset description............................................................ 254 10.4 Result and discussion ...................................................................254 10.4.1 Comparative analysis........................................................ 255 10.5 Conclusion .....................................................................................258 References ............................................................................................... 259 Further reading ....................................................................................... 260
11 Hybrid technique for heart diseases diagnosis based on convolution neural network and long short-term memory..................261 Abdelmegeid Amin Ali, Hassan Shaban Hassan, Eman M. Anwar and Ashish Khanna 11.1 Introduction ...................................................................................262 11.1.1 Heart disease ..................................................................... 262 11.1.2 Traditional ways ................................................................ 262 11.1.3 The classification techniques ........................................... 262 11.2 Literature review............................................................................264 11.3 The proposed technique ...............................................................265 11.3.1 Preprocessing data............................................................ 266 11.3.2 Building classifier model .................................................. 267 11.4 Experimental results and discussion ...........................................270 11.4.1 Evaluation criteria.............................................................. 271 11.5 Results analysis and discussion...................................................273 11.5.1 Scenario 1 .......................................................................... 274 11.6 Conclusion .....................................................................................277 References ............................................................................................... 278 Further reading ....................................................................................... 280 Index........................................................................................................... 281
List of contributors Radhika Adholeya Uniworld Care, Dwarka, New Delhi Abdelmegeid Amin Ali Faculty of Computer and Information, Computer Science, Minia University, Egypt
Department
of
Eman M. Anwar Faculty of Computer and Information, Information System, Minia University, Egypt
Department
of
Caitrin Armstrong Aifred Health, Montre´al, Canada Monika Arora Apeejay School of Management, Dwarka, New Delhi Shivam Bachhety Department of Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, New Delhi, India Siddhant Bagga Department of Information Technology, Netaji Subhas University of Technology (Formerly Netaji Subhas Institute of Technology), New Delhi, India David Benrimoh McGill University, Montre´al, Canada E. Dhiravida Chelvi Department of Electronics and Communication Engineering, Mohamed Sathak A.J. College of Engineering, Chennai, India Samarth Chugh Department of Information Technology, Netaji Subhas University of Technology (Formerly Netaji Subhas Institute of Technology), New Delhi, India Vandna Dahiya Department of Education, Government of National Capital Territory of Delhi, Bangalore, India Robert Fratila Aifred Health, Montre´al, Canada
xiii
xiv
List of contributors
Cheryl Gabbay McGill University, Montre´al, Canada R. Gopi Department of Computer Science and Engineering, Dhanalakshmi Srinivasan Engineering College, Perambalur, India Sarthak Gupta Department of Information Technology, Netaji Subhas University of Technology (Formerly Netaji Subhas Institute of Technology), New Delhi, India Hassan Shaban Hassan Faculty of Computer and Information, Computer Science, Minia University, Egypt
Department
of
G.R. Hemalakshmi Department of Computer Science and Engineering, National Engineering College, Kovilpatti, India Sonia Israel Aifred Health, Montre´al, Canada Rachna Jain Department of Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, New Delhi, India T. Jayasankar Electronics and Communication Engineering Department, University College of Engineering, BIT Campus, Anna University, Tiruchirappalli, India Shivani Kapania Department of Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, New Delhi, India Ashish Khanna Maharaja Agrasen Institute of Technology Akiva Kleinerman Bar-Ilan University, Ramt-Gan, Israel Sujatha Krishamoorthy Department of Computer Science, Wenzhou-Kean University, Wenzhou, P.R. China N. Krishnaraj School of Computing, SRM Institute of Science and Technology, Kattankulathur, India
List of contributors
Shubham Kumaram Department of Information Technology, Netaji Subhas University of Technology (Formerly Netaji Subhas Institute of Technology), New Delhi, India Timothe Langlois-Therrien Aifred Health, Montre´al, Canada G. Manjula Department of Information Science & Engineering, Dayananda Sagar Academy of Technology & Management, Bengaluru, India Joseph Mehltretter Aifred Health, Montre´al, Canada Nykan Mirchi Aifred Health, Montre´al, Canada Moutan Mukhopadhyay Bengal Institute of Technology, Kolkata, India Saurabh Pal Bengal Institute of Technology, Kolkata, India Kelly Perlman Aifred Health, Montre´al, Canada N.B. Prakash Department of Electrical and Electronics Engineering, National Engineering College, Kovilpatti, India Pijush Kanti Dutta Pramanik National Institute of Technology, Durgapur, India S. Sheeba Rani Department of Electrical and Electronics Engineering, Sri Krishna College of Engineering and Technology, Coimbatore, India Shiva Shankar Reddy Department of Computer Science and Engineering, SRKR Engineering College, Bhimavaram, India Blagoj Ristevski Faculty of Information and Communication Technologies Bitola, University “St. Kliment Ohridski” - Bitola, Republic of Macedonia Colleen Rollins Aifred Health, Montre´al, Canada
xv
xvi
List of contributors
Ariel Rosenfeld Bar-Ilan University, Ramt-Gan, Israel Bechara Saab Mobio Interactive, Toronto, Canada Snezana Savoska Faculty of Information and Communication Technologies Bitola, University “St. Kliment Ohridski” - Bitola, Republic of Macedonia K. Shankar Department of Computer Applications, Alagappa University, Karaikudi, India Swati Sharan Uniworld Care, Dwarka, New Delhi Deepak Kumar Sharma Department of Information Technology, Netaji Subhas University of Technology (Formerly Netaji Subhas Institute of Technology), New Delhi, India C. Sharon Roji Priya Computer Science and Engineering Department, Sri Sairam College of Engineering, Bangalore, India Simar Preet Singh Chandigarh Engineering College, Landran, India Emily Snook Aifred Health, Montre´al, Canada Myriam Tanguay-Sela Aifred Health, Montre´al, Canada Mark Thoburn Mobio Interactive, Toronto, Canada S. Venkata Lakshmi CSE, Panimalar Institute of Technology, Chennai, India Amit Yaniv-Rosenfeld Tel-Aviv University, Tel-Aviv, Israel; Shalvata Mental Health Center, Hod Hasharon, Israel
About the authors Dr. Ashish Khanna has 16 years of expertise in teaching, entrepreneurship, and research & development. He received his PhD degree from National Institute of Technology, Kurukshetra, India in March 2017. He has completed his M. Tech. and B. Tech. from GGSIPU, Delhi, India. He has completed his PDF from Internet of Things Lab at Inatel, Brazil and has around 110 research papers along with book chapters, including more than 50 papers in SCI-indexed journals with cumulative impact factor of above 125 to his credit. Additionally, he has authored, edited and editing 25 books. Furthermore, he has served in the research field as a Keynote Speaker/Session Chair/Reviewer/ TPC member/Guest Editor, and many more positions in various conferences and journals. His research interest includes image processing, distributed systems and its variants, and machine learning. He is currently working at the CSE, Maharaja Agrasen Institute of Technology, Delhi, India. He is the convener and organizer of ICICC and ICDAM Springer conference series. Dr. Deepak Gupta is an eminent academician; plays versatile roles and responsibilities juggling between lectures, research, publications, consultancy, community service, PhD, postdoctorate supervision, etc. With 12 years of rich expertise in teaching and 2 years in the industry, he focuses on rational and practical learning. He has contributed massive literature in the field of human computer interaction, intelligent data analysis, natureinspired computing, machine learning, and soft computing. He is working as an Assistant Professor at Maharaja Agrasen Institute of Technology (GGSIPU), Delhi, India. He has served as Editor-in-Chief, Guest Editor, and Associate Editor in SCI and various other reputed journals. He has authored/edited 36 books with National/International level publisher (Elsevier, Springer, Wiley, and Katson). He has published 128 scientific research publications in reputed International Journals and Conferences, including 61 SCI-Indexed Journals of IEEE, Elsevier, Springer, Wiley, and many more. Dr. Nilanjan Dey is Associate Professor in the Department of Computer Science and Engineering, JIS University, Kolkata, India. He is a visiting fellow of the University of Reading,
xvii
xviii
About the authors
United Kingdom. Previously, he held an honorary position of Visiting Scientist at Global Biomedical Technologies Inc., CA, United States (2012 15). He was awarded his PhD from Jadavpur University in 2015. He has authored/edited more than 90 books with Elsevier, Wiley, CRC Press, and Springer and published more than 300 papers. Furthermore, he is the Editor-inChief of the International Journal of Ambient Computing and Intelligence (IGI Global) and Associate Editor of IEEE Access and International Journal of Information Technology (Springer). He is the Series Co-Editor of Springer Tracts in Nature-Inspired Computing (Springer), Series Co-Editor of Advances in Ubiquitous Sensing Applications for Healthcare (Elsevier), and Series Editor of Computational Intelligence in Engineering Problem-Solving and Intelligent Signal Processing and Data Analysis (CRC Press). His main research interests include medical imaging, machine learning, computer-aided diagnosis, data mining, etc. He is the Indian Ambassador of the International Federation for Information Processing—Young ICT Group and Senior member of IEEE.
Preface
This book begins with the basics of Big Data Analysis and introduces the tools, processes, and procedures associated with the same. It unites healthcare with a leading technology, that is, Big Data Analysis and uses the advantages of the latter to solve the problems faced by the former. The book starts with the basics of Big Data and progresses toward the challenges faced by the healthcare industry, including capturing, storing, searching, sharing, and analyzing data. The book highlights the reasons for the growing abundance and complexity of data in this sector. The applications of Big Data have grown tremendously within the past few years, and its growth can be attributed not only to its competence to handle large data sizes but also to its abilities to find insights from complex, noisy, heterogeneous, longitudinal, and voluminous data. This helps Big Data to answer the previously unanswered questions, and this is preciously what helps it find its applications in the healthcare industry. Big Data is nowadays the requirement of almost all the technologies/applications, and there is a separate and special need to address its association with healthcare. The main objective of Big Data in this sector is to come up with ways to provide personalized healthcare to the patients by taking into account the enormous amount of the already existing data. The book further illustrates the possible challenges in its applications and suggests ways to overcome them. The topic is vast and, hence, every technique and/or solution cannot be discussed in detail. The primary emphasis of this book is to introduce healthcare data repositories, challenges, and concepts to data scientists, students, and academicians at large.
Objective of the book The main aim of this book is to provide a detailed understanding of Big Data and focus on its applications in the field of healthcare. The ultimate goal is to bridge data mining and medical informatics communities to foster interdisciplinary works that can be put to good use.
xix
xx
Preface
Organization of the book The book is organized in 11 chapters, a brief description of which is given in the following: 1. Big Data Classification: Techniques and Tools An enormous volume of data, known as Big Data, of varied properties, is continuously being generated from several sources. For efficient and consequential use of this huge amount of data, automated and correct categorization is very important. This chapter attempts to discuss various technicalities of Big Data classification, comprehensively. 2. Big Data Analytics for Healthcare: Theory and Applications In this chapter, the procedure of big data analytics in the healthcare sector with some practical applications along with its challenges has been discussed. The work has been concluded with a discussion on potential opportunities for analytics in the healthcare sector. 3. Application of Tools and Techniques for Big Data Analytics of Healthcare System In the past, various data-analysis tools and methods have been adopted to improve the services provided in a plethora of areas. This chapter highlights the improvements in terms of the effectiveness of predictions and inferences drawn so that future usage may be eased. 4. Healthcare and Medical Big Data Analytics This chapter discusses effective data analysis, suitable classification and standardization of big data in medicine and healthcare, as well as excellent design and implementation of healthcare information systems. 5. Big Data Analytics in Medical Imaging This chapter discusses the various medical image processing tools and frameworks, Hadoop, MapReduce, Yarn, Spark, and Hive, used to solve the purpose. Machinelearning and deep-learning techniques are extensively used for carrying out the required analytics. Genetic algorithms and association rule learning techniques are considerably used for the purpose. 6. Big Data Analytics and Artificial Intelligence in Mental Healthcare In this chapter, the authors discuss the major opportunities, limitations, and techniques used for improving mental healthcare through AI and big-data. They explore both the computational, clinical, and ethical considerations and best practices as well as layout the major researcher directions for the near future.
Preface
7. Big Data Based Breast Cancer Prediction Using Kernel Support Vector Machine With the Gray Wolf Optimization Algorithm Today, big data in healthcare is often used to predict disease. Breast cancer is one of the primary cancers that a woman suffers. If we recognize this disease at an early stage, there is a greater chance of recovery. In this chapter, an optimal feature is selected using Oppositional Grasshopper Optimization (OGHO); further, these features are processed in the training phase using the kernel support vector machine with the Gray Wolf Optimization algorithm (KSVMGWO) to predict breast cancer. 8. Big Data Based Medical Data Classification Using Oppositional Gray Wolf Optimization With Kernel Ridge Regression The classification of medical data is an important data mining issue that has been discussed for nearly a decade and has attracted numerous researchers around the world. Selection procedures provide the pathologist with valuable information for diagnosing and treating diseases. In this chapter, the authors aim to develop machine-learning algorithms to effectively predict the outbreak of chronic disease in general communities. 9. An Analytical Hierarchical Process Evaluation on Parameters Apps-Based Big Data Analytics for Healthcare Services Any healthcare management system can be studied in terms of access, integration, privacy and security, confidentiality, sharing, assurance/relevancy, reliability, and cost involvement for the data/documents in the system. It can be a concern to the healthcare centers. Accessibility is a complex concept, and at least four aspects—that is, availability, utilization, relevance, and equity of access—require evaluation. These parameters are discussed and evaluated in this chapter. 10. Firefly—Binary Cuckoo Search Technique Based Heart Disease Prediction in Big Data Analytics Nowadays, big data analysis is given more attention in complex healthcare settings. Fetal growth curves, the classic case of big health data, are used to predict coronary heart disease (CAD). This work aims to predict the risk of CAD using machine-learning algorithms such as Firefly—Binary Cuckoo Search (FFBCS). The authors also suggest a preliminary analysis of the performance of the framework.
xxi
xxii
Preface
11. Hybrid Technique for Heart Diseases Diagnosis Based on Convolution Neural Network and Long Short-Term Memory In this chapter, a hybrid deep neural network using the dataset with 14 features as input and they are trained to utilize the convolution neural network (CNN) and long shortterm memory (LSTM) hybrid algorithms to predict the presence or absence of disease in patients with the highest accuracy reaching 0.937%. The result of the study showed that the CNN-LSTM hybrid model had the best results in accuracy, recall, precession, F1 score, and AUC compared to other techniques.
Big Data classification: techniques and tools
1
Pijush Kanti Dutta Pramanik1, Saurabh Pal2, Moutan Mukhopadhyay2 and Simar Preet Singh3 1
National Institute of Technology, Durgapur, India 2Bengal Institute of Technology, Kolkata, India 3Chandigarh Engineering College, Landran, India
Abstract An enormous volume of data, known as Big Data, of varied properties, is continuously being generating from several sources. For efficient and consequential use of this huge amount of data, automated and correct categorization is very important. The precise categorization can find the correlations, hidden patterns, and other valuable insights. The process of categorization of mixed heterogeneous data is known as data classification and is done based on some predefined features. Various algorithms and techniques are proposed for Big Data classification. This chapter attempts to discuss various technicalities of Big Data classification, comprehensively. To start with, the basics of Big Data classifications such as need, types, patterns, phases, approaches, etc. are explained aptly. Different classification techniques, including traditional, evolutionary, and advanced machine learning technique, are discussed with suitable examples, along with citing their advantages and disadvantages. Finally, a survey of various open-source and commercial libraries, platforms, and tools for Big Data classification is presented. Keywords: Big Data; machine learning; classification techniques; evolutionary algorithms; classification tools; Big Data platforms
1.1
Introduction
The modern digitized and smart world is continuously generating data of enormous volume from various sources, such as smart healthcare [1,2], smart cities [3,4], smart agriculture [5,6], Applications of Big Data in Healthcare. DOI: https://doi.org/10.1016/B978-0-12-820203-6.00002-3 © 2021 Elsevier Inc. All rights reserved.
1
2
Chapter 1 Big Data classification: techniques and tools
smart buildings [7,8], smart learning [9,10], modern industries [11,12], social media [13,14], autonomous vehicles [15,16], cognitive systems [17,18], and so on. These data are generated and transmitted in various forms, volume and speed to different sinks. To have actionable usage of these data, they are required to be mined for extracting information and knowledge. In traditional data mining, we generally use approaches like clustering, classification, and association rules. In Big Data also, we have to mine useful and valuable information, but with extremely large data sets. The two most used techniques in the case of Big Data are Big Data classification [19] and Big Data clustering [20]. The classification is supervised learning where classification algorithm needs training in order to achieve accurate classification. Whereas, clustering is unsupervised learning and does not need pretraining for performing clustering. Classification allows grouping the information by common attributes and comparing them with similarities and differences. It helps in identifying and segregating data, which allows appropriate data tagging and thus making them easy to locate and retrieve or rather searchable. This enables in identifying relevant data from the irrelevant ones and further identifying multiple duplications. The tagging of data makes the searching fast and data accession easy [21]. We can say classification techniques mold the scattered data into shape. This shaped data reasons for quality and confidence in the outcome [22]. Conclusively, classification helps the user in gathering knowledge (knowledge discovery) and future planning. Without the proper classification, Big Data is bound to fail at drawing any valuable inferences. Big Data classification has found many significant real-life applications such as predicting epidemic outbreak, drug discovery, providing and managing healthcare services [23,24], weather forecasting, product and service recommendation [25], sentiment analysis and opinion mining [14], user profiling, predicting the next possible crime in the city, network traffic classification and intrusion prediction, etc. Since classification involves supervised learning, a classification model is built under supervision, using a training data set. The model by comprehending the data patterns of training set infers on how to classify a similar unknown data to a class. Specific strategies need to be sorted out on how to use the vast data. Typically, before the data can be used, they are preprocessed [26,27]. By preprocessing, the data are cleaned off, which may include null values, missing values, and inconsistent data. Not all features in the data set are useful. The required and
Chapter 1 Big Data classification: techniques and tools
suitable features are extracted using appropriate algorithms. The data are generally ready to be used after preprocessing and feature extraction and selection [28]. The performance accuracy of the model increases if the model is trained with good data. In classification, different techniques such as probabilistic models [29], the decision trees [30], the neural networks [31], support vector machine (SVM) [32], etc. are used. Most of these techniques are also used in usual data mining. But, since the volume of data is huge in Big Data, old techniques for data mining need to be tweaked to work for Big Data. This chapter presents a comprehensive discussion of different techniques and tools used in Big Data classification. The rest of the chapter is organized as follows. Section 1.2 presents the preliminary of Big Data classification, including the need, challenges, types, pattern, phases, and approaches of Big Data classification. Various algorithms and techniques of Big Data classification are elaboratively discussed in Section 1.3. The recent and popular tools, libraries, and platforms for Big Data classification are discussed in Section 1.4. Section 1.5 summarizes the chapter.
1.2
Big Data classification
Since the term, Big Data defines the data sets that are too large or complex, the classification of these data sets is important for a deeper understanding of the data [33]. Using Big Data classification, different analysis can be performed on the data for accurate prediction. In this section, we will be learning about the classification, why classification is required and various types of classification. Phases of classification, classification patterns, and the challenges are also covered subsequently.
1.2.1
Definition of classification
Process of classifying the data into different categories, on the basis of some attributes or features, is known as classification [34]. In Data mining, it is a data analysis process, which involves predicting the class of the newly observed data through modeling [35]. Classification is a crucial function of Big Data processing and is directly entails in knowledge discovery. Fig. 1.1 shows the different phases of Big Data processing, including Big Data classification.
3
4
Chapter 1 Big Data classification: techniques and tools
Figure 1.1 Phases of Big Data processing.
Figure 1.2 Different aspects of the need for classification of Big Data.
Need for classification l ifi ti
Convert unstructured data into useful information. Find meaningful and accurate data. Understand the relationship between different groups. Specifically identify any feature in the data sets. Convenient study of data sets. Ease the data accessing.
1.2.2
Need for classification in Big Data
Data classification helps in knowledge discovery and intelligent decision making [35]. Thus, it plays a vital role in Big Data. Big Data is so huge as well as complex that without proper classification, finding information from Big Data will be like finding a needle in a haystack. Classification is required to systematically analyze this massive amount of data by organizing them into suitable classes. This helps in developing a precise model or description for each defined class using the data features of that class. The different aspects of the need for classification of Big Data are shown in Fig. 1.2.
1.2.3
Challenges in Big Data classification
Big Data involves complexity, thanks to its multifaceted properties. Hence, obviously, handling Big Data will not be trivial. The same applies to Big Data classification, which is hugely challenging due to several factors, most of which are inherent to Big Data. Fig. 1.3 lists some of the crucial challenges in Big Data classification.
1.2.4
Types of classification
Classification can be categorized into the following three types: 1. Binary classification: Binary classification is a classification method, in which new data are classified into two possible
Chapter 1 Big Data classification: techniques and tools
Heterogeneous data Big Data classification challenges
volume data High-v High-speed data ge mining Garbag Data viisualization Computing model Algorithmic efficiency Imbalanced classification acy, trust, and provenance Accura Data uncertaincy and incompleten ess ue density and meaning diversity aving low valu Data ha
classes as an outcome, that is, it categorizes items into two groups [36]. An example of binary classification is gender classification, which has two classification tasks with two possible outcomes—male or female. Similarly, it can be used in classifying the state of a machine into faulty or good. Other application areas where binary classification can be used are medical diagnosis, spam detection, etc. 2. Multiclass classification: As the name suggests, multiclass classification has more than two classification classes as an outcome. Multiclass or multinomial classification is a technique of assigning items into one of the N numbers of classes, where N is greater than two. Examples of multiclass classification are segregating emails into the appropriate folder, gene expression categorization, etc. In this classification, one target label is assigned to each sample, but the sample cannot have two or more labels at the same time [36]. For example, an animal can be a dog or a cat, not both at the same time [37]. 3. Multilabel classification: The multilabel classification approach calls for the classification task where each sample of the data set is mapped to the set of target labels, that is more than one class [37]. The example of multilabel classification is a news article. The news article can describe sports, a location, or a person at the same time [38].
Figure 1.3 Big Data classification challenges.
5
6
Chapter 1 Big Data classification: techniques and tools
1.2.5
Big Data classification approaches
Typically, the following two approaches are followed for Big Data classification: 1. Supervised: Supervised classification approach calls for learning classification logic under directed supervision by means of understanding the patterns in a known labeled data set. This classification approach takes the large volume of data called the training data set as input. To understand the classification rules for categorizing data into the given classes, data analysis is performed on the data set. The classification rules which are learnt in this process are based on the features or attributes of each element of a class [39]. Since in this classification approach, the set of possible classes is known in advance—it is also known as directed or predictive classification. 2. Unsupervised: Unsupervised classification approach calls for learning classification logic from a set of unlabeled data; that is, here the class labels are unknown, and the training sets are not along with predefined class labels. In other words, the set of possible classes to which a datum will be classified is unknown in prior; after classification, the classes are given names. Hence, this approach is often known as clustering. The classification is carried out by drawing the comparisons between the features of data. This approach is useful when features or attributes are unknown, and therefore, the unsupervised classification techniques are often said to be as descriptive or undirected. Example of this technique is arranging a bucket full of fruits where all fruits are in jumbled order, and the aim is to classify the fruits into groups [39].
1.2.6
Phases of classification
Classification process generally involves three phases, as shown in Fig. 1.4. Each of the phases of classification is described below [4042]: 1. Data preparation phase: For the purpose of classification, appropriate training of the classifier is required. This could be achieved by the availability of good data, but practically the data which are available are often not suitable for training purpose. Data preparation phase involves the preparation of data to make it suitable to be used for training purpose. This phase involves processes as:
Chapter 1 Big Data classification: techniques and tools
Figure 1.4 Phases of classification.
• Data selection: As data are huge and all the data may not be useful for learning, selecting an appropriate subset of data could lead to optimized training of classifier over a data set. • Data preprocessing: The selected data may be unstructured and may contain incorrect, inconsistent, missing, and skewed information. Data preprocessing includes formatting, cleaning, and sampling before it could be used. The collected data may not be in a suitable format to be used. Formatting transforms data into a form which is suitable to work with. The data selected to use may have missing values, inconsistent, and incorrect information; cleaning the data is absolutely necessary as these data may inform the training outcome. The data collected may be far more than what is required to use for training. More the data used, the more learning time it takes and more the memory is consumed. It would be better for if a selected sample of the data is chosen for training, which leads to optimized time for training the model and less memory consumption. The sample selection process appropriately selects a suitable sample of the data required for training. • Data transformation: This phase transforms data into a usable metric which could be suitably used in algorithms and application problem. This step includes (a) scaling, (b) decomposition and (c) aggregation. It is seen that the data which are being used for training purpose are expressed in different measures and values, which often are not suitable for training and may require scaling the values of
7
8
Chapter 1 Big Data classification: techniques and tools
an acceptable form. Decomposition is splitting complex data into constituent simple atomic value. Some data are complex; decomposing them into simpler constituent value may be more useful for training. Often splitting the information leads to some useful and some not so useful features. Discarding the less significant feature could reduce the learning complexity and time. Opposite to this approach, aggregation may help in combining simple atomic data to concrete information which may help the classifier to get more insights into the data. The data transformation is also referred to as feature engineering since the transformed data is being represented as features. These data features help in better understanding of different patterns in the data set. The transformed data at the end of this phase is conveniently split into two segments (60% and 40%) called the training set and testing set. 2. Learning phase: Learning is the initial classification phase. In this phase, the classifier learns how to classify a kind of data pattern. This phase features model construction for predicting the class of data in an inquiry. Appropriately an algorithm is being trained with the set of labeled or unlabeled data samples called the training set for developing a mathematical model. The model learns patterns in the training data set by considering the various features or attributes of data and comprehends on how to identify the class based on these features. The model thus developed as an outcome of this phase for classification could be a set of rules or mathematical formula. 3. Evaluation phase: This phase follows after the learning phase. The evaluation phase of classification characterizes testing and evaluation of the learning model obtained as an outcome of the learning phase. The constructed model is tested or evaluated by an unknown sample data set called a test data set. The model is tested with test data; and based on the evaluation result, the model is recalibrated by adjusting the parameters through a tuning process. The parameters here are the learning rate and the hyperparameters. The evaluation phase outputs an accurate classifier model.
1.2.7
Classification pattern
In classification, the pair of a collection of features or observations along with their concept is termed as a pattern. This classification pattern deals with the problem of identifying which set of
Chapter 1 Big Data classification: techniques and tools
Classification patterns
Atomic patterns
Composite patterns
classes or categories belong to which observations [4345]. This is carried out with the use of training data that contains a known category of observations. The classification patterns are subdivided further into two categories, as described below and shown in Fig. 1.5. 1. Atomic patterns: Atomic patterns are responsible for providing information regarding the data, that is, how data are processed, accessed, stored as well as consumed for recurring problems [46,47]. These can also be used for the identification of the required components. Atomic patterns might require different approaches for performing their tasks. These types of patterns do not require any sequence or layering [48,49]. The examples of atomic patterns are shown in Fig. 1.6. 2. Composite patterns: Atomic patterns form composite patterns when worked together. Classification of composite patterns is carried out on the basis of end-to-end solutions. In this approach, one or more dimensions are considered for each composite pattern [50,51]. Many variations are considered in the cases that are applicable to each composite pattern. For solving the business problems, the composite patterns are mapped to one or more atomic patterns [52].
1.3
Big Data classification techniques
Various classification techniques are being used in data classification [53]. The major classification techniques that are used in Big Data classification are covered in this section.
1.3.1
Traditional learning techniques
This section discusses conventional and prevalent data classification techniques.
1.3.1.1
Logistic regression
This is a binary classification algorithm, a supervised learning technique for modeling predictions. This algorithm learns from a set of labeled data and makes a prediction of the probability of a target data belongs to a class under consideration or not. This
Figure 1.5 Classification patterns categories.
9
10
Chapter 1 Big Data classification: techniques and tools
Ad-hoc analysis Advanced analytics Processing patterns Preprocessing raw data Historical data analytics Device Operational data Access patterns Atomic patterns
Social and web media Transactional warehouse Notifications Visualisation Consumption patterns
Adhoc discovery Automated response initiation Augment traditional data stores Distributed structures
Storage patterns
Traditional data stores Distributed unstructures
Figure 1.6 Atomic patterns example.
regression is also similar to linear regression. The exception is that it predicts data, whether it belongs to a true or a false class. It produces categorical values which are not continuous values. The logistic regression algorithm maps the input into probabilities (1 or 0) by using the sigmoid function. This function translates a linear equation output (continuous value) into a probability of 0 or 1. The sigmoid function is represented as: SðZÞ 5
1 ð1 1 e2z Þ
S(Z) returns the probability score in the range of 0 and 1. By considering 0.5 as the threshold value, all value for S(Z) $ 0.5 are considered as class 1 and value ,0.5 are being considered as class 0 as defined in Fig. 1.7. The probabilistic score classifies the input data (x) into two classes, 1 or 0 is being illustrated by a graph in Fig. 1.8. Z is a linear function represented as z 5 w0 1 w1x, where the parameters or weight w0, w1 is learned from the sample training data.
Chapter 1 Big Data classification: techniques and tools
11
Logistic regression Y=1
Predicted Y lies within 0 and 1 range
Y
Figure 1.7 Decision boundary definition for logistic regression.
Y=0 S(Z)
Figure 1.8 Probabilistic score.
Table 1.1 Logistic regression: advantages and disadvantages. Advantages
Disadvantages
• This type of regression performs well on a linearly • Assumption of linearity between dependent and independent variables is the main limitation of this separable dataset. type of regression. • This regression is less prone to over-fitting. • This type of regression may lead to overfitting when However, this regression can overfit in high the number of observations gets less than the number dimensional datasets. of features it contains. • Logistic regression is easier to interpret, easy to implement as well as more efficient towards training.
Primarily, logistic regression is considered as a binary classifier but may be applied as multinomial or ordinal. The different application avenues where logistic regression is being successfully tried are geographic image processing, handwriting recognition, image segmentation and categorization, and in healthcare. The advantages, as well as disadvantages of SVM, are listed in Table 1.1. The Python code of logistic regression usage is depicted in Fig. 1.9.
12
Chapter 1 Big Data classification: techniques and tools
Figure 1.9 Sample Python code for logistic regression.
Figure 1.10 Hyperplane in support vector machine.
1.3.1.2 Support vector machine SVM is a supervised classification algorithm. This is also a binary classifier, which allows classifying the data into one of two classes. Unlike other classifiers, SVM increases the confidence of classification by maximizing the decision surface. This allows a clear separation of data in the data space, thus allow selecting data precisely to one class or the other [54]. SVM is very useful when the numbers of features are high. It allows mapping the data in the feature space with very less computation in comparison to other classification algorithms. SVM defines a margin or a hyperplane for segregating two classes, where the hyperplane is a subspace having dimensions one less than the dimension required for projecting the data in space. The hyperplane is linear, where there are two features. Fig. 1.10 shows the hyperplane made from the support vectors, which uniquely partitions two classes of data in space. The margin of separation is wide enough, which provides the boundary of separation and thus, confidence in classification [55]. For high dimensional space, linearly separating the classes is not possible. Appropriate kernel functions are used for defining such hyperplane to separate classes nonlinearly, which otherwise is a nontrivial job. The SVM algorithm thus can be considered as a set of kernel functions. The kernel function transforms the input data into the desired output form. It separates linearly nonseparable data into the linearly
Chapter 1 Big Data classification: techniques and tools
13
Kernel functions for SVM Polynomial kernel Gaussian kernel Gaussian radial basis function Laplace RBF kernel Hyperbolic tangent kernel Sigmoid kernel Bessel function of first kind kernel
Figure 1.11 Kernel functions applied in support vector machine [57].
Linear spline kernel (in 1D) Annova radial basis kernel
Table 1.2 Advantages and disadvantages of support vector machine. Advantages
Disadvantages
• It works well on unstructured or semistructured data. • It supports high dimensional data. • SVM is prone to less over-fitting as this algorithm is more generalised in practice. • SVM uses different kernel functions • In comparison to ANN, it performs better.
• It takes long training time for a large data set. • Choosing a kernel function is difficult. • Performance is poor when the number of features is greater than the number of samples. • SVM does not provide probability estimates.
separable data [56]. The different kernel functions which are applied in SVM are listed in Fig. 1.11. The different application areas of SVM are protein structure prediction, medical imaging, image interpolation, financial analysis, handwriting recognition, text classification, breast cancer diagnosis, and almost in all other applications where artificial neural network (ANN) is applied. The advantages, as well as disadvantages of SVM, are listed in Table 1.2 and Python sample code for the SVM algorithm is shown in Fig. 1.12.
1.3.1.3
Decision tree
A decision tree is the supervised learning algorithm, also called CART (classification and regression tree) for it solves
14
Chapter 1 Big Data classification: techniques and tools
Figure 1.12 Sample Python code for support vector machine.
Figure 1.13 Decision tree structure model.
both classification and regression problems. But due to its inherent nature of classification, it is mostly used in classification problems. Decision tree leads to a predictive model with high accuracy and stability. It not only maps the linear relationships very well but also the nonlinear relationships [58]. The decision tree can be of the following two types: 1. Categorical variable decision tree: Predicts output as a class value. 2. Continuous variable decision tree: Predicts continuous target variable. The inherent structure of the decision tree is a tree-structured flowchart, as shown in Fig. 1.13. The root node in the tree structure represents the entire sample, which further categorized into two or more homogeneous sets. The available data is being divided by each of the subnode of the tree into further child subnodes. This splitting node which performs decision making on the data splitting is called a decision node, where the terminal nodes or the leaf node represents a class label. The classification rules are depicted by paths from the root to leaf nodes. The root node is being attributed with the best-selected features of the data set. The most popular attributes selection measures are Gini index and information gain. The information gain is being the measure of uncertainty reduction, whereas a metric that measures how often the randomly chosen element may be
Chapter 1 Big Data classification: techniques and tools
15
Table 1.3 Decision tree: advantages and disadvantages. Advantages
Disadvantages
• Easy to understand as well as easy to generate rules. • These trees have the ability to handle both: the continuous as well as categorical data. • Decision trees require less effort in data preparation than other algorithms. • No effect on tree performance with nonlinear relationships between parameters. • Does not require normalization and scaling of data.
• May suffer from overfitting. • With the work where continuous numerical variables are involved, a decision tree is less efficient, as information loss occurs in categorizing the variables into different categories. • Complex calculations can be there in case of many class labels. • When these trees are used for classification problems, they result in errors comprising of many-class and a relatively small number of training examples. • Training time is relatively expensive, as the time taken and complexity are higher. • Inadequate in handling continuous values.
Figure 1.14 Sample Python code for decision trees.
incorrectly identified is the Gini index. The attributes are chosen for the root node and other subnodes while keeping the highest information gain and low Gini index [58]. The splitting of data and attributes selection processes are continued until leaf nodes have been found with predicted class value [59]. The advantages, as well as the disadvantages of a decision tree, are listed in Table 1.3, and the Python sample code for implementing the decision tree algorithm is shown in Fig. 1.14.
1.3.1.4
Naı¨ve Bayes algorithm
This algorithm is based on Bayes’ theorem, which is a probabilistic machine learning algorithm for the classification task. It is a collection of algorithms where all the algorithms follow a principle that particular feature’s presence is not related with any of the other features that are present, that is, all the features are independent of one and other, and all equally contribute in prediction. Bayes theorem can be three types: Multinomial, Bernoulli, and Gaussian. The algorithm is primarily used for
16
Chapter 1 Big Data classification: techniques and tools
classification of text, where involvement of high dimensional training datasets is present. The Naı¨ve Bayes algorithm works on the Bayes’ theorem, which defines the probability of an event to occur is found by the probability of other events that have already occurred [60]. The Bayes theorem is defined as: P ðAjBÞ 5
P ðBjAÞPðAÞ PðBÞ
where, • P(A|B) is posterior probability of class (A, target) with given predictor (B, attributes). • P(A) is prior probability of the target. • P(B|A) is the likelihood of the predictor probability. • P(B) is prior probability of the predictor. The naı¨ve assumption of feature independence is applied to Bayes theorem to getting a Naı¨ve Bayes algorithm. The Naı¨ve Bayes theorem states that class probability “y” could be predicted by conditional probability P(xi|y) of each independent feature (xi) is given as [61]: n P yjx1 . . . . . .xn 5 argmaxy P y L Pðxi jyÞ i51
This algorithm is best suited for real-time applications that involve real-time situations like spam filtering and document classification [34]. The other application of Naı¨ve Bayes is sentiment analysis, recommendation system, and classifying a news article. The advantages, as well as disadvantages of the Naı¨ve Bayes algorithm, are listed in Table 1.4. A sample Python code for Naı¨ve Bayes algorithm usage is depicted in Fig. 1.15.
Table 1.4 Naı¨ve Bayes Algorithm: advantages and disadvantages. Advantages
Disadvantages
• This algorithm assumes that all the features are mutually • The algorithm needs training data for independent. But in a real-life scenario, it is almost impossible to identifying the parameters involved. get a dataset where features are completely independent. • The algorithm is faster than other • Suffers from the zero conditional probability problem. Missing sophisticated methods. categorical variable assigns zero (0) probability for that feature. • Very efficient in easy and fast But since multiplication with zero turns the total frequency to zero, prediction. thus making prediction not possible. • Works very well in large feature space.
Chapter 1 Big Data classification: techniques and tools
17
Figure 1.15 Sample Python code for Naı¨ve Bayes algorithm.
Figure 1.16 Similarity mapping for unclassified data in the feature space of labeled data.
1.3.1.5
K-nearest neighbor
K-nearest neighbor is a very simple, easy, and one of popular supervised machine learning algorithms. KNN is popular as its interpretation is easy and requires less calculation time than any other machine learning algorithms (such as decision tree, logistic regression, and random forest, etc.). KNN may be used for both: regression and classification problems. However, it is more widely used for classification problems in various fields [62]. KNN is also identified as the lazy learning algorithm. The lazy algorithm means it does not learn a model during its training phase, but the data points are saved during the training phase and simply uses that training data itself for classification problems. This algorithm works less in the training phase and works more in the testing phase to make a class prediction [63]. The KNN classifier algorithm works on how similar the data are in terms of their features. The input (unclassified) data is mapped into the data feature space to find the nearest most similar items and thus find its membership as classification outcome, as shown in Fig. 1.16. The similarities between the new (unclassified) data and other data in the given data set feature space are measured by measuring the distances between them. The distance between the data is calculated by measures like Euclidean, Manhattan, Minkowski, and Hamming. The nearest and most occurring data’s class is found to be the predicted class of the new data. The advantages, as well as
18
Chapter 1 Big Data classification: techniques and tools
Table 1.5 K-nearest neighbor algorithm: advantages and disadvantages. Advantages
Disadvantages
• Easy to implement and interpret. • Low-cost development. • A very flexible classification algorithm which does not require data preprocessing. • Well suited for both classification and regression.
• With growing data set size, the efficiency of algorithm declines very fast. • KNN is known as a slow algorithm. • It is computationally expensive as it needs to store all training examples and need more time to compute the distance to all examples. • Unable to handle missing data. • Works efficiently with a small number of input variables—but as the numbers of variables grow, it becomes difficult for a KNN algorithm to predict the class value of unseen data point.
Figure 1.17 Sample Python code for K-nearest neighbor algorithm.
disadvantages of the KNN algorithm, are listed in Table 1.5. A sample Python code for KNN algorithm usage is depicted in Fig. 1.17.
1.3.1.6 Random forest This is a meta-estimator classification technique that ensembles learning with decision tree model [64]. It is popularly used in classification and regression. This technique works by fitting the number of decision trees on numerous subsamples of data sets and then uses the average of them for the improvement of the predictive accuracy of the model [65]. This way, the algorithm controls overfitting. The subsample size is taken to be the same as the input sample size. In this technique, the samples are drawn with replacement. The advantages, as well as disadvantages of random forest, are listed in Table 1.6. The Python code of random forest algorithm usage is mentioned in Fig. 1.18.
1.3.1.7 Matrix factorization This is a collaborative filtering based method. It is primarily used for computation of complex matrix operation by reducing
Chapter 1 Big Data classification: techniques and tools
19
Table 1.6 Random forest: advantages and disadvantages. Advantages
Disadvantages
• More accurate reduction in overfitting comparison to decision trees in many cases. • Used for both classification and the regression problems. • Can handle missing values very efficiently. • Perceived to be a very stable algorithm.
• This technique is slow in real-time prediction cases. • Forms a complex algorithm. • Difficult to implement.
Figure 1.18 Sample Python code for random forest.
Table 1.7 Advantages and disadvantages of matrix factorization. Advantages
Disadvantages
• Makes complex matrix operations easier. • Fewer storage resources are utilized.
• Involves more computations.
a matrix into constituent parts, thereby helping in computing complex matrix operations much easier. This technique is also referred to as matrix decomposition method. Matrix factorization allows a matrix A of [m 3 n] order to be decomposed into factor matrices X and Y such that on multiplying these factors will produce the original matrix. Such that A 5 XY, where X is a [m 3 r] matrix and Y is a [r 3 n] matrix [66]. This matrix decomposition into factors is being applied to dimensionality reduction. Matrix factorization is used in image recognition, movie recommendation system. The advantages, as well as disadvantages of matrix factorization, are listed in Table 1.7. The example portrayed in Fig. 1.19 shows a 3 3 2 matrix that computes matrix factorization and then reconstructs the original matrix from the constituent parts [67].
20
Chapter 1 Big Data classification: techniques and tools
Figure 1.19 Sample Python code for matrix factorization.
1.3.2
Evolutionary techniques
Evolutionary computation techniques are inspired by biology and nature. Biological and natural phenomena in living beings such as genetic inheritance and natural selection have helped computer scientists to clone the mechanisms in solving complex computational problems. The main pillar of evolutionary computation techniques is evolutionary algorithms. These algorithms typically follow a heuristic-based approach and provide better approximate solutions to the problems that cannot be solved with ease using other techniques. Evolutionary algorithms are very useful in classical NP-hard problems that cannot be solved in polynomial time. They are also suitable for the problems which have high running time complexity. Since Big Data classification process also falls in this category, evolutionary techniques play a crucial role in this. In this section, various evolutionary techniques are discussed which can be useful in Big Data classification. Table 1.8 summarizes the comparison between the discussed evolutionary techniques.
1.3.2.1 Swarm intelligence The collaborative behavior of the self-organized systems is known as swarm intelligence where swarm means collaboration or togetherness and these are self-organized or self-adaptive. Various swarm intelligence algorithms are used in Big Data. A couple of them are discussed in this section. 1.3.2.1.1 Particle swarm optimization Particle swarm optimization (PSO) is a metaheuristic optimization algorithm. This algorithm requires the search space and the population of candidate solutions [72]. This is an iterative algorithm that computes local best and the global best in each of its iteration. The computed local best and the global best in
Chapter 1 Big Data classification: techniques and tools
21
Table 1.8 Comparative analysis of different evolutionary techniques [6871]. Swarm intelligence Accuracy Good Computation High Performance Performs very well in case of large scale and dynamic data and it can be used to solve such problems more rapidly and effectively.
Genetic programming
Genetic algorithm
ANN
Coevolutionary programming
Good High Execution time is long in Genetic programming and it is usually not necessary to preprocessed data with any explicit feature selection algorithm.
Good High Genetic algorithms tackle the curse of dimensionality successfully and they are used in various search spaces.
Very good High Robust, faulttolerant and predicts new output from prior experience and used in developing predictive models.
Very good High Used in finding fuzzy classification rules and performance depends on selection and evolution.
the last iteration results in the optimal solution of the acceptance function [73]. PSO can reduce complexity and improves the accuracy of the classification problems related to Big Data. 1.3.2.1.2 Ant colony optimization Ant colony optimization (ACO) is an optimization algorithm which employs the probabilistic technique and is used for solving computational problems and finding the optimal path with the help of graphs. An ant, in this algorithm, acts as multiagents that walk through the edges of the graph (paths) by spreading the pheromone. This pheromone is used by other ants as a tracking component. With the help of the pheromone composition of the path, the ants find their shortest or optimal path from source (nest) to destination (food). The ants here are called the artificial ants. As this algorithm works in collaboration of ants and their self-selection of the optimal path, ACO is considered as a swarm intelligence approach.
1.3.2.2
Genetic programming
This is an open-ended search technique that produces attributes of different combinations. It is a technique of encoding computer programs as the set of genes, and then they are evolved or modified using evolutionary algorithms. The results of this technique are the computer programs that solve the predefined task for which this approach is considered. Genetic programming is very useful for the tasks requiring classification and prediction.
22
Chapter 1 Big Data classification: techniques and tools
1.3.2.3 Genetic algorithm This algorithm is a heuristic search approach that is formed by getting inspiration from Charles Darwin’s natural evolution theory. This algorithm, based on natural selection process, is represented by the data structures with chromosomes having a recursive combination of techniques that are used for searching. A genetic algorithm has five main basic principles, as mentioned below. 1. Initialization: Set of entire population sample points marks as initialization of genetic algorithm. 2. Selection: Subset of the previous step to categorize data. 3. Crossover/recombination: For creating logical relations between sets and for reducing the degree of randomness among various sets. 4. Mutation: To generate genetic diversity. 5. Acceptance: For generating new offspring after mutation. Elimination also takes place in this stage. The basic flowchart of the genetic algorithm is shown in Fig. 1.20.
1.3.2.4 Artificial neural network ANNs is an information processing system for data classification. The structure of ANN and its behavior are inspired by biological neural networks, that constitutes human brains. The constituent building block of an ANN is the artificial neurons, which are functionally arranged in single or multiple layers. The input layer of a single-layer neural network is constituted of weighted input signals, which are summed and passed through the activation function or uses the transfer function for prediction of the output class, as shown in Fig. 1.21. For multilayer ANN, there are more than one hidden intermediate layers between input and output layers, which process the data layer by layer, as shown
Figure 1.20 Flowchart of a genetic algorithm.
Chapter 1 Big Data classification: techniques and tools
23
Figure 1.21 Single-layer artificial neural network model.
Figure 1.22 Multilayer artificial neural network model.
in Fig. 1.22. The output of a previous layer is being fed to the following layer as input and so on. For an ANN to work substantially accurate, the activation function or transfer function is very crucial. The activation function is the deciding function which takes data feature values as weighted input and predicts the output class. There are various types of activation functions used in ANN, like Step, tanh, RELU, Sigmoid, and Softmax [74]. For an ANN resolving binary classification problem, Sigmoid transfer function is being popularly used with single output neuron. This function predicts the output value in the range of 021, where 0.5 is set to be as the threshold value. For all values, less than 0.5 is considered to be 0 or false or negative result and for all predicted output greater than 0.5 is considered to be 1 or true or positive result. For a multiclass classification problem, there could be multiple neurons present in the output layer, where, each neuron represents one class. The Softmax activation function is used in a multiclass classification problem to predict the output probability. ANN is used quite popularly in
24
Chapter 1 Big Data classification: techniques and tools
different fields like speech recognition, computer vision, machine translation, healthcare, telemedicine, marketing and predicting credit scores, etc.
1.3.2.5 Coevolutionary programming In coevolutionary genetic programming, two populations evolve in pair. The coevolutionary algorithm is a rule-based technique which is more preferred than other classification techniques [50,51]. In this approach, two tightly coupled coevolutionary processes are involved. The coevolutionary programming model is broadly classified into two categories, such as competitive coevolution and cooperative coevolution. In competitive coevolution approach, the populations compete with each other to gain the advantage, and in cooperative coevolution approach, the populations cooperate with each other by exchanging information. Coevolutionary algorithms are a special type of evolutionary algorithms, in which the fitness of each individual depends on other individuals’ fitness. Such kind of algorithms are suitable for the cases of problem where the formulation of the explicit fitness function is difficult.
1.3.3
Advanced learning techniques
Various recent machine learning methods that have promised to be effective in Big Data classification [75] have been discussed in this section. The learning methods of this category and their characteristics are listed in Fig. 1.23.
1.3.3.1 Representation learning This learning method allows representing data in such a way that useful information can be extracted easily to build the classifiers, predictors and regressor, as shown in Fig. 1.24. Representation learning can eliminate the need for large labeled datasets [31], opening up new domains to machine learning and transforming the practice of data science. This technique replaces labor-intensive feature engineering. This technique also allows the machine to learn the features and then use them for performing a specific task such as classification. Representation learning helps in removing unnecessary features (feature selection) and transforming raw data into features (feature extraction). In this regard, features are extracted from unlabeled data by training a neural network to solve a supervised learning task. This method also provides a single way to perform unsupervised and semisupervised learning. A good
Chapter 1 Big Data classification: techniques and tools
25
Figure 1.23 Advanced learning techniques and their characteristics [75].
Figure 1.24 The representation learning model for representing data.
representation of data is very important to maintain the efficacy of predictive models, and it makes learning easier.
1.3.3.2
Deep learning
Many data analysis systems are facing huge challenges in this era of big data, as the massive volume of raw data is increasing. It is very difficult to extract any meaningful pattern from the huge uncategorized raw data. The data which is produced is not only massive but also has features such as multisources dynamic value and sparse value. These features make
26
Chapter 1 Big Data classification: techniques and tools
Big Data analysis harder for conventional machine learning methods [76]. Deep learning provides a certain degree of simplification and accuracy that can be used for Big Data analytics tasks, which is not possible with traditional methods [77]. Deep learning algorithms quite beneficial in extracting useful pattern in high dimensional complex data representations or features that are at high levels of abstraction. For learning better representations automatically with deep structure, the deep learning methods can be used, by using supervised or the unsupervised strategies [76]. It allows complex abstractions and extraction of high-level knowledge through a hierarchical learning process [78]. A large amount of data in Big Data reasons for information retrieval a challenging task. Deep learning algorithms can address this challenge very efficiently. Below, some applications of the deep learning approach in Big Data analytics are mentioned: • Semantic image and video tagging [79,80]. • Extracting meaningful patterns and mining from large data sets as semantic indexing [77]. • Speech recognition and computer vision [81,82]. • Healthcare and medical fields [83,84]. • Autonomous vehicles [85,86].
1.3.3.3 Distributed and parallel learning As mentioned earlier, the huge volume of Big Data poses a serious challenge in classification. The traditional learning techniques which follow a centralized processing scheme may be insufficient to carry out the classification process for this voluminous data in a time-bound manner, which is a must for many Big Data applications. Distributed and parallel learning framework allows to execute the learning algorithms parallelly over a number of distributed processing stations [87]. In distributed learning, a collection of many independent or autonomous computers, known as nodes, communicate over a network. Each node has its private memory, and a node is an independent, autonomous computer with its own CPU cores. In order to achieve a common goal, they interact with each other. Decision rules [88], distributed boosting [47], stacked generalization [89], meta-learning [90], etc. are examples of some of the proposed distributed learning approaches. In parallel learning, a computation job is generally split into several similar subtasks that can be processed independently, and results are combined afterwards. For scaling up the traditional learning algorithms,
Chapter 1 Big Data classification: techniques and tools
parallel machine learning algorithms are being used and that too by utilizing the power of multicore processors [75,91,92].
1.3.3.4
Transfer learning
This is a machine learning technique. In this technique, knowledge is transferred from the previous task to a target task. Most of the machine learning methods work well with the common assumption, and that is, training data and test data must be taken from the same single feature space and also must be from the same distribution. With the change in distribution, many models need to re-built from the beginning considering the newly collected data as the initial dataset. In the case of applications that involve real-world scenario, this concept can be expensive or can be impossible also for recollecting the data that is needed for training and re-building the models. So, a good approach is to reduce the need and effort of recollecting training data. In those cases, transfer learning would be advantageous between the task domains. Transfer learning can be defined as the following: Given the domain source “DS”, and the learning task “TS”, a target domain namely “DT” as well as target learning task “TT”, where transfer learning is aiming in helping improvement in the target’s learning predictive function “fT” in “DT”, using the knowledge in “DS” and “TS”, where DS! 5 DT or TS! 5 TT. The study of this type of learning is inspired by the fact that humans can intelligently and easily apply the knowledge that is learned previously for solving new problems in a faster way and with better solutions. For example, the learning in recognizing apples might help to recognize the pears [93]. Fig. 1.25 represents the general scenario of transfer learning.
1.3.3.5
Active learning
Active learning is a model semisupervised iterative training technique which aims to reduce the amount of labeled data required to train a mode [94,95]. In the beginning, the model training iteration consists of feeding labeled data to inputs, assessing output (prediction or inference), and correcting the model’s weights and biases. The iteration is recursively repeated with the next labeled data sample. After a number of iterations, the model is evaluated by feeding it unlabeled data prepared beforehand along with the training data pool. This iteration is repeated until satisfactory learning model is obtained. Fig. 1.26 represents the model of active learning. This approach of iterative training over
27
28
Chapter 1 Big Data classification: techniques and tools
Figure 1.25 The transfer learning scenario.
Figure 1.26 Active learning model.
fewer data, followed by tuning hyperparameter, reduces the amount of data required to train a model.
1.3.3.6 Kernel-based learning The task of kernel-based learning is the construction of an embedding space, implicitly using some similar measurements between various objects for which the classification takes place. For the purpose of classification and regression, this method provides
Chapter 1 Big Data classification: techniques and tools
the basis of combining heterogeneous information such as images, time series, objects, or strings [9698]. This learning method offers an efficient mapping of the original space into the potentially infinite-dimensional feature space, where the kernel function allows the calculation of the inner products directly [75,99].
1.4 1.4.1
Big Data classification tools and platforms Shogun
Shogun1 is free as well as an open-source library for machine learning. The shogun library is written in C11, with an interface for many languages like Octave, Python, R, Java, Ruby, and C# for program development using machine learning. Shogun supports algorithms such as SVMs, dimensionality reduction algorithms [like Principal Component Analysis (PCA)], clustering algorithms (k-means and GMM), hidden Markov models, KNN, linear discriminant analysis, and kernel perceptrons. Initially, Shogun was developed for bioinformatics application, but it could be well applied to other domains also. Shogun can process a data set of 10 million samples. With this capability of processing huge amount of data, it may be used as an implementation tool for Big Data classification.
1.4.2
Scikit-learn
Scikit-learn (Sklearn)2 is a free python library that is commonly used for machine learning. It has evolved from the code project of Google summer in the year 2007. The library is developed in Python, Cython, C, and C11. It supports many algorithms involving machine learning and deals with regression, classification, and clustering problems, and also provides for dimensionality reduction, model selection, and preprocessing. The library is built upon the SciPy (Scientific Python) which includes SciPy, NumPy, Ipython, Matplotlib, Pandas, and Sympy [100]. As a library package in python, it is used commercially in various Big Data classification and analysis application like sentiment analysis, video and image analysis, filtering and recommendation, etc. Scikit-learn is 1 2
http://shogun-toolbox.org/ https://scikit-learn.org/stable/
29
30
Chapter 1 Big Data classification: techniques and tools
quite popularly used across academics and industry. Commercially more than a hundred companies are using this library for machine learning application [101].
1.4.3
TensorFlow
TensorFlow3 is an open-source library that is used for training various machine learning models. It is developed by Google and published under Apache license 2.0, and the library is still used by Google for research and production purpose. The library based on deep neural network (DNN) and can best be used for neural network applications (like classification). TensorFlow is quite resource-intensive and is available for a variety of platforms. TensorFlow, with giant infrastructure support from Google, is being used in various critical machine learning applications involving Big Data classification application like medical image analysis, natural language processing for a chat application, etc.
1.4.4
Pattern
Pattern4 is a library tool that is designed for the Python language at Computational Linguistics & Psycholinguistics Research Center. It has many tools for web and data mining, network analysis, natural language processing, and machine learning. The machine learning tools include the k-means clustering, KNN, Naı¨ve Bayes, vector space model, SVM classifiers [102]. Pattern is used for natural language processing in the web-based application and in clinical audio analysis involving Big Data.
1.4.5
Weka 5
Weka (Waikato Environment for Knowledge Analysis) is also a machine learning tool developed in Java language at University of Waikato, New Zealand. This is the free software that is designed for data analysis and predictive model. It provides a graphical user interface for easy access and visualization. The Weka tool is specifically used for data processing, clustering, regression, classification and visualization [103]. The different algorithms supported are logistic regression, linear regression, Naı¨ve Bayes, decision tree, random tree, random 3
https://www.tensorflow.org/ https://www.clips.uantwerpen.be/pattern 5 https://www.cs.waikato.ac.nz/ml/weka/ 4
Chapter 1 Big Data classification: techniques and tools
forest, multilayer perceptron (neural network), decision rule, and Gaussian process [104]. Weka supporting Knowledge Flow Interface, Hadoop, and Spark allows mining in Big Data. The Big Data mining functionality like data preprocessing, classification, clustering and visualization, allow Weka to meet the demanding real-world application [105].
1.4.6
BigML
BigML6 is a platform for machine learning and application. It provides highly scalable machine learning as a service (MLaaS) analogous to Software as a Service (SaaS) through the cloud. BigML performs high-performance processing of Big Data in a fast, scalable and dynamic manner. This platform supports classification, cluster analysis, modeling, anomaly detection and time series forecasting and prediction. BigML provides an API, binding, and libraries for many popular languages like Python, Ruby, and Java. It is a comprehensive machine learning platform which helps one to develop models without bothering about the complexities of different library sets. It covers time series forecasting, supervised learning like classification and regression (trees, ensembles, linear regressions, logistic regressions, deepnets), and unsupervised learning like anomaly detection, cluster analysis, association discovery, topic modeling and PCA [106]. BigML provides interactive visualization and explanation to all predictive models developed on this platform, which makes data interpretation quite easy. The model developed on this platform is fully exportable through JSON (JavaScript Object Notation) to all popular programming languages. This allows connecting the model to the cloud, desktop and mobile applications.
1.4.7
DataRobot
DataRobot7 is a machine learning and modeling platform available as a cloud service for enterprises. DataRobot empowers peoples who have less or no knowledge in machine learning to build prediction models. It supports building highly accurate machine learning model and preparing data sets. DataRobot simplifies Big Data processing like data preparation, clustering, classifying, visualization at high speed, thus speeding up the decision making the process [107]. Through DataRobot 6 7
https://bigml.com/features https://www.datarobot.com/
31
32
Chapter 1 Big Data classification: techniques and tools
businesses can deploy real-time predictive analytics in just a few steps.
1.4.8
Google Cloud AutoML
Google Cloud AutoML8 is a cloud-based service by Google for machine learning models. This service allows people to easily build a model who even have no machine learning expertise. This provides a GUI application for training, evaluating, and deploying models. The platform also provides human labeling service for annotating data and cleansing data. AutoML works with various kinds of data like text, image, audio, video, etc. [108]. The giant infrastructure of Google Cloud service support gives force and speed to Cloud AutoML to connect and analyze Big Data very rapidly and precisely [109]. The predictive analytics on Big Data techniques like classification, clustering and pattern recognition of AutoML helps in the industry-leading applications.
1.4.9
IBM Watson Studio
Watson Studio9 is an integrated environment for machine learning powered model designing, developing, training and deployment. The environment is available on the web as a software service on the IBM Cloud. The environment provides tools for clean and shape data, capture streaming data, modeling, training model, and visualizing data and results. It has more enhanced features for model development, evaluation, deployment, and management [110]. The environment provides the extended capability for deep learning and to access pretrained machine learning models. Further, the GUI features of the studio allow one to visualize the data insights and enhanced visual modeling [111]. Watson studio with huge infrastructure provides a platform for Big Data processing and analytics. Watson Studio’s artificial intelligence (AI) prepares and processes Big Data for functionality like classification, categorization, decision making and pattern recognition quite easily and user-friendly manner.
8
https://cloud.google.com/automl/ https://www.ibm.com/cloud/watson-studio
9
Chapter 1 Big Data classification: techniques and tools
1.4.10
MLJAR
MLJAR10 is a platform for rapid machine learning modeling, training, testing, and deployment. Running the machine learning algorithm is a time-consuming process; MLJAR reduces the processing time by running the algorithm in parallel on multiple machines. It supports Python, Scikit-learn, TensorFlow and regularized greedy forest (RGF) libraries and APIs. Extreme gradient boosting, Extra Trees, LightGBM, KNN, logistic regression, random forest, RGF, neural network, and ensemble are some available machine learning algorithm in MLJAR. For a naive user, it provides an easy interface, cloud-based solution which allows them to handle complex business data such as Big Data in order to obtain meaningful insights.
1.4.11
Rapidminer
Rapidminer11 is a tool for rapid machine learning modeling, training, validation, deployment and data statistics. It features various data analyses and the preprocessing job of modeling like data sampling, transformation, and portioning, binding and attribute generation. Rapidminer can acquire both the structured and unstructured data from sources like files (CSV, HTML, pdf, etc.) and various databases. It is equipped with various machine learning algorithms and modeling techniques like clustering, market basket analysis, decision trees, rule induction, Bayesian modeling, regression, neural network, SVM, memory-based reasoning and model ensembles. An expert can customize and extend Rapidminer by R and Python code and libraries. The other functions available can validate the models to its highest accuracy. It is easy to use and work on the platform, even for naive users. For quicker decision making in business, Rapidminer leverage machine learning and AI to process Big Data. Using Apache Spark, it connects Big Data and processes it quickly and accurately to have needful insights.
1.4.12
Tableau
Tableau12 is a data visualization tool, mainly used for business data; it helps one with no or little technical knowledge to find the data pattern and hidden insights. Tableau features realtime data analysis and data collaboration. Tableau can extract 10
https://mljar.com/automl/ https://rapidminer.com/ 12 https://www.tableau.com/ 11
33
34
Chapter 1 Big Data classification: techniques and tools
data from various sources (e.g., heterogenous files, cloud, etc.) and databases. The interactive interfaces display data in various graphical forms like charts and the graph for depicting trends, variations, and data density, etc. [112]. The software is free for academicians and researchers. It provides researchers and business organization with a quick glance at the various data pattern hidden in huge heterogeneous data like Big Data, thus allowing one to have a quicker insight into business data.
1.4.13
Azure Machine Learning Studio
Azure Machine Learning Studio13, developed by Microsoft Technologies, is a platform for data science, data analysis and prediction modeling. It allows one to develop and deploy machine learning models without any coding knowledge. It provides an interactive and visual interface with drag-and-drop facility for easy interaction and enhanced visualization for users. For experts, it provides support for Python and R languages for customized prediction modeling and data analysis. It is a cloud-based service, available as an Azure Cloud platform. Azure provides machine learning algorithms for regression, classification and clustering. Some of the algorithms are Bayes point, logistic regression, decision tree, neural network, decision forest, SVM, decision jungle etc. The Azure Cloud ecosystem provides tools which, besides the building, testing and deploying model also support data ingestion, preprocessing, feature reduction and selection, parting data. The cloud service allows the platform in applying machine learning functionalities quite well on Big Data. It is commercially quite well applied in facial expression recognition from video, a critical example of Big Data classification.
1.4.14
H2O Driverless AI
H2O Driverless AI14 is an AI platform which is designed for training and building a machine learning model. Driverless AI does not require any expert-level knowledge for developing models. This platform is very much useful in handling huge amount of data like Big Data. This takes data from various sources like Amazon S3, Hadoop Distributed File System (HDFS) and also from local data sources. Driverless AI can be 13
https://azure.microsoft.com/en-in/services/machine-learning-studio/ https://www.h2o.ai/products/h2o-driverless-ai/
14
Chapter 1 Big Data classification: techniques and tools
deployed in all clouds like Microsoft Azure, AWS, and Google Cloud. It automates machine learning workflows such as tuning, model validation, feature engineering, selection and deployment. Data visualization and machine learning interpretability are the important features of Driverless AI, which provides statistical data plotting to help users to get a quick perception before starting machine learning modeling process. Driverless AI provides machine learning interpretability by describing the modeling results into human-readable format. Driverless AI is integrated with TensorFlow, which provides a deep learning approach to solve different types of real-life problems like sentiment analysis, classifying documents, finding customer’s purchasing pattern, content tagging. H2O Driverless AI provides a more user-friendly solution than other the platforms available in the market.
1.4.15
Apache Mahout
Apache Mahout15 is an open data mining framework. This framework breaks a computational job into multiple tasks which runs on multiple computers to analyze the data in quick time, returning high performance and efficient result. The Mahout framework is coupled with Apache Hadoop infrastructure for storing as well as processing the big data in the distributed environment. This framework supports machine learning algorithms for clustering, classification as well as collaborative filtering. Several algorithms like k-mean, fuzzy k-mean, mean shift, Canopy, Dirichlet, etc. are being supported for clustering. These algorithms are implemented on the top of Hadoop’s map-reduce paradigm. Mahout mainly includes a Java library for statistical data operation. It provides high scalability by supporting any large-scale data set in a distributed computing environment [113].
1.4.16
Apache Spark (MLib) 16
MLib is a scalable machine learning library, provided by Apache Spark. MLib contains many algorithms, like traditional classification techniques. This Apache Spark’s library in conjunction with Hadoop leverages machine learning for Big Data clustering, classification and decision making. It provides fast 15 16
https://mahout.apache.org/ https://spark.apache.org/mllib/
35
36
Chapter 1 Big Data classification: techniques and tools
machine learning to implement in a distributed manner which is not limited to—linear regression, Naı¨ve Bayes, logistic regression, decision tree, gradient-boosted trees, ensemble algorithms, k-means clustering, latent Dirichlet allocation (LDA), and PCA. Additionally, the library provides a utility tool for convex optimization, distributed linear algebra, statistical analysis, standardization, normalization, hashing, and feature extraction [114]. MLib also addresses the problem of pipelining algorithm in a distributed manner.
1.4.17
Apache Storm
Apache Storm17 is the open platform that is used for machine learning in streaming the data and for run-time data analysis [115]. This is one of the best platforms for real-time Big Data analytics. The Storm supports many programming languages, which gives programmers an edge over other platforms for customized machine learning and data analysis. The data processing for analysis and prediction is much faster than Hadoop and Spark, whereby the data are processed in a millisecond by following distributed computing. The reliability and performance of Storm are much higher than contemporary data analytics tools.
1.5
Conclusion
The power of Big Data has revolutionized many sectors. Proper treatment and utilization of Big Data can provide startling benefits. For that, it is important to process Big Data methodologically. Big Data classification is one of the crucial aspects of that. Using classification techniques, Big Data are classified into different classes or segments according to different features of the acquired data. The classified data are easier to handle and manipulate in retrieving information and discover knowledge, which otherwise would be difficult to attain. We have seen, in this chapter, that different types of classification are proposed, which follow different approaches. Machine learning is the soul of data classification techniques. Various learning techniques exist that are successfully being used for data classification. To cope up with the complexities in Big Data, several advanced learning techniques are also proposed. Due to wide application scenarios, a number of open-source 17
https://storm.apache.org/
Chapter 1 Big Data classification: techniques and tools
and commercial libraries, tools, and platforms have emerged which are good options for performing Big Data classification. Among them, the prominent ones are discussed in this chapter.
References [1] S. Dash, S.K. Shakyawar, M. Sharma, S. Kaushik, Big data in healthcare: management, analysis and future prospects, J. Big Data 6 (2019). Article number: 54. [2] P.K.D. Pramanik, B. Upadhyaya, S. Pal, T. Pal, Internet of Things, smart sensors, and pervasive systems: enabling the connected and pervasive health care, in: N. Dey, A. Ashour, S.J. Fong, C. Bhatt (Eds.), Healthcare Data Analytics and Management, Academic Press, 2018, pp. 158. [3] I.A.T. Hashem, V. Chang, N.B. Anuar, K. Adewole, I. Yaqoob, A. Gani, et al., The role of big data in smart city, Int. J. Inf. Manag. 36 (5) (2016) 748758. [4] E.A. Nuaimi, H.A. Neyadi, N. Mohamed, J. Al-Jaroodi, Applications of big data to smart cities, J. Internet Serv. Appl. 6 (2015). Article number: 25. [5] M.N.I. Sarker, M. Wu, B. Chanthamith, S. Yusufzada, D. Li, J. Zhang, Big Data driven smart agriculture: pathway for sustainable development, in 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 2019. [6] S. Wolfert, L. Ge, C. Verdouw, M.-J. Bogaardt, Big Data in smart farming a review, Agric. Syst. 153 (2017) 6980. [7] P.K.D. Pramanik, B. Mukherjee, S. Pal, T. Pal, S.P. Singh, Green Smart Building: Requisites, Architecture, Challenges, and Use Cases, in A. Solanki, A. Nayyar (Eds.), Green Building Management and Smart Automation, IGI Global, 2019, pp. 150. [8] B. Qolomany, A. Al-Fuqaha, A. Gupta, D. Benhaddou, S. Alwajidi, J. Qadir, et al., Leveraging machine learning and Big Data for smart buildings: a comprehensive survey, IEEE Access. 7 (2019) 9031690356. [9] S. Pal, P.K.D. Pramanik, P. Choudhury, A step towards smart learning: designing an interactive video-based M-learning system for educational institutes, Int. J. Web-Based Learn. Teach. Technol. 14 (4) (2019) 2648. [10] M. Anshari, Y. Alas, L.S. Guan, Developing online learning resources: Big data, social networks, and cloud computing to support pervasive knowledge, Educ. Inf. Technol. 21 (2016) 16631677. [11] P.K.D. Pramanik, B. Mukherjee, S. Pal, B.K. Upadhyaya, S. Dutta, Ubiquitous manufacturing in the age of industry 4.0: a state-of-the-art primer, in: A. Nayyar, A. Kumar (Eds.), A Roadmap to Industry 4.0: Smart Production, Sharp Business and Sustainable Development, Springer, Cham, 2019, pp. 73112. [12] L.D. Xu, L. Duan, Big data for cyber physical systems in industry 4.0: a survey, Enterp. Inf. Syst. 13 (2) (2019) 148169. [13] G. Bello-Orgaz, J.J. Jung, D. Camacho, Social big data: Recent achievements and new challenges, Inf. Fusion. 28 (2016) 4559. [14] B. Sarkar, N. Sinhababu, M. Roy, P.K.D. Pramanik, P. Choudhury, Mining multilingual and multiscript twitter data: unleashing the language and script barrier, Int. J. Bus. Intell. Data Min. 16 (1) (2019) 107127.
37
38
Chapter 1 Big Data classification: techniques and tools
[15] C. Zhang, K. Ota, J. Jia, M. Dong, Breaking the blockage for big data transmission: gigabit road communication in autonomous vehicles, IEEE Commun. Mag. 56 (6) (2018) 152157. [16] A. Daniel, K. Subburathinam, A. Paul, N. Rajkumar, S. Rho, Big autonomous vehicular data classifications: towards procuring intelligence in ITS, Vehicular Commun. 9 (2017) 306312. [17] P.K.D. Pramanik, S. Pal, P. Choudhury, Beyond automation: the cognitive IoT. Artificial intelligence brings sense to the Internet of Things, in: A.K. Sangaiah, A. Thangavelu, V.M. Sundaram (Eds.), Cognitive Computing for Big Data Systems Over IoT: Frameworks, Tools and Application, Springer, 2018, pp. 137. [18] S. Gupta, A.K. Kar, A. Baabdullah, W.A. Al-Khowaiter, Big data with cognitive computing: a review for the future, Int. J. Inf. Manag. 42 (2018) 7889. [19] J. Han, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers Inc, San Francisco, 2005. [20] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Comp. Surv. 31 (3) (1999) 264323. [21] J.D. Groot, What is Data Classification? A Data Classification Definition, 3 January 2019. [Online]. Available: https://digitalguardian.com/blog/ what-data-classification-data-classification-definition (accessed 28.02.19). [22] P. Balas, Big Data and Classification, 28 Feburary 2015. [Online]. Available: https://www.datascienceassn.org/sites/default/files/Big%20Data%20and% 20Classification%20%20by%20Paul%20Balas%20-%20Slides.pdf (accessed 28.02.19). [23] P.K.D. Pramanik, S. Pal, M. Mukhopadhyay, Healthcare Big Data: a comprehensive overview, in: N. Bouchemal (Ed.), Intelligent Systems for Healthcare Management and Delivery, IGI Global, 2018, pp. 72100. [24] P.K.D. Pramanik, S. Pal, M. Mukhopadhyay, Big Data & Big Data analytics for improved healthcare service and management, Int. J. Priv. Health Inf. Manag. 7 (2) (2021). In press. [25] P.K. Singh, P.K.D. Pramanik, A.K. Dey, P. Choudhury, Recommender systems: an overview, research trends and future direction, Int. J. Bus. Syst. Res. (2021). Available from: https://doi.org/10.1504/IJBSR.2021.10033303. In press. [26] S. Garcı´a, S. Ramı´rez-Gallego, J. Luengo, J.M. Benı´tez, F. Herrera, Big data preprocessing: methods and prospects, Big Data Anal. 1 (9) (2016). [27] J. Hariharakrishnan, S. Mohanavalli, M. Srividya, K.B.S. Kumar, Survey of preprocessing techniques for mining big data, in International Conference on Computer, Communication and Signal Processing (ICCCSP), Chennai, India, 2017. [28] J. Tang, S. Alelyani, H. Liu, Feature selection for classification: a review, Data Classification: Algorithms and Applications, CRC Press, 2014. [29] H. Deng, Y. Sun, Y. Chang, J. Han, Probabilistic models for classification, Data Classification: Algorithms and Applications, CRC Press, 2014, pp. 6586. [30] S. Safavian, D. Landgrebe, A survey of decision tree classifier methodology, IEEE Trans. Syst. Man. Cyber 21 (3) (1991) 660674. [31] A. Biem, Neural networks: a review, Data Classification: Algorithms and Applications, CRC Press, 2014, pp. 205244. [32] B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin classifiers, Annual Workshop on Computational Learning Theory, Pennsylvania, Pittsburgh, 1992.
Chapter 1 Big Data classification: techniques and tools
[33] R. Lodha, H. Jain, L. Kurup, Big Data challenges: data analysis perspective, Int. J. Curr. Eng. Technol. 4 (5) (2014) 32863289. [34] P. Pandey, M. Kumar, P. Srivastava, Classification techniques for big data: a survey, in 3rd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 2016. [35] P. Koturwar, S. Girase, D. Mukhopadhyay, A survey of classification techniques in the area of Big Data, Int. J. Adv. Found. Res. Computer 1 (11) (2014). [36] A. Oussous, F.-Z. Benjelloun, A.A. Lahcen, S. Belfkih, Big Data technologies: a survey, J. King Saud. Univ. Comp. Inf. Sci. 30 (4) (2018) 431448. [37] B. Krawczyk, M. MikelGalar, H. Wo´zniak, et al., Dynamic ensemble selection for multi-class classification with one-class classifiers, Pattern Recognit. 83 (2018) 3451. [38] R. Babbar, B. Scho¨lkopf, DiSMEC - Distributed Sparse Machines for Extreme Multi-label Classification, in Tenth ACM International Conference on Web Search and Data Mining, Cambridge, 2016. [39] Z. Wang, L. Qu, J. Xin, H. Yang, X. Gao, A unified distributed ELM framework with supervised, semi-supervised and unsupervised big data learning, Memetic Comput. (2018) 111. [40] D. Levinger, V. Dev, Six steps to master machine learning with data preparation, KDnuggets, December 2018. [Online]. Available: https://www. kdnuggets.com/2018/12/six-steps-master-machine-learning-datapreparation.html (accessed 26.08.19). [41] J. Brownlee, How to prepare data for machine learning, Machine Learning Mastery, 25 December 2013. [Online]. Available: https:// machinelearningmastery.com/how-to-prepare-data-for-machine-learning/ (accessed 26.08.19). [42] G. Yufeng, The 7 steps of machine learning, Towards Data Science, 1 September 2017. [Online]. Available: https://towardsdatascience.com/the7-steps-of-machine-learning-2877d7e5548e (accessed 26.08.19). [43] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd ed., Wiley, 2000. [44] L. Breiman, J. Friedman, C.J. Stone, R.A. Olshen, Classification and Regression Trees, Chapman Hall/CRC, 1984. [45] D. Garcı´a-Gil, J. Luengo, S. Garcı´a, F. Herrera, Enabling smart data: noise filtering in big data classification, Inf. Sci. 479 (2019) 135152. [46] P. Moeck, On classification approaches for crystallographic symmetries of noisy 2D periodic patterns, arXiv preprint, no. arXiv:1902.04155, 2019. [47] P.M. Vecsei, K. Choo, J. Chang, T. Neupert, Neural network based classification of crystal symmetries from x-ray diffraction patterns, Phys. Rev. B 99 (24) (2019). [48] C.-H. Liu, Y. Tao, D. Hsu, Q. Du, S.J.L. Billinge, Using a machine learning approach to determine the space group of a structure from the atomic pair distribution function, Acta Crystallogr. Sect. A: Found. Adv. 75 (4) (2019) 633643. [49] R. Batra, H.D. Tran, C. Kim, J. Chapman, L. Chen, A. Chandrasekaran, et al., A general atomic neighborhood fingerprint for machine learning based methods, J. Phys. Chem. C. 123 (25) (2019) 1585915866. [50] S.Y. Kim, W.-C. Lee, Classification consistency and accuracy for mixedformat tests, Appl. Meas. Educ. 32 (2) (2019) 97115. [51] E.C. Knight, S.P. Hernandez, E.M. Bayne, V. Bulitko, B.V. Tucker, Preprocessing spectrogram parameters improve the accuracy of bioacoustic classification using convolutional neural networks, Bioacoustics (2019) 119.
39
40
Chapter 1 Big Data classification: techniques and tools
[52] R.G. Hussain, M.A. Ghazanfar, M.A. Azam, U. Naeem, S.U. Rehman, A performance comparison of machine learning classification approaches for robust activity of daily living recognition, Artif. Intell. Rev. 52 (1) (2019) 357379. [53] C.C. Aggarwal, An introduction to data classification, Data Classification: Algorithms and Applications, CRC Press, 2014, pp. 136. [54] P.-W. Wang, C.-J. Lin, Support vector machines, Data Classification: Algorithms and Applications, CRC Press, New York, USA, 2014, pp. 187204. [55] DataFlair, SVM Support Vector Machine Tutorial for Beginners, DataFlair, 19 November 2018. [Online]. Available: https://data-flair.training/blogs/ svm-support-vector-machine-tutorial/ (accessed 26.08.19). [56] T. Afonja, Kernel Functions, Towards Data Science, 2 January 2017. [Online]. Available: https://towardsdatascience.com/kernel-function6f1d2be6091 (accessed 26.08.19). [57] DataFlair, Kernel Functions-Introduction to SVM Kernel & Examples, Data Flair, 16 November 2018. [Online]. Available: https://data-flair.training/ blogs/svm-kernel-functions/ (accessed 26.08.19). [58] Analytics Vidhya, A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python), Analytics Vidhya, 12 April 2016. [Online]. Available: https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-treebased-modeling-scratch-in-python/ (accessed 26.08.19). [59] R. Saxena, How Decision Tree Algorithm Works, Dataaspirant, 30 January 2017. [Online]. Available: https://dataaspirant.com/2017/01/30/howdecision-tree-algorithm-works/ (accessed 26.08.19). [60] GeeksforGeeks, Naive Bayes Classifiers, GeeksforGeeks, 2017. [Online]. Available: https://www.geeksforgeeks.org/naive-bayes-classifiers/ (accessed 26.08.19). [61] J. McGonagle, Naive Bayes Classifier, Brilliant, 2019. [Online]. Available: https://brilliant.org/wiki/naive-bayes-classifier/ (accessed 26.08.19). [62] T. Srivastava, Introduction to k-Nearest Neighbors: A powerful Machine Learning Algorithm (with implementation in Python & R), Analytics Vidhya, 26 March 2018. [Online]. Available: https://www.analyticsvidhya.com/blog/ 2018/03/introduction-k-neighbours-algorithm-clustering/ (accessed 26.08.19). [63] A. Navlani, KNN Classification using Scikit-learn, DataComp, 2 August 2018. [Online]. Available: https://www.datacamp.com/community/ tutorials/k-nearest-neighbor-classification-scikit-learn (accessed 26.08.19). [64] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 532. [65] R. Genuer, J.-M. Poggi, C. Tuleau-Malot, N. Villa-Vialaneix, Random forests for Big Data, Big Data Res. 9 (2017) 2846. [66] K. Liao, Prototyping a Recommender System Step by Step Part 2: Alternating Least Square (ALS) Matrix Factorisation in Collaborative Filtering, Towards Data Science, 17 November 2018. [Online]. Available: https://towardsdatascience.com/prototyping-a-recommender-system-stepby-step-part-2-alternating-least-square-als-matrix-4a76c58714a1 (accessed 26.08.19). [67] M. Agarwal, R. Mehra, Review of matrix decomposition techniques for signal processing applications, Int. J. Eng. Res. Appl. 4 (1) (2014) 9093. [68] N. Khan, M.S. Husain, M.R. Beg, Big Data Classification using Evolutionary Techniques: A Survey, in IEEE International Conference on Engineering and Technology (ICETECH), Coimbatore, India, 2015.
Chapter 1 Big Data classification: techniques and tools
[69] S. Cheng, Y. Shi, Q. Qin, R. Bai, Swarm intelligence in Big Data analytics, Lecture Notes Comput. Sci. 8206 (2013) 417426. [70] M. Castelli, L. Vanneschi, L. Manzoni, A. Popoviˇc, Semantic genetic programming for fast and accurate data knowledge discovery, Swarm Evolut. Comput. 26 (2016) 17. [71] V. Stanovov, C. Brester, M. Kolehmainen, O. Semenkina, Why don’t you use Evolutionary Algorithms in Big Data?, in IOP Conference Series: Materials Science and Engineering, vol. 173, pp. 19, 2017. [72] N. Jatanaa, B. Surib, Particle swarm and genetic algorithm applied to mutation testing for test data generation: A comparative evaluation, J. King Saud. Univ. Comput. Inf. Sci. (2019). [73] W. Lin, Z. Lian, X. Gu, B. Jiao, A local and global search combined particle swarm optimization algorithm and its convergence analysis, Math. Probl. Eng. (2014). [74] C. Gallo, Artificial neural networks: tutorial, Encyclopedia of Information Science and Technology, IGI Global, USA, 2015, pp. 179189. [75] J. Qiu, Q. Wu, G. Ding, Y. Xu, S. Feng, A survey of machine learning for big data processing, EURASIP J. Adv. Signal. Process. vol. 67 (2016). [76] J. Xie, Z. Song, Y. Li, Y. Zhang, H. Yu, J. Zhan, et al., A survey on machine learning-based mobile big data analysis: challenges and applications, Wirel. Commun. Mob. Comput. 2018 (2018) 19. [77] M.M. Najafabadi, F. Villanustre, T.M. Khoshgoftaar, N. Seliya, R. Wald, E. Muharemagic, Deep learning applications and challenges in big data analytics, J. Big Data 2 (2015) 121. [78] J.L. Torrecilla, J. Romo, Data learning from big data, Stat. Prob. Lett. 136 (2018) 1519. [79] Z. Wu, T. Yao, Y. Fu, Y.-G. Jiang, Deep learning for video classification and captioning, Frontiers of Multimedia Research, Association for Computing Machinery and Morgan & Claypool, New York, NY, USA, 2018, pp. 329. [80] F. Lateef, Y. Ruichek, Survey on semantic segmentation using deep learning techniques, Neurocomputing 338 (2019) 321348. [81] N. Cummins, A. Baird, B.W. Schuller, Speech analysis for health: current state-of-the-art and the increasing impact of deep learning, Methods 151 (2018) 4154. [82] A. Brunetti, D. Buongiorno, G.F. Trotta, V. Bevilacqua, Computer vision and deep learning techniques for pedestrian detection and tracking: a survey, Neurocomputing 300 (2018) 1733. [83] S. Purushotham, C. Meng, Z. Che, Y. Liu, Benchmarking deep learning models on large healthcare datasets, J. Biomed. Inform. 83 (2018) 112134. [84] H.-C. Yang, M.M. Islam, Y.-C. Li, Potentiality of deep learning application in healthcare, Comput. Methods Prog. Biomed. 161 (2018) a1. [85] C. You, J. Lu, D. Filev, P. Tsiotras, Advanced planning for autonomous vehicles using reinforcement learning and deep inverse reinforcement learning, Robot. Autonomous Syst. 114 (2019) 118. [86] S.T. Mohammed, A. Bytyn, G. Ascheid, G. Dartmann, Reinforcement learning and deep neural network for autonomous driving, Big Data Analytics for Cyber-Physical Systems, Elsevier, 2019, pp. 187213. [87] H. Zheng, S.R. Kulkarni, H.V. Poor, Attribute-distributed learning: models, limits, and algorithms, IEEE Trans. Signal. Process. 59 (1) (2011) 386398. [88] H. Chen, T. Li, C. Luo, S.J. Horng, G. Wang, A rough set-based method for updating decision rules on attribute values’ coarsening and refining, IEEE Trans. Knowl. Data Eng. 26 (12) (2014) 28862899.
41
42
Chapter 1 Big Data classification: techniques and tools
[89] J. Chen, C. Wang, R. Wang, Using stacked generalisation to combine SVMs in magnitude and shape feature spaces for classification of hyperspectral data, IEEE Trans. Geosci. Remote. 47 (7) (2009) 21932205. [90] E. Leyva, A. Gonza´lez, R. Pe´rez, A set of complexity measures designed for applying meta-learning to instance selection, IEEE Trans. Knowl. Data Eng. 27 (2) (2014) 354367. [91] H. Tong, Big Data Classification, Data Classification: Algorithms and Applications, CRC Press, New York, USA, 2014, pp. 275286. [92] S.R. Upadhyaya, Parallel approaches to machine learning - a comprehensive survey, J. Parallel Distr Com. 73 (3) (2013) 284292. [93] S.J. Pan, A survey on transfer learning, IEEE Trans. Knowl. Data Eng. 22 (10) (2010) 13451359. [94] S. Hosein, Active Learning: Curious AI Algorithms, DataCamp, 9 Feburary 2018. [Online]. Available: https://www.datacamp.com/community/ tutorials/active-learning (accessed 26.08.19). [95] C.C. Aggarwal, X. Kong, Q. Gu, J. Han, P.S. Yu, Active Learning: A Survey, Data Classification: Algorithms and Applications, CRC Press, 2014, pp. 571606. ´ lvarez, M. [96] G. Camps-Valls, L. Go´mez-Chova, J. Mun˜oz-Marı´, J.L. Rojo-A Martı´nez-Ramo´n, Kernel-based framework for multitemporal and multisource remote sensing data classification and change detection, IEEE Trans. Geosci. Remote. Sens. 46 (6) (2008) 18221835. [97] B. Scho¨lkopf, A.J. Smola, Learning with Kernels, MIT Press, Cambridge, MA, 2002. [98] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, Cambridge, U.K, 2004. [99] C. Li, M. Georgiopoulos, G. Anagnostopoulos, A unifying framework for typical multitask multiple kernel learning problems, IEEE Trans. Neur Net. Lear Syst. 25 (7) (2014) 12871297. [100] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, et al., Scikit-learn, in European Conference on Machine Learning and Principles and Practices of Knowledge Discovery in Databases, Prague, 2013. [101] J. Brownlee, A Gentle Introduction to Scikit-Learn: A Python Machine Learning Library, Machine Learning Mastery, 16 April 2014. [Online]. Available: https://machinelearningmastery.com/a-gentle-introductionto-scikit-learn-a-python-machine-learning-library/. [Accessed 19 May 2019]. [102] T.D. Smedt, W. Daelemans, Pattern for Python, J. Mach. Learn. Res. 13 (2012) 20632067. [103] I.H. Witten, E. Frank, M.A. Hall, C.J. Pal, The WEKA workbench, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 2016, pp. 7126. [104] M. Hall, Classifiers, Hitachi Vantara, 1 November 2016. [Online]. Available: https://wiki.pentaho.com/display/DATAMINING/Classifiers (accessed 19.05.19). [105] R. Janoˇscova´, Mining Big Data in WEKA, in International Workshop on Knowledge Management (IWKM), Bratislava, 2016. [106] Cloud Academy, BigML: Machine Learning Made Easy, Cloud Academy, 2019. [Online]. Available: https://cloudacademy.com/blog/bigml-machinelearning/ (accessed 19.05.19). [107] Amazon Web Services, DataRobot on AWS, Amazon Web Services, 2019. [Online]. Available: https://aws.amazon.com/solutionspace/ datarobot_on_aws/ (accessed 19.05.19).
Chapter 1 Big Data classification: techniques and tools
[108] F.-F. Li, J. Li, Cloud AutoML: Making AI accessible to every business, Google, 17 January 2018. [Online]. Available: https://www.blog.google/ products/google-cloud/cloud-automl-making-ai-accessible-everybusiness/ (accessed 19.05.19). [109] R. Thomas, Google’s AutoML: Cutting Through the Hype, fast.ai, 23 July 2018. [Online]. Available: https://www.fast.ai/2018/07/23/auto-ml-3/ (accessed 19.05.19). [110] IBM, Watson Studio overview, IBM, 10 May 2019. [Online]. Available: https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/ overview-ws.html (accessed 19.05.19). [111] C. Shao, IBM Watson Studio: Build and train AI models all in one integrated environment, IBM, 20 March 2018. [Online]. Available: https:// www.ibm.com/cloud/blog/announcements/watson-studio-announcement (accessed 19.05.19). [112] Intellipaat, What is Tableau?, Intellipaat, 2017. [Online]. Available: https:// intellipaat.com/blog/what-is-tableau/ (accessed 19.05.19). [113] Technopedia, Apache Mahout, Technopedia, 2019. [Online]. Available: https://www.techopedia.com/definition/30301/apache-mahout (accessed 19.05.19). [114] Apache Spark, Machine Learning Library (MLlib) Guide, 2018. [Online]. Available: https://spark.apache.org/docs/latest/ml-guide.html (accessed 19.05.19). [115] IntelliPaat, What is Apache Storm?, IntelliPaat, 2017. [Online]. Available: https://intellipaat.com/blog/what-is-apache-storm/ (accessed 19.05.19).
43
Big Data Analytics for healthcare: theory and applications
2
Shivam Bachhety, Shivani Kapania and Rachna Jain Department of Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, New Delhi, India
Abstract In the past 10 years, the healthcare industry is growing at a remarkable rate. The healthcare industry is generating enormous amounts of data in terms of volume, velocity, and variety. Big Data methodologies in healthcare can not only increase the business value but will also add to the improvement of healthcare services. Several techniques can be implemented to develop early disease diagnose systems and improve treatment procedures using detailed analysis over time. In such a situation, Big data Analytics proposes to connect intricate databases to achieve more useful results. In this chapter, we will discuss the procedure of big data analytics in the healthcare sector with some practical applications along with its challenges. We will also have a look at the various big data techniques and their tools for implementation. We conclude this chapter with a discussion on potential opportunities for analytics in the healthcare sector. Keywords: Big Data; healthcare; Big Data Analytics; Hadoop
2.1
Introduction to Big Data
The amount of information that can be processed by our systems is limited, and there is still a lot of data being generated across the planet at an unmatched rate. And, today’s businesses and modern organizations have a big part of their success driven by making sense of all the data that is being generated, etching out new patterns and connections that hold huge rewards. At the core level, the necessary elements for working Applications of Big Data in Healthcare. DOI: https://doi.org/10.1016/B978-0-12-820203-6.00008-4 © 2021 Elsevier Inc. All rights reserved.
45
46
Chapter 2 Big Data Analytics for healthcare: theory and applications
with big data are similar to an arbitrary dataset of any size. If we look beyond this, the processing speed, scale of operation, and the various aspects of the data that must be handled at each step during the process exhibit a unique set of challenges. The purpose of all big data systems is to generate actionable information [1] from vast volumes of complex data that could be structured or unstructured in format, and this is not achievable by applying traditional techniques. For this goal, it is essential that the programs must be running together on several machines, and each program must have knowledge about which component of the data to execute. After completion of the processing of the program, the results from each machine must be compiled together to generate useful insights from a massive amount of data, and this is done using special programming tools. Since it is usually a lot quicker for programs to obtain data that is locally stored on a machine instead of over a network [2], it is absolutely crucial to consider the way those machines are networked together and how the data is distributed across the cluster for building a scalable solution. Computer clusters are a feasible option to address the need for high computation power and large storage capacities. The resources from all the participating machines are combined by the computer cluster on the network to provide the following benefits: • Resource Pooling: This software not only combines the available storage of all the machines but also performs computation capacity (CPU) and memory pooling. This is essential for processing massive datasets as they require large quantities of all the above resources. • High Availability: A major benefit offered by computer clustering is that there are no serious consequences on the processing in the case of a hardware/software failure, which becomes critical for real-time analytics. • Easy Scalability: It is easy to add more machines to the network; thus resource requirements can be effectively realized without increasing the physical capacity of any machine. On the other hand, a solution for coordinating between machines for resource sharing, scheduling tasks on the various nodes in the network, and managing cluster membership must be implemented. This can be made possible using software such as Apache Mesos or Hadoop’s YARN. However, with the high reward, Big Data brings in a lot of interest. There is a need to process this large volume of into business intelligence (BI) and better data that organizations can tap into and make better decisions, and a level field that
Chapter 2 Big Data Analytics for healthcare: theory and applications
provides an organization improved ways to strategize regardless of their size, geography, market share, customer hold among other categorizations. The key to the future’s market is in the hands of the firm that can make sense of all the data at extremely high volumes and speeds.
2.1.1
Motivation
Big Data has altered the way we achieve, analyze, and influence data in any domain. Healthcare is one of the most promising areas where it can be feasible to drive a change. Healthcare analytics have the likelihood to cut costs of treatment, forecast outbursts of epidemics, evade preventable diseases, and advance the quality of life of an individual. The average human lifespan is growing along with the world population, which stances new encounters to treatment delivery approaches in today’s scenario. Health specialists, just like business tycoons, are proficient enough to collect enormous amounts of data and look for best policies to use these numbers. The chapter contributes to outline opportunities for data analysts to provide more evidence-based decisions to physicians so that they can harness and trust abundant pathways of research and clinical data as contrasting to exclusively their own knowledge. As in many other businesses, data collection and administration are moving to scale in healthcare. The chapter focuses on higher demand for big data analytics in healthcare facilities than ever before and some methodologies to deal with it. The major motivation of this chapter is to: • improve healthcare quality and management; • reduce healthcare expenses, reduce preventable overuse; • provide provision for renewed payment structures; and • real-time health tracking and prevention of human errors. The organization of this chapter is as follows: Section 2.1 introduces Big Data as a core technology to process extensive data in terms of volume, variety, and velocity. It further focuses on the concept of clusters in Big Data and its advantages. Section 2.2 mentions the techniques and technologies available to harness Big Data. It talks about the working analytics model, use cases, and its types. Section 2.3 defines the core usage of Big Data in the healthcare sector. Section 2.4 elaborates medical imaging, one of the core healthcare domains with emphasis on the use of big data. Section 2.5 emphasizes on the methodology and steps involved in Big Data Analytics. Section 2.6 lists various platforms and tools available in the market for Big Data Analytics. It also discusses the advantages and disadvantages of
47
48
Chapter 2 Big Data Analytics for healthcare: theory and applications
each in detail. The next section presents the present opportunities of Big Data in healthcare. Section 2.8 focuses on the particular challenges faced by the medical industry in processing big data. The next section lists the core applications of Big data in the healthcare industry. The last section concludes with the future aspects of Big Data in healthcare, mentioning the key areas of further development.
2.2
Big Data Analytics
Big data analytics to be efficient is an advanced form of analytics. As a consequence of this, it incorporates compound applications with elements such as predictive models, statistical algorithms, and what-if analysis powered by high-end analytic operations. Applications that make use of this tool enable big data analysts, data scientists, predictive modelers, statisticians, and other professionals from the field to examine and observe things like growing volumes of structured [3] trade data, also investigating different forms of data that are often left untouched by traditional BI and other analytic programs. That circles around a variety of semistructured and unstructured data. Consider, for instance, the internet click-stream data, social media content, text from customer emails and survey responses, web server logs, and machine data obtained by sensors on the internet of things. When contemplating all of its functions, Big Data Analytics offers numerous advantages for a firm that include new revenue streams, improved marketing strategy, improved operational efficiency, and better customer service, which all make up for a much superior advantage over rivals. Data analytics can be thought of as the extraction of insights and actionable knowledge [4] from big data. Insights from data can be achieved by making use of the hypothesis formulation that’s based on inferences gathered from experience and seldom striking correlations among variables. There are four kinds of data analytics: 1. Descriptive Analytics: In this type, all prior data is accumulated and organized handily as bar graphs, charts, pie graphs, maps, scatter diagrams, etc., to get a simple visualization that provides insight into what the data implies, generally known as a dashboard. A routine example is that of the demonstration of population censuses that manipulate data of people across a country with education, age ranges, gender, income, population density, and variables that are comparable.
Chapter 2 Big Data Analytics for healthcare: theory and applications
2. Predictive Analytics: As the name suggests, the system predicts based on the information gathered as to what is expected to happen next after it extrapolates the available data. The various tools that are used for extrapolation are time series analysis using statistical methods, machine learning algorithms, and neural networks. A fundamental application of this sort is in the field of marketing. By comprehending the clients’ needs and preferences, a typical example of which is the advertisement of blades that emerges when you buy a shaving kit from an online retail shop. 3. Discovery or Exploratory Analytics: This kind of analytics is used to discover unexpected correlations between parameters in a compilation of big data and provides an additional chance for making serendipitous discoveries and accumulating insights from the collection of data. One application of this is finding patterns in consumers’ habits by companies using their feedback, tweets, and blogs. This makes it easier to predict the customers’ next action, and the organization then has the chance to come up with a handsome offer to try and switch the customer’s likely response. 4. Prescriptive Analytics: This defines, depending on data accumulated, chances to optimize solutions for issues that are present. This inquiry informs the concerned customer what needs to be done to accomplish the goal, finding best application in airline companies’ prices allotment for seats based on historical statistics of travel routines, popular origins among destinations, significant events or holidays, etc., to boost profits for the company.
2.2.1
Techniques and technologies
They provide methods for testing all the gathered datasets and then to draw certain information from them. This helps firms in making better and more educated business decisions. Whereas a simple BI query answers a simple set of questions about business performance and operations. There are several methods [5] that can be utilized when handling a big data project. However, the type of data being examined plays a major role in selecting an appropriate method. The research questions and the technology on hand also play a key role in deciding the type of technique. Some of the most regularly used techniques and tools are as follows: • Data Mining: Ref. [6] referred to data mining as blending approaches that are from machine learning and statistics
49
50
Chapter 2 Big Data Analytics for healthcare: theory and applications
into database management. This is done so that the system can accurately point toward patterns even in a very large dataset. Ref. [7] considers it to be one of the very important tools for data-driven decision-making and explains it as searching or digging into a data file for information to understand a particular phenomenon in a better way. • Cluster Analysis: It is a kind of data mining that distributes a large group of data into smaller groups of similar objects. A distinctive characteristic of this method of analysis is that the objects that are divided and their features of similarity are not known in advance. • Machine Learning: Machine learning consists of creating algorithms that allow applications to evolve based on empirical data. A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data [6]. Miller (2011/12) provides an illustration of the U.S. Department of Homeland Security that makes use of machine learning to spot patterns in mobile and mail traffic, including additional sources of security dangers and then being able to prepare better to tackle them. • Association Rule Learning: A way of finding relationships among variables of a dataset. Frequently employed in data mining. This technique lends support to recommender systems that are typically used at places such as Netflix and Amazon Prime. • Text Analytics: Using emails, online pings, company records, etc., a large amount of data that can be accessed will be in text format. The investigation can be done to hit targets, mine opinions, answer questions, and also simulate topics. • Crowdsourcing: Crowdsourcing collects data from a large group of people through an open call, usually via a Web2.0 tool. These are just a few of the many techniques used in Big Data Analytics. The conventional data warehouses based on relational databases may work on structured data, but it is not considered suitable for the unstructured or semistructured data type. The high processing demands of big data require it to be updated frequently and at regular intervals, for example, capturing the website visitors’ online activities or a mobile application’s performance, thus making a data warehouse inappropriate. As a consequence, most companies that gather, process, and analyze big data switched to NoSQL databases, Hadoop, or other tools such as those specified in Fig. 2.1.
Chapter 2 Big Data Analytics for healthcare: theory and applications
Figure 2.1 A few analysis tools.
2.2.2
How Big Data Analytics work
For performing analytics, NoSQL databases and Hadoop clusters are used as the staging areas for the massive volume of heterogeneous data before loading it into an analytical database or a data warehouse. Most systems implement the Hadoop Data Lake technique, which is the fundamental repository for all kinds of incoming raw data. In an analogous architecture, the Hadoop cluster is used to analyze the data directly. After successful data warehousing, effectively managing the data is perhaps the most crucial step in any big data analytics technique. The data can subsequently be analyzed after it has been organized, partitioned, and configured in the Hadoop File System. A software that is capable of data mining, that is, finding actionable information in large datasets, is used to perform this analysis. Other forms of analysis include that of customer behavior [8] and other possible future developments using predictive analysis. Machine learning and deep learning might also be incorporated into these systems. Apart from the tools mentioned earlier, a lot of insights can be gained by performing statistical analysis and text mining on the data, as its usefulness has been proven in most BI software and data visualization tools. MapReduce is used to write the queries in analytics and ETL applications, and standard programming
51
52
Chapter 2 Big Data Analytics for healthcare: theory and applications
languages such as Structured Query Language (SQL), Python, R, and Scala are also supported by SQL-on-Hadoop tools.
2.2.3
Uses and challenges
Data from external as well as internal sources is combined in the applications used for Big Data Analytics, for example, for the weather prediction software—weather data is consolidated by a third-party provider. Streaming analytics [9] apps are gaining popularity as users wish to obtain real-time analytics of the data furnished into their Hadoop systems in big data environments through stream processing engines such as Storm and Spark. With the introduction of various cloud platforms such as Microsoft Azure and Amazon Web Services (AWS), it has become much simpler and convenient to manage Hadoop clusters in the cloud. Many Hadoop suppliers such as Hortonworks and Cloudera also support their frameworks on these cloud platforms. Thus users can now create Hadoop clusters in their cloud, run these according to their demand and convenience, and with usage-based pricing policy, they can easily take them offline as well. Unlike earlier, when large organizations had to set up big data systems onpremises and this used to incur huge costs. The biggest obstacle for Big Data Analytics continues to be the lack of internal skills and the correspondingly high costs of hiring experienced data scientists and data engineers to meet the demands of the organizations. The level of progression in machine learning and AI has now enabled vendors to design and produce software for big data analysis that is easier to work with, especially for the growing citizen data scientist population. The amount of data that is typically involved, and also its quantity, could cause data management issues in areas including data quality, consistency, and governance. Data silos are a result of a data structure from the use of data stores and different platforms. Additionally, integrating Hadoop, Spark, and other tools into a coherent architecture that matches a firm’s big data demands is a difficult proposition for many IT and analytics teams that need to spot the right mix of technologies and then put the pieces together.
2.3
Big Data in healthcare sector
With most organizations rapidly shifting to maintaining their health records digitally, there is a considerable increase in the quantity and quality of clinical data available in an electronic format. On the other hand, clinical analytics has been used to
Chapter 2 Big Data Analytics for healthcare: theory and applications
gain new insights and reduce healthcare expenditure in the United States. Techniques used in Big Data have advanced considerably in the past few years. Yet currently, the application of Big Data in healthcare is in an experimental stage, and the potential impact is enormous. The ability to perform analytics on medical images and clinical reports vastly enhances our understanding and scope of improvement of clinical care. With the advent of new technologies and medical devices, patient data collection is becoming remarkably more uncomplicated. But recognizing diseases requires a composite technique that combines structured and unstructured data from multiple nonclinical and clinical modalities to gain perspective on the patient’s condition and predict diseases. As of now, Big data is majorly used for analyzing trends in populations and using those trends for gaining valuable knowledge. This helps in outlining paradigms for healthcare systems, anticipate disease outbreaks, and curb them. A wide range of healthcare providers can benefit from digitizing and efficiently leveraging big data. Inherent benefits include early stage disease detection to provide simple and effective treatments, personalized medicines, and being able to detect medical fraud more promptly. Big data analytics can unravel answers to various questions related to healthcare. It can help predict outcomes or developments in the disease life cycle based on a massive volume of historical data having parameters such as complexities, elective surgery, predicting disease progression, patients unlikely to benefit from surgery, length of stay, patients at risk for Methicillinresistant Staphylococcus aureus (MRSA), sepsis or other illness possibly acquired from the hospital. According to McKinsey, three areas that have a vast potential for cost savings are Research & Development, Clinical Operations, and Public Health.
2.4
Medical imaging
One of the most important sources of data for healthcare systems is medical images that are frequently used for diagnosis. Imaging techniques such as echocardiography, magnetic resonance imaging (MRI), X-ray, ultrasound, and computed tomography (CT) are widely utilized for analysis in the clinical setting. With the broad array of available techniques, the size of medical image data also varies from a few megabytes to hundreds of megabytes for each study. Consequently, this requires massive storage volume, and speed and storage optimized algorithms to perform the decision-making analytics.
53
54
Chapter 2 Big Data Analytics for healthcare: theory and applications
Medical imaging produces valuable information about the functioning and anatomy of organs and possible disease states. It can be used for a variety of purposes such as to monitor the developing fetus during pregnancy, detecting aneurysms, artery stenosis, spinal deformities, and identifying tumor affected organs. Various image processing techniques such as dilation, contour detection, denoising, and segmentation must be used in parallel with machine learning approaches to perform these diagnoses. With an increase in the dimensionality of data, learning the dependencies and relationships among the data, and creating effective techniques for analysis become a challenge. By integrating medical images with pathological reports, other types of Electronic Health Record data and genomic data, the accuracy of diagnosis can be substantially improved.
2.5
Methodology
Big Data Methodology requires certain stages, beginning from the process of capturing data, applying a prediction model to obtaining useful results. The complete iterative process is shown in Fig. 2.2. The main stages of Big Data Analytics involve: 1. Data Acquisition and Cleaning: It consists of capturing data from different internal and external sources such as government health organizations, hospitals, and medical
Figure 2.2 Big Data processing pipeline.
Chapter 2 Big Data Analytics for healthcare: theory and applications
institutions. This data is available in different formats for different applications. The next task involves the cleansing of data. It makes data valid, suitable for analysis, and structured via filtering and normalization [10]. 2. Data Integration, Analysis, and Interpretation: After the first stage, the data is loaded to a suitable database and integrated with the warehouse. Here data is transformed to analyze it via different algorithms. This step takes time as the data is trained and tested on the complete dataset. Here, a lot of Big Data tools and platforms are used to make the computation faster. The results of this step are interpreted to obtain final results in the form of a report, making it easy to understand and take further decisions. 3. Querying, Reporting, and Visualization: This is the last but not compulsory step in the analytics process. This involves querying the model to predict results based on new values for the input parameters. Reports are generated as the result of the querying process, and results are often displayed in the form of charts, plots, graphs, etc. Visualization makes it more comprehensible to understand and infer from the results obtained.
2.6
Big Data Analytics: platforms and tools
Big Data Analytics Platforms helps in the meaningful extraction of medical data, which is next to impossible using traditional methods of evaluation. These platforms help in faster data processing, more in-depth insights into data, and advanced functionalities for querying and manipulation into databases. It helps to decrease the complex challenges for medical data interpretation and provide a mechanism to predict outcomes in case of diseases, health patterns, etc. Various tools make it easily accessible and visualize correlations in the data patterns. These technologies have the power to reveal the real potential of big data in the healthcare sector [11]. The capabilities and processing inside a big data platform are shown in an architecture diagram in Fig. 2.3. The various platforms and their capabilities for data processing have been listed in the following sections:
2.6.1
Cloud storage
Cloud is simply storing data into remote servers that are capable of communicating with each other. The storage providers provide a list of services that helps in storing, managing,
55
Figure 2.3 Big Data Architecture for healthcare sector. Adapted from E. Mezghani, E. Exposito, K. Drira, M. Da Silveira, C. Pruski, A semantic big data platform for integrating heterogeneous wearable data in healthcare, J. Med. Syst. 39 (12) (2015) 185 [12].
Chapter 2 Big Data Analytics for healthcare: theory and applications
and processing of data [13]. Its benefits include faster processing with less resource utilization and cost. Some of the common cloud services providers include Google Cloud Platform and AWS. It can be easily integrated with Hadoop and Spark as well. Advantages: 1. Better usability and accessibility. 2. Cost savings and easy recovery. 3. Customized storage and scalability. Disadvantages: 1. Higher latency and lower bandwidth. 2. Less control and expertise required. 3. Vulnerability and privacy.
2.6.2
NoSQL databases
It stands for Not Only SQL and provides different data storage and retrieval mechanisms. It is different from relational databases that make use of tables, thus capable of processing vast volumes of data [14]. It can easily handle structures, semistructured, and even unstructured data. Advantages: 1. Easily scalable due to effective load distribution. 2. Good tolerance for processing Big Data. 3. Easy management without need for administrator. 4. Economical and flexible. Disadvantages: 1. Less mature as compared to RDBMS (Relational Database Management System). 2. Few professional experts in the market. 3. Less support for being open source.
2.6.3
Hadoop
It is an open-source platform that is based on MapReduce technology. It is designed to make it capable of handling big data. It consists of data clusters having massive power for data processing. It is being used for log processing, analytics, recommendation systems, etc. It consists of object storage for unstructured data and metadata [15]. It is connected to file servers that support concurrency, replication, and distribution. Advantages: 1. High computing power for big data. 2. Flexible and fault-tolerance. 3. Low cost and scalable with little administration.
57
58
Chapter 2 Big Data Analytics for healthcare: theory and applications
Disadvantages: 1. MapReduce is file-intensive and not efficient in every scenario. 2. Broadly recognized talent gap. 3. Lack of standardization tools and data security.
2.6.4
Hive
It is based on Hadoop architecture and controls SQL. The platform provides easy data summarization and query processing to analyze massive data volumes. It is efficient for batch processing of jobs based on immutable data. It supports all SQL-based query operations with capabilities of MapReduce. Advantages: 1. Simple querying as SQL. 2. Provides indexing of queries for faster processing. 3. Simultaneous multiuser architecture with easy format conversion. Disadvantages: 1. Problems in OLTP (Online Transactional Processing) processing. 2. No delete and update operation. 3. No support for subqueries.
2.6.5
Pig
It is an abstraction over MapReduce for the representation of dig datasets as data flows. It consists of Pig Latin language that is similar to SQL. It is a high-level language that practices a multiquery approach. The data model for the Pig is nested relational and performs query optimization. It can successfully process time-sensitive data loads. It supports data operations, diagnostic functions, and file commands. Advantages: 1. Provides all Hadoop features, parallelization, and fault-tolerance. 2. Procedural nature and reduces development time. 3. Effective to convert unstructured database to structured. Disadvantages: 1. Less mature as it is still in development. 2. Data schema is not imposed explicitly but implicitly. 3. Less support available and increased iteration between debug and issue-resolution.
Chapter 2 Big Data Analytics for healthcare: theory and applications
2.6.6
Cassandra
It is a distributed data system designed to handle big data across multiple servers. It implements its own Cassandra Query Language. It is a NoSQL system that uses a master-slave arrangement. It is a highly scalable system as each node on the network has equal access rights and hence can send or receive data at any point. It consists of a decentralized system, hence no single point of failure. Advantages: 1. Peer-to-peer architecture. 2. High availability and tunable consistency. 3. Supports replication and simplified data management. Disadvantages: 1. Not compatible with existing applications. 2. No room for unanticipated queries and less efficient for multiple queries.
2.7
Opportunities for Big Data in healthcare
The advanced analytics provided by Big Data allows an enormous amount of opportunities for all the stakeholders—the patient, healthcare provider, and the payer despite all the challenges presented by Big Data in healthcare [16]. Opportunities for big data occur in several domains as specified in Fig. 2.4 such as early disease detection, improved quality, accessibility and structure of data, better decision-making abilities, reduction in healthcare costs, personalized medicine, detecting healthcare fraud, management of population health, a better quality of care of patients, and detection of any threats to patient health. Here, we discuss the top four domains:
2.7.1
Quality of treatment
With the introduction of big data in healthcare systems, we can tap into the hidden potential of historical data. Big Data provides us with the ability to anticipate the effects [17] of the
Figure 2.4 Various opportunities for Big Data in healthcare.
59
60
Chapter 2 Big Data Analytics for healthcare: theory and applications
medicine and treatment provided to the patients. The overall quality of care can be improved by introducing big data analytics for determining the current patient condition and not just depending on clinical reports, medical images and patient history. These systems may also offer proof of benefit in the form of an explanation for their reasoning, which could help change instituted standards of care. By providing patients with technology-equipped solutions, these systems can also help them with personalized medicine commitment. Another factor that improves the quality of treatment is the reduction in wastage of information in the form of clinical reports. All the relevant parameters play an important role in predicting outcomes, and discarding them may result in inefficiencies that are reduced by using big data systems. As a result of improving the quality of treatment offered, the rates of patient readmission may be reduced, operational effectiveness is increased, and overall staff performance is improved.
2.7.2
Early disease detection
One of the essential elements of clinical healthcare is early detection of disease state that helps in providing appropriate treatment to the patients promptly and preventing any serious illnesses. It allows healthcare providers to monitor [18] the health-related behavior of their patients, especially when faced with age-linked illnesses or global health concerns such as cardiology.
2.7.3
Data accessibility and decision-making
Big data allows a prompt capture of medical data records and their translation from an unstructured format to structured and meaningful information. Relevant insights [11] can then be obtained from this newly generated knowledge system, facilitating reuse of the data. Other than that, it was also possible to maintain the quality of data by reducing irrelevant information using Big Data analytics. Big Data also facilitates the healthcare system with more rational decision-making by the application of evidence-based medicine, which also leads to an improved quality of treatment and patient-centric care. If we provide the system with the updated and latest information in a timely fashion concerning the treatment procedures and new therapy solutions, it can significantly optimize the decision-making process. By substituting or assisting humans with big data analytics
Chapter 2 Big Data Analytics for healthcare: theory and applications
61
enables the decision-making process to become quicker, more accurate, and more straightforward.
2.7.4
Cost reduction
Big data analytics facilitates storage and speed-optimized algorithms [19], which lead to a substantial decrease in the cost of storage components and computing power. The effect of these savings can be seen in the entire spectrum of healthcare providers. These savings come to light in the form of better patient monitoring and cost-efficient treatments.
2.8
Challenges to Big Data Analytics in healthcare
Big Data Analytics is becoming quite popular in the healthcare industry due to enormous amounts of data being produced by the healthcare sector daily. Still, there is a need for the integration of technology with all clinical and operational processes [20]. Since big data is complex and extracting meaningful insights from it requires high computational power, the healthcare sector faces significant challenges, as shown in Fig. 2.5. Some of the
Figure 2.5 Major challenges to Big Data in healthcare.
62
Chapter 2 Big Data Analytics for healthcare: theory and applications
major problems faced by the healthcare industry are described in the following sections:
2.8.1
Data acquisition and modeling
Right from the first stage of capturing data to later stages of cleaning and processing it, this is the most challenging issue faced by the healthcare sector. The data usually comes from government health units and is not clean, accurate, and properly formatted most of the time. The data types are not prioritized according to clinical measurements and are thus not complete. Big data is heterogeneous and thus becomes complex to look for suitable data labels for successful classification. Searching for specific information in a large pool of data is a tedious task. With this, the job of data modeling becomes complex [21].
2.8.2
Data storage and transfer
Cloud storage seems to be preferably a better option for storage of big data, but at the same time, it needs to have sufficient space and high transfer speeds for upload. Storing graphical information such as X-ray, MRI, and CT scan to be later processed by prediction systems is still a challenging task. The images should be clearly interpreted to make decisions at a later stage. Though the cloud provides flexibility through hybrid infrastructure but sharing data between systems and simultaneous communication poses a problem here. Most of the time, few organizations have more substantial control over data access since the central system is not able to distribute the load equally.
2.8.3
Data security and risk
In the cyber age of hacking, ransomware, and data breaches, security is the top priority for most of the health organizations. Medical data is powerful and valuable and is vulnerable to attacks. There is always an urgent need to introduce integrity and access rights, secure protocols, authenticated transmission, and firewall support. Encryption of data before communication is also a crucial solution for securing the information. Data centers can be easily hijacked manually with the intension of stealing personal data even by staff members. The involvement of third parties is also seen as a threat to a data breach.
Chapter 2 Big Data Analytics for healthcare: theory and applications
2.8.4
Querying and reporting
Querying is the process of fetching results from the database on the input of particular inquiries. This process is not always robust. Lack of standardization and quality leads to improper results from the large pool of database. This leads to inconsistency in Analytics. A report is a short and concise form of presenting results that is accessible to all. Accuracy and integrity are often the main aspects of report generation. It always speaks out the conclusion of the process conducted, and bad results often lead to wrong reports. Since medical data is dynamic and changes as time pass by, there needs to be a mechanism to update the datasets manually or via automation. An organization should also ensure that there is no duplication of records in clinical data [22].
2.8.5
Technology incorporation and miscommunication gaps
Adoption of technology in the medical sector is still a slow process. Most of the time, sufficient data is not available for systems to make decisions. Lack of proper technical support makes results unavailable for big data. Fragmented data sources from vendors, stakeholders, organizations, and health institutions lead to diversion and unavailability of a central warehouse. The medical industry also lacks experts in data science and statistics to deal logically with big data. The absence of a central database pool of patients’ history leads to wastage of doctor’s time. Thus lack of standardization and less faith in predicted information lead to miscommunication gap between data scientists and users.
2.9
Applications of Big Data in healthcare industry
Big Data is very vast, and so is its application in the medical industry. Big Data Analytics can produce fruitful results to almost any sector in healthcare provided quality data is provided. Healthcare organizations are investing in high amounts to earn profits and reach new heights in advanced developments and research. Some of the main practical applications are shown in Fig. 2.6. The main areas of big data applications in healthcare are as follows:
63
64
Chapter 2 Big Data Analytics for healthcare: theory and applications
Figure 2.6 Practical applications of Big Data in healthcare.
2.9.1
Advanced patient monitoring and alerts
Substantial results can be obtained from data if regular monitoring of patients is done. This data can be collected via different clinical machines and apparatus. This data obtained over time can turn out to be big data. Such historical records can be proved useful for predicting any disease patterns in future generations [23]. Also, we can get regular updates for the health of a patient via wearable gadgets such as heart rate and blood pressure. In case of any degradation in health, the doctor can even send alerts to the patient on time. GPS trackers can even help in getting accurate locations for the same purpose.
2.9.2
Management and operational efficiency
If proper management is not followed in hospitals, it can lead to an emergency. Most of the time, a patient dies due to mismanagement, lack of beds in a hospital, nonavailability of the right equipment, and even negligence. The patients’ data history being administered in the hospital can help predict the count of patients at a particular time of day/year. The administration can make the availability of more staff members to prevent any accidents.
Chapter 2 Big Data Analytics for healthcare: theory and applications
2.9.3
Fraud and error prevention
Many instances are seen in the medical industry, such as wrong dosage, wrong prescription, and other human errors. Such cases are found mostly in the health insurance sector. Big data can help detect such frauds and be useful to insurance companies. Automated systems can be developed based on the data available from past fraudulent cases and can prevent further losses in the insurance sector.
2.9.4
Enhanced patient engagement
Over time, the use of wearables has been increased among the people. Such devices can be developed, and they can monitor the health statistics of a person. It can even simplify the physician’s job and lower emergency cases. People can even self-monitor and set up alerts for any timely medicine or routine treatment.
2.9.5
Smart healthcare intelligence
Certain companies are now developing smart healthcare systems to help determine the health of the patient in the future. The machine can make decisions on their own based on the current health situation and the medical data available. Prior work based on the genetic data of humans can help predict new drugs and also predict diseases such as cancer in the beginning stages. More common problems can be treated even simpler at lower costs.
2.10
Future of Big Data in healthcare
Big data is a boon not only for healthcare but also for every other industry. It can be proved useful not only from an economic point of view but also saving the lives of millions of people. Doctors work hard in the diagnosis of complex diseases and later on for its treatment. Data warehousing can ease the process of finding patterns in historical data and lead to some vital results. Predictive analysis [24] can help predict diseases in early stages based on symptoms and give precautionary measures too. The key areas that require future development in the healthcare sector are shown in Fig. 2.7. Based on the current situation of healthcare institutions, it can analyze the current situation to help distribute funds accordingly. It can help shape the business models to save costs
65
66
Chapter 2 Big Data Analytics for healthcare: theory and applications
Figure 2.7 Key areas of future development in healthcare.
and invest only in required areas. Datasets can be used effectively to predict the allocation of staff based on the occupancy of patients at a particular time of the year. More focus can be laid on new researches and availability of beds, medicines, equipment, etc. Big Data Analytics can be implemented on the data available for all the citizens living in the country, and results can be obtained to identify most-risked age groups vulnerable to disease. Historical data can also predict the parts of the country that registered maximum deaths from a particular disease so that more medical facilities can be provided in those areas. It can help eradicate certain conditions if measures are taken long before. Thus it can help build a healthy and happy nation. Moreover, for better results, there is a need to keep a check on the data that is being collected is authenticate and complete. More security measures need to be incorporated to prevent data thefts and the availability of technical support to help record medical data in the right manner.
References [1] S. John Walker, Big data: a revolution that will transform how we live, work, and think, Int. J. Advert. 33 (1) (2014) 181 183. [2] P. Zikopoulos, C. Eaton, et al., Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, McGraw-Hill Osborne Media, 2011. [3] D. Maltby, Big data analytics, in: 74th Annual Meeting of the Association for Information Science and Technology (ASIST), 2011, pp. 1 6. [4] X. Wu, X. Zhu, G.-Q. Wu, W. Ding, Data mining with big data, IEEE Trans. Knowl. Data Eng. 26 (1) (2014) 97 107. [5] A. Gandomi, M. Haider, Beyond the hype: big data concepts, methods, and analytics, Int. J. Inf. Manag. 35 (2) (2015) 137 144. [6] J. Manyika, M. Chui, BrownBrad, J. Bughin, R. Dobbs, C. Roxburgh, A. Byers, Big Data: The Next Frontier for Innovation, Comptetition, and Productivity, McKinsey Global Institute, 2011. [7] A.G. Picciano, The evolution of big data and learning analytics in American higher education, J. Asynchronous Learn. Netw. 16 (3) (2012) 9 20.
Chapter 2 Big Data Analytics for healthcare: theory and applications
[8] A. McAfee, E. Brynjolfsson, T.H. Davenport, D. Patil, D. Barton, Big data: the management revolution, Harvard Bus. Rev. 90 (10) (2012) 60 68. [9] M. Chen, S. Mao, Y. Liu, Big data: a survey, Mob. Netw. Appl. 19 (2) (2014) 171 209. [10] M.M.S. Eastaff, et al., Application of big data analytics in health care, Int. J. Eng. Res. Appl. 6 (12) (2016) 01 04. [11] W. Raghupathi, V. Raghupathi, Big data analytics in healthcare: promise and potential, Health Inf. Sci. Syst. 2 (1) (2014) 3. [12] E. Mezghani, E. Exposito, K. Drira, M. Da Silveira, C. Pruski, A semantic big data platform for integrating heterogeneous wearable data in healthcare, J. Med. Syst. 39 (12) (2015) 185. [13] L. Benhlima, et al., Big data management for healthcare systems: architecture, requirements, and implementation, Adv. Bioinform. (2018). [14] S. Senthilkumar, B.K. Rai, A.A. Meshram, A. Gunasekaran, S. Chandrakumarmangalam, Big data in healthcare management: a review of literature, Am. J. Theor. Appl. Bus. 4 (2) (2018) 57 69. [15] D. Sekar, Big Data Analytics for Healthcare Informatics. (2018) https://www. pinkelephantasia.com/big-data-analytics-for-healthcare-informatics/. [16] R. Nambiar, R. Bhardwaj, A. Sethi, R. Vargheese, A look at challenges and opportunities of big data analytics in healthcare, in: Big Data, 2013 IEEE International Conference on, IEEE, 2013, pp. 17 22. [17] D.W. Bates, S. Saria, L. Ohno-Machado, A. Shah, G. Escobar, Big data in health care: using analytics to identify and manage high-risk and high-cost patients, Health Aff. 33 (7) (2014) 1123 1131. [18] J. Andreu-Perez, C.C. Poon, R.D. Merrifield, S.T. Wong, G.-Z. Yang, Big data for health, IEEE J. Biomed. Health Inf. 19 (4) (2015) 1193 1208. [19] A. Belle, R. Thiagarajan, S. Soroushmehr, F. Navidi, D.A. Beard, K. Najarian, Big data analytics in healthcare, BioMed. Res. Int. 2015 (2015). [20] D. Thara, B. Premasudha, V.R. Ram, R. Suma, Impact of big data in healthcare: a survey, in: Contemporary Computing and Informatics (IC3I), 2016 2nd International Conference on, IEEE, 2016, pp. 729 735. [21] B. Ristevski, M. Chen, Big data analytics in medicine and healthcare, J. Integr. Bioinform. 15 (2015). [22] M. Viceconti, P.J. Hunter, R.D. Hose, Big data, big knowledge: big data for personalized healthcare, IEEE J. Biomed. Health Inform. 19 (4) (2015) 1209 1215. [23] P.-Y. Wu, C.-W. Cheng, C.D. Kaddi, J. Venugopalan, R. Hoffman, M.D. Wang, Omic and electronic health record big data analytics for precision medicine, IEEE Trans. Biomed. Eng. 64 (2) (2017) 263 273. [24] J. Lillo-Castellano, I. Mora-Jimenez, R. Santiago-Mozos, F. Chavarria-Asso, A. Cano-Gonza´lez, A. Garcı´a-Alberola, et al., Symmetrical compression distance for arrhythmia discrimination in cloud-based big-data services, IEEE J. Biomed. Health Inform. 19 (4) (2015) 1253 1263.
67
Application of tools and techniques of Big data analytics for healthcare system
3
Samarth Chugh, Shubham Kumaram and Deepak Kumar Sharma Department of Information Technology, Netaji Subhas University of Technology (Formerly Netaji Subhas Institute of Technology), New Delhi, India
Abstract The concept of data analysis is essential in making decisions that may affect the way a sector of the society or an industry function. Healthcare data availability has grown manifold over the years, and there is an immense amount of knowledge that can be extracted and used effectively. Furthermore, as with other specific areas of interest, healthcare also requires an exploitation of prior knowledge of the field to assist and enhance the decision making. In the past, various data analysis tools and methods have been adopted to improve the services provided in a plethora of areas. The improvements are in terms of the effectiveness of predictions and inferences drawn so that future usage may be eased. This book chapter is organized in the following way—it begins with an introduction to the affect and amount of changes that are being caused by data analysis in healthcare services followed by a discussion on the importance and motivation to pursue the work. Next, the application of data analysis and an examination of various techniques comprising of machine learning is carried out. However, the scope of this work is not limited to exploring and refining past work but also speculate the future possibilities and provide the readers with substance to further the work that has been performed. Keywords: Scope; feature extraction; imputation; healthcare; past shortcomings
Applications of Big Data in Healthcare. DOI: https://doi.org/10.1016/B978-0-12-820203-6.00010-2 © 2021 Elsevier Inc. All rights reserved.
69
70
Chapter 3 Application of tools and techniques of Big data analytics for healthcare system
3.1
Introduction
Big data [1 3] is a term used for datasets that comprise various variables and structures more complex than regular tables and records and also require techniques beyond simple computation. The three components that help define big data is its variety, velocity, and volume. • “Variety” refers to the nonuniform format of the data since it can be complex and unstructured. • “Velocity” means that the data is not available at a uniform rate. The data in concern is based on real world patients hence it can only be obtained in spurts rather than as a continuous stream. • Finally, “Volume” is a plain indication of the vast expanse of data that is present and needs to be handled. For instance, the human genome code of a single person can take up to 4 Gigabytes of data! With the help of this data, industries or different sectors of the society can shape the way they make their decisions and have a significant impact over targeted/affected populous. The rapid growth of the awareness of big data has had a remarkable influence over the years in the healthcare sector. With an increase in the availability of this data, a need for the merge of healthcare and computer science has given rise to many new jobs in the form of bioinformatics [4,5]. They involve analyzing and visualizing data with an expertise in healthcare that provides the correct mindset for capitalizing on the implementations and execution of decisions. There is a plethora of sources for the generation of this data—the ever-increasing number of patients across the world and digitized medical equipment that include automatic data logging. However, it is not just limited to vital information about the patient but also other nonvital and seemingly unimportant factors such as healthcare and socio-economic elements [6]. The analysis of the background that a patient hails from and decisions ranging from the diet being advised to the medicines being prescribed lie at the heart of application goals for big data analytics in healthcare. These can result in improved experiences and quality of care provided to the patients over a long period of time. The apex of big data analytics lies in empowering the prediction of epidemics and reduction of preventable deaths. To achieve this, a variety of statistical techniques involving machine learning and data mining is required. There are multifold improvements possible for the healthcare organizations and benefits which would result in saving
Chapter 3 Application of tools and techniques of Big data analytics for healthcare system
billions or possibly even trillions of USD (United States Dollars) if successfully implemented all over the world. The data to perform such analysis is already present, with some regulations in order, proper formatting of the data could help even predict and mitigate possible diseases in the future or help develop a cure for them faster. This chapter aims to bring out the importance of such analysis and shed some light on the steps that have already been taken and pave the way for larger future strides. While there is almost imperceivably limitless potential in this field, the path to its fulfillment is fraught with challenges and of various kinds—economic, social, organizational, and policy barriers are some of them. Organizations may resist the introduction of digitized technologies that would compel them to modify their plan of action. An economic angle would encompass the costs involved for the stakeholders in the healthcare industry. However, the two most prominent challenges are those of handling this technologically and the security and privacy concerns [7,8]. The sections hereafter contain the following information— the need for big data analytics in the healthcare system is discussed along with the progress that has already been done; next, in brief, an insight into the various techniques for carrying out the analytics are discussed. Finally, the shortcomings and drawbacks faced in the past are considered and the extent of possibilities in the field is explored.
3.2 3.2.1
Need and past work Importance and motivation
A lot of data is being generated worldwide for healthcare research and clinical work, in the form of Electronic Health Records (EHR) [9,10], MRI data, research data, and so on. It was reported that stored health care data exceeded 150 exabytes in 2011 [11], and it is still growing at an exponential rate. If this data is to be of any use whatsoever, data analytic techniques need to be applied to glean useful information. The healthcare data, besides being huge in size, also satisfies the 5Vs—volume, velocity, variety, veracity, and value [11]. The volume property is self-explanatory, considering the huge amounts of data being generated every day from machines, gene data, clinical trials, and so on. Machines such as MRI machines generate a lot of real-time data in comparatively short time frames, hence satisfying the velocity property. Healthcare
71
72
Chapter 3 Application of tools and techniques of Big data analytics for healthcare system
data also has a lot of variety as the collected data can be structured as well as unstructured, and come from various sources such as manual data entry, machine generated, historical data, or even accumulated through social media. Such data has varying levels of veracity, or trustworthiness. For example, we can generally assume machine generated data to be fairly accurate, whereas data gathered from patient surveys needs to be taken with a grain of salt. Another factor here is value—data collected from different sources may not be of the same value. We can see that healthcare data displays properties of big data and is suitable for application of big data analytical methods. A study [12] was conducted to study the potential benefits big data analytics can provide to the field of healthcare. They found out that big data analytics can help improve the following capabilities of healthcare organizations: 1. Analytical capabilities for patterns of care; 2. Unstructured data analytical capability; 3. Decision support capability; 4. Predictive capability; 5. Traceability. Apart from improving the research and care capabilities, such analytics can provide the following benefits for healthcare organizations: 1. IT infrastructure benefits; 2. Operational benefits; 3. Organizational benefits; 4. Managerial benefits; 5. Strategic benefits. It is estimated that using data mining can save up to $450 billion per year in the United States alone [11]. Thus, big data analytics would not only help advance the state of healthcare research and patient care but also bring about substantial financial benefits.
3.2.2
Background
The term big data was introduced by Cox and Ellsworth in Ref. [13]. It was later expanded in Ref. [14] to identify such data using the principle of 5Vs as described in the previous section. Today, big data analytics is being used in various fields of healthcare, such as bioinformatics, image informatics, clinical informatics, public health informatics, and even translational bioinformatics [15]. The field of bioinformatics deals with molecular level data. It is used in domains such as genetic studies, medicinal research,
Chapter 3 Application of tools and techniques of Big data analytics for healthcare system
and so on. Neuroinformatic is generally concerned with brain and uses tissue level data, such as those collected using histology and brain scans. Patient level data falls under the domain of clinical informatics and is the most common form of generated data. It is concerned with the documentation and analysis of data generated through patient interviews, MRI scans, pathological tests, and other such data. Another field where big data really shines is public health informatics, where efforts are made to collect data from the internet—usually from social media, search queries and medical websites, to study epidemiological patterns and public health issues.
3.3 3.3.1
Methods of application Feature extraction
In the vast amount of data that is resulting from the boost in EHR, there is a need to categorize and classify the data into different features based on a plethora of vitals like heart rate, urea creatine, and qualitative data like text-based documents and demographics. This has been mostly the result of funding received from the Health Information Technology for Economic and Clinical Health Act of 2009 [16]. The challenge now is to summarize the large data into a set of features and the process of doing this is what is called feature extraction. There are rules and signs that define a particular feature. After the different features have been realized, they can be successfully used as inputs for different machine learning algorithms that help with the predictive analysis. The most common types of features are given below. These are used for distinguishing various medical activities [17]: • Time-Domain Features—Mean, Variance, Root Mean Square (RMS) and its error; • Frequency-Domain Features—Spectral Energy, Spectral Entropy; • Time-Frequency Domain Features—Wavelet Coefficients; • Heuristic Features—Signal Magnitude Area, Signal Vector Magnitude, Inter axis Correlation; • Domain Specific Features—Time Domain Gait Detection. To go into the details of each and every feature is beyond the scope of this text. However, as it may be evident, there are a lot of features and it is imperative to employ some feature selection methods. These methods contribute in choosing the features that have the most significant and dominant variance on the overall predictive values.
73
74
Chapter 3 Application of tools and techniques of Big data analytics for healthcare system
• Support Vector Machine (SVM) based feature selection— While SVMs are extremely powerful and intricate methods for computation in neural networks, the presence of irrelevant and insignificant features can cause hindrance. To improve the performance of the system, features can be selected that are most relevant as according to the SVM. This is based on an empirical basis rather than a fixed method for filtration. • K-Means Clustering—Clustering is defined by [18] as “a method to uncover structure in a set of samples by grouping them according to a distance metric.” Similar exhibition of features helps grouping of variables and in this way the outliers can be easily removed. Feature extraction is applied in different modes for different cases. These are due to the inherent variations in the abundant clinical routines. Some may require sensors working at a particular moment as in the case of calculating body temperature and blood pressure while others provide continuous signals as in the case of respiration and ECG. We shall discuss here the feature extraction that is involved in ECG since it is one of the most widely researched topics due to its importance [19]. For the detection of abnormalities of the heart, heart rate is one of the primary features extracted in an ECG. Immediate relief measures can be advised to bring the heart rate to a normal level to continue the standard operations of the heart. The example discussed here has three main parts for the extraction of ECG features namely ECG preprocessing, wavelet transformation, and ECG feature extraction. The preprocessing step prepares the data before further steps are performed. This involves some filters such as the notch filter or moving average filter which help eliminate the errors of movement. The transformation of wavelet entails algorithms for wavelets and their computation. This is performed under expert supervision or at least requires so in the preparation of the models. Lastly, the extraction part is used for applying algorithms to extract P2R interval, Q2T interval, S2T segment, QRS area, and QRS energy. A low resource demanding algorithm enabling faster computation includes a threshold value based on the entirety of the ECG input signal received. Another advantage of this algorithm is the reusability it provides for instance in the steps for scanning ECG or calculating the threshold value.
3.3.2
Imputation
While we defined big data as an amalgamation of a large variety of data combined with a great volume, it so happens that some
Chapter 3 Application of tools and techniques of Big data analytics for healthcare system
data might be missing in the datasets that are created [20]. These are unavoidable and can be categorized into the following types: • Missing completely at random—When there is no perceivable pattern in the manner data is found missing in a dataset. For instance, automated sphygmomanometer breakdowns can result in missing blood pressure levels. • Missing at random—When differences in missing and nonmissing data can be explained by differences in the situations and observed data. • Missing not at random—Systematic differences can be recognized in between missing and observed values by looking at the observed data. This missing data can seriously undermine the results of a research and predictive analysis. Researchers may choose to present complete data by removing the information of missing individuals completely. The cumulative effect may result in the exclusion of an entire category of individuals! To overcome this problem, various statistical methods have been developed to deal with this absent data. These can range from imputing missing values based on mean values or flagging it with a missing value indicator or carrying forward the last value recorded. If only a single method among the ones listed above is used, it may fail to account for the fact that there is an uncertainty about the values that we have used as a replacement. Multiple imputation shines as a solution in this case wherein this unreliability can be considered by creating multiple plausible datasets that combine and produce an average result. The two stages involved in this method are as follows: 1. Create multiple copies of the data with different imputed values. These are based on predictive distribution on the lines of Bayesian approach. It can only be hoped that we somehow emulate the variance of the true data through our assumptions. 2. The statistical methods then required for our analysis are then fitted to each of the datasets obtained in the stage above. This results in multiple predictions which can then be averaged and an error in estimation can also be calculated. While this may seem extremely ideal, there is still some scope for errors. These arise from the following: • Omission of the outcome variable from the imputation procedure since it is assumed that these are to be calculated from the predictors which are not always the case. • Nonnormal distribution of data can include bias. Highly skewed data can hamper the predictions intended for normally distributed data.
75
76
Chapter 3 Application of tools and techniques of Big data analytics for healthcare system
• Credibility of the data category whether or not it is missing at random. It is difficult at times to reckon the systematic differences based on observing the missing and observed values. • Computation requirements are very high for multiple imputations and many implementations introduce approximation measures. The degree of missing data affects the run length of the algorithms involved and the design of such algorithms may vary from its practical application.
3.4
Result domains
This section provides an overview of how Big Data analytical techniques can help researchers in various different healthcare domains. Recent research efforts are presented as case studies which illustrate the use of data analytics in the respective domains.
3.4.1
Bioinformatics
As mentioned earlier, bioinformatics deals with molecular level data, such as genetic research. The human genome consists of a large number of genes, which can affect the susceptibility of an individual for diseases such as cancer. The total number of genes that need to be studied for prediction of such diseases can be drastically reduced using feature selection techniques such as gene probing. Salazar et al. created a gene expression classifier to predict relapse rates for colorectal cancer (CRC) [21]. In the study group, they had 188 patients for training and 206 for validation. The researchers identified 33,834 genes that varied among the patients in the training group, thus amounting to 33,834 3 188 5 6.3 million data points. They used gene probe selection to isolate genes with high correlation with 5-year distant metastasis free survival (DMFS) using leave-one-out cross-validation techniques, and finally found a set of 18 gene probes for study. For classification, a Nearest Centroid-Based Classifier (NCBC) called ColoPrint was used. A patient was classified either as high risk or low risk for relapse. ColoPrint was able to classify the validation set with a hazard value of 2.69 with a confidence interval of 95% and P-value of .003. Here, the hazard ratio is the ratio of relapses of predicted-relapse and predicted-relapse-free-survivors. Another significant study was done by Haferlach et al., who designed a gene expression profiling classifier for myeloid or lymphoid leukemia [22]. The patients were to be classified into one of 18 different classes of leukemia. A study group of 3334 patients was used, with two-thirds used for training and the rest
Chapter 3 Application of tools and techniques of Big data analytics for healthcare system
for validation. 54,360 gene probe samples were taken from each patient. DQN (Difference of Quantile Normalized Values) technique was used for classification, which is a pairwise classification technique (for more details, see Ref. [23]). Since there were 18 classes, 153 class pairs were created. SVMs were used as binary classifiers in each of the class pairs, to finally place the patients in a single class. The model was first tested on the training set using a 30-fold cross validation strategy and was found to have a mean specificity of 99.7% and a mean sensitivity of 92.2%. On the testing set, the model was able to achieve a median specificity of 99.8% and median sensitivity of 95.6%. In case of discrepancies, the methods proposed in this work achieved better results than traditional diagnostic methods in about 57% of cases. The research by Haferlach et al. showed that microarray gene expression data can reliably be used for diagnosis of leukemia. However, the results could have been even better if feature selection methods like those used in the former work were used. Further work using Big Data in bioinformatics should be undertaken to develop better models for diagnosis of diseases based on gene expression data.
3.4.2
Neuroinformatic
A major project underlining the utility of Big Data in modern neuroinformatic is the Human Connectome Project (HCP) led by researchers at WU-Minn-Oxford consortium and MGH/HarvardUCLA consortium which aims to create a comprehensive connectivity map (connectome) of human brain using MRI data. While the project is not yet complete as of 2019, the methods developed and the data collected have been made freely available and can be used by researchers around the world for neurological studies. The connectome is expected to provide valuable insights into links between neural pathways and various cognitive and behavioral diseases. Studies using HCP methods are already under process for creating connectomes of patients with Alzheimer’s disease, epilepsy, anxiety and depression, psychosis, and comparing them with connectomes of healthy adults to better understand the underlying cause of such ailments.
3.4.3
Clinical informatics
Big Data analytical methods can be used to answer clinical level questions with the help of patient data, generated using both machines such as MRI and other clinical features.
77
78
Chapter 3 Application of tools and techniques of Big data analytics for healthcare system
3.4.4
MRI data for prediction
Yoshida et al. have proposed a novel method [24] which combines patients’ clinical features with MRI data to search for any correlation between different areas of brain and diseases. Their proposed method, known as Radial Basis Function—sparse Partial Least Squares (RBF-sPLS) can perform feature selection and dimensionality reduction simultaneously. They employed the proposed method to a dataset of 102 chronic kidney patients. Each of these patients’ data had 73 clinical features and approximately 2.1 million voxels from MRI scans. They were able to find a strong correlation between chronic kidney disease and bilateral temporal lobe area of the brain. This area was also found to have a relationship with aging and artificial stiffness, whereas the occipital lobe was found to have an apparent relation to anemia. While the results are only preliminary, they showcase a novel use of Big Data analytical methods to make predictions about different diseases using MRI data.
3.4.5
ICU readmission and mortality rates
Big Data analysis has also been used by multiple researchers to predict ICU readmission and mortality rates, that is, how likely it is that a patient would die or have to be readmitted to the ICU after being discharged. The results from such works can provide great insights into the factors leading to readmissions or death, and deciding whether the ICU stay should be extended for someone. The work in Ref. [25] showed that the readmission and mortality rates go down drastically between day one and day seven after discharge. This suggests that keeping patients a little longer can greatly improve survival rates, but since the beds are limited, more specific results are needed about patient survivability. The data can also be used to determine if it’s worth it for a patient to get a particularly harsh treatment if the chances of survival were low, or if they should just receive palliative care.
3.4.6
Analyzing real-time data streams for diagnosis and prognosis
Big Data can also be utilized to analyze real-time data streams generated from various medical equipment in order to perform real-time diagnostic and prognostic predictions. IBM developed one such system [26] which was made up of three parts:
Chapter 3 Application of tools and techniques of Big data analytics for healthcare system
1. physiological stream processing; 2. offline analysis; and 3. online analysis. A correlation technique is used in physiological stream processing to correlate sensor data and fill in the missing bits. During offline analysis, domain knowledge was used to train a learning model, using a method known as Locally Supervised Metric Learning (LSML). Finally, during the online analysis, the system provides prognosis for a new patient by finding similar data in the database using a regression method. Zhang et al. [27] proposed a novel method performing prognosis and diagnosis using data streams, which requires lesser computing power than the IBM method. They propose a method known as Very Fast Decision Tree (VFDT). VFDT can offer true real-time diagnosis and prognosis, whereas LSML could not provide real-time results. Moreover, Zhang’s method was also found to provide better accuracy.
3.4.7
Public health informatics
Public health informatics is the more recent addition to the health informatics field, brought about due to the explosion in the number of internet users. The work done in this subfield of medical research tries to glean useful medical data from social media posts, search queries or forums such as bulletin board systems. Applying data mining techniques on such data can help in early identification of epidemics and the affected area. Such knowledge can help medical professionals and hospitals to stop the epidemic before it has had time to spread. In the United States, the Centre for Disease Control (CDC) publishes data about the spread of Influenza like illnesses (ILI) every year, but it has a latency of about a week or two.
3.4.8
Search query data
Ginsberg et al. [28] devised a method to analyze Google search queries which can predict an ILI epidemic before CDC can report it. They used historical logs of a 5-year period and trained a linear model to model the relationship between ILI-related search queries and ILI-related physician visits. The model achieved a mean correlation of 0.90 throughout all CDC regions on the training data set, while it reported a correlation of 0.97 during validation compared to the CDC reported percentage. Moreover, the model could predict the visits a week or two before CDC reports, hence proving that such an approach
79
80
Chapter 3 Application of tools and techniques of Big data analytics for healthcare system
is indeed feasible. Yuan et al. [29] did a similar work using Baidu and data from the Chinese Ministry of Health, improving upon the work done by Ginsburg et al.
3.4.9
Social media analytics
Not only search queries but also public social media like Twitter can also be used for the purpose. One such work was done in Ref. [30], which used the Twitter data from the United States and the CDC reports about ILI epidemics to identify regions affected by H1N1 epidemic. The tweets were monitored for a set of words related to the disease, and weekly statistics were used to estimate the epidemic status using a generalized SVM, called Support Vector Regression. The model produced good results, with an error of 0.28% with 0.23% standard deviation. Another system for using Twitter data for ILI-epidemics was proposed by Achrekar as the SNEFT (Social Network Enabled Flu Tracking) system [31]. Before making any predictions, the system first uses a text classification method to decide if the tweet is about flu, to reduce the noise of the medium. Once a proper set of tweets is found, the system uses Logistic Autoregression with exogenous inputs (ARX) is used for prediction. The results achieved by the above experiments have shown that social media and the internet can be used as a great indicator for detecting the spread of epidemics. Even though social media and search data provide a lot of Volume, Velocity, and Variety, they are lacking in terms of Veracity and Value. There is simply no way of knowing whether the poster is actually sick or not. Moreover, owing to imperfect classification, many of the posts used as input might not even be about flu. Moreover, the data collected from social networks is often noisy, and hence not of much value. There is also the concern for privacy of medical data of the patients, which needs to be addressed. Hence, even though such methods have shown a lot of promise, the field is not mature yet, and a lot of work needs to be done to address the issues associated with it in order to get usable medical data.
3.5
Discussion
This chapter showcased the importance of Big Data analytical methods in the field of healthcare. As more and more hospitals and care centers around the world get online, an exponential increase in the amount of medical data is to be expected. We
Chapter 3 Application of tools and techniques of Big data analytics for healthcare system
have seen the capabilities and benefits which such an approach brings to the table, and why it makes scientific as well as financial sense to incorporate Big Data in healthcare. The methods presented in this chapter should serve as a basis for further research into devising better models for various unanswered problems in the various subfields of medical research.
3.5.1
Past shortcomings
The first and foremost challenges to be tackled in the application of Big Data analysis is that of handling the insane amount of data that has become available as a result of the acceptance of EHR. Subsequently, the next challenge is facing policies for public health and delivery of care while trying to provide analysis on genomic scale data on a timely basis. In essence, transforming the idealistic expectations of utilizing the big data at a clinical level remains the biggest challenge. To understand another challenge, of integrating datasets, better, let us introduce an example of the VHA’s database VISTA. It is a set of interlinked systems which can possibly include different types of data which also restricts the queries that can be made. This is a clear indication towards the lack of standardized protocols in laboratories. Security and privacy concerns have also been a huge hurdle. Anonymity of individuals whose data can be potentially “compromised” or individuals who may wish not to share their personal information also have to be catered for. Since such delicate information is at play here, the security of all individuals is at risk after being digitized. With all being considered, with adequate care, the correct usage of big data analytics in healthcare will potentially save more than $300 billion every year in the United States alone according to an industry report by McKinsey and Company [32].
3.6
Conclusion
Big data analytics already has begun shaping the way healthcare is managed and is bound to have multifold reach in the coming years [33]. Some of the few expected advantages range from improvements in the standard of clinical trials to minimizing the expenditures for stakeholders as well as patients. Additionally, it might also become possible to successfully predict disease epidemics and the subsequent planning to limit it. The primary problems being faced in healthcare today are due to patients suffering from chronic care diseases. It follows that our focus should be on preventive care firstly and then on the
81
82
Chapter 3 Application of tools and techniques of Big data analytics for healthcare system
overall wellness. A direct requirement is to promote personalization and produce medicines not in bulk but for targeted audiences. Big data helps with the profiling of individuals through genomic and proteomic data of similar patients and recording their response to these prescribed medications. The discussions in the result section have shown how such methods are already being used in the real world, and are producing great results. They should serve along with machine learning [34] as an inspiration for future work in the field, and advance the available knowledge about diseases, epidemics and general public health.
References [1] M. Devgan, D.K. Sharma, Large-scale MMBD management and retrieval, in: S. Tanwar, S. Tyagi, N. Kumar (Eds.), Multimedia Big Data Computing for IoT Applications, Springer, Singapore, 2019, pp. 247 267. [2] M. Devgan, D.K. Sharma, MMBD sharing on data analytics platform, in: S. Tanwar, S. Tyagi, N. Kumar (Eds.), Multimedia Big Data Computing for IoT Applications, Springer, Singapore, 2019, pp. 343 366. [3] Y. Bhatt, C. Bhatt, Internet of Things in healthcare, in: C. Bhatt, N. Dey, A. Ashour (Eds.), Internet of Things and Big Data Technologies for Next Generation Healthcare. Studies in Big Data, vol 23. Springer, Cham, 2017. [4] K.K. Bhardwaj, S. Banyal, D.K. Sharma, Artificial intelligence based diagnostics, therapeutics and applications in biomedical engineering and bioinformatics, in: Internet of Things in Biomedical Engineering, Academic Press, Elsevier, 2019, pp. 161 187. [5] S. Bagga, S. Gupta, D.K. Sharma, Computer-assisted anthropology, in: Internet of Things in Biomedical Engineering, Academic Press, Elsevier, 2019, pp. 21 47. [6] T. Sethi, A. Mittal, S. Maheshwari, S. Chugh, Learning to address health inequality in the United States with a Bayesian Decision Network. Proceedings of the AAAI Conference on Artificial Intelligence 33, 2019, pp. 710 717. Available from: https://doi.org/10.1609/aaai.v33i01.3301710. [7] A. Kankanhalli, J. Hahn, S. Tan, G. Gao, Big data and analytics in healthcare: introduction to the special section, Inf. Syst. Front. 18 (2016) 233 235. Available from: https://doi.org/10.1007/s10796-016-9641-2. [8] Chen, Chiang, Storey, Business intelligence and analytics: from big data to big impact, MIS Q. 36 (2012) 1165. Available from: https://doi.org/10.2307/ 41703503. [9] A. Khera, D. Singh, D.K. Sharma, Application design for privacy and security in healthcare, in: Security and Privacy of Electronic Healthcare Records: Concepts, Paradigms and Solutions, IET, 2019, pp. 93 130. [10] A. Khera, D. Singh, D.K. Sharma, Information security and privacy in healthcare records: threat analysis, classification, and solutions, in: Security and Privacy of Electronic Healthcare Records: Concepts, Paradigms and Solutions, IET, 2019, pp. 223 247. [11] M. Herland, T.M. Khoshgoftaar, R. Wald, A review of data mining using big data in health informatics, J. Big Data 1 (2014) 2. Available from: https://doi.org/10.1186/2196-1115-1-2. [12] Y. Wang, L. Kung, T.A. Byrd, Big data analytics: understanding its capabilities and potential benefits for healthcare organizations, Technol.
Chapter 3 Application of tools and techniques of Big data analytics for healthcare system
[13]
[14]
[15]
[16] [17]
[18]
[19]
[20]
[21]
[22]
Forecast. Soc. Change 126 (2018) 3 13. Available from: https://doi.org/ 10.1016/j.techfore.2015.12.019. M. Cox, D. Ellsworth. Application-controlled demand paging for out-of-core visualization, in: Proceedings. Visualization ’97 (Cat. No. 97CB36155). Presented at the Proceedings. Visualization ’97 (Cat. No. 97CB36155), IEEE, Phoenix, AZ, USA, 1997, pp. 235 244. Available from: https://doi.org/ 10.1109/VISUAL.1997.663888. Y. Demchenko, Z. Zhao, P. Grosso, A. Wibisono, C. de Laat. Addressing Big Data challenges for Scientific Data Infrastructure, in: 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings. Presented at the 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings, 2012, pp. 614 617. Available from: https://doi.org/10.1109/CloudCom.2012.6427494. K. Lan, D. Wang, S. Fong, et al., A survey of data mining and deep learning in bioinformatics, J. Med. Syst. 42 (2018) 139. Available from: https://doi. org/10.1007/s10916-018-1003-9. T.B. Murdoch, A.S. Detsky, The inevitable application of Big Data to health care, JAMA 309 (2013) 1351. Available from: https://doi.org/10.1001/jama.2013.393. A. Avci, S. Bosch, M. Marin-Perianu, R. Marin-Perianu, P. Havinga, Activity recognition using inertial sensing for healthcare, wellbeing and sports applications: a survey. Presented at the 23th International Conference on Architecture of Computing Systems, ARCS 2010, Hannover, Germany, 2010, pp. 167 176. T. Huynh, B. Schiele, Analyzing features for activity recognition, in: Proceedings of the 2005 Joint Conference on Smart Objects and Ambient Intelligence: Innovative Context-Aware Services: Usages and Technologies, SOc-EUSAI ’05. ACM, New York, NY, USA, 2005, pp. 159 163. Available from: https://doi.org/10.1145/1107548.1107591. T.N. Gia, M. Jiang, A.-M. Rahmani, T. Westerlund, P. Liljeberg, H. Tenhunen, Fog computing in healthcare Internet of Things: a case study on ECG feature extraction, in: 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing. Presented at the 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing (CIT/IUCC/DASC/PICOM), IEEE, LIVERPOOL, United Kingdom, 2015, pp. 356 363. Available from: https:// doi.org/10.1109/CIT/IUCC/DASC/PICOM.2015.51. J.A.C. Sterne, I.R. White, J.B. Carlin, M. Spratt, P. Royston, M.G. Kenward, et al., Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls, BMJ 338 (2009) b2393. Available from: https://doi.org/10.1136/bmj.b2393. R. Salazar, P. Roepman, G. Capella, V. Moreno, I. Simon, C. Dreezen, et al., Gene expression signature to improve prognosis prediction of stage II and III colorectal cancer, J. Clin. Oncol. Off. J. Am. Soc. Clin. Oncol. 29 (2011) 17 24. Available from: https://doi.org/10.1200/JCO.2010.30.1077. T. Haferlach, A. Kohlmann, L. Wieczorek, G. Basso, G.T. Kronnie, M.-C. Be´ne´, et al., Clinical utility of microarray-based gene expression profiling in the diagnosis and subclassification of leukemia: report from the International Microarray Innovations in Leukemia Study Group, J. Clin. Oncol. Off. J. Am. Soc. Clin. Oncol. 28 (2010) 2529 2537. Available from: https://doi.org/10.1200/JCO.2009.23.4732.
83
84
Chapter 3 Application of tools and techniques of Big data analytics for healthcare system
[23] W. Liu, R. Li, J.Z. Sun, J. Wang, J. Tsai, W. Wen, et al., PQN and DQN: algorithms for expression microarrays, J. Theor. Biol. 243 (2006) 273 278. Available from: https://doi.org/10.1016/j.jtbi.2006.06.017. [24] H. Yoshida, A. Kawaguchi, K. Tsuruya, Radial basis function-sparse partial least squares for application to brain imaging data, Comput. Math. Methods Med. 2013 (2013) 1 7. Available from: https://doi.org/10.1155/ 2013/591032. [25] I. Ouanes, C. Schwebel, A. Franc¸ais, C. Bruel, F. Philippart, A. Vesin, et al., A model to predict short-term death or readmission after intensive care unit discharge, J. Crit. Care 27 (2012) 422.e1 422.e9. Available from: https://doi.org/10.1016/j.jcrc.2011.08.003. [26] J. Sun, D. Sow, J. Hu, S. Ebadollahi. A system for mining temporal physiological data streams for advanced prognostic decision support, in: 2010 IEEE International Conference on Data Mining. Presented at the 2010 IEEE 10th International Conference on Data Mining (ICDM), IEEE, Sydney, Australia, 2010, pp. 1061 1066. Available from: https://doi.org/ 10.1109/ICDM.2010.102. [27] Y. Zhang, S. Fong, J. Fiaidhi, S. Mohammed, Real-time clinical decision support system with data stream mining, J. Biomed. Biotechnol. 2012 (2012) 1 8. Available from: https://doi.org/10.1155/2012/580186. [28] J. Ginsberg, M.H. Mohebbi, R.S. Patel, L. Brammer, M.S. Smolinski, L. Brilliant, Detecting influenza epidemics using search engine query data, Nature 457 (2009) 1012 1014. Available from: https://doi.org/10.1038/ nature07634. [29] Q. Yuan, E.O. Nsoesie, B. Lv, G. Peng, R. Chunara, J.S. Brownstein, Monitoring influenza epidemics in China with search query from baidu, PLoS ONE 8 (2013) e64323. Available from: https://doi.org/10.1371/journal. pone.0064323. [30] A. Signorini, A.M. Segre, P.M. Polgreen, The use of Twitter to track levels of disease activity and public concern in the U.S. during the influenza A H1N1 pandemic, PLoS ONE 6 (2011) e19467. Available from: https://doi.org/ 10.1371/journal.pone.0019467. [31] H. Achrekar, A. Gandhe, R. Lazarus, S.-H. Yu, B. Liu, TWITTER IMPROVES SEASONAL INFLUENZA PREDICTION, in: Proceedings of the International Conference on Health Informatics. Presented at the International Conference on Health Informatics, SciTePress - Science and Technology Publications, Vilamoura, Algarve, Portugal, 2012, pp. 61 70. Available from: https://doi.org/10.5220/0003780600610070. [32] J. Luo, M. Wu, D. Gopukumar, Y. Zhao, Big Data application in biomedical research and health care: a literature review. Biomed. Inform. Insights 8 (2016) BII.S31559. Available from: https://doi.org/10.4137/BII.S31559. [33] R. Nambiar, R. Bhardwaj, A. Sethi, R. Vargheese, A look at challenges and opportunities of Big Data analytics in healthcare, in: 2013 IEEE International Conference on Big Data. Presented at the 2013 IEEE International Conference on Big Data, IEEE, Silicon Valley, CA, USA, 2013, pp. 17 22. Available from: https://doi.org/10.1109/BigData.2013.6691753. [34] U. Sinha, A. Singh, D.K. Sharma, Machine learning in the medical industry, in: Handbook of Research on Emerging Trends and Applications of Machine Learning, ed. Arun Solanki, Sandeep Kumar and Anand Nayyar, IGI Global, 2020, pp. 403 424.
Healthcare and medical Big Data analytics
4
Blagoj Ristevski and Snezana Savoska Faculty of Information and Communication Technologies - Bitola, University "St. Kliment Ohridski" - Bitola, Republic of Macedonia
Abstract In the era of big data, a huge volume of heterogeneous healthcare and medical data are generated daily. These heterogeneous data, that are stored in diverse data formats, have to be integrated and stored in a standard way and format to perform suitable efficient and effective data analysis and visualization. These data, which are generated from different sources such as mobile devices, sensors, lab tests, clinical notes, social media, demographics data, diverse omics data, etc., can be structured, semistructured, or unstructured. These varieties of data structures require these big data to be stored not only in the standard relational databases but also in NoSQL databases. To provide effective data analysis, suitable classification and standardization of big data in medicine and healthcare are necessary, as well as excellent design and implementation of healthcare information systems. Regarding the security and privacy of the patient’s data, we suggest employing suitable data governance policies. Additionally, we suggest choosing of proper software development frameworks, tools, databases, in-database analytics, stream computing and data mining algorithms (supervised, unsupervised and semisupervised) to reveal valuable knowledge and insights from these healthcare and medical big data. Ultimately we propose the development of not only patient-oriented but also decision- and population-centric healthcare information systems. Keywords: Big Data; medical and healthcare Big Data; Big Data Analytics; databases; healthcare information systems
4.1
Introduction
Nowadays, using numerous diverse digital devices that generate a massive volume of heterogeneous structured, semistructured, and unstructured data results with the explosive growth of the Applications of Big Data in Healthcare. DOI: https://doi.org/10.1016/B978-0-12-820203-6.00005-9 © 2021 Elsevier Inc. All rights reserved.
85
86
Chapter 4 Healthcare and medical Big Data analytics
different types of a large amount of data, which enable the extraction of new information and inherent insights contained in data. Such an enormous variety of data collected from various and heterogeneous sources in healthcare and medicine can make valuable comprehension for patients, clinicians, hospitals, pharmacy, insurance companies, and other involved parties. Additionally, when data from social media, banks and credit cards, census records and diverse types of data with varying quality as the Internet of things (IoT) data for measuring vital signs are attached to the healthcare and medical data, a holistic view of a patient with environmental factors, which might influence patients’ health, can be obtained. These linked data can be very sensitive and with different quality attributes. But it is very useful to discover the connections from electronic health records (EHR) and coding systems to establish common criteria which will be beneficial for big data analysis and visualization for different stakeholders. As healthcare data sources, healthcare institutions are critical data producers that demand a huge architecture for storing a wide variety of data connected to the patients, healthcare institutions, government and municipality activities associated with healthcare as well as many activities of a wide range of healthcare stakeholders. A large volume of healthcare data is generated in hospitals during clinical treatments, labs and administrative procedures. There are many healthcare big data sources that have different attributes that have to be taken into account. This enormous volume of data collected from this sector has to be analyzed to obtain specific knowledge for all stakeholders in healthcare. Because of a large volume, veracity, variety, value, variability and velocity, healthcare data have big data properties. Usually, these data are stored as patients’ EHR, medical data records, coded with known medical and pharmaceutical coding systems such as ICD10 and SNOMED. Regarding healthcare and medical big data, the first characteristic of these data is complexity. It comes with a wide range of activities connected with patients, physicians, hospitals and clinicians, healthcare providers, healthcare insurance companies, medical instruments and medical terminology, national regulations, pharmaceutical companies, healthcare research groups, healthcare IoT appliances, and the WHO needs and directions. These data are also connected with living conditions as environmental data, transportation and communications, social media and advertising data.
Chapter 4 Healthcare and medical Big Data analytics
All these mentioned complex data sometimes have to be associated with suitable specific conditions in the context of healthcare big data. Healthcare big data can be produced also by many types of data sources: social media and markets, scientific instruments, mobile devices and services, technological and network services, hospital medical devices, EHR, physicians’ notes, medical, and pharmaceutical research. Certain ethical healthcare obligations demand a high quality of services for patients. Medical data storage and created documents aim to support quality management in healthcare institutions. So, historical disease data for each patient (medical audit), healthcare quality system monitoring, specific clinical insights and epidemiological data should also be provided. This could be very useful to gain suitable knowledge that can be used for further training and education of healthcare professionals as well as to assess medical students. These data have to be stored in high capacity distributed repositories and used from the other stakeholders for various purposes. Thereafter, suitable data mining techniques are applying to the medical and healthcare data sets. The specific purpose of healthcare data mining procedures should be to support anonymized patient data exchanges among healthcare staff and institutions, to support external demands like law, reimbursement procedures and documentation for planning and control of healthcare services. Additionally, applied data mining techniques should support scientific research enabling patients’ analysis as well as statistical data analysis, clinical delve into the data, epidemiological data analysis, information about critical insights with using of appropriate case studies. Many efforts are made to lead healthcare data to a unique system that codes important medical and healthcare data in general [1]. Because the highlight of the medical data is the medical and health care of the patients, these data are typical clinical data containing disease history, symptoms, clinical notes, diagnoses, therapies, and predictions or prognosis of the patient health conditions. Data also have to be connected with nursing, labs, medical knowledge, epidemiological information, and other relevant healthcare information. Clinical data management systems have to use technical language for classifying healthcare data and to use nomenclatures, making a taxonomy of medical and healthcare big data. On the other hand, the evolving healthcare standards focus on healthcare data interchange possibility to have basic communication among different healthcare information systems and their components.
87
88
Chapter 4 Healthcare and medical Big Data analytics
Design and implementation of healthcare information systems based on suitable big data should provide to the patients more efficient and cheaper healthcare services, an enhanced knowledge-based basis for decision making intended for the managers in healthcare institutions and insurance companies and benefits for the involved stakeholders. Additionally, dealing with the security and privacy of the patient’s data, which play a central role in the healthcare information systems, must be assured. The analysis of medical and healthcare big data gathered following the described coding systems provides a tailored analysis of specific groups of medical data. Nowadays such an analysis is more than needed to detect the best manners of treatment and potential treatment anomalies, as well as the influence of different factors to each patient. The rest of this chapter is organized as follows. Section 4.2 describes the properties of big data with a focus on their usage in medicine and healthcare and the various data sources as a base for big data analysis. The next section depicts all stages of big data analytics, from their creation to visualization, as well as commonly used data mining algorithms. The coding systems and taxonomy of healthcare and medical data are shown in the subsequent section. Section 4.5 focuses on medical and healthcare data interchange standards. In the subsequent section, as a methodology for the development of healthcare information systems, we describe frameworks for patient-oriented information systems for data analysis based on medical and healthcare data with a highlight on their necessary components and functions. In Section 4.7, we describe the concerns about data privacy, security, and governance in medicine and healthcare and give some directions towards handling these issues. Concluding remarks and directions for future work, such as choosing of proper software development frameworks, tools, databases, in-database analytics, stream computing and data mining algorithms as well as directions towards a development patient-, decision- and population-centric healthcare information systems, are given in the last section.
4.2
Medical and healthcare Big Data
Using numerous diverse digital devices that generate a large volume of heterogeneous data results with an explosion and significant growth of the voluminous complex data [2]. These data
Chapter 4 Healthcare and medical Big Data analytics
enable the extraction of new inherent insights in many disciplines. The aim of this enormous variety of data collected from various and heterogeneous sources is to make valuable comprehensions for patients, clinicians, hospitals, pharmacy, so the medical and healthcare data analysis become an extremely challenging task [3]. The healthcare analytics tools have to integrate these data generated from numerous and heterogeneous sources, providing valuable information and a base for healthcare researchers to improve current healthcare software solutions. Although the term big data was introduced in the 1990s, the serious impact of database development has been achieved by extracting 3V’s (volume, velocity, variety) characteristics in 2001, by Meta Group [4], defining big data as “Data-intensive technologies for data collection, data storage, data analysis, reasoning with data and discovering new data” [5]. In the next years, big data concept accepted all emerging technologies that influence human life and needs, and enhance the definition taking into account firstly 5V’s and thereafter 6V’s. Big data was especially useful for science and technology, business, healthcare and education. When the IoT has appeared, many sensors were connected for enhancing human life, producing a huge amount of structured and unstructured data. Big data then was just a part of the wider concept of data science that included many other methods and techniques for big data analysis and visualization usually included in the wider business intelligence and analytics (BIA) concept to create a big impact from big data [6]. Nowadays, big data usually refers to the following properties denoted as 6 “V’s”: volume, velocity, variety, value, variability, and veracity of the generated data that is delicate to analyze by using of standard data processing methods and platforms [7,8]. Volume characterizes the huge amount of created data, while velocity denotes data in motion as well as the frequency and speed of creation, processing and analysis of data. Complexity and heterogeneity of multiple data sets refer to the variety, while value refers to the data coherent analysis, which should be beneficial to the clients, costumers, managers, organizations, corporations and other stakeholders. Variability regards to the data consistency, while veracity refers to the data quality, relevance, uncertainty, reliability and predictive value. Particularly, in medicine and healthcare, to obtain more medical-related data and to improve disease diagnostics, many wearable sensors, remote monitors, handheld, wearable, smart, and capturing devices are used, as well as data generated by using many novel omics technologies.
89
90
Chapter 4 Healthcare and medical Big Data analytics
Current advances in patients’ EHR, their fusion with social, behavioral, and diverse biological data have led to novel healthcare models that support personalized and precision medicine [3]. Personalized healthcare services provide individuals-tailored diagnoses, drugs, and treatments based on the psycho-physiological and spatial-temporal circumstances. The main challenge in using healthcare big data effectively is to identify the potential sources of healthcare information and to highlight the value of linking these data together [9]. Healthcare data collection is the main data stream in healthcare big data analytics. Data sources can be curative, preventive as well as other types of healthcare and medical sources. Curative data are medical records, lab tests, referral, and prescription data. Preventive data can be taken from growth cards, maternal and child healthcare (MCH) cards, school healthcare cards as well as family registration cards. Other data sources can deliver comprehensive contents, record filling (patient-retained), layout (self-explanatory), production forms and various environmental data. Demographic surveillance should also be taken into account in healthcare data collection because of the need of information such as causing mortality data, sex, age, age-specific fertility, perceived data for mortality and disability, expenditure for household healthcare, practices, service quality and costs covered with healthcare insurance [10]. System medicine, which combines clinical decision support systems (DSS) and EHR systems, aims toward individualized disease prognosis and treatment of the patients. These prognoses and treatments have to be based on various large amounts of data including phenotype data, omics data, and individual preferences of the patients [11]. Data used and mutually combined in system medicine can be categorized in three main groups: personal data (behavioral data, demographic data), clinical data (examination data, laboratory data, imaging data), and omics data (e.g., genomics, proteomics, metabolomics, transcriptomics, lipidomics, epigenomics, microbiomics, immunomics, and exposomics data) [11]. Data used in the healthcare systems and applications are categorized as unstructured, semistructured, and structured data. Structured data have defined data type, structure, and format. Such data in healthcare systems are laboratory results, hierarchical terminologies of different diseases, their symptoms, information about diagnosis and prognosis, patients’ data, drug and billing information [3].
Chapter 4 Healthcare and medical Big Data analytics
Semistructured data, which are usually generated from various sensor devices for monitoring of patients’ conditions and behavior, are organized with minimal structure properties. Besides these two categories of data, unstructured data have no inherent structure. These data usually contain doctor’s prescriptions, written in natural human languages, biomedical literature, clinical notes and images. Many researchers pay attention to the symbiosis of data types collected from healthcare services and the structure of these data. Weber et al. had created the tapestry of all healthcare data sources whenever they are collected or stored, regardless of the used coding and classification system, as systemized in Table 4.1 [9]. EHR data can be categorized into two main categories: electronic medical records (EMR) and sensor data. EMR data usually consist of patients’ medical history, medical features (e.g., diagnoses, lab tests, medication, procedures, unstructured text data, image data) and socio-demographic information about patients [12]. Sensor data are collected from various sensors and they originate from a huge number of users and devices, hence generating enormously large amounts of real-time data streams. The main characteristics of EMR data are high-dimensionality, missing values, sparsity, irregularity, intrinsic noise and bias [12]. Problems with missing data values and data sparsity are usually solved by using removal or imputation methods. Medical imaging is a vigorous source of phenotypic data appropriate for further data analysis, personalized medicine, predictive analytics and artificial intelligence [13]. Medical imaging data are generated by imaging techniques such as X-ray, mammography, computer tomography (CT), ultrasonography, fluoroscopy, photoacoustic imaging and magnetic resonance imaging (MRI), histology, positron emission tomography (PET), radiography, nuclear medicine, tactile imaging, echocardiography, angiography, and elastography [7,14]. Prescription data contained in the patient EHR generated by physicians, clinical notes, medical research reports are examples of data that contain natural language terms, mathematical symbols and graphs [15]. These incompatible data structures and formats along inconsistent data semantics combined with huge data volumes make healthcare big data analytics a challenging and demanding task. Large scale omics data can ensure clarifying of the molecular base of particular diseases and disease risks. To model and analyze complex interactions that occur between entities in biology,
91
92
Chapter 4 Healthcare and medical Big Data analytics
Table 4.1 Simplified information sources related to individual healthcare and data classification. Data types
Medication
Electronic pill dispenser OTC Medication medications filled
Demographics Encounters Diagnosis Procedures Diagnostics
Employ seek days Death record PHR
Home treatment, tests, monitoring SNPs, arrays Police and other records
Genetics Social history Family history Symptoms
Lifestyle
Prescribed medication
Medication instruction
Medication taken
Dose/Route NDC RxNorm codes
Allergies out-of-pocket expense
Diaries herbal remedies alternative therapy
HL7 Visit type and time SNOMED ICD10 CPT/ICD10 LOINC pathology histology radiology lab vital signs
Differential diagnostics Reports, imaging, digital clinical notes, physical examinations
Blogs, Facebook posting, tweets Digital clinical notes
OTC purchases (indirect) Fitness clubs membership
Socioeconomics Social network Environment Climate, weather,
Credit cards
Census records Public health GIS, EPA, health database map Structured data
News feeds Unstructured data
medicine and neuroscience, networks are fundamental and very suitable tools. The complex network nodes represent dynamic entities such as genes, microRNAs, proteins, metabolites, whereas the edges represent the links and interactions among nodes [16].
Chapter 4 Healthcare and medical Big Data analytics
4.2.1
Exposome data
Creating an individual model can be very important for human beings. These analyses also demand a huge amount of healthcare big data sets and complex data mining tools with diverse focuses of analysis to make available important insights based on EHR, known coding system and ontologies of exposome [17]. They called expotype as a particular set of exposome features of an individual gathered throughout a certain time and/or space [17]. In this manner, the authors in Ref. [17] had stated that development of a template-driven model to identifying exposome concepts from the Unified Medical Language System (UMLS) and create expotype is important. They also defined exposotype terms as the metabolomic profile of an individual that considers an occurrence of exposure. When wider integration of healthcare and biomedical data with environmental data is required, the term exposome is introduced as a novel conception that tends to delineate biotechnical approaches and to systematically quantify a massive subset of environmental exposures of an individual for whole lifespan [18]. Some data in this concept have genetic or clinical backgrounds. Some of the data are also associated with the integration of genotype-phenotype data, environmental risk factors at the individual level [17]. This concept can be essential for understanding the biological basis of diseases, taking into consideration the influence of ecosystems to each person. Authors in Ref. [17] stated that most diseases outcome from the multiplex interchange between genetic and environmental factors. Some reasons of emerging of this coined word are the partition of the landscape of disciplines, which are interested in exposome characterization from different points of view such as environment, health, exposure, toxicology and health services. There are many interdisciplinary subbranches of exposome, which lead to novel coined words and terms, such as urban exposome, occupational exposome, epidemiology as public health exposome, socioexposome, nanoexposome, infectoexposome, drugexposome, psychoexposome, etc. [17]. Besides omics data, exposure data in the wide sense has to gain certain nongenetic data for the patients as data for patient behavior and habits, as social determinants of health and physicochemical exposures. They can be taken from various sources, as biomonitoring data, exposure to particular environmental agents like smoke, geographic information systems (GIS), environmental sensors, etc. Also, EHR data, digital health sensors
93
94
Chapter 4 Healthcare and medical Big Data analytics
data, mobile applications and consumer behavior data can be considered as exposome data. They can be used to create new dimensions and multi-omics data models [17]. Many databases are created for this purpose such as the US National Health and Nutrition Environmental Survey (NHANES) that treats the exposome theory and this had achieved successful results. Since 2013, the Institute of Medicine (IOM) reported that comprehending social and behavioral domains and data in EHR is important because of increasing clinical awareness of the patient’s state, broadly considered, and to link clinical, public health and general public resources [19].
4.3
Big Data Analytics
Big data concept requires real-time data processing and development of real-time predictive models. Such rapidly growing amounts of data are faced with inefficient storage, preprocessing, processing and analysis by traditional relational databases [2]. To perform efficient multisite and multivariable searching and querying, high indexing and efficient data lookup of the big data sets are required. Another stage is preprocessing of the raw data that might be inconsistent, inaccurate, erroneous and/or incomplete. To make big data analysis more reliable, integration of data from heterogeneous sources must be performed to have a conventional structured form. To improve the quality of collected data from various sensors/devices and to obtain more reliable analytics results, identifying and removing incomplete, inaccurate and irrelevant data should be made. These unreliable data are usually replaced with interpolated values. For these reasons, when we are talking about healthcare big data analysis and visualization, as most desirable techniques that create rapid information and deep insights, we have to analyze who will use results of data analyses and visualization and for which purposes. After defining the users (stakeholders), we have to analyze all necessary data sources and data formats and manner how to extract the information and knowledge from these healthcare and medical data, which demands to scrutinize design of the healthcare information system, data preparation for further data analysis and visualization and choosing suitable analytics and visualization tools and platforms. Big data analytics acquits an enormous variety of data from former and current customers to obtain valuable knowledge to
Chapter 4 Healthcare and medical Big Data analytics
improve decision-making, to predict customer habits and behavior and to obtain real-time customer-tailored offers. Big data analytics in healthcare and medicine aims to bridge the gap between costs and outcomes in healthcare, which is a result of poor management of insights from research and poor resources management [20]. These analytics goals are achieved by prediction, modeling, and inference. Moreover, data created by omics technologies can significantly improve the prediction of diabetes, heart diseases, cancer, and other diseases [21]. The main barriers in computer science related to the integration of omics data into clinic systems are the development of a model of cellular processes that cover noncorresponding omics data types, the limitation for data storage and organization of heterogeneous data sets, generated from diverse high-throughput omics technologies, and the lack of suitable multidisciplinary data scientists with wider knowledge in computer science, biology, medicine, bioinformatics and data mining [21]. To increase computing features and scalability of different omics data, 3-D memory and scalable methodologies are developed [7]. Data acquisition is followed by the data cleansing step that detects and removes the before-mentioned anomalies of data. In the subsequent stage, raw data must be transformed by data normalization and aggregation. During data transformation scaling, cleaning, splitting, translating, merging, sorting, indexing, and validating of data are performed as substeps of this stage to make data consistent and easily accessible for further data analysis stages [22]. Data transformation is important to obtain valuable data for supporting evidence-based decision making in healthcare organizations [23]. After data transformation, their integration in a unified standard form is required. Sensor data are generated by diverse medical devices at a different sampling rate, that makes data integration an important step [12]. Data transmission is a technique that transfers integrated data into data storage centers, systems or distributed cloud storage for further analysis [2]. Data reduction is a method that makes size reduction of the large data sets so that suitable data mining techniques and application could be applied. Moreover, some data mining techniques require discretization of the continuous attribute intervals, which means discretization techniques should be applied to the data, before further suitable analysis, interpretation and visualization. Since physicians use voluminous clinical notes and unstructured text data, to perform an optimized full-text exploration of medical data, searching and indexing tools are needed. These tools
95
96
Chapter 4 Healthcare and medical Big Data analytics
are employed for worthwhile distributed text data management and indexing of huge amounts of data. The healthcare industry employs machine learning techniques to convert plentiful medical data into applicable knowledge by performing predictive and prescriptive analytics. Recent advance in healthcare sensor devices elicits the data processing from diverse data sources to be achieved in real-time. To discover correlations, patterns and previously unknown knowledge in the large datasets and databases, appropriate data mining techniques should be applied. Data mining algorithms in medicine and healthcare enable analysis, detection and prediction of specific diseases and hence aid practitioners to make decisions about early disease detection [24]. Such early disease prediction helps to decide and use the most suitable treatment considering the symptoms, patient EHR and treatment history. Data mining algorithms can be supervised, semisupervised, and unsupervised learning algorithms. The unsupervised algorithms aim to find patterns or groups (clusters) of entries within unlabeled data. Supervised learning, which uses training sets of classified data, aims to predict a known result of target and to classify or infer testing data sets. Semisupervised learning algorithms use small sets of annotated data and larger unlabeled data sets.
4.3.1
Unsupervised learning
Clustering is an unsupervised learning algorithm that is used to find groups of data entries (clusters) by using distance metrics. The clustering aims to minimize intra-cluster distances of the data entries belonging to the same clusters and, at the same time, to maximize inter-clustering distances among data entries that belong to different clusters [25]. Besides commonly used distance-based clustering algorithms, density-based clustering aims to find areas (clusters) with higher dense entries compared to the remainder of the data set. Clustering algorithms include hierarchical clustering, k-means clustering, fuzzy c-means clustering, self-organizing maps, principle-based clustering, as well as biclustering, where rows and columns of the data set matrices are clustered simultaneously, and triclustering, which uses tri-dimensional data analysis to discover coherent three-dimensional subspaces (triclusters).
4.3.2
Supervised learning
Classification is a commonly exploited method for supervised learning, that is, predictive modeling where output vector
Chapter 4 Healthcare and medical Big Data analytics
values are categorical [20]. The aim of the classification is to create rules to assign and organize data entries to those of the preidentified set of classes so the benefit of data is most efficient and effective. Commonly used classification algorithms are decision trees, Bayesian networks, neural networks, support vector machines, boosting, logistic regression and naı¨ve Bayesian classifiers. Classification of medical and healthcare data sets is used to develop DSS for diagnosis as well as for predicting models for prognosis based on the big data analysis. Linear regression is a statistical analysis technique to represent trends in the data, quantifying the relationship between dependent variables and the independent data variables.
4.3.3
Semisupervised learning
Semisupervised learning algorithms are between unsupervised and supervised algorithms. This learning belongs to the machine learning algorithms that use a large amount of unlabeled data and a small amount of labeled data. Data visualization aims to make a graphical representation that enables users to interact with data to extract useful information and knowledge [26]. Data visualization tools in healthcare aids to detect patterns, tendencies, outliers, clusters, to analyse time-series data and to improve clinical healthcare services and public health policy [3]. Visualization generates outputs like numerous visualization reports (e.g., charts, interactive dashboards), real-time reporting information (e.g., diverse notifications, alerts and key performance indicators) and clinical summaries (e.g., historical reports, statistical analyses, timeseries comparisons) [22]. For healthcare and medical big data analysis, interoperability specification is very important, as well as the used coding systems [27], especially when healthcare clinical data, omics data and sensor data are applied [28]. They also have to be related to the patient and prescription data. Usually, hospitals and clinics have integrated information systems that produce a huge volume data suitable for data analysis, but with restricted possibilities, generating data for decision support for their stakeholders and defining interactions and workflow systems to provide healthcare quality systems, patient data privacy as well as optimization of healthcare costs. They also have an enormous quantity of reports with statistical data analysis and periodical reports. But, when they have to be analyzed at a higher level, the system has to be previously designed for storing data in desired formats for further data analysis [27].
97
98
Chapter 4 Healthcare and medical Big Data analytics
The motivation of healthcare stakeholders such as university hospital centres is very high because they have to deal with a large amount of data. For example, one hospital has to deal with several million documents with a huge amount of different data such as clinical notes and lab results. They had to create a lot of papers, which are difficult to analyze and hence numerous data are produced. These data have to be connected with the patients’ living conditions [29]. In Ref. [30], authors have proposed the concept of a four-level healthcare system that has to be followed and in which patients, care teams, organizations and environment play key roles. Suitable system engineering tools should be employed to handle healthcare issues. First, systems design tools for implementation of parallel engineering, quality control functions deployment, human factors tools, failure analysis tools [30]. Subsequently, particular systems analysis tools are required for the following activities: modeling and simulation tools (to provide queuing methods), discrete event simulation, supply chain management, game theory and contracts, system dynamics modeling, productivity measuring and monitoring tools. For financial engineering and risk analysis, tools are required for stochastic analysis, value at risk, optimization of individual decision making and distributed decision making market models. Systems control tools are used for statistical process control and scheduling. These analysis tools have to act and to be kept together to achieve a synergy in achieving the best analysis results in healthcare and medicine. The most important case for data analysis is bringing relevant information for decision making associated with the disease diagnoses, treatments and interventions. Also, there are large amounts of data related to the need of communication help from medical practitioners from different healthcare institutions, to the EHR between two conditions of diagnoses. Additionally, plans for treatments and storing of data about patient lab tests, plan for creating appropriate healthcare services and medical documents required for the national healthcare bodies. This is important when they have to perform statistical analysis and to create clinical reports. The second important reason and motivation for storing huge data sets are administrative healthcare data that supports reimbursement procedures, considering the data collected from various healthcare services, interventions and treatments as well as the expenses for these services. The healthcare institutions’ managers have to plan and control the working processes and these data should aid them to increase the transparency
Chapter 4 Healthcare and medical Big Data analytics
and enhance the decision making processes to manage the available resources. Many implications arise from the need for evidence-based medicine and patient care when healthcare legislation and law are highlighted. There are many law procedures for insufficient document quantities that can have negative implications for healthcare institutions.
4.4
Healthcare and medical data coding and taxonomy
Many efforts are made to lead healthcare data to a unique system that codes important medical and healthcare data in general [1]. The need for medical and healthcare information arises over time. The healthcare stakeholders know that the healthcare documentation has to provide information, which has to be complete, without noise, on time, without missing values and outliers, in the format that has to be presented to the healthcare authorities. Indeed, the information has to be comprehensible and in form of logics for the desirable knowledge in medicine and healthcare services [1]. Because the highlight of the medical data is the medical and healthcare of the patients, these data are typical clinical data containing disease history, symptoms, clinical notes, diagnoses, therapies and predictions or prognosis of the patient health conditions. The data also have to be connected with nursing, medical knowledge, epidemiological information and other relevant information. Clinical data management systems have to use technical language for classifying healthcare data and to use nomenclatures. The data can be classified according to data description and data mining demands. According to the subjects, classification can be done by following classification criteria [15]: • clinical information; • medical knowledge that abstract individual patient’s insights for diseases and; • attributes of healthcare systems as data for institutions, accidents, etc. For instance, CDMS1 is a classification that considers the meaning of data in 5 classes: (1) contains primarily clinical facts; (2) contains primarily medical knowledge; (3) contains healthcare attributes; (4) contains a balanced mix of many types of information; (5) does not belong to any of previously mentioned classes. CDMS2 classification can have following classes (1) data-oriented to the patient or patients’ groups;
99
100
Chapter 4 Healthcare and medical Big Data analytics
(2) class according to the level of standardization (standardized or unstandardized data). Other classification criteria take into account the horizontal or vertical medical documentation (CDMS3), patients, diseases and interventions (CDMS4) and use IT tools or conventional documentation (CDMS5) [27]. Medical coding systems are used to provide fewer problems when data analysis has to be performed, shorter and accurate data entries, less memory space, possibility to grouping data according to codes and groups. The coding language is based on the medical conceptual coding system and using a thesaurus (lexicon). The current healthcare coding systems are accepted and announced by WHO. ICD10 (International Classification of Diseases) is the most significant coding system that contains data for death statistics, healthcare quality control, and international register of causes of death. The list of causes of death was made since 1893 by Bertillon [27] and later in 1964 ISI (International Statistical Institute) produced the document International Classification of Diseases and Causes of Death [27]. Nowadays, the current version is the tenth revision of ICD10 with a digital code length of 425 alphanumeric letters/ digits. The first code character is a letter, while the rest 2 4 are digits. The fourth code character is divided by a decimal point. The ICD10 contains 21 chapters for diseases [27,31]. For instance, chapter 4 classifies endocrine, nutritional and metabolic diseases into 261 groups of diseases (e.g., E10 to E14 are diabetes mellitus), 2000 classes with 3 numbers (e.g., E10 is a class of insulin-dependent diabetes mellitus), 12000 numbers with 4 digits as classes of diseases (e.g., E10.1 as insulindependent diabetes mellitus with ketoacidosis). The special codes that begin with U50 to U99 are reserved for research purposes. The classes are created in general according to statistical criteria (e.g., diabetes prevalence). There are no semantic features in the ICD10 classification. Some extensions of classification are done to achieve the higher groups’ granularity with ICD10-CM (clinical modification) to overcome specific organizational and terminological healthcare system demands. Other classifications are ICPM (International Classification of Procedures in Medicine) created by WHO for research purposes and ICD-9-CM created by the US National Center for Health Statistics (NCHS) and Health Care Financing Administration (HCFA). The code consists of 4 bits: 2 bits for a group of the procedure and 2 bits for a specific procedure and its
Chapter 4 Healthcare and medical Big Data analytics
specification. The last bit has topological meaning (e.g., 30234 are codes of operations for the respiratory system) [31]. Next important coding system in healthcare and medicine is Systematized Nomenclature in Medicine—SNOMED (Systematized Nomenclature of Human and Veterinary Medicine 1975/1979/1993 from CAP), 2000 SNOMED-RT (reference technology), SNOMED-CT (clinical terms), SNOMED-RT as multidimensional nomenclature that has the two-layer alphanumeric notation: 2 bits for the base hierarchy (T-Topology, P-Procedures) and second layer bits for the concept identification [27]. The research with joint efforts resulted in gaining CAP (Clinical Audit Platform)/NSH (National Health Service), promoted in 1999 [27,32]. It integrates SNOMED-RT and NSH’s version of clinical terms. This unified terminology solved many compatibility data problems and provided basic building blocks for world clinical communication. The classification of malignant tumors (TNM) provides a consistent classification of anatomic spreading of malignant tumor diseases, the phases of diseases and particular common names. This oncological classification provides topological and morphological unification on the ICD10. Since 1953, Union for International Cancer Control (UICC) and International Commission for cancer statements and results presentation and cancer treatment, argued with this mainstream method of classification malignant tumours [33], which is used until now. The system takes into account the tumor spreading (T0-T4), the stadium of the nodules of metastasis (N0-N3), and the presence of metastases (M0-M1). For instance, code T2N1M0 means that the malignant tumour is in the second phase of spreading, N1 stadium of nodule metastasis and there are metastases (M0). The next coding system in healthcare is MeSH (Medical Subject Heading) from the US National Library of Medicine [34]. The aim was to code the subjects of medical and healthcare literature according to the poly-hierarchical conceptual system. UMLS is another NLM (National Library of Medicine) product that brings the clinic codes and literature in metathesaurus (lexicon) to provide automatized linkage through the clinical case studies and available healthcare literature.
4.5
Medical and healthcare data interchange standards
The emerging healthcare standards focus on healthcare data interchange possibility to have basic communication between
101
102
Chapter 4 Healthcare and medical Big Data analytics
different healthcare information systems. The examples of usage of data interchange standards that affect healthcare are Health Level 7 standards (HL7), National Council for Prescription Drug Programs (NCPDP), Digital Imaging and Communications in Medicine (DICOM) and ANSI X12N standards. HL7 is developed as an HL7 messaging standard to allow interoperability among healthcare applications [35]. This standard is involved in other standards activities, but the messaging standard is denoted as HL7. Other commonly used HL7 standards are Clinical Context Management (CCM) specifications, Arden Syntax for Medical Logic Systems and Electronic Health Record functional model. DICOM is a standard that supports communication of digital image data regardless of the device producer and it deals with picture archiving and communications systems (PACS) [35,36]. It was first published by the American College of Radiology and the National Electrical Manufacturers Association in 1985. The next NCPDP is ANSI accredited Standards Developing Organizations (SDO) who established a standard for the electronic submission of intermediary drug claims [37]. ASC X12N is the subcommittee of ASC X12 intended to deal with EDI (electronic data interchange) for the insurance sector. Its purpose is to achieve healthcare task group TG2. X12N/TG2 develops and maintains standards for healthcare EDI [38]. Health Record Content Standards are associated with creating functional standards for EHR content. These standards can be the HL7 EHR functional model or ASTM (American Society for Testing and Materials) healthcare informatics subcommittees Continuity of Care (CCR) standard. HL7 EHR functional model is adopted in 2004 as the second draft and contains three components: direct care, supportive and information infrastructure, as shown in Fig. 4.1. CCR has been developed by the ASTM Healthcare Informatics subcommittee and it tends to provide core data sets of the most relevant and timely facts about a patient’s healthcare. It is completed when a patient is shifted to another healthcare provider. The third version included nine elements grouped in three main data groups, as shown in Fig. 4.2. HIPAA (Health Insurance Portability and Accountability Act) standards also influence electronic transaction standards for healthcare and it is mainly from ASC X12N. Standard HIPAA code sets comprise: ICD-10, CDT (Code on Dental Procedures and Nomenclature), HCPCS and CPT-4 [35,39]. It also identifies deliberated standards’ organizations to develop, maintain and adapt relevant EDI standards [40]. It comprises ACS X12, Dental Content committee of the ADA (American Dental Association),
Chapter 4 Healthcare and medical Big Data analytics
Figure 4.1 Levels of healthcare information system management and base coding systems.
Figure 4.2 Specific clinical information care documentation and electronic health records.
103
104
Chapter 4 Healthcare and medical Big Data analytics
HL7 and NCPDP, NUBC (National Uniform Billing Committee) and NUCC (National Uniform Claim Committee) standards. The above-mentioned standards and coding systems function in given healthcare infrastructure as NHII (National Healthcare Information Infrastructure). NHII is neither a simple set of health information standards, nor a government agency, nor a centralized database for medical records, but it is a complete knowledge-based network of interoperable systems of clinical, public health, and personal health information [41]. This is a basis from where data are taken for further big data analytics and visualization [35]. But, these coding systems, classifications, terminologies and standards in healthcare have to support big data analysis in certain aspects. The biggest problems are related to the 6V’s big data concept, which means that data might not be structured, might be in various formats, created in a huge volume and velocity, with a variety of data sets with different metadata.
4.6
Framework for healthcare information system based on Big Data
Developing healthcare information systems based on healthcare and medical big data has to take into account all stakeholders involved in healthcare and medical research. The patients, who pay for health insurance and play a central role in healthcare systems, expect the healthcare institutions and hospitals to deliver a wide assortment of high-quality healthcare services at a reasonable cost. Besides physicians’ diagnosis, patients can gain more medical and healthcare knowledge, such as symptoms, hospitalization, medicament information through social networks, forums, etc. [3]. Moreover, using different healthcare sensors and wearable devices provides an opportunity for telemedicine, which results in the creation of a huge amount of data. Medical personnel, as a key stakeholder who generates various data, such as medical imaging data, CT, laboratory results and clinical notes. The medical staff makes a diagnosis based on these data and symptoms. The collected data and then integrated into the big data repository help physicians to make the right diagnosis and then to prescribe the appropriate medicaments and to observe patients’ health conditions. Hospital strategic operators should use available big data to strengthen the relationship between patient satisfaction and the
Chapter 4 Healthcare and medical Big Data analytics
offered services and to optimize using the healthcare departments and resources [3]. Pharmaceutical research based on available big omics data helps to comprehend the drugs and biological processes that lead to successful drug design. Additionally, medicaments prescriptions and recommendations for a particular disease, dosages, consumption quantity as well as sales history from a specific pharmacy should be included in big data analytics. Clinical researchers can benefit from various clinical reports generated by big data analytics tools and data contained in the patients’ EHRs. Healthcare big data introduce opportunities to healthcare insurance companies/organizations to generate reports and appropriate health plans and trends for frequently occurring diseases in a particular geographic region. Such reports and plans can enable healthcare insurance funds, organizations and institutions to predict and to detect patterns of realistic claims and uncommon outliers to minimize the financial misuse costs [3]. Furthermore, analyzing the patients’ behavior big data taken in real-time enables these insurers to introduce novel business models such as usage-oriented insurance, depending on the particular country’s law and regulations. Healthcare software developers play a very crucial role in the logical and physical design and development of healthcare information systems. These computer science specialists have to have a very wide interdisciplinary knowledge of computer science, data science, data mining, bioinformatics, information systems, healthcare, medicine and biomedical engineering. The emergence of the big data and their usage in medicine and healthcare causes the development of numerous mobile healthcare services and applications that can employ and integrate data from heterogeneous sources such as biosignals (e.g., electroencephalograms, EEG; electrocardiograms, ECG), data from wearable sensor devices, laboratory data, etc. These data should be integrated along with pharmaceutical and regulatory data into models on a high level in a cloud computing environment, to address interoperability, availability and their sharing among different stakeholders such as medical physicians, patients, healthcare insurers and pharmaceutical companies [3]. Most of the proposed healthcare information system frameworks are structured from the following layers [1,3]: • data connection layer; • data storage layer; • big data analytics layer; and • presentation layer.
105
106
Chapter 4 Healthcare and medical Big Data analytics
The role of the data connection layer is to identify, extract and integrate medical and healthcare data, while relational, nonrelational and cloud-based data are stored in the data storage layer. Big data analytics layer provides diverse analytics such as descriptive, predictive and prescriptive analytics. The presentation layer provides the developing of graphical workflows and dashboards and various kinds of data visualizations. Besides these layers, the frameworks have to address the privacy and security issues of the system on several tiers. Sensitivity tier ensures patient information such as disease name and its status, patient mental health and biometric identifiers. The security tier authenticates patient data such as name, date of birth, doctor name, etc. To secure the privacy and security of patient data, the system should adopt a twolevel security mechanism. The first security level is associated to the authorization of the user concerning retrieving patient data in the clinics by providing provisional user and patient identifiers [3]. To access the patient data at the inter- and intra-clinic levels, an OTP (one-time password) based security mechanism level should be employed. Depending on the purpose of the analysis and data types, analysis of big data is dissociated into three parts: Hadoop MapReduce, stream computing and in-database analytics [42]. As a result of the voluminous big data and various data formats, new NoSQL (not only SQL) database management systems are required to integrate and retrieve data sets and to enable data transfer from standard into new operating systems. These NoSQL databases, which are used for big data storing, are classified into following 4 categories (some of them overlapping): column databases (e.g., HBase, Cassandra), document-oriented databases (e.g., MongoDB, OrientDB, Apache CouchDB, Couchbase), graph databases (e.g., Neo4j, Apache Giraph, AllegroGraph) and keyvalue databases (e.g., Redis, Riak, Oracle NoSQL Database, Apache Ignite). Hadoop MapReduce is an SQL-based programming model, which can process large amounts of data sets through a Hadoop cluster by provided parallelization, distribution and scheduling services. MapReduce allows analysis of structured, semistructured and unstructured data in a massive parallel processing (MPP) environment Apache Hive is a relational model for querying, searching, analyzing huge data sets stored in Hadoop Distributed File System (HDFS). It uses HiveQL as a query language that transforms typical SQL queries into MapReduce tasks [43]. To store data in distributed and scalable databases, Apache HBase is a suitable system.
Chapter 4 Healthcare and medical Big Data analytics
Stream computing supports high-performance big data processing in real-time or almost real-time. Stream computing analysis of healthcare big data can respond to unexpected events that occur, such as customer account misusing and to determine quickly the most appropriate actions. Suitable nonHadoop processing tools for streaming data processing are Spark, Hive, Storm and GraphLab [44]. In-database analytics provides high-speed parallel processing, scalability and optimization ensuring a safe environment for sensitive data. Results of the in-database analysis are not real-time and this analysis in healthcare supports preventive healthcare practice and evident-based medicine.
4.7
Big Data security, privacy, and governance
Medical and healthcare data are generated from diverse multiple sources, which means that patients’ data security is a big concern, as well their privacy due to the major risk of data leakage since their massive usage of third-party infrastructures and services. Cloud storage of these data can be vulnerable by potential malicious outsiders who can access the cloud platform and for instance, can act man-in-the-middle attacks. The primary goal is to provide confidentiality, availability and integrity of the patients’ data with achieving security in healthcare systems [45,46]. Confidentiality of the data can be attained by protecting data from accidental and unauthorized users. When big data are stored in databases, encryption methods for data protection can be categorized into table encryption, disk encryption and data encryption [2]. Another major legal and ethical issues are related to the ownership of data and the developed applications that employ patients’ data. Particularly, whether patients’ data, which are used for development and validation of application and analytics models, can be reused, shared and/or even sold [13]. Especially, concerns are raised when the application that employs patients’ data for development and validation should be sold for profit [13]. Patients, whose data are crucial for application development and validation, should be preacquainted that the data cannot be reused for other purposes and will not be misused [13]. Data recycling, data repurposing, data recontextualization, data sharing and data portability are the most commonly used forms for data reusing, as well as “the right to be forgotten” [47].
107
108
Chapter 4 Healthcare and medical Big Data analytics
Data recycling covers using the same data, in the same manner, more than once. When a patient chooses another medical practitioner or health insurance company, it should not be allowed to the previous ones to use the data for that patient. When patients’ data are used for a different purpose than the main one, it is categorized as data repurposing. While interpreting big data in a different context than in which they were primarily collected, for example, the same data physicians and health insurance companies can be interpreted differently or may have a different meaning, is classified as data recontextualization. Data sharing of medical data is sharing or disclosing in a specific context for particular purposes to other people or institutions, while data portability is the capability for patients to reuse their data through different devices and services. The “right to be forgotten” is the right of the data owner to invoke or block secondary using of his/her data. It is essential for the healthcare organization to secure personal data and to address the risks and legal responsibilities associated with personal data processing, according to the valid national and international laws, policies and regulations for data privacy [48]. Storing healthcare big data in a public cloud, which is a cost-saving alternative, requires solving security risks and patient privacy control since the data access is controlled by third parties. Differently, storing data in a private cloud, which keeps the sensitive data in-house, is a more secure option, but is a more expensive choice. This means that healthcare managers have to make a trade-off between the project budget and the security and privacy of sensitive patient data. Data governance is also a very important part of a healthcare system framework. Typically, it is composed of master data management, data life-cycle management and data privacy and security management [42]. Master data management refers to the governance, policies, standards and tools for data management, while life-cycle management manages the business data lifecycle from achieving data, warehousing data, testing and providing diverse application systems. Data security and privacy management deliver activities related to discovery, monitoring, configuration appraisal, auditing and protection of healthcare big data.
4.8
Discussion and further work
All the mentioned issues in this chapter cannot cover all the needs of healthcare and medical data for global users but
Chapter 4 Healthcare and medical Big Data analytics
intend to give a holistic view of the problems of healthcare big data analysis and visualization. The human exposome, as big data will increase the healthcare and medical repository, and IT has to deal with the analysis of these data, we are convinced that this will be the challenge in the following years. The purpose is evident, to produce data for personal health risk assessment and for better living conditions for every human being. Taking into consideration previous work in many countries and current healthcare information systems organization, as well as the needs of integration of cross border healthcare systems, that are considered in a couple of ongoing European Union projects [49,50], such as cloud-oriented cross border solutions, healthcare big data analysis and visualization have to be considered as a global challenge. It is necessary to create wide range taxonomy for healthcare and medical big data analytics as well as to create standards for usage of big data from different stakeholders with maximal security and privacy of patients’ data as one of the priorities. Then, we have to take into account all big data methods and tools to create the most suitable tools for analysis and visualization that are tailored to the various stakeholders’ needs. As further work, due to the huge amounts of generated data, we suggest that more efforts should be made towards big data governance that manages with the rules and control over data, their integrity and standardization. More efforts have to be made to improve the quality of the patients’ EHR, sensor and omics data, which is still a demanding task in big data analytics. These data have to be integrated into a unique clinical system, which would result in reducing of waste of resources and therefore to provide to the patients more efficient and cheaper healthcare services. The newest concept of Personal Health Record (PHR), where the patient is the data owner and plays a key role ins data collections, enables a new vision for personalized medicine and healthcare due to the data ownership. Combining PHR with a new concept of exposome and IoT data provides a wider horizon for using big data analytics methods for patient’s healthcare risk assessment as well as predicting the risk of some diseases and proposing appropriate preventing actions. It is a challenging task for the next decade to employ a wide community of medical practitioners, computer scientists as well as other specialists. As a suitable software framework for application development, we suggest Hadoop MapReduce that can process large amounts of data sets through a Hadoop cluster by provided
109
110
Chapter 4 Healthcare and medical Big Data analytics
parallelization, distribution and scheduling services. MapReduce allows analysis of structured, semistructured and unstructured data in the MPP environment. Because many of the generated data are streaming data, stream computing principles should be considered. Stream computing supports high-performance big data processing in real-time or almost real-time and using suitable tools such as Spark, Hive, Storm and GraphLab [44]. Because gathering of health and medical big data grows exponentially, to improve clinical decision making, development of decision-centric information systems is needed. Besides patientand decision-centric healthcare information systems, nowadays when huge amounts of infective disease data are generating, there is an urgent need for development of population-oriented information systems. These information systems have to integrate GIS, particularly on a worldwide level to detect patterns of disease spreading, and to stop the propagation of a particular disease and hence to help public healthcare institutions to handle the novel disease.
References [1] K. Beaver, Healthcare Information Systems, Auerbach Publications, 2002. [2] A. Siddiqa, I.A.T. Hashem, I. Yaqoob, M. Marjani, S. Shamshirband, A. Gani, et al., A survey of big data management: taxonomy and state-of-the-art, J. Netw. Comput. Appl. 71 (2016) 151 166. [3] V. Palanisamy, R. Thirunavukarasu, Implications of big data analytics in developing healthcare frameworks A review, J. King Saud. Univ. Comput. Inf. Sci (2017). [4] S. Mukherjee, Ovum decision matrix: selecting a business intelligence solution, 2014 15. Ovum, (July 2014), Product code: IT0014 002923, 2014. [5] T.A. Keahey, Using visualization to understand big data. IBM Business Analytics Advanced Visualisation, 2013. [6] H. Chen, R.H. Chiang, V.C. Storey, Business intelligence and analytics: From big data to big impact, MIS Q. 36 (4) (2012). [7] G. Manogaran, C. Thota, D. Lopez, V. Vijayakumar, K.M. Abbas, R. Sundarsekar, Big data knowledge system in healthcare, Internet of Things and Big Data Technologies for Next Generation Healthcare, Springer, Cham, 2017, pp. 133 157. [8] B. Ristevski, M. Chen, Big data analytics in medicine and healthcare, J. Integr. Bioinforma. 15 (3) (2018). [9] G.M. Weber, K.D. Mandl, I.S. Kohane, Finding the missing link for big biomedical data, JAMA 311 (24) (2014) 2479 2480. [10] J.J. Baker, Activity-Based Costing and Activity-Based Management for Health Care, Jones & Bartlett Learning, 1998. [11] M. Gietzelt, M. Lo¨pprich, C. Karmen, M. Ganzinger, Models and data sources used in systems medicine, Methods Inf. Med. 55 (02) (2016) 107 113.
Chapter 4 Healthcare and medical Big Data analytics
[12] C. Lee, Z. Luo, K.Y. Ngiam, M. Zhang, K. Zheng, G. Chen, et al., Big healthcare data analytics: challenges and applications, Handbook of Large-Scale Distributed Computing in Smart Healthcare, Springer, Cham, 2017, pp. 11 41. [13] P. Balthazar, P. Harri, A. Prater, N.M. Safdar, Protecting your patients’ interests in the era of big data, artificial intelligence, and predictive analytics, J. Am. Coll. Radiol. 15 (3) (2018) 580 586. [14] L. Hong, M. Luo, R. Wang, P. Lu, W. Lu, L. Lu, Big data in health care: applications and challenges, Data Inf. Manag. 2 (3) (2018) 175 197. [15] K. Wan, V. Alagar, Characteristics and classification of big data in health care sector, in: 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), pp. 1439 1446, IEEE, 2016. [16] B. Ristevski, S. Savoska, P. Mitrevski, Complex network analysis and big omics data, in: Web proceedings of 11th ICT Innovation Conference 2019, Ohrid, Macedonia, 2019. [17] M. Househ, A.W. Kushniruk, E.M. Borycki, Big Data, Big Challenges: A Healthcare Perspective: Background, Issues, Solutions and Research Directions, Springer, 2019. [18] S. Savoska, B. Ristevski, N. Blazheska-Tabakovska, I. Jolevski, Towards Integration Exposome Data and Personal Health Records in the Age of IoT, in: Web proceedings of 11th ICT Innovation Conference 2019, Ohrid, Macedonia, 2019. [19] Institute of Medicine (US). Committee on the Recommended Social and Behavioral Domains and Measures for Electronic Health Records. Capturing social and behavioral domains and measures in electronic health records: phase 2. National Academies Press, 2014. [20] C.H. Lee, H.J. Yoon, Medical big data: promise and challenges, Kidney Res. Clin. Pract. 36 (1) (2017) 3. [21] E.S. Boja, C.R. Kinsinger, H. Rodriguez, P. Srinivas, Integration of omics sciences to advance biology and medicine, 2014. [22] Y. Wang, N. Hajli, Exploring the path to big data analytics success in healthcare, J. Bus. Res. 70 (2017) 287 299. [23] Y. Wang, L. Kung, W.Y.C. Wang, C.G. Cegielski, An integrated big data analytics-enabled transformation model: application to health care, Inf. Manag. 55 (1) (2018) 64 79. [24] B.M. Bai, B.M. Nalini, J. Majumdar, Analysis and detection of diabetes using data mining techniques—a big data application in health care, Emerging Research in Computing, Information, Communication and Applications, Springer, Singapore, 2019, pp. 443 455. [25] B. Ristevski, S. Loskovska, A survey of clustering algorithms of microarray gene expression data analysis, in: Proceedings of the 10th International Multiconference Information Society IS 2007, 2007, pp. 52 55, Ljubljana, Slovenia. [26] J.F. Rodrigues Jr, F.V. Paulovich, M.C. de Oliveira, O.N. de Oliveira Jr, On the convergence of nanotechnology and Big Data analysis for computer-aided diagnosis, Nanomedicine 11 (8) (2016) 959 982. [27] J. Tan (Ed.), Healthcare Information Systems and Informatics: Research and Practices: Research and Practices, IGI Global, 2008. [28] D. Schaefer, A. Chandramouly, B.D.D. Owner, I.B. Carmack, K. Kesavamurthy, Delivering self-service BI, data visualization, and Big Data analytics. Intel IT: Business Intelligence, 2013. [29] Big Digital Data, Analytic Visualization and the Opportunity of Digital Intelligence, 2014, white paper, SAS Institute Inc.
111
112
Chapter 4 Healthcare and medical Big Data analytics
[30] G. Fanjiang, J.H. Grossman, W.D. Compton, P.P. Reid (Eds.), Building a Better Delivery System: A New Engineering/Health Care Partnership, National Academies Press, 2005. [31] https://www.icd10data.com/ICD10CM/Codes [32] K.A. Spackman, K.E. Campbell, R.A. Coˆte´, SNOMED RT: a reference terminology for health care, in: Proceedings of the AMIA annual fall symposium, p. 640, American Medical Informatics Association, 1997. [33] https://www.uicc.org/resources/tnm [34] https://www.nlm.nih.gov/mesh/meshhome.html [35] K.A. Wager, F.W. Lee, J.P. Glaser, Health Care Information Systems: a Practical Approach for Health Care Management, John Wiley & Sons, 2017. [36] Mustra, M., Delac, K., & Grgic, M., Overview of the DICOM standard, in: 2008 50th International Symposium ELMAR, Vol. 1, pp. 39 44, IEEE, 2008. [37] https://ncpdp.org/Standards-Development/Standards-Information [38] www.x12.org/x12org/subcommittees/X12N/N0200_X12N_TG2Charter.pdf [39] http://www.hipaasurvivalguide.com/hipaa-standards.php [40] https://www.edibasics.com/edi-resources/document-standards/hipaa/ [41] Final Report NHII Information for Health: A Strategy for Building the National Health Information Infrastructure, 2001. https://ncvhs.hhs.gov/ reports/reports [42] Y. Wang, L. Kung, T.A. Byrd, Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations, Technol. Forecast. Soc. Change 126 (2018) 3 13. [43] F. Bajaber, R. Elshawi, O. Batarfi, A. Altalhi, A. Barnawi, S. Sakr, Big data 2.0 processing systems: taxonomy and open challenges, J. Grid Comput. 14 (3) (2016) 379 405. [44] N. Mehta, A. Pandit, Concurrence of big data analytics and healthcare: a systematic review, Int. J. Med. Inform. 114 (2018) 57 65. [45] A. Sajid, H. Abbas, Data privacy in cloud-assisted healthcare systems: state of the art and future challenges, J. Med. Syst. 40 (6) (2016) 155. [46] N.M. Shrestha, A. Alsadoon, P.W.C. Prasad, L. Hourany, A. Elchouemi, Enhanced e-health framework for security and privacy in healthcare system, in: 2016 Sixth International Conference on Digital Information Processing and Communications (ICDIPC), pp. 75 79, IEEE, 2016. [47] B. Custers, H. Urˇsiˇc, Big data and data reuse: a taxonomy of data reuse for balancing big data benefits and personal data protection, Int. Data Priv. Law 6 (1) (2016) 4 15. [48] K. Abouelmehdi, A. Beni-Hessane, H. Khaloufi, Big healthcare data: preserving security and privacy, J. Big Data 5 (1) (2018) 1. [49] Z. Savoski, S. Savoska, E-health, need, reality or mith for R. of Macedonia, in: Proceedings/8 th International conference on applied internet and information technologies, Vol. 8, No. 1, pp. 56 59, 2018. [50] S. Savoska, I. Jolevski, Architectural model of e-health system for support the integrated cross-border services, in: Proceedings of 12th Information Systems and Grid Technologies ISGT 2018, Sofia, Bulgaria, pp. 42 49, 2018.
Big Data analytics in medical imaging
5
Siddhant Bagga, Sarthak Gupta and Deepak Kumar Sharma Department of Information Technology, Netaji Subhas University of Technology (Formerly Netaji Subhas Institute of Technology), New Delhi, India
Abstract Huge amount of data which can be used for computations and predictive analysis based on the trends in data is called Big Data. In the recent times, there has been notable increase in the use of big data analytics in healthcare, with medical imaging being an important aspect of it. For the purpose of handling diverse medical image data obtained from X-rays, CT-scan, MRI etc. and in order to attain better insights for diagnosis, Big Data Analytics platforms are being leveraged to a great extent. Disease surveillance can be effectively improved using Big Data Analytics. Unstructured medical image data sets can be evaluated with great efficiency to create a better discernment about the disease and requisite prevention and curing methodologies, hence leading to much better critical decision making. An approximate of 66,000 images were contained in medical image dataset called CLEF (Cross Language Evaluation Forum) in 2007 which increased to 300,000 in 2013 with images varying greatly in dimensions, resolution and modalities. In order to handle this huge amount of image data, dedicated analytics platforms are required for analyzing these big data sets in distributed environment. Medical images reveal information about organs and internal functioning of the body which is required to identify the tumors, diabetic retinopathy, artery stenosis etc. Data storage, automatic extraction, and advanced analytics of this medical image data using Big Data Analytics platforms has resulted in much faster diagnosis and prediction of treatment plans in advance. Parallel programming and cloud computation have also played a significant role in overcoming the challenges of huge amount of data computation. Applications of Big Data in Healthcare. DOI: https://doi.org/10.1016/B978-0-12-820203-6.00006-0 © 2021 Elsevier Inc. All rights reserved.
113
114
Chapter 5 Big Data analytics in medical imaging
The medical image processing is based on extracting features from the images and detecting patterns in the extracted data. Various tools and frameworks are used to solve the purpose like Hadoop, MapReduce, YARN, Spark, Hive etc. Machine learning and Deep learning techniques are extensively used for carrying out the required analytics. Genetic algorithms and association rule learning techniques are considerably used for the purpose. Keywords: Big Data Analytics; medical imaging; Hadoop; MapReduce; YARN; Spark; Big data analytics in medical imaging
5.1
Introduction
The objective of this chapter is to explore the pivotal role of Big Data Analytics in the field of healthcare and medical imaging and the kind of research being carried in order to leverage the big data analytics platforms to optimum use. From handling the storage of enormous number of diverse medical images to carrying out extensive analytics for attaining diagnostic insights, big data analytics is being utilized in the healthcare sector very significantly. In this chapter, we have discussed the latest research being carried out for the storage and analytics of huge amount of medical image data. We have also explored the latest techniques being developed in the domain of artificial intelligence (AI) for the purpose of carrying out predictive analytics. In the end, we have discussed the most prominent tools being employed in the sphere of Big Data Analytics. This section explains medical imaging, the challenges in medical imaging and how big data and multimedia big data analytics [1,2] are being employed into this field.
5.1.1
Medical imaging
Medical imaging is the process of visualizing the components of the bodies of the living individuals. These parts include the organs, bones, skeletons, tissues etc. The objective of the visualization and the further analysis is to diagnose the diseases in the patients and ensure and effective medical treatment for it. The following techniques [3] are being employed for the medical imaging process: 1. Radiology: This involves the use of methodologies such as CT scanning, X-rays, MRI, ultrasound etc. Internal body parts are observed using these techniques.
Chapter 5 Big Data analytics in medical imaging
2. Nuclear machine: Methodologies like PET which refers to Positron Emission Tomography which is used for the observation of the processes inside the body involving metabolic activities. 3. Optical imaging: Techniques like OCT which refers to Optical Coherence Tomography is used for the purpose of visualizing the hollow body parts which are present in cellular extent.
5.1.2
Challenges in medical imaging
Following are the main challenges in the field of medical imaging [4]: 1. Development of efficient image analytic techniques which can be applied to multiple medical applications. 2. There is a need for validation of the results through GT (ground truth) annotations. The objective is to achieve high accuracy, so a significant number of images are required in the medical image data sets for the requisite validation. 3. Since the field of medicine requires the use of considerable amount of varying and heterogeneous images, there is a need of development of algorithms which can work on the diverse images. 4. Development of accurate anatomical models based on the organs and the body parts of the patient is a very significant challenge. 5. Since the images occupy large amount of space in the memory, efficient compression techniques need to be applied prior to the storage of such enormous amount of data. 6. Effective security techniques need to be applied to ensure data confidentiality and data integrity. 7. Scalable frameworks need to be developed to cater to the increasing amount of data. 8. Preprocessing of the data needs to be done prior to the analysis. The images are affected by the noises, missing information etc. therefore reduction of noise in the images, handling of the missing data, optimum enhancement of the contrast needs to be done. In Section 5.2, certain big data analytics methods are discussed which have been used in medical imaging. Section 5.3 discusses how medical imaging has improved thanks to various artificial intelligence techniques. Finally, Section 5.4 showcases various tools that are used for big data analytics, like MapReduce, Hadoop, Yarn and Spark.
115
116
Chapter 5 Big Data analytics in medical imaging
5.2 5.2.1
Big Data analytics in medical imaging Analytical methods
The most significant framework which is widely being used in the field of medical imaging is Hadoop. It makes use of the programming utility called MapReduce. The MapReduce is considerably used in the following ways: 1. It is used for computing the optimized parameters for carrying out the classification of the lungs based on the texture. SVM (Support Vector Machine) is used for the purpose. 2. It used for the indexing of the images on the basis of the content. 3. Analysis of the images based on the texture is carried out using the Wavelet Analysis. In Ref. [5], a hybrid ML (Machine Learning) algorithm has been designed for the purpose of classification of the schizophrenia patients with the data in form of fMRI images along with the SNP data (Single Nucleotide Polymorphism). In Ref. [6], a fully automated methodology is proposed for the purpose of carrying out organ segmentation using the VISCERAL (Visual Concept Extraction Challenge in Radiology) dataset. The parameters involved for the purpose of carrying out the requisite analysis include local contrast of the image along with the intensity attribute of the image. It also involves the use of probabilistic data for finding out the labels of the required structures. Atlas-based segmentation is made use of for the purpose of image segmentation. In Ref. [7], a clinical decision support system has been developed. Similarity search and CBR (case based reasoning) is used. Discriminative distance learning is used in this system which is a better alternative to the conventional distance functions if similarity search analysis is being done. Prediction accuracy is much higher in discriminative distance learning along with the significant reduction in the computational complexity. This makes the system much more scalable. Random Forest (RF) algorithm has been used for the purpose of carrying out the predictions along with the use of the advanced visualization techniques including the neighborhood maps, tree-maps and heat-maps. In Ref. [8], a nonintrusive technique has been developed for the purpose of prediction of the intracranial pressure. The attributes of the images which are used include the midline-shift values and the texture features retrieved from the CT scans along with the other parameters such as age, weight etc. Furthermore, all of the acquired information of the features is not made use of, rather a feature selection algorithm is applied for the purpose of selecting
Chapter 5 Big Data analytics in medical imaging
the most significant features. SVM is used for the purpose of training of the data. Rapidminer is used for the purpose of cross validation. In Ref. [9], molecular imaging is carried out for the purpose of detection of the cancer. The proposed methodology involves the use of integration of the molecular data along with the physiological data and the anatomical data. In Ref. [10], HDOC (Hybrid Digital Optical Correlator) is developed with the purpose of improving the speed of the image correlation. If the coordinate matching is not available, HDOC can be used for the comparison of the images. The storage of the images is done in the volume holographic memory. The HDOC system has 7500 channels and the objective is to correlate the target image with all the possible channels. The aim is to compute the measurement in terms of the rotation as well as the translation. Fig. 5.1 describes the generalized flow of work of big data analytics in the field of healthcare [5]. In Ref. [11], the proposed work focuses on the use of 3-D invariant attributes of the images and index them effectively in order to compute the NN (nearest neighbors) features match in computing complexity of O(logN). In Ref. [12], the proposed work focuses on the querying of the unstructured information. The process is carried out in two steps. Initially, the data which is structured is used for the filtering of the clinical data. Next, the modules which carry out feature extraction are implemented on the unstructured information in distributed way through Hadoop. CNN’s (Convolutional Neural Networks) have been proposed for the purpose of classification of the skin cancer.
Figure 5.1 Generalized workflow of Big Data Analytics in Healthcare.
117
118
Chapter 5 Big Data analytics in medical imaging
The input attributes only include the pixels along with the labels of the diseases. A lot of research has been going on for the use of machine learning and deep learning to detect diabetic retinopathy [13]. In Ref. [14], the authors have proposed a fuzzy C-means clustering algorithm, which when applied to fundus images, detects the vessels, while efficiently handling junctions and forks in the angiograms. In Ref. [15], the proposed work focuses on detection of diabetic retinopathy using a back propagation neural network. The neural network was trained to classify images into one of four classes: normal retinas without blood vessels, retinas with normal blood vessels, retinas with exudates, or retinas with hemorrhages. In Ref. [16], CNNs have been used to improve existing Computer Aided Detection (CAD) systems for the detection of colonic polyps, sclerotic spine metastases, and enlarged lymph nodes. In Ref. [17], the proposed work focuses on detection of cerebral microbleeds (CMBs) using 3-D CNNs (see Fig. 5.3) on MRI scans. CMBs are small hemorrhages near the blood vessels and have been identified as important markers for various diseases. Fig. 5.2 shows the architecture of 3-D CNN [5].
Figure 5.2 Architecture of 3-D CNN [17].
Chapter 5 Big Data analytics in medical imaging
In Ref. [17], an artificial neural network based predictor is proposed, that can predict the occurrence of hypotensive episodes for intensive care unit patients. To train the network, time series blood pressure and heart rate data has been used, taken from MIMIC-II database [18]. Electroencephalogram (EEG) signals are the representation of electric activity in the brain in response to certain external stimuli. The study of EEG signals is extremely vital since they may be used for the diagnosis of many neurological diseases. Various methods have been suggested for the classification of EEG signals, such as recurrent neural networks [19], adaptive autoregressive models [20], neuro-fuzzy systems [21], SVMs [22], and wavelet transforms [23 25]. Methods have also been proposed to study and visualize the spread of diseases, such as [26], in which movement and interaction of individuals have been analyzed to study the spread of pandemic diseases. Elshazly et al. [27] proposes a hybrid framework for the diagnosis of lymphatic diseases, which uses a combination of Genetic Algorithm (GA) and RF.
5.2.2
Collection, sharing, and compression
1. Integrating Data for Analysis Anonymization and Sharing (iDASH) [28] is significantly used in the biomedical imaging. The focus is on the techniques and methodologies for making
Figure 5.3 Architecture of Integrating Data for Analysis Anonymization and Sharing.
119
120
Chapter 5 Big Data analytics in medical imaging
the sharing of the data in confidential manner while preserving the data integrity. Fig. 5.3 explains the architecture of iDASH. 2. In Ref. [29], a new platform has been developed for the purpose of storage and exchange of the healthcare data. This system is based on Hadoop. MIFAS (Medical Image File Accessing System) is developed which is based on the coallocation method in the cloud. Integration of the Hadoop and coallocation methodology for the cloud is done for the
Figure 5.4 Architecture of Medical Image File Accessing System.
Figure 5.5 Workflow of Digital Imaging and Communications in Medicine system.
Chapter 5 Big Data analytics in medical imaging
purpose of sharing of medical images between various healthcare organizations (Fig. 5.4). 3. The patient’s information is collected using sensors [30], and the data is eventually transmitted to the cloud for the storage and further distribution or processing. 4. DICOM (Digital Imaging and Communications in Medicine) server [31] is responsible for carrying out the storage and query requests. There is an image indexer which ensures the parsing of the image metadata. It furthermore stores the information in the Azure Database (SQL). Fig. 5.5 describes the workflow of DICOM system.
5.3
Artificial intelligence for analytics of medical images
Artificial intelligence has a major impact in the field of healthcare [32,33]. Artificial intelligence is being widely used for the purpose of carrying of analytics on medical imaging. As a matter of fact, enormous amount of research is going on to employ AI in the field of analytics of medical images. The following points describe the research work that has been carried out for employing AI in medical imaging analysis:1. Hopfield Neural Network along with penalized C means Fuzzy method [34] is applied for the purpose of carrying out medical images segmentation. First and second ordered moments of the pixels are considered for construction from the nearest neighbors. This mapping of this training vector is done to the 2-D Hopfield neural network. 2. An automated hybrid model [35] is presented in which there is an integration of EM (Expectation Maximization) and PCNN (Pulse Coupled Neural Network) for the purpose of segmentation of brain MRI images. Adaptive methodology is used for tuning the parameters of neural network. The workflow is shown as below in Fig. 5.6. 3. 2-layered Hopfield NN is designed [36] called CHEFNN (Competitive edge finding neural network) for the purpose of detection of the edges in CT scan and MRI images. There is an extension of one layered 2-D NN to two layered 3-D neural network for detecting the edges. The context of the pixel can be analyzed and can be used for labeling pixels. Hence, noise reduction can be carried out effectively. 4. NED (Neural Edge Detector) [37] is used for the extraction of contours form the left side of the ventriculograms.
121
122
Chapter 5 Big Data analytics in medical imaging
Figure 5.6 Workflow of the system.
Multilayered modified BPN (Backpropogation Network) is used in a supervised way. 5. Computer aided system for detecting the micro calcification clusters in an automated is designed in Ref. [38]. CNN is used. 6. RBF (Radial Basis Function Networks) Neural Network [39] is used for detecting the mass circumscribed in mammograms. There is nonlinear mapping of neuron weights’ distance measure and the input vector. Cauchy probability density function is used for implementing the nonlinear operator. In order to compute the weights of the hidden layer, the Kmeans technique is used in an unsupervised way. Winser Filter Theory is used for minimizing the error and for computing the weights. Fig. 5.7 describes the architecture of RBF neural network.
Chapter 5 Big Data analytics in medical imaging
123
Figure 5.7 RBF neural network.
7. For carrying out the predictive analytics of various symptoms of the breast cancer, multivariate models [40] are used which involve PCA (principal component analysis) along with the PNN (Probabilistic Neural Network). 8. For the purpose of carrying out image guided radiation therapy [41], Artificial Neural Networks are used. The aim is to track the position of the tumor and thereafter compensate the movement which is observed. 9. For the purpose of suppression of the ribs in the radiographs of the chest, MTANN (Massive Training Artificial Neural Network) [42] is used. Dual subtraction technique is used for obtaining the bone images for training the neural network. The output of the network is the ‘bone-like’ images of the ribs as shown in the following figure. Fig. 5.8 shows working of untrained and trained MTANN.
5.4
Tools and frameworks
Big data is used to describe those datasets which are too large to be processed by traditional software programmes. Making sense of such large quantities of data can help organizations in making better decisions. Earlier, organizations primarily used Relation Database Management Systems for data storage and analytics. However, in the past few decades,
124
Chapter 5 Big Data analytics in medical imaging
Figure 5.8 (A) Multiresolution of MTANN; (B) Multiresolution of trained MTANN.
data growth has been exponential, especially the amount of unstructured data. This calls for more robust and reliable techniques which are equipped to handle huge volumes of data at a time. There are numerous Big Data Analysis tools available, a few of which have been discussed in this section.
5.4.1
MapReduce
In the recent past, numerous algorithms have been developed for the analysis of raw data. Most of these algorithms are conceptually easy and straightforward. However, these algorithms take a
Chapter 5 Big Data analytics in medical imaging
huge amount of data as input, and various techniques of distributed computing systems need to be applied in order to ensure that the system functions as expected. Taking care of so many details makes the implementations messy and time consuming. To address these problems, Google came up with a framework called MapReduce [43,44], which aims to provide an interface that inherently takes care of parallelization, load balancing, data distribution and fault tolerance. The main idea behind MapReduce is to abstract the details of parallelization and allow the users to focus more on the data analysis strategies and algorithms. MapReduce, as the name suggests, consists of two main submodules—Map and Reduce. Map part: This part of the framework takes as input keyvalue pairs (K, V), do some processing on these pairs, and output another set of key-value pairs (K’, V’). Reduce part: This part of the framework takes the intermediate, generated key-value pairs (K’, V’), and aggregate them based on the equality of the keys (K’, list(V’)). Then, some processing is done on these pairs, and aggregated results are generated. At Google, they use their own distributed file system (Google File System) [43], which handles fault tolerance by data replication and partitioning. An important point to be noted is that MapReduce only provides a very general framework, and users are free to define Map and Reduce functions based on their requirement specifications. Fig. 5.9 shows the overall execution of the MapReduce. An overview of the steps involved in the execution of MapReduce algorithm (see Fig. 5.9) is presented below [43]: 1. MapReduce works in a distributed way. Map function splits the input files into M pieces, while the Reduce function uses a partitioning function to split the intermediate key space into R pieces. Each split can be processed in parallel, in a distributed manner. 2. Then, many copies of the program are started. One of the copies is the “master” program, white the rest are “workers.” The master assigns the M Map tasks and R Reduce tasks to idle workers. 3. The map task involves reading the key-value pairs from one of M input splits, and passing these through a user-defined Map function. The generated intermediate key-value pairs are stored on the local disk. The local disk is divided into R splits, according to the partitioning function. 4. The Map worker processes notify the master about the location of the intermediate key-value pairs, which them passes it to the Reduce worker processes.
125
126
Chapter 5 Big Data analytics in medical imaging
Figure 5.9 MapReduce execution overview.
5. The Reduce workers read the data from the local disk of the Map workers using Remote Procedure Calls (RPC), and then the data is sorted on the basis of keys. 6. From the sorted key-value pairs, a key and its corresponding list of values is passed through a user-defined Reduce function. The output thus generated is included in the final output. 7. Finally, the MapReduce function returns control to the user’s program.
5.4.2
Hadoop
Apache Hadoop [44,45] is the most popular open source implementation of MapReduce framework. Hadoop consists of four modules: 1. Hadoop Common: This module consists of the commonly used Java libraries and functions. It provides some basic functionality to the other three modules. 2. Hadoop YARN: Yet Another Resource Negotiator (YARN) provides cluster management and job scheduling algorithms.
Chapter 5 Big Data analytics in medical imaging
Figure 5.10 Hadoop distributed file system architecture.
3. Hadoop Distributed File System (HDFS) [46,47]: This is a fault-tolerant, high throughput, distributed file system, with a master-slave architecture. Fig. 5.10 shows the HDFS architecture. There are two types of nodes (see Fig. 5.10): 1. Namenodes: Each HDFS cluster has one Namenode. It maintains metadata like block locations and namespace hierarchy. These nodes perform operations like renaming, closing, and opening directories and files. 2. Datanodes: Files are divided into blocks, which are stored on these nodes. Each node of an HDFS cluster usually has one Datanode. They serve requests made by clients of the file system. Datanodes also receive commands from Namenodes that might instruct them to create, delete, or replicate blocks. 4. Hadoop MapReduce: This module contains the YARN based implementation of the MapReduce algorithm. As already discussed, MapReduce provides parallel, distributed execution, which is extremely helpful for analyzing and processing large datasets. Fig. 5.11 shows the Hadoop architecture. The architecture of Hadoop (see Fig. 5.11) is similar to the original MapReduce architecture, and has been explained below.
127
128
Chapter 5 Big Data analytics in medical imaging
Figure 5.11 Hadoop architecture.
Before starting a Map task, the input is divided into blocks. To ensure fault tolerance, three copies of each block are created. The number of Map worker processes is equal to the number of blocks thus generated. Each block is assigned a Map worker, who applies the Map() function, and generates intermediate key value pairs, and sorts the output locally. Before sending the data to reducers, partitioning is done by a hash function, for example hash(key) mod R, where R is the number of Reduce worker processes. The output of reducers is triplicated again for fault tolerance, and then stored n HDFS. Many researchers argue that techniques like MapReduce and Hadoop are worse as compared to traditional parallel Database Management Systems [48,49]. In Ref. [49], it has been shown that parallel DBMS are 2250 times faster than Hadoop (except data loading). In Ref. [50], the authors have shown that although Hadoop achieves good scalability, but at the cost of extremely less efficiency per node. Based on these studies, a trade-off between fault tolerance and efficiency can be noticed. A few of the advantages and limitations of Hadoop and MapReduce are listed below [44].
Chapter 5 Big Data analytics in medical imaging
Advantages: 1. Simplicity: The programmers only have to concern themselves with the algorithm. Hadoop abstracts the distributed and parallel implementation details. The programmer only needs to worry about two simple functions: Map() and Reduce(). 2. Flexibility: Handling unstructured data is easier with Hadoop, since there is no well-defined schema or model for the data. 3. Storage independent: MapReduce can be used with any data storage layer, without affecting the rest of the algorithm. 4. Fault tolerance: MapReduce and Hadoop are extremely fault-tolerant, due to their distributed nature and data replication. 5. Scalability: Hadoop is also very scalable, since new nodes/ clusters can be added quite easily due to the distributed nature. Limitations: 1. No high level language: MapReduce does not provide any support for high level languages like SQL, which allow easy data retrieval using queries. Programmers are forced to follow the Map/Reduce paradigm. 2. No schema: The benefits of data modeling cannot be utilized. 3. Rigid Data Flow: A lot of algorithms are difficult to implement using only the Map and Reduce primitives. 4. Low efficiency: Hadoop’s focus on fault tolerance and scalability has resulted in low efficiency per node. To ensure high degree of fault tolerance, frequent I/O operations are required which significantly reduce the efficiency. Moreover, Map and Reduce functions are blocking in nature, that is, data cannot be transferred to the reducers until all mappers have completed their local processing.
5.4.3
Yet Another Resource Negotiator
The initial design of Hadoop only focused on MapReduce. This posed two major shortcomings [51]: 1. The programming model was highly correlated to the resource management system. With the increasing rate of adoption of Hadoop as a de-facto Big Data Analysis tool by organizations, there was a need for some level of independence between the programming model and the resource management infrastructure.
129
130
Chapter 5 Big Data analytics in medical imaging
Figure 5.12 Yet another resource negotiator architecture.
2. In MapReduce, control flow for various jobs is centralized, thus creating scalability issues. To overcome these issues, a new tool called YARN was introduced. YARN separates the programming model and the resource management system, opening doors for more varied and flexible usage of Big Data Analytics tools. It can be used as the underlying resource manager (RM) for many programming models other than Hadoop like Spark [52], REEF [53], Dryad [54], etc. Fig. 5.12 shows the architecture of YARN. YARN implements a resource management platform layer, and leaves the execution plan coordination to the programming models. Yarn architecture has three main components [55] (see Fig. 5.12): 1. RM: There is one Resource Manager for each cluster, that performs various tasks like monitoring the cluster’s resource usage, node liveness, etc. Another task is arbitrating system resources among competing applications. Resource Manager has two parts: a. Scheduler: Scheduler allocates resources to running applications.
Chapter 5 Big Data analytics in medical imaging
b. Application Manager: This part of the RM manages the Application Masters (AM). It is responsible for staring and monitoring the application, and restarting them on another node in case of failure. 2. Node Manager (NM): Each node has a NM, which manages containers, tracks node health, and manages user processes running on that node. 3. AM: There is one AM for each application. It negotiates system resources from the Resource Manager, after which it contacts the NM to start the application.
5.4.4
Spark
Apache Spark [56,57] is a distributed cluster-computing tool. Unlike Hadoop’s two-stage MapReduce computation engine, Spark uses an in-memory computation engine. Since it is in-memory as
Figure 5.13 Spark architecture.
131
132
Chapter 5 Big Data analytics in medical imaging
compared to Hadoop’s disk-based storage, Spark usually exhibits better performance. Spark has a Master-Slave architecture (see Fig. 5.13). There is a driver that communicates with the Master or cluster manager. The cluster manager communicates with executors running on various worker/slave processes. Fig. 5.13 shows the architecture of Spark. The driver program negotiates for system resources from the cluster manager. The cluster manager, in turn, starts executor agents on worker nodes. Executors are distributed in nature and are responsible for running tasks. Data items are stored inmemory on the worker processes in the form of Resilient Distributed Datasets (RDDs). RDDs in Spark support functionality for two types of operations: Actions and Transformations. These operations are represented using a Directed Acyclic Graph (DAG) in which each node represents an RDD partition. Representing the sequence of operations as a DAG eliminates the two-stage model of Hadoop MapReduce, and thus provides performance improvements over Hadoop.
5.5
Conclusion
Big Data Analytics has a pivotal role in the field of Medical Imaging and in maintain healthcare records [58,59]. As a matter of fact, huge amount of research work is being carried out to improve upon the existing methodologies in order to cater to the multifarious requirements in Medical Imaging, ranging from storage of data to performing predictive analytics. Various frameworks like Hadoop are being employed for the storage of enormous amount of information in form of datasets which is eventually used for diagnosis and detection of diseases based on the appropriate analysis. Manual analysis of such images is extremely time consuming. With the amount of medical imaging scans being produced every day, it has become imperative to develop certain models that can leverage this Big Data and make reasonable deductions and predictions, thereby saving the time of doctors as well as patients. Moreover, the integration of medical images with other modalities and medical data opens up a new realm of research to produce reliable big data analytics models. The objective of the medical image processing is to extract the features from the medical images and recognize the patterns based on which similar diseases can be detected in the future. Various Machine Learning and Deep Learning techniques are being used along with GAs, fuzzy logic and association learning
Chapter 5 Big Data analytics in medical imaging
techniques for efficient analytics. The medical images obtained from X-rays, CT scans, etc. have varying dimensions, resolutions and modalities and hence it is essential for the Big Data frameworks to be effective for all kinds of heterogeneous images.
References [1] M. Devgan, D.K. Sharma, Large-scale MMBD management and retrieval, in: S. Tanwar, S. Tyagi, N. Kumar (Eds.), Multimedia Big Data Computing for IoT Applications, Springer, Singapore, 2019, pp. 247 267. [2] M. Devgan, D.K. Sharma, MMBD sharing on data analytics platform, in: S. Tanwar, S. Tyagi, N. Kumar (Eds.), Multimedia Big Data Computing for IoT Applications, Springer, Singapore, 2019, pp. 343 366. [3] https://www.vumc.com/branch/imagingcenter/economic_impact/ medical_imaging/. [4] J. Weese, C. Lorenz, Four challenges in medical image analysis from an industrial perspective, Med. Image Anal. 33 (2016) 44 49. [5] H. Yang, J. Liu, J. Sui, G. Pearlson, V.D. Calhoun, A hybrid machine learning method for fusing fMRI and genetic data: combining both improves classification of schizophrenia, Front. Hum. Neurosci. (2010). [6] O. Jimenez del Toro, H. Muller, Multi atlas-based segmentation with data driven refinement, in IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), 2014. [7] A. Tsymbal, E. Meissner, M. Kelm, M. Kramer, Towards cloud-based imageintegrated similarity search in big data, in Proceedings of the IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI ’14), pp. 593 596, IEEE, Valencia, Spain, June 2014. [8] W. Chen, C. Cockrell, K.R. Ward, K. Najarian, Intracranial pressure level prediction in traumatic brain injury by extracting features from multiple sources and using machine learning methods, in 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2010. [9] R. Weissleder, Molecular imaging in cancer, Science 312 (5777) (2006) 1168 1171. [10] T. Zheng, L. Cao, Q. He, G. Jin, Full-range in-plane rotation measurement for image recognition with hybrid digital-optical correlator, Optical Eng. 53 (1) (2013) 011003. [11] M. Toews, C. Wachinger, R. San Jose Este´par, W.M. Wells, A feature-based approach to big data analysis of medical imaging, in S. Ourselin, D.C. Alexander, C.-F. Westin, J.M. Cardoso (Eds.), Proceedings of the Conference Information processing in medical imaging, 2015; 24, pp. 339 350. [12] S. Istephan, M.-R. Siadat, Unstructured medical image query using big data an epilepsy case study, J. Biomed. Inform. 59 (2016) 218 226. Available from: 10.1016/j.jbi.2015.12.005. [13] U. Sinha, A. Singh, D.K. Sharma, Machine learning in the medical industry, in: A. Solanki, S. Kumar, A. Nayyar (Eds.), Handbook of Research on Emerging Trends and Applications of Machine Learning, IGI Global, 2020, pp. 403 424. [14] Y.A. Tolias, S.M. Panas, A fuzzy vessel tracking algorithm for retinal images based on fuzzy clustering, IEEE Trans. Med. Imaging 17 (2) (1998) 263 273. [15] G.G. Gardner, D. Keating, T.H. Williamson, A.T. Elliott, Automatic detection of diabetic retinopathy using an artificial neural network: a screening tool, Br. J. Ophthalmol. 80 (11) (1996) 940 944.
133
134
Chapter 5 Big Data analytics in medical imaging
[16] H.R. Roth, L. Lu, J. Liu, J. Yao, A. Seff, K. Cherry, et al., Improving computeraided detection using convolutional neural networks and random view aggregation, IEEE Trans. Med. imaging 35 (5) (2016) 1170 1181. [17] J. Lee, R.G. Mark, A hypotensive episode predictor for intensive care based on heart rate and blood pressure time series, Comput. Cardiology 2010 (2010) 81 84. [18] M. Saeed, M. Villarroel, A.T. Reisner, G. Clifford, L.W. Lehman, G. Moody, et al., Multiparameter intelligent monitoring in intensive care II (MIMIC-II): a public-access intensive care unit database, Crit. Care Med. 39 (5) (2011) 952. ¨ beyli, I. Gu ¨ ler, E.D. U ¨ ler, Recurrent neural networks employing [19] N.F. Gu Lyapunov exponents for EEG signals classification, Expert. Syst. Appl. 29 (3) (2005) 506 514. [20] G. Pfurtscheller, C. Neuper, A. Schlogl, K. Lugger, Separability of EEG signals recorded during right and left motor imagery using adaptive autoregressive parameters, IEEE Trans. RehabilitatiEng. 6 (3) (1998) 316 325. ¨ beyli, Adaptive neuro-fuzzy inference system for [21] I. Gu¨ler, E.D. U classification of EEG signals using wavelet coefficients, J. Neurosci. Methods 148 (2) (2005) 113 121. [22] I. Guler, E.D. Ubeyli, Multiclass support vector machines for EEG-signals classification, IEEE Trans. Inf. Technol. Biomedicine 11 (2) (2007) 117 126. [23] M. Akin, Comparison of wavelet transform and FFT methods in the analysis of EEG signals, J. Med. Syst. 26 (3) (2002) 241 247. [24] N. Hazarika, J.Z. Chen, A.C. Tsoi, A. Sergejew, Classification of EEG signals using the wavelet transform, in: 13th International Conference on Digital Signal Processing Proceedings, 1997 (Vol. 1, pp. 89 92), IEEE, 1997. [25] A. Subasi, EEG signal classification using wavelet feature extraction and a mixture of expert model, Expert. Syst. Appl. 32 (4) (2007) 1084 1093. [26] D. Guo, Visual analytics of spatial interaction patterns for pandemic decision support, Int. J. Geographical Inf. Sci. 21 (8) (2007) 859 877. [27] H. Elshazly, A.T. Azar, A. El-Korany, A.E. Hassanien, Hybrid system for lymphatic diseases diagnosis, in: 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 343 347. IEEE, 2013. [28] L. Ohno-Machado, V. Bafna, A.A. Boxwala, B.E. Chapman, W.W. Chapman, K. Chaudhuri, iDASH: integrating data for analysis, anonymization, and sharing, J. Am. Med. Inform. Assoc. (2012). [29] C.-T. Yang, L.-T. Chen, W.-L. Chou, K.-C. Wang, Implementation of a medical image file accessing system on cloud computing, in: 2010 13th IEEE International Conference on Computational Science and Engineering, 2010. [30] C.O. Rolim, F.L. Koch, C.B. Westphall, J. Werner, A. Fracalossi, G.S. Salvador, A cloud computing solution for patient’s data collection in health care institutions, in: 2010 Second International Conference on eHealth, Telemedicine, and Social Medicine, 2010. doi:10.1109/etelemed.2010.19 [31] C.-C. Teng, J. Mitchell, C. Walker, A. Swan, C. Davila, D. Howard, et al., A medical image archive solution in the cloud, in: 2010 IEEE International Conference on Software Engineering and Service Sciences, 2010. [32] K.K. Bhardwaj, S. Banyal, D.K. Sharma, Artificial intelligence based diagnostics, therapeutics and applications in biomedical engineering and bioinformatics, Internet of Things in Biomedical Engineering, Academic Press, Elsevier, 2019, pp. 161 187. [33] S. Bagga, S. Gupta, D.K. Sharma, Computer-assisted anthropology, Internet of Things in Biomedical Engineering, Academic Press, Elsevier, 2019. pp. 21 47.
Chapter 5 Big Data analytics in medical imaging
[34] J. Lin, Segmentation of medical images through a penalized fuzzy Hopfield network with moments preservation, J. Chin. Inst. Eng. 23 (5) (2000) 633 643. [35] J.C. Fu, C.C. Chen, J.W. Chai, S.T.C. Wong, I.C. Li, Image segmentation by EM-based adaptive pulse coupled neural networks in brain magnetic resonance imaging, Computer. Med. Imaging Graph. 34 (4) (2010) 308 320. [36] C.-Y. Chang, Two-layer competitive based Hopfield neural network for medical image edge detection, Optical Eng. 39 (3) (2000) 695 703. 2000. [37] K. Suzuki, I. Horiba, N. Sugie, M. Nanki, Extraction of left ventricular contours from left ventriculograms by means of a neural edge detector, IEEE Trans. Med. Imaging 23 (3) (2004) 330 339. [38] J. Ge, B. Sahiner, L.M. Hadjiiski, H.-P. Chan, J. Wei, M.A. Helvie, et al., Computer aided detection of clusters of microcalcifications on full field digital mammograms, Med. Phys. 33 (8) (2006) 2975 2988. [39] I. Christoyianni, E. Dermatas, G. Kokkinakis, Fast detection of masses in computer-aided mammography, IEEE Signal. Process. Mag. 17 (1) (2000) 54 64. [40] T.F. Bathen, L.R. Jensen, B. Sitter, H.E. Fjo¨sne, J. Halgunset, D.E. Axelson, et al., MR-determined metabolic phenotype of breast cancer in prediction of lymphatic spread, grade, and hormone status, Breast Cancer Res. Treat. 104 (2) (2006) 181 189. [41] J.H. Goodband, O.C.L. Haas, J.A. Mills, A comparison of neural network approaches for on-line prediction in IGRT, Med. Phys. 35 (3) (2008) 1113 1122. [42] K. Suzuki, H. Abe, H. MacMahon, K. Doi, Image-processing technique for suppressing ribs in chest radiographs by means of massive training artificial neural network (MTANN), IEEE Trans. Med. Imaging 25 (4) (2006) 406 416. [43] J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, Commun. ACM 51 (1) (2008) 107 113. [44] K.H. Lee, Y.J. Lee, H. Choi, Y.D. Chung, B. Moon, Parallel data processing with MapReduce: a survey, AcM sIGMoD Rec. 40 (4) (2012) 11 20. [45] R.C. Taylor, An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics, BMC Bioinforma. 11 (12) (2010) S1. [46] D. Borthakur, The hadoop distributed file system: architecture and design, Hadoop Proj. Website 11 (2007) (2007) 21. [47] K. Shvachko, H. Kuang, S. Radia, R. Chansler, The hadoop distributed file system, in: 2010 IEEE 26th symposium on Mass storage systems and technologies (MSST), pp. 1 10, IEEE, 2010. [48] D. DeWitt, M. Stonebraker, MapReduce: a major step backwards, Database Column. 1 (2008) 23. [49] A. Pavlo, E. Paulson, A. Rasin, D.J. Abadi, D.J. DeWitt, S. Madden, et al., A comparison of approaches to large-scale data analysis, in: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 165 178, ACM, 2009. [50] E. Anderson, J. Tucek, Efficiency matters!, ACM SIGOPS Operating Syst. Rev. 44 (1) (2010) 40 45. [51] V.K. Vavilapalli, A.C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, et al., Apache hadoop yarn: yet another resource negotiator, in: Proceedings of the 4th annual Symposium on Cloud Computing, p. 5, ACM, 2013. [52] M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica, Spark: Cluster computing with working sets, HotCloud 10 (10-10) (2010) 95.
135
136
Chapter 5 Big Data analytics in medical imaging
[53] M. Weimer, Y. Chen, B.G. Chun, T. Condie, C. Curino, C. Douglas, et al., Reef: Retainable evaluator execution framework, in: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1343 1355, ACM, 2015. [54] M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly, Dryad: distributed dataparallel programs from sequential building blocks, ACM SIGOPS operating Syst. Rev. 41 (3) (2007) 59 72. [55] https://data-flair.training/blogs/hadoop-yarn-tutorial/. [56] https://jaceklaskowski.gitbooks.io/mastering-apache-spark/sparkarchitecture.html. [57] M. Zaharia, R.S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, et al., Apache spark: a unified engine for big data processing, Commun. ACM 59 (11) (2016) 56 65. [58] A. Khera, D. Singh, D.K. Sharma, Application design for privacy and security in healthcare. Security and Privacy of Electronic Healthcare Records: Concepts, Paradigms and Solutions (Healthcare Technologies), IET, 2019, pp. 93 130. [59] A. Khera, D. Singh, D.K. Sharma, “Information security and privacy in healthcare records: threat analysis, classification, and solutions. Security and Privacy of Electronic Healthcare Records: Concepts, Paradigms and Solutions (Healthcare Technologies), IET, 2019, pp. 223 247.
Big Data analytics and artificial intelligence in mental healthcare
6
Ariel Rosenfeld1,*, David Benrimoh2,*, Caitrin Armstrong3, Nykan Mirchi3, Timothe Langlois-Therrien3, Colleen Rollins3, Myriam Tanguay-Sela3, Joseph Mehltretter3, Robert Fratila3, Sonia Israel3, Emily Snook3, Kelly Perlman3, Akiva Kleinerman1, Bechara Saab4, Mark Thoburn4, Cheryl Gabbay2 and Amit Yaniv-Rosenfeld5,6 1
Bar-Ilan University, Ramt-Gan, Israel 2McGill University, Montre´al, Canada Aifred Health, Montre´al, Canada 4Mobio Interactive, Toronto, Canada 5 Tel-Aviv University, Tel-Aviv, Israel 6Shalvata Mental Health Center, Hod Hasharon, Israel *Both authors contributed equally to this manuscript 3
Abstract Mental health conditions cause a great deal of distress or impairment; depression alone will affect 11% of the world’s population. The application of Artificial Intelligence (AI) and big-data technologies to mental health has great potential for personalizing treatment selection, prognosticating, monitoring for relapse, detecting and helping to prevent mental health conditions before they reach clinical-level symptomatology, and even delivering some treatments. However, unlike similar applications in other fields of medicine, there are several unique challenges in mental health applications, which currently pose barriers toward the implementation of these technologies. Specifically, there are very few widely used or validated biomarkers in mental health, leading to a heavy reliance on patient- and clinician-derived questionnaire data as well as interpretation of new signals such as digital phenotyping. In addition, diagnosis also lacks the same objective “gold standard” as in other conditions such as oncology, where clinicians and researchers can often rely on pathological analysis for confirmation of diagnosis. In this chapter, we discuss the major opportunities, limitations, and techniques used for improving mental healthcare through AI and big data. We explore both
Applications of Big Data in Healthcare. DOI: https://doi.org/10.1016/B978-0-12-820203-6.00001-1 © 2021 Elsevier Inc. All rights reserved.
137
138
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
the computational, clinical, and ethical considerations and best practices as well as lay out the major researcher directions for the near future. Keywords: Big Data; artificial intelligence; mental healthcare; psychiatry
6.1
Introduction
The conceptualization, diagnosis, treatment, and prevention of mental disorders is limited by existing options for collecting, organizing, and analyzing information. Big data and machine learning/artificial intelligence (ML/AI) can be applied to the development of tools that could help patients, providers, and systems overcome these limitations. Nearly one in five adults live with a mental illness. Mental disorders affect individuals’ abilities to function, engage meaningfully in daily activities, and maintain relationships. They cause significant suffering to individuals and their families and are a significant source of socioeconomic burden [1]. Many mental disorders are also risk factors for suicide that occurs at an alarming rate globally [2]. These are disorders that often strike young and otherwise healthy people, a socially and economically critical segment of the population. As such, improving the detection, treatment, and monitoring of mental illness is crucial. Designing tools for practical use cases in mental healthcare requires a deep understanding of psychiatric illness, the current mental healthcare system, and medical ethics. We begin with an introduction of mental illness and healthcare, and proceed to discussing their complexities from a clinical and data-driven perspective, before discussing specific use cases and applications of big data and machine-learning approaches. To the reader from engineering and computer science: perhaps the most important conclusion from this chapter is that close collaboration with domain experts and clinicians will be required invariably in order to successfully build safe, effective, and useful mental healthcare applications. Mental illnesses are a group of diverse conditions with varying severity, complexity, and duration. In considering deviations from normal thought, feeling and behavior, characteristic of mental illness, it is critical to recognize the extent to which they lead to functional impairment. To be classified as a disorder, a set of symptoms must cause significant suffering or interference with daily functions or life goals [3]. This means that the treatment of mental illnesses has largely the same objective as other
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
branches of medicine: the alleviation of suffering, the improvement of function and quality of life, and the reduction of morbidity (the incidence of new diseases or impairments) and mortality (or the rate of death, primarily from suicide or reduced life expectancy because of impaired self-care). These then are the objectives of the clinical professionals who treat patients with mental illness. Family doctors are the primary providers of mental healthcare [4] and are accompanied by many other healthcare workers and specialists such as psychiatrists, psychologists, nurses, social workers, case managers, occupational therapists, pharmacists, and counselors. Approaches to measurement and treatment within mental healthcare differ significantly from other areas of healthcare. An example of an approach that is widely used in medicine but not useful in mental healthcare is diagnosis confirmation through pathological examination (i.e., via examination of patient tissues). We will highlight other such differences throughout this chapter. There is a diversity of mental illnesses with varying clinical presentations, time-courses, and causes. Autism, schizophrenia, workrelated burnout, dementia, attention-deficit hyperactivity disorder (ADHD), eating disorders, and addictions, along with many others, are all included under the banner of mental illnesses. As such, just as there are many different versions of cancer whose causes, genetics, disease courses, and prognoses are very different from each other (and which can have different severities and types even within the same disease), mental illnesses present a kaleidoscope of different conditions. An in-depth discussion of these different disorders and their presentations is beyond the scope of this chapter. There is also a considerable amount of individual variation within disorders. For example, people with Major Depressive Disorder (MDD), otherwise known as depression, can present with sadness, guilt, feelings of inadequacy, lack of sleep, and a profound lack of energy. Other patients with depression can present with lethargy, overeating, oversleeping, poor concentration, and thoughts of suicide [3]. Mild cases of depression can often respond well to exercise or psychotherapy, whereas more severe cases may only respond when medication or even electroconvulsive therapy is included in their treatment plan [5] (Section 1.10). Some patients are present with recurrent experiences of depression while others are with present with a singular depressive episode. Some depressive episodes are correlated with the experience of a triggering or traumatic event such as the death of a loved one or a conflict at work while other depressive episodes do not seem to be correlated with any particular life event and begin seemingly “out-of-the-blue” [6]. The possible causes for these different presentations, and their effect on
139
140
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
machine-learning and big data approaches to improving mental healthcare, will be further discussed below. Another important concept is comorbidity or the existence in a single patient of more than one disorder. For example, a patient can present with both depression and ADHD. The incidence of comorbidity further complicates an already complex problem. First, there is the challenge of diagnosing disorders with often overlapping symptomatology. Then there is the question of whether causality can be inferred or should be investigated. For example, depression can manifest as a reaction to an individual’s difficulty coping with a preexisting condition or it could be an entirely separate disease process. It is highly unlikely that one application or machine-learning model will be able to, given the diversity in mental health and the current state of technology, provide a tool that would be useful across all of these conditions. Favoring development aimed at specific use cases, informed by an understanding of a particular disorder, will likely result in more fruitful efforts. How are these diverse conditions treated? The most important thing to understand is that there is no singular treatment that is effective in all presentations of a disorder or appropriate for every individual. Treatment involves more than just diagnosis, drug prescription, and/or psychotherapy. Perhaps more than in other branches of medicine, it is critical in mental health provision to dedicate the time to form an alliance or partnership with the patient and to understand the patient’s life situation, goals, social network, belief system, and personal and community resources. Only when one understands a patient in this way and has built a therapeutic alliance [7], a trusting professional relationship, can appropriate treatments or interventions be implemented successfully. As such, over time, big data must evolve to capture these elements of patient care, and machine-learning analyses must incorporate them as critical elements to model. Because of the need to understand and support the patient as just described, treating severe mental illness is often not practical unless a team of professionals is involved [8]. For example, a psychiatrist may manage medications, while a nurse monitors medication blood levels and side effects. Furthermore, a social worker may help the patient manage finances, while a psychologist works with the patient in therapy and a family doctor manages nonpsychiatric conditions such as diabetes. As such, any clinically focused applications or big data collection initiatives should consider these many sources of information and the multiple agents that are involved in the clinical decision-making pathway. Furthermore, both the patient and (sometimes) their family, are active participants in planning,
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
choosing, arranging, and prioritizing the care they receive. An understanding and assessment of individual choices and preferences and how these interact with the clinical team’s decisionmaking process is another important piece of the puzzle. The journey experienced by a patient—what might be termed as the “user story” in other application contexts—may vary wildly as a function of individual resources and of the specific disorder the patient has. Let us take as an example the journey of a patient who develops schizophrenia (referred to as “she”), and contrast this with one who develops depression (referred to as “he”)—both to highlight differences and similarities between the two and to demonstrate how the mental healthcare system interacts with patients, as both are relevant to how and where big data and machine-learning powered methods may be integrated into practice. The first patient is a university student, with an unremarkable family history and a stable home environment. At the age of 22, she begins experiencing strange things—whispers that come out of nowhere, fleeting thoughts that her friends wish her harm, and difficulty focusing on coursework. She begins to isolate herself, and her friends start growing concerned after she stops coming to class. After 6 months of this progressively worsening situation, she is brought to the hospital, by the police, because she was yelling on the campus square about an imminent attack by some undefined group. She is brought to an emergency room, sedated, and in the morning she is assessed by a psychiatrist and started on an antipsychotic medication; no firm diagnosis is given, only the acknowledgment that she had a psychotic episode and that no drugs seemed to be involved. Once the patient has returned, somewhat, to their usual self, they are referred to an outpatient program for further evaluation. In this program they are seen by another psychiatrist, a nurse, and a case manager—all of whom ask many questions in long interviews. Still no firm diagnosis is made, as it is too early to say anything definitive. The patient is recommended to continue taking her antipsychotic medication, but the she refuses because she is afraid of weight gain (a common sideeffect). The patient eventually has another psychotic episode 4 months later, spends 2 months on a locked inpatient unit recovering, and is diagnosed formally with schizophrenia. Because of concerns that the patient will stop taking the medication again, she is recommended to take- and agree to- a long-acting injectable form of the drug that helps her stabilize and return to her studies. In this example, the patient is presenting to the emergency room, the outpatient program, and the inpatient program; she initially does not take her medication or adhere to follow-up, but eventually improve with the recommended treatment.
141
142
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
The second patient, who suffers from depression may face a very different experience. At the age of 30, his performance at work begins to lag, and his boss suggests that they go and speak to the company counselor. After a brief conversation, the counselor suggests that the patient is experiencing more than burnout and suggests they visit their doctor. The family doctor diagnoses the patient with depression, and notes that the abuse the patient suffered in his childhood serves as a significant risk factor for the disorder. The patient agrees to try taking antidepressants, but stops taking the drug, with their doctor’s agreement, after 4 weeks because it does not seem to be working and causes sexual side effects. The patient then tries psychotherapy, but because of the long wait before the start of therapy (because of a long waiting list) and a poor therapeutic alliance with the therapist, the patient worsens. He returns to his family doctor and decides to try one more medication and this one begins to work; within 3 weeks the patient is feeling better and in 2 months they are back at work. Here, the patient never even saw a psychiatrist and was treated only by a family doctor and a psychologist; they had to go through several treatments, even though they adhered to each one. The two patients discussed above had very different journeys and challenges. As a result, any computational model or application developed to support them would need to be targeted at the appropriate part(s) of their prospective journeys. The reminder of this chapter is structured as follows: in Section 6.2, we examine further details of what makes mental healthcare complex from clinical and data-driven perspectives. In Section 6.3, we deconstruct the patient journey down to the most common elements or steps and discuss the challenges, opportunities, and use cases for each of these, as well as important ethical considerations. Last, in Section 6.4, we discuss and summarize the content of this chapter.
6.2
What makes mental healthcare complex?
In this section, we will discuss some of the challenges inherent in applying machine-learning to big data approaches in mental health. A fundamental challenge to treating mental health problems is the lack of a mechanistic model for essentially any psychiatric disorder [9,10]. Specifically, in contrast to some other branches of medicine like cardiology, we do not understand the mechanisms that lead to and sustain states such as depression or psychosis. While a wealth of results from
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
psychological, genetic, environmental, metabolomic, epidemiological, and neuroimaging approaches have advanced our understanding of the causes of psychopathology, it remains the case that we do not have a clear model that describes the development of a given mental illness, unlike the way we understand how plaque builds up in arteries and leads to a heart attack. This is not due to a lack of effort or data; rather it can be attributed to the extraordinary complexity of understanding the human brain as it develops, processes information, and interacts with the continually changing environment, which we collectively shape through high-level processes such as culture and social hierarchies. In addition, while many studies have evaluated, for instance, the link between genetic markers and the etiology of schizophrenia [11] or the link between childhood abuse and the development of depression [12], it remains a challenge to integrate results across levels of investigation, from neurotransmitters and variations in genes, to brain systems, to psychological states or behaviors, and to the societal and cultural systems in which individuals are embedded. To express the problem in terms familiar to any computer or data scientist, we are attempting to interpret the functioning of the hidden layers of an extraordinarily complex neural network whose inputs and outputs are not easy to quantify and are subject to a lot of noise; in some respects, this is similar to the classic credit assignment problem [13] where determining how the success or failure of a system is due to the various contributions of the system’s components. Different branches of science (e.g., psychology, neuroscience, and artificial intelligence) make different theoretical assumptions about the nature of mental processes and take different stances on the mind body problem [14] (Section 1). It appears that so far most data and evidence collected have been of insufficient quantity and/or quality, or has been of the wrong kind, to afford the construction of mechanistic models. This brings us to two practical challenges raised by the lack of mechanistic models: choosing which data to record and how it should be represented in order to provide useful insights. Specifically, the lack of mechanistic models often leads to the analyses of large datasets in a “blind” manner—that is, by training simple classifiers or related data-driven algorithms in an attempt to find some model that explains the data. While recent deep learning advances [15] can capture the implicit relationships between features in these models, they do not provide one with the desired mechanistic model. Furthermore, in most cases, these models turn out to be extremely difficult to interpret
143
144
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
beyond a simple understanding of which features were most important in the model [16]. That is, the complex moderation and mediation interactions these models discover are not always accessible to human researchers and cannot readily be translated into practical insights. Without a mechanistic model to test, we must rely on generating models from data. However, different datasets may produce different best-fit models, even when these datasets are large and the outcome being predicted is similar. A model that is significant in one dataset may or may not advance our understanding of the underlying disease, unless it replicates in other datasets and coheres with existing findings in the literature. Given the concern that bias engendered by training on nonrepresentative datasets could creep into clinical applications, ensuring that models are generalizable is essential, which again, is very challenging. Let us turn to the challenge of measurement, or more precisely, knowing what to measure in a cost-effective manner. This is critical because having “big data” is only useful if the dataset actually contains informative variables. As discussed above, for our purposes, the type of data one expects to be informative depends on underlying theoretical assumptions. In addition, because there are so many possible types of data to collect, and because of the often high cost of data collection, it will not be feasible in most cases to expect to overcome this challenge by simply “collecting everything.” In making decisions about the type of data to collect, there are also practical considerations. For example, when designing big datasets, should the budget be spent on a collection of extensive neuroimaging data, which is difficult to apply in clinical settings, or on a set of very simple measures that can be more easily collected and applicable within the clinical setting but that may lack in explanatory power? Furthermore, there is also the challenge of operationalizing the constructs to be measured. Should we attempt to work with written case records via natural language processing, and if so, how do we deal with the differing terminologies, writing styles, and the fact that, often, two healthcare professionals with the same training will disagree on the diagnosis or treatment plan for the same patient [17]? How do we combine existing datasets in a valid manner, understanding that they were all collected for different purposes and often using different tools and questionnaires? These are among some of the questions that must be addressed. Unlike most other disorder types, mental illnesses diagnoses are based entirely on their phenomenology—that is, on descriptions of
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
symptom clusters and patient presentations amalgamated over time by the psychiatric and psychological community [18]. Sometimes diagnostic categories can seem arbitrary or overly rigid. For example, while anxiety symptoms are very common in depression, anxiety is not one of 9 symptoms included in the official diagnosis of depression [3] even though anxiety seems to be both neurobiologically related to depression [19] and an important predictor of whether or not people will respond to antidepressant treatment [20]. In order to be of clinical use, diagnostic tools based on big data and machine learning should be validated against these admittedly imperfect diagnostic criteria. One caveat is that this need for validation against diagnostic criteria may lead to a situation in which a tool that, for example, tracks user anxiety and sleep might be more reflective of the putative underlying neurobiology of depression, but may or may not have acceptable performance when detecting depression according to the formal criteria. This would depend on how well anxiety correlates with the other official symptoms. In terms of outcome measurement, in mental health, this is usually accomplished by means of validated questionnaires such as the Quick Inventory of Depressive Symptomatology (QIDS) [21] or the Patient Health Questionnaire (PHQ-9) [22]. These are also often used for screening, determination of illness severity, and in some cases diagnosis. These questionnaires are themselves imperfect, with their accuracy varying depending on their length, whether they are patient-self report or clinicianrated, and in the latter case the level of training of the clinician. In addition, they are often validated by coherence with older questionnaires that measure the same construct. Other, “harder” outcomes can also be collected, such as employment, service utilization costs, and suicide. This, however, may be subject to strong correlations as a patient may or may not find employment because of prevailing economic conditions or other factors. This problem can be worsened because of the prevalence of comorbidity and the intersection of mental illness with social, cultural, and economic realities [23] making it very difficult to understand the trajectories of many of the more severe patients in the mental healthcare system. It is also important to note that although the patient is the expert on his or her own experience, the information that the patient can provide is limited especially in cases where executive functions such as attention, memory, and others are impaired as is often the case in mental illness [24]. These difficulties are compounded by several factors. Firstly, many patients see multiple providers [25] and can have chaotic or irregular
145
146
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
contact with the health system. Furthermore, it can be difficult to accurately measure a patient’s social situation both because of the challenge of operationalizing a patient’s social network and the limitations of available technologies. For example, even with the advent of social media monitoring, it is not clear whether these techniques capture important social connections [26]. Furthermore, missing data is ubiquitous and can introduce bias into the dataset. Given the advent of big data (large volumes of heterogeneous variables) and improvements in processing power, machine learning (see Ref. [27]) presents itself as a promising avenue to offer solutions to some of the aforementioned complexities inherent to treating mental health problems [28,29]. However, the promise comes in hand with challenges of applying big data analytics to mental health problems. Though the heterogeneity of clinical, sociodemographic, neuroimaging, genomic, immune, and other measures is advantageous from a research standpoint, it is challenging to deal with the impact of high dimensionality, especially when the number of features exceeds the number of subjects. On the other hand, much of the data collected may not be informative toward making a diagnosis or predicting treatment response, drastically reducing the number of features that are actually needed. The challenge is to separate the useful from the less useful features. It is equally important to consider the generalizability of machine-learning models outside the training sample. Insufficient sample size and underrepresentation of minority groups also make it difficult to interpret machinelearning models for some populations. An additional important challenge is that many machine-learning models cannot be easily understood by humans, commonly functioning as “black boxes.”
6.3
Opportunities and limitations for artificial intelligence and big data in mental health
Understanding the complexities inherent in the conceptualization, diagnosis, and treatment of mental illness as well as how these complexities impact on the use of machine learning and big data in this space helps contextualize a discussion of potential and current use cases for AI and big data in mental health. In the following section, we have grouped these use cases to reflect a patient’s clinical course. The first step in this trajectory is diagnosis. Mental illnesses are best treated if diagnosed early [30], and many cases of mental
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
illness could even be prevented with the right interventions (including societal and population-level interventions) [2]. Here, big data and machine learning can help us better understand which people in the general population are at risk for developing mental illness, helping us deliver interventions that could save lives and costs by preventing the illness from fully manifesting or intensifying. On a population level, the use of machine learning could help us identify population trends and variables that could be targeted by social programs to reduce the incidence of mental illness. Once a person is diagnosed or considered to be at risk, they often want to know what their chances of recovery are, and many clinicians want to know about their risk of suicide or violence; this understanding of the likely course a patient’s illness will take is called the prognosis. Here, predictive tools could help patients better understand their illness and help their families plan for different clinical courses. Clinicians do not perform well when trying to predict which patients are at risk of suicide, and so better tools for this have the potential to fill a clinical gap and save lives. Simply knowing a patient’s prognosis does not necessarily lead to being able to help them manage their illness. For this, an optimal patient management strategy must be selected. This is where treatment selection tools come in; these are tools aimed at using patient information to select the optimal treatment or intervention, from a range of psychiatric interventions that often do not separate by efficacy at the group level. Once a treatment is selected, one must choose how to deliver treatment, and here, AI can help by providing virtual therapists or personalizing patient experiences in digital therapeutics applications. While these are unlikely to replace medications and traditional psychotherapy, they may prove to be a powerful ally in augmenting traditional therapies, improving access, and acting as low-intensity interventions that can be delivered in a preventative manner, before a patient requires more skilled or advanced care. Next, the monitoring of patients’ condition and symptoms is needed in order for clinicians to get a deeper understanding of the patient’s illness. Typically, nonhospitalized patients do not see their clinician on a very frequent basis and therefore the treatment is mostly based on the patients’ (sometimes biased) report and condition during the appointments. Automated or semi-automated monitoring can mitigate this limitation. Finally, it is crucial to discuss the ethical considerations: each of these use cases must be with a respect for patient welfare, dignity, rights and current medical ethics, while endeavoring to address the more novel ethical considerations that come from the use of AI.
147
148
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
6.3.1
Diagnosis
Diagnosis is an important initial step in the treatment of mental illness and relies heavily on nosology, the study of the categorization and explanation of disease. Three main nosological systems can be used [31]: 1. an etiological system [i.e., defined based on the cause(s) of a disease]; 2. a pathophysiological or mechanistic system (e.g., diabetes type 1 is defined through the absence of insulin production); and 3. a symptom-based system (i.e., defined by clusters of symptoms). Psychiatry previously relied on an etiological nosological system centered around psychoanalytic theories. However, as discussed before, the field of psychiatry have moved to adopt the current “atheoretical” symptom-based model, as present in the Diagnostic and Statistical Manual of Mental Disorders [3] from its third edition onward. This radical paradigm shift ushered in issues of validity. The primary issue was and remains the fact that diagnosis in a symptom-based nosological model cannot be incorrect. That is, there is no other test or gold-standard diagnostic procedure; if the symptoms are present and meet criteria, the diagnosis is valid even if there may be reasons to doubt this conclusion. Furthermore, heterogeneity in symptom presentation and high levels of comorbidity obscure the boundaries of the disease categories leading to high false positive and false negative rates [32]. In response to these limitations, researchers have begun to identify biomarkers and underlying neurobiological mechanisms to guide mental health diagnosis and treatment, though this work is still in the exploratory phase [9]. Considering the difficulties facing psychiatric diagnoses, machine learning and big data offer a unique opportunity to improve our understanding of and ability to diagnose diseases. We have identified three areas that could benefit from the right application of machine learning and big data; namely, improved data collection, a better understanding of symptom clusters, and a redefinition of diseases with respect to function and quality of life. The popularization of AI, big data approaches, and data collection technology development is encouraging the collection and analysis of unprecedented amounts of data, from a wider range of sources than ever before. Examples of these new data sources and analysis efforts in our context include: • The spread of electronic medical records, allowing for more access to healthcare data. • Social media that offer the opportunity to mine data to both inform diagnosis and examine the impact of disease. Internet
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
users with various mental illnesses can be characterized by their social media use, text generation, and other online behaviors (e.g., Ref. [33 35]). Social media analysis can also serve to inform our understanding of the sometimes difficult changes in self-perception and identity that can occur after a diagnosis (see Ref. [36]). • Passive sensing of movement, location, social media, calling, and text message use through mobile phones [37] (e.g., the Beiwe platform1). This data may help us gain a deep understanding of the patterns and changes in patterns associated with mental illness, suicidality, and response to treatment. • Ambulatory assessment that includes wearable sensor technologies that collect momentary data that do not depend on self-reports, context sensors (e.g., noise level, pollution), and biobehavioural sensors (which measure physical activity, sleep quality, blood pressure, alcohol intoxication, and more) [38]. Much like passive sensing from mobile phones, this kind of highly personalized data offers insights into realms of human behavior and functioning, such as sleep and movement, which are directly relevant to diagnosis. Moreover, by accumulating all this information on patients, symptoms like “mood” or “appetite” that often mean different things to different patients could be more accurately assessed within a patient-specific model trained on one patient’s past data to predict outcomes for them. AI may facilitate the deconstruction of clinical labels into more biologically grounded transdiagnostic features. As we mentioned, the research community is struggling to understand underlying mechanisms specific to the currently labeled mental disorders. This is most likely explained by a many-to-many relationship between neurobiological processes and syndromes or even symptoms. In other words, a neurobiological alteration could potentially give rise to multiple symptoms and many symptoms could be resulting from more than one underlying process. Machine learning offers the chance to predict the different dimensions of symptomatology, which a patient might experience. Meaningful clusters can then be found in this multidimensional landscape, and these might be stable across patients, allowing for new symptom clusters to be identified. For example, clustering techniques have been able to show important clustering between brain regions and some transdiagnostic features or dimensions of mental disorders, like mood, psychosis, disruptive behavior, or anhedonia [39,40]. In these applications, agglomerative hierarchical 1
http://wiki.beiwe.org.
149
150
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
clustering techniques [41] were used to devise clusters of symptom scores, showcasing the power of machine learning for psychiatry. This is in line with recent theoretical attempts to move from symptom-based categorization toward a more pathophysiological nosology, such as the symptom network theory [42]. As such, AI (supported by access to big data) can help realign psychiatric diagnosis with biology. Finally, machine learning can offer a functional perspective on disease. Indeed, some would argue that attempting to isolate clinically relevant neurobiological mechanisms from clusters of symptoms is unrealistic as the interaction between trauma, neurobiological alterations, and symptom experience is neither unidirectional nor direct [43]. Instead, a diverse set of bodily, cognitive, social, and cultural influences mediate these interactions at different timescales to maintain the clinical suffering. AI and big data technologies could help us monitor patients’ ability to perform in daily activities and enjoy daily life, and model the individual factors relevant to each person’s quality of life. As such, it may no longer be necessary to categorize patients in a strict sense and the concept of psychiatric diagnosis could become obsolete. Instead, an understanding of each patient’s genetic vulnerabilities, neurobiological states, clinical presentation, personal narratives, and life trajectories and their interactions with others, as captured by an integrative AI model, would help clinicians target treatment and interventions without the need to label the patient with a diagnosis. This would create a new nosological system based on functionality and quality of life, which would more closely match patient needs and concerns [44].
6.3.2
Prognosis
One of the most complex steps in mental healthcare concerns prognosis, the understanding of which is important both on an individual patient level and from a public health policy standpoint. A prognosis is a prediction about the likely outcome of a patient’s current disease (i.e., risk/chance of recovery, death). Prognosis for mental illness is often fairly unpredictable due to the following: 1. Our perfunctory understanding of disease pathophysiology. 2. Lack of well-documented and standardized longitudinal data on patients. 3. Insufficient follow-up and dosage modulation from physicians. During recovery from mental illness, symptom improvement is not directly proportional to time elapsed during treatment (i.e., it is nonlinear). However, by knowing the likely course a disease will take early on in the treatment process,
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
including the likelihood of relapses and recurrences, we can better know which treatment to provide. For example, if we can predict that a given patient is likely to experience a relapse of the condition within a given time frame, it may be worth administering prophylactic treatment or scheduling more regular follow-ups with the clinician. Knowledge on the most likely course of a disease can also increase the efficiency of resource allocation and delivery of mental healthcare services. In particular, a patient’s clinical team can better structure a long-term, integrative, and multifaceted care plan. Using AI, we can create personalized preventative mental health treatment based on these predictions. AI in mental illness prognosis can not only have a drastic impact on individuals, but also on society as a whole. Many countries, especially low-income and middle-income countries classify addressing mental health as low priority [45]. Developing countries tend to prioritize the control of infectious diseases and reproductive health, which makes sense due to their urgency. That being said, in order to understand the public health impact of mental illness, we must consider that most mental disorders are diagnosed at a young age, where 75% have an onset below the age of 24 [45]. This is a vital point in an individual’s life, both academically and socially, as this is the stage when these young adults begin their career, develop romantic relationships and lifelong friendships. However, mental illness can have a drastic impact on the these seemingly normal steps in social development [46]. Additionally, several studies have correlated poor mental health with lower educational achievement [47]. Currently, a poor understanding of prognosis in psychiatry means that it may take several years for a patient to be properly treated [47]. On a population level, this leads to large groups of people missing the opportunity to pursue higher education and develop the appropriate social skills to become impactful members of society. In developing countries, this impact may be even higher as the future of the country relies on the development of an educated population who can contribute to the country’s evolution. Hence, it would be worthwhile for developing countries to invest in the implementation of these AI-based technologies to help clinicians understand prognosis, thereby allowing patients to be properly treated at a younger age, and continue their pursuit of academic, professional, and social success. A nonequitable implementation of AI-based care globally would further perpetuate the cycle of inequality facing persons in the developing world. Furthermore, the high comorbidity rate between mental illness and other diseases indicates that improved prognosis in
151
152
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
mental health could have a drastic impact even beyond the realm of psychiatry, also impacting diseases including cardiovascular illness and diabetes. Again, it is important for us to consider the complex relationships between mental illness and other diseases. It is possible that shifting a country’s public health priority toward one that includes mental health may have a wider impact on other health conditions that are considered a priority. In terms of cost, a 2007 study discusses the economic impact of a phase-specific intervention program built for teens affected by psychotic disorders [47]. Importantly, the researchers identified that this method proved to be more costeffective as costs shifted from inpatient services to community care. Employing AI to improve prognosis in mental health could allow for the development of improved, phase-specific interventions in depression as well. The use of AI technologies offers potential to gain greater insight on disease progression, potentially having a significant impact on healthcare delivery as a whole. The implementation of AI systems can help to shift the delivery of mental healthcare from a reactive response to a proactive response.
6.3.3
Treatment selection
While determining prognosis can be helpful, it is often claimed that a prediction model is only as good as the system which uses it [48]. In other words, instead of simply determining which patients will or will not improve, we should be evaluated based on the assignment of patients to the treatments that are most likely to be effective for them. This is often referred to as personalized, or precision, medicine. Let us examine the treatment selection problem more closely. In mental health, many kinds of treatments exist—a range of pharmacotherapies, a range of psychotherapies, and neuromodulation techniques such as repetitive transcranial magnetic stimulation and electroconvulsive therapy. In addition “lifestyle” interventions, such as exercise, mindfulness, and meditation, have also been found to be effective for certain milder disorders or as adjuncts to medication or psychotherapy. What is striking is that most of the nonlifestyle interventions have been found to be roughly equally effective, despite disparate mechanisms and routes of administration [49] (with the exception of electroconvulsive therapy, which for many conditions has superior efficacy but which is far more resource intensive than most other treatment options [50]). In addition, we must face the reality of resource restriction and the need to match patients to the right
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
treatment intensity. This is exemplified by the “stepped care” approach [51], where patients are given access to the level of intervention they require and “stepped up” to more intense (and costly) services as needed. For example, in the UK’s adult Improving Access to Psychological Therapies (IAPT) program [52], patients are streamed toward “low intensity” treatment (online resources and infrequent visits) or “high intensity” (weekly visits with a therapist). Patients are either streamed in to low-intensity treatment first and are then moved up to highintensity treatment if the low-intensity treatment fails, or they can be streamed directly into high-intensity services based on diagnosis or symptom severity. Finally, because of the way mental health services are organized, it is often necessary to decide if a patient should be put on the waitlist for psychotherapy, should be started on medication, or should pursue medication and therapy concurrently. In addition, when one is training a machine-learning model aimed at improving treatment selection, one must decide the “success” criteria. One could aim for the greatest amount of symptom reduction, though this may not always correlate with function or fully represent patient goals. One might aim for a reduction in suicide, though very large datasets may be needed in order to reliably detect such an effect given the low incidence of suicide. One might try and optimize cost effectiveness, though this might lead to slightly worse outcomes for many patients. Finally, when pursuing treatment selection it must be understood whether one is building a general model that can be applied to any patient population with a similar sociodemographic profile to the training set, or if one is optimizing a model that will help make treatment decisions within a specific healthcare system or institution. As such, it is clear that the treatment selection problem can be carved in many different ways, and important progress is now starting to be made on several different fronts and using a number of approaches. This progress, which we will now discuss, increases the hope that the right kind of big data, combined with a clear understanding of the treatment selection problem at hand and algorithms that are appropriate for solving it, can help significantly improve mental healthcare, reducing the time it takes for patients to find a treatment that works for them while reducing systemic costs by avoiding failed treatment courses. Recent work has investigated several different perspectives of treatment selection prediction in depression. The Leeds Risk Index (LRI) identifies pretreatment variables to predict treatment outcome and allows patients to be stratified into groups of low, moderate, or high risk for poor treatment response based
153
154
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
on these LRI scores [53]. The baseline variables used to predict depression outcomes contain demographic and clinical information, including measures of age, employment status, disability status, and intellectual functioning. This type of risk index is proposed to have clinical relevance in directing patients toward treatment options of different intensity levels based on the predicted advantage for patients grouped by LRI scores. For example, the authors describe the value of this index in psychiatric treatment systems with discrete steps of treatment intensity such as those within the IAPT program; that is, lower intensity treatment options such as providing support and teaching strategies based on Cognitive Behavioral Therapy (CBT), and higher intensity treatment options such as depression counseling or CBT sessions. The work suggests that low-intensity treatment interventions may be the most cost-efficient approach for depressed patients with low LRI scores, while depression cases with high LRI scores may benefit more from high-intensity treatment options and should avoid lower intensity treatment alternatives due to predicted higher dropout rates. This predictive insight has the potential to improve patient care at an individual level by assisting physicians in guiding patients toward the most effective treatment option, and improve patient care at a population level by distributing patients between differing treatment intensities based on predicted response patterns. This predictive approach to treatment selection has the potential to improve the efficiency of mental health treatment systems; in this case, by providing informed predictions to help patients and physicians navigate systems with discrete levels of treatment intensity. Neuroimaging data can also be used for differential treatment prediction. For example, positron emission tomography (PET) imaging can be used to measure pretreatment brain glucose metabolism in patients with depression receiving either escitalopram or CBT as treatment to predict response to treatment [54]. The treatment-specific neuroimaging biomarker described by McGrath et al. suggests that distinct, observable, physiological differences can be measured using existing neuroimaging techniques to predict patient outcomes to different treatments. On the other hand, the LRI suggests that stratification of patients with depression based on baseline demographic and clinical variables can predict the most beneficial treatment intensity level in a stepwise treatment system. These approaches, rather than competing, may provide complementary information regarding both the type of treatments and the intensity of the treatments that are most effective in individual cases.
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
In related work [55], created the Personalized Advantage Index (PAI). The PAI can be used to identify both the most effective treatment as well as the magnitude of this benefit for an individual patient. The PAI identifies individuals who would benefit differentially between different treatments. Out of 154 patients in their study, 60% of the samples displayed a clinically meaningful PAI score for one treatment compared to the other, meaning these patients are predicted to respond better to one of the two treatments. This PAI score was calculated by predicting symptom severity after treatment for each patient for paroxetine and CBT separately, then comparing the two estimates to determine the more beneficial treatment option. Clinically, identifying which patients would benefit differentially between treatments would allow for the prescription of the more effective treatment option for these individuals, and also allow patients without any discernible advantage for a specific treatment to select treatments based on personal patient values and potentially choose more cost-efficient options. An important distinction was made by the authors of this work between prognostic variables that predict nonspecific treatment outcome, and prescriptive variables that predict differential treatment outcomes. Overall, the above three approaches to differential treatment selection provide intriguing insight into the potential power of statistical analysis and modeling of data to more effectively guide treatment selection in the context of depression. From these different methodologies, it is clear that the future holds many exciting paths to improve treatment selection for mental healthcare upon the foundations of big data and rapidly advancing analytical and predictive tools. A recent successful application of machine learning for treatment selection is Ref. [56]. The system, known as Aifred, offers a neural network model that allows for differential prediction between the four different antidepressant drug categories. The model is capable of determining the overall likelihood of remission given each drug category. Using an extensive evaluation protocol, the system is found to provide a significant advantage over random drug allocation. The major contribution of this model is the extension of a differential benefit treatment selection process to more than two treatments, which is key because of the large number of treatments available. A significant limitation of this analysis was the class imbalance in the training data—there were far more patients being treated with “citalopram,” than with any other drug. As such, we expect analyses of more balanced datasets to yield larger differential treatment prediction effect sizes.
155
156
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
6.3.4
Treatment delivery
Psychotherapy is an ancient form of healthcare that persists as an effective treatment option for a variety of mental illnesses, particularly affect disorders. However, psychotherapy delivery is hampered by two human-derived issues: 1. Limited access to appropriately qualified healthcare professionals due to cost and other logistical issues. 2. Variability in the quality of care. These limitations are now being addressed with machinelearning based natural language processing and big data techniques. A prominent example is Woebot2 (Woebot) [57], a commercially available psychotherapist, primarily powered by AI, with demonstrated short-term efficacy in reducing PHQ-9 scores among college students with self-identified symptoms of depression and anxiety. Users interact with Woebot via an instant messaging app and these conversations are reviewed (typically at a later time) by a trained psychologist. Woebot’s natural language is modeled after social discourse, and its response function decision tree is trained in CBT using three clinical sources [58 60]. Six key process-oriented treatment features are prioritized: empathy, personalization, goal-setting, accountability, motivation, and reflection. Interestingly, some Woebot users who participated in the RCT by [57] reported feeling a “real person concern” as a “most favoured feature” during intervention review, indicating that natural language is approaching the level of sophistication needed for elements of psychotherapy. Not all post-RCT reviews were positive of course, with complaints that Woebot “got a little repetitive” and that the conversations were inflexible and unnatural. Another example is reSET3 (Pear Therapeutics Inc.). reSET is a mobile app adjunct therapy for substance use disorder for patients abusing alcohol, cocaine, marijuana, or stimulants. While reSET does not appear to directly utilize big data or machine learning for therapy delivery, it does collect the types of patient data that could be leveraged to improve treatment prognosis [61]. The more important aspect of reSET in the context of this chapter however, is in becoming the first FDAapproved digital therapeutic [62]. In recent years, mindfulness meditation [63] has surfaced as popular resilience training technique and there are hundreds of 2
https://woebot.io/. https://peartherapeutics.com/.
3
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
readily available wellness apps claiming to be able to train users in mindfulness meditation. Unfortunately, the explosion of mindfulness apps in recent years has not been matched by a parallel explosion in RCTs examining mindfulness app efficacy, and it is still debated if these products can deliver tangible benefit, given mixed early results. To the authors’ knowledge, to date only a handful of RCTs using active controls have interrogated commercially available mindfulness apps, most recently [64]. These RCTs show a different outcome. In an examination of the app Headspace4 (Headspace) as a 6-week intervention in undergraduate students, no benefits on well-being, affect, cognitive function, or mindfulness abilities were revealed, either through withinsubject pre post analysis or in comparison to the active control. In contrast, in an examination of the app Wildflowers5 (Mobio Interactive Inc., of which BS and MT are a part) as a 3-week intervention in undergraduate students, benefits to well-being and stress-resilience were revealed in both within-subject analysis as well as in comparison to the active control. Interestingly, an unique aspect of Wildflowers and its sister products developed by Mobio Interactive is the use of big data and machine learning. For example, Wildflowers leverages computer vision to extract heart-rate variability through photoplethsymographic imaging [65,66] of user selfie videos. Since heart-rate variability is negatively correlated with cognitive stress [67], this technology has the potential to objectively quantify stress. According to the Mobio Interactive website, over 100,000 pairs of selfie videos and self-assessments of mood and stress from users throughout the world are currently being leveraged to train deep neural networks that both objectively and remotely predict stress changes in the end user, and then personalize psychotherapy accordingly. It remains to be determined if the use of big data and machine learning in this context contributes to clinical efficacy, but given how effectively big data and machine learning have been applied in the various contexts described throughout this chapter, it seems likely that such practices will ultimately give rise to more efficacious digital therapeutics for the patient.
6.3.4.1
Special opportunities
6.3.4.1.1
Real-world validation
With almost as many cell phone subscriptions as there are humans and penetration rates in developing countries 4 5
https://www.headspace.com/. http://www.midigitaltherapeutics.com/.
157
158
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
averaging 90%, the potential to use mobile devices for healthrelated data generation is unsurpassed by any other previously available data collection method in history. With these real-world data comes the potential for verifying real-world efficacy, that is, obtaining real-world validation. Real-world validation offers an unprecedented opportunity for transparency in healthcare. Digital therapeutics like the ones mentioned above and others may one day soon stream live anonymous and objective data on stress and well-being, continually monitoring the real-world efficacy of each product in real time. 6.3.4.1.2 Big Data Loop Without question, the greatest promise of combining big data and machine learning with digital therapeutics is the generation of a “Big Data Loop” that enables seamless feedback circuitry to continuously refine therapy and prognosis in real time as data stream in for analyses. In this context, the real-time collection and analysis of real-world data from digital therapeutics play a central role in redefining the relationship between large numbers of patients and healthcare providers—in some cases providing the first, or even the only, point of contact between healthcare systems and individuals challenged with mental health conditions.
6.3.4.2 Specific challenges 6.3.4.2.1 Public acceptance and adoption Pharmaceuticals have dominated healthcare for about a century, and became a first-line treatment option for mental illness beginning in the 1970. At present, it is well accepted that small molecule pharmaceuticals have definitive biological effect, and often of a net-positive nature. The same patient confidence cannot be stated for the digital forms of therapy that are guided by AI or required to gather critical patient data. This lingering skepticism is likely to slow adoption of digital products until sufficient real-world efficacy permeates the public consciousness. 6.3.4.2.2 Differentiation Given that much of the public already demonstrates clear confusion between approved clinical practices and medical hoaxes (e.g., homeopathy), it should be expected that at least as much of the public will find it difficult to differentiate between digital therapeutics that are backed by scientific evidence from
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
the large number that are not, or that have even failed RCTs [64]. The generation of a consumer friendly and tightly controlled cross-border e-commerce site (i.e., a “medical app store”) for public-facing digital therapeutics may be one viable solution. While these are still early (albeit exciting) days, AI-powered and/or big data collecting psychotherapeutic interventions like Woebot, reSET, and Wildflowers are likely to have a massive positive impact on treating mental health, globally. People in all countries and from all walks of life use their mobile devices every day. These interfaces may soon deliver affordable and effective mental healthcare.
6.3.5
Monitoring
Continuous monitoring of patients plays a critical role in mental healthcare, for many reasons: first, the mental health clinician (psychiatrist, therapist, etc.) receives a very partial view of the full condition of the patient. Typically, nonhospitalized patients do not see their clinician on a very frequent basis and therefore the treatment is based only on the report of patient at the appointments. Second, symptoms can change significantly over a short period and thus necessitate more immediate intervention. In addition, mental illness is often episodic, in particular for patients suffering from depression or schizophrenia in which recurrences or relapses are common. Such patients would benefit from their clinician receiving regular updates of their symptoms. A possible naı¨ve approach for performing monitoring would be to manually contact the patient frequently in order to receive reports of mood and symptoms. However, this approach is not efficient since it requires great effort from both the patient and the clinician. In addition, self reporting suffers from a lack of accuracy. Therefore, in the past two decades, many different methods and applications have been developed for automatic monitoring of mental health. Advances in technology enable the effective collection of monitored data through several forms and means. Specifically, smartphones are an extremely valuable tool for monitoring mentally ill patients, since they have become very common among the overall population in Western countries and specifically among mental health patients and are carried by the patients throughout most of the day [68]. In addition, smartphones are continuously improving in memory storage and processors capabilities for recording and processing information. Other forms of collection of monitored data can be done
159
160
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
through computers and designated wearable devices. Automatic monitored data can be classified in two types: 1. Subjective data, such as patient’s self report of mood and symptoms in response to a mobile application’s daily inquiry. 2. Objective data, such as behavioral data (e.g., activity, phone usage), physiological data (e.g., heart rate, body temperature), or environmental information (e.g., location, outdoor exposure). Each of these types of data have advantages and limitations: objective measures are naturally more accurate and can capture a large amount of information without interrupting the patient’s daily routine. However, these measures are limited since in many mental illnesses, the diagnosis depends to a great extent on the patient’s description of feeling and mood. Subjective measures, on the other hand, are often affected by the context of their assessment and biased by the mood of the patient. The research in the field of automatic monitoring of mental health has focused mainly on the feasibility of collection of data and on finding associations between the monitored data and the patient’s mood and symptoms. However, almost none of these studies have attempted to create an application that would utilize the data to assist in real-time decision-making regarding treatments. Some studies have introduced systems that perform interventions in the treatment, however, the interventions are relatively simple, and to the best of our knowledge, no application includes advanced tools of AI, such as autonomous agents [69]. In the following, we will describe the existing research and the applications of monitoring mental health, and we will lay out future directions for research.
6.3.5.1 Symptom monitoring Many of the studies in the field of symptom monitoring have explored the association of clinical states, or transitions in mood states, with monitored data that can be automatically and passively monitored. Some studies have found that physiological measures can predict clinical states. For example, Lantana et al. [70] introduce PSYCHE, a personalized wearable monitoring system, designed to improve the management of patients suffering from mental disorders and specifically Bipolar Disorder. PSYCHE is a t-shirt with embedded sensors, which monitor the Heart-Rate Variability. The authors demonstrate that PSYCHE is successful in assessing transitions from pathological mood states. Other studies have shown that the phone usage patterns of patients can indicate patients’ mood state [71 73]. For example,
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
¨nerbl et al. [72] found that the duration and frequency of Gru phone calls increase for individuals with mild depressive disorders compared to individuals with severe depression or a normal mood state. In addition, physical activity has been shown to be linked to affective states [72,74 77]. In general, individuals with affective disorders tend to have lower activity energy and acceleration compared to healthy individuals [76]. Furthermore, studies have found a relationship between the emotional state and the following categories of data monitored by smartphones: voice features [73,74,77], light exposure [71,74], and location changes [71,72,74,78]. Monitored data can also be useful for predicting and preventing relapse. Relapse prevention is specifically an important issue among patients diagnosed, hospitalized, and treated for schizophrenia, since up to 40% of those discharged may relapse within a year (even with appropriate treatment) [79]. In Ref. [79], Barnett et al. identified statistically significant anomalies inpatient behavior, as measured through smartphone use, in the days prior to a relapse. A major part of the research in the field has focused on the feasibility of using monitoring devices and patients’ adherence to user guidelines. Many studies that evaluated smartphonebased monitoring have reported patient compliance as a limitation [74,75,78,80], as some patients did not carry their phones with them all of the time or occasionally turn their phones off. In one study, patients noted that they would be interested in using the monitoring smartphone more regularly if transparency concerning the recorded data was guaranteed [80]. Only a small portion of the studies in this field have investigated the clinical outcomes of automated monitoring for mental health. In Ref. [81], Saunders et al. conducted a longitudinal study in which individuals with bipolar disorder monitored their mood daily using a smartphone application for 12 weeks. In a follow-up interview, half of the participants noted that they had improved their mood since they were able to better recognize their feelings, and half of the participants also reported a change in behavior (e.g., increased exercise levels). In Ref. [82], Wu et al. evaluated in a 6-month study an automated telephone assessment systems that monitored patients with depression and type 2 diabetes by calling them regularly. The call contents were individually determined through an algorithm that scanned patient medical records and call histories to determine applicable questions. The system also alerted emergency responders to contact immediately the patients who exhibited suicidal ideation. The automated system significantly increased both associated depression remission and patient satisfaction compared to the control group.
161
162
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
6.3.5.2 Monitoring compliance to treatment Another form of monitoring that has been investigated in research is the monitoring of patient compliance with treatment. Nonadherence to psychotropic medication is a significant issue in mental health treatment. For example, in bipolar disorders, estimates of nonadherence range between 20% and 60% treatment, with nonadherence often leading to negative outcomes [83]. Some studies have tested the effect of medication adherence telemonitoring systems that record the date and time of the medication bottle openings. For example, Frangou et al. [84] evaluated a system that included an electronic dispenser that fits on the medicine bottle cap and records the date of each bottle opening. The data was automatically transmitted online to clinicians who received alerts if adherence dropped below 50%. They tested the systems efficacy among individuals with schizophrenia and found that it significantly improved medication adherence and improved the psychotic symptoms in comparison to the control group. Bickmore et al. [85] tested an automated system, including of an animated agent, which conducts simulated conversations in order to promote medication adherence in individuals with schizophrenia by establishing an emotional relationship with the patient and providing consistent social support. They conducted a 1-month pilot study, in which individuals with schizophrenia were instructed to have daily interactions with the agent. During the interaction, the agent inquired about medication adherence and provided tips and suggestions for solving adherence problems. They found that the agent was successful in increasing the rate of adherence among the participants. An additional problem (regarding compliance to treatment) associated with poor therapeutic outcomes is nonattendance of psychotherapy sessions. In Ref. [86], Bruehlman-Senecal et al. found that automated mood-monitoring text messages can be used as a predictor of psychotherapy attendance.
6.3.6
Ethical considerations
Psychiatry faces unique challenges in addition to those common to all healthcare disciplines. While we have discussed how AI can offer promising solutions to those challenges, from diagnosis to treatment selection, such implementations create important ethical considerations that need to be addressed if the benefits are to be realized without causing undue harm. This section focuses on ethical issues specific to the use of AI and big data approaches in mental health.
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
AI algorithms are only as good as the data they are trained on. A widely discussed concern with the use of AI in high-stakes applications like healthcare involves the quality and quantity of data needed and the possible replication or amplification of biases present in the data [87]. This issue is particularly concerning in the mental health sector. First, mental health conditions will probably require more and more diverse data than physical conditions to accurately picture their complexity. Indeed, mental disorders are multisystem disorders—affecting mood, perception, cognition, and volition—and are caused by a complex interaction of more proximal biological causes and more distal environmental causes [88], Chapter 2, Big Data Analytics for Healthcare: Theory and Applications. The absence of any accurate etiological or pathophysiological models prevent us from preselecting relevant features as we do not know how they interact. While it is hoped that AI will help us uncover such interactions, extensive data from every levels or dimensions are deemed necessary to avoid biases. Second, many factors also threaten the quality of mental health data: patients with mental disorders are more subject to treatment noncompliance and high dropout rates in clinical studies [89]. Also, the prominent stigmatization around mental health can threaten reliability of patients’ reports. Moreover, the lack of mental illness biomarkers renders mental health data less quantifiable in general compared to other health conditions. Psychiatric terminology often involves concepts that are subjective and can be interpreted in many ways, making comparing stories of individual patients sometimes more difficult. All of this increases the risk that the algorithm makes erroneous advice in a mental health context. Potential social or governmental action to improve mental health services in order to ensure optimal data accessibility and quality may help avoid any added discrimination (i.e., via biased algorithm results or lack of access to useful algorithms because of a paucity of quality data) in an already heavily stigmatized population. Another well-acknowledged issue in AI relates to the interpretability or transparency of the algorithm [87], that is the ability to understand the step-by-step path taken by the algorithm to arrive at its conclusion. Transparency seems necessary in order to avoid conflict of interest and potential malicious use of Clinical Decision-Support Systems (CDSS) that are aimed at helping clinicians and patients make decisions. For example, a treatment selection CDSS could be programmed to output certain drugs more often in order to generate higher profits for their designers instead of prioritizing clinical outcomes [90]. This is particularly relevant in mental healthcare where many lines of treatment exist, but there is no systematic procedure for
163
164
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
selecting between them (as discussed before). Moreover, as comprehensibility of an algorithmic decision cannot be made without full knowledge of all features that are inputted, transparency runs against other ethical ideals, like privacy of data subjects. Medical confidentiality is a prime principle of medical care, of special importance in the psychiatric setting considering the high level of stigma around mental health. If any third party is in charge of interpreting and explaining a CDSS output to the physician or patient, this could jeopardize medical confidentiality, as they would potentially have access to the patient’s sensitive features. To prevent such concerns, efforts and resources should be directed to train the physician and patient on the AI tools and ensure their autonomy on CDSS usage and interpretability, or ensure that interpretability report generation by third parties occurs without these parties having access to identifying patient information. Third, a trusting patient physician relationship represents a central component of mental healthcare. AI products will undoubtedly reshape this relationship’s dynamics, whether it is a CDSS used only by the physician, an interface to improve daily communication between the patient and their physician or an automated conversational agent used only by the patient [57]. An evident-related concern is the responsibility and liability associated with algorithmic decisions [91]. It seems collectively understood that AI should not replace clinical judgment and that physician input should remain critical at every stage of the clinical process. While relying on physician judgment seems justified for now as algorithms are still limited and laden with potential biases [92], it will become less evident as this technology continues to develop and improves in accuracy and quality. Would a physician be liable if they disregard advice of a high-quality CDSS and this results in harm coming to the patient? There are also more subtle ways in which AI can be detrimental to the patient physician relationship. The AI “narrative,”that is, the ways of talking about AI and the terms used by the physician in their clinical encounter with the patient, will most certainly have dramatic effects on patient well-being. The placebo effect and physician and patient expectations have long been recognized as playing a significant role in treatment efficacy [93]. This is particularly true in the field of psychiatry [94] where we do not know exactly how most treatments work to relieve symptoms. Considering the hype around AI on the one hand, but the distorted views of its potential risks being presented to the public on the other hand, it will be imperative to regulate the place of AI in the patient physician relationship and educate clinicians
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
and patients on AI, big data, and its limitations to avoid deception or a blind compliance of the patient or clinician with AI recommendations.
6.4
Conclusions
As reviewed in this chapter, the significance of AI and big data in mental healthcare cannot be overstated. The use of these technologies will facilitate diagnosis, prognosis prediction, treatment selection and delivery, disease monitoring, and will optimize the allocation of healthcare resources, all of which be utilized to inform public health policy. Making use of technological innovations in the field of mental healthcare is primordial in the quest to tackle the current inefficiency of the system, especially considering the fact that the expected burden of mental illnesses is rising over time [95]. Indeed, medical professionals are increasingly recognizing the importance of harnessing big data to rectify the dysfunction inherent in the current system, which is one of the reasons that clinicians are adopting the framework set forth by the National Institute of Mental Health (NIMH) called the Research Domain Criteria (RDoC) of mental health classification. The data-driven RDoC is [. . .] an attempt to create a new kind of taxonomy for mental disorders by bringing the power of modern research approaches in genetics, neuroscience, and behavioral science to the problem of mental illness [96]. This quantitative system focused on biology, while still making use of symptom tracking and other subjective metrics, plays a new and important role by filling in pieces that have been missing from the complex puzzle of mental illness diagnosis and treatment. While the introduction of big data into mental healthcare will bring about widespread social and economic benefits, it will also generate its own unique and unprecedented challenges. The most salient of these challenges will be to enforce an ethical development and equitable delivery of AI solutions. Additionally, any robust AI model must be built with comprehensive data to encompass all possible treatments types. Importantly, data from people of all races, ethnicities, and socioeconomic backgrounds must be used in model training to avoid bias. Using big data and machine learning to capture and quantify the heterogeneity within patient diagnosis and treatment response can elucidate the biological mechanisms underlying the diseases themselves. In other words, the data collected to feed an AI model will inevitably prompt research questions and hypotheses by
165
166
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
highlighting particular variables that are salient in the model’s decision-making processes. For example, if the AI finds that a series of immune markers were heavily weighted in comparing predictions of treatment outcome between two antidepressant medications, then this implies that one medication may be targeting some sort of immune dysregulation, adding weight to the hypothesized link between depression and the immune system. AI is therefore necessary in mental health research in order to disentangle the nonlinear relationships between potential predictive factors and distill individual factors with robust predictive power. In short, data from basic research will feed the AI, and results from the AI will also feed basic research. While this is an exciting time for mental healthcare, a technological reform of such a scale must be implemented in a proactive, careful, and deliberate manner. One must remember that data points in a machine-learning model are representative of real people who are suffering from real mental illnesses. Therein lies the true value of big data in mental health—bringing personalized treatment to a field of medicine in which it is so desperately lacking.
Acknowledgments This work was supported by an Era-PerMed 2020 Grant. The Israeli authors were funded by the Chief Scientist Office, Israeli Ministry of Health (CSO-MOH, IL) as part of grant #3-000015730 within Era-PerMed.
References [1] National Institute of Mental Health, Any mental illness (AMI) among adults. ,http://www.nimh.nih.gov/health/statistics/prevalence/any-mental-illnessami-among-adults.shtml., 2017, (Online; accessed 09.10.18). [2] World Health Organization, Suicide prevention. ,http://www.who.int/ mental_health/suicide-prevention/en/., 2018, (Online; accessed 09.10.18). [3] American Psychiatric Association, Diagnostic and Statistical Manual of Mental Disorders (DSM-5), American Psychiatric Publishing, 2013. [4] M.-J. Fleury, A. Imboua, D. Aube´, L. Farand, Y. Lambert, General practitioners’ management of mental disorders: a rewarding practice with considerable obstacles, BMC Family Pract. 13 (1) (2012) 19. [5] National Institute for Health and Care Excellence, Depression in adults: recognition and management. ,https://www.nice.org.uk/guidance/cg90/ chapter/1-Guidance., 2018, (Online; accessed 09.10.18). [6] K. Malki, R. Keers, M.G. Tosto, A. Lourdusamy, L. Carboni, E. Domenici, et al., The endogenous and reactive depression subtypes revisited: integrative animal and human studies implicate multiple distinct molecular mechanisms underlying major depressive disorder, BMC Med. 12 (1) (2014) 73.
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
[7] R.B. Ardito, D. Rabellino, Therapeutic alliance and outcome of psychotherapy: historical excursus, measurements, and prospects for research, Front. Psychol. 2 (2011) 270. [8] G.R. Bond, R.E. Drake, The critical ingredients of assertive community treatment, World Psychiatry 14 (2) (2015) 240 242. [9] K.S. Kendler, Explanatory models for psychiatric illness, Am. J. Psychiatry 165 (6) (2008) 695 702. [10] Q.J. Huys, T.V. Maia, M.J. Frank, Computational psychiatry as a bridge from neuroscience to clinical applications, Nat. Neurosci. 19 (3) (2016) 404. [11] C.R. Marshall, D.P. Howrigan, D. Merico, B. Thiruvahindrapuram, W. Wu, D.S. Greer, et al., Contribution of copy number variants to schizophrenia from a genome-wide study of 41,321 subjects, Nat. Genet. 49 (1) (2017) 27. [12] M.R. Infurna, C. Reichl, P. Parzer, A. Schimmenti, A. Bifulco, M. Kaess, Associations between depression and specific childhood experiences of abuse and neglect: a meta-analysis, J. Affect. Disord. 190 (2016) 47 55. [13] M. Minsky, Steps toward artificial intelligence, Proc. IRE 49 (1) (1961) 8 30. [14] H. Robinson, Dualism, in: E.N. Zalta (Ed.), The Stanford Encyclopedia of Philosophy, fall 2017 Edition, Metaphysics Research Lab, Stanford University, 2017. [15] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436. [16] D. Castelvecchi, Can we open the black box of AI? Nature 538 (7623) (2016) 20. [17] A. Aboraya, E. Rankin, C. France, A. El-Missiry, C. John, The reliability of psychiatric diagnosis revisited: the clinician’s guide to improve the reliability of psychiatric diagnosis, Psychiatry (Edgmont) 3 (1) (2006) 41. [18] S.E. Hyman, Diagnosing the dsm: diagnostic classification needs fundamental reform, Cerebrum: The Dana Forum on Brain Science, vol. 2011, Dana Foundation, 2011. [19] J.W. Tiller, Depression and anxiety, Med. J. Aust. 199 (6) (2013) 28 31. [20] R. Saveanu, A. Etkin, A.-M. Duchemin, A. Goldstein-Piekarski, A. Gyurak, C. Debattista, et al., The international study to predict optimized treatment in depression (iSPOT-D): outcomes from the acute phase of antidepressant treatment, J. Psychiatr. Res. 61 (2015) 1 12. [21] A.J. Rush, M.H. Trivedi, H.M. Ibrahim, T.J. Carmody, B. Arnow, D.N. Klein, et al., The 16-item quick inventory of depressive symptomatology (qids), clinician rating (qids-c), and self-report (qids-sr): a psychometric evaluation in patients with chronic major depression, Biol. Psychiatry 54 (5) (2003) 573 583. [22] K. Kroenke, R.L. Spitzer, J.B. Williams, The phq-9: validity of a brief depression severity measure, J. Gen. Intern. Med. 16 (9) (2001) 606 613. [23] S. Stewart-Brown, P.C. Samaraweera, F. Taggart, N.-B. Kandala, S. Stranges, Socioeconomic gradients and mental health: implications for public health, Br. J. Psychiatry 206 (6) (2015) 461 465. [24] M.W. Musso, A.S. Cohen, T.L. Auster, J.E. McGovern, Investigation of the montreal cognitive assessment (MoCA) as a cognitive screener in severe mental illness, Psychiatry Res. 220 (1 2) (2014) 664 668. [25] E.H. Rubin, C.F. Zorumski, Perspective: upcoming paradigm shifts for psychiatry in clinical care, research, and education, Acad. Med. 87 (3) (2012) 261 265. [26] Z.I. Santini, A. Koyanagi, S. Tyrovolas, C. Mason, J.M. Haro, The association between social relationships and depression: a systematic review, J. Affect. Disord. 175 (2015) 53 65. [27] S. Shalev-Shwartz, S. Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014.
167
168
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
[28] R. Iniesta, D. Stahl, P. McGuffin, Machine learning, statistical learning and the future of biological research in psychiatry, Psychol. Med. 46 (12) (2016) 2455 2465. [29] I.C. Passos, B. Mwangi, F. Kapczinski, Big data analytics and machine learning: 2015 and beyond, Lancet Psychiatry 3 (1) (2016) 13 15. [30] V. Bird, P. Premkumar, T. Kendall, C. Whittington, J. Mitchell, E. Kuipers, Early intervention services, cognitive behavioural therapy and family intervention in early psychosis: systematic review, Br. J. Psychiatry 197 (5) (2010) 350 356. [31] K. Kendler, Introduction: why does psychiatry need philosophy, Philosophical Issues in Psychiatry: Explanation, Phenomenology, and Nosology, 2008, pp. 1 16. [32] M.S. Klinkman, J.C. Coyne, S. Gallo, T.L. Schwenk, False positives, false negatives, and the validity of the diagnosis of major depression in primary care, Arch. Family Med. 7 (5) (1998) 451. [33] M.L. Birnbaum, S.K. Ernala, A.F. Rizvi, M. De Choudhury, J.M. Kane, A collaborative approach to identifying social media markers of schizophrenia by employing machine learning and clinical appraisals, J. Med. Internet Res. 19 (8) (2017) e289. [34] M. De Choudhury, S. Counts, E. Horvitz, Predicting postpartum changes in emotion and behavior via social media, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ACM, 2013, pp. 3267 3276. [35] B. Saha, T. Nguyen, D. Phung, S. Venkatesh, A framework for classifying online mental health-related communities with an interest in depression, IEEE J. Biomed. Health Inform. 20 (4) (2016) 1008 1015. [36] M. Conway, D. O’Connor, Social media, big data, and mental health: current advances and ethical implications, Curr. Opin. Psychol. 9 (2016) 77 82. [37] R. Wang, M.S. Aung, S. Abdullah, R. Brian, A.T. Campbell, T. Choudhury, et al., Crosscheck: toward passive sensing and detection of mental health changes in people with schizophrenia, in: Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, ACM, 2016, pp. 886 897. [38] T.J. Trull, U. Ebner-Priemer, Ambulatory assessment, Annu. Rev. Clin. Psychol. 9 (2013) 151 276. [39] C.H. Xia, Z. Ma, R. Ciric, S. Gu, R.F. Betzel, A.N. Kaczkurkin, et al., Linked dimensions of psychopathology and connectivity in functional brain networks, Nat. Commun. 9 (1) (2018) 3003. [40] K.A. Grisanzio, A.N. Goldstein-Piekarski, M.Y. Wang, A.P.R. Ahmed, Z. Samara, L.M. Williams, Transdiagnostic symptom clusters and associations with brain, behavior, and daily function in mood, anxiety, and trauma disorders, JAMA Psychiatry 75 (2) (2018) 201 209. [41] W.H. Day, H. Edelsbrunner, Efficient algorithms for agglomerative hierarchical clustering methods, J. Class. 1 (1) (1984) 7 24. [42] D. Borsboom, A.O. Cramer, Network analysis: an integrative approach to the structure of psychopathology, Annu. Rev. Clin. Psychol. 9 (2013) 91 121. [43] D.E. Hinton, L.J. Kirmayer, Local responses to trauma: symptom, affect, and healing (2013). [44] S.M. Robertson, Neurodiversity, quality of life, and autistic adults: shifting research and professional focuses onto real-life challenges, Disabil. Stud. Q. 30 (1) (2009). [45] M. Prince, V. Patel, S. Saxena, M. Maj, J. Maselko, M.R. Phillips, et al., No health without mental health, Lancet 370 (9590) (2007) 859 877.
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
[46] A. Mezulis, R.H. Salk, J.S. Hyde, H.A. Priess-Groben, J.L. Simonson, Affective, biological, and cognitive predictors of depressive symptom trajectories in adolescence, J. Abnorm. Child. Psychol. 42 (4) (2014) 539 550. [47] V. Patel, A.J. Flisher, S. Hetrick, P. McGorry, Mental health of young people: a global public-health challenge, Lancet 369 (9569) (2007) 1302 1313. [48] A. Rosenfeld, S. Kraus, Predicting human decision-making: from prediction to action, Synth. Lect. Artif. Intell. Mach. Learn. 12 (1) (2018) 1 150. [49] M. Bares, M. Kopecek, T. Novak, P. Stopkova, P. Sos, J. Kozeny, et al., Low frequency (1-hz), right prefrontal repetitive transcranial magnetic stimulation (rtms) compared with venlafaxine er in the treatment of resistant depression: a double-blind, single-centre, randomized study, J. Affect. Disord. 118 (1 3) (2009) 94 100. [50] H.A. Sackeim, Modern electroconvulsive therapy: vastly improved yet greatly underused, JAMA Psychiatry 74 (8) (2017) 779 780. [51] P. Bower, S. Gilbody, Stepped care in psychological therapies: access, effectiveness and efficiency: narrative literature review, Br. J. Psychiatry 186 (1) (2005) 11 17. [52] D.M. Clark, Implementing NICE guidelines for the psychological treatment of depression and anxiety disorders: the iapt experience, Int. Rev. Psychiatry 23 (4) (2011) 318 327. [53] J. Delgadillo, O. Moreea, W. Lutz, Different people respond differently to therapy: a demonstration using patient profiling and risk stratification, Behav. Res. Ther. 79 (2016) 15 22. [54] C.L. McGrath, M.E. Kelley, P.E. Holtzheimer, B.W. Dunlop, W.E. Craighead, A.R. Franco, et al., Toward a neuroimaging treatment selection biomarker for major depressive disorder, JAMA Psychiatry 70 (8) (2013) 821 829. [55] R.J. DeRubeis, Z.D. Cohen, N.R. Forand, J.C. Fournier, L.A. Gelfand, L. Lorenzo-Luaces, The personalized advantage index: translating research on prediction into individualized treatment recommendations. A demonstration, PLoS One 9 (1) (2014) e83875. [56] D. Benrimoh, R. Fratila, S. Israel, K. Perlman, N. Mirchi, S. Desai, et al., Aifred health, a deep learning powered clinical decision support system for mental health, The NIPS 17 Competition: Building Intelligent Systems The Springer Series on Challenges in Machine Learning (2018) 251 287. Available from: https://doi.org/10.1007/978-3-319-94042-7_13. [57] K.K. Fitzpatrick, A. Darcy, M. Vierhile, Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (woebot): a randomized controlled trial, JMIR Ment. Health 4 (2) (2017) e19. Available from: https://doi.org/ 10.2196/mental.7785. [58] D.D. Burns, When panic attacks: the new, drug-free anxiety therapy that can change your life, Morgan Road Books, 2007. [59] D.D. Burns, Feeling good: the new mood therapy, Harper, 2009. [60] J. Towery, The anti-depressant book: a practical guide for teens and young adults to overcome depression and stay healthy, Jacob Towery, 2016. ˚ rsand, The potential use of [61] M. Bradway, R. Joakimsen, A. Grøttland, E. A patient-gathered data from mhealth tools: suggestions based on an rctstudy, Int. J. Integr. Care 16 (5) (2016) 8. [62] C. Kennedy, Pear approval signals fda readiness for digital treatments, Nat. Biotechnol. 36 (6) (2018) 481. [63] R.J. Davidson, J. Kabat-Zinn, J. Schumacher, M. Rosenkranz, D. Muller, S.F. Santorelli, et al., Alterations in brain and immune function produced by mindfulness meditation, Psychosom. Med. 65 (4) (2003) 564 570.
169
170
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
[64] C. Noone, M.J. Hogan, Improvements in critical thinking performance following mindfulness meditation depend on thinking dispositions, Mindfulness 9 (2) (2018) 461 473. [65] B. Chwyl, A.G. Chung, R. Amelard, J. Deglint, D.A. Clausi, A. Wong, Sapphire: Stochastically acquired photoplethysmogram for heart rate inference in realistic environments, in: Image Processing (ICIP), 2016 IEEE International Conference on, IEEE, 2016, pp. 1230 1234. [66] B. Chwyl, A.G. Chung, R. Amelard, J. Deglint, D.A. Clausi, A. Wong, Time-frequency domain analysis via pulselets for non-contact heart rate estimation from remotely acquired photoplethysmograms, in: Computer and Robot Vision (CRV ), 2016 13th Conference on, IEEE, 2016, pp. 201 207. ˚ hs, M. Fredrikson, J.J. Sollers III, T.D. Wager, A meta-analysis [67] J.F. Thayer, F. A of heart rate variability and neuroimaging studies: implications for heart rate variability as a marker of stress and health, Neurosci. Biobehav. Rev. 36 (2) (2012) 747 756. [68] J. Firth, J. Cotter, J. Torous, S. Bucci, J.A. Firth, A.R. Yung, Mobile phone ownership and endorsement of “mhealth” among people with psychosis: a meta-analysis of cross-sectional studies, Schizophrenia Bull. 42 (2) (2015) 448 455. [69] S.J. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, Pearson Education Limited, Malaysia, 2016. [70] A. Lanata, G. Valenza, M. Nardelli, C. Gentili, E.P. Scilingo, Complexity index from a personalized wearable monitoring system for assessing remission in mental health, IEEE J. Biomed. Health Inform. 19 (1) (2015) 132 139. [71] M.N. Burns, M. Begale, J. Duffecy, D. Gergle, C.J. Karr, E. Giangrande, et al., Harnessing context sensing to develop a mobile intervention for depression, J. Med. Internet Res. 13 (3) (2011) e55. ¨ nerbl, P. Oleksy, G. Bahle, C. Haring, J. Weppner, P. Lukowicz, [72] A. Gru Towards smart phone based monitoring of bipolar disorder, in: Proceedings of the Second ACM Workshop on Mobile Systems, Applications, and Services for HealthCare, ACM, 2012, p. 3. ¨ nerbl, B. Arnrich, G. Tro¨ster, Assessing [73] A. Muaremi, F. Gravenhorst, A. Gru bipolar episodes using speech cues derived from phone calls, in: International Symposium on Pervasive Computing Paradigms for Mental Health, Springer, 2014, pp. 103 114. [74] S. Abdullah, M. Matthews, E. Frank, G. Doherty, G. Gay, T. Choudhury, Automatic detection of social rhythms in bipolar disorder, J. Am. Med. Inform. Assoc. 23 (3) (2016) 538 543. [75] T. Beiwinkel, S. Kindermann, A. Maier, C. Kerl, J. Moock, G. Barbian, et al., Using smartphones to monitor bipolar disorder symptoms: a pilot study, JMIR Ment. Health 3 (1) (2016) e2. [76] M. Faurholt-Jepsen, S. Brage, M. Vinberg, E.M. Christensen, U. Knorr, H.M. Jensen, et al., Differences in psychomotor activity in patients suffering from unipolar and bipolar affective disorder in the remitted or mild/moderate depressive state, J. Affect. Disord. 141 (2 3) (2012) 457 463. [77] R.F. Dickerson, E.I. Gorlin, J.A. Stankovic, Empath: a continuous remote emotional health monitoring system for depressive illness, in: Proceedings of the 2nd Conference on Wireless Health, ACM, 2011, p. 5. [78] A. Gruenerbl, V. Osmani, G. Bahle, J.C. Carrasco, S. Oehler, O. Mayora, et al., Using smart phone mobility traces for the diagnosis of depressive and manic episodes in bipolar patients, in: Proceedings of the 5th Augmented Human International Conference, ACM, 2014, p. 38.
Chapter 6 Big Data analytics and artificial intelligence in mental healthcare
[79] I. Barnett, J. Torous, P. Staples, L. Sandoval, M. Keshavan, J.-P. Onnela, Relapse prediction in schizophrenia through digital phenotyping: a pilot study, Neuropsychopharmacology (2018) 1. [80] M. Dang, C. Mielke, A. Diehl, R. Haux, Accompanying depression with fine-a smartphone-based approach., in: MIE, 2016, pp. 195 199. [81] K. Saunders, A. Bilderbeck, P. Panchal, L. Atkinson, J. Geddes, G. Goodwin, Experiences of remote mood and activity monitoring in bipolar disorder: a qualitative study, Eur. Psychiatry 41 (2017) 115 121. [82] S. Wu, K. Ell, H. Jin, I. Vidyanti, C.-P. Chou, P.-J. Lee, et al., Comparative effectiveness of a technology-facilitated depression care management model in safety-net primary care patients with type 2 diabetes: 6-month outcomes of a large clinical trial, J. Med. Internet Res. 20 (4) (2018) e147. [83] J.B. Levin, J. Sams, C. Tatsuoka, K.A. Cassidy, M. Sajatovic, Use of automated medication adherence monitoring in bipolar disorder research: pitfalls, pragmatics, and possibilities, Therapeutic Adv. Psychopharmacol. 5 (2) (2015) 76 87. [84] S. Frangou, I. Sachpazidis, A. Stassinakis, G. Sakas, Telemonitoring of medication adherence in patients with schizophrenia, Telemed. J. E-Health 11 (6) (2005) 675 683. [85] T.W. Bickmore, K. Puskar, E.A. Schlenk, L.M. Pfeifer, S.M. Sereika, Maintaining reality: relational agents for antipsychotic medication adherence, Interact. Comput. 22 (4) (2010) 276 288. [86] E. Bruehlman-Senecal, A. Aguilera, S.M. Schueller, Mobile phone-based mood ratings prospectively predict psychotherapy attendance, Behav. Ther. 48 (5) (2017) 614 623. [87] B.D. Mittelstadt, P. Allo, M. Taddeo, S. Wachter, L. Floridi, The ethics of algorithms: Mapping the debate, Big Data Soc. 3 (2) (2016). [88] K.F. Schaffner, Etiological models in psychiatry: reductive and nonreductive approaches. [89] A. Chen, Noncompliance in community psychiatry: a review of clinical interventions, Psychiatr. Serv. 42 (3) (1991) 282 287. [90] D.S. Char, N.H. Shah, D. Magnus, Implementing machine learning in health care-addressing ethical challenges, N. Engl. J. Med. 378 (11) (2018) 981. [91] Ordre National des me´decins, Doctors and patients in the world of data, algorithms and artificial intelligence. [92] B. Friedman, H. Nissenbaum, Bias in computer systems, ACM Trans. Inf. Syst. (TOIS) 14 (3) (1996) 330 347. [93] H. Benson, M.D. Epstein, The placebo effect: a neglected asset in the care of patients, JAMA 232 (12) (1975) 1225 1227. [94] I. Kirsch, Antidepressants and the placebo effect, Z. fu¨r Psychologie 222 (3) (2014) 128. [95] M. Thyloth, H. Singh, V. Subramanian, et al., Increasing burden of mental illnesses across the globe: current status, Indian. J. Soc. Psychiatry 32 (3) (2016) 254. [96] T.R. Insel, J.A. Lieberman, Dsm-5 and rdoc: shared interests. ,http:// publichealthunited.org/pressreleases/DSM5andRDoCSharedInterests.pdf., 2013.
171
Big Data based breast cancer prediction using kernel support vector machine with the Gray Wolf Optimization algorithm
7
T. Jayasankar1, N.B. Prakash2 and G.R. Hemalakshmi3 1
Electronics and Communication Engineering Department, University College of Engineering, BIT Campus, Anna University, Tiruchirappalli, India 2 Department of Electrical and Electronics Engineering, National Engineering College, Kovilpatti, India 3Department of Computer Science and Engineering, National Engineering College, Kovilpatti, India
Abstract Big data is a term used to indicate the collection of data with such large dimensions and continuous exponential growth over time. It includes unstructured and semistructured data. Today, a big data in healthcare is often used to predict disease. Breast cancer is one of the main cancers that occur in a woman. It is the second main reason for the death of a woman in the United States and in Asian countries. If we recognize this disease at an early stage, there is a greater chance of recovery. For this experiment, optimal feature to be selected using oppositional grasshopper optimization (OGHO). These to be processed in the training phase using the kernel support vector machine with the Gray Wolf Optimization algorithm (KSVMGWO) based on this to predict the breast cancer using KSVMGWO algorithm. Check the Wisconsin Cancer Database (Original) from the UCI machine learning repository. Keywords: Support vector machine; breast cancer; big data; Gray Wolf Optimization algorithm; Wisconsin Cancer Database
7.1
Introduction
Malignancy is a collection of diseases caused by the uncontrolled development of cells in the human body. From a clinical point of view, it is a harmful neoplasm that aggregates around it an Applications of Big Data in Healthcare. DOI: https://doi.org/10.1016/B978-0-12-820203-6.00003-5 © 2021 Elsevier Inc. All rights reserved.
173
174
Chapter 7 Big Data based breast cancer prediction
overwhelming mass, known under the name of tumor, which can incite to pass the risk of not being alleviated. Tumors are basically of two types, viz. favorable and harmful [1]. Breast cancer is the most widely recognized disease in women and sooner or later affects about 10% of women in their life. It is the second most fundamental commitment to women’s death after malignant lung growth. 25% of all malignancies in women, including 12% of each new case, are caused by breast disease. Enormous data has grown in terms of incentives because it has been used to obtain Business Intelligence, analysis of company data and data mining to obtain reports and evaluate results [2]. The harmful tumor is created when the cells of the breast tissue separate and grow without the normal controls of the passage and cell division. Despite the fact that malignant breast cancer is the subsequent cause of death in women, resistance after early recognition remains high [3]. Malignant breast growth is the leading cause of death for women. It is the second most risky malignant growth after lung disease. According to reports from the World Cancer Research Fund, it is estimated that over 2 million new cases were recorded in 2018, of which 626,679 steps were evaluated [4]. One of the fundamental factors that is certainly linked to long-term mortality is the size of the tumor, and it is suggested to use several advances in screening to identify breast cancer within 2 cm. Malignant breast growth subtypes move considerably in the two neurotic phenotypes and have distinctive prescribed treatment plans. Despite the fact that the onset rates of all malignant breast growth subtypes increase with age, they will usually have distinct tumor sizes. The Luminal A subtype will generally have a smaller tumor size than the Luminal HER2 subtype [5]. The motivation behind these predictions is to face situations in which malignant growth has not repeated (published information) and situations in which the disease has recurred at some point. In this way, the problems of research and prediction of breast cancer are mainly among the arrangement problems that are much described. These problems have attracted many analysts in the areas of computing knowledge, data mining, and measurement [6]. Exploring information requires significant work in computing applications in the clinical area. The principle of using innovation in information exploration is to transform raw information into increasingly important data. People are likely to be caused by malignant breast growth [7]. A review of huge information offers the best answer for checking, storing, and breaking down countless mammographic images. The life cycle of Big Data has a main premonitory structure. Improve the style of clinical unhappiness by animating the premonitory structure, evaluating measurable
Chapter 7 Big Data based breast cancer prediction
improvements, and deciding on the various clinical pictures. It also offers better answers for monitoring wellness information [8]. With the emergence of total evolution and the use of machine learning (ML) strategies in disease reviews, they have become progressively accurate and depend on the disclosure of new and improved information on the source, aggregation, prediction and processing. Our work includes ML strategies for time of resistance to malignant breast growth [9]. In bioinformatics challenges, information exploration can be updated in many sets, for example quality survey, protein identification, brand model recognition, protein capacity deduction, infection assessment, pollution assessment, improvement of drugs against diseases, recovery of useful networks on proteins and quality, cleaning of information, and waiting for the situation of subcellular proteins [10]. Strategies generally treat this breed as a problem of organizing in pairs in command, which, on the whole, has moderately high false negatives (FN) and false positives (FP). Normally FN shows the grouping level of mitosis as nonmitosis, while FP shows the order level of nonmitosis as mitosis [11]. Waiting for the disease, the data mining systems are applied together to fabricate another strategy to break the proximity of malignant growth for a specific patient. When you begin to eliminate an information extraction problem, it is essential to accumulate all the information collected in a progression of cases. The osmosis of information from various sources mostly presents many difficulties [12]. In the framework, all information are physically displayed, implying that the storage of the information in the documents and the information must be physically displayed. It is difficult to imagine exactly the information. In addition, the current picture takes longer to predict whether the patient has malignant breast growth. The specific stage of breast disease cannot be anticipated in the past stage. The measurement presents a high risk [13]. The huge information lifecycle has a motorized gauge structure. Improve the style of clinical misfortune by reinvigorating the structure of forecasts, evaluating factual updates and deciding on different types of disease [14]. The vast majority of the evaluation of the strength of the infection, similar to the grouping of tumors, inspects the territory that contains only annoying tumors [15].
7.2
Literature survey
Padmapriya et al. [16] have suggested that grouping calculations are calculations used regularly to decompose different
175
176
Chapter 7 Big Data based breast cancer prediction
types of information accessible into various archives that have applications in reality. The main objective of this exploration work is to discover the presentation of group calculations in the examination of information on malignant breast growth through the search for mammographic images as indicated by its qualities. Some estimates of the characteristics of mammography images affected by malignancy are currently under review. Patient’s diets, patient’s age, lifestyle, vocation, problems identified with disease, and other data are taken into consideration for the agreement. Finally, the presentation of the J48, classification and regression trees (CART), and ADTree grouping calculations is provided with its precision. The accuracy of the calculations performed is estimated by various estimates, such as clarifications, feasibility, and kappa measurements. Malignancy is a serious problem worldwide. It is a disease that is serious most of the time and has affected the lives of many people and will continue to affect the lives of many others. Malignant breast cancer is now the leading cause of disease death in women and has become the best-known malignant growth in women in both created and emerging countries. Early identification is the best method for reducing the passage of breast disease. However, early identification requires a precise and solid indicative methodology that allows specialists to separate lovable breast tumors from harmful tumors without relying on a careful biopsy. Then, create a premonitory model using data mining strategies to distinguish the wrapped information and build a model interface for breast disease that assists social security experts in their demonstration choices and organization of activities treatment. Ayele et al. [17] proposed a six-step hybrid knowledge discovery process model, due to the idea of the problem and the properties of the dataset. The characterization procedure, for example, the tree of choice J48, Naive Bayes, and the registration of the CART rule was used to create the models. The execution of the model is analyzed using precision, true positive rate (TPR), true negative rate (TNR), and the area under the elbow ROC. The chosen J48 shaft is the most efficient with an accuracy of 94.82%. Malignant breast growth is an important reason for a given death, unlike any other single disease. Breast malignancy has become the most dangerous type of disease in things around the world. The early localization of breast cancer is essential to reduce the loss of rapidly human. This record presented a correlation of the different mining classifiers from the Wisconsin Breast Cancer (WBC) breast malignancy database, using the precision performed. Ravi Kumar [18] presented a comparative
Chapter 7 Big Data based breast cancer prediction
study of different data mining techniques on WBC dataset. Defines the isolated data in the preparations of the set with 499 and of the test set with 200 pages. Currently, reviewing six grouping procedures in Weka programming and the results of the selected exam that Support Vector Machine (SVM) has a higher wait precision than these techniques. Different techniques are studied and contrasted to identify breast pathology, as well as their precision. With these results, we deduce that SVM are best suited to solving the problem of the order of malignant breast growth predictions and suggest the use of these methodologies in comparable grouping problems. To the extent that science claims to carry out significant ovarian work, social insurance also finds significant help. Malignant breast growth is the most voted type of disease in donations; which only worked 627,000 stop beats. The final stages of breast cancer results in high mortality rate. As a potential commitment to mechanical improvement, information mining finds several applications in Brest’s malignancy predictions. Vivek Kumar et al. [19] focused on various control procedures for the prediction of threatening and mild breast disease. The collection of information on breast cancer in Wisconsin from the University of California (UCI) shop was used as a collection of test information while the thickness of the quality collection is used as a classification class. It is studied according to these 12 calculations: Ada Boost M1, decision table, J-Rip, J48, Lazy IBK, Lazy K-star, strategic recurrence, Multiclass Classifier, Multilayer Perceptron, Naive Bayes, Random Forest, and Random Tree in this regard data set. Malignant breast growth is a common disease that is systematically large in number. He is the best known type of all the things they feel and the main death cast among him gives him everything from all over the world. The characterization and information extraction strategy is a practical method for classifying information. Especially in the clinical setting, where these sound strategies generally used to determine and examine to decide Asri et al. [20] proposed an examination of exposure between various AI calculations: SVM, Decision Tree (C4.5), Naive Bayes (NB) and k Nearest Neighbors (k-NN) on WBC (single data set). The main objective is to evaluate the accuracy of the aggregation of information as regards the competence and feasibility of each calculation in terms of precision, precision, precision, reliability, and clarity. The exploratory results are selected that SVM offers the highest precision (97.13%) with the lowest error rate. All tests are performed in a replay domain and are present in the WEKA data mining device.
177
178
Chapter 7 Big Data based breast cancer prediction
The situation in the rapid development of the examination of huge information is the fundamental engine of the exchange of information that human services are not ready to decide. It has provided apparatus to collect, succeed, examine, and ingest much embedded and equivocal information created by registered wellness facilities. A huge examination of information has recently been used to aid disease and the research process. Nonetheless, the degree of recognition and the improvement of research seek an answer to the problem of the enormous vision of the world of information. Sakthidharan et al. [21] improved the ML formula for the expectation of longer episodes in the networks of preventive infections. Here is the new multimodal disease prediction calculation for convolutional neural systems (CNN-MDRP) that uses organize and unstructured emergency clinical information. Bellaachia et al. [22] proposed an examination of the resistance predictions of patients with breast disease using data mining systems. Supervised information is SEER information for open use. Information availability has been set to 151,886 information, depending on the capacity of the 16 SEER database. We inspected three data mining systems: Naive Bayes, the reverse proliferated neural system and the related calculations of decision C4.5. Some studies have been conducted with these calculations. The presentation of the forecast made is similar to the existing procedures. However, we found that the C4.5 calculation far outperforms the next two strategies.
7.3
Proposed methodology
Breast cancer is a deadly form of cancer that mainly affects women around the world. The concept of big data and predictive analysis is explored in the breast cancer screening document. The entire methodology is expanded on three levels. The first level discusses the conceptual model, which is described by the three levels of preprocessing, feature extraction and classification. After announced premiere; the essential required features are extracted from the purpose of generating meaningful descriptive names from the recorded data set. This includes the feature extraction phase, in which the feature is used to classify cancer as benign or malignant in the data set. The reduction of the function also takes place together with the extraction of the function, in which the repetitions or information that is not required in the data are removed or filtered. After extracting and reducing the functional method, the next phase includes
Chapter 7 Big Data based breast cancer prediction
179
Input Big Data Set
Preprocessing
Feature extraction
Optimal feature selection using OGHO
Training phase
Testing phase
KSVMGWO
Predictive model
classifications. The overall process of the proposed model is shown in Fig. 7.1.
7.3.1
Preprocessing
Preprocessing phase is employed to remove the unwanted data. This leads to confusion and leads to unnecessary information. The preliminary preparation of the information made up a large part of this study. At this stage, all information is checked and stored from this preprocessed information in order to extract sufficient information.
7.3.2
Feature selection
The selection of feature is only the technique for ejecting highlights from the given data set, which are irrelevant for each other. It is used in various areas such as model recognition,
Figure 7.1 Proposed big data based breast cancer prediction.
180
Chapter 7 Big Data based breast cancer prediction
data mining, and AI. When the dataset contains a variety of functions, it can be difficult to display so much information in certain circumstances. Therefore, most specialists have used different element determination systems. The main goal of this method is to eliminate accuracy and increase accuracy by expelling random and most likely unnecessary skills. Evacuating irrelevant information in this way reduces the complexity of time. It is used in OGHO approaches.
7.3.2.1 Oppositional grasshopper optimization algorithm At that point, we have using a grasshopper restriction calculation for the rationalization process. The grasshopper is one of the insects on the planet. This mistake is valuable for the expansion of agricultural and horticultural creation. Grasshoppers occur regularly in nature, they are likely to join the largest herd of all things considered. The size of the swarm can be dry and nightmarish for shapers. Swarming insects is one of the swarming practices. It is available in both fairy tales and adulthood. The many types of insect fairies that jump and move are similar to moving chambers. They eat all the vegetation on their way. When they develop, they are clearly structured everywhere. The grasshopper can move over large spans. In the larval phase, the slow development and the small steps of the grasshopper are the main behavior of the swarm. In adulthood, unexpected developments and long separations are the essential element of the swarm. The basic part of the locust swarm is the search for food sources. The subtleties are made clear in the presentation. Calculations animated by nature divide the examination procedure considerably into two models: examination and improper use. In the investigative segment, researchers are encouraged to move unexpectedly during the abuse while moving around the site. The objective investigation and these two skills are carried out by beetles. We can structure another normally motivated calculation by scientifically understanding this behavior model. A selection of preparation models is provided during this process, with each model divided into specific classes. Step 1: The numerical model for restoring the behavior of swarms of grasshoppers looks like this: Gi 5 S i 1 Fi 1 Wi
ð7:1Þ
where Pi characterizes the ith grasshopper situation, Si is social communication, Fi is gravity on the ith grasshopper and Wi shows the breeze that takes care of the development.
Chapter 7 Big Data based breast cancer prediction
Step 2: In order to change the conventional grasshopper calculation, the restriction strategy is presented. As can be seen from the opposition based learning (OBL) presented by Tizhoosh in 2005, the current operator and his counter-attacker are considered simultaneously to obtain a hypothesis that is superior to the specialist’s current disposition. An antitoxin arrangement is believed to be closer to the general ideal arrangement than an irregular care arrangement. The locations of the inverse change blocks ðOGm Þ are fully characterized based on Eq. (7.2): 1 2 d ð7:2Þ ; ogm ; : : . . . ogm OGm 5 ogm where OGm 5 Lowm 1 Ugm 2 gm with OGm A Lowm ; Ugm is the position of mth low variance blocks Ogm in the dth dimension of oppositional blocks. Step 3: Note that in order to provide irregular behavior, the condition can be composed as follows: gi 5 r1 Si 1 r2 Fi 1 r3 Wi, where r1, r2, and r3 in [0, 1] are arbitrary numbers. Si 5
N X
sðdij Þdbij
ð7:3Þ
j51 j 6¼ i where dij is the separation between the ith and the jth grasshopper, determined as dij 5 |gj 2 gi|, s is an ability to characterize the g 2g quality of social powers dbij 5 j dij i and a unitary vector from grasshopper ith to grasshopper jth. Step 4: The s ability that characterizes social powers is determined as follows: 2r
SðrÞ 5 Ae 2 e2r Ð
Ð
ð7:4Þ
where W shows the power of attraction and is the seductive length staircase. The ability shows how it influences the social association (attraction and repulsion) of the grasshoppers. Step 5: The type of capacity s at this moment, segment F in Eq. (7.5) is determined as follows. Fi 5 H ebh
ð7:5Þ
where g is the stable gravitational and ebg shows a unitary vector in the focal point of the earth. Step 6: The W segment in Eq. (7.1) is determined as follows. Wi 5 uebv
ð7:6Þ
where c is a coherent float and ebv a unitary carrier during the wind.
181
182
Chapter 7 Big Data based breast cancer prediction
Step 7: Nymph beetles have no wings, so their developments are emphatically identified with the direction of the breeze. Enter S, F, and W in Eq. (7.1); this condition can be extended as follows. Gi 5
N X
sðjgj 2 gi jÞ
j51 j 6¼ i
gj 2 gi 2 hebh 1 c ebv dij
ð7:7Þ
2r
where s and SðrÞ 5 Ae 2 e2r , N is the amount of grasshoppers, since the nymph grasshoppers land on the ground, their position should not drop below a limit. In this way, we are not widely used in the calculation of recreation and improvement of swarms since it prevents the calculation from investigating and misusing the investigation space around a response. The swarm is used outdoors. Eq. (7.7) is valuable for reproducing the grasshopper association. Ð
1
0
C B X C B N ubd 2 lbd d d gj 2 gi C b sðjg Gi 5 B 2 g jÞ j i C 1 Td B 2 d ij A @ j51
ð7:8Þ
j 6¼ i
lbd is as where ubd is as far as possible in the Dth dimension, 2r much as possible in the Dth measurements SðrÞ 5 Ae 2 e2r , the estimate of the Dth measurement is in the target (the best arrangement found up to this point) and u is a decreasing coefficient to reduce the comfort zone, aversion zone and charm zone. Note that S of part S in Eq. (7.1). However, we do not think about gravity (no part G) and we expect the direction of the breeze (component A) to be constantly focused on a target Tbd . The following situation of the grasshopper is characterized by its current position, which appears in the condition (8). This ability considers the current situation of the comparative grasshopper with different grasshoppers. Similarly, we considered the status of the considerable number of locusts to characterize the situation of exploration specialists around the target. Ð
C 5 c max 2
c max 2 c min L
ð7:9Þ
where c max is the largest value, c min is the base value, l shows the current emphasis and L is the most extreme number of cycles. The best objective position so far is updated in each cycle. Furthermore, the factor c with Eq. (7.9) and separations between insects are standardized in each cycle in Eq. 7.1 and 7.5. Position update is performed iteratively until the last measurement is reached. The position and reasonableness of the best goal are finally returned as the best hypothesis for the
Chapter 7 Big Data based breast cancer prediction
183
world ideal. Despite the fact that the reproductions and conversations above show the adequacy of the GOA calculation in finding the world ideal in a search space.
7.3.3
Kernel based support vector machine with Gray Wolf Optimization
The initially proposed preparation and test database is to use the breast disease database, which depends on a lot of information. Unlimited use of this database is attributed to its attributes, which contain the large number of accessible tests and the interesting quiet element. Furthermore, only a few properties are missing from the database. The proposed model should first be prepared and an attempt should be made to use this current database. In this sense, it is proposed to create a database that meets the requirements to improve the size and accuracy of the collection of information and create a peaceful domain. The process involved in the SVM model is depicted in Fig. 7.2. Training phase: The return on the selection of the property has so far been given as a contribution by the degree of agreement. The usefulness of the information provides estimates that cannot be separated. At every hectic level, all possible site limits are recorded. In the Lagrangian model in question, the division of the standard vector by the phrenic level is localized during the action of the single piece. At the moment, the kernel symbolizes some exercises that declare with a point for a certain type of characteristic inscription. However, recording a situation in a hole with a higher quality measurement could result in an unnecessary evaluation period and immense maintenance requirements. By the outcome, an action of the first piece is started, with which it is possible to easily measure the point that is delivered in the dimensional hole with the best quality. The execution of the kernel of the work is as follows. K ðP; QÞ 5 φðP ÞT φðQÞ
ð7:10Þ
At this point, the subkernel include the kernel part, the polynomial kernel, the quadratic kernel, the sigmoid and the Radial Basis task. Below we will find the terms for the different kernel. Class 1 Selected feature
SVM classifier
Class 1 Class 1
Figure 7.2 Proposed support vector machine.
184
Chapter 7 Big Data based breast cancer prediction
For Linear Kernel: lineark ðP; QÞ 5 pT q 1 c
ð7:11Þ
where p; q is the internal object are called in the linear kernel and c is a constant. For Quadratic Kernel: quadk ðP; QÞ 5 1 2
:p2q:
2
2
:p2q: 1 c
ð7:12Þ
where, u; v are the vectors of the polynomial kernel of the part are found in the information space. e For Polynomial Kernel: polyk ðP; QÞ 5 λpT q1c ; λ . 0 ð7:13Þ For Sigmoid Kernel: sigk ðP; QÞ 5 tanh λpT q 1 c ;
λ . 0 ð7:14Þ
The SVM’s feasibility is continuously based on a fair kernel. In the event that the property hole is directly indestructible, the center of the basic diffusion commission must register it in Radial basis task kernel so that the problem presents itself directly. Similarly, consolidation of two-part kernel task can shift the remarkable precision achieved using a single bit action. In the first strategy, a single KSVM is accompanied by the unusual development of the framework for agreements. Two-kernel tasks such as direct and kernel tasks are absolutely not linked to delay excellent introductory relationships. The standard association of conditions 6 and 7 is not surprising, as prescribed in the first method. The same part kernel tasks are actually used in KSVM and the part’s kernel tasks avgk ðP; QÞ is transferred below. avgk ðP; QÞ 5
1 ðlink ðp; qÞ 1 quadk ðp; qÞÞ 2 2
:p2q: 1 ðpT q 1 cÞ 1 1 2 avgk ðP; QÞ 5 2 2 :p2q: 1 c
ð7:15Þ
!! ð7:16Þ
The kernel SVM uses two kernel, direct and square, to represent the guidelines for grouping search links. By consolidating two results, the result standard is achieved and further developed for the agreement. Gray wolf optimizer (GWO) is another technique that can be properly linked to solve current problems. The GWO really imitates business progress and hunting defenseless wolves. The impotent wolves satisfactorily envelop the history of part of the Canidae’s and are considered the most important predators who deal with the collection of food sources. Overall, they have an arrangement for subtleties
Chapter 7 Big Data based breast cancer prediction
suitable as a meeting. The pioneer turns to a man and woman who are isolated as alpha developing region, time to wake up and so on. The guidelines requested by Alpha will be confirmed during the meeting. Below average beta positions appear in the darkness of the meat grinder. Ultimately, these are discretionary wolves that have actually proposed sponsorships for Alpha Deal for related developments or ranges. The Omega is the smallest segment of the pack of blunt wolves and a tremendous feat to replace the other wolves at the helm on almost every occasion. After a great pioneer wolf dinner, only the charming little remains could be included. A wolf will appear discretionary or delta as often as expected from circumstances in which it is unlikely to fall within the range of an Alpha, Beta, or Omega. Due to the way these Delta Wolves have to worship Alpha and Beta, unlike Omega, they have effective development. In our process, alpha (α) is considered the most suitable assortment to significantly duplicate the collection that catches the wolves envisaged by the GWO. Currently the second and third largest Miens are Beta (β) and Delta (δ). The remaining optimistic harmony is considered omega (ω). From this point on, the effort (improvement) was directed, for example, in the GWO system α, β, δ and ω. The flowchart of GWO algorithm is given in Fig. 7.3. The step by step strategy of calculating progress for gray wolf is as follows: • Initialization process: Here we start the yield information prepared as well a; A; and C as the coefficient vectors; • Fitness evaluation: Evaluate the training plan based on the condition (1) and also select the best result; Fiti 5 max accuracy
ð7:17Þ
Separate the solution based on the fitness. The result is currently isolated from the appreciation of wellbeing. Leave them alone, the best initial finest fitness results dα , the second best wellness results dβ and the third best finest fitness results dδ . • Encircling prey: The following is selected and delimited by α, β, δ and ω these three competitors as a target. With the goal that development can follow, an injured person leads nearby. - -
dðt 1 1Þ 5 dðtÞ 1 A K -
-
K 5 j C dðt 1 1Þ 2 dðtÞj
ð7:18Þ ð7:19Þ
185
186
Chapter 7 Big Data based breast cancer prediction
Initialize the solution
Evaluate the fitness
Find 1st best, 2nd best and 3rd best
Update the position
Calculate the fitness value for new solution
Store the best
No
If max iteration reached
Yes
Stop
Figure 7.3 Proposed Gray Wolf Optimization.
-
-
-
A 5 2~ a r1 2 a and C 5 2r2
ð7:20Þ
where, t is the iteration number. d(t) is confronted with the situation of the prey. A and C speak with the coefficient vector. a is reduced directly from 2 to 0. r1 and r2 compare with any vector [0, 1]. • Hunting: We accept that Alpha (best serious practice), Beta, and Delta contain up-to-date information on the victim’s alleged situation, with the particular purpose of accurately
Chapter 7 Big Data based breast cancer prediction
reflecting the accompanying wolf’s fragile work. As a result, we are collecting the top three best results so far and asking the other examiner at the center (counting the omegas) to change their mindset to the best revealing authority. For the appropriate answer, the new result for the following formulas dðt 1 1Þ is not surprising. ~ 1 dα 2 dj ; ~ α 5 jC K
~ 2 dβ 2 dj; ~ β 5 jC K
~ 3 :dδ 2 dj ð7:21Þ ~ δ 5 jC K
~ α Þ; d2 5 dβ 2 ~ ~ β Þ; d3 5 dδ 2 ~ ~ δ Þ ð7:22Þ d1 5 dα 2 ~ A 1 ðK A 2 ðK A 3 ðK dðt 1 1Þ 5
•
•
d1 1 d2 1 d3 3
ð7:23Þ
It can be demonstrated very well that the obstacle zone is in a random position surrounded by an explicit circle for the alpha, beta and delta territory in the evaluation gap. Regardless, Alpha, Beta, and Delta speak to the victim’s domain and more and more wolves are eccentrically modernizing their region in the victim’s region. Attacking prey (exploitation) and Search for prey (exploration): The revision and use are guaranteed by the customizable classifications of a and A. The customizable classifications of the trolley n and A GWO guarantee represent an effective change between test and use: by decreasing A, half the cycles coordinate the test (jAj $ 1) and the other half of the headlights used (|A| , 1). The GWO contains only two basic restrictions on acclimatization (a and C). In any case, we take into consideration that the GWO count is removed as much as possible from the most modest number of specialists to acclimatize. Innovation is constant and guarantees the accuracy achieved. Finally the best capacity is selected and the additional framework is provided. Testing phase: In the preparatory phases, profitability is declared by the decision on the grouping to the test phase and the efficiency shows the presence or absence.
7.3.4
Dataset description
Breast Cancer Experimental Test Index the Wisconsin Breast Disease dataset Index, presented by the UCI ML standard, is a collection of free information available for breast cancer assessment. The Information Index is an assortment of multivariate
187
188
Chapter 7 Big Data based breast cancer prediction
information on fine needle aspirate (FNA) from a mass of collaborators acquired from emergency clinics at the University of Wisconsin. The information includes the accompanying information indices: in recent times there has been a collection of huge unstructured, semiorganized and organized information measures. By collecting, examining, reviewing, and separating this information, an organization can acquire a lot of sensitive information from individual customers. This information is generally called Big Data because of its volume, the speed with which it appears and the assortment of structures it takes. This information not only answers the problems of the company itself, but also offers types of assistance to different organizations if the information is stored in a Big Data phase. There are three Big Data attributes called 3V. • Volume (the volumes of information are huge and cannot be treated with the usual strategies), • Velocity (data is processed quickly and must be obtained and processed quickly) and • Variety (Assortment of types of information: organized, semiorganized and unstructured). Since the data on malignant growth meets all the prerequisites for the utility of big data and could therefore be better examined using a survey on big data: • WBC Data Set (unique); • Wisconsin Diagnostic Breast Cancer Dataset; • Wisconsin Prognostic Breast Cancer Dataset; • Gathering information about breast tissue. This paper used only the first WBC for review. The WDBC dataset contains 699 occurrences based on 11 traits, including 458 favorable and 241 harmful. 16 datasets are missing from this dataset and 583 have ended.
7.4
Result and discussion
The presentation of the proposed approach was tested on an overall frame with 8GB of RAM. The proposed calculation is updated in MATLAB R2016b, which is a response to tracking large amounts of information in various applications. The particular preparation movement is divided into two situations. For the evaluation of the presentation within the organization, we first of all note TP, FP, TN, and FN as positive evidence, an individually insured fake person, a real opposite and a FN. Therefore, we get four measurements: truth, precision, recovery
Chapter 7 Big Data based breast cancer prediction
189
and F1 measurement as results. This condition below is used to anticipate the qualities of the diagram. Accuracy 5
TP 1 TN TP 1 FP 1 TN 5 FN
Precision 5
Recall 5
TP TP 1 FP
TP TP 1 FN
The sectioned database is a big data database dependent on breast disease. The proposed essential model was performed using the KSVMGWO classifier. The Matlab programming language was used for programming progress. The database was created by the University of Wisconsin and this database was used for testing. Some features are available in the database. Some of these are the thickness of the meeting, the consistency with which the elements of the phones are available, the different types of hearts present, etc. The general qualities associated with these attributes range from 1 to 10. There are two unique classes, which recall the cases for which they must be generous or harmful. The results attained by the presented model in terms of different measures are presented in Table 7.1 and Fig. 7.4.
7.4.1
Comparison measures
The motivation behind this section is to justify KSVMGWO’s decision for several classifications. The classifiers used for comparison are SVM and GWO. Proposed presentation of diabetes
Table 7.1 Proposed KSVMGWO evaluation measures. Accuracy
Precision
Recall
F measures
95.746 96.687 93.686 94.264 91.785
97.248 96.464 95.574 94.675 93.684
95.864 94.575 95.673 93.574 92.667
93.574 94.2364 93.999 92.868 91.685
190
Chapter 7 Big Data based breast cancer prediction
98 97 96 95 Accuracy
94
Precision
93
Recall
92
F measures
91 90 89
Figure 7.4 Proposed breast cancer prediction.
88 1
2
3
4
5
Table 7.2 Comparison of proposed and existing accuracy measures. Accuracy KSVMGWO
SVM
GWO
97.248 96.464 95.574 94.675 93.684
8.686 90.675 88.798 90.574 88.575
90.675 89.675 85.574 87.737 89.464
meter is investigated by changing the classifier and the results obtained are described in Table 7.2. Correlates of learning ability of KSVMGWO and the rapid prediction of breast cancer diseases rather than different classifiers. Therefore, the proposed work uses KSVMGWO for disease prognosis and the following section compares the effectiveness of the proposed approach with existing techniques and are displayed in Figs. 7.57.7. The experimental results are presented in relation to the proposed breast cancer prognosis with the KSVMGWO classification, but not with the comparative classification. KSVMGWO’s learning ability and efficiency lead to better accuracy, accuracy and recall speed. Accurate and accurate accuracy rates determine the feasibility of technology for predicting breast cancer. Breast cancer prognosis is very reliable; the frequency of updates is very accurate and accurate. The accuracy and correction rates for the FP and FN coefficients have improved. Time consumption of different classifiers for breast cancer prognosis is shown in Tables 7.27.4.
Chapter 7 Big Data based breast cancer prediction
191
100 90 80 70 60
Accuracy KSVMGWO
50
SVM
40 GWO
30 20
Figure 7.5 Graphical representation of proposed and existing accuracy measures.
10 0 1
2
3
4
5
98 96 94 92 90
Precision KSVMGWO
88
SVM
86 84
GWO
82
Figure 7.6 Graphical representation of proposed and existing precision measures.
80 78
1
2
3
4
5
96 94 92 90
Recall KSVMGWO
88
SVM
86 GWO
84 82 80 78 1
2
3
4
5
Figure 7.7 Graphical representation of proposed and existing recall measures.
192
Chapter 7 Big Data based breast cancer prediction
Table 7.3 Comparison of proposed and existing precision measures. Precision KSVMGWO
SVM
GWO
97.248 96.464 95.574 94.675 93.684
88.575 90.375 89.798 90.575 87.594
91.655 89.385 84.594 85.937 89.494
Table 7.4 Comparison of proposed and existing recall measures. Recall KSVMGWO
SVM
GWO
95.864 94.575 95.673 93.574 92.667
88.575 90.375 89.798 90.575 87.594
91.655 89.385 84.594 85.937 89.494
7.5
Conclusion
This article relates to the KSVMGWO method of classifying cancer tumors as benign or malignant. We used feature selection in the dataset to eliminate duplicate and inappropriate features. To select characteristics, we used a symmetric estimation of the uncertainty properties in the OGH. Our proposed approach is evaluated and compared with the WBC Data Set. Experimental results show that the accuracy, accuracy, recall, and size F are improved by our proposed method compared to different models. In the future, we will work on feature selection techniques to improve the accuracy of the model.
References [1] S. Kumari Sheoran, Breast cancer classification using big data approach, in: Proceedings of Indian Journal of Research, Vol. 7, No. 2250-1991, January 2018, pp. 400403. [2] J. Sivapriya, V. Aravind Kumar, S. Siddarth Sai, S. Sriram, Breast cancer prediction using machine learning, in: Proceeding of International Journal of Recent Technology and Engineering, No. 22773878, Vol. 8, 4 November 2019, pp. 48794881.
Chapter 7 Big Data based breast cancer prediction
[3] J. Talukdar, S.K. Kalita, Detection of breast cancer using data mining tool (WEKA), in: Proceeding of International Journal of Scientific & Engineering Research, Vol. 6, No. 22295518, 11 November 2015, pp. 11241128. [4] C. Shravya, K. Pravalika, S. Subhani, Prediction of breast cancer using supervised machine learning techniques, in: Proceeding of International Journal of Innovative Technology and Exploring Engineering, Vol. 8, No. 2278-3075, 6 April 2019, pp. 11061110. [5] X. Feng, et al., Accurate prediction of neoadjuvant chemotherapy pathological complete remission (pCR) for the four sub-types of breast cancer, in: IEEE Access, vol. 7, pp. 134697134706, 2019. [6] S. Gupta, D. Kumar, A. Sharma, Data mining classification techniques applied for breast cancer diagnosis and prognosis, in: Proceeding of Indian Journal of Computer Science and Engineering, Vol. 2, May 2011, No. 09765166, pp. 188195. [7] D. Mulatu, R.R. Gangarde, Survey of data mining techniques for prediction of breast cancer recurrence, in: Proceedings of International Journal of Computer Science and Information Technologies, Vol. 8, 2017, pp. 599601. [8] K. Shailaja, B. Seetharamulu, M.A. Jabbar, Prediction of breast cancer using big data analytics, in: Proceedings of International Journal of Engineering & Technology, 2018, pp. 223227. [9] I. Mihaylov, M. Nisheva, D. Vassilev, Application of machine learning models for survival prognosis in breast cancer studies, in: Proceeding of MDPI, 2019, pp. 113. [10] R. Preetha, S. Vinila Jinny, A research on breast cancer prediction using data mining techniques, in: Proceeding of International Journal of Innovative Technology and Exploring Engineering, Vol. 8, No. 22783075, September 2019, pp. 362370. [11] M. Ma, Y. Shi, W. Li, Y. Gao, J. Xu, A novel two-stage deep method for mitosis detection in breast cancer histology images, in: 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, 2018, pp. 38923897. [12] B. Gousbi, A.R. Mohamed Shanavas, A study: breast cancer prediction using data mining techniques, in: Proceeding of the Research Publication, Vol.8, No. 2249-0701, 2019, pp. 5256. [13] L. Sankari, R. Rajbharath, G. Tholkappia Arasu, Predicting breast cancer using novel approach in data analytics, in: Proceeding of International Journal of Engineering Research & Technology, Vol. 6, 5 May 2017, pp. 7276. [14] K. Shailaja1, B. Seetharamulu, M.A. Jabbar, Prediction of breast cancer using big data analytics, in: Proceeding of International Journal of Engineering & Technology, 2018, pp. 223226. [15] A.M. Romano, A.A. Hernandez, Enhanced deep learning approach for predicting invasive ductal carcinoma from histopathology images, in: 2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 2019, pp. 142148. [16] B. Padmapriya, T. Velmurugan, Classification algorithm based analysis of breast cancer data, in: Proceeding of International Journal of Data Mining Techniques and Applications, Vol.5, June 2016, No. 22782419, pp. 4349. [17] F. Ayele, Constructing a predictive model for detection of breast cancer, in: Proceeding of IJESC, Vol. 8, No. 23213361, 2018, pp. 1752917532. [18] G. Ravi Kumar, D.G.A. Ramachandra, K. Nagamani, An efficient prediction of breast cancer data using data mining techniques, in: Proceeding of International Journal of Innovations in Engineering and Technology, Vol. 2, No. 2319-1058, 4 August 2013, pp. 139144.
193
194
Chapter 7 Big Data based breast cancer prediction
[19] V. Kumar, B.K. Mishra, M. Mazzara, D.N.H. Thanh, A. Verma, Prediction of malignant & benign breast cancer: a data mining approach in healthcare applications, in: Proceeding of National University of Science and Technology, pp. 18. [20] H. Asria, H. Mousannifb, H. Al Moatassimec, T. Noel, Using machine learning algorithms for breast cancer risk prediction and diagnosis, in: Proceeding of Elsevier, 2016, pp. 10641069. [21] G.R. Sakthidharan, P.C. Sekhar Reddy, S.G. Rao, Detection and prediction of breast cancer using CNN-MDRP algorithm in big data and machine learning: study and analysis, in: 2018 International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 2018, pp. 122130. [22] A. Bellaachia, E. Guven, Predicting breast cancer survivability using data mining techniques, Proceeding of, The George Washington University, 2006, pp. 14.
Big Data based medical data classification using oppositional Gray Wolf Optimization with kernel ridge regression
8
N. Krishnaraj1, Sujatha Krishamoorthy2, S. Venkata Lakshmi3, C. Sharon Roji Priya4, Vandna Dahiya5 and K. Shankar6 1
School of Computing, SRM Institute of Science and Technology, Kattankulathur, India 2Department of Computer Science, Wenzhou-Kean University, Wenzhou, P.R. China 3CSE, Panimalar Institute of Technology, Chennai, India 4Computer Science and Engineering Department, Sri Sairam College of Engineering, Bangalore, India 5Department of Education, Government of National Capital Territory of Delhi, Bangalore, India 6 Department of Computer Applications, Alagappa University, Karaikudi, India
Abstract The classification of medical data is an important data mining issue that has been discussed for nearly a decade and has attracted numerous researchers around the world. Selection procedures provide the pathologist with valuable information for diagnosing and treating diseases. With the development of big data in the biomedicine and healthcare industry, carefully analyze the benefits of clinical data in early diagnosis, patient care and community service. However, the accuracy of the analysis decreases if the quality of the clinical data is incomplete. In addition, many regions have unique characteristics of some regional diseases that may weaken the outbreak forecast. In this study, we develop machine learning algorithms to effectively predict the outbreak of chronic disease in general communities. In this paper, the oppositional firefly (OFF) technique is proposed to select the most optimal properties in large data-based clinical datasets and oppositional Gray Wolf Optimization with Applications of Big Data in Healthcare. DOI: https://doi.org/10.1016/B978-0-12-820203-6.00004-7 © 2021 Elsevier Inc. All rights reserved.
195
196
Chapter 8 Big Data based medical data classification using oppositional
Kernel Ridge Regression (OGWOKRR) compared to the OFF algorithm. The literature in this area shows that OFF performs better than particle swarm optimization (PSO), although its computational complexity is higher than PSO. Keywords: Medical data; machine learning algorithms; oppositional firefly; oppositional Gray Wolf Optimization with kernel ridge regression; PSO
8.1
Introduction
The medical services sector has really created a lot of information based on documents, consistency, administrative requirements, and patient considerations. Although most of the information is kept on paper, the current model is to quickly digitize this huge volume of information. This huge amount of information (known as “big data”) will strengthen a range of clinical and well-being capacities due to mandatory requirements and the possibility of improving the nature of human services while reducing costs, electoral support, disease monitoring and well-being of the population on board [1]. In today’s social order, the number of more established individuals has increased dramatically since the beginning of this century. This shows an expansion of well-being and open use [2]. Distributed computing is one of today’s fastest evolving innovations as it is used in a number of applications, including: B. in medical services, observation panels, climate protection notions and drug production frameworks productive. Today questions related to wellbeing are increasingly being considered. Remote patient analysis is currently conceivable with the data and information and communication technology (ICT) approach [3]. Up to this point, the further development of the progress of correspondence and its implementation in the clinical area have improved the results and results of well-being and effectively changed the way of life. Patients with sickle cell disease (SCD) should be fully analyzed at the start of treatment. More and more SCD patients have moved from medical clinics to outpatient care, which is completely subject to information technology (IT) [4]. Big information is a collection of unpredictable and much information that is difficult to process with regular on-board device databases or normal information processing applications. Large information is important in the background that companies can collect, store, archive, monitor, and edit information at the right time. Information extraction techniques for medical services play an important role in the expectation and conclusion of diseases. There are various applications of
Chapter 8 Big Data based medical data classification using oppositional
information preparation in clinical areas, for example in the field of clinical devices, in the pharmaceutical company and in the medical clinic of the council [5]. These methods can be used to break down clinical information and focus important wellness data that can help patients and specialists make better decisions. Both industry and science are currently vigorously leveraging advances in information science to analyze clinical information. Given the rapid development of information generated by the clinical part, information science will play an important role in the coming years [6]. To characterize clinically large information, we propose an integrated grouping and sorting system. The proposed system is a mixture of the K-mean grouping technique and the RF (Random Forest) characterization strategy. Clustering k-mean is superior to other grouping calculations; because it requires investments to share collections of high-dimensional information. RF is a competent learning strategy that is far from difficult to decipher and cannot be deciphered parametrically [7]. At this point, the first examination of the requirements associated with huge wellness information. Let us examine the representations of the qualities of this enormous information in the wellness segment. We present a basic methodology, which depends on the information provided by the center, in order to display the guide and to reduce the chronicle of the transmitted data used in administrations and wellness facilities [8]. With the advancement of tremendous innovation in the study of information, more attention has been paid to the determination of the disease from the point of view of extensive information research, various studies have been carried out, of course, the highlights of which have been selected from a large amount of information for improving accuracy. Grouping of hazards instead of having recently selected attributes [9]. Huge data refer to devices, procedures, and systems with which an association can create, control and monitor huge collections of information and capacity structures [10]. The digitization of clinical information is resizing and expanding the size of information and the importance of examining information. Huge information has the attributes volume, speed, truthfulness, fluctuation, and estimate. Conventional techniques for examining information are not useful for breaking down large information [11]. We show the algorithmic use of the ordering method by extending support machine vector (SVM) to more traditional double grouping calculations. As can be seen from the above tests, the way to improve the grouped classifier is to improve the coupled grouping. In the last part of the proposal, we provide an accurate assessment that enables us to understand
197
198
Chapter 8 Big Data based medical data classification using oppositional
the parallel order in terms of global learning more easily [12]. Clinical information is collected in camps in separate wellness habitats and managed and limited by medical clinics or administrative centers. With the probability the big data will actually be realized. This reduces expensive overhead costs and competent administration [13]. Conventional irregular position calculations are not ready to effectively monitor this enormous wealth information, just like bad and one-sided transports [14].
8.2
Literature survey
Diabetes is a real medical problem worldwide. According to the International Diabetes Foundation, there are currently 425 million people living with diabetes worldwide, and another 300 million people will be at higher risk of diabetes by 2030. There is therefore a clinical need for critical hypotheses early identification and counseling of diabetes and its disadvantages. Mallika et al. [15] has proposed numerous intelligent methods for managing artificial intelligence and information mining, including the SVM, which has been misused by executive diabetes. SVM is a widely regulated and disadvantageous information classifier that human services and biomedical specialists can use to distinguish obscure information patterns that gradually structure a huge volume of information. Because this database is a private resource, sensitive data must be supported without affecting the utility. Archenaa1 et al. [16] proposed a delegated test to find more incentives for information generated by human services and the government. These organizations produce a lot of heterogeneous information. However, without adequate information verification techniques, this information has become unnecessary. The big data investigation that Hadoop uses takes on the task of gradually making important analyzes of the vast amount of information and predicting the circumstances of the crisis before it occurs. The following personal information on the quality of the joint of the microarrays remains one of the most important research regions in the field of bioinformatics, handy intelligence, and model characterization [17]. The two kernel ridge regression (KRR), namely wavelet kernel ridge regression (WKRR) and radial basis kernel ridge regression (RKRR) as well as the spiral motif for the information with a microarray. Clinical microarray data sets contain immaterial and permanent properties, dimensionality and small examples. To improve the scourge of the dimensionality of microarray information collections, the modified cat swarm
Chapter 8 Big Data based medical data classification using oppositional
optimization (MCSO), a transformative decrease in motivation, is used to attribute the meaning from the information indexes. The amplitude of the classifiers is one of four from each of clinically parallel interests and a microarray with American class sequences. The former are breast malignancy, prostate disease, colon cancer, leukemia and emotional leukemia1, leukemia2, SRBCT, brain tumor1. Different powers to perceive the execution, rights, rights, rights, informational value, noise grille, Gmean, F-score and the area under the receiver operating characteristic (ROC) are constrained to the feasibility of the format. Various models such as the basic relapse peak (RR), the after the online sequential ridge regression (OSRR), support vector machine radial basis function (SVMRBF), support vector machine polynomial (SVMPoly), and RF received and receive. Exploratory results show that KRR tested various models that know little about records, and that WKRR knows RKRR leads results north. The secure area of personal security is safer with a tremendous amount of information that will help you understand the information, instructions, tariffs, and guarantees that are provided by professionals and are subject to drawbacks. We have managed various articles that have been reviewed by friends, various components of information extraction instructions for personal services. The lack of trust is a major concern. Islam et al. [18] relate to the welfare letter of information mining and informational information about interests. This then in the period from 2005 to 2016 was carried out with a database search according to the rules for the rights of disclosure powers for the examination and Meta-Analyses (PRISMA). The basic controls of the social security studies guidelines, the methods of extracting information, the types of breakdowns, the information and the sources of information were used to give rights of authority. We have concerns that the accessible script in surveillance trials of clinical and regulatory permissions. From a clinical point of view, the use of information of human origin is the general recognition of electronic medical records rights. The information-driven investigations of websites and online rights have not been heard of recently. The lack of a gradually mandatory investigation and control of information controlled by the information control of the information right of the general rights of research. The department of well-being has made incredible progress since the advancement of new data innovations that have led to the creation of an increasing number of clinical information that have pushed various research regions. Great efforts have been made to manage the explosion of clinical information
199
200
Chapter 8 Big Data based medical data classification using oppositional
from one point of view and to obtain valuable information about the other. This prompted analysts to apply every single specialized advance, for example, a large informative survey, a preliminary exam, artificial intelligence and learning calculations to separate valuable information and make decisions. With the guarantee of a preliminary investigation of enormous information and the use of AI calculations, predicting the arrival, especially for medicines, will never be a problematic task again, as recovery from an illness is expected. Boukenze et al. [19] has proposed a diagram to improve enormous information in medical services and we will apply a learning calculation to a field of clinical information. The goal is to use the decision tree calculation (C4.5) to anticipate a constant kidney infection. The rapid continuous growth of the advanced information age and the rapid improvement of software engineering allow us to expand new knowledge through extensive information measures in various control elements, including the Internet and funds. The amount of information carefully collected and stored is immense and is developing rapidly. The information that executives and the review receive is then done in the same way, with the aim of enabling organizations to transform this immense resource into data and information that will help them achieve their goals. PC researchers designed the term enormous information to represent this innovation creation. Social security is one of the most encouraging areas where detailed information can be used for change. The normal future of people increases compared to the general population, which creates new difficulties for current treatment techniques. The use of major information strategies and methods in the field of clinical drugs and the organization of human services is developing rapidly. Testing well-being can potentially reduce treatment costs, anticipate epidemics, prevent preventable diseases and improve overall personal satisfaction. Garapati et al. [20] proposed to address issues related to the huge amount of medical and social security information. Big information is a selection of collections of information that are abundant and confusing. They contain organized and unstructured information that develop generously and develop so rapidly that they do not feel comfortable with exemplary frameworks for social databases or current test equipment. Full control of information cannot be directly expanded. It is a predefined action. At the moment, huge information is extremely useful for sponsoring information, not everything else. There is
Chapter 8 Big Data based medical data classification using oppositional
always a presentation of information. It also deals with India’s huge problems. It also helps link the information forum. Medical services are the support or progress of well-being on the path of prevention, decryption and clinical consideration of ailments, unexpected weakness, improper and other enormous use and mental debilitation in humans. Human services are communicated by wellness experts to wellness experts, teachers, specialists, maternity specialists, medical assistants, antitoxins, pharmacies, analysts and other wellness specialists. Das et al. [21] focused on the transmission of data in the area of the broad information survey and its application in the clinical area. It also includes the presentation, the potential problems and concerns, the huge revision of the information used, the specific details, the applications, the mechanical applications, and the future applications.
8.3
Proposed methodology
In our proposed investigation of spotlights for predicting disease, an arrangement procedure that predicts disease movement is routine or irregular. In the premanipulation phase, first, after preproduction, separate the critical data from the dataset classification with the help of anisotropic scattering components; subsequent yield is viewed as the highlight extraction phase. In the proposed investigation, methods are used to eliminate the best features of the oppositional firefly (OFF). We have improved the weight of the Oppositional Gray Wolf Optimization with Kernel Ridge Regression (OGWOKRR), technique describing the disease progression as a daily practice or a peculiar one. The proposed system will be implemented in Matlab with various lung malignancies. Furthermore, our proposed work contradicts existing procedures and techniques to show that our proposed work is mandatory (Fig. 8.1).
8.3.1
Feature reduction
The reduced utility simplifies the characterization of big data medical datasets and improves their accuracy. The important task of the medical information classifier is to clarify and legitimize an accurate expectation of the disease. However, traditional element selection techniques are not really solid and characterization strategies take a long time. To overcome these challenges, our proposed technique uses the OGWOKRR calculation.
201
202
Chapter 8 Big Data based medical data classification using oppositional
Input Big data
Feature extraction
Optimal feature selection using OFFA
Training phase
Testing phase
Classification using OGWOKRRG
Performance evaluation of classifier
Figure 8.1 Proposed big data based medical data classification.
8.3.2
Feature selection
Feature selection is basically one of the most important research areas in the machine learning. The main favorable position in the selection of the main points is to find the most ideal candidates that help to improve the accuracy of the order and reduce the general expenses, the calculation and the space requirement. The most convincing highlights are selected with the aim that the customer can decipher the connection between highlights and classes. The possibility of using OFF was taken and examined in the attached subsections. Here the information report on clinical information contains relevant and intangible highlights. To reduce the insignificant properties of information gathering, the opposite calculation is used to update the flies of natural products. After selecting the skills, the ideal subset of the skills acquired is finally isolated in the preparation and test documents. Finally, the yield of the preparation and test logs is determined in the classifiers.
8.3.2.1 Oppositional fruit fly algorithm The proposed method uses the oppositional fruit fly algorithm (OFFA) calculation to select the ideal properties. When
Chapter 8 Big Data based medical data classification using oppositional
we select the main highlights, the selected highlights are combined. The fruit fly calculation of the natural product is a calculation that mimics the browsing behavior of the flies of organic products. Fruit fly calculation is another research method for worldwide improvement. First, he inspected the behavior of flocks of flies using organic products in search of food. The fruit fly of the organic product is a super food tracker with intense hospitality and vision. In any case, differentiate the source of the diet by seeing a number of flavors that fruit fly around and fruit fly to the appropriate place. As he approached food, he could find food or go to that particular place with his sensitive vision. Food sources are addressed by the optima and the food research strategy is repeated through the iterative research of the optima in the FOA. The improved type of calculation of flies from organic products would be OFA. It offers an improved version compared to the fruit fly calculation. Data: Initial situation of squares with small fluctuations. Result: The best square area. Step 1: Installation of parameters—the basic parameters of the FOA are the total number of changes and the situation of the squares with little difference. In our proposed strategy, the fly of the natural product addresses the area of the square with little difference. It causes the irregular situation of squares with low fluctuation (FX_axis, FY_axis). Step 2: In order to modify the normal fruit fly calculation of the organic product, the resistance strategy is presented. According to the OBL presented by Tizhoosh in the present specialist, its counter-operator is also taken into consideration in order to show signs of an improvement estimate for the current agreement with the operator. It is accepted that a counterspecialist agreement is closer to the ideal global agreement than the irregular operator agreement. The positions of the squares of opposite differences () are completely identified by the parts. 1 2 d OFm 5 oFm ð8:1Þ ; oFm ; . . .oFm where OFm 5 Lowm 1 Uf m 2 Fm with OFm A½Lowm ; UFm is the position of m-th low variance blocks OPm in the d-th dimension of oppositional blocks. Step 3: Examination in any way and determination of the squares with little difference. Here Fm is the m-th position of the squares with little difference. Fm ðx; yÞ 5 ðFXm ; FYm ÞT
ð8:2Þ
203
204
Chapter 8 Big Data based medical data classification using oppositional
FX m 5 FX2 axis 1 RandomValue
ð8:3Þ
FY m 5 FY2 axis 1 RandomValue
ð8:4Þ
Step 4: Assessment of the situation of the proposed system, BFm 5 EC
ð8:5Þ
Step 5: Complete the squares situation with slight fluctuations in wellness work. best block 5 functionðMin BPm Þ
ð8:6Þ
Step 6: Find the best places on the squares with small fluctuations. ½Excellent block Excellent selection 5 min error
ð8:7Þ
Step 7: It maintains the best situation in the square as an incentive with few changes and organizes the x, y. The fruit fly of organic products uses the representation to hover on this path. selected block 5 min error
ð8:8Þ
FX - axis 5 FX ðExcellent indexÞ
ð8:9Þ
FY - axis 5 FY ðExcellent indexÞ
ð8:10Þ
Step 8: Enter the appropriate rationalization to reproduce levels 3 to 6, then choose whether the situation of the square with few changes is better than the previous situation of the square with small fluctuations. Suppose this is so, overall allocation 7. After selecting the feature, the ideal subset of the skills is shown in the contract and in the evidence records. Finally, executions and test documents are placed in the classifiers. modified kernel ridge regression (MKRR) is used for selection.
8.3.3
Classification using OGWOKRRG
This count sees the perception of the dark wolf to characterize perception. This is the latest technique in terms of the number of customers in the relationship. The Gray Wolf Optimization Agent (GWO) is another strategy. The GWO really imitates the progress of development and hunts for the part of the fragile wolves. Powerless wolves usually contain a precursor
Chapter 8 Big Data based medical data classification using oppositional
of the Canidae’s component and also refer to the perception of predators that see their methodology in the perception class of learning. If in doubt, present a rule for incorrect subtleties, for example an assembly. The pioneer turns to a man and woman that have alpha rights and in terms of selective selection for the selection of authentic relationships and perception, control, control. The results sorted by Alpha are managed to target. Beta relationship signs in the representative representation of weak wolves guarantees, for alcohol-related sponsorships, sorting or use of sorting. The Omega is the smallest piece of the pack of dark wolves and the perception in the perception of all the others leads to a gigantic result in the replacement of the other main wolves. After a big feast by the pioneer wolves, only small rights can be maintained. A wolf appears as decisive as it is conceivable as a choice or a delta if it is not an alpha, beta or omega range. Since Delta Wolves must love Alpha and Beta, they have influenced each other in the same way as Omega. In our strategy, alpha (α) is ordered as the best one to belong to the pile of wolf axes that GWO feeds. The second and third largest chords are free beta (β) and delta (δ). Lucky settings are referred to as omega (ω). From the latter it was guided by the GWO methodology by then α, β, δ and ω. The smallest perceptual calculation technique for gray wolves is explained as follows:
8.3.3.1
Initialization process
Here we start the pretreatment yield information a; A; and C as the co-guidelines.
8.3.3.2
Fitness evaluation
Evaluate your training regimen using condition (Eq. 8.1) and victory for the best result. Fiti 5 max accuracy
8.3.3.3
ð8:11Þ
Separate the solution based on the fitness
The result is contrasted by the explanations of the wellness relationship. Leave them alone, the best initial finest fitness results Fα , the second best well-being result dβ and the third best finest fitness result Fδ .
205
206
Chapter 8 Big Data based medical data classification using oppositional
8.3.3.4 Encircling prey Development ends α, β, δ and ω belongs to the length of these three 0. Development also follows a damaged person leads the line. -
-
f ðt 1 1Þ 5 f ðtÞ 1 A : K
ð8:12Þ
-
-
K 5 j C : f ðt 1 1Þ 2 dðtÞ -
-
ð8:13Þ
-
-
A 5 2 a r1 2 a And C 5 2r2
ð8:14Þ
where, t is the iteration number. f (t) refers to the prey situation. A and C talk to the coefficient vector. a is straight from 2 to 0. r1 and r2 differed on the irregular controlled vector [0, 1].
8.3.3.5 Hunting We accept that Alpha (the best serious way of acting), Beta and Delta contain updated information on the victim’s imaginable situation with a specific ultimate goal of accurately reflecting the wolf accompanying him. Therefore, so far we collect the three best results and ask the other reviewer of the center (considering the omega) to change their mentality with the best analytical official. For the appropriate answer, the new result for the equations below is not surprising. -
-
K α 5 jC 1 : fα 2 f j ; -
-
-
-
-
-
K β 5 jC 2 : fβ 2 f j; K δ 5 jC 3 : fδ 2 f j -
-
-
-
ff1 5 fα 2 A1 : ðK αÞ; f2 5 fβ 2 A2 : ðK β Þ; f3 5 δ 2 A3 : ðK δÞ f ðt 1 1Þ 5
f1 1 f2 1 f3 3
ð8:15Þ ð8:16Þ ð8:17Þ
We tend to indicate that the obstacle zone is in an occasional position, which is surrounded by an explicit overview of the alpha, beta and delta zone in the evaluation gap. In any case, Alpha, Beta and Delta talk about the person’s area and an ever-growing number of wolves are mistakenly modernizing their area in the person’s area.
Chapter 8 Big Data based medical data classification using oppositional
8.3.3.6
Attacking prey (exploitation) and Search for prey (exploration)
Testing and use are ensured by the adaptable ratings of a and A. The adaptable transport indicators n and An GWO ensure a competent change between testing and use. By decreasing A, half of the cycles are compared with the test (jAj $ 1) and the other half highlights the use (| A | , 1). The GWO contains only two main acclimatization limits (a and C). In any case, we integrate that the figuration of GWO is put aside as basically as possible by the few specialists to adapt. Innovation is relentless and guarantees absolute precision. Finally, the best capacity is chosen and the additional frame is provided. This continuation updates the relapse of the bit peak, just run main.m. There is also the possibility of generating polynomial information about toys and arbitrarily separating information into preparation and approval information. This archive describes the Kernel Ridge relapse class. We accept that our hypothesis j(x) is a direct mixture of some fundamental capacities: jðxÞ 5 ωT φðxÞ
ð8:18Þ
We have to decide ω. As with the normal relapse, we expect that the contrast between some objective factors Γ and certain deterministic abilities will be altered by a clamor of additional substance: Γ 5 ωT φðxÞ 1 ζ
ð8:19Þ
The Gaussian theory of Γ and parameters can be demonstrated using the largest. In the counter-estimate, we can obtain the ideal qualities of the vector ω by limiting the cost work that accompanies it: CðωÞ 5
N X
ðΓi 2ωT φðxi Þ2 Þ 1
i51
λ T ω ω 2
ð8:20Þ
It is not so difficult to show that the estimate of ω which limits (Eq. 8.13) is given by: " #" # N N X X 21 Γi φðxi Þ ð8:21Þ ω 5 Iλ ðφðxi Þφðxi Þ Þ i51
i51
In the event that the size of φ(x) is extremely large and for several reasons, it may be necessary to communicate the condition (Eq. 8.14) as a kernel. Realized by, using
207
208
Chapter 8 Big Data based medical data classification using oppositional
Lagrangian multipliers, it has shown that condition 14 can be communicated in the accompanying structure: jðxÞ 5 kðxÞT ðK 1λIÞ21 Γ
ð8:22Þ
where the kernel is K ðx; zÞ 5 φT ðxÞφðzÞ indicated. The vector Γ 5 ½Γ1 ; Γ2 ; :::; ΓN T is the target of the preparation sample N. Network K is a structure of grams with a component ðK Þij 5 kðxi ; xj Þ; that is created from the preparation information from the illustrative factors. The vector k(x) whose i-th component is in the structure ðkðxÞÞi5φT ðxi ÞφT ðxj Þ. Where xi is the i-th preparation model. There are some conditions for moving something largely. Finally, the categorizer sorts the data set as ordinary and unusual. The exposure of the classifiers is assessed using different estimates, for example accuracy, quality of work of the beneficiary (ROC), reliability and explanation.
8.4
Result and discussion
This section shows the results of the analysis of medical data based proposed OGWOKRR approach used for classification by providing a comparative analysis, in which the performance of OGWOKRR is compared with existing classification techniques such as the KRR and NN method. The proposed algorithm is implemented using the MATLAB platform.
8.4.1
Classification accuracy
Accuracy of characterization is a basis of assessment used for a standard clustering framework. Accuracy 5
8.4.2
TP 1 TN TP 1 TN 1 FP 1 FN
Sensitivity
Sensitivity is also called TPR (true Positive Rate) or sponsor. Reliability examined the ability of an abnormal recorded distinction model. The higher the reliability, the better the classifier. Continuously differentiate the scope of examples carefully included in the total number of tests. Sensitivity 5
TP TP 1 FN
Chapter 8 Big Data based medical data classification using oppositional
8.4.3
209
Specificity
Specificity instruction is managed by the TNR (true negative rate). The specificity statement quantifies the ability of the proposed technique to recognize typical cases. Distinguish negative examples deliberately separated in the absolute number of tests. In the event that the information value becomes as high as the reliability and its distinction is much lower (or 1%), the classifier is better. Specificity 5
TN TN 1 FP
To evaluate the emission of the proposed system, the grouping to be performed uses the OGWOKRR method and the presentation to encourage reliability, expressiveness and accuracy.
8.4.4
Performance evaluation
In order to evaluate the issue of the proposed strategy, the characterization to be performed uses the OGWOKRR method and the presentation to stimulate the use of reliability, expressiveness and accuracy. Table 8.1 shows the presentation estimates for our proposed study based on breast, quality, leukemia, lung, lymphoma, and ovary (Fig. 8.2). From the incentive in Tables 8.1 and 8.2, we can understand which clinical information depends on an agreement that depends on collections of information for incessant information, on the breast, leukemia, lungs, lymphomatous and ovarian have a good classification of characterization. The accuracy of each individual clinical data provides an accuracy of 94.21%, sensitivity of 92.09% and specificity of 94.21%. In this way, it is
Table 8.1 Evaluation measures for our proposed study. Input dataset
Accuracy
Sensitivity
Specificity
Breast Gene Leukemia Lung Lymphoma Ovarian
96.345 93.483 94.274 95.463 93.353 92.374
92.673 91.354 92.564 93.847 90.746 91.386
84.374 89.674 85.733 88.675 88.784 89.628
210
Chapter 8 Big Data based medical data classification using oppositional
98 96 94 92
Breast
90
Gene
88
Leukemia
86
Figure 8.2 Graph for specificity measures for proposed study.
84
Lung
82
Lymphoma
80
Ovarian
78 Accuracy
Sensitivity
Specificity
Table 8.2 Accuracy comparison for proposed and existing research. Disease
Proposed OGWOKRR
Existing KRR
Existing NN
Breast Gene Leukemia Lung Lymphoma Ovarian
96.345 93.483 94.274 95.463 93.353 92.374
90.657 89.784 90.673 91.753 88.783 87.463
89.674 90.885 91.674 89.685 89.253 89.242
possible to use the clinical and actual information collected using the proposed work is ordinary and abnormal.
8.4.5
Comparative analysis
The work of characterization of medical data according to the calculation OGWOKR proposes the KRR and NN technology is compared and effective. The correlation between the proposed and existing information is then shown in Tables 8.28.4 (Figs. 8.38.5). The correlation sequences of various characterization calculations are applied to much of the information in Tables 8.28.4 clearly shows that Classification based on OGWOKRR algorithm outperforms all the other algorithms in terms of performance measures such as accuracy, sensitivity, specificity of the classification. OGWOKRR produces higher sensitivity and specificity as compared to other models.
Chapter 8 Big Data based medical data classification using oppositional
211
Table 8.3 Sensitivity comparison for proposed and existing research. Disease
Proposed OGWOKRR
Existing KRR
Existing NN
Breast Gene Leukemia Lung Lymphoma Ovarian
92.673 91.354 92.564 93.847 90.746 91.386
87.436 89.336 89.533 90.564 88.453 89.452
86.375 85.363 88.574 89.564 86.483 87.453
Table 8.4 Specificity comparison for proposed and existing research. Disease
Proposed OGWOKRR
Existing KRR
Existing NN
Breast Gene Leukemia Lung Lymphoma Ovarian
84.374 89.674 85.733 88.675 88.784 89.628
82.563 86.564 83.564 85.847 84.785 87.975
83.793 85.785 81.674 84.894 83.793 86.264
Accuracy 98 96 94 Breast
92
Gene
90
Leukemia
88
Lung
86
Lymphoma
84
Ovarian
82 Proposed OGWOKRR
Existing KRR
Existing NN
Figure 8.3 Graph for comparison of proposed and existing accuracy measures.
212
Chapter 8 Big Data based medical data classification using oppositional
Sensitivity 94 92
Figure 8.4 Graph for sensitivity comparison for proposed and existing research.
Breast
90
Gene
88
Leukemia
86
Lung
84
Lymphoma
82
Ovarian
80 Proposed OGWOKRR R
Existing KRR
Existing NN
Specificity 90 88
Breast
86
Figure 8.5 Graph for specificity comparison of proposed and existing research.
Gene
84
Leukemia
82
Lung
80
Lymphoma
78
Ovarian
76 Proposed OGWOKRR
8.5
Existing KRR
Existing NN
Conclusion
In our proposed to organize medical information based on large information, we looked at big data about the brush provided by the MKRR technique. When ordering medical data, highlights can be selected using the calculation of resistance to oppositional fruit-fly algorithm. The proposed work was analyzed on the basis of evaluation parameters such as sensitivity, specificity and accuracy. In case we compare our evaluation measures OGWOKRR proposes with the current KRR and NN procedures to show our proposed technique would lead to better results. The aftermath of the proposed study suggests that the proposed work exceeds that currently present in preferred results compared to the characterization of clinical information. In this way, the proposed work stimulates future research to investigate new grouping calculations.
Chapter 8 Big Data based medical data classification using oppositional
References [1] W. Raghupathi, V. Raghupathi, Big data analytics in healthcare: promise and potential, in: Proceeding of Health Information Science and Systems 2014, pp. 111. [2] L. Chen, X. Li, Y. Yang, H. Kurniawati, Q.Z. Sheng, H.-Y. Hu, et al., Personal health indexing based on medical examinations: a data mining approach, in: Proceedings of Decision Support Systems, 29 October 2015, pp. 136. [3] A. Jindal, A. Dua, N. Kumar, A.V. Vasilakos, J.J.P.C. Rodrigues, An efficient fuzzy rule-based big data analytics scheme for providing healthcare-as-aservice, in: 2017 IEEE International Conference on Communications (ICC), Paris, 2017, pp. 16. [4] M. Khalaf, A.J. Hussain, D. Al-Jumeily, R. Keenan, P. Fergus, I.O. Idowu, Robust approach for medical data classification and deploying self-care management system for sickle cell disease, in: 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing, Liverpool, 2015, pp. 575580. [5] M. Durairaj, V. Ranjani, Data mining applications in healthcare sector: a study, in: Proceedings of International Journal Of Scientific & Technology Research, 2, 22778616, October 2013, pp. 2935. [6] N.G. Maity, S. Das, Machine learning for improved diagnosis and prognosis in healthcare, in: 2017 IEEE Aerospace Conference, Big Sky, MT, 2017, pp. 19. [7] R.S. Kumar, P. Manikandan, Medical Big Data classification using a combination of random forest classifier and kmeans clustering, in: Proceeding of Modern Education and Computer Science, November 2018, pp. 1119. [8] J. Ni, Y. Chen, J. Sha, M. Zhang, Hadoop-based distributed computing algorithms for healthcare and clinic data processing, in: Proceedings of Eighth International Conference on Internet Computing for Science and Engineering, 2015, pp. 188193. [9] M. Chen, Y. Hao, K. Hwang, L. Wang, L. Wang, Disease Prediction by Machine Learning Over Big Data From Healthcare Communities, in: IEEE Access, vol. 5, pp. 88698879, 2017. [10] J. Sun, C.K. Reddy, Big Data analytics for healthcare, in: Proceedings of Healthcare, 2013, pp. 1112. [11] P. Saranya, P. Asha, Survey on Big Data Analytics in health care, in: 2019 International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 2019, pp. 4651. [12] R.D. Sah, J. Sheetalani, Review of medical disease symptoms prediction using data mining technique, in: Proceeding of IOSR Journal of Computer Engineering, 19, 22788727, June 2017, pp. 5970. [13] Q.K. Fatt, A. Ramadas, The usefulness and challenges of Big Data in healthcare, in: Proceeding of Journal of Healthcare Communications, 3, 16542472, 2018, pp. 14. [14] K. Yan, X. You, X. Ji, G. Yin, F. Yang, A hybrid outlier detection method for health care Big Data, in: 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) (BDCloud-SocialCom-SustainCom), Atlanta, GA, 2016, pp. 157162.
213
214
Chapter 8 Big Data based medical data classification using oppositional
[15] C. Mallika, S. Selvamuthukumaran, Privacy protected medical data classification in precision medicine using an ontology-based support vector machine in the diabetes management system, in: proceeding of International Journal of Innovative Technology and Exploring Engineering, 9, 22783075, November 2019, pp. 334342. [16] J.Archenaa, E.A.M. Anita, A survey of Big Data Analytics in healthcare and government, in: Proceeding of International Symposium on Big Data and Cloud Computing, 2015, pp. 408413. [17] P. Mohapatra, S. Chakravarty, P.K. Dash, Microarray medical data classification using kernel ridge regression and modified cat swarm optimization based gene selection system, in: Proceeding of Elsevier, 2016, pp. 117. [18] M.S. Islam, M.M. Hasan, X. Wang, H.D. Germack, M. Noor-E-Alam, A systematic review on healthcare analytics: application and theoretical perspective of data mining, in: Proceeding of Healthcare, 23 May 2018, pp. 143. [19] B. Boukenze, H. Mousannif, A. Haqiq, Predictive analytics in healthcare system using data mining techniques, in: Proceedings of Computer Science & Information Technology, 2016, pp. 19. [20] S.L. Garapati, S. Garapati, Application of Big Data Analytics: An Innovation in Health Care, in: Proceedings of International Journal of Computational Intelligence Research, 14, 09731873, 2018, pp. 1527. [21] L. Das, S.S. Rautaray, M. Pandey, Big Data Analytics for medical applications, in: Proceedings of I.J. Modern Education and Computer Science, 2018, pp. 3542.
An analytical hierarchical process evaluation on parameters Apps-based Data Analytics for healthcare services
9
Monika Arora1, Radhika Adholeya2 and Swati Sharan2 1 2
Apeejay School of Management, Dwarka, New Delhi Uniworld Care, Dwarka, New Delhi
Abstract The healthcare system cares about your health by prevention, early diagnosis, timely treatment and managing complications. According to WHO, it also includes the long term cure to assure the patient. Any healthcare management system can be studied in terms of access, integration, privacy and security, confidentiality, sharing, assurance/relevancy, reliability, and cost involvement for the data/ documents in the system. It can be a concern to the healthcare centers. Accessibility is a complex concept and at least four aspects require evaluation, that is availability, utilization, relevant, and equity of access. These parameters are discussed and evaluated in this study. The healthcare system is driven by groups of people such as patients, community resource providers, and healthcare teams. These groups are working to facilitate the patients. The healthcare team takes care of different types of care to the patient and his family, and the role and responsibility of doctors, nurses, physicians, and many more. Patients should be aware of all the latest updates in their areas. The use of WhatsApp in smartphones as well as the SMS facilitates them to communicate and update each and everything. The evaluation method which weights categories and selection is the best evaluation by implementing the analytical hierarchical process (AHP) technique. This will be used in development of electronic health record systems of Healthcare services. Keywords: Healthcare; analytics; AHP evaluation; parameters Applications of Big Data in Healthcare. DOI: https://doi.org/10.1016/B978-0-12-820203-6.00011-4 © 2021 Elsevier Inc. All rights reserved.
215
216
Chapter 9 An analytical hierarchical process evaluation
9.1
Introduction
The health sector generates huge data in large volume through record keeping, compliance and patient data which are now being digitized. These data needs to be analyzed effectively which will minimize healthcare charges as well as revitalize quality in healthcare. The government also generates large datasets every day in the same manner. These datasets from various hospitals or healthcare providers need to be collated to provide an effective treatment for a patient. India, after independence in 1947, has established its own Ministry of Health and made health as a priority in its 5-year plan. The budget on health is determined for each state for the next 5-years. The National Health Policy was passed in 1983, with an aim to achieve universal health coverage by 2000. The policy got updated in 2002. The Constitution of India has made each state responsible for providing the health facilities to its people. For improving rural healthcare National Rural Health Mission (NRHM) was started in 2005 by the Indian Government for providing medical facilities in rural areas. The objective of the mission was to improve healthcare services by providing medical resources and facilities in the rural areas and states having weak healthcare systems. Since the public healthcare system is not able to cater the needs of the people, so the private sector comes into play. Quality and cost of the public and private sector is a major concern. People living in rural areas and poor states such as Bihar do not have access to proper healthcare facilities due to the lack of skilled healthcare providers in those areas as compared to other prosperous states. Healthcare services and health education are provided by state governments but administrative and technical services are provided by the central government in India. Since very few people are insured in India, especially by their employers, the private health services are not accessible to them. The out-of-pocket expenditure is very high in India because of high cost in the private sector. But still India is a preferred destination for medical tourists because of the high quality and low cost of private healthcare providers as compared to other developed countries. Based on the number of health apps downloaded worldwide, there is a growth in terms of people using the mHealth app. Over 5 years, it is found that there is an increase of over 2.7 billion mHealth app downloaded. The health service over the phone has increased and technology intervention to healthcare is a positive symbol towards the growth of healthcare worldwide as shown in Fig. 9.1.
Chapter 9 An analytical hierarchical process evaluation
217
No of mHealth app downloads in billions 4 3.5
3.7
3
3
2.5 2.3
2 1.5
3.2
1.7
1 0.5 0
2013
2014
2015
2016
2017
Healthcare providers are now saving the lives of patients and improving the quality of life by using various technological means. This shift to technology has also changed various medical processes and the way the medical fraternity works. The government has taken various initiatives for the health sector to raise the standard of patient care. Such as: (1) Medicare penalties— hospitals having high rates of readmissions for patients suffering heart failure, heart attack, etc. is being penalized through Medicare; (2) Brain initiative—to avoid brain disorders new treatments and medicines are now being discovered; (3) Heritage health prize—implementing an algorithm for calculating the days a patient would stay in a hospital in the next year; (4) Health challenge—to develop algorithms for analytical tools, biomarkers, and various technologies for diagnosis of mild traumatic injuries; (5) Technologies related to Healthcare Analytics. Many new technologies are now affecting the healthcare industry. Some of these technologies are: (1) Artificial intelligence (AI)—many of the applications are designed on AI. Information needs to be managed to have a scalable and sustainable AI. (2) Block chain—block chain technology has been used for crypto currency transactions. Experts are now envisioning how blockchains can be used for health and life insurers to maintain health records, execute transactions and interact with stakeholders according to a report of Fintech. Companies such as Deloitte and IBM have confidence that blockchains could help to create a broader, secure and interoperable health system. Today every healthcare stakeholder such as health systems, physicians, diagnostic information service providers, etc. each has a separate copy of health data. While accessing this information, it becomes a time-consuming and costly process. With the implementation of block chain, tracking provider information and reducing the administrative cost can be easily done. (3) Cloud—healthcare IT
Figure 9.1 No of mHealth app download in billions.
218
Chapter 9 An analytical hierarchical process evaluation
has now shifted its investment to cloud-based systems. Every third Healthcare IT expert now believes that this technology would receive huge investment. (4) Consumer-facing technology—The West Monroe Partners, a business and technology consulting firm states that with the help of digital communication and availability of patient data will help the healthcare industry to have better patient care. (5) Internet of things—by connecting various medical devices and other applications to the internet will make a positive shift towards the Internet of Things. (6) Disease management technology—healthcare organizations are now making different strategies for disease treatments. Electronic Health Records (EHRs): By replacing paper records with EHRs, the healthcare professionals are getting a huge benefit out of that. EHRs are now being used by healthcare professionals at every stage. The clinical information, reports, and weight are fed in the system by nurses and technicians while the administrative things such as to schedule appointments, to update patient records with diagnostic codes and to submit claims are done by medical billers and coders through EHRs. But due to the lack of interoperability and huge documentation, physicians are now experiencing a burden on them. To have a secure and efficient system to exchange data across various healthcare centers, HIPAA (Health Insurance and Portability Act) is passed. Some of the benefits of using EHRs are: (1) Enhanced patient care—in case of unconscious patients, EHR can give access to the patient history, provide an alert to the doctors about any allergies or intolerances towards any medicines to the doctors. (2) Improved public health—with the help of EHRs data, new treatments can be developed and the medical knowledge can be enhanced. Any viral or bacterial outbreak could be quickly identified and the preventative measures could be taken at the right time. (3) Ease of workflow—the medical codes have increased from 13,600 to 69,000 in the recent times. But with EHRs in place, medical coders and billers could easily enter the data into the computer system. This reduces the errors related to patient data and financial details and less time is taken to enter the data as compared to paperbased records. The productivity and the efficiency increases by accessing these records through various portable devices at any time. (4) Lower healthcare costs—with a shift towards EHRs systems, the outpatient cost has decreased by 3%, as published in a research done by University of Michigan. This saving in cost is equivalent to $5.14 per patient each month. Telemedicine: Another big achievement in the field of technology is the Telemedicine system. It is the two-way video consultation between doctor and patient. It refers to the system of remote
Chapter 9 An analytical hierarchical process evaluation
monitoring of patients data such as cardiovascular or any other sign or symptoms. Remote ultrasound technology is also being developed which would increase the career options for various healthcare jobs. With the expensive health plans and value-based demands, adoption of telemedicine will increase. The advantages of telemedicine are: (1) decrease in waiting-time; (2) increased accessibility for the rural people; and (3) increase in efficiency. The use of digitized health space till 2020, by various sections of the market is described in Fig. 9.2. Based on the data, it can be found that wireless health will grab the market over Mobile health, Telehealth, and EHR/EMR. Also, Mobile health will also have its existence and it will be increasing in future as well as it can be seen based on the dataset in next few years. Mobile Healthcare (m-healthcare): When healthcare or the medical information is supported through mobile technology, it is referred to as m-health. According to research, mobile based medical applications are being used by 80% of physicians and 25% of them use it for patient care. At present, there are 100,000 health applications available out of which 300 thousand paid apps are being downloaded. Everyday new applications are developed for doctors and patients both. The cost associated with m-health technology is very less but provides quality services to the patient and gives flexibility to the doctors and the administrators at the same time. This technology could be used for creating health awareness and to have communication between doctors and patients. These applications could be used in the following areas: chronic care management (diabetes, blood pressure, cancer, etc.), women’s health (for pregnancy, calendar, feeding, etc.), fitness and weight-loss; wellness apps, cardio apps (healthy lifestyle, trackers, exercises etc.), medication management (track patient medication), mental health (for disorder, stress relief), Digized Health Space ll 2020 by various secons of the industry (in billion$) 120 100 80 60
EHR/EMR
40
Telehealth
20
Mobile Health
0
Wireless Health 2015
2016
2017
2018
2019
2020
Figure 9.2 Digitized Health Space till 2020 by various sections of the industry (in billion$).
219
220
Chapter 9 An analytical hierarchical process evaluation
doctors-on-demand apps (doctor booking, online consultation, doctor appointment), medical reference, diagnostics, patient medical education, sleep and medication apps (relaxing music and sounds), clinical assistance mobile apps (PHR, LIMS, HIMS, EHR), prescription-e-prescription mobile applications, and Nursing apps (scheduling, tracking, medical records). The benefits of using m-health technology are: (1) the work can be completed from remote locations; (2) decrease the cost by reducing paper-use, reduces snail mail and the time on phone calls; and (3) improves the communication of medical billers as they can send alert messages related to payment and bills. With the benefits of mobile health, there are certain disadvantages also such as these devices are prone to hacking and can be stolen. Information on Cloud: Information in huge amounts and various forms are being generated at various healthcare facilities. This information will be refined, transformed, or analyzed to be used further [1]. Every day huge data is being generated by the healthcare industry. With the digital data, the system needs to be cost-efficient, expandable and have a proper storage facility. Today 270 million people own a smartphone and are online. This continuously generating data is administered through cloud computing. It uses the internet for providing its services of storage facilities. By connecting the devices to the internet, patients and the healthcare providers can access their data [2]. This technology does not require any additional services such as hardware or server to store data but instead it provides a strong recovery facility. The technology helps to establish the link within various healthcare professionals, between providers or patients. This technology is very useful in rural areas or where there is a lack of healthcare facilities [3]. It has been used for email, smartphones, webcam, telemedicine, and telemonitoring systems and also for diagnostics, management, counseling, education, and support. Healthcare data analytics: When this huge amount of data is analyzed, it adds value to the healthcare sector. With this huge data, the health industry could easily predict the trends, normalize the patient data, and detect any loop-holes in the patient care. With the growth in medical practices, the data would also increase. The Wested group in NYC substantially increases the number of doctors. The revenue has increased to $285 million with 250,000 patients. With the use of big data they analyzed their system and were able to improve the satisfaction level of their patient by streamlining their workflow, reducing the burden on doctors, decreasing unnecessary tests.
Chapter 9 An analytical hierarchical process evaluation
Big data analytics can help the health system in many ways [4,5]. Such as: • Performance evaluation: With the help of a database, hospital administrators can detect any ineffectiveness in their services. Strong actions can be taken on the basis of data analysis which would decrease medical errors with an improvement in patient care. • Financial planning: The data analysis can help to effectively allocate funds and maintain transparency in the system. • Patient satisfaction: A healthcare facility can reduce the cost of treatment by maintaining a database of patient history. This would help to decrease medical errors, improve the safety of patients and increase the quality of care. • Healthcare management: With the help of data analytics, preventive and disease management could help to detect any diseases to a particular group of population. It also helps in operation management and regulatory compliances. • Quality scores and outcome analysis: Data analytics would help in reducing the communication gap between various consultants. It acts as a medium where data of the patient, its history and progress can be maintained and viewed by various healthcare practitioners working on a particular case. For healthcare administrators and managers, analytics would help in better patient care and improvement in existing systems. The doctors could use analytics in various forms: • Prescriptive analytics: To provide personalized treatment for each patient. • Predictive analytics: Data mining and various models such as simulation or learning models helps to forecast future disease patterns. • Descriptive analytics: Analytics helps to comprehend the problem causation and further drill down it. The use of big data over the years has increased tremendously. The use of analytics in health care has become a mandatory requirement. This will help to cure disease, prevent epidemics and cut down costs. The data generated by the Application (APPS) will be used and helpful for everyone in the world. This will also motivate the researchers and developers to determine the importance of big data and also its role in health care analytics. This paper helps the developer and the researcher to focus on the various factors that are most important for the development of successful Model to be prepared in Healthcare sector. Also, the developer will be able to identify the factors in the healthcare industry that are important. The global weights are also been drawn that will be useful for the development of APPS in future by health companies.
221
222
Chapter 9 An analytical hierarchical process evaluation
The paper starts with the introduction which covers all the basics of analytics and big data analytics. It also covers the motivation and need of the study. After the introduction, it covers the review of literature, that is, Section 2 where the various papers were studied in this regard. It identifies the various factors and subfactors related to this study and also identifies the review already done in this area. Section 3 represents the research methodology where the analytic hierarchy processing (AHP) has been discussed in detail; it discusses the calculation of weights for the determination of Multi criteria decision modeling technique. Section 4 discusses the proposed AHP model of successful healthcare that has been designed and implemented. The AHP has implemented successful healthcare. The weights are calculated and finally the weights have been determined. Finally, the chapter ended with the conclusions.
9.2
Review of literature
Healthcare industry since the beginning generates information in huge quantities. Earlier these records were paper-based but now the information is being digitized [6]. The data comes from various sources and is being produced and collected at a very high speed [7,8]. These sources of healthcare data are numerous such as genomics, EHR, monitoring devices, health-related mobile applications, social media, etc. It has been estimated that by 2020 this data will reach 25,000 petabytes [8]. With this huge amount of digital information, the cost of medical treatment for the patient increases [5]. Data analytics has a major part to play here. It helps in extracting knowledge from huge datasets. This term is being used in various business sectors and healthcare can also utilize its full potential [8]. It is characterized by four terms: (1) amount of data generated, that is, volume; (2) speed of online data generation and analysis, that is, velocity; (3) various types of data generated, that is, variety of information; and (4) veracity (quality assurance of data). It uses various techniques such as descriptive or predictive to analyze organized and unorganized healthcare information [5]. Data analytics helps to process huge volume, variety, and velocity of data across healthcare organizations and helps in taking decisions based on the evidence [9]. In this chapter, 15 parameters are shortlisted from vast literature review and reduced to 10 after discussing with experts in the area of healthcare. The parameters are data access, data integration, privacy and
Chapter 9 An analytical hierarchical process evaluation
security of data, confidentiality of data, data sharing, data assurance, data relevancy, reliability, and cost involvement. Data access: Healthcare data contains sensitive information about an individual and has many stakeholders including administrative staff, caregivers, paramedic staff, patients, and other professionals such as researchers, pharmaceutical companies, government, etc. [10]. The National Health Service (NHS) has maintained user protocol and regulatory measures for the restricted access to health data. Researchers suggest that though cloud-computing offers a solution to storage and access towards healthcare data and at the same time extra security measures needs to be taken care as the service is provided by an external entity towards healthcare [11,12]. Data integration: Healthcare is a vast changing field where continuous care from various healthcare professionals and institutions needs to be given to an individual. Information is generated from various resources such as clinical information, from other databases etc. Apart from this data, behavioral, socio-economic information are being captured through various social platforms and are integrated in the patient record [13]. Data protection: Data privacy and security is the most important prerequisites as it contains very vital information about an individual. Many methods are used for maintaining the confidentiality and surveillance of healthcare information such as data encryption, password protection and secure data transmission [12]. Laws have been passed to safeguard health data such as HIPPA and DISHA (Digital Information Security in Healthcare Act) in India. Confidentiality: In the world of data analytics, genomic data can predict the health of an individual. Various measures need to be designed to maintain the confidentiality of the patient [14]. Patients may be unwilling to take the treatment if there is any data breach or if they are not sure of their data confidentiality. Data sharing: Every health system has their own relational database and data model. This inhibits the data sharing among the various health institutions [15]. Because of lack of standardization in healthcare and availability of data in different formats, data acquisition and data cleansing need to be performed for the transmission of data between various healthcare and research organizations [8]. Data assurance: Since data comes from various sources, the goal is to have an error-free and quality data [6].The data coming from variety of sources such as EHR, wearable devices,
223
224
Chapter 9 An analytical hierarchical process evaluation
integrated sensors, and continuous monitoring may be structured, semi-structured or unstructured form [8]. Reliability: Reliability or data stability is an important criteria for measuring the quality of health data. The data needs to be same across the healthcare services and over time [16]. The huge amount of data needs to be balanced so as to have a more reliable data to be analyzed [9]. Cost involvement: Data analytics tools can manage large, multiscale various data types and transform them to intelligence based data. It helps in better decision making, delivering better health services and cost-effectiveness of these services [13]. The data analyzed helps the healthcare providers and the other stakeholders to have better diagnosis and treatment of the patient [6]. The details are discussed in Table 9.1. There are many studies done by the researchers using many methods. The use of AHP will help the users to have pairwise evaluation and use its vital scale for validation. Perhaps researchers want only to take the decision based on the scientific processes. The AHP aims to verify practically and this ranking is widely range from individual to individual. The AHP grading is created for every participant as provided by them rather than using non-parametric tests. For this study the AHP has been used for getting the better results.
Table 9.1 Synthetic view of predominantly ergonomic evaluations. Features
Definition
References
Data access
Various stakeholders of the data needs to provide access through protocols and regulatory measurements Data from various sources needs to be integrated in the patient record
[10,12]
Methods and laws to safeguard patient information
[12]
Data confidentiality needs to be maintained to avoid any breach of data Lack of standardization and data in various formats, data sharing is an issue To have an error-free data coming from various sources
[14] [8,15,17] [6,8]
Data coming from various sources needs to be reliable Data analytics helps in better decision making, delivering better health services and cost-effectiveness of these services.
[11,16] [6,18]
Data assimilation Data privacy and security Confidentiality Data sharing Data assurance/ relevancy Reliability Cost involvement
[13]
Chapter 9 An analytical hierarchical process evaluation
9.3 9.3.1
Research methodology Analytic hierarchy processing model
The main ideas for using AHP are illustrated in this section as validating the pairwise comparison process and its fundamental scale used in the AHP. The Saaty compatibility index is used to show the closeness of the derived priorities in the validation examples to actual values against which have to be compared that have been standardized to relative form by dividing by their sum. Such examples encourage one about the validity of the AHP and its supermatrix as they are applied to retrieval problems. There are several multiple-criteria decision methods (MCDM), which have been developed to help researchers in this regard. Validation of AHP modeling technique: The studies done on the basis of decision problems, motivates the behavior in the experiment. This will be in line with the techniques used in experimental methodology. It will also help in making decisions by taking the reality. The validation method can be described as: 1. Theoretical validation: Kornyshova and Salinesi [19] claimed that there is no perfect for the selection approach. The selection is based on the characteristics of the problem as per the information available. 2. Experimental validation with verifiable objective results: Satty explained the area of geometric figures and estimate by using pairwise comparisons. AHP provides accurate results, volume of drink consumption in a country or distance between cities are evaluated by asking directly an estimate and derived indirectly from pairwise comparisons (as in AHP). AHP appears to provide more accurate results [20]. AHP is a multidecision tool that helps the decision-maker to make his decision. It will give you advice to choose the best option available. The consistency of AHP is very high. It detects the priority and will go from top and least priorities. It is observed that AHP has been most likely an associate and adequate tool to consider all the issues for multiple choices and branches. The validation measures the best out of all options available. To validate the measures and therefore the best thing we will do is to check our results. There are many samples considered for validation and used in AHP, considered as a smart measuring system. There is a limitation of data in supporting the technique. This is restricted to only 100 participants. But it will definitely give ideas for a decision making approach. This will propose a novel way of doing the analysis for measurement used for judgment that can be derived from relative
225
226
Chapter 9 An analytical hierarchical process evaluation
measurements/values. Thus the AHP is based on descriptive not the prescriptive one. AHP is a predictive tool that is based on the requirement in science for theory and is reliable in terms of data.
9.3.2
Analytic hierarchy processing technique
In any application domain, AHP (a decision approach) can be applying to get the solution of complex multiple criteria problems. In view of the uncertainty and complexity of decision-situations, it is used for micro tasks and problems to carry tasks. And also hierarchy can be prepared by means-ends. Further, the critical factors are studied for performance. The AHP is simple and robust and can be easily incorporated for tangibles and intangibles ability in decision. It is here that AHP technique can be an important aid in decision-making. The factors and subfactors affecting successful healthcare is proposed in this chapter. Each criterion in pairwise comparisons is identified by the decision maker using this technique. The outcome of AHP is that it will provide a weight to each decision alternative. AHP follows the primarily three steps: (1) constructing hierarchies; (2) comparative judgment of comparisons; and (3) weights synthesis. This will help the decision maker to do the justice in taking decisions. Structural hierarchy is a step that allows taking the decision for complex problems in a structured way. It uses the hierarchy and will be applied in descending order considering the overall objective. This can be calculated by using “criteria” and “subcriteria.” This will go to the lowest level. The objective of the study is to determine the top level of the hierarchy. The criteria and subcriteria are contributing to the decision traits and represented at the lower and the intermediate level of hierarchy. The decision traits/alternatives are the final ones and are helpful in laying down at the last level of the hierarchy. It is based on inventive thinking, remembrance, and the perspective of users in judging the construction of hierarchy. There is no set of ways or procedures to generate the level of hierarchy or structure[20]. The structure/hierarchy depends on the decision and the nature and type of the work taken by owners or managers involved in the task taken. The level and the numbers in hierarchical units depend on the complexity of the problems. It has been analyzed that any detail of the problem is required by the analyst to solve. As such, the hierarchy representation and details required may vary from one person to another. Comparative judgments are applied on the structured tasks and priorities given to the elements at each level. Every member has to be included in the hierarchy and involved in the decision
Chapter 9 An analytical hierarchical process evaluation
227
according to Satty Figure 9.3 [21]. The comparisons matrices of all elements in a level of hierarchy consider the immediate level as well. It considers in construction of comparative judgments and prioritizes into ratio scale measurements for the pairwise comparisons. The preferences consider nine-point according to [20]. The pairwise comparisons, for example, “X” to “Y”, that is, how “X” is more important than “Y”. The amalgamation/consistency of measurement of uniformity and precedence are scaled into pairwise comparisons. Then it generates a matrix of relative rankings. This will take place for all levels of the hierarchy from top to bottom. Similarly the matrices are prepared and made at each level and that is linked to the upper level as well. The order depends on the connection of the matrix. The eigenvectors or the relative weights are assigned after developing all the matrices. The relative degree of importance is important in global weights and eigen values are calculated. An important validating parameter in AHP the λmax value is calculated. This eigenvalue is used as a reference index which helps in calculating the consistency ratio (CR), that is, the estimated vector. In order to validate the pairwise comparison matrix as a completely consistent evaluation is made for CR calculation in AHP. 1. The steps taken for the calculation of CR are defined as under: the relative weights and (λmax) for each matrix of order n is calculated. 2. The consistency index (CI) for each matrix of order n is computed as by using the expression as follows. CI 5 (λ max -n)/n-1) 3. The formula for CR is under. [CR 5 CI/RI] Here, random consistency index (RI) is obtained from a large number of simulation runs and varies depending upon the order of matrix (see Table 9.2). If CR # 0.10 seams that the degree of consistency is satisfactory and it is accepted, if CR . 0.10, there are serious inconsistencies, the AHP may not give you meaningful results. The evaluation process has to be reviewed, and it can be improved further (Table 9.3).
Table 9.2 Random index values. Order of matrix (OoM) Random index (RI)
2 0
3 0.58
4 0.9
5 1.12
6 1.24
7 1.32
8 1.41
9 1.45
10 1.51
228
Chapter 9 An analytical hierarchical process evaluation
Table 9.3 Scales for pairwise comparisons. Verbal scale (important, likely or preferred)
Numerical values
Equal Moderately more Strongly Very strongly Extremely Intermediate not considered as they are compromise values Reciprocals for inverse comparison
1 3 5 7 9 2, 4, 6, 8 Reciprocals
The pairwise comparison in the present study, is the actual data and collected from the survey. The pair-wise comparison for the questions is considered as: “if ith and jth are the two elements, how many times ith is preferred to jth.” The wi and wj are the values for the criteria ith and jth are, respectively, the preference of the criteria ith to jth is equal to wi/wj. Therefore, the pairwise comparison matrix is: 0 1 w1 =w1 w1 =w2 w1 =wn @ w2 =w1 w2 =w2 w2 =wn A wn =w1 wn =w2 wn =wn It’s the relative normalized method. The amount in this matrix is consistent. The weight of each element is calculated as: X n th Weight of the i element 5 Wi Wi i51
The Wj =Wi is the priority of criteria i to j for negative criteria, such as risk, is equal to. The pairwise comparison matrix is therefore: 0 1 w1 =w1 w2 =w1 ::: wn =w1 B w1 =w2 w2 =w2 wn =w2 C B C @ A ::: w1 =wn w2 =wn wn =wn The above matrix is also consistent [20]. The weight of the ith element is given by: n X 1=Wi 1=Wi i51
Chapter 9 An analytical hierarchical process evaluation
The data was collected to analyze with the analytical hierarchy processing (AHP) techniques and it has arrived at priorities/ weights.
9.4
Proposed analytical hierarchy processing model of successful healthcare
Search engine implementation for any healthcare app requires being aware of the information related to the problem. In the case of healthcare, the influential factors for the implementation of healthcare play an important role. There are some factors that are not important for healthcare. The different influential factors have their weights according to their importance. The AHP framework for finding the importance of the influential factors can be the reason for using it. AHP has been extensively used as a tool used for analytical decisions related to healthcare. AHP contribution is significant, it has been considered as an effectiveness measurement model for the healthcare industry [22]. The significant factors are determined by extensive investigations from doctors and consultants. Also, the experts in this industry, and owner/developers/users in the healthcare sector, that is, hospitals and labs are there in the present study. The literature review is synthesized from Ref. [22]; the opinions of the experts are based on two main factors: User and Hospital/Labs. From these factors, 7 out of 10 factors are used for the implementation of healthcare are described in Fig. 9.3 as follows: User (C1): It is important for users or patients to get the benefits of capturing and storing the data for usage and searching for their benefit for the implementation of healthcare. The process starts with the users in the healthcare system. All the users in the system contribute to the growth of the research in the field of healthcare. The various processes used for the collection
Successful parameters in healthcare
Lab/hospital
Data access/ integration
Data sharing
Data confidentiality
User
Data privacy and security
Reliability
Data assurance/ relevancy
Figure 9.3 Analytical hierarchical process model for parameters in healthcare.
Cost involvem
229
230
Chapter 9 An analytical hierarchical process evaluation
of data in terms of capturing, storing and retrieving is becoming a challenge as the data is increasing in every second. To meet the challenge, it is important to work on the integration of the system, which may be associated with the data inputs and its integration for future purposes. Accordingly, the user/patient has been decomposed into seven subfactors. From the investigation of the factors associated with User/ Patient, the seven sub factors for users can be covered. The subfactor of C1 are C11: Data Access/ Integration C12: Data Privacy and Security C13: Confidentiality, C14: Data Sharing C15: Data Assurance/ relevancy, C16: Reliability and C17: Cost Involvement.
9.4.1
Hospital/lab (C2)
Data retrieval of healthcare is most important for labs and hospitals. This healthcare would involve assessment of documents based on their needs, identification on the basis of users or patients for hospital and laboratory. The hospital/labs in healthcare has been decomposed into seven subfactors: From the investigation of the associated hospital or pathology labs, the following seven sub factors for the same are determined. The subfactor of C2 (Hospital and Labs) are C21: data access/integration, C22: data privacy and security, C23: confidentiality, C24: data sharing; C25: data assurance/relevancy, C26: reliability, and C27: cost involvement. Methodology used: The AHP methodology works upon the weights (priorities) by using a pair-wise comparison within each pair of factors. To determine the relative weights, users/hospitals/labs can be asked to make pairwise comparisons using a 19 preference scale [20]. However, in the present study for the pairwise comparison, there is a reliance on actual data, that is, the data extracted from the questionnaire survey. The consistency of the dataset is an advantage of using actual data (quantitative data) over preference scale for pairwise comparison. As discussed in the present study, reliance has been on quantitative data as obtained from the questionnaire survey of healthcare users. They can be of different age groups working as the employees of the small and medium sized enterprises and associated with healthcare sectors [23]. The research methodology on web users using online resources was conducted for users and hospitals (see Appendix 1). The data are used by the users for healthcare. It identified the parameters at the category and subcategory for the analysis of the problem. The Likert scale is used for the collection of the data in terms of system data and user and further sub category as retrieval, relevant, layout, and ranking.
Chapter 9 An analytical hierarchical process evaluation
231
Table 9.4 Scale weights. 80% 2 100% 60%-80% 40% 2 60% 20% 2 40% 020%
The questions are on the scale of 5 from 1% to 100% as shown in Table 9.4. The details are as follows: The questionnaire in 5-point Likert scale consisted of 15 questions which were divided into two sections namely—user/ patient and hospital/lab. The process agreed on the composed questionnaire survey form for the pair-wise judgment in AHP methodology is discussed. The survey was open to the user in any area of India. Firstly, the average value of responses, that is 100 (preferences based on 5-point Likert scale) obtained for each question has been calculated. These average values were calculated based on the centrality of an entire distribution of responses as shown in formula. Then for every said category the composite preference value (out of n, where n 5 5) using the following equation has been calculated: The composite preference value (CPF) can be calculated as the division of calculated value (CV), and maximum value (MV): Composite Preference Value ðCPFÞ 5 Calculated Value ðCV Þ=Maximum Value ðMV Þ Here, CV is calculated as sum of the average Values in a Category. MV is calculated as sum of the highest possible Values that a respondent in the Category.
9.4.2
Analytic hierarchy processing model description
The judgment matrices shown in Exhibit 9.1—the measure of individual attributes and calculating their global scores shows the relative significance with respect to the overall attributes other objectives are also pre´cised. For the pairwise comparison technique used in AHP, where attributes and subattributes are reliant on inputs as they are obtained from the survey sent to the relevant participants in that domain. Two main attributes have been considered as important for healthcare namely, user/ patient and lab/hospitals. The representation appears from the
Very significant Significant Neutral Significant Very Insignificant
232
Chapter 9 An analytical hierarchical process evaluation
Exhibit 9.1 Pairwise comparison with respect to healthcare. Successful Healthcare User Lab/hospital
User 0.8333333 0.1666667
Lab/hospital 0.833333333 0.166666667
Weights 1.67 0.33
pair-wise assessment suggested for healthcare, users/patients (83%) is more important over hospitals/labs (16%). Thus, it is significant to consider the users/patients about healthcare and its benefits amongst hospitals/labs. According to the analytical hierarchical model deliberated in the existing study, users/patients have been further decomposed into data access/integration, data privacy and security, confidentiality, data sharing, data assurance/relevance and reliability, and cost involvement for apprehending reality. On pair-wise comparison of user conforming to successful healthcare system; data access/integration (7.820), data privacy and security (7.859), confidentiality (7.884), data sharing (7.976), data assurance/relevancy (7.503), reliability (7.416), and cost involvement (7.844) also weigh almost equal. The higher weights are more important than the lesser weights. Although all the weights are equally important as they are having a very little difference means, these parameters are important as they help in the development of healthcare systems, one can clearly measure the user/patient and hospital/ pathology labs as shown in Exhibit 9.2. According to the analytical hierarchical model deliberated in the study, hospitals/labs have been further decomposed into data access/ integration, data privacy and security, confidentiality, data sharing, data assurance/relevance and reliability, and cost involvement for apprehending certainty. On pair-wise judgment of user/patient agreed to a development of successful healthcare system. Data access/integration (6.371), data privacy and security (7.808), confidentiality (7.794), data sharing(7.934), data assurance/relevancy (8.071) and reliability (8.188) and cost involvement (7.844) also weigh almost equal. The higher weights are more important than the lesser weights. Although all the weights are equally important as they are having a very little difference means these parameters are important as they help in the development of Healthcare systems, one can clearly measure the user/patient and hospital/pathology labs as shown in Exhibit 9.3. Based on the global parameters such
Exhibit 9.2 Pairwise comparison with respect to user/patient factor successful factors in healthcare. User
Data Data access/ privacy integration and Security
Confidentiality Data Data Reliability Cost Average Sum Weighted sharing assurance/ involvement sum relevancy
Data access/ Integration Data privacy and security Confidentiality Data Sharing Data assurance/ relevancy Reliability Cost involvement
0.084
0.066
0.053
0.078
0.086
0.122
0.150
0.091
0.714 7.820
0.169
0.134
0.107
0.089
0.112
0.115
0.199
0.132
1.040 7.859
0.206 0.154 0.198
0.164 0.164 0.233
0.130 0.226 0.218
0.083 0.144 0.258
0.122 0.113 0.203
0.133 0.111 0.173
0.243 0.243 0.075
0.154 0.165 0.194
1.216 7.884 1.315 7.976 1.456 7.503
0.177 0.012
0.225 0.015
0.254 0.012
0.335 0.013
0.305 0.060
0.259 0.086
0.066 0.022
0.232 0.031
1.718 7.416 0.223 7.079
Exhibit 9.3 Pairwise comparison with respect to hospitals/labs factor successful factors in healthcare. User
Data Data access/ privacy integration and Security
Confidentiality Data Data Reliability Cost Average Sum Weighted sharing assurance/ involvement sum relevancy
Data access/ integration Data privacy and security Confidentiality Data sharing Data assurance/ relevancy Reliability Cost involvement
0.057
0.020
0.022
0.018
0.057
0.087
0.099
0.051
0.327 6.371
0.150
0.052
0.046
0.035
0.027
0.028
0.037
0.054
0.418 7.808
0.184 0.107 0.138
0.064 0.085 0.091
0.057 0.097 0.101
0.034 0.057 0.100
0.032 0.033 0.057
0.034 0.022 0.046
0.064 0.020 0.145
0.067 0.060 0.097
0.523 7.794 0.476 7.934 0.781 8.071
0.110 0.253
0.087 0.603
0.096 0.390
0.152 1.275
0.070 0.173
0.057 0.126
0.199 0.437
0.110 0.465
0.902 8.188 3.650 7.844
235
Chapter 9 An analytical hierarchical process evaluation
Exhibit 9.4 Overall weight of user and hospitals sub factors with respect to healthcare.
Data access/integration Data privacy and security Confidentiality Data sharing Data assurance/relevancy Reliability Cost involvement
User
Lab/hospital
Priority
Rank
7.82 7.86 7.88 7.98 7.50 7.42 7.08
6.37 7.81 7.79 7.93 8.07 8.19 7.84
15.16 15.70 15.74 15.94 15.20 15.09 14.41
6 3 2 1 4 5 7
as data assurance/relevance and reliability are having the high weights, they are considered to be more important than other subfactors. Further, as a whole for both user/patient and labs/hospital the weights have been calculated global weights determined that data access/integration (15.16), data privacy and security (15.70), confidentiality (15.74), data sharing (15.94), data assurance/relevancy (15.20) and reliability (15.09), and cost involvement (14.41); finally, the key to healthcare depends on considering the overall weights for both user/patient and labs/ hospital as shown in Exhibit 9.4. In other words, owners/developers of healthcare need to prioritize their efforts towards Healthcare in terms of data access/ integration, data privacy and security, confidentiality, data sharing, data assurance/relevancy and reliability, and cost involvement aspects necessarily in that order. To summarize it is evident, data sharing aspects leads at Rank 1 of healthcare, viz., and healthcare management software or apps used the account as having the highest global weight in the Healthcare industry. Based on the weights the decision is based on that. The hospital/lab of healthcare, for example, is the main concern of the reliability followed by Data Assurance/ relevancy data for different aspects of healthcare. The data correctness and relevancy in the area is important. The user of healthcare, for example, is the main concern of reliability followed by data sharing and confidentiality of data for different aspects of healthcare. There are other areas as well that are important and have the weights as shown in Exhibit 9.4 from Rank 17 along with the weights. The weights determined as showing the importance in defining success of the healthcare of the respondents of the study.
236
Chapter 9 An analytical hierarchical process evaluation
Finally, based on the representation it emerges the global relevance in the order of data sharing, data assurance/relevancy and reliability are the most important aspects incorporating relevance of the main attributes for healthcare analytics. The layout factors should be priorities for the retrieval mechanism and also in terms of storage and organizing that data.
9.5
Conclusion
The users of healthcare would, therefore, be better recommended to reinforce Healthcare processes, more so those refer to the plan, organize, implement and monitor the layout feature in app and web development for searching to enhance web embodied features that can be plug-in. Also the bots and agents work for the better outcomes. The successful Healthcare is needed to develop web based covering industry-wide reuse of documents in analytics and future. The documents are patient centered for retrieval will be the new venture of technology in terms of app based adventure. A number of points emerged from what has been discussed relating to the successful healthcare. It is significant to generate relevant data/documents about Healthcare and for benefits of its own. This chapter discusses the appropriate data relevancy, which helps in healthcare data storage and retrieval. The hospital owners have to be concerned with technical implementations of data in terms of ranking involved in healthcare. The search engines owner should work on new techniques for data storage in an efficient manner to satisfy the user. The factors and subfactors are studied and used as a mechanism in the healthcare sector. The ideas or circumstances for successful usage can be identified and used for the implementation. The observations can be experimented and are properly explored. This can be emphasized on the provisions of the successful factors required for building the analytical hierarchy process. The effective parameters such as data sharing, data assurance/relevancy, and reliability have been incorporated in the implementation of AHP for analysis. In order to make the proposed work more revealing, the applicability of these parameters have been explored for the further focus on the proposed model to describe the interaction and interrelation between the hospitals and users as presented in this Chapter. Based on the findings, the parameters are identified and ready to suggest the technique important and applied at the process model of Application development (APPS).
Chapter 9 An analytical hierarchical process evaluation
Appendix 1 Big data analytics for healthcare It identified the parameters at the category and subcategory for the analysis of the problem. The Likert scale is used for the collection of the data in terms of system data and user and further sub category as retrieval, relevant, layout and ranking. The questions are on the scale of 5 from 1. The details are as follows 80%100% as very significant, 60%80% as significant, 40%60% as neutral, 20%40% as insignificant, 0%20% as very insignificant. The questionnaire consisted of 15 questions in 5-point Likert scale No
Parameter
Description
1 2 3 4 5 6
Your name Your Email Gender Age State Rate the priority/Rank for healthcare services [1 (High) 2 7 (Low)] How do you rate the importance of data access/ integration in health care services How do you rate the importance of privacy and security of data in healthcare services How do you rate the importance of confidentiality of data in healthcare services How do you rate the importance of data sharing in healthcare services How do you rate the importance of data assurance/ relevancy in healthcare services How do you rate the importance of reliability in healthcare services How do you rate the importance of Cost involvement in healthcare services Are you a paramedical staff Select your profession (Doctor/Paramedic/Associated with healthcare and others)
Basic Basic Basic Basic Basic
7 8 9 10 11 12 13 14 15
information information information information information
There are 100 responses (preferences based on 5-point Likert scale) obtained for each question has been calculated. These average values were calculated to describe the central location of an entire distribution of responses.
237
238
Chapter 9 An analytical hierarchical process evaluation
References [1] A. Gandomi, M. Haider, Beyond the hype: big data concepts, methods, and analytics, Int. J. Inf. Manag. 35 (2014) 137144. [2] R. Gamache, H. Kharrazi, J.P. Weiner, Public and population health informatics: the bridging of big data to benefit communities, Yearb. Med. Inform. (2018) 199206. [3] K. Abou El Mehdi, A. Beni-Hssane, H. Khaloufia, M. Saadi, Big data security and privacy in healthcare: a review, Procedia Comput. Sci. 117 (2017) 7380. [4] J. Archenaa, E. Anita, A survey of big data analytics in healthcare and government, Procedia Comput. Sci. 50 (2015) 408413. [5] Y. Wang, L. Kung, T.A. Byrd, Big data analytics: Understanding its capabilities and potential benefits, Technol. Forecast. Soc. Change 126 (2018) 313. [6] W. Raghupathi, V. Raghupathi, Big data analytics in healthcare: promise and potential, Health Inf. Sci. Syst. 2 (3) (2014) 110. [7] J. Archenaa, A. E.A. Mary, A survey of big data analytics in healthcare and government, Procedia Comput. Sci. 50 (2015) 408413. [8] C. Kruse, R. Goswamy, Y. Raval, Challenges and opportunities of big data in health care: a systematic review, JMIR Med. Inform. 4 (4) (2016) 1. [9] Y. Wanga, L. Kung, W.Y. Wang, C.G. Cegielski, An integrated big data analytics-enabled transformation model: application to health care, Inf. & Manag. (2018) 6479. [10] Y. Khan, M. Saleem, M. Mehdi, A. Hogan, Q. Mehmood, D. RebholzSchuhmann, et al., SAFE: SPARQL Federation over RDF data cubes with access control, J. Biomed. Semant. 8 (5) (2017) 122. [11] I.D. Dinov, Methodological challenges and analytic opportunities for modeling and interpreting big healthcare data, GigaScience 5 (12) (2016) 115. [12] H.Y. Karen, G. Dongliang, M.H. Max, Big data analytics for genomic medicine, Int. J. Mol. Sci. 18 (412) (2017) 118. [13] R.D. Badinelli, D. Sarno, Integrating the Internet of Things and big data analytics into decision support models for healthcare management (2017) 118. [14] I.Y. Choi, T.-M. Kim, M. Shin, S.K. Mun, Y.-J. Chung, Perspectives on clinical informatics:integrating large-scale clinical, genomic, and health information for clinical care, Genom Inform. 11 (4) (2013) 186190. [15] A. Belle, R. Thiagarajan, S.R. Soroushmehr, F. Navidi, D.A. Beard, K. Najarian, Big data analytics in healthcare, BioMed. Res. Int. (2015) 16. [16] M. Greiver, J. Barnsley, R.H. Glazier, B.J. Harvey, M. Rahim, Measuring data reliability for preventive services in electronic medical records, BMC Health Serv. Res. 12 (16) (2012) 19. [17] M.-J. Sepulveda, From worker health to citizen health: moving upstream, Occup. Env. Med. (2014) 113. [18] C. Ho Lee, H.-J. Yoon, Medical big data: promise and challenges, Kidney Res. Clin. Pract. 38 (2017) 311. [19] E. Kornyshova, C. Salinesi, Introducing Multicriteria Decision Making into Software Engineering. INSIGHT - International Council on Systems Engineering (INCOSE), Wiley, 11 (3) (2008) 2426. hhal-00707300i. [20] T.L. Saaty, Fundamentals of decision making and priority theory with the AHP. RWS, 2000.
Chapter 9 An analytical hierarchical process evaluation
[21] M. Arora, U. Kanjilal, D. Varshney, Successful efficient and intelligent data retrieval: using analytic hierarchy process (AHP. p. 1, Handbook of Management and Behavioural Science), (Wisdom Publication-India), New Delhi, 2011. [22] E. Cheng, H. Li, Analytic hierarchy process: an approach to determine measures for business performance, Measur. Bus. Excell. 5 (3) (2001) 3036. [23] M. Arora, U. Kanjilal, D. Varshney, Hierarchical model for successful (efficient and intelligent). Data retrieval, Int. J. Auton. Agents Multi Agent Syst. (2012) 331343.
239
Firefly—Binary Cuckoo Search Technique based heart disease prediction in Big Data Analytics
10
G. Manjula1, R. Gopi2, S. Sheeba Rani3, Shiva Shankar Reddy4 and E. Dhiravida Chelvi5 1
Department of Information Science & Engineering, Dayananda Sagar Academy of Technology & Management, Bengaluru, India 2Department of Computer Science and Engineering, Dhanalakshmi Srinivasan Engineering College, Perambalur, India 3Department of Electrical and Electronics Engineering, Sri Krishna College of Engineering and Technology, Coimbatore, India 4Department of Computer Science and Engineering, SRKR Engineering College, Bhimavaram, India 5Department of Electronics and Communication Engineering, Mohamed Sathak A.J. College of Engineering, Chennai, India
Abstract Nowadays, big data analysis is being given more attention in complex healthcare settings. Fetal growth curves, the classic case of big health data, are used to predict coronary heart disease. The proposed framework introduces the idea of summarizing large big data (inputs) in multidimensional scenarios in which known data mining methods such as preprocessing, optimal selection of features and forecasts are used. The dataset contains many random and variable values and can lead to incorrect results. Therefore, when dealing with these values, the utmost care is needed to obtain the best performance. Therefore, data creation before optimal function creation is processed using bacterial foraging optimization (BFO) before sample creation. It defines a multidimensional mining approach as a whole that addresses complex healthcare environments. This work aims to predict the risk of coronary heart disease (CAD) using machine learning algorithms such as Firefly—Binary Cuckoo Search (FFBCS). We also suggest a preliminary analysis of the performance of the framework. Keywords: Coronary heart disease; preprocessing; bacterial foraging; Firefly; Binary Cuckoo Search
Applications of Big Data in Healthcare. DOI: https://doi.org/10.1016/B978-0-12-820203-6.00007-2 © 2021 Elsevier Inc. All rights reserved.
241
242
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
10.1
Introduction
The data mining contains information about data from protected databases. Mining information is of immeasurable value in exploration, that is, a lot of information contains only nonessential data. Information mining separates information to look for covered modifications that can be converted into major models [1]. Huge information plays an important role in saving patients’ wellbeing and personal cardiac death. The USA Apollo Hospital and Alive Cor Inc have made it possible to use the portable EKG 4 (Electro Cardio Gram), which enables cell phones to be screened for tests and arrhythmias (possibly visible heartbeat). Mounted on mobile phones protect the patient’s confidence in personal, compensate for positioning them on the chest. Persistent well-being data are of course legitimately stacked as an electrocardiogram on mobile phones and in the tolerant database [2]. Intensive acute myocardial infarction (AMI), commonly known as the coronary episode, is one of the most inductive cardiovascular conditions. AMI occurs when blood flow or blood flow to the heart muscle causes the heart muscle. The main causes of the greatest respiratory failure are a blockage, which is associated with blood in one of the coronary corridor values, which are unable to cross the heart muscle to the psychologist or quadratic [3]. Today’s coronary artery disease speaks of one of the main dangers for being heard. As verified by the WHO, are monitored and coronary insufficiency causes the world to die from responsible causes of death (80%). In this case, the senses help the accessibility of information and the procedure for information extraction, especially the artificial intelligence (AI) and the early perception of relationships of the coronary arteries, the patient, refer to a probable 3. We hurt ourselves on data mining systems to predict coronary artery disease. Information mining separates data from a responsibility of information [4]. A lot of information in data mining becomes information, hidden and disabled opaque patterns, connections and information that are difficult to relate to the following measurable relationships [5]. Extracting information is the task of extracting basic dynamic data from a group of subsequent records for specified or controlled ones. The data acquisition without information mining will and will not be checked [6]. The idea behind personal control of information that includes information, information from the information index to contacts, as well as information. In the following cases, the large-scale investigation of the information and personal knowledge of the personal meaning of the meaning for the
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
sensible decision has been exceptionally thought out and has resulted in a certain loss of associations. This segment shows the extraction of risk management from the database for coronary heart diseases. The coronary artery disease database provides information on the safe screening of patients with heart disease. The database was finally preunderstandable to make the mining process productive [7]. Big data related to data mining inability is a number of injuries, such as well-being models, disease relationships, and relationship identifiers for individuals. Various investigations have examined the relationships between wellness conditions and cardiovascular diseases (CVDs) and have analyzed the structures for the diagnosis and prediction of coronary heart disease using false neural systems, information mining and the affiliate standard (i.e., increasing models) to uncover important patterns among clinics distinct treatment information [8]. Most order calculations in preparing great information only take organized information into account. Unstructured information is generally prepared by combining organized and unstructured data that reduce the risk of predicting coronary heart disease [9]. This proposal ensures that information displays and investigative devices, for example, information exploration, have the ability to create an information-based condition that can help improve the fundamentals of the idea of clinical choices [10]. More than a large number of medical records were available to dissect and prepare the Properties command model, such as age, gender, physical evaluation squeal, clinical tests, consequences of analytical imaging and results [11]. Clinical profiles such as age, sexual orientation, blood pressure, and glucose can be used to predict the probability of coronary heart disease. It provides important information, such as models, connections between clinical variables identified with coronary artery disease. It is a web application, simple to use, versatile, solid, and extensible [12]. In this study, we analyze and estimate the use of different machine learning algorithms in the prediction of heart disease. This work aims to predict the risk of coronary heart disease using machine learning algorithms such as bacterial foraging optimization (BFO), Firefly with Binary Cuckoo Search. In addition, a comparative study between these methods is performed based on the accuracy of the prediction. Results show that these models can efficiently predict the risk of heart disease. Rest of the paper is organized as follows: Section 2 describes the related work. In Section 3, proposed work is described
243
244
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
including preprocessing, analytics and modeling, dataset description. Section 4 includes experimental analysis which describes the performance measurement parameters for different algorithms also includes performance comparison of the algorithms and paper is concluded in Section 5.
10.2
Literature survey
Developing voluminous information in personal services allows you to anticipate diseases quickly and efficiently and ensures tolerant consideration. Dhanushree et al. [13], various attributes for various ailments in various fields, the expectation of the plagues will not be so precise, so that AI calculations will be improved to feasibly anticipate incessant affects and explain the problem of inadequate information. The modified estimation models are tested with real clinical information for disease anticipation. Nowadays, many fragments of the population are affected by coronary heart disease, which is mainly caused by individuals’ poor livelihood and lifestyle. Most of them are largely due to the poor forecasts of specialists and people with their bodies [14]. Expectations can be clarified by performing recurrent neural networks (RNN), which can be used to decide whether to anticipate choices based on a patient’s medical history. Predicting coronary heart disease can reduce research time and improve the accuracy of symptoms. New advances such as AI and the enormous study of information have proven to be promising answers to biomedical networks, medical problems and patient consideration. They also help anticipate early disorder by understanding the clinical information. Infection of the on-board system can be further improved by detecting early signs of an illness. This early prediction can also be helpful to check for signs of infection and satisfactory treatment for the disease. AI approaches can be used to predict ceaseless diseases such as kidney and coronary artery disease by building characterization models. Krishnani et al. [15] proposed a comprehensive preprocessing method to address the coronary heart diseases (CHD). The methodology includes elimination of invalid qualities, resampling, institutionalization, standardization, grouping and estimation. This work aims to anticipate the risk that KHK will use AI calculations such as random forest, decision trees (DTs) and K neighbors. “Similarly, a relative relationship between these calculations is made based on the accuracy of expectations. In
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
addition, the K cross approval is used to generate information irregularities. These calculations are tested against the 4240 data set.” Framingham Heart Study (FHS) “The Random Forest, Decision Tree and K-Neighbor Neighbor test individually achieved an accuracy of 96.8%, 92.7% and 92.89%, combining our preprocessing levels, the Random Forest arrangement more accurate results than other AI calculations. The social security sector has grown enormously in the past century. Drugs have developed since the days of the Vedas that advance and expand his way of managing understanding and management. Yadwad et al. [16] proposed the use of information excavation strategies to predict coronary events. The description is how to select levels and calculations to big information to find large information. At this moment, we propose a conspiracy of equal expectations that uses order strategies, for example support vector machine (SVM), Naive Bayes (NB), and K closer. The same order depends on MapReduce’s calculation for predicting coronary artery disease. The presentation of the calculation proposed at this point is far superior to the NB classifier and the continuous SVM classifier. We use MapReduce to update Hadoop, a distributed computing system. A comprehensive investigation of information in complex social security situations is strongly considered today. Fetal development curves, which are an excellent example of enormous information about well-being, are used in prenatal medicines to identify potential fetal development problems early on, evaluate perinatal results and treat potential complications immediately. The curves currently used and the associated analysis systems were criticized for their low accuracy. New strategies that depend on the possibility of customer-specific development curves have been proposed in writing. From this point of view, this report addresses the problem of creating modified or adapted fetal development curves with big data systems. Bochicchio et al. [17] proposed a structure that offers the ability to synthesize the colossal dimensions (input) of big data from multidimensional perspectives to which remarkable information extraction techniques, such as grouping and arrangement, are applied. In principle, this characterizes a multidimensional extraction approach that requires complex social security conditions. A first study of the adequacy of the system is also proposed. The WHO estimates that CVD is the leading source of death worldwide and in India. They are caused by heart and vein dispersions and incorporate coronary heart disease (cardiovascular
245
246
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
failure). Information extraction plays an important role in creating an intellectual prediction model for disease heart detection (HD) systems through the use of patient data sets that help clinicians reduce coronary heart disease mortality. Numerous tests have been conducted to create models independently or consolidating data mining with data innovation, including the DT, NB, and the meta-heuristic methodology, the prepared neural network (NN), man-made reasoning or AI and learning calculations alone, for example KNN and SVM. A huge number of clinical substances are used as a contribution to the proposed framework. The Mapreduce system is used to separate the necessary data from the cardiovascular patient record from this clinical data set. Nagamani et al. [18] proposed to evaluate the use of the Mapreduce calculation in equal and widespread frameworks using the Cleveland data set and to contrast it and the foreseen ANN technique. The test results state that the planned strategy could achieve a normal prediction accuracy of 98%, which is higher than the normal s recurrent fuzzy neural network. In addition, this guide reduction procedure worked better than previous techniques in which a forecast accuracy of somewhere between 95 and 98% had been calculated. These results recommend that the Mapreduce method can be used to accurately predict the dangers of HD in the structure. Hypertension is related to increased somnolence and mortality from coronary artery disease (CHD). Despite this, the risk factors for the improvement of coronary heart disease in hypertensive patients remain nebulous. Chen et al. [19] proposed to consider conventional and unusual risk factors for CHD in the hypertension population. The information was separated from the provincial clinical picture of Big Data in Shenzhen, a huge city in China. This examination included 3395 hypertension patients aged 30 to 79 years. Of these, 1153 coronary events occurred within 3 years of their first development. A strategic relapse model was used to measure hazard factors and predict it at 3 years for coronary occasions. The results showed that conventional risk factors for coronary artery disease, such as age, weight list, diabetes, hyperlipemia, and relentless everyone-dependent kidney infection, were still present in hypertensive patients. Furthermore, the probability ratio (95% certainty range) for CAD was 1.54 (from 1.19 to 2.01) for passionate or mental disorders and 1.69 (from 1.21 to 2, 34) for rest problem. In addition, the model indicated a large separation run with an AUC of 0.839. He gave another vision to hypertensive patients that maintaining psychological well-being and maintaining high-quality rest could help prevent CAD. Given
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
247
Input Big Data set
Preprocessing phase
Feature selection
Optimal feature selection using BFO
Disease prediction using FFBCS
the high risk of coronary vein disorder in patients with hypertension, control of these risk elements can have a critical preventive impact on coronary artery disease (Fig. 10.1).
10.3
Proposed methodology
New technologies, such as machine learning and big data analysis, provide tangible solutions to biomedical communities, health problems, and patient care. They also contribute to the early prognosis of diseases through accurate interpretation of clinical data. Disease management strategies can be further improved by identifying the first signs of illness. This prognosis is useful in controlling the symptoms and treating the disease properly. By developing classification models, machine learning approaches can be used to predict chronic diseases such as kidney and heart disease. In this article, we propose a comprehensive pretreatment approach for predicting coronary heart diseases (CHD). The optimal function of selection using the optimal BHO of the approach involves the conversion, reorganization, normalization, normalization, classification, and prediction of zero values. This work aims to predict the risk of CHD
Figure 10.1 Proposed big data based heart disease prediction.
248
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
using machine learning algorithms such as FFBCS. In addition, a comparative study between these methods is performed based on the accuracy of the prediction.
10.3.1
Preprocessing
Preprocessing is a strategy for obtaining complete, wellfounded, and justifiable information. The quality of the information influences the mining decisions that are made using machine learning algorithms. Subjective data leads to quality. The collection of FHS dataset is then coordinated using the associated preprocessing steps. Insufficient capacity can reduce the productivity of the model and reduce the learning rate. In this way, capacity determination plays an important role in preprocessing, selecting these skills which contribute overall to predicting ideal results. Using a determination of the ideal element, important highlights are also eliminated in the collection of FHS dataset. Next, an exhibition method is better.
10.3.2
Optimal feature selection using bacterial foraging optimization
Data mining methods can be used to preprocess and update machine learning algorithms—for this distributed computing is used. Known machine learning algorithms are currently helping to decide on coronary heart disease risk and helping professionals to anticipate it. Data mining separates important and important information from much information. Information preparation is an important process in data mining and machine learning algorithms. Indeed, reduction measurement is an important component of data reduction equipment. The basic techniques for reducing size are the specification and extraction of the highlights. Determining the highlights is a technique for selecting a subset of the relevant highlights. Highlighting determination procedures are a subset of the global component extraction industry. BFO calculation is another segment of metoreuristic calculation. It is a population-based improvement process that has been developed by improving the propensity for Escherichia coli microbes. The essential elements of the BFO calculation are quickly explained below. Chemotaxis (detection, control, and ingestion of food) during the process of the substance using a bacterial flagellum from E. coli to move to the feeding position with the help of swimming and falling. When it swims, it can move in a certain way and in the autumn the direction of bacterial research
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
can change. Both strategies are continuously updated to follow arbitrary paths and to find an adequate measure for positive nutritional angles. These exercises are completed for an extraordinary duration. Swarming: In this method, the bacterium, which knows the ideal way to reach the food source after advancing towards the best food, tries to use a gravity signal to contact other microscopic organisms. Sign collaboration between E. coli bacterial cells is shown by the accompanying condition: Sðθ; DðS; k; lÞÞ 5
N X
Scc ðθ; θs ðS; k; lÞÞ 5 A 1 B
ð10:1Þ
i51
X5
N X
" 2dattract exp 2Wattract
Y5
i51
2 θm 2θim
!# ð10:2Þ
m51
i51 N X
D X
" hrepell exp 2 Wrepell
D X
2 θm 2θim
!# ð10:3Þ
m51
where is the area of the ideal bacterium in the world, as long as chemotactic jth, kth multiplication and lth eleventh state of disposal and “” are mth another parameter of ideal bacteria in the world. When Sðθ; Dðj; k; lÞÞ the classification of the target capacity refers, “N” is the total number of microorganisms and the ideal parameters are substantially “D”. Various parameters, such as dattract the depth of the captivating sign transmitted by microscopic organisms and Wattract the width of the seductive sign. The height hrepell and Wrepell width of the signs and the main thrust between the microbes (the attractor is the sign of the food source and the dismissal is the sign of an unsafe proximity). Reproduction: During the swarm movement, microbes are formed in grapes with positive tendencies to integrate, which can increase bacterial concentration. After collection, the microbes are arranged in an alert request based on their wellbeing. Unwanted bacteria enter the bucket and microorganisms with a high duplicate of appreciation of well-being in order to maintain a stable population. Elimination-dispersal: The number of inhabitants of a bacterium can vary according to ecological conditions, for example, temperature changes, a dangerous situation and access to food, change continuously or unexpectedly. Now an accumulation of microbes in the limited area (neighborhood optima) can be
249
250
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
expelled or an accumulation can be circulated to another nutritional area in the dimensional research room “D.” The progress of dispersion science is right. When widespread, microorganisms can occasionally be brought closer to a decent nutritional supplement and strengthen chemotaxis to determine the accessibility of other nutritional sources. The above procedures are reworked until the optimized solutions are achieved.
10.3.3
Optimization by using firefly—Binary Cuckoo Search
This predictive test is commonly used in the health care services industry. One of the data mining techniques used to anticipate and group predefined information for a particular class is organizing. Here are some compilation techniques recommended by scientists. Data extraction techniques have been used to predict coronary heart disease. The accuracy of each calculation was verified and reported. Study and modeling this section explained the supervised algorithms of currently used Machine learning this work. It shows the FFCCs logical methodology and internal procedures for developing a prediction model. The Firefly (FA) calculation speaks of a meta-heuristic calculation. The refreshed properties are displayed in the calculation of the alert hunting hook input due to the polished behavior of the Firefly calculation (Fig. 10.2).
10.3.3.1
Solution representation
One of the biggest difficulties is to introduce the arrangement when selecting the best attributes when creating the selection tree. The representation of the contract represents the effects of the Firefly calculation. We classify a firefly (arrangement) as a possible arrangement in a population. The basis for worm calculation is that the population of firefly is safely generated. The base population size P is classified as follows: P 5 Ad
ðd 5 1; 2; . . .; nÞ
ð10:4Þ
where n is the size of fireflies. The initialized continuous position values are generated using Eq. 10.6 below. uk 5 umin 1 ðumax 2 umin Þ r
ð10:5Þ
where, xmin 5 0, xmax 5 1 and r represents a uniform arbitrary number between 0 and 1.
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
251
Evaluation phase
Fitness evaluation
Firefly updation
Generating new cuckoo phase
Fitness evaluation
Updation phase
Binary Cuckoo search updation
Reject worst nest
Figure 10.2 Proposed Firefly with Binary Cuckoo Search.
Stop
10.3.3.2
Fitness evaluation
The fitness function is classified based on the purpose of this review. Here, the rational procedure is recorded as in Eq. (10.6) with respect to the reduction of the objective function as follows. W ðpÞ 5 min
m X
wðpi ÞHx ðpi Þ
ð10:6Þ
i51
where, Hx ðpi Þ - the entropy, wðpi Þ - the gravity of the entropy of each characteristic is weighed.
10.3.3.3
Firefly updation
The firefly (FF) p of an increasingly attractive (lighter) firefly inspired firefly q is evaluated using Condition 7 below. We
252
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
have currently updated the Firefly calculation using the Firefly calculation. 0 ð10:7Þ Fup 5 Fup 1 γðrÞ ðFup 2 uq Þ 1 φ rand 2 12 Due to the attractiveness of the second term (8) of the condition, and the third word randomization ‘ϕ’ as a random parameter, “rand” is an arbitrary number, used somewhere in the 0 to 1 range. m
Attractiveness; γðrÞ 5 γ 0 e2θr ;
m$1
ð10:8Þ
where, r represents the between the two fireflies, γ 0 shows the basic attraction and θ retention coefficients of fireflies. rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xd ð10:9Þ ðFup;s 2Fuq;s Þ2 Distance; rpq 5 :Fup 2 Fuq : 5 k51 where, up;s represents the sth components show the spatial arrangement of the pth fireflies and d represents to the total number of measurements. It also qAf1; 2; . . .; Fn g organizes the chosen code with confidence. Despite q the fact that was judged immediately, it should look like different from p. Here Fn refers to the number of fireflies. To combine these two techniques, the renewed properties are shown here as a crane search strategy. The cuckoo search calculation is a meta-heuristic calculation that is provided by the crane generation activities and reduces their use. They have numerous nests looking for the cuckoo. Each egg shows a target and the cuckoo represents a different translation of an egg. The best representation ever replaces the most valuable representations in the nest. The technique of accompanying the showcase is selected from the calculation of the search for the cuckoo: each egg in a nest addresses a representation and an egg from the crane addresses a different representation. The goal is to get a decent egg from the nest crane with another possibly improved egg. Any statement by a crane that implies that the essential conditions, but any coalition and some eggs, that can bring goods in through the unification of the framework, can strengthen the presentation of the estimate of the update of the cuckoo search algorithm.
10.3.3.4
Initialization phase
The number of residents in the inn (mi, where i 5 1, 2 . . . n) is incorrect. Generating New Cuckoo Phase: A levy flights is used to randomly select the cuckoos ordered and to create new translations. The application is then evaluated to determine the estimate of the crane’s subsequent notes.
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
Fitness Evaluation Phase: Assess fitness function based on the condition. Then choose the best one. fitness 5 maximum popularity
ð10:10Þ
Updation Phase: Change the basic representation from the levy flights associated with changing the cosine. The dominance of the new translation is examined and a nest is selected subjectively. In the event that the spread of new targets in the selected hub has improved compared to previous targets, it will be replaced by another clarification (Cuckoo). The earlier clarification is best found on the finest. The levy flights for regular cuckoo search calculations:
Ci 5 Ci ðt11Þ 5 Ci ðtÞ 1 α"LevyðnÞ
ð10:11Þ
where t is step size, and α . 0 is the step size scaling feature limit. Here the entry wise product " is comparable to those utilized xiðt11Þ and represents ðt 11Þth egg (feature) at nest (solution), i 5 1, 2, . . ., C, and t 5 1, 2, . . ., d. Withdrawal plans use any amount deducted from the withdrawal activity. Subsequently, the CS calculation is increasingly able to identify the interference of the test because its progress size is longer. Here is an updated answer to Firefly’s calculation. With the typical COA, the stories are processed in test intervals at various decent places. The difference in the calculation of the search for the coupled calculation for the range of the brand is that the interference of the test is cut in a onedimensional Boolean network in which the representations at the edges of a hypercube are revised obliquely. In the event that it is difficult to select a known element, one clarification is that the double vector is included, with 1 showing whether a capacity for creating a different data set has been selected and 0 in each case to add this double vector we use condition 4, which can only result in parallel qualities in the Boolean lattice, which limits the new understanding to coupled qualities: 1 S xiðt11Þ 5 ð10:12Þ ðtÞ 1 1 e2xi ( if S , rand then xiðt11Þ 5 0 if S . rand then xiðt11Þ 5 1 Reject Worst Nest Phase: Terrible nests are sold on this level and new ones are processed according to their favorite qualities. In this sense, the most ideal explanations are given based on the activity. The following best clarifications are called surprising and ideal arrangements.
253
254
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
Stopping Criterion Phase: This process is redrawn until the maximum emphasis is reached. The ideal impact is analyzed to determine the graph and the boundary created by the graph.
10.3.4
Dataset description
We have prepared a subset of the FHS dataset. You can access it for free through the Framingham Heart Institute. The accessible section of the FHS dataset now used 4240 member registrations. The dataset is derived from longitudinal surveys of the number of residents in Framingham, Massachusetts. This study depends on the cause and cause of CVD and is the best part of the general well-being infection of administrators. The FHS focuses primarily on differentiating the risk factors that influence a person’s wellbeing when they experience coronary heart disease. The dataset contains 16 unique skills to influence coronary heart disease.
10.4
Result and discussion
True Positive (TP): The effect of accurately predicting a positive class by the model. True Negative (TN): The effect of accurately predicting the negative class of the model. False Positive (FP): A result if the positive class is falsely predicted by the model. False Negative (FN): A consequence of the false prediction of the negative class of the model. Accuracy: Accuracy is the ratio between the quantity of correct predictions of the model and the total number of instances. Accuracy 5
ðTP 1 TNÞ ðTP 1 FP 1 FN 1 TNÞ
Precision: Precision is the position of people who rely on CHD development and are at risk of developing CHD. Precision 5
TP ðTP 1 FNÞ
Recall/Sensitivity: Recall, in this paper, it measures the percentage of individuals that were at risk of developing CHD and who the algorithm had expected to be at risk of developing CHD. Recall 5
TP ðTP 1 FNÞ
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
255
Specificity: The specification measures the number of people who are not at risk of coronary heart disease and those who are not at risk of coronary heart disease. Specificity 5
TN ðTN 1 FPÞ
The proposed test results of the work are presented as follows. The applicability of the technical program for predicting heart disease is compared and the methods and results of future studies are presented below. The attached figure shows the impact of the proposed work (Table 10.1; Fig. 10.3).
10.4.1
Comparative analysis
The FFBCS calculation suggests that the characteristic work of the FHS data set is useful in comparing CNN and RF
Table 10.1 Proposed FFBCS evaluation measures. Accuracy
Precision
Recall
Specificity
96.676 95.674 93.685 96.275 95.243
98.675 97.667 98.996 96.886 98.564
94.353 89.564 88.464 88.464 89.906
98.575 97.454 98.564 96.565 96.667
100 98 96 94
Accuracy
92
Precision
90
Recall
88
Specificity
86 84 82 1
2
3
4
5
Figure 10.3 Graphical representation of FFBCS Evaluation measures.
256
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
technology. The proposed representation of FHS dataset is analyzed by modifying the classifier and the recorded results are as follows (Figs. 10.410.7). 98 96 94 92
Propossed FFBCS
90
CNN
88
RF
86
Figure 10.4 Graph for comparison of proposed and existing accuracy measures.
84 82 1
2
3
4
5
100 98 96 Proposed P FFBCS F
94 92
CNN C
90 88
R RF
86 84
Figure 10.5 Graph for prediction comparison for proposed and existing method.
82 80 1
2
3
4
5
95 90 85
Prroposed FBCS FF NN CN
80
RF F 75
Figure 10.6 Graph for Recall measures comparison of proposed and existing method.
70 1
2
3
4
5
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
257
The time consumption of various classifiers for predicting CVDs is shown in Tables 10.210.5. The experimental results are presented in connection with the proposed prognosis for CVDs with the FFBCS classification comparative classifier. Learning ability and FFBCS selectivity describe better specification, accuracy and recall activity. Accurate accuracy rates determine the feasibility of the technology for heart disease. The frequency of the updates is very precise and accurate; the 100 98 96 Prroposed FBCS FF CN NN
94 92 90 88
Figure 10.7 Graph for specificity comparison of proposed and existing method.
86 1
2
3
4
5
Table 10.2 Comparison of proposed and existing accuracy measures. Proposed FFBCS
CNN
RF
96.676 95.674 93.685 96.275 95.243
90.675 89.786 88.786 89.574 90.786
88.674 89.363 87.676 90.786 89.678
Table 10.3 Comparison of proposed and existing precision measures. Proposed FFBCS
CNN
RF
98.675 97.667 98.996 96.886 98.564
90.674 91.674 90.673 90.556 91.445
86.667 88.675 89.385 90.674 90.563
258
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
Table 10.4 Comparison of proposed and existing recall measures. Proposed FFBCS
CNN
RF
94.353 89.564 88.464 88.464 89.906
88.676 83.786 80.676 83.897 80.675
87.786 82.778 81.445 82.675 81.897
Table 10.5 Comparison of proposed and existing specificity measures. Proposed FFBCS
CNN
RF
98.575 97.454 98.564 96.565 96.667
90.675 91.667 90.559 92.575 92.786
93.786 93.996 93.786 93.786 94.786
prediction of heart diseases is very reliable. The accuracy and correction percentages for the FP and FN coefficients improved.
10.5
Conclusion
This article proposes a new classification system based on the FFBCS (Firefly Binary Cuckoo Search) method for diagnosing heart disease. The main novelty of this paper lies in the proposed approach—a combination of FFBCS and BFO methods for the efficient and fast classification of heart disease problems. The FFBCS classification system is characterized by two subsystems: the BFO feature selection subsystem and the classification subsystem. The FHS (heart) database was selected from the machine learning database to test the system. FFBCS outperforms three general classifiers in terms of accuracy, sensitivity, and specificity. In addition, the efficiency of the proposed system is superior to the procedures in the literature. Based on empirical analyzes, the results of the proposed classification system can be used as a promising alternative instrument in clinical decision-making for the diagnosis of heart diseases. Other machine learning techniques such as deep learning,
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
association rules and genetic mechanisms are being investigated to ensure accuracy in the future with defined performance parameters.
References [1] K. Gomathi, D. Shanmugapriyaa, Heart disease prediction using data mining classification, in: Proceeding of International Journal for Research in Applied Science & Engineering Technology, Vol. 4, February 2016, No. 23219653, pp. 5963. [2] S. Suguna, S. Sakunthala.N, S. Sanjana, S.S. Sanjhana, A survey on prediction of heart diseases using big data algorithms, in: Proceeding of International Journal of Advanced Research in Computer Engineering & Technology, Vol. 6, No. 2278-1323, March 2017, pp. 371378. [3] C.A. Alexander, L. Wang, Big data analytics in heart attack prediction, in: Proceeding of Journal of Nursing & Care, No. 2167-1168, Vol. 6, 2017, pp. 19. [4] H.A. Esfahani, M. Ghazanfari, Cardiovascular disease detection using a new ensemble classifier, in: 2017 IEEE 4th International Conference on Knowledge-Based Engineering and Innovation (KBEI), Tehran, 2017, pp. 10111014. [5] J. Patel, T. Upadhyay, S. Patel, Heart disease prediction using machine learning and data mining technique, in: Proceeding of IJCSC, Vol. 7, March 2016, pp. 129137. [6] H. Benjamin, F. David, S. Antony Belcy, Heart disease prediction using data mining techniques, in: Proceeding of ICTACT Journal on Soft Computing, Vol. 9, No. 22296956, October 2018, pp. 18171823. [7] K. Purushottam, K. Saxena, R. Sharma, Efficient heart disease prediction system using decision tree, in: International Conference on Computing, Communication & Automation, Noida, 2015, pp. 7277. [8] S.H. Han, K.O. Kim, E.J. Cha, K.A. Kim, H.S. Shon, System framework for cardiovascular disease prediction based on big data technology, in: Proceeding of MDPI, 2017, pp. 110. [9] G. Thangarasu, K. Subramanian, P.D.D. Dominic, An integrated architecture for prediction of heart disease from the medical database, in: 2018 4th International Conference on Computer and Information Sciences (ICCOINS), Kuala Lumpur, 2018, pp. 15. [10] N. Singh, S. Jindal, Heart disease prediction using classification and feature selection techniques, in: Proceeding of International Journal of Advance Research, Ideas and Innovations in Technology, No. 2454-132X, Vol. 4, 2018, pp. 11241127. [11] S. Xu, Z. Zhang, D. Wang, J. Hu, X. Duan, T. Zhu, Cardiovascular risk prediction method based on CFS subset evaluation and random forest classification framework, in: 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA), Beijing, 2017, pp. 228232. [12] B.V. Baiju, R.J. Remy Janet, A survey on heart disease diagnosis and prediction using naive bayes in data mining, in: Proceeding of International Journal of Current Engineering and Technology, No. 23475161, Vol. 5, April 2015, pp. 10341038. [13] Dhanushree Y., Indrani N., Santosh N., Vidya Vani R., Rajeshwari J. Dr., Heart disease prediction by machine learning, in: Proceeding of
259
260
Chapter 10 Firefly—Binary Cuckoo Search Technique based heart disease prediction
[14]
[15]
[16]
[17]
[18]
[19]
International Journal Of Engineering Research And Development, Vol. 14, No. 2278-067X, June 2018, pp. 16. P. Anandajayam, C. Krishnakoumar, S. Vikneshvaran, B. Suryanaraynan, Coronary heart disease predictive decision scheme using big data and RNN, in: 2019 IEEE International Conference on System, Computation, Automation and Networking (ICSCAN), Pondicherry, India, 2019, pp. 16. D. Krishnani, A. Kumari, A. Dewangan, A. Singh, N.S. Naik, Prediction of coronary heart disease using supervised machine learning algorithms, in: TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON), Kochi, India, 2019, pp. 367372. S.A. Yadwad, P. Praveen Kumar Prediction of heart disease using hadoop mapreduce, in: Proceeding of International Journal of Computer Application, Vol. 6, December 2016, pp. 18. M. Bochicchio, A. Cuzzocrea, L. Vaira, A big data analytics framework for supporting multidimensional mining over big healthcare data, in: 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), Anaheim, CA, 2016, pp. 508513. T. Nagamani, S. Logeswari, B. Gomathy, Heart disease prediction using data mining with mapreduce algorithm, in: Proceedings of International Journal of Innovative Technology and Exploring Engineering, No. 22783075, Vol. 8, January 2019, pp. 137140. R. Chen, et al., 3-year risk prediction of coronary heart disease in hypertension patients: a preliminary study, in: 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Seogwipo, 2017, pp. 11821185.
Further reading A. Ed-Daoudy, K. Maalmi, Real-time machine learning for early detection of heart disease using big data approach, in: 2019 International Conference on Wireless Technologies, Embedded and Intelligent Systems (WITS), Fez, Morocco, 2019, pp. 15. M. S. Islam, H. Muhamed Umran, S. M. Umran, M. Karim, Intelligent healthcare platform: cardiovascular disease risk factors prediction using attention module based LSTM, in: 2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 2019, pp. 167175. R.G. Saboji, A scalable solution for heart disease prediction using classification mining technique, in: 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS), Chennai, 2017, pp. 17801785.
Hybrid technique for heart diseases diagnosis based on convolution neural network and long short-term memory
11
Abdelmegeid Amin Ali1, Hassan Shaban Hassan1, Eman M. Anwar2 and Ashish Khanna3 1
Faculty of Computer and Information, Department of Computer Science, Minia University, Egypt 2Faculty of Computer and Information, Department of Information System, Minia University, Egypt 3Maharaja Agrasen Institute of Technology
Abstract Heart failure-related malfunctioning is the cause of the leading number of death worldwide since it is very difficult to determine the cause of malfunctioning of heart-based on symptoms. Further, the detection of this requires a lot of experience and knowledge as far as medical science is concerned. Therefore, in the presented study work, a technique has been suggested that predicts the cardiac malfunctioning. Nowadays, due to the advancements in data science, scientists, and medical professionals are largely interested in developing an automated cardiac malfunctioning prediction system as it can be highly accurate, efficient, cost-efficient, and very helpful in the early diagnosis. In this study work, a hybrid deep neural network using the dataset with 14 features as input and they are trained to utilize the convolution neural network (CNN) and long shortterm memory (LSTM) hybrid algorithms to predict the presence or absence of disease in patients with the highest accuracy reaching 0.937 percent. The results of the study showed that the CNN 2 LSTM hybrid model had the best results in accuracy, recall, precession, F1 score, and AUC compared to other techniques. Keywords: Classification; CNN; RNN; deep learning; heart diseases Applications of Big Data in Healthcare. DOI: https://doi.org/10.1016/B978-0-12-820203-6.00009-6 © 2021 Elsevier Inc. All rights reserved.
261
262
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
11.1 11.1.1
Introduction Heart disease
It is one of the most serious diseases because a huge number of people affected by the disease have died. Due to the significant risk of cardiac disease, medical risk should be avoided. Even though improvements in medical treatment and cardiac failure mortality remain certainly huge [1,2], several aspects of health become totally unpredictable, so people facing the danger of death, cardiac arrest, heart attack, and stroke should be deleted before it is too late. According to the Centers for Disease Control (CDC) organization, cardiac failure ranks first in their list of death causes, while stroke ranks fourth.
11.1.2
Traditional ways
It is a common practice for physicians to check blood pressure, body temperature change, and heart rate at any time during a blood test [3]. Therefore, data needs to be collected much more frequently to have a more accurate representation of one’s health. The extraction of valuable information for the collection of appropriate help in the context of medicinal knowledge has become difficult due to the size of the information. The manual data approach is known to be an inefficient approach compared to artificial intelligence, making it the best strategy for predictive medical study [4,5]. There is a persistent need for a highly accurate system that serves as an analysis tool to detect hidden patterns of heart disease in medical data and predict preexisting heart attacks. This is expected to lead to better management of heart attacks.
11.1.3
The classification techniques
For predicting cardiac disease, different classification techniques are used. For example, the Cleveland dataset, the neural backpropagation network (BPNN), and logistic regression (LR) [6,7]. Also, other techniques were used to predict cardiac diseases such as Naive Bayes (NB), decision tree (DT), and k-Nearest Neighbor (K-NN) [8]. Several experimental methods for treating and predicting cardiac disease have recently been established. The hybrid method consists of two principal phases. The first phase is feature selection, which is utilized to choose a subset of features—which used in the second phase as a training set for the building of a classification algorithm.
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
263
For example, in Ref. [9], the cardiovascular disease diagnostic system was designed by combining a rough set-based decrease of attributes with a fuzzy logic interval of type 2. Furthermore, the decision tree hybrid technique and ANN for predicting cardiac disease were developed in Ref. [10]. Nevertheless, the large and enormous amounts of historical data, as well as the continuous flow of streaming data produced by healthcare providers, have become an enormous challenge for processing, storing, and analyzing using only Conventional methods of database storage and machine learning (Fig. 11.1). Machine learning consists of four types. Supervised learning is an algorithm that aims to learn patterns provided both as a set of inputs and target outputs that can automatically map input data points to their accurate target output. Unsupervised learning is an algorithm that attempts to explore discriminating features automatically without any knowledge about what the inputs are. Semisupervised learning is an algorithm somewhere between supervised and unsupervised learning because it uses both labeled and unlabeled data to train usually a large amount of unlabeled data and a small amount of labeled data. Reinforcement learning is then a tool for all of its phases of engaging with a complex world in which it has to perform a particular objective such as playing an opposing game or driving a car [11]. Previous cardiac prediction study centered
Figure 11.1 Machine learning types.
264
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
primarily on predicting cardiac disease based on previous data stored and conventional machine learning techniques to resolve this issue. In this study, the problem of predicting cardiac disease utilizing a collection of data collected from the UCI machine learning repository is addressed. The main aim of the study is to create a model utilizing the cardiac disease Cleveland dataset to enhance high precision and then utilize this model to identify the cardiac disease indication as to normal or not. Feature selection algorithms, the selection of linear discriminant analysis (LDA) features were utilized for selecting a large range of features that used in traditional machine learning techniques such as DT, support vector machine (SVM), random forest (RF), NN, and LR. Still, features are selected using deep learning methods automatically, such as long short-term memory (LSTM), convolution neural network (CNN), and hyper CNN LSTM, which were implemented to all features and subsets of the selected features; and k-fold (cross-validation) was utilized to increase accuracy. The remainder of this study is structured as: related work is mentioned in Section 11.2. The proposed system structure and methodology are mentioned in Section 11.3. Experimental results and the comparison between classification methods are mentioned in Section 11.4. The paper is concluded under Section 11.5.
11.2
Literature review
Several methods and algorithms were employed to predict cardiac disease. Results are in a major speed—that is especially important when dealing with large datasets as they happen in real—such as medicine. Genetic programming (GP) was initially developed as an evolutionary tool for breeding programs. An efficient algorithm has been established for the identification of introns in linear genetic programs. Elimination of introns before any fitness test results in a substantial reduction in runtime [12]. A classification method is utilizing multilayer perception with the Back- Propagation learning algorithm on the UCI dataset reached at 80.99 percent accuracy with the eight features [13]. The two methods of cardiac disease diagnosis are the ANN and ANFSI, and the average accuracy obtained by ANN is 87.04% [14]. NNs are widely utilized in a variety of areas, including neuroscience, diagnostics, and forecasting. The support of medical decision making is one field of growing interest in the study. For each classifier, the uncertainty matrix is utilized to see the
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
type of mistake it is producing. A distinction between supervised (MLP/RBF) classifiers and unsupervised (SOM) classifiers will help to define more suitable classifications for patients [15]. ANN is a technique that focuses on maintaining a summary of the arrhythmia forms classified by the ECG. Early and reliable proof of arrhythmia is important for patient identification of cardiac diseases and the selection of suitable care. Multiple classifiers are accessible to the characterization of the ECG [16]. AdaBoost and CNN are utilized for training cardiac phonocardiogram (PCG) cycles, which dissolve into four groups of recurrence. Classifiers combining the AdaBoost and CNN yield typical/anomalous cardiac sounds that rely on the outfit to be coordinated [17]. Myocardial infarction of the conventional 12-lead ECG information system depends on CNN operating with the end-to-end architecture. Then an adaptable DL model was developed that could distinguish MI forms for all lead signals [18]. The two major deep learning methods for diagnosing cardiac disease are multiple kernel learning with adaptive neuro-fuzzy inference system (MKL with ANFIS). The MKL procedure is utilized to differentiate parameters between cardiovascular disease patients and normal people after the ANFIS classifier is issued. The results are obtained from the MKL method to classify cardiac disease and healthy patients [19]. Online sequential extreme learning machine (OSELM) classification technique focused on a recognition system for detecting and classifying the pulses in ECG signals. The pulses in ECG information were characterized by the use of a technique for the wavelet transformation (WT) [20]. Normal sinus rhythm diagnosis, atrial premature beat (APB), right bundle branch block (RBBB), left bundle branch block (LBBB), and premature ventricular contraction (PVC) on ECG signals are known as CNN 2 LSTM hyper [21]. Chen et al. [22] have introduced a theory to predict cardiac disease. They utilized vector quantization, one of the techniques utilized to define and forecast AIs. Artificial neural networks (ANNs) are used for grading purposes. NN preparation is performed utilizing backpropagation to test the method for predicting cardiac disease. The cardiac disease prediction check set by the program achieves approximately 80% accuracy.
11.3
The proposed technique
The experimental method was created to identify people with cardiac disease and healthy people. The features were
265
266
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
tested for the efficiency of various predictive techniques of ML for the diagnosis of cardiac disease. The validity and efficiency assessment metrics were calculated for the model. The proposed technique for classification cardiac disease uses the hyper between CNN and LST, as shown in Fig. 11.2. For the first time, the cardiac disease is prepossessed for identifying and segmenting it into smaller parts. The proposed classification technique (deep learning) is provided features extracted from those segments. There are ordinary or abnormal Classifications. This helps detect the heart condition of a patient. It explains in detail the steps of the proposed approach as follows.
11.3.1
Preprocessing data
First, data is split into two values; independent value that is considered to be features X and a dependent value, which is a target Y, then an absolute encoder, which is a dummy value that converts the string value to an integer value. Scaling features on the data set is a method that standardizes the different sizes and dynamic ranges of various values and has been shown to increase efficiency in classification. We moved and scaled the data to 0 and 1, for all records. Features are selected using deep learning methods in which you are not expected to extract/ select handcrafted features because DL methods extract the relevant features automatically. The data set is separated by
Figure 11.2 The proposed technique (CNN_LSTM) methodology.
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
cross-validation (K Fold) to the evaluating and testing data. For classifiers, ten-fold cross-validation has been implemented.
11.3.2
Building classifier model
In this section, some classifier algorithms are introduced for comparison with the proposed methods. SVM utilizes a set of mathematics functions known as the kernels. The kernel’s function takes data as input and transforms it into the necessary form. This can be utilized for both regression and classification challenges [23]. K-NN utilized to fix regression and classification issues. When classifying K-NN, the output can be determined as the classification [24]. DT divides its data gathering the training data into smaller parts to obtain patterns utilized for the method of classification. Knowledge is then included in a type of tree that can be easily grasped [25]. RF is made up of a collection of tree-structured classification. They decide for the most public classification at input x after a great number of trees are formed. The generalization mistake of a tree classifier forest depends on the intensity and relation between the individual trees in the forest [26]. NB is using this formula to measure a class’s backward likelihood. The model’s goal is to make predictions when each class is allocated a set of objects [27]. ANN is a large collection of specific entangled units that execute a big general task in parallel. Also, these modules have a learning mechanism that automatically updates network parameters in answer to a potentially changing input context. Sometimes, the units are very simplistic biological neuron models located in the animal brain [28]. There really are two significant concepts: hierarchical knowledge collection by CNN. How a CNN essentially perform is taking a gander at one input region at a time, map it to some output, and repeat this process for each input region. By placing a sequence of convolutions one after another, we allow our network to learn hierarchies: each layer that follows is a convolution of the values of the previous layer. This ensures that features of the later layers in a CNN capture increasingly higher numbers. Recurrent Neural Networks obtain sequential order, that is, prompt appeal. One pattern in the RNN is the LSTM. It is a very intelligent technique that specifically selects what to “recollect” in a sequence and what to “overlook.” Therefore, which feedback will influence the neuron state, and which will not. The neuron will learn a sequence using this “memory” and estimate corresponding values. And now that we hyper-CNN and RNN in our model, it is obvious the benefit will be.
267
268
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
Pseudocode of LSTM Algorithm Step 1: Data parameters are identified. Step 2: Model is built for the parameters. Step 3: data are needed to be uploaded and saved. Step 4: There are some procedures as clean data, missing value, encoding categorical, and scaling features. Step 5: Cross-validation for data are performed by using k-fold cross-validation. Step 6: Data are trained for classifier. Step 7: LSTM classifier are used. Step 8: Repeat step 5 to step 7 until the end of training data. Pseudocode of CNN Algorithm Step 1: Determine the several parameters that are used. Step 2: Need to build a model for parameters. Step 3: Uploaded and saved data set. Step 4: Clean data, missing value, encoding categorical and scaling features are the main procedures. Step 5: Use Cross-validation for data and using k-fold crossvalidation. Step 6: Trained Data for classifier. Step 7: Apply CNN method. Step 8: Repeat step 5 to step 7 until the end of the training data. Model utilizing hyper is constructed as a new classifier between the classification technique CNN and LSTM called “CNN LSTM.” The algorithm had at least 100 epochs trained for 227 records, with a batch size of 10, and the early stoppage was done when the validity loss was not enhanced over 100 epochs. Fig. 11.3 illustration of the Architecture of the Convolution Neural Network and Long Short-Term Memory Network. The pseudocode of the hyper CNN_LSTM Algorithm is illustrated in the following steps. Step 1: Several parameters Determine that are used. Step 2: The model is built for parameters. Step 3: Data set are uploaded and saved. Step 4: The main procedures are clean data, missing value, explore data, encoding categorical, and scaling features introduced. Step 5: Cross-validation methods are used for the dataset by using k-fold cross-validation. Step 6: Data are trained for the classifier. Step 7: CNN method is applied for classifying data. Step 8: LSTM algorithm is a hidden layer that is applied for classifying data.
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
Data of heart diseases
Figure 11.3 CNN_LSTM architecture.
Figure 11.4 Flowchart steps of hyper CNN_LSTM algorithm.
Step 9: The dense layer classifies output results by using the sigmoid function to enhance results. Step 10: Repeat step 5 and step 6 to step 9 until the end of the training data (Fig. 11.4). Hyper CNN LSTM structure can be described by applying CNN layers with a max-pooling layer at the front end then
269
270
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
flattening the output to feed into LSTM layers. The model has two hidden layers of LSTM accompanied by a dense layer to get the data and to utilize the rectifier (“relu”) activation function on the layers and the sigmoid function in the output layer to provide our network output as between 0 and 1. The first layer has nine neurons and requires 16 variables of data. The other layers are secret and have nine neurons. Finally, there is one neuron in the output layer to determine the class (or cardiac disease onset or not). For a compiled model, we utilize a binary classification problem known as “binary cross-entropy,” and the effective gradient descent algorithm “Adam” is also used.
11.4
Experimental results and discussion
This section discusses and explains the settings of the parameters, the definition of the data set, the results of the study, and discussion. Dataset Description: The cardiac data set is the first dataset chosen from the “https://www.kaggle.com/ronitf/heart-diseaseuci” UCI machine learning repository. This database has 14 features. The primary focus area applies to the diagnosed patient who has or is not have heart disease. This is the value of an integer between (0 or 1). The value 0 refers to being a diseasefree person, while the value 1 refers to having the disease. As per the heart data set, among 303 patients, 138 negative diagnosis cases, and 165 positive diagnosis cases were recorded. The second data set was extracted from the UCI machine learning repository (processed. Cleveland. Info). It is extracted from the “https://www.kaggle.com/ronitf/ heart-disease-uci” website, which includes 303 patient records with 14 features. The integer value ranges from 0 (normal) to 1 (abnormal). The negative diagnosis was 164, while the positive diagnosis was 139. All classifiers were trained on a workstation with a 4.0 GHz Core i3 4 CPU and 4 GB of memory using python (Table 11.1). In datasets, patients’ group age of 29279 were selected. Sex value 1 denotes male patients, and sex value 0 denotes female patients. Four types of cardiac disease can be viewed as examples of chest pain. Because of narrowed coronary arteries, Angina type 1 induces a decreased blood flow to the cardiac muscles. Angina type 1 is a pain in the chest that happens during mental or emotional stress. Due to various reasons, nonangina chest pain may occur and may not always occur as a result of actual cardiac failure.
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
271
Table 11.1 Parameter setting of the tested algorithms. Variable name
Description
Age Sex CP Trestbps Chol Fps Restecg Thalach Exang Oldpeak Slope CA Thal Num
Age in years Sex, 1 for male, 0 for female Chest pain type (1 5 typical angina; 2 5 atypical angina; 3 5 nonangina pain; 4 5 asymptomatic) Resting blood pressure Serum cholesterol in mg/dl Fasting blood sugar larger 120 mg/dl (1 true) Resting electrocardiographic results (1 5 abnormality, 0 5 normal) Maximum heart rate achieved Exercise-induce angina (1 yes) ST depression induce: Exercise relative to rest The slope of peak exercise ST Number of major vessels No explanation provided, but probably thalassemia Diagnosis of heart disease (angiographic disease status) (,50% diameter narrowing) 1 ( . 50% diameter narrowing)
The asymptomatic type fourth may not be a sign of cardiac disease. The next feature of the threetbps is blood pressure measurement at rest. Chol is the cholesterol-rate. FBS is the fasting blood sugar rate; the value is classified as 1 if the sugar in the fasting blood is less than 120 mg/dl and as above. Restecg is the result of electrocardiographic rest, thalach is the maximum cardiac rate; angina caused by stress is registered as 1 when pain occurs, and 0 when pain is not present. Oldpeak is ST depression caused by stress. The slope is the ST peak stress part; CA is a large number of fluoroscopic colored vessels. Thal is the stress duration check in minutes, and the number is the class features. The class function has a rating of 0 for patients diagnosed with usual and 1 for those diagnosed with cardiac disease (Figs. 11.511.8).
11.4.1
Evaluation criteria
Evaluation criteria are an important step that is used for evaluating the performance of the trained model by utilizing the testing data set. Different matrices evaluated the model in terms of accuracy, recall, precision, F-measure, ROC, and AUC.
272
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
Figure 11.5 Heart diseases distribution.
Figure 11.6 Heart diseases by chest pain type.
Figure 11.7 Heart diseases by gender.
Figure 11.8 Heart diseases distribution by age.
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
Accuracy (Acc) is one of the classification performance measures most frequently used. It is described as the percentage of the samples accurately classified to the total sample number, as indicated in Eq. (11.1). Accuracy 5
TN 1 TP FP 1 FN 1 TP 1 TN
ð11:1Þ
where the number of true positive, false positive, true negative, and false-negative samples is indicated by TP, TN, FP, and FN, respectively. A recall is defined as the percentage of the accurately classified positive samples to the total number of positive samples, as shown in Eq. (11.2). Recall 5
TP TP 1 FN
ð11:2Þ
Precision is described as the percentage of positive samples classified correctly to the total number of positive samples predicted, as shown in Eq. (11.3). Precision 5
TP TP 1 FP
ð11:3Þ
F-measure is also referred to as F1-score. It is the harmonic mean of precision and recall, as in Eq. (11.4). The F-measure value ranges from zero to one, and the good F-measure values imply high classification efficiency. F 2 measure 5
2 Precision 2 Recall Precision 1 Recall
ð11:4Þ
The receiver operating characteristics (ROC) curve is a twodimensional diagram where the y-axis is determined by the recall, and the X-axis is the false positive probability (FPR). The ROC curve was used to evaluate numerous systems, including diagnosis and treatment systems for medical decision-making, and machine learning [24]. The metric ROC Curve Area (AUC) is utilized for measuring the area below the ROC curve. The AUC value is always restricted by zero to one, and an AUC lower than 0.5 has no realistic classifier.
11.5
Results analysis and discussion
To evaluate classification algorithms efficiency, we utilized accuracy, recall, precision, f-measurement, ROC, and AUC metrics to make a comparison between them. The results
273
274
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
Table 11.2 Results comparison between CNN, LSTM, and CNN_LSTM algorithms in dataset 1. Algorithm
Accuracy
Recall
Precession
F1 score
AUC
CNN LSTM CNN_LSTM
0.881 0.92 0.937
0.95 0.94 0.97
0.82 0.89 0.91
0.88 0.92 0.94
0.88 0.92 0.93.6
Table 11.3 Comparison between CNN, LSTM, and hyper CNN_LSTM algorithms. Algorithm
Accuracy
Recall
Precession
F1 score
AUC
CNN LSTM CNN_LSTM
0.868 0.885 0.889
0.882 0.94 0.91
0.882 0.864 0.885
0.882 0.90 0.89
0.87 0.88 0.88
obtained from the comparison are reported and described as the following samples are included with the dataset.
11.5.1
Scenario 1
11.5.1.1
Dataset 1
Using the dataset 1 that to finding the accuracy, recall, precession, F1 score, and AUC percentages for predicting heart diseases as shown in Table 11.2. Accuracy results as recall, precession, F1 score, and AUC are used by three algorithms, which are CNN, LSTM, and the proposed CNN_LSTM, the new CNN_LSTM achieved the highest accuracy at 0.937, while CNN scored the lowest accuracy at 0.881 as shown in Table 11.3 and Fig. 11.9. The following figures are ROC graphs are to test the positive class by implementing three algorithms. CNN_LSTM achieved the highest value at 0.936, and the very little error rate is generated, and the ratio between FPR versus TPR is also good.
a. ROC_AUC CNN.
b. ROC_AUC LSTM
c. ROC_AUC CNN_LSTM.
CNN
0.937
LSTM
0.91
275
0.94
F1 score 0.97
Precession 0.89
0.94
0.92
Recall
0.88
0.82
0.95
0 881 0.
Accuracy
0.92
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
CNN_LSTM
Figure 11.9 CNN, LSTM, proposed algorithms.
CNN
LSTM
0.88
0.89
0.885
AUC 0.91
0 889
F1 score
0.85
0.9
0.864
Precession 0.94
0.885
Recall
0.87
0.882
0.882
0.882
0.868
Accuracy
H YP E R C N N _ L S T M
Figure 11.10 CNN, LSTM, proposed algorithms.
11.5.1.2
Dataset 2
By using another dataset which is dataset 2 through Table 11.3 and Fig. 11.10, we find that that the CNN_LSTM is achieved highest accuracy reach at 0.889 while CNN gave the lowest accuracy reach 0.868. Fig. 11.10 displays the comparison of the results of accuracy, recall, precision, F1 score, and AUC, which are registered by the algorithms CNN, LSTM, and hyper CNN_LSTM. The results show that CNN_LSTM achieved the highest accuracy at 0.889, while CNN scored the lowest accuracy at 0.868, as shown in Table 11.3 and Fig. 11.10. The following figures are ROC graphs to test the positive class by implementing three algorithms, which are CNN_LSTM and LSTM, which achieved the highest value at 0.88, and the very little error rate is generated. The ratio between FPR versus TPR is also good.
a. ROC_AUC CNN_LSTM.
b. ROC _AUC CNN.
c. ROC_AUC LSTM.
276
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
The second dataset gives the results based on some criteria as accuracy, recall, precession, f1 score, and AUC. CNN results represent frequency as 0.868, 0.882, 0.882, 0.882, 0.87. LSTM results are 0.885, 0.94, 0.864, 0.90, 0.88, and Hyper CNN_LSTM results give 0.889, 0.91, 0.885, 0.89, 0.88. So, the Hyper CNN_LSTM results indicate the best results compared with the two previous methods.
11.5.1.3
Scenario 2
This scenario compares between k-NN, SVM, DT, RF, NB, and k-NN algorithms utilizing the data set, which contains 14 features, and finding the accuracy, recall, precision, F1 score, and AUC percentages for predicting the heart diseases as shown in Table 11.4. The results comparison indicates that ANN achieved the highest accuracy at 90%, and the RF registered the lowest accuracy at 84%. Also, K-NN, SVM, DT, and NB scored about similar accuracy at 89%, 88%, 86%, and 89%, respectively, as shown in Table 11.4 and Fig. 11.11. Table 11.4 Comparison results between different algorithms. Method
Accuracy
Recall
Precession
F1 score
AUC
K-NN SVM DT RF NB ANN
0.88 0.89 0.86 0.88 0.89 0.90
0.95 0.95 0.87 0.92 0.95 0.91
0.84 0.86 0.87 0.85 0.86 0.88
0.89 0.90 0.87 0.85 0.90 0.86
0.876 0.89 0.868 0.88 0.89 0.90
1 0.5 0
Figure 11.11 Algorithms result in comparison.
Method
KNN
SVM
Decision tree
Accuracy
Recall
Precession
Random forest F1 score
Naïve Bayes AUC
ANN
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
277
The following figure consists of six algorithms for indicating ROC graphs to test the positive class. ANN achieved the highest value at 0.90 ROC, less error rate is generated, and the FPR versus TPR ratio is also good.
a. ROC_AUC SVM
b. ROC_AUC RF
c. ROC_AUC NB
d. ROC_AUC K-NN
e. ROC_AUC DT
f. ROC_AUC ANN
Classification methods as CNN, LSTM, K-NN, SVM, DT, RF, and NB introduce respectively optimal an accurate result. So, a hybrid algorithm between CNN and LSTM is proposed to improve the results such as accuracy, recall, precision, and AUC, as illustrated in Table 11.2 Scenario 1. According to the scenarios the datasets 1 and dataset 2 and the previous work showing in Table 11.5, the comparison results for accuracy, recall, precession, F1 score, and AUC, which used nine algorithms as CNN, LSTM, CNN_LSTM, K-NN, SVM, DT, RF, and NB, it is found that the proposed hyper CNN_LSTM obtained the highest accuracy at 0.937 compared to other algorithms. In contrast, DT scored the lowest accuracy at 0.86. Hyper CNN_LSTM gives the best results for F1-Score at 0.94, 0.91 for precision, and 0.97 for recall compared to other algorithms. The comparison between the two datasets and the previous study shows that the first data set gave better results compared with the second dataset.
11.6
Conclusion
Early diagnosis of heart diseases is important to reduce the risk of a death rate. An accurate diagnosis system can help in early diagnosis, and also patients can get early treatment. In this study, the predictive machine-learning system for diagnosing
278
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
Table 11.5 Comparison result. Publication
Compared algorithms
Best accuracy
Elshazly et al. [29] Milan and Sunila [30] Bhanu and Gomathu [31] Syed Umar Amin et al. [32] Vadicherla and Saonawane [33] Taneja [34] Liu et al. [35] Abusharian et al. [36] Proposed
Genetic algorithm GA ANN, Decision tree and SVM K-means SVM RIPPER, Decision tree, ANNs, SVM Naı¨ve Bayes, Decision tree (DT), Neural network RFRS ANN, ANFIS Hyper CNN_LSTM
83.1% 84.12% 89% 89% 84.12% 89% 92.59% 87.04% 93.70%
cardiac disease was proposed. The experimental method was trained on the data set of cardiac disease at Cleveland, and the heart dataset by the classifier algorithms such as SVM, K-NN, NB, RF, LSTM, CNN, and DT is used. This paper proposed a system that uses a combination of (CNN and LSTM) techniques for the detection of heart diseases. The proposed technique has higher classification accuracy, recall, precession, F1 score, and AUC compared to other techniques. Future studies will focus on using the metaheuristic algorithm to improve results.
References [1] A. Fukushima, G.D. Lopaschuk, Acetylation control of cardiac fatty acid β-oxidation and energy metabolism in obesity, diabetes, and heart failure, Biochim. Biophys. Acta 1862 (12) (2016) 22112220. [2] V. Jayaraman, H. Parveen Sultana, Artificial gravitational cuckoo search algorithm along with particle bee optimized associative memory neural network for feature selection in heart disease classification, J. Ambient. Intell. Human. Comput. (2019) 110. [3] S. Mishra, et al., Management protocols for chronic heart failure in India, Indian. Heart J. 70 (1) (2018) 105127. [4] K. Burse, et al., Various preprocessing methods for neural network based heart disease prediction, Smart Innovations in Communication and Computational Sciences, Springer, Singapore, 2019, pp. 5565. [5] H. Pratt, F. Coenen, D.M. Broadbent, S.P. Harding, Y. Zheng, Convolutional neural networks for diabetic retinopathy, Procedia Comput. Sci. 90 (2016) 200205. [6] S.D. Desai, S. Giraddi, P. Narayankar, N.R. Pudakalakatti, S. Sulegaon, Back-propagation neural network versus logistic regression in heart disease classification, Advanced Computing and Communication Technologies, Springer, 2019, pp. 133144.
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
[7] K. Burse, V.P.S. Kirar, A. Burse, R. Burse, Various preprocessing methods for neural network-based heart disease prediction, Smart Innovations in Communication and Computational Sciences, Springer, 2019, pp. 5565. [8] I.K.A. Enriko, M. Suryanegara, D. Gunawan, Heart disease prediction system using a k-nearest neighbor algorithm with simplified patient’s health parameters, J. Telecommun. Electron. Computer Eng. 8 (2016) 5965. [9] T. Nguyen, A. Khosravi, D. Creighton, S. Nahavandi, Classification of healthcare data using the genetic fuzzy logic system and wavelets, Expert. Syst. Appl. 42 (2015) 21842197. [10] S. Maji, S. Arora, Decision tree algorithms for prediction of heart disease, Information and Communication Technology for Competitive Strategies, Springer, 2019, pp. 447454. [11] B.M. Tay, Machine learning-based approaches for intelligent adaptation and prediction in banking business processes. (c2018). Diss. Lebanese American University, 2018. [12] M.B.W. Banzhaf, A comparison of linear genetic programming and neural networks in medical data mining, Fachbereich Informatik University at Dortmund 44221 Dortmund, Germany. [13] A. Khemphila, V. Boonjing, Heart disease classification using neural network and feature selection, International Conference on Systems Engineering, Las Vegas, NV, USA. 94, 2011. [14] M.A.M. Abushariah, A.A.M. Alqudah, O.Y. Adwan, R.M.M. Yousef, Automatic heart disease diagnosis system based on artificial neural network (ANN) and adaptive neuro-fuzzy inference systems (ANFIS) Approaches, J. Softw. Eng. Appl. 7 (12) (2014) 10551064. [15] T.T. Nguyen, D.N. Davis, Predicting cardio vascular risk using the neural net techniques, University of Hull. [16] S.H. Jambukia, V.K. Dabhi, H.B. Prajapati, Classification of ECG signals using machine learning techniques: a survey, Computer Engineering and Applications (ICACEA), 2015 International Conference on Advances in. IEEE, 2015. [17] C. Potes, et al., Ensemble of feature-based and deep learning-based classifiers for detection of abnormal heart sounds, Computing in Cardiology Conference (CinC), 2016. IEEE, 2016. [18] U.B. Baloglu, et al., Classification of myocardial infarction with multi-lead ECG signals and deep CNN, Pattern Recognit. Lett. 122 (2019) 2330. [19] G. Manogaran, R. Varatharajan, M.K. Priyan, Hybrid recommendation system for heart disease diagnosis based on multiple kernel learning with adaptive neuro-fuzzy inference system, Multimed. Tools Appl. 77 (4) (2018) 43794399. ¨ . Yildirim, Ecg beat detection and classification system using wavelet [20] O transform and online sequential elm, J. Mech. Med. Biol. 19 (01) (2019) 1940008. [21] S.L. Oh, et al., Automated diagnosis of arrhythmia using a combination of CNN and LSTM techniques with variable length heart beats, Comput. Biol. Med. 102 (2018) 278287. [22] J.S. Sonawane, D. Patil, Prediction of heart disease using learning vector quantization algorithm, in IT in Business, Industry, and Government (CSIBIG), 2014 Conference on, 2014, pp. 15. [23] Q. Fan, Z. Wang, D. Li, D. Gao, H. Zha, Entropy-based fuzzy support vector machine for imbalanced datasets, Knowl. Syst. 115 (2017) 8799. [24] J. Brownlee, K-Nearest Neighbors for Machine Learning. Understand Machine Learning Algorithms, 2016.
279
280
Chapter 11 Hybrid technique for heart diseases diagnosis based on convolution neural network
[25] X. Zhang, P.M. Treitz, D. Chen, C. Quan, L. Shi, X. Li, Mapping mangrove forests using multi-tidal remotely-sensed data and a decision-tree-based procedure, Int. J. Appl. Earth Observ. Geoinf. 62 (2017) 201214. [26] M.M. Islam, J. Kim, S.A. Khan, J.M. Kim, Reliable bearing fault diagnosis using Bayesian inference-based multi-class support vector machines, J. Acoustical Soc. Am. 141 (2) (2017) EL89EL95. [27] U. Narayanan et al., A survey on various supervised classification algorithms, in International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS), 2017, pp. 21182124. [28] G. Dileep, S.N. Singh, Application of soft computing techniques for maximum power point tracking of the SPV system, Sol. Energy 141 (2017) 182202. [29] H.I. Elshazly, M. Elkorany, A.E. Assanien, Lymph diseases diagnosis approach based on support vector machines with different kernel functions, Computer Engineering & Systems 9th International Conference (ICCES), Cairo, pp. 198203, 2014. [30] M. Kumari, S. Godara, Comparative study of data mining classification methods in cardiovascular disease prediction, Int. J. Comput. Sci. Technol. 2 (2) (2011). ISSN: 0976-8491 (online). [31] M.A.N. Banu, B. Gomathy, Disease predicting system using data mining techniques, Int. J. Technical Res. Appl. 1 (5) (2013) 4145. e-ISSN: 23208163. [32] S.U. Amin, K. Agarwal, R. Beg, Genetic neural network based data mining in prediction of heart disease using risk factors, in: Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013), 2013. [33] D. Vadicherla, S. Sonawane, Decision support system for heart disease based on sequential minimal optimization in support, Int. J. Eng. Sci. Emerg. Technol. 4 (2) (2013) 1926. [34] A. Taneja, Heart disease prediction system using data mining techniques, Orient. J. Comput. Sci. Technol. 6 (4) (2013). Dec 2013, ISSN: 0974-6471. [35] X. Liu, Q. Wang, M. Su, Y. Zhang, Q. Zhu, Wang, et al., A hybrid classification system for heart disease diagnosis based on the RFRS metho, Comput. Math. Methods Med. (2017) 11. [36] M.A.M. Abushariah, A.A.M. Alqudah, O.Y. Adwan, R.M.M. Yousef, Automatic heart disease diagnosis system based on artificial neural network (ANN) and adaptive neuro-fuzzy inference systems (ANFIS) approaches, J. Softw. Eng. Appl. 7 (12) (2014) 10551064.
Further reading R. Kannan, V. Vasanthi, Machine learning algorithms with ROC curve for predicting and diagnosing the heart disease, Soft Computing and Medical Bioinformatics, Springer, Singapore, 2019, pp. 6372.
Index Note: Page numbers followed by “f” and “t” refer to figures and tables, respectively.
A Accuracy, 273 Active learning, 27 28 Acute myocardial infarction (AMI), 242 AdaBoost, 265 Ada Boost M1, 177 ADTree grouping calculations, 175 176 Advanced learning techniques, 24 29 active learning, 27 28 deep learning, 25 26 distributed and parallel learning, 26 27 kernel-based learning, 28 29 representation learning, 24 25 transfer learning, 27 Aifred, 155 Alpha, 185 187, 204 206 Analytical hierarchical process (AHP) evaluation proposed AHP model of successful healthcare, 229 236 description, 231 236 hospital/lab (C2), 230 231 research methodology AHP model, 225 226 AHP technique, 226 229 review of literature, 222 224 ANFSI, 264 265 Ant colony optimization (ACO), 21 Apache Hadoop, 116, 120 121, 126 129, 127f, 128f Apache Mahout, 35 Apache Spark (MLib), 35 36, 131 132, 131f
Apache Storm, 36 Artificial intelligence (AI), 217 218, 244 245 Artificial neural network (ANN), 13, 22 24, 123, 262 265, 267 Association Rule Learning, 50 Atheoretical symptom-based model, 148 Atomic patterns, 9 Attacking prey (exploitation), 187 Azure Machine Learning Studio, 34
B Back-Propagation learning algorithm, 264 265 Backpropagation network (BPNN), 262 263 Bacterial foraging optimization (BFO), 243, 248 249, 258 259 optimal feature selection using, 248 250 Beta, 185 187, 204 206 Big data analytics, 45 52, 94 99, 221 applications of big data in healthcare industry, 63 65 advanced patient monitoring and alerts, 64 enhanced patient engagement, 65 fraud and error prevention, 65 management and operational efficiency, 64 smart healthcare intelligence, 65
challenges to big data analytics in healthcare, 61 63 data acquisition and modeling, 62 data security and risk, 62 data storage and transfer, 62 querying and reporting, 63 technology incorporation and miscommunication gaps, 63 future of big data in healthcare, 65 66 healthcare sector, big data in, 52 53 medical imaging, 53 54 methodology, 54 55 motivation, 47 48 opportunities for big data in healthcare, 59 61 cost reduction, 61 data accessibility and decision-making, 60 61 early disease detection, 60 quality of treatment, 59 60 platforms and tools, 55 59 Cassandra, 59 Cloud storage, 55 57 Hadoop, 57 58 Hive, 58 NoSQL databases, 57 Pig, 58 semisupervised learning, 97 99 supervised learning, 96 97 techniques and technologies, 49 50 unsupervised learning, 96 uses and challenges, 52 working, 51 52
281
282
Index
Big Data Loop, 158 Big information, 196 197 BigML, 31 Binary classification, 4 5 Binary cross-entropy, 269 270 Bioinformatics, 76 77 Block chain technology, 217 218 BPN (Backpropogation Network), 121 122 Brain initiative, 217 Breast Cancer Experimental Test Index, 187 188 Breast cancer prediction, 173 174 comparison measures, 189 191 early identification, 176 literature survey, 175 178 proposed methodology, 178 188, 179f dataset description, 187 188 feature selection, 179 183 kernel based support vector machine with Gray Wolf Optimization, 183 187 preprocessing, 179
C C4.5 calculation, 178 Calculated value (CV), 231 CART (classification and regression tree), 13 14, 175 176 Cassandra, 59 Categorical variable decision tree, 14 CBR (case based reasoning), 116 117 Centre for Disease Control (CDC), 79 Cerebral microbleeds (CMBs), 117 118 CHEFNN (Competitive edge finding neural network), 121 Classification, Big Data, 3 9 advanced learning techniques, 24 29 active learning, 27 28
deep learning, 25 26 distributed and parallel learning, 26 27 kernel-based learning, 28 29 representation learning, 24 25 transfer learning, 27 approaches, 6 challenges in, 4 definition of, 3 evolutionary techniques, 20 24 artificial neural network (ANN), 22 24 coevolutionary programming, 24 genetic algorithm, 22 genetic programming, 21 swarm intelligence, 20 21 need in Big Data, 4 pattern, 8 9 atomic patterns, 9 composite patterns, 9 phases of, 6 8 data preparation phase, 6 8 evaluation phase, 8 learning phase, 8 tools and platforms, 29 36 Apache Mahout, 35 Apache Spark (MLib), 35 36 Apache Storm, 36 Azure Machine Learning Studio, 34 BigML, 31 DataRobot, 31 32 Google Cloud AutoML, 32 H2O Driverless AI, 34 35 IBM Watson Studio, 32 MLJAR, 33 Pattern, 30 Rapidminer, 33 Scikit-learn, 29 30 Shogun, 29 Tableau, 33 34 TensorFlow, 30 Weka, 30 31
traditional learning techniques, 9 19 decision tree, 13 15 K-nearest neighbor (KNN), 17 18 logistic regression, 9 11 matrix factorization, 18 19 Naı¨ve Bayes algorithm, 15 16 random forest, 18 support vector machine, 12 13 types of, 4 5 Classification techniques, for predicting cardiac disease, 262 264 Classifier model, building, 267 270 Cleveland dataset, 262 263 Clinical Decision-Support Systems (CDSS), 163 164 Clinical informatics, 77 Clinical researchers, 105 Cloud, 217 218, 220 Cloud storage, 55 57 Cluster Analysis, 50 Clustering, 74, 96 Clustering k-mean, 196 197 CNN LSTM hyper, 265, 268 CNN’s (Convolutional Neural Networks), 117 118 3-D CNN, 117 118, 118f Coevolutionary programming, 24 Cognitive Behavioral Therapy (CBT), 153 154 ColoPrint, 76 Colorectal cancer (CRC), 76 Comparative judgments, 226 227 Composite patterns, 9 Composite preference value (CPF), 231 Computer Aided Detection (CAD) systems, 117 118 Confidentiality, 223 Consistency index (CI), 227 Consistency ratio (CR), 227 Consumer-facing technology, 217 218
Index
Continuous variable decision tree, 14 Convolutional neural systems, 178 Convolution neural network (CNN), 264 265, 267 268 Coronary episode, 242 Coronary heart disease, 243 244, 255 Cost-effectiveness, 224 Cost reduction, 61 Crowdsourcing, 50
D Data access, 223 Data accessibility and decision-making, 60 61 Data acquisition, 95 and modeling, 62 Data assurance, 223 224 Data cleansing, 95 Data extraction techniques, 250 Data governance, 106, 108 Data integration, 223 Data mining techniques, 49 50, 96, 242, 248 Datanode, 127 Data portability, 108 Data preparation phase, 6 8 data preprocessing, 7 data selection, 7 data transformation, 7 8 Data privacy, 107 Data protection, 223 Data recontextualization, 108 Data recycling, 108 Data reduction, 95 Data repurposing, 108 DataRobot, 31 32 Data security, 107 Data security and risk, 62 Data sharing, 108, 223 Data stability, 224 Data storage and transfer, 62 Data visualization, 97 Decision table, 177 Decision tree (DT), 13 15, 262 264
advantages and disadvantages, 15t Decision tree hybrid technique, 262 263 Deep learning, 25 26, 264 265 Deep neural network (DNN), 30 Deloitte, 217 218 Delta, 186 187 Digital Imaging and Communications in Medicine (DICOM), 102, 121 Digitized health space, 219 Directed Acyclic Graph (DAG), 132 Discriminative distance learning, 116 117 Disease management technology, 217 218 Distant metastasis free survival (DMFS), 76 Distributed and parallel learning, 26 27 DQN (Difference of Quantile Normalized Values) technique, 76 77
E Early disease detection, 60 Eigenvalue, 227 Electroencephalogram (EEG) signals, 119 Electronic Health Records (EHRs), 71, 81, 218 Electronic medical records (EMR), 91 EM (Expectation Maximization), 121 Evaluation phase, 8 Evolutionary techniques, 20 24 artificial neural network (ANN), 22 24 coevolutionary programming, 24 genetic algorithm, 22 genetic programming, 21 swarm intelligence, 20 21 ant colony optimization (ACO), 21
283
particle swarm optimization (PSO), 20 21 Experimental validation with verifiable objective results, 225
F False Negative (FN), 254 False Positive (FP), 254 False positive probability (FPR), 273 Feature extraction, 73 74 Feature selection, 179 183, 262 263 Fine needle aspirate (FNA), 187 188 Firefly calculation, 250 Firefly—Binary Cuckoo Search (FFBCS) literature survey, 244 247 proposed methodology, 247 254 dataset description, 254 optimal feature selection using bacterial foraging optimization, 248 250 optimization by using FFBCS, 250 254 preprocessing, 248 result and discussion, 254 258 comparative analysis, 255 258 F-measure, 273 Framingham Heart Study (FHS), 244 245 Fraud and error prevention, 65 Fruit fly calculation, 202 203 Fuzzy C-means clustering algorithm, 117 118
G Gaussian theory, 207 Genetic Algorithm (GA), 22, 119 Genetic programming (GP), 21, 264 Google Cloud AutoML, 32 Grasshoppers, 180 183 Gray Wolf Optimization, 183 187
284
Index
Gray Wolf Optimization Agent (GWO), 204 205 Grey wolf optimizer (GWO), 184 185 Grouping calculations, 175 176
H H2O Driverless AI, 34 35 Hadoop, 57 58, 116, 120 121, 126 129, 127f, 128f Hadoop Distributed File System (HDFS), 127 Hadoop MapReduce, 106 HDOC (Hybrid Digital Optical Correlator), 116 117 Healthcare and medical big bata analytics, 85 88 big data analytics, 94 99 semisupervised learning, 97 99 supervised learning, 96 97 unsupervised learning, 96 big data security, privacy, and governance, 107 108 framework for healthcare information system based on big data, 104 107 healthcare and medical data coding and taxonomy, 99 101 medical and healthcare big data, 88 94 exposome data, 93 94 medical and healthcare data interchange standards, 101 104 Healthcare data analytics, 220 Healthcare software developers, 105 Health challenge, 217 Health Information Technology for Economic and Clinical Health Act of 2009, 73 Heart disease, 262 Heritage health prize, 217 HIPAA (Health Insurance Portability and
Accountability Act), 102 104, 218 Hive, 58 Hopfield Neural Network, 121 Hospital strategic operators, 104 105 Human Connectome Project (HCP), 77 Hunting, 186 187 Hybrid technique for heart diseases diagnosis classification techniques, 262 264 experimental results and discussion, 270 273 evaluation criteria, 271 273 literature review, 264 265 proposed technique, 265 270 building classifier model, 267 270 preprocessing data, 266 267 results analysis and discussion, 273 277 traditional ways, 262 Hyper CNN LSTM, 264, 269 270 Hypertension, 246 247
I IBM, 217 218 IBM Watson Studio, 32 ICU readmission and mortality rates, 78 Improving Access to Psychological Therapies (IAPT) program, 152 153 In-database analytics, 107 Influenza like illnesses (ILI), 79 80 Information exploration, 174 175 Information extraction, 245 246 Information mining, 242 243 Information preparation, 248 Integrating Data for Analysis Anonymization and Sharing (iDASH), 119 120
J J48, 175 177 J-Rip, 177
K K closer, 245 Kernel-based learning, 28 29 Kernel based support vector machine with Gray Wolf Optimization (KSVMGWO), 183 187, 189 190 testing phase, 187 training phase, 183 Kernel ridge regression (KRR), 198 199 Kernels, 267 Kernel support vector machine (KSVM), 184 K-Means Clustering, 74 K-Nearest Neighbor (K-NN), 17 18, 262 263, 267 advantages and disadvantages, 18t
L Lagrangian model, 183 Lazy IBK, 177 Lazy K-star, 177 Learning phase, 8 Leeds Risk Index (LRI), 153 154 Likert scale, 230, 237 Linear discriminant analysis (LDA), 264 Linear Kernel, 183 184 Locally Supervised Metric Learning (LSML), 79 Logistic Autoregression with exogenous inputs (ARX), 80 Logistic regression (LR), 9 11, 262 264 advantages and disadvantages, 11t Long short-term memory (LSTM), 264, 267 268 Luminal A subtype, 173 174
Index
M Machine learning (ML), 50, 174 175, 247 248, 263 264, 263f Machine learning as a service (MLaaS), 31 Major Depressive Disorder (MDD), 139 140 Malignancy, defined, 173 174 Management and operational efficiency, 64 MapReduce, 116, 124 126, 126f, 245 246 MATLAB R2016b, 188 Matlab programming language, 189 Matrix factorization, 18 19 advantages and disadvantages of, 18 19 Maximum value (MV), 231 Medical imaging, 91, 113 136 artificial intelligence for analytics of medical images, 121 123 big data analytics in, 116 121, 117f analytical methods, 116 119 collection, sharing, and compression, 119 121 challenges in, 115 3-D CNN, 117 118, 118f tools and frameworks, 123 132 Hadoop, 126 129, 127f, 128f MapReduce, 124 126, 126f Spark, 131 132, 131f Yet Another Resource Negotiator (YARN), 129 131, 130f Medical personnel, 104 Medicare penalties, 217 Mental health, artificial intelligence and big data in, 146 165 diagnosis, 148 150 ethical considerations, 162 165 monitoring, 159 162 monitoring compliance to treatment, 162
symptom monitoring, 160 161 prognosis, 150 152 treatment delivery, 156 159 Big Data Loop, 158 differentiation, 158 159 public acceptance and adoption, 158 real-world validation, 157 158 treatment selection, 152 155 Mental healthcare, 142 146 MHealth app, 216, 217f MIFAS (Medical Image File Accessing System), 120 121 MLJAR, 33 Mobile Healthcare (m-healthcare), 219 220 Modified cat swarm optimization (MCSO), 198 199 MRI data for prediction, 78 MTANN (Massive training artificial neural network), 123, 124f Multiclass classification, 5 Multiclass Classifier, 177 Multilabel classification, 5 Multilayer Perceptron, 177 Multiple-criteria decision methods (MCDM), 225 Multiple kernel learning with adaptive neuro-fuzzy inference system, 265
N Naive Bayes (NB), 176 178, 262 263 advantages and disadvantages, 16t Namenode, 127 National Health Policy, India, 216 Naı¨ve Bayes (NB) algorithm, 15 16, 245 Nearest Centroid-Based Classifier (NCBC), 76 NED (Neural Edge Detector), 121 122
285
Neuroinformatic, 77 NN (nearest neighbors) features, 117 118, 264 NoSQL databases, 57 management systems, 106
O Objective data, 160 Omega, 185, 205 206 Online sequential extreme learning machine (OSELM) classification technique, 265 Online sequential ridge regression (OSRR), 198 199 Oppositional firefly (OFF), 201 Oppositional fruit fly algorithm (OFFA), 202 204 Oppositional grasshopper optimization (OGHO) algorithm, 180 183 Oppositional Gray Wolf Optimization with Kernel Ridge Regression (OGWOKRR), 195 214 classification accuracy, 208 classification using, 204 208 attacking prey and search for prey, 207 208 encircling prey, 206 fitness evaluation, 205 hunting, 206 initialization process, 205 separate the solution based on the fitness, 205 comparative analysis, 210 211 feature reduction, 201 feature selection, 202 204 oppositional fruit fly algorithm (OFFA), 202 204 literature survey, 198 201 performance evaluation, 209 210 sensitivity, 208 specificity, 209 Opposition based learning (OBL), 181
286
Index
Optimal feature selection using bacterial foraging optimization, 248 250 firefly updation, 251 252 fitness function, 251 initialization phase, 252 254 solution representation, 250 Optimization by FFBCS, 250 254 Out-of-pocket expenditure, 216
P Particle swarm optimization (PSO), 20 21 Patient engagement, enhanced, 65 Patient Health Questionnaire (PHQ-9), 145 146 Patient monitoring and alerts, 64 Pattern, 30 PCA (principal component analysis), 123 PCNN (Pulse Coupled Neural Network), 121 Personalized Advantage Index (PAI), 155 Pharmaceutical research, 105 Pig, 58 PNN (Probabilistic Neural Network), 123 Polynomial Kernel, 183 184 Portable EKG 4, 242 Positron emission tomography (PET) imaging, 154 Precision, 273 Predominantly ergonomic evaluations, 224t Prescription data, 91 Prognosis, 147 Properties command model, 243 PSYCHE, 160 Public health informatics, 79 Python code, 11
Q
S
Quadratic Kernel, 183 184 Querying and reporting, 63 Quick Inventory of Depressive Symptomatology (QIDS), 145 146
Saaty compatibility index, 225 Schizophrenia, 141 Scikit-learn, 29 30 Search for prey (exploration), 187 Search query data, 79 80 Sectioned database, 189 SEER information, 178 Semistructured data, 91 Semisupervised learning, 96 99, 263 264 Sensitivity, 208 Sensor data, 91 Shogun, 29 Sickle cell disease (SCD), 196 Sigmoid function, 10 Sigmoid Kernel, 183 184 Smart healthcare intelligence, 65 SNEFT (Social Network Enabled Flu Tracking) system, 80 SNP data (Single Nucleotide Polymorphism), 116 117 Social media analytics, 80 Software as a Service (SaaS), 31 Spark, 131 132, 131f Specificity, 209, 255 Strategic recurrence, 177 Stream computing, 107 Streaming analytics apps, 52 Structural hierarchy, 226 Structured data, 90 Subjective data, 160 Supervised classification approach, 6 Supervised learning, 96 97, 263 264 Support machine vector, 197 198 Support vector machine (SVM), 12 13, 74, 76 77, 116 117, 176 177, 245, 264, 267 advantages and disadvantages of, 13t kernel based SVM with Gray Wolf Optimization, 183 187
R Radial Basis Function—sparse Partial Least Squares (RBF-sPLS), 78 Radial basis kernel ridge regression (RKRR), 198 199 Random consistency index (RI), 227 Random Forest (RF) algorithm, 18, 116 117, 119, 177, 196 197, 264 advantages and disadvantages, 19t Random Tree, 177 Rapidminer, 33, 116 117 RBF Neural Network, 122, 123f Real-world validation, 157 158 Recall, 273 Receiver operating characteristics (ROC) curve, 198 199, 273 Recurrent neural networks (RNN), 244, 267 268 Reinforcement learning, 263 264 Reliability, 224 Remote Procedure Calls (RPC), 126 Remote ultrasound technology, 218 219 Representation learning, 24 25 Research Domain Criteria (RDoC), 165 ReSET, 156 Resilient Distributed Datasets (RDDs), 132 Reverse proliferated neural system, 178 RNN, 267 268
Index
Support vector machine polynomial (SVMPoly), 198 199 Support vector machine radial basis function (SVMRBF), 198 199 Support Vector Regression, 80 Swarming, 249 Swarming insects, 180 Swarm intelligence, 20 21 ant colony optimization (ACO), 21 particle swarm optimization (PSO), 20 21 Symptom monitoring, 160 161
T Tableau, 33 34 Technology incorporation and miscommunication gaps, 63 Telemedicine, 218 219 TensorFlow, 30 Theoretical validation, 225 3V, 188 TNR (true negative rate), 209 Tools and techniques of big data analytics for healthcare system, 69 84 background, 72 73 importance and motivation, 71 72 methods of application, 73 76 feature extraction, 73 74 imputation, 74 76 past shortcomings, 81 result domains, 76 80
287
analyzing real-time data streams for diagnosis and prognosis, 78 79 bioinformatics, 76 77 clinical informatics, 77 ICU readmission and mortality rates, 78 MRI data for prediction, 78 neuroinformatic, 77 public health informatics, 79 search query data, 79 80 social media analytics, 80 TPR (true positive rate), 208 209 Traditional learning techniques, 9 19 decision tree, 13 15 K-nearest neighbor, 17 18 logistic regression, 9 11 matrix factorization, 18 19 Naı¨ve Bayes algorithm, 15 16 random forest, 18 support vector machine, 12 13 Transfer learning, 27 Treatment quality, 59 60 True Negative (TN), 254 True Positive (TP), 254 2-layered Hopfield NN, 121
Unsupervised learning, 96, 263 264
U
Yet Another Resource Negotiator (YARN), 126, 129 131, 130f
UCI ML standard, 187 188 Unstructured data, 91 Unstructured information, 243 Unsupervised classification approach, 6
V Validation of AHP modeling technique, 225 Value, 71 72 Variety, 70 72, 188 Vector quantization, 265 Velocity, 70 72, 188 Very Fast Decision Tree (VFDT), 79 Volume, 70 72, 188
W Wavelet kernel ridge regression (WKRR), 198 199 Wavelet transformation (WT), 265 Weka programming, 30 31, 176 177 Winser Filter Theory, 122 Wireless health, 219 Wisconsin Breast Cancer (WBC) breast malignancy database, 176 177 Wisconsin Breast Disease dataset Index, 187 188 Woebot, 156
Y
Z Zhang’s method, 79