Biomedical Data Mining for Information Retrieval: Methodologies, Techniques, and Applications [1 ed.] 111971124X, 9781119711247

This book comprehensively covers the topic of mining biomedical text, images and visual features towards information ret

466 15 9MB

English Pages 448 [424] Year 2021

Table of contents :
Cover
Half-Title Page
Series Page
Title Page
Copyright Page
Contents
Preface
1 Mortality Prediction of ICU Patients Using Machine Learning Techniques
1.1 Introduction
1.2 Review of Literature
1.3 Materials and Methods
1.3.1 Dataset
1.3.2 Data Pre-Processing
1.3.3 Normalization
1.3.4 Mortality Prediction
1.3.5 Model Description and Development
1.4 Result and Discussion
1.5 Conclusion
1.6 Future Work
References
2 Artificial Intelligence in Bioinformatics
2.1 Introduction
2.2 Recent Trends in the Field of AI in Bioinformatics
2.2.1 DNA Sequencing and Gene Prediction Using Deep Learning
2.3 Data Management and Information Extraction
2.4 Gene Expression Analysis
2.4.1 Approaches for Analysis of Gene Expression
2.4.2 Applications of Gene Expression Analysis
2.5 Role of Computation in Protein Structure Prediction
2.6 Application in Protein Folding Prediction
2.7 Role of Artificial Intelligence in Computer-Aided Drug Design
2.8 Conclusions
References
3 Predictive Analysis in Healthcare Using Feature Selection
3.1 Introduction
3.1.1 Overview and Statistics About the Disease
3.1.2 Overview of the Experiment Carried Out
3.2 Literature Review
3.2.1 Summary
3.2.2 Comparison of Papers for Diabetes and Hepatitis Dataset
3.3 Dataset Description
3.3.1 Diabetes Dataset
3.3.2 Hepatitis Dataset
3.4 Feature Selection
3.4.1 Importance of Feature Selection
3.4.2 Difference Between Feature Selection, Feature Extraction and Dimensionality Reduction
3.4.3 Why Traditional Feature Selection Techniques Still Holds True?
3.4.4 Advantages and Disadvantages of Feature Selection Technique
3.5 Feature Selection Methods
3.5.1 Filter Method
3.5.2 Wrapper Method
3.6 Methodology
3.6.1 Steps Performed
3.6.2 Flowchart
3.7 Experimental Results and Analysis
3.7.1 Task 1—Application of Four Machine Learning Models
3.7.2 Task 2—Applying Ensemble Learning Algorithms
3.7.3 Task 3—Applying Feature Selection Techniques
3.7.4 Task 4—Appling Data Balancing Technique
3.8 Conclusion
References
4 Healthcare 4.0: An Insight of Architecture, Security Requirements, Pillars and Applications
4.1 Introduction
4.2 Basic Architecture and Components of e-Health Architecture
4.2.1 Front End Layer
4.2.2 Communication Layer
4.2.3 Back End Layer
4.3 Security Requirements in Healthcare 4.0
4.3.1 Mutual-Authentications
4.3.2 Anonymity
4.3.3 Un-Traceability
4.3.4 Perfect—Forward—Secrecy
4.3.5 Attack Resistance
4.4 ICT Pillar’s Associated With HC4.0
4.4.1 IoT in Healthcare 4.0
4.4.2 Cloud Computing (CC) in Healthcare 4.0
4.4.3 Fog Computing (FC) in Healthcare 4.0
4.4.4 BigData (BD) in Healthcare 4.0
4.4.5 Machine Learning (ML) in Healthcare 4.0
4.4.6 Blockchain (BC) in Healthcare 4.0
4.5 Healthcare 4.0’s Applications-Scenarios
4.5.1 Monitor-Physical and Pathological Related Signals
4.5.2 Self-Management, and Wellbeing Monitor, and its Precaution
4.5.3 Medication Consumption Monitoring and Smart-Pharmaceutics
4.5.4 Personalized (or Customized) Healthcare
4.5.5 Cloud-Related Medical Information’s Systems
4.5.6 Rehabilitation
4.6 Conclusion
References
5 Improved Social Media Data Mining for Analyzing Medical Trends
5.1 Introduction
5.1.1 Data Mining
5.1.2 Major Components of Data Mining
5.1.3 Social Media Mining
5.1.4 Clustering in Data Mining
5.2 Literature Survey
5.3 Basic Data Mining Clustering Technique
5.3.1 Classifier and Their Algorithms in Data Mining
5.4 Research Methodology
5.5 Results and Discussion
5.5.1 Tool Description
5.5.2 Implementation Results
5.5.3 Comparison Graphs Performance Comparison
5.6 Conclusion & Future Scope
References
6 Bioinformatics: An Important Tool in Oncology
6.1 Introduction
6.2 Cancer—A Brief Introduction
6.2.1 Types of Cancer
6.2.2 Development of Cancer
6.2.3 Properties of Cancer Cells
6.2.4 Causes of Cancer
6.3 Bioinformatics—A Brief Introduction
6.4 Bioinformatics—A Boon for Cancer Research
6.5 Applications of Bioinformatics Approaches in Cancer
6.5.1 Biomarkers: A Paramount Tool for Cancer Research
6.5.2 Comparative Genomic Hybridization for Cancer Research
6.5.3 Next-Generation Sequencing
6.5.4 miRNA
6.5.5 Microarray Technology
6.5.6 Proteomics-Based Bioinformatics Techniques
6.5.7 Expressed Sequence Tags (EST) and Serial Analysis of Gene Expression (SAGE)
6.6 Bioinformatics: A New Hope for Cancer Therapeutics
6.7 Conclusion
References
7 Biomedical Big Data Analytics Using IoT in Health Informatics
7.1 Introduction
7.2 Biomedical Big Data
7.2.1 Big EHR Data
7.2.2 Medical Imaging Data
7.2.3 Clinical Text Mining Data
7.2.4 Big OMICs Data
7.3 Healthcare Internet of Things (IoT)
7.3.1 IoT Architecture
7.3.2 IoT Data Source
7.4 Studies Related to Big Data Analytics in Healthcare IoT
7.5 Challenges for Medical IoT & Big Data in Healthcare
7.6 Conclusion
References
8 Statistical Image Analysis of Drying Bovine Serum Albumin Droplets in Phosphate Buffered Saline
8.1 Introduction
8.2 Experimental Methods
8.3 Results
8.3.1 Temporal Study of the Drying Droplets
8.3.2 FOS Characterization of the Drying Evolution
8.3.3 GLCM Characterization of the Drying Evolution
8.4 Discussions
8.4.1 Qualitative Analysis of the Drying Droplets and the Dried Films
8.4.2 Quantitative Analysis of the Drying Droplets and the Dried Films
8.5 Conclusions
Acknowledgments
References
9 Introduction to Deep Learning in Health Informatics
9.1 Introduction
9.1.1 Machine Learning v/s Deep Learning
9.1.2 Neural Networks and Deep Learning
9.1.3 Deep Learning Architecture
9.1.4 Applications
9.2 Deep Learning in Health Informatics
9.2.1 Medical Imaging
9.3 Medical Informatics
9.3.1 Data Mining
9.3.2 Prediction of Disease
9.3.3 Human Behavior Monitoring
9.4 Bioinformatics
9.4.1 Cancer Diagnosis
9.4.2 Gene Variants
9.4.3 Gene Classification or Gene Selection
9.4.4 Compound–Protein Interaction
9.4.5 DNA–RNA Sequences
9.4.6 Drug Designing
9.5 Pervasive Sensing
9.5.1 Human Activity Monitoring
9.5.2 Anomaly Detection
9.5.3 Biological Parameter Monitoring
9.5.4 Hand Gesture Recognition
9.5.5 Sign Language Recognition
9.5.6 Food Intake
9.5.7 Energy Expenditure
9.5.8 Obstacle Detection
9.6 Public Health
9.6.1 Lifestyle Diseases
9.6.2 Predicting Demographic Information
9.6.3 Air Pollutant Prediction
9.6.4 Infectious Disease Epidemics
9.7 Deep Learning Limitations and Challenges in Health Informatics
References
10 Data Mining Techniques and Algorithms in Psychiatric Health: A Systematic Review
10.1 Introduction
10.2 Techniques and Algorithms Applied
10.3 Analysis of Major Health Disorders Through Different Techniques
10.3.1 Alzheimer
10.3.2 Dementia
10.3.3 Depression
10.3.4 Schizophrenia and Bipolar Disorders
10.4 Conclusion
References
11 Deep Learning Applications in Medical Image Analysis
11.1 Introduction
11.1.1 Medical Imaging
11.1.2 Artificial Intelligence and Deep Learning
11.1.3 Processing in Medical Images
11.2 Deep Learning Models and its Classification
11.2.1 Supervised Learning
11.2.2 Unsupervised Learning
11.3 Convolutional Neural Networks (CNN)— A Popular Supervised Deep Model
11.3.1 Architecture of CNN
11.3.2 Learning of CNNs
11.3.3 Medical Image Denoising using CNNs
11.3.4 Medical Image Classification Using CNN
11.4 Deep Learning Advancements—A Biological Overview
11.4.1 Sub-Cellular Level
11.4.2 Cellular Level
11.4.3 Tissue Level
11.4.4 Organ Level
11.5 Conclusion and Discussion
References
12 Role of Medical Image Analysis in Oncology
12.1 Introduction
12.2 Cancer
12.2.1 Types of Cancer
12.2.2 Causes of Cancer
12.2.3 Stages of Cancer
12.2.4 Prognosis
12.3 Medical Imaging
12.3.1 Anatomical Imaging
12.3.2 Functional Imaging
12.3.3 Molecular Imaging
12.4 Diagnostic Approaches for Cancer
12.4.1 Conventional Approaches
12.4.2 Modern Approaches
12.5 Conclusion
References
13 A Comparative Analysis of Classifiers Using Particle Swarm Optimization-Based Feature Selection
13.1 Introduction
13.2 Feature Selection for Classification
13.2.1 An Overview: Data Mining
13.2.2 Classification Prediction
13.2.3 Dimensionality Reduction
13.2.4 Techniques of Feature Selection
13.2.5 Feature Selection: A Survey
13.2.6 Summary
13.3 Use of WEKA Tool
13.3.1 WEKA Tool
13.3.2 Classifier Selection
13.3.3 Feature Selection Algorithms in WEKA
13.3.4 Performance Measure
13.3.5 Dataset Description
13.3.6 Experiment Design
13.3.7 Results Analysis
13.3.8 Summary
13.4 Conclusion and Future Work
13.4.1 Summary of the Work
13.4.2 Research Challenges
13.4.3 Future Work
References
Index

Recommend Papers

Methodologies of Multi-Omics Data Integration and Data Mining: Techniques and Applications 9811982090, 9789811982095

This book features multi-omics big-data integration and data-mining techniques. In the omics age, paramount of multi-omi

253 14 3MB Read more

Data Mining: Concepts, Methodologies, Tools, and Applications 9781466624559, 9781466624566, 9781466624573

Data mining continues to be an emerging interdisciplinary field that offers the ability to extract information from an e

117 16 Read more

Intelligent Agents for Data Mining and Information Retrieval 1591401941, 9781591401940, 9781591401957

There is a large increase in the amount of information available on World Wide Web and also in number of online database

373 82 8MB Read more

Benchmarking Attribute Selection Techniques for Data Mining

Data engineering is generally considered to be a central issue in the development of data mining applications. The succe

625 72 265KB Read more

Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications) 3540331727, 9783540331728

Poor data quality can seriously hinder or damage the efficiency and effectiveness of organizations and businesses. The g

105 35 4MB Read more

Information Retrieval Techniques for Speech Applications (Lecture Notes in Computer Science, 2273) 354043156X, 9783540431565

This volume is based on a workshop held on September 13, 2001 in New Orleans, LA, USA as part of the24thAnnualInternatio

112 95 2MB Read more

Deep Learning for Biomedical Data Analysis: Techniques, Approaches, and Applications [1 ed.] 3030716759, 9783030716752

This book is the first overview on Deep Learning (DL) for biomedical data analysis. It surveys the most recent technique

479 20 9MB Read more

Data Mining for Business Analytics: Concepts, Techniques, and Applications in R 9781118879368, 1118879368

Data Mining for Business Analytics: Concepts, Techniques, and Applications in Rpresents an applied approach to data mini

1,196 84 25MB Read more

Deep Learning for Biomedical Data Analysis: Techniques, Approaches, and Applications [1 ed.] 3030716759, 9783030716752

This book is the first overview on Deep Learning (DL) for biomedical data analysis. It surveys the most recent technique

540 50 23MB Read more

Deep Learning for Biomedical Data Analysis: Techniques, Approaches, and Applications [1st ed. 2021] 3030716759, 9783030716752

This book is the first overview on Deep Learning (DL) for biomedical data analysis. It surveys the most recent technique

721 136 9MB Read more

Biomedical Data Mining for Information Retrieval: Methodologies, Techniques, and Applications [1 ed.]
111971124X, 9781119711247

Author / Uploaded
Subhendu Kumar Pani (editor)
Sujata Dash (editor)
S. Balamurugan (editor)
Ajith Abraham (editor)

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Biomedical Data Mining for Information Retrieval

Scrivener Publishing 100 Cummings Center, Suite 541J Beverly, MA 01915-6106

Artificial Intelligence and Soft Computing for Industrial Transformation Series Editor: Dr S. Balamurugan ([email protected])

Scope: Artificial Intelligence and Soft Computing Techniques play an impeccable role in industrial transformation. The topics to be covered in this book series include Artificial Intelligence, Machine Learning, Deep Learning, Neural Networks, Fuzzy Logic, Genetic Algorithms, Particle Swarm Optimization, Evolutionary Algorithms, Nature Inspired Algorithms, Simulated Annealing, Metaheuristics, Cuckoo Search, Firefly Optimization, Bio-inspired Algorithms, Ant Colony Optimization, Heuristic Search Techniques, Reinforcement Learning, Inductive Learning, Statistical Learning, Supervised and Unsupervised Learning, Association Learning and Clustering, Reasoning, Support Vector Machine, Differential Evolution Algorithms, Expert Systems, Neuro Fuzzy Hybrid Systems, Genetic Neuro Hybrid Systems, Genetic Fuzzy Hybrid Systems and other Hybridized Soft Computing Techniques and their applications for Industrial Transformation. The book series is aimed to provide comprehensive handbooks and reference books for the benefit of scientists, research scholars, students and industry professional working towards next generation industrial transformation.

Publishers at Scrivener Martin Scrivener ([email protected]) Phillip Carmical ([email protected])

Biomedical Data Mining for Information Retrieval Methodologies, Techniques and Applications

Edited by

Sujata Dash, Subhendu Kumar Pani, S. Balamurugan and Ajith Abraham

This edition first published 2021 by John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA and Scrivener Publishing LLC, 100 Cummings Center, Suite 541J, Beverly, MA 01915, USA © 2021 Scrivener Publishing LLC For more information about Scrivener publications please visit www.scrivenerpublishing.com. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions. Wiley Global Headquarters 111 River Street, Hoboken, NJ 07030, USA For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com. Limit of Liability/Disclaimer of Warranty While the publisher and authors have used their best efforts in preparing this work, they make no rep resentations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchant- ability or fitness for a particular purpose. No warranty may be created or extended by sales representa tives, written sales materials, or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further informa tion does not mean that the publisher and authors endorse the information or services the organiza tion, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Library of Congress Cataloging-in-Publication Data ISBN 978-1-119-71124-7 Cover image: Pixabay.Com Cover design by Russell Richardson Set in size of 11pt and Minion Pro by Manila Typesetting Company, Makati, Philippines Printed in the USA 10 9 8 7 6 5 4 3 2 1

Contents Preface xv 1 Mortality Prediction of ICU Patients Using Machine Learning Techniques 1 Babita Majhi, Aarti Kashyap and Ritanjali Majhi 1.1 Introduction 2 1.2 Review of Literature 3 1.3 Materials and Methods 8 1.3.1 Dataset 8 1.3.2 Data Pre-Processing 8 1.3.3 Normalization 8 1.3.4 Mortality Prediction 10 1.3.5 Model Description and Development 11 1.4 Result and Discussion 15 1.5 Conclusion 16 1.6 Future Work 16 References 17 2 Artificial Intelligence in Bioinformatics 21 V. Samuel Raj, Anjali Priyadarshini, Manoj Kumar Yadav, Ramendra Pati Pandey, Archana Gupta and Arpana Vibhuti 2.1 Introduction 21 2.2 Recent Trends in the Field of AI in Bioinformatics 22 2.2.1 DNA Sequencing and Gene Prediction Using Deep Learning 24 2.3 Data Management and Information Extraction 26 2.4 Gene Expression Analysis 26 2.4.1 Approaches for Analysis of Gene Expression 27 2.4.2 Applications of Gene Expression Analysis 29 2.5 Role of Computation in Protein Structure Prediction 30 v

vi Contents 2.6 Application in Protein Folding Prediction 2.7 Role of Artificial Intelligence in Computer-Aided Drug Design 2.8 Conclusions References

31 38 42 43

3 Predictive Analysis in Healthcare Using Feature Selection 53 Aneri Acharya, Jitali Patel and Jigna Patel 3.1 Introduction 54 3.1.1 Overview and Statistics About the Disease 54 3.1.1.1 Diabetes 54 3.1.1.2 Hepatitis 55 3.1.2 Overview of the Experiment Carried Out 56 3.2 Literature Review 58 3.2.1 Summary 58 3.2.2 Comparison of Papers for Diabetes and Hepatitis Dataset 61 3.3 Dataset Description 70 3.3.1 Diabetes Dataset 70 3.3.2 Hepatitis Dataset 71 3.4 Feature Selection 73 3.4.1 Importance of Feature Selection 74 3.4.2 Difference Between Feature Selection, Feature Extraction and Dimensionality Reduction 74 3.4.3 Why Traditional Feature Selection Techniques Still Holds True? 75 3.4.4 Advantages and Disadvantages of Feature Selection Technique 76 3.4.4.1 Advantages 76 3.4.4.2 Disadvantage 76 3.5 Feature Selection Methods 76 3.5.1 Filter Method 76 3.5.1.1 Basic Filter Methods 77 3.5.1.2 Correlation Filter Methods 77 3.5.1.3 Statistical & Ranking Filter Methods 78 3.5.1.4 Advantages and Disadvantages of Filter Method 80 3.5.2 Wrapper Method 80 3.5.2.1 Advantages and Disadvantages of Wrapper Method 82 3.5.2.2 Difference Between Filter Method and Wrapper Method 82

Contents vii 3.6 Methodology 3.6.1 Steps Performed 3.6.2 Flowchart 3.7 Experimental Results and Analysis 3.7.1 Task 1—Application of Four Machine Learning Models 3.7.2 Task 2—Applying Ensemble Learning Algorithms 3.7.3 Task 3—Applying Feature Selection Techniques 3.7.4 Task 4—Appling Data Balancing Technique 3.8 Conclusion References

84 84 84 85 85 86 87 94 96 99

4 Healthcare 4.0: An Insight of Architecture, Security Requirements, Pillars and Applications 103 Deepanshu Bajaj, Bharat Bhushan and Divya Yadav 4.1 Introduction 104 4.2 Basic Architecture and Components of e-Health Architecture 105 4.2.1 Front End Layer 106 4.2.2 Communication Layer 107 4.2.3 Back End Layer 107 4.3 Security Requirements in Healthcare 4.0 108 4.3.1 Mutual-Authentications 109 4.3.2 Anonymity 110 4.3.3 Un-Traceability 111 4.3.4 Perfect—Forward—Secrecy 111 4.3.5 Attack Resistance 111 4.3.5.1 Replay Attack 111 4.3.5.2 Spoofing Attack 112 4.3.5.3 Modification Attack 112 4.3.5.4 MITM Attack 112 4.3.5.5 Impersonation Attack 112 4.4 ICT Pillar’s Associated With HC4.0 113 4.4.1 IoT in Healthcare 4.0 114 4.4.2 Cloud Computing (CC) in Healthcare 4.0 115 4.4.3 Fog Computing (FC) in Healthcare 4.0 116 4.4.4 BigData (BD) in Healthcare 4.0 117 4.4.5 Machine Learning (ML) in Healthcare 4.0 118 4.4.6 Blockchain (BC) in Healthcare 4.0 120 4.5 Healthcare 4.0’s Applications-Scenarios 121 4.5.1 Monitor-Physical and Pathological Related Signals 121

viii Contents 4.5.2 Self-Management, and Wellbeing Monitor, and its Precaution 124 4.5.3 Medication Consumption Monitoring and Smart-Pharmaceutics 124 4.5.4 Personalized (or Customized) Healthcare 125 4.5.5 Cloud-Related Medical Information’s Systems 125 4.5.6 Rehabilitation 126 4.6 Conclusion 126 References 127 5 Improved Social Media Data Mining for Analyzing Medical Trends Minakshi Sharma and Sunil Sharma 5.1 Introduction 5.1.1 Data Mining 5.1.2 Major Components of Data Mining 5.1.3 Social Media Mining 5.1.4 Clustering in Data Mining 5.2 Literature Survey 5.3 Basic Data Mining Clustering Technique 5.3.1 Classifier and Their Algorithms in Data Mining 5.4 Research Methodology 5.5 Results and Discussion 5.5.1 Tool Description 5.5.2 Implementation Results 5.5.3 Comparison Graphs Performance Comparison 5.6 Conclusion & Future Scope References 6 Bioinformatics: An Important Tool in Oncology Gaganpreet Kaur, Saurabh Gupta, Gagandeep Kaur, Manju Verma and Pawandeep Kaur 6.1 Introduction 6.2 Cancer—A Brief Introduction 6.2.1 Types of Cancer 6.2.2 Development of Cancer 6.2.3 Properties of Cancer Cells 6.2.4 Causes of Cancer 6.3 Bioinformatics—A Brief Introduction 6.4 Bioinformatics—A Boon for Cancer Research

131 132 132 132 134 134 136 140 143 147 151 151 152 156 157 158 163 164 165 166 166 166 168 169 170

Contents ix 6.5 Applications of Bioinformatics Approaches in Cancer 174 6.5.1 Biomarkers: A Paramount Tool for Cancer Research 175 6.5.2 Comparative Genomic Hybridization for Cancer Research 177 6.5.3 Next-Generation Sequencing 178 6.5.4 miRNA 179 6.5.5 Microarray Technology 181 6.5.6 Proteomics-Based Bioinformatics Techniques 185 6.5.7 Expressed Sequence Tags (EST) and Serial Analysis of Gene Expression (SAGE) 187 6.6 Bioinformatics: A New Hope for Cancer Therapeutics 188 6.7 Conclusion 191 References 192 7 Biomedical Big Data Analytics Using IoT in Health Informatics Pawan Singh Gangwar and Yasha Hasija 7.1 Introduction 7.2 Biomedical Big Data 7.2.1 Big EHR Data 7.2.2 Medical Imaging Data 7.2.3 Clinical Text Mining Data 7.2.4 Big OMICs Data 7.3 Healthcare Internet of Things (IoT) 7.3.1 IoT Architecture 7.3.2 IoT Data Source 7.3.2.1 IoT Hardware 7.3.2.2 IoT Middleware 7.3.2.3 IoT Presentation 7.3.2.4 IoT Software 7.3.2.5 IoT Protocols 7.4 Studies Related to Big Data Analytics in Healthcare IoT 7.5 Challenges for Medical IoT & Big Data in Healthcare 7.6 Conclusion References 8 Statistical Image Analysis of Drying Bovine Serum Albumin Droplets in Phosphate Buffered Saline Anusuya Pal, Amalesh Gope and Germano S. Iannacchione 8.1 Introduction 8.2 Experimental Methods

197 198 200 201 201 201 202 202 202 204 204 205 205 205 206 206 209 210 210 213 214 216

x Contents 8.3 Results 8.3.1 Temporal Study of the Drying Droplets 8.3.2 FOS Characterization of the Drying Evolution 8.3.3 GLCM Characterization of the Drying Evolution 8.4 Discussions 8.4.1 Qualitative Analysis of the Drying Droplets and the Dried Films 8.4.2 Quantitative Analysis of the Drying Droplets and the Dried Films 8.5 Conclusions Acknowledgments References

217 217 219 220 224

9 Introduction to Deep Learning in Health Informatics Monika Jyotiyana and Nishtha Kesswani 9.1 Introduction 9.1.1 Machine Learning v/s Deep Learning 9.1.2 Neural Networks and Deep Learning 9.1.3 Deep Learning Architecture 9.1.3.1 Deep Neural Networks 9.1.3.2 Convolutional Neural Networks 9.1.3.3 Deep Belief Networks 9.1.3.4 Recurrent Neural Networks 9.1.3.5 Deep Auto-Encoder 9.1.4 Applications 9.2 Deep Learning in Health Informatics 9.2.1 Medical Imaging 9.2.1.1 CNN v/s Medical Imaging 9.2.1.2 Tissue Classification 9.2.1.3 Cell Clustering 9.2.1.4 Tumor Detection 9.2.1.5 Brain Tissue Classification 9.2.1.6 Organ Segmentation 9.2.1.7 Alzheimer’s and Other NDD Diagnosis 9.3 Medical Informatics 9.3.1 Data Mining 9.3.2 Prediction of Disease 9.3.3 Human Behavior Monitoring 9.4 Bioinformatics 9.4.1 Cancer Diagnosis 9.4.2 Gene Variants

237

224 227 231 232 232

237 240 241 242 243 243 244 244 245 246 246 246 247 247 247 247 248 248 248 249 249 249 250 250 250 251

Contents xi 9.4.3 Gene Classification or Gene Selection 251 9.4.4 Compound–Protein Interaction 251 9.4.5 DNA–RNA Sequences 252 9.4.6 Drug Designing 252 9.5 Pervasive Sensing 252 9.5.1 Human Activity Monitoring 253 9.5.2 Anomaly Detection 253 9.5.3 Biological Parameter Monitoring 253 9.5.4 Hand Gesture Recognition 253 9.5.5 Sign Language Recognition 254 9.5.6 Food Intake 254 9.5.7 Energy Expenditure 254 9.5.8 Obstacle Detection 254 9.6 Public Health 255 9.6.1 Lifestyle Diseases 255 9.6.2 Predicting Demographic Information 256 9.6.3 Air Pollutant Prediction 256 9.6.4 Infectious Disease Epidemics 257 9.7 Deep Learning Limitations and Challenges in Health Informatics 257 References 258 10 Data Mining Techniques and Algorithms in Psychiatric Health: A Systematic Review 263 Shikha Gupta, Nitish Mehndiratta, Swarnim Sinha, Sangana Chaturvedi and Mehak Singla 10.1 Introduction 263 10.2 Techniques and Algorithms Applied 265 10.3 Analysis of Major Health Disorders Through Different Techniques 267 10.3.1 Alzheimer 267 10.3.2 Dementia 268 10.3.3 Depression 274 10.3.4 Schizophrenia and Bipolar Disorders 281 10.4 Conclusion 285 References 286 11 Deep Learning Applications in Medical Image Analysis Ananya Singha, Rini Smita Thakur and Tushar Patel 11.1 Introduction 11.1.1 Medical Imaging 11.1.2 Artificial Intelligence and Deep Learning

293 294 295 296

xii Contents 11.1.3 Processing in Medical Images 300 11.2 Deep Learning Models and its Classification 303 11.2.1 Supervised Learning 303 11.2.1.1 RNN (Recurrent Neural Network) 303 11.2.2 Unsupervised Learning 304 11.2.2.1 Stacked Auto Encoder (SAE) 304 11.2.2.2 Deep Belief Network (DBN) 306 11.2.2.3 Deep Boltzmann Machine (DBM) 307 11.2.2.4 Generative Adversarial Network (GAN) 308 11.3 Convolutional Neural Networks (CNN)—A Popular Supervised Deep Model 309 11.3.1 Architecture of CNN 310 11.3.2 Learning of CNNs 313 11.3.3 Medical Image Denoising using CNNs 314 11.3.4 Medical Image Classification Using CNN 316 11.4 Deep Learning Advancements—A Biological Overview 317 11.4.1 Sub-Cellular Level 317 11.4.2 Cellular Level 319 11.4.3 Tissue Level 323 11.4.4 Organ Level 326 11.4.4.1 The Brain and Neural System 326 11.4.4.2 Sensory Organs—The Eye and Ear 329 11.4.4.3 Thoracic Cavity 330 11.4.4.4 Abdomen and Gastrointestinal (GI) Track 331 11.4.4.5 Other Miscellaneous Applications 332 11.5 Conclusion and Discussion 335 References 336 12 Role of Medical Image Analysis in Oncology Gaganpreet Kaur, Hardik Garg, Kumari Heena, Lakhvir Singh, Navroz Kaur, Shubham Kumar and Shadab Alam 12.1 Introduction 12.2 Cancer 12.2.1 Types of Cancer 12.2.2 Causes of Cancer 12.2.3 Stages of Cancer 12.2.4 Prognosis 12.3 Medical Imaging 12.3.1 Anatomical Imaging

351 352 353 354 355 355 356 357 357

Contents xiii 12.3.2 Functional Imaging 12.3.3 Molecular Imaging 12.4 Diagnostic Approaches for Cancer 12.4.1 Conventional Approaches 12.4.1.1 Laboratory Diagnostic Techniques 12.4.1.2 Tumor Biopsies 12.4.1.3 Endoscopic Exams 12.4.2 Modern Approaches 12.4.2.1 Image Processing 12.4.2.2 Implications of Advanced Techniques 12.4.2.3 Imaging Techniques 12.5 Conclusion References 13 A Comparative Analysis of Classifiers Using Particle Swarm Optimization-Based Feature Selection Chandra Sekhar Biswal, Subhendu Kumar Pani and Sujata Dash 13.1 Introduction 13.2 Feature Selection for Classification 13.2.1 An Overview: Data Mining 13.2.2 Classification Prediction 13.2.3 Dimensionality Reduction 13.2.4 Techniques of Feature Selection 13.2.5 Feature Selection: A Survey 13.2.6 Summary 13.3 Use of WEKA Tool 13.3.1 WEKA Tool 13.3.2 Classifier Selection 13.3.3 Feature Selection Algorithms in WEKA 13.3.4 Performance Measure 13.3.5 Dataset Description 13.3.6 Experiment Design 13.3.7 Results Analysis 13.3.8 Summary 13.4 Conclusion and Future Work 13.4.1 Summary of the Work 13.4.2 Research Challenges 13.4.3 Future Work References

358 358 358 358 359 359 360 361 361 362 363 375 376 383 384 385 385 387 387 388 392 394 395 395 395 395 396 398 398 399 401 401 401 402 404 404

Index 409

Preface Introduction Biomedical Data Mining for Information Retrieval comprehensively covers the topic of mining biomedical text, images and visual features towards information retrieval, which is an emerging research field at the intersection of information science and computer science. Biomedical and health informatics is another remerging field of research at the intersection of information science, computer science and healthcare. This new era of healthcare informatics and analytics brings with it tremendous opportunities and challenges based on the abundance of biomedical data easily available for further analysis. The aim of healthcare informatics is to ensure high-quality, efficient healthcare and better treatment and quality of life by efficiently analyzing biomedical and healthcare data, including patients’ data, electronic health records (EHRs) and lifestyle. Earlier, it was commonly required to have a domain expert develop a model for biomedical or healthcare data; however, recent advancements in representation learning algorithms allow automatic learning of the pattern and representation of given data for the development of such a model. Biomedical image mining is a novel research area brought about by the large number of biomedical images increasingly being generated and stored digitally. These images are mainly generated by computed tomography (CT), X-ray, nuclear medicine imaging (PET, SPECT), magnetic resonance imaging (MRI) and ultrasound. Patients’ biomedical images can be digitized using data mining techniques and may help in answering several critical questions related to their healthcare. Image mining in medicine can help to uncover new relationships between data and reveal new useful information that can aid doctors in treating their patients. Information retrieval (IR) methods have multiple levels of representation in which the system learns raw to higher abstract level representation at each level. An essential issue in medical IR is the variety of users of different services. In general, they will have changeable categories of information xv

xvi Preface needs, varying levels of medical knowledge and varying language skills. The various categories of users of medical IR systems have multiple levels of medical knowledge, with the medical knowledge of many individuals falling within a category that varies greatly. This influences the way in which individuals present search queries to systems and also the level of complexity of information that should be returned to them or the type of support when considering which retrieved material should be provided. These have shown significant success in dealing with massive data for a large number of applications due to their capability of extracting complex hidden features and learning efficient representation in an unsupervised setting. This book covers the latest advances and developments in health informatics, data mining, machine learning and artificial intelligence, fields which to a great extent will play a vital role in improving human life. It also covers the IR-based models for biomedical and health informatics which have recently emerged in the still-developing field of research in biomedicine and healthcare. All researchers and practitioners working in the fields of biomedicine, health informatics, and information retrieval will find the book highly beneficial. Since it is a good collection of state-ofthe-art approaches for data-mining-based biomedical and health-related applications, it will also be very beneficial for new researchers and practitioners working in the field in order to quickly know what the best performing methods are. With this book they will be able to compare different approaches in order to carry forward their research in the most important areas of research, which directly impacts the betterment of human life and health. No other book on the market provides such a good collection of state-of-the-art methods for mining biomedical text, images and visual features towards information retrieval.

Organization of the Book The 13 chapters of this book present scientific concepts, frameworks and ideas on biomedical data analytics and information retrieval from the different biomedical domains. The Editorial Advisory Board and expert reviewers have ensured the high caliber of the chapters through careful refereeing of the submitted papers. For the purpose of coherence, we have organized the chapters with respect to similarity of topics addressed, ranging from issues pertaining to the internet of things for biomedical engineering and health informatics, computational intelligence for medical image processing, and biomedical natural language processing. In Chapter 1, “Mortality Prediction of ICU Patients Using Machine Learning Techniques,” Babita Majhi, Aarti Kashyap and Ritanjali Majhi

Preface xvii present a mortality prediction using machine learning techniques. Since the intensive care unit (ICU) admits very ill patients, facilitating their care requires serious attention and treatment using ventilators and other sophisticated medical equipment. This equipment is very costly; hence, its optimized use is necessary. ICUs require a higher number of staff in comparison to the number of patients admitted for regular monitoring. In brief, ICUs involve a larger budget compared to other sections of any hospital. Therefore, to help doctors determine which patient is more at risk, mortality prediction is an important area of research. In data mining, mortality prediction is a binary classification problem, i.e., die or survive. As a result, this has attracted machine learning groups to apply algorithms to do the mortality prediction. In this chapter, six different machine learning methods, functional link artificial neural network (FLANN), support vector machine (SVM), discriminate analysis (DA), decision tree (DT), naïve Bayesian network and K-nearest neighbors (KNN), are used to develop a model for mortality prediction collecting data from PhysioNetChallenge 2012 and did the performance analysis of them. In Chapter 2, “Artificial Intelligence in Bioinformatics,” V. Samuel Raj, Anjali Priyadarshini, Manoj Kumar Yadav, Ramendra Pati Pandey, Archana Gupta and Arpana Vibhuti emphasize the various smart tools available in the field of biomedical and health informatics. They also analyzed recently introduced state-of-the-art bioinformatics using complex AI algorithms. In Chapter 3, “Predictive Analysis in Healthcare Using Feature Selection,” Aneri Acharya, Jitali Patel and Jigna Patel describe various methods to enhance the performance of machine learning models used in predictive analysis. The chronic diseases of diabetes and hepatitis are explored in this chapter with an experiment carried out in four tasks. In Chapter 4, “Healthcare 4.0: An Insight of Architecture, Security Requirements, Pillars and Applications,” Deepanshu Bajaj, Bharat Bhushan and Divya Yadav present the idea of Industry 4.0, which is massively evolving as it is essential for the medical sector, including the internet of things (IoT), big data (BD) and blockchain (BC), the combination of which are modernizing the overall framework of e-health. They analyze the implementation of the I4.0 (Industry 4.0) technology in the medical sector, which has revolutionized the best available approaches and improved the entire framework. In Chapter 5, “Improved Social Media Data Mining for Analyzing Medical Trends,” Minakshi Sharma and Sunil Sharma discuss social media health records. Nowadays, social media has become a prominent method of sharing and viewing news among the general population. It has become an inseparable part of our lives, with people spending most of their time

xviii Preface on social media instead of on other activities. People on media, such as Twitter, Facebook or blogs, share their health records, medication history and personal views. For social media resources to be useful, noise must be filtered out and only the important content must be captured excluding the irrelevant data, depending on the similarities to the social media. However, even after filtering the content, it may contain irrelevant information, so the information should be prioritized based on its estimated importance. Importance can be estimated with the help of three factors: media focus (MF), user attention (UA) and user interaction (UI). In the first factor, media focus is the temporal popularity of that topic in the news. In the second factor, the temporal popularity of a topic on twitter indicates its user attention. In the third factor, the interaction between the social media users on a topic is referred to as the user interaction. It indicates the strength of a topic in social media. Hence, these three factors form the basis of ranking news topics and thus improve the quality and variety of ranked news. In Chapter 6, “Bioinformatics: An Important Tool in Oncology” Gaganpreet Kaur, Saurabh Gupta, Gagandeep Kaur, Manju Verma and Pawandeep Kaur provide an analysis of comprehensive details of the beginning, development and future perspectives of bioinformatics in the field of oncology. In Chapter 7, “Biomedical Big Data Analytics Using IoT in Health Informatics,” Pawan Singh Gangwar and Yasha Hasija present are view of healthcare big data analytics and biomedical IoT and aim to describe it. Wearable devices play a major role in various environmental conditions like daily continuous health monitoring of people, weather forecasting and traffic management on roads. Such mobile apps and devices are presently used progressively and are interconnected with telehealth and telemedicine through the healthcare IoT. Enormous quantities of data are consistently generated by such kinds of devices and are stored on the cloud platforms. Such large amounts of biomedical data are periodically gathered by intelligent sensors and transmitted for remote medical diagnostics. In Chapter 8, “Statistical Image Analysis of Drying Bovine Serum Albumin Droplets in Phosphate Buffered Saline,” Anusuya Pal, Amalesh Gope and Germano S. Iannacchione have an important discussion about how statistical image data are monitored and analyzed. It is revealed that the image processing techniques can be used to understand and quantify the textural features that emerge during the drying process. The image processing methodology adopted in this chapter is certainly useful in quantifying the textural changes of the patterns at different saline concentrations those dictate the ubiquitous stages of the drying process.

Preface xix In Chapter 9, “Introduction to Deep Learning in Health Informatics,” Monika Jyotiyana and Nishtha Kesswani discuss deep learning applications in biomedical data. Because of the vital role played by biomedical data, this is an emergent field in the health sector. These days, health industries focus on the correct and on-time treatment provided to the subject for their betterment while avoiding any negative aspects. The huge amount of data brings enormous opportunities as well as challenges. Deep learning and AI techniques provide a sustainable environment and enhancement over machine learning and other state-of-the-art theories. In Chapter 10, “Data Mining Techniques and Algorithms in Psychiatric Health: A Systematic Review,” Shikha Gupta, Nitish Mehndiratta, Swarnim Sinha, Sangana Chaturvedi and Mehak Singla review the latest literature belonging to the intercessions for data mining in mental health covering many techniques and algorithms linked with data mining in the most prevalent diseases such as Alzheimer’s, dementia, depression, schizophrenia and bipolar disorder. Some of the academic databases used for this literature review are Google Scholar, IEEE Xplore and Research Gate, which include a handful of e-journals for study and research-based materials. In Chapter 11, “Deep Learning Applications in Medical Image Analysis,” Ananya Singha, Rini Smita Thakur and Tushar Patel present detailed information about deep learning and its recent advancements in aiding medical image analysis. Also discussed are the variations that have evolved across different techniques of deep learning according to challenges in specific fields; and emphasizes one such extensively used tool, convolution neural network (CNN), in medical image analysis. In Chapter 12, “Role of Medical Image Analysis in Oncology,” Gaganpreet Kaur, Hardik Garg, Kumari Heena, Lakhvir Singh, Navroz Kaur, Shubham Kumar and Shadab Alam give deep insight into the cancer studies used traditionally and the use of modern practices in medical image analysis used for them. Cancer is a disease caused due to uncontrolled division of cells other than normal body cells in any part of the body. It is among one of the most dreadful diseases affecting the whole world; moreover, the number of people suffering from this fatal disease is increasing day by day. In Chapter 13, “A Comparative Analysis of Classifiers Using Particle Swarm Optimization-Based Feature Selection,” Chandra Sekhar Biswal, Subhendu Kumar Pani and Sujata Dash analyze the performance of classifiers using particle swarm optimization-based feature selection. Medical science researchers can collect several patients’ data and build an effective model by feature selection methods for better prediction of disease cure rate. In other words, the data acts just as an input into some kind of

xx Preface competitive decision-making mechanism that might place the company ahead of its rivals.

Concluding Remarks The chapters of this book were written by eminent professors, researchers and those involved in the industry from different countries. The chapters were initially peer reviewed by the editorial board members, reviewers, and those in the industry, who themselves span many countries. The chapters are arranged to all have the basic introductory topics and advancements as well as future research directions, which enable budding researchers and engineers to pursue their work in this area. Biomedical data mining for information retrieval is so diversified that it cannot be covered in a single book. However, with the encouraging research contributed by the researchers in this book, we (contributors), editorial board members, and reviewers tried to sum up the latest research domains, developments in the data analytics field, and applicable areas. First and foremost, we express our heartfelt appreciation to all the authors. We thank them all for considering and trusting this edited book as the platform for publishing their valuable work. We also thank all the authors for their kind co-operation extended during the various stages of processing of the manuscript. This edited book will serve as a motivating factor for those researchers who have spent years working as crime analysts, data analysts, statisticians, and budding researchers. Dr. Sujata Dash Department of Computer Science and Application North Orissa University, Baripada, Mayurbhanj, India Dr. Subhendu Kumar Pani Principal Krupajal Computer Academy, BPUT, Odisha, India Dr. S. Balamurugan Director of Research and Development Intelligent Research Consultancy Service (iRCS), Coimbatore, Tamil Nadu, India Dr. Ajith Abraham Director MIR Labs, USA May 2021

1 Mortality Prediction of ICU Patients Using Machine Learning Techniques Babita Majhi1*, Aarti Kashyap1 and Ritanjali Majhi2 Dept. of CSIT, Guru Ghasidas Vishwavidyalaya, Central University, Bilaspur, India School of Management, National Institute of Technology Karnataka, Surathkal, India 1

2

Abstract

The intensive care unit (ICU) admits highly ill patients to facilitate them serious attention and treatment using ventilators and other sophisticated medical equipments. These equipments are very costly hence its optimized uses are necessary. ICUs have a number of staffs in comparison to the number of patients admitted for regular monitoring of the patients. In brief, ICUs involve large amount of budget in comparison to other sections of any hospital. Therefore to help the doctors to find out which patient is more at risk mortality prediction is an important area of research. In data mining mortality prediction is a binary classification problem i.e. die or survive. As a result it attracts the machine learning group to apply the algorithms to do the mortality prediction. In this chapter six different machine learning methods such as Functional Link Artificial Neural Network (FLANN), Support Vector Machine (SVM), Discriminate Analysis (DA), Decision Tree (DT), Naïve Bayesian Network and K-Nearest Neighbors (KNN) are used to develop model for mortality prediction collecting data from Physionet Challenge 2012 and did the performance analysis of them. There are three separate data set each with 4000 records in Physionet Challenge 2012. This chapter uses dataset A containing 4000 records of different patients. The simulation study reveals that the decision tree based model outperforms the rest five models with an accuracy of 97.95% during testing. It is followed by the FA-FLANN model in the second rank with an accuracy of 87.60%. Keywords: Mortality prediction, ICU patients, physioNet 2012 data, machine learning techniques *Corresponding author: [email protected] Sujata Dash, Subhendu Kumar Pani, S. Balamurugan and Ajith Abraham (eds.) Biomedical Data Mining for Information Retrieval: Methodologies, Techniques and Applications, (1–20) © 2021 Scrivener Publishing LLC

1

2 Biomedical Data Mining for Information Retrieval

1.1 Introduction Healthcare is the support or improvement of wellbeing by means of the avoidance, finding, treatment, recuperation or fix of sickness, disease, damage and other physical and mental hindrances in individuals [1]. Hospitals are dependent upon various weights, including restricted assets and human services assets which include limited funds and healthcare resources. Mortality prediction for ICU patients is basic commonly as the snappier and increasingly precise the choices taken by intensivists, the more the advantage for the two, patients and medicinal services assets. An emergency unit is for patients with the most genuine sicknesses or wounds. The vast majority of the patients need support from gear like the clinical ventilator to keep up typical body capacities and should be continually and firmly checked. For quite a long time, the number of ICUs has encountered an overall increment [2]. During the ICU remain, diverse physiological parameters are estimated and examined every day. Those parameters are utilized in scoring frameworks to measure the seriousness of the patients. ICUs are answerable for an expanding level of the human services spending plan, and consequently are a significant objective in the exertion to constrain social insurance costs [3]. Consequently, there is an expanding need, given the asset accessibility restrictions, to ensure that extra concentrated consideration assets are distributed to the individuals who are probably going to profit most from them. Basic choices incorporate hindering life-bolster medications and giving doesn’t revive orders when serious consideration is viewed as worthless. In this setting, mortality evaluation is an essential assignment, being utilized to foresee the last clinical result as well as to assess ICU viability, and assign assets. In the course of recent decades, a few seriousness scoring frameworks and machine learning mortality prediction models have been developed [4]. Different traditional scoring techniques such as Acute Physiology and Chronic Health Evaluation (APACHE) [4], Simplified Acute Physiology Score (SAPS) [4], Sequential Organ Failure Assessment (SOFA) [4] and Mortality Probability Model (MPM) [4] and data mining techniques like Artificial Neural Network (ANN) [5], Support Vector Machine (SVM) [5], Decision Tree (DT) [5], Logistic Regression (LR) [5] have been used in the previous researches. Mortality prediction is still an open challenge in an Intensive Care Unit. The objective of this chapter is to develop a model to predict whether a patient will survive in hospital or not in an ICU using different models such as Discriminate Analysis (DA), Decision Tree (DT), K-Nearest Neighbor (KNN), Naive Bayesian, Support Vector Machine (SVM) and Functional

Mortality Prediction of ICU Patients Using ML 3 Link Artificial Neural Network (FLANN), a low complexity neural network and its comparison. The dataset have been collected from the PhysioNet Challenge 2012 [6] which consists of 4,000 records of patients admitted in ICU. There are 41 variables during first 48 h after the admission of patients to the ICU from which 5 variables indicate general descriptors—age, gender, height, ICU type and initial weight, 36 variables (time series) from which 15 variables (Temp, HR, Urine, pH, RespRate, GCS, FiO2, PaCO2, MAP, SysABP, DiasABP, NIMAP, NiDiasABP, MechVent, NISysABP) will be taken as input and 5 outcome descriptors—SAPS-1 score, SOFA score, length of stay in days (LOS), length of survival and in-hospital death (0 for survival and 1 for death in hospital) to predict the survival of patients. The rest of the chapter is organized as follows: Section 1.2 describes the previous studies of mortality prediction, Material and methods are presented in Section 1.3 where data collection, data-preprocessing, model description is properly described. Section 1.4 presents the obtained results. Section 1.5 briefly discusses the work with conclusion and finally Section 1.6 gives the future work.

1.2 Review of Literature Many researchers applied different models in PhysioNet Challenge 2012 dataset and obtained different accuracy results. Silva et al. [7] have developed a method for the prediction of mortality in an in-hospital death (0 takes as survivor and 1 taken as died in hospital). They have collected the data from PhysioNet website and perform the challenges. Dataset consists of three sets: sets A, B and C. Each set has 4,000 records. The challenges are given in two events: event I for a binary classifier measurement performance and event II for a risk estimator measurement performance. For event I scoring criteria are evaluated by using sensitivity and positive predictive value and for event II Hosmer–Lemeshow statistic [8] is used. A baseline algorithm (SAPS-I) is used and obtained score of 0.3125 and 68.58 for events I and II respectively and final score they obtained for events I and II are 0.5353 and 17.58. In Ref. [9] Johnson et al. have described a novel Bayesian ensemble algorithm for mortality prediction. Artifacts and erroneous recordings are removed using data pre-processing. The model is trained using 4,000 records from training set for set A and also with two datasets B and C. Jack-knifing method is performed to estimate the performance of the model. The model has obtained values of 0.5310 and 0.5353 as score 1 on the hidden datasets. Hosmer– Lemeshow statistic has given 26.44 and 29.86 as score 2. The model has

4 Biomedical Data Mining for Information Retrieval re-developed and obtained 0.5374 and 18.20 for scores 1 and 2 on dataset C. The overall performance of the proposed model gives better performance than traditional SAPS model which have some advantages such as missing data handling etc. An improved version of model to estimate the in hospital mortality in the ICU using 37 time series variables is presented in Ref. [10]. They have estimated the performance of various models by using 10-fold cross validation. In the clinical data, it is common to have missing values. These missing values are imputed by using the mean value for patient’s age and gender. A logistic regression model is used and trained using the dataset. The performance of model is evaluated by the two events: Event 1 for the accuracy using low sensitivity and positive predictive value and Event 2 for the Hosmer–Lemeshow H static model for calibration. Their model has resulted 0.516 and 14.4 scores for events 1 and 2 for test set B and 0.482 and 51.7 scores for both the event for test set C. The model performance is better than the existing SAPS model. Another model in Ref. [11] has developed an algorithm to predict the in-hospital death of ICU patients for the event 1 and probability estimation in event 2. Here the missing values are imputed by zero and the data is normalized. Six support vector machine (SVM) classifiers are used for training. For each SVM positive examples and one sixth of the negative examples have taken in the training set. The obtained scores for events 1 and 2 are 0.5345 and 17.88 respectively. An artificial neural network model has developed for the prediction of in-hospital death patients in the ICU under the 48 h observations from the admission [12]. Missing values are handled using an artificial value based on assumption. From all feature sets, 26 features are selected for further process. For classification, two layered neural network having 15 neurons in the hidden layers is used. The model has used 100 voting classifiers and the output it produced is the average of 100 outputs. The mode is trained and tested using 5-fold cross validation. Fuzzy threshold is used to determine the output of the neural network. The model is resulted 0.5088 score for event 1 and 82.211 score for event 2 on the test data set. Ref. [13] has presented an approach that identify time series motifs to predict ICU patients in an in-hospital segmenting the variables into low, high and medium measurements. The method has outperformed the existing scoring systems, SAPS-II, APACHE-II and SOFA and obtained 0.46 score for event 1 and 56.45 score for event 2. An improved mortality prediction using logistic regression and Hidden Markov model has developed for an in-hospital death in Ref. [14]. The model is trained using 4,000 records of patients on set A and validation on other sets of unseen data of 4,000 records. Two different events: event 1 for minimum sensitivity and positive predictive value and for event 2 Hosmer–Lemeshow H statistic is used.

Mortality Prediction of ICU Patients Using ML 5 The model has given 0.50, 0.50 for event 1 and 15.18, 78.9 for event 2 compared to SAPS-I whose event 1 scores are 0.3170, 0.312 and for event 2 66.03 and 68.58 respectively. An effective framework model for predicting in-hospital death mortality in the ICU stay has been suggested in Ref. [15]. Feature extraction is done by data interpolation and Histogram analysis. To reduce the complexity of feature extraction, it reduces the feature vector by evaluating measurement value of each variable. Then finally Cascaded Adaboost learning model is applied as mortality classifier and obtained the 0.806 score for event 1 and 24.00 score for event 2 on dataset A. On another dataset B the model has obtained 0.379 and 5331.15 score for both events 1 and 2. A decision support application for mortality prediction risk has been reported in Ref. [16]. For the clinical rules the authors have used fuzzy rule based systems. An optimizer is used with genetic algorithm which generates final solutions coefficients. The model FIS achieves 0.39 score for event 1 and 94 score for event 2. To predict the mortality in an ICU, a new method is proposed in Ref. [17]. The method, Simple Correspondence Analysis (SCA) is based on both clinical and laboratory data with the two previous models APACHE-II and SAPS-II. It collects the data from PhysioNet Challenge 2012 of total 12,000 records of Sets A, B and C and 37 time series variables are recorded. SCA method is applied to select variables. SCA combines these variables using traditional methods APACHE and SAPS. This method predicts whether the patient will survive or not. Finally, model has obtained 43.50% score 1 for set A, 42.25% score 1 for set B and 42.73% score1 for set C. The Naive Bayesian Classifier is used in [18] to predict mortality in an ICU and obtain high and small S1 and S2. For S1 sensitivity and predictive positive and for S2 Hosmer–Lemeshow H statistic is defined. It replaces the missing values by NaN (Not-a-Number) if variable is not measured. The model achieves 0.475 for S1 which is the eighth best solution and 12.820 for S2 which is the first best solution on set B. On set C, model has achieved 0.4928 score for event 1 (forth best solution) and 0.247 score for event 2 (third best solution). Di Marco et al. [19] have proposed a new algorithm for mortality prediction with better accuracy for data collected from the first 48 h of admission in ICU. A binary classifier model is applied to obtain result for event 1. The set A is selected which contains 41 variables of 4,000 patients. For feature selection forward sequential with logistic cost function is used. For classification a logistic regression model is used which obtained 54.9% score on set A and 44.0% on test set B. To predict mortality rate Ref. [20] has developed a model based on Support Vector Machine. Support Vector Machine is the machine learning algorithm which tries to minimize error and find the best hyperplane of maximum margin. The two classes represent 0 as survivor or 1 as

6 Biomedical Data Mining for Information Retrieval died in-hospital. For training they read 3,000 data and for testing 1,000 data. They observed an over-fitting of SVM on set A and obtained 0.8158 score for event 1 and 0.3045 score for event 2. For phase 2 they set to improve the training strategies of SVM. They reduce the over-fitting of SVM. The final obtained for event 1 is 0.530 and for set B is 0.350 and for set C final score is 0.333. An algorithm based on artificial neural network has employed to predict patient’s mortality in the hospital in Ref. [21]. Features are extracted from the PhysioNet data and a method is used to detect solar ‘nanoflares’ due to the similarity between solar and time series data. Data preprocessing is done to remove outliers. Missing values are replaced by the mean value of each patient. Then the model is trained and yields 22.83 score for event 2 on set B and 38.23 score on set C. A logistic regression model is suggested in Ref. [22] for the purpose. It follows three phases. In phase 1 selection of derived variables on set A, calculation of the variable’s first value, average, minimum value, maximum value, total time, first difference and last value is done. Phase 2 has applied logistic regression model to predict patients in-hospital death (0 for survivor, 1 for died) on the set A. Third phase applies logistic regression model to obtain events 1 and 2 score. The results obtained are 0.4116 for score1 and 8.843 for score2. The paper [23] also reported a logistic regression model for the prediction of mortality. The experiment is done using 4,000 ICU patients for training in set A and 4,000 patients for testing in set B. During the filtering process it figures out 30 variables for building up model. Results obtained are score 0.451 for event 1 and score 2 45.010 for event 2. A novel cluster analysis technique is used in Ref. [24] to test the similarities between time series data for mortality prediction. For data preprocessing it uses a segmentation based approach to divide variables in several segments. The maximal and minimal values are used to maintain its statistical features. Weighted Euclidian distance based clustering and rule based classification is used. The average result obtained for death prediction is 22.77 to 33.08% and for live prediction is 75 to 86%. In Ref. [25], the main goal is to improve the mortality prediction of the ICU patients by using the PhysioNet Challenge 2012 dataset. Mainly three objectives have accomplished (i) reduction of dimensions, (ii) reduction of uncontrolled variance and (iii) less dependency on training set. Feature reduction techniques such as Principal Component Analysis, Spectral Clustering, Factor Analysis and Tukey’s HSD Test are used. Classification is done using SVM that has achieved better accuracy result of 0.73 than the previous work. The authors in Ref. [26] have extracted 61,533 data from the MIMIC-III v1.4, excluded patients whose age is less than 16, patients who stay less than 4 h and patients whose data is not present in the flow

Mortality Prediction of ICU Patients Using ML 7 sheet. Finally 50,488 cohort ICU stays are used for experiments. Features are extracted by using window of fixed length. The machine learning models used are Logistic Regression, LR with L1 regularization penalty using Least Absolute Shrinkage and Selection Operator (LASSO), LR with L2 regularization penalty and Gradient Boosting Decision Trees. Severity of illness is calculated using different scores such as APS III, SOFA, SAPS, LODS, SAPS II and OASIS. Two types of experiments are conducted i.e. Benchmarking experiment and Real-time experiment. Models are compared from which Gradient Boosting Algorithm obtained high AUROC of 0.920. Prediction of hospital mortality through time series analysis of an intensive care unit patient in an early stage, during the admission by using different data mining techniques is carried in [27]. Different traditional scoring system such as APACHE, SAPS and SOFA are used to obtain score. 4,000 ICU patients are selected from MIMIC database and 37 time series variables are selected from first 48 h of admission. Synthetic Minority Oversampling Technique (SMOTE) (original and smote) is used to modify datasets where they handle missing data by replacing with mean (rep1), then SMOTE (rep1 and smote) is applied. After replacing missing data, EM-Imputation (rep2) algorithm is applied. Finally, result is obtained by using different classifiers like Random Forest (RF), Partial Decision Tree (PART) and Bayesian Network (BN). Among all these three classifiers, Random Forest has obtained best result with AUROC of 0.83 0.03 at 48 h on the rep1, with AUROC of 0.82 0.03 on original, rep1 and smote at 40 h and with AUROC of 0.82 0.03 on rep2 and smote at 48 h. Sepsis is one of the reasons for high mortality rate and it should be recover quickly, because due to sepsis [28] there is a chance of increasing risk of death after discharge from hospital. The objective of the paper is to develop a model for one year mortality prediction. 5,650 admitted patients with sepsis were selected from MIMIC-III database and were divided into 70% patients for training and 30% patients for testing. Stochastic Gradient Boosting Method is used to develop one-year mortality prediction model. Variables are selected by using Least Absolute Shrinkage and Selection Operator (LASSO) and AUROC is calculated. 0.8039 with confidence level 95%: [0.8033–0.8045] of AUROC result is obtained in testing set. Finally, it is observed that Stochastic Gradient Boosting assembly algorithm is more accurate for one year mortality prediction than other traditional scoring systems—SAPS, OASIS, MPM or SOFA. Deep learning is successfully applied in various large and complex datasets. It is one of the new technique which is outperformed the traditional techniques. A multi-scale deep convolution neural network (ConvNets) model for mortality prediction is proposed in Ref. [29]. The dataset is

8 Biomedical Data Mining for Information Retrieval taken from MIMIC-III database and 22 different variables are extracted for measurements from first 48 h for each patient. ConvNet is a multilayer neural network and discrete convolution operation is applied in the network. Convolution Neural Network models have been developed as a backend using different python packages i.e. Keras and TensorFlow. The result obtained by the proposed model gives better result of ROC AUC (0.8735, 0.0025) which satisfies the state of art of deep learning models.

1.3 Materials and Methods 1.3.1 Dataset The dataset is collected from PhysioNet Challenge 2012 which consists of three sets A, B and C [6]. A total of 12,000 patient records are available. Each set consists of 4,000 records of patients from which only set A dataset of 4,000 records are used in this chapter for simulation. There are 41 variables recorded in dataset, five of these variables (age, gender, height, ICU type and initial weight) are general descriptors and 36 variables are times series variables as described in Table 1.1. From the above 36 variables, only 15 variables are selected for mortality prediction. These variables are represented below in Table 1.2. From these 15 variables, first value, last value, highest value, lowest value and median value are calculated for nine variables and taken as features. Only first and last values are taken for four variables. For the dataset A, five outcome-related descriptors (SAPS Score, SOFA Score, length of stay, length of survival and in-hospital death) are available from which inhospital death (0 is represented as a survivor and 1 is represented as died in hospital) is taken as a target value.

1.3.2 Data Pre-Processing Data pre-processing is one of the technique to filter and remove noisy data. 41 variables are given in the dataset. Among them 15 variables are selected out of which some of the variables are not carefully collected and having missing values. In this chapter, missing data are replaced by zeros.

1.3.3 Normalization All the variables in the dataset are in different ranges and in different scales. The current values of data cannot be used for classification. If all the

Mortality Prediction of ICU Patients Using ML 9 Table 1.1 Time series variables with description and physical units recorded in the ICU [6]. S. no.

Variables

Description

Physical units

1.

Albumin

Albumin

g/dL

2.

ALP

Alkaline Phosphate

IU/L

3.

ALT

Alanine transaminase

IU/L

4.

AST

Aspartate transaminase

IU/L

5.

Bilirubin

Bilirubin

mg/dL

6.

BUN

Blood urea nitrogen

mg/dL

7.

Cholesterol

Cholesterol

mg/dL

8.

Creatinine

Creatinine

mg/dL

9.

DiasABP

Invasive diastolic arterial blood pressure

mmHg

10.

FiO2

Fractional inspired oxygen

[0–1]

11.

GCS

Glasgow Coma Score

[3–15]

12.

Glucose

Serum Glucose

mg/dL

13.

HCO3

Serum Bicarbonate

mmol/L

14.

HCT

Hematocrit

%

15.

HR

Heart Rate

bpm

16.

K

Serum Potassium

mEq/L

17.

Lactate

Lactate

mmol/L

18.

Mg

Serum Magnesium

mmol/L

19.

MAP

Invasive mean arterial blood pressure

mmHg

20.

MechVent

Mechanical Respiration Ventilation

0/1(true/false)

21.

Na

Serum Sodium

mEq/L

22.

NIDiasABP

Non-invasive diastolic arterial blood pressure

mmHg

23.

NIMAP

Non-invasive mean arterial blood pressure

mmHg (Continued)

10 Biomedical Data Mining for Information Retrieval Table 1.1 Time series variables with description and physical units recorded in the ICU [6]. (Continued) S. no.

Variables

Description

Physical units

24.

NISysABP

Non-invasive systolic arterial blood pressure

mmHg

25.

PaCO2

Partial pressure of arterial carbon dioxide

mmHg

26.

PaO2

Partial pressure of arterial oxygen

mmHg

27.

pH

Arterial pH

[0–14]

28.

Platelets

Platelets

cells/nL

29.

RespRate

Respiration Rate

bpm

30.

SaO2

O2 saturation in hemoglobin

%

31.

SysABP

Invasive systolic arterial blood pressure

mmHg

32.

Temp

Temperature

33.

TropI

Troponin-I

µg/L

34.

TropT

Troponin-T

µg/L

35.

Urine

Urine Output

mL

36.

WBC

White Blood Cells Count

cells/nL

C

variables have the values in better ranges and scales, classifiers will work in a better way. A standard approach, z-score normalization method is used to normalize the variables.

1.3.4 Mortality Prediction After data pre-processing, normalization, feature extraction and feature reduction, different models are employed to predict the patient’s mortality in an in-hospital stage and calculate the accuracy. The models predict either patient will survive or die. This is determined by using classification technique as mortality prediction is a binary classification problem. This process is done step by step as shown in Figure 1.1.

Mortality Prediction of ICU Patients Using ML 11 Table 1.2 Time series variables with physical units [30]. S. no.

Variables

Physical units

1.

Temperature

Celsius

2.

Heart Rate

bpm

3.

Urine Output

mL

4.

pH

[0–14]

5.

Respiration Rate

bpm

6.

GCS (Glassgow Coma Index)

[3–15]

7.

FiO2 (Fractional Inspired Oxygen)

[0–1]

8.

PaCo2 (Partial Pressure Carbon dioxide)

mmHg

9.

MAP (Invasive Mean arterial blood pressure)

mmHg

10.

SysABP (Invasive Systolic arterial blood pressure)

mmHg

11.

DiasABP (Invasive Diastolic arterial blood pressure)

mmHg

12.

NIMAP (Non-invasive mean arterial blood pressure)

mmHg

13.

NIDiasABP (Non-invasive diastolic arterial blood pressure)

mmHg

14.

Mechanical ventilation respiration

[yes/no]

15.

NISysABP (Non-invasive systolic arterial blood pressure)

mmHg

1.3.5 Model Description and Development Different models are developed in this chapter to estimate the performance of mortality prediction and comparison between them is also made. The models such as FLANN, Discriminant analysis, Decision Tree, KNN, Naive Bayesian and Support Vector Machine are applied to develop different classifiers. Out of 4,000 records of dataset A 3,000 records are taken as training set and remaining 1,000 records are used for validation or test of the models. First of all Factor Analysis (FA) is applied to the selected variables to reduce the features. Factor analysis is one of the feature reduction techniques which is used to reduce the high dimension features to low

12 Biomedical Data Mining for Information Retrieval

Dataset A

Data preprocessing (replacing missing values)

Time series data

Normalization using Z-score

Feature Reduction

FLANN Trigonometric Expansion DA, DT, KNN, Naïve Bayesian & SVM

Obtained Result in terms of accuracy

Figure 1.1 Step by step process for mortality prediction.

dimension [31]. The 58 features of the dataset are reduced to 49 using FA. Several steps to of factor analysis are 1. First normalize the data matrix (Y) by using z-score method. 2. Calculate the auto correlation of the matrix (R).

R=

Y ×Y′ n −1

(1.1)

3. Calculate the Eigen vectors (U) and Eigen values ( )

R×U=

× U

(1.2)

4. Rearrange the Eigen vector and Eigen values in descending order 5. Calculate the factor loading matrix (A) by using

A =U × λ

(1.3)

Mortality Prediction of ICU Patients Using ML 13 6. Calculate the score matrix (B)

B = R−1 × A

(1.4)

7. Calculate the factor score (F)

F = B × Y

(1.5)

After reducing the features, FLANN model [32] is used to predict patient’s survival or in-hospital death and finally evaluate the overall performance. The FLANN based mortality prediction model is shown in Figure 1.2. To design FLANN model 4,000 records of patients (dataset A) is selected. Out of 4,000 data, 3,000 data are selected for training and remaining 1,000 data are used for testing the model. During the training process each record with 49 features out of 3,000 records is taken as input. Each of the features is then expanded trigonometrically to five terms and map the data to a nonlinear format. The outputs of the functional expansion is multiplied with the corresponding weight valued and summed together to generate an output which is known as actual output. The actual output is then compared with the desired output either 0.1 (for 0) or 0.9 (for 1). If there are any differences in the actual and desired output, an error signal will be generated. On the basis of this error signal, weights and biases are updated using Least Mean Square (LMS) [33] algorithm. The process

ϕ1(Xk)

x1(k)

Xk

.

.

.

.

.

.

xn(k)

TRIGONOMETRIC EXPANSION

ϕ2(Xk)

x2(k)

.

w1(k) w2(k)

. .

∑

s(k)

∫

y(k)

wn(k)

–

ϕn(Xk)

e(k)

Least Mean Square Algorithm

+

d(k)

Figure 1.2 The FLANN based mortality prediction model.

∑

14 Biomedical Data Mining for Information Retrieval is repeated until all training patterns are used. The experiment is continued for 3,000 iterations. The value of learning parameter is 0.1. The mean square error (MSE) value for each iteration is stored and plotted to show the convergence characteristics as given in Figure 1.3. Once the training is over and the model is ready for prediction, 1,000 records which are kept aside for testing purpose in given to the model with fixed value of weights and bias obtained after the end of training process. For each input pattern the output or class label is calculated and compared with the target class label. Similarly, other models Discriminant Analysis (DA), Decision Tree (DT), K-Nearest Neighborhood (KNN), Naive Bayesian and Support Vector Machine (SVM) are also applied to predict mortality in an in-hospital death and obtained results using their own principles as briefed below. Discriminant analysis [34] is one of the statistical tools which is used to classify individuals into a number of groups. To separate two groups, Discriminant Function Analysis (DFA) is used and to separate more than two groups Canonical Varieties Analysis (CVA) is used. There are two potential goals in a discriminant investigation: finding a prescient condition for grouping new people or deciphering the prescient condition to all the more likely comprehend the connections that may exist among the factors. Decision Tree [35] is a tree like structure used for classification and regression. It is a supervised machine learning algorithm used in decision making. The objective of utilizing a DT is to make a preparation model 102

mean squared error

101

100

10–1

10–2 0

500

1000

1500 Iterations

2000

2500

3000

Figure 1.3 Convergence characteristics of FA-FLANN based mortality prediction model.

Mortality Prediction of ICU Patients Using ML 15 that can use to foresee the class or estimation of the objective variable by taking in basic choice principles gathered from earlier data (training information). In DT, for anticipating a class name for a record one has to start from the foundation of the tree. We look at the estimations of the root property with the record’s characteristic. Based on correlation, one follows the branch and jump to the next node. KNN [35] is also a supervised machine learning algorithm used for both classification and regression. It is simple and easy to implement algorithm. KNN finds the nearest neighbors by calculating the distance between the data points which is called the Euclidian distance. A Naive Bayes classifier [35] is a probabilistic AI model that is utilized for classification task. The Bayes equation is given as

P( A|B) =

P( B|A)P( A) P( B)

(1.6)

Utilizing Bayes hypothesis, it discovers the likelihood of an occurrence, given that B has happened. Here, B represents evidence and A represents hypothesis. The supposition made here is that the indicators/highlights are free. That is nearness of one specific element doesn’t influence the other. Consequently it is called Naïve. Support Vector Machine [35] is a supervised machine learning algorithms which aims to find a hyperplane in the N-dimensional space. A plane which has the maximum margin is to be chosen. Vectors are information focuses that are nearer to the hyperplane and impact the position and direction of the hyperplane. Utilizing these help vectors, the edge of the classifier is expanded. Erasing the help vectors will change the situation of the hyperplane. These are the focuses that assist in building the SVM.

1.4 Result and Discussion The results of all the models on testing set containing 1,000 records are shown in the Table 1.3. As exhibited from the above table DT has outperformed the other five models with an accuracy of 97.95%. FA-FLANN model has secured the 2nd rank with an accuracy of 87.6%. DA, KNN and SVM models are giving almost same results with accuracy of 86.05%, 86.6% and 86.15% respectively. The worst result is reported for the Naïve Bayesian based model with an accuracy of 54.80%.

16 Biomedical Data Mining for Information Retrieval Table 1.3 Comparison of different models during testing. Error during testing S. no.

Model name

Value

(%)

Accuracy

Rank

1.

FA-FLANN

0.1240

12.40%

87.60%

2

2.

DA

0.1395

13.95%

86.05%

5

3.

DT

0.0205

2.05%

97.95%

1

4.

KNN

0.1340

13.4%

86.6%

4

5.

Naive Bayesian

0.4520

45.20%

54.80%

6

6.

SVM

0.1385

13.85%

86.15%

3

1.5 Conclusion In this chapter, different algorithms are presented to predict in hospital mortality based on the information collected at the hospital from the 48 h of observation. The data are selected from the PhysioNet challenge 2012 and used to predict in-hospital death. 4,000 records of patients have been selected of set A, from which 3,000 records of patients are used for training and other 1,000 records are kept for testing. 15 time series variables are selected out of 41 features for model development. Missing values are handled by imputing zeros. Six different models are developed for mortality prediction and a comparison is performed. It is observed from comparison that the decision tree is one of the best algorithms which obtained best accuracy result as compared to other five models used for the simulation study.

1.6 Future Work Many authors have accepted challenges of PhysioNet challenge 2012 and published many papers and found better accuracy results. Mortality prediction is still a challenging task to predict patient’s mortality in a hospital. Researchers are going on to develop some more models, other methods of handling missing data and make new strategies for mortality prediction. The performance of different other algorithms such as extreme learning machine, convolution neural networks and deep learning can also be used for the purpose in future.

Mortality Prediction of ICU Patients Using ML 17

References 1. https://en.wikipedia.org/wiki/Health_care 2. Hanson, C. and Marshall, B., Artificial intelligence applications in the intensive care unit. Crit. Care Med., 29, 2, 1–9, 2001. 3. Halpern, N.A. and Pastores, S.M., Critical care medicine in the United States 2000–2005: An analysis of bed numbers, occupancy rates, payer mix and costs. Crit. Care Med., 38, 1, 65–71, 2010. 4. Awad, A., Bader-El-Den, M., McNicholas, J., Briggs, J., Early Hospital Mortality Prediction of Intensive Care Unit Patients Using an Ensemble Learning Approach, CDATA. Int. J. Med. Inf., vol. 108, pp. 185–195, 2017. 5. Kim, S., Kim, W., Woong Park, R., A Comparison of Intensive Care Unit Mortality Prediction Models through the Use of Data Mining Techniques. The Korean Society of Medical Informatics. Healthc. Inform. Res., 17, 4, 232– 243, December, 2011. 6. https://physionet.org/content/challenge-2012/1.0.0. 7. Silva, I., Moody, G., Scott, D.J., Celi, L.A., Mark, R.G., Predicting In-Hospital Mortality of ICU patients: The PhysioNet/Computing in Cardiology Challenge 2012. Computing in Cardiology Conference (CinC), vol. 39, IEEE, pp. 245–248, 2012. 8. Hosmer, D.W. and Lemeshow, S., Applied Logistic Regression, Wiley series in probability and statistics, 2nd Edition, John Wiley & Sons, Inc. 2000. 9. Johnson, D., Nic, M., Louis, T., Athanasios, K., Adrew, A., Clifford, G.D., Patient specific predictions in the intensive care unit using a Bayesian Ensemble. Computing in Cardiology Conference (CinC), vol. 39, IEEE, pp. 249–252, 2012. 10. Lee, C.H., Arzeno, N.M., Ho, J.C., Vikalo, H., Ghosh, J., An ImputationEnhanced Algorithm for ICU Mortality Prediction. Computing in Cardiology Conference (CinC), vol. 39, IEEE, pp. 253–256, 2012. 11. Citi, L. and Barbieri, R., PhysioNet 2012 Challenge: Predicting Mortality of ICU Patients using a Cascaded SVM-GLM Paradigm. Computing in Cardiology Conference (CinC), vol. 39, IEEE, pp. 257–260, 2012. 12. Xia, H., Daley, B.J., Petrie, A., Zhao, X., A Neural Network Model for Mortality Prediction in ICU. Computing in Cardiology Conference (CinC), vol. 39, IEEE, pp. 261–264, 2012. 13. McMillan, S., Chia, C.-C., Van Esbroeck, A., Runinfield, I., Syed, Z., ICU Mortality Prediction using Time Series Motifs. Computing in Cardiology Conference (CinC), vol. 39, IEEE, pp. 265–268, 2012. 14. Vairavan, S., Eshelman, L., Haider, S., Flower, A., Seiver, A., Prediction of Mortality in an Intensive Care Unit using Logistic Regression and a Hidden Markov Model. Computing in Cardiology Conference (CinC), vol. 39, IEEE, pp. 393–396, 2012. 15. Yi, C., Sun, Y., Tian, Y., CinC Challenge: Predicting In-Hospital Mortality in the Intensive Care Unit by Analyzing Histograms of Medical Variables under

18 Biomedical Data Mining for Information Retrieval Cascaded Adaboost Model. Computing in Cardiology Conference (CinC), vol. 39, IEEE, pp. 397–400, 2012. 16. Kranjnak, M., Xue, J., Kaiser, W., Balloni, W., Combining Machine Learning and Clinical Rules to Build an Algorithm for Predicting ICU Mortality Risk. Computing in Cardiology Conference (CinC), vol. 39, IEEE, pp. 401–404, 2012. 17. Severeyn, E., Altuve, M., Ng, F., Lollett, C., Wong, S., Towards the Prediction of Mortality in Intensive Care Units Patients: A simple Correspondence Analysis Approach. Computing in Cardiology Conference (CinC), vol. 39, IEEE, pp. 469–472, 2012. 18. Macas, M., Kuzilek, J., Odstrcilik, T., Huptych, M., Linear Bayes Classification for Mortality Prediction. Computing in Cardiology Conference (CinC), vol. 39, IEEE, pp. 473–476, 2012. 19. Di Marco, L.Y., Bojarnejad, M., King, S.T., Duan, W., Di Maria, C., Zheng, D., Murray, A., Langley, P., Robust Prediction of Patient Mortality from 48 Hour Intensive Care Unit Data. Computing in Cardiology Conference (CinC), vol. 39, IEEE, pp. 477–480, 2012. 20. Bosnjak, A. and Montilla, G., Predicting Mortality of ICU Patients using Statistics of Physiological Variables and Support Vector Machines. Computing in Cardiology Conference (CinC), vol. 39, IEEE, pp. 481–484, 2012. 21. Pollard, T.J., Harra, L., Williams, D., Harris, S., Martinez, D., Fong, K., PhysioNet Challenge: An Artificial Neural Network to Predict Mortality in ICU Patients and Application of Solar Physics Analysis Methods. Computing in Cardiology Conference (CinC), vol. 39, IEEE, pp. 485–488, 20122012. 22. Hamilton, S.L. and Hamilton, J.R., Predicting In-Hospital-Death and Mortality Percentage using Logistic Regression. Computing in Cardiology Conference (CinC), vol. 39, IEEE, pp. 489–492, 2012. 23. Bera, D. and Manjnath Nayak, M., Mortality Risk for ICU patients using Logistic Regression. Computing in Cardiology Conference (CinC), vol. 39, IEEE, pp. 493–496, 2012. 24. Xu, J., Li, D., Zhang, Y., Djulovic, A., Li, Y., Zeng, Y., CinC Challenge: Cluster Analysis of Multi-Granular Time-series Data for Mortality Rate Prediction. Computing in Cardiology Conference (CinC), vol. 39, IEEE, pp. 497–500, 2012. 25. Monterio, F., Meloni, F., Baranauskas, J.A., Alaniz Macedo, A., Prediction of mortality in Intensive Care Units: A multivariate feature selection. J. Biomed. Inf., Elsevier, 107, 103456, pp. 1–11, 2020. 26. Johnson, A.E.W., Real-time mortality prediction in Intensive Care Unit. AMIA Annual Symposium Proceedings Archive, pp. 994–1003, 2018. 27. Awad, A., Bader-EI-Den, M., McNicholas, J., Briggs, J., EI-Sonbaty, Y., Predicting hospital mortality for intensive care unit patients: Time series analysis. Health Inf. J., vol. 26(2), pp. 1043–1059, 2019. 28. Garcia-Gallo, J.E., Fonseca-Ruiz, N.J., Celi, L.A., Duitama-Munoz, J.F., A machine learning-based model for 1-year mortality prediction in patients

Mortality Prediction of ICU Patients Using ML 19 admitted to an Intensive Care Unit with a diagnosis of sepsis. Med. Intensiva, Elsevier, 44, 3, 160–170, 2018. 29. Caicedo-Torres, W. and Gutierrez, J., ISeeU: Visually Interpretable deep learning for mortality prediction inside the ICU. J. Biomed. Inform., Elsevier, 98, 103269, pp. 1–16, 2019. 30. Ma, X., Si, Y., Wang, Z., Wang, Y., Length of stay prediction for ICU patients using individualized single classification algorithm. Comput. Methods Programs Biomed., 186, 105224, p. 1–11, 2020. 31. Schönrock-Adema, J., Heijne-Penninga, M., van Hell, E.A., CohenSchotanus, J., Necessary steps in factor analysis: Enhancing validation studies of educational instruments. The PHEEM applied to clerks as an example. Med. Teach., 31, 6, e226–e232, 2009. 32. Majhi, R., Panda, G., Sahoo, G., Development and performance evaluation of FLANN based model for forecasting of stock markets. Expert Syst. Appl., Elsevier, 36, 3, 6800–6808, 2009. 33. Widrow, B., Adaptive signal processing, Prentice Hall, New Jersey, 1985. 34. https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/ Procedures/NCSS/ Discriminant_Analysis.pdf 35. Han, J., Kamber, M., Pei, J., Data mining concepts and techniques, Third Edition, Elsevier, India, 2012.

2 Artificial Intelligence in Bioinformatics V. Samuel Raj, Anjali Priyadarshini*, Manoj Kumar Yadav, Ramendra Pati Pandey, Archana Gupta and Arpana Vibhuti SRM University, Delhi-NCR, Rajiv Gandhi Education City, Sonepat, India

Abstract

Artificial intelligence tries to replace human intelligence with machine intelligence to solve diverse biological problems. Recent developments in Artificial Intelligence (AI) are set to play a very essential role in the bioinformatics domain. Machine learning and deep learning, the emerging fields with respect to biological science have created a lot of excitement as research communities want to harness their robustness in the field of biomedical and health-informatics. In this book chapter, we will look at the recently introduced state of the art in the field of Bioinformatics using complex Artificial Intelligence algorithms. With various intelligent methods available, the most common problem is selection of the best method to be applied for any specific data set. Researchers need tools, which present the data in a comprehensible fashion, annotated with context, estimates of accuracy and explanation. Thus the various smart tools available and their advantages and disadvantages have been the major focus of this chapter. Keywords: AI, bioinformatics, protein prediction, drug discovery, gene sequence, deep learning in bioinformatics, gene expression

2.1 Introduction Computational biology is contributing to some of the most important bioinformatics advances helping in the field of medicine and biology. This field is expanding and enhancing our knowledge with the help of tools of artificial intelligence which are inspired by the way in which nature solves *Corresponding author: [email protected] Sujata Dash, Subhendu Kumar Pani, S. Balamurugan and Ajith Abraham (eds.) Biomedical Data Mining for Information Retrieval: Methodologies, Techniques and Applications, (21–52) © 2021 Scrivener Publishing LLC

21

22 Biomedical Data Mining for Information Retrieval the problems it faces. This chapter deals with biology, bioinformatics and the complexities of search and optimisation which would equip the reader with the necessary knowledge to undertake a biological problem with the aid of computational tools. This chapter also contains links to software and information available on the internet, in academic journals and beyond, making it an indispensable reference all natural scientists and bioinformatics person having large data sets to analyze. We are aware of the fact that one medicine for all is not valid anymore due to genetic variations arising in different ethnic population or due to mutations. It becomes pertinent to develop personalized medicine and Artificial intelligence (AI) which is referred to as the core of the fourth revolution of science and technology would be able to provide an opportunity to achieve this for precision public health [1, 2]. This can be done by fact that medical AI generates an all-round promotion of medical services which includes accurate image interpretation, enabling fast data processing, improving workflow, and reducing medical errors in the healthcare system [3]. Due to improved medical facilities worldwide geriatric population has increased. Advancing age is associated with multiple ailments which compromises the quality of life and tend to have a high morbidity of chronic diseases [4, 5]. Therefore elderly people have a higher demand for AI because their demand for medical service increases and a more rapid, accessible, and cost-efficient medical model need is prevalent. Medical services with AI assistance Various AI-aided services such as AI mobile platforms for monitoring medication adherence, early intelligent detection of health issues, and medical interventions among home-dwelling patients [6, 7] have the potential to meet such needs.

2.2 Recent Trends in the Field of AI in Bioinformatics The basic conception of machine learning as an important element of the continued huge information revolution is reworking biomedicine and care. One of the foremost thriving sorts of machine learning techniques is Deep Learning, which has re-modeled several subfields of AI over the last decade. Deoxyribonucleic acid sequencing has furnished researchers the strength to “study” the genetic blueprint that directs all happenings of a living organism. The reference of significant Dogma of life: the pathway from deoxyribonucleic acid to macromolecule via polymer, is that the epitome of series goes with the flow. DNA, the composition of base pairs, supported four elementary units known as nucleotides (A, T, G and C) whereby A pairs with T with double element bonds and G pairs with

AI in Bioinformatics 23 C through triple element bonds. The deoxyribonucleic acid is condensed into Chromosomes. The chromosomes area unit is shaped from the segments of deoxyribonucleic acid known as genes that create or write in code proteins. This active deoxyribonucleic acid is that the key space of cognizance in analysis and therefore the business of genetics. Genetics is closely related to exactitude medicinal drugs. The sphere of exactitude medicine, conjointly called customized medicine, is an associate method to affected person care that encompasses biology, behavior, and environment with a vision of enforcing an affected person or populace-precise treatment intervention; in distinction to a one-length-fits-all technique. For example, the blood sorts area unit matched beforehand to scale back the danger of complications. Currently, there are two barriers to larger implementation of exactitude medication area unit high prices and Technology Limitations. Here comes the work of Machine Learning that helps within the assortment and analysis of the huge quantity of patient information economically. Machine Learning is sanctioning researchers to spot patterns among high volume genetic information sets. These patterns area unit then translated to laptop models which can facilitate within the prediction of the chance of a person developing the bound disease or facilitate in coming up with potential therapies. Whole-genome sequencing (WGS) has intrigued everybody in medical nosology. The researchers will sequence the total human order in sooner or later. This has been created doable by Next Generation Sequencing that could be a cumulating of all trendy deoxyribonucleic acid sequencing techniques. Deep genetics uses machine learning to assist researchers to interpret genetic variation. Specifically, the patterns area unit known in massive genetic datasets that area unit then translated to laptop models, then algorithms area unit designed to assist the purchasers to interpret however genetic variation affects crucial cellular processes. Metabolism, DNA repair, and cell growth area unit a few of those cellular processes. Disruption of the conventional functioning of those pathways will doubtless cause diseases like cancer. Recent programs of deep learning knowledge of biomedicine have already incontestable their advanced overall performance compared with specific devices gaining knowledge of methods in several medication troubles [8], also as drug discovery and repurposing [9, 10]. The intense growth within the volume of data at the side of the many progress in computing, that is comprehensive of use of powerful graphical process units that area unit specifically well matched for the improvement of deep learning models, area unit thought to be the causes of the splendid success of deep learning models in various tasks. The previous

24 Biomedical Data Mining for Information Retrieval ratings typically serve in the prediction of practicality and deleteriousness of single variants. However, several advanced trends and problems (e.g.: metabolic syndrome) are also outlined via the contributions of the numerous variants so that you can be diagrammatic in a complete rating. These editions are usually acknowledged through genome-huge affiliation studies region unit enclosed inheritable threat scores. These scores vicinity unit from time to time mounted as a weighted overall of cistron counts, the weights being given via log odds ratios, or statistical regression coefficients from univariate regression assessments of the originating population genotyping research [11]. Concisely, numerous alternatives place a unit in use to coach fashions that are expecting the effects of genetic version in committal to writing and non-coding areas of the order. The output is expressed in ratings that region unit used for rating and prioritizing candidate editions for extra investigation, or heritable rankings that summarize consequences. The reliable identiﬁcation of structural variants through short-study sequencing remains an undertaking [12]. For the goal of investigation tiny additionally as huge deletions and insertions, several algorithms have already been evolved (https://omictools.com/structural-variant-detection-category; date final accessed Apr four, 2018). Its fascination for excessive-throughput biology is apparent: it permits better exploitation of the delivery of regularly massive and excessive-dimensional facts sets by advanced networks with more than one layer that seize their internal structure [13]. The prediction of collection specificity of deoxyribonucleic acid and RNA-binding proteins and of attention and cis-regulatory regions, methylation standing, and control of conjunction in genomics area unit one a number of the maximum packages of deep getting to know. Programs carried out genetics in particular for base enterprise and population biology are there extra lately. DL has emerged as a sturdy device to create correct predictions from superior records like images, texts, or motion pictures. Cautious improvement of hype parameter values is crucial to keep away from over fitting.

2.2.1 DNA Sequencing and Gene Prediction Using Deep Learning The genomic prediction had been supported by genotyping arrays historically however with the arrival of NGS in recent times, the utilization of complete sequence for genomic prediction has become possible or a minimum of doable. In theory, the NGS information supply varied benefits overexploitation only SNP arrays, i.e., the causative mutations ought to be

AI in Bioinformatics 25 within the information, and state of affairs between causative SNPs and traits would not decrease with time, avoiding the necessity to recalibrate the model every few generations [14]. But each simulation and empirical studies have not shown a major gain of sequence over excessive-density SNP arrays [15, 16]. The conventional algorithms and extraordinarily versatile device of DL have a diode to achievement in various areas (e.g., analysis of pictures, films, voice, texts, and macromolecule folding). These algorithms have already been applied to an awesome kind of genomic problems like physical variant career [17] and prediction of the scientific effect of mutations [18] or transcription patterns [19]. With their aim to predict new information as accurately as doable, the metric capacity unit will be less restrictive and their ability to be told while not model assumptions for genomic prediction is among the foremost distinguished benefits of the metric capacity unit. Its connection would not like any specifications: whether or not the constitution shows dominance or organic process. Furthermore, metric potential unit nonlinear relationships because metric capacity unit admit various nonlinear activation capabilities. It ought to be doable to seek out the simplest metric capacity unit design which will be learned by itself, regardless of the underlying genetic design if decent information is going to be provided. “Standard” quantitative or binary phenotypes are used for genomic predictions and in varied applications of the metric capacity unit up to now. Evidence, although restricted nevertheless indicates that dramatic enhancements with the metric capacity unit during this field should not be expected. CNNs seem exceptional due to the fact that they are the most promising prophetic tool with these sorts of phenotypes. This could happen partly to the very truth that convolutional filters might seize some purposeful collection motifs. The complexness of cell signaling and mobile interactions with their atmosphere will affect the biological course of the illness and may moreover affect responses to healing interventions to the complexness of genomic changes. The coinciding interrogation of more than one option at the side of touchy and precise processes vicinity unit needed for the evaluation of such changes. All the identical, biomarker improvement is generally one-dimensional, qualitative, and would not account for the complex signaling and mobile network of increased cells and/or tissues. computerized AI-primarily based extraction of a couple of sub-visual morphometric options on ordinary hematoxylin and fluorescein (H&E)-stained preparations stay constrained through sampling problems and increase heterogeneousness however will facilitate to overcome limitations of subjective visual assessment and to combine multiple measurements to capture the complexness of tissue layout.

26 Biomedical Data Mining for Information Retrieval These histopathological alternatives may seemingly be employed together with alternative tomography, genomic, and proteomic measurements to deliver quite a few objectives, multi-dimensional, and functionally applicable diagnostic output. Thus, the AI-based approaches area unit simply the beginnings to alleviate a number of the challenges faced by oncologists and pathologists.

2.3 Data Management and Information Extraction Information Extraction (IE) is an important and growing field, in part because of the development of ubiquitous social media networking millions of people and producing huge collections of textual information. Mined information is being used in a wide array of application areas from targeted marketing of products to intelligence gathering for military, and security needs. IE has its roots in AI (Artificial Intelligence) fields including machine learning, logic and search algorithms, computational linguistics, and pattern recognition. IE can be used for taking out information which is useful from the data which may be unstructured or semi-structured. Nowadays a lot of data is pouring in making the process of information extraction extremely difficult. Such big data gives rise to unstructured data which may be multi-dimensional, which further complicates the problem. Thus, computational capabilities equipped with the tools of AI is acting as a game changer helping to deal with large amounts of unstructured data which has an advantage over traditional IE systems having improved computational capabilities. In this context neural and adaptive computing might play a very important role. These have been discussed in the later part of the chapter.

2.4 Gene Expression Analysis Gene expression analysis consists of transcription and translation of a particular gene from its coding region. If there is any change in coding sequence automatically gene product’s structure and function change. A variety of techniques are there which analyze gene expression qualitatively and quantitatively. If there are some gene sequence changes automatically the biological functions of gene products would change. To get a better understanding of gene, gene pathways, gene associated signaling pathways, biological functions of gene, analysis of gene expression has to be done which are very useful in research, pharmaceutical and clinical fields. Gene

AI in Bioinformatics 27 expression analysis covers the static information of the genome sequence into a dynamic functional view of an organism’s biology. There are various approaches like Real-time PCR, Serial analysis of gene expression (SAGE), Microarrays, next generation sequencing (NGS) are used for study and analyze the gene expression patterns. These techniques are proven to be a very effective tool for Gene expression analysis.

2.4.1 Approaches for Analysis of Gene Expression The following methods and high throughput approaches have been used for analysis of gene expression 1 Microarrays: It is a very effective tool for analysis of gene expression. Microarray has been used for comparison of the same set of genes in different conditions, or in different cells or in same cells in different timings. Microarray has been used for analysis of gene expression on a large scale. It is act usually comparative study. It has been used for tens of thousands of target gene comparisons at one time. In different sets of conditions with the same set of genes expressed differently, microarray has been used to predict the different expression of the same set of genes in different conditions. It gives an idea about a particular set of get for their up regulation or down regulation when compared with standard one. Therefore relative expression levels between the two populations can be calculated. This high throughput approach allows for large scale screening of gene pathways or disease-related gene families. It provides a useful approach in disease-prognosis and disease diagnosis studies. It is a very effective method to determine the effects of chemicals or drugs on biological processes in pharmaceutical research. Microarray has been used for analyzing large amounts of genes which have either been recorded previously or new samples. Microarray is a very sensitive technique. It can detect even a single nucleotide change in a given gene. This highly precise determination of a single nucleotide change or SNPs (single nucleotide polymorphisms) via microarray make this approach very useful applicable to identify strains of viruses, to identify mutation in cancer cells and subsequently facilitate disease’s treatment 2 Serial Analysis of Gene Expression (SAGE) SAGE is an important quantization technique for determination of gene expression. The principle of SAGE is based on counting of the number of tags in a particular gene. The total number of gene tags gives a strong idea for how much gene is expressed or how much abundance of gene product

28 Biomedical Data Mining for Information Retrieval will be there in cell. The total number of tags would give an idea to predict the abundance of a gene product. 3 Next Generation Sequencing (NGS) NGS is another technology used for gene expression analysis. RNA-Seq is an efficient technology. Millions of random position reads could be measured and compared with the help of NGS. Data can be used to map and align to each gene, in this way NGS provides an analysis of gene expression at a remarkable level of detail. 4 Real Time Reverse Trancriptase PCR (RT-PCR) Real time reverse transcriptase PCR (RT-qPCR) is another powerful approach for determination of high throughput gene expression analyses and for the analysis of moderate numbers of genes. It can detect accurate relative and in some cases absolute quantity of cDNA in a sample. RT-PCR is accurately used for qualitative and quantitative interpretation of gene expression. It is gold standard method for analysis of gene expression. Depending upon the experiment design, overall workflow and analysis techniques RT-qPCR gives efficient results. For getting 100% PCR efficiency, a number of models, software programs and calculation approaches are there. Depending upon the numbers or type of reference genes used for normalization and calculation methods RT-PCR results may vary. Once relative expression levels have been calculated, appropriate statistical analysis is used to ensure any conclusions drawn from the data. Conclusion can be made to find out if the data is biologically relevant. Tasks of the nature that requires human intelligence is aided by Artificial intelligence (AI) installed in the software and hardware of the computer system. Multiple advancement has been achieved in deep learning algorithms, the graphic processing units (GPI) which has revolutionized its medical and clinical applications. In Advances in AI software and hardware, especially deep learning algorithms and the graphics processing units (GPUs) that power their training, have led to a recent and rapidly increasing interest in medical and clinical applications. In clinical diagnostics, AI-based computer vision approaches are poised to revolutionize image-based diagnostics, while other AI subtypes have begun to show similar promise in various diagnostic modalities. In case of clinical genomics, a specific type of AI algorithm known as deep learning is used to process large and complex genomic datasets to predict certain outcomes. These analyses are done based on large amount of data which is beyond human capability thus helping in prognosis, diagnosis and therapeutics.

AI in Bioinformatics 29

2.4.2 Applications of Gene Expression Analysis Applications of gene expression involve the comparative analysis. Analysis of relative expression of same set of gene in different conditions is main applications of the high throughput approaches. The important and useful comparative analyses are mentioned below: a) The comparative expression pattern of same set of genes in mutant and wild type b) The analysis of gene expression in disease and control one c) For time point comparison between the same set of gene during any drug treatment or during development d) The comparison of same set of gene expression in different tissues or organs e) To determine drug efficacy by relative comparison of same set of genes in control and treated with a particular drug. In case of medical and clinical diagnostics study of gene expression plays a very important role, as any change be it under-expression, overexpression or loss of function plays a role in various disease etiology. So, it is really important to equip our clinicians, pathologist and the researchers with such advanced computing devices to come to a valid and informed conclusion related to disease condition. Such result interprets health data arising from a large set of unstructured data form for example the identification or forecasting of a disease state. AI interpretation tasks related to clinical aspect can be grouped into various classes of which includes computer vision, time series analysis, speech recognition, and natural language processing. Each of these problems is well suited to address specific types of clinical diagnostic tasks [20]. a) Computer vision is useful for the interpretation of radiological images; time series analysis is useful for the analysis of continuously streaming health data such as those provided by an electrocardiogram [21]. b) Speech-recognition techniques can be used for detection of neurological disorders [22]. c) AI-based natural language processing can be helpful in the extraction of meaningful information from electronic health record (EHR) data [23]. d) These techniques also aid in analysing areas which are not very obvious such as regulation of genome.

30 Biomedical Data Mining for Information Retrieval AI aided systems can identify functional regulatory elements present in the human genome, where they can be used to identify recurrent motifs in DNA sequences in a manner analogous to that in which pixel patterns are detected in images by convolutional neural networks [24] AI algorithm deep learning is able to interpret features from large and complex datasets by using deep neural network architectures. Neural networks are computational systems of artificial neurons (also called ‘nodes’) that transmit signals to one another, often in interconnected layers as neurons in a human body do. In such computational systems there are layers known as hidden layers which are not the input or the output layer. A deep neural network consists of many hidden layers of artificial neurons. Neural networks often take as input the fundamental unit of data that it is trained to interpret: for example, pixel intensity in images; diagnostic, prescription, and procedure codes in EHR data; or nucleotide sequence data in genomic applications [25]. A multitude of these simple features are combined in successive layers of the neural network in a lot of ways, as designed by the human neural network architect, in order to represent more sophisticated concepts or features of the input health data. Ultimately, the output of the neural network is the interpretation task that the network has been trained to execute. For example, successive layers of a computer vision algorithm might learn to detect edges in an image, then patterns of edges that represent shapes, then collections of shapes that represent certain objects, and so on. Thus, AI systems synthesize simple features into more complex concepts to derive conclusions about health data in a manner that is analogous to human interpretation, although the complex concepts used by the AI systems are not necessarily recognizable or obvious concepts to humans.

2.5 Role of Computation in Protein Structure Prediction There are various critical and important processes and materials like personalized medicine, gene pathway, determination organs functioning, gene therapy, vaccine and drug development etc. Nowadays bioinformatics has been extensively used for the development of artificial intelligence. It also comprises softwares & programming for prediction of structure of protein however, it is still difficult to find the structure of a protein. The two most powerful approaches are being used for determining protein structure .These are Nuclear Magnetic Resonance and X-ray crystallography but these are too expensive & time consuming which are disadvantages associated with these techniques.

AI in Bioinformatics 31 Recent advancement for getting precise & fine protein structure a powerful technique has been introduced named cryo-electron microscope (Cryo-EM). This revolutionary technique predicts high resolution large scale molecular structures. The principle of this approach is mainly used in machine learning & artificial intelligence. For interpretation of cryo-EM maps, machine learning & artificial intelligence are extensively used [26–29]. Many liquid proteins cannot be crystallized. Getting Cryo-EM map crystallization of protein is mandatory. The solution of this problem can be done by AI which gives remedy for sequencing of protein without its crystallization. Artificial intelligence has numerous programmes which are trained enough to give enormous information on atomic features of protein like: bond angles, bond length, type of bonds, physical-chemical properties, bond energy, amino acids interaction, potential energy etc. Artificial intelligence is used for image recognition [30, 31]. It helps in giving precise, broad and accurate thousands of protein structure [32, 33]. In this way these programmes suggest prediction model outputs which can be compared to the known crystal structures. There are several events organized for prediction model for protein. Critical Assessment of Structure Prediction (CASP) is an annual gathering for comparison of protein structures by various models to assess the quality of the model and find the most accurate model making it the important milestone for protein structure prediction for multiple applications. MULTICOM: in every two years all over the world researchers submit predicted protein structure while deep learning (Machine Learning) has been applied to make protein structure prediction with help of protein contact distance prediction. Professionals analyze the performance of these methods [34] and decide on the best models.

2.6 Application in Protein Folding Prediction Understanding protein folding is inherent to understanding its function and its heterogenous nature. Cellular function is incomplete without proteins be it replication, transcription and translation, thus prediction of 3D or folded protein structure becomes very important to address various questions of molecular biology. Earlier various molecular biology techniques were used for determination of protein folding which was time consuming. The discovery of new protein sequences has been accelerated by next-generation sequencing techniques due to these methods being rapid

32 Biomedical Data Mining for Information Retrieval and economical. The computational prediction methods that can accurately classify unknown protein sequences into specific fold categories in the shortest time possible is today’s requirement. Therefore computational recognition of protein folds holds a lot of importance in bioinformatics and computational biology. A number of efforts have led to generation of a variety of computational prediction methods and Artificial intelligence (AI) and machine learning (ML) have shown to hold great promise. In this chapter, available AI and ML methods and features have been explored and novel methods based on reinforcement learning have been discussed. Prediction of protein structure happens at four levels that is i) 1-D prediction of structural features which is the primary sequence of amino acids linked by peptide bond ii) 2-D prediction of which is the spatial relationships between amino acids that is alpha helix, beta turn and beta turn facilitated by hydrogen bonds iii) 3-D prediction of the tertiary structure of a protein that is fibrous or globular involving multiple bonds facilitated by hydrogen bonds, Van der Wal forces, hydrophobic interactions iv) 4-D prediction of the quaternary structure of a multiprotein complex which is made up of more than one peptide chain involving formation of sulfur bridge. Thus a model development which allows the flexibility of bond formation and helps to predict a stable and functional protein structure has been facilitated to a great deal by AI and ML. Prediction of protein structure is a complex problem as it is associated with various levels of organization and is a multi-fold process. There is a need for smart computational techniques for such purpose. AI is a great tool which when used with computational biology facilitates such prediction. Apart from determining the structure AI also aids in predicting protein structure crucial for drug development as well as in understanding the biochemical effect and ultimately the function. A protein can be broadly described as a polymer where the individual amino acid can be considered as the monomers or the building blocks arranged in a linear chain and joined together by peptide bonds. The primary structure as described earlier is represented by a sequence of letters which represent the amino acids. The chain of amino acids of a protein folds into local secondary structures including alpha helices, beta strands, and nonregular coils [35, 36] in its native environment. The secondary structure elements are further packed to form a tertiary structure depending on

AI in Bioinformatics 33 hydrophobic forces and side chain interactions, such as hydrogen bonding, between amino acids [37–39]. The tertiary structure is described by the x, y and z coordinates of all the atoms of a protein or, in a coarser description, by the coordinates of the backbone atoms (Figure 2.1). The quaternary structure is formed by more than one protein chains interacting or assembling together to form a complexes structure. Theses protein complexes proteins interact with each other and with other biological macromolecules such as DNA, RNA and certain metabolites in a cell. This kind of interaction is required to carry out various types of biological functions such as enzymatic catalysis (protein complex can interact with a metal or non-metal referred to as co-enzyme), to gene regulation (interaction of transcription factors with DNA sequences), control of growth and differentiation (protein– protein interaction where ligand binding to receptor triggers a signal cascade pathway) and transmission of nerve impulses [40]. A protein’s function is and its structure are dependent on each other [37, 38, 41, 42] therefore, determination or prediction of protein structure accurately holds the key for its function determination. The most effective methods for finding protein structure since the inception of this field have been Nuclear Magnetic Resonance and X-ray crystallography which have the disadvantage of being time consuming and expensive. The recent advancement has been the introduction of cryo-electron microscope (cryo-EM) which produces highresolution large-scale molecular structures very efficiently. Cryo-EM density maps make use of machine learning and artificial intelligence for prediction [43–46]. For such experiments protein crystal is needed which is the most disadvantageous or complex part of these methods because there are many liquid proteins which do not crystalize. Artificial intelligence comes to our aid here as it is a possible better pathway for sequencing these proteins [47, 48] due to the fact that they have proved their efficacy and accuracy of successful application in different fields like business [49], image recognition to name and can accurately and efficiently predict thousands of possible structures in shortest time by analysing big data where other methods have failed to deliver accurate and useful information.

Amino acids

Alpha helix

Pleated sheet

Tertiary folded structure

Figure 2.1 The different level of organization of protein.

Quaternary structure

34 Biomedical Data Mining for Information Retrieval Most of the models are inaccurate and do not produce predicted proteins that contain useful information so using artificial intelligence, programs are trained using many numerically represented atomic features from the models (such as bond lengths, bond angles, residue-residue interactions, physio-chemical properties, and potential energy properties). Then the comparison of the prediction models output to the known crystal structures helps to assess the quality of the model and find the most accurate model. Models for predictions and prediction analysis are compared each year in one main gathering called the Critical Assessment of Structure Prediction (CASP). Every two years researchers from around the world submit machine learning methods designed for protein structure prediction [50] where the latest advancement has been the help of protein contact distance prediction [51] and addition of quality assessment (QA) category in CASP7 (2006) [51, 52]. AI which is time and resource efficient allows for more accurate prognosis and diagnosis of structures because the computers can analyze data and have perfect calculations and deeply analyze the details. These accuracies while may be very close to that of traditional approaches are still slightly stronger allowing confidence in the results. AI would also help in cost reduction and would not be an agent to replace researchers but rather working in conjunction with them. Artificial Intelligence is an exciting field which offers solutions to issues in finding structures of proteins which is crucial to drug development and the understanding of biochemical effects. A protein’s function is determined by its structure [53–56] as the evidence is there in many biochemical reactions, therefore elucidating a protein’s structure as seen in Table 2.1 is key to understanding its function. Function determination in turn is essential for any related biological, biotechnological, medical, or pharmaceutical applications which is much needed in today’s time of increased anti-microbial resistance and threat by unknown biological agents. Table 2.1 Summary of database sources of protein structure classification. Database sources

Websites

References

PDB

http://www.rcsb.org/pdb/

[57]

UniProt

http://www.uniprot.org/

[58]

DSSP

http://swift.cmbi.ru.nl/gv/dssp/

[59]

SCOP

http://scop.mrc-lmb.cam.ac.uk/

[60]

SCOP2

http://scop2.mrc-lmb.cam.ac.uk/

[61]

CATH

http://www.cathdb.info/

[62]

AI in Bioinformatics 35 The various predictive models for protein structure prediction are hidden Markov models, neural networks, support vector machines, Bayesian methods, and clustering methods. Hidden Markov Model for Prediction HMMs are among the most important techniques for protein fold recognition. In the HMM version of profile–profile methods, the HMM for the query is aligned with the prebuilt HMMs of the template library. This form of profile–profile alignment is also computed using standard dynamic programming methods. Earlier HMM approaches, such as SAM [63] and HMMer [64], built an HMM for a query with its homologous sequences and then used this HMM to score sequences with known structures in the PDB using the Viterbi algorithm, an instance of dynamic programming methods. This can be viewed as a form of profile-sequence alignment. More recently, profile–profile methods have been shown to significantly improve the sensitivity of fold recognition over profile–sequence, or sequence–sequence, methods [65]. Neural Networks (NNs) It is very challenging to determine the structure of a protein if its sequence is given and hence making function determination more difficult. Since a lot of molecular interaction and various levels of folding are involved in a functional protein simple input of sequence will not result in desired output. Deep learning methods are rapidly evolving field in the context of complex relationships between input features and desired outputs which has been put to great use in structure prediction. Various deep neural network architectures resembling the neural network of a human have been proposed which includes deep feed-forward neural networks, recurrent neural networks and neural Turing machines and memory networks. Such advancements are making this field more competitive and accurate and a comparison can be made to a human brain where it receives so many information as inputs but is able to analyze and come to a logical conclusion. Pattern recognition and classification are important tools of NN. Examples of early NN methods that are still widely used today are PHD [66, 67] PSIPRED [68] and JPred [69] though advancement has occurred to a great deal as Deep neural network (DNN) models have been shown have an advantage of performance in image and language based problems [70] and has been seen to extend to some specific CASP areas such as residue-residue contact prediction and direct use for accurate tertiary structure generation [71–75].

36 Biomedical Data Mining for Information Retrieval Support Vector Machines (SVMs) Support Vector Machine (SVM) is a supervised Machine Learning technique that has been used to rank protein models [76]. SVM has been put to use in pattern classification problems related to biology. Support Vector Machine method is performed based on the database derived from SCOP, in which protein domains are classified which is based on i. Known structures of protein in the data bank ii. Evolutionary relationships of the predicted protein iii. The various principles of bond formation governing the 3-D structure of protein. The advantages of SVM include avoidance of over-fitting very effectively which is a disadvantage with several other methods and is able to manage large feature spaces, and condensation of large amount of information data. Bayesian Methods The most successful methods for determining secondary structure from primary structure use machine learning approaches that are quite accurate, but they do not directly incorporate structural information. There is a need to determine higher order protein structure which can provide a better and deeper understanding of protein’s function in the cell as structure and function are strongly related. Various computational prediction methods have been developed for the prediction of secondary structure if the primary amino acid sequence is available and one such computational methods is the Bayesian method The knob-socket model of protein packing in secondary structure forms the basis of Bayesian model. As it is known that when packaging of protein may result in residues that are packed close in space but distant in sequence if the primary structure is seen [77, 78] which is not taken into account by several other methods. The Bayesian model method considers the packing influence of residues on the secondary structure determination. Thus this method has an advantage over other methods of having constructs for the direct inclusion and prediction of the secondary states of coil and turn. Where other secondary structure prediction methods are indirect and do not make direct prediction of coil structure of alpha helix and beta sheet. The secondary folding is very much dependent upon the surrounding environment (aqueous/non aqueous) as a lot of hydrogen bonding and hydrophobic is involved. Thus this method helps in developing the understanding of the environment responsible for secondary structure formation.

AI in Bioinformatics 37 Clustering Methods A protein rarely performs its function in isolation, various kinds of interaction is needed to perform its function [79] as discussed earlier in this chapter in context to quaternary structure. Protein–protein interactions are thus fundamental to almost all biological processes [80] and it’s really important to understand this phenomenon. Increasing availability of large-scale protein-protein interaction data has made it possible to understand the basic components and organization of cell machinery from the network level in terms of interactions taking place. Protein–protein interactions can be studied by advance high-throughput technologies such as yeast-two-hybrid, mass spectrometry, and protein chip technologies and making available huge data sets of such interactions [81] which can be put to great use in structure prediction. In computation analysis such protein– protein interaction data can be naturally represented in the form of networks. This network representation can provide the initial global picture of protein interactions on a genomic scale and can also help to build an understanding of the basic components and organization of cell machinery. In Clustering method protein interaction network is represented as an interaction graph. In this graphical representation the proteins are as vertices (or nodes) and interactions as edges. This method has been put to use in the study of surface or topological properties of protein interaction including the network diameter, the distribution of vertex degree, the clustering coefficient and shows that there is scale-free network [82–85] and effects in a very small area [86, 87]. It has been observed and shown that clustering protein interaction networks is an effective approach for system biology to understand the relationship between the organization of a network and its function [88] making it a very effective tool. The proteins are grouped into sets (clusters) helping to demonstrate greater similarity among proteins in the same cluster than in different clusters. The clusters have two which are protein complexes and functional modules. Protein complexes are groups of proteins that interact with each other at the same time and place which form a single multimolecular structure as evident in RNA splicing and polyadenylation machinery, protein export and transport complexes to name a few [89]. The difference between protein complex and functional modules is that the functional module consists of proteins binding each other at a different time and place and participating in a cellular process. Example of functional module includes the yeast pheromone response pathway, MAP signalling cascades, etc. [90] which initiates with an extracellular signaling leading to a signal cascade pathway resulting in gene activation and other processes.

38 Biomedical Data Mining for Information Retrieval

2.7 Role of Artificial Intelligence in Computer-Aided Drug Design High throughput screening (HTS) is a set of techniques that are capable of identifying biologically active molecules with desired properties from any compound database of billions of compounds. The prediction and identification of active compounds with high accuracy and activity are crucial to decrease the time taken to discover potent drugs. Different medicinal chemistry-related companies use screening techniques to identify active compounds from drug databases in a significantly less amount of time. The decrease in search space or targeted search will reduce the overall cost of the drug discovery process. The critical problem is how to establish a relationship between the 3D structure of the lead molecule and its biological activity. QSAR is a technique that can able to predict the activity of a set of compounds using the derived equations from a set of known compounds [91]. While in QSPR (quantitative structure–property relationships), one predicts biological activity, using the physicochemical properties of known compounds as a response variable. Accurate prediction of the activity of chemical molecules is still a persistence issue in drug discovery. It is a general phenomenon in structural bioinformatics that if the two protein structures share structural similarities, then their functions may also be the same. Nevertheless, this is not always true in the case of chemical structures, where minute structural differences in pairs of compounds will lead to change in their activity against the same target receptor. This is an activity cliff problem which is being a hot topic of debate among computational and medicinal scientists [92, 93]. The lock-and-key hypothesis and induced fit model hypothesis deal with the biochemistry of binding of a ligand at the receptor. In general, a ligand–receptor complex comprises of a smaller ligand which attaches to the functional cavity of the receptor. The 3D structure information of both ligand, as well as receptor, is essential in order to understand their functional role. There is a change in 3D conformation of receptor protein upon binding of ligands at the active site and thus leads to change in their functional state. X-Ray Crystallography, Nuclear Magnetic Resonance (NMR), Electron Microscopy are the currently available experimental techniques to predict the 3D structure of proteins. Since there is a considerable gap between available protein sequences and their 3D structures, one can harness bioinformatics techniques, namely molecular modeling, to predict their 3D structures in a less amount of time with comparable accuracy. Molecular docking is a technique that can be used to predict the binding

AI in Bioinformatics 39 mode of ligand at the receptor if their 3D information is available. It is the most commonly used for pose prediction of ligand at the active site of the receptor. The approach of identifying lead compounds using 3D structure information of receptor–protein is known as Structure-Based Drug Design (SBDD). Nowadays, the process of identifying, predicting and optimising the activity of small molecules against a biological target comes under SBDD domain [94–96]. Ligand-based drug design (LBDD) is another approach of drug designing, applicable only when 3D structural information of the receptor is unavailable. LBDD mainly relies on the pre-existing knowledge of compounds that are known to bind with the receptor. The physicochemical properties of known ligands are used to predict their activity and develop SAR to screen unknown compounds [97]. Although artificial intelligence can be applied in both SBDD and LBDD approaches to automate the drug discovery process, its implementation in the LBBD approaches is more common these days. Some recent methods like proteochemometric modeling (PCM) try to extract the individual descriptor information from both ligands as well as the receptors, and also the combined interaction information [98]. The machine learning classifiers use the individual descriptor, as well as cross-descriptor information, for predicting the bioactivity relations. Biological activity is a broad term that relates to the ability of a compound/target to achieve the desired effect [99]. The bioactivity or biological activity may be divided into the activity of receptor (functionality) and activity of compounds. While in pharmacology, the biological activity is replaced by pharmacological activity, which usually represents the beneficial or adverse effect of drugs on biological systems. The compound must possess both the activity against the target as well as permissible physicochemical properties in order to establish them as an ideal drug candidate. The absorption, distribution, metabolism, excretion and toxicity (ADMET) profile of a compound is required to predict the bioavailability, biodegradability and toxicity of drugs. Initially, the simple descriptor-based statistical models were created for predicting the bioactivity of drug compounds. Later on, the target specificity and selectivity of compounds were increased many folds due to the inclusion of machine learning-based models [100]. The machine learning classifiers may be built and trained based on pre- existing knowledge of either molecular descriptors or substructure mining in order to classify new compounds. One can train the classifiers, and classify the new compounds considering either single or combination of parameters: activity (active/non-active),

40 Biomedical Data Mining for Information Retrieval drug-likeness, pharmacodynamics, and pharmacokinetics or toxicity profiles of known compounds [91]. Nowadays, a lot of open-source as well as commercial applications, are available for predicting skin sensitisation, hepatotoxicity, or carcinogenicity of compounds [101]. Apart from this, several expert systems are in use for finding the toxicity of unknown compounds using knowledgebase information [102, 103]. These expert systems are artificial intelligence-enabled expert systems that are using human knowledge (or intelligence) to reason about problems or to make predictions. They can make qualitative judgements based on qualitative, quantitative, statistical and other evidence provided to them as an input. For instance, DEREK and StAR use the knowledge-based information to derive new rules that can better describe the relationship between chemical structure and their toxicity [102]. DEREK uses a data-driven approach to predict the toxicity of a novel set of compounds given in the training dataset and compare them to given biological assay results to refine the prediction rules. Toxtree is an open-source platform to detect the toxicity potential of chemicals. It uses the Decision Tree (DT) classification machine learning algorithm based classification model to estimate toxicity. The toxicological data of chemicals derived from their structural information is used as an input to feed the model [104]. Besides expert systems, there are also some other automated prediction methods like Bayesian methods, Neural Networks, Support Vector Machines. Bayesian Inference Networks (BIN) is among one of the crucial methods that allow a straightforward representation of uncertainties that are involved in the different medical domains involving diagnosis, treatment selection, prediction of prognosis and screening of compounds [105]. Nowadays, doctors are using these BIN models in the prognosis and diagnosis. Use of BIN models in the ligand-based virtual screening domain tells their successful implications in the field of drug discovery. A comparative study was done to find the efficiency of three models: Tanimoto Coefficient Networks (TAN), conventional BINs and BIN Reweighting Factor (BINRF) for screening billions of drug compounds based on structural similarity information [106]. All three models use MDL Drug Data Report (MMDR) database for training as well as testing purposes. The ligand-based virtual screening, which utilizes the BINRF model, not only significantly improved the search strategy, it also identified the active molecules with less structural similarity, compared to TAN and BIN-based approaches. Thus, this is an era of the integrative approaches to achieve higher accuracy in drug or drug target prediction. Bayesian ANalysis to determine Drug Interaction Target (BANDIT), uses a Bayesian approach to integrate varied data types in an unbiased

AI in Bioinformatics 41 manner. It also provides a platform that allows the integration of newly available data types [107]. BANDIT has the potential to expedite the drug development process, as it spans the entire drug search space starting from new target identification and validation to clinical candidate development and drug repurposing. Support Vector Machine (SVM) is a supervised machine learning technique most often used in knowledge base drug designing [108]. The selection of appropriate kernel function and optimum parameters are the most challenging part in the problem modelling, as both parameters are problem-dependent. Later on, a more specific kernel function is designed that can control the complexity of subtrees by using parameter adjustments. The SVM model integrated with the newly designed kernel function successfully classifies and cross-validates small molecules having anti-cancer properties [109]. Graph kernels-based learning algorithms are widely in SVMs, and they can directly utilise graph information to classify compounds. The graph kernel-based SVMs are employed to classify diverse compounds, to predict their biological activity and to rank them in screening assays. Deep learning algorithms that mimic the human neural system, artificial neural network (ANN) also have applications in the drug discovery process. The comparable robustness of both SVM and ANN algorithms were checked in term of their ability to classify between drug/non-drug compounds [110]. The result is in support of SVM as it can classify the compounds with higher accuracy and robustness compared to ANN. Other machine learning algorithms: Decision tree, Random forest, logistic regression, recursive partitioning are also successfully applied to classify compounds using relationship criteria between their chemical structure and toxicity profiles [111]. The comparative study of ML algorithms shows that non-linear/ensemble-based classification algorithms are more successful in classifying the compounds using ADMET properties. Random Forest algorithms can also be used in ligand pose prediction, finding receptor-ligand interactions and predicting the efficiency of docking simulations [112]. Nowadays, Deep Learning (DL) methods are achieving remarkable success in the area of pharmaceutical research starting from biological-image analysis, de novo molecule design, ligand– receptor interaction to biological activity prediction [113]. So the continuous improvements in machine learning and deep learning algorithms will help to achieve desired results with higher prediction accuracy in the drug designing field. Multiple descriptors represent the molecular data in terms of their structural and physicochemical features. These descriptors are responsible for diverse bioactivity of compounds [114]. Apart from descriptor-based

42 Biomedical Data Mining for Information Retrieval bioactivity prediction of chemicals, substructure mining is also an established technique in the field of drug discovery. The substructure mining is also a data-driven approach that uses a combination of algorithms to detect the most frequently occurring substructures from a large subset of the known ligands [115]. There are two ways to use the substructure mining: one way is to use a predefined list of candidate scaffolds. The substructure mining algorithm identifies and extracts all the candidate scaffolds present in known compounds of a given database. While the second approach of substructure mining adaptively learns the substructures from the compounds. Both the ways are capable of getting all the significant 2D substructures from any chemical databases [116]. The popularity of the substructure mining approaches is highly appreciable for establishing a common consensus among medicinal chemists who later on start treating chemical compounds as a collection of their sub-structural parts. Application of the approach to establish structure–activity relationships will build more confidence in stating that biological properties of molecules are dependent upon their structural properties. Later on, several substructure mining algorithms have been developed to accommodate the needs of an ever-changing drug discovery process [117]. The subgraph mining approach is unique as it is free from any kind of arbitrary assumption, compared to other approaches. In other words, the current subgraph mining techniques are capable of retrieving all frequent occurring subgraphs from a given database of chemical compounds in significantly less time with minimum support [118]. Furthermore, as described above, the idea behind these techniques is to enable us to find the most significant subgraph out of all possible subgraphs. Shortly, the use of Artificial intelligence-based techniques in medicinal chemistry will become more complex, due to the increasing availability of huge repositories containing chemical, biological, genetic, and structural data. The implementation of the complex algorithm on ever-increasing data volume for searching a new, safer and more effective drug candidates leads to the use of quantum computing and high-performance computing. In summary, we believe that these techniques will become a much more significant part of drug discovery endeavours within a very short time.

2.8 Conclusions AI and ML have come up as great tools for structure prediction but these techniques rely to a great deal on collection of phenotype data, and not genomic data which may be its disadvantage. Genome researchers have

AI in Bioinformatics 43 learned that much of the variation between individuals is the result of a number of discrete, single-base changes also known as single nucleotide polymorphisms, or SNP’s in the human genome, which affects the phenotype. Application of ML to SNP data can be done in a manner similar to its application to microarray data which can be employed for supervised learning to identify differences in SNP patterns between people who respond well to a particular drug versus those who respond poorly. This can also be used for supervised learning to identify SNP patterns predictive of disease if possible. If the highly predictive SNP’s appear within genes may indicate that these genes may be important for conferring disease resistance or susceptibility, or the proteins they encode may be potential drug targets an important finding for a doctor and a researcher. Constructing models of biological pathways or even an entire cell in silico cell is a goal of systems biology which may be possible using the advanced computational techniques. Machine learning has revolutionized the field of biology and medicine where researchers have employed machine learning to make gene chips more practical and useful. Data that might have taken years to collect, now takes a week. Biologist are aided greatly by the supervised and unsupervised learning methods that many are using to make sense of the large amount of data now available to them. As a result a rapid increase has occurred in the rate at which biologists are able to understand the molecular processes that underlie and govern the function of biological systems which can be used for a variety of important medical applications such as diagnosis, prognosis, and drug response. As our vast amount of genomic and similar types of data continues to grow, the role of computational techniques, especially machine learning, will grow with it. These algorithms will enable us to handle the task of analyzing this data to yield valuable insight into the biological systems that surround us and the diseases that affect us.

References 1. Lancet, T., Artificial intelligence in healthcare: Within touching distance. Lancet, 390, 10114, 2739, 2018. 2. Kantarjian, H. and Yu, P.P., Artificial Intelligence, Big Data, and Cancer. JAMA Oncol., 1, 5, 573–574, 2015. 3. Topol, E.J., High-performance medicine: The convergence of human and artificial intelligence. Nat. Med., 25, 1, 44–56, 2019. 4. Kanasi, E., Ayilavarapu, S., Jone, J., The aging population: Demographics and the biology of aging. Periodontol. 2000, 72, 1, 13–18, 2016.

44 Biomedical Data Mining for Information Retrieval 5. Naughton, M.J., Brunner, R.L., Hogan, P.E., Danhauer, S.C., Brenes, G.A., Bowen, D.J. et al., Global quality of life among WHI women aged 80 years and older. J. Gerontol. A Biol. Sci. Med. Sci., 71 Suppl. 1, S72–8, 2016. 6. Cohen, C., Kampel, T., Verloo, H., Acceptability among community healthcare nurses of intelligent wireless sensor-system technology for the rapid detection of health issues in home-dwelling older adults. Open Nurs. J., 11, 54–63, 2017. 7. Labovitz, D.L., Shafner, L., Reyes, G.M., Virmani, D., Hanina, A., Using artificial intelligence to reduce the risk of nonadherence in patients on anticoagulation therapy. Stroke, 48, 5, 1416–1419, 2017. 8. Ching, T., Himmelstein, D.S., Beaulieu-Jones, B.K. et al., Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface, 15, 141, pii:20170387, 2018. 9. Goh, G.B., Hodas, N.O., Vishnu, A., Deep learning for computational chemistry. J. Comput. Chem., 38, 16, 1291–1307, 2017. 10. Ramsundar, B., Liu, B., Wu, Z. et al., Is multi task deep learning practical for pharma? J. Chem. Inf. Model., 57, 8, 2068–2076, 2017. 11. So, H.C. and Sham, P.C., Improving polygenic risk prediction from summary statistics by an empirical Bayes approach. Sci. Rep., 7, 41262, 2017. 12. English, A.C., Salerno, W.J., Hampton, O.A., GonzagaJauregui, C., Ambreth, S., Ritter, D.I., Beck, C.R., Davis, C.F., Dahdouli, M., Ma, S. et al., Assessing structural variation in a personal genome—Towards a human reference diploid genome. BMC Genomics, 16, 286, 2015. 13. Angermueller, C., Parnamaa, T., Parts, L., Stegle, O., Deep learning for computational biology. Mol. Syst. Biol., 12, 878, 2016. 14. Meuwissen, T. and Goddard, M., Accurate Prediction of Genetic Values for Complex Traits by Whole-Genome Resequencing. Genetics, 185, 623–631, 2010. 15. Pérez-Enciso, M., Rincón, J.C., Legarra, A., Sequence- vs. chip-assisted genomic selection: Accurate Biological information is advised. Genet. Sel. Evol., 47, 1–14, 2015. 16. Heidaritabar, M., Calus, M.P.L., Megens, H.-J., Vereijken, A., Groenen, M.A.M., Bastiaansen, J.W.M., Accuracy of genomic prediction using imputed whole-genome sequence data in white layers. J. Anim. Breed. Genet., 133, 167–179, 2016. 17. Ainscough, B.J., Barnell, E.K., Ronning, P., Campbell, K.M., Wagner, A.H., Fehniger, T.A., Dunn, G.P., Uppaluri, R., Govindan, R., Rohan, T.E. et al., A deep learning approach to automate reﬁnement of somatic variant calling from cancer sequencing data. Nat. Genet., 50, 1735–1743, 2018. 18. Sundaram, L., Gao, H., Padigepati, S.R., McRae, J.F., Li, Y., Kosmicki, J.A., Fritzilas, N., Hakenberg, J., Dutta, A., Shon, J. et al., Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet., 50, 1161–1170, 2018.

AI in Bioinformatics 45 19. Zhou, J., Theesfeld, C.L., Yao, K., Chen, K.M., Wong, A.K., Troyanskaya, O.G., Deep learning sequence-based ab initio prediction of variant eﬀects on expression and disease risk. Nat. Genet., 50, 1171–1179, 2018. 20. Torkamani, A., Andersen, K.G., Steinhubl, S.R., Topol., E.J., High-definition medicine. Cell, 170, 828–4, 2017. 21. Este va, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K. et al., A guide to deep learning in healthcare. Nat. Med., 25, 24–9, 2019. 22. Fraser, K.C., Meltzer, J.A., Rudzicz, F., Linguistic features identify Alzheimer’s disease in narrative speech. J. Alzheimers Dis., 49, 407–22, 2016. 23. Rajkomar, A., Oren, E., Chen, K., Dai, A.M., Hajaj, N., Liu, P.J. et al., Scalable and accurate deep learning for electronic health records. NPJ Digit. Med., 1, 18, 2018, https://doi.org/10.1038/s41746-018-0029-1. 24. Zou, J., Huss, M., Abid, A., Mohammadi, P., Torkamani, A., Telenti, A., A primer on deep learning in genomics. Nat. Genet., 51, 12–8, 2019. 25. Eraslan, G., Avsec, Ž., Gagneur, J., Theis, F.J., Deep learning: New computational modelling techniques for genomics. Nat. Rev. Genet., 20, 389–403, 2019. 26. Yang, J., Cao, R., Si, D., EMNets: A Convolutional Autoencoder is made available under a CC-BY-NC-ND 4.0 International license. bioRxiv, preprint, 2018, https://doi.org/10.1101/561027. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It for Protein Surface Retrieval Based on Cryo-Electron Microscopy Imaging,” in Proceedings of the ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics—BCB ‘18, Washington, DC, USA, pp. 639–644. 27. Ng, A. and Si, D., Beta-Barrel Detection for Medium Resolution CryoElectron Microscopy Density Maps Using Genetic Algorithms and Ray Tracing. J. Comput. Biol., 25, 6, 326–336, 2018. 28. Li, R., Si, D., Zeng, T., Ji, S., He, J., Deep Convolutional Neural Networks for Detecting Secondary Structures in Protein Density Maps from CryoElectron Microscopy. Proceedings, pp. 41–46, 2016. 29. Si, D., Ji, S., Nasr, K.A., He, J., A machine learning approach for the identification of protein secondary structure elements from electron cryo-microscopy density maps. Biopolymers, 97, 9, 698–708, 2012. 30. Huang, Q., Zhang, P., Wu, D., Zhang, L., Turbo Learning for CaptionBot and DrawingBot, in: Advances in Neural Information Processing Systems, vol. 20, pp. 6456–6466, Curran Associates Inc., USA, 2018. 31. Xu, T. et al., AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 32. Kosylo, N. et al., Artificial Intelligence on Job-Hopping Forecasting: AI on Job-Hopping, in: Portland International Conference on Management of Engineering and Technology (PICMET), 2018.

46 Biomedical Data Mining for Information Retrieval 33. Keasar, C. et al., An analysis and evaluation of the WeFold collaborative for protein structure prediction and its pipelines in CASP11 and CASP12. Sci. Rep., 8, 1, 9939, 2018. 34. Hou, J., Wu, T., Cao, R., Cheng, J., Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13. bioRxiv, Open Access 552–422, 15 April 2019, https://doi.org/10.1002/prot.25697. 35. Pauling, L. and Corey, R.B., The pleated sheet, a new layer configuration of the polypeptide chain. Proc. Natl. Acad. Sci., 37, 251–256, 1951. 36. Pauling, L., Corey, R.B., Branson, H.R., The structure of proteins: Two hydrogen bonded helical configurations of the polypeptide chain. Proc. Natl. Acad. Sci., 37, 205–211, 1951. 37. Kendrew, J.C., Dickerson, R.E., Strandberg, B.E., Hart, R.J., Davies, D.R., Phillips, D.C., Shore, V.C., Structure of myoglobin: A three-dimensional Fourier synthesis at 2_a resolution. Nature, 185, 422–427, 1960. 38. Perutz, M.F., Rossmann, M.G., Cullis, A.F., Muirhead, G., Will, G., North, A.T., Structure of haemoglobin: A three-dimensional Fourier synthesis at 5.5 Angstrom resolution, obtained by x-ray analysis. Nature, 185, 416–422, 1960. 39. Dill, K.A., Dominant forces in protein folding. Biochemistry, 31, 7134–7155, 1990. 40. Laskowski, R.A., Watson, J.D., Thornton, J.M., From protein structure to biochemical function? J. Struct. Funct. Genomics, 4, 167–177, 2003. 41. Travers, DNA conformation and protein binding. Annu. Rev. Biochem., 58, 427–452, 1989. 42. Bjorkman, P.J. and Parham, P., Structure, function and diversity of class I major histocompatibility complex molecules. Annu. Rev. Biochem., 59, 253– 288, 1990. 43. Yang, J., Cao, R., Si, D., EMNets: A Convolutional Autoencoder for Protein Surface Retrieval Based on Cryo-Electron Microscopy Imaging, in: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics—BCB ‘18, Washington, DC, USA, pp. 639–644, 2018. 44. Ng, A. and Si, D., Beta-Barrel Detection for Medium Resolution CryoElectron Microscopy Density Maps Using Genetic Algorithms and Ray Tracing. J. Comput. Biol., 25, 3, 326–336, Mar. 2018. 45. Li, R., Si, D., Zeng, T., Ji, S., He, J., Deep Convolutional Neural Networks for Detecting Secondary Structures in Protein Density Maps from CryoElectron Microscopy. Proceedings, 2016, 41–46, Dec. 2016. 46. Si, D., Ji, S., Nasr, K.A., He, J., A machine learning approach for the identification of protein secondary structure elements from electron cryo-microscopy density maps. Biopolymers, 97, 9, 698–708, Sep. 2012. 47. Huang, Q., Zhang, P., Wu, D., Zhang, L., Turbo Learning for CaptionBot and DrawingBot, in: Advances in Neural Information Processing Systems, vol. 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), pp. 6456–6466, Curran Associates, Inc., USA, 2018.

AI in Bioinformatics 47 48. Xu, T. et al., AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 49. Kosylo, N. et al., Artificial Intelligence on Job-Hopping Forecasting: AI on Job-Hopping, in: 2018 Portland International Conference on Management of Engineering and Technology (PICMET), 2018. 50. Keasar, C. et al., An analysis and evaluation of the WeFold collaborative for protein structure prediction and its pipelines in CASP11 and CASP12. Sci. Rep., 8, 1, 9939, Jul. 2018. 51. Hou, J., Wu, T., Cao, R., Cheng, J., Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13. bioRxiv, Open Access 552422, 15 April 2019, https://doi.org/10.1002/prot.25697. 52. Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T., Tramontano, A., Critical assessment of methods of protein structure prediction (CASP)-Round XII. Proteins, 86, Suppl 1, 7–15, Mar. 2018. 53. Kendrew, J.C., Dickerson, R.E., Strandberg, B.E., Hart, R.J., Davies, D.R., Phillips, D.C., Shore, V.C., Structure of myoglobin: A three-dimensional Fourier synthesis at 2_a resolution. Nature, 185, 422–427, 1960. 54. Perutz, M.F., Rossmann, M.G., Cullis, A.F., Muirhead, G., Will, G., North, A.T., Structure of haemoglobin: A three-dimensional Fourier synthesis at 5.5 Angstrom resolution, obtained by X-ray analysis. Nature, 185, 416–422, 1960. 55. Travers, A., DNA conformation and protein binding. Annu. Rev. Biochem., 58, 427–452, 1989. 56. Bjorkman, P.J. and Parham, P., Structure, function and diversity of class I major histocompatibility complex molecules. Annu. Rev. Biochem., 59, 253– 288, 1990. 57. Bernstein, F.C., Koetzle, T.F., Williams, G.J., Meyer, E.F., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T., Tasumi, M., The protein data bank. Eur. J. Biochem., 80, 319–324, 1977. [CrossRef] [PubMed]. 58. Consortium, U., The universal protein resource (UniProt). Nucleic Acids Res., 36, D190–D195, 2008. [CrossRef] [PubMed]. 59. Kabsch, W. and Sander, C., Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577–2637, 1983. [CrossRef] [PubMed]. 60. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C., Scop: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540, 1995. [CrossRef]. 61. Andreeva, A., Howorth, D., Chothia, C., Kulesha, E., Murzin, A.G., SCOP2 prototype: A new approach to protein structure mining. Nucleic Acids Res., 42, 310–314, 2014. [CrossRef] [PubMed]. 62. Sillitoe, I., Lewis, T.E., Cuff, A., Das, S., Ashford, P., Dawson, N.L., Furnham, N., Laskowski, R.A., Lee, D., Lees, J.G., Cath: Comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res., 43, 376– 381, 2015.

48 Biomedical Data Mining for Information Retrieval 63. Karplus, K., Barrett, C., Hughey, R., Hidden Markov models for detecting remote protein homologies. Bioinfo., 14, 10, 846–856, 1998. 64. Eddy, S.R., Profile hidden Markov models. Bioinfo., 14, 755–763, 1998. 65. Soeding, J., Protein homology detection by HMM–HMM comparison. Bioinfo., 21, 951–960, 2005. 66. Rost, B. and Sander, C., Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol., 232, 2, 584–599, 1993, https://doi.org/ 10.1006/jmbi.1993.1413. 67. Rost, B., PHD: Predicting one-dimensional protein structure by profile based neural networks. Methods Enzymol., 266, 525–539, 1996, https://doi. org/10.1016/s0076-6879(96)66033-9. 68. Jones, D.T., Protein secondary structure prediction based on positionspecific scoring matrices. J. Mol. Biol., 292, 2, 195–202, 1999. 69. Cuff, J.A., Clamp, M.E., Siddiqui, A.S., Finlaym, M., Barton, G.J., JPred: A consensus secondary structure prediction server. Bioinformatics, 14, 10, 892– 893, 1998. 70. LeCun, Y., Bengio, Y., Hinton, G., Deep learning. Nature, 521, 7553, 436–444, 2015. 71. Zhu, J., Wang, S., Bu, D., Xu, J., Protein threading using residue covariation and deep learning. Bioinformatics, 34, 13, i263–i273, 2018. 72. Xu, J. and Wang, S., Analysis of distance-based protein structure prediction by deep learning in CASP13. Proteins, 87, 12, 1069–1081, 2019, https://doi. org/10.1002/prot.25810. 73. Xu, J., Distance-based protein folding powered by deep learning. Proc. Natl. Acad. Sci. U.S.A., 116, 34, 16856–16865, 2019. 74. Greener, J.G., Kandathil, S.M., Jones, D.T., Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints. Nat. Commun., 10, 3977, 2019. 75. Senior, A.W., Evans, R., Jumper, J. et al., Protein structure prediction using multiple deep neural networks in CASP13. Proteins, 87, 12, 1041–1048, 2019, https://doi.org/10.1002/prot.25834. [2] Cao, R., Freitas, C., Chan, L., Sun, M., Jiang, H., Chen, Z., ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network. Molecules, 22, 10, 1732, 2017. 76. Qiu, J., Sheffler, W., Baker, D., Noble, W.S., Ranking predicted protein structures with support vector regression. Proteins, 71, 1175–1182, 2007. 77. Joo, H. and Tsai, J., An amino acid code for β-sheet packing structure. Proteins: Structure, Function, and Bioinformatics, Volume 82 (9) – Sep. 1, 2014. 78. Crick, F.H., The packing of α-helices: simple coiled-coils. Acta Crystallogr., 6, 689–697, 1953. 79. von Mering, C., Krause, R., Sne, B. et al., Comparative assessment of large scale data sets of protein–protein interactions. Nature, 417, 6887, 399–403, 2002.

AI in Bioinformatics 49 80. Hakes, L., Lovell, S.C., Oliver, S.G. et al., Specificity in protein interactions and its relationship with sequence diversity and coevolution. PNAS, 104, 19, 7999–8004, 2007. 81. Harwell, L.H., Hopfield, J.J., Leibler, S., Murray, A.W., From molecular to modular cell biology. Nature, 402, c47–c52, 999. 82. Jeong, H., Mason, S., Barabási, A.L. et al., Lethality and centrality in protein networks. Nature, 411, 6833, 41–42, 2001. 83. Giot, L. et al., A protein interaction map of Drosophila melanogaster. Science, 302, 1727–1736, 2003. 84. Li, S., Armstrong, C., Bertin, N., A map of the interactome network of the metazoan. Science, 303, 5657, 540–543, 2004. 85. Wuchty, S., Scale-free behavior in protein domain networks. Mol. Biol. Evol., 18, 9, 1694–1702, 2001. 86. del Sol, A. and O’Meara, P., Small-world network approach to identify key residues in protein–protein interaction. Proteins, 58, 3, 672–682, 2004. 87. del Sol, A., Fujihashi, H., O’Meara, P., Topology of small-world networks of protein–protein complex structures. Bioinformatics, 21, 8, 1311–131, 2005. 88. Brohée, S. and van Helden, J., Evaluation of clustering algorithms for protein–protein interaction networks. BMC Bioinf., 7, 48, 2006. 89. Spirin, V. and Mirny, L.A., Protein complexes and functional modules in molecular networks. PNAS, 100, 12123–12128, 2003. 90. Bu, D., Zhao, Y., Cai, L. et al., Topological structure analysis of the protein– protein interaction network in budding yeast. Nucleic Acids Res., 31, 9, 2443– 2450, 2003. 91. Nicolas, J., Artificial intelligence and bioinformatics. 2018, https://doi.org/ 10.1007/978-3-030-06170-8_7. 92. Dimova, D. and Bajorath, J., Advances in activity cliff research. Mol. Inf., 35, 5, 181–191, 2016. 93. Stumpfe, D., Hu, H., Bajorath, J., Evolving Concept of Activity Cliffs. ACS Omega, 4, 11, 14360–14368, 2019, Published 2019 Aug 26. 94. Kitchen, D.B., Decornez, H., Furr, J.R., Bajorath, J., Docking and scoring in virtual screening for drug discovery: Methods and applications. Nat. Rev. Drug Discovery, 3, 11, 935, 2004. 95. Ferreira, L.G., dos Santos, R.N., Oliva, G., Andricopulo, A.D., Molecular docking and structure-based drug design strategies. Molecules, 20, 7, 13384– 13421, 2015. 96. Dos Santos, R.N., Ferreira, L.G., Andricopulo, A.D., Practices in Molecular Docking and Structure-Based Virtual Screening. Methods Mol. Biol. (Clifton, N.J.), 1762, 31–50, 2018, https://doi.org/10.1007/978-1-4939-7756-7_3. 97. Brown, J.B., Niijima, S., Okuno, Y., Compound–Protein Interaction Prediction Within Chemogenomics: Theoretical Concepts, Practical Usage, and Future Directions. Mol. Inf., 32, 906–921, 2013. 98. Qiu, T., Qiu, J., Feng, J., Wu, D., Yang, Y., Tang, K., Cao, Z., Zhu, R., The recent progress in proteochemometric modelling: Focusing on target descriptors,

50 Biomedical Data Mining for Information Retrieval cross-term descriptors and application scope. Brief Bioinform., 18, 1, 125– 136, 2017. 99. Jackson, M.J., Esnouf, M.P., Winzor, D., Duewer, D., Defining and measuring biological activity: Applying the principles of metrology. Accredit. Qual. Assur., 12, 6, 283–29, 2007, https://doi.org/10.1007/s00769-006-0254-1. 100. Vamathevan, J., Clark, D., Czodrowski, P., Dunham, I., Ferran, E., Lee, G., Li, B., Madabhushi, A., Shah, P., Spitzer, M., Zhao, S., Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discovery, 18, 6, 463–477, 2019, https://doi.org/10.1038/s41573-019-0024-5. 101. Sidey-Gibbons, J. and Sidey-Gibbons, C.J., Machine learning in medicine: A practical introduction. BMC Med. Res. Method., 19, 1, 64, 2019, https://doi. org/10.1186/s12874-019-0681-4. 102. Greene, N., Judson, P.N., Langowski, J.J., Marchantm, C.A., Knowledgebased expert systems for toxicity and metabolism prediction: DEREK, StAR and METEOR. SAR QSAR Environ. Res., 10, 2–3, 299–314, 1999. 103. Raies, A.B. and Bajic, V.B., In silico toxicology: computational methods for the prediction of chemical toxicity. Wiley Interdiscip. Rev. Comput. Mol. Sci., 6, 2, 147–172, 2016, https://doi.org/10.1002/wcms.1240. 104. Patlewicz, G., Jeliazkova, N., Safford, R.J., Worth, A.P., Aleksiev, B., An evaluation of the implementation of the Cramer classification scheme in the Toxtree software. SAR QSAR Environ. Res., 19, 5–6, 495–524, 2008. 105. Agrahari, R., Foroushani, A., Docking, T.R. et al., Applications of Bayesian network models in predicting types of hematological malignancies. Sci. Rep., 8, 6951, 2018, https://doi.org/10.1038/s41598-018-24758-5. 106. Ahmed, A., Abdo, A., Salim, N., Ligand-based virtual screening using Bayesian inference network and reweighted fragments. Sci. World J., Drug Discovery Today, 01 Jun 2002, 7(11):597–598, 410914, 2012, https://doi. org/10.1016/s1359-6446(02)02316-4. 107. Madhukar, N.S., Khade, P.K., Huang, L. et al., A Bayesian machine learning approach for drug target identification using diverse data types. Nat. Commun., 10, 5221, 2019, https://doi.org/10.1038/s41467-019-12928-6. 108. Hinselmann, G., Rosenbaum, L., Jahn, A., Fechner, N., Ostermann, C., and Zell, A., Large-scale learning of structure–activity relationships using a linear support vector machine and problem-specific metrics. J. Chem. Inf. Model., 51, 2, 203–213, 2011. 109. Mahé, P. and Vert, J., Graph kernels based on tree patterns for molecules. Mach. Learn., 75, 3–35, 2009, https://doi.org/10.1007/s10994-008-5086-2. 110. Byvatov, E., Fechner, U., Sadowski, J., Schneider, G., Comparison of support vector machine and artificial neural network systems for drug/non-drug classification. J. Chem. Inf. Comput. Sci., 43, 6, 1882–1889, 2003, https://doi. org/10.1021/ci0341161. 111. Sakiyama, Y., Yuki, H., Moriya, T. et al., Predicting human liver microsomal stability with machine learning techniques. J. Mol. Graph. Model., 26, 6, 907–915, 2008.

AI in Bioinformatics 51 112. Wang, C. and Zhang, Y., Improving scoring-docking-screening powers of protein–ligand scoring functions using random forest. J. Comput. Chem., 38, 3, 169–177, 2017. 113. Chen, H., Engkvist, O., Wang, Y., Olivecrona, M., Blaschke, T., The rise of deep learning in drug discovery. Drug Discovery Today, 23, 6, 1241–1250, 2018. 114. Marini, F., Roncaglioni, A., Novic, M., Variable selection and interpretation in structure-affinity correlation modeling of estrogen receptor binders. J. Chem. Inf. Model., 45, 6, 1507–1519, 2005. 115. Kazius, J., Nijssen, S., Kok, J.N., Bäck, T., IJzerman, A.P., Substructure Mining Using Elaborate Chemical Representation. J. Chem. Inf. Model., 46, 2, 597– 605, 2006. 116. Raschka, S., Scott, A.M., Huertas, M., Li, W., Kuhn, L.A., Automated Inference of Chemical Discriminants of Biological Activity. Methods Mol. Biol., 1762, 307–338, 2018. 117. Ramraj, T. and Prabhakar, R., Frequent Subgraph Mining Algorithms—A Survey. Proc. Comput. Sci., 47, 197–204, 2015, https://doi.org/10.1016/j. procs.2015.03.198. 118. Mrzic, A., Meysman, P., Bittremieux, W. et al., (Grasping frequent subgraph mining for bioinformatics applications. BioData Min., 11, 20, 2018.

3 Predictive Analysis in Healthcare Using Feature Selection Aneri Acharya, Jitali Patel* and Jigna Patel Computer Science and Engineering Department, Institute of Technology, Nirma University, Ahmedabad, India

Abstract

Diagnosis of chronic disease is essential in the healthcare domain as these diseases are very lethal and persists for a long time. It will be beneficial if these diseases are predicted at an early stage. With the advent of AI, various ML algorithms are useful in early prediction of diseases, but the dataset may be unbalanced that increases false positive rate and also irrelevant features present inside the dataset leads to poor performance of the models. This chapter aims to apply various methods to enhance the performance of machine learning models used in predictive analysis. Diabetes and the hepatitis chronic diseases are explored in this chapter. The experiment has been carried out in four tasks. The first task uses four machine learning models and then three ensemble learning techniques are implemented. The third task involves the application of 11 feature selection techniques. In the last task, four data balancing techniques are implemented which are random sampling, SMOTE analysis, ADASYN, and borderline SMOTE. In the diabetes dataset, the highest accuracy of 81% is obtained by applying SMOTE analysis in the random forest model. In hepatitis dataset, the highest accuracy of 94% is obtained by applying random sampling in the random forest algorithm. Keywords: Healthcare, chronic diseases, predictive analysis, feature selection, imbalance dataset, machine learning

*Corresponding author: [email protected] Sujata Dash, Subhendu Kumar Pani, S. Balamurugan and Ajith Abraham (eds.) Biomedical Data Mining for Information Retrieval: Methodologies, Techniques and Applications, (53–102) © 2021 Scrivener Publishing LLC

53

54 Biomedical Data Mining for Information Retrieval

3.1 Introduction Healthcare is one of the most prominent sectors of any country which needs utmost attention and care than any other sector. The success of any country is mapped based on how advanced and powerful one’s healthcare sector is. As we all know that this era is the era of technology, the era of automation, and creativity. We are all aware of what all wonders can be done by using technology in the right way at the right place. India is the country of sages and we have a strong root with the traditional healthcare system which was solely invented in India, we call it Ayurveda. With the advent of science and technology, scientists and pharmacists have successfully invented the vaccines of many of the deadly diseases. We are able to establish so many sophisticated laboratories and advanced hospitals that can facilitate patients in bulk and also give the best quality of diagnostic treatments and have successfully improved the mortality rate of the globe. But was it the end of the technological era and its wonders? No, it was just the beginning, all these advanced tools, technologies, and vaccinations can cure the patients after it has been diagnosed with the disease. With the increase in the population, it is sometimes difficult to diagnose the patients accurately before patients enter into the critical level. It will be boon to mankind if we are able to predict the possibility of such lethal disease in advance based on understanding the pattern and probabilities of occurring such deadly diseases [36]. With the advent of machine learning, we are able to do predictive analysis based on the past collected dataset of the patients [1]. The leading chronic diseases include diabetes, cancer, arthritis, cardiovascular disease, and hepatitis. These chronic diseases are very lethal and need proper diagnosis at early stages only, once if patients enter into the last or the critical stage than it is difficult to save the life. Hence early detection of such lethal diseases can help the doctors to take proper precautionary steps and save one’s life.

3.1.1 Overview and Statistics About the Disease 3.1.1.1 Diabetes Diabetes disease is caused due to an increase in sugar level in our body. Glucose is the main source of energy that we get from the good we intake. There is a hormone named insulin which is secreted by our pancreas.

Predictive Analysis Using Feature Selection 55 Insulin helps in absorption of the glucose from our blood. Our cells cannot absorb the energy directly from the food we intake. The food we take is needed to be converted in the form which the blood cells can absorb, and this process is called digestion which is facilitated by various chemical messenger called hormones. After the digestion and assimilation processes, the glucose level inside our blood increases. This glucose is the main agent that provides energy to our body. If the glucose level in the blood is increased above the certain threshold level than a signal is sent to the cells of the pancreas (known as beta cells) to secrete the chemical substance called insulin. Insulin then signals the blood cells to absorb the glucose from our bloodstream. Insulin helps in glucose regulation in our body. If the secretion of insulin is too low, then it increases the sugar level in our body and this is called Hyperglycemia. If the secretion of the insulin is too high then it causes a decrease of sugar level in our body, and this is called hypoglycemia. Diabetes type 1 [2] is caused when our own immunity destroys the beta cells (from where the insulin is secreted). This is mainly caused due to environmental and genetic factors. Type 2 diabetes is the most commonly occurring disease as it is caused by the change in lifestyles. It is mainly caused due to overweight, obesity, and physical inactivity. Another type of diabetes is Gestational diabetes which is caused during pregnancy because of an increase in hormonal changes. According to WHO [3] reports, the number of people suffering from Diabetes has risen from 108 million in 1980 to 422 million in 2014. In 2016 an estimated 1.6 million deaths were caused due to this disease. Diabetes disease was declared as the seventh leading cause of death by WHO in 2016. According to the International Diabetes Federation (IEF) 371 million people across the globe are affected by this chronic disease and surprisingly 187 million of them don’t know about this fact. This is such an alarming situation that needs greater attention. Hence these statistics provide the motivation to carry out researches for diagnosing the disease before it is caused based on studying the various attributes of the people which makes them more vulnerable to this disease.

3.1.1.2 Hepatitis It is a type of viral disease that affects our liver. There are a total of five types of hepatitis diseases that include hepatitis A, B, C, D, and E. Hepatitis

56 Biomedical Data Mining for Information Retrieval A is majorly caused due to the ingestion of contaminated water or food. Hepatitis B and C are sexually transmitted diseases that are majorly transmitted via body fluid like blood or semen. Hepatitis D is a very rare form of hepatitis disease and is mainly transmitted by direct contact of infected blood. Hepatitis D cannot grow without the presence of the hepatitis B virus, but this form is rarely found among the population. Hepatitis E is one of the waterborne diseases and it is mainly found in the areas of poor sanitization and caused due to the ingestion of contaminated water due to the mixing of fecal matter with water. The other non-transferable causes are the excessive consumption of alcohol that can cause liver inflammation and autoimmune response. The major symptoms noticed are fatigue, dark urine, pale stool, abdominal pain, loss of appetite, and unexpected weight loss. According [4] to the WHO report of 2015, 257 million people are diagnosed with this disease, and an estimated 887,000 people died due to this disease around the globe. It is reported that the hepatitis disease is more common among the people of the WHO Western Pacific Region and WHO African region where about 6.2 and 6.1% of the adult population is suffering from this disease respectively [5]. Among all the hepatitis forms hepatitis B is most common as it affects 2 billion people all over the world [6]. It is estimated that around 2.3 billion of people are affected by this disease around the globe. Among them viral hepatitis causes 90% of deaths which counts up to 1.4 million deaths each year.

3.1.2 Overview of the Experiment Carried Out Predictive analysis is all about predicting the unknown future values based on past historical data. With the collected set of data, the models are first trained. These trained models are tested against a testing dataset and the errors are computed. Based on the errors the hyper-parameters of the models are tweaked and penalized to get a more accurate model. Once the model is trained properly it can be used for predicting the unknown values. This is the basic theory behind any predictive analysis carried out. The predictive analysis is carried out in many domains for example in social media analysis, for advertisement and marketing, in stock market analysis, in the healthcare domain, and many more. This has emerged to be one of the most beneficial techniques for future prediction of risk in risk management, in the weather forecast, analyzing

Predictive Analysis Using Feature Selection 57 market trends, analyzing and predicting customer behavior, and many more. In this chapter, we aim to learn how predictive analysis is applied in the healthcare sector along with various feature selection techniques. The two chronic diseases are explored here where Diabetes and Hepatitis datasets were used. Four machine learning algorithms are implemented for predictive analysis. The algorithms used are Logistic Regression, SVM, ANN, and Decision Tree. It is always better to compare the performance of one model with the other. Along with the four machine learning algorithms, the ensemble learning approach has also been implemented on the selected dataset. Ensemble learning allows us to tune more than one model. It helps us to enhance the performance of the model and also reduces the errors. Here Random Forest algorithm, Adaboost algorithm, and Bagging strategy have been applied on the selected datasets. With the advent of technology and machine learning so many experiments have been carried out and so many models have been trained which gives pretty good accuracy. But accuracy is not the best performance measure to be used for the judgment especially in this domain, because the cost of predicting a patient as false negative is very high and can cause severe harm to mankind. Hence we need to explore techniques like feature selection, balancing of dataset, and many more. The dataset on which various models are trained is proved to be the most important asset for predictive analysis. Irrelevant features can lead to poor performance of the model, decrease the accuracy, may lead to overfitting of the model, and also increases the complexity and training time of the model. Hence for achieving better proficiency in predictive analysis we need to apply the Feature Selection techniques to select the best features among the given dataset and to remove the unwanted and irrelevant parameters. The 2 types of feature selection techniques are discussed in this chapter. The first method is the Filter method and in this approach, the features are ranked based on certain performance measures. They give fast and efficient results for voluminous data. The second method is the Wrapper method which uses a greedy approach and calculates all possible combinations of features and selects the most optimal set of features for that particular model. By doing so we can increase the awareness among the people and can prevent occurring of such diseases beforehand. It will also help doctors in monitoring the patients on these features.

58 Biomedical Data Mining for Information Retrieval Another major problem faced in applying predictive analytics in the healthcare domain is that the datasets which we get are highly imbalanced. The proportion of people suffering from any disease is always less than the proportion of healthy people. Hence the tuples representing the number of people suffering from that disease will form the minority class with less number of data points to represent the class label. Most [7] classifiers tend to be biased towards majority class tuples and as a result, they classify every tuple in the favor of majority class which is one of the major issues to be focused on. As even a single patient who is having the disease and predicted as healthy is more dangerous. There are various techniques that can help in balancing the imbalanced dataset. Random Sampling, SMOTE analysis, ADASYN technique, and borderline SMOTE are implemented on the selected datasets. With the help of these sampling techniques, we are able to increase the data points of minority class tuples and make the imbalanced dataset balanced.

3.2 Literature Review 3.2.1 Summary The aim of this paper [1] is to make a survey on the application of different feature selection techniques on the diabetes dataset. They have shown a survey of different work proposed on the prediction of chronic diseases using different methodology. They have explained three feature selection methods which are Filter Method, Wrapper Method, and Embedded Method. They have listed out all the possible methods available in each of the respective feature selection methods along with their merits and demerits. They have also compared the work carried out by different researchers in the field of healthcare under each of the feature selection methods. They have also listed out the different work done using a hybrid method of the three feature selection technique. In this paper, a comparison is shown between the adaptive classification systems and the traditional classification systems for the prediction of chronic diseases. The major chronic diseases discussed in this paper are: 1. Diabetes—Pima Indians Diabetes Dataset 2. Kidney—Chronic Kidney Disease Dataset

Predictive Analysis Using Feature Selection 59 3. 4. 5. 6. 7. 8.

Cardiovascular disease—Statlog (Heart) Data Set Breast Cancer—Breast Cancer Wisconsin (Diagnostic) Data Set Arrhythmia—Arrhythmia Data Set Hepatitis—Hepatitis Data Set Lung Cancer—Lung Cancer Data Set Parkinson’s—Parkinson’s Data Set.

This paper [8] discusses various tools and technologies which can be used for performing predictive analysis in the healthcare domain. Different applications of machine learning techniques in different domains are discussed. They summarize the role of machine learning algorithms and their application for carrying out the predictive analysis in the healthcare domain. They have explained the significance of the predictive analysis and the methodology for performing predictive analysis. They have briefly explained the three main classes of machine learning algorithms which are Supervised Learning, Unsupervised Learning, and reinforcement learning. They have listed out all the possible machine learning algorithms and techniques belonging to each of the algorithm classes that can be used for doing predictive analysis. They have shown a survey on different possible machine learning tools and libraries that can be used for predictive analysis. Machine learning is being used in a wide range of application domains and a survey on the following domains has been shown in this paper. 1. 2. 3. 4. 5.

Financial Services and Economics Administration in Government Organizations Healthcare Industry Promotion, Sales and Marketing Shipping and Transportation.

Narrowing down the domain further a survey on predictive analysis on healthcare industry is shown for the following disease: 1. 2. 3. 4.

Predictions on Cardio Vascular Diseases Diabetes Predictions Hepatitis Disease Prediction Cancer Predictions Using Machine Learning.

This paper [9] describes the application of data mining techniques for the detection of diabetes. They have performed the experiment on the

60 Biomedical Data Mining for Information Retrieval PIMA diabetes dataset. 50% of the dataset is taken as training and rest is used for testing dataset. Then they have applied following feature selection techniques with the ranker search on the given dataset: 1. 2. 3. 4. 5. 6. 7.

Chi-squared Gain Ratio Information Gain One Attribute Evaluation Relief Attribute Evaluation Symmetric Uncert SVM.

The accuracy increased from 65 to 76%. The minimum accuracy of 81.72 is obtained for the Dagging classifier. This paper [10] shows different machine learning techniques that are used for text mining and extraction. It shows the use of NLP in predictive analysis. Two tasks are carried out in this paper. The first task is extracting all the information from the medical papers related to the disease and the treatments, and the second task is extracting just the relevant information. They have used the combination of NLP and ML packages to extract the relevant information. After performing the stemming algorithm, they have extracted the semantic relations from the data by applying Multinomial Naïve Bayes along with Apriori association rule mining. Then they implemented a finer-grained classification of these sentences according to the semantic relations that exist between diseases and treatments. They have concluded by displaying F-measure, precision, and recall bar chart for cancer abstract using NB, SVM, Decision Tree, and Adaptive Learning Classifier and gets maximum values in decision tree and SVM. We can also apply data mining algorithms like apriori and predictive apriori used in Ref. [11] to generate the most confident association rules. In this paper, they have used the PIMA diabetes dataset and applied apriori and predictive apriori data mining algorithms. The outcome was they got 10 association rules with 99% confidence. In Ref. [12] pre-processing is done using Java programming language in NetBeans. Along with these two WEKA algorithms (K means and MPL) are implemented on the hepatitis dataset which is collected from the three big cities of the Kingdom of Saudi Arabia. The Complete KDD process is also applied in this paper. In [13] this paper the authors have used data cleaning and data smoothing methods of data mining on the clinical dataset. Then they

Predictive Analysis Using Feature Selection 61 applied backpropagation neural network for classification, and tested the classifier using hepatitis, Wisconsin breast cancer, and Stat log heart disease datasets obtained from the University of California at Irvine (UCI) machine learning repository. They obtained the accuracy of 97.3, 98.6, and 90.4% for hepatitis, breast cancer, and heart disease, respectively. If we talk about big data analytics than feature selection techniques can become very beneficial as the data, we get is large in volume and have a large number of features too. But [14] the problem is the type of data we get, as most of the real-time data is either semi-structured or unstructured. Hence the challenge is to apply feature selection techniques on the unstructured or semi-structured data. In Ref. [15] the author used a cloud-enabled big data analytic platform to analyze the semi-structured data generated in the healthcare domain. In this paper inter and intracluster correlation techniques are used for analyzing the dataset and then got 98% accuracy by further applying the FHCP algorithm for predicting future healthcare conditions. The problem of handling large amount of data which is generated in the healthcare sector can be solved solely using big data analytics. In Ref. [16] authors used Hue 3.7.1 which is an open web interface, along with the Hortonworks Data Platform which is a web interface that allows us to carry out analytical operations on the data stored in the Apache Hadoop. The medical dataset is displayed using the Pig editor. Most of the time the feature selection techniques select the optimal global feature subset that is applied over all the regions of the sampled subspace. The paper [17] uses a Localized feature selection technique that optimally adapts to the local variation present in the dataset. The methodology used in this method is based on the linear programming hence it provides the advantages of convexity and efficient implementation.

3.2.2 Comparison of Papers for Diabetes and Hepatitis Dataset Table 3.1 shows the comparison of different diabetes dataset research paper, comparing different ML algorithms used in different paper. Accuracy is used as measure of comparison. Table 3.2 shows the comparison of different research paper that have used hepatitis dataset, comparing different ML algorithms used in different paper. Accuracy is used as measure of comparison.

Dataset used

Pima Indians Diabetes Dataset which is collected from National Institute of Diabetes and Digestive and Kidney Diseases. The total No. of Instances is 768.

Reference

[18]

Diabetes Dataset

1 Preprocess dataset using WEKA. 2 Splitting dataset into 70–30% train-test Evaluate the matrix

Methodology 1 Naïve Bayes 2 Simple Cart 3 Random Forest 4 Support Vector Machine

Algorithms discussed

Table 3.1 Comparison of Research paper for diabetes dataset.

79%

77% 76.5% 76.5%

Results (Accuracy)

(Continued)

1. SVM is best algorithm in terms of accuracy 2 Training time of Naïve Bayes is lowest. 3 Training time of simple cart is highest so we can simply discard this algorithm 4 Precision value of SVM (0.784) is highest and random forest is lowest (0.756) 5 F-Measure value of SVM (0.782) is also highest and simple cart is lowest (0.446).

Conclusion drawn

62 Biomedical Data Mining for Information Retrieval

Dataset used

Pima Indians Diabetes dataset

Reference

[19]

Diabetes Dataset

They used the tool Enthought Canopy. 1 Preprocess the data and fill in the missing values 2 Feature selection Process (8 features are selected) Evaluate the matrix

Methodology 74% 77%

1 Naïve Bayes 2 K Nearest Neighbor 3 SVM 4 Decision Tree 5 Logistic Regression 6 Random Forest 71%

77% 71% 74%

Results (Accuracy)

Algorithms discussed

Table 3.1 Comparison of Research paper for diabetes dataset. (Continued)

(Continued)

1 SVM and KNN have highest accuracy. 2 By using feature selection we have selected 8 main attribute and among them “Plasma glucose concentration” is the most significant attribute and “Body mass index and age” have second highest importance.” 2-Hour serum insulin (mu U/ml)” have lowest importance hence this feature can be discarded.

Conclusion drawn

Predictive Analysis Using Feature Selection 63

Dataset used

Pima Indians Diabetes dataset

Reference

[20]

Diabetes Dataset

Using WEKA tool: 1 Preprocess the dataset 2 feature Selection (8 attributes are used) 2 split into train and test 3 build the classifier model 4 test the model using testing dataset 5 evaluate the performance matrix.

Methodology

Results (Accuracy) 65.10% 76.30% 73.82%

Algorithms discussed 1 SVM 2 Naïve Bayes 3 Decision Tree

Table 3.1 Comparison of Research paper for diabetes dataset. (Continued)

(Continued)

1 Naïve Bayes has maximum accuracy 2 Naïve Bayes have highest precision value (0.759) then decision tree (0.735) and SVM have lowest value (0.424) 3. Naïve Bayes have highest recall value (0.763) then decision tree (0.738) and SVM have lowest value (0.651) 4 Naïve Bayes have highest F-measure value (0.760) then decision tree (0.736) and SVM have lowest value (0.513). 5 Naïve Bayes have highest ROC value (0.819). Naïve Bayes gives best result

Conclusion drawn

64 Biomedical Data Mining for Information Retrieval

Dataset used

Pima Indians Diabetes dataset

Reference

[21]

Diabetes Dataset

The classification is done using ADABOOST ALGORITHM

Methodology 1 Decision Tree 2 Naïve Bayes 2 SVM 4 Decision Stump

Algorithms discussed

Table 3.1 Comparison of Research paper for diabetes dataset. (Continued)

77.6% 79.68% 79.68% 80.70%

Results (Accuracy)

(Continued)

1 Decision Stump provides best accuracy. 2 SVM has highest sensitivity (90.7%) and decision tree have lowest (85.38%) 3 Decision Stump have highest Specificity (64.5%) and SVM have lowest (56.4%). 4 Decision Stump have lowest Error Rate (19.27%) and Decision Tree have highest (22.39%).

Conclusion drawn

Predictive Analysis Using Feature Selection 65

Dataset used

Pima Indians Diabetes dataset

Reference

[22]

Diabetes Dataset

Applied the Min Max Scaling Technique On the 8 most important attributes

Methodology 78.05% 75.5% 79.3%

SVM KNN Gaussian Naïve Bayes Artificial Neural Network 82.35%

Results (Accuracy)

Algorithms discussed

Table 3.1 Comparison of Research paper for diabetes dataset. (Continued)

ANN provides us highest accuracy with Min Max Scaling. This algorithm is further used for development of Web Application for prediction of diabetes They have used PHP Web programming language as backend development, JavaScript frameworks as frontend and Tensorflow.js for the code implementation of Machine Learning Model. ANN IS BEST

Conclusion drawn

66 Biomedical Data Mining for Information Retrieval

Dataset used

Data were collected from different hospitals. The dataset is checked by a gastroenterologist. The dataset contain 25 attributes

Reference

[23]

Hepatitis Dataset

1 Pre-processed the data based on the mode of transmission detected. 2 Applied ML algorithms and evaluated the performance matrix

Methodology 93.2% 98.6% 95.8%

2 Random Forest 3 K-nearest

Results (Accuracy)

1 Naïve Bayes

Algorithms discussed

Table 3.2 Comparison of Research papers on hepatitis dataset.

(Continued)

1 Achieved highest accuracy in K-nearest. 2 Recorded minimum error in random forest and maximum error in Naïve Bayes. 3 Highest TPR and FPR is seen in Random forest hence it is more preferable than K nearest.

Conclusion drawn

Predictive Analysis Using Feature Selection 67

Dataset used

Dataset is used of Indian liver patients from network repository

Reference

[24]

Hepatitis Dataset

1 Scale the dataset for normalizing. 2 Apply rule based induction 3 analyse the rules for diagnosing. 4 Transforming robust multilinear mixed model using Box–Cox transformation. 5 Evaluate the matrix

Methodology

Results (Accuracy)

98.06 98.07 82.58 94.44 83.78

Algorithms discussed Robust BoxCox Transformation (RBCT) RBCT RF RBCT NN RBCT DT RBCT KNN RBCT SVM

Table 3.2 Comparison of Research papers on hepatitis dataset. (Continued)

(Continued)

1 Highest accuracy is observed in RBCT RF and RBCT NN. 2 Highest positively predicted value is obtained by RBCT SVM 3 Highest sensitivity is also seen in RBCT SVM hence it can be more preferred

Conclusion drawn

68 Biomedical Data Mining for Information Retrieval

Dataset used

Clinical data collected from different hospital

Hepatitis Dataset from UCI repository

Reference

[25]

[26]

Hepatitis Dataset

Pre-processed through Weka and applied 4 algorithms

Artificial neural network and rule based system have been explored in this paper

Methodology

Bilirubin and Varices are significant attributes.

85.81%

C4.5 Decision Tree

Conclusion drawn The model was trained using 1218 facts in 87 training runs. And the RMSE value decreased up to −0.079.

Results (Accuracy)

ANN

Algorithms discussed

Table 3.2 Comparison of Research papers on hepatitis dataset. (Continued)

Predictive Analysis Using Feature Selection 69

70 Biomedical Data Mining for Information Retrieval

3.3 Dataset Description 3.3.1 Diabetes Dataset • It is the part of large dataset that was held by National Institutes of Diabetes and Digestive and Kidney. The patient belongs to the Pima Indian Heritage. • It has 768 women patient’s data which are collected over 8 attributes • All the attributes are not null. This Table 3.3 shows the description of the attributes present in the diabetes dataset, along with the range of the value present in each attribute. All the attributes of PIMA Diabetes dataset is described in Table 3.3 along with the range of values of each attribute. Figure 3.1 shows the class distribution of the diabetes dataset. It has two class labels and the figure below shows the amount of tuples present in each class.

Total instances = 768 Class label 0—person not having diabetes = 500 Class label 1—person having diabetes = 268

Table 3.3 PIMA diabetes dataset description. Attributes

Description

Range value

Pregnancies

Number of pregnancies

[0–17]

Glucose

Plasma glucose concentration in an oral glucose tolerance test

[0–199]

Blood Pressure

Diastolic Blood Pressure

[0–122]

Skin Thickness

Triceps Skin fold thickness

[0–99]

Insulin

2 h serum insulin

[0–846]

BMI

Body Mass Index

[0–67]

Diabetes Pedigree Function

Diabetes Pedgree Function

[0–2.45]

Age

Age of the patient

[21–81]

Outcome

0—person not having diabetes 1—person having diabetes

[0,1]

Predictive Analysis Using Feature Selection 71 0 1 Name: 0 1 Name:

500 268 Outcome, dtype: int64 0.651042 0.348958 Outcome, dtype: float64

0.6 0.5 0.4 0.3 0.2 0.1 1

0

0.0

Figure 3.1 Diabetes dataset class distribution.

65% of the tuples belong to class 0 and only 34% belong to class 1 hence the data is imbalanced.

3.3.2 Hepatitis Dataset • The hepatitis dataset was taken from the UCI machine learning repository. • The dataset contains the information about the patient who are living and those who died due to hepatitis disease. • This dataset was found on OpenML—hepatitis. • It contains 155 tuples and 19 different attribute along with one class label. • The class contain 2 class label—live and die. • Hepatitis dataset have total 67 missing values. We can apply various techniques to fill the missing values present. I have filled the missing values using the next data point available. Table 3.4 shows the description of the attributes present in the hepatitis dataset, along with the range of the value present in each attribute. Figure 3.2 shows the class distribution of the hepatitis dataset. It has two class labels and the figure below shows the amount of tuples present in each class. The hepatitis dataset has two class labels: live and die. It shows whether the patient lived or died. Table 3.4 shows the name of the attribute, its data

72 Biomedical Data Mining for Information Retrieval type, how many missing values the attribute have and the range of the data the attribute holds.

Total tuples present in hepatitis dataset = 155 Number of tuples with class label “live” = 123 Number of tuples with class label “die” = 32

Table 3.4 Description of Hepatitis dataset’s attributes. Sr. no.

Attribute

dtype

Missing values

Range

1

Age

Int

0

10–80

2

Sex

Object

0

Male/female

3

Steroid

Object

1

True/false

4

Antivirals

Boolean

0

True/false

5

Fatigue

Object

1

True/false

6

Malaise

Object

1

True/false

7

Anorexia

Object

1

True/false

8

Liver big

Object

10

True/false

9

Liver firm

Object

11

True/false

10

Spleen palpable

Object

5

True/false

11

Spiders

Object

5

True/false

12

Ascites

Object

5

True/false

13

Varices

Object

5

True/false

14

Bilirubin

float

6

0.39–4

15

Alk phosphate

Float

29

33–250

16

Sgot

Float

4

13–500

17

Albumin

Float

16

2.1–6

18

Protime

Float

67

10–90

19

Histology

Boolean

0

True/false

20

Class

object

0

Live/Die

Predictive Analysis Using Feature Selection 73 live die Name: live die Name:

123 32 class, dtype: int64 0.793548 0.206452 class, dtype: float64

0.8 0.7 0.6 0.5 0.4 0.3 0.2

die

0.0

live

0.1

Figure 3.2 Class distribution of hepatitis dataset.

Here we can clearly see the imbalanced class distribution. 79% of the tuples belong to class live and only 21% belong to class die hence the data is imbalanced.

3.4 Feature Selection Feature selection is one of the most important parts of data pre-processing. It is the technique to select the most significant and relevant features from the given dataset. It is also called as attribute or variable selection which combines the knowledge of machine learning domain and statistics. With the advent of technology and automation, the key ingredient required for any experiment to be performed is the historical records or the past dataset collected. The more historical and large the data is, the more accurately the model is trained. Hence the most important asset for any data scientist today is the most appropriate dataset. To make any dataset appropriate feature selection plays a vital role in removing the irrelevant and unwanted features. Feature selection techniques give us the most significant set of attributes that makes a fruitful contribution to the predictive analysis and removes irrelevant attributes.

74 Biomedical Data Mining for Information Retrieval

3.4.1 Importance of Feature Selection What are the irrelevant features in any dataset and how does it affect the predicting process? Often in high dimensional dataset, many features are either redundant or are highly dependent on another feature or the features that make no meaningful contribution to the prediction of the target variable/feature, such features are called irrelevant features. These [27] irrelevant features negatively influence the performance of the model as more misleading the dataset, poor will be its accuracy. The various problems faced due to these irrelevant features are: • The redundant attributes and highly dependent attributes add no meaningful information to the predictive modeling. • Increases the chances of overfitting. • Increases the complexity and the training time of the model. For example, in the medical dataset, attributes like age, date of birth are highly dependent on each other and only increases the complexity and training time. • If the dataset is noisy than it will mislead and degrades the overall performance of the model. • For data scientists, before carrying out any experiment it is mandatory to understand the dataset thoroughly, but because of the presence of such irrelevant attributes, it is hard to visualize the dataset properly and to learn the significant pattern from it. • Unnecessary wastage of space and resource allocation for such irrelevant attributes. These are some of the major hurdles faced while doing predictive analysis. Hence to overcome such problems it is important to remove the irrelevant features and select the most significant set of attributes that helps in improving the performance of the model.

3.4.2 Difference Between Feature Selection, Feature Extraction and Dimensionality Reduction To overcome from the curse of dimensionality we require Dimensionality Reduction procedure. Dimensions here refers to variables/attributes and dimensionality reduction means reduction of number of features from the

Predictive Analysis Using Feature Selection 75 high dimensionality dataset. Dimensionality reduction can be divided into two parts, feature selection and feature extraction. Feature Selection method aims to remove the unwanted and irrelevant set of features from the original dataset. It ranks the features based on some of the criteria and then select the k important features based on the rank and its importance calculated. The various feature selection methods are: • Filter Method • Wrapper Method • Ensemble Method Feature Extraction method aims to reduce the dimensionality by creating new set of significant features from the original dataset and then simply discarding the original features [28]. Feature selection method doesn’t create new set of features but it simply discards the irrelevant features and keeps the original important features as it is. Feature Extraction works on the initial raw data and generates the new set of features by transforming the raw data available into the features suitable for modeling. Different feature extraction methods are: • • • • • •

Principle Component Analysis Independent Component Analysis Linear Discriminant Analysis Locally Linear Embedding t-distributed Stochastic Neighbor Embedding (t-SNE) Auto encoders.

3.4.3 Why Traditional Feature Selection Techniques Still Holds True? Deep learning has evolved as one of the powerful sub domain of machine learning. There are various neural network architectures like CNN which are quite capable of extracting most significant features from the data, but it has limitations too. If the dataset is not quite large then using deep learning models will not be a wise decision. When the resources as well as the data is in limited amount than using deep learning neural architectures will be a complete wastage, hence in such scenarios using these traditional feature selection methods will be beneficial.

76 Biomedical Data Mining for Information Retrieval

3.4.4 Advantages and Disadvantages of Feature Selection Technique 3.4.4.1 Advantages 1. It [29] helps in reducing the chances of overfitting as by eliminating the redundant features will help in reducing the errors made due to noise. 2. It helps in enhancing the overall performance of the model as it eliminates the misleading data from the dataset. 3. It also helps in reducing the training time of the model as there is no point of wasting the time for training the features which makes no significant contribution in predicting the target variable. 4. It also reduces the complexity of the model which makes the interpretation easier. 5. By eliminating the noisy and misleading data it makes the model less prone to errors.

3.4.4.2 Disadvantage 1. If the number of features are less in the dataset, than further eliminating more features will increase the risk of overfitting and make the dataset less generalized. 2. If the number of features are more than it will increase the computation time and complexity.

3.5 Feature Selection Methods The feature selection technique is divided into 3 classes: Filter method, Wrapper Method and Embedded Method. We can also use the hybrid method which is the combination of two or more methods from the above-mentioned classes of feature selection techniques. In this chapter two feature selection techniques are explored. The feature selection technique is applied to the training dataset only and not on the entire dataset.

3.5.1 Filter Method The filter method uses various criteria to rank the features. This method solely depends on the characteristic of the attribute. The features are filtered

Predictive Analysis Using Feature Selection 77 out before training any model. This method is independent of any machine learning algorithms. It acts as a filter that removes the attributes that have obtained less score by considering them as irrelevant. After filtering the attributes, the models are trained using the filtered or selected attributes. Based on the criteria and the scoring techniques used the filter method is classified into three categories.

3.5.1.1 Basic Filter Methods These methods don’t score the attributes; they are simple methods that just help in cleaning the dataset. They are basic data cleaning methods that help in removing unwanted noise and irrelevant attributes from the dataset. 1. Removing Constant Features Constant [30] features don’t provide any significant information that helps ML models to predict the data. Hence they are irrelevant and keeping them in the training dataset will only increase the training time and model complexity. 2. Quasi-Constant Features There [31] are some features that occupy the majority of the tuples in the dataset. Hence they are somewhat similar to constant features and depict the constant value for that feature as the majority of the tuples have that value. 3. Duplicated Features Before training any model it is necessary to identify the duplicate and redundant attribute as they can lead to overfitting of the model and increase the model’s complexity by giving no significant information in predicting the target values.

3.5.1.2 Correlation Filter Methods The correlation matrix determines the relationship between the two attributes. The [32] attributes that are highly correlated are seen to be highly dependent on each other and we can predict one from the other, for example, the height and weight of the person are highly correlated attributes and it is seen that as the height of the person increases the weight also increases. Hence if two features are highly correlated than they form somewhat redundant attributes as we can predict the target class value just from the one attribute only. We need to remove the highly correlated attributes as they add no significant information for the prediction of the target class and only increase the dimensionality and noise in the dataset.

78 Biomedical Data Mining for Information Retrieval For the two chronic disease dataset used here are the following correlated features: Diabetes dataset: Insulin, Age, BMI Hepatitis dataset: histology, albumin, protime.

3.5.1.3 Statistical & Ranking Filter Methods These methods evaluate the attributes by performing the statistical tests and give the score to each attribute which they obtained in the test. The attributes are then ranked based on their score. The attributes that have higher scores are the most significant attributes and the attributes that have lower scores are the least significant attributes. It is classified into the following methods based on the statistical test they take: 1. Chi-squared Score This method is used for categorical attributes only. It calculated the expected values from the contingency table and calculated the difference between the observed and expected values. If the calculated test statistics is probably large than we can say that the variables are dependent on each other.

Chi square value =

values − expected values ) ∑ (Observedexpected values

2. Based on mutual Information Mutual information determines the mutual dependency between the two variables. It determines the amount of information that can be determined from any of the variable by computing the information from the other variables. It seems to be similar to the correlation method but it is more generic than the correlation method. The amount of information present is determined by calculating the entropy. If the two attributes are completely independent than their mutual information value will be zero. High mutual information value indicates that there is a large reduction in uncertainty and low mutual information value indicates that there is a small reduction. 3. ANOVA Univariate Test ANOVA stands for analysis of variance where the variance between the variables is compared. It is normally used when

Predictive Analysis Using Feature Selection 79 the distribution is normal. It makes the assumption that there is a linear relationship between the attributes and the target variable and the variables are linearly distributed. It calculates the variance within the group and between the group. We calculate the F value as:

F=

4.

5.

6.

Variance between the group variance within the group

If the F statistic value is more than the critical value than we say that the attributes are dependent on each other. Univariate ROC-AUC/RMSE This method uses any of the machine learning model. It is useful for any kind of variable and doesn’t require any assumption to be made regarding the relationship between the attributes and the target variable. Here I have used the decision tree classifiee as a machine learning model to compute the ROC-AUC. It will calculate the ROC-AUC and gives the rank to the attribute according to their respective score. The features having more scores are more significant. F score It is the statistical test that compares the attributes and checks if the difference between the attributes are significant or not. It creates another constant attribute A and then computes the least square errors between the constant attribute created A and attributes of our dataset. The F test checks whether the difference is significant or it is just created by chance. There is one drawback of this test it only shows the linear relationships between features and labels. The features that are highly correlated are given more scores. Extra Tree Classifier It uses the decision tree and random forest algorithms to calculate the importance of the attributes and gives the score based on that. It is also known as Extremely Randomized Trees Classifier. Here Ginni index is used as a parameter for the construction of forest. All the features are ordered in descending order according to the Gini value computed. The higher the Ginni value the more important are the attributes.

80 Biomedical Data Mining for Information Retrieval 7. Pearson’s Correlation Coefficient It is also called as f regressor where the correlation value between each regressor is calculated by the given below formula:

Cor ( X , y ) =

((X[:,i] − mean(X[:,i])) ∗ (y − mean y )) std(X[: , i]) ∗ std(y)

Then the value is converted into the F score and then into the p-value.

3.5.1.4 Advantages and Disadvantages of Filter Method Advantage 1. They are comparatively fast than the other feature selection methods. 2. They can be applied when the dataset is large. 3. These are generally the starting point or the initial step that is mandatorily performed to understand the features of the given dataset easily. 4. This method is not algorithm specific, the features selected using this technique is applicable as an input to any of the machine learning model. Disadvantages 1. They [32] do not consider the relationship between the individual features hence there are chances of selecting the redundant features. 2. They don’t guarantee to give the most optimal set of features every time.

3.5.2 Wrapper Method Filter methods are applied as a filter on the entire dataset before training any model. It is a more generic method that gives the attributes ranks based on the score they obtained from any statistical test but it is not specific to any machine learning algorithms. Many times we require to use certain specific algorithms for any dataset, in such cases filter method doesn’t guarantee

Predictive Analysis Using Feature Selection 81 to give the optimal result. In such cases, we use the Wrapper method. This method uses any specific machine learning algorithm and based on its performance criteria evaluated it selects the most significant attributes. This method searches for a feature that is best suited for that specific machine learning algorithm and computes the performance matrix for that algorithm. The [33] wrapper method uses a greedy approach as it will compare all possible combinations of the attributes and select the combination that produces the best result. For implementing the methods of the Wrapper method here I have used MLxtend library. 1. Forward Selection In this method initially, all the features are evaluated by the specified machine learning algorithm, and the performance of each feature is computed. The feature that shows the best performance is selected and added to the empty list. In the next step, the selected feature is paired will all the remaining features, and the pair that shows the best performance is selected. After that, the combination of the selected pair with the rest of the attributes are evaluated and the best set of three features are selected. The process continues until the specified set of features are selected. Here I have implemented this method using Random Forest Regressor. 2. Backward Elimination Initially, it starts with the complete set comprising all the features. In [34] the first step, it removes one feature in the round-robin fashion from the set of features and selects the features that gives the best performance. In the next iteration, one feature is eliminated in the round-robin fashion and the performance of all the combination of the features except the two is evaluated. The process continues until the specified number of the feature is selected. Here I have implemented this method using Random Forest Regressor. 3. Recursive Feature Elimination It is the reverse form of forward selection. Initially, it takes the complete set that comprises all the features. It evaluates the performance using the specified model and assigns weights to the features. The least important features are eliminated from the current set of features. The procedure is repeated until we get the desired number of features. It will then rank all the features based on the order of their elimination.

82 Biomedical Data Mining for Information Retrieval 4. Exhaustive Feature Elimination It is the greediest strategy among all the wrapper methods discussed above. It will evaluate all the possible combinations of all the features present in the dataset and select the best subset having the number of features specified by the user and that gives the best performance among all the subsets. We [35] need to specify the minimum and maximum number of features in the form of parameters and this method will give us the best subset within the range we specified. The drawback of this method is that it takes more time compared to other methods as it evaluates all the combinations of the features.

3.5.2.1 Advantages and Disadvantages of Wrapper Method Advantages 1. Wrapper methods give the most optimal set of features for the specific machine learning algorithm. 2. It evaluates all the possible combinations of features hence they emit the most optimal result. 3. Unlike the filter method, they also consider the interaction between the features. Disadvantages 1. It will give the optimal set for features for that specific algorithm only, which cannot be used for any other machine learning algorithm. 2. It calculates all the possible combinations, so if the dataset has more number of features than calculating all the possible combinations of the set of features is very compute-intensive.

3.5.2.2 Difference Between Filter Method and Wrapper Method Table 3.5 points the differences between filter feature selection method and wrapper feature selection method.

Predictive Analysis Using Feature Selection 83 Table 3.5 Difference between filter method and wrapper method. Sr. no.

Filter method

Wrapper method

1

Filter methods are not algorithm specific, the output obtained by these methods can be applied as an input to any of the machine learning algorithms.

Wrapper method is specific to one machine learning algorithm and its output set of features cannot be fed as an input to any other machine learning algorithm.

2

This method selects the features based on the scores they obtained in statistical tests.

This method selects the features based on their performance evaluated by a specific machine learning algorithm.

3

They do not consider the relationship between the individual features

They consider the relationship between the individual features

4

They do not evaluate all the possible combinations of features while selecting the features

They evaluate all the possible combinations of the features while making the selection.

5

They [13] are faster.

They [13] are a bit slower than the filter method because they evaluate all the possible sets of combinations of the features.

6

They do not guarantee to give an optimal set of features

They give the optimal set of features for the specific machine learning algorithm applied.

7

They are less compute intensive.

They are more compute intensive and can’t be used when the dataset has a large number of features.

8

Examples: Correlation Test, F-score, Chi-Square Test

Example: Forward selection, Backward Elimination.

84 Biomedical Data Mining for Information Retrieval

3.6 Methodology 3.6.1 Steps Performed Step 1: Data Cleaning The first step required to conduct any experiment is data cleaning and data transformation. The diabetes dataset didn’t require any data cleaning or any data transformation. But the hepatitis dataset contains many missing values. Hence I have filled the missing values with the immediate next data point available. The “sex” attribute of the hepatitis dataset have categorical values, hence I transformed it to numeric values. I used label encoder to transform the categorical values to numeric. Step 2: Splitting Data For the entire experiment, I have split the complete dataset in the ratio 7:3. 70% of the data is used as a training dataset for training the models and 30% of the data is used as testing data for evaluating the model and computing performance matrix. Step 3: Performing the task I have completed the entire experiment in 4 tasks Task 1: Implementing 4 machine learning models which are Logistic Regression, Decision Tree, ANN, SVM Task 2: Applying 3 ensemble learning methods which are Random Forest, Adaboost Algorithm, Bagging technique. Task 3: Here I have applied 11 Filter Feature Selection methods and 3 Wrapper Selection methods that are discussed in Section 3.5. Task 4: Applying various balancing techniques and balancing the imbalanced data. Step 4: Evaluating the performance matrix After performing the above four tasks, I have noted down the accuracy obtained in each task and compared it.

3.6.2 Flowchart Figure 3.3 is the flow chart of the implementation steps of the experiments/ tasks performed in this chapter. It shows the sequence of tasks performed throughout the experiment. • The diabetes and hepatitis dataset is csv file format. • The dataset is split into training and testing dataset with ratio 7:3 (i.e. 70% dataset as training dataset and rest as testing).

Predictive Analysis Using Feature Selection 85 Input data in CSV format Data Cleaning and Transformation

Split the dataset into training and testing data in the ratio 7:3

Perform the tasks TASK 1: Apply Four machine learning algorithms

TASK 2: Apply three ensemble learning methods

TASK 3: Apply the feature selection methods

TASK 4: Apply the data balancing techniques

Use the testing data to evaluate the performance matrix

Compare the accuracy evaluated from each task

Figure 3.3 Flow chart of the tasks carried out in this chapter.

• Then the task mentioned in Section 3.6.1 is performed. • The accuracy is used as a performance evaluation measure. • The accuracy obtained in each task is tabulated and shown in Section 3.7 of this chapter.

3.7 Experimental Results and Analysis Here are the results obtained in each task performed. I have taken accuracy as performance measure tabulated the accuracy obtained in each task mentioned in the previous section.

3.7.1 Task 1—Application of Four Machine Learning Models Table 3.6 shows the results obtained in task1 for Diabetes dataset. It shows the accuracy obtained for each ML model used in Task 1. Table 3.7 shows the results obtained in task1 for Hepatitis dataset. It shows the accuracy obtained for each ML model used in Task 1.

86 Biomedical Data Mining for Information Retrieval Table 3.6 Accuracy obtained in Task 1 for diabetes dataset. Model

Accuracy

Logistic Regression

0.78

Decision Tree

0.77

ANN

0.73

SVM

0.78

Table 3.7 Accuracy obtained in Task 1 for hepatitis dataset. Model

Accuracy

Logistic Regression

0.83

Decision Tree

0.85

ANN

0.81

SVM

0.83

Observation • In the diabetes dataset highest accuracy of 78% is obtained in logistic regression and SVM. • In the hepatitis dataset highest accuracy is obtained in the decision tree with 85%.

3.7.2 Task 2—Applying Ensemble Learning Algorithms Table 3.8 shows the results obtained in task2 for Diabetes dataset. It shows the accuracy obtained for each ensemble learning models used in Task 2. Table 3.9 shows the results obtained in task2 for Hepatitis dataset. It shows the accuracy obtained for each ensemble learning models used in Task 2. Table 3.8 Accuracy obtained in Task 2 in diabetes dataset. Model

Accuracy

Random forest

0.77

AdaBoost

0.77

Bagging

0.79

Predictive Analysis Using Feature Selection 87 Table 3.9 Accuracy obtained in Task 2 in hepatitis dataset. Model

Accuracy

Random Forest

0.85

AdaBoost

0.8936

Bagging

0.8511

Observation • In the diabetes dataset, we got 78% maximum accuracy in task 1, here we get 79% accuracy with bagging method. • In the hepatitis dataset AdaBoost algorithm gives 89% accuracy which exceeds the maximum accuracy we obtained in task 1.

3.7.3 Task 3—Applying Feature Selection Techniques In this task, I applied different feature selection technique and removed the least significant feature obtained and then noted the accuracy obtained after removing each feature one by one. The table shows the parameters removed and the accuracy obtained in each case. Basic Filter Methods In diabetes and hepatitis dataset, there are no duplicate, quasi duplicate and redundant features. Filter and Wrapper Methods 1 Diabetes Dataset Table 3.10 shows results obtained in task 3 for Diabetes dataset. It shows the accuracy obtained by each ML models used in task 1 and ensemble learning models used in task2 by using different filter feature selection techniques. For each feature selection technique, the least significant attributes are removed one by one and accuracy is noted after running each models. Table 3.11 shows results obtained in task 3 for Diabetes dataset. It shows the accuracy obtained by each ML models used in task 1 and ensemble learning models used in task2 by using different Wrapper feature selection techniques. For each feature selection technique, the least significant attributes are removed one by one and accuracy is noted after running each models after removal of least significant attributes.

0.77 0.77 0.77 0.75 0.79 0.8

0.77 0.76 0.76 0.77 0.75 0.78 0.77 0.78 0.78 0.75 0.77 0.78 0.76

0.73 0.75 0.72 0.72 0.72 0.71 0.74 0.67 0.71 0.65 0.75 0.77 0.74

Bagging

ST

0.77 0.78 0.78 0.73 0.78 0.77 0.74 0.77 0.76 0.74 0.76 0.77 0.77 0.77 0.74 0.73 0.77 0.77 0.76

Correlation

Adabooost algorithm

ST

0.78 0.77 0.79 0.77 0.78 0.78 0.79 0.77 0.77 0.78 0.78 0.78 0.77 0.75 0.77 0.74 0.78 0.79 0.75

BP

Random Forest

BP

0.77 0.76 0.76 0.75 0.78 0.78 0.76 0.78 0.77 0.76 0.77 0.77 0.78 0.78 0.78 0.76 0.78 0.78 0.76

Insulin

SVM

DPF

0.71 0.72 0.76 0.76 0.73 0.72 0.75 0.73 0.66 0.77 0.71 0.71 0.72 0.73 0.73 0.69 0.73 0.75 0.74

Pregnancies

ANN

Age

0.72 0.75 0.75 0.67 0.74 0.7

BMI

Decision Tree

BP

0.77 0.76 0.77 0.75 0.79 0.78 0.77 0.77 0.77 0.77 0.77 0.77 0.76 0.78 0.78 0.74 0.77 0.76 0.74

BMI

Logistic Regression

Models

DPF

Chi_square

BP

Extra Trees Classifier Insulin

F Score ST

Mutual Information ST

ANOVA F VALUE DPF

Pearson’s Correlation Coefficient DPF

Table 3.10 Accuracy obtained by filter feature selection methods in diabetes dataset.

88 Biomedical Data Mining for Information Retrieval

Sgot

0.85

0.87

0.87

0.83

0.8936

0.8936

0.83

Models

Logistic Regression

Decision Tree

ANN

SVM

Random Forest

Adaboost Algorithm

Bagging

0.85

0.9149

0.89

0.83

0.81

0.87

0.83

Histology

Forward selection

0.85

0.83

0.83

0.85

0.85

0.83

0.85

Protime

0.81

0.87

0.83

0.83

0.74

0.8723

0.81

Histology

0.85

0.8511

0.81

0.85

0.81

0.81

0.83

Protime

Backward selection

0.83

0.8511

0.8511

0.83

0.77

0.74

0.83

Albumin

Table 3.11 Accuracy obtained by wrapper feature selection methods in diabetes dataset.

0.83

0.8723

0.8723

0.83

0.85

0.8723

0.83

Sex

0.8298

0.8936

0.8511

0.85

0.81

0.87

0.83

Antivirals

0.85

0.8936

0.85

0.85

0.79

0.87

0.81

Varices

Recursive feature elimination

Predictive Analysis Using Feature Selection 89

90 Biomedical Data Mining for Information Retrieval Observation: • The result of Univariate ROC-AUC and Pearson’s Correlation Coefficient method is same. • In ANN the normal accuracy is 73%, by using the F-score method we can achieve 77% accuracy by removing the least significant features scored by this method (Blood Pressure, Skin Thickness, Insulin). • In SVM we get an accuracy of 78%, but by using recursive feature elimination technique if we remove the Skin Thickness, Insulin, and Pregnancies than maximum 80% accuracy can be obtained. • In Random Forest classifier normally we can achieve accuracy up to 77% but accuracy is increased up to 79% using the following methods and removing the features mentioned in the brackets one by one: 1. Chi-square = Diabetes Pedigree Function, Blood Pressure, Skin Thickness 2. Extra Tree Classifier—Skin Thickness, Insulin, Blood Pressure 3. Pearson’s Correlation Coefficient—Blood Pressure, Diabetes Pedigree Function. 4. Recursive Feature Elimination—Skin Thickness, Insulin, Pregnancies. • In AdaBoost algorithm normally we get an accuracy of 77% but by using feature selection method the accuracy is improved by removing the attributes mentioned in the brackets: 1. chi-square test—78% (Diabetes pedigree function, Blood pressure, Skin Thickness) 2. Extra tree classifier—78% (Skin Thickness) 3. Backward Selection—78% (Skin Thickness, Insulin) 4. Recursive Feature Elimination—79% (Skin Thickness, Insulin, Pregnancies) • In Bagging method normally, we get an accuracy of 78% but we can achieve 79% accuracy in the following case by removing the parameter mentioned in front of each method: 1. Forward Selection—Skin Thickness 2. Backward Selection—Skin Thickness, Insulin 3. Recursive Feature Elimination—Skin Thickness, Insulin, Pregnancies. 4. And 80% accuracy in case of Extra tree classifier by removing skin thickness and insulin

Predictive Analysis Using Feature Selection 91 2 Hepatitis Dataset Table 3.12 shows results obtained in task 3 for Hepatitis dataset. It shows the accuracy obtained by each ML models used in task 1 and ensemble learning models used in task2 by using different filter feature selection techniques. For each feature selection technique, the least significant attributes are removed one by one and accuracy is noted after running each models after removal of least significant attributes. Table 3.13 shows results obtained in task 3 for Hepatitis dataset. It shows the accuracy obtained by each ML models used in task 1 and ensemble learning models used in task2 by using different Wrapper feature selection techniques. For each feature selection technique, the least significant attributes are removed one by one and accuracy is noted after running each models after removal of least significant attributes. Observation: • In logistic regression we can obtain accuracy of 83%, but we can increase the accuracy up to 85% by using the following methods and removing the features mentioned in the brackets one by one: 1. Mutual Information method—(steroid, sgot, varices, anorexia) 2. Univariate ROC-AUC method (anorexia, sgot). 3. Forward selection—(sgot, Histology, Protime). • In decision tree normally the accuracy 85% accuracy is obtained, but by removing the least significant attributes mentioned in the brackets one by one, following hike in the accuracy is observed: 1. 89%—chi square (liver_big, liver_firm, steroid) 2. 89%—F score (liver_firm) 3. 89%—mutual information (steroid, sgot, varices, liver_firm) 4. 89%—ANOVA F value (liver_firm, liver_big) 5. 89%—Pearson’s Correlation (liver_firm, liver_big) 6. 87%—recursive feature elimination (sex, antivirals, variaces). • In ANN we obtained the accuracy of 81%, but the hike in accuracy is seen by removing least significant features by following methods: 1. 83%—Correlation matrix (histology, albumin, protime) 2. 87%—F-score (liver_big, liver_firm) 3. 87%—Mutual Information (steroid, sgot, varices)

LF

ftg

LB

Models

Chi_square

LF

Extra Trees Classifier

antiv

sex

Correlation

Steroid

Bagging 0.83 0.81 0.81 0.83 0.83 0.83 0.81 0.85 0.83 0.83 0.83 0.85 0.85 0.83 0.85 0.85 0.82 0.83 0.85 0.82 0.83 0.83 0.87 0.85 0.85

0.87 0.83 0.87 0.83 0.87 0.89 0.89 0.89 0.87 0.87 0.87 0.87 0.87 0.89 0.91 0.85 0.89 0.89 0.85 0.89 0.89 0.89 0.89 0.87 0.89

LB

F-Score A_P

AA

Sgot

0.87 0.85 0.87 0.85 0.89 0.89 0.85 0.87 0.87 0.87 0.89 0.89 0.85 0.87 0.91 0.89 0.87 0.89 0.89 0.87 0.89 0.87 0.89 0.91 0.89

Steroid

RF

LF

0.83 0.79 0.83 0.83 0.83 0.81 0.85 0.83 0.81 0.83 0.83 0.83 0.83 0.83 0.85 0.83 0.81 0.83 0.83 0.81 0.83 0.83 0.87 0.83 0.85

LF

SVM

AP

0.83 0.74 0.77 0.83 0.72 0.79 0.72 0.77 0.87 0.83 0.87 0.85 0.85 0.87 0.83 0.83 0.81 0.85 0.83 0.81 0.85 0.85 0.87 0.83 0.89

LF

ANN

Sgot

Mutual Information vari

ANOVA F-value LB

Pearson’s Coefficient LB

0.87 0.87 0.89 0.77 0.85 0.87 0.87 0.89 0.85 0.85 0.87 0.89 0.89 0.87 0.89 0.89 0.89 0.87 0.89 0.89 0.87 0.87 0.87 0.83 0.85

AP

DT

Anroxia

0.81 0.83 0.85 0.83 0.81 0.81 0.81 0.83 0.83 0.83 0.83 0.83 0.85 0.83 0.85 0.83 0.83 0.83 0.83 0.83 0.83 0.83 0.85 0.83 0.83

antiv

LR

Sgot

Univariate ROC AP

Table 3.12 Accuracy obtained by filter feature selection methods in hepatitis dataset.

92 Biomedical Data Mining for Information Retrieval

Sgot

0.85

0.87

0.87

0.83

0.8936

0.8936

0.83

Models

Logistic Regression

Decision Tree

ANN

SVM

Random Forest

Adaboost Algorithm

Bagging

0.85

0.9149

0.89

0.83

0.81

0.87

0.83

histology

Forward selection

0.85

0.83

0.83

0.85

0.85

0.83

0.85

Protime

0.81

0.87

0.83

0.83

0.74

0.8723

0.81

histology

0.85

0.8511

0.81

0.85

0.81

0.81

0.83

Protime

Backward selection

0.83

0.8511

0.8511

0.83

0.77

0.74

0.83

albumin

Table 3.13 Accuracy obtained in wrapper feature selection methods in the hepatitis dataset.

0.83

0.8723

0.8723

0.83

0.85

0.8723

0.83

Sex

0.8298

0.8936

0.8511

0.85

0.81

0.87

0.83

antivirals

0.85

0.8936

0.85

0.85

0.79

0.87

0.81

varices

Recursive feature elimination

Predictive Analysis Using Feature Selection 93

94 Biomedical Data Mining for Information Retrieval 4. 85%—ANOVA–F value (liver_big, liver_firm, Alk_ phosphate) 5. 85%—Pearson’s Correlation (liver_big, liver_firm, Alk_ phosphate) 6. 89%—Univariate Roc (anroxia, Sgot, alk_phosphate, antivirals). • In SVM normally we get the accuracy of 83% but the hike in accuracy is seen by removing least significant features by following methods: 1. 85%—Extra tree classifier (sex, antivirals, fatigue) 2. 85%—mutual information (steroid, sgot, varices, liver_firm) 3. 85%—univariate ROC-AUC (anroxia, sgot, alk_phosphate, antivirals) 4. 85%—forward selection (Sgot, histology, Protime) 5. 85%—backward selection (histology, protime) 6. 85%—recursive feature elimination (sex, antivirals, varices) • In Random Forest classifier normally we get the accuracy of 85% but the hike in accuracy is seen by removing least significant features by following methods: 1. 87%—Chi square (liver_firm,liver_big,steroid) 2. 89%—Extra Tree Classifier (sex, antivirals) 3. 89%—F score (liver_firm, liver_big , alp_phosphate, sgot) 4. 91%—Mutual Information (steroid, sgot, varices, liver_firm) 5. 89%—ANOVA F value (Liver_firm , liver_big, alk_ phosphate) 6. 89%—Pearson’s correlation ((Liver_firm , liver_big, alk_ phosphate) 7. 91%—univariate ROC-AUC(anroxia, sgot, alk_ phosphate) 8. 89%—forward selection (sgot, histology). • In adaboost algorithm normally we get the accuracy of 89% but the hike in accuracy is seen by removing least significant features by following methods: 1. 91%—Mutual Information (steroid, sgot, varices, anroxia) 2. 91%—Forward selection (sgot, histology).

3.7.4 Task 4—Appling Data Balancing Technique Here I have applied various balancing techniques on the training dataset. And then trained the following 8 models on that resampled training data and evaluated the accuracy by using the test dataset.

Predictive Analysis Using Feature Selection 95 Table 3.14 shows the result obtained in Task 4 for Diabetes dataset. It shows the accuracy obtained by each models after applying different data balancing techniques. Table 3.15 shows the result obtained in Task 4 for Hepatitis dataset. It shows the accuracy obtained by each models after applying different data balancing techniques. Observation • After applying balancing techniques, we are able to increase the accuracy Table 3.14 Accuracy obtained in Task 4 for diabetes dataset. Models

Random sampling

SMOTE

ADAYSN

Borderline SMOTE

Logistic Regression

0.78

0.78

0.78

0.78

Decision Tree

0.7

0.74

0.68

0.73

ANN

0.69

0.68

0.67

0.71

SVM

0.76

0.76

0.76

0.76

Random Forest

0.78

0.81

0.8

0.78

Adaboost Algorithm

0.77

0.7835

0.77

0.77

Bagging

0.74

0.75

0.74

0.75

Table 3.15 Accuracy obtained in Task 4 for hepatitis dataset. Models

Random sampling

SMOTE

ADAYSN

Borderline SMOTE

Logistic Regression

0.81

0.8723

0.85

0.85

Decision Tree

0.79

0.81

0.81

0.81

ANN

0.81

0.89

0.83

0.79

SVM

0.83

0.83

0.83

0.83

Random Forest

0.94

0.85

0.8723

0.89

Adaboost Algorithm

0.8936

0.9149

0.8936

0.8936

Bagging

0.8511

0.8723

0.8723

0.85

96 Biomedical Data Mining for Information Retrieval • Normally in a random forest, we obtained an accuracy of 77% but it achieved 81% accuracy after applying SMOTE and 80% accuracy after applying ADAYSN methods in random forest and 78% accuracy after applying the rest of the methods. • In the case of the Adaboost algorithm also the accuracy increased up to 78.35%. Observation • Logistic regression gave an accuracy of 83%, it increased up to 87% after applying SMOTE analysis and 85% after applying ADAYSN and borderline SMOTE. • ANN gave 81% accuracy, after applying SMOTE it increased up to 89%. • Random forest gave 85% accuracy, after applying random sampling it increased up to 94% and it reached up to 87 and 89% in the case of ADYSN and Borderline SMOTE respectively. • In the case of the Adaboost algorithm also the accuracy increased from 89 to 91% after applying SMOTE analysis. • In the case of the Bagging method also the accuracy increased from 85 to 87% after applying SMOTE and ADYSN techniques.

3.8 Conclusion This chapter aims to compare the performance of different pre-processing approaches that can enhance the performance of the predictive analysis process in the healthcare domain. As we know that the healthcare domain is the most prominent field for the researches. Predicting the possibility of having any chronic disease helps healthcare workers to take precautionary steps beforehand. Here I have explored two of such chronic diseases in this chapter which are diabetes and hepatitis. I have applied Logistic regression, decision tree, ANN and SVM models of machine learning as Task 1. • In diabetes dataset maximum accuracy of 78% is obtained by SVM and 85% by decision tree in the hepatitis dataset. Then in Task 2, I applied three ensemble learning algorithms which are the Random forest algorithm, AdaBoost algorithm, and Bagging method.

Predictive Analysis Using Feature Selection 97 • In the diabetes dataset, the accuracy increased from 78 to 79% by applying the bagging method and in the hepatitis dataset, the accuracy increased from 85 to 89% in the case of the AdaBoost algorithm. Feature selection techniques remove unwanted and irrelevant features which enhances the performance of the model. Here in Task 3, I have applied 11 filter feature selection methods and 3 Wrapper feature selection methods which enhance the performance. The accuracy is increased in almost all the cases. • For diabetes dataset from the accuracy increased from 79 to 80% by using 1. extra tree classifier (skin thickness and insulin) in Bagging method 2. recursive feature elimination technique (Skin Thickness, Insulin, and Pregnancies) in SVM. • In hepatitis dataset the accuracy increased from 89 to 91% by applying: 1. 91%—Mutual Information (steroid, got, varices, liver_ firm) in Random forest and AdaBoost algorithm 2. 91%—univariate ROC-AUC (anorexia, sgot, alk_phosphate) in random forest. 3. 91%—Forward selection (sgot, histology) in AdaBoost algorithm. Most of the medical dataset we get are highly imbalance as the proportion of people suffering from the disease will always be less in any population. Hence we need to balance the dataset. Here I have applied four data balancing techniques, random sampling, SMOTE, ADASYN, and borderline SMOTE. In the diabetes dataset, the accuracy increased up to 81% after applying SMOTE and 80% accuracy after applying ADAYSN in random forest. In the hepatitis dataset, Random forest gave 85% accuracy, after applying random sampling it increased up to 94% and it reached up to 87 and 89% in the case of ADSYN and Borderline SMOTE respectively. In the case of the Adaboost algorithm also the accuracy increased from 89 to 91% after applying SMOTE analysis.

98 Biomedical Data Mining for Information Retrieval Conclusion Table: Diabetes Dataset For Diabetes dataset the conclusion of all the 4 tasks is showed in Table 3.16. The maximum accuracy obtained in each task is noted along with its respective approach. Table 3.16 Conclusion table for diabetes dataset. Maximum accuracy

Method/model

Task 1

78%

Logitic Regression and SVM

Task 2

78%

Bagging

Task 3

80%

In SVM using recursive feature elimination (Skin Thickness, Insulin and Pregnancies)

Task 4

81%

Logistic regression and random forest

Conclusion Table: Hepatitis Dataset For Hepatitis dataset the conclusion of all the 4 tasks is showed in Table 3.17. The maximum accuracy obtained in each task is noted along with its respective approach. Table 3.17 Conclusion table for hepatitis dataset. Maximum accuracy

Models and techniques

Task 1

85%

Decision Tree

Task 2

89%

Adaboost Algorithm

Task 3

91%

Task 4

94%

Random forest: 1. Mutual Information (steroid, sgot, varices, liver_firm) 2. univariate ROC-AUC (anroxia, sgot, alk_phosphate) Adaboost Algorithm 1. Mutual Information (steroid, sgot, varices, anroxia) 2. Forward selection (sgot, histology) Random forest and Random sampling 91%—Adaboost algorithm and SMOTE

Predictive Analysis Using Feature Selection 99

References 1. Singh, D.J., Feature Selection and Classification Systems for Chronic disease prediction: A Review. Egypt. Inform. J., 11, 179–189, 2018, Retrieved 06 03, 2020. 2. Sneha, N. and Gangil, T., Analysis of diabetes mellitus for early prediction using optimal features selection. J. Big Data, 6, 13, 2019, https://doi. org/10.1186/s40537-019-0175-6. 3. World Health Organization, Diabetes, 2020, Retrieved 06 03, 2020, https://www. who.int/health-topics/diabetes#tab=tab_1. 4. World Health Organization, Hepatitis B, 2019, 07 18, Retrieved 06 03, 2020, from news-room/fact-sheets/detail/hepatitis-b: https://www.who.int/ news-room/fact-sheets/detail/hepatitis-b. 5. Trishna, T.I., Emon, S.U., Ema, R.R., Sajal, G.I.H., Kundu, S., Islam, T., Detection of Hepatitis (A, B, C and E) Viruses Based on Random Forest, K-nearest and Naïve Bayes Classifier. 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kanpur, India, pp. 1–7, 2019. 6. Cases, W.J., Update on global epidemiology of viral hepatitis and preventive strategies. World J. Clin. Cases, 6, 589–599, 2018. 7. He, H. and Garcia, E.A., Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng., 21, 9, 1263–1284, Sept. 2009 8. Nithya, B. and Ilango, V., Predictive analytics in healthcare using machine learning tools and techniques. 2017 International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, pp. 492–499, 2017. 9. Selvakuberan, K., Kayathiri, D., Harini, B., Devi, M.I., An efficient feature selection method for classification in healthcare systems using machine learning techniques. 2011 3rd International Conference on Electronics Computer Technology, Kanyakumari, pp. 223–226, 2011. 10. Shailaja, K., Seetharamulu, B., Jabbar, M.A., Machine Learning in Healthcare: A Review. 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, pp. 910– 914, 2018. 11. Patil, B., Joshi, R., Toshniwal, D., Classification of type-2 diabetic patients by using Apriori and predictive Apriori. Int. J. Comput. Vis. Robot., 2, 254–265, 2011, 10.1504/IJCVR.2011.042842. 12. Al-Hagery, M., Alfaiz, A., Alorini, F., Saleh, M., Knowledge Discovery in the Data Sets of Hepatitis Disease for Diagnosis and Prediction to Support and Serve Community. 4, 118–125, 2015. 13. Nahato, K.B., Harichandran, K.N., Arputharaj, K., Knowledge mining from clinical datasets using rough sets and backpropagation neural network. Comput. Math. Methods Med., 2015, 460189, 2015.

100 Biomedical Data Mining for Information Retrieval 14. Rong, M., Gong, D., Gao, X., Feature Selection and Its Use in Big Data: Challenges, Methods, and Trends. IEEE Access, 7, 19709–19725, 2019. 15. Sahoo, P.K., Mohapatra, S.K., Wu, S., Analyzing Healthcare Big Data With Prediction for Future Health Condition. IEEE Access, 4, 9786–9799, 2016. 16. Reddy, A.R. and Kumar, P.S., Predictive Big Data Analytics in Healthcare. 2016 Second International Conference on Computational Intelligence & Communication Technology (CICT), Ghaziabad, pp. 623–626, 2016. 17. Armanfard, N., Reilly, J.P., Komeili, M., Local Feature Selection for Data Classification. IEEE Trans. Pattern Anal. Mach. Intell., 38, 6, 1217–1227, 1 June 2016. 18. Mir, A. and Dhage, S.N., Diabetes Disease Prediction Using Machine Learning on Big Data of Healthcare. 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, pp. 1–6, 2018. 19. Sarwar, M.A., Kamal, N., Hamid, W., Shah, M.A., Prediction of Diabetes Using Machine Learning Algorithms in Healthcare. 2018 24th International Conference on Automation and Computing (ICAC), Newcastle upon Tyne, United Kingdom, pp. 1–6, 2018. 20. Sisodiaa, D. and Singh Sisodiab, D., Prediction of Diabetes using Classiﬁcation Algorithms, in: International Conference on Computational Intelligence and Data Science (ICCIDS 2018). 21. Veena Vijayan, V. and Anjali, C., Prediction and Diagnosis of Diabetes Mellitus—A Machine Learning Approach. 2015 IEEE Recent Advances in Intelligent Computational Systems (RAICS), Trivandrum, 10–12 December 2015. 22. Dey, S.K., Hossain, A., Rahman, M.M., Implementation of a Web Application to Predict Diabetes Disease: An Approach Using Machine Learning Algorithm. 2018 21st International Conference of Computer and Information Technology (ICCIT), Dhaka, Bangladesh, pp. 1–5, 2018. 23. Trishna, T.I., Emon, S.U., Ema, R.R., Sajal, G.I.H., Kundu, S., Islam, T., Detection of Hepatitis (A, B, C and E) Viruses Based on Random Forest, K-nearest and Naïve Bayes Classifier. 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kanpur, India, pp. 1–7, 2019. 24. Pushpalatha, S. and Pandya, J.G., Designing a framework for diagnosing hepatitis disease using data mining techniques. 2017 International Conference on Algorithms, Methodology, Models and Applications in Emerging Technologies (ICAMMAET), Chennai, pp. 1–6, 2017. 25. Jajoo, R., Mital, D., Haque, S., Srinivasan, S., Prediction of hepatitis C using artificial neural network. 7th International Conference on Control, Automation, Robotics and Vision, 2002, vol. 3, ICARCV 2002, Singapore, pp. 1545–1550, 2002. 26. Shankar sowmien, V. et al., Diagnosis of Hepatitis using Decision tree algorithm. Int. J. Eng. Technol. (IJET), Vol 8, pp. 1414–1419, 2016.

Predictive Analysis Using Feature Selection 101 27. Shroff, K.P. and Maheta, H.H., A comparative study of various feature selection techniques in high-dimensional data set to improve classification accuracy. 2015 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, pp. 1–6, 2015. 28. Suto, J., Oniga, S., Sitar, P.P., Comparison of wrapper and filter feature selection algorithms on human activity recognition. 2016 6th International Conference on Computers Communications and Control (ICCCC), Oradea, pp. 124–129, 2016. 29. Brownlee, J., Feature Selection For Machine Learning in Python, 201605 20, Retrieved 06 03, 2020, from feature-selection- machine-learning- python: https://machinelearningmastery.com/feature-selection-machine-learningpython/. 30. Charfaoui, Y., Hands-on with Feature Selection Techniques: Filter Methods, Retrieved 06 03, 2020, 2020, from hands-on-with-feature- selectiontechniques-filter-methods-f248e0436ce5: https://heartbeat.fritz.ai/hands-onwith-feature-selection-techniques-filter-methods-f248e0436ce5. 31. Shetye, A., Feature Selection with sklearn and Pandas, 2019, 02 11, Retrieved 06 03, 2020, from feature-selection-with-pandas-e3690ad8504b: https:// towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b. 32. Khandelwal, R., Feature selection in Python using the Filter method, 2019, 08 24, Retrieved 06 02, 2020, from Feature selection in Python using the Filter method:https://towardsdatascience.com/feature-selection-in-python- using-filter-method-7ae5cbc4ee05. 33. Luhaniwal, V., Feature selection using Wrapper methods in Python, 2019, 10 04, Retrieved 06 03, 2020, from feature-selection-using-wrapper- methods-in-python-f0d352b346f: https://towardsdatascience.com/featureselection-using-wrapper-methods-in-python-f0d352b346f. 34. Srinidhi, S., Backward Elimination for Feature Selection in Machine Learning, 2019, 11 15, Retrieved 06 03, 2020, from backward-elimination-for-feature- selection-in-machine-learning-c6a3a8f8cef4: https://towardsdatascience.com/ backward-elimination-for-feature-selection-in-machine-learning-c6a3a8f8cef4. 35. Raschka, S., Exhaustive Feature Selector, 2014, Retrieved 06 03, 2020, from mlxtend/user_guide/feature_selection/ExhaustiveFeatureSelector: http://rasbt. github.io/mlxtend/user_guide/feature_selection/ExhaustiveFeatureSelector. 36. Mulani, J., Heda, S., Tumdi, K., Patel, J., Chhinkaniwala, H., Patel, J., Deep Reinforcement Learning Based Personalized Health Recommendations, in: Deep Learning Techniques for Biomedical and Health Informatics, pp. 231– 255, Springer, Cham, 2020.

4 Healthcare 4.0: An Insight of Architecture, Security Requirements, Pillars and Applications Deepanshu Bajaj1, Bharat Bhushan2* and Divya Yadav1 HMR Institute of Technology and Management, New Delhi, India School of Engineering and Technology, Sharda University, Greater Noida, India 1

2

Abstract

Motivated from Industry 4.0, Healthcare 4.0 was established with a remarkable vision mainly focusing on utilizing customization and the virtualizations among various industrial sectors. Strategically, it empowers industrial sector development from producers to consumers, focusing on developing measures of customization, provided as a service to the customer. It is a patient-based framework which depends on information exchange among different participants of the network resulting in an enhanced Healthcare exchange network. The idea of Industry 4.0 is evolving massively as itis essential for the medical sector, including the Internet of Things (IoTs), BigData (BD) and Blockchain (BC), a combination of which is modernizing the e-Health and its overall framework. In this paper we analyze the implementation of the I4.0 (Industry 4.0) technology in the medical sector which has revolutionized the best available approaches and improved entire framework. The main aim of this paper is to discuss the: Architectural design and components of e-Health, assurance and security, ICT (Information and Communication Technology) pillars and advancements provided by I4.0 including IoT, Cloud and Fog Computing, Machine Learning, Big Data, Blockchain, Major Applicationsscenarios associated with HC4.0 (or Healthcare 4.0). Keywords: Industry 4.0, Healthcare 4.0, Internet of Things, big data, machine learning, cloud computing

*Corresponding author: [email protected] Sujata Dash, Subhendu Kumar Pani, S. Balamurugan and Ajith Abraham (eds.) Biomedical Data Mining for Information Retrieval: Methodologies, Techniques and Applications, (103–130) © 2021 Scrivener Publishing LLC

103

104 Biomedical Data Mining for Information Retrieval

4.1 Introduction The development of all people and the increasing desire for successful medications as well as the personal satisfaction of living a good life is a burden on Healthcare. In this manner, Healthcare continues to be the highest significant general and financial difficulties around the world, which is requesting modern and further developed resources with the help of sciences and innovations [1]. The healthcare sector helps in saving and expanding the lifetime of sufferers through a plan of getting individuals in good health. It is due to the development of medical examination procedures and development in the latest innovations as chosen by medical experts. Accordingly, considering the mid-1990s, Information and Communications Technologies (ICTs) must affect the entrance, proficiency, and nature of some procedures which are identified in favor of Healthcare [2]. From the most recent couple of decades, many individuals saw different patterns in the Industrial norms and accomplishments. As an instance, I1.0 (Industry 1.0) concentrated more on Automated Engineering and mechanization chased with the help of I2.0 (Industry 2.0), referred to as electrically powered based. Other than this, comprises of I3.0 (Industry 3.0) holding media-based transmission and ICT’s are the main segments. Consequently, e-Health was planned because of the utilization of ICT’s to medical sectors, which are considered for normal usage. Also, with Industry 4.0 depends on Smart-gadgets stationing and its utilization with IoTs. The proof of current growing innovative developments connected globally increasing the efforts made by governments that prompt the end that the medical industry confronting the effects of Industry 4.0, successfully passing e-Health to HC4.0. A few presentations and reviews were there concerning the advancements associated with Industry 4.0, or individual Information and Communication Technologies (ICTs) techniques for e-Health [3, 4], regarded as an absence of modern work concentrating on the angles of great importance associated to Healthcare 4.0 [5, 6]. Along these lines, with this paper we will target to present the innovative angles of the continuous mechanical advancements as employed or appropriate to the medical sector, to clarify their utilization and extending their comprehension. The major contribution of the paper is as follows: • The paper presents a comprehensive survey of protection and security issues associated with Healthcare 4.0. It depicts

Implementation of I4.0 in Healthcare 4.0 105 the background, architecture of e-health, and the importance of protection and security in Healthcare 4.0. The various techniques that are used for protection and security in Healthcare 4.0 are categorized into five categories (as shown in Section 4.3). • The Paper highlights the need for protection and security in Healthcare 4.0, along with the advantages and disadvantages of the same. Various issues, tools, technologies used, the framework that is used to maintain protection and security in Healthcare 4.0 are discussed. Moreover, we have discussed about the various types of ICT Pillar’s associated with Healthcare 4.0 and explain its prerequisites. • The paper throws light on making the environment eco and user friendly by making the healthcare sector’s requirements more prioritized regarding the available services. Also, the paper itself explains some of the application-scenarios in Healthcare 4.0. • The paper further highlights CC as a tie-breaker along with FC (FC significantly rectifies the problems of CC prototype). The FC is then described about Low latency, Privacy, Network failure, and Inevitability and further gives a comprehensive view of other technologies i.e. blockchain, machine learning, big data, and IoT. The remainder of the paper is organized as follows: Section 4.2 represents the review of the basic architecture, components of e-health architecture and comprehensive description of layers of e-health architecture. Section 4.3 focuses on protection and security necessities in Healthcare 4.0 and individually explains its prerequisites in brief. Section 4.4 discusses the various types of ICT Pillar’s associated with Healthcare 4.0. Applicationscenarios associated with Healthcare 4.0 and a brief description of some scenarios are explained and discussed in Section 4.6 followed by a conclusion in Section 4.8.

4.2 Basic Architecture and Components of e-Health Architecture Through this paper, we come to know about a three-layered e-Health architecture based on the individual patients and caregivers where the 3 layers

106 Biomedical Data Mining for Information Retrieval Access Point

Emergency

Body Temperature Sensor

BP Sensor

Medical Information Database

ECG Sensor

Gateway

Internet TelePharmacy Server

PO Sensor

Doctor

Front End Layer

Communication Layer

Back End Layer

Figure 4.1 Basic architecture and components of e-health architecture.

can be described as: front end, communication layer, and back end. Many of the devices are used as a part of the front end some of them are sensors, Medical Devices, and Wearable devices. Whereas handling of data gathered from the front end is done by the communication layer. Centralized control of patients is provided by a high-computing data center which comes under the backend. The comprehensive description of all such layers can be explained by the Figure 4.1 and the content as follows.

4.2.1 Front End Layer The wider group of IoT-related healthcare devices such as Wearable Devices (WDs), and sensors, and Medical Devices (MDs), etc. are used for monitoring the real-time condition of health of individual patients [7]. Hence the mentioned condition is verified and reserved for additional processing within the health cloud. Consequently, the complete system layout of e-health which includes some layers follows as the first layer comprises wearables and physical sensors that are used to gather the ongoing medical information from the specific patients. Such sensors are thus used to transmit the health data to Medical Devices with the involvement of some communication protocols stated overhead and add links to the Internet for the information stored within cloud storage. Furthermore, Medical Devices can be treated as either physical (or concrete) as well as the virtual sensors. A concrete sensor plays a crucial role

Implementation of I4.0 in Healthcare 4.0 107 in the monitoring of the patient’s health and also to trace their wellness in real-time while, a gathering of health data is done by virtual sensors form remote diagnostics for remote monitoring, PHR, and remote consultation [8]. Some of the sensors involved are sensor for Heart Rate (HR), sensors for Pulse-Oximeter (PO), sensors for Respiration Rate (RR), sensors for Blood Pressure (BP), sensors for Body Temperature (BT), Smartwatch for observing-activities, Fitness Band, etc.

4.2.2 Communication Layer The handling of data accumulated from the front end gadgets is done by the communication layer. Also, it is responsible for sending the assembled information to the remote gateway via the Internet. Routine facilities are then sent via cloud passage (or gateway) and so also the emergency facilities are sent using Fog Gateway because for further investigation the indicated layer essentially consists of energy-efficient and less-powered Fog Computing (FC) nodes that are assembled with an abundant variety of WDs or sensors to accomplish the data faultlessly [9]. Fog Computing fixes and settles the issue related to the network latency (or system inactivity) and of information security of the Healthcare related information [8]. A significant number of the cloud servers are subsequently used to examine the patient’s instantaneous information or data and to process it over the wide geographical region, which can be considered as very tedious. All these major issues can be resolved with the help of Fog Computing for the emergency facilities. Besides within the layer, Fog Nodes can carry out the accumulation, packing, separating, and organizing of the gathered information. Moreover, to protect this valuable information, access control, authentication, and instruments for encryptions are utilized by the analysts. That’s why this layer is considered to be more time consuming and hence slow.

4.2.3 Back End Layer This layer includes a Centralized control of the patient which is provided by a high-computing data center which effectively falls under this. Also, high-computing data center permits the perplexing and long interval examination of actions, and besides, the relationship among the information of the patients. Besides this, it also includes some of the cloud servers that are responsible for taking choices dynamically. Additionally, it is utilized for the goal of collecting information and gives extra memory to store the patient’s clinical records. All the specialists and patients can access

108 Biomedical Data Mining for Information Retrieval such records and particular pharmaceutical section (detailing or charging purposes) to get the bits of knowledge. Indeed, even Patients can disclose their past medical records or bills and present medical records or bills by employing any portable application or a web interface. The information accumulated from different sources are then incorporated into the EHRs, and numerous e-solution sites, and many web sources. Thus, all Doctors and all Patients are permitted to get to the information whenever and wherever at whatever point it is required by them. Further, it gives something referred to as a pushing facility to get a warning when an individual patient transfer or get any medical data. All the above layers as talked about so far incorporate more than one innovation use to raise the consideration of the patients. Medical information of Patients is shared across Internet as it can be considered as an open-channel and also, it can have a greater likelihood of assaults like DDoS, replay, MIM, advantaged insider, impersonation assault. Such assaults can be controlled and handled by utilizing some security arrangements and the different traditional cryptographic systems.

4.3 Security Requirements in Healthcare 4.0 Nowadays, the primary concerns are Privacy and Security of the Healthcare industries (or medical fields) due to a massive quantity of information related to health that is retrieved and passed via the Internet. This can be viewed as an open channel for transmissions, consequently, there is a likelihood of system assaults on health information. Numerous Cryptographybased algorithms are utilized by the researchers to tackle such attacks. For security, many of the steps are ensured to handle the retrieval of data of patients to shield it from unapproved clients. It will, in general, be practiced with operational controls inside a hidden-entity. The majority of the nations are utilizing Personal-Health-Information (PHI) that is kept and can be sent via computerized frameworks. In health information, the foremost priority privacy can be briefed to maintain one’s information of healthcare which are secured from unapproved client. It can be accomplished with the implementation of numerous schemes and their guidelines. Privacy can be described as a means that only authorized users are allowed for retrieval of the health information related to specific patients, and the circumstance wherein patient information may be retrieved, and used, and unveiled to some outsider. For instance, the HIPPA act ensures the protection of medical data identified with the patient. Also, there is a history of privacy and security. The alterations and enhancements of IT

Implementation of I4.0 in Healthcare 4.0 109 structures in the section of Healthcare bring the establishment of HC1.0. All these structures begin correlating with clinical imaging that gives better bits of knowledge to specialists in the patients’ medical condition. The specified period best describes to HC2.0 which requires a quick capacity to retrieve the information. This makes the use of EHRs to retrieve the information from Wearable Devices. This results in the establishment of HC3.0. At that point, all the advancements with instantaneous information assortment are joined with HC4.0 and they expanded employments of Artificial Intelligence, and other (UIs). It accentuates more on the organized work, and combination, and consistency that makes it the most personalized and prescient. HC2.0 was begun in the mid-the 2000s, similar to the subset of the Healthcare systems considering the broad utilization of the Web 2.0 (W2.0). This empowers the patients can have the conspicuous expert on their data and it might likewise lessen the clinical traveling of these patients. This underscores more on innovation as the most authorizing delegate for the concern of the patients. It includes m-Health, associated care, and computerized care. Web3.0 (W3.0) is utilized by HC3.0in which the customer interfaces are subsequently accessible on the web having data and information that is personalized to update their understanding. Healthcare facilities may utilize electronic web-related Social Medias (SM), and Wearable Devices, and IDs related to Tele-Healthcare frameworks to improve interaction among patients and parental figures. HC4.0 (Healthcare 4.0) entitles the customization in favor of Health-Management and besides can be finished with the utilization of Cloud Computing, and portable transmissions [10]. Thus, the utilization of virtualization entitles the assessment of the clinical pictures continuously with more exactness and accuracy. In any case, security and protection are the prerequisites identified with the cloud-related Healthcare framework that can be complex, like Mutual-Authentications, and Un-Traceability, and Session key Agreement, and Users Anonymity, and Perfect-Forwarding-Secrecy, and also Attacks Resistances to make sure the information security and protection. This can be briefly explained by Figure 4.2 and detailed as follows.

4.3.1 Mutual-Authentications Mutual authentication mainly refers to as a two-way authentication, it is referred to as a processor technology that ensures that both the elements in a communication in this manner Links confirming one another. In the network environment, the client authenticates or validates the server and the reverse is also true. In such a manner, Network clients that can be

110 Biomedical Data Mining for Information Retrieval Mutual authentication mainly refers to as a two-way authentication, it is referred to as a processor technology that ensures that both the elements in Mutual a Communication Links are in Authentication this manner confirming one another

Anonymity can be described as the circumstances in which someone’s identity is not known which provides the confidentiality to the person and hides all the data related to him.

Anonymity

Attack Resistance

Healthcare 4.0 Protection and Security Necessities Perfect Forward Secrecy

It refers to the fighting back in opposition to the client who has targeted or assaulted the framework. Some Attacks are replay, spoofing attack, modification attack, MIM, and impersonation attack.

Un-traceability mainly refers to as a property of maintaining routes Un-traceability that are unknown to either of the internal attackers or external attackers. Un-traceability can be seen as a property of keeping up courses that are unclear to both internal as well as external attackers.

PFS is described as a section of an encrypting system that automatically and frequently changes the concerned keys which are used to encrypt and decrypt the data.

Figure 4.2 Healthcare 4.0 protection and security necessities.

guaranteed about whether they are working with genuine substances and also guarantees that servers can likewise be sure as that all the clients that may try to get entrance for real purposes. MA (Mutual-Authentications) can be practiced with the use of certain authenticated conventions like Kerberos validation. It can be assumed from the work about Secured Socket Layers (SSL) or Transports Layer Securities (TLSs) ensures the ease of interaction but can’t check any client or specialized gadget for communication, as it very well may be confirmed through MA [11]. Likewise, it permits just the approved client to get to the data from the particular server [12].

4.3.2 Anonymity Anonymity generally refers to a situation where the involved person’s name is kept unknown. If there is a chance that an attacker attempts to get the information of any client, at that point, the patient’s protection can be undermined, and however, it isn’t viewed as helpful for more established individuals. Subsequently, Anonymity (or namelessness) is considered as the key component related to the security necessities. Associated patient and physician’s identification is required and shown at the time of the login requesting stage [11]. Moreover, it is quite hard to accomplish the identity of doctors and patients as they are encoded using Symmetric-Encryption related Algorithms for example DES.

Implementation of I4.0 in Healthcare 4.0 111

4.3.3 Un-Traceability Un-traceability mainly refers to as a property of maintaining routes that are unknown to either of the internal attackers or external attackers. Un-traceability can be seen as a property of keeping up courses that are unclear to both internal as well as external attackers. If an assailant finds the communication activities of some customers, at that point, the person guesses the genuine identity of such patients with a greater likelihood. An attacker is an intelligent as well as a knowledgeable human being or an individual, if the same is capable of identifying the communication activities or actions of definite clients without their knowledge, then the individual might predict the real patient’s identity with more accuracy. This leads to the infringement of the client’s security. An assailant can’t choose the transmission practices or activities of definite clients.

4.3.4 Perfect—Forward—Secrecy PFS (or Perfect—Forward—Secrecy) can be mainly described as a section of an encrypting system that automatically and frequently changes the keys which are used to encrypt and decrypt the data, means if the current key is compromised, PFS exposes only a small section of the sensitive data of the user. PFS is utilized for key-understanding, which is responsible to protect the previous sessions which are against the future agreements of private keys or passwords by making a periodic key for every single period (or Session). Additionally, an assailant can’t get to these periodic keys, which are made in earlier meetings; even though whether somebody can get to the client’s private key, at that point user remains unaffected since that periodic key is now encoded with numerous Algorithms, for example, Cryptographic-Algorithms.

4.3.5 Attack Resistance Attack resistance generally means the fighting back in opposition to the client who has targeted or assaulted the framework. It can withstand numerous that are replay, spoofing attack, modification attack, MIM, and impersonation attack.

4.3.5.1 Replay Attack The Replay Attack is a type of cyber-attack. In a replay attack the malicious entity is interpreted and the valid data transmission going through a network is repeated. It owes its validity to the original data that typically comes

112 Biomedical Data Mining for Information Retrieval from an authorized user. Hence the network’s security protocol treats it as if it were the original data. Thus, the malicious entity shams itself as valid data. Replay attacks are commonly used to gain access to information stored on a protected network and it can also be used to defraud financial institutions into duplicating transactions.

4.3.5.2 Spoofing Attack A spoofing attack deceives systems, individuals and organizations to interpret something in an attempt to steal data, steal money or spread malware. The spoofer initiates the communication to the target sender or the victim or the system from an unknown source and disguises itself as an authentic and safe sender. By the means of a malicious link, the scammers will send you to a malware download or a fake login page disguised under a familiar logo and spoofed URL to get your username and password.

4.3.5.3 Modification Attack The data modification attack is not interested in taking the data, but instead makes subtle, stealthy modification to data for some malicious gain, which can be harmful to organizations similar to a fraud or theft. Modification attacks don’t always have to result in a tangible financial gain. These types of attacks are hence commonly carried out by insiders itself with malicious intentions.

4.3.5.4 MITM Attack A Man-In-The-Middle Attack (MITM) is a form of eavesdropping data through a network. In this attack the malicious actor includes himself into a conversation between two parties (like eavesdropping) and intercepts the data through a compromised system, which is often a trusted one. The victims are often the fiduciary information, the attacker also uses malware to open the channel of communication to build vast networks of comprised systems. At many times the organizations are unaware that their data has been tampered with until it has been tampered with already and it’s too late. The MITM attack builds a route between a user and an entity, and attempts to conceal the breach and information theft.

4.3.5.5 Impersonation Attack Impersonation attack is a form of cyber-attacks where attackers send emails that impersonate any individual or company for gaining access to

Implementation of I4.0 in Healthcare 4.0 113 the confidential and restricted information. Impersonation attacks are a form of attack where the attackers use manipulation to access information. Attackers usually do background research on the targeted victim. It includes targeting the victim, where a background research on the probable victim is to be done. The next step is trust-building, where the attacker impersonates someone known to the victim. The last stage is deploying the actual attack once a fiduciary relationship is established.

4.4 ICT Pillar’s Associated With HC4.0 The current world is right now being modified by the presence of whenever and wherever linking. The unusual universal presence of remote and portable advancements moreover in the evolving nations, small-sized remote sensors, provided affordable funds, just as financially efficient facilities are given by new automated frameworks (for instance, immense scope datacenters that are utilizing virtualization advances) now have empowered a section of new medical facilities or some new degrees of the quality and financial-adequacy in the setup ones. A significant number of the models go from the growing accessibility and properties of the clinical programming applications and too as versatile mobile applications-essentially determined by the combination of some Mobile gadgets over the clinical practice, to sponsorship people experiencing weight or some other persistent illnesses or population aging by analyzing the huge digitalized information on wide-scale. A remark of the radical changes in medicine are sanctioned by these most recent innovations that can be recognized in the idea of P-4 medicines that are, preventive, participatory, predictive, and personalized. The above methodology depends more on a thorough comprehension or assessment of every patient’s own science rather than gathering patients into the particular treatment category that is used to minimize worldwide medical accounts, for instance, minimizing hospitalization and by decreasing unusual usage of medications and its systems [13]. Such developments have emerged from the wide-field identified with ICTs, and we might want to concentrate more on three key empowering facilitators, which are considered as the Dub-Pillars of Healthcare 4.0, for their critical significance: IoT, BD, CC and FC, by elaborating the usage of such advances in the field identified with HC4.0. Their advancement related to scientific interests and creation can be represented. We value different angles also, both innovative and not, maybe considered according to Healthcare 4.0, however, an examination of the above, they can be considered either as optional or are approved by

114 Biomedical Data Mining for Information Retrieval advancements at lower maturation level. Amidst of all these, the 5G environment, whose details and specialized answers are as yet being clarified and are under investigation which is playing an important job. In spite, the benefits of these developing advances (for instance, Modern Qualityof-Assistance potentials, almost zero idleness, and information rates concerning Gbps), are supposed to get various advantages or benefits to medical-related solutions [14].

4.4.1 IoT in Healthcare 4.0 A great vision for IoT is in general described by ITU (International Telecommunication Union) due to the approach of whenever, and wherever interaction among people, and forward to links for anything, we need to concentrate on the digitalized proof for identification and the MachineTo-Machine (M-To-M) transmissions. Besides, the elements associated with the IoT have an immense scope of understandings, containing RFID and the Wireless Sensor Networks (WSNs) and they must follow some precise constraints, for example, power utilization, size, and processing capacities. Of all the particular interests for the medical sector, and Wireless Body Area Networks (WBANs) are gathered by numerous remote gadgets (like some Sensors and some Actuators) that are embedded in the body of the humans. Because of such complex and heterogeneous scenarios, topics related to IoT are often directed referencing some digital layers [15], that is from base to top: (a) Layer of Perception (formed by some Sensors and some Actuators); (b) Layer of Transmission (used to pass on detected information to the upper layers); (c) Computation based Layer (responsible for making choices and in preparing information); (d) Applicationbased Layer (makes use of the infrastructure of IoT—accomplishing elevated level objectives like transport, manufacturing, healthcare, home automation, etc.). Practically the modern research that happened on IoT has paid attention to more on the layer of transmission and its communication conventions. Also, the planning and usage of lower-power, more consistent, and the Internet-empowered Transmission stack is a chief necessity which is usually concurred, IoT definition can be Fuzzy for certain areas. The recent vision related to IoT, whenever is applied to some sectors (for example, manufacturing processes), considerably overlaps with Industry 4.0. It refers to as a move towards IoT, one way by including referenced mechanisms joined by manufacturing-related particulars and logistics [16], or the other by appending IoT innovations to existed robotized forms, with numerous challenges as an outcome. Healthcare has proven best for IoT sectors [17].

Implementation of I4.0 in Healthcare 4.0 115 The type of prototype is building latest Healthcare, ensuring innovative, and social possibilities: IoT can be considered as the principle of alternative technologies related to the medical sector, hence by providing a remarkable contribution in decreasing entire medical cost while rising medical outcomes, even though behavioral changes related to the stakeholders in systems are required. Advancement in wireless technologies that are related to performance development steadily hold continuous checking of the physiological parameters, subsequently helps in controlling persistent illness, permits early detection, and managing clinical emergency. Further, diagnostics, and clinical, and imaging sensor gadgets are essentials of IoT in medical sectors, although a large amount of applications also take up the benefits of common-purpose smart devices (smartphones, PDAs, tablets). Also, IoT provides frameworks where intelligent gadgets can communicate with other intelligent gadgets to get the latest skill and realization of clients and Environment for best decisions [18]. Further, Wearable-Internet of Things (W-IoT) plans to do the telehealth to achieve an environment for the robotized mediation. Moreover, IoHT (or Internet of Health Things) is established the blend of Mobile related Applications, and wearables, and other connected tools and hold context-knowledgeable always-on smart sensor for clinical devices [19]. Some of the other technologies are IoT (or Internet of Nano things), and IoMT (or Internet of Mobile things), etc.

4.4.2 Cloud Computing (CC) in Healthcare 4.0 The term Cloud Computing (basically “Cloud”) refers to a model that ensures “Utility Computing”, that is the hiring of assets of PC (Computational force, memory and also the related communication assets) progressively, joined by the lesser communication with the supplier. In such manner Cloud makes the operation simple, since it needn’t bother about the size and the prediction of the assets that are required, enabling the PAY-PER-USE bills on minimum terms, without an honest promise made by the client. Also, consumer of the cloud takes up the benefits that are the outcomes of uncountable assets on command and can be used or convey everything as a help: the significant basic facilities are Infrastructure as a Service (Iaas), and Platform as a Service (Paas), and Software as a service (Saas), joined by further varieties like Function as a Service (Faas), (also called as “Serverless Computing” [20]). Specifically, Cloud is an essential to quench various necessities getting from IoT, also, it is indicated by a portion of the innovation, it is planned as major IoT top layers [21]. The movement towards cloud facilities are powered by a drift in previous decades: that is, the expansion of installed functionalities

116 Biomedical Data Mining for Information Retrieval in the field of gadgets that have contributed them with increasingly more knowledge and greater adaptability, henceforth permitting to continue to perform the task of Cloud, by response and variable gains. In addition to, numerous problem associated with CC model is very obvious in the years, they are generally connected with the connection joining the end gadget and the cloud facilities facilitated by the datacenter are cost, and inertness (latency), transfer speed (Bandwidth), and presence of the links that restricts various functions for CC. An escalation of the persuasive phone gadgets additionally intensified these phenomena, incredibly testing the Cloud model. However, for a few references, the term Cloud can’t cooperate with all the necessities of various applications related to Healthcare which is surfacing the necessity for an alternative model. In contrast to different ideas and terms, and statements are giving the required result which is suggested comprehensively as following. Also, FC shifts several CC facilities towards the Edged Networking System, that is near the client gadgets and perhaps somewhat depending on to the clients’ machine assets, henceforth dispersing work among end gadgets and bringing, among others, and conventional Cloud-related Datacenters, low inactivity (latency) rates, long-period security, and quicker responses, while supporting for improvement in the quality of adaptability identified with the entire framework.

4.4.3 Fog Computing (FC) in Healthcare 4.0 To overwhelm today’s problems of the medical sector, for instance, the delay under the supervision of the patients, Fundamental Cloud Computing alone isn’t the only method. Moreover, the medical services and its uses of Cloud Computing is not accomplishing the requirement of the HC4.0. It consists of its disadvantages like low continuous reaction and its delays. Corresponding to the medical sector, a minor delay can esteem a patient’s life subsequently, to upgrade facilities and their utilizations, Fog Computing plays a significant role. FC also ensures the on-time facilities transfer including the highest consistency while overpowering issues like Delay or Jitter, expenses by passing- information or data to a Cloud. This can be considered as the circulated level design that improves the Computational assets, memory, and Networking based assets with Cloud Computing. Fog Computing provides the three major advantages like privacy, Low-Latency (LL) and Resiliency opposing to Cloud Computing. In this way, a consortium of Open-FOG stated the term Fog Computing of the form SCALE: where S = Security, C = Cognition, A = Agility, L = Latency, E = Efficiency. Also, Fog Computing shifts several Cloud Computing

Implementation of I4.0 in Healthcare 4.0 117 facilities towards the Edged Networking System, nearer to client gadgets and somewhat depending on to the clients’ machine assets, so dispersing work among end gadgets and conventional Cloud-related Datacenters, low inactivity (latency) rates, long-period security, and quicker responses, while encouraging the advancements in the quality of adaptability regarding the entire framework. Taxonomy of the FC can be explained as it includes medical data which is considered as one of the primary segments of existed scientific classification and it is clarified thoroughly. FC has numerous uses in HC4.0 conditions that have increased demands between most of the scientists. Numerous works identified with finding, examining, observing, discovery, and representation of the infections that have been existed in comparison to the preceding years. Hence, this taxonomy gives an organizational view of the existing works in the sector of Healthcare using Fog Computing. It consists of Data Collection-It is the steady medical observation with the help of Wireless Body Areas Network (WBAN) and is Implantable and Wearables Medical Device (I and WMD) which is the major developing edge in the medical sector. The vast majority of the quick changes in bio clinical-Sensors, obtaining minimal effort in electrical gadgets, low force, and remote-systems; that has carried it to some new extent. Although, serious issues and difficulties are still in the picture and should have to be considered. Other than discussing the steady checking of the wellbeing of humanity, the detecting gadgets have turned to be primarily important which gets all information from a person. Taxonomy also includes Data-Analysis. There’s an extensive space lies between the memory and power necessities of the longer-term constantly observing frameworks, and associated proficiencies exist among proposed gadgets. The model for test aggregation, and compressive detecting, and the anomaly-operative communication brings down the overheads related to remote (or Wireless) communications, information record, authentication, and encryption of information. To screen the fitness progressively with the help of Universal medical-observation frameworks, the structure can accumulate BiologicalSignals from sensor-based gadgets and thus accumulate detected data to the portals by the utilization of a particular Wireless (Remote) Communication Protocols like Wifis. Then, the instantaneous information is then transmitted to a distant cloud server for visualization, preparation, and analysis.

4.4.4 BigData (BD) in Healthcare 4.0 BD is a lot more extensively reviewed field and definition. After some time, its significant center has gone from datasets (data records) qualities as to

118 Biomedical Data Mining for Information Retrieval the new advances (Data records that couldn’t be caught, overseen, and handled by the general PCs surrounded by an admissible field, as expressed by the definition given by Apache’s Hadoop) to the particular advances that can financially gather an incentive from the exceptionally large information, by endorsing the high-speed catch, and disclosure, or its analysis. A most generally concurred and considering 5V’s covers the extremely enormous and the most referred pack of properties which is related to BD: (a) Volume (increase in data scale); (b) Velocity (the gathering and its examination are exposed to the time limits); (c) Variety (information is made of different kinds, that is organized information, unstructured information, and Semi-Organized information); (d) Veracity (information has extending degrees of reliability, as expressed to provenance, handling, and the Management); (e) Value (the entire design is focused on the affordable, and Value Productions). One of the other current origins of BD which is playing a crucial role in I4.0 is Corporate (or Enterprise) Information. Even though Enterprises as of now gives and deals with a high amount of information: in addition to Internal Communication, and Accounting, and the representative’s information, additionally, there are information charges required from standards. It can likewise be applied to Medical Organizations as this sort of information (for instance, Scheduled Data, and Administrative data, and Billing-related Data) although not just and precisely associated to the medical sector, improve the catalog of the Potential resources, enabling works that are included not only as clinical and Biological views. Moreover, I4.0 is only expected to ramp up, owing to concentrate on the extensive exploitation of the streams of information, enhanced with more amount of information resources and with the Metadata on the actions itself. It will attach to the outer information (that is, externally from Enterprise), originating from the clients, sold items, and from the colleagues or providers, that is calling for the number of applications and their development of the BD Techniques.

4.4.5 Machine Learning (ML) in Healthcare 4.0 Other than specialized difficulties of functioning along with the EHRinformation and that scientists still can’t seem to exploit the universe of EHRinferred factors accessible for predictive design, we observe many energizing possibilities for ML to strengthen medical and provision of medical services. Designs that distinguish patients into various risk groups to notify practice administrations having a tremendous potential effect on medical esteem and strategies that are capable of foreseeing results for specific patients bring medical practices one bit nearer to exact remedy. Determining significant

Implementation of I4.0 in Healthcare 4.0 119 expenses and patients who are at high risk to focus on medication which is turn out to be progressively important as Healthcare vendors take on the budgetary chance of treating their patients. ML techniques have just been utilized to describe and predict numerous medical dangers. Ongoing work in the team makes use of disciplined logistical regressions to determine patients by undiscovered PVD (Peripheral Vascular Disease) and predicting their mortality risks identified as a methodology which is the same as gradual logistical regressions as far as precision, adjustment, and net rehabilitation are concerned. These predictive designs were executed in clinical work, for getting good productive and higher-quality concerns. As an instance, a predictive design stratifying infants’ danger of septic reduced medicines recommended with 33 to 60% [22]. Current projects to acquire a knowledge of frame of reference regarding Heart diseases and pulmonary-rates through EHR information have reduced the rate of cardiac diseases in the emergency unit due to the acquired knowledge in monitoring, thus decreasing alarm fatigue. ML is also applicable to medical and practice administration, to perform operations efficiently, and also to get the best results of patients. As an instance, designs were created for the requirement of casualty wards and voluntary medical procedures, to notify health worker’s choices. The Veterans Medical Organization records Medical Information such as data of 20 million sufferers which was used for handling the risks of hospital and fatality rates of sufferers consisting of AUC value which lies between 0.81 and 0.87. Outcomes of risks determined by using the above designs are introduced to the Patients Aligned caring groups and accumulated by 1,200 or more health workers for each month for their normal practice and experience. At last, ML isn’t a cure for all ills and whatever that cannot be predicted is offensive. As an instance, we might have the option to precisely predict movement beginning from phase 3 to phase 4 CKD (chronic kidney disease) [23]. Without powerful treatment choices: other than dialysis and the kidneys- transplants: an expectation doesn’t work a lot to rectify the condition of all sufferers. Alternatively, for understanding, assume a design predicting specific tumor (consider lung cancer) can be considered as inadequately related over the genotypic forms of cancers. Can we acquire therapies depending upon these predictions? Alongside these extraordinary circumstances, there are centered grounds where utilization of prediction design in the medical sector will provide more dynamic therapies, progressive utilization of assets, and the best care to the patients at cheaper expenses. Different sectors, for example, money, retails, flights, and webbased businesses, etc. have switched to ML, and also subsequently using its designs or models to achieve fluent workflow in their organization. This

120 Biomedical Data Mining for Information Retrieval can be considered as an ideal opportunity for Healthcare to switch to ML to develop more significant changes.

4.4.6 Blockchain (BC) in Healthcare 4.0 BC is considered as the combination of 2 old innovations: (i) Cryptography, and (ii) Peer to peer (P to P) Transmissions. The consequent bitcoin BC was formed concerning the worldwide monetary emergency that happened in 2008 and has motivated the extensive utilization of BC across the world [24]. The promoters of BC considered as a substitute in contrast to the most integrated-technique related to the activities in favor of a different divisions beyond economy frequently features pace, reduced-expense, safety, lesser faults, adaptation to non-critical failures and disposal of a core part of an organization, assault or disappointments [24]. On the other hand, more difficulties that can restrict its versatility has risen [25]. BC is a class related to Dispersed-Ledgers-Technologies (DLTs) that’s a collective journal with a developing arranged or organized inventory of information saved and preserved in “massive PC data records”. These data records are developed with the use of a few inter-related gadgets (telephones, PCs, or installed frameworks) not limited by geography [26]. This does not require to organize members to confide each other as it holds the crypto-graphical authorized in-built protective mechanism [27]. Every passage considered as singular-Block (made out of texts and exchanges) connected and periodic-stepped via Cryptographical Hash and approved by system peers. Now, the role of the BC can be described as: Considering the release of bitcoins about 10 years prior, a few variations related to BC was presented. BC is presently being employed on advanced resources than the monetary exchanges. The Health-related relevance sector is accepting most of the attention and possibilities among all the sectors. This can be exhibited in the unexpected peaks over all the worldwide-Google patterns. Given that customary HIE’s and PHR’s-related trades now neglected to collaborate their guarantee of sharing coordinated EHR’s, and contending perks and numerous different variables keep on uncovering the trust-shortage innate in conventional HIE’s intercessions. Protection guidelines and instances of information ruptures have uplifted such a question. Partners are subsequently reluctant to coordinate or work together until the degrees necessary in favor of mutual worth. As a result, it leads to increasing medical expenses and decreasing wellbeing results. This might clarify the continued consideration within HIE’s in the course of the most recent twenty years represented using Google patterns. If this pattern gives a sign from current worldwide concerns, it at that point

Implementation of I4.0 in Healthcare 4.0 121 implies that such difficulties sustain despite ten-year-old mechanical- advancements. Altogether, we can propose that trusts-deficiency can be considered as a major factor answerable in the unavailability of advancement. Scientists are presently going to BC for assistance in addressing some segments of such belief-related difficulties. Including these and numerous others completed the modern influx of benefits in BC in Wellness program (or medical programs). Additionally, for more comprehensive exploration and its advancement, that investigation was authorized.

4.5 Healthcare 4.0’s Applications-Scenarios Market patterns and the logical writing shows the job of medical care considered as the driver for the significant-Pillars carrying the Industry 4.0 viewpoint. Besides, considering each pillar, and IoT is leveraged for the remote monitoring of in all its facets, hence by ensuring HC4.0 presentation in a vast assortment of settings, fluctuating from the long period elder’s consideration and the Home-surveillances to the basic Medical Rehabilitation frameworks. Though these frameworks may create the bigger and bigger measure of a wide assortment of information by sanctioning High-Velocity-Catches, and Discoveries, and the investigation, presently, BD innovations and their designs are expected to accumulate the use from them. It can furthermore push to shift to the cloud architectures, required for the more secure and reliable handling and both preparing and capacity prerequisites to test and examine these bulky quantities of data. This can be briefly explained in Table 4.1 and detailed as follows.

4.5.1 Monitor-Physical and Pathological Related Signals These reports represent that by what means the IoT prototype is thus carried by the advancement achieved in Mobile-CommunicationsTechnologies and wearables and some other Sensing-gadgets frequently presented in WBANs and WSNs, collectively accompanied by the presence of on-call Cloud, Fog-assets and besides BD Technologies which can accumulate a more precious Structures in reference by providing strength to Pervasive-Monitoring based Applications. Moreover, the arriving structure holds up the gathering of wellbeing records, possibly assuming the ages of the Statistical-Information in due respect of Health-Conditions [28], and the conveying of the current types of cloud services, will be able to complement or replace the already existing Hospitals-InformationSystems. Such sort of techniques also guarantees to considerably reduce

Aim

To create a precious framework to sustain monitoring operations as well as also supporting the collection of medical data, delivering novel cloud services, provide statistical data.

Solutions for self- management issues through big data analytics and other mechanisms.

To provide a real-time constrained drug distribution in order to achieve optimal efficiency and minimal adverse effects through various wireless or Internet based technologies.

To take patient-specific decisions to offer highly specific services relying majorly on genetic data of the concerned individual.

Applications

Physical-Pathological Signal Monitoring

Self-Management

Smart-Pharmaceuticals

Personalized Healthcare

Table 4.1 Healthcare 4.0’s application scenarios.

¾¾Big Data analytics operations ¾¾Collection of data from numerous sources

(Continued)

¾¾Ingestible or wearable sensors ¾¾Integrated IoT intelligence and interaction ¾¾Big Data analytics operations ¾¾Collection of micro-level and macro-level metadata

¾¾Suggestion of measured data and temporary storage of data ¾¾Provision of effective and reliable feedbacks

¾¾Data collection hardware and mechanisms to gather data ¾¾Communication hardware and processes to transmit data ¾¾Data analysis processes to obtain relevant data.

Requirements

122 Biomedical Data Mining for Information Retrieval

Aim

To empower and simplify the architecture, development and distribution of data systems for obtaining, analyzing and sharing hospital accounts data, clinical data or medical photographs.

To implement effective value saving techniques for healthcare frameworks and appreciable quality of patients life.

Applications

Cloud-based Medical Databases

Rehabilitation

Table 4.1 Healthcare 4.0’s application scenarios. (Continued)

¾¾WBAN technologies ¾¾Multiple sensor data fusion ¾¾Implementation of Virtual-Reality ¾¾Real-time feedback and Biofeedback

¾¾Designing system focusing on security and privacy ¾¾Big Data analytics operations ¾¾Collection of data from numerous sources

Requirements

Implementation of I4.0 in Healthcare 4.0 123

124 Biomedical Data Mining for Information Retrieval the danger of presenting Errors when it is comparable to the techniques needing Manual-Interventions. The customized structure used for the patient’s inspection consists of three fundamental segments: Firstly, Sensing and Data-Gathering-Hardware, that is used to collect Physiologicaland Movement-Information. Secondly, Communication-Software and hardware are used to impart the data on the remote center. Thirdly, the techniques for the data-examination are used to collect the data from Physio-logical and Movement-Data which are scientifically verified.

4.5.2 Self-Management, and Wellbeing Monitor, and its Precaution Healthcare 4.0 notably hold up the answers regarding self-management. BD-Technology grants to execute a move for the most part from healing to its anticipation, it is likewise the significant eccentricity built up by P-4 medications. Analysts likewise researched by what means to structure smart facilities yonder straightforward capacities like assigning estimated and putting away information provisionally, yet to give increasingly viable input to the people. In that case, a portion of these arrangements can execute algorithms that help to forestall illnesses by perceiving modifiable hazard factors and organizing-Interventions for conduct change in fitness [29, 30]. For the case, examining the management and also the prevention of obesity and diabetes, all these systems can be ready to give proposals for the most part for teaching to and officially enabling best healthful practices and objective well-being schemes.

4.5.3 Medication Consumption Monitoring and Smart-Pharmaceutics Pharmaceutical resistance is progressively normal in elderly sick and incessantly sick subjects and is strengthened on account of knowledge inabilities. Moreover, observing pharmaceuticals-utilization addresses concerning problems. Additionally, all these frameworks may prompt an important apparatus for clinicians viewing illness-Management because they all give a quantitative method for evaluating treatment adequacy. Earlier paradigms structured for the old utilized combined utilizing RFID, and Sensor-Networking-Systems. As stated in accordance to the greatest importance timing of the medication conveyance as to tranquilize medicines for picking up the ideal efficacy and in this way limiting its unfavorable impacts, a considerable lot of the Mobile-Applications is currently accessible that comprises of highlights like Scheduling-Reminders,

Implementation of I4.0 in Healthcare 4.0 125 drug-consumption-Tracking, and prescriptions-reminders [30]. Modern solutions (for example, comprising of the ingestible or wearable sensors, and consolidated IoT connectivity’s and brilliance) as of now already being shown. In such cases, Smart-Pharmaceutics are thus described as electronicbundles, and transporting-structures, or some pills giving the brilliance adjoining values. Smart-Pharmaceutics for the future are believed to gather numerous Micro (μ)-Levels and Macro-Levels meta-data which capably serve the latest insight in sickness, and assistance based service structure, and ease Customize medical Care.

4.5.4 Personalized (or Customized) Healthcare Customized Healthcare is intended to be client-driven that is, it concentrates that patients can take extremely certain choices (relatively distinguishing-patients inside classical therapy-groups) [31]. Collecting information from multiple sources (for example, from the one and the other patients and the surrounding) is essential, as the connected data- examination eases wellbeing and social-supervision in decisions-takings and transporting. Exemplary assets can be considered as WD (in other words, embedded micro (μ)-advancements and Nano-advancements) accompanied by Sensor or the treatment passing gadgets, similar to FallDetectors, and Defibrillator-Vests, and so on. Such features are in general in Industry 4.0 vision. They heavily characterize Healthcare 4.0 and besides emphasized when it is observed under the P-4 medicine prototype, that strongly relies on the genetic data of every person (Customized-Omics) licenses to result on Pre-Disposition, and diagnosis, and pharmacogenomics, and screening, and prognosis, and monitoring. Also, BD Analytics is essential for executing customized medical care, both at the populace level and independent level [30].

4.5.5 Cloud-Related Medical Information’s Systems Cloud-related architecture in a considerable lot of the cases that have been generally embraced to fortify and facilitating the structure, the improvement, and the arrangement of information frameworks, received for handling, gathering, and distributing medical data [32], Medical centers authoritative information, or some clinical pictures. All these architectures, helps to increase the data group process (for example, the elaborated entities are frequently given by Mobile–Client-Interfaces to cloud administrations for collecting and overseeing medical information). Additionally, data sharing all over the various medical structures or among medical

126 Biomedical Data Mining for Information Retrieval clinics and patients are likewise profitable, because in the majority of the cases all the particular frameworks additionally center on incorporating data in different arrangements. In addition to these worries about the presentation of the frameworks that are considering in not many cases. Designs of these frameworks concentrate frequently as possible on security and protection, which are both treated as condemnatory.

4.5.6 Rehabilitation In addition to assisted living, many of the home-related Rehabilitation is supposed to get efficient value-savings regarding the healthcare systems and the patients will get the best quality of life. Similarly, WBAN technologies are some of the major tools permitting detection and tracking of the movement of the humans that are associated with the practice of rehabilitation. It shifts from nonexclusive helped answers for living, home related-Rehabilitation is featured by more measure of restrictions and its requirement, and related arrangements, containing multi-sensory information-fusions, and virtual-reality integration, and patients will get real-time feedback. A significant trademark given before WBANs regarding home related Rehabilitations is associated with Biofeedback’s: the number of Physiological-Exercises and some different boundaries, sustained-back to the clients among themselves. Such practice adequately guarantees that the patients can overcome and adjust their Physiological-Exercises, accompanying the last objective of refining their wellbeing and production. Table 4.1 represents the Application Scenarios of the Healthcare 4.0.

4.6 Conclusion In this chapter, we’ve given experiences to pursuers related to HC4.0 which expands the idea related to I4.0. The effort is made to benefit the scientists and physicians with ICT’s for carrying their mastery to respond, the necessities related to medical regions, and individuals working in the regions of medical data’s Schemes and mechanization successfully and fruitfully confront the evolving ideas and methods originating from ComputerScience Sector and comprising the imagined HC4.0 advancement. We’ve done a comprehensive discussion and examination of I4.0 and subsequently HC4.0, the idea of I4.0 is evolving worldwide as this can be valid for the medical sectors, as the IoT, FC, CC, ML, BD and BC are modernizing e-Health and the entire environment, shifting to HC4.0 and research can happen in these regions. Though, we’ve talked about some background

Implementation of I4.0 in Healthcare 4.0 127 associated with HC4.0 and their general architecture-and the components identified with e-Health. Moreover, comprehensive study associated with ICT’s, besides, provide upgrades for conventional-procedures and frameworks like Cloud-dependent Medical data’s frameworks, SelfManagement, and Wellbeing Monitor-Physical and Pathological based Signals, drug consumption, and exercises, yet motivate and make conceivable latest unexpected methodologies, procedures, and utilization, for example, Cloud related medical Information’s Systems, home-related Rehabilitation, Customized HC4.0. Later on, countless associated problems and survey-difficulties identified with HC4.0 can be fathomed and increasingly more surveys might happen on different Smart-Technologies involving Machine Learning, Big Data, and Blockchain.

References 1. Omanović-Mikličanin, E., Maksimović, M., Vujović, V., The future of healthcare: Nanomedicine and internet of nano things. Folia Med. Fac. Med. Univ. Saraev., 50, 1, 23–28, 2015. 2. Aceto, G., Persico, V., Pescapé, A., The role of Information and Communication Technologies in healthcare: Taxonomies, perspectives, and challenges. J. Netw. Comput. Appl., 107, 125–154, 2018. 3. Laplante, P.A. and Laplante, N.L., A Structured approach for describing healthcare applications for the Internet of Things. 2015 IEEE 2nd World Forum on Internet of Things (WF-IoT), 2015. 4. Laplante, P.A. and Laplante, N., The Internet of Things in Healthcare: Potential Applications and Challenges. IT Prof., 18, 3, 2–4, 2016. 5. Islam, S.M.R., Kwak, D., Kabir, M.H., Hossain, M., Kwak, K.-S., The Internet of Things for Healthcare: A Comprehensive Survey. IEEE Access, 3, 678–708, 2015. 6. Archenaa, J. and Anita, E.M., A Survey of Big Data Analytics in Healthcare and Government. Proc. Comput. Sci., 50, 408–413, 2015. 7. Kumari, A., Tanwar, S., Tyagi, S., Kumar, N., Verification and validation techniques for streaming big data analytics in internet of things environment. IET Networks, 8, 3, 155–163, 2019. 8. Kumari, A., Tanwar, S., Tyagi, S., Kumar, N., Fog computing for Healthcare 4.0 environment: Opportunities and challenges. Comput. Electr. Eng., 72, 1–13, 2018. 9. Vora, J., Kaneriya, S., Tanwar, S., Tyagi, S., Kumar, N., Obaidat, M., TILAA: Tactile Internet-based Ambient Assistant Living in fog environment. Future Gener. Comput. Syst., 98, 635–649, 2019.

128 Biomedical Data Mining for Information Retrieval 10. Gupta, R., Tanwar, S., Tyagi, S., Kumar, N., Tactile internet and its applications in 5G era: A comprehensive review. Int. J. Commun. Syst., 32, 14, 1–14, 2019. 11. He, D. and Zeadally, S., Authentication protocol for an ambient assisted living system. IEEE Commun. Mag., 53, 1, 71–77, 2015. 12. Mutlag, A.A., Ghani, M.K.A., Arunkumar, N., Mohammed, M.A., Mohd, O., Enabling technologies for fog computing in healthcare IoT systems. Future Gener. Comput. Syst., 90, 62–78, 2019. 13. Nice, E.C., From proteomics to personalized medicine: The road ahead. Expert Rev. Proteomics, 13, 4, 341–343, 2016. 14. Mattos, W.D.D. and Gondim, P.R., M-Health Solutions Using 5G Networks and M2M Communications. IT Prof., 18, 3, 24–29, 2016. 15. Trappey, A.J., Trappey, C.V., Govindarajan, U.H., Chuang, A.C., Sun, J.J., A review of essential standards and patent landscapes for the Internet of Things: A key enabler for Industry 4.0. Adv. Eng. Inf., 33, 208–229, 2017. 16. Weyrich, M. and Ebert, C., Reference Architectures for the Internet of Things. IEEE Software, 33, 1, 112–116, 2016. 17. Botta, A., Donato, W.D., Persico, V., Pescapé, A., Integration of Cloud computing and Internet of Things: A survey. Future Gener. Comput. Syst., 56, 684–700, 2016. 18. Santos, J., Rodrigues, J.J., Silva, B.M., Casal, J., Saleem, K., Denisov, V., An IoT-based mobile gateway for intelligent personal assistants on mobile health environments. J. Netw. Comput. Appl., 71, 194–204, 2016. 19. Khamparia, A., Gupta, D., de Albuquerque, V.H., Sangaiah, A.K., Jhaveri, R.H., Internet of health things driven deep learning system for detection and classification of cervical cells using transfer learning. J. Supercomput., 76, 11, 8590–8608, 2020. 20. Hendrickson, S., Sturdevant, S., Harter, T., Venkataramani, V., ArpaciDusseau, A.C., Arpaci-Dusseau, R.H., Serverless computation with openlambda. Elastic, 60, 33–39, 2016. 21. Atzori, L., Iera, A., Morabito, G., Understanding the Internet of Things: Definition, potentials, and societal role of a fast evolving paradigm. Ad Hoc Netw., 56, 122–140, 2017. 22. Kaur, R., Kaur, K., Khamparia, A., Anand, D., An Improved and Adaptive Approach in ANFIS to Predict Knee Diseases. Int. J. Healthc. Inf. Syst. Inform. (IJHISI), IGI Global, 15, 2, 22–37, 2020. 23. Khamparia, A., Saini, G., Pandey, B., Tiwari, S., Gupta, D., Khanna, A., KDSAE: Chronic kidney disease classification with multimedia data learning using deep stacked autoencoder network, in: Multimedia Tools and Applications, 2019, https://doi.org/10.1007/s11042-019-07839-z, 2019. 24. Tapscott, D., Blockchain revolution, how the technology behind bitcoin is changing money, business and the world, Portfolio Penguin, Portfolio, 2016.

Implementation of I4.0 in Healthcare 4.0 129 25. Puthal, D., Malik, N., Mohanty, S.P., Kougianos, E., Das, G., Everything You Wanted to Know About the Blockchain: Its Promise, Components, Processes, and Problems. IEEE Consum. Electron. Mag., 7, 4, 6–14, 2018. 26. Crosby, M., Nachiappan, Pattanayak, P., Verma, S., Kalyanaraman, V., Blockchain Technology Beyond Bitcoin. Applied Innovation Review (AIR), 1, 2, 6–19, 2015. 27. Burniske, C., Vaughn, E., Shelton, J., Cahana, A., How Blockchain Technology Can Enhance EHR Operability, Gem | Ark Invest Res., New Yok, 2016. 28. Habib, C., Makhoul, A., Darazi, R., Salim, C., Self-Adaptive Data Collection and Fusion for Health Monitoring Based on Body Sensor Networks. IEEE Trans. Ind. Inf., 12, 6, 2342–2352, 2016. 29. Khamparia, A., Singh, A., Anand, D., Gupta, D., Khanna, A., Arun Kumar, N., Tan, J., A Novel deep learning based multi-model ensemble methods for prediction of neuromuscular disorders. Neural Comput. Appl., 32, 15, 11083–11095, 2018, https://doi.org/10.1007/s00521-018-3896-0. 30. Chouhan, V., Kumar Singh, S., Khamparia, A., Gupta, D., Tiwari, P., Moreira, C., Damaševičius, R., de Albuquerque, V.H.C., A Novel Transfer Learning Based Approach for Pneumonia Detection in Chest X-ray Images. Appl. Sci., 10, 2, 559, 2020. 31. Silva, B.M., Rodrigues, J.J., Díez, I.D.L.T., López-Coronado, M., Saleem, K., Mobile-health: A review of current state in 2015. J. Biomed. Inf., 56, 265–272, 2015. 31. Viceconti, M., Hunter, P., Hose, R., Big Data, Big Knowledge: Big Data for Personalized Healthcare. IEEE J. Biomed. Health Inf., 19, 4, 1209–1215, 2015. 32. Chen, D., Chen, Y., Brownlow, B.N., Kanjamala, P.P., Arredondo, C.A.G., Radspinner, B.L., Raveling, M.A., Real-Time or Near Real-Time Persisting Daily Healthcare Data Into HDFS and Elastic Search Index Inside a Big Data Platform. IEEE Trans. Ind. Inf., 13, 2, 595–606, 2017.

5 Improved Social Media Data Mining for Analyzing Medical Trends Minakshi Sharma1* and Sunil Sharma2 Department of Computer Engineering, National Institute of Technology, Kurukshetra, India 2 Department of Electronics and Communication, National Institute of Technology, Kurukshetra, India 1

Abstract

Nowadays, Social media has become a prominent method of sharing and viewing news among the general population. Also, people spend most of their time on social media as compared to other activities. It has become an inseparable part of our lives. People on media such as tweeter, face book or Blogs share their health records, medicine history, drug information and personal views. For social media resources to be useful, noise must be filtered and only the important content must be captured excluding the irrelevant data, depending on the similarities to the social media. However, even after filtering the content, it may contain irrelevant information, so the information should be prioritized based on its estimated importance. Importance can be estimated with the help of three factors i.e. Media Focus (MF), User Attention (UA) and User Interaction (UI). First factor, Media Focus (MF) of a topic is the temporal popularity of that topic in the news. Second factor, the temporal popularity of a topic in twitter indicates its user attention (UA). Third factor, the interaction between the social media users on a topic is referred as the user interaction (UI). It indicates the strength of a topic in social media. Hence, these three factors form the basis of ranking of news topics and thus improve the quality and variety of ranked news. The objective of this chapter is to give survey of different data mining methods that can be applied to medical field. Finally, a novel method has been proposed for classification of important news with improved accuracy.

*Corresponding author: [email protected] Sujata Dash, Subhendu Kumar Pani, S. Balamurugan and Ajith Abraham (eds.) Biomedical Data Mining for Information Retrieval: Methodologies, Techniques and Applications, (131–162) © 2021 Scrivener Publishing LLC

131

132 Biomedical Data Mining for Information Retrieval Keywords: Medical trends, data mining, social media, Media Focus (MF), User Attention (UA), User Interaction (UI)

5.1 Introduction 5.1.1 Data Mining Data mining is the technique using which hidden patterns can be explored that is resided in the huge datasets. It is very essential technique currently, because of existence of vast amount of data. There are various filed in which this technique has been widely utilized such as medical, military and many more. It is very essential to analyze the data as currently large amount of data is present in very field that need to sorted so that no difficult will be faced in near future. Social media is the major reason behind the collection of such large amount of data in almost every field that leads to power and success. The availability of different types of data required to store it for which mass digital storage and computers has been utilized currently. This massive collection of the data leads to have new technologies using which the issue of storing the large amount of data can be resolved easily. Database management system (DBMS) is the techniques that are used to overcome this issue. This above mentioned technique provides its assistance whenever useful information is required to extract from the vast amount of data. This process makes the working of sorting information easier and in efficient manner [9]. The main objective of data mining technique is to sort out the data exists in the unorganized way. This information has to be extracted so that only useful information can be utilized as compared to raw data. Figure 5.1 gives overview of various steps involved in Data mining process.

5.1.2 Major Components of Data Mining Major components of data mining process are given below (as shown in Figure 5.2): 1. Database, data warehouse, or other storage: Major component of data mining process is information. Information can be either in databases or data warehouses. At this stage data cleaning and data integration process has to be performed. 2. Database or data warehouse server: Server is the main storage media where database/data warehouse data is stored. User request is handled by server.

Improved Social Media for Medical Trends 133 Knowledge Pattern Evaluation Task-relevant Data

Data Mining

Data warehouse

Data Selection Data cleaning

Data Integration

Databases

Figure 5.1 Steps involved in Data mining process. http://www.lastnightstudy.com/ Show?id=34/Knowledge-Discovery-Process-(KDP).

Graphic User Interface

Pattern Evaluation

Data mining engine

Knowledge base

Database of data warehouse server Data cleaning

Data integration

Data base

Filtering

Data warehouse

Figure 5.2 Components of data mining system. https://www.ques10.com/p/9209/ explain-data-mining-as-a-step-in-kdd-give-the-arch/.

134 Biomedical Data Mining for Information Retrieval 3. Knowledge base: In order to guide the search, this domain knowledge has been utilized or for the evaluation of obtained patters. With the help of these hierarchies, attributes and their values can be organized in a different level of abstraction. 4. Data mining engine: This is main component of data mining system. There are various tasks which are performed by mining engine such as: characterization, association rules, classification etc. 5. Pattern evaluation: It is the component generates interesting patterns. 6. Graphical user interface: This module acts as intermediate interface between user and data mining system. User can build various queries to find interesting patterns.

5.1.3 Social Media Mining Nowaday’s social media mining is more popular because it is affordable for most of the people. Major social sites being used are shown in Figure 5.3. Social media plays important role in estimating user behavior [10]. How people think about any topic? People can give their reviews on different topic which is helpful for other person. But by the excessive usage of social media data generated is also very large. Data can be represented in social media in the form of graph. A graph is made up of links and nodes as shown in Figure 5.4.

5.1.4 Clustering in Data Mining Clustering is process which separates data in different groups of similar objects as shown in Figure 5.5 [13, 14].

Figure 5.3 Major Social media sites. https://www.securitymagazine.com/ articles/87597-facebook-is-most-popular-social-media-platform.

Improved Social Media for Medical Trends 135 user

user

Group user

user

Group

user

user Group

user

dept

Figure 5.4 Social network representation using graph. https://www.javatpoint.com/ social-media-data-mining-methods.

Cluster 1 Cluster 3 Cluster 2

income

Figure 5.5 An example of clustering. https://www.analyticsvidhya.com/blog/2013/11/ getting-clustering-right/.

If in the system, there is less number of clusters then the simplification level can be achieved easily. But due to these less number of clusters, there is loss some required information. Therefore, with the help of clusters data must be modeled. Clusters are referred as the hidden patterns as per the machine learning, these clusters are searched in the unsupervised manner. The concept of data is defined by the system used as an outcome. On the basis of studied clustering mechanism it is assumed that it is not the process of one step. The clustering process is divided into following steps on the basis of text on the cluster analysis given below: a. Data Collection: It is the process of collecting or extraction of the data objects from the present data source called as data collection process. The classification of data objects is done on the basis of their respective values for some attributes.

136 Biomedical Data Mining for Information Retrieval b. Initial Screening: It is the process in which the extracted data from different sources data first pass through the some evaluation or examinations. In the data warehousing this process is also executed. c. Representation: It is the step in which data is rearranges as per the requirements in order to process the clustering algorithm. The selection of same measurement and the data different characteristics and dimensions are evaluated here. d. Clustering Tendency: The clustering of the data is tested in this step, but this step is terminated if the large data sets are present. e. Clustering Strategy: The selection of clustering algorithms and the initial parameters is done in this process. f. Validation: This functionality is performs after the experiments are done manually and visually. g. Interpretation: With the other technologies, obtained results are combined in case of classification using which results and its analysis is provided.

5.2 Literature Survey Adhikari et al. [1] presented that the heart is considered as the most important organ. This is the reason of metabolic activities in the body and for the oxygen circulation. It also passes fundamental nutrients to different body parts. The wastes are removed from the body with the help of this organ. Therefore, the whole organism or body can be affected if there is minor effects occur in the heart. The doctors are provided by the large amount of necessary data and a lot of data analysis work that help them in detecting disease in the early stage. The disease can be predicted by analyzing the data of different health problems. They analyzed this on 1,094 patients for the analysis in this paper. They developed a model after utilizing this data which gained all the necessary requirements and predicts whether the patients in the data have the probability of facing heart attack or not. With the help of this model clearness between the doctor and the patient become clearer as it helps in making decisions and doctor able to treat patients efficiently. The only accuracy parameter is not the basis using which models becomes effective, the parameters like true positive rate and false-negative rate along with the AUC-ROC that needs to be taken care. These parameters help in building the algorithm inside the model.

Improved Social Media for Medical Trends 137 Burse et al. [2] presented with the increase in the disease in currently scenario, there is also increase in the mortality rate due to the increase of this fatal heart disease. The doctor experience and intuition considered as the way using which disease is diagnosed in most of the cases. Therefore, it is not possible only after attaining doctor degree for this doctor must be highly skilled and experienced. Hence, it is very much necessary to develop more and more methods using which useful information can be extracted. In this paper, they go through from various methods such as L1/2 Logistic Regularization, Lasso, Group Lasso regularization and Elastic Net techniques. These techniques are used for the selection of a subset that used for the heart diseases to be predicted. The different subsets of features were selected by the techniques of different regularization. They compared various techniques in this paper and determined on the basis of their performance for the prediction of heart diseases. Mane et al. [3, 36] presented heart disease which causes 15 million people died every year as in the survey of world health organization. The big data approach is also called as the heart disease in some cases. Hadoop Map platform has been used for the big data minimization, for clustering K-means [36] and for classification decision tree algorithm is used Smith et al. [4] presented in order to suppose the psycho physiological state of the user, the bio signals, such as heart rate (HR) has been controlled by the interactive voice technologies. There is no requirement of additional sensors due to the attractiveness of HR based on the voice detection. They utilized the Corpus of the SRI Bio Frustration for the prediction of the HR from speech. They utilize the input similar to the continuous spontaneous speech instead used in the previously done study. The noteworthy effects on HR prediction have been showed by the obtained results from the random forests. Karayilan et al. [5] presented most of the people are suffering from deadly currently disease due to these reasons detection and prevention of this disease is the major requirement. The detection of the disease at the early stage is very much required so that preventive measure can be taken accordingly. This process requires the proper care and monitoring and human intervention due to which the diagnosis this diseases is complicated. Therefore, it is main focused of almost all the medical fields that how this disease can be detected and prevented accurately [5]. This paper presents an improved machine learning algorithm. The Back propagation algorithm and artificial neural network have been proposed to detect heart disease. Pahwa et al. [6] presented a method to discover hidden patterns from large amount of data. Random forest and Naïve bayes has been used to

138 Biomedical Data Mining for Information Retrieval discover hidden patterns. As per done evaluations and obtained results, it is concluded that the performance level of the method can be increased. Proposed method shows high accuracy and shortens computational time compared to other algorithms. Chen et al. [7] presented an improved Convolutional neural network for the classification of both structured and unstructured data. In comparison to other existing algorithm, proposed algorithm has shown high prediction accuracy and also convergence rate is high. Babu et al. [8] presented a method in which important information can be extracted easily. In the medical data mining, medical dataset is very important. The collection of the data is done in the standardized form as this data is utilized in the patterns for the clinical diagnosis. K-means algorithms [35], Decision tree classification is used for further steps and has shown real improvement in accuracy. Jabbar et al. [17] presented a novel approach for classification and pre diction i.e. Hidden Naïve Bayes. As per done experiments and results observation, it is concluded that optimal accuracy is provided by the Hidden Naïve Bayes (HNB) as compared to other methods. Princy et al. presented data mining technique that is utilized in almost every field, similarly it is widely utilized in the medical field [11]. The KNN and ID3 algorithm was utilized for the detection of disease. On the basis of performed experiments and observations of results no. of attributes are reduced and prediction accuracy has shown significant improvement. Rajathi et al. proposed a new approach from the combination of two techniques i.e. Ant Colony and k-Nearest Neighbor (kNN) algorithm [12]. Within this technique, there are two different phases that has been utilized. In the first phase, the kNN algorithm was utilized so that the classification of the test data can be done. In this paper, for the initialization of population and to search the desired results, the ACO technique was utilized as the optimized solution. Various experiments has shown better error rate and improved accuracy. Rajalakshmi et al., proposed a new method for fast growing data. Within this field, vast amount of data is generated daily and handling of this information is not an easy task [15]. In this paper for the analysis of the various existing methods, the K-means algorithm has been utilized. Proposed method minimizes the human effects and the consumption of cost. Gandhi et al. presented a new approach for handling raw information which has poor knowledge but large quantity. The successful analysis methods are not present currently in the healthcare using which connections and patterns can be identified easily [16]. Therefore, data mining techniques worked as the remedy to minimize the effects and these circumstances and

Improved Social Media for Medical Trends 139 widely utilized in medical field. Various algorithms have been surveyed for the classification i.e. Naïve Bayes, Neural network, Decision tree algorithm. Chakrabotry et al. presented clustering are the techniques used as the powerful tool using which different forecasting can be possible done. With the help of proposed incremental K-mean clustering generic methodology, it becomes easy to forecast weather conditions. The analysis of air pollution is the main objective of this paper for which dataset of west Bengal was used by them. In order to develop the weather category list, they utilized the clusters peak mean values and on the dataset of air pollution they implemented the K-means clustering. Within the different clusters, they defined the weather category. The incremental K means was used to check new data in order to group it into existing clusters. This proposed approach is also utilized for the prediction of the future weather information. After conducting experiments, it is concluded that there is reduction in the air pollutions consequences due to used data set of west Bengal [18]. With the use of modeled computations, it becomes easy the process of weather events forecasting and prediction. Sundar et al. [19] presented for the prediction or diagnosis of heart diseases, the real and artificial datasets have been utilized with the help of a K-mean clustering technique results to check its accuracy. With the help of clustering process in the k number of clusters, these clusters are partitioned consider as the part of cluster analysis. In this with nearest mean each cluster has its own observations. The random initialization of the entire data is the first step after which to each cluster a cluster k is assigned. In the k number of groups, this proposed technique divides the k assigned clusters and also minimizes the distance square of sum [19]. Therefore, in order to perform the above mentioned task, in between the data the cluster centroid and Euclidean distance formula has been utilized. They evaluated the proposed scheme of integration of clustering and illustrated that results proved the accuracy rate robustness of the proposed method. Kaur et al. presented the use of clustering in this paper with the help of which data contained in the similar objects has been divided. In the same group, the data of similar objects are placed while in case dissimilar objects this data is compared with other group’s objects. The K-means algorithm has been widely utilized in almost every field for the clustering of data but this process is expensive. The quality of its final results is defined by the factor of initial centroid selection [20]. The main objective is to overcome all the drawbacks and make it more effective and efficient for the utilization. Sivagowry et al. present a new approach which is earlier used for text mining [21]. It is illustrated that in the classification better performance is given by the Decision Tree in some while is come by Neural Network and

140 Biomedical Data Mining for Information Retrieval Naïve Bayes. Every technique has its own advantages and disadvantages. For the minimization of attribute fuzzy logic is used. Bellaachia et al. presented the data mining technique which might help in providing analysis of the data of breast cancer patients in order to predict the survivability rate [24]. Within this paper, the SEER public datasets are studied in which there are huge numbers of records of various fields present. The three data mining methods are investigated further in this paper. The proposed algorithm is compared with other existing approach and the results show the performance of proposed algorithm better. Qasem et al. presented [25] that the analysis of previously proposed investigations can help in predicting the future analysis of data. A better timing to buy and sell the stocks depending on the previously present data is presented for the investors of stock market through this paper. The decision tree classifier is utilized here which is known to be one of the best approaches in order to proposed this method. The results provided here outperform the results achieved by existing methods. Oyelade et al. presented the ability of high learning of a student is to be determined on the basis of performance of the student which is to be studied in this paper [26]. The clustering algorithm is used for the analysis of student’s results. In this method score is to be arranged on the basis of level of performance with the help of standard statistical algorithm. The performance of the student is analyzed in this paper by combining this proposed model with the deterministic model.

5.3 Basic Data Mining Clustering Technique a. Partition based clustering: In this technique, sample objects of high similarity are placed inside of clusters and the objects of high dissimilarity are placed outside the cluster as shown in Figure 5.6. They are also named as the distance based method. In the smaller parts the object is divided in this partitioning clustering. One group to another is combined in order to improve the partition for which iterative relocation technique is utilized. With the help of these clustering methods, it becomes easy to find spherical-shaped clusters in the different medium. b. Hierarchical clustering: It is a type of unsupervised learning. It forms predefined hierarchy of clusters as shown in Figure 5.7. Formation of cluster can be preceded in two ways. One is the agglomerative (bottom-up) and the other is the divisive (top-down). They are further explained below:

Improved Social Media for Medical Trends 141

Original points

A partitional clustering

Figure 5.6 Partition clustering [18].

Traditional Hierarchical Clustering Traditional Dendrogram

P1 Non-traditional Hierarchical Clustering

Figure 5.7 Hierarchical clustering [19].

p2

p3

p4

Non-traditional Dendrogram

142 Biomedical Data Mining for Information Retrieval • Agglomerative algorithms: In this process, all individual clusters have been identified. Gradually, these clusters are mixed with each other on the basis of distance measured amongst them. This process of clustering stops when a single group is formed by the all objects. On the basis of user’s demands, this process can also be stopped. • Divisive algorithms: These algorithms works in reverse fashion. All groups must be identified to start the process. After this process, there is a formation of smaller groups and splitting of groups. This process stop when into one cluster all the object falls or into the cluster in which it desire. With the help of divisive approach, into the disjoint groups these data objects are divided in each step. c. Density-based Clustering: It is a form of unsupervised learning. Data are separated from each other based on density. High density regions are clustered together and low density regions in different cluster. DBSCAN algorithm is most popular algorithm in this category. DBSCAN Algorithm Technique The implementation of the density based clustering algorithm is done on the parameters of density. The thick regions or area has been formed by the by the regions that are different from the thin regions. A threshold has been used to identify cluster in certain region [29]. DBSCAN stands for density based spatial clustering of applications with noise which is based on the algorithm of density based clustering. Eps (radius) and MinPts (minimum points—a threshold) are the two to measure density of a specific point. d. Grid-Based Clustering [13]: In grid based clustering, whole space is divided in to grid of cells. Cells are gradually merged to form cluster. Initially start with small cluster then on each iteration bigger cluster is formed. The main objective of these algorithms is the quantization of data set into number of cells. The created cells in this work with the objects that are the part of this cell. In the process there is no relocation of points. In order to decide or select these parameters, the pre-defined parameter has been utilized. The points’ membership induces the data partitioning in segments which comes from the space partitioning. e. Model-Based Clustering [22]: It is the method with the help of which model parameters that are optimal and best suited

Improved Social Media for Medical Trends 143

Figure 5.8 Obstacle in Constraint-Based Clustering [42].

the data is searched. In order to recognize the partitioning, the refining of model is done. This partition can be partitional or hierarchical. Initially, number of clusters has been fixed in order to start the process. f. Constraint-Based Clustering [23]: There are various type of constraints which can be application oriented or user defined. This constraint has been used for clustering process. The parameters of preprocessing or external cluster are used to donate these values. There is involvement of classification for each cluster which has the limitations for the individual clusters as shown in Figure 5.8. It is necessary to develop new methods using which all the constraints can be overcome easily. K-Means Clustering K-means is a type of unsupervised learning. Whole given data set has been divided in to K no. of cluster. Mean of cluster should be randomly assigned and on each iteration mean of cluster refined [40, 41]. Here, k is the positive integer. In between the cluster centroid and data, the amount of squares of distances is minimized in order to complete the grouping. Algorithm: 1. Initialize cluster centers 2. Assign cluster to each data point according distance from neighbor 3. Modify mean on each iteration 4. Recap steps 2–3 until all data is not classify and no more changes in cluster center [37].

5.3.1 Classifier and Their Algorithms in Data Mining a. Decision Trees: Decision tree is a type of supervised learning method. A decision tree has been form with the help of test

144 Biomedical Data Mining for Information Retrieval Age90% accuracy and