Explainable Machine Learning in Medicine 9783031448768, 9783031448775

This book covers a variety of advanced communications technologies that can be used to analyze medical data and can be u

149 105

English Pages [92] Year 2024

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
List of Figures
List of Tables
1 Introduction
1.1 Medical Data Types
1.2 Explainable Artificial Intelligence (XAI)
1.3 The Roles of Explainable AI (XAI) in Medical Data Analysis
1.4 Fundamentals of Machine Learning
1.5 Supervised Learning
1.6 Unsupervised Learning
1.7 Semi-supervised Learning
1.8 Reinforcement Learning
1.9 Model Selection and Evaluation
1.10 Feature Engineering
1.11 Model Training and Optimization
1.12 Model Deployment and Monitoring
2 Medical Tabular Data
2.1 Data Types and Applications
2.2 Explainable Methods for Tabular Data
2.2.1 Decision Trees
2.2.2 Data Intensive Computing for Extracting (DICE)
2.2.3 Additive Linear Explanations (ALE)
2.3 Decision Trees Used for Covid19 Symptoms Influence
2.4 Heart Attack Data Intensive Computing for Extracting
2.5 Diabetes Prediction Explanation with ALE Method
3 Natural Language Processing for Medical Data Analysis
3.1 Data Types and Applications
3.2 Explainable Methods for Text Data
3.2.1 Example Driven
3.2.2 Long Short-Term Memory Network (LSTM) Explained with Gating Functions
3.2.3 Bidirectional Encoder Representation from Transformers Explained Using the Attention Mechanism
3.3 Text Generation Explained
3.3.1 Example Driven
3.3.2 BERT
4 Computer Vision for Medical Data Analysis
4.1 Data Types and Applications
4.2 Explainable Methods for Images
4.2.1 Gradient Class Activation Map (GradCAM)
4.2.2 Local Interpretable Model Agnostic Explanations (LIME)
4.2.3 Shapley Addictive Explanation (SHAP)
4.3 Skin Moles Classification Explained
4.3.1 LIME Explanation
4.3.2 SHAP Explanation
4.3.3 GradCAM Explanation
5 Time Series Data Used for Diseases Recognition and Anomaly Detection
5.1 Data Types and Applications
5.2 Explainable Methods for Time Series
5.2.1 Symbolix Aggregate Approximation (SAX)
5.2.2 Shapelets
5.2.3 Learning from Aggregate Xtreme Categories with Automated Transformation (LAXCAT)
5.3 Heart Diseases Recognition
5.3.1 Database
5.3.2 Model
5.3.3 Explanation with Shapelets
5.3.4 SAX Explanation
6 Summary
Recommend Papers

Explainable Machine Learning in Medicine
 9783031448768, 9783031448775

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Synthesis Lectures on Engineering, Science, and Technology

Karol Przystalski · Rohit M. Thanki

Explainable Machine Learning in Medicine

Synthesis Lectures on Engineering, Science, and Technology

The focus of this series is general topics, and applications about, and for, engineers and scientists on a wide array of applications, methods and advances. Most titles cover subjects such as professional development, education, and study skills, as well as basic introductory undergraduate material and other topics appropriate for a broader and less technical audience.

Karol Przystalski · Rohit M. Thanki

Explainable Machine Learning in Medicine

Karol Przystalski Department of Information Technologies Jagiellonian University Kraków, Poland

Rohit M. Thanki Krian Software GmbH Wolfsburg, Niedersachsen, Germany

ISSN 2690-0300 ISSN 2690-0327 (electronic) Synthesis Lectures on Engineering, Science, and Technology ISBN 978-3-031-44876-8 ISBN 978-3-031-44877-5 (eBook) https://doi.org/10.1007/978-3-031-44877-5 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.

Preface

Machine learning has revolutionized various industries, including medicine, by providing powerful tools for analyzing vast amounts of data and making predictions. In the field of healthcare, the integration of machine learning algorithms has tremendous potential for enhancing diagnostic accuracy, treatment recommendations, and patient outcomes. However, as machine learning models become increasingly complex, concerns regarding their interpretability and transparency have arisen. The need for explainable machine learning in medicine has become crucial for building trust, ensuring ethical decision-making, and enabling effective collaboration between healthcare professionals and intelligent systems. “Explainable Machine Learning in Medicine” is a comprehensive exploration of the emerging field of explainable artificial intelligence (XAI) in healthcare. This book aims to bridge the gap between machine learning techniques and the need for understandable and interpretable models in the medical domain. Authored by experts in both machine learning and healthcare, this book delves into the challenges, methods, and applications of explainable machine learning in medicine. In this book, we delve into the fundamental concepts of machine learning and its applications in healthcare. We highlight the importance of transparency, interpretability, and explainability in medical AI systems, emphasizing the need to demystify the blackbox nature of machine learning models. We explored various techniques and methods for explainable machine learning, such as rule-based models, surrogate models, and modelagnostic approaches. Through case studies and examples, we demonstrate how explainable AI has been successfully applied in clinical decision support, disease diagnosis, treatment recommendation systems, and personalized medicine. We also addressed the unique challenges and considerations that arise when applying machine learning in the medical field, such as data quality, bias, privacy concerns, and regulatory compliance. With practical guidelines and best practices, we aim to provide healthcare practitioners, researchers, and policymakers with tools and knowledge to navigate the integration of machine learning models in medicine responsibly and ethically. This book is not only a resource for understanding the concepts and techniques of explainable machine learning but also a call to action. We believe that the responsible deployment of machine learning in healthcare requires transparency, interpretability, and v

vi

Preface

involvement of medical experts. By demystifying machine learning models and providing insights into their decision-making processes, trust, acceptance, and collaboration between intelligent systems and healthcare professionals can be fostered. We hope that “Explainable Machine Learning in Medicine” will serve as a valuable guide for unlocking the potential of machine learning in healthcare while ensuring that the decisions made by these intelligent systems are explainable, understandable, and aligned with the best interests of patients. Together, we can shape the future of medicine by combining the power of machine learning with the wisdom and expertise of medical professionals to provide safer, more effective, and more transparent healthcare solutions. Kraków, Poland Wolfsburg, Germany May 2023

Dr. Karol Przystalski Dr. Rohit M. Thanki

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Medical Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Explainable Artificial Intelligence (XAI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 The Roles of Explainable AI (XAI) in Medical Data Analysis . . . . . . . . . 1.4 Fundamentals of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9 Model Selection and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.11 Model Training and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12 Model Deployment and Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 2 3 4 6 7 8 9 11 12 13 14

2 Medical Tabular Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Data Types and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Explainable Methods for Tabular Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Data Intensive Computing for Extracting (DICE) . . . . . . . . . . . . . . 2.2.3 Additive Linear Explanations (ALE) . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Decision Trees Used for Covid19 Symptoms Influence . . . . . . . . . . . . . . . . 2.4 Heart Attack Data Intensive Computing for Extracting . . . . . . . . . . . . . . . . 2.5 Diabetes Prediction Explanation with ALE Method . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17 17 18 18 21 23 25 29 32 36

3 Natural Language Processing for Medical Data Analysis . . . . . . . . . . . . . . . . . . 3.1 Data Types and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Explainable Methods for Text Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Example Driven . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Long Short-Term Memory Network (LSTM) Explained with Gating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37 38 39 39 41 vii

viii

Contents

3.2.3 Bidirectional Encoder Representation from Transformers Explained Using the Attention Mechanism . . . . . . . . . . . . . . . . . . . . 3.3 Text Generation Explained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Example Driven . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44 47 47 48 52

4 Computer Vision for Medical Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Data Types and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Explainable Methods for Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Gradient Class Activation Map (GradCAM) . . . . . . . . . . . . . . . . . . . 4.2.2 Local Interpretable Model Agnostic Explanations (LIME) . . . . . . . 4.2.3 Shapley Addictive Explanation (SHAP) . . . . . . . . . . . . . . . . . . . . . . . 4.3 Skin Moles Classification Explained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 LIME Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 SHAP Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 GradCAM Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53 54 56 56 57 57 58 61 62 63 65

5 Time Series Data Used for Diseases Recognition and Anomaly Detection . . . 5.1 Data Types and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Explainable Methods for Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Symbolix Aggregate Approximation (SAX) . . . . . . . . . . . . . . . . . . . 5.2.2 Shapelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Learning from Aggregate Xtreme Categories with Automated Transformation (LAXCAT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Heart Diseases Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Explanation with Shapelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 SAX Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67 67 68 69 71

6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

73 76 76 77 78 78 79

List of Figures

Fig. 2.1 Fig. 2.2 Fig. 2.3 Fig. 2.4 Fig. 2.5 Fig. 3.1 Fig. 3.2 Fig. 3.3 Fig. 4.1

Fig. Fig. Fig. Fig.

4.2 4.3 5.1 5.2

An example of a decision tree for melanoma recognition using the ABCD scoring method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Decision tree built using the scikit-learn implementation on Covid19 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Covid19 feature importance plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ALE pregnancy and glucose tolerance features influence . . . . . . . . . . . . . Glucose tolerance and body mass index influences on the prediction explained using the ALE method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tokens relationship on the model level . . . . . . . . . . . . . . . . . . . . . . . . . . . Tokens relationship by attention heads . . . . . . . . . . . . . . . . . . . . . . . . . . . . Embedding projector with Uniform Manifold Approximation and Projection method used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LIME explanation results. First image show the original image, the second the mask generated by LIME method, and the last image that is a combined image of the mask the original one . . . . . . . . . . . . . . SHAP values for a random HAM10k skin mole image . . . . . . . . . . . . . . Heatmap merged with the original skin mole image . . . . . . . . . . . . . . . . . ECG recordings of the healthly and sick patients . . . . . . . . . . . . . . . . . . . . Shapeletes of the heart disease data set plot . . . . . . . . . . . . . . . . . . . . . . .

19 28 29 35 36 49 50 51

61 63 65 77 79

ix

List of Tables

Table 2.1 Table 2.2

Heart disease patients features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heart disease patients counterfactuals . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32 33

xi

1

Introduction

1.1

Medical Data Types

Medical data refers to any type of information related to a patient’s health status, medical history, treatments, and outcomes. This data can be collected from various sources, such as electronic health records (EHRs), medical imaging, genomic testing, clinical trials, and public health surveillance. Here are some examples of medical data and their applications: • Electronic Health Records (EHRs): EHRs contain a patient’s medical history, diagnoses, medications, and laboratory results, among other information. EHRs can be used by healthcare providers to manage patient care, improve patient safety, and reduce medical errors. For example, a doctor can use a patient’s EHR to check for any allergies or drug interactions before prescribing a new medication. • Medical Imaging Data: Medical imaging data includes X-rays, CT scans, MRIs, and other types of images used to diagnose and monitor diseases and injuries. Medical imaging data can be used to detect and monitor the progression of diseases, such as cancer or heart disease. For example, a radiologist can use medical imaging data to identify the location and size of a tumor and monitor its response to treatment. • Genomic Data: Genomic data includes information about a person’s genetic makeup, such as DNA sequence, gene expression, and epigenetic modifications. Genomic data can be used to personalize medical treatments and develop new therapies for genetic disorders. For example, genomic testing can help identify the specific mutations that cause certain types of cancer and guide the development of targeted therapies. • Clinical Trial Data: Clinical trial data includes information from randomized controlled trials of new drugs or medical interventions. Clinical trial data can be used to evaluate the safety and efficacy of new treatments and to inform clinical practice guidelines. For

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Przystalski and R. M. Thanki, Explainable Machine Learning in Medicine, Synthesis Lectures on Engineering, Science, and Technology, https://doi.org/10.1007/978-3-031-44877-5_1

1

2

1 Introduction

example, clinical trial data can be used to determine the optimal dose and duration of a new medication, or to identify potential side effects. • Public Health Data: Public health data includes information about the health of populations, such as disease incidence and prevalence, mortality rates, and risk factors. Public health data can be used to monitor disease outbreaks, identify emerging health threats, and inform public health policies. For example, public health data can be used to track the spread of infectious diseases, such as COVID-19, and inform public health interventions to contain the spread. • Wearable and Mobile Health Data: Wearable and mobile health data includes information collected from devices such as fitness trackers, smartwatches, and mobile apps. Wearable and mobile health data can be used to monitor and manage chronic diseases, such as diabetes or hypertension, and promote healthy behaviors. For example, a patient with diabetes can use a wearable device to monitor their blood glucose levels and adjust their insulin dosage accordingly. Overall, medical data has a wide range of applications in healthcare, from improving patient care and outcomes, to advancing medical research and informing public health policy. The analysis and integration of multiple types of medical data can provide a more comprehensive understanding of health and disease, leading to more effective and personalized medical treatments.

1.2

Explainable Artificial Intelligence (XAI)

Explainable Artificial Intelligence (XAI) refers to the development and deployment of machine learning and artificial intelligence (AI) systems that can provide understandable and transparent explanations for their decisions or outputs. The goal of XAI is to bridge the gap between the complex, black-box nature of many AI algorithms and the need for human comprehensibility, interpretability, and trust. Traditional AI algorithms, such as deep neural networks, often operate as opaque models, making it challenging to understand how they arrive at their decisions. This lack of transparency can hinder the adoption and acceptance of AI systems, especially in critical domains like healthcare, finance, and autonomous systems, where explainability is crucial. Explainable AI seeks to address these concerns by providing insights into the decisionmaking process of AI models. It allows users, including domain experts, regulators, and end-users, to understand why a particular decision was made and how the AI system arrived at its conclusion. XAI provides a human-understandable rationale behind the predictions, recommendations, or actions of AI systems, promoting transparency, accountability, and trust. There are different approaches and techniques within XAI to enhance interpretability and explainability. Some common methods include:

1.2

Explainable Artificial Intelligence (XAI)

3

• Rule-Based Models: These models use explicit rules to make decisions based on predefined conditions or logical statements, enabling straightforward interpretation and justification of decisions. • Surrogate Models: Surrogate models are simpler, more interpretable models that approximate the behavior of complex black-box models. They provide insights into how the original model makes predictions without sacrificing too much accuracy. • Model-Agnostic Approaches: Model-agnostic methods focus on interpreting AI models without relying on their internal structures. Techniques like feature importance analysis, partial dependence plots, and SHAP (Shapley Additive Explanations) values provide insights into the contribution of different features or variables to the model’s predictions. • Visualizations: Visualizations help present complex AI models and their decisionmaking processes in a more understandable and intuitive manner. They can include diagrams, heatmaps, or interactive interfaces that depict the model’s internal workings or illustrate the impact of input features on the output. Explainable AI is not only about post hoc explanations but also about designing AI models and algorithms that inherently prioritize interpretability and transparency. This involves developing models that incorporate human-understandable features, using interpretable algorithms and architectures, and enforcing constraints on the decision-making process. The benefits of XAI extend beyond addressing interpretability concerns. Explainable AI can assist in identifying biases in data or algorithms, enhancing human-AI collaboration, enabling regulatory compliance, facilitating debugging and error analysis, and fostering trust among users and stakeholders. As AI continues to advance and becomes more integrated into critical decision-making processes, the development and deployment of explainable AI systems become increasingly important. By making AI more transparent and interpretable, XAI aims to unlock the full potential of AI while ensuring that its decisions align with human values, ethical principles, and regulatory requirements.

1.3

The Roles of Explainable AI (XAI) in Medical Data Analysis

The role of explainable AI (XAI) in medical data analysis is to provide healthcare professionals with a clear understanding of how an AI model arrived at a particular diagnosis or treatment recommendation. XAI methods help to promote transparency, accountability, and trust in AI models by providing a clear explanation of how these models make decisions. Some of the key roles of XAI in medical data analysis include: • Transparency: XAI methods help to make the decision-making process of an AI model transparent and understandable. This can help to build trust in AI systems and ensure that healthcare professionals and patients have confidence in the decisions made by these models.

4

1 Introduction

• Trust: XAI methods provide a clear and understandable explanation of how AI models make decisions. This transparency can help build trust in AI systems and ensure that healthcare professionals and patients have confidence in the decisions made by these models. • Accountability: XAI methods make it possible to trace back the reasoning behind AI model decisions, making it easier to identify and address errors or biases in the model. • Improved Patient Outcomes: XAI methods can help healthcare professionals better understand the reasoning behind an AI model’s decisions, leading to more accurate diagnoses and treatment recommendations. This can ultimately lead to improved patient outcomes. • Error Detection and Correction: XAI methods can help identify errors and biases in AI models. This can help to ensure that these models are accurate and provide reliable diagnoses and treatment recommendations. • Model Improvement: XAI methods can be used to identify areas where an AI model may need improvement. For example, if an AI model is consistently making incorrect diagnoses, XAI methods can help to identify the features that the model is focusing on and suggest modifications to improve its accuracy. • Regulatory Compliance: Many countries have regulatory requirements for the transparency and interpretability of medical AI models. XAI methods can help to ensure compliance with these regulations. • Education: XAI methods can help educate healthcare professionals on the capabilities and limitations of AI models. This can help promote wider adoption of AI in healthcare and encourage collaboration between healthcare professionals and data scientists.

1.4

Fundamentals of Machine Learning

The fundamentals of machine learning encompass the core concepts, techniques, and processes involved in building and training models that can learn from data and make predictions or take actions without being explicitly programmed. Here are the key components of machine learning: • Data: Machine learning algorithms rely on data as their primary source of information. The data consists of input features (also known as variables or attributes) and corresponding output labels or target values. High-quality and representative data is essential for effective model training and evaluation. • Supervised Learning: In supervised learning, the algorithm learns from labeled data, where the input features are associated with known output labels. The goal is to learn a mapping function that can predict the correct labels for new, unseen data. Examples of

1.4















Fundamentals of Machine Learning

5

supervised learning algorithms include linear regression, decision trees, random forests, and support vector machines. Unsupervised Learning: Unsupervised learning involves learning patterns or structures in the data without explicit output labels. The algorithm discovers hidden relationships or clusters within the data. Clustering algorithms, such as k-means and hierarchical clustering, and dimensionality reduction techniques like principal component analysis (PCA) are common examples of unsupervised learning. Semi-supervised Learning: This type of learning falls between supervised and unsupervised learning. It utilizes a combination of labeled and unlabeled data to improve model performance. Semi-supervised learning is often employed when labeled data is limited or expensive to obtain. Reinforcement Learning: Reinforcement learning focuses on training agents to make decisions in an environment to maximize a cumulative reward signal. The agent learns through trial and error, receiving feedback in the form of rewards or penalties for its actions. Reinforcement learning is applied in scenarios such as game playing, robotics, and autonomous systems. Model Selection and Evaluation: Machine learning involves selecting an appropriate model or algorithm that suits the problem at hand. Model selection depends on factors such as the type and size of the dataset, the desired prediction task, and the available computational resources. Evaluation metrics, such as accuracy, precision, recall, and F1 score, are used to assess the performance of the models and compare their effectiveness. Feature Engineering: Feature engineering involves selecting, transforming, and creating relevant features from the raw data to improve the performance of machine learning models. This process requires domain expertise and an understanding of the data characteristics and the problem at hand. Feature selection techniques, dimensionality reduction, and data normalization are common feature engineering practices. Model Training and Optimization: Training a machine learning model involves feeding the algorithm with labeled data and iteratively adjusting the model’s parameters to minimize the prediction error. Optimization algorithms, such as gradient descent, are employed to find the optimal parameter values. Overfitting (model overly complex) and underfitting (model too simplistic) are common challenges during training, requiring techniques such as regularization and cross-validation for better generalization. Model Deployment and Monitoring: Once a model is trained and evaluated, it can be deployed to make predictions on new, unseen data. Deployment may involve integrating the model into an application or system for real-time predictions. It is crucial to continuously monitor the model’s performance, detect any drift or degradation, and update or retrain the model as needed to ensure its accuracy and reliability.

Understanding these fundamentals of machine learning provides a solid foundation for building and applying various machine learning models and algorithms in different domains and problem scenarios.

6

1.5

1 Introduction

Supervised Learning

Supervised learning is a type of machine learning where an algorithm learns from labeled data to make predictions or decisions. In this approach, the training data consists of input features (also known as independent variables) and corresponding output labels (also known as dependent variables or targets). The goal is to learn a mapping function that can accurately predict the output labels for new, unseen data. Here are the key components and steps involved in supervised learning: • Labeled Training Data: The supervised learning process begins with a dataset that contains examples of input features along with their corresponding output labels. This data is labeled because the correct answers are provided for each input instance. The quality and representativeness of the training data are crucial for building an effective model. • Input Features: The input features, also referred to as independent variables, are the measurable characteristics or attributes of the data. They can be numerical, categorical, or even text based. Examples of input features in various domains include age, temperature, gender, text descriptions, or pixel values in an image. • Output Labels: The output labels, also known as dependent variables or targets, represent the desired prediction or decision the model should make. The labels can be categorical (e.g., class labels like “spam” or “not spam”) or continuous (e.g., predicting house prices). The type of output labels determines the specific supervised learning problem, such as classification or regression. • Model Selection: The next step is to select an appropriate model or algorithm that suits the specific problem and data characteristics. Different supervised learning algorithms have varying assumptions, strengths, and weaknesses. Some common algorithms include linear regression, logistic regression, decision trees, random forests, support vector machines, and neural networks. • Model Training: Once the model is selected, the training phase begins. During training, the algorithm learns from the labeled training data by adjusting its internal parameters or weights. The goal is to minimize the difference between the predicted outputs and the true labels. This process involves an optimization algorithm, such as gradient descent, which iteratively updates the model’s parameters. • Model Evaluation: After training, the model’s performance is evaluated using evaluation metrics that are appropriate for the specific problem. For classification tasks, metrics like accuracy, precision, recall, and F1 score can be used. Regression tasks may employ metrics such as mean squared error (MSE) or mean absolute error (MAE). Evaluation helps assess how well the model generalizes to new, unseen data and provides insights into its strengths and weaknesses. • Prediction and Inference: Once the model is trained and evaluated, it can be used to make predictions or decisions on new, unseen data. The model takes the input features and produces the predicted output labels. This inference step allows the model to provide

1.6

Unsupervised Learning

7

valuable insights, make predictions, or assist in decision-making for real-world applications. Supervised learning is widely used in various domains, including healthcare, finance, natural language processing, computer vision, and many others. By learning from labeled data, supervised learning algorithms can effectively solve classification, regression, and other predictive tasks.

1.6

Unsupervised Learning

Unsupervised learning is a type of machine learning where an algorithm learns patterns, structures, or relationships in data without explicit labels or target values. Unlike supervised learning, unsupervised learning does not have predefined output labels to guide the learning process. Instead, it focuses on finding inherent patterns or structures within the data itself. Here are the key components and approaches in unsupervised learning: • Unlabeled Data: In unsupervised learning, the algorithm is provided with a dataset that consists of input features but does not include corresponding output labels. This data is unlabeled, meaning there is no predefined information about the desired output or target values. • Clustering: Clustering is a common technique used in unsupervised learning. It aims to group similar instances together based on the inherent patterns or similarities in the input features. The algorithm automatically identifies clusters in the data, allowing for the discovery of hidden structures or segments. – K-means Clustering: This algorithm partitions the data into a predetermined number of clusters, where each data point is assigned to the nearest centroid (representative point) of a cluster. – Hierarchical Clustering: This algorithm creates a hierarchy of clusters, where each data point starts in its own cluster and is successively merged with other clusters based on similarity measures. • Dimensionality Reduction: Dimensionality reduction techniques aim to reduce the number of input features while preserving the most important information in the data. This can be beneficial for visualizing high-dimensional data or reducing computational complexity. – Principal Component Analysis (PCA): PCA identifies the directions (principal components) in which the data varies the most and projects the data onto these components, effectively reducing the dimensionality.

8

1 Introduction

– t-SNE: t-SNE is a technique commonly used for visualizing high-dimensional data by mapping them into a lower-dimensional space while preserving local similarities. • Anomaly Detection: Anomaly detection is concerned with identifying instances that deviate significantly from the normal patterns in the data. Unsupervised learning algorithms can learn the normal patterns in the data and flag instances that are unusual or anomalous. – Density-Based Approaches: Algorithms such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identify dense regions of data and consider instances outside those regions as anomalies. – Outlier Detection: Various statistical techniques and algorithms can be used to detect outliers or anomalies based on deviation from normal distribution or other statistical measures. • Association Rules: Association rule learning focuses on discovering interesting relationships or associations among items or variables in the data. This is often applied in market basket analysis, where the goal is to identify items frequently purchased together. • Apriori Algorithm: This algorithm is commonly used to discover frequent item sets by iteratively generating candidate item sets and pruning those that do not meet minimum support thresholds. Unsupervised learning has various applications, including customer segmentation, recommendation systems, anomaly detection, data preprocessing, and exploratory data analysis. By uncovering hidden patterns or structures in the data, unsupervised learning provides valuable insights and aids decision-making processes.

1.7

Semi-supervised Learning

Semi-supervised learning is a type of machine learning that combines labeled and unlabeled data to improve the performance of models. In this approach, a limited amount of labeled data is available alongside a larger amount of unlabeled data. By leveraging the unlabeled data, semi-supervised learning algorithms aim to learn more accurate and robust models. Here are the key components and approaches in semi-supervised learning: • Labeled Data: Similar to supervised learning, semi-supervised learning begins with a dataset that contains examples of input features with their corresponding output labels. The labeled data is typically limited and can be costly to obtain since it requires human experts to annotate or label the data.

1.8

Reinforcement Learning

9

• Unlabeled Data: In semi-supervised learning, a significant portion of the data consists of unlabeled instances. These instances have input features but lack the corresponding output labels. Unlabeled data is often easier to obtain in large quantities, as it does not require manual labeling. • Co-training: Co-training is a popular approach in semi-supervised learning. It involves training multiple models or classifiers on different subsets of the available labeled data and then using the unlabeled data to improve their performance. The models exchange information and learn from the unlabeled data to make more accurate predictions. • Self-training: Self-training is another common technique in semi-supervised learning. It starts by training a model on the labeled data. This trained model is then used to predict labels for the unlabeled data. The most confident predictions are added to the labeled data, and the model is retrained on this expanded labeled dataset. This process iterates until convergence or a predefined stopping criterion. • Generative Models: Generative models, such as generative adversarial networks (GANs) or autoencoders, can be employed in semi-supervised learning. These models learn the underlying distribution of the data and generate realistic samples. The generated data can be combined with the labeled data to enhance the model’s training. • Manifold Regularization: Manifold regularization is a regularization technique used in semi-supervised learning to encourage smoothness and consistency of predictions on neighboring data points. It assumes that nearby points in the input space have similar output labels, helping improve generalization. Semi-supervised learning is particularly useful when labeled data is limited or expensive to obtain, but there is an abundance of unlabeled data. It can provide a cost-effective way to improve model performance by leveraging the additional unlabeled information. Applications of semi-supervised learning include text classification, sentiment analysis, speech recognition, and image recognition, among others.

1.8

Reinforcement Learning

Reinforcement learning is a branch of machine learning that focuses on training agents to make decisions and take actions in an environment to maximize cumulative rewards. It involves an agent interacting with an environment, learning through trial and error, and receiving feedback in the form of rewards or penalties for its actions. The goal of reinforcement learning is to find an optimal policy that guides the agent to make the best possible decisions in a given context. Here are the key components and concepts in reinforcement learning:

10

1 Introduction

• Agent: The agent is the learning entity that interacts with the environment. It takes actions based on the current state of the environment and receives feedback in the form of rewards or penalties. • Environment: The environment is the external system or context with which the agent interacts. It can be a simulated environment, a physical environment, or a computer program. The environment presents the agent with different states, and the agent’s actions influence the subsequent states. • State: A state represents the current configuration or observation of the environment at a particular time. It captures the relevant information needed for the agent to make decisions. States can be fully observable or partially observable, depending on the information available to the agent. • Action: An action is a decision or choice made by the agent based on the current state. Actions can have immediate consequences and can lead the agent to transition to a new state. • Reward: The reward is the feedback signal that the agent receives from the environment after taking an action. It indicates the desirability or quality of the agent’s action in a given state. The agent’s objective is to maximize cumulative rewards over time. • Policy: A policy is a strategy or set of rules that determines the agent’s action selection at each state. It maps states to actions and guides the agent’s decision-making process. The policy can be deterministic (always choosing the same action for a given state) or stochastic (selecting actions based on probabilities). • Value Function: The value function estimates the expected cumulative rewards that an agent can achieve from a given state or state-action pair. It helps the agent evaluate the long-term consequences of its actions and make informed decisions. The value function can be estimated using different techniques, such as Q-learning or Monte Carlo methods. • Exploration and Exploitation: Balancing exploration and exploitation is crucial in reinforcement learning. Exploration involves trying out different actions to gather more information about the environment, while exploitation focuses on choosing actions that maximize rewards based on current knowledge. Striking the right balance between exploration and exploitation is essential for the agent to learn effectively. • Q-learning and Policy Gradient Methods: Q-learning is a widely used algorithm in reinforcement learning that learns an action-value function called the Q-function. It iteratively updates Q-values based on the observed rewards and allows the agent to make informed decisions. Policy gradient methods, on the other hand, directly learn the policy by optimizing it to maximize rewards using gradient-based optimization techniques. Reinforcement learning has applications in various domains, including robotics, game playing, autonomous systems, recommendation systems, and control systems. By learning from interactions with the environment and optimizing actions based on rewards, reinforcement learning enables agents to learn complex behaviors and make adaptive decisions in dynamic and uncertain environments.

1.9

1.9

Model Selection and Evaluation

11

Model Selection and Evaluation

Model selection and evaluation are critical steps in machine learning, aiming to choose the most appropriate model for a given problem and assess its performance. These steps involve comparing different models, tuning their hyperparameters, and evaluating their predictive capabilities. Here’s an overview of the model selection and evaluation process: • Splitting the Data: The available dataset is typically divided into three subsets: training set, validation set, and test set. The training set is used to train the models, the validation set helps in hyperparameter tuning and model selection, and the test set is used for final evaluation. Common splitting ratios are 60–70% for training, 15–20% for validation, and 15–20% for testing, although they can vary depending on the dataset size. • Selecting Model Candidates: Based on the problem at hand, various models or algorithms are considered as potential candidates. These may include decision trees, random forests, support vector machines, neural networks, or other models suitable for the task (e.g., linear regression for regression problems or convolutional neural networks for image classification). • Training and Cross-Validation: The model candidates are trained on the training set using the chosen algorithm. During training, the models learn from the input features and corresponding output labels. Cross-validation techniques, such as k-fold cross-validation, can be used to assess model performance and mitigate overfitting. Cross-validation involves dividing the training set into k subsets, training the model on k-1 subsets, and evaluating its performance on the remaining subset. This process is repeated k times, with each subset serving as the validation set once. • Hyperparameter Tuning: Models often have hyperparameters that need to be set before training, such as learning rates, regularization strengths, or the number of hidden layers. Hyperparameter tuning involves searching for the optimal combination of hyperparameter values that yield the best performance. Techniques like grid search, random search, or more advanced methods like Bayesian optimization can be used for this purpose. The validation set is used to evaluate model performance with different hyperparameter settings and choose the best configuration. • Performance Evaluation: After selecting the best-performing model based on the validation set, its performance is assessed on the independent test set. The test set provides an unbiased estimate of the model’s performance on unseen data. Evaluation metrics depend on the specific problem and can include accuracy, precision, recall, F1 score, mean squared error, or others suitable for the task. It is essential to choose metrics that align with the problem’s requirements and domain-specific considerations. • Iterative Refinement: Model selection and evaluation are iterative processes. If the performance of the chosen model is not satisfactory, the steps of model selection, hyperparameter tuning, and evaluation can be repeated with different models or hyperparameter settings. This iterative refinement helps improve the model’s performance and achieve better results.

12

1 Introduction

The process of model selection and evaluation heavily depends on the availability and quality of data, the problem complexity, and the computational resources. The goal is to choose a model that performs well on unseen data, exhibits good generalization, and aligns with the specific requirements of the problem domain.

1.10

Feature Engineering

Feature engineering is the process of selecting, transforming, and creating relevant features from raw data to improve the performance and effectiveness of machine learning models. It involves extracting meaningful information and representations from the data that can enhance the model’s ability to learn patterns and make accurate predictions. Feature engineering is a crucial step in the machine learning pipeline and requires domain knowledge, creativity, and an understanding of the data and the problem at hand. Here are some key techniques and considerations in feature engineering: • Feature Selection: Feature selection involves identifying the most informative and relevant features from the available set. It helps reduce the dimensionality of the data, improve model interpretability, and prevent overfitting. Feature selection methods can be based on statistical measures (e.g., correlation, mutual information), model-based techniques (e.g., coefficients in linear regression), or recursive elimination strategies (e.g., recursive feature elimination). • Feature Transformation: Feature transformation aims to transform the data into a more suitable representation that aligns with the assumptions of the machine learning algorithms or enhances its characteristics. Common transformations include scaling features to a common range (e.g., normalization, standardization), handling skewed distributions (e.g., log transformations), or applying mathematical functions (e.g., square root, exponentiation). • Handling Missing Data: Missing data is a common challenge in real-world datasets. Feature engineering techniques can involve strategies to handle missing values, such as imputation (filling missing values with estimated values), flagging missingness indicators, or creating additional features to capture the absence of data. • Encoding Categorical Variables: Categorical variables need to be appropriately encoded for machine learning models. One-hot encoding converts categorical variables into binary vectors, with each category represented by a separate binary feature. Label encoding assigns a unique numeric label to each category. Target encoding encodes categorical variables based on the target variable’s statistical properties within each category. • Creating Interaction or Polynomial Features: Interaction features capture the relationship between multiple input features by combining or multiplying them. Polynomial features involve creating new features as powers or interactions of existing features.

1.11

Model Training and Optimization

13

These techniques can help model non-linear relationships between features and improve the model’s ability to capture complex patterns. • Time-Based Features: When dealing with time-series data, it is often beneficial to create features that capture temporal patterns and trends. These can include lagged features (e.g., previous values), rolling statistics (e.g., moving averages), or seasonality indicators (e.g., day of the week, month). • Domain-Specific Knowledge: Leveraging domain expertise and knowledge can help identify relevant features that are specific to the problem domain. Understanding the underlying factors and relationships in the data can guide the creation of informative features. • Iterative Refinement: Feature engineering is an iterative process. It involves building, evaluating, and refining models based on the engineered features. The process may require experimenting with different transformations, interactions, or combinations of features to find the most effective representation. Feature engineering is highly dependent on the specific dataset, problem domain, and the characteristics of the machine learning models being used. Well-engineered features can significantly improve model performance, interpretability, and generalization capabilities, ultimately leading to better predictions and insights from the data.

1.11

Model Training and Optimization

Model training and optimization are crucial steps in machine learning that involve adjusting the model’s parameters or weights to minimize the prediction error and improve its performance. The goal is to find the best set of parameter values that allow the model to generalize well to unseen data. Here’s an overview of the model training and optimization process: • Loss Function: A loss function, also known as an objective function or cost function, quantifies the error between the model’s predictions and the true values or labels in the training data. The choice of loss function depends on the specific problem, such as mean squared error (MSE) for regression tasks or cross-entropy loss for classification tasks. • Optimization Algorithm: An optimization algorithm is used to minimize the loss function by adjusting the model’s parameters. Gradient-based optimization algorithms are commonly employed, such as stochastic gradient descent (SGD), Adam, or RMSprop. These algorithms compute the gradients of the loss function with respect to the model parameters and update the parameters in a way that gradually reduces the loss. • Training Data: The model is trained on a training dataset, which consists of input features and corresponding output labels or target values. The training data is used to update the model’s parameters iteratively. It is crucial to have diverse and representative training data that covers the range of possible inputs and outputs to achieve good generalization.

14

1 Introduction

• Mini-Batch Training: To improve computational efficiency, training is often performed on mini-batches of data rather than the entire dataset at once. This approach is known as mini-batch training. The model parameters are updated after processing each mini-batch, resulting in faster convergence and better utilization of computational resources. • Backpropagation: Backpropagation is a key technique used for calculating gradients in neural networks. It efficiently computes the gradient of the loss function with respect to each parameter in the model. The gradients are propagated backward through the layers of the network, allowing the optimization algorithm to adjust the parameters accordingly. • Regularization: Regularization techniques are used to prevent overfitting, where the model becomes overly complex and fails to generalize well to new data. Common regularization methods include L1 and L2 regularization, which add a penalty term to the loss function to control the complexity of the model. Regularization helps to balance between fitting the training data and avoiding excessive complexity. • Hyperparameter Tuning: Hyperparameters are parameters that are set before the training process and determine the behavior and characteristics of the model. Examples include learning rate, regularization strength, batch size, and network architecture. Hyperparameter tuning involves selecting the optimal combination of hyperparameter values that result in the best model performance. Techniques such as grid search, random search, or Bayesian optimization can be employed for hyperparameter tuning. • Early Stopping: Early stopping is a technique used to prevent overfitting and improve generalization. It involves monitoring the model’s performance on a validation set during training and stopping the training process when the performance starts to deteriorate. This prevents the model from continuously improving on the training data at the expense of generalization to unseen data. • Evaluation on Validation and Test Sets: The trained model’s performance is evaluated on a separate validation dataset or validation set. This helps assess its generalization capabilities and select the best-performing model for deployment. Additionally, the model should be evaluated on an independent test dataset or test set that has not been used during training or validation to obtain an unbiased estimate of its performance. The model training and optimization process is iterative, involving multiple cycles of training, evaluation, and refinement. It requires careful monitoring, experimentation, and fine-tuning to find the optimal parameter settings that lead to the best-performing model.

1.12

Model Deployment and Monitoring

Model deployment and monitoring are essential steps in the machine learning lifecycle to ensure that trained models are effectively deployed into production systems and continue to perform optimally over time. These steps involve integrating the model into a production

1.12

Model Deployment and Monitoring

15

environment, monitoring its performance, and updating or retraining the model as needed. Here’s an overview of the model deployment and monitoring process: • Integration with Production Systems: The trained model needs to be integrated into the production environment or system where it will be used to make predictions or decisions. This integration may involve developing APIs, microservices, or other interfaces that allow the system to communicate with the model and send input data for inference. • Data Preprocessing and Input Handling: Input data from the production system needs to be pre-processed and prepared in a format suitable for the model’s input requirements. This may involve data normalization, encoding categorical variables, handling missing values, or any other preprocessing steps necessary to align the input data with the model’s expectations. • Real-Time Prediction or Batch Processing: Depending on the requirements of the application, models can be deployed for real-time prediction, where predictions are made on the fly as new data arrives, or batch processing, where predictions are made in bulk on a set of data. The deployment setup should be designed to handle the expected workload and provide predictions efficiently. • Performance Monitoring: Once the model is deployed, it is important to monitor its performance in the production environment. Performance monitoring involves tracking various metrics, such as prediction accuracy, latency, throughput, and resource utilization. Monitoring helps detect any degradation in performance, identify anomalies, and trigger alerts if performance falls below predefined thresholds. • Error Analysis and Feedback Loop: In a production environment, monitoring can uncover errors or discrepancies between the model’s predictions and ground truth. Error analysis and investigation should be conducted to identify the root causes of errors and understand the potential sources of bias, data drift, or other issues affecting model performance. This analysis can provide insights for model improvement and guide subsequent iterations of the machine learning pipeline. • Model Maintenance and Updates: Over time, the model’s performance may degrade due to changes in the underlying data distribution, shifts in user behavior, or other factors. To maintain optimal performance, it is necessary to update or retrain the model periodically. This can involve collecting new labeled data, retraining the model with updated algorithms or hyperparameters, and deploying the updated model to replace the existing one. • Versioning and Rollbacks: It is important to keep track of model versions and maintain a history of changes made to the deployed models. Versioning allows for easy rollback to a previous version if issues or regressions are encountered after deploying an updated model. It also provides a record of model improvements and allows for comparisons and analysis of different versions. • Security and Privacy Considerations: Model deployment should address security and privacy concerns. This includes protecting sensitive user data, ensuring secure data trans-

16

1 Introduction

mission, implementing access controls to prevent unauthorized access to the model, and adhering to privacy regulations and guidelines. • Collaboration and Documentation: Effective collaboration between data scientists, software engineers, and stakeholders is crucial for successful model deployment and monitoring. Proper documentation of the deployed model, its dependencies, configurations, and performance monitoring procedures ensures that all relevant parties have access to the necessary information. Model deployment and monitoring are ongoing processes that require continuous attention and proactive maintenance to ensure that models remain reliable, accurate, and aligned with the evolving needs of the production system and the changing data landscape.

2

Medical Tabular Data

The most typical types of data in machine learning challenges are tabular data. A set of body mass index (BMI) values for a group of people is just a simple example of this type of data. We have the weight, age, and height values divided by the person to which they are assigned. Such data might be too simple to be used with a deep neural network or even most shallow methods, but we can use methods such as linear regression to see the correlation between these metrics. The tabular type of data consists of numbers and text values divided by features saved usually as a comma-separated values (CSV) or similar file format. Each value is separated by a semicolon or other sign, making it easy to load the data into a numpy or pandas matrix. In this chapter, we show what kind of tabular data exists in the medical industry and where such data are used together with machine learning. We show a few applications of machine learning methods used in tabular sets. The Diverse Counterfactual Explanations (DICE) method is shown on a Covid19 data set. The second use case is based on a diabetes data set. The prediction is explained using the Accumulated Local Effects (ALE) method. Heart attack is analyzed using a neural network and explained using the decision tree.

2.1

Data Types and Applications

The tabular data type is the most popular in machine learning and most of the first challenges were solved based on tabular data. One of the reasons is that such data is usually less complex compared to other types of data. In medicine, such data are used in a few cases. One of such areas is drug development, where this kind of data is used to invent new molecules like Exscientia (https://www.exscientia.ai/) or Atomwise (https://www.atomwise.com/) do. Genomics is another branch where new molecules are developed, such as in Tempus

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Przystalski and R. M. Thanki, Explainable Machine Learning in Medicine, Synthesis Lectures on Engineering, Science, and Technology, https://doi.org/10.1007/978-3-031-44877-5_2

17

18

2 Medical Tabular Data

(https://www.tempus.com/), Sophia Genetics (https://www.sophiagenetics.com/), Dynotx (https://www.dynotx.com/), or Immunai (https://www.immunai.com/platform/). The patient’s health records (EHR) contain different types of data such as images, laboratory data, or typical patient’s interview responses. Buoy Health (https://www.buoyhealth. com/) uses the patient’s interview information to predict the illnesses. It is a symptom checker that is becoming more and more popular for different types of diseases. Some models use EHR and extract data, such as text or images. The EHR text is used in the models used by DigitalOwl (https://www.digitalowl.com/). Abtrace https://www.abtrace.co/our-approach/ uses machine learning to organize the data, including the EHR data also from almost the whole UK population. Fraud can be detection using time series or just tabular data. Both types of data are used to detect frauds in the Healthcare by Health@Scale Technologies (https://www.healthatscale. com/precision-fwa-detection). Similarly, but not just for fraud, Codoxo investigates clinical audits and other operations (https://www.codoxo.com/). As in fraud detection, also in other areas tabular data is mixed with other types of data as input for neural network-based models.

2.2

Explainable Methods for Tabular Data

In this chapter, we explain decision trees in the first place, which are a typical machine learning method. This method is a bit special, as it can be used to identify the importance of features during model training. We used it on Covid data sets to find the most important features and explain the influence of each. Additionally, the models when created are explainable by design. The Diverse Counterfactual Explanations (DiCE) method is explained next using the Diabetes tabular data. The last explainability method is Accumulated Local Effects (ALE). We show how to explain the heart attacks and show the correlations between features.

2.2.1

Decision Trees

Decision trees are an interpretable and effective method for medical data analysis. Decision trees work by recursively partitioning the data into smaller sub-sets based on the most significant features. Here’s how decision trees can be applied to medical data analysis: • Feature Extraction: Before building a decision tree, it is necessary to extract features from the medical data. These features will serve as the input for the decision tree algorithm.

2.2

Explainable Methods for Tabular Data

19

• Building the Decision Tree: Once the features have been extracted, the decision tree algorithm can be applied to the data. The algorithm works by finding the feature that best separates the data into different classes (e.g., positive, or negative sentiment). This process is repeated recursively for each subset of the data until the decision tree is fully grown. • Interpreting the Decision Tree: Once the decision tree has been built, it can be interpreted to understand how the model arrived at its decision. Each node in the tree represents a decision based on a feature, and each branch represents the possible outcomes. By following the decision path of the tree, it is possible to understand which features were most significant in the classification task. • Pruning the Decision Tree: Decision trees can sometimes overfit the data, resulting in poor generalization of new data. To avoid overfitting, it is common to prune the decision tree by removing branches that do not significantly contribute to the classification task. • Visualizing the Decision Tree: Decision trees can be visualized to provide a more intuitive understanding of the model’s decision-making process. The tree can be represented as a diagram with nodes and branches, where the size of each node represents the number of data points that reach that node. The tree (Breiman et al. 1984) when trained takes a decision on one feature what means that the line is perpendicularly to the axis. For example, for .x1 = 4 the line would be parallel to axis .x2 with a value of .x1 = 4. This would divide the data set into two smaller ones, one on the left where .x1 < 4 and second on the right where .x1 ≥ 4. We can now take both parts of the main data set and divide each again and again, until we reach sets where we have all objects of same label. The training part of a decision tree is to find out the division rules. One example of such a tree is shown in Fig. 2.1. It is a simplified tree for melanoma

Fig. 2.1 An example of a decision tree for melanoma recognition using the ABCD scoring method

20

2 Medical Tabular Data

recognition. The medical doctors’ use different scoring methods to distinguish between a malignant or benign mole. The shown tree is based on the ABCD method where asymmetry, border sharpness, and number of colors are the most important features. The decision can be taken by answering questions on each tree level. In blue benign cases are marked, in orange suspicious, and in red the malignant ones. Decision tree is a popular machine learning method used for classification and regression tasks. It works by partitioning the input space into a set of rectangles or boxes, with each box corresponding to a decision rule based on the values of input features. Let. X be a feature matrix representing the input data, with.n rows and.m columns, where n is the number of observations and m is the number of features. Let . y be a vector representing the target variable, such as a class label or a numerical value, for each observation. A decision tree is built recursively by partitioning the input space into subsets based on the values of one of the features. At each node of the tree, a decision rule is applied to determine which subset of the input space should be further partitioned. The decision rule is typically based on a simple threshold function of the form: if .xi ≤ t then go left else go right, where .xi is the value of the .ith feature, .t is a threshold value, and “go left” and “go right” refer to the two possible outcomes of the decision rule. The partitioning process continues until a stopping criterion is met, such as a maximum depth of the tree, a minimum number of observations in each box, or a minimum reduction in the impurity of the target variable (in the case of classification tasks). To make a prediction for a new observation .x0 , the decision tree follows the decision rules from the root node to a leaf node, which corresponds to a subset of the input space. The prediction is then based on the majority class or the mean value of the target variable in that subset. The decision tree method can also be used to explain the predictions by tracing the path from the root node to the leaf node corresponding to a specific prediction. The decision rules along the path provide a natural language explanation of how the prediction was made based on the input features. In the Listing 2.1, first load the medical data from a CSV file and then split the data in-to features (X) and target (y) variables. After that, data is split into training and testing sets using the train_test_split method. Then create a decision tree classifier using the DecisionTreeClassifier class and train it on the training data using the fit method. In the next step, use this trained classifier to predict the target values for the testing data using the predict method. Finally, compute the accuracy of the classifier using the accuracy_score method from scikit-learn’s metrics module.

2.2

1 2 3 4

Explainable Methods for Tabular Data

21

from sklearn . tree import DecisionTreeClassifier from sklearn . model_selection import train_test_split from sklearn . metrics import accuracy_score import pandas as pd

5 6 7

# Load the medical data data = pd. read_csv("medical_data . csv")

8 9 10 11

# Split the data into features and target X = data . drop(" target " , axis=1) y = data [" target "]

12 13 14

# Split the data into training and testing sets X_train , X_test , y_train , y_test = train_test_split (X, y, test_size =0.2, random_state=42)

15 16 17

# Create a decision tree classifier clf = DecisionTreeClassifier ()

18 19 20

# Train the classifier on the training data clf . f i t (X_train , y_train )

21 22 23

# Predict the target values for the testing data y_pred = clf . predict (X_test)

24 25 26 27

# Compute the accuracy of the classifier accuracy = accuracy_score ( y_test , y_pred) print ("Accuracy: " , accuracy)

Listing 2.1 Decsion tree model classification methods

The decision tree method can also be used to explain the predictions by tracing the path from the root node to the leaf node corresponding to a specific prediction. The decision rules along the path provide a natural language explanation of how the prediction was made based on the input features. Decision trees are an effective and interpretable method for medical data analysis. They can be used to understand the most significant features in the data and provide a clear visualization of the decision-making process. However, it is important to be careful not to overfit the data and to prune the tree appropriately. In summary, the decision tree method partitions the input space into rectangles or boxes based on simple threshold functions of the input features and makes predictions by following the decision rules from the root node to a leaf node. The method can be used for both classification and regression tasks and provides a way to explain the predictions using natural language explanations of the decision rules.

2.2.2

Data Intensive Computing for Extracting (DICE)

The Data Intensive Computing for Extracting (DICE) method (Mothilal et al. 2020) is a method that find alternative feature values that can make or make not influence on the final prediction. What differs DICE from other methods is that the result of this method is a list of proposed objects with changes features values. It can generate alternatives for the same label or the opposite one. These proposed alternatives are called counterfactuals. The idea is to explain the model not by showing which feature has the greatest impact on the prediction,

22

2 Medical Tabular Data

but understanding the prediction using the set of alternative options. It makes it usually easy to compare the set to the current prediction and give a better understanding of the prediction. The counterfactuals are defined as: c = arg min yloss ( f (c), y) + |x − c|,

.

(2.1)

c

where .x are the input features, .c is the counterfactual, and . y is the label. The . yloss function that is proposed in the original paper is based on the Hinge loss function defined as follows: .

hingeloss = max(0, 1 − z ∗ logit( f (c))),

(2.2)

where .z is -1 when y = 0 and 1 when y = 1, and logit(f (c)) is the unscaled output from the model. The DICE method is divided into a few steps where the most import ants are the diversity, proximity calculations, and the counterfactual optimization. Finally, we get the most propitiate counterfactuals by optimizing the generated counterfactuals. The first step is calculated as follows: .d pp_diver sit y = det(K ), (2.3) 1 where . K i, j = 1+dist(c denotes the distance between two counterfactuals. The distance i ,c j ) is calculated, depending on the type of data, using two different functions. For the continous data we can calclulate the distance as follows:

dist_cont(c, x) =

.

1

d∑ cont

dcont

p=1

|c p − x p | . MAD p

(2.4)

The categorial features distance is calculated as follows: dist_cat(c, x) =

.

dcat 1 ∑

dcat

I (c p / = x p ),

(2.5)

p=1

where .dcat is the number of categories. The second step is the proximity that is a negative distance vector between the original data and the counterfactual features. It can be calculated as follows: 1∑ dist(ci , x). k k

.

Pr oximit y := −

(2.6)

i=1

It is important to mention that from the user perspective a smaller number of features is a better solution then having more features to change. That is why we want to find counterfactuals with lowest number of features to change. The counterfactuals that should be chosen to the final set can be calculated using the following:

2.2

Explainable Methods for Tabular Data

23

1∑ yloss ( f (ci ), y) k k

C(x) = arg min c1 ,...,ck

.

λ1 + k

k ∑

i=1

(2.7)

dist(ci , x) − λ2 d pp_diver sit y(c1 , . . . , ck ),

i=1

where .k is the number of counterfactuals, .ci are the counterfactuals, and . f () is the model. 1

import numpy as np

2 3 4 5

# Define two medical data samples as arrays sample1 = np. array ([0 , 1, 1, 0, 0, 1, 1, 1]) sample2 = np. array ([0 , 0, 1, 1, 0, 1, 0, 1])

6 7 8

# Compute the intersection of the two samples intersection = np.sum(sample1 * sample2)

9 10 11

# Compute the sum of the two samples sum_samples = np.sum(sample1) + np.sum(sample2)

12 13 14

# Compute the DICE coefficient dice = (2 * intersection ) / sum_samples

15 16 17

# Print the DICE coefficient print ("DICE coefficient : " , dice )

Listing 2.2 DICE example

In the Listing 2.2, we first define two medical data samples as arrays (sample1 and sample2). We then compute the intersection of the two samples by taking the element-wise multiplication of the two arrays and then summing the resulting array using the np.sum method. We also compute the sum of the two samples by summing each array individually and then adding the results together. Finally, we compute the DICE coefficient using the formula: (2 * intersection) / sum_samples and print the result.

2.2.3

Additive Linear Explanations (ALE)

ALE (Additive Linear Explanations) (Apley and Zhu 2020) is a method for explaining the predictions of machine learning models, including those used for medical data analysis. ALE provides a way to interpret the contribution of each feature to the final prediction made by the model. Here’s how ALE works: • Model Training: First, a machine learning model is trained on the data. This could be a classification model. • Feature Extraction: Next, features are extracted from the data. These features serve as the input for the ALE method. • Prediction: The trained model is then used to make predictions on the data. For each prediction, the ALE method is applied to explain the contribution of each feature to the prediction.

24

2 Medical Tabular Data

• ALE Calculation: ALE works by computing the difference in the model prediction for each feature at two different values of that feature, while keeping all other features constant. This calculation is performed for each feature, and the results are combined to provide an explanation for the model prediction. • Visualization: Finally, the ALE results can be visualized to provide a more intuitive understanding of the contribution of each feature to the model prediction. This visualization could take the form of a bar chart or a heatmap, where the features are ranked by their contribution to the prediction. ALE is a useful method for understanding the inner workings of machine learning models used for medical data analysis. It provides a way to explain the contribution of each feature to the model’s prediction, which can be useful for identifying biases, improving model performance, or communicating the model’s decision-making process to stakeholders. The ALE method is a technique for explaining the predictions of machine learning models. It provides a way to interpret the contribution of each feature to the final prediction made by the model. Let. X be a feature matrix representing the input data, with.n rows and.m columns, where.n is the number of observations and .m is the number of features. Let . y be a vector representing the target variable for each observation. The ALE method models the relationship between the features and the target variable using an additive model of the form: .

f (x) = β0 +

m ∑

f i (xi ),

(2.8)

i=1

where.β0 is the intercept term,. f i (xi ) is the contribution of the.ith feature to the prediction for observation x, and xi is the value of the .ith feature for observation .x. The function . f i (xi ) is modeled as a piecewise linear function, where the breakpoints are determined by the values of the feature in the training data. To explain the prediction for a specific observation .x0 , the ALE method computes the difference in the predicted value between two values of a given feature .i, while holding all other features constant. Let xi0 be the value of feature i for observation .x0 and let .xi h and .xil be two values of feature i in the training data such that .xil ≤ xi0 ≤ xi h . Then, the contribution of the .ith feature to the prediction for observation .x0 can be estimated as: .

f i (xi0 ) ≈ ( f (x0 |xi = xi h ) −

( f (x0 |xi = xii ) , (xi h − xii )

(2.9)

where. f i (xi0 )|xi = xi h ) and. f (x0 |xi = xil ) are the predicted values for observation.x0 when the value of feature i is set to .xi h and .xil , respectively. The ALE method can also be used to visualize the contributions of each feature to the prediction. For example, the contributions can be plotted as a function of the value of the feature, with each feature represented as a separate line or curve.

2.2

1 2 3 4

Explainable Methods for Tabular Data

25

import pandas as pd from sklearn . ensemble import RandomForestRegressor from sklearn . inspection import plot_partial_dependence import matplotlib . pyplot as plt

5 6 7

# Load medical data data = pd. read_csv("medical_data . csv")

8 9 10 11

# Define features and target variable features = ["age" , "blood_pressure" , "cholesterol "] target = "heart_disease"

12 13 14 15

# Split data into training and test sets train_data = data . sample( frac =0.8, random_state=42) test_data = data . drop( train_data . index)

16 17 18 19

# Fit a random forest regressor model to the training data model = RandomForestRegressor( n_estimators=100, random_state=42) model. f i t ( train_data [ features ] , train_data [ target ])

20 21 22 23 24

# Plot the ALE plot for each feature fig , ax = plt . subplots ( figsize =(8, 6) ) plot_partial_dependence (model, train_data [ features ] , features , ax=ax) plt .show()

Listing 2.3 ALE simplified example

In Listing 2.3, we first load the medical data from a CSV file using the pd.read_csv method. We then define the features and target variable for the analysis. Next, we split the data into training and test sets using the sample and drop methods from pandas. We then fit a random forest regressor model to the training data using the RandomForestRegressor class from scikit-learn. Finally, we plot the ALE plot for each feature using the plot_partial_dependence method from scikit-learn and the plt.subplots and plt.show methods from matplotlib. The resulting plot shows the effect of each feature on the target variable, considering the interactions with other features. The ALE plot allows us to see how the target variable changes as each feature is varied, while holding all other features constant.

2.3

Decision Trees Used for Covid19 Symptoms Influence

Covid-19 pandemy was deathly for many human beings. It was shocking, because no human infections of any coronaviruse was noticed. We had to learn the illness shortly to be able to react fast and reduce the number of infections or with serious breath problems. Recently, many researches were made and a few symptoms or parameters that can have an imporatant impact on the severity were found.

Dataset For this example we use the Covid-19 symptom data set prepared by Bilal Hungund that is hosted by Kaggle (https://www.kaggle.com/datasets/iamhungundji/covid19-symptoms-

26

2 Medical Tabular Data

checker). It is an artifical dataset. To automatically download the data, we need to set the Kaggle username and access key as environment variables. 1 2 3 4 5 6

m k d i r −p b r a i n e x p o r t KAGGLE_USERNAME= u s e r e x p o r t KAGGLE_KEY= p a s s w o r d cd b r a i n k a g g l e d a t a s e t s d o w n l o a d i a m h u n g u n d j i / c o v i d 1 9 −s y m p t o m s − c h e c k e r u n z i p c o v i d 1 9 −s y m p t o m s − c h e c k e r . z i p

Listing 2.4 Dataset download

If you do not have an Kaggle key, you can downlod the data manually and extract it to the ‘covid10-symptoms-checker’ directory. The data set consists of a few symptoms like: fever, tiredness, cough, sore throat, pain, difficulties in breathing, nasal congestion, runny nose, and many others. Apart from the symptoms there were other features collected. One of such is the age that is divided into groups: 0–9, 10–19, 20–24, 25–59, and 60+ patients. Other non-symptom feature is the gender and the patient’s country of residence. Some patients might not have any symptoms and this is also noted in the set. The last two features is the severity and if the patient has contact with other infected person. The set consists of 27 columns where four severity features can be used as labels. We have 5,068,800 cases. Even the data set is already cleaned up, still some features should be modified. The Severity features should be combined into one, so we can use it as a label. To do so we need to use the four severity features and set a one if the severity is severe. We can just replace the Condition feature with Severity_Severe or in case we want to extend it to other severities, set a proper value for specific severity as shown below. After we create the Condition feature, the severity features can be removed from the set. 1 2 3 4

df ["Condition"] df ["Condition"] df ["Condition"] df ["Condition"]

= = = =

0 np.where( df [ ' Severity_Mild ']== 1, 0, df ["Condition" ]) np.where( df [ ' Severity_Moderate ']== 0, 1, df ["Condition" ]) np.where( df [ ' Severity_Severe ']== 1, 1, df ["Condition" ])

Listing 2.5 Prepreocessing

The countries are text values. We should transform these into numbers by setting a number to each country and next replace the column with these numbers. We can use the LabelEncoder for it (see Listing 2.6). 1 2

le = preprocessing . LabelEncoder() df [ 'Country ' ] = le . fit_transform ( df [ 'Country ' ])

Listing 2.6 Country preprocessing

Similarly, other features needs some adjustment. Age is a set of features that can be merged into one value feature Age. Similarly to severity, we need to go through each feature and set a value for each range from 0 to 5. Gender is represented by three features that can be easily combined into one. Contact with a infeceted person can also be represented by one feature instead of three. After the cleanup, the final set has now 16 features instead of 27. It makes the calculation faster.

2.3

1 2 3

Decision Trees Used for Covid19 Symptoms Influence

27

y = df [ ' Condition ' ] df . drop ([ ' Condition ' ] , axis=1,inplace=True) X = df

4 5

X_train , X_test , y_train , y_test = train_test_split (X, y, test_size =0.1, random_state=15)

Listing 2.7 Testing/training sets preparation

Before we build a model, we should divide the set into training and testing sets. To do so, we should set the label y as a separate set, next remove it from the set and divide both sets use to divide into two sets with a ratio 90% training, 10% testing set (train_test_split).

Model The libraries that we are about to use are divided into three groups: data related like data manipulation or preparation, classification methods to build the model, and the graphical ones that are used to display the data and classification results. In the first group we use Numpy and Pandas for data clean up, presentation and check. The preprocessing set from the scikit-learn package is used for the values conversion into a more model-readable format. The second group are used to build the model, use it for prediction, it includes the accuracy measure. In this case we use the decision tree implemented in the scikit-learn library. The last group a libraries related to charts and dendograms. Matplotlib is used to show the five most important features and graphviz to show the tree nodes structure. 1 2 3

clf = tree . DecisionTreeClassifier (max_depth=3) clf = clf . f i t (X_train , y_train ) predictions = clf . predict (X_test)

4 5

accuracy_score ( y_test , predictions )

Listing 2.8 Scikit-learn decision tree model training and prediction

The training and prediction is made as using the DecisionTreeClassifier from scikit-learn. The accuracy is good enough to make an analysis on the explanation. The tree node structure can be plot using the plot_tree method that is a part of the scikit-learn package. The output is a PNG file with a dendrogram with the root of the tree on the top (see Listing 2.9). 1 2

tree . plot_tree ( clf ) tree . export_graphviz( clf , ' tree .png ' )

Listing 2.9 Decision tree dendroram plotting method

Explanation Decision trees by design can be used to measure importance of features. At each split, the method need to decide on what feature it need to divide the node into child nodes. There are a few methods that can be used to take a decision on how to divide the node. In the

28

2 Medical Tabular Data

Fig. 2.2 Decision tree built using the scikit-learn implementation on Covid19 dataset

chart above we see that the gini index is used in this case. We can also see that each node (rectangle) that is next divided, has an information about the feature that was used to do the split. In this case, we have a split method that do in a binary way (two child nodes). The information about what features were used to do the split and how often is one of the feature importance measure. There more often a feature is used to do the split, there more important it is. Features that were not used even once should be considered as useless for the classification. It is important to mention, that the features that are selected for the split depend on method that is used to measure the distribution like the gini index. The feature importance is known after the model is built and in used implementation is stored in the feature_importances_ variable. The five most important features can be plot using Pandas as in Listing 2.10. It seems out that the intuition of is true. The runny node, age, and breathing difficulties were already proven to be important indicators of heavy mileage of Covid19 (Fig. 2.2). The country feature is the most important. It can be caused as the pandemic started in China where is has the most impact on the population quantitatively. The plot of the five most important feattures is shown in Fig. 2.3a. 1

feature_importance = clf . feature_importances_

2 3 4

feat_importances = pd. Series ( clf . feature_importances_ , index=df .columns) feat_importances . nlargest (5) . plot (kind= ' barh ' )

Listing 2.10 Five most important features plotting method

2.3

Decision Trees Used for Covid19 Symptoms Influence

29

Fig. 2.3 Covid19 feature importance plots

1

tree_importance = pd. Series (feature_importance , index=df .columns)

2 3 4 5 6 7

fig , ax = plt . subplots () tree_importance . plot . bar( yerr=feature_importance , ax=ax) ax . s e t _ t i t l e ("Feature importances") ax . set_ylabel ("Mean decrease in impurity") fig . tight_layout ()

Listing 2.11 Feature importance plotting method

If we plot all features like on Listing 2.11b, we can draw the conclusion that the gender, sore throat, dry cough and pains does not have an influence on the severe mileage. All features are shown in Fig. 2.3.

2.4

Heart Attack Data Intensive Computing for Extracting

Based on the U.S. Department of Health and Human Services, every 40 s someone has a heart attack in the US. About 805,000 people in the United States have a heart attack each year. Building solutions that can support the MDs can decrease the number of heart attack occurances and potentially also reduce the number of deaths that might be the negative consequence of these.

30

2 Medical Tabular Data

Dataset For this example we use the Covid-19 symptom data set prepared by Rashik Rahman that is hosted by Kaggle (https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attackanalysis-prediction-dataset). It is an artifical dataset. To automatically download the data, we need to set the Kaggle username and access key as environment variables. 1 2

df = pd. read_csv("heart / heart . csv") df . head()

3 4 5 6 7

df_exp = df .copy() y = df [ ' output ' ] df . drop ([ ' output ' ] , axis=1,inplace=True) X = df

8 9

X_train , X_test , y_train , y_test = train_test_split (X, y, test_size =0.3, random_state=42)

Listing 2.12 Heart diseases data set loading

The dataset consists of 165 positive and 138 negative cases. It has 13 features where the meaning is as follows: • age—patient’s age, • sex—patient’s gender, • cp—the chest pain type (1—typical angina, 2—atypical angina, 3—non-anginal pain, 4—asymptomatic), • trtbps—resting blood pressure (in mm Hg), • chol—cholestoral in mg/dl fetched via BMI sensor, • fbs—fasting blood sugar .>120 mg/dl (1—true, 0—false), • rest_ecg—resting electrocardiographic results (0—normal, 1—having ST-T wave abnormality, 2—showing probable or definite left ventricular hypertrophy by Estes’ criteria, • thalachh—maximum heart rate achieved, • exng—a binary value of the exercise induced angina (1—yes, 0—no), • oldpeak—ST depression in electrocardiogram induced by exercise relative to rest, • slp—the slope of the peak exercise ST segment (0—unsloping, 1—flat, 2—downsloping), • caa—the number of major vessels (0–3), • thall—maximum heart rate achieved. We have also the label column named ‘target’. It is a binary value where 0 means less chance of heart attack and 1 means more chance of heart attack. Do make the prediction possible we should divide the set into the features and labels and next divide it into the training and testing dataset using the scikit-learn train_test_split function (see Listing 2.12).

2.4

Heart Attack Data Intensive Computing for Extracting

31

Model The prediction is quite straight forward. We use the Ada boost method from scikit-learn with 10 classifiers used internally. It is shown in Listing 2.13. 1 2 3

clf = AdaBoostClassifier ( n_estimators=10) clf . f i t (X_train , y_train ) predictions = clf . predict (X_test)

4 5

accuracy_score ( y_test , predictions )

Listing 2.13 Training the model for the DICE heart diseases example

The accuracy is about 80% and is high enough to use the explainability methods, but if you want to increase it you should increase the number of estimators a bit or exchange the classifier to a more sufficient one.

Explanation The DICE implementation from the dice_ml library allows us to train a model within the function or use a already trained model. In our case it would be the second option. The model is set together with the framework or library that was used to built it. As the model was built in our case using the Ada Boost from the scikit-learn library, we need to set it to sklearn. Next, the testing data that we want to use for the explanation is given. The continuous features are set to age, blood pressure, and cholesterol. The label column is set as the output_name value. The explainer is returned next as the output and saves in the exp variable. The Python code to build the explainer is shown in Listing 2.14. 1 2 3

model = dice_ml .Model(model=clf , backend="sklearn") d = dice_ml . Data(dataframe=df_exp , continuous_features=[ ' age ' , ' trtbps ' , ' chol ' ] , outcome_name= ' output ' ) exp = dice_ml . Dice(d, model)

Listing 2.14 Tested patients for the DICE explainer

We have The explainer can now generate the counterfactuals. There are three features that should be set: the cases we want the counterfactuals generate to, the number of counterfactuals, and the class that the change should be considered. In our case we generate 5 counterfactuals for each case, what makes 75 counterfactuals. We set the desired class to the opposite, what means that the counterfactuals will show changes that can be made to get out of this case a case of the opposite class. If the patient has a high probability to get a heart disease, the counterfactuals show us what features and how they have to change to become healthy. In Table 2.1. The counterfactual for the testing set of patients are given in Table 2.2. In the first counterfactual, we see that the cholesterol and chest pain are features changed 3 times out of 5. We can conclude that even if the cholesterol level keeps high, the chest pain type if changed increases the possibility of the heart disease. We get same effect if the cholesterol level stays high and the maximum heart rate increases. For the third patient that is marked as sick, the

32

2 Medical Tabular Data

Table 2.1 Heart disease patients features Patient age sex cp id

trtbps chol fbs

restecg thalachh exng oldpeak slp

caa thall output

1

57

1

0

150

276 0

0

112

1

0.6

1

1

1

0

2

59

1

3

170

288 0

0

159

0

0.2

1

0

3

0

3

57

1

2

150

126 1

1

173

0

0.2

2

1

3

1

4

56

0

0

134

409 0

0

150

1

1.9

1

2

3

0

5

71

0

2

110

265 1

0

130

0

0.0

2

1

2

1

increase in oldpeak means the drop and results in better condition of the patient. In two cases, surprisingly, increasing the cholesterol level might change the patient’s condition.

2.5

Diabetes Prediction Explanation with ALE Method

Diabetes is a chronic disease that may cause other diseases if not threated. It is caused by the hyperglycemia and became a civilization disease. The main reason for its formation is the lack or limited secretion of insulin. There are different types of the diabetes, but in this research we focus on prediction if the patient has any or does not have it.

Dataset For this example we use the Diabetes data set prepared by UCI Machine Learning and is hosted by Kaggle: https://www.kaggle.com/datasets/uciml/pima-indians-diabetesdatabase. To automatically download the data, we need to set the Kaggle username and access key as environment variables, same as in Listing 2.4. The one change that needs to be applied is the dataset name: uciml/pima-indians-diabetes-database. Compared to other data sets, in this cases we just need to load it as a dataframe as below. We see 8 features named in columns: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, and Age. The features meaning is as follows: • Pregnancies is the number of times pregnant, • Glucose is a diagnostic test that measure the two hourplasma glucose concentration after 75 g anhydrous glucose in mg/dl, • Blood Pressure is the blood Pressure measured in mmHg, • Skin Thickness is the triceps skin fold thickness measured in mm, • Insulin is a diagnostic test that measures the 2 h serum insulin in mU/ml, • BMI is the Body Mass Index measured in kg/.m2 ,

2.5

Diabetes Prediction Explanation with ALE Method

33

Table 2.2 Heart disease patients counterfactuals age

sex

cp

trtbps chol fbs

restecg thalachh exng oldpeak slp

caa

thall output

Patient 1 57.0 1.0

0.0

150.0 186

0.0

0.0

180

1.0

0.6

1.0

1.0

1.0

1

57.0 1.0

2

150.0 276.0 0.0

0.0

112.0

1.0

0.6

2

1.0

1.0

1

57.0 1.0

0.0

150.0 276.0 0.0

0.0

179

1.0

0.6

1.0

0

1.0

1

57.0 1.0

1

150.0 276.0 0.0

0.0

112.0

1.0

0.6

1.0

0

1.0

1

57.0 0

2

150.0 276.0 0.0

0.0

112.0

1.0

0.6

1.0

1.0

1.0

1

Patient 2 59.0 1.0

3.0

170.0 288.0 0.0

0.0

159.0

0.0

0.2

0

0.0

0

1

59.0 0

3.0

170.0 276

0.0

159.0

0.0

0.2

1.0

0.0

3.0

1

0.0

57.0 1.0

3.0

170.0 226

0.0

0.0

159.0

0.0

0.2

1.0

0.0

3.0

1

59.0 1.0

3.0

170.0 203

0.0

0.0

159.0

0.0

0.2

1.0

0.0

3.0

1

59.0 1.0

3.0

170.0 288.0 0

0.0

159.0

0.0

0.2

1.0

0.0

2

1

2

134.0 409.0 0.0

0.0

150.0

1.0

1.9

1.0

0

3.0

1

Patient 3 56.0 0.0 56.0 0.0

0.0

134.0 409.0 0.0

0.0

192

1.0

1.9

1.0

2.0

1

1

56.0 0.0

3

134.0 409.0 0.0

0.0

150.0

1.0

1.9

1.0

0

3.0

1

56.0 0.0

1

134.0 409.0 0.0

0.0

150.0

1.0

1.9

1.0

2.0

0

1

56.0 0.0

2

134.0 409.0 0.0

0.0

150.0

1.0

1.9

1.0

2.0

2

1

2.0

180.0 126.0 1.0

1.0

173.0

0.0

1.8

2.0

1.0

3.0

0

Patient 4 57.0 1.0 57.0 1.0

0

150.0 126.0 1.0

1.0

173.0

1

0.2

2.0

1.0

3.0

0

57.0 1.0

2.0

150.0 409

1.0

1.0

173.0

1

0.2

2.0

1.0

3.0

0

57.0 1.0

2.0

150.0 186

1.0

1.0

173.0

0.0

4.2

2.0

1.0

3.0

0

57.0 1.0

0

150.0 126.0 1.0

1.0

173.0

0.0

0.2

2.0

1

3.0

0

Patient 5 71.0 0.0

2.0

110.0 265.0 1.0

0.0

130.0

0.0

0.0

2

1.0

3

0

71.0 0.0

0

110.0 265.0 1.0

0.0

130.0

0.0

0.0

2.0

1.0

0

0

71.0 0.0

2.0

110.0 265.0 1.0

0.0

130.0

0.0

3.8

2.0

1.0

2.0

0

71.0 0

2.0

110.0 265.0 1.0

0.0

130.0

0.0

2.5

2.0

1.0

2.0

0

71.0 0.0

2.0

110.0 265.0 1.0

0.0

130.0

0.0

6.2

2.0

1.0

2.0

0

• Pedigree Diabetes Function is a function that represents how likely the patient is to get the disease by extrapolating from their ancestor’s history’, • Age of the patient measured in years.

34

2 Medical Tabular Data

We have also the label named as Outcome. The set consists of 768 cases with 500 negative and 268 positive cases 1 2

df = pd. read_csv("diabetes / diabetes . csv") df . head()

3 4 5 6

y = df [ 'Outcome ' ] df . drop ([ 'Outcome ' ] , axis=1,inplace=True) X = df

7 8

X_train , X_test , y_train , y_test = train_test_split (X, y, test_size =0.3, random_state=12)

Listing 2.15 Data loading for the ALE diabetes example

Do make the prediction possible we should divide the set into the features and labels and next divide it into the training and testing dataset using the scikit-learn train_test_ split function.

Model The model is build using the Support Vector Machine classifier provided with the scikit-learn library. We use the RBF kernel with the C regularization parameter set to 5.0. The accuracy is not very high, but the data is inbalanced and the goal of this research is not the model itself. 1

clf = SVC( kernel= ' rbf ' , C=5.0)

2 3

clf . f i t (X_train , y_train )

4 5

predictions = clf . predict (X_test)

6 7

accuracy_score ( y_test , predictions )

Listing 2.16 SVM model training

Explanation SVM is a binary classifier that is a shallow method what usually also means that it is easier to explain compared to deep neural networks (Samuel et al. 2021). Even that, the explanation is mostly done on the output and the influence of the features rather then the support vectors meanings. The ALE method can be easily combined with the SVM model to get such feature influence details. In the code below we get the number of pregnancies as the indicator of the influence on the prediction. Most patients was not pregnant or only once. Based on the results shown in Fig. 2.4a, we can conclude that with each pregnancy the risk of diabetes increases.

2.5

Diabetes Prediction Explanation with ALE Method

Fig. 2.4 ALE pregnancy and glucose tolerance features influence

1 2 3

ale_eff = ale ( X=X[ df . columns ] , model=clf , feature =[" Pregnancies " ] , feature_type=" d i s c r e t e " , grid_size =50, include_CI=False )

4 5 6 7 8

ale_eff = ale ( X=X[ df . columns ] , model=clf , feature =["Glucose" ] , grid_size =50, include_CI=False ) p l t . savefig ( ' ale_glucose . png ' , dpi=300, bbox_inches= ' t i g h t ' )

9 10

ale_eff = ale (X=X[ df . columns ] , model=clf , feature =["Glucose" , "BMI" ] , grid_size =100)

Listing 2.17 Feature influence ALE plots generation method

35

36

2 Medical Tabular Data

Fig. 2.5 Glucose tolerance and body mass index influences on the prediction explained using the ALE method

Another features that is shown in Fig. 2.4b is based on the analysis of the Glucose test. It can be explained even without the ALE method, as the intuition tells us that there a higher value of this test means a higher probability of the disease. The ALE method can be also used on more then one feature. In Fig. 2.5 we see a two dimensional heatmap where the darker color indicates a lower influence and a brighter color indicates a higher influence on the final prediction. In this case, a higher BMI and a higher Glucose feature value has the highest impact on the results.

References Apley DW, Zhu J (2020) Visualizing the effects of predictor variables in black box supervised learning models. J R Stat Soc Ser B (Stat Methodol) 82(4):1059–1086 Breiman L, Friedman JH, Olsen RA, Stone CJ (1984) Classification and regression trees. Chapman and Hall/CRC Mothilal RK, Sharma A, Tan C (2020) Explaining machine learning classifiers through diverse counterfactual explanations. In: Proceedings of the 2020 conference on fairness, accountability, and transparency, pp 607–617 Samuel SS, Abdullah NNB, Raj A (2021) Interpretation of SVM to build an explainable AI via granular computing. In: Interpretable artificial intelligence: a perspective of granular computing, pp 119–152

3

Natural Language Processing for Medical Data Analysis

In medicine, as in every other discipline, language is one of the main forms of communication. In addition to others, difficulties are the domain knowledge and the terms that in the case of medicine can be in Latin. This means that most of the publicly available language models are not trained to recognize the medical terms. For each NLP exercise, most of the models are developed in English. The NLP unified medical terms datasets are already publicly available for the English language (Bodenreider 2004). The other challenges are the jargon and formatting of the medical text. Current research is focused on developing sophisticated models for specific specialty physicians, such as cardiologists or orthopedists. This approach makes complete sense because it is easier to focus on a limited number of medical terms. It speeds up the computation and allows for providing a proper model in a shorter period of time. On the other hand, such models are in most cases dedicated to solving only one specific task. This limits the usage of NLP methods for the medical procedures, but makes the models more robust. The data types are limited compared to the models discussed in other chapters. The text can be better or worse formatted, but is generally obtained from three sources: patient, physician, and software. The data from patients usually come while the patient describes the illness or problem that he/she wants to discuss with the doctor. The physicians write different types of reports, such as radiology image description or patient medical record. The software that is used by the last ones is implemented to allow the generation of different types of reports that are limited to the physicians or patients requests. This makes analysis easier as the software usually generates well-formatted text. An important fact to note is that text can often be combined with other data sources, such as images (Seenivasan et al. 2022; Li et al. 2020). In this chapter, we describe different types of data in the following sections. This is in combination with several applications of NLP in the medical field. The next section is an explanation of three explainable methods: attentions explained on the BERT method, LSTM gating signals as an explanation method, and example-driven explanation of text© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Przystalski and R. M. Thanki, Explainable Machine Learning in Medicine, Synthesis Lectures on Engineering, Science, and Technology, https://doi.org/10.1007/978-3-031-44877-5_3

37

38

3 Natural Language Processing for Medical Data Analysis

based models. The next three sections are dedicated to the implementation of the three explanation methods in typical medical text data sets.

3.1

Data Types and Applications

From the NLP point of view, we can process the text, generate it, or understand it. The text processing is the crucial part of both text generation and understanding. We see three main areas where companies use the machine learning model for text processing in any way in the medical field. The first are chatbots of any kind. One of such solution is Wysa (https:// www.wysa.io/meet-wysa), a chatbot that helps patients with their mental health. Today, the more popular were the symptoms checker applications. A chatbot whose one of the features is the symptom checker is Sensley (https://sensely.com/solutions/). A solution in which the chatbot maintains health care management and communication within the organization using text or voice is Orbita (https://orbita.ai/). As there are already several chat platforms like Whatsapp or Viber, all can be used just with a plugged chatbot in it. BotMD is a solution that can be integrated with a few chat platforms and can be used by nurses or physicians, for example, to monitor patients (https://www.botmd.io/en/home.html). A second group of NLP analysis solutions is text analysis, such as medical or scientific text understanding. Eleos (https://eleos.health/science/) analyzes the sessions with psychotherapists and shows the progress of the treatment. Similarly to Wysa, but based on text without the chatbot part is implemented in Gyant (https://gyant.com/how-does-an-aisymptom-checker-work/). An interesting approach is presented in Abridge (https://www. abridge.com/) where one of the goals is to summarize clinical notes or other sources of text. The analysis can also be combined with structured data like the one shown in the previous chapter. OccamzRazor (https://www.occamzrazor.com/work#our-approach) helps identify medical terms (Pittala et al. 2020) by extraction of knowledge and is combined with graphs for easier recognition. Radiology is another branch where NLP can be helpful. RadAI Omni (https://www. radai.com/omni) generates the image reports made by the radiologists. Synapsica SpindleX (https://synapsica.com/spindlex/) is a tool that adds the input use the X-ray image and, based on the image, generates a report. This is an example of mixing images with text using architectures like VisualBERT and similar ones. We expect the solutions dedicated to the radiologist to grow in the coming years, as in our opinion, the text generation based on images is a great benefit for the MDs and patients, but there is still room for such solutions on the market.

3.2

Explainable Methods for Text Data

3.2

39

Explainable Methods for Text Data

ChatGPT is now a trending technology and many feel it useful. It is based on an Large Language NLP model. It can be useful for general medical applications. It is not detailed enough to be used in specific medical cases. Many applications do not need such a complex solution and the example-driven approach is an easier and more efficient way. This is the first method of text analysis explained in this chapter. The second and third are neural networks that were introduced before the GPT model and still present a great value in many applications. The LSTM and BERT architecture are much simpler compared to GPT and can be build even locally. In both cases it is about training a model with a medical dataset and explain the LSTM using the gating signals, and attentions in the BERT model.

3.2.1

Example Driven

Example-driven methods for medical text data analysis involve using labeled examples to train machine learning models to automatically classify, extract or generate text data. These methods are widely used in natural language processing (NLP) tasks such as sentiment analysis, named entity recognition, text classification, and text summarization. The general process for example-driven medical text data analysis involves the following steps: • Data Collection: Collect a large dataset of text examples relevant to the task at hand. For example, if the task is sentiment analysis, collect a dataset of reviews, tweets, or social media posts labeled with positive or negative sentiment. • Data Preprocessing: Preprocess the text data by removing stop words, stemming, or lemmatizing the text, and converting the text to a numerical representation such as bagof-words or word embeddings. • Model Training: Train a machine learning model using the labeled examples to learn patterns in the text data. Common machine learning models used in text data analysis include decision trees, random forests, logistic regression, support vector machines, and neural networks. • Model Evaluation: Evaluate the performance of the model on a separate dataset of labeled examples that were not used for training. Common evaluation metrics for text data analysis include accuracy, precision, recall, and F1-score. Here are some examples of example-driven methods for medical text data analysis: • Sentiment Analysis: Sentiment analysis involves automatically classifying the sentiment of medical text as positive, negative, or neutral. Example-driven methods for sentiment analysis involve training a machine learning model on a large dataset of labeled examples of text with corresponding sentiment labels.

40

3 Natural Language Processing for Medical Data Analysis

• Named Entity Recognition: Named entity recognition involves automatically identifying and classifying entities in text, such as people, organizations, and locations. Exampledriven methods for named entity recognition involve training a machine learning model on a large dataset of labeled examples of text with corresponding entity labels. • Text Classification: Text classification involves automatically classifying the text into predefined categories. Example-driven methods for text classification involve training a machine learning model on a large dataset of labeled examples of text with corresponding category labels. • Text Summarization: Text summarization involves automatically generating a concise summary of a longer text document. Example-driven methods for text summarization involve training a machine learning model on a large dataset of text examples with the corresponding summary labels. In general, example-driven methods for medical text data analysis provide a powerful and efficient approach for automatically processing large volumes of medical text data in a wide range of applications. Example-driven methods for medical text data analysis involve using a mathematical model to learn patterns in labeled text examples and use this learned knowledge to automatically classify, extract, or generate new text data. The general process of example-driven medical text data analysis can be formulated mathematically as follows: • Data Collection: Collect a dataset of .N labeled text examples .xi , yi where .xi represents the input text data and .yi represents the corresponding output label. • Data Preprocessing: Preprocess the text data by converting the text to a numerical representation such as bag-of-words or word embeddings. • Model Training: Train a mathematical model .f (x; θ ) to learn the underlying patterns in the labeled text examples using a supervised learning algorithm such as gradient descent. The model takes an input text x and a set of learnable parameters.θ , and outputs a predicted label .ypred = f (x; θ ). • Model Evaluation: Evaluate the performance of the model on a separate dataset of labeled examples .xj , yj that were not used for training. The evaluation metrics typically used in text data analysis include accuracy, precision, recall, and F1-score. The mathematical model .f (x; θ ) used in example-driven text data analysis can take various forms depending on the task at hand. Common models used in text data analysis include decision trees, random forests, logistic regression, support vector machines, and neural networks. The model is trained by minimizing a loss function .L(ypred , y) that measures the difference between the predicted label .ypred and the true label y for each example in the training dataset. Once the model is trained, it can be used to automatically process new, unlabeled text data by applying the learned knowledge to classify, extract, or generate new text data. This

3.2

Explainable Methods for Text Data

41

makes example-driven methods for text data analysis a powerful and efficient approach for processing large volumes of text data in a wide range of applications. The Example Driven Method as it is a generic concept, and the implementation would vary depending on the specific application and dataset. However, we are providing a brief outline of the steps involved in the Example Driven Method for medical data analysis: • • • • • • • •

Define the problem and select the relevant dataset. Identify specific examples of interest (e.g., cases with a particular disease or condition). Preprocess the data, including cleaning, feature extraction, and normalization. Split the dataset into training and testing sets. Train a machine learning model using the training set. Evaluate the model performance on the testing set. Analyze the model output to identify important features and patterns. Use the model to make predictions on new data.

The specific Python libraries and functions used for the above steps would depend on the type of data and the machine learning model used for the analysis. For example in Rdusseeun and Kaufman (1987) a solution based on clustering k-means method is proposed. In Kim et al. (2016) the example-driven approach is based on criticism and the Maximum Mean Discrepancy (MMD) measure. It is defined as: MMD2 =

.

m,n m n 1 ∑ 2 ∑ 1 ∑ k(z , z ) − k(z , x ) + k(xi , xj ), i j i j m2 mn n2 i,j=1

i,j=1

(3.1)

i,j=1

where .P and .Q are probability distributions, .F is the function of differences between expectations. In other words, the goal is to find a prototype that is the closest to the method that object we are evaluating. In text analysis we can just compare two sentences, the prototype sentences and the generated one.

3.2.2

Long Short-Term Memory Network (LSTM) Explained with Gating Functions

Long Short-Term Memory Network (LSTM) (Arras et al. 2019; Liu et al. 2022) is a type of recurrent neural network (RNN) that is designed to overcome the vanishing gradient problem of traditional RNNs. LSTMs use gating functions to control the flow of information through the network and allow it to selectively remember or forget past inputs. An LSTM cell consists of three main components: an input gate, a forget gate, and an output gate. These gates are implemented as sigmoid activation functions that output values between 0 and 1, which are then multiplied with the cell state and the input or output at each time step. The input gate determines how much of the input should be let into the cell state.

42

3 Natural Language Processing for Medical Data Analysis

It takes the input xt and the previous hidden state .ht−1 as input and outputs a value between 0 and 1. This is achieved by applying a sigmoid activation function to the concatenation of the input and hidden state, followed by a point-wise multiplication with a candidate input value. The forget gate determines how much of the previous cell state should be retained. It takes the input .xt and the previous hidden state .ht−1 as input and outputs a value between 0 and 1. This is achieved by applying a sigmoid activation function to the concatenation of the input and hidden state, followed by a point-wise multiplication with the previous cell state. The output gate determines how much of the current cell state should be output as the hidden state. It takes the input .xt and the previous hidden state .ht−1 as input and outputs a value between 0 and 1. This is achieved by applying a sigmoid activation function to the concatenation of the input and hidden state, followed by a point-wise multiplication with the updated cell state. The updated cell state is calculated by first applying a hyperbolic tangent activation function to the concatenation of the input and hidden state. Then, the forget gate is applied to the previous cell state and the input gate is applied to the new candidate value. The resulting values are added together to update the cell state. Finally, the updated cell state is multiplied by the output gate to produce the new hidden state, which is the output of the LSTM cell. In summary, LSTMs use gating functions to control the flow of information through the network and selectively remember or forget past inputs. This makes them particularly effective for processing sequential data such as natural language text or time series data. Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) that is designed to overcome the vanishing gradient problem of traditional RNNs. LSTMs use gating functions to control the flow of information through the network and allow it to selectively remember or forget past inputs. An LSTM cell consists of four main components: an input gate, a forget gate, a candidate cell state, and an output gate. These components are implemented using sigmoid and hyperbolic tangent activation functions, which enable the network to selectively control the flow of information through the cell. Let us assume we have an input sequence .x = (x1 , x2 , . . . , xT ), where each .xi is a vector representing the .ith element of the input sequence. For each element of the input sequence, the LSTM network updates its hidden state .ht and cell state .ct according to the following equations: .Input Gate it = sigmoid (Wi [xt , ht−1 ] + bi ). (3.2) Forget Gate ft = sigmoid (Wf [xt , ht−1 ] + bf ).

(3.3)

Candidate Cell State c˜t = tanh(Wc [xt , ht−1 ] + bc ).

(3.4)

New Cell State ct = ft ∗ ct−1 + it ∗ c˜t .

(3.5)

Output Gate ot = sigmoid (Wo [xt , ht−1 ] + b0 ).

(3.6)

Hidden Gate ht = ot ∗ tanh(ct ).

(3.7)

. .

. .

.

3.2

Explainable Methods for Text Data

43

here .it is the input gate, .ft is the forget gate, .c˜t is the candidate cell state, .ct is the new cell state, .ot is the output gate, and .ht is the hidden state. .Wi , .Wf , .Wc , and .Wo are weight matrices, .bi , .bf , .bc , and .bo are bias terms, and .∗ denotes the concatenation of vectors. The input gate .it controls how much of the candidate cell state .c˜t is added to the cell state .ct . The forget gate .ft controls how much of the previous cell state .ct−1 is retained in the new cell state .ct . The output gate .ot controls how much of the cell state .ct is output as the hidden state .ht . The candidate cell state .c˜ t is the new information that will be added to the cell state .ct . By using these gating functions, the LSTM network can selectively control the flow of information through the network and retain important information over long periods of time, making it effective for processing sequential data such as natural language text or time series data. 1 2 3 4

import numpy as np import pandas as pd import tensorflow as t f from tensorflow . keras .models import Sequential

5 6

from tensorflow . keras . layers import Dense, LSTM

7 8 9

# Load the dataset data = pd. read_csv("medical_data . csv")

10 11 12 13

# Preprocess the data X = data . iloc [ : , : − 1].values # Input features y = data . iloc [ : , −1].values # Target variable

14 15 16 17 18

# Split the data into training and testing sets s p l i t = int (0.8 ∗ len ( data ) ) X_train , X_test = X[ : s p l i t ] , X[ s p l i t : ] y_train , y_test = y[ : s p l i t ] , y[ s p l i t : ]

19 20 21 22

# Reshape the input data for LSTM X_train = np. reshape(X_train , (X_train . shape[0] , 1, X_train . shape[1]) ) X_test = np. reshape(X_test , (X_test . shape[0] , 1, X_test . shape[1]) )

23 24 25 26 27 28

# Define the LSTM model model = Sequential () model.add(LSTM(50, input_shape=(1, X. shape[1]) ) ) model.add(Dense(1 , activation="sigmoid") ) model. compile( loss="binary_crossentropy" , optimizer="adam" , metrics=["accuracy" ])

29 30 31

# Train the model model. f i t (X_train , y_train , epochs=50, batch_size=32, validation_data=(X_test , y_test ) )

32 33 34 35 36

# Make predictions on new data new_data = np. array ([[0.1 , 0.2 , 0.3 , 0.4]]) new_data = np. reshape(new_data, (new_data. shape[0] , 1, new_data. shape[1]) ) prediction = model. predict (new_data)

37 38

print ("Prediction : " , prediction )

Listing 3.1 LSTM simplified example

In the Listing 3.1, we first load the medical data and split it into training and testing sets. Then, we reshape the input data to be compatible with the LSTM model. We define an LSTM model with one LSTM layer and a sigmoid activation function in the output layer and compile it using binary cross-entropy loss and the Adam optimizer. We train the model on

44

3 Natural Language Processing for Medical Data Analysis

the training set and evaluate its performance on the testing set. Finally, we make predictions on new data using the trained model. Note that the specific hyperparameters and preprocessing steps used in this example code may not be optimal for every medical dataset. It’s important to carefully tune the model and preprocessing steps to achieve the best possible performance on the specific dataset of interest.

3.2.3

Bidirectional Encoder Representation from Transformers Explained Using the Attention Mechanism

The Bidirectional Encoder Representations (Devlin et al. 2018; Clark et al. 2019) from Transformers (BERT) model is a pre-trained neural network that is used for natural language processing (NLP) tasks, such as text classification, question answering, and language translation. BERT is based on the Transformer architecture, which uses attention mechanisms to selectively focus on important parts of the input sequence. The BERT model is bidirectional, meaning that it can process the input sequence in both directions (from left to right and from right to left) simultaneously. This allows the model to capture context from both past and future tokens, which helps it to understand the meaning of each word in the context of the whole sentence. The BERT model consists of an encoder stack that is made up of multiple layers of self-attention and feed-forward networks. Each layer of the encoder stack has its own set of parameters and can be fine-tuned for specific NLP tasks. The self-attention mechanism in BERT allows the model to selectively attend to different parts of the input sequence. Each word in the input sequence is represented as a vector, and these vectors are used to compute attention scores between each pair of words in the sequence. The attention scores indicate how much each word should contribute to the representation of every other word in the sequence. The attention mechanism works by computing a weighted sum of the vectors of all words in the input sequence, where the weights are determined by the attention scores. The resulting vector is then passed through a feed-forward network to generate a new representation of the input sequence. This process is repeated for each layer of the encoder stack, allowing the model to capture increasingly complex patterns in the input sequence. The BERT model is trained on a large corpus of text using a masked language modeling (MLM) task and a next sentence prediction (NSP) task. In the MLM task, a certain percentage of the input tokens are randomly masked and the model is trained to predict the masked tokens based on the context provided by the surrounding tokens. In the NSP task, the model is trained to predict whether a given sentence follows another sentence in the input sequence. After pre-training, the BERT model can be fine-tuned for specific NLP tasks by adding a task-specific output layer on top of the encoder stack and training the model on a labeled dataset for that task. This allows the model to adapt its representations to the specific task at hand, achieving state-of-the-art performance on a wide range of NLP tasks.

3.2

Explainable Methods for Text Data

45

The Bidirectional Encoder Representations from Transformers (BERT) model is a pretrained neural network that is based on the Transformer architecture, which uses attention mechanisms to selectively focus on important parts of the input sequence. BERT is a bidirectional model, meaning that it can process the input sequence in both directions (from left to right and from right to left) simultaneously. Let us assume we have an input sequence .X consisting of .n words, .X = x1 , x2 , . . . , xn , where each word .xi is represented as a d-dimensional vector. The BERT model consists of an encoder stack that is made up of .L layers of self-attention and feed-forward networks. In the first layer of the encoder stack, the input sequence .X is fed into a self-attention mechanism, which computes attention scores between each pair of words in the sequence. The attention scores are computed as follows. Compute the query matrix .Q, the key matrix .K, and the value matrix .V from the input sequence .Q = X Wq K = X Wk V = X Wv (3.8) Here, .Wq , .Wk and .Wv are learned weight matrices that project the input sequence .X into the query, key, and value spaces, respectively. Compute the attention scores .A between each pair of words in the sequence using the dot product between the query and key matrices: ) ( QK T . (3.9) .A = soft max √ d Here, √ .soft max is the softmax function that normalizes the attention scores for each word, and . d is a scaling factor that helps prevent the attention scores from becoming too large. Compute the weighted sum of the value vectors using the attention scores as weights: Z =A∗V

.

(3.10)

Here, .Z is a matrix of weighted value vectors that represent the contextualized representation of each word in the input sequence. The output of the first layer of the encoder stack is the matrix .Z, which is passed through a feed-forward network to generate a new representation of the input sequence. This process is repeated for each layer of the encoder stack, allowing the model to capture increasingly complex patterns in the input sequence. In the BERT model, the self-attention mechanism is modified to include a masking step, where a certain percentage of the input tokens are randomly masked during training. This allows the model to learn to predict the masked tokens based on the context provided by the surrounding tokens. After pre-training, the BERT model can be fine-tuned for specific NLP tasks by adding a task-specific output layer on top of the encoder stack and training the model on a labeled dataset for that task. This allows the model to adapt its representations to the specific task at hand, achieving state-of-the-art performance on a wide range of NLP tasks. 1 2 3

import pandas as pd import torch from sklearn . model_selection import train_test_split

46

4 5

3 Natural Language Processing for Medical Data Analysis from transformers import BertTokenizer , BertForSequenceClassification , AdamW from torch . u t i l s . data import TensorDataset , DataLoader , RandomSampler, SequentialSampler

6 7 8 9 10

# Load data df = pd. read_csv("medical_data . csv") texts = df [" text " ] . values . t o l i s t () labels = df ["label" ] . values . t o l i s t ()

11 12 13

# Split data into training and validation sets train_texts , val_texts , train_labels , val_labels = train_test_split ( texts , labels , test_size =0.2, random_state=42)

14 15 16

# Load pre−trained BERT model and tokenizer model = BertForSequenceClassification . from_pretrained("bert−base−uncased" , num_labels=2)

17 18

tokenizer = BertTokenizer . from_pretrained("bert−base−uncased" , do_lower_case=True)

19 20 21 22

# Tokenize input texts train_encodings = tokenizer ( train_texts , truncation=True , padding=True) val_encodings = tokenizer ( val_texts , truncation=True , padding=True)

23 24 25

# Convert tokenized input to PyTorch tensors train_dataset = TensorDataset( torch . tensor ( train_encodings ["input_ids" ]) , torch . tensor ( train_encodings [ "attention_mask" ]) , torch . tensor ( train_labels ) )

26 27

val_dataset = TensorDataset( torch . tensor (val_encodings["input_ids" ]) , torch . tensor (val_encoding [" attention_mask" ]) , torch . tensor ( val_labels ) )

28 29 30 31 32

# Set batch size and create data loaders batch_size = 32 train_loader = DataLoader( train_dataset , sampler=RandomSampler( train_dataset ) , batch_size=batch_size ) val_loader = DataLoader( val_dataset , sampler=SequentialSampler( val_dataset ) , batch_size=batch_size )

33 34 35 36 37

# Set optimizer and learning rate scheduler optimizer = AdamW(model. parameters () , l r=2e−5, eps=1e−8) epochs = 5 scheduler = get_linear_schedule_with_warmup (optimizer , num_warmup_steps=0, num_training_steps=len ( train_loader ) ∗ epochs)

38 39 40 41 42

# Train model device = torch . device("cuda") i f torch . cuda . is_available () else torch . device("cpu") model. to (device)

43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59

for epoch in range(epochs) : train_loss = 0.0 val_loss = 0.0 model. train () for batch in train_loader : input_ids = batch [0]. to (device) attention_mask = batch [1]. to (device) labels = batch [2]. to (device) optimizer . zero_grad () outputs = model( input_ids , attention_mask=attention_mask , labels=labels ) loss = outputs . loss train_loss += loss . item () loss .backward() optimizer . step () scheduler . step () model. eval ()

60 61 62 63 64 65 66

with torch . no_grad() : for batch in val_loader : input_ids = batch [0]. to (device) attention_mask = batch [1]. to (device) labels = batch [2]. to (device) outputs = model( input_ids , attention_mask=attention_mask , labels=labels )

3.3 Text Generation Explained

47

loss = outputs . loss val_loss += loss . item ()

67 68 69

avg_train_loss = train_loss / len ( train_loader ) avg_val_loss = val_loss / len ( val_loader )

70 71 72 73

print ( f"Epoch {epoch + 1}: Train Loss = {avg_train_loss } , Val Loss = {avg_val_loss}")

74 75 76 77

# Save model model. save_pretrained ("bert_model") tokenizer . save_pretrained ( "bert_tokenizer")

Listing 3.2 A BERT example implementation

In Listing 3.2 a example Python code for fine-tuning BERT for binary classification on a medical dataset using the Hugging Face Transformers library is shown. It is divided into: • the tokenizer to covert the text into token, • the BERT model that is loaded (bert-base-uncased), • training phase to add the data from the CSV file and retrain the loaded model. Part of the training are the attentions that are set (lines 48–58). The final model is next saved as bert_model. In the next section, we show how to use it to explain the medical text.

3.3

Text Generation Explained

For this exercise, we use the openAI chatGPT to generate sample sentences. We focus on two methods previously mentioned. The example-driven method works not only for text data, but in this case allows even a non-technical person to compare between two samples of generated and real one. The LSTM architecture even it is a well-known one, since the attention mechanism was introduced, the number of applications is decreasing in favor of such architectures like BERT or GPT.

3.3.1

Example Driven

The example-driven approach is about comparing a sentence to prototypes. A typical radiology image description consists of several sentences or sections. To simplify the example, we base just on one randomly chosen sentence. The prototypes can be taken from a real-world radiology image description or like in our example, can be generated using the GPT model. The randomly generated ten sentences might looks as follows: • The chest X-ray shows a consolidation in the right middle lobe, suggestive of pneumonia.

48

3 Natural Language Processing for Medical Data Analysis

• The abdominal ultrasound reveals a hypoechoic mass in the liver, consistent with a possible tumor. • The MRI of the brain demonstrates multiple hyperintense lesions in the periventricular white matter, indicative of demyelinating disease. • The CT scan of the spine exhibits a herniated disc at the L5-S1 level, resulting in compression of the nerve roots. • The mammogram reveals a suspicious mass with irregular margins and microcalcifications, requiring further evaluation. • The bone scan shows increased radiotracer uptake in the left femur, suggestive of a possible stress fracture. • The renal ultrasound demonstrates bilateral hydronephrosis, indicating obstruction of the urinary tract. • The PET-CT scan reveals increased metabolic activity in the right lung nodule, raising suspicion for malignancy. • The musculoskeletal X-ray shows joint space narrowing, subchondral sclerosis, and osteophyte formation, consistent with osteoarthritis. • The Doppler ultrasound of the lower extremities reveals deep vein thrombosis in the left popliteal vein, necessitating anticoagulation therapy. In most sentences typical medical terms are used. For this example we assume these are real radiology image description. Let’s assume we have a sentence like We see on the lung CT image a cancer that was generated. We can see that the GPT sentences are impersonal, but in both cases the sentences are related to medical images. The medical terms are in the GPT sentences less likely to be understood by someone that is not familiar with the topic, where the proposed on consists of word that are easier to interpret.

3.3.2

BERT

The BERT is one of the best open source architecture (https://research.aimultiple.com/gpt/) for text generation. In this example we UCSD-VA-health/RadBERT-RoBERTa-4m model (Yan et al. 2022). A few medical data sets were used to retrain the BERT model. It includes a 3996 radiology reports set (Harzig et al. 2019) and a chest radiology set (Demner-Fushman et al. 2015). The trained model can be downloaded using the Huging Face repository. Using the transformers and the AutoModel class, we can easily download and use pretrained models (see Listing 3.3). 1

from t r a n s f o r m e r s i m p o r t A u t o T o k e n i z e r , AutoModel , AutoConfig

2 3 4 5 6

c o n f i g = AutoConfig . f r o m _ p r e t r a i n e d ( ’UCSD−VA− h e a l t h / RadBERT−RoBERTa−4m’ ) t o k e n i z e r = A u t o T o k e n i z e r . f r o m _ p r e t r a i n e d ( ’UCSD−VA− h e a l t h / RadBERT−RoBERTa−4m’ ) model = AutoModel . f r o m _ p r e t r a i n e d ( ’UCSD−VA− h e a l t h / RadBERT−RoBERTa−4ms ’ , c o n f i g = config )

3.3 Text Generation Explained

7

49

i n p u t s = t o k e n i z e r . encode ( "We s e e on t h e l u n g CT image a c a n c e r " , r e t u r n _ t e n s o r s =" pt " )

8 9

o u t p u t s = model ( i n p u t s , o u t p u t _ a t t e n t i o n s =True )

Listing 3.3 Model downloaded from Hugging Face

Fig. 3.1 Tokens relationship on the model level

The tokenizer can be next use to analyze the input sentence. In our example it is only one sentence, but we can put here a longer text. There are at least a few ways to explain the sentence. We use the bertviz library (Vig 2019). One of the ways is the view of the

50

3 Natural Language Processing for Medical Data Analysis

relations between tokens (words). In Listing 3.3 we use the model_view method. The results are shown in Fig. 3.1. It is a square divided by layers (rows) and heads (columns). It shows the relationship between words for given layer and head. In the black square in the bottom right part an example of the relationships for the 8-th layer and 7-th head is given. We can easily recognize that there is a high relation between the token pairs see-image and lung-CT. 1

from bertviz import model_view

2 3 4 5

attention = outputs[−1] tokens = tokenizer . convert_ids_to_tokens( inputs [0]) model_view( attention , tokens)

6 7

from bertviz import head_view

8 9

head_view( attention , tokens)

Listing 3.4 Text explaination using the BERT model and the attentions

The token relationship seen on the whole model is still not explainable enough. We can check on the heads level using the heads_view method. This method generates an image as shown in Fig. 3.2. We can see a visible relationship between CT and two other tokens: image and lung. Similarly to the previously image, it gives a better understanding of what are the relations based on the model trained on medical tokens. It uses the LIME method as one of the explanation method the saliences maps. Fig. 3.2 Tokens relationship by attention heads

A different approach of explainability can be achieved using the lit tool (https://github. com/PAIR-code/lit). It is designed to explain NLP models (Tenney et al. 2020). There are a few important explainability features like counterfactuals, or local explanations. To run the sever you can use following command:

3.3 Text Generation Explained

51

python -m lit_nlp.examples.lm_demo --models=bert-base-uncased \ --port=5432

In the web interface you can add new models or datasets. In Fig. 3.3 an example of an embedding projector is shown.

Fig. 3.3 Embedding projector with Uniform Manifold Approximation and Projection method used

NLP models usually consists of vectors with many variables. This makes the data readable in a high dimension that is not possible in many cases to draw. Two dimensionality reduction methods are available in Lit. In Fig. 3.3 the Uniform Manifold Approximation and Projection is used for dimensionality reduction. The sentence that is marked is the one generated by the GPT model that we have used here as an example to show where it is placed in the feature space. It can be next compared to other ones (green dots) in this space to see how far or close these sentences are. This gives a better understanding of the sentences relations and explain the prediction.

52

3 Natural Language Processing for Medical Data Analysis

References Arras L, Arjona-Medina J, Widrich M, Montavon G, Gillhofer M, Müller KR, Hochreiter S, Samek W (2019) Explaining and interpreting LSTMs. In: Explainable AI: interpreting, explaining and visualizing deep learning. Springer, pp 211–238 Bodenreider O (2004) The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res 32:D267–D270. https://doi.org/10.1093/nar/gkh061 Clark K, Khandelwal U, Levy O, Manning CD (2019) What does BERT look at? An analysis of Bert’s attention. arXiv:1906.04341 Demner-Fushman D, Kohli M, Rosenman M, Shooshan S, Rodriguez L, Antani S, Thoma G, Mcdonald C (2015) Preparing a collection of radiology examinations for distribution and retrieval. J Am Med Inf Assoc JAMIA 23. https://doi.org/10.1093/jamia/ocv080 Devlin J, Chang MW, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 Harzig P, Chen YY, Chen F, Lienhart R (2019) Addressing data bias problems for chest X-ray image report generation. arXiv:1908.02123 Kim B, Khanna R, Koyejo OO (2016) Examples are not enough, learn to criticize! criticism for interpretability. In: Advances in neural information processing systems, vol 29 Li Y, Wang H, Luo Y (2020) A comparison of pre-trained vision-and-language models for multimodal representation learning across medical images and reports Liu S, Wang X, Xiang Y, Xu H, Wang H, Tang B (2022) Multi-channel fusion LSTM for medical event prediction using EHRs. J Biomed Inform 127:104011 Pittala S, Koehler W, Deans J, Salinas D, Bringmann M, Volz KS, Kapicioglu B (2020) Relationweighted link prediction for disease gene identification. arXiv:2011.05138 Rdusseeun L, Kaufman P (1987) Clustering by means of medoids. In: Proceedings of the statistical data analysis based on the L1 norm conference, Neuchatel, Switzerland, vol 31 Seenivasan L, Islam M, Krishna A (2022) Surgical-VQA: visual question answering in surgical scenes using transformer. https://doi.org/10.48550/arXiv.2206.11053 Tenney I, Wexler J, Bastings J, Bolukbasi T, Coenen A, Gehrmann S, Jiang E, Pushkarna M, Radebaugh C, Reif E, Yuan A (2020) The language interpretability tool: extensible, interactive visualizations and analysis for NLP models. https://www.aclweb.org/anthology/2020.emnlp-demos. 15 Vig J (2019) A multiscale visualization of attention in the transformer model. In: Proceedings of the 57th annual meeting of the association for computational linguistics: system demonstrations, association for computational linguistics, Florence, Italy, pp 37–42. https://doi.org/10.18653/v1/ P19-3007. https://www.aclweb.org/anthology/P19-3007 Yan A, McAuley J, Lu X, Du J, Chang E, Gentili A, Hsu CN (2022) RadBERT: adapting transformerbased language models to radiology. Radiol Artif Intell 4. https://doi.org/10.1148/ryai.210258

4

Computer Vision for Medical Data Analysis

The images are widely used in the medical diagnostics. Most people who visited any kind of physician encountered one of the devices that are used in the medical diagnostics on a daily basis. The most typical are X-ray and ultrasound scans. Some General Practitioners have a dermatoscope to take pictures of the skin. Similarly, thermography can be used to take kind of heatmap images of our body. If the diagnosis of a disease is more complex, magnetic resonance imaging (MRI), computer tomography (CT), or even positron emission tomography is used before surgery. A specific type of X-ray is mammography (MMG) that is used to investigate the breast diseases. A bit more invasive device are endoscopes that can be used to diagnose an illness while introduced into the body through the i.e. mouth. In some cases, when surgery is done, cells must be invested to cure the disease. This procedure is performed on a histopathological image, where the physician checks the image at high magnification. In the next sections, we discuss different types of images that are used in the medical diagnostic process together with machine learning and how explainability can help to understand the models used. To understand the way in which the images are processed, we first show the different ways in which the images are stored. A detailed overview of the most typical explainable methods used for images is given in the next part. As a first example, images of skin cancer are shown. This type of image is one of the most basic ones used in medicine and is typically JPEG files. The second example is based on brain magnetic resonance images. This type of image is much more complex to investigate. The last example presented is coronary angiograph video, which we consider as a set of images.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Przystalski and R. M. Thanki, Explainable Machine Learning in Medicine, Synthesis Lectures on Engineering, Science, and Technology, https://doi.org/10.1007/978-3-031-44877-5_4

53

54

4.1

4 Computer Vision for Medical Data Analysis

Data Types and Applications

There are different image types, depending on the procedure that is used, the most popular are DICOM, PNG, and MP4 files. These images are used in different diagnostic processes where several types of hardware are used. MRI, PET, or CT scans return as output a set of DICOM files. Endoscopy, coronary angiography, or ultrasound can return both the images the physician took or the entire video, which is usually saved as a DICOM file that includes the set of images. It can also be exported to a more user-friendly format, such as MP4. The DICOM format is a well-known format in medical diagnostics and is a high-quality image or a set of images of a specific body scan. Apart from other image formats, such as JPG or PNG, to open a DICOM file, we need to have a DICOM viewer that allows us to view the DICOM file. An exceptional case are augmented reality images that are generated based on the input images. One of the examples of use cases in surgery robotics such as Activ Surgical (https://www.activsurgical.com/). DICOM stands for Digital Imaging and Communications in Medicine and was introduced in the 1980s. The idea was to clarify the way medical images are stored. Today, most images are stored in this format. It has different modalities depending on the type of scan that is performed. Currently, there are about 40 different modalities. From a machine learning point of view, the modality is important when the DICOM image is exported to a different format, such as PNG. Different modalities might need to use different export functions related to the contrast and brightness. The Covid-19 pandemic changed the mindset of many. One effect of these changes is the approach to automation, including medicine. Together with the current overall hype on artificial intelligence, recently we can see many startups and products that use machine learning to help, support the physicians, or automate some processes that were many manually or in a semi-automated way. It changed in recent years, and today we can see that there are many machine learning applications in medical imaging. One of the simplest uses of machine learning usage is a typical OCR used for the recognition of medical data on images. This kind of application is developed by Mendel (https://www.mendel.ai/retina), where a typical OCR is extended to focus on medical terms. DermSport from Digital Diagnostics https://www.digitaldiagnostics.com/ products/skin-disease/dermspot/ is the camera used for skin diseases. It is a mobile application that helps to find the skin illnesses including melanoma, BCC, or SCC. Another use case of the camera images is shown in Orcam https://www.orcam.com/, a solution for people with blind or partially sighted people. It recognizes the environment surrounding the person using Orcam and translates it into speech. Thermographs are images recorded using infrared light waves. Thermalytix by Niramai https://www.niramai.com/ and AI Talos https://aitalos. com/ uses these images for the detection of breast cancer. Annalise CXR (https://annalise.ai/) uses X-ray images to recognize lung diseases. A visit to the dentist often ends with a dental X-ray. Overjet (https://www.overjet.ai/) uses these images to detect, outline decay, and quantify bone loss. Bone fractures can also be

4.1

Data Types and Applications

55

recognized with Gleamer (https://www.gleamer.ai/) and AZMed (https://azmed.co/). The lung is beside the bones of the most popular body part where the X-ray images are taken. Qure.ai qXR (https://qure.ai/product/qxr/) uses lung X-ray images to find anomalies in the lungs, bones, or other structures visible in such images. Mammography is a type of X-rays used for breast investigation. There are several tools available on the market for breast MMG images analysis such as: Curemetrix (https:// curemetrix.com/), Whiterabbit.ai (https://www.whiterabbit.ai/), PrognicaMMG (http://www.prognica.com/products/). Aidoc (https://www.aidoc.com/) and Lunit (https://www.lunit.io/) use the CT images for various analyses, including lung/chest, brain, or bone/backbone images. The brain is analyzed including the CT angiography images. A positron emission tomography (PET) scans can also be used as input for machine learning. One of such example is PaIRe by Incepto Medical (https://incepto-medical.com/en/solutions/paire) where the software uses different machine learning methods to recognize vitals and the typical measurements of these. The prostate cancer can be diagnosed using the Bot Image solution (https://www. botimageai.com/). It is a cloud-based application that receives as input an MRI scan and returns a full report with colored areas of interest. Based on MRI scans, Rapid AI https:// www.rapidai.com/rapid-mri performs a diffusion and perfusion image analysis and checks salvageable parts of the brain. Machine learning is not only used to diagnose disease on magnetic resonance scans. An example of a product where the focus is on the quality of the images is Airs Medical (https://en.airsmed.com/). The goal is to reduce the scan time by reducing the noise in the images using deep learning models. Some companies such as Arterys (https://www.arterys.com/) provide products that use different types of image for breast, lung, or heart disease. These are different products that use MMG, CT, and MRI, respectively. Pathomorphologists use cell images made under microscopes with a proper zoom. Such images are usually just JPG or PNG files. Such companies like PathAI (https://www.pathai. com/), Paige.ai (https://paige.ai/) and Inveox (https://inveox.com/) provide solutions where the machine learning models are used to simplify and automate the process of finding specific type of cells like cancer cells. An interesting approach is shown in Sightdx (https://sightdx. com/) where the blood cells are accurately classified and enumerated. One of the most common and most popular types of examinations worldwide is performed using ultrasound. One reason is that it is a cheaper device compared to the MR and CT. At the same time, it has a wide range of cases in which it can be used. Ultrasonography can also be used to get movies that we threaten as a set of images and will be analyzed in this chapter. It can also generate sounds that are covered in the next chapter. The heart is one of the organs that can be investigated using the USG. EchoGo by Ultromics https://www.ultromics.com/ helps physicians by providing several metrics that are automatically calculated, such as the volume of the ejection fraction. The anestociologist can use the USG to find the right nerve.

56

4 Computer Vision for Medical Data Analysis

To simplify the process, Nerveblox (https://nerveblox.com/) uses machine learning to find the appropriate nerves faster using the USG. Other types of video are different types of endoscopies. Gastroscopy supported by a ML model is used by AIM (https://www.ai-ms.com/). Endogenie implements a mobile solution for the dentists (https://www.endogenie.ai/how-it-works). Endofotonics implements a cancer detection model that is implemented in an endoscopy solution by Endofotonics (http:// www.endofotonics.com/).

4.2

Explainable Methods for Images

These days most image related models are based on deep neural networks. One of the reasons is the complexity of the images and the successful application of the convolution-like neural networks. In this chapter we focus on methods that can be used to explain models based on such networks. The first one that is used to explain is the GradCAM that uses specifically one the convolution network layer to explain the whole network prediction. The second method is a method that can be used on any model as it consider the model as a black-box. This is a totally different approach compared to GradCAM. The third method that is presented in this chapter is the Deep SHAP method. It is designed for deep neural networks and base on game theory.

4.2.1

Gradient Class Activation Map (GradCAM)

GradCAM is also known as the gradient class activation map (Selvaraju et al. 2017). The goal here is to use heatmaps and show the regions on images that are the most interesting for the model for a specific classification. In other words, if we classify dogs in an image, the regions that are hot should be the ears, chest, tail, and other parts that should be important for recognizing a dog. It can be compared to how humans recognize objects; for each object we have different patterns that are characteristic for these. The idea of GradCAM as an explanability method is to show the regions that have the greatest impact on the classification as a heatmap. This explains why the object on an image was classified as a specific label/class. The explanation is done on the input image, but the information to develop the heatmap is a gradient of the weights from one of the last layers of a neural network. The neuron importance weights can be calculated as following: global average pooling c .αk

=

, ,, , 1 ∑∑ Z i

j

∂ yc ∂ Aikj , ,, , gradients via backprop

,

(4.1)

4.2

Explainable Methods for Images

57

where . Ak is the feature activation map and . yc is the prediction for class .c. In Fig. ?? a the last 10 layers of the Inception network architecture are shown. The redmarked layer is the one that is used in to generate the heatmap. The heatmap can be obtained as following: ) (∑ c . L Grad-CAM = ReLU αkc Ak . (4.2) k

A ReLU activation function is used to the linear combination of maps. The more important pixels will have a higher value . yc .

4.2.2

Local Interpretable Model Agnostic Explanations (LIME)

Local Interpretable Model-Agnostic Explanations (LIME) (Ribeiro et al. 2016) is a method where the main idea is to provide qualitative understanding of the relationship between the patches in an image and the model’s prediction. It creates an alternative model, that is easier to explain. It can be a totally different model compared to the original one. These models are just an approximation and works locally. They are also called local surrogate models. It uses the original model as a black box and tests it by preparing different variations of the input data and measure the influence of it on the prediction. The explanation made by the methods can be obtained as follows: ξ(x) = arg min L( f , g, ∏x ) + Ω(g),

.

g∈G

(4.3)

where .g is the explanation model, .L is the fidelity funtion, .Ω is the complexity measure, and .∏x is the locality of the model. The method can be divided into a few steps: • • • • •

get the local scope of interests, prepare a data set that is a perturbation of the original one, train the new model based on the new set, based on the prediction set weights for the new data set objects, explain the prediction.

The goal is to keep the .G low to makes the model understandable. The method tries to find such data in the local neighborhood to explain the model using the newly build model.

4.2.3

Shapley Addictive Explanation (SHAP)

The Shapley Addictive Explanation (SHAP) method (Lundberg and Lee 2017) assigns each feature a value of importance on the prediction. It is based on game theory and the Shapley values (Shapley et al. 1953; Shapley 1997) proposed by Lloyd Shapley in 1953. The idea

58

4 Computer Vision for Medical Data Analysis

of the Shapley values is to find the best way of profit division in a cooperation of players. It gives a better understanding for each player how they will get out of the profit based on the input of each into the result. Based on the classic Shapley value estimation method, three properties should be considered by a method to find the solution. Local accuracy requires the output of . f to match for the input .x ' :

.

fˆ(x) = g(x ' ) = φ0 +

M ∑

φi xi' ,

(4.4)

i=1

where .φi is the effect to each feature, and . f (x) is the original model. Missingness property means that it a feature from the simplified input does not have an impact the effect is also equal to 0. It can be drawn as follows: .

x 'j = 0 ⇒ φ j = 0.

(4.5)

Consistency property is about the simplified input’s contribution changes. While the model changes, if the input’s contribution increases or stays at the same level regardless of the other inputs, the attribution should not decrease. It can be written as follows: .

f x' (z ' ) − f x' (z ' \ i) ≥ f x' (z ' ) − f x (z ' \ i),

(4.6)

where .z ' \ i = 0 means setting .z i' = 0. SHAP method assumes that there is only one model that satisfies all three properties. It is defined as: ∑ |z ' |!(M − |z ' | − 1)! [ f x (z ' ) − f x (z ' i)], (4.7) .φi ( f , x) = M! ' ' z ⊆x

where.|z ' | is the number of non-zeros entries in.z '

(.z ' ≈ x ' ) and the .φi are the Shapley values. The SHAP values are not so easy to find and we need to use an approximation method. There are several approximation methods. This includes the Kernel SHAP, Tree SHAP, and Deep SHAP. Some of the approximation methods can be used on each model. One of such method is the Kernel SHAP. It is a combination of the LIME method and the Shalpey values. The other group is dedicated to a specific type of models. In our exercise we use a neural network for image classification. That is why the Deep SHAP method fits most in this case.

4.3

Skin Moles Classification Explained

Melanoma is one of the deadliest cancer. Luckily it is a rare disease, but there are also other skin cancer types and diseases. In 2016 the International Skin Imaging Collaboration challenges were introduced. They organizers provide data sets that consists of different skin lesions. The sets are very useful for researchers to build better machine learning models that

4.3

Skin Moles Classification Explained

59

recognize the disease, but also specific patterns of each skin lesion type. In this section we are about to recognize the skin lesion types using a convolution neural network.

Dataset The dataset of the ISIC2018 consists of 10,000 images divded into training andesting sets. It has beningn and malignant cases of different diseases, but is divded into seven different type of diseases. The dataset can be downloaded from https://challenge2018.isicarchive.com/. Apart from Melanoma cases and many Nevus benign cases, we can also find basal cell carcinoma, actinic keratosis, squamous cell carcinoma, lentigo, and a few other skin mole types. The full image gallery is available at: https://www.isic-archive.com/#!/ topWithHeader/onlyHeaderTop/gallery?filter=%5B%5D. In this research we focus on a binary classification between benign and malignant ones. 1

cwd = os .getcwd()

2 3 4

epochs = 50 batch_size = 10

5 6

input_shape = (150, 200, 3)

Listing 4.1 Model training configuration

Before training the model some variables needs to be set (see Listing 4.2). After the data is download we need to set the paths to the metadata file and the path to the images. The number of epochs is set to 50 and for higher accuracy should be increased. The batch size is set to 100. The shape of the input images are reduced to original ones to simplify the calculations. Finally, the images are loaded using the OpenCV library (see Listing 4.2). 1 2

t r u t h = ’ HAM10000_metadata . csv ’ i m a g e s _ p a t h = ’ / ham10k ’

3 4

d a t a s e t = m e t a d a t a [ [ ’ i m a g e _ i d ’ , ’ dx ’ ] ]

5 6 7

l o o k u p T a b l e , i n d e x e d _ d a t a S e t = np . u n i q u e ( m e t a d a t a [ ’ dx ’ ] , True ) metadata [ ’ label ’ ] = indexed_dataSet

return_inverse=

8 9

d a t a s e t = metadata [[ ’ image_id ’ , ’ l a b e l ’ ]]

10 11

labels = metadata [ ’ label ’ ]

12 13

i n p u t _ s h a p e = (150 , 200 , 3)

14 15

_ i m a g e s = np . empty ( ( l e n ( d a t a s e t ) , ∗ i n p u t _ s h a p e ) , d t y p e = np . u i n t 8 )

16 17 18 19

def

load_image ( filename ) : img = cv2 . i m r e a d ( f i l e n a m e ) r e t u r n c v 2 . c v t C o l o r ( img , c v 2 . COLOR_BGR2RGB )

20 21 22 23

f o r i , image in enumerate ( images_path + d a t a s e t [ ’ image_id ’ ] + " . jpg " ) : img = cv2 . r e s i z e ( l o a d _ i m a g e ( i m a g e ) , ( 2 0 0 , 1 5 0 ) , i n t e r p o l a t i o n = cv2 . INTER_AREA ) _ i m a g e s [ i ] = img

Listing 4.2 Data loading and preprocessing for the Inception network

60

4 Computer Vision for Medical Data Analysis

The data preprocessing is shown in Listing 4.2. For simplification, we load the images and resize each to the size of 150 by 300 pixels. The data is saved as an numpy matrix. The labels are loaded from the CSV metadata file.

Model In Listing 4.3 the model is built using an Keras implementation of the Inception network architecture. We apply the transfer learning technique, we use the imagenet weights before we train the model using the lesions. 1 2

i n c e p t i o n = Inception ( include_top =False , =7) model_inc = k eras . models . S e q u e n t i a l ( )

input_shape =input_shape ,

classes

3 4 5 6 7 8 9 10

model_inc model_inc model_inc model_inc model_inc model_inc model_inc

. . . . . . .

add ( i n c e p t i o n ) add ( k e r a s . l a y e r s add ( k e r a s . l a y e r s add ( k e r a s . l a y e r s add ( k e r a s . l a y e r s add ( k e r a s . l a y e r s summary ( )

. . . . .

GlobalAveragePooling2D () ) Dropout ( 0 . 1 ) ) Flatten () ) Dense ( 1 5 , a c t i v a t i o n = ’ r e l u ’ ) ) Dense ( 7 , a c t i v a t i o n = ’ s o f t m a x ’ ) )

11 12 13

# Early stopping to monitor e a r l y _ s t o p = k e r a s . c a l l b a c k s . E a r l y S t o p p i n g ( m o n i t o r = ’ v a l _ l o s s ’ , mode = ’ m i n ’ , v e r b o s e =1 , p a t i e n c e =50 , r e s t o r e _ b e s t _ w e i g h t s = True )

14 15

for

i

in

range (10) :

16 17 18

X_train , X_test , y _ t r a i n , y _ t e s t = t r a i n _ t e s t _ s p l i t ( _images , l a b e l s , t r a i n _ s i z e = 0 . 5 , r a n d o m _ s t a t e =6 , s t r a t i f y = l a b e l s ) X _ t e s t , X_val , y _ t e s t , y _ v a l = t r a i n _ t e s t _ s p l i t ( X _ t e s t , y _ t e s t , t r a i n _ s i z e = 0 . 5 , r a n d o m _ s t a t e =6 , s t r a t i f y = y _ t e s t )

19 20 21

y_train = np_utils . to_categorical ( y_train ) y_val = np_utils . t o _ c a t e g o r i c a l ( y_val )

22 23

24

m o d e l _ i n c . c o m p i l e ( o p t i m i z e r = ’ adam ’ , l o s s = ’ c a t e g o r i c a l _ c r o s s e n t r o p y ’ , m e t r i c s =[ ’ a c c u r a c y ’ , t f a . m e t r i c s . F1Score ( n u m _ c l a s s e s =7 , a v e r a g e =" micro " , name = " i n c _ f 1 " ) , t f . k e r a s . m e t r i c s . AUC ( m u l t i _ l a b e l = T r u e , name = " i n c _ a u c " ) , t f . k e r a s . m e t r i c s . P r e c i s i o n ( name = " i n c _ p r e c i s i o n " ) , t f . k e r a s . m e t r i c s . R e c a l l ( name = " i n c _ r e c a l l " ) ] )

25 26

callbacks = [ early_stop ]

27 28 29 30

h i s t o r y = m o d e l _ i n c . f i t ( X _ t r a i n , y _ t r a i n , v a l i d a t i o n _ d a t a =( X_val , y_val ) , epochs = epochs , b a t c h _ s i z e = b a t c h _ s i z e , v e r b o s e =2 , c a l l b a c k s = c a l l b a c k s , s h u f f l e = True ) # class_weight ={0: 1 , 1: 8})

31 32 33 34

p r i n t ( " Saving model . . . " ) m o d e l _ i n c . s a v e ( f " { cwd } / i n c _ f i n a l _ m o d e l { i } . h 5 " )

Listing 4.3 Keras Inception network training

The data is split into twice into training, testing and next also into validation sets with a ratio of 90% of testing and the rest is split gain in same ratio into testing and validation sets, but this time the root set is the previously obtained testing set. There are one callback added.

4.3

Skin Moles Classification Explained

61

It can be turned off, but it is recommended to keep them, especially when the number of epoches is greater (see Listing 4.3).

4.3.1

LIME Explanation

The LIME explainer needs to be first initialized as shown in Listing 4.4. The function make_prediction can be used by the explainer to get the prediction. We take one randomly chosen image from the validation set. In the last step we create a mask to get the pixels that have the greatest impact on the prediction. 1 2

def make_prediction(image) : return model_inc . predict (image)

3 4

explainer = lime_image . LimeImageExplainer(random_state=123)

5 6

image_id = 17

7 8 9

explanation = explainer . explain_instance (X_val[image_id ] . astype ( ’double ’ ) , make_prediction , random_seed=123, top_labels=3)

10 11

img, mask = explanation .get_image_and_mask( y_test [0] , positive_only=False , negative_only=True , hide_rest=True)

Listing 4.4 Lime explainer

Fig. 4.1 LIME explanation results. First image show the original image, the second the mask generated by LIME method, and the last image that is a combined image of the mask the original one

The final image is a combination of the original image and generated mask. It is generated with the function plot_comparison shown in Listing 4.5. For better understanding, all three images are displayed side by side. The borders of the mask are marked in green. In Fig. 4.1 shown image we can see three regions that are marked as important.

62

1 2

4 Computer Vision for Medical Data Analysis def plot_comparison(main_image, img, mask) : fig = plt . figure ( figsize =(15,5) )

3

ax = fig . add_subplot(141) ax .imshow(main_image, cmap="gray") ; ax . s e t _ t i t l e ("Original Image") ax = fig . add_subplot(142) ax .imshow(img) ; ax . s e t _ t i t l e ("Image") ax = fig . add_subplot(143) ax .imshow(mask) ; ax . s e t _ t i t l e ("Mask") ax = fig . add_subplot(144) ax .imshow(mark_boundaries(img, mask, color=(0,1,0) ) ) ; ax . s e t _ t i t l e ("Image+Mask Combined") ;

4 5 6 7 8 9 10 11 12 13 14 15 16 17

plot_comparison(X_test [0] , img, mask)

Listing 4.5 Combining the mask with the image showing the parts of the image with highest influence on the prediction

As it is a benign lesion, the most important parts are the smooth skin regions and the flat, one color lesion region in the center.

4.3.2

SHAP Explanation

The SHAP method is implemented in the shap Python library. Five simple steps are needed to get the SHAP values (see Listing 4.6): • • • • •

set the masker for the Image, classes for which we want to calculate the SHAP values, set the SHAP explainer, calculate the SHAP values, plot the images with the SHAP value masks.

The masker mask out partitions of the image using a bluring method (Telea 2004). We use the full size of the input shape to get the highest possible granularity. In this example we use five diseases called in the list by an abbreviation. The explainer is next calculating the SHAP values for the image_id image. 1

masker = shap . maskers .Image("inpaint_telea " , input_shape)

2 3

class_labels = ( ’bcc’ , ’bkl ’ , ’mel’ , ’nv’ , ’vasc ’ )

4 5

explainer = shap . Explainer (model_inc , masker, output_names=class_labels )

6 7

shap_values = explainer (X_val[image_id :image_id+1], outputs=shap . Explanation . argsort . flip [:5])

8 9

shap . image_plot(shap_values)

Listing 4.6 SHAP

4.3

Skin Moles Classification Explained

63

The plot of the skin mole image together with the SHAP value masks images are shown in Fig. 4.2. The blue squares indicate negative SHAP values and the red positive. We can see the most red squares are shown in the nevus image. First conclusion that can be taken is that the mole should be classified as nevus. The second conclusion that the two most probable diagnosis are nevus and vascular moles. The third one is that the most important parts of the moles are the surrounding of the bottom and the top part of the mole.

Fig. 4.2 SHAP values for a random HAM10k skin mole image

4.3.3

GradCAM Explanation

The GradCAM can be implemented directly on the Keras layers. The following implementation is only a modification of the Keras GradCam implementation available at https://keras. io/examples/vision/grad_cam/. It is divided into two methods: make_gradcam_heatmap and save_and_display_gradcam. In Listing 4.7 we get the image, the model, and the last convolution layer to generate the heatmap. The last convolution is used to generate the gradients on the it for the image. It returns the heatmap as a numpy matrix.

64

1 2 3 4

4 Computer Vision for Medical Data Analysis def make_gradcam_heatmap(img_array , model, last_conv_layer_name , pred_index=None) : grad_model = t f . keras .models .Model( [model. inputs ] , [model. get_layer (last_conv_layer_name) . output , model. output ] )

5 6 7 8 9 10

with t f . GradientTape () as tape : last_conv_layer_output , preds = grad_model(img_array) i f pred_index is None: pred_index = t f .argmax(preds [0]) class_channel = preds [ : , pred_index]

11 12

grads = tape . gradient ( class_channel , last_conv_layer_output )

13 14

pooled_grads = t f .reduce_mean(grads , axis=(0, 1, 2) )

15 16 17 18

last_conv_layer_output = last_conv_layer_output [0] heatmap = last_conv_layer_output @ pooled_grads [ . . . , t f . newaxis] heatmap = t f . squeeze(heatmap)

19 20 21

heatmap = t f .maximum(heatmap, 0) / t f .math.reduce_max(heatmap) return heatmap.numpy()

Listing 4.7 GradCAM heatmap generation method

The second method (Listing 4.8) combines the heatmap with the original image. It saves the image as a JPG image file. The typical heatmap colormap is jet, but it can be replaced here to get a different color palette. 1 2

def save_and_display_gradcam(img, heatmap, cam_path="cam. jpg" , alpha=0.4) : # Load the original image

3 4 5

# Rescale heatmap to a range 0−255 heatmap = np. uint8(255 ∗ heatmap)

6 7 8

# Use j e t colormap to colorize heatmap j e t = cm.get_cmap(" j e t ")

9 10 11 12

# Use RGB values of the colormap jet_colors = j e t (np. arange(256)) [ : , :3] jet_heatmap = jet_colors [heatmap]

13 14 15 16 17

# Create an jet_heatmap jet_heatmap jet_heatmap

image with RGB colorized heatmap = keras . u t i l s . array_to_img(jet_heatmap) = jet_heatmap . resize ((img. shape[1] , img. shape[0]) ) = keras . u t i l s . img_to_array(jet_heatmap)

18 19 20 21

# Superimpose the heatmap on original image superimposed_img = jet_heatmap ∗ alpha + img superimposed_img = keras . u t i l s . array_to_img(superimposed_img)

22 23 24

# Save the superimposed image superimposed_img . save(cam_path)

25 26 27

# Display Grad CAM display (Image(cam_path) )

28 29 30

save_and_display_gradcam(X_val[0] , heatmap)

Listing 4.8 GradCAM

4.3

Skin Moles Classification Explained

65

In the Listing 4.9 the model is loaded, next the heatmap is generated for the model, and the combined image is displayed. The model used here is a MobileNet model. 1

model = keras .models . load_model( ’mn2_final_model0 .h5’ )

2 3

heatmap = make_gradcam_heatmap(X_val[0:1] , model, "conv2d_178")

4 5

save_and_display_gradcam(X_val[0] , heatmap)

Listing 4.9 GradCAM

The heatmap shows that the most important regions of the image for the classification are the ones in the top part of the mole (marked in red) and the lowest impact has the region on the left and bottom of the mole (Fig. 4.3).

Fig. 4.3 Heatmap merged with the original skin mole image

References Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Advances in neural information processing systems, p 30 Ribeiro MT, Singh S, Guestrin C (2016) “Why should i trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1135–1144 Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626 Shapley LS, et al. (1953) A value for n-person games

66

4 Computer Vision for Medical Data Analysis

Shapley LS (1997) A value for n-person games. In: Classics in game theory, p 69 Telea A (2004) An image inpainting technique based on the fast marching method. J Graph Tools 9. https://doi.org/10.1080/10867651.2004.10487596

5

Time Series Data Used for Diseases Recognition and Anomaly Detection

For now, the data we used were static. There is a group of sets where the data change over time. One of such examples is the sound. When we take a CT scan, we receive a set of images, but the set is a capture of a small piece (slice) of our body. It is even possible to combine the MRI slices into a 3D model as the distance between the slices is fixed. A different approach is proposed in videos, such as ultrasound videos, in which we see changes in an organ, usually in a short period of time. The time series usually does not rely on images only but, in most cases, on sets of tabular sets. We can observe how the observed organ changes its behavior over time, making it possible to recognize different types of anomalies or diseases compared to data captured in one time, such as typical MRI scans. In this chapter, we analyze three applications of heart, brain, and human movements. The time-series data that are typical in medicine are described in the following section. It is filled with examples of commercial applications that use such data together with machine learning methods. Next, three main explainable methods used for time series data are explained. This includes the Symbolix Aggregate Approximation (SAX), Shapelets, and LAXCAT methods. In the next section, we follow up with an analysis of movement monitoring data sets where we use the SAX method. It is next followed by the heart disease recognition case, where the shapelets method is used to explain the results. The last example is an anomaly detection model for brain activity. Here, we use the LAXCAT method to explain the anomalies.

5.1

Data Types and Applications

Time-series data in medicine can be in the form of different types of sounds captured, for example, using ultrasound, or from different sensors like EEG. BrainQ (https://brainqtech. com/our-technology) uses the electrophysiology measurements and analyzes them. Other sensor-based data is related to the electrocardiogram records (ECG), wearables, or dedicated sensors. One of the dedicated sensors where the data is next analyzed using machine learn© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Przystalski and R. M. Thanki, Explainable Machine Learning in Medicine, Synthesis Lectures on Engineering, Science, and Technology, https://doi.org/10.1007/978-3-031-44877-5_5

67

68

5 Time Series Data Used for Diseases Recognition and Anomaly Detection

ing methods is the Oli PPH by Baymatob (https://www.baymatob.com/index.php/clinicalproduct-information). The data where the source is the heart can be obtained using special hardware such as that done by Eko Devices (https://ekohealth.com/) where they provide stethoscopes to get the data, which are then analyzed by a separate machine learning-based solution. Similar product, but for breath diseases recognition is developed by StethoMe (https://www.stethome.com/pl/). Recently, more companies have been relying on the cloud when it comes to data storage and analysis. One of such example is Cardiomatics (https:// cardiomatics.com/) where they focus on ECG data storage in the cloud. They also provide several methods to analyze those data. XOresearch provides an ECG monitoring and annotation (https://xoresearch.com/#/products). Models that analyze the ECG are also used on mobile devices, such as the Magicardio application (http://www.magicardio.com/features/). In addition to ECG, ultrasound is one of the technologies that can provide timeseries data. Novasignal (https://www.novasignal.com/product/novaguide-2) provides realtime blood flow data that is then analyzed using some machine learning models. A bigger devices for pulmonology was developed by Artiq (https://www.artiq.eu/whitepapers/ artificial-intelligence-in-pulmonology-threat-or-opportunity/). A bigger group of applications of time-series data are related to sound. We can divide it into two groups, one that is about translating text into speech, and the second that is dedicated to sound analysis for mental health. Suki (https://www.suki.ai/) has a voice assistant for patients and physicians. A similar solution, but dedicated to ortopedists, is provided by 2022 Robin Healthcare (https://www.robinhealthcare.com/). DeepScribe (https://www. deepscribe.ai/) records the doctor’s and patients’ speech and keeps the study details in the EHR. Menthal health is about measuring different metrics based on what is recorded from each study and analyzing each. There are several examples like Winterlight Labs (https:// winterlightlabs.com/), Ellipsis Health (https://www.ellipsishealth.com/), and Aural Analytics (https://auralanalytics.com/).

5.2

Explainable Methods for Time Series

Time series medical data is a type of medical data that is collected over time, such as heart rate, blood pressure, and EEG signals. Explainable methods for time series medical data aim to provide transparency and interpretability of the models used to analyze this data, allowing clinicians and researchers to understand how the models make predictions and identify potential issues. One popular explainable method for time series medical data is the use of interpretable machine learning models, such as decision trees, that can be used to analyze and visualize the relationships between different variables in the data. Decision trees can be used to identify important features in the data, such as changes in heart rate or blood pressure, and to understand how these features affect the out-come of interest, such as the likelihood of a heart attack or stroke.

5.2

Explainable Methods for Time Series

69

Another approach is the use of explainable deep learning models, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), that have been modified to include attention mechanisms or interpretability techniques. Attention mechanisms can be used to identify important regions of the time series data that are relevant for making a prediction, while interpretability techniques such as layer-wise relevance propagation (LRP) can be used to understand how the model makes decisions. Finally, rule-based models such as fuzzy logic and decision rules can be used to analyze time series medical data. Fuzzy logic models can be used to capture uncertainty in the data and can be used to make decisions based on the degree of membership of the data in different classes or categories. Decision rules can be used to identify patterns and relationships between different variables in the data, and to generate rules that can be used to make predictions based on the presence or absence of specific features in the data. Overall, explainable methods for time series medical data are important for ensuring that machine learning models are transparent, interpretable, and can be trusted by clinicians and researchers. These methods can help to identify potential issues and biases in the models, and to improve the accuracy and reliability of predictions made using time series medical data.

5.2.1

Symbolix Aggregate Approximation (SAX)

Symbolix Aggregate approximation (SAX) (Lundberg and Lee 2017) is a method for summarizing and analyzing medical time series data that was originally developed for speech recognition applications. SAX is based on the idea of transforming a continuous time series signal into a sequence of discrete symbols, which can then be analyzed using standard pattern recognition techniques. In medical applications, SAX has been used to analyze electrocardiogram (ECG) and electroencephalogram (EEG) signals. For example, SAX has been used to identify abnormal patterns in ECG signals that are indicative of cardiac arrhythmia or other cardiovascular diseases. SAX has also been used to analyze EEG signals to identify abnormal patterns that are indicative of epilepsy or other neurological disorders. Overall, SAX is a simple and effective method for summarizing and analyzing medical time series data and can be used to identify patterns and trends that are indicative of specific medical conditions or diseases. However, SAX does not capture the full complexity of the original time series data and may miss important details that are relevant for accurate diagnosis and treatment. Therefore, SAX should be used in conjunction with other more complex and accurate methods for analyzing medical time series data. The SAX algorithm works as follows: • Segment the time series data into a fixed number of equal-length subsequences. • Compute the mean and standard deviation of each subsequence.

70

5 Time Series Data Used for Diseases Recognition and Anomaly Detection

• Map each subsequence to a symbol using a pre-defined alphabet based on the standard deviation of the subsequence. • Concatenate the symbols to form a string representation of the time series data. Segment the time series data into a fixed number of equal-length subsequences. Let X = [x1 , x2 , . . . , xn ] be the time series data of length .n, and let .w be the length of the subsequences. Then we can define .k = wn , the number of subsequences. Compute the mean and standard deviation of each subsequence. Let .m(i) and .s(i) be the mean and standard deviation of the .ith subsequence, where .i = 1, 2, . . . , k. Then we can compute these values as follows: ∑ ˙ − 1) + 1 : wi) ˙ X (w(i (5.1) .m(i) = w √ ∑ ˙ − 1) + 1 : wi) ˙ − m(i)2 ) (X (w(i (5.2) .s(i) = w

.

Map each subsequence to a symbol using a pre-defined alphabet based on the standard deviation of the subsequence. Let A be the pre-defined alphabet of size a, and let . B = [b1 , b2 , . . . , ba−1 ] be the breakpoints that define the intervals for each symbol. Then we can define a mapping function . f that maps the standard deviation .s(i) of each subsequence to a symbol in . A as follows: f (s(i)) = a − 1 if s(i) ≥ B(a − 1),

(5.3)

f (s(i)) = j if B( j − 1) ≤ s(i) ≤ B( j), j = 2, 3, . . . , a − 1,

(5.4)

f (s(i)) = i if s(i) ≥ B(1).

(5.5)

.

.

.

Concatenate the symbols to form a string representation of the time series data. Let . S = [s1 , s2 , . . . , sk ] be the sequence of symbols corresponding to the .k subsequences. Then the SAX representation of the time series data . X is given by: .

S AX (X ) =)[ f (s(1)), f (s(2)), . . . , f (s(k))].

(5.6)

The SAX algorithm produces a compact and computationally efficient representation of the original time series data, which can be analyzed using standard pattern recognition techniques, such as clustering or classification. The size of the alpha-bet a and the number of breakpoints b can be tuned to balance the trade-off be-tween the complexity of the symbolic representation and the accuracy of the analysis. By using a fixed-length segmentation and a pre-defined alphabet, SAX produces a compact and computationally efficient representation of the original time series data. The resulting symbolic representation can be analyzed using standard pattern recognition techniques, such as clustering or classification.

5.2

1 2

Explainable Methods for Time Series

71

from pyts . approximation import SymbolicAggregateApproximation import pandas as pd

3 4 5

# Load time series data data = pd. read_csv( ’time_series . csv ’ , header=None)

6 7 8 9

# I n i t i a l i z e the SAX transformer n_bins = 8 sax = SymbolicAggregateApproximation(n_bins=n_bins , strategy=’uniform’ )

10 11 12

# Transform the time series data into SAX representa−tion X_sax = sax . fit_transform ( data )

13 14 15

# Print the SAX representation of the f i r s t time series print (X_sax[0])

Listing 5.1 SAX simplified example

In this code, we first load the time series data from a CSV file. We then initialize the SAX transformer with the desired number of bins and strategy. The fit_transform method is used to transform the time series data into its SAX representation. Finally, we print the SAX representation of the first time series in the data.

5.2.2

Shapelets

Shapelets (Ye and Keogh 2011) are a type of feature extraction method for time series data that has been widely used in the analysis of medical time series data. The basic idea behind Shapelets is to identify small, discriminative sub-sequences (called shapelets) that capture the distinctive features of a particular class of time series data and use these shapelets as features for classification or clustering tasks. The Shapelet algorithm works as follows: • Define a set of candidate shapelets of varying lengths. These candidate shapelets can be generated by randomly selecting sub-sequences from the time series data, or by using a more sophisticated approach such as a genetic algorithm or a heuristic search. • Compute the distance between each candidate shapelet and every sub-sequence in the training data. The distance between a candidate shapelet and a sub-sequence is defined as the minimum distance between the shapelet and any sub-sequence of the same length in the original time series data. This can be computed using a distance measure such as the Euclidean distance or the Dynamic Time Warping distance. • Rank the candidate shapelets according to their discriminative power. The dis-criminative power of a shapelet is defined as the difference in mean distance between the shapelet and the sub-sequences of different classes. The top k shapelets with the highest discriminative power are selected as the final set of shapelets. • Use the selected shapelets as features for classification or clustering tasks. Each time series is represented as a vector of distances to each selected shapelet, and these vectors are used as input to a machine learning algorithm such as a sup-port vector machine or a k-nearest neighbors’ classifier.

72

5 Time Series Data Used for Diseases Recognition and Anomaly Detection

In other words, shapelets divides the time series into sets of each classes. A few methods that are also used in decision trees can be used to make such division. One it is the entropy defined as: . H (D) = − p(A) log( p(A)) − p(B) log( p(B)), (5.7) where . D is the time series, . p(A) and . p(B) are the probabilities of two classes. Other method is the information gain defined as: Gain(sp) = H (D) − Hˆ (D),

(5.8)

Gain(sp) = H (D) − ( f (D1 )H (D1 ) + f (D2 )H (D2 )),

(5.9)

.

.

where.sp is the split strategy. Using the two methods can generate a large number of potential candidates for the two classes. To find the right one, we limit the number of possible generated candidates. We can get the number of candidates as following: MAXLEN ∑



.

(m i − l + 1),

(5.10)

l=MINLEN Ti ∈D

where.Ti is the time series,.l is the fixed length,.m i is the length of the time series,. M AX L E N and . M I N L E N are the maximum and minimum lengths of the candidates. The advantage of Shapelets is that it can capture the most important features of the time series data in a concise and interpretable way, which can help to identify the key factors that differentiate one class of time series data from another. In medical applications, Shapelets have been used to identify early warning signs of disease progression, to predict patient outcomes, and to classify different types of medical images and signals. However, Shapelets have some limitations, such as the sensitivity to the choice of candidate shapelets and the difficulty in handling time series data with variable lengths or missing values. Therefore, Shapelets should be used in conjunction with other more robust and flexible methods for analyzing medical time series data. 1 2 3 4 5 6 7

import numpy as np import pandas as pd import matplotlib . pyplot as plt from sklearn . model_selection import train_test_split from sklearn . neighbors import KNeighborsClassifier from tslearn . piecewise import SymbolicAggregateApproximation from tslearn . shapelets import ShapeletModel , grabocka_params_to_shapelet_size_dict

8 9 10

# Load data data = pd. read_csv("data . csv" , header=None)

11 12 13

# Split data into train and test sets X_train , X_test , y_train , y_test = train_test_split ( data . iloc [ : , : −1], data . iloc [ : , −1], test_size =0.2)

14 15 16 17 18

# Convert time series to symbolic representation using SAX n_paa_segments = 8 n_sax_symbols = 8 sax = SymbolicAggregateApproximation(n_segments=n_paa_segments, alphabet_size_avg=n_sax_symbols)

19 20 21

X_train_sax = sax . fit_transform (X_train) X_test_sax = sax . transform (X_test)

5.2

Explainable Methods for Time Series

73

22 23 24 25

# Set shapelet parameters n_shapelets = 10 shapelet_sizes = grabocka_params_to_shapelet_size_dict ( n_ts=X_train_sax . shape[0] , ts_sz=X_train_sax . shap[1] , n_classes=len (np. unique( y_train ) ) , l =0.1, r=2)

26 27 28 29

# Train shapelet model shapelet_model = ShapeletModel( n_shapelets=n_shapelets , batch_size=1, verbose_level=0) shapelet_model . f i t (X_train_sax , y_train , X_test_sax)

30 31 32 33 34 35 36 37 38

# Visualize top shapelets shapelets = shapelet_model . shapelets_as_time_series n_shapelets_to_plot = 3 fig , axs = plt . subplots ( n_shapelets_to_plot , figsize =(10,10)) for i in range( n_shapelets_to_plot ) : axs[ i ] . plot ( shapelets [ i ] . ravel () ) plt .show()

39 40 41

# Evaluate model y_pred = shapelet_model . predict (X_test_sax)

42 43 44

accuracy = np.mean(y_pred == y_test ) print ("Accuracy: " , accuracy)

Listing 5.2 Shapletes simplified example

In this example, we first load the time series dataset using Pandas. We then split the data into training and test sets using train_test_split from Scikit-learn. Next, we use the SymbolicAggregateApproximation function from the tslearn library to convert the time series data to a symbolic representation using SAX. We then set the parameters for the shapelet model. We choose to use 10 shapelets and set the shapelet sizes based on the grabocka_params_to_shapelet_size_dict function, which is a heuristic approach that depends on the number of time series in the training set, the length of the time series, and the number of classes. We train the shapelet model using the ShapeletModel class from the tslearn library. We pass in the SAX-transformed training data, the corresponding labels, and the SAX-transformed test data. After training the model, we visualize the top shapelets using Matplotlib. Finally, we evaluate the model on the test set by predicting the labels for the SAX-transformed test data and computing the accuracy.

5.2.3

Learning from Aggregate Xtreme Categories with Automated Transformation (LAXCAT)

LAXCAT (Learning from Aggregate Xtreme Categories with Automated Transformation) (Hsieh et al. 2021) is a feature extraction method for time series data that has been used in medical applications. It is designed to identify extreme cases or outliers in the data and use them to create features that can improve classification or clustering performance. The LAXCAT algorithm works as follows:

74

5 Time Series Data Used for Diseases Recognition and Anomaly Detection

• Identify the extreme cases or outliers in the time series data using a statistical measure such as the interquartile range or the median absolute deviation. • Divide the time series data into segments of fixed length or using a sliding window approach. • For each segment, compute a set of aggregate features such as the mean, variance, skewness, and kurtosis. • Transform the aggregate features using a non-linear function such as a sigmoid or a tanh function to create a new set of features that capture the extreme values in the data. • Use the transformed features as input to a machine learning algorithm such as a support vector machine or a random forest for classification or clustering tasks. The advantage of LAXCAT is that it can capture the extreme values in the time series data, which can provide important information about the underlying physiology or pathology of the medical condition being studied. In medical applications, LAXCAT has been used to identify patients at high risk for adverse events, to predict the response to treatment, and to classify different types of medical images and signals. However, LAXCAT has some limitations, such as the difficulty in handling time series data with variable lengths or missing values, and the sensitivity to the choice of the statistical measure used to identify the extreme cases. Therefore, LAXCAT should be used in conjunction with other more robust and flexible methods for analyzing medical time series data. LAXCAT (Learning from Aggregate Xtreme Categories with Automated Transformation) is a feature extraction method for time series data that uses extreme values to create new features. The algorithm can be mathematically described as follows: Let . X = x 1 , x 2 , . . . , x n be a time series dataset with n data points. • Step 1: Identify the extreme values in the dataset using a statistical measure such as the median absolute deviation (MAD) or the interquartile range (IQR). The extreme values are defined as those points that are more than .k times the MAD or the IQR from the median or the interquartile range. • Step 2: Divide the time series data into segments of length w or using a sliding window approach. Let . X i = xi , xi+1 , . . . , xi+w−1 be a segment of the dataset. • Step 3: Compute a set of aggregate features for each segment . X i . Let . A(Xi) = a1 , a2 , . . . , am be the set of aggregate features for. X i , where m is the number of aggregate features. The aggregate features can include the mean, variance, skewness, kurtosis, and other statistical measures. • Step 4: Transform the aggregate features using a non-linear function such as a sigmoid or a tanh function. Let .T (A(X i )) = t1 , t2 , . . . , tm be the set of trans-formed features for .Xi . • Step 5: Combine the transformed features for all segments to create a new feature matrix . F. Let . F = [T (A(X 1 )), T (A(X 2 )), . . . , T (A(X n−w+1 ))] be the feature matrix, where each row corresponds to a segment of the time series data.

5.2

Explainable Methods for Time Series

75

• Step 6: Use the feature matrix F as input to a machine learning algorithm such as a support vector machine or a random forest for classification or clustering tasks. 1 2

from scipy . spatial . distance import euclidean from saxpy import SAX

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

def laxcat ( ts , w, a , d) : """ Computes LAXCAT representation of time series Parameters : ts (numpy array ) : Time series data w ( int ) : Window size for computing mean and standard deviation a ( int ) : Alphabet size d ( int ) : Dimensionality of LAXCAT representation Returns : numpy array : LAXCAT representation of time series """ # Compute mean and standard deviation for each window means = [] stds = [] for i in range(0 , len ( ts ) − w, w) : window = ts [ i : i + w] mean = np.mean(window) std = np. std (window) means.append(mean) stds .append( std ) # Apply SAX transformation sax = SAX(wordSize=d, alphabetSize=a) means = sax . to_letter_rep (means) stds = sax . to_letter_rep ( stds ) # Compute LAXCAT representation laxcat_rep = [] for i in range(d) : for j in range(a) : laxcat_rep .append(means[ i ] + stds [ i ] + chr(ord( ’a ’ ) + j ) ) return np. array ( laxcat_rep )

29 30

# Example usage

31 32 33 34 35 36 37

ts = np.random. rand(500) w = 50 a = 8 d = 5 laxcat_rep = laxcat ( ts , w, a , d) print ( laxcat_rep )

Listing 5.3 Simplified LAXCAT example

In the Listing 5.3, we first import the necessary libraries such as numpy, scipy, and SAX. Then we define the LAXCAT function which takes in a time series, window size, alphabet size, and LAXCAT representation dimensionality as input parameters. Within the function, we first compute the mean and standard deviation of each window of the time series using the provided window size. Then we apply the SAX transformation to the means and stds using the provided alphabet size and dimensionality. Finally, we compute the LAXCAT representation by concatenating each possible combination of mean, std, and letter from the alphabet. We then return the LAXCAT representation as a numpy array. In the example usage section, we generate a random time series of length 500 and call the LAXCAT function with window size of 50, alphabet size of 8, and dimensionality of 5. We print the resulting LAXCAT representation.

76

5 Time Series Data Used for Diseases Recognition and Anomaly Detection

5.3

Heart Diseases Recognition

Heart disease recognition using explainable AI involves using machine learning models to analyze medical data related to heart health and providing explanations for the model’s predictions or recommendations. This type of analysis has several applications in healthcare, including early detection of heart disease and personalized treatment recommendations.

5.3.1

Database

One example of heart disease recognition using explainable AI is using electrocardiogram (ECG) data to predict the presence of heart disease. The dataset (https://physionet.org/ content/mvtdb/1.0/) is set of 135 recordings of abnormal and normal ECG. Machine learning models can be trained on ECG data to classify abnormal heart rhythms, such as atrial fibrillation or ventricular tachycardia, and provide explanations for their predictions. This can help healthcare providers to diagnose heart disease early and provide appropriate treatment. Another example is using medical imaging data, such as computed tomography (CT) or magnetic resonance imaging (MRI), to predict the presence of heart disease. Machine learning models can be trained on these images to detect structural abnormalities or classifications in the heart and provide explanations for their predictions. This can help healthcare providers to identify patients at high risk for heart disease and develop personalized treatment plans. The data of the spontaneous ventricular tachyarrhythmia is loaded using the loadtxt numpy method. The same way is used for the healthly and abnormal sets. The Listing 5.4 shows the healthly set loading implementation. 1

import os

2 3

heathly_directory = os .getcwd()+" / spontaneous−ventricular−tachyarrhythmia−database −1.0/mr/ "

4 5

heathly = []

6 7 8 9 10 11

for i in sorted (os . l i s t d i r ( heathly_directory ) ) : try : heathly .append(np. loadtxt ( heathly_directory + i ) [0:800]) except : print ( i )

12 13 14

X = np. concatenate ((np. array ( sick ) , np. array ( heathly ) ) , axis=0)

15 16 17

y = np. ones( len ( heathly )+len ( sick ) ) y[ l i s t (range( len ( sick )+1,len (y) ) ) ] = 0

Listing 5.4 ECG time series loading implementation

The Data has up to 1024 intervals, but as it is not always so long, in the example it is limited to 800 intervals. The results for both are shown in Fig. 5.1. All recordings are plot in the images.

5.3

Heart Diseases Recognition

77

Fig. 5.1 ECG recordings of the healthly and sick patients

5.3.2

Model

The model training is shown in Listing 5.5. In this example we are looking for 2 shapelets of a fixed length (line 12). The data is normalized before it is used for training. 1 2 3 4

from tslearn . shapelets import LearningShapelets from tslearn . preprocessing import TimeSeriesScalerMinMax from sklearn . model_selection import train_test_split import tensorflow as t f

5 6

model = LearningShapelets ( n_shapelets_per_size={3: 2})

7 8

X = TimeSeriesScalerMinMax() . fit_transform (X)

9 10

X_train , X_test , y_train , y_test = train_test_split (X, y, train_size =0.6, random_state=6)

11 12

shapelet_sizes = {80:2}

13 14 15 16 17

model = LearningShapelets ( n_shapelets_per_size=shapelet_sizes , optimizer=t f . optimizers .Adam(.01) , batch_size=16, weight_regularizer=.01, max_iter=200, random_state=42, verbose=0)

18 19 20

model. f i t (X_train , y_train ) train_distances = model. transform (X_train)

Listing 5.5 Shapletes model learning

78

5 Time Series Data Used for Diseases Recognition and Anomaly Detection

The shapletes distances can be next used in clustering to identify the differences between classes. 1

from tslearn . shapelets import grabocka_params_to_shapelet_size_dict

2 3 4

n_ts , ts_sz = X. shape[:2] n_classes = len ( set (y) )

5 6 7

shapelet_sizes = grabocka_params_to_shapelet_size_dict ( n_ts=n_ts , ts_sz=ts_sz , n_classes=n_classes , l =0.1, r=1)

Listing 5.6 GradCAM

The fixed number of shapelets and the length is not the best approach. There are different ways how to find these values. One of it is proposed in (Grabocka et al. 2014). In our example, we get 5 shapelets of 80 intervals length (Listing 5.6).

5.3.3

Explanation with Shapelets

The explanation of a time series using the shapeletes can be drawn as shown in Listing 5.7. It uses the shapelet_sizes dictionary with the number of shapelets and the length. 1

from tslearn . u t i l s import ts_size

2 3 4 5 6 7 8 9 10

plt . figure () for i , sz in enumerate( shapelet_sizes . keys () ) : plt . subplot ( len ( shapelet_sizes ) , 1, i + 1) plt . t i t l e ("%d shapelets of size %d" % ( shapelet_sizes [ sz ] , sz ) ) for shp in model. shapelets_ : i f ts_size (shp) == sz : plt . plot (shp . ravel () ) plt . xlim([0 , max( shapelet_sizes . keys () ) − 1])

11 12 13

plt . tight_layout () plt .show()

Listing 5.7 Shapeletes plot for the ECG data set

The two shapelets example results in a plot as shown in Fig. 5.2. There a visible difference in the part between the intervals 0 and 50.

5.3.4

SAX Explanation

The SAX method, apart from many other explainability methods, are used directly on the time series. In the spontaneous ventricular tachyarrhythmia set, SAX finds anomalys in the series for both healthly and sick patients (Listing 5.8). 1

from saxpy . hotsax import find_discords_hotsax

2 3 4

discords_heathly = find_discords_hotsax ( heathly [1]) discords_sick = find_discords_hotsax ( sick [1])

Listing 5.8 SAX anomaly detection

References

79

Fig. 5.2 Shapeletes of the heart disease data set plot

The find_discords_hotsax finds by default two anomalies. For the first set the anomaly intervals that are found are: 553 and 127. For the abnormal series it the anomalies are: 189 and 14. For a better understanding a greater number of series might be required.

References Grabocka J, Schilling N, Wistuba M, Schmidt-Thieme L (2014) Learning time-series shapelets. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 392–401 Hsieh TY, Wang S, Sun Y, Honavar V (2021) Explainable multivariate time series classification: a deep neural network which learns to attend to important variables as well as time intervals. In: Proceedings of the 14th ACM international conference on web search and data mining, pp 607–615 Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Advances in neural information processing systems, p 30 Ye L, Keogh E (2011) Time series shapelets: a novel technique that allows accurate, interpretable and fast classification. Data Min Knowl Disc 22(1):149–182

6

Summary

“Explainable Machine Learning in Medicine” is a comprehensive and insightful book that delves into the emerging field of explainable artificial intelligence (XAI) and its applications in the medical domain. This book provides a thorough exploration of the challenges and opportunities that arise when integrating machine-learning models into medical decisionmaking processes. This book begins by introducing the fundamental concepts of machine learning and its growing impact on medicine. This highlights the need for transparency and interpretability in medical AI systems to foster trust and acceptance among healthcare professionals and patients. The authors emphasize the critical role of explainability in ensuring the effective and ethical deployment of machine learning algorithms in clinical practice. Throughout this book, various techniques and methods for explainable machine learning are discussed in detail. The authors presented a range of interpretability approaches, including rule-based models, surrogate models, and model-agnostic methods. They also explored the concepts of feature importance, local explanations, and model visualization, which enabled healthcare practitioners to understand the inner workings of complex machine learning models. One key aspect of the book is its focus on specific challenges and considerations when applying explainable machine learning in medicine. The authors addressed issues such as data quality, bias, and privacy concerns, which are particularly relevant in healthcare settings. They offer practical guidelines and best practices for addressing these challenges, ensuring that the implementation of machine-learning models aligns with the unique requirements of the medical field. Furthermore, “Explainable Machine Learning in Medicine” showcases real-world case studies and examples, illustrating how explainable AI has been successfully applied to clinical decision support, disease diagnosis, treatment recommendation systems, and personalized medicine. These case studies provide valuable insights into the potential benefits © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Przystalski and R. M. Thanki, Explainable Machine Learning in Medicine, Synthesis Lectures on Engineering, Science, and Technology, https://doi.org/10.1007/978-3-031-44877-5_6

81

82

6 Summary

of incorporating explainability into medical AI systems, thereby enhancing their reliability, interpretability, and overall impact on patient care. In summary, “Explainable Machine Learning in Medicine” is an indispensable resource for researchers, practitioners, and policymakers involved in the intersection between machine learning and healthcare. It presents a comprehensive overview of the principles, techniques, and challenges of explainable machine learning with a specific focus on its application and implications in the medical field. By shedding light on the black box of machine learning models, this book paves the way for the responsible and transparent adoption of AI technology in healthcare, ultimately leading to improved patient outcomes.