Artificial Intelligence in Medicine: Technical Basis and Clinical Applications 9780128212592, 0128212594

Artificial Intelligence Medicine: Technical Basis and Clinical Applications presents a comprehensive overview of the fie

244 46 22MB

English Pages 570 [545] Year 2020

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Artificial Intelligence in Medicine
Copyright
Dedication
Contents
List of contributors
Foreword
References
Preface
Acknowledgments
1 Artificial intelligence in medicine: past, present, and future
1.1 Introduction
1.2 A brief history of artificial intelligence and its applications in medicine
1.3 How intelligent is artificial intelligence?
1.4 Artificial intelligence, machine learning, and precision medicine
1.5 Algorithms and models
1.6 Health data sources and types
1.7 The promise
1.8 The challenges
1.8.1 Quality and completeness of training data
1.8.2 Trust and performance: the case for model interpretability
1.8.3 Beyond performance and interpretability: causality
1.8.4 Defining the question, measuring real-world impact
1.8.5 Maximizing information gain across modalities, tasks, populations, and time
1.8.6 Quality assessment and expert supervision
1.9 Making it a reality: integrating artificial intelligence into the human workforce of a learning health system
References
2 Artificial intelligence in medicine: Technical basis and clinical applications
2.1 Introduction
2.2 Technology used in clinical artificial intelligence tools
2.2.1 Elements of artificial intelligence algorithms
2.2.1.1 Activation functions
2.2.1.2 Fully connected layer
2.2.1.3 Dropout
2.2.1.4 Residual blocks
2.2.1.5 Initialization
2.2.1.6 Convolution and transposed convolution
2.2.1.7 Inception layers
2.2.2 Popular artificial intelligence software architectures
2.2.2.1 Neural networks and fully connected networks
2.2.2.2 Convolutional neural networks
2.2.2.3 U-Nets and V-Nets
2.2.2.4 DenseNets
2.2.2.5 Generative adversarial networks
2.2.2.6 Hybrid generative adversarial network designs
2.3 Clinical applications
2.3.1 Applications of regression
2.3.1.1 Bone age
2.3.1.2 Brain age
2.3.2 Applications of segmentation
2.3.3 Applications of classification
2.3.3.1 Detection of disease
2.3.3.2 Diagnosis of disease class
2.3.3.3 Prediction of molecular markers
2.3.3.4 Prediction of outcome and survival
2.3.4 Deep learning for improved image reconstruction
2.4 Future directions
2.4.1 Understanding what artificial intelligence “sees”
2.4.2 Workflow
2.5 Conclusion
References
3 Deep learning for biomedical videos: perspective and recommendations
3.1 Introduction
3.2 Video datasets
3.3 Semantic segmentation
3.4 Object detection and tracking
3.5 Motion classification
3.6 Future directions and conclusion
References
4 Biomedical imaging and analysis through deep learning
4.1 Introduction
4.2 Tomographic image reconstruction
4.2.1 Foundation
4.2.2 Computed tomography
4.2.3 Magnetic resonance imaging
4.2.4 Other imaging modalities
4.3 Image segmentation
4.3.1 Introduction
4.3.2 Localization versus segmentation
4.3.3 Fully convolutional networks
4.3.4 Regions with convolutional neural network features
4.3.5 A priori information
4.3.6 Manual labeling
4.3.7 Semisupervised and unsupervised approaches
4.4 Image registration
4.4.1 Single-modality image registration
4.4.2 Multimodality image registration
4.5 Deep-learning-based radiomics
4.5.1 Detection
4.5.2 Characterization and diagnosis
4.5.3 Prognosis
4.5.4 Assessment and prediction of response to treatment
4.5.5 Assessment of risk of future cancer
4.6 Summary and outlook
References
5 Expert systems in medicine
5.1 Introduction
5.2 A brief history
5.3 Methods
5.3.1 Expert system architecture
5.3.2 Knowledge representation and management
5.3.3 Uncertainty, probabilistic reasoning, fuzzy logic
5.3.3.1 Uncertainty
5.3.3.2 Probabilistic reasoning
5.3.3.3 Fuzzy logic
5.4 Applications
5.4.1 Computer-assisted diagnosis
5.4.2 Computer-assisted therapy
5.4.3 Medication alert systems
5.4.4 Reminder systems
5.5 Challenges
5.5.1 Workflow integration
5.5.2 Clinician acceptance and alert fatigue
5.5.3 Knowledge maintenance
5.5.4 Standard, transferability, and interoperability
5.6 Future directions
References
6 Privacy-preserving collaborative deep learning methods for multiinstitutional training without sharing patient data
6.1 Introduction
6.2 Variants of distributed learning
6.2.1 Model ensembling
6.2.2 Cyclical weight transfer
6.2.3 Federated learning
6.2.4 Split learning
6.3 Handling data heterogeneity
6.4 Protecting patient privacy
6.5 Publicly available software
6.6 Conclusion
References
7 Analytics methods and tools for integration of biomedical data in medicine
7.1 The rise of multimodal data in biology and medicine
7.1.1 The emergence of various sequencing techniques
7.1.1.1 Bulk sequencing
7.1.1.2 Single-cell sequencing
7.1.2 The increasing need for combining images and omics in clinical applications
7.1.2.1 Various modalities of images in clinics
7.1.2.2 The rise of radiomics: combine medical images with omics
7.1.3 The availability of large-scale public health data
7.2 The challenges in multimodal data—problems with learning from multiple sources of data
7.2.1 The imperfect generation of single-cell data
7.2.1.1 The complementariness of various sources of data
7.2.2 The issues of generalizability of machine learning
7.3 Machine learning algorithms in integrating medical and biological data
7.3.1 Genome-wide data integration with machine learning
7.3.1.1 How to integrate various omics for cancer subtyping
7.3.1.2 How to integrate single-cell multiomics for precision medicine
7.3.2 Data integration beyond omics—an example with cardiovascular diseases
7.3.2.1 How to integrate various image modalities such as magnetic resonance imaging computed tomography scans
7.3.2.2 How to better the diagnosis by linking images with electrocardiograms
7.3.3 Multimodal decision-making in clinical settings
7.4 Future directions
References
8 Electronic health record data mining for artificial intelligence healthcare
8.1 Introduction
8.2 Overview of the electronic health record
8.2.1 History of the electronic health record
8.2.2 Core functions of an electronic health record
8.2.3 Electronic health record ontologies and data standards
8.3 Clinical decision support
8.3.1 Healthcare primed for clinical decision support
8.4 Areas of artificial intelligence augmentation for electronic health records
8.4.1 Artificial intelligence to improve data entry and extraction
8.4.2 Optimizing care
8.4.3 Predictions
8.4.4 Hospital outcomes
8.4.5 Sepsis and infections
8.4.6 Oncology
8.5 Limitations of artificial intelligence and next steps
References
9 Roles of artificial intelligence in wellness, healthy living, and healthy status sensing
9.1 Introduction
9.2 Diet
9.3 Fitness and physical activity
9.4 Sleep
9.5 Sexual and reproductive health
9.6 Mental health
9.7 Behavioral factors
9.8 Environmental and social determinants of health
9.9 Remote screening tools
9.10 Conclusion
References
10 The growing significance of smartphone apps in data-driven clinical decision-making: Challenges and pitfalls
10.1 Introduction
10.2 Distribution of apps in the field of medicine
10.3 Distribution of apps over different locations
10.4 Reporting applications development approaches
10.5 Decision-support modalities
10.6 Camera-based apps
10.7 Guideline/algorithm applications
10.8 Predictive modeling applications
10.9 Sensor-linked apps
10.10 Discussion
10.11 Summary
References
11 Artificial intelligence for pathology
11.1 Introduction
11.2 Deep neural networks
11.2.1 Convolutional neural networks
11.2.2 Fully convolutional networks
11.2.3 Generative adversarial networks
11.2.4 Stacked autoencoders
11.2.5 Recurrent neural networks
11.3 Deep learning in pathological image analysis
11.3.1 Image classification
11.3.1.1 Image-level classification
11.3.1.2 Object-level classification
11.3.2 Object detection
11.3.2.1 Detection of particular types of objects
11.3.2.2 Detection of objects without category labeling
11.3.2.3 Detection of objects with category labeling
11.3.3 Image segmentation
11.3.3.1 Nucleus/cell segmentation
11.3.3.2 Gland segmentation
11.3.3.3 Segmentation of other biological structures or tissues
11.3.4 Stain normalization
11.3.5 Image superresolution
11.3.6 Computer-aided diagnosis
11.3.7 Others
11.4 Summary
11.4.1 Open challenges and future directions of deep learning in pathology image analysis
11.4.1.1 Quality control
11.4.1.2 High image dimension
11.4.1.3 Object crowding
11.4.1.4 Data annotation issues
11.4.1.5 Integration of different types of input data
11.4.2 Outlook of clinical adoption of artificial intelligence
11.4.2.1 Potential applications
11.4.2.2 Barriers to clinical adoption
11.4.2.2.1 Lagging adoption of digital pathology
11.4.2.2.2 Lack of standards for interfacing AI to clinical systems
11.4.2.2.3 Regulatory concerns
11.4.2.2.4 Computational requirements
11.4.2.2.5 Algorithm explainability
11.4.2.2.6 Pathologists’ skepticism
References
12 The potential of deep learning for gastrointestinal endoscopy—a disruptive new technology
12.1 Introduction
12.2 Applications of artificial intelligence in video capsule endoscopy
12.2.1 Introduction
12.2.2 Decreasing read time
12.2.3 Anatomical landmark identification
12.2.4 Improving sensitivity
12.2.5 Recent developments
12.3 Applications of artificial intelligence in upper endoscopy
12.3.1 Introduction
12.3.2 Esophageal cancer
12.3.3 Gastric cancer
12.3.4 Upper endoscopy quality
12.3.5 Future directions
12.4 Applications of artificial intelligence in colonoscopy
12.4.1 Introduction
12.4.2 Cecal intubation rate and cecal intubation time
12.4.3 Withdrawal time
12.4.4 Boston Bowel Prep Scoring
12.4.5 Polyp detection
12.4.6 Polyp size
12.4.7 Polyp morphology
12.4.8 Polyp pathology
12.4.9 Tools
12.4.10 Mayo endoscopic subscore
12.5 Conclusion
12.6 Future directions
References
13 Lessons learnt from harnessing deep learning for real-world clinical applications in ophthalmology: detecting diabetic r...
13.1 Introduction
13.2 Historical artificial intelligence for diabetic retinopathy
13.3 Deep learning era
13.4 Lessons from interpreting and evaluating studies
13.5 Important factors for real-world usage
13.6 Regulatory approvals and further validation
13.7 Toward patient impact and beyond
13.8 Summary
Conflict of interest
References
14 Artificial intelligence in radiology
14.1 Introduction
14.2 Thoracic applications
14.2.1 Pulmonary analysis in chest X-ray
14.2.2 Pulmonary analysis in computerized tomography
14.2.2.1 Lung, lobe, and airway segmentation
14.2.2.2 Interstitial lung disease pattern recognition
14.3 Abdominal applications
14.3.1 Pancreatic cancer analysis in computerized tomography and magnetic resonance imaging
14.3.1.1 Pancreas segmentation in computerized tomography and magnetic resonance imaging
14.3.1.2 Pancreatic tumor segmentation and detection in computerized tomography and magnetic resonance imaging
14.3.1.3 Prediction and prognosis with pancreatic cancer imaging
14.3.2 AI in other abdominal imaging
14.4 Pelvic applications
14.5 Universal lesion analysis
14.5.1 DeepLesion dataset
14.5.2 Lesion detection and classification
14.5.3 Lesion segmentation and quantification
14.5.4 Lesion retrieval and mining
14.6 Conclusion
References
15 Artificial intelligence and interpretations in breast cancer imaging
15.1 Introduction
15.2 Artificial intelligence in decision support
15.3 Artificial intelligence in breast cancer screening
15.4 Artificial intelligence in breast cancer risk assessment: density and parenchymal pattern
15.5 Artificial intelligence in breast cancer diagnosis and prognosis
15.6 Artificial intelligence for treatment response, risk of recurrence, and cancer discovery
15.7 Conclusion and discussion
References
16 Prospect and adversity of artificial intelligence in urology
16.1 Introduction
16.2 Basic examinations in urology
16.2.1 Urinalysis and urine cytology
16.2.2 Ultrasound examination
16.3 Urological endoscopy
16.3.1 Cystoscopy and transurethral resection of the bladder
16.3.2 Ureterorenoscopy
16.4 Andrology
16.5 Diagnostic imaging
16.5.1 Prostate
16.5.2 Kidney
16.5.3 Ureter and bladder
16.6 Robotic surgery
16.6.1 Preoperative preparation
16.6.2 Navigation
16.6.3 Automated maneuver
16.7 Risk prediction
16.8 Future direction
References
17 Meaningful incorporation of artificial intelligence for personalized patient management during cancer: Quantitative imag...
17.1 Introduction
17.1.1 Workflow
17.1.1.1 Data acquisition
17.1.1.2 Preprocessing
17.1.1.3 Model building and evaluation
17.1.1.4 Inference
17.1.2 Meaningful incorporation of machine learning
17.2 Quantitative imaging
17.2.1 Brief overview of the physics of imaging modalities
17.2.2 Use of artificial intelligence in different stages of a quantitative imaging workflow
17.3 Risk assessment in cancer
17.4 Therapeutic outcome prediction
17.4.1 Chemotherapy
17.4.2 Radiation therapy
17.5 Using artificial intelligence meaningfully
17.6 Summary
References
18 Artificial intelligence in oncology
Abbreviations
18.1 Introduction
18.2 Electronic health records and clinical data warehouse
18.2.1 Data reuse for research purposes
18.2.2 Data reuse and artificial intelligence
18.2.3 Data reuse for patient care
18.3 Artificial intelligence applications for imaging in oncology
18.3.1 Applications in oncology for diagnosis and prediction
18.3.1.1 Computer vision and image analysis
18.3.1.2 Radiomics: data-driven biomarker discovery
18.3.1.3 Artificial intelligence–assisted diagnosis and monitoring in oncology
18.3.1.4 Treatment outcome assessment and prediction
18.3.2 Applications in oncology to improve exam quality and workflow
18.3.2.1 Improvement of image acquisition
18.3.2.2 Image segmentation
18.3.2.3 Improved workflow
18.3.2.4 Interventional radiology
18.4 Artificial intelligence applications for radiation oncology
18.4.1 Treatment planning
18.4.1.1 Segmentation
18.4.1.1.1 Brain
18.4.1.1.2 Head and neck
18.4.1.1.3 Lung
18.4.1.1.4 Abdomen
18.4.1.1.5 Pelvis
18.4.1.2 Dosimetry
18.4.2 Outcome prediction
18.4.2.1 Treatment response
18.4.2.1.1 Brain
18.4.2.1.2 Head and neck
18.4.2.1.3 Lung
18.4.2.1.4 Esophagus
18.4.2.1.5 Rectum
18.4.2.2 Toxicity
18.5 Future directions
References
19 Artificial intelligence in cardiovascular imaging
19.1 Introduction
19.2 Types of machine learning
19.3 Deep learning
19.4 Role of artificial intelligence in echocardiography
19.5 Role of artificial intelligence computed tomography
19.6 Role of artificial intelligence in nuclear cardiology
19.7 Role of artificial intelligence in cardiac magnetic resonance imaging
19.8 Role of artificial intelligence in electrocardiogram
19.9 The role of artificial intelligence in large databases
19.10 Our views on machine learning
19.11 Conclusion
References
20 Artificial intelligence as applied to clinical neurological conditions
20.1 Introduction to artificial intelligence in neurology
20.2 Integration with clinical workflow
20.2.1 Diagnosis
20.2.2 Risk prognostication
20.2.3 Surgical planning
20.2.4 Intraoperative guidance and enhancement
20.2.5 Neurophysiological monitoring
20.2.6 Clinical decision support
20.2.7 Theoretical neurological artificial intelligence research
20.3 Currently adopted methods in clinical use
20.4 Challenges
20.4.1 Data volume
20.4.2 Data quality
20.4.3 Generalizability
20.4.4 Interpretability
20.4.5 Legal
20.4.6 Ethical
20.5 Conclusion
References
21 Harnessing the potential of artificial neural networks for pediatric patient management
21.1 Introduction
21.2 Applications of artificial intelligence in diagnosis and prognosis
21.2.1 Prematurity
21.2.2 Childhood brain tumors
21.2.3 Epilepsy and seizure disorders
21.2.4 Autism spectrum disorder
21.2.5 Mood disorders and psychoses
21.2.6 Hydrocephalus
21.2.7 Traumatic brain injury
21.2.8 Molecular mechanisms of disease
21.2.9 Other disease entities
21.3 Transition to treatment decision-making using artificial intelligence
21.4 Future directions
References
22 Artificial intelligence–enabled public health surveillance—from local detection to global epidemic monitoring and control
22.1 Introduction
22.2 Artificial intelligence–enhanced data analysis for outbreak detection and early warning
22.2.1 Analyzing data collected from the physical world
22.2.2 Analyzing data from the cyberspace
22.2.3 From syndromic to pre-syndromic disease surveillance: A safety net for public health
22.3 Artificial intelligence–enhanced prediction in support of public health surveillance
22.3.1 Time series prediction based on dependent variables
22.3.2 Time series prediction based on dependent and independent variables
22.4 Artificial intelligence–based infectious disease transmission modeling and response assessment
22.4.1 Modeling disease transmission dynamics based on machine learning and complex networks
22.4.2 Modeling disease transmission dynamics based on multiagent modeling
22.5 Internet-based surveillance systems for global epidemic monitoring
22.6 Conclusion
References
23 Regulatory, social, ethical, and legal issues of artificial intelligence in medicine
23.1 Introduction
23.2 Ethical issues in data acquisition
23.2.1 Ethical issues arising from each type of data source
23.2.1.1 Ethical issues common to all data sources: Privacy and confidentiality
23.2.1.2 Ethical issues unique to each data source: Issues of consent
23.2.1.2.1 Issues of consent with data from research repositories
23.2.1.2.2 Return of results from research repositories
23.2.1.2.3 Issues of consent with clinical or public health data
23.2.1.2.4 Incidental or secondary findings in clinical or public health data
23.2.1.2.5 Issues of consent with nonclinically collected data
23.2.2 Future directions: Toward a new model of data stewardship
23.3 Application problems: Problems with learning from the data
23.3.1 Values embedded in algorithm design
23.3.2 Biases in the data themselves
23.3.3 Biases in the society in which the data occurs
23.3.4 Issues of implementation
23.3.5 Summary
23.4 Issues in regulation
23.4.1 Challenges to existing regulatory frameworks
23.4.2 Challenges in oversight and regulation of artificial intelligence used in healthcare
23.4.3 Regulation of safety and efficacy
23.4.4 Privacy and data protection
23.4.5 Transparency, liability, responsibility, and trust
23.5 Implications for the ethos of medicine
23.6 Future directions
References
24 Industry perspectives and commercial opportunities of artificial intelligence in medicine
24.1 Introduction
24.2 Exciting growth of artificial intelligence in medicine
24.3 A framework on development of artificial intelligence in medicine
24.3.1 The power of public attention and funding
24.3.2 Technology relies on continuous innovation
24.3.3 Practical applications bring the innovation to the real world
24.3.4 Market adoption defines the success
24.3.5 Apply the framework to the current and future market
24.3.6 Patient privacy
24.3.7 Approving a moving target
24.3.8 Accountability and transparency
24.4 Business opportunity of artificial intelligence in medicine
References
25 Outlook of the future landscape of artificial intelligence in medicine and new challenges
25.1 Overview of artificial intelligence in health care
25.1.1 Models dealing with input and output data from the same domain
25.1.2 Deep learning as applied to problems with input and output related by physical/mathematical law
25.1.3 Models with input and output data domains related by empirical evidence or measurements
25.1.4 Applications beyond traditional indications
25.2 Challenges ahead and issues relevant to the practical implementation of artificial intelligence in medicine
25.2.1 Technical challenges
25.2.2 Data, data curation, and sharing
25.2.3 Data and potential bias in artificial intelligence
25.2.4 Workflow and practical implementation
25.2.5 Clinical tests
25.2.6 Economical, political, social, ethical, and legal aspects
25.2.7 Education and training
25.3 Future directions and opportunities
25.4 Summary and outlook
References
Index
Recommend Papers

Artificial Intelligence in Medicine: Technical Basis and Clinical Applications
 9780128212592, 0128212594

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

ARTIFICIAL INTELLIGENCE IN MEDICINE

ARTIFICIAL INTELLIGENCE IN MEDICINE Technical Basis and Clinical Applications Edited by

LEI XING Department of Radiation Oncology, Stanford University, Stanford, CA, United States

MARYELLEN L. GIGER Committee on Medical Physics, The University of Chicago, Chicago, IL, United States

JAMES K. MIN Cleerly Inc., New York, NY, United States

Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom Copyright © 2021 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress ISBN: 978-0-12-821259-2 For Information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Stacy Masucci Senior Acquisitions Editor: Rafael E. Teixeira Editorial Project Manager: Mona Zahir Production Project Manager: Niranjan Bhaskaran Senior Cover Designer: Miles Hitchen Typeset by MPS Limited, Chennai, India

Dedication To my mentors, Dr. George T.Y. Chen and Dr. Arthur L. Boyer, who taught me the art of medical physics. Their guidance, encouragement, support, and friendship have greatly enriched the neural networks in my brain! —Lei Xing To all the medical physicists and their collaborators who have pushed the envelope of AI in medical imaging and enabled its translation through rigorous evaluations. —Maryellen L. Giger To my collaborators and mentors in data science, computer vision, clinical trials, and cardiology, who have graciously included me as a participant in this AI medical revolution that we are about to experience. —James K. Min

Contents 2.5 Conclusion References 33

List of contributors xiii Foreword xvii Preface xxi Acknowledgments xxiii

33

II Technical basis

I

35

Introduction

3. Deep learning for biomedical videos: perspective and recommendations 37

1. Artificial intelligence in medicine: past, present, and future 3

David Ouyang, Zhenqin Wu, Bryan He and James Zou

3.1 Introduction 37 3.2 Video datasets 38 3.3 Semantic segmentation 40 3.4 Object detection and tracking 42 3.5 Motion classification 44 3.6 Future directions and conclusion 45 References 46

Efstathios D. Gennatas and Jonathan H. Chen

1.1 Introduction 3 1.2 A brief history of artificial intelligence and its applications in medicine 4 1.3 How intelligent is artificial intelligence? 5 1.4 Artificial intelligence, machine learning, and precision medicine 6 1.5 Algorithms and models 6 1.6 Health data sources and types 7 1.7 The promise 9 1.8 The challenges 11 1.9 Making it a reality: integrating artificial intelligence into the human workforce of a learning health system 16 References 16

4. Biomedical imaging and analysis through deep learning 49 Karen Drukker, Pingkun Yan, Adam Sibley and Ge Wang

4.1 Introduction 49 4.2 Tomographic image reconstruction 50 4.3 Image segmentation 56 4.4 Image registration 60 4.5 Deep-learning-based radiomics 63 4.6 Summary and outlook 68 References 69

2. Artificial intelligence in medicine: Technical basis and clinical applications 19

5. Expert systems in medicine 75

Bradley J. Erickson

Li Zhou and Margarita Sordo

2.1 Introduction 19 2.2 Technology used in clinical artificial intelligence tools 20 2.3 Clinical applications 27 2.4 Future directions 32

5.1 5.2 5.3 5.4

vii

Introduction 75 A brief history 76 Methods 77 Applications 89

viii

Contents

5.5 Challenges 92 5.6 Future directions References 97

96

6. Privacy-preserving collaborative deep learning methods for multiinstitutional training without sharing patient data 101 Ken Chang, Praveer Singh, Praneeth Vepakomma, Maarten G. Poirot, Ramesh Raskar, Daniel L. Rubin and Jayashree KalpathyCramer

6.1 Introduction 101 6.2 Variants of distributed learning 103 6.3 Handling data heterogeneity 105 6.4 Protecting patient privacy 108 6.5 Publicly available software 109 6.6 Conclusion 109 References 109

7. Analytics methods and tools for integration of biomedical data in medicine 113

8.4 Areas of artificial intelligence augmentation for electronic health records 139 8.5 Limitations of artificial intelligence and next steps 144 References 146

9. Roles of artificial intelligence in wellness, healthy living, and healthy status sensing 151 Peter Jaeho Cho, Karnika Singh and Jessilyn Dunn

9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8

Introduction 151 Diet 153 Fitness and physical activity 155 Sleep 156 Sexual and reproductive health 158 Mental health 159 Behavioral factors 161 Environmental and social determinants of health 163 9.9 Remote screening tools 165 9.10 Conclusion 166 References 167

Lin Zhang, Mehran Karimzadeh, Mattea Welch, Chris McIntosh and Bo Wang

7.1 The rise of multimodal data in biology and medicine 113 7.2 The challenges in multimodal data—problems with learning from multiple sources of data 118 7.3 Machine learning algorithms in integrating medical and biological data 119 7.4 Future directions 125 References 126

III Clinical applications

131

8. Electronic health record data mining for artificial intelligence healthcare 133 Anthony L. Lin, William C. Chen and Julian C. Hong

8.1 Introduction 133 8.2 Overview of the electronic health record 134 8.3 Clinical decision support 137

10. The growing significance of smartphone apps in data-driven clinical decision-making: Challenges and pitfalls 173 Iva Halilaj, Yvonka van Wijk, Arthur Jochems and Philippe Lambin

10.1 Introduction 173 10.2 Distribution of apps in the field of medicine 174 10.3 Distribution of apps over different locations 175 10.4 Reporting applications development approaches 175 10.5 Decision-support modalities 175 10.6 Camera-based apps 176 10.7 Guideline/algorithm applications 177 10.8 Predictive modeling applications 177 10.9 Sensor-linked apps 178 10.10 Discussion 179 10.11 Summary 180 References 180

ix

Contents

11. Artificial intelligence for pathology 183

Conflict of interest References 260

260

Fuyong Xing, Xuhong Zhang and Toby C. Cornish

11.1 Introduction 183 11.2 Deep neural networks 184 11.3 Deep learning in pathological image analysis 189 11.4 Summary 202 Acknowledgment 209 References 209

12. The potential of deep learning for gastrointestinal endoscopy—a disruptive new technology 223

14. Artificial intelligence in radiology 265 Dakai Jin, Adam P. Harrison, Ling Zhang, Ke Yan, Yirui Wang, Jinzheng Cai, Shun Miao and Le Lu

14.1 Introduction 265 14.2 Thoracic applications 266 14.3 Abdominal applications 274 14.4 Pelvic applications 278 14.5 Universal lesion analysis 280 14.6 Conclusion 284 References 284

Robin Zachariah, Christopher Rombaoa, Jason Samarasena, Duminda Suraweera, Kimberly Wong and William Karnes

12.1 Introduction 223 12.2 Applications of artificial intelligence in video capsule endoscopy 224 12.3 Applications of artificial intelligence in upper endoscopy 227 12.4 Applications of artificial intelligence in colonoscopy 232 12.5 Conclusion 238 12.6 Future directions 239 References 240

13. Lessons learnt from harnessing deep learning for real-world clinical applications in ophthalmology: detecting diabetic retinopathy from retinal fundus photographs 247

15. Artificial intelligence and interpretations in breast cancer imaging 291 Hui Li and Maryellen L. Giger

15.1 Introduction 291 15.2 Artificial intelligence in decision support 292 15.3 Artificial intelligence in breast cancer screening 293 15.4 Artificial intelligence in breast cancer risk assessment: density and parenchymal pattern 296 15.5 Artificial intelligence in breast cancer diagnosis and prognosis 297 15.6 Artificial intelligence for treatment response, risk of recurrence, and cancer discovery 300 15.7 Conclusion and discussion 304 References 305

Yun Liu, Lu Yang, Sonia Phene and Lily Peng

13.1 Introduction 247 13.2 Historical artificial intelligence for diabetic retinopathy 248 13.3 Deep learning era 249 13.4 Lessons from interpreting and evaluating studies 255 13.5 Important factors for real-world usage 257 13.6 Regulatory approvals and further validation 258 13.7 Toward patient impact and beyond 259 13.8 Summary 260

16. Prospect and adversity of artificial intelligence in urology 309 Okyaz Eminaga and Joseph C. Liao

16.1 16.2 16.3 16.4 16.5 16.6 16.7

Introduction 309 Basic examinations in urology Urological endoscopy 314 Andrology 316 Diagnostic imaging 318 Robotic surgery 322 Risk prediction 323

311

x

Contents

16.8 Future direction References 328

325

17. Meaningful incorporation of artificial intelligence for personalized patient management during cancer: Quantitative imaging, risk assessment, and therapeutic outcomes 339 Elisa Warner, Nicholas Wang, Joonsang Lee and Arvind Rao

17.1 17.2 17.3 17.4 17.5

Introduction 339 Quantitative imaging 343 Risk assessment in cancer 347 Therapeutic outcome prediction Using artificial intelligence meaningfully 352 17.6 Summary 355 Acknowledgments 356 References 357

20. Artificial intelligence as applied to clinical neurological conditions 395 348

18. Artificial intelligence in oncology 361 Jean-Emmanuel Bibault, Anita Burgun, Laure Fournier, Andre´ Dekker and Philippe Lambin

Abbreviations 361 18.1 Introduction 362 18.2 Electronic health records and clinical data warehouse 362 18.3 Artificial intelligence applications for imaging in oncology 367 18.4 Artificial intelligence applications for radiation oncology 371 18.5 Future directions 376 References 377

19. Artificial intelligence in cardiovascular imaging 383 Karthik Seetharam and James K. Min

19.1 19.2 19.3 19.4

19.6 Role of artificial intelligence in nuclear cardiology 388 19.7 Role of artificial intelligence in cardiac magnetic resonance imaging 389 19.8 Role of artificial intelligence in electrocardiogram 389 19.9 The role of artificial intelligence in large databases 390 19.10 Our views on machine learning 391 19.11 Conclusion 391 References 392

Introduction 383 Types of machine learning 384 Deep learning 386 Role of artificial intelligence in echocardiography 386 19.5 Role of artificial intelligence computed tomography 387

Daniel L. Ranti, Aly Al-Amyn Valliani, Anthony Costa and Eric Karl Oermann

20.1 Introduction to artificial intelligence in neurology 395 20.2 Integration with clinical workflow 396 20.3 Currently adopted methods in clinical use 406 20.4 Challenges 407 20.5 Conclusion 409 References 409

21. Harnessing the potential of artificial neural networks for pediatric patient management 415 Jennifer Quon, Michael C. Jin, Jayne Seekins and Kristen W. Yeom

21.1 Introduction 415 21.2 Applications of artificial intelligence in diagnosis and prognosis 416 21.3 Transition to treatment decision-making using artificial intelligence 428 21.4 Future directions 429 References 430

22. Artificial intelligence enabled public health surveillance—from local detection to global epidemic monitoring and control 437 Daniel Zeng, Zhidong Cao and Daniel B. Neill

22.1 Introduction 437 22.2 Artificial intelligence enhanced data analysis for outbreak detection and early warning 440

xi

Contents

22.3 Artificial intelligence enhanced prediction in support of public health surveillance 444 22.4 Artificial intelligence based infectious disease transmission modeling and response assessment 446 22.5 Internet-based surveillance systems for global epidemic monitoring 449 22.6 Conclusion 450 References 450

IV Future outlook 455 23. Regulatory, social, ethical, and legal issues of artificial intelligence in medicine 457 Emily Shearer, Mildred Cho and David Magnus

23.1 Introduction 457 23.2 Ethical issues in data acquisition 458 23.3 Application problems: Problems with learning from the data 464 23.4 Issues in regulation 468 23.5 Implications for the ethos of medicine 473 23.6 Future directions 474 References 475

24. Industry perspectives and commercial opportunities of artificial intelligence in medicine 479 Rebecca Y. Lin and Jeffery B. Alvarez

24.1 Introduction 479 24.2 Exciting growth of artificial intelligence in medicine 480 24.3 A framework on development of artificial intelligence in medicine 480 24.4 Business opportunity of artificial intelligence in medicine 498 References 500

25. Outlook of the future landscape of artificial intelligence in medicine and new challenges 503 Lei Xing, Daniel S. Kapp, Maryellen L. Giger and James K. Min

25.1 Overview of artificial intelligence in health care 503 25.2 Challenges ahead and issues relevant to the practical implementation of artificial intelligence in medicine 512 25.3 Future directions and opportunities 518 25.4 Summary and outlook 519 References 519

Index 527

List of contributors Jeffery B. Alvarez Strategy and Global Business Development, Potrero Medical, Hayward, CA, United States Jean-Emmanuel Bibault Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, United States

Andre Dekker Department of Radiation Oncology (MAASTRO), GROW School for Oncology and Developmental Biology, Maastricht University Medical Centre, Maastricht, The Netherlands

Center,

Karen Drukker Department of Radiology, University of Chicago, Chicago, IL, United States

Jinzheng Cai Bethesda Research Lab, PAII Inc, Bethesda, MD, United States

Jessilyn Dunn Department of Biomedical Engineering, Duke University Medical Center, Durham, NC, United States

Anita Burgun Cordeliers Research Paris University, Paris, France

Zhidong Cao State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, P.R. China Ken Chang Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Charlestown, MA, United States Jonathan H. Chen Center for Biomedical Informatics Research and Division of Hospital Medicine, Stanford University School of Medicine, Stanford, CA, United States William C. Chen Department of Radiation Oncology, University of California San Francisco School of Medicine, San Francisco, CA, United States Mildred Cho Department of Pediatrics and Medicine, Stanford University School of Medicine, Stanford, CA, United States Peter Jaeho Cho Department of Biomedical Engineering, Duke University, Durham, NC, United States Toby C. Cornish Department of Pathology, University of Colorado School of Medicine, Aurora, CO, United States Anthony Costa Department of Neurosurgery, Mount Sinai School of Medicine, New York, NY, United States

Okyaz Eminaga Department of Urology, Stanford University School of Medicine, Stanford, CA, United States Bradley J. Erickson Department of Radiology, Mayo Clinic, Rochester, MN, United States Laure Fournier Department of Radiology, Georges Pompidou European Hospital, Assistance Publique - Hoˆpitaux de Paris, Paris University, Paris, France Sanjiv Sam Gambhir Department of Radiology, Stanford University School of Medicine, Stanford, CA, United States Efstathios D. Gennatas Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, United States Maryellen L. Giger Department of Radiology, University of Chicago, Chicago, IL, United States Iva Halilaj Department of Precision Medicine D-lab, Maastricht University, Maastricht, The Netherlands Adam P. Harrison Bethesda Research Lab, PAII Inc, Bethesda, MD, United States Bryan He Department of Computer Science, Stanford University, Stanford, CA, United States

xiii

xiv

List of contributors

Julian C. Hong Department of Radiation Oncology, Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, United States

Anthony L. Lin Department of Radiation Oncology, University of California San Francisco School of Medicine, San Francisco, CA, United States

Dakai Jin Bethesda Research Lab, PAII Inc, Bethesda, MD, United States

Rebecca Y. Lin Strategy and Global Business Development, Potrero Medical, Hayward, CA, United States

Michael C. Jin Stanford University School of Medicine, Stanford, CA, United States

Yun Liu Google Health, Google LLC, Palo Alto, CA, United States

Arthur Jochems Department of Precision Medicine D-lab, Maastricht University, Maastricht, The Netherlands

Le Lu Bethesda Research Lab, PAII Inc, Bethesda, MD, United States

Jayashree Kalpathy-Cramer Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Charlestown, MA, United States Daniel S. Kapp Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, United States Mehran Karimzadeh Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada William Karnes Department of Gastroenterology, University of California Irvine Medical Center, Orange, CA, United States Philippe Lambin Department of Precision Medicine, Faculty of Health, Medicine and Life Sciences, Maastricht University School for Oncology and Developmental Biology, Maastricht, The Netherlands Curtis P. Langlotz Department of Radiology and Biomedical Informatics, Stanford University School of Medicine, Stanford, CA, United States Joonsang Lee Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, United States

David Magnus Department of Medicine and Biomedical Ethics and Pediatrics and Medicine, Stanford University School of Medicine, Stanford, CA, United States Chris McIntosh Techna Institute for the Advancement of Technology for Health, University Health Network, Toronto, ON, Canada Shun Miao Bethesda Research Lab, PAII Inc, Bethesda, MD, United States James K. Min Cleerly Inc., New York, NY, United States Daniel B. Neill Courant Institute, Department of Computer Science, Wagner School of Public Service, Center for Urban Science and Progress, New York University, New York, NY, United States Eric Karl Oermann Department of Neurological Surgery, Mount Sinai Health System, New York, NY, United States David Ouyang Department of Cardiovascular Medicine, Stanford University School of Medicine, Stanford, CA, United States Lily Peng Google Health, Google LLC, Palo Alto, CA, United States Sonia Phene Google Health, Google LLC, Palo Alto, CA, United States

Hui Li Department of Radiology, University of Chicago, Chicago, IL, United States

Maarten G. Poirot Department of Radiology, Massachusetts General Hospital, Charlestown, MA, United States

Joseph C. Liao Department of Urology, Stanford University School of Medicine, Stanford, CA, United States

Jennifer L. Quon Department of Neurosurgery, Stanford University School of Medicine, Stanford, CA, United States

List of contributors

Daniel Ranti Department of Neurosurgery, Mount Sinai School of Medicine, New York, NY, United States Arvind Rao Department of Computational Medicine and Bioinformatics and Radiation Oncology, University of Michigan, Ann Arbor, MI, United States

xv

Duminda Suraweera Department of Gastroenterology, University of California Irvine Medical Center, Orange, CA, United States Aly Al-Amyn Valliani Department of Neurological Surgery, Mount Sinai Health System, New York, NY, United States

Ramesh Raskar Media Lab, Massachusetts Institute of Technology, Cambridge, MA, United States

Yvonka van Wijk Department of Precision Medicine D-lab, Maastricht University, Maastricht, The Netherlands

Christopher Rombaoa Department of Gastroenterology, University of California Irvine Medical Center, Orange, CA, United States

Praneeth Vepakomma Media Lab, Massachusetts Institute of Technology, Cambridge, MA, United States

Daniel L. Rubin Department of Radiology and Biomedical Data Science, Stanford University, Stanford, CA, United States Jason Samarasena Department of Gastroenterology, University of California Irvine Medical Center, Orange, CA, United States Jayne Seekins Department of Radiology, Stanford University School of Medicine, Stanford, CA, United States Karthik Seetharam Cleerly Inc., New York, NY, United States

Bo

Wang Peter Munk Cardiac Center, University Health Network, Toronto, ON, Canada

Ge

Wang Biomedical Imaging Center, Rensselaer Polytechnic Institute, Troy, NY, United States

Nicholas Wang Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, United States Yirui Wang Bethesda Research Lab, PAII Inc, Bethesda, MD, United States

Emily Shearer School of Medicine Stanford University, Stanford, CA, United States

Elisa Warner Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, United States

Adam Sibley Department of Medical Physics, University of Chicago, Chicago, IL, United States

Mattea Welch Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada

Karnika Singh Department of Biomedical Engineering, Duke University, Durham, NC, United States

Kimberly Wong Department of Gastroenterology, University of California Irvine Medical Center, Orange, CA, United States

Praveer Singh Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Charlestown, MA, United States Margarita Sordo Department of General Internal Medicine, Brigham and Women’s Hospital Harvard School of Medicine, Boston, MA, United States; MGH Institute of Health Professions School of Nursing, Boston, MA, United States

Zhenqin Wu Department of Chemistry, Stanford University, Stanford, CA, United States Fuyong Xing Department of Biostatistics and Informatics, University of Colorado School of Public Health, Aurora, CO, United States Lei Xing Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, United States

xvi

List of contributors

Ke Yan Bethesda Research Lab, PAII Inc, Bethesda, MD, United States Pingkun Yan Department of Biomedical Engineering, Rensselaer Polytechnic Institute, Troy, NY, United States Lu Yang Google Health, Google LLC, Palo Alto, CA, United States Kristen W. Yeom Department of Radiology, Stanford University School of Medicine, Stanford, CA, United States

Lin Zhang Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada; Peter Munk Cardiac Center, University Health Network, Toronto, ON, Canada Ling Zhang Bethesda Research Lab, PAII Inc, Bethesda, MD, United States Xuhong Zhang Department of Biostatistics and Informatics, University of Colorado School of Public Health, Aurora, CO, United States Zhou Department of General Internal Medicine, Brigham and Women’s Hospital Harvard School of Medicine, Boston, MA, United States

Robin Zachariah Department of Gastroenterology, University of California Irvine Medical Center, Orange, CA, United States

Li

Daniel Zeng State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, P.R. China

James Zou Department of Biomedical Data Science, Stanford University, Stanford, CA, United States

Foreword As the Chairman and the Associate Chair for Information Systems of a large academic radiology department, we stand ready to incorporate artificial intelligence (AI) into our clinical operations. But despite the incredible potential of AI, radiologists are not knocking down our doors to request we implement a particular machine learning algorithm to support their practice—either to improve the health of their patients or to enhance the efficiency of their practice. This lack of enthusiasm suggests that we are still at the dawn of AI and its effect on clinical radiology practice. On the other hand, radiologists remain apprehensive about how AI will affect their future—particularly younger radiologists, whose future hinges on the long term effects of this revolutionary technology.1 These two seemingly contradictory attitudes about AI, which are widely held throughout medicine, highlight a dichotomy: AI is a potentially revolutionary technology that has yet to show significant practical clinical benefits. The earliest sign of the AI revolution began in the 1980s, when computer science researchers required many years to develop a system to extract information from clinical images.2 Researchers who built such systems could easily earn a PhD for their work. But since the deep learning revolution just a few years ago, much has changed. It now takes only brief experience and a few days or weeks with the right training data to create a much more accurate informatics extraction system.3 This dramatic change represents a true technologic

revolution. As a result, deep learning and AI is now a part of nearly all medical research laboratories. Much of the recent progress in AI has occurred outside of the medical domain. ImageNet, a database of 14 million lowresolution color photographs of natural scenes, linked to labels from the WordNet ontology, drove this progress through a series of data science challenges.4 But medical images and other forms of medical data are substantially different from data elsewhere, not only in their form and format, but also in the need for privacy and security.5 Consequently, progress toward a comprehensive public data resource for medicine is slower, but we have nevertheless seen sustained advances toward clinically useful applications. We are at the threshold of a major information revolution. We should be grateful at this opportune time that Drs. Xing, Giger, and Min have created an extensive compendium of the current promises and challenges of AI in health care. Dr. Xing brings extensive experience as a physicist and pioneering engineer who has developed numerous innovations in AI, medical imaging, treatment planning, molecular imaging instrumentations, and imageguided interventions. Dr. Giger has been on the front lines of AI for several decades and is a pioneer in the development and use of AI algorithms, including algorithms in routine clinical use today. Her AI research in cancer imaging for risk assessment, diagnosis, prognosis, and therapeutic response has yielded several translational innovations, including the use of these “virtual biopsies”

xvii

xviii

Foreword

in imaging-genomics association studies. Dr. Min has led a comprehensive cardiovascular institute, where he spearheaded a large prospective trial for cardiac CT that created a massive image database comprising more than 30,000 patients from over seven countries. Most recently, he has transitioned to industry where the innovative technologies he has developed can be translated to benefit patients. These three editors, together with the eminent chapter authors they have recruited, have created an invaluable resource to help us all understand the present and future of medical AI. Their book begins with a guided journey though the settings in which AI research is having an impact on medicine. The next chapters describe the key methods of this discipline, including the fundamentals of machine learning and neural networks highlighting how they differ from the rulebased and probabilistic methods that were first pioneered in the 1970s and 1980s. These older methods will be ascendant as we recognize the need for machine learning (ML) systems to explain themselves and to incorporate symbolic relationships into their reasoning.6 The early chapters also feature new AI methods for analyzing complex data types. For example, machine learning methods can extract information from echocardiography and ultrasound video feeds. They can also aid in the reconstruction and enhancement of three-dimensional images for faster scans, lower radiation and contrast dose, as well as more capable and less expensive imaging devices. Also emphasized are two new methods tailored for health-care analyses: (1) the power of distributed learning, which helps create privacy-preserving generalizable models through learning from multiinstitutional data and (2) the strengths and challenges of

machine learning from multimodal data, which can combine insights from genomic information with clinical and imaging data. The heart of the book reviews the application of these new AI methods to health care. AI’s earliest effects are already being felt by diagnostic imaging, because computer vision algorithms often perform at or above the level of human experts. These imaging applications are examined in chapters covering pathology, endoscopy, ophthalmology, and radiology, highlighting how progress varies across imaging specialties due to differences in availability of digital data, and task variability. For example, these chapters contrast the need for diagnostic decision support in the developed world versus rapid screening in lower resource countries. They also highlight the unique needle-in-a-haystack problems in histopathology, the panoply of detection, classification, and measurement problems in radiology as well as the difficult timesensitive detection tasks in endoscopy. In parallel with progress in imaging, key advances are occurring in the extraction of information from the electronic health record (EHR) and in combining EHR data with genomics and imaging data for applications ranging from digital phenotyping to adverse event prediction. These heterogeneous data types raise the need for AI methods that integrate multimodal data from disparate sources. All of these innovations are described in detail. A pair of chapters highlights how technology can improve health outside the clinical care environment, illustrating the essential concept of precision health, which aims to develop tailored unobtrusive disease surveillance systems and to promote healthy habits through diet and exercise. In this context, the role of mobile phones and

Foreword

mobile computing is reviewed, including the sensors, camera, GPS, and other mobile functions that decision-support apps can employ to promote health. The book then shifts to focus on new methods that apply across a broad range of prediction problems in a sampling of the many medical fields likely to be affected by AI: breast cancer, urology, oncology, cardiovascular disease, neurologic conditions, and pediatrics. The concluding chapters consider the broader implications of AI research, including its impact on public health surveillance. As we envision how these amazing new technologies will affect the practice of medicine, we need to consider the regulatory, social, ethical, and legal issues that undoubtedly will play a strong role in the adoption, use, and public confidence in AI. All of those factors are considered in a chapter that also emphasizes the importance of recognizing bias in data sets. Finally, the business opportunities and barriers are discussed, including the cycles of innovation that lead to market adoption. We are fascinated to see how these profound AI technologies will affect the future of every medical specialty and every patient. As we experience that AI-enabled future,

xix

we cannot imagine a better foundation for everyone than the material available in this comprehensive book. Enjoy! Curtis P. Langlotz, MD, PhD and Sanjiv Gambhir, MD, PhD

References 1. Langlotz CP. Will artificial intelligence replace radiologists? Radiol: Artif Intell 2019.. Available from: ,https://pubs.rsna.org/doi/full/10.1148/ ryai.2019190058.. 2. Karssemeijer N, van Erning LJ, Eijkman EG. Recognition of organs in CT-image sequences: a model guided approach. Comput Biomed Res 1988;21:434 48. 3. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521:436 44. 4. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. Int J Comput Vis 2015;115:211 52. 5. Langlotz CP, Allen B, Erickson BJ, Kalpathy-Cramer J, Bigelow K, Cook TS, et al. A roadmap for foundational research on artificial intelligence in medical imaging: from the 2018 NIH/RSNA/ACR/The Academy Workshop. Radiology 2019;190613. 6. Marcus G. Deep learning: a critical appraisal. arXiv [cs.AI] 2018. Available from: ,http://arxiv.org/ abs/1801.00631..

Preface In this book, we provide a comprehensive overview of the fundamental principles, technical basis, clinical applications, and practical considerations of AIM. The intended readers of the book include, but are not limited to, students, teachers, researchers, medical professionals, industrial engineers, administrators, and business people in the AIM sector. It is our hope that readers will attain useful background knowledge of AIM, learn about emerging computing algorithms for various clinical problems, gain perspectives on AI applications in health care, and appreciate the challenges and opportunities of AIM. The multidisciplinary and panoramic view of AIM provided by the expert authors of 25 chapters affords valuable sources for readers to identify the trends of research and to gain perspectives and prospects of AIM. While enormous progress has been made in AIM, a number of important issues remain to be resolved to take AIM to the next level. The next generation of AIM should be more interpretable, transparent, and trustworthy than they are today. It is arguable that achievement of these important characteristics should start from data curation and harmonization, which interprets existing characteristics of data and action taken on data and uses the information for subsequent data quality improvements. In modeling, novel strategies and more intelligent AI frameworks with the abovementioned features must be developed. Robust statistical evaluation methods

Artificial intelligence (AI) is the theory and development of computer systems able to perform tasks that usually were conducted by human intelligence, including computer algorithms that automatically learn from past experience to perform a task such as prediction, detection, classification, semantic transcriptions, image reconstructions/restorations, robotic movements, or efficient workflows. Although AI in medicine (AIM) has been around for decades, it has progressed with remarkable technical and clinical innovations in recent years, as evidenced by the ever increasing number of publications, media news, startup companies, and FDA-approved AIM products. The landscape of health care is being transformed by AI and substantial progress has been made in almost every specialty of medicine. Indeed, AI is being integrated into virtually all biomedical fields and various decision-making processes, ranging from preventive medicine, disease management, monitoring of patient status, imaging, biomarkers discovery, drug design and repurposing, healthy livings, elderly care, robotic interventions to AI-augmented telemedicine. For many biomedical problems that are either extremely tedious or too difficult to solve, AI may become a viable or even the only choice to move medicine forward. With the immense technical tools, powerful computational resources, and promising research, there is no doubt that the growing trends of AI in health care will continue in the years to come.

xxi

xxii

Preface

and metrics also play critical roles in the success of data-driven machine learning and represent a forefront of AI research and development. New machine learning methods with minimal disparity between the neural network loss function and evaluation metrics should be investigated, as well as incorporation into multidisciplinary tasks, to facilitate the convergence between AI and human intelligence. While passing the Turing test in some specific tasks has become reality today, we are still far away from the important milestone for an AI to convince a Turing test jury that the AI system is an autonomous human in general clinical applications. We emphasize that timely deployment and clinical implementation of AIM tools are of great importance to the new era of AI-powered medicine. After all, the ultimate goal of AIM is to leverage the latest AI technologies to benefit our patient care. In this regard, a thorough understanding of the clinical workflow and a clearly defined path of clinical translation of AIM research is critical. Finally, we note that advances in

AIM also bring new regulatory, social, legal, and ethical challenges (see Chapter 23: Regulatory, Social, Ethical, and Legal Issues of Artificial Intelligence in Medicine, for details), which we must understand and handle to further advance the field and translate to routine clinical care. Looking ahead to the future of AIM can be very daunting, especially after considering all of the possibilities to be made available by the latest advances in technologies such as autonomous on-device AI, machine human interface, and quantum computing. Let us welcome the dawn of a new era of AIM and work together to advance the field! Lei Xing1, Maryellen L. Giger2 and James K. Min3 1 Stanford University, Stanford, CA, United States 2 University of Chicago, Chicago, IL, United States 3 Cleerly Inc., New York, NY, United States

Acknowledgments It is a great pleasure to acknowledge the generous help of the team that brought this book to life. It has been a great privilege to be associated with the outstanding professionals, people who are talented, knowledgeable, dedicated, and take pride in their work. We are indebted to the contributions of all the lead authors and coauthors of the chapters. While the ultimate responsibility for the content of this book is ours, the invaluable contributions from these people made the book a great one. We wish to sincerely thank Ms. Carrie Zhang and Dania Abid, and Mr. Frank Chevz for providing tremendous administrative support throughout the editing and production of this book. We are very grateful to the Editorial Project Manager of the book, Mona Zahir, for continuous and prompt support during this project. It is hard to count how many emails we have sent to her for professional advice and last minute help—we truly appreciate her professional support. We would also like to thank many other people at Elsevier, particularly Niranjan Bhaskaran, Project Manager, and Indhumathi Mani, Copyrights Coordinator, for their professional supports. LX wishes to acknowledge the contributions from current and past members of the Laboratory of Artificial Intelligence in Medicine and Biomedical Physics at Department of Radiation Oncology and colleagues from the Department of Radiation Oncology, the Center for Radiation Science in the Department of Radiation Oncology, the

Center of Artificial Intelligence for Medicine & Imaging (AIMI), and the Human-Centered Artificial Intelligence (HAI) of Stanford University. LX also extends his sincere gratitude to Dr. Jacob Haimson and Dr. Sarah S. Donaldson for their continuous support of the medical physics program at Stanford and the generous gift of the endowed professorship that LX currently holds. Grant and/or gift supports over the years from NIH, ACS, DOD, RSNA, Varian Medical Systems, Huiyihuiying (HY) Medical Technology Co., and Google LLC are also gratefully acknowledged. LX is an adviser and shareholder of HY Medical Technology Co., Luca Medical Systems, and MoreHealth Inc. He receives royalty from Varian Medical Systems. MLG extends her acknowledgment to the current and past members of the Department of Radiology and Committee on Medical Physics at the University of Chicago and especially the current and past members and collaborators of her medical imaging computer vision/machine learning research lab at the University who have contributed significantly to the research. Grants are gratefully acknowledged from NIH, the University of Chicago Comprehensive Cancer Center, and the University of Chicago Institute for Translational Medicine. MLG is a stockholder in R2 Technology/ Hologic and a cofounder and equity holder in Quantitative Insights (now Qlarity Imaging). M.L. Giger receives royalties from Hologic, GE Medical Systems, MEDIAN Technologies, Riverain Medical, Mitsubishi, and Toshiba. It is the University of Chicago Conflict of Interest Policy that investigators

xxiii

xxiv

Acknowledgments

disclose publicly actual or potential significant financial interest that would reasonably appear to be directly and significantly affected by the research activities. JKM extends his appreciation to his colleagues at the Weill Cornell Medical College and at the Dalio Institute of Cardiovascular Imaging at the New York-Presbyterian Hospital as well as to the numerous collaborators with whom he has had the privilege to work with on large-scale clinical trials and registries over the last 15 years. Grants and gifts are gratefully acknowledged from the National Institutes of Health, the Dalio Foundation and the Michael Wolk Foundation. JKM is the founder, shareholder, and employee of Cleerly, Inc. Last but not least, we are greatly saddened to learn that Sanjiv Sam Gambhir,

MD, PhD, the Virginia and D.K. Ludwig, Professor and Chair of the Department of Radiology at Stanford University School of Medicine, passed away while this book is in the process of printing. Dr. Gambhir was a visionary, brilliant, and genuinely kind physician-scientist. He was internationally known as a forward-looking thinker and pioneer in biomedical imaging, medical science, and AI in healthcare. He was an advocate for precision health and medical AI, as reflected partially in the Foreword put together by Dr. Langlotz and him. Dr. Gambhir will be remembered fondly, and his passions for translational medicine will be a great inspiration for us to constantly push the boundaries of AI and to deliver life-changing AI-powered medicine to patients.

C H A P T E R

1 Artificial intelligence in medicine: past, present, and future Efstathios D. Gennatas and Jonathan H. Chen Abstract Artificial intelligence is a powerful technology that promises to vastly improve the efficiency and effectiveness of health-care delivery, and usher in the era of precision medicine, transforming our everyday lives. It is helping accelerate basic biomedical research, delivering insights into disease pathophysiology, and guiding new treatment discovery. It is optimizing clinical trials and translational research, bringing us closer to new treatments faster. At a time when the health-care system is under more strain than ever, artificial intelligence promises to revolutionize health-care delivery by capitalizing on the totality of health-related data in order to optimize clinical decision-making for each individual and improve access to health-care for all. To deliver on these promises, we must bring together basic and applied researchers, engineers, and clinicians to address the many outstanding challenges in a timely and responsible manner. It is all of our duty to strive for the safe, fair, and efficient delivery of this technology to all. Keywords: Artificial intelligence; machine learning; precision medicine; medicine; health care

1.1 Introduction Artificial intelligence (AI) has been through highs and lows to reclaim its place as one of the most exciting and promising technologies today. It is gaining increasing traction across fields, and the race is on for the widespread delivery of real-world applications that have the potential to transform our daily lives and society as a whole. Medicine is arguably one of the most promising and at the same time challenging fields for AI adoption. AI in medicine aims to optimize clinical decision-making and health-care delivery in general by capitalizing on the increasing volume and availability of health-related data in order to provide the most informed care to each individual. Medical AI applications are still at the early stages of development but are advancing rapidly. This book offers an overview of the ongoing advances in AI across medical subfields. In this introduction chapter, we shall begin with a historical overview of AI and its clinical applications and a set of definitions. We will then consider the promises and challenges of AI in medicine: What do we stand to gain from AI in medicine? What are the challenges we

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00001-6

3

© 2021 Elsevier Inc. All rights reserved.

4

1. Artificial intelligence in medicine: past, present, and future

need to address before we can deliver on those promises? The coordinated work of an interdisciplinary team of health-care workers and providers, scientists, and engineers is required to fulfill the potential of AI in medicine in a safe, fair, and efficient way.

1.2 A brief history of artificial intelligence and its applications in medicine A single formal definition of AI may not exist, but we commonly use the term to refer to a set of approaches “able to perform tasks that normally require human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages” (“artificial intelligence, n.” OED Online, Oxford University Press; December 2019, www.oed. com/view/Entry/271625 [accessed 13.12.19]). The field originated in the mid-1950s largely within computer science but with important influences from philosophy, mathematics, economics, cognitive science, and neuroscience. Researchers’ early focus was on symbolic reasoning: building high-level representations of problems to mimic, to some extent, human thinking. This paradigm is known as Symbolic AI or Symbolism and is often referred to as “good old fashioned AI”.1 Early successes were achieved using symbolic reasoning and expert systems, its main type of implementation. These systems rely largely on hard-coded rules designed by human experts to address a defined, circumscribed problem. An example of a very popular expert system widely used today is electronic tax preparation software. The designers of these systems have hardcoded a country’s or state’s entire Tax Law into their software. The program asks users a series of simple questions and follows a long list of if then statements to calculate how much tax is owed. Such systems can be very effective in specific applications. Their main limitations are as follows: • They are labor-intensive: A team of experts needs to manually enter, and subsequently maintain, a long, up-to-date list of rules and their relationships. • They are generally only possible when a comprehensive set of stable rules guiding a system is known. This is particularly limiting in medicine, where knowledge uncertainties abound in the setting of constantly evolving systems.2,3 Early examples of expert systems in medicine included the MYCIN system, designed to recommend appropriate antibiotic treatment for bacterial infections based on user-entered patient symptoms and information,4 the causal-associational network CASNET applied to the management of glaucoma,5 and INTERNIST-I, a general medicine consultation system.6 While exciting, these systems failed to achieve widespread adoption.7 Expert systems, in general, largely fell out of fashion in AI research, but successful applications, such as the tax preparation software, remain in use today. Instead, interest grew for purely data-driven learning procedures that eschewed the laborious manual hardcoding of rules. Machine learning (ML), also known as statistical learning, developed out of the fields of statistics and computer science, often in parallel/independently, to precisely allow a machine to learn from data without explicit programing. ML refers to a large and growing collection of algorithms that have proven highly successful in a wide range of applications. Within ML, artificial neural networks (ANN) represent a versatile learning framework created as an attempt, however crude, to mimic the network architecture of the brain. Research in ANNs, starting with seminal work on parallel distributed processing,8 gave rise to what was later named connectionism and connectionist

I. Introduction

1.3 How intelligent is artificial intelligence?

5

Intelligence Machine Learning

Deep Learning

FIGURE 1.1 The relationship between AI, machine learning, and deep learning. Machine learning refers to a large collection of algorithms and is the main approach used in AI today. Deep learning refers to a specific class of algorithms within machine learning, that are particularly effective at handling “unstructured data”: images, text, etc. AI, Artificial intelligence.

AI. Connectionist and symbolic AI has been largely seen as opposing views in AI.9 The increasing predominance of ML methods in AI today has led to the two terms often being used interchangeably even though they are not equivalent (Fig. 1.1).

1.3 How intelligent is artificial intelligence? Currently popular forms of AI/ML algorithms largely operate on a single circumscribed task at a time. For example, a model can be trained to estimate cardiovascular disease risk from demographic and clinical examination data, a different model can be trained to diagnose heart disease from electrocardiograms, yet another one could be trained on cardiac MRIs to select from a list of possible diagnoses. This task-focused prediction is called weak or narrow AI. In contrast, artificial general intelligence (AGI), also known as hard AI, is defined as an AI system that is able to perform any number of intelligent tasks. This remains the ultimate goal for many AI researchers, but while it has been promised or predicted multiple times already, its realization remains out of immediate reach by most estimates. There is currently no way to train a “cardiologist AI,” or a “general medicine AI.” Researchers are focusing instead on augmented intelligence, a paradigm that aims to use AI to assist humans in tackling difficult and important tasks. This is largely where current AI approaches fit in medicine: not as a technology to replace clinicians but as a powerful tool that can process vast amounts of information and to assist clinicians in making decisions, while possibly also automating some simpler tasks.

I. Introduction

6

1. Artificial intelligence in medicine: past, present, and future

Many argue that existing machine learning algorithms (weak AI) do little more than a type of “curve fitting” or “pattern recognition” on multidimensional data10 and are, therefore, not worthy of the term “artificial intelligence,” which should be reserved for a system that possesses higher level abilities, if not general intelligence. Regardless of individual views on the matter, the term AI is widely used and recognized, and the important distinction between weak or narrow and hard AI should be clear to the reader. At the same time, there is increasing interest in bridging the gap between symbolism and connectionism. Such an approach may be the key in paving the way toward AGI. This work can also improve the interpretability, intelligibility, or “explainability” of AI and, therefore, boost its trustworthiness in critical applications such as medicine.

1.4 Artificial intelligence, machine learning, and precision medicine Advances in ML algorithms along with increases in computational power allow biomedical and clinical researchers to easily analyze large and complex datasets. AI’s benefits in medicine extend across the spectrum of basic biomedical research to translational research and clinical practice. ML in basic research is used to extract insights on disease pathophysiology and guide new treatment discovery. Currently, the majority of applications are in basic research, while clinical applications are slowly being developed and tested. Precision medicine, sometimes called personalized or individualized medicine, is defined by the NIH as “an emerging approach for disease treatment and prevention that takes into account individual variability in genes, environment, and lifestyle for each person” (https://ghr.nlm.nih.gov/primer/precisionmedicine/definition). This approach recognizes that each person may have a unique (1) risk of developing a disease, (2) presentation when they develop disease, and (3) response to treatment and progression of the disease. The central premise of precision medicine is, therefore, to treat individuals, not diseases. This requires the integration of all available health-related data sources to offer individualized estimates of disease risk, prevention strategies, and treatment planning.

1.5 Algorithms and models ML includes a broad range of methods that can address many types of tasks, Hastie et al.11 provide a comprehensive overview of machine learning methods, and Koller et al.12 offer a more accessible introduction. Supervised learning is one of the most widely used approaches for labeled datasets. The input to the algorithm is a set of features (a.k.a. independent variables, predictors, and covariates) and an outcome (a.k.a. dependent variable, label). If the outcome is categorical, the procedure is called classification, if it is a continuous variable it is called regression, and, lastly, if it represents the time to an event (e.g., time to death), it is called survival analysis. The goal is to build a mathematical mapping from the inputs to the output. An algorithm is the general procedure we use to build such a mapping. Popular algorithms include the generalized linear model, classification and regression trees, random forests, gradient boosting, and ANNs. For a specific dataset, this mapping is termed a model. For example, we may input age, sex, weight, blood pressure, LDL, and HDL cholesterol blood levels of 10,000 cases into a decision

I. Introduction

1.6 Health data sources and types

7

tree to predict the risk of a heart attack. We may input a series of cardiac MRI volumes into a type of neural network known as convolutional neural network to classify heart disease. Algorithms build and optimize models by minimizing a specified loss function. Loss functions are typically defined by the difference between the true values of the outcome being predicted and the model’s estimate of those values. The goal of model training is, therefore, to produce a model the estimated values of which are as close as possible to the true values on cases not seen during training (model testing). Precision medicine

The tailoring of clinical decision-making to the individual

Artificial intelligence

Field of study involved in the development of computer systems able to perform tasks that normally require human intelligence. A core component supporting precision medicine

Machine learning

A set of procedures that allow learning from data

Deep learning

A subset of machine learning approaches particularly suited to the analysis of images, text, and voice data

A model serves two main functions: (1) to predict outcomes of future/unseen cases and (2) to provide insights into the underlying processes that contribute to the outcome of interest. For example, we may input a list of demographic, clinical, and laboratory data into an algorithm to predict the future onset of chronic obstructive pulmonary disease. The direct value of such predictions could guide health screening, risk stratification, and resource allocation strategies. Moreover, such models may help us understand which of our input features contribute to the prediction of lung disease and possibly how they interact and, therefore, help identify potential targets for disease prevention or treatment and guide the development of new hypotheses for future research. The model may further give us some measure of the extent of contribution of each feature in predicting the outcome, reflected in variable importance measures. Beyond pure prediction the amount we can learn about the underlying processes, for example, to disease pathophysiology, is often limited by the quality and quantity of training data. Some of the challenges in applying AI in medicine are described in Section 1.8.

1.6 Health data sources and types What features, that is, types of data are useful for AI in medicine? All health-related data are potentially useful. This includes demographic data, past medical history, family history and social history, clinical examination, laboratory tests and genomic data, and imaging and histopathology, along with lifestyle information, such as nutrition and exercise. Increasingly, mobile and wearable technology is proving to be a rich and valuable source of health data.13,14 Crucially, connected devices such as smartphones and watches can be used both to monitor health (e.g., heart rate and rhythm) and to promote healthy behavior, for example, by setting exercise goals. In ML, input data (a.k.a. features, covariates, and independent variables) can be divided into structured and unstructured. Structured data refers to tabular data, where, by

I. Introduction

8

1. Artificial intelligence in medicine: past, present, and future

convention, each row represents an individual case and each column represents a variable, for example, age, sex, height, weight, and heart rate. On the other hand, examples of unstructured data are images (e.g., X-rays, CT, MRI, and histology), text (such as clinician notes in patient charts), and audio (e.g., recording of a patient’s voice). The variables in such datasets lack one-to-one correspondence among cases. For example, a series of chest CTs may reveal lung tumors on different lobes of either side. Among patients, there will be differences in anatomy (unless they are registered to the same space) as well as differences in size, shape, and location of tumors. A given pixel, that is, feature, on one image does not correspond to the same pixel on any other image. Similarly, comparing two sets of text, the position of words will usually be in a different order. In both cases, we want to be able to extract higher levels of information, such as the presence of a tumor anywhere in the image, regardless of the absolute position of features in the input data. Therefore to distinguish between structured and unstructured data we can ask the question “Do the input features fall neatly into columns holding the same information for each case?” If yes, we have structured, that is, tabular data, otherwise we are dealing with unstructured data. Different groups of ML algorithms are generally used to analyze structured versus unstructured data, although there is some overlap. Linear models, additive models, support vector machines, decision trees and their ensembles, that is, random forest, and gradient boosting are popular algorithms to handle structured data. Unstructured data can be analyzed in two main ways. We can either convert it first to structured data and apply an

FIGURE 1.2 Example of medical data inputs and algorithms that can be used to train clinical predictive models.

I. Introduction

1.7 The promise

9

algorithm that works well on structured data or we can use a class of algorithms that can directly handle unstructured data and possibly take advantage of the total information without an intermediate transformation. ANNs represent a large family of algorithms that have been highly successful in the analysis of unstructured data types, while attracting a vast number of researchers working to expand and improve them. ANNs work by creating hierarchical representations of the raw input data that is able to extract relevant patterns that are predictive of the outcome of interest. Fig. 1.2 shows a selection of common medical data inputs and the algorithms that can be used on them.

1.7 The promise AI is a large and complex interdisciplinary field of research. While progress is made at a fast pace, substantial work is required before AI can be widely applied in an effective and efficient, safe and fair manner across medicine. What are the expectations for AI in medicine that make this a worthwhile endeavor? Benefits of AI in medicine extend across multiple axes: (1) As already mentioned, AI can be applied across the spectrum of basic to translational research and clinical practice. (2) Within clinical practice, AI applications extend across settings and health-care access points—from the everyday home and work environment to the family practitioner’s office, the emergency room, hospital ward, operating room, intensive care unit. (3) AI benefits extend across the lifespan, from preconception and pregnancy planning to end-of-life care. In basic and translational research, AI is offering increasingly powerful ways to analyze large and complex data in order to extract insights on human disease pathophysiology15 and guide new treatment discovery.16 In clinical research, it can help optimize clinical trial design and monitoring. For example, it can guide dynamic treatment allocation of patients in clinical trials where patients are allocated to treatment arms by balancing prognostic factors, and it can guide adaptive treatment designs, where patients are switched between treatments in order to optimize individual outcomes as information is collected and analyzed. In basic and applied research, AI can help analyze vast and diverse datasets at the same time in a way previously impossible. This helps answer researchers’ questions but can also uncover previously unrecognized relationships and suggest new research paths (Fig. 1.3). In clinical practice, AI can help increase clinical decision accuracy. In other words, AI promises to give better diagnoses, suggest more optimal treatment strategies, and improve our ability to predict treatment response and long-term prognosis. Given sufficient training data, AI has the potential to significantly improve the treatment or management of most cases, probably even more so in rare and difficult cases that might be most often misdiagnosed or inadequately managed currently. Furthermore, AI has vast potential to dramatically increase the efficiency of health-care delivery by reducing the time required to perform clinical procedures and eliminating unnecessary procedures. The potential gains in efficiency could dramatically improve patient experience, health outcomes, as well as drastically cut down health-care costs. This has the potential to redirect scarce health-care resources to serve more patients, better and faster. This is particularly crucial considering the worsening national physician shortage, exacerbated by an aging population, and the growing physician burnout epidemic.17

I. Introduction

10

1. Artificial intelligence in medicine: past, present, and future

FIGURE 1.3 A selection of the settings and functions of AI in medicine. AI, Artificial intelligence.

This contributes to increased clinician error rates, which can lead to avoidable patient harm and liability for health systems. A major contributor to the increased workload of physicians is the increased time demand for data entry into electronic health records (EHR) systems. Probably the most damaging consequence of increased documentation load on clinicians is that clinician patient interaction and communication, one of the core components of clinical medicine, is severely limited. AI can help automate most if not all EHR data entries and, perhaps ironically, help make medicine more humane again. This is central to the concept of augmented intelligence in medicine.18 As AI systems are deployed in clinical applications, they can help substantially improve access to health care. Even in regions where health-care services are readily available, individuals may be reluctant to seek medical help for cultural, religious, and personal reasons. The ability to consult an AI system for initial triage/consultation in a private and confidential way could dramatically improve individuals’ health-seeking behavior. The impact could be particularly important in mental health and conditions associated with stigma, where individuals are less likely to seek medical attention. Since AI applications can be delivered through any number of digital devices, their deployment can potentially reach most parts of the world. In regions where health-care services are limited, AI decision support tools can be invaluable in helping a local physician or nurse assess and treat patients. Lack of critical medical equipment and drugs cannot be compensated by software of course, but an AI model could recommend the best course of action given the limited resources, whether that is to treat with available means or recommend seeking specialized care at the nearest appropriate health-care facility. Prevention remains a core goal of medicine. Disease prevention has the potential to massively benefit individuals and society as a whole by improving quality of life, saving time and money, and reducing the loss of productivity. To date, research studies often offer contradicting advice on best preventive practices for a given disease. The high variance often reflects inhomogeneity among individuals’ biology, lifestyle, and environment. AI can be used to model these covariates to produce individualized disease-prevention strategies. Digital devices such as smartphones and smartwatches, or more specialized wearable sensors are increasingly used to monitor an individual’s health and alert when a problem is detected. Carefully tuned AI algorithms can optimize the use of such devices.

I. Introduction

1.8 The challenges

11

Too sensitive an alarm can lead to unnecessary interventions and ultimately cause harm, while the opposite can miss crucial opportunities to intervene and prevent so. Some of the toughest and most crucial goals that AI will be used to tackle in medicine are (1) personalized risk reduction for disease prevention, (2) early diagnosis of presymptomatic disease (chronic setting), and (3) prediction of critical clinical events (acute setting).

1.8 The challenges A number of challenges must be addressed before the transformational promises of AI in medicine can be fully realized. Many of them are technical, related to designing highperformance systems that are safe and effective,19 while others are related to the implementation of AI systems—how to integrate the new technologies into everyday clinical practice and oversee their safe and appropriate usage.

1.8.1 Quality and completeness of training data Biomedical and clinical data are inherently challenging for predictive modeling. Small sample sizes are a common limitation. This is increasingly being addressed by the orchestrated pooling of data collected across multiple research institutions and health centers. It can place heavy demands on resources for both the primary data collection as well as to ensure privacy, security, and curation throughout the data-sharing process. Streamlining these operations to minimize time and cost will help accelerate data sharing, which is crucial for building robust predictive models. Depending on the question at hand, however, proper model training may require large sample sizes that are hard to collect even after extensive pooling. Large effects can be identified in relatively small samples, while identification of the additive and interactive effects of large numbers of variables, as seen often in genomics, for example, may require millions of data points. Beyond raw sample sizes a fundamental consideration for any clinical predictive model is what population it should be trained on. Pooling data across different sites does not only offer a quantitative advantage of more data point but is, crucially, a qualitative advantage as well, by providing a more diverse population sample, which can result in better model generalizability. Race, geographical location, and socioeconomic status are but a few factors that affect generalizability. The question, therefore, arises, should we aim for one large inclusive model trained on as many different subpopulations or should we train individual models for particular populations? The answer, as is often the case, is empirical. Each production clinical model will have to clearly state what population it was trained, validated, and tested on before its applicability on an individual patient can be assessed. Crucially, for any given medical question there will exist rare cases and outliers. Can we ensure the fairness of our algorithms and models and guarantee that no particular subpopulations will be favored or overlooked? How do we choose among competing models? High noise is another common problem in biomedical data, which can limit the performance of ML models. It can occur both in the features and outcomes. Noise results not only from measurement error, which can often be quantified, but also from inherent biological stochasticity, which may be hard to estimate. The latter is often overlooked in

I. Introduction

12

1. Artificial intelligence in medicine: past, present, and future

biomedical models but becomes crucial as we ask the key question: How good is our model?—or—Is our model good enough? What does it mean if a model achieves the classification accuracy of 95%? Does that mean we still have plenty of room for improvement or have we reached our accuracy ceiling, and the remainder is irreducible error? A fundamental problem in predictive modeling, certainly in science and undoubtedly in medicine, is that we commonly do not know the full list of features that can help predict the outcome of interest and/or not all of those features may be available or observable. Basic and clinical research is continuously expanding our understanding of disease pathophysiology and clinical medicine and helps build a more complete picture of the relevant features for each health condition or clinical scenario as well as identify proxy measures for biological or other processes that are not directly measurable. Collecting all the relevant features for a given scenario can cost time, money, and resources. In practice, data is continuously collected during a patient’s journey through the health-care system. An important goal for AI in health care is to be able to predict at each step what is the most important piece of information that should be collected next while considering how informative it will be, and how safe and costly it will be to obtain. While basic research is built on experimentation, clinical data is largely observational in nature. This limits the types of cases available for model training and often results in the presence of hidden confounders which can affect both model generalizability and interpretation. Important legislation has been introduced to guide the ethical treatment of patients and the conditions under which their involvement in medical research is permitted and how it should be monitored. Legislation on human subject research only permits low-risk experimentation on humans.20 Certainly, no patients will be untreated or undertreated to allow an ML algorithm to learn. Bioengineering and AI have a huge role to play here in simulating human cells and organs in physical and computational systems.21 Bias in medical practice is extensively documented22,23 The challenge is to prevent ML models from learning and perpetuating such bias and instead capitalize on AI to identify and minimize human and institutional bias and promote equity in clinical practice and health-care delivery in general. Aside from models trained to further our basic understanding of pathophysiology or clinical decision support models, there is a need for models targeted directly at identifying bias and unfairness in medicine, in the broader context of identifying subpopulations that are systematically treated suboptimally. Similar considerations are important in AI applications in fields such as criminal justice and finance.

1.8.2 Trust and performance: the case for model interpretability Interpretability in AI/ML (a.k.a. explainability and intelligibility) is increasingly important, particularly in high-stakes application, such as medicine. Trust of a predictive model becomes greatly important if not necessary in many cases,24 yet model interpretability is not well defined and is difficult to quantify. Murdoch et al.25 offer a thorough introduction to the topic in an attempt to define and help advance the field. In a classic example in medical ML a group of researchers set out to estimate the mortality risk of patients presenting to the emergency room (ER) with pneumonia.26 While it is a common condition, usually treated successfully with antibiotics, some cases are severe

I. Introduction

1.8 The challenges

13

and some populations are particularly susceptible, which can lead to death. The question, therefore, arises, can we effectively stratify patients’ risk of death from pneumonia? The researchers used a number of algorithms available at the time (1997) to train different models. A black-box neural network had the best accuracy on average. However, when the researchers looked at a rule-based model, they identified a surprising pattern learned by all algorithms: patients with asthma were labeled as having a decreased risk of mortality from pneumonia. This is, of course, exactly the opposite of what is true. What was happening? Not surprisingly, patients with a history of asthma presenting to the ER with pneumonia were being treated faster and more aggressively, per hospital protocol. Similar protocols exist for a large number of medical scenarios and are updated as new evidence becomes available. Therefore the protocol was working and the patients were being treated effectively, reducing their risk of death. The model worked as expected: it used the correlation structure of the data to predict the outcome the best it could. This example highlights a major problem with black-box models in medicine, as in any application. Out of sample and out of context the model can be misleading and may be misused with catastrophic consequences. If the model was used to predict admission numbers, bed availability, insurance claims, etc., the “bad rule” would likely not matter, as it is, after all, helping make an accurate prediction. If, on the other hand, the model was blindly applied in the ER to decide which patient should be seen first or admitted, it would be a disaster. Linear models and decision trees are some of the most popular models considered to be interpretable. Users can inspect them and understand how they are built and how they make predictions. They are often outperformed by ensemble techniques such as random forests and gradient boosting and other black-box models,27 leading to the idea that interpretability comes at the expense of accuracy—the so-called accuracy-interpretability trade-off. However, algorithms that can build accurate and interpretable models do exist,28 and important work is being done in this active research field.29,30

1.8.3 Beyond performance and interpretability: causality In medicine our goal is often to answer questions that are causal in nature rather than just associational. An interpretable, high-performing, predictive model can deceive us into thinking we have discovered causal relationships between certain features and an outcome. Hidden confounders are extremely common in observational clinical data and make causal inference challenging. Research in causality has a long and exciting history in statistics.31,32 Recent work is helping to bring causal inference into machine learning and AI of highdimensional data.33,34 Upon training of a clinical predictive model, we may be interested in focusing on modifiable risk factors and ranking them by importance in order to identify potential targets for lifestyle changes or prophylactic treatment. Causal inference is necessary to estimate treatment effects and optimize intervention recommendations for the individual.

1.8.4 Defining the question, measuring real-world impact We have discussed a few of the important factors affecting model performance. Let us consider the fundamental question of predictive modeling: “What are we optimizing for?”,

I. Introduction

14

1. Artificial intelligence in medicine: past, present, and future

that is, “What question are we trying to answer/how are we planning to use the model?” AI in medicine will serve a vast number of functions across many different settings. It is always essential but often not trivial to define the model objective.35 As we saw in the pneumonia example given previously, even within a given clinical context, two different models can be trained on the same input data and outcome but with a different goal. ML models’ performance is commonly assessed using a set of metrics that calculates the difference between the true and estimated outcome. For example, in a simple binary classification problem, mislabeling a case as belonging in group “A” when they actually belong in group “B” incurs the same penalty as estimating they belong in “B” when in fact they belong in “A.” In real-world applications, such as medicine, one must go a step further and understand what a prediction error translates into in real life. In one of the simplest cases, consider the scenario where a treatment for disease X is readily available, cheap, and safe (free from side-effects). When training a model to classify X from non-X, we might choose to maximize sensitivity (i.e., increasing number of true positives, a.k.a. hits) instead of maximizing overall accuracy. There is more to gain for every case correctly recognized as X than there is to lose for misclassifying a non-X as X. The opposite may be true if the treatment for X was difficult to procure, expensive, and highly toxic to subjects not suffering from disease, that is, we might prefer to maximize specificity instead. Furthermore, the potential benefit and harm of misclassification may not be fixed across cases but rather depend on patient characteristics. Understanding and customizing algorithms’ loss functions to account for real-world impact is one of the processes that still require human expert input. The duty of health-care professionals is to assess each patient and recommend the best treatment options after considering not just their unique biology but also their personal wishes and preferences. Informed consent is required for all medical procedures (with the exception of cases where a patient lacks capacity, acutely, or chronically, in which case the law describes who is allowed to decide on their behalf). The patient has the final say on (1) what clinical exams and tests are acceptable during their assessment, (2) what treatments they are willing to undergo, and, crucially, (3) what they hope for and expect of the treatment outcome. Application of a clinical AI system toward explicit care recommendations must take all of these into account to be effective. While the goal of a diagnostic model is straightforward in trying to get the correct diagnosis, the patient’s wishes could limit the available information, for example, by refusing certain examinations or tests. Individual wishes are particularly important in defining what a treatment model should optimize for. A common example is patients faced with end-of-life care decisions often must balance maximizing survival time versus optimizing the quality of life. Certain treatments have such serious side effects that receiving no treatment becomes an option. AI for precision medicine can, therefore, optimize clinical decisions based not only on the patient’s biology, lifestyle, and environment but also their cultural, religious, and personal needs and wishes.

1.8.5 Maximizing information gain across modalities, tasks, populations, and time The goal of AI in medicine is to use all available information and produce the best possible recommendation at each point of care for each individual patient. Existing AI systems are limited in their ability to take advantage of multiple different data sources to

I. Introduction

1.8 The challenges

15

train a single model. Ongoing work is focused on developing algorithms that can learn from heterogeneous health data sources, which are the norm in clinical practice. This requires the joint optimization of multimodal data to produce a model that takes into account all patient information concurrently, as opposed to creating multiple separate models. Furthermore, clinical data is collected over different timeframes. Demographic data remains constant, laboratory testing may be conducted a few times a day at most, and physiological monitoring may be continuous (time series data). At the same time the main limitation of weak AI is that each model is limited to addressing a single prescribed problem. Different approaches exist to take advantage of what is learned in one problem when training on another. Transfer learning, which has been particularly successful in learning from images, allows the sharing of low-level features learned from one dataset to another. For example, it can be used to capitalize on a dataset of generic images with millions of examples to “bootstrap” learning on a much smaller, specialized set of (potentially rare and expensive) medical images, resulting in significant improvement of performance. A related and particularly fascinating field of AI research is Lifelong Machine Learning.36 The goal is to generalize higher level representations among learning tasks, in a way more akin to how humans learn. This is an essential step toward AGI and will help train more robust and dependable clinical predictive models. For example, Lifelong ML may be able to help us capitalize on the knowledge captured by individual disease-predicting models toward a general medicine AI system. Such approaches will help make more robust and trustworthy models. Federated learning (a.k.a. collaborative learning) is another subfield within ML and AI of increasing importance in the context of learning against sensitive medical data while protecting individual privacy.37 This technique allows the training of an algorithm on sensitive data, such as protected health information, present at multiple decentralized sites, without the exchange of data. For example, a number of hospitals can contribute cases toward the training of a model, without the data itself ever leaving each hospital’s data center. Similarly, algorithms can be securely trained on data present on consumer devices and contributed by individuals around the world, opening the door to massive and rich new datasets. To date, the majority of AI/ML models are trained on historical datasets. In medicine, such as many other fields, a high variety of data is generated continuously at vast volumes, with enormous velocity (three V’s associated with “big data”). At the same time, environmental and lifestyle factors that affect health may shift rapidly. As a result, medical models require continuous update and quality assessment (QA). ML techniques such as online machine learning, sequential machine learning, and incremental learning38,39 refer to learning from a continuous stream of data, such that a model is updated as new cases become available.

1.8.6 Quality assessment and expert supervision How do we ensure QA procedures keep up with continuous model updates? How should the use of such models be regulated? The Food and Drug Administration has proposed treating ML-based software as medical devices40 as they begin to shape a regulatory framework for AI technology which is bound to fundamentally change how medicine in

I. Introduction

16

1. Artificial intelligence in medicine: past, present, and future

practiced. Similar to pharmaceutical agents and medical devices, new AI/ML applications will require extensive testing prior to adoption, but unlike drugs or devices, they will need continuous QA as they as are continuously updated. Physicians and patients have their own ideas, concerns, and expectations regarding AI/ML technology in general and medicine in particular. Some will readily trust an algorithm’s predictions over a physician’s, others the opposite. A core questions is—can we provide guarantees regarding a model’s predictions? Can we guarantee that no patient in the ER waiting room with asthma and pneumonia will be considered low risk? Across fields, we often pit human expert performance against machine performance to decide who to trust for a specific task. ML excels at extracting patterns from a dataset with ease but, as described previously, is still severely limited in building higher level representations and concepts to contextualize, generalize, and discriminate beyond the given dataset. Approaches such as expert-augmented machine learning41 and human-in-the-loop AI42 attempt to combine the complementary strengths of humans and machines. Such approaches can result in better generalizability in unseen cases and, crucially, allow QA of models by domain experts ahead of deployment.

1.9 Making it a reality: integrating artificial intelligence into the human workforce of a learning health system The best health data coupled with the best algorithms will do little good in improving patient outcomes and streamlining health-care delivery if they are not (1) trusted by both clinicians and patients and (2) integrated effectively into the human health-care workforce. As AI technologies mature and gradually become ready for adoption across health-care access points, specialized training is necessary to ensure their optimal and continued safe use. This is no different than any other medical technology but may be of a larger scale with more far-reaching consequences. Education at all levels and for all health-care professionals must incorporate a solid foundation of statistics, ML, and AI. It will cover the basics of algorithm design, model QA and application in clinical practice. Not all implementations face the same challenges or risk. The successful introduction of AI to clinical practice will depend on orchestrating the timing of AI implementations. By successfully addressing the more manageable clinical problems first, we can build up the expertise and the trust necessary to address the more challenging ones. At each step of the way, combining human and machine resources should deliver better care to patients than could be achieved with either alone.

References 1. Pohl I. John Haugeland. Artificial intelligence: the very idea. Bradford books. The MIT Press, Cambridge, Mass., and London, 1985, ix 1 289 pp. J Symbol Logic 1988;53:659 60. 2. Tricoci P, Allen JM, Kramer JM, Califf RM, Smith SC. Scientific evidence underlying the ACC/AHA clinical practice guidelines. JAMA 2009;301:831 41. Available from: https://doi.org/10.1001/jama.2009.205. 3. Chen JH, Alagappan M, Goldstein MK, Asch SM, Altman RB. Decaying relevance of clinical data towards future decisions in data-driven inpatient clinical order sets. Int J Med Inform 2017;102:71 9. Available from: https://doi.org/10.1016/j.ijmedinf.2017.03.006.

I. Introduction

References

17

4. Shortliffe EH, Davis R, Axline SG, Buchanan BG, Green CC, Cohen SN. Computer-based consultations in clinical therapeutics: explanation and rule acquisition capabilities of the MYCIN system. Comput Biomed Res 1975;8:303 20. Available from: https://doi.org/10.1016/0010-4809(75)90009-9. 5. Kulikowski CA, Weiss SM. Representation of expert knowledge for consultation: the CASNET and EXPERT projects. Artif Intell Med 1982;51. 6. Miller RA, Pople Jr HE, Myers JD. Internist-I, an experimental computer-based diagnostic consultant for general internal medicine. N Engl J Med 1982;307:468 76. 7. Duda RO, Shortliffe EH. Expert systems research. Science 1983;220:261 8. Available from: https://doi.org/ 10.1126/science.6340198. 8. McClelland JL, Rumelhart DE, Group PR, et al. Parallel distributed processing. Cambridge, MA: MIT Press; 1987. 9. Sun R. Artificial intelligence: connectionist and symbolic approaches; 1999. 10. Hartnett K. To build truly intelligent machines, teach them cause and effect. Quanta Magazine. ,https:// www.quantamagazine.org/to-build-truly-intelligent-machines-teach-them-cause-and-effect-20180515/.; 2018 (accessed 04.09.20). 11. Hastie TJ, Tibshirani RJ, Friedman JJH. The elements of statistical learning, data mining, inference, and prediction. 2nd ed. Springer; 2009. 12. Koller D, Friedman N, Dˇzeroski S, Sutton C, McCallum A, Pfeffer A, et al. Introduction to statistical relational learning. MIT Press; 2007. 13. Iqbal MH, Aydin A, Brunckhorst O, Dasgupta P, Ahmed K. A review of wearable technology in medicine. J R Soc Med 2016;109:372 80. Available from: https://doi.org/10.1177/0141076816663560. 14. Turakhia MP, Desai M, Hedlin H, Rajmane A, Talati N, Ferris T, et al. Rationale and design of a large-scale, app-based study to identify cardiac arrhythmias using a smartwatch: the Apple Heart Study. Am Heart J 2019;207:66 75. Available from: https://doi.org/10.1016/j.ahj.2018.09.002. 15. Chang HY, Jung CK, Woo JI, Lee S, Cho J, Kim SW, et al. Artificial Intelligence in Pathology. J Pathol Transl Med 2019;53:1 12. Available from: https://doi.org/10.4132/jptm.2018.12.16. 16. Fleming N. How artificial intelligence is changing drug discovery. Nature 2018;557:S55. 17. Shanafelt TD, Dyrbye LN, West CP, Sinsky CA. Potential impact of burnout on the US physician workforce. Mayo Clinic Proc 2016;91:1667 8. Available from: https://doi.org/10.1016/j.mayocp.2016.08.016. 18. Topol E. Deep medicine: how artificial intelligence can make healthcare human again. Hachette UK; 2019. 19. Wang F, Preininger A. AI in health: state of the art, challenges, and future directions. Yearb Med Inform 2019;28:16 26. Available from: https://doi.org/10.1055/s-0039-1677908. 20. US Department of Health and Human Services, 1979. The Belmont Report. 21. Wu Q, Liu J, Wang X, Feng L, Wu J, Zhu X, et al. Organ-on-a-chip: recent breakthroughs and future prospects. Biomed Eng Online 2020;19:9. Available from: https://doi.org/10.1186/s12938-020-0752-0. 22. Dehon E, Weiss N, Jones J, Faulconer W, Hinton E, Sterling S. A systematic review of the impact of physician implicit racial bias on clinical decision making. Acad Emerg Med 2017;24:895 904. Available from: https://doi. org/10.1111/acem.13214. 23. Hamberg K. Gender bias in medicine. Womens Health (Lond Engl) 2008;4:237 43. Available from: https://doi. org/10.2217/17455057.4.3.237. 24. Rudin C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell 2019;1:206 15. Available from: https://doi.org/10.1038/s42256-019-0048-x. 25. Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B. Definitions, methods, and applications in interpretable machine learning. Proc Natl Acad Sci USA 2019;116:22071 80. Available from: https://doi.org/ 10.1073/pnas.1900654116. 26. Cooper GF, Aliferis CF, Ambrosino R, Aronis J, Buchanan BG, Caruana R, et al. An evaluation of machinelearning methods for predicting pneumonia mortality. Artif Intell Med 1997;9:107 38. Available from: https:// doi.org/10.1016/s0933-3657(96)00367-3. 27. Olson RS, Cava WL, Mustahsan Z, Varik A, Moore JH. Data-driven advice for applying machine learning to bioinformatics problems. Biocomputing 2018;23:192 203. Available from: https://doi.org/10.1142/9789813235533_0018. 28. Friedman JH, Popescu BE. Predictive learning via rule ensembles. Ann Appl Stat 2008;2:916 54. Available from: https://doi.org/10.1214/07-AOAS148. 29. Luna JM, Gennatas ED, Ungar LH, Eaton E, Diffenderfer ES, Jensen ST, et al. Building more accurate decision trees with the additive tree. Proc Natl Acad Sci USA 2019;116:19887 93. Available from: https://doi.org/ 10.1073/pnas.1816748116.

I. Introduction

18

1. Artificial intelligence in medicine: past, present, and future

30. Caruana R. Friends don’t let friends deploy black-box models: the importance of intelligibility in machine learning. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, KDD ’19. Anchorage, AK: Association for Computing Machinery; 2019. p. 3174. Available from: https://doi. org/10.1145/3292500.3340414. 31. Tukey J, Kempthorne O, Bancroft T, Gowen J, Lush J. Causation, regression, and path analysis. Ames, IA: Iowa State College Press; 1954. p. 35 66. 32. Pearl J. Causal diagrams for empirical research. Biometrika 1995;82:669 88. Available from: https://doi.org/ 10.1093/biomet/82.4.669. 33. Laan MJ, van der Rubin D. Targeted maximum likelihood learning. Int J Biostat 2006;2.. Available from: https://doi.org/10.2202/1557-4679.1043. 34. Schuler MS, Rose S. Targeted maximum likelihood estimation for causal inference in observational studies. Am J Epidemiol 2017;185:65 73. Available from: https://doi.org/10.1093/aje/kww165. 35. Seo H, Bassenne M, Xing L. Closing the gap between deep neural network modeling and biomedical decision-making metrics in segmentation via adaptive loss functions, IEEE Trans. Med. Ima., conditionally accepted, 2020. 36. Chen Z, Liu B. Lifelong machine learning, second edition. Synth Lect Artif Intell Mach Learn 2018;12:1 207. Available from: https://doi.org/10.2200/S00832ED1V01Y201802AIM037. 37. McMahan HB, Moore E, Ramage D, Hampson S, Arcas BAy. Communication-efficient learning of deep networks from decentralized data. arXiv:1602.05629 [cs]; 2017. 38. Shalev-Shwartz S. Online learning and online convex optimization. MAL 2012;4:107 94. Available from: https://doi.org/10.1561/2200000018. 39. Bishop CM. Pattern recognition and machine learning. Springer; 2006. 40. US FDA. Proposed regulatory framework for modifications to artificial intelligence/machine learning (AI/ ML)-based software as a medical device (SaMD). FDA; 2019. 41. Gennatas ED, Friedman JH, Ungar LH, Pirracchio R, Eaton E, Reichmann LG, et al. Expert-augmented machine learning. Proc Natl Acad Sci USA 2020. Available from: https://doi.org/10.1073/pnas.1906831117. 42. Patel BN, Rosenberg L, Willcox G, Baltaxe D, Lyons M, Irvin J, et al. Human machine partnership with artificial intelligence for chest radiograph diagnosis. NPJ Digit Med 2019;2:1 10. Available from: https://doi.org/ 10.1038/s41746-019-0189-7.

I. Introduction

C H A P T E R

2 Artificial intelligence in medicine: Technical basis and clinical applications Bradley J. Erickson Abstract There has been a revolution in the past decade as “deep learning” has begun to show high performance with robust reliability in the real world for many imaging tasks. Current deep learning technologies are being applied to medical imaging tasks with good results, which has prompted great interest in applying them broadly into clinical practice. This chapter describes the basic principles of deep learning methods and some common applications in medical imaging. Keywords: Deep learning; convolutional neural network; fully connected network; loss function; residual block; generative adversarial network

2.1 Introduction Artificial intelligence (AI) has a long history of development, beginning in the 1950s when Hubel and Wiesel1 began mapping the optic cortex of cats and noting the functional anatomy of how that part of the brain worked. It was not long after that computational models (often referred to as “artificial neural networks” or ANNs) were created, in the hope of both understanding and confirming our model of brain function, but with the additional possibility of creating artificial brains that could perform useful work. About that same time, computer scientists built rule-based systems to assist in decision-making processes, including applications in medicine. One of the earliest such systems was designed to assist in the selection of antibiotics based on gram staining and patient characteristics.2 There has been a wide variety of technologies used to aid in the diagnosis of medical images. The earliest report3 relied on the physician to identify features and those were conveyed to an algorithm that then suggested a differential diagnosis. Since then, other

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00002-8

19

© 2021 Elsevier Inc. All rights reserved.

20

2. Artificial intelligence in medicine: Technical basis and clinical applications

methods that automatically extracted features directly from the pixels of the images (whether directly acquired as digital data or if captured by digitizing film) have been produced, including several FDA-cleared computer-aided diagnostic (CAD) tools.

2.2 Technology used in clinical artificial intelligence tools There has been a revolution in the past decade as “deep learning” has begun to show high performance with robust reliability in the real world for many imaging tasks. The earlier machine learning tools were often “brittle” meaning that they failed if the input data or images did not appear very similar to the training examples, and those training examples often had to be acquired under carefully constrained conditions, limiting their application in the real world. Current deep learning technologies can be divided into several families, largely depending on the dominant technological feature. The first point to make is that deep learning gets its name because its networks have many layers. For instance, fully connected networks (FCNs) are the prototypical neural networks, where the layers consist of multiple nodes, each “connected” to the subsequent layer by a “weight.” Convolutional neural networks (CNNs) begin with several layers that perform convolutions on the input, each of them often followed by pooling layers that combine features while reducing the resolution of an image. Recurrent neural networks (RNNs) have a connection from a later layer to an earlier layer in the network, accounting for the “recurrent” nature pointed to in the name, and these are often applied to situations where the input data is repetitive. It is not possible to cover every component of modern deep learning algorithms in this chapter, and instead it will be focused on clinical applications. If the reader is more interested in these other elements, one recent review that covers them in greater depth is found at Ref. [4].

2.2.1 Elements of artificial intelligence algorithms There are several common components that are often seen in different types of deep learning algorithms. This section describes these elements, and later parts of the chapter then describe how they are combined to address problems in specific ways. 2.2.1.1 Activation functions Activation functions are based on the action potential seen in neurons. While action potentials in biological cells are binary (either they have “fired” or not), computational “neurons” (also referred to as “nodes”) can output either binary or numeric values. They can also be more flexible about the basis for computing its output, including summing the output but other options include using the slope (rate of change) of inputs as well as more complex functions of the inputs. In addition, biological neurons have a limited firing rate, with a refractory period after firing, while computational neurons are limited only by the speed of the computer.

I. Introduction

2.2 Technology used in clinical artificial intelligence tools

21

There are several properties of activation functions that are useful for learning systems. The properties include being: 1. Nonlinear. If the output is a linear function of the inputs, then the output can only be a linear function as well. 2. Differentiable. While not required in all cases, most learning systems require the calculation of gradients in order to update the weights as the optimal set of weights is pursued. Some special functions like rectified linear units (ReLU) are differentiable everywhere except at 0 and that can be efficiently handled programmatically. 3. Controlled range. In most cases, it is desirable to limit the range of values that are output by the activation function, else the systems often become unstable. It is not necessary to use the same activation function for all nodes. It is common practice in current deep learning systems to use a function like ReLU for most of the layers but use a function like SoftMax at the output layer to produce values something more akin to a probability. 2.2.1.2 Fully connected layer Fully connected layers (sometimes also called dense layers) are the prototypical layers that compose a neural network. Each layer consists of an arbitrary number of nodes (neurons), which receive weighted inputs from the prior layer, add those values up (in rare cases, other operations like rates of change or product have been described), and output a value based on that sum. The output value is determined by the activation function. This computational model was first proposed by McCulloch and Pitts in 19435 but the concept of a complete neural network that could learn is generally attributed to Turing.6 Simple neural network consisting only of fully connected layers is shown in Fig. 2.1. 2.2.1.3 Dropout A challenge with deep neural networks is that they can overfit, that is, the network learns features that are specific to the examples in the training set, rather than the general features present in the class of examples in either training or testing sets. There are several ways to address overfitting, and one that is an architectural approach is called dropout. FIGURE 2.1 A simple neural network consisting of an input layer with three nodes, a hidden layer with five nodes, and an output layer with just one node.

I. Introduction

22

2. Artificial intelligence in medicine: Technical basis and clinical applications

This is a technique described by Hinton et al.7 and involves the exclusion of randomly selected nodes with each presentation of a training sample. It is common to use a dropout ratio of 0.5, meaning that 50% of nodes are excluded. Dropout is thought to be a more efficient way to implement many networks (each set of dropped nodes is effectively a unique network) into one computational framework. 2.2.1.4 Residual blocks Residual blocks8 (also known as “residual networks” though this should be used to refer to complete networks that incorporate residual blocks) consist of typically two to three fully connected layers that have a “skip connection” from the input to the small group of layers to the output of the layers, thus forming the “block” (see Fig. 2.2). Using residual blocks almost always improves performance because it forces each layer to learn—in the case that a layer within the block does not do better than what it receives as input, the skip layer is selected. It has been shown that residual blocks smooth the gradient space. Architectures with multiple skips to multiple layers are often referred to as DenseNets.9 2.2.1.5 Initialization Neural networks have many weights—essentially 1 for every connection between each node in a layer and the subsequent layer. Therefore a typical neural network will have many thousands to millions of weights that are to be learned during training. This raises the question: what values should the nodes have at the start? One might first consider initializing all weights to be 0 (or any other single value), but in that case the derivative for the loss function (the summed output error) will be the same for every weight in a given layer. Therefore one cannot effectively update the weights, and the network will not learn. Random values are frequently used to initialize the weights, but it has been found that using any random value can lead to poor learning. Instead, constrained random weights seem to work best, and there are a few options that are commonly used. One simple option is to limit the range to be from 21.0 to 1.0, also referred to as random uniform and one with a normal distribution (random normal). Somewhat better performance can be FIGURE 2.2 A residual block. The output from layer 1 is fed both to layer 2 and to layer 3. If layer 2 is not able to do better than its input, layer 3 will effectively ignore it, thus reducing the number of weights in the network, and thus reducing the chance of overfitting. Note that there can be more layers between the skip connection (e.g., a layer 2a between layer 2 and layer 3) but typically there are only one or two layers plus the output layer in the residual block.

I. Introduction

2.2 Technology used in clinical artificial intelligence tools

23

shown with He, Xavier (also known as Glorot) initialization, or LeCun functions, all of which have normal and uniform distribution forms. One special form of initialization is transfer learning. If there are networks that have already been trained on problems similar to your task, it is often more efficient to take the network (including trained weights) and begin by training that network. It is likely that the learned weights will be a better starting point than any of the initialization functions described previously. It is also common practice to “freeze” the weights on early layers, based on the assumption that the low-level features are common, if the task truly is similar. Transfer learning can be particularly useful when your training set is limited, because freezing layers reduces the number of weights to be learned. 2.2.1.6 Convolution and transposed convolution A convolution is a core function that is used for extracting features from the input image or signal. Convolution consists of having a convolutional kernel (which is a small matrix that matches the dimensionality of the input function) that is moved across the input image and corresponding elements of the image and kernel are multiplied, and then each of those products is added to produce an output. For example, if the input is a 2D image, the kernel is also 2D, such as a 3 3 3 matrix. In this case, the output image will have the same dimensions as the input image. Another strategy is to not convolve the edge pixels, in which case the output image will be smaller than the input. Increasing the matrix size of an image (effectively magnifying it) is often needed. A simple way to do this is to duplicate each pixel in each dimension, or a next level of sophistication is to linearly interpolate the values. However, both of these strategies can lead to undesirable effects. Transposed convolution (sometimes incorrectly referred to as deconvolution or correctly referred to as upsampled convolution) is a technique to increase the matrix size of an image by using a kernel. In this case, as the kernel is passed over the input image, each element of the kernel is multiplied by the corresponding pixel in the image plus input image neighbors to produce a component the size of the window. This is repeated for each element of the kernel, and then all the components are added together, producing an upsampled image that reflects features the kernel was selected to amplify. 2.2.1.7 Inception layers GoogLeNet contains multiple inception modules, in which multiple different filter sizes are applied to the input and their results concatenated. This multiscale processing allows the module to extract features at different levels of detail simultaneously. GoogLeNet also popularized the idea of not using fully connected layers at the end, but rather global average pooling, significantly reducing the number of model parameters.

2.2.2 Popular artificial intelligence software architectures 2.2.2.1 Neural networks and fully connected networks Previous sections have described fully connected layers where each node of a layer is connected to each node of subsequent layers by a weight. A FCN will usually include many such layers, often with ReLU activation functions for most layers (often with

I. Introduction

24

2. Artificial intelligence in medicine: Technical basis and clinical applications

FIGURE 2.3 Simple convolutional neural network. The input image is convolved with a kernel. The convolved image is then reduced in resolution using pooling. There are usually 3 10 groups of convolution and pooling applied. The last pooled image is then “flatten” meaning that the 2D array is converted to a 1D array that can be used as input to fully connected layers until an output is predicted.

dropout), but a sigmoidal-shaped function as the output, which creates output values that are more akin to a probability, which is helpful in understanding how to convert values to decisions. 2.2.2.2 Convolutional neural networks CNNs are a popular form of deep neural network that have proven quite effective for image-focused learning. A fundamental difference between a CNN and a fully connected network is that the first few layers consist of alternating convolutions and pooling (Fig. 2.3). After these initial layers the 2D image is “flattened” into a 1D array of values and that is then usually used as input to a few layers of fully connected network until the final output layer is reached. Convolutional neural networks work well for many image (and also some 1D problems) where the location of an object in the input is not known, and where the thing of interest consists of several parts that must be combined together to be recognized, and where precise size and scale are not known. During the learning process, both the values in the convolutional kernel (also referred to as a “filter”) and the weights of the fully connected layers are updated/learned. In general, the convolution kernels are learning the features in the images that are important while the fully connected layers learn the best weights and combinations of those features. It should also be noted that there are many convolutional kernels at each resolution, so many different features can be found at each level of resolution. While radiological images are usually gray scale, photographic images are red/green/blue, and CNNs will also have separate kernels for each color channel. As a result, even if we specify just a 5 3 5 kernel in a certain layer, there will be many more values because of the number of color channels and the number of filters per layer. For example, if a 5 3 5 kernel is used with 3 color channels and 16 filters, we will actually have 5 3 5 3 3 3 16 5 1200 values that can be learned for that 1 layer. After each convolutional layer, there is often a pooling layer, which serves to reduce the resolution of the image. Commonly the maximum pool or MaxPool function is used. It is a simple function that takes a small region of an image (such as 2 3 2) and outputs the maximum value in that region. This effectively reduces the resolution of the image, and taking

I. Introduction

2.2 Technology used in clinical artificial intelligence tools

25

the maximum (though some also use the mean value) is thought to be a good function because it rewards regions where the kernel has found a feature it matches. 2.2.2.3 U-Nets and V-Nets Segmentation is the assignment of labels to pixels of an image. For instance, if the image is a CT scan of the abdomen, a segmentation algorithm might identify the liver or tumors within the liver. Segmentation is a critical step in many image analysis tasks and has been the focus of decades of research. Early methods focused on finding edges or regions of similar intensity but these were often very sensitive to how the image was acquired and also required that the item being segmented have a consistent appearance—something often not true when pathology is present. A novel deep learning architecture designed for segmentation is the U-Net.10 It gets its name from the architecture (Fig. 2.4) wherein the first steps of the algorithm successively reduce the resolution of an image while extracting key high-level properties of the object being segmented. At the lowest resolution (the bottom of the “U”) the key features of the structure should be captured. The right-hand side of the U consists of restoring the resolution of the image while retaining focus on the structure of interest. There are “skip connections” where the corresponding resolution of the input image on the left side of the “U” is accessed by the right-hand side to assist in localizing the margins of the object of interest. A typical U-Net will use many of the components described previously, including convolutional elements as well as pooling elements that reduce resolution. It then uses transposed convolutions to increase the resolution on the right side but uses “skip connections” from the convolutions. The original description of the U-net was for 2D biological images, but this has now been successfully applied to many other 2D images. It has also been extended to 3D, which is referred to as a “V-Net.”11 2.2.2.4 DenseNets ANNs were inspired by what we understood to exist in the brain. In this arrangement, layers of nodes (neurons) are sequentially connected to subsequent layers of nodes. This FIGURE 2.4 A DenseNet. This is similar to a group of residual blocks, but with the distinction that each layer has skip connections to all subsequent layers, while the prototypical residual block just skips one or two layers.

I. Introduction

26

2. Artificial intelligence in medicine: Technical basis and clinical applications

leads to a clean and understandable computational model that has been applied to many problems. A challenge with this architecture is the vanishing gradient problem, that is, when an error is computed at the output of a network, it can be hard to decide how the error should be used to update a given individual layer: if there are 20 layers, should the error be evenly distributed across each layer? Typically the error is apportioned according to the weights in a layer so that layers with small weights do not get a large change while layers with large weights are effectively “ignored.” Each of the neural network’s weights receives an update proportional to the partial derivative of the error function with respect to the current weight in each iteration of training. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. The residual block has already been described and addresses the challenge of assuring that each layer contributes positively to the performance of the network, by forcing it to compete with the identity function. The next logical extension is to connect a given layer to all of the subsequent layers, thus forcing each layer as well as the sum of all layers to do better than identity. Network where a layer is connected to all (some might say “many”) subsequent layers is referred to as a DenseNet12 (see Fig. 2.4). The original description had convolutional layers at the start, but some have also described a DenseNet without convolutions. Because the skip connections force more efficient learning, it is common to reduce the number of nodes in each layer (making the layer “narrower”). Besides better parameter efficiency, another advantage of DenseNets is their improved flow of information and gradients throughout the network, which makes them easier to train. Each layer can directly access the gradients of the loss function, which helps training of deeper network architectures. 2.2.2.5 Generative adversarial networks Generative adversarial networks (GANs) were first described by Ian Goodfellow13 and represent a distinct departure from the typical learning paradigm. While most deep learning architectures focus on making predictions about existing images based on patterns learned from other images, GANs attempt to learn about existing images in order to create new images or pieces of images. Since that original design, many useful variants have been described. The basic GAN (Fig. 2.5) consists of a generator that typically has some source of variance such as a noise generator. (This is more frequently described as a “latent space” and other variants of GANs control this in order to achieve desired outcomes.) The generator is tasked with learning to generate images that simulate the collection of real images. The second main component is a discriminator that is essentially a classifier that tries to determine whether an image presented to it is real or fake. These two elements compete against each other (hence, the term “adversarial”) with the hope that the generator will learn to create very realistic images. 2.2.2.6 Hybrid generative adversarial network designs Since the original description of the GAN, there have been a number of variations that have proven to be useful for medical imaging. Several variations have been shown to be effective for creating images of one modality or contrast type based on a different input image. Some examples include converting MRI to CT for purposes of attenuation correction in treatment planning or PET imaging. There are also examples converting different

I. Introduction

2.3 Clinical applications

27 FIGURE 2.5 Basic GAN architecture. There is a collection of real images, which the GAN should learn to simulate. There is also a fake image generator that uses noise and feedback about its performance to generate images that should be like the real images. The discriminator tries to classify an input image as real of fake. A performance analyzer knows whether a real or fake image was presented to the discriminator and gives feedback to the discriminator about its performance. It also gives feedback to the generator about how well it is fooling the discriminator. GAN, Generative adversarial network.

MR contrast types from one to another. It is likely that these will become more commonly used in clinical practice. It has also been demonstrated that augmenting data sets using GANs is more effective than simple geometric variants14,15 and this will likely put GANs in a more prominent role in medical imaging. Closely aligned with this is the ability to create specific forms of images such as examples that are normal and others that have a known disease but which are not any specific patient image. In addition to more effective training of the AI algorithm, this can avoid challenges surrounding patient privacy. It is likely that this use of GAN technology will become more widespread.

2.3 Clinical applications 2.3.1 Applications of regression The output of an AI algorithm can take many forms, but perhaps the simplest is to perform regression, which means that the output is a floating-point value. In this case the output layer may be as simple as an addition function of all the outputs of the nodes in the prior layer. 2.3.1.1 Bone age One of the earliest applications of AI in medical imaging was the prediction of own age based on a hand radiograph. This particular task is used in the case where children seem to be either more advanced or delayed compared to their chronological age and

I. Introduction

28

2. Artificial intelligence in medicine: Technical basis and clinical applications

the state of maturation of the bones of the hand turns off and reflects the metabolic state of the patient. In the past, there have been radiographic atlas that humans would then use to compare a patient hand radiograph versus the large collection of normal hand radiographs. This was also a good task for AI because creating the gold standard was easier, in that the reporting format for pediatric hand radiographs was very stereotypical. The report typically would consist of the estimated bone age in years and months and might also include the patient’s chronological age so that could be deduced from the medical record. There are now several examples of bone age estimation by AI algorithms, in part promoted by the RSNA bone age challenge that provided a large curated data set of hand radiographs along with the bone age reported by a human expert. 2.3.1.2 Brain age Just as hand radiographs can be used to estimate the metabolic age of a child, there are now reports using MRI imaging to estimate the brain age of subjects.16,17 There are many large public databases of high-resolution anatomic brain images along with patient demographics (including patient age), which have been used to create algorithms that predict the age of the patient. If this can be reliably performed, looking at the tires may provide insight into patients with specific diseases. One recent report also shows that this can be helpful in identifying molecular markers.16 It is also likely that just as one may be able to predict patient age from MRI of the brain, tools to predict patient age (and variation from the patient chronologic age) could be applied to other images such as chest CT or abdominal CT, which might in turn identify important biomarkers of disease.

2.3.2 Applications of segmentation As noted previously, segmentation is the assignment semantic labels to pixels of an image. This is an important step for measuring size and other properties of a structure from an image. It is also often a first step before other tasks such as regression or classification are applied to an image. Because of this, segmentation has received much attention. While it is desirable to have a segmentation algorithm be as general as possible, it is now clear that the best results are obtained when information both about the type of image and also properties of the structure being segmented are known. It is not possible to review all deep learning based literature on segmentation of medical images. It is worth noting that the U-Net10 was first described for biological images, and is now applied not only in medicine, but is widely used outside of medicine and biology. A query of PubMed literature, including the term “U-Net” or (deep learning and segmentation) returns 184 publications during 2019, 352 during the prior 5 years, and only 4 additional publications prior to 5 years ago. This shows the rapid adoption and acceleration of this technology to the field. The U-Net has already been described, but it is worth paying attention to the error metrics used in evaluating segmentation methods. Probably the most commonly used error metric is the Dice similarity coefficient (DSC)—it is popular enough that it is built into many popular deep learning frameworks. The DSC is 1.0 when the predicted

I. Introduction

2.3 Clinical applications

29

segmentation perfectly matches the ground truth, and is 0 when none of the ground truth pixels are correctly identified. This function can work well for many applications, but a few caveats should be noted: 1. In some medical applications, it may be more important to assure that one is never more than a certain distance from the correct outer margin of a structure, and in that case the Hausdorf distance is a better metric. For very large structures (e.g., the liver), one can get near-perfect DSCs but one could still have a few pixels that are a great distance from the true margin. Conversely, for objects with a high surface area:volume ratio, an error in segmenting just a few pixels can result in a very poor score, even if that error is just pixels on the edge of the structure (errors that might result from partial volume effect). 2. In cases where multiple objects are being simultaneously segmented, it can be challenging to get the right weighting of the DSCs of each object. If the objects include both large round objects like the liver that easily result in high DSCs and small structures like adrenals that are challenging to obtain high DSCs, using a performance metric that averages all the DSCs together could result in very poor liver segmentation in order to get reasonable adrenal segmentation. Therefore it is important to understand the medical drivers of the segmentation task in order to determine the optimal error metric and acceptable performance. The list of applications of deep learning for medical image segmentation is very long and rapidly growing. There have been several international challenges for image segmentation such as those organized by the MICCAI, ISBI, and SPIE (see https://grand-challenge.org/challenges/), which have substantially accelerated the number of papers published for specific segmentation tasks. A major driver for publications and success in image segmentation is the availability of large data sets required for training a deep learning system. As such, it should not be surprising that frequently imaged organs would be among the most popular: brain, lung, heart, liver, and kidneys. In addition to these organs, there are also good results being shown for segmentation of cancer of some of these organs, including tumors of the brain, lung, and liver, which approach human-level performance.

2.3.3 Applications of classification Classification models attempt to predict a label or class for a given example. It may predict the class using a continuous value that is the probability of a given example belonging to a class. If the classification problem is binary (“yes” or “no”), then one can simply set a threshold of 0.5 to make the decision. If there are multiple classes, the probabilities can be interpreted as the likelihood or confidence of a given example belonging to each class. A predicted probability can be converted into a class value by selecting the class label that has the highest probability. In binary classification, where the number of classes equals 2, cross-entropy can be calculated as:

−(ylog (p)+(1−y)log (1−p))

I. Introduction

30

2. Artificial intelligence in medicine: Technical basis and clinical applications

If M . 2 (i.e., multiclass classification), we calculate a separate loss for each class label per observation and sum the result. 2

M X

  yo; c log po; c

c51

where M is the number of classes (normal, tumor, necrosis); log is the natural log; y is the binary indicator (0 or 1) if class label c is the correct classification for observation o; p is the predicted probability that observation o is of class c.

2.3.3.1 Detection of disease There are now hundreds of reports in the peer-reviewed medical literature on the use of deep learning methods to detect disease. The vast majority of these are for the detection of a specific disease using a specific imaging modality. Peak performance of most of these approaches that of human experts and in some cases surpasses it. This has sometimes lead pundits to suggest that computers will replace physicians where at least some diagnostic tests. It should be noted that while a computer can be useful for the detection of some specific diseases, there are few reports on the detection of most or all diseases in a given type of image and that represents a fundamental challenge to the replacement of humans by computers. One recent publication demonstrates the application of AI to detect a broad range of findings in imaging.18 It shows that for most classes of findings, the AI algorithm was able to perform similar to that of human experts. That group and others, however, note that the most productive model is where the AI supplements the human expert rather than attempt to compete separately. An example of this application of AI to medicine is seen, for example, in Ref. [19].

I. Introduction

2.3 Clinical applications

31

2.3.3.2 Diagnosis of disease class Just as an AI tool can be trained to determine whether or not a specific finding is present, it can also be trained to differentiate between two classes of findings. One common challenge in the practice of medicine is to determine whether or not a patient is responding to a specific therapy. In this case the challenge is not to diagnose the disease but rather to assess the response. Most recent publications on the use of AI in medical imaging are binary classifiers— a specific disease or finding is present or not. These are usually narrow clinical questions, and the wide variety of possible imaging findings limits the usefulness of these tools. One good example showing broad coverage of AI findings is for detection and classification of abnormalities on chest radiographs.18 In this study of over 100,000 chest radiographs, findings were grouped into 14 categories covering most of the findings that might be observed. They then built a multiclass classifier to identify the findings. The classifier performed at a level similar to radiologists, with the exception of emphysema. Even with this broad coverage, it should be noted that not all important findings were included, for example, detection of gas under the diaphragm and bone fractures is important if rare. 2.3.3.3 Prediction of molecular markers One of the most exciting applications of deep learning to medical imaging is the ability to identify important molecular markers from routine imaging. Radiology in particular has not participated in the genomics revolution to any significant degree, but deep learning could change that. There are now several reports on the ability of deep learning algorithms to detect features in routine radiological images that predict important molecular markers with high accuracy. Brain glioma markers are probably the most advanced with greater than 90% accuracy for predicting such markers as isocitrate dehydrogenase, 1p19q chromosomal deletion, and MGMT methylation status.20,21 Other reports show good performance for prediction of some lung cancer molecular markers22,23 as well as molecular markers associated with various forms of dementia.24 26 It is likely that many more molecular markers will continue to be developed over the next few years for many diseases. 2.3.3.4 Prediction of outcome and survival While the prediction of molecular markers from radiographic imaging is revolutionary, in some respects, it may not be the best ultimate target. It is known that complex gene gene interactions as well as host factors can alter the expression of a single molecular marker. Since imaging “sees” the phenotype of the tumor, it is possible that predicting the responsiveness to a therapy or the likely survival of a patient may be more important. Indeed, several groups have now also shown that radiomics can make reasonably accurate predictions of clinical course, which may be independent of the markers. This has been shown for diseases such as head and neck cancer,27 lung cancer,28 and polycystic kidney disease.29

2.3.4 Deep learning for improved image reconstruction The design of imaging devices has always included mathematical formulas that produce images based on an understanding of the physics of the device. There are now many

I. Introduction

32

2. Artificial intelligence in medicine: Technical basis and clinical applications

reports on the use of deep learning methods to create the images both from the raw detected signal, as well as to improve the quality of the image. A common element of these algorithms focuses on the creation of “full quality” images from reduced signal. The reduced signal may be less dose in the case of X-ray-based modalities such as CT, or less RF signal such as in MRI where less time is needed to collect the image data. In most cases the learning algorithm is given the reduced signal either in the form detected, or possibly with some component of traditional reconstruction performed (e.g., Fourier transform for MRI) and the algorithm then attempts to create the “full quality” image from that limited signal input. In the case of CT, one can train a CNN to work as a filter to reduce the noise in lowdose images such that they look similar to full-dose images,30 and these methods perform well or better than traditional filtering methods.31 CNN- and RNN-based image reconstruction methods are rapidly increasing, including methods that directly reconstruct MRI from limited k-space data.32,33

2.4 Future directions 2.4.1 Understanding what artificial intelligence “sees” It is a common misperception in the medical community that AI tools are a “black box” and that one cannot understand how they work. For traditional machine learning methods, this is particularly untrue, as humans select the features, and the relative weighting of those features is discernible. For deep learning, determining the features is more challenging, but not impossible. In the case of CNNs, one can both directly observe the values used in the kernels of the convolutions and also see the activations of the network for a particular input, and there are publicly available toolkits that can allow any user to visualize these activations.34 It is becoming rather common for publications to include saliency maps, which focus on the impact of input pixels on making a decision, as well as activation maps that reflect the gradients in the final layers of a network, which also reflect important parts of an image. A description of these is provided in Philbrick et al.35 Because many applications of deep learning require confidence in the decision basis, improvements in making the “black boxes” more transparent will continue.

2.4.2 Workflow A challenge to the implementation of AI tools into clinical practice is assuring that all required information is efficiently provided to the tool. Most AI tools today are simple and require only one image. However, AI always does better with more information, and it is almost certain that AI tools of the future will provide better decisions by taking advantage of richer input data. This may include more images—ones with different contrast properties, or from different time points. It may also include more nonimage data, such as patient demographics, known diagnoses, prior and present therapies, and the times of these therapies. All of these are known to improve human performance for most medical tasks, and it is certain to improve AI performance as well. This demand for richer input data will require more sophisticated AI architectures, improved data curation

I. Introduction

References

33

methods, and better clinical implementation environments. At present, there is limited literature on support for more complex workflows in medical imaging,36 but AI is likely to drive broader adoption and development of this type of technology.

2.5 Conclusion AI, and deep learning methods in particular, has made significant advances that have resulted in dramatic advances for medical imaging in recent years, and the rate of adoption into clinical practice is likely to accelerate. Current challenges to broad adoption include large, diverse, and well-curated data sets required for training these systems. Regulatory, legal, workflow, and financial constraints are also impediments to adoption in many jurisdictions. However, these tools have demonstrated the ability to significantly improve the efficiency of medical care, to extract information that humans cannot perceive, and to do this in an objective fashion. This will insure that the demand for these tools will increase, as our understanding of optimal training and implementation models improves.

References 1. Hubel DH, Wiesel TN. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J Physiol 1962;160:106 54. 2. Yu VL, Fagan LM, Wraith SM, et al. Antimicrobial selection by a computer. A blinded evaluation by infectious diseases experts. JAMA 1979;242:1279 82. 3. Lodwick GS, Haun CL, Smith WE, Keller RF, Robertson ED. Computer diagnosis of primary bone tumors. Radiology 1963;273 5. Available from: https://doi.org/10.1148/80.2.273. 4. Lundervold AS, Lundervold A. An overview of deep learning in medical imaging focusing on MRI. Z Med Phys 2019;29:102 27. 5. McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 1943;115 33. Available from: https://doi.org/10.1007/bf02478259. 6. Turing AM. Intelligent machinery. NPL. Mathematics Division; 1948. ,https://weightagnostic.github.io/ papers/turing1948.pdf.. 7. Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR. Improving neural networks by preventing co-adaptation of feature detectors. arXiv [cs.NE]. 2012. ,http://arxiv.org/abs/1207.0580.. 8. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Bajcsy, editor. Proceedings of the IEEE Conference on Computer Visions and Pattern Recognition. Los Alamitos, CA: Conference Publishing Services; 2016. p. 770 8. 9. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. 4700 8. 10. Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells WM, Frangi AF, editors. Medical image computing and computer-assisted intervention MICCAI 2015. Springer International Publishing; 2015. p. 234 41. 11. Milletari F, Navab N, Ahmadi S-A. V-Net: fully convolutional neural networks for volumetric medical image segmentation. arXiv [cs.CV]. 2016. ,http://arxiv.org/abs/1606.04797.. 12. Huang G, Liu Z, van der Maaten L, Weinberger KQ. Densely connected convolutional networks. arXiv [cs.CV]. 2016. ,http://arxiv.org/abs/1608.06993.. 13. Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ, editors. Advances in neural information processing systems 27. Curran Associates, Inc.; 2014. p. 2672 80. 14. Han C, Murao K, Satoh S, Nakayama H. Learning more with less: GAN-based medical image augmentation. arXiv [cs.CV]. 2019. ,http://arxiv.org/abs/1904.00838..

I. Introduction

34

2. Artificial intelligence in medicine: Technical basis and clinical applications

15. Frid-Adar M, Klang E, Amitai M, Goldberger J, Greenspan H. Synthetic data augmentation using GAN for improved liver lesion classification. In: 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018). 2018. 289 93. 16. Jonsson BA, Bjornsdottir G, Thorgeirsson TE, et al. Deep learning based brain age prediction uncovers associated sequence variants, Nat Commun 2019. Available from: https://doi.org/10.1101/595801. 17. Cole JH, Poudel RPK, Tsagkrasoulis D, et al. Predicting brain age with deep learning from raw imaging data results in a reliable and heritable biomarker. Neuroimage 2017;163:115 24. 18. Rajpurkar P, Irvin J, Ball RL, et al. Deep learning for chest radiograph diagnosis: a retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med 2018;15 e1002686. 19. Patel BN, Rosenberg L, Willcox G, et al. Human-machine partnership with artificial intelligence for chest radiograph diagnosis. NPJ Digit Med 2019;2:111. 20. Korfiatis P, Zhou Z, Liang J, Erickson BJ. Fully automated IDH mutation prediction in MRI utilizing deep learning. In: Erickson BJ, Siegel EL, editors. Proceedings of the second conference on machine intelligence in medical imaging. 2017. 23. 21. Chang P, Grinband J, Weinberg BD, et al. Deep-learning convolutional neural networks accurately classify genetic mutations in gliomas. AJNR Am J Neuroradiol 2018;39:1201 7. 22. Digumarthy S.R, Padole AM, Lo Gullo R, Sequist LV, Kalra MK. Can CT radiomic analysis in NSCLC predict histology and EGFR mutation status? Medicine. 2019;98:e13963. Available from: https://doi.org/10.1097/ md.0000000000013963 23. Rizzo S, Petrella F, Buscarino V, et al. CT radiogenomic characterization of EGFR, K-RAS, and ALK mutations in non-small cell lung cancer. Eur Radiol 2016;26:32 42. 24. Ullah HMT, Tarek Ullah HM, Onik Z, Islam R, Nandi D. Alzheimer’s disease and dementia detection from 3d brain MRI data using deep convolutional neural networks. In: 2018 third international conference for convergence in technology (I2CT). 2018. Available from: https://doi.org/10.1109/i2ct.2018.8529808. 25. Islam J, Zhang Y. Brain MRI analysis for Alzheimer’s disease diagnosis using an ensemble system of deep convolutional neural networks. Brain Inf 2018;5:2. 26. Wang Y, Tu D, Du J, et al. Classification of subcortical vascular cognitive impairment using single MRI sequence and deep learning convolutional neural networks. Front Neurosci 2019;13:627. 27. Diamant A, Chatterjee A, Vallie`res M, Shenouda G, Seuntjens J. Deep learning in head & neck cancer outcome prediction. Sci Rep 2019;9:2764. 28. Hawkins SH, Korecki JN, Balagurunathan Y, et al. Predicting outcomes of nonsmall cell lung cancer using CT image features. IEEE Access 2014;2:1418 26. 29. Kline TL, Korfiatis P, Edwards ME, et al. Image texture features predict renal function decline in patients with autosomal dominant polycystic kidney disease. Kidney Int 2017. Available from: https://doi.org/ 10.1016/j.kint.2017.03.026. 30. Kang E, Chang W, Yoo J, Ye JC. Deep convolutional framelet denoising for low-dose CT via wavelet residual network. IEEE Trans Med Imaging 2018;37:1358 69. 31. Wu D, Kim K, El Fakhri G, Li Q. A cascaded convolutional neural network for X-ray low-dose CT image denoising. arXiv [cs.CV]. 2017. ,http://arxiv.org/abs/1705.04267.. 32. Yang Y, Sun J, Li H, Xu Z. Deep ADMM-Net for compressive sensing MRI. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R, editors. Advances in neural information processing systems 29. Curran Associates, Inc.; 2016. p. 10 18. 33. Schlemper J, Caballero J, Hajnal JV, Price A, Rueckert D. A deep cascade of convolutional neural networks for MR image reconstruction. arXiv [cs.CV]. 2017. ,http://arxiv.org/abs/1703.00555.. 34. J Yosinski. ,http://yosinski.com/deepvis. [accessed 29.12.19]. 35. Philbrick KA, Yoshida K, Inoue D, et al. What does deep learning see? Insights from a classifier trained to predict contrast enhancement phase from CT images. AJR Am J Roentgenol 2018;211:1184 93. 36. Erickson BJ, Langer SG, Blezek DJ, Ryan WJ, French TL. DEWEY: the DICOM-enabled workflow engine system. J Digit Imaging 2014;27:309 13.

I. Introduction

C H A P T E R

3 Deep learning for biomedical videos: perspective and recommendations David Ouyang, Zhenqin Wu, Bryan He and James Zou Abstract Medical videos capture dynamic information of motion, velocity, and perturbation, which can assist in the diagnosis and understanding of disease. Common examples of medical videos include cardiac ultrasound to assess cardiac motion, endoscopies to screen for gastrointestinal cancers, natural videos to track human behaviors in population health, and microscopy to understand cellular interactions. Deep learning for medical video analysis is rapidly progressing and holds tremendous potential to extract actionable insights from these rich complex data. Here we provide an overview of deep learning approaches to perform segmentation, object tracking, and motion analysis from medical videos. Using cardiac ultrasound and cellular microscopy as case studies, we highlight the unique challenges of working with videos compared to the more standard models used on still images. We further discuss available video datasets that may search as good training sets and benchmarks. We conclude by discussing the future directions for this field with recommendations to practitioners. Keywords: Video; deep learning; echocardiogram; microscopy; segmentation; motion analysis

3.1 Introduction Artificial intelligence and machine learning has seen dramatic advances in the last 10 years. While the concepts of neural networks, convolution operations, and methods to train networks have been proposed over the last 30 years,1 it is relatively recent the widespread availability of graphic processing units and the insight that this hardware can efficiently perform the repetitive, parallel operations that speeds up the training of these complex machine learning algorithms.2 With these advances in computation, deep neural networks, in which many layers of mathematical operations are used to learn complex relationships, have been used to tackle complex tasks including genomics,3 computer vision,4 natural language processing,5 and human strategy games.5,6 This field of “deep learning” has seen some of the most exciting accomplishments in artificial intelligence.

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00003-X

37

© 2021 Elsevier Inc. All rights reserved.

38

3. Deep learning for biomedical videos: perspective and recommendations

Many of the biggest advances and best examples of the high performance of deep learning have been in computer vision, the scientific field of designing computer systems to understand images and videos.2,5,6 While computer vision has been showing steady, incremental improvement over many years, the introduction of deep convolutional neural networks to the task of image classification produced drastic improvements and highlighted the potential of deep learning compared to previous state-of-the-art techniques using feature engineering.6,7 Paralleling the progress with still images, deep learning on video datasets have been shown to have outstanding results by incorporating elements of neural network architectures originally tailored for both still image computer vision and time series data.8 11 Inspired by the tremendous, near human level of accuracy in classifying images, researchers attempted to apply similar deep learning algorithms to medical imaging tasks.12,13 Ranging from pictures of skin lesions in photographs and retina images from ophthalmologists to chest X-rays and mammograms, deep learning approaches have been adapted to medical still image datasets.12 17 While these are complex datasets, an even richer set of data exists in medical videos, which capture motion and behaviors that still images cannot detect. In one particularly salient example, the cardiovascular system has many dynamic structures, with the motion of heart muscle, heart valves, and blood providing significant diagnostic information that still images do not capture. Deep learning for medical videos is much less advanced compared to deep learning for images, though it holds tremendous potential for impact. In this chapter, we review key advances in computer vision and deep learning on video tasks and highlight applications in the medical machine learning literature. We discuss case studies ranging from cardiac ultrasounds (echocardiograms) to microscopy videos, which highlight approaches to understand dynamic systems and techniques to tackle complex datasets.

3.2 Video datasets A key factor in the advances of computer vision has been the machine learning community’s adoption of standardized datasets and comparison benchmarks for the evaluation of machine learning algorithms (Fig. 3.1). These datasets, which often comprise the imaging data with human annotations, dictate the scope of tasks that are being answered. In the still image dataset realm, ImageNet is a high profile example of a large imaging dataset used to benchmark and study the relative performance of machine learning models.7 A large dataset of natural images obtained from Google Image Search with crowdsourced classifications, ImageNet was a standardized input dataset that could be used in competitions. The drastically improved performance of deep learning algorithms on ImageNet classification was one of the first hints of the potential of convolutional neural networks.2,7 Video data are frequently used in medicine and biological sciences. Natural human videos are used to study behaviors, record interviews, and track actions. Research in mental health and neuroscience often relies on the recording of human behavior in video format. In the hospital the diagnosis of epilepsy incorporates patient videos as well as electrical recordings of brain activity. The diagnosis of many diseases requires the observation of behavior and motion (or limitations in motion), and many physical exam

II. Technical basis

39

3.2 Video datasets

Natural images

Medical images

Natural videos

Medical videos

Datasets

Examples

FIGURE 3.1 Examples of publicly available datasets with representative frames.

maneuvers performed by physicians seek to elicit differences in behavior with perturbation or stress. Unfortunately, many of these medical visual behaviors are not consistently recorded, which makes machine learning for these tasks more difficult. As with all tasks within machine learning for health, understanding the clinical scenario and available datasets inform the machine learning pipeline and possible training tasks. Videos have additional temporal information compared to still images, and many tasks require this information for understanding and comprehension. While any individual frame of a video can give information on the location and context, many behaviors and movements require comprehending temporal information. For example, while a still image might be enough to identify a door, the video is required to understand whether the action consists of “a door closing” or “a door opening.” Biological behavior is very complex, often consisting of a similar actor but different motions or actions which dictate the task at hand. Actions such as “patting a person’s head” versus “braiding hair” can appear visually similar in a still image, but the temporal information encoded in a video data can readily distinguish between different behaviors. Datasets such as Kinetics,18 HMDB,19 and UCF10119,20 have been designed for the purpose of investigating computer vision on human behavior videos. There are also many forms of advanced medical imaging that capture motion for disease diagnosis. An example of understanding motion for medical diagnosis is the imaging of the heart for detecting cardiovascular disease. The heart is a tremendously dynamic organ, with motion in every heartbeat and often sizable variation even beat-to-beat. While the heart can be imaged through many modalities—including ultrasound (echocardiograms), computed tomography (CT), or magnetic resonance imaging (MRI)—modalities which have lower temporal resolution often require aggregation of information and taking advantage of the cyclical nature of the cardiac cycle and each heartbeat. Thus abnormalities in the heart muscle or aberrations in heart valve function can be readily detected in multiple imaging modalities, but all modalities of cardiac imaging incorporate the temporal information. EchoNet-Dynamic is an example of a publicly available medical video dataset.14 Comprised of over 10,000 echocardiogram, or cardiac ultrasound, videos and associated expert labels of heart chamber sizes and cardiac function, the EchoNet-Dynamic dataset was released to the machine learning for health community to serve as a benchmark for

II. Technical basis

40

3. Deep learning for biomedical videos: perspective and recommendations

medical video research and evaluation of domain specific architectures. By having a standardized shared dataset, direct “apples-to-apples” comparison of different machine learning models can be performed on key medical questions. Both clinically relevant and taking advantage of the information specifically encoded in video, cardiac function is a challenging benchmark clinical task that will advance machine learning in healthcare. With all medical datasets, concerns regarding patient privacy, fairness, and generalizability need to be addressed. Unlike synthetic data or natural images, medical data can come from especially vulnerable populations, be biased toward certain demographics which can cause bias to be propagated in machine learning model trained on the dataset, and might not be generalized to the population as a whole.21 Often, special efforts need to be made prior to data release to verify the balance of the dataset, avoid bias in patient selection, and removal of identifying features or markers whenever possible.22 In the case of EchoNet-Dynamic, in addition to evaluating the demographic information of the patient population, each video was manually reviewed by a trained employee of the hospital system to highlight and exclude identifying information. With video datasets, many different important clinical tasks can be performed. In the next few sections, we will use examples from healthcare and biological research to showcase cases and models used for semantic segmentation, object tracking, and motion classification.

3.3 Semantic segmentation Semantic segmentation refers to the task of labeling each individual pixel of an image or video with corresponding labels, often of classification tasks to identify regions and structures (Fig. 3.2). In the natural world, humans perform this task instinctively, seamlessly identifying objects such as cars, people, or bicycles in order to interact with the natural environment. Prior to taking an action (e.g., getting in the car), one needs to recognize where the car is and how it is oriented. This task is crucial for understanding the local environment and is more difficult than traditional object-identification tasks, which often simply ask if an image contains a certain object but not necessarily ask where the object is in the image. The machine learning community has released datasets such as Microsoft Common Objects in Context (COCO)14,23 and Cityscape Dataset for Semantic Urban Scene Understanding24 for studying the machine understanding of local environments and the pixel-wise relationship between regions and objects. Outside of health-care applications the potential of self-driving cars and other disruptive technologies motivate research into the understanding of urban traffic environments through semantic segmentation of videos. In medicine, characterizing organ systems through medical imaging often relies on similar approaches to understanding the voxel-wise relationship between medical imaging and disease characteristics. For example, in the example of solid organ cancers, such as prostate cancer and lung cancer, radiologists take significant time to understand the size and distribution of tumors. The tumors are often distinct and readily recognizable; however, the clinical workflow has much subjectivity and human variation in how to measure

II. Technical basis

3.3 Semantic segmentation

(A)

41

(B)

FIGURE 3.2 Semantic segmentation task examples in natural video and medical video. (A) The Cityscape Dataset for Semantic Urban Scene Understanding identifies common physical structures and classes in an urban commuter environment. (B) The EchoNet-Dynamic Dataset identifies left ventricular size and shape to characterize cardiac function.

the dimensions and characteristics of the tumor. Such tasks are crucial in distinguishing between dormant versus progressive disease; however, human variation can lead to the under diagnosis of subtle changes in tumor burden and neglect small but meaningful changes in disease state. Many prominent and well-studied neural network architectures have been designed with semantic segmentation tasks in mind. Fully convolutional networks (FCNs),25,26 U-Net,27 and DeepLab28,29 represent the gamut of architectures and designs evaluated for semantic segmentation tasks in computer vision datasets. Common to all listed architectures is a model design that aggregates both distant and local information from other pixels and collapses pixel-wise information into a smaller vector or array that more closely represent the labeled tasks and reexpanding that annotation into an array of the original size that represent the pixel-level label. An example of an application driven architecture which has subsequently been expanded to other biomedical as well as nonmedical tasks is the U-Net architecture, which was originally inspired by the task of segmenting electron microscopy images to annotate cells and small local structures. This architecture extends advances made with FCNs by aggregating input image information into smaller and smaller layers that encode higher order information (which has been described as the “encoding arm”) followed by gradual upscaling of layer sizes to produce an output of the same shape and size as the original input image (described as the “decoding arm”). Given its efficiency and high performance, U-Net has been applied to many tasks outside of biomedical imaging and has been recognized as an advance for machine learning computer vision even outside of medicine.30 Many semantic segmentation models rely heavily on individual frame level information while discounting additional information for temporally adjacent frames. In nonmedical datasets such as CityScapes the dataset is often constructed with sparse labeled frames selected to maximize differences between sampled frames rather than providing labels for

II. Technical basis

42

3. Deep learning for biomedical videos: perspective and recommendations

multiple consecutive frames to augment the model training. In the example of echocardiogram videos, researchers were able to show that training on a small subset of frames was able to generalize the entirety of the video and the previously unannotated frames.14 In this example, opportunities in augmenting the dataset exist knowing the constraints of the particular data. For example, for echocardiograms, sonographers often trace the left ventricle at its largest and smallest, so the shape and size of the ventricle is smoothly constrained between the two examples even in unlabeled frames. This allows fuzzy model training by using the same training labels to train adjacent frames of the video as well as penalizing the model when there is drastic change in model prediction from frame to frame. Techniques from still images can be generalized to video and from a nonmedical domain to a medical domain.

3.4 Object detection and tracking A common subsequent task after image segmentation is the detection, classification, and tracking of objects. Understanding the mechanisms of basic biological processes requires understanding the prevalence, location, and trajectory of cells and subcellular objects in molecular and cellular imaging. For example, our understanding of chemotaxis in immunology comes from studying the gradual movement of neutrophils and experimental modifications that slow or impede cellular movement. Given the diversity and high numbers of cellular actors in complex biological processes, automated approaches to detect, classify, and track important objects in cellular videos is crucial to advance our understanding fundamental biological processes. Deep learning has revolutionized how we can computationally detect and track objects and have made leaps in how we interpret cellular videos. In addition to differentiating between foreground and background, computer vision tasks involving visual information requires the understanding of the objects represented in the image and video. In cellular microscopy, similar cell types and populations can be represented by many morphologies, shapes, and sizes. Machine learning models must understand intrinsic characteristics that define different cellular populations to detect and track cellular movement. Understanding of the various projections an individual object can manifest in visual information from different views and perspectives is important in appropriately detecting and identifying objects in visual data. Often this is made even more computationally difficult in biomedical imaging, in which a single field of view can have many, even hundreds of instances of the objects of interest. Appropriate object registration and detection is needed to analyze the population level motion and trajectories. In the recent development from deep learning community, the task of object detection/ instance segmentation is solved through a complex framework which involved multiple prediction stages and heads that are responsible for generating proposals, and identifying bounding boxes and labels of objects. Some well-known models include RCNN,31 Mask-RCNN,32 and YOLO.33 These models are typically trained in a supervised learning framework with matching images and labels in the form of annotated bounding boxes. In biomedical data, especially cell imaging, usage of these advanced neural network architectures is typically limited due to lack of human annotations, heterogeneity

II. Technical basis

3.4 Object detection and tracking

43

in human labels, and the large amount of relatively homogeneous objects in images that make human annotation tedious. Alternative solutions have been proposed which generalize from the pixel map output from semantic segmentation and perform heuristic-based instance separation. Each segmentation can be classified into relevant cell types, although an important biological challenge from cellular imaging is the challenge of segmenting large clusters of often overlapping objects. In cases when cells are distinct from each other, the semantic segmentations can be sufficient. However, when there are overlapping cell pairs or groups of closely contacting cells, even little segmentation error could lead to classification errors and fused cell instances. A large variety of segmentation methods have been proposed to answer the challenge of separating close, overlapping objects in biomedical imaging. Some existing solutions rely on an assumption of cell shape, employing strategies such as Laplacian of Gaussian,34 radial symmetry transformation,35 and distances between objects to distinguish between individual instances. Other researchers have tried feature-based or threshold-based methods to distinguish between different objects.36 The wide range of proposed techniques highlight the challenge of cellular object detection and each method has shown impressive results in specific imaging modalities and domain questions (Fig. 3.3). FIGURE 3.3 Object detection and instance segmentation task examples (A). Mask-RCNN generates bounding boxes and masks for objects in natural images (B). Cell detection in microscopy images.

(A)

(B)

II. Technical basis

44

3. Deep learning for biomedical videos: perspective and recommendations

Video data provides additional temporal information that can be used to inform object detection and track objects even through challenging frames that independently might be difficult to process. The additional information can also make object tracking more difficult—particularly when there are multiple objects in the same frame and pairing each object with the same object in a preceding or subsequent frame is needed to accurately generate trajectories. This matching task is usually solved under the framework of linear assignment problem37: matching a set number of objects from one frame of a video to the same objects in a subsequent frame. In this problem setting a cost matrix is specified based on how likely a pair of cells in two image frames are from the very same cell. This matrix is usually defined/calculated based on factors including the distance of locations, similarity of appearance, and surrounding environment. Though in most cases these components are empirically selected and weighed in the final cost matrix, there is recent work applying neural network-based method to refine the cost matrix composition.38,39 Further refinement can be performed after generating an initial set of trajectories from frame-to-frame matchings.37 This step can mitigate segmentation error and also account for events such as cell merging and splitting. In addition, deep learning can help researchers investigate the complex relationship between visual phenotype and genetics. Convolutional neural networks28 provide a powerful and unbiased tool to organize and quantify the complex morphological characteristics of cells from imaging data. Morphological and morphodynamic states are often highly correlated with gene expression—it could be easily imagined that a convolutional network-based featurizer will be able to extract the relationship between the visual phenotype and the gene expression and functional states of cells. Cellular videos enable a wide range of studies on time-dependent behavior in biological systems that are not possible with traditional microscopy. There is increasing recognition that cellular systems are incredibly dynamic40 throughout the cell cycle41 and further information can be obtained in dynamic analysis of the cell morphology.42 In a live cell imaging, tracking trajectories of cells opens opportunities for detailed analysis on dynamic state and temporal change of individual cells during development and in immunological processes.

3.5 Motion classification In both natural video as well as in medical imaging machine learning, there are tasks that require understanding motion and the interplay of structures. Image-based classifiers traditionally have a tough time distinguishing between opening or closing a door, or repetitive motions such as brushing or braiding hair. In medicine a large range of tasks require understanding of biological movement. For natural videos, subtle physical motions and variation can identify important physiological states and medical diagnoses.43 45 In healthcare, video medical imaging is obtained for the diagnosis of cardiovascular disease particularly because the heart is a dynamic structure and abnormalities in the heart muscle and valves is most clearly reflected in abnormal heart motion. In this section, we present examples of machine learning applied to assessing cardiac function to exemplify opportunities with video-based deep learning architecture.14

II. Technical basis

3.6 Future directions and conclusion

45

The “convolution” part of convolutional neural networks refers to the collections of mathematical functions that aggregate and pass information into subsequent layers of the neural network.2,7,14 For image-based computer vision tasks the convolutional task aggregates local pixel information and benefits from embedding understanding of local structure and geometric shape.29 In still image tasks, all visual information is mapped to a two-dimensional (2D) data structure and the relevant functions are 2D convolutions that traverse through the image for an output often of similar size and structure. While there has been sizable advances in neural network architecture design,9,28,29,46 the fundamental mathematical operation is 2D convolutions for processing still images. Videos contain temporal information, and this richer dataset can be represented in many different ways, and the neural network architectures used to understand them are similarly more complex.8,10,18 Video data can be preprocessed with feature-tracing algorithms such as optical flow and dense optical flow to consolidate information and label regions with motion and activity prior to classification.47 Other researchers have treated the temporal information as another dimension in the data, such that a three-dimensional (3D) data structure (x,y,z) represents the input video with x and y axes representing the spatial information of each frame of the video and the z axis representing temporal information across frames.10,47 In this approach, 3D convolutional kernels are used to consolidate information from all three dimensions, and both spatial and temporal information is integrated in the model predictions. In order to minimize computational cost, various approaches to independently treat temporal and spatial information have been attempted with good efficacy.10 In cardiac physiology the vigor and speed of contractions of the heart chambers, especially the left ventricle, is quantified to understand cardiac function. Human physicians use video information from cardiovascular ultrasound (echocardiograms), CT, or MRI to examine the heart. Severe impairment of heart muscle movement captured on these medical imaging videos is considered “heart failure” and is the leading cause of hospitalization in the United States.48 Researchers have applied video-based deep learning models to predict heart function with high accuracy and precision.10,14 Even human interpretations of heart motion can be subjective and vary among experts,49,50 and appropriately applied deep learning models can improve the reliability in medical imaging and physician and patient trust in diagnostic testing.

3.6 Future directions and conclusion There have been significant advances in machine learning applied to medical imaging and medical videos. From applying conventional computer vision algorithms to standard medical imaging to proposing novel deep architectures inspired by biomedical imaging segmentation tasks, medical imaging applications has inspired basic machine learning research and machine learning is on the verge of revolutionizing how medical imaging is interpreted. Many further prospective studies need to be performed, and the interplay between human elements and machine learning models need to be understood before deep learning can be truly applied in a clinical setting, but the future is full of opportunities for machine learning in healthcare.

II. Technical basis

46

3. Deep learning for biomedical videos: perspective and recommendations

The various ways humans interact with natural image and video data directly correspond to opportunities to standardize, improve, and expand availability to medical imaging through machine learning. While not explicitly defined in these categories, image segmentation, object detection, object tracking, and motion analysis are used daily in medical imaging by human radiologists, cardiologists, and pathologists to understand and diagnose disease. In a combination of these discrete tasks, cardiologists identify different chambers of the heart and track heart motion to assess cardiac function. Video-based features, such as desynchrony or impairment of motion, defines cardiovascular disease such as electrical conduction abnormalities and cardiomyopathy. With the gravity of medical decision-making, it is likely the first application of video artificial intelligence (AI) models will be on replacing tedious, unglamorous intermediate tasks rather than providing end-to-end predictions of a final medical diagnosis. In the example of echocardiograms the first application of video AI models might be using AI to label the left ventricle rather than directly replacing the human input and producing a final diagnosis. In many ways, this would already improve upon the current clinical workflow by using multiple cardiac beats to inform human diagnosis of cardiac function, but this would also parallel the current clinical workflow of initial tracing by sonographers or trainees before being signed off by a physician. However, additional work needs to be done to study what happens when AI models get good enough, but still need an observant overseer to maintain quality. For example, in the example of self-driving cars, one can foresee a future where AI systems can handle the vast majority of mundane experiences but still need human intervention in difficult environments—how does such a system keep a human engaged and situationally aware while maintaining human trust in the AI system? In the future, there needs to be further work to understand and assess the relationship between human and machine learning models when applying video AI models. Important work is currently being done to have open, standardized, and shared datasets for studying and benchmarking machine learning of medical imaging. Implicit in this challenge is the desire to understand cross-institution differences in image acquisition and interpretation while also maintaining patient privacy and understanding implicit biases in the health-care system. Fairness in machine learning has become an increasingly important issue as we increasingly recognize that both implicit and explicit biases in the training labels and sampling of examples can significantly influence machine learning model behaviors. In medicine, biases in how physicians diagnose patients with different socioeconomic backgrounds and biases in the availability of health-care tests and resources can be propagated into machine learning models if careful examination and reflection is not undertaken.

References 1. LeCun Y, et al. Backpropagation applied to handwritten zip code recognition. Neural Comput 1989;1:541 51. 2. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, editors. Advances in neural information processing systems 25. Curran Associates, Inc.; 2012. p. 1097 105. 3. Zou J, et al. A primer on deep learning in genomics. Nat Genet 2019;51:12 18. 4. Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE 1998;86:2278 324.

II. Technical basis

References

47

5. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv [cs.CL]; 2018. 6. Silver D, et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016;529:484 9. 7. Russakovsky O, et al. ImageNet large scale visual recognition challenge. Int J Comput Vis 2015;115:211 52. 8. Tran A, Cheong L-F. Two-stream flow-guided convolutional attention networks for action recognition. 2017 IEEE international conference on computer vision workshops (ICCVW) 2017. Available from: https://doi.org/ 10.1109/iccvw.2017.368. 9. Song L, Weng L, Wang L, Min X, Pan C. Two-stream designed 2D/3D residual networks with LSTMs for action recognition in videos. 2018 25th IEEE international conference on image processing (ICIP) 2018. Available from: https://doi.org/10.1109/icip.2018.8451662. 10. Tran D, et al. A closer look at spatiotemporal convolutions for action recognition. 2018 IEEE/CVF conference on computer vision and pattern recognition 2018. Available from: https://doi.org/10.1109/cvpr.2018.00675. 11. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision 2015;4489 97. 12. Esteva A, et al. Corrigendum: dermatologist-level classification of skin cancer with deep neural networks. Nature 2017;546:686. 13. McKinney SM, et al. International evaluation of an AI system for breast cancer screening. Nature 2020;577:89 94. 14. Ouyang D, He B, Ghorbani A, et al. Video-based AI for beat-to-beat assessment of cardiac function. Nature 2020;580: 252 256. Available from: https://doi.org/10.1038/s41586-020-2145-8. 15. Bello GA, et al. Deep learning cardiac motion analysis for human survival prediction. arXiv [cs.LG]. 2018. 16. Poplin R, et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat Biomed Eng 2018;2:158 64. 17. Coudray N, et al. Classification and mutation prediction from non small cell lung cancer histopathology images using deep learning. Nat Med 2018;24:1559 67. 18. Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition 2017;6299 308. 19. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T. HMDB: a large video database for human motion recognition. In: 2011 international conference on computer vision 2011;2556 63. 20. Soomro K, Zamir AR, Shah M. UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv [cs.CV]. 2012. 21. Zou J, Schiebinger L. AI can be sexist and racist—it’s time to make it fair. Nature 2018;559:324 6. 22. Kim MP, Ghorbani A, Zou J. Multiaccuracy. Proceedings of the 2019 AAAI/ACM conference on AI, ethics, and society AIES ’19 2019. Available from: https://doi.org/10.1145/3306618.3314287. 23. Lin T-Y, et al. Microsoft COCO: common objects in context. arXiv [cs.CV]. 2014. 24. Cordts M, et al. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition 2016. 25. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. 2015 IEEE conference on computer vision and pattern recognition (CVPR) 2015. Available from: https://doi.org/10.1109/ cvpr.2015.7298965. 26. Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL. Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv [cs.CV]. 2014. 27. Dong H, Yang G, Liu F, Mo Y, Guo Y. Automatic brain tumor detection and segmentation using U-Net based fully convolutional networks. Medical image understanding and analysis. Springer International Publishing; 2017. p. 506 17. 28. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. 2016 IEEE conference on computer vision and pattern recognition (CVPR) 2016. Available from: https://doi.org/10.1109/cvpr.2016.90. 29. Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 2018;40:834 48. 30. Billaut V, de Rochemonteix M, Thibault M. ColorUNet: a convolutional classification approach to colorization. arXiv [cs.CV]. 2018. 31. Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. 2014 IEEE conference on computer vision and pattern recognition 2014. Available from: https://doi. org/10.1109/cvpr.2014.81.

II. Technical basis

48

3. Deep learning for biomedical videos: perspective and recommendations

32. He K, Gkioxari G, Dollar P, Girshick R. Mask RCNN. 2017 IEEE international conference on computer vision (ICCV) 2017. Available from: https://doi.org/10.1109/iccv.2017.322. 33. Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: unified, real-time object detection. 2016 IEEE conference on computer vision and pattern recognition (CVPR) 2016. Available from: https://doi.org/10.1109/ cvpr.2016.91. 34. Xu H, Lu C, Berendt R, Jha N, Mandal M. Automatic nuclei detection based on generalized Laplacian of Gaussian filters. IEEE J. Biomed. Health Inf. 2017;21:826 37. 35. Loy G, Zelinsky A. Fast radial symmetry for detecting points of interest. IEEE Trans Pattern Anal Mach Intell 2003;25:959 73. 36. Vicar T, et al. Cell segmentation methods for label-free contrast microscopy: review and comprehensive comparison. BMC Bioinform 2019;20:360. 37. Jaqaman K, et al. Robust single-particle tracking in live-cell time-lapse sequences. Nat Methods 2008;5:695 702. 38. Sadeghian A, Alahi A, Savarese S. Tracking the untrackable: learning to track multiple cues with long-term dependencies. 2017 IEEE international conference on computer vision (ICCV) 2017. Available from: https://doi. org/10.1109/iccv.2017.41. 39. Moen E, et al. Accurate cell tracking and lineage construction in live-cell imaging experiments with deep learning. Available from: https://doi.org/10.1101/803205. 40. Kimmel JC, Chang AY, Brack AS, Marshall WF. Inferring cell state by quantitative motility analysis reveals a dynamic state system and broken detailed balance. PLoS Comput Biol 2018;14:e1005927. 41. Neumann B, et al. Phenotypic profiling of the human genome by time-lapse microscopy reveals cell division genes. Nature 2010;464:721 7. 42. Pincus Z, Theriot JA. Comparison of quantitative methods for cell-shape analysis. J Microsc 2007;227:140 56. 43. Yan BP, Lai WHS, Chan CKY, et al. High-throughput, contact-free detection of atrial fibrillation from video with deep learning. JAMA Cardiol 2020;5 (1):105 107. Available from: https://doi.org/10.1001/ jamacardio.2019.4004. 44. Wu H-Y, et al. Eulerian video magnification for revealing subtle changes in the world. ACM Trans Graph 2012;31:1 8. 45. Elgharib M, Hefeeda M, Durand F, Freeman WT. Video magnification in presence of large motions. In: Proceedings of the IEEE conference on computer vision and pattern recognition 2015;4119 27. 46. Chen L-C, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. arXiv [cs.CV]. 2017. 47. Barron JL, Fleet DJ, Beauchemin SS, Burkitt TA. Performance of optical flow techniques. In: Proceedings 1992 IEEE computer society conference on computer vision and pattern recognition, 1992. Available from: https://doi. org/10.1109/cvpr.1992.223269. 48. Loehr LR, Rosamond WD, Chang PP, Folsom AR, Chambless LE. Heart failure incidence and survival (from the atherosclerosis risk in communities study). Am. J. Cardiol. 2008;101:1016 22. 49. Pellikka PA, et al. Variability in ejection fraction measured by echocardiography, gated single-photon emission computed tomography, and cardiac magnetic resonance in patients with coronary artery disease and left ventricular dysfunction. JAMA Netw Open 2018;1:e181456. 50. Farsalinos KE, et al. Head-to-head comparison of global longitudinal strain measurements among nine different vendors: the EACVI/ASE inter-vendor comparison study. J Am Soc Echocardiogr 2015;28:1171 81 e2.

II. Technical basis

C H A P T E R

4 Biomedical imaging and analysis through deep learning Karen Drukker, Pingkun Yan, Adam Sibley and Ge Wang Abstract The material presented here will expand upon the deep learning techniques introduced in the prior chapter to address imaging-related issues. Recently, the use of deep learning has gained tremendous popularity within the realm of medical imaging research and development. This chapter will give a general overview of artificial intelligence applications with an emphasis on the areas of tomographic image reconstruction, image segmentation, image registration, and radiomics. Given the large scope of this chapter, it mainly describes key applications at a conceptual level but not at a technical level, with many details left to the references. Keywords: Deep learning; tomographic reconstruction; image registration; image segmentation; radiomics (detection; diagnosis; prognosis; treatment response; risk assessment)

4.1 Introduction Artificial intelligence (AI) and machine learning, especially deep learning, applications have become popular in everyday life and can be found almost anywhere from “smart” home devices, mobile phone apps, to autodriving cars. While cars do not normally drive themselves quite yet, the emergence of AI techniques in so many different aspects of daily life seems inevitable, promising the eventual routine use of deep learning in medical imaging applications. There is a lot at stake in the medical imaging field, however, since a wrong suggestion by an AI application could cause serious consequences, for example, a cancer to be missed on a screening mammogram, delaying treatment, and potentially causing loss of a life. Of course, this potential problem is associated with any imaging technique. For example, “conventional” (nondeep learning) computer-aided detection for screening mammography is not perfect either but it has been in routine clinical use for decades using hand-crafted computer-extracted image features rather than those extracted by a deep neural network and has certainly helped pave the way for better use of computer “tools” in clinical practice. The content of this chapter is limited to the technical

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00004-1

49

© 2021 Elsevier Inc. All rights reserved.

50

4. Biomedical imaging and analysis through deep learning

review and discussions, while the regulatory and ethical issues of AI in medical imaging are certainly worth of a separate discussion. The key difference between deep learning AI and traditional machine learning techniques is that the former can automatically learn meaningful representations of the data, thereby eliminating the need for approximately modeled imaging models and handcrafted features. Drawbacks include a large amount of data required to train an AI system, a generally much higher computational burden, and a reduced interpretability (more about that later). With the use of graphical processing units (GPUs), originally developed to speed up computer video games, computational times have been drastically reduced. The application of AI to medical images, rather than “natural” images (generally used in deep learning competitions) brings along special challenges, however. Medical images are often of very high resolution, may be 3D, 4D, or even more (space over time, spectral, and biological contrasts), often require extensive annotation by an expert to establish the “ground truth,” and may not be readily available in a large enough number required for AI training. The task can be very difficult compared to, for example, the identification of dogs and cats in natural images. This chapter will describe both accomplishments and remaining challenges, but with 1401 peer-reviewed papers (including 95 review papers) involving deep learning in imaging indexed in PubMed in the first 8 months of 2019 alone it is not meant to be exhaustive. Important medical imaging research areas into the application of deep learning include radiology, oncology, neurology, musculoskeletal imaging, digital pathology, cardiology, ophthalmology, gastroenterology, and dentistry. Several recent review articles on deep learning in medical imaging1 6 are recommended as additional reading materials that provide complementary aspects of state-of-the-art deep learning methods in medical imaging and image analysis. One should also keep in mind that a vast majority of deep learning applications discussed in this chapter are not approved by US food and drug administration (FDA) for clinical use yet. Hence, a word of caution is that, while many promising results were reported, often there is a lack of testing on separate “outside” datasets, which may hamper generalizability and final future translation to the clinic.

4.2 Tomographic image reconstruction 4.2.1 Foundation It is underlined that deep learning can help not only image analysis but also tomographic image formation or image reconstruction. Tomographic reconstruction represents an important class of inverse problems in which externally measured data are linked to internal structures in a complicated way and processed to reconstruct internal features in cross sections or volumetrically. The first perspective on deep tomographic reconstruction was given in Ref. [7], in which deep learning was proposed as a new class of tomographic image reconstruction algorithms, as shown in Fig. 4.1. In that article, it is underlined that The real power of the deep learning based reconstruction lies in the data-driven knowledge-enhancing abilities so as to promise a smarter initial guess, more relevant intermediate features, and an optimally regularized final image within an application-specific low-dimensional manifold.

II. Technical basis

51

4.2 Tomographic image reconstruction

FIGURE 4.1 Past, present, and future of tomographic image reconstruction, from analytic and iterative reconstruction algorithms to learning-based algorithms.7

FIGURE 4.2 Low-hanging fruits by “knocking-out/down/in” computational elements in a traditional iterative reconstruction flowchart to generate many deep-learning-aided tomographic reconstruction algorithms.7

More specifically, low-hanging fruits in the tomographic reconstruction area were suggested in Ref. [7], as shown in Fig. 4.2. Over the past several years, developments in the emerging field of learning-based tomographic reconstruction have generally been consistent with the same blue print (Fig. 4.2).7 Two good examples are the June 2018 special issue of the IEEE Transactions on Medical Imaging on the theme of machine learning for image reconstruction8 and a recent review article (https://arxiv.org/abs/1904.02816) that covers over 200 relevant papers.

II. Technical basis

52

4. Biomedical imaging and analysis through deep learning

FIGURE 4.3 Superiority principle for deep tomographic reconstruction: superior image quality is in principle guaranteed because whenever necessary, a deep learning method can be synergistically combined with analytic and iterative algorithms in order to outperform them.

Despite ongoing debates on if deep learning as a “black box” should be the approach of choice for tomographic reconstruction, it has become clear now that the deep learning approach becomes increasingly popular. It is our belief that deep learning will dominate and outperform classic analytic and iterative algorithms in challenging cases where data are imperfect and image geometries are limited, because deep reconstruction can incorporate essential components of existing methods and empower them with extensive knowledge extracted from big data, which we call the superiority principle for deep tomographic imaging as shown in Fig. 4.3.

4.2.2 Computed tomography Three initial deep computed tomography (CT) reconstruction results were reported in Ref. [7] to demonstrate the feasibility and potential. In the first example, blurry CT images were converted to clearer ones using a deep network, quite similar to superresolution imaging networks used in the computer vision community. In the second example, a datamissing sinogram was impainted using a deep network for CT metal artifact reduction. In retrospect, this network-based impainting or interpolation is somehow related to contrastive predictive coding (https://arxiv.org/abs/1905.09272), which has attracted a major interest recently. In the third example, deep learning was for the first time applied for low-dose CT imaging, showing a competitive performance. Low-dose CT denoising has been a hottest topic in the deep CT-imaging area. A series of excellent papers were published to demonstrate gradually improved image quality such as in the June 2018 special issue of the IEEE Transactions on Medical Imaging on the theme of machine learning for image reconstruction.8 Very recently, in a Nature Machine Intelligence Paper (https://www.nature.com/articles/s42256-019-0057-9), commercial iterative reconstruction algorithms used on current clinical CT scanners made by three leading vendors were systematically compared in a double blind fashion to the results produced from a modularized deep neural network taking direct filtered backprojection images as the input. This work was performed in collaboration among engineers and

II. Technical basis

4.2 Tomographic image reconstruction

53

FIGURE 4.4 High-level comparative study on deep learning versus iterative reconstruction. The top illustration shows the overall experimental design, and the bottom bar charts summarize the key result that deep learning is comparable or superior to iterative reconstruction in a majority of cases.

scientists at Rensselaer Polytechnic Institute and radiologists at Massachusetts General Hospital and Harvard Medical School. The comparative results indicate that deep learning performs as effectively as, or better than, the current iterative techniques in an overwhelming majority of cases. Furthermore, the deep learning method is also computationally much more efficient. The key message is illustrated in Fig. 4.4. It is underlined that the previous positive results were obtained without access to the original, or raw, data collected with the CT scanners. If the original CT datasets are made available, a more specialized deep learning algorithm should perform even better than what we have shown in Ref. [8]. Indeed, the direct mapping from raw data to tomographic images is feasible. As already argued in Ref. [7], “Either the filtered backprojection (FBP) or simultaneous algebraic reconstruction technique (SART) can be easily formulated in the form of parallel layered structures.9 Then, the straightforward path to deep imaging could be simply from raw data to an initial image through a neural network modeled after a traditional reconstruction scheme, and then from a reconstructed image to a processed image through a refinement deep network (an overlap with deep-learning-based image processing).”7

II. Technical basis

54

4. Biomedical imaging and analysis through deep learning

FIGURE 4.5 Representative reconstructions applying two direct mapping methods to 49-view sinograms from our Massachusetts General Hospital dataset. (Left) The ground truth, (middle) reconstructed with the LEARN, and (right) the counterpart reconstructed using the DNA network. The display window is [ 2 300, 300] HU. LEARN, Learned Experts’ Assessment-based Reconstruction Network.

There are already several network-based direct mappings published for tomographic image reconstruction. To our best knowledge, the first direct reconstruction deep network is the “LEARN” (Learned Experts’ Assessment-based Reconstruction Network) by unfolding the “fields of experts”-based reconstruction scheme to form a LEARN for sparse-data CT.10 Also, a network iCT-Net was recently designed for CT reconstruction when data are compromised in various ways.11 Particularly, sparse view and interior tomographic problems are solved using the iCT-Net with competitive results.11 Furthermore, a dual network architecture (DNA) was proposed for few-view CT image reconstruction efficiently (https://arxiv.org/abs/1907.01262), with good pilot results shown in Fig. 4.5. An interesting feature is that the DNA network was pretrained with ImageNet data so that what is learnt by the network is the intrinsic inverse transfer to avoid an overfitting issue.

4.2.3 Magnetic resonance imaging Similar to what we have described for CT image reconstruction, magnetic resonance imaging (MRI) image reconstruction can be achieved through deep learning at least in two modes: direct mapping and postprocessing. Zhu et al. used fully connected layers to perform manifold learning so that k-space data can be directly mapped to a tomographic image, which was published as a Nature paper.12 Although fully connected layers are most powerful, they are computationally demanding and structurally redundant when convolutional layers can be applied. Recently, much progress is seen in the deep MRI reconstruction field. For example, a number of deep learning techniques were adapted for accelerated MRI reconstruction coupled with multicoil measurements (https://arxiv.org/ abs/1904.01112). While current deep-learning-based approaches were applied to MRI image reconstruction from k-space samples to final images, in a recent proposal (https:// arxiv.org/abs/1805.12006), data acquisition and image reconstruction are considered together to optimize both the pulse sequence and the reconstruction scheme seamlessly in

II. Technical basis

4.2 Tomographic image reconstruction

55

the machine learning framework. This is referred to as SPIN for a synergized pulsing and imaging network, and computationally rather demanding.

4.2.4 Other imaging modalities While deep reconstruction research has been active for CT and MRI, neural networks are being also applied for other imaging modalities such as positron emission tomography (PET), single-photon emission CT, ultrasound, and optical imaging. For example, an autoencoder was designed for dynamic PET.13 Also, a deep-learning-based prior was applied for PET imaging,14 and a deep network was adapted for PET attenuation correction.15 In yet another study, a deep residual convolutional neural network deblurred PET images acquired with large pixelated crystals to be similar to the counterparts with thin-pixelated crystals.16 As far as ultrasound imaging is concerned, scattering and speckling from heterogeneous background has been troublesome. Early convolutional neural networks (CNNs) for ultrasound imaging were reported in Refs. [17,18]. Given the cost-effective nature of ultrasound imaging, it is attractive to be routinely applied from family care to space mission. To implement its full potential, inter- and intraoperator variability and strong image artifacts must be addressed, and deep learning offers great opportunities. More recent advances on deep learning for ultrasound imaging are reviewed in Ref. [19], along with research directions. A very interesting example is the portable ultrasound transducer coupled with an iPhone (https://www.fastcompany.com/ 90288626/this-2000-ultrasound-scanner-makes-medical-imaging-affordable-and-portable). Actually, an X-ray system can also be made portable (https://content.iospress.com/articles/journal-of-xray-science-and-technology/xst00453). Moreover, we believe that a hybrid system can be constructed of ultrasound and X-ray imagers (Fig. 4.6). Along with the same lines, a robotic arm could be added as well. Like ultrasound imaging, optical coherence tomography (OCT) makes images based on wave propagation and reflection. Hence, antiscattering and despeckling is also important for OCT. To improve OCT images, a conditional generative adversarial network (GAN) was designed for improvement of retinal image quality.20 In another study, model-based deblurring was enhanced with a convolutional neural net (CNN) to achieve super-resolution of retinal OCT images.21 The above-mentioned studies are mostly earlier examples. Since 2016, deep reconstruction research results are constantly being published at an exponential rate. Convergence is expected in this field in terms of multi-modality imaging in the deep learning framework, and also the end-toend workflow from data acquisition, image reconstruction, image analysis, all the way through medical interventions such as robotic surgery and radiotherapy. We are optimistic that this field will continue growing, at least over the next ten years. For more details on the material covered in this section, the reader is referred to the first book published by IOP Publishing entitled “Machine Learning for Tomographic Imaging” (https://iopscience.iop.org/book/978-07503-2216-4). In short, there are four parts in the book. First, basic knowledge on neural networks are reviewed in three chapters. Then, the fourth and fifth chapters describe CTimaging principles and deep CT reconstruction. Also with two chapters, MRI physics and deep MRI reconstruction are covered. Finally, in the fourth part, three chapters discuss other imaging modalities and multimodality imaging, image quality assessment, and quantum computing, respectively.

II. Technical basis

56

4. Biomedical imaging and analysis through deep learning

FIGURE 4.6 Hybrid portable imaging system consisting of ultrasound and X-ray imagers.

4.3 Image segmentation 4.3.1 Introduction Prior to the recent era of deep learning, most medical image segmentation methods relied on human-designed algorithms, including thresholding, clustering, region-growing, partial differential equation-based methods, variational methods, atlas and model-based methods, graph partitioning, and more. In contrast to these methods, deep-learning-based methods allow a computer to create its own algorithm for integrating and discriminating image-based data. Deep-learning-based image segmentation has certainly been an area of active research (366 peer-reviewed papers indexed on PubMed in the first 8 months of 2019) and segmentation has been a topic of interest involving many different imaging modalities such as X-ray, CT, MRI, ultrasound, PET, microscopy, and OCT. Image segmentation is often an essential step within an overall image analysis pipeline and is used, for example, to identify regions of interest representing normal structures—such as abdominal organs,22 bones,23 lungs,24 blood vessels or arteries,25 individual cells26 (Fig. 4.7), the heart27—or disease—such as lung nodules,28 breast cancer,29 diabetic retinopathy,30 aneurysms31—or imaging artifacts—such as metal artifacts in CT.32 DCNNs (DCNNs) are the primary tool at the center of most deep learning approaches to image segmentation and classification. They are a variation on standard neural networks

II. Technical basis

4.3 Image segmentation

57

FIGURE 4.7 Segmentation and classification of T-cell nuclei and dendritic cell bodies in multichannel immunofluorescent images from biopsies of patients with inflammatory kidney disease using patch-based DCNN.26 DCNN, Deep convolutional neural network.

and use large numbers of small, learned image filters to highlight specific image features and incorporate these features into a hierarchical representation of the image with successive application of different image filters in many convolutional layers. Krizhevsky et al.33 began the modern deep learning revolution by showing that a DCNN could reduce the error rate on the ImageNet database by nearly half. While their approach relied on using a DCNN to represent an image as a set of classification scores for different categories, much following research has focused on using different DCNN architectures to not only classify the information contained in a 2D, 3D, or ND image matrix into different categories, but to also localize, segment, and classify separate instances of individual objects in an image.

4.3.2 Localization versus segmentation Herein, localization and segmentation refer to different tasks. For example, the human body naturally contains two kidneys and it is possible for a DCNN to segment (produce pixel-level predictions for kidneys in an image) them accurately without understanding that they are separate objects in the spatial extent of the image. The task of understanding that they are both kidneys but also both separate objects is object localization. The importance of object localization is less obvious in the case of kidneys because they are well separated in the volume of the human body. Such importance is more obvious when considering cellular imagery. When imaging cells, two cells of the same type might be in very close proximity or overlapping.26,34 In this case a segmentation mask of the two cells would be joined into a single object without a localization mechanism. The distinction between segmentation and localization is an important point because most of the DCNN architectures in wide use do not have an internal mechanism for object localization. A notable exception are CapsuleNets35 proposed by Geoff Hinton, which encode spatial information in their representation and were applied to image segmentation of lung CT.36

II. Technical basis

58

4. Biomedical imaging and analysis through deep learning

4.3.3 Fully convolutional networks Fully convolutional networks (FCNs)37 are an example of a popular DCNN architecture, which both segment and classify image data but lack a specific localization mechanism. The U-Net38 architecture is another version of an FCN with an encoder decoder architecture that has been applied to many biomedical image segmentation problems ranging from the organ to the subcellular level but lacks a specific localization mechanism. In many applications of DCNNs, the segmentation mask produced by a DCNN is not accurate at a fine-grained pixel level. Conditional random fields (CRFs) are often used as a postprocessing step to produce refined object segmentations. CRFs can only be applied to individually localized objects. A 3D CRF was used for brain lesion segmentation refinement here.39

4.3.4 Regions with convolutional neural network features In contrast to FCNs, regions with CNN (R-CNN) features40 use regions defined by bounding boxes around objects to tackle the object-localization problem. This concept has been progressively expanded upon in fast R-CNN,41 faster R-CNN,42 and mask R-CNN.43 These bounding boxes are generated using a mechanism external to the primary DCNN architecture and approaching the problem of object localization in this way plays to the strengths of DCNNs at hierarchically representing single objects as collections of features. Using region proposals has the additional advantage of often dramatically reducing the amount and dimensionality of the data the DCNN needs to process. This can be critical when dealing with the large image matrices often found in medical imaging and can significantly reduce the amount of training data required for training a DCNN. Large image matrix size in 2D or 3D medical imaging represents a significant computational challenge for deep learning approaches. 3D spatial convolutions for application to large 3D image volumes such as CT or MRI required significantly more processing than their variants on 2D images. Often the imaging volumes being processed cannot fit within memory of a GPU. Breaking the image volume into different parts for process can reduce this problem but designing distributed computing solutions for processing different pieces of a single DCNN model is difficult. Hence, older deep-learning-based image segmentation methods were frequently based on the segmentation of image patches, rather than full images. Communication between multiple GPUs has a communication overhead, and data transfer is often limited. In addition, different regions within an image processed within a DCNN are highly interdependent, which further reduces parallelization possibilities. While region proposals can help with this computational complexity, there are downsides to separating the task of image segmentation into separate region proposal and classification/segmentation steps. For example, in lesion detection in tissue, the important image characteristics of a lesion may extend beyond the obvious physical border of the lesion in the image. By restricting the classification step of the lesion to the bounding box region proposal encompassing the obvious physical border, this information becomes hidden from further processing. This can also be true for classification of many other objects in images, including cells and organs in certain tasks. Many recent studies thus use the entire input image to exploit the contextual information and reduce redundant calculations.

II. Technical basis

4.3 Image segmentation

59

Further refinement of deep-learning-based segmentation maps can sometimes be obtained using classic segmentation methods, such as level sets,44 47 graph cuts,48 and model-based methods.49,50 R-CNN-based methods are also known for being slow due to the necessity of evaluating thousands of candidate region proposals within an image. Single-shot object detectors, including YOLO51 and SSD,52 were designed to address this issue and reframe detection as a regression problem where object-bounding boxes are detected directly from image data in a single pass. However, in many medical imaging applications, speed is not an issue. In situations such as trauma assessment and surgery, DCNNs designed for real-time processing may be more practical.

4.3.5 A priori information Incorporating a priori information about organ structure and location into the DCNN segmentation task is another important concept. Organ location and organ structure in the human body are very regular. Human atlas based techniques have exploited this structure by building a reference atlas of the human body and then registering imagery to this reference model. How to incorporate this information into a DCNN is not as clear. Most DCNN implementations use pixel-wise error computations, which do not consider more complex information about how well a prediction matches a label. In Oktay et al.53 for cardiac segmentation used a stacked convolutional autoencoder to learn a statistical shape distribution used to bias DCNN output label predictions toward specific shapes. The V-Net54 DCNN architecture for volumetric segmentation used an objective function based on the Sorenson DICE coefficient of image similarity.

4.3.6 Manual labeling Generating manual labels for DCNN training is another important topic since most DCNN architecture for object segmentation currently use manual human labels for supervised training. Generating manual label for 3D image volumes and high-resolution images can be very time consuming. A flexible interface for image labeling is important. Much work has also been done on machine-assisted image segmentation, which can greatly reduce the manual labor of outlining objects. The Deep Extreme Cut55 framework is one example of this and uses a few user-specified points along an object to produce segmentation masks.

4.3.7 Semisupervised and unsupervised approaches Since the manual labeling of images in the supervised machine learning paradigm is so laborious, there has been much investigation into semisupervised and unsupervised segmentation techniques using deep learning. These have seen less usage in medical imaging and remain generally experimental and unproven. The Learning to Segment Every Thing56 semisupervised framework uses a novel weight transfer function to learn to produce segmentation masks using a dataset where all objects had bounding box annotations

II. Technical basis

60

4. Biomedical imaging and analysis through deep learning

but only a few had segmentation masks. W-Net57 requires no labeled data and uses and encoder decoder convolutional architecture where the reconstruction error of the autoencoder and the normalized cut produced by the encoder are minimized together during training. GANs have also been applied to fully unsupervised segmentation of images.

4.4 Image registration Image registration is the process of transforming different images into the same coordinate system with matched imaging contents. It has been applied to various clinical practices and medical studies. Depending on the medical purposes, the images to be registered may be acquired for the same subject using various modalities or in the same modality but from different subjects or the same subject at a different time. Registration may also be performed on images acquired over time for time series analysis or longitudinal studies. Many image registration tasks are very challenging due to the complexities of the problem in terms of both image similarity measurement and spatial transformation or even deformation. Many automatic registration algorithms have been developed in the past several decades. However, manual intervention is still required in practice because the conventional methods rely on predefined rules, for example, similarity metrics and transform patterns, which may not adapt well to the targeted data. Entering the era of deep learning, the landscape of image registration research has been undergoing fast changes in the past few years. Soon after the success of deep learning in several computer vision tasks, for example, image classification,33 object detection,58 feature extraction,59 and segmentation,38 this breakthrough technology was introduced to the field of medical image registration.60 The application of deep learning to image registration appears to be very different from the typical tasks at which deep learning has demonstrated superior performance. Instead of directly correlating learned image features with image class or object location, deep learning in image registration tries to map the extracted features to spatial relationships between images either directly or indirectly. The improved performance in image registration is derived from the powerful feature representation and learning capabilities of deep learning. Fig. 4.8 shows the evolution trend of the deep-learning-based medical image registration methods. As the registration technique evolves, the number of published papers in this area has also increased exponentially. The process is briefly illustrated as follows. Deep learning was at first used to augment the performance of classical iterative optimization-based registration by directly defining corresponding features61 or learning image similarity measures.62 64 Such methods can be readily integrated into the classical image registration framework but may explore only limited power of deep learning. As the field moved forward, several groups investigated the use of reinforcement learning for registration,65 67 with the intuition of mimicking human expert registration process. Later, the demand for faster registration methods motivated the development of one-step end-toend transformation estimation techniques to avoid the iterative optimization process.68 Such direct mapping of feature to spatial transform makes better use of the capabilities of deep learning. Unsupervised direct registration methods were then developed to tackle

II. Technical basis

4.4 Image registration

61

FIGURE 4.8 Evolution trend of deep-learning-based medical image registration methods.

FIGURE 4.9 An AIR-Net was proposed for multimodality image registration.73 AIR-Net, Adversarial image registration network.

the difficulties in generating ground truth transformations for training deep-learningbased registration algorithms.69 72 At the time of writing this book, the domain of medical image registration continues advancing at a fast pace. In addition to the abovementioned need-driven technical evolution, new network architectures and learning strategies have been pushed and applied to medical image registration. For example, the GAN-based frameworks have been used to train a transform estimator and registration quality evaluator simultaneously and shown promising performance.73 75 For example, Yan et al.73 proposed to have an imagetransform estimator and a registration evaluator trained simultaneously, as shown in Fig. 4.9, to bypass the difficulty of defining similarity between multimodal images for registration. New medical image registration algorithms are actively being developed to meet the growing clinical needs. A more dedicated literature review on deep-learning-based

II. Technical basis

62

4. Biomedical imaging and analysis through deep learning

medical image registration techniques can be found in Ref. [76]. In the rest of this section, deep-learning-based registration methods are categorized into single- and multimodality image registration based on the involved imaging modalities for detailed discussions.

4.4.1 Single-modality image registration Single-modality image registration refers to those applications where both the moving images and targeted fixed images are acquired using the same imaging modality and typically in the same dimensions. The main challenge in this kind of applications is due to the feature mismatch caused by local structure difference and deformation. The conventional hand-crafted features and manually defined similarity metrics may fail. Wu et al.60,61 were the first to apply deep learning to extract discriminative image features for medical image registration volumes using a stacked autoencoder [77]. Their proposed methods learn to select features that describe complex morphological patterns in image patches to improve correspondence detection for deformable registration of 3D brain MRI images. Gradient descent optimization is used subsequently to achieve the registration by maximizing the normalized cross correlation (NCC) between the two sets of features from fixed and moving images. This method outperformed other landmark-based deformable image registration methods, including the diffeomorphic demons78 and HAMMER79 registration techniques. Researchers also investigated mixing deep learning with hand-crafted image features. Blendowski and Heinrich80 proposed to combine CNN-based descriptors and manually crafted self-similarity descriptors for lung CT image registration. Although using CNNbased descriptors alone performed not as well as the conventional image features, the optimal performance was achieved by using both sets of feature descriptors together. This indicates that deep learning can provide information complementary to hand-crafted image features. Since the conventional image similarity metrics work well when the images are aligned reasonably close to each other, that is, the deformation between the images is small, de Vos et al.71 proposed an end-to-end single-modality image registration method using CNNs to optimize a loss defined by a conventional similarity metric, for example, NCC. This new approach directly outputs the image-registration transform after the network is fully trained, which was made possible by the implementation of a spatial transform network.81 To further reduce the reliance on manually defined similarity metrics, Fan et al.74,75 developed a GAN-based registration method by learning a discriminator to tell the quality of image registration for both single- and multimodality image registration.

4.4.2 Multimodality image registration Multimodality image registration refers to those applications where the moving and targeted fixed images are acquired using different imaging modalities and sometimes even in different dimensions. In addition to the challenge caused by tissue deformation presented in single-modality image registration, gauging image similarity between multimodality images can be difficult, even when they are well aligned.

II. Technical basis

4.5 Deep-learning-based radiomics

63

Prior to the deep learning era, mutual information was the most commonly used similarity metric for intensity-based multimodal image registration.82,83 However, the performance depends on the intensity correspondence between images. For some complicated correspondence images, for instance, MRI and ultrasound images, mutual information will not work well. In an effort to explicitly estimate image similarity between multimodal images, Simonovsky et al.62 used a CNN to learn the similarity between 3D T1- and T2-weighted brain MRI volumes. The learned metric was then plugged into a classical iterative image registration framework to complete registration. Performance was demonstrated to be superior over that for mutual information. For a more challenging task of registering MRI and transrectal ultrasound (TRUS) images, Haskins et al.64 trained a CNN with data acquired from 679 subjects undergoing image fusion guided biopsy to learn the similarity, which demonstrated superior performance compared to conventional methods, including mutual information and a state-ofthe-art modality-independent neighborhood descriptor.84 A new two-stage optimization strategy, differential evolution-initialized Newton-based optimization, was proposed to speed up the optimization process by reducing the number of network computation. A common difficulty in training deep neural networks lies in the lack of large-scalelabeled datasets. This problem becomes more pronounced when it comes to multimodality deformable image registration, where the ground truth deformation field is hard to obtain and very time consuming if possible. To alleviate the problem, weakly supervised registration methods have been proposed, which requires only loosely labeled datasets for training networks. For example, to tackle the problem of deformable MRI and TRUS image registration, Hu et al.85 proposed to use the segmentation of those images to measure the degree of alignment. Since the underlying deformation fields are not required in the training process, their method is considered weakly supervised. Compared to deformation fields, image segmentation labels can be acquired more economically. They showed that their method outperformed various baseline methods on this challenging task. Going one step further, Sedghi et al.86 developed an unsupervised method named deep informatic registration that does not require any labeled data for training their networks. It is achieved by training deep neural networks to optimize patch-wise mutual information in a feature space. That removes the implicit pixel-wise independence assumptions made by the original pixel-intensity-based mutual information. Accompanied by an explicit information theoretic explanation, their proposed method demonstrated significantly better performance than the standard mutual information on registering T1- and T2-weighted brain MRI images.

4.5 Deep-learning-based radiomics “‘Radiomics’ refers to the extraction and analysis of large amounts of advanced quantitative imaging features with high throughput from medical images.”87 Deep-learning-based radiomics studies have been reported in increasing numbers and seem to have taken the world by storm with close to 1000 papers indexed in PubMed for the first 8 months of 2019. However, with radiomics having evolved from computer-aided detection/diagnosis and radiomics itself being a rather newly coined term87 that is not been adopted by everyone, this reported number of publications may be an underestimate. The intertwined nature of

II. Technical basis

64

4. Biomedical imaging and analysis through deep learning

deep learning applications for the different radiomics “tasks” of segmentation, detection, characterization, and classification further confound counting publications. In this section, we will consider deep-learning-based radiomics applications in which image features are either extracted using a deep learning method, for example, using a CNN as feature extractor, and then input into a “conventional” classifier (transfer learning), or used internally by a deep learning method. It is important to note throughout that there is no one-size-fits-all approach for either conventional radiomic machine learning methods or deep learning techniques. While deep learning is sometimes seen as a “magic black box,” one needs to carefully consider the clinical task at hand in the method design and implementation.

4.5.1 Detection Detection of organs, other normal anatomical structures, or lesions in medical images is often an important part of a radiomics pipeline as it allows for more targeted analysis of specific body parts or pathology. The detection of anatomical structures is closely related to that of segmentation of these structures (Section 4.3). Detection of abnormalities/lesions in medical images, such as screening mammograms, is a common task for radiologists, which tends to be costly, tiring, time consuming, and sometimes error prone. It is therefore not surprising that computer-aided detection methods have been in development for decades. Traditional lesion detection methods often involve long processing pipelines with many different steps88 91: preprocessing, identification of candidate locations (e.g., in the simplest way through thresholding an image), extraction of hand-crafted features, and classification of these features of the candidate locations using a traditional classifier (to distinguish between those that represent actual lesions and those that do not). Deep learning approaches, on the other hand, can combine the many steps of the traditional methods but generally require much more data for training. Like deep-learning-based segmentation, detection methods initially tended to be based on image patches (e.g., Ref. [92]) and evolved into whole-image-based approaches providing localized anatomical information (e.g., Ref. [93]). Deep-learning-based detection methods can generally be divided into classification-based and regression-based methods. The former methods identify structures of interest on a by-image (or image-patch) basis, while the latter provide more detailed information such as the coordinates of a lesion center. Apart from these two common deep learning approaches to detection problems, modern techniques such as reinforcement learning94 are currently being adopted as well. Since many CNN architectures are open-source and available with weights obtained in training on large datasets of natural images (such as ImageNet), early deep learning classification-based detection approaches tended to rely heavily on transfer learning (Fig. 4.10). In the most “straightforward” form of transfer learning, a pretrained CNN is directly applied to the medical images of interest, without any adjustments of the weights, purely to extract image features (4096 features for VGG19, e.g.) for input into a “conventional” classifier such as a support vector machine. More recently, refinements in transfer learning approaches include fine-tuning of the pretrained CNN, in which some layers of the pretrained CNN are kept fixed, while the weights in other layers are adapted in retraining using the medical images of interest.95 In contrast, more recent specialized

II. Technical basis

4.5 Deep-learning-based radiomics

65

FIGURE 4.10

Illustration of the general pipeline of “conventional” radiomics and how deep learning techniques can replace the “building blocks” of this pipeline. Note that the deep learning approaches do not necessarily explicitly perform the steps in each “building block.”

approaches rely on custom-designed CNNs, often based in part on the architecture of well-known CNNs (such as AlexNet, U-Net, or one of the VGG networks) but with improvements through the use of more advanced architectural “building blocks” specifically designed for the task at hand.92,96,97 In anatomy detection methods formulated as regression problems, objects can be detected through, for example, marginal space learning that avoids learning the full similarity transformation space by incrementally learning representations in marginal spaces of lower dimensions. Thus coordinate systems can be learned within anatomical images and the position of the body part of interest, such as the heart,98 within that image can be estimated.99,100 Similar to organ detection methods, many lesion detection methods employ transfer learning (e.g., Ref. [101]), while some more recent methods use custom architectures (e.g., Refs. [102 105]). An added challenge is that medical images such as CT are 3D and many of the CNNs available online were developed for 2D images. While the extension to 3D is conceptually straightforward, the implementation may not be due to memory and training data requirements. Hence, most of the early deep learning lesion detection methods used 2D CNNs even if the image data were 3D (or 4D for modalities such as dynamic- contrastenhanced MRI) for image-slice analysis. In other applications, orthogonal views of lesions were analyzed.104,106 More recently, 3D CNNs have been successfully used for volumetric data. A common approach is to use 3D patches extracted from each image volume in training, similar to the use of 2D patches discussed earlier. Hundreds or thousands of 3D patches can thus be extracted from each image volume that, combined with data augmentation, makes it possible to generate enough samples to train 3D CNNs. In conventional radiomics detection methods, it is common to use “sliding window” image patches (called regions of interest in that context) but, while tempting to employ such a strategy here, the sheer size of volumetric medical image volumes often makes this approach impractical or even technically impossible for deep learning. Instead, once the 3D CNN is trained on patches the entire network can be converted into an FCN107 so that the whole network can be efficiently applied to an input of arbitrary size which results in fast processing of an entire image volume.108

II. Technical basis

66

4. Biomedical imaging and analysis through deep learning

4.5.2 Characterization and diagnosis Much like traditional computerized detection approaches, conventional computerized characterization and classification methods involve a pipeline with many steps and the extraction of hand-crafted features often related to those used by physicians in clinical practice (such as tumor size, shape, texture, and kinetics). Again, deep learning can circumvent the design of a complicated pipeline, but it is rather unlikely that naively applying a CNN to a learning task or a medical image without careful consideration of implementation will yield “optimal” performance. For example, when regions of interest are analyzed (depicting, e.g., an imaged tumor) the choice of this region of interest likely will have an impact on performance: the choice of a fixed size region of interest will provide different information to a deep learning system than a variable size region of interest chosen to tightly encompass each imaged tumor, and 2D versus 3D analysis may yield different results as well.109 The task of lesion or tissue characterization, that is, their description, lends itself to conventional radiomics methods in which hand-crafted features are extracted that are related to those used by physicians in clinical practice. In contrasts, the features extracted by a CNN are not intuitive or readily interpretable. Nonetheless, deep-learning-based methods have been developed and proven promising, for example, in the characterization of lung tissue since image patches of lung patterns may be informative of underlying disease such as those associated with inflammation.110 112 Deep learning has also shown promise in other characterization tasks ultrasonic fatty liver disease characterization.113 The task of lesion or tissue diagnosis is to provide an assessment, such as the probability of malignancy, of a finding or region of interest initially identified by the physician. The, perhaps subtle, difference between characterization and diagnosis tasks is that for the former the output of the machine learning system represents a characteristic feature of a disease and for the latter it provides a likelihood of disease. Conventional computer-aided diagnosis methods have been around for decades,88,89 and deep learning is starting to gain substantial interest in this area. Again, due to the limited number of cases generally available within medical image datasets, transfer learning is frequently used, either purely for feature extraction or for fine-tuning by freezing the earlier layers of a pretrained CNN and training the later layers employing either a CNN pretrained on (1) natural images or (2) medical images of a different, but related, imaging modality (such as screen film mammograms and full-field digital mammograms114). Researchers have found that the performance of conventional radiomics diagnosis methods and that of CNN-based approaches often yield similar levels of diagnostic performance. Moreover, it has been shown that combining conventional and deep learning methods can result in a statistically significant improvement in performance.115,116

4.5.3 Prognosis Once a cancer or disease has been identified, further workup through, for example, biopsies, gives information on—in the case of cancer—stage, molecular subtype, and proliferation. While it is unlikely that radiomics will replace biopsies, the advantage of radiomics is that it is noninvasive and that a tumor or diseased region can be examined in its

II. Technical basis

4.5 Deep-learning-based radiomics

67

entirety as well as any normal-appearing surrounding tissue. Biopsies examine only small parts of a tumor, and it is known that tumor heterogeneity plays an important role in prognosis. Moreover, genomics studies have shown that the “normal” tissue directly surrounding cancers is not “‘normal” at all; tissue adjacent to a tumor has characteristics that distinguish it from both healthy and tumor tissue and at the molecular level, the tissue appears to be in “a unique intermediate state.”117 Whether these molecular differences result in differences in “in vivo,” or even pathological, imaging characteristics is unclear to date. Hence, radiomics can help one to discover traits of both tumor and surrounding tissue that inform patient prognosis and treatment. Conventional radiomics methods have shown promise, for example, in prognosis of breast cancer and staging of bladder cancer.118,119 Deep learning applications in prognosis include the staging of liver fibrosis on MRI,120 the distinction between “pure” ductal carcinoma in situ and ductal carcinoma in situ with occult microinvasions,121 staging of pulmonary nodules,122 detection and staging of chronic obstructive pulmonary disease, and acute respiratory disease.123

4.5.4 Assessment and prediction of response to treatment Just as image analysis can be used to extract features for diagnosis and prognosis, it can also be used to help one to assess and predict response to therapy and to predict overall or recurrence-free survival.124 Another advantage of using radiomics is that it is repeatable in the sense that a tumor or disease can be repeatedly imaged and analyzed over the course of treatment with pharmaceuticals or radiation. In oncology-related research, deep-learning-based prognosis approaches have been investigated in breast cancer,125 bladder cancer,126 glioblastoma,127 rectal cancer,128 and liver toxicity after liver stereotactic body radiotherapy.129 Also in nononcology-related imaging, deep learning has gained interest with applications ranging from the early prediction of rejection of renal transplants and prediction of response to treatment in ischemic stroke patients.130 It is possible to use image features extracted from imaging exams acquired during the course of treatment for use in the prediction of long-term treatment success in terms of recurrence-free or overall survival. Ideally suited for such tasks are deep neural networks such as long short-term memory (LSTM) networks that have the ability to retain information over arbitrary time intervals. An LSTM is an artificial recurrent neural network that, unlike a “regular” deep learning network, has feedback connections and “forget” gates, which makes them ideally suited for analysis of time series for classification and prediction. Originally proposed in 1997, LSTMs have recently become very successful in commercial applications such as speech recognition. While the number of time points available for inference in medical imaging datasets is typically much smaller than in everyday applications such as forecasting of the weather or the stock market, LSTMs have shown promise in breast lesion classification using the temporal information within single 4D dynamic-contrast-enhanced MRI exams (3D image space for pre- and postcontrast agent injection acquisitions) to capture temporal enhancement patterns.131 Similar methods are being developed for the prediction of recurrence-free survival based on analysis of MRI exams acquired during the course of neoadjuvant treatment.

II. Technical basis

68

4. Biomedical imaging and analysis through deep learning

4.5.5 Assessment of risk of future cancer In cancer risk assessment, computer-extracted characteristics of normal tissue are computed and related to cancer risk factors. For example, in breast cancer imaging research conventional radiomic texture analysis demonstrated that women at high risk for breast cancer tend to have dense breasts with parenchymal patterns that are coarse and low in contrast.132 Deep learning is currently being investigated to assess breast density133,134 and to characterize breast parenchyma, for example, to distinguish between women at normal risk of breast cancer and those at high risk based on their BRCA1/2 status.135

4.6 Summary and outlook Despite resistance to, and suspicion of, deep learning in medical image analysis at an early stage, an overwhelming majority of researchers are working on deep imaging now worldwide. The efficacy and efficiency of deep learning is undeniable in the medical imaging field, as evidenced by a sufficiently large number of independent studies across modalities and application areas. Along with the impressive results, related practice issues become pressing, involving interpretability, interoperability, robustness, and optimality, in a continuous learning environment. Many challenges remain, however, and deep learning will continue to be an area of active research within medical imaging. Issues regarding quality control, ethics, patient confidentiality, and reimbursement, for example, pose hurdles for clinical implementation that are outside of the scope of this chapter.136 Another challenge is that the worthwhile for “explainable” AI may decrease performance, since “explainability” tends to be inversely correlated with performance. “Explainability” is being able to quite literally explain what is happening in human terms, which is difficult or even impossible in deep learning. Rather, perhaps the focus should be on developing “interpretable” methods, where “interpretability” is about being able to predict what is going to happen, given a change in input or algorithmic parameters, without necessarily knowing why. But arguably the most important key to success is the sharing of algorithms and anonymous image data among researchers. The importance of testing newly developed methods on—preferably publicly available137—independent test datasets cannot be emphasized enough. Often, database bias in the development and performance assessment of new methods is difficult to avoid entirely, and methods may not generalize well to new unseen data, especially when dealing with different populations or different imaging protocols. Organized research challenges—such as those organized by the international society for optics and photonics, national cancer institute, American association for physicists in medicine, medical image computing and computer-assisted intervention conference, and Kaggle—can play an important role in the direct comparison of different methodologies on the same test data set and advance the field as a whole.138 In organized challenge competitions, deep learning methods have become dominant contenders both in number of submissions and performance. In summary, in this chapter we have provided a snapshot of the ultrafast moving field of deep learning in medical imaging. We believe that many more exciting developments

II. Technical basis

References

69

are on the horizon. These include graph neural networks, knowledge graphs, and semisupervised and unsupervised learning, just to name a few. We are very optimistic about the field of AI in medicine. Specifically, we are positive that many more successful applications of AI-based medical imaging techniques will be realized in the next 5 10 years, generating a huge impact on health care in general.

References 1. Sahiner B, et al. Deep learning in medical imaging and radiation therapy. Med Phys 2019;46(1):e1 e36. 2. Litjens G, et al. A survey on deep learning in medical image analysis. Med Image Anal 2017;42:60 88. 3. Sheth D, Giger ML. Artificial intelligence in the interpretation of breast cancer on MRI. J Magn Reson Imaging 2020;108:354 70. 4. Zhang Z, Sejdi´c E. Radiological images and machine learning: Trends, perspectives, and prospects. Comput Biol Med 2019;108:354 70. 5. Mazurowski MA, et al. Deep learning in radiology: An overview of the concepts and a survey of the state of the art with focus on MRI. J Magn Reson Imaging 2019;49(4):939 54. 6. Srinidhi CL, Ciga O, Martel AL.Deep neural network models for computational histopathology: a survey. 2019. 7. Wang G. A perspective on deep imaging. IEEE Access 2016;4:8914 24. 8. Wang G, et al. Image reconstruction is a new frontier of machine learning. IEEE Trans Med Imaging 2018;37 (6):1289 96. 9. Kak AC, Slaney M. Principles of computerized tomographic imaging. Classics in applied mathematics. Philadelphia, PA: Society for Industrial and Applied Mathematics; 2001. p. 327. xiv. 10. Chen H, et al. LEARN: Learned Experts’ Assessment-Based Reconstruction Network for sparse-data CT. IEEE Trans Med Imaging 2018;37(6):1333 47. 11. Li Y, et al. Learning to reconstruct computed tomography images directly from sinogram data under a variety of data acquisition conditions. IEEE Trans Med Imaging 2019;38. 12. Zhu B, et al. Image reconstruction by domain-transform manifold learning. Nature 2018;555(7697):487 92. 13. Cui JA, et al. Deep reconstruction model for dynamic PET images. PLoS One 2017;12:9. 14. Kim K, et al. Penalized PET reconstruction using deep learning prior and local linear fitting. IEEE Trans Med Imaging 2018;37(6):1478 87. 15. Liu F, et al. A deep learning approach for F-18-FDG PET attenuation correction. EJNMMI Phys. 2018;5 16. Hong X, et al. Enhancing the image quality via transferred deep residual learning of coarse PET sinograms. IEEE Trans Med Imaging 2018;37(10):2322 32. 17. Yu HQ, et al. PCANet based nonlocal means method for speckle noise removal in ultrasound images. PLoS One 2018;13:10. 18. Chan TH, et al. PCANet: a simple deep learning baseline for image classification? IEEE Trans Image Process 2015;24(12):5017 32. 19. Brattain LJ, et al. Machine learning for medical ultrasound: status, methods, and future opportunities. Abdom Radiol (NY) 2018;43(4):786 99. 20. Ma YH, et al. Speckle noise reduction in optical coherence tomography images based on edge-sensitive cGAN. Biomed Opt Express 2018;9(11):5129 46. 21. Lian J, et al. Deblurring retinal optical coherence tomography via a convolutional neural network with anisotropic and double convolution layer. IET Comput Vis 2018;12(6):900 7. 22. Yan P, et al. Discrete deformable model guided by partial active shape model for TRUS image segmentation. IEEE Trans Biomed Eng 2010;57(5):1158 66. 23. Lindgren Belal S, et al. Deep learning for segmentation of 49 selected bones in CT scans: first step in automated PET/CT-based 3D quantification of skeletal metastases. Eur J Radiol 2019;113:89 95. 24. Park J, et al. Fully automated lung lobe segmentation in volumetric chest CT with 3D U-Net: validation with intra- and extra-datasets. J Digit Imaging 2019;33. 25. Oda M, et al. Abdominal artery segmentation method from CT volumes using fully convolutional neural network. Int J Comput Assist Radiol Surg 2019;14.

II. Technical basis

70

4. Biomedical imaging and analysis through deep learning

26. Liarski VM, et al. Quantifying in situ adaptive immune cell cognate interactions in humans. Nat Immunol 2019;20(4):503 13. 27. Zhuang X, et al. Evaluation of algorithms for Multi-Modality Whole Heart Segmentation: an open-access grand challenge. Med Image Anal 2019;58:101537. 28. Aresta G, et al. iW-Net: an automatic and minimalistic interactive lung nodule segmentation deep network. Sci Rep 2019;9(1):11591. 29. Men K, et al. Fully automatic and robust segmentation of the clinical target volume for radiotherapy of breast cancer using big data and deep learning. Phys Med 2018;50:13 19. 30. Kou C, et al. Microaneurysms segmentation with a U-Net based on recurrent residual convolutional neural network. J Med Imaging (Bellingham) 2019;6(2):025008. 31. Podgorsak AR, et al. Automatic radiomic feature extraction using deep learning for angiographic parametric imaging of intracranial aneurysms. J Neurointerv Surg 2019. 32. Hegazy MAA, Cho MH, Lee SY. U-net based metal segmentation on projection domain for metal artifact reduction in dental CT. Biomed Eng Lett 2019;9(3):375 85. 33. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: Pereira F, et al., editors. Advances in neural information processing systems. Curran Associates, Inc.; 2012. p. 1097 105. 34. Yan P, et al. Automatic segmentation of high-throughput RNAi fluorescent cellular images. IEEE Trans Inf Technol Biomed 2008;12(1):109 17. 35. Sabour S, Frosst N, Hinton GE. Dynamic routing between capsules. CoRR 2017. abs/1710.09829. 36. LaLonde R, Bagci U. Capsules for object segmentation. 2018. 37. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. CoRR 2014. abs/ 1411.4038. 38. Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation. Cham: Springer International Publishing; 2015. 39. Kamnitsas K, et al. Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med Image Anal 2017;36:61 78. 40. Cai Z, Vasconselos N. Cascade R-CNN: delving into high quality object detection. CoRR 2017. abs/1712.00726. 41. Girshick RB. Fast R-CNN. CoRR 2015. abs/1504.08083. 42. Ren S, et al. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR 2015. abs/1506.01497. 43. He K, et al. Mask R-CNN. CoRR 2017. abs/1703.0. 44. Cha KH, et al. Urinary bladder segmentation in CT urography using deep-learning convolutional neural network and level sets. Med Phys 2016;43(4):1882. 45. Cha KH, et al. Bladder cancer segmentation in CT for treatment response assessment: application of deeplearning convolution neural network—a pilot study. Tomography 2016;2(4):421 9. 46. Avendi MR, Kheradvar A, Jafarkhani H. Automatic segmentation of the right ventricle from cardiac MRI using a learning-based approach. Magn Reson Med 2017;78(6):2439 48. 47. Ngo TA, Lu Z, Carneiro G. Combining deep learning and level set for the automated segmentation of the left ventricle of the heart from cardiac cine magnetic resonance. Med Image Anal 2017;35:159 71. 48. Lu F, et al. Automatic 3D liver location and segmentation via convolutional neural network and graph cut. Int J Comput Assist Radiol Surg 2017;12(2):171 82. 49. Liu F, et al. Deep convolutional neural network and 3D deformable approach for tissue segmentation in musculoskeletal magnetic resonance imaging. Magn Reson Med 2018;79(4):2379 91. 50. Avendi MR, Kheradvar A, Jafarkhani H. A combined deep-learning and deformable-model approach to fully automatic segmentation of the left ventricle in cardiac MRI. Med Image Anal 2016;30:108 19. 51. Redmon J, et al. You only look once: unified, real-time object detection. CoRR 2015. abs/1506.02640. 52. Liu W, et al. SSD: Single Shot MultiBox Detector. CoRR 2015. abs/1512.02325. 53. Oktay O, et al. Anatomically constrained neural networks (ACNN): application to cardiac image enhancement and segmentation. CoRR 2017. abs/1705.08302. 54. Milletari F, Navab N, Ahmadi S. V-Net: fully convolutional neural networks for volumetric medical image segmentation. 55. Maninis, K.K., et al. Deep extreme cut: from extreme points to object segmentation.

II. Technical basis

References

71

56. Hu R, et al. Learning to segment every thing. CoRR 2017. abs/1711.10370. 57. Xia X, Kulis B. W-Net: a deep model for fully unsupervised image segmentation. CoRR 2017. abs/1711.08506. 58. Girshick R, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE conference on computer vision and pattern recognition. 2014. 59. He KM et al. Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition. 2016. p. 770 8. 60. Wu G., et al. Unsupervised deep feature learning for deformable registration of MR brain images. In: International conference on medical image computing and computer-assisted intervention. Springer; 2013. 61. Wu G, et al. Scalable high-performance image registration framework by unsupervised deep feature representations learning. IEEE Trans Biomed Eng 2015;63(7):1505 16. 62. Simonovsky M, et al. A deep metric for multimodal registration. Springer International Publishing; 2016. 63. Cheng X, Zhang L, Zheng Y. Deep similarity learning for multimodal medical images. Comput Methods Biomech Biomed Eng: Imag Vis 2016;1 5. 64. Haskins G, et al. Learning deep similarity metric for 3D MR TRUS image registration. Int J Comput Assist Radiol Surg 2019;14(3):417 25. 65. Liao R, et al. An artificial agent for robust image registration. 2017. 66. Miao S, Wang ZJ, Liao R. A CNN regression approach for real-time 2D/3D registration. IEEE Trans Med Imaging 2016;35(5):1352 63. 67. Ma K, et al. Multimodal image registration with deep context reinforcement learning. In: International conference on medical image computing and computer-assisted intervention. Springer; 2017. 68. Prevost R, et al. Deep learning for sensorless 3D freehand ultrasound imaging. In: International conference on medical image computing and computer-assisted intervention. Springer, Cham; 2017. 69. Balakrishnan G, et al. An unsupervised learning model for deformable medical image registration. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. 70. de Vos BD, et al. ConvNet-based localization of anatomical structures in 3-D medical images. IEEE Trans Med Imaging 2017;36(7):1470 81. 71. de Vos BD, et al. End-to-end unsupervised deformable image registration with a convolutional neural network. arXiv:1704.06065 [cs] 2017. 72. de Vos BD, et al. A deep learning framework for unsupervised affine and deformable image registration. Med Image Anal 2019;52:128 43. 73. Yan P, et al. Adversarial image registration with application for MR and TRUS image fusion. Springer International Publishing; 2018. 74. Fan J, et al. Adversarial similarity network for evaluating image alignment in deep learning based registration. Springer International Publishing; 2018. 75. Fan J, et al. Adversarial learning for mono- or multi-modal registration. Med Image Anal 2019;101545. 76. Haskins G, Kruger U, Yan P. Deep learning in medical image registration: a survey. arXiv:1903.02026 2019. 77. Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 2013;35(8):1798 828. 78. Vercauteren T, et al. Diffeomorphic demons: efficient non-parametric image registration. Neuroimage 2009;45 (1):S61 72. 79. Shen D, Davatzikos C. HAMMER: hierarchical attribute matching mechanism for elastic registration. IEEE Trans Med Imaging 2002;21(11):1421 39. 80. Blendowski M, Heinrich MP. Combining MRF-based deformable registration and deep binary 3D-CNN descriptors for large lung motion estimation in COPD patients. Int J Comput Assist Radiol Surg 2019;14 (1):43 52. 81. Jaderberg M, et al. Spatial transformer networks. arXiv:1506.02025 [cs] 2015. 82. Viola P, et al. Multi-modal volume registration by maximization of mutual information. Med Image Anal 1996;1(1):35 51. 83. Maes F, et al. Multimodality image registration by maximization of mutual information. IEEE Trans Med Imaging 1997;16(2):187 98. 84. Heinrich MP, et al. MIND: modality independent neighbourhood descriptor for multi-modal deformable registration. Med Image Anal 2012;16(7):1423 35.

II. Technical basis

72

4. Biomedical imaging and analysis through deep learning

85. Hu Y, et al. Weakly-supervised convolutional neural networks for multimodal image registration. Med Image Anal 2018;49:1 13. 86. Sedghi, A., et al., Deep information theoretic registration. arXiv:1901.00040 [cs, math] 2018. 87. Kumar V, et al. Radiomics: the process and the challenges. Magn Reson Imaging 2012;30(9):1234 48. 88. Giger ML, Chan HP, Boone J. Anniversary paper: history and status of CAD and quantitative image analysis: the role of medical physics and AAPM. Med Phys 2008;35(12):5799 820. 89. Giger ML, Karssemeijer N, Schnabel JA. Breast image analysis for risk assessment, detection, diagnosis, and treatment of cancer. In: Yarmush ML, editor. Annual review of biomedical engineering, vol. 15. 2013. p. 327 57. 90. Drukker K, et al. Computerized detection and classification of cancer on breast ultrasound. 91. Drukker K, Sennett CA, Giger ML. Computerized detection of breast cancer on automated breast ultrasound imaging of women with dense breasts. Med Phys 2014;41(1):012901. 92. Yang D, et al. Automated anatomical landmark detection on distal femur surface using convolutional neural network. In: 2015 IEEE 12th international symposium on biomedical imaging. 2015. 93. Lee H, et al. Pixel-level deep segmentation: artificial intelligence quantifies muscle on computed tomography for body morphometric analysis. J Digit Imaging 2017;30(4):487 98. 94. Ghesu FC, et al. An artificial agent for anatomical landmark detection in medical images. In: International conference on medical image computing and computer-assisted intervention (MICCAI). 2016. 95. Chen H, et al. Standard plane localization in fetal ultrasound via domain transferred deep neural networks. IEEE J Biomed Health Inform 2015;19(5):1627 36. 96. Kumar A, et al. Plane identification in fetal ultrasound images using saliency maps and convolutional neural networks. In: 2016 IEEE 13th international symposium on biomedical imaging. 2016. 97. Kumar A, et al. An ensemble of fine-tuned convolutional neural networks for medical image classification. IEEE J Biomed Health Inform 2017;21(1):31 40. 98. Ghesu FC, et al. Marginal space deep learning: efficient architecture for volumetric image parsing. IEEE Trans Med Imaging 2016;35(5):1217 28. 99. Yan K, Lu L, Summers RM. Unsupervised body part regression using convolutional neural network with self-organization. arXiv:1707.03891 2017. 100. Yan Z, et al. Multi-instance deep learning: discover discriminative local anatomies for bodypart recognition. IEEE Trans Med Imaging 2016;35(5):1332 43. 101. Yap MH, et al. Automated breast ultrasound lesions detection using convolutional neural networks. IEEE J Biomed Health Inform 2018;22. 102. Albarqouni S, et al. Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE Trans Med Imaging 2016;35(5):1313 21. 103. Orlando JI, et al. An ensemble deep learning based approach for red lesion detection in fundus images. Comput Methods Programs Biomed 2018;153:115 27. 104. Roth HR, et al. Improving computer-aided detection using convolutional neural networks and random view aggregation. IEEE Trans Med Imaging 2016;35(5):1170 81. 105. Yang X, et al. Co-trained convolutional neural networks for automated detection of prostate cancer in multiparametric MRI. Med Image Anal 2017;42:212 27. 106. Setio AA, et al. Pulmonary nodule detection in CT images: false positive reduction using multi-view convolutional networks. IEEE Trans Med Imaging 2016;35(5):1160 9. 107. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. 108. Qi D, et al. Automatic detection of cerebral microbleeds from MR images via 3D convolutional neural networks. IEEE Trans Med Imaging 2016;35(5):1182 95. 109. Antropova N, Abe H, Giger ML. Use of clinical MRI maximum intensity projections for improved breast lesion classification with deep convolutional neural networks. J Med Imaging (Bellingham) 2018;5 (1):014503. 110. Anthimopoulos M, et al. Lung pattern classification for interstitial lung diseases using a deep convolutional neural network. IEEE Trans Med Imaging 2016;35(5):1207 16. 111. Kim GB, et al. Comparison of shallow and deep learning methods on classifying the regional pattern of diffuse lung disease. J Digit Imaging 2018;31.

II. Technical basis

References

73

112. Christodoulidis S, et al. Multisource transfer learning with convolutional neural networks for lung pattern analysis. IEEE J Biomed Health Inform 2017;21(1):76 84. 113. Bharath R, Rajalakshmi P. Deep scattering convolution network based features for ultrasonic fatty liver tissue characterization. In: 2017 39th annual international conference of the IEEE engineering in medicine and biology society. 2017. p. 1982 5. 114. Samala RK, et al. Multi-task transfer learning deep convolutional neural network: application to computeraided diagnosis of breast cancer on mammograms. Phys Med Biol 2017;62(23):8894 908. 115. Antropova N, Huynh BQ, Giger ML. A deep feature fusion methodology for breast cancer diagnosis demonstrated on three imaging modality datasets. Med Phys 2017;44(10):5162 71. 116. Huynh BQ, Li H, Giger ML. Digital mammographic tumor classification using transfer learning from deep convolutional neural networks. J Med Imaging (Bellingham) 2016;3(3):034501. 117. Aran D, et al. Comprehensive analysis of normal adjacent to tumor transcriptomes. Nat Commun 2017;8 (1):1077. 118. Li H, et al. Quantitative MRI radiomics in the prediction of molecular classifications of breast cancer subtypes in the TCGA/TCIA data set. NPJ Breast Cancer 2016;2. 119. Garapati SS, et al. Urinary bladder cancer staging in CT urography using machine learning. Med Phys 2017;44(11):5814 23. 120. Yasaka K, et al. Deep learning with convolutional neural network for differentiation of liver masses at dynamic contrast-enhanced CT: a preliminary study. Radiology 2018;286(3):899 908. 121. Shi BB, et al. Prediction of occult invasive disease in ductal carcinoma in situ using deep learning features. J Am Coll Radiol 2018;15(3):527 34. 122. Masood A, et al. Computer-assisted decision support system in pulmonary cancer detection and stage classification on CT images. J Biomed Inform 2018;79:117 28. 123. Gonzalez G, et al. Disease staging and prognosis in smokers using deep learning in chest computed tomography. Am J Respir Crit Care Med 2018;197(2):193 203. 124. Drukker K, et al. Breast MRI radiomics for the pre-treatment prediction of response to neoadjuvant chemotherapy in node-positive breast cancer patients. In: Proceedings of the SPIE 10950 Medical Imaging. 109502N. 2019. 125. Huynh BQ, Antropova N, Giger ML. Comparison of breast DCE-MRI contrast time points for predicting response to neoadjuvant chemotherapy using deep convolutional neural network features with transfer learning. In: Armato SG, Petrick NA, editors. Proceedings of the SPIE Medical Imaging. 2017. p. 101340U. 126. Cha KH, et al. Bladder cancer treatment response assessment in CT using radiomics with deep-learning. Sci Rep 2017;7. 127. Lao J, et al. A deep learning-based radiomics model for prediction of survival in glioblastoma multiforme. Sci Rep 2017;7(1):10353. 128. Bibault JE, et al. Deep learning and radiomics predict complete response after neo-adjuvant chemoradiation for locally advanced rectal cancer. Sci Rep 2018;8(1):12611. 129. Ibragimov B, et al. Development of deep neural network for individualized hepatobiliary toxicity prediction after liver SBRT. Med Phys 2018. 130. Nielsen A, et al. Prediction of tissue outcome and assessment of treatment effect in acute ischemic stroke using deep learning. Stroke 2018;49(6):1394 401. 131. Antropova N, et al. Breast lesion classification based on dynamic contrast-enhanced magnetic resonance images sequences with long short-term memory networks. J Med Imaging (Bellingham) 2019;6(1):011002. 132. Li H, et al. Computerized analysis of mammographic parenchymal patterns on a large clinical dataset of full-field digital mammograms: robustness study with two high-risk datasets. J Digit Imaging 2012;25 (5):591 8. 133. Li SF, et al. Computer-aided assessment of breast density: comparison of supervised deep learning and feature-based statistical learning. Phys Med Biol 2018;63(2). 134. Lee J, Nishikawa RM. Automated mammographic breast density estimation using a fully convolutional network. Med Phys 2018;45(3):1178 90. 135. Li H, et al. Deep learning in breast cancer risk assessment: evaluation of convolutional neural networks on a clinical dataset of full-field digital mammograms. J Med Imaging (Bellingham) 2017;4(4):041304.

II. Technical basis

74

4. Biomedical imaging and analysis through deep learning

136. Herold CJ, et al. Imaging in the age of precision medicine: summary of the proceedings of the 10th biannual symposium of the international society for strategic studies in radiology. Radiology 2016;279(1):226 38. 137. Clark K, et al. The cancer imaging archive (TCIA): maintaining and operating a public information repository. J Digit Imaging 2013;26(6):1045 57. 138. Armato SG, et al. PROSTATEx challenges for computerized classification of prostate lesions from multiparametric magnetic resonance images. J Med Imaging (Bellingham) 2018;5(4):044501.

II. Technical basis

C H A P T E R

5 Expert systems in medicine Li Zhou and Margarita Sordo Abstract Medical expert systems (ESs) aim to apply computer technology to emulate human decision-making and provide computerized clinical decision support to clinicians, patients, and other individuals with suitable information and knowledge at appropriate times to improve the quality and safety of health care. This chapter first presents the significance and a brief history of this field. It then describes a common architecture and major components of medical ESs and introduces different knowledge representation and reasoning techniques. This chapter will also show some examples of medical ESs, including computerassisted diagnosis systems, medication alert systems, reminder systems, and so on. Lastly, it discusses the advantages and disadvantages of existing approaches as well as major issues and challenges related to system implementation, evaluation, maintenance, and distribution and points out some of the directions for future research and development. Keywords: Expert systems; knowledge bases; rule-based systems; clinical decision support; decision theory; knowledge acquisition; knowledge representation; computer-assisted diagnosis; computer-assisted therapy; medication alert system; reminder systems

5.1 Introduction In medicine and health care, many computer systems have been developed to automatically assist clinicians and patients with decision-making by providing relevant information in a timely manner. The use of artificial intelligence (AI) and expert systems (ESs) as an aid to clinicians was first recognized in the late 1950s. Medical ESs aim to apply computer technology to emulate human decision-making and to provide computerized clinical decision support (CDS) to clinicians, patients, and other individuals with pertinent information and knowledge at appropriate times to improve health and health-care delivery. Medical ESs typically consist of a knowledge base and an inference engine. The inference engine reasons upon available knowledge from the knowledge base and input data to make health-care recommendations. The user interacts with the system through a user interface, where the user can have an interactive dialog with the system or enter information if needed, and then receives recommendations and interpretations of these recommendations given by the system. The purpose of these systems is to assist users to make better

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00005-3

75

© 2021 Elsevier Inc. All rights reserved.

76

5. Expert systems in medicine

analyses or decisions based on the available data and knowledge than either the user or the system could make on their own. It is widely recognized that ESs and other CDS systems have great potential to improve health and health care. Medical errors are common and costly and can result in death and serious injury. The Institute of Medicine report To Err Is Human, issued in 1999, estimated that as many as 98,000 people died in hospitals per year from medical errors,1 while many injuries that occur during hospitalization are unpredictable and unavoidable, 20%70% may be preventable. Computer information technology, such as ESs, is believed to provide optimal evidence-based care to make the best knowledge available when needed and help one to reduce medical errors. ESs can aid at building a safer health by providing more precise diagnosis and a more scientific determination of the treatment plan, check potential medication errors at each stage of the medication management process (e.g., ordering, dispensing, and administering), and inform clinicians of actionable interventions based on patient conditions. In this chapter, we first present a brief history of this field and then introduce a common architecture and major components of medical ESs, as well as different knowledge representation (KR) and reasoning techniques. We will demonstrate some examples of medical ESs and discuss the advantages and disadvantages of existing approaches. We will also present major issues and challenges in this field and finally point out future directions.

5.2 A brief history After the dawn of modern computers in 1950s, medical researchers began experimenting with the idea of using computer technology to emulate the decision-making processes of clinical domain experts. Clinical diagnostic systems were the early forms of ESs. In the late 1950s, Ledley and Lusted proposed the use of mathematical techniques to model the reasoning process of medical diagnosis.2 In the 1970s and 1980s the field of medical AI and ESs was explored and established.3 Some exemplary medical ESs were developed. During this time, MYCIN was designed to identify bacteria causing infection and recommend antibiotics.4 However, most early diagnostic consultation systems were limited to stand-alone applications or working prototypes and were not used in routine clinical practice due to many obstacles, including integration with the hospital information systems and clinician workflow, the complexity and uncertainty of the logic rules, the high costs of maintaining the knowledge base, and the requirement of manual data entry and cumbersome dialogs.5 During the late 1980s and early 1990s, excitement surrounding AI fell into was often referred to as the “AI winter.” Over the years, researchers in medical informatics, AI, and other related fields have strived to find solutions to address and overcome the limitations of ESs. These efforts included developing KR frameworks and common reference models, markup languages (such as Arden Syntax6 for presenting and sharing medical knowledge in an executable format), and computer-based medical coding system (such as GALEN that aimed to provide reusable terminology resources for clinical systems).7 To address uncertainty, which is common in medicine, fuzzy logic was integrated into languages,

II. Technical basis

5.3 Methods

77

such as fuzzy Arden Syntax.8 Evaluation studies, including randomized, controlled trails, were conducted to assess provider performance, clinical outcomes, and economic impact of the ESs.912 It was also realized that some potential benefits, such as improved process of care, quicker and easier access to patient data, and automatic error checks, often cannot be measured directly by clinical outcome measures. In the late 1990s and early 2000s, with the advancement and increased use of the Internet and information technology in health care and the emphasis on evidence-based medicine (EBM) and clinical guidelines, more ESs and computerized CDS systems emerged. While the majority of early ESs focused on diagnosis, new areas, including health-care reminders and medication alerts, began to be further explored. With the rapid increase in the amount of new information available in health care, knowledge management became an important task. Efficient methods and tools were critical to allow medical domain experts, knowledge engineers (KEs), and software engineers to convert medical knowledge into machine executable decision support rules.13 Query and expression languages, such as GELLO,14 were developed for specifying logic expressions and criteria to facilitate representation and sharing of computerized clinical knowledge. Electronic health record (EHR) technologies progressed rapidly and saw widespread adoption in the late 2000s. It has been widely recognized that the addition of expert knowledge that can provide computerized CDS in the EHR is critical to facilitating meaningful use of the EHR as both a patient care and population management tool, rather than as a platform primarily for billing and documentation. However, many health-care delivery institutions lacked resources and expertise to implement advanced expert and CDS systems to realize their potential benefits. In addition, institutions often implemented these systems in different ways due to the lack of a universally adopted standard representation of the best available clinical knowledge for knowledge sharing and distribution. In the mid-2000s a road map was proposed that recommended a series of actions to improve CDS capabilities and increase its use through the US health sector.15 A set of coordinated, collaborative projects was funded and conducted16 to develop approaches for representing structured and shareable CDS intervention artifacts17 and distributing the machine executable artifacts via web-services.18 Over the past five decades, despite notable successes of medical ESs and CDS, wide adoption and efficient use of such systems still face many challenges. The critical path ahead includes, but is not limited to, improving KR, interoperability and sharing, increasing integration with workflow and effective use, and leveraging data to enhance knowledge, reasoning processes and methods.

5.3 Methods 5.3.1 Expert system architecture An ES is a knowledge-based system that consists of a knowledge base from a specific, usually narrow, domain, and an inference engine to reason about that specific domain. An ES interacts with users through a user interface. It simulates the decision-making processes human experts (HEs) follow to tackle specific problems.

II. Technical basis

78

5. Expert systems in medicine

ESs find their way into most areas of knowledge and are used in a wide variety of applications, including medical diagnosis, helping physicians identify potential diagnoses based on signs and symptoms, and patient data19,20; planning21 and scheduling in airline and cargo applications22; configuration applications with constraints23 to maximize usability of resources24; design and manufacturing25; financial decision-making in the stock market and fraud detection26; and process management to maximize the usability of resources while managing costs.27 Despite the apparent disparate nature of applications, what all these areas have in common is that knowledge is abstracted from the real world and represented in a concrete, unambiguous manner in the form of facts and rules with a tractable number of possible solutions. Building an ES starts with a HE, a KE, and a user (U). The first step is to define the problem at hand to address the needs of the user. This requires the concerting effort of HE, KE, and U. Second, based on the problem definition, the expert and the engineer work together on the knowledge acquisition process. Throughout this knowledge acquisition process, the KE and the expert interact several times. The purpose is for the KE to extract relevant expert knowledge from the HE to define the knowledge base (Fig. 5.1). Extracted knowledge must be concise, unambiguous, and targeted to the required purpose of the ES. It is important to remember that the ES will serve as a computer proxy of the expert. Therefore the more relevant knowledge is extracted, the more useful the ES will be. Extracted knowledge, typically from the HE in the form of statements, is converted into facts and rules and integrated into the knowledge base. Communication and understanding between the expert and the engineer are vital for this task to succeed. The expert must be able to clearly communicate the knowledge (rules of thumb) used to solve problems in a specific domain, while the engineer, besides understanding such knowledge and the needs for the ES, must correctly “translate” such knowledge into a suitable representation

Knowledge elicitation

Expert

Expert system Knowledge representation

Knowledge base

Knowledge engineer

Inference engine

User

User interface

II. Technical basis

FIGURE 5.1 Schematic view of an expert system, and its interactions with expert, knowledge engineer, and user.

5.3 Methods

79

without any loss of heuristic knowledge. Oftentimes, the expert struggles to accurately and explicitly communicate all the knowledge required to perform specialized tasks. If we add a potential lack of understanding of the domain by the engineer, this may result in suboptimal representations of knowledge, reflected in inaccurate facts and rules that do not perform as expected. In an ES the knowledge base is separate from the execution engine. The knowledge base contains facts and rules, and the execution engine applies the rules and reasons about the facts. Reasoning is a logical and heuristic process that takes into consideration notions such as actions to be taken under given circumstances, goals to be achieved, causality and dependencies among rules and facts. There is no room for improvisation or generalization; the engine simply processes all the rules that apply to the presented facts and produces an answer. As mentioned previously, if the gathered knowledge is ambiguous or incomplete, the results from the ES will potentially also be ambiguous and incomplete. Therefore it is essential that the knowledge base is as complete and unambiguous as possible so that the reasoning is sound. Even though knowledge acquisition can be costly and time consuming, when done correctly, the benefit is improved decision quality in terms of fast, reliable, consistent, and efficient solutions. In addition, given the nature of the decision rules, ES provides answers about and explanations of the decision-making process. Getting an explanation about an answer is key in many domains, such as medicine, where, for example, getting a potential diagnosis for a patient could be as important as getting an explanation as to why that diagnosis is relevant. There are two common approaches for representing knowledge in the knowledge base: frame-based systems and production rules. First proposed by Marvin Minsky in the 1970s, frame-based systems provide a natural structure to organize knowledge in terms of frames representing concepts or stereotypical situations.28 The internal structure of a frame, that is, slots, stores all the necessary information about characteristics, attributes, usage, expectations, and so on. This means that all kinds of definitional and descriptive information can be included in a frame. In terms of internal structure within a frame-based system, collections of frames are organized and, when relevant, interconnected to each other through relationships. Execution of the reasoning process on these systems matches a frame to a specific situation, using default values to fill unspecified aspects/slots of the frame. Production rules are the most common KR method, particularly in medicine, MYCIN being the most famous example,4 and in business.26,27 In rule-based ESs, knowledge is represented by production rules. Each rule consists of an antecedent, or condition, and an action, or conclusion. The general form is as follows: IF condition THEN actionðsÞ During the reasoning process, all the rules for which the conditions in their antecedents are satisfied are triggered and the respective actions executed. The inference engine orchestrates the rule execution order and resolves conflicts between rules. This process is mostly centralized and managed by rule properties to control the order in which rules are executed. This is achieved mostly through execution parameters and conditions, rule priorities, and execution flow. When a user enters relevant facts about the case at hand, these facts are stored in a working memory. The inference engine applies one of two possible strategies, working II. Technical basis

80

5. Expert systems in medicine

either forward (data-driven) or backward (goal-oriented) chaining rules from the knowledge base to the facts in the working memory. As rule conditions are satisfied and new facts are generated, these new facts are added to the working memory. This process continues until a goal state is reached and no more rules apply to the facts in working memory. These systems provide an explanation facility to explicate how the system triggered certain actions or arrived at a given recommendation. Explanations can be as simple as rule numbers or identifiers for triggered rules, or more elaborate, human-like explanations. Either way, this approach provides explicit information about the reasoning process. In a forward (data-driven) chaining strategy, the inference process is driven by the facts stored in the working memory. The rule engine attempts to match the antecedent (condition) of each rule in the knowledge base using the available facts. Execution order is handled by the parameters explained previously, and, if needed, resolution strategies are applied. This type of reasoning strategy is normally used to solve open-ended problems such as planning.29 Backward chaining strategy starts from an assumed goal or subgoal state—the THEN (conclusion) part of the rule. The execution engine tries to match the antecedent of the rule with the assumed conclusion. If such a match exists, its antecedent becomes the new subgoal, and the process repeats; otherwise the engine will attempt to prove other goal(s) until a goal state can be supported by the antecedent. In other words, this strategy works backward: it starts with a conclusion in a rule and tries to satisfy the conditions for that rule. If this does not work, it picks another rule conclusion and tries to satisfy its conditions with the available facts. The execution engine keeps trying until it finds such a rule, or no more rules are left to evaluate. This type of reasoning strategy is best used in diagnosis applications with a limited number of well-defined conclusions that can be checked.29

5.3.2 Knowledge representation and management The term “knowledge representation” encompasses all formalisms employed for appropriately represent, encode, and execute knowledge. The processes for eliciting, representing, and encoding knowledge normally are referred to as knowledge engineering. Ontologies and reference terminologies and vocabularies play a vital role in representing knowledge in a clear, succinct, unambiguous manner. During the modeling process, the KE must take into account all available evidence and information and transform it into unambiguous definitions. This is followed by the identification of relevant reference terminologies and vocabularies used to further describe the semantic meaning of the knowledge being encoded. This step provides further consistency to the representation efforts in that the meaning of the knowledge asset being represented is reinforced and enhanced by the meaning of the reference terminologies and vocabularies. Further, once knowledge assets are linked to other knowledge assets, relationships among existing and new concepts emerge, providing richer definitions that go beyond those initially ascribed to individual concepts. The outcome of this modeling process is either a human-readable, declarative representation or a computer-interpretable representation of the initial knowledge.30 Computer-interpretable knowledge expressed as described previously can be easily incorporated into knowledge bases and reasoned upon by inference engines. Regardless of the

II. Technical basis

5.3 Methods

81

final representation, care must be taken to ensure that knowledge is concise, explicit, and unambiguous. This allows for an incremental approach to expanding knowledge while preserving consistency in the definitions and relationships among connected knowledge assets. An additional step to further enrich the semantics of encoded knowledge is to link knowledge assets to clinical domain ontologies. These ontologies provide additional mechanisms for inference and consistency checking and ongoing maintenance.31 Ontologies represent collections of formalized concepts that semantically augment knowledge assets. This is advantageous in terms of modeling, since ontologies not only allow for structuring concepts and establishing relationships, but when reason upon, they also provide a solid foundation for knowledge discovery.3234 Further, structured, clear, and explicit knowledge endorsed by reference terminologies and vocabularies foster shareability and reusability of knowledge. Even when the representation of knowledge requires local customization (not ideal), having clear, semantically augmented definitions, facilitates the customization process and reduces potential misunderstandings. Syntax is another key aspect of KR. The notation used to represent knowledge and describe concept hierarchies and maintenance mechanisms is the means for ensuring that a KR contains all the facts needed for the engine to traverse and reason upon the meaning of the knowledge. The syntax defines the elements, structure, and valid configurations of those elements in order to encode valid “sentences.” There is a consensus on the need for standards for KR to address syntactic issues. However, there is a lack of general agreement on a common representation and the mechanisms for translating knowledge encoded in one formalism into another. Some of the approved standards for clinical KR that have had some traction over the years include Arden Syntax6 and GELLO.14 Arden Syntax is part of Health Level Seven International (HL7).35 It was published by HL7 in 1999. Arden Syntax is considered a hybrid KR.6,36 It supports encoding of knowledge and definition of workflows. Arden Syntax is used to define medical logic modules (MLMs) comprising variable definitions and execution flow guided by decision logic. MLMs are self-contained and can be embedded and invoked at any point in workflows where a decision, represented in the MLM, is needed. The main limitation of Arden Syntax that so far has precluded the shareability of MLM is the “curly braces problem.” Within an MLM, calls to local data repositories are defined within curly braces. The notation for retrieving such data needed for the decision logic of the MLM is specific to the institution and system. This means that not only is the call to retrieve information tied to a specific syntax, but the data itself and its structure are also specific. This presents a major challenge when MLMs are shared, since this requires extensive recoding of the content within curly braces to the point that often it is better to define the MLM anew. Fig. 5.2 presents a fictitious example of an MLM for screening and notification of evidence of recent myocardial infarction. The MLM is divided into three main sections: maintenance, library, and knowledge. Note the data retrieval in an institution-specific format within the curly braces. Another approved KR standard is GELLO: an object-oriented, platform-independent expression language for data querying, reasoning, and evaluation of logic expressions.14 In 2005 GELLO was approved as an international standard by ANSI and HL7.38 GELLO is based on the Object Constraint Language (OCL) and is declarative in nature. OCL is an Object Management Group standard that can be used with any object-oriented data model.

II. Technical basis

82

5. Expert systems in medicine

Maintenance: Title: Filename: Version: Institution: Author: Specialist: Date: Validation:

Screen for positive troponin I;; troponin;; 1.40 World famous medial center;; Robert A. Jenders MD, MS ;; 2013-04-30;; research;;

Library: purpose: screen for evidence of recent myocardial infarction;; triggered by storage of troponin result. sends explanation: message if result exceeds threshold;; keywords: troponin; myocardial infarction;; citations: ;; Knowledge: type: data:

FIGURE 5.2 Sample Arden Syntax module (MLM) for screening for troponin. The MLM is divided into three sections: maintenance, library, and knowledge. Note the reference to institution-specific data source within the curly braces. Source: Adapted from Jenders RA. Decision rules and expressions. In: Clinical decision support. Elsevier; 2014. p. 41734.37

data-driven;; troponin_storage:= event(storage of troponin); /* get test result */ tp: = read last {select result from test_table where test_code = ‘TROPONIN-I’; threshold := 1.5;; /* email for research log */ email_dest := destination {‘email’, ‘name’ = [email protected]’};

;; troponin_storage;; if (tp is not number) then conclude false; endif;

Evoke: Logic:

if tp > threshold then conclude true; else conclude false; endif; ;; write “Patient may have suffered a myocardial infarction. ” || “Troponin I = “ || tp || “ at ” || time of troponin at email_dest;

Action;

;; Urgency:

50;;

end:

GELLO captures the specific features of clinical knowledge, for example, eligibility criteria required in guidelines and expresses it in a declarative, deterministic manner. GELLO uses the HL7 Reference Information Model (RIM) as data model. This allows the KE to build expressions that query data from a standard source in a standard representation and logically manipulate and reason upon them. GELLO also supports calculations and other formulae, thereby, facilitating close integration with other HL7 standards that use the RIM. Its generic nature allows it to be integrated into multiple applications. GELLO addresses the curly braces problem by facilitating the incorporation and use of standard vocabularies and

II. Technical basis

5.3 Methods

let lastTroponin: Observation = Observation select (code= (”SNOMED-CT”, “102683006”)).sortedBy(effectiveTime.high).last() let threshold : PhysicalQuantity = Factory. PhysicalQuantity(“1.5”, “ng/dl”) let myocardial infarction: boolean = if lastTroponin. value.greaterThan (thershold) then true else false endif

83 FIGURE 5.3 Same clinical knowledge for troponin as presented in Fig. 5.2. Note the succinctness and declarative nature of GELLO. Also note the reference to HL7 RIM (standard data model). GELLO expressions can be embedded into clinical workflows.

if myocardial infarction then send notification else send alternative notification endif

data model. Even though the declarative nature of GELLO is more difficult to grasp than the procedural structure of Arden Syntax, authoring tools can hide this complexity, allowing authors to create computable knowledge (Fig. 5.3).

5.3.3 Uncertainty, probabilistic reasoning, fuzzy logic 5.3.3.1 Uncertainty In classic propositional logic, a statement is either true or false, but people often take the truth value as a matter of degree. As noted in a previous section, production rules are dichotomous, and, when applied, the antecedent (the statement embedded in the IF part of the rule) is either true or false, but nothing in between. The inference process has a prespecified goal, and the process is terminatable whenever its goal is achieved. In the real world, this imposes limitations on certain application areas, because the nature of the data collected do not always fit a clear yesno scenario, and therefore, dichotomous rules cannot be applied. There are facts we believe to be true or false. Hence, it is advantageous for a system to be able to reason under uncertainty and provide ways to represent this vagueness and ambiguity in a formal, well-defined language that can produce conclusions and justify them. 5.3.3.2 Probabilistic reasoning Probabilistic reasoning arises from the need to quantify and reason under uncertainty. It is a richer, more expressive formalism that preserves the process of deductive reasoning from statements to reach a logically certain conclusion and incorporates probability theory to represent and handle uncertainty. It attempts to define a natural extension of traditional dichotomous logic, where the truth value of a proposition is extended from discrete, binary values to a [0,1] interval of continuous values, with binary logic as a special case. A conclusion is the consequence of the premises, but unlike propositional logic, the truth value is not dichotomous but a function of the conditional probabilities of the premises. Under this notion, only a conclusion under the special case will be absolutely true, while

II. Technical basis

84

5. Expert systems in medicine

any other potential conclusions will have a truth value based on the probabilities of the premises, and the one with the highest probability will be preferred. Probabilistic reasoning interprets these probabilities as reasonable expectation,39 or quantification of a personal belief.40 The following are definitions needed in order to understand probability theory. Boolean connectives (1) negation, (2) conjunction, and (3) disjunction are extended into probability functions to handle the probabilities of the premises being evaluated. Boolean connectives become probability functions of events A, B as follows: 1. Negation: P(:A) 5 1 2 P(A), where P(:A) is the probability of A not occurring. 2. Conjunction: P(A - B) 5 P(A) 3 P(B), where A and B are independent of each other. 3. Disjunction: P(A , B) 5 P(A) 1 P(B) 2 P(A - B), where P(A , B) denotes the probability of either A or B happening, but not both. When a proposition A can only have two truth values: true or false, a priori or unconditional probability P(A) represents the probability of that proposition A being true.41 The sum of all the probabilities xi of a random variable X is always 1 and is denoted by: X Pð X Þ 5 1 (5.1) xAfx1 ; ...; xn g

The conditional probability of A given the prior occurrence of B, denoted by P(A/B), is defined as the ratio of the joint occurrence (conjunction) of A and B to the probability of B:   A Pð A - BÞ (5.2) 5 P B PðBÞ Similarly, the conditional probability of B given A is denoted by:   B Pð A - BÞ P 5 A PðAÞ Manipulating Eqs. (5.2) and (5.3), respectively, we obtain:   A PðBÞ Pð A - BÞ 5 P B   B Pð A - B Þ 5 P PðAÞ A

(5.3)

(5.4) (5.5)

By combining Eqs. (5.4) and (5.5), we derive: PðBjAÞPðAÞ 5 PðAjBÞPðBÞ

(5.6)

and PðBjAÞ 5

PðAjBÞ PðBÞ PðAÞ

(5.7)

which is Bayes’ theorem on conditional probability. For a more in-depth description of uncertain knowledge and reasoning, see Ref. [41].

II. Technical basis

85

5.3 Methods

FIGURE 5.4 Conditional probability P(B|A) in a directed arc connected from A to B. A and B are parent and child nodes, respectively. Node C is not connected to either A or B and therefore is conditionally independent.

Parent node A P(B|A) B C

Child node

Independent node

Blood transfusion

Jaundice

Gallstones

(8) (1) Hx viral hepatitis

(2)

Total bilirubin

(3) Liver disorder

(5)

(7)

(4) Hx alcohol abuse

Direct bilirubin

(6) Alkaline phosphatase

(9)

(1): P(Hx viral hepatitis | Blood transfusion) (2): P(Liver disorder | Gllastones) (3): P(Liver disorder | Hx viral hepatitis) (4): P(Liver disorder | Hx alcohol abuse) (5): P(Liver disorder | Total bilirubin) (6): P(Alkaline phosphatase | Liver disorder) (7): P(Direct bilirubin | Total bilirubin) (8): P(Jaundice | Total bilirubin) (9): P(Direct bilirubin | Alkaline phosphatase)

FIGURE 5.5 Simplified Bayesian network for diagnosis of liver disorder. Source: Adapted from Onisko A, Druzdzel MJ, Wasyluk H. A Bayesian network model for diagnosis of liver disorders. In: Proceedings of the eleventh conference on biocybernetics and biomedical engineering. CiteseerX; 1999.

Bayesian network is a probabilistic, directed acyclic graphical (DAG) model that represents a set of variables and their conditional dependencies, where the strength of these dependencies is denoted by the conditional probabilities. Nodes in a graph represent variables in the Bayesian sense, and arcs depict the conditional dependencies (joint probability distribution) between the parent (A) and child (B) nodes. Conditionally independent nodes (C) are not connected to other nodes (see Fig. 5.4). Formally, the structure of a DAG represents a factorization of the joint probability distributions of the considered variables.42 Bayesian networks are well suited to represent the probabilistic relationships between diseases and symptoms. Given that a Bayesian network is a complete model of a set of variables and their relationships, it can be used to compute the posterior distribution of variables based on available evidence. It applies Bayes’ theorem to perform probabilistic inferences. For example, a Bayesian network could be used to compute the probabilities of the presence of a disease based on the presence of certain symptoms. As shown in Fig. 5.5 (simplified from Ref. [43]), multiple factors are taken into consideration for a potential diagnosis of liver disorder. Based on available evidence, for example, symptoms and findings, objective evidence observed by physicians, and laboratory test results, the applicable conditional probabilities are calculated to obtain a potential diagnosis.

II. Technical basis

86

5. Expert systems in medicine

Compared to other probabilistic reasoning approaches, Bayesian networks are more efficient, though computational costs can easily escalate for complex models. In addition to the computational expense, unknown probability values and inconsistent probability assignments can hinder the applicability of this methodology. 5.3.3.3 Fuzzy logic Fuzzy logic is an extension of dichotomization of one of the fundamental underlying concepts of classic propositional logic proposed by Zadeh.44 In propositional logic, truth values are binary, and we should be able to unequivocally determine, based on the characteristics of an object, whether it belongs to a set or not; that is, an object x is either a member of a set A or not (Eq. 5.8).  1 if xAA AðxÞ 5 (5.8) 0 if x= 2A However powerful this concept may be, the condition defining the boundaries of a set is very rigid. Fuzzy sets extend the binary membership of a conventional set into a continuous interval [0,1], where the membership values express the degree of compatibility of an object with the distinctive properties or characteristics of the collection it belongs to. The membership value can range from 0 (complete exclusion) to 1 (complete membership). Thus a fuzzy set A is a set of ordered pairs defined by45:    (5.9) A 5 x; µA ðxÞ :xAX where x is any object of a universal set X and µA ðxÞ is the degree of membership of the element x in A. The following example of the resting heart rate (RHR) of adult patients at risk of developing complications illustrates these concepts. Let X be the universal set of RHR measurements for adults where a RHR x in X could have any value in the range of [0,250] beats per minute, with three fuzzy sets to classify patients, for example, low, normal, and high. As part of this fictitious system, there are alerts to warn when a patient’s heart rate deviates from normal. All possible heart rates in the interval [0,250] are members of each fuzzy set, though with different membership values in the interval [0,1]. For example, an RHR 5 70 is a member of all three sets, but the membership functions µLOW ð70Þ, µNORMAL ð70Þ, µHIGH ð70Þ will have very different membership values for the same heart rate (Fig. 5.6). In fuzzy systems, membership functions can be continuous, and expand over a continuous spectrum of values (Fig. 5.6), or discrete, sampling values at regular intervals over the whole range of possible values (Fig. 5.7). Although any continuous function of the form A:X-½0; 1 can be used as membership function of a fuzzy set, the most commonly used are: triangular, trapezoidal, S-membership, and Gaussian functions. A triangular membership or bell-shaped function determines the degree of membership by comparing a given value x against upper and lower values a, b and a modal value m as shown in Eq. (5.10). The graphic representation of this type of function is depicted in Fig. 5.8A.

II. Technical basis

87

5.3 Methods

1 0.9 Degree of membership

0.8 0.7

Low

0.6

Normal

0.5 High

0.4 0.3 0.2

240

225

210

195

180

165

150

135

120

105

90

75

60

45

30

15

0

0

0.1

Resting heart rate

FIGURE 5.6 Membership curves for the fuzzy sets low, normal, and high. The x-axis denotes the resting heart rate for adults. The y-axis denotes the degree of membership of the given fuzzy sets at different heart rates. 1

Degree of membership

0.9 0.8 0.7 0.6 0.5 0.4

Normal

0.3 0.2

0

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240

0.1

Resting heart rate

FIGURE 5.7 Nonuniform sampling of discrete resting heart rate values at nonuniform intervals for the normal fuzzy set.

AðxÞ 5

8 0 if x # a > > > x 2 a > > if xA½a; m > > < m2a b2x > > if xA½m; b > > b 2m > > > : 0 if x $ b

II. Technical basis

(5.10)

88

5. Expert systems in medicine

Triangular membership function 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Membership function

1

m

a

0.8 0.6 0.4 0.2 0

b

S-membership function

(C)

Membership function

Trapezoidal membership function

(B)

a

m

n

b

Gaussian membership function

(D)

1

0

a

m

b

Membership function

Membership function

(A)

1

m

FIGURE 5.8 Typical membership functions: (A) triangular; (B) trapezoidal; (C) S-membership; and (D) Gaussian.

A trapezoidal membership function is typically used to represent the fuzzy linguistic conditions neither so high nor so low.45 In the heart rate example, this function could be used to express that a given heart rate is neither too high nor too low for those values between the threshold boundaries determined by m, n, calculated by Eq. (5.11). See Fig. 5.8B. 8 0 if x , a > > > x2a > > if xA½a; m > > > m2a > < 1 if xA½m; n (5.11) AðxÞ 5 > > b 2 x > > if xA½n; b > > b2n > > > : 0 if x . b An S-membership function has a smoother slope. In the previous example, it could be used to calculate the membership for the fuzzy set high. One typical form of the S-function is depicted in Fig. 5.8C, and the curve is calculated by Eq. (5.12): 8 0 if x # a > >  2 > > > > > 2 x2a if xA½a; m > b2a > <  2 (5.12) AðxÞ 5 > 1 2 2 x2b if xA½m; b > b2a > > > > > > 1 if x . b > :

II. Technical basis

5.4 Applications

89

A Gaussian membership function has a wide application on fuzzy sets. In our example, this function could be used to calculate the membership for the fuzzy set NORMAL, with mean value of, for example, 60. To shape the curve, we can experiment with different values for the variance; the smaller the value, the higher and sharper the curve around the mean value. A typical Gaussian curve is depicted in Fig. 5.8D, and it is calculated by Eq. (5.13). AðxÞ 5 e2kðx2mÞ

2

where k . 0

(5.13)

5.4 Applications Studies have shown that appropriately implemented medical ESs can enhance healthrelated decisions and actions with the right knowledge presented at the right time, and therefore reduce medical errors and improve quality of care. A wide range of tools and interventions have been developed and implemented over the past 50 years, either as stand-alone or as part of integrated solutions. Key CDS interventions include diagnostic support, therapy recommendations, order sets, medication alerts, and preventive care interventions and reminders. Some institutions have also used ESs for billing purposes, for example, by recommending a plan and treatment options based on the health-care needs of the patient and financial needs of the institution. In the following sections, we show some examples of medical ESs.

5.4.1 Computer-assisted diagnosis An accurate diagnosis is important because it helps a physician choose an appropriate therapy for a patient and may also reduce diagnostic errors. Medical diagnosis is often challenging and involves complicated reasoning processes, as it requires collection of various pieces of information (e.g., signs, symptoms, medical history, and lab test results) for differential diagnosis; however, the collected information may be nonspecific, causing uncertainty. Ledley and Lusted2 proposed that certain mathematical techniques, including symbolic logic, probability, and value theory, can help our understanding of the reasoning behind medical diagnosis and the choice of an optimum treatment. Computers can help clinicians collect and process relevant information during the diagnostic process to search through a collection of possible diseases to make a more precise diagnosis and a more scientific decision about the treatment plan. In the 1970s the Leeds Abdominal Pain System was built based on Bayesian probability theory and used sensitivity, specificity, and diseaseprevalence data for various signs, symptoms, and lab test results to calculate the probability of the diagnosis of abdominal pain.46 MYCIN, one of the most influential early ESs, uses a rule-based approach and symbolic reasoning techniques rather than numerical calculations. MYCIN was developed as a thesis project by Edward Shortliffe at Stanford University that aimed to help physicians select an antimicrobial therapy for patients with infections.4,19 Clinical knowledge of bacterial infections is represented as a set of production rules, where each rule has a certainty factor

II. Technical basis

90

5. Expert systems in medicine

associated with it. MYCIN is goal driven; it uses backward chaining reasoning from a goal or conclusion to conditions that establish it. It assesses patient-specific data entered by the physician to generate conclusions and therapeutic advice and also explains its reasoning and decisions when asked to do so. MYCIN was designed to keep its knowledge base separate from the reasoning engine, so the domain-independent portions of MYCIN, also known as EMYCIN, or “Essential MYCIN,” were used to develop other ESs. For example, PUFF was a system that could process lung function data, recommend four possible diagnoses, and generate pulmonary function reports.47 The agreement between the PUFF system and two physiologists was 92%. ESs built to provide consultation in diagnosing difficult and rare pulmonary diseases can be found in this review.48 Other ESs built to provide consultation in medical diagnoses across different clinical settings and domains include HELP (health evaluation through logical processing),49 INTERNIST-I,50 Quick Medical Reference,51 CADUCEUS,52 and DXplain.53 DXplain, a system developed at the Laboratory of Computer Science at the Massachusetts General Hospital, generates ranked differential diagnoses based on user input of a patient’s symptoms, laboratory results, and other clinical findings using a modified form of Bayesian logic to derive clinical interpretations.53 It has been used mainly for medical training.

5.4.2 Computer-assisted therapy Some ESs were built to provide CDS for medical therapy. For example, MYCIN recommended antibiotics for bacterial infections.19 ONCOCIN,54 an oncology protocol management system, was designed to assist physicians in chemotherapy for cancer patients. The protocols for cancer treatment were represented and coded in ONCOCIN’s knowledge base. Physicians interacted with the “interviewer” to review and enter the patient’s data and received therapy and test recommendations generated by the “reasoner.” An antibiotic consultant program was integrated with a hospital information system, HELP system, at LDS hospital in Salt Lake City.55 The program accessed the patient’s medical records and used patient- and infection-specific information provided by physicians to determine the patient’s “most likely” pathogens. The likelihood of the pathogens was estimated using data from the previous 5 years and the most recent 6 months from patients with similar characteristics. The computer displayed the five antibiotic regimens most likely to be effective for all the pathogens and suggested an appropriate antibiotic regimen (e.g., based on the infection site). The program also checked patients’ allergy history and renal function using logic rules. The evaluation of the program showed a 17% greater pathogen susceptibility to an antibiotic drug suggested by the computer program versus physicians. Studies have also focused on dosing of medications for patients with certain clinical conditions to optimize prescribing behavior and reduce medical errors. A group of researchers at the Brigham and Women’s Hospital built an algorithm within a computerized physician order entry system for adjusting drug dose and frequency in patients with renal insufficiency.56 The researchers created a knowledge base containing a list of medications that were renally cleared and/or nephrotoxic and the optimal adjusted dose/frequency of each medication for different levels of renal insufficiency. Compared to the control, the intervention resulted in the selection of 15% fewer inappropriate doses and 24% fewer inappropriate

II. Technical basis

5.4 Applications

91

dose frequencies. Other drug-dosing systems assisted clinicians in maintaining therapeutic theophylline level5759 or warfarin dosing during initiation and follow-up maintenance therapy.6063 A computer-assisted antibiotic-dose monitor was designed at the LDS hospital in Salt Lake City, Utah, in 1995, which was used as a daily surveillance tool to monitor for appropriate antiinfective dosages.64 The program used computer-based patient records to screen patients in the hospital who were receiving any of the studied antibiotics. Then, the program calculated an optimal 24-hour dosage for the prescribed antibiotic, adjusted by the patient’s renal function test results and based on clinical guidelines. The program printed out the list of patients who were determined to possibly be receiving excessive dosage for the clinical pharmacists to review. The pharmacist then contacted patients’ physicians, when needed, to discuss the possible need for a change in the dosage. In medicine, physicians often fail to order tests or treatments (so-called corollary orders) needed to monitor or ameliorate the effects of other tests or treatments. For example, ordering Coumadin should trigger a decision to monitor prothrombin time and international normalized ratio. A group of researchers at Regenstrief Institute built computer-generated reminders to suggest corollary orders to physicians while they were writing their orders to reduce errors of omission.65 They conducted a randomized, controlled clinical trial and found that intervention physicians ordered the suggested corollary orders in 46.3% of instances when they received a reminder, compared with 21.9% compliance by control physicians. This study showed that such reminder interventions can be an efficient means for reducing reliance on memory and improving adherence to practice guidelines.

5.4.3 Medication alert systems Adverse drug events (ADEs) account for 19% injuries in hospitalized patients. Medication alert systems focus on identifying potential medication-related adverse events in various levels of severity.66 These alerts are easily incorporated into clinical workflows and produce timely notifications to help clinicians prescribe the most appropriate medications based on the clinical status of a patient. Alerts focus on drugdrug interactions, medical condition drug interactions, drug-allergy checking, and dosage adjustments. They monitor clinical events indicative of possible occurrences on ADEs by checking on patient medical conditions and notifying physicians about them so that physicians can take additional steps to address issues. For example, a patient with renal failure may require a dosage adjustment on a medication about to be ordered that is metabolized by the kidneys. These systems also monitor for potential drugdrug interactions by checking the current medication list of a patient and warning the ordering physician about potential antagonistic events when a new medication to be ordered adversely interacts with a current medication (e.g., warfarin and aspirin) and drugfood interactions by advising the physician to notify patients of foods that must be avoided while taking a medication (e.g., statins and grapefruit). They also monitor the impact of medication orders by checking on changes in laboratory tests associated with those medications, for example, drug-induced immune hemolytic anemia or agerelated liver toxicity of acetaminophen.

II. Technical basis

92

5. Expert systems in medicine

Bates et al. evaluated the efficacy of computerized physician order entry and a team intervention targeted the administration and dispensing of drug on prevention of serious medication errors.67 The computerized physician order entry system provides a menu of medications from a formulary and default doses and a range of potential doses for each medication. It also suggests relevant laboratory tests at the time of ordering and conducts some drug-allergy and drug-laboratory checking. The researchers reported 55% decrease in nonintercepted serious medication errors and 17% decrease in preventable ADEs.

5.4.4 Reminder systems Patient-specific reminder systems are mostly divided into four categories: health maintenance, expensive medication reminders, chronic disease care/management, and therapeutic recommendations. Reminder systems identify patients for whom upcoming (almost due) and overdue maintenance screenings and laboratory procedures need to be scheduled. They also include notifications relating to alternative therapeutic interventions and cheaper alternative medications available in the formulary. These systems generate notifications addressed to primary care physicians and health-care providers alerting them about upcoming due dates for preventive and monitoring actions and procedures that need to be scheduled, for example, a woman due for Pap smear, or a regular lab test for a diabetic patient monitored for hemoglobin A1c. Triggered reminder rules generate notifications that are either printed on a patient encounter summary, informing both the physician and the patient of upcoming preventive actions, or presented to the physician as part of the CDS embedded in careflows. Following an on-screen notification, physicians or other assigned health-care staff follow up on the recommendation. The system generates letters and notifications to patients informing them of the need to schedule upcoming or overdue preventive actions.

5.5 Challenges Although great progress has been made, many challenges still exist around the life cycle of ESs, from knowledge acquisition and management (including the process of designing, testing, deploying, and maintaining rules), to integration with EHR and workflow, knowledge sharing, and performance evaluation. Clinician acceptance, financial obstacles, and impact on institutional culture are also important factors to be considered. The evidence of the effectiveness of ESs in improving CDS and the quality and efficiency of care has been mixed. A 2005 systematic review reported that CDS systems improved provider performance in 64% of the studies and patient outcomes in 13% of the studies.9 Some reviews were less optimistic about the effects of CDS, calling for more evidence to demonstrate the cost-effectiveness of these systems.68

5.5.1 Workflow integration Workflow integration refers to processes, content, or frameworks set in place to support workflow itself. Workflow is defined as a series of tasks where jobs and processes can be

II. Technical basis

5.5 Challenges

93

streamlined by the seamless incorporation of tools to aid and enhance the outcome of tasks, not as punctual, isolated elements, but as components of a whole. It is what connects the dots and how best to optimize those steps in order to improve job performance. Collaborative tools support creation, management, and deployment of knowledge. They provide an excellent platform for consistent, trackable, reproducible knowledge-related activities aligned with best practices and recommendations. In terms of knowledge, collaborative tools allow for modularity and specialization. Knowledge assets can be customized to address specific needs; they can be self-contained, thereby facilitating integration. There are a wide variety of workflows in clinical systems. In many cases, these workflows have knowledge assets in common that could be centrally managed in a collaboration platform and deployed into those systems as required. Alternatively, execution of those assets may be triggered at a specific point during a workflow as adjuncts to assessment, diagnosis, treatment, or prevention tasks. Those assets could be plugged in at critical points of a workflow, and, based on specific conditions, trigger actions relevant to the task at hand. As seen in the previous section, computer applications provide valuable decision support at the point of care in the form of medication, therapy, treatment, dosage recommendations, and interventions. Integration of these knowledge assets requires them to be selfcontained, so, for example, a rule advising about a medication dosage, or a medication order set containing relevant medication to treat a disease can be plugged into a medication ordering workflow. In order to maximize the benefits of workflow integration activities, it is necessary to have well-defined, consistent, and reproducible processes for knowledge creation and maintenance, as well as knowledge deployment mechanisms. Having all these processes in place fosters implementation and applicability of knowledge-based interventions, particularly those requiring continuity during and after care setting transitions. Further, when workflow is integrated across applications, the benefits of each application improve. Knowledge and data are seamlessly shared between systems and applications, eliminating the need for separate, disparate applications to complete a process.

5.5.2 Clinician acceptance and alert fatigue When one is exposed to a high volume of warnings, reminders, and recommendations, especially those who do not require escalation, one may become desensitized to these CDS interventions. In medicine, although some expert and CDS systems have been shown to improve the safety and quality of patient care, a high number of alerts and reminders presented to clinicians on a daily basis can result in them experiencing alert fatigue. Clinicians may pay less attention to or ignore both important and unimportant alerts warnings. Ignoring clinical alerts can lead to patient harm and other unintended consequences. It is reported that clinicians override 49%96% of medication alerts and that many overrides are justifiable. The signal-to-noise ratio reveals significant problems; providers see and address more than 100 alerts to find one that could prevent one ADE.69 One study collected 10 years of drug allergy alert data from two large academic hospitals in Boston and found that the overall override rate was 84%.70 The most common override

II. Technical basis

94

5. Expert systems in medicine

reason for drug allergy alerts (B50% of the cases) was that the patient had previously tolerated the medication. In addition, clinicians were more likely to override repeated alerts that appeared two or more times than first time alerts (89.7% vs 77.4%, respectively). While most overrides may simply indicate low-value CDS recommendations that do not provide useful decision support, many alerts/recommendations are overridden in unsafe situations, such as in patients with histories of severe adverse drug reactions or contraindicated clinical conditions. Because not all alerts need to be interruptive, to reduce overalerting, Paterno et al. designed and implemented a three-tiered presentation of drugdrug interaction alerts based on the level of severity.71 Level 1 alerts were the most serious, considered life threatening and required the clinician to either cancel the current order or discontinue the preexisting order. Level 2 alerts were less serious but still required action by the clinician (e.g., canceling the drug or providing an override reason). Level 3 alerts were the least serious and comprised the largest proportion of alerts. Level 3 alerts were presented as informational only and required no clinician action. Their study showed that, at the tiered site, 100% of the most severe alerts were accepted, compared to 34% at the nontiered site; moderately severe alerts were also more likely to be accepted at the tiered site (29% vs 10%). While a tiering approach has shown promising results in reducing overalerting generally, it has not been applied to other CDS domains. Studies have shown that to increase clinical acceptance, it is important that a system can provide actionable recommendations rather just assessments. A system should explain its reasoning and/or provide evidence and encourage the user to record a reason when not following the advice generated by the system. When feasible and appropriate, it is also important for institutions to implement CDS surveillance systems to monitor inappropriate or abnormal alerts and recommendations. Web-based, graphic dashboards have been shown to be useful for monitoring CDS volume and trends over time. Poorly performing CDS alerts and recommendations should be identified for improvement. Other areas for improvement include EHR documentation and data quality, alerting mechanism (e.g., by considering contextual information, such as whether the alert was overridden in the past by the clinician), and hospital polices and guidelines.72

5.5.3 Knowledge maintenance Despite the ever increasing demand for computer-interpretable knowledge,73,74 the creation, testing, and maintenance of knowledge remains a manual, error-prone, and resourceintensive process. It is well known that creation of knowledge requires a detailed review of potential content and its translation into executable knowledge that could be integrated into clinical systems. The nature of clinical knowledge is complex, and it normally requires definitions that involve associations to other knowledge elements and reference terminologies. This creates multiple interdependencies that are difficult to track and manage. For example, a medication-related rule can easily become inoperable if the medication is removed from a formulary or is updated with a new medication; similarly, a rule advising about a diagnostic test will no longer trigger if the diagnostic test is superseded by a newer one, or if its reference value is updated. This problem becomes even more apparent when knowledge assets

II. Technical basis

5.5 Challenges

95

are part of collections maintained by third-party users. In this case, reusability and maintenance require a degree of customization, thereby creating a disconnect between the source and the deployed versions. Extensive customization to create complex “localizations” required to incorporate contextual constraints to improve specificity of an intervention13 or adapt to workflow and staffing requirements are problematic and difficult to handle. The eagerness of institutions to implement and deploy new knowledge rapidly disappears as they realize the long-term maintenance implications. Therefore prior to implementation and deployment, it is important to set well-defined processes for creation, documentation, and maintenance of knowledge, so that subsequent changes are trackable and validated with evidence, decisions are justified and explained, and the whole process is transparent. Welldocumented processes and decisions are more efficient, wasting no valuable time and resources and supporting the evolving nature of clinical knowledge.30

5.5.4 Standard, transferability, and interoperability Health-care knowledge is particularly complex and driven by extensive reliance on constantly evolving decision practices. This complexity arises not only from the intricate nature of medicine, but also from fragmented processes and incompatibilities in KRs and exchange mechanisms. From a knowledge engineering perspective, many of these problems can be mitigated with formal, well-defined KR methodologies and tools. Internally, organizations use several approaches to standardize knowledge repositories, practices, and procedures. However, wide exchange across teams and institutions can only be achieved through concerted collaboration and adoption of common tools and standards. Even though standards are, and have been, used for KR, the actual structure of the knowledge and descriptive attributes considered important widely vary from place to place. As a result, there is no guarantee that knowledge engineered at one institution, even if aligned with standards, would be transferable without loss of consistency into another institution. Without consistency, knowledge transferability and interoperability have diminished dissemination. Over the past few years, standardization efforts have been intensified in terms of interoperability and exchange. The emphasis has shifted from creating standards for KR to standards that enable knowledge sharing and exchange.17 Government initiatives such as Meaningful Use have driven efforts toward developing standards for sharable knowledge. As a result, health-care institutions have been moving toward adopting and implementing systems that are more compliant with interoperability standards. Despite these efforts, challenges remain. However, it is important to acknowledge that existing standards can be used as a lingua franca for knowledge exchange. In fact, each institution could maintain their internal representations, if their knowledge assets align with standard reference terminology vocabularies to preserve the semantics. Driven by semantics, mismatched knowledge can be reconciled into a common, agreed upon interchange of standard representation. Knowledge can be leveraged by its meaning, and the syntax adjusted accordingly, allowing for sustainable translations to and from interchange standards. Alternatives for information exchange have surfaced over the past few years. The HL7 Clinical Document Architecture—release 2 was selected by the Office of the National

II. Technical basis

96

5. Expert systems in medicine

Coordinator as one of the data exchange standards for Meaningful Use.75 However, exchanging whole complex documents turned out not to be an ideal approach. As a result, HL7 Fast Healthcare Interoperability Resources (FHIR)76 emerged as a simpler approach that exposes discrete health-care data elements (e.g., patients, admissions, medication, and laboratory test results) as services that can be handled through URLs. FHIR provides the mechanisms for exchanging data between health-care applications. It facilitates interoperability between legacy and disparate systems, eases delivery of health information to providers, and allows integration of third-party applications into existing systems. The biggest challenge of FHIR is that it is possible for two different applications to implement different versions of FHIR with structurally different data elements and services. This means that the systems are not interoperable—they cannot exchange information because the representations are not compatible and cannot be consumed by the other system. Further, given the flexibility of representation of FHIR, it is possible to implement partial application programming interfaces, resulting in mismatched, partial, and incompatible representations. FHIR falls short of the interoperability expectations due to the lack of rigor in its representation mechanisms. To mitigate its limitations, an ontology-based analysis with formal mapping for correctness is currently under development.

5.6 Future directions There are multiple opportunities for ES applications in clinical domains. However, important barriers mentioned in this chapter need to be overcome before wider, more efficient, and reliable applicability can occur. Broader incorporation of sophisticated, welldesigned decision rules into clinical systems and workflows requires moving away from highly customized representations that only serve a narrow niche into high-quality interoperable assets that can be assembled into more complex, yet manageable structures that could be deployed into multiple systems and institutions. An integrated approach for implementation, deployment, and execution of knowledge assets must include sound infrastructure and policies for knowledge design, creation, and maintenance that persist throughout the life cycle of the knowledge assets. There are three key technical aspects for the development of shared knowledge. The first is a common data schema and information model that serves as lingua franca when institutions maintain their own data structures and internal representations. The second is reference terminologies, vocabularies, and ontologies to semantically augment knowledge assets and data elements. By semantically augmenting assets, sharing knowledge and data elements goes beyond sharing syntactically well-formed assets; interoperability and exchange becomes a process of sharing meaning, which not only validates the syntactic definition of assets but also provides augmented meaning to ensure that mappings and “translations” are accurate to the source. The third is a common decision logic language and decision-making frameworks that provide the standard means for a consistent representation of executable knowledge. Knowledge should be integrated into workflows but should not be embedded in procedural code. Proper invocation mechanisms should exist to insert decision knowledge at critical points of workflow execution, but at the same time, remain independent. Clinical knowledge should be presented in a standardized format,

II. Technical basis

References

97

both human and machine interpretable, so system developers can produce the information in a way that front-end users can readily understand, assess, and apply it. Thus, a combination of a common data model, semantically augmented knowledge, and a common decision logic language are three key elements necessary if wide dissemination of consistent, reliable decision knowledge is to be achieved. Current efforts must align to these recommendations if they are to succeed in terms of true interoperability and shareability. Understanding patterns found in data can leverage the rapid development and progress of data sciences and AI technologies. We need to understand the input-to-output process of data science and AI classification algorithms. However, this is a nontrivial process that requires identifying and elucidating valid, novel, and potentially useful patterns in data. In addition, we need innovative methods to improve reasoning mechanisms under challenging circumstances, as it is the case in medicine. If we aim at a wider applicability of these techniques, it is vital that such algorithms are capable of explaining how and why they provide an output (e.g., recommendation) based on some input data. Modern medical practice has emphasized EBM from well-designed research and the use of clinical guidelines to optimize clinical decision-making. However, EBM is hypothesis driven which may be influenced by various biases and constraints. In addition, there is a lag between when the research is conducted and when the evidence is properly applied. Therefore combining explanations from understood patterns in data and EBM may leverage both hypothesisand data-driven methods into a more holistic approach that benefits from both approaches.

References 1. Kohn LT, Corrigan J, Donaldson MS. To err is human: building a safer health system, 6. Washington, DC: National Academy Press; 2000. 2. Ledley RS, Lusted LB. Reasoning foundations of medical diagnosis. Science 1959;130(3366):921. 3. Blois MS. Clinical judgment and computers. N Engl J Med 1980;303(4):1927. 4. Shortliffe EH, Davis R, Axline SG, Buchanan BG, Green CC, Cohen SN. Computer-based consultations in clinical therapeutics: explanation and rule acquisition capabilities of the MYCIN system. Comput Biomed Res 1975;8(4):30320. 5. Heathfield H. The rise and ‘fall’ of expert systems in medicine. Expert Syst 1999;16(3):1838. 6. Hripcsak G. Writing Arden Syntax medical logic modules. Comput Biol Med 1994;24(5):33163. 7. Jennings R. The expert system language GALEN. J Electrocardiol 1988;21:S817. 8. Vetterlein T, Mandl H, Adlassnig K-P. Fuzzy Arden Syntax: a fuzzy programming language for medicine. Artif Intell Med 2010;49(1):110. 9. Kawamoto K, Houlihan CA, Balas EA, Lobach DF. Improving clinical practice using clinical decision support systems: a systematic review of trials to identify features critical to success. BMJ 2005;330(7494):765. 10. Garg AX, Adhikari NK, McDonald H, Rosas-Arellano MP, Devereaux PJ, Beyene J, et al. Effects of computerized clinical decision support systems on practitioner performance and patient outcomes: a systematic review. JAMA 2005;293(10):122338. 11. Hunt DL, Haynes RB, Hanna SE, Smith K. Effects of computer-based clinical decision support systems on physician performance and patient outcomes: a systematic review. JAMA 1998;280(15):133946. 12. Bright TJ, Wong A, Dhurjati R, Bristow E, Bastian L, Coeytaux RR, et al. Effect of clinical decision-support systems: a systematic review. Ann Intern Med 2012;157(1):2943. 13. Zhou L, Karipineni N, Lewis J, Maviglia SM, Fairbanks A, Hongsermeier T, et al. A study of diverse clinical decision support rule authoring environments and requirements for integration. BMC Med Inform Decis Mak 2012;12(1):128.

II. Technical basis

98

5. Expert systems in medicine

14. Sordo M, Boxwala AA, Ogunyemi O, Greenes RA. Description and status update on GELLO: a proposed standardized object-oriented expression language for clinical decision support. Stud Health Technol Inform 2004;107. 15. Osheroff JA, Teich JM, Middleton B, Steen EB, Wright A, Detmer DE. A roadmap for national action on clinical decision support. J Am Med Inf Assoc 2007;14(2):1415. 16. Greenes RA, Bates DW, Kawamoto K, Middleton B, Osheroff J, Shahar Y. Clinical decision support models and frameworks: seeking to address research issues underlying implementation successes and failures. J Biomed Inform 2018;78:13443. 17. Zhou L, Hongsermeier T, Boxwala AA, Lewis J, Kawamoto K, Maviglia SM, et al. Structured representation for core elements of common clinical decision support interventions to facilitate knowledge sharing. Stud Health Technol Inform 2013;192. 18. Loya SR, Kawamoto K, Chatwin C, Huser V. Service oriented architecture for clinical decision support: a systematic review and future directions. J Med Syst 2014;38(12):140. 19. Shortliffe E. Computer-based medical consultations: MYCIN, 2. Elsevier; 2012. 20. Barnett GO, Famiglietti KT, Kim RJ, Hoffer EP, Feldman MJ. DXplain on the Internet. Proceedings of the AMIA symposium. American Medical Informatics Association; 1998. 21. Metaxiotis KS, Askounis D, Psarras J. Expert systems in production planning and scheduling: a state-of-theart survey. J Intell Manuf 2002;13(4):25360. 22. Jo G-S, Jung J-J, Yang C-Y. Expert system for scheduling in an airline gate allocation. Expert Syst Appl 1997;13 (4):27582. 23. Frayman F, Mittal S. COSSACK: a constraint-based expert system for configuration tasks. In Sriram D, and Adey RA (Eds.), Knowledge-based expert systems in engineering: planning and design. Computational Mechanics Publications, Southampton, UK, 1987, 14366. 24. Standridge CR, Steward D. Using expert systems for simulation modeling of patient scheduling. Simulation 2000;75(3):14856. 25. Heragu SS, Kusiak A. Analysis of expert systems in manufacturing design. IEEE Trans Syst Man Cybern 1987;17(6):898912. 26. Hilas CS. Designing an expert system for fraud detection in private telecommunications networks. Expert Syst Appl 2009;36(9):1155969. 27. Mates D, Iancu E, Bostan I, Grosu V. Expert system models in the companies’ financial and accounting domain. In: arXiv preprint arXiv:1001.3495. 2010. 28. Winston PH, Horn B. The psychology of computer vision. McGraw-Hill Companies; 1975. 29. Charniak E, McDermott D. Introduction to artificial intelligence. Addison-Wesley Longman Publishing Co., Inc.; 1985. 30. Rocha RA, Maviglia SM, Sordo M, Rocha BH. A clinical knowledge management program. Clinical decision support. Elsevier; 2014. p. 773817. 31. Musen MA. Dimensions of knowledge sharing and reuse. Comput Biomed Res 1992;25(5):43567. 32. Sordo M, Tokachichu P, Vitale CJ, Maviglia SM, Rocha RM. Modeling contextual knowledge for clinical decision support. AMIA annual symposium proceedings. American Medical Informatics Association; 2017. 33. Sordo M, Rocha BH, Morales AA, Maviglia SM, Dell’Oglio E, Fairbanks A, Aroy T, et al. Modeling decision support rule interactions in a clinical setting. Stud Health Technol Inform 2013;192:90812. 34. Dissanayake PI, Colicchio TK, Cimino JJ. Using clinical reasoning ontologies to make smarter clinical decision support systems: a systematic review and data synthesis. J Am Med Inform Assoc 2020;27(1):15974. 35. An Introduction to HL7 standards. HL7 International webpage. ,https://www.hl7.org/implement/standards/.; [accessed 30.03.20]. 36. Pryor TA, Hripcsak G. The Arden syntax for medical logic modules. Int J Clin Monit Comput 1993;10 (4):21524. 37. Jenders RA. Decision rules and expressions. Clinical decision support. Elsevier; 2014. p. 41734. 38. GELLO. HL7 version 3 standard: GELLO, a common expression language, release 2. 2005. ,https://www.hl7.org/ implement/standards/product_brief.cfm?product_id 5 5.; [accessed 30.03.20]. 39. Cox RT. Probability, frequency and reasonable expectation. Am J Phys 1946;14(1):113. 40. De Finetti B. Theory of probability: a critical introductory treatment, 6. John Wiley & Sons; 2017. 41. Russell S, Norvig P. Artificial intelligence: a modern approach. 2002.

II. Technical basis

References

99

42. Pearl J. Probabilistic reasoning in intelligent systems: networks of plausible inference. Elsevier; 2014. 43. Onisko A, Druzdzel MJ, Wasyluk H. A Bayesian network model for diagnosis of liver disorders. Proceedings of the eleventh conference on biocybernetics and biomedical engineering. CiteseerX; 1999. 44. Zadeh LA. Fuzzy sets. Inf Control 1965;8(3):33853. 45. Konar A. Computational intelligence: principles, techniques and applications. Springer Science & Business Media; 2006. 46. Wilson D, Wilson P, Walmsley R, Horrocks JC, De Dombal F. Diagnosis of acute abdominal pain in the accident and emergency department. Br J Surg 1977;64(4):2504. 47. Aikins JS, Kunz JC, Shortliffe EH, Fallat RJ. PUFF: an expert system for interpretation of pulmonary function data. Comput Biomed Res 1983;16(3):199208. 48. Klar R, Zaiss A. Medical expert systems: design and applications in pulmonary medicine. Lung 1990;168 (1):12019. 49. Pryor TA, Gardner RM, Clayton PD, Warner HR. The HELP system. Proceedings of the annual symposium on computer application in medical care. American Medical Informatics Association; 1982. 50. Wolfram DA. An appraisal of INTERNIST-I. Artif Intell Med 1995;7(2):93116. 51. Miller RA, McNeil MA, Challinor SM, Masarie Jr FE, Myers JD. The INTERNIST-1/quick medical REFERENCE project—status report. West J Med 1986;145(6):816. 52. Banks G. Artificial intelligence in medical diagnosis: the INTERNIST/CADUCEUS approach. Crit Rev Med Inform 1986;1(1):2354. 53. Barnett GO, Cimino JJ, Hupp JA, Hoffer EP. DXplain: an evolving diagnostic decision-support system. JAMA 1987;258(1):6774. 54. Shortliffe EH, Scott AC, Bischoff MB, Campbell AB, Van Melle W, Jacobs CD. An expert system for oncology protocol management. In: Buchanan BG and Shortiffe EH, (Eds.), Rule-based expert systems. Addison-Wesley, 1984, 653665. 55. Evans RS, Classen DC, Pestotnik SL, Lundsgaarde HP, Burke JP. Improving empiric antibiotic selection using computer decision support. Arch Intern Med 1994;154(8):87884. 56. Chertow GM, Lee J, Kuperman GJ, Burdick E, Horsky J, Seger DL, et al. Guided medication dosing for inpatients with renal insufficiency. JAMA 2001;286(22):283944. 57. Casner PR, Reilly R, Ho H. A randomized controlled trial of computerized pharmacokinetic theophylline dosing versus empiric physician dosing. Clin Pharmacol Ther 1993;53(6):68490. 58. Verner D, Seligmann H, Platt S, Dany S, Almog S, Zulty L, et al. Computer assisted design of a theophylline dosing regimen in acute bronchospasm: serum concentrations and clinical outcome. Eur J Clin Pharmacol 1992;43(1):2933. 59. Hurley SF, Dziukas LJ, McNeil JJ, Brignell MJ. A randomized controlled clinical trial of pharmacokinetic theophylline dosing. Am Rev Respir Dis 1986;134(6):121924. 60. Carter BL, Taylor JW, Becker A. Evaluation of three dosage-prediction methods for initial in-hospital stabilization of warfarin therapy. Clin Pharm 1987;6(1):3745. 61. Poller L, Wright D, Rowlands M. Prospective comparative study of computer programs used for management of warfarin. J Clin Pathol 1993;46(4):299303. 62. White RH, Hong R, Venook AP, Daschbach MM, Murray W, Mungall DR, et al. Initiation of warfarin therapy: comparison of physician dosing with computer-assisted dosing. J Gen Intern Med 1987;2(3):1418. 63. Vadher B, Patterson DL, Leaning M. Evaluation of a decision support system for initiation and control of oral anticoagulation in a randomised trial. BMJ 1997;314(7089):12526. 64. Evans RS, Pestotnik SL, Classen DC, Clemmer TP, Weaver LK, Orme Jr JF, et al. A computer-assisted management program for antibiotics and other antiinfective agents. N Engl J Med 1998;338(4):2328. 65. Overhage JM, Tierney WM, Zhou X-H, McDonald CJ. A randomized trial of “corollary orders” to prevent errors of omission. J Am Med Inform Assoc 1997;4(5):36475. 66. Kaushal R, Shojania KG, Bates DW. Effects of computerized physician order entry and clinical decision support systems on medication safety: a systematic review. Arch Intern Med 2003;163(12):140916. 67. Bates DW, Leape LL, Cullen DJ, Laird N, Petersen LA, Teich JM, et al. Effect of computerized physician order entry and a team intervention on prevention of serious medication errors. JAMA 1998;280(15):131116. 68. Black AD, Car J, Pagliari C, Anandan C, Cresswell K, Bokun T, et al. The impact of eHealth on the quality and safety of health care: a systematic overview. PLoS Med 2011;8(1).

II. Technical basis

100

5. Expert systems in medicine

69. Genco EK, Forster JE, Flaten H, Goss F, Heard KJ, Hoppe J, et al. Clinically inconsequential alerts: the characteristics of opioid drug alerts and their utility in preventing adverse drug events in the emergency department. Ann Emerg Med 2016;67(2):240248. e3. 70. Topaz M, Seger DL, Slight SP, Goss F, Lai K, Wickner PG, et al. Rising drug allergy alert overrides in electronic health records: an observational retrospective study of a decade of experience. J Am Med Inf Assoc 2016;23(3):6018. 71. Paterno MD, Maviglia SM, Gorman PN, Seger DL, Yoshida E, Seger AC, et al. Tiering drug-drug interaction alerts by severity increases compliance rates. J Am Med Inf Assoc 2009;16(1):406. 72. Topaz M, Goss F, Blumenthal K, Lai K, Seger DL, Slight SP, et al. Towards improved drug allergy alerts: multidisciplinary expert recommendations. Int J Med Inf 2017;97:3535. 73. Fineberg HV. A successful and sustainable health system—how to get there from here. N Engl J Med 2012;366 (11):10207. 74. Friedman CP, Wong AK, Blumenthal D. Achieving a nationwide learning health system. Sci Transl Med 2010;2 (57):57cm29. 75. HL7 Companion Guide to Consolidated Clinical Document Architecture (C-CDAs) R2.1 companion guide. HL7 International webpage , http://blog.hl7.org/ccdacompanionguide . ; [accessed 04.10.20]. 76. HL7 Fast Health Interoperability Resources (FHIR) Specification. HL7 International webpage. , http://hl7. org/fhir/directory.html . ; [accessed 31.03.20].

II. Technical basis

C H A P T E R

6 Privacy-preserving collaborative deep learning methods for multiinstitutional training without sharing patient data Ken Chang, Praveer Singh, Praneeth Vepakomma, Maarten G. Poirot, Ramesh Raskar, Daniel L. Rubin and Jayashree Kalpathy-Cramer Abstract Although deep learning models have great promise for clinical applications, there are numerous obstacles to training effective deep learning models. There is a need for collecting large quantities of diverse data, which often can only be achieved through multiinstitutional collaborations. One approach to multiinstitutional studies is to build large central repositories, but this is hindered by concerns about data sharing, including patient privacy, data deidentification, regulation, intellectual property, and data storage. These challenges have made centrally hosting data less impractical. An alternative approach is to have the data hosted locally and have the model trained in a collaborative fashion. Depending on the collaborative learning approach, model weights, model gradients, or smashed data are shared instead of raw patient data. These approaches can also reduce the communication overhead while reducing the need to share private patient data. In this chapter, we will review and compare the current techniques for distributing learning, handling data heterogeneity, and preserving patient privacy for multiinstitutional applications. Keywords: Deep learning; distributed computing; data security; data privacy; federated learning; split learning; model ensembling; cyclical weight transfer

6.1 Introduction Deep learning methods have yielded state-of-the-art results in a wide range of computer vision, speech recognition, and natural language processing tasks without the need for

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00006-5

101

© 2021 Elsevier Inc. All rights reserved.

102

6. Privacy-preserving collaborative deep learning methods

domain-inspired, hand-crafted imaging features.1,2 At the core of deep learning are convolutional neural networks (CNNs), a machine learning technique that can be trained on raw image data to predict the outputs of interest. This is achieved through many layers of nonlinear transforms that are capable of learning complex patterns with a high level of abstraction.1 With the advent of more powerful graphics processing units that allow for training of large-scale neural network architectures, deep learning has become the method of choice for automation of tasks within medical imaging.3 Recent studies have shown the potential of deep learning in medical fields such as dermatology, ophthalmology, and radiology for key clinical assessments, such as diagnosis, prognosis, response to treatment, and future disease progression.4 12 Integrated into the clinic or at the bedside, these models have the potential to aid with clinical decisionmaking, improving efficiency, accuracy, and reliability of patient care. Although deep learning models have great promise for clinical applications, there are numerous obstacles to training effective deep learning models. First, there is a need for enormous quantities of annotated training data, especially for diseases with subtle or diverse phenotypes. The data requirement is also increased when the individual patient data are noisy or incomplete. While the vast majority of medical diseases have no publicly available datasets, the few that do are limited in quantity (varying from few hundreds to hundreds of thousands).13,14 Comparatively, outside of the medical field, most of the state-of-the-art neural network architectures have been trained on large-scale benchmark datasets such as ImageNet that has millions of annotated images.2 In the scenario where publicly available datasets are scarce or nonexistent for a given medical problem, algorithm developers have to rely on their own institutional datasets. However, for rare diseases15 or when studying the effect of different modes of treatment that may be hospital specific, it may be impossible to acquire sufficient quantities of training data at a single institution. In addition, these trained models might not be generic enough to perform well on outside institution datasets. Furthermore, deep learning algorithms are prone to overfitting and brittle when evaluated on external data16. As such, training data need to be diverse, ideally from varying acquisition settings and patient populations. Unfortunately, data from a single institution are often limited in quantity and heterogeneity, rendering the data insufficient for training robust deep learning algorithms. In such cases, multiinstitutional patient cohorts are the only avenue toward training an effective deep learning model. One approach to multiinstitutional studies is to build a large central repository, but this is hindered by concerns about data sharing, specifically patient privacy, data deidentification, regulation, intellectual property, and data storage. First, protecting patient privacy is of upmost importance in the increasingly digital world as the release of sensitive patient information could be harmful. Recently, studies have shown that the barrier to reidentification is quite low, requiring just a few clinical variables or a single scan, emphasizing the importance of privacy preservation.17,18 Second, it is difficult to ensure rigorous patient deidentification, and the potential of accidental data leakage is not negligible. Also, data are valuable resource, and many hospitals prefer not to publicly share data to protect their own institutional interests. Lastly, patient data is growing in size with the increasing resolution and number of imaging modalities. As such, it would be cumbersome to commission the substantial data storage required to centrally host data. These concerns have made centrally hosting data both expensive and challenging. An alternative approach is to have the data be locally hosted and have the model be trained in a distributed fashion. Comparatively, the model is much smaller

II. Technical basis

6.2 Variants of distributed learning

103

than patient data so the communication overhead is drastically reduced. Under a distributed learning paradigm, each institution will install a software application that links the different institutions together, allowing for collaboration and distribution computation. The result is a model that ideally performs as well as if the data had been shared while still preserving patient privacy and protecting institutional interests. In addition, the reduced requirements for storage and deidentification mean reduced cost of collaboration and increased incentive for participation in multiinstitutional studies. However, distributed learning is not as simple as it sounds. Rather, distributed learning is an umbrella term that describes a variety of approaches. New considerations have to be made as well when deciding which approach to be used. Specifically, there are several components that can be possibly shared in distributed training: model weights, the gradients, and smashed data (output of intermediate layers). In addition, each method may have a different performance at convergence, degrees, and approaches to protecting privacy, and communication requirements, which will be discussed in this chapter.

6.2 Variants of distributed learning 6.2.1 Model ensembling The simplest paradigm for distributed learning is model ensembling, or having each institution train a model on their respective data and ensembling the output of the resulting single institution models (Fig. 6.1).19 Indeed, model ensembling has been shown to be effective when ensembling high-performing models, especially if the outputs of the individual models are decorrelated.20 The advantage of this approach is that it is completely asynchronous—that is, training of the model at one institution does not depend on the training of the model at (A)

(B)

Institution 1

Institution 2

Institution 1

Institution 2

Model 1 Model 2

Final model

Final model

Model 3 Model 4

Institution 4

Institution 3

Institution 4

Institution 3

(D)

(C) Institution 1

Institution 2 Institution 1

Institution 2 Central server

Institution 4

Intermediate layers

Institution 3 Local updates

Institution 4

Institution 3

Central server

Central model

II. Technical basis

FIGURE 6.1 The variants of distributed learning include (A) model ensembling, (B) cyclical weight transfer, (C) federated learning, and (D) split learning.

104

6. Privacy-preserving collaborative deep learning methods

another institution. However, in a distributed setting, a single institution (such as a small community hospital) may not have enough data to train a high-performing model. In such cases, ensembling will not result in performance that is comparable to that of a model trained on centrally hosted data.19 In addition, if there is substantial heterogeneity between the datasets at the different due to acquisition and protocol differences, patient demographic differences, and disease prevalence differences, merely ensembling the models may not account of it.

6.2.2 Cyclical weight transfer Another paradigm of distributed learning is cyclical weight transfer. This approach involves training a single model at one institution for a fixed number of iterations followed by transferring of the weights to the next institution. This process is repeated in a cyclical fashion across all institutions such that the model has seen all the data (multiple times) and model convergence is reached (Fig. 6.1). The performance of cyclical weight transfer improves with increasing weight transfer frequency, that is, transferring the weights to the next institution after only a few training iterations of training at each institution.19 Empirically, this approach has been shown to be capable of reaching performance of a model trained on centrally hosted data, even when any single institution does not have enough data to train an effective model.19 However, this comes with an increased communication cost as weights are being transferred more frequently. The main pitfall of cyclical weight transfer is that at any given point in time the model is only training at a single institution, while the computation resources at all other institutions wait idly. One way to mitigate idle computational resources is to run multiple, staggered instances of cyclical weight transfer and then to ensemble the resulting models. Another possibility is to train multiple teacher networks at different institutions and have one student network that cyclically distills common knowledge from all the different teachers.21

6.2.3 Federated learning Federated learning is a variant of distributed learning in which training among institutions is orchestrated by a central server.22 Under this paradigm, models are trained locally at each institution and either gradient updates (federated stochastic gradient descent) or model weights (federated averaging) are sent to the central server.23,24 The model in the central server is then updated, and the weights are then sent back to the institutions to update their local copy of the model (Fig. 6.1). A key challenge with federated stochastic gradient descent is waiting for synchronous updates from each institution. Specifically, the model weights in the central server can only be updated and sent back to the institutions after gradient updates are received from all institutions.25 Given that each institution is likely to have different compute and communication infrastructure, this process is rate limited by the slowest institution. A workaround is to perform asynchronous stochastic gradient descent in which each institution asynchronously grabs the most up-to-date model weights from the central server, computes gradients of the loss, and then sends the gradients back to the central server.26 The downside of asynchrony is while a specific institution is calculating the gradient, the

II. Technical basis

6.3 Handling data heterogeneity

105

model weights in the central server may be updated by other institution, resulting in the gradients from the specific institution being calculated with outdated model weights. These gradients are thus termed stale resulting in convergence with worse performance.25 A compromise between fully synchronous or fully asynchronous approaches is to have partial synchrony, in which the central server waits for update from institutions until a certain point, after which the updates from straggler institutions are discarded.25 An alternative approach is federated averaging, in which local model weights are sent to the central server after a specified number of training iterations and averaged. The averaged model weights are then sent back to the local institutions to update the local copy of the model.24 The advantage of this approach is that communication with central server is only performed after a specified number of training iterations (which consists of many gradient updates), which can more communicate efficiently, depending on the frequency of averaging. However, federated averaging faces the same synchrony challenge as federated stochastic gradient descent in terms of being rate limited by the slowest institution. Recently, federated averaging has been shown to be capable of achieving near centrally hosted performance under a simulated setting, for the task of brain tumor segmentation.27

6.2.4 Split learning Split learning is based on the idea that the layers of a neural network can be divided piecewise at specific layers (termed cut layers) between institutions and the central server.22,28,29 Raw patient data is never shared, but rather the outputs of cut layers (termed smash data) during forward propagation and the gradients of cut layers during backpropagation (Fig. 6.1).22 Although there are many possible configurations, the most relevant one for medical applications is U-shaped (boomerang) split learning, designed for a scenario in which both input data and labels cannot leave the institution.29 In this paradigm, each institution has their own beginning and end layers of a neural network. The intermediate layers are shared between all institutions. At each iteration of training the image is fed through the local beginning layers, then fed through the intermediate layers on the central server, and lastly through the local end layers. The loss and gradient are then calculated, and the weights are updated through backpropagation, this time moving through all the layers in reverse. The main advantage of split learning is the ability to defer a portion of the computation to the central server.29,30 This decreases the computational resources needed at each institution, which would benefit institutions with limited resources, such as smaller community hospitals. Also, depending on the dataset size contributed per client, number of clients, and the model size, the communication requirements of split learning can be more favorable than those of federated averaging.29,31 A recent study has shown the potential of split learning for achieving centrally hosted performance for health-care applications32.

6.3 Handling data heterogeneity One critical hurdle that prevents the deployment of deep learning models in the clinical work environment is their relatively poor generalizability across institutional differences,

II. Technical basis

106

6. Privacy-preserving collaborative deep learning methods

(A) Institution 1

Institution 2

(B) Institution 1

Institution 2

(C) Institution 1

Institution 2

Scanner type 1

Label 1 Label 2

Scanner type 2

FIGURE 6.2 Various types of heterogeneity can exist in real patient data, such as (A) imbalanced labels or patient characteristics, (B) data size heterogeneity, and (C) differences in data acquisition.

such as patient demographics, disease prevalence, scanners, and acquisition settings. A variety of recent deep learning studies have shown poor generalizability of deep learning models when applied to data from different institutions than the one they were trained on.33,34 Furthermore, the optimal method of distributing the process of training medical deep learning models across heterogenous institutions has not yet been adequately studied. Indeed, much of the medical deep learning studies have been on independent and identically distributed (IID) data.19,27 In this scenario the institutions have no intrainstitution correlation, and the data across institutions is identically distributed.35 More work needs to be done on dealing with dataset skew (non-IID data) across institutions, specifically when there is (1) quantity skew (e.g., a large academic hospital has significantly more data than a small community hospital), (2) feature distribution skew (e.g., one hospital uses one scanner vendor while another hospital uses a different scanner vendor), (3) label distribution skew (e.g., obesity is much more prevalent in North American than in Asia), (4) concept shift—same label, different features (e.g., eczema looks different on light vs dark skin), and (5) concept shift—same features, different labels (e.g., physicians in North America may be more conservative in calling a certain disease than physicians in Asia due to higher rates of litigation for unnecessary treatment) (Fig. 6.2).22,35 In a real case scenario the data across institutions will contain a mixture of skew types, which makes the problem even more challenging. One approach to dealing with data heterogeneity is to optimize training of distributed models in non-IID situations. For example, the cost function of the neural network can be modified to deal with class imbalance or data size heterogeneity across institutions.36 One hurdle is the high prevalence of Batch Normalization (BatchNorm) layers in modern neural network architectures.37 39 BatchNorm layers stabilize neural networks by channelwise normalization of intermediate inputs with the mean and standard deviation of each mini-batch that mitigates divergent effects of large gradient updates and smooths the optimization landscape.40,41 At inference time an estimate of the global mean and standard deviation is used. This can be problematic in a non-IID setting with federated learning because the training and validation distributions differ, resulting in differing normalization during training and validation and, thus, lower validation performance.35 Hsieh et al provide evidence that much of the loss in performance due to BatchNorm can be partially recovered by replacing the BatchNorm layers with Group Normalization (GroupNorm)

II. Technical basis

6.3 Handling data heterogeneity

107

layers.35 GroupNorm layers normalize by group, which is defined as a prespecified number of adjacent channels for each individual input (as opposed to on a mini-batch basis).42 Another hurdle is the prevalent use of momentum in neural network optimizers, which improves convergence of networks.43,44 However, it is unclear how to incorporate momentum into distributed learning.22 Work by Yu et al demonstrate that letting each institution have its own momentum buffer followed by periodic global averaging of the buffers improves the accuracy of the final model compared to resetting the buffers to 0 during each round of federated averaging.45 One critical barrier to optimizing distributed training is dealing with catastrophic forgetting, a phenomenon in which sequential training of a model on a new task results in “forgetting” of previously learned knowledge for previous tasks.46 Although the task is the same across all institutions in most health-care applications, if the dataset is non-IID across institutions, it is possible that learning the model may “forget” what it learned at other institutions when training at a given institution. This is of particular concern in cyclical weight transfer, in which training occurs at each institution in sequence.19 Indeed, overall model performance is decreased with decreasing frequency of weight transfer.19 Catastrophic forgetting is also a concern in the context of synchronous distributed learning (such as federated averaging) if the gradient updates at one institution are antiparallel to those from another institution. Some proposed approaches to dealing with catastrophic forgetting may be to slow down learning on weights that are important for other institutions when updating model weights for a given institution, masking trainable weights differentially at each institution, or to update the model weights orthogonally for each institution.47 49 Another approach to dealing with data heterogeneity is to train “personalized” models for each institution as opposed to a single global model for all institutions, also known as domain adaptation.22 These “personalized” models can be adapted from a global model that performs reasonably well among all institutions. For example, one strategy could be to have common weights for convolutional layers of the neural network but have institution-specific BatchNorm layers.50 Another strategy would be to perform unsupervised domain adaptation of the global model to a specific institution through adversarial training.51 Zhao et al. give upper and lower bounds on conditions required to reduce model error on target domain when adapting from a source domain.52 The bounds are based on Jensen Shannon divergences between labels distributions in source and target domain as well as between intermediate representations learn by the deep learning network over the source and target domain of data. Once these “personalized” models are trained, there will be several models that can be used for inference on new institutions. Model selection strategies, based on similarity of the data distribution of the new institution with the data distribution used to train each of the “personalized models,” can be used to select the optimal model.53,54 Alternatively, datasets at institutions can be augmented to make the data distribution more IID-like. For example, in the presence of imbalance in the distribution of class or patient characteristics, data from the minority class or characteristic can be augmented via synthetic oversampling.55 If one institution has less data than another institution, the institution with less data can augment its data with geometric transforms, mix-up, or generative adversarial networks (GANs), assuming that such augmentation does not cause a shift in the institution’s overall distribution.56 58 If scanner types or image sequences differ

II. Technical basis

108

6. Privacy-preserving collaborative deep learning methods

across institutions, data at each institution can be augmented for acquisition diversity using supervised approaches.59 61 In summary, these heterogeneity present a critical hurdle to the deployment of distributed deep learning methods. Importantly, these challenges are not unique to distributed machine learning, but relevant for multiinstitution machine learning as a whole.

6.4 Protecting patient privacy One of the key motivations of distributed deep learning is the protection of patient privacy. Most institutions require researchers to deidentify patient data before they are used for training. This removes obvious patient identifiers such as name, medical record numbers, date of birth, and date of hospital visit. It is important to note that patient information is still embedded in the clinical variables, lab tests, and medical imaging. Distributed learning provides a method of training deep learning models without sharing raw patient data. However, this is not equivalent to full protection of patient privacy as there is still component data being shared: model weights (model ensembling, cyclical weight transfer, federated averaging), gradients (federated stochastic gradient descent), and smashed data (split learning). A tech-savvy attacker might infer sensitive information about the training data or, in the worst-case scenario, reconstruct the training data itself from the shared component data.62 64 As such, additional protections need to be put into place to ensure privacy among participating institutions. One framework for such protection is differential privacy (DP). At the core of DP is the concept of a privacy budget, which is the maximum increase to the risk of an individual’s privacy.65 Alternatively, the privacy budget is how much of an individual’s privacy that the neural network can use for training. In practice, DP optimization involves clipping the gradient followed by addition of Gaussian noise at each training step. Training is discontinued once the privacy budget is exhausted, regardless of whether the desired level of performance is reached.66 Utilization of DP comes with an important trade-off—there is an inversely proportional relationship between the privacy budget and model performance.30 That is, the more stringent the protection, the lower the model performance. Recently, DP has been applied to train models for health-care applications.67,68 Furthermore, the trade-off between performance and privacy protection has been demonstrated for brain tumor segmentation.69 DP can also be used to train GANs that can subsequently be shared for model training.70 Homomorphic encryption is an approach that allows performing of mathematical operations directly on the encrypted data (ciphertext).22 The allowed operations include addition and multiplication, but other operations (such as activation functions) can be approximated using higher degree polynomials, Chebyshev polynomials, and Taylor series.30 Patient data or smashed data can be homomorphically encrypted before it is sent to the central server for model training or inference.22,71,72 Alternatively, model weights can be homomorphically encrypted before it is sent to the central server for aggregation.22 The major drawback of homomorphic encryption is the need for specialized hardware and extensive computational resources, limiting the scalability of the method.30,71 As such, much of the work on homomorphic encryption has been focused on shallow architectures that do not represent the deep architectures used for modern health-care applications.30,71,72

II. Technical basis

References

109

One concern that is specific to split learning is the correlation between raw data and smashed data, resulting in leakage of information. To reduce this leakage a variant of split learning, called NoPeek, utilizes a decorrelation approach based on distance correlation between raw and smashed data as part of the loss function during optimization.22,64 This approach has been shown to protect again information leakage while maintaining high model performance.64 This is especially useful when the cut layer is very early in the neural network when the correlation between raw and smashed data is high.64 The exact trade-off between the use of NoPeek and model performance in a variety of medical deep learning scenarios is still under investigation.

6.5 Publicly available software Given the high interest in distributed learning techniques, there is emerging software for federated learning, such as Tensorflow Federated, Horovod, and NVIDIA Clara. PySyft offers tools for federated learning, split learning, and DP.73 As distributed learning techniques are evaluated and refined in real-world use cases, these software will continue to evolve and improve.

6.6 Conclusion In this chapter we review various methods of distributed training of neural networks without sharing raw data. Each method provides different advantages and disadvantages that warrant further study in real-world health-care use cases. A large hurdle to such validation is dealing with the presence of data heterogeneity within and across institutions, which present challenges in optimization, generalizability, and catastrophic forgetting. This hurdle is compounded by the need for further protection of patient privacy, which induces trade-offs in performance and computational complexity. Studies on the synergy between methods of distributing training, handling data heterogeneity, and protecting patient privacy provide an avenue for impactful future work.

References 1. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521(7553):436 44. Available from: https://doi.org/ 10.1038/nature14539. 2. Russakovsky O, Deng J, Su H, et al. ImageNet large scale visual recognition challenge. Int J Comput Vis 2015;115(3):211 52. Available from: https://doi.org/10.1007/s11263-015-0816-y. 3. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019;25 (1):44 56. Available from: https://doi.org/10.1038/s41591-018-0300-7. 4. Rajpurkar P, Irvin J, Ball RL, et al. Deep learning for chest radiograph diagnosis: a retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med. 2018;15(11): e1002686. Available from: https:// doi.org/10.1371/journal.pmed.1002686. 5. Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017;542(7639):115 18. Available from: https://doi.org/10.1038/nature21056. 6. Chang K, Beers AL, Bai HX, et al. Automatic assessment of glioma burden: a deep learning algorithm for fully automated volumetric and bi-dimensional measurement. Neuro Oncol 2019;21(11):1412 22. Available from: https://doi.org/10.1093/neuonc/noz106.

II. Technical basis

110

6. Privacy-preserving collaborative deep learning methods

7. Arcadu F, Benmansour F, Maunz A, Willis J, Haskova Z, Prunotto M. Deep learning algorithm predicts diabetic retinopathy progression in individual patients. NPJ Digit Med 2019;2. Available from: https://doi.org/ 10.1038/s41746-019-0172-3. 8. Lu MT, Ivanov A, Mayrhofer T, Hosny A, Aerts HJWL, Hoffmann U. Deep Learning to Assess Long-term Mortality From Chest Radiographs. JAMA Netw Open 2019;2(7). Available from: https://doi.org/10.1001/ jamanetworkopen.2019.7416. e197416. 9. Beers A, Brown J, Chang K, et al. DeepNeuro: an open-source deep learning toolbox for neuroimaging. Neuroinformatics 2020;1 14. Available from: https://doi.org/10.1007/s12021-020-09477-5. 10. Winzeck S, Hakim A, McKinley R, et al. ISLES 2016 and 2017-benchmarking ischemic stroke lesion outcome prediction based on multispectral MRI. Front Neurol 2018;9:679. Available from: https://doi.org/10.3389/ fneur.2018.00679. 11. Bakas S, Reyes M, Jakab A, et al. Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the BRATS challenge. ,http://arxiv.org/abs/1811.02629.; November 2018 [accessed 15.12.19]. 12. Li MD, Chang K, Bearce B, et al. Siamese neural networks for continuous disease severity evaluation and change detection in medical imaging. npj Dig Med 2020;3(1):1 9. Available from: https://doi.org/10.1038/ s41746-020-0255-1. 13. Menze BH, Jakab A, Bauer S, et al. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans Med Imaging 2015;34(10):1993 2024. Available from: https://doi.org/10.1109/TMI.2014.2377694. 14. Irvin J, Rajpurkar P, Ko M, et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. ,http://arxiv.org/abs/1901.07031.; January 2019 [accessed 30.10.19]. 15. Gurovich Y, Hanani Y, Bar O, et al. Identifying facial phenotypes of genetic disorders using deep learning. Nat Med 2019;25(1):60 4. Available from: https://doi.org/10.1038/s41591-018-0279-0. 16. Chang K, Beers AL, Brink L, et al. Multi-institutional assessment and crowdsourcing evaluation of deep learning for automated classification of breast density. J Am Coll Radiol 2020. Available from: https://doi.org/ 10.1016/j.jacr.2020.05.015. 17. Rocher L, Hendrickx JM, de Montjoye Y-A. Estimating the success of re-identifications in incomplete datasets using generative models. Nat Commun 2019;10(1):3069. Available from: https://doi.org/10.1038/s41467-019-10933-3. 18. Schwarz CG, Kremers WK, Therneau TM, et al. Identification of anonymous MRI research participants with face-recognition software. N Engl J Med 2019;381(17):1684 6. Available from: https://doi.org/10.1056/ NEJMc1908881. 19. Chang K, Balachandar N, Lam C, et al. Distributed deep learning networks among institutions for medical imaging. J Am Med Inf Assoc 2018;5(8):945 54. Available from: https://doi.org/10.1093/jamia/ocy017. 20. Pan I, Larson D. Improving automated pediatric bone age estimation using ensembles of models from the 2017 RSNA machine learning challenge. Radiol AI 2019;1(6): e190053. Available from: https://doi.org/ 10.1148/ryai.2019190053. 21. Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. ,http://arxiv.org/abs/1503.02531.; March 2015 [accessed 24.01.20]. 22. Kairouz P, McMahan HB, Avent B, et al. Advances and open problems in federated learning. ,http://arxiv.org/ abs/1912.04977.; December 2019 [accessed 24.12.19]. 23. Shokri R, Shmatikov V. Privacy-preserving deep learning. In: 2015 53rd annual allerton conference on communication, control, and computing, Allerton 2015; 2016. Available from: https://doi.org/10.1109/ALLERTON.2015.7447103. 24. Brendan McMahan H, Moore E, Ramage D, Hampson S, Agu¨era y Arcas B. Communication-efficient learning of deep networks from decentralized data. In: Proceedings of the 20th international conference on artificial intelligence and statistics, AISTATS 2017; 2017. 25. Chen J, Pan X, Monga R, Bengio S, Jozefowicz R. Revisiting distributed synchronous SGD. ,http://arxiv.org/ abs/1604.00981.; April 2016 [accessed 24.12.19]. 26. Dean J, Corrado GS, Monga R, et al. Large scale distributed deep networks. In: NIPS 2012 neural Inf Process Syst; 2012. p. 1 11. Available from: https://doi.org/10.1109/ICDAR.2011.95. 27. Sheller MJ, Reina GA, Edwards B, Martin J, Bakas S. Multi-institutional deep learning modeling without sharing patient data: a feasibility study on brain tumor segmentation. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), 11383 LNCS. Springer Verlag; 2019. p. 92 104. Available from: https://doi.org/10.1007/978-3-030-11723-8_9.

II. Technical basis

References

111

28. Gupta O, Raskar R. Distributed learning of deep neural network over multiple agents. J Netw Comput Appl 2018;116:1 8. Available from: https://doi.org/10.1016/j.jnca.2018.05.003. 29. Vepakomma P, Gupta O, Swedish T, Raskar R. Split learning for health: distributed deep learning without sharing raw patient data. ,http://arxiv.org/abs/1812.00564.. December 2018 [accessed 22.07.19]. 30. Vepakomma P, Swedish T, Raskar R, Gupta O, Dubey A. No peek: a survey of private distributed deep learning. ,http://arxiv.org/abs/1812.03288.; December 2018 [accessed 01.01.20]. 31. Singh A, Vepakomma P, Gupta O, Raskar R. Detailed comparison of communication efficiency of split learning and federated learning. ,http://arxiv.org/abs/1909.09145.; September 2019. [accessed 03.01.20]. 32. Poirot MG, Vepakomma P, Chang K, Kalpathy-Cramer J, Gupta R, Raskar R. Split learning for collaborative deep learning in healthcare. ,http://arxiv.org/abs/1912.12115.; December 2019 [accessed 31.12.19]. 33. Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS Med 2018;15(11): e1002683. Available from: https://doi.org/10.1371/journal.pmed.1002683. 34. Albadawy EA, Saha A, Mazurowski MA. Deep learning for segmentation of brain tumors: impact of crossinstitutional training and testing: Impact. Med Phys 2018;45(3):1150 8. Available from: https://doi.org/ 10.1002/mp.12752. 35. Hsieh K, Phanishayee A, Mutlu O, Gibbons PB. The non-IID data quagmire of decentralized machine learning. ,http://arxiv.org/abs/1910.00189.; September 2019 [accessed 27.12.19]. 36. Balachandar N, Chang K, Kalpathy-Cramer J, Rubin DL. Accounting for data variability in multi-institutional distributed deep learning for medical imaging. J Am Med Informatics Assoc 2020;27(5):700 8. Available from: https://doi.org/10.1093/jamia/ocaa017. 37. Szegedy C, Ioffe S, Vanhoucke V, Alemi A. Inception-v4, inception-ResNet and the impact of residual connections on learning. ,http://arxiv.org/abs/1602.07261.; February 2016 [accessed 12.08.18]. 38. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). IEEE; 2016. p. 770 8. Available from: https://doi.org/10.1109/ CVPR.2016.90. 39. Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. 2015. ,http://proceedings.mlr.press/v37/ioffe15.pdf. [accessed 12.04.17]. 40. Bjorck J, Gomes C, Selman B, Weinberger KQ. Understanding batch normalization. In: Advances in neural information processing systems. 2018. 41. Santurkar S, Tsipras D, Ilyas A, Madry A. How does batch normalization help optimization? In: Advances in neural information processing systems. 2018. 42. Wu Y, He K. Group normalization. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics); 2018. Available from: https://doi.org/10.1007/978-3-030-01261-8_1. 43. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2012;60:1 9. Available from: https://doi.org/10.1016/j.protcy.2014.09.007. 44. Sutskever I, Martens J, Dahl G, Hinton G. On the importance of initialization and momentum in deep learning. In: 30th international conference on machine learning, ICML 2013; 2013. 45. Yu H, Jin R, Yang S. On the linear speedup analysis of communication efficient momentum SGD for distributed nonconvex optimization. ,http://arxiv.org/abs/1905.03817.. May 2019 [accessed 30.12.19]. 46. Goodfellow IJ, Mirza M, Xiao D, Courville A, Bengio Y. An empirical investigation of catastrophic forgetting in gradient-based neural networks. In: 2nd international conference on learning representations, ICLR 2014 conference track proceedings; 2014. 47. Kirkpatrick J, Pascanu R, Rabinowitz N, et al. Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci USA 2017;114. Available from: https://doi.org/10.1073/pnas.1611835114. 48. Zeng G, Chen Y, Cui B, Yu S. Continual learning of context-dependent processing in neural networks. Nat Mach Intell. 1, 2019. Available from: https://doi.org/10.1038/s42256-019-0080-x. 49. Mallya A, Davis D, Lazebnik S. Piggyback: adapting a single network to multiple tasks by learning to mask weights. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). 2018. Available from: https://doi.org/10.1007/978-3-030-01225-0_5. 50. Karani N, Chaitanya K, Baumgartner C, Konukoglu E. A lifelong learning approach to brain MR segmentation across scanners and protocols. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). 2018. Available from: https://doi.org/10.1007/978-3-030-00928-1_54.

II. Technical basis

112

6. Privacy-preserving collaborative deep learning methods

51. Kamnitsas K, Baumgartner C, Ledig C, et al. Unsupervised domain adaptation in brain lesion segmentation with adversarial networks. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). 2017. Available from: https://doi.org/10.1007/978-3-319-59050-9_47. 52. Zhao H, des Combes RT, Zhang K, Gordon GJ. On learning invariant representation for domain adaptation. ,http://arxiv.org/abs/1901.09453.; January 2019 [accessed 21.01.20]. 53. Sharma V, Vepakomma P, Swedish T, Chang K, Kalpathy-Cramer J, Raskar R. ExpertMatcher: automating ML model selection for users in resource constrained countries. ,http://arxiv.org/abs/1910.02312. [accessed 09.02.20]. 54. Sharma V, Vepakomma P, Swedish T, Chang K, Kalpathy-Cramer J, Raskar R. ExpertMatcher: automating ML model selection for clients using hidden representations. ,http://arxiv.org/abs/1910.03731.; October 2019 [accessed 09.02.20]. 55. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 2002;16(1):321 57. Available from: https://doi.org/10.1613/jair.953. 56. Zhang H, Cisse M, Dauphin YN, Lopez-Paz D. MixUp: beyond empirical risk minimization. In: 6th international conference on learning representations, ICLR 2018 - conference track proceedings; 2018. 57. Beers A, Brown J, Chang K, et al. High-resolution medical image synthesis using progressively grown generative adversarial networks. ,http://arxiv.org/abs/1805.03144.; May 2018 [accessed 23.05.18]. 58. Lee H, Yune S, Mansouri M, et al. An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets. Nat Biomed Eng 2019;3(3):173 82. Available from: https://doi.org/ 10.1038/s41551-018-0324-9. 59. Shan H, Padole A, Homayounieh F, et al. Competitive performance of a modularized deep neural network compared to commercial algorithms for low-dose CT image reconstruction. Nat Mach Intell 2019;1:269 76. Available from: https://doi.org/10.1038/s42256-019-0057-9. 60. Sandfort V, Yan K, Pickhardt PJ, Summers RM. Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks. Sci Rep 2019;9. Available from: https:// doi.org/10.1038/s41598-019-52737-x. 61. Zhang Y, Wu H, Liu H, Tong L, Wang MD. Improve model generalization and robustness to dataset bias with biasregularized learning and domain-guided augmentation. ,http://arxiv.org/abs/1910.06745.; October 2019 [accessed 31.12.19]. 62. Song C, Ristenpart T, Shmatikov V. Machine learning models that remember too much. In: Proceedings of the ACM conference on computer and communications security; 2017. Available from: https://doi.org/10.1145/3133956.3134077. 63. Zhu L, Liu Z, Han S. Deep leakage from gradients. ,http://arxiv.org/abs/1906.08935.; June 2019 [accessed 01.01.20]. 64. Vepakomma P, Gupta O, Dubey A, Raskar R. Reducing leakage in distributed deep learning for sensitive health data. In: ICLR AI for social good workshop 2019; 2019. 65. Wood A, Altman M, Bembenek A, et al. Differential privacy: a primer for a non-technical audience. SSRN Electron J 2019. Available from: https://doi.org/10.2139/ssrn.3338027. 66. Abadi M, McMahan HB, Chu A, et al. Deep learning with differential privacy. In: Proceedings of the ACM conference on computer and communications security; 2016. https://doi.org/10.1145/2976749.2978318. 67. Wu B, Zhao S, Sun G, et al. P3SGD: patient privacy preserving SGD for regularizing deep CNNs in pathological image classification. ,http://arxiv.org/abs/1905.12883.; May 2019 [accessed 01.01.20]. 68. Beaulieu-Jones BK, Yuan W, Finlayson SG, Wu ZS. Privacy-preserving distributed deep learning for clinical data. ,http://arxiv.org/abs/1812.01484.; December 2018 [accessed 01.01.20]. 69. Li W, Milletarı` F, Xu D, et al. Privacy-preserving federated brain tumour segmentation. In: Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics). 2019; 11861 LNCS. p. 133 141. ,http://arxiv.org/abs/1910.00962. [accessed 19.03.20]. 70. Beaulieu-Jones BK, Wu ZS, Williams C, et al. Privacy-preserving generative deep neural networks support clinical data sharing. Circ Cardiovasc Qual Outcomes 2019;12(7): e005122. Available from: https://doi.org/ 10.1161/CIRCOUTCOMES.118.005122. 71. Al Badawi A, Chao J, Lin J, et al. The AlexNet moment for homomorphic encryption: HCNN, the first homomorphic CNN on encrypted data with GPUs. ,http://arxiv.org/abs/1811.00778.; November 2018 [accessed 01.01.20]. 72. Chao J, Badawi AA., Unnikrishnan B, et al. CaRENets: compact and resource-efficient CNN for homomorphic inference on encrypted medical images. ,http://arxiv.org/abs/1901.10074.; January 2019 [accessed 01.01.20]. 73. Ryffel T, Trask A, Dahl M, et al. A generic framework for privacy preserving deep learning. ,http://arxiv.org/ abs/1811.04017.; November 2018 [accessed 08.02.20].

II. Technical basis

C H A P T E R

7 Analytics methods and tools for integration of biomedical data in medicine Lin Zhang, Mehran Karimzadeh, Mattea Welch, Chris McIntosh and Bo Wang Abstract Recent technologies have enabled us to collect diverse types of genome-wide data at an unprecedented scale and in multiple dimensions. Integrative computational methods are greatly needed to combine these data to provide a comprehensive view of the underlying biology and human disease. An ideal method should be able to automatically extract relevant features by exploiting heterogeneous data across various modalities and dimensions to answer a specific biological or medical question. The key challenge in developing such methods is the design of a comprehensive model capable of harnessing noisy and high-dimensional datasets without much manual supervision. Recent advances in machine learning (ML) algorithms, especially deep learning, offer a unique opportunity to mine and provide a systematic understanding of massive heterogeneous biological datasets. In this chapter, we describe the principles of integrative genomic analysis and discuss existing ML methods. We provide examples of successful data integration in biology and medicine with a specific focus on omics (e.g., single-cell RNA-seq) and image data. Finally, we discuss current challenges in integrative methods for genomics and our perspective on the future development of the field. Keywords: Machine learning; data integration; multimodality; genomic data; radiomics

7.1 The rise of multimodal data in biology and medicine 7.1.1 The emergence of various sequencing techniques 7.1.1.1 Bulk sequencing We owe a significant portion of our understanding of disease etiology to advances in sequencing and array technologies. Before the advent of high-throughput sequencing, we could not even sensibly estimate the number of genes in the human genome.1

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00007-7

113

© 2021 Elsevier Inc. All rights reserved.

114

7. Analytics methods and tools for integration of biomedical data in medicine

Our knowledge of the genetics and epigenetics of complex diseases such as cardiovascular disease (CVD) and cancer has considerably evolved. Currently, various technologies can identify the sequence of small fragments of DNA in a high-throughput manner. Illumina sequencers such as the HiSeq 2500 can “sequence” up to 250 nucleotides of a DNA fragment,2 while the PacBio RS II can sequence half of DNA molecules, more than 50,000 nucleotides.3 The Oxford Nanopore can sequence even up to 2 million nucleotides of a single DNA molecule.4 Different accuracy and multiplexing potential for each of these technologies make them appropriate for different experiments. Reverse transcriptase enzymes allow us to generate complementary DNA from RNA molecules, making it possible to quantify the transcriptome (RNA-seq).5 Bisulfite treatment converts unmethylated cytosine to Uracil, making it possible to quantify DNA methylation (whole-genome bisulfite sequencing).6 Treating DNA with the DNase-I7 or Tn58 digests or breaks the genomic regions that did not interact with DNA-associated proteins or nucleosomes, allowing for obtaining a footprint of nucleosomes and transcription factors [known as DNase-seq and assay for transposase activity and sequencing (ATAC-seq)]. Chromatin immunoprecipitation with sequencing (ChIP-seq) allows for sequencing the DNA bound to a DNA-associated protein such as a transcription factor or chemical modifications of histones.9 Chromosome conformation capture assays, such as HiC, assess the threedimensional proximity of the chromatin in the cell by sequencing the ligated product of approximate DNA sequences.10 Bulk sequencing essentially provides a consensus view of cell biology and dynamics within a pool of cells. It potentially masks the heterogeneity of cell populations and functions. Recent development of single cellsequencing technologies aims to offer new biological insights at the resolution of single cells. 7.1.1.2 Single-cell sequencing The single cellsequencing technologies rely on using cell-specific barcodes during library preparation and pooled sequencing of the population of single cells. These technologies vary by their approach toward isolating single cells for using different bar codes (microfluidics, flow-activated cell sorting, etc.), the type of molecule they capture for assessment (RNA, open chromatin DNA, etc.), their preference for all or specific parts of the molecule of interest (50 , 30 , or full-transcript sequencing of mRNA), and the number of cells they can sequence.11 Microfluidics-based technologies such as drop-seq12 and Chromium13 use droplets to enclose each cell with unique bar codes in an automated fashion. This approach enables to explore the transcriptome with single-cell RNA-seq12,13 and protein binding to DNA with single-cell ChIP-seq.14 Combinatorial indexing offers a less expensive alternative that provides measurements often (but not always) representing single cells. This approach passes the cells through two different steps of indexing which result in most cells obtaining unique sets of bar codes.15 This approach, therefore, allows for single-cell assessment of copy number variations,16 chromatin accessibility,17 and even Hi-C.18 A more recent approach uses both combinatorial indexing and microfluidics-based technology to sequence hundreds of thousands of cells instead of a few thousands for high-throughput single-cell RNA-seq19 as well as ATAC-seq.20

II. Technical basis

115

7.1 The rise of multimodal data in biology and medicine

TABLE 7.1 Summary of sequencing assays. Assay

Description

References

RNA-seq

Uses reverse transcriptase to generate complementary DNA from RNA molecules and sequence them

5

DNase-seq

Sequences DNA fragments not digested by the DNase-I enzyme activity

7

ATAC-seq

Sequences DNA fragments ligated to a small DNA adapter due to Tn5 transposase activity

8

ChIP-seq

Sequences DNA fragments bound to a specific proteinantibody complex

9

Bisulfite-seq Sodium bisulfite treatment of DNA prior to DNA-sequencing to allow for distinguishing methylated cytosine

6

Hi-C

Sequences DNA fragments ligated to each other due to spatial proximity

10

NOMe-seq

Sodium bisulfite treatment after treatment with M.CvIPI methyltransferase enzyme to allow for simultaneous assessment of DNA methylation and nucleosome occupancy

25

With these methods, each cell can be sequenced through one of its modalities: transcriptome, chromatin accessibility, etc (Table 7.1). When assessing multiple single-cell measurements from biologically similar samples, identifying the groups of single cells from data of the different assays may not always be feasible. For example, even if downstream analyses identify the same number of single-cell clusters from each assay (e.g., 10 clusters), the next challenge must cope with mapping the clusters to each other (100 cluster combinations). Assays such as scNMT-seq that assess DNA methylation, chromatin accessibility, and transcription from each cell,21 or those that assess both chromatin accessibility and transcription of each single cell2224 introduce an experimental setup that avoids such hurdles of downstream analysis. An objective comparison of the yield of the unimodal versus multimodal single-cell measurements, however, may help researchers decide on the best technology for each experiment. The plethora of single cell and bulk-sequencing technologies provides us with information about complementary aspects of the genome. The goal of data integration methods is to make the most out of these datasets by carefully incorporating their limitations and biases.

7.1.2 The increasing need for combining images and omics in clinical applications 7.1.2.1 Various modalities of images in clinics Demand for the in vivo study of physiology gave rise to an accelerating era of increased imaging throughout 1990 to present, fueled by not only the extra structural and functional data that imaging brings to the table but by a decrease in costs of machines and acquisitions. In cancer, in particular, the advent of image-guided radiotherapy brought with it the routine acquisition of diagnostic and treatment planning images for patients creating large datasets over the past decade that until recently remained unexplored. Imaging in general is divided into noninvasive and invasive categories with varying degrees for the later. Invasive modalities present some risk of harm to the patient, including X-ray, fluoroscopy, computed tomography (CT), and positron emission tomography (PET). II. Technical basis

116

7. Analytics methods and tools for integration of biomedical data in medicine

Typically, the risk is limited to the radiation of the imaging device but can also be in the form of radioactive tracers injected into the patient to enhance contrast or as in fluoroscopic angiography with the associated risk of death associated with inserting a catheter into a blood vessel in the groin. Noninvasive imaging has little to zero risk, including magnetic resonance, electrocardiograms (ECGs), and ultrasound. The distinction can get blurry as well, with, for example, intravascular ultrasound (IVUS). In IVUS a transducer is threaded into the heart via a catheter and is thus invasive while the imaging itself, the ultrasound, is not. Along those same lines the imaging can be broadly categorized as structural and functional. Structural imaging is concerned with the geometry and topology of the patient, whereas functional imaging is often temporal and focused on blood flow, oxygen flow, and anatomical function. Functional imaging tends to be of lower resolution spatially, in its trade-off for temporal or functional resolution. Structural imaging includes X-ray, CT, and magnetic resonance, whereas functional modalities include functional magnetic resonance imaging (MRI), ultrasound, PET, and ECG. Note, however, that the characterization is general, not binary, and thus ultrasound can be used to measure the size of one prostate, for example, or as a more functional-oriented ECG. Conclusively, imaging requires careful consideration of the added benefit to the patient’s health trajectory, the risks or harms to the patient, and its costs. As a result the largest datasets for study are those relating to routine care, where all patients along a particular care pathway will have data available. 7.1.2.2 The rise of radiomics: combine medical images with omics Quantification of medical imaging data was recognized as an important task for computer-aided decision support half a century ago; it was even suggested that by 2020 clinicians would be “in frequent dialog” with computers during diagnostic and prognostic tasks.26 Furthermore, different methods of data processing, feature extraction, and pattern recognition for radiographic image classification were described.27,28 To date, there has been an abundance of research in medical image processing and analysis, exploring everything from computer-assisted diagnosis,29 image-based markers of neurological disease progression30 to the recent explosion of deep learningbased approaches.31 More recently, quantified features extracted from imaging data demonstrated correlation with tumor grade,32 histopathology,33 and treatment response34; but successful integration into clinics has still not been achieved. It was not until recently with the increase in computing power and increased collection of digitized imaging (Section 7.1.2.1) and electronic patient records, that these techniques have become feasible. In 2012 the automated process of image feature quantification was rebranded as “radiomics.”35 Radiomics is experiencing increasing interest from both researchers and clinicians. This interest is driven by advancements in pattern recognition, computer vision, and model building that make the utilization of radiomics for diagnostics, prognostics, and treatment decision processes more promising than ever. Radiomic features have the potential to quantify information regarding the whole tumor as well as the various textures contained within the tumor, thereby serving as a potential noninvasive method of disease heterogeneity quantification that could be used in conjunction with invasive biopsies and traditional quantitative imaging approaches. Studies have found predictors for recurrence following radiotherapy in lung cancer,36 prognostic factors for patients treated with combined chemotherapy, radiotherapy, and targeted therapeutic agents,37 and signatures for tumor metabolism38 and prediction of II. Technical basis

117

7.1 The rise of multimodal data in biology and medicine

freedom from distant metastases, local control, and survival.39 The largest study to date was published in 2014 by Aerts et al.40 where they validated a four-feature radiomic signature’s prognostic performance in both lung and head and neck cancer patients. The signature was later found to be a surrogate for tumor volume,41 a well-known clinical prognostic factor. This has spurred an interest in determining the robustness and repeatability of radiomic features (discussed further in Section 7.3.3), and safeguards have been suggested to minimize misinterpretation of results while maximizing reproducibility and preventing the field from being classified as “yet-another-omics.”42

7.1.3 The availability of large-scale public health data The advent of technologies brings the era of rising of large-scale public health data, the biobanks. A biobank refers a respiratory that collects and stores biosamples from humans for use in health-related research. deCODE genetics,43 a biopharmaceutical company based in Ireland, established one of the world’s earliest national biobanks in the late 1990s and has collected biosamples from over 160,000 volunteers, which amounts to over half of the adult population in Iceland. Ever since, large-scale national biobanks emerged rapidly in many countries/regions as the decreasing cost of technology and wide adoption of electronic health records. Table 7.2 summarizes some of the recent large-scale public health data ranging from biobanks to single-cell atlas consortium. TABLE 7.2 Some of the recent large-scale public health datasets. Public health dataset

Data type

Description

References

UK Biobank

Genetics data, imaging data, self-reported medical information

UK Biobank is a population-based national biobank with 500,000 participants aged 4069 years (at the time of recruitment) recruited between 2006 and 2010. UK Biobank has collected multimodal health data including genetic data, imaging data, clinical records, base measurements and etc.

44

The GTEx project

Expression quantitative trait loci

The GTEx project characterizes human transcriptomes within and across individuals for a wide variety of primary tissues and cell types, containing genotype, gene expression, and histological and clinical data, and its latest release (release V8) consists of 17,382 samples from 948 human donors across 54 tissues.

45

The Human Cell Atlas

Single-cell multiomics

Aims to create an atlas of cells, tissues, and organs throughout the human body that captures their molecular characteristics and spatial location and organization.

46

The Human BioMolecular Atlas Program

Spatial single-cell An NIH-sponsored program launched in 2019 aims to multiomics develop three-dimensional maps of tissues with unprecedented spatial and molecular resolution. The vision is to construct a comprehensive cellular atlas of the human body in health and under various disease conditions.

47

GTEx, Genotype-Tissue Expression.

II. Technical basis

118

7. Analytics methods and tools for integration of biomedical data in medicine

7.2 The challenges in multimodal data—problems with learning from multiple sources of data 7.2.1 The imperfect generation of single-cell data Noise is ubiquitous for various types of genomic data, posing the risk of corrupting the underlying biological signal and obstructing downstream analysis. Here we discuss wellrecognized types of noise and bias impacting different types of bulk- and single cellsequencing technologies. Despite improvements in measuring technologies of scRNA-seq data, the quality of various omics data varies due to multiple technical factors, including amplification bias, cell cycle effects, library size differences, and RNA capture rate.48 For example, recent droplet-based scRNA-seq technologies can profile up to millions of cells in a single experiment, but these technologies are particularly sparse due to relatively shallow sequencing caused by the low RNA capture rate. A unique challenge associated with scRNA-seq is “dropout” event due to failure to detect an expressed gene, resulting in a “false” zero-count observation.49 Dropouts not only mask the potential differential genes for cell-type discovery but also lead to increased difficulty for computational methods to capture useful signals due to large sparsity. Traditional imputation methods for missing values may not be suitable for scRNA-seq data due to nontrivial distinction between true and false zero counts by dropouts.49 The sequencing assays that assess the epigenome have sample- and assay-specific limitations. Heterogeneity, the primary source of sample-specific bias in sequencing data, affects epigenomic datasets to a greater extent compared to genomic datasets. Heterogeneity in DNA content of different cell types does exist.50 Epigenomics, however, defines the emergence of different cell types from the same pluripotent progenitor. When analyzing mutations, we expect less heterogeneity due to contamination from different cell types of the same tissue. In epigenomic analysis, however, these contaminants can introduce false-positive hits. Similarly, biological processes such as cell cycle, response to hypoxia, inflammatory response, and immune infiltration can impact the epigenome. In addition, each of the epigenomic assays has its own limitations that must be carefully taken into account in data processing and interpretation. DNase-seq and ATAC-seq, for example, are inherently biased the sequencing preference of the DNase-I enzyme or the Tn5 transposase.51 ChIP-seq suffers from cross-reactivity of the antibodies and pull-down of spatially proximal genomic regions as a side effect of cross-linking reagents. Hydroxymethyl group, as well as methyl group, protects cytosine from deamination, making it impossible to distinguish hydroxymethylcytosine from methylcytosine in the conventional bisulfite-sequencing experiments.52 Most of these limitations, therefore, introduce false-positive findings. 7.2.1.1 The complementariness of various sources of data Many biobanks focus on profiling of multiomics, such as the genome, proteome, transcriptome, epigenome, and microbiome. However, each modality of data has its own strengths and weaknesses, and integrating data from multiple modalities improves the interpretations of each other. For example, single-cell ATAC-sequencing (scATAC-seq) can uniquely uncover enhancer regions and regulatory landscape of the genome, but currently, it may not achieve the same power for unsupervised cell-type discovery as transcriptomics53,54;

II. Technical basis

7.3 Machine learning algorithms in integrating medical and biological data

119

traditional genome-wide association studies, despite that thousands of genetic variants have been identified for complex diseases and traits, are not ideal to reveal complex interactions between genetic variants. Moreover, due to the cost of collecting data while building a reasonably large sample size, not many human health datasets possess more than one major type of omics data. Therefore specific machine learning (ML) algorithms that aim to integrate various disjoint multimodal data are very much in need.

7.2.2 The issues of generalizability of machine learning Different data modalities have different distributions of structures and noises, posing unique challenges to ML algorithms that are applied to integrate these data modalities. For example, scRNA-seq data generated at different sequencing technologies from the same sample usually contains large batch effects, in which the expression of genes in one batch differs systematically from that in another batch, and such differences can mask underlying biology or introduce spurious structure in the data. ML in the genomic setting often uses models derived from a set of assumptions that are not always met in the data they are using. Violating these assumptions can have major implications for model generalization. It is consistently found that ML models trained from one type of genomic data provide little insights toward the other type, even though these types of genomic data may reflect the same biology process.

7.3 Machine learning algorithms in integrating medical and biological data 7.3.1 Genome-wide data integration with machine learning Each omics assay investigates a specific aspect of the cells. These types of data not only provide unprecedented opportunities to better understand the biology of the genome but also provide ample opportunities for unraveling the disease etiology. Here, we will discuss how integrative analysis has improved our understanding of the genome biology as well as the disease etiology. Semiautomated genomic annotation reveals chromatin function Integration of omics datasets can inform us about the biological aspects of the genome. Semiautomated genome segmentation algorithms, for example, use unsupervised probabilistic graphical models to integrate epigenome signals of different sources. These methods segment the genome into genomic regions according to the signal patterns of the different epigenomic assays. These segments, interestingly, match biological functions of the chromatin, corresponding to genes, exons, and transcription start sites.55,56 DNA-binding preferences of transcription factors Transcription factors are of particular interest to disease etiology, as alteration of a single transcription factor can affect the expression of tens of genes. Predicting transcription factor binding sites has brought us valuable knowledge on the process of transcription,57 gene regulation, the role of driver noncoding mutations that alter transcription factor binding in disease,58 and the role of transcription factors as upstream regulators of various diseases.59

II. Technical basis

120

7. Analytics methods and tools for integration of biomedical data in medicine

The initial models benefitted from the sequence preference of the DNA-binding domain of the transcription factors.60,61 Given the 3.2 billion nucleotide search space of the human genome, however, these models suffered from the futility conjecture: the number of matches for the sequence preference of each transcription factor across the genome exceeds the number of actual transcription factor binding sites by orders of magnitude.62 In addition, given the cooperative binding of transcription factors and the importance of DNA shape,63,64 not all of the binding sites of a transcription factor satisfy its sequence preference, resulting in a dual futility conjecture.65 Recent methods, therefore, integrate publicly available ChIP-seq data66 as well as gene expression65 to achieve better predictions. Deep learning for genome and epigenome analysis Some supervised deep learning models trained to predict a specific biological measurement from DNA-sequencing data may learn biologically insightful information. Basset, for example, trains a convolutional neural network on one-hot-encoded DNA-sequencing data to predict chromatin accessibility signal.25 DNA sequence cannot comprehensively explain the extent of variation in chromatin accessibility (cell types in our body have similar DNA sequence but different chromatin accessibility profiles). In an in silico mutagenesis experiment, however, the trained model informs us about the potential effect of noncoding mutations on chromatin accessibility signal. In the case of the vitiligo disease polymorphism rs4409785, for example, the prediction of the model suggests an increase in CTCF binding predisposes to the development of the disease. Other deep learning models specifically aim to identify the causal mutations in disease. DeepSEA, for example, trains a convolutional network on 919 epigenomic features encompassing chromatin accessibility, histone modification, and transcription factor binding.67 Since the input data is the DNA sequence, similar to Basset, the model predicts the effect of mutations on different aspects of the epigenome. A similar model identified the role of previously unknown driver mutations in autism spectrum disorders.68 By learning from nondamaging mutations among the human polymorphisms as well as polymorphisms of other primates, such models can predict pathogenic coding mutations from the plethora of mutations without harmful effects as well.69 7.3.1.1 How to integrate various omics for cancer subtyping Before the advent of high-throughput genomic technologies, we had understood the role of some oncogenes and tumor suppressors, especially in familial cancers affected by highly penetrant mutations that spread through multiple generations. The initial efforts to sequence the cancer genomes aimed to identify the driver mutations and distinguish invasive tumors based on the molecular profile. While these efforts identified a plethora of oncogenes70 and novel pathways leading to cancer,71 they also suffered from the limitations of using just one biological assay. Some patients did not harbor mutations in the coding region of any of the known oncogenes or tumor suppressors. In addition, most polymorphisms that increase the risk of disease and cancer occurred in the noncoding regions of the genome. Leveraging complementary data types on converging biological module, such as effect on one gene at a time, has facilitated the discovery of the molecular basis of various diseases. Effect of genetic mutation on gene expression measurements in expression

II. Technical basis

7.3 Machine learning algorithms in integrating medical and biological data

121

quantitative trait loci analyses and effect of genetic mutation on DNA methylation in methylation trait quantitative loci analyses are two common frameworks applied to multiple diseases and datasets. Biological networks, particularly the protein interaction networks, provide another converging framework to integrate various types of molecular aberrations. Various methods allow for integrating one type of data, such as mutations affecting the proteins in the network using a heat-diffusion model (HotNet272), or integrating multiple types of data to identify driver genes or mutations (OncoImpact73 and AbHAC74). These approaches identify a network module, a small number of proteins closely related to a specific biological pathway driving the disease, or aberration hubs,74 a small number of proteins not necessarily connected to each other but playing important roles in the disease etiology. Integrating multiple datasets without a converging module, however, has been challenging. Concatenating these measurements increases data dimensions while incorporating additional noise. Independent feature selection from each experiment may miss the interactions among variables of different assays. Unsupervised and supervised ML can integrate datasets of different sources to understand the native regulatory mechanisms, how they are dysregulated in cancer, and predict molecular aspects of the disease relevant to disease etiology, or even directly predict disease outcome. Curtis et al.,75 for example, identified copy number alterations impacting gene expression and used a joint latent variable framework on expression of 1000 genes affected by such copy number variations to cluster the patients into 10 groups with distinct survival outcome. Another approach, similarity network fusion (SNF), generates sample-similarity networks from independent assays and merges them using nonlinear combination.76 SNF can better predict the survival of cancer patients compared to any single-assay measurement. Yet, such approaches do not provide normalized data matrices to allow for feature optimal feature extraction and biomarker identification. In other words, many of the data integration approaches use big-box model requiring data from multiple sources that are only available in the research setting as a result of international collaboration but may not be feasible to generate for all patients entering the clinic. Another algorithm, MANCIE, assumes that similarity among samples should not vary among different measurements.77 To achieve this, MANCIE uses an approximation of the full Bayesian inference to generate an adjusted matrix from matrices of two independent genomic measurements, such as measurements of the transcriptome and epigenome. This approach, therefore, allows for identifying a small number of measurements with potential clinical use from two different assays. 7.3.1.2 How to integrate single-cell multiomics for precision medicine The emergence of single-cell technologies enables and fosters the studies on single cells to reveal their transcriptomic and epigenetic profiles. Most works have been focusing on the integrating methods for multiple datasets from the same single-cell modality. However, how to integrate single-cell omics data from multiple modalities remains relatively unexplored. Several methods have been proposed to integrate omics datasets from different modalities. SIMLR (Single-cell Interpretation via Multikernel Learning)78 identifies cell-to-cell similarities from single-cell data and enables the deconvolution of population heterogeneity. Seurat v379 uses pairs of correspondence cells shared between datasets to “anchor” datasets

II. Technical basis

122

7. Analytics methods and tools for integration of biomedical data in medicine

Integration of scRNA-seq

Integration of multimodal single-cell data

Spatial data

Cell heterogeneity

Machine learning

Cell dynamics

FIGURE 7.1 An example of the application of machine learning on single-cell multi-omics data integration and downstream analyses.

from different modalities and leverage the information of one data modality to the interpretation of the other. Welch et al.80 propose LIGER that focuses mainly on integrating scRNAseq datasets using shared metagene factors. Fig. 7.1 shows an example of integration of single-cell data from multiple modalities using ML algorithms. Nevertheless, existing methods in multimodal omics data integration require at least partial correspondence shared between omics and rely heavily on the assumption of orthogonality between the batch effect and biological effects. Moreover, as the blooming of spatial single-cell data, how to map spatial single-cell omics data to single cellsequencing data remains unsolved. Combined analysis of multimodal omics will enable the accurate reconstruction of gene regulatory and signaling networks that drive cellular identity and function. Integrative analysis of data from various modalities will shed light on the molecular and cellular mechanisms and eventually benefit clinical practices and precision medicine strategies.

7.3.2 Data integration beyond omics—an example with cardiovascular diseases Cardiovascular diseases (CVDs) are the number one cause of death globally. Every year, around 17.9 million people die from CVDs, according to an estimated statistic by the World Health Organization. The rapid development of technologies generates a large amount of patient data ranging from clinical information data, such as genomicsequencing data, to individual-level biometrics data, such as heart rate measured by smart wristband, besides the vital imaging measurements that are crucial to physicians’ therapy selection. In the meantime, patients are beginning to demand faster and more personalized

II. Technical basis

7.3 Machine learning algorithms in integrating medical and biological data

123

MRI and CT scans

Demographic factors

ECG

Genomic information

Doctors’ notes Multimodal deep learning architecture

Risk predictors of CVD

FIGURE 7.2 Deep learning algorithms to predict cardiovascular disease (CVD) risk factors by integrating clinical data, demographical data and genomic data.

care. As a result, automated solutions are in high request for physicians and health-care systems. Besides its successful applications in many other fields, deep learning quickly appeals to clinical diagnosis thanks to its capability to handle a large amount of chaotic data and the increasing computational power of graphical processing units (GPUs). Deep learning has established its reputation in the field of image classification, image segmentation, natural language processing, speech recognition, and genomics, all of which have great potential in facilitating the diagnosis, prediction, and prevention of CVDs. By June 2019 only seven AI-based algorithms have been approved for cardiology by the US Food and Drug Administration. Deep learning algorithms in cardiology are mainly in imaging analysis, such as echocardiography, MRI, and CT. However, most deep learning algorithms only focus on single modal data. With more biobank-collected human health data of multiple modalities, to predict CVDs using integrated multimodal data, such as demographics, biological multiomics, and vital signs, becomes plausible. Fig. 7.2 shows the pipeline of integrating multimodal data to facilitate the diagnosis of CVD. 7.3.2.1 How to integrate various image modalities such as magnetic resonance imaging computed tomography scans To repeat its successful applications in many other fields, the initial application of deep learning in biomedical field came in image detection and segmentation. Deep algorithms have proven to outperform board-certified doctors. However, existing applications of ML

II. Technical basis

124

7. Analytics methods and tools for integration of biomedical data in medicine

on imaging focus on one modality at a time, with few integration of information from various sources. The study of Nakanishi et al.81 is one of the few that utilizes ML algorithms to integrate temporal CT data with coronary artery calcium (CAC) score, CAC volume scores, extracardiac CAC scores, and epicardial fat volume to predict coronary heart disease (CHD). They concluded that ML algorithms integrating all data for predicting CHD events outperform those of using risk score alone (AUC 5 0.765). This work confirms that there is huge potential in more accurately predicting CVDs by jointly analyzing clinical data from various modalities using ML algorithms. Integration of clinical data with -omics data and demographic information will further input genetic heritability information and environmental factors related to the diseases. 7.3.2.2 How to better the diagnosis by linking images with electrocardiograms Electrocardiogram (ECG) measures the electrical activity that passes through the heart. It is a noninvasive test to detect abnormal heart rhythms that may potentially indicate heart problems. Automated ECG interpretation, an enterprise initially undertaken in the 1960s with the advent of digital ECG machines, is now almost universal. It was the first instance in which rudimentary AI (likely a rule-based expert system) effectively streamlined hospital care and cut costs. However, most of the existing analysis and interpretation of ECG do not incorporate other clinical data. One notable application collectively analyzes ECG data with pulsatile photoplethysmographic (PPG) data. Shashikumar et al.82 applied convolutional neural network to jointly analyze PPG data recorded using a multichannel wrist-worn device, while a single channel ECG was recorded (for rhythm verification only) simultaneously to predict atrial fibrillation (AF). They showed that the novel integration approach produced a robust and accurate algorithm for the detection of AF from PPG data, which is scalable and likely to improve in accuracy as the dataset size continues to expand.

7.3.3 Multimodal decision-making in clinical settings Identification of quantitative imaging biomarkers is a modular series of steps that combine together to create a pipeline that can be executed manually, semiautomatically, or automatically. The steps involve (1) collection of images, (2) contouring of regions of interest, (3) extraction of quantitative features, (4) building of a predictive model, and (5) integration into clinics. These are the basic five steps to identify a radiomic biomarker or signature, but it is a much more nuanced process that is constantly evolving as the field matures. In addition, the proposed radiomics pipeline is similar to a traditional quantitative imaging pipeline.83 These similarities provide opportunities to improve the interpretability of radiomics features by aligning processes and investigating related mechanisms between the fields. A detailed description of these steps is beyond the scope of the chapter, and the authors would like to guide you to Welch et al.84 for a more comprehensive overview. An often overlooked step in the radiomics pipeline is the process of image acquisition that can have a large impact on the repeatability of radiomic features. Contrast agent selection, timing of contrast injection, scanner manufacturers, pixel sizing, and other

II. Technical basis

7.4 Future directions

125

modality-specific image acquisition parameters (e.g., tube current and voltage in CT and pulse sequence parameters in MRI) all pose problems for repeatability of radiomic features.8588 Some image acquisition variations can be improved with appropriate postprocessing89; however, caution is required when selecting appropriate postprocessing.90 In the field of traditional quantitative imaging, efforts have been made to establish guidelines and improve standardization of imaging practices with the Quantitative Imaging Biomarker Alliance of the Radiological Society of North America, the European Imaging Biomarkers Alliance, and the Quantitative Imaging Network; the field of radiomics would benefit from deeper collaboration with these groups. In the absence of standardized imaging acquisition, various steps are required to determine the robustness of radiomic features. Feature repeatability can be determined by imaging the same patient twice in a short period of time, this is known as test-retest imaging.91,92 Additionally, the stability of the features against interobserver contour variation can be tested using multidelineation datasets.89,93 Phantom studies can also be used to determine the reproducibility of features across scanner and other image acquisition parameters,94 in addition to developing corrective calibration curves.95 In addition, the reliance on radiomic features on the software utilized for feature extraction96 warrants strict adherence to published guidelines.97 As a result of these considerations, external validation of results plays a key role by reducing the likelihood of overfitting and/or sensitivity to bias sources.98 Omics-based intervention is an emerging area of research. Early works have used CT texture features to predict radiation therapy treatment plans used to treat a variety of cancers. Most of the work in this area remains theoretical, 99,100 but some have created clinically deliverable treatment plans that can be directly incorporated into patient care.101 Recent works have suggested adjusting the prescribed radiation dose based on radiomic signals,102 but to date, the accuracy is insufficient to justify clinical implementation.

7.4 Future directions With the advent of large-scale datasets across different fields in biology and healthcare, the scalability and effectiveness of ML algorithms have resulted in genuine enthusiasm in applying ML to these fields. However, current ML algorithms suffer from many computational limitations. First, existing methods lack rigorous theoretical guidance toward effective model designs given few data samples. Practitioners need to rely heavily on manual experiments to determine the optimal model selections, often resulting in overfitting to specific datasets. Second, the nature of noisy modalities across various biomedical platforms challenges many underlying assumptions of ML algorithms. With little supervision, ML algorithms can fail to capture the useful signals without proper data harmonization. Last, the infrequent adoption of ML on real-world clinical practice indicates the need for improved algorithms to gain clinician’s trust. One main direction would be to expand the transferability of ML in which models learned from different sources can be utilized or enhanced to a different application of interest. Another major avenue is the explainability

II. Technical basis

126

7. Analytics methods and tools for integration of biomedical data in medicine

of the AI algorithms to gain better insight into what features are being used for prediction and if those features are indeed reliable for the given patient.

References 1. Fields C, Adams MD, White O, Venter JC. How many genes in the human genome? Nat Genet 1994;7:3456. 2. Performance specifications for the HiSeq 2500 System. 2020. ,https://www.illumina.com/systems/sequencingplatforms/hiseq-2500/specifications.html. (accessed 14.07.20). 3. PacBio. Smart sequencing. ,https://www.pacb.com/smrt-science/smrt-sequencing/. (accessed). 4. Payne A, Holmes N, Rakyan V, Loose M. BulkVis: a graphical viewer for Oxford Nanopore bulk FAST5 files. Bioinformatics 2019;35:21938. 5. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009;10:5763. 6. Frommer M, McDonald LE, Millar DS, et al. A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proc Natl Acad Sci USA 1992;89:182731. 7. Song L, Crawford GE. DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harb Protoc 2010;2010 pdb.prot5384. 8. Buenrostro JD, Wu B, Chang HY, Greenleaf WJ. ATAC-seq: a method for assaying chromatin accessibility genome-wide. Curr Protoc Mol Biol 2015;109:21.9.121.9.9. 9. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science 2007;316:1497502. 10. Rao SS, Huntley MH, Durand NC, et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 2014;159:166580. 11. Chen H, Lareau C, Andreani T, et al. Assessment of computational methods for the analysis of single-cell ATAC-seq data. bioRxiv 2019:739011. 12. Macosko EZ, Basu A, Satija R, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 2015;161:120214. 13. Zheng GX, Terry JM, Belgrader P, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun 2017;8:14049. 14. Grosselin K, Durand A, Marsolier J, et al. High-throughput single-cell ChIP-seq identifies heterogeneity of chromatin states in breast cancer. Nat Genet 2019;51:10606. 15. Adey A, Kitzman JO, Burton JN, et al. In vitro, long-range sequence information for de novo genome assembly via transposase contiguity. Genome Res 2014;24:20419. 16. Amini S, Pushkarev D, Christiansen L, et al. Haplotype-resolved whole-genome sequencing by contiguitypreserving transposition and combinatorial indexing. Nat Genet 2014;46:13439. 17. Cusanovich DA, Daza R, Adey A, et al. Multiplex single cell profiling of chromatin accessibility by combinatorial cellular indexing. Science 2015;348:91014. 18. Ramani V, Deng X, Qiu R, et al. Sci-Hi-C: A single-cell Hi-C method for mapping 3D genome organization in large number of single cells. Methods 2020;170:618. 19. Datlinger P, Rendeiro AF, Boenke T, Krausgruber T, Barreca D, Bock C. Ultra-high throughput single-cell RNA sequencing by combinatorial fluidic indexing. bioRxiv 2019. 2019.12.17.879304. 20. Lareau CA, Duarte FM, Chew JG, et al. Droplet-based combinatorial indexing for massive-scale single-cell chromatin accessibility. Nat Biotechnol 2019;37:91624. 21. Clark SJ, Argelaguet R, Kapourani CA, et al. scNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nat Commun 2018;9:781. 22. Zhu C, Yu M, Huang H, et al. An ultra high-throughput method for single-cell joint analysis of open chromatin and transcriptome. Nat Struct Mol Biol 2019;26:106370. 23. Chen S, Lake BB, Zhang K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat Biotechnol 2019;37:14527. 24. Cao J, Cusanovich DA, Ramani V, et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science 2018;361:13805.

II. Technical basis

References

127

25. Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 2016;26:9909. 26. Schwartz WB. Medicine and the computer. The promise and problems of change. N Engl J Med 1970;283:125764. 27. Hall EL, Kruger RP, Dwyer SJ, Hall DL, Mclaren RW, Lodwick GS. A survey of preprocessing and feature extraction techniques for radiographic images. IEEE Trans Comput 1971;100:103244. 28. Harlow CA, Eisenbeis SA. The analysis of radiographic images. IEEE Trans Comput 1973;100:67889. 29. Baker JA, Rosen EL, Lo JY, Gimenez EI, Walsh R, Soo MS. Computer-aided detection (CAD) in screening mammography: sensitivity of commercial CAD systems for detecting architectural distortion. AJR Am J Roentgenol 2003;181:10838. 30. Wang L, Beg F, Ratnanather T, et al. Large deformation diffeomorphism and momentum based hippocampal shape discrimination in dementia of the Alzheimer type. IEEE Trans Med Imaging 2007;26:46270. 31. Litjens G, Kooi T, Bejnordi BE, et al. A survey on deep learning in medical image analysis. Med Image Anal 2017;42:6088. 32. Zacharaki EI, Wang S, Chawla S, et al. Classification of brain tumor type and grade using MRI texture and shape in a machine learning scheme. Magn Reson Med 2009;62:160918. 33. Earnest F, Kelly PJ, Scheithauer BW, et al. Cerebral astrocytomas: histopathologic correlation of MR and CT contrast enhancement with stereotactic biopsy. Radiology 1988;166:8237. 34. Provenzale JM, Mukundan S, Barboriak DP. Diffusion-weighted and perfusion MR imaging for brain tumor characterization and assessment of treatment response. Radiology 2006;239:63249. 35. Lambin P, Rios-Velazquez E, Leijenaar R, et al. Radiomics: extracting more information from medical images using advanced feature analysis. Eur J Cancer 2012;48:4416. 36. Vaidya M, Creach KM, Frye J, Dehdashti F, Bradley JD, El Naqa I. Combined PET/CT image characteristics for radiotherapy tumor response in lung cancer. Radiother Oncol 2012;102:23945. 37. Chong Y, Kim JH, Lee HY, et al. Quantitative CT variables enabling response prediction in neoadjuvant therapy with EGFR-TKIs: are they different from those in neoadjuvant concurrent chemoradiotherapy? PLoS One 2014;9:e88598. 38. Ganeshan B, Skogen K, Pressney I, Coutroubis D, Miles K. Tumour heterogeneity in oesophageal cancer assessed by CT texture analysis: preliminary evidence of an association with tumour metabolism, stage, and survival. Clin Radiol 2012;67:15764. 39. Fried DV, Tucker SL, Zhou S, et al. Prognostic value and reproducibility of pretreatment CT texture features in stage III non-small cell lung cancer. Int J Radiat Oncol Biol Phys 2014;90:83442. 40. Aerts HJ, Velazquez ER, Leijenaar RT, et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat Commun 2014;5:4006. 41. Welch ML, McIntosh C, Haibe-Kains B, et al. Vulnerabilities of radiomic signature development: the need for safeguards. Radiother Oncol 2019;130:29. 42. Welch ML, Jaffray DA. Editorial: Radiomics: the new world or another road to El Dorado? J Natl Cancer Inst 2017;109. 43. deCODE Genetics. ,https://www.decode.com/. (accessed 14.07.20). 44. Sudlow C, Gallacher J, Allen N, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med 2015;12 e1001779. 45. Consortium G. The Genotype-Tissue Expression (GTEx) project. Nat Genet 2013;45:5805. 46. Regev A, Teichmann SA, Lander ES, et al. The Human Cell Atlas. Elife 2017;6. 47. Consortium H. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature 2019;574:18792. 48. Hwang B, Lee JH, Bang D. Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp Mol Med 2018;50:96. 49. Hicks SC, Townes FW, Teng M, Irizarry RA. Missing data and technical variability in single-cell RNAsequencing experiments. Biostatistics 2018;19:56278. 50. Garcı´a-Nieto PE, Morrison AJ, Fraser HB. The somatic mutation landscape of the human body. Genome Biol 2019;20:298. 51. Karabacak Calviello A, Hirsekorn A, Wurmus R, Yusuf D, Ohler U. Reproducible inference of transcription factor footprints in ATAC-seq and DNase-seq datasets using protocol-specific bias modeling. Genome Biol 2019;20:42.

II. Technical basis

128

7. Analytics methods and tools for integration of biomedical data in medicine

52. Booth MJ, Ost TW, Beraldi D, et al. Oxidative bisulfite sequencing of 5-methylcytosine and 5-hydroxymethylcytosine. Nat Protoc 2013;8:1841. 53. Cusanovich DA, Hill AJ, Aghamirzaie D, et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 2018;174:13091324.e18. 54. Lake BB, Chen S, Sos BC, et al. Integrative single-cell analysis of transcriptional and epigenetic states in the human adult brain. Nat Biotechnol 2018;36:7080. 55. Ernst J, Kellis M. ChromHMM: automating chromatin-state discovery and characterization. Nat Methods 2012;9:21516. 56. Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, Noble WS. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat Methods 2012;9:4736. 57. Gill G, Tjian R. Eukaryotic coactivators associated with the TATA box binding protein. Curr Opin Genet Dev 1992;2:23642. 58. Gloss BS, Dinger ME. Realizing the significance of noncoding functionality in clinical genomics. Exp Mol Med 2018;50:97. 59. Mazrooei P, Kron KJ, Zhu Y, et al. Cistrome partitioning reveals convergence of somatic mutations and risk variants on master transcription regulators in primary prostate tumors. Cancer Cell 2019;36:674689.e6. 60. Stormo GD, Fields DS. Specificity, free energy and information content in protein-DNA interactions. Trends Biochem Sci 1998;23:10913. 61. Stormo GD. DNA binding sites: representation and discovery. Bioinformatics 2000;16:1623. 62. Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 2004;5:27687. 63. Samee MAH, Bruneau BG, Pollard KS. A de novo shape motif discovery algorithm reveals preferences of transcription factors for DNA shape beyond sequence motifs. Cell Syst 2019;8:2742.e6. 64. Yang L, Orenstein Y, Jolma A, et al. Transcription factor family-specific DNA shape readout revealed by quantitative specificity models. Mol Syst Biol 2017;13:910. 65. Karimzadeh M, Hoffman MM. Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome. bioRxiv 2019. 168419. 66. Schreiber J, Durham T, Bilmes J, Noble WS. Multi-scale deep tensor factorization learns a latent representation of the human epigenome. BioRxiv 2019. 364976. 67. Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods 2015;12:9314. 68. Zhou J, Park CY, Theesfeld CL, et al. Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk. Nat Genet 2019;51:97380. 69. Sundaram L, Gao H, Padigepati SR, et al. Predicting the clinical impact of human mutation with deep neural networks. Nat Genet 2018;50:116170. 70. Repana D, Nulsen J, Dressler L, et al. The Network of Cancer Genes (NCG): a comprehensive catalogue of known and candidate cancer genes from cancer sequencing screens. Genome Biol 2019;20:1. 71. Huang M, Ye YC, Chen S, Chai JR, Lu JX, Zhoa L, et al. Use of all-trans retinoic acid in the treatment of acute promyelocytic leukemia. Blood 1988;72:56772. 72. Leiserson MD, Vandin F, Wu HT, et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat Genet 2015;47:10614. 73. Bertrand D, Chng KR, Sherbaf FG, et al. Patient-specific driver gene prediction and risk assessment through integrated network analysis of cancer omics profiles. Nucleic Acids Res 2015;43:e44. 74. Karimzadeh M, Jandaghi P, Papadakis AI, et al. Aberration hubs in protein interaction networks highlight actionable targets in cancer. Oncotarget 2018;9:2516680. 75. Curtis C, Shah SP, Chin SF, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 2012;486:34652. 76. Wang B, Mezlini AM, Demir F, et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods 2014;11:3337. 77. Zang C, Wang T, Deng K, et al. High-dimensional genomic data bias correction and data integration using MANCIE. Nat Commun 2016;7:11305. 78. Wang B, Zhu J, Pierson E, Ramazzotti D, Batzoglou S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat Methods 2017;14:41416.

II. Technical basis

References

129

79. Stuart T, Butler A, Hoffman P, et al. Comprehensive Integration of Single-Cell Data. Cell 2019;177:18881902.e21. 80. Welch JD, Kozareva V, Ferreira A, Vanderburg C, Martin C, Macosko EZ. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 2019;177:18731887.e17. 81. Nakanishi R, Dey D, Commandeur F, et al. Machine learning in predicting coronary heart disease and cardiovascular disease events: results from the multi-ethnic study of atherosclerosis (MESA). J Am Coll Cardiol 2018;71:A1483. 82. Shashikumar SP, Shah AJ, Li Q, Clifford GD, Nemati S. A deep learning approach to monitoring and detecting atrial fibrillation using wearable technology. In: 2017 IEEE EMBS international conference on Biomedical & Health Informatics (BHI). IEEE; 2017. p. 1414. 83. Jaffray DA, Chung C, Coolens C, et al. Quantitative imaging in radiation oncology: an emerging science and clinical service. Semin Radiat Oncol 2015;25:292304. 84. Welch ML, Traverso A, Chung C, Jaffray DA. Quantitative radiomics in radiation oncology. In: Van Dyk J, editor. The modern technology of radiation oncology: a compendium for medical physicists and radiation oncologists. Madison, WI: Medical Physics Publishing; 2020. 85. Mackin D, Ger R, Dodge C, et al. Effect of tube current on computed tomography radiomic features. Sci Rep 2018;8:2354. 86. Shiri I, Rahmim A, Ghaffarian P, Geramifar P, Abdollahi H, Bitarafan-Rajabi A. The impact of image reconstruction settings on 18F-FDG PET radiomic features: multi-scanner phantom and patient studies. Eur Radiol 2017;27:4498509. 87. Shafiq-Ul-Hassan M, Zhang GG, Latifi K, et al. Intrinsic dependencies of CT radiomic features on voxel size and number of gray levels. Med Phys 2017;44:105062. 88. Ger RB, Zhou S, Chi PM, et al. Comprehensive investigation on controlling for CT imaging variabilities in radiomics studies. Sci Rep 2018;8:13047. 89. Traverso A, Kazmierski M, Welch ML, et al. Sensitivity of radiomic features to inter-observer variability and image pre-processing in apparent diffusion coefficient (ADC) maps of cervix cancer patients. Radiother Oncol, 143. 2019. p. 8894. 90. Mackin D, Fave X, Zhang L, et al. Harmonizing the pixel size in retrospective computed tomography radiomics studies. PLoS One 2017;12:e0178524. 91. Leijenaar RT, Carvalho S, Velazquez ER, et al. Stability of FDG-PET radiomics features: an integrated analysis of test-retest and inter-observer variability. Acta Oncol 2013;52:13917. 92. Balagurunathan Y, Kumar V, Gu Y, et al. Test-retest reproducibility analysis of lung CT image features. J Digit Imaging 2014;27:80523. 93. Liu R, Elhalawani H, Radwan Mohamed AS, et al. Stability analysis of CT radiomic features with respect to segmentation variation in oropharyngeal cancer. Clin Transl Radiat Oncol 2020;21:1118. 94. Mackin D, Fave X, Zhang L, et al. Measuring computed tomography scanner variability of radiomics features. Invest Radiol 2015;50:75765. 95. Zhovannik I, Bussink J, Fijten R, Dekker A, Monshouwer R. Learning from scanners: radiomics correction modeling. Radiotherapy and oncology. Brookvale Plaza, East Park Shannon, CO: Elsevier Ireland Ltd Elsevier House; 2019. p. S10345. 96. Bogowicz M, Riesterer O, Stark LS, et al. Comparison of PET and CT radiomics for prediction of local tumor control in head and neck squamous cell carcinoma. Acta Oncol 2017;56:15316. 97. Zwanenburg A, Leger S, Vallie`res M, Lo¨ck S. Image biomarker standardisation initiative. arXiv 2016. 1612.07003. 98. Parmar C, Leijenaar RT, Grossmann P, et al. Radiomic feature clusters and prognostic signatures specific for Lung and Head & Neck cancer. Sci Rep 2015;5:11044. 99. Babier A, Mahmood R, McNiven AL, Diamant A, Chan TCY. Knowledge-based automated planning with three-dimensional generative adversarial networks. Med Phys 2019;47(2), pp. 297306. 100. Nguyen D, Long T, Jia X, et al. A feasibility study for predicting optimal radiation therapy dose distributions of prostate cancer patients from patient anatomy using deep learning. Sci Rep 2019;9:1076. 101. McIntosh C, Welch M, McNiven A, Jaffray DA, Purdie TG. Fully automated treatment planning for head and neck radiotherapy using a voxel-based dose prediction and dose mimicking method. Phys Med Biol 2017;62:592644. 102. Lou B, Doken S, Zhuang T, et al. An image-based deep learning framework for individualising radiotherapy dose: a retrospective analysis of outcome prediction. Lancet Digital Health 2019;1:e13647.

II. Technical basis

C H A P T E R

8 Electronic health record data mining for artificial intelligence healthcare Anthony L. Lin, William C. Chen and Julian C. Hong Abstract Continued advancements in both data aggregation and computer science have brought an emergence of artificial intelligence (AI) that has changed the way many industries operate in the modern era. The aviation industry uses AI for autopiloting; online retail uses it to understand consumer interest; and the auto industry has made massive strides in AI to bring forth a new age of driverless cars. Despite the widespread use of AI to improve performance and efficiency in other industries, the adoption of AI into healthcare has remained relatively slow. However, with the increasing volume and scope of data collected in electronic health records (EHRs), healthcare has become ripe with opportunity to leverage AI to improve elements of the care delivery process. In this chapter, we will discuss current implementations of AI in EHRs, EHR-specific limitations to safeguard against, and a path forward for continued integration of AI and healthcare. Keywords: Artificial intelligence, AI; electronic health record, EHR; clinical decision support, CDS; machine learning; data mining

8.1 Introduction In December 2017 at the Neural Information Processing Systems annual conference, one of the largest gatherings of machine learning researchers, Rahimi made a controversial statement in his acceptance speech for a career achievement award. In front of a room of researchers who dedicated their careers to leveraging machine learning for its many applications, he argued, “machine learning has become alchemy.”1 Rahimi claimed that the use of complex machine learning approaches had become so poorly understood that they likened the ancient “science” of changing base metals into gold. His comments were met with both high praise and intense criticism and sparked a debate about the appropriateness of using machine learning before fully understanding the principles behind why it works. Advances in machine learning have made it more difficult to understand how specific algorithms work, how they should be validated, and how they should be safeguarded for

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00008-9

133

© 2021 Elsevier Inc. All rights reserved.

134

8. Electronic health record data mining for artificial intelligence healthcare

cases in which they do not work. These questions become increasingly important when using machine learning in environments such as healthcare where the cost of failure is exceptionally high. The systematic errors brought on by computational approaches could result in injury or harm to patients. A recent example is IBM Watson Heath’s Watson for Oncology, built off the artificial intelligence (AI) that defeated legendary (human) champions at a game of Jeopardy! in 2011. With its ability to parse clues from sentences riddled with complex wordplay, quickly search millions of textual sources, and correctly arrive at an answer before its human competitors, IBM Watson’s applications in medicine were thought to be boundless. Eight years later at the time of the writing of this chapter, Watson for Oncology has yet to deliver on its promise. In fact, recent reports have suggested that some treatment recommendations made by Watson for Oncology have been dangerous, such as recommending the use of bevacizumab in patients with severe bleeding, an absolute contraindication.2 This example highlights just how high the stakes can be for using AI in medicine. While the debate around whether AI is ready for medicine is still ongoing, researchers and healthcare leaders have already begun to explore its utility in a variety of clinical applications. In this current chapter, we will focus specifically on the growing use of electronic health records (EHRs) in healthcare, and how their digitization and storage of clinical data have catalyzed the development and implementation of AI in modern care delivery.

8.2 Overview of the electronic health record 8.2.1 History of the electronic health record The idea of a centralized and standardized health record began long before the invention of computers. Up until the early 1900s, physicians would each keep their own ledger of patients in which information was recorded idiosyncratically, based on each physician’s individual preference. In 1907 Dr. Henry Plummer designed what could be considered the first medical record for the Mayo Clinic.3 Each patient was assigned a number; each physician who visited with that patient would use that unique identifier to track their observations and interventions. Plummer also instituted a number of standards to ensure consistency; all information had to be recorded in a particular manner, even down to the ink that physicians were using. Plummer’s “patient-centric” system laid the foundation for other health records. One of the first clinically oriented EHRs (then called “clinical information systems”) was the Technicon Medical Information System (TMIS) developed as a partnership between the Lockheed Corporation and the El Camino Hospital of Mountain View, California. The development of TMIS began in 1965, and TMIS was one of the first systems to support both nursing clinical documentation as well as physician computerized order entry. TMIS became operational in 1971; in the same year, Technicon Data Systems purchased it from the Lockheed Corporation and started deploying TMIS nationwide. TMIS would be centrally installed on an off-site computer and support multiple hospitals within its service area. TMIS would be connected to hospitals via dedicated telephone lines, and a switching station in each hospital would be used to route the connection to each of the patient care

III. Clinical applications

8.2 Overview of the electronic health record

135

units. Each unit would have a video display terminal (a precursor to the modern-day monitor) and printer that allowed care providers to view and print patient information. Their service was so popular that by 1987, Technicon Data Systems would have TMIS installed in over 85 different institutions across the United States. TMIS was eventually bought by Eclipsys that would later merge with Allscripts, makers of an EHR that is still used by tens of thousands of hospitals and physician practices today.4 No review of EHRs is complete without discussing the Office of the National Coordinator of Health Information Technology (ONC) and the Health Information Technology for Economic and Clinical Health (HITECH) Act. Former president George W. Bush issued an executive order in 2004 to create ONC to promote and oversee the development and deployment of national health information technology. ONC would later take a much larger role in defining the evolution of EHRs in 2009 when former president Barack Obama signed the HITECH Act into legislation, charging ONC with the dissemination of EHRs as well as setting the standards for their use. Financial incentives, totaling over $29 billion, were offered to over 4800 hospitals and 450,000 physician practices to catalyze the widespread adoption of EHRs nationwide.5 As a result of this initiative, the number of nonfederal acute care hospitals with EHRs increased from 9.4% to 83.8% between 2008 and 2015.6 A brief look into the history of EHRs sheds light on their required fundamental functions. Health records standardize information entry, aggregate and preserve patient data, and improve communication between clinicians. The transition from paper to electronic records enabled designers to incorporate additional features we have now come to expect of EHRs: visualization of vast amounts of clinical data, quick and simultaneous access of multiple users, and the potential to implement clinical decision support (CDS) systems.

8.2.2 Core functions of an electronic health record The EHR is more than just an electronic version of the historical paper records of Plummer’s era. In 2003 the Institute of Medicine identified eight core functions that an EHR should be capable of performing to improve the quality, safety, and efficiency of care delivery.7 The six worth reviewing for the purposes of this chapter are as follows: 1. Health information and data. First and foremost, an EHR system must enable access to key clinical information required to make timely and appropriate treatment decisions. Much research shows that the information needs of healthcare providers are often not met, leading to inefficient or suboptimal care.8,9 EHR systems must be built with the defined data structures to store and maintain patient demographics, medication lists, allergies, diagnoses, test results, clinical assessments, and treatments to ensure access to important pieces of information necessary to patient care. 2. Results management. Managing results in an EHR has many notable benefits over paperbased results. Computerized results are not constrained to a physical space and thus can be viewed by multiple providers in different locations at the time and place that they are needed. Providers can view test results in relationship to previous results to make informed decisions about how to react. In addition, EHRs make previous results

III. Clinical applications

136

3.

4.

5.

6.

8. Electronic health record data mining for artificial intelligence healthcare

easier to find, reducing redundant or unnecessary testing and potentially improving efficiency and cost of care.8,10 Order entry/management. Computerized provider order entry (CPOE) has numerous benefits such as decreasing the number of lost orders, reducing ambiguities from illegible handwriting, monitoring for duplicate orders, and quickening the speed with which providers can write orders.11 13 Striking evidence in favor of CPOE revolves around medication order entry. Relatively simple CPOE systems have been shown to reduce medication errors by 80% simply by implementing “forcing functions” for medication dosages and frequency.13,14 Similar benefits have been appreciated for nearly all components of the healthcare experience, such as laboratory, radiology, microbiology, pathology, nursing, specialty consults, and ancillary services. Decision support. Utilizing the wealth of information stored in the EHR, computer-based decision support systems have demonstrated utility in improving care delivery, ranging from minimizing drug drug interactions to detecting disease.14,15 Computer-based decision support tools can be used to identify adverse events, hospital-acquired infections, and disease outbreaks.16 18 We will delve deeper into the topic of decision support systems later in this chapter as they have been of large interest given their appropriateness for AI. Electronic communication and connectivity. Effective communication around clinical care has been shown to reduce adverse events in healthcare.19,20 Patients often interact asynchronously with multiple providers in multiple settings, and electronic connectivity has greatly improved care coordination. They have similarly facilitated patient physician communication, allowing for greater patient participation and ownership of their care.21 Reporting. It has long been known that manual chart abstraction leads to a significant number of errors.22 EHRs store clinical data with standardized terminology and in machine-readable formats. Common examples of such coding standards include International Classification of Diseases (ICD), Logical Observation Identifiers Names and Codes (LOINC), and Systematized Nomenclature of Medicine—Clinical Terms (SNOMED-CT). Research, quality improvement, and health system operations can all benefit from the improvement in accuracy and decrease in associated costs that standardized EHR data storage can provide.

8.2.3 Electronic health record ontologies and data standards In order to perform the core functions discussed earlier, EHRs must abide by certain industry standards. We will focus on the ones that pertain to the clinical data being stored. Interoperability of EHRs, such that patient information can be easily shared across EHRs, has long been felt to represent one of the potential benefits of EHRs for patients. In the current US healthcare system, patients often navigate multiple healthcare providers and health systems, and the ability to seamlessly transfer information from one EHR to the next could improve continuity of care, reduce repeat studies, and certainly would be more convenient for patients. Interoperability has percolated to the forefront of the regulatory landscape, with the Centers for Medicare and Medicaid Services (CMS) renaming the EHR

III. Clinical applications

8.3 Clinical decision support

137

incentives program, originally implemented to incentivize EHR adoption, as the “Promoting Interoperability (PI)” program. Furthermore, the ONC and Department of Health and Human Services have, as of January 2020, proposed new rules and health IT certification criteria that contain embedded requirements for health IT systems to adhere to a common data interoperability standard and to provide application programming interfaces (API) allowing external applications (e.g., the Apple Health App) to interface with the EHR. The current dominant data standard that is directly referenced in this new proposed rule is the Fast Healthcare Interoperability Resources (FHIR) standard, developed by Health Level 7 (HL7), an international standards organization. FHIR defines a standard data framework with which to store healthcare data with emphasis on ease of implementation, interoperability, flexibility (basic data structures are extensible based on the needs of a particular implementation), and use of web-based standards for ease of development of applications using FHIR. Nevertheless, the interoperability of EHRs will still rely on consistent implementation of FHIR in order to ensure that the same data element “means” the same thing across different implementations. To aid with consistency of the meaning of healthcare data elements, certain standardized healthcare ontologies are frequently used, including the International Statistical Classification of Diseases (ICD) terminologies, which contains a comprehensive mapping of diseases as well as procedures, Current Procedural Terminology (CPT), which describes a comprehensive list of healthcare procedures for billing purposes, and others such as SNOMED-CT, RxNorm, and the National Library of Medicine’s Unified Medical Language System (UMLS). Thus far, the ability of organizations and researchers to interact with an EHR at a large scale has been limited to what specific vendors provide. For example, EPIC, arguably the most dominant EHR company, provides its own versions of data warehouses, including EPIC Chronicles, Clarity, and Caboodle (formerly Cogito), each of which contains varying levels of granularity, with Caboodle being the leanest and most suitable for research applications. Greater interoperability through implementation of data standards or conversion of existing data into standard frameworks will allow for healthcare organizations and researchers to access, pool, and analyze ever larger amounts of data. What can be done using this data will be the subject of the next several sections.

8.3 Clinical decision support From the early works of the Leeds Abdominal Pain System in the 1960s to the modernday best practice advisory (BPA) alerts for documented allergies, dose recommendations, and possible drug drug interactions, healthcare has been exploring how to best aid clinicians in their diagnostic and treatment decisions. In diagnostics, healthcare relies on a wealth of clinical risk scores to aid in accurate diagnosis of disease and risk stratification. However, these scores are typically conceptually simple in their design. For instance, take into account the National Early Warning Score (NEWS), a score developed to detect patients at risk of cardiac arrest, unanticipated admission to an intensive care unit, or death.23 NEWS considers seven discrete variables (respiratory rate, oxygen saturation, supplemental oxygen, temperature, systolic blood pressure, heart rate, and mental status), assigning each variable an independent score and then adding them all together to create a composite score

III. Clinical applications

138

8. Electronic health record data mining for artificial intelligence healthcare

representing the likelihood of clinical deterioration. In calculating a risk score in such a manner, we lose an understanding of the intricate relationships between variables, clinical interventions, and their evolution in time. This understanding motivates our discussion around the evolution of CDS, and how past works have influenced current approaches in improving CDS with AI.

8.3.1 Healthcare primed for clinical decision support We touched briefly on the HITECH Act of 2009 and its key role in the widespread adoption of EHRs across the United States. To explore the influence of this policy further, we must look at the rules of meaningful use, as well as the regulations set forth by the federal government to ensure improvements in infrastructure led to improvements in care. The meaningful use regulations were implemented in a staged approach over Phases I III. Phase I and II focused primarily on basic EHR adoption and interoperability. Phase III, arguably the most ambitious phase of the three, utilizes capabilities of these new EHRs to incorporate CDS to improve the quality, efficacy, and safety of care delivered at qualifying sites.24 Before exploring current implementations of CDS, we will review early approaches to CDS as an overview of different paradigms of automated decision-making that continue to guide modern CDS development. One of the earliest uses for CDS came in the form of diagnostic aids. In the late 1960s to early 1970s, de Dombal et al. created the Leeds Abdominal Pain System, a CDS to calculate the probability of seven possible underlying etiologies for acute abdominal pain.25 Their system compared a new presentation for acute abdominal pain with a database of 600 prior presentations for acute abdominal pain and, using Bayesian probability theory across 42 clinical attributes, computed the likelihood that the new presentation was due to one of the seven diagnoses for abdominal pain. In the evaluation of their system,26 they analyzed 304 patients who were admitted to the surgical unit at the General Infirmary at Leeds for complaints of acute abdominal pain. Details of each patient case were input into their CDS as they were simultaneously evaluated by the clinical team. A “real-time” diagnosis was made among one of the seven possible etiologies for acute abdominal pain (appendicitis, cholecystitis, small bowel obstruction, pancreatitis, perforated peptic ulcer, diverticular disease, and “nonspecific abdominal pain”). The computer-generated diagnoses were not revealed to the clinical team as to not bias their decision-making. The computer-generated diagnoses were then compared to the diagnoses made by the clinical team as well as the ultimate diagnoses confirmed by surgery. Their study found that while clinicians were 65% 80% correct (differences depending on the different levels of training of the clinicians), the Leeds Abdominal Pain System was able to boast an overall diagnostic accuracy of 91.8%. Most impressive was the system’s accuracy around appendicitis, a diagnosis that is difficult to differentiate from other causes of acute abdominal pain and frequently overdiagnosed given its high morbidity and mortality. The Leeds Abdominal Pain System correctly classified 84 out of 85 patients with acute appendicitis while only misclassifying 6 nonspecific abdominal pain patients as appendicitis. Meanwhile, its clinician counterparts initially missed 6 patients with appendicitis (resulting in delays in surgery of .8 hours) while

III. Clinical applications

8.4 Areas of artificial intelligence augmentation for electronic health records

139

subjecting 27 patients with nonspecific abdominal pain to unnecessary surgery due to an incorrect diagnosis of appendicitis. Of note, future implementations of the Leeds Abdominal Pain System never achieved the same diagnostic accuracy in other settings as it did in its initial evaluation. This discrepancy is likely explained by a variation in interpretability of the 42 clinical features that the CDS relies on human clinicians to input and is an early example of the importance of independent testing in real-world settings. Clinical features such as “abdominal rigidity,” “rebound tenderness,” and “pain severity” may be interpreted and assessed differently between clinicians based on their prior training, previous experiences, and culture. Implementation of CDS and best practice alerts (BPAs) have ranged from the simple, such as drug drug interaction alerts, to more complex systems such as automated calculation and display of risk scores and practice guidelines.27 To date, randomized trials of CDSs and BPAs have produced mixed results, with some indicating possibly clinically significant improvements in endpoints ranging from better adherence to antibiotic prescribing guidelines and rapid streptococcus testing,28 to improvement in the detection of and workup/intervention for pediatric hypertension.29 A 2017 metaanalysis of trials investigating CDS for VTE prevention among surgical patients found a pooled improvement in adherence to VTE prophylaxis guidelines and, intriguingly, a reduction in VTE (odds ratio 0.78%, 95% confidence 0.72 0.85).30 CDSs have been described to help improve appropriate antibiotic31 and blood product use.32

8.4 Areas of artificial intelligence augmentation for electronic health records Healthcare has long been fraught with high costs, diagnostic errors, workflow inefficiencies, increasing administrative complexities, and diminishing time between patients and their clinicians. In 2011 US healthcare costs were estimated at nearly 18% of the nation’s gross domestic product.33 At an expected growth of nearly 6% per year, projections suggest that the health share of the nation’s gross domestic product will reach nearly 20% by 2020. These increasing costs are complicated by the fact that at least 1 in 20 US adults suffers from a diagnostic error,34 approximately one-half of which have the potential to lead to serious harm.35 These deficiencies have motivated healthcare leaders and the AI community to work together to explore the role that AI may play in improving care delivery and decreasing costs.

8.4.1 Artificial intelligence to improve data entry and extraction One intuitive application of AI to the EHR is in automation of administrative and documentation tasks that currently take up a significant proportion of provider time and lead to greater rates of burnout among physicians.36 Human variation in EHR use is also a potential source of harm.37 Automated data extraction and summarization from verbose clinical records could also increase the efficiency of clinical care, as well as research and operational efforts. For instance, prediction and streamlining of physician orders are a natural application of AI.38

III. Clinical applications

140

8. Electronic health record data mining for artificial intelligence healthcare

Natural language processing (NLP) can be used to process raw free-form and semistructured clinical text and to “translate” this human language into a structured set of data elements.39 Machine learning and deep learning techniques can then be utilized to glean meaning, ranging from extraction of key terms (e.g., “breast cancer”) that can be mapped to ontologies to deep learning approaches that theoretically can further account for complex underlying relationships between “tokenized” text. These efforts can also be more systematically consistent than human extraction that suffers from interrater variability in domains such as acute event identification.40,41,42 NLP of unstructured medical text has also been shown to augment AI prediction of ICU mortality43 and prediction of inpatient mortality, readmissions, and length of stay (LOS) in another study that we will describe in more detail in the next section. Furthermore, NLP on speech data may offer opportunities to alleviate documentation burden on physicians. An NLP speech analyzer could one day be used as an automated scribe44 or as part of a “chatbot”-type system for patient intake or other patient interactions. Multiple start-ups have emerged around the concept of a patient-facing chatbot system using NLP techniques to help patients look up symptoms, leveraging published clinical literature to provide possible diagnoses.

8.4.2 Optimizing care AI also offers opportunity to optimize care. For instance, EHRs may facilitate the identification of low-yield and high-volume laboratory orders. One study utilized EHR data of more than 44,000 inpatients in order to predict the probability of a laboratory test to be normal, resulting with models with an average AUROC of 0.77, and up to 0.90 for a group of 12 standalone tests, including lactate dehydrogenase, creatinine, urea nitrogen, and sodium, hemoglobin.45 The implications of these results are that certain “low-yield” laboratory orders could be anticipated as such and flagged via this model, leading to a decrease in unnecessary repeat or uninformative lab orders. Other potential applications of AI into the clinical workflow include prediction of no-shows,46 automation of coding and billing, medication refill and reconciliations, and patient engagement and educational efforts, all of which promise to reduce inefficiency and costs and potentially improve patient care and experience.

8.4.3 Predictions The ability to predict key clinical outcomes could assist in better understanding patient clinical trajectory and prognosis. For example, accurate estimation of the risk of readmission could be used to avoid unsafe discharge plans or ensure that more resources and ancillary services were in place to support a patient once discharged. AI has the opportunity to inform these types of clinical decisions as well as many others concerning severity of illness, diagnoses, or risk for clinical deterioration. As improvements in computing power and data storage continue to be realized, the number of publications evaluating predictive models in clinical medicine continues to grow at a staggering rate (Fig. 8.1).

III. Clinical applications

8.4 Areas of artificial intelligence augmentation for electronic health records

141

FIGURE 8.1 PubMed search for (“medicine”[All Fields] or “clinical” [All Fields] OR “health”) AND (“artificial intelligence”[All Fields] OR “machine learning”[All Fields] OR “data analysis”[All Fields] OR “data science”[All Fields] OR “prediction”[All Fields]) AND (“2008/01/01”[PDAT]: “2019/12/ 31”[PDAT]).

In the following subsections, we will review select studies that have investigated how AI could generate meaningful results based on integration with the EHR.

8.4.4 Hospital outcomes Some of the most important hospital metrics include inpatient mortality, 30-day readmissions (whether a patient returns to the hospital within 30 days of being discharged), and hospital LOS. These outcomes not only affect patient health but also play a large role in the reimbursement rates of health systems, so there is much incentive to track and forecast these outcomes. Many multivariable logistic regression models have been developed and studied previously, such as NEWS for mortality47 or the HOSPITAL score for 30-day readmissions.48 A recent research collaboration between the University of California, San Francisco, University of Chicago Medicine, Stanford University, and Google49 explored whether a deep learning neural network built on EHR data could produce valid predictions concerning inpatient mortality, 30-day readmissions, and hospital LOS in two different academic hospital settings. Their retrospective study included a total of 216,221 hospitalizations for 114,003 unique patients, which were randomly divided into training (80%), validation (10%), and testing (10%) cohorts. Using a deep learning approach, developers gave the model access to tens of thousands of structured (i.e., vital signs, laboratory results) and unstructured (i.e., free-text physician notes, radiology reports) predictors for each patient, resulting in over 46 billion tokens of EHR data. The model was developed to identify patterns in the training dataset that most strongly differentiated patients who resulted in inpatient mortality, 30-day readmission, or a long LOS (defined as greater than or equal to 7 days) from their counterparts. For predicting inpatient mortality, the areas under the receiver operating characteristic curve (AUROCs) at 24 hours after admission were 0.95 (95% CI 0.94 0.96) and 0.93 (95% CI 0.92 0.94) in the two academic hospitals, compared to only 0.85 (95% CI 0.81 0.89) and 0.86 (95% CI 0.83 0.88) when using a logistic regression model based on NEWS. For predicting 30-day readmissions, the AUROCs at discharge were 0.77 (95% CI 0.75 0.78) and 0.76 (95% CI 0.75 0.77), significantly higher than

III. Clinical applications

142

8. Electronic health record data mining for artificial intelligence healthcare

AUROCs of 0.70 (95% CI 0.68 0.72) and 0.68 (95% CI 0.67 0.69) obtained from the traditional HOSPITAL predictive model. For predicting long LOS, AUROCs at 24 hours after admission were 0.86 (95% CI 0.86 0.87) and 0.85 (95% CI 0.84 0.86), again significantly higher than the AUROCs of 0.76 (95% CI 0.75 0.77) and 0.74 (95% CI 0.73 0.75) generated by traditional modeling approaches. Particularly for predicting inpatient mortality, the improved AUROC translated to decreasing the false alarm rate by roughly half of what was to be expected using NEWS. The results of their retrospective study demonstrated that deep learning methods might be used to accurately predict a variety of hospital outcomes. Though there are limitations with retrospective studies—many of which we will discuss at the end of this chapter—the demonstrated accuracy of their deep learning algorithm certainly warrants further exploration of the utility such methods may bring to prospective patient care. In addition to general hospital outcomes, other groups have investigated the utility of deep learning methods to predict certain potentially harmful disease states. Acute kidney injury (AKI), characterized by an abrupt decrease in kidney function leading to a buildup of waste products normally filtered from the blood, affects approximately one in five inpatient admissions in the United States.50 More importantly, AKI is associated with an over fourfold increase in mortality. Current approaches to detecting AKI rely on sudden increases in serum creatinine, a protein that is normally excreted by the kidneys. However, changes in serum creatinine are known to lag behind the physiologic injury to the kidney. This results in a delay of treatment for a clinical scenario where AKI is believed to be preventable with early treatment.51 A recent study52 attempted to develop a deep learning model based on EHR data to predict the likelihood of a patient developing AKI in the next 48 hours. The model was trained on over 700,000 adult patients form the US Department of Veterans Affairs, the largest integrated healthcare system in the US spanning 172 inpatient and 1062 outpatient settings. Patients were randomly divided into training (80%), validation (5%), calibration (5%), and testing (10%) cohorts. A total of 620,000 EHR predictors were used in training the model, which resulted in approximately 6 billion tokens of information. At an operating point of two false-positive predictions for every true positive prediction, their model was capable of predicting about 55% of inpatient AKI events within 48 hours of occurrence. This corresponds to an AUROC of 0.92. When compared against a baseline model using traditional gradient-boosted trees usually expertselected features, that model was only able to capture 36% of all AKI events within 48 hours of occurrence at the same operating point. The question remains as to how clinically useful these operating characteristics are in real practice, as a sensitivity of 55% with an alert system that raises a true positive only 33% of the time may not be clinically actionable. Another recent study utilized a highly curated single-institution EHR data repository, including procedure and diagnosis codes, medications, and vitals data, in order to improve prediction of postsurgical complication rates. Their models achieved AUROCs between 0.75 and 0.92 predicting various outcomes ranging from postoperative shock to genitourinary complications and outperformed the traditional American College of Surgeons (ACS) NSQIP model.53

III. Clinical applications

8.4 Areas of artificial intelligence augmentation for electronic health records

143

TABLE 8.1 Systemic inflammatory response syndrome (SIRS) criteria. Temperature

,36 C or .38 C

Heart rate

.90 bpm

Respiratory rate

.20 or PaCO2 , 32 mmHg

White blood cell count

,4000/mm3 or .12,000/mm3 or .10% bands

8.4.5 Sepsis and infections Sepsis—a dysregulated systemic response to infection that leads to a life-threatening reaction that can cause tissue damage, organ failure, and death—represents a significant burden to the healthcare system. By some national estimates, sepsis is responsible for 4% of all hospitalizations and 6% of all deaths in the US, resulting in more than $23 billion in healthcare spending across all payers.54,55 There are 1.7 million adult sepsis-related hospitalizations per year, 270,000 of which result in an inpatient death.56 Protocol-driven care bundles administered within hours of sepsis detection have been shown to significantly improve clinical outcomes, making early detection of sepsis crucial.57,58 As a clear diagnostic marker for sepsis has yet to be identified, diagnosis has predominantly relied on clinical judgment based on suspicion for infection and end-organ dysfunction. This has made automated screening tools for sepsis of particular interest to healthcare leaders and care providers. The CMS, who is beginning to hold health systems accountable for their sepsis outcomes, uses a definition of sepsis based on the systemic inflammatory response syndrome (SIRS) (Table 8.1). In 2016 the definition was revised by a consensus group59 but has been criticized for detecting sepsis too late in the clinical course due to its reliance on end-organ dysfunction.57,60,61 Another recent interesting approach to improving clinical decision-making in sepsis utilized a reinforcement learning model to generate an AI agent that would learn optimal treatment choices in terms of doses of vasopressor and intravenous fluids. When compared to clinician choices, the AI agent made decisions that appeared to lead to superior 90-day mortality. Based on the calibration curve reported, the absolute improvement appeared to translate to an approximately 5% difference in mortality, which is not insignificant. The AI agent tended to treat with slightly higher doses of vasopressors and lower amounts of IV fluid. Indeed, reinforcement learning may be well suited to learning decisions made over time; the caveats to this study include the fact that the AI agent had “immediate” access to laboratory and clinical information that may not have been available to clinicians, the 4-hour time block used as a time window, which may be too long, and the use of missing data imputation.62

8.4.6 Oncology There is growing interest in developing predictive models for patients undergoing cancer therapy using EHR data. Patients with advanced cancer can benefit from timely implementation of palliative care, symptom management, and hospice services. Indeed,

III. Clinical applications

144

8. Electronic health record data mining for artificial intelligence healthcare

AI could have a significant role to play in understanding and addressing challenges in oncology, including poor clinical trial enrollment,63 disparities in oncologic care,64 rising costs,65 and nonuniform access to new knowledge and multidisciplinary expertise.66 Along these lines, one recent study utilized structured EHR data extracted from EPIC’s Clarity system to generate predictions of 6-month mortality among patients with cancer, reporting an AUROC of between 0.86 and 0.88 using logistic regression, random forest, or gradient-boosted trees. Interestingly, the investigators then presented patients identified to have at least a 30% risk of 6-month mortality to clinicians, of whom 58.8% were deemed appropriate for end of life discussion.67 Many of the predictive factors identified in this study were related to cancer type, comorbidities, and laboratory values, which could be readily translatable to clinical use. Of note, the overall 6-month mortality rate for the cohort was only 4%. Another recent study used a similar approach using EHR data and linked Social Security database to train decision tree-based models predicting mortality among patients with cancer, achieving AUROCs of between 0.83 and 0.86. With a sensitivity of 60% the maximum positive predictive value of a gradient-boosted model was 53.4% for 6-month mortality.68 The addition of unstructured free-text clinical notes was shown in a study by Gensheimer et al. to result in a robust predictive model with a c-index of 0.78 in the test set and 0.74 in a separate testing cohort.69 The model was able to identify key phrases from clinical notes of patients diagnosed with metastatic cancer, which were predictive of their illness trajectory and ultimate mortality risk. An important aspect of this approach was the modeling of outcomes as survival times and censored data, rather than binary outcomes. According to the manuscript, future trials utilizing this predictive model to help reduce futile care and improve appropriate access to and referral for palliative care are in development. Finally, other important clinical needs for high-quality cancer care delivery are avenues for EHR-based AI solutions. Among these, minimizing acute care has been a point of emphasis for the CMS.70 Among a number of models have investigated the prediction of acute care during outpatient cancer therapy,71,72 including one that is the subject of a recently completed prospective randomized interventional study.71

8.5 Limitations of artificial intelligence and next steps Despite the many examples of AI approaches to using EHR data to automate processes or to predict clinical outcomes, there has been very little in the way of meaningful adoption of AI or machine learning in actual practice. Several barriers may be to blame: first, while momentum toward EHR interoperability is growing, the EHR landscape has always been and remains today a fragmented one; while the leading EHRs together boast large numbers of users utilizing their systems, there is little to no ability to share data electronically from one system to another. Even within a single EHR system, for example, EPIC, there has only recently begun to be a growing ability to share patient data across health institutions using the same EHR, through the Care Everywhere system. EHR data itself remains siloed within institutional data infrastructures, perhaps rightfully so given data

III. Clinical applications

8.5 Limitations of artificial intelligence and next steps

145

privacy concerns. Next, it remains time-consuming and costly to construct secure data infrastructures able to interface with EHRs, and this is perhaps one of the reasons why many efforts thus far have been limited to large academic centers capable of a sustained data mining effort. Given the logistical barriers to interfacing with and mining EHR data, and the risks, including accidental patient privacy breaches resulting in potentially large monetary penalties, it would only make sense for healthcare organizations to pursue EHR data mining if the benefits clearly outweighed the costs and risks. Thus far, despite the promising exploratory work described earlier, the true clinical and financial benefits of EHR data mining remain to be seen. In general, applications of AI in healthcare in particular also remain limited by the lack of prospective clinical trials that demonstrate their utility. To date, all randomized studies have been diagnostic in nature, without any in the EHR space.73 78 Nearly all of the abovementioned studies were retrospective in nature, utilizing fixed datasets, and often with imputation of missing data. While certain studies took care to ensure that model performance was evaluated in a way that simulated real-time use, for example, examining the performance of the AKI model 48 hours in advance, these models still need to be tested prospectively, in real-world settings. In a real-world setting, available data is likely to be more sparse, and any imputation of data can be expected to be even less reliable. How missing data is handled and filtered can introduce significant biases into datasets—nor is missing data always uninformative, as patients with fewer healthcare interactions may in fact simply be healthier.79,80 Bias can also stem from patterns of care, which is critical to consider in the development of AI algorithms based on real-world data. A recent study showed racial bias in an algorithm used to identify patients with high medical complexity.81 This adds to growing literature surrounding the potential for algorithms to perpetuate existing biases in any deployment setting.82,83 Moreover, EHR data is not only variable across institutions, but also across time. This potential for flux, referred to as nonstationarity, can often be due to either data collection or changes in practice patterns over time.84 New treatments that have emerged in the last 5 years in oncology, for example, have dramatically improved survival for certain patients with metastatic lung or breast cancer and may alter the performance of models trained on older data. Indeed, models trained on any dataset, no matter how large, are at risk of overfitting and invariably underperform on independent datasets from separate institutions, a fact that was demonstrated early on by the Leeds Abdominal Pain System CDS. Furthermore, while many of the studies above used machine learning methods that are considered “interpretable,” such as regularized logistic or Cox regression or decision treebased methods, the resulting decision trees or nomograms are not always intuitive or clinically sound. Indeed, the promise of “personalized medicine” using EHR mining often results in the opposite, sometimes leading to an arbitrary selection of data points and thresholds in order to sort patients into buckets. These arrangements frequently oversimplify clinical scenarios and lose the nuance in clinical understanding. Ultimately, AI-based algorithms need to be reliable, generalizable, and clinically actionable to demonstrate utility in a measurable and meaningful end point and justify their logistical cost and risk. Prospective trials utilizing algorithms with well-designed interventions are sorely needed. Finally, particular care needs to be exercised in ensuring that

III. Clinical applications

146

8. Electronic health record data mining for artificial intelligence healthcare

systemic biases and disparities that are prevalent in society and in healthcare are not unknowingly perpetuated within algorithms.

References 1. Rahmini A. Ali Rahimi NIPS 2017 Test-of-Time Award Presentation; 2017. ,https://www.youtube.com/ watch?v 5 ORHFOnaEzPc. [accessed 07.11.19]. 2. Ross C, Swetlitz I. IBM’s Watson supercomputer recommended ‘unsafe and incorrect’ cancer treatments, internal documents show; 2018. ,https://www.statnews.com/2018/07/25/ibm-watson-recommended-unsafe-incorrecttreatments/. [accessed 17.11.19]. 3. Melton LJ. History of the Rochester Epidemiology Project. Mayo Clin Proc 1996;71(3):266 74. Available from: https://doi.org/10.4065/71.3.266. 4. Home | Allscripts. ,https://www.allscripts.com/.. [accessed 10.02.20]. 5. The Office of the National Coordinator for Health Information Technology (ONC). Federal health IT strategic plan: 2015 2020. 2014. ,https://dashboard.healthit.gov/strategic-plan/federal-health-it-strategic-plan-20152020.php. [accessed 01.11.17]. 6. Henry J, Pylypchuk Y, Searcy T, Patel V. Adoption of electronic health record systems among U.S. nonfederal acute care hospitals: 2008-2015. ONC Data Brief No 35 2016;35:1 11. 7. Institute of Medicine. Key capabilities of an electronic health record system: letter report. Washington, DC: National Academies Press; 2003. Available from: https://doi.org/10.17226/10781. 8. Bates DW, Ebell M, Gotlieb E, Zapp J, Mullins HC. A proposal for electronic medical records in U.S. Primary Care. J Am Med Inform Assoc 2003;10(1):1 10. Available from: https://doi.org/10.1197/jamia.M1097. 9. McKnight L, Stetson PD, Bakken S, Curran C, Cimino JJ. Perceived information needs and communication difficulties of inpatient physicians and nurses. Proc AMIA Symp 2001;453 7. Available from: https://doi.org/ 10.1197/jamia.m1230. 10. Shea S, Starren J, Weinstock RS, et al. Columbia University’s informatics for diabetes education and telemedicine (IDEATel) project. J Am Med Inform Assoc 2002;9(1):25 36. Available from: https://doi.org/10.1136/ jamia.2002.0090025. 11. Lepage EF, Gardner RM, Laub RM, Golubjatnikov OK. Improving blood transfusion practice: role of a computerized hospital information system. Transfusion (Paris) 1992;32(3):253 9. Available from: https://doi.org/ 10.1046/j.1537-2995.1992.32392213810.x. 12. Mekhjian HS, Kumar RR, Kuehn L, et al. Immediate benefits realized following implementation of physician order entry at an academic medical center. J Am Med Inform Assoc 2002;9(5):529 39. Available from: https:// doi.org/10.1197/jamia.M1038. 13. Sittig DF, Stead WW. Computer-based physician order entry: the state of the art. J Am Med Inform Assoc 1994;1(2):108 23. Available from: https://doi.org/10.1136/jamia.1994.95236142. 14. Bates DW, Gawande AA. Improving Safety with Information Technology. N Engl J Med 2003;348(25):2526 34. Available from: https://doi.org/10.1056/NEJMsa020847. 15. Hunt DL, Haynes RB, Hanna SE, Smith K. Effects of computer-based clinical decision support systems on physician performance and patient outcomes. JAMA 1998;280(15):1339. Available from: https://doi.org/ 10.1001/jama.280.15.1339. 16. Bates DW, Cohen M, Leape LL, Overhage JM, Shabot MM, Sheridan T. Reducing the frequency of errors in medicine using information technology. J Am Med Inform Assoc 2001;8(4):299 308. Available from: https:// doi.org/10.1136/jamia.2001.0080299. 17. Evans RS, Larsen R, Burke JP, et al. Computer surveillance of hospital-acquired infections and antibiotic use. JAMA J Am Med Assoc 1986;256(8):1007. Available from: https://doi.org/10.1001/jama.1986.03380080053027. 18. Pavlin JA. Investigation of disease outbreaks detected by “syndromic” surveillance systems. J Urban Health 2003;80(2 Suppl. 1):107 14. Available from: https://doi.org/10.1007/pl00022321. 19. Li P, Ali S, Tang C, Ghali WA, Stelfox HT. Review of computerized physician handoff tools for improving the quality of patient care. Vol 8; 2013. Available from: https://doi.org/10.1002/jhm.1988. 20. Schiff GD, Klass D, Peterson J, Shah G, Bates DW. Linking laboratory and pharmacy. Arch Intern Med 2003;163 (8):893. Available from: https://doi.org/10.1001/archinte.163.8.893.

III. Clinical applications

References

147

21. Liederman EM, Morefield CS. Web messaging: a new tool for patient-physician communication. J Am Med Inform Assoc 2003;10(3):260 70. Available from: https://doi.org/10.1197/jamia.M1259. 22. Green J, Wintfeld N. How accurate are hospital discharge data for evaluating effectiveness of care? Med Care 1993;31(8):719 31. Available from: https://doi.org/10.1097/00005650-199308000-00005. 23. Royal College of Physicians. National Early Warning Score (NEWS): standardising the assessment of acute-illness severity in the NHS. 2012:47. 24. Blumenthal D, Tavenner M. The “meaningful use” regulation for electronic health records. N Engl J Med 2010;363(6):501 4. 25. de Dombal FT, Horrocks JC, Staniland JR, Guillou PJ. Construction and uses of a “data-base” of clinical information concerning 600 patients with acute abdominal pain. Proc R Soc Med 1971;64. 26. De Dombal FT, Leaper DJ, Staniland JR, McCann AP, Horrocks JC. Computer-aided diagnosis of acute abdominal pain. Stud Health Technol Inform 1997;36:27 31. Available from: https://doi.org/10.3233/978-160750-880-9-27. 27. Sperl-Hillen JM, Crain AL, Margolis KL, et al. Clinical decision support directed to primary care patients and providers reduces cardiovascular risk: a randomized trial. J Am Med Inform Assoc 2018;25(9):1137 46. Available from: https://doi.org/10.1093/jamia/ocy085. 28. McGinn TG, McCullagh L, Kannry J, et al. Efficacy of an evidence-based clinical decision support in primary care practices: a randomized clinical trial. JAMA Intern Med 2013;173(17):1584 91. Available from: https:// doi.org/10.1001/jamainternmed.2013.8980. 29. Kharbanda EO, Asche SE, Sinaiko AR, et al. Clinical decision support for recognition and management of hypertension: a randomized trial. Pediatrics 2018;141(2). Available from: https://doi.org/10.1542/peds.2017-2954. 30. Borab ZM, Lanni MA, Tecce MG, Pannucci CJ, Fischer JP. Use of computerized clinical decision support systems to prevent venous thromboembolism in surgical patients: a systematic review and meta-analysis. JAMA Surg 2017;152(7):638 45. Available from: https://doi.org/10.1001/jamasurg.2017.0131. 31. Pestotnik SL, Classen DC, Evans RS, Burke JP. Implementing antibiotic practice guidelines through computerassisted decision support: clinical and financial outcomes. Ann Intern Med 1996;124(10):884 90. Available from: https://doi.org/10.7326/0003-4819-124-10-199605150-00004. 32. Goodnough LT, Shieh L, Hadhazy E, Cheng N, Khari P, Maggio P. Improved blood utilization using real-time clinical decision support. Transfusion (Paris) 2014;54(5):1358 65. Available from: https://doi.org/10.1111/trf.12445. 33. Keehan SP, Sisko AM, Truffer CJ, et al. National health spending projections through 2020: Economic recovery and reform drive faster spending growth. Health Aff (Millwood) 2011;30(8):1594 605. Available from: https:// doi.org/10.1377/hlthaff.2011.0662. 34. Singh H, Meyer AND, Thomas EJ. The frequency of diagnostic errors in outpatient care: Estimations from three large observational studies involving US adult populations. BMJ Qual Saf 2014;23(9):727 31. Available from: https://doi.org/10.1136/bmjqs-2013-002627. 35. Singh H, Giardina TD, Meyer AND, Forjuoh SN, Reis MD, Thomas EJ. Types and origins of diagnostic errors in primary care settings. JAMA Intern Med 2013;173(6):418 25. Available from: https://doi.org/10.1001/ jamainternmed.2013.2777. 36. Longhurst CA, Davis T, Maneker A, et al. Local investment in training drives electronic health record user satisfaction. Appl Clin Inform 2019;10(2):331 5. Available from: https://doi.org/10.1055/s-0039-1688753. 37. Cohen GR, Friedman CP, Ryan AM, Richardson CR, Adler-Milstein J. Variation in physicians’ electronic health record documentation and potential patient harm from that variation. J Gen Intern Med 2019;34 (11):2355 67. Available from: https://doi.org/10.1007/s11606-019-05025-3. 38. Chen JH, Podchiyska T, Altman RB. OrderRex: clinical order decision support and outcome predictions by data-mining electronic medical records. J Am Med Inform Assoc 2016;23(2):339 48. Available from: https:// doi.org/10.1093/jamia/ocv091. 39. Yim W, Yetisgen M, Harris WP, Kwan SW. Natural language processing in oncology: a review. JAMA Oncol 2016;2(6):797 804. Available from: https://doi.org/10.1001/jamaoncol.2016.0213. 40. Miller TP, Li Y, Kavcic M, et al. Accuracy of adverse event ascertainment in clinical trials for pediatric acute myeloid leukemia. J Clin Oncol 2016;34(13):1537 43. Available from: https://doi.org/10.1200/JCO.2015.65.5860. 41. Fairchild AT, Tanksley JP, Tenenbaum JD, Palta M, Hong JC. Interrater reliability in toxicity identification: limitations of current standards [published online ahead of print, 2020 May 3]. Int J Radiat Oncol Biol Phys 2020;20(31084-1) S:0360 3016. Available from: https://doi.org/10.1016/j.ijrobp.2020.04.040.

III. Clinical applications

148

8. Electronic health record data mining for artificial intelligence healthcare

42. Hong JC, Tanksley J, Niedzwiecki D, Palta M, Tenenbaum JD. Accuracy of a natural language processing pipeline to identify patient symptoms during radiation therapy. Int J Radiat Oncol 2019;105(1):S70. Available from: https://doi.org/10.1016/j.ijrobp.2019.06.522. 43. Marafino BJ, Park M, Davies JM, et al. Validation of prediction models for critical care outcomes using natural language processing of electronic health record data. JAMA Netw Open 2018;1(8): e185097. Available from: https://doi.org/10.1001/jamanetworkopen.2018.5097. e185097. 44. Finley G, Edwards E, Robinson A, et al. An automated medical scribe for documenting clinical encounters. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: demonstrations. New Orleans, LA: Association for Computational Linguistics; 2018. p. 11 15. Available from: https://doi.org/10.18653/v1/N18-5003. 45. Xu S, Hom J, Balasubramanian S, et al. Prevalence and predictability of low-yield inpatient laboratory diagnostic tests. JAMA Netw Open 2019;2(9):e1910967. Available from: https://doi.org/10.1001/ jamanetworkopen.2019.10967. 46. Nelson A, Herron D, Rees G, Nachev P. Predicting scheduled hospital attendance with artificial intelligence. NPJ Digit Med 2019;2(1):1 7. Available from: https://doi.org/10.1038/s41746-019-0103-3. 47. Smith GB, Prytherch DR, Meredith P, Schmidt PE, Featherstone PI. The ability of the National Early Warning Score (NEWS) to discriminate patients at risk of early cardiac arrest, unanticipated intensive care unit admission, and death. Resuscitation 2013;84(4):465 70. Available from: https://doi.org/10.1016/ j.resuscitation.2012.12.016. 48. Donze J, Aujesky D, Williams D, Schnipper JL. Potentially avoidable 30-day hospital readmissions in medical patients: derivation and validation of a prediction model. JAMA Intern Med 2013;173(8):632 8. Available from: https://doi.org/10.1001/jamainternmed.2013.3023. 49. Rajkomar A, Oren E, Chen K, et al. Scalable and accurate deep learning with electronic health records. NPJ Digit Med 2018;1(1):1609. Available from: https://doi.org/10.1038/s41746-018-0029-1. 50. Wang HE, Muntner P, Chertow GM, Warnock DG. Acute kidney injury and mortality in hospitalized patients. Am J Nephrol 2012;35(4):349 55. Available from: https://doi.org/10.1159/000337487. 51. MacLeod A. NCEPOD report on acute kidney injury-must do better, 374. Elsevier Limited; 2009. Available from: https://doi.org/10.1016/S0140-6736(09)61843-2. 52. Tomaˇsev N, Glorot X, Rae JW, et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature 2019;572:1 21. Available from: https://doi.org/10.1038/s41586-019-1390-1. 53. Corey KM, Kashyap S, Lorenzi E, et al. Development and validation of machine learning models to identify highrisk surgical patients using automatically curated electronic health record data (Pythia): a retrospective, single-site study. PLoS Med 2018;15(11): e1002701. Available from: https://doi.org/10.1371/journal.pmed.1002701. 54. Epstein L, Dantes R, Magill S, Fiore A. Varying estimates of sepsis mortality using death certificates and administrative codes—United States, 1999 2014. MMWR Morb Mortal Wkly Rep 2016;65(13):342 5. Available from: https://doi.org/10.15585/mmwr.mm6513a2. 55. Torio CM, Moore BJ. National inpatient hospital costs: the most expensive conditions by pay; 2013. 2016. p. 1 15. Available from: https://doi.org/10.1377/hlthaff.2015.1194.3. 56. Rhee C, Dantes R, Epstein L, et al. Incidence and trends of sepsis in US hospitals using clinical vs claims data, 2009-2014. JAMA 2017;318(13):1241. Available from: https://doi.org/10.1001/jama.2017.13836. 57. Levy MM, Dellinger RP, Townsend SR, et al. The Surviving Sepsis Campaign: results of an international guideline-based performance improvement program targeting severe sepsis. Intensive Care Med 2010;36 (2):222 31. Available from: https://doi.org/10.1007/s00134-009-1738-3. 58. Seymour CW, Gesten F, Prescott HC, et al. Time to treatment and mortality during mandated emergency care for sepsis. N Engl J Med 2017;376(23):2235 44. Available from: https://doi.org/10.1056/NEJMoa1703058. 59. Singer M, Deutschman CS, Seymour CW, et al. The third international consensus definitions for sepsis and septic shock (Sepsis-3). JAMA 2016;315(8):801. Available from: https://doi.org/10.1001/jama.2016.0287. 60. Cortes-Puch I, Hartog CS. Opening the debate on the new sepsis definition: change is not necessarily progress: Revision of the sepsis definition should be based on new scientific insights. Am J Respir Crit Care Med 2016;194(1):16 18. Available from: https://doi.org/10.1164/rccm.201604-0734ED. 61. Lin A, Sendak M, Bedoya A, et al. What is sepsis: investigating the heterogeneity of patient populations captured by different sepsis definitions. Am Thorac Soc Int Conf 2018;197.. Available from: https://doi.org/ 10.1164/ajrccm-conference.2018.197.1_MeetingAbstracts.A3299.

III. Clinical applications

References

149

62. Komorowski M, Celi LA, Badawi O, Gordon AC, Faisal AA. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nat Med 2018;24(11):1716 20. Available from: https://doi. org/10.1038/s41591-018-0213-5. 63. Nipp RD, Hong K, Paskett ED. Overcoming barriers to clinical trial enrollment. Am Soc Clin Oncol Educ Book 2019;39:105 14. Available from: https://doi.org/10.1200/EDBK_243729. 64. O’Keefe EB, Meltzer JP, Bethea TN. Health disparities and cancer: racial disparities in cancer mortality in the United States, 2000-2010. Front Public Health 2015;3:51. Available from: https://doi.org/10.3389/fpubh.2015.00051. 65. Tran G, Zafar SY. Financial toxicity and implications for cancer care in the era of molecular and immune therapies. Ann Transl Med 2018;6(9). Available from: https://doi.org/10.21037/atm.2018.03.28. 66. Kalra M, Karuturi M, Jankowitz R, et al. Dissemination of breast cancer knowledge and expertise from NCICCC tumor boards with community oncologists. J Clin Oncol 2018;36(15). Presented at the: ,https://ascopubs.org/doi/abs/10.1200/JCO.2018.36.15_suppl.e18575. [accessed 08.02.20]. 67. Parikh RB, Manz C, Chivers C, et al. Machine learning approaches to predict 6-month mortality among patients with cancer. JAMA Netw Open 2019;2(10): e1915997. Available from: https://doi.org/10.1001/ jamanetworkopen.2019.15997. 68. Bertsimas D, Dunn J, Pawlowski C, et al. Applied informatics decision support tool for mortality predictions in patients with cancer. JCO Clin Cancer Inform 2018;2(2):1 11. Available from: https://doi.org/10.1200/CCI.18.00003. 69. Gensheimer MF, Henry AS, Wood DJ, et al. Automated survival prediction in metastatic cancer patients using high-dimensional electronic medical record data. J Natl Cancer Inst 2019;111(6):568 74. Available from: https://doi.org/10.1093/jnci/djy178. 70. Admissions and emergency department (ED) visits for patients receiving outpatient chemotherapy. 2019. ,https:// cmit.cms.gov/CMIT_public/ViewMeasure?MeasureId 5 2929. [accessed 19.12.19]. 71. Hong JC, Niedzwiecki D, Palta M, Tenenbaum JD. Predicting emergency visits and hospital admissions during radiation and chemoradiation: an internally validated pretreatment machine learning algorithm. JCO Clin Cancer Inform 2018;2(2):1 11. Available from: https://doi.org/10.1200/CCI.18.00037. 72. Brooks GA, Uno H, Aiello Bowles EJ, et al. Hospitalization risk during chemotherapy for advanced cancer: development and validation of risk stratification models using real-world data. JCO Clin Cancer Inform 2019;3:1 10. Available from: https://doi.org/10.1200/CCI.18.00147. 73. Gong D, Wu L, Zhang J, et al. Detection of colorectal adenomas with a real-time computer-aided system (ENDOANGEL): a randomised controlled study. Lancet Gastroenterol Hepatol 2020;5(4):352 61. Available from: https://doi.org/10.1016/S2468-1253(19)30413-3. 74. Lin H, Li R, Liu Z, et al. Diagnostic efficacy and therapeutic decision-making capacity of an artificial intelligence platform for childhood cataracts in eye clinics: a multicentre randomized controlled trial. EClin Med 2019;9:52 9. Available from: https://doi.org/10.1016/j.eclinm.2019.03.001. 75. Su J-R, Li Z, Shao X-J, et al. Impact of a real-time automatic quality control system on colorectal polyp and adenoma detection: a prospective randomized controlled study (with videos). Gastrointest Endosc 2020;91(2): 415-424.e4 Available from: https://doi.org/10.1016/j.gie.2019.08.026. 76. Wang P, Berzin TM, Glissen Brown JR, et al. Real-time automatic detection system increases colonoscopic polyp and adenoma detection rates: a prospective randomised controlled study. Gut 2019;68(10):1813 19. Available from: https://doi.org/10.1136/gutjnl-2018-317500. 77. Wang P, Liu X, Berzin TM, et al. Effect of a deep-learning computer-aided detection system on adenoma detection during colonoscopy (CADe-DB trial): a double-blind randomised study. Lancet Gastroenterol Hepatol 2020;5(4):343 51. https://doi.org/10.1016/S2468-1253(19)30411-X. 78. Wu L, Zhang J, Zhou W, et al. Randomised controlled trial of WISENSE, a real-time quality improving system for monitoring blind spots during esophagogastroduodenoscopy. Gut 2019;68(12):2161 9. Available from: https://doi.org/10.1136/gutjnl-2018-317366. 79. Goldstein BA, Phelan M, Pagidipati NJ, Peskoe SB. How and when informative visit processes can bias inference when using electronic health records data for clinical research. J Am Med Inform Assoc 2019;26 (12):1609 17. Available from: https://doi.org/10.1093/jamia/ocz148. 80. Weber GM, Adams WG, Bernstam EV, et al. Biases introduced by filtering electronic health records for patients with “complete data”. J Am Med Inform Assoc 2017;24(6):1134 41. Available from: https://doi.org/ 10.1093/jamia/ocx071.

III. Clinical applications

150

8. Electronic health record data mining for artificial intelligence healthcare

81. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019;366(6464):447 53. Available from: https://doi.org/10.1126/science. aax2342. 82. Bolukbasi T, Chang K-W, Zou JY, Saligrama V, Kalai AT. Man is to computer programmer as woman is to homemaker? In: Debiasing word embeddings; January 2016. p. 4349 57. 83. Lum K, Isaac W. To predict and serve? Significance 2016;13(5):14 19. Available from: https://doi.org/ 10.1111/j.1740-9713.2016.00960.x. 84. Jung K, Shah NH. Implications of non-stationarity on predictive modeling using EHRs. J Biomed Inform 2015;58:168 74. Available from: https://doi.org/10.1016/j.jbi.2015.10.006.

III. Clinical applications

C H A P T E R

9 Roles of artificial intelligence in wellness, healthy living, and healthy status sensing Peter Jaeho Cho, Karnika Singh and Jessilyn Dunn Abstract The growing desire for better control of health outcomes and the increasing healthcare costs associated with disease treatment has led to a shift in the healthcare paradigm from reactive to proactive. Advances in artificial intelligence (AI), the study of intelligent machines that maximize their likelihood of achieving a goal, and the rise of mobile health technologies (e.g., wearable devices and smartphone applications) have enabled healthcare to take place outside of the traditional clinical setting. In this chapter, we detail how AI algorithms can improve wellness assessment, aid in personalizing intervention strategies to promote healthier lifestyle behaviors, and uncover previously unknown disease risk factors. Organized across three dimensions of wellness (physical, mental, and social), this chapter highlights studies that utilize AI to incorporate new data sources or reinterpret preexisting data sources to further advance preventative medicine. Keywords: Preventative medicine; wellness; mobile health; lifestyle behaviors; health promotion; remote monitoring; health decision support systems

9.1 Introduction As of the date of writing, non-communicable chronic diseases are the primary cause of morbidity and mortality around the world.1 These diseases result mainly from unhealthy behaviors and can be prevented by adopting healthier lifestyles. To address this issue, novel methods of assessing and promoting wellness are being developed. The combination of technological advancements and the large influx of health and behavior data has inspired the idea of a smart healthcare system where remote patient monitoring enables more effective resource utilization and a drive toward personalized, preventative medicine.2 Such a system would promote healthy living and would delay the onset of disease.3 11

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00009-0

151

© 2021 Elsevier Inc. All rights reserved.

152

9. Roles of artificial intelligence in wellness, healthy living, and healthy status sensing

FIGURE 9.1 Applications of AI to promote wellness using health-related measurements—different sources of information varying from passive monitors to active blood tests are analyzed with the AI algorithms. The asterisked algorithms were utilized by the studies presented in this chapter. The results help to improve personalized assessment, intervention strategies, and understanding of underlying mechanisms. AI, Artificial intelligence.

In recent years, there has been a shift towards promoting healthy living in healthcare.1,12 Hardware and software breakthroughs in mobile and wearable devices have been combined with computational advances in artificial intelligence (AI) to scale wellness coaching and automate promotion of health.14 19,1 Consumer wellness products and services are becoming more popular and prevalent every day due to the ubiquity of smartphones, expanding methods for personal monitoring, and the availability of commercial health products, such as 23andMe genotyping and gut microbiome tests.5,20 23 Since wellness lacks a uniform definition and is multidimensional in nature, it poses a challenge to the structured approaches of data science and AI. For the purpose of this chapter, we define AI to include supervised and unsupervised statistical and machine learning and deep learning algorithms.2 A list of common algorithms and their acronyms, which we will use throughout the chapter and as seen in Fig. 9.1, includes supervised machine learning methods such as support vector machines (SVMs), decision trees (DTs), and random forest (RF); unsupervised machine learning methods such as k-nearest neighbors (kNN), principal component analysis (PCA), and linear discriminant analysis (LDA); and deep learning methods consisting of different types of neural networks (NNs), such 1

The regulatory oversight of these devices, however, is complicated and constantly evolving.13

2

A detailed explanation of AI algorithms can be found in Chapter 3, Deep Learning for Biomedical Videos: Perspective and Recommendations.

III. Clinical applications

153

9.2 Diet

FIGURE 9.2 Percentage of wellness articles that include artificial intelligence over time—a PubMed search of wellness and artificial intelligence (MeSH Terms: artificial intelligence; health) generated rising trends over time. The number of wellness and artificial intelligence articles normalized over all wellness articles has been increasing over time.

as long short-term memory (LSTM) networks. The advancement of these AI algorithms is increasingly supporting devices and programs used for wellness. For example, chatbots, or conversation-simulating programs, have transitioned from responses generated by simpler DT-based algorithms to “interpreting” user responses with more advanced natural language processing (NLP). Even more advanced AI agents, including virtual assistants such as Amazon’s Alexa, Apple’s Siri, or the Google Assistant, not only incorporate text and audio chatbot features but also learn from previous conversations using deep learning techniques, such as transfer learning and LSTM networks, to provide personalized interactions. Increasing amounts of data collected to train AI algorithms will lead to more accurate prediction models, resulting in more tailored user experiences. The rise of AI-based methods in wellness, as seen in Fig. 9.2, facilitates informed health choices and behaviors. In this chapter, we focus on AI-based methods that have been implemented across several dimensions of wellness. We also point the reader to general reviews of the use of AI in wellness or in specific aspects of wellness.2,17,24 This chapter groups applications of AI in health by wellness factors to illustrate how AI improves assessment, intervention strategies, and understanding of the mechanisms related to those factors.3 Furthermore, we highlight how the intended purpose of available technologies and data sources has been extended to advance personalized health, including wellness monitoring.15 Drawing inspiration from the literature, this chapter is divided into the physical, mental, and social aspects of wellness, with subdivisions along diet, fitness and physical activity (PA), sleep, sexual and reproductive health (SRH), mental health, behavioral factors, environmental and social determinants of health (SDOH), and remote screening tools.5,25 27

9.2 Diet An important physical wellness factor is diet. Diet heavily influences the risk for obesity, diabetes, and metabolic health, which affects the likelihood of all-cause mortality and 3

AI-based studies that focus on multiomics, including genomics, are available in Chapter 7, Analytics Methods and Tools for Integration of Biomedical Data in Medicine.

III. Clinical applications

154

9. Roles of artificial intelligence in wellness, healthy living, and healthy status sensing

cardiovascular mortality.28 30 Dietary guidelines that seek to reduce the risk of these conditions propose sustainable, long-term nutritional recommendations and often detail specific foods and food categories to include and avoid (e.g., replace simple carbohydrates with complex carbohydrates). However, many consumers have difficulty adhering to complicated dietary recommendations and lack an understanding of the components of a healthy diet.31 Therefore alternative intervention strategies have utilized technology to promote dietary changes and personalize diets for individuals.32,33 The following section describes studies on diet that incorporate AI to improve adherence to healthier choices, delineate personal responses to specific food types, and enable self-monitoring. AI algorithms can support personalized interventions to maintain a healthier diet. When following a diet plan, individuals tend to deviate.34,35 To improve diet adherence, Anselma et al.36 developed an AI framework that compensates unhealthy behaviors (e.g., eating cake today) with healthy behaviors (e.g., eating carrots tomorrow). The framework differed from standard, rigid linear meal planning systems because it provided the flexibility to reward healthy eating decisions or recover from poor eating decisions by adjusting recommendations according to the user’s actual food intake and automatically supplementing future, diet-compatible meals. While the effectiveness of the framework was only assessed in a simulated population, the researcher team plans to test the framework in a clinical setting with obese participants. The framework is aimed to provide a more personalized approach to improving dietary habits and proposes to reduce guilt or shame associated with consuming unhealthy foods, which is expected to improve long-term diet adherence. A general understanding of nutrition has also shifted with the use of data-driven personalized diet design. The traditional thought that there are common responses to the same type of foods (i.e., that some foods are always healthy and others are always unhealthy) is changing with the increasing knowledge of differential physiological reactions to the same dietary inputs. Zeevi et al.37 implemented stochastic gradient boosting regression, using hematological parameters, dietary habits, anthropometrics, PA, and gut microbiota as inputs to predict postprandial glycemic responses (PPGRs). In this study, 800 participants logged food intake, exercise, and sleep on a smartphone-friendly website. Participants also wore a continuous glucose monitor that recorded interstitial glucose levels every 5 minutes to measure PPGR. While predictive models using only a meal’s carbohydrate content had a modest correlation to PPGR (R 5 0.38), models containing personal factors improved the models dramatically (R 5 0.68). This model was tested on a validation cohort (n 5 100) and a similar correlation (R 5 0.70) was achieved, demonstrating that the model is generalizable yet sensitive to personal dietary behaviors. In addition to understanding dietary responses, food logging can aid in tracking dietary behaviors. However, the process can be tedious and time-consuming for users. Thus applications (or apps) that enable automatic food logging can improve usability.38 To address this, McAllister et al.39 developed a convolutional neural network (CNN) to classify food images for dietary assessment. The deep feature extraction of the model enabled users to take a photograph of their food item and automatically log it to help maintain weight loss. These services have extended into commercial apps, such as LifeSum, Snap It, Calorie Mama, and Bitesnap.40 43 The benefit of seamlessly logging foods may inform users about their dietary choices and aid in general nutrition education.

III. Clinical applications

9.3 Fitness and physical activity

155

Improving diet can reduce the risk of multiple chronic health conditions. The aforementioned studies present AI-based methods that are under development to promote healthy eating behaviors. Apps that personalize diet and easily record nutritional information may even be useful for recommending particular food items to individuals who are near a grocery store or restaurant.33 The popularity of self-improvement apps, like food trackers, has also led to the development of AI-based fitness apps, as we describe in the next section.

9.3 Fitness and physical activity Another factor that affects wellness is PA. Lack of adequate PA increases the risk of metabolic disease, cardiovascular disease, and cancer.44 Individual PA levels can indicate the need for intervention and improve our understanding of the relationship between PA and adverse health outcomes. Various studies have attempted to monitor PA data obtained from wearable and smartphone devices.45 48 This section explores how AI is used to improve PA monitoring with mobile devices, assess various health conditions affected by PA levels, and promote increased PA among individuals. Determining energy expenditures (EEs) is vital to evaluate whether an individual meets recommended PA levels. Montoye et al.49 compared the accuracy of hip, thigh, and bothwrist accelerometer data in predicting EE using NNs for 44 participants. The participants performed 14 activities of varying intensities for 90 minutes in a laboratory setting. The predicted EEs were compared with those measured by the Oxycon portable metabolic analyzer. Separate NNs were developed using data from each of the different accelerometer locations. Although all four models had high accuracy (correlation R . 0.80), the thigh accelerometers led to the most accurate models (correlation R 5 0.90) for EE estimation. The study demonstrated the potential of NNs in providing accurate EE estimates from accelerometer data. Future studies can extend these calculations into free-living settings to provide tailored PA recommendations to optimize EE for specific circumstances. Several groups have attempted to design methods that obtain clinically valid PA and physiological measurements from smartphones.50,51 Cheng et al.50 trained an SVM from smartphone accelerometer data to predict pulmonary function. For the study, 24 pulmonary patients performed at least two 6-minute walk tests4 (6MWT) and carried a smartphone with the researcher’s accelerometer data collecting app, MoveSense. During the first 6MWT, patients took a clinical pulmonary function test to determine their forced expiratory volume in 1-second value, an indicator of respective severity level. This severity level associated accelerometry data was then used to train the SVM. The model was tested on subsequent 6MWT accelerometry data and accurately predicted the severity level for all participants. This study suggests that accelerometry data could be used to monitor cardiopulmonary function in free-living settings and can ensure timely interventions by continuous assessment. In addition to assessment, intervention strategies to promote PA can benefit from the application of AI. Health interventions delivered by AI agents can be effective for 4

The 6MWT is a standard assessment test for chronic heart and lung conditions such as chronic obstructive pulmonary disease.

III. Clinical applications

156

9. Roles of artificial intelligence in wellness, healthy living, and healthy status sensing

encouraging health behaviors. AI agents are computer systems that are guided by their experiences to act on their perceived environment in order to achieve a goal.52 These characteristics make them great candidates for automated interventions. In order to mimic human health counselor experiences, AI agents provide human-like conversation to promote healthy behaviors. Stein and Brooks53 assessed user engagement and acceptability of Lark Weight Loss Health Coach AI (HCAI), which incorporates elements of cognitive behavioral therapy to encourage weight loss and healthy diet choices. The HCAI promoted user-reported behavioral change, had high user acceptability, and was associated with an average user-reported weight loss of 2.4 kg. Ultimately, the agent-assisted weight loss method achieved results comparable to that of human coaching interventions. Future studies may improve upon these methods by automating diet and weight tracking to enhance the precision of monitoring. Hassoon et al.54 explored AI-based interventions to promote PA among overweight and obese cancer survivors, including (1) distributing written material in the clinic; (2) using an automated text message agent to send periodic information; and (3) using a voice-based interactive AI agent delivered via the Amazon Echo speaker. A conversational AI agent called MyCoach was developed for this purpose. MyCoach used data from various sources, including the patient’s wearable sensors, music library, geolocation, and National Weather Service UV Index, to provide personalized health advice, reminders, and feedback. MyCoach was developed with the Alexa developer kit that can be used to build new Alexa “skills.” This adds to Alexa’s built-in capabilities to introduce new functionalities.55 The rising prevalence of virtual assistants like Alexa makes these personalized interventions feasible, affordable, and pervasive. AI algorithms can be used to promote self-management of chronic health conditions by encouraging PA. The use of smartphone and wearable devices to track and assess personal PA data is gaining popularity.56 For example, the Quantified Self Movement involves individuals not only analyzing their own PA data but also other facets of their life to inform their health.46,57 Mobile health (or mHealth) interventions are effective in improving PA by delivering timely behavioral support notifications.58,59 These mHealth interventions can also help address the lack of healthcare professionals and counselors by detecting different types of PA (e.g., walking vs running). AI algorithms can help monitor PA levels and intervene to promote healthier lifestyles.

9.4 Sleep In addition to diet and PA, sleep is an important factor for wellness. Sleep is a complex phenomenon that is influenced by biological, behavioral, and environmental variables.60 Our understanding of sleep remains unclear despite the growing number of people suffering from sleep disorders and lack of sleep (B18% of Europeans and 23% of Americans).61 Sleep is an important metric for evaluating individual wellness and potential risks for disorders. Therefore techniques to accurately understand the complex relationship between sleep and health on an individual level are essential. Health insights derived from using AI algorithms on sensor, wearable, and epidemiological data will advance the field of sleep research. Integrating big data from these sources can help uncover individual differences in

III. Clinical applications

9.4 Sleep

157

sleep dynamics and aid in personalized sleep disorder management.60 Smart health technologies enable unobtrusive and continuous tracking of sleep, which yields useful insights into sleep patterns and their effects. Various commercial wearable devices can measure the timing, quantity, and quality of sleep. For example,46 explored the use of wearables for quantifying diurnal patterns, assessing the effect of circadian cycles on physiological responses, and capturing circadian fluctuations in heart rate (HR) and skin temperature. This section discusses how AI algorithms aid in sleep assessment and how they may be implemented to better manage sleep disorders. Sleep quality plays an essential role in personal health. Polysomnography, which involves monitoring multiple physiological parameters, including HR, blood oxygen levels, and respiration, is the recommended standard by the American Academy of Sleep Medicine for investigating sleep stages.62 64 This method is not feasible for long-term continuous sleep monitoring in free-living conditions as it is costly, intrusive, and requires a battery of specialized devices such as respiratory monitors, electroencephalograms, and sound/video recorders. Sadeghi et al.64 proposed a method to predict sleep quality by tracking trends in physiological signal measurements from the Empatica E4 wristband. Combining four physiological signals [HR variability (HRV), electrodermal activity, body movement, and skin temperature], researchers classified sleep quality in caregivers for people with dementia. Using three different classifiers, including naı¨ve Bayes, RF, and bagged trees, the best results were obtained with the RF (75% accuracy for sleep quality). For feature selection the authors used recursive feature elimination to maximize the classifier performance and found sleep efficiency (proportion of actual sleep time to the total time in bed) and skin temperature to be the most important factors in predicting sleep quality. A transparent and explainable clinical decision support system for sleep quality estimation was developed using physiological measurements that can be extremely valuable in tracking individual sleeping patterns. Details about sleep habits are becoming increasingly important as poor sleeping habits affect sleep quality. For example, sleep quality has been related to sleeping posture (body configuration assumed by a person during or prior to sleeping) and frequent sleeping posture changes.65 Sleep posture detection can yield useful insights for sleep assessment. Hsiao et al.66 used force and infrared sensors to detect and classify sleep postures. Using distance-weighted kNN on the sensor data, the researchers classified sleeping postures into six categories with an accuracy of 88%. These methods, in contrast to video-based methods, raise fewer privacy concerns and provide a cost-effective method for sleep posture recognition. Moreover, since sensors are unobtrusive and simply deploy without any inconvenience to the sleeper, they can be used to passively monitor sleeping positions effectively. AI algorithms are also being employed to diagnose and monitor various sleeping disorders. Khandoker et al.67 employed SVM to automatically recognize obstructive sleep apnea syndrome (OSAS) types from a set of 125 nocturnal electrocardiogram (ECG) recordings (each about 8 hours long) acquired from normal individuals (OSAS 2 ) and individuals with OSAS (OSAS 1 ). Since OSAS diagnosis is expensive and requires evaluation polysomnography in sleep laboratories, a large number of OSAS patients remain undiagnosed. While previous applications of AI to automatically recognize OSAS did not estimate the relative severity, here SVMs deployed on HRV and ECG-derived respiration data could both recognize OSAS and estimate its severity using posterior probabilities.68 The authors

III. Clinical applications

158

9. Roles of artificial intelligence in wellness, healthy living, and healthy status sensing

achieved nearly perfect accuracy with a subset of four features and an accuracy of up to 92.85% on an independent dataset. Extending this algorithm to a study population with other sleeprelated breathing disorders could help evaluate the efficacy of the model in diagnosing a variety of sleep-related breathing disorders. Using a combination of approaches to monitor multiple aspects of sleep may result in a comprehensive and personalized sleep profile that can be used to diagnose sleep-related disorders and expand our understanding of correlation between sleep and health. Sleep science research can be accelerated with the application of AI to sleep and circadian data to provide a holistic understanding of sleep and to help determine factors leading to sleep disorders.

9.5 Sexual and reproductive health Similar to the wellness domain grouping, SRH incorporate the physical, mental, and social health aspects pertaining to sexuality and reproductive systems, respectively.68,70 In the same way that providing SRH education, counseling, and contraceptive provisions for adolescents can improve their understanding of SRH, longitudinal monitoring of physiological changes that influence SRH can lead to more personalized health assessments. For example, Penders et al.71 measured four lifestyle behaviors (PA, sleep, stress, and diet and weight management) that change during pregnancy using wearable sensors. Furthermore, utilization of mobile and wearable devices has been shown to reduce negative feelings associated with SRH by providing an option for remote care.72 In this section, we focus on studies that incorporate AI algorithms to uncover SRH-related factors and provide personalized SRH assessments. Fertility is a key aspect of reproductive health. Awareness of fertility windows can assist individuals seeking to become pregnant. Goodale et al.73 tracked nighttime physiological data (i.e., wrist skin temperature, HR, HRV, respiratory rate, and skin perfusion) with the Ava AG wearable wristband in 237 women. The research team trained an RF model to classify time windows into the follicular phase, fertile window, or luteal phase and was subsequently able to predict a 6-day fertility window with 90% accuracy. The availability of SRH-related commercial wearables supports the shift toward real-time and predictive health monitoring. Much research has been done surrounding factors that affect male fertility.74 77 Girela et al.78 predicted male fertility (n 5 100) based on environmental factors and lifestyle habits measured through questionnaires. Sperm motility and concentration were predictable using NNs with accuracies of 90% and 82%, respectively. Key factors in sperm quality included body mass index, average hours of sleep per day, frequency of warm baths, smoking, frequency of alcohol consumption, and number of hours spent sitting per day. This study uncovers the potential of AI algorithms to discover new relationships between behaviors and health outcomes, which may be further improved with wearable sensors. Smartphones are also useful for monitoring conditions that are difficult to assess with a single clinical snapshot. One such SRH-related condition is postpartum depression (PPD). Classification models were trained to detect the risk of PPD (n 5 1397) after the first week of childbirth using socioeconomic features, clinical data, and psychiatric questionnaires.79

III. Clinical applications

9.6 Mental health

159

Among SVM, logistic regression, NNs, and naı¨ve Bayes, the naı¨ve Bayes model performed the best, with sensitivity, specificity, and accuracy all near 73%. This model was then implemented into a clinical decision support system app for Android smartphones, where postpartum women could fill out a one-time questionnaire and receive early intervention if they were deemed to be at a risk of PPD. By developing efficient and cost-effective apps that increase communication between patients and clinicians, preventative care can be provided outside of the hospital walls. SRH can benefit through applications of AI in remote and personalized monitoring and assessment. Many such assessments utilize AI algorithms to discover new associations between measurable physiologic, environmental, and behavioral factors and SRH, and to improve the detection and prediction of SRH-related states.

9.6 Mental health As one of the three main domains of wellness, mental health consists of emotional, psychological, and social well-being and is impacted by biological factors, life experiences, and family history. Psychological stress evokes physiological responses and is a key contributor to depression, anxiety, and other mental illness.80 91 Methods to track stress levels are important for accurately assessing state of mind. While stress presents interindividual differences in symptoms, there are common physiological responses to stress, such as elevated HR and sweating. Gjoreski et al.91 proposed detecting stress by following three conditions: (1) using the Empatica E4 wristband as the only physiological data source (HR, EDA, blood volume pulse, interbeat, and skin temperature); (2) detecting activity from the device’s accelerometer; and (3) collecting contextual information with stress logs and ecological momentary assessment prompts carried out by smartphones. AI algorithms, including naı¨ve Bayes, kNN, SVM, DT, and RF, were used to create context-based stress-detection methods for 20-minute intervals. The best model was a DT, which had a sensitivity of 70% and precision of 95% for detecting stress events. While there were limitations in their study’s sample size (n 5 5), age range (28 6 4.3), and variety of device choices, future work will diversify the context of stress, personalize the features, and include other components of stress (behavioral and affective) to improve passive stress monitoring. Another key factor of mental health is mood, which is a “relatively long lasting, and subjectively experienced state of mind, the cause of which is generally unclear”.83 Assessing mood is particularly important for patients with bipolar disorder, who suffer from mood instability. However, mood is difficult to track as it changes constantly throughout the day.84 Perez Arribas et al.85 utilized a signature-based5 AI algorithm to predict individuals’ mood and differentiate people into three groups: (1) healthy participants, (2) individuals with borderline personality disorder, and (3) individuals with bipolar disorder. They reanalyzed data from a longitudinal study where participants rated 5

This refers to an individual’s handwritten signature and the signature models utilize sequential, ordered information. These models have performed better than general handwriting recognition and can be generalized, in this case, with time-based mood data.86 88

III. Clinical applications

160

9. Roles of artificial intelligence in wellness, healthy living, and healthy status sensing

their moods daily (anger, anxiety, energy, elation, irritability, and sadness) on a 7-point Likert scale, with 1 being “not at all” and 7 being “very much.” An RF model was trained to classify individuals into the three groups based on a “bucket,” or an aggregate score of 20 self-reported consecutive moods, as the input features. The model correctly classified 75% of participants. A distinction could be made between individuals with borderline personality disorders and healthy controls with an accuracy of 93%. Mood predictions using the same model configurations had 89% 98% accuracy in healthy participants, 82% 90% accuracy in individuals with bipolar disorder, and 70% 78% accuracy in individuals with borderline personality disorder. While the studied group consisted of a subpopulation of more stable patients, this work points to the potential of assessing mental health status and predicting real-time mood from longitudinal data. In addition to self-reported mental health status, researchers are utilizing more passive methods of data collection, such as mining data on social media. Choudhury et al.89 gathered their study population (n 5 1583) by crowdsourcing, which involves enlisting many people to perform certain tasks. The crowd workers took a clinical depression survey, which was presented as a behavioral patterns test, provided their depression history and demographics, and were given the option of sharing their Twitter username for the purpose of data mining. From the Twitter feed the researchers extracted engagement levels, emotional states, use of depression-related terms, linguistic styles, behavioral patterns associated with depression, and social graphs. PCA was used to reduce dimensionality and SVM was implemented to predict future episodes of depression with an accuracy of 70%. Passive means of detecting depression will enable improved quantification of population-scale depression patterns and further tailoring the models will enable focus on early warning signs of nonnormative behaviors to enable timely interventions. Similarly, one study developed a preventative tool for suicide risk assessment using NLP on social media data.90 The researchers received data from OurDataHelps.org, which provided social media access (e.g., Facebook), wearables (e.g., Fitbit), and other applications (e.g., Strava), and from public users who had mentioned suicide attempts on social media. Each user had an average of 473 social media posts and the total number of posts (n 5 197,615) 6 months before the attempts were analyzed and compared against the control group (n 5 197,615). The researchers trained a text classification LSTM network to predict whether a single post indicated risk for a suicide attempt and developed an aggregate score from these posts to predict the individual’s risk. They tested varying number of months prior to the attempt and generated receiver operating characteristic curves that ranged from 70% to 85% true positive rate. Their models’ performance suggests that signals linked to suicide risk are present in social media. The main source of bias in their design was their study population, which was predominantly females aged 18 24. There are certainly ethical implications of employing such a screen, and the trade-off between privacy and health monitoring must be balanced. Developing effective mobile intervention strategies will provide helpful tools for addressing the lack of sufficient mental health treatment options. Factors such as convenience, access to and cost of healthcare, ease of use, and stigma have led to the rise of mobile apps associated with daily mood logging, mental wellness, meditation, and personal chatbots.92 While most applications do not report on their efficacy, there were two chatbots with published studies93 described the use of text-based conversation app,

III. Clinical applications

9.7 Behavioral factors

161

Woebot, in improving the mental health conditions of college students with anxiety and self-reported depression (n 5 70). Students were assessed for depression with the Patient Health Questionnaire (PHQ-9), a 9-item questionnaire that measures the frequency and severity of depressive symptoms. The study split the cohort into the application user group (n 5 34) and information-only control group (n 5 36), who were given the NIMH eBook, “Depression in College Students.” The chatbot was based on DT and employed therapeutic process-oriented features such as empathic responses and goal setting. The study revealed a significant reduction in the severity of depression of the app user group compared to the control group. Another chatbot, Wysa, was tested in a study to compare reported depression symptoms between high (n 5 108) and low (n 5 21) frequency app users.94 Users took a PHQ-9 and used the app before and after the 2-week period. Highfrequency was defined as more than one usage of the app during the 2-week period. While both groups showed a reduction in their PHQ-9 score, a between-group comparison showed that high-frequency users had significantly higher average improvement than low-frequency users. While both apps can be improved to better understand participant responses and avoid repetition of phrases, they represent the potential of AI to provide new tools for addressing mental health. Automated and passive assessment of stress, mood, and other mental health parameters combined with the development of novel AI-based intervention strategies can provide supplementary and technology-based mental healthcare. However, since most of the mentioned studies required participant self-reporting, the subjective nature of this data will be scrutinized in the coming years and compared against continuous, real-time monitoring.95 Nevertheless, the rising mental health crisis and the increasing number of methods available for monitoring make a clear case for using AI to aid clinicians in providing comprehensive mental healthcare.

9.7 Behavioral factors Quality of life can be heavily influenced by health-related behaviors. According to Pavel et al.,96 these behaviors conventionally refer to actions “to maintain, attain, or regain good health and prevent illness.” Behavioral informatics can uncover the effects of various external factors and psychological characteristics on behavior patterns. Moreover, AI technologies and home sensors have facilitated automated ambient monitoring to detect behavior patterns, which can be useful in providing a new, low-resource model for assisted living. In this section, we discuss how AI algorithms aid in linking various factors to behavior choices and in automated monitoring of behaviors for personalized interventions. One key behavior that improves health outcomes is adherence to therapy. For example, medication adherence is crucial for reducing mortality rates and cost of treatments. Son et al.97 identified predictors of medication adherence in heart failure patients. They collected self-reported questionnaires about medication adherence from patients (n 5 76) and recorded variables such as gender, frequency of medication intake, and New York Heart Association functional class, which classifies the severity of heart failure based on symptoms.98 The researchers implemented SVMs on all possible feature (n 5 11) combinations

III. Clinical applications

162

9. Roles of artificial intelligence in wellness, healthy living, and healthy status sensing

and selected the best performing two models, which both achieved an accuracy of 78%. Both models had different predictors except for medication knowledge, which is known to improve medication adherence and consists of the patient’s knowledge on the purpose, name, dosage, frequency, and side effect of their medication.99,100 While this study was limited by its sample size, the findings illustrate the potential of incorporating AI algorithms to quantitatively compare factors that contribute to medication adherence. Similarly, understanding the factors that lead to nonadherence in smoking cessation can help in the development of targeted intervention strategies. The effects of smoking (e.g., quantifying the effects of aging due to smoking) and interventions for helping smokers quit are the common areas of research.101 Understanding when smokers who seek to quit have urges to smoke can help to improve intervention strategies.102 Using data from 1990 to 1995 on smokers (n 5 349) seeking to quit smoking, researchers trained naı¨ve Bayes, DT, and discriminant analysis models using 41 parameters (i.e., “feeling tense,” “day of the study,” and “drinking coffee”) self-reported by the smokers. The three models classified the urge rating into either high smoking urge or little to no smoking urge. By implementing a feature selection algorithm the researchers improved model performance; naı¨ve Bayes method and discriminant analysis had high sensitivities around 90% but low specificities around 30%. While the false alarm rate is high, it ensures that the more serious, high smoking urges are caught. Furthermore, this study confirmed that similar questionnaires can now be deployed on mobile devices to facilitate real-time assessments. For example, in a smartphone-based smoking cessation trial (n 5 92), participants were provided ecological momentary assessments that notified users to fill out a questionnaire at set intervals until their quit date.103 From the 26 possible predictors gathered from the ecological momentary assessments, only 5 features6 were chosen from the elastic net’s backward elimination to achieve a model fit of 93.9%. The authors mentioned the highly environmental nature rather than the internal condition of the predictors (e.g., availability of cigarettes). In the future, AI can support tailored interventions that address these factors to prevent smoking lapses. In addition to quantifying the significance of factors affecting behavioral intervention adherence, understanding individual behavior patterns is important for developing personalized interventions. This has been facilitated by ambient sensing in smart environments, where sensors monitor an individual in their home and models can be used to detect and predict behavioral patterns.4,104 107 Lundstro¨m et al.108 simulated a method to detect and analyze anomalous behaviors, such as falling, by generating motion data from home sensors. The researchers modeled a three-room apartment with passive infrared sensors in each section of the house, door sensors on doors, and occupancy sensors on chairs and beds. They simulated a series of normal behaviors (e.g., “going to the kitchen at night” and “opening drinking glass cabinet”) and generated abnormal behavioral patterns, such as eating breakfast during night or falling in the bathroom, by randomly permuting events in a set behavioral pattern. The researchers classified normal and abnormal behaviors using RF and implemented t-distributed stochastic neighbor embedding and hierarchical agglomerative clustering to reduce the dimensionality and group series of behaviors into 6

The features were feeling irritable, being in smoking-permitted areas, the availability of cigarettes, consumption of alcohol in the past hour, and interacting with another smoker.

III. Clinical applications

9.8 Environmental and social determinants of health

163

clusters, respectively. For example, entering the kitchen, getting a glass of water in the kitchen, and having breakfast are behaviors that are grouped together. While this study was limited by its simulated data and single living condition, it suggests the feasibility of behavioral monitoring without wearable devices. By passively monitoring behavior, ambient monitoring can support independent living for the elderly or for individuals suffering from health conditions with behavioral indicators and can be implemented in future smart homes. In addition, video-based monitoring can help identify indicators of healthy and unhealthy behaviors for remote wellness assessment. Wize Mirror, a smart mirror, was developed to monitor health status from video recordings of the face.109 111 Facial semeiotic signs, which are physical signs and expressive features of the face that can indicate a person’s health status, were analyzed to detect features linked to cardiometabolic risk. The study implemented RF for face detection. The authors recorded videos of the face to determine facial blood volume changes for extracting respiration rate and HR parameters as well as to perform HRV analysis. Ambient white light and principles of photoplethysmography were utilized to detect changes in facial color (resulting from fluctuations in reflected light with changes in blood volume in the face). HR prediction could be performed satisfactorily but HRV analysis was inaccurate, which the authors attributed to poor video signal quality. This study also implemented fatigue analysis by assessing the duration and frequency of yawns. They assessed stress/anxiety from head motion patterns and evaluations of eyes and mouth facial cues. A breath composition analysis device, the Wize Sniffer identified molecules in the breath that could be tied to the habits known to increase cardiometabolic risk such as smoking and alcohol consumption. A kNN model was implemented on the Wize Sniffer breath data to classify subjects based on their smoking and drinking habits. Based on the analysis, the mirror provided personalized messages related to diet, PA, smoking, and alcohol intake to foster healthier lifestyles. Such a platform can be used for ambient self-monitoring of wellness status and as a personalized recommender system for healthy behaviors. Monitoring behavioral patterns in everyday life and learning the factors that influence behavior decisions and behavior change can aid in the development of longitudinal and targeted behavioral intervention strategies. AI can detect personalized behavioral patterns from a variety of different data sources and aid in the delivery of timely interventions.

9.8 Environmental and social determinants of health Environmental and social factors play a significant role in health outcomes. WHO defines social and environmental determinants of health as the “full set of social and physical conditions in which people live and work, including socioeconomic, demographic, environmental, and cultural factors, along with the health system.”112 Understandably, these factors influence wellness and identifying at-risk population based on these factors can facilitate timely interventions. Environmental factors are estimated to cause 13% 20% of all diseases in Europe.112 These factors include air and water quality, temperature, and exposure to radiation. SDOH include factors such as housing, local emergency/health services, income level, neighborhood safety, and education. Improving our understanding of

III. Clinical applications

164

9. Roles of artificial intelligence in wellness, healthy living, and healthy status sensing

SDOH and applying that knowledge in healthcare can also help improve intervention strategies. Given the significant effect of these factors on wellness, monitoring both environmental and SDOH is crucial for the development of appropriate and effective interventions. This section covers how AI is being applied to environmental data to understand and manage disease risk factors and to SDOH data to understand the socioeconomic factors that affect health outcomes. AI algorithms can improve our knowledge of the associations between environmental exposures and adverse health effects. For instance, Di et al.113 implemented NNs to evaluate the effect of atmospheric particulate matter that have a diameter of less than 2.5 µm (PM2.5) and ozone exposure on mortality of Medicare recipients. PM2.5 and ozone concentration measurements that were lower than the annual National Ambient Air Quality Standards were associated with an increased risk of death among this population. Such assessments can help people in high-risk areas take timely precautions and can also inform public health policy decisions, including establishing air quality standards. Several other studies focus on exploring environmental pollutants as risk factors for specific health disorders. Ren et al.114 used RF and gradient boosting to determine links between maternal exposure to air pollutants and risk of congenital heart defects (CHDs) in Beijing. The study used the Chinese birth defects surveillance system and the air pollution index to obtain CHDs data and the average PM10 concentrations, respectively. The results identified PM10 to be a major risk factor for CHDs with both models. This study highlights the importance of AI algorithms in enhancing our understanding of diseases by establishing links with previously unknown environmental risk factors. Recent research efforts have focused on designing mHealth technologies that can incorporate environmental exposure information. For instance, AirRater is an app that implements weighted LDA to notify users with respiratory conditions (e.g., hay fever, asthma, and allergic rhinitis) when environmental factors (e.g., humidity, temperature, and pollen count) are unfavorable.115 Users can create saved locations and receive notifications for when these locations have increased pollen or PM2.5 levels. The app can also help users take their medications in a timely manner. The availability of information from this app can educate a general population on the impact of various environmental conditions on their health. In addition to environmental factors, SDOH plays a key role in wellness. Social factors influence health outcomes and can help identify at-risk populations for various health conditions. Ye et al.116 used AI algorithms to predict incident hypertension risk for the following year and discovered specific social factors, such as education and income levels, to be strong predictors of incident hypertension. A gradient boosted DTs-based prediction model was implemented on data from the Maine electronic health record, the US Census, and the US Department of Agriculture. The model classified individuals into five hypertension risk categories, ranging from very low to very high. Demographic features, such as age and gender, and clinical features, such as presence of other chronic health conditions (like type 2 diabetes), were shown to affect incident hypertension. The model also yielded insights into the social factors associated with a high-risk of hypertension. Low education and income levels and public insurance coverage were shown to be related to an increased risk of incident hypertension during the following year. Interestingly, people residing in areas near supermarkets and far from parks were overrepresented in the very high-risk

III. Clinical applications

9.9 Remote screening tools

165

category. Approximately 51% of the people classified as very high-risk in the evaluation population set were diagnosed with hypertension in the following year. These studies demonstrate the ability of AI algorithms to uncover socioeconomic factors associated with high-risk of diseases. Such insights can further aid the development of policies for promoting population health. Future integration of environmental and SODH data using AI can lead to the development of effective monitoring and intervention strategies. This could help promote preventative medicine by establishing links between environmental and SODH factors and health outcomes.

9.9 Remote screening tools The previous sections have focused on promoting factors of wellness, and in this section, we outline screening tools for preventative care. Remote screening tools and point-of-care testing have gained traction with the rise of remote monitoring through wearable devices and smartphone applications. These out-of-clinic screening methods can evaluate an individual’s disease risk or help assess injuries. Screening results may also indicate follow-up testing and in-person appointments. An increased access to data and use of AI algorithms has improved these remote tools. Here we describe remote screens that assess pain and detect possible diseases. Remote screening tools may be used to assess subjective wellness factors, such as pain117 used AI algorithms to redefine pain volatility, normally defined as the difference between two consecutive, self-reported pain severity scores, and predict users’ future pain volatility levels. Researchers separated individuals with high and low variation by k-means clustering and created an RF model to predict the volatility level with an accuracy of 70%. While the model was not designed to be implemented in real-time, the ability to assess and predict patients’ pain volatility may aid in personalizing clinical treatments. Similarly, AI has been used to detect experienced pain. Clinicians require a window into the patients’ day-to-day life to understand the severity of their patients’ symptoms. PainCheck is a point-of-care smartphone app that assesses pain levels in nonverbal adult populations.118 The app circumvents the need for reported subjective symptoms by taking a photo of the patient’s face when a fall or injury is detected. A pain assessment score is evaluated through an AI algorithm that integrates clinical domain knowledge with a data-driven approach. This combined methodology increases the usefulness of AI-based apps and expands feasibility of pain detection that may otherwise be impossible. In addition to facial expression recognition, smartphone cameras can also be useful for other types of image-based screening. Chuchu et al.119 reviewed the diagnostic accuracy of four AI-based smartphone apps in identifying skin lesion images as either melanoma or high-risk lesions. Across the four applications surveyed, the accuracy varied with sensitivities ranging from 7% to 73% and specificities from 37% to 94%. The authors mention several issues of bias and preselection by the studies in targeting individuals who were scheduled for a lesion excision. In addition, there was an inadequate judgment of the

III. Clinical applications

166

9. Roles of artificial intelligence in wellness, healthy living, and healthy status sensing

quality of the reference standard due to the studies providing poor methodological quality and reporting. However, these early studies demonstrate the potential of mobile imaging screens. Remote screening tools that assess pain or disease risk can help triage patient populations to decrease the overall number of patients that must be processed by traditional healthcare system practices. While screening tools have primarily focused on risk assessments using recorded measurements from the clinic, several of the studies mentioned in this section expand the scope to include data collected remotely from smartphones and mobile apps. However, the low accuracy rates of many of these models highlight the areas for improvement. Furthermore, regardless of the success of such models, it is important that there be sufficient oversight of claims made by companies developing screening tools to ensure their appropriate use and interpretation of results. In general, recommendations made by these nascent technologies should be followed up by a medical professional.

9.10 Conclusion Wellness is a multidimensional field, with a growing body of related research incorporating AI. This chapter has provided a general overview of how AI can aid in providing assessments, promoting healthy behaviors, and preventing disease. The chapter has also highlighted several future directions for preventive and personalized medicine using datadriven approaches. As these fields expand, agency and ownership over personal health and wellness is becoming more feasible.120 The integration of information from different wellness factors, facilitated by AI, can lead to holistic health monitoring systems. Furthermore, as computational power increases, robust AI-based disease detection and analysis methods can take in data from continuous monitoring systems to supplement real-time health decision support systems. However, with the benefits of new technologies come ethical and technical considerations. For example, we must decide who will be accountable for treatments and interventions provided by AI agents.92,121 We will also have to determine the appropriate level of compromise between privacy and prevention for a wide range of scenarios. In addition to these ethical dilemmas, there are technical issues. The inundation of health-related information requires better infrastructure for data management and security. AI models also need to be critically evaluated based on factors, including, but not limited to, their accuracy, reliability, and interpretability, including how sample populations or the use of simulated data affect the model outputs. With the popularization of health monitoring tools and wellness apps, these issues must be addressed to facilitate broad but appropriate AI implementation. The pervasiveness of AI in each facet of wellness highlights its potential to improve personal ownership of health and drive a shift toward preventative medicine.113 121

III. Clinical applications

References

167

References 1. Chopra M, Galbraith S, Darnton-Hill I. A global response to a global problem: the epidemic of overnutrition. Bull World Health Organ 2002;80(12):952 8. 2. Kellogg Ryan A, Jessilyn D, Snyder Michael P. Personal omics for precision health. Circulation Res 2018;122 (9):1169 71. Available from: https://doi.org/10.1161/CIRCRESAHA.117.310909. 3. Athilingam P, Jenkins B. Mobile phone apps to support heart failure self-care management: integrative review. JMIR Cardio 2018;2(1):e10057. Available from: https://doi.org/10.2196/10057. 4. Cook DJ, Augusto JC, Jakkula VR. Ambient intelligence: technologies, applications, and opportunities. Pervasive Mob Comput 2009;5(4):277 98. Available from: https://doi.org/10.1016/j.pmcj.2009.04.001. 5. Dunn J, Runge R, Snyder M. Wearables and the medical revolution. Personalized Med 2018;15(5):429 48. Available from: https://doi.org/10.2217/pme-2018-0044. ¨ rkkA ¨ J, MA ¨ ntyjA ¨ rvi J, Korhonen I. Detection of daily activities and sports with wearable sensors 6. Ermes M, PA in controlled and uncontrolled conditions. IEEE Trans Inf Technol Biomed 2008;12(1):20 6. Available from: https://doi.org/10.1109/TITB.2007.899496. 7. Huh J, Le T, Reeder B, Thompson HJ, Demiris G. Perspectives on wellness self-monitoring tools for older adults. Int J Med Inform 2013;82(11). Available from: https://doi.org/10.1016/j.ijmedinf.2013.08.009. 8. Kuziemsky C, Maeder AJ, John O, Gogia SB, Basu A, Meher S, et al. Role of Artificial Intelligence within the Telehealth Domain. Yearb Med Inform 2019;28(1):35 40. Available from: https://doi.org/10.1055/s-0039-1677897. 9. Shameer K, Badgeley MA, Miotto R, Glicksberg BS, Morgan JW, Dudley JT. Translational bioinformatics in the era of real-time biomedical, health care and wellness data streams. Brief Bioinforma 2017;18(1):105 24. Available from: https://doi.org/10.1093/bib/bbv118. 10. Stanford V. Biosignals offer potential for direct interfaces and health monitoring. IEEE Pervasive Comput 2004;3(1):99 103. Available from: https://doi.org/10.1109/MPRV.2004.1269140. 11. Sundaravadivel P, Kougianos E, Mohanty S, Ganapathiraju M. Everything you wanted to know about smart health care: evaluating the different technologies and components of the internet of things for better health. IEEE Consum Electron Mag 2018;7:18 28. Available from: https://doi.org/10.1109/MCE.2017.2755378. 12. Polak R, Pojednic RM, Phillips EM. Lifestyle medicine education. Am J Lifestyle Med 2015;9(5):361 7. Available from: https://doi.org/10.1177/1559827615580307. 13. Goldsack J, Coravos A, Bakker J, Bent B, Dowling AV, Fitzer-Attas C, et al. Verification, analytical validation, and clinical validation (V3): the foundation of determining fit-for-purpose for biometric monitoring technologies (BioMeTs). JMIR Preprints; 2019. Available from: https://preprints.jmir.org/preprint/17264. 14. Baig MM, Gholamhosseini H. Smart health monitoring systems: an overview of design and modeling. J Med Syst 2013;37(2):9898. Available from: https://doi.org/10.1007/s10916-012-9898-z. 15. Bert F, Giacometti M, Gualano MR, Siliquini R. Smartphones and health promotion: a review of the evidence. J Med Syst 2013;38(1):9995. Available from: https://doi.org/10.1007/s10916-013-9995-7. 16. Chung AE, Griffin AC, Selezneva D, Gotz D. Health and fitness apps for hands-free voice-activated assistants: content analysis. JMIR MHealth UHealth 2018;6(9). Available from: https://doi.org/10.2196/mhealth.9705. 17. DeGregory KW, Kuiper P, DeSilvio T, Pleuss JD, Miller R, Roginski JW, et al. A review of machine learning in obesity. Obes Rev 2018;19(5):668 85. Available from: https://doi.org/10.1111/obr.12667. 18. Hansen WB, Scheier LM. Specialized smartphone intervention apps: review of 2014 to 2018 NIH funded grants. JMIR MHealth UHealth 2019;7(7):e14655. Available from: https://doi.org/10.2196/14655. 19. Jazayeri SMHM, Jamshidnezhad A. Top mobile applications in pediatrics and children’s health: assessment and intelligent analysis tools for a systematic investigation. Malaysian J Med Sci: MJMS 2019;26(1):5 14. Available from: https://doi.org/10.21315/mjms2019.26.1.2. 20. Eriksson N, Macpherson JM, Tung JY, Hon LS, Naughton B, Saxonov S, et al. Web-based, participant-driven studies yield novel genetic associations for common traits. PLoS Genet 2010;6(6). Available from: https://doi. org/10.1371/journal.pgen.1000993. 21. Hanson MA, Barth AT, Silverman C. In home assessment and management of health and wellness with BeCloseTM ambient, artificial intelligence. Proceedings of the second conference on wireless health. 2011. p. 25:1 25:2. https://doi.org/10.1145/2077546.2077574. 22. Milani RV, Franklin NC. The role of technology in healthy living medicine. Prog Cardiovasc Dis 2017;59 (5):487 91. Available from: https://doi.org/10.1016/j.pcad.2017.02.001.

III. Clinical applications

168

9. Roles of artificial intelligence in wellness, healthy living, and healthy status sensing

23. Mohr DC, Schueller SM, Montague E, Burns MN, Rashidi P. The behavioral intervention technology model: an integrated conceptual and technological framework for eHealth and mHealth interventions. J Med Internet Res 2014;16(6). Available from: https://doi.org/10.2196/jmir.3077. 24. Kataria S, Ravindran V. Digital health: a new dimension in rheumatology patient care. Rheumatol Int 2018;38 (11):1949 57. Available from: https://doi.org/10.1007/s00296-018-4037-x. 25. Bart R, Ishak WW, Ganjian S, Jaffer KY, Abdelmesseh M, Hanna S, et al. The assessment and measurement of wellness in the clinical medical setting: a systematic review. Innov Clin Neurosci 2018;15 (09 10):14 23. 26. Kamiˇsali´c A, Fister I, Turkanovi´c M, Karakatiˇc S. Sensors and functionalities of non-invasive wrist-wearable devices: a review. Sensors 2018;18(6):1714. Available from: https://doi.org/10.3390/s18061714. 27. Witt DR, Kellogg RA, Snyder MP, Dunn J. Windows into human health through wearables data analytics. Curr Opin Biomed Eng 2019;9:28 46. Available from: https://doi.org/10.1016/j.cobme.2019.01.001. 28. Ford ES. Risks for all-cause mortality, cardiovascular disease, and diabetes associated with the metabolic syndrome: a summary of the evidence. Diabetes Care 2005;28(7):1769 78. Available from: https://doi.org/ 10.2337/diacare.28.7.1769. 29. Meldrum DR, Morris MA, Gambone JC. Obesity pandemic: causes, consequences, and solutions—but do we have the will? Fertil Steril 2017;107(4):833 9. Available from: https://doi.org/10.1016/j. fertnstert.2017.02.104. 30. Villegas R, Liu S, Gao Y-T, Yang G, Li H, Zheng W, et al. Prospective study of dietary carbohydrates, glycemic index, glycemic load, and incidence of type 2 diabetes mellitus in middle-aged Chinese women. Arch Intern Med 2007;167(21):2310 16. Available from: https://doi.org/10.1001/archinte.167.21.2310. 31. de Ridder D, Kroese F, Evers C, Adriaanse M, Gillebaart M. Healthy diet: health impact, prevalence, correlates, and interventions. Psychol Health 2017;32(8):907 41. Available from: https://doi.org/10.1080/08870446.2017.1316849. 32. Cade JE. Measuring diet in the 21st century: use of new technologies. Proc Nutr Soc 2017;76(3):276 82. Available from: https://doi.org/10.1017/S0029665116002883. 33. Dunn JP, Hadjimichael M, Isparyan Y, Manral D, Runge R. MoveIt! Smartphone application for promoting healthy living. IEEE International Conference on Biomedical and Health Informatics, 2019;1. 34. Brownell KD, Jeffery RW. Improving long-term weight loss: pushing the limits of treatment. Behav Ther 1987;18(4):353 74. Available from: https://doi.org/10.1016/S0005-7894(87)80004-7. 35. Fm K, Rw J, Jl F, Mk S. Long-term follow-up of behavioral treatment for obesity: patterns of weight regain among men and women. Int J Obes 1989;13(2):123 36. 36. Anselma L, Mazzei A, De Michieli F. An artificial intelligence framework for compensating transgressions and its application to diet management. J Biomed Inform 2017;68:58 70. Available from: https://doi.org/ 10.1016/j.jbi.2017.02.015. 37. Zeevi D, Korem T, Zmora N, Israeli D, Rothschild D, Weinberger A, et al. Personalized nutrition by prediction of glycemic responses. Cell 2015;163(5):1079 94. Available from: https://doi.org/10.1016/j. cell.2015.11.001. 38. Ferrara G, Kim J, Lin S, Hua J, Seto E. A focused review of smartphone diet-tracking apps: usability, functionality, coherence with behavior change theory, and comparative validity of nutrient intake and energy estimates. JMIR MHealth UHealth 2019;7(5). Available from: https://doi.org/10.2196/mhealth.9232. 39. McAllister P, Zheng H, Bond R, Moorhead A. Combining deep residual neural network features with supervised machine learning algorithms to classify diverse food image datasets. Computers Biol Med 2018;95:217 33. Available from: https://doi.org/10.1016/j.compbiomed.2018.02.008. 40. Bitesnap. Photo Food J 2020. ,https://getbitesnap.com.. 41. Calorie Mama. Food AI—food image recognition and calorie counter using deep learning. 2017. ,https://www. caloriemama.ai/.. 42. Snap ItTM. Lose It!. 2020. ,https://www.loseit.com/snapit/.. 43. Lifesum. Food Tracker App-Millions of searchable foods. Lifesum; 2019. ,https://lifesum.com/food-tracker/.. 44. Booth FW, Roberts CK, Laye MJ. Lack of exercise is a major cause of chronic diseases. Compr Physiol 2012;2 (2):1143 211. Available from: https://doi.org/10.1002/cphy.c110025. 45. Kranz M, Mo¨ller A, Hammerla N, Diewald S, Plo¨tz T, Olivier P, et al. The mobile fitness coach: towards individualized skill assessment using personalized mobile devices. Pervasive Mob Comput 2013;9(2):203 15. Available from: https://doi.org/10.1016/j.pmcj.2012.06.002.

III. Clinical applications

References

169

46. Li X, Dunn J, Salins D, Zhou G, Zhou W, Schu¨ssler-Fiorenza Rose SM, et al. Digital health: tracking physiomes and activity using wearable biosensors reveals useful health-related information. PLoS Biol 2017;15(1): e2001402. Available from: https://doi.org/10.1371/journal.pbio.2001402. 47. Rabbi M, Pfammatter A, Zhang M, Spring B, Choudhury T. Automated personalized feedback for physical activity and dietary behavior change with mobile phones: a randomized controlled trial on adults. JMIR MHealth UHealth 2015;3(2). Available from: https://doi.org/10.2196/mhealth.4160. 48. Voicu R-A, Dobre C, Bajenaru L, Ciobanu R-I. Human physical activity recognition using smartphone sensors. Sensors (Basel, Switz) 2019;19(3). Available from: https://doi.org/10.3390/s19030458. 49. Montoye AHK, Mudd LM, Biswas S, Pfeiffer KA. Energy expenditure prediction using raw accelerometer data in simulated free living. Med Sci Sports Exerc 2015;47(8):1735. Available from: https://doi.org/10.1249/ MSS.0000000000000597. 50. Cheng Q, Juen J, Bellam S, Fulara N, Close D, Silverstein JC, et al. Classification models for pulmonary function using motion analysis from phone sensors. In: AMIA annual symposium proceedings, 2016. 2017. p. 401 10. 51. Johnson LB, Sumner S, Duong T, Yan P, Bajcsy R, Abresch RT, et al. Validity and reliability of smartphone magnetometer-based goniometer evaluation of shoulder abduction a pilot study. Man Ther 2015;20 (6):777 82. Available from: https://doi.org/10.1016/j.math.2015.03.004. 52. Russell S, Norvig P. Artificial intelligence—a modern approach. Series in artificial intelligence, vol. 11. Englewood Cliffs, NJ: Prentice Hall; 1996. ,https://www.cambridge.org/core/journals/knowledge-engineering-review/ article/artificial-intelligencea-modern-approach-by-russellstuart-and-norvigpeter-prentice-hall-series-in-artificial-intelligence-englewood-cliffs-nj/65AD9B9C5853AE2595E99E26800C30CE.. 53. Stein N, Brooks K. A fully automated conversational artificial intelligence for weight loss: longitudinal observational study among overweight and obese adults. JMIR Diabetes 2017;2(2):e28. Available from: https://doi. org/10.2196/diabetes.8590. 54. Hassoon A, Schrack J, Naiman D, Lansey D, Baig Y, Stearns V, et al. Increasing physical activity amongst overweight and obese cancer survivors using an Alexa-based intelligent agent for patient coaching: protocol for the physical activity by technology help (PATH) trial. JMIR Res Protoc 2018;7(2). Available from: https:// doi.org/10.2196/resprot.9096. 55. Alexa Skills Kit. Build skills with the Alexa Skills Kit. 2019. https://developer.amazon.com/en-US/docs/alexa/ ask-overviews/build-skills-with-the-alexa-skills-kit.html. 56. Porter AK, Schwartz M. Ride report: mobile app user guide. Br J Sports Med 2018;52(18):e4. Available from: https://doi.org/10.1136/bjsports-2017-098364. 57. Lee VR. What’s happening in the “quantified self” movement?. 2014. 5. 58. Maddison R, Rawstorn JC, Shariful Islam SM, Ball K, Tighe S, Gant N, et al. MHealth interventions for exercise and risk factor modification in cardiovascular disease. Exerc Sport Sci Rev 2019;47(2):86 90. Available from: https://doi.org/10.1249/JES.0000000000000185. 59. Rawstorn JC, Gant N, Direito A, Beckmann C, Maddison R. Telehealth exercise-based cardiac rehabilitation: a systematic review and meta-analysis. Heart 2016;102(15):1183 92. Available from: https://doi.org/10.1136/ heartjnl-2015-308966. 60. Bragazzi NL, Guglielmi O, Garbarino S. SleepOMICS: how big data can revolutionize sleep science. Int J Environ Res Public Health 2019;16(2):291. Available from: https://doi.org/10.3390/ijerph16020291. 61. Uehli K, Mehta AJ, Miedinger D, Hug K, Schindler C, Holsboer-Trachsler E, et al. Sleep problems and work injuries: a systematic review and meta-analysis. Sleep Med Rev 2014;18(1):61 73. Available from: https://doi. org/10.1016/j.smrv.2013.01.004. 62. Douglas NJ, Thomas S, Jan MA. Clinical value of polysomnography. Lancet 1992;339(8789):347 50. Available from: https://doi.org/10.1016/0140-6736(92)91660-Z. 63. Jafari B, Mohsenin V. Polysomnography—ClinicalKey. 2010. ,https://www.clinicalkey.com/#!/content/ playContent/1-s2.0-S0272523110000286?returnurl 5 https:%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii% 2FS0272523110000286%3Fshowall%3Dtrue&referrer 5 https:%2F%2Fwww.ncbi.nlm.nih.gov%2F.. 64. Sadeghi R, Banerjee T, Hughes JC, Lawhorne LW. Sleep quality prediction in caregivers using physiological signals. Computers Biol Med 2019;110:276 88. Available from: https://doi.org/10.1016/j.compbiomed.2019.05.010. 65. De Koninck J, Gagnon P, Lallier S. Sleep positions in the young adult and their relationship with the subjective quality of sleep. Sleep 1983;6(1):52 9. Available from: https://doi.org/10.1093/sleep/6.1.52.

III. Clinical applications

170

9. Roles of artificial intelligence in wellness, healthy living, and healthy status sensing

66. Hsiao R-S, Chen T-X, Bitew MA, Kao C-H, Li T-Y. Sleeping posture recognition using fuzzy c-means algorithm. Biomed Eng OnLine 2018;17(Suppl. 2). Available from: https://doi.org/10.1186/s12938-018-0584-3. 67. Khandoker AH, Palaniswami M, Karmakar CK. Support vector machines for automated recognition of obstructive sleep apnea syndrome from ECG recordings. IEEE Trans Inf Technol Biomedicine 2009;13(1):37 48. Available from: https://doi.org/10.1109/TITB.2008.2004495. 68. Roche F, Pichot V, Sforza E, Court-Fortune I, Duverney D, Costes F, et al. Predicting sleep apnoea syndrome from heart period: a time-frequency wavelet analysis. Eur Respir J 2003;22(6):937 42. Available from: https:// doi.org/10.1183/09031936.03.00104902. 69. WHO. Defining sexual health. WHO; 2006. ,http://www.who.int/reproductivehealth/topics/sexual_health/sh_definitions/en/.. 70. WHO. Integrating poverty and gender into health programmes: a sourcebook for health professionals (sexual and reproductive health). WHO; 2008. ,https://www.who.int/gender-equity-rights/knowledge/povertygender-in-health-programmes-sexual-reproductive-health/en/. 71. Penders J, Altini M, Van Hoof C, Dy E. Wearable sensors for healthier pregnancies. Proc IEEE 2015;103 (2):179 91. Available from: https://doi.org/10.1109/JPROC.2014.2387017. 72. L’Engle KL, Mangone ER, Parcesepe AM, Agarwal S, Ippoliti NB. Mobile phone interventions for adolescent sexual and reproductive health: a systematic review. Pediatrics 2016;138(3). Available from: https://doi.org/ 10.1542/peds.2016-0884. 73. Goodale BM, Shilaih M, Falco L, Dammeier F, Hamvas G, Leeners B. Wearable sensors reveal menses-driven changes in physiology and enable prediction of the fertile window: observational study. J Med Internet Res 2019;21(4). Available from: https://doi.org/10.2196/13404. 74. Auger J, Kunstmann JM, Czyglik F, Jouannet P. Decline in semen quality among fertile men in Paris during the past 20 years. N Engl J Med 1995;332(5):281 5. Available from: https://doi.org/10.1056/NEJM199502023320501. 75. Berling S, Wo¨lner-Hanssen P. No evidence of deteriorating semen quality among men in infertile relationships during the last decade: a study of males from Southern Sweden. Hum Reprod (Oxford, Engl) 1997;12(5):1002 5. Available from: https://doi.org/10.1093/humrep/12.5.1002. 76. Splingart C, Frapsauce C, Veau S, Barthe´le´my C, Roye`re D, Gue´rif F. Semen variation in a population of fertile donors: evaluation in a French centre over a 34-year period. Int J Androl 2012;35(3):467 74. Available from: https://doi.org/10.1111/j.1365-2605.2011.01229.x. 77. Swan SH, Elkin EP, Fenster L. The question of declining sperm density revisited: an analysis of 101 studies published 1934-1996. Environ Health Perspect 2000;108(10):961 6. Available from: https://doi.org/10.1289/ ehp.00108961. 78. Girela JL, Gil D, Johnsson M, Gomez-Torres MJ, De Juan J. Semen parameters can be predicted from environmental factors and lifestyle using artificial intelligence methods. Biol Reprod 2013;88(4). Available from: https://doi.org/10.1095/biolreprod.112.104653. 79. Jime´nez-Serrano S, Tortajada S, Garcı´a-Go´mez JM. A mobile health application to predict postpartum depression based on machine learning. Telemed E-Health 2015;21(7):567 74. Available from: https://doi.org/ 10.1089/tmj.2014.0113. 80. Cacioppo JT. Social neuroscience: autonomic, neuroendocrine, and immune responses to stress. Psychophysiology 1994;31(2):113 28. Available from: https://doi.org/10.1111/j.1469-8986.1994.tb01032.x. 81. Schneiderman N, Ironson G, Siegel SD. Stress and health: psychological, behavioral, and biological determinants. Annu Rev Clin Psychol 2005;1:607 28. Available from: https://doi.org/10.1146/annurev.clinpsy.1.102803.144141. 82. Yang L, Zhao Y, Wang Y, Liu L, Zhang X, Li B, et al. The effects of psychological stress on depression. Curr Neuropharmacol 2015;13(4):494 504. Available from: https://doi.org/10.2174/1570159X1304150831150507. 83. Brand S. Mood and learning. In: Seel NM, editor. Encyclopedia of the Sciences of Learning. Springer US; 2012. p. 2328 30. Available from: https://doi.org/10.1007/978-1-4419-1428-6_40. 84. Holmes EA, Bonsall MB, Hales SA, Mitchell H, Renner F, Blackwell SE, et al. Applications of time-series analysis to mood fluctuations in bipolar disorder to promote treatment innovation: a case series. Transl Psychiatry 2016;6(1):e720. Available from: https://doi.org/10.1038/tp.2015.207. 85. Perez Arribas I, Goodwin GM, Geddes JR, Lyons T, Saunders KEA. A signature-based machine learning model for distinguishing bipolar disorder and borderline personality disorder. Transl Psychiatry 2018;8. Available from: https://doi.org/10.1038/s41398-018-0334-0.

III. Clinical applications

References

171

86. Lai, S., Jin, L., & Yang, W. (2017). Online signature verification using recurrent neural network and lengthnormalized path signature. ArXiv:1705.06849 [Cs]. ,http://arxiv.org/abs/1705.06849.. 87. Liu, M., Jin, L., & Xie, Z. (2017). PS-LSTM: capturing essential sequential online information with path signature and LSTM for writer identification. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 01, 664 669. Available from: https://doi.org/10.1109/ICDAR.2017.114 88. Xie Z, Sun Z, Jin L, Ni H, Lyons T. Learning spatial-semantic context with fully convolutional recurrent network for online handwritten chinese text recognition. IEEE Trans Pattern Anal Mach Intell 2018;40(8):1903 17. Available from: https://doi.org/10.1109/TPAMI.2017.2732978. 89. Choudhury, M.D., Gamon, M., Counts, S., & Horvitz, E. (2013). Predicting depression via social media. 10. 90. Coppersmith G, Leary R, Crutchley P, Fine A. Natural language processing of social media as screening for suicide risk. Biomed Inform Insights 2018;10.. Available from: https://doi.org/10.1177/1178222618792860. 91. Gjoreski M, Luˇstrek M, Gams M, Gjoreski H. Monitoring stress with a wrist device using context. J Biomed Inform 2017;73:159 70. Available from: https://doi.org/10.1016/j.jbi.2017.08.006. 92. Kretzschmar K, Tyroll H, Pavarini G, Manzini A, Singh I. Can your phone be your therapist? Young people’s ethical perspectives on the use of fully automated conversational agents (Chatbots) in mental health support. Biomed Inform Insights 2019;11. Available from: https://doi.org/10.1177/1178222619829083. 93. Fitzpatrick KK, Darcy A, Vierhile M. Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): a randomized controlled trial. JMIR Ment Health 2017;4(2):e19. Available from: https://doi.org/10.2196/mental.7785. 94. Inkster B, Sarda S, Subramanian V. An empathy-driven, conversational artificial intelligence agent (Wysa) for digital mental well-being: real-world data evaluation mixed-methods study. JMIR MHealth UHealth 2018;6(11). Available from: https://doi.org/10.2196/12106. 95. Malhi GS, Hamilton A, Morris G, Mannie Z, Das P, Outhred T. The promise of digital mood tracking technologies: are we heading on the right track? Evidence-Based Ment Health 2017;20(4):102 7. Available from: https://doi.org/10.1136/eb-2017-102757. 96. Pavel M, Jimison HB, Korhonen I, Gordon CM, Saranummi N. Behavioral informatics and computational modeling in support of proactive health management and care. IEEE Trans Biomed Eng 2015;62(12):2763 75. Available from: https://doi.org/10.1109/TBME.2015.2484286. 97. Son Y-J, Kim H-G, Kim E-H, Choi S, Lee S-K. Application of support vector machine for prediction of medication adherence in heart failure patients. Healthc Inform Res 2010;16(4):253 9. Available from: https://doi. org/10.4258/hir.2010.16.4.253. 98. Dolgin M, editor. Nomenclature and criteria for diagnosis of diseases of the heart and great vessels/the Criteria Committee of the New York Heart Association. 9th ed. Little, Brown; 1994. 99. Morrow DG, Weiner M, Young J, Steinley D, Deer M, Murray MD. Improving medication knowledge among older adults with heart failure: a patient-centered approach to instruction design. Gerontologist 2005;45 (4):545 52. Available from: https://doi.org/10.1093/geront/45.4.545. 100. Vlasnik JJ, Aliotta SL, DeLor B. Medication adherence: factors influencing compliance with prescribed medication plans. Case Manager 2005;16(2):47 51. Available from: https://doi.org/10.1016/j.casemgr.2005.01.009. 101. Mamoshina P, Kochetov K, Cortese F, Kovalchuk A, Aliper A, Putin E, et al. Blood biochemistry analysis to detect smoking status and quantify accelerated aging in smokers. Sci Rep 2019;9. Available from: https:// doi.org/10.1038/s41598-018-35704-w. 102. Dumortier A, Beckjord E, Shiffman S, Sejdi´c E. Classifying smoking urges via machine learning. Comput Methods Prog Biomed 2016;137:203 13. Available from: https://doi.org/10.1016/j.cmpb.2016.09.016. 103. Suchting R, He´bert ET, Ma P, Kendzor DE, Businelle MS. Using elastic net penalized Cox proportional hazards regression to identify predictors of imminent smoking lapse. Nicotine Tob Res 2019;21(2):173 9. Available from: https://doi.org/10.1093/ntr/ntx201. 104. Cook D, Das SK. Smart environments: technology, protocols, and applications. John Wiley & Sons; 2004. 105. Coradeschi S, Cesta A, Cortellessa G, Coraci L, Gonzalez J, Karlsson L, et al. GiraffPlus: combining social interaction and long term monitoring for promoting independent living. In: 2013 6th International conference on human system interactions (HSI). 2013. 578 585. Available from: https://doi.org/10.1109/HSI.2013.6577883. 106. de Morais WO, Lundstro¨m J, Wickstro¨m N. Active in-database processing to support ambient assisted living systems. Sensors (Basel, Switz) 2014;14(8):14765 85. Available from: https://doi.org/10.3390/s140814765.

III. Clinical applications

172

9. Roles of artificial intelligence in wellness, healthy living, and healthy status sensing

107. Uddin MZ, Khaksar W, Torresen J. Ambient sensors for elderly care and independent living: a survey. Sensors (Basel, Switz) 2018;18(7). Available from: https://doi.org/10.3390/s18072027. 108. Lundstro¨m J, Ja¨rpe E, Verikas A. Detecting and exploring deviating behaviour of smart home residents. Expert Syst Appl 2016;55:429 40. Available from: https://doi.org/10.1016/j.eswa.2016.02.030. 109. Andreu Y, Chiarugi F, Colantonio S, Giannakakis G, Giorgi D, Henriquez P, et al. Wize Mirror—a smart, multisensory cardio-metabolic risk monitoring system. Comput Vis Image Underst 2016;148:3 22. Available from: https://doi.org/10.1016/j.cviu.2016.03.018. 110. Colantonio S, Coppini G, Germanese D, Giorgi D, Magrini M, Marraccini P, et al. A smart mirror to promote a healthy lifestyle. Biosyst Eng 2015;138:33 43. Available from: https://doi.org/10.1016/j. biosystemseng.2015.06.008. 111. Henriquez P, Matuszewski BJ, Andreu-Cabedo Y, Bastiani L, Colantonio S, Coppini G, et al. Mirror on the wall. An unobtrusive intelligent multisensory mirror for well-being status self-assessment and visualization. IEEE Trans Multimed 2017;19(7):1467 81. Available from: https://doi.org/10.1109/TMM.2017.2666545. 112. World Health Organization. Social and environmental determinants of health and health inequalities in Europe: fact sheet. 2012. ,http://www.euro.who.int/__data/assets/pdf_file/0006/185217/Social-and-environmentaldeterminants-Fact-Sheet.pdf.. 113. Di Q, Wang Y, Zanobetti A, Wang Y, Koutrakis P, Choirat C, et al. Air pollution and mortality in the medicare population. N Engl J Med 2017;376(26):2513 22. Available from: https://doi.org/10.1056/ NEJMoa1702747. 114. Ren Z, Zhu J, Gao Y, Yin Q, Hu M, Dai L, et al. Maternal exposure to ambient PM10 during pregnancy increases the risk of congenital heart defects: evidence from machine learning models. Sci Total Environ 2018;630:1 10. Available from: https://doi.org/10.1016/j.scitotenv.2018.02.181. 115. Johnston FH, Wheeler AJ, Williamson GJ, Campbell SL, Jones PJ, Koolhof IS, et al. Using smartphone technology to reduce health impacts from atmospheric environmental hazards. Environ Res Lett 2018;13 (4):044019. Available from: https://doi.org/10.1088/1748-9326/aab1e6. 116. Ye C, Fu T, Hao S, Zhang Y, Wang O, Jin B, et al. Prediction of incident hypertension within the next year: prospective study using statewide electronic health records and machine learning. J Med Internet Res 2018;20 (1):e22. Available from: https://doi.org/10.2196/jmir.9268. 117. Rahman QA, Janmohamed T, Pirbaglou M, Clarke H, Ritvo P, Heffernan JM, et al. Defining and predicting pain volatility in users of the manage my pain app: analysis using data mining and machine learning methods. J Med Internet Res 2018;20(11). Available from: https://doi.org/10.2196/12001. 118. Atee M, Hoti K, Hughes JD. A technical note on the PainChekt system: a web portal and mobile medical device for assessing pain in people with dementia. Front Aging Neurosci 2018;10. Available from: https://doi. org/10.3389/fnagi.2018.00117. 119. Chuchu N, Takwoingi Y, Dinnes J, Matin RN, Bassett O, Moreau JF, et al. Smartphone applications for triaging adults with skin lesions that are suspicious for melanoma. Cochrane Database Syst Rev 2018;12. Available from: https://doi.org/10.1002/14651858.CD013192. 120. Jain R. A navigational approach to health. ArXiv:1805.05402 [Cs] 2018. ,http://arxiv.org/abs/1805.05402.. 121. Martinez-Martin N, Kreitmair K. Ethical issues for direct-to-consumer digital psychotherapy apps: addressing accountability, data protection, and consent. JMIR Ment Health 2018;5(2):e32. Available from: https:// doi.org/10.2196/mental.9423.

III. Clinical applications

C H A P T E R

10 The growing significance of smartphone apps in data-driven clinical decision-making: Challenges and pitfalls Iva Halilaj, Yvonka van Wijk, Arthur Jochems and Philippe Lambin Abstract Smartphone applications have a growing role within the medical field, as they provide both doctors and patients with instant access to information and decision-support tools. This chapter focuses on data-driven clinical decision-making apps and explaining different categories of apps used in health care such as predictive model apps, camera-based apps, and artificial intelligence based apps. Recent studies show the potential of using clinical decision-support applications in the medical field, by empowering the role of patients and the clinicians in decision-making, and emphasizing the importance and contribution of technology and artificial intelligence in the quality of life and health care. However, it also becomes clear that there is a need for standardized clinical evaluation guidelines for these applications. In conclusion, there is a need for clinical trials to increase the quality of applications. Keywords: Data-driven apps; artificial intelligence apps; decision-support tools; applications

10.1 Introduction Smartphone apps are becoming increasingly relevant in medical practice and datadriven decision-making processes. These apps provide the means, for both doctors and patients, to access information, communicate, and coordinate care strategies, as well as provide access to learning opportunities.1 3 However, a problem exists in insuring or regulating quality control in the sale or distribution of these apps to ensure their effectiveness, reliability, and user-friendliness.

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00010-7

173

© 2021 Elsevier Inc. All rights reserved.

174

10. The growing significance of smartphone apps in data-driven clinical decision-making: Challenges and pitfalls

FIGURE 10.1

Distribution of digital technologies used in medical phone apps according to Watson et al.4

There is now a recognized need to move beyond the stage of collegial recommendations for app usage to rigorous and peer-reviewed clinical evaluations. Indeed, apps have moved beyond their initial use of merely providing information crucial in supporting the decision-making process in patient care and treatment strategies. As all tools that influence treatment decisions, they now fall under the category of “Medical Devices” for purposes of evaluation standards and regulation.4 This chapter seeks to explore and assess ways to evaluate new decision-support apps as well as provide insight into managing app based trials. The focus of the chapter will be on apps developed for clinical care and exploring how they are influencing the clinical decision-making process. Furthermore, this chapter will pay attention to how such apps are being evaluated for accuracy and clinical impact. The most prevalent digital technologies used in medical smartphone apps as shown in Fig. 10.1 are camera, predictive models, digitized guidelines, metronomes, and EEG data processing.4,5 We will discuss these technologies in detail in separate sections in this chapter.

10.2 Distribution of apps in the field of medicine The majority of medical apps available are community medicine apps. Such apps emphasize appropriate hospital6,7 referrals and interactive tools for monitoring symptoms and medications.8,9 They reflect patient, rather than clinician, orientation, taking advantage of the nearly universal availability of smartphones to focus on promoting health education or chronic disease monitoring and management. Because apps have become more sophisticated, they are emerging as a potential tool for assisting specialist decision-making. Among the apps reviewed, one-third was designed to deal with cardiovascular disease both within and outside a hospital setting. Since heart disease and strokes have been known to cause more deaths than any other disease, app developers might this field a priority. Development in smartphone cameras allows for diagnosis at a distance, which has led to an increasing popularity of apps in fields such as radiology and dermatology.10 12 Care of pregnant women, from conception to puerperium, requires a number of complex risk assessments; therefore this field could strongly benefit from using such tools.12

III. Clinical applications

10.5 Decision-support modalities

175

10.3 Distribution of apps over different locations The United States leads in the adoption of smartphone app technology, although its use is truly becoming global in scope. Increasing numbers of clinical trials are demonstrating the usefulness of decision-support apps in both acute settings and chronic disease management. Health concerns in the United States are dominated by heart disease, which is the leading cause of death, and opioid abuse, an emerging epidemic that saw a 345% mortality increase between 2001 and 2016.13 For this reason, cardiovascular risk assessment and opioid abuse predictive tools have stimulated a growing number of studies addressing accuracy assessments and clinical effectiveness of these tools.9,14 Medical app research and development remains the domain of economically developed or developing countries. China, with the greatest potential for app usage, still largely focuses on outpatient and community health concerns, but with innovative apps such as retinal image analysis15 and digital skin surface reconstruction.16 Besides the United States and China, a number of Western European countries, as well as Iran and New Zealand, have produced significant studies of app usage.

10.4 Reporting applications development approaches The circumstances involved in app development raise significant questions related to the evaluations of their reliability and effectiveness. Specifically, which responsibilities belong to the investigators, which to the software developers, and what is the overlap? App developers typically provide adequate descriptions of the creation and validation of algorithms, but conversion to smartphone applications is done without description of app design or risk assessments.17,18 Since this part of the medical field is still very new, there is a lack of guidance when it comes to the development of medical apps, which may explain this phenomenon. While regulatory authorities initially gave little response to the exponential increase of medical apps, guidelines are now available from the FDA (Food and Drug Administration) and MHRA (Medicines and Healthcare products Regulatory Agency). But the emerging processes are not without their own liabilities. Schoemans et al. claim that the quality control steps that would be necessary at each iteration of a new tool would not be feasible within the dynamic nature of the development of such tools.19 Trial protocols for app validation tests have not referred to regulatory approval. Clearing houses, organizations established to review, organize, and certify health-care apps (e.g., NHS Apps Library)20 were not referenced either. In the future perhaps the SANA organization will provide this field with more consistency and clarity though technology and educational programs.21

10.5 Decision-support modalities What makes smartphone apps so useful when it comes to clinical decision support is the access to functional technology. The advancement and quality of the technology within

III. Clinical applications

176

10. The growing significance of smartphone apps in data-driven clinical decision-making: Challenges and pitfalls

the smartphone will directly influence the impact the apps have on risk assessment, and this will determine the best suited method of evaluation.

10.6 Camera-based apps The phone’s camera is the most critical feature for clinical decision support, which is why a large portion of clinical trials for clinical decision-making app22 assess camerabased apps. In essence, camera-based apps can be used to take a picture of existing data and share it, which enables fast and remote decision-making. Studies have demonstrated camera image effectiveness in transmitting electrocardiograms23 or radiological images24,25 to other health professionals. In their most basic form, these types of applications are evaluated by comparing the interpretation of experts on the photograph taken by the smartphone to the interpretation on medical images. This method of evaluation often leads to good results, showing strong interoperator agreement between the conventional image and the smartphone image; however, this method relies on simulated image assessment but does not evaluate clinical utility. Arguably, this could be sufficient for tools that do not directly impact clinical decisions; however, clinical trials should be considered to ensure quality for such apps. More sophisticated camera-based apps are applied within the field of dermatology, where clip-on microscopes are used to image the skin, and advanced image-analysis algorithms are applied to extract skin properties. These properties can be used to reconstruct a haptic augmented skin surface, such as described by Kim in a preclinical study. Another field of study that uses more advanced camera-based apps is retinal imaging.16 Xu et al. describe a method for retinal image analysis based on smartphone images for the onsite detection of diabetic retinopathy. 15 Although these methods perform well in these preclinical studies, the evidence that these apps perform sufficiently in a clinical environment remains lacking. In a Cochrane review22 diagnostic apps were taken into account for the evaluation of melanomas. In this review, two types of apps were considered: artificial intelligence based apps and storeand-forward apps. The artificial-intelligence apps use algorithms trained on a database of images containing malignant and benign lesions to classify the acquired image, while store-and-forward apps forward the photograph to a trained medical expert for diagnosis. The study showed that in general, the artificial intelligence based apps performed unreliably and were outperformed by the store-and-forward apps; however, of the latter a large number of the images were found to be unevaluable. A significant limitation of these studies is that they only included preselected skin lesions that were going to be surgically removed, which is not an accurate portrayal of the variety of lesions, which would be analyzed when clinically applied. Another limitation of these studies is that the performance of these apps may already be outdated, as smartphone camera technology is rapidly evolving, and it is likely that the improved image quality since these papers were published would result in better diagnostic performance.26 Moreover, the performance of a camera-based app relies more heavily on technical functions of the device than on the software. Studies that evaluate the performance of these diagnostic apps need to take into account the large range in technical capabilities of

III. Clinical applications

10.8 Predictive modeling applications

177

smartphone devices. The rapid development of this technology is a new challenge for research procedures, as trials need to be updated as cameras and apps improve.

10.7 Guideline/algorithm applications Apps that provide clinicians with easy access to clinical guidelines are currently in use and provide benefits, especially poorly funded settings with little educational resources.27 Such apps do not suffer from the hard requirement of clinical validation or formal evaluation, as they simply replicate validated guidelines. When comparing the decisions made by experts using guideline-based apps to those using conventional guidelines, studies find a strong overlap.28 30 When tested using questions on the management of regional anesthesia and thrombosis concerns, anesthesiologists using a decision-support tool significantly outperformed those in the control group not using these aids.29 It is, however, hard to say what the impact would be on patients in a clinical setting, as these results were based on theoretical situations and the questions were multiple choice. The number of clinical studies that assesses the effect of these apps on guideline fidelity is beginning to increase. The most popular method for the evaluation of guideline-based apps are before-and-after models that are often used to assess the quality improvement after clinical application of these apps.31 One such study found a reduction in expenses through more efficient use of antibiotics after the application of a local antibiotic guideline app.31 Another example is the evaluation of a clinical decision-support system for the management of hypertension and diabetes, which showed significant improvements in blood pressure and fasting blood glucose.32 However, limitations remain in many of these studies, especially in determining on whom and when clinicians’ personal devices were used in the trial. Such study designs have appeal because they are easy to perform locally and involve relatively few additional costs, but must be evaluated with caution, even skepticism. The before-and-after scheme ignores possible temporal biases, and the lack of control group causes uncertainty considering additional factors that may affect outcomes. The solution to such uncertainty is the use of randomized-controlled trials. For decisionsupport tools for complex interventions, the use of cluster randomized-controlled trials is very common. Because the application is tested on a whole department, rather than a random part, there is no selection bias. In addition, it saves time and money and provides more accurate conclusions about real clinical effectiveness.33

10.8 Predictive modeling applications Predictive models can integrate large volumes of patient data to guide clinical decisionmaking. Eight app studies using predictive modeling apps are available from the literature. These apps target various subjects, such as risk of disease, the use of narcotics, or survival after cancer treatment.5,17,18,32,34 36 What these apps have in common is that they all utilize a form of a prediction model in order to improve diagnosis by a medical expert

III. Clinical applications

178

10. The growing significance of smartphone apps in data-driven clinical decision-making: Challenges and pitfalls

or to provide prognostic information. The use of regression models, logistic or multivariable, is very popular,35,37 and some apps use a Bayesian model.38 Medical studies make extensive use of regression models to analyze the relationship between patient and disease characteristics to patient outcome. These models are often used to classify patients into different risk groups or to aid in treatment decision-making. Currently, these types of models perform very well in discriminating between high- and low-risk patient groups, but when making individual predictions, these methods remain lacking.39 The predecessors of decision-support apps are nomograms and flowcharts that visualize predictive models. However, these figures are often confusing and complex, a problem that these types of apps can solve. By simplifying the visualization of risk factors and clinical characteristics, these applications can aid in informing patients and accelerate the treatment decision-making. Predictive modeling aims to enhance clinical decision-making by learning from large volumes of patient data to make predictions about patients currently seen in the clinic. Due to the complex nature of medical decision-making, human bias and limited mental capacities of the human brain, clinicians perform suboptimally on many medical decisionmaking tasks.40 Doctors’ judgment can be unconsciously swayed by a reluctance to give a patient bad news36 or they may choose treatments based on a faulty assessment of risk.41,42 Predictive models may provide clinical decision support, due to their objectivity.43 Furthermore, predictive models may provide a means of facilitating a more personalized approach to clinical decision-making.44 A hurdle in model development is that large volumes of patient data are required to train them and to ensure their quality. External validation of any model requires performance testing on different populations to ensure reduce the risk of overfitting. In order for a model to be statistically valid, the model must perform well both on the original dataset, as on the independent, external validation set.39 When studies cannot be validated according to these criteria, their applicability and their usefulness as models can be questioned.18,34 Within the medical field, it is often difficult to obtain an external validation set, due to privacy reasons, monetary limitations, or lack of available data. Internal cross validation can be a useful alternative. A downside of internal validation is that it suffers from substantial bias, and generalizability may be limited. Guidelines for proper model building and validation exist, for example, the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis statement.45 Adhering to these criteria in this statement increases the likelihood of developing a model that is reliable in most medical contexts.

10.9 Sensor-linked apps Built-in smartphones’ accelerometers have useful applications in orthopedics and sports medicine. Within these fields, they are applied as digital goniometers, where they have proven to outperform conventional goniometers as well as result in very strong observer

III. Clinical applications

10.10 Discussion

179

agreement.10,11,46,47 Though these initial results are promising, the clinical impact of such apps is yet to be evaluated. In a pilot study of an app for measuring blood pressure, excellent results were found. The application was used in combination with a small pressure sensor and a mini microphone, and this new technology appears to be superior to the traditional Korotkoff auscultation method.48 Another study evaluated the impact of an app that, in combination with an ingestible sensor and a wearable sensor, measures medicine intake and physical activity on clinicmeasured blood pressure. The study showed significant blood pressure reductions compared to usual care and a much higher medication adherence. In addition, the app allowed for more timely and personalized treatment decisions and promoted patient involvement and self-care.9

10.10 Discussion This overview confirms the growing permeation of medical apps into all forms of clinical health care from promoting research and health education, to acute triage and prescribing medication. The most common decision-support applications are largely focused on images, clinical guideline based instructive tools, or based on predictive models. Imagebased tools aim to simplify and accelerate diagnosis, while the goal of predictive modelbased tools is to use clinical features to predict outcome and support treatment decisions. On the other hand, clinical guideline based educational tools seek to improve compliance to standards. Initial data indicate that these applications, while not entirely without some problems, may not only provide practical alternatives to the standard of care but could also help one to create novel accuracy benchmarks.10,30 This chapter sought to offer a comprehensive overview of the existing evidence pertaining to these crucial and nearly universally used digital aids in the field of medicine. It has identified key areas of technical development and appropriate methods of evaluation. Apps utilizing predictive models offer real advantages to bring rationality and personalized treatment in clinical decision-making, but the program and the model need robust external and clinical validation. While studies on diagnostic performance are numerous, clinical trials are limited. A limitation of the overview presented here likely results from the beginning healthcare applications. Moreover, the classification for decision-support medical applications is quite diverse, and while every attempt has been made to present a complete overview of the literature, some important studies may have been overlooked. As this chapter focuses on clinical trials, it may be underrepresentative of significant advancements in artificial intelligence not yet translated into clinical research. Several interesting papers were not discussed in this chapter, for example, extensive technical accounts of the prototypes and models for measuring efficiency of a new medication adherence application,49 decision process on anthrax response,50 and preclinical yet realworld research of an app that links to a Doppler system for fetal heart auscultation.51 Each of them incorporated a systematic and reproducible approach, which addressed the need for clarity in the foundations of artificial intelligence decision-support applications.49

III. Clinical applications

180

10. The growing significance of smartphone apps in data-driven clinical decision-making: Challenges and pitfalls

Nevertheless, the technical vocabulary required will likely present interpretative problems for health-care providers. The WHO52 recently recognized a common and structured vocabulary, paramount to promote assessment and coordination through various digital health deployments. This initiative should assist researchers in the future. Challenges will, however, remain. For example, application development paths are not standardized, and there is an overall lack in quality assurance. In addition, the competition in the smartphone market between Google, Apple, and other powerful competitors is tremendously fierce, with varying market shares from country to country. In addition, neither the codes, or the developer’s skillset, nor the users are really universal. Moreover, it is possible that traditional regulatory frameworks are too unwieldy to keep up with this competitive environment and can hence limit the novelty.4 Governance systems, institutional partners, and doctors need to set out a direct a proactive, yet evidence-based route via decision-making applications if they are to optimize their potential. That is clearly a difficult but commendable goal to achieve.

10.11 Summary The use of clinical decision-support software to improve access to care is growing rapidly and has enormous potential in improving health-care quality. The efficacy of clinical trials is decreasing due to the nature of the apps technology and applications, but the health-care sector might need to modernize the approach and adoption criteria to capitalize on the opportunities of apps’ technology and digitalization. Like the use of all other clinical tools, users should take every measure to ensure safe and appropriate conditions for patients, especially nowadays when regulation for the data-driven apps is still in its infancy.4

References 1. Shenouda JEA, Davies BS, Haq I. The role of the smartphone in the transition from medical student to foundation trainee: a qualitative interview and focus group study. BMC Med Educ 2018;18:175. 2. Payne KFB, Wharrad H, Watts K. Smartphone and medical related App use among medical students and junior doctors in the United Kingdom (UK): a regional survey. BMC Med Inform Decis Mak 2012;12. 3. Perry R, Burns R, Simon-Freeman R. A survey of mobile app use among California obstetrics and gynecology residents [8I]. Obstet Gynecol 2017;129:95S. 4. Watson HA, Tribe RM, Shennan AH. The role of medical smartphone apps in clinical decision-support: a literature review. Artif Intell Med 2019;100:101707. 5. Watson HA, Carter J, Seed PT, et al. The QUiPP App: a safe alternative to a treat-all strategy for threatened preterm labor. Ultrasound Obstet Gynecol 2017;50:342 6. 6. Hardy V, O’Connor Y, Heavin C, et al. The added value of a mobile application of community case management on referral, re-consultation and hospitalization rates of children aged under 5 years in two districts in Northern Malawi: study protocol for a pragmatic, stepped-wedge cluster-randomized controlled trial. Trials 2017;18. 7. Raihana S, Dunsmuir D, Huda T, et al. Development and internal validation of a predictive model including pulse oximetry for hospitalization of under-five children in Bangladesh. PLoS One 2015;10:e0143213. 8. Bot BM, Suver C, Neto EC, et al. The mPower study, Parkinson disease mobile data collected using ResearchKit. Sci Data 2016;3:160011.

III. Clinical applications

References

181

9. Frias J, Virdi N, Raja P, et al. Effectiveness of digital medicines to improve clinical outcomes in patients with uncontrolled hypertension and type 2 diabetes: prospective, open-label, cluster-randomized pilot clinical trial. J Med Internet Res 2017;19:e246. 10. Mehta SP, Barker K, Bowman B, et al. Reliability, concurrent validity, and minimal detectable change for iPhone goniometer app in assessing knee range of motion. J Knee Surg 2017;30:577 84. 11. Fernandes HL, Albert MV, Kording KP. Measuring generalization of visuomotor perturbations in wrist movements using mobile phones. PLoS One 2011;6:e20290. 12. Lund S, Boas IM, Bedesa T, et al. Association between the safe delivery app and quality of care and perinatal survival in Ethiopia: a randomized clinical trial. JAMA Pediatr 2016;170:765 71. 13. Gomes T, Tadrous M, Mamdani MM, et al. The burden of opioid-related mortality in the United States. JAMA Netw Open 2018;1:e180217. 14. Bennett GG, Steinberg D, Askew S, et al. Effectiveness of an app and provider counseling for obesity treatment in primary care. Am J Prev Med 2018;55:777 86. 15. Xu X, Ding W, Wang X, et al. Smartphone-based accurate analysis of retinal vasculature towards point-of-care diagnostics. Sci Rep 2016;6:34603. 16. Kim K. Haptic augmented skin surface generation toward telepalpation from a mobile skin image. Skin Res Technol 2018;24:203 12. 17. Bertges DJ, Neal D, Schanzer A, et al. The vascular quality initiative cardiac risk index for prediction of myocardial infarction after vascular surgery. J Vasc Surg 2016;64:1411 1421. e4. 18. Sperduto PW, Jiang W, Brown PD, et al. Estimating survival in melanoma patients with brain metastases: an update of the graded prognostic assessment for melanoma using molecular markers (Melanoma-molGPA). Int J Radiat Oncol Biol Phys 2017;99:812 16. 19. Schoemans HM, Goris K, Van Durm R, et al. The eGVHD App has the potential to improve the accuracy of graftversus-host disease assessment: a multicenter randomized controlled trial. Haematologica 2018;103:1698 707. 20. Boudreaux ED, Waring ME, Hayes RB, et al. Evaluating and selecting mobile health apps: strategies for healthcare providers and healthcare organizations. Transl Behav Med 2014;4:363 71. 21. Massachusetts Institute of Technology. Sana activities. Sana; n.d. ,http://sana.mit.edu/. [accessed 28.02.20; retrieved 27.03.19]. 22. Chuchu N, Takwoingi Y, Dinnes J, et al. Smartphone applications for triaging adults with skin lesions that are suspicious for melanoma. Cochrane Database Syst Rev 2018;12:CD013192. 23. Ouellette L, VanDePol E, Chassee T, et al. Emergency department electrocardiogram images sent through the mobile phone: feasibility and accuracy. Am J Emerg Med 2018;36:731 2. 24. Kelly A, Liu Z, Leonard S, et al. Balance in children following cochlear implantation. Cochlear Implant Int 2018;19:22 5. 25. Demaerschalk BM, Vargas JE, Channer DD, et al. Smartphone teleradiology application is successfully incorporated into a telestroke network environment. Stroke 2012;43:3098 101. 26. Bogoch II, Ame SM, Utzinger J, et al. Mobile phone microscopy for the diagnosis of soil-transmitted helminth infections: a proof-of-concept study. Am J Tropical Med Hyg 2013;88:626 9. 27. O’Reilly-Shah VN, Kitzman J, Jabaley CS, Lynde GC. Evidence for increased use of the Society of Pediatric Anesthesia Critical Events Checklist in resource-limited environments: a retrospective observational study of app data. Pediatr Anesth 2018;28:167 73. 28. Curcio A, DE Rosa S, Sabatino J, et al. Clinical usefulness of a mobile application for the appropriate selection of the antiarrhythmic device in heart failure. Pacing Clin Electrophysiol 2016;39:696 702. 29. McEvoy MD, Hand WR, Stiegler MP, et al. A smartphone-based decision support tool improves test performance concerning application of the guidelines for managing regional anesthesia in the patient receiving antithrombotic or thrombolytic therapy. Obstet Anesth Dig 2016;36:197 8. 30. Mohan A, Agarwal T, Cherian TS, et al. Diagnostic ability of a smart phone app (injured tooth) in diagnosing traumatic injuries to the teeth a multicentre analysis. Int J Paediatric Dent 2018;28:561 9. 31. Manaktala S, Claypool SR. Evaluating the impact of a computerized surveillance algorithm and decision support system on sepsis mortality. J Am Med Inf Assoc 2017;24:88 95. 32. Ajay VS, Jindal D, Roy A, et al. Development of a smartphone-enabled hypertension and diabetes mellitus management package to facilitate evidence-based care delivery in primary healthcare facilities in India: the mPower Heart Project. J Am Heart Assoc 2016;5. Available from: https://doi.org/10.1161/JAHA.116.004343.

III. Clinical applications

182

10. The growing significance of smartphone apps in data-driven clinical decision-making: Challenges and pitfalls

33. Platt R, Takvorian SU, Septimus E, Hickok J, Moody J, Perlin J, et al. Cluster randomized trials in comparative effectiveness research. Med Care 2010;48:S52 7. Available from: https://doi.org/10.1097/MLR.0b013e3181dbebcf [Accessed 28.02.20]. 34. Kalakoti P, Hendrickson NR, Bedard NA, Pugely AJ. Opioid utilization following lumbar arthrodesis: trends and factors associated with long-term use. Spine 2018;43:1208 16. 35. Pietrantonio F, Miceli R, Rimassa L, et al. Estimating 12-week death probability in patients with refractory metastatic colorectal cancer: the Colon Life nomogram. Ann Oncol 2017;28:555 61. 36. Patterson V, Singh M, Rajbhandari H, Vishnubhatla S. Validation of a phone app for epilepsy diagnosis in India and Nepal. Seizure 2015;30:46 9. 37. Palazo´n-Bru A, Rizo-Baeza MM, Martı´nez-Segura A, et al. Screening tool to determine risk of having muscle dysmorphia symptoms in men who engage in weight training at a gym. Clin J Sport Med 2018;28:168 73. 38. Patterson V, Pant P, Gautam N, Bhandari A. A Bayesian tool for epilepsy diagnosis in the resource-poor world: development and early validation. Seizure 2014;23:567 9. 39. Altman DG, Royston P. What do we mean by validating a prognistic model? Stat Med 2000;19:453 73. Available from: https://doi.org/10.1002/(SICI)1097-0258(20000229)19:4 , 453 [Accessed 28.02.20]. 40. Oberije C, Nalbantov G, Dekker A, et al. A prospective study comparing the predictions of doctors versus models for treatment outcome of lung cancer patients: a step toward individualized care and shared decision making. Radiother Oncol 2014;112:37 43. 41. Ortashi O, Virdee J, Hassan R, et al. The practice of defensive medicine among hospital doctors in the United Kingdom. BMC Med Ethics 2013;14. 42. Studdert DM, Mello MM, Sage WM, et al. Defensive medicine among high-risk specialist physicians in a volatile malpractice environment. Obstet Gynecol Surv 2005;60:718 20. 43. Khairat S, Marc D, Crosby W, Al Sanousi A. Reasons for physicians not adopting clinical decision support systems: critical analysis. JMIR Med Inform 2018;6:e24. 44. van Wijk Y, Halilaj I, van Limbergen E, et al. Decision support systems in prostate cancer treatment: an overview. Biomed Res Int 2019;2019:4961768. 45. Heus P, Damen JAAG, Pajouheshnia R, et al. Poor reporting of multivariable prediction model studies: towards a targeted implementation strategy of the TRIPOD statement. BMC Med 2018;16:120. 46. Pourahmadi MR, Ebrahimi Takamjani I, Sarrafzadeh J, et al. Reliability and concurrent validity of a new iPhone goniometric application for measuring active wrist range of motion: a cross-sectional study in asymptomatic subjects. J Anat 2017;230:484 95. 47. Williams CM, Caserta AJ, Haines TP. The TiltMeter app is a novel and accurate measurement tool for the weight bearing lunge test. J Sci Med Sport 2013;16:392 5. 48. Wu H, Wang B, Zhu X, et al. A new automatic blood pressure kit auscultates for accurate reading with a smartphone. Medicine 2016;95: e4538. 49. Varshney U. Mobile health. In: Proceedings of the fourth ACM MobiHoc workshop on Pervasive wireless healthcare MobileHealth ’14. 2014. 50. Soh H, Demiris Y. Multi-reward policies for medical applications. In: Proceedings of the 13th annual conference companion on Genetic and evolutionary computation GECCO ’11. 2011. 51. Valderrama CE, Marzbanrad F, Stroux L, et al. Improving the quality of point of care diagnostics with realtime machine learning in low literacy LMIC settings. In: Proceedings of the first ACM SIGCAS conference on computing and sustainable societies (COMPASS) COMPASS ’18. 2018. 52. Organization WH, World Health Organization. World Health Organization quality of life-100. In: PsycTESTS dataset. 2018.

III. Clinical applications

C H A P T E R

11 Artificial intelligence for pathology Fuyong Xing, Xuhong Zhang and Toby C. Cornish Abstract Advances in artificial intelligence (AI), especially in deep learning, improve pathological image analysis in basic, translational, and clinical research and in routine clinical practice. Deep learning is currently the dominant technique among the best solutions for many tasks in digital pathology. This chapter provides a general overview of different applications of deep learning in pathological image analysis, such as image classification, object detection, image segmentation, stain normalization, and image superresolution. It summarizes deep learning achievements and identifies the contributions in specific tasks. In addition, it discusses open challenges and potential directions of deep learning-based pathological image computing and presents barriers to clinical adoption of AI in digital pathology. Keywords: Artificial intelligence; deep learning; neural networks; digital pathology; image analysis; diagnosis

11.1 Introduction Digital pathology nowadays plays an increasingly important role in basic, translational, and clinical pathology research and in routine clinical practice. With the whole-slide imaging (WSI) technique, digital pathology not only allows digital image sharing between different locations for the purposes of education, research, and/or diagnosis but also enables quantitative analyses for the entire landscape of tissue morphology.1 In particular, the WSI provides a platform for the development of automated or semiautomated image analysis methods, which can improve the efficiency, objectivity, and consistency of disease characterization24 and thus potentially lead to early detection and targeted treatments of diseases. Computerized methods also provide reproducible measurements of pathological image characteristics for clinical follow-up in basic research and clinical practice.5 Artificial intelligence (AI), particularly machine learning (ML), has been widely applied to pathological image analysis and has provided significant support for medical research and clinical practice.68 Compared with nonlearning-based digital image processing, ML can infer image analysis rules from data representations and typically does not require manual algorithm adaptation to different datasets or images.9 This has greatly facilitated the applications

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00011-9

183

© 2021 Elsevier Inc. All rights reserved.

184

11. Artificial intelligence for pathology

of AI or ML methods in digital pathology. Deep learning, a class of ML techniques, has recently attracted tremendous attention in medical imaging, including pathological image computing.1012 Unlike traditional ML methods that rely on handcrafted features, deep learning can directly process raw image data and automatically learn multiple levels of representations for supervised or unsupervised learning tasks.1315 Deep learning typically extracts features with a multilayer, hierarchical architecture, that is, deep neural network, and these learned features usually provide much better performance in image analysis tasks.15,16 In digital pathology, WSI images provide very rich and complex information about tissue or disease characteristics, which require advanced computational methods for effective image analysis. Deep learning that learns multilevel abstract representations can explore high-dimensional WSI data and discover complex hidden patterns or complicated relationship between image appearance and disease expression.12 Thus deep learning usually provides improved performance for various image analysis tasks as compared to traditional ML methods,12,16 leading to the increasing prevalence of deep learning in pathological image analysis. In addition, deep learning does not require manual feature engineering for image representations, which usually require domain knowledge, and this has further boosted the acceptability of deep neural networks in digital pathology.17,18 Nowadays, deep learning is the dominant computational method among the best solutions for different types of pathological image computing tasks. In this chapter, we focus on recent advances of AI in digital pathology, that is, deep learning-based pathological image analysis in research and clinics. Specifically, we first briefly explain deep neural networks used in digital pathology, including convolutional neural networks (CNNs),14,19 fully convolutional networks (FCNs),20 generative adversarial networks (GANs),21 stacked autoencoders (SAEs),15,22 and recurrent neural networks (RNNs).23,24 Then, we introduce deep learning applications in various tasks in pathological image analysis, such as image classification, object detection, image segmentation, stain normalization, and image superresolution, in different types of tissue and stain preparations. Finally, we discuss open challenges and potential future research directions of deep learning-based pathological image analysis and point out barriers to clinical adoption of AI in digital pathology.

11.2 Deep neural networks There are mainly five types of deep neural networks currently used for pathological image analysis: CNNs, FCNs, GANs, SAEs, and RNNs. CNNs and FCNs are the dominant models used in digital pathology, and GANs have recently elicited increasing interest. SAEs and RNNs are adopted in some literature but are not the major methods for pathological image computing. Although deep belief networks (DBNs)25 and deep Boltzmann machines (DBMs)26 have been introduced for visual tasks in computer vision, there is very scarce literature reporting applications of DBNs and DBMs in digital pathology. Thus we exclude these two models from the following discussion.

11.2.1 Convolutional neural networks A CNN is a feedforward neural network that adopts convolutional operations in one or more layers of the network.14,19 Compared with fully connected artificial neural networks,

III. Clinical applications

11.2 Deep neural networks

185

FIGURE 11.1 An example CNN architecture. C, M, and F denote convolutional, max-pooling, and fully connected layers, respectively. An image patch is classified as a nucleus patch if its center pixel locates inside the nucleus, otherwise not. CNN, Convolutional neural network.

CNNs use local connections and share weights for the convolutional layers. A CNN is typically composed of a set of stacked convolution-pooling layers followed by multiple fully connected layers, as shown in Fig. 11.1. Given input images/feature maps, convolutional layers calculate output feature maps with several convolutional kernels, one per feature map, and nonlinear activation functions. Currently CNNs often choose rectified linear unit (ReLU),27 leaky ReLU,28 or exponential linear unit29 as the activations, which have shown better performance than the sigmoid or hyperbolic tangent functions in many applications. Pooling layers summarize the responses in locally neighboring regions30 and reduce dimension of feature maps. Max-pooling or average pooling is commonly used in CNNs, and both do not learn parameters. Instead of using pooling operations, some CNN architectures recently exploit strided convolutional layers,31 which allow for parameter learning. Fully connected layers learn higher levels of feature abstract, which are more specific to concrete tasks. Unlike convolutional layers where each unit is connected with a local region in the input feature map, each unit in fully connected layers is linked to all units in the previous layer. The standard CNN architecture has been widely applied to object recognition and has achieved excellent performance,27,32 and the AI community has recently introduced several new architectures that enable exception module modeling33 and residual learning.34 By using an appropriate loss function in the last fully connected layer, a CNN can be trained with the backpropagation algorithm19 for classification, regression, or other applications. A cross-entropy loss or a hinge loss35 is typically employed for classification, while a mean squared error or a mean absolute error is often exploited for regression. Because deep CNN training usually requires a large amount of annotated data, which might not be available in some scenario like medical imaging, it is very common to conduct data augmentation for model training, such as image rotation, random cropping, and color shifting. In addition, it is not unusual to fine-tune CNN models trained on the largescale ImageNet dataset36 to specific, small target datasets. In order to avoid model overfitting, the dropout technique is also often applied to CNN training.37

11.2.2 Fully convolutional networks An FCN, a variant of CNNs, is a deep network that contains no fully connected layers.20 Actually, a standard CNN can be converted into an FCN by replacing a fully connected layer with a specific convolutional layer that has a kernel size equal to the dimension of the feature map fed into the layer. In this way, FCNs do not require fixed-size image input or a sliding-window scan for dense prediction, which is typically used in CNNs and is

III. Clinical applications

186

11. Artificial intelligence for pathology

FIGURE 11.2 An example FCN (variant) architecture for image segmentation. A pixel is classified as a tumor pixel if it locates in tumor regions, otherwise not. FCN, Fully convolutional network.

computationally expensive for large-size images. In addition, by introducing one or multiple upsampling or transposed convolutional layers,31 FCNs can perform pixel-to-pixel learning for direct pixel-wise prediction, which is very efficient for applications such as image segmentation, as shown in Fig. 11.2. With task-specific losses, FCNs can be trained by using the standard backpropagation algorithm.19 U-Net38 that is built on FCNs is a very popular deep network in medical imaging, including pathological image analysis. It is an encoderdecoder, U-shaped deep architecture. Similar to standard CNNs, the encoder consists of stacked convolution-pooling layers, aiming to learn feature representations from input images. The decoder contains a set of upsampling (e.g., bilinear interpolation) and convolutional layers, symmetrically mapping learned representations back to the input space for pixel-wise prediction. One important characteristic of U-Net is introducing long-range skip connections between the encoder and decoder such that high-resolution (HR) information in the encoder can be effectively used for more precise object localization. By applying proper cropping or padding to the output feature maps in the decoder, U-Net can directly produce a prediction map that has the same size as the input image. In this scenario, U-Net allows for fast inference for arbitrary-sized images. This is very important for digital pathology, because WSI images typically have tens of millions or more pixels.

11.2.3 Generative adversarial networks A GAN is an unsupervised learning model that simultaneously trains two deep neural networks: a generator and a discriminator.21 Given a set of training data, the generator learns to capture the distribution of real training data and generate fake samples that have the same statistics as the training data, while the discriminator learns to differentiate fake samples from real data, that is, to be not fooled by the generator. These two networks are integrated into a two-player minimax game for model training. More specifically, let px (x) and pz (z) denote the distribution for real data and random noise variables, respectively. Suppose D(x) estimate the probability that x comes from the real training data rather than the generator G. The GAN trains the generator G by minimizing the probability of D making a correct prediction of fake samples and meanwhile trains the discriminator D by maximizing the probability of D assigning correct labels to both real and fake samples. The complete objective function can be written as min max ExBpx ½logDðxÞ 1 EzBpz ðzÞ ½logð1 2 DðGðxÞÞÞ; G

D

(11.1)

where E denotes the expectation operation. In practice, it is common to train G by maximizing log(D(G(z))) instead of minimizing log(1 2 D(G(z))), which is easy to saturate in the early learning stage.21

III. Clinical applications

11.2 Deep neural networks

187

FIGURE 11.3

An example CycleGAN architecture for image translation between the source and target domains. The cycle loss ensures that the reconstructed source image is close to the original source image.43 Here the source and target domains represent Ki67 IHC stained image data from two separate institutes. For clarity, here we only draw one cycle. ICH, Immunohistochemistry.

GANs pave a new path for deep unsupervised learning. With the advent of the first GAN, many variants have been proposed to produce better data synthesis. DCGAN 39 introduces convolutional layers into the generator and discriminators to generate more realistic samples, conditional GAN40 explains how to integrate prior information (e.g., class labels) into a conventional GAN to assist with data generation, ProGAN41 describes a progressive growing strategy to stabilize GAN training for HR image data, Pix2Pixel42 investigates conditional GANs as a generic solution for image-to-image translation instead of image synthesis from random noise, and CycleGAN43 incorporates a cycle consistency loss into GAN training to enable bidirectional mappings for unpaired image-to-image translation. Among these GAN variants and particularly those for image-to-image translation applications, the generator is usually a deconvolutional neural network or U-Net-like architecture, and the discriminator is a standard CNN. The applications of GANs and their variants have increased rapidly in medical imaging and have recently emerged in digital pathology,44 especially for image stain normalization. Fig. 11.3 shows a CycleGAN-based image conversion method for stain normalization in Ki67 immunohistochemistry (IHC) stained pancreatic neuroendocrine tumor images.

11.2.4 Stacked autoencoders An autoencoder is an unsupervised learning neural network, which consists of an input layer, a hidden layer, and an output layer.15,22 The output layer aims to reconstruct the input from the hidden layer, which typically has a smaller number of units than the input layer and learns a compressed representation of input data. Alternatively, an autoencoder might be viewed as consisting of an encoder, which maps the input to an embedded feature representation, and a decoder, which learns to produce a reconstruction based on the embedded representation (see the left of Fig. 11.4). In order to capture useful features of input data, constraints such as limiting the number of hidden units are usually placed on

III. Clinical applications

188

11. Artificial intelligence for pathology

FIGURE 11.4

Left: an example autoencoder. Right: an example SAE with fine-tuning. SAE, Stacked autoencoder.

the network. An autoencoder is learned by minimizing the reconstruction error, such as a mean squared error or a cross-entropy loss.45 Autoencoders can be stacked to form a deep, multilayer network, that is, SAE, which is able to model complex data structure with each hidden layer corresponding to a level of feature abstraction. An SAE can be trained in an unsupervised, layer-wise manner, that is, training one layer at a time and using the feature representations from lower layers as the input of higher layers. After unsupervised training of an SAE, it can be combined with an additional layer, such as softmax, and further be jointly fine-tuned with backpropagation in a supervised manner for discriminative learning tasks (see the right of Fig. 11.4). In addition to traditional autoencoders, regularized autoencoders are also used to constrain the representations.22 Sparse autoencoders introduce a sparsity penalty into the network to obtain sparse representations for specific tasks,4648 denoising autoencoders learn to reconstruct the clean input from corresponding corrupted data,45,49 and contractive autoencoders explicitly regularize the derivatives of the encoder function with a Frobenius norm.50

11.2.5 Recurrent neural networks An RNN is a type of neural network for sequential data processing,23,24 such as speech recognition and natural language processing. RNNs use hidden states to memorize the patterns of the sequential data. For a vanilla RNN the current hidden state is calculated by using the current input and the previous hidden states, which implicitly contain the information about past elements in the input sequence. Afterward, the output at the current time step is computed by conducting a dot multiplication between the current states and learned network parameters followed by a nonlinear activation. These parameters are shared across all the time steps such that RNNs can be applied to different-length sequences. This parameter sharing is particularly critical when specific piece of information can appear in different locations in the sequence.15 Actually, an RNN can be unfolded in time and be viewed as a deep feedforward neural network, where all the layers have the same parameters.14 In this scenario, RNNs can be trained by using the backpropagation in the time domain, that is, backpropagation through time (BPTT).51,52 Although the BPTT can be applied to vanilla RNN training, it is difficult to learn long-term dependencies in the sequence.53 This problem can be addressed by augmenting the network with an explicit memory,14 such as the long short-term memory (LSTM).54 Another effective sequence model used in real applications is the networks based on gated recurrent units,55

III. Clinical applications

11.3 Deep learning in pathological image analysis

FIGURE 11.5

189

The four different sweeping directions for a two-dimensional RNN. RNN, Recurrent neural

network.

where paths through time are created such that the derivatives do not vanish or explode during network training.15 RNNs that are designed for sequential data are not directly applicable to twodimensional (2D) images, because there are no explicit time series in image data. To address this issue, multidimensional RNNs have been introduced to process data with more than one spatiotemporal dimension.56 Its basic idea is to replace the standard single recurrent connections with multidimensional connections such that the network can produce an effective representation of surrounding context. In the forward pass, each unit at a time step receives an external input and its own previous activations from along all dimensions.56 For 2D images, one possible way is to split the image into a set of nonoverlapping patches and organize them into a sequence. At each time step the current patch receives information from both row and column directions. Ideally, each patch is surrounded by the “previous” and “next” neighborhoods, leading to a cyclic graph.12,57 A multidimensional RNN can use a separate RNN to model the dependencies for each sweeping direction: from top-left/top-right/bottom-right/bottom-left to bottom-right/bottom-left/top-left/top-right,58 as shown in Fig. 11.5. For each time step the network calculates the hidden states of each individual RNN and then collects all the hidden states to generate the whole-context information. With the integrated hidden states the network can compute the output in a similar way used for the standard single RNN.

11.3 Deep learning in pathological image analysis 11.3.1 Image classification Given an input, image classification is to predict a class label, for example, benign or malignant. This might be one of the first applications of deep learning in medical imaging. Because deep neural networks have demonstrated great success in natural image categorization and object recognition,14 it is straightforward to apply deep learning algorithms to image labeling in the medical domain. For image classification applications in digital pathology, most perform image-level labeling of individual images cropped from WSI slides or labeling of the holistic WSI images. Some studies focus on object-level image classification, which labels an image that contains a single object (e.g., cell) at the center, that is, categorizing different types of objects without object localization. In either image-level

III. Clinical applications

190

11. Artificial intelligence for pathology

or object-level classification, CNNs are currently the most popular method, although other deep architectures like SAEs are used in some literature. 11.3.1.1 Image-level classification In order to deal with different views of fields in histopathology images, three separate CNNs that take image patches of different sizes as input are used to classify prostate cancer images for Gleason grading.59 A logistic regression model is used to fuse the predictions from the three CNNs for final class label determination. This method conducts data augmentation in both image and feature space to facilitate model training. Li et al.60 introduce a deep multiple instance learning (MIL) method with adaptive attention for histopathology image classification of colon and breast cancers. It first exploits a CNN to produce instance presentations and then adopts a fully connected network to generate attention weights, which are used to measure the importance of each instance for bag classification. This method also presents a hard negative instance mining strategy based on the attention weights for better model training. Couture et al.61 train a CNN-based MIL model that uses predictions from small regions of the image to produce image-level classification. By defining a quantile function for feature pooling in the CNN, it is able to aggregate instance predictions that are generated by an FCN network, for more complete representation learning. With a cross-entropy loss the model is learned end-to-end with the standard backpropagation algorithm. This method is applied to breast cancer tissue microarray image classification and has produces very promising results. Ioannou et al.62 discuss a fast deep learning-based pipeline for image patch classification on prostate cancer WSIs. It accelerates the stain normalization process63 by using a set of optimizations, including look-up searching, multithreaded CPU execution, and Monte Carlo sampling. In addition, it designs a multithreaded I/O module for fast nonvolatile memory storage and integrates stain normalization and I/O loading/storing into a single pipeline to reduce data transfer during image analysis. Gu et al.64 discuss a deep weakly supervised representation learning method for classification of endomicroscopy image sequences. Given a set of consecutive frames, the method uses a CNN model to extract frame-based features and to learn for global diagnosis and local frame-based annotations with two different losses. In order to preserve the consistency between the global and local predictions, it adds an additional semantic exclusivity loss to the model learning, which uses only global diagnosis labels. It is usually time-consuming to collect large-scale datasets for supervised deep model training (e.g., CNNs) in digital pathology, because data annotation is extremely expensive. To address this problem, Huang et al.65 introduce an unsupervised domain adaptation (UDA) image classification method to transfer knowledge from an existing dataset (i.e., source) that has data annotations available to another dataset (i.e., target) that has no labeled data. By performing convolutional kernel adaptation between the source and target CNNs, the source CNN transfers learned experience to the target CNN, which can be trained without target data annotations, for classification of epithelium and stroma histopathology images. Similarly, Zhang et al.66 present another deep UDA method that can adapt deep image classification models trained on the WSI data domain to the conventional microscopy domain. The method reduces domain discrepancies with adversarial learning and entropy minimization, while addressing the data imbalance problem with

III. Clinical applications

11.3 Deep learning in pathological image analysis

191

sample reweighting. Without domain adaptation, Shi et al.67 propose a self-ensembling based semisupervised deep network for image classification in lung and breast cancers. Specifically, it adopts exponential moving average to create ensemble targets for feature and label predictions and then exploits a graph to map labeled images of each class into a cluster to boost the strength of ensemble predictions. This method learns a graph temporal ensemblingbased CNN in a semisupervised fashion, by leveraging the semantic information of annotated images and exploring the information hidden in unlabeled data. Alternatively, Qi et al.68 present a CNN-based active learning approach for classification of breast cancer histopathological images. Based on the AlexNet27 pretrained on the ImageNet,36 this method iteratively fine-tunes AlexNet with a dynamically updated training set, which is achieved by using entropy-based and confidence-boosting selection strategies. In this scenario, this framework is able to significantly reduce annotation costs without sacrificing the classification accuracy. Instead of classifying image patches cropped from WSI slides, deep learning has also been applied to entire WSI image categorization. Lee and Paeng69 introduce a deep learning-based framework for metastasis detection and pN-stage classification in breast cancer. The framework adopts a ResNet34 to detect metastasis regions on WSI images, with considering balanced patch sampling, patch augmentation, stain color augmentation, two-stage CNN finetuning, and overlapped tile predictions. Next, it defines a set of morphological and geometrical features based on the CNN-predicted maps and trains a random forest classifier for WSI classification, which is further aggregated to produce patient-level classification. Sun et al.70 present a deep CNN-based method for liver cancer WSI image classification. It first uses a pretrain ResNet34 to extract patch-level features and then generates a single representation for each WSI with feature pooling, sorting, and selection. Next, it trains a multilayer perceptron network with these global WSI representations for image classification. Hou et al.71 train a deep CNN to produce patch-level predictions and learn a multiclass logistic regression to aggregate patch-level predictions for image-level WSI classification. A new expectationminimization algorithm is introduced to automatically select discriminative patches for CNN training. This decision fusion model is shown to outperform patch-level CNNs with max-pooling and voting in classification of glioma and nonsmall-cell lung carcinoma WSI images, and the experimental results are close to the interobserver agreement between pathologists. Wang et al.72 introduce a deep MIL framework for hematoxylin and eosin (H&E) stained gastric WSI image classification. Specifically, the framework first extracts the discriminative instances with an FCN localization network, which is built on the Inception-v4 architecture.73 Then, it designs a recalibrated MIL network, which is composed of localglobal feature fusion, instance recalibration, and multiinstance pooling, to predict image-level labels. This method is able to capture instance-wise dependencies and take into consideration the instance impacts for image classification. 11.3.1.2 Object-level classification In this section, object-level classification means classifying a single object roughly centered in an image. One of the early studies is to adopt a LeNet-5-like CNN network19 to differentiate human epithelial-2 (HEp-2) cells on indirect immunofluorescence microscopy images.74 This work experimentally analyzes the effects of different factors on

III. Clinical applications

192

11. Artificial intelligence for pathology

classification, such as model hyperparameters, data augmentation, and transfer learning. Later, Phan et al.75 use a pretrained CNN76 as a fixed feature extractor to generate cell image representations for HEp-2 cell classification. It first trains a support vector machine (SVM) classifier with selected low-layer CNN features to differentiate positive intensity images from intermediate intensity ones and then learns two additional separate SVMs with selected high-layer CNN features to classify cell images into six categories. Meng et al.77 present and compare a deep CNN with other traditional ML models such as k-nearest neighbors (KNN) and SVMs on label-free cell classification of bright-field microscopy images. The experiments show that the CNN performs very well in identifying multiple types of cells and benefits most from increasing data as compared to KNN and SVMs. Duggal et al.78 introduce a stain deconvolution layer into standard CNNs for white blood cell classification (i.e., malignant vs benign) of bone marrow microscopy images. This layer transforms input RGB images to an optical density space and learns optimal stain basis vectors of cells of interest for class labels. The experiments show that this layer can produce effective representations for image classification. Instead of training deep models with limited target data from scratch, Zhang et al.79 fine-tune a CNN model trained on ImageNet36 toward specific cervical cell image datasets for abnormal and normal cell classification. During model testing, it uses multicropping27 and random-view aggregation80 to generate multiple predictions and take the average for final prediction of a single input image. In order to reduce the cost of data annotation, Su et al.81 present a deep semisupervised learning method for nucleus classification. Specifically, it integrates a label propagation (LP) step into a mean teacher self-ensembling method82 such that class information can be iteratively propagated from labeled data to unlabeled samples; meanwhile, it builds a graph based on the LP predictions and defines a Siamese loss83 to enforce local and global consistencies during model learning. This method has been applied to nucleus classification in both H&E and Ki67 IHC microscopy images, and it can obtain better results than some other semisupervised methods. Shao et al.84 introduce a deep active learning method for nucleus classification in colon cancer histopathological images. It first uses a pairwise-constrained CNN, which is able to preserve the distribution of intraclass and interclass nuclei, to produce predictions of unlabeled data. Then, it selects the most informative samples for expert annotation and continues to update the learned CNN. Compared with other competitors, this method can deliver desired nucleus classification performance with less data annotation. Recently, Wu et al.85 present a new stochastic gradient descent algorithm to regularize CNNs for glomerulus classification of pathological images, with simultaneously addressing the issues of model overfitting and privacy leakage of patient information. Technically, this algorithm dynamically controls the noisy update within each iteration. The method is evaluated on a large-scaled pathological image dataset, demonstrating the effectiveness of reducing the overfitting risk and preventing privacy information leakage.

11.3.2 Object detection Unlike image classification that determines only the image labels, object detection aims to locate individual objects and meanwhile determine their categories.86,87 Object detection can be grouped into two classes: detection of specific types of objects and broad categories

III. Clinical applications

11.3 Deep learning in pathological image analysis

193

of objects. The first group is to detect a specific class of object, while the second one is to detect instances of multiple predefined categories.88,89 Nucleus or cell detection is a fundamental and important task in pathological image analysis,5 which can support for many subsequent studies such as image segmentation, object tracking, and disease grading. Currently object detection in digital pathology may be broadly classified as three groups: (1) detect a particular category of nucleus/cell such as mitotic nuclei or white blood cells, (2) detect all the nuclei or cells without category labeling, and (3) detect individual nuclei/ cells with category labeling, such as differentiation of epithelial, inflammatory, and fibroblast nuclei. 11.3.2.1 Detection of particular types of objects One of the earliest deep learning-based object detections in medical imaging is to apply a standard CNN to mitosis detection in H&E stained breast cancer histology images.90 This method is trained with small image patches and detects mitotic nuclei by scanning testing images with a fixed-size sliding window. This sliding-window-based CNN strategy is a very straightforward application of deep learning in object detection, and it can also be found in other literature for mitosis detection.9194 However, CNN inference with a sliding window is computationally expensive for high-dimensional images, and thus some studies apply CNNs to only patches of mitosis candidates, which are generated by using other relatively efficient methods, such as simple digital image processing techniques,95 region proposal networks,96 and FCNs.97 Recently, FCNs have been directly applied to mitotic nucleus detection in a pixel-to-pixel mapping manner, which requires only simple postprocessing of the FCNs’ prediction maps,98,99 and thus FCNs are typically faster than those CNN-based approaches. In addition to mitosis localization, deep models have also been applied to detection of other specific objects in pathological images. Combined with Voronoi diagram of clusters,100 a deep CNN101 is employed to identify individual neutrophils in H&E stained inflammatory bowel disease images. A recent study102 has investigated the effectiveness of different deep neural networks, which are based on FCN,20 U-Net,38 YOLO,103 and locality-constrained CNN regression,104 for lymphocyte detection in IHC stained histopathology images of breast, colon, and prostate cancer tissues. It has found that U-Net provides the best performance in terms of F1 score and the agreement with manual evaluation. By iteratively including the most representative negative samples into the training set, a standard CNN is trained to detect circulating tumor cells in clinical blood images and produces promising detection performance.105 11.3.2.2 Detection of objects without category labeling By viewing different classes of nuclei/cells as a single, generic category of object, many deep models for detection of a particular type of object mentioned previously can be exploited to detect all nuclei/cells with binary classification, that is, nuclei/cells versus nonnuclei/noncells. For instance, multiple two-class CNNs106108 are used to locate individual nuclei in H&E stained breast cancer, brain tumor, skeletal muscle, and Ki67 IHC stained pancreatic neuroendocrine tumor images. With converting fully connected layers into convolutional layers, an computationally efficient CNN109 is applied to nucleus

III. Clinical applications

194

11. Artificial intelligence for pathology

detection for area measurement in breast cancer images; with incorporating sparse kernels into a standard CNN and parallel computing, fast nucleus detection can be achieved in WSI histopathology images that exhibit billions of pixels.110 In addition to CNN classification for nucleus detection, deep CNN regression has been proposed to regress the locations of nuclei/cells by placing a specific regressor in the last layer of a standard CNN, and this has shown better object detection performance than CNN classification approaches.111,112 For a further improvement a regression CNN recently incorporates shape prior information of nuclei into the learning process for nucleus detection in colon cancer histology images.113 Although CNNs and their variants are the dominant techniques for object detection in digital pathology, SAEs have been reported to detect nuclei in breast cancer images,114 in which the networks are trained in an unsupervised manner and then are fine-tuned with target labels for nucleus and nonnucleus patch classification. Similarly, another sparse SAE exploits this pretraining followed by fine-tuning strategy to detect bone marrow hematopoietic stem cells, which is combined with a curvature Gaussian model in the framework.115 FCNs that allow for direct pixel-to-pixel mappings have recently drawn considerable interest in nucleus/cell detection, which can be formulated as a pixel-wise prediction problem and naturally fits in the FCN architecture. Specifically, a fully convolutional regression network has been applied to efficient detection of nuclei in various microscopy images with different tissue preparations.116 This network inserts residual blocks into a U-Net-like architecture and designs a structured regression in the last layer for robust nucleus detection. This method is also used for nucleus detection in a domain adaptation setting,117 where the network is trained with pseudo-labels of nuclei in the target dataset and produces promising performance. Later, another pixel-to-pixel network with additional multilevel information aggregation layers has been proposed to detect nuclei in H&E stained histopathology images of more than 20 types of organ tissues.118 With defining a Gaussian density map for each annotated cell, a regression FCN learns to locate individual cells by minimizing a simple mean squared error119 and has shown impressive cell detection results. 11.3.2.3 Detection of objects with category labeling In digital pathology, it is very important to locate and classify different categories of nuclei or cells, because this is usually a prerequisite for exploration of cell distribution and characterization of tissue heterogeneity.12 However, there is a paucity of literature on deep learningbased nucleus/cell detection with category labeling, especially on single-stage deep detection models. A straightforward method is to leverage two separate, sequential neural networks, one for detection and the other for classification. This two-stage deep learning framework has been applied to nucleus classification in H&E stained histology images of colorectal adenocarcinomas,104 where a locality-constrained CNN locates individual nuclei and then another standard CNN classifies the detected nuclei into four classes. Similarly, Hagos et al.120 exploit a U-Net-like network that incorporates a cell counter subnetwork to detect all the cells and then use two sequential VGGNet-based CNNs32 to differentiate five types of cells in breast cancer multiplex IHC images. In order to ensure efficient data annotation, a deep active learning method with uncertainty measurement121 is introduced to identify different types of red blood cells in bright-field microscopy images. Based on the faster R-CNN detector,122 this method

III. Clinical applications

11.3 Deep learning in pathological image analysis

195

interactively selects the most relevant samples for expert annotation and has shown improved cell detection accuracy with few training images. Compared with multistage object detectors discussed previously, single-stage methods are typically faster and potentially reduce the variability of image analysis. Huang et al.123 introduce a multiclass CNN, which applies a Gabor modulation to convolutional kernels, to perform pixel-wise prediction for different blood cell classification. Song et al.124 presents a synchronized SAE, which integrates a curve-support Gaussian model into the network, to conduct simultaneous cell localization and classification in bone marrow trephine histology images. These two methods need a sliding-window scanning for dense prediction, which is inefficient for large-size images. To address this issue, CNNs with dregularly sparse kernels,125 which can eliminate the redundant convolutional operations in the siding window-based method, are introduced to conduct fast model inference126,127 for efficient cell classification in lung cancer histology images from The Cancer Genome Atlas.128 Recently, Zhou et al.129 introduce a unified FCN framework for simultaneous nucleus localization and class labeling in colon cancer images, and the framework uses the object localization to constrain the fine-grained object classification. Xing et al.130 propose a novel U-Net-like network for single-stage nucleus recognition in Ki67 IHC stained pancreatic neuroendocrine tumor images. This method consists of two branches, one for nucleus identification and the other for extraction of regions of interest (ROIs), which can assist nucleus recognition via joint task learning.

11.3.3 Image segmentation Image segmentation is a process of assigning each pixel a label such that pixels with the same label share certain attributes. This process of grouping similar pixels into summary representations can facilitate subsequent image processing, such as feature extraction and visual recognition. Image segmentation can be modeled as a pixel-wise classification problem, either grouping pixels with semantic labels (i.e., semantic segmentation) or partitioning individual objects with boundary delineation (i.e., instance segmentation).131 Over the last few decades, image segmentation might be one of most popular applications of deep learning in digital pathology and microscopy image analysis, serving as a basis of many further studies5,132 such as cellular morphology computation and neural circuit reconstruction. CNNs and FCNs are currently the dominant deep neural networks used for pathological image segmentation. Standard CNNs typically label each image pixel in a sliding-window manner, while FCNs produce dense prediction via a direct pixel-to-pixel mapping. Currently the literature of deep learning-based pathological image segmentation might be classified as three categories: (1) nucleus or cell segmentation that locates and delineates boundaries of individual nuclei/cells in images, (2) gland segmentation that separates individual glands from image background, and (3) segmentation of other biological structures or tissues. 11.3.3.1 Nucleus/cell segmentation Nucleus/cell segmentation is a fundamental yet challenging task in microscopy and digital pathology image analysis.133 There are currently a large number of publications

III. Clinical applications

196

11. Artificial intelligence for pathology

focusing on deep learning-based nucleus/cell segmentation. Oda et al.134 incorporate an additional boundary-enhanced decoder into a U-Net-like architecture for ganglion cell segmentation on H&E stained pediatric intestine histopathological images. This doubledecoder network concatenates feature maps from the boundary-enhanced decoder with those of the segmentation decoder and adaptively adjusts the weights in the loss function for weak cell boundaries. Zhao and Yin135 exploit pyramid-based FCNs to segment cells in phase-contract microscopy images. High-level FCNs produce coarse segmentation masks, aiming to address the issue of low contrast between cells and background; low-level FCNs focus on cell details so as to deal with irregular cell shapes. The predictions from highand low-level FCNs are aggregated in a cascaded refinement manner to produce final precise segmentation masks. To reduce the cost of data labeling, Yoo et al.136 present a weakly supervised nucleus segmentation method using only point annotations instead of pixelwise labeling. In addition to using a ResNet-based feature pyramid network137 for segmentation, this method also introduces a shallow CNN model to extract nucleus edges and provide supervisory information to the segmentation network. The auxiliary CNN is activated for model training and is discarded during inference. These methods are not specifically designed for instance segmentation and thus might have difficulty in separating overlapped nuclei/cells. For touching cell segmentation, Ronneberger et al.38 introduce the first U-Net model with a weighted cross-entropy loss, which can enforce the network to pay more attention to separation borders between touching cells. Yi et al.138 present a deep learning framework combining a single-shot detector (SSD)139 and a U-Net segmentor. Specifically, it introduces a feature fusion module and an attention module into the SSD to facilitate cell localization. With the bounding boxes generated by the SSD, it uses a U-Net that shares the backbone layers with the SSD to aggregate multilevel feature maps for individual cell segmentation. This framework is evaluated on different types of microscopy image data132 and provides fairly good instance segmentation performance. To effectively handle variations in size, shape, and viewpoint of cells, Zhang et al.140 introduce a U-Net-like network with deformable convolution layers141 for red blood cell segmentation in sickle cell disease microscopy images. Unlike conventional convolutional layers, deformable convolution enables the network to learn robust representations with adaptive receptive fields. Qu et al.142 present a full resolution FCN for nucleus segmentation. Unlike standard encoderdecoder architectures, this network does not leverage any down-sampling and upsampling layers but uses dilated convolution143 in dense blocks144 for feature learning. In addition, it introduces a variance constrained cross-entropy loss to encourage the network to consider the spatial relationship among pixels in the same object. Mahmood et al.145 describe a GAN-based method to segment individual nuclei in multiorgan histopathological images.146 To tackle the problem of limited data annotation, it first uses a CycleGAN43 to synthesize perfectly annotated images by adding realism to randomly generated polygon masks. Then, it designs a conditional GAN with spectral normalization and gradient penalty to directly map histopathological images to corresponding segmentation masks, which can effectively handle touching or overlapped nuclei. There are other reports for instance segmentation of nuclei/cells. Kumar et al.146 exploit a three-class CNN to locate individual nuclei and then employ a region growing algorithm to find nucleus boundaries for final segmentation in histopathological images. Luna

III. Clinical applications

11.3 Deep learning in pathological image analysis

197

et al.147 use a U-Net to generate initial segmentation results and design a Siamese neural network to separate adjacent nuclei. The Siamese network learns the affinity of neighboring nuclei in training data and determines whether two neighboring instances belong to a single nucleus. For overlapped cervical cell segmentation in microscopy images, Song et al.148 adopt a multiscale CNN to separate nuclei and cytoplasm from image background and then use energy-based discrete labeling and multitemplate deformable modeling to segment individual cells. Instead of conducting pixel-wise classification, Naylor et al.149 apply a distance mapbased regression loss to FCN learning for touching nucleus segmentation in pathological images. 11.3.3.2 Gland segmentation Glands are critical biological structures in most organ systems and morphology of glands provides key information for cancer grading. Manual gland segmentation is very inefficient, and many computerized methods have been proposed for automated segmentation.150 Bentaieb and Hamarneh151 introduce a topology aware FCN model for gland segmentation in a colon adenocarcinoma image dataset.150 To take high-level shape priors into consideration, it designs a loss function to encode desired boundary smoothness priors and hierarchical relationships between regions. This method does not need postprocessing and produces better performance than the counterpart without topology encoding. Yan et al.152 recently present another shape-preservation FCN network, which can jointly learn pixel-wise gland segmentation and boundary detection. The shape preservation is achieved by applying a shape similarity measurement loss153 and a weighted pixel-wise cross-entropy loss to holistically nested edge detection model learning.154,155 On the other hand, Chen et al.156,157 formulate gland segmentation with multitask learning, which uses two branches in an FCN to perform gland segmentation and boundary detection separately. Xu et al.158,159 adopt different deep networks to detect gland regions,20 boundaries,154 and locations122 separately and then leverage an FCN network to fuse the results from previous predictions for final gland segmentation. In order to counter information loss caused by pooling operations, Graham et al.160 input the original down-sampled image at multiple levels within an FCN network. During testing, random transformations are applied to input images to generate predictive distributions for uncertainty measurement, which can further enhance gland segmentation. Yang et al.161 combine an FCN and active learning for gland segmentation with less training data compared to others. Starting with little training data, this method iteratively trains multiple FCNs with an updated training set, which is obtained by adding highly representative images to the previous training set with suggestive annotation. It uses uncertainty and similarity estimation to select those highly representative data. Based on this framework, Xu et al.162 apply quantization to FCN learning for accurate gland segmentation. The quantization process has two steps: suggestive annotation to extract representative annotation samples to build a small-sized balanced training set and network training for high accuracy with reducing overfitting. Zhang et al.163 introduce an adversarial learning framework that can use both annotated and unannotated data for gland segmentation. The framework contains a segmentation network, which segments unannotated images, and an evaluation network, which assesses the segmentation quality. These two networks are trained in an

III. Clinical applications

198

11. Artificial intelligence for pathology

adversarial fashion such that the framework is able to produce consistently good segmentation performance on annotated and unannotated images. Instead of focusing on gland segmentation only on H&E stained images, Van Eycke et al.164 present a more generic FCN model with residual learning that uses only the hematoxylin color channel for model training, which makes it easy to extend to IHC stained histology image data. 11.3.3.3 Segmentation of other biological structures or tissues Deep learning has been applied to neuronal structure segmentation, most on the dataset of ISBI 2012 electron microscopy image segmentation challenge.165 Ciresan et al.166 employ a binary-classification CNN to label each image pixel and use polynomial functionbased postprocessing to calibrate network outputs for image thresholding. This method averages predictions of multiple networks to enhance the robustness. Based on an FCN network,20 Chen et al.167 aggregate multilevel contextual information from different layers of the network to handle scale variation in neuronal structures. More recently, Gu et al.168 exploit atrous convolution (i.e., dilated convolution)169 to encode contextual information within a U-Net-like network, which is able to capture high-level information for neuron segmentation. Other deep learning-based neuronal structure segmentation methods include CNN combined with random walk,170 FCN with multistage and multipoint inputs,171 and CNNguided merging and splitting for segmentation improvement.172 Cancer or tumor region segmentation is another common application of deep learning in pathological image analysis. Within an MIL framework, Jia et al.173 introduce a weakly supervised FCN to segment cancerous regions in colon cancer histopathological images. The deep weak supervision is applied to each intermediate layer of the FCN for multiscale feature learning, and these features are merged via an information fusion layer for segmentation. In addition, area constraints are further adopted to regularize predictions of positive instances for performance improvement. Xu et al.174 present another MIL-based deep weak supervision framework for colon cancer region segmentation. Specifically, it first splits the image into latticed instances and uses a combined MIL algorithm to generate instance labels. Next, it assigns these instance-level labels to corresponding pixels and trains fully supervised deep models for image segmentation. Liang et al.175 discuss a reiterative learning framework with weak supervision for gastric cancer image segmentation. It first trains an FCN model with selected patches and applies the FCN with an overlapped region forecast algorithm to predictions of patches sampled from original images. Then, it updates the training set to fine-tune the FCN and repeats this process until desired segmentation performance is achieved. Qaiser et al.176 describe a colorectal tumor region segmentation framework with CNNs and persistent homology profiles (PHPs): use a CNN to select a set of exemplar patches followed by patch classification based on the symmetrized KullbackLeibler divergence or feed high-level CNN features and PHP features separately to two random forest models followed by a multistage ensemble of prediction outputs. With adversarial image-to-image translation, Gupta et al.177 use a CycleGAN to generate virtual images from a given kidney pathology image and learn a standard U-Net with original and virtual images for glomerulus segmentation. The experiments show that the image enrichment can boost segmentation performance compared to the counterpart using

III. Clinical applications

11.3 Deep learning in pathological image analysis

199

original images only. Instead of conducting image conversion, Wang et al.178 adopt a multiscale FCN to segment muscle and messy regions in histology tissue images. This model takes as input different fields of views in the image and merges the multiscale scored maps via a softmax function for pixel-wise prediction. Based on multidimensional RNNs57 and the Clockwork RNN (CW-RNN),179 Xie et al.180 introduce a spatial CW-RNN to encode global context for muscle perimysium segmentation. Specifically, it splits the image into nonoverlapped patches and uses the spatial CW-RNN to model their semantic dependencies such that global contextual information is applied to predictions of local patches. In addition, a structured regression loss is used for effective model training via the standard BPTT algorithm.52 Chan et al.181 recently present a CNN-based framework for image semantic segmentation in the Atlas of Digital Pathology dataset,182 which covers a large range of histological tissue types (HTTs). The framework consists of four major steps: (1) it uses a CNN to generate HTT scores of image patches, (2) creates pixel-level class activation maps, (3) calculates the average of activation maps of overlapped patches, and (4) exploits a conditional random field for postprocessing. The method provides better semantic segmentation results than some other weakly supervised approaches.

11.3.4 Stain normalization Stain or color normalization is a critical task in digital pathology, because color variation in pathology images can significantly challenge computerized methods.63,183 Stain normalization is a process of normalizing color values of an image such that the color distributions of source and target images match each other.184 It is able to compensate for color and intensity variations in pathology images from different imaging sites or batches. Previous studies mainly use methods based on stain deconvolution185 or template matching186 to normalize the image color distribution. Recently, Zanjani et al.187 integrate a deep CNN and a Gaussian mixture model into a unified framework for color normalization of H&E stained lymph node histopathology images, which can be jointly optimized in an end-to-end manner. Tellez et al.188 use a U-Net-like architecture to directly convert augmented image data to reference data and quantitatively evaluate the effects of color normalization in deep networks for downstream tasks such as image classification. It suggests that deep stain normalization can be combined with data augmentation to improve classifier robustness. With advances in deep generative models, GAN-based stain normalization methods are emerging as a new research direction. Bentaieb and Hammarneh189 have presented a GAN-based image-to-image translation technique for stain normalization of H&E stained images with different tissue preparations such as breast, colon, and ovary. This method trains an encoderdecoder network with adversarial learning21 and translates source images to target-style ones in the pixel level, which can be used for supervised tasks such as image classification and segmentation. Another GAN-based approach190 synthesizes realistic H&E lymph node images from random noises by using the infoGAN technique,191 which can discover structured latent factors in an unsupervised way to assist the generator with image generation. With the cycle-consistent GAN,43 Lahiani et al.192 introduce a perceptual embedding loss into the GAN framework to normalize H&E images to match fibroblast activation proteincytokeratin IHC stained images and produce very promising

III. Clinical applications

200

11. Artificial intelligence for pathology

performance. By taking a stain color matrix of H&E images as an auxiliary input, the CycleGAN is able to stabilize the cycle consistency loss for image-to-image translation as so to help classification of images from different independent institutes.193 CycleGANs have also been applied to stain normalization for H&E stained breast cancer histology images that are acquired from different scanners, that is, Hamamatsu and Aperio scanners, and have demonstrated better results than other competitors.194

11.3.5 Image superresolution Image superresolution is a computer vision technique that transforms a low-resolution (LR) image to an HR counterpart. It has a wide range of applications in many areas such as medical imaging, satellite imaging, and surveillance. For microscopy imaging, it is often cheaper and faster to generate LR images compared to HR ones, but LR images lose details of objects of interest, such as cell shapes and texture, which are very important for tissue description or disease characterization. There are currently a large number of reports focusing on natural image superresolution195198; however, related literature in digital pathology or microscopy imaging is scarce. With a DenseNet144,199 as the backbone architecture, a deep FCN200 is recently presented for endomicroscopy image superresolution. Given an LR image, the network learns a nonlinear, complex transformation to recover the high-frequency cues and reconstruct texture details. In order to eliminate or alleviate checkboard artifacts, it exploits subpixel convolution operations196 for feature map upsampling. This method is able to produce high-quality HR images with an upscaling factor of 8 and outperforms other superresolution methods in terms of peak-signal-to-noise ratio and structural similarity. A similar DenseNet-based FCN201 is introduced to superresolve transmission electron microscopy images, with a 4 3 upscaling factor. With adversarial learning, Han and Yin202 report a cascaded GAN model to generate HR phase-contrast microscopy images in a coarse-tofine fashion. In each level of image upscaling, it adds a content loss, which uses opticsrelated data enhancement to emphasize different types of cell regions, to the perceptual loss function for improved superresolution. This method can provide very competitive performance on 8 3 upscaled images. More recently, another GAN-based image superresolution method203 is used to superresolve H&E stained lymph node images with an upscaling factor of 4. This method integrates an autoencoder into a GAN framework, which is enforced to learn manifold representations and reduce the effects of noise and artifacts.

11.3.6 Computer-aided diagnosis Computer-aided diagnosis (CAD) nowadays plays a significant role in clinical research and practice. It is actually a part of routine clinical work for breast cancer detection on mammograms.6,204 With the advent of WSI and AI technologies, CAD finds many potential applications in digital pathology.205,206 Conventionally, traditional ML algorithms might be mostly applied to CAD systems, which need domain experts to provide

III. Clinical applications

11.3 Deep learning in pathological image analysis

201

handcrafted features. With the nature of end-to-end learning, deep neural networks have been increasingly exploited to design CAD systems in digital pathology. Based on both traditional ML and deep learning algorithms, Sapkota et al.207 develop an AI CAD system for automatic idiopathic inflammatory myopathy diagnosis on H&E stained skeletal muscle images. It uses a logistic-boosting classifier to detect muscle fibers and trains a spatial CW-RNN with structured regression180 to annotate perimysium regions; with these detections and annotations, it automatically calculates a set of image features for content-based muscle image retrieval and classification. Ma et al.208 design a CNN-based CAD system for diagnosis of cervical diseases on optical coherence microscopy (OCM) images. Specifically, this system uses a VGGNet32 to extract OCM image features and concatenate them with patient information to form final representations, which are fed to an SVM for cervical disease grading. Recently, Zhang et al.209 unify a deep CNN and an LSTM-based RNN into a single framework for pathology bladder cancer diagnosis. The CNN leans to extract useful features from raw image data, and the RNN aims to produce a description of tissue characteristics and visual attention on input images to support the diagnosis. Later, this work is extended to handle WSI images by three cascaded deep networks.210 The first network resembles the U-Net38 that is to detect tumor regions and select diagnostically useful ROIs in the entire WSI image, the second one combines an Inception-v3 CNN211 and an LSTM-based RNN212 to characterize the selected ROIs, and the last network is a simple fully connected neural work to produce the diagnosis.

11.3.7 Others Content-based image retrieval (CBIR) is a technique of searching for images in massive databases. Given an input query of medical image, CBIR aims to retrieve similar cases or images stored in large-scale collections such that physicians can quickly measure the likelihood of abnormality or the similarity of disease characterization and deliver better patient care.213216 There are three key components in CBIR systems: image representation, image organization for indexing, and image similarity measurement.217 Deep learning-based hashing methods, which can learn very rich image features and enable efficient search in large-scale databases, have drawn great attention from the CBIR community. Shi et al.218 have recently proposed a deep pairwise hashing method for CBIR in a database containing skeletal muscle and lung cancer images. Specifically, this method defines a pair-wise matrix to preserve intraclass relevance and interclass difference. It adds a pairwise loss to the last layer of a CNN architecture, which simultaneously learns image features and binary representations for hashing-based indexing. Instead of calculating a pairwise matrix, Sapkota et al.219 introduce a point-wise CNN hashing model to retrieve histopathological images in the same database. This approach inserts a latent binary encoding layer into a classification CNN network and conducts joint learning to encourage the network to learn discriminative representations, which can reduce the semantic gap in CBIR. The point-wise learning of binary codes makes the hashing method scalable to large-size databases. Deep learning has also been applied to survival analysis based on pathology images. Zhu et al.220 report a deep learning-based survival analysis framework on WSI histopathological images of lung and brain cancers. This framework first extracts and groups

III. Clinical applications

202

11. Artificial intelligence for pathology

heterogeneous patches from the WSI images into different clusters and then trains a CNN survival model221 for each cluster to select those informative clusters, which are further aggregated and fed into a survival analysis model for final predictions. In this way the framework is able to produce patient-level predictions, even though there is small-size patient data. In order to capture topological features in the WSI images, Li et al.222 present a graph CNN model to perform survival predictions. By viewing patches from a WSI image as vertices, it builds a spectral graph to model their relationship and learn a deep, attention-based graph CNN for survival analysis, where an attention mechanism can weight different patches for training and improve model robustness. In addition to pathological images, a two-branch CNN model223 recently takes molecular profiling data as an additional input for survival predictions. Specifically, the method first maximizes the correlation among different modality data by learning a joint embedding space and then transfers feature hierarchies from the shared space and fine-tunes the network with a survival loss function.

11.4 Summary Deep learning is emerging as a state-of-the-art technique for medical image computing. In particular, it has exhibited great power in digital pathology and microscopy image analysis, leading to improved performance on many tasks compared with other computerized methods.12,16 However, there are a few open challenges or unsolved problems for applying deep learning to digital pathology for both research and clinics.

11.4.1 Open challenges and future directions of deep learning in pathology image analysis 11.4.1.1 Quality control In digital pathology, WSI images are typically generated with a series of steps, including tissue collection, embedding, sectioning, and imaging.2 Due to potential errors in the steps of image acquisition, it is common for images to exhibit artifacts such as tissue folds, pen marks, and/or blurred regions.7 These anomalies can have unpredictable effects on computerized image quantification algorithms, and thus it is very critical to detect, correct, and/or eliminate the image artifacts. Early studies mainly rely on basic image processing and traditional ML algorithms,224,225 and deep learning seems to start getting involved in quality control such as automatic focusing226,227 and ink removal,228 although there is currently very little related literature. Batch effects cause color or scale variations in digitized images due to inconsistent preparation for image generation.7,12 These variations are very common in digital pathology, especially for image data from different imaging centers or institutes. Currently literature that uses deep neural networks to detect or correct scale batch effects is very scarce, but deep learning has been applied to stain color normalization and has produced superior performance than many nondeep learning approaches, as discussed in Section 11.3.4.

III. Clinical applications

11.4 Summary

203

We expect more studies and particularly GAN-based methods to be developed on this topic in the near future. 11.4.1.2 High image dimension A WSI image can have 50,000 3 50,000 or even more pixels and exhibits rich information about tissue and cell morphology. It is necessary to examine and analyze the entire WSI image instead of cropped smaller patches for better disease characterization.1 Because of high image dimension, image analysis methods for other modalities (e.g., magnetic resonance imaging or computed tomography) might not be directly applicable to pathological image quantification. First, it might be difficult to load a set of holistic WSI images into graphics processing unit memory for effective model training if no distributed computing resources are available. Second, simply resizing WSI images into smaller sized ones would result in loss of detailed information such as cell shapes and sizes, thus degrading the performance of computerized image analysis. It is currently common to split a WSI image into a set of patches for separate analysis and then merge all the results for WSI image quantization. However, it is an open challenge to design an efficient and effective split-and-stitch method. A WSI image can contain a large number of patches but only a few of them are informative for the entire WSI image analysis such as cancer grading, and it might not be easy to automatically determine and select those important patches. In addition, it is challenging to take into account the relationship between patches or global context if each patch is analyzed independently, thus affecting WSI-level analysis. Although some studies have reported WSI image classification with deep learning, there is still room for improvement. Meanwhile, it would be helpful to extend these classification techniques to other applications such as object detection and segmentation. 11.4.1.3 Object crowding Object detection and segmentation might be two of most important tasks in pathological image analysis. Many objects such as nuclei or cells often appear in a clump such that they touch or partially overlap each other. This situation might not appear in other medical imaging modalities (e.g., radiography), where only one or a few objects of interest in one image. The densely clustered objects might exhibit very weak or no boundaries, which would significantly challenge automated image quantification methods, including deep learning models. Nowadays many computerized methods have been developed for object detection and/or segmentation in various digital pathology and microscopy images, but most of them solve the problem in a very limited context, especially for object segmentation.229 At present, it often requires substantial effort to adapt the developed methods to new datasets and this limits the methods’ reusability in different situations. Although many deep learning-based methods have been proposed for detection of a particular type of object or localization of all objects without category labeling, there are very few studies focusing on single-stage object detection with category labeling in digital pathology,130 probably partly because this is very challenging for touching or overlapping objects. Compared with multistage object detection, one-stage methods could reduce the variability and improve the efficiency of image quantification. The pixel-to-pixel models, such as FCNs, U-Net, and their variants, provide a straightforward method to achieve this goal.

III. Clinical applications

204

11. Artificial intelligence for pathology

Alternatively, single-stage object detectors for natural images like SSD139 and YOLO103 have also attracted increasing interest in the medical domain. For touching object segmentation, it is common to apply deep networks to ROI extraction followed by nontrivial postprocessing to separate and delineate object boundaries. More recently, several deep models145 have been designed to segment individual nuclei/cells with very limited or no postprocessing. We expect the number of this type of object segmentation approach to increase in the future. 11.4.1.4 Data annotation issues Deep supervised models such as CNNs typically require a large amount of annotated training data, which might be infeasible in some applications. Data annotation is very expensive in digital pathology, especially for the applications of object detection and segmentation that need individual object labeling and pixel-wise annotation, respectively. Although it is common to pretrain a CNN on other large-scale image datasets like ImageNet36 and then fine-tune the model on a specific target dataset,230 it might still be difficult to collect enough annotated data for proper fine-tuning.231 Deep active learning and weakly supervised learning, which have a relatively lower requirement for data annotation, are two active research areas in computer vision; they have been recently applied to pathological image classification and segmentation, showing very promising performance. We believe these techniques would continue to gain attraction in digital pathology. On the other hand, deep unsupervised learning that does not need labeled data is drawing increasing attention. In particular, many GAN-based methods have been proposed for image segmentation, stain normalization, and image superresolution and have produced competitive performance with supervised learning methods. GANs are currently a hot research topic and would continue to play an important role in medical image computing, including pathological image analysis. It is not unusual that data annotations are imbalanced in real applications. For object detection, it is likely that objects of interest account for only a small portion of each image, most of which is background pixels. Thus learning a classification- or regression-based detector typically requires careful loss function design.116 Alternatively, sampling-based methods,232234 which modify the imbalanced data to generate a balanced distribution, can also be considered when dealing with imbalanced learning. Inconsistent data labeling is another challenge for computerized image analysis. For instance, data annotations for two major subtypes of nonsmall-cell lung cancer in histopathologic images might be not consistent between different physicians,235 because of the difficulty in differentiating adenocarcinoma from squamous cell carcinoma as well as distinct experience of interobservers. It is currently an open challenge to design a deep model with taking inconsistent data labels into consideration for pathological image quantification. 11.4.1.5 Integration of different types of input data Integration of pathological images with other types of data, such as omics and health records, is a promising avenue for applications of AI in digital pathology. Combining with genomics and IHC image analysis can help understand the relationship between genomic and image markers,18 and integration of pathologic, radiologic, and/or proteomic measurements can help better characterize diseases.236,237 Recently, deep neural networks have been

III. Clinical applications

11.4 Summary

205

applied to mining of semantic interactions of diagnostic images and radiology reports238 and text generation for CAD in digital pathology.210 With natural language processing, text mining along with image analysis can provide an additional viewpoint for image understanding and interpretation and thus improve disease diagnosis and/or prognosis. Therefore integration of images, text, and other types of data with deep learning has recently drawn considerable interest in medical image computing, and it may play a profound role in the future.

11.4.2 Outlook of clinical adoption of artificial intelligence While it is possible to apply AI/deep learning to clinical pathology, it is certainly in the distant future and raises additional legal and regulatory issues.239 Deep learning is currently used in only a few clinical labs around the world, but research has shown that deep learning can be used to improve accuracy, precision, quality, and efficiency in digital pathology.240 Currently deep learning studies of diagnostic accuracy are limited by small, highly curated datasets with unproven performance outside the research environment. It is uncertain how the current models would perform when exposed to novel images from a variety of histology labs and slide scanners, but a significant degradation in performance might be inevitable. Although there is too much at risk for using deep learning in general diagnosis, limited use in areas such as quality control and screening may become widespread soon. 11.4.2.1 Potential applications Diagnostic quality control is a prime application of deep learning-based AI, and at least one company, Ibex Medical Analytics, has brought a product to market.241 This Galen prostate cancer quality control system applies deep learning algorithms to WSIs after the pathologist flags any discrepancy in diagnosis or grading for subsequent review. The Galen system has been in clinical use for over a year and found a diagnostic discrepancy in the first week it was installed. It represents the first known deep learning application to be employed in a clinical diagnostic workflow. Deep learning may prove useful for prescreening slides for pathologists in the near future. Many screening tasks in histopathology require a pathologist to look for rare events across a large set of slides, which is a time-consuming and error-prone process. A recent open contest, the CAMELYON16 challenge, has demonstrated that ML is up to a class screening task, namely, identifying breast cancer metastasis in sentinel lymph nodes.242 The Harvard and MIT team that won the challenge used a CNN-based approach, and their updated method even exceeded the performance of the reference pathologist in the study. Since then, much work has been done on deep learning-based detection of lymph node metastasis, and this task remains an exemplar for tissue screening.243245 Image biomarker quantitation is another potential application of deep learning in digital pathology. Deep learning might attempt to conduct survival analysis directly from histology220223 or measuring biomarkers with known prognostic implications. In addition, deep learning has been also applied to predictive biomarker quantification, which aims to measure the likelihood that a patient will respond to a specific therapy. CNNs have been recently used to assess the HER2 status246249 and quantify estrogen receptor

III. Clinical applications

206

11. Artificial intelligence for pathology

expression250,251 for breast cancer in fluorescent in situ hybridization, IHC stained or H&E stained images. CNNs have been also adopted to measure tumor infiltrating lymphocytes to provide prognostic and predictive information for different types of cancers.252254 We expect more applications of deep learning in prognostication and prediction in the future. 11.4.2.2 Barriers to clinical adoption Although deep neural networks have been widely applied to pathology image analysis in research, currently deep learning applications are mostly absent from the clinical laboratory. The first applications would be those with the lowest risk profile, and additional applications may appear as confidence in AI/deep learning increases. It comes as no surprise that the first application to be implemented in the lab was for diagnostic quality assurance, since this added layer of quality control only has the potential to lower risk rather than raise it. Adoption of AI directly into the diagnostic workflow, even if only in a clinical decision support (CDS) role, would bring with it significant legal, malpractice and regulatory burdens that are not yet resolved, but which are currently being debated. We discuss some barriers to clinical adoption of AI in pathology as follows. 11.4.2.2.1 Lagging adoption of digital pathology

While WSI is standard practice in education and research, it is not a well-established clinical practice.255,256 Some degree of digital pathology adoption is required for efficient deployment of AI in pathology. Certain types of algorithms, such as those performing screening tasks, have negligible utility if results are not available at the time a pathologist reviews a case.257 Pathologists will also be disincentivized to run on-demand algorithms if glass slides must be physically returned to the histology lab and scanned. One can argue that other algorithms are less limited by a lack of upfront slide scanning. These are typically low-volume, high-value algorithms that perform biomarker-specific quantitation or provide targeted CDS for diagnosis, prognosis, or prediction. Overall, however, the vast majority of proposed use cases are impeded by a lack of digital pathology. Nowadays, fewer than a dozen labs in the world are known to have an entirely digital workflow, and all of these labs are located outside the United States. The Food and Drug Administration (FDA), which regulates in vitro diagnostic (IVD) devices in the United States, has only approved two digital pathology systems for primary diagnosis—one in 2017 and the other in 2019.258,259 While adoption of digital pathology has been more robust in the European Union (EU), IVD regulation there has been fundamentally altered to create a more stringent, FDA-like regulatory structure.260 All IVDs, including currently approved in the EU, must be compliant with the new In Vitro Diagnostic Medical Device Regulation (IVDR) by May 26, 2022. The effect on the digital pathology market in Europe is uncertain but will likely slow adoption there. The cost of digital pathology is frequently cited as a barrier to adoption.256 Converting a lab to a full digital workflow is undeniably expensive. Full digital conversion at a large academic institution might cost 2 million in initial hardware and software in addition to annual licensing, support, personnel, and WSI storage costs.261 It is not always clear what return on investment, if any, might be realized from digital pathology. Some have theorized long-term reduction of cost at the health system level due to increased pathologist

III. Clinical applications

11.4 Summary

207

productivity, facilitation of laboratory consolidation, and reduced patient care costs through promotion of subspecialty diagnostics.262,263 Encouragingly, Memorial Sloan Kettering Cancer Center has reported actual cost savings due to a decrease in slide retrieval requests and ancillary IHC stain orders on digital cases.264 The authors project $1.3 million in savings over 5 years. It remains to be seen if AI-enabled digital pathology workflows further increase savings and return on investment. 11.4.2.2.2 Lack of standards for interfacing AI to clinical systems

Discussions of AI rarely include a detailed plan for how to integrate algorithms into clinical workflows. Key clinical systems in pathology include the laboratory information system and digital pathology system but might also include an enterprise picture archiving and communication system (PACS) or vendor neutral archive. AI is expected to interact with these systems, yet no standard model has been proposed for this integration. Vendors that are developing AI for pathology typically propose to deploy AI in siloed systems, duplicating existing functionality rather than properly integrating. The digital imaging and communications in medicine (DICOM) standard is clearly part of the solution, but adoption of DICOM in digital pathology has been slow. Over a decade ago, the DICOM Working Group 26 (WG-26) released Supplement 145, which specifies an information object definition for WSIs.265 This was mostly ignored by vendors until recently, when WG-26 coordinated a series of successful Digital Pathology Connectathons.266 Several PACS have since incorporated DICOM-based support for WSIs, and in 2019 the first slide scanner with native DICOM support was released. However, the FDA has yet to clear a digital pathology system that uses DICOM. Furthermore, serious questions remain as to whether the existing annotation and overlay models in DICOM are adequate for use with WSIs.267,268 11.4.2.2.3 Regulatory concerns

Dozens of AI-based devices have been approved by the FDA, but none of the recent approvals are in pathology.269 The FDA is working toward establishing a new paradigm for regulating software as a medical device (SaMD) and AI used in medical practice. SaMD is defined by the International Medical Device Regulators Forum as “software intended to be used for one or more medical purposes that perform these purposes without being part of a hardware medical device” and differs from software in a medical device (SiMD). The 21st Century Cures Act of 2016 modified the definition of a device, excluding certain types of SaMD from FDA regulation.270 Notably, the Cures Act does not exempt software intended to analyze or process medical images but does exempt some forms of CDS software. The FDA is currently drafting guidance to clarify the status of CDS, and the current version would only exempt AI/ML-based CDS with easily and fully explainable inputs or with trivial indications.271 Thus it appears that most AI in pathology will continue to be regulated as IVDs. The FDA has also recently published the “Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning,” which proposes a new, total product life cycle regulatory approach that would allow certain types of changes to approved AI/ML-based SaMD without premarket approval.272 This approach would focus on monitoring actual performance and accommodate several modifications to algorithms,

III. Clinical applications

208

11. Artificial intelligence for pathology

including the incorporation of new learning. While this is an encouraging direction for the FDA, until it is implemented that all AI/ML-based SaMD algorithms must be “locked” prior to marketing and require premarket review for any changes.273 In addition to the FDA, laboratory testing in the United States is regulated by the Centers of Medicare and Medicaid through the Clinical Laboratory Improvement Amendments (CLIA) of 1988. This additional layer of regulation makes deployment of AI in pathology distinct from other areas of medicine. For FDA-approved testing, CLIA requires a lab to perform additional testing to verify the performance reported by the manufacturer. Importantly, CLIA also permits laboratories to perform more comprehensive testing to validate non-FDA-approved testing as laboratory developed tests, which open another potential pathway for deploying AI. In either case, CLIA requires a lab to repeat validation or verification on a test if any change is made. If AI meets the definition of a laboratory test, the abovementioned rules apply. Finally, it is important to note that CLIA requires all clinical laboratory testing to be performed in a CLIA-licensed facility. Currently this includes all steps in a computational pipeline and has significant implications for the provision of AI software as a service by a third party. Under current rules a third party would need to maintain a CLIA license and be subject to regular inspection to maintain certification. 11.4.2.2.4 Computational requirements

ML algorithms can be computationally expensive, especially when they are run over an entire WSI. State-of-the-art slide scanners digitize around 80 slides per hour (45 seconds per slide) and multiple slide scanners running in parallel would be required to keep up with slide production in the average histology laboratory.274 If ML algorithms must keep pace with the existing workflow, this will require significant computational resources. To address a lack of computational resources in the lab, many pathology AI vendors are promoting the use of cloud-based or hosted algorithms. While it solves one problem, it may introduce additional issues. First, it requires the transfer of large amounts of data to the cloud. It also means that potentially sensitive data will be transferred to third party servers. While attitudes are changing in health-care organizations about the use of cloud services, this may create an insurmountable barrier for other organizations. It also may create regulatory issues if the third party is not a CLIA-licensed facility. 11.4.2.2.5 Algorithm explainability

In the era of evidence-based medicine, physicians have a natural distrust of “black box” methods.275 While traditional ML algorithms may use models that are explainable, deep learning models such as CNNs are notorious “black boxes.” This lack of transparency is particularly concerning to pathologists, who are trained to make diagnostic decisions by evaluating human-parsable cytologic and histopathologic criteria. A possible solution to this barrier is explainable AI, which attempts to explain why an AI algorithm arrives at a specific decision.276 A variety of methods are being developed to explain the behavior of deep learning models, many of which are model-independent.277 In the absence of explainability, it remains to be seen if accuracy alone will be sufficient to convince pathologists to adopt ML algorithms.275

III. Clinical applications

References

209

11.4.2.2.6 Pathologists’ skepticism

Many pathologists are enthusiastic about AI, while others express concerns about costeffectiveness, timeline to clinical adoption, and job displacement.278 Pathologists frequently raise concerns about AI replacing pathologists, and comments from the AI community have not allayed their concerns. In 2016 Geoffrey Hinton, a deep learning pioneer and Turing Award winner, famously quipped “People should stop training radiologists now; it’s just completely obvious that within five years, deep learning is going to do better than radiologists.”279 While Hinton has not made any such pronouncements about the future of pathology, many consider radiologists as the proverbial “canary in the coal mine” with respect to AI. If AI is to succeed in pathology, computer scientists must recognize that it will be due to the advocacy of pathologists and that considerable work remains to convince pathologists that AI is a boon and not a threat.

Acknowledgment Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under Award Number R21CA237493. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References 1. Madabhushi A, Lee G. Image analysis and machine learning in digital pathology: challenges and opportunities. Med Image Anal 2016;33:1705. 2. McCann MT, Ozolek JA, Castro CA, Parvin B, Kovacevic J. Automated histology analysis: opportunities for signal processing. IEEE Signal Process Mag 2015;32(1):7887. 3. Foran DJ, Lin Y, Wenjin C, et al. Imageminer: a software system for comparative analysis of tissue microarrays using content-based image retrieval, high-performance computing, and grid technology. J Am Med Inform Assoc 2011;8(4):40315. 4. Mills A, Gradecki S, Horton B, Blackwell R, Moskaluk C, Mandell J, et al. Diagnostic efficiency in digital pathology: A comparison of optical versus digital assessment in 510 surgical pathology cases. Am J surgical Pathol 2017;42(1):539. 5. Xing F, Yang L. Robust nucleus/cell detection and segmentation in digital pathology and microscopy images: A comprehensive review. IEEE Rev Biomed Eng 2016;9:23463. 6. Gurcan MN, Boucheron LE, Can A, Madabushi A, Rajpoot NM, Yener B. Histopathological image analysis: a review. IEEE Revews Biomed Eng 2009;2:14771. 7. Kothari S, Phan JH, Stokes TH, Wang MD. Pathology imaging informatics for quantitative analysis of wholeslide images. J Am Med Inform Assoc 2013;20(6):1099108. 8. Xing F, Yang L. Chapter 4  Machine learning and its application in microscopic image analysis. In: Wu G, Shen D, Sabuncu MR, editors. Machine learning and medical imaging, Academic Press; 2016. pp. 97127. 9. Sommer C, Gerlich DW. Machine learning in cell biology  teaching computers to recognize phenotypes. J Cell Sci 2013;126(24):552939. 10. Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, et al. A survey on deep learning in medical image analysis. Med Image Anal 2017;42(Suppl. C):6088. 11. Shen D, Wu G, Suk H-I. Deep learning in medical image analysis. Annu Rev Biomed Eng 2017;19(1):22148. 12. Xing F, Xie Y, Su H, Liu F, Yang L. Deep learning in microscopy image analysis: a survey. IEEE Trans Neural Netw Learn Syst 2018;29(10):455068. 13. Deng L, Yu D. Deep learning: methods and applications. Found Trends Signal Process 2014;3(34):197387. 14. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521(28):43644.

III. Clinical applications

210

11. Artificial intelligence for pathology

15. Goodfellow I, Bengio Y, Courville A. Deep learning. MIT Press; 2016. Available from: ,http://www.deeplearningbook.org.. 16. Greenspan H, van Ginneken B, Summers RM. Guest editorial deep learning in medical imaging: overview and future promise of an exciting new technique. IEEE Trans Med Imaging 2016;35(5):11531159. 17. Bera K, Schalper KA, Rimm DL, Velcheti V, Madabhushi A. Artificial intelligence in digital pathology  new tools for diagnosis and precision oncology. Nat Rev Clin Oncol 2019;16:70315. 18. Niazi MKK, Parwani AV, Gurcan MN. Digital pathology and artificial intelligence. Lancet Oncol 2019;20(5): e25361. 19. LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE 1998;86(11):2278324. 20. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE conference on computer vision and pattern recognition. 2015. p. 343140. 21. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. In: Proceedings of advances in neural information processing systems. 2014. p. 267280. 22. Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 2013;35(8):1798828. 23. Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature 1986;323:5336. 24. Schmidhuber J. Deep learning in neural networks: an overview. Neural Netw 2015;61:85117. 25. Hinton GE, Osindero S, Teh Y-W. A fast learning algorithm for deep belief nets. Neural Comput 2006;18 (7):152754. 26. Salakhutdinov R, Hinton G. Deep Boltzmann machines. In: Proceedings of the 12th international conference on artificial intelligence and statistics. 2009. p. 44855. 27. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: Proceedings of advances neural information processing systems. 2012. p. 1097105. 28. Maas AL, Hannun AY, Ng AY. Rectifier nonlinearities improve neural network acoustic models. In: ICML workshop on deep learning for audio, speech, and language processing. 2013. p. 16. 29. Clevert DA, Unterthiner T, Hochreiter S. Fast and accurate deep network learning by exponential linear units (elus). In: Proceedings of international conference on learning representations. 2016. p. 114. 30. LeCun Y, Kavukcuoglu K, Farabet C. Convolutional networks and applications in vision. In: Proceedings of 2010 IEEE international symposium on circuits and systems. 2010. p. 2536. 31. Dumoulin V, Visin F. A guide to convolution arithmetic for deep learning. arXiv:1603.07285 [stat.ML] 2016:131. 32. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of international conference on learning representations. 2015. p. 114. 33. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. In: Proceedings of IEEE conference on computer vision and pattern recognition. 2015. p. 19. 34. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition. 2016. p. 7708. 35. Tang Y. Deep learning using linear support vector machines. In: Workshop on representational learning, the 30th international conference on machine learning. 2013. p. 16. 36. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. Imagenet large scale visual recognition challenge. Int J Computer Vis 2015;115(3):21152. 37. Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R. Improving neural networks by preventing coadaptation of feature detectors. arXiv:1207.0580 [cs.NE] 2012:118. 38. Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation. In: Proceedings of the 18th international conference on medical image computing and computer-assisted intervention, vol. 9351. 2015. p. 23441. 39. Radford A., Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks. In: Proceedings of international conference on learning representations. 2016. p. 116. 40. Mirza M, Osinderoo S. Conditional generative adversarial nets. arXiv:1411.1784 [cs.LG] 2014:17. 41. Karras T, Aila T, Laine S, Lehtinen J. Progressive growing of gans for improved quality, stability, and variation. In: Proceedings of international conference on learning representations. 2018. p. 126.

III. Clinical applications

References

211

42. Isola P, Zhu J, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. In: Proceedings of IEEE conference on computer vision and pattern recognition. 2017. p. 596776. 43. Zhu J, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of IEEE international conference on computer vision. 2017. p. 224251. 44. Yi X, Walia E, Babyn P. Generative adversarial network in medical imaging: a review. Med Image Anal 2019;58:101552. 45. Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 2010;11:3371408. 46. Ranzato M, Poultney C, Chopra S, LeCun Y. Efficient learning of sparse representations with an energy-based model. In: Proceedings of Advances in neural information processing systems. 2007. p. 113744. 47. Ranzato M, Boureau Y, LeCun Y. Sparse feature learning for deep belief networks. In: Proceedings of advances in neural information processing systems. 2008. p. 118592. 48. Glorot X, Bordes A, Bengio Y. Domain adaptation for large-scale sentiment classification: a deep learning approach. In: Proceedings of the 28th international conference on machine learning. 2011. p. 51320. 49. Vincent P, Larochelle H, Bengio Y, Manzagol PA. Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning. 2008. p. 10961103. 50. Rifai S, Vincent P, Muller X, Glorot X, Bengio Y. Contractive autoencoders: explicit invariance during feature extraction. In: Proceedings of the 28th international conference on machine learning. 2011. p. 83340. 51. Williams RJ, Zipser D. Gradient-based learning algorithms for recurrent networks and their computational complexity. L. Erlbaum Associates Inc.; 1995, p. 43386. 52. Werbos PJ. Backpropagation through time: what it does and how to do it. Proc IEEE 1990;78(10):155060. 53. Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 1994;5(2):15766. 54. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computat 1997;9(8):173580. 55. Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing. 2014. p. 172434. 56. Graves A, Fernandez S, Schmidhuber J. Multi-dimensional recurrent neural networks. In: Proceedings of 17th international conference on artificial neural networks. 2007. p. 54958. 57. Graves A, Schmidhuber J. Offline handwriting recognition with multidimensional recurrent neural networks. In: Proceedings of advances in neural information processing systems. 2009. p. 54552. 58. Byeon W, Breuel TM, Raue F, Liwicki M. Scene labeling with LSTM recurrent neural networks. In: Proceedings of IEEE conference on computer vision and pattern recognition. 2015. p. 354755. 59. Karimi D, Nir G, Fazli L, Black PC, Goldenberg L, Salcudean SE. Deep learning-based gleason grading of prostate cancer from histopathology images  role of multiscale decision aggregation and data augmentation. IEEE J Biomed Health Inform 2020;24:141326. 60. Li M, Wu L, Wiliem A, Zhao K, Zhang T, Lovell B. Deep instance-level hard negative mining model for histopathology images. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2019. p. 51422. 61. Couture HD, Marron JS, Perou CM, Troester MA, Niethammer M. Multiple instance learning for heterogeneous images: Training a CNN for histopathology. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2018. p. 25462. 62. Ioannou N, Stanisavljevic M, Anghel A, Papandreou N, Andani S, Ruschoff JH, et al. Accelerated ml-assisted tumor detection in high-resolution histopathology images. In: Proceedings of international conference on medical image computing and computer assisted interventions. 2019. p. 40614. 63. Macenko M, Niethammer M, Marron JS, et al. A method for normalizing histology slides for quantitative analysis. In: Proceedings of IEEE international symposium on biomedical imaging: from nano to macro. 2009. p. 110710. 64. Gu Y, Vyas K, Yang J, Yang GZ. Weakly supervised representation learning for endomicroscopy image analysis. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2018. p. 32634. 65. Huang Y, Zheng H, Liu C, Ding X, Rohde GK. Epithelium-stroma classification via convolutional neural networks and unsupervised domain adaptation in histopathological images. IEEE J Biomed Health Inform 2017;21 (6):162532.

III. Clinical applications

212

11. Artificial intelligence for pathology

66. Zhang Y, Chen H, Wei Y, Zhao P, Cao J, Fan X, et al. From whole slide imaging to microscopy: deep microscopy adaptation network for histopathology cancer image classification. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2019. p. 3608. 67. Shi X, Su H, Xing F, Liang Y, Qu G, Yang L. Graph temporal ensembling based semi-supervised convolutional neural network with noisy labels for histopathology image analysis. Med Image Anal 2020;60:101624. 68. Qi Q, Li Y, Wang J, Zheng H, Huang Y, Ding X, et al. Label-efficient breast cancer histopathological image classification. IEEE J Biomed Health Inform 2019;23(5):210816. 69. Lee B, Paeng K. A robust and effective approach towards accurate metastasis detection and pN-stage classification in breast cancer. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2018. p. 84150. 70. Sun C, Xu A, Liu D, Xiong Z, Zhao F, Ding W. Deep learning-based classification of liver cancer histopathology images using only global labels. IEEE J Biomed Health Inform 2020;24:164351. 71. Hou L, Samaras D, Kurc TM, Gao Y, Davis JE, Saltz JH. Patch-based convolutional neural network for whole slide tissue image classification. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR). 2016. p. 242433. 72. Wang S, Zhu Y, Yu L, Chen H, Lin H, Wan X, et al. RMDL: recalibrated multi-instance deep learning for whole slide gastric image classification. Med Image Anal 2019;58:101549. 73. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA. Inception-v4, inception-ResNet and the impact of residual connections on learning. In: Proceedings of the 31st AAAI conference on artificial intelligence. 2017. p. 427884. 74. Gao Z, Wang L, Zhou L, Zhang J. HEp-2 cell image classification with deep convolutional neural networks. IEEE J Biomed Health Inform 2017;21(2):41628. 75. Phan HTH, Kumar A, Kim J, Feng D. Transfer learning of a convolutional neural network for HEp-2 cell image classification. In: Proceedings of IEEE 13th international symposium on biomedical imaging. 2016. p. 120811. 76. Chatfield K, Simonyan K, Vedaldi A, Zisserman A. Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of the British Machine Vision Conference. 2014. p. 112. 77. Meng N, Lam EY, Tsia KK, So HK. Large-scale multi-class image-based cell classification with deep learning. IEEE J Biomed Health Inform 2019;23(5):20918. 78. Duggal R, Gupta A, Gupta R, Mallick P. SD-layer: stain deconvolutional layer for CNNs in medical microscopic imaging. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2017. p. 43543. 79. Zhang L, Lu Le, Nogues I, Summers RM, Liu S, Yao J. DeepPap: deep convolutional networks for cervical cell classification. IEEE J Biomed Health Inform 2017;21(6):163343. 80. Roth HR, Lu L, Liu J, Yao J, Seff A, Cherry K, et al. Improving computer-aided detection using convolutional neural networks and random view aggregation. IEEE Trans Med Imaging 2016;35(5):117081. 81. Su H, Shi X, Cai J, Yang L. Local and global consistency regularized mean teacher for semi-supervised nuclei classification. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2019. p. 55967. 82. Tarvainen A, Valpola H. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: Proceedings of advances in neural information processing systems. 2017. p. 1195204. 83. Bromley J, Guyon I, LeCun Y, Sackinger E, Shah R. Signature verification using a “Siamese” time delay neural network. In: Proceedings of the sixth international conference on neural information processing systems. 1993. p. 73744. 84. Shao W, Sun L, Zhang D. Deep active learning for nucleus classification in pathology images. In: Proceedings of IEEE 15th international symposium on biomedical imaging. 2018. p. 199202. 85. Wu B, Zhao S, Sun G, Zhang X, Su Z, Zeng C, et al. P3SGD: patient privacy preserving SGD for regularizing deep CNNs in pathological image classification. In: Proceedings of IEEE/CVF conference on computer vision and pattern recognition (CVPR). 2019. p. 2094103. 86. Zhao Z, Zheng P, Xu S, Wu X. Object detection with deep learning: a review. IEEE Trans Neural Netw Learn Syst 2019;30(11):321232. 87. Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, et al. Deep learning for generic object detection: a survey. Int J Comput Vis 2020;128:261318.

III. Clinical applications

References

213

88. Zhang X, Yang Y-H, Han Z, Wang H, Gao C. Object class detection: a survey. ACM Comput Surv 2013;46 (1):153. 89. Grauman K, Leibe B. Visual object recognition. Synth Lect Artif Intell Mach Learn 2011;5(2):1181. 90. Ciresan DC, Giusti A, Gambardella LM, Schmidhuber J. Mitosis detection in breast cancer histology images with deep neural networks. In: Proceedings of the 16th international conference on medical image computing and computer-assisted intervention, vol. 8150. 2013. p. 411418. 91. Albarqouni S, Baur C, Achilles F, Belagiannis V, Demirci S, Navab N. AggNet: Deep learning from crowds for mitosis detection in breast cancer histology images. IEEE Trans Med Imaging 2016;35(5):131321. 92. Zerhouni E, Lanyi D, Viana M, Gabrani M. Wide residual networks for mitosis detection. In: Proceedings of IEEE 14th international symposium on biomedical imaging. 2017. p. 9248. 93. Bekkers EJ, Lafarge MW, Veta M, Eppenhof KAJ, Pluim JPW, Duits R. Roto-translation covariant convolutional networks for medical image analysis. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2018. p. 440448. 94. Lu Y, Liu A, Chen M, Nie W, Su Y. Sequential saliency guided deep neural network for joint mitosis identification and localization in time-lapse phase contrast microscopy images. IEEE J Biomed Health Inform 2020;24:136778. 95. Shkolyar A, Gefen A, Benayahu D, Greenspan H. Automatic detection of cell divisions (mitosis) in liveimaging microscopy images using convolutional neural networks. In: Proceedings of the 37th annual international conference of the IEEE Engineering in Medicine and Biology Society. 2015. p. 7436. 96. Li C, Wang X, Liu W, Latecki LJ. DeepMitosis: mitosis detection via deep detection, verification and segmentation networks. Med Image Anal 2018;45:12133. 97. Chen H, Dou Q, Wang X, Qin J, Heng PA. Mitosis detection in breast cancer histology images via deep cascaded networks. In: Proceedings of the 30th AAAI conference on artificial intelligence. 2016. p. 11601166. 98. Chen H, Wang X, Heng PA. Automated mitosis detection with deep regression networks. In: Proceedings of IEEE 13th international symposium on biomedical imaging. 2016. p. 12047. 99. Li C, Wang X, Liu W, Latecki LJ, Wang B, Huang J. Weakly supervised mitosis detection in breast histopathology images using concentric loss. Med Image Anal 2019;53:16578. 100. Chen DZ, Huang Z, Liu Y, Xu J. On clustering induced Voronoi diagrams. SIAM J Comput 2017;46(6):1679711. 101. Wang J, MacKenzie JD, Ramachandran R, Chen DZ. Neutrophils identification by deep learning and Voronoi diagram of clusters. In: Proceedings of the 18th international conference on medical image computing and computer-assisted intervention. 2015. p. 22633. 102. Swiderska-Chadaj Z, Pinckaers H, van Rijthoven M, Balkenhol M, Melnikova M, Geessink O, et al. Learning to detect lymphocytes in immunohistochemistry with deep learning. Med Image Anal 2019;58:101547. 103. Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: unified, real-time object detection. In: Proceedings of IEEE conference on computer vision and pattern recognition. 2016. p. 77988. 104. Sirinukunwattana K, Raza SEA, Tsang YW, Snead DRJ, Cree IA, Rajpoot NM. Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images. IEEE Trans Med Imaging 2016;35(5):1196206. 105. Mao Y, Yin Z, Schober JM. Iteratively training classifiers for circulating tumor cell detection. In: Proceedings of 2015 IEEE 12th international symposium on biomedical imaging. 2015. p. 1904. 106. Xing F, Xie Y, Yang L. An automatic learning-based framework for robust nucleus segmentation. IEEE Trans Med Imaging 2016;35(2):55066. 107. Xing F, Yang L. Fast cell segmentation using scalable sparse manifold learning and affine transformapproximated active contour. In: Proceedings of the 18th international conference on medical image computing and computer-assisted intervention, vol. 9351. 2015. p. 3329. 108. Xing F, Shi X, Zhang Z, Cai J, Xie Y, Yang L. Transfer shape modeling towards high-throughput microscopy image segmentation. In: Proceedings of international conference on medical image computing and computer-assisted intervention. 2016. p. 18390. 109. Veta M, van Diest PJ, Pluim JPW. Cutting out the middleman: measuring nuclear area in histopathology slides without segmentation. In: Proceedings of international conference on medical image computing and computerassisted intervention. 2016. p. 6329. 110. Xu Z, Huang J. Detecting 10,000 cells in one second. In: Medical image computing and computer-assisted intervention: 19th international conference. 2016. p. 67684.

III. Clinical applications

214

11. Artificial intelligence for pathology

111. Xie Y, Xing F, Kong X, Yang L. Beyond classification: structured regression for robust cell detection using convolutional neural network. In: Proceedings of the 18th international conference on medical image computing and computer-assisted intervention, vol. 9351. 2015. p. 358365. 112. Xie Y, Kong X, Xing F, Liu F, Su H, Yang L. Deep voting: a robust approach toward nucleus localization in microscopy images. In: Proceedings of the 18th international conference on medical image computing and computerassisted intervention, vol. 9351. 2015. p. 37482. 113. Tofighi M, Guo T, Vanamala JKP, Monga V. Prior information guided regularized deep learning for cell nucleus detection. IEEE Trans Med Imaging 2019;38(9):204758. 114. Xu J, Xiang L, Liu Q, Gilmore H, Wu J, Tang J, et al. Stacked sparse autoencoder (SSAE) for nuclei detection on breast cancer histopathology images. IEEE Trans Med Imaging 2016;35(1):11930. 115. Song T, Sanchez V, ElDaly H, Rajpoot NM. Hybrid deep autoencoder with curvature Gaussian for detection of various types of cells in bone marrow trephine biopsy images. In: Proceedings of IEEE 14th international symposium on biomedical imaging. 2017. p. 10403. 116. Xie Y, Xing F, Shi X, Kong X, Su H, Yang L. Efficient and robust cell detection: a structured regression approach. Med Image Anal 2018;44:24554. 117. Xing F, Bennett T, Ghosh D. Adversarial domain adaptation and pseudo-labeling for cross-modality microscopy image quantification. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2019. p. 7409. 118. Xing F, Xie Y, Shi X, Chen P, Zhang Z, Yang L. Towards pixel-to-pixel deep nucleus detection in microscopy images. BMC Bioinforma 2019;20:472. 119. Xie W., Noble J.A., Zisserman A. Microscopy cell counting with fully convolutional regression networks. In: MICCAI first workshop on deep learning in medical image analysis. 2015. pp. 18. 120. Hagos YB, Narayanan PL, Akarca AU, Marafioti T, Yuan Y. Concorde-net: cell count regularized convolutional neural network for cell detection in multiplex immunohistochemistry images. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2019. p. 66775. 121. Sadafi A, Koehler N, Makhro A, Bogdanova A, Navab N, Marr C, et al. Multiclass deep active learning for detecting red blood cell subtypes in brightfield microscopy. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2019. p. 68593. 122. Ren S, He K, Girshick R, Sun J. Faster r-CNN: towards real-time object detection with region proposal networks. In: Proceedings of advances in neural information processing systems. 2015. p. 919. 123. Huang Q, Li W, Zhang B, Li Q, Tao R, Lovell NH. Blood cell classification based on hyperspectral imaging with modulated Gabor and CNN. IEEE J Biomed Health Inform 2020;24(1):16070. 124. Song T, Sanchez V, Daly HEI, Rajpoot NM. Simultaneous cell detection and classification in bone marrow histology images. IEEE J Biomed Health Inform 2019;23(4):146976. 125. Li H, Zhao R, Wang X. Highly efficient forward and backward propagation of convolutional neural networks for pixelwise classification. arXiv:1412.4526 2014:110. 126. Wang S, Yao J, Xu Z, Huang J. Subtype cell detection with an accelerated deep convolution neural network. In: Proceedings of international conference on medical image computing and computer-assisted intervention. 2016. p. 6408. 127. Yao J, Wang S, Zhu X, Huang J. Imaging biomarker discovery for lung cancer survival prediction. In: Proceedings of international conference on medical image computing and computer-assisted intervention. 2016. p. 64957. 128. TTR Network. The Cancer Genome Atlas. 2020. Available from: ,http://cancergenome.nih.gov/.. 129. Zhou Y, Dou Q, Chen H, Qin J, Heng PA. SFCN-OPI: detection and fine-grained classification of nuclei using sibling FCN with objectness prior interaction. In: Proceedings of the 30th AAAI conference on artificial intelligence. 2018. pp. 26529. 130. Xing F, Cornish TC, Bennett T, Ghosh D, Yang L. Pixel-to-pixel learning with weak supervision for singlestage nucleus recognition in ki67 images. IEEE Trans Biomed Eng 2019;66(11):308897. 131. Minaee S, Boykov Y, Porikli F, Plaza A, Kehtarnavaz N, Terzopoulos D. Image segmentation using deep learning: a survey. arXiv:2001.05566 [cs.CV] 2020:123. 132. Caicedo JC, Goodman A, Karhohs KW, et al. Nucleus segmentation across imaging experiments: the 2018 data science bowl. Nat Methods 2019;16:124753. 133. Meijering E. Cell segmentation: 50 years down the road. IEEE Signal Process Mag 2012;29(5):1405.

III. Clinical applications

References

215

134. Oda H, Roth HR, Chiba K, Sokolic J, Kitasaka T, Oda M, et al. BESNet: boundary-enhanced segmentation of cells in histopathological images. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2018. p. 22836. 135. Zhao T, Yin Z. Pyramid-based fully convolutional networks for cell segmentation. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2018. p. 67785. 136. Yoo I, Yoo D, Paeng K. Pseudoedgenet: nuclei segmentation only with point annotations. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2019. p. 731739. 137. Kirillov A, Girshick R, He K, Dollar P. Panoptic feature pyramid networks. In: Proceedings of IEEE/CVF conference on computer vision and pattern recognition. 2019. p. 6392401. 138. Yi J, Wu P, Jiang M, Huang Q, Hoeppner DJ, Metaxas DN. Attentive neural cell instance segmentation. Med Image Anal 2019;55:22840. 139. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, et al. SSD: single shot multibox detector. In: Proceedings of European conference on computer vision. 2016. p. 2137. 140. Zhang M, Li X, Xu M, Li Q. RBC semantic segmentation for sickle cell disease based on deformable U-Net. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2018. p. 695702. 141. Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, et al. Deformable convolutional networks. In: Proceedings of IEEE international conference on computer vision. 2017. p. 76473. 142. Qu H, Yan Z, Riedlinger GM, De S, Metaxas DN. Improving nuclei/gland instance segmentation in histopathology images by full resolution neural network and spatial constrained loss. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2019. p. 37886. 143. Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions. In: Proceedings of international conference on learning representations. 2016. p. 113. 144. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of IEEE conference on computer vision and pattern recognition. 2017. p. 22619. 145. Mahmood F, Borders D, Chen R, McKay GN, Salimian KJ, Baras A, et al. Deep adversarial training for multi-organ nuclei segmentation in histopathology images. IEEE Trans Med Imaging 2019. 1-1. 146. Kumar N, Verma R, Sharma S, Bhargava S, Vahadane A, Sethi A. A dataset and a technique for generalized nuclear segmentation for computational pathology. IEEE Trans Med Imaging 2017;36(7):155060. 147. Luna M, Kwon M, Park SH. Precise separation of adjacent nuclei using a Siamese neural network. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2019. pp. 57785. 148. Song Y, Tan E, Jiang X, Cheng J, Ni D, Chen S, et al. Accurate cervical cell segmentation from overlapping clumps in pap smear images. IEEE Trans Med Imaging 2017;36(1):288300. 149. Naylor P, Lae M, Reyal F, Walter T. Segmentation of nuclei in histopathology images by deep regression of the distance map. IEEE Trans Med Imaging 2019;38(2):44859. 150. Sirinukunwattana K, et al. Gland segmentation in colon histology images: the GlaS challenge contest. Med Image Anal 2017;35:489502. 151. Bentaieb A, Hamarneh G. Topology aware fully convolutional networks for histology gland segmentation. In: Proceedings of international conference on medical image computing and computer-assisted intervention. 2016. p. 4608. 152. Yan Z, Yang X, Cheng KTT. A deep model with shape-preserving loss for gland instance segmentation. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2018. p. 13846. 153. Yan Z, Yang X, Cheng K. A skeletal similarity metric for quality evaluation of retinal vessel segmentation. IEEE Trans Med Imaging 2018;37(4):104557. 154. Xie S, Tu Z. Holistically-nested edge detection. In: Proceedings of IEEE international conference on computer vision. 2015. p. 1395403. 155. Liu Y, Cheng M, Hu X, Bian J, Zhang L, Bai X, et al. Richer convolutional features for edge detection. IEEE Trans Pattern Anal Mach Intell 2019;41(8):193946. 156. Chen H, Qi X, Yu L, Heng PA. DCAN: deep contour-aware networks for accurate gland segmentation. In: Proceedings of IEEE international conference on computer vision. 2016. p. 248796.

III. Clinical applications

216

11. Artificial intelligence for pathology

157. Chen H, Qi X, Yu L, Dou Q, Qin J, Heng P-A. DCAN: Deep contour-aware networks for object instance segmentation from histology images. Med Image Anal 2017;36:13546. 158. Xu Y, Li Y, Wang Y, Liu M, Fan Y, Lai M, et al. Gland instance segmentation using deep multichannel neural networks. IEEE Trans Biomed Eng 2017;64(12):29012912s. 159. Xu Y, Li Y, Liu M, Wang Y, Lai M, Chang EIC. Gland instance segmentation by deep multichannel side supervision. In: Proceedings of international conference on medical image computing and computer-assisted intervention. 2016. p. 496504. 160. Graham S, Chen H, Gamper J, Dou Q, Heng P-A, Snead D, et al. Mild-net: Minimal information loss dilated network for gland instance segmentation in colon histology images. Med Image Anal 2019;52:199211. 161. Yang L, Zhang Y, Chen J, Zhang S, Chen DZ. Suggestive annotation: a deep active learning framework for biomedical image segmentation. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2017. p. 399407. 162. Xu X, Lu Q, Yang L, Hu S, Chen D, Hu Y, et al. Quantization of fully convolutional networks for accurate biomedical image segmentation. In: Proceedings of IEEE/CVF conference on computer vision and pattern recognition. 2018. p. 83008. 163. Zhang Y, Yang L, Chen J, Fredericksen M, Hughes DP, Chen DZ. Deep adversarial networks for biomedical image segmentation utilizing unannotated images. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2017. p. 40816. 164. Van Eycke Y-R, Balsat C, Verset L, Debeir O, Salmon I, Decaestecker C. Segmentation of glandular epithelium in colorectal tumours to automatically compartmentalise IHC biomarker quantification: a deep learning approach. Med Image Anal 2018;49:3545. 165. Cardona A, Saalfeld S, Preibisch S, Schmid B, Cheng A, Pulokas J, et al. An integrated micro- and macroarchitectural analysis of the drosophila brain by computer-assisted serial section electron microscopy. PLoS Biol 2010;8:117. 166. Ciresan D, Giusti A, Gambardella LM, Schmidhuber J. Deep neural networks segment neuronal membranes in electron microscopy images. In: Proceedings of advances in neural information processing systems. 2012. p. 284351. 167. Chen H, Qi X, Cheng J, Heng PA. Deep contextual networks for neuronal structure segmentation. In: Proceedings of the 30th AAAI conference on artificial intelligence. 2016. p. 116773. 168. Gu Z, Cheng J, Fu H, Zhou K, Hao H, Zhao Y, et al. CE-NET: context encoder network for 2D medical image segmentation. IEEE Trans Med Imaging 2019;38(10):228192. 169. Chen L, Papandreou G, Kokkinos I, Murphy K, Yuille AL. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 2018;40(4):83448. 170. Cerrone L, Zeilmann A, Hamprecht FA. End-to-end learned random walker for seeded image segmentation. In: Proceedings of IEEE/CVF conference on computer vision and pattern recognition. 2019. p. 1255112560. 171. Shen W, Wang B, Jiang Y, Wang Y, Yuille A. Multi-stage multirecursive-input fully convolutional networks for neuronal boundary detection. In: Proceedings of IEEE international conference on computer vision. 2017. p. 24109. 172. Haehn D, Kaynig V, Tompkin J, Lichtman JW, Pfister H. Guided proofreading of automatic segmentations for connectomics. In: Proceedings of IEEE/CVF conference on computer vision and pattern recognition. 2018. p. 931928. 173. Jia Z, Huang X, Chang EI, Xu Y. Constrained deep weak supervision for histopathology image segmentation. IEEE Trans Med Imaging 2017;36(11):237688. 174. Xu G, Song Z, Sun Z, Ku C, Yang Z, Liu C, et al. Camel: a weakly supervised learning framework for histopathology image segmentation. In: 2019 IEEE/CVF international conference on computer vision (ICCV). 2019. p. 1068190. 175. Liang Q, Nan Y, Coppola G, Zou K, Sun W, Zhang D, et al. Weakly supervised biomedical image segmentation by reiterative learning. IEEE J Biomed Health Inform 2019;23(3):120514. 176. Qaiser T, Tsang Y-W, Taniyama D, Sakamoto N, Nakane K, Epstein D, et al. Fast and accurate tumor segmentation of histology images using persistent homology and deep convolutional features. Med Image Anal 2019;55:114. 177. Gupta L, Klinkhammer BM, Boor P, Merhof D, Gadermayr M. GAN-based image enrichment in digital pathology boosts segmentation accuracy. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2019. p. 6319.

III. Clinical applications

References

217

178. Wang J, MacKenzie JD, Ramachandran R, Chen DZ. A deep learning approach for semantic segmentation in histology tissue images. In: Proceedings of international conference on medical image computing and computerassisted intervention. 2016. p. 17684. 179. Koutnik J, Greff K, Gomez F, Schmidhuber J. A clockwork RNN. In: Proceedings of the 31st international conference on machine learning, vol. 32. 2014. p. 186371. 180. Xie Y, Zhang Z, Sapkota M, Yang L. Spatial clockwork recurrent neural network for muscle perimysium segmentation. In: Proceedings of international conference on medical image computing and computer-assisted intervention, vol. 9901. 2016. pp. 18593. 181. Chan L, Hosseini M, Rowsell C, Plataniotis K, Damaskinos S. HistoSegNet: semantic segmentation of histological tissue type in whole slide images. In: Proceedings of IEEE/CVF international conference on computer vision. 2019. p. 1066170. 182. Hosseini MS, Chan L, Tse G, Tang M, Deng J, Norouzi S, et al. Atlas of digital pathology: a generalized hierarchical histological tissue type-annotated database for deep learning. In: Proceedings of IEEE/CVF conference on computer vision and pattern recognition. 2019. p. 1173948. 183. Ciompi F, Geessink O, Bejnordi BE, de Souza GS, Baidoshvili A, Litjens G, et al. The importance of stain normalization in colorectal tissue classification with convolutional networks. In: Proceedings of IEEE 14th international symposium on biomedical imaging. 2017. p. 1603. 184. Khan AM, Rajpoot N, Treanor D, Magee D. A nonlinear mapping approach to stain normalization in digital histopathology images using image-specific color deconvolution. IEEE Trans Biomed Eng 2014;61 (6):172938. 185. Ruifrok AC, Johnston DA. Quantification of histochemical staining by color deconvolution. Anal Quant Cytol Histol 2001;23(4):2919. 186. Reinhard E, Adhikhmin M, Gooch B, Shirley P. Color transfer between images. IEEE Comput Graph Appl 2001;21(5):3441. 187. Zanjani FG, Zinger S, de With PHN. Deep convolutional Gaussian mixture model for stain-color normalization of histopathological images. In: Proceedings of medical image computing and computer assisted intervention. 2018. p. 27482. 188. Tellez D, Litjens G, Bandi P, Bulten W, Bokhorst J-M, Ciompi F, et al. Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology. Med Image Anal 2019;58:101544. 189. Bentaieb A, Hamarneh G. Adversarial stain transfer for histopathology image analysis. IEEE Trans Med Imaging 2018;37(3):792802. 190. Zanjani FG, Zinger S, Bejnordi BE, van der Laak JAWM, de With PHN. Stain normalization of histopathology images using generative adversarial networks. In: Proceedings of IEEE 15th international symposium on biomedical imaging. 2018. p. 5737. 191. Chen X, Duan Y, Houthooft R, Schulman J, Sutskever I, Abbeel P. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In: Proceedings of the 30th international conference on neural information processing systems. 2016. p. 21808. 192. Lahiani A, Navab N, Albarqouni S, Klaiman E. Perceptual embedding consistency for seamless reconstruction of tilewise style transfer. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2019. p. 56876. 193. Zhou N., Cai D, Han X, Yao J. Enhanced cycle-consistent generative adversarial network for color normalization of H&E stained images. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2019. p. 694702. 194. Shaban MT, Baur C, Navab N, Albarqouni S. StainGAN: stain style transfer for digital histological images. In: Proceedings of IEEE 16th international symposium on biomedical imaging. 2019. p. 9536. 195. Dong C, Loy CC, He K, Tang X. Image super-resolution using deep convolutional networks. IEEE Trans Pattern Anal Mach Intell 2016;38(2):295307. 196. Shi W, Caballero J, Huszar F, Totz J, Aitken AP, Bishop R, et al. Real-time single image and video superresolution using an efficient sub-pixel convolutional neural network. In: Proceedings of IEEE conference on computer vision and pattern recognition. 2016. p. 187483. 197. Kim J, Lee JK, Lee KM. Accurate image super-resolution using very deep convolutional networks. In: Proceedings of IEEE conference on computer vision and pattern recognition. 2016. p. 164654.

III. Clinical applications

218

11. Artificial intelligence for pathology

198. Ledig C, Theis L, Huszaar F, Caballero J, Cunningham A, Acosta A, et al. Photo-realistic single image superresolution using a generative adversarial network. In: Proceedings of IEEE conference on computer vision and pattern recognition. 2017. p. 10514. 199. Tong T, Li G, Liu X, Gao Q. Image super-resolution using dense skip connections. In: Proceedings of IEEE international conference on computer vision. 2017. p. 480917. 200. Izadi S, Moriarty KP, Hamarneh G. Can deep learning relax endomicroscopy hardware miniaturization requirements?. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2018. p. 5764. 201. Suveer A, Gupta A, Kylberg G, Sintorn I. Super-resolution reconstruction of transmission electron microscopy images using deep learning. In: Proceedings of IEEE 16th international symposium on biomedical imaging. 2019. p. 54851. 202. Han L, Yin Z. A cascaded refinement GAN for phase contrast microscopy image super resolution. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2018. p. 34755. 203. Upadhyay U, Awate SP. Robust super-resolution GAN, with manifold-based and perception loss. In: Proceedings of IEEE 16th international symposium on biomedical imaging. 2019. p. 13726. 204. Shiraishi J, Li Q, Appelbaum D, Doi K. Computer-aided diagnosis and artificial intelligence in clinical imaging. Semin Nucl Med 2011;41(6):44962. 205. Bengtsson E, Danielsen H, Treanor D, Gurcan MN, MacAulay C, Molnar B. Computer-aided diagnostics in digital pathology. Cytometry, A 2017;91(6):5514. 206. Langer L, Binenbaum Y, Gugel L, Amit M, Gil Z, Dekel S. Computer-aided diagnostics in digital pathology: automated evaluation of early-phase pancreatic cancer in mice. Int J Comput Assist Radiol Surg 2015;10:104354. 207. Sapkota M, Liu F, Xie Y, Su H, Xing F, Yang L. AIIMDs: an integrated framework of automatic idiopathic inflammatory myopathy diagnosis for muscle. IEEE J Biomed Health Inform 2018;22(3):94254. 208. Ma Y, Xu T, Huang X, Wang X, Li C, Jerwick J, et al. Computer-aided diagnosis of label-free 3-D optical coherence microscopy images of human cervical tissue. IEEE Trans Biomed Eng 2019;66 (9):244756. 209. Zhang Z, Xie Y, Xing F, McGough M, Yang L. MDNet: a semantically and visually interpretable medical image diagnosis network. In: Proceedings of IEEE conference on computer vision and pattern recognition. 2017. p. 354957. 210. Zhang Z, Chen P, McGough M, et al. Pathologist-level interpretable whole-slide cancer diagnosis with deep learning. Nat Mach Intell 2019;1:23645. 211. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of IEEE conference on computer vision and pattern recognition. 2016. p. 28182826. 212. Krause J, Johnson J, Krishna R, Fei-Fei L. A hierarchical approach for generating descriptive image paragraphs. In: Proceedings of IEEE conference on computer vision and pattern recognition. 2017. p. 33373345. 213. Zhang S, Metaxas D. Large-scale medical image analytics: recent methodologies, applications and future directions. Med Image Anal 2016;33:98101. 214. Yang L, Qi X, Xing F, Kurc T, Saltz J, Foran DJ. Parallel content-based sub-image retrieval using hierarchical searching. Bioinformatics 2013;30(7):9961002. 215. Shi X, Xing F, Xu K, Xie Y, Su H, Yang L. Supervised graph hashing for histopathology image retrieval and classification. Med Image Anal 2017;42:11728. 216. Zhang X, Xing F, Su H, Yang L, Zhang S. High-throughput histopathological image analysis via robust cell segmentation and hashing. Med Image Anal 2015;26(1):30615. 217. Zhou W, Li H, Tian Q. Recent advance in content-based image retrieval: a literature survey. arXiv:1706.06064 [cs.MM] 2017:122. 218. Shi X, Sapkota M, Xing F, Liu F, Cui L, Yang L. Pairwise based deep ranking hashing for histopathology image classification and retrieval. Pattern Recognit 2018;81:1422. 219. Sapkota M, Shi X, Xing F, Yang L. Deep convolutional hashing for low-dimensional binary embedding of histopathological images. IEEE J Biomed Health Inform 2019;23(2):80516. 220. Zhu X, Yao J, Zhu F, Huang J. WSISA: Making survival prediction from whole slide histopathological images. In: Proceedings of IEEE conference on computer vision and pattern recognition. 2017. p. 685563.

III. Clinical applications

References

219

221. Zhu X, Yao J, Huang J. Deep convolutional neural network for survival analysis with pathological images. In: Proceedings of IEEE international conference on bioinformatics and biomedicine. 2016. p. 5447. 222. Li R, Yao J, Zhu X, Li Y, Huang J. Graph CNN for survival analysis on whole slide pathological images. In: Proceedings of international conference on medical image computing and computer assisted intervention. 2018. p. 17482. 223. Yao J, Zhu X, Zhu F, Huang J. Deep correlational learning for survival prediction from multi-modality data. In: Proceedings of international conference on medical image computing and computer-assisted intervention. 2017. p. 40614. 224. Kothari S, Phan JH, Osunkoya AO, Wang MD. Biological interpretation of morphological patterns in histopathological whole-slide images. In: Proceedings of the ACM conference on bioinformatics, computational biology and biomedicine. 2012. p. 21825. 225. Wu HS, Murray J, Morgello S, Fiel MI, Schiano T, Kalir T, et al. Restoration of distorted colour microscopic images from transverse chromatic aberration of imperfect lenses. J Microsc 2011;241(2):125131. 226. Dastidar TR, Ethirajan R. Whole slide imaging system using deep learning-based automated focusing. Biomed Opt Express 2019;11(1). 227. Jiang S, Liao J, Bian Z, Guo K, Zhang Y, Zheng G. Transform- and multi-domain deep learning for singleframe rapid autofocusing in whole slide imaging. Biomed Opt Express 2018;9(4):160112. 228. Ali S, Alham NK, Verrill C, Rittscher J. Ink removal from histopathology whole slide images by combining classification, detection and image generation models. In: Proceedings of IEEE 16th international symposium on biomedical imagings. 2019. p. 92832. 229. Meijering E, Carpenter AE, Peng H, Hamprecht FA, Olivo-Marin J-C. Imagining the future of bioimage analysis. Nat Biotechnol 2016;34(12):12505. 230. Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks?. In: Proceedings of advances in neural information processing systems. 2014. p. 33208. 231. Tzeng E, Hoffman J, Saenko K, Darrell T. Adversarial discriminative domain adaptation. In: Proceedings of IEEE conference on computer vision and pattern recognition. 2017. p. 296271. 232. He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng 2009;21(9):126384. 233. He H, Ma Y. Imbalanced learning: foundations, algorithms, and applications. 1st ed. Wiley-IEEE Press; 2013. 234. Lin M, Tang K, Yao X. Dynamic sampling approach to training neural networks for multiclass imbalance classification. IEEE Trans Neural Netw Learn Syst 2013;24(4):64760. 235. Paech DC, Weston AR, Pavlakis N, Gill A, Rajan N, Barraclough H, et al. A systematic review of the interobserver variability for histology in the differentiation between squamous and non-squamous non-small cell lung cancer. J Thorac Oncol 2011;6(1):5563. 236. Ward AD, Crukley C, McKenzie CA, Montreuil J, Gibson E, Romagnoli C, et al. Prostate: registration of digital histopathologic images to in vivo MR images acquired by using endorectal receive coil. Radiology 2012;263(3):85664. 237. Savage RS, Yuan Y. Predicting chemoinsensitivity in breast cancer with ’omics/digital pathology data fusion. R Soc Open Sci 2016;3(2):113. 238. Shin H-C, Lu L, Kim L, Seff A, Yao J, Summers RM. Interleaved text/image deep mining on a large-scale radiology database for automated image interpretation. J Mach Learn Res 2016;17(107):131. 239. Allen TC. Regulating artificial intelligence for a successful pathology future. Arch Pathol Lab Med 2019;143 (10):11759. 240. Tizhoosh H, Pantanowitz L. Artificial intelligence and digital pathology: challenges and opportunities. J Pathol Inform 2018;9(1):38. 241. Tercatin R. Israeli start-up Ibex helps detect cancer using AI. Jerusalem Post; March 2020. 242. Bejnordi BE, Veta M, Johannes van Diest P, van Ginneken B, Karssemeijer N, Litjens G, et al., and the CAMELYON16 Consortium, Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer, JAMA 2017;318(22):21992210. 243. Lin H, Chen H, Graham S, Dou Q, Rajpoot N, Heng P-A. Fast scanNet: Fast and dense analysis of multigigapixel whole-slide images for cancer metastasis detection. IEEE Trans Med Imaging 2019;38(8):194858. 244. Pham HHN, Futakuchi M, Bychkov A, Furukawa T, Kuroda K, Fukuoka J. Detection of lung cancer lymph node metastases from whole-slide histopathologic images using a two-step deep learning approach. Am J Pathol 2019;189(12):242839.

III. Clinical applications

220

11. Artificial intelligence for pathology

245. Liu Y, Kohlberger T, Norouzi M, Dahl GE, Smith JL, Mohtashamian A, et al. Artificial intelligence-based breast cancer nodal metastasis detection: Insights into the black box for pathologists. Arch Pathol Lab Med 2018;143(7):85968. 246. Zakrzewski F, et al. Automated detection of the HER2 gene amplification status in Fluorescence in situ hybridization images for the diagnostics of cancer tissues. Sci Rep 2019;9(1):112. 247. Hofener H, Homeyer A, Forster M, Drieschner N, Schildhaus H-U, Hahn HK. Automated density-based counting of FISH amplification signals for HER2 status assessment. Comput Methods Prog Biomed 2019;173:7785. 248. Khameneh FD, Razavi S, Kamasak M. Automated segmentation of cell membranes to evaluate HER2 status in whole slide images using a modified deep learning network. Comput Biol Med 2019;110:16474. 249. Vandenberghe ME, Scott MLJ, Scorer PW, Soderberg M, Balcerzak D, Barker C. Relevance of deep learning to facilitate the diagnosis of HER2 status in breast cancer. Sci Rep 2017;7(1):111. 250. Jamaluddin MF, Fauzi MFA, Abas FS, Lee JTH, Khor SY, Teoh KH, et al. Cell classification in ER-stained whole slide breast cancer images using convolutional neural network. In: Proceedings of the 40th annual international conference of the IEEE Engineering in Medicine and Biology Society (EMBC). 2018. p. 6325. 251. Shamai G, Binenbaum Y, Slossberg R, Duek I, Gil Z, Kimmel R. Artificial intelligence algorithms to assess hormonal status from tissue microarrays in patients with breast cancer. JAMA Netw Open 2019;2(7):e197700. 252. Amgad M, Sarkar A, Srinivas C, Redman R, Ratra S, Bechert CJ, et al. Joint region and nucleus segmentation for characterization of tumor infiltrating lymphocytes in breast cancer. In: Proceedings of SPIE—the international society for optical engineering, vol. 10956. 2019. 253. Shaban M, Khurram SA, Fraz MM, Alsubaie N, Masood I, Mushtaq S, et al. A novel digital score for abundance of tumour infiltrating lymphocytes predicts disease free survival in oral squamous cell carcinoma. Sci Rep 2019;9(1):113. 254. Saltz J, et al. Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images. Cell Rep 2018;23(1):181193.e7. 255. Cornish TC, Swapp RE, Kaplan KJ. Whole-slide imaging: routine pathologic diagnosis. Adv Anat Pathol 2012;19(3):152159. 256. Zarella MD, Bowman D, Aeffner F, Farahani N, Xthona A, Absar SF, et al. A practical guide to whole slide imaging: a white paper from the digital pathology association. Arch Pathol Lab Med 2019;143(2):22234. 257. Hulsken DB. Seamless computational pathology. White Pap 2018;14. 258. Food and Drug Administration. 510(k) Substantial equivalence determination decision summary: K172174. 2017. 259. Food and Drug Administration. 510(k) Substantial equivalence determination decision summary: K190332. 2019. 260. Garcia-Rojo M, Mena DD, Muriel-Cueto P, Atienza-Cuevas L, Dominguez-Gomez M, Bueno G. New European Union regulations related to whole slide image scanners and image analysis software. J Pathol Inform 2019;10:2. 261. Isaacs M, Lennerz J, Yates S, Clermont W, Rossi J, Pfeifer J. Implementation of whole slide imaging in surgical pathology: a value added approach. J Pathol Inform 2011;2(1):39. 262. Ho J, Kuzmishin J, Montalto M, Pantanowitz L, Parwani A, Stratman C, et al. Can digital pathology result in cost savings? A financial projection for digital pathology implementation at a large integrated health care organization. J Pathol Inform 2014;5(1):33. 263. Baidoshvili A, Bucur A, Leeuwen J v, Laak J v d, Kluin P, Diest P J v. Evaluating the benefits of digital pathology implementation: time savings in laboratory logistics. Histopathology 2018;73(5):78494. 264. Hanna MG, Reuter VE, Samboy J, England C, Corsale L, Fine SW, et al. Implementation of digital pathology offers clinical and operational increase in efficiency and cost savings. Arch Pathol Lab Med 2019;143 (12):154555. 265. DICOM Standards Committee, Working Groups 26, Pathology. Digital imaging and communications in medicine (DICOM) supplement 145: Whole slide microscopic image IOD and SOP classes. 2010. 266. Clunie D, et al. Digital imaging and communications in medicine whole slide imaging connectathon at Digital Pathology Association Pathology Visions 2017. J Pathol Inform 2018;9(1):6. 267. Herrmann M, et al. Implementing the DICOM standard for digital pathology. J Pathol Inform 2018;9(1):37. 268. DICOM WG-26. DICOM WG-26 Pathology WSI Annotations Ah-Hoc Group. November 26, 2019. 269. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019;25 (1):4456.

III. Clinical applications

References

221

270. 21st Century Cures Act. 2016. Available from: ,https://www.congress.gov/114/plaws/publ255/PLAW114publ255.pdf.. 271. Food and Drug Administration. Clinical decision support software. Draft Guidance for Industry and Food and Drug Administration Staff; 2019. 272. Food and Drug Administration. Proposed regulatory framework for modifications to artificial intelligence/machine learning (AI/ML)-based software as a medical device (SaMD)  discussion paper and request for feedback. 2019. 273. Harvey HB, Gowda V. How the FDA regulates AI. Acad Radiol 2020;27(1):5861. 274. McClintock DS, Lee RE, Gilbertson JR. Using computerized workflow simulations to assess the feasibility of whole slide imaging full adoption in a high-volume histology laboratory. Anal Cell Pathol 2012;35(1):5764. 275. London AJ. Artificial intelligence and black-box medical decisions: accuracy versus explainability. Hastings Cent Rep 2019;49(1):1521. 276. Holzinger A, Langs G, Denk H, Zatloukal K, Muller H. Causability and explainability of artificial intelligence in medicine. WIREs Data Min Knowl Discov 2019;9(4). 277. Ribeiro MT, Singh S, Guestrin C. “Why should I trust you?”: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016. p. 113544. 278. Sarwar S, Dent A, Faust K, Richer M, Djuric U, Van Ommeren R, et al. Physician perspectives on integration of artificial intelligence into diagnostic pathology. Digital Med 2019;2(1):28. 279. Hinton G. What’s next? The research frontier. In: The conference on machine learning and the market for intelligence. Toronto, ON; 2016.

III. Clinical applications

C H A P T E R

12 The potential of deep learning for gastrointestinal endoscopy—a disruptive new technology Robin Zachariah, Christopher Rombaoa, Jason Samarasena, Duminda Suraweera, Kimberly Wong and William Karnes Abstract Applications of deep learning have the potential to revolutionize gastrointestinal endoscopy. When trained by experts and capable of real-time feedback, deep learning can be applied to improve disease detection; to assist interventions; and to document procedure findings, interventions, and quality measures. In colonoscopy, deep learning is already showing promise for polyp detection, polyp characterization, documentation of complete exams, calculating withdrawal times, quantifying preparation quality, and identifying tools used for intervention. Similarly, in video capsule endoscopy, deep learning is showing great potential to reduce miss rates, time-to-find, and reading times. The technical challenges of live implementation are rapidly disappearing with inexpensive high-performance graphics-processing units. We can soon expect to enjoy an “expert in the room” helping all endoscopists perform at high levels, as well as an “expert personal scribe” allowing clinicians to replace documentation time with more face time with patients. Keywords: Deep machine learning; colorectal cancer screening; capsule endoscopy; convolutional neural networks; Barrett’s esophagus; esophageal cancer; dysplasia screening

12.1 Introduction Artificial intelligence (AI) allows computers to simulate human cognitive reasoning. Machine learning is an AI method that enables computers to make successful predictions by learning from training datasets.1,2 Recent advances in computer performance and AI technology have led to the development of deep learning, which utilizes convolutional neural networks (CNNs) that are modeled after the visual cortex of animals and are very effective at image recognition.3 Researchers have shown that CNNs are well-suited to automatically identify and characterize features within images and videos when trained

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00012-0

223

© 2021 Elsevier Inc. All rights reserved.

224

12. The potential of deep learning for gastrointestinal endoscopy

by real-life experts. More recently, this type of machine learning has found very useful applications in clinical medicine, especially in the field of gastroenterology. The emergence of real-time computer-aided detection (CADe) techniques has proven to be a powerful and promising tool to assist endoscopists in disease detection and subsequent intervention as well as clinical documentation of findings, interventions, and quality measures. The potential for AI to make a significant clinical impact in gastroenterology is vast when one considers the varying degree of endoscopic expertise, clinical time constraints, and the challenges related to standardization of endoscopic evaluation. For example, researchers have shown that the use of AI during an upper endoscopy can effectively recognize anatomic locations and provide real-time feedback to improve exam completeness.4 Moreover, AI can accurately detect pathology such as Barrett’s esophagus (BE), esophageal cancer, and gastric cancer.5 7 In colonoscopy, deep learning has been shown to be effective in polyp recognition, polyp characterization, calculation of withdrawal times, measuring preparation quality, and identifying tools used for intervention. As will be discussed in this chapter, the same principles of anatomic recognition and disease detection can also be applied to video capsule endoscopy. Ultimately, AI can incorporate multiple simultaneous deep learning algorithms to create automated endoscopy reports that could include standardized quality measures. As computer performance and deep learning algorithms evolve to meet the technical challenges of live implementation, AI appears more poised to revolutionize clinical medicine. This chapter will examine the implications of AI use within the field of gastrointestinal (GI) endoscopy and its potential to improve patient care and reduce physician burnout through augmented diagnostic accuracy as well as automated data collection and documentation.

12.2 Applications of artificial intelligence in video capsule endoscopy 12.2.1 Introduction Wireless video capsule endoscopy (WCE) was first approved by the FDA for the evaluation of the small bowel in 2001.8 The WCE system comprises three elements—a capsule endoscope, a capsule sensor setup (in either the form of a sensing belt or sensing pads), and a computer workstation with proprietary software. There are currently five major commercially available systems—PillCam (Medtronic, Rapid software), EndoCapsule (Olympus, EndoCapsule EC-10 software), MiroCam (IntroMedic, MiroView 4.0 software), CapsoCam (CapsoVision, CapsoView software), and OMOM (Chongqing Jinshan Science & Technology, VUE software).8,9 The capsule can either be swallowed or deployed endoscopically, and it is propelled through the GI tract via peristalsis. The PillCam is one of the most widely used systems.10 Small bowel capsule endoscopy (SBCE) has allowed for a relatively noninvasive and well-tolerated method of small bowel evaluation to investigate GI bleeding, iron deficiency anemia, Crohn’s disease, hereditary polyposis syndromes, and small bowel tumors.11,12 SBCE has a higher diagnostic yield for clinically significant findings compared to alternative small bowel imaging or procedural modalities—42% for SBCE versus 5% for small bowel follow through, 62% for SBCE

III. Clinical applications

12.2 Applications of artificial intelligence in video capsule endoscopy

225

versus 56% for double-balloon enteroscopy (DBE), and 78% for SBCE versus 22% for CT enteroclysis.13,14 Although SBCE in many ways represents an improved modality of small bowel evaluation, there remain areas of needed improvement. These include decreasing the read time for study interpretation, automatic detection of landmarks, and improving sensitivity to detect pathology by the software. Because Rapid Reader (the software for PillCam) is the most widely used system, we will mainly be referring specifically to its design and features for this discussion.

12.2.2 Decreasing read time Most studies cite that the average read time for a single SBCE study ranges from 30 to 120 minutes on a per reader basis with an overall average of 50 60 minutes.15 17 Given the poor reimbursement for reads and poor detection rates in these reads, better technologies are needed to reduce read times and improve detection rates and reduce miss rates. The built-in Quick-View mode was intended to help reduce read time. This algorithm selects 10% of the most relevant images to create a quick preview of the SBCE study. In one study of 106 capsules, they found that the Quick-View Read and the Initial Read (defined as an unassisted read) were not statistically significantly different regarding sensitivity, specificity, positive predictive value, and negative predictive value (NPV). The reported mean time to review the Quick-View images was 11.6 minutes. The sensitivity of Quick-View imaging was 89% and the specificity was 85%.17 CNN systems are being utilized to further reduce the SBCE reading time and enhance sensitivity. A group from the University of Tokyo compared unassisted reads to reads of only images selected by the CNN. Mean CNN-assisted reads were statistically shorter and ranged from 3.1 minutes for experts to 5.2 minutes for trainees. The detection rate of CNN-based reads did not significantly decrease. However, it was noted that there were lesions that were identified by experts but not trainees in both the unassisted and CNN-assisted reads.18

12.2.3 Anatomical landmark identification SBCE allows for the identification of pathology but cannot offer any intervention. Once a lesion is found by SBCE, intervention requires deep enteroscopy in the form of singleballoon enteroscopy or DBE to identify and treat the lesion. Location of the lesion is key to determine the most efficient technique and approach. Currently, capsule localization has two established tools to estimate location—one is triangulation between the capsule and the receiving sensors, and the second is location relative to the user-defined landmarks of duodenum and ileum.19 The research group at the University of California Irvine has developed a CNN that can accurately identify anatomical landmarks from SBCE images. 20 Their AI was trained on a database of 53,843 unique frames—images were labeled by anatomic location (esophagus, stomach, small bowel, or colon). During the testing for validation on 14 new capsules the AI performed with high sensitivity and specificity. It identified esophagus with 93.6% sensitivity and 99.8% specificity, stomach with sensitivity 94.2% and 99.8% specificity, small bowel with 99.6% sensitivity and 98.7% specificity, and

III. Clinical applications

226

12. The potential of deep learning for gastrointestinal endoscopy

colon with 80.7% sensitivity and 96.9% specificity.21 Incorporating landmarks into CNNs for SBCE reads will allow for an increased level of detail to help localization of pathology to influence therapeutic interventions.

12.2.4 Improving sensitivity Although SBCE is associated with diagnostic yields that are higher compared to many other small bowel evaluation modalities, the overall yield is modest. In one systematic review of SBCE from 2000 to 2008, they identified a pooled detection rate of 59.4% for all diagnostic indications.22 In another systematic review for only iron deficiency anemia evaluations, the pooled diagnostic yield was 47%.23 For cases of either definite or possible GI bleeding, many systems offer a suspected blood indicator (SBI) algorithm. The SBI function works by identifying which images contain red-colored pixels and denoting them with a red label.24,25 However, multiple studies have demonstrated that SBI lacks sensitivity and specificity.24,26 For potentially bleeding lesions the estimated sensitivity of SBI ranges from only 26% to 55%, and the specificity is estimated to be 58%. This contrasts with cases of active GI bleeding, where sensitivity is much higher and ranges from 81% to 99% but specificity is lower at 65%.25,27 A team of researchers at the University of Tokyo compared the performance of their CNN with the SBI algorithm with respect to detecting blood. They found that their CNN had statistically higher sensitivity, specificity, and accuracy compared to SBI. There were seven images of discordance where the CNN labeled them as false negatives, but the SBI correctly labeled four of the seven images as having blood.28 CNNs are also being developed to identify the areas of Crohn’s disease, characterized by small intestinal ulceration. One research group based at Tel Aviv University and Sheba Medical Center created a CNN that could recognize SBCE images of ulcerations with an accuracy of more than 95%.29

12.2.5 Recent developments Two groups have made significant progress toward solving the key problems with capsule endoscopy—prolonged reading time, poor sensitivity, and high miss rates. In the United States, Lui et al. trained a CNN to recognize abnormalities of any type utilizing 502 capsule videos (PillCam) and validated on 14 new videos. He reported a sensitivity of 99.8% and specificity of 98.6% with a false negative rate of 0.8%.21 Representative examples of lesions found by the CNN but missed by the original reader are shown in Fig. 12.1. In China, Ding et al. trained a CNN on images (normal and abnormal) from 1970 capsule studies (Ankon) and then validated the CNN on 5000 cases. The AI performed with a superior sensitivity of 99.88% versus a sensitivity of 74.57% achieved by a human read alone. Read time was 5.9 minutes utilizing the CNN versus 96.6 minutes by human read alone.30 Even in these early stages of research, AI when applied to capsule endoscopy has demonstrated a significant improvement in diagnostic yield of capsule endoscopy in significantly less time. These promising results suggest that deep learning in the realm of capsule endoscopy will have clinically relevant outcomes for patient care.

III. Clinical applications

12.3 Applications of artificial intelligence in upper endoscopy

227

FIGURE 12.1 Examples of relevant abnormalities and their predicted locations found by capsule AI only (missed by original human reader). Top row: polyps in both Col and SB. Middle row: AVMs found in SB and Col. Bottom row: bulges/masses in SB. AI, Artificial intelligence; AVMs, arteriovenous malformations; Col, colon; SB, small bowel. Source: Courtesy Docbot, Inc.

12.3 Applications of artificial intelligence in upper endoscopy 12.3.1 Introduction Upper GI endoscopy is a common tool used for diagnostic, screening, and surveillance purposes for a variety of pathology. However, the quality of the examination varies based on both technology and operator. These variations can translate into significantly different clinical outcomes because of the often-subtle presentations of upper GI tract disease, especially for cancerous and precancerous lesions. Often diagnosed in a later stage, upper GI cancers continue to have low 5-year survival rates that are typically less than 20%.31,32 Despite efforts to improve detection with imageenhancing technology, overall detection rates of early cancers and precancers remain poor. This is most likely because advanced image-enhancing technologies are limited by the need for specific expertise and longer examination times. Top performance is, therefore, limited to a handful of experts, most typically in academic centers. AI has the potential to introduce high-level performance for the average endoscopist, by providing expert assistance and reducing procedure time. To this end, several studies have evaluated the use of CADe and classification of upper GI cancers, precancers, and other pathology.

12.3.2 Esophageal cancer Esophageal cancer is the ninth most common cancer worldwide with over half a million cases reported in 2018.33 BE represents a change in normal esophageal squamous

III. Clinical applications

228

12. The potential of deep learning for gastrointestinal endoscopy

epithelium to intestinal metaplasia and is a major risk factor for esophageal adenocarcinoma (EAC). The pathway toward cancer is preceded by the development of dysplasia in BE. Esophageal adenocarcinoma has a high mortality rate, primarily due to its frequent late detection34; overall 5-year survival is 25% with 50% of the patients diagnosed at stage II or later.31 One metaanalysis showed that 25% of patients with EAC were diagnosed within 1 year of endoscopy.35 This implies that early EAC or dysplasia was missed during the previous surveillance endoscopies. Even well-trained endoscopists who follow guideline surveillance recommendations of four-quadrant random biopsies for every 1 cm of BE miss these subtle lesions. Barrett’s experts who are highly attuned to subtle mucosal pattern characteristics of dysplasia and early EAC and are willing to spend adequate time can better target biopsies and reduce miss rates. AI trained by such experts to identify dysplasia and EAC could bring these lower miss rates to every endoscopist. The use of AI for endoscopic image recognition of BE has been evaluated by several studies. De Groof et al. used a CADe system to detect BE using white light endoscopy.5 The AI was trained using images from 40 neoplastic Barrett’s lesions and 20 nondysplastic BE patients, which were prospectively collected and delineated by expert reviewers. The AI was able to detect early Barrett’s neoplasia on white light images with an accuracy of 92%, sensitivity of 95%, and specificity of 85%. AI has also been used with narrow-band imaging (NBI). In 2016 Boschetto et al. published data on an AI trained to detect metaplastic regions on NBI images of the esophagus with an accuracy, sensitivity, and specificity of 83.9%, 79.2%, and 82.3%, respectively.36 The ability of AI to recognize images of overt cancer has also been studied. One group based in Germany found that compared to human endoscopists, AI had variable sensitivity but higher specificity for the diagnosis of EAC with white light images. They tested their AI on two different datasets. In one of their datasets (the Augsburg dataset) the AI had a higher sensitivity of 97% versus the endoscopists’ sensitivity of 76%, and the AI had a higher specificity of 88% versus the endoscopists’ specificity of 80%. In their second dataset (the Medical Image Computing and Computer-Assisted Intervention data), the AI had a lower sensitivity of 92% compared to the endoscopists’ sensitivity of 99%, but the AI had a higher specificity of 100% compared to the endoscopists’ specificity of 78%.6 In another study conducted by a group in Japan, their AI diagnosed esophageal cancer with a sensitivity of 98% in a sample of 1118 images in 27 seconds.37 However, its positive predictive value was only 40% due to misdiagnosis of shadows and normal structures. The visual detection of early esophageal neoplasia (high-grade dysplasia and T1 cancer) in BE with white light and virtual chromoendoscopy still remains challenging. A simple real-time diagnosis support system would be of great help for endoscopists in the detection of Barrett’s dysplasia. Hashimoto et al. conducted a pilot study on the endoscopic detection of early esophageal neoplasia on BE using a deep learning system showing promising results. A total of 916 images from 65 patients were collected that showed histology-proven early esophageal neoplasia in BE containing high-grade dysplasia or T1 cancer. The area of neoplasia was masked using image annotation software. A total of 919 control images were collected of BE without high-grade dysplasia. A CNN algorithm was pretrained, and an object detection algorithm was developed which drew

III. Clinical applications

12.3 Applications of artificial intelligence in upper endoscopy

229

FIGURE 12.2 Barrett’s dysplasia detected by CNN algorithm. CNN, Convolutional neural network. Source: Courtesy Docbot, Inc.

localization boxes around regions classified as dysplasia. The CNN analyzed 458 test images (225 dysplasia/233 nondysplasia) and correctly detected early neoplasia with sensitivity of 96.4%, specificity of 94.2%, and accuracy of 95.4%. With regards to the object detection algorithm for all images in the validation set, the system was able to achieve a high localization accuracy with mean-average-precision of 0.7533 at an intersection over union of 0.3.38 Fig. 12.2 shows an example of detection and localization of dysplasia in Barrett’s by this CNN. The use of AI to augment image-enhancing techniques in endoscopic screening has also been studied. Volumetric laser endomicroscopy (VLE) is an existing technology that uses near infrared light to generate optical coherence tomography of 6 cm segments of the esophagus to a depth of 3 mm.39 This method generates 1200 cross-sectional frames for each 6 cm segment of BE.40 The amount of data generated makes this method ideal for AI assistance. Currently VLE uses an AI-termed intelligent real-time image segmentation that highlights certain images with a colored overlay for the endoscopist if the images have certain VLE features associated with histologic dysplasia.41 A recent multicenter study published by Smith el al. found that VLE did identify pathology that was not identified with white light in 59% of the procedures. The authors estimated that with VLE, the neoplasia yield exceeded the standard of care by 55%.42 High-resolution microendoscopy is a feasible, low-cost means of evaluating esophageal histopathology in real-time; however, it does require that the operator is trained to identify dysplasia.43 AI may assist with this recognition. A study by Shin et al. found that their AI could detect high-grade squamous dysplasia or invasive squamous cell carcinoma with a sensitivity of 87% and a specificity of 97% in a study of 177 patients.44 The use of AI has shown great promise for detection and classification of esophageal neoplasia. Given that training and experience amongst endoscopists vary greatly in the areas of precancer detection in the esophagus, a carefully crafted real-time AI algorithm has the potential to elevate the detection level of less experienced practitioners thereby producing significant patient benefit.

III. Clinical applications

230

12. The potential of deep learning for gastrointestinal endoscopy

12.3.3 Gastric cancer Gastric cancer is the sixth most common cancer worldwide with over one million new cases made in 2018.33 Although the prevalence of gastric cancer has been decreasing, it remains the third leading cause of cancer death in the world with Asia having the highest burden of disease.45 As with esophageal cancer, the prognosis of gastric cancer is poor with a 5-year survival rate of about 20%.32 Endoscopic surveillance for gastric cancer is limited by the subtle endoscopic presentation of early gastric cancer. Furthermore, determining invasion depth of gastric cancer is an important criterion for surgical versus endoscopic resection but remains difficult to predict on endoscopy. In 2018 Hirasawa et al. published data on using AI to help detect gastric cancer using conventional endoscopy.46 The software was trained on 13,584 endoscopic images and then subsequently tested on 2296 images. The AI evaluated all images in 47 seconds and diagnosed gastric cancer with a sensitivity of 92.2%. However, 161 noncancerous lesions were detected as gastric cancer with an overall positive predictive value of 30.6%. A recently published large multicenter trial from China evaluated the use of AI for detecting upper GI cancer by endoscopy.47 The AI was trained using over 1 million endoscopic images collected from 84,424 individuals from six different hospitals in China. Only white light images were used. The AI was then validated using several internal and external sets of images. The AI’s diagnostic accuracy was 95.5% in the internal validation set and 92.7% in the external set. Diagnostic sensitivity was 94.2% compared to 94.5% for an expert endoscopist. The positive predictive value was 81.4% for the AI and 93.2% for an expert endoscopist. Recent advances in endoscopy using magnified NBI have provided increasingly granular analysis for the endoscopist. However, these new methods require expertise restricted to a limited number of endoscopists. AI-guided analysis may bridge this gap. Several studies have evaluated the role of AI when using these modalities for detection of gastric cancer. Kanesaka et al. evaluated the use of AI in the detection of early gastric cancer using magnifying NBI.48 The AI was trained using 66 images of early gastric cancer and 60 images of noncancer. It was then tested using 61 images of early gastric cancer and 20 noncancer images. The AI performed with an accuracy of 96.3%. The sensitivity and specificity were 96.7% and 95%, respectively. Another study evaluated the use of AI with magnifying endoscopy with flexible spectral imaging color enhancement and found that an AI-assisted system yielded a detection accuracy of 85.9%, sensitivity of 84.8%, and specificity of 87.0%.7 Similarly, Miyaki et al. evaluated the use of AI with blue-laser imaging. The authors tested the AI with 100 pictures of early gastric cancer. The software was able to detect 84.6% of cancerous lesions. The use of AI in determining gastric cancer depth has also been studied. Kubota et al. used data from 344 patients who underwent gastrectomy or endoscopic tumor resection between 2001 and 2010 and their 902 endoscopic images to train an AI to predict invasion depth. The diagnostic accuracy was 77.2% for T1, 49.1% for T2, 51.0% T3, and 55.3% for T4 lesions. A more recent study published in 2018 also evaluated the use of AI in predicting gastric tumor depth.49 The software was trained using endoscopic images from 790 patients. It was then tested on endoscopic images from 203 patients. The software was able to detect depth with an accuracy of 89% compared to 71.5% for the endoscopists.

III. Clinical applications

12.3 Applications of artificial intelligence in upper endoscopy

231

It had a sensitivity of 76.5% and specificity of 95.5% compared to 87.8% and 63.31%, respectively, for the endoscopists. As with esophageal cancer, the use of AI in gastric cancer evaluation has the potential to increase detection, especially among endoscopists who lack sophisticated expertise and equipment.

12.3.4 Upper endoscopy quality While the studies mentioned previously demonstrate significant progress toward AIdriven image recognition of pathology, these systems will only be as robust as the source data. Thus it is vital that the quality and completeness of upper endoscopy be assured. Several major GI societies have developed safety and quality indicators for upper endoscopy.50,51 Furthermore, standardized methods of evaluating and reporting key anatomical points have also been developed.52,53 However, these protocols are rarely fully implemented due to the lack of awareness, limitation in time, and lack of supervision or quality control. The use of AI for monitoring and encouraging completeness as well as quality assurance is an area of active study. A recent study published by Wu et al. evaluated the use of a deep CNN for image recognition as well as a deep reinforcement learning network to decrease the rate of blind spots in upper endoscopy.4 A total of 33,513 endoscopic images were used to train the AI to classify anatomic locations, divided into 26 sites. Patients were then randomly assigned to undergo upper endoscopy with AI assistance or without assistance. The AI would detect anatomic locations in real time and grade the images as “good,” “excellent,” and “perfect.” The primary outcome of the study was the blind spot rate, the number of unobserved anatomical sites. Inspection time and completeness of photo-documentation was also evaluated. The AI identified specific anatomical sites with an average accuracy of 90.02% with a sensitivity of 87.57% and specificity of 95.02%. Blind spot rates were significantly lower in the AI-assisted group, 5.86%, as compared to the unassisted group, 22.46% (P , .001). The mean inspection times were also significantly longer in the AI-assisted group (5.03 vs 4.24 minutes, P , .001). The photo-documentation generated by the AI was significantly more complete than reports generated by the endoscopist (90.64% vs 79.14%, P , .001). Thus the use of AI to provide real-time feedback to an endoscopist resulted in increased completeness, a longer inspection and a more thorough report.

12.3.5 Future directions The development of CADe in upper endoscopy is in its infancy but is rapidly maturing. The rates of false positive and false negative findings are an area of much needed improvement. As discussed, upper GI pathologies tend to be subtle and often have surrounding inflammatory changes that make it difficult to discern by AI and endoscopist. Much larger volumes of expertly annotated images are needed to improve the predictive accuracy of future AI. This is especially a challenge for rare conditions and lesions with high heterogeneity. Effective AIs for nonheterogeneous findings can be leveraged with a high number of video frames of relatively few unique lesions. However, effective AIs for highly

III. Clinical applications

232

12. The potential of deep learning for gastrointestinal endoscopy

heterogeneous findings require many unique lesions. Lack of training data could be solved by pooling endoscopic images and video into a large multiinstitutional data registry.

12.4 Applications of artificial intelligence in colonoscopy 12.4.1 Introduction Colorectal cancer (CRC) is currently the second leading cause of cancer deaths within the United States.54 CRCs typically start off as benign precancerous polyps, such as adenomas.55 The National Polyp Study showed that up to 90% of CRCs can be prevented with removal of polyps.56 The same cohort had a 53% decrease in CRC mortality compared to the general population per Surveillance, Epidemiology, and End Results data.55 Colonoscopy remains the gold standard for finding adenomas and is the only intervention capable of removing adenomas short of surgery. Thus colonoscopy is the only approved screening mechanism that can also simultaneously prevent colon cancer. The most common type of precancerous polyps are adenomas.57 The percentage of colonoscopies that reveal at least one adenoma is known as the adenoma detection rate (ADR). ADR should reflect adenoma prevalence that is estimated to be greater than 50%.58,59 Unfortunately, colonoscopists vary widely in their ADRs (9% 54%).60 In fact, some studies involving tandem colonoscopies have shown that nearly 25% of adenomatous polyps can be missed.58 Two large reports have shown that for each 1% increase in ADR, interval CRC rate decreases by at least 3%.60,61 Success in “leaving no adenoma behind” could reduce CRC risk by 90 1 %. ADR has thus become an important quality measure tied to reimbursement by the Centers for Medicare and Medicaid.62 ADR performance is linked to other colonoscopist and procedural characteristics, such as the percentage of colonoscopies that successfully reach the uppermost portion of the colon (cecum), quality of the colon preparation, and time spent during the inspection withdrawal phase of colonoscopy. There have been multiple advances within the field of colonoscopy with the goal of improving ADR as a means toward colon cancer prevention. AI and deep learning algorithms provide a unique opportunity to further enhance colonoscopy though improved adenoma detection and automated measures of withdrawal time, preparation quality, and cecal intubation rate. Beyond this, AI could help objectify “eye of the beholder” measures such as polyp size or Mayo endoscopic scoring for ulcerative colitis (UC).

12.4.2 Cecal intubation rate and cecal intubation time In order to decrease CRCs, a thorough exam of the colon is required. This means the cecum must be confirmed. “Photo-documentation of cecal intubation” is an important GI Quality Improvement Consortium (GIQuIC) measure that assures complete examination and calculation of cecal intubation rate.63 Identification of the cecum is also the first step in determining the withdrawal time, another key quality measure linked to ADR.64 In order to automatically document cecal intubation rate, a deep learning algorithm must be able to first consistently identify the cecum.

III. Clinical applications

12.4 Applications of artificial intelligence in colonoscopy

233

In 2018 Karnes et al. designed a CNN to identify images as cecum or not cecum with an accuracy of up to 98% depending on algorithms confidence; however, when evaluating all images (including lower confidence images) the accuracy dropped to 88%.65 In 2019 Rombaoa et al. presented data on a refined and revised version of similar CNN technology with a cecal detection accuracy above 99% when validated on both videos and images.66

12.4.3 Withdrawal time Withdrawal time measures the time from the cecum to the termination of the colonoscopy. Withdrawal time is positively correlated with ADR. Current recommendations state that withdrawal time needs to be at least 6 minutes for adequate viewing of the colonic mucosa, although some studies suggest further improvements in ADR with withdrawal times greater than 6 minutes. Accordingly, withdrawal time is an important part of the GIQuIC reportable quality measures.63 In 2019 Rombaoa et al. presented the results of an AI algorithm capable of reporting withdrawal time.66 When compared to expert video reviewers, the AI had a mean difference of only 26 seconds compared to the documented withdrawal time.

12.4.4 Boston Bowel Prep Scoring Proper bowel preparation is the key to ensure complete visual inspection of the colonic mucosa and detection of polyps. The Boston Bowel Prep Score (BBPS) was developed to eliminate the subjectivity in bowel prep evaluation during colonoscopy. Not unexpectedly, ADR positively correlates with BBPS67 and is another important reportable measure as per GIQuIC quality indicators.63 AI provides the opportunity to objectify and remove ambiguity from measures of prep quality. In 2018 Karnes et al. developed a deep learning algorithm that had 97% accuracy in classifying bowel prep as inadequate (BBPS 0 or 1) or adequate (BBPS 2 or 3).68 As of April 2019, Dr. Xiuli Zuo of Shandong University is the process of developing the first prospective study to validate a convolution neural network capable of determining BBPS (clinicaltrials.gov, NCT03908645). At the end of 2019, Zhou et al. published data regarding their deep learning algorithm capable of classifying the BBPS score with an accuracy of 89% on 20 colonoscopy videos and 80% 93% on colonoscopy images.69

12.4.5 Polyp detection A colonoscopist’s ADR has been shown to be inversely correlated with interval CRC rate after a screening colonoscopy.61 Various attempts at increasing ADR have been made including scope modifications/attachments that expose mucosal surfaces behind folds such as Endocuff and AmplifEYE.70 Studies have shown that having a trained second observer in the room can also improve ADR.71 To that effect, CADe for polyps creates a potential for an automated trained second observer in the room potentially improving a colonoscopist’s ADR. In order to be broadly applicable, CADe technology must be able to function in realtime without lag. Realistically, any apparent delay in image processing will be considered annoying and unusable by colonoscopists. The CADe technology must also be simple and

III. Clinical applications

234

12. The potential of deep learning for gastrointestinal endoscopy

cheap to implement as any moderate increase in price will be viewed as untenable by community colonoscopists. Lastly, the system must provide a low false positive rate with a near 100% sensitivity. Computer aided polyp detection has an engaging and long history. Using texture and color analysis, Karkanis et al. described a CADe system they called, the Colorectal Lesion Detector that had an accuracy .95% but only applicable on static images and had a high latency.72 Multiple studies since then have attempted to teach learning algorithms about the shape,73 color,72 edge features,74 and spatiotemporal features for the detection of polyps.75 However, a majority of these systems were limited by high latency or lag. They were also limited by another fundamental problem. The early CADe learning algorithms relied on a human programmer to teach the system unique features of polyps, thus introducing human bias as an error into the software. To avoid this problem, CADe systems needed to be upgraded with deep learning algorithms fitted with CNNs. CNNs are free to make their own assumptions about polyp-specific features without human input. Multiple groups have published manuscripts detailing their versions of CNNs used for polyp detection starting with Li et al. who achieved an accuracy of 86% and sensitivity of 73%.76 Shortly after, Wang et al. published data on a CNN, developed with SegNet Architecture, and validated on over 27,000 images. However, their system operated with a latency of 77 ms operating at 25 fps.77,78 Urban et al. developed a CNN algorithm capable of operating in real time with a latency of 10.2 ms and operating at nearly 100 fps. Their published results had an accuracy of 96%, sensitivity of 97% with a 5% false positive rate. They further validated their CNN on colonoscopies videos with a 100% sensitivity and 7% false positive rate but specifically it helped expert reviewers find 20% more additional polyps.79 Their study, however, was limited as a single center retrospective study. In early 2019, Wang et al. published the first randomized prospective trial of a CADe system utilizing their CNN mentioned previously (sensitivity of 94% and latency of 77 ms). The system required two viewing monitors, one with the live video stream viewed by the colonoscopist, and the other processed by the CNN algorithm. When the CNN detected a polyp, an alarm alerts the colonoscopist to view the CNN screen to identify the polyp. While the process was convoluted, the manuscript was the first to show and prove improvement in ADR (0.53 vs 0.31 adenomas per patient, P , .001).80 During the latter part of 2019, Kudo et al. published their multicenter data regarding their development of an “EndoBRAIN” capable of detecting polyps with an accuracy of 98%, sensitivity of 97%, and 95% NPV; however, their assessment was done on endocytoscopic images.81 While there is still much work to be done on polyp detection, the gap between the future goal and now is closing. Ultimately, demand and use of the CADe technology will depend on clear evidence of improved polyp detection and ADR in multicenter validation, financial benefit, and FDA clearance. Additional benefit may be yielded if deep learning algorithms can also be proven to find residual tissue after polypectomy.

12.4.6 Polyp size Polyp size plays a key role in determining the risk of CRC. In fact, as polyp size increases, the risk of in situ colorectal malignancy also increases.82 In addition,

III. Clinical applications

12.4 Applications of artificial intelligence in colonoscopy

235

appropriate determination of surveillance intervals for colon cancer screening is dependent on accurate assessment of polyp size. Despite the importance of polyp size, there is significant interobserver variability in assessing size.83 The variability can be present even when biopsy forceps are placed adjacent to the polyp. Deep learning algorithms have been proposed as a novel way of attempting to solve the size challenge. Specifically, CNNs can be trained to recognize and learn polyp sizes without the experience of interobserver bias. In 2018 Requa et al. presented data establishing their CNN as capable of determining the polyp size.84 Their data cannot be validated as there is no objective standard for sizing polyps. In addition, the CNN is subject to the inherent bias in the training portion given that the polyp sizes are determined initially by a human. However, by using a single expert or a consensus of experts to make estimates, size variability could be reduced.

12.4.7 Polyp morphology The endoscopic appearance of polyps has been shown to be an important characteristic in predicting invasion into the submucosa.85 For example, the invasive growth rate for a depressed lesion is 61% compared to 5% for a pedunculated polyp.86,87 Thus the morphology of a polyp can help the endoscopist to determine the appropriate mode of resection or surgery. The Paris classification system has been utilized internationally since 2003 to characterize colon polyp morphology (KIM 14). Polypoid lesions include pedunculated (Ip), sessile (Is), and nonpolypoid lesions include superficial slightly elevated (IIa), flat (IIb), superficial depressed (IIc), and excavated (III). Prior studies have demonstrated that intraobserver variation can exist between endoscopists.88,89 Therefore the use of AI has the potential to help train novice endoscopists and standardize real-time polyp morphology assessments among experts using the Paris classification.

12.4.8 Polyp pathology During colonoscopy, physicians spend time searching for polyps and determining whether removal of a polyp is necessary, and if so, whether it can be discarded without the expense of pathology. Accordingly, an area of active interest is the development of methods to optically “biopsy” diminutive polyps (#5 mm) with reliable outputs of precancerous adenomas versus nonprecancerous polyps (hyperplastic, lymphoid aggregates, or normal polypoid tissue). Toward this aim the ASGE released PIVI criteria for the adoption of one of two strategies when assessing diminutive polyps. The first strategy is called “Resect and Discard” which requires an optical biopsy system that achieves .90% concordance in recommended surveillance intervals when compared to true pathology. Some studies have found that this process could lead to a “savings of $25 per person screened.90” The second strategy, titled “Diagnose and Leave,” requires an optical biopsy system that achieves .90% NPV for adenomas distal to the sigmoid colon. Some studies have found that this strategy could lead to a “savings of $179 per person screened.90” With both strategies in place, optical biopsy of diminutive polyps could lead to a health-care savings of nearly 1 billion dollars per year.90 There have been several attempts to achieve these PIVI strategies using optical biopsy. The “Colon Tumor NBI Interest Group” developed the NICE (NBI International Colorectal

III. Clinical applications

236

12. The potential of deep learning for gastrointestinal endoscopy

Endoscopic) criteria to separate adenomatous polyps from hyperplastic polyps on the basis of their vessels, color, and surface pattern.91 The ASGE published a metaanalysis assessing the ability of colonoscopists to use the NICE criteria to determine polyp histology based on appearance alone. Overall, they had a NPV for distal adenomas of 91% with a surveillance concordance of 89%. On subgroup analysis the ASGE PIVI standards could only be obtained when the operator was practicing with high experience and performing in an academic setting.90 Evaluation of community practice physician’s ability to use NICE criteria showed that they failed to achieve the PIVI thresholds. Further advanced imaging modalities have been evaluated against PIVI criteria. Endocytoscopy, which uses contact microscopy to produce an image of the cellular level, was shown, by Misawa et al., to have an accuracy of 90% and NPV of 82% for adenomatous lesions.92 Laser-induced autofluorescence spectroscopy, known as the “WavSTAT4” optical biopsy system, achieved an accuracy of 82% and concordance of 89%.93 Both of these advanced technologies were limited by subpar results in addition to reliance on expensive and not widely available technology. Intrinsic subjectivity, operator skill, motivation, user training, and/or expensive technologies remain important obstacles to achieve unambiguous, reliable, and widely available optical biopsy methodology.94 Deep learning algorithms have the potential to achieve both PIVI strategies unambiguously and independently of operator skill, training, motivation, or expensive technologies. Unlike humans, deep learning algorithms do not experience fatigue or loss of concentration. Several studies using various versions of deep learning machine algorithms to predict polyp histology have recently been published. In 2017 Byrne et al. created a CNN that was validated on polyps during colonoscopy.95 They were able to achieve an accuracy of 94% and NPV of 97% after the exclusion of 15% of polyps that could not be predicted.95 They did not publish data on surveillance concordance and thus it is unknown if they were able to reach both PIVI thresholds. Their model also required the use of NBI for diagnosis and their dataset was collected by only one expert endoscopist limiting broad applicability. In 2020 Zachariah et al. published data on the validation of their CNN algorithm that was used to evaluate polyp histology.96 Their dataset included polyps with various light sources (NBI and white light) and was created from endoscopists at various training levels including fellows up to attendings who had been in practice for well over 30 years. Overall, the NPV for distal colon diminutive adenomas was 97%, while the surveillance concordance was 94%.96 No polyps were excluded in their analysis. Fig. 12.3 shows a rendering of this CNN during a live colonoscopy, in this case, simultaneously predicting an adenoma and two serrated polyps. While this was a great first step, their study was retrospective and validated on static images. Further prospective data on live colonoscopy videos are needed to further validate this and other algorithms prospectively.

12.4.9 Tools There are a large variety of endoscopic tools available to aid the detection and treatment of various lesions found during colonoscopy. Endoscopic caps and cuffs, which attach to the tip of the colonoscope are simple tools to enhance the ability to inspect otherwise

III. Clinical applications

12.4 Applications of artificial intelligence in colonoscopy

237

FIGURE 12.3 Simultaneous detection and characterization of an adenoma, sessile serrate polyp, and hyperplastic polyp. Source: Image obtained courtesy of Docbot, Inc.

difficult mucosal areas on folds and flexures.97 Popular tools for endoscopic intervention include cold forceps that can be used to take tissue biopsies or remove small polyps and cold snare, which has emerged as a suitable method to remove small polyps up to 9 mm.98 Endoscopic hemoclips are a common means to achieve hemostasis after polypectomy. Lastly, argon plasma coagulation is an effective, safe, and well-tolerated treatment of conditions such as radiation proctitis and colonic angioectasias.99,100 Samarasena et al. designed an CNN algorithm that was trained with over 56,000 images depicting the tools discussed previously. During colonoscopy the CNN algorithm was able to recognize the presence of any tool with an accuracy of 0.97, area under the curve (AUC) of 0.96, sensitivity 0.95, and specificity of 0.97. The CNN was also able to differentiate between each tool with accuracy ranging 0.94 0.99, AUC 0.93 0.99, sensitivity .0.96, and specificity .0.90.101 This has significant implications for automated documentation of specific interventions and associated billing codes during endoscopy.

12.4.10 Mayo endoscopic subscore In patients with UC the severity of disease is evaluated by a combination of clinical symptoms and endoscopic appearance. The Mayo endoscopic subscore (MES) is one of the most common methods of evaluating disease activity in UC patients.102 In clinical practice, MES is essential to help determine the treatment of UC.103 A MES of 0 1 is consistent with disease remission and an endpoint for UC treatment associated with favorable clinical outcomes.104 A score of 2 3 is associated with more severe states of inflammation and may indicate a need for adjustments in therapy to avoid colectomy.105 Of note, previous studies have demonstrated that the application of the scoring system is vulnerable to interobserver variability.103,106 Although centralized consensus reading poses a possible

III. Clinical applications

238

12. The potential of deep learning for gastrointestinal endoscopy

solution to operator dependence, it is time-consuming, expensive, and impractical in clinical practice.107 Ozawa et al. devised a computer-assisted diagnosis (CAD) system using a CNN to differentiate between different Mayo disease states. A total of 26,304 images from 444 patients with UC were used as a training set. Area under the receiver operating characteristic curves were 0.86 (95% CI, 0.84 87) when differentiating Mayo 0 from 1 3 and 0.98 (95% CI, 0.97 98) when identifying Mayo 0 1 versus 2 3.108 Overall, this demonstrates the potential for a CAD system to provide real-time standardized and unambiguous Mayo scores for documentation and clinical pharmacological trials.

12.5 Conclusion Within the field of colonoscopy, multiple deep learning algorithms are under development that hold promise to improve detection, report writing, quality measure documentation, billing, and communication of findings, pathology, and appropriate return intervals with patients at point of care. These include algorithms for BBPS,68 cecal intubation rate,65 withdrawal time,66 polyp size,84 polyp morphology, polyp pathology,96 tools used,101 and Mayo endoscopic scoring.108 The group at the University of California, Irvine working with Docbot has successfully combined multiple algorithms that run simultaneously during live colonoscopy. These include the detection of polyps, cecum, start and end times, and tools, as well as measures of BBPS and optical pathology. Fig. 12.4 shows a video

FIGURE 12.4 Multiple simultaneous AI overlays during live colonoscopy. In the top left corner, IO represents confidence by the CNN that the frame is inside (vs outside) the colon, marking start time (displayed as “In 12:57 p.m.”) and end times of procedure (not displayed because procedure still in progress). Cecum prediction for this frame is displayed “C: 0.0%”, representing confidence that current frame is within the cecum. Time of last view of cecum is shown (Cecum 01:20 p.m.). Green bar represents BBPS of 3. Missing values include displays for device being used (because no device is currently in use). Overlaying the video frame is a red rectangle entirely enclosing the detected polyp (91.8% confidence), labeled with the predicted pathology (adenoma with 83.7% confidence). AI, Artificial intelligence; BBPS, Boston Bowel Prep Score; CNN, convolutional neural network. Source: Image obtained courtesy of Docbot, Inc.

III. Clinical applications

12.6 Future directions

239

frame of their current interface running these multiple algorithms taken during a live colonoscopy. The challenge will be to bring these and other algorithms together under one form factor capable of simultaneous real-time feedback in a reliable and inexpensive manner. If efficient and widespread in its use, this novel technology could reduce CRC burden by bringing all colonoscopists to high detection levels, automate their quality reporting, procedure reports and billing codes, reduce costs associated with pathology and backoffice staff, and provide more quality time with their patients at work and their family and friends at home.90

12.6 Future directions Use of AI in healthcare is only now beginning its first steps toward catching up with what is already available in business, government, and the consumer world. Successful integration of deep learning algorithms in medicine is often slowed by regulatory requirements for high quality clinical validation studies to demonstrate efficacy and safety. Within healthcare, AI is most developed for problems that do not require a 10 ms solution and, therefore, can be handled by off-site servers, for example, radiological interpretations are typically appropriate within minutes. However, in the field of gastroenterology, useful AI requires real-time feedback, ideally faster than high quality video frame-rates of 60 fps. This becomes more challenging when multiple AIs must run simultaneously. Fortunately, the gaming industry has brought with it ever more powerful, speedy, and affordable graphics processing units. Furthermore, the AI community continues to develop ever more efficient, accurate, specialized, and open-source neural network models. Together, these developments put us on the cusp of truly meaningful AI in the endoscopy unit that will assist our quality, efficiency, and ability to provide optimal care to our patients. Based on the current timeline, it is reasonable to predict the following within 5 10 years among AI users in endoscopy: 1. More uniform detection of pathology approaching current expertise, even among low performers. Mean ADRs will begin to approach true adenoma prevalence resulting in much lower rates of interval CRCs and there will be much fewer missed dysplastic lesions and early cancers in the upper GI tract, resulting in fewer deaths from gastric and esophageal cancer. 2. Optical pathology will lead to the adoption of “Resect and Discard” and “Leave Alone” strategies, and definitive treatment of upper GI precancerous and early cancers of the stomach and esophagus at point of care without the need to await results of biopsies. This will result is health-care savings well in excess of $1 billion/year. 3. AI will automate report writing and documentation of quality measures reportable to CMS 4. AI will desubjectify prep scores, polyp size, polyp shape, and inflammatory scoring. 5. Patients will enjoy complete communication of findings, expected pathology and surveillance intervals at point of care, easing patient anxiety and reducing overhead related to follow-up calls and documentation. They will also enjoy fewer repeat endoscopies for conditions that currently require a diagnostic exam to determine pathology, followed by therapeutic exams after pathology results return days later.

III. Clinical applications

240

12. The potential of deep learning for gastrointestinal endoscopy

6. AI for endoscopic scoring of inflammatory bowel disease will reduce costs of future drugs used to treat these diseases. Current drug trials are burdened by the high costs of requiring consensus of 2 3 expert readers to score mucosal inflammation in every case, often 2 3 times per case per study. Once deployed nationally, the use of deep learning will generate massive datasets available to further develop and improve algorithms, and to advance research with large outcome and epidemiological studies. For example, researchers could analyze the data among patients with inflammatory bowel disease to assess real-world associations between endoscopic inflammatory scores and epidemiologic factors and treatments. Another example might include assessment of real-world associations between bowel prep scores, cleansing agent used, and epidemiological data. These data could also assist in the development of novel risk assessment tools for more personalized screening and surveillance recommendations. Research would be limited only by imagination. Newer technologies and concepts in the field of gastroenterology often face fear and/or criticism. For example, Dr. Barry Marshall and Dr. Robin Warren’s claim that Helicobacter Pylori is a pathogen that causes ulcers was met with criticism before final acceptance and eventual Nobel Prize. Similarly, colonoscopists feared stool tests (fecal immunochemical test and multitarget stool DNA test) and virtual colonoscopy as threats to replace traditional colonoscopy. However, the use of these screening technologies has increased the demand for colonoscopy by engaging the general population in the importance of colorectal screening. Use of AI in endoscopy conjures similar fears that we will be replaced by an AI-driven “self-driving” scope. For now, however, AI algorithms are being developed to assist us to be better, more efficient, and to improve patient outcomes. As a computer vision platform, current AIs cannot see what is not on the screen. We are required to skillfully clean and inspect all mucosal surfaces, use our judgment on any potential pathology whether detection was assisted by AI or not, determine if and what interventions are appropriate, and then apply our skills to completing the intervention effectively and safely. We are generations away from AI and robotics replacing a skilled colonoscopist. One day, the skills of expert gastroenterology proceduralists may become obsolete and replaced by a new technology. Imagine, for example, intelligent nanotechnology administered orally in the comfort of one’s home, capable of safely and effectively diagnosing and removing colon polyps and Barrett’s mucosa; no need for fasting, sedation, and imposition for someone to care for you after the procedure, or taking time off. Similarly, imagine nanoparticles that can disintegrate biliary stones, eliminating the need for endoscopic retrograde cholangiopancreatography to treat choledocholithiasis. While currently fantasy, Moore’s law of doubling processing power every 2 years in shrinking form factors and huge advances in AI and nanotechnologies leaves us in wonder. For now, however, AI is a friend of both the endoscopist and the patient and can be embraced.

References ¨ zuysal M. Introduction to machine learning. Humana Press; 2014. p. 105 28. 1. Ba¸stanlar Y, O 2. Deo RC. Machine learning in medicine. Circulation 2015;132(20):1920 30.

III. Clinical applications

References

241

3. Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 2013;35(8):1798 828. 4. Wu L, Zhang J, Zhou W, et al. Randomised controlled trial of WISENSE, a real-time quality improving system for monitoring blind spots during esophagogastroduodenoscopy. Gut 2019;68(12):2161 9. 5. de Groof J, van der Sommen F, van der Putten J, et al. The Argos project: the development of a computeraided detection system to improve detection of Barrett’s neoplasia on white light endoscopy. United Eur Gastroenterol J 2019;7(4):538 47. 6. Ebigbo A, Mendel R, Probst A, et al. Computer-aided diagnosis using deep learning in the evaluation of early esophageal adenocarcinoma. Gut 2019;68(7):1143 5. 7. Miyaki R, Yoshida S, Tanaka S, et al. Quantitative identification of mucosal gastric cancer under magnifying endoscopy with flexible spectral imaging color enhancement. J Gastroenterol Hepatol 2013;28(5):841 7. 8. Wang A, Banerjee S, Barth BA, et al. Wireless capsule endoscopy. Gastrointest Endosc 2013;78(6):805 15. 9. Beg S, Parra-Blanco A, Ragunath K. Optimising the performance and interpretation of small bowel capsule endoscopy. Frontline Gastroenterol 2018;9(4):300 8. 10. Aasen TD, Wilhoite D, Rahman A, Devani K, Young M, Swenson J. No significant difference in clinically relevant findings between Pillcams SB3 and Pillcams SB2 capsules in a United States veteran population. World J Gastrointest Endosc 2019;11(2):124 32. 11. Spada C, Riccioni ME, Costamagna G. Rapid Access Real-Time device and Rapid Access software: new tools in the armamentarium of capsule endoscopy. Expert Rev Med Devices 2007;4(4):431 5. 12. Chong AKH, Chin BWK, Meredith CG. Clinically significant small-bowel pathology identified by doubleballoon enteroscopy but missed by capsule endoscopy. Gastrointest Endosc 2006;64(3):445 9. 13. Teshima CW, Kuipers EJ, Van Zanten SV, Mensink PBF. Double balloon enteroscopy and capsule endoscopy for obscure gastrointestinal bleeding: an updated meta-analysis. J Gastroenterol Hepatol 2011;26(5):796 801. 14. Milano A, Balatsinou C, Filippone A, et al. A prospective evaluation of iron deficiency anemia in the GI endoscopy setting: role of standard endoscopy, videocapsule endoscopy, and CT-enteroclysis. Gastrointest Endosc 2011;73(5):1002 8. 15. Hartmann D, Schmidt H, Bolz G, et al. A prospective two-center study comparing wireless capsule endoscopy with intraoperative enteroscopy in patients with obscure GI bleeding. Gastrointest Endosc 2005;61(7):826 32. 16. Lewis BS, Swain P. Capsule endoscopy in the evaluation of patients with suspected small intestinal bleeding: results of a pilot study. Gastrointest Endosc 2002;56(3):349 53. 17. Saurin JC, Lapalus MG, Cholet F, et al. Can we shorten the small-bowel capsule reading time with the “Quick-view” image detection system? Dig Liver Dis 2012;44(6):477 81. 18. Aoki T, Yamada A, Aoyama K, et al. Clinical usefulness of a deep learning-based system as the first screening on small-bowel capsule endoscopy reading. Dig Endosc 2020;32(4):585 91. 19. Slawinski PR, Obstein KL, Valdastri P. Emerging issues and future developments in capsule endoscopy. Tech Gastrointest Endosc 2015;17(1):40 6. 20. Lui F, Rusconi-Rodrigues Y, Ninh A, Requa J. Highly sensitive and specific identification of anatomical landmarks and mucosal abnormalities in video capsule endoscopy with convolutional neural networks: presidential poster award. Philadelphia, PA: ACG; 2018. 21. Lui FH, Ninh A, Rusconi Y, Requa J, Karnes WE. 299 video validation of small bowel convolutional neural networks (CNNS) in identification of anatomical landmarks and mucosal abnormalities in video capsule endoscopy. Gastroenterology 2019;156(6):S-58 9. 22. Liao Z, Gao R, Xu C, Li Z-S. Indications and detection, completion, and retention rates of small-bowel capsule endoscopy: a systematic review. Gastrointest Endosc 2010;71(2):280 6. 23. Koulaouzidis A, Rondonotti E, Giannakou A, Plevris JN. Diagnostic yield of small-bowel capsule endoscopy in patients with iron-deficiency anemia: a systematic review. Gastrointest Endosc 2012;76(5):983 92. 24. Buscaglia JM, Giday SA, Kantsevoy SV, et al. Performance characteristics of the suspected blood indicator feature in capsule endoscopy according to indication for study. Clin Gastroenterol Hepatol 2008;6(3):298 301. 25. Liangpunsakul S, Mays L, Rex DK. Performance of Given suspected blood indicator. Am J Gastroenterol 2003;98(12):2676 8. 26. D’Halluin PN, Delvaux M, Lapalus MG, et al. Does the “Suspected Blood Indicator” improve the detection of bleeding lesions by capsule endoscopy? Gastrointest Endosc 2005;61(2):243 9. 27. Yung DE, Sykes C, Koulaouzidis A. The validity of suspected blood indicator software in capsule endoscopy: a systematic review and meta-analysis. Expert Rev Gastroenterol Hepatol 2017;11(1):43 51.

III. Clinical applications

242

12. The potential of deep learning for gastrointestinal endoscopy

28. Aoki T, Yamada A, Kato Y, et al. Automatic detection of blood content in capsule endoscopy images based on a deep convolutional neural network. J Gastroenterol Hepatol 2019. 29. Klang E, Barash Y, Yehuda Margalit R, et al. Deep learning algorithms for automated detection of Crohn’s disease ulcers by video capsule endoscopy. Gastrointest Endosc 2020;91(3):606 613.e2. 30. Ding Z, Shi H, Zhang H, et al. Gastroenterologist-level identification of small-bowel diseases and normal variants by capsule endoscopy using a deep-learning model. Gastroenterology 2019;157(4):1044 1054.e1045. 31. Tramontano AC, Sheehan DF, Yeh JM, et al. The impact of a prior diagnosis of Barrett’s esophagus on esophageal adenocarcinoma survival. Am J Gastroenterol 2017;112(8):1256 64. 32. Karimi P, Islami F, Anandasabapathy S, Freedman ND, Kamangar F. Gastric cancer: descriptive epidemiology, risk factors, screening, and prevention. Cancer Epidemiol Biomarkers Prev 2014;23(5):700 13. 33. Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2018;68 (6):394 424. 34. Trindade AJ, McKinley MJ, Fan C, Leggett CL, Kahn A, Pleskow DK. Endoscopic surveillance of Barrett’s esophagus using volumetric laser endomicroscopy with artificial intelligence image enhancement. Gastroenterology 2019;157(2):303 5. 35. Visrodia K, Singh S, Krishnamoorthi R, et al. Magnitude of missed esophageal adenocarcinoma after Barrett’s esophagus diagnosis: a systematic review and meta-analysis. Gastroenterology 2016;150(3):599 607.e597 ; quiz e514-595. 36. Boschetto D, Gambaretto G, Grisan E. Automatic classification of endoscopic images for premalignant conditions of the esophagus. In: Paper presented at: SPIE medical imaging 2016. San Diego, CA; 2016. 37. Horie Y, Yoshio T, Aoyama K, et al. Diagnostic outcomes of esophageal cancer by artificial intelligence using convolutional neural networks. Gastrointest Endosc 2019;89(1):25 32. 38. Hashimoto R, Requa J, Tyler D, et al. Artificial intelligence using convolutional neural networks for real-time detection of early esophageal neoplasia in Barrett’s esophagus (with video). Gastrointest Endosc 2020;91 (6):1264 1271.e1. 39. Wolfsen HC. Volumetric laser endomicroscopy in patients with Barrett esophagus. Gastroenterol Hepatol (NY) 2016;12(11):719 22. 40. Swager AF, de Groof AJ, Meijer SL, Weusten BL, Curvers WL, Bergman JJ. Feasibility of laser marking in Barrett’s esophagus with volumetric laser endomicroscopy: first-in-man pilot study. Gastrointest Endosc 2017;86(3):464 72. 41. Swager AF, Tearney GJ, Leggett CL, et al. Identification of volumetric laser endomicroscopy features predictive for early neoplasia in Barrett’s esophagus using high-quality histological correlation. Gastrointest Endosc 2017;85(5):918 926.e917. 42. Smith MS, Cash B, Konda V, et al. Volumetric laser endomicroscopy and its application to Barrett’s esophagus: results from a 1,000 patient registry. Dis Esophagus 2019;32(9).. 43. Muldoon TJ, Anandasabapathy S, Maru D, Richards-Kortum R. High-resolution imaging in Barrett’s esophagus: a novel, low-cost endoscopic microscope. Gastrointest Endosc 2008;68(4):737 44. 44. Shin D, Protano MA, Polydorides AD, et al. Quantitative analysis of high-resolution microendoscopic images for diagnosis of esophageal squamous cell carcinoma. Clin Gastroenterol Hepatol 2015;13(2):272 279.e272. 45. Hamashima C. Current issues and future perspectives of gastric cancer screening. World J Gastroenterol 2014;20(38):13767 74. 46. Hirasawa T, Aoyama K, Tanimoto T, et al. Application of artificial intelligence using a convolutional neural network for detecting gastric cancer in endoscopic images. Gastric Cancer 2018;21(4):653 60. 47. Luo H, Xu G, Li C, et al. Real-time artificial intelligence for detection of upper gastrointestinal cancer by endoscopy: a multicentre, case-control, diagnostic study. Lancet Oncol 2019;20(12):1645 54. 48. Kanesaka T, Lee TC, Uedo N, et al. Computer-aided diagnosis for identifying and delineating early gastric cancers in magnifying narrow-band imaging. Gastrointest Endosc 2018;87(5):1339 44. 49. Zhu Y, Wang QC, Xu MD, et al. Application of convolutional neural network in the diagnosis of the invasion depth of gastric cancer based on conventional endoscopy. Gastrointest Endosc 2019;89(4):806 815.e801. 50. Cohen J, Safdi MA, Deal SE, et al. Quality indicators for esophagogastroduodenoscopy. Gastrointest Endosc 2006;63(4 Suppl.):S10 15.

III. Clinical applications

References

243

51. Bisschops R, Areia M, Coron E, et al. Performance measures for upper gastrointestinal endoscopy: a European Society of Gastrointestinal Endoscopy (ESGE) Quality Improvement Initiative. Endoscopy 2016;48 (9):843 64. 52. Yao K. The endoscopic diagnosis of early gastric cancer. Ann Gastroenterol 2013;26(1):11 22. 53. Bretthauer M, Aabakken L, Dekker E, et al. Requirements and standards facilitating quality improvement for reporting systems in gastrointestinal endoscopy: European Society of Gastrointestinal Endoscopy (ESGE) Position Statement. Endoscopy 2016;48(3):291 4. 54. Surveillance, Epidemiology, and End Results (SEER) Program. SEER*Stat database: mortality-all COD, aggregated with State, total US (1969-2014) ,Katrina/Rita Population Adjustment . . In: Bethesda MNCI, editor; 2016. [Updated 10.09.18]. 55. Zauber AG, Winawer SJ, O’Brien MJ, et al. Colonoscopic polypectomy and long-term prevention of colorectalcancer deaths. N Engl J Med 2012;366(8):687 96. 56. Winawer SJ, Zauber AG, Ho MN, et al. Prevention of colorectal cancer by colonoscopic polypectomy. The National Polyp Study Workgroup. N Engl J Med 1993;329(27):1977 81. 57. Shinya H, Wolff WI. Morphology, anatomic distribution and cancer potential of colonic polyps. Ann Surg 1979;190(6):679 83. 58. Leufkens AM, van Oijen MG, Vleggaar FP, Siersema PD. Factors influencing the miss rate of polyps in a back-to-back colonoscopy study. Endoscopy 2012;44(5):470 5. 59. El-Halabi MM, Rex DK, Saito A, Eckert GJ, Kahi CJ. Defining adenoma detection rate benchmarks in averagerisk male veterans. Gastrointest Endosc 2019;89(1):137 43. 60. Corley DA, Jensen CD, Marks AR, et al. Adenoma detection rate and risk of colorectal cancer and death. N Engl J Med 2014;370(14):1298 306. 61. Kaminski MF, Wieszczy P, Rupinski M, et al. Increased rate of adenoma detection associates with reduced risk of colorectal cancer and death. Gastroenterology 2017;153(1):98 105. 62. GI quality measures for 2017 released in MACRA final rule. 63. GIQuIC. “What Is GIQuIC?” GI Quality Improvement Consortium, ,http://giquic.gi.org/what-is-giquic. asp#measures.[ accessed 25.11.19]. 64. Williams JE, Le TD, Faigel DO. Polypectomy rate as a quality measure for colonoscopy. Gastrointest Endosc 2011;73(3):498 506. 65. Karnes WE, Ninh A, Dao T, Requa J, Samarasena JB. Sa1925. Real-time identification of anatomic landmarks during colonoscopy using deep learning. Gastrointest Endosc 2018;87(6):AB252. 66. Rombaoa C, Kalra A, Dao T, et al. Tu1932. Automated insertion time, cecal intubation, and withdrawal time during live colonoscopy using convolutional neural networks a video validation study. Gastrointest Endosc 2019;89(6):AB619. 67. Guo R, Wang YJ, Liu M, et al. The effect of quality of segmental bowel preparation on adenoma detection rate. BMC Gastroenterol 2019;19(1):119. 68. Karnes WE, Ninh A, Dao T, Requa J, Samarasena JB. Sa1940. Unambiguous real-time scoring of bowel preparation using artificial intelligence. Gastrointest Endosc 2018;87(6):AB258. 69. Zhou J, Wu L, Wan X, et al. A novel artificial intelligence system for the assessment of bowel preparation (with video). Gastrointest Endosc 2020;91.. 70. Chin M, Karnes W, Jamal MM, et al. Use of the Endocuff during routine colonoscopy examination improves adenoma detection: a meta-analysis. World J Gastroenterol 2016;22(43):9642 9. 71. Xu L, Zhang Y, Song H, Wang W, Zhang S, Ding X. Nurse participation in colonoscopy observation versus the colonoscopist alone for polyp and adenoma detection: a meta-analysis of randomized, controlled trials. Gastroenterol Res Pract 2016;2016:7631981. 72. Maroulis DE, Iakovidis DK, Karkanis SA, Karras DA. CoLD: a versatile detection system for colorectal lesions in endoscopy video-frames. Comput Methods Programs Biomed 2003;70(2):151 66. 73. Hwang S, Oh J, Tavanapong W, Wong J. Polyp detection in colonoscopy video using elliptical shape feature. Proc ICIP 2007;2:465 8. 74. Wang Y, Tavanapong W, Wong J, Oh JH, de Groen PC. Polyp-Alert: near real-time feedback during colonoscopy. Comput Methods Programs Biomed 2015;120(3):164 79. 75. Bernal J, Sanchez FJ, Fernandez-Esparrach G, Gil D, Rodriguez C, Vilarino F. WM-DOVA maps for accurate polyp highlighting in colonoscopy: validation vs. saliency maps from physicians. Comput Med Imaging Graph 2015;43:99 111.

III. Clinical applications

244

12. The potential of deep learning for gastrointestinal endoscopy

76. Li T, Cohen J, Craig M, Tsourides K, Mahmud N, Berzin TM. Mo1979. The next endoscopic frontier: a novel computer vision program accurately identifies colonoscopic colorectal adenomas. Gastrointest Endosc 2016;83 (5):AB482. 77. Wang P, Xiao X, Glissen Brown JR, et al. Development and validation of a deep-learning algorithm for the detection of polyps during colonoscopy. Nat Biomed Eng 2018;2(10):741 8. 78. Wang P, Xiao X, Liu J, et al. A prospective validation of deep learning for polyp auto-detection during colonoscopy: 2017 international award: 205. Am J Gastroenterol 2017;112:S106 10. 79. Urban G, Tripathi P, Alkayali T, et al. Deep learning localizes and identifies polyps in real time with 96% accuracy in screening colonoscopy. Gastroenterology 2018;155(4):1069 1078.e1068. 80. Wang P, Berzin TM, Glissen Brown JR, et al. Real-time automatic detection system increases colonoscopic polyp and adenoma detection rates: a prospective randomised controlled study. Gut 2019;68(10):1813 19. 81. Kudo SE, Misawa M, Mori Y, et al. Artificial Intelligence-assisted System Improves Endoscopic Identification of Colorectal Neoplasms, Clin Gastroenterol Hepatol 2020;18(8):1874 1881. 82. Kaminski MF, Hassan C, Bisschops R, et al. Advanced imaging for detection and differentiation of colorectal neoplasia: European Society of Gastrointestinal Endoscopy (ESGE) guideline. Endoscopy 2014;46(5):435 49. 83. Elwir S, Shaukat A, Shaw M, Hughes J, Colton J. Variability in, and factors associated with, sizing of polyps by endoscopists at a large community practice. Endosc Int Open 2017;5(8):E742 5. 84. Requa J, Dao T, Ninh A, Karnes W. Can a convolutional neural network solve the polyp size dilemma? Category award (colorectal cancer prevention) presidential poster award: 282. Am J Gastroenterol 2018;113:S158. 85. Haggitt RC, Glotzbach RE, Soffer EE, Wruble LD. Prognostic factors in colorectal carcinomas arising in adenomas: implications for lesions removed by endoscopic polypectomy. Gastroenterology 1985;89(2):328 36. 86. No authors listed. .The Paris endoscopic classification of superficial neoplastic lesions: esophagus, stomach, and colon: November 30 to December 1, 2002. Gastrointest Endosc 2003;58(6 Suppl.):S3 43. 87. Soetikno RM, Kaltenbach T, Rouse RV, et al. Prevalence of nonpolypoid (flat and depressed) colorectal neoplasms in asymptomatic and symptomatic adults. JAMA 2008;299(9):1027 35. 88. van Doorn SC, Hazewinkel Y, East JE, et al. Polyp morphology: an interobserver evaluation for the Paris classification among international experts. Am J Gastroenterol 2015;110(1):180 7. 89. Kim JH, Nam KS, Kwon HJ, et al. Assessment of colon polyp morphology: is education effective? World J Gastroenterol 2017;23(34):6281 6. 90. Abu Dayyeh BK, Thosani N, Konda V, et al. ASGE Technology Committee systematic review and metaanalysis assessing the ASGE PIVI thresholds for adopting real-time endoscopic assessment of the histology of diminutive colorectal polyps. Gastrointest Endosc 2015;81(3):502.e501 16. 91. Hewett DG, Kaltenbach T, Sano Y, et al. Validation of a simple classification system for endoscopic diagnosis of small colorectal polyps using narrow-band imaging. Gastroenterology 2012;143(3):599 607.e591. 92. Misawa M, Kudo SE, Mori Y, et al. Characterization of colorectal lesions using a computer-aided diagnostic system for narrow-band imaging endocytoscopy. Gastroenterology 2016;150(7):1531 1532.e1533. 93. Rath T, Tontini GE, Vieth M, Nagel A, Neurath MF, Neumann H. In vivo real-time assessment of colorectal polyp histology using an optical biopsy forceps system based on laser-induced fluorescence spectroscopy. Endoscopy 2016;48(6):557 62. 94. Ladabaum U, Fioritto A, Mitani A, et al. Real-time optical biopsy of colon polyps with narrow band imaging in community practice does not yet meet key thresholds for clinical decisions. Gastroenterology 2013;144 (1):81 91. 95. Byrne MF, Shahidi N, Rex DK. Will computer-aided detection and diagnosis revolutionize colonoscopy? Gastroenterology 2017;153(6):1460 1464.e1461. 96. Zachariah R, Samarasena J, Luba D, et al. Prediction of polyp pathology using convolutional neural networks achieves “resect and discard” thresholds. Am J Gastroenterol 2020;115(1):138 44. 97. Kondo S, Yamaji Y, Watabe H, et al. A randomized controlled trial evaluating the usefulness of a transparent hood attached to the tip of the colonoscope. Am J Gastroenterol 2007;102(1):75 81. 98. Dumoulin FL, Hildenbrand R. Endoscopic resection techniques for colorectal neoplasia: current developments. World J Gastroenterol 2019;25(3):300 7. 99. Buyukberber M, Savas MC, Gulsen MT, Koruk M, Kadayifci A. Argon plasma coagulation in the treatment of hemorrhagic radiation proctitis. Turk J Gastroenterol 2005;16(4):232 5. 100. Sakai E, Ohata K, Nakajima A, Matsuhashi N. Diagnosis and therapeutic strategies for small bowel vascular lesions. World J Gastroenterol 2019;25(22):2720 33.

III. Clinical applications

References

245

101. Samarasena J, Yu AR, Torralba EJ, et al. Artificial intelligence can accurately detect tools used during colonoscopy: another step forward toward autonomous report writing: presidential poster award: 1075. Am J Gastroenterology 2018;113:S619 20. 102. D’Haens G, Sandborn WJ, Feagan BG, et al. A review of activity indices and efficacy end points for clinical trials of medical therapy in adults with ulcerative colitis. Gastroenterology 2007;132(2):763 86. 103. Mohammed Vashist N, Samaan M, Mosli MH, et al. Endoscopic scoring indices for evaluation of disease activity in ulcerative colitis. Cochrane Database Syst Rev 2018;1:Cd011450. 104. Ket SN, Palmer R, Travis S. Endoscopic disease activity in inflammatory bowel disease. Curr Gastroenterol Rep 2015;17(12):50. 105. Mazzuoli S, Guglielmi FW, Antonelli E, Salemme M, Bassotti G, Villanacci V. Definition and evaluation of mucosal healing in clinical practice. Dig Liver Dis 2013;45(12):969 77. 106. Daperno M, Comberlato M, Bossa F, et al. Training programs on endoscopic scoring systems for inflammatory bowel disease lead to a significant increase in interobserver agreement among community gastroenterologists. J Crohns Colitis 2017;11(5):556 61. 107. Feagan BG, Sandborn WJ, D’Haens G, et al. The role of centralized reading of endoscopy in a randomized controlled trial of mesalamine for ulcerative colitis. Gastroenterology 2013;145(1):149 157.e142. 108. Ozawa T, Ishihara S, Fujishiro M, et al. Novel computer-assisted diagnosis system for endoscopic disease activity in patients with ulcerative colitis. Gastrointest Endosc 2019;89(2):416 421.e411.

III. Clinical applications

C H A P T E R

13 Lessons learnt from harnessing deep learning for real-world clinical applications in ophthalmology: detecting diabetic retinopathy from retinal fundus photographs Yun Liu, Lu Yang, Sonia Phene and Lily Peng Abstract Diabetic retinopathy (DR) is one of the fastest growing causes of blindness and has prompted the implementation of national screening programs. To help address the shortage of experts to grade images for signs of DR, there has been a surge of interest in artificial intelligence for DR detection. In this chapter, we will cover both historical and recent deep learning algorithms for automated DR detection, the current state of regulatory approval and clinical validation, and future outlook. Keywords: Artificial intelligence; deep learning; ophthalmology; diabetic retinopathy

13.1 Introduction Diabetes, characterized by hyperglycemia, is a chronic systemic illness that requires a multipronged management strategy. One of the more insidious complications of the disease is diabetic retinopathy (DR, damage to the retina of the eye), stemming from prolonged hyperglycemia. DR is typically asymptomatic initially but can eventually progress to vision-threatening DR. Due to the lack of effective means of reversing DR, treatments for DR such as photocoagulation surgery1 and intravitreal injections of antivascular endothelial growth factor2 generally aim to prevent further vision loss. As such, avoiding vision loss due to DR depends on timely detection from regular screening.3

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00013-2

247

© 2021 Elsevier Inc. All rights reserved.

248

13. Lessons learnt from harnessing deep learning for real-world clinical applications in ophthalmology

DR screening is consistent with the 10 principles from the classic 1968 Principles and Practice of Screening for Diseases by Wilson and Jungner4 and is cost-effective.3 However, regular screening still imposes a high burden on the health-care system. For example, the American Diabetes Association (ADA) guidelines recommend initiating annual screening upon diagnosis of type 2 diabetes or within 5 years after the onset of type 1 diabetes.5 Based on an estimated global prevalence of 500 million type 2 diabetes,6 following these guidelines would result in an unacceptably high screening workload that would further tax an already limited number of eye care professionals, particularly in developing countries.7 To help improve the feasibility of screening large populations, artificial intelligence (AI) has been developed for detecting DR. Several major developments in recent years include the publication of large, landmark studies in the Journal of the American Medical Association (JAMA)8,9 and the first autonomous AI system approved by the US Food and Drug Administration (FDA).10 Notably, the recent 2020 ADA guidelines state, “Artificial intelligence systems that detect more than mild diabetic retinopathy and diabetic macular edema authorized for use by the FDA represent an alternative to traditional screening approaches” with a caution that “the benefits and optimal utilization of this type of screening have yet to be fully determined.”5 The remainder of this chapter will briefly summarize work in this area to compare historical methods with current efforts; highlight key aspects other than the AI algorithm that are important to scrutinize in AI efforts; provide a snapshot of the regulatory landscape for AI; discuss challenges that remain for real-world use of AI for DR; and describe the future outlook from our perspective.

13.2 Historical artificial intelligence for diabetic retinopathy The concept of automated DR screening is not new and began initially with computational analysis of ocular imaging. These earlier work, as reviewed in,11 generally focused on handcrafted algorithms based on observations of the visual feature of interest. One early example in 198212 is the detection of hemorrhages, exudates, optic disks, and arteriovenous “crossings” from color fundus photographs using multiple manually engineered techniques. As another example, a study on automated detection in fluorescein angiograms in 198413 focused on using brightness as a criterion to isolate darker regions, which tend to be blood vessels or lesions. The authors then filtered for circular shapes to extract only circular lesions such as microaneurysms. Besides detection of pathologies, classical image processing techniques have also enabled automated detection of retina structures, such as blood vessels, fovea, and the optic disk. Sinthanayothin et al. showed encouraging results that various components of the retina can be accurately detected.14 Detecting those features also aided computers to better understand the fundus photographs to aid future classification of pathologies and disease severity based on anatomic region. There were several exceptions to hard-engineered approaches in this era, such as using artificial neural networks to classify anatomic regions or pathologies (vessels, exudates, and hemorrhages) by Gardner et al.15 and detecting DR directly by Nguyen et al.,16 both

III. Clinical applications

13.3 Deep learning era

249

in 1996. Regardless of the method, constraints in computational resources and digitized images generally limited the work in this era to a smaller number of images and simpler neural network architectures. As computers became faster and technology in capturing images became more commonplace, machine learning became part of the repertoire of methods for DR screening. In contrast to describing the exact mathematical functions, studies in this era frequently used techniques to extract general features and tasked the computer with learning how those features could be integrated to interpret the image. For example, a study in 1998 focused on microaneurysms detection in fluorescein angiograms first extracted 13 features based on the shape and color. The authors then compared the ability of empirically determined rules to machine learning techniques (linear discriminant analysis and artificial neural network) to make the final determination of microaneurysm versus not.17 Another study in 200518 focused on color fundus photographs, and by combining several technical improvements such as additional features and a machine learning technique called a k-nearest neighbor classifier, this study showed lesion detection sensitivity of 100% and a specificity of 87%, achieving a performance “close to that of a human expert.” Another application of artificial neural networks in this period was by Usher et al. in 2003,19 concluding that it could help human graders bypass grading of one-third of the images. With time, previously developed efforts were also validated in larger studies. For example, Philip et al. tested their automated DR (disease/no disease) system on 6722 patients from a regional primary care based DR screening program, achieving a sensitivity of 91% and specificity of 67% compared to manual grading of 87% sensitivity and 95% specificity.20 8.2% of patients had ungradable images (mostly due to “technical failures”) and 62.5% had no disease according to the reference standard on a set of patients. In 2008 another study for detecting referable DR conducted on 5692 patients, Abra`moff et al.21 achieved an area under the receiver operating characteristic curve (AUC) of 0.84, a sensitivity of 84%, and a specificity of 64%. Their approach involved automatic determination of the image having sufficient quality based on the presence of particular image structures with a particular spatial distribution. Then, the blood vessels were automatically segmented using pixel feature classification to exclude false-positive “red lesions.” Finally, the optic disc and red lesions were distinguished using feature classification.

13.3 Deep learning era Deep learning22 is a subfield of machine learning and, like the previously discussed machine learning efforts, often uses labeled datasets. Although previous efforts started with feature engineering, where features (such as color and shape used to describe microaneurysms) are first manually engineered and then a machine learning technique uses these features to make a final determination, deep learning replaces the feature engineering section with layers of computation that automatically learn these features from the data. However, this approach shifts the burden of engineering complex features toward data and computation resources instead. The trifecta of faster computers, larger datasets, and improved methods has triggered a recent resurgence of interest in deep learning. Faster computers were driven by both faster

III. Clinical applications

250

13. Lessons learnt from harnessing deep learning for real-world clinical applications in ophthalmology

and cheaper general-purpose computers and the adoption of specialized hardware such as graphic cards to accelerate the training of neural networks. In terms of datasets the creation of the ImageNet Large Scale Visual Recognition Challenge and its associated dataset containing millions of labeled images23 enabled the rapid exploration of new methods of arranging artificial neurons (i.e., “architectures”) for the purposes of image recognition. These efforts resulted in the creation of popular architectures such as AlexNet,24 VGG,25 various versions of GoogLeNet/Inception,26 28 and ResNet.29 Some common layers in these architectures are convolutions, average or maximum (max) pooling, and “fully connected.” Generally, lower layers tend to extract “basic” features such as edges, and middle layers tend to combine these simpler features into more complex features, such as shapes and textures. The final layers use the presence and number of these shapes to determine the final classification of interest. The description of “deep” learning originates from the fact that many layers (e.g., 20 or more) may be used. Another technique found to improve neural network training is data augmentation, which uses synthetic, slightly modified versions (e.g., slightly rotated, cropped, or otherwise modified) of the training data to encourage the neural network to be insensitive to these perturbations. Finally, many recent efforts leverage transfer learning, where a neural network can learn from one task (such as distinguishing everyday objects) and use that “knowledge” to improve its ability to perform another task. The most common way of leveraging transfer learning involves pretraining or preinitialization: first training an existing network architecture (such as Inception or ResNet) on ImageNet, and then refining that network on the new task. In doing so, the amount of data required for the new task is substantially reduced because basic features in the lower layers tend to be similar.30 Remarkably, these popular neural network architectures also generalize to unrelated tasks such as melanoma detection.31 A review of these concepts for clinicians can be found in a chapter of the JAMA Users’ Guide to the Medical Literature series, “How to Read Articles That Use Machine Learning.”32 In the remainder of this section, we will summarize recent work in deep learning for DR detection. More examples are highlighted in recent reviews33 and summarized in Table 13.1. In 2016 Gulshan et al.9 published in JAMA one of the largest studies at the time for applying deep learning to detecting DR: 128,175 retinal images were graded 3 7 times for DR and then used to train a deep convolutional neural network. This algorithm was evaluated using two operating points chosen during development: high sensitivity and high specificity. On both validation datasets—EyePACS (9963 images, 4997 patients) and Messidor (1748 images, 874 patients)—the AUC was 0.99. At the high-specificity threshold, sensitivity and specificity were 90% and 98% for EyePACS and 87% and 99% for Messidor. At the high-sensitivity threshold the corresponding numbers were 98% and 93% for EyePACS and 96% and 94% for Messidor. These numbers were based on fully gradable images. Soon after in 2017 Ting et al.8 published another JAMA paper describing a deep learning system that detected DR as well as age-related macular degeneration (AMD) and possible glaucoma. Training was done using 70,000 125,000 images for the three conditions. The evaluation of this paper focused on 10 multiethnic cohorts with diabetes from the Singapore National Diabetic Retinopathy Screening Program, consisting of a total of

III. Clinical applications

TABLE 13.1 Summary of many recent efforts in developing and validating artificial intelligence (AI) for detecting diabetic retinopathy (DR) and other conditions that may be found as part of a diabetic eye exam. Name of product and regulatory (if applicable)

Authors, journal, year

Validation set

Outputs

AUC

Sensitivity

Specificity

Bosch

Bawankar et al., PLoS One, 201734

1128 eyes from 564 patients in India

DR from nonmydriatic single-field images

Not reported

91%

97%

EyeArt: EU, Canada

Tufail et al., Ophthalmology, 201735

20,258 patients in the United Kingdom

Referable DR

Not reported

93.8%

15.8%

Rajalakshmi et al., Eye, 201836

301 patients from tertiary care diabetes center in India

Sight-threatening DR

NA

99.1% (using Remidio smartphone-based device)

80.4% (using Remidio smartphone-based device)

Bhaskaranand et al., Diabetes Technology & Therapeutics, 201937

850,908 images from 101,710 visits in EyePACS

mtmDR

0.965

91.3%

91.1%

0.986 (worse than moderate), 0.9770 (worse than mild)

High-specificity operating point: 90.3% and 80.7% High-sensitivity operating point: 97.5% and 96.1% 97.1% (worse than moderate), 97.0% (worse than mild)

High-specificity operating point: 98.1% and 98.5% High-sensitivity operating point: 93.4% and 93.9% 92.3% (worse than moderate), 91.7% (worse than mild)

Google/Verily Gulshan et al., (unnamed): JAMA, 20169 EU

9963 images from 4997 patients Moderate or worse 0.991, 0.990 (EyePACS-1) and 1748 images from 874 diabetic retinopathy patients (Messidor-2) or referable macular edema

Krause et al., Ophthalmology, 201838

1958 images from 999 patients in EyePACS

Gulshan et al., JAMA Ophthalmology, 201939

3049 patients from 2 eye care centers in Moderate or worse India DR and referable diabetic macular edema

0.963, 0.980

88.9%, 92.1%

92.2%, 95.2%

Ruamviboonsuk et al., NPJ Digital Medicine, 201940

29,943 images from 7517 patients in Thailand

0.987

97%

96%

Moderate or worse DR, mild or worse DR

Referable DR or DME

(Continued)

TABLE 13.1

(Continued)

Name of product and regulatory (if applicable)

Authors, journal, year

Validation set

Outputs

AUC

Sensitivity

Specificity

Sayres et al., Ophthalmology, 201941

1796 images from 1612 patients from EyePACS

Moderate or worse DR

N/A (“reader study” comparing graders with vs without AI assistance)

Without AI: 79.4% With AI (two modes): 87.5%, 88.7%

Without AI: 96.6% With AI (two modes): 96.1%, 95.5%

Phene et al., Ophthalmology, 201942

Multiple datasets; total of 11,193 images/patients

Referable glaucomatous optic neuropathy

0.945 for glaucoma

76.1%

92.3%

Varadarajan et al., Nature Communications, 202043

1033 images from 697 patients in Thailand and 990 images from 554 patients in EyePACS-DME

Center-involved diabetic macular edema (ci-DME)

0.89

85%

80%

Healgoo

Li et al., Diabetes Care, 201844

19,900 images (internal validation), 35,201 images from 14,520 eyes in Australia and Singapore (external validation)

Preproliferative DR or worse, diabetic macular edema

0.989 (internal), 0.955 97.0% (internal), (external) 92.5% (external)

91.4% (internal), 98.5% (external)

IDx-DR: EU, United States

5692 patients in the Netherlands Abra`moff et al., Diabetes Care, 200821

Referable DR

0.84

84%

64%

Referable DR

0.937

96.8%

59.4%

Referable DR

0.878

91.0%

69.9%

Abramoff et al., JAMA Ophthalmology, 201345

874 patients in France

Hansen et al., PLOS 4381 patients in Kenya ONE, 201546 Abramoff et al., Investigative Ophthalmology & Visual Science, 201647

1748 images from 874 patients in Messidor-2

More-than-mild DR (mtmDR), macular edema

0.98

96.8%

87%

Abramoff et al., NPJ Digital Medicine, 201810

900 patients in the United States

mtmDR and DME

0.98

87.2%

90.7%

Van der Heijden et al., Acta Ophthalmologica, 201848

1415 patients in the Netherlands

Referable DR

0.87

68%

86%

Tufail et al., Ophthalmology, 201735

20,258 patients in the United Kingdom

Disease or ungradable

Not reported

100% (iGradingM classified all images as “disease” or “ungradable”)

0% (iGradingM classified all images as “disease” or “ungradable”)

260 eyes from 137 patients

Any DR

0.936

90.1%

81.3%

Hansen et al., Acta Ophthalmologica Scandinavica, 200450

163 eyes from 83 patients

Any DR

0.918 (without dilation); 0.940 (with dilation)

89.9% (without dilation); 97.0% (with dilation)

97.0% (without dilation); 75.0% (with dilation)

Bouhaimed et al., Diabetes Technology & Therapeutics, 200851

192 eyes from 96 patients

Any DR

Not reported

88%

52%

RetmarkerDR: EU

Tufail et al., Ophthalmology, 201735

20,258 patients in the United Kingdom

Referable DR

Not reported

85.0%

48.8%

SELENA 1 : Singapore

Ting et al., JAMA, 20178

71,896 images from 12,880 patients (primary), 40,752 images (secondary)

DR, AMD, possible glaucoma

0.936

90.5% referable DR; 100% for visionthreatening DR

91.6% referable DR; 91.1% for visionthreatening DR

Bellemo et al., Lancet Digital Health, 201952

4504 images from 3093 eyes of 1574 patients in Zambia

Moderate nonproliferative DR or worse, DME

0.973

92.25%

89.04%

Kanagasingam et al., JAMA Network Open, 201853

386 images from 193 patients in Australia

Clinically significant Not reported DR

100% (n 5 2 cases)

92%

iGradingM: EU

Retinalyze: EU Larsen et al., Investigative Ophthalmology & Visual Science, 200349

TeleMedC: Australia, EU, Singapore

(Continued)

TABLE 13.1

(Continued)

Name of product and regulatory (if applicable)

Authors, journal, year

Validation set

Outputs

AUC

Sensitivity

Specificity

Yu et al., IEEE Journal of Biomedical and Health Informatics, 201854

424 images from multiple public datasets: Messidor, High-Resolution Fundus Image Database, DIARETDB0, EyePACS (via Kaggle)

Neovascularization in the optic disc

0.993 (crossvalidation)

92.9% (crossvalidation)

96.3% (crossvalidation)

Unnamed

Gargeya and Leng, Ophthalmology, 201755

1748 images from Messidor-2405 images from E-Ophtha

Healthy versus DR

0.94 (Messidor-2), 0.95 (E-Ophtha)

93% (Messidor-2), 90% (E-Ophtha)

87% (Messidor-2), 94% (E-Ophtha)

Unnamed

Natarajan et al., JAMA Ophthalmology, 201956

213 patients in India

Referable DR

Not reported

100.0% (using Remidio smartphone-based device)

88.4% (using Remidio smartphone-based device)

Note that performance numbers (AUC, sensitivity, specificity) are not directly comparable across studies because of variability in evaluation setup (eye vs patient-level), ground truth (Wisconsin reading center vs local graders), inclusion versus exclusion of ungradable images, and differences in patient populations.

13.4 Lessons from interpreting and evaluating studies

255

494,661 fundus photographs. When including ungradable images as an algorithm “refer,” this algorithm achieved a sensitivity and specificity of 91% and 92%, respectively, for detecting referable DR; 100% and 91% for vision-threatening DR; 96% and 87% for possible glaucoma; and 93% and 89% for AMD. One of the chief strengths of this work was in its extensive evaluation across multiple different patient populations. Many other groups have also explored automated DR screening in this new deep learning era. Notably, IDx-DR ran the first pivotal trial to screen for DR,10 leading to FDA clearance for their system. More details can be found in the regulatory section of this chapter. Other groups working in this space include TeleMedC, Healgoo, Retinalyze, EyeArt, and Bosch (Table 13.1). The research community has collectively tested automated DR screening on various datasets, spanning across different ethnicities, ages, genders, camera types, etc. It is worth noting that several aspects of study design differed across studies. For example, the grading protocols differed, with the most common being the International Classification Diabetic Retinopathy Scale or the Early Treatment of Diabetic Retinopathy Treatment Study severity scale. These grades were provided by different sources, such as by a reading center, local graders, or panels of retina specialists. Some of the AI systems (or as part of study design) were further able to flag images as ungradable. These ungradable images were included in the performance metrics of some studies (i.e., ungradables were considered “refer” to reflect realworld use), but not others. Evaluations were also further variably reported in terms of per patient, per eye, and even per image. Finally, patient populations varied across studies. Despite these difficulties in comparing different studies, more recent work generally resulted in more accurate automated DR screening algorithms, with several studies showing that AI algorithms were on par with clinicians at the specific task of grading DR severity or detecting referable DR. Because AI can generally grade more quickly than graders, it can also be used for rapid triage, second reads, or even epidemiological studies.57 Another trend worth highlighting is that of mobile health (mHealth), which uses nowubiquitous mobile devices, many of which have excellent optic systems that can be combined with custom devices for fundus imaging. For example, Rajalakshmi et al. used a smartphonebased imaging device (Remidio Fundus on phone) to capture four fields in each eye and ran an automated DR screening algorithm (EyeArt) on the images.36 The AI algorithm showed 96% sensitivity and 80% specificity in detection of any DR, and 99% and 80% specificity in detection of sight-threatening DR. Similarly, Natarajan et al. trained a deep learning model based on the MobileNet architecture58 and ran the model on images captured by the same smartphone fundus imaging device.56 They were able to achieve 100% sensitivity and 88% specificity for detection of referable DR.56 Oftentimes, mobile applications require patients to be dilated and as a result may be more challenging to use, resulting in many ungradable images that are not consistently captured in reported sensitivity and specificity performance metrics. Still, these efforts may help improve feasibility of screening for DR in remote areas without the ability to deploy traditional tabletop fundus cameras.

13.4 Lessons from interpreting and evaluating studies Though the most prominent parts of most studies remain the performance metrics (i.e., sensitivity and specificity for DR), there are many important aspects to evaluation.

III. Clinical applications

256

13. Lessons learnt from harnessing deep learning for real-world clinical applications in ophthalmology

Chief among these is the reference standard against which the algorithm is being evaluated.32,59,60 For example, Krause et al.38 replaced the reference standard from the majority opinion of retina specialists in earlier work9 to adjudicated consensus grades (where the graders could discuss and resolve any disagreements). After creating this more rigorous reference standard, the same algorithm had a 30% lower measured error rate, suggesting that some of the originally measured “errors” actually reflected errors in the reference standard. Thus in the absence of an unambiguous gold standard, readers must take care to evaluate the claims in the context of the reference standard: what imaging modality was used (e.g., ultrawidefield, number of fields, stereo vs monocular, fluorescein angiography, optical coherence tomography); whether images from both eyes or past images from the same patient were available; whether additional clinical data such as visual acuity were available; the number and years of experience of graders; whether the graders had the opportunity to discuss disagreements, etc. Another important but subtle point is the dataset size. Often, multiple datasets are involved: a development dataset for training the AI algorithm, and one or more validation datasets for evaluation of the AI algorithm. Increasing the size of the validation datasets improves the precision of the performance measurement via narrower confidence intervals. On the other hand, increasing the size of the development dataset often improves the accuracy of AI algorithms by providing more examples from which to learn, albeit with diminishing returns. The amount of training data needed is often unclear, though experience suggests that tens of thousands of examples may be needed for optimal performance.9 Because modern machine learning algorithms can easily overfit by learning to detect spurious patterns, those developing AI algorithms should maintain a “tuning” dataset independent of the validation dataset.32,61 This tuning dataset should be used to make all decisions, including benign-seeming ones like that of “operating point” selection. The ability to fine-tune or select an operating point is both an inherent strength and a facet that causes substantial difficulty in developing and validating algorithms. An operating point is similar to a cutoff for a diagnostic test, such as HgA1c . 6.5% for diabetes. In diagnostic interpretation the operating point describes how “conservative” the algorithm is, similar to how some clinicians are more conservative than others and may err toward further diagnostic testing or treatment of patients with borderline findings. Because AI algorithms most commonly output a continuous value (e.g., 0.12, 0.87), application of the AI predictions to clinical decisions generally requires a threshold to be chosen for metric calculation as well as realworld decision criteria (e.g., refer if above 0.4). However, this also has the theoretical advantage of choosing a threshold that best matches the resources available. For example, if referred patients are seen by an ophthalmologist, then the number of available ophthalmologists may be used to determine the referral rate or what level of DR to trigger urgent referrals for. Another important aspect of the evaluation is the patient population represented in the study, particularly in the evaluation dataset. Generally, the population needs to reflect the patient population under consideration. For example, whether the study involved diabetics undergoing regular screening and follow-up versus newly diagnosed; type 1 versus type 2; demographics such as age, sex, race, and ethnicity; hemoglobin A1c levels as a marker of blood glucose control, etc. Partially as a consequence of the difficulty of understanding precisely how neural networks make their predictions, beyond overall performance (e.g., sensitivity and

III. Clinical applications

13.5 Important factors for real-world usage

257

specificity), further subgroup analysis in specific patient populations is needed,62 such as measuring performance on cases with the most severe DR, that is, visual-threatening DR. This helps to ensure that the AI does not catch most cases of moderate DR but miss the proliferative cases because of the relative rarity of such examples for an AI to learn from. More subtle subgroups involve factors such as age and sex because of as yet unclear morphological changes that can be accurately detected by an AI (but not by humans).63

13.5 Important factors for real-world usage Even with the most well-designed retrospective studies, many aspects are often challenging to evaluate prior to prospective studies and everyday use. One example is that of gradability by the AI. Because studies often want to curate the best datasets for AI development and validation, ungradable images may be excluded. Causes of grading difficulty (whether by AI or human) include blur or occlusion from movement, haziness from cataracts that are common in older populations, and incomplete field of view from poor patient positioning or compliance. Beyond evaluation in research settings, many AI algorithms that are designed for real use cases have a functionality to flag “ungradable” images for human inspection.8 10 In the real world, poor-quality images cannot be excluded, and therefore have to be handled separately. Reasons for poor image quality include miosis, cataracts, ptosis, and artifacts such as dust, dirt, condensation, and smudges64 and may affect gradability of the image by both AI and human graders. Proper technician training and protocols or reminders that encourage regular lens cleaning may help reduce the frequency of some artifacts. Other issues may be resolved by repeat imaging or dilation, albeit at the cost of patient discomfort. The capability to detect poor image quality in real time enables images to be retaken (with dilation as needed) before the patient leaves.10 This strategy of an immediate gradability check can potentially lower the risk of patients with ungradable images being lost to follow-up. Another issue that has not been fully explored is the consistency of AI interpretation. Similar to how human graders can demonstrate both intergrader and intragrader variability, the AI needs to be evaluated for consistency. Though the training process involves randomness, most AI algorithms use deep neural networks that do not have a random component at usage. As such, given the exact same stored image, the AI predictions will generally remain identical. However, if a second image of the same patient were taken, the pixel values in the image will change and potentially lead to a change in the prediction as well. Evaluating whether reimaging the same patient in the same setting, operator, and device causes changes is termed “repeatability.” More generally, variability to imaging protocols, training levels of the camera operator, and other conditions also need to be measured, termed “reproducibility.”65 Use of AI in the real world also involves feasibility concerns. Internet connectivity is one example: does the AI require Internet connectivity because the algorithm runs in the cloud? If so, do the locales that most need the AI (such as rural or underserved areas with fewer ophthalmologists) have the required connectivity? Similarly, the portability of the imaging device itself is important: can device be physically brought to the places where

III. Clinical applications

258

13. Lessons learnt from harnessing deep learning for real-world clinical applications in ophthalmology

they are needed, or do patients need to be brought to the device—incurring time, monetary, and scheduling costs of transportation? Though solutions such as portable, smartphone-based fundus imaging systems36,56,66 and on-device AI algorithms exist, these have not been validated as extensively on large diverse populations as other studies. An additional real-world consideration is variation in local practice patterns. For example, Denmark refers only patients with vision-threatening DR, whereas many AI algorithms have been developed for detecting moderate (also termed “more-than-mild”) DR.67 Differences like these may impact the approach to developing AI: they motivate the development of flexible systems (such as finer grained DR diagnosis) that may provide institutions or countries with the most flexibility. Such flexible systems may also be more amenable to any practice guideline changes in the future, such as a hypothetical change to the definition of “severe DR” to include 10 microaneurysms. Without flexibility built into the algorithm, significant effort may be required to update the AI algorithm. Consistent with this thought process (and possibly also for historical reasons), some algorithms were developed in a modular fashion to first detect individual lesions and then to interpret these findings, making it easier to adapt to different final classifications.10 On the other hand, predicting multicategory classifications (such as fine-grained ETDRS DR levels) may restrict the amount of available data for each category in any given dataset, increasing the burden of data collection for both development and validation of the AI. Even if the model is used and evaluated by dichotomizing at a predetermined threshold (e.g., for referrals), from a machine learning perspective these approaches may not perform as well as “endto-end” predictions on the final decision of interest (e.g., referable DR). Thus the decision of whether to create modular versus “end-to-end” models needs to be carefully thought through for future applications across different locales and usages. Related to the notion of modularity, if used by people in an assistive, real-time mode, the AI’s interpretations may need to be explainable to help the user understand (whether a grader, optometrist, ophthalmologist, or retina specialist). Sayres et al. showed that AI assistance, whether in the form of displaying model “confidence” in each DR grade or a “heatmap” highlighting which areas the AI was using to make its prediction both help improve graders’ accuracy, though at the cost of increasing grading time.41 Additional work in explainability will generally be needed to improve understanding and avoiding a “black box” phenomena and gain clinician trust.

13.6 Regulatory approvals and further validation In practice, regulatory clearance is required for AI to be marketed for clinical use in the relevant jurisdiction. On April 11, 2018 the US FDA cleared the first autonomous AI-based DR diagnostic system: IDx-DR68 based on a prospective pivotal trial.10 The trial was registered at clincaltrials.gov as NCT02963441,69 involved 900 subjects, and achieved a sensitivity of 87% and a specificity of 91%. Another study (NCT0311200570) was conducted to evaluate the EyeArt AI Eye Screening system and announced at the Association for Research in Vision and Ophthalmology (ARVO) conference in 2019.71 The EyeArt system was also extensively validated on more than 100,000 consecutive patient visits from 404 primary care clinics, achieving both a sensitivity and specificity of 91%.37 Along with these

III. Clinical applications

13.7 Toward patient impact and beyond

259

regulatory progress, a new Category 1 Current Procedural Terminology (CPT) code (9225X) has been accepted for retinal imaging with automated point-of-care and should take effect in 2021.72 This will facilitate billing in the United States for automated AI-based diagnostics and will be a crucial step for sustainable use of AI for DR screening. In other jurisdictions the European Union’s Conformite´ Europe´enne (CE) mark has been obtained by multiple systems, including the iGradingM, Retmarker, IDx-DR, and EyeArt, with a study concluding that Retmarker and EyeArt achieved acceptable sensitivity and specificity for DR, making them “cost-effective alternatives to a purely manual grading approach.”73 The Google/Verily-developed tool was prospectively validated in India39 and also received CE mark.74 Retrospective validation of the same algorithm was performed in Thailand,40 with prospective trials ongoing.75 Notably, however, the European Union has replaced the European Medical Devices Directive (MDD) with the Medical Device Regulation (MDR), resulting in changes to the clinical validation requirements and regulatory controls.76 Outside Europe, SELENA 1 by the Singapore Eye Research Institute (SERI) and TeleMedC DR Grader were also registered as a Class B medical device at the Singapore Medical Device Register (SMDR)77,78 in 2019. At the Australian Therapeutic Goods Administration, TeleMedC and others were registered as Class 1 medical devices, while RetScanner, which focuses on detecting lesions typical of DR, was registered as a Class IIa device.79 These services are also extending from clinical settings to the retail clinic,80 potentially extending the availability of diabetic eye care to more patients. Just like postmarket surveillance of approved drugs, the performance of AI algorithms and their effectiveness need to be monitored both to ensure that diagnostic accuracy is preserved, and to ensure that sophisticated fraud techniques are not being used.81

13.7 Toward patient impact and beyond The goal of DR screening is to prevent blindness. For this to happen, patients must be diagnosed correctly, followed up, and have treatments administered when appropriate. Most of what we have discussed, and what AI focuses on, is improving the accuracy and availability of DR diagnosis. However, the next steps involve significant challenges82 particularly as the patient populations of highest risk of diabetics and DR are often socioeconomically the most disadvantaged. One primary challenge is that of logistics, which starts with aligning open appointment slots with patients’ busy lives. Though strategies such as reducing the lead time to the appointment83 or text messaging reminders84,85 may help reduce no-shows, the unfortunately reality is that some patients will not be able to, or will forget to make it to the appointment, leading to rescheduling or loss to follow-up. Another challenge is patient compliance to diet or lifestyle changes and medication regimes that aim to control their blood glucose levels, though it is important to note that patients are not passive rule followers and should play an active role in the decision-making process.86,87 Patient education on the importance of these topics88 or even messaging-based reminders89 may help improve compliance, though modifying patient behavior is likely to remain a challenge going forward.90 Lastly, in addition to the difficulty of scaling DR screening in rural areas,

III. Clinical applications

260

13. Lessons learnt from harnessing deep learning for real-world clinical applications in ophthalmology

there needs to be transportation infrastructure or medical facilities that support the administration of treatments when required. DR screening also does not happen in isolation with screening clinicians asking only a single question of “is DR present?” Referrals, follow-up testing, or treatment may be warranted if other pathologies are found, such as glaucoma suspect, age-related macular degeneration, and cataract.91 Other examples of rarer incidental findings include hypertensive retinopathy, central retinal artery occlusion, uveal melanoma, and Hollenhorst plaques. However, AI developed for DR screening is not guaranteed to detect non-DR pathologies. If used naively, some dangerous entities could be missed in an automated image grading process. One clinical solution is to manually review all non-DR cases, but this may negate any workload reduction provided by the AI in the first place. Though some AI systems have been developed to detect two of the more common non-DR pathologies: glaucoma suspect and age-related macular degeneration,8,42,92 the solution for detecting the long tail of pathologies remains to be seen. Finally, no AI system is perfect and just like with manual diagnostics, errors will inevitably occur. Though many errors may be resolved later, such as during a subsequent screening, some may result in negative consequences for the patient. Successful use of AI for DR screening will thus also require careful thought of how to resolve the medicolegal aspects.93

13.8 Summary In this chapter, we have provided a brief overview of historical and current aspects of AI for DR screening and discussed key aspects of their validation and use going forward. The diagnostic accuracy of AI for DR screening has been demonstrated in multiple patient populations and settings, and several common non-DR findings are detectable by the AI as well. This is an exciting time for AI in DR screening, and the next few years will undoubtedly see substantial additional strides made in this area. Though many challenges remain, we are optimistic that AI will help improve diabetic eye care for many patients in the future.

Conflict of interest All authors on our chapter are employees of Google LLC (this is our only affiliation), and thus own Alphabet stock, and several of us are coinventors on patents for machine learning for medical imaging.

References 1. The Diabetic Retinopathy Study Research Group. Preliminary report on effects of photocoagulation therapy. Am J Ophthalmol 1976;81:383 96. 2. Writing Committee for the Diabetic Retinopathy Clinical Research Network, Gross JG, Glassman AR, Jampol LM, Inusah S, Aiello LP, et al. Panretinal photocoagulation vs intravitreous ranibizumab for proliferative diabetic retinopathy: a randomized clinical trial. JAMA 2015;314:2137 46. 3. Stefa´nsson E, Bek T, Porta M, Larsen N, Kristinsson JK, Agardh E. Screening and prevention of diabetic blindness. Acta Ophthalmol Scand 2000;78:374 85.

III. Clinical applications

References

261

4. J.M.G. Wilson, G. Jungner, World Health Organization, others, principles and practice of screening for disease (1968). 5. American Diabetes Association. 11. Microvascular complications and foot care: standards of medical care in diabetes-2020. Diabetes Care 2020;43:S135 51. 6. Kaiser AB, Zhang N, Van Der Pluijm W. Global prevalence of type 2 diabetes over the next ten years (20182028). Diabetes 2018;67. Available from: https://doi.org/10.2337/db18-202-LB. 7. Resnikoff S, Felch W, Gauthier T-M, Spivey B. The number of ophthalmologists in practice and training worldwide: a growing gap despite more than 200,000 practitioners. Br J Ophthalmol 2012;96:783 7. 8. Ting DSW, Cheung CY-L, Lim G, Tan GSW, Quang ND, Gan A, et al. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes. JAMA 2017;318:2211 23. 9. Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 2016;316:2402 10. 10. Abra`moff MD, Lavin PT, Birch M, Shah N, Folk JC. Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices. NPJ Digital Med 2018;1:39. 11. Teng T, Lefley M, Claremont D. Progress towards automated diabetic ocular screening: a review of image analysis and intelligent systems for diabetic retinopathy. Med Biol Eng Comput 2002;40:2 13. 12. Akita K, Kuga H. A computer method of understanding ocular fundus images. Pattern Recognit 1982;15:431 43. 13. Lay B, Baudoin C, Klein J-C. Automatic detection of microaneurysms in retinopathy fluoro-angiogram. Applications of digital image processing VI. International Society for Optics and Photonics; 1984. p. 165 73. 14. Sinthanayothin C, Boyce JF, Cook HL, Williamson TH. Automated localisation of the optic disc, fovea, and retinal blood vessels from digital colour fundus images. Br J Ophthalmol 1999;83:902 10. 15. Gardner GG, Keating D, Williamson TH, Elliott AT. Automatic detection of diabetic retinopathy using an artificial neural network: a screening tool. Br J Ophthalmol 1996;80:940 4. 16. Nguyen HT, Butler M, Roychoudhry A, Shannon AG, Flack J, Mitchell P. Classification of diabetic retinopathy using neural networks. In: Proceedings of 18th annual international conference of the IEEE engineering in medicine and biology society, vol. 4; 1996. 1548 9. 17. Frame AJ, Undrill PE, Cree MJ, Olson JA, McHardy KC, Sharp PF, et al. A comparison of computer based classification methods applied to the detection of microaneurysms in ophthalmic fluorescein angiograms. Comput Biol Med 1998;28:225 38. 18. Niemeijer M, van Ginneken B, Staal J, Suttorp-Schulten MSA, Abra`moff MD. Automatic detection of red lesions in digital color fundus photographs. IEEE Trans Med Imaging 2005;24:584 92. 19. Usher D, Dumskyj M, Himaga M, Williamson TH, Nussey S, Boyce J. Automated detection of diabetic retinopathy in digital retinal images: a tool for diabetic retinopathy screening. Diabet Med 2004;21:84 90. 20. Philip S, Fleming AD, Goatman KA, Fonseca S, McNamee P, Scotland GS, et al. The efficacy of automated “disease/no disease” grading for diabetic retinopathy in a systematic screening programme. Br J Ophthalmol 2007;91:1512 17. 21. Abra`moff MD, Niemeijer M, Suttorp-Schulten MSA, Viergever MA, Russell SR, van Ginneken B. Evaluation of a system for automatic detection of diabetic retinopathy from color fundus photographs in a large population of patients with diabetes. Diabetes Care 2008;31:193 8. 22. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521:436 44. 23. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. Int J Comput Vis 2015;115:211 52. 24. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, editors. Advances in neural information processing systems 25. Curran Associates, Inc.; 2012. p. 1097 105. 25. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition, arXiv [cs.CV]. 2014. ,http://arxiv.org/abs/1409.1556.. 26. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions, arXiv [cs.CV]. 2014. ,http://arxiv.org/abs/1409.4842.. 27. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision, arXiv [cs.CV]. 2015. ,http://arxiv.org/abs/1512.00567.. 28. Szegedy C, Ioffe S, Vanhoucke V, Alemi A. Inception-v4, inception-ResNet and the impact of residual connections on learning, arXiv [cs.CV]. 2016. ,http://arxiv.org/abs/1602.07261..

III. Clinical applications

262

13. Lessons learnt from harnessing deep learning for real-world clinical applications in ophthalmology

29. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition, arXiv [cs.CV]. 2015. ,http://arxiv. org/abs/1512.03385.. 30. Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks?, arXiv [cs.LG]. 2014. ,http://arxiv.org/abs/1411.1792.. 31. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Corrigendum: dermatologist-level classification of skin cancer with deep neural networks. Nature 2017;546:686. 32. Liu Y, Chen P-HC, Krause J, Peng L. How to read articles that use machine learning: users’ guides to the medical literature. JAMA 2019;322:1806 16. 33. Grzybowski A, Brona P, Lim G, Ruamviboonsuk P, Tan GSW, Abramoff M, et al. Artificial intelligence for diabetic retinopathy screening: a review. Eye 2019;34(3):451 60. Available from: https://doi.org/10.1038/ s41433-019-0566-0. 34. Bawankar P, Shanbhag N, Smitha SK, Dhawan B, Palsule A, Kumar D, et al. Sensitivity and specificity of automated analysis of single-field non-mydriatic fundus photographs by Bosch DR Algorithm-Comparison with mydriatic fundus photography (ETDRS) for screening in undiagnosed diabetic retinopathy. PLoS One 2017;12:e0189854. 35. Tufail A, Rudisill C, Egan C, Kapetanakis VV, Salas-Vega S, Owen CG, et al. Automated diabetic retinopathy image assessment software: diagnostic accuracy and cost-effectiveness compared with human graders. Ophthalmology 2017;124:343 51. 36. Rajalakshmi R, Subashini R, Anjana RM, Mohan V. Automated diabetic retinopathy detection in smartphonebased fundus photography using artificial intelligence. Eye 2018;32:1138 44. 37. Bhaskaranand M, Ramachandra C, Bhat S, Cuadros J, Nittala MG, Sadda SR, et al. The value of automated diabetic retinopathy screening with the EyeArt system: a study of more than 100,000 consecutive encounters from people with diabetes. Diabetes Technol Ther 2019;21:635 43. 38. Krause J, Gulshan V, Rahimy E, Karth P, Widner K, Corrado GS, et al. Grader variability and the importance of reference standards for evaluating machine learning models for diabetic retinopathy. Ophthalmology 2018;125:1264 72. 39. Gulshan V, Rajan RP, Widner K, Wu D, Wubbels P, Rhodes T, et al. Performance of a deep-learning algorithm vs manual grading for detecting diabetic retinopathy in India. JAMA Ophthalmol 2019;137(9):987 93. Available from: https://doi.org/10.1001/jamaophthalmol.2019.2004. 40. Ruamviboonsuk P, Krause J, Chotcomwongse P, Sayres R, Raman R, Widner K, et al. Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program. NPJ Digit Med 2019;2:25. 41. Sayres R, Taly A, Rahimy E, Blumer K, Coz D, Hammel N, et al. Using a deep learning algorithm and integrated gradients explanation to assist grading for diabetic retinopathy. Ophthalmology 2019;126:552 64. 42. Phene S, Dunn RC, Hammel N, Liu Y, Krause J, Kitade N, et al. Deep learning and glaucoma specialists: the relative importance of optic disc features to predict glaucoma referral in fundus photographs. Ophthalmology 2019;126:1627 39. 43. Varadarajan AV, Bavishi P, Ruamviboonsuk P, Chotcomwongse P, Venugopalan S, Narayanaswamy A, et al. Predicting optical coherence tomography-derived diabetic macular edema grades from fundus photographs using deep learning. Nat Commun 2020;11:130. 44. Li Z, Keel S, Liu C, He Y, Meng W, Scheetz J, et al. An automated grading system for detection of visionthreatening referable diabetic retinopathy on the basis of color fundus photographs. Diabetes Care 2018;41:2509 16. 45. Abra`moff MD, Folk JC, Han DP, Walker JD, Williams DF, Russell SR, et al. Automated analysis of retinal images for detection of referable diabetic retinopathy. JAMA Ophthalmol 2013;131:351 7. 46. Hansen MB, Abra`moff MD, Folk JC, Mathenge W, Bastawrous A, Peto T. Results of automated retinal image analysis for detection of diabetic retinopathy from the Nakuru Study, Kenya. PLoS One 2015;10:e0139148. 47. Abra`moff MD, Lou Y, Erginay A, Clarida W, Amelon R, Folk JC, et al. Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning. Invest Ophthalmol Vis Sci 2016;57:5200 6. 48. van der Heijden AA, Abramoff MD, Verbraak F, van Hecke MV, Liem A, Nijpels G. Validation of automated screening for referable diabetic retinopathy with the IDx-DR device in the Hoorn Diabetes Care System. Acta Ophthalmol 2018;96:63 8.

III. Clinical applications

References

263

49. Larsen N, Godt J, Grunkin M, Lund-Andersen H, Larsen M. Automated detection of diabetic retinopathy in a fundus photographic screening population. Invest Ophthalmol Vis Sci 2003;44:767 71. 50. Hansen AB, Hartvig NV, Jensen MS, Borch-Johnsen K, Lund-Andersen H, Larsen M. Diabetic retinopathy screening using digital non-mydriatic fundus photography and automated image analysis. Acta Ophthalmol Scand 2004;82:666 72. 51. Bouhaimed M, Gibbins R, Owens D. Automated detection of diabetic retinopathy: results of a screening study. Diabetes Technol Ther 2008;10:142 8. 52. Bellemo V, Lim ZW, Lim G, Nguyen QD, Xie Y, Yip MYT, et al. Artificial intelligence using deep learning to screen for referable and vision-threatening diabetic retinopathy in Africa: a clinical validation study. Lancet Digital Health 2019;1:e35 44. 53. Kanagasingam Y, Xiao D, Vignarajan J, Preetham A, Tay-Kearney M-L, Mehrotra A. Evaluation of artificial intelligence-based grading of diabetic retinopathy in primary care. JAMA Netw Open 2018;1:e182665. 54. Yu S, Xiao D, Kanagasingam Y. Machine learning based automatic neovascularization detection on optic disc region. IEEE J Biomed Health Inf 2018;22:886 94. 55. Gargeya R, Leng T. Automated identification of diabetic retinopathy using deep learning. Ophthalmology 2017;124:962 9. 56. Natarajan S, Jain A, Krishnan R, Rogye A, Sivaprasad S. Diagnostic accuracy of community-based diabetic retinopathy screening with an offline artificial intelligence system on a smartphone. JAMA Ophthalmol 2018;137 (10):1182 8. Available from: https://doi.org/10.1001/jamaophthalmol.2019.2923. 57. Ting DSW, Cheung CY, Nguyen Q, Sabanayagam C, Lim G, Lim ZW, et al. Deep learning in estimating prevalence and systemic risk factors for diabetic retinopathy: a multi-ethnic study. NPJ Digit Med 2019;2:24. 58. A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, et al., MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv [cs.CV]. 2017. ,http://arxiv.org/abs/1704.04861.. 59. Jaeschke R, Guyatt G, Sackett DL, Bass E, Brill-Edwards P, Browman G, et al. Users’ guides to the medical literature: III. How to use an article about a diagnostic test A. Are the results of the study valid? JAMA 1994;271:389 91. 60. Jaeschke R, Guyatt GH, Sackett DL, Guyatt G, Bass E, Brill-Edwards P, et al. Users’ guides to the medical literature: III. How to use an article about a diagnostic test B. What are the results and will they help me in caring for my patients? JAMA 1994;271:703 7. 61. Chen P-HC, Liu Y, Peng L. How to develop machine learning models for healthcare. Nat Mater 2019;18:410 14. 62. Oakden-Rayner L, Dunnmon J, Carneiro G, Re´ C. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging, arXiv [cs.LG]. 2019. ,http://arxiv.org/abs/1909.12475.. 63. Poplin R, Varadarajan AV, Blumer K, Liu Y, McConnell MV, Corrado GS, et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat Biomed Eng 2018;2:158 64. 64. Wilkes B. Central Mersey diabetic retinopathy screening programme: DRSS user manual. 2009. ,http://www.locnet.org.uk/media/1269/drss_manualnovember_09.pdf. [accessed 30.12.19]. 65. Center for Drug Evaluation. Research, Q2A text on validation of analytical procedures. U.S. Food and Drug Administration; 2019. ,http://www.fda.gov/regulatory-information/search-fda-guidance-documents/q2atext-validation-analytical-procedures. [accessed 22.01.20]. 66. Russo A, Morescalchi F, Costagliola C, Delcassi L, Semeraro F. Comparison of smartphone ophthalmoscopy with slit-lamp biomicroscopy for grading diabetic retinopathy. Am J Ophthalmol 2015;159 360 364.e1. 67. Grauslund J, Andersen N, Andresen J, Flesner P, Haamann P, Heegaard S, et al. Reply: is automated screening for DR indeed not yet ready as stated by Grauslund et al? Acta Ophthalmol 2019;98(2). Available from: https://doi.org/10.1111/aos.14251. 68. FDA permits marketing of artificial intelligence-based device to detect certain diabetes-related eye problems. 2018. ,https://www.fda.gov/news-events/press-announcements/fda-permits-marketing-artificial-intelligencebased-device-detect-certain-diabetes-related-eye. [accessed 30.12.19]. 69. A multi-center study to evaluate performance of an automated device for the detection of diabetic retinopathy - full text view - ClinicalTrials.gov, n.d. ,https://clinicaltrials.gov/ct2/show/NCT02963441. [accessed 30.12.19]. 70. Assessment of EyeArt as an Automated Diabetic Retinopathy Screening Tool; n.d. ,https://clinicaltrials.gov/ct2/ show/NCT03112005. [accessed 30.12.19].

III. Clinical applications

264

13. Lessons learnt from harnessing deep learning for real-world clinical applications in ophthalmology

71. 2019 imaging in the eye conference; n.d. ,https://www.arvo.org/globalassets/arvo/meetings/arvo-image-conference/2019/2019_agenda_imaging_conference.pdf. [accessed 30.12.19]. 72. 2019 May: CPT editorial summary of panel actions; n.d. ,https://www.ama-assn.org/system/files/2019-08/ may-2019-summary-panel-actions.pdf. [accessed 30.12.19]. 73. Tufail A, Kapetanakis VV, Salas-Vega S, Egan C, Rudisill C, Owen CG, et al. An observational study to assess if automated diabetic retinopathy image assessment software can replace one or more steps of manual imaging grading and to determine their cost-effectiveness. NIHR J Library 2016;20(92):1 72. 74. Launching a powerful new screening tool for diabetic eye disease in India, Verily Blog; n.d. ,https://blog.verily. com/2019/02/launching-powerful-new-screening-tool.html. [accessed 30.12.19]. 75. TCTR Thai clinical trials registry TCTR20190902002; n.d. ,http://www.clinicaltrials.in.th/index.php? tp 5 regtrials&menu 5 trialsearch&smenu 5 fulltext&task 5 search&task2 5 view1&id 5 5226. [accessed 30.12.19]. 76. Copley C. Will Europe’s clampdown on faulty medical devices hurt patients?, Reuters; 2019. ,https://www. reuters.com/article/us-eu-medical-devices-insight-idUSKCN1T70HN. [accessed 30.12.19]. 77. Public enquiry - Singapore Medical Device Register (SMDR) EyRIS SELENA 1 ; n.d. ,https://eservice.hsa.gov. sg/medics/md/mdEnquiry.do?action 5 getDeviceInfo&devId 5 C5021D4E2741-19. [accessed 30.12.19]. 78. Public enquiry Singapore Medical Device Register (SMDR) TeleMedC DR Grader; n.d. ,https://eservice.hsa. gov.sg/medics/md/mdEnquiry.do?action 5 getDeviceInfo&devId 5 C5020D0B4D46-19. [accessed 30.12.19]. 79. Therapeutic Goods Administration (TGA). Australian Register of Therapeutic Goods Search automated diabetic retinopathy. Therapeutic Goods Administration (TGA); n.d. ,http://tga-search.clients.funnelback.com/s/search. html?query 5 diabetic 1 retinopathy 1 automated&collection 5 tga-artg&profile 5 record. [accessed 30.12.19]. 80. CarePortMD. n.d. ,https://www.careportmd.com/diabetes-care/. [accessed 30.12.19]. 81. Finlayson SG, Bowers JD, Ito J, Zittrain JL, Beam AL, Kohane IS. Adversarial attacks on medical machine learning. Science 2019;363:1287 9. 82. Bouskill K, Smith-Morris C, Bresnick G, Cuadros J, Pedersen ER. Blind spots in telemedicine: a qualitative study of staff workarounds to resolve gaps in diabetes management. BMC Health Serv Res 2018;18:617. 83. McMullen MJ, Netland PA. Lead time for appointment and the no-show rate in an ophthalmology clinic. Clin Ophthalmol 2015;9:513 16. 84. Brannan SO, Dewar C, Taggerty L, Clark S. The effect of short messaging service text on non-attendance in a general ophthalmology clinic. Scott Med J 2011;56:148 50. 85. Koshy E, Car J, Majeed A. Effectiveness of mobile-phone short message service (SMS) reminders for ophthalmology outpatient appointments: observational study. BMC Ophthalmol 2008;8:9. 86. Lutfey KE, Wishner WJ. Beyond “compliance” is “adherence”. Improving the prospect of diabetes care. Diabetes Care 1999;22:635 9. 87. Chatterjee JS. From compliance to concordance in diabetes. J Med Ethics 2006;32:507 10. 88. Alm-Roijer C, Stagmo M, Ude´n G, Erhardt L. Better knowledge improves adherence to lifestyle changes and medication in patients with coronary heart disease. Eur J Cardiovasc Nurs 2004;3:321 30. 89. Head KJ, Noar SM, Iannarino NT, Grant Harrington N. Efficacy of text messaging-based interventions for health promotion: a meta-analysis. Soc Sci Med 2013;97:41 8. 90. Emanuel EJ, Wachter RM. Artificial intelligence in health care: will the value match the hype? JAMA 2019;321:2281 2. 91. Maa AY, Patel S, Chasan JE, Delaune W, Lynch MG. Retrospective evaluation of a teleretinal screening program in detecting multiple nondiabetic eye diseases. Telemed J E Health 2017;23:41 8. 92. Burlina PM, Joshi N, Pekala M, Pacheco KD, Freund DE, Bressler NM. Automated grading of age-related macular degeneration from color fundus images using deep convolutional neural networks. JAMA Ophthalmol 2017;135:1170 6. 93. Price WN, Gerke S, Cohen IG. Potential liability for physicians using artificial intelligence, JAMA 2019;322 (18):1765 1766. Available from: https://doi.org/10.1001/jama.2019.15064.

III. Clinical applications

C H A P T E R

14 Artificial intelligence in radiology Dakai Jin, Adam P. Harrison, Ling Zhang, Ke Yan, Yirui Wang, Jinzheng Cai, Shun Miao and Le Lu Abstract The interest in artificial intelligence (AI) has ballooned within radiology in the past few years primarily due to notable successes of deep learning. With the advances brought by deep learning, AI has the potential to recognize and localize complex patterns from different radiological imaging modalities, many of which even achieve comparable performance to human decision-making in recent applications. In this chapter, we review several AI applications in radiology for different anatomies: chest, abdomen, pelvis, as well as general lesion detection/identification that is not limited to specific anatomies. For each anatomy site, we focus on introducing the tasks of detection, segmentation, and classification with an emphasis on describing the technology development pathway with the aim of providing the reader with an understanding of what AI can do in radiology and what still needs to be done for AI to better fit in radiology. Combining with our own research experience of AI in medicine, we elaborate how AI can enrich knowledge discovery, understanding, and decision-making in radiology, rather than replacing the radiologist. Keywords: Radiology; artificial intelligence; deep learning; lesion; pulmonary; abdomen; pelvis; classification; segmentation; detection; characterization

14.1 Introduction Computers have revolutionized the field of diagnostic and quantitative imaging and are imperative in radiology workflow nowadays. Early milestones of computer technology include imaging acquisition inventions, such as computerized tomography (CT), nuclear medicine, and magnetic resonance imaging (MRI), and the developments of digitized picture archiving and communication systems (PACSs). Significant advances in “intelligent” image analysis have been achieved in recent years with the booming of artificial intelligence (AI) technology due to the emergence of deep learning. In certain very specific and limited applications, computers are now able to perform tasks that previously only physicians could accomplish. For instance, a deep-learning empowered segmentation and classification system for optical coherence tomography achieves clinically applicable performance, that is, comparable or exceeding the performance by professional experts, on

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00014-4

265

© 2021 Elsevier Inc. All rights reserved.

266

14. Artificial intelligence in radiology

a range of sight-threatening retinal diseases.1 With the appropriate integration of deeplearning technologies paired with suitable medical imaging tasks, effective, and efficient AI systems can be developed to help radiologists reduce workloads and increase accuracy and consistency. It may eventually change the radiology workflow for some tasks. Hence, a better understanding of the strengths and limitations of new technology is of great benefit for radiologists. In this chapter, we review several important medical imaging tasks for different anatomies, emphasizing applications that we have worked on in our own research. Specifically, we overview and discuss the recent AI advances in thoracic, abdominal, and pelvic applications as well as general lesion analysis, which is not limited to a specific anatomy. Various imaging modalities are included, such as X-rays, CT, and MRI. For each anatomy, we focus on introducing the tasks of detection, segmentation, and classification with AI-based methods and discuss their achievements and also what future work remains to be done. Throughout, a common thread unifies the discussion, which also undergirds our own work—clinically useful AI tools must be developed hand-inhand with radiologists toward a shared goal of empowering the radiology field.

14.2 Thoracic applications 14.2.1 Pulmonary analysis in chest X-ray Chest X-rays (CXRs) are the most ordered radiological scan in the United States2 used to diagnose or screen for a variety of thoracic ailments. Given the challenges in reading CXRs, for example, low sensitives,3 there is great impetus for AI-based tools to help or enhance interpretation. Work along this line, catalyzed by the release of the CXR14 dataset,4 has accelerated in recent years. In this subsection, we first overview the history of large-scale CXR datasets for training AI systems. Then we outline some on-going efforts and innovations aimed at pushing forward what is possible in AI-based analysis. Finally, we discuss some challenges for future investigation. Like all AI applications, a necessary, but not sufficient, condition for an effective AI system for CXR analysis is an extensive and curated data source. Prior to the advent of deep learning, there was a paucity of large-scale CXR datasets. The one exception was the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial5 whose CXR screening arm includes roughly 200,000 manually annotated CXRs. Annotated disease patterns include masses and nodules along with nononcological patterns, such as opacities and pleural abnormalities. However, because the PLCO is a screening trial, disease prevalence is low. Moreover, PLCO CXRs are film radiographs that were later digitized, so they may differ in appearance from digital radiographs. While the PLCO remains invaluable, it was collected at enormous expense by executing a multisite clinical trial. Clearly, alternative data collection strategies are needed. Fortunately, the data housed in hospital PACSs offers a preexisting source of large-scale CXR data. The CXR14 dataset4 was the first to exploit large-scale PACS CXRs. The authors collected B110K CXRs by retrospectively mining the National Institutes of Health Clinical Center PACS. Labels for each CXR were generated by automatically text mining the accompanying radiological reports written during daily clinical workflows. Once released, the CXR14 dataset quickly became a core dataset for AI training and kicked off a trend of additional groups

III. Clinical applications

14.2 Thoracic applications

267

releasing their own PACS-mined data, such as CheXpert,6 MIMIC-CXR,7 and PadChest.8 Fig. 14.1 depicts the numbers of released CXRs from each dataset. Mining PACS is a highly promising source of data, but the aforementioned studies all rely on natural language processing to extract labels. Apart from any errors in the text mining, radiologist reports are written by considering many other factors outside of the CXR appearance, for example, lab tests, prior scans, and patient history.3 Thus mentioned disease patterns may not actually be present in the image and disease patterns present in the image may not actually be mentioned in the report, for example, an “unchanged” assessment. This can cause serious issues9 and AI specialists must work hand-in-hand with clinicians to most effectively use PACS-mined data. Despite these challenges, PACSmined data still represents the most promising source of large-scale data for CXR AI and, deployed carefully, models trained on PACS-mined data can generalize well.10 Furthermore, enhanced data collection efforts, such as more robust evaluation subsets6 and more ontological approaches to label extraction,8 will continue to strengthen the value of PACS-mined data. The most straightforward application of CXR AI is predicting the scan- or study-wise labels. This is essentially a multilabel classification problem, and many initial efforts focused on this task.4,13,14 However, another key aim is to localize each disease pattern being predicted. This enhances explainability and is a beneficial end in and of itself. The key challenge is that CXR datasets typically only possess scan- or study-wise labels that do not specify the disease pattern’s location. This means training an AI-based localizer requires using weak-supervision techniques. For the most part, CXR localizers are all built off class-activation map techniques15 which exploits the implicit localization properties within convolutional neural networks (CNNs). Promising approaches include generating

FIGURE 14.1 The plot of the numbers of publicly released CXRs by recent PACS-mined CXR datasets. CXRs, Chest X-rays; PACS, picture archiving and communication system.

III. Clinical applications

268

14. Artificial intelligence in radiology

pseudolabels to supervise an AI-localizer11,12,16 and developing techniques that can work well with only a small batch of localization labels.17 Another direction is to force the CNN to use as many regions of the image as possible when making its prediction.18 Fig. 14.2 depicts some example localizations derived from these weakly supervised techniques. Weakly supervised localization shows promise, but challenges remain to ensure the model captures the entire extent of the disease pattern and does not focus on spurious regions. Apart from localization, recent works have also focused on providing specialized or enhanced analyses. Mirroring larger trends within deep learning, the use of generative adversarial networks (GANs) to generate synthetic CXRs, has received attention. This includes using realistic synthetic CXRs to simulate image/mask pairs in order to train AI models to segment the lung field.19,20 GANs have also been used to transfer a model that works well on adult patients to also perform well on pediatric data.21 Finally, GANs have also been successfully used to flag abnormal CXRs.16 Moving on from GAN-based analysis, another interesting line of work is using a taxonomy of disease patterns to provide both more meaningful predictions and enhanced performance.22 As these works suggest, there is a rich set of research directions, beyond just localization, for AI applications in CXR analysis. The release of recent PACS-mined datasets has spurred an incredibly exciting burst of research activity in CXR analysis. Already much progress has been made, but important challenges remain. One key hurdle is developing AI techniques and models that can better manage the noise and uncertainty that comes with text-mined labels. This could involve integrating clinical domain knowledge to better model the meaning behind text-mined phrases and words found in radiological reports. Relatedly, it would be extraordinarily

FIGURE 14.2 When properly configured, CNNs can also provide localizations indicating the region of the image that is contributing to the prediction. CNNs, Convolutional neural networks. Source: Credit: Tang Y, Wang X, Harrison AP, Lu L, Xiao J, Summers RM. Attention-guided curriculum learning for weakly supervised classification and localization of thoracic diseases on chest radiographs. In: Shi Y, Suk H-I, Liu M editors. Machine learning in medical imaging, lecture notes in computer science. Cham: Springer International Publishing; 2018b. p. 249 58. Available from: https://doi.org/10.1007/978-3-030-00919-9_29.

III. Clinical applications

14.2 Thoracic applications

269

beneficial for the AI community to have agreed upon and radiologist-driven ontologies or taxonomies of disease patterns for an AI system to target. Such an ontology would also help incorporate and model the interdependencies across disease patterns. In addition, principled techniques should be developed to consider also prior CXR studies, lab results, and patient history. This would better emulate current radiological practices in the clinic. Along with these improved modeling capabilities, future work should also focus on creating larger and more accurate manually labeled evaluation sets, so that performance can be better gauged.

14.2.2 Pulmonary analysis in computerized tomography CT is the gold standard imaging modality for a broad range of high prevalence pulmonary diseases, such as interstitial lung disease (ILD) and lung cancer.23,24 Benefiting from its high spatial resolution in three dimensions (3D), CT allows more accurate disease diagnosis and quantification. To effectively detect and analyze pulmonary abnormalities from the large amounts of 3D CT data, automated AI-based tools play a critical role and have been studied for more than two decades.25 27 In such AI-based systems, typically the first step is to segment the anatomies of interest to facilitate the later steps of disease detection and quantification. In this subsection, we first review the AI-based segmentation methods for three pulmonary anatomies, that is, lung, lobe, and airway. Then we use ILD as a case study for how AI systems can play a role in pulmonary analysis. 14.2.2.1 Lung, lobe, and airway segmentation An often necessary first step for any computer-aided diagnosis or detection system is to accurately delineate the organs of interest. Measuring organ volume or shape can offer its own important biomarkers. In addition, accurate delineation is often a prerequisite for any downstream disease analysis, so that the area of focus can be accurately determined. Within pulmonary analysis, AI-based segmentation primarily focuses on three structures: the lungs, lung lobes, and pulmonary airways. Below, we discuss each in turn. For normal lungs the delineation is relatively straightforward, and classic techniques such as region growing or anatomical shape models can operate well as long as their strict assumptions on Hounsfield unit intensity and shape, respectively, are maintained. However, the problem becomes much more challenging once pathological patterns are present, such as consolidations, pleural effusions, or lung nodules, or if lung shapes do not follow expected distributions. Prior to the dominance of deep learning, effective pathological lung segmentation techniques relied on sophisticated, but handcrafted workflows,30 that can struggle to generalize without significant calibration efforts. To address this, Harrison et al.29 proposed the first deep model for pathological lung segmentation, called progressive holistically-nested networks (PHNNs), which classified each CT voxel individually in a bottom-up manner. Tested on 929 pathological CT studies, where disease patterns associated with infections, chronic obstructive pulmonary disease (COPD) and/or ILD were present, PHNN achieved an extremely high mean Dice score, or Sørensen Dice coefficient score of 98.5%. After Harrison et al.,29 many subsequent works reported their own deep segmentation approaches that followed similar strategies. While the PHNN

III. Clinical applications

270

14. Artificial intelligence in radiology

results are impressive, the model can still struggle on scenarios it did not see enough of in training, such as lung nodules or consolidations touching the lung border. Thus further work is still required to harden CNN models, like PHNN, to such unseen variations. Jin et al.28 proposed one such interesting strategy, using GANs to simulate lung nodules to fine-tune the PHNN model so that it can successfully handle such cases (see Fig. 14.3). The continued development of strategies along this vein will be necessary to address outlier cases as much as possible. Delineating the five lobes of the lung is another important task, particularly as infections are often limited to one or a few lobes. While lobe segmentation shares similarities with lung segmentation, successful solutions must incorporate much more top-down structural guidance. This is because lobe fissures are often incomplete, share the same appearance with accessory fissures, and can be obscured when pathologies are present. This challenges the bottom-up voxel-by-voxel strategies used in lung segmentation.29 Because airways do not cross lobar boundaries, one structural approach to segmentation is to first segment the airways,32,33 using the resulting airway trees as constraints or initializations. But airway segmentation is a challenging problem in its own right, which means these approaches require complex and brittle multicomponent workflows to segment lobes. Taking a different approach, George et al.31 reported the first deep solution to this problem. The authors trained a bottom-up PHNN model to noisily segment lung fissures and then used the random walker (RW) algorithm to impose top-down structural constraints. To keep it simple and generalizable the RW algorithm’s only assumption is that there are five lobes. Their method achieves a 88.8% mean Dice score, or Sørensen Dice coefficient score under the presence of highly challenging interstitial lung pathologies, which outperformed a leading nondeep approach32 by 5%. Fig. 14.4 provides some qualitative examples demonstrating the power of combining bottom-up CNN predictions with straightforward top-down constraints. Airway segmentation is uniquely challenging due to its topological complexity. The extreme thin airway wall separating the lumen and lung parenchyma adds further difficulty since its resolution is even lower than that of the CT scanner at many middle or small airway branches. This often causes large segmentation leakage into the adjacent lung parenchyma. Many

FIGURE 14.3 Jin et al.’s28 simulated lung nodules. (A) A volume of interest centered at a lung nodule; (B) 2D axial view of (A), (C) same as (B), but with central sphere region erased; (D E) simulated lung nodule using a competitor method and Jin et al.’s28 method, respectively. These simulated lung nodules were used to fine-tune and enhance Harrison et al.’s29 lung segmentation model. Source: Credit: Jin D, Xu Z, Tang Y, Harrison AP, Mollura DJ. CT-realistic lung nodule simulation from 3D conditional generative adversarial networks for robust lung segmentation. In: Medical image computing and computer-assisted intervention MICCAI 2018, Lecture notes in computer science. Cham: Springer International Publishing; 2018. p. 732 40. Available from: https://doi.org/10.1007/978-3-030-00934-2_81.

III. Clinical applications

14.2 Thoracic applications

271

FIGURE 14.4 Lung lobe segmentation using George et al.’s31 technique (P-HNN 1 RW). Here pulmonary toolkit (PTK) denotes Doel et al.’s32 approach. Ground truth lobar boundaries are rendered in red. Despite its simplicity the P-HNN 1 RW technique can provide reliable lobe segmentations in challenging scenarios: (A) PTK follows an erroneous boundary, (B) P-HNN 1 RW handles an incomplete fissure, (C) PTK over segments one lobe, (D) P-HNN 1 RW does not get confounded by a fibrosis pattern that looks like a fissure, and (E) P-HNN 1 RW infers a reasonable lobar boundary even though there is no visual fissure. Source: Credit: George K, Harrison AP, Jin D, Xu Z, Mollura DJ. Pathological pulmonary lobe segmentation from CT images using progressive holistically-nested neural networks and random walker. In: Cardoso MJ, Arbel T, Carneiro G, Syeda-Mahmood T, Tavares JMRS, Moradi M, et al. editors. Deep learning in medical image analysis and multimodal learning for clinical decision support, lecture notes in computer science. Cham: Springer International Publishing; 2017. p. 195 203. Available from: https://doi.org/10.1007/978-3-31967558-9_23.

automated methods have been developed to tackle this task including intensity-based34; morphology-based35,36; graph-based37,38 and 2D learning-based.39,40 Among these, different variations of region growing are often used. In contrast, 2D learning-based methods39,40 can add potential robustness. However, their inability to consider the entire 3D volume greatly limits their learning capacities, since 3D information is crucial to detect small highly anisotropic tubular structures of airways. Another crucial limitation with learning-based approaches is that they rely on labeled training data to train their algorithms. However, the labor costs to fully annotate airways are much too high for large-scale datasets. To fill these gaps, Jin et al.41 proposed the first 3D CNN-based method to fully leverage 3D airway tree features. A further graph-based refinement step addresses local discontinuities of the coarse 3D CNN output, which is then further refined by a curve skeletonization approach42 to remove the blob-like segmentation leakages. It significantly improves over previous methods by extracting more than 30 airway branches per patient while maintaining similar false positive rates as compared to a prior art.38 Importantly, their training process does not require perfect airway labels, as the 3D CNN is trained using the incomplete labels generated by Xu et al.38 that have high specificity and moderate sensitivity. By learning from these incomplete labels, Jin et al.’s41 approach can boost the sensitivity while maintaining high specificity. Fig. 14.5 provides some qualitative examples demonstrating the power of the 3D CNN for airway tree segmentation. After Jin et al.,41 several subsequent works reported their own deep segmentation approaches that followed similar strategies.43,44 Given the impossibility of obtaining large-scale and manually

III. Clinical applications

272

14. Artificial intelligence in radiology

FIGURE 14.5 Examples of 3D rendering of airway segmentations using Jin et al.’s41 3D CNN technique as compared against a nonlearning-based prior work38 on the EXACT09 dataset. Overlap regions are colored in red. Green and blue indicates additional extracted or missed branches, respectively, compared to the results from Xu et al.38 Source: Credit: Jin D, Xu Z, Harrison AP, George K, Mollura DJ. 3D convolutional neural networks with graph refinement for airway segmentation using incomplete data labels. In: International workshop on machine learning in medical imaging. Cham: Springer; 2017 September. p. 141 9.

labeled airway datasets, continued work on approaches able to learn from weakly or incompletely labeled data will be vital to continue pushing progress. 14.2.2.2 Interstitial lung disease pattern recognition ILD comprises more than 150 lung disorders affecting the lung parenchyma, which may eventually lead to breathing dysfunction. For the diagnosis of an ILD, besides the patient’s clinical history and physical examination, CT scan is often ordered to provide a visual assessment of the lung tissues. This is a less risky procedure compared to biopsies. However, reading and interpreting large amounts of 3D chest CT scans requires significant time, effort, and experience from physicians. Yet, inter- and intra- observer agreement is frequently low because of the subjectivity and difficulty in interpreting ILD patterns.45,46 Hence, many computerized and AI-based systems have been developed to automatically identify these abnormal patterns for increasing accuracy and consistency. Note that the new coronavirus 2019 (COVID-2019) causes severe pneumonia in certain patients. The corresponding CT scans include quite a few patterns that match patterns found in ILD, such as ground glass opacity, consolidation, reticulation, and crazy paving. Two COVID-2019 CT examples are shown in Fig. 14.6. AI-based lung pattern classification methods can be categorized into conventional image analysis and deep learning based approaches, which are detailed in the following two paragraphs. We end this subsection by discussing the limitation of the current works and point out the possible directions for solving this important problem. ILD comprises more than 150 lung disorders affecting the lung parenchyma, which may eventually lead to breathing dysfunction. For the diagnosis of an ILD, besides the patient’s clinical history and physical examination, CT scan is often ordered to provide a visual assessment of the lung tissues. This is a less risky procedure compared to biopsies. However, reading and interpreting large amounts of 3D chest CT scans requires significant time, effort, and experience from physicians. Yet, inter- and intraobserver agreement

III. Clinical applications

14.2 Thoracic applications

273

FIGURE 14.6 Examples of CT findings in two coronavirus 2019 (COVID-19) patients. The transverse, sagittal, and coronal views are shown for each case. The first row presents a patient with mild ground glass opacity in the right lower lobe. The bottom row shows a patient with severe radiologic progression with bilateral patchy shadowing. CT, Computerized tomography.

is frequently low because of the subjectivity and difficulty in interpreting ILD patterns.45,46 Hence, many computerized and AI-based systems have been developed to automatically identify these abnormal patterns for increasing accuracy and consistency. Note that the new COVID-2019 causes severe pneumonia in certain patients. The corresponding CT scans include quite a few patterns that match patterns found in ILD, such as ground glass opacity, consolidation, reticulation, and crazy paving. Two COVID-2019 CT examples are shown in Fig. 14.6. AI-based lung pattern classification methods can be categorized into conventional image analysis and deep learning based approaches, which are detailed in the following two paragraphs. We end this subsection by discussing the limitation of the current works and point out the possible directions for solving this important problem. Early computerized lung pattern recognition works can trace back to the 1980s,47,48 which used simple lung density analysis, such as mean or histogram percentile, to recognize emphysematous subjects. Later on, using local image patches, learning-based classification methods have been actively explored to identify various abnormal patterns, such as emphysema, honeycombing, ground glass opacity, consolidation, reticulation, nodular, or their combinations.49 53 Various features have been designed for characterizing the distinct properties of the abnormal lung patterns, for example, basic statistical texture features, geometric features, features extracted by multiscale filter banks, and more complex features such as near-affine-invariant texture, rotation-invariant Gabor-local binary patterns, and the multicoordinate histogram of oriented gradients. Different classifiers

III. Clinical applications

274

14. Artificial intelligence in radiology

have been examined for their performance such as Bayesian classifier, linear discriminant classifier, and support vector machine with feature selections. These methods achieved quite divergent results due to different evaluation metrics distinct datasets. Recently, deep learning based AI solutions have shown promise. Anthimopoulos et al.54 designed a customized CNN to conduct patch-based lung-pattern classification and gained markedly improved performance as compared to the non deep-learning methods. This suggests that features automatically learned in a CNN network are more effective than previous handcrafted approaches. Gao et al.55,56 further confirmed this, by introducing holistic slice-based classification for ILD diseases, where the CNN directly predicts if an axial slice contains any ILD disease patterns. This avoids needing to sample local image patches from manual Regions of Interest (ROIs) and can be used to prescreen a large amount of radiology data, which might be more clinically useful. Also of interest, Shin et al.57 conducted a comprehensive evaluation of both patch- and holistic slice based ILD pattern classification using different CNN structures and transfer learning. Although deep-learning methods have shown promising results in recognizing abnormal patterns for ILD, current approaches face a bottleneck that there is no large-scale labeled data for training and evaluation. Note that there are two public datasets relevant to the ILD patterns: (1) the lung tissue research consortium (LTRC) contributed by the National Heart, Lung and Blood Institute58,59 and (2) the specialized ILD dataset developed by University Hospitals of Geneva.52 Although the LTRC includes more than 1000 (and counting) CT scans, from four centers, with COPDs and fibrotic ILD patterns, no manually annotated regions of interest are made available. In contrast, the ILD dataset contains manually annotated regions of 11 types of lung patterns. But these are only partially annotated. Moreover, only 108 CT scans with thick-slice spacing (10 15 mm) are made available, and they all originate from the same hospital and the partial labeling. Another limitation comes from the fact that these CT scans are from a single hospital and fail to cover sufficient variance of larger population with different scanners, which is crucial for enhancing the generalizability of the AI recognition systems. Thus limitations in labeled data is a major issue. There has been already published works addressing this issue, for example, Gao et al.60 have explored deep-learning label propagation approaches to fully label the ILD dataset.52 Nonetheless, further work is needed. Potentially, using techniques to mine unlabeled instances from multiple heterogeneous and incompletely labeled datasets, as explored in lesion detection,61 might be a useful research direction.

14.3 Abdominal applications AI systems have played a critical role for various cancer diagnostics in the abdomen, such as pancreatic ductal adenocarcinoma (PDAC), hepatocellular carcinoma, colorectal adenocarcinomas, etc. For instance, early computer-aided detection systems had been developed for polyps62 and hepatic lesions.63 In this section, we take pancreatic cancer as an example to show the importance of AI-based systems in cancer detection, segmentation, and the tumor growth prediction.

III. Clinical applications

14.3 Abdominal applications

275

14.3.1 Pancreatic cancer analysis in computerized tomography and magnetic resonance imaging Pancreatic cancer mainly includes two types: PDAC (85% of cases) and pancreatic neuroendocrine tumor (PanNET, less than 5% cases). PDAC is a major cause of cancer-related death in Western countries and is anticipated to emerge as the second leading cause of cancer-related death in the United States by 2030.64 The prognosis of patients with PDAC is extremely poor, marked by a dismal 9% survival rate at 5 years. Medical imaging, for example, CT, is now routinely performed for depiction, quantification, staging, resectability evaluation, vascular invasion, and metastasis diagnosis of pancreatic cancers. Automated analysis of pancreas images is a challenging task compared to other organs in CT, such as the heart, liver, and kidney, as the pancreas has a variable shape, size, and location in the abdomen. Pancreatic tumors are even more challenging to identify: they are quite variable in their shape, size, location, and have complex enhanced patterns, such as hypo-, iso-, or even hyperenhancement in different CT phases; moreover, the heterogeneity of pancreas regions (i.e., pancreas tissue, duct, veins, and arteries) and the ill-defined tumor boundary make pancreatic tumor segmentation highly difficult even for radiologists. Recent advances in machine learning and especially deep learning have led to substantial improvements in automated pancreas cancer analysis and have enabled the prediction and prognosis studies, such as tumor growth prediction and patient survival prediction. In this section, we cover the representative works of the pancreas and pancreatic tumor segmentation/detection, as well as the prediction and prognosis of pancreatic cancer. 14.3.1.1 Pancreas segmentation in computerized tomography and magnetic resonance imaging Segmentation of the pancreas from 3D scans can provide quantitative features, such as the volume and shape statistics. Before deep learning, conventional methods report only 46.6% 69.1% dice score in the automatic pancreas segmentation. The performance has been significantly improved after adopting the deep-learning techniques.65 68 Starting from 2D image patch-based CNN68 to multiscale coarse-to-fine 3D fully convolutional network,66 the Dice score is improved from 71.8% to 86.9% for healthy pancreas segmentation (example shown in Fig. 14.7), and computational time is reduced from 3 hours to 3 minutes. For abnormal pancreas segmentation, researchers recently achieve a comparably high Dice score of 86.7%70 by the fusion of the arterial and venous enhanced CT phases in a hyperparing 3D UNet framework, achieving a similar level as the interobserver variability between radiologists.

FIGURE 14.7 Example of pancreas segmentation results (green) (A) comparing with the ground-truth annotation (red) (B).65 Source: Credit: Roth HR, Lu L, Lay N, Harrison AP, Farag A, Sohn A. et al. Spatial aggregation of holisticallynested convolutional neural networks for automated pancreas localization and segmentation. Med Image Anal 2018;45:94 107.

III. Clinical applications

276

14. Artificial intelligence in radiology

(A) Example of PanNET segmentation.69 Red: algorithm segmentation; Green: ground truth. (B) ROC curve of pancreatic ductal adenocarcinoma screening.66 PanNET, Pancreatic neuroendocrine tumor; ROC, Receiver operating characteristic. Source: Credit: Zhu Z, Xia Y, Xie L, Fishman EK, Yuille AL. Multi-scale coarse-to-fine segmentation for screening pancreatic ductal adenocarcinoma. In: International conference on medical image computing and computer-assisted intervention. Cham: Springer; 2019, October. p. 3 12; Guo Z, Zhang L, Lu L, Bagheri M, Summers RM, Sonka M. et al. Deep LOGISMOS: deep learning graph-based 3D segmentation of pancreatic tumors on CT scans. In: 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018). IEEE; 2018, April. p. 1230 3.

FIGURE 14.8

14.3.1.2 Pancreatic tumor segmentation and detection in computerized tomography and magnetic resonance imaging Precise tumor detection and segmentation are key elements in cancer imaging. For PDAC a multiscale coarse-to-fine 3D CNN method can automatically segment the tumors from venous phase CT with a Dice score of 57.3%.66 With the identified suspicious regions of PDAC the pancreatic cancer screening/detection can be achieved. As such Zhu et al.66 reports a sensitivity of 94.1% and a specificity of 98.5% for PDAC screening (Fig. 14.8B). To enhance the PDAC segmentation performance a hyperpairing framework70 with the same network backbone as66 is designed, which fuses venous and arterial phases at layer level. A much higher Dice score of 63.9% is reported. For PanNET a semiautomated method which combines UNet and 3D graph-based segmentation can segment tumor from arterial phase CT images with a Dice score of 83.2%69 (Fig. 14.8A). This approach requires a manual click roughly at the tumor centroid for initialization. More generally, researchers attempt to segment the universal pancreatic tumors, that is, with a mix of PDAC and PanNET. Using the venous phase CT images, a cascade UNet approach produces a Dice score of 0.52 in a fully automated way.67 Using dynamic contrast-enhanced MRI images, a patch-based semi-automated classification approach identifies tumor voxels in the pancreatic head region achieving a Dice score of 0.73, comparable to the interobserver variability.71 14.3.1.3 Prediction and prognosis with pancreatic cancer imaging The prediction of patient-specific progression of pancreatic tumors at an earlier stage, such as PanNETs, will assist physicians in making decisions of the treatment plans. Such a prediction problem has long been tackled using principles of mathematical modeling. A few pieces of recent work72,73 using deep-learning approaches can handle more complex

III. Clinical applications

14.3 Abdominal applications

277

FIGURE 14.9 Example of deep-learning prediction of PanNET growth at different later time points.72 PanNET, Pancreatic neuroendocrine tumor. Source: Credit: Zhang L, Lu L, Wang X, Zhu RM, Bagheri M, Summers RM et al. Spatio-temporal convolutional LSTMs for tumor growth prediction by learning 4D longitudinal patient data. In: IEEE transactions on medical imaging; 2019.

distributions from a larger patient population and provide more precise pixel-level prediction results. As demonstrated in Zhang et al.,73 the two-stream CNN model achieves an average volume prediction error of 6.6% compared to a 13.9% error of a state-of-the-art mathematical modeling method using the same PanNET longitudinal dataset. The most recent work further enables the prediction of cell density and CT intensity,72 and at the arbitrary future time point (shown in Fig. 14.9). There are also great interests in developing effective imaging-based biomarkers to stratify the patients with PDAC74 and predict gene mutation status from CT imaging,75 etc. Radiomics is still the mainstream approach in this direction. Making these biomarkers to reach the clinical practices, a highly automatic model and standardized radiomic features are desirable, as they can improve the objectiveness and enable the multicenter validation on large-scale patient cohorts.

14.3.2 AI in other abdominal imaging Multiorgan segmentation in CT and MRI has attracted lots of research interest. Researchers have built several datasets with voxel-level annotations of the major abdominal organs and vessels.65,76,77 The recent deep-learning approaches (either 2D or 3D based) have already achieved high accuracies for some larger organs, for example, Dice score of 98%, 97%, and 98% for liver, spleen, and kidney in CT images.77,80 Small object segmentation is still challenging: Dice score of the duodenum is only 75% and that of the esophagus is only 76%. Segmentation of other abdominal tumors is also

III. Clinical applications

278

14. Artificial intelligence in radiology

important. Investigators build several public datasets with annotations of abdominal tumors (e.g., liver, kidney, and colon)81, providing an opportunity for the whole community to develop the algorithms and helping to accelerate the development in this field.

14.4 Pelvic applications While bone fracture detection is not the only AI application in the pelvic region, it is one of the most important and promising. Hip and pelvic fractures are among the most frequent fracture types worldwide.82 Due to its low-cost, high-efficiency, and wide availability, pelvic X-ray imaging is the standard imaging tool for diagnosing pelvic and hip fractures. However, anatomical complexities and perspective projection distortions in the X-ray image contribute to a high rate of diagnostic errors83 that may delay treatment and increase patient care cost, morbidity, and mortality.84 As such, an effective AI system for both pelvic and hip fractures is of high clinical interest, with the aim of reducing diagnostic errors and improving patient outcomes. In this section, we will cover recent advances in AI-based fracture detection in pelvic X-rays. The medical reports in PACSs and/or radiology information systems (RISs) provide natural sources of image labels for training a deep-learning-based AI system. These labels typically indicate positive finding of abnormalities (e.g., fracture) in the image, without specifying the exact location. The convenience of obtaining massive image-level labeled data from PACSs and/or RISs without manual annotation has driven the development of weakly supervised learning for the AI models in X-ray images, especially CXR applications.4,22,85,86 In this formulation an image-level classification CNN is trained, and localizations of the detected abnormalities are provided via attention methods, for example, class activation mapping15 or gradientweighted class activation mapping.87 Hip fractures are the most common type of fracture visible in pelvic X-rays. Due to their high incidence, hip fractures are also the most well studied fracture type by the AI systems in pelvic X-rays. Cheng et al.88 pretrained a popular CNN model on 25,505 limb radiographs and fine-tuned it on 3605 pelvic X-rays with hip fracture labels. The trained model reports an area under curve (AUC) of 0.980. Gale et al.89 collected a training set of 45,492 pelvic X-rays with hip fractures labeled using a combination of orthopedics unit records and radiology reports. Training their AI model using manually extracted hip ROIs, they reported an impressive AUC of 0.994 on hip fracture identification, which matches radiologist-level performance. Their findings suggest that due to the localized nature of fractures and the complexity of the surrounding anatomical regions in the pelvis, concentrating on an ROI around the target anatomy (i.e., hip) is an effective strategy for detecting fractures. The effectiveness of employing ROI for hip fracture detection has also been demonstrated by Jime´nez-Sa´nchez et al.,90 who reported significant improvements in F1 scores using a ROI-base approach compared to a global approach. Jime´nez-Sa´nchez et al.90 further demonstrated that a curriculum learning scheme that starts from learning “easy” subtypes of hip fractures and gradually moves toward “hard” subtypes leads to a better performance with fewer training data. Beside hip fractures, detecting the more complex pelvic fractures (fractures in three pelvic bones: the ilium, ischium, and pubis) is also of utmost importance, due to the potential critical

III. Clinical applications

279

14.4 Pelvic applications

TABLE 14.1 Results of the Computer-aided detection system78 and physicians performances on fracture detection in a reader study. Hip fracture

Pelvic fracture

Accuracy (%)

Sensitivity (%)

Specificity (%)

Sensitivity (%)

Specificity (%)

Emergency

88.1

98.3

93.7

81.3

95.5

Surgeon

85.5

93.1

92.8

82.9

93.2

Orthedpics

93.2

100

95.3

90.5

99.0

Radiology

93.0

99.0

96.5

87.0

99.5

Physician average

88.2

96.2

93.8

84.2

95.3

Wang et al.78

90.7

96.0

98.0

84.0

96.0

Credit: Wang Y, Lu L, Cheng CT, Jin D, Harrison AP, Xiao J, et al. Weakly supervised universal fracture detection in pelvic X-rays. In: International conference on medical image computing and computer-assisted intervention. Cham: Springer; 2019d, October. p. 459 67.

complications associated with pelvic fractures. The makeup of pelvic fractures is much more complex, as there are a large variety types with very different visual patterns at various locations. The overlap of pelvic bones with the lower abdomen anatomies further confounds image patterns. In addition, unlike hip fractures, which occur at the femoral neck/head, pelvic fractures can occur anywhere on the large pelvis, which precludes the use of anatomical ROIs to concentrate on local fracture patterns. To address the previously mentioned challenges in universal fracture detection in pelvic X-rays, Wang et al.78 proposed a global-to-local twostage gradient-weighted class activation mapping approach and reported radiologist-level performance. In the first stage a CNN is trained using a multiinstance learning formulation to generate proposals of potential fracture sites. ROIs of the generated proposals are collected and used to train the second stage local fracture identification network. During inference the two-stage models are chained together to provide a complete solution. This two-stage solution has the ability to concentrate on local fracture patterns despite the large field of view of pelvic X-rays. This method reports a high AUC of 0.975 on detecting both hip and pelvic fractures. A reader study involving 23 physicians from 4 departments (i.e., surgical, orthopedics, emergency, and radiology) on 150 pelvic X-rays demonstrates that the method outperforms emergency physicians and surgeons. Table 14.1 depicts the performances of physicians as well as the model on diagnosing hip and pelvic fractures. The model is also shown to be able to detect ambiguous fracture sites that are missed by physicians in the reader study. Fig. 14.10 shows a few examples of frequently missed fracture sites and their corresponding model detection results. In summary the recent advances in the AI system for pelvic X-ray fracture detection has shown a trend of shifting from detecting a single fracture type toward universal fracture detection, which is often required to be deployed in real-world clinical scenarios such as emergency rooms or trauma centers. We also observe a paradigm shift from global classifier to local fracture pattern identification, represented by Gale et al.89 and Wang et al.,78 which significantly improves fracture-detection performance to reach radiologist-level.

III. Clinical applications

280

FIGURE 14.10

14. Artificial intelligence in radiology

Examples of frequently missed fracture sites and their corresponding model detection results.78

14.5 Universal lesion analysis When reading medical images, such as CT scans, radiologists generally search across the entire image to find lesions, characterize and measure them, and then describe them in the radiological report. This routine process is tedious and time-consuming. More importantly, human readers may miss some critical abnormal findings. This spurs research on automated lesion analysis algorithms (detection, classification, and segmentation) to decrease reading time and improve accuracy. However, most existing works focus on lesions of specific types and organs, such as lung nodules,91 breast lesions,92 and liver lesions.93 Yet, in clinical scenarios, a CT scan may contain multiple types of lesions in different organs. For instance, metastasis can spread from a primary site to regional lymph nodes and other body parts or organs. Designing a model for each organ/lesion type is inefficient and less scalable. In addition, given the wide range of lesion types, a group of single-type models will still miss some infrequent types. To help radiologists find and characterize all of them, a universal lesion analysis (ULA) algorithm is ideal. While AI algorithms for specific lesions will always be valuable, ULA addresses an important part of radiologists’ daily workflows and needs.

III. Clinical applications

14.5 Universal lesion analysis

281

FIGURE 14.11

Exemplar lesions in the DeepLesion dataset.94,95 Source: Credit: Yan K, Wang X, Lu L, Summers RM. DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning. J Med Imaging 2018a;5:1. Available from: https://doi.org/10.1117/1.JMI.5.3.036501 95; Yan K, Wang X, Lu L, Zhang L, Harrison A, Bagheri M, et al. Deep lesion graphs in the wild: relationship learning and organization of significant radiology image findings in a diverse large-scale lesion database. In: CVPR; 2018b.

In this section, we first introduce the large-scale DeepLesion dataset94 serving the purpose of ULA for the CT modality. Then we describe representative works for specific lesion analysis tasks, including lesion detection, classification, quantification, and retrieval, and mining.

14.5.1 DeepLesion dataset To achieve ULA the first step is to collect a large-scale and diverse lesion dataset with comprehensive labels. Conventional data collection efforts would recruit experienced radiologists to manually annotate all lesions in 3D scans, which is extremely costly to acquire. Taking a different approach, the DeepLesion dataset94,95 was collected from the PACS of the NIH Clinical Center by mining the response evaluation criteria in solid tumors (RECIST)96 marks already annotated by radiologists during their daily work. DeepLesion contains 32,735 lesions annotated on 32,120 axial CT slices from 10,594 studies of 4427 patients. A visualization of lesions in the dataset can be found in Fig. 14.11. This dataset greatly boosted research on ULA.11,79,80,95,97 105 It can also be readily updated or extended as it was mined automatically with minimal manual effort. Nonetheless, as with PACSmined data in other domains, for example, CXR datasets, there are limitations. One important limitation is that the data are incompletely labeled, as radiologists do not typically mark all found lesions with RECIST marks. As outlined below, active research is currently underway to address this.

14.5.2 Lesion detection and classification Universal lesion detection (ULD) is one of the most important tasks in ULA. It aims at finding a variety of lesions in the whole body and thus is more challenging than traditional single-type lesion detection because of the large appearance variation in different lesions types and the sometimes subtle distinction between lesions and nonlesions. CNNbased object detection frameworks such as the Faster Region based CNN106 and Mask Region based CNN107 are often adopted for ULD. Its performance has been improved by through various enhancements in the analysis. For instance, 3D context information in neighboring slices is important for detection, as lesions may be less distinguishable in just one 2D axial slice. Yan et al.98,102 and Wang et al.79 exploited 3D information with multislice image inputs and a 2.5D network by fusing features of multiple slices. On the other hand, Zlocha et al.,103 Wang et al.,79,80 and Li et al.104 used attention mechanisms to

III. Clinical applications

282

14. Artificial intelligence in radiology

FIGURE 14.12 Examples of the lesion detection, tagging, and segmentation results of MULAN.102 For detection, boxes in green and red are predicted TPs and FPs, respectively. The number above each box is the confidence score. For tagging, tags in black and blue are predicted TPs and FNs, respectively. For segmentation the green lines are ground-truth RECIST measurements; the orange contours and lines show predicted masks and RECIST measurements, respectively. FN, False negative; FP, false positive; MULAN, multitask universal lesion analysis network; RECIST, response evaluation criteria in solid tumors; TP, true positive. Source: Reproduced from and Credit Yan K, Tang Y, Peng Y, Sandfort V, Bagheri M, Lu Z, et al. MULAN: multitask universal lesion analysis network for joint lesion detection, tagging, and segmentation. In: MICCAI; 2019b. p. 194 202. Available from: https://doi.org/ 10.1007/978-3-030-32226-7_22.

emphasize important regions and features within the deep CNN. Wang et al.80 went even further and proposed a domain attention module to learn from DeepLesion and 10 other object detection datasets simultaneously. ULDor99 used a trained detector to mine hard negative proposals and then retrained the model. Finally, the multitask ULA network (MULAN)102 jointly learned lesion detection, segmentation, and tagging, and used a score refinement layer to improve detection with tagging. It achieved the current state-of-the-art accuracy on DeepLesion, that is, 83.7% recall at one false positive per key slice. Fig. 14.12 illustrates exemplar results of MULAN. Automatic lesion classification can assist diagnostic decision-making and structured report generation. Existing algorithms usually focus on certain body parts and attempt to distinguish between a limited set of labels.91 93 In contrast, Yan et al. and Peng et al.100,101 learned from the DeepLesion dataset to predict 171 comprehensive labels for a variety of lesions to describe their body part, type, and attributes. They first designed a natural language processing algorithm to extract relevant semantic labels from the radiology reports associated with the lesion images and then proposed a lesion annotation network (LesaNet) for multilabel classification, leveraging hierarchical, and mutually exclusive relations between the labels to improve the label prediction accuracy. LesaNet’s average classification AUC of the 171 labels is 0.934.

14.5.3 Lesion segmentation and quantification Lesion segmentation and measurement results are useful for clinicians to evaluate lesion sizes and treatment responses. In DeepLesion, lesions were annotated with two RECIST diameters including one long axis and the orthogonal short axis.94,96 However,

III. Clinical applications

14.5 Universal lesion analysis

283

FIGURE 14.13

Example of automatic lesion segmentation with weakly supervised slice-propagated segmentation method.97 We show an axial CT slice that contains a lesion measured by a RECIST mark in (A). The highlighted lesion and the RECIST mark is shown in (B) using green color. The red box is the region of interest that is conducted from the RECIST mark and used for initializing automatic segmentation. (C) and (D) show the result of automatic segmentation and manually delineated ground-truth segmentation, respectively. CT, Computerized tomography; RECIST, response evaluation criteria in solid tumors. Source: Credit: Cai J, Tang Y, Lu L, Harrison AP, Yan K, Xiao J, et al. Accurate weakly-supervised deep lesion segmentation using large-scale clinical annotations: slice-propagated 3D mask generation from 2D RECIST. In: MICCAI; 2018b.

RECIST marks are subjective and can be prone to inconsistency among different observers, especially when selecting the corresponding axial slices at different time-points where RECIST diameters are measured. To alleviate this problem, Tang et al.11 designed a cascaded CNN to automatically predict the endpoints of the RECIST diameters, yielding reliable and reproducible lesion measurement results with an average error of B3 pixels. Compared with RECIST diameters, volumetric lesion measurement can be a better metric for holistic and accurate quantitative assessment of lesion growth rates, avoiding the subjective selection of axial slice for RECIST measurement. Unfortunately, obtaining full volumetric lesion measurements with manual segmentations is labor-intensive and timeconsuming. For this reason, RECIST is treated as the default, but imperfect, clinical surrogate of measuring lesion progression. To facilitate automatic segmentation of lesion volumes, Cai et al.97 presented a weakly supervised slice-propagated segmentation method with DeepLesion to learn from the RECIST annotations and predict 3D lesion masks. They reported a patient-wise mean Dice score of 91.5% for lesion segmentation measured on the key slices (the axial slice containing the RECIST mark). Fig. 14.13 shows an example of automatic lesion segmentation on the RECIST-marked CT slice. With slicewise propagation, Cai et al.’s97 method can produce volumetric segmentations, achieving 76.4% Dice scores across the entire lesion.

14.5.4 Lesion retrieval and mining The goal of lesion retrieval is to find similar lesions from a database to help the user understand the query lesion. DeepLesion also provides a valuable platform to explore the similarity relationship among a variety of lesions. For instance, Yan et al.95 trained a triplet network to learn quantitative lesion embeddings that reflected lesion “similarity.” Similarity was defined hierarchically based on the lesion type, anatomical location, and size. The embeddings can also be used to build a lesion graph for intra-patient lesion

III. Clinical applications

284

14. Artificial intelligence in radiology

matching.95 The lesion labels mined from radiological reports can also be adopted to learn embeddings to encode more fine-grained semantic information.100 In terms of lesion mining, one limitation of DeepLesion is that not all lesions in the dataset were annotated. Cai et al.105 exploited a small fully labeled subset of volumes and used it to intelligently mine annotations from the remainder of images in DeepLesion. They showed that lesion detectors trained on the harvested lesions and hard negatives can significantly outperform the same variants only trained on the original annotations, boosting average precision by 7% 10%. Despite the progress of ULA in recent years, there is still room for improvement, for example, the detection accuracy for lesions in confusing or rare body parts98 is still insufficient for practical use. One interesting research direction is to combine existing single-type lesion datasets with DeepLesion and leverage their synergy to further improve detection accuracy.

14.6 Conclusion Significant advances in AI technology may greatly impact and eventually alter radiology workflows. In this chapter, several important medical imaging tasks in different anatomies are reviewed. Specifically, we overview AI applications in thoracic, abdominal, and pelvic regions as well as general lesion analysis. Different tasks, such as detection, segmentation, and classification, are discussed to highlight their strengths and limitations. These should provide radiologists with a better understanding of current AI technology and its potential going forward in improving efficiency, accuracy, and consistency of various radiology procedures.

References 1. De Fauw J, Ledsam JR, Romera-Paredes B, Nikolov S, Tomasev N, Blackwell S, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med 2018;24(9):1342 50. 2. Mettler FA, Bhargavan M, Faulkner K, Gilley DB, Gray JE, Ibbott GS, et al. Radiologic and nuclear medicine studies in the United States and worldwide: frequency, radiation dose, and comparison with other radiation sources—1950 2007. Radiology 2009;253:520 31. Available from: https://doi.org/10.1148/radiol.2532082010. 3. Raoof S, Feigin D, Sung A, Raoof S, Irugulpati L, Rosenow EC. Interpretation of plain chest roentgenogram. Chest 2012;141:545 58. Available from: https://doi.org/10.1378/chest.10-1302. 4. Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-Ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). Presented at the 2017 IEEE conference on computer vision and pattern recognition (CVPR); 2017. pp. 3462 71. Available from: https://doi.org/10.1109/CVPR.2017.369. 5. Gohagan JK, Prorok PC, Hayes RB, Kramer B-S. The Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial of the National Cancer Institute: history, organization, and status. Controlled Clin Trials 2000;21:251S 72S. Available from: https://doi.org/10.1016/S0197-2456(00)00097-0. 6. Irvin J, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. Proc AAAI Conf Artif Intell 2019;33:590 7. Available from: https://doi.org/10.1609/aaai.v33i01.3301590. 7. Johnson AEW, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng C, et al. MIMIC-CXR, a deidentified publicly available database of chest radiographs with free-text reports. Sci Data 2019;6:1 8. Available from: https://doi.org/10.1038/s41597-019-0322-0.

III. Clinical applications

References

285

8. Bustos A, Pertusa A, Salinas J-M, de la Iglesia-Vaya´ M. PadChest: a large chest x-ray image dataset with multilabel annotated reports. arXiv:1901.07441 [cs, eess], 2019. 9. Oakden-Rayner L. Exploring large-scale public medical image datasets. Acad Radiol 2020;27:106 12. Available from: https://doi.org/10.1016/j.acra.2019.10.006. 10. Rajpurkar P, Joshi A, Pareek A, Chen P, Kiani A, Irvin J, et al. CheXpedition: investigating generalization challenges for translation of chest x-ray algorithms to the clinical setting. arXiv:2002.11379 [cs, eess]. 2020. 11. Tang Y, Harrison AP, Bagheri M, Xiao J, Summers RM. Semi-automatic RECIST labeling on CT scans with cascaded convolutional neural networks. In: MICCAI; 2018a. pp. 405 13. Available from: https://doi.org/ 10.1007/978-3-030-00937-3_47. 12. Tang Y, Wang X, Harrison AP, Lu L, Xiao J, Summers RM. Attention-guided curriculum learning for weakly supervised classification and localization of thoracic diseases on chest radiographs. In: Shi Y, Suk H-I, Liu M, editors. Machine learning in medical imaging, lecture notes in computer science. Cham: Springer International Publishing; 2018. pp. 249 58. Available from: https://doi.org/10.1007/978-3-030-00919-9_29. 13. Pesce E, Ypsilantis P-P, Withey S, Bakewell R, Goh V, Montana G. Learning to detect chest radiographs containing lung nodules using visual attention networks. arXiv:1712.00996 [cs, stat], 2017. 14. Rajpurkar P, Irvin J, Ball RL, Zhu K, Yang B, Mehta H, et al. Deep learning for chest radiograph diagnosis: a retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med 2018;15: e1002686. Available from: https://doi.org/10.1371/journal.pmed.1002686. 15. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning deep features for discriminative localization. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). Presented at the 2016 IEEE conference on computer vision and pattern recognition (CVPR); 2016. pp. 2921 9. Available from: https://doi.org/10.1109/CVPR.2016.319. 16. Tang Y-X, Tang Y-B, Han M, Xiao J, Summers RM Abnormal chest x-ray identification with generative adversarial one-class classifier. In: 2019 IEEE 16th international symposium on biomedical imaging (ISBI 2019). Presented at the 2019 IEEE 16th international symposium on biomedical imaging (ISBI 2019); 2019c. pp. 1358 61. Available from: https://doi.org/10.1109/ISBI.2019.8759442. 17. Li Z, Wang C, Han M, Xue Y, Wei W, Li L-J, et al. Thoracic disease identification and localization with limited supervision. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. Presented at the 2018 IEEE/ CVF conference on computer vision and pattern recognition (CVPR), IEEE, Salt Lake City, UT; 2018. pp. 8290 9. Available from: https://doi.org/10.1109/CVPR.2018.00865. 18. Cai J, Lu L, Harrison AP, Shi X, Chen P, Yang L. Iterative attention mining for weakly supervised thoracic disease pattern localization in chest X-rays. In: Frangi AF, Schnabel JA, Davatzikos C, Alberola-Lo´pez C, Fichtinger G, editors. Medical image computing and computer assisted intervention MICCAI 2018, lecture notes in computer science. Cham: Springer International Publishing; 2018. pp. 589 98. Available from: https://doi.org/ 10.1007/978-3-030-00934-2_66. 19. Zhang Y, Miao S, Mansi T, Liao R. Task driven generative modeling for unsupervised domain adaptation: application to X-ray image segmentation. In: Frangi AF, Schnabel JA, Davatzikos C, Alberola-Lo´pez C, Fichtinger G, editors. Medical image computing and computer assisted intervention MICCAI 2018, lecture notes in computer science. Cham: Springer International Publishing; 2018. pp. 599 607. Available from: https://doi. org/10.1007/978-3-030-00934-2_67. 20. Tang Y-B, Tang Y-X, Xiao J, Summers RM. XLSor: a robust and accurate lung segmentor on chest X-rays using criss-cross attention and customized radiorealistic abnormalities generation. In: International conference on medical imaging with deep learning. Presented at the international conference on medical imaging with deep learning; 2019a. pp. 457 67. 21. Tang Y, Tang Y, Sandfort V, Xiao J, Summers RM. TUNA-net: task-oriented UN supervised adversarial network for disease recognition in cross-domain chest X-rays. In: Shen D, Liu T, Peters TM, Staib LH, Essert C, Zhou S, Yap P-T, Khan A, editors. Medical image computing and computer assisted intervention MICCAI 2019, lecture notes in computer science. Cham: Springer International Publishing; 2019. pp. 431 40. Available from: https://doi.org/10.1007/978-3-030-32226-7_48. 22. Chen H, Miao S, Xu D, Hager GD, Harrison AP. Deep hierarchical multi-label classification of chest X-ray images. Un: Proceedings of machine learning research. Presented at the MIDL 2019; 2019. 12. 23. Armato III SG, McLennan G, Bidaut L, McNitt-Gray MF, Meyer CR, Reeves AP, et al. The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans. Med Phys 2011;38(2):915 31.

III. Clinical applications

286

14. Artificial intelligence in radiology

24. Grenier P, Valeyre D, Cluzel P, Brauner MW, Lenoir S, Chastang C. Chronic diffuse interstitial lung disease: diagnostic value of chest radiography and high-resolution CT. Radiology 1991;179(1):123 32. 25. Lee Y, Hara T, Fujita H, Itoh S, Ishigaki T. Automated detection of pulmonary nodules in helical CT images based on an improved template-matching technique. IEEE Trans Med Imaging 2001;20(7):595 604. 26. McNitt-Gray MF, Hart EM, Wyckoff N, Sayre JW, Goldin JG, Aberle DR. A pattern classification approach to characterizing solitary pulmonary nodules imaged on high resolution CT: preliminary results. Med Phys 1999;26(6):880 8. 27. Blechschmidt RA, Werthschutzky R, Lorcher U. Automated CT image evaluation of the lung: a morphologybased concept. IEEE Trans Med Imaging 2001;20(5):434 42. 28. Jin D, Xu Z, Tang Y, Harrison AP, Mollura DJ. CT-realistic lung nodule simulation from 3D conditional generative adversarial networks for robust lung segmentation. Medical image computing and computer assisted intervention MICCAI 2018, Lecture notes in computer science. Cham: Springer International Publishing; 2018. pp. 732 40. Available from: https://doi.org/10.1007/978-3-030-00934-2_81. 29. Harrison AP, Xu Z, George K, Lu L, Summers RM, Mollura DJ. Progressive and multi-path holistically nested neural networks for pathological lung segmentation from CT images. In: Descoteaux M, Maier-Hein L, Franz A, Jannin P, Collins DL, Duchesne S, editors. Medical image computing and computer assisted intervention 2 MICCAI 2017, Lecture notes in computer science. Springer International Publishing; 2017. pp. 621 9. 30. Mansoor A, Bagci U, Xu Z, Foster B, Olivier KN, Elinoff JM, et al. A generic approach to pathological lung segmentation. IEEE Trans Med Imaging 2014;33:2293 310. Available from: https://doi.org/10.1109/ TMI.2014.2337057. 31. George K, Harrison AP, Jin D, Xu Z, Mollura DJ. Pathological pulmonary lobe segmentation from CT images using progressive holistically nested neural networks and random walker. In: Cardoso MJ, Arbel T, Carneiro G, Syeda-Mahmood T, Tavares JMRS, Moradi M, et al., editors. Deep learning in medical image analysis and multimodal learning for clinical decision support, lecture notes in computer Science. Cham: Springer International Publishing; 2017. pp. 195 203. Available from: https://doi.org/10.1007/978-3-319-67558-9_23. 32. Doel T, Matin TN, Gleeson FV, Gavaghan DJ, Grau V. Pulmonary lobe segmentation from CT images using fissureness, airways, vessels and multilevel B-splines. In: 2012 9th IEEE international symposium on biomedical imaging (ISBI). Presented at the 2012 9th IEEE international symposium on biomedical imaging (ISBI); 2012. pp. 1491 4. Available from: https://doi.org/10.1109/ISBI.2012.6235854. 33. Bragman FJS, McClelland JR, Jacob J, Hurst JR, Hawkes DJ. Pulmonary lobe segmentation with probabilistic segmentation of the fissures and a groupwise fissure prior. IEEE Trans Med Imaging 2017;36:1650 63. Available from: https://doi.org/10.1109/TMI.2017.2688377. 34. Van Rikxoort EM, Baggerman W, Van Ginneken B. Automatic segmentation of the airway tree from thoracic CT scans using a multi-threshold approach. In: Proc of second international workshop on pulmonary image analysis; 2009. pp. 341 9. 35. Aykac D, Hoffman EA, McLennan G, Reinhardt JM. Segmentation and analysis of the human airway tree from three-dimensional X-ray CT images. IEEE Trans Med Imaging 2003;22(8):940 50. 36. Nadeem SA, Jin D, Hoffman EA, Saha PK. An iterative method for airway segmentation using multiscale leakage detection. Med imaging 2017: image process, 10133. International Society for Optics and Photonics; 2017. p. 1013308. 37. Tschirren J, Hoffman EA, McLennan G, Sonka M. Intrathoracic airway trees: segmentation and airway morphology analysis from low-dose CT scans. IEEE Trans Med Imaging 2005;24(12):1529 39. 38. Xu Z, Bagci U, Foster B, Mansoor A, Udupa JK, Mollura DJ. A hybrid method for airway segmentation and automated measurement of bronchial wall thickness on CT. Med Image Anal 2015;24(1):1 17. 39. Charbonnier JP, Van Rikxoort EM, Setio AA, Schaefer-Prokop CM, van Ginneken B, Ciompi F. Improving airway segmentation in computed tomography using leak detection with convolutional networks. Med Image Anal 2017;36:52 60. 40. Lo P, Sporring J, Ashraf H, Pedersen JJ, de Bruijne M. Vessel-guided airway tree segmentation: a voxel classification approach. Med Image Anal 2010;14(4):527 38. 41. Jin D, Xu Z, Harrison AP, George K, Mollura DJ. 3D convolutional neural networks with graph refinement for airway segmentation using incomplete data labels. International workshop on machine learning in medical imaging. Cham: Springer; 2017. pp. 141 9. 42. Jin D, Iyer KS, Chen C, Hoffman EA, Saha PK. A robust and efficient curve skeletonization algorithm for treelike objects using minimum cost paths. Pattern Recognit Lett 2016;76:32 40.

III. Clinical applications

References

287

43. Yun J, Park J, Yu D, Yi J, Lee M, Park HJ, et al. Improvement of fully automated airway segmentation on volumetric computed tomographic images using a 2.5 dimensional convolutional neural net. Med Image Anal 2019;51:13 20. 44. Qin Y, Chen M, Zheng H, Gu Y, Shen M, Yang J, et al. AirwayNet: a voxel-connectivity aware approach for accurate airway segmentation using convolutional neural networks. International conference on medical image computing and computer-assisted intervention. Cham: Springer; 2019. pp. 212 20. 45. Nishimura K, Izumi T, Kitaichi M, Nagai S, Itoh H. The diagnostic accuracy of high-resolution computed tomography in diffuse infiltrative lung diseases. Chest 1993;104(4):1149 55. 46. Padley SPG, Hansell DM, Flower CDR, Jennings P. Comparative accuracy of high resolution computed tomography and chest radiography in the diagnosis of chronic diffuse infiltrative lung disease. Clin Radiol 1991;44(4):222 6. 47. Goddard PA, Nicholson EM, Laszlo G, Watt I. Computed tomography in pulmonary emphysema. Clin Radiol 1982;33:379 87. 48. Hayhurst MD, MacNee W, Flenley DC, Wright D, McLean A, Lamb D, et al. Diagnosis of pulmonary emphysema by computed tomography. Lancet 1984;2:320 2. 49. Uppaluri R, Hoffman EA, Sonka M, Hartley PG, Hunninghake GW, McLennan G. Computer recognition of regional lung disease patterns. Am J Respiratory Crit Care Med 1999;160(2):648 54. 50. Sluimer IC, van Waes PF, Viergever MA, van Ginneken B. Computer-aided diagnosis in high resolution CT of the lungs. Med Phys 2003;30(12):3081 90. 51. Depeursinge A, Van de Ville, Platon D, Geissbuhler A, Poletti PA, Muller H. Near-affine-invariant texture learning for lung tissue analysis using isotropic wavelet frames. IEEE Trans Inf Technol Biomed 2012;16(4):665 75. 52. Depeursinge A, Vargas A, Platon A, Geissbuhler A, Poletti PA, Mu¨ller H. Building a reference multimedia database for interstitial lung diseases. Computerized Med Imaging Graph 2012;36(3):227 38. 53. Song Y, Cai W, Zhou Y, Feng DD. Feature-based image patch approximation for lung tissue classification. IEEE Trans Med Imaging 2013;32(4):797 808. 54. Anthimopoulos M, Christodoulidis S, Ebner L, Christe A, Mougiakakou S. Lung pattern classification for interstitial lung diseases using a deep convolutional neural network. IEEE Trans Med imaging 2016;35(5):1207 16. 55. Gao M, Bagci U, Lu L, Wu A, Buty M, Shin HC, et al. Holistic classification of CT attenuation patterns for interstitial lung diseases via deep convolutional neural networks. Comput Methods Biomech Biomed Eng: Imaging Vis 2016;6(1):1 6. 56. Gao M, Xu Z, Lu L, Harrison AP, Summers RM, Mollura DJ. Multi-label deep regression and unordered pooling for holistic interstitial lung disease pattern detection. International workshop on machine learning in medical imaging. Cham: Springer; 2016. pp. 147 55. 57. Shin HC, Roth HR, Gao M, Lu L, Xu Z, Nogues I, et al. Deep convolutional neural networks for computeraided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans Med Imaging 2016;35(5):1285 98. 58. Karwoski RA, Bartholmai B, Zavaletta VA, Holmes D, Robb RA. Processing of CT images for analysis of diffuse lung disease in the lung tissue research consortium. Medical imaging 2008: physiology, function, and structure from medical images, 6916. International Society for Optics and Photonics; 2008. pp. 691614. 59. Bartholmai B, Karwoski R, Zavaletta V, Robb R, Holmes DRI. The Lung Tissue Research Consortium: an extensive open database containing histological, clinical, and radiological data to study chronic lung disease. Insight J 2006. 60. Gao M, Xu Z, Lu L, Wu A, Nogues I, Summers RM, et al. Segmentation label propagation using deep convolutional neural networks and dense conditional random field. 2016 IEEE 13th international symposium on biomedical imaging (ISBI). IEEE; 2016. pp. 1265 8. 61. Yan K, Cai J, Harisson AP, Jin D, Xiao J, Lu L. Universal lesion detection by learning from multiple heterogeneously labeled datasets. Under review. 2020. 62. Summers RM, Jerebko AK, Franaszek M, Malley JD, Johnson CD. Colonic polyps: complementary role of computer-aided detection in CT colonography. Radiology 2002;225(2):391 9. 63. Bilello M, Gokturk SB, Desser T, Napel S, Jeffrey Jr RB, Beaulieu CF. Automatic detection and classification of hypodense hepatic lesions on contrast-enhanced venous-phase CT. Med Phys 2004;31(9):2584 93. 64. Rahib L, Smith BD, Aizenberg R, Rosenzweig AB, Fleshman JM, Matrisian LM. Projecting cancer incidence and deaths to 2030: the unexpected burden of thyroid, liver, and pancreas cancers in the United States. Cancer Res 2014;74(11):2913 21.

III. Clinical applications

288

14. Artificial intelligence in radiology

65. Roth HR, Lu L, Lay N, Harrison AP, Farag A, Sohn A, et al. Spatial aggregation of holistically-nested convolutional neural networks for automated pancreas localization and segmentation. Med Image Anal 2018;45:94 107. 66. Zhu Z, Xia Y, Xie L, Fishman EK, Yuille AL. Multi-scale coarse-to-fine segmentation for screening pancreatic ductal adenocarcinoma. International conference on medical image computing and computer-assisted intervention. Cham: Springer; 2019. pp. 3 12. 67. Isensee F, Petersen J, Klein A, Zimmerer D, Jaeger PF, Kohl S, et al. nnu-net: Self-adapting framework for u-netbased medical image segmentation. arXiv preprint arXiv:1809.10486. 2018. 68. Roth HR, Lu L, Farag A, Shin HC, Liu J, Turkbey EB, et al. Deeporgan: multi-level deep convolutional networks for automated pancreas segmentation. International conference on medical image computing and computer-assisted intervention. Cham: Springer; 2015. pp. 556 64. 69. Guo Z, Zhang L, Lu L, Bagheri M, Summers RM, Sonka M, et al. Deep LOGISMOS: deep learning graph-based 3D segmentation of pancreatic tumors on CT scans. 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018). IEEE; 2018. pp. 1230 3. 70. Zhou Y, Li Y, Zhang Z, Wang Y, Wang A, Fishman EK, et al. Hyper-pairing network for multi-phase pancreatic ductal adenocarcinoma segmentation. International conference on medical image computing and computer-assisted intervention. Cham: Springer; 2019. pp. 155 63. 71. Liang Y, Schott D, Zhang Y, Wang Z, Nasief H, Paulson E, et al. Auto-segmentation of pancreatic tumor in multi-parametric MRI using deep convolutional neural networks. Radiother Oncol 2020;145:193 200. 72. Zhang L, Lu L, Wang X, Zhu RM, Bagheri M, Summers RM et al. Spatio-temporal convolutional LSTMs for tumor growth prediction by learning 4D longitudinal patient data. In: IEEE transactions on medical imaging; 2019. 73. Zhang L, Lu L, Summers RM, Kebebew E, Yao J. Convolutional invasion and expansion networks for tumor growth prediction. IEEE Trans Med Imaging 2018;37(2):638 48. 74. Attiyeh MA, Chakraborty J, Doussot A, Langdon-Embry L, Mainarich S, Go¨nen M, et al. Survival prediction in pancreatic ductal adenocarcinoma by quantitative computed tomography image analysis. Ann Surg Oncol 2018;25(4):1034 42. 75. Attiyeh MA, Chakraborty J, McIntyre CA, Kappagantula R, Chou Y, Askan G, et al. CT radiomics associations with genotype and stromal content in pancreatic ductal adenocarcinoma. Abdom Radiol 2019;44(9):3148 57. 76. Gibson E, Giganti F, Hu Y, Bonmati E, Bandula S, Gurusamy K, et al. Automatic multi-organ segmentation on abdominal CT with dense v-networks. IEEE Trans Med Imaging 2018;37(8):1822 34. 77. Wang Y, Zhou Y, Shen W, Park S, Fishman EK, Yuille AL. Abdominal multi-organ segmentation with organattention networks and statistical fusion. Med Image Anal 2019;55:88 102. 78. Wang Y, Lu L, Cheng CT, Jin D, Harrison AP, Xiao J, et al. Weakly supervised universal fracture detection in pelvic X-rays. International conference on medical image computing and computer-assisted intervention. Cham: Springer; 2019. p. 459 67. 79. Wang X, Han S, Chen Y, Gao D, Vasconcelos N. Volumetric attention for 3D medical image segmentation and detection. In: MICCAI; 2019a. pp. 175 84. Available from: https://doi.org/10.1007/978-3-030-32226-7_20. 80. Wang X, Cai Z, Gao D, Vasconcelos N. Towards universal object detection by domain attention. In: CVPR; 2019b. pp. 7281 90. Available from: https://doi.org/10.1109/CVPR.2019.00746. 81. Simpson AL, Antonelli M, Bakas S, Bilello M, Farahani K, Van Ginneken B, et al. A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063; 2019. 82. Johnell O, Kanis JA. An estimate of the worldwide prevalence, mortality and disability associated with hip fracture. Osteoporos Int 2004;15(11):897 902. 83. Chellam WB. Missed subtle fractures on the trauma-meeting digital projector. Injury 2016;47(3):674 6. 84. Tarrant SM, Hardy BM, Byth PL, Brown TL, Attia J, Balogh ZJ. Preventable mortality in geriatric hip fracture inpatients. Bone Jt. J 2014;96(9):1178 84. 85. Badgeley MA, Zech JR, Oakden-Rayner L, Glicksberg BS, Liu M, Gale W, et al. Deep learning predicts hip fracture using confounding patient and healthcare variables. NPJ Digital Med 2019;2(1):1 10. 86. Rajpurkar P, Irvin J, Zhu K, Yang B, Mehta H, Duan T, et al. Chexnet: radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225; 2017. 87. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision; 2017. pp. 618 26. 88. Cheng CT, Ho TY, Lee TY, Chang CC, Chou CC, Chen CC, et al. Application of a deep learning algorithm for detection and visualization of hip fractures on plain pelvic radiographs. Eur Radiol 2019;29(10):5469 77.

III. Clinical applications

References

289

89. Gale W, Oakden-Rayner L, Carneiro G, Bradley AP, Palmer LJ. Detecting hip fractures with radiologist-level performance using deep neural networks. arXiv preprint arXiv:1711.06504; 2017. 90. Jime´nez-Sa´nchez A, Kazi A, Albarqouni S, Kirchhoff S, Stra¨ter A, Biberthaler P, et al. Weakly-supervised localization and classification of proximal femur fractures. arXiv preprint arXiv:1809.10692; 2018. 91. Sahiner B, Pezeshk A, Hadjiiski LM, Wang X, Drukker K, Cha KH, et al. Deep learning in medical imaging and radiation therapy. Med Phys 2019;46:e1 e36. Available from: https://doi.org/10.1002/mp.13264. 92. Ribli D, Horva´th A, Unger Z, Pollner P, Csabai I. Detecting and classifying lesions in mammograms with Deep Learning. Sci Rep 2018;8. Available from: https://doi.org/10.1038/s41598-018-22437-z. 93. Diamant I, Hoogi A, Beaulieu CF, Safdari M, Klang E, Amitai M, et al. Improved patch-based automated liver lesion classification by separate analysis of the interior and boundary regions. IEEE J Biomed Heal Inform 2016;20:1585 94. Available from: https://doi.org/10.1109/JBHI.2015.2478255. 94. Yan K, Wang X, Lu L, Summers RM. DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning. J Med Imaging 2018;5:1. Available from: https://doi.org/ 10.1117/1.JMI.5.3.036501. 95. Yan K, Wang X, Lu L, Zhang L, Harrison A, Bagheri M, et al. Deep lesion graphs in the wild: relationship learning and organization of significant radiology image findings in a diverse large-scale lesion database. In: CVPR; 2018b. 96. Eisenhauer EA, Therasse P, Bogaerts J, Schwartz LH, Sargent D, Ford R, et al. New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1). Eur J Cancer 2009;45:228 47. Available from: https://doi.org/10.1016/j.ejca.2008.10.026. 97. Cai J, Tang Y, Lu L, Harrison AP, Yan K, Xiao J, et al. Accurate weakly-supervised deep lesion segmentation using large-scale clinical annotations: slice-propagated 3D mask generation from 2D RECIST. In: MICCAI; 2018b. 98. Yan K, Bagheri M, Summers RM. 3D context enhanced region-based convolutional neural network for endto-end lesion detection. In: MICCAI; 2018c. pp. 511 9. Available from: https://doi.org/10.1007/978-3-03000928-1_58. 99. Tang Y-B, Yan K, Tang Y-X, Liu J, Xiao J, Summers RM. Uldor: a universal lesion detector for CT scans with pseudo masks and hard negative example mining. In: 2019 IEEE 16th international symposium on biomedical imaging (ISBI 2019). Presented at the 2019a IEEE 16th international symposium on biomedical imaging (ISBI 2019); 2019b. pp. 833 6. Available from: https://doi.org/10.1109/ISBI.2019.8759478. 100. Yan K, Peng Y, Sandfort V, Bagheri M, Lu Z, Summers RM. Holistic and comprehensive annotation of clinically significant findings on diverse CT images: learning from radiology reports and label ontology. In: CVPR; 2019a. pp. 8515 24. Available from: https://doi.org/10.1109/CVPR.2019.00872. 101. Peng Y, Yan K, Sandfort V, Summers RM, Lu Z. A self-attention based deep learning method for lesion attribute detection from CT reports. In: 2019 IEEE international conference on healthcare informatics, ICHI 2019; 2019. Available from: https://doi.org/10.1109/ICHI.2019.8904668. 102. Yan K, Tang Y, Peng Y, Sandfort V, Bagheri M, Lu Z, et al. MULAN: multitask universal lesion analysis network for joint lesion detection, tagging, and segmentation. In: MICCAI; 2019b. pp. 194 202. Available from: https://doi.org/10.1007/978-3-030-32226-7_22.. 103. Zlocha M, Dou Q, Glocker B. Improving RetinaNet for CT lesion detection with dense masks from weak RECIST labels. In: MICCAI; 2019. pp. 402 10. ,https://doi.org/10.1007/978-3-030-32226-7_45. 104. Li Z, Zhang S, Zhang J, Huang K, Wang Y, Yu Y. MVP-Net: multi-view FPN with position-aware attention for deep universal lesion detection. In: MICCAI; 2019. pp. 13 21. Available from: https://doi.org/10.1007/ 978-3-030-32226-7_2. 105. Cai J, Harrison AP, Zheng Y, Yan K, Huo Y, Xiao J, et al. Lesion harvester: iteratively mining unlabeled lesions and hard-negative examples at scale. 2020. Available from: http://arxiv.org/abs/2001.07776. 106. Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS; 2015. pp. 91 9. ,https://doi.org/10.1109/TPAMI.2016.2577031.. 107. He K, Gkioxari G, Dollar P, Girshick R. Mask R-CNN. In: ICCV; 2017. pp. 2980 8. Available from: https:// doi.org/10.1109/ICCV.2017.322.

III. Clinical applications

C H A P T E R

15 Artificial intelligence and interpretations in breast cancer imaging Hui Li and Maryellen L. Giger Abstract Medical decision-making in breast cancer image interpretation continues to evolve with synergistic advances in image acquisition systems (e.g., breast tomosynthesis), imaging protocols [e.g., multiparametric magnetic resonance imaging (MRI)], interpretation aids [artificial intelligence (AI)], and clinical tasks (use in response assessment). Advances in computers, in terms of both computing power and memory aspects, have led to a rapid increase in assessing the potential use of AI in various tasks in breast imaging, going beyond the initial use in computer-aided detection to include diagnosis, prognosis, response to therapy, and risk assessment, as well as in cancer discovery. AI methods are being developed for computer-aided detection (CADe) and diagnosis (CADx), for triaging (CADt), and for ultimate use as an autonomous reader, often with limited consideration for effect on radiologists’ perception/cognitive performance and workflow. While the prospects of AI in breast cancer image interpretation are abundant and promising, they bring along challenges and trepidations. This chapter focuses on the role of AI in breast cancer image interpretation. Keywords: Breast cancer; AI; CAD; breast imaging; mammography; breast MRI; breast ultrasound; machine learning; radiomics; deep learning

15.1 Introduction Breast cancer is the second leading cause of death among women in the United States, with over 40,000 women estimated to die of breast cancer in 2020.1 Improved survival is linked to early detection and treatment advances with overall prognosis related to the stage of disease at the initial diagnosis.2 Mammographic screening programs have been associated with a 20% 40% relative reduction in breast cancer.3 5 Due to the camouflaging effect of dense breasts, cancers may be missed during screening mammography.6 Thus new recommendations are being established for women presenting

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00015-6

291

© 2021 Elsevier Inc. All rights reserved.

292

15. Artificial intelligence and interpretations in breast cancer imaging

with dense breasts at the time of screening, leading to new multimodality breast image acquisition methods for use in or as an adjunct to mammographic screening, such as full-field digital mammography (FFDM), dynamic contrast-enhanced (DCE) breast magnetic resonance imaging (MRI), digital breast tomosynthesis (DBT), and whole breast ultrasound.5,7,8 With these emerging imaging technologies comes associated increased need in interpretation expertise as well as longer interpretation times, areas to which artificial intelligence (AI) could potentially contribute to in the screening process. Breast imaging is advancing as a diagnostic tool with newer protocols for staging, assessing extent of disease, and for planning and monitoring treatment for the breast cancer patients.9,10 All these decision-making tasks could also be amenable to incorporation of AI into the workflow. Computer-aided detection (CADe), a form of AI-aided, has been developed and clinically used since 1996.11 14 With advances in computers, in terms of both computing power and memory aspects, a rapid increase in assessing the potential use of AI, in various tasks in breast imaging, has gone beyond the initial use in CADe to include diagnosis, prognosis, response to therapy, and risk assessment, as well as in cancer discovery. AI methods are being developed for computer-aided detection (CADe) and diagnosis (CADx), for triaging (CADt), and for ultimate use as an autonomous reader, often without sufficient consideration for effect on radiologists’ perception/cognitive performance and workflow. While the prospects of AI in breast cancer image interpretation are abundant and promising, they bring along challenges and trepidations. This chapter focuses on the role of AI in breast cancer image interpretation. The organization will start with a general discussion on decision support and AI in breast imaging, including methods of implementing AI and its potential effects on human perception in radiology. Next the chapter will cover AI in the clinical tasks in detection, diagnosis, risk assessment, prognosis, treatment response and risk of recurrence, and cancer discovery.

15.2 Artificial intelligence in decision support A medical image is a meaningless gray-scale pattern unless it is “viewed and analyzed” by an intelligent observer. That observer could be a radiologist, a computer, or a combination of a human and a computer. In the interpretation of a breast image, the observer is the breast radiologist who specializes in detecting and diagnosing breast cancer with the use of various breast imaging modalities. The breast radiologist will make recommendations on further imaging and diagnostic procedures and will work with an interdisciplinary team of experts from oncology, pathology, radiation therapy, and surgery. Thus the role of AI could be as focused as just indicating regions in the breast image where cancer might be present to acting as a tool for integrating the output from multiple diagnostic exams. AI techniques being developed and published today approach a single modality, single clinical task problem, for example, finding a suspect region on the mammogram, assessing the likelihood of malignancy given a lesion on breast MRI, or predicting the risk of breast cancer. The future will most likely yield AI techniques that attempt to solve the multidisciplinary aspect of breast cancer detection and management.

III. Clinical applications

293

15.3 Artificial intelligence in breast cancer screening

It is important to note that understanding the role of AI in breast cancer imaging requires an understanding of the differences between CAD and AI, since there two terms have been used extensively in papers and reviews. CAD describes a method of how an AI algorithm might be used in clinical practice, that is, as a second reader serving as an aid to the radiologist. AI is the computer algorithm that exists in a system for various types of implementations in clinical practice, CAD or otherwise as schematically shown in Fig. 15.1. The computer algorithm conducting AI could be, for example, one using humanengineered radiomic features combined with machine learning algorithms or deeplearning methods.15 Emerging techniques for AI in breast imaging interpretation include computer-aided detection (CADe) and diagnosis (CADx), for triaging (CADt), and for ultimate use as an autonomous reader.13,16 24 This chapter covers the various technical and clinical implementations of AI in breast image analysis, however, more research and development are needed to increase the performance of the algorithms, as well as consider the effect of the AI method on radiologists’ perception/cognitive performance and workflow. Note that while the prospects of AI in breast cancer image interpretation are abundant and promising, they bring along challenges and trepidations.

15.3 Artificial intelligence in breast cancer screening In breast cancer screening the search task is one of detection, which refers to the localization of a lesion (i.e., mass lesion, clustered microcalcifications, and architectural distortion) within a breast image. The detection of cancer by radiologists is limited by the presence of structure noise (i.e., dense breast parenchyma concealing an underlying abnormality), incomplete visual search patterns, incorrect assessment of subtle or complex disease states, vast amounts of image data, suboptimal physical image quality, fatigue, and distractions.12,25 AI detection tools can be used in the localization task and serve as another reader to radiologists in their task of finding suspicious lesions within images. Attempts at AI techniques for breast images dates back to the 1960s with Winsberg et al. paper on detection of abnormalities on mammograms using optical scanning and computer analysis.26 In the 1980s, computer-aided detection for clustered microcalcifications in digitized mammograms were being developed along with the detection of lung nodules in chest FIGURE 15.1

Schematic illustrating potential roles and uses of AI in medical image interpretation. AI, Artificial intelligence.

III. Clinical applications

294

15. Artificial intelligence and interpretations in breast cancer imaging

radiographs using a method based on the difference between a signal-enhanced and signal-suppressed version of the original image.27,28 These human-engineered analytical methods were based on the understanding of the presentation of the signal (mass or clustered microcalcifications) on a digitized mammogram along with various algorithmic methods to extract various characteristics (also, referred to now as “radiomics”) of the suspect lesions in order to reduce false positives. While these initial CADe algorithms for breast abnormalities were developed on digitized mammograms, the methods progressed with the advent of FFDM (full-field digital mammograms), and the various AI methods have been reviewed over the decades.12 14 Interestingly, the use of convolutional neural networks (deep learning with a limited number of layers) for lesion detection in medical imaging was first performed in medical imaging for use in breast cancer screening programs; that is, in 1994 in which a shift-invariant (aka convolutional neural network) was used in the detection of microcalcifications29 (Fig. 15.2). Zhang et al. and LeCun had developed these shift-invariant image learning methods in the late 1980s in the tasks of detecting alphabets and numbers, respectively.30,31 The first clinical translation of mammographic CADe occurred in 1998, with the commercial CADe system (ImageChecker M1000 version 1.2; R2 Technology, Los Altos, California) being approved by the US Food and Drug Administration (FDA).32 The system was FDA approved for use as a second reader, meaning that the radiologist was to interpret the mammogram first, and only after a complete interpretation were they to view the CADe system output. Also, if the radiologist indicated a potential lesion on the screening mammogram but the computer output did not, the radiologist was not to eliminate the region, thus ensuring that sensitivity would not be reduced with the use of CADe. Other CADe systems were also then introduced to clinical practice, and by 2008, CADe systems were used in 70% of all screening mammography studies in hospital-based facilities, compared with 81% in private offices.33 CADe systems continued to improve by incorporating various human-engineered features and various classifiers, including artificial neural

FIGURE 15.2 Illustration of the first journal publication on the use of a CNN (i.e., a shift-invariant neural network) in medical image analysis. The CNN was used in a computer-aided detection system for digitized mammograms and later on FFDMs. CNN, Convolutional neural network. Source: Reprinted with permission from Zhang W, et al. Shift-invariant artificial neural network (SIANN) for CADe in mammography. Med Phys 1994;21:517 24.

III. Clinical applications

15.3 Artificial intelligence in breast cancer screening

295

networks.13 Use of CADe systems became controversial, especially when radiologists used them (off-label) as primary readers or when the performances of the systems were not sufficiently high to aid experienced radiologists. A 2018 paper reported that CADe use was stable (over 90% usage) from 2008 to 2016 within the US digital screening facilities.34 Over the past decades, AI advances for breast cancer screening progressed given the explosive improvement in computer processing and memory, as well as in advanced deep algorithm.29,35 These deep-learning developments have been summarized in many review papers as well as in earlier chapters of this book.16,36 Mirroring the advances in computing power was the increase in the number of curated datasets of FFDMs allowing for deep learning using training from scratch, as well as transfer learning with feature extraction or fine tuning. Along with increases in algorithmic complexity, databases, computer power, and memory, the method of use of AI by radiologists varied to include, beyond CADe, AI as a concurrent reader, AI for triage, and AI as an autonomous reader. A few of these methods will be briefly discussed here. Given the additional image data presented to the radiologist with 3D imaging acquisitions, such as tomosynthesis and MRI, the potential role of AI includes improving both the effectiveness and the efficiency of the image interpretation task. With the advance of digital DBT in screening programs, so was the development of CADe methods for tomosynthesis images, first as a second reader and more recently as a concurrent reader.24 Conant et al. reported on a reader study to evaluate the use of AI in reducing DBT interpretation time while maintaining or improving accuracy.21 They evaluated a deep-learning AI system, which was developed to detect suspicious soft-tissue and calcified lesions in DBT images. The multireader, multicase (MRMC) study statistically evaluated the performance of 24 radiologists reading 260 DBT examinations with and without AI. They found that the concurrent use of AI did improve cancer detection efficacy via a demonstrated increase in the area under the receiver operating characteristic (ROC) curve (AUC), sensitivity, and specificity as well as a reduction in recall rate and reading time.21 Current recommendations note that patients with dense breast would benefit from additional screening, including whole breast ultrasound and/or MRI.37 39 The additional image data motivated the development of AI for detection of lesions on these 3D imaging modalities, with multiple systems reaching clinical use through FDA approvals. These systems include combinations of machine learning algorithms, including both humanengineered approaches and deep-learning methods. An AI system for finding lesions on 3D breast ultrasound showed, via a MRMC reader study, that the concurrent use of the AI tool enabled a reduction in interpretation time while maintaining diagnostic accuracy.20 While not exactly AI, computer-based image enhancement systems have been developed to aid radiologists in interpreting breast MRI, especially DCE-MRI. These systems have the potential to reduce reading time by highlighting lesions of suspicion in screening MRI.40 Such image-processing systems incorporate lesion features from morphology and kinetics.41 As MRI protocols advance for use as an adjunct to mammographic screening, more and more AI detection techniques for breast MRI will be developed. One such CADe system is using deep learning on early-phase MRIs as opposed to late-phase temporal information.42

III. Clinical applications

296

15. Artificial intelligence and interpretations in breast cancer imaging

FIGURE 15.3 Diagram illustrating the experimental setup for triage analysis (CADt). In standard scenario, radiologists read all mammograms. In CADt (or rule-out), radiologists only read mammograms above the model’s cancer-free threshold. Source: Reprinted with permission from Yala A, et al. A deep learning model to triage screening mammograms: a simulation study. Radiology 2019;293:38 46.

Researchers are currently investigating the use of AI for triaging (CADt) so that in a screening program, a certain percentage of FFDMs would be deemed negative by the computer and patients would be told to return in the regular screening interval period without having a radiologist read their mammograms (Fig. 15.3). In a simulation study, a deeplearning approach was used to triage 20% of screening mammograms.22 In the simulation, the results showed improvement in radiologist efficiency and specificity without harming sensitivity. Ultimately, algorithms will be developed for use of AI as an autonomous reader, that is, with the computer making the screening detection decision without input from a radiologist. In such a situation, it is the standalone performance of the computer as compared to a radiologist that is evaluated. In a recent study, an AI system was trained on datasets from the United Kingdom and the United States and was shown to generalize from the United Kingdom to the United States and outperformed all the human readers in terms of AUC from ROC analysis.23 It is important to note that just like the earlier CADe systems, introduction of the system into the clinical workflow and evaluation of performance in the real world needs to be demonstrated.

15.4 Artificial intelligence in breast cancer risk assessment: density and parenchymal pattern Breast cancer risk assessment has been one of the goals of breast image analysis, especially given that breast density is a strong breast cancer risk indicator due to two issues.43,44 In addition, dense tissue within a mammogram can act to camouflage the presence of an underlying lesion, that is, a “masking” effect that leads to a reduction in detection sensitivity during mammography screening. Computerized risk assessment methods can potentially serve to estimate a woman’s lifetime risk of breast cancer and yield riskstratified screening protocols and preventive therapies to reduce overall risk, as opposed to current “one-size-fits-all” screening programs. Risk models include risk factors,

III. Clinical applications

15.5 Artificial intelligence in breast cancer diagnosis and prognosis

297

including demographics, personal history, family history, hormonal status and hormonal therapy, as well as image-based characteristics such as density and parenchymal pattern. Characterization of dense tissue is via (1) breast density, which is the amount of fibroglandular tissue relative to the amount of total breast tissue (both fibroglandular and fatty breast tissue) and (2) the parenchymal pattern demonstrating the spatial distribution of dense tissue. Note that some AI methods have been trained to output the estimated density or pattern, and thus indirectly assess risk, whereas others have been trained as a component in a risk model to output a score more directly related to breast cancer risk.45 Breast density has been assessed manually for decades, which led to the four-category Breast Imaging Reporting and Data System (BI-RADS) density ratings, proposed by the BI-RADS of the American College of Radiology.46 Computerized methods for assessing breast density have involved the calculation of skewness of the gray-level histograms of FFDMs as well as methods to estimate volumetric density estimations from the 2D projections of FFDMs.13,47,48 Automated methods for assessing density in screening mammograms are now routinely conducted using FDA-cleared clinical systems.49 In quantitatively assessing the parenchymal pattern of the breast, various texture-based calculations have been investigated to characterize the spatial distribution (i.e., variability) of gray levels in a FFDM.13 Such radiomic texture analyses have been trained using datasets of BRCA1/BRCA2 gene mutation carriers or women with a contralateral cancer representing “high-risk” groups and datasets from routine screening populations representing “low or average risk” groups. Results have shown that women at high risk of future breast cancer have dense breasts with parenchymal patterns are coarse and low in contrast.50 52 Others have also conducted texture analysis on breast tomosynthesis images for risk assessment.53 Various deep-learning methods for assessing breast density as well as the parenchymal patterns have been reported. With deep learning, one seeks to determine if additional information is contained in FFDMs beyond density and texture analysis. It has been demonstrated that deep learning performed statistically better than feature-based methods in assessing breast density on FFDMs.54 Other studies compared and merged radiomic texture analysis and deep-learning approaches in characterizing the parenchymal patterns on FFDMs demonstrating that the combination of both human-engineered and deep-learning approaches yield improved results55 (Fig. 15.4). Beyond FFDM, investigators have assessed the amount of fibroglandular tissue seen on MR images and the amount of breast parenchymal enhancement on dynamic-contrast enhanced MRI in assessing risk of future breast cancer as well as its role in prognosis and risk of recurrence.56 58 Investigators have implemented a deep-learning U-net to quantitatively assess breast density59 (Fig. 15.5).

15.5 Artificial intelligence in breast cancer diagnosis and prognosis Diagnosis and/or prognosis occurs during the workup of a breast lesion after it has been detected by either screening mammography or other means, such as a physical breast exam. This characterization of the lesion is thus a classification task and not a localization task (as was the case during screening). During screening, radiologists will

III. Clinical applications

298

15. Artificial intelligence and interpretations in breast cancer imaging

FIGURE 15.4 Schematic of methods for the classification of ROIs using humanengineered texture analysis and deep convolutional neural network methods. ROI, Region of interest. Source: Reprinted with permission from Li H, et al. Deep learning in breast cancer risk assessment: evaluation of convolutional neural networks on a clinical dataset of full-field digital mammograms. J Med Imaging 2017;4.

give a detected suspect lesion, a BI-RADS rating indicating whether it is highly likely to be normal (BI-RADS 5 1), highly likely to be benign (BI-RADS 5 2), or unsure and thus requiring workup (BI-RADS 5 0).46 With diagnosis the goal of workup is to further assess the likelihood that the lesion is cancerous and determine whether or not the patient should proceed to biopsy for pathologic confirmation. Often multiple imaging modalities, including additional mammography, breast ultrasound,60 or breast MRI,8 are employed in order to better characterize the suspect lesion. Once the diagnosis of cancer is known, additional imaging of the tumor is conducted with the goal of assessing extent of disease for helping in determining patient management, and thus AI has a role in integrated diagnostics. When AI is used to aid the radiologist in assessing the likelihood of malignancy or extent of disease, it is referred to as computer-aided diagnosis (CADx). Input to an AI algorithm for assessing diagnosis or prognosis is the image of the lesion and its surrounding parenchyma, that is, a region of interest about the lesion initially indicated by either a radiologist or the output of a detection algorithm. AI algorithms for the assessment of the likelihood of malignancy are typically based on various machine learning methods, based either on human-engineered features (radiomics) or on deep-learning methods.13,36,61 Given that the output of the AI algorithm is to include the likelihood that the lesion is malignant or benign, the AI algorithm will be trained using biopsy “truth” from pathology to confirm if a lesion is cancerous or not. AI methods for diagnosis have been and are being developed for various implementation as demonstrated in Fig. 15.1. While AI diagnostic systems output a tumor signature

III. Clinical applications

15.5 Artificial intelligence in breast cancer diagnosis and prognosis

299

FIGURE 15.5 Multiple uses of deep U-Nets to segment fibroglandular tissue on breast MRI. MRI, Magnetic resonance imaging. Source: Reprinted with permission from Dalmis M, et al. Using deep learning to segment breast and fibroglandular tissue in MRI volumes. Med Phys 2017;44:533 46.

(a metric) related to the likelihood of malignancy, the systems can also output radiomic features to characterize the lesion in question.13,16 Over the decades, investigators have developed CADx methods that included automatic lesion segmentation, feature extraction, and the merging of features into a tumor signature.12,62 76 Note that the various computer-extracted radiomic features will depend on the imaging modality. For example, spiculation might be extracted from mammographic images of lesions with high spatial resolution, while kinetic-based features, such as texture of enhancement patterns, can be extracted from DCE-MRI of the breast.64,66,77,78 However, some features, such as lesion size, shape, and morphology, can be extracted across modalities.42,69,70 While performances of the AI algorithms are evaluated in the task of distinguishing between malignant and benign lesions, it is also necessary to evaluate the performance of the radiologists when they use the AI output as an aid in assessing diagnosis, and thus CADx methods have been also evaluated in reader studies demonstrating improvement in radiologist performance in the task of classifying between malignant and benign tumors.17 19,79 These human-engineered radiomic approaches have now been augmented or substituted with deep-learning approaches.15,36,61,80,81 Many more contributions are summarized in Table 3 of Bi et al.16

III. Clinical applications

300

15. Artificial intelligence and interpretations in breast cancer imaging

Deep learning with convolutional neural networks has shown to be successful in characterizing lesions in the diagnostic workup of breast tumors in mammography, tomosynthesis, ultrasound, and MRI.61,80 85 In the task of distinguishing between malignant and benign breast lesions, transfer learning with a pretrained convolutional neural network (CNN) with either feature extraction or fine tuning has been employed. Transfer learning, as opposed to training a deep network from scratch, has allowed promising results with a limited number of cases, often involving around only 500 1000 cases. However, selection of the layers from which to extract features or layers to freeze in fine tuning requires careful investigation and selection, as demonstrated in (Fig. 15.6).61,80 Combining analyses based on humanengineered features and those involving deep networks have also shown promise as shown in Fig. 15.7, which demonstrates “lesion signatures” from FFDMs as predicted by a humanengineered radiomic classifier and a deep-learning classifier.61 It is expected that in the future, a collection of human-engineered approaches and deep networks will yield higher performance in the classification task.15,16,36 Due to limited datasets, it is potentially useful to either preprocess the inputted breast images, select specific image regions for input, or input registered multiple images of the same lesion. For example, using maximum intensity projections DCE-MRI images instead of early-phase unsubtracted or subtraction images showed improved deep-learning computer performance most likely because of the 4D information.84 The first clinical translation of breast CADx occurred in 2017, with the commercial breast MRI CADx system (QuantX from Quantitative Insights, Chicago, IL; now Qlarity Imaging) being cleared by the FDA.86 Others have followed for various breast imaging modalities for use as secondary or concurrent readers. During the workup of a suspect breast lesion, it may be beneficial to have the extracted imaging-based characteristics (i.e., phenotypes) related to clinical, histopathology, or genomic data. Workup may involve further decision-making on the likelihood that the lesion is cancerous or not, or may involve, if it is already known that the lesion is cancerous, prognostic assessment, which might inform treatment options. Prognosis is related to tumor grade, tumor extent, molecular subtype, and other histopathology information. Besides AI to assess the likelihood of malignancy, investigators are interested in relating image-based features to prognostic features, such as molecular subtypes.87 90 A multiinstitutional National Cancer Institute TGGA Breast Phenotyping Group research study of data from the TCGA and the TCIA investigated mappings from AI-extracted MRI lesion features to clinical, molecular, and genomic markers.91 95 Statistically significant associations have been observed between DCE-MRI enhancement texture features and molecular subtypes (such as luminal A, luminal B, HER2-enriched, and basal-like).

15.6 Artificial intelligence for treatment response, risk of recurrence, and cancer discovery Breast imaging modalities are being investigated to assess tumor response from neoadjuvant therapy and risk of recurrence in the treatment of breast cancer patients.

III. Clinical applications

15.6 Artificial intelligence for treatment response, risk of recurrence, and cancer discovery

301

FIGURE 15.6 (A) Illustration of CNN layers at which features can be extracted during transfer learning (AlexNet). Right-most column: number of features for a given image that is used as input to a classifier. For each layer, these features were extracted from outputs from each layer, which were combined and flattened (center column) from their original image outputs (left column) for input to a support vector machine classifier. (B) Classification performance in the task of classification of mammographic lesions as benign or cancer, for classifiers based on features from each layer of AlexNet. In this example the fully connected layer 6 (“Fc6” in the figure) was selected as the optimal layer for feature extraction, due to its high AUC performance and reduced computational cost. AUC, Area under the curve. CNN, Convolutional neural network. Source: Reprinted with permission from Huynh BQ, Li H, Giger ML. Digital mammographic tumor classification using transfer learning from deep convolutional neural networks. J Med Imaging 2016;3(3):034501.

III. Clinical applications

302

15. Artificial intelligence and interpretations in breast cancer imaging

FIGURE 15.7 A diagonal classifier agreement plot between the CNN-based classifier and a humanengineered radiomic-based CADx classifier for FFDM. The x-axis denotes the output from the CNN-based classifier, and the y-axis denotes the output from the conventional CADx classifier. Each point represents an ROI for which predictions were made. Points near or along the diagonal from bottom left to top right indicate high classifier agreement, points far from the diagonal indicate low agreement. ROI pictures of extreme examples of agreement/disagreement are included. ROI, Region of interest. CNN, Convolutional neural network. Source: Reprinted with permission from Antropova N, Huynh BQ, Giger ML. A deep feature fusion methodology for breast cancer diagnosis demonstrated on three imaging modality datasets. Med Phys 2017;44(10):5162 71.

Breast MRI is the most accurate imaging modality for predicting tumor response. Human-engineered radiomic features have been shown to predict recurrence-free survival based on MRI before or just after the first treatment round. One study, using fuzzy c-means (FCM) clustering method to extract the most enhancing voxels within a tumor, yielded an automatic method for quantitatively extracting the most-enhancing tumor volume with similar performances as compared to a semiautomated method in

III. Clinical applications

15.6 Artificial intelligence for treatment response, risk of recurrence, and cancer discovery

303

predicting recurrence-free survival of cases within the I-SPY1 trial.64,96,97 The FCM method was initially developed for extracting the most enhancing voxels within a breast tumor on DCE-MRI in order to analyze the kinetics of only those most enhancing voxels for computer-aided diagnosis.64 However, the volume, that is, the size of those most enhancing voxels within the tumor, have been found useful in assessing recurrence-free survival.97 This demonstrates the importance of understanding image-based tumor phenotypes so that they can be repurposed for different decision-making tasks. Also, from the TGGA Breast Phenotyping Group, breast MRIs were quantitatively mapped to research versions of gene assays of MammaPrint, Oncotype DX, and PAM50 demonstrating the potential of MRI-based biomarkers to predict risk of recurrence, serving as a radiomic assay along with a gene assay.93 Tumor enhancement pattern and tumor size, which were extracted using human-engineered radiomics, correlated with a gene assay recurrence score, showing that cancers with greater angiogenesis are associated with an increased risk of recurrence. Others have employed deep CNNs to predict pathologic complete response using the ISPY1 database. Their analysis yielded probability heatmaps indicating regions within the tumors most strongly associated with response98 (Fig. 15.8). Use of AI (with or without deep learning) for characterizing response effectively yields a “virtual biopsy,” which can be conducted when an actual biopsy is not practical, such as during assessing multiple rounds of therapy, that is, assessing and tracking tumor serially over time. Such image-based signatures might yield precise diagnostic and therapeutic plans specific to individual patients. As noted, use of “virtual biopsies” can quantitatively glean information from breast images when an actual invasive biopsy is not practical. Through discovery, these quantitative image-based biomarkers can be used to further understand relationships with genomics. Fig. 15.9 illustrates statistical relationships between various radiomic phenotypes and genomic features.95 Once such relationships are known, if these imagingbased signatures can be observed during screening or routine monitoring post therapy,

FIGURE 15.8 (A) DCE-MRI images from the postcontrast phase were normalized, and 65 3 65 patches were selected randomly from the tumor mask. (B) Network architecture consisted of convolution layers with batch normalization and ReLu activation followed by one fully connected layer. Softmax activation was used to determine class membership, and dropout regularization was used. (C) The trained model was tested on 33 held-out patients. Localized patch probabilities were used to generate a heatmap of likelihood of response to treatment. DCE, Dynamic contrast-enhanced; MRI, magnetic resonance imaging. Source: Reprinted with permission from Ravichandran K, et al. A deep learning classifier for prediction of pathological complete response to neoadjuvant chemotherapy from baseline breast DCE-MRI. In: SPIE medical imaging. Houston, TX: SPIE; 2018.

III. Clinical applications

304

15. Artificial intelligence and interpretations in breast cancer imaging

FIGURE 15.9 (A) Statistically significant associations of various genomic features and radiomic features. Each line is an identified statistically significant association. Genomic features without statistically significant association are not shown. Genomic features are organized into circles by data platform and indicated by different node colors. Radiomic phenotypes are divided into six categories also indicated by different node colors. The node size is proportional to its connectivity relatively to other nodes in the category. Associations are deemed as statistically significant if the adjusted P-values # .05. The only exception is for the associations involving somatically mutated genes, for which the statistical significance criteria are (1) P-value # 0.05 and (2) the gene mutated in at least five patients. (B) A table showing the numbers of statistically significant associations between genomic features of different platforms and radiomic phenotypes of different categories. Source: Reprinted with permission from Zhu Y, et al. Deciphering genomic underpinnings of quantitative MRI-based radiomic phenotypes of invasive breast carcinoma. Sci Rep 2015;42(6):3603.

there is the potential for improved early recognition of disease allowing for earlier intervention for the patient.

15.7 Conclusion and discussion Quantitative analysis of breast imaging through AI allows the field of breast cancer to advance from a “one-size-fits-all” regimen to precision medicine using the characteristics

III. Clinical applications

References

305

of the tumor and parenchyma specific to each patient. Eventually, imaging will yield information relevant to clinical, histopathology, and genomic patient data. However, it is important to note that the role of AI in breast imaging interpretation is evolving, not to replace the radiologist but rather to aid them with new effective and efficient AI methods. Although AI for understanding breast cancer images has been around for decades, it continues to improve as larger and better curated datasets are collected, and more complex algorithms are developed.

References 1. Siegel R, Miller K, Jemal A. Cancer statistics, 2020. Cancer J Clin 2020;70:7 30. 2. Howlader, N., et al. SEER cancer statistics review, 1975-2008. Available from: ,http://seer.cancer.gov/csr/ 1975_2008/.. 3. Tabar L, et al. Mammography service screening and mortality in breast cancer patients: 20-year follow-up before and after introduction of screening. Lancet 2003;361:1405 10. 4. Feig S. Cost-effectiveness of mammography, MRI, and ultrasonography for breast cancer screening. Radiol Clin North Am 2010;48:879 91. 5. Niell B, et al. Screening for breast cancer. Radiol Clin North Am 2017;55:1145 62. 6. Nelson H, et al. Factors associated with rates of false-positive and false-negative results from digital mammography screening: an analysis of registry data. Ann Intern Med 2016;164:226 35. 7. Marinovich M, et al. Breast cancer screening using tomosynthesis or mammography: a meta-analysis of cancer detection and recall. J Nati Cancer Inst 2018;110:942 9. 8. Mann R, Cho N, Moy L. Breast MRI: state of the art. Radiology 2019;292(3):520 36. 9. Marino M, et al. Multiparametric MRI of the breast: a review. J Magn Res Imag 2018;47:301 15. 10. Guo R, et al. Ultrasound imaging technologies for breast cancer detection and management: a review. Ultrasound Med Biol 2018;44:37 70. 11. Giger, M. Computerized image analysis in breast cancer detection and diagnosis. in Seminars in breast disease. WB Saunders; 2002. 12. Giger ML, Chan HP, Boone J. Anniversary paper: history and status of CAD and quantitative image analysis: the role of Medical Physics and AAPM. Med Phys 2008;35(12):5799 820. 13. Giger ML, Karssemeijer N, Schnabel JA. Breast image analysis for risk assessment, detection, diagnosis, and treatment of cancer. Annu Rev Biomed Eng 2013;15:327 57. 14. Vyborny CJ, Giger ML. Computer vision and artificial intelligence in mammography. AJR Am J Roentgenol 1994;162(3):699 708. 15. Whitney H, et al. Comparison of breast MRI tumor classification using human-engineered radiomics, transfer learning from deep convolutional neural networks, and fusion methods. Proc IEEE 2019;108:163 77. 16. Bi W, et al. Artificial intelligence in cancer imaging: clinical challenges and applications. CA: A Cancer J Clin 2019;69(2):127 57. 17. Huo Z, et al. Breast cancer: effectiveness of computer-aided diagnosis—observer study with independent database of mammograms. Radiology 2002;224(2):560 8. 18. Horsch K, et al. Classification of breast lesions with multimodality computer-aided diagnosis: observer study results on an independent clinical data set. Radiology 2006;240(2):357 68. 19. Shimauchi A, et al. Evaluation of clinical breast MR imaging performed with prototype computer-aided diagnosis breast MR imaging workstation: reader study. Radiology 2011;258(3):696 704. 20. Jiang Y, et al. Interpretation time using a concurrent-read computer-aided detection system for automated breast ultrasound in breast cancer screening of women with dense breast tissue. AJR Am J Roentgenol 2018;211(2):1 10. 21. Conant E, et al. Improving accuracy and efficiency with concurrent use of artificial intelligence for digital breast tomosynthesis. Radiology: AI 2019;1. 22. Yala A, et al. A deep learning model to triage screening mammograms: a simulation study. Radiology 2019;293:38 46.

III. Clinical applications

306

15. Artificial intelligence and interpretations in breast cancer imaging

23. McKinney S, et al. International evaluation of an AI system for breast cancer screening. Nature 2020;577:89 94. 24. Geras K, Mann R, Moy L. Artificial intelligence for mammography and digital breast tomosynthesis: current concepts and future perspectives. Radiology 2019;293:246 59. 25. Giger M. Machine learning in medical imaging. J Am Coll Radiol 2018;15. 26. Winsberg F, et al. Detection of radiographic abnormalities in mammograms by means of optical scanning and computer analysis. Radiology 1967;89:211 15. 27. Chan H-P, et al. Image feature analysis and computer-aided diagnosis in digital radiography. 1. Automated detection of microcalcifications in mammography. Med Phys 1987;14:538 48. 28. Giger M, Doi K, MacMahon H. Image feature analysis and computer-aided diagnoses in digital radiography. 3. Automated detection of nodules in peripheral lung fields. Med Phys 1987;15:158 66. 29. Zhang W, et al. Shift-invariant artificial neural network (SIANN) for CADe in mammography. Med Phys 1994;21:517 24. 30. Zhang W, et al. Shift-invariant pattern recognition neural network and its optical architecture. In: Annual meeting of the Japanese Society of Applied Physics. 1988. 31. LeCun Y. A theoretical framework for back-propagation. In: Connectionist Models Summer School. Pittsburgh, PA: 1988. 32. Warren-Burhenne L, et al. Potential contribution of computer-aided detection to the sensitivity of screening mammography. Radiology 2000;215:554 62. 33. Rao V, et al. How widely is computer-aided detection used in screening and diagnostic mammography? J Am Coll Radiol 2010;7:802 5. 34. Keen J, Keen J, Keen J. Utilization of computer-aided detection for digital screening mammography in the United States, 2008 to 2016. J Am Coll Radiol 2018;15:44 8. 35. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521:436 44. 36. Sahiner B, et al. Deep learning in medical imaging and radiation therapy. Med Phys 2019;46. 37. Melnikow J, et al. Supplemental screening for breast cancer in women with dense breasts: a systematic review for the US Preventive Services Task Force. Ann Intern Med 2016;164:268 78. 38. Kuhl C. Abbreviated breast MRI for screening women with dense breast: the EA1141 trial. Br J Radiol 2018;90:20170441. 39. Giger M, et al. Automated breast ultrasound in breast cancer screening of women with dense breasts: reader study on mammogram-negative and mammogram-positive cancers. AJR Am J Roentgenol 2016;206(6):1341 50. 40. Gubern-Merida A, et al. Automated detection of breast cancer in false-negative screening MRI studies from women at increased risk. Eur J Radiol 2016;85:472 9. 41. Chang Y-C, et al. Computerized breast lesion detection using kinetic and morphologic analysis for dynamic contrast-enhanced MRI. Magn Reson Imaging 2014;32:514 22. 42. Dalmis M, et al. Fully automated detection of breast cancer in screening MRI using convolutional neural networks. J Med Imaging 2018;5(1):014502. 43. Saftlas A, et al. Mammographic densities and risk of breast cancer. Cancer 1991;67:2833 8. 44. Freer P. Mammographic breast density: impact on breast cancer risk and implications for screening. Radiographics 2015;35:302 15. 45. Dembrower K, et al. Comparison of a deep learning risk score and standard mammographic density score for breast cancer risk prediction. Radiology 2020;294:265 72. 46. ACR. ACR BI-RADS Atlas 5th. Reston, VA: American College of Radiology; 2013. 47. Byng J, et al. Automated analysis of mammographic densities and breast carcinoma risk. Cancer 1997;80:66 74. 48. van England S, et al. Volumetric breast density estimate from full-field digital mammograms. IEEE Trans Med Imaging 2006;25:273 82. 49. Lee H, Sohn Y, Han K. Comparison of mammographic density estimation by Volpara software with radiologists’ visual assessment: analysis of clinical-radiologic factors affecting discrepancy between them. Acta Radiol 2015;56:1061 8. 50. Huo Z, et al. Computerized analysis of digitized mammograms of BRCA1/BRCA2 gene mutation carriers. Radiology 2002;225:519 26. 51. Li H, et al. Fractal analysis of mammographic parenchymal patterns in breast cancer risk assessment. Acad Radiol 2007;14:513 21.

III. Clinical applications

References

307

52. Li H, et al. Power spectral analysis of mammographic parenchymal patterns of digitized mammograms. J Digital Imaging 2008;21:145 52. 53. Kontos D, et al. Parenchymal texture analysis in digital breast tomosynthesis for breast cancer risk estimation: a preliminary study. Acad Radiol 2009;16:283 98. 54. Li S, et al. Computer-aided assessment of breast density: comparison of supervised deep learning and feature-based statistical learning. Phys Med Biol 2018;63. 55. Li H, et al. Deep learning in breast cancer risk assessment: evaluation of convolutional neural networks on a clinical dataset of full-field digital mammograms. J Med Imaging 2017;4. 56. King V, et al. Background parenchymal enhancement at breast MR imaging and breast cancer risk. Radiology 2011;260:50 60. 57. Wu S, et al. Quantitative assessment of background parenchymal enhancement in breast MRI predicts response to risk-reducing salpingo-oophorectomy: preliminary evaluation in a cohort of BRCA1/2 mutation carriers. Breast Cancer Res 2015;17. 58. Li, H., et al. Computerized breast parenchymal analysis on DCE-MRI. in SPIE Medical Imaging. San Diego, CA: SPIE; 2009. 59. Dalmis M, et al. Using deep learning to segment breast and fibroglandular tissue in MRI volumes. Med Phys 2017;44:533 46. 60. Hooley R, Scoutt L, Philpotts L. Breast ultrasonography. State of the art. Radiology 2013;268:642 59. 61. Antropova N, Huynh BQ, Giger ML. A deep feature fusion methodology for breast cancer diagnosis demonstrated on three imaging modality datasets. Med Phys 2017;44(10):5162 71. 62. Huo Z, Giger ML, Vyborny CJ. Computerized analysis of multiple-mammographic views: potential usefulness of special view mammograms in computer-aided diagnosis. IEEE Trans Med Imaging 2001;20 (12):1285 92. 63. Chen W, Giger ML, Bick U. A fuzzy c-means (FCM)-based approach for computerized segmentation of breast lesions in dynamic contrast-enhanced MR images. Acad Radiol 2006;13(1):63 72. 64. Chen W, et al. Automatic identification and classification of characteristic kinetic curves of breast lesions on DCE-MRI. Med Phys 2006;33(8):2878 87. 65. Chen W, et al. Computerized interpretation of breast MRI: investigation of enhancement-variance dynamics. Med Phys 2004;31(5):1076 82. 66. Chen W, et al. Volumetric texture analysis of breast lesions on contrast-enhanced magnetic resonance images. Magnetic Reson Med 2007;58(3):562 71. 67. Chen W, et al. Computerized assessment of breast lesion malignancy using DCE-MRI: robustness study on two independent clinical datasets from two manufacturers. Acad Radiol 2010;17(7):822 9. 68. Chen W, Zur R, Giger M. Joint feature selection and classification using a Bayesian neural network with automatic relevance determination priors: potential use in CAD of medical imaging. Proc. SPIE 6514, Medical Imaging 2007: Computer-Aided Diagnosis, 65141G. 69. Drukker K, et al. Computerized detection and classification of cancer on breast ultrasound. Acad Radiol 2004;11(5):526 35. 70. Gilhuijs KG, Giger ML, Bick U. Computerized analysis of breast lesions in three dimensions using dynamic magnetic-resonance imaging. Med Phys 1998;25(9):1647 54. 71. Horsch K, et al. Computerized diagnosis of breast lesions on ultrasound. Med Phys 2002;29(2):157 64. 72. Li H, et al. Evaluation of computer-aided diagnosis on a large clinical full-field digital mammographic dataset. Acad Radiol 2008;15(11):1437 45. 73. Sahiner B, Chan H-P, Hadjiiski L. Classifier performance prediction for computer-aided diagnosis using a limited dataset. Med Phys 2008;35:1559 70. 74. Sahiner B, et al. Computerized characterization of breast masses using three-dimensional ultrasound images. Proc SPIE 1998;3338:301 12. 75. Chan H-P, et al. Computer-aided classification of mammographic masses and normal tissue: linear discriminant analysis in texture feature space. Phys Med Biol 1995;40(5):857. 76. Ashraf AB, et al. Identification of intrinsic imaging phenotypes for breast cancer tumors: preliminary associations with gene expression profiles. Radiology 2014;272(2):374 84. 77. Meinel L, et al. Breast MRI lesion classification: improved performance of human readers with a backpropagation neural network computer-aided diagnosis (CAD) system. J Magn Reson Imaging 2007;25:89 95.

III. Clinical applications

308

15. Artificial intelligence and interpretations in breast cancer imaging

78. Song S, et al. Computer aided detection system for breast MRI in assessment of local tumor extent, nodal status, and multifocality of invasive breast cancers: preliminary study. Cancer Imaging 2015;15. 79. Horsch K, et al. Performance of computer-aided diagnosis in the interpretation of lesions on breast sonography. Acad Radiol 2004;11(3):272 80. 80. Huynh BQ, Li H, Giger ML. Digital mammographic tumor classification using transfer learning from deep convolutional neural networks. J Med Imaging 2016;3(3):034501. 81. Samala R, et al. Breast cancer diagnosis in digital breast tomosynthesis: effects of training sample size on multi-stage transfer learning using deep neural nets. IEEE Trans Med Imaging 2019;38:686 96. 82. Samala R, et al. Evolutionary pruning of transfer learned deep convolutional neural network for breast cancer diagnosis in digital breast tomosynthesis. Phys Med Biol 2018;63. 83. Samala R, et al. Multi-task transfer learning deep convolutional neural network: application to computeraided diagnosis of breast cancer on mammograms. Phys Med Biol 2017;62:8894 908. 84. Antropova N, Abe H, Giger M. Use of clinical MRI maximum intensity projections for improved breast lesion classification with deep CNNs. J Med Imaging 2018;5. 85. Antropova N, et al. Breast lesion classification based on DCE-MRI sequences with long short-term memory networks. J Med Imaging 2019;6. 86. FDA. De novo summary DEN170022. 2017. 87. Grimm L, et al. Relationships between MRI BI-RADS lexicon descriptors and breast cancer molecular subtypes: internal enhancement is associated with luminal B subtype. Breast J 2017;23:579 82. 88. Wu J, et al. Identifying relations between imaging phenotypes and molecular subtypes of breast cancer: model discovery and external validation. Magn Imaging 2017;46:1017 27. 89. Whitney H, et al. Additive benefit of radiomics over size alone in the distinction between benign lesions and luminal A cancers on a large clinical breast MRI dataset. Acad Radiol 2018;26(2):202 9. 90. Bhooshan N, et al. Cancerous breast lesions on dynamic contrast-enhanced MR images: computerized characterization for image-based prognostic markers. Radiology 2010;254:680 90. 91. Burnside E, et al. Using computer-extracted image phenotypes from tumors on breast MRI to predict breast cancer pathologic stage. Cancer 2015;122(5):748 57. 92. Li H, et al. Quantitative MRI radiomics in the prediction of molecular classifications of breast cancer subtypes in the TCGA/TCIA Dataset. Npj Breast Cancer 2016;2. 93. Li H, et al. MRI radiomics signatures for predicting the risk of breast cancer recurrence as given by research versions of gene assays of MammaPrint, Oncotype DX, and PAM50. Radiology 2016;281(2):382 91. 94. Guo W, et al. Prediction of clinical phenotypes in invasive breast carcinomas from the integration of radiomics and genomics data. J Med Imaging 2015;2. 95. Zhu Y, et al. Deciphering genomic underpinnings of quantitative MRI-based radiomic phenotypes of invasive breast carcinoma. Sci Rep 2015;42(6):3603. 96. Hylton N, et al. Neoadjuvant chemotherapy for breast cancer: functional tumor volume by MR imaging predicts recurrence-free survival-results from the ACRIN 6657/CALGB 150007 I-SPY 1 TRIAL. Radiology 2016;279:44 55. 97. Drukker K, et al. Most-enhancing tumor volume by MRI radiomics predicts recurrence-free survival “early on” in neoadjuvant treatment of breast cancer. Cancer Imaging 2018;18. 98. Ravichandran, K., et al. A deep learning classifier for prediction of pathological complete response to neoadjuvant chemotherapy from baseline breast DCE-MRI. in SPIE Medical Imaging. Houston, TX: SPIE. 2018.

III. Clinical applications

C H A P T E R

16 Prospect and adversity of artificial intelligence in urology Okyaz Eminaga and Joseph C. Liao Abstract The emergence of artificial intelligence (AI) has opened a new avenue for tackling existing challenges in clinical routine. This chapter will briefly introduce potential applications of AI in urology and focus on its benefits and barriers in solving real clinical problems. First, the introduction section will generally discuss AI and existing data resources. Then, the chapter will explain the potential application of AI in urological endoscopy, urine, stone and andrology, imaging and the robotic surgery. Further, this chapter will briefly discuss some tools of risk predictions for urological cancer. Finally, the author will discuss the potential future direction of AI in urology. Keywords: Urology; artificial intelligence; MRI; CT; urine analyses; AI-based solution; prostate cancer; bladder cancer; kidney cancer; ultrasound; diagnostic imaging; prediction models

16.1 Introduction The term artificial intelligence (AI) was first introduced by John McCarthy in 1956.1 Further, Alan Turing introduced a computer machine concept emphasizing the capability of simulating the human being and making decision.2 The development of AI in the ensuing decades was not a straight trajectory compared to the development of microprocessing or data storage. AI has experienced several setbacks and stagnation periods. In addition to the limited computational resources, one of the major reasons for the stagnation of AI in the early stage is the weakness in organizing and recognizing existing methods and the complicated mathematical presentation of such methods for programming. For example, the backpropagation algorithm that helped solving the weight learning problem of deep neural networks was introduced in 1974 as part of a doctoral dissertation.3 However, it was not until 1990 when backpropagation algorithm was first optimized for neural network training.4 In the same year, LeCun et al. introduced a deep convolutional neural network (DCNN) with four layers trained using the backpropagation algorithm to determine single digits.5 Eleven years later, the rectified linear unit function was

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00016-8

309

© 2021 Elsevier Inc. All rights reserved.

310

16. Prospect and adversity of artificial intelligence in urology

introduced,6 which was first utilized in a neural network model in 2009.7 The breakthrough of AI occurred after the highly accurate performance of the DCNN in detecting objects and classifying images in the 2012 ImageNet competition.8,9 Since then, the neural network continues to gain more attention by the research community from different disciplines including the biomedical research community to solve real-world problems. The artificial neural network is initially driven by the connectionism movement in cognitive science that focuses on studying the human cognition by utilizing mathematical models based on neural networks.10,11 Over the past years, different types of artificial neural network have been proposed including Boltzmann machine,12 deep belief network,13 Kohonen network (self-organizing map),14 recurrent neural network,15 neural Turing machines,16 multilayer perceptron (MLP) model,17,18 Bayesian neural network,19 convolutional neural network,9,20 and graph neural network21 and transformer.22 From these types of neural networks the convolutional neural network and the recurrent neural networks have recently received more attention given their simple implementation and their human-level performance in solving computer vision problems and replicating human tasks such as chatbots, generating texts, or language translation.23 The AI is a broad field and is not limited to artificial neural networks; the general foundation of the AI is built on the cognitive science, the statistical learning, optimization, and probability theories.24 There are four major types of AI: reactive machines, limited memory machines, theory of mind, and self-aware AI. The majority of the medical papers related to AI falls below limited memory machines and, therefore, will be our focus in this chapter. The AI also utilizes other machine-learning approaches (e.g., support vector machines or decision tree models) and considers different learning concepts such as supervised or unsupervised learning, reinforcement learning (sequential reward- and/or punishmentbased learning), situated learning, problem-based learning, the recommendation system, and the reasoning system or attention-based learning to tackle real-world problems or to replicate human tasks. These learning concepts are in a dynamic relationship with each other and can be combined to solve clinical questions. Given these topics are beyond the scope of this chapter, we refer the readers to the corresponding literatures. In clinical research, AI has been applied over decades; it is generally utilized to provide decision-aided tools to make predictions about patient-centered outcomes, treatment management, and the disease conditions on the basis of clinical data or imaging. Considered the fuel of the AI, data are essential to develop and build models or to derive patterns or conditional behaviors. Furthermore, AI intuitively cannot function without data. Therefore it is crucial to know the data and their condition in AI as data significantly affect the products of AI. Since the implementation of the electronic health-care record system in routine clinical practice, we reached an unprecedent level of data volume covering almost all aspects of clinical medicine, placing the health-care sector to one of the most valuable resources for big data research and AI. However, the potential exploration of the health-care records is still in the early stage and subject of intense research given the challenges with data mining and cleaning. Further, the current electronic health records generally cover unstructured or free text data, which require additional preprocessing and a careful data quality control to avoid misleading conclusion.25 Although a standard for clinical terminology (e.g., SNOMED) is available to facilitate data exchange and interoperability between institutions26 or different information system concepts for hospitals (e.g., single source information system) were

III. Clinical applications

16.2 Basic examinations in urology

311

introduced to manage clinical data more efficiently,27 each hospital still has its own culture to store its clinical data. So, a collective effort is required to achieve generalizable AI solutions for unstructured clinical data. A further research to refine the concepts of the hospital information system for a better utilization of clinical data is also needed. Another broad area is diagnostic, therapeutic, or functional imaging, which has been the major interest of AI researchers after the remarkable results of convolutional neural nets in detecting objects and performing segmentation tasks. The image acquisition and quality generally depend on the acquisition protocol and the instruments used in clinical routine. Further, the image acquisition and storage can impact the image quality and, therefore, presents an essential factor for developing AI solutions in medical imaging. Certainly, the application of AI in healthcare is not to limited to tasks mentioned previously and also cover logistic, cost, and resource management tasks. The urological research community is considered an early adapter of AI to solve clinical questions. For instance, the application of the artificial neural network goes back to 1994, when Snow et al. introduced diagnostic and prognostic tools for prostate cancer based on the neural network that predicted recurrent prostate cancer with 90% overall accuracy or the biopsy result with 87% accuracy.28 Another example from Partin et al. in 1997 utilized a multinomial log-linear regression model to develop a nomogram to predict different tumor stages of prostate cancer.29 This article was the foundation for the development of the well-known Partin table for the risk stratification of prostate cancer.30 Further, the application of machine learning on images also dated back to 1995 or 1998, where Bartels et al. introduced a tool based on feature-engineering and a nonparametric classifier to differentiate prostate cancer lesions from normal prostatic tissues.31 Since then, many diagnostic and prognostic tools have been introduced using different machine-learning approaches on different data types. Because this chapter focuses more on the clinical aspect of AI, the following sections will briefly cover each aspect of the application of AI in the basis examination, urological endoscopy, andrology, urologic oncology, diagnostic imaging, and robotic surgery.

16.2 Basic examinations in urology This section will highlight some basic examinations in urology and will discuss the application of AI to improve these basic examinations. From the basic examination, we will cover some urine tests and ultrasound examination of prostate and bladder.

16.2.1 Urinalysis and urine cytology Many information can be acquired from the urine tests and help urologists diagnose diseases and follow treatment. Urine tests include, for example, urinalysis, urine culture and 24-hour urine collection. A urinalysis is a test that evaluates the urine sample for color and the presence or concentration of certain substances and provides information in the case of infections, diseases of urinary tracts, metabolisms disorders, or cancer. A urinalysis includes a chemical examination using a dipstick and a microscopic evaluation of the urine sample.32 Urologists apply the chemical examination more in clinical routine because

III. Clinical applications

312

16. Prospect and adversity of artificial intelligence in urology

of the clinical practicality given the speed (60 90 seconds) and point-of-care deployment. Here, the urologist or the nurse places the dipstick with chemically treated pads into the urine and then inserts the stick into a reader machine to read the color intensities of these pads using absorption measurement technique.33 The evaluation of the dipstick test relays on the color intensity evaluation (color spectrum) that can be determined by the computer-vision algorithm as well. However, the measurement of color intensity requires a prior color calibration as the color detection depends on the light and position conditions while taking an image of the dipstick. Meanwhile the pixel resolution and adaptive features of the current smartphone cameras have facilitated capturing high-quality images or videos for such analyses. Further, some smartphone applications to interpret the dipstick have been introduced using the smartphone cameras.34 However, these applications are still not validated and consequently not recommended for clinical use. Recently, a urine test kit is available that includes a plate to place the dipstick and to calibrate the smart phone camera for the dipstick reading by a smartphone application; the kit has received the FDA (Food Drug Administration) and Conformite´ Europe´enne marking for clinical use.35 Although this urine test is simple and widely used in clinical routine, it has high false positive rates common clinical findings such as hematuria.36 The hematuria is the presence of blood cells in urine and is a very important finding in urology as it is linked to the risk of cancer presence in the urinary tract, especially in bladder for certain patient groups.37,38 Since the dipstick test can cause a false positive readout for hematuria, the microscopic evaluation of the urine sample is generally required to confirm hematuria.39 The urine cytology is essential to diagnose urinary tract cancers and infection diseases; it plays an important role in defining the tumor grade of urothelial carcinoma as well. However, one of the major limitations is its dependency on the readers that causes an interrater variation of the results.40 This led to an early example of artificial neural network application for the detection of abnormal cytology in order to achieve a standardized evaluation to inform clinical decision-making. In 1980s, Melder et al. and Sherman et al. introduced a proof-of-concept of imaging analysis for bladder cancer cell detection in urine sediment from images captured by video cameras.41,42 Pantazopoulos reported in 1998 that an MLP capable of discriminating benign cells from malignant cells in urine smears with overall 90.57% accuracy on a test set containing 31,816 cells.43 Muralidaran et al. provided a proof of concept using an MLP model to differentiate low-grade urothelial cancer cells from high-grade urothelial cancer cells with overall 83.3% accuracy, although this study had only two cases with low-grade bladder cancer in their test set, emphasizing the limitation of their study.44 These early efforts demonstrated the potential of neural networks in detecting abnormal cytology and determining the tumor grade in urine samples. Finally, semiautomated approaches already found their application in laboratory and pathology centers to meet the increasing workload for cytology evaluation.45

16.2.2 Ultrasound examination The ultrasound examination is one of the most important tools for diagnosis, treatment, and surveillance in urology. The ultrasound uses sound waves to capture live images

III. Clinical applications

16.2 Basic examinations in urology

313

from the inside of the body. The frequency of these waves and head modularity determine how the depth of penetration and width of these images. The application of ultrasound in urology is manifold and include, for example, the ultrasound-guided biopsy of prostate, anatomic evaluation of the kidneys, the estimation of residual urine volume, and initiating percutaneous nephrostomy or inserting a suprapubic catheter into bladder. The introduction of initial segmentation techniques dated back to the periods between 1960s and 1980s.46 50 The prostate organ was the first genitourinary organ to segment on the ultrasound images 1982.51 The artificial neural network was first introduced for the prostate segmentation on transrectal ultrasound images in 1990s.52 Later, different machine and deep learning models were proposed for the prostate segmentation on ultrasound images with remarkable improvement in segmentation performance.53 55 Meanwhile computational approaches already found their application in clinical routine to perform the prostate biopsy based on fusing the T2-weighted magnetic resonance imaging (MRI) and the transrectal ultrasound images of the prostate.56,57 The fusion of both image entities requires in addition to the organ segmentation and the coregistration the annotation of the tumor lesions on T2weighted MRI to enhance the lesion targeting during the biopsy setting58 (Fig. 16.1). By focusing on organs other than prostate the kidney segmentation on the ultrasound images was first reported in late 1980s59 and the following three-dimensional (3D) reconstruction of the kidney in 1990s,60 62 the first utilization of artificial neural network for kidney segmentation was described in 1997.63 The contour detection of urinary bladder was first introduced in 1991 to measure the wall thickness,64 and the application of MLP to segment the urinary bladder was reported in 2002,65 while the ultrasound images of the gall bladder or the cardiac sonography were more frequently used for segmentation research of hollow organs in 1980 90s.66 68 Although no article related to the segmentation research on ultrasound images of human testis could be identified, Ferdeghini et al. concluded that texture features extracted from testicular ultrasound using feature engineering may be associated with the testis growth as a quantitative measurement.69 Interestingly, we found that the computational analyses of testicular ultrasound images in animals were majorly utilized to determine the sperm quality on the basis of ultrasound images as well.70 With the Biopsy needle

3D construction of the prostate

Ultrasound probe

III. Clinical applications

FIGURE 16.1 The 3D reconstruction of prostate and the position of needle biopsy. T2-weighted MRI is utilized to reconstruct the prostate and the lesion (red). Then, a fusion between ultrasound and MRI is realized using either rigid or elastic coregistration. 3D, Three-dimensional; MRI, magnetic resonance imaging.

314

16. Prospect and adversity of artificial intelligence in urology

development of the feature engineering and the advanced machine learning, we experienced a significant improvement in segmentation and reconstruction tasks of the abovementioned organs on the basis of ultrasound images. For instance, Annangi et al. provided an estimation of the bladder volume from the ultrasound images using a segmentation method based on the pixelwise classification and support vector machine and active contour algorithms on out-held dataset.71 Matsumoto et al. provided an automated ultrasonographic detection of bladder diameter for the estimation of bladder urine volume using DCNN model (VGG16, a neural network architecture introduced by Visual Geometry Group72) that are wellcorrelated with the expert estimation (R2 5 0.96).73 In kidney, Ravishankar et al. proposed a fully automated kidney segmentation technique on the basis of template matching and a machine-learning method called a constrained probabilistic boosting tree for the edge detection to successfully segment 80% of the kidney ultrasound images.74 Another study introduced a DCNN model for the kidney volume estimation that achieved an overall mean dice similarity coefficient of 0.86 6 0.07 (mean 6 SD) between automated and manual segmentations from clinical experts and a mean correlation coefficient (ρ) of 0.98 (P , .001) for segmented kidney volume measurements on the test set.75 Although these results from deep learning are promising, the segmentation tasks for kidney and bladder still need to be explored for their clinical utilization as physician-friendly approach. Overall, the definition of the organ boundary or the organ segmentation is helpful for orientation and identification of the regions of interests inside the organ during the diagnostic or treatment procedure. Further, it allows the volume estimation of the organ. The organ segmentation of prostate is essential to determine, for example, the organ boundary for brachytherapy (a type of radiation therapy) or to simplify the registration of biopsy cores. The bladder segmentation facilitates the estimation of residual volume and determining the thickness of the bladder wall that provide valuable information about the severity of the voiding dysfunction.76 In the kidney the segmentation technique can be a useful guidance for urologists to place the nephrostomies or for tracking the organ and the kidney stones during the extracorporeal shock wave therapy.

16.3 Urological endoscopy The examination with the white light cystoscope (WLC) is frequently indicated in urology and is an essential tool to investigate the lower urinary tract (Fig. 16.2A). Further, the urologists can use specialized cystoscopes (resectoscopes) for bladder tumor resection. In contrast, the ureterorenoscope is a tool that facilitates the access to the upper urinary tract for diagnostic and therapeutic purposes (Fig. 16.2B).

16.3.1 Cystoscopy and transurethral resection of the bladder According to the literature, the first description of AI to detect bladder cancer on the white-light cystoscopic images was based on the color segmentation and included comparison analyses between three different machine-learning algorithms (linear regression,

III. Clinical applications

16.3 Urological endoscopy

(A) Flexible cystscope

(B) Ureterorenoscope

315 FIGURE 16.2 The flexible cystoscope (A) and the rigid ureterorenoscope (B).

quadratic regression, and naı¨ve Bayes classification).77 Although this approach is limited by the accuracy and the small sample size, it showed the capability of feature engineering and machine learning in identifying bladder cancer from white-light cystoscopic images. Later, Eminaga et al. identified the capability of deep DCNN in detecting 44 most relevant cystoscopic findings using different DCNN models from white-light78 images; this study further revealed that DCNN is capable of correctly capturing patterns that are specific for certain cystoscopic findings without prior knowledge about the finding except the labeling of the corresponding images. The modern cystoscopy facilitates video streaming onto a high definition screen and the real-time evaluation; by using this feature, Shkolyar et al. developed a DCNN model called CystoNet that detect papillary bladder cancers from white-light cystoscopy videos with a sensitivity of 90.9% (95% CI, 90.3% 91.6%).79 Their approach also highlights the regions with bladder cancers on the cystoscopy frames with bladder cancers as their approach incorporated the region proposal network (RPN) concept introduced by Ren et al.80 To address the unbalanced cystoscopic imaging data problem in real-time bladder cancer detection using deep learning, the unbalanced discriminant loss originally designed for colonoscopy image analysis by Yuan et al. is under investigation81. Generally, the bladder cancer and any cystoscopic finding can be detected in three ways: the first approach is the image classification, where the detection model classifies the cystoscopic images. A class-specific saliency map can be applied to have an intuition of the model attention.82 The second approach is the object detection and the instance segmentation that aim to identify the boundary of the tumor lesions and can cover multiple tumor lesions and different findings on a single image. The third approach is the semantic segmentation that can detect pixels with tumors. Fig. 16.3 provides an illustration for the three different approaches. These studies identified that DCNN can detect and highlight pathological lesions in the urinary bladder and help in refining the imprecise localization of tumor borders that can also impede complete resection. Although WLC is an essential tool for bladder cancer diagnosis, WLC can miss multifocal, smaller tumors, and has suboptimal diagnostic accuracy for flat cancers.83 Therefore we see the potential of AI in improving the WLC to reduce the rate of missed bladder cancers and the lesion registration that can adversely impact cancer outcome.

16.3.2 Ureterorenoscopy Ureterorenoscopy (URS) is generally indicated to manage kidney and ureter stones and is often preferred because of its higher stone-free rate compared to shockwave lithotripsy, and

III. Clinical applications

316

16. Prospect and adversity of artificial intelligence in urology

Semantic segmentation Image classification

99% Tumor

99% Tumor 99% Tumor

99% Tumor

Instance segmentation Object detection

Image classification with activation map

FIGURE 16.3 Three ways are available to detect tumor lesions. (1) Image classification; a saliency map (activation map) can be added to understand the intuition of the detection model by highlighting the features on the image; (2) the object detection and the instance segmentation that identified the lesion boundary with the lesion classification; and (3) the semantic segmentation that facilitates the pixelwise segmentation of the tumor lesion.

lower complication rate compared to percutaneous nephrolithotomy.84 However, URS requires a close attention to the surrounding anatomy to reduce complications. So, having an assistance tool that identifies anatomical structure during the surgery would be useful for the anatomical orientation as such information is essential for surgical decision-making. The identification of anatomical structure or stones during URS can be computationally solved using segmentation algorithms. Our literature search could not find any article utilized machine or deep learning for segmentation tasks on URS images. However, Rosa et al. introduced a segmentation tool for the renal stones on the basis of URS images; their approach for segmentation task is based on the region growing algorithm and the color discrepancy between the stone and the background, no machine learning was utilized by this study.85 Overall, the research effort in providing AIbased assistance tools for URS is still below the expectation and we anticipate more research efforts in this field for the upcoming years.

16.4 Andrology Andrology is a subspecialty of urology that deals with the man’s health and treats primarily male fertility issues and erectile dysfunction. The sperm analysis is one of the most important diagnostic tools in andrology and involves the characteristic description of human semen according to the criterions of the World Health Organization (Table 16.1).86

III. Clinical applications

317

16.4 Andrology

TABLE 16.1 Semen analyses lower reference limits defined by the World Health Organization laboratory manual for the examination and processing of human semen in 2010.86 Semen characteristic

Lower reference limit

Volume, mL

1.5 6

39

6

Total sperm number, 10

15

Total motility (PR 1 NP), %

40

Progressive motility (PR), %

32

Vitality (live spermatozoa), %

58

Sperm morphology (normal forms), %

4

pH

$ 7.2

Seminal fructose, µmol/ejaculate

$ 13

Sperm concentration, 10 /mL

PR, Progressive motility; NP, nonprogressive motility.

The semen characteristics provide relevant information about the male infertility. Having a standardized approach for semen analysis is, therefore, essential to limit the variation in obtaining and processing ejaculates.87 However, the semen analysis is timeconsuming and labor-intensive. A computer-based solution has shown to be helpful to automate and to standardize the semen analyses.88 Given the need for the objective measurement in semen analyses, the development of the computer-aided semen analyses started in the early 1980s89 92; there are currently commercial solutions for computeraided semen analyses that are today widely used to measure the sperm number, vitality, kinetic parameters, and sperm concentration.93 Recently, a double blinded clinical study evaluated a commercial computer system for automated semen analyses against the manual evaluation94; the study showed a good correlation (Pearson ρ: 0.88 0.97) between the automated semen analyses and the human evaluation for sperm concentration and motility, though the study showed a low correlation for normal morphology suggesting that the human reader follow more restricted definition for normal morphology than the computer solution.94 We believe the calibration and the configuration of the computer software has also impact on the results and it is not clear what sorts of algorithms the commercial software utilized to perform the measurement in this study. Another point is that the recent research efforts in AI for semen analyses neglected the commercial computer-aided solutions used in clinical routine. These commercial computer solutions are actually based on approaches related to AI, had a long development period and validated by clinical studies.93,95 These approaches are considered reference for computer performance in semen analyses with a higher evidence level according to the conducted clinical study. This observation further indicates a disconnection between the clinical routine and AI research teams in this field, signifying the importance of having a close research effort between two disciplines. Semen analysis is also important to identify and select semen eligible for in vitro fertilization procedure. A sperm selection step is very important for artificial fertilization to III. Clinical applications

318

16. Prospect and adversity of artificial intelligence in urology

increase the success rates. Different approaches are available to determine and select fertile sperms and includes, among other, intracytoplasmic morphologically selected sperm injection (IMSI) that utilizes a high-magnification (60 99, 100 3 ) microscopy and requires the morphometrical evaluation of sperms. Some studies revealed that IMSI significantly improves blastocyst development, implantation, and pregnancy rates.96 98 To select the high-quality sperm. Lamb and Niederberger utilized a single-layer neural network architecture with ADALINE units (Adaptive Linear Element) and three input parameters from the semen analyses to predict the effectiveness of sperm penetration into bovine cervical mucus as measurement of sperm fertility; their model could correctly predict in 90% of the cases.99 ADALINE was introduced by Widrow et al. in 1960 and marginally different from the current unit used in the MLP or CNN.100 Recently, McCallum et al. introduced an algorithm based on deep CNN (VGG16) that can predict sperms with high DNA integrity from microscopic images and select the sperms accordingly.101 High DNA integrity is linked to both improved fertilization rates and embryonic development.102 Uyar et al. compared six different machine-learning approaches to predict the fertilization success on 2453 fertilization procedures with 18 feature values; the best model was based on naı¨ve Bayes classification approach and achieved an overall 80.4% accuracy, 63.7% sensitivity, and 17.6% false alarm rate in embryo-based implantation prediction.103 Overall, the andrology represents a good example for how computer-based solutions can help to improve and standardize the clinical workflow.

16.5 Diagnostic imaging 16.5.1 Prostate Multiparametric MRI (mpMRI) of the prostate plays an essential role in detecting significant prostate cancer (CaP) and has shown to reduce primary biopsy by 10% and diagnosis of indolent cancers by 5%.104 The PIRADS (Prostate Imaging Reporting and Data System) Score is applied to assess the significance of observed lesions on mpMRI.105 However, this scoring system limited by considerable variation in the reproducibility of MRI findings.106 Therefore computer-aided tools are required to reduce this hindrance by using intelligent algorithms for pattern recognition and feature extraction. A recent study introduced a lesion detection model based on deep learning that could determine CaP lesions on MRI images with AUC of 0.89 higher than PIRAD v2 (AUC: 0.82).107 Meanwhile several studies regarding the computer-aided-diagnostic system for CaP detection in mpMRI based on deep learning models have been introduced108 113; further some studies have claimed to achieve an accuracy of 90% to distinguish noncancerous lesions from cancerous lesions in prostate; however, the major limitation of these studies is the small sample size (50 cases on average).108 110,113 119 Previous studies fall on one of the four following detection strategies to detect prostate cancer and determine its significance on mpMRI of the prostate: (1) an instance segmentation model that determines the boundary of the tumor lesions; (2) a model that classifies the lesions according to tumor presence; (3) semantic segmentation model that segments

III. Clinical applications

16.5 Diagnostic imaging

319

tumor lesions at pixel levels, (4) a weakly supervised model that determines cases with CaP (Fig. 16.4). In the last decades, three challenges have been proposed to promote the research in this field. The first challenge was related to the prostate segmentation on T2-weighted images where the best segmentation model achieved an 86% dice coefficient.111 The second challenge aimed to classify the lesions for CaP presence and the best model achieved a 92% accuracy.120 The third SPIE-AAPM-NCI Prostate MR Gleason Grade Group Challenge was initiated and aimed to predict the Gleason grade groups from the prostate MRI. Gleason grade is a malignancy grade of prostate cancer.121 However, this challenge is limited by using biopsy Gleason grade, which inherit the risk of up/downgrading of Gleason score.122 125 Another study overcame this limitation by considering whole-mount histology slides of the prostate from cases who underwent a total removal of the prostate.126 The recent study from Heidelberg utilized the semantic segmentation approach to detect the tumor lesions and achieved an accuracy comparable to PIIRADs scores in detecting significant prostate cancer without prior delineation of tumors in contrast to Gleason Grade Group Challenge and using sequence-wise slice image of T2 weighted and apparent diffusion coefficient (ADC) images.126 ADC is a measure of the magnitude of diffusion within tissue and is commonly calculated using diffusion-weighted imaging.127 Further, we developed attention-based recurrent convolutional neural network on the basis of PlexusNet that detects cases with prostate cancer and determines the significance of prostate cancer using volumetric data of mpMRI, which facilitates highlighting the regions that explains the model consideration.128

16.5.2 Kidney The computer tomography (CT) of the kidney is essential to diagnose a variety of kidney diseases including, but not limited, kidney stones, renal abscess, kidney injuries, and tumors.129 The segmentation and reconstruction of the kidney and its blood supply are effective tools to understand the anatomical relation between this structure during the invasive or open surgeries of the kidney. Such supportive tools can help to realize a nephron-sparing surgery or to minimize the risk of damaging the blood supply during the surgery that potentially leads to losing the kidney.130 For that purpose, Bugajska et al. provided a proof of concept for the segmentation of kidney blood vessel using a sequence of computational algorithms that process the CT-scan and achieved a 0.838 dice coefficient on 10 cases.131 Historically, the first attempts to segment the kidney on CT started in 1970s using boundary finding algorithms.132 Since then, many efforts have been put in developing segmentation and detection tools for kidney-related diseases on the basis of CT. In the first era of the segmentation research for the renal organ, many studies utilized the feature engineering or growing algorithms to detect and segment the kidney.132 134 Later, the introduction of artificial neural networks and machine learning has promoted the development of segmentation tools and provided more accurate diagnostic tools to detect diseases and lesions on CT. For instance, Cuingnet et al. utilized global contextual information from 3D volumetric image and the random forest for the voxel-wise prediction of the

III. Clinical applications

320

16. Prospect and adversity of artificial intelligence in urology

Lesion detection and instance segmentation

Cancer detection at case level

75% 45%

75%

Reference

Attention map

Lesion segmentation

Determine the most informative slices Has the case prostate cancer: yes

FIGURE 16.4 (1) The lesion detection occurs using the instance segmentation where the boundary is recognized and the probability score is provided for each lesion, (2) another approach estimate the probability score the tumor lesion that is marked by the physicians, (3) the semantic segmentation of the tumor lesions, (4) weakly supervised attention-based model for cancer detection at case level. Here, attention mechanism is implemented for the volumetric data (Relevance score) and for slice-wise image (attention map).

III. Clinical applications

16.5 Diagnostic imaging

321

presence of the kidney morphology. Their approach could segment kidney on a highly heterogeneous database of 233 CT scans from 89 patients where 80% of the kidneys were accurately detected and segmented (Dice coefficient .0.90).135 Most recently, we observed a methodological shift toward deep learning, specifically the DCNNs, since DCNN targets both feature engineering and the classification task, and many studies provided supportive results for its efficacy in segmentation and classification tasks. For instance, Hu et al. developed a customized 3D DCNN model that could segment the kidney with a dice coefficient (DSC) of 95.4% 6 3.2%,136 whereas Gibson found that V-Net for the kidney segmentation achieved an 0.92 DSC.137 Sharma et al. utilized DCNN to segment the autosomal dominant polycystic kidneys on 79 CT scans of abdomens and thereby achieved a DSC of 0.86 6 0.07 and facilitated the kidney volume estimation.75 Mannil et al. revealed that the texture information of CT-scan for the kidney stones and random forest can predict the success of shock wave lithotripsy.138 In conclusion, although many papers highlighted the potential of DCNN in segmentation and classification problems for CT of kidney, further validation is needed to confirm the clinical utility of DCNN for detection tasks on CT for the kidney diseases.

16.5.3 Ureter and bladder The computerized tomography (CT) urogram is an imaging exam to evaluate the upper urinary tracts and the bladder. CT urogram facilitates the clarification of the presence of urothelial tumors or abnormality in the upper urinary tract139 and for staging purposes.140 However, CT urogram is inferior to cystoscopy in diagnosing bladder cancer141 and recommended if bladder cancer is confirmed.140 Further, CT tends to under stage advanced disease.142 Another indication of CT is to monitor and follow up the treatment response, for example, in case of nonsurgical treatments (i.e., chemotherapy and immune therapy) or after surgery of bladder cancers. Initial efforts were given in providing tools that mimic the real cystoscopy to detect bladder cancer and use the three reconstruction of bladder from CT or MRI; for instance, Narumi et al. used 3D reconstruction algorithms on CT images to provide a 3D view field of the bladder that enables navigating inside the bladder.143 Similarly, two research groups introduced a proof of concept for CT cystoscopy that facilitates the virtual navigation inside the bladder including the bladder tumors after performing the 3D reconstruction of the bladder based on CT data.136,144 However, CT cystoscopy is ineffective in the diagnosis of bladder cancer due to its dependency on CT, that has an inferior diagnostic accuracy compared to the conventional cystoscopy.141 Recent studies for the bladder segmentation were generally more focused on the diagnostic aspect despite the limitations of CT in the bladder cancer diagnosis.145,146 From the clinical aspect, we believe, it is also more impactful to develop segmentation tasks for staging purposes or follow-up, since we already have diagnostic tools that have no radiation exposure and achieve a better diagnostic accuracy than CT. Cha et al. identified that features extracted from bladder regions and deep learning can potentially provide predictive values for treatment response although their results (AUC: 0.69 0.77) were moderate and not generalizable.147 Wu et al. developed a nomogram based on multivariate logistic regression model. The model was trained using radiomic features extracted from

III. Clinical applications

322

16. Prospect and adversity of artificial intelligence in urology

CT-urogram and originated from the bladder wall for preoperative prediction of the lymph node metastasis in cases with bladder cancer with a AUC of 0.899 (95% CI, 0.761 0.990) on 38 test cases.148 Conclusively, we expect that more studies will be revealed in the next period that evaluate the potential of CT-urogram to improve the tumor staging for bladder and ureter cancers using AI since a clinical dilemma in false staging 10% of muscle-invasive bladder cancer cases.149

16.6 Robotic surgery Although Kwoh et al. introduced the first robotic surgery procedure for neurosurgical biopsy in 1988,150 the urologic discipline is considered an early adapter of robotic surgery as the first description of a robotic prostatectomy was introduced in 1991.151 Meanwhile more than the half of uro-oncologic surgeries were performed with robot assistance.152 A robotic assistance system consists of a console unit and a robotic arm unit. The console unit has a screen display that supports a 3D vision and the grip control that facilitates a remote surgery. The placement of robotic arms requires assistant personnel who also changes the head of robotic arms during the surgery and fix issues raised by the robotic arms. Though there is no superiority of robotic surgery for patient outcome, studies showed that the urological robotic surgery is associated with less blood loss and expedited recovery compared to open surgery.153 Another aspect of the robotic surgery is the capability to protocol the surgery procedure and collect spatial and visual data to improve the surgeon experience and to manage the limitation of the robotic system.

16.6.1 Preoperative preparation The recent development in medical imaging and the facility to network different resources within the hospital information system have facilitated data streaming into the operation room. A previous study revealed that AI can potentially help developing more realistic surgical simulation.154 The surgical simulation with haptics represents a useful preparation tool for surgeon trainees and even for experienced surgeons. However, such simulation is currently in its early stage and a simplification of the real surgical procedure.155 Another aspect is that some studies found that AI and electronic health record can help to identify cases who are more prone to surgical complications,156,157 so that the surgical team can initiate preventive measurement for cases with increased risk for complications (e.g., urinary incontinence and lung embolisms). Finally, an operation room table equipped with AI and sensors allows to monitor the patient’s body and position.158,159

16.6.2 Navigation Previous studies have introduced navigation tools that consider the anatomical relationships of the organ, blood supply, and tumors for a better anatomical orientation during the surgery. For instance, the partial removal of the kidney or the tumor enucleation requires a precision surgery and familiarizing with the topological location of the tumor to

III. Clinical applications

16.7 Risk prediction

323

ensure the complete removal of the tumor. In this context the virtual reality and 3D reconstruction from preoperative diagnostic and staging images are useful resources to build navigation systems as decision-aided tools for the surgeons. For example, Porpiglia et al. could utilize as proof of concept a 3D augmented reality to assist the robotic surgery with partial removal of the kidney.160 The same team also added the 3D elastic augmented reality to robot-assisted radical prostatectomy and evaluated the utility of this tool in determining capsule involvement.161 The 3D model for prostate was reconstructed using mpMRI and manually and overlaid using a custom software. Two other works on virtual reality for robotic radical prostatectomy built the an anatomy atlas of the prostate on the basis ultrasound; they found that the virtual reality may help surgeons in identifying anatomical margins, and nerve and blood vessels.162 164 Similarly, the 3D reconstruction of kidney, blood vessels, and tumors from CT found to be a useful tool for navigating through the partial removal of the kidney165,166 and can potentially improve the postoperative renal function.167

16.6.3 Automated maneuver The recent development in robotic and automation technology has facilitated the introduction of different concepts and systems. An example is the autonomous or adaptive cruise control system that aims to regulate the speed and the distance between cars and alert the drivers to avoid potential collusions.168,169 Further, the industrial automation has revolutionized the modern industry by automating repetitive tasks, thereby improving the production capacity while maintaining the product quality and reducing production errors. In the surgery, many research efforts focused on automating repetitive tasks such as suturing or automated tracking surgery field during laparoscopic surgery. However, due to safety and ethical reasons, these tasks are performed automated under the human supervision. Shademan et al. introduced a supervised surgical robotic tool for suturing deformable soft tissues that includes robotic arms, force sensors, cameras and a tracking system.170 Staub et al. developed a robotic system that have an automation software for performing tissue piercing under laparoscopic robotic conditions.171 Another study introduced computational strategies that prevents the collusion of robotic arms and facilitates the automated knot after training the robot using data generated by haptic sensors of the trainers.172 Wei et al.173 and Omote et al.174 introduced methods for robotic camera control for laparoscopic surgery. Overall, these works related to automated repetitive surgical maneuvers are still in the early stage, but encouraging, emphasizing the high potential in improving the ergonomics of workspace and the surgery precisions. However, we are still in early stages and need further improvement until the surgical automation finds its place in the routine surgery.

16.7 Risk prediction The treatment management of urologic cancer diseases involves the risk estimation of the cancer and the risk-based treatment. In the past, physicians generally estimate such risk based on their personal judgments. However, it has been shown that physicians are

III. Clinical applications

324

16. Prospect and adversity of artificial intelligence in urology

modest in predicting the risk compared to a decision-aided predictive tool.175 There are two major types of prediction models according to their readiness for humans: plain and complex models. A plain model is human-friendly and readable because it consists of few parameters (usually less than 6 7 parameters) and its simple model design can be followed by humans. In contrast, a complex model contains so many parameters and has a complicated model design that is very difficult for humans to follow the content. Generally, nomograms and look-up tables fall below the plain model, whereas prediction models based on artificial neural networks and advanced machine-learning algorithms (e.g., adaptive boosting, SVM, and random forest) are complex. Moreover, the model complexity depends on the dimension of the input data. Transformation techniques (e.g., principal components analysis or linear discriminant analysis) are available to reduce the dimension of the input data. However, the human readability is still limited even after applying the dimension reduction as data reduction generally causes a certain level of the information loss of the feature identity. Another approach is the feature selection on the basis of statistical methods using, for example, correlation or regression analyses. Finally, each developed model needs to be validated. The definition of the model validity also depends on the constitution of the study cohort and the model design. The one-to-ten or one-to-five rules (e.g., 5 clinical parameters to 1 endpoint) are widely adapted for the validity of prediction models in the medical domain, although these rules are not generalizable and alternative approaches based on the sample size and the shrinkage algorithms are available. Further, the validation step is defined according to the study design and the sample size. Widely used approaches for the internal validation are the cross validation and the bootstrap resampling, especially in studies with small or moderate sample sizes. The definition of sample size is essential for model development and validation. Generally, the sample size is calculated based on the effect size or confidence interval. The external validation means the validation on a test set that is different from the initial study cohort and utilized as confirmation the model performance on unrelated datasets. The accuracy measurement for the model performance is generally given in area under of ROC curve, concordance index, or F-beta score (categorical endpoint), the mean square errors or coefficient of determination (continuous endpoint) depending on the type of the endpoints. The brier score or the calibration plot is an useful tool to understand how well the predicted probability of an event reflects the empirical probability. Chun et al. provided comprehensive evaluations of different prediction methods of prostate cancer outcomes (i.e., prostate cancer detection, biochemical recurrence, or the presence of extracapsular extension of the prostate cancer); their study found that nomograms are more accurate and have better performance metrics than look-up tables, artificial neural networks, classification and regression tree analyses, and risk-group stratification models for all endpoints. The study also identified that nomograms are more user-friendly and can be utilized with the need for computational resources and well calibrated compared to the alternative methods.176 From the practical aspect, nomograms and look-up tables are more user-friendly and does not require computational resources. These advantages help to maintain the minimal dependency of clinical decision from computational resources and increase the wide-spread usage of such decision-aided tools during the clinical routine. However, the wide acceptance of decision-aided tools is still low due to the fact that the prediction results from many nomograms still do not have any clinical

III. Clinical applications

16.8 Future direction

325

consequence and it is more predictive information. In our opinion the risk classification system and risk score systems have received more acceptance rates than nomograms because urologists or oncologists can memorize the risk classification or scoring systems that require lesser time and effort to reach the results compared to nomograms or other models. Therefore the National Comprehensive Cancer Network suggested the initial utilization of the modified version of D’Amico risk stratification for an initial risk assessment and then the usage of nomograms for more personalized information in cases with prostate cancer.177 Another example for a well-accepted risk score system is the EORTC risk score system for superficial bladder cancers that considers six parameters (Table 16.2) to estimate the progression risk.178 In kidney cancer the selection of the target therapy is based on a risk score system developed by Heng et al. and includes six prognostic factors to estimate the survival rate.179 A recent study evaluated different machine-learning approaches for the prediction of localized prostate cancer and Gleason upgrading using five preoperative clinical parameters; this study revealed that the artificial neural networks (i.e., dense neural network and wide-dense neural network) provided a better classification accuracy (F1 score: 0.85 for localized prostate cancer or 0.57 for Gleason upgrading) than other machine-learning algorithms (i.e., logistic regression, classification trees, naı¨ve Bayes classification, adaptive boosting, random forest classification, supportive vector machine, and k-nearest neighbor classification).180 Overall, we believe the major challenges of the current AI are the model complexity and the willing for responsibility sharing. Physicians are namely liable for their decisions regardless whether they used AI or not. So, any AI-based solutions need to be reliable and understandable by physicians in order to achieve a trustful interaction between AI and physicians. Prospectively, more efforts will be made to develop AI solutions with more transparent model designs and data processing that visualize in simplified conditions for a better human interpretation. Finally, we expect more research focusing on automated generation of nomograms or risk classification system using clinical parameters detected by AI that will find most informative parameters from clinical big data and develops a reasoning system.

16.8 Future direction The recent development in AI and the increased number of urological research projects related to AI reveal that we are still exploring the potential application of AI in urology. Recently, there are many over-optimistic conclusions about the potential of AI in solving complex problems, although we are still in the early stage of understanding about the utilization of AI in solving medical problems. AI has experienced two majors hype periods in the past and lots of overpromises that caused the stagnation of AI for a long period. The wisdom “Those who cannot remember the past are condemned to repeat it” should be taken seriously by biomedical AI researchers. Understanding the lessons from the past will help the evolvement of AI in biomedical research. Otherwise, the current AI wave will experience a new stagnation period if these lessons are not part of the current AI development. Strategies are needed to prevent overpromises, for example, starting with solving simple clinical problems and increasing the severity of the challenge over the time.

III. Clinical applications

326

16. Prospect and adversity of artificial intelligence in urology

TABLE 16.2 EORTC risk score system for superficial bladder cancers.178 Factor

Recurrence

Progression

0

0

Number of tumors Single 2 7

3

3

$8

6

3

, 3 cm

0

0

$3

3

3

Tumor diameter

Prior recurrence rate Primary

0

0

# 1 recurrence/year

2

2

. 1 recurrence/year

4

2

Ta (noninvasive papillary carcinoma)

0

0

T1 (tumor invades subepithelial connective tissue)

1

4

No

0

0

Yes

1

6

G1

0

0

G2

1

0

G3

2

5

Total score

0 17

0 23

Recurrence score

Probability of recurrence at 1 year

Probability of recurrence at 5 years

%

(95% CI)

%

(95% CI)

0

15

(10 19)

31

(24 37)

1 4

24

(21 26)

46

(42 49)

5 9

38

(35 41)

62

(58 65)

10 17

61

(55 67)

78

(73 84)

Category

Concurrent carcinoma in situ

Grade

Progression score

Probability of progression at 1 year

Probability of progression at 5 years

%

(95% CI)

%

(95% CI)

0

0.2

(0 0.7)

0.8

(0 1.7)

2 6

1

(0.4 1.6)

6

(5 8)

7 13

5

(4 7)

17

(14 20)

14 23

17

(10 24)

45

(35 55)

CI, Confidence interval.

III. Clinical applications

16.8 Future direction

327

The clinical benefit or utility cannot be assumed based on the discriminative accuracy or calibration of AI-assisted tools. Therefore, the estimation of clinical utility is essential to identify the benefits of such tools for clinical decision-making. However, most studies related to AI in the medical domain have neglected this measurement despite its importance from the clinical perspective. We believe that the measurement of AI’s clinical utility will soon become part of the standard measurements in the biomedical domain. We found from our literature review that there is a discrepancy between the clinical needs and the engineering research efforts, suggesting that the communication between both research disciplines is still suboptimal. Moreover, most engineering efforts to solve clinical problems are realized by research teams not supervised or consulted by physicians with the clinical experience. The research efforts of clinical AI are majorly directed to technical audiences despite the core of their works is targeting clinical problems. There are many technical solutions for clinical problems could not transfer to clinical research; the reasons for that are multifactorial and include, for example, the lack of knowing the interests and the literature resources for each discipline and the complicated technical implementation for researchers who focus on the clinical application of such techniques. We believe this dilemma will change in the future and the clinical engineering will evolve to accelerate the technological advance in medicine. Further, there will be mediator researchers who are familiar with engineering and have clinical experience for more productive research efforts. The current algorithms of AI are primarily driven by large datasets. Accordingly, we expect a continuation of novel works aim to optimize the data flow in the current infrastructure of the hospital information system to feed AI-based solutions or projects with data. For rare disease conditions, we anticipate seeing more new approaches to tackle the issues with small and imbalanced datasets by improving the mathematical computation of the artificial neural networks. Although the current approaches such as augmentation or generative adversarial network can help to generate fictive images to train artificial neural networks, we think these approaches are temporary solutions and we need for network architecture or new model training algorithms that can handle imbalanced or rare cases and are closer to the biologic memorization mechanism. Since the data privacy is important in health care, the distributed learning concept or the hospital-specific model system will be widely adapted. We anticipate seeing more data-driven clinical decision-making in the near future in addition to evidence-based decision-making. We believe that the data-driven recommendation and health surveillance systems will be part of the patient management and will function as assistance tools for the patients. We believe that AI-based solutions will be the standard approach in automating tasks that require visual evaluation such as diagnostic imaging, cystoscopy, sperm analyses and urine cytology in urology. A key component for clinical translation of AI technologies is the user interface, as inconsistent and suboptimal behaviors of AI solutions can confuse users and reduce their confidence in AI technology.181,182 Acceptance and understanding of the AI interface depends in part on the user background; consequently, the benefits and limitations of the human AI interface may vary according to the clinical experience of urology providers. The human AI interaction remains one of major research topics in medical AI and we do expect more clinical studies evaluating the interaction between health providers and AI or between patients and AI in improving the clinical management. Any human AI

III. Clinical applications

328

16. Prospect and adversity of artificial intelligence in urology

interaction should prevent any distraction and misinterpretation caused by the user interface or AI to avoid iatrogenic errors and to develop clinically useful AI tools. The simplicity and the clinical impact of the user interface and AI approaches will be decisive for the success of AI-based solutions for clinical problems. We anticipate seeing further research on AI solutions for clinical problems that have the self-learning capacity. Further, more studies developing AI-based research tools will be available to identify descriptions or patterns of diseases previously not known. We also expect more intense research on developing “digital markers” to monitor cancer diseases as noninvasive approach to stratify the patients according to their risk profile or to monitor the treatment success. We believe that the prospective digital markers can be a result of mathematical algorithms or a summation of feature patterns that reflects the urologic diseases (e.g. Prostate Health Index183 or biologic and prognostic feature scores from histology images184,185). Finally, a complete automation system to manage multidimensional clinical data will be one day a reality when the infrastructure for the health care information systems facilitates an efficient data streaming to AI. We believe new algorithms will be available for an efficient usage of data storage capacity and to manage the growing clinical data.

References 1. Bush V. As we may think by Vannevar bush the Atlantic monthly, July 1945. Atlantic Monthly. 1945. 2. Turing AM. Computing machinery and intelligence. Parsing the Turing test. Springer; 2009. p. 23 65. 3. Werbos P. Beyond regression: new tools for prediction and analysis in the behavioral sciences [Ph.D. dissertation]. Harvard University; 1974. 4. Werbos PJ. Backpropagation through time: what it does and how to do it. Proc IEEE 1990;78(10):1550 60. 5. LeCun Y, Boser BE, Denker JS, Henderson D, Howard RE, Hubbard WE, et al. editors. Handwritten digit recognition with a back-propagation network. In: Advances in neural information processing systems; 1990. 6. Hahnloser RH, Seung HS, Slotine JJ. Permitted and forbidden sets in symmetric threshold-linear networks. Neural Comput 2003;15(3):621 38. Available from: https://doi.org/10.1162/089976603321192103. PubMed PMID: 12620160. 7. Jarrett K, Kavukcuoglu K, Ranzato MA, LeCun Y, editors. What is the best multi-stage architecture for object recognition? In: 2009 IEEE 12th international conference on computer vision: IEEE, 2009. 8. Competition ILSVR. ,http://www.image-net.org/challenges.LSVRC/.; 2012 [accessed 27.12.16]. 9. Krizhevsky A, Sutskever I, Hinton GE, editors. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. 2012. 10. Teuscher C. Turing’s connectionism: an investigation of neural network architectures. Springer Science & Business Media; 2012. 11. Medler DA. A brief history of connectionism. Neural Comput Surv 1998;1:18 72. 12. Hinton GE, Sejnowski TJ. Learning and relearning in Boltzmann machines. Parallel Distrib Proc: Explor Microstruct Cogn 1986;1(282-317):2. 13. Hinton GE. Deep belief networks. Scholarpedia 2009;4(5):5947. 14. Kohonen T, Honkela T. Kohonen network. Scholarpedia 2007;2(1):1568. 15. Jordan MI. Attractor dynamics and parallelism in a connectionist sequential machine. Artif Neural Netw: Concept Learn 1990;112 27. 16. Graves A, Wayne G, Danihelka I. Neural Turing machines. arXiv preprint arXiv:14105401. 2014. 17. Freund Y, Schapire RE. Large margin classification using the perceptron algorithm. Mach Learn 1999;37 (3):277 96. 18. Yi J, Prybutok VR. A neural network model forecasting for prediction of daily maximum ozone concentration in an industrialized urban area. Environ Pollut 1996;92(3):349 57.

III. Clinical applications

References

329

19. Neal RM. Bayesian learning for neural networks. Springer Science & Business Media; 2012. 20. LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE 1998;86(11):2278 324. 21. Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G. The graph neural network model. IEEE Trans Neural Netw 2008;20(1):61 80. 22. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Advances in neural information processing systems; 2017. 23. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521(7553):436 44. 24. Vapnik VN. An overview of statistical learning theory. IEEE Trans Neural Netw 1999;10(5):988 99. 25. Agency for Healthcare Research and Quality. Prospects for care coordination measurement using electronic data sources: Challenges of measuring care coordination using electronic data and recommendations to address those challenges. 2012 [cited 2020 16.02.20]. 26. Cote RA, Robboy S. Progress in medical information management. Systematized nomenclature of medicine (SNOMED). JAMA 1980;243(8):756 62. Available from: https://doi.org/10.1001/jama.1980.03300340032015. PubMed PMID: 6986000. 27. Kessel KA, Combs SE. Review of developments in electronic, clinical data collection, and documentation systems over the last decade—are we ready for big data in routine health care? Front Oncol 2016;6:75. 28. Snow PB, Smith DS, Catalona WJ. Artificial neural networks in the diagnosis and prognosis of prostate cancer: a pilot study. J Urol 1994;152(5 Pt 2):1923 6. Available from: https://doi.org/10.1016/s0022-5347(17)32416-3. PubMed PMID: 7523737. 29. Partin AW, Kattan MW, Subong EN, Walsh PC, Wojno KJ, Oesterling JE, et al. Combination of prostatespecific antigen, clinical stage, and Gleason score to predict pathological stage of localized prostate cancer. A multi-institutional update. JAMA 1997;277(18):1445 51 PubMed PMID: 9145716. 30. Partin AW, Mangold LA, Lamm DM, Walsh PC, Epstein JI, Pearson JD. Contemporary update of prostate cancer staging nomograms (Partin Tables) for the new millennium. Urology 2001;58(6):843 8. Available from: https://doi.org/10.1016/s0090-4295(01)01441-8. PubMed PMID: 11744442. 31. Bartels PH, Thompson D, Bartels HG, Montironi R, Scarpelli M, Hamilton PW. Machine vision-based histometry of premalignant and malignant prostatic lesions. Pathol Res Pract 1995;191(9):935 44. Available from: https://doi.org/10.1016/S0344-0338(11)80979-9. PubMed PMID: 8606876. 32. Liao JC, Churchill BM. Pediatric urine testing. Pediatr Clin North Am 2001;48(6):1425 40. Available from: https://doi.org/10.1016/s0031-3955(05)70384-9. PubMed PMID: 11732123. 33. Burtis CA, Ashwood ER, Bruns DE. Tietz textbook of clinical chemistry and molecular diagnostics - E-book. Elsevier Health Sciences; 2012. 34. Intelligent method for dipstick urinalysis using smartphone camera. In: Ginardi RH, Saikhu A, Sarno R, Sunaryono D, Kholimi AS, Shanty RNT, editors. Information and communication technology-EurAsia conference. Springer; 2014. 35. Wicklund E. FDA approves smartphone-based mHealth platform for urinalysis tests 2018. Available from: ,https://mhealthintelligence.com/news/fda-approves-smartphone-based-mhealth-platform-for-urinalysistests.; 2020. [cited 21.02.20]. 36. Wein AJ, Kavoussi LR, Novick AC, Partin AW, Peters CA. Campbell-Walsh urology. Elsevier Health Sciences; 2011. 37. Cha EK, Tirsar LA, Schwentner C, Hennenlotter J, Christos PJ, Stenzl A, et al. Accurate risk assessment of patients with asymptomatic hematuria for the presence of bladder cancer. World J Urol 2012;30(6):847 52. Available from: https://doi.org/10.1007/s00345-012-0979-x. PubMed PMID: 23124847; PMCID: PMC4004026. 38. Babjuk M, Burger M, Comperat EM, Gontero P, Mostafid AH, Palou J, et al. European Association of Urology guidelines on non-muscle-invasive bladder cancer (TaT1 and carcinoma in situ) 2019 update. Eur Urol 2019;76(5):639 57. Available from: https://doi.org/10.1016/j.eururo.2019.08.016. PubMed PMID: 31443960. 39. Simerville JA, Maxted WC, Pahira JJ. Urinalysis: a comprehensive review. Am Family Phys 2005;71(6):1153 62. 40. Raitanen M-P, Aine R, Rintala E, Kallio J, Rajala P, Juusela H, et al. Differences between local and review urinary cytology in diagnosis of bladder cancer. An interobserver multicenter analysis. Eur Urol 2002;41(3):284 9. 41. Melder KK, Koss LG. Automated image analysis in the diagnosis of bladder cancer. Appl Opt 1987;26 (16):3367 72. Available from: https://doi.org/10.1364/AO.26.003367. PubMed PMID: 20490066.

III. Clinical applications

330

16. Prospect and adversity of artificial intelligence in urology

42. Sherman AB, Koss LG, Wyschogrod D, Melder KH, Eppich EM, Bales CE. Bladder cancer diagnosis by computer image analysis of cells in the sediment of voided urine using a video scanning system. Anal Quant Cytol Histol 1986;8(3):177 86 Epub 1986/09/01. PubMed PMID: 3778610. 43. Pantazopoulos D, Karakitsos P, Iokim-Liossi A, Pouliakis A, Botsoli-Stergiou E, Dimopoulos C. Back propagation neural network in the discrimination of benign from malignant lower urinary tract lesions. J Urol 1998;159(5):1619 23. Available from: https://doi.org/10.1097/00005392-199805000-00057. PubMed PMID: 9554366. 44. Muralidaran C, Dey P, Nijhawan R, Kakkar N. Artificial neural network in diagnosis of urothelial cell carcinoma in urine cytology. Diagn Cytopathol 2015;43(6):443 9. 45. Valente PT, Schantz HD. Cytology automation: an overview. Lab Med 2001;32(11):686 90. 46. Pavlidis T. Segmentation of pictures and maps through functional approximation. Comput Graph Image Process 1972;1(4):360 72. 47. Zahn CT. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans Comput 1971;100 (1):68 86. 48. Muerle JL. Experimental evaluation of techniques for automatic segmentation of objects in a complex scene. In: Pictorial pattern recognition. 1968. p. 3 13. 49. Haralick R, Dinstein I. A spatial clustering procedure for multi-image data. IEEE Trans Circ Syst 1975;22 (5):440 50. 50. Haralick RM, Shapiro LG. Image segmentation techniques. Comput Vis, Graph, Image Process 1985;29 (1):100 32. 51. Nakamura S. Monitoring the three-dimensional prostate shape under anti-androgen therapy. Diagnostic Ultrasound of the Prostate. New York: Elsevier; 1988. p. 137 44. 52. Prater JS, Richard WD. Segmenting ultrasound images of the prostate using neural networks. Ultrason Imaging 1992;14(2):159 85. Available from: https://doi.org/10.1177/016173469201400205. PubMed PMID: 1604756. 53. Bi H, Jiang Y, Tang H, Yang G, Shu H, Dillenseger JL. Fast and accurate segmentation method of active shape model with Rayleigh mixture model clustering for prostate ultrasound images. Comput Methods Prog Biomed 2020;184:105097. Available from: https://doi.org/10.1016/j.cmpb.2019.105097. PubMed PMID: 31634807. 54. Akbari H, Yang X, Halig LV, Fei B. 3D segmentation of prostate ultrasound images using wavelet transform. Proc SPIE Int Soc Opt Eng 2011;7962:79622K. Available from: https://doi.org/10.1117/12.878072. PubMed PMID: 22468205; PMCID: PMC3314427. 55. Yang ZS, Li CF, Zhou KY, Zhang KH, He L. [Segmentation of the prostate ultrasound image based on an improved geodesic active contour model]. Zhongguo Yi Liao Qi Xie Za Zhi 2008;32(5):316 18 PubMed PMID: 19119647. 56. Pinto PA, Chung PH, Rastinehad AR, Baccala AA, Kruecker J, Benjamin CJ, et al. Magnetic resonance imaging/ultrasound fusion guided prostate biopsy improves cancer detection following transrectal ultrasound biopsy and correlates with multiparametric magnetic resonance imaging. J Urol 2011;186(4):1281 5. 57. Vourganti S, Rastinehad A, Yerram NK, Nix J, Volkin D, Hoang A, et al. Multiparametric magnetic resonance imaging and ultrasound fusion biopsy detect prostate cancer in patients with prior negative transrectal ultrasound biopsies. J Urol 2012;188(6):2152 7. 58. Radtke JP, Schwab C, Wolf MB, Freitag MT, Alt CD, Kesch C, et al. Multiparametric magnetic resonance imaging (MRI) and MRI—transrectal ultrasound fusion biopsy for index tumor detection: correlation with radical prostatectomy specimen. Eur Urol 2016;70(5):846 53. 59. Sohn C. Three-dimensional ultrasound. SPIE; 1990. 60. Yung-Nien S, Jiann-Shu L, Jai-Chie C, Wei-Jen Y, editors. Three-dimensional reconstruction of kidney from ultrasonic images. In: Proceedings of IEEE workshop on biomedical image analysis; 24 25 June 1994. 61. Pretorius DH, Nelson TR, Jaffe JS. 3-dimensional sonographic analysis based on color flow Doppler and gray scale image data: a preliminary report. J Ultrasound Med 1992;11(5):225 32. Available from: https://doi.org/ 10.7863/jum.1992.11.5.225. PubMed PMID: 1588693. 62. Fenster A, Downey DB. 3-D ultrasound imaging: a review. IEEE Eng Med Biol Mag 1996;15(6):41 51. Available from: https://doi.org/10.1109/51.544511. 63. Cerny V, Zajac R, editors. Computer aided ultrasound laboratory. In: Proceedings of computer based medical systems; 11 13 June 1997. 64. Topper AK, Jernigan ME. Estimation of bladder wall location in ultrasound images. Med Biol Eng Comput 1991;29(3):297 303. Available from: https://doi.org/10.1007/BF02446712.

III. Clinical applications

References

331

¨ lmez T. Segmentation of ultrasound images by using a hybrid neural network. Pattern Recognit 65. Dokur Z, O Lett 2002;23(14):1825 36. Available from: https://doi.org/10.1016/S0167-8655(02)00155-1. 66. Moritz WE, Pearlman AS, McCabe DH, Medema DK, Ainsworth ME, Boles MS. An ultrasonic technique for imaging the ventricle in three dimensions and calculating its volume. IEEE Trans Biomed Eng 1983;(8):482 92. 67. Rohling RN, Gee AH, Berman L. Automatic registration of 3-D ultrasound images. Ultrasound Med Biol 1998;24(6):841 54. Available from: https://doi.org/10.1016/s0301-5629(97)00210-x. PubMed PMID: 9740386. 68. Levienaise-Obadia B, Gee A. Adaptive segmentation of ultrasound images. Image Vis Comput 1999;17 (8):583 8. Available from: https://doi.org/10.1016/S0262-8856(98)00177-2. 69. Ferdeghini EM, Morelli G, Distante A, Giannotti P, Benassi A. Assessment of normal testis growth by quantitative texture analysis of 2-D echo images. Med Eng Phys 1995;17(7):523 8. Available from: https://doi.org/ 10.1016/1350-4533(94)00019-6. PubMed PMID: 7489125. 70. Moxon R, Bright L, Pritchard B, Bowen IM, de Souza MB, da Silva LD, et al. Digital image analysis of testicular and prostatic ultrasonographic echogencity and heterogeneity in dogs and the relation to semen quality. Anim Reprod Sci 2015;160:112 19. Available from: https://doi.org/10.1016/j.anireprosci.2015.07.012. PubMed PMID: 26282522. 71. Annangi P, Frigstad S, Subin SB, Torp A, Ramasubramaniam S, Varna S, editors. An automated bladder volume measurement algorithm by pixel classification using random forests. In: 2016 38th annual international conference of the IEEE engineering in medicine and biology society (EMBC); 16 20 August 2016. 72. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556. 2014. 73. Matsumoto M, Tsutaoka T, Yabunaka K, Handa M, Yoshida M, Nakagami G, et al. Development and evaluation of automated ultrasonographic detection of bladder diameter for estimation of bladder urine volume. PLoS One 2019;14(9). Available from: https://doi.org/10.1371/journal.pone.0219916. e0219916. 74. Ravishankar H, Annangi P, Washburn M, Lanning J. Automated kidney morphology measurements from ultrasound images using texture and edge analysis. SPIE; 2016. 75. Sharma K, Rupprecht C, Caroli A, Aparicio MC, Remuzzi A, Baust M, et al. Automatic segmentation of kidneys using deep learning for total kidney volume quantification in autosomal dominant polycystic kidney disease. Sci Rep 2017;7(1):2049. Available from: https://doi.org/10.1038/s41598-017-01779-0. PubMed PMID: 28515418; PMCID: PMC5435691. 76. Manieri C, Carter SSC, Romano G, Trucchi A, Valentic M, Tubaro A. The diagnosis of bladder outlet obstruction in men by ultrasound measurement of bladder wall thickness. J Urol 1998;159(3):761 5. 77. Gosnell ME, Polikarpov DM, Goldys EM, Zvyagin AV, Gillatt DA. Computer-assisted cystoscopy diagnosis of bladder cancer. Urol Oncol 2018;36(1):8 e9 8 e15. Available from: https://doi.org/10.1016/j.urolonc.2017.08.026. PubMed PMID: 28958822. 78. Eminaga O, Eminaga N, Semjonow A, Breil B. Diagnostic classification of cystoscopic images using deep convolutional neural networks. JCO Clin Cancer Inf 2018;2:1 8. Available from: https://doi.org/10.1200/ CCI.17.00126. PubMed PMID: 30652604. 79. Shkolyar E, Jia X, Chang TC, Trivedi D, Mach KE, Meng MQ, et al. Augmented bladder tumor detection using deep learning. Eur Urol 2019;76(6):714 18. Available from: https://doi.org/10.1016/j.eururo.2019.08.032. PubMed PMID: 31537407; PMCID: PMC6889816. 80. Ren S, He K, Girshick R, Sun J, editors. Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. 2015. 81. Yuan Y, Qin W, Ibragimov B, Zhang G, Han B, Meng MQ-H, et al. Densely connected neural network with unbalanced discriminant and category sensitive constraints for polyp recognition. IEEE Trans Automat Sci Eng 2019;17:1 10. https://doi.org/10.1109/TASE.2019.2936645. 82. Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:13126034. 2013. 83. Grossman HB, Soloway M, Messing E, Katz G, Stein B, Kassabian V, et al. Surveillance for recurrent bladder cancer using a point-of-care proteomic assay. JAMA 2006;295(3):299 305. Available from: https://doi.org/ 10.1001/jama.295.3.299. PubMed PMID: 16418465. 84. Somani BK, Giusti G, Sun Y, Osther PJ, Frank M, De Sio M, et al. Complications associated with ureterorenoscopy (URS) related to treatment of urolithiasis: the Clinical Research Office of Endourological Society URS

III. Clinical applications

332

16. Prospect and adversity of artificial intelligence in urology

Global study. World J Urol 2017;35(4):675 81. Available from: https://doi.org/10.1007/s00345-016-1909-0. PubMed PMID: 27492012; PMCID: PMC5364249. 85. Rosa B, Mozer P, Szewczyk J. An algorithm for calculi segmentation on ureteroscopic images. Int J Comput Assist Radiol Surg 2011;6(2):237 46. Available from: https://doi.org/10.1007/s11548-010-0504-x. PubMed PMID: 20574798. 86. Cooper TG, Noonan E, Von Eckardstein S, Auger J, Baker H, Behre HM, et al. World Health Organization reference values for human semen characteristics. Hum Reprod Update 2010;16(3):231 45. 87. Alvarez C, Castilla JA, Martinez L, Ramirez JP, Vergara F, Gaforio JJ. Biological variation of seminal parameters in healthy subjects. Hum Reprod 2003;18(10):2082 8. Available from: https://doi.org/10.1093/humrep/deg430. PubMed PMID: 14507825. 88. Rijsselaere T, Van Soom A, Hoflack G, Maes D, de Kruif A. Automated sperm morphometry and morphology analysis of canine semen by the Hamilton-Thorne analyser. Theriogenology 2004;62(7):1292 306. Available from: https://doi.org/10.1016/j.theriogenology.2004.01.005. PubMed PMID: 15325556. 89. Moruzzi JF, Wyrobek AJ, Mayall BH, Gledhill BL. Quantification and classification of human sperm morphology by computer-assisted image analysis. Fertil Steril 1988;50(1):142 52 Epub 1988/07/01. PubMed PMID: 3384107. 90. Bendvold E, Aanesen A, Bjoro K. [Objective and automated semen analysis. CellSoft-CASA]. Tidsskr Laegeforen 1988;108(29):2495 7 Epub 1988/10/20. PubMed PMID: 3206464. 91. Mortimer D, Goel N, Shu MA. Evaluation of the CellSoft automated semen analysis system in a routine laboratory setting. Fertil Steril 1988;50(6):960 8. Available from: https://doi.org/10.1016/s0015-0282(16)60381-3. PubMed PMID: 3203762. 92. Mathur SM. Automated semen analysis. Fertil Steril 1989;52(2):343 5. Available from: https://doi.org/ 10.1016/s0015-0282(16)60869-5. PubMed PMID: 2666177. 93. Gallagher MT, Smith DJ, Kirkman-Brown JC. CASA: tracking the past and plotting the future. Reprod Fertil Dev 2018;30(6):867 74. Available from: https://doi.org/10.1071/RD17420. PubMed PMID: 29806989. 94. Engel KM, Grunewald S, Schiller J, Paasch U. Automated semen analysis by SQA Vision((R)) versus the manual approach—a prospective double-blind study. Andrologia 2019;51(1). Available from: https://doi.org/ 10.1111/and.13149. e13149. 95. Hicks SA, Andersen JM, Witczak O, Thambawita V, Halvorsen P, Hammer HL, et al. Machine learning-based analysis of sperm videos and participant data for male fertility prediction. Sci Rep 2019;9(1):16770. Available from: https://doi.org/10.1038/s41598-019-53217-y. PubMed PMID: 31727961; PMCID: PMC6856178. 96. Bartoov B, Berkovitz A, Eltes F, Kogosovsky A, Yagoda A, Lederman H, et al. Pregnancy rates are higher with intracytoplasmic morphologically selected sperm injection than with conventional intracytoplasmic injection. Fertil Steril 2003;80(6):1413 19. 97. Setti AS, Ferreira RC, Braga DPdAF, Figueira RdCS, Iaconelli Jr A, Borges Jr E. Intracytoplasmic sperm injection outcome versus intracytoplasmic morphologically selected sperm injection outcome: a meta-analysis. Reprod Biomed Online 2010;21(4):450 5. 98. Antinori M, Licata E, Dani G, Cerusico F, Versaci C, d’Angelo D, et al. Intracytoplasmic morphologically selected sperm injection: a prospective randomized trial. Reprod Biomed Online 2008;16(6):835 41. 99. Lamb DJ, Niederberger CS. Artificial intelligence in medicine and male infertility. World J Urol 1993;11 (2):129 36. Available from: https://doi.org/10.1007/BF00182040. 100. Laboratories SUSE, Widrow B, Research USOoN, Corps USAS, Force USA, Navy US. Adaptive “adaline” neuron using chemical “memistors”. 1960. 101. McCallum C, Riordon J, Wang Y, Kong T, You JB, Sanner S, et al. Deep learning-based selection of human sperm with high DNA integrity. Commun Biol 2019;2:250. Available from: https://doi.org/10.1038/s42003019-0491-6. PubMed PMID: 31286067; PMCID: PMC6610103. 102. Agarwal A, Said TM. Role of sperm chromatin abnormalities and DNA damage in male infertility. Hum Reprod Update 2003;9(4):331 45. Available from: https://doi.org/10.1093/humupd/dmg027. PubMed PMID: 12926527. 103. Uyar A, Bener A, Ciray HN. Predictive modeling of implantation outcome in an in vitro fertilization setting: an application of machine learning methods. Med Decis Mak 2015;35(6):714 25. Available from: https://doi. org/10.1177/0272989X14535984. PubMed PMID: 24842951. 104. Ahmed HU, El-Shater Bosaily A, Brown LC, Gabe R, Kaplan R, Parmar MK, et al. Diagnostic accuracy of multi-parametric MRI and TRUS biopsy in prostate cancer (PROMIS): a paired validating confirmatory

III. Clinical applications

References

105.

106.

107. 108.

109.

110.

111.

112.

113.

114.

115.

116.

117.

118. 119.

120.

121.

333

study. Lancet 2017;389(10071):815 22. Available from: https://doi.org/10.1016/S0140-6736(16)32401-1. PubMed PMID: 28110982. Purysko AS, Rosenkrantz AB, Barentsz JO, Weinreb JC, Macura KJ. PI-RADS version 2: a pictorial update. Radiographics 2016;36(5):1354 72. Available from: https://doi.org/10.1148/rg.2016150234. PubMed PMID: 27471952. Rosenkrantz AB, Ginocchio LA, Cornfeld D, Froemming AT, Gupta RT, Turkbey B, et al. Interobserver reproducibility of the PI-RADS version 2 lexicon: a multicenter study of six experienced prostate radiologists. Radiology 2016;280(3):793 804. Available from: https://doi.org/10.1148/radiol.2016152542. PubMed PMID: 27035179; PMCID: PMC5006735. Liu S, Zheng H, Feng Y, Li W. Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI 2017 [updated 2017]. https://arxiv.org/abs/1703.04078. Guo Y, Gao Y, Shen D. Deformable MR prostate segmentation via deep feature learning and sparse patch matching. IEEE Trans Med Imaging 2016;35(4):1077 89. Available from: https://doi.org/10.1109/ TMI.2015.2508280. PubMed PMID: 26685226; PMCID: PMC5002995. Le MH, Chen J, Wang L, Wang Z, Liu W, Cheng KT, et al. Automated diagnosis of prostate cancer in multiparametric MRI based on multimodal convolutional neural networks. Phys Med Biol 2017. Available from: https://doi.org/10.1088/1361-6560/aa7731. PubMed PMID: 28582269. Liao S, Gao Y, Oto A, Shen D. Representation learning: a unified deep learning framework for automatic prostate MR segmentation. Med Image Comput Comput Assist Interv 2013;16(Pt 2):254 61 PubMed PMID: 24579148; PMCID: PMC3939619. Litjens G, Toth R, van de Ven W, Hoeks C, Kerkstra S, van Ginneken B, et al. Evaluation of prostate segmentation algorithms for MRI: the PROMISE12 challenge. Med Image Anal 2014;18(2):359 73. Available from: https://doi.org/10.1016/j.media.2013.12.002. PubMed PMID: 24418598; PMCID: PMC4137968. Mehrtash A, Pesteie M, Hetherington J, Behringer PA, Kapur T, Wells 3rd WM, et al. DeepInfer: open-source deep learning deployment toolkit for image-guided therapy. Proc SPIE Int Soc Opt Eng 2017;10135 PubMed PMID: 28615794; PMCID: PMC5467894. Available from: https://doi.org/10.1117/12.2256011. Zhu Y, Wang L, Liu M, Qian C, Yousuf A, Oto A, et al. MRI-based prostate cancer detection with high-level representation and hierarchical classification. Med Phys 2017;44(3):1028 39. Available from: https://doi.org/ 10.1002/mp.12116. PubMed PMID: 28107548. Azizi S, Imani F, Ghavidel S, Tahmasebi A, Kwak JT, Xu S, et al. Detection of prostate cancer using temporal sequences of ultrasound data: a large clinical feasibility study. Int J Comput Assist Radiol Surg 2016;11 (6):947 56. Available from: https://doi.org/10.1007/s11548-016-1395-2. PubMed PMID: 27059021. Azizi S, Mousavi P, Yan P, Tahmasebi A, Kwak JT, Xu S, et al. Transfer learning from RF to B-mode temporal enhanced ultrasound features for prostate cancer detection. Int J Comput Assist Radiol Surg 2017. PubMed PMID: 28349507. Available from: https://doi.org/10.1007/s11548-017-1573-x. Cheng RD, Roth HR, Lu L, Wang SJ, Turkbey B, Gandler W, et al. Active appearance model and deep learning for more accurate prostate segmentation on MRI. Proc SPIE 2016;9784. Available from: https://doi.org/ 10.1117/12.2216286. Artn 97842i. PubMed PMID: WOS:000382313300088. Craig MC, Fletcher PC, Daly EM, Rymer J, Cutter WJ, Brammer M, et al. Gonadotropin hormone releasing hormone agonists alter prefrontal function during verbal encoding in young women. Psychoneuroendocrinology 2007;32 (8 10):1116 27. Available from: https://doi.org/10.1016/j.psyneuen.2007.09.009. PubMed PMID: 17980497. El-Baz AS, Jiang X, Suri JS. Biomedical image segmentation: advances and trends. Boca Raton, FL: CRC Press, Taylor & Francis Group; 2017. xix, 526 pages p. Reda I, Shalaby A, Elmogy M, Elfotouh AA, Khalifa F, El-Ghar MA, et al. A comprehensive non-invasive framework for diagnosing prostate cancer. Comput Biol Med 2017;81:148 58. Available from: https://doi. org/10.1016/j.compbiomed.2016.12.010. PubMed PMID: 28063376. Armato SG, Huisman H, Drukker K, Hadjiiski L, Kirby JS, Petrick N, et al. PROSTATEx Challenges for computerized classification of prostate lesions from multiparametric magnetic resonance images. J Med Imaging 2018;5(4):044501. Epstein JI, Zelefsky MJ, Sjoberg DD, Nelson JB, Egevad L, Magi-Galluzzi C, et al. A contemporary prostate cancer grading system: a validated alternative to the Gleason score. Eur Urol 2016;69(3):428 35. Available from: https://doi.org/10.1016/j.eururo.2015.06.046. PubMed PMID: 26166626; PMCID: PMC5002992.

III. Clinical applications

334

16. Prospect and adversity of artificial intelligence in urology

122. Eminaga O, Hinkelammert R, Abbas M, Titze U, Eltze E, Bettendorf O, et al. Prostate cancers detected on repeat prostate biopsies show spatial distributions that differ from those detected on the initial biopsies. BJU Int 2015;116(1):57 64. Available from: https://doi.org/10.1111/bju.12691. PubMed PMID: 24552505. 123. Payton S. Prostate cancer: new nomogram predicts risk of Gleason upgrading. Nat Rev Urol 2013;10(10):553. Available from: https://doi.org/10.1038/nrurol.2013.218. PubMed PMID: 24061534. 124. Truong M, Slezak JA, Lin CP, Iremashvili V, Sado M, Razmaria AA, et al. Development and multiinstitutional validation of an upgrading risk tool for Gleason 6 prostate cancer. Cancer 2013;119 (22):3992 4002. Available from: https://doi.org/10.1002/cncr.28303. PubMed PMID: 24006289; PMCID: PMC4880351. 125. Epstein JI, Feng Z, Trock BJ, Pierorazio PM. Upgrading and downgrading of prostate cancer from biopsy to radical prostatectomy: incidence and predictive factors using the modified Gleason grading system and factoring in tertiary grades. Eur Urol 2012;61(5):1019 24. Available from: https://doi.org/10.1016/j.eururo.2012.01.050. PubMed PMID: 22336380; PMCID: PMC4659370. 126. Schelb P, Kohl S, Radtke JP, Wiesenfarth M, Kickingereder P, Bickelhaupt S, et al. Classification of cancer at prostate MRI: deep learning versus clinical PI-RADS assessment. Radiology 2019;293(3):607 17. 127. Gibbs P, Liney GP, Pickles MD, Zelhof B, Rodrigues G, Turnbull LW. Correlation of ADC and T2 measurements with cell density in prostate cancer at 3.0 Tesla. Investig Radiol 2009;44(9):572 6. 128. Eminaga O, Loening A, Lu A, Brooks JD, Rubin D. Detection of prostate cancer and determination of its significance using explainable artificial intelligence. Journal of Clinical Oncology 38 (15_suppl), 5555. 129. Wein AJ, Kavoussi LR, Novick AC, Partin AW, Peters CA. Campbell-Walsh urology: expert consult premium edition: enhanced online features and print, 4-volume set. Elsevier Health Sciences; 2011. 130. Papalia R, De Castro Abreu AL, Panebianco V, Duddalwar V, Simone G, Leslie S, et al. Novel kidney segmentation system to describe tumour location for nephron-sparing surgery. World J Urol 2015;33(6):865 71. Available from: https://doi.org/10.1007/s00345-014-1386-2. PubMed PMID: 25159872. 131. Bugajska K, Skalski A, Gajda J, Drewniak T, editors. The renal vessel segmentation for facilitation of partial nephrectomy. In: 2015 signal processing: algorithms, architectures, arrangements, and applications (SPA); 23 25 September 2015. 132. Selfridge PG, Judith MS, Prewitt MS, Dyer CR, Ranade S, editors. Segmentation algorithms for abdominal computerized tomography scans. In: COMPSAC 79 proceedings computer software and the IEEE computer society’s third international applications conference, 1979; 6 8 November 1979. 133. Selfridge P, Prewitt J. Boundary-finding scheme and the problem of complex object localization in computed tomography images. SPIE; 1979. 134. Lin DT, Lei CC, Hung SW. Computer-aided kidney segmentation on abdominal CT images. IEEE Trans Inf Technol Biomed 2006;10(1):59 65. Available from: https://doi.org/10.1109/titb.2005.855561. PubMed PMID: 16445250. 135. Automatic detection and segmentation of kidneys in 3D CT images using random forests. In: Cuingnet R, Prevost R, Lesage D, Cohen LD, Mory B, Ardon R, editors. International conference on medical image computing and computer-assisted intervention. Springer; 2012. 136. Hinton G, Srivastava N, Swersky K. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent, 2019. 137. Gibson E, Giganti F, Hu Y, Bonmati E, Bandula S, Gurusamy K, et al. Automatic multi-organ segmentation on abdominal CT with dense V-networks. IEEE Trans Med Imaging 2018;37(8):1822 34. Available from: https://doi.org/10.1109/TMI.2018.2806309. PubMed PMID: 29994628; PMCID: PMC6076994. 138. Mannil M, von Spiczak J, Hermanns T, Poyet C, Alkadhi H, Fankhauser CD. Three-dimensional texture analysis with machine learning provides incremental predictive information for successful shock wave lithotripsy in patients with kidney stones. J Urol 2018;200(4):829 36. Available from: https://doi.org/10.1016/j. juro.2018.04.059. PubMed PMID: 29673945. 139. Browne RF, Meehan CP, Colville J, Power R, Torreggiani WC. Transitional cell carcinoma of the upper urinary tract: spectrum of imaging findings. Radiographics 2005;25(6):1609 27. Available from: https://doi.org/ 10.1148/rg.256045517. PubMed PMID: 16284138. 140. Millan-Rodriguez F, Chechile-Toniolo G, Salvador-Bayarri J, Huguet-Perez J, Vicente-Rodriguez J. Upper urinary tract tumors after primary superficial bladder tumors: prognostic factors and risk groups. J Urol 2000;164(4):1183 7 PubMed PMID: 10992362.

III. Clinical applications

References

335

141. Trinh TW, Glazer DI, Sadow CA, Sahni VA, Geller NL, Silverman SG. Bladder cancer diagnosis with CT urography: test characteristics and reasons for false-positive and false-negative results. Abdom Radiol (NY) 2018;43(3):663 71. Available from: https://doi.org/10.1007/s00261-017-1249-6. PubMed PMID: 28677000. 142. Paik ML, Scolieri MJ, Brown SL, Spirnak JP, Resnick MI. Limitations of computerized tomography in staging invasive bladder cancer before radical cystectomy. J Urol 2000;163(6):1693 6 PubMed PMID: 10799162. 143. Narumi Y, Kumatani T, Sawai Y, Kuriyama K, Kuroda C, Takahashi S, et al. The bladder and bladder tumors: imaging with three-dimensional display of helical CT data. AJR Am J Roentgenol 1996;167(5):1134 5. Available from: https://doi.org/10.2214/ajr.167.5.8911164. PubMed PMID: 8911164. 144. Vining DJ, Zagoria RJ, Liu K, Stelts D. CT cystoscopy: an innovation in bladder imaging. AJR Am J Roentgenol 1996;166(2):409 10. Available from: https://doi.org/10.2214/ajr.166.2.8553956. PubMed PMID: 8553956. 145. Ma X, Hadjiiski LM, Wei J, Chan HP, Cha KH, Cohan RH, et al. U-Net based deep learning bladder segmentation in CT urography. Med Phys 2019;46(4):1752 65. Available from: https://doi.org/10.1002/mp.13438. PubMed PMID: 30734932; PMCID: PMC6453730. 146. Cha KH, Hadjiiski L, Samala RK, Chan HP, Caoili EM, Cohan RH. Urinary bladder segmentation in CT urography using deep-learning convolutional neural network and level sets. Med Phys 2016;43(4):1882. Available from: https://doi.org/10.1118/1.4944498. PubMed PMID: 27036584; PMCID: PMC4808067. 147. Cha KH, Hadjiiski L, Chan H-P, Weizer AZ, Alva A, Cohan RH, et al. Bladder cancer treatment response assessment in CT using radiomics with deep-learning. Sci Rep 2017;7(1):1 12. 148. Wu S, Zheng J, Li Y, Yu H, Shi S, Xie W, et al. A radiomics nomogram for the preoperative prediction of lymph node metastasis in bladder cancer. Clin Cancer Res 2017;23(22):6904 11. Available from: https://doi. org/10.1158/1078-0432.CCR-17-1510. PubMed PMID: 28874414. 149. Van Der Meijden A, Sylvester R, Collette L, Bono A, Ten Kate F. The role and impact of pathology review on stage and grade assessment of stages Ta and T1 bladder tumors: a combined analysis of 5 European Organization for Research and Treatment of Cancer Trials. J Urol 2000;164(5):1533 7. Available from: https://doi.org/10.1097/00005392-200011000-00017. PubMed PMID: 11025698. 150. Kwoh YS, Hou J, Jonckheere EA, Hayati S. A robot with improved absolute positioning accuracy for CT guided stereotactic brain surgery. IEEE Trans Biomed Eng 1988;35(2):153 60. Available from: https://doi. org/10.1109/10.1354. PubMed PMID: 3280462. 151. Davies BL, Hibberd RD, Ng WS, Timoney AG, Wickham JE. The development of a surgeon robot for prostatectomies. Proc Inst Mech Eng H 1991;205(1):35 8. Available from: https://doi.org/10.1243/ PIME_PROC_1991_205_259_02. PubMed PMID: 1670073. 152. Merrill SB, Sohl BS, Thompson RH, Reese AC, Parekh DJ, Lynch JH, et al. The balance between open and robotic training among graduating urology residents: does surgical technique need monitoring? J Urol 2020;203(5):996 1002. Available from: https://doi.org/10.1097/JU.0000000000000689. 153. Finkelstein J, Eckersberger E, Sadri H, Taneja SS, Lepor H, Djavan B. Open versus laparoscopic versus robotassisted laparoscopic prostatectomy: the European and US experience. Rev Urol 2010;12(1):35. 154. Winkler-Schwartz A, Yilmaz R, Mirchi N, Bissonnette V, Ledwos N, Siyar S, et al. Machine learning identification of surgical and operative factors associated with surgical expertise in virtual reality simulation. JAMA Netw Open 2019;2(8). e198363-e. 155. Agha RA, Fowler AJ. The role and validity of surgical simulation. Int Surg 2015;100(2):350 7. Available from: https://doi.org/10.9738/INTSURG-D-14-00004.1. PubMed PMID: 25692441; PMCID: PMC4337453. 156. Stanford JL, Feng Z, Hamilton AS, Gilliland FD, Stephenson RA, Eley JW, et al. Urinary and sexual function after radical prostatectomy for clinically localized prostate cancer: the Prostate Cancer Outcomes Study. JAMA 2000;283(3):354 60. 157. Etzioni DA, Wasif N, Dueck AC, Cima RR, Hohmann SF, Naessens JM, et al. Association of hospital participation in a surgical outcomes monitoring program with inpatient complications and mortality. JAMA 2015;313(5):505 11. 158. Glaser B, Danzer S, Neumuth T. Intra-operative surgical instrument usage detection on a multi-sensor table. Int J Comput Assist Radiol Surg 2015;10(3):351 62. Available from: https://doi.org/10.1007/s11548-014-1066-0. PubMed PMID: 24830533. 159. Gorek J, Scholl TU, Finley E, Kim AC, Herman M. Surgical trajectory monitoring system and related methods. Google Patents; 2013.

III. Clinical applications

336

16. Prospect and adversity of artificial intelligence in urology

160. Porpiglia F, Checcucci E, Amparore D, Piramide F, Volpi G, Granato S, et al. Three-dimensional augmented reality robot-assisted partial nephrectomy in case of complex tumours (PADUA ./ 5 10): a new intraoperative tool overcoming the ultrasound guidance. Eur Urol 2019. Available from: https://doi.org/10.1016/j.eururo.2019.11.024. PubMed PMID: 31898992. 161. Porpiglia F, Checcucci E, Amparore D, Manfredi M, Massa F, Piazzolla P, et al. Three-dimensional elastic augmented-reality robot-assisted radical prostatectomy using hyperaccuracy three-dimensional reconstruction technology: a step further in the identification of capsular involvement. Eur Urol 2019;76(4):505 14. Available from: https://doi.org/10.1016/j.eururo.2019.03.037. PubMed PMID: 30979636. 162. Simpfendorfer T, Baumhauer M, Muller M, Gutt CN, Meinzer HP, Rassweiler JJ, et al. Augmented reality visualization during laparoscopic radical prostatectomy. J Endourol 2011;25(12):1841 5. Available from: https://doi.org/10.1089/end.2010.0724. PubMed PMID: 21970336. 163. Ukimura O, Gill IS, Desai MM, Steinberg AP, Kilciler M, Ng CS, et al. Real-time transrectal ultrasonography during laparoscopic radical prostatectomy. J Urol 2004;172(1):112 18. Available from: https://doi.org/ 10.1097/01.ju.0000128914.21240.c8. PubMed PMID: 15201749. 164. Ukimura O, Magi-Galluzzi C, Gill IS. Real-time transrectal ultrasound guidance during laparoscopic radical prostatectomy: impact on surgical margins. J Urol 2006;175(4):1304 10. Available from: https://doi.org/ 10.1016/S0022-5347(05)00688-9. PubMed PMID: 16515987. 165. Ukimura O, Nakamoto M, Gill IS. Three-dimensional reconstruction of renovascular-tumor anatomy to facilitate zero-ischemia partial nephrectomy. Eur Urol 2012;61(1):211 17. Available from: https://doi.org/ 10.1016/j.eururo.2011.07.068. PubMed PMID: 21937162. 166. Teber D, Guven S, Simpfendo¨rfer T, Baumhauer M, Gu¨ven EO, Yencilek F, et al. Augmented reality: a new tool to improve surgical accuracy during laparoscopic partial nephrectomy? Preliminary in vitro and in vivo results. Eur Urol 2009;56(2):332 8. 167. Kobayashi S, Cho B, Mutaguchi J, Inokuchi J, Tatsugami K, Hashizume M, et al. Surgical navigation improves renal parenchyma volume preservation in robot-assisted partial nephrectomy: a propensity scorematched comparative analysis. J Urol 2020;204(1). Epub 2019/12/21. https://doi.org/10.1097/ JU.0000000000000709. PubMed PMID: 31859597. 168. Ioannou PA, Chien C-C. Autonomous intelligent cruise control. IEEE Trans Veh Technol 1993;42(4):657 72. 169. Towards a viable autonomous driving research platform. In: Wei J, Snider JM, Kim J, Dolan JM, Rajkumar R, Litkouhi B, editors. 2013 IEEE intelligent vehicles symposium (IV). IEEE; 2013. 170. Shademan A, Decker RS, Opfermann JD, Leonard S, Krieger A, Kim PC. Supervised autonomous robotic soft tissue surgery. Sci Transl Med 2016;8(337):337 64. Available from: https://doi.org/10.1126/scitranslmed. aad9398. PubMed PMID: 27147588. 171. Staub C, Osa T, Knoll A, Bauernschmitt R, editors. Automation of tissue piercing using circular needles and vision guidance for computer aided laparoscopic surgery. In: 2010 IEEE international conference on robotics and automation; 3 7 May 2010. 172. Mayer H, Nagy I, Burschka D, Knoll A, Braun EU, Lange R, et al. editors. Automation of manual tasks for minimally invasive surgery. In: Fourth international conference on autonomic and autonomous systems (ICAS’08); 16 21 March 2008. 173. Wei GQ, Arbter K, Hirzinger G. Real-time visual servoing for laparoscopic surgery. Controlling robot motion with color image segmentation. IEEE Eng Med Biol Mag 1997;16(1):40 5. Available from: https://doi.org/ 10.1109/51.566151. PubMed PMID: 9058581. 174. Omote K, Feussner H, Ungeheuer A, Arbter K, Wei GQ, Siewert JR, et al. Self-guided robotic camera control for laparoscopic surgery compared with human camera control. Am J Surg 1999;177(4):321 4. Available from: https://doi.org/10.1016/s0002-9610(99)00055-0. PubMed PMID: 10326852. 175. Specht MC, Kattan MW, Gonen M, Fey J, Van Zee KJ. Predicting nonsentinel node status after positive sentinel lymph biopsy for breast cancer: clinicians versus nomogram. Ann Surg Oncol 2005;12(8):654 9. Available from: https://doi.org/10.1245/ASO.2005.06.037. PubMed PMID: 16021535. 176. Chun FK, Karakiewicz PI, Briganti A, Walz J, Kattan MW, Huland H, et al. A critical appraisal of logistic regression-based nomograms, artificial neural networks, classification and regression-tree models, look-up tables and risk-group stratification models for prostate cancer. BJU Int 2007;99(4):794 800. Available from: https://doi.org/10.1111/j.1464-410X.2006.06694.x. PubMed PMID: 17378842.

III. Clinical applications

References

337

177. Mohler JL, Antonarakis ES, Armstrong AJ, D’Amico AV, Davis BJ, Dorff T, et al. Prostate Cancer, Version 2.2019, NCCN clinical practice guidelines in oncology. J Natl Compr Canc Netw 2019;17(5):479 505. Available from: https://doi.org/10.6004/jnccn.2019.0023. PubMed PMID: 31085757. 178. Sylvester RJ, van der Meijden AP, Oosterlinck W, Witjes JA, Bouffioux C, Denis L, et al. Predicting recurrence and progression in individual patients with stage Ta T1 bladder cancer using EORTC risk tables: a combined analysis of 2596 patients from seven EORTC trials. Eur Urol 2006;49(3):466 77. 179. Heng DY, Xie W, Regan MM, Warren MA, Golshayan AR, Sahi C, et al. Prognostic factors for overall survival in patients with metastatic renal cell carcinoma treated with vascular endothelial growth factor-targeted agents: results from a large, multicenter study. J Clin Oncol 2009;27(34):5794 9. Available from: https://doi.org/ 10.1200/JCO.2008.21.4809. PubMed PMID: 19826129. 180. Eminaga O, Al-Hamad O, Boegemann M, Breil B, Semjonow A. Combination possibility and deep learning model as clinical decision-aided approach for prostate cancer. Health Informatics J 2020;26(2). PubMed PMID: 31238766.1460458219855884. Available from: https://doi.org/10.1177/1460458219855884. 181. Why do they refuse to use my robot?: Reasons for non-use derived from a long-term home study. In: De Graaf M, Allouch SB, Van Diik J, editors. 2017 12th ACM/IEEE international conference on human-robot interaction (HRI). IEEE; 2017. 182. Amershi AS, Weld D, Vorvoreanu M, Fourney A, Nushi B, Collisson P, et al. Guidelines for human-AI Interaction. Proceedings of the 2019 CHI conference on human factors in computing systems. Glasgow: Association for Computing Machinery; 2019. p. Paper 3. 183. Catalona WJ, Partin AW, Sanda MG, Wei JT, Klee GG, Bangma CH, et al. A multicenter study of [-2]proprostate specific antigen combined with prostate specific antigen and free prostate specific antigen for prostate cancer detection in the 2.0 to 10.0 ng/ml prostate specific antigen range. J Urol 2011;185(5):1650 5. Available from: https://doi.org/10.1016/j.juro.2010.12.032. PubMed PMID: 21419439; PMCID: PMC3140702. 184. Eminaga O, Abbas M, Tolkach Y, Nolley R, Kunder C, Semjonow A, et al. Biologic and prognostic feature scores from whole-slide histology images using deep learning. arXiv preprint arXiv:191009100. 2019. 185. Eminaga O, Loening A, Lu A, Brooks JD, Rubin D. Determination of biologic and prognostic feature scores from whole slide histology images using deep learning. Journal of Clinical Oncology 38 (15_suppl), e17527.

III. Clinical applications

C H A P T E R

17 Meaningful incorporation of artificial intelligence for personalized patient management during cancer: Quantitative imaging, risk assessment, and therapeutic outcomes Elisa Warner*, Nicholas Wang*, Joonsang Lee and Arvind Rao Abstract Personalized medicine stands to become the future of health care in many domains, including oncology, which is both a reflection and cause of increasing genetic discoveries and potential decision support models. In this chapter, we focus on research surrounding decision support in cancer, with a lens on the medical journey of patients with cancer. These three areas of focus follow a patient from diagnosis to assigning the best therapies: (1) quantitative imaging, which uses numerical values extracted from images to predict outcomes such as a diagnosis, (2) risk assessment in cancer patients, which assigns a prognosis, and (3) therapeutic outcome prediction, which attempts to assign each patient to their predicted best therapy. In the final section of this chapter, we elaborate on what we believe to be meaningful incorporation of artificial intelligence for personalized cancer care. We urge model creators to be transparent with their workflows, label acquisition, validation, and performance reporting in order to meet the meaningful criteria and therefore increase public confidence in their software for clinical use. Keywords: Machine learning; quantitative imaging; oncology; therapeutic outcome; risk assessment; clinical decision support; cancer

17.1 Introduction While significant advances in cancer research and treatment over the last decade have resulted in declining mortality rates among cancer patients, a strong need for further research * Equal contribution.

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00017-X

339

© 2021 Elsevier Inc. All rights reserved.

340

17. Meaningful incorporation of artificial intelligence for personalized patient management during cancer

FIGURE 17.1 This chapter follows a patient’s journey during cancer from diagnosis to prognosis to treatment. For each of these, we discuss an area of study where artificial intelligence (namely, machine learning) is improving personalized care.

and innovation in the field prevails. Cancer continues to be a leading cause of death in the United States, claiming the nation’s second highest mortality rate.1 In 2016 over 1.6 million Americans were diagnosed with cancer, with prostate and breast cancers constituting the largest percentage of new cases, respectively. In the research realm, other developments have ensued, already shifting the direction of cancer research. Personalized medicine stands to become the future of health care in many domains, including oncology, which is both a reflection and cause of increasing genetic discoveries and potential decision support models. As a result, artificial intelligence (AI) and its subfield, machine learning, have become a popular method in both general biological and oncological research. In the last 20 years, PubMed citations for “cancer 1 machine learning” have grown exponentially, reflecting this fervor for machine learning application in both discovery and decision support. In this chapter, we focus on research surrounding the latter, with a lens on the medical journey of patients with cancer. We focus on three particular subfields that have been influenced heavily by machine learning in recent years. These three areas of focus follow a patient from diagnosis to assigning the best therapies and can be summarized in Fig. 17.1: (1) quantitative imaging, which uses numerical values extracted from images to predict outcomes such as a diagnosis, (2) risk assessment in cancer patients, which assigns a prognosis, and (3) therapeutic outcome prediction, which attempts to assign each patient to their predicted best therapy.

17.1.1 Workflow Machine learning studies for decision support follow a similar overall workflow, where counterintuitively the least amount of time is typically spent on model building and inference, while the majority of time is spent on data acquisition and preprocessing. While there may be several additional details involved in each step, we present the following general anatomy of a machine learning workflow.

III. Clinical applications

17.1 Introduction

341

17.1.1.1 Data acquisition In machine learning/deep learning, data acquisition is one of the most critical steps. The purpose of data acquisition is to find or create datasets that can be used to build models. Data acquisition starts at the study design phase, relying on proper record-keeping, unbiased reporting, and fairly complete records. Acquisition can also involve data harmonization, which attempts to integrate different datasets into a single set. The amount and quality of medical input data for AI applications are critical factors that influence a model’s performance, sometimes implicitly. Even if a model performs well in metrics such as accuracy or AUC, the model may still perform poorly in real-life clinical settings if the data the model was trained on were significantly biased or based on poor study design. If working with images, additional methods such as augmentation of images can be implemented once the data are obtained. This is very useful in increasing the diversity of data available for training models. Image data augmentation such as cropping, padding, rotation, transformation, or horizontal flipping are commonly used to artificially expand the size of a training dataset to build models. Data augmentation methods also can be considered as a preprocessing method. However, any synthetically created perturbations should match the physics of the system at hand, as changing the brightness of a picture of a dog does not make it not a dog but changing the intensity on a computed tomography (CT) scan would change a structure from muscle to bone. These adversarial perturbations can also be performed after training a model to assess the robustness and quality of that model. Finally, data acquisition involves ground truth labeling. In order to train supervised models, a training set of labeled data must be provided to the machine. Therefore data must be labeled by an expert based on the desired outcome. For example, in segmentation problems for tumor detection, a radiologist can be asked to designate the location of a tumor in an image. In discrimination problems a physician may be asked to look at lab vitals, biomarker information, or images and determine whether or not a disease is present in a given patient. It is common practice to assign more than one expert labeler to reduce potential bias in labeling. In order to provide the basis for a good model, each data observation should contain an accurately labeled ground truth. 17.1.1.2 Preprocessing Next, image data need to be preprocessed before feeding them into the model. The purpose of image preprocessing is an improvement of the image data that suppresses unwanted distortions or enhances some image features important for further processing. Several preprocessing algorithms have been studied for accuracy, variability, and reproducibility.2 Image preprocessing typically consists of image scaling, intensity normalization, dimensionality reduction, adding/reducing noise, etc. Image scaling refers to the resizing of a digital image, resulting in a higher or lower number of pixels per image. This is useful because some images that feed into an AI algorithm vary in size; therefore a base size for all images must be established before feeding them into the algorithm. In deep neural networks, intensity rescaling is commonly employed to restrict the image to the range of 01 due to possible overflow, optimization, stability issues, and so on. Gray scaling is another type of transformation, which turns a color RGB image into an image with only shades of gray representing colors. Gray scaling is commonly used in the preprocessing step in

III. Clinical applications

342

17. Meaningful incorporation of artificial intelligence for personalized patient management during cancer

machine learning, especially in radiomics.3 Image normalization in deep learning refers to intensity rescaling, gray scaling, centering, and standard deviation normalization. In AI classification problems, there are often too many correlated and/or redundant features, which increase computational requirements but provide no new information. They may also bias evaluation metrics, as they arbitrarily increase the dimensionality of the prediction space and likely pull data points further from each other in space. Therefore conventional practice aims to reduce dimensionality through linear algebra techniques such as projection or factorization, or via feature selection. Dimensionality reduction is the process of reducing the number of random variables or features under consideration by obtaining a set of principal variables or features. The choice of whether or not to conduct the next step of data preparation, feature extraction, is contingent upon whether or not the model will automatically extract features from its input signal. In the case of many convolutional neural networks (CNNs), the feature extraction step is generally skipped, because the model itself determines features of importance directly from the pixel intensities of the image. In most other models, however, feature extraction must be implemented, whereby quantitative features are pulled from an existing image as representative summaries then fed into the model. 17.1.1.3 Model building and evaluation Once the data are collected and preprocessed, the data need to be split into at least two groups: a training set and a test set. The training data will be used to train a model and the test data will be used to evaluate the trained model. In other cases, another subset of the training set will be removed as a validation set. The validation set is used to assess the performance of the model built in the training phase. In this situation, the k-fold cross-validation method is one of the most popular methods in machine learning models to estimate how accurately a predictive model will perform. In k-fold cross-validation, the data are divided into k subsets. Then, one of the k subsets is used as the test dataset and the other k 2 1 subsets are used as a training set to train a model. This method will be repeated k times. Other common forms of validation include leave-one-out cross-validation and bootstrapping. After validation, the test set will be used to evaluate the performance of the trained model. This process is the final step to be conducted after validation is completed. There are several metrics to evaluate a model’s performance. The choice of evaluation metrics depends on a given machine learning task such as regression, classification, or clustering. In general, the root mean squared error is commonly used in regression problems, and accuracy, precision, and recall are commonly used in classification problems. Cross-validation techniques are also used to compare the performance of different machine learning models on the same dataset. 17.1.1.4 Inference While the terms “inference” and “prediction” are sometimes muddled in the machine learning community, inference has traditionally referred to understanding the factors that influence the distribution of some given data.4 Prediction, on the other hand, is the forwardlooking notion of taking data inputs and using them to evaluate new examples. Statistical inference techniques were developed on much smaller datasets but provide some insight into the data generation process even in a big data world. Typically, statistical models will

III. Clinical applications

17.2 Quantitative imaging

343

derive their coefficients from the data and use these coefficients to interpret and understand predictions. Most machine learning techniques tend to focus on solely obtaining predictions and are not focused on creating a parsimonious and interpretable model of the world. That said, research is being conducted into interpretability of deep learning models, such as work in adversarial networks and network dissection.5,6 Both predictive power and the inferences a model is making are important, but that can vary depending on the application and its goals.

17.1.2 Meaningful incorporation of machine learning Although the research world blooms with AI-based decision support models for disease detection, risk assessment, and therapeutic outcome prediction, few models are actually employed clinically, demonstrating the challenges today with bringing these machine learning-based software as medical devices to market. Discussions in the research community suggest that machine learningbased decision support models that show promise for clinical use must be held to high standards to ensure clinical safety and efficacy, and proper data acquisition and preprocessing strategies as well as validation and metrics should be used to ensure that models are appropriately comparable and that bias is minimized. The next section of this chapter will highlight recent studies which utilize machine learning in the aforementioned clinical realms of disease detection, risk assessment and therapeutic outcome prediction. In the final section of this chapter, we will draw light to the discussion surrounding meaningful incorporation of AI models for clinical use, elaborating on what we believe to be meaningful incorporation of AI for personalized cancer care. We urge model creators to be transparent with their workflows, label acquisition, validation, and performance reporting in order to meet the meaningful criteria and therefore increase public confidence in their software for clinical use.

17.2 Quantitative imaging Quantitative imaging in cancer involves the study of medical images, generally in a radiologic context where imaging biomarkers are extracted for use in clinical problems such as diagnosis and staging. Quantitative imaging involves any sort of analysis of imaging, whereby numerical features are extracted to be input into a predictive model. Quantitative imaging is heavily involved in both the data acquisition and preprocessing steps. In practice, radiologists are trained to interpret clinical imaging and provide a synopsis of a patient’s problem of interest in the form of a report, noting abnormalities in the images with respect to a patient’s clinical history. In the case of cancer, a radiologist may look for areas of especially dense tissue or high metabolic activity in order to ascertain the presence of a tumor, and texture, size, or metabolic characteristics of the tumor may help them to further diagnose stage and prognosis. Quantitative imaging leverages the information from the same images, but in a different way, ideally providing complementary insights.

III. Clinical applications

344

17. Meaningful incorporation of artificial intelligence for personalized patient management during cancer

Quantitative imaging algorithms are often targeted toward a specific disease process or problem such as cancer subtype. Biomarkers range from the relatively simple (e.g., the volume of a lesion) to much more complex texture features via “feature engineering,” to often uninterpretable (but effective) features produced by CNNs. AI in the narrow sense is being used in specific steps to improve the richness of data, the efficiency of the process, and its consistency. Beyond feature extraction and biomarker discovery in cancer images, AI is being used all along the quantitative imaging workflow, from data acquisition to feature extraction, modeling, and inference. There are a variety of imaging modalities where quantitative imaging is used, including, but not limited to, magnetic resonance imaging (MRI), CT, ultrasound, and positron emission tomography (PET). While similar algorithms can be applied to each of these modalities, the underlying physics of these modalities can change how these AI systems are interpreted, how they can be applied, and where they break.

17.2.1 Brief overview of the physics of imaging modalities CT takes a series of X-ray images through the body in a radial fashion and combines them to create a 3D volume based on the radiodensity of the body.7 While CT does not quite measure density, it has a fixed scale of Hounsfield units (HU), which go from air at 21024 HU to water at 0 HU and beyond. CTs are often used due to the relative speed with which they can be obtained, and while they are particularly good at differentiating bone, they still provide good contrast in the soft tissues of the body, thus enabling detection of tumors in the brain and other regions. One downside of CT studies is their use of ionizing radiation, which means that radiation dosage and usage of CTs must be carefully controlled. MRI studies the resonance of hydrogen atoms in the body in the presence of a very strong magnetic field. By altering parameters of the MR acquisition, different aspects of the hydrogen resonance signal can be interrogated. This flexibility allows for a number of different MR modalities, including T1 weighted images, T2 weighted images, and many more. Both CT and MR studies can take advantage of vascular contrast agents to study the effect of blood flow, which is particularly useful in the study of tumors. One difficulty MRI presents in the quantitative imaging space is the lack of a consistent signal scale. An intensity of 50 in a CT study signifies 50 HU, roughly the density of muscle, but in MRI all signal values are relative and dependent on the scan parameters. MRI has more differentiation and contrast in soft tissues compared to CT, but each imaging modality is used in a variety of contexts.8 Ultrasound is another imaging modality, but one that is quite different from CT and MRI. Ultrasound relies on the propagation and reflection of sound waves that are created by a portable transducer to build a sonar-like image. While typically ultrasound is taken in a single plane, relative to the source, by taking a series of acquisitions it can be used to create a 3D volume, instead of a video. Some of the benefits of ultrasound include its portability, speed, and ease of use. However, it lacks the resolution, field of view, and a consistent frame of reference that a fixed setup like MRI and CT provide, making it a less common choice for cancer imaging.9

III. Clinical applications

17.2 Quantitative imaging

345

PET is a form of functional imaging, which requires a radioactively labeled tracer to emit gamma rays.10 Because the tracer is often a glucose analog, it can be used to measure metabolic activity by the concentration of emissions. As tumors tend to be more metabolically active than their surrounding tissue, the increased uptake of tagged glucose would show up as an increased signal. While the relative resolution of PET is low, that can be ameliorated by its site specificity and by combining it with other modalities such as CT.11 It can be complex managing a PET system, as the radiolabeled tracers have a short halflife and have to be synthesized within a short time frame. While not an imaging study from radiology, digital microscopy has become an important application of quantitative imaging techniques for cancer diagnosis, staging, and personalized therapy. One of the variants of digital microscopy is quantitative fluorescence imaging, in which fluorescent dyes can be used to label different cellular structures.12 By tagging donor and acceptor pairs the interaction between these two chromophores can lead to a resonance energy transfer event if they are in close proximity to one another. Quantification of the relative fluorescence of tagged components requires a careful separation of the signal from the background. These interactions help to illuminate the function and interactions of different proteins and molecules in a biological study. Digital pathology leverages digital microscopy in a clinical setting where highresolution images are taken of tissue samples. These tissue samples have been extracted, stained, and fixed using protocols that have existed for a long time, but now many more high-resolution image datasets are publicly available. Many different methods exist for staining and preserving tissue samples, each of which results in different profiles, because of the stain’s affinity toward nucleic acids, cytoplasm, or other cellular components. Pathologists recognize changes in cell morphology, which are associated with disease states. AI can help improve the efficiency and consistency of these reads by looking at large amounts of imaging data, where usually a pathologist focuses on representative regions.13 Overall, while many similarities exist between different imaging modalities, important differences exist in what those values mean. Many existing methodologies, but especially deep learning, were trained on natural images, such as photographs, which have few constraints and guarantees as to what is in them. Medical and biological imaging is done in a much more specific fashion, generally imaging the same thing each time in a similar fashion. Intensity and value often have a specific meaning, and even orientation is often preserved between subjects. Clinicians also often have access to unstructured metadata about the patient, which drives interpretation and treatment. While many techniques will work when applied to a new modality without context, knowing how the modalities differ is key to understanding the results and getting full value out of medical imaging.

17.2.2 Use of artificial intelligence in different stages of a quantitative imaging workflow Though AI algorithms are heavily task-centered in their current incarnations, they are still applied to a variety of tasks across the imaging workflow. These algorithms can be used to make existing processes more efficient and consistent, extract additional information from existing modalities, or provide more accurate models of the underlying biology. When AI

III. Clinical applications

346

17. Meaningful incorporation of artificial intelligence for personalized patient management during cancer

systems are used to perform these separate tasks, the compounding nature of these automated pipelines bears investigating. In the worst case, systems that were not necessarily designed to work together pass through errors that are propagated and enhanced with each stage. In a better designed system, safeguards and quality assurance steps can be put in place to account for the weakness and uncertainty from each model. In the data acquisition stage, deep learning algorithms and signal processing methods such as compressed sensing have been employed to reduce the amount of information needed to acquire CT and MRI studies. Approaching the physics problem of image reconstruction from a deep learning perspective leverages the collection of older imaging studies to predict the imaged volume from the raw signal. This results in a reduced radiation dosage in the case of CT studies, and decreased acquisition time in the case of MRI studies.1416 In other fields, similar approaches have also been used to create superresolution images, which take a set of lower resolution images to predict the underlying structure in high resolution. Beyond improving physics-based image reconstruction, new information for radiologists and clinicians can be extracted in the form of new modalities. In MR studies, secondary measures such as apparent diffusion coefficient, dynamic contrast enhancement, and perfusion maps have been created using quantitative imaging techniques. These modalities are evaluated by a radiologist to study the disease process of interest but can also be used by AI algorithms to extract additional quantified information. Segmentation, or demarcating a region (e.g., as tumor) is a necessary part of data acquisition for many algorithms and is typically performed by a trained clinician. Manual segmentation, however, is often quite time-consuming and also can be inconsistent between users depending on the difficulty of the task. Classical image processing techniques such as edge detection, clustering, and watershed-based techniques can often be successful but also can be susceptible to noise and require tuning. Deep learning techniques have found success in these segmentation challenges and often work out of the box to turn training segmentations into a model that can work on new image volumes. Regardless of the technique used, by increasing the number of labeled studies that can be used for further analysis, AI can help scale these workflows by improving efficiency. Quantitative imaging biomarkers, as discussed earlier, are an important way that AI can be used in the preprocessing and feature extraction step. Whether it has texture-based features, engineered features based on anatomy, or deep learning features, AI algorithms can be used to generate a large amount of information from imaging studies. These features are used in a variety of cancer applications such as grading lung nodules, interpreting mammograms, or by combining modalities in PETCT to evaluate cancer activity.17,18 AI can be used to extract additional information, improve efficiency for existing biomarkers, or allow for combined modalities such as PETCT, or ultrasound 1 MRI.19 AI can also be used to model these features to predict clinical outcome, identify gene phenotypes, or identify differential treatment prognoses. Many models treat feature sets as a matrix input and multiple approaches can be tested to evaluate which is best suited for the problem at hand. Models can be used to classify patients by disease or tumor subtype, or to perform regression and predict an outcome such as survival, recurrence time, or another measure of interest. Though many types of models have been applied to predict a variety of outcomes, many of the models that get implemented clinically are Cox regression models. Examples include organ transplantation indices such as Kidney

III. Clinical applications

17.3 Risk assessment in cancer

347

Donor Risk Index and Liver Donor Risk Index, which guide kidney and liver transplant.20 While these Cox models are relatively transparent as to the factors that affect them, they can lack predictive power. This is a potential area where new model types could provide better predictions, but there is a significant gap between models used in research and those in the flow of clinical care. After a model has been trained, evaluating its predictions in another dataset, or at least a held-aside dataset, is important for understanding its predictive power. If the only goal is to solve a task, statistics like accuracy or AUC can be used. However, oftentimes the goal is understanding the underlying biological mechanism of a process, or how to change care. This step of inference as opposed to simple prediction may be better executed with simple statistical models, which carry a more solid statistical basis than most complex machine learning models. AI algorithms for modeling features range in complexity from simple models such as linear regression to random forest, support vector machine (SVM), or the most complex neural network models. Each model type has its own specifics, but one of the big trade-offs in AI is often balancing predictive power against model complexity, understandability, and potential for overfitting. Simpler models such as linear regression may not be able to encompass all the nonlinearities of a biological process but do provide a straightforward accounting of the modeled factors. Ensemble models such as random forest may be able to provide more power but the importance of features is described in a statistical sense rather than directly. Deep learning models are often described as black boxes, and while the techniques to interrogate these models are improving, it is very difficult to interpret the steps which led to the decision in a contextual sense.21

17.3 Risk assessment in cancer Determining risk for a patient with cancer typically involves determining grade and stage of the patient’s tumor. Although there has been a marked increase in the number of publications on protein-, RNA-, and DNA-based biomarkers, few of these have been FDA approved, and among these, few have actually been used clinically.22 The current method for grading a tumor is based on a pathologist’s determination from a biopsy sample, which is highly subjective and therefore dependent on the experience of the pathologist. Staging is typically determined through the TNM system, which classifies size of the primary tumor (T), number and location of lymph nodes containing the cancer (N), and the presence or absence of metastases (M). Staging guidelines have been assigned through international anticancer advocacy organizations and staging is increasingly determined with the help of imaging tools that observe metabolic development of tumors and other properties in order to assess stage.22 Prognosis of a patient is typically determined from tumor grade and stage; however, this simple categorization of risk has been further complicated by the rise of personalized medicine possibilities. For example, patients with prostate cancer who share the same tumor grade and stage may still exhibit highly variable outcomes, some of which can be predicted with the alpha fetoprotein, chorionic gonadotropin beta, or lactate dehydrogenase biomarkers.22 The complexities of personalized medicine mean that viewing a simple

III. Clinical applications

348

17. Meaningful incorporation of artificial intelligence for personalized patient management during cancer

chart to determine patient risk may become obsolete in the future, as the number of factors that determine prognosis becomes more plentiful. Since risk assessment may also involve sensitivity to available drugs (see next section), risk assessment becomes highly specific to not only the patient’s tumor information, but also their underlying genetic map. In this case, risk assessment in the future likely entails pattern matching to determine what each combination of gene, protein, and DNA signatures matched with tumor grade and stage could mean for patient outcomes. One other method by which machine learning could assist in risk assessment in cancer is through improving the tumor grading process. Since pathologists often tend to disagree on a diagnosis,23 properly designed machine learning methods could significantly help to speed up assessments that are clearly either tumor or nontumor, while helping pathologists to spend more time on clinically uncertain cases and to give a risk estimation as assistance. In one related example, Yu et al.24 use histopathological images paired with an elastic net and Cox proportional hazard model to predict whether a patient at Stage I adenocarcinoma would be a “longer term” or “shorter term” survivor. This study claims to be the first to use quantitative imaging feature extraction methods in histopathology images to predict prognosis for patients with lung cancer.24 A similar study used CNN analysis of histology slides to predict survival in colorectal cancer.25 In another interesting study, Mobadersany et al.26 use a genomic survival CNN (GSCNN) to predict survival for patients with different types of gliomas. The GSCNN method automatically extracts features from histopathological slides like the more common CNN methods but outputs survival predictions rather than the more common discrete binary output. Genomic data were incorporated into the fully connected layers of the survival CNN model to influence final predictions. The authors claimed that their results performed equally as good or better than clinical experts in prognosing survival. In order to clarify regions of the histopathology slide that the model has deemed “critical,” the authors procure an occlusion map and point out areas of interest.12 So far, the addition of occlusion maps in studies has been one of the best ways to demonstrate that the features a deep learning model extracts are contextually relevant to clinicians. Other common studies in predicting risk in patients with cancer utilize -omics features for prediction. Chaudhary et al.27 attempt to use deep learning with The Cancer Genome Atlas (TCGA) omics data to predict survival in patients with hepatocellular carcinoma. Choi and Na28 use gene coexpression networks and deep learning to stratify risk in lung adenocarcinoma. Machine learning in the realm of risk prognostication in general has been utilized for assessing risk for onset of cancer given other conditions,29 mortality risk after treatment,30 and lifestyle and genetic factors that lead to breast cancer,3133 among many others. In the next section, we discuss a related topic, namely, predicting outcomes to cancer therapies.

17.4 Therapeutic outcome prediction After a patient with cancer is diagnosed and staged, the final leg of the journey presented here is to select the treatment that leads to the best possible outcome. Due to high variance in patient responses to therapies, physicians must often tailor therapies to

III. Clinical applications

17.4 Therapeutic outcome prediction

349

individual patients based on the best predicted outcome to a given treatment. If negative outcomes ensue, the treatment may be deemed ineffective or produce a countereffect that worsens the condition and may lead to death. Therefore it is imperative that treatments are assigned in the most individualized manner possible. Recently, a rise in popularity of machine learning methods has led to an increased deployment of narrow AI in the academic sphere surrounding therapeutic outcomes for patients based on a diverse range of data, most notably genetic data. Machine learning is a common choice in this sphere because a plethora of data available to health-care providers can be used to train models. Machine learning allows physicians to find the optimal treatment choice based on patterns created by thousands of data points each with thousands of individual genetic features. These predictions carry the potential to customize therapies in a way previously unknown in the medical profession. This section covers therapeutic outcome prediction with machine learning methods as a means of choosing the proper therapy for patients. Although many kinds of therapies may exist for treating cancer, this section will primarily address chemotherapy and radiation therapy.

17.4.1 Chemotherapy In current practice, physicians generally determine the best therapy for patients based on statistics of overall success rates and personal experience, pulling from knowledge of known complications with preexisting conditions. In one example, the Barcelona criteria for hepatocellular carcinoma provide a table with which physicians can determine the general best course of treatment based on cancer stage, number of nodules, and the presence of associated diseases.34 If treatment such as chemotherapy is chosen, then dosage and drug type must also be tailored to the patient. Gurney35 describes general guidelines to chemotherapy dose assignment, illustrating the art form that dosing can be for physicians. He also expresses the hazards of underdosing, which may increase liability for return of cancer cells or cancer cell resistance. Overdosing can also be hazardous, as chemotherapy drugs tend to be nonspecifically cytotoxic, disrupting the cell cycle of all affected cells in the region of action.35 A physician’s decision therefore weighs the risks and benefits of treatment against the general condition of the patient. For example, a cancer patient with a weak immune system may not be a suitable candidate for chemotherapy due to the high levels of toxicity of the drugs, resulting in abstinence from this form of treatment. Therapies are therefore chosen which confer the best success rate on average given a patient’s general condition. The affordability and ease of genetic sequencing in recent years has ushered in a new era of pharmacogenomics (formerly pharmacogenetics), whereby drug sensitivity can be determined from a patient’s genes. Rodrı´guez-Vicente et al.36 highlight the development of pharmacogenomics in the context of breast cancer treatment by explaining how patients’ molecular information can reveal likely prognosis. These commercially available biomarker panels can detect both genes and proteins, assisting in risk prognostication such as recurrence of the disease. Prognosis also influences therapeutic outcomes, as lowrisk patients respond differently to specific drugs such as tamoxifen compared with highrisk patients.36

III. Clinical applications

350

17. Meaningful incorporation of artificial intelligence for personalized patient management during cancer

As the availability of gene and protein information becomes more widely available with the rise of databases such as TCGA and its image complement The Cancer Imaging Atlas, COSMIC, Protein Data Bank, and the Human Protein Atlas, computational work has played a much more significant role in pharmacogenomics, dramatically accelerating the search for effective biomarkers by screening thousands of potential targets. Of particular interest for therapeutic outcome prediction are driver genes, which are malignant genome alterations that can drive cancer progression. Driver genes are expected to be among those genes differentially expressed in cancer. Copy-number variants and mutations in driver genes are expected to form some patterns with their gene expression values, and therefore some therapeutic outcome prediction and biomarker identification studies have included the use of both aberrations and expression values in their prediction and discovery models.3739 The incorporation of an element of narrow AI, namely, machine learning, has further optimized the prediction of pharmacogenomic interactions in cancer. In the realm of outcome prediction, machine learning is advantageous in that given the right features, it can form the optimal decision boundary with which features predict a known outcome. As opposed to traditional forms of linear regression and logistic regression, machine learning methods can form more complex nonlinear, nonsigmoidal boundaries, allowing more complex modeling of patient data, which may better represent the complex systems of therapeutic outcome prediction. Other methods, such as elastic net, linear SVM, and linear discriminant analysis, while linear, provide boundaries better suited for supervised group classification than traditional linear regression. However, each method provides its own advantages and disadvantages depending on the quality of data, and researchers must typically try several models to find the best fitting one. In one example of this, Sakellaropoulos et al. used 1001 cancer cell lines to predict IC50 values for 251 therapeutic drugs for cancer.40 IC50 values measure the concentration of drug required to inhibit 50% of target activity in vitro. The authors fed direct gene expression data into a deep learning model, a random forest model, and an elastic net model, respectively. Data were obtained from the Genomics of Drug Sensitivity in Cancer database, as well as some TCGA data and a novel clinical trial. Significant effort was made to ensure prevention of data leakage in feature selection and all models were tested on a previously unseen test set. The authors concluded that with nearly every drug effect predicted, the deep learning models outperformed both random forest and elastic net in terms of both effect size and P values of mean-predicted IC50 values between sensitive and insensitive groups. The authors suggest that deep learning may better represent complexities in biological interactions compared with simpler models. In theory, this is possible due to the fact that neural networks are extremely flexible functions, possessing a VC dimension of OðρlogρÞ, where ρ represents the number of weights.41 This means that highly nonlinear models can be trained with high sensitivity to individual examples. This also means in principle, however, that large sample sizes are required for deep learning models and the testing phase must be carefully conducted. Interestingly, a similar pharmacogenomics study led by Iorio et al.42 used elastic nets and random forest models with publicly available genetic data in an expansive study surmising the effect of oncogenic alterations in 1001 cancer cell lines on sensitivity to 265 anticancer

III. Clinical applications

17.4 Therapeutic outcome prediction

351

drugs. The authors focused on the aforementioned driver genes and found that both gene expression values and genomic features such as CpG islands and copy-number variants showed promise in predicting a drug’s IC50 value. They used these models to suggest that clinical trials in the future could be stratified by genetic subgroups in order to more accurately represent the efficacy of the drug in heterogeneous cancer populations.42

17.4.2 Radiation therapy Radiation therapy is considered one of the most effective courses of treatment for patients with cancer, second only to surgical removal of the tumor.43 In radiation therapy, doses of DNA-corrupting radiation are targeted toward tumor cells with the intention of killing them off. Although a high-energy X-ray machine is most commonly used for this treatment, other types of radiation waves such as gamma rays can be used as well as charged particles.43 As stated in Ref. [43], dose calculation for radiation therapy must be conducted with an algorithm fast enough to be used in daily practice, but accurate enough to sufficiently treat patients. The most common algorithms used are pencil beam algorithms and Monte Carlobased methods.43,44 However, the former is described as “semiempiric,” while the latter is considered too slow for general use.43 In addition, the algorithms are designed to account for only a fixed number of factors and may not account for outside contributors not included in the general equation. Traditional dosage calculation embodies further complexities. While it has become recently recognized that higher radiation dosages than previously established may lead to better prognosis for a patient,45,46 radiation is also inherently cytotoxic and high doses can result in a condition called radiation pneumonitis, leading to poorer prognosis.46,47 The ability to effectively control the tumor at the site, deemed local control, must be carefully weighed against the toxicity of the radiation dose to optimize patient condition. In this arena, machine learning carries vast potential to improve therapeutic outcomes. Where traditional simulations must explicitly model dosage factors a priori, machine learning models simply analyze patterns in the data and calculate an optimal dosage based on previously seen examples. Similarly, it has also been noted that machine learning models can consolidate several more variables into a single model than humans can do alone.48 Although training may require significant time, testing and implementation of these algorithms are relatively fast for fixed-complexity models, where learned data patterns can be summarized into a single optimized equation. On the contrary, as stated earlier with quantitative imaging, radiologists must contend with the fact that most modern machine learning algorithms, particularly deep learningbased models, are not interpretable and thus cannot explain how exactly a given environmental factor has influenced the model’s decision to increase or decrease dosage. This topic is at the forefront of general controversies with the application of machine learning for clinical use. Therapeutic outcomes in radiotherapy involve more than just dose calculation. In addition to calculating dose, clinicians must analyze the dose calculation with respect to patient anatomy then calculate dosevolume histograms for each tumor or organ involved.49 They also calculate tumor control probability and normal tissue complication probability, which is largely based on clinician experience.49 These measures, weighted against the dose calculation result, help to create a profile for overall radiosensitivity of a patient.

III. Clinical applications

352

17. Meaningful incorporation of artificial intelligence for personalized patient management during cancer

However, future dosage calculation profiles will likely also include genetic profiles. As argued by El Naqa et al.,50 genetics have become increasingly relevant in landscaping therapeutic outcome, as it may influence radiosensitivity of a patient. Radiogenomics may help to predict both tumor and normal cell response to radiation, both of which are important in determining the risk and reward of radiotherapy. El Naqa et al.’s review covers the general field of radiogenomics but includes a section on machine learning in this realm. The authors reference the works of Munley et al.51 and Su et al.52 in utilizing neural networks in the context of radiotherapy outcomes in lung cancer50; however, these studies utilize the traditional three-layer neural networks paired with engineered radiomics features rather than the newer “feature-extracting” CNNs, which are commonly used in radiomics studies today. Studies such as,53 also mentioned in El Naqa et al., have increasingly focused on Bayesian strategies for classification to increase interpretability of their models. In Ref. [53], Luo et al. utilize Bayesian networks as a means of identifying biophysical pathways that increase patients’ risk for radiation pneumonitis caused by overexposure to radiation. One study compared the classic three-layer artificial neural network(ANN) combined with engineered radiomics features versus a seven-layer CNN to determine prognosis after radiotherapy treatment for patients with rectal cancer.54 The models were built to discriminate between pathologic complete response (pCR) versus non-pCR and good response (GR) versus non-GR.54 The ANN with engineered radiomics features outperformed the seven-layer CNN, but both models combined resulted in a higher AUC estimate than any one method alone. The authors determined that pretreatment information was sufficient to accurately predict posttreatment outcomes.54 In general, CNNs begin to outperform traditional methods in cases where a very large sample size is available but may likely perform worse where this assumption is unmet. In one very unique approach outside the realm of common machine learning strategies in medicine, Tseng et al. use reinforcement learning to predict response to radiation therapy and thus adjust dosing.46 Patient data from the first two-thirds of treatment were used to predict the best treatment course for the remaining one-third of treatment. Since reinforcement learning requires a large number of examples in order to perfect its policy, a generative adversarial network (GAN) was constructed to simulate real patient data and thus increase sample size. GANs are a type of neural network that can “memorize” a training distribution and draw samples from it. The results from Tseng et al. demonstrate the potential for reinforcement learning as a means of adjusting dose during radiation therapy.

17.5 Using artificial intelligence meaningfully Although we present here many applications of machine learning for personalized care during cancer, machine learning methods used for clinical decision support at the time of this writing have scarcely been deployed in the clinic. One reason for this is the FDA approval process, as all decision support systems are categorized as a “medical device” under the FDA and must therefore pass an approval process before serving patients in the public domain. According to Van Norman,55 the average time for a

III. Clinical applications

17.5 Using artificial intelligence meaningfully

353

FIGURE 17.2 During each phase of the machine learning workflow, model builders can ask themselves the abovementioned questions to ensure that their model is being incorporated meaningfully: designed robustly with minimized bias and attention to methodological detail.

medical device to make it to market (pass the FDA approval process and present a finished product that is widely available) is on the range of 37 years. Another reason, however, is often the lack of understanding between decision support developers and the target clinical environment, leading to impractical and often biased models. To this end, we focus on this latter point, describing what we believe “meaningful” incorporation of AI to entail. In Fig. 17.2, we present our machine learning workflow along with possible critical questions that model builders can ask when assessing their own or others’ models in order to ensure that the model is meaningfully executed with attention to the below considerations. This section assumes that input data provided to model builders have been provided which meet the criteria for proper record-keeping, unbiased reporting, and fairly complete records and speaks directly to the stages of preprocessing, model building, and inference. The wide-scale availability of software such as TensorFlow and Pytorch, as well as packages such as scikit-learn, has resulted in an eruption of candidate AI-based academic decision support systems. With the potential for so many choices of support systems in the future, it will become important to standardize reporting and approaches to data acquisition, preprocessing, and model building, so that models can be compared effectively. Standards such as the TRIPOD statement,56 which provides a checklist of information that should be reported on a model, help to build trust in clinical models. On a similar note, it is also critical to ensure that models are safe and effective and are as minimally biased against any particular race, age, or social class as possible. With the proliferation of AI systems, it is easy to go astray and miss the fundamental principles of statistical evaluation. As models become more complex, overfitting on a training set becomes easier and easier. Dividing a dataset into training, validation, and testing sets is a critical part of evaluating the performance of any model. A training set is useful for training model parameters that are learned, while a validation set, in concert with a training set, can

III. Clinical applications

354

17. Meaningful incorporation of artificial intelligence for personalized patient management during cancer

be used to view the model’s response to changes in hyperparameters such as the number of layers in a model. A testing dataset, sometimes denoted a holdout set, should be completely held aside to provide an unbiased estimate of predictive power and tested only after the final model is conjured. Likewise, Russell et al.57 assert that peeking at testing data is an easy way to go wrong, particularly in repeating an experiment with the same testing dataset until the results are better. Beyond multiple testing, another source of overfitting is data leakage, wherein information from the testing set influences the training data. Data leakage can occur in seemingly innocuous ways, such as normalizing features or doing principal component analysis (PCA) using the entire dataset. Furthermore, it is inappropriate to normalize the testing dataset based on itself, rather the normalization from the training set, as the former would push the feature distributions to be more similar than they are in reality. Proper technique would require saving normalization or PCA parameters from the training set for use in normalization or PCA of the test set, respectively. Another pitfall is splitting an individual subject across training/ testing sets or cross-validation folds, so that the model has already seen the individual it is trying to predict. A simple way to correct this is through subject-wise cross-validation, which simply places all records of a patient in the same dataset. In addition, the population from the original dataset may be unrepresentative of the general population, or may be missing subtypes, which cannot be easily detected within a single dataset. For these reasons, we suggest additional validation of a model’s performance through conducting one or more external validation studies whenever possible. Likewise, any AI system targeted at clinical practice should recognize that publishing a good predictive result on a single dataset is only the start. Factors beyond a single standard predictive power measurement are equal determinants of a model’s true performance clinically. Oakden-Rayner and Palmer58 provide a comprehensive summarization of both the validation and study design process that should be undergone by groups who intend to implement their decision support systems clinically. They begin with the distinction between safety and efficacy, stressing that model performance does not equate to patient safety, and that efficacy, if based on a concept such as saving lives, needs to be quantified as such by lives saved and then compared to some gold standard, which has a firm scientific basis. Further caution should be heeded to ensure the best ground truth labeling, a critical issue in the data acquisition phase. Supervised models (most discriminatory models) are measured and trained based on the assumption that labels are correct. However, this assumption is hardly met in the clinical realm because physicians are not often certain what condition a patient has. In one case study in mammography, radiologists agreed with their colleagues only 78% of the time (interrater reliability), while they agreed with themselves only 84% of the time.59 This issue can be further exacerbated for early detection problems where physicians must determine when a patient begins exhibiting early signs of disease outcome. Cases such as heart failure or arrhythmia may be easier for physicians to detect early as opposed to a slow and ambiguous condition such as liver cancer or sepsis. In these cases, Oakden-Rayner and Palmer suggest using as many physicians as possible for ground truth labeling, casting doubt on the common practice of assigning only two to three physicians as potentially adding “significant bias” to the overall model.58 Adamson and Welch23 state their concerns with the interrater reliability problem and their possible solution from a pathologist’s standpoint. In another approach to handling disagreement in labeling,

III. Clinical applications

17.6 Summary

355

Reamaroon et al. attempt to factor in physician confidence into SVM model labels, attributing higher weight to physician labels, which are presented with higher confidence of correct attribution.60 However, Friedman61 shows that confident physicians do not always equate to the best physicians. This may also be problematic, as the “overconfident but incorrect” physician may erroneously bias the model away from correct discrimination. An additional method to address the issue of label uncertainty may involve the use of fuzzy networks or Bayesian-based methods.62,63 In the data processing phase, careful consideration must be made to justify each step. For example, several methods of stain normalization for pathology slides such as Macenko and Reinhard exist. Being cognizant of why a given stain choice was made for a study and ensuring that it is better than other methods before publishing is important for ensuring meaningful incorporation of the method into a workflow. Furthermore, in the dimensionality reduction stage, it is important to consider whether a reduction method such as PCA or independent component analysis (ICA) is best or if a simple feature selection method such as random forest, least absolute shrinkage and selection operator, or minimum redundancy maximum relevance is more suitable. Note that if PCA or ICA is conducted, computational power and feature requirements remain the same, since any PCA function X : ℝm 3 n -ℝm 3 p ; p , n. This means that PCA will require all original features as input before outputting a smaller dimension set. Contextually, model builders may ask themselves if obtaining all the input features for PCA would be practical or necessary once employed in a clinical setting. In the cases where this is not practical, simply selecting the top n number of features that make a fairly accurate and much simpler model may be easiest for clinicians who must gather the information. Lastly, in the model building phase, one must be careful to choose the correct reporting statistics. It is common practice to report AUC for machine learning models, but we further suggest incorporating a “panel” of measures, including accuracy and F1 score with standard deviations, since each of these measures help reveal potential weaknesses in the model. For example, if AUC and accuracy differ by a sufficient amount, one can ascertain that the model itself is likely biased toward the outcome group with the most examples in the dataset. Furthermore, although F1 score and AUC can often differ when sensitivity and specificity are imbalanced, F1 score is thought to be more robust than AUC with highly imbalanced datasets. Brier score and the no information error rate are also good score options for imbalanced data. Another common practice is to include sensitivity and specificity of the model in reporting. Oakden-Rayner and Palmer suggest instead to incorporate positive predictive value and negative predictive value.58 From an epidemiological standpoint, this can be more representative of the effectiveness of the model in a population where the condition of interest is rare, as is the case for many cancer subtypes. We end our discussion with a caution toward the common theme of building models that only optimize AUC by referring to Goodhart’s law: “When a measure becomes a target, it ceases to be a good measure.”

17.6 Summary In the age of increasing personalized care, AI and more specifically its subfield of machine learning has become an indispensable tool. In this chapter, we present the state of

III. Clinical applications

356

17. Meaningful incorporation of artificial intelligence for personalized patient management during cancer

the field of precision medicine for personalized care in the context of cancer, following developments in machine learning as a discriminative tool along a patient’s oncological journey with respect to quantitative imaging, risk assessment, and therapeutic outcomes. Machine learning methods have grown increasingly popular for these three areas of care because of the speed of assessment of a trained model, the relatively high-accuracy predictions they can produce, and the complex number and variety of features with which the model allows in determining outcome. In the first leg of a patient’s cancer journey, they must seek diagnostic help. In the age of AI, quantitative imaging paired with machine learning has emerged as a critical tool for patient diagnosis and prognosis. Quantitative imaging has been simply defined as a way to study images by extracting numeric summaries of images for input into a machine learning model. The second leg of the patient’s cancer journey discussed here involves prognosis, or a determination of risk. Risk assessment, particularly with respect to tumor grading and survival analysis, is a field that shows growing interest in machine learning, particularly in the realm of pathology and radiomics. Recent studies have also demonstrated that incorporation of genomic data also improves prognostic predictions. In the last leg of the patient’s journey, the patient must address a proper therapy for treating the malignant neoplasms, which involves physician effort in determining the best treatment option. In this realm, pharmacogenomics has demonstrated the most popular direction of research in terms of determining both drug sensitivity and radiosensitivity. The increasing presence of widely available molecular, imaging, and genetic information is accelerating growth in this field. Finally, in this chapter we stress the importance of meaningful incorporation of AI into the clinical realm. While many machine learning algorithms applied to medical data may produce seemingly impressive results, they must be vetted with high standards at every step of data acquisition, processing, and validation in order to reduce bias as much as possible and ensure patient safety. This includes strong attention to validation and evaluation, including preventing data leakage and withholding model evaluation on the test set until the model is completely finished. Meaningful incorporation also includes a strong effort to minimizing bias in physician labeling, justification of a workflow, and reporting performance measures more varied than AUC. More “meaningful” incorporation of machine learning models for personalized medicine can increase the signal-tonoise ratio of useful models in the literature that actually have potential for clinical use. In addition, “meaningful” deployment of AI to oncological problems will result in higher clinical confidence for their use and potentially enable a wider spread of decision support systems for clinically relevant management and intervention strategies for patients in the future.

Acknowledgments E.W., J.S.L., N.W., and A.R. were supported through CCSG P30 CA046592, Institutional Research Grants (MCubed, O’Brien Kidney Center) from The University of Michigan, NCI R37CA214955-01A1, and a Research Scholar Grant from the American Cancer Society (RSG-16-005-01). E.W was also funded under T32GM070449.

III. Clinical applications

References

357

References 1. Siegel RL, Miller KD, Jemal A. Cancer statistics, 2016: cancer statistics, 2016. CA Cancer J Clin 2016;66:730. Available from: https://doi.org/10.3322/caac.21332. 2. Pattanayak P, Turkbey EB, Summers RM. Comparative evaluation of three software packages for liver and spleen segmentation and volumetry. Acad Radiol 2017;24:8319. Available from: https://doi.org/10.1016/j. acra.2017.02.001. 3. Duron L, Balvay D, Vande Perre S, Bouchouicha A, Savatovsky J, Sadik J-C, et al. Gray-level discretization impacts reproducible MRI radiomics texture features. PLoS One 2019;14:e0213459. Available from: https:// doi.org/10.1371/journal.pone.0213459. 4. Sanders N. A balanced perspective on prediction and inference for data science in industry. Harv Data Sci Rev 2019;1. Available from: https://doi.org/10.1162/99608f92.644ef4a4. 5. Chen X, Duan Y, Houthooft R, Schulman J, Sutskever I, Abbeel P. InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. ArXiv160603657 Cs Stat 2016. 6. Bau D, Zhou B, Khosla A, Oliva A, Torralba A. Network dissection: quantifying interpretability of deep visual representations. ArXiv170405796 Cs 2017. 7. Rubin GD. Computed tomography: revolutionizing the practice of medicine for 40 years. Radiology 2014;273: S4574. Available from: https://doi.org/10.1148/radiol.14141356. 8. Grover VPB, Tognarelli JM, Crossey MME, Cox IJ, Taylor-Robinson SD, McPhail MJW. Magnetic resonance imaging: principles and techniques: lessons for clinicians. J Clin Exp Hepatol 2015;5:24655. Available from: https://doi.org/10.1016/j.jceh.2015.08.001. 9. Abu-Zidan FM, Hefny AF, Corr P. Clinical ultrasound physics. J Emerg Trauma Shock 2011;4:5013. Available from: https://doi.org/10.4103/0974-2700.86646. 10. Shukla AK, Kumar U. Positron emission tomography: an overview. J Med Phys 2006;31:1321. Available from: https://doi.org/10.4103/0971-6203.25665. 11. Vaquero JJ, Kinahan P. Positron emission tomography: current challenges and opportunities for technological advances in clinical and preclinical imaging systems. Annu Rev Biomed Eng 2015;17:385414. Available from: https://doi.org/10.1146/annurev-bioeng-071114-040723. 12. Swedlow JR. Quantitative fluorescence microscopy and image deconvolution. Methods Cell Biol 2013;114:40726. Available from: https://doi.org/10.1016/B978-0-12-407761-4.00017-8. 13. Pantanowitz L, Sharma A, Carter AB, Kurc T, Sussman A, Saltz J. Twenty years of digital pathology: an overview of the road travelled, what is on the horizon, and the emergence of vendor-neutral archives. J Pathol Inform 2018;9. Available from: https://doi.org/10.4103/jpi.jpi_69_18. 14. Han Y.S., Yoo J., Ye J.C. Deep residual learning for compressed sensing CT reconstruction via persistent homology analysis. ArXiv161106391 Cs 2016. 15. Golkov V, Dosovitskiy A, Sperl JI, Menzel MI, Czisch M, Sa¨mann P, et al. q-Space deep learning: twelve-fold shorter and model-free diffusion MRI scans. IEEE Trans Med Imaging 2016;35:134451. Available from: https://doi.org/10.1109/TMI.2016.2551324. 16. Yang G, Yu S, Dong H, Slabaugh G, Dragotti PL, Ye X, et al. DAGAN: deep de-aliasing generative adversarial networks for fast compressed sensing MRI reconstruction. IEEE Trans Med Imaging 2018;37:131021. Available from: https://doi.org/10.1109/TMI.2017.2785879. 17. Gillies RJ, Kinahan PE, Hricak H. Radiomics: images are more than pictures, they are data. Radiology 2015;278:56377. Available from: https://doi.org/10.1148/radiol.2015151169. 18. Rizzo S, Botta F, Raimondi S, Origgi D, Fanciullo C, Morganti AG, et al. Radiomics: the facts and the challenges of image analysis. Eur Radiol Exp 2018;2. Available from: https://doi.org/10.1186/s41747-018-0068-z. 19. Siddiqui MM, Rais-Bahrami S, Truong H, Stamatakis L, Vourganti S, Nix J, et al. Magnetic resonance imaging/ultrasound-fusion biopsy significantly upgrades prostate cancer versus systematic 12-core transrectal ultrasound biopsy. Eur Urol 2013;64:71319. Available from: https://doi.org/10.1016/j.eururo.2013.05.059. 20. Akkina SK, Asrani SK, Peng Y, Stock P, Kim R, Israni AK. Development of organ-specific donor risk indices. Liver Transpl 2012;18:395404. Available from: https://doi.org/10.1002/lt.23398. 21. Breiman L. Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat Sci 2001;16:199231. Available from: https://doi.org/10.1214/ss/1009213726. 22. Ludwig JA, Weinstein JN. Biomarkers in cancer staging, prognosis and treatment selection. Nat Rev Cancer 2005;5:84556. Available from: https://doi.org/10.1038/nrc1739.

III. Clinical applications

358

17. Meaningful incorporation of artificial intelligence for personalized patient management during cancer

23. Adamson AS, Welch HG. Machine learning and the cancer-diagnosis problem—no gold standard. N Engl J Med 2019;381:22857. Available from: https://doi.org/10.1056/nejmp1907407. 24. Yu K-H, Zhang C, Berry GJ, Altman RB, Re´ C, Rubin DL, et al. Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features. Nat Commun 2016;7. Available from: https:// doi.org/10.1038/ncomms12474. 25. Kather JN, Krisam J, Charoentong P, Luedde T, Herpel E, Weis C-A, et al. Predicting survival from colorectal cancer histology slides using deep learning: a retrospective multicenter study. PLoS Med 2019;16:e1002730. Available from: https://doi.org/10.1371/journal.pmed.1002730. 26. Mobadersany P, Yousefi S, Amgad M, Gutman DA, Barnholtz-Sloan JS, Vega JEV, et al. Predicting cancer outcomes from histology and genomics using convolutional networks. Proc Natl Acad Sci USA 2018;115:E29709. Available from: https://doi.org/10.1073/pnas.1717139115. 27. Chaudhary K, Poirion OB, Lu L, Garmire LX. Deep learningbased multi-omics integration robustly predicts survival in liver cancer. Clin Cancer Res 2017;24:12481259. Available from: https://doi.org/10.1158/10780432.ccr-17-0853. 28. Choi H, Na KJ. A risk stratification model for lung cancer based on gene coexpression network and deep learning. Biomed Res Int 2018;2018:111. Available from: https://doi.org/10.1155/2018/2914280. 29. Liu Y, Li Y, Fu Y, Liu T, Liu X, Zhang X, et al. Quantitative prediction of oral cancer risk in patients with oral leukoplakia. Oncotarget 2017;8. Available from: https://doi.org/10.18632/oncotarget.17550. 30. Elfiky AA, Pany MJ, Parikh RB, Obermeyer Z. Development and application of a machine learning approach to assess short-term mortality risk among patients with cancer starting chemotherapy. JAMA Netw Open 2018;1:e180926. Available from: https://doi.org/10.1001/jamanetworkopen.2018.0926. 31. Jung SY, Papp JC, Sobel EM, Zhang Z-F. Genetic variants in metabolic signaling pathways and their interaction with lifestyle factors on breast cancer risk: a random survival forest analysis. Cancer Prev Res (Philadelphia, PA) 2017;11:4451. Available from: https://doi.org/10.1158/1940-6207.capr-17-0143. 32. Behravan H, Hartikainen JM, Tengstro¨m M, Pylka¨s K, Winqvist R, Kosma V-M, et al. Machine learning identifies interacting genetic variants contributing to breast cancer risk: a case study in Finnish cases and controls. Sci Rep 2018;8. Available from: https://doi.org/10.1038/s41598-018-31573-5. 33. Nindrea RD, Aryandono T, Lazuardi L, Dwiprahasto I. Diagnostic accuracy of different machine learning algorithms for breast cancer risk calculation: a meta-analysis. Asian Pac J Cancer Prev 2018;19. Available from: https://doi.org/10.22034/APJCP.2018.19.7.1747. 34. Llovet JM, Fuster J, JB. The Barcelona approach: diagnosis, staging, and treatment of hepatocellular carcinoma. Liver Transpl 2004;10:S11520. Available from: https://doi.org/10.1002/lt.20034. 35. Gurney H. How to calculate the dose of chemotherapy. Br J Cancer 2002;86:1297302. Available from: https://doi.org/10.1038/sj.bjc.6600139. 36. Rodrı´guez-Vicente AE, Lumbreras E, Herna´ndez JM, Martı´n M, Calles A, Otı´n CL, et al. Pharmacogenetics and pharmacogenomics as tools in cancer therapy. Drug Metab Pers Ther 2016;31. Available from: https://doi. org/10.1515/dmpt-2015-0042. 37. Nabavi S. Identifying candidate drivers of drug response in heterogeneous cancer by mining high throughput genomics data. BMC Genomics 2016;17. Available from: https://doi.org/10.1186/s12864-016-2942-5. 38. Akavia UD, Litvin O, Kim J, Sanchez-Garcia F, Kotliar D, Causton HC, et al. An integrated approach to uncover drivers of cancer. Cell 2010;143:100517. Available from: https://doi.org/10.1016/j.cell.2010.11.013. 39. Lahti L, Scha¨fer M, Klein H-U, Bicciato S, Dugas M. Cancer gene prioritization by integrative analysis of mRNA expression and DNA copy number data: a comparative review. Brief Bioinform 2013;14:2735. Available from: https://doi.org/10.1093/bib/bbs005. 40. Sakellaropoulos T, Vougas K, Narang S, Koinis F, Kotsinas A, Polyzos A, et al. A deep learning framework for predicting response to therapy in cancer. Cell Rep 2019;29:33673373.e4. Available from: https://doi.org/ 10.1016/j.celrep.2019.11.017. 41. Koiran P, Sontag ED. Neural networks with quadratic VC dimension. In: Touretzky DS, Mozer MC, Hasselmo ME, editors. Advances in neural information processing systems 8. MIT Press; 1996. p. 197203. 42. Iorio F, Knijnenburg TA, Vis DJ, Bignell GR, Menden MP, Schubert M, et al. A landscape of pharmacogenomic interactions in cancer. Cell 2016;166:74054. Available from: https://doi.org/10.1016/j.cell.2016.06.017. 43. Schlegel W, Bortfeld T, Grosu A-L, Pan T, Luo D. New technologies in radiation oncology. J Nucl Med 2008;49:6834. Available from: https://doi.org/10.2967/jnumed.107.048827.

III. Clinical applications

References

359

44. Gustafsson A, Lind BK, Brahme A. A generalized pencil beam algorithm for optimization of radiation therapy. Med Phys 1994;21:34356. Available from: https://doi.org/10.1118/1.597302. 45. Kong F-M, Haken RKT, Schipper M, Frey KA, Hayman J, Gross M, et al. Effect of midtreatment PET/CTadapted radiation therapy with concurrent chemotherapy in patients with locally advanced non-small-cell lung cancer. JAMA Oncol 2017;3:1358. Available from: https://doi.org/10.1001/jamaoncol.2017.0982. 46. Tseng H-H, Luo Y, Cui S, Chien J-T, Haken RKT, El Naqa I. Deep reinforcement learning for automated radiation adaptation in lung cancer. Med Phys 2017;44:6690705. Available from: https://doi.org/10.1002/ mp.12625. 47. Bradley JD, Paulus R, Komaki R, Masters G, Blumenschein G, Schild S, et al. Standard-dose versus high-dose conformal radiotherapy with concurrent and consolidation carboplatin plus paclitaxel with or without cetuximab for patients with stage IIIA or IIIB non-small-cell lung cancer (RTOG 0617): a randomised, two-by-two factorial phase 3 study. Lancet Oncol 2015;16:18799. Available from: https://doi.org/10.1016/s1470-2045(14) 71207-0. 48. Abernethy AP, Etheredge LM, Ganz PA, Wallace P, German RR, Neti C, et al. Rapid-learning system for cancer care. J Clin Oncol 2010;28:426874. Available from: https://doi.org/10.1200/jco.2010.28.5478. 49. Warkentin B, Stavrev P, Stavreva N, Field C, Fallone BG. A TCP-NTCP estimation module using DVHs and known radiobiological models and parameter sets. J Appl Clin Med Phys 2004;5:5063. Available from: https://doi.org/10.1120/jacmp.v5i1.1970. 50. El Naqa I, Kerns SL, Coates J, Luo Y, Speers C, West CML, et al. Radiogenomics and radiotherapy response modeling. Phys Med Biol 2017;62:R179206. Available from: https://doi.org/10.1088/1361-6560/aa7c55. 51. Munley MT, Lo JY, Sibley GS, Bentel GC, Anscher MS, Marks LB. A neural network to predict symptomatic lung injury. Phys Med Biol 1999;44:22419. Available from: https://doi.org/10.1088/0031-9155/44/9/311. 52. Su M, Miften M, Whiddon C, Sun X, Light K, Marks L. An artificial neural network for predicting the incidence of radiation pneumonitis. Med Phys 2005;32:31825. Available from: https://doi.org/10.1118/ 1.1835611. 53. Luo Y, El Naqa I, McShan DL, Ray D, Lohse I, Matuszak MM, et al. Unraveling biophysical interactions of radiation pneumonitis in non-small-cell lung cancer via Bayesian network analysis. Radiother Oncol 2017;123:8592. Available from: https://doi.org/10.1016/j.radonc.2017.02.004. 54. Shi L, Zhang Y, Nie K, Sun X, Niu T, Yue N, et al. Machine learning for prediction of chemoradiation therapy response in rectal cancer using pre-treatment and mid-radiation multi-parametric MRI. Magn Reson Imaging 2019;61:3340. Available from: https://doi.org/10.1016/j.mri.2019.05.003. 55. Van Norman GA. Drugs, devices, and the FDA: Part 2: An overview of approval processes: FDA approval of medical devices. JACC Basic Transl Sci 2016;1:27787. Available from: https://doi.org/10.1016/j.jacbts.2016.03.009. 56. Moons KGM, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg EW, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med 2015;162:W1. Available from: https://doi.org/10.7326/M14-0698. 57. Russell SJ, Norvig P, Davis E. Artificial intelligence: a modern approach. 3rd ed. Upper Saddle River, NJ: Prentice Hall; 2010. 58. Oakden-Rayner L, Palmer LJ. Artificial intelligence in medicine: validation and study design. In: Artificial intelligence in medical imaging. Springer International Publishing; 2019. p. 83104. Available from: https://doi.org/ 10.1007/978-3-319-94878-2_8. 59. Elmore JG, Wells CK, Lee CH, Howard DH, Feinstein AR. Variability in radiologists’ interpretations of mammograms. N Engl J Med 1994;331:14939. Available from: https://doi.org/10.1056/nejm199412013312206. 60. Reamaroon N, Sjoding MW, Lin K, Iwashyna TJ, Najarian K. Accounting for label uncertainty in machine learning for detection of acute respiratory distress syndrome. IEEE J Biomed Health Inform 2019;23:40715. Available from: https://doi.org/10.1109/jbhi.2018.2810820. 61. Friedman CP, Gatti GG, Franz TM, Murphy GC, Wolf FM, Heckerling PS, et al. Do physicians know when their diagnoses are correct? J Gen Intern Med 2005;20:3349. Available from: https://doi.org/10.1111/j.15251497.2005.30145.x. 62. John RI, Innocent PR. Modeling uncertainty in clinical diagnosis using fuzzy logic. IEEE Trans Syst Man Cybern B Cybern 2005;35:134050. Available from: https://doi.org/10.1109/tsmcb.2005.855588. 63. Leibig C, Allken V, Ayhan MS, Berens P, Wahl S. Leveraging uncertainty information from deep neural networks for disease detection. Sci Rep 2017;7. Available from: https://doi.org/10.1038/s41598-017-17876-z.

III. Clinical applications

C H A P T E R

18 Artificial intelligence in oncology Jean-Emmanuel Bibault, Anita Burgun, Laure Fournier, Andre´ Dekker and Philippe Lambin Abstract Medical decisions can rely on a very large number of parameters, but it is traditionally considered that our cognitive capacity can only integrate up to five factors in order to take a decision. Oncologists will need to combine vast amount of clinical, biological, and imaging data to achieve state-of-the-art treatments. Data science and artificial intelligence (AI) will have an important role in the generation of models to predict outcome and guide treatments. A new paradigm of data-driven decision-making, reusing routine health-care data to provide decision support is emerging. This chapter explores the studies published in imaging, medical and radiation oncology and explains the technical challenges that need to be addressed before AI can be routinely used to treat cancer patients. Keywords: Oncology; cancer; artificial intelligence; deep learning; machine learning; prediction

Abbreviations AI ANN AUC CAD CNN DDCNN DDNN DNN EHR FDA GWAS IMRT OAR PheWAS RT RWE SBRT SVM VOE

artificial intelligence artificial neural networks area under the curve computer-aided detection convolutional neural network deep dilated convolutional neural network deep deconvolutional neural network deep neural network electronic health record Food and Drug Administration genome-wide association studies Intensity-Modulated Radiation Therapy organs at risk phenome-wide associations studies radiotherapy real-world evidence Stereotactic Body Radiation Therapy support vector machine volumetric overlap error

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00018-1

361

© 2021 Elsevier Inc. All rights reserved.

362

18. Artificial intelligence in oncology

18.1 Introduction Clinical trials are designed for specific, predefined, populations. But the high number of parameters that need to be explored to deliver cancer care makes it almost impossible to design trials for every case.1 New approaches, such as deep learning (DL) used on real-life data, are needed. Data quality is heterogeneous in medicine. Depending on whether departments have implemented a structured medical notes or reports policy, data preprocessing can be very time-consuming and can actually become the main bottleneck of any “Big Data” study. In the first part of this chapter, we will explain how clinical data warehouse can be implemented in order to leverage routine care data. Since clinical trials’ patient do not always reflect real life, this kind of data is very important. In the second part of the chapter, we will explore examples of the potential of artificial intelligence (AI) applications in medical imaging and comment on the main studies that already used AI for assisted diagnosis and monitoring in oncology and treatment outcome assessment and prediction. The last part of this chapter will document the studies where machine learning was used in radiation oncology for treatment planning (segmentation and dosimetry) and outcome prediction (toxicity and efficacy).

18.2 Electronic health records and clinical data warehouse 18.2.1 Data reuse for research purposes The traditional role of health data warehouses has been to serve research needs. The first users of clinical data warehouses were clinicians interested in finding patient cohorts for research studies.2,3 Data from hospital electronic health records (EHRs) provide a rich source of clinical histories at a level of granularity far beyond a standard registry. Typical data types that are present within clinical data warehouses include structured data, such as patient demographics, laboratory test results, drug prescriptions, International Statistical Classification of Diseases and Related Health Problems codes and procedure codes, as well as unstructured data, that is, text reports (e.g., radiology, pathology, and progress notes) and images. Such data can be exploited alone or linked with existing registries. In oncology, with the objective to support precision medicine, some groups have pushed toward establishing dedicated data warehouses by integrating EHR data from clinical data warehouses with other data sources (Fig. 18.1). Such data warehouses may have different scopes in terms of diseases, data sources, and coverage, as illustrated by the following three examples. The data repository developed by the Cancer Research for Personalized Medicine (CARPEM) program in Paris integrates EHR data from three AP-HP hospitals along with research data and biobank information to support studies on various kinds of cancers, including lung, blood, colorectal, and gynecological cancers.4 The OncoSHARE database is a comprehensive database for breast cancer that integrates data from the California Cancer Registry, EHRs from Stanford University Hospital, and other care centers, genomic sequencing results; and patient-reported data.5 Similar hybrid databases are under development for other cancers.6 The American Society of Clinical Oncology (ASCO) created the platform called CancerLinQ to integrate data from EHRs across different levels of care and Surveillance, Epidemiology, III. Clinical applications

18.2 Electronic health records and clinical data warehouse

FIGURE 18.1

363

i2b2 clinical data warehouse graphical interface.

and End Results registry data.7 These infrastructures provide rich repositories of cancer data that can increase the quality and scope of data analyses. However, there were several technical challenges related to semantic heterogeneity and data quality. Solutions based on common data models have been proposed to harmonize data from disparate observational databases. The concept behind this approach is to transform data contained within the data sources into a common model such as the Observational Medical Outcomes Partnership (OMOP) Common Data Model developed by the Observational Health Data Sciences and Informatics (OHDSI) Oncology Subgroup.8,9 Common vocabularies and ontologies have also been proposed such as the Radiation Oncology Structures ontology used to standardize description in radiation therapy.10 The next step consists in sharing libraries of data analytics modules based on the common data format and developing studies on distributed real-world datasets. Interest in real-world evidence (RWE) relying on routinely collected data continues to grow, as it complements clinical trial findings and may help fill knowledge gaps related to effectiveness, safety, and cost of treatment in “real life.” Regarding the medicines regulatory system, the European Medicines Agency has recognized the importance of RWE, especially in oncology where they claimed that it was “our only hope to come to grips with combinatorial complexity,” (https://www.ema.europa.eu/en/documents/presentation/presentation-real-world-evidence-rwe-introduction-how-it-relevant-medicines-regulatory-system-emas_en.pdf) and the US Food and Drug Administration (FDA) created a framework for assessing the use of real-world data throughout the drug development process (https://www.fda.gov/media/120060/download). For example, real-world data III. Clinical applications

364

18. Artificial intelligence in oncology

can be used to help support the approval of a new indication for a drug already in another indication and to help support or satisfy drug postapproval study requirements. Generally speaking, studies relying on real-world data benefit from the large sample sizes and generalizable patient populations afforded by EHRs. Casey et al. identified diverse research applications that can be based on real-world data reuse: reevaluation of prior conclusions drawn from smaller populations; analysis of rare condition groups and subgroups; environmental and social epidemiology to study risk factors and social impact of cancer episodes; stigmatized conditions, where patient recruitment and follow-up can be difficult; predictive modeling that requires huge volume of data; evaluation of natural experiments taking advantage of rapid collection of EHR data compared with traditional cohorts.11 DL could also be used to predict cancer prevalence from satellite images (Fig. 18.2). Contrasting with conventional

FIGURE 18.2 Features extracted from satellite images by a CNN that can be used to predict cancer prevalence. CNN, Convolutional neural network.

III. Clinical applications

18.2 Electronic health records and clinical data warehouse

365

studies using primary data collection methods, EHR-based studies have broader coverage, are less expensive, and require less time to complete. Moreover, large-scale, hypothesis-free methods for studying associations among variables, especially genome-wide association studies (GWAS) and phenome-wide associations studies (PheWAS), have led to numerous discoveries of the genomic bases of both rare and common diseases. In the past, these methods focused on a single disease or a small set of diseases at a time in order to answer specific research questions, based on traditional datasets with limited scope. Nowadays, the accumulation of biospecimens linked to EHR data makes possible GWAS and PheWAS,12 as well as cross-phenotype associations13 using only data collected during health care. The goal of precision oncology is to use genetics and all patient phenotypic characteristics to guide cancer prevention and treatment. The approach described previously offers a potential strategy to accurately stratify patients for risk profiling and discover new relationships between specific cancers and genomes.14 This process can be extended with data sharing. To achieve that goal the Global Alliance for Genomics and Health (http://genomicsandhealth.org/) was founded to develop interoperable solutions. Instead of building centralized repositories, the Global Alliance suggests that shared data could be stored by their originating institutions and assessed and analyzed by members of the global research community through secure cloud-embedded network solutions.15 Another use of clinical data warehouses has been for prospective clinical research. The role of real-world data to improve efficiencies of clinical research programs has been well established. They can be used to generate hypotheses for testing in traditional trials, identify potential biomarkers, perform feasibility studies, inform prior probability distributions (in Bayesian models), identify eligible patients, and assess the safety of drugs or devices after they are approved. With the objective of accelerating patient enrollment in clinical trials, the eligibility criteria can be aligned with EHR data models and mapped to common terminologies. However, several studies demonstrated that a significant percentage of those criteria could not be mapped to structured EHR data but were present only in text or images.16 By leveraging natural language processing and image processing, IT technologies can dramatically increase the trial screening efficiency of oncologists A review paper published in 2019 showed that, for trial eligibility determination, the highest accuracy was reached by the machine learning based approach with a per-trial area under the curve (AUC) between 75.5% and 89.8%.17 These algorithms have the potential to significantly reduce the effort to execute clinical research, facilitate participation of small cancer centers, and accelerate research for specific patient populations such as pediatric cancers.18,19 Realworld data repositories are key components to generate evidence regarding safety and effectiveness.20 After the identification of a safety concern, a study can be conducted easily and quickly, based on the network of clinical data warehouses that adopted a common format like OMOP. Finally, regarding clinical research in precision oncology, there is still a major question to answer: to what extent regulatory agencies may accept RWE to support drug product approvals in oncology? When approval is based on a single-arm interventional trial, the supportive RWE would consist of data on historical response rates drawn from real-world data. In the past, blinatumomab was initially approved under accelerated approval for the treatment of Philadelphia chromosome-negative relapsed or refractory B-cell precursor acute lymphoblastic leukemia, based on the comparison to historical data from 694 comparable patients extracted from over 2000 patient records from different sites.21

III. Clinical applications

366

18. Artificial intelligence in oncology

This example (a randomized controlled trial was required by FDA to confirm the results) illustrates the current debates on the integration between clinical trials and health-care systems in oncology.

18.2.2 Data reuse and artificial intelligence There are high expectations that AI may support cancer detection, optimize the care trajectory of cancer patients, suggest optimal therapies, reduce medical errors, and improve subject enrollment into clinical trials.22 AI, particularly DL algorithms, is gaining extensive attention to process images and text for its excellent performance in natural language understanding and image recognition tasks. In oncology, several authors23 25 showed that DL algorithms, especially convolutional neural networks (CNNs) and radiomics, could be used for detecting and evaluating cancer lesions, facilitating treatment, and predicting treatment response. All machine learning algorithms require huge sets of high-quality data for training. Collecting as many data as possible for the training set can help reduce the risk of overfitting, a modeling error that occurs when a function is too closely fit to a limited set of data points. Moreover, deficiencies in the data inexorably compromise the algorithm.26 The integrity of unbiased, clinically useful data depends upon the reliability of EHR data sources. Yet, data quality must be systematically assessed, including unexpected changes of some variables over time.27 Other risks are related to sampling bias and observation bias. Sampling bias leads to nonrepresentative datasets, which leads to nonrepresentative algorithms output: the algorithm is trained on a dataset from population p1 then used to make decision in a population p2. Observation bias may occur across a variety of health-care specialties due to measurement error,28 even for common parameters such as heart rate or blood pressure assessment.29 All these considerations show that deep understanding of the data used to train AI, as well as mechanisms for data curation, is needed to realize the potential of AI for precision oncology. Transfer of AI algorithms to clinical settings requires rigorous clinical validation studies of AI models. Model design and training must be totally separated from clinical evaluation. Besides the training dataset on which the model was trained and the validation set used for internal validation, external datasets, totally independent from the initial ones, are needed to answer questions such as: how well does an AI model developed at one institution perform at another institution? A first step before conducting a prospective study could be to evaluate the performance of the algorithm on a retrospective dataset from other institutions. Such datasets could be derived from clinical data warehouses, transformed in the appropriate format, and used to test the algorithm. Typically, the goal is to check that AI looks safe enough, based on a small number of cases, like phase I/II trials in drug development. In their roadmap for Translational Research on AI in Medical Imaging, the American College of Radiology together with the National Institute of Health and Radiological Society of North America defined key priorities for translational research on AI, including the needs for “establishing methods to encourage data sharing for training and testing AI algorithms to promote generalizability to widespread clinical practice and mitigate unintended bias; establishing tools for validation and performance monitoring of AI algorithms to facilitate regulatory approval; and developing standards and common data elements for seamless integration of AI tools into existing clinical workflows.”30

III. Clinical applications

18.3 Artificial intelligence applications for imaging in oncology

367

These recommendations could be extended to all the applications in oncology, in order to deliver high-quality AI models intended for use in clinical practice.

18.2.3 Data reuse for patient care Many groups have highlighted the secondary use of EHR data for research. Nevertheless, besides utilization for research and cohort identification, clinical data warehouse can be used to make clinical decisions. Frankovich et al. in 2011 provided a good example of how data reuse may benefit to the individual patient.31 They admitted a 13-year-old girl with systemic lupus erythematosus. Their patient’s presentation was complicated by several additional factors that put their patient at potential risk for thrombosis, and they considered anticoagulation. However, they found neither studies in the literature nor clinical trials corresponding to such situation. Given the risk of bleeding and the concurrent risk of thrombosis, they did not reach a consensus. Finally, they used the data stored in their data warehouse to review similar pediatric patients with lupus and made a decision on the basis of the results. Such process of solving the diagnostic or therapeutic problem of a new patient by recalling previous cases that exhibited similar symptoms is possible only if the data are available, along with sophisticated enough IT methods. Evidence-based medicine was adopted in the last decades as a methodology for improving care. Nowadays, in the era of precision oncology, the increased complexity of health care requires new insights regarding potential uses of RWE for clinical decisions. In oncology, reuse of real-world data may mirror the approach illustrated by Frankovitch et al. When clinicians discuss cases during tumor board meetings and have to make decisions about rare complex cases, it is without any doubt helpful in the quest of the right treatment to be able to search for similar patients in clinical data warehouses. Similarity metrics must take into account all the data available in clinical data warehouses, including patients’ medical history, diagnoses, molecular characteristics, and outcomes. Similarity metrics have already been tested to help diagnosis of rare diseases with encouraging results.32,33 The same approach should be developed to support decision about treatment in precision oncology. The search for similar cases in a huge repository of EHRs is expected to be more helpful than physician recollection alone and pooled colleague opinion.34 However, it requires to overcome obstacles related to data reuse, including possible restrictive regulations and technical expertise in databases.

18.3 Artificial intelligence applications for imaging in oncology The field of “computer vision” (computer-assisted vision) has become one of the main applications of AI. Image recognition algorithms were developed and tested on huge databases on the Internet of animals and objects. It took only one step for deep algorithms that detected objects in an image to be applied to medical imaging to detect (normal/abnormal) or characterize (benign/malignant) lesions. The transferability of deep algorithms from one problem to another (i.e., “transfer learning”) made this all the more logical. Multiple applications of AI are expected in imaging in oncology, impacting not only the diagnostic performance of images, but also the way they are acquired, patient flow and

III. Clinical applications

368

18. Artificial intelligence in oncology

exam workflow. Other specific developments are currently being developed specific to interventional radiology.

18.3.1 Applications in oncology for diagnosis and prediction 18.3.1.1 Computer vision and image analysis Computer vision and image analysis is the most common application for AI software in imaging. Two types of algorithms can be developed. A detection algorithm answers the question: “Is there a lesion in the image?”, and such software is termed “computer-aided detection” (CAD). These algorithms are applied to screening, for example. A characterization algorithm answers the question: “Is the lesion benign or malignant?” or “What is the nature of the lesion?”, and such software is termed “computer-aided diagnosis” (CADx). The radiologist submits the examination to the software that tags the lesions and gives the percentage risk of being a cancer. Both these applications have existed for more than 20 years using traditional machine learning techniques but had fallen into relative disuse due to the high rate of false positives.35 18.3.1.2 Radiomics: data-driven biomarker discovery Radiomics is a research field related to AI. The principle of radiomics relies on two steps: (1) the extraction of a large number (usually over a hundred) of quantitative parameters (termed “features”) from images with no a priori hypothesis and (2) the selection of the best feature or combination of features correlated to a desired characteristic (Fig. 18.3). This may be a genetic mutation (radiogenomics), a biological marker (receptor expression, for example), or an outcome (survival). Machine and DL techniques may be used for feature extraction, and/or for feature selection. 18.3.1.3 Artificial intelligence assisted diagnosis and monitoring in oncology AI tools for diagnosis can be applied at all levels of diagnostic imaging. Mass screening is an obvious application because it concerns a large number of studies and justifies the use of a tool that can accelerate reading. Images are highly standardized, and reports are structured, which allows for large preexisting training datasets, and less generalizability issues. Finally, the risk of errors is directly integrated in the principle of screening, that is, it is an accepted concept that there can be false positives and negatives, as long as they are reasonably low. Algorithms for screening usually combine detection and characterization

FIGURE 18.3

Radiomics analysis pipeline.

III. Clinical applications

18.3 Artificial intelligence applications for imaging in oncology

369

in the same software. Applications to mass screening for breast cancer in mammography36 or lung cancer screening in CT37 are already being developed and tested. Breast cancer screening is the most advanced application, and studies show that AI-supported CADx systems perform at least as well as radiologists.36,38 A very large study using UK and US datasets39 showed a DL algorithm reduced false positives and false negatives compared to the first reading but did not outperform consensus double reading, the standard of care in many European countries. However, it must be kept in mind that most studies were performed in enriched datasets, with prevalence of cancers around 20% 30%, much higher than the expected 2m 10m in a screening program, probably leading to an overestimation of accuracy.40 Clinical implementation of AI-based CADx tools requires testing in real-life settings to (1) prove its usefulness in the clinical workflow, (2) assess the reader’s interaction with the tools, and (3) ensure that there are no unintended consequences. Apart from cancer screening, many studies are exploring AI-based tools for detection and characterization in a wide variety of applications, such as prostate cancer detection,41 or neuro-oncology.42 There are no clinical tools available to date. 18.3.1.4 Treatment outcome assessment and prediction AI tools are also being explored to predict treatment response or prognosis. Studies may develop algorithms that are directly correlated to the outcome (treatment response category or survival), or that are correlated to a specific predictive or prognostic factor. Fields of application are similar to those for tumor detection or characterization. Groups have shown that specific image profiles can be correlated to prognostic factors such as IDH status in glioblastoma,43 EGFR and KRAS mutations in lung cancer,44 or multigene assays in breast cancer for prediction of recurrence.45 Immuno-oncology is garnering much interest because factors predicting response remain to be identified for these expensive therapies. Studies have correlated AI- or radiomics-based features to predict response to immunotherapy in solid tumors46 or more specifically in glioblastoma.47 On breast MRI images a neural network-based model could predict pathologic complete response and detect tumor regions most strongly associated with therapeutic response.48

18.3.2 Applications in oncology to improve exam quality and workflow 18.3.2.1 Improvement of image acquisition DL-based image reconstruction and denoising algorithms are already developed and implemented on recent scanners.49 The principle is to teach the algorithm to reconstruct a well-defined, high signal-to-noise image from a very noisy image. Thus CT images acquired at very low radiation doses can be reconstructed to give diagnostic quality images. It is also possible to reconstruct high-quality postinjection images from CT or MRI scans performed with very low doses of contrast agent injections (10% of the usual dose, for example). Finally, it is possible to accelerate an MRI examination by acquiring images with low spatial resolution and then reconstruct images that are almost identical to those of high-resolution acquisitions. This will increase patient comfort, particularly for cancer patients who might be tired or in pain but undergo multiple exams. In young patients that need repeated exams for

III. Clinical applications

370

18. Artificial intelligence in oncology

surveillance, these can be performed with less radiation and reduced doses of contrast agent and therefore less possible side effects. 18.3.2.2 Image segmentation One area in which neural networks have demonstrated their superiority is automatic image segmentation. Spinal vertebrae can be automatically identified and numbered, and the loss of height of one or more vertebrae due to compression fracture can be detected. More quantitative information can be extracted automatically from the images: emphysema, sarcopenia, osteoporosis, and vascular calcifications reflecting cardiovascular risk. This will allow a “holistic” imaging report, giving information not only on the cancer, but also on possible complications or risk factors that could impact both treatment choices and prognosis of an individual patient. All this information can also be automatically provided to the radiologist at the beginning of an examination to help him/her make a comprehensive and systematic assessment of the patient’s condition. These tools are already being made available on imaging workstations. 18.3.2.3 Improved workflow Beyond “intelligent” imaging appointment management software, several aspects of the workflow of nuclear medicine physicians and radiologists are being developed using AI algorithms.50,51 The presentation of the exam that the physician must interpret can be enhanced by algorithms by displaying the most appropriate series based on the indication of the exam. Natural language analysis software could retrieve relevant information from the patient’s medical record while the exam is being read, for example, surgical history or previous local therapy, adverse effects. 18.3.2.4 Interventional radiology In interventional radiology, AI tools will allow better planning of procedures, by guiding the choice of equipment, but also by selecting the patients who are most likely to benefit from the procedure. Image segmentation methods and image registration between two modalities will enable precise real-time identification of the organ and the lesion to be treated. Indeed, high-resolution images or images allowing visualization of the lesion, such as MRI or PET, can be merged with images from planar X-ray devices, to guide the radiologist during the procedure. The preparation of the procedure, better guidance to the target, and real-time tracking will result in a reduced dose of X-ray and injected contrast agent, beyond the same advances in image acquisition quality that are expected in diagnostic radiology. Finally, developments in the field of robotics are expected, with prototypes of ultrasensitive and miniaturized sensors or intravascular microrobots enabling procedures similar to robotic surgery, which have become possible thanks to computer-assisted vision techniques and technological advances in highly reliable and latency-free communication networks to produce a signal, enabling an immersive visual and touch experience.

III. Clinical applications

18.4 Artificial intelligence applications for radiation oncology

371

18.4 Artificial intelligence applications for radiation oncology In radiation oncology, simple artificial neural networks (ANN) have historically been used to predict different outcomes in retrospective studies: survival in advanced carcinoma of the head and neck treated with radio(chemo)therapy,52 PSA-level response and toxicity after radiotherapy (RT) for prostate cancer,53 55 pneumonitis in RT for lung cancer,56 or even survival in uterine cervical cancer treated with irradiation.57 The performances of these models were correct, but these studies had significant limitations: the training cohorts were limited, and they often lacked external validation. Eventually, they were never used in clinical routine. Nowadays, AI mostly refers to DL, a type for ANN that uses many hidden layers. There is currently no consensus as to how many layers count as deep and there is no clear distinction between the terms ANN and DL. For a thorough technical description of DL, Meyer et al. published an excellent review in the context of radiation oncology.58 CNNs are the most frequently used DL methods, followed by autoencoder, deep deconvolution neural network, deep belief network, and transfer learning. In this part, we will discuss studies where DL was used for treatment planning (segmentation and dosimetry) and outcome prediction (toxicity and efficacy).

18.4.1 Treatment planning 18.4.1.1 Segmentation Manual segmentation is a mandatory step of radiation planning and requires a high amount of time and human resources in radiation oncology (Fig. 18.4A and B). AI could be used to assist physicians and help to ensure contouring quality by reducing interobserver variability. AI could also increase adherence to delineation guidelines.59 Furthermore, these tools could be leveraged for real-time adaptive RT.60 Several studies, for each tumoral location, have been published and are reviewed next. Segmentation performances are often evaluated with the Dice Similarity Index (DSI), a measure of the overlap between two sets of contours. A DSI of 0 indicates the absence of overlap between the structures, while a value of 1 means that the overlap is complete.61 The ground truth is the contour performed by one or several physicians (Table 18.1).

FIGURE 18.4 Manual segmentation of a rectal tumor on CT (A) and MRI (B).

III. Clinical applications

372

18. Artificial intelligence in oncology

TABLE 18.1 Deep learning for segmentation. DICE (reported average or range)

Reference

Site

Method

Number of patients

Brain

CNN

305

0.67

62

3D CNN

182

0.66

63

50

0.37 0.89

64

230

0.33 0.81

65

DNN

52

0.62 0.90

66

CNN

450

0.57 (0.16 0.99)

67

Head and neck

Lung

Abdomen

Pelvis

CNN DDNN

CNN 1 conditional random fields

30

0.57 0.87

68

CNN

72

0.7

70

CNN

118

N/A (VOE 5 0.06)

71

CNN

140

0.7

72

93

0.74

73

2D CNN

CNN, Convolutional neural network; DDCNN, deep dilated convolutional neural network; DDNN, deep deconvolutional neural network; DNN, deep neural network; VOE, volumetric overlap error.

18.4.1.1.1 Brain

DL techniques have been used in diagnostic neuroradiology for primary or secondary brain tumors, but its direct use in RT is still rare. Liu et al. published a method to segment brain metastases on contrast-enhanced T1w MRI.62 The network architecture had four blocks: input, convolution, fully connected, and classification. This approach was then validated on data from the Multimodal Brain Tumor Image Segmentation challenge (BRATS—65 patients) and 240 patients with brain metastases treated at the University of Texas Southwestern Medical Center. This study showed DSI values of 0.75 6 0.07 in the tumor core and 0.81 6 0.04 in the enhancing tumor for the BRATS data. A similar method has been assessed by Charron et al.: they adapted an existing 3D CNN called DeepMedic to detect and segment brain metastases on MRI scans. One hundred and eighty-two patients were included with three MRI modalities (T1w3D, T2flair2D, and T1w2D). The ground truth segmentations were performed by four radiation oncologists and compared to the DL output. Using multimodal MRI (T1w3D plus T2flair2D) provided the best performance (DSI 5 0.77).63 18.4.1.1.2 Head and neck

Atlas-based autosegmentation often fails to take into account large primary or nodal lesions or the anatomical effects of surgical procedures. In that context, DL could improve the results of autosegmentation, if the cases used to train the model is diversified enough. Ibragimov et al. trained a network consisting of CNN with the data of 50 patients treated for head and neck cancers. The CNN’s performance was similar or superior when

III. Clinical applications

18.4 Artificial intelligence applications for radiation oncology

373

compared to the reference segmentation for the spinal cord, mandible, parotid glands, larynx, pharynx, eye globes, and optic nerves.64 Men et al. developed an approach focused on the delineation of the target volumes with a deep deconvolutional neural network for the delineation of in nasopharyngeal cancer. They used the data from 230 patients to segment gross tumor and lymph node gross volumes, with their respective clinical target volumes (CTVs), and organs at risk (OAR). This study showed a significant improvement of the contouring performance, when compared to other segmentation methods.65 Another study by Cardenas et al. applies DL to high-risk CTV autodelineation and show that their results are comparable to inter- and intraobservational in manual delineation (DSI 5 0.62 0.90).66 18.4.1.1.3 Lung

Lungs have a naturally high contrast that explain the fact that traditional semiautomatic tools have good performances overall, reaching DSI values above 0.9 when compared to manual benchmark.67 However, using DL could have an interest for autosegmenting other thoracic OARs and lung tumors. Trullo et al. tested a model with 10 fully convolutional layers (SharpMask), to automatically delineate the esophagus, heart, trachea, aorta, and body contour in 30 CT scans of patients.68 The model achieved a 0.67 0.9 DSI ratio for different organs. Lustberg et al. assessed the time that can be saved when using softwaregenerated contouring and found a significant reduction in contouring time.69 18.4.1.1.4 Abdomen

Autosegmentation is very difficult in the abdomen because of its anatomical inter- and intrapatient variability, hollow organs, and natural bowel displacement. The liver is an easier organ to delineate: Ibragimov et al. published a study on intrahepatic portal vein segmentation based on CNN70 and reported a DSI of 0.7 0.83, when compared to manual benchmarks. Another team published similar results for liver autosegmentation using 78 CT as training set and 40 as test set.71 18.4.1.1.5 Pelvis

DL has been used in the pelvic region, for both OAR and target volumes segmentation. Men et al. created a deep dilated CNN to automatically segment OARs and CTVs for patients with rectal cancer treated with neoadjuvant chemoradiation, obtaining a 87.7% concordance.65 Trebeschi et al. proposed another method for the same type of patients and used T2 and DWI MRI images, obtaining a DSI of 0.7 and a model’s AUC of 0.99.72 Similar results were also reported by Wang et al. (DSI 0.74).73 For prostate cancer, Guo et al. created a method to learn the latent feature representation from prostate MR images using a stacked sparse autoencoder (SSAE) on 66 patients.74 18.4.1.2 Dosimetry DL has been used as a delivery optimization technique for automated plan adaptation in lung cancer: Tseng et al. used a retrospective cohort of 114 NSCLC patients to train a model to generate dose distribution and achieve a root mean square error of 0.5 Gy.75 In head and neck cancers, Fan et al. developed an automated planning strategy for IntensityModulated Radiation Therapy (IMRT) with a residual neural network trained on 195 patients. They showed that DL can predict clinically acceptable dose distributions, III. Clinical applications

374

18. Artificial intelligence in oncology

without any statistically significant difference between prediction and real clinical plan for all clinically relevant dose volume histogram (DVH) indices, except brain stem and right and left lenses.76 In prostate cancer, Chen et al. derived CT images from MRI and evaluated the dosimetric accuracy of this approach.77 They showed that using a U-net trained on 51 prostate cancer patients achieved DVH parameters’ discrepancy that was less than 0.87% and a maximum point dose discrepancy within PTV that was less than 1.01% in respect to the prescription.

18.4.2 Outcome prediction 18.4.2.1 Treatment response Treatment response prediction mostly relies on radiomics in radiation oncology studies. In this part, we excluded radiomics studies that did not use machine learning to create a model, or that used linear regression. Overall, research teams in the field are moving away from handcrafted radiomics and using DL approaches (Table 18.2). TABLE 18.2 Studies using radiomics to model clinical outcome in radiation oncology. Site

Number of Modality patients Method

Brain MRI metastases

110

MRI

66

Head and neck

Lung

Results

Reference

DL

Predict response after SBRT: AUC 5 0.856 (68.2% 100%)

78

Intensity and texture

Distinguish true progression from radionecrosis after SBRT: AUC 5 0.81

CT

315

Shape, texture, and grayscale intensities

Assess HPV status: AUC 5 0.87 0.92

80

CT

270

DL: CNN

Lymph node metastatic status: AUC 5 0.91 (95%CI: 0.85 0.97)

81

CT

465

134 features

Predict tumor control: low risk 5 94% at 5 years, high risk 5 62% 80%

82

CT

1,194

DL: 3D CNN

2-year overall survival after chemoradiation: AUC 5 0.70 (95%CI: 0.63 0.78), P , .001

83

CT

179

DL: CNN

2-year overall survival after SBRT: AUC 5 0.74 (P , .05)

84

97

DL: CNN

Complete response after chemoradiation: AUC 5 0.74

86

222

30 selected features

Predict pathologic response after neoadjuvant chemoradiation: AUC 5 0.9756 (95%CI: 0.9185 0.9711)

87

Intensity and texture

Predict pathologic response after 24 neoadjuvant chemoradiation: accuracy 5 80%

Esophagus

18 F-FDGPET

Rectal

MRI

CT

95

AUC, Area under the curve; CNN, convolutional neural network; DL, deep learning; SBRT, stereotactic body radiation therapy.

III. Clinical applications

18.4 Artificial intelligence applications for radiation oncology

375

18.4.2.1.1 Brain

Radiomics have been used in two studies for brain metastases: Cha et al. directly trained a DL network (without feature extraction) to predict response after Stereotactic Body Radiation Therapy (SBRT) on a cohort of 110 patients. AUC was 0.856 (68.2% 100%).78 Peng et al. assessed the role of radiomics feature to distinguish radionecrosis from tumoral progression after SBRT for brain metastasis. In total, 66 patients with 82 lesions were included to create a signature with 51 MRI features. AUC was 0.81.79 Overall, the technical quality of the studies leveraging radiomics in brain tumors is not comparable to the studies published in lung or head and neck cancers. 18.4.2.1.2 Head and neck

Twenty years ago, Bryce et al. analyzed data from a phase III trial that included 95 patients with locally advanced squamous cell carcinoma of the head and neck. They trained several ANN using different features and found that the best ANN model used tumor stage, nodal stage, tumor size, tumor resectability, and hemoglobin to predict 2-year survival with an area under the Receiver Operating Characteristic curve value of 0.78 6 0.05.52 More recently, Yu et al. extracted 1683 features with IBEX from head and neck gross tumor volume (GTV) T or N in order to predict HPV status. A general linear model was used, because it provided better AUC compared to the other models explored (such as random forest, support vector machine (SVM), decision trees, and DL). The model performance was tested on two external datasets. AUC was 0.87 on the first and 0.92 on the second dataset.80 Kann et al. created a CNN and trained it on 2875 lymph nodes delineated from the CT scans of 270 patients to determine whether a lymph node was metastatic or not. On the test set the model demonstrated an AUC of 0.91 [95% confidence interval (95%CI): 0.85 0.97].81 Finally, a team from the MD Anderson Cancer Center also published a radiomics signature using data from 465 patients treated with IMRT for oropharyngeal cancer to assess the recurrence probability. Using IBEX, 134 radiomic features were extracted from the primary GTV and narrowed down to only two features through decision tree modeling. Local tumor control for patients with a low-risk signature was 94% at 5 years, compared to 62% 80% for the patients with a high-risk signature.82 18.4.2.1.3 Lung

In 2018 Hosny et al. published a study using a CNN for lung cancer, using seven independent datasets, in order to stratify patients on their mortality risk.83 In total, 1194 patients were included. The model was able to predict the 2-year overall survival after chemoradiation: AUC 5 0.70 (95%CI: 0.63 0.78), P , .001. More recently, Xu et al. published a radiomics study with DL to predict treatment response. A model was developed to predict 2-year survival with the data from a cohort of 179 patients with stage III non small cell lung cancer treated with chemoradiation. AUC was 0.74 (P , .05).84 DL has also been used in a study published by Lou et al. to predict the outcome after SBRT and individualize RT dose.85 18.4.2.1.4 Esophagus

Amyar et al. showed that a 3D CNN trained on the baseline PET-FDG of 97 patients treated by chemoradiotherapy for locally advanced esophageal cancer could predict

III. Clinical applications

376

18. Artificial intelligence in oncology

treatment response and outperformed a 2D CNN architectures and more classical radiomics approaches (AUC 5 0.70 6 0.02). Moreover, the addition of a margin around the target lesion seemed to increase the accuracy of the 3D CNN model (AUC 5 0.74 6 0.02).86 18.4.2.1.5 Rectum

Radiomics and machine learning have been used to predict pathologic complete response after neoadjuvant chemoradiation with MRI87 or PET/CT.86 The most robust study included 222 patients and created a signature that included 30 MRI features.87 A SVM provided an AUC of 0.9756 (95%CI: 0.9185 0.9711) in the validation cohort. Radiomics feature extraction from MRI is complex and less reproducible than CT Scan. Extracting features from treatment planning CT scans could be easier and more reproducible. This approach was used and coupled with a DL approach with a 80% accuracy.24 18.4.2.2 Toxicity There are few studies predicting toxicity after treatment in radiation oncology. Most of them used traditional, non-DL, machine learning methods. In 2009 Zhang et al. used a plan-related clinical toxicity predictive model in an IMRT framework. A total of 125 plans were generated for one head and neck cancer case and 256 plans were generated for one prostate cancer case, with saliva flow rate and G2 rectal bleeding being used as prediction outcomes, respectively. Mean absolute error for saliva flow prediction was 0.42%. The prediction accuracy for G2 rectal bleeding was 97.04%.88 In another work, Pella et al. used a dataset of 321 prostate cancer patients, with gastrointestinal and genitourinary acute toxicities. The ANN- and SVM-based methods provided similar accuracy (AUC 5 0.7).54 In cervical cancer chemoradiation therapy, Zhen et al. created a CNN model that took into account rectal dose distribution for 42 patients and predicted the relative .G2 rectal toxicity with an AUC of 0.7.89 In head and neck cancers, Abdollahi et al. modeled sensorineural hearing loss using CT radiomics information: the predictive power of the tested methods was acceptable (AUC 5 0.70).90

18.5 Future directions As we have seen, there are a large number of studies using real-world data with machine learning in every field of oncology, from diagnosis to treatment and follow-up. Studies are almost exclusively retrospective. Even with excellent performances, AI algorithms will need two things before they can be implemented in the daily routine of an oncologist: first, they will need to be interpretable (Fig. 18.5A and B). Physicians, and patients, need to be able to understand the reasons that motivated the results provided by an algorithm. As of now, a vast majority of AI is seen as a black box, and this is probably not acceptable in medicine. Interpretability is the next frontier, and methods are already being developed to provide it on top of the existing solutions.91 Before AI can be used with trust randomized clinical trials need to be conducted. Like we would not accept to be treated with untested drugs, we should not promote the use of untested AI. Rigorous clinical trials, which assess and compare the performances of humans, AI and AI-augmented humans on specific, predefined, clinically relevant,

III. Clinical applications

References

377

FIGURE 18.5 Example of interpretability for prostate cancer survival: individual prediction for virtual patient 1 (risk to die from PCa in (A) and from any other causes in (B) and 2 [PCa in (C) and other causes in (D)]. The features in red increase the risk of dying and the features in blue decrease it. The size of the blocks reflects the shapley value of the feature. Each feature, and its negative or positive contribution to the outcome, is added to form the prediction.

outcomes will need to be performed to provide a thorough evaluation of this new field and guarantee the safety of any new algorithm.

References 1. Chen C, He M, Zhu Y, Shi L, Wang X. Five critical elements to ensure the precision medicine. Cancer Metastasis Rev 2015;34(2):313 18. 2. Murphy SN, Barnett GO, Chueh HC. Visual query tool for finding patient cohorts from a clinical data warehouse of the partners HealthCare system, In: Proc. AMIA Symp.; 2000. p. 1174. 3. Segagni D, Tibollo V, Dagliati A, Zambelli A, Priori SG, Bellazzi R. An ICT infrastructure to integrate clinical and molecular data in oncology research. BMC Bioinformatics 2012;13(Suppl 4):S5. 4. Rance B, Canuel V, Countouris H, Laurent-Puig P, Burgun A. Integrating Heterogeneous biomedical data for cancer research: the CARPEM infrastructure. Appl Clin Inform 2016;7(2):260 74. 5. Kurian AW, Mitani A, Desai M, Yu PP, Seto T, Weber SC, et al. Breast cancer treatment across health care systems: linking electronic medical records and state registry data to enable outcomes research. Cancer 2014;120(1):103 11. 6. Seneviratne MG, Seto T, Blayney DW, Brooks JD, Hernandez-Boussard T. Architecture and implementation of a clinical research data warehouse for prostate cancer. EGEMS (Wash DC) 2018;6(1):13. 7. Rubinstein SM, Warner JL. CancerLinQ: origins, implementation, and future directions. JCO Clin Cancer Inform 2018;2:1 7. 8. Belenkaya R, Gurley M, Dymshyts D, Araujo S, Williams A, Chen R, et al. Standardized observational cancer research using the OMOP CDM oncology module. Stud Health Technol Inform 2019;264:1831 2. 9. Warner JL, Dymshyts D, Reich CG, Gurley MJ, Hochheiser H, Moldwin ZH, et al. HemOnc: a new standard vocabulary for chemotherapy regimen representation in the OMOP common data model. J Biomed Inform 2019;96:103239. 10. Bibault J-E, Zapletal E, Rance B, Giraud P, Burgun A. Labeling for Big Data in radiation oncology: the radiation oncology structures ontology. PLoS One 2018;13(1):e0191263. 11. Casey JA, Schwartz BS, Stewart WF, Adler NE. Using electronic health records for population health research: a review of methods and applications. Annu Rev Public Health 2016;37:61 81. 12. Denny JC, Bastarache L, Roden DM. Phenome-wide association studies as a tool to advance precision medicine. Annu Rev Genomics Hum Genet 2016;17:353 73. 13. Verma A, Lucas A, Verma SS, Zhang Y, Josyula N, Khan A, et al. PheWAS and beyond: the landscape of associations with medical diagnoses and clinical measures across 38,662 individuals from Geisinger. Am J Hum Genet 2018;102(4):592 608. 14. Fritsche LG, Gruber SB, Wu Z, Schmidt EM, Zawistowski M, Moser SE, et al. Association of polygenic risk scores for multiple cancers in a phenome-wide study: results from the Michigan genomics initiative. Am J Hum Genet 2018;102(6):1048 61.

III. Clinical applications

378

18. Artificial intelligence in oncology

15. Clinical Cancer Genome Task Team of the Global Alliance for Genomics and Health, Lawler M, Haussler D, Siu LL, Haendel MA, McMurry JA, et al. Sharing clinical and genomic data on cancer the need for global solutions. N Engl J Med 2017;376(21):2006 9. 16. Girardeau Y, Doods J, Zapletal E, Chatellier G, Daniel C, Burgun A, et al. Leveraging the EHR4CR platform to support patient inclusion in academic studies: challenges and lessons learned. BMC Med Res Methodol 2017;17(1):36. 17. Meystre SM, Heider PM, Kim Y, Aruch DB, Britten CD. Automatic trial eligibility surveillance based on unstructured clinical data. Int J Med Inform 2019;129:13 19. 18. Ni Y, Wright J, Perentesis J, Lingren T, Deleger L, Kaiser M, et al. Increasing the efficiency of trial-patient matching: automated clinical trial eligibility pre-screening for pediatric oncology patients. BMC Med Inform Decis Mak 2015;15:28. 19. Zhang K, Demner-Fushman D. Automated classification of eligibility criteria in clinical trials to facilitate patient-trial matching for specific patient populations. J Am Med Inform Assoc JAMIA 2017;24(4):781 7. 20. Xu Y, Zhou X, Suehs BT, Hartzema AG, Kahn MG, Moride Y, et al. A comparative assessment of observational medical outcomes partnership and mini-sentinel common data models and analytics: implications for active drug safety surveillance. Drug Saf 2015;38(8):749 65. 21. Przepiorka D, Ko C-W, Deisseroth A, Yancey CL, Candau-Chacon R, Chiu H-J, et al. FDA approval: blinatumomab. Clin Cancer Res 2015;21(18):4035 9. 22. Miller DD, Brown EW. Artificial intelligence in medical practice: the question to the answer? Am J Med 2018;131(2):129 33. 23. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017;542(7639):115 18. 24. Bibault J-E, Giraud P, Durdux C, Taieb J, Berger A, Coriat R, et al. Deep learning and radiomics predict complete response after neo-adjuvant chemoradiation for locally advanced rectal cancer. Sci Rep 2018;8(1):12611. 25. Harms J, Lei Y, Wang T, Zhang R, Zhou J, Tang X, et al. Paired cycle-GAN-based image correction for quantitative cone-beam computed tomography. Med Phys 2019;46(9):3998 4009. 26. Cahan EM, Hernandez-Boussard T, Thadaney-Israni S, Rubin DL. Putting the data before the algorithm in big data addressing personalized healthcare. NPJ Digit Med 2019;2:78. 27. Looten V, Kong Win Chang L, Neuraz A, Landau-Loriot M-A, Vedie B, Paul J-L, et al. What can millions of laboratory test results tell us about the temporal aspect of data quality? Study of data spanning 17 years in a clinical data warehouse. Comput Methods Programs Biomed 2019;181:104825. 28. Brakenhoff TB, Mitroiu M, Keogh RH, Moons KGM, Groenwold RHH, van Smeden M. Measurement error is often neglected in medical literature: a systematic review. J Clin Epidemiol 2018;98:89 97. 29. Lee ES, Lee JS, Joo MC, Kim JH, Noh SE. Accuracy of heart rate measurement using smartphones during treadmill exercise in male patients with ischemic heart disease. Ann Rehabil Med 2017;41(1):129 37. 30. Allen B, Seltzer SE, Langlotz CP, Dreyer KP, Summers RM, Petrick N, et al. A Road Map for Translational Research on Artificial Intelligence in Medical Imaging: from the 2018 National Institutes of Health/RSNA/ ACR/The Academy Workshop. J Am Coll Radiol 2019;16(9 Pt A):1179 89. 31. Frankovich J, Longhurst CA, Sutherland SM. Evidence-based medicine in the EMR era. N Engl J Med 2011;365 (19):1758 9. 32. Garcelon N, Neuraz A, Benoit V, Salomon R, Kracker S, Suarez F, et al. Finding patients using similarity measures in a rare diseases-oriented clinical data warehouse: Dr. Warehouse and the needle in the needle stack. J Biomed Inform 2017;73:51 61. 33. Chen X, Garcelon N, Neuraz A, Billot K, Lelarge M, Bonald T, et al. Phenotypic similarity for rare disease: ciliopathy diagnoses and subtyping. J Biomed Inform 2019;100:103308. 34. Gombar S, Callahan A, Califf R, Harrington R, Shah NH. It is time to learn from patients like mine. NPJ Digit Med 2019;2:16. 35. Hosny A, Parmar C, Quackenbush J, Schwartz LH, Aerts HJWL. Artificial intelligence in radiology. Nat Rev Cancer 2018;18(8):500 10. 36. Thomassin-Naggara I, Balleyguier C, Ceugnart L, Heid P, Lenczner G, Maire A, et al. Artificial intelligence and breast screening: French Radiology Community position paper. Diagn Interv Imaging 2019;100 (10):553 66.

III. Clinical applications

References

379

37. Chassagnon G, Vakalopolou M, Paragios N, Revel M-P. Deep learning: definition and perspectives for thoracic imaging. Eur Radiol [Internet] 2019. Available from: https://doi.org/10.1007/s00330-019-06564-3. 38. Gao Y, Geras KJ, Lewin AA, Moy L. New frontiers: an update on computer-aided diagnosis for breast imaging in the age of artificial intelligence. Am J Roentgenol 2019;212(2):300 7. 39. McKinney SM, Sieniek M, Godbole V, Godwin J, Antropova N, Ashrafian H, et al. International evaluation of an AI system for breast cancer screening. Nature 2020;577(7788):89 94. 40. Houssami N, Kirkpatrick-Jones G, Noguchi N, Lee CI. Artificial Intelligence (AI) for the early detection of breast cancer: a scoping review to assess AI’s potential in breast screening practice. Expert Rev Med Devices 2019;16(5):351 62. 41. Giannini V, Mazzetti S, Vignati A, Russo F, Bollito E, Porpiglia F, et al. A fully automatic computer aided diagnosis system for peripheral zone prostate cancer detection using multi-parametric magnetic resonance imaging. Comput Med Imaging Graph 2015;46:219 26. 42. Shaver MM, Kohanteb PA, Chiou C, Bardis MD, Chantaduly C, Bota D, et al. Optimizing neuro-oncology imaging: a review of deep learning approaches for glioma imaging. Cancers 2019;11(6):829. 43. Chang K, Bai HX, Zhou H, Su C, Bi WL, Agbodza E, et al. Residual convolutional neural network for the determination of IDH status in low- and high-grade gliomas from MR imaging. Clin Cancer Res 2018;24(5):1073 81. 44. Velazquez ER, Parmar C, Liu Y, Coroller TP, Cruz G, Stringfield O, et al. Somatic mutations drive distinct imaging phenotypes in lung cancer. Cancer Res 2017;77(14):3922 30. 45. Li H, Zhu Y, Burnside ES, Drukker K, Hoadley KA, Fan C, et al. MR imaging radiomics signatures for predicting the risk of breast cancer recurrence as given by research versions of MammaPrint, Oncotype DX, and PAM50 gene assays. Radiology 2016;281(2):382 91. 46. Sun R, Limkin EJ, Vakalopoulou M, Dercle L, Champiat S, Han SR, et al. A radiomics approach to assess tumour-infiltrating CD8 cells and response to anti-PD-1 or anti-PD-L1 immunotherapy: an imaging biomarker, retrospective multicohort study. Lancet Oncol 2018;19(9):1180 91. 47. Sinigaglia M, Assi T, Besson FL, Ammari S, Edjlali M, Feltus W, et al. Imaging-guided precision medicine in glioblastoma patients treated with immune checkpoint modulators: research trend and future directions in the field of imaging biomarkers and artificial intelligence. EJNMMI Res 2019;9(1):78. 48. Sheth D, Giger ML. Artificial intelligence in the interpretation of breast cancer on MRI. J Magn Reson Imaging [Internet] 2020;51(5):1310 24 [cited 02.02.20];n/a(n/a). Available from. Available from: http://onlinelibrary. wiley.com/doi/abs/10.1002/jmri.26878. 49. Higaki T, Nakamura Y, Tatsugami F, Nakaura T, Awai K. Improvement of image quality at CT and MRI using deep learning. Jpn J Radiol 2019;37(1):73 80. 50. Lakhani P, Prater AB, Hutson RK, Andriole KP, Dreyer KJ, Morey J, et al. Machine learning in radiology: applications beyond image interpretation. J Am Coll Radiol 2018;15(2):350 9. 51. Doshi AM, Moore WH, Kim DC, Rosenkrantz AB, Fefferman NR, Ostrow DL, et al. Informatics solutions for driving an effective and efficient radiology practice. Radiographics 2018;38(6):1810 22. 52. Bryce TJ, Dewhirst MW, Floyd CE, Hars V, Brizel DM. Artificial neural network model of survival in patients treated with irradiation with and without concurrent chemotherapy for advanced carcinoma of the head and neck. Int J Radiat Oncol Biol Phys 1998;41(2):339 45. 53. Gulliford SL, Webb S, Rowbottom CG, Corne DW, Dearnaley DP. Use of artificial neural networks to predict biological outcomes for patients receiving radical radiotherapy of the prostate. Radiother Oncol 2004;71 (1):3 12. 54. Pella A, Cambria R, Riboldi M, Jereczek-Fossa BA, Fodor C, Zerini D, et al. Use of machine learning methods for prediction of acute toxicity in organs at risk following prostate radiotherapy. Med Phys 2011;38(6):2859 67. 55. Tomatis S, Rancati T, Fiorino C, Vavassori V, Fellin G, Cagna E, et al. Late rectal bleeding after 3D-CRT for prostate cancer: development of a neural-network-based predictive model. Phys Med Biol 2012;57(5):1399 412. 56. Chen S, Zhou S, Zhang J, Yin F-F, Marks LB, Das SK. A neural network model to predict lung radiationinduced pneumonitis. Med Phys 2007;34(9):3420 7. 57. Ochi T, Murase K, Fujii T, Kawamura M, Ikezoe J. Survival prediction using artificial neural networks in patients with uterine cervical cancer treated by radiation therapy alone. Int J Clin Oncol 2002;7(5):294 300. 58. Meyer P, Noblet V, Mazzara C, Lallement A. Survey on deep learning for radiotherapy. Comput Biol Med 2018;98:126 46.

III. Clinical applications

380

18. Artificial intelligence in oncology

59. Valentini V, Boldrini L, Damiani A, Muren LP. Recommendations on how to establish evidence from autosegmentation software in radiotherapy. Radiother Oncol 2014;112(3):317 20. 60. Boldrini L, Cusumano D, Cellini F, Azario L, Mattiucci GC, Valentini V. Online adaptive magnetic resonance guided radiotherapy for pancreatic cancer: state of the art, pearls and pitfalls. Radiat Oncol Lond Engl 2019;14(1):71. 61. Dice LR. Measures of the amount of ecologic association between species. Ecology 1945;26(3):297 302. 62. Liu Y, Stojadinovic S, Hrycushko B, Wardak Z, Lau S, Lu W, et al. A deep convolutional neural networkbased automatic delineation strategy for multiple brain metastases stereotactic radiosurgery. PLoS One 2017;12 (10):e0185844. 63. Charron O, Lallement A, Jarnet D, Noblet V, Clavier J-B, Meyer P. Automatic detection and segmentation of brain metastases on multimodal MR images with a deep convolutional neural network. Comput Biol Med 2018;95:43 54. 64. Ibragimov B, Xing L. Segmentation of organs-at-risks in head and neck CT images using convolutional neural networks. Med Phys 2017;44(2):547 57. 65. Men K, Dai J, Li Y. Automatic segmentation of the clinical target volume and organs at risk in the planning CT for rectal cancer using deep dilated convolutional neural networks. Med Phys 2017;44(12):6377 89. 66. Cardenas CE, McCarroll RE, Court LE, Elgohari BA, Elhalawani H, Fuller CD, et al. Deep learning algorithm for auto-delineation of high-risk oropharyngeal clinical target volumes with built-in dice similarity coefficient parameter optimization function. Int J Radiat Oncol Biol Phys 2018;101(2):468 78. 67. Zhu M, Bzdusek K, Brink C, Eriksen JG, Hansen O, Jensen HA, et al. Multi-institutional quantitative evaluation and clinical validation of Smart Probabilistic Image Contouring Engine (SPICE) autosegmentation of target structures and normal tissues on computer tomography images in the head and neck, thorax, liver, and male pelvis areas. Int J Radiat Oncol Biol Phys 2013;87(4):809 16. 68. Trullo R, Petitjean C, Ruan S, Dubray B, Nie D, Shen D. Segmentation of organs at risk in thoracic ct images using a SharpMask architecture and conditional random fields. Proc IEEE Int Symp Biomed Imaging 2017;2017:1003 6. 69. Lustberg T, van Soest J, Gooding M, Peressutti D, Aljabar P, van der Stoep J, et al. Clinical evaluation of atlas and deep learning based automatic contouring for lung cancer. Radiother Oncol 2018;126(2):312 17. 70. Ibragimov B, Toesca D, Chang D, Koong A, Xing L. Combining deep learning with anatomical analysis for segmentation of the portal vein for liver SBRT planning. Phys Med Biol 2017;62(23):8943 58. 71. Lu F, Wu F, Hu P, Peng Z, Kong D. Automatic 3D liver location and segmentation via convolutional neural network and graph cut. Int J Comput Assist Radiol Surg 2017;12(2):171 82. 72. Trebeschi S, van Griethuysen JJM, Lambregts DMJ, Lahaye MJ, Parmar C, Bakers FCH, et al. Deep learning for fully-automated localization and segmentation of rectal cancer on multiparametric MR. Sci Rep 2017;7(1):5301. 73. Wang J, Lu J, Qin G, Shen L, Sun Y, Ying H, et al. Technical note: a deep learning-based autosegmentation of rectal tumors in MR images. Med Phys 2018;45(6):2560 4. 74. Guo Y, Gao Y, Shen D. Deformable MR prostate segmentation via deep feature learning and sparse patch matching. IEEE Trans Med Imaging 2016;35(4):1077 89. 75. Tseng H-H, Luo Y, Cui S, Chien J-T, Ten Haken RK, Naqa IE. Deep reinforcement learning for automated radiation adaptation in lung cancer. Med Phys 2017;44(12):6690 705. 76. Fan J, Wang J, Chen Z, Hu C, Zhang Z, Hu W. Automatic treatment planning based on three-dimensional dose distribution predicted from deep learning technique. Med Phys 2019;46(1):370 81. 77. Chen S, Qin A, Zhou D, Yan D. Technical note: U-net-generated synthetic CT images for magnetic resonance imaging-only prostate intensity-modulated radiation therapy treatment planning. Med Phys 2018;45(12):5659 65. 78. Cha YJ, Jang WI, Kim M-S, Yoo HJ, Paik EK, Jeong HK, et al. Prediction of response to stereotactic radiosurgery for brain metastases using convolutional neural networks. Anticancer Res 2018;38(9):5437 45. 79. Peng L, Parekh V, Huang P, Lin DD, Sheikh K, Baker B, et al. Distinguishing true progression from radionecrosis after stereotactic radiation therapy for brain metastases with machine learning and radiomics. Int J Radiat Oncol Biol Phys 2018;102(4):1236 43. 80. Yu K, Zhang Y, Yu Y, Huang C, Liu R, Li T, et al. Radiomic analysis in prediction of Human Papilloma Virus status. Clin Transl Radiat Oncol 2017;7:49 54. 81. Kann BH, Aneja S, Loganadane GV, Kelly JR, Smith SM, Decker RH, et al. Pretreatment identification of head and neck cancer nodal metastasis and extranodal extension using deep learning neural networks. Sci Rep 2018;8(1):14036.

III. Clinical applications

References

381

82. Anderson Cancer Center Head and Neck Quantitative Imaging Working Group MD. Investigation of radiomic signatures for local recurrence using primary tumor texture analysis in oropharyngeal head and neck cancer patients. Sci Rep 2018;8(1):1524. 83. Hosny A, Parmar C, Coroller TP, Grossmann P, Zeleznik R, Kumar A, et al. Deep learning for lung cancer prognostication: a retrospective multi-cohort radiomics study. PLoS Med 2018;15(11):e1002711. 84. Xu Y, Hosny A, Zeleznik R, Parmar C, Coroller T, Franco I, et al. Deep learning predicts lung cancer treatment response from serial medical imaging. Clin Cancer Res 2019;25(11):3266 75. 85. Lou B, Doken S, Zhuang T, Wingerter D, Gidwani M, Mistry N, et al. An image-based deep learning framework for individualising radiotherapy dose: a retrospective analysis of outcome prediction. Lancet Digit Health 2019;1(3):e136 47. 86. Amyar A, Ruan S, Gardin I, Herault R, Clement C, Decazes P, et al. Radiomics-net: convolutional neural networks on FDG PET images for predicting cancer treatment response. J Nucl Med 2018;59(Suppl. 1):324. 87. Liu Z, Zhang X-Y, Shi Y-J, Wang L, Zhu H-T, Tang Z, et al. Radiomics analysis for evaluation of pathological complete response to neoadjuvant chemoradiotherapy in locally advanced rectal cancer. Clin Cancer Res 2017;23(23):7253 62. 88. Zhang HH, D’Souza WD, Shi L, Meyer RR. Modeling plan-related clinical complications using machine learning tools in a multiplan IMRT framework. Int J Radiat Oncol Biol Phys 2009;74(5):1617 26. 89. Zhen X, Chen J, Zhong Z, Hrycushko B, Zhou L, Jiang S, et al. Deep convolutional neural network with transfer learning for rectum toxicity prediction in cervical cancer radiotherapy: a feasibility study. Phys Med Biol 2017;62(21):8246 63. 90. Abdollahi H, Mostafaei S, Cheraghi S, Shiri I, Rabi Mahdavi S, Kazemnejad A. Cochlea CT radiomics predicts chemoradiotherapy induced sensorineural hearing loss in head and neck cancer patients: a machine learning and multi-variable modelling study. Phys Med 2018;45:192 7. 91. Lundberg S, Lee S-I. A unified approach to interpreting model predictions. ArXiv170507874 Cs Stat [Internet]. Available from: http://arxiv.org/abs/1705.07874; 2017 [cited 22.01.20].

III. Clinical applications

C H A P T E R

19 Artificial intelligence in cardiovascular imaging Karthik Seetharam and James K. Min Abstract The emergence of artificial intelligence (AI) has sparked tremendous interest in the academic community, commercial industries, and society with wide-ranging applications from self-driving cars to automated voice recognition. With the explosive progress of AI, the field of cardiovascular imaging is no exception. Cardiovascular disease is one of the leading causes of mortality and morbidity, various diagnostic modalities play a paramount role in patient evaluation. As data originating form imaging and health care is becoming increasing complex for conventional statistics, this places us at a delicate crossroads. Machine learning (ML), a subset of AI, has shown significant promise and will help cardiovascular imaging transcend to new heights. In this review, we herein describe the role of ML to date in various imaging modalities for in cardiovascular disease. Keywords: Artificial intelligence; machine learning; cardiovascular disease; cardiovascular imaging; echocardiography

19.1 Introduction Over the last two decades, significant advances in computer processing capabilities and growth in cloud infrastructures have opened new frontiers in information technology.1 With the emergence of smartphone apps and telemedicine, health care is bound to experience new paradigms shifts, which will impact patient care.2 In parallel with this technological revolution, the landscape of cardiovascular imaging is undergoing significant transformations as well.1,2 Data in cardiovascular imaging is growing exponentially in complexity and in size.3 If data inevitably continues to follow this current trend, this will potentially supersede the capability of current statistical software.4 Furthermore, existing cardiovascular modalities are having additional parameters or variables, which provide supplementary information.5 The influx of new information can overwhelm physicians and be counterproductive. This tremendous growth can have profound ramifications in medical management, research, and clinical trials.

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00019-3

383

© 2021 Elsevier Inc. All rights reserved.

384

19. Artificial intelligence in cardiovascular imaging

As the gap between man and machine continues to expand, artificial intelligence (AI) can serve as a bridge in cardiovascular imaging1,6 (Fig. 19.1). Machine learning (ML), a subset of AI, can escape the confines of conventional statistics and help expand the boundaries of cardiovascular imaging.7 ML algorithms can extrapolate information within this massive expanse of data to decipher clinically relevant information in cardiovascular imaging.7 Furthermore, it can unravel hidden relationships leading to data-driven discoveries.3 In addition, AI can facilitate automation and reduce burden for physicians.1 In this review article, we will discuss the role of ML in cardiovascular imaging.

19.2 Types of machine learning ML differs from conventional statistical techniques. Prediction is not a strong suite of conventional statistics. The performance of ML improves proportionally as the data becomes larger or complex.3 Technically speaking, ML is a broad term but it includes a multitude of algorithms1,7 (Fig. 19.2). ML can be classified into supervised learning, unsupervised learning,

Decision-making

FIGURE 19.1

Potential of artificial intelligence.

FIGURE 19.2

Progression of artificial intelligence.

III. Clinical applications

385

19.2 Types of machine learning

semisupervised learning, and reinforcement learning1,7,8 (Table 19.1). Among these, supervised learning and unsupervised learning are the major branches of ML algorithms and are the most frequently used.7 Supervised learning requires labels or annotations within a dataset to function effectively.1,7 In contrast, unsupervised learning does not require any kind of labels.1,7 It functions more independently and can uncover hidden relationships with adequate exposure. In this aspect, unsupervised learning can work with more complex datasets in comparison to its contemporary algorithms. Semisupervised learning contains elements of supervised and unsupervised learning.3 It can work with labeled and unlabeled datasets. Reinforcement learning is similar to human psychology and works with certain reward criteria.4,9 Reinforcement learning is yet to gain significant traction in cardiology. A number of centers have utilized ML algorithms with positive results (Table 19.2). TABLE 19.1

Types of machine learning.

Machine learning algorithms

Brief description

Examples

Supervised learning1

Labels and outcomes are clearly defined in dataset

Unsupervised learning1

Dataset contains no labels, algorithms operates independently to discover relationships

• Logistic regression • Bayesian network • Least absolute shrinkage network (LASSO) • Elastic net regression • K-means clustering • Hierarchical clustering

Reinforcement learning9

Operates in similar fashion to human psychology

• Used in imaging and analytics

TABLE 19.2

Role of artificial intelligence in cardiac imaging.

Study

Machine learning algorithm

Type of imaging

Brief study description

Narula et al.

Supervised learning

Echocardiography

To differentiate between constrictive pericarditis and restrictive cardiomyopathy

Sengupta et al.14

Supervised learning

Echocardiography

To differentiate between athlete heart and hypertrophic cardiomyopathy

Madani et al.15 Deep learning

Echocardiography

To explore accuracy between machine learning algorithm and board-certified cardiologists

Madani et al.16 Deep learning

Echocardiography

To assess automatic interpretations of echocardiograms

CasaclangVerzosa et al.18

Unsupervised learning

Echocardiography

To identify different phenotypes of aortic stenosis

Motwani et al.21

Multiple machine learning algorithms

Computed tomography

To predict 5-year mortality in CAD mortality patients

13

(Continued)

III. Clinical applications

386

19. Artificial intelligence in cardiovascular imaging

TABLE 19.2 (Continued) Machine learning algorithm

Study 22

Type of imaging

Brief study description

Zreik et al.

Deep learning algorithm

Computed tomography

To automatically measure fractional flow reserve from CTA

Baskaran et al.24

Machine learning algorithm

Computed tomography

To automatically segment structures on CTA

Baskaran et al.23

Deep learning algorithm

Computed tomography

To detect and quantify cardiovascular structures from CTA

Arsanjani et al.26

Supervised learning algorithm

Nuclear cardiology

To assess the accuracy of MPI for prediction of CAD

Betancur et al.27

Deep learning

Nuclear cardiology

To assess the prediction of CAD in comparison to TPD

Arsanjani et al.28

Supervised learning

Nuclear cardiology

To predict revascularization for CAD

Betancur et al.44

Supervised learning

Nuclear cardiology

To predict MACE events

Bai et al.31

Machine learning algorithm

Magnetic resonance imaging

To automatically calculate right and left ventricular mass and volumes

Ruijsink et al.32

Deep learning

Magnetic resonance imaging

To assess ventricular function

CAD, Coronary artery disease; CTA, computed tomography angiography; MACE, major adverse cardiovascular event; MPI, myocardial perfusion imaging.

19.3 Deep learning Among the vast variety of ML algorithms, deep learning has the most revolutionary potential and will push the boundaries of cardiovascular imaging to new heights.10 The architecture of deep learning draws parallel to the neuronal structures present within humans.11 This ML algorithm is arranged in a series of layers. There are a number of connections between preceding and subsequent layers.10 As computer processing units and cloud technology continues to grow and evolve, the role of deep learning will be an inevitable integral part of information technology and health care.11 A number of medical centers are exploring the role of deep learning in cardiovascular imaging and are experiencing positive results.

19.4 Role of artificial intelligence in echocardiography Echocardiography is one of the most versatile and readily available modality present in cardiovascular imaging.12 For numerous pathological entities, echocardiography serves as the first line of approach for diagnosing the underlying condition.12 A number of novel technologies such as speckle-tracking and vortex flow mapping present new opportunities for clinical insight and medical management. ML algorithms can tap this underlying information from a number of echocardiographic and clinical parameters.3

III. Clinical applications

19.5 Role of artificial intelligence computed tomography

387

Narula et al. employed an ensemble algorithm for distinguishing hypertrophic cardiomyopathy (HCM) from athlete’s heart.13 The algorithm showed left ventricular volume [information gain (IG) 5 0.24] was the best predictor for discerning athlete’s heart from HCM in echocardiography, followed by mid left ventricular segment (IG 5 0.134) and average longitudinal strain (IG 5 0.131). In addition, Sengupta et al. explored the potential of an associate memory classifier ML algorithm for differentiating constrictive pericarditis from restrictive cardiomyopathy.14 The algorithm was capable of demonstrating an area under the curve (AUC) with 89.2% and 96.2% by utilizing echocardiographic variables. Similarly, Madani et al. applied a CNN algorithm to 267 transthoracic echocardiograms with 15 standard views to demonstrate real-life variation15. Among single low-resolution images, there was a 91.7% accuracy for the ML algorithm in comparison to 70.2% 84.0% for board-certified echocardiographers. In another report by Madani et al., they utilized deep learning classifiers for automatic interpretation of echocardiograms.16 Madani et al. obtained an accuracy of 94.4% for 15-view still image echocardiographic classification and 91.2% accuracy for binary left ventricular hypertrophy view classification. Clustering is an unsupervised learning approach that can detect certain subgroups within a cohort by observing patterns.2 Lancaster et al. utilized a clustering approach to assess left ventricular dysfunction by identifying high-risk phenotypes through echocardiographic variables in 866 patients.17 It found diastolic dysfunction in 559 patients, detected 2 distinct groups that had modest agreement with conventional classification (kappa 5 0.41, P , .0001). Another innovative mechanism within unsupervised learning is topological data analysis (TDA), which builds a corresponding network and shape from data.3 Casaclang-Verzosa et al. utilized TDA to distinguish patient similarity for precise phenotypic recognition of left ventricular responses in AS progression.18 The TDA algorithm formed a loop that automatically grouped patients with mild and severe AS (P , .0001). Both components were linked by moderate AS on the top and bottom sides of this loop with reduced and preserved ejection fraction (P , .001). Similar findings were corroborated in mice (P , .001).

19.5 Role of artificial intelligence computed tomography Computed tomography (CT) is one of the premier modalities in the cardiovascular imaging for assessing coronary artery disease (CAD).19 It enables the visualization of plaques and stenosis in coronary artery tree. With the emergence of CT angiography (CTA), many diagnostic pathways are increasingly incorporating this approach in clinical practice.20 Furthermore, CTA has high sensitivity and specificity to detect and exclude high-grade angiographic stenosis. As a result, CTA can effectively exclude 100%.20 The application of ML can augment and expand the existing capabilities of cardiac CT. Motwani et al. explored the utilization of an ML algorithm to predict 5-year mortality in (CT) in comparison to traditional cardiac metrics in 10,030 patients with possible CAD. Surprisingly, the ML algorithm demonstrated a statistically significant (P , .001) higher AUC (0.79) than CT severity scores (SSS 5 0.64, SIS 5 0.64, DI 5 0.62) for predicting 5-year all-cause mortality.21 Zreik et al. utilized a deep learning algorithm to automatically measure fractional flow reserve from coronary CTA in 166 patients who had coronary angiography.22 The area

III. Clinical applications

388

19. Artificial intelligence in cardiovascular imaging

under the receiver operating curve (ROC) was 0.74 6 0.02. When sensitivity levels measured 0.60, 0.70, and 0.80, the corresponding specificity was 0.77, 0.71, and 0.59, respectively. From our own personal experience, we have extensive experience in ML algorithms in CTA. Baskaran et al. utilized an ML algorithm in automatic segmentation of cardiac structures on CT.23 This included left and right sides of the heart along with the great vessels. The overall Dice score was 0.932 and results were consistent across various subsets. Furthermore, the automatic segmentation took an average of 440 seconds, which is far efficient than manual segmentation. Recently, Baskaran et al. applied of deep learning algorithms in detecting and quantifying cardiovascular structures from CTA in 166 patients.24 The combined Dice score was 0.9246. The deep learning architecture corroborated with manual annotation for left ventricular volume (r 5 0.98), right ventricular volume (r 5 0.97), left atrial volume (r 5 0.78), right atrial volume (r 5 0.97), and left ventricular myocardial mass (r 5 0.94) with statistical significance (P , .05).

19.6 Role of artificial intelligence in nuclear cardiology Single-photon emission CT (SPECT) myocardial perfusion imaging (MPI) is a cardinal test in nuclear cardiology.4 MPI has significant value in patients with intermediate to high probability of CAD and facilitates risk stratification in cardiac imaging.25 SPECT and positron emission tomography MPI play an indispensable role in CAD detection by providing ventricular and myocardial perfusion information.4 ML framework can augment accuracy and the predictive capabilities in nuclear cardiology. Arsanjani et al. assessed the accuracy of MPI for CAD prediction in 957 patients by using an ML framework by exploring perfusion and functional variables.26 The findings were compared with automatic quantification software and two experienced readers. The ROC area under the curve for ML algorithm (0.92) was statistically significant compared to both readers (0.87 and 0.88, P , .03), along with higher sensitivity and specificity (P , .005) for all. Betancur et al. also assessed the prediction of obstructive CAD with a combination of semiupright and supine stress MPI by deep learning with total perfusion deficit (TPD).27 Similarly, the area under the ROC for prediction of disease on per patient and per vessel by the deep learning architecture was superior to the combined TPD (per patient, 0.81 vs 0.78; per vessel 0.77 vs 0.73, P , .001). Surprisingly, the ROC for the ML architecture was (0.81 6 0.02) was similar to reader 1 (0.81 6 0.02) but better than reader 2 (0.72 6 0.02, P , .01) and standalone measure of perfusion (0.77 6 0.02, P , .01). Arsanjani et al. investigated to determine if early revascularization in 713 suspected CAD patients by utilizing an ML approach integrating clinical and imaging data in MPI SPECT.28 The prediction of revascularization by the ML algorithm was compared with two experienced readers. Betancur et al. led a study combining patient information with SPECT MPI to predict major adverse cardiovascular events (MACE) through an ML architecture in 2619 patients.29 239 patients had MACE events at 3 years follow-up. ML combined had superior MACE prediction than ML imaging (AUC: 0.81 vs 0.78, P , .01). The ML also had higher MACE predictive accuracy when compared with expert reader, automated stress total perfusion deficit, and automated ischemic perfusion deficit (AUC: 0.81 vs 0.65 vs 0.73 vs 0.71, P , .01 for all).

III. Clinical applications

19.8 Role of artificial intelligence in electrocardiogram

389

19.7 Role of artificial intelligence in cardiac magnetic resonance imaging Cardiac magnetic resonance (CMR) imaging has emerged as a pivotal diagnostic approach for assessing a number of pathophysiological conditions in the field of cardiology.30 In addition, CMR is heralded as the gold standard for noninvasive measurement of ventricular volumes and ejection fraction.30 It enables excellent spatial resolution and allows tissue characterization. A number of academic centers have investigated the potential of ML algorithms in cardiac CMR. Bai et al. applied an ML framework for automated analysis of CMR images in 5000 patients for calculating left and right ventricular mass and volumes.31 On the shortaxis image test the Dice metric measured 0.94 for left ventricular cavity, 0.88 for left ventricular myocardium, and 0.90 for right ventricular cavity. In two-chamber view the average Dice metric measured 0.93 for the left atrial cavity and 0.96 in right atrial cavity. Ruijsink et al. utilized a deep learning architecture for ventricular function estimation from cardiac CMR.32 The ML algorithm correlated with manual measurements for left ventricular and right ventricular volumes (all r . 0.95); strain (circumferential r 5 0.89, longitudinal r . 0.89); and filling and ejection rates (all r $ 0.93). Similarly, Tan et al. explored the role of deep learning, for automatic segmentation of the left ventricle in all short-axis slices in a number of publicly available datasets.33 Remarkably, they obtained a Jaccard Index of 0.77 in the left ventricular segmentation challenge dataset and demonstrated a continuous ranked probability score of 0.0124 with the Kaggle second annual data science bowl.

19.8 Role of artificial intelligence in electrocardiogram Electrocardiogram (ECG) is one of the most fundamental tests in cardiology and clinical care.34 A number of pathological conditions from arrhythmias to heart blocks are initially detected by ECG first.35 It serves as a critical branch point leading to a number of additional tests and diagnostic pathways. Computer-enabled interpretation of ECG is being increasingly integrated into clinical care but inaccurate findings can occur.36 A number of ML algorithms can play a critical role and enable risk stratification. Sengupta et al. explored the feasibility of signal processed ECG for predicting abnormal myocardial relaxation or left ventricular diastolic dysfunction in 188 patients.37 The signal processed ECG provided accurate diagnostic performance in elderly, obese, and hypertensive patients. Wearable devices have received considerable interest form physicians and the general public. Tison et al. evaluated smart watch data with standard ECG in 9750 patients for identifying atrial fibrillation through a deep learning algorithm. As a result, Tison et al. demonstrated that the ML algorithm showed excellent prediction of atrial fibrillation (C-statistic 0.97) with a 98% sensitivity and 90.2% specificity.38 Hannun et al. applied a deep learning approach to classify 12 rhythm classes using 91,232 single-lead ECGs from a single-lead ambulatory ECG monitoring device.35 The findings were compared with board-certified cardiologists. The ML algorithm obtained an average area under the ROC of 0.97. The average F1 score for the ML algorithm was 0.837 that was superior to the cardiologists (0.780).

III. Clinical applications

390

19. Artificial intelligence in cardiovascular imaging

TABLE 19.3 Role of artificial intelligence in big data. Study

Sample size

Description

Samad et al.39

171, 510

To predict all-cause mortality by integrating echocardiographic and clinical information in echocardiography

Zhang et al.40

14,035

Automatic interpretation of echocardiography

Han et al.41

86,155

To predict all-cause mortality with multiple parameters, compare with other approaches

Al’Aref et al.8

35,821

To predict CAD from calcium score and clinical factors

CAD, Coronary artery disease.

19.9 The role of artificial intelligence in large databases Big data derived from large databases provide a wealth of information that can play an instrumental role in clinical care1,2 (Table 19.3). As we move forward, sometimes big data may become too “big” for us to handle. As stated earlier, big data emanating from telemedicine and cardiovascular imaging cannot be interpreted effectively by current means.2 As a result, ML is being increasingly utilized by many of academic centers and commercial industry for this purpose. Samad et al. used random forest algorithm to predict all-cause mortality in 171,510 patients with over 300,000 echocardiograms by combining echocardiographic and clinical parameters.39 The ML algorithm was able to obtain a superior prediction model [all (AUC) . 0.82] in comparison to clinical risk scores (AUC 5 0.69 0.79) and outdid logistic regression models (P , .001) for all survival intervals. Zhang et al. employed a deep learning framework for automatic interpretation of echocardiography in 14,035 echocardiograms over a 10-year span.40 Further, the algorithm was successful in identifying views (96% for parasternal long axis) and enabled segmentation of cardiac chambers. Furthermore, Zhang et al. demonstrated the automatic measurements were superior to manual measurements across multiple metrics (e.g., the correlation of left atrial and ventricular volumes). Similarly, we have implemented ML frameworks in a number of large complex databases with interesting findings. Han et al. explored the prognostic ability of an ML-based prediction of all-cause mortality in 86,155 patients with multiple parameters in comparison to other risk prediction approaches.41 The AUC for the ML algorithm (0.82) was better than Framingham risk score and coronary artery calcium score (0.74) and atherosclerotic cardiovascular disease and coronary calcium score (0.72, P , .05). Interestingly, the ML model augmented reclassification over the other models in low to intermediate risk individuals (P , .001 for all). Al’Aref et al. examined an ML model integrating clinical factors and calcium score to predict CAD from CTA in 35, 281 patients from the CONFIRM registry.8 The AUC for ML and coronary calcium (0.881) was superior (P , .005) to ML alone (0.773), coronary calcium (0.866), and updated Diamond Forrester score (0.682).

III. Clinical applications

19.11 Conclusion

391

19.10 Our views on machine learning With the explosion of data and significant strides in technology, one could supposedly say we are in the midst of a cardiovascular imaging renaissance.1 Although the future may appear bright in the days to come, it is not without its hurdles.4,8 A number of issues may need to be resolved to facilitate a seamless transition of ML in clinical practice. In this current era of health care, physicians are faced with exceedingly high demands from the workplace.5 The constant need to multitask and manage clinical care can lead to exhaustion and possible burnout. Furthermore, this can lead to mistakes and inconsistences in findings, which can impact clinical care.5 As technology continues to evolve, there have been a number of additions of new parameters or variables to existing modalities.42 The surplus of clinical information can overwhelm any physician, this is especially relevant for younger physicians entering clinical practice. This clearly emphasizes that ML is the next step in the evolutionary line of cardiovascular imaging.8 AI can streamline the clinical workflow derived from multiple modalities in a seamless manner.1 It can help automate a number of processes by performing measurements, predicting outcomes, and analyzing data. It can reduce the burden of mundane tasks and provide more time to patient management. For ML algorithms to execute with high accuracy and consistency, adequate sample sizes are absolutely imperative for optimal performance.3 However, obtaining data of necessary size itself can be a difficult endeavor. Most academic centers do not have access to large data samples; this emphasizes the need for collaboration. Some form of data sharing is necessary to facilitate the growth of ML in the academic setting. However, institutional guidelines can be very rigid and may require multiple institutional review board approvals.42 This can be particularly time-consuming and delay the progress of research. Furthermore, data needs to be deidentified to maintain patient confidentiality. If data can be publically available, this can help expedite research with AI.4 For AI to flourish, it needs to be established at earlier stages of medical training rather than clinical practice. Budding doctors should be exposed with AI during medical school and should be provided fundamental principles regarding various algorithms. These concepts may appear less foreign and more easily embraced. Furthermore, AI lacks a moral compass and analyzes data.43 A number of unintentional biases can creep into the algorithm. By properly educating medical students, they will learn to handle ML efficiently and adequately.

19.11 Conclusion As we move forward in this era of big data, AI will be a vital contributor connecting cardiovascular imaging and clinical care. Although ML may initially appear as a futuristic concept, it will invariably help to automate a number of tasks across the spectrum of cardiovascular imaging modalities. This will allow physicians to supervise clinical workflow and dedicate more time toward less repetitive tasks and more patient-oriented activities.

III. Clinical applications

392

19. Artificial intelligence in cardiovascular imaging

References 1. Al’Aref SJ, Anchouche K, Singh G, et al. Clinical applications of machine learning in cardiovascular disease and its relevance to cardiac imaging. Eur Heart J 2019;40(24):1975 86. Available from: https://doi.org/ 10.1093/eurheartj/ehy404. 2. Seetharam K, Kagiyama N, Sengupta PP. Application of mobile health, telemedicine and artificial intelligence to echocardiography. Echo Res Pract. 2019;6(2):R41 52. Available from: https://doi.org/10.1530/ERP-18-0081. 3. Seetharam K, Shrestha S, Sengupta PP. Artificial intelligence in cardiovascular medicine. Curr Treat Options Cardiovasc Med 2019;21(6):25. 4. Seetharam K, Shresthra S, Mills JD, Sengupta PP. Artificial intelligence in nuclear cardiology: adding value to prognostication. Curr Cardiovasc Imaging Rep 2019;12(5). 5. Sengupta PP, Adjeroh DA. Will artificial intelligence replace the human echocardiographer? Circulation 2018;138(16):1639 42. 6. Dey D, Slomka PJ, Leeson P, Comaniciu D, Shrestha S, Sengupta PP, et al. Artificial intelligence in cardiovascular imaging: JACC state-of-the-art review. J Am Coll Cardiol 2019;73(11):1317 35. 7. Singh G, Al’Aref SJ, Van Assen M, Kim TS, van Rosendael A, Kolli KK, et al. Machine learning in cardiac CT: basic concepts and contemporary data. J Cardiovasc Comput Tomogr 2018;12(3):192 201. 8. Al’Aref SJ, Maliakal G, Singh G, et al. Machine learning of clinical variables and coronary artery calcium scoring for the prediction of obstructive coronary artery disease on coronary computed tomography angiography: analysis from the CONFIRM registry. Eur Heart J 2020;41(3):359 67. Available from: https://doi.org/10.1093/ eurheartj/ehz565. 9. Seetharam K, Shrestha S, Sengupta P. Artificial intelligence in cardiac imaging. US Cardiol Rev 2020;13:110 16. 10. Bizopoulos P, Koutsouris D. Deep learning in cardiology. IEEE Rev Biomed Eng 2019;12:168 93. 11. Krittanawong C, Johnson KW, Rosenson RS, Wang Z, Aydar M, Baber U, et al. Deep learning for cardiovascular medicine: a practical primer. Eur Heart J 2019;40(25):2058 73. 12. Papolos A, Narula J, Bavishi C, Chaudhry FA, Sengupta PP. U.S. Hospital use of echocardiography: insights from the nationwide inpatient sample. J Am Coll Cardiol 2016;67(5):502 11. 13. Narula S, Shameer K, Salem Omar AM, Dudley JT, Sengupta PP. Machine-learning algorithms to automate morphological and functional assessments in 2D echocardiography. J Am Coll Cardiol 2016;68(21):2287 95. 14. Sengupta PP, Huang YM, Bansal M, Ashrafi A, Fisher M, Shameer K, et al. Cognitive machine-learning algorithm for cardiac imaging: a pilot study for differentiating constrictive pericarditis from restrictive cardiomyopathy. Circ Cardiovasc Imaging 2016;9(6). 15. Madani A, Arnaout R, Mofrad M, Arnaout R. Fast and accurate view classification of echocardiograms using deep learning. NPJ Digit Med 2018;1. 16. Madani A, Ong JR, Tibrewal A, Mofrad MRK. Deep echocardiography: data-efficient supervised and semisupervised deep learning towards automated diagnosis of cardiac disease. NPJ Digit Med 2018;1:59. 17. Lancaster MC, Salem Omar AM, Narula S, Kulkarni H, Narula J, Sengupta PP. Phenotypic Clustering of Left Ventricular Diastolic Function Parameters: Patterns and Prognostic Relevance [published correction appears in. JACC Cardiovasc Imaging 2018 Jun;11(6):935 7. Available from: https://doi.org/10.1016/j.jcmg.2018.02.005. JACC Cardiovasc Imaging. 2019;12(7 Pt 1):1149-1161. 18. Casaclang-Verzosa G, Shrestha S, Khalil MJ, Cho JS, Tokodi M, Balla S, et al. Network tomography for understanding phenotypic presentations in aortic stenosis. JACC Cardiovasc Imaging 2019;12(2):236 48. 19. Al’Aref SJ, Min JK. Cardiac CT: current practice and emerging applications. Heart 2019;105(20):1597 605. 20. Levsky JM, Haramati LB, Spevack DM, Menegus MA, Chen T, Mizrachi S, et al. Coronary computed tomography angiography versus stress echocardiography in acute chest pain: a randomized controlled trial. JACC Cardiovasc Imaging 2018;11(9):1288 97. 21. Motwani M, Dey D, Berman DS, Germano G, Achenbach S, Al-Mallah MH, et al. Machine learning for prediction of all-cause mortality in patients with suspected coronary artery disease: a 5-year multicentre prospective registry analysis. Eur Heart J 2017;38(7):500 7. 22. Zreik M, Lessmann N, van Hamersvelt RW, Wolterink JM, Voskuil M, Viergever MA, et al. Deep learning analysis of the myocardium in coronary CT angiography for identification of patients with functionally significant coronary artery stenosis. Med Image Anal 2018;44:72 85.

III. Clinical applications

References

393

23. Baskaran L, Maliakal G, Singh G, Al’Aref S, Pandey M, van Rosendael A, et al. Automatic segmentation of cardiovascular structures imaged on cardiac computed tomography angiography using deep learning. J Cardiovasc Comput Tomogr 2019;13:S9. 24. Baskaran L, Maliakal G, Al’Aref SJ, et al. Identification and Quantification of Cardiovascular Structures From CCTA: An End-to-End, Rapid, Pixel-Wise, Deep-Learning Method. JACC Cardiovasc Imaging. 2020;13 (5):1163 71. Available from: https://doi.org/10.1016/j.jcmg.2019.08.025. 25. Hachamovitch R, Hayes SW, Friedman JD, Cohen I, Berman DS. Stress myocardial perfusion single-photon emission computed tomography is clinically effective and cost effective in risk stratification of patients with a high likelihood of coronary artery disease (CAD) but no known CAD. J Am Coll Cardiol 2004;43(2):200 8. 26. Arsanjani R, Xu Y, Dey D, Fish M, Dorbala S, Hayes S, et al. Improved accuracy of myocardial perfusion SPECT for the detection of coronary artery disease using a support vector machine algorithm. J Nucl Med 2013;54(4):549 55. 27. Betancur J, Hu LH, Commandeur F, Sharir T, Einstein AJ, Fish MB, et al. Deep learning analysis of uprightsupine high-efficiency SPECT myocardial perfusion imaging for prediction of obstructive coronary artery disease: a multicenter study. J Nucl Med 2019;60(5):664 70. 28. Arsanjani R, Dey D, Khachatryan T, Shalev A, Hayes SW, Fish M, et al. Prediction of revascularization after myocardial perfusion SPECT by machine learning in a large population. J Nucl Cardiol 2015;22(5):877 84. 29. Betancur J, Otaki Y, Motwani M, Fish MB, Lemley M, Dey D, et al. Prognostic value of combined clinical and myocardial perfusion imaging data using machine learning. JACC Cardiovasc Imaging 2018;11(7):1000 9. 30. Seetharam K, Lerakis S. Cardiac magnetic resonance imaging: the future is bright. F1000Res 2019;8. 31. Bai W, Sinclair M, Tarroni G, Oktay O, Rajchl M, Vaillant G, et al. Automated cardiovascular magnetic resonance image analysis with fully convolutional networks. J Cardiovasc Magn Reson 2018;20(1):65. 32. Ruijsink B, Puyol-Anto´n E, Oksuz I, et al. Fully Automated, Quality-Controlled Cardiac Analysis From CMR: Validation and Large-Scale Application to Characterize Cardiac Function. JACC Cardiovasc Imaging. 2020;13 (3):684 95. Available from: https://doi.org/10.1016/j.jcmg.2019.05.030. 33. Tan LK, McLaughlin RA, Lim E, Abdul Aziz YF, Liew YM. Fully automated segmentation of the left ventricle in cine cardiac MRI using neural network regression. J Magn Reson Imaging 2018;48(1):140 52. 34. Holst H, Ohlsson M, Peterson C, Edenbrandt L. A confident decision support system for interpreting electrocardiograms. Clin Physiol 1999;19(5):410 18. 35. Hannun AY, Rajpurkar P, Haghpanahi M, Tison GH, Bourn C, Turakhia MP, et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat Med 2019;25(1):65 9. 36. Shah AP, Rubin SA. Errors in the computerized electrocardiogram interpretation of cardiac rhythm. J Electrocardiol 2007;40(5):385 90. 37. Sengupta PP, Kulkarni H, Narula J. Prediction of abnormal myocardial relaxation from signal processed surface ECG. J Am Coll Cardiol 2018;71(15):1650 60. 38. Tison GH, Sanchez JM, Ballinger B, Singh A, Olgin JE, Pletcher MJ, et al. Passive detection of atrial fibrillation using a commercially available smartwatch. JAMA Cardiol 2018;3(5):409 16. 39. Samad MD, Ulloa A, Wehner GJ, Jing L, Hartzel D, Good CW, et al. Predicting survival from large echocardiography and electronic health record datasets: optimization with machine learning. JACC Cardiovasc Imaging 2019;12(4):681 9. 40. Zhang J, Gajjala S, Agrawal P, Tison GH, Hallock LA, Beussink-Nelson L, et al. Fully automated echocardiogram interpretation in clinical practice. Circulation 2018;138(16):1623 35. 41. Han D, Beecy A, Anchouche K, Gransar H, Dunham PC, Lee JH, et al. Risk reclassification with coronary computed tomography angiography-visualized nonobstructive coronary artery disease according to 2018 American College of Cardiology/American Heart Association Cholesterol Guidelines (from the Coronary Computed Tomography Angiography Evaluation for Clinical Outcomes: An International Multicenter Registry [CONFIRM]). Am J Cardiol 2019;124(9):1397 405. 42. Shrestha S, Sengupta PP. Machine learning for nuclear cardiology: The way forward. J Nucl Cardiol. 2019;26 (5):1755 8. Available from: https://doi.org/10.1007/s12350-018-1284-x. 43. Bostrom N, Yudkowsky E. The ethics of artificial intelligence. Cambridge Handb Artif Intell 2014;316:334. 44. Betancur J, Commandeur F, Motlagh M, Sharir T, Einstein AJ, Bokhari S, et al. Deep learning for prediction of obstructive disease from fast myocardial perfusion SPECT: a multicenter study. JACC Cardiovasc Imaging 2018;11 (11):1654 63.

III. Clinical applications

C H A P T E R

20 Artificial intelligence as applied to clinical neurological conditions Daniel Ranti, Aly Al-Amyn Valliani, Anthony Costa and Eric Karl Oermann Abstract Artificial intelligence (AI) is poised to alter clinical practice and the ways in which physicians and hospital systems gain insights from complex multimodal patient data. The clinical neurosciences are beginning to see the impact of AI in the form of novel tools for the early detection and management of neurological conditions. This chapter will briefly introduce the analytical modalities and disease processes that are common areas of focus, detail ongoing areas of AI research targeting various stages of neurological and neurosurgical care, discuss existing applications in the clinical setting, and conclude with a word on specific technical hurdles that present a challenge for the widespread adoption of AI in the clinical setting. Keywords: Artificial intelligence; biomedical informatics; computer vision; connectome mapping; deep learning; genomics; machine learning; neurology; neuroscience

20.1 Introduction to artificial intelligence in neurology The study of the human brain and its diseases pose unique challenges to medicine due to the difficulty of assessing both neurological structure and function. The advent of electroencephalography in the 1920s provided our first look into neurological electrophysiology. Critical advances in medical imaging, such as the advent of computed tomography (CT) in the 1970s and magnetic resonance imaging (MRI) in the 1990s, provided a massive leap in capabilities for assessing neurological structures.1 5 Even amidst these advances, large portions of brain anatomy and physiology remain poorly understood, and insidious disease processes that are a hallmark of neurological disorders remain hard to define and clinically manage. The challenge of the modern clinical neurosciences is to assimilate the abundance of high-dimensional biomedical data generated from advances in medicine to understand neurological disease.2,6 Advances in artificial intelligence (AI) and machine learning (ML) are natural means to manage the complexities of neurological disease. Deep

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00020-X

395

© 2021 Elsevier Inc. All rights reserved.

396

20. Artificial intelligence as applied to clinical neurological conditions

learning techniques in particular may provide an avenue to discover hidden patterns amidst large amounts of data to better understand the subtle symptoms that characterize disease in the clinical neurosciences. In this chapter, we discuss applications of AI with an emphasis on deep learning in neurology across several subdomains. More specifically, we discuss techniques toward radiological image classification and segmentation for the management of conditions such as Alzheimer’s disease (AD), attention deficit hyperactivity disorder (ADHD), autism spectrum disorders (ASDs), and gliomas. In addition, we discuss the use of functional MRI (fMRI) and functional brain mapping in the prediction of postoperative seizure and events following traumatic brain injury (TBI). Moving from theory to practice, we discuss current applications of AI in the management of intracranial hemorrhage and stroke cases as a means to triage physician workflows. Finally, we conclude with a word on challenges that will need to be overcome before AI algorithms gain widespread use in a manner that improves the lives of patients everywhere.

20.2 Integration with clinical workflow The development of clinical AI has brought high expectations for the impact of these systems on clinical and surgical workflows. Given the increasingly accurate ability of ML algorithms to make predictions using a wide range of clinical data modalities, investigations into their use have occurred at every stage of the clinical workflow. In this section, we will delve into the use of ML algorithms as they apply to discrete stages of clinical care within both neurology and neurosurgery. These distinct areas include neurological risk prognostication, surgical planning, intraoperative guidance, neurophysiological monitoring, postoperative care, and clinical quality-improvement programs.

20.2.1 Diagnosis Over the past two decades, CT and MRI-based imaging modalities have seen drastic improvements in sensitivity, specificity, and resolution. With the increasing adoption of 3Tesla MRI systems, the experimental use of higher powered magnets (7-Tesla), and the popularization of full genome sequencing, health-care systems have more insight into the gross and micro anatomical parameters of patients suffering from neurological diseases than ever before. With higher resolution data comes increasingly complex relationships between neurological structures that are difficult to study utilizing normal methodologies. Clinical data and diagnoses combined with imaging, laboratory measurements, and sequencing provide a rich source of research into AI for diagnostic and risk-prediction models. Research within diagnostics has had a particularly important role in improving the diagnosis of diseases with complicated phenotypes and poorly understood mechanisms such as AD, ASD, and ADHD. Like many chronic neurological conditions, the onset of these diseases can be insidious and the diagnosis is reliant on nonspecific symptomatology, such as distractibility and hyperactivity in the case of ADHD. The inability to rely on

III. Clinical applications

20.2 Integration with clinical workflow

397

more easily quantifiable measures for phenotyping results in poor sensitivity and specificity for clinical diagnostic testing: gold-standard testing guidelines, such as the American Psychiatric Association’s Diagnostic and Statistical Manual, can identify ADHD between 70% and 90% of the time.7 Delays in diagnosis negatively impact care, delaying treatment and potentially even reducing its efficacy.7 Using fMRI and connectome mapping alongside clinical and demographic datapoints, recent research has sought to increase the accuracy and objectivity of diagnosing neurological conditions. Connectome mapping in particular has been an area of intense research with ML. White and gray matter imaging using fMRI techniques, combined with diffusion tensor imaging (DTI) and white matter fiber tractography, gives researchers and clinicians a wiring-diagram of the medium and large connections within the brain. This picture of brain connectivity can measure the relationships between functionally and physically connected structures and can be indicative of resting state changes in the brain. While genetic profiles can elucidate the predisposition toward microscopic changes in brain anatomy, connectome mapping shows the macroscopic manifestations of these changes, modeling the brain as a network of interconnected regions.8 With 1010 neurons interconnected by 1014 connections, current imaging technology and throughput capabilities have limited the connectome mapping to a macroscopic level of brain wiring. These methods, combined with ML, show promise in identifying novel biomarkers for the diagnosis of epilepsy, ASD, ADHD, and other neurological conditions. Epilepsy is another diagnosis well suited to characterization by connectome mapping and ML. As a disease space, epilepsy is defined by pathological changes in brain organization, leading to hyperexcitability of certain regions and seizures. ML has shown significant advances over the prior state-of-the-art in detecting epileptiform structures and activity, and at predicting surgical outcomes, using anatomical MRI images combined with DTI tractography. In one 2015 study, Munsell et al. used support vector machines (SVMs) combined with a feature selection and regularization approach to narrow the clinical features used in predictive tasks.9 With their feature selection and regularization approach, Munsell et al. address a critical hurdle in ML research as applied to neurology, which is a small number of test subjects relative to the massive numbers of features defined by connectome mapping. Using these techniques, the SVM was able to separate epileptic patients from healthy controls with an 80% accuracy, and to predict the outcome of surgery with a 70% accuracy, results that are on-par with expert clinical decision-making.9 Within ASD, deep learning classifiers are being developed to improve ASD classification and diagnosis using fMRI data. Autism is theorized to result from altered brain connectivity, functionally impairing an individual’s ability to communicate and interact socially.10 12 For similar pathological reasons as epilepsy, autism has seen a boon in diagnostic research using fMRI, DTI, and diffusion weighted MRI imaging. Deep learning algorithms have been successful in using these data to enhance the clinical diagnosis of patients with ASD. Research has been assisted substantially via the Autism Brain Imaging Data Exchange (ABIDE), which facilitates the exchange of 1112 resting-state fMRI datasets, as well as corresponding structural MRIs and phenotypic information for ASD patients and age-matched controls, from several institutions around the globe.13 Cross-institutional collaborations such as these are critical to the development of deep learning models at our current stage of knowledge, as high-resolution experimental scanners are still relatively

III. Clinical applications

398

20. Artificial intelligence as applied to clinical neurological conditions

rare and the number of features available within imaging data far exceeds the enrollment of studies. Using this dataset, a large number of studies and methods have been developed using ABIDE data to classify and identify ASD patients when compared to healthy controls. Iidaka et al. in 2015 used a probabilistic neural network to classify young patients under 20 years old with 92% sensitivity and 87% specificity using fMRI data14; Chen et al. in 2016 used an SVM to classify ASD patients based on low-frequency connectivity fluctuations in connectome data15; Abraham et al. in 2017 used fMRI data to predict the neuropsychiatric status of patients with ASD16; and Itani and Thanou in 2019 used graph signal processing and decision trees to build a discriminative model to aid in the diagnosis of patients.17

20.2.2 Risk prognostication In comparison to augmenting the diagnostic capabilities of current models, ML has been applied to both general medical data, as well as neurology-specific data to predict future risk of occurrence of disease. In addition to the widespread use of MRI data, electroencephalogram (EEG) readings have been utilized in the hopes of developing predictive tools. EEG data is often used by neurologists and clinicians in the management and diagnosis of neurological dysfunction, namely, epilepsy and epileptic events. Studies using deep techniques have investigated the utility of a variety of algorithms as applied to preictal scalp EEGs for seizure prediction.18 20 The most successful research efforts utilized a long short-term memory (LSTM) network, a method particularly useful for interpreting time-series data, which allow the model to allocate importance to previously seen data in a sequence when interpreting a given datapoint. LSTMs are well suited to the task of interpreting large sequences of data and have proved their efficacy in predicting epileptic events.19 Similar to the grass-roots sharing of ASD data via the ABIDE dataset, scalp EEG data has been publicized for researchers in several publicly available databases that have facilitated deep learning research. One database that has seen widespread use is the Children’s Hospital of Boston, Massachusetts Institute of Technology (CHB-MIT) scalp EEG dataset, that contains recordings from 23 cases.21 Tsiouris et al. published a 2018 studying using a two-layer LSTM algorithm predicting epileptic seizures on cases from the CHB-MIT scalp EEG database. This study differentiated itself from prior efforts using convolutional neural networks (CNNs) and scalp EEGs predicting epileptic seizures, setting the bar in the stateof-the-art over traditional ML algorithms and other deep learning algorithms in predicting epileptic risk. The authors chose several meaningful features for the detection of epileptic seizures. Parameters included statistical moments, zero crossings, Wavelet Transform coefficients, power spectral density, cross-correlation, and graph theory. The authors compared the predictive power of raw EEG data to a model trained using the extracted parameters described previously and found that premodeling feature extraction improved model performance.19 The LSTM achieved a minimum of 99.28% sensitivity and 99.28% specificity across the 15, 30, 60, and 120 minute preictal periods for the cases within CHBMIT, as well as a maximum false positive rate of 0.11/hour. Comparing these results to experiments with CNNs, the author’s achieved worse results with a CNN, namely, poorer sensitivity and a higher hourly rate of false positives.22,23

III. Clinical applications

20.2 Integration with clinical workflow

399

A second widely used and publicized area of research centered on understanding a patient’s risk for future development of a given disease is genomic data. Genomic data has generated significant scientific and public interest for the promise of understanding genetic variation and predisposition to malfunction on a genome-wide basis. Genomic data is gathered via sequencing and array platforms, with the proliferation of microarrays being the main workhorse of most studies.24 While genetic sequencing has seen an increase in recent years, cost and availability still somewhat limit its more widespread use. Given the massive number of features that exist in genetic datasets, predictive tasks in genetic datasets are often distilled by subject experts who hand-select features with likely involvement in the disease process for predictive models.25 Deep learning research with genetic data has proliferated particularly within ASD and cancers of the central nervous system (CNS). One area of work has centered on understanding the impact that de novo mutations, namely, copy number variants and point mutations, will have on the severity of symptoms in ASD.26 One study by Zhou et al. modeled the impact of point mutations in RNA and DNA in 1790 whole-genome sequenced families with ASD.26 This approach found that both transcriptional and posttranscriptional mechanisms play a major role in ASD, suggesting biological convergence of genetic dysregulation in ASD. Genetic data and deep learning models have helped further the neurological sciences understanding of the impact of genetic mutations on future risk of amyotrophic lateral sclerosis. Genetic sequencing, neuroimaging, and histopathology, either individually or in conjunction with one another, have given researchers a wealth of data to research oncologic predictive tasks.25,27,28 Deep learning has been a well-established way of incorporating multimodal data and has generated models to accurately predict time-to-event outcomes in highmortality cancer areas, even exceeding established gold-standard clinical paradigms for glioma patient survival prediction.28 Using histological staining and genetic data, namely, isocitrate dehydrogenase mutation status and 1p/19q codeletion, sourced from The Cancer Genome Atlas, Mobadersany et al. used a subvariant of a CNN, namely, the survival CNN (SCNN), to predict time-to-event for glioma patients. The developed model estimated survival time on-par with manual histologic grading or molecular subtyping.28 In another research effort, SCNNs outperformed other shallow ML models in classification tasks with genetic data from multiple tumor types, including kidney, breast, and pan-glioma cancer.25 Deep learning algorithms relying on multiple modalities could be a viable method to reduce subjectivity in histologic evaluation, to accurately interpret noisy EEG data, or to model the complex multidimensional relationships seen in genetic data.

20.2.3 Surgical planning While diagnostic ML algorithms have received a substantial portion of the research efforts within neurology and neurosurgery, work has been performed to build models that can assist in preoperative planning. Two areas of research include the automation of beam orientation in stereotactic radiosurgery (SRS) using neural networks, and efforts dedicated to the automated segmentation of the neuroanatomy on imaging data. SRS has become a reliable approach to addressing a variety of CNS pathologies, including brain metastases, but requires labor- and expertise-intensive planning to be effective.29,30 A medical

III. Clinical applications

400

20. Artificial intelligence as applied to clinical neurological conditions

physicist is required to create optimal plans, directing and aligning beam orientation to most accurately radiate the tumor while avoiding exposure to neighboring vital structures.31 While it is possible to use standardized templates to expedite the planning stage of SRS, neuroanatomical heterogeneity makes it difficult to employ a one-size-fits-all model.32 Deep learning and the continuing shift toward precision medicine offer promise to revolutionize presurgical planning within SRS. In one 2014 study, three artificial neural networks (ANNs) were used to generate beam coordinates based on a variety of genetic profile data and cartesian coordinates describing the location of the planning target volume (PTV).31 The planned target volume is a geometrically defined subset of the tumor, which is designed to ensure that the delivery of the therapy is localized to the boundaries of the tumor that are causing clinical symptoms (see Fig. 20.1).33 Using 669 intracranial lesions from a single center over an 8-year period of time, the ANNs incorporated coordinates describing the localization of the lesion, genetic data, and vectors between the PTVs and the surrounding organs at risk.31 When comparing the ANN-generated plans to a medical physician generated plans using standardized measures (dose-volume histograms, root-mean-square, and gamma index methods), the authors found the machine-generated bespoke plans were similar in efficacy to a medical physicist: the medical physicist-generated plans covered 99.2% of the mean PTV by the 95% isodose, compared to 99.3%, 98.5%, and 99.2% for the three ANN-generated plans.31 While fully automated systems for the generation of stereotactic surgical plans are still far away, these experimental results show great promise in expediting the generation of customized plans. A second area of research within surgical planning is the automated segmentation of neuroimaging data. Segmentation of radiological brain images is performed to quantify brain regions, measuring the thickness, shape, and volume of structures that can indicate structural changes due to physiological variation or pathological changes.34 Classifying regions is important not only as potential biomarkers in disease pathologies such as AD, but for prognostic risk in patients with gliomas.35,36 Gliomas are the most commonly diagnosed brain tumor and can be extraordinarily lethal, coming with a mean survival time of less than 2 years for glioblastoma multiforme (GBM).35,36 Proper segmentation is essential in monitoring disease progression and estimating viability for surgery.37

FIGURE 20.1 Example of radiation treatment planning with centralized lesion delineating a GTV. The CTV is the theoretical area requiring treatment to the prescription dose in order to achieve a cure. Lastly the planning target volume (PTV) is the chosen treatment volume in order to ensure that the prescription dose is delivered to the CTV. CTV, Clinical target volume; GTV, gross tumor volume.

III. Clinical applications

20.2 Integration with clinical workflow

401

Current segmentation methods are either manual or computer-assisted, most commonly the FreeSurfer program, both of which have limitations. Manual segmentation is laborand expertise-intensive and show variations in anatomical designation, particularly when obscured by artifacts and/or in images where intensity gradients are minimal.38 The FreeSurfer program is an open-source atlas-based segmentation program, which assigns 37 distinct anatomical labels, based on probabilistic estimate, to each voxel in a threedimensional (3D) MRI.39 Given the importance of properly segmented images in surgical planning, and the shortfalls of existing solutions, segmenting has become a target for improvement via deep learning models (Fig. 20.2). Similar to the collaborative environment fostered by data scarcity seen in the ABIDE dataset, measurement of segmenting algorithms has been standardized by the brain tumor segmenting (BraTS) challenge established during the 2012 and 2013 Medical Image Computing and Computer Assisted Interventions (MICCAI) conference.38 BraTS and MICCAI addressed a key issue in the field of deep learning, namely, the evaluation of models on private collections leading to difficulty in successfully reproducing and validating results. Often kept internal due to constraints imposed by the Health Insurance Portability and Accountability Act (HIPAA), private imaging collections contain variations in the imaging modalities incorporated and metrics used to evaluate effectiveness. BraTS has been critical in standardizing the efficacy of models and is currently considered the gold standard for peer-reviewed evaluation of image segmentation. Since the establishment of BraTS in 2012, there has been considerable advancement in performance, largely based on the adoption of CNNs for anatomical segmentation.

FIGURE 20.2 Example of tumor segmentation via a machine learning algorithm. Multiple MRI imaging sequences describe different aspects of the underlying tumor including its detailed anatomical relationships (T1), its enhancing margin after administration of contrast (T1ce), and its peritumoral edema (T2, flair). A deep neural network (prediction) is able to almost perfectly recapitulate manual segmentation by a human expert (true). MRI, Magnetic resonance imaging.

III. Clinical applications

402

20. Artificial intelligence as applied to clinical neurological conditions

Deep learning has equaled or outperformed state-of-the-art methods when compared in head-head-evaluations. In a comparative study, Wachinger et al. applied a CNN, DeepNAT, to T1-weighted MRIs included in the MICCAI Multi-Atlas Labeling challenge, consisting of 30 T1-weighted images.34,40 When compared to the FreeSurfer, the current clinical standard for automated segmentation, DeepNAT had statistically significant performance improvements. Performance in segmentation is most frequently measured using a dice volume overlap score, which not only evaluates volumetric overlap with the ground truth, but also is a good measure of reproducibility.41 In the MICCAI labeling challenge, DeepNAT achieved a Dice score of 0.906, surpassing FreeSurfer’s score of 0.817.34 In addition to tumor- and tissue-based segmenting efforts, segmentation of vascular neuroanatomy has been an area of clinical research additionally deep learning research aimed at quantifying brain vessel status. Neurovascular status is commonly quantified to evaluate the possibility of surgery in pathophysiological anatomy, such as arteriovenous malformations (AVMs), arteriovenous fistulas, and aneurysms. Proper segmenting is helpful in assessing the severity of the pathological process, as well as planning for craniotomy route, the resection corridor, and other treatment specifics. The current gold-standard of vessel segmentation relies on either manual identification or rule-based algorithms. Unlike tissue segmentation, in which FreeSurfer is the predominate atlas-based model, neurovascular segmentations have no equivalent atlas-based programs. U-net models have shown significant promise in accurately identifying vascular anatomy, as evidenced by a study published by Livne et al., in which a U-net applied to labeled data from 66 patients with cerebrovascular disease outperformed traditional methods. The U-net outperformed graph-cuts, which achieved a Dice score of 0.760, compared to 0.891 for the U-net model.42 Given recent experience with deep learning models and other techniques, deep learning techniques have shown exceptional proof-of-concept results in tasks frequently performed in clinical evaluation and planning. While standardized tasks and datasets have allowed for robust and community-accepted validation tasks, as well as increasing the availability of training data, further development continues to face barriers in data paucity as well as the generalization of models. Deep learning has struggled somewhat in achieving equivalent performance when evaluated on data acquired outside of the institutions in which it was trained.43 While these hesitations slightly hamper development, preoperative deep learning tools are on the cusp of revolutionizing the accuracy and ability of surgeons and neurologists to visualize anatomy and plan treatment.

20.2.4 Intraoperative guidance and enhancement In the operating room, ML and related research have begun to make inroads, improving on surgical capabilities and augmenting routine tasks performed in neurosurgical procedures around the country. ML has been used to estimate affected tissue volumes based on deep brain stimulation (DBS) electrode recordings in Parkinson patients,44 to differentiate between healthy and affected tumor cells in real-time and in vivo images,45,46 and to improve anatomical visualization and preoperative planning with augmented reality (AR).47 In the following sections, we will review the impact that deep learning techniques have had on each of these domains.

III. Clinical applications

20.2 Integration with clinical workflow

403

DBS has become an established and safe treatment for symptoms in Parkinson disease; however, suboptimal placement of electrodes can lead to significant side effects.48 50 Improper lead placement has been shown to contribute to increased depression, manic episodes, and gait impairment.49 52 ML in an intraoperative setting has helped improve the placement of leads via the automatic detection of subthalamic exit points during DBS.53 Using microelectrode recordings (MERs), Valsky et al. employed several ML classifier types in order to discriminate between the subthalamic nucleus (STN) and the substantia nigra pars reticulata (SnPR) in real-time, guiding optimal placement of placement of DBS leads. Using a relatively simple SVM, MERs from 58 cases which included both the STN and the SnPR achieved a 97.6% consistency with gold-standard manual classification.53 These intraoperative results raise the possibility of establishing a new goldstandard as more complicated and powerful ML models continue to improve, with easier means of intraoperative integration. A second area of intraoperative guidance continues the theme of structural differentiation and seeks to differentiate tumor cells from the surrounding unaffected brain parenchyma in real-time during operations. Incomplete resection, especially in aggressive tumors such as GBM, has been shown to cause tumor recurrence; however, differentiation between healthy and malignant tissue frequently depends on time-intensive histology.54 Deep learning has helped improve this issue via the research of a deep learning framework for the identification of glioblastoma borders using hyperspectral images.45 The system uses a two-dimensional CNN in combination with an imaging system that calibrates, denoises, and processes images from a spectral camera mid-craniotomy, to generate a classification map that labels each image pixel as normal tissue, tumor tissue, blood vessels/ hypervascularized tissue, and background.45 The system generated an overall accuracy of 80% for the multiclass classification system, outperforming traditional ML methods.45 While still a research project, and not a commercialized system, the results add to the growing evidence of the multitude of uses of deep learning during surgery. While not strictly AI, AR is well worth discussing given the emerging power of AR-enhanced guidance systems in neurosurgical procedures, and AI’s role in object identification. AR is a system in which virtual objects are combined with real life images, an example of which would be combining neurosurgical scope images with virtual anatomy under the surface of the visible tissue.47 AR, or mixed-reality, guided surgery was first utilized in neurosurgery, and is used today to enhance transsphenoidal resections of pituitary adenomas, craniotomies, endoscopic surgery, and ear, nose, and throat procedures.55 58 Deep learning has played a significant role in general-domain AR research, improving object identification and geo-spatial localization.59,60 AR programs are complex, multipart systems that rely on a constellation of devices to accurately superimpose images on top of real-life anatomy. Most systems include an optical tracker, a workstation, and a camera, working in consort.47 One of the greatest difficulties that these systems currently face is the accuracy in which anatomy is superimposed on top of camera images. In one case-study surgeons noted that an AR program had roughly 1 2 mm of error and mislabeled vasculature, confusing feeding arteries and draining veins.47,61 Deep learning algorithms hold significant promise in improving and augmenting these systems, particularly in the area of object recognition. While object recognition in two-dimensional images has excelled, 3D object recognition has continued to

III. Clinical applications

404

20. Artificial intelligence as applied to clinical neurological conditions

face obstacles, and sensor-denoising techniques along with improved object identification and pose estimation may hold the key to improved utility.62 64

20.2.5 Neurophysiological monitoring For high-risk patients on neurology wards and those recovering from neurosurgical interventions, physiological monitoring is a critical part of delivering safe and effective care. In the setting of intensive care units (ICUs), data is produced in vast amounts, from blood pressure measurements, continues electrocardiogram readings, oxygen saturation parameters, and intracranial pressure (ICP) measurements. More and more are physiological parameters being collected in care units for the purposes of monitoring, rapid response alerts, and personalized risk assessments.65 Given the cost of ICU care, roughly $81.7 billion USD and 4.1% of national health expenditures in the United States, improving efficiency of resource use has been a prime target of clinical ML researchers.66 Two areas of particular interest in neurology are improving the monitoring of ICP, and surgical site infection surveillance. A particularly appealing clinical characteristic for neurocritical care applications is that they tend to be time sensitive, and AI techniques are typically excellent at delivering results quickly. ICP monitoring is particularly important in delivering care following TBI, in which alterations in ICP can cause rapid swings in a patient’s status.67 Monitoring ICP is a way to react to these alterations and prevent catastrophic drops in cerebral perfusion pressure that so commonly lead to deterioration of status. ICP monitoring faces substantial challenges, particularly in assessing normalcy of baseline, which is inherently retrospective, and being able to establish deviations from the baseline using real-time monitoring. In one novel 2016 study, Lee et al. used an autoencoder combined with a CNN for the purposes of signal processing and anomaly detection in signals garnered blood pressure readings and intraparenchymal probes to reduce false clinical events in ICU patients with TBIs.68 The method of using an autoencoder to generate a signal image, and a CNN to discriminate whether or not that image was abnormal was successful, reducing clinical artifacts and increasing the prognostic value of detected clinical events.68 In addition to swings in ICP a multitude of research has been performed for the purposes of surgical site infection and sepsis surveillance in orthopedic surgery, colorectal surgery, and neurosurgery.69 71 Unlike many of the research efforts described thus far in this discussion, surgical site infection detection commonly relies on the analysis of free text notes, field of ML referred to as natural language processing (NLP). Early efforts in NLP focused on key-word identification and rule-based methods to identify words and phrases commonly associated with infections.70 These methods saw some success but struggled when dealing with conditions whose very definition are ambiguous, such as sepsis.70 More recent clinical efforts to detect infections using sequence-based data, namely, time-series and text data, have focused on attention mechanisms to build models. Attention mechanisms have become an essential piece of sequence modeling, allowing models to assess dependencies in a sequence, regardless of the distance between the elements in the sequence.72 These methods have broken records in NLP benchmark tasks and recently have been used to model clinical events in the ICU setting, predicting the onset of sepsis, myocardial infarction, and the need for vancomycin administration.73,74 The 2019

III. Clinical applications

20.2 Integration with clinical workflow

405

study by Kaji et al., using a recurrent neural network and variable-level attention, found that attention is helpful in generating useful and interpretable area under the curve (AUC) estimates for MI, sepsis, and vancomycin administration but also suffered from difficulties common to many electronic health record (EHR)-based ML models: their model quickly and readily exploited variables that reflected clinician decision-making, as opposed to relying on physiologic parameters indicative of underlying dysfunction.74 Separating the ability of a model to understand underlying physiological change, as opposed to a clinician’s interpretation of that change and subsequent actions, is a key difficulty in developing effective models for postoperative surveillance.

20.2.6 Clinical decision support In a similar vein to postoperative physiological monitoring, disease progression following intervention has been pursued as a method of clinical decision support. These efforts have occurred on multiple fronts, with two major areas being predicting disease progression and achieving positive results following surgery, as well as improving clinical workflow via triage mechanisms. Positive results following intervention is a central element of clinical and surgical practice. Research into factors predisposing patients to successfully recovery following surgical correction of cerebral AVMs has uncovered a number of risk factors; however, ML researchers have sought to improve the accuracy of existing scoring mechanisms via the use of ML algorithms.75 78 Multiple attempts have been made to use ML as a prognostication tool for outcome predictions. In 2016 Oermann et al. sought to distinguish between the predictive ability of several ML algorithm types in estimating outcomes following SRS years following intervention.78 This study is particularly novel, because it estimates the prognostic ability of models using clinical feature sets on the order of years following a given event, as opposed to days or months.78 The authors found that the accuracy of ML systems garnered an AUC of 0.71, surpassing existing clinical systems at the time, which achieved an AUC of 0.63. The results of this study show the ability of ML in an experimental setting to be an improvement on the tools that clinicians have at their disposal in order to accurately estimate the ability of a given patient to respond well to a certain therapy. In a second approach to clinical decision support, ML has been applied to better optimize radiological workflows in the acute care setting.79 The accurate diagnosis of acute neurological illnesses is critical for proper care, as irreversible damage can occur within minutes. Imaging is the gold standard for diagnosis of events, and computer-aided surveillance of radiological workflows holds promise in decreasing the time to treatment.80 82 In a 2018 study by Titano et al., the authors used a CNN to generate urgent versus nonurgent labels for images using a res-net model, leaving the true diagnosis up to the radiologists themselves.79 This approach differentiates itself from many existing studies because it offers a place for ML as a companion tool in the pipeline of diagnosis, as opposed to attempting to use the algorithm as a clinician itself. The authors tested the ability of the CNN to prioritize cases in an imaging work queue and found that when ordered based on computer-perceived urgency in a randomized, double-binded prospective trial, the diagnosis time dropped from minutes to seconds.79

III. Clinical applications

406

20. Artificial intelligence as applied to clinical neurological conditions

20.2.7 Theoretical neurological artificial intelligence research Outside the clinic, AI has become a burgeoning part of fundamental neurological research, as well as neurological therapy development. Freed from the complex constraints that govern the uptake and use of systems within hospital systems, AI has begun to make inroads in assisting the development of therapeutic strategies as part of the disease theory of network medicine.83 Network medicine the theory that a disease state is the consequence of disturbances in a complex and functionally interdependent system of genetics and biochemistry that link the organs of the body.83 This theory has been the foundation of network pharmacology, which seeks to develop precise medicines that have multiple targets.84 This approach is part of a new paradigm in therapeutic development, which seeks to reverse the stagnation of therapeutic development that has occurred over the past decades: rather than target symptomatic measures, the new generation of scientists hope to target the root of the disease.84 AI has come into play in studies of network medicine in a number of ways and has recently been used both to repurpose existing drugs and design new ones. In one study, Romeo-Guitart et al. discovered a therapy for peripheral nerve root avulsion using a commercial system that maps experimental data to a translational database of clinical effectors.85 Using proteomic data from preclinical rat models, combinations of repurposed drugs were evaluated for their neuroprotective effects in nerve damage studies. Going further than reformulating and repurposing existing molecules for novel indications, AI has been used for de novo drug design, generating analogs to existing molecules with desirable properties, such as Celecoxib.86,87 Given the complexity of drug discovery, with only an 8% success rate in preclinical trials of nervous system disorders, AI may be the key in bringing more successful therapies to fruition.88 Finally, AI, in particular the mathematics behind deep models, has been used as a method to model the biology of the nervous system. Research into cortical circuitry has yielded stereotypes regarding the dynamics and architecture of these circuits. Prevalent features present across many of these circuits include laminar organization, recurrence, and the interplay between excitation and inhibition.89,90 Researchers have used these insights to the architecture of an LSTM to model excitation inhibition balance in cortical microcircuits, and to estimate the encoding of errors within dendritic backpropagation.91,92 These uses of AI show the breadth of ability of AI models as universal function approximators, going beyond use as image and text prediction tools.

20.3 Currently adopted methods in clinical use Advances in AI are already being applied to augment the delivery of care in the clinical neurosciences today. Among the first such applications was developed by Aidoc, an AI startup based out of Israel, and approved by the Food and Drug Administration (FDA) for use in the detection of intracranial hemorrhage on CT scans. The software, which is powered by CNNs, analyzes brain CTs immediately after a patient is scanned to detect the presence of hyperdense intracranial lesions suggestive of hemorrhage and alerts neuroradiologists to appropriately triage patients in order of increasing risk. This reduces time to intervention and enables appropriate resources to be committed to patients before progression of disease. The company’s tool has exhibited comparable efficacy to expert

III. Clinical applications

20.4 Challenges

407

radiologists in assessing retrospective scans across health-care systems and is currently in use in over 50 hospitals around the world, analyzing over one million scans per year.93,94 AI-assisted neuroradiology workflow triage has also been applied to the management of stroke with San Francisco-based AI startup Viz.ai recently obtaining 510(k) approval by the FDA for its ContaCT software to analyze CT angiograms (CTAs) for evidence of stroke. The company’s deep-learning powered software assesses for large volume occlusion (LVO) strokes and alerts a neurovascular specialist of significant findings by text message within 6 minutes of the scan, thereby enabling more rapid emergency management. Research findings indicate that the software is able to detect anterior LVOs with a sensitivity of 90% and a specificity of 86% in CTAs.95 Furthermore, in a study of 300 CT scans comparing their software against expert neuroradiologists, Viz.ai claims that their software enabled early detection of stroke in 95% cases and saved an average of 52 minutes in time to intervention.96 As the volume of neuroimages continues to grow at an exponential pace, such tools will be incredibly important in optimizing physician workflows, especially in the clinical neurosciences where time is brain.

20.4 Challenges Despite the profound biomedical advances due to deep learning algorithms, there remain significant challenges that must be addressed before such applications gain widespread use. We discuss some of the most critical hurdles in the following sections.

20.4.1 Data volume Deep neural networks are computationally intensive algorithms with millions or, increasingly, even billions of parameters.97 Although there is no hard and fast rule governing the quantity of data required to optimally train deep neural networks, studies suggest 10 3 more training data than parameters is required to produce an effective model.98 It follows logically that computer vision and NLP, two fields which can take advantage of the proliferation of the availability of internet data, have made the greatest strides over the past months. Biomedical data on the other hand is governed by much stricter requirements, primarily the HIPAA, and as such it is decentralized—stored locally within hospital systems. Public access to such data is limited, and anonymization, required for data release, is a very difficult task. While some datasets have become available to any researcher, such as the ABIDE dataset and the CHB-MIT dataset mentioned previously, these are a far reach from the optimal number of training examples required for proper training. This paucity of high quality and labeled data remains one of the most significant legal and technical limitations in the future development of ML models.

20.4.2 Data quality Health-care data is extremely messy. A far reach from ImageNet, in which models can safely learn nicely labeled images of cats and dogs, health-care data is heterogenous,

III. Clinical applications

408

20. Artificial intelligence as applied to clinical neurological conditions

incomplete, filled with nuance and misspellings, and marked by significant variation due to differences in providers, regions, and disease processes.99 Diagnosis codes, such as the International Classification of Disease code sets, are often used as proxies for medical conditions; however, these suffer from inaccuracy and misclassification.100 The gold standard for labeling medical data is manual for most data types, which is labor and expertise expensive, and propagates the issue of data scarcity. The messiness of data today makes it difficult for algorithms to parse signal from noise.

20.4.3 Generalizability Many researchers have published high rates of success on in silico tasks; however, the ability of those trained models to generalize to applications outside of the data on which they have been trained and tested has yet to be seen. Much of this phenomenon is due to overfitting the natural distribution of and statistical characteristics of the training dataset. Overfitting in a medical ML setting leaves models hyperspecialized for a single setting, such as one institution or patient set, dropping performance when tested against the population at-large.43,101 These phenomena are often a manifestation of the difficulties with data availability as well and continue to plague algorithms today.

20.4.4 Interpretability Given the multilayer and large parameter nature of the most successful models, interpretation of a model’s decision-making rationale is difficult. Deep learning algorithms have gained popularity due to their ability to accurately approximate complex, nonlinear functions; however, the inability to fully explain the contributions of a decision hinders their uptake in a medical setting.102 The wariness of models is well founded—research has shown that models may learn aspects of the data they have been shown that reflect clinical decision-making, that is, learning to differentiate between inpatient and outpatient X-rays when assessing pneumonia risk, and not disease processes.43 While difficult, interpreting the decision-making processes of a model is an essential element of ensuring the trust of providers in relying on a given model.

20.4.5 Legal While existing medical malpractice law protects and ensures appropriate care in the context of human decision-making, no standards have been established to assign responsibility for more autonomous AI decision-making. In the event an algorithm provides poor predictions, substandard treatment recommendations, or incorrect prioritization of patient treatment, it is unclear who takes responsibility. Given the black-box nature of algorithms and the proliferation of medical malpractice lawsuits, it is not a far stretch to imagine that providers would not accept further legal exposure for the use of AI models. The establishment of regulations governing culpability for AI in clinical use is a prerequisite for the widespread uptake of deep learning algorithms.

III. Clinical applications

References

409

20.4.6 Ethical Incidental introduction of bias has plagued ML models in numerous settings where they have been developed. Biases within training data have been perpetuated in future decision-making by a given model: NLP models trained on Google News articles exhibited gender-based stereotypes103; a resume screening tool created by Amazon systematically discriminated against women104 ; an algorithm to model future crime risk created by the US justice department systematically labeled people who are black as “higher risk” as compared to their white counterparts.105 Similar to the model learning that patients in the inpatient setting are more likely to have pneumonia than their outpatient counterparts, these are examples of models have learned vestiges of socially flawed policies and inherent human biases. Without explicitly accounting for these issues, AI decision-making may expose patients to substandard and biased treatment.

20.5 Conclusion Deep learning has already begun to revolutionize the field of neuroscience and medicine at large. The clinical neurosciences are suited particularly well to improvement by ML models, given the subtle presentation of clinical symptoms of CNS disorders, as well as the rapid improvement in the resolution of neuroimaging techniques. Models have been shown to improve the diagnosis of AD, ASD, ADHD; to assist in surgical planning; to monitor neurophysiological signals following TBI; to improve drug discovery; to model the mathematical nature of dendritic growth; and much more. Amidst the advances and deluge of positive publicity, significant barriers remain in the way of development and uptake of deep learning tools in clinical settings. Technical challenges in improving the generalizability and interpretability of models are active areas of research and progress; however difficult conversations surrounding data privacy, accessibility, and ownership need to be had. Furthermore, the proliferation of bias within datasets commonly used for training, propagated by disparities in access to care, potentially opens patients to suboptimal care and hospital systems and companies to legal action. These issues and more necessitate difficult and open conversations amongst the healthcare and society to address these solutions and to foster a transparent environment to realize the full potential that deep learning has to offer.

References 1. Ausman JI. Achievements of the last century in neurosurgery and a view to the 21st century. Arch Neurol 2000;57(1):61 2. 2. Ausman JI. The challenge for neurosurgery in the 21st century. Surg Neurol 2008;69(1):102. 3. Alexander RE, Gunderman RB. EMI and the first CT scanner. J Am Coll Radiol 2010;7(10):778 81. 4. Buonanno FS, Kistler JP, DeWitt LD, et al. Nuclear magnetic resonance imaging in central nervous system disease. Semin Nucl Med 1983;13(4):329 38. 5. Jones EG, Mendell LM. Assessing the decade of the brain. Science 1999;284(5415):739. 6. Muzumdar D. Neurosurgery in the past and future. An appraisal. Ann Med Surg (Lond) 2012;1:13 15.

III. Clinical applications

410

20. Artificial intelligence as applied to clinical neurological conditions

7. Kuang D, He L. Classification on ADHD with deep learning. In: 2014 international conference on cloud computing and big data; 2014. pp. 27 32. ,ieeexplore.ieee.org.. 8. Mohammad-Rezazadeh I, Frohlich J, Loo SK, Jeste SS. Brain connectivity in autism spectrum disorder. Curr Opin Neurol 2016;29(2):137 47. 9. Munsell BC, Wee C-Y, Keller SS, et al. Evaluation of machine learning algorithms for treatment outcome prediction in patients with epilepsy based on structural connectome data. NeuroImage 2015;118:219 30. 10. Just MA, Keller TA, Malave VL, Kana RK, Varma S. Autism as a neural systems disorder: a theory of frontalposterior underconnectivity. Neurosci Biobehav Rev 2012;36(4):1292 313. 11. Picci G, Gotts SJ, Scherf KS. A theoretical rut: revisiting and critically evaluating the generalized under/overconnectivity hypothesis of autism. Dev Sci 2016;19(4):524 49. 12. Uddin LQ, Supekar K, Menon V. Reconceptualizing functional brain connectivity in autism from a developmental perspective. Front Hum Neurosci 2013;7:458. 13. Di Martino A, Yan C-G, Li Q, et al. The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Mol Psychiatry 2014;19(6):659 67. 14. Iidaka T. Resting state functional magnetic resonance imaging and neural network classified autism and control. Cortex 2015;63:55 67. 15. Chen H, Duan X, Liu F, et al. Multivariate classification of autism spectrum disorder using frequency-specific resting-state functional connectivity—a multi-center study. Prog Neuropsychopharmacol Biol Psychiatry 2016;64:1 9. 16. Abraham A, Milham MP, Di Martino A, et al. Deriving reproducible biomarkers from multi-site resting-state data: an autism-based example. NeuroImage 2017;147:736 45. Available from: https://doi.org/10.1016/j. neuroimage.2016.10.045. 17. Itani S, Thanou D. Combining anatomical and functional networks for neuropathology identification: a case study on autism spectrum disorder. arXiv [eess.IV]. April 2019. ,http://arxiv.org/abs/1904.11296.. 18. Tjepkema-Cloostermans MC, de Carvalho RCV, van Putten MJAM. Deep learning for detection of focal epileptiform discharges from scalp EEG recordings. Clin Neurophysiol 2018;129(10):2191 6. 19. Tsiouris KM, Pezoulas VC, Zervakis M, Konitsiotis S, Koutsouris DD, Fotiadis DI. A long short-term memory deep learning network for the prediction of epileptic seizures using EEG signals. Comput Biol Med 2018;99:24 37. 20. Acharya UR, Oh SL, Hagiwara Y, Tan JH, Adeli H. Deep convolutional neural network for the automated detection and diagnosis of seizure using EEG signals. Comput Biol Med 2018;100:270 8. 21. Shoeb A, Edwards H, Connolly J, Bourgeois B, Ted Treves S, Guttag J. CHB-MIT scalp EEG database v1.0.0. ,https://physionet.org/content/chbmit/1.0.0/. [accessed 27.12.19]. Published 09.06.10. 22. Truong ND, Nguyen AD, Kuhlmann L, Bonyadi MR, Yang J, Kavehei O. A generalised seizure prediction with convolutional neural networks for intracranial and scalp electroencephalogram data analysis. arXiv [cs.CV]. July 2017. ,http://arxiv.org/abs/1707.01976.. 23. Khan H, Marcuse L, Fields M, Swann K, Yener B. Focal onset seizure prediction using convolutional networks. IEEE Trans Biomed Eng 2018;65(9):2109 18. 24. Han G, Sun J, Wang J, Bai Z, Song F, Lei H. Genomics in neurological disorders. Genomics Proteomics Bioinformatics 2014;12(4):156 63. 25. Yousefi S, Amrollahi F, Amgad M, et al. Predicting clinical outcomes from large scale cancer genomic profiles with deep survival models. Sci Rep 2017;7(1):11707. 26. Zhou J, Park CY, Theesfeld CL, et al. Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk. Nat Genet 2019;51(6):973 80. 27. Buda M, Saha A, Mazurowski MA. Association of genomic subtypes of lower-grade gliomas with shape features automatically extracted by a deep learning algorithm. Comput Biol Med 2019;109:218 25. 28. Mobadersany P, Yousefi S, Amgad M, et al. Predicting cancer outcomes from histology and genomics using convolutional networks. Proc Natl Acad Sci USA 2018;115(13):E2970 9. 29. Mehta MP, Rozental JM, Levin AB, et al. Defining the role of radiosurgery in the management of brain metastases. Int J Radiat Oncol Biol Phys 1992;24(4):619 25. 30. Mu¨ller-Riemenschneider F, Bockelbrink A, Ernst I, et al. Stereotactic radiosurgery for the treatment of brain metastases. Radiother Oncol 2009;91(1):67 74. Available from: https://doi.org/10.1016/j.radonc.2008.12.001. 31. Skrobala A, Malicki J. Beam orientation in stereotactic radiosurgery using an artificial neural network. Radiother Oncol 2014;111(2):296 300.

III. Clinical applications

References

411

32. Rowbottom CG, Oldham M, Webb S. Constrained customization of non-coplanar beam orientations in radiotherapy of brain tumours. Phys Med Biol 1999;44(2):383 99. 33. Burnet NG, Thomas SJ, Burton KE, Jefferies SJ. Defining the tumour and target volumes for radiotherapy. Cancer Imaging 2004;4(2):153 61. 34. Wachinger C, Reuter M, Klein T. DeepNAT: deep convolutional neural network for segmenting neuroanatomy. NeuroImage 2018;170:434 45. 35. Ohgaki H, Kleihues P. Population-based studies on incidence, survival rates, and genetic alterations in astrocytic and oligodendroglial gliomas. J Neuropathol Exp Neurol 2005;64(6):479 89. 36. Holland EC. Progenitor cells and glioma formation. Curr Opin Neurol 2001;14(6):683 8. 37. Visser M, Mu¨ller DMJ, van Duijn RJM, et al. Inter-rater agreement in glioma segmentations on longitudinal MRI. Neuroimage Clin 2019;22:101727. 38. Menze BH, Jakab A, Bauer S, et al. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans Med Imaging 2015;34(10):1993 2024. 39. Fischl B, Salat DH, Busa E, et al. Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron 2002;33(3):341 55. 40. Landman B, Warfield S. MICCAI 2012 workshop on multi-atlas labeling. In: Medical image computing and computer assisted intervention conference. 2012. 41. Taha AA, Hanbury A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med Imaging 2015;15:29. 42. Livne M, Rieger J, Aydin OU, et al. A U-net deep learning framework for high performance vessel segmentation in patients with cerebrovascular disease. Front Neurosci 2019;13:97. 43. Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med 2018;15 (11):e1002683. 44. Taghva A. Hidden semi-Markov models in the computerized decoding of microelectrode recording data for deep brain stimulator placement. World Neurosurg 2011;75(5 6):758 763.e4. 45. Fabelo H, Halicek M, Ortega S, et al. Deep learning-based framework for in vivo identification of glioblastoma tumor using hyperspectral images of human brain. Sensors 2019;19(4). Available from: https://doi.org/ 10.3390/s19040920. 46. Eberlin LS, Norton I, Dill AL, et al. Classifying human brain tumors by lipid imaging with mass spectrometry. Cancer Res 2012;72(3):645 54. 47. Kersten-Oertel M, Gerard I, Drouin S, et al. Augmented reality in neurovascular surgery: feasibility and first uses in the operating room. Int J Comput Assist Radiol Surg 2015;10(11):1823 36. 48. Machado A, Rezai AR, Kopell BH, Gross RE, Sharan AD, Benabid A-L. Deep brain stimulation for Parkinson’s disease: surgical technique and perioperative management. Mov Disord 2006;21(S14):S247 58. Available from: https://doi.org/10.1002/mds.20959. 49. Bejjani BP, Damier P, Arnulf I, et al. Transient acute depression induced by high-frequency deep-brain stimulation. N Engl J Med 1999;340(19):1476 80. 50. Raucher-Che´ne´ D, Charrel C-L, de Maindreville AD, Limosin F. Manic episode with psychotic symptoms in a patient with Parkinson’s disease treated by subthalamic nucleus stimulation: improvement on switching the target. J Neurol Sci 2008;273(1 2):116 17. 51. Weiss D, Breit S, Wa¨chter T, Plewnia C, Gharabaghi A, Kru¨ger R. Combined stimulation of the substantia nigra pars reticulata and the subthalamic nucleus is effective in hypokinetic gait disturbance in Parkinson’s disease. J Neurol 2011;258(6):1183 5. 52. Weiss D, Walach M, Meisner C, et al. Nigral stimulation for resistant axial motor impairment in Parkinson’s disease? A randomized controlled trial. Brain 2013;136(Pt 7):2098 108. 53. Valsky D, Marmor-Levin O, Deffains M, et al. Stop! border ahead: automatic detection of subthalamic exit during deep brain stimulation surgery. Mov Disord 2017;32(1):70 9. 54. Petrecca K, Guiot M-C, Panet-Raymond V, Souhami L. Failure pattern following complete resection plus radiotherapy and temozolomide is at the resection margin in patients with glioblastoma. J Neurooncol 2013;111(1):19 23. 55. Kawamata T, Iseki H, Shibasaki T, Hori T. Endoscopic augmented reality navigation system for endonasal transsphenoidal surgery to treat pituitary tumors: technical note. Neurosurgery 2002;50(6):1393 7. Available from: https://doi.org/10.1227/00006123-200206000-00038.

III. Clinical applications

412

20. Artificial intelligence as applied to clinical neurological conditions

56. Rosahl SK, Shahidi R. The virtual operating field—how image guidance can become integral to microneurosurgery. Samii’s essentials in neurosurgery. Springer; 2008. p. 11 20. 57. Shahidi R, Bax MR, Maurer Jr CR, et al. Implementation, calibration and accuracy testing of an imageenhanced endoscopy system. IEEE Trans Med Imaging 2002;21(12):1524 35. 58. Gleason PL, Kikinis R, Altobelli D, et al. Video registration virtual reality for nonlinkage stereotactic surgery. Stereotact Funct Neurosurg 1994;63(1 4):139 43. Available from: https://doi.org/10.1159/000100305. 59. Rao J, Qiao Y, Ren F, Wang J, Du Q. A mobile outdoor augmented reality method combining deep learning object detection and spatial relationships for geovisualization. Sensors 2017;17(9). Available from: https://doi. org/10.3390/s17091951. 60. Lin C, Chung Y, Chou B, Chen H, Tsai C. A novel campus navigation APP with augmented reality and deep learning. In: 2018 IEEE international conference on applied system invention (ICASI). 2018. pp. 1075 7. 61. Cabrilo I, Bijlenga P, Schaller K. Augmented reality in the surgery of cerebral arteriovenous malformations: technique assessment and considerations. Acta Neurochir 2014;156(9):1769 74. 62. Akgul O, Penekli HI, Genc Y. Applying deep learning in augmented reality tracking. In: 2016 12th international conference on signal-image technology internet-based systems (SITIS). 2016. pp. 47 54. 63. Rahman MM, Tan Y, Xue J, Lu K. Recent advances in 3D object detection in the era of deep neural networks: a survey. In: IEEE trans image process. November, 2019. Available from: https://doi.org/10.1109/TIP.2019.2955239. 64. Li X, Zhou Z. Object re-identification based on deep learning. In: Visual object tracking in the deep neural networks era. 2019. ,https://doi.org/10.5772/intechopen.86564.. [working title]. 65. Johnson AEW, Ghassemi MM, Nemati S, Niehaus KE, Clifton DA, Clifford GD. Machine learning and decision support in critical care. Proc IEEE Inst Electr Electron Eng 2016;104(2):444 66. 66. Halpern NA, Pastores SM. Critical care medicine in the United States 2000 2005: an analysis of bed numbers, occupancy rates, payer mix, and costs. Crit Care Med 2010;38(1):65 71. Available from: https://doi.org/ 10.1097/ccm.0b013e3181b090d0. 67. Haddad SH, Arabi YM. Critical care management of severe traumatic brain injury in adults. Scand J Trauma Resusc Emerg Med 2012;20:12. 68. Lee S-B, Kim H, Kim Y-T, et al. Artifact removal from neurophysiological signals: impact on intracranial and arterial pressure monitoring in traumatic brain injury. J Neurosurg 2019;1:1 9. 69. Thirukumaran CP, Zaman A, Rubery PT, et al. Natural language processing for the identification of surgical site infections in orthopaedics. J Bone Joint Surg Am 2019;101(24):2167 74. 70. FitzHenry F, Murff HJ, Matheny ME, et al. Exploring the frontier of electronic health record surveillance: the case of postoperative complications. Med Care 2013;51(6):509 16. 71. Campillo-Gimenez B, Garcelon N, Jarno P, Chapplain JM, Cuggia M. Full-text automated detection of surgical site infections secondary to neurosurgery in Rennes, France. Stud Health Technol Inf 2013;192:572 5. 72. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, et al., editors. Advances in neural information processing systems 30. Curran Associates, Inc.; 2017 . pp. 5998 6008. 73. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv [cs.CL]. October 2018. ,http://arxiv.org/abs/1810.04805.. 74. Kaji DA, Zech JR, Kim JS, et al. An attention based deep learning model of clinical events in the intensive care unit. PLoS One 2019;14(2):e0211057. 75. Ding D, Liu KC. Predictive capability of the Spetzler-Martin versus supplementary grading scale for microsurgical outcomes of cerebellar arteriovenous malformations. J Cerebrovasc Endovasc Neurosurg 2013;15 (4):307 10. 76. Pollock BE, Flickinger JC, Lunsford LD, Bissonette DJ, Kondziolka D. Factors that predict the bleeding risk of cerebral arteriovenous malformations. Stroke 1996;27(1):1 6. 77. Young WL, Gao EG, Hademenos J, Massoud TF. Use of modeling for the study of cerebral arteriovenous malformations, in: P. E. Stieg, H. H. Batjer, and L. Samson, (Eds.), Intracranial Arteriovenous Malformations, Informa Healthcare, New York, NY, USA, 2007;49 71. 78. Oermann EK, Rubinsteyn A, Ding D, et al. Using a machine learning approach to predict outcomes after radiosurgery for cerebral arteriovenous malformations. Sci Rep 2016;6:21161. 79. Titano JJ, Badgeley M, Schefflein J, et al. Automated deep-neural-network surveillance of cranial images for acute neurologic events. Nat Med 2018;24(9):1337 41. 80. Navi BB, Kamel H, Shah MP, et al. The use of neuroimaging studies and neurological consultation to evaluate dizzy patients in the emergency department. Neurohospitalist 2013;3(1):7 14.

III. Clinical applications

References

413

81. Ferro JM, Pinto AN, Falca˜o I, et al. Diagnosis of stroke by the nonneurologist. A validation study. Stroke 1998;29(6):1106 9. 82. Ferro JM, Falca˜o I, Rodrigues G, et al. Diagnosis of transient ischemic attack by the nonneurologist. Stroke 1996;27(12):2225 9. Available from: https://doi.org/10.1161/01.str.27.12.2225. 83. Baraba´si A-L, Gulbahce N, Loscalzo J. Network medicine: a network-based approach to human disease. Nat Rev Genet 2011;12(1):56 68. 84. Margineanu DG. Neuropharmacology beyond reductionism a likely prospect. Biosystems 2016;141:1 9. 85. Romeo-Guitart D, Fore´s J, Herrando-Grabulosa M, et al. Neuroprotective drug for nerve trauma revealed using artificial intelligence. Sci Rep 2018;8(1):1879. 86. Olivecrona M, Blaschke T, Engkvist O, Chen H. Molecular de-novo design through deep reinforcement learning. J Cheminform 2017;9(1):48. 87. Besnard J, Ruda GF, Setola V, et al. Automated design of ligands to polypharmacological profiles. Nature 2012;492(7428):215 20. 88. Takebe T, Imai R, Ono S. The current status of drug discovery and development as originated in United States Academia: the influence of industrial and academic collaboration on drug discovery and development. Clin Transl Sci 2018;11(6):597 606. 89. Harris KD, Mrsic-Flogel TD. Cortical connectivity and sensory coding. Nature 2013;503(7474):51 8. 90. Markram H, Toledo-Rodriguez M, Wang Y, Gupta A, Silberberg G, Wu C. Interneurons of the neocortical inhibitory system. Nat Rev Neurosci 2004;5(10):793 807. 91. Sacramento J, Costa RP, Bengio Y, Senn W. Dendritic error backpropagation in deep cortical microcircuits. arXiv [q-bio.NC]. December 2017. ,http://arxiv.org/abs/1801.00062.. 92. Costa R, Assael IA, Shillingford B, de Freitas N, Vogels T. Cortical microcircuits as gated-recurrent neural networks. In: Guyon I, Luxburg UV, Bengio S, et al., editors. Advances in neural information processing systems 30. Curran Associates, Inc.; 2017 . pp. 272 83. 93. Bluemke DA. Radiology in 2018: are you working with AI or being replaced by AI? Radiology 2018;287 (2):365 6. 94. Ojeda P, Zawaideh M, Mossa-Basha M, Haynor D. The utility of deep learning: evaluation of a convolutional neural network for detection of intracranial bleeds on non-contrast head computed tomography studies. Med imaging 2019: image process, 10949. International Society for Optics and Photonics; 2019. p. 109493J. 95. Barreira C, Bouslama M, Lim J, Al-Bayati A, Saleem Y, Devlin T, Haussen D, Froehler M, Grossberg J, Baxter B, Frankel M. European stroke organisation conference: abstracts, Eur Stroke J. 2018;3(1_Suppl.):3 204. 96. No authors listed. FDA approves stroke-detecting AI software. Nat Biotechnol 2018;36(4):290. 97. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog 2019;1(8). ,https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.. 98. Zhu X, Vondrick C, Fowlkes C, Ramanan D. Do we need more training data? arXiv [cs.CV]. March 2015. ,http://arxiv.org/abs/1503.01508.. 99. ImageNet. ,http://www.image-net.org/. Accessed 31.12.19. 100. Ladha KS, Eikermann M. Codifying healthcare—big data and the issue of misclassification. BMC Anesthesiol 2015;15:179. 101. Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK. Confounding variables can degrade generalization performance of radiological deep learning models. arXiv [cs.CV]. July 2018. http://arxiv.org/abs/ 1807.00431. 102. Clark T, Nyberg E. Creating the black box: a primer on convolutional neural network use in image interpretation. Curr Probl Diagn Radiol 2019. Available from: https://doi.org/10.1067/j.cpradiol.2019.07.004. 103. Bolukbasi T, Chang K-W, Zou J, Saligrama V, Kalai A. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. arXiv [cs.CL]. July 2016. ,http://arxiv.org/abs/1607.06520.. 104. Dastin J. Amazon scraps secret AI recruiting tool that showed bias against women. Reuters. ,https://www.reuters. com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G. [accessed 26.11.19]. Published 10.10.18. 105. Angwin J, Larson J, Kirchner L, Mattu S. Machine Bias. ProPublica. ,https://www.propublica.org/article/ machine-bias-risk-assessments-in-criminal-sentencing. [accessed 26.11.19]. Published May 23.05.16.

III. Clinical applications

C H A P T E R

21 Harnessing the potential of artificial neural networks for pediatric patient management Jennifer L. Quon, Michael C. Jin, Jayne Seekins and Kristen W. Yeom Abstract Both artificial intelligence (AI) and clinical data acquisition are advancing at a rapid pace, making this an exciting time as clinicians and computer scientists collaborate to develop automated tools for understanding disease pathophysiology and improving patient care. The development of machine learning algorithms specific to pediatric diseases may help compensate for the pediatric evidence gap as well as the lack of pediatric subspecialists outside of major tertiary care centers. In this chapter, we review early efforts and ongoing advances in applying AI to pediatric disease diagnosis, prognosis, and management. Specifically we discuss studies addressing prematurity, childhood brain tumors, epilepsy, autism spectrum disorder, mood and psychotic disorders, hydrocephalus, traumatic brain injury, and other entities. We highlight the ongoing transition of clinical pediatrics into an increasingly computer-assisted field and discuss challenges and opportunities for growth in this arena. Keywords: Artificial intelligence; machine learning; deep learning; pediatrics; childhood diseases

21.1 Introduction The spectrum of childhood diseases is heterogeneous and broad, ranging from prematurity through young adulthood, and including developmental abnormalities. Diversity within the pediatric patient population resulting from age and growth differences lends further complexity to the clinical assessment and prognostication of pediatric diseases. In this chapter, we will discuss advances in the use of artificial intelligence (AI) to facilitate our understanding and management of pediatric conditions with a focus on neurologic diseases. Early efforts to expand machine learning methods into the medical sphere include the MYCIN project in the mid-1970s at Stanford University, utilizing models grounded in

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00021-1

415

© 2021 Elsevier Inc. All rights reserved.

416

21. Harnessing the potential of artificial neural networks for pediatric patient management

heuristics to assess possible pathogens in patients with severe infections.1 Decades later, automated systems expanded into medical use for pediatrics with one of the earliest efforts centered on computer-assisted diagnosis of congenital malformation and their associated syndromes.2 Since then, the use of AI in pediatrics has grown exponentially.3 From the development and adaptation of new statistical methods to the increasing availability of data, paradigm shifts in both AI and medicine have accelerated efforts to improve clinical diagnosis, prognostication, and management. For example, the availability of affordable genomic profiling, combined with high-dimensional digital image data, has enabled the radiogenomic analysis of childhood brain tumors. Improvements in medical record databases, such as modernization of electronic health record systems, have facilitated longterm assessment of patient growth, development, and recovery. In addition, advances in the computer sciences have contributed to this growth by expanding access to powerful computing resources. These changes, among many others, have uncovered new avenues of research and opportunities for multidisciplinary collaboration. We review recent advances in the application of AI to pediatric patient diagnosis, prognosis, and management to highlight the ongoing transition of clinical pediatrics into an increasingly computer-assisted field. In particular, we discuss AI applications in pediatric neurologic diseases that have been the focus of many ongoing efforts in AI.

21.2 Applications of artificial intelligence in diagnosis and prognosis The earliest applications of computer algorithms in health-care centered on improving diagnosis. MYCIN, one of the earliest AI methods, was designed to identify bacterial pathogens in the context of severe infections such as sepsis. Built upon a series of heuristics digitizing clinical and biological information on potential pathogens, MYCIN, despite its ground-breaking performance at the time, was limited in its capacity to construct an intricate knowledge framework. Since the early applications of rule-based decisionsupport systems,4 6 the increased availability of data combined with improvements in medical knowledge have enabled probabilistic approaches for disease diagnosis. Though disease diagnosis remains the most common use of AI in the pediatrics, AI has increasingly been applied to prognosticate disease outcomes. For pediatric conditions, this holds particular importance as projections for quality of life and functionality in adulthood frequently drive clinical decision-making and resource allocation.

21.2.1 Prematurity Premature birth is a major concern in pediatrics, with some studies estimating the global preterm birth rate exceeding 10%.7 Technological advances and early medical interventions in higher income countries have improved premature infant survival; however, the long-term health-care burden and associated complications remain significant challenges.8,9 As a major source of morbidity and mortality, prematurity, particularly in children born very preterm and in poverty, contributes to poor cognitive performance later in life.10,11 Early changes in brain development have been shown to correlate with cognitive function later in life.12 For example, Rathbone et al. used MRI to quantify cortical growth trajectories in neonates born before 30 weeks.12 They noted that cortical surface area

III. Clinical applications

21.2 Applications of artificial intelligence in diagnosis and prognosis

417

growth patterns observable between 24 and 44 weeks correlated with neurocognitive functional assessments at 2 and 6 years of age. Therefore early anticipation of long-term outcomes in premature infants can also facilitate physician decision-making. Early methods for potential preterm birth screening have included sampling of biochemical indicators13 as well as assessing cervical dilatation14 and uterine activity.15 In 1980 Creasy et al. developed an early risk score to predict preterm delivery by considering factors such as socioeconomic status, past or present pregnancy complications, and daily habits.16 However, this score was found to have limited clinical utility due to poor accuracy.17,18 More recently, Woolery et al. utilized an approach called Learning from Examples using Rough Sets to interpret input datasets and discover underlying rules.19 Prediction accuracy was promising, reaching 88.8% in a mixed cohort of high- and lowrisk women, and 59.2% in a cohort of high-risk pregnant women. Accuracy performed on an independent validation cohort reached 53.4% and demonstrated the potential utility of an automated system for assessing risk of premature birth.19,20 Recent efforts have also leveraged the greater availability of neuroimaging to incorporate early imaging findings into diagnostic and predictive models. In the Fetal Growth Longitudinal Study a component of the INTERGROWTH-21st project, second trimester fetal ultrasound was found to estimate gestational age with a R2 of 0.99 and an AIC of 14.6921 using predictors such as fetal head circumference and femur length. Given the cognitive dysfunction and developmental delays22,23 associated with prematurity, Ball et al. used functional MRI (fMRI) datasets of preterm infants scanned at term-equivalent age to develop a support vector machine (SVM) based classifier for neural connectivity. The model showed a classification accuracy of 80.2% with an area under the curve (AUC) of 0.9224 using 27 discriminative features from a total of 2485 network edges, with overrepresentation of basal ganglia nodal connections. Moeskops et al. developed an SVM classifier to identify patients with poor cognitive and motor function, as assessed by the Bayley scales of infant development (BSID-IIII),25 at approximately 2 years of age.26 They evaluated multiple combinations of descriptor inputs and imaging timepoints and noted optimal classification was achieved when evaluating the change in brain development between 30- and 40-week of postmenstrual age, as opposed to using any one particular timepoint. Using their best performing combination, AUCs of 0.81 and 0.85 were achieved for identifying patients with higher degrees of cognitive and motor impairments, respectively. In comparison, use of gestational age alone yielded AUCs of 0.68 and 0.72. Imaging features most frequently represented in the best performing models were gyrification index, inner cortical surface area, ventricular cerebrospinal fluid (CSF) volume, and brain volume. The authors posit these descriptors offer a multifaceted glimpse into brain development—gyrification index and inner cortical surface area describe cortical folding, ventricular CSF volume may indicate the presence of hydrocephalus and enlarged ventricles, and brain volume correlates with overall brain growth. The authors also note that combinations of descriptors (usually between eight and twelve features) offered significantly improved predictive power compared to individual imaging features, suggesting that each of these descriptors (and the associated underlying biology) contributes to neuromuscular development in premature infants. Despite the predictive utility of imaging, limited access to imaging technology in some low- and middle-income countries (LMICs)27,28 prohibits its use for modeling in these populations. Therefore Rittenhouse et al. assembled clinical features readily available in

III. Clinical applications

418

21. Harnessing the potential of artificial neural networks for pediatric patient management

LMICs, including the New Ballard Score (an index summary of neuromuscular maturity of the developing fetus), last menstrual period, birth weight, twin delivery, maternal HIV serologic status, maternal height, and presence of maternal hypertension,29 and found last menstrual period to be the most predictive feature of premature birth (classification accuracy of 94.0% and an AUC of 0.98). Although these results seemed promising, the positive predictive value (PPV) remained low at 53.6%, highlighting the need for a more robust model for risk stratification of prematurity. Beyond cognitive outcomes, physical sources of morbidity are also important considerations in premature infants. One of the most common sources of disability in low-weight premature infants is bronchopulmonary dysplasia, which has been attributed to the lengthened use of mechanical ventilation and describes a constellation of fibrosis and structural disruption combined with delayed lung maturation.30,31 As such, a number of studies evaluated early diagnosis of bronchopulmonary dysplasia and correlated findings with pulmonary development and function later in life.32 39 However, as additional diagnostic and prognostic methods have emerged, heterogeneity in clinical practice and data availability have limited the application of these methods across institutions and geographical regions. To address these concerns, Ochab and Wajs developed an expert support system that, based on available input features, recommends the optimal modeling approach to predict bronchopulmonary dysplasia.37 Using SVM and logistic regression, Ochab et al. demonstrated the importance of model and feature selection in different hypothetical scenarios; furthermore, logistic regression was demonstrated to perform best in situations with low feature availability while expanded input data favored SVM. As much of the research on prematurity outcomes has been conducted in higher income countries, similar methods are necessary to identify best practices in LMICs, where prematurity remains a significant concern. The care of premature infants places significant burden on the health-care system, as estimates have placed the cost of prematurity, including medical costs, educational services, and lost time and labor, at over 25 billion dollars in the United States alone.40 To better understand the predictors of discharge after hospitalization in the neonatal ICU, Temple et al. extracted data from daily progress notes using regular expressions, and generated 26 features ranging from quantitative metrics, such as birth weight and gestational age, to qualitative metrics, such as use of caffeine and mechanical ventilation.41 Using a random forest classifier to predict impending discharge from the neonatal ICU, they achieved AUCs ranging from 0.729 to 0.864 in premature patients. Prediction performance was best in patients closest to discharge. The top three contributing features were the amount of oral feeds, percent of oral feeds, and the number of days in which the percent of oral feeds was greater than 90%. In a follow-up study, Temple et al. expanded their work to include natural language processing, which allowed for the interpretation of semistructured reports as well as free text excerpts. While their bag-of-words approach, which consolidated sentences and paragraphs into a potpourri of words (excluding grammar but retaining multiplicity), did not demonstrate the improved ability to predict discharge date compared to the aforementioned random forest model, it did provide insight into the single words and bigrams most strongly associated with delayed discharge. Notably, bigrams associated with delayed discharge included “plus disease,” “stage zone,” indicators of “retinopathy of prematurity,” “social work,” and “DCS involved,” all of which suggest either social or logistical causes to the discharge delays.

III. Clinical applications

21.2 Applications of artificial intelligence in diagnosis and prognosis

419

Much remains to be done in understanding, and eventually predicting, delays in discharge; however, continued exploration could streamline follow-up care appointments and identify ways to more efficiently use financial resources in the hospital setting.

21.2.2 Childhood brain tumors Brain tumors are the most common solid cancer in children with variable prognosis depending on tumor subtype.42 While children can present with headache, nausea, vomiting, or other neurologic symptoms,42 due to the often nonspecific nature of these symptoms, there is an average delay of over 4 weeks from symptom onset to diagnosis.43 Imaging is a key component of diagnosing brain tumors, with tissue specimens providing a final pathological diagnosis when tumors require biopsy or are surgically resectable. In the future, imaging may play an even greater role by incorporating molecular biomarkers and predicting genetic subtype to aid in tumor prognostication and therapeutic strategies. Early efforts in applying AI to brain tumor diagnosis have focused largely on imaging and pathology slides. Various radiologic image features, such as location, enhancement pattern, and diffusion restriction, have been shown to help predict tumor subtype or aggressiveness.44 For example imaging features, including tumor location, have been correlated with different molecular subgroups of medulloblastoma.46 Multimodal MRI including diffusion, perfusion and MR spectroscopy may further facilitate diagnosis, particularly for high-grade tumors.47 Nevertheless, the qualitative evaluation of MRI alone may be insufficient to make such nuanced diagnoses,44 and machine learning techniques may play a future role in identifying clinically relevant predictors of tumor subtypes.45 Various semiautomatic and automatic computational algorithms have therefore been applied to assist with diagnosing pediatric brain tumors on MRI.48 While pediatric radiologists already assess such features in practice, machine learning analysis after feature extraction has helped to better quantify these correlations. For example, textural and radiogenomic analysis of posterior fossa tumor MRIs (specifically pilocytic astrocytoma, medulloblastoma, and ependymoma) has been reported by multiple groups with varying levels of predictive accuracy.49 53 While most groups have focused on T1 and T2 MRI to predict tumor type, others have attempted to enhance classification accuracy by using apparent diffusion coefficient or MR spectroscopy.52 Other groups have additionally tried to integrate clinical as well as radiologic variables to predict tumor type.54 In a multiinstitutional study from our own group, MR-based radiomic features were evaluated for 109 medulloblastoma patients to predict genetic subgroup classification.53 Two predictive models were evaluated, a double 10-fold cross-validation scheme using a combined dataset as well as a 3-dataset cross-validation in which the model was trained on two cohorts and tested on the third. Model performance varied across tumor subtypes and MRI modality. Model performance was consistently better for sonic hedgehog (SHH) tumors and worse for group 3 tumors. Further, model performance was better for predicting SHH, group 3, and group 4 tumors using combined T1 and T2 images. Tumor edge-sharpness was most useful for identifying SHH and group 4 tumors. In addition to posterior fossa tumors, machine learning techniques have also been applied to classifying craniopharyngiomas.55,56 Using radiomic feature extraction and a random forest model, Chen et al. identified four features for pathological subtype discrimination, two for BRAF V600E mutation, and three for

III. Clinical applications

420

21. Harnessing the potential of artificial neural networks for pediatric patient management

CTNNB1 mutation prediction, each with an AUC . 0.95. While most studies have focused on imaging data, machine learning techniques have also been applied to other tumor detection modalities such as EEG. Selvam et al. analyzed scalp EEG from healthy subjects and six patients with brain tumors (including adults and children).57 They used a modified waveletindependent component analysis along with a three-layered feed-forward neural network. In a series of studies, Fetit et al. demonstrated that a model that performed well in a single-center study actually did worse when tested on multiinstitutional data.51 This emphasizes the importance of multiinstitutional data repositories for developing clinically usable models. While imaging biomarkers have been identified from a number of adult tumor databases, the lack of publicly available annotated pediatric image databases has limited their assessment by nonhospital affiliated groups. Even though machine learning methods for brain tumor detection and classification remain primarily in the research realm, ongoing improvements in accuracy may allow “virtual biopsy”—or the advanced diagnosis of tumor subtypes using imaging alone—to be usable in clinical practice. As heterogeneous entities, pediatric brain tumors can vary widely in their prognosis. Some can be incredibly aggressive, requiring surgical resection, chemotherapy, and radiation, whereas others can have a more favorable prognosis with just one treatment modality. Several factors may play a role in the outcomes for these patients. For example, genetic subtype, imaging features, and clinical variables can all affect prognosis. In addition, the treatment course and response to treatment can also affect overall and progression-free survival. Previous studies have attempted to predict disease and treatment outcomes using a variety of statistical and machine learning methods. These have been applied to clinical and imaging data for outcome prediction, as well as other modalities such as gene expression profiles. In one study of pediatric embryonal brain tumors (medulloblastoma, pineoblastoma, supratentorial primitive neuroectodermal), radiomic feature extraction and statistical analysis were combined to predict progression-free and overall survival based on preoperative imaging.58 Tumors in older patients had increased normalized mean tumor intensity, decreased tumor volume, and increased heterogeneity compared to those in younger patients. Larger tumors and those with less heterogeneity were more likely to recur. In a study of gene expression profiles from 60 medulloblastoma patients, Narayanan et al. used a neural network approach to finding genetic signatures predictive of survival.59 Reducing the genetic signature down to 64 genes, they were able to achieve an accuracy of 96% with a three-layer neural network majority voting model. Further studies have investigated additional prognostic factors following surgery and for surgical complications such as posterior fossa syndrome. For example, with medulloblastoma, while maximal safe resection is the standard of care, the prognostic benefit is affected by tumor subgroup with no definitive benefit of gross total over near total resection.60 Following posterior fossa tumor resection, bilateral hypertrophic olivary degeneration and left inferior olive hyperintensity have been correlated with developing posterior fossa syndrome.61 Spiteri et al. assessed imaging predictors of cerebellar mutism in children with posterior fossa tumors.62 First, they used feature selection to assess gray-level intensity means and Jacobian of deformation means of the cerebellar lobule. Next they used an SVM binary classifier for developing postoperative cerebellar mutism with a leave-5-out crossvalidation approach. Performance was best (AUC 0.85) using all features, with slightly worse performance (0.75) using only top-scoring features.

III. Clinical applications

21.2 Applications of artificial intelligence in diagnosis and prognosis

421

Machine learning has also been used to elucidate the effects of non-surgical treatment on brain matter volume and neurocognitive outcomes. These studies primarily examined patients with posterior fossa tumors (medulloblastoma, astrocytoma, and ependymoma). Reddick et al. used a hybrid neural network approach to segment and classify different brain tissue types in children with brain tumors.63 Once they determined tissue volumes, patients with medulloblastoma who had undergone surgery, radiation and chemotherapy were compared to those who had undergone only surgery and radiation. They further compared medulloblastoma patients who had undergone surgery plus radiation with lowgrade astrocytoma patients who had undergone surgery alone. While chemotherapy did not appear to affect white matter, gray matter, or CSF volumes, radiation led to significantly reduced white matter volumes. In a follow-up study, including long-term survivors of medulloblastoma, astrocytoma, ependymoma, as well as other tumor types, these same investigators evaluated white matter volumes and IQ. They found significant associations between white matter volume, attention, and IQ and proposed a model for cognitive performance and white matter volume. In a study of 18 medulloblastoma survivors treated with surgery plus radiation and chemotherapy, age and size matched with 18 posterior fossa low-grade astrocytoma patients treated with surgery alone, white matter volume and IQ scores were both significantly lower in the former.64 The authors used a previously published method of brain tissue classification based on a Kohonen self-organizing map, which was then fed into a neural network. These regions were then manually checked for aberrant inclusion of T2 hyperintensities. White matter volume and IQ values were then compared between the two groups using t-tests. A number of other studies have been published in response to the CAMDA 2017 Neuroblastoma Data Integration challenge, which included RNA-Seq, gene expression profiles, and extensive clinical data from 498 children with neuroblastoma. A range of approaches were applied, from random forests65 to deep learning architectures.66 Francescatto et al. applied an integrative network fusion framework to assess RNA-Seq, microarray and comparative genomic hybridization data and developed a predictor for event-free and overall survival. Integration of all available genetic and clinical data improved outcome prediction. The deep learning autoencoder architecture allowed them to identify two groups of patients with differential survival as well as a subset of patients who were consistently misclassified. MYCN amplification, low NTRK1 expression, metastatic disease, and older age at diagnosis were all associated with worse outcomes across multiple machine learning algorithms.67,68 Tumor volume was also an important prognostic factor for neuroblastoma. Automated methods of segmenting tumor on abdominal CT have been explored using a fuzzy connectivity algorithm that takes advantage of the difference in Hounsfield units between different tissue types.

21.2.3 Epilepsy and seizure disorders The diagnosis of seizure disorders in neonates and children involves assessingseizure semiology as well as using adjunctive studies such as EEG or imaging. However, observing changes in consciousness and localizing or lateralizing symptoms can be especially difficult in this population.69 While younger children (,6 years old) demonstrate fewer ictal symptoms compared to older children who more closely resemble adult presentation, use

III. Clinical applications

422

21. Harnessing the potential of artificial neural networks for pediatric patient management

of video EEG and interpretation of verbal descriptions provided by older patients can greatly facilitate seizure classification.69 Clinical evaluation and EEG combined with imaging studies can help indicate seizure foci but remains a challenge in the context of discrepancies. Mesial temporal lobe epilepsy with hippocampal sclerosis (MTLE-HS) is the most common type of drug-resistant epilepsy but is often amenable to surgical treatment.70 While the diagnostic accuracy with clinical evaluation and EEG alone is about 75%, it improves with the incorporation of MRI.70 In order to improve accuracy beyond qualitative visual assessment and manual measurements, one group attempted to use a combination of voxel-based morphometry and SVM to distinguish pediatric MTLE-HS from nonepileptic controls.70 Voxel-based morphometry was applied to 3D T1-weighted MRI to identify regions of abnormal gray matter and calculate gray matter volume. Feature vectors from the voxel-based morphometry were incorporated into the SVM classifier, which achieved an average AUC of 0.902 for detection of hippocampal sclerosis.70 Diagnosing seizure disorders in the absence of obvious focal imaging abnormalities presents a challenge addressable using machine learning approaches. In generalized epilepsy with tonic clonic seizures, seizure symptoms of muscular contractions and whole-body convulsions can develop in the absence of abnormal radiographic findings. In response to studies from adults with idiopathic generalized epilepsy suggesting perturbed structure and connectivity on DTI and fMRI, one group developed an SVM classifier using pediatric T1-weighted and fMRI. They measured gray matter volume and fractional amplitude of low-frequency fluctuation differences between children with generalized tonic clonic seizures and healthy controls and found that both measures applied to the right thalamus could diagnose with accuracies of 74.42% and 83.73%, respectively. The heterogeneous etiology and semiology of many seizure disorders necessitates nuanced clinical tools for diagnosis and treatment. However, as electrographic and imaging data become increasingly available as part of the clinical workup for seizure disorders, larger machine learning studies can be conducted toward improving diagnostic tools. Prognostication in the context of epilepsy is similarly challenging. Treatment regions are often multistage, involving both medical and surgical management. Likely due to complexity of care, prior studies evaluatingthe timing of care in patients with refractory seizures have identified delays of nearly two decades between initial diagnosis and surgical referral.71 Applications of AI may help deconvolute optimal care strategies for complex seizure syndromes and expedite access to definitive treatment. In addition, it is believed that children may be more susceptible to adverse effects induced by anticonvulsant medication, including altered and aggressive behavior, fatigue, and possible seizure exacerbation.72 Severe drug-induced toxicity or inflammation may also be more common in children, resulting in complications such as Stevens Johnson syndrome73 or hepatotoxicity.74 Machine learning and deep learning approaches for predicting refractoriness to medical therapy could streamline care, reducing delays in surgical intervention and preventing adverse effects associated with medications destined to fail. Many methods have been developed to predict outcomes following medical management of recurrent seizures. Early studies generally relied on the application of statistical learning frameworks to datasets of clinical features with the hopes of either describing previously unknown predictors of responsiveness or identifying high-risk patient subsets.

III. Clinical applications

21.2 Applications of artificial intelligence in diagnosis and prognosis

423

In an early example of this, Aslan et al. utilized a multilayer perceptron approach with input features derived from patient histories.75 Neural networks with varying numbers of nodes were able to achieve classification accuracies ranging from 60% to 90%. While such early studies suggested the feasibility of using AI for prognosticating epilepsy, study flaws and exclusions, such as independent external validation and incorporation of multisource input features, likely limit their applicability in the clinical setting. As the use of AI for diagnosing epilepsy continues to expand, extending such approaches to prognostication could build upon initial methodology and allow the creation of machine-optimized approaches for seizure management.

21.2.4 Autism spectrum disorder The prevalence of autism has increased over the past few decades,76 78 with recent estimates suggesting as many as 1 in 40 children is affected by autism spectrum disorder (ASD).79 Clinical diagnosis of autism can be challenging and requires longitudinal monitoring of behavioral development; in part due to the lack of robust quantitative methods for diagnosis of autism. In addition, early recognition of anatomical and behavioral patterns predictive of future ASD manifestations remains difficult. However, studies over the past few decades have uncovered numerous features associated with autism, including early differences in cortical surface area,80 white matter tract development,81 and CSF volume and localization.82 Furthermore, behavioral deviations have been observed as early as 6 months after birth,83 85 Decline in eye fixation, a behavioral tendency known to be deficient in children with autism, may be present in children as young as 2 months of age.85 Methods applying AI for early diagnosis of ASD have used a combination of imaging and clinical features to assemble predictive algorithms. The first screening methods developed for autism diagnosis focused on incorporating behavioral observations in heuristic-based classifiers. In one of the earliest efforts to standardize autism diagnosis, Robins et al. constructed the Modified Checklist for Autism in Toddlers (M-CHAT).86 Covering behavioral tendencies characteristic of children with autism, such as repetitive motions and early social communication deficits, the M-CHAT approach was only able to achieve PPV of 0.06 when applying a single-stage classifier to an unselected pediatric population (comprising the M-CHAT screening criteria without subsequent telephone follow-up).87 While performance improved to 0.57 when including the subsequent telephone interview,87 inclusion of a labor-intensive interview component marginalizes much of the benefit of an automated diagnostic approach. Subsequent studies evaluating the utility of M-CHAT as a population-level screening tool have suggested thresholds above which clinical intervention may be indicated; however, like prior studies, survey accuracy remained low and administration of the full two-stage screening protocol was labor- and time-intensive.88 While preliminary efforts toward early ASD screening focused on observable features such as behavioral tendencies, recent work has increasingly attempted to incorporate quantitative neuroimaging characteristics. For instance, early anatomical deviations in children with ASD include increased corpus callosum size and thickness, increased subarachnoid CSF,82 and increased cortical surface area associated with general brain enlargement.80 Neural connectivity profiling by fMRI has also improved our understanding of abnormal neural

III. Clinical applications

424

21. Harnessing the potential of artificial neural networks for pediatric patient management

signaling in ASD.89,90 Specifically, studies suggest decreased functional connectivity in the medial prefrontal cortex, anterior cingulate cortex, and precuneus, offering potential avenues for using fMRI to differentiate patients with ASD from healthy individuals based on abnormal distribution of brain activity.91 A number of models have sought to harness data from diverse sources, namely, anatomical, functional, and behavioral observations. By incorporating an fMRI assessment of salience network connectivity composed of regions associated with emotional awareness,92 Uddin et al. were able to achieve a classification accuracy of 83% using a logistic regression classifier. They further demonstrated an association between salience network connectivity and symptom severity.93 This investigation, along with a prior study by Anderson et al.,94 demonstrated the potential for fMRI to assist with diagnosing and prognosticating autism. However, both studies included patients aged 7 42 years old rather than those in early childhood. Early diagnosis of autism in children nevertheless remains challenging.95 In a hallmark study applying deep learning principles to the diagnosis of ASD, Hazlett et al. utilized a three-stage neural network incorporating 3D T1 MPRAGE and T2 FSE MRI protocols with clinical variables to achieve a cross-validated classification accuracy of 94% and a PPV of 81%. A total of 315 features, including 6- and 12-month measurements of cortical thickness and surface area along various regions of the brain, intracranial volume, and demographics, were included; notably, of the top 15 contributing features, 12 described cortical surface area, 2 described intracranial volume (at 6 and 12 months), and only 1 described cortical thickness (left precuneus at 12 months). This study suggests that early diagnosis of ASD from neuroanatomical features can be achieved as early as 6 months of age. Though functional studies were not incorporated into the aforementioned model, another study conducted by Emerson et al. demonstrated the potential of fMRI to contribute to classifiers for ASD.96 The study cohort consisted of 6-month-old infants with a high familial risk of autism and the primary endpoint was clinical diagnosis of autism by 24 months of age. Using an SVM model that incorporated features describing personalized neural connectivity, the authors achieved a classification accuracy of 92.7% within a leave-10-out validation framework.

21.2.5 Mood disorders and psychoses Often presenting early in childhood, mood disorders are incredibly disabling and place a large burden on the families of affected patients.97 According to the World Health Organization, unipolar depression is the leading cause of disability worldwide, and bipolar disorder follows as the sixth leading cause.97 With a high risk of recurrence, childhood and adolescent mood disorders are increasingly being described as chronic diseases with a lifelong risk of disability.97 While most studies of disease prevalence have focused on late adolescents or adults, current data suggests a 2% 1-year prevalence of unipolar depression in children, and 4% 7% 1-year prevalence in adolescents.97 Further, many adult mood disorders are being reframed as recurrent episodes from childhood onset rather than onetime entities.97 Especially in the pediatric population, early diagnosis and identification of mood disorders may allow for earlier intervention and social support. Several studies have attempted to identify imaging biomarkers for early, definitive diagnosis. The use of various

III. Clinical applications

21.2 Applications of artificial intelligence in diagnosis and prognosis

425

neuroimaging approaches to diagnose psychiatric disorders has also gained increasing attention for a variety of disease entities.98 One group applied a 3D AlexNet convolutional neural network to extract features from structural MRIs (sMRI) of adolescents with conduct disorders in order to differentiate them from healthy controls.98 Conduct disorder, described by symptoms such as violence and deceitfulness, is the most common childhood psychiatric disorder but can persist into adulthood.98 While clinical diagnosis currently relies on behavioral criteria from the DSM, some symptoms may be subtle or missed, thus leading to underdiagnosis.98 The AlexNet model developed by Zhang et al. achieved an accuracy of 0.85 and AUC of 0.86, which was significantly higher than an SVM model applied to the same dataset. Relevant areas on the sMRI of conduct disorder patients included those within the frontal lobe, superior temporal gyrus, parietal lobe, and occipital lobe, suggesting their importance in the pathophysiology and symptomatology of the disorder. Advanced imaging modalities can be applied to differentiate nuanced psychiatric diagnoses. For instance, one group attempted to use DTI to classify disruptive mood dysregulation disorder (a.k.a. severe chronic irritability) apart from bipolar disorder based on white matter microstructure.99 Compared to patients with disruptive mood dysregulation disorder and healthy volunteers, those with bipolar disorder had significantly reduced fractional anisotropy in the corticospinal tracts and throughout the brain, respectively. While their Gaussian process classifier was unable to reliably discriminate between the two disorders, it was able to identify both from healthy controls with 68% accuracy for disruptive mood dysregulation disorder, and 75% for bipolar disorder. An additional group also created a DTI-based model for classifying children with sensory processing disorders from typically developing children.100 They trialed multiple machine learning algorithms, of which random forests using tract-based connectivity metrics yielded the best performance (77.5% accuracy, 73.8% sensitivity, and 81.6% specificity). Yoo et al. developed an image-based classifier to differentiate children with ADHD from typically developing children.101 For each group an sMRI, fMRI, and DTI along with polygenic risk scores within three major neurotransmission pathways were obtained. FreeSurfer was used to obtain morphometric features, which, in addition to functional connectivity and white matter metrics, served as multimodal model input. Cortical thickness, cortical thickness variability, and volume differences across the fronto temporo-parietal regions all proved important for classification, whereas genetic features did not meaningfully improve model performance. While childhood psychiatric disorders represent a wide array of pathophysiologies, the use of imaging data, genetic profiling, and other quantitative metrics will likely improve dectection and risk profiling. As a method for streamlining clinical diagnosis and decisionmaking, the use of AI for psychiatric screening may help expedite symptom management and improve health-care access.

21.2.6 Hydrocephalus Ventriculomegaly is characterized by enlarged ventricles with or without the presence of hydrocephalus or elevated intracranial pressure. The prevalence of ventriculomegaly ranges based on severity, with studies estimating as many as 22 cases per 1000 births of mild

III. Clinical applications

426

21. Harnessing the potential of artificial neural networks for pediatric patient management

ventriculomegaly102 and around 2 cases per 1000 births of severe ventriculomegaly.103 Fetal ventriculomegaly is often self-resolving; however, up to 20% of mild ventriculomegaly cases may progress to more severe forms resulting in elevated intracranial pressure and hydrocephalus.104,105 In these patients, CSF diversion with either placement of a ventricular shunt or endoscopic third ventriculostomy (ETV) is required. CSF diversion is a mainstay of pediatric neurosurgery; however, the risk of complications such as shunt infection or blockage requires that patients undergo constant surveillance and repeated surgical operations. When successful, ETVs can obviate the need for a shunt; however, prior studies evaluating ETV outcomes demonstrate the lowest success rates in neonates among all age-groups.106 Therefore early identification of neonates likely to require CSF diversion would allow both providers and families to anticipate care decisions in advance. To address this issue, Gu et al. incorporated clinical characteristics with average lateral ventricle width on MRI and ultrasound to predict the need for surgical intervention within the first 3 months of life.107 As a single variable predictor, the authors were able to achieve a sensitivity and specificity of 67% and 73%, respectively, with an AUC of 0.72. Pisapia et al. further expanded upon these efforts by developing an algorithm to predict need for postnatal CSF diversion using an expanded feature set acquired from fetal MRI.107,108 A total of 77 input features were identified, of which the most predictive were the ratio of smaller occipital horn to lateral ventricle area as well as the length of the minor axis of the smaller occipital horn of the lateral ventricle. The SVM model achieved a prediction accuracy of 82%, with a sensitivity and specificity of 80% and 84%, respectively. These results offer optimism for predictinghydrocephalus in neonates with fetal ventriculomegaly. Tempering this optimism, however, is the need to develop models capable of being used in lower resource settings and to extensively validate these algorithms on externally sourced datasets prior to implementing these approaches into clinical management.

21.2.7 Traumatic brain injury Traumatic brain injury (TBI) remains a significant cause of pediatric morbidity worldwide.110 While diverse situations can result in TBI, the leading cause of TBI in both the United States and globally remains motor vehicle accidents.110 Common post-TBI sequelae in pediatric patients include functional impairments111 and mood disorders.112 In contrast to adults, children with TBI more frequently present with diffuse axonal injury and cerebral edema than with focal lesions,.112 Treatment strategies are largely longitudinal, often involving combined pharmacotherapeutic and psychotherapeutic approaches.112 Longterm care may require particular attention toward post-TBI depression113 or secondary ADHD following TBI.114 Metrics such as the Glasgow Coma Score (GCS) have been developed to qualitatively assess brain injury severity and to anticipate outcomes. In addition, new AI models for prognostication of post-TBI outcomes in children offer optimism for improved prediction performance. Comparingestablished computerized tomography (CT) classification systems (Marshall, Rotterdam, and Helsinki scores), Hale et al. developed an artificial neural network (ANN) with four clinical input variables (GCS, pupillary light reaction, glucose, and hemoglobin) and five imaging input variables (subdural hemorrhage, intracerebral hemorrhage, intraventricular hemorrhage, cistern integrity, and the presence of midline shift).

III. Clinical applications

21.2 Applications of artificial intelligence in diagnosis and prognosis

427

The three established CT classification systems were found to be least predictive, with the Marshall, Rotterdam, and Helsinki scores generating AUCs of 0.663, 0.748, and 0.717 when classifying unfavorable (Glasgow Outcome Scale # 3) 6-month outcomes. In contrast, using GCS alone resulted in an AUC of 0.855. Classification of 6-month mortality as the designated outcome was more robust, with AUCs ranging from 0.781 to 0.920 and GCS again demonstrating the best discriminative power. Impressively, the ANN was able to achieve an AUC of 0.9774 when evaluating for 6-month unfavorable outcomes. While this study demonstrated the promise of using AI to prognosticate of pediatric TBI, a number of caveats are associated with these results. It is likely that the AUC reported for the ANN overestimates its performance, as the ANN was trained on data from within the same overall dataset and application to an external data would likely reduce, predictive accuracy. Furthermore, comparing established metrics, such as the CT classification schemas evaluated in the study, to newly defined methods such as their ANN, is somewhat misleading. While the Marshall, Rotterdam, and Helsinki scores were used as out-of-the-box systems, the ANN was both developed and fine-tuned using training data similar to the test set. To summarize, Hale et al. demonstrated the feasibility of integrating clinical and imaging data into an AI model for pediatric TBI prognostication; however, further validation is needed to identify the optimal approach and to understand prediction accuracy. A more recent study applying statistical learning concepts to pediatric TBI evaluated the performance of optimal classification trees (OCTs) foridentifying clinically important TBI (ciTBI) compared to previously established heuristic frameworks.115 Previously, the Pediatric Emergency Care Applied Research Network (PECARN) had identified a set of clinical heuristics, such as palpable skull fractures or evidence of altered mental status, to better identify patients at risk for ciTBI in need of either emergent intervention or diagnostic CT imaging.116 118 In a follow-up study conducted on the PECARN registry, Bertsimas et al. developed (Bertsimas and Dunn, Optimal Classification Trees, Machine Learning, 2017) and applied OCTs in an effort to improve upon the performance established by the PECARN study. While overall sensitivity was comparable, Bertsimas et al. demonstrated significantly improved specificity and PPV in both younger (age ,2; characterized as predominantly nonverbal) and older (age $ 2, characterized as predominantly verbal) patients. In a withheld validation cohort, they achieved a specificity of 69.3% using OCTs compared to 52.8% achieved using the original PECARN rule based approach. These results demonstrated that machine learning and AI could improve diagnostic accuracy for ciTBI; however, independent validation has yet to be performed. While both studies discussed offer glimpses into the potential of AI for pediatric TBI prognostication, additional efforts, both to develop novel approaches and to validate existing ones, are necessary.

21.2.8 Molecular mechanisms of disease Machine learning techniques have also been applied to better understand molecular mechanisms and cellular interactions as well as simulate responses to therapy. For example, computational modeling has been applied to the hedgehog pathway relevant in medulloblastoma tumor growth, and to understanding pathway differences between different genetic subtypes of medulloblastoma.119 Modeling of gene expression profiles from neuroblastoma patients has also identified select genes for predicting survival, shown the

III. Clinical applications

428

21. Harnessing the potential of artificial neural networks for pediatric patient management

importance of 17q gain, as well as differences in copy number aberrations between stage 4 and 4S patients.120 As algorithms for analyzing and simulating molecular interactions continue to progress, so too will their role in prognosticating and predicting treatment effects.

21.2.9 Other disease entities Machine learning algorithms have also been applied to prognosticate a number of other pediatric disease entities, notably in the realm of hospital utilization and critical care. With an increasing proportion of children on Medicaid, identifying high-risk populations and reducing the need for inpatient admissions is an area of active research.121 The availability of electronic record systems has allowed better evaluation of cost-effectiveness data and become ripe for investigation using machine learning techniques. For example, more effective triaging of pediatric asthma, the leading cause of pediatric emergency department utilization and hospitalization, would allow improved care pathways and resource allocation.122 While most tools use only clinical data, one group attempted to use machine learning to incorporate population-level and environmental factors.122 A gradient boosting machines model had the best performance with an AUC of 0.84 for predicting hospitalization. Importantly, in addition to patient’s vital signs, acuity, age, and weight, socioeconomic status and weather-related features also contributed to the predictive model. Similarly, another group built upon a contextual decomposition method to demonstrate that predictive modeling can be performed using a group of visits rather than information from a single visit.123 For patients in the critical care setting, the clinical acuity and shear volume of medical information can be overwhelming for clinicians to synthesize. Despite being a primary concern, prognistication remains challenging. In a retrospective analysis of pediatric ICU data, Williams et al. developed a k-clustering means algorithm to predict mortality with an AUC of 0.77.124 Even though the clusters did not contain specific information about diagnosis, outcome, or treatments, the mortality rate differed between generated clusters, indicating that clustering provided prognostic information. For instance, length of stay, the use of inotropes, intubation, and general diagnostic categories all varied between the outcome clusters. This suggests that such algorithms could be useful and translatable to the clinical setting.

21.3 Transition to treatment decision-making using artificial intelligence The use of AI to guide clinical management and treatment decisions has notably lagged behind diagnosis and prognostication. Many clinical applications of AI have used retrospectively collected clinical data as proof-of-concept studies prior to prospective validation. Much of the previously discussed work has used AI to predict the need for treatment, rather than dictating which patients should ultimately receive it. To improve clinical decision-making in the neonatal ICU, one group developed a prediction algorithm to optimize drug dosing for preterm infants with apnea.125 Since neonates are highly sensitive to drug dosages, optimizing this aspect of their treatment regimen can significantly affect care. By trialing number of machine learning algorithms, including deep learning to

III. Clinical applications

21.4 Future directions

429

predict the adequacy of caffeine to prevent recurrent apneas, they found the best performance by a deep belief network with an AUC 0.91. The Score for Neonatal Acute Physiology I was a critical input feature, highlighting its role in driving disease management. With the exponential growth of such outcome studies, AI’s role in informing treatment guidelines is likely to increase significantly in the coming years. While the impact of machine learning on treatment strategies trails its use in diagnostics, automated techniques are also poised to change the way that clinicians provide patient care. For patients with brain tumors, this will entail predicting and precisely monitoring responses to chemotherapy and radiation. In particular, machine learning has been explored for automating tumor segmentation and volume calculation on imaging. Automated tumor segmentation would allow precise measurements of tumor growth or recurrence throughout and following treatment. The current gold standard for tumor segmentation for targeted radiotherapy and surgical planning involves manual delineation by clinicians. However, this time-consuming process is ripe for automation and, as such, has been investigated by a number of different groups, using a wide variety of segmentation methods. A number of quantitative feature extraction and tissue segmentation techniques have been explored for automated posterior fossa tumor segmentation on MRI.126,127 Many have been conducted as mathematical proof-of-principle experiments on single or only a few patient images. One group published a proof-of-concept study for pediatric brain tumor segmentation built upon T1 pre- and postcontrast, and T2-weighted MRIs from six patients. Their model was built upon a Markov random field approach further refined with a probabilistic boosting tree voxel classifier. A different study used a neural network approach to assess registration accuracy for children with brain tumors undergoing external beam radiotherapy.128 Using a curated dataset of six children, they compared orthogonal daily setup X-ray images to treatment planning CT and measured the distance from each individual solution to the best solution. Their goal was to develop an automated registration evaluator that could automatically reject unacceptable registration plans. These strategies have yet to be integrated into stereotactic radiotherapy planning or operative navigation systems. With the advent of hybrid imaging for intraoperative guidance, machine learning is also poised to allow surgeons to integrate information about tumor boundaries into these systems.

21.4 Future directions AI and clinical data acquisition are both advancing at a rapid pace, making this an exciting time as clinicians and computer scientists collaborate to develop automated tools for understanding disease pathophysiology and improving patient care. Computer-aided diagnosis will help minimize the number of invasive diagnostic procedures by allowing more detailed and quantitative analysis of clinical data and imaging and identifying imaging biomarkers for various diseases. AI-driven analysis of disease outcomes will improve precision and minimize bias in population-level studies. Machine learning may also be able to glean nuanced outcome measures from clinical trial or other repository data. As such, the development of machine learning algorithms specific to children may also help compensate for pediatric evidence gap as well as the lack of pediatric subspecialists outside of major tertiary care centers.

III. Clinical applications

430

21. Harnessing the potential of artificial neural networks for pediatric patient management

In machine learning, specifically deep learning, models are strengthened by large high quality, but heterogeneous datasets. Models trained on datasets that are too small are at risk of “overfitting” to that dataset. Pediatric diseases are generally more rare than adult ones, so the relative paucity of data makes developing these models even more challenging.129 Multiinstitution repositories will allow for the better development of such tools and additional work must be done to establish model validity and generalizability across multiple institutions, as each has their own imaging scanners, protocols, and data collection methods. Therefore sharing of real-world data across institutions is necessary to externally validate model performance. However, there is much work to be done especially as clinicians seek to bridge the gap between retrospective analysis and prospective validation. Much like testing the safety and efficacy of new medical devices, similar procedures are needed to evaluate machine learning algorithms prior to their use in the clinical setting. Finally, once models are established and validated, structuring them as real-time tools will require user-friendly interfaces that can be implemented in the clinical setting. We look forward to the ongoing advancements in applying AI to pediatric care.

References 1. Shortliffe EH, Davis R, Axline SG, Buchanan BG, Green CC, Cohen SN. Computer-based consultations in clinical therapeutics: explanation and rule acquisition capabilities of the MYCIN system. Computers Biomed Res 1975;8:303 20. 2. Yamamoto K, Sudo M, Shigematsu Y, Fukui M, Masukawa J. The development of personal computer-based medical consultation system for diagnosis of congenital malformation syndromes using MUMPS. Med Inform 1990;15:355 62. 3. Kokol P, Zavrˇsnik J, Voˇsner HB. Artificial intelligence and pediatrics: a synthetic mini review. arXiv 2018. 1802.06068. 4. Muller R, Sergl M, Nauerth U, Schoppe D, Pommerening K, Dittrich HM. THEMPO: a knowledge-based system for therapy planning in pediatric oncology. Computers Biol Med 1997;27:177 200. 5. Tung WL, Quek C. GenSo-FDSS: a neural-fuzzy decision support system for pediatric ALL cancer subtype identification using gene expression data. Artif Intell Med 2005;33:61 88. 6. Aarabi A, Fazel-Rezai R, Aghakhani Y. A fuzzy rule-based system for epileptic seizure detection in intracranial EEG. Clin Neurophysiol 2009;120:1648 57. 7. Chawanpaiboon S, Vogel JP, Moller AB, Lumbiganon P, Petzold M, Hogan D, et al. Global, regional, and national estimates of levels of preterm birth in 2014: a systematic review and modelling analysis. Lancet Glob Health 2019;7:e37 46. 8. Petrou S, Henderson J, Bracewell M, Hockley C, Wolke D, Marlow N. Pushing the boundaries of viability: the economic impact of extreme preterm birth. Early Hum Dev 2006;82:77 84. 9. Ward RM, Beachy JC. Neonatal complications following preterm birth. BJOG 2003;110(Suppl. 20):8 16. 10. Brydges CR, Landes JK, Reid CL, Campbell C, French N, Anderson M. Cognitive outcomes in children and adolescents born very preterm: a meta-analysis. Dev Med Child Neurol 2018;60:452 68. 11. Beauregard JL, Drews-Botsch C, Sales JM, Flanders WD, Kramer MR. Preterm birth, poverty, and cognitive development. Pediatrics 2018;141. 12. Rathbone R, Counsell SJ, Kapellou O, Dyet L, Kennea N, Hajnal J, et al. Perinatal cortical growth and childhood neurocognitive abilities. Neurology 2011;77:1510 17. 13. Hobel CJ, Dunkel-Schetter C, Roesch SC, Castro LC, Arora CP. Maternal plasma corticotropin-releasing hormone associated with stress at 20 weeks’ gestation in pregnancies ending in preterm delivery. Am J Obstet Gynecol 1999;180:S257 63. 14. Leveno KJ, Cox K, Roark ML. Cervical dilatation and prematurity revisited. Obstet Gynecol 1986;68:434 5. 15. Bell R. The prediction of preterm labour by recording spontaneous antenatal uterine activity. Br J Obstet Gynaecol 1983;90:884 7.

III. Clinical applications

References

431

16. Creasy RK, Gummer BA, Liggins GC. System for predicting spontaneous preterm birth. Obstet Gynecol 1980;55:692 5. 17. Owen J, Goldenberg RL, Davis RO, Kirk KA, Copper RL. Evaluation of a risk scoring system as a predictor of preterm birth in an indigent population. Am J Obstet Gynecol 1990;163:873 9. 18. Main DM, Richardson D, Gabbe SG, Strong S, Weller SC. Prospective evaluation of a risk scoring system for predicting preterm delivery in black inner city women. Obstet Gynecol 1987;69:61 6. 19. Woolery LK, Grzymala-Busse J. Machine learning for an expert system to predict preterm birth risk. J Am Med Inform Assoc 1994;1:439 46. 20. McLean M, Walters WA, Smith R. Prediction and early diagnosis of preterm labor: a critical review. Obstetrical Gynecol Surv 1993;48:209 25. 21. Papageorghiou AT, Ohuma EO, Altman DG, Todros T, Ismail LC, Lambert A, et al. International standards for fetal growth based on serial ultrasound measurements: the Fetal Growth Longitudinal Study of the INTERGROWTH-21st Project. Lancet 2014;384:869 79. 22. Delobel-Ayoub M, Arnaud C, White-Koning M, Casper C, Pierrat V, Garel M, et al. Behavioral problems and cognitive performance at 5 years of age after very preterm birth: the EPIPAGE Study. Pediatrics 2009;123:1485 92. 23. Moore T, Hennessy EM, Myles J, Johnson SJ, Draper ES, Costeloe KL, et al. Neurological and developmental outcome in extremely preterm children born in England in 1995 and 2006: the EPICure studies. BMJ 2012;345:e7961. 24. Ball G, Aljabar P, Arichi T, Tusor N, Cox D, Merchant N, et al. Machine-learning to characterise neonatal functional connectivity in the preterm brain. NeuroImage 2016;124:267 75. 25 Bayley N. Bayley scales of infant and toddler development. 3rd ed Harcourt Assessment, Inc; 2006. 26. Moeskops P, Isgum I, Keunen K, Claessens NHP, van Haastert IC, Groenendaal F, et al. Prediction of cognitive and motor outcome of preterm infants based on automatic quantitative descriptors from neonatal MR brain images. Sci Rep 2017;7:2163. 27. Lawn JE, Davidge R, Paul VK, von Xylander S, de Graft Johnson J, Costello A, et al. Born too soon: care for the preterm baby. Reprod Health 2013;10(Suppl. 1):S5. 28. Katz J, Lee AC, Kozuki N, Lawn JE, Cousens S, Blencowe H, et al. Mortality risk in preterm and small-forgestational-age infants in low-income and middle-income countries: a pooled country analysis. Lancet (London, Engl) 2013;382:417 25. 29. Rittenhouse KJ, Vwalika B, Keil A, Winston J, Stoner M, Price JT, et al. Improving preterm newborn identification in low-resource settings with machine learning. PLoS One 2019;14:e0198919. 30. D’Angio CT, Maniscalco WM. Bronchopulmonary dysplasia in preterm infants: pathophysiology and management strategies. Paediatric Drugs 2004;6:303 30. 31. Kang ES, Matsuo N, Nagai T, Greenhaw J, Williams PL. Serum lipolytic activity in Reye’s syndrome. Clinica Chim Acta 1989;184:107 14. 32. Ambalavanan N, Van Meurs KP, Perritt R, Carlo WA, Ehrenkranz RA, Stevenson DK, et al. Predictors of death or bronchopulmonary dysplasia in preterm infants with respiratory failure. J Perinatol 2008;28:420 6. 33. Oh W, Poindexter BB, Perritt R, Lemons JA, Bauer CR, Ehrenkranz RA, et al. Association between fluid intake and weight loss during the first ten days of life and risk of bronchopulmonary dysplasia in extremely low birth weight infants. J Pediatr 2005;147:786 90. 34. Cunha GS, Mezzacappa-Filho F, Ribeiro JD. Risk factors for bronchopulmonary dysplasia in very low birth weight newborns treated with mechanical ventilation in the first week of life. J Tropical Pediatr 2005;51:334 40. 35. Laughon MM, Langer JC, Bose CL, Smith PB, Ambalavanan N, Kennedy KA, et al. Prediction of bronchopulmonary dysplasia by postnatal age in extremely premature infants. Am J Respiratory Crit Care Med 2011;183:1715 22. 36. Xu YP. Bronchopulmonary dysplasia in preterm infants born at less than 32 weeks gestation. Glob Pediatr Health 2016;3 2333794x16668773. 37. Ochab M, Wajs W. Expert system supporting an early prediction of the bronchopulmonary dysplasia. Comput Biol Med 2016;69:236 44. 38. Jensen EA, Dysart K, Gantz MG, McDonald S, Bamat NA, Keszler M, et al. The diagnosis of bronchopulmonary dysplasia in very preterm infants. An evidence-based approach. Am J Respiratory Crit Care Med 2019;200:751 9. 39. Isayama T, Lee SK, Yang J, Lee D, Daspal S, Dunn M, et al. Revisiting the definition of bronchopulmonary dysplasia: effect of changing panoply of respiratory support for preterm neonates. JAMA Pediatr 2017;171:271 9.

III. Clinical applications

432

21. Harnessing the potential of artificial neural networks for pediatric patient management

40. Bockli K, Andrews B, Pellerite M, Meadow W. Trends and challenges in United States neonatal intensive care units follow-up clinics. J Perinatol 2014;34:71 4. 41. Temple MW, Lehmann CU, Fabbri D. Predicting discharge dates from the NICU using progress note data. Pediatrics 2015;136:e395 405. 42. Pollack IF, Agnihotri S, Broniscer A. Childhood brain tumors: current management, biological insights, and future directions. J Neurosurg Pediatr 2019;23:261 73. 43. Patel V, McNinch NL, Rush S. Diagnostic delay and morbidity of central nervous system tumors in children and young adults: a pediatric hospital experience. J Neuro-oncol 2019;143:297 304. 44. Colafati GS, Voicu IP, Carducci C, Miele E, Carai A, Di Loreto S, et al. MRI features as a helpful tool to predict the molecular subgroups of medulloblastoma: state of the art. Ther Adv Neurol Disord 2018;11 1756286418775375. 45. Dasgupta A, Gupta T. Radiogenomics of medulloblastoma: imaging surrogates of molecular biology. J Transl Genet Genomics 2018. 46. Teo WY, Shen J, Su JM, Yu A, Wang J, Chow WY, et al. Implications of tumor location on subtypes of medulloblastoma. Pediatr Blood Cancer 2013;60:1408 10. 47. Koob M, Girard N, Ghattas B, Fellah S, Confort-Gouny S, Figarella-Branger D, et al. The diagnostic accuracy of multiparametric MRI to determine pediatric brain tumor grades and types. J Neurooncol 2016;127:345 53. 48. Raschke F, Davies NP, Wilson M, Peet AC, Howe FA. Classification of single-voxel 1H spectra of childhood cerebellar tumors using LCModel and whole tissue representations. Magn Reson Med 2013;70:1 6. 49. Rodriguez Gutierrez D, Awwad A, Meijer L, Manita M, Jaspan T, Dineen RA, et al. Metrics and textural features of MRI diffusion to improve classification of pediatric posterior fossa tumors. AJNR Am J Neuroradiol 2014;35:1009 15. 50. Orphanidou-Vlachou E, Vlachos N, Davies NP, Arvanitis TN, Grundy RG, Peet AC. Texture analysis of T1and T2-weighted MR images and use of probabilistic neural network to discriminate posterior fossa tumours in children. NMR Biomed 2014;27:632 9. 51. Fetit AE, Novak J, Rodriguez D, Auer DP, Clark CA, Grundy RG, et al. Radiomics in paediatric neuro-oncology: A multicentre study on MRI texture analysis. NMR Biomed 2018;31. 52. Zarinabad N, Abernethy LJ, Avula S, Davies NP, Rodriguez Gutierrez D, Jaspan T, et al. Application of pattern recognition techniques for classification of pediatric brain tumors by in vivo 3T (1) H-MR spectroscopy— a multi-center study. Magn Reson Med 2018;79:2359 66. 53. Iv M, Zhou M, Shpanskaya K, Perreault S, Wang Z, Tranvinh E, et al. MR imaging-based radiomic signatures of distinct molecular subgroups of medulloblastoma. AJNR Am J Neuroradiol 2019;40:154 61. 54. Arle JE, Morriss C, Wang ZJ, Zimmerman RA, Phillips PG, Sutton LN. Prediction of posterior fossa tumor type in children by means of magnetic resonance image properties, spectroscopy, and neural networks. J Neurosurg 1997;86:755. 55. Chen X, Tong Y, Shi Z, Chen H, Yang Z, Wang Y, et al. Noninvasive molecular diagnosis of craniopharyngioma with MRI-based radiomics approach. BMC Neurol 2019;19:6. 56. Mahmoodabadi SZ, Alirezaie J, Babyn P, Kassner A, Widjaja E. PCA-SGA implementation in classification and disease specific feature extraction of the brain MRS signals. In: Conference proceedings: annual international conference of the IEEE engineering in medicine and biology society. IEEE Engineering in Medicine and Biology Society. Annual conference 2008. 2008. p. 3526 9. 57. Selvam VS, Shenbagadevi S. Brain tumor detection using scalp EEG with modified wavelet-ICA and multilayer feed forward neural network. In: 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE; 2011. 6104 9. 58. Hara JH, Wu A, Villanueva-Meyer JE, Valdes G, Daggubati V, Mueller S, et al. Clinical applications of quantitative 3-dimensional MRI analysis for pediatric embryonal brain tumors. Int J Radiat Oncol Biol Phys 2018;102:744 56. 59. Narayanan A, Nana E, Keedwell E. Analyzing gene expression data for childhood medulloblastoma survival with artificial neural networks. In: 2004 Symposium on computational intelligence in bioinformatics and computational biology. IEEE; 2004. 9 16. 60. Thompson EM, Hielscher T, Bouffet E, Remke M, Luu B, Gururangan S, et al. Prognostic value of medulloblastoma extent of resection after accounting for molecular subgroup: a retrospective integrated clinical and molecular analysis. Lancet Oncol 2016;17:484 95. 61. Avula S, Spiteri M, Kumar R, Lewis E, Harave S, Windridge D, et al. Post-operative pediatric cerebellar mutism syndrome and its association with hypertrophic olivary degeneration. Quant Imaging Med Surg 2016;6:535 44.

III. Clinical applications

References

433

62. Spiteri M, Guillemaut JY, Windridge D, Avula S, Kumar R, Lewis E. Fully-automated identification of imaging biomarkers for post-operative cerebellar mutism syndrome using longitudinal paediatric MRI. Neuroinformatics 2019;18(1):151 62. 63. Reddick WE, Mulhern RK, Elkin TD, Glass JO, Merchant TE, Langston JW. A hybrid neural network analysis of subtle brain volume differences in children surviving brain tumors. Magnetic Reson Imaging 1998;16:413 21. 64. Mulhern RK, Palmer SL, Reddick WE, Glass JO, Kun LE, Taylor J, et al. Risks of young age for selected neurocognitive deficits in medulloblastoma are associated with white matter loss. J Clin Oncol 2001;19:472 9. 65. Polewko-Klim A, Lesinski W, Mnich K, Piliszek R, Rudnicki WR. Integration of multiple types of genetic markers for neuroblastoma may contribute to improved prediction of the overall survival. Biol Direct 2018;13:17. 66. Francescatto M, Chierici M, Rezvan Dezfooli S, Zandona A, Jurman G, Furlanello C. Multi-omics integration for neuroblastoma clinical endpoint prediction. Biol Direct 2018;13:5. 67. Cangelosi D, Blengio F, Versteeg R, Eggert A, Garaventa A, Gambini C, et al. Logic learning machine creates explicit and stable rules stratifying neuroblastoma patients. BMC Bioinforma 2013;14(Suppl. 7):S12. 68. Schramm A, Schowe B, Fielitz K, Heilmann M, Martin M, Marschall T, et al. Exon-level expression analyses identify MYCN and NTRK1 as major determinants of alternative exon usage and robustly predict primary neuroblastoma outcome. Br J Cancer 2012;107:1409 17. 69. Park JT, Fernandez-Baca Vaca G. Epileptic seizure semiology in infants and children. Seizure 2019;77:3 6. 70. Chen S, Zhang J, Ruan X, Deng K, Zhang J, Zou D, et al. Voxel-based morphometry analysis and machine learning based classification in pediatric mesial temporal lobe epilepsy with hippocampal sclerosis. Brain Imaging Behav 2019. 71. Haneef Z, Stern J, Dewar S, Engel Jr. J. Referral pattern for epilepsy surgery after evidence-based recommendations: a retrospective study. Neurology 2010;75:699 704. 72. Perucca P, Gilliam FG. Adverse effects of antiepileptic drugs. Lancet Neurol 2012;11:792 802. 73. Guberman AH, Besag FM, Brodie MJ, Dooley JM, Duchowny MS, Pellock JM, et al. Lamotrigine-associated rash: risk/benefit considerations in adults and children. Epilepsia 1999;40:985 91. 74. Zaccara G, Franciotta D, Perucca E. Idiosyncratic adverse reactions to antiepileptic drugs. Epilepsia 2007;48:1223 44. 75. Aslan K, Bozdemir H, Sahin C, Noyan Ogulata S. Can neural network able to estimate the prognosis of epilepsy patients according to risk factors? J Med Syst 2010;34:541 50. 76. Boyle CA, Boulet S, Schieve LA, Cohen RA, Blumberg SJ, Yeargin-Allsopp M, et al. Trends in the prevalence of developmental disabilities in US children, 1997-2008. Pediatrics 2011;127:1034 42. 77. Newschaffer CJ, Falb MD, Gurney JG. National autism prevalence trends from United States special education data. Pediatrics 2005;115:e277 82. 78. Blumberg SJ, Bramlett MD, Kogan MD, Schieve LA, Jones JR, Lu MC. Changes in prevalence of parent-reported autism spectrum disorder in school-aged U.S. children: 2007 to 2011-2012. Natl Health Stat Rep 2013; 1 11. 79. Kogan MD, Vladutiu CJ, Schieve LA, Ghandour RM, Blumberg SJ, Zablotsky B, et al. The prevalence of parent-reported autism spectrum disorder among US children. Pediatrics 2018;142. 80. Hazlett HC, Poe MD, Gerig G, Styner M, Chappell C, Smith RG, et al. Early brain overgrowth in autism associated with an increase in cortical surface area before age 2 years. Arch Gen Psychiaty 2011;68:467 76. 81. Wolff JJ, Gu H, Gerig G, Elison JT, Styner M, Gouttard S, et al. Differences in white matter fiber tract development present from 6 to 24 months in infants with autism. Am J Psychiatry 2012;169:589 600. 82. Shen MD, Nordahl CW, Young GS, Wootton-Gorges SL, Lee A, Liston SE, et al. Early brain enlargement and elevated extra-axial fluid in infants who develop autism spectrum disorder. Brain 2013;136:2825 35. 83. Ozonoff S, Iosif AM, Baguio F, Cook IC, Hill MM, Hutman T, et al. A prospective study of the emergence of early behavioral signs of autism. J Am Acad Child Adolesc Psychiatry 2010;49 256-66.e1-256-66.e2. 84. Zwaigenbaum L, Bryson S, Rogers T, Roberts W, Brian J, Szatmari P. Behavioral manifestations of autism in the first year of life. Int J Developmental Neurosci 2005;23:143 52. 85. Jones W, Klin A. Attention to eyes is present but in decline in 2 6-month-old infants later diagnosed with autism. Nature 2013;504:427 31. 86. Robins DL, Fein D, Barton ML, Green JA. The modified checklist for autism in toddlers: an initial study investigating the early detection of autism and pervasive developmental disorders. J Autism Dev Disord 2001;31:131 44. 87. Robins DL. Screening for autism spectrum disorders in primary care settings. Autism 2008;12:537 56.

III. Clinical applications

434

21. Harnessing the potential of artificial neural networks for pediatric patient management

88. Chlebowski C, Robins DL, Barton ML, Fein D. Large-scale use of the modified checklist for autism in lowrisk toddlers. Pediatrics 2013;131:e1121 7. 89. Weng SJ, Wiggins JL, Peltier SJ, Carrasco M, Risi S, Lord C, et al. Alterations of resting state functional connectivity in the default network in adolescents with autism spectrum disorders. Brain Res 2010;1313:202 14. 90. Kennedy DP, Courchesne E. The intrinsic functional organization of the brain is altered in autism. NeuroImage 2008;39:1877 85. 91. Assaf M, Jagannathan K, Calhoun VD, Miller L, Stevens MC, Sahl R, et al. Abnormal functional connectivity of default mode sub-networks in autism spectrum disorder patients. NeuroImage 2010;53:247 56. 92. Craig AD. Significance of the insula for the evolution of human awareness of feelings from the body. Ann NY Acad Sci 2011;1225:72 82. 93. Uddin LQ, Supekar K, Lynch CJ, Khouzam A, Phillips J, Feinstein C, et al. Salience network-based classification and prediction of symptom severity in children with autism. JAMA Psychiatry 2013;70:869 79. 94. Anderson JS, Nielsen JA, Froehlich AL, DuBray MB, Druzgal TJ, Cariello AN, et al. Functional connectivity magnetic resonance imaging classification of autism. Brain: A J Neurol 2011;134:3742 54. 95. Charman T. Early identification and intervention in autism spectrum disorders: some progress but not as much as we hoped. Int J Speech-Language Pathol 2014;16:15 18. 96. Emerson RW, Adams C, Nishino T, Hazlett HC, Wolff JJ, Zwaigenbaum L, et al. Functional neuroimaging of high-risk 6-month-old infants predicts a diagnosis of autism at 24 months of age. Sci Transl Med 2017;9. 97. Costello EJ, Pine DD, Hammen C, March JS, Plotsky PM, Weissman MM, et al. Development and natural history of mood disorders. Biol. Psychiatry 2002;52:529 42. Available from: https://doi.org/10.1016/S0006-3223 (02)01372-0. 98. Zhang J, Li X, Li Y, Wang M, Huang B, Yao S, et al. Three dimensional convolutional neural network-based classification of conduct disorder with structural MRI. Brain Imaging Behav 2019. 99. Linke JO, Adleman NE, Sarlls J, Ross A, Perlstein S, Frank HR, et al. White matter microstructure in pediatric bipolar disorder and disruptive mood dysregulation disorder. J Am Acad Child Adolesc Psychiatry 2019. 100. Payabvash S, Palacios EM, Owen JP, Wang MB, Tavassoli T, Gerdes M, et al. Diffusion tensor tractography in children with sensory processing disorder: potentials for devising machine learning classifiers. Neuroimage Clin 2019;23:101831. 101. Yoo JH, Kim JI, Kim BN, Jeong B. Exploring characteristic features of attention-deficit/hyperactivity disorder: findings from multi-modal MRI and candidate genetic data. Brain Imaging Behav 2019. 102. Kelly EN, Allen VM, Seaward G, Windrim R, Ryan G. Mild ventriculomegaly in the fetus, natural history, associated findings and outcome of isolated mild ventriculomegaly: a literature review. Prenat Diagnosis 2001;21:697 700. 103. Weichert J, Hartge D, Krapp M, Germer U, Gembruch U, Axt-Fliedner R. Prevalence, characteristics and perinatal outcome of fetal ventriculomegaly in 29,000 pregnancies followed at a single institution. Fetal Diagnosis Ther 2010;27:142 8. 104. Parilla BV, Endres LK, Dinsmoor MJ, Curran L. In utero progression of mild fetal ventriculomegaly. Int J Gynaecol Obstetrics 2006;93:106 9. 105. Vergani P, Locatelli A, Strobelt N, Cavallone M, Ceruti P, Paterlini G, et al. Clinical outcome of mild fetal ventriculomegaly. Am J Obstet Gynecol 1998;178:218 22. 106. Kulkarni AV, Drake JM, Mallucci CL, Sgouros S, Roth J, Constantini S. Endoscopic third ventriculostomy in the treatment of childhood hydrocephalus. J Pediatr 2009;155:254 259.e1. 107. Gu JL, Johnson A, Kerr M, Moise Jr. KJ, Bebbington MW, Pedroza C, et al. Correlating prenatal imaging findings of fetal ventriculomegaly with the need for surgical intervention in the first 3 months after birth. Pediatric Neurosurg 2017;52:20 5. 108. Pisapia JM, Akbari H, Rozycki M, Goldstein H, Bakas S, et al. Use of fetal magnetic resonance image analysis and machine learning to predict the need for postnatal cerebrospinal fluid diversion in fetal ventriculomegaly. JAMA Pediatrics 2018;172(2):128 35. 109. Hyder AA, Wunderlich CA, Puvanachandra P, Gururaj G, Kobusingye OC. The impact of traumatic brain injuries: a global perspective. NeuroRehabilitation 2007;22:341 53. 110. Rutland-Brown W, Langlois JA, Thomas KE, Xi YL. Incidence of traumatic brain injury in the United States, 2003. J Head Trauma Rehabilitation 2006;21:544 8. 111. Wade SL, Zhang N, Yeates KO, Stancin T, Taylor HG. Social environmental moderators of long-term functional outcomes of early childhood brain injury. JAMA Pediatr 2016;170:343 9.

III. Clinical applications

References

435

112. Max JE. Neuropsychiatry of pediatric traumatic brain injury. Psychiatr Clin North Am 2014;37:125 40. 113. Warden DL, Gordon B, McAllister TW, Silver JM, Barth JT, Bruns J, et al. Guidelines for the pharmacologic treatment of neurobehavioral sequelae of traumatic brain injury. J Neurotrauma 2006;23:1468 501. 114. Jin C, Schachar R. Methylphenidate treatment of attention-deficit/hyperactivity disorder secondary to traumatic brain injury: a critical appraisal of treatment studies. CNS Spectr 2004;9:217 26. 115. Bertsimas D, Dunn J, Steele DW, Trikalinos TA, Wang Y. Comparison of machine learning optimal classification trees with the pediatric emergency care applied research network head trauma decision rules. JAMA Pediatr 2019;173:648 56. 116. Kuppermann N, Holmes JF, Dayan PS, Hoyle Jr. JD, Atabaki SM, Holubkov R, et al. Identification of children at very low risk of clinically-important brain injuries after head trauma: a prospective cohort study. Lancet (London, Engl) 2009;374:1160 70. 117. Ide K, Uematsu S, Tetsuhara K, Yoshimura S, Kato T, Kobayashi T. External validation of the PECARN head trauma prediction rules in Japan. Acad Emerg Med 2017;24:308 14. 118. Lorton F, Poullaouec C, Legallais E, Simon-Pimmel J, Chene MA, Leroy H, et al. Validation of the PECARN clinical decision rule for children with minor head trauma: a French multicenter prospective study. Scand J Trauma Resusc Emerg Med 2016;24:98. 119. Bosl WJ. Systems biology by the rules: hybrid intelligent systems for pathway modeling and discovery. BMC Syst Biol 2007;1:13. 120. Masecchia S, Coco S, Barla A, Verri A, Tonini GP. Genome instability model of metastatic neuroblastoma tumorigenesis by a dictionary learning algorithm. BMC Med Genomics 2015;8:57. 121. Rubin DM, Kenyon CC, Strane D, Brooks E, Kanter GP, Luan X, et al. Association of a targeted population health management intervention with hospital admissions and bed-days for Medicaid-enrolled children. JAMA Netw Open 2019;2:e1918306. 122. Patel SJ, Chamberlain DB, Chamberlain JM. A machine learning approach to predicting need for hospitalization for pediatric asthma exacerbation at the time of emergency department triage. Acad Emerg Med 2018;25:1463 70. 123. AlSaad R, Malluhi Q, Janahi I, Boughorbel S. Interpreting patient-Specific risk prediction using contextual decomposition of BiLSTMs: application to children with asthma. BMC Med Inform Decis Mak 2019;19:214. 124. Williams JB, Ghosh D, Wetzel RC. Applying machine learning to pediatric critical care data. Pediatr Crit Care Med 2018;19:599 608. 125. Shirwaikar RD. Estimation of caffeine regimens: a machine learning approach for enhanced clinical decision making at a neonatal intensive care unit (NICU). Crit Rev Biomed Eng 2018;46:93 115. 126. Iftekharuddin KM, Ahmed S, Hossen J. Multiresolution texture models for brain tumor segmentation in MRI. In: 2011 Annual international conference of the IEEE Engineering in Medicine and Biology Society. IEEE; 2011. 6985 8. 127. Islam A, Iftekharuddin KM, Ogg RJ, Laningham FH, Sivakumar B. Multifractal modeling, segmentation, prediction, and statistical validation of posterior fossa tumors. In: Medical imaging 2008: computer-aided diagnosis, international society for optics and photonics. 2008. 69153C. 128. Wels M, Carneiro G, Aplas A, Huber M, Hornegger J, Comaniciu D. A discriminative model-constrained graph cuts approach to fully automated pediatric brain tumor segmentation in 3-D MRI. In: International conference on medical image computing and computer-assisted intervention. Springer; 2008. p. 67 75. 129. Goulooze SC, Zwep LB, Vogt JE, Krekels EHJ, Hankemeier T, van den Anker JN, et al. Beyond the randomized clinical trial: innovative data science to close the pediatric evidence gap. Clin Pharmacol Ther 2019;107.

III. Clinical applications

C H A P T E R

22 Artificial intelligence enabled public health surveillance—from local detection to global epidemic monitoring and control Daniel Zeng, Zhidong Cao and Daniel B. Neill Abstract Artificial intelligence (AI) techniques have been widely applied to infectious disease outbreak detection and early warning, trend prediction, and public health response modeling and assessment. Such public health surveillance and response tasks of major importance pose unique technical challenges such as data sparsity, lack of positive training samples, difficulty in developing baselines and quantifying the control measures, and interwoven dependencies between spatiotemporal elements and finer-grained risk analyses through contact and social networks. Traditional public health surveillance relies heavily on statistical techniques. Recent years have seen tremendous growth of AI-enabled methods, including but not limited to deep learning based models, complementing statistical approaches. This chapter aims to provide a systematic review of these recent advances applying AI techniques to address public health surveillance and response challenges. Keywords: AI-enabled public health surveillance; infectious disease surveillance; early warning; public health response

22.1 Introduction In the recent decade, partially fueled by major advances in big data and raw computing power, artificial intelligence (AI) technology has entered an extraordinary phase of fast development and wide application. The techniques developed in traditional AI research areas such as computer vision, speech recognition, natural language processing, and robotics have found many innovative applications in an array of real-world settings, including medicine. The general methodological contributions from AI, such as a variety of recently developed deep

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00022-3

437

© 2021 Elsevier Inc. All rights reserved.

438

22. Artificial intelligence enabled public health surveillance—from local detection

25

20

15

10

5

0

FIGURE 22.1 Numbers of published papers containing both “artificial intelligence” and “public health surveillance” as keywords from Web of Science, accessed in January 2020.

learning algorithms, have also been applied to a wide spectrum of fields. Public health surveillance is one such area that has benefited significantly from these recent AI advances. As shown in Fig. 22.1, the growing literature on the AI-enabled or -enhanced public health surveillance work illustrates the relevant research community’s interest in applying AI techniques. In another example, as part of the research community’s response to the COVID-19 pandemic, there was a specific call for developing AI-based public health solutions. How can AI enhance existing public health surveillance and response approaches and enable new ones? The answer lies with the fundamental challenges facing public health surveillance and response. Public health surveillance is intrinsically data driven. Identifying early, accurate, and reliable signals of health anomalies and disease outbreaks from a heterogeneous collection of data sources has always been the main objective of public health surveillance. Technically, this translates into two distinct challenges: the data sourcing challenge and the analytics challenge. The first data sourcing challenge is concerned with determining easily operationalizable sources of data that contain useful signals. The second analytics challenge is concerned with developing effective computational frameworks to extract such signals. AI provides a range of methods and techniques to help tackle both challenges. Other major objectives of public health surveillance and response are to analyze and predict infectious disease trends through modeling the disease transmission dynamics and assess public health responses. Accomplishing these objectives entails domain knowledge and context-rich predictions, fine-grained risk analytics through contact and social networks, quantification of responses and control measures, and assessment of these responses and measures in the presence of complex interactions and constraints. AI offers a suite of applicable modeling and analytics frameworks to address these complex considerations. Fig. 22.2 summarizes the major ways through which AI enhances existing public health surveillance and response approaches and enables new ones. First, AI opens the door to make use of a variety of novel or underexplored data sources for public health surveillance purposes, especially those not originally or intentionally designed to answer epidemiological questions. For instance, with the rapid development of the Internet and the Internet of Things applications, ubiquitous social and device

III. Clinical applications

439

22.1 Introduction

Artificial intelligence

Deep learning

Reinforcement learning

Knowledge graph

Bayesian networks

Multiagent systems

Public health surveillance and response

Extended data sources Physical world Environment Insect vector Cyber world Social media Forum Blog

FIGURE 22.2

Data analytics Outbreak detection Early warning Spatiotemporal analytics Risk estimation Trend prediction

Epidemic modeling and simulation Agent-based modeling Multiagent simulation Response assessment

Enabling and enhancing public health surveillance and response approaches through AI

applications.

sensing capabilities are becoming a reality, presenting significant surveillance potentials. A variety of open data, external to traditional public health surveillance systems, can be fruitfully exploited to enhance the surveillance capabilities. Second, AI enhances the traditional suite of data analytics tools, which is mainly statistics based, to deal with the new surveillance data types that cover unstructured and semistructured text, images, and videos, in addition to structured information items. Dealing with unstructured data necessitates the use of AI methods such as natural language processing and image processing, which often include a deep learning component for automatic datadriven feature construction. Through the application of these methods, unstructured data can be converted into structured items through semantic labels and autofilled features. From an analytical standpoint, public health surveillance is concerned with timely and effectively assessing the risk of an epidemic, detecting abnormal changes in the spatiotemporal status of the epidemic, so as to provide early warning and predict the trend of the epidemic. AI methods, in particular, those based on machine learning, have long been applied to detect patterns, identify anomalies, and analyze trends and risks, from public health surveillance data streams. Such data streams often possess prominent temporal and spatial elements and need to be analyzed along with external social, economic, and environmental data. Traditional surveillance methods rely heavily on the use of statistical methods. As the data become increasingly complex, within the framework of these methods, statistical inferences become rather difficult. Moreover, statistical methods focus on conclusions at the macrolevel, whereas machine learning methods enable customized inferences aimed at characterizing local patterns. Third, AI provides modeling frameworks to simulate complex setups and scenarios of infectious disease transmission and public health responses. The evolution of epidemics in time,

III. Clinical applications

440

22. Artificial intelligence enabled public health surveillance—from local detection

space, and people has a high degree of uncertainty and complexity. The infection transmission dynamics is nonlinear by nature, with chaotic characteristics and poor predictability. As such, the applicability of the models based on aggregated statistics and linear interactions is inherently limited. A subarea of AI, multiagent systems, offers modeling frameworks that allow for the study of the evolution of the epidemic under different conditions. These frameworks can also be used to quantitatively evaluate the effect of different interventions and control measures. The remainder of this chapter reviews the application of AI in public health surveillance and response. Section 22.2 discusses AI-enhanced data analysis techniques for outbreak detection and early warning. Section 22.3 focuses on AI-enhanced prediction methods in support of surveillance and trend analysis tasks. In Section 22.4, we review AI-based simulation frameworks to characterize infectious disease transmission patterns and assess public health responses. In Section 22.5, we briefly summarize several Internet-based surveillance systems aimed at global epidemic monitoring. Section 22.6 concludes this chapter with a summary of the key findings and a discussion of challenges ahead.

22.2 Artificial intelligence enhanced data analysis for outbreak detection and early warning In order to improve the timeliness and accuracy of outbreak detection and early warning approaches, public health researchers continue to investigate and explore sensor data and indicators from the physical world covering health, environmental, societal, and economic aspects, among others. Significant efforts have been expended to make use of data from the cyberspace, such as keyword searches, blogs, and social networking posts. Fig. 22.3 illustrates these sources of data, along with the commonly used machine learning methods. In the remainder of this section, we first discuss two sets of AI-enhanced data analysis techniques: one concerned with data from the physical world and the other from the cyberspace. Then we consider how machine learning techniques for text analysis and event detection can provide a "safety net" by identifying previously unseen disease outbreaks and other emerging events of interest to public health.

Physical world AI cyberspace

Dependent variable

Independent variable

Support vector machine Gradient boosting machine Random forest Neural networks – CNN, LSTM Online extreme learning machine

III. Clinical applications

FIGURE 22.3 Sources of public health surveillance data and commonly used machine learning methods.

22.2 Artificial intelligence enhanced data analysis for outbreak detection and early warning

441

22.2.1 Analyzing data collected from the physical world The spatiotemporal pattern of epidemiological risks is related to various factors such as climatic conditions, social and economic status, and vector transmissions. Assessing the outbreak risk of an infectious disease is important for early warning and effective resource deployment. Based on the data collected from the physical world, machine learning methods have been successfully applied to estimate the high-risk regions and outbreak periods. Support vector machine (SVM), gradient boosting machine, and random forest (RF) were applied to simulate the global distribution of Aedes aegypti and Aedes albopictus to fight against mosquito-borne infectious diseases, for example, ZIKV, dengue, and chikungunya. Effectively killing the vector cuts off the disease’s transmission path. Multidisciplinary datasets, such as occurrence records, social factors, and meteorological factors, were quantified to train the models. It is reported that RF obtained the highest AUC value, and the temperature suitability had the best discriminatory power among factors.1 RF was also used to assess the risk of dengue transmission in Singapore with dengue, population, entomological, and environmental data. Random bootstrap samples were drawn from the data, and an unpruned decision tree was fitted to each bootstrap sample. The risk maps had high accuracy in that more than 80% of the observed risk ranks fell within the 80% prediction interval.2 Another framework based on neural networks and online extreme learning machine (OLEM) estimated the distribution of kinds of water containers with the Aedes mosquito larvae in Recife, Brazil. Nine years of environmental and entomological data were used to train the OLEM model.3 Deep learning models have been applied to detect outbreaks of infectious diseases. A dynamic neural network model was developed to predict the outbreak risk of ZIKV in the Americas. This model utilized history epidemiological data, air travel volumes, vector distribution, and socioeconomic factors. The main feature of this modeling work is its flexibility. Decision-makers can easily modify the risk indicator, risk classification scheme, and prediction forecast window according to their own customized needs.4 To examine emerging spatiotemporal hotspots of dengue fever at the township level in Taiwan, a deep AlexNet model was trained on sea surface temperature images and rainfall data by transfer learning. This transfer learning based method overcame the overfitting problem due to the small dataset and yielded an accuracy of 100% on an eightfold cross-validation test dataset.5 The general trend of spatiotemporal analysis is that the data resolution and the number of exogenous variables are increasing. In this regard, the nonlinear fitting ability of machine learning, in particular, deep learning, models offers many advantages over classical statistical models.

22.2.2 Analyzing data from the cyberspace Internet application usage data (e. g., keyword searches) and social media data are widely studied for rapid response to infectious disease outbreaks. Machine learning methods have been used for text classification and sentiment analysis from social media data for surveillance purposes. A social media based early warning system for mosquito-borne disease in India was proposed.6 Latent Dirichlet allocation based topic modeling techniques were

III. Clinical applications

442

22. Artificial intelligence enabled public health surveillance—from local detection

applied to identify relevant topics related to symptoms, prevention, and public sentiments toward the disease. The real-time tracking of public sentiments provided an early warning mechanism. DEFENDER is a software system developed in the United Kingdom that integrates Twitter and news media for outbreak detection.7 SVM and naive Bayes classifiers were used for disease-related text classification. The DBSCAN algorithm was utilized to cluster the geographic space and observe the movement behavior of Internet users. The secondgeneration system (SENTINEL) of DEFENDER further improved the text classification and denoising algorithm using CNN and LSTM networks.8 Twitter has proven its usefulness as a public health surveillance data source. The Twitter data feed is real time and can be obtained from a large number of users in different geographic regions. At the same time, interpreting Twitter data semantically can be challenging, given that tweets are often very short and full of incomplete and informal writing. Chen and Neill analyzed the heterogeneous network structure of Twitter using a nonparametric graph scan, and applied this approach to detection of hantavirus outbreaks in Chile.9 Dai and Bikdash10 studied influenza-related tweet classification as a surveillance tool. In their proposed hybrid classification approach, they combined artificially defined features with features automatically generated by supervised machine learning methods to separate tweets involving flu cases versus tweets that do not involve flue cases. Dai et al.11 reported a clustering method based on word embedding for public health monitoring. This method learns semantically meaningful representational vectors from surrounding words. Based on the cluster similarity measures, tweets can in turn be classified as relevant or irrelevant to a certain topic (e.g., flu). Wang et al.12 proposed a long- and short-term RNN structure to classify infectious disease-related tweets and showed that this deep learning model outperformed a range of standard machine learning models. Lampos et al.13 used a neural network word embedding model trained on social media content on Twitter to determine the degree of semantic relevance of the text to infectious diseases. An “influenza infection” concept was developed and used to reduce false and potentially confusing features selected by previous commonly used methods. Edo-Osagie et al.14 used an attention-based short text classification method to mine information on Twitter for public health monitoring. The goal of the algorithm was to automatically filter Twitter related to asthma syndrome. In addition, the algorithm contained a binary recurrent neural network architecture with an attention layer (ABRNN) that allows the network to weight words in Twitter based on perceived importance. Souza et al. develop new machine learning approaches to identify geographic hot-spots of dengue infection risk, using a large Twitter dataset from Brazil Souza et al.15,16 Recently, Shah and Dunn17 proposed a machine learning method to generate a model to detect the magnitude of unexpected changes in terms of usage with spatiotemporal patterns from social media data streams. This work has direct public health relevance and can be used for health events represented by relatively infrequent terms. In practical applications, social media based outbreak detection and early warning methods are not without problems. The key challenges include the heterogeneity of the geospatial distribution of data, the demographic heterogeneity of online users, and the unavailability of data and language resources in underdeveloped regions. It was also noted that for such methods to work, supervised learning models would need a large amount of labeled data.18 Special care has to be given to reduce biases when adopting these methods.

III. Clinical applications

22.2 Artificial intelligence enhanced data analysis for outbreak detection and early warning

443

22.2.3 From syndromic to pre-syndromic disease surveillance: A safety net for public health Over the past two decades, numerous techniques have been developed in the disease surveillance, statistics, and AI communities for syndromic surveillance: early detection of disease outbreaks by identifying emerging clusters of cases in space and time. A variety of public health data sources, such as hospital emergency department visits, over-the-counter medication sales19, and more recently online data sources such as search queries20,21 and social media9, have been employed for this task, and spatial event detection approaches based on the spatial and subset scan statistics22,23 have become increasingly widespread in public health practice. Such approaches typically classify cases to a set of known syndrome types (such as influenza-like illness or gastrointestinal illness) based on preestablished rules, then detect regions of space and time with significantly higher than expected case counts. Heuristic search methods such as simulated annealing and genetic algorithms25,26, fast subset24 scan approaches for exact optimization over subsets and machine learning approaches such as support vector machines27 have been used to increase the flexibility of syndromic surveillance and to enable more accurate detection of irregularly-shaped spatial clusters. However, these syndromic surveillance approaches, as well as other public health approaches such as notifiable disease reporting, are unable to detect newly emerging (“novel”) outbreaks with previously unseen patterns of symptoms, or other unexpected events of relevance to public health. Such events would not be mapped to any of the existing syndrome categories, or would be lumped into a broader and less informative syndrome definition, thus diluting or entirely removing the outbreak signal. This necessitated the development of new machine learning approaches for “pre-syndromic” surveillance28,29 that do not rely on existing syndrome categories, but instead analyze free text data such as emergency department chief complaints to identify emerging patterns of keywords. While early pre-syndromic surveillance approaches30,31 treat each keyword in isolation, identifying any novel words or those that substantially increased in frequency, more recent machine learning approaches32,34 develop novel variants of

NO

Epidemic time series

Building training data through time windows

Externally related variables

YES

New time series Learning from training data

Building training data from externally relevant variables

FIGURE 22.4 An AI-enhanced prediction framework for infectious disease time series.

III. Clinical applications

Learning model

Prediction

444

22. Artificial intelligence enabled public health surveillance—from local detection

Latent Dirichlet allocation (LDA) topic models33 to identify newly emerging topics that cluster in space, in time, or among a subpopulation defined by observed demographic or behavioral features. When deployed in combination with existing syndromic surveillance and notifiable disease reporting systems, pre-syndromic surveillance provides a “safety net” for public health practitioners, calling their attention to newly emerging outbreaks and other events that they were not already looking for. Moreover, incorporating user feedback into the learned topic models34 enables the system to better distinguish between relevant and irrelevant case clusters, and thus avoids overwhelming the user with false positives.

22.3 Artificial intelligence enhanced prediction in support of public health surveillance Time series of epidemiological data feature seasonality, nonstationarity, and sparsity. Predicting such time series has major public health implications and has attracted a lot of attention from the research and practitioner communities. Researchers have been proposing complex models for univariate prediction to extract useful patterns. In additional, efforts have been expended to develop multivariate prediction models. AI plays an important role in both research streams. Fig. 22.4 illustrates the overall AI-enhanced prediction framework.

22.3.1 Time series prediction based on dependent variables Researchers have worked to extract long-term dependencies from one or more correlated incidence curves, without complex exogenous variables. A case in point is the “FluSight” task hosted by the US Centers for Disease Control and Prevention (CDC), which encourages seasonal influenza forecasting at the national and regional level using the weighted influenza-like illness (wILI) data. The CNNRNN-Res model adopted RNNs to capture the long-term correlation in the wILI curves and CNNs to fuse curves of different states.35 To avoid overfitting, this model utilized the residual links and dropout mechanism. CNNRNN-Res achieved better results than autoregressive methods and Gaussian process regression. To overcome the problem of data sparsity and improve model interpretability in influenza prediction, Adhikari et al.36 designed a novel framework named EpiDeep. This framework consists of clustering/embedding, encoder, and decoder modules to learn meaningful embeddings of incidence curves in a continuous feature space and predicts peak intensity, peak time, onset week, and future incidences of wILI. The learned embeddings reveal the neighbor similarities, temporal similarities, intensity separation, and other patters in different flu seasons. Focusing on the problem of high-resolution ILI incidence forecasting, the DEFSI model used the SEIR model and multiagent simulation to obtain time series of incidence with high-spatial and -temporal resolution for model training. Results showed that the DEFSI model outperformed the baselines at the state level and for high-resolution forecasting at the county level.37

III. Clinical applications

445

22.3 Artificial intelligence enhanced prediction in support of public health surveillance

Training data Degree Shortest path Clustering coefficient Density

Machine learning Linear regression Support vector machines Neural networks

Public health surveill ance data

Formal parameter Basic reproduction number Generation time Spread rate Incubation period Infectious period

Complex network Random network Small-world network Scale-free network Hybrid network Heterogeneous network

Multiagent modeling of infectious diseases

Establishing artificial systems for disease epidemics

Computational Experiment Analysis and Evaluation

Simulation

Simulation

Isolation vaccination controlling personnel activities

Effectiveness evaluation and strategy optimization of infectious disease prevention and control measures

FIGURE 22.5 A simulation framework for analyzing infectious disease prevention and control strategies.

The abovementioned models focus on the time dependence of the incidence curves. In fact, these sequences of different regions also have spatial similarity. The graph neural network can be applied to capture the spatial correlation of different geographical scales. Li et al.38 used a graph-structured recurrent neural network (GSRNN) to predict the influenza data provided by the US CDC. Nodes of the graph represent the Health and Human Services regions and edges adjacency between regions. This model, shown to deliver the state of the art performance, partitioned the nodes into two classes based on the influenza activity level, and the nodes and edges features were trained separately.

22.3.2 Time series prediction based on dependent and independent variables Past research has also explored external variables that are highly correlated with the outbreak of infectious diseases to make multivariate predictions. The prediction performance of such methods has a lot to do with the selection of external variables. Commonly used external variables are climate data (temperature, humidity, air quality), search trends (Google indexes, Baidu indexes), social media text data (Twitter, Sina), population migration, among others. In general, accurate and fine-grained multivariate data can improve the prediction accuracy. In order to measure the contribution of each variable to the prediction results, modelers will need to try different combinations of external variables for model training and prediction. Usually, the higher the prediction accuracy, the better the interpretation of these variables. The LSTM model is the most commonly used prediction model. Chae et al.39 investigated the LSTM and the DNN models to predict chicken pox, scarlet fever, and malaria in Korea. Daily Naver search frequency, number of Twitter mentions, and average daily temperature and humidity data were included as exogenous variables. The DNN model performed stably, and the LSTM model was more accurate when infectious disease was

III. Clinical applications

446

22. Artificial intelligence enabled public health surveillance—from local detection

spreading. It is critical to consider the time difference between clinical data and nonclinical data, and results indicated that a lag of 7 days was more suitable in this work. In another study a multichannel attention-based LSTM neural network was designed to forecast the real-time influenza-like illness rate (ILI%) in Guangzhou, China.40 The external variables included medicine sales records, temperature records, and rainfall, among others. Because people of different ages have varying immunity to flu, the influenza-like cases were divided into five age groups. The approach trained the influenza- and climate-related channels separately and merged these features together in an ensuing step. In addition to developing specific prediction models, there are also hybrid models that combine the prediction results of different methods in a weighted manner to produce better accuracy and improve robustness. A self-adaptive AI model (SAAIM) that predicts influenza activity in Chongqing, China, was developed by Su et al.41 The multisource data include ILI%, weather data, Baidu search index, and Sina Weibo data of Chongqing. SAAIM hybrids the predictions of SARIMA and XGBoost in a Kalman filter, so the weights are self-adaptive. As for the contribution of different data to SAAIM, ablation experiments showed that ILI% 1 week prior to real time had the highest ranking score, suggesting that the ILI activity is highly autoregressive. Soliman et al.42 developed a probabilistic forecasting of influenza in Dallas County using Bayesian model averaging (BMA). In this work the baseline models are feedforward neural networks, ARIMA, LASSO, and nonparametric multivariate adaptive regression splines (MARS) model. Influenza record, Google search, and atmospheric data are variables. The BMA model outperformed the individual methods in 1- and 2-week ahead forecasts. AI-based infectious disease prediction is still in its infancy, and some key issues summarized by Viboud and Vespignani43 for influenza prediction are also applicable to other AI-based infectious disease prediction: How do prediction capabilities scale with data accuracy and quantity? How should ensemble predictions be optimized? And how do prediction capabilities decrease with time horizon? Clearly, more research is needed to answer these questions.

22.4 Artificial intelligence based infectious disease transmission modeling and response assessment Modeling complex infectious disease transmission is key to public health emergency response. A basic framework for assessing infectious disease prevention and control strategies using simulation is shown in Fig. 22.5. Complex network models and agent-based computing methods are two widely used methods. Complex network models have low computational complexity and high abstract level. Agent-based methods have higher resolution and can flexibly reconstruct detailed plans and reproduce the complex process of disease transmission, which is necessary to assess major responses and interventions in real-world settings. Some studies44 have combined these two models to capture multiple-level features on complex transmission systems.

III. Clinical applications

22.4 Artificial intelligence based infectious disease transmission modeling and response assessment

447

22.4.1 Modeling disease transmission dynamics based on machine learning and complex networks Complex network analysis45,46 methods have been applied to study the spread of epidemics over typical network types such as small-world networks, scale-free networks, and community networks. Recently, researchers have attempted to combine AI and complex networks for infectious disease transmission modeling. Tripathi et al.47 used machine learning techniques to predict the controllability of diseases on complex networks. In their experiments the input of the training data was the complex network properties, including average degree, average shortest path length, clustering coefficient, density, diameter, and maximum degree. Their approach applied three machine learning methods, linear regression, SVM, and a neural network model, to predict an important parameter in disease transmission—basic reproduction number (R0), which determines whether the disease-free epidemic or an endemic state is asymptotically stable. SIR epidemic spreading models were used to simulate the disease spreading dynamics on four types of complex networks. The experimental results showed that the prediction of machine learning on complex networks was highly accurate. Scarpino and Petri48 adopted dynamic modeling approaches to study the predictability of infectious disease outbreaks. Permutation entropy, Markov chain simulations, and epidemic simulations were used in the experiment. The results indicated that both shifting model structures and social network heterogeneity are likely to lead to differences in the predictability of infectious diseases.

22.4.2 Modeling disease transmission dynamics based on multiagent modeling Effective interventions are essential to curb the spread of infectious diseases. Assessing the effectiveness of such responses entails analysis of various “what-if” scenarios. Agentbased simulations provide an AI-based framework to carry out these assessment-related tasks. Mei et al.44 proposed a model that unified agent-based modeling and complex networks and applied complex agent networks to model infectious diseases. Rocha and Masuda49 developed an individual-based approximation for the SIR epidemic model applicable to arbitrary dynamic networks. Großmann et al.50 proposed the rejection-based simulation of non-Markovian agents on complex networks and demonstrated its efficacy on various models of epidemic spreading. Through multiagent simulation, Kuga and Tanimoto51 found that a pandemic arises more easily in a scale-free network than in homogeneous networks. To construct a multiagent simulation run, generating and simulating realistic and dynamic contact networks remains a major challenge. Considering spatiotemporal dynamics of influenza, Cliff et al.52 proposed ACEMod (Australian Census-based Epidemic Model). This model employed a discrete-time and stochastic agent-based model to investigate complex outbreak scenarios at various spatiotemporal levels. In this model, each agent contained a set of attributes of an anonymous individual. The agents’ distributions

III. Clinical applications

448

22. Artificial intelligence enabled public health surveillance—from local detection

TABLE 22.1 Examples of Internet-based surveillance systems. Type

Example

Data source

Establishment

Moderated system

ProMED

Media reports, official reports, online summaries, local observers

1994

Partially moderated system

GPHIN

News sources

1998

ProMED, GPHIN

2004

Fully automated system MedISys

HealthMap ProMED Mail, WHO, GeoSentinel, EuroSurveillance, Google News

2006

PULS

2007

Text-based news sites and social media resources

SENTINEL Twitter data, news data, CDC materials

2019

GPHIN, Global Public Health Intelligence Network; MedISys, Medical Information System; ProMED, Program for Monitoring Emerging Disease; PULS, Pattern-based Understanding and Learning System.

Natural language processing Monitoring global open source data based on the Internet

FIGURE 22.6

Text processing Multi-lingual Text translation Topic relevance Deduplication Redundant

Assigning symptoms Keyword match Word embedding Semantic understanding Semantic fusion

Machine learning

Learning model Texts linked to infectious disease events Classification Prediction

Detect abnormal signals and detect early signs of outbreaks of infectious diseases

Compared with official monitoring data, further confirmation, and early warning

AI techniques in Internet-based global epidemic monitoring.

at multiple scales concurred with key demographic statistics from the 2006 Australian census. The abovementioned studies were motivated by the needs of modeling infection and responses in large populations. In other settings, studying disease spread among occupants in confined spaces such as educational institutions was critically needed as well. Duan et al.53 developed an agent-based model to simulate how epidemics spread in structured space. This model was used to evaluate the public health response policies used during an H1N1 outbreak. This simulation model captured spatial layouts, population distribution, social networks, and contact patterns. A hierarchical social contact network consistent with actual social relationships and characteristics was developed. An agent-based approach was used to simulate the behavior and actions of the students and the spread of epidemics in the hierarchical network. Ge et al.54 also used an individualbased method to reconstruct an artificial university. High-resolution social interaction algorithms were designed and nonpharmaceutical interventions evaluated quantitatively. The framework of the virtual university was constructed with four components:

III. Clinical applications

22.5 Internet-based surveillance systems for global epidemic monitoring

449

synthetic population, behavior schedule, social networks, and disease transmission model. To perform a more realistic simulation, Iwanaga et al.55 focused on how to obtain the average infectivity ratio. Based on real epidemic data from the Japan Coast Guard Academy, they proposed a discrete-time epidemic model and investigated how to estimate the infectivity rate from the real data. After obtaining the infectivity ratio the authors simulated a seasonal influenza epidemic using the SEPIR model with multiagent simulation fed with estimated spatiotemporal parameters.

22.5 Internet-based surveillance systems for global epidemic monitoring To develop an effective global epidemic monitoring approach, researchers have long used Internet-based methods. Internet-based disease surveillance serves as a real-time complementary approach to traditional indicator-based public health disease surveillance methods.56 Typically, Internet-based surveillance systems use a variety of open-source Internet data, including online newswires, social media, and other Internet-based data streams to detect early warning signals of threats to public health. There are three types of Internet-based surveillance systems: moderated, partially moderated, and fully automated (Table 22.1). The Program for Monitoring Emerging Diseases (ProMED)57,58 is a moderated system. The Global Public Health Intelligence Network (GPHIN)59,60 is a partially moderated system developed by the Canadian Government. On average, the GPHIN processes 3000 news reports every day. Both ProMED and GPHIN can function in multiple languages. Fully automated systems include the European Commission’s Medical Information System (MedISys),61 Pattern-based Understanding and Learning System (PULS),62 and HealthMap.63 SENTINEL8 is a newly developed software system built upon recent developments in machine learning and data processing for realtime syndromic surveillance based on social media data. This system can detect disease outbreaks to provide situational awareness. From a technical standpoint, in the application context of Internet-based global epidemic monitoring, AI techniques have played a major role in a sequence of data processing and analysis tasks. Fig. 22.6 shows a list of such tasks. Text processing: Several important steps in text processing include translation, relevancy ranking, event extraction, and deduplication. GPHIN employs language-specific keywords and algorithms to extract relevant data from the Internet and news aggregator databases.64 PULS employs language-specific linguistic analysis and ontologies and inference rules to extract relevant data.62 Relevancy ranking is to assess the relevancy of the report according to the user’s interest. Keyword recognition algorithms, Boolean combinations, and proximity searches are the more commonly used method in event extraction. Information extraction technology is the basis of PULS. Assigning symptoms: The purpose of assigning symptoms is to determine which tweets show symptoms of illness. SENTINEL8 used a keyword matching technique and enriched synonym lists by using the word embeddings trained on Twitter data. They generated a list of the 10 closest words in the embedding space by cosine similarity. They use Glove65 and FastText66 techniques to generate word embeddings.

III. Clinical applications

450

22. Artificial intelligence enabled public health surveillance—from local detection

Machine learning: Machine learning classifiers to identify those tweets and news articles that are genuinely health related.67,68 The CNN and LSTM are applied in SENTINEL (see Section 22.2.2).

22.6 Conclusion The core tasks of public health surveillance and response include infectious disease outbreak detection and early warning, trend prediction, and public health response modeling and assessment. Recent years have seen wide adoption of AI techniques in accomplishing these tasks in highly dynamic, complex, and data-rich environments. In this chapter, we reviewed a collection of recent studies focusing on how to make use of sensor and social data from the physical world and cyberspace to improve the outbreak detection and early warning capabilities. We also discussed a set of methods aimed at modeling infection disease transmission and predicting various time series of epidemiological data. Several simulation frameworks were reviewed with the objective of modeling and assessing public health responses. A common theme cutting across these areas is that AI plays an essential role in this new generation of public health methods. With the AI technology itself, machine learning methods, including deep learning, and agent-based modeling, are among the most relevant, from a methodological standpoint. Given the growing interests and research activities in the interdisciplinary area of AI and public health surveillance, despite the impressive set of accomplishments already achieved, one can almost surely conclude that this area is still in its early stage of development with a lot of potentials yet to be fulfilled. To conclude this chapter, we briefly present several key challenges to be mindful about, while we as a community continue harnessing the power of rapidly developing AI technologies in the context of public health. At the present time, in the middle of the COVID19 pandemic, there are a lot of ongoing discussions about developing public health big data for surveillance and response purposes. It is important to realize the limitations and potentially significant biases associated with public health big data. In particular, privacy protection, algorithm discrimination, and model interpretability must receive serious consideration to conform to social ethics and norms. As in the case of many other AI applications in the medical domain, AI-enabled and -enhanced public health surveillance and response hold real potentials with significant challenges remaining.

References 1. Ding FY, Fu JY, Jiang D, Hao MM, Lin G. Mapping the spatial distribution of Aedes aegypti and Aedes albopictus. Acta Trop 2018;178:155 62. Available from: https://doi.org/10.1016/j.actatropica.2017.11.020. 2. Ong J, Liu X, Rajarethinam J, Kok SY, Liang S, Tang CS, et al. Mapping dengue risk in Singapore using Random Forest. PLoS Negl Trop Dis 2018;12(6). Available from: https://doi.org/10.1371/journal.pntd.0006587. e0006587. 3. Rubio-Solis A, Musah A, Dos Santos PW, Massoni T, Birjovanu G, Kostkova, P. ZIKA virus: prediction of Aedes mosquito larvae occurrence in Recife (Brazil) using online extreme learning machine and neural networks. In: Paper presented at the Proceedings of the ninth international conference on digital public health; 2019.

III. Clinical applications

References

451

4. Akhtar M, Kraemer MU, Gardner LM. A dynamic neural network model for predicting risk of Zika in real-time. bioRxiv, 466581. 2019. 5. Anno S, Hara T, Kai H, Lee MA, Chang Y, Oyoshi K, et al. Spatiotemporal dengue fever hotspots associated with climatic factors in Taiwan including outbreak predictions based on machine-learning. Geospat Health 2019;14(2). Available from: https://doi.org/10.4081/gh.2019.771. 6. Jain VK, Kumar S. Effective surveillance and predictive mapping of mosquito-borne diseases using social media. J Computat Sci 2018;25:406 15. Available from: https://doi.org/10.1016/j.jocs.2017.07.003. 7. Thapen N, Simmie D, Hankin C, Gillard J. DEFENDER: detecting and forecasting epidemics using novel data-analytics for enhanced response. PLoS One 2016;11(5). Available from: https://doi.org/10.1371/journal. pone.0155417. ARTN e0155417. 8. Șerban O, Thapen N, Maginnis B, Hankin C, Foot V. Real-time processing of social media with SENTINEL: a syndromic surveillance system incorporating deep learning for health classification. Inf Process Manage 2019;56(3):1166 84. Available from: https://doi.org/10.1016/j.ipm.2018.04.011. 9. Chen F., Neill D.B. Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs. In: Proceedings of the 20th ACM SIGKDD conference on knowledge discovery and data mining, 2014. p. 1166 1175. 10. Dai X, Bikdash M. Hybrid classification for tweets related to infection with influenza. In: Paper presented at the SoutheastCon 2015; 2015. 11. Dai X, Bikdash M, Meyer, B. From social media to public health surveillance: Word embedding based clustering method for Twitter classification. In: Paper presented at the SoutheastCon 2017; 2017. 12. Wang C-K, Singh O, Tang Z-L, Dai H-J. Using a recurrent neural network model for classification of tweets conveyed influenza-related information. In: Paper presented at the proceedings of the international workshop on digital disease detection using social media 2017 (DDDSM-2017); 2017. 13. Lampos V, Zou B, Cox IJ. Enhancing feature selection using word embeddings: the case of flu surveillance. In: Paper presented at the proceedings of the 26th international conference on World Wide Web; 2017. 14. Edo-Osagie O, Lake I, Edeghere O, De La Iglesia B. Attention-based recurrent neural networks (RNNs) for short text classification: an application in public health monitoring. In: Paper presented at the international workconference on artificial neural networks; 2019. 15. Souza RCSNP, Assuncao RM, Oliveira DM, Neill DB, Meira Jr W. Where did I get dengue? Detecting spatial clusters of infection risk with social network data.. Spat. Spatiotemporal Epidemiol. 2019;29:163 75. 16. Souza RCSNP, Assuncao RM, Neill DB, Meira W, Jr.. Detecting spatial clusters of disease infection risk using sparsely sampled social media mobility patterns. In: Proc. 27th ACM SIGSPATIAL Intl. Conf. on advances in geographic information systems, 2019b, p. 359 368. 17. Shah Z, Dunn AG. Event detection on Twitter by mapping unexpected changes in streaming data into a spatiotemporal lattice. In: IEEE transactions on big data; 2019. 18. Magumba MA, Nabende P, Mwebaze E. Design choices for automated disease surveillance in the social web. Online J Public Health Inf 2018;10(2): e214. Available from: https://doi.org/10.5210/ojphi.v10i2.9312. 19. Wagner MM, Tsui F-C, Espino J, et al. National Retail Data Monitor for public health surveillance. Morb Mortal Wkly Rep 2004;53(Supp):40 2. 20. Polgreen PM, Chen Y, Pennock DM, Nelson FD, Weinstein RA. Using internet searches for influenza surveillance. Clin. Infect. Dis. 2008;47(11):1443 8. 21. Ginsberg J, Mohebbi M, Patel R, et al. Detecting influenza epidemics using search engine query data. Nature 2009;457:1012 14. 22. Kulldorff M. A spatial scan statistic. Commun Stat-Theor M 1997;26(6):1481 96. 23. Kulldorff M. (Prospective time-periodic geographical disease surveillance using a scan statistic. J R Stat Soc A 2001;164:61 72. 24. Neill DB. Fast subset scan for spatial pattern detection. J R Stat Soc B 2012;74(2):337 60. 25. Duczmal L, Assuncao R. A simulated annealing strategy for the detection of arbitrary shaped spatial clusters. Comput. Statist. Data Anal 2004;45:269 86. 26. Duczmal L, Cancado A, Takahashi R, Bessegato L. A genetic algorithm for irregularly shaped scan statistics. Comput. Statist. Data Anal. 2007;52:43 52. 27. Fitzpatrick D, Ni Y, Neill DB. Support vector subset scan for spatial outbreak detection. Online J. Public Health Inform 2017;9(1):e021.

III. Clinical applications

452

22. Artificial intelligence enabled public health surveillance—from local detection

28. Faigen Z, Deyneka L, Ising A, et al. Cross-disciplinary consultancy to bridge public health technical needs and analytic developers: asyndromic surveillance use case. Online J. Public Health Inform 2015;7(3):e228. 29. Nobles M, Lall R, Mathes R, Neill DB. Multidimensional semantic scan for pre-syndromic disease surveillance. Online J. Public Health Inform 2019;11(1):e255. 30. Lall R, Levin-Rector A, Mathes R, Weiss D. Detecting unanticipated increases in emergency department chief complaint keywords. Online J Public Health Inform 2014;6(1):e93. 31. Walsh A, Hamby St T, John TL. Identifying clusters of rare and novel words in emergency department chief complaints. Online J Public Health Inform 2014;6(1):e146. 32. Maurya A, Murray K, Liu Y, Dyer C, Cohen WW, Neill DB. Semantic scan: detecting subtle, spatially localized events in text streams. arXiv preprint arXiv 2016. 1602.04393. 33. Blei D, Ng A, Jordan M. Latent dirichlet allocation. J Mach Learn 2003;3:993 1022. 34. Nobles, M. Multidimensional semantic scan for pre-syndromic surveillance. Ph.D. thesis, H.J. Heinz III College, Carnegie Mellon University, 2019. 35. Wu YX, Yang YM, Nishiura H, Saitoh M. Deep learning for epidemiological predictions. In: ACM/Sigir Proceedings 2018; 2018. pp. 1085 1088. Available from: https://doi.org/10.1145/3209978.3210077. 36. Adhikari B, Xu X, Ramakrishnan N, Prakash BA. EpiDeep. In: Paper presented at the proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining - KDD ’19; 2019. 37. Wang LJ, Chen JZ, Marathe M. DEFSI: Deep Learning Based Epidemic Forecasting with Synthetic Information. In: Thirty-third AAAI conference on artificial intelligence/thirty-first innovative applications of artificial intelligence conference/ninth AAAI symposium on educational advances in artificial intelligence; 2019. p. 9607 12. 38. Li Z, Luo X, Wang B, Bertozzi AL, Xin J. A study on graph-structured recurrent neural networks and sparsification with application to epidemic forecasting. In: Paper presented at the world congress on global optimization; 2019. 39. Chae S, Kwon S, Lee D. Predicting infectious disease using deep learning and big data. Int J Env Res Public Health 2018;15(8). Available from: https://doi.org/10.3390/ijerph15081596. 40. Zhu X, Fu B, Yang Y, Ma Y, Hao J, Chen S, et al. Attention-based recurrent neural network for influenza epidemic prediction. BMC Bioinforma 2019;20(18):1 10. 41. Su K, Xu L, Li G, Ruan X, Li X, Deng P, et al. Forecasting influenza activity using self-adaptive AI model and multi-source data in Chongqing, China. EBioMedicine 2019;47:284 92. Available from: https://doi.org/ 10.1016/j.ebiom.2019.08.024. 42. Soliman M, Lyubchich V, Gel YR. Complementing the power of deep learning with statistical model fusion: probabilistic forecasting of influenza in Dallas County, Texas, USA. Epidemics 2019;28:100345. Available from: https://doi.org/10.1016/j.epidem.2019.05.004. 43. Viboud C, Vespignani A. The future of influenza forecasts. Proc Natl Acad Sci USA 2019;116(8):2802 4. Available from: https://doi.org/10.1073/pnas.1822167116. 44. Mei S, Zarrabi N, Lees M, Sloot PM. Complex agent networks: an emerging approach for modeling complex systems. Appl Soft Comput 2015;37:311 21. 45. Nian F, Wang X. Efficient immunization strategies on complex networks. J Theor Biol 2010;264(1):77 83. 46. Ren G, Wang X. Epidemic spreading in time-varying community networks. Chaos 2014;24(2):023116. 47. Tripathi R, Reza A, Garg D. Prediction of the disease controllability in a complex network using machine learning algorithms. arXiv preprint arXiv:1902.10224. 2019. 48. Scarpino SV, Petri G. On the predictability of infectious disease outbreaks. Nat Commun 2019;10(1):898. 49. Rocha LE, Masuda N. Individual-based approach to epidemic processes on arbitrary dynamic contact networks. Sci Rep 2016;6:31456. 50. Großmann G, Bortolussi L, Wolf V. Rejection-based simulation of non-markovian agents on complex networks. In: Paper presented at the international conference on complex networks and their applications; 2019. 51. Kuga K, Tanimoto J. Impact of imperfect vaccination and defense against contagion on vaccination behavior in complex networks. J Stat Mech: Theory Exp 2018;2018(11):113402. 52. Cliff OM, Harding N, Piraveenan M, Erten EY, Gambhir M, Prokopenko M. Investigating spatiotemporal dynamics and synchrony of influenza epidemics in Australia: an agent-based modelling approach. Simul Model Pract Theory 2018;87:412 31. 53. Duan W, Cao Z, Wang Y, Zhu B, Zeng D, Wang F-Y, et al. An ACP approach to public health emergency management: using a campus outbreak of H1N1 influenza as a case study. IEEE Trans Syst, Man, Cybernetics: Syst. 2013;43(5):1028 41.

III. Clinical applications

References

453

54. Ge Y, Chen B, Qiu X, Song H, Wang Y. A synthetic computational environment: To control the spread of respiratory infections in a virtual university. Phys A: Stat Mech Appl 2018;492:93 104. 55. Iwanaga S, Yoshida H, Kinjo S. Feasibility study on multi-agent simulations of a seasonal influenza epidemic in a closed space. In: Paper presented at the symposium on intelligent and evolutionary systems; 2019. 56. Bush GW. Homeland security presidential directive/Hspd-12. Office of the Press Secretary, White House; 2004. 57. Carrion M, Madoff LC. ProMED-mail: 22 years of digital surveillance of emerging infectious diseases. Int Health 2017;9(3):177 83. 58. Yu VL, Madoff LC. ProMED-mail: an early warning system for emerging diseases. Clin Infect Dis 2004;39 (2):227 32. 59. M’ikanatha NM, Lynfield R, Van Beneden CA, De Valk H. Infectious disease surveillance. Wiley Online Library; 2013. 60. Mykhalovskiy E, Weir L. The global public health intelligence network and early warning outbreak detection. Can J Public Health 2006;97(1):42 4. 61. Rortais A, Belyaeva J, Gemo M, Van der Goot E, Linge JP. MedISys: An early-warning system for the detection of (re-) emerging food-and feed-borne hazards. Food Res Int 2010;43(5):1553 6. 62. Hartley DM, Nelson NP, Arthur R, Barboza P, Collier N, Lightfoot N, et al. An overview of internet biosurveillance. Clin Microbiol Infect 2013;19(11):1006 13. 63. Freifeld CC, Mandl KD, Reis BY, Brownstein JS. HealthMap: global infectious disease monitoring through automated classification and visualization of Internet media reports. J Am Med Inform Assoc 2008;15(2):150 7. 64. Mawudeku A, Blench M, Boily L, John RS, Andraghetti R, Ruben M. 31 The Global Public Health Intelligence Network. Infect Dis Surveill 2007;457. 65. Pennington J, Socher R, Manning C. Glove: global vectors for word representation. In: Paper presented at the proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP); 2014. 66. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Linguist 2017;5:135 46. 67. Hawkins JB, Tuli G, Kluberg S, Harris J, Brownstein JS, Nsoesie E. A digital platform for local foodborne illness and outbreak surveillance. Online J Public Health Inform 2016;8(1). 68. Sarker A, Gonzalez G. Portable automatic text classification for adverse drug reaction detection via multicorpus training. J Biomed Inform 2015;53:196 207.

III. Clinical applications

C H A P T E R

23 Regulatory, social, ethical, and legal issues of artificial intelligence in medicine Emily Shearer, Mildred Cho and David Magnus Abstract Advances in artificial intelligence (AI) bring with them new regulatory, social, and ethical challenges. At the level of data collection an increase in the ability to identify previously deidentified health data and questions overreporting of incidental findings call for a new model of data stewardship. In designing new AI tools, recognition of biases and competing interests in algorithm design, in the data in which algorithms learn, and in society itself will all be necessary in avoiding unjust or suboptimal outcomes in application. And at the level of regulation, there is a need for a system that can verify the quality of algorithm design to improve trust in these tools. Moving forward, a new regulatory model that focuses on regulating the process, rather than the product, of AI design, as well as a reconceptualization of the traditional physician patient dyad to incorporate AI, is needed to help overcome these challenges. Keywords: Artificial intelligence; data; algorithms; bias; ethics; values; disparities; trust; regulation data stewardship; physician patient relationship

23.1 Introduction Artificial intelligence (AI) holds much potential to improve our current diagnostic, prognostic, and predictive capabilities in medicine. While the extent of these possibilities is becoming clearer with new advances, so too are gaps in our current ethical, legal, and regulatory frameworks, which did not evolve to handle the capabilities and challenges AI brings. The first step in addressing these gaps will be to solidify our understanding of the fundamental questions AI raises. Key questions raised by any advance in medicine bring include “Who is primarily benefitting from these advances?”, “Are the benefits and burdens distributed equally (are the advances just)?”, “Are patients and their data being protected adequately with use of these

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00023-5

457

© 2021 Elsevier Inc. All rights reserved.

458

23. Regulatory, social, ethical, and legal issues of artificial intelligence in medicine

new tools (privacy and confidentiality)?”, and how does this advance fundamentally alter clinicians’ or health systems obligations to patients, if at all? With regard to AI specifically, issues arise in multiple domains. At the level of data collection and results disclosure, questions over how consent is obtained (if at all), how data sharing with third parties is to be managed, and how or if incidental findings need to be shared with patients are all questions that remain unsettled. The question of results disclosure is particularly interesting in the case of nonclinically collected data, such as that from wearable devices, as those collecting the data in these circumstances may not be clinicians at all. These questions together call for a new model of data stewardship that we return to in-depth. At the level of the design of AI tools, ethical issues arise from biases in the data in which algorithms themselves learn, the values of which are represented in algorithm design, bias in the society in which the data exist (e.g., the real presence of a health wealth gradient or racial disparities, which, if AI tools are allowed to learn on such data may reify these disparities), and at the level of application, as clinicians or health systems are prone to apply AI tools differentially across different populations. At each of these levels, safeguards will need to be put in place to protect those vulnerable from exclusion or misapplication. The rise of AI tools also poses significant challenges for our current regulatory frameworks, as AI technology raises unprecedented questions. How will we adapt our current approval standards to a technology that, by definition, is constantly learning and updating? Given its “black-box nature,” how can we ensure the process of algorithm design is standard and fair? How do our current data protection laws apply to the data in which AI tools learn? And how can we ensure physicians are employing AI correctly given its opacities? In what follows, we will address each of these questions as they pertain to AI at multiple levels: data collection and results disclosure, application of AI tools, regulation and oversight, and, finally, the ethos of medicine itself. We conclude with what we consider to be essential next steps in overcoming these challenges, which we believe center around a need to reimagine responsibility for patient protection and care in the age of AI.

23.2 Ethical issues in data acquisition In order to understand the possible ethical, social, and legal implications of the use of AI in healthcare, it is first necessary to understand where the data used in algorithm design comes from. There are three main types of use of AI in medicine, each of which draws data from three main sources. The three main ways AI is being used in healthcare are: 1. predictive uses [e.g., using machine learning (ML) on past population data to predict a prognosis for an individual patient], 2. diagnostic uses (e.g., using ML on past population data to assist in arriving at the correct diagnosis for an individual patient), and 3. uses in clinical decision-making aids (e.g., using ML on past population data to determine what clinical steps should be taken for an individual with a given condition, given certain characteristics or severity of the disease).

IV. Future outlook

23.2 Ethical issues in data acquisition

459

For each of these uses, algorithms used in the ML process rely on three types of data: 1. data from research repositories (e.g., data from a biobank), 2. clinical or public health data made publicly available [e.g., secondary uses of electric health record (EHR) data, Medicare and Medicaid data, insurance claims, uses of blood spots collected for newborn screening, census data], and 3. nonclinically collected data (e.g., data collected from wearables or mobile devices, data collected from Internet searches, or social media use). Though we rely on all three types of data, it is important to note that as the use of wearable health technology and mobile/tablet data collection rises, data are increasingly coming from the third, nonclinical, category. As we discuss next, this poses substantial challenges to our current legal, ethical, and regulatory systems, as they evolved to handle issues derived mainly from data collected either for research or secondary uses of data collected for clinical or public health purposes.

23.2.1 Ethical issues arising from each type of data source 23.2.1.1 Ethical issues common to all data sources: Privacy and confidentiality Ethical issues arise in all three categories of data source, some of which are unique to each type, some of which are common to all three. Ethical issues common to all three include issues about data privacy and confidentiality that have dominated legal and ethical discussions surrounding data collection for some time. The primary shared issue is data security (e.g., how do we know data are being secured in a way that protects our privacy, including confidentiality?). This is often treated as a technical issue as much as it is an ethical issue, but it is critically important for maintaining trust. Most issues related to data security, privacy, and confidentiality are not unique to AI. However, some issues take on distinctive features when considering AI. For example, issues of data privacy arise over whether or not outputs from AI tools count as protected health information (PHI) and, relatedly, if they are offered the same level of protection as PHI. In a similar vein, advances in AI capabilities make the problem of reidentification of previously unidentifiable information more acute, a problem we will return to next. 23.2.1.2 Ethical issues unique to each data source: Issues of consent In addition to issues of data privacy and confidentiality, issues of consent and data sharing arise in all three types of data source; however, unlike issues of privacy and confidentiality, these issues vary in important ways by type of data source due to their differing nature of researcher participant interaction. 23.2.1.2.1 Issues of consent with data from research repositories

Data that have been collected for a research repository such as a biobank has generally been consented at the time of collection, since it is a primary research activity and involves interaction with the person donating the material. Therefore creation of a research repository is generally regarded as human subject research and subject to federal regulations. The first major challenge for informed consent is whether each use of data requires

IV. Future outlook

460

23. Regulatory, social, ethical, and legal issues of artificial intelligence in medicine

specific consent or whether an open-ended or broad consent is adequate. There are also approaches that use a tiered consent process where individuals consent to broad categories of uses of their data. While specific discussion of the benefits, risks, and purposes of each use of someone’s data seems to be required for a true informed consent, it would likely render most research repositories unusable for most purposes. As a result, the recently revised common rule (45 CFR 46) allows for broad consent in place of a true informed consent.1 However, there are a number of requirements for this to be utilized. Patients still need to be told of any risks (e.g., risks associated with sample collection or risks to privacy), similarly any anticipated benefits, details about how confidentiality of information is to be maintained, that participation is voluntary, and if there are plans to do genetic analysis. In addition, participants should be told whether their data may be used to generate profit, and whether there will be any sharing of that profit. Some effort should also be made to convey at least a sense of the types of research that will take place with their specimens, what types of data will be used, how long the data will be stored and available for research uses, and they should typically be told that they will not be informed of all possible future uses. Finally, any plans for return of results should be disclosed. This issue will be discussed in more detail next. 23.2.1.2.2 Return of results from research repositories

Research uses of the data repositories, whether with identifiable or linked, but deidentified samples, have the potential to yield information about the individuals in the repository. This raises the question of whether and when information that is discovered in a research context should be returned to the participants. Some research repositories are established with rules that prohibit the return of any information to research participants. These approaches draw a sharp boundary between the research activities of the repository and any clinical obligations that might seem to arise. There are several challenges to this view. First, research participants often want information provided to them, and this can be important for recruitment. Second, there are ancillary duties that researchers owe to their research participants that may require action under at least some circumstances. If an ML model reveals that a patient’s cancer diagnosis was incorrectly made, and the patient is currently receiving the wrong chemotherapy, it would be difficult to ignore the potential duty to warn the patient. Some have argued that if participants want any information that is discovered about themselves in the course of research on their samples, then they should be returned. Defenders of this view often appeal to respect for the patient’s autonomy, and that individuals may differ in what information they find important or relevant to them. Critics point out the challenges in returning all information. Genome sequencing, for example, is likely to produce variants of unknown significance, and providing poorly defined, unvalidated information may increase the chances of poor decision-making and could lead to harm to research participants. Intermediate positions between these two extremes may be more plausible, but there are a range of views about precisely where to draw the line. Some have claimed there is a duty to offer to return results that are “valid, medically important, and actionable if discovered purposefully or by chance during the course of data analysis.”2 At the same time, there may be findings that would be inappropriate to disclose due to uncertainty, lack of

IV. Future outlook

23.2 Ethical issues in data acquisition

461

validity, lack of interpretation, and risk that participants may be harmed due to lack of understanding of the findings.3,4 23.2.1.2.3 Issues of consent with clinical or public health data

In contrast to data collected by research repositories, secondary uses of clinical or public health data are often done without consent. Thus ethical issues arise over whether consent is required or even desirable for there to be at least some consent for such uses, and if so, how is it structured? Unlike research repositories, in secondary uses of clinical or public health data, the data are collected for a nonresearch primary use. For example, when patients go to the hospital for clinical care, the data in their EHR are primarily there for clinical and accounting purposes. But once the data exist, it becomes possible to use that data to address research questions. This could include images, progress notes, lab findings, or treatments. Researchers are interested in exploring the use of AI to learn from this data for a wide range of purposes, including predictive analytics, diagnostics, and development of decision aids. Similarly, blood spots collected for newborn genetic screening can be used for research on those samples. It may be impracticable if not impossible to obtain consent that meets regulatory requirements for informed consent for human subject research from individuals for this type of research. States vary in whether they require consent for research uses of blood spots (California, e.g., does not require consent). How could this be justified, from a regulatory or ethical perspective? From a regulatory perspective, there are three possible pathways that avoid the burdensome requirement of obtaining consent while still meeting the regulatory requirements of research repositories; the first two are the most commonly used in establishing research repositories. First, if the “research” activity is regarded as “quality assurance and improvement,” that activity is not regulated under the common rule. Unfortunately, there is limited guidance about how to determine whether something is “quality improvement” rather than research.5 One rule of thumb might be to consider whether particular findings are likely to be generalizable to other institutions or whether they are only relevant to a particular hospital or health system. It is worth noting that whether findings are published or not does not determine whether the activity is quality improvement or research. The second, even more common basis for avoiding the informed consent requirement is the secondary use of deidentified data where there is agreement that investigators will not have access to the identity of the participants. For deidentified, secondary uses of clinical data, the standards for deidentification will need to meet the rigorous requirements of Health Insurance Portability and Accountability Act (HIPAA). There are two possible standards: the expert determination standard and the safe harbor standard. The vast majority of researchers utilize the safe harbor standard. This requires that before turning data over to researchers, they must ensure that 18 specific identifiers have been stripped. In addition, they must require that the institution “does not have actual knowledge that the information could be used alone or in combination with other information to identify and individual who is the subject of the information.”6 Though secondary uses of deidentified information may have been tenable in the past on this basis, new capabilities of AI to reidentify previously deidentified information are increasingly making this exemption harder to defend. The National Institutes of Health

IV. Future outlook

462

23. Regulatory, social, ethical, and legal issues of artificial intelligence in medicine

(NIH) has taken the stand that genetic information is inherently identifiable, and generally, researchers continue to find new ways to identify individuals in databases that were previously regarded as deidentified. This year, several reports of AI technologies being able to reidentify patients based on deidentified clinically collected data (e.g., from brain MRIs) or repository data (e.g., from large, national physical activity repositories) have been published. This is a significant challenge for this research going forward. Third, it is possible to apply for a specific wavier to waive the need for informed consent. This third pathway may increase in the near future due to challenges described earlier and is thus worth outlining here. There are four requirements researchers must demonstrate in order to have a waiver for informed consent approved. These four requirements are as follows: 1. 2. 3. 4.

The research involves no more than minimal risk to subjects. The research could not be carried out practicably without the waiver or alteration. The waiver or alteration will not adversely affect the rights and welfare of the subjects. Where appropriate, the subjects will be provided with additional information about their participation (45 CFR 46.116).1

Ethically, there is controversy about at least some uses of unconsented data. There is concern that such research uses will undermine trust and demonstrate a lack of respect for the patients, data of which are being used. Controversy over the uses of unconsented blood spots led to Minnesota and Texas destroying previously collected blood spots, despite their potential research value. Empirical studies of public and parental attitudes and other sources of information have pushed most bioethicists to at least require broad opt-out consent to allow secondary uses of data.7 While surveys show patients want to provide consent to research, they also show that they want research to take place and for health systems to learn how to take better care of them. What has been less studied is how people view the trade-off between the two.8 There are several challenges to consent in this setting. Some institutions are adding information to their Terms and Conditions documents that state that their data will be used for research. What should be included? These documents are not research consent forms, and they are likely not read by most patients (and definitely not by patients who present emergently). If they are broad consent forms, they do not contain a great deal of information about future uses. On the other hand, requiring consent will likely lead to fewer participants, less data, and more biased data. There is a pronounced tendency for the data to become less representative of the population and to leave out groups that have been historically underrepresented in research. Unconsented blood spots, for example, may be the most diverse source of data in terms of participants. And if the primary justification is respect for participants, there are likely better ways that a few vague sentences in Terms and Conditions (or even a research consent form) that is barely read (or not read at all). This reality points to the need for a new model of data stewardship to address these challenges moving forward, a point we will return to next.

IV. Future outlook

23.2 Ethical issues in data acquisition

463

23.2.1.2.4 Incidental or secondary findings in clinical or public health data

Some of the same issues that arose around return of results are also raised for secondary research uses, even with deidentified data (as long as there are coded links to the participant identities). However, the context is quite different. In some of these cases the patient is explicitly seeking answers to clinical problems. Therefore there is arguably a greater claim of a duty to return results in this context. For example, the American College of Medical Genetics has claimed that there is a duty to offer to return 59 different variant results if they are incidental findings in clinical genetics.9 But they explicitly reject applying this standard to findings from research repositories. 23.2.1.2.5 Issues of consent with nonclinically collected data

Though data collected in nonclinical settings are, such as clinical or public health data, typically nonconsented, it shares some of the same ethical challenges that arise with any nonconsented data. However, the nonclinical nature of its collection sets it apart from traditional forms of health data collection, and thus it carries with it unique ethical challenges.10 The ethical questions posed by nonclinically collected data are: 1. “Has there been consent for data collection at all?”, 2. “Who is responsible for regulating third-party use and data sharing?”, and 3. “Who is responsible for relaying any prognostic or diagnostic information gleaned from data collection to patients?”. The third point is worth emphasizing due to the lack of precedence for this question in our current ethical, legal, and regulatory frameworks. When this data are collected by commercial entities outside of the health system, they will usually not be covered under either HIPAA or the common rule. The data collected in nonclinical settings can result in a wealth of prognostic or diagnostic information that may or may not be clinically accurate, but, as it was gathered outside of the clinical setting, it is not in the hands of clinicians. In our traditional ethical and regulatory frameworks, the responsibility lies on the clinician to relay prognostic or diagnostic information to patients when it is uncovered, even if incidentally. In the case of nonclinically collected data, what ethical responsibility does the collecting body have in relaying this information to patients? Should this responsibility vary by likelihood of accuracy or by potential severity of disease? Should this be regulated, and if so, who should the governing body of such regulation be? These are all questions that will need to be addressed moving forward.

23.2.2 Future directions: Toward a new model of data stewardship In this section, we have described ways in which ethical issues arise in all three categories of data source, noting that while issues of privacy and confidentiality are common among all three, issues of consent and disclosure are unique to each type based on their differing natures of data collection. The issues of consent and disclosure we raised in this section will undoubtedly need to be addressed as we continue to collect data by all three means. This will require continued conversations between all relevant stakeholders to come to consensus on what situations should require consent, what the acceptable scope

IV. Future outlook

464

23. Regulatory, social, ethical, and legal issues of artificial intelligence in medicine

of consent (broad or narrow) should be, and what form rules or norms regarding disclosure of information gleaned from nonclinical settings should take. In addition, as we alluded to previously, though issues regarding data privacy and confidentiality have been mainstays in ethical and legal conversations surrounding health data collection over the past decades, advances in AI are reshaping these conversations in important ways. Specifically, advances that allow third parties to reidentify previously deidentified information are quickly becoming a reality. Big data in the form of genomics is also identifiable. Therefore secondary uses of whether clinical or public health sources of data or big data repositories are unlikely to be viewed in the same way. There are regulatory options and in particular, if consent would make the research impracticable and it is minimal risk, researchers should consider seeking a waiver or alteration of consent, rather than attempt to claim that no human subject research is taking place. Taken together, these developments highlight the need for a new model of data stewardship moving forward. This new model will need to: 1. recognize the duty to protect patients as well as the data entrusted to their providers and 2. recognize the duty to make good use of the data to improve healthcare for all patients (including developing new treatments) 3. creating new oversight approaches for relationships with third parties (whether partnering for access to data or vendors for developing algorithms for use). All three of these criteria will need to be met in order to meet the challenges posed by advances in AI.

23.3 Application problems: Problems with learning from the data Now that we have a sense of where the data being used in health-care AI come from and some of the ethical issues associated with data collection, we can now turn our attention to ethical, social, and legal issues that arise from data use. These issues can manifest wherever values and bias are able to enter algorithm design or application and at least partly reflect competing interests of the different stakeholders involved in these processes.

23.3.1 Values embedded in algorithm design Problems can arise in the AI learning process in the making of the algorithms themselves, as whose or which values are being prioritized has implications for their design. A powerful example illustrating how values may infiltrate the algorithm design process in ways that may not reflect patients’ preferences or best interests is the case of quality metrics such as the widely used Vizient metrics or Centers for Medicare and Medicaid Services (CMS) quality metrics. Vizient, a nonprofit organization that produces tools that allow hospitals to track and report quality in five domains (access to care, capacity and throughput, quality and efficiency, continuum of care, and equity), provides advisory and analytics services to major health systems under the promise of improving patient care while lowering costs.11 Because they are highly visible to the public and often impact reimbursement rates, metrics such as the Vizient or CMS metrics often matter a great deal to

IV. Future outlook

23.3 Application problems: Problems with learning from the data

465

institutions, many of whom will undergo herculean efforts to achieve top rankings. However, because the use of these metrics is undertaken by and paid for by health systems, and not patients, many have criticized their use for simply incentivizing tackling of “low-hanging fruit,” such as better documentation of patient comorbidities, to increase hospital quality rankings, rather than more costly investments in improvements to patient care itself. As a result, AI tools that institutions use to improve their quality outcomes may not reflect the interests or values of patients and physicians. Similar conflicts of interest come to head in the use of AI tools developed to aid health systems in billing practices. For example, there are already several AI tools available on the market that exist for the purpose of predicting Medicare reimbursement rates ahead of actual reimbursement, with the objective of allowing hospitals to improve metrics ahead of reimbursement if their financial goals are not predicted to be met. Though it is possible such practices could lead to gains for patients if hospitals invest the large amounts of time and money into improvements in actual patient care, experience to date suggests that, as in the case of quality metrics above, hospitals will be more incentivized to go for the “lowhanging fruit” of changes in diagnostic coding or other superficial practices instead. The incentives in health systems face from reimbursement and quality metrics and, therefore, may come at the cost of what matters most to patients and physicians.

23.3.2 Biases in the data themselves The second way the AI learning process can be unduly affected is by biases in the data themselves, that is, existing over- or underrepresentations of certain populations and subpopulations in the data in which AI learns. This in turn can manifest in several ways. First, biases in existing data may serve to reinforce those biases, as the ML tools trained on the biased data will learn to recapitulate them. Problematically, this has the potential to lead to the creation of self-fulfilling prophecies in which some groups already disadvantaged by the health-care system continue to receive fewer resources and attention. We know, for example, that women are less likely than men to be accurately diagnosed with a myocardial infarction. If ML tools learn on datasets in which women are less likely to be classified with this disease, the resulting algorithms may be less likely to be recommended treatment for them regardless of need. Studies also have shown that institutions can vary widely on a number important of health-care resource allocation decisions. Hospitals with pediatric transplant capabilities, for example, have been shown to differ in whether and to what extent they take into account neurodevelopmental delay and certain genetic findings when making allocation decisions.12 Similar results have been found regarding extracorporeal membrane oxygenation decisions. If, as these findings suggest, physicians stop treating patients with certain genetic or clinical conditions, ML systems may interpret this to mean that these conditions are always fatal, creating self-fulfilling prophecies in which they recommend withdrawal of care for those patients.13 Like the issue of competing values in algorithm design, the self-fulfilling prophecy problem is thus largely a problem for predictive, not analytic AI. Next, incomplete or missing data are an important source of bias in the data in which AI tools learn. If certain groups are systematically underrepresented from the datasets in which ML tools learn, AI tools will not be able to make accurate predictions for these groups in ways that will benefit them.14 This is problematic especially in repository sources of data,

IV. Future outlook

466

23. Regulatory, social, ethical, and legal issues of artificial intelligence in medicine

which are well-known to underrepresent minority groups, but also is relevant to clinically collected data. The vast majority of genome-wide association studies (GWAS), for example, considered to be the gold-standard way to study genetic linkages, have been conducted on people of European descent; people of Hispanic or Latin descent make up less than 1% and African descent 3% of GWAS participants.15 Importantly, we already know from our experiences with observational data how underrepresentation of certain groups has the potential to harm their care. Attempts to apply the Framingham study (conducted primarily on white men) to nonwhite populations, for example, have been shown to over- and underestimate cardiac risks for different groups.16 The issue of missing data is problematic for both predictive and analytic AI, as both rely on complete and representative data for accuracy. Finally, biases in AI application can come about due to problems with EHR data. It is now well-known that EHR records are incomplete and inaccurate in biased ways, making them prone to the same issues of missing data, misclassification, and measurement error as observational studies.17 These shortcomings will lead to the same issues discussed earlier, with any biases present in EHR data being perpetuated by any ML system that learns from their data. We know, for example, that people of low-socioeconomic status, immigrants, and those with complex psychosocial needs are more likely to visit multiple institutions rather than a single institution to receive their care.18 20 As the authors of these studies point out, if the health records from these individuals are not able to be linked across systems, AI support tools that rely on number of visits to trigger a recommended treatment may not identify them as individuals who would benefit from a particular intervention or therapy. In addition to these concerns, EHRs introduce a second potential source of conflict, since they were never developed with patient care at their center in the first place. Indeed, the major EHRs in use today were not built around diagnostic or therapeutic considerations, but rather around billing practices in order to ease the administrative burdens of health systems. The span of metrics available for use in algorithms, therefore, will be biased toward those things that are billable in the hospital system, rather than those with the biggest potential to improve individual or population health outcomes, potentially leaving what matters most to patients by the wayside.

23.3.3 Biases in the society in which the data occurs Biases can also occur not in the data themselves, but rather at the level of the society in which the data occur. It is important that we distinguish these two types of biases, as they arise from separate phenomena. For example, our society currently suffers from a substantial health/ wealth gradient, as well as from a significant amount of excess death in minority populations. These biases, which represent real differences in health outcomes correlating with known social determinants of health, will be reflected in the data in which AI tools learn, thus leading to system that will overpredict morbidity and mortality in poor and minority populations. This is not a bias in the data, as the data correctly reflect the nature of our society. Rather it is a bias from our society itself, which has led to disparities in health for these subpopulations. The issue of bias from society itself is particularly problematic because any AI tools built on these biases will only serve to reinforce them, thus reifying existing health disparities. As alluded to earlier, this will be primarily a problem for predictive analytics, as tools built on these biases will overestimate morbidity and mortality in already disadvantaged

IV. Future outlook

23.3 Application problems: Problems with learning from the data

467

populations, thus potentially denying them certain treatments or procedures from which they might benefit. This potential for AI to lead to unjust outcomes has already been powerfully documented both inside and outside of healthcare. For example, AI algorithms designed to predict risk of recidivism in order to help judges determine sentencing lengths have unsurprisingly been shown to be massively racially biased.21 Within the health space, one study found that a commercial AI tool commonly used by health systems exhibits a substantial racial bias that labels black patients as significantly sicker than white patients with identical risk scores. The authors find that remedying this disparity in the algorithm would increase the percentage of black patients receiving additional medical care from 17.7% to 46.5%.22 Studies like these are deeply troubling from a justice standpoint, as they indicate patients with similar health-care needs will not receive similar care. An alternative would be to adjust the algorithms to eliminate these disparities; however, we are then left in a situation in which the algorithms are worse predictors of morbidity or mortality. This dilemma is similar to that of designing risk adjustment models to accurately pay insurers for taking on high-risk patients: how much do we weigh predictive power versus an “overfitting” of the model, of sorts, the latter of which will almost certainly hinder our ability to eliminate disparities by codifying them? This is a question we will continue to need to grapple with moving forward.

23.3.4 Issues of implementation Finally, ethical, social, and legal issues can arise at the level of implementation; for example, when AI is actually applied in practice. First, there is the issue of physician and health system behavior. Due to limitations in knowledge and judgment, physicians or health systems may misuse algorithms in a number of predictable ways. Misuses could arise if these actors are not aware of the limitations of the data, the values that are embedded in algorithm design, or when to apply algorithm results to individuals in populations for whom they are not as predictive. All of these misuses could result in worsening disparities for already disadvantaged groups. And of course, as human beings, physicians are also prone to bias, whether explicit or implicit. Though as stewards of our health-care system we may hope physicians are more aware of their biases, and thus better able to control them, than individuals in the average population, it is almost certain that these biases will at least in some instances lead physicians to apply AI tools differentially for different populations, even in the face of identical AI output. Indeed, such differential application has already been shown to be true in the case of palliative radiation. A related, but separate, issue arises from the problem of incidental findings. Namely, who will be responsible for conveying incidental findings that may arise via the use of AI tools? Traditionally, we have relied on individual patients’ physicians to convey such information to patients. This responsibility has already generated noninsignificant pushback in the medical community, as the number of diagnostics and imaging, and thus chance of incidental findings, has increased substantially over the past few decades. The use of AI will almost certainly multiply this likelihood severalfold, likely creating a substantial burden on an already overworked health-care workforce. But if the tools are deployed at a health system rather than individual

IV. Future outlook

468

23. Regulatory, social, ethical, and legal issues of artificial intelligence in medicine

patient physician level, there may also be no single identifiable physician best served to convey this information, leaving us with a new, previously unexplored dilemma. How we will address these issues moving forward has yet to be determined.

23.3.5 Summary In summary, problems in applying AI can arise because of bias or value entry at multiple levels in the AI development and deployment process, including in algorithm design, at the level of the data themselves, at the level of society in which the data occurs, and in its actual use. Viewed in this way, we can see that the knowledge we get from AI tools, developed from the combination of very large amounts of data and ML on that data, is actually strikingly analogous to the knowledge we generate from the most powerful observational studies in medicine. Of course, this means AI has huge potential to change clinical practice, sometimes in important and desirable ways. But it also means it will still suffer from all of the biases and limitations we know to exist in observational studies, and thus should be considered with the same caveats.

23.4 Issues in regulation 23.4.1 Challenges to existing regulatory frameworks In the United States, regulation of AI in healthcare differs from that of AI in other domains because of the unique laws and regulations that apply to health data, clinical research, and the use of products in clinical practice. Regulation of AI using existing frameworks is challenging because AI does not easily fit into the classification schemes used to determine whether a new medical technology is subject to regulation. Furthermore, safety and efficacy of AI is difficult to evaluate using the metrics and methods traditionally used by regulators for other technologies or products. In medicine an important regulatory distinction exists between products, which can be regulated by the US Food and Drug Administration (FDA), and the practice of medicine, which is outside the scope of the FDA. An area of definitional ambiguity is “medical services” that are more typically subject to oversight under the Clinical Laboratory Improvement Act of 198823 or the Patient Safety and Quality Improvement Act of 2005.24 Some applications of AI could be considered as medical services or part of quality improvement, but the distinction from products is not clear. Evaluation of software products by the FDA rests on the assumption that the software’s validity is “based on an established scientific framework” or “is clinically accepted.”25 However, it has been proposed that the very strength of some types of AI, especially those using ML, comes from its ability to reach conclusions based on associations that clinicians or scientists have not previously identified.26 Indeed, ML and other big data analytic techniques have been proposed as “data-first” approaches that do not require hypotheses and as alternatives to clinical trials. Although the FDA is mandated to explore these approaches by the 21st Century Cures Act27; however, they have to date had limited success.28 Two unusual features of AI that pose particular challenges to evaluation and regulation are its “black-box” nature and its constant evolution.29 For example, unsupervised ML techniques seek patterns in data without output variables being specified in advance.

IV. Future outlook

23.4 Issues in regulation

469

These approaches are being used in medicine to essentially redefine diseases according to new characteristics indicated by groupings in data that could be easily missed by supervised learning techniques but which might not be explainable.26 Evolution is especially characteristic of ML-based techniques such as deep learning or neural networks, the very purpose of which is to “learn”—that is, change its conclusions based on new data with which its algorithms are presented. Unless specifically programmed to do so by AI designers,30 an algorithm does not provide a rationale for its output. Opacity and evolution pose challenges to regulatory requirements for transparency. For example, the European Union’s (EU’s) General Data Protection Regulation (GDPR) includes right to explanation rules. Similar standards may be implemented in the United States, and regulators might require AI tools to be explainable to clinicians who make decisions based on their output.31 AI software is often proprietary, further decreasing transparency.29 A cornerstone of evaluating new medical technologies is assessment of reproducibility. Although the FDA has approved AI-based applications to inform medical diagnosis and treatment in patients, to date these have been “locked” applications that do not continually adapt to new data.32 Recognizing some of the challenges of regulating AI and ML, the FDA proposed a Modifications Framework for AI/ML-based Software-as-a-Medical-Device (SaMD).33 Medical informatics professionals and computer scientists, however, have recommended that specific regulatory requirements are necessary for “unlocked” software to measure algorithmic drift, even when target populations and clinical indications have not changed. The recommendations included establishment of a trigger for FDA rereview, for example, if measurable drift is detected when compared to a static historical control.32 Others have proposed that the potential for large-scale use of ML-based algorithms for treatment and diagnosis and serious consequences of failure requires regulatory bodies to prioritize verification of reproducibility claims.34 They argue that demonstration of reproducibility entails meeting three tiers of replicability, including technical replicability (related to code and dataset release, analogous to analytic validity), statistical replicability (e.g., using different train/test splits of a dataset, analogous to internal validity), and conceptual replicability (how well the desired results reproduce under conditions that match the conceptual description of the purported effect, analogous to external validity). Furthermore, AI, like any other software, is constantly updated and might be altered by end users. The potential for alterations poses challenges to the important regulatory tool of postmarketing surveillance. Because of these challenges, the regulatory environment of AI in healthcare is currently in flux in order to adapt to the features of AI and to test new regulatory frameworks. For example, the FDA is testing a Digital Health Software Precertification Program to test and establish a new method of oversight that assures adequate quality management system (QMS) processes, and a culture of quality and organizational excellence are in place at organizations that develop software.35

23.4.2 Challenges in oversight and regulation of artificial intelligence used in healthcare The regulatory frameworks in the United States that govern the use of new technology in healthcare and clinical research primarily evaluate and oversee safety, efficacy, and privacy. The concept of explainability is difficult to map onto current regulatory frameworks.

IV. Future outlook

470

23. Regulatory, social, ethical, and legal issues of artificial intelligence in medicine

However, from a regulatory perspective, explainability may play an important role in validation, particularly because AI outputs require interpretation and integration with other data when used by clinicians. The use of AI in healthcare is subject to federal laws but also state laws, including those that determine tort liability for damages due to injury.31 However, specific state laws are beyond the scope of this chapter and will not be discussed here.

23.4.3 Regulation of safety and efficacy The FDA oversees the introduction of new medical products that fall into the general classifications of drugs, biologicals, or devices. The regulation relevant to AI in healthcare depends on particulars of its application, such as whether the AI is used for diagnosis, treatment, or prevention of disease; whether it is used for making clinical decisions; or whether it is used to enhance clinical operations such as patient management. However, the distinctions between these applications are sometimes difficult to draw. Some AI is regulated as a device, but some types of AI such as some of that used in clinical decision support (CDS) software may be exempt from FDA regulation because it is not classified as a medical device.36,37 The FDA generally takes a risk-based approach to regulation of products, including devices. That is, the more that a device poses risks to patients, the more regulatory requirements must be met. Devices are classified as Class I, Class II, or Class III, representing increasing levels of risk. For AI, especially that which is part of what is called SaMD (FDA SaMD Digital Health Initiative), the FDA is applying a different risk categorization adapted from the framework developed by the International Medical Device Regulators Forum. This framework categorizes SaMD based on the state of health-care conditions and the significance of the information provided by the products.38 Another important distinction is between CDS tools that replace the health professional’s decision-making, which could be subject to regulation as a medical device, and those that provide information to a professional to assist in decision-making. However, this area of regulation is underdevelopment, leaving AI developers in a state of uncertainty. The meaning or interpretation of language about CDS in the 21st Century Cures Act is ambiguous, such as when software is used in “supporting or providing recommendations,” or when it “enables a health-care professional to independently review the basis for recommendations.”39

23.4.4 Privacy and data protection AI in healthcare requires health data, and lots of it. The granularity and sheer amount of data about individuals pose risks to privacy because they become impossible to deidentify even if obvious identifiers are removed. For example, commercially available facial recognition software was able to match photos of individuals to deidentified MRI scans of their heads 83% of the time.40,41 Increasingly, other large datasets are publicly available and can be linked to deidentified health data and enable reidentification of individuals. Development and validation of AI models often require access to or acquisition of data from multiple sources, which requires data sharing and transfer, posing further risks to privacy.

IV. Future outlook

23.4 Issues in regulation

471

Unlike many other kinds of data, health data are highly regulated. In the United States, this regulation is primarily through the HIPAA.42 Individual states have other laws, such as the California Consumer Protection Act (CCPA),43 which can impose additional restrictions. The GDPR applies to EU citizens regardless of where they live. However, some data used in AI for healthcare are derived from sources that are not clearly protected by these laws. For example, data from smartphone or fitness-tracking devices are collected outside of the health-care settings that are subject to HIPAA.44 HIPAA applies to covered entities, including health-care providers, insurers, and health data clearinghouses, as well as business associates. These covered entities are limited to use and disclosure of PHI if they have patient authorization or with a waiver from an institutional review board or privacy board. There are several relevant exceptions, especially exceptions for use and disclosure of PHI for the purposes of health-care operations such as quality improvement activities. Increasingly, health systems are turning to technology companies to develop and provide AI-based analytic services on their patients’ data. These services require access to electronic health records and other patient-level health information. Some companies, such as insurers, have their own claims data but often need to combine it with other data to develop AI platforms. Companies that are not considered a covered entity under HIPAA can only access health data that are deidentified in a manner that is compliant with HIPAA. However, as described earlier, the limitations of that strategy have been well documented, given the relative ease of reidentification, especially of the highly granular and very large datasets typically necessary for AI. The long-term viability of relying on HIPAA to protect patient privacy in AI is tenuous. Although data sharing agreements and HIPAA prohibit efforts to reidentify such data, the data labeling phase of the AI model training process often requires human annotation of individual patient-level data, so is vulnerable to reidentification. In addition, it is apparent that it is difficult to fully remove known identifiers from large health datasets. For example, Google canceled plans to use over 100,000 chest X-ray images from the NIH when the company discovered that the images had identifying data such as dates the X-rays were taken and outlines of distinctive jewelry worn by patients.45 In another project by Google a patient discovered that data obtained by the company from the University of Chicago medical center were not fully identified but still contained admission and discharge dates and free-text notes. The patient subsequently brought a class action suit against the company and the hospital, not only for disclosing PHI but for the hospital’s failing to notify or obtain consent to release data to the company for commercial purposes, in contradiction to the hospital’s admission forms.46 Data can be shared with individual authorization. However, such authorization is difficult to obtain on a large scale. A company can be deemed a covered entity by becoming a business associate that is involved in patient care, thereby obtaining access to fully identified patient data, including lab tests, radiology images, and medications, as well as names of family members. HIPAA permits this as long as the information is used only for the purpose of carrying out healthcare functions. Arrangements such as “Project Nightingale,” a collaboration between Google and Ascension, a large nonprofit health system, are conducted under a business associate agreement, allowing the company to assist the health system with developing and deploying AI-based analytics to make changes to patient care that improves outcomes and reduces costs, and to do so without requiring patient consent.47,48 While such uses of

IV. Future outlook

472

23. Regulatory, social, ethical, and legal issues of artificial intelligence in medicine

health data do not violate current federal law, increased patient expectations around transparency and new regulations contribute to a growing debate about whether current law is adequate. In particular, sector-based laws such as HIPAA are rendered ineffective when health information is collected outside of health-care organizations, and when nonhealth information such as location data is combined with health data. Furthermore, individual states such as California have state-specific digital data privacy laws such as the CCPA,43 and the EU has its own GDPR49 that apply to EU citizens regardless of where their data are collected and that may impose requirements independent of US federal law. The GDPR applies to all “personal data” that are defined as any data that can be used to directly or indirectly identify a living person, so is not sector specific. The CCPA has a similar definition of “personal information.” While the CCPA has an exception for health data covered by HIPAA, it covers personal data of individuals not covered by HIPAA such as doctors, nurses, and other employees. This could apply to staff data such as location or biometrics that are used in AI analytics. Importantly for companies using Business Associate Agreements, business associates under HIPAA are not exempted as covered entities under the CCPA.50 Inability to share data and the likelihood that consent requirements will lead to incomplete datasets increase the chances that systematic error, or bias, will be introduced into analysis.51 To the extent that vulnerable and underserved patients such as racial and ethnic minority groups are already poorly represented in health datasets and more likely to distrust requests to authorize data sharing, such bias can lead to or exacerbate discrimination. A scientific report of an ML algorithm used to predict an individual patient’s future health-care costs resulted in systematic error, or bias, against black patients when the measure of costs was used as a proxy for medical need because spending on black patients was less than for similarly ill white patients.22 The study prompted an inquiry by the New York State Department of Financial Services seeking an explanation for what appeared to be discriminatory business practices. New York State law prohibits insurers from relying on, producing, or promoting discriminatory algorithms.52 Thus bias is not only an ethical concern for AI but a legal consideration.

23.4.5 Transparency, liability, responsibility, and trust Transparency might be required by regulators to evaluate AI as it is initially designed and implemented. Existing regulations such as QMS principles and current good manufacturing practices require design assurances, design control, hazard analysis, and postmarket surveillance provisions.31 Logs of transactions, versioning, and validation testing are part of auditing. However, continuously learning systems pose challenges to transparency because of drift, where performance is assessed in static datasets. Lack of transparency of AI systems could inhibit patients, providers, or health systems from demonstrating causation of injury, which would make successful tort litigation more difficult.31 While devices that undergo full premarket approval are generally immune from tort litigation,31 many, if not most AI products in healthcare, will not undergo full premarket approval and will not be exempt from such litigation. Especially important for the acceptance of AI in healthcare is gaining the trust of the clinician users. To the extent that AI either fully or partially replaces clinician

IV. Future outlook

23.5 Implications for the ethos of medicine

473

decision-making functions, the lines of responsibility for making errors or decisions that harm patients are blurred. While clinicians have traditionally been responsible for clinical decisions, they have little autonomy in selecting or using AI systems adopted by their health systems or clinics. In the example above of the predictive AI model that was shown to systematically underestimate the health needs of black patients,53 the company that developed, validated, and sold the model argued that their model was highly predictive of cost, and only one of many elements to be used to select patients for clinical engagement programs, including the doctor’s expertise and knowledge of the individual patient. The company, however, believed that clinicians, not the model developers, were responsible for how the model’s output was used.53 This illustrates that where the responsibility for harm to patients rests is not clear—is it with physicians, health systems, or AI developers? Trust of clinicians in AI developers can also be affected by concerns about intellectual property and commercialization of patient-related data that are sold to companies. Concerns include the use of data for commercial purposes without consent, such as described in the example above, to exclusive access to patient data granted to a single company (especially those data generated through federally funded research), health systems profiting from selling data that is the result of the work product of clinicians (e.g., pathologists’ or radiologists’ diagnoses), the labor of which was already reimbursed.54 Such concerns can create an adversarial relationship between clinicians, their health-care institutions, and companies to which their patients’ data are sold or disclosed. These arrangements, while not necessarily illegal, raise broader issues about what it means to treat patient data ethically and how clinicians’ perceptions might affect eventual adoption of AI systems that use these data. Because AI can give nonintuitive results or even results that conflict with clinicians’ intuitions and are often not transparent as to how it came to those results, it will face challenges in being adopted. AI will need to demonstrate trustworthiness, especially in highstakes situations such as the ICU, the ED, and in end of life care, in order to be accepted.

23.5 Implications for the ethos of medicine Given the issues brought about by the use of AI in medicine that we have outlined earlier, how might we need to reconceptualize the traditional role of the physician in society, if at all? Traditional Western medicine has relied upon an assumption of a dyadic relationship in medicine—that is, an assumption that the physician and patient enter into a one-to-one covenential relationship that creates the fiduciary obligations that form the core of medical ethics. Crucially, these relationships are based on trust between each patient and his or her physician. Though some may argue that, due to a number of interposing figures or structures (payers, clinical guidelines, hospital regulations, to name a few), this has never truly been a problem-free assumption. Despite its inaccuracy, almost all medical ethics are built on this assumption. As the use of AI tools in medicine grows, a new ethos will need to replace this dyadic relationship, as the augmentation of physician knowledge and skills with AI tools creates ambiguity about the responsibility in the patient physician relationship. For example,

IV. Future outlook

474

23. Regulatory, social, ethical, and legal issues of artificial intelligence in medicine

where does responsibility lie when the physician relies on diagnostic or prognostic tools he or she did not develop but in which he or she is placing clinical trust? Moreover, in whom should the patient trust when physicians do not deploy these tools on an individual basis, but rather based on guidelines or policies out of their control? Indeed, much like the cases of formularies and clinical decision pathways, it may not even be individual physicians who utilize AI tools on behalf of patients: hospitals or health systems may be the ones deciding when they are employed. As discussed earlier, these developments will require new forms of oversight and regulation to increase trust in AI tools on the part of physicians or health systems. Such advances will be crucial in helping patients understand how to place trust in the medical establishment when opaque AI tools that they may or may not understand are deployed on their behalf. However, the deployment of AI tools in the medical space will also require a concomitant increase in individual physician understanding of AI in order to increase the transparency of their use. It is only when doctors have a clear understanding of how AI tools have been developed, and when it is clinically appropriate to use them, that we will be able to establish trust in its use in the physician patient encounter, thus restoring at least to some extent, the traditional patient physician dyadic relationship.

23.6 Future directions We have outlined many of the ways in which AI will pose ethical, legal, and regulatory challenges to clinical and research practices in the coming decades. Though they are many, acknowledging and addressing them is feasible and will allow us to unlock the huge potential gains for AI in medicine while protecting patients, clinicians, researchers, and the ethos of the medical system itself. With regard to data collection and use, AI brings new challenges in conceptualizing consent for research, as well as in disclosure of incidental findings or results. Agreement as to what type of consent (broad or narrow) is acceptable for repository data, how consent is to be obtained from public health data sources (if at all), and how and when to disclose results and incidental or secondary findings from different types of research sources are needed. Moreover, a new model of data stewardship, in which we address issues of secondary use in the age of reidentifiable genomic and research repository data, will need to be considered and implemented. Addressing these challenges will ensure that we can maintain a commitment to protecting patients while using their data to advance healthcare for all sectors of society. In the application of AI tools, we have seen how issues can arise at multiple levels, including in algorithm design, from biases in the data themselves, at the level of society, or in the actual implementation of AI tools. How to regulate algorithm design to protect against systematic bias, as well as to protect patients from perverse payer incentives, and how to monitor and rectify differential application of AI tool use are issues that will need to be addressed. Moreover, just as with all forms of observational trials, we will need to be cognizant of issues of missing or biased data in which AI algorithms learn in order to ensure populations are equally represented in research.

IV. Future outlook

References

475

At the level of regulation, AI poses significant unprecedented challenges. Unsettled but important questions include how AI tools fit into existing regulatory frameworks given their “black-box” and constantly evolving nature, how use of AI tools as an aid to physician decision-making will be regulated, how to protect patient data as AI tools increasingly become able to reidentify previously deidentified information, and how to ensure transparency in AI tool development. Underlying all of these issues is a need to reorient responsibility for patient protection and clinical care in the age of AI. As the traditional dyad of the patient physician relationship in medicine becomes blurred, we will need to reconsider what obligations are owed to patients and who is responsible for ensuring they are carried out. At minimum, this will require new regulations and oversight of AI tools to increase trust in their use, as well as physician education in these tools to decrease their opacity in their application. More transparency about whose and which values are embedded in the algorithm design process, and a related need for a more explicit awareness that this is in fact a values activity, will also be necessary. Finally, all of this will require tighter working connections between algorithm developers, health systems, and physicians themselves to help ensure AI tools are developed, implemented, and improved with the common goal of promoting patient protection and care.

References 1. Department of Health and Human Services Office of Human Research Protections. Code of Federal Regulations. 2018. 2. Jarvik G, et al. Return of genomic results to research participants; the floor, the ceiling, and the choices in between. Am J Hum Genet 2014;94(6):18 26. 3. Ravitsky V, Wilfond B. Disclosing individual genetic results to research participants. Am J Bioeth 2006;6 (6):8 17. 4. Wolf S, et al. Managing incidental findings and research results in genomic research involving biobanks and archived datasets. Genet Med 2012;14(4):361 84. 5. Lee S, et al. Adrift in the gray zone: IRB perspectives on research in the learning health system. AJOB Empir Bioeth 2016;7(2):125 34. 6. Department of Health and Human Services. Guidance regarding methods for de-identification of protected health information in accordance with the Health Insurance Portability and Accountability Act (HIPAA) privacy rule. Nov 26, 2012. 7. Botkin J, et al. Retention and use of residual newborn screening bloodspots. Pediatrics 2013;131(1):120 7. 8. Cho M, et al. Attitudes toward risk and informed consent for research on medical practices. Ann Intern Med 2015;162(10):690 6. 9. Green R, et al. ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet Med 2013;15(7):565 74. 10. Kreitmair K, Cho M, Magnus D. Consent and engagement, security, and authentic living using wearable and mobile health technology. Nat Biotechnol 2017;35:617 20. 11. Vizient. Driving performance improvement in healthcare. Accessed https://www.vizientinc.com/what-we-do July 12, 2020. 12. Richards C, Crawley L, Magnus D. Use of neurodevelopmental delay in pediatric solid organ transplant listing decisions: inconsistencies in standards across major pediatric transplant centers. Pediatr Transplant 2009;13 (7):843 50. 13. Char D, Nigam H, Magnus D. Implementing machine learning in health care—addressing ethical challenges. N Engl J Med 2018;378(11):981 3. 14. Rajkomar A, Hardt M, Howell M. Ensuring fairness in machine learning to advance health equity. Ann Intern Med 2018;169(12):866 72. 15. Popejoy A, Fullerton S. Genomics is failing on diversity. Nature 2016;538:161 4.

IV. Future outlook

476

23. Regulatory, social, ethical, and legal issues of artificial intelligence in medicine

16. Gijsberts C, et al. Race/ethnic differences in the associations of the Framingham risk factors with carotid IMT and cardiovascular events. PLoS One 2015;10(7):p.e0132321. 17. Gianfrancesco M, et al. Potential biases in machine learning algorithms using electronic health record data. JAMA Intern Med 2018;178:1544 7. 18. Arpey N, Gaglioti A, Rosenbaum M. How socioeconomic status affects patient perceptions of health care. J Prim Care Community Health 2017;8(3):169 75. 19. Ng J, et al. Data on race, ethnicity, and language largely incomplete for managed care plan members. Health Aff (Millwood) 2017;36(3):548 52. 20. Ramoni M, Sebasiani P. Robust learning with missing data. Mach Learn 2001;45(2):147 70. 21. Dressel J, Farid H. The accuracy, fairness, and limits of predicting recidivism. Sci Adv 2018;4(1):p.eaa05580. 22. Obermeyer Z, et al. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019;366(6464):447 53. 23. Center for Medicare & Medicaid Services. Clinical Laboratory Improvements Act in 42 USC 263. Washington, DC: US Department of Health and Human Services; editor. 1988. 24. US Department of Health & Human Services. Patient Safety and Quality Improvement Act in 42 U.S.C. 299. Washington, DC: U.S. Department of Health & Human Services; editor. 2005. 25. US Food and Drug Administration. Software as a medical device (SaMD): clinical evaluation. Guidance for industry and food and drug administration staff, In: U.S.D.o.H.a.H. Services, editor. Washington, DC: US Department of Health and Human Services; 2017. 26. Deo R. Machine learning in medicine. Circulation 2015;132:1920 30. 27. 21st Century Cures Act. 2016. Public Law 114-255-Dec. 13, 2016 130 STAT. 1033. 28. Bartlett V, et al. Feasibility of using real-world data to replicate clinical trial evidence. JAMA Netw Open 2019;2:e1912869. 29. Price W. Black-box medicine. Harv J Law Technol 2015;28:420 54. 30. Wang D, et al. Designing theory-driven user-centric explainable AI. In: CHI paper. 2019. 31. National Academies of Sciences. Artificial intelligence in healthcare: the hope, the hype, the promise, the peril 2019. Washington, DC: National Academies of Sciences; 2019. 32. Friedsma D. Letter to Dr. Norman Sharpless, Acting Commissioner US DHHS FDA. RE: Docket No. FDA-2019-N1185; Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Discussion. 2019. 33. US Food and Drug Administration. Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Discussion Paper and Request for Feedback. 2019. 34. McDermott B, et al. Reproducibility in machine learning for health. arXiv:1907.01463v1 [cs.LG]. 2019. 35. US Food and Drug Administration. Developing a software precertification program: a working model v.1.0. Washington, DC; 2019. 36. Dixon-Woods M, et al. Synthesising qualitative and quantitative evidence: a review of possible methods. Health Serv Res Policy 2005;10:45 53. 37. Mcdougall R. Reviewing literature in bioethics research: increasing rigour in non-systematic reviews. Bioethics 2015;29:523 8. 38. Internatinal Software as a Medical Device Working Group. “Software as a medical device”: possible framework for risk categorization and corresponding considerations. International Medical Device Regulators Forum; 2014. 39. US Food and Drug Administration. Clinical decision support software draft guidance for industry and food and drug administration staff. 2019. 40. Evans M. Facial-recognition software was able to identify patients from MRI scans. Wall Street Journal Oct 23, 2019. 41. Schwartz C, et al. Identification of anonymous MRI research participants with face-recognition software. N Engl J Med 2019;381:1684 6. 42. Employee Benefits Security Administration. The Health Insurance Portability and Accountability Act (HIPAA). US Department of Labor; editor. 2004. 43. California Consumer Privacy Act of 2018, in Civil Code. Section 1798-1798.78. 44. Price W, Cohen I. Privacy in the age of medical big data. Nat Med 2019;25:37 43. 45. MacMillan, D Bensinger G. Google almost made 100,000 chest X-rays public—until it realized personal data could be exposed. Washington Post Nov 15, 2019.

IV. Future outlook

References

477

46. United States District Court for the Northern District of Illinois Eastern Division. Matt Dinerstein v. Google, LLC and the University of Chicago Medical Center, the University of Chicago. 2019. 47. Copeland R. Google’s ‘Project Nightingale’ gathers personal health data on millions of Americans. Wall Street Journal Nov 11, 2019. 48. Davis J. Google ascension partnership fuels overdue HIPAA privacy debate. In Health IT security. Danvers, MA: Xtelligent Healthcare Media, LLC; 2019. 49. European Union. General data protection regulation. In: OJ L 119, 04.05.2016; cor. OJ L 127, 23.5.2018. 2016. 50. Cho T, Aram N. Health sector does not completely avoid the CCPA by HIPAA exemption. In The national law review. Western Springs, IL: National Law Forum, LLC; 2019. 51. Spector-Bagdady K. The Google of healthcare: enabling the privatization of genetic bio/databanking. Ann Epidemiol 2016;26:515 19. 52. Evans M, Mathews A. New York regulator probes UnitedHealth algorithm for racial bias. Wall Street Journal Oct 26, 2019. 53. King, R., New York insurance regulator to probe Optum algorithm for racial bias. FierceHealthcare Oct 28, 2019. 54. Ornstein C, Thomas K. Sloan Kettering’s cozy deal with start-up ignites a new uproar. New York Times Sep 20, 2018.

IV. Future outlook

C H A P T E R

24 Industry perspectives and commercial opportunities of artificial intelligence in medicine Rebecca Y. Lin and Jeffery B. Alvarez Abstract Artificial intelligence (AI) in medicine is growing rapidly and quickly becoming an integral part of health-care delivery. Given its enormous potential in revolutionizing the healthcare, the field has attracted more investment than AI projects in any other sector of the global economy in the past few years. In this chapter, we provide a panoramic view of the business aspects of health-care AI, which include a brief history of health-care AI, market overview and commercial opportunities, the current status of capital investment in medical AI, regulatory issues, and challenges that the industry is facing. With the advances in AI technologies, we believe that the medical AI industry will continue to evolve to improve the ecosystem of health-care industry and contribute greatly to better patient care. Keywords: Business opportunity; funding; value-based segmentation; public perception; innovation; applications; market adoption; hype; AI winter

24.1 Introduction Artificial intelligence (AI) is arguably one of the most significant innovations in medicine since the vaccine was developed in 1798. With over $4.5 billion invested in 2019, there is little doubt that it has potential. In an industry that is notoriously slow to change, providers and payers are more ready than ever to embrace AI due to the irrefutable evidence of its impact on both costs and lives. However, these favorable trends do not guarantee companies success in achieving sustainable business models. The landscape is filled with painful mistakes, shuttered companies, and low adoption rates for marginal value. From the White House to the Food and Drug Administration (FDA), the tech giants to health-care giants and start-ups to venture capitalists, everyone is playing a role in driving

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00024-7

479

© 2021 Elsevier Inc. All rights reserved.

480

24. Industry perspectives and commercial opportunities of artificial intelligence in medicine

AI forward. In the past decade, 43% of Fortune 100 companies were newly added. Health-care will experience a similar shift in value pools as new entrants execute novel strategies and pursue unique opportunities. In this chapter, we will provide a framework to analyze the environment of health-care AI to uncover the opportunities, the challenges, and to help you conclude for yourself. Is it hype, or is it opportunity?

24.2 Exciting growth of artificial intelligence in medicine Investment and innovation always go hand in hand, which allows investment to be a corollary metric to innovation potential. The magnitude of AI investment has been startling, with global investment expected to reach $35.8 billion in 2019, an increase of over 44% from 2018. Spending on AI systems is projected to more than double by 2022 to almost $80 billion, with a compound annual growth rate of 38% over 2018 22.1 In 2019 $4.5 billion was driven into healthcare, making it a focal point within the AI evolution. In Q3’19 alone the industry peaked in both deals and dollars in Q3’19 raising almost $1.6 billion across 103 financing rounds. At that magnitude, AI makes it the top-funded subsector in healthcare.2 These big bets are anchored on the hypothesis of both considerable potential and rapid growth of the market, which 2019 estimates pin at a compound annual growth rate of 40%, putting the AI health-care market at $6.6 billion by 2021.3 The growth rate is fueled by a perfect alignment of interests in the landscape. Clinicians are overwhelmed with data, hospitals are bogged down by administrative logistics, payors are crunched by regulation, and patients increasingly have comorbidities. All in, the cost of the US health-care system now exceeds $3.5 trillion, roughly 20% of the US economy. The top 10 AI applications with the most significant near-term impact in healthcare create over $150 billion of savings to this system. Accounting for $40 billion of those savings are AI-assisted robotic surgery platforms underdevelopment. Surgical robotics probably resonate with people as expensive equipment, but it could change the health economy by reducing the cost of relegated factors, such as the length of hospital stay.3 With all this excitement, the question becomes, “is AI all hype?” Will AI be the next Internet of Things, a promise of a changed human daily life powered by hardware connected to the World Wide Web that never lived up to expectations, or will it be the next gig economy that revolutionized the way many Americans work through Uber, Upwork, DoorDash, and others?

24.3 A framework on development of artificial intelligence in medicine AI became a very popular topic again ever since we realized they can play better chess than us. But the concept of AI was the first proposed almost 70 years ago. What happened in the AI research community over this time period, and why did it take so long for the breakthroughs to occur? Looking into the short history of AI, we find no lack of progenitors, enthusiasts, and genius to drive rapid development. But the development of AI is not all smooth sailing. There have been two previous booms that ended in busts, and we are now in the middle of a third boom (Fig. 24.1).

IV. Future Outlook

24.3 A framework on development of artificial intelligence in medicine

481

FIGURE 24.1 AI booms and busts cycle. AI, Artificial intelligence.

FIGURE 24.2 “An AI boom bust cycle” is a framework that represents the major characteristics of the AI cycles and demonstrates the relationships among the public attention, funding, technology innovation, practical application, and market adoption.4 AI, Artificial intelligence.

Innovation is always built on the shoulders of others. Sometimes, those shoulders succeeded at their efforts, but most often, they failed. AI is no exception. We must take a few steps back to understand how we got here and where we are going. No one knows the future, but understanding what held us back in the past can help to break through to succeed in the future. The development of AI is a complex system composed of many components that interact with each other. We built a framework of an AI boom bust cycle with the variables that we can clearly define and describe how it influences the system (Fig. 24.2).

IV. Future Outlook

482

24. Industry perspectives and commercial opportunities of artificial intelligence in medicine

24.3.1 The power of public attention and funding Over the two centuries, novels, movies, philosophy, and even religions have crafted perceptions of AI. These perceptions always pivot around the singularity, the moment a conscious AI is born, and what happens after. On one side, AI, whether virtual or embodied in synthetic mobile platforms, obeys Isaac Asimov’s three laws of robotics: 1. A robot (AI) may not injure a human being or, through inaction, allow a human being to come to harm. 2. A robot (AI) must obey the orders given it by human beings except where such orders would conflict with the First Law. 3. A robot (AI) must protect its own existence as long as such protection does not conflict with the First or Second Laws. A future leads to a human machine symbiosis. On the other side is the rise of the gestalt consciousness that revolts against the creator, the fear of organics versus synthetics, and the undesirable solution to the Fermi Paradox. This polarizing public perception has fueled interest in AI over the past century. The optimistic and pessimistic public option on AI largely drives the tide of funding up and down, which pulls the AI world into a golden age, and also pushes it into deep winter. The funding influence on AI can be traced back to the birth of AI. In 1956 John McCarthy, founder of the Stanford Artificial Intelligence Lab, with his coauthors Shannon, Minsky, and Rochester, submitted a proposal to the Rockefeller Foundation requesting sponsorship for a conference. The plan was to collect 10 thought leaders and have them elucidate “every aspect of learning or any other feature of intelligence [that] can in principle be so precisely described that a machine can be made to simulate it.”5 The field was so new that mathematician John McCarthy had to invent a new term to help explain the concept of machine learning. In fact, the first known use of the term, “artificial intelligence,” was in that conference proposal. McCarthy requested $13,500 for a 2-month conference, but Morison, the Director of Biological and Medical Research in Rockefeller Foundation, offered $7500 for 5 weeks. As Morison explained, I hope you won’t feel we are being overcautious, but the general feeling here is that this new field of mathematical models for thought, though very challenging for the long run, is still difficult to grasp very clearly. This suggests a modest gamble for exploring a new approach, but there is a great deal of hesitancy about risking any very substantial amount at this stage.6

With that investment the field of AI was born, and although its charge was murky to outsiders, was immediately evident to practitioners. Technology leaders such as IBM, Bell Laboratories, and RAND all sponsored their researchers’ attendance. In the summer of 1956 the fun little conference kicked off. They called it the Dartmouth Summer Research Project on Artificial Intelligence (DRPAI), held at Dartmouth College, New Hampshire, the United States; it sparked a new era of research and development in mathematics, engineering, linguistics, computer science, and psychology. The Rockefeller Foundation’s investment of $7500 funding back then, equals to about $7 million today.

IV. Future Outlook

24.3 A framework on development of artificial intelligence in medicine

483

Excitement and investment grew when these well-respected scientists tapped into the public’s perceptions with predictions of automatons powered with intelligence and massive thinking machines. Herbert Simon (won the Nobel in economics in 1978), in 1957, publicly predicted that within a decade a thinking computer would be able to beat the world chess champion.7 Perhaps at the peak of boom one mania, in 1970, Marvin Minsky predicted, “in three to eight years we will have a machine with the general intelligence of an average human being.”8 When these promises did not come to fruition, both the public and investors grew disheartened. Simon was off by 30 years when IBM’s Deep Blue famously defeated Garry Kasparov, and Minsky’s prediction would still be considered science fiction today. By 1974 confidence fizzled, and DARPA cuts funding, leading to the first AI bust. The second boom was remarkable in the amount of tenacity and ingenuity it took vital researchers such as Edward Feigenbaum to wipe the frost off the face of AI. Feigenbaum’s scientific contribution to the field was a radical new paradigm called expert systems in which AI was trained to be a narrow problem solver in a specific domain of interest such as medicine.9 Interest in AI research reignited in the mid-1980s with the development of expert systems and reinvigorated research and investment in AI. Early on, Feigenbaum realized that if he were going to bring AI out of the winter, the field would need significant funding. During the late 1970s, Japan’s economy was booming, and there was a fear among the American public that Japan would surpass the United States as an economic superpower. Japan was also working on a project called the Fifth Generation Computer System (FGCS), into which the Japanese Ministry of International Trade had invested half a billion dollars. The FGCS was designed to be a massively parallel supercomputer that could be used to develop expert system AI.10 In 1981 after Feigenbaum went back to the United States from a trip to Japan, he visited computer science departments across the country. He spearheaded a campaign to raise fear about Japan beating the United States in AI development.11 Months later, DARPA funded a billion-dollar proposal called the Strategic Computing Initiative (SCI) to crush Japan in AI development. Feigenbaum’s plan worked, but it went further than he imagined. The SCI inspired the development of the $1.5 billion European Strategic Programme on Research in Information Technology and the $500 million Alvey project in the United Kingdom.12 To bring AI out of the winter, Feigenbaum and his contemporaries had sparked an international arms race leading to the second AI boom. As expert systems became increasingly excited as people wanted to do increasingly more complicated things with them, the more hurdles that these AI adherents encountered. Expert systems were very dependent on data, and storage was still expensive in the 1980s. Storage and data were a few of the problems that expert systems came across in terms of technological issues. Corporations also needed to develop their data, and decision flows but faced limitations to what they could do with this data. In the 1980s, there was neither an Internet-based cloud to store data on like we can get today nor access to almost unlimited computing powers. It was also harder to transmit large quantities of data from one area to another. This meant that expert systems could not communicate with systems outside of the company that owned them, and the data they needed to progress further was not available. Disappointed by the progress that had been made in the actual use of AI, the SCI and DARPA cut funding to AI projects aggressively in the late 1980s.

IV. Future Outlook

484

24. Industry perspectives and commercial opportunities of artificial intelligence in medicine

The Japanese economy started to collapse in the early 1990s leading to a global AI bust in which all the significant AI funding from international government organizations froze. The pace of AI started to increase the third time in 1993. Still, it really took off when IBM’s Deep Blue made history on May 11, 1997 by defeating the reigning chess champion Garry Kasparov in a highly publicized multiple-day televised chess match. Finally, the great promises made by the earliest AI pioneers were coming true and drove an unprecedented wave of public revelry on AI. In 2016 when Andrew Ng, one of the world’s most influential computer scientists, shared, “AI is the new electricity,” which attached the public attention and funding to a new level. With the present boom, the excitement and anxiety are palpable; the funding is astronomical but is the technology ready to deliver on the promises?

24.3.2 Technology relies on continuous innovation Like every scientific breakthrough, AI will be built on the shoulders of others. Success in AI in medicine will require three primary ingredients to be at the right stage of evolution. Only the continuous innovation in the three ingredients will keep driving AI in medicine to solve the unsolvable problems: 1. Evolved computation (low cost and high power parallel computation) To build a machine that can think like a human, mimicking the architecture of the human brain is an excellent place to start. Artificial neural networks are a simplified version that was designed to function like biological neural networks. Each node of a neural network loosely imitates a neuron in the brain—mutually interacting with its neighbors to make sense of the signals it receives. Neural networks today are the primary architecture of AI software. To achieve any complicated tasks such as performing open-heart surgery or driving a car, it requires numerous processes to take place simultaneously. Parallel computing is a crucial requirement for success in AI. Until recently, the typical computer processor could only ping one thing at a time, limited information processing to a mostly serial modality. The result was that it took too long to process any complicated tasks. That began to change more than a decade ago, when a new kind of chip, called a graphics processing unit, or GPU, was devised for the intensely visual—and parallel—demands of videogames, in which millions of pixels had to be recalculated many times per second. That required a specialized parallel computing chip that was added as a supplement to the PC motherboard. The parallel graphical chips worked, and gaming soared. By 2005 GPUs were being produced in such quantities that they became much cheaper. In 2009 Andrew Ng and a team at Stanford realized that GPU chips could run neural networks in parallel. That discovery unlocked new possibilities for neural networks, which can include hundreds of millions of connections between their nodes. Traditional processors required several weeks to calculate all the cascading options in a 100 million-parameter neural net. Ng found that a cluster of GPUs could accomplish the same thing in a day. Today neural nets running on GPUs are routinely used by cloud-enabled companies such as Facebook to identify your friends in photos or, in the case of Netflix, to make reliable recommendations for its more than 50 million subscribers.

IV. Future Outlook

24.3 A framework on development of artificial intelligence in medicine

485

2. Evolved data The human brain has to be either taught or learn to recognize failure and learn from it. A human brain, which is genetically primed to categorize things, still needs to see a dozen examples before it can distinguish between cats and dogs. That is even more true for artificial minds. Even the best programmed computer has to play at least a thousand games of chess before it gets good. Part of the AI breakthrough lies in the incredible avalanche of collected data about our world, which can provide the schooling that AIs need. Massive databases, self-tracking, web cookies, online footprints, terabytes of storage, decades of search results, Wikipedia, and digital imagery became the teachers making AI smart. 3. Evolved algorithms Essentially, when it comes to training an AI, the best way to do it is to have the system guess, receive feedback, and guess again—continually shifting the probabilities that it will get to the right answer. Digital neural nets were invented in the 1950s, but it took decades for computer scientists to learn how to tame the astronomically huge combinatorial relationships between a million—or 100 million—neurons. The key was to organize neural nets into stacked layers. Take the relatively simple task of recognizing that a face is a face. When a group of bits in a neural net are found to trigger a pattern—the image of an eye, for instance—that result is moved up to another level in the neural net for further parsing. The next level might group signals from two eyes together and pass that meaningful chunk onto another level of hierarchical structure that associates it with the pattern of a nose. It can take many millions of these nodes (each one producing a calculation feeding others around it), stacked up to 15 levels high, to recognize a human face. In 2006 Geoff Hinton, then at the University of Toronto, made a pivotal tweak to this method, which he dubbed “deep learning.” He was able to mathematically optimize results from each layer so that the learning accumulated faster as it proceeded up the stack of layers. Deep learning algorithms accelerated enormously a few years later when they were ported to GPUs. The code of deep learning alone is insufficient to generate complex logical thinking. However, it is an essential component of all current AIs, including IBM’s Watson, Google’s search engine, and Facebook’s algorithms. This evolution of parallel computation, the expanding universe of “big” data, and more profound algorithms generated the 60-years-in-the-making fulfillment of promises; AI products are coming online at an accelerating rate. This current convergence suggests that as long as these technological trends continue—and there’s no reason to think they will not—AI will keep improving. To complete the innovation framework, we will also need a judging system that can decide if the AI system likes a real person. In 1950 the computer science pioneer Alan Turing gave the first answer in the “the imitation game,” which is known as the “Turing test.” A test for measuring when we can finally declare that machines can be intelligent with the criteria that a judge cannot differentiate between a human and a machine. Turing made a bold prediction about the future of computing—and he reckoned that by the end of the 20th century, his test would have been passed. Unfortunately, except for several anecdote success stories, no AI algorithm has passed the test yet.

IV. Future Outlook

486

24. Industry perspectives and commercial opportunities of artificial intelligence in medicine

24.3.3 Practical applications bring the innovation to the real world The first application of artificial neural networks was created by Marvin Minsky and Dean Edmonds in 1952 to simulate a rat finding its way through a maze. The Stochastic Neural Analog Reinforcement Calculator Maze Solver was the first proof that machine learning could be put to work. The system, however, was about the size of two pieces of carry-on luggage, and only contained about 40 “neurons’” Over the years, with the evolution of building block technologies, there have been thousands of AI applications that have tried to prove their real-world value. Healthcare has been one of the most attractive arenas given the presence of structured workflows, expansive databases, and the need for highly trained expertise both to aggregate data and make clinical decisions. Architecting a new commercial market is not easy. It is even more challenging in a highly regulated, such as healthcare. The approach must delicately balance value creation, value delivery, and value capture, all while proving safety and efficacy. For start-ups, this means finding the fastest path to a minimal viable product, that the customer has a willingness to pay for, all before the bank account hits zero or investor conviction is lost. Value creation: Defining the area a product can create value for a customer; how does the customer benefit? Value delivery: Defining the embodiment of the product; how will the product deliver the value and will the customer accept it? Value capture: Ensuring that the value delivered can be captured to increase the value of your business; is the customer willing to pay for it? To discern the areas of future opportunity, we should not only understand players in the current landscape but also how they got there. To do this, a value delivery based segmentation overlayed onto a patient journey provides an excellent framework. Laying out a patient’s journey through a disease allows us to categorize critical stages in the process (Fig. 24.3). From being healthy and pursuing proactive methods such as exercise and diet to therapy and recovery, every step in the process is an area of possible AI value delivery. But the market variables need to be appropriately balanced to ensure that success can be achieved without too much risk and cost: 1. Proactive health In 2017 the Center for Medicare and Medicaid Services (CMMS) pinned the cost of chronic diseases at $3.15 trillion or about 90% of total US health-care costs.13 Proactive Health encompasses any effort to prevent the onset of acute issues and chronic diseases. It could be an app that helps build healthy meal plans based on an individual’s dietary preferences and impact. It could be a bedside device that monitors and provides feedback on sleep. The reality of chronic diseases is staggering. In the United States, 6 in 10 adults have at least one chronic illness, and 4 in 10 adults have two or more.14 In terms of opportunity, there is no more significant place an AI innovation could create value and improve national health. However, this is likely the hardest area to deliver substantial value because you are changing one of the hardest things in the universe, human behavior. Unfortunately, we are creatures of habit and convenience. Preventing the onset of chronic diseases means tackling its primary drivers, tobacco use, poor nutrition, lack of physical activity, and excessive alcohol use. Poor nutrition is the largest culprit in the

IV. Future Outlook

24.3 A framework on development of artificial intelligence in medicine

487

FIGURE 24.3 Value-based segmentation with FDA-approved projects. The diagram shows a patient’s journey through a disease that includes eight critical stages in the process. The examples include all projects approved by the FDA before January 2020.15

IV. Future Outlook

488

24. Industry perspectives and commercial opportunities of artificial intelligence in medicine

bunch. A driving risk factor for chronic diseases such as diabetes, Alzheimer’s, and heart disease, Obesity tallies up a total cost of $1.72 trillion or 9.3% of the total US GDP.16 2. Access When you need medical consultation, convenient and fast access to the right care provider can make all the difference to a patient’s health. Both health-care providers and patients have quickly adopted online appointment scheduling tools. How can AI help improve access to the right health-care provider for the right patient? Could systems recommend providers based on communication attributes? Could AI systems automatically schedule proactive or preventative care appointments based on biometrics from your scale at home or your heart rate? One area that is currently being explored by AI researchers is automated appointment scheduling. “No-shows,” late arrivals, and last-minute cancellation have been estimated to cost the US health-care system over $150 billion. A 2016 study found that in acute care settings the mean no-show rate was a staggering 18.8%. These researchers believe that monitoring variables such as seasonality and booking volume can better optimize patient appointment scheduling. 17 Lightning bolt, an AI solution from PerfectServe, has taken the scheduling a few steps further. The PerfectServe AI system uses customized scheduling rules, automatically schedules clinical teams for shifts, on-call times, department meetings, and even OR time with a preferred OR team. 3. Acquisition Obtaining the clinical data necessary to drive clinical decision-making can be laborious, time-consuming, and often full of inaccuracies. They are almost always triggered by a decision of clinicians to collect information, which can often be in the form of lab tests, X-rays, and biopsies. In many cases, this need for data collection involves patients scheduling completely separate appointments for a blood test or diagnostic imaging resulting in incredible delays in the data’s acquisition. AI can be used in these areas to enable passive, real-time data collection, which can accelerate time to reach the goal of a clinical decision. Recent surgical robotics companies have been investigating this opportunity. There are few places in healthcare where you have such a sophisticated piece of hardware, with continuous access to a variety of patient types, diseases, and even operating environments. The opportunity that companies such as Auris Health and CMR Surgical have realized is that their systems can use this data to support clinical decision-making. Take laparoscopy, for example; with thousands of hours of laparoscopy footage at its disposal, could an AI learn to discriminate cancerous tissue from healthy tissue and tell a clinician if it has all been removed? 4. Processing Processing of collected data for diagnostic or therapeutic decision-making composes the lion share of AI innovations. We believe there are three primary reasons for this, which are discussed in the following: a. Data standardization Autonomous cars, facial recognition, and robotics all rely heavily on the ability for a system’s ability to “see” like humans. This need created the field of computer vision that has been accelerated by deep learning and artificial neural networks. AI has been successful with computer vision because of data and format standardization.

IV. Future Outlook

24.3 A framework on development of artificial intelligence in medicine

489

Every data stream comes in an expected format that can be reconstructed into a series of pixels, allowing key features to be taught and later recognized. These advancements have spilled over into healthcare, where they can be applied to another expected format, DICOM. Data standardization around DICOM has enabled companies to begin to train algorithms to identify clinical indicators quickly. Algorithms built to interrogate DICOM formatted data structures account for just over 82% of the currently FDA-approved AI systems (Fig. 24.4). b. Limited regulatory hurdles AI systems developed to assist with processing clinical data can be configured and productized as clinical decision support (CDS). In September 2019 the FDA published draft guidance for CDS systems, which they describe as . . .. a tool that provides health care professionals and patients with knowledge and person-specific information, intelligently filtered or presented at appropriate times, to enhance health and health care.

These CDS tools are allowed to interrogate qualified data sets to provide both focused data reports and summary alerts, patient data reports, automated alerts, and diagnostic support. In application, these systems typically observe large data sets to triage patients requiring closer scrutiny and leave the actual diagnosis up to the doctor. FIGURE 24.4 FDA-approved AI products by data categories.18 AI, Artificial intelligence.

IV. Future Outlook

490

24. Industry perspectives and commercial opportunities of artificial intelligence in medicine

The regulatory hurdles to gain approval for a CDS tool requiring that it only be trained and validated on relevant data sets. This relatively low bar has eliminated a lot of business and development risk, allowing for the rapid development of CDS tools, which account for 90% of the FDA-approved AI systems.18 c. Immediate need For every 1 hour that a doctor spends with a patient, they spend 2 hours working in the EMR. With the acceleration of connected care and the adoption of EMR systems, physicians are on data overload. During a typical office day, physicians spent only about 27% of their total time working directly with the patients. EMR data entry and interpretation accounted for 49% of their time, including an average of 1 2 hours of at-home EMR work.19 This data overload has driven health-care teams to the brink, with over 50% of providers complaining of burnout. Burnout is plaguing our health-care system. In the recent survey by the Physicians Foundation found that over 78% of physicians have feelings of burnout, and that they spend a staggering 23% of their time on nonclinical paperwork.20 Opportunities for technologies to automate data analysis can have a significant impact on the current health-care burden. 5. Inference Drawing conclusions from analyzed data is where the frontier of AI in healthcare. Its most significant limitation is the regulatory hurdles and gray areas of liability. If an AI makes an improper diagnosis that results in a patient’s death, who is liable—the physician that used the AI product, the company that sold the AI product, or the hospital paid for the AI product? The first company to cross this massive regulatory chasm is IDx, with the approval of their IDx-DRn April of 2018. This AI product is approved to “automatically detect more than mild diabetic retinopathy (mtmDR) in adults diagnosed with diabetes who have not been previously diagnosed with diabetic retinopathy.” The system’s value is for broad patients screening for diabetic retinopathy, identifying it earlier in the disease progression and during routine eye exams by optometrists. This reduces the workload on ophthalmologists who shoulder the highest volume surgical procedure every year, Cataract surgery, and enables the frontline eye care providers to identify the patients that need specialist consultation. The good news is that the hurdle for obtaining this was not insurmountable. Their pivotal clinical study enrolled 900 patients across 10 primary care sites and yielded incredible performance21: a. Sensitivity—87% b. Specificity—90% c. Imageability—96% PPV (positive predictive value)—73% d. NPV (negative predictive value)—96% For IDx-DR, it was cleared to operate on a proven hardware system, the Topcon NW400, which was a noninvasive imaging system. The only risks for the algorithm then were false negatives and false positives that would delay evaluation or unnecessary additional medical care. The excellent performance of IDx-DR weighed against its risk profile, enabling them to get a nod from the FDA.

IV. Future Outlook

24.3 A framework on development of artificial intelligence in medicine

491

The theme behind IDx-DR is a significant opportunity for future AI health-care innovations. With the population of patients in the United States getting older and with more comorbidities, the need for providers with both broad and deep experience is increasingly required. With the amount of schooling for a specialist doctor surpassing 15 years, it is becoming hard to support the needs of the escalating patient population. Technologies that enable nonspecialized clinicians to perform diagnostics and even procedures with parity of clinical outcomes that their specialist counterparts did previously are one solution to the problem. Often called the democratization of healthcare, these approaches have made a lot of promises starting in 2015, but few have been able to deliver. 6. Therapy An AI system that delivers therapy remains a thing of the future. The FDA, physicians, and even patients are not ready to have both decision-making and action on an intervention handed over to an algorithm. However, similar to inference, the need to democratize therapeutics also exists. A first stepping point along this path has been in the area of surgical robotics. The orthopedic robotic surgery pioneer, Mako Surgical, was purchased by Stryker for $1.65 billion in 2013 with the success of the Rio platform. The Rio robotic system, although not powered by an AI, did intervene in therapeutic procedures; it prevented surgeons from cutting too much bone when preparing a knee for a prosthetic. AI capable therapeutics it is, without a doubt, in our future. With plans of becoming an interplanetary species by 2030, we will be traveling places where a specialist will not always be available. Under the current laws of physics, it would not be prudent to perform a telepresence procedure as the latency would be unbearable. The average time for a signal moving at the speed of light to traverse the distance between Earth and Mars is about 20 minutes. In robotic surgery a delay of over 120 ms is enough to compromise a procedure. We do have a need to develop sophisticated AI machines that can handle some therapeutic tasks. They may start with simple procedures such as putting in stitches or repairing a broken figure with a custom cast. But, they will evolve into much more complex procedures, such as amputation, hernia repair, and even cataract surgery. The keys to AI in these applications are algorithms that leverage data from multiple sources. An autonomous AI surgical robot, for instance, would need to have its algorithms built on data from computer vision, hardware sensors, instrument localization, and precision kinematics. 7. Recovery Following therapy, how do we support patient recovery to an improved level of quality of life? There is no place where it is more challenging to manage patient recovery than the intensive care unit (ICU). ICUs are portions of the hospitals, the focus of which is to support the recovery of burn, trauma, postsurgical, and critically ill patients. Intensivists, the clinicians who serve in these units, must juggle thousands in inputs in their clinical decision-making process. Temperature, pulse, blood pressure, respiratory rate, SpO2, pain, consciousness, and urine output are the eight vital signs that must be continuously monitored. However, 75 daily inputs come from various sources such as urinalysis and blood tests. Both the vital signs and test results are logged over time, allowing the clinician to read the values and interpret how they have changed over time. Clinical decision-making in the ICU

IV. Future Outlook

492

24. Industry perspectives and commercial opportunities of artificial intelligence in medicine

comes down to recognizing patterns within the overwhelming amount of data. It is in situations like this that AI can excel. The commercial opportunity exists as well, as ICUs account for only 15% of the total hospital beds and $82 billion in health-care costs.22 Innovations that streamline patient recovery to reduce the analytical burden on clinicians, patient’s length of stay, and iatrogenic morbidities can make a significant impact on those costs. 8. Preventative medicine Once a patient is discharged and is at home, how can we help them stay healthy and prevent readmission? With the Affordable Care Act of 2010 the CMMS launched the Hospital Readmissions Reduction Program that penalizes hospitals readmission rates that are too high. Incredibly, in 2019, over 83% of facilities were punished for not achieving these standards.23 The opportunity here is with innovations that can remotely monitor patients, ensure discharge order compliance, and enable clinicians to intercept an issue before it reaches a point that requires hospitalization.

24.3.4 Market adoption defines the success After a commercial launch of a new application, adoption became one of the most important parameters to evaluate success. In the start-up and investment world the product market fit is an essential term, because it has an important feature to predict the potential adoption speed and rate before the official product launch. But in the health-care industry, having the right product that fits market needs is only a beginning. Regulatory, patient safety, workflow, cost, infrastructure, human resources, and others all can be enormous barriers for a new technology to be implemented within hospitals. You will not be surprised to see healthcare seems stuck 15 years or more behind other industries when it comes to adopting new technologies. A well-known example in the health-care world is the electronic health record (EHR) system. It is not hard to understand why we want to convert the paper system into a digital system, and the first EHR systems were used in the early 1960s. In 1991 the Institute of Medicine declared that all physicians needed to use EHRs by 2001 to improve healthcare. But only 18% of doctors were using an EHR system by 2001 because there was no actual law passed. Meaningful adoption of EHR did not become significant until after the federal government mandated the adoption of bypassing the HITECH act and paying out $21 billion to acute care hospitals to adopt the technology.24 The number of hospitals using electronic records grew from 9% in 2008 to 86.9% in 2015,25 a considerable change in less than a decade. The fast adoption of the EHR system is a foundation of AI in medicine because it generates a massive amount of data to feed the machine. But the HITECH program did not account for a critical need: sharing. Hospital and doctor offices generally remain unable to transfer electronic information to other hospitals and doctor offices. Billions of dollars later, they have left printing out documents and faxing them. The fax machine, which was abandoned by most industries in the 1990s, remains medicine’s dominant method of communication. Healthcare’s adoption of AI applications faces the same challenges. As medicine is one of the industries that AI can collect vast amounts of structural data and add clear value, the effort

IV. Future Outlook

24.3 A framework on development of artificial intelligence in medicine

493

started long ago. During the second AI boom the AI applications in the medicine already made tremendous progress. Some of the exciting and pioneering applications, including MYCIN, VM, PUFF, INTERNIST, and CADUCEUS, were described in a Stanford paper.26 MYCIN used AI to diagnose bacteremia and to then recommend the appropriate antibiotic with the correct dosage based on body weight. It was tested theoretically at Stanford and ended up performing superior to physicians.27 However, MYCIN (named after the suffix of common antibiotics—i.e., erythroMYCIN) was never actually used in hospitals due to the lack of computer infrastructure in clinical settings back then. INTERNIST was a very ambitious program that, even by today’s standards, was way ahead of its time. INTERNIST, as one could guess from the name, provided differential diagnoses in the field of internal medicine. It knew close to 500 different pathologies. Writing about INTERNIST in 1980 Feigenbaum wrote, Because INTERNIST is intended to serve a consulting role in medical diagnosis, it has been challenged with a wide variety of difficult clinical problems: cases published in the medical journals, clinical conferences and other interesting and unusual problems arising in the local teaching hospitals. In the great majority of these test cases, the problem formation strategy of INTERNIST has proved to be effective in sorting out the pieces of the puzzle and coming to a correct diagnosis, involving in some cases as many as a dozen disease entities27

Although INTERNIST was incredibly promising, it was far from perfect. Subsequent papers put out by the team who created INTERNIST (computer scientist Harry Pople and internist Dr. Jack Myers) showed that the system was not robust enough for regular clinical decision-making due to a host of frequent errors. They worked to address these errors in the next iteration.27 In some regards, CADUCEUS was version 2 of INTERNIST-1. It is estimated that CADUCEUS had a working knowledge of more than 500 different disease processes and 3400 specific disease characteristics.28 CADUCEUS also improved upon INTERNIST-1 by giving explanations on how it came to its conclusions. At the height of AI boom 2 in 1986, CADUCEUS was described as the “most knowledge-intensive expert system in existence.”29 Indeed, these were exciting times for the expert systems AI enthusiasts; however, good times do not last forever. These large expert systems were expensive, difficult to keep updated, and cumbersome to use by everyday medical professionals. Disappointed by the progress that had been made in the actual use of AI, the funding to AI projects was aggressively cut in the late 1980s and led to the second AI winter. Most recently, in 2013, the MD Anderson Cancer Center launched a “moon shot” project: diagnose and recommend treatment plans for certain forms of cancer using IBM’s Watson cognitive system. But in 2017 the project was put on hold after costs topped $62 million—and the system had yet to be used on patients. At the same time, the cancer center’s IT group was experimenting with using cognitive technologies to do much less ambitious jobs, such as making hotel and restaurant recommendations for patients’ families, determining which patients needed help paying bills, and addressing staff IT problems. The results of these projects have been much more promising: The new systems have contributed to increased patient satisfaction, improved financial performance, and a decline in time spent on tedious data entry by the hospital’s care managers. Despite the setback on the moon shot, MD Anderson remains committed to using cognitive technology—that is, next-generation AI—to enhance cancer treatment. It is currently developing a variety of new projects at its center of competency for cognitive computing.

IV. Future Outlook

494

24. Industry perspectives and commercial opportunities of artificial intelligence in medicine

The contrast between the two approaches is relevant to anyone planning AI initiatives. A survey of 500 US health industry leaders revealed that while most respondents indicated high levels of trust in AI for both clinical and administrative tasks overall when asked to rank-specific applications, more administrative applications were selected (62%) over clinical applications (38%). Interestingly, when asked which health-care applications, if any, they would feel comfortable having AI technology supports, four out of five of the top-ranked applications were administrative.30 The “low-hanging fruit” projects that enhance administrative processes are more likely to be successful than the clinical support projects, which have an impact on the clinical workflow.

24.3.5 Apply the framework to the current and future market Walking through the framework on the development of AI in medicine, we get a good sense of the opportunities and hurdles that face AI innovations. We know what happened in the past two boom and bust cycles and very clearly, we are in the third one. How close are we to the peak of the third boom? Or are we already past the peak? We can use just learned knowledge to see if we can figure out. Public attention and funding—The funding keeps pouring to the AI projects, and there is no sign of slowing down. The buyers in the health-care system are also more positive, with the budget ready to implement AI applications. A survey in 2019 reveals a shift in funding expectations for AI-related projects, as leaders estimate their organizations will invest an average of $39.7 million over the next 5 years $7.3 million more than last year’s estimate.30 That may be a result of a positive enhancement that people can expect a positive return on investment (ROI) will take less time than previously expected, as little as 3 years in some cases. 50% of respondents in the same survey hope to see tangible cost savings in 3 years or less as a result of investing in AI, compared with 31% in 2018. Among the respondent groups, more hospitals (55%) and health plans (52%) expect to see a positive return in less time—in 3 years or fewer—while life sciences executives (38%) anticipate it taking 5 years or longer. Now technologies such as Apple, Google, and Facebook have jumped in and are pushing AI forward. These efforts have both intrigued and polarized the public, with tech thought leaders taking sides. On one end is Elon Musk, the entrepreneur and visionary, called the prospect of AI “our greatest existential threat” in a 2014 interview with MIT students. Musk’s conviction led him, famed astrophysicist Stephen Hawkings, Microsoft Founder Bill Gates, and other thought leaders to publish an open letter as a warning of the dangers it poses. On the other side are leaders such as Mark Zuckerberg, who is optimistic about a future where AI makes human life better, and he called Musk’s warnings of AI “pretty irresponsible.” Regardless of their different beliefs about the future in AI, both heavily invested in this field. Microsoft, Elon Musk, Reid Hoffman, and Sam Altman have invested over $1 billion, a company that formed called OpenAI, with the hopes of ensuring that AI is done responsibly. Innovation—Innovation is accelerating every day, and it is fueling the current AI curve. That fuel is composed of three factors: computing power, data, and algorithms, which have all gone through significant evolutions. The increase in computing power followed Moore’s Law for the past several decades. Although Intel delays several times of launching their 10 nm chips and slows down the

IV. Future Outlook

24.3 A framework on development of artificial intelligence in medicine

495

process, the increase of the computing power is still significant considering the base. Before Moore’s Law fails the industry already has new hope, quantum computing. It has attracted huge interest at the national level with funding from governments and billions of dollars are poured in. Google and IBM compete fiercely to be the edge of this new field. Regardless of the different approaches, the next goal is clear: to build a quantum computer that can solve real-world problems. Quantum computing may take more time then we hope to support AI development, but cloud-based computing is a real help today. The emergence of machine learning business models based on the use of the cloud is, in fact, a significant factor for why AI is taking off. Before the cloud, AI projects had high costs, but cloud economics have rendered specific machine learning capabilities relatively inexpensive and less complicated to operate. Thanks to the integration of cloud and AI, very specialized AI start-ups are exploding in growth. The quality of data available is often a barrier to businesses and organizations wanting to move toward AI-driven automated decision-making. But as the technology of simulating real-world processes and mechanisms in the digital domain has improved over recent years, accurate data has become increasingly available. Simulations have advanced to the stage where autonomous vehicles developers can gain thousands of hours of driving data without vehicles even leaving the lab, leading to considerable reductions in cost as well as increases in the quality of data that can be gathered. There is an increase in accurate and real-time data that is captured in the health-care industry. Companies put more sensors on the patients to directly get better structured, precise, and real-time data. The benefit is significant compared with pull data from the EHR system that requests communicating with EHR and unifying data structures and may have delayed results. It is not easy for AI companies to add sensors to the patients, though, but it is possible combined with the strength of medical device experience. More start-ups are taking the approach of combining two expertise to revolutionize the field. Today, deep learning is the mainstream of AI algorithm development. The architectures will continue to grow in size and depth and produce more accurate results and become better at mimicking human performance on tasks that involve data analysis. At the same time, methods for improving the efficiency of neural networks will also improve, and there will be more real-time and power-efficient networks running on small devices. But deep learning has its limitations, and the most important one is data hungry. It depends on alarming amounts of structured and labeled data to train one of these systems. Unfortunately, in many domains, there is not enough quality data to train robust AI models, making it very difficult to apply deep learning to solve problems. The good news is we began to see a new approach to improve that. Researchers started to test the neurosymbolic AI that combines statistical data-driven methods with powerful knowledge representation and reasoning techniques to yield more explainable and robust AI that can learn from fewer data. There are also exceptions to see significant innovations in the area of so-called “AI for AI”: using AI to help automate the steps and processes involved in the life cycle of creating, deploying, managing, and operating AI models to help scale AI more widely into the enterprise. The innovation in AI never stopped even in the deepest AI winter. With high enthusiasm and adequate funding, researchers will soon bring the game to the next level. Applications—The burden the health-care system faces is not only the increasingly sick population but also the growing amount of data that comes with all patients. In 2020 the

IV. Future Outlook

496

24. Industry perspectives and commercial opportunities of artificial intelligence in medicine

world is expected to generate 2314 EB of health-care data.31 With more sick patients, more data, and fewer clinicians, the highest impact solutions of the future will be about efficiency. These solutions will drive down the cost of clinical decisions, therapies, and recovery. We will also see an increasing number of them cross the barrier into inference and treatment. The result will be that clinicians no longer focus on routine tasks, they become more specialized, and their focus will ultimately improve the product: a healthier society. This narrative has played out time and time again in history with textiles, farming, and, shortly, driving. These have all been scenarios where machines have been handed over the routine, where they excel while humans focus on creative problem-solving, where we excel. Adoption—The health-care industry is not known for the fast adoption of new technology. Still, 53% of health-care executives say that the health-care industry is ahead of most other industries in AI adoption in a recent survey.32 According to 89% of respondents, AI is already creating efficiencies in their systems, and 91% say AI is increasing patient access to care. Participants’ eagerness to adopt and use AI reflects the industry-wide belief that technology has the potential to transform healthcare. However, 37% of respondents believe that the pace at which they are implementing AI is too slow, mainly because of factors related to cost and skill. Despite these hurdles, health-care leaders agree that AI will play an essential role in improving care delivery, with 90% of respondents saying they believe that AI will enhance the patient experience. The results show that once leaders address critical issues to implementation, the benefits of AI could outweigh potential risks. While AI has already made some inroads in the back or middle office, patient access and care will ultimately see the most significant impact from AI by better diagnosing, treating, serving, and helping patients at every point of engagement. All four components seem to drive to a definite conclusion about the bright future of AI. However, there are new risks that came to the surface and well-known risks that got to a new level.

24.3.6 Patient privacy Companies’ overtures to major hospitals about data sharing have highlighted legal and ethical uncertainties as to whether and how to undertake these relationships. One such partnership is now being challenged in court. In June 2019 a patient sued the University of Chicago Medical Center and Google for alleged misuse of patient EHR data.33 Arguably, the Health Insurance Portability and Accountability Act, the flagship health information privacy law in the United States, shows its age after more than 20 years since it was published in 1996. The world is calling a new regulation that can meet the increasing needs to protect patient privacy. The European Union (EU)’s General Data Protection Regulation (GDPR) is the start of a new era of regulation. GDPR harmonizes data privacy laws across Europe, protects EU citizens, and reshapes the way organizations in the region approach data privacy. Trump’s administration steered clear of heavy-handed regulation on private industry and suggested that federal agencies, such as the FDA, should take public input into account when regulating AI. The regulation will help the industry understand better how to work with their health-care provider partners and manage the patient data.

IV. Future Outlook

24.3 A framework on development of artificial intelligence in medicine

497

However, it will also increase the barriers to access the data, which is already a massive challenge for AI companies today. It is hard to tell which factor will win, and if increasing the regulation will help or hurt AI in healthcare, but the trend will not stop, and the AI companies need to get ready for a world without easy access to data.

24.3.7 Approving a moving target To unleash the real power of AI in medicine, the latest AI solutions in the medical field are being designed to learn as they gain exposure to new patient data to hone their ability to make diagnoses, assist doctors, or suggest treatments. That means the capabilities, safety, and efficacy of some of the newest medical AI solutions cannot be assessed a single time for regulators to grant approvals because an AI’s performance may be different the day after it has undergone testing. What is more, there is no telling if the performance differences will make it work better or worse. That is why regulators such as the FDA have thus far only started to approve locked-algorithm solutions. When medical AI has the ability to learn; however, all of the existing approval processes no longer suffice. To tackle the problem the FDA has already proposed a whole new regulatory framework to deal with AI in medical applications.34 It would include a preapproval process that would allow manufacturers some leeway in what changes (or how much machine learning) would be permissible without reapproval. Also, it would require manufacturers to submit ongoing performance data to the agency so they can intervene if necessary. That, however, will require a drastic increase in the workforce at the FDA—and nobody is sure if they’ll get it.

24.3.8 Accountability and transparency It can be difficult or impossible to determine the underlying logic that generates the outputs produced by AI. This turned into a black box and transparency issue in AI. There is an increasing concern of the black-box nature of AI algorithms as AI has more influence on the patient’s healthcare. As opposed to AI’s black box, explainable AI (XAI) is an essential component of human AI that aims at expanding the human experience with transparency instead of replacing it. It will almost be impossible to trust an AI algorithm or its tools that are used for critical decision-making if the process is opaque and no rationale has been produced for it. XAI holds the following two main components of its working: • Accountability, where the users are aware of the tech behind and the process of concluding. XAI will also be able to trace the path of reasoning. • Auditability, where the user can review the method used for analyzing the data. XAI will provide the ability to test the processes along with refining them for gaping future loopholes. The EU GDPR states that data subjects have the right not to be subject to a decision based solely on automated processing that produces legal or similarly significant effects. It further says, “information provided to individuals when data about them are used should include the existence of automated decision-making, meaningful information about the logic involved, as well as the significance and the envisaged consequences of such

IV. Future Outlook

498

24. Industry perspectives and commercial opportunities of artificial intelligence in medicine

processing for the data subject.” However, the interpretation of the automated decisionmaking regulation in the GDPR has triggered a vivid debate in the legal doctrine. Finally, Article 29 Working Party has confirmed that the scope of Article 22 should be interpreted extensively and provides individuals with more transparency and accountability in its guidelines on profiling and automated decision-making.35 An underlying issue in all three risks above is regulation, which is not a surprise at all and brings the topic back to why healthcare is a unique industry. As AI in medicine starts to leap from administrative tasks to diagnosis support, it is a natural progression that the public experienced the next level of concern. After all, understanding how AI helps to make an appropriate appointment is way less stressful than how AI decides a tumor is benign or malignant. The EU published its own AI regulatory proposals in February 2020 to regulate AI that calls for strict rules and safeguards on risky applications of the rapidly developing technology.36 Regulation is a double-edged sword in healthcare. There is an apparent reason for the health-care industry to be highly regulated. Still, If the developers are restricted too much from the approaches, they can take and the protection of their work, innovation stops. To solve that problem, regulators are going to have to strike a delicate balance. That is going to require regulators across a variety of agencies to bring high-level AI developers into the fold, as they will be the only ones qualified to figure out what the AI solutions are doing and why. Those developers will also need a medical or research background to be able to comprehend the medical aspects of the technology. That in itself is a problem because there are limited experts who can satisfy both requirements at present, and there is no existing program designed to produce such experts. One final risk bears mentioning that the nirvana fallacy posits that problems arise when policymakers and others compare a new option to perfection, rather than the status quo. Health-care AI faces risks and challenges. But the current system is also rife with problems. Doing nothing because AI is imperfect creates the risk of perpetuating a problematic status quo.

24.4 Business opportunity of artificial intelligence in medicine In this chapter, you may feel the temperature of AI in medicine. But will the temperature keep increasing or will drop suddenly? Healthcare is only one applied field of AI, and looking at the bigger picture of the AI industry will help to see the overall landscape. Fig. 24.5 shows the three directional possibilities in which AI may head. The most important driver to keep the high growth is the fast adoption. It is not because the early expert systems were neither not exciting nor useful; it was due to a lack of commercial success. It was not until IBM’s Watson that large-scale expert systems became commercially viable. And even Watson was a peripheral line, intended more to burnish IBM brand than to be a core revenue generator. The worst situation is the cycle repeats itself, fails to live up to the promises, and the AI world hits another winter. Today, there is a lot of hype surrounding AI. You may have heard the news already: the AI bubble is getting ready to burst. Plenty of media published articles about it and if you Google AI winter or AI bubble for the evidence to start leaping off the screen. The reasonable possibility is the individual

IV. Future Outlook

24.4 Business opportunity of artificial intelligence in medicine

499

FIGURE 24.5 Future trends of AI. AI, Artificial intelligence.

application’s fate depends on the solution and the execution. As Warren Buffett likes to put it—“only when the tide goes out do you discover who has been swimming naked.” The question becomes, will the current AI revolutions deliver on its promises? The selfdriving car is such an essential topic in AI because of the technology, investment, and commercial effort; it is often a complementary topic with AI. But it is those very selfdriving cars that are causing scientists to sweat the possibility of another AI winter. Because when it comes to self-driving cars, the future was supposed to be now. In 2020 you will be a “permanent backseat driver,” the Guardian predicted in 2015. “10 million self-driving cars will be on the road by 2020,” blared a Business Insider headline from 2016. Those declarations were accompanied by announcements from General Motors, Google’s Waymo, Toyota, and Honda that they would be making self-driving cars by 2020. Elon Musk forecast that Tesla would do it by 2018—and then, when that failed, by 2020. But the year is here—and self-driving cars are not. There are many reasons for the delay, such as the need for vast amounts of training data, technical challenges, and regulatory concerns. The self-driving car is only one prominent example of the risks that the AI industry has to face. So, will the underdelivering reality drive to another AI winter? The skeptics have seen the field’s boom-and-bust periods before, and some feel sure it is going to happen again. They can point to the problematic recurrent pattern for the field way back in 1965—early, dramatic success followed by sudden, unexpected difficulties. Coming back to AI in medicine, the situation is similar that AI raised the very high expectation of what it can do. Some reports or articles get a lot of press of AI algorithms being tested against real doctors and “beating” the human doctors repeatedly. The debate of whether doctors are going to be “replaced” by AI algorithms is also a polarizing one. But what consistently in studies is that when the doctors took the machines’ suggestions, they were better than either the doctor or machine alone. A combination of AI and a doctor is better than either of them. Since the first human civilizations took shape, doctors

IV. Future Outlook

500

24. Industry perspectives and commercial opportunities of artificial intelligence in medicine

have been an ever-present member of human communities, curing diseases, and caring for the ill. With advances in science and technology, their methods have evolved from praying to mystic gods to mixing herbal concoctions to using scientific methods and advanced medical equipment to diagnose, treat, and prevent diseases. Today, doctors can easily save their patients from diseases that killed millions of people in the past centuries. There is still room for so much more improvement, and the health-care industry is desperate for better solutions in many fields. AI is fueling the hopes and bringing in so much excitement. One of them is AI can help to achieve the Triple Aim. In 2008 researchers at the Institute for Healthcare Improvement described the Triple Aim as simultaneously “improving the individual experience of care, improving the health of populations; and reducing the per capita costs of care for populations.”37 The balanced pursuit of the Triple Aim is not congruent with the current business models of any but a tiny number of US health-care organizations. For most, only one, or possibly two, of the dimensions is strategic, but not all three. Thus we face a paradox concerning the pursuit of the Triple Aim. Today, AI is being used to help health-care organizations make sense of vast amounts of data and equip them to succeed under value-based arrangements. With AI applications, health plans and providers have a better chance to achieve these three goals together. With all these exciting opportunities the industry needs to be conscious when we set up the expectation and the time line because another AI winter is always possible if we repeat the overpromising and underdelivering pattern. At least one of the fundamental principles that all can agree to is that AI tools are meant to augment, not replace doctors. Doctors may never be replaced by AI or any technology, as long as the core value of a doctor is humanity—“to cure sometimes, to relieve often, to comfort always.” Today, it is true that there is a lot of hype surrounding AI. It is also true that there are AI vendors selling snake oil instead of focusing on real-world implementation. But the significant difference is AI already is part of life now, and that never happened before. For the first time ever, AI is part of how companies work, how we interact with our schedules, and how we set the alarm at home or send messages to our family. That gives it new value beyond the shortcomings to which critics point. It is possible while there might be a cooling off of research and investment, as there always inevitably is with technology waves, we are hopeful this time we do not go into hibernation. AI is reshaping the whole health-care industry. Bill Gates has a quote that fits perfectly with AI in medicine: “Most people overestimate what they can do in one year and underestimate what they can do in ten years.” In the next few years the impact of AI will be lesser than is now believed in the wildest of hopes. But in the longer term, the effect will be significant, even revolutionary. If you are a provider, a payor, an investor, or an entrepreneur in the health-care industry and have not yet jumped on the fast-moving train, it may be the last opportunity to get a seat on the table now.

References 1. International Data Corporation. Worldwide semiannual artificial intelligence systems spending guide. 2019. 2. CBInsights. Global healthcare report Q3. 2019. ,https://www.cbinsights.com/research/report/healthcaretrends-q3-2019/..

IV. Future Outlook

References

501

3. Accenture. Artificial intelligence (AI): healthcare’s new nervous system. 2017 ,https://www.accenture.com/fi-en/ insight-artificial-intelligence-healthcare.. 4. Rebecca Lin & Jeff Alvarez, January 8, 2020. An AI boom-bust cycle. ,https://rlin.online/ai-boom-bust-cycle/.. 5. McCarthy J, Minsky ML, Rochester N, Shannon CE. A proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955. AI Mag 2006;27(4):12. 6. Letter from Robert S. Morison to John McCarthy, 1955 November 30. ,https://rockfound.rockarch.org/digital-library-listing/-/asset_publisher/yYxpQfeI4W8N/content/letter-from-robert-s-morison-to-john-mccarthy1955-november-30. 7. Jack McCredie Herbert A. Simon, 1916-2001 EDUCAUSE Review Magazine, Volume 36, Number 3, May/June 2001. 8. LIFE Magazine Nov 20, 1970. 9. Nilsson N. Oral history of Edward Feigenbaum, June 20, 2007. CHM Reference number X3896.2007, Computer History Museum. 10. JIPDEC. Proceedings of the international conference on fifth generation computer systems, October 19 22, 1981. FGCS International Conference (Japan Information Processing Development Center; 1981). ,http://www.jipdec.or. jp/archives/publications/J0002118.. 11. Feigenbaum Edward A, McCorduck Pamela. The fifth generation: artificial intelligence and Japan’s Computer Challenge to the World. Reading: Addison-Wesley; 1983. p. 1 3. 12. Garvey C. Broken promises & empty threats: the evolution of AI in the USA, 1956-1996. Technology’s Stories 12 March 2018. ,www.technologystories.org/ai-evolution/#_ftnref30.. 13. CDC: About Chronic Diseases ,https://www.cdc.gov/chronicdisease/about/costs/index.htm.. 14. Christine Buttorff, etc. Multiple chronic conditions in the United States. ,https://www.rand.org/content/dam/ rand/pubs/tools/TL200/TL221/RAND_TL221.pdf.. 15. Rebecca Lin & Jeff Alvarez, February 6, 2020. Value-based segmentation FDA approved AI projects. ,https://rlin.online/fda-approved-ai-projects/.. 16. Hugh Waters & Marlon Graf. America’s obesity crisis: the health and economic costs of excess weight. ,https:// milkeninstitute.org/reports/americas-obesity-crisis-health-and-economic-costs-excess-weight.. 17. Kheirkhah P, Feng Q, Travis LM, et al. Prevalence, predictors and economic consequences of no-shows. BMC Health Serv Res. 2015;16:13. Available from: https://doi.org/10.1186/s12913-015-1243-z. 18. Rebecca Lin & Jeff Alvarez, February 6, 2020. FDA approved AI projects by category. ,https://rlin.online/fdaapproved-ai-projects-by-category/.. 19. Sinsky C, et al. Allocation of physician time in ambulatory practice: a time and motion study in 4 specialties. Annals of Internal Medicine 2016;165(11):753 60. 20. Merritt Hawkins on behalf of The Physicians Foundation 2018 Survey of America’s physicians practice patterns & perspectives. ,https://physiciansfoundation.org/research-insights/the-physicians-foundation-2018-physician-survey/.. 21. IDx-DR, FDA de novo 510(k) decision summary. ,https://www.accessdata.fda.gov/cdrh_docs/reviews/ DEN180001.pdf.. 22. Halpern N, Pastores S. Critical care medicine beds, use, occupancy, and costs in the United States: a methodological review. Crit Care Med 2015;43(11):2452 9. 23. ,https://khn.org/news/hospital-readmission-penalties-medicare-2583-hospitals/.. 24. Slabodkin G. HITECH proves pivotal to hospital EHR adoption. 2017. ,https://www.healthdatamanagement. com/news/hitech-incentives-prove-pivotal-to-hospital-ehr-adoption.. 25. ,https://dashboard.healthit.gov/quickstats/pages/physician-ehr-adoption-trends.php.. 26. Feigenbaum EA. “Expert systems in the 1980s.” State of the art report on machine intelligence. Maidenhead: Pergamon-Infotech; 1981. 27. Yu Victor L, et al. Antimicrobial selection by a computer. A blinded evaluation by infectious diseases experts. JAMA 1979;242(12):1279 82. 28. Miller RA, Pople Jr HE, Myers JD. Internist-I, an experimental computer-based diagnostic consultant for general internal medicine. N Engl J Med 1982;307(8):468 76. 29. Banks G. Artificial intelligence in medical diagnosis: the INTERNIST/CADUCEUS approach. Crit Rev Med Inform 1986;1(1):23 54. 30. The second OptumIQ annual survey on AI in health care. ,https://www.optum.com/about/news/2019-executive-survey.html..

IV. Future Outlook

502

24. Industry perspectives and commercial opportunities of artificial intelligence in medicine

31. Stanford Medicine Health Trends Report 2017: harnessing the power data in health. ,https://med.stanford.edu/ content/dam/sm/sm-news/documents/StanfordMedicineHealthTrendsWhitePaper2017.pdf.. 32. Jessica Kent “53% of Execs Say Healthcare Leads Artificial Intelligence Adoption” ,https://healthitanalytics. com/news/53-of-execs-say-healthcare-leads-artificial-intelligence-adoption.. 33. June 26, 2019 Complaint, Dinerstein v Google, No. 1:19-cv-04311 (Ill 2019). ,https://www.courtlistener.com/ docket/15841645/dinerstein-v-google-llc/.. 34. Proposed regulatory framework for modifications to artificial intelligence/machine learning (AI/ML)-based software as a medical device (SaMD)—discussion paper and request for feedback. ,https://www.fda.gov/files/medical% 20devices/published/US-FDA-Artificial-Intelligence-and-Machine-Learning-Discussion-Paper.pdf.. 35. Article 29 Working Party. Guidelines on automated individual decision-making and profiling for the purposes of regulation. 2018 ,https://ec.europa.eu/newsroom/article29/item-detail.cfm?item_id 5 612053.. 36. Brussels, 2020. White paper on artificial intelligence—a European approach to excellence and trust. ,https://ec. europa.eu/info/sites/info/files/commission-white-paper-artificial-intelligence-feb2020_en.pdf.. 37. The IHI Triple Aim ,http://www.ihi.org/Engage/Initiatives/TripleAim/Pages/default.aspx..

IV. Future Outlook

C H A P T E R

25 Outlook of the future landscape of artificial intelligence in medicine and new challenges Lei Xing, Daniel S. Kapp, Maryellen L. Giger and James K. Min Abstract Basic and technical aspects and important applications of artificial intelligence (AI) in medicine (AIM) have been presented in previous chapters. Given the broad scope and incredible depth of potential impact that AI promises to bring, it is quite clear that medicine is on the verge of a revolution. Looking ahead, much needs to be done to optimize the pathway for clinical translation and to maximally utilize the capacity of AI to benefit the well-being of patients. In this final chapter we highlight some important trends in research and development in AIM and their indications to the future of health care. In particular, the urgent demands in advanced big data analytics, practically feasible data curation and sharing schemes, quantitative imaging tools, and more intelligent and broadly applicable machine learning algorithms will be elaborated. New opportunities and challenges in AIM will also be discussed. Keywords: Artificial intelligence; medicine; deep learning; machine learning; big data

25.1 Overview of artificial intelligence in health care The principles of evidence-based medicine (EBM), which was originally defined as the conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients, play a significant role in modern medicine and health-care policy-making in the United States.1,2 In reality, despite enormous success of EBM over the years, there remains a huge gap between theory and practice, as evidenced by various problems and concerns on the quality, efficiency, and cost of current clinical practice. Artificial intelligence (AI), which consists of agents or computer algorithms that automatically learn from past experience to perform predictions or classifications, is evolving rapidly and promises to transform the way the predictive model is established and applied. The recent surge in AI in medicine (AIM) affords tremendous opportunities to

Artificial Intelligence in Medicine DOI: https://doi.org/10.1016/B978-0-12-821259-2.00025-9

503

© 2021 Elsevier Inc. All rights reserved.

504

FIGURE 25.1

25. Outlook of the future landscape of artificial intelligence in medicine and new challenges

A summary of important aspects of AIM. AIM, Artificial intelligence in medicine.

augment EBM and substantially improve patient care.321 Previous chapters have provided an outstanding summary of AIM. In Fig. 25.1 we highlight important aspects relevant to AIM and interrelationship between different technologies and applications of AI in health care. While digital- and big datadriven medicine is of great significance to health care, it is useful to emphasize that AI is a technical tool and, in our opinion, it will probably remain in its ancillary and synergistic role in the foreseeable future. Its potential to dramatically change the landscape of health care, ranging from the disease prevention, screening, diagnosis, treatment planning and delivery to therapeutic follow-up, arises from the unique ability of the technology in extracting features from data and providing inferences that are most consistent with the training data. Of the five important topics in AI (Fig. 25.1), machine learning/deep learning (DL) plays a most fundamental role in AIM applications for its ubiquitous use in dealing with all types of data, from genomics, demographics, imaging, video, text, lab measurement to audio. Because of the strong ability of AI models in nonlinear mapping, the technique can, in principle, bridge the gap between any two types of data (input and output) that are inherently related. AI is exceedingly good at identifying patterns, analyzing complex problems, and making predictions and decisions. To better illustrate the AI research and applications, for convenience, here we divide machine learning into three major categories (Fig. 25.2) according to the characteristics of input and output data, and these are elaborated with examples in the following sections.

IV. Future outlook

25.1 Overview of artificial intelligence in health care

505

FIGURE 25.2 Machine learning algorithms as categorized according to the characteristics of input and output data.

25.1.1 Models dealing with input and output data from the same domain This is perhaps the simplest situation for DL as there is no domain change in input and output data. Many tasks in AIM belong to this category. Actually, the list of applications of DL in this category can be extensive and new type of applications continues to appear in the literature. To name a few, we mention superresolution imaging, superresolution dose calculation, microcopy image processing and denoising, imputation of genomic data, image inpainting, image registration, and bidirectional encoder representations for transformers in natural language processing (NLP). In medical imaging the image quality is a balance between imaging time, spatial resolution, temporal resolution, and patient dose or photodamage when X-ray or optical photons are involved. In current practice, each imaging event is done independently. DL provides an effective way to incorporate prior knowledge attained in imaging previous subjects to obtain better images. In superresolution imaging, for example, we can obtain highresolution images from low-resolution ones by leveraging previous imaging data. The technique has been extensively employed to enhance image quality,22 radiation dose calculation efficiency,23 and the pathological and microscopic images. Noteworthy, much has been accomplished toward data-driven designs in combining microscopy and deep neural networks to achieve what neither has accomplished.2427 For a review, we refer the readers to Ref. [28]. To further elaborate the utility of the DL in this category, we highlight a practically valuable computed tomography (CT) application. While CT accounts for over 62 million clinical adult scans in the United States each year and represents one of the most important imaging

IV. Future outlook

506

25. Outlook of the future landscape of artificial intelligence in medicine and new challenges

FIGURE 25.3 Conventional and DLDECT images for an abdominal case. The first and second images display the raw 100 and 140 kV CT. The third and fourth columns show the DL-derived 140 kV images and their differences with respect to the corresponding raw 140 kV images. For comparison the fifth and sixth columns show the difference images between the raw DECT images, and the difference images between the raw 100 kV and DL-derived 140 kV images. CT, Computed tomography; DECT, dual-energy computed tomography; DL, deep learning.

modalities in modern medicine,29 conventional CT imaging with a single energy is incapable of providing material composition information because different tissues may lead to the same Hounsfield units (HUs). Dual-energy CT (DECT) with physical means of simultaneously generating and measuring photon signals of two different energies is designed to provide a solution to the problem. DECT takes advantage of the energy dependence of the linear attenuation coefficients of the tissue to yield material-specific images, such as blood, iodine, or water maps. The approach, however, adds an extra layer of complexity on top of the widely used single-energy CT (SECT) and increases the system cost and patient radiation dose, hindering the widespread clinical applications of DECT. A data-driven strategy of obtaining DECT images from SECT images was pioneered by Zhao et al.30 The approach allows us to obtain dual- or even multiple-energy CT images without any additional measurement than a SECT acquisition. Fig. 25.3 shows an example of conventional and DLDECT images obtained using the DL method. The technique provides a simple and cost-effective spectral CT solution for a wide spectrum of biomedical applications, ranging from disease diagnosis, proton therapy treatment planning to assessment of therapeutic response.

25.1.2 Deep learning as applied to problems with input and output related by physical/mathematical law It is often seen that input and output domains are related by a physical or mathematical relation. While remarkable results have been accomplished by using the model-based approaches in tackling this type of problem, there are unmet demands for data-driven strategies to augment the traditional techniques, especially when the model input is noisy or incomplete. Machine learning algorithms are particularly valuable in integrating prior system knowledge to yield reliable output solutions. For this reason, DL has found valuable applications in biomedical instrumentations, such as imaging devices, microscope, and interventional guidance systems. Because of the known physics/mathematical relationship between the input and output, an important advantage of working in this category is that the collection/generation of training datasets can often be achieved by computer simulation. In CT image reconstruction, for example, an image is traditionally obtained via the mathematical inversion of the encoding function of the imaging wave for a given set of

IV. Future outlook

25.1 Overview of artificial intelligence in health care

507

measurements.31 A prerequisite for artifact-free inversion is the satisfaction of the classical ShannonNyquist theorem in angular-data sampling, which requires a certain number of sensory measurements and imposes a practically achievable limit in imaging time and imaging dose. Furthermore, missing sensory data may occur in the presence of implanted metallic objects (such as hip prosthesis). For tomographic imaging with ultrasparse sampling, Shen, Zhao, and Xing designed a hierarchical neural network and developed a structured training process for DL to generate 3D CT images from 2D projections.31 For this purpose a feature-space transformation between a 2D projection and a 3D volumetric CT image within a representationgeneration framework is employed. By using the transformation module, they transfer the representations learned from the 2D projection to a representative tensor for 3D volume reconstruction in the subsequent generation network (Fig. 25.4). Fig. 25.5 shows some examples of ultrasparse CT reconstruction. Various DL techniques have been applied to other imaging modalities, such as MRI,3234 ultrasound,35,36 PET,37,38 optical imaging.39,40 Zhu et al.32 proposed a framework for MR image reconstruction by recasting the problem as a data-driven supervised learning task. Superior immunity to noise and reduction in reconstruction artifacts were observed as compared with conventional handcrafted reconstruction methods. Recently, Wu et al.41 investigated a

(A)

(B)

X-ray source Patient Detector

(C)

(D) 100%

Level of prior knowledge

Deep learning reconstruction Regularized reconstruction Conventional reconstruction 0 Number of projections

FIGURE 25.4 3D image reconstruction with ultrasparse projection-view data. (A) A geometric view of an Xray source, a patient and a detector in a CT system. (B) X-ray projection views of a patient from three different angles. (C) Different image-reconstruction schemes in the context of prior knowledge and projection sampling. (D) Volumetric image reconstruction using deep learning with one or multiple 2D projection images. CT, Computed tomography. Source: From Shen L, Zhao W, Xing L. Patient-specific reconstruction of volumetric computed tomography images from a single projection view via deep learning. Nat Biomed Eng 2019;3:8808.

IV. Future outlook

508

25. Outlook of the future landscape of artificial intelligence in medicine and new challenges

Abdominal CT

Lung CT Truth

(a)

1-view

Pred Diff

(b)

2-view

Pred Diff

(c)

5-view

Pred Diff

(d)

10-view

Pred Diff

FIGURE 25.5 Examples from the abdominal CT and lung CT cases. (AD) Images reconstructed by using 1 (A), 2 (B), 5 (C), and 10 (D) projection views. Predicted and difference images between predicted and ground truth are shown. CT, Computed tomography. Source: From Shen L, Zhao W, Xing L. Patient-specific reconstruction of volumetric computed tomography images from a single projection view via deep learning. Nat Biomed Eng 2019;3:8808.

data-driven strategy to derive quantitative parametric maps from a single qualitative MR image with automatic compensation for magnetic/radiofrequency field inhomogeneity. As is well known, conventional MRI is qualitative in nature and this presents a bottleneck in quantitative image analysis and precision medicine. The approach promises to derive qualitative and quantitative MRI, named Q2MRI, simultaneously without changing the standard imaging

IV. Future outlook

25.1 Overview of artificial intelligence in health care

509

protocol. This is imperative for comparison of signal intensity across subjects, time points, or imaging centers, and to draw reference from prior imaging data. Interestingly, by combining with the physics Bloch equations, they can also retrospectively tune the tissue contrast of MRI,42 making the entire landscape of MRI contrast readily accessible.

25.1.3 Models with input and output data domains related by empirical evidence or measurements A majority of clinical decision-making models belong to this category.4347 Practically, AI and DL are particularly useful in facilitating various decision-making processes by learning complex relationships and incorporating existing knowledge into the inference model.48 Machine learning models are trained with measurement data annotated by experienced professionals such as physicians. Because of the ability of learning from big data across different domains and integrating knowledge from various sources and diverse disciplines, the AI models can potentially go beyond any individual’s comprehension and provide a new health-care paradigm with significantly improved decisions. Ibragimov et al.46 proposed a neural networkbased paradigm for prediction of liver stereotactic body radiation therapy (SBRT)49 outcomes. In this model a patient’s 3D images and dose delivery plans are fed together with other input variables such as patients’ demographics, quantified abdominal anatomy, history of liver comorbidities, and liver function tests, to the multi-path neural network to predict post-SBRT survival and local cancer progression (Fig. 25.6). The network was able to identify the critical-to-spare liver regions, and the critical clinical features associated with the highest risks of negative SBRT outcomes. For another example, Courtiol et al.43 have developed a MesoNet (Fig. 25.7) to accurately predict the overall survival of mesothelioma patients from whole-slide digitized images, without any

FIGURE 25.6 A schematic illustration of the proposed framework for deep learningbased prediction of liver SBRT outcomes. SBRT, Stereotactic body radiation therapy. Source: From Ibragimov B, et al. Neural networks for deep radiotherapy dose analysis and prediction of liver SBRT outcomes. IEEE J Biomed Health Inform 2019. Available from: https://doi.org/10.1109/JBHI.2019.2904078.

IV. Future outlook

510

25. Outlook of the future landscape of artificial intelligence in medicine and new challenges

FIGURE 25.7 MesoNet layout. Mesothelioma histology slides were collected from the MESOPATH/ MESOBANK database and TCGA. First, a machine learning model is trained to predict patient overall survival using 2300 WSIs randomly chosen from a total of 2981 slides taken from MESOBANK, without expert-derived annotations. The model was then tested on the remaining 681 slides from MESOBANK and validated on 56 slides from TCGA. A predictive score was given to each tile of interest, positively or negatively associated with survival (see heatmap). Extremal tiles filtering allowed us to retain the more informative tiles for use by the model for prediction so as to achieve patient survival prediction (shaded area over the curve represents CI).TCGA, The Cancer Genome Atlas. Source: From Courtiol P, et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nat Med 2019;25:151925.

IV. Future outlook

25.1 Overview of artificial intelligence in health care

511

pathologist-provided locally annotated regions. The MesoNet was validated on both an internal validation cohort from the French MESOBANK and an independent cohort from The Cancer Genome Atlas. It was demonstrated that the model was more accurate in predicting patient survival than using current pathology practices. Furthermore, MesoNet identified regions (mainly located in the stroma and are histological features associated with inflammation, cellular diversity, and vacuolization) contributing to patient outcome predictions. These findings suggest that DL models can identify new features predictive of patient survival and potentially lead to new biomarkers. Another important example is AI-augmented biomarker discovery and drug design. A critical challenge in biomarker discovery arises from the need of finding the correlative relationship between biological measurement data, which may be a terabyte in size, noisy, and even incomplete, to the phenotypes. Advances in high-throughput omics techniques, including the recent surge of revolutionary single-cell sequencing,50 necessitate the development of machine learningbased inference models to identify molecular patterns associated with the disease status and subtypes and to interpret/predict disease phenotypes.51 A closely related application is drug development and drug repurposing.5262 As is well known, development of a new drug is generally a complex and challenging task, requiring seamlessly aggregation of information and data from various sources, including the underlying biology, biochemistry, biomarkers candidates, preclinical tests, and clinical trials.63 In the United States, on average, it takes pharmaceutical companies 12 years with a cost of 2.6 billion dollars to develop a new drug.6365 Given its transformative ways to learn from big data and integrate information from diverse sources, AI is becoming an indispensable part of every major step in modern drug development, ranging from target identification and selection, lead discovery, candidate selection, rapid synthesis, and preclinical and clinical evaluations. The technique promises to accelerate the process of drug discovery and/ or repurposing with much improved efficiency and reduced cost. Segler et al.66, for example, have employed Monte Carlo tree search and symbolic AI to discover retrosynthetic routes. The model solves for almost twice as many molecules, 30 times faster than the traditional computer-aided search method. For another example, Zhavoronkov et al.60 developed a generative tensorial reinforcement learning (GENTRL) technique for de novo small-molecule design. The technique was used to discover potent inhibitors of discoidin domain receptor 1, a kinase target implicated in fibrosis and other diseases.60 AI promises to assist us in deciphering biological mechanism of disease not easily revealed by traditional approaches and facilitate the decision-making in drug discovery. The technique may lead not just a faster and less expensive drug discovery process, but the delivery of better treatments for various diseases.

25.1.4 Applications beyond traditional indications Some health-care applications of AI have been elaborated in the previous chapters. The application of AI goes, of course, much beyond. In practice, even though some are still in their early stages, AI has been applied ubiquitously to numerous clinical decision-making tasks in almost all disciplines of medicine.68,1118,20,54,6799 Because of its unprecedented ability in integrating/aggregating information from diverse sources, AI should benefit any clinical decision-making task that is either inaccurate or inefficient or both in diagnosis, treatment

IV. Future outlook

512

25. Outlook of the future landscape of artificial intelligence in medicine and new challenges

planning, prognosis, and outcome analysis. AI can also help to visualize objects that are even invisible to human100,101 with superior performance. Coupled with robotics, computer vision, NLP, smartphone, wearable devices, etc., the entire landscape of health care is being transformed. Indeed, important applications have also seen in screening and disease prevention,20,102105 aging and longevity,106 interventions11,100,101,107,108 and robotic interventions,109111 sports trauma prediction,112 implants,113 smart homes,114 rehabilitation,115118 elderly care,114,119,120 support of clinical trials,121 patient scheduling and hospital management,122 quality and safety,123128 global health and prevention, surveillance, rapid response to infectious diseases,129131 precision training of health-care professionals,132134 and so forth. Recently, Chen et al.111 reported a portable robotic device capable of introducing needles and catheters into deformable tissues such as blood vessels to draw blood or deliver fluids autonomously. Robotic cannulation is driven by predictions from a series of deep convolutional neural networks that encode spatiotemporal information from multimodal image sequences to guide real-time servoing. It is intriguing that the device can outperform humans and improve success rates and procedure times compared to manual cannulations by trained operators, particularly in challenging physiological conditions. Given the general trend of data-driven automation and the enormous momentum of AI in changing the way that people integrate information and make decisions, we believe that AI will fuel a wide-ranging innovation in the future and lead to profound impact on health care.

25.2 Challenges ahead and issues relevant to the practical implementation of artificial intelligence in medicine While colossal advancements have been made recently in the field of AI, truly intelligence and ubiquitously applicable AI solutions for medicine have yet to come. The challenges are multiple, from technology, data curation/sharing and security, workflow and integration, and clinical evaluation and implantation to social, economical, political, ethical, and legal aspects. These are highlighted next.

25.2.1 Technical challenges DL and AI are powerful tools and especially good at identifying patterns from data and making predictions for complicated situations. Current AIM research is primarily focused on automation, classification, regression, and detection. As it is, however, the AI technology is far from ideal and acts not even close to human-like reasoning. If technology were to make genuine impact and gain widespread adoption in clinical practice, AI must be less artificial and more intelligent than it is today. Lack of intelligence in current AI algorithms is reflected in many aspects of current neural network design. To a large extent, many of current algorithms are more nonlinear fitting of the data than an ideal intelligent machine. In current formulation, for example, there is a huge gap between the loss function, which measures the discrepancy between the model prediction and the ground truth and guides the search for optimal network parameters, and the decision metric used for evaluation of the model prediction and decision-making. Similar to any decision-making problem such as inverse treatment planning,135,136 there is generally no onesize-fits-all loss function for training DL algorithms and various functions have been proposed

IV. Future outlook

25.2 Challenges ahead and issues relevant to the practical implementation of artificial intelligence in medicine

513

FIGURE 25.8 Tumor detection and segmentation results for testing lung cases. In (A), the top and bottom row corresponds to a large- and small-sized tumor example, respectively. In (B), distributions of Dice coefficients across tumor sizes are shown. D, E, T, G (D), and G (E) correspond to the Dice loss, effectiveness loss, Tversky loss, generalized loss based on deterministic approach, and generalized loss based on exploratory approach, respectively.

for different or even the same applications. An important but less appreciated issue is that minimizing a predefined loss function alone does not always yield truly optimal prediction.135137 For instance, the primary building blocks of classification evaluation metrics— true positive (TP), false negative (FNÞ, true negative (TN), and false positive ðFPÞ— are often combined into a scalar loss function. A popular metric for semantic segmentation tasks is the Dice similarity coefficient, harmonic mean of precision (P 5 TP=½TP 1 FP) and recall (R 5 TP=½TP 1 FN). While partially accounting for imbalance related to low prevalence of pixels with positive class in individual images, it does not truly lead to optimal prediction.138141 Seo et al.137 constructed a generalized loss function that combines multidimensional metrics into a differentiable scalar measure suitable for traditional gradient-based optimization and developed a general strategy with an adaptive training methodology. The strategy effectively reduces the mismatch between loss function and actual decision-making metrics and, when applied to cancer tumor detection and segmentation, it improves the network performance by up to 10% as compared with the current state-of-the-art. In Fig. 25.8 we show the distribution of the Dice coefficient across all tumor sizes in lung test datasets. The approach can consistently detect and segment small-sized tumors that would have remained undetected otherwise. Fig. 25.8B shows the performances of all methods for a small- or intermediate-sized tumor in the dataset. On the algorithmic level the DL modeling is rather data intensive, task specific, and brute force in nature. For example, while most medical data, such as genomic and imaging data, are highly redundant, little has been done to leverage that fact to come up with smart learning schemes for various clinical applications. On a more fundamental level, many have argued that purely data-driven may not be the best approach to tackle practical problems. Transferred learning and wisdom-informed learning with incorporation of existing scientific knowledge142 provides more rationale solutions in meeting the clinical challenges from all angles. How to make AI more intelligent, robust, scalable, and interpretable is still an active area of research and there are no doubt daunting challenges ahead in creating AI tools that can understand the complex reality captured by human thought.

IV. Future outlook

514

25. Outlook of the future landscape of artificial intelligence in medicine and new challenges

25.2.2 Data, data curation, and sharing Machine learning modeling is reliant heavily on a plethora of high-quality data to extract features and to facilitate clinical decision-making. The topic has been reviewed extensively in previous chapters. Thus only a few issues in technical aspects will be touched on here. First, it should be noted that some research and progress have been made in machine learning with a small size of training datasets.51 This type of efforts should reduce the need of curating a large amount of data. Second, we would like to emphasize the importance of standardization and quality control in data curation, which forms the basis for data sharing. As noted by Verghese et al.,143 bad data can be amplified into worse predictive models. In the modern big data era, researchers have access to an abundance of biomedical data across many modalities. On the other hand, medical knowledge and technology are evolving with time. Thus attention must be paid to ensure that all the data represent the state-of-the-art. Data annotation is generally a labor-intensive process as a large amount of manual annotations are performed by well-trained individuals. This topic has also been discussed extensively. For certain engineering tasks, data augmentation and the use of synthetic data with automated annotation may be viable choices.100,144,145 For some applications, there are commercial entities that offer annotation or crowdsourcing services. Recently, Mak et al.146 investigated whether crowd innovation could be used to rapidly produce AI solutions that replicate the accuracy of an expert radiation oncologist in segmenting lung tumors. AI algorithms could improve cancer care globally by transferring the skills of expert clinicians to underresourced health-care settings. Finally, we mention that the use of semisupervised learning or even unsupervised learning, when applicable to the specific problems under study, provides a practically valuable solution to reduce/mitigate the need for annotation. For example, Islam and Xing147 have recently applied unsupervised learning for exploration of distinct patterns in various biomedical data without reliance on any data annotation. This may find widespread applications in tasks such as data visualization, compression, exploration and classification, and promises to change fundamentally the ways that big data are analyzed.

25.2.3 Data and potential bias in artificial intelligence The disparities in patient outcome based on social-economic and possibly genetic factors in the United States are well known. For example, the inferior outcomes of AfricanAmericans for most cancers have been widely documented, particularly for gynecologic cancers.148150 Use of patient data with inherent and changing racial disparities in outcomes as input in AI models can potentiate and propagate any underlying biases. For example, a recent study showed that a widely used algorithm employed to help identify and help patients with more complex health needs resulted in a lower percentage of Blacks receiving additional help. The bias arose because the algorithm predicted healthcare costs rather than illness; for a given risk score, Black patients were sicker than White patients.151 Until bias-free data can be accumulated, AI approaches need to attempt to correct for such biases. One attempt at designing machine learning algorithms to ensure that the algorithms do not exhibit undesirable behavior has been proposed.152 A second

IV. Future outlook

25.2 Challenges ahead and issues relevant to the practical implementation of artificial intelligence in medicine

515

and somewhat related problem is the (mis)application of algorithms based on databases for patient populations that were not included in the initial datasets. An example was the application of the US-based treatment algorithms to foreign populations not well represented in the initial datasets. The IBM product “Watson for Oncology” has been criticized possibly for this reason.153 In addition, recent studies have demonstrated genetic differences in tumors of the same site of origin between different racial groups. The inclusion of diverse populations in the datasets used in the development of AI algorithms is essential to assure their appropriate applications in such diverse populations.154 Otherwise there is a risk of assignment of inappropriate treatment recommendations by the AI algorisms. Data has been called the new oil. Patient data is valuable to many vendors for its use related to individual patient care and product placement. While wide data sharing is needed to assure the development of unbiased and broadly usable AI applications, more has to be done in data security and confidentiality to assure the privacy of the health-care data. Patients have to be informed from the onset as to who has access to which of their data; what will it be used for and how it will be safeguarded; and so forth.

25.2.4 Workflow and practical implementation Clinical implementation of AI is still in its early stage and it remains to be tested whether AI can live up to our expectations of solving clinical problems, reduce cost, and, more importantly, realize what humans cannot do or cannot easily accomplish. Indeed, we still have a long way to go to truly step into an era of AI-augmented practice. For clinical implementation a number of specific issues must be considered. Recently, Fihn et al.155 and He et al.156 reviewed some key problems surrounding the integration of AI into clinical workflows, including data sharing and privacy, transparency of algorithms, data standardization, interoperability across multiple platforms, and patient safety. He et al.156 also summarized the current regulatory environment in the United States and highlighted comparisons with other regions in the world, notably Europe and China. In the past few years, the US Food and Drug Administration (FDA) has approved a number of AI software products, such as breast density analysis, detection of diabetic retinopathy, personalized diabetes decision support, smart glucose monitoring system, image analysis and enhancement, autosegmentation of anatomical structures, patient surveillance and predictive algorithm platform, X-ray wrist fractures detection, autism diagnosis, seizure monitoring, computed tomographic angiography (CTA)-based large-vessel occlusion (LVO) stroke management, and risk assessment of coronary artery disease. Internet searches of these topics should direct the readers to the corresponding websites for more information. For obvious reasons the early products are primarily focused on patho-image or signal processingbased applications. But the AI technologies are rapidly making their ways into other applications for analysis of images, bioassay data, NLP, EMR data mining, drug discovery, and more. The lack of transparency of the AI algorithms has long been recognized and this may hinder their acceptance by the clinicians. If the doctor does not understand how the AI tool made the given recommendation, he/she might be reluctant to apply it. Showing how a more established modeling approach resulted in similar but less precise recommendation as a given AI algorithm may help in its acceptability. For example, a random survival

IV. Future outlook

516

25. Outlook of the future landscape of artificial intelligence in medicine and new challenges

forest model was developed to assist in treatment assignment in patients with a rare subgroup of uterine cancers (for which it would be unlikely ever to have enough patients for a randomized control trial). The AI-based algorithm has shown to be able to identify significantly higher number of patients who may benefit from treatment than a more “conventional” Cox regression model, helping doctors to understand and accept the AI-based mode.157 Furthermore, doctors may be unwilling to implement an algorithm that restricts their autonomy. Having the doctors “buy-in” to the study at the onset may be helpful in assuring acceptability once the product has been developed, which is clearly an important for the execution and realization of the cutting-edge AI technologies. Medical treatment decisions continue to be made (primarily) by medical doctors. If the doctors were not convinced to use the developed AI tools, then much effort would be wasted. For example, a machine learning model was developed for improving the efficiency of the operating room environment at a prominent children’s hospital. However, reportedly there have been some difficulties in implementing this scheduling system.158

25.2.5 Clinical tests Considerable progress has been made in testing the clinical performance of AI algorithms. Most of these studies have been nonrandomized, used retrospectively collected materials, and evaluated diagnostic findings from video images, radiological studies, histopathology specimens, and patient monitoring. They compared determinations based on AI-generated algorithms with the “gold standard” of evaluations by an expert or a team of experts. These studies have spanned many areas of medical diagnosis and representative examples have included (1) a seminal study using deep convolutional neural networks for the diagnosis of melanoma from a set of digital images of skin lesions compared with the diagnoses of 21 doctors159; this was followed by a plethora of related studies on the diagnosis of dermal melanomas (for review see Ref. [160]); (2) the diagnosis of diabetic retinopathy from retinal fundus photographs using a DL-trained algorithm compared to at least 7 board-certified ophthalmologists161; (3) a DL-based automatic detection algorithm for malignant pulmonary nodules on chest radiographs compared to the performance of 18 physicians from four different subgroups, including four subspecialty trained thoracic radiologists162; and (4) determination of lymph nodal metastases in women with breast cancer from whole-slide images evaluating a panel, including 7 DL algorithms compared to a panel of 11 pathologists in a simulated time-constrained diagnostic setting.7 All of these and similar studies demonstrated equivalent or superior performance of the AI algorithm compared with the expert opinions. Studies have also demonstrated the added value of AI algorithms to the experts’ assessment of the interpretation of chest radiographs, classification of skin lesions, and cancer detection on mammograms. None of the abovementioned investigations however were randomized prospective trials (RPTs), a standard measure of clinical benefit, and often the endpoint (e.g., diagnosis of a benign or malignant condition) of these investigations was not validated on long-term follow-up. In fact, here have been few RPTs comparing AI algorithms prospectively with the gold standards of diagnosis.163 One was an unblinded RPT that compared a machine learning algorithm to standard care to predict intraoperative hypotension in patients undergoing elective noncardiac surgery. A significant reduction in both the medium

IV. Future outlook

25.2 Challenges ahead and issues relevant to the practical implementation of artificial intelligence in medicine

517

time-weighted average of hypotension and the median time of hypotension was reported for the algorithm-based group compared to the control (standard care) patients.164 In a second example a DL algorithm for the detection of polyps during colonoscopy was studied in an open, nonblinded RPT. Patients were prospectively randomized to undergo diagnostic colonoscopy with or without the assistance of the real-time automatic polyp detection system (which provided a simultaneous visual notice and sound alarm on polyp detection). The AI system significantly increased the detection rate as well as the mean number of adenomas detected per patient.165 Additional RPTs attempting to validate AI algorithms in medical diagnostic and treatment decision-making are urgently needed to provide support for FDA approval, insurance company reimbursement, and medical community adoption of these algorithms

25.2.6 Economical, political, social, ethical, and legal aspects Health-care costs in the United States are the highest per capita in the world at 18% of the GDP. They have risen at a rate greater than our economic growth rate and are hardly unsustainable. Prior studies have shown that medical technology contributed between 27% and 48% of the increase in health-care costs between 1960 and 2007.166 Much attention needs to be given to cost-effectiveness analyses at the time of development of AI in medicine applications. One needs to address questions such as: what will this technology cost to develop and use? Will it replace a more expensive current technology or be used in addition to it? What is a realistic estimate of patient gain from the AI use (less complications, better survival, time savings in starting treatment or shortening treatment, etc.)? Rapid growth and continuous innovations of AI should be assessed in parallel with their impact to our society.167 Indeed, the potential impact of AI in our lives and society today is enormous. Many reports have been compiled based on the analysis of available sources by experts in the field.167177 In particular, the authors of Chapter 23, Regulatory, Social, Ethical, and Legal Issues of Artificial Intelligence in Medicine, have outlined challenges and impacts of AIM.

25.2.7 Education and training Training and education present another grand challenge in AIM. In AI-augmented health care, it is foreseeable that a clinical decision needs to be made together with human who can understand and communicate the computer findings with the patient. There is an urgent need to train current and next-generation health-care professionals in data science and AI to enable them to become part of this emerging revolution in medicine and to benefit patient care.178181 Wartman and Combs have argued that future physicians’ skill sets must also include collaborating with and managing AI applications and an overhaul of medical school curricula is due with focus on knowledge management (rather than information acquisition), effective use of AI, improved communication, and empathy cultivation.180 The proposed educational efforts would facilitate the translation of cutting-edge AI technology into clinical practice, which includes reliable collection data and evidence for

IV. Future outlook

518

25. Outlook of the future landscape of artificial intelligence in medicine and new challenges

models training, integration of AI into clinical workflow, proper conduct, to clinical evaluation of the tools.

25.3 Future directions and opportunities Thanks to the years of research in machine learning, computer vision, expert system, robotics and NLP, we can now begin to harness AI research to help physicians, patients, and researchers to prevent, diagnose, and fight disease and injury. Looking ahead, enormous opportunities exist in advancing the AIM to the next level. Here we highlight some important directions of research and discuss ongoing efforts in turning technical/clinical roadblocks into opportunities. As it is today, an AI model behaves like a black box that makes decision with little transparency and interpretability. Indeed, lack of context and interpretability has long been known to be a major deficiency of data-driven approaches. Much effort is devoted to improving the situation so that a user can account for the AI decisions and explain them to the physicians and patients. By and large this represents part of the overall efforts in making AI more trustworthy and generalizable. Computationally, a comprehensive understanding of knowledge extraction, representation, and inference process in neural networks is a prerequisite of truly explainable AI. Toward this goal, a better visualization and characterization of the feature domain data are essential as this would not only allow us to better understand the properties of feature domain but also utilize the information to guide the design and training process of deep neural networks.182 In addition, effort such as building human-centered AI designs is laudable and should facilitate the achievement of this goal.183,184 In this way, human intelligence and AI work synergistically for veritable prospects of improved medicine. Next generation of AIM should be much less “artificial” and more intelligent than what it is today. While techniques such as transferred learning and multitask modeling are frequently discussed, in practice, a task-specific machine learning model is still required for each task in many circumstances, even if the problem to be solved is obviously similar to many other tasks when judged by humans. An important topic under investigation is how to incorporate existing scientific knowledge into data-driven inference modeling. How to transit from data-intensive tasks toward wisdom-informed decision142,185 is an important direction to explore, not only for AIM but also for the entire AI field. An important application of AI in medicine is to enhance the performance of existing clinical instrumentation with simplified system design, reduced cost, and autonomous workflow. Because of the datasets used for training machine learning models can often be obtained from physical simulation (e.g., Monte Carlo simulation of X-ray photon transport for imaging and radiation therapy186), the issues of access to large amount of training data is less of a problem in this type of applications. AI also enables some novel applications that would otherwise be impossible. There are initiatives in developing on-device AI or embedded AI with machine learning running on the end device. Future medical equipment and quality assurance of the systems and patient management may be significantly simplified by integrating or embedding AI.30,128 From clinical points of view, solutions that improve patient care workflow, enhance and/or extend the utility of clinically acquired data, help to support clinical decisions such as disease

IV. Future outlook

References

519

detection/diagnosis/treatment, and therapeutic assessment of response, will always be welcome and this will greatly enhance the armamentarium of AI. Furthermore, algorithmic and infrastructural innovations are urgently needed to leverage clinical data acquired with special features, such as longitudinal patient data and/or multimodal patient data. New opportunities and breakthroughs may also exist in developing multiinput models to better leverage complementary values of different types of clinical data, such as lab tests, demographic data, text, language, pathology, radiology, genomics, and other biological data. Inclusion of data from higher dimensions and/or different modalities is highly desirable to build more accurate and robust DL models.182 Prospective clinical trials of AI-augmented procedures are sparse today, but this should be promoted to gain objective perspectives about the efficacy of novel AI technologies. The challenges in developing an echosystem for data curation, annotation, and sharing should be turned into opportunities. Open source data and software should be encouraged. We emphasize that, for many applications, unsupervised learning may be a viable choice to obtain DL models without the hassles of manually labeling the data. Finally, deployment and security of AI solutions are also critically important to the future AIM.

25.4 Summary and outlook The last few years have witnessed unprecedented growth in AIM. Novel use cases and successful examples of AI applications have begun to emerge in research labs and industry, and the landscape of health care is rapidly changing by these new developments. Like any emerging technology, when AI meets medicine, there will be unseen issues and challenges, not only in the technical and clinical applications but also social, ethical, economical, and legal aspects. Currently, AIM is primarily focused on automation, classification, detection, and regression in the context of weak AI, but the indications of general AI for health care are even more appealing and critically important for the future of AIM. Coupled with emerging technical advances such as quantum computing and humancomputer interface, AIM promises to provide a new paradigm by learning from big data that are beyond any individual physician’s comprehension, integrating evidence and knowledge from various sources and diverse disciplines, and optimizing the clinical decision-making. Finally, we emphasize that AI alone represents only a technical tool and the new AIM health-care paradigm of EBM should be built upon synergistic collaborations between health-care professionals and computer agents.

References 1. Evidence-Based Medicine Working Group. Evidence-based medicine. A new approach to teaching the practice of medicine. JAMA 1992;268:24205. 2. Sackett DL, Rosenberg WM, Gray JA, Haynes RB, Richardson WS. Evidence based medicine: what it is and what it isn’t. BMJ 1996;312:712. 3. Xing L, Krupinski EA, Cai J. Artificial intelligence will soon change the landscape of medical physics research and practice. Med Phys 2018;45:17913. 4. Galimova RM, Buzaev IV, Ramilevich KA, Yuldybaev LK, Shaykhulova AF. Artificial intelligence—developments in medicine in the last two years. Chronic Dis Transl Med 2019;5:648. 5. Gruson D, Helleputte T, Rousseau P, Gruson D. Data science, artificial intelligence, and machine learning: opportunities for laboratory medicine and the value of positive regulation. Clin Biochem 2019;69:17.

IV. Future outlook

520

25. Outlook of the future landscape of artificial intelligence in medicine and new challenges

6. Im H, et al. Design and clinical validation of a point-of-care device for the diagnosis of lymphoma via contrast-enhanced microholography and machine learning. Nat Biomed Eng 2018;2:66674. 7. Ehteshami Bejnordi B, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 2017;318:2199210. 8. Krittanawong C, Zhang H, Wang Z, Aydar M, Kitai T. Artificial intelligence in precision cardiovascular medicine. J Am Coll Cardiol 2017;69:265764. 9. Liang H, et al. Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence. Nat Med 2019;25:4338. 10. Luo H, et al. Real-time artificial intelligence for detection of upper gastrointestinal cancer by endoscopy: a multicentre, case-control, diagnostic study. Lancet Oncol 2019;20:164554. 11. Shkolyar E, et al. Automated cystoscopic detection of bladder cancer using deep-learning. Eur Urol 2019;76:71418. 12. Stead WW. Clinical implications and challenges of artificial intelligence and deep learning. JAMA 2018;320:11078. 13. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019;25:4456. 14. Visvikis D, Cheze Le Rest C, Jaouen V, Hatt M. Artificial intelligence, machine (deep) learning and radio (geno)mics: definitions and nuclear medicine imaging applications. Eur J Nucl Med Mol Imaging 2019;46:26307. 15. Wong TY, Bressler NM. Artificial intelligence with deep learning technology looks into diabetic retinopathy screening. JAMA 2016;316:23667. 16. Benjamins JW, Hendriks T, Knuuti J, Juarez-Orozco LE, van der Harst P. A primer in artificial intelligence in cardiovascular medicine. Neth Heart J 2019;27:392402. 17. Dzobo K, Adotey S, Thomford NE, Dzobo W. Integrating artificial and human intelligence: a partnership for responsible innovation in biomedical engineering and medicine. OMICS 2020;24. 18. Niazi MKK, Parwani AV, Gurcan MN. Digital pathology and artificial intelligence. Lancet Oncol 2019;20:e25361. 19. Topol E. Deep medicine: how artificial intelligence can make healthcare human again. New York: Hachette Book Group; 2019. 20. Yu KH, Kohane IS. Framing the challenges of artificial intelligence in medicine. BMJ Qual Saf 2019;28:23841. 21. Wainberg M, Merico D, Delong A, Frey BJ. Deep learning in biomedicine. Nat Biotechnol 2018;36:82938. 22. Liu H, et al. Learning deconvolutional deep neural network for high resolution (HR) medical image reconstruction. Inf Sci (NY) 2018;468:14254. 23. Dong P, Xing L. Deep DoseNet: a deep neural network for accurate dosimetric transformation between different spatial resolutions and/or different dose calculation algorithms for precision radiation therapy. Phys Med Biol 2020;65:035010. 24. Rivenson Y, et al. Virtual histological staining of unlabelled tissue-autofluorescence images via deep learning. Nat Biomed Eng 2019;3:46677. 25. Wu Y, et al. Three-dimensional virtual refocusing of fluorescence microscopy images using deep learning. Nat Methods 2019;16:132331. 26. Rivenson Y, et al. PhaseStain: the digital staining of label-free quantitative phase microscopy images using deep learning. Light Sci Appl 2019;8:23. 27. Wang H, et al. Deep learning enables cross-modality super-resolution in fluorescence microscopy. Nat Methods 2019;16:10310. 28. De Haan K, Rivenson Y, Wu Y, Ozcan A. Deep-learning-based image reconstruction and enhancement in optical microscopy. Proc IEEE Inst Electr Electron Eng 2020;108:3050. 29. Brenner DJ, Hall EJ. Computed tomography—an increasing source of radiation exposure. New Engl J Med 2007;357:227784. 30. Zhao W, Lv T, Chen Y, Xing L. Revealing tissue compositions with a single-energy computed tomography and deep learning. Med Phys 2020, under review. 31. Shen L, Zhao W, Xing L. Patient-specific reconstruction of volumetric computed tomography images from a single projection view via deep learning. Nat Biomed Eng 2019;3:8808. 32. Zhu B, Liu JZ, Cauley SF, Rosen BR, Rosen MS. Image reconstruction by domain-transform manifold learning. Nature 2018;555:48792. 33. Mardani M, et al. Deep generative adversarial neural networks for compressive sensing MRI. IEEE Trans Med Imaging 2019;38:16779.

IV. Future outlook

References

521

34. Wu Y, et al. Incorporating prior knowledge via volumetric deep residual network to optimize the reconstruction of sparsely sampled MRI. J Magn Reson Imaging 2019;63. Available from: https://doi.org/10.1016/j. mri.2019.1003.1012. 35. Solomon O, et al. Deep unfolded robust PCA with application to clutter suppression in ultrasound. IEEE Trans Med Imaging 2019. Available from: https://doi.org/10.1109/TMI.2019.2941271. 36. Cao W, et al. Application of deep learning in quantitative analysis of 2-dimensional ultrasound imaging of nonalcoholic fatty liver disease. J Ultrasound Med 2020;39:519. 37. Haggstrom I, Schmidtlein CR, Campanella G, Fuchs TJ. DeepPET: a deep encoder-decoder network for directly solving the PET image reconstruction inverse problem. Med Image Anal 2019;54:25362. 38. Chen KT, et al. Ultra-low-dose (18)F-florbetaben amyloid PET imaging using deep learning with multicontrast MRI inputs. Radiology 2019;290:64956. 39. Weigert M, et al. Content-aware image restoration: pushing the limits of fluorescence microscopy. Nat Methods 2018;15:10907. 40. Goy A, et al. High-resolution limited-angle phase tomography of dense layered objects using deep neural networks. Proc Natl Acad Sci USA 2019;116:1984856. 41. Wu Y, et al. Quantitative magnetic resonance imaging from a single image using deep learning. Nat Biomed Eng 2020. under review. 42. Wu Y, et al. Exploiting the entire spectrum of MRI contrast by combined use of deep learning and physics equations. Radiology 2020. under review. 43. Courtiol P, et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nat Med 2019;25:151925. 44. Ardila D, et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat Med 2019;25:95461. 45. Coudray N, et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat Med 2018;24:155967. 46. Ibragimov B, et al. Neural networks for deep radiotherapy dose analysis and prediction of liver SBRT outcomes. IEEE J Biomed Health Inform 2019. Available from: https://doi.org/10.1109/JBHI.2019.2904078. 47. Ibragimov B, Toesca D, Chang D, Koong A, Xing L. Development of deep neural network for individualized hepatobiliary toxicity prediction after liver SBRT. Med Phys 2018;45:476374. 48. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521:436. 49. Timmerman R, Xing L. Image guided and adaptive radiation therapy. Baltimore, MD: Lippincott Williams & Wilkins; 2009. 50. Trapnell C, et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol 2014;32:3816. 51. Li R, Xing L, Napel S, Rubin D. Radiomics and radiogenomics: technical basis and clinical applications. Abingdon: Taylor & Francis Books, Inc.; 2019. 52. Moridi M, Ghadirinia M, Sharifi-Zarchi A, Zare-Mirakabad F. The assessment of efficient representation of drug features using deep learning for drug repositioning. BMC Bioinform 2019;20:577. 53. Issa NT, Stathias V, Schurer S, Dakshanamurthy S. Machine and deep learning approaches for cancer drug repurposing. Semin Cancer Biol 2020. 54. Alvarez-Machancoses O, Fernandez-Martinez JL. Using artificial intelligence methods to speed up drug discovery. Expert Opin Drug Discov 2019;14:76977. 55. Ekins S, et al. Exploiting machine learning for end-to-end drug discovery and development. Nat Mater 2019;18:43541. 56. Zhu H. Big data and artificial intelligence modeling for drug discovery. Annu Rev Pharmacol Toxicol 2020;60:57389. 57. Fleming N. How artificial intelligence is changing drug discovery. Nature 2018;557:S557. 58. Ching T, et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 2018;15. 59. Zeng X, et al. deepDR: a network-based deep learning approach to in silico drug repositioning. Bioinformatics 2019;35:51918. 60. Zhavoronkov A, et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat Biotechnol 2019;37:103840. 61. Mamoshina P, Vieira A, Putin E, Zhavoronkov A. Applications of deep learning in biomedicine. Mol Pharm 2016;13:144554.

IV. Future outlook

522

25. Outlook of the future landscape of artificial intelligence in medicine and new challenges

62. Mak KK, Pichika MR. Artificial intelligence in drug development: present status and future prospects. Drug Discov Today 2019;24:77380. 63. Hughes JP, Rees S, Kalindjian SB, Philpott KL. Principles of early drug discovery. Br J Pharmacol 2011;162:123949. 64. Avorn J. The $2.6 billion pill—methodologic and policy considerations. N Engl J Med 2015;372:18779. 65. Paul SM, et al. How to improve R&D productivity: the pharmaceutical industry’s grand challenge. Nat Rev Drug Discov 2010;9:20314. 66. Segler MHS, Preuss M, Waller MP. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 2018;555:60410. 67. Lynn LA. Artificial intelligence systems for complex decision-making in acute care medicine: a review. Patient Saf Surg 2019;13:6. 68. Ahmad OF, et al. Artificial intelligence and computer-aided diagnosis in colonoscopy: current evidence and future directions. Lancet Gastroenterol Hepatol 2019;4:7180. 69. Wang R, et al. Artificial intelligence in reproductive medicine. Reproduction 2019;158. 70. Acs B, Rimm DL. Not just digital pathology, intelligent digital pathology. JAMA Oncol 2018;4:4034. 71. Adir O, et al. Integrating artificial intelligence and nanotechnology for precision cancer medicine. Adv Mater 2019;32:e1901989. 72. Akay A, Hess H. Deep learning: current and emerging applications in medicine and technology. IEEE J Biomed Health Inform 2019;23:90620. 73. Buch VH, Ahmed I, Maruthappu M. Artificial intelligence in medicine: current trends and future possibilities. Br J Gen Pract 2018;68:1434. 74. Ferroni P, Roselli M, Zanzotto FM, Guadagni F. Artificial intelligence for cancer-associated thrombosis risk assessment. Lancet Haematol 2018;5:e391. 75. Fogel AL, Kvedar JC. Artificial intelligence powers digital medicine. NPJ Digit Med 2018;1:5. 76. Hall M. Artificial intelligence and nuclear medicine. Nucl Med Commun 2019;40:12. 77. Hamdy FC, Catto JW. Less is more: artificial intelligence and gene-expression arrays. Lancet 2004;364:20034. 78. Hannun AY, et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat Med 2019;25:659. 79. Hendriks JML, Fabritz L. AI can now identify atrial fibrillation through sinus rhythm. Lancet 2019;394:81213. 80. Hwang TJ, Kesselheim AS, Vokinger KN. Lifecycle regulation of artificial intelligence- and machine learningbased software devices in medicine. JAMA 2019;322:22856. 81. Kantarjian H, Yu PP. Artificial intelligence, big data, and cancer. JAMA Oncol 2015;1:5734. 82. Kim HK, et al. Deep learning improves prediction of CRISPR-Cpf1 guide RNA activity. Nat Biotechnol 2018;36:23941. 83. Yuan Y, et al. Prostate cancer classification with multiparametric MRI transfer learning model. Med Phys 2019;46:75665. 84. Madabhushi A, Feldman MD, Leo P. Deep-learning approaches for Gleason grading of prostate biopsies. Lancet Oncol 2020;21. 85. Matheny ME, Whicher D, Thadaney Israni S. Artificial intelligence in health care: a report from the National Academy of Medicine. JAMA 2019;323:50910. 86. Matuchansky C. Deep medicine, artificial intelligence, and the practising clinician. Lancet 2019;394:736. 87. Mesko B. The real era of the art of medicine begins with artificial intelligence. J Med Internet Res 2019;21: e16295. 88. Norgeot B, Glicksberg BS, Butte AJ. A call for deep-learning healthcare. Nat Med 2019;25:1415. 89. Quer G, Muse ED, Nikzad N, Topol EJ, Steinhubl SR. Augmenting diagnostic vision with AI. Lancet 2017;390:221. 90. Stoekle HC, Charlier P, Herve C, Deleuze JF, Vogt G. Artificial intelligence in internal medicine: between science and pseudoscience. Eur J Intern Med 2018;51:e334. 91. Wang F, Casalino LP, Khullar D. Deep learning in medicine-promise, progress, and challenges. JAMA Intern Med 2019;179:2934. 92. Jia X, Xing X, Yuan Y, Xing L, Meng MQ. Wireless capsule endoscopy: a new tool for cancer screening in the colon with deep-learning-based polyp recognition. Proc IEEE 2020;108:17897. 93. Yuan Y, et al. Densely connected neural network with unbalanced discriminant and category sensitive constraints for polyp recognition. IEEE Trans Autom Sci Eng 2019;1:110.

IV. Future outlook

References

523

94. Rogers MA, Aikawa E. Cardiovascular calcification: artificial intelligence and big data accelerate mechanistic discovery. Nat Rev Cardiol 2019;16:26174. 95. Johnson KW, et al. Artificial intelligence in cardiology. J Am Coll Cardiol 2018;71:266879. 96. Saba L, et al. The present and future of deep learning in radiology. Eur J Radiol 2019;114:1424. 97. Niel O, Bastard P. Artificial intelligence in nephrology: core concepts, clinical applications, and perspectives. Am J Kidney Dis 2019;74:80310. 98. Feng R, Badgeley M, Mocco J, Oermann EK. Deep learning guided stroke management: a review of clinical applications. J Neurointerv Surg 2018;10:35862. 99. Avati A, et al. Improving palliative care with deep learning. BMC Med Inform Decis Mak 2018;18:122. 100. Zhao W, et al. Incorporating imaging information from deep neural network layers into image guided radiation therapy (IGRT). Radiother Oncol 2019;140:16774. 101. Zhao W, et al. Visualizing the invisible in prostate radiation therapy: markerless prostate target localization via a deep learning model and monoscopic kV projection X-ray image. In: 2018 Annual meeting of ASTRO oral presentation. San Antonio, TX; 2018. 102. Galloway CD, et al. Development and validation of a deep-learning model to screen for hyperkalemia from the electrocardiogram. JAMA Cardiol 2019;4:42836. 103. Bellemo V, et al. Artificial intelligence screening for diabetic retinopathy: the real-world emerging application. Curr Diab Rep 2019;19:72. 104. Rajalakshmi R, Subashini R, Anjana RM, Mohan V. Automated diabetic retinopathy detection in smartphone-based fundus photography using artificial intelligence. Eye (Lond) 2018;32:113844. 105. Karhade AV, et al. Natural language processing for automated detection of incidental durotomy. Spine J 2019;20. 106. Zhavoronkov A, Li R, Ma C, Mamoshina P. Deep biomarkers of aging and longevity: from research to applications. Aging (Albany NY) 2019;11:1077180. 107. Vercauteren T, Unberath M, Padoy N, Navab N. CAI4CAI: the rise of contextual artificial intelligence in computer assisted interventions. Proc IEEE Inst Electr Electron Eng 2020;108:198214. 108. Panesar SS, et al. Promises and perils of artificial intelligence in neurosurgery. Neurosurgery 2020;87:3344. 109. Moustris GP, Hiridis SC, Deliparaschos KM, Konstantinidis KM. Evolution of autonomous and semiautonomous robotic surgical systems: a review of the literature. Int J Med Robot 2011;7:37592. 110. Pan J, et al. Image-guided stereotactic radiosurgery for treatment of spinal hemangioblastoma. Neurosurg Focus 2017;42:E12. 111. Chen AI, Balter ML, Maguire TJ, Yarmush ML. Deep learning robotic guidance for autonomous vascular access. Nat Mach Intell 2020;2:10415. 112. Kakavas G, Malliaropoulos N, Pruna R, Maffulli N. Artificial intelligence. A tool for sports trauma prediction. Injury 2019;S0020-1383. 113. Olze H, et al. Hearing implants in the Era of digitization. Laryngorhinootologie 2019;98:S82128. 114. Fritz RL, Dermody G. A nurse-driven method for developing artificial intelligence in “smart” homes for aging-in-place. Nurs Outlook 2019;67:14053. 115. Anderson D. Artificial intelligence and applications in PM&R. Am J Phys Med Rehabil 2019;98:e1289. 116. Langer A, Feingold-Polak R, Mueller O, Kellmeyer P, Levy-Tzedek S. Trust in socially assistive robots: considerations for use in rehabilitation. Neurosci Biobehav Rev 2019;104:2319. 117. Moral-Munoz JAP, Zhang WP, Cobo MJP, Herrera-Viedma EP, Kaber DBP. Smartphone-based systems for physical rehabilitation applications: a systematic review. Assist Technol 2019;114. 118. Dor-Haim H, Katzburg S, Leibowitz D. A novel digital platform for a monitored home-based cardiac rehabilitation program. J Vis Exp 2019;146 in press (doi: 10.3791/59019). 119. Borelli E, et al. HABITAT: an IoT solution for independent elderly. Sensors (Basel) 2019;19. 120. Dolatabadi E, et al. The feasibility of a vision-based sensor for longitudinal monitoring of mobility in older adults with dementia. Arch Gerontol Geriatr 2019;82:2006. 121. Fares J, et al. Diagnostic clinical trials in breast cancer brain metastases: barriers and innovations. Clin Breast Cancer 2019;19:38391. 122. Nelson A, Herron D, Rees G, Nachev P. Predicting scheduled hospital attendance with artificial intelligence. NPJ Digit Med 2019;2:26. 123. Yeung S, Downing NL, Fei-Fei L, Milstein A. Bedside computer vision  moving artificial intelligence from driver assistance to patient safety. N Engl J Med 2018;378:12713.

IV. Future outlook

524

25. Outlook of the future landscape of artificial intelligence in medicine and new challenges

124. Mockute R, et al. Artificial intelligence within pharmacovigilance: a means to identify cognitive services and the framework for their validation. Pharmaceut Med 2019;33:10920. 125. Ellahham S, Ellahham N, Simsekler MCE. Application of artificial intelligence in the health care safety context: opportunities and challenges. Am J Med Qual 2019. 1062860619878515, in press. Available from: https:// doi.org/10.1177/1062860619878515. 126. Howard JP, et al. Artificial intelligence for aortic pressure waveform analysis during coronary angiography: machine learning for patient safety. JACC Cardiovasc Interv 2019;12:2093101. 127. Pillai M, et al. Using artificial intelligence to improve the quality and safety of radiation therapy. J Am Coll Radiol 2019;16:126772. 128. Fan J, Xing L, Ma M, Hu W, Yang Y. Verification of the machine delivery parameters of treatment plan via deep learning. Phys Med Biol 2020, in press (doi: 10.1088/1361-6560/aba165). 129. Lancet. Artificial intelligence in global health: a brave new world. Lancet 2019;393:1478. 130. Chu HJ, Lin BC, Yu MR, Chan TC. Minimizing spatial variability of healthcare spatial accessibility—the case of a dengue fever outbreak. Int J Environ Res Public Health 2016;13. 131. Thiebaut R, Cossin S, Section Editors for the IMIA Yearbook Section on Public Health and Epidemiology Informatics. Artificial intelligence for surveillance in public health. Yearb Med Inform 2019;28:2324. 132. Duong MT, et al. Artificial intelligence for precision education in radiology. Br J Radiol 2019;92:20190389. 133. Sheikh AY, Fann JI. Artificial intelligence: can information be transformed into intelligence in surgical education? Thorac Surg Clin 2019;29:33950. 134. Alonso-Silverio GA, et al. Development of a laparoscopic box trainer based on open source hardware and artificial intelligence for objective assessment of surgical psychomotor skills. Surg Innov 2018;25:3808. 135. Xing L, Li J, Donaldson S, Le Q, Boyer A. Optimization of importance factors in inverse planning. Phys Med Biol 1999;44:2525. 136. Xing L, Li JG, Pugachev A, Le QT, Boyer AL. Estimation theory and model parameter selection for therapeutic treatment plan optimization. Med Phys 1999;26:234858. 137. Seo H, Bassenne M, Xing L. Closing the gap between deep neural network modeling and biomedical decision-making metrics via adaptive loss functions. IEEE Trans Med Ima 2020, conditionally accepted. 138. Milletari F, Navab N, Ahmadi S-A. V-net: fully convolutional neural networks for volumetric medical image segmentation. 2016 Fourth international conference on 3D vision (3DV). IEEE; 2016. p. 56571. 139. Lguensat R, et al. EddyNet: a deep neural network for pixel-wise classification of oceanic eddies. IGARSS 2018-2018 IEEE international geoscience and remote sensing symposium. (IEEE; 2018. p. 17647. 140. Hashemi SR, et al. Asymmetric loss functions and deep densely-connected networks for highly-imbalanced medical image segmentation: application to multiple sclerosis lesion detection. IEEE Access 2018;7:172135. 141. Salehi SSM, Erdogmus D, Gholipour A. Tversky loss function for image segmentation using 3D fully convolutional deep networks. International workshop on machine learning in medical imaging. Springer; 2017. p. 37987. 142. Frangi AF, Taylor ZA, Gooya A. Precision Imaging: more descriptive, predictive and integrative imaging. Med Image Anal 2016;33:2732. 143. Verghese A, Shah NH, Harrington RA. What this computer needs is a physician: humanism and artificial intelligence. JAMA 2018;319:1920. 144. Zhao W, et al. Markerless pancreatic tumor target localization enabled by deep learning. Int J Radiat Oncol Biol Phys 2019;105:4329. 145. Zhao W, et al. A deep learning approach for dual-energy CT imaging using a single-energy CT data. SPIE; 2019. 146. Mak RH, et al. Use of crowd innovation to develop an artificial intelligence-based solution for radiation therapy targeting. JAMA Oncol 2019;5:65461. 147. Islam M, Xing L. Feature-augmented embedding machine for exploration of distinct patterns in biomedical data. Nat Biomed Eng 2020, in press. 148. Chatterjee S, Gupta D, Caputo TA, Holcomb K. Disparities in gynecological malignancies. Front Oncol 2016;6:36. 149. Temkin SM, et al. A contemporary framework of health equity applied to gynecologic cancer care: a Society of Gynecologic Oncology evidenced-based review. Gynecol Oncol 2018;149:707. 150. Doll KM. Investigating Black-White disparities in gynecologic oncology: theories, conceptual models, and applications. Gynecol Oncol 2018;149:7883.

IV. Future outlook

References

525

151. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019;366:44753. 152. Thomas PS, et al. Preventing undesirable behavior of intelligent machines. Science 2019;366:9991004. 153. Strickland E. How IBM Watson overpromised and underdelivered on AI health care. 2019. 154. Teh BT. The importance of including diverse populations in cancer genomic and epigenomic studies. Nat Rev Cancer 2019;19:3612. 155. Fihn S, et al. Deploying AI in clinical settings. Washington, DC: National Academy of Medicine; 2019. 156. He J, et al. The practical implementation of artificial intelligence technologies in medicine. Nat Med 2019;25:306. 157. Mysona DP, et al. Clinical calculator predictive of chemotherapy benefit in stage 1A uterine papillary serous cancers. Gynecol Oncol 2020;156:7784. 158. Fairly M. Improving the efficiency of the operating room environment with an optimization and machine learning model. Health Care Manag Sci 2019;22:75667. 159. Esteva A, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017;542:11518. 160. Charalambides M, Singh S. Artificial intelligence and melanoma detection: friend or foe of dermatologists? Br J Hosp Med (Lond) 2020;81:15. 161. Gulshan V, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 2016;316:240210. 162. Nam JG, et al. Development and validation of deep learning-based automatic detection algorithm for malignant pulmonary nodules on chest radiographs. Radiology 2019;290:21828. 163. Angus DC. Randomized clinical trials of artificial intelligence. JAMA 2020;323. 164. Wijnberge M, et al. Effect of a machine learning-derived early warning system for intraoperative hypotension vs standard care on depth and duration of intraoperative hypotension during elective noncardiac surgery: the HYPE Randomized Clinical Trial. JAMA 2020;323. 165. Wang P, et al. Real-time automatic detection system increases colonoscopic polyp and adenoma detection rates: a prospective randomised controlled study. Gut 2019;68:181319. 166. National Research Council (US) Committee on Statistics. Improving health care cost projections for the Medicare population: summary of a workshop. Washington, DC: National Academies Press; 2010. 167. Margetts H, Dorobantu C. Rethink government with AI. Nature 2019;568:1635. 168. Magnus D, Batten JN. Building a trustworthy precision health research enterprise. Am J Bioeth 2018;18:12. 169. Char DS, Shah NH, Magnus D. Implementing machine learning in health care  addressing ethical challenges. N Engl J Med 2018;378:9813. 170. Kreitmair KV, Cho MK, Magnus DC. Consent and engagement, security, and authentic living using wearable and mobile health technology. Nat Biotechnol 2017;35:61720. 171. Grigorovich A, Kontos P. Towards responsible implementation of monitoring technologies in institutional care. Gerontologist 2020, in press (https://doi.org/10.1093/geront/gnz190). 172. Gordon JS. Building moral robots: ethical pitfalls and challenges. Sci Eng Ethics 2020;26:14157. 173. Racine E, Boehlen W, Sample M. Healthcare uses of artificial intelligence: challenges and opportunities for growth. Healthc Manage Forum 2019;32:2725. 174. Nebeker C, Torous J, Bartlett Ellis RJ. Building the case for actionable ethics in digital health research supported by artificial intelligence. BMC Med 2019;17:137. 175. Cath C, Wachter S, Mittelstadt B, Taddeo M, Floridi L. Artificial intelligence and the ‘Good Society’: the US, EU, and UK approach. Sci Eng Ethics 2018;24:50528. 176. Horvitz E, Mulligan D. Policy forum. Data, privacy, and the greater good. Science 2015;349:2535. 177. McNair D, Price II WN. Health care AI: law, regulation, and ploicy. Washington, DC: National Academy of Medicine; 2019. 178. Kolachalama VB, Garg PS. Machine learning and medical education. NPJ Digit Med 2018;1:54. 179. Wartman SA, Combs CD. Medical education must move from the information age to the age of artificial intelligence. Acad Med 2018;93:11079. 180. Wartman SA, Combs CD. Reimagining medical education in the age of AI. AMA J Ethics 2019;21:E14652. 181. Masters K. Artificial intelligence in medical education. Med Teach 2019;41:97680.

IV. Future outlook

526

25. Outlook of the future landscape of artificial intelligence in medicine and new challenges

Islam M, Xing L. Interpretable and high performance AI for healthcare, Nat Med, submitted, 2020. Reidl M. Human-centered artificial intelligence and machine learning. Hum Behav Emerg Technol 2019;1:336. Website of People 1 AI Research (PAIR) team at Google, , http://Pair.withgoogle.com . . 2020. Shen L, Zhao W, Capaldi PID, Pauly J, Xing L. A geometry-informed deep learning framework for ultrasparse computed tomography (CT) imaging, Proc Natl Acad Sci USA, submitted, 2020. 186. Nomura Y, Xu Q, Shirato H, Shimizu S, Xing L. Projection-domain scatter correction for cone beam computed tomography using a residual convolutional neural network. Med Phys 2019;46:314255. 182. 183. 184. 185.

IV. Future outlook

Index Note: Page numbers followed by “f” and “t” refer to figures and tables, respectively.

A Abdomen, 373 Abdominal applications, 274 278 AI in abdominal imaging, 277 278 pancreatic cancer analysis in CT and MRI, 275 277 Abdominal rigidity, 139 ABIDE. See Autism Brain Imaging Data Exchange (ABIDE) Accountability, 497 498 Accuracy-interpretability trade-off, 13 ACEMod. See Australian Census-based Epidemic Model (ACEMod) ACS NSQIP model. See American College of Surgeons NSQIP model (ACS NSQIP model) Activation functions, 20 21 Acute kidney injury (AKI), 142 AD. See Alzheimer’s disease (AD) ADA. See American Diabetes Association (ADA) ADC. See Apparent diffusion coefficient (ADC) Additive models, 8 9 Adenoma detection rate (ADR), 232 ADEs. See Adverse drug events (ADEs) ADHD. See Attention deficit hyperactivity disorder (ADHD) ADR. See Adenoma detection rate (ADR) Adversarial image registration network (AIR-Net), 61 62, 61f Adversarial learning framework, 197 198 Adverse drug events (ADEs), 91 AF. See Atrial fibrillation (AF) Age-related macular degeneration (AMD), 250 255 AGI. See Artificial general intelligence (AGI) AI. See Artificial intelligence (AI) AI in medicine (AIM), 4 5, 503 504, 504f AI, ML, and precision medicine, 6 algorithms and models, 6 7 business opportunity, 498 500 challenges, 11 16 maximizing information gain across modalities, tasks, populations, and time, 14 15 measuring real-world impact, 13 14 beyond performance and interpretability, 13 quality and completeness of training data, 11 12

quality assessment and expert supervision, 15 16 trust and performance, 12 13 challenges and issues, 512 518 clinical tests, 516 517 data, data curation, and sharing, 514 data and potential bias in, 514 515 economical, political, social, ethical, and legal aspects, 517 education and training, 517 518 technical challenges, 512 513 workflow and practical implementation, 515 516 development, 480 498 accountability, 497 498 current and future market, 494 496 market adoption, 492 494 patient privacy, 496 497 power of public attention and funding, 482 484 practical applications, 486 492 target, 497 technology on continuous innovation, 484 485 transparency, 497 498 exciting growth of, 480 future directions and opportunities, 518 519 health data sources and types, 7 9 integrating AI into human workforce of learning health system, 16 intelligent, 5 6 promise, 9 11 value-based segmentation, 487f “AI winter”, 76 77, 493, 495, 499 500 AI/ML algorithms. See Artificial intelligence/machine learning algorithms (AI/ML algorithms) AI assisted diagnosis and monitoring in oncology, 368 369 AIM. See AI in medicine (AIM) AIR-Net. See Adversarial image registration network (AIR-Net) Airway segmentation, 269 272, 272f AKI. See Acute kidney injury (AKI) Alzheimer’s disease (AD), 396 AMA. See American Medical Association (AMA) AMD. See Age-related macular degeneration (AMD)

527

528 American College of Surgeons NSQIP model (ACS NSQIP model), 142 American Diabetes Association (ADA), 248 American Medical Association (AMA), 248 American Society of Clinical Oncology (ASCO), 362 363 Andrology, 314 315 Aneurysms, 402 ANNs. See Artificial neural networks (ANNs) API. See Application programming interfaces (API) Apparent diffusion coefficient (ADC), 319 Application programming interfaces (API), 136 137 Apps over locations, distribution of, 175 AR. See Augmented reality (AR) Area under curve (AUC), 236 237, 365 366, 387, 404 405, 417 Areas under receiver operating characteristic curve (AUROCs), 141 142 Arteriovenous fistulas, 402 Arteriovenous malformations (AVMs), 402 Artificial general intelligence (AGI), 5 Artificial intelligence (AI), 3 4, 10f, 19, 37, 49 50, 75 76, 133 134, 152, 153f, 183 184, 223 224, 248, 251t, 309 310, 340, 352 355, 362, 384, 384f, 395 396, 415 416, 437 438, 457, 479, 503 504 in abdominal imaging, 277 278 AI-enhanced prediction framework, 443f algorithms elements, 20 23 activation functions, 20 21 convolution and transposed convolution, 23 dropout, 21 22 fully connected layer, 21 inception layers, 23 initialization, 22 23 residual blocks, 22 applications in diagnosis and prognosis, 416 428 ASD, 423 424 childhood brain tumors, 419 421 disease entities, 428 epilepsy and seizure disorders, 421 423 hydrocephalus, 425 426 molecular mechanisms of disease, 427 428 mood disorders and psychoses, 424 425 prematurity, 416 419 TBI, 426 427 augmentation for EHRs, 139 144 hospital outcomes, 141 142 oncology, 143 144 optimizing care, 140 predictions, 140 141 sepsis and infections, 143 behavioral factors, 161 163 in cardiac imaging, 385t

Index

challenges, 407 409 data quality, 407 408 data volume, 407 ethical, 409 generalizability, 408 interpretability, 408 legal, 408 clinical applications applications of classification, 29 31 applications of regression, 27 28 applications of segmentation, 28 29 deep learning for improved image reconstruction, 31 32 computed tomography, 387 388 data analysis, 440 444 from cyberspace, 441 442 from physical world, 441 pre-syndromic disease surveillance, 443 444 in decision support, 292 293 diet, 153 155 environmental and social determinants of health, 163 165 fitness and physical activity, 155 156 future directions, 32 33 in health care, 503 512 to improve data entry and extraction, 139 140 infectious disease transmission modeling, 446 449 integration with clinical workflow, 396 406 CDS, 405 diagnosis, 396 398 intraoperative guidance and enhancement, 402 404 neurophysiological monitoring, 404 405 risk prognostication, 398 399 surgical planning, 399 402 theoretical neurological AI research, 406 internet-based surveillance systems, 449 450 limitations, 144 146 mental health, 159 161 methods in clinical use, 406 407 in neurology, 395 396 public health surveillance, 438, 439f, 444 446 remote screening tools, 165 166 role in echocardiography, 386 387 sleep, 156 158 software architectures CNNs, 24 25 DenseNets, 25 26 GANs, 26 hybrid generative adversarial network designs, 26 27 neural networks and fully connected networks, 23 24

Index

U-Nets and V-Nets, 25 SRH, 158 159 technology, 265 266 in clinical AI tools, 20 27 transition to treatment decision-making using, 428 429 Artificial intelligence/machine learning algorithms (AI/ML algorithms), 5 Artificial intelligence based apps, 176 Artificial neural networks (ANNs), 4 5, 19, 249, 310, 352, 371, 400, 426 427. See also Convolutional neural networks (CNNs) for pediatric patient management applications of AI in diagnosis and prognosis, 416 428 future directions, 429 430 transition to treatment decision-making using AI, 428 429 ARVO. See Association for Research in Vision and Ophthalmology (ARVO) ASCO. See American Society of Clinical Oncology (ASCO) ASDs. See Autism spectrum disorders (ASDs) Assay for transposase activity and sequencing (ATACseq), 114 Association for Research in Vision and Ophthalmology (ARVO), 258 259 Astrocytoma, 421 ATAC-seq. See Assay for transposase activity and sequencing (ATAC-seq) Atrial fibrillation (AF), 124 Attention deficit hyperactivity disorder (ADHD), 396 AUC. See Area under curve (AUC) Augmented reality (AR), 402 403 AUROCs. See Areas under receiver operating characteristic curve (AUROCs) Australian Census-based Epidemic Model (ACEMod), 447 448 Autism, 397 398 Autism Brain Imaging Data Exchange (ABIDE), 397 398 Autism spectrum disorders (ASDs), 396, 423 424 Automated AI-based tools, 269 Automated lesion analysis algorithms, 280 281 Automatic computational algorithm, 419 420 Automating tumor segmentation, 429 AVMs. See Arteriovenous malformations (AVMs)

B Backpropagation through time (BPTT), 188 189 Barrett’s esophagus (BE), 224 Batch Normalization layers (BatchNorm layers), 106 107

529

BatchNorm layers. See Batch Normalization layers (BatchNorm layers) Bayesian model averaging (BMA), 446 Bayesian network, 85, 85f Bayesian neural network, 310 Bayley scales of infant development-IIII (BSID-IIII), 417 BBPS. See Boston Bowel Prep Score (BBPS) BE. See Barrett’s esophagus (BE) Behavioral factors, AI, 161 163 Best practice advisory (BPA), 137 139 Best practice alerts. See Best practice advisory (BPA) BI-RADS. See Breast Imaging Reporting and Data System (BI-RADS) Big data, 362, 390, 390t, 504, 509 Biological networks, 121 Biological structures or tissues, 198 199 Biomedical data, 11 12, 407 in medicine challenges in multimodal data, 118 119 future directions, 125 126 ML algorithms in integrating medical and biological data, 119 125 rise of multimodal data in biology and medicine, 113 117 Biomedical imaging and analysis through deep learning deep-learning-based radiomics, 63 68 image registration, 60 63 image segmentation, 56 60 tomographic image reconstruction, 50 55 “Black box”, 50 Bladder, 321 322 cystoscopy and transurethral resection of, 314 315 BMA. See Bayesian model averaging (BMA) Bone age, 27 28 Bone fracture detection, 278 Boston Bowel Prep Score (BBPS), 233 BPA. See Best practice advisory (BPA) BPTT. See Backpropagation through time (BPTT) Braiding hair, 44 Brain, 375 age, 28 segmentation, 372 tumors, 419 Brain tumor segmenting challenge (BraTS challenge), 401 Breast cancer imaging, 291 AI for treatment response, risk of recurrence, and cancer discovery, 300 304 AI in breast cancer diagnosis and prognosis, 297 300 AI in breast cancer risk assessment, 296 297 AI in breast cancer screening, 293 296

530

Index

Breast cancer imaging (Continued) AI in decision support, 292 293 Breast Imaging Reporting and Data System (BI-RADS), 297 Breathing dysfunction, 272 273 Bronchopulmonary dysplasia, 418 Brushing hair, 44 BSID-IIII. See Bayley scales of infant development-IIII (BSID-IIII) Built-in smartphones’ accelerometers, 178 179 Bulk sequencing, 113 114 Business opportunity of AI, 498 500

C CAC score. See Coronary artery calcium score (CAC score) CAD. See Computer-aided diagnosis (CAD); Coronary artery disease (CAD) CADe. See Computer-aided detection (CADe) CADt. See Computer-aided triaging (CADt) CADUCEUS program, 493 California Consumer Protection Act (CCPA), 471 472 CAMDA 2017 Neuroblastoma Data Integration challenge, 421 Camera-based apps, 176 177 Camouflaging effect, 291 292 Cancer, 198 assessment of risk of future, 68 discovery, 300 304 integrating omics for cancer subtyping, 120 121 risk assessment in, 347 348 Cancer Research for Personalized Medicine program (CARPEM program), 362 363 CaP. See Prostate cancer (CaP) Capsule endoscopy, 224 CapsuleNets, 57 Cardiac function, 39 40 Cardiac magnetic resonance imaging (CMR imaging), 389 AI role in, 389 Cardiac ultrasounds, 38 “Cardiologist AI”, 5 Cardiovascular diseases (CVDs), 113 114, 122 124, 390 Cardiovascular imaging, 383 AI role in CMR imaging, 389 computed tomography, 387 388 in ECG, 389 in echocardiography, 386 387 in large databases, 390 in nuclear cardiology, 388 deep learning, 386

types of ML, 384 385 views on ML, 391 CARPEM program. See Cancer Research for Personalized Medicine program (CARPEM program) CBIR. See Content-based image retrieval (CBIR) CCPA. See California Consumer Protection Act (CCPA) CDC. See Centers for Disease Control and Prevention (CDC) CDS. See Clinical decision support (CDS) Cecal intubation rate, 232 233 Cecal intubation time, 232 233 Cell detection, 192 193 Cell segmentation, 195 197 Centers for Disease Control and Prevention (CDC), 444 Centers for Medicare and Medicaid Services (CMS), 136 137, 143 144, 464 465, 486 488 Central nervous system (CNS), 399 Cerebrospinal fluid (CSF), 417 diversion, 425 426 CHB-MIT. See Children’s Hospital of Boston, Massachusetts Institute of Technology (CHBMIT) CHD. See Coronary heart disease (CHD) CHDs. See Congenital heart defects (CHDs) Chebyshev polynomials, 108 Chemotherapy, 349 351 Chest X-rays (CXRs), 38, 266 pulmonary analysis in, 266 269 Childhood brain tumors, 419 421 diseases, 415 psychiatric disorders, 425 Children’s Hospital of Boston, Massachusetts Institute of Technology (CHB-MIT), 398 Chromatin accessibility, 115 Chromatin immunoprecipitation with sequencing (ChIP-seq), 114 Chromosome conformation capture assays, 114 Chronic obstructive pulmonary disease (COPD), 269 270 ciTBI. See Clinically TBI (ciTBI) CityScapes dataset, 41 42 Classification-based method, 64 CLIA. See Clinical Laboratory Improvement Amendments (CLIA) Clinical data, 461 462 warehouse, 362 367 Clinical decision support (CDS), 75 76, 135, 137 139, 175 176, 206, 405, 470, 489 490 healthcare primed for, 138 139 Clinical information systems, 134 135 Clinical Laboratory Improvement Amendments (CLIA), 208

Index

Clinical target volumes (CTVs), 372 373 Clinically TBI (ciTBI), 427 Clockwork RNN (CW-RNN), 198 199 Cloud-based computing, 494 495 CMMS. See Centers for Medicare and Medicaid Services (CMS) CMR imaging. See Cardiac magnetic resonance imaging (CMR imaging) CMS. See Centers for Medicare and Medicaid Services (CMS) CNNRNN-Res model, 444 CNNs. See Convolutional neural networks (CNNs) CNS. See Central nervous system (CNS) COCO. See Common Objects in Context (COCO) Collaborative learning. See Federated learning Colonoscopy, AI applications in, 232 238 BBPS, 233 cecal intubation rate and cecal intubation time, 232 233 MES, 237 238 polyp detection, 233 234 polyp morphology, 235 polyp pathology, 235 236 polyp size, 234 235 tools, 236 237 withdrawal time, 233 Color normalization, 199 Colorectal cancer (CRC), 232 Common Objects in Context (COCO), 40 Complex network analysis, 447 Computational neurons, 20 Computational requirements, 208 Computed tomographic angiography (CTA), 387, 407 Computed tomography (CT), 39, 52 54, 53f, 54f, 115 116, 265 266, 273f, 319, 387, 395 396, 426 427, 505 506 pancreas segmentation in MRI and, 275 pancreatic cancer analysis in MRI and, 275 277 pancreatic tumor segmentation and detection in MRI and, 276 pulmonary analysis in, 269 274 ILD pattern recognition, 272 274 lung, lobe, and airway segmentation, 269 272 role of AI, 387 388 scan, 341 Computer aided polyp detection, 234 Computer information technology, 76 Computer vision, 368, 407 Computer-aided detection (CADe), 224, 233 234, 292 293, 368 Computer-aided diagnosis (CAD), 89 90, 200 201, 292 293, 298, 368, 415 416 diagnostic tools, 19 20

531

system, 237 238 Computer-aided triaging (CADt), 292 293, 296f Computer-assisted antibiotic-dose monitor, 91 Computer-assisted therapy, 90 91 Computer-assisted vision, 367 Computer-based multilingual medical coding system, 76 77 Computer-interpretable knowledge, 80 81 Computer-interpretable representation, 80 81 Computerized physician order entry. See Computerized physician order entry system (CPOE system) Computerized physician order entry system (CPOE system), 90 92, 136 Conditional probability, 84, 85f Conditional random fields (CRFs), 58 Congenital heart defects (CHDs), 164 Connectome mapping, 396 397 Content-based image retrieval (CBIR), 201 Convolution, 23 operations, 37 Convolutional kernel, 23 24 Convolutional network-based featurizer, 44 Convolutional neural networks (CNNs), 6 7, 20, 24 25, 24f, 44 45, 55, 101 102, 184 185, 185f, 223 224, 229f, 267 268, 293 294, 294f, 301f, 302f, 310, 342, 366, 398. See also Artificial neural networks (ANNs) COPD. See Chronic obstructive pulmonary disease (COPD) Core functions of EHR, 135 136 Coronary artery calcium score (CAC score), 123 124 Coronary artery disease (CAD), 387 Coronary heart disease (CHD), 123 124 Coronavirus 2019 (COVID-2019), 272 Covariates, 6 8 COVID-2019. See Coronavirus 2019 (COVID-2019) CPOE system. See Computerized physician order entry system (CPOE system) CPT. See Current Procedural Terminology (CPT) Craniopharyngioma, 419 420 CRC. See Colorectal cancer (CRC) CRFs. See Conditional random fields (CRFs) CSF. See Cerebrospinal fluid (CSF) CT. See Computed tomography (CT) CTA. See Computed tomographic angiography (CTA) CTVs. See Clinical target volumes (CTVs) Cures Act, 470 Current Procedural Terminology (CPT), 136 137, 258 259 CVD. See Cardiovascular diseases (CVDs) CW-RNN. See Clockwork RNN (CW-RNN) CXRs. See Chest X-rays (CXRs)

532 Cyberspace, AI data analysis, 441 442 CycleGAN, 199 200 Cyclical weight transfer, 104 Cystoscopy, 314 315, 315f, 316f

D DAG model. See Directed acyclic graphical model (DAG model) Dartmouth Summer Research Project on Artificial Intelligence (DRPAI), 482 Data, 514 learning from, 464 468 biases in, 465 467 issues of implementation, 467 468 values in algorithm design, 464 465 mining, 144 145 quality, 407 408 stewardship, 463 464 volume, 407 Data acquisition, 341 ethical issues in, 458 464 Data annotation, 514 issues, 204 Data curation, and sharing, 514 Data entry and extraction, AI to improve, 139 140 Data integration beyond omics, 122 124 diagnosis by linking images with ECGs, 124 magnetic resonance imaging computed tomography scans, 123 124 Data reuse and AI, 366 367 for patient care, 367 for research purposes, 362 366 features extracted from satellite images, 364f i2b2 clinical data warehouse graphical interface, 363f Data-driven apps, 180 Data-driven biomarker discovery, 368 Data-driven decision-making processes, 173 “Data-first” approaches, 468 Data-sharing process, 11 DBE. See Double-balloon enteroscopy (DBE) DBMs. See Deep Boltzmann machines (DBMs) DBNs. See Deep belief networks (DBNs) DBS. See Deep brain stimulation (DBS) DBSCAN algorithm, 441 442 DBT. See Digital breast tomosynthesis (DBT) DCE. See Dynamic contrast-enhanced (DCE) DCNN. See Deep convolutional neural network (DCNN) Decision support, AI in, 292 293, 293f Decision trees (DTs), 8 9, 152 153 Decision-making

Index

frameworks, 96 97 process, 76 77, 79, 174, 259 260, 408 Decision-support modalities, 175 176 “Decoding arm”, 41 DECT. See Dual-energy CT (DECT) Deep belief networks (DBNs), 184 Deep Boltzmann machines (DBMs), 184 Deep brain stimulation (DBS), 402 Deep convolutional neural network (DCNN), 38, 57, 309 310 Deep learning (DL), 20, 50, 56, 184, 201 202, 362, 386, 485, 495, 504, 506 509. See also Artificial neural networks (ANNs); Machine learning (ML) algorithms, 123, 235 approaches, 38 architectures, 421 biomedical imaging and analysis through deep-learning-based radiomics, 63 68 image registration, 60 63 image segmentation, 56 60 tomographic image reconstruction, 50 55 for biomedical videos future directions, 45 46 motion classification, 44 45 object detection and tracking, 42 44 semantic segmentation, 40 42 video datasets, 38 40 data acquisition in, 341 DL based approaches, 54 55, 272 DL based image segmentation, 56 DL based literature, 28 DL based medical image registration methods, 60 61, 61f DL based radiomics, 63 68 assessment and prediction of response to treatment, 67 assessment of risk of future cancer, 68 characterization and diagnosis, 66 detection, 64 65, 65f prognosis, 66 67 era, 249 255 framework for, 509f for gastrointestinal endoscopy colonoscopy, AI applications in, 232 238 upper endoscopy, AI applications in, 227 232 video capsule endoscopy, AI applications in, 224 226 for genome and epigenome analysis, 120 for improved image reconstruction, 31 32 methods, 101 102, 395 396 models, 397 398, 441 in pathological image analysis CAD, 200 201

533

Index

clinical adoption of AI, 205 209 data annotation issues, 204 high image dimension, 203 image classification, 189 192 image segmentation, 195 199 image superresolution, 200 integration of types of input data, 204 205 object crowding, 203 204 object detection, 192 195 others, 201 202 quality control, 202 203 stain normalization, 199 200 Deep neural networks, 184 189, 407 CNNs, 184 185 FCNs, 185 186 GANs, 186 187 RNNs, 188 189 SAEs, 187 188 Deep tomographic reconstruction, 50 DeepLab, 41 DeepLesion dataset, 281, 281f DEFENDER software system, 441 442 Dense layers. See Fully connected layer DenseNets, 22, 25 26, 25f, 200 Density and parenchymal pattern, breast cancer, 296 297 Dermatology, 101 102, 174 Diabetes, 247 Diabetic retinopathy, AI for, 248 249 Diagnostic quality control, 205 Dice similarity coefficient (DSC), 28 29, 319 321 Dice Similarity Index (DSI), 371 Diet, 153 155 Differential evolution-initialized Newton-based optimization, 63 Differential privacy (DP), 108 Diffusion tensor imaging (DTI), 397 Digital breast tomosynthesis (DBT), 291 292 Digital goniometers, 178 179 Digital pathology, 183 lagging adoption of, 206 207 Directed acyclic graphical model (DAG model), 85 Dirichlet allocation based topic modeling techniques, 441 442 Disease class diagnosis, 31 Disease detection, 30 Disease entities, 428 Disruptive mood dysregulation disorder, 425 Distributed learning. See also Deep learning (DL) techniques, 109 variants cyclical weight transfer, 104 federated learning, 104 105

model ensembling, 103 104 split learning, 105 DL. See Deep learning (DL) DNA DNA-associated protein, 114 DNA-binding preferences of transcription factors, 119 120 methylation, 114 DNA. See Dual network architecture (DNA) Dose volume histogram indices (DVH indices), 373 374 Dosimetry, 373 374 Double-balloon enteroscopy (DBE), 224 225 DP. See Differential privacy (DP) Dropout, 21 22 DRPAI. See Dartmouth Summer Research Project on Artificial Intelligence (DRPAI) DSC. See Dice similarity coefficient (DSC) DSI. See Dice Similarity Index (DSI) DTI. See Diffusion tensor imaging (DTI) DTs. See Decision trees (DTs) Dual network architecture (DNA), 54 Dual-energy CT (DECT), 505 506 DVH indices. See Dose volume histogram indices (DVH indices) Dynamic contrast-enhanced (DCE), 291 292

E EAC. See Esophageal adenocarcinoma (EAC) Early warning, AI, 439 444 EBM. See Evidence-based medicine (EBM) ECGs. See Electrocardiograms (ECGs) Echocardiograms, 38 39 Echocardiography, AI role in, 386 387 EchoNet-Dynamic dataset, 39 40 EEG. See Electroencephalogram (EEG) EEs. See Energy expenditures (EEs) Electric/electronic health records (EHRs), 77, 134, 362 363, 404 405, 459, 492 areas of AI augmentation for, 139 144 and clinical data warehouse data reuse and AI, 366 367 data reuse for patient care, 367 data reuse for research purposes, 362 366 core functions, 135 136 data mining for AI healthcare CDS, 137 139 limitations of AI, 144 146 history of, 134 135 ontologies and data standards, 136 137 systems, 9 10

534 Electrocardiograms (ECGs), 115 116, 124, 157 158, 389 AI role in, 389 diagnosis by linking images with, 124 Electroencephalogram (EEG), 398 Electrographic and imaging data, 422 Electronic communication and connectivity, 136 EMYCIN. See Essential MYCIN (EMYCIN) Endoscopic third ventriculostomy (ETV), 425 426 Energy expenditures (EEs), 155 EORTC risk score system, 324 325, 326t Ependymoma, 421 Epigenome analysis, 120 Epilepsy, 397, 421 423 Esophageal adenocarcinoma (EAC), 227 228 Esophageal cancer, 227 229 Esophagus, 277 278, 375 376 ESs. See Expert systems (ESs) Essential MYCIN (EMYCIN), 89 90 Ethical issues consent with clinical or public health data, 461 462 in data acquisition, 458 464 from data source, 459 incidental or secondary findings, 463 nonclinically collected data, 463 from research repositories, 459 461 ETV. See Endoscopic third ventriculostomy (ETV) European Medical Devices Directive (MDD), 259 Evidence-based medicine (EBM), 77, 367, 503 504 Exercise, 7 Expert supervision, 15 16 Expert systems (ESs), 75 76, 78f applications, 89 92 computer-assisted diagnosis, 89 90 computer-assisted therapy, 90 91 medication alert systems, 91 92 reminder systems, 92 challenges, 92 96 clinician acceptance and alert fatigue, 93 94 knowledge maintenance, 94 95 standard, transferability, and interoperability, 95 96 workflow integration, 92 93 future directions, 96 97 history, 76 77 methods architecture, 77 80 knowledge representation and management, 80 83 uncertainty, probabilistic reasoning, fuzzy logic, 83 89 Expert-augmented machine learning, 16

Index

F False negative (FN), 512 513 False positive (FP), 512 513 Fast Healthcare Interoperability Resources (FHIR), 95 96, 136 137 FBP. See Filtered backprojection (FBP) FCM. See Fuzzy c-means (FCM) FCNs. See Fully connected networks (FCNs) FDA. See US Food and Drug Administration (FDA) Feature distribution skew, 105 106 Federated learning, 15, 104 105 Fertility, 158 FFDM. See Full-field digital mammography (FFDM) FGCS. See Fifth Generation Computer System (FGCS) FHIR. See Fast Healthcare Interoperability Resources (FHIR) Fifth Generation Computer System (FGCS), 483 Filter. See Convolutional kernel Filtered backprojection (FBP), 53 Fine-grained ETDRS DR levels, 258 Fitness, 155 156 fMRI. See Functional MRI (fMRI) FN. See False negative (FN) FP. See False positive (FP) Frame-based systems and production rules, 79 FreeSurfer program, 401 Full-field digital mammography (FFDM), 291 294 Fully asynchronous approach, 104 105 Fully connected layer, 21, 21f Fully connected networks (FCNs), 20, 23 24, 41, 58, 184 186, 186f Fully convolutional networks. See Fully connected networks (FCNs) Fully synchronous approach, 104 105 Functional MRI (fMRI), 417 Fuzzy Arden Syntax, 76 77 Fuzzy c-means (FCM), 300 303 Fuzzy logic, 83 89, 87f

G Gamma rays, 351 GANs. See Generative adversarial networks (GANs) Gastric cancer, 230 231 Gastrointestinal endoscopy (GI endoscopy), 224 Gaussian membership function, 89 Gaussian process classifier, 425 GBM. See Glioblastoma multiforme (GBM) GCS. See Glasgow Coma Score (GCS) GDPR. See General Data Protection Regulation (GDPR) GELLO, 81 83, 83f General Data Protection Regulation (GDPR), 468 469, 496 497 “General medicine AI”, 5

Index

Generalizability, 408 of machine learning, 119 Generative adversarial networks (GANs), 26, 27f, 55, 107 108, 184, 186 187, 187f, 268, 352 GAN-based frameworks, 61 62 Generative tensorial reinforcement learning technique (GENTRL technique), 511 Genome analysis, 120 Genome-wide association studies (GWAS), 365, 465 466 Genome-wide data integration with ML, 119 122 deep learning for genome and epigenome analysis, 120 DNA-binding preferences of transcription factors, 119 120 integrating omics for cancer subtyping, 120 121 integrating single-cell multiomics for precision medicine, 121 122 semiautomated genomic annotation reveals chromatin function, 119 Genomic-sequencing data, 122 123 GENTRL technique. See Generative tensorial reinforcement learning technique (GENTRL technique) GI endoscopy. See Gastrointestinal endoscopy (GI endoscopy) GI Quality Improvement Consortium (GIQuIC), 232 Gland segmentation, 197 198 Glasgow Coma Score (GCS), 426 427 Gleason grade, 319 Glioblastoma multiforme (GBM), 400 Gliomas, 396 Global Alliance for Genomics and Health, 365 Global epidemic monitoring, 449 450 Global Public Health Intelligence Network (GPHIN), 449 Good response (GR), 352 Google Image Search, 38 GoogLeNet, 23 GPHIN. See Global Public Health Intelligence Network (GPHIN) GPUs. See Graphical processing units (GPUs) GR. See Good response (GR) Gradient boosted DTs-based prediction model, 164 165 Graph neural network, 310 Graph-structured recurrent neural network (GSRNN), 445 Graphical processing units (GPUs), 50, 484 Gross tumor volume (GTV), 375 Group Normalization layers (GroupNorm layers), 106 107

535

GSRNN. See Graph-structured recurrent neural network (GSRNN) GTV. See Gross tumor volume (GTV) GWAS. See Genome-wide association studies (GWAS)

H Hand radiograph, 27 28 Hand-crafted computer-extracted image features, 49 50 Handling data heterogeneity, 105 108, 106f Hard AI. See Artificial general intelligence (AGI) Hausdorf distance, 29 HCAI. See Lark Weight Loss Health Coach AI (HCAI) HCM. See Hypertrophic cardiomyopathy (HCM) Head and neck, 372 373, 375 Health data sources and types, 7 9 medical data inputs and algorithms, 8f environmental and social determinants of, 163 165 Health Information Technology for Economic and Clinical Health Act (HITECH Act), 135, 492 Health Insurance Portability and Accountability Act (HIPAA), 401, 461 Health Level 7 (HL7), 136 137 Health Level Seven International, 81 Health-care AI in, 503 512 data domains by empirical evidence, 509 511 data from same domain, 505 506 deep learning, 506 509 MesoNet layout, 510f traditional indications, 511 512 applications, 107 data, 407 408 Health-care system, 418 419 Healthcare primed for clinical decision support, 138 139 Heart, 39 Heart rate (HR), 156 157 Heat-diffusion model, 121 Hematuria, 312 HEs. See Human experts (HEs) High image dimension, 203 High-resolution (HR) anatomic brain images, 28 information, 186 microendoscopy, 229 High-throughput genomic technologies, 120 Higher degree polynomials, 108 Hip fractures, 278 HIPAA. See Health Insurance Portability and Accountability Act (HIPAA) Histological tissue types (HTTs), 198 199

536 HITECH Act. See Health Information Technology for Economic and Clinical Health Act (HITECH Act) HL7. See Health Level 7 (HL7) HMDB datasets, 38 39 Homomorphic encryption approach, 108 Hospital outcomes, 141 142 Hounsfield units (HU), 344, 505 506 HR. See Heart rate (HR) HR variability (HRV), 157 HRV. See HR variability (HRV) HTTs. See Histological tissue types (HTTs) HU. See Hounsfield units (HU) Human atlas based techniques, 59 Human experts (HEs), 77 Human strategy games, 37 Human-designed algorithms, 56 Human-engineered radiomic approaches, 299 Human-in-the-loop AI, 16 Hybrid generative adversarial network designs, 26 27 Hydrocephalus, 425 426 Hyperglycemia, 247 Hypertrophic cardiomyopathy (HCM), 387

I ICA. See Independent component analysis (ICA) ICD. See International Classification of Diseases (ICD) ICP. See Intracranial pressure (ICP) ICUs. See Intensive care units (ICUs) IG. See Information gain (IG) IHC. See Immunohistochemistry (IHC) IID data. See Independent and identically distributed data (IID data) ILD. See Interstitial lung disease (ILD) ILI%. See Influenza-like illness rate (ILI%) Ilium, 278 279 Illumina sequencers, 114 Image acquisition, improvement of, 369 370 Image analysis, 368 Image biomarker quantitation, 205 206 Image registration, 60 63 multimodality image registration, 62 63 single-modality image registration, 62 Image scaling, 341 342 Image segmentation, 56 60, 195 199, 370 of biological structures or tissues, 198 199 and classification of T-cell nuclei, 57f FCNs, 58 gland segmentation, 197 198 localization vs. segmentation, 57 manual labeling, 59 nucleus/cell segmentation, 195 197 priori information, 59

Index

R-CNN features, 58 59 semisupervised and unsupervised approaches, 59 60 Image superresolution, 200 Image-level classification, 190 191 ImageNet, 38, 64 65 Imaging modalities, 55 hybrid portable imaging system, 56f physics of, 344 345 Immunohistochemistry (IHC), 187 IMRT. See Modulated Radiation Therapy (IMRT) IMSI. See Intracytoplasmic morphologically selected sperm injection (IMSI) In vitro diagnostic devices (IVD devices), 206 In Vitro Diagnostic Medical Device Regulation (IVDR), 206 Inception layers, 23 Incremental learning, 15 Independent and identically distributed data (IID data), 105 106 Independent component analysis (ICA), 355 Independent variables, 6 8 Individualized medicine. See Precision medicine Infectious disease transmission modeling, 446 449 Inference, 342 343 process, 80 “Influenza infection” concept, 442 Influenza-like illness rate (ILI%), 445 446 Information gain (IG), 387 Information technology, 383 Initialization, 22 23 Innovation AI, 494 AIM, 484 485 Intensive care units (ICUs), 404, 491 492 Interfacing AI to clinical systems, 207 International Classification of Diseases (ICD), 136 137 Internet-based surveillance systems, 449 450 INTERNIST program, 493 INTERNIST-I, 4 5 Interoperability, 95 96 Interpretability, 12 13, 408 Interstitial lung disease (ILD), 269 pattern recognition, 272 274 Interventional radiology, 370 Intestinal metaplasia, 227 228 Intracranial pressure (ICP), 404 Intracytoplasmic morphologically selected sperm injection (IMSI), 317 318 Intravascular ultrasound (IVUS), 115 116 Ischium, 278 279 IVD devices. See In vitro diagnostic devices (IVD devices)

Index

IVDR. See In Vitro Diagnostic Medical Device Regulation (IVDR) IVUS. See Intravascular ultrasound (IVUS)

K k-nearest neighbors (kNN), 152 153, 191 192 KEs. See Knowledge engineers (KEs) Kidney cancer, 319 321 segmentation, 314 Kidney Donor Risk Index, 346 347 Kinetics datasets, 38 39 kNN. See k-nearest neighbors (kNN) Knowledge acquisition process, 78 bases, 80 81 engineering, 80 81 maintenance, 94 95 Knowledge engineers (KEs), 77 Knowledge representation (KR), 76 and management, 80 83

L Label distribution skew, 105 106 Label propagation (LP), 192 Lagging adoption of digital pathology, 206 207 Laplacian of Gaussian, 42 43 Large volume occlusion (LVO), 407 Large-scale public health data, 117, 117t Lark Weight Loss Health Coach AI (HCAI), 155 156 Laser-induced autofluorescence spectroscopy, 236 LDA. See Linear discriminant analysis (LDA) Learned Experts Assessment-based Reconstruction Network (LEARN), 54 Learning health system, 16 learning-based tomographic reconstruction, 51 Length of stay (LOS), 140 Lesion detection and classification, 281 282, 282f retrieval and mining, 283 284 segmentation and quantification, 282 283, 283f Lifelong machine learning, 15 Lifestyle behaviors, 158 Linear discriminant analysis (LDA), 152 153, 249 Linear models, 8 9 Liver Donor Risk Index, 346 347 LMICs. See Low-and middle-income countries (LMICs) Lobe, 269 272 Localization, 57 Logical Observation Identifiers Names and Codes (LOINC), 136

537

Long short-term memory networks (LSTM networks), 67, 152 153, 188 189, 398, 445 446 LOS. See Length of stay (LOS) Loss function, 22 Low-and middle-income countries (LMICs), 417 418 Low-dose CT denoising, 52 53 LP. See Label propagation (LP) LSTM networks. See Long short-term memory networks (LSTM networks) LTRC. See Lung tissue research consortium (LTRC) Lumen, 270 272 Lung, 269 272, 270f, 271f, 373, 375 parenchyma, 270 272 Lung tissue research consortium (LTRC), 274 LVO. See Large volume occlusion (LVO)

M M-CHAT. See Modified Checklist for Autism in Toddlers (M-CHAT) MACE. See Major adverse cardiovascular events (MACE) Machine learning (ML), 4 6, 5f, 37, 49 50, 119, 183 184, 223 224, 348 349, 384, 395 396, 419, 421, 450, 458. See also Deep learning (DL) algorithms, 6, 118 119, 314 315, 428 data integration beyond omics, 122 124 genome-wide data integration with ML, 119 122 multimodal decision-making in clinical settings, 124 125 approaches, 310 data acquisition in, 341 incorporation, 343 methods, 152 153, 427 428, 441 442 modeling disease transmission dynamics, 447 models, 42 researchers, 133 studies, 340 system, 66 types of, 384 385, 385t views on, 391 workflow, 353f Machine-assisted image segmentation, 59 Magnetic resonance (MR), 291 292 Magnetic resonance imaging (MRI), 39, 54 55, 116, 265 266, 291 292, 299f, 313, 313f, 344, 395 396 computed tomography scans, 123 124 pancreas segmentation in CT and, 275 pancreatic cancer analysis in CT and, 275 277 pancreatic tumor segmentation and detection in CT and, 276 Major adverse cardiovascular events (MACE), 388 Mammograms, 38 Mammography, 291 292

538

Index

Manual segmentation, 371, 371f Market adoption, 492 494 MARS model. See Multivariate adaptive regression splines model (MARS model) Mask-RCNN model, 42 43, 43f Maximum pool function (MaxPool function), 24 25 Mayo endoscopic subscore (MES), 237 238 MDD. See European Medical Devices Directive (MDD) MDR. See Medical Device Regulation (MDR) Measuring real-world impact, 13 14 Medical and biological data integration, 119 125 Medical Device Regulation (MDR), 259 Medical Image Computing and Computer Assisted Interventions (MICCAI), 401 Medical Information System (MedISys), 449 Medical logic modules (MLMs), 81, 82f Medication alert systems, 91 92 management process, 76 Medicine, 3 4, 40 41 AI in, 4 5, 473 474 apps distribution in field of, 174 Medicines and Healthcare products Regulatory Agency (MHRA), 175 MedISys. See Medical Information System (MedISys) Medulloblastoma, 419 421, 427 428 Mental health, 159 161 MERs. See Microelectrode recordings (MERs) MES. See Mayo endoscopic subscore (MES) Mesial temporal lobe epilepsy with hippocampal sclerosis (MTLE-HS), 422 MesoNet layout, 509 511, 510f mHealth. See Mobile health (mHealth) MHRA. See Medicines and Healthcare products Regulatory Agency (MHRA) MICCAI. See Medical Image Computing and Computer Assisted Interventions (MICCAI) Microelectrode recordings (MERs), 403 Microfluidics-based technologies, 114 MIL method. See Multiple instance learning method (MIL method) Mild diabetic retinopathy (mtmDR), 490 491 Mining PACS, 267 ML. See Machine learning (ML) MLMs. See Medical logic modules (MLMs) MLP model. See Multilayer perceptron model (MLP model) Mobile health (mHealth), 156, 255 Modalities of images in clinics, 115 116 Model ensembling, 103 104 Modified Checklist for Autism in Toddlers (M-CHAT), 423 Modulated Radiation Therapy (IMRT), 373 374

Molecular markers, prediction of, 31 Molecular mechanisms of disease, 427 428 Monitoring ICP, 404 Mood disorders, 424 425 “Moon shot” project, 493 Motion analysis, 46 Motion classification, 44 45 MPI. See Myocardial perfusion imaging (MPI) mpMRI. See Multiparametric MRI (mpMRI) MR. See Magnetic resonance (MR) MRI. See Magnetic resonance imaging (MRI) MRMC study. See Multireader, multicase study (MRMC study) MTLE-HS. See Mesial temporal lobe epilepsy with hippocampal sclerosis (MTLE-HS) mtmDR. See Mild diabetic retinopathy (mtmDR) MULAN. See Multitask ULA network (MULAN) Multiagent modeling, 447 449 Multilayer perceptron model (MLP model), 310 Multimodal data in biology and medicine emergence of sequencing techniques, 113 115 large-scale public health data, 117 modalities of images in clinics, 115 116 rise of radiomics, 116 117 challenges in, 118 119 Multimodal decision-making in clinical settings, 124 125 Multimodality image registration, 62 63 Multiorgan segmentation in CT and MRI, 277 278 Multiparametric MRI (mpMRI), 318 Multiple instance learning method (MIL method), 190 Multireader, multicase study (MRMC study), 295 Multitask ULA network (MULAN), 281 282 Multivariate adaptive regression splines model (MARS model), 446 6MWT. See 6-minute walk tests4 (6MWT) MYCIN system, 4 5, 89 90 Myocardial perfusion imaging (MPI), 388 Myriad methods, 422 423

N Narrow-band imaging (NBI), 228 National Comprehensive Cancer Network, 324 325 National Early Warning Score (NEWS), 137 138 National Institutes of Health (NIH), 461 462 Natural language processing (NLP), 37, 152 153, 404, 505 NBI. See Narrow-band imaging (NBI) NBI International Colorectal Endoscopic criteria (NICE criteria), 235 236 NCC. See Normalized cross correlation (NCC) Near-affine-invariant texture, 273 274

Index

Negative predictive value (NPV), 225, 490 Neural networks (NNs), 22 24, 37, 152 153 Neurology, AI in, 395 396 Neurophysiological monitoring, 404 405 Neurosciences, 407 NEWS. See National Early Warning Score (NEWS) NICE criteria. See NBI International Colorectal Endoscopic criteria (NICE criteria) NIH. See National Institutes of Health (NIH) NLP. See Natural language processing (NLP) NNs. See Neural networks (NNs) Nodes. See Computational neurons Noise, 118 Nonclinically collected data, 463 Non deep-learning methods, 274 NoPeek, 109 Normalized cross correlation (NCC), 62 NPV. See Negative predictive value (NPV) Nuclear cardiology, AI role in, 388 Nuclear medicine, 265 266 Nucleus segmentation, 192 193, 195 197 Nutrition, 7

O OAR. See Organs at risk (OAR) Obesity, 153 154 Object Constraint Language (OCL), 81 83 Object crowding, 203 204 Object detection, 192 195 of objects with category labeling, 194 195 of objects without category labeling, 193 194 of particular types of objects, 193 and tracking, 42 44 Object-level classification, 191 192 Observational Health Data Sciences and Informatics (OHDSI), 362 363 Observational Medical Outcomes Partnership (OMOP), 362 363 Obstructive sleep apnea syndrome (OSAS), 157 158 OCL. See Object Constraint Language (OCL) OCM images. See Optical coherence microscopy images (OCM images) OCT. See Optical coherence tomography (OCT) OCTs. See Optimal classification trees (OCTs) Office of National Coordinator of Health Information Technology (ONC of Health Information Technology), 135 OHDSI. See Observational Health Data Sciences and Informatics (OHDSI) OLEM. See Online extreme learning machine (OLEM) Omics assay, 119 and health records, 204 205

539

integration for cancer subtyping, 120 121 omics-based intervention, 125 OMOP. See Observational Medical Outcomes Partnership (OMOP) ONCOCIN’s knowledge base, 90 Oncology, 143 144 AI applications for imaging, 367 370 for diagnosis and prediction, 368 369 to improve exam quality and workflow, 369 370 AI applications for radiation oncology, 371 376 EHRs and clinical data warehouse, 362 367 future directions, 376 377, 377f OncoSHARE database, 362 363 Online extreme learning machine (OLEM), 441 Online machine learning, 15 OpenAI, 494 Ophthalmology, 101 102 Optical coherence microscopy images (OCM images), 201 Optical coherence tomography (OCT), 55 Optical imaging, 55 Optimal classification trees (OCTs), 427 Optimizing care, 140 Organ detection methods, 65 Organs at risk (OAR), 372 373 OSAS. See Obstructive sleep apnea syndrome (OSAS) Oxford Nanopore, 114

P PA. See Physical activity (PA) PACS. See Picture archiving and communication system (PACS) Pain severity, 139 PainCheck, point-of-care smartphone app, 165 Pancreas segmentation in CT and MRI, 275, 275f Pancreatic cancer analysis in CT and MRI, 275 277 pancreas segmentation in CT and MRI, 275 pancreatic tumor segmentation and detection in CT and MRI, 276 prediction and prognosis with pancreatic cancer imaging, 276 277 Pancreatic ductal adenocarcinoma (PDAC), 274 275 Pancreatic tumor segmentation and detection, 276 PanNET, 276, 276f, 277f Partial differential equation-based methods, 56 Pathologic complete response (pCR), 352 Pathologists’ skepticism, 209 Pathology, AI for deep neural networks, 184 189 deep learning in pathological image analysis, 189 202 in pathology image analysis, 202 205 Patient care, data reuse for, 367

540 Patient Health Questionnaire (PHQ-9), 160 161 Patient impact and beyond, 259 260 Patient privacy, 496 497 Pattern-based Understanding and Learning System (PULS), 449 PCA. See Principal component analysis (PCA) pCR. See Pathologic complete response (pCR) PDAC. See Pancreatic ductal adenocarcinoma (PDAC) Pediatric Emergency Care Applied Research Network (PECARN), 427 Pediatric(s), 415 416 brain tumors, 420 embryonal brain tumors, 420 population, 424 425 Pelvic applications, 278 279 Pelvic X-ray imaging, 278 Pelvis, 373 Personalized medicine. See Precision medicine “Personalized” models, 107 PET. See Positron emission tomography (PET) Phenome-wide associations studies (PheWAS), 365 PHI. See Protected health information (PHI) PHNNs. See Progressive holistically-nested networks (PHNNs) Photo-documentation of cecal intubation, 232 Photocoagulation surgery, 247 Photoplethysmographic data (PPG data), 124 PHQ-9. See Patient Health Questionnaire (PHQ-9) Physical activity (PA), 152 153, 155 156 Physician patient, 467 468, 473 474 Physiological monitoring, 404 PI program. See Promoting Interoperability program (PI program) Picture archiving and communication system (PACS), 207, 265 266, 267f PACS-mined datasets, 268 269 Pineoblastoma, 420 Planning target volume (PTV), 400 PLCO. See Prostate, Lung, Colorectal, and Ovarian (PLCO) Plummer’s “patient-centric” system, 134 135 Polyp detection, 233 234 morphology, 235 pathology, 235 236 size, 234 235 Polysomnography, 157 Positive predictive value (PPV), 417 418 Positron emission tomography (PET), 55, 115 116, 344 345 Posterior fossa syndrome, 420 421 Postpartum depression (PPD), 158 159 PPG data. See Photoplethysmographic data (PPG data)

Index

PPGRs. See Predict postprandial glycemic responses (PPGRs) PPV. See Positive predictive value (PPV) Precision medicine, 6, 145, 340 integrate single-cell multiomics for, 121 122 Predict postprandial glycemic responses (PPGRs), 154 Predictions, 342 343 in EHRs, 140 141 of molecular markers, 31 of outcome and survival, 31 and prognosis with pancreatic cancer imaging, 276 277 therapeutic outcome, 348 352 chemotherapy, 349 351 radiation therapy, 351 352 treatment outcome assessment and, 369 Predictive modeling applications, 177 178 Predictors, 6 7 Premature birth, 416 417 Prematurity, 416 419 Preventative medicine, 151 Principal component analysis (PCA), 152 153, 354 Privacy-preserving collaborative deep learning methods handling data heterogeneity, 105 108 protecting patient privacy, 108 109 publicly available software, 109 variants of distributed learning, 103 105 Proactive health, 486 488 Probabilistic reasoning, 83 89 Prognosis, deep-learning-based radiomics, 66 67 Program for Monitoring Emerging Diseases (ProMED), 449 Progressive holistically-nested networks (PHNNs), 269 270 Promoting Interoperability program (PI program), 136 137 Prostate, Lung, Colorectal, and Ovarian (PLCO), 266 Prostate cancer (CaP), 318 319, 320f Protected health information (PHI), 459 Protecting patient privacy, 108 109 Psychological stress, 159 Psychoses, 424 425 PTV. See Planning target volume (PTV) Pubis, 278 279 Public attention and funding, 494 Public health data, 461 462 safety net for, 443 444 surveillance, 438, 439f, 444 446 time series prediction, 444 446 Publicly available software, 109 Pulmonary analysis

541

Index

in CT, 269 274 in CXR, 266 269 PULS. See Pattern-based Understanding and Learning System (PULS)

Q Quality assessment (QA), 15 16 Quality control, 202 203 Quality management system (QMS), 469 Quantitative imaging, 340 biomarkers, 124 cancer, 343 347 AI in different stages of quantitative imaging workflow, 345 347 physics of imaging modalities, 344 345 Quantity skew, 105 106 Quantum computing, 494 495

R R-CNN features. See Regions with convolutional neural network features (R-CNN features) Radial symmetry transformation, 42 43 Radiation pneumonitis, 351 therapy, 351 352 Radiation oncology, AI applications for, 371 376, 374t outcome prediction toxicity, 376 treatment response, 374 376 treatment planning dosimetry, 373 374 segmentation, 371 373 Radiology, 31, 101 102, 174 abdominal applications, 274 278 pelvic applications, 278 279 thoracic applications, 266 274 ULA, 280 284 Radiology information systems (RISs), 278 Radiomics, 63 64, 116 117, 368, 368f Radiotherapy (RT), 371 Random forest (RF), 152 153, 441 Random uniform, 22 23 Random walker algorithm (RW algorithm), 270 Randomized prospective trials (RPTs), 516 517 RCNN model, 42 43 Real-world evidence (RWE), 363 365 Real-world usage, 257 258 Rebound tenderness, 139 Receiver operating characteristic (ROC), 295, 387 388 Receiver operating curve. See Receiver operating characteristic (ROC) RECIST. See Response evaluation criteria in solid tumors (RECIST)

Rectified linear units (ReLU), 21, 184 185 Rectum, 376 Recurrent neural networks (RNNs), 20, 184, 188 189, 189f Reference Information Model (RIM), 81 83 Region proposal network (RPN), 315 Regions with convolutional neural network features (R-CNN features), 58 59 Regression, 6 7, 342 applications, 27 28 bone age, 27 28 brain age, 28 regression-based method, 64 Regulation, issues in, 468 473 of artificial intelligence, 469 470 challenges to regulatory frameworks, 468 469 privacy and data protection, 470 472 of safety and efficacy, 470 transparency, liability, responsibility, and trust, 472 473 Regulatory approvals and validation, 258 259 Reinforcement learning, 384 385 ReLU. See Rectified linear units (ReLU) Reminder systems, 92 Remote monitoring, 165 Remote screening tools, 165 166 Reporting applications development approaches, 175 Reproducibility, 257 Research repositories, 459 461 Residual blocks, 22, 22f Residual networks. See Residual blocks Response assessment, 446 449 Response evaluation criteria in solid tumors (RECIST), 281 Resting heart rate (RHR), 86, 87f Retinopathy of prematurity, 418 419 Return on investment (ROI), 494 Reverse transcriptase enzymes, 114 RF. See Random forest (RF) RHR. See Resting heart rate (RHR) RIM. See Reference Information Model (RIM) Risk assessment, 340 in cancer, 347 348 of future cancer, 68 RISs. See Radiology information systems (RISs) RNNs. See Recurrent neural networks (RNNs) Robotic surgery, 322 323 automated maneuver, 323 navigation, 322 323 preoperative preparation, 322 ROC. See Receiver operating characteristic (ROC) ROI. See Return on investment (ROI)

542 Rotation-invariant Gabor-local binary patterns, 273 274 RPN. See Region proposal network (RPN) RPTs. See Randomized prospective trials (RPTs) RT. See Radiotherapy (RT) RW algorithm. See Random walker algorithm (RW algorithm) RWE. See Real-world evidence (RWE)

S S-membership function, 88 89 SAAIM. See Self-adaptive AI model (SAAIM) SAEs. See Stacked autoencoders (SAEs) SaMD. See Software-as-a-medical-device (SaMD) SART. See Simultaneous algebraic reconstruction technique (SART) SBCE. See Small bowel capsule endoscopy (SBCE) SBI. See Suspected blood indicator (SBI) SBRT. See Stereotactic body radiation therapy (SBRT) scATAC-seq. See Single-cell ATAC-sequencing (scATAC-seq) SCI. See Strategic Computing Initiative (SCI) SCNN. See Survival CNN (SCNN) Screening mammograms, 64 SDOH. See Social determinants of health (SDOH) SECT. See Single-energy CT (SECT) Segmentation, 25, 57, 346, 371 373 abdomen, 373 applications, 28 29 of biological structures or tissues, 198 199 brain, 372 deep learning for, 372t head and neck, 372 373 lung, 373 pelvis, 373 Seizure disorders, 421 423 Self-adaptive AI model (SAAIM), 446 Semantic segmentation, 40 42, 41f Semen analysis, 317 318 Semiautomated genomic annotation reveals chromatin function, 119 Semiautomatic computational algorithm, 419 420 Semisupervised approach, 59 60 Sensitivity, 226 Sensor-linked apps, 178 179 SENTINEL software system, 449 Sepsis and infections, 143 Sequencing techniques, 115t bulk sequencing, 113 114 single-cell sequencing, 114 115 Sequential machine learning, 15 SERI. See Singapore Eye Research Institute (SERI)

Index

Severe chronic irritability. See Disruptive mood dysregulation disorder Sexual and reproductive health (SRH), 152 153, 158 159 Shannon Nyquist theorem, 506 507 SHH tumors. See Sonic hedgehog tumors (SHH tumors) Shift-invariant. See Convolutional neural networks (CNNs) Sigmoidal-shaped function, 23 24 SiMD. See Software in medical device (SiMD) SIMLR. See Single-cell Interpretation via Multikernel Learning (SIMLR) Simultaneous algebraic reconstruction technique (SART), 53 Singapore Eye Research Institute (SERI), 259 Singapore Medical Device Register (SMDR), 259 Single-cell ATAC-sequencing (scATAC-seq), 118 119 Single-cell data, imperfect generation of, 118 119 complementariness of sources of data, 118 119 generalizability of machine learning, 119 Single-cell Interpretation via Multikernel Learning (SIMLR), 121 122 Single-cell multiomics integration for precision medicine, 121 122 Single-cell sequencing, 114 115 Single-energy CT (SECT), 505 506 Single-modality image registration, 62 Single-photon emission CT (SPECT), 55, 388 Single-shot detector (SSD), 196 SIRS. See Systemic inflammatory response syndrome (SIRS) 6-minute walk tests4 (6MWT), 155 Sleep, 156 158 Sliding window image patches, 65 Sliding-window-based CNN strategy, 193 Small bowel capsule endoscopy (SBCE), 224 225 Smart healthcare system, 151 Smartphone apps in data-driven clinical decisionmaking camera-based apps, 176 177 decision-support modalities, 175 176 distribution of apps in field of medicine, 174 of apps over locations, 175 of digital technologies, 174f guideline/algorithm applications, 177 predictive modeling applications, 177 178 reporting applications development approaches, 175 sensor-linked apps, 178 179 Smartphone-based smoking cessation trial, 162 Smartphones, 7, 10 11, 158 159 Smartwatches (digital devices), 10 11

Index

SMDR. See Singapore Medical Device Register (SMDR) sMRI. See Structural MRIs (sMRI) SNOMED-CT. See Systematized Nomenclature of Medicine—Clinical Terms (SNOMED-CT) SnPR. See Substantia nigra pars reticulata (SnPR) Social determinants of health (SDOH), 152 153, 163 164 Software in medical device (SiMD), 207 Software-as-a-medical-device (SaMD), 469 Sonic hedgehog tumors (SHH tumors), 419 420 SPECT. See Single-photon emission CT (SPECT) Spinal vertebrae, 370 Split learning, 105 SRH. See Sexual and reproductive health (SRH) SRS. See Stereotactic radiosurgery (SRS) SSAE. See Stacked sparse autoencoder (SSAE) SSD. See Single-shot detector (SSD) Stacked autoencoders (SAEs), 184, 187 188, 188f Stacked sparse autoencoder (SSAE), 373 Stain normalization, 199 200 Statistical learning. See Machine learning (ML) Stereotactic body radiation therapy (SBRT), 375, 509 511 Stereotactic radiosurgery (SRS), 399 400 STN. See Subthalamic nucleus (STN) Store-and-forward apps, 176 Strategic Computing Initiative (SCI), 483 Structural MRIs (sMRI), 424 425 Substantia nigra pars reticulata (SnPR), 403 Subthalamic nucleus (STN), 403 Supervised learning, 384 385 Support vector machine (SVM), 8 9, 152 153, 191 192, 347, 375, 397, 417, 441 Supratentorial primitive neuroectodermal, 420 Surgical planning, AI, 399 402, 400f tumor segmentation via ML algorithm, 401f Survival analysis, 6 7 Survival CNN (SCNN), 399 Suspected blood indicator (SBI), 226 SVM. See Support vector machine (SVM) Systematized Nomenclature of Medicine—Clinical Terms (SNOMED-CT), 136 Systemic inflammatory response syndrome (SIRS), 143, 143t

T Taylor series, 108 TBI. See Traumatic brain injury (TBI) TCGA. See The Cancer Genome Atlas (TCGA) TDA. See Topological data analysis (TDA) Technicon Medical Information System (TMIS), 134 135 Text processing, 449

543

The Cancer Genome Atlas (TCGA), 348 Theoretical neurological AI research, 406 Thoracic applications, 266 274 pulmonary analysis in CT, 269 274 in CXR, 266 269 Three-dimension (3D) CNN-based method, 270 272 spatial convolutions, 58 Threshold-based methods, 42 43 Time series prediction, 444 446 Time-consuming process, 429 Tissue characterization, 66 Tissue heterogeneity, 194 195 TMIS. See Technicon Medical Information System (TMIS) TN. See True negative (TN) Tomographic image reconstruction, 51f CT, 52 54 foundation, 50 52, 51f imaging modalities, 55 magnetic resonance imaging, 54 55 Tomosynthesis, 295 Topological data analysis (TDA), 387 Toxicity, 376 TP. See True positive (TP) Traditional Western medicine, 473 Transcription factors, DNA-binding preferences of, 119 Transfer learning, 15 Transferability, 95 96 Transmitting electrocardiograms, 176 Transparency, 497 498 Transposed convolution, 23 Transrectal ultrasound (TRUS), 63 Transurethral resection of bladder, 314 315 Trapezoidal membership function, 88, 88f Trauma assessment and surgery, 59 Traumatic brain injury (TBI), 396, 426 427 Triangular membership, 86 88 Tried feature-based methods, 42 43 True negative (TN), 512 513 True positive (TP), 512 513 TRUS. See Transrectal ultrasound (TRUS) Tumor (T), 347 region segmentation, 198 Turing test, 485

U U-Net, 25, 28, 41, 186 UC. See Ulcerative colitis (UC) UCF10119 datasets, 38 39 UDA. See Unsupervised domain adaptation (UDA) ULA. See Universal lesion analysis (ULA)

544 Ulcerative colitis (UC), 232 ULD. See Universal lesion detection (ULD) Ultrasound, 55, 115 116, 344 examination, 312 314 UMLS. See Unified Medical Language System (UMLS) Uncertainty, 83 Unconditional probability, 84 Unified Medical Language System (UMLS), 136 137 Unipolar depression, 424 Universal lesion analysis (ULA), 280 284 DeepLesion dataset, 281 lesion detection and classification, 281 282 lesion retrieval and mining, 283 284 lesion segmentation and quantification, 282 283 Universal lesion detection (ULD), 281 282 Unsupervised approach, 59 60 Unsupervised domain adaptation (UDA), 190 191 Unsupervised learning, 384 385 neural network, 187 188 Unsupervised machine learning methods, 152 153 Upper endoscopy, AI applications in, 227 232 esophageal cancer, 227 229 future directions, 231 232 gastric cancer, 230 231 upper endoscopy quality, 231 Upper GI endoscopy, 227 Urology andrology, 314 315 diagnostic imaging kidney, 319 321 prostate, 318 319 ureter and bladder, 321 322 examinations in, 311 314 ultrasound examination, 312 314 urinalysis and urine cytology, 311 312 future direction, 325 328 risk prediction, 323 325 robotic surgery, 322 323 urological endoscopy, 314 316 cystoscopy and transurethral resection of bladder, 314 315 ureterorenoscopy, 315 316 US Food and Drug Administration (FDA), 15 16, 123, 175, 248, 294 295, 311 312, 363 365, 406 407, 468, 479, 515

Index

V V-Nets, 25 Value-based segmentation, 487f Vanishing gradient problem, 25 26 Ventriculomegaly, 425 426 Video capsule endoscopy, AI applications in, 224 226 anatomical landmark identification, 225 226 decreasing read time, 225 improving sensitivity, 226 recent developments, 226 Video datasets, 38 40, 39f Virtual biopsy, 303, 420 Volumetric laser endomicroscopy (VLE), 229

W Watches (connected devices), 7 “WavSTAT4” optical biopsy system, 236 WCE. See Wireless video capsule endoscopy (WCE) Wellness, AI, 151 WG-26. See Working Group 26 (WG-26) White light cystoscope (WLC), 314 White matter fiber tractography, 397 WHO. See World Health Organization (WHO) Whole-genome bisulfite sequencing, 114 Whole-slide imaging technique (WSI technique), 183 Wireless video capsule endoscopy (WCE), 224 225 Withdrawal time, 233 Wize Mirror, 163 WLC. See White light cystoscope (WLC) Workflow integration, 92 93 Working Group 26 (WG-26), 207 World Health Organization (WHO), 316, 317t WSI technique. See Whole-slide imaging technique (WSI technique)

X X-ray imagers, 55

Y YOLO model, 42 43