Machine Learning in Clinical Neuroimaging: 4th International Workshop, MLCN 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, September ... Vision, Pattern Recognition, and Graphics) 3030875857, 9783030875855

This book constitutes the refereed proceedings of the 4th International Workshop on Machine Learning in Clinical Neuroim

97 9 39MB

English Pages 188 [185] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Organization
Contents
Computational Anatomy
Unfolding the Medial Temporal Lobe Cortex to Characterize Neurodegeneration Due to Alzheimer’s Disease Pathology Using Ex vivo Imaging
1 Introduction
2 Ex-vivo Imaging Dataset
2.1 Specimen Preparation and Imaging
2.2 Quantitative NFT Burden Maps from Histology
2.3 Histology-Guided MTL Subregion Segmentations
3 Methods
3.1 Overview of Topological Unfolding Framework
3.2 Segmentation of the Outer MTL Boundary in Ex vivo MRI
3.3 Laplacian Coordinate System
3.4 Mapping Image and Morphological Features to Unfolded Space
4 Experiments and Results
4.1 Consensus MTL Subregion Segmentation in Unfolded Coordinate Space
4.2 Correlating NFT Burden with MTL Neurodegeneration
4.3 Surface-Based Registration Using Mean Curvature Maps
5 Conclusions
References
Distinguishing Healthy Ageing from Dementia: A Biomechanical Simulation of Brain Atrophy Using Deep Networks
1 Introduction
2 Methods
2.1 Data
2.2 Preprocessing
2.3 Model Overview
2.4 Training and Evaluation
3 Experimental Methods and Results
3.1 Evaluation of Biomechanical Model
3.2 Evaluation of Atrophy Estimation
4 Discussion and Future Work
References
Towards Self-explainable Classifiers and Regressors in Neuroimaging with Normalizing Flows
1 Introduction
2 Normalizing Flows as Generative Invertible Classifiers and Regressors
2.1 Manifold-Constrained NFs for Efficient 3D Data Processing
2.2 Implementation Details and Model Training
3 Explainable AI with Normalizing Flows
3.1 Derivative-Based Attribution Map of the Inverse
3.2 Counterfactual Images for Systematic Analyses
4 Experiments and Results
5 Conclusion
References
Patch vs. Global Image-Based Unsupervised Anomaly Detection in MR Brain Scans of Early Parkinsonian Patients
1 Introduction
2 Brain Anomaly Detection Pipeline
2.1 Autoencoder Architectures
2.2 Post-processing of the Reconstruction Error Maps
3 Experiments
3.1 Data
3.2 Training of the Auto-Encoders
3.3 Performance Evaluation
4 Results
5 Discussion and Conclusion
References
MRI Image Registration Considerably Improves CNN-Based Disease Classification
1 Introduction
2 Data and Methods
2.1 Dataset
2.2 Image Preprocessing
2.3 Network Architecture and Training
3 Results
4 Discussion
References
Dynamic Sub-graph Learning for Patch-Based Cortical Folding Classification
1 Introduction
2 Methods
3 Experimental Results
4 Conclusions
References
Detection of Abnormal Folding Patterns with Unsupervised Deep Generative Models
1 Introduction
2 Methods
2.1 Focusing on Folding Information
2.2 Generating Synthetic Brain Anomalies
2.3 Learning a Representation of the Normal Variability
3 Results
3.1 Datasets and Implementation
3.2 Analysing Learned Folding Variability
4 Discussion and Conclusion
References
PialNN: A Fast Deep Learning Framework for Cortical Pial Surface Reconstruction
1 Introduction
2 Related Work
3 Method
3.1 Deformation Block
3.2 Smoothing and Training
4 Experiments
5 Conclusion
References
Multi-modal Brain Segmentation Using Hyper-Fused Convolutional Neural Network
1 Introduction
2 Method
2.1 Baseline Architecture
2.2 Proposed Architecture
2.3 Learning Process and Implementation Details
3 Experiments and Results
3.1 Datasets
3.2 Results and Discussion
4 Conclusion
References
Robust Hydrocephalus Brain Segmentation via Globally and Locally Spatial Guidance
1 Introduction
2 Method
2.1 Guidance with Registration Module
2.2 Segmentation with Positional Correlation Attention Block
2.3 Training Strategy
3 Experiments and Results
3.1 Datasets and Experiments
3.2 Results
4 Conclusion
References
Brain Networks and Time Series
Geometric Deep Learning of the Human Connectome Project Multimodal Cortical Parcellation
1 Introduction
2 Methods
2.1 Participants and Image Acquisition
2.2 Modelling the Cortex as an Icosphere
2.3 Image Processing and Augmentation
2.4 Model Architecture and Implementation
3 Results
4 Discussion
References
Deep Stacking Networks for Conditional Nonlinear Granger Causal Modeling of fMRI Data
1 Introduction
2 Materials and Methods
2.1 Deep Stacking Network
2.2 Conditional Nonlinear Granger Causal Modeling with DSN
2.3 Model Validation and Application
3 Experiments and Results
3.1 Synthetic Dataset
3.2 Simulated fMRI Dataset
3.3 Real-World fMRI Dataset
4 Discussion
5 Conclusion
References
Dynamic Adaptive Spatio-Temporal Graph Convolution for fMRI Modelling
1 Introduction
2 Methodology
2.1 Preliminaries
2.2 Temporal Lag Correction
2.3 Temporal Feature Extraction
2.4 Spatial Feature Extraction
2.5 Framework of the Model
3 Experiments
3.1 Dataset
3.2 Experimental Setup
3.3 Experimental Results
4 Generalizability
5 Limitations
6 Discussion
References
Structure-Function Mapping via Graph Neural Networks
1 Introduction
2 Preliminaries
2.1 Problem Statement
2.2 Autoencoder
2.3 Graph Convolutional Networks (GCN)
2.4 Graph Transformer Networks (GTN)
3 Experiments
3.1 Data
3.2 Implementation
4 Results and Discussion
5 Conclusion
References
Improving Phenotype Prediction Using Long-Range Spatio-Temporal Dynamics of Functional Connectivity
1 Introduction
2 Related Works
3 Methods
4 Results
5 Discussion
References
H3K27M Mutations Prediction for Brainstem Gliomas Based on Diffusion Radiomics Learning
1 Introduction
2 Proposed Method
2.1 Materials and Preprocessing
2.2 Problem Formulation
2.3 Multi-Mechanism Diffusion-Convolution (MMDC)
2.4 Diffusion-Radiomics
3 Experiments and Results
3.1 Experimental Settings
3.2 H3K27M Mutation Prediction Results
3.3 Node Pooling Interpretation
4 Conclusion
References
Constrained Learning of Task-Related and Spatially-Coherent Dictionaries from Task fMRI Data
1 Introduction
2 Constrained Online Dictionary Learning
2.1 Dictionary Learning of fMRI Data
2.2 Incorporating Task Characteristics
2.3 Constraining Spatial Patterns
2.4 Optimization
3 Application and Results
3.1 Synthetic fMRI Data Generation Using SimTB
3.2 Evaluation of Sparse Dictionary Learning Algorithms
3.3 Synthetic Data Results
3.4 Real Task fMRI Data
4 Conclusion
References
Author Index
Recommend Papers

Machine Learning in Clinical Neuroimaging: 4th International Workshop, MLCN 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, September ... Vision, Pattern Recognition, and Graphics)
 3030875857, 9783030875855

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

LNCS 13001

Ahmed Abdulkadir · Seyed Mostafa Kia · Mohamad Habes · Vinod Kumar · Jane Maryam Rondina · Chantal Tax · Thomas Wolfers (Eds.)

Machine Learning in Clinical Neuroimaging 4th International Workshop, MLCN 2021 Held in Conjunction with MICCAI 2021 Strasbourg, France, September 27, 2021, Proceedings

Lecture Notes in Computer Science Founding Editors Gerhard Goos Karlsruhe Institute of Technology, Karlsruhe, Germany Juris Hartmanis Cornell University, Ithaca, NY, USA

Editorial Board Members Elisa Bertino Purdue University, West Lafayette, IN, USA Wen Gao Peking University, Beijing, China Bernhard Steffen TU Dortmund University, Dortmund, Germany Gerhard Woeginger RWTH Aachen, Aachen, Germany Moti Yung Columbia University, New York, NY, USA

13001

More information about this subseries at http://www.springer.com/series/7412

Ahmed Abdulkadir · Seyed Mostafa Kia · Mohamad Habes · Vinod Kumar · Jane Maryam Rondina · Chantal Tax · Thomas Wolfers (Eds.)

Machine Learning in Clinical Neuroimaging 4th International Workshop, MLCN 2021 Held in Conjunction with MICCAI 2021 Strasbourg, France, September 27, 2021 Proceedings

Editors Ahmed Abdulkadir University of Pennsylvania Philadelphia, PA, USA

Seyed Mostafa Kia Donders Institute Nijmegen, The Netherlands

Mohamad Habes The University of Texas Health Science Center at San Antonio San Antonio, TX, USA

Vinod Kumar Max Planck Institute for Biological Cybernetics Tübingen, Germany

Jane Maryam Rondina University College London London, UK

Chantal Tax University Medical Center Utrecht Utrecht, The Netherlands

Thomas Wolfers University of Oslo Oslo, Norway

Cardiff University Brain Research Imaging Centre (CUBRIC) Cardiff, UK

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-87585-5 ISBN 978-3-030-87586-2 (eBook) https://doi.org/10.1007/978-3-030-87586-2 LNCS Sublibrary: SL6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics © Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Methodological developments in neuroimaging analysis contribute to the progress in clinical neurosciences. In specific domains of academic image analysis, impressive strides were made thanks to modern machine learning and data analysis methods such as deep artificial neural networks. The initial success in academic applications of complex neural networks started a wave of studies through the neuroimaging research field. Deep learning is now complementing more traditional machine learning as a tool for image and data analysis. It is our view that incorporating interdisciplinary domain knowledge into the machine learning models is critical to answer challenging clinically relevant research questions in the field of clinical neuroscience that eventually will translate to clinical routine. With this workshop, we aimed at creating an intellectual playing field for clinicians and machine learning experts alike to share and discuss knowledge at the interface between machine learning and clinical application. The 4th International Workshop on Machine Learning in Clinical Neuroimaging (MLCN 2021) was held as a satellite event of the 24th International Conference on Medical Imaging Computing and Computer-Assisted Intervention (MICCAI 2021) to foster a scientific dialog between experts in machine learning and clinical neuroimaging. The call for papers was published on April 30, 2021, and the submission window closed on July 5, 2021. Each submitted manuscript was reviewed by three members of the Program Committee in a double-blindd review process. The accepted manuscripts contained in this proceedings presented a methodologically sound, novel, and thematically fitting contribution to the field of clinical neuroimaging, and were presented and discussed by the authors at the virtual MLCN workshop. The contributions studied in vivo structural and functional magnetic resonance imaging data. Several accepted submissions were concerned with computational anatomy involving a wide range of methods including supervised image segmentation, registration, classification, anomaly detection, and generative modeling. Network analysis and time series were other topical branches of the workshop contributions in which a wide variety of methods were employed and developed including dictionary learning, graph neural networks, and space-time convolutional neural networks. The fields of applications were as diverse as the methods. They included detection and modeling of abnormal cortical folding patterns and simulation of brain atrophy, mapping histology to ex vivo imaging, mapping functional cortical regions, and mapping structural with functional connectivity graphs. The methodological developments pushed the boundaries of clinical neuroscience image analysis with fast algorithms for complex and accurate descriptors of structure, function, or the combination of multiple modalities.

vi

Preface

This workshop was made possible by a devoted community of authors, Program Committee, Steering Committee, and workshop participants. We thank all creators and attendees for their valuable contributions. September 2021

Ahmed Abdulkadir Mohamad Habes Seyed Mostafa Kia Vinod Kumar Jane Maryam Rondina Chantal Tax Thomas Wolfers

Organization

Steering Committee Christos Davatzikos Andre Marquand Jonas Richiardi Emma Robinson

University of Pennsylvania, USA Donders Institute, The Netherlands Lausanne University Hospital, Switzerland King’s College London, UK

Organizing Committee Ahmed Abdulkadir Mohamad Habes Seyed Mostafa Kia Vinod Kumar Jane Maryam Rondina Chantal Tax Thomas Wolfers

University of Pennsylvania, USA University of Texas Health Science Center at San Antonio, USA University Medical Center Utrecht, The Netherlands Max Planck Institute for Biological Cybernetics, Germany University College London, UK University Medical Center Utrecht, The Netherlands NORMENT, Norway

Program Committee Mohammed Al-Masni Andre Altman Pierre Berthet Özgün Çiçek Richard Dinga Charlotte Fraza Pouya Ghaemmaghami Francesco La Rosa Sarah Lee Hangfan Liu Emanuele Olivetti Pradeep Reddy Raamana Saige Rutherford Hugo Schnack Haochang Shou Haykel Snoussi Sourena Soheili Nezhad Rashid Tanweer Petteri Teikari

Yonsei University, South Korea University College London, UK University of Oslo, Norway University of Freiburg, Germany Donders Institute, The Netherlands Donders Institute, The Netherlands Concordia University, Canada Ecole Polytechnique Fédérale de Lausanne, Switzerland Amallis Consulting, UK University of Pennsylvania, USA Fondazione Bruno Kessler, Italy University of Toronto, Canada University of Michigan, USA University Medical Center Utrecht, The Netherlands University of Pennsylvania, USA University of Texas Health Science Center at San Antonio, USA Radboud University Medical Center, The Netherlands University of Pennsylvania, USA University College London, UK

viii

Organization

Erdem Varol Matthias Wilms Tianbo Xu Mariam Zabihi

Columbia University, USA University of Calgary, Canada University College London, UK Radboud University Medical Center, The Netherlands

Contents

Computational Anatomy Unfolding the Medial Temporal Lobe Cortex to Characterize Neurodegeneration Due to Alzheimer’s Disease Pathology Using Ex vivo Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sadhana Ravikumar, Laura Wisse, Sydney Lim, David Irwin, Ranjit Ittyerah, Long Xie, Sandhitsu R. Das, Edward Lee, M. Dylan Tisdall, Karthik Prabhakaran, John Detre, Gabor Mizsei, John Q. Trojanowski, John Robinson, Theresa Schuck, Murray Grossman, Emilio Artacho-Pérula, Maria Mercedes Iñiguez de Onzoño Martin, María del Mar Arroyo Jiménez, Monica Muñoz, Francisco Javier Molina Romero, Maria del Pilar Marcos Rabal, Sandra Cebada Sánchez, José Carlos Delgado González, Carlos de la Rosa Prieto, Marta Córcoles Parada, David Wolk, Ricardo Insausti, and Paul Yushkevich Distinguishing Healthy Ageing from Dementia: A Biomechanical Simulation of Brain Atrophy Using Deep Networks . . . . . . . . . . . . . . . . . . . . . . . . Mariana Da Silva, Carole H. Sudre, Kara Garcia, Cher Bass, M. Jorge Cardoso, and Emma C. Robinson Towards Self-explainable Classifiers and Regressors in Neuroimaging with Normalizing Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthias Wilms, Pauline Mouches, Jordan J. Bannister, Deepthi Rajashekar, Sönke Langner, and Nils D. Forkert Patch vs. Global Image-Based Unsupervised Anomaly Detection in MR Brain Scans of Early Parkinsonian Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Verónica Muñoz-Ramírez, Nicolas Pinon, Florence Forbes, Carole Lartizen, and Michel Dojat MRI Image Registration Considerably Improves CNN-Based Disease Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Malte Klingenberg, Didem Stark, Fabian Eitel, and Kerstin Ritter for the Alzheimer’s Disease Neuroimaging Initiative Dynamic Sub-graph Learning for Patch-Based Cortical Folding Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhiwei Deng, Jiong Zhang, Yonggang Shi, and the Health and Aging Brain Study (HABS-HD) Study Team

3

13

23

34

44

53

x

Contents

Detection of Abnormal Folding Patterns with Unsupervised Deep Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Louise Guillon, Bastien Cagna, Benoit Dufumier, Joël Chavas, Denis Rivière, and Jean-François Mangin PialNN: A Fast Deep Learning Framework for Cortical Pial Surface Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qiang Ma, Emma C. Robinson, Bernhard Kainz, Daniel Rueckert, and Amir Alansary Multi-modal Brain Segmentation Using Hyper-Fused Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wenting Duan, Lei Zhang, Jordan Colman, Giosue Gulli, and Xujiong Ye Robust Hydrocephalus Brain Segmentation via Globally and Locally Spatial Guidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuanfang Qiao, Haoyi Tao, Jiayu Huo, Wenjun Shen, Qian Wang, and Lichi Zhang

63

73

82

92

Brain Networks and Time Series Geometric Deep Learning of the Human Connectome Project Multimodal Cortical Parcellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Logan Z. J. Williams, Abdulah Fawaz, Matthew F. Glasser, A. David Edwards, and Emma C. Robinson Deep Stacking Networks for Conditional Nonlinear Granger Causal Modeling of fMRI Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Kai-Cheng Chuang, Sreekrishna Ramakrishnapillai, Lydia Bazzano, and Owen T. Carmichael Dynamic Adaptive Spatio-Temporal Graph Convolution for fMRI Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Ahmed El-Gazzar, Rajat Mani Thomas, and Guido van Wingen Structure-Function Mapping via Graph Neural Networks . . . . . . . . . . . . . . . . . . . . 135 Yang Ji, Samuel Deslauriers-Gauthier, and Rachid Deriche Improving Phenotype Prediction Using Long-Range Spatio-Temporal Dynamics of Functional Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Simon Dahan, Logan Z. J. Williams, Daniel Rueckert, and Emma C. Robinson

Contents

xi

H3K27M Mutations Prediction for Brainstem Gliomas Based on Diffusion Radiomics Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Ne Yang, Xiong Xiao, Xianyu Wang, Guocan Gu, Liwei Zhang, and Hongen Liao Constrained Learning of Task-Related and Spatially-Coherent Dictionaries from Task fMRI Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Sreekrishna Ramakrishnapillai, Harris R. Lieberman, Jennifer C. Rood, Stefan M. Pasiakos, Kori Murray, Preetham Shankapal, and Owen T. Carmichael Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

Computational Anatomy

Unfolding the Medial Temporal Lobe Cortex to Characterize Neurodegeneration Due to Alzheimer’s Disease Pathology Using Ex vivo Imaging Sadhana Ravikumar1(B) , Laura Wisse2 , Sydney Lim1 , David Irwin1 , Ranjit Ittyerah1 , Long Xie1 , Sandhitsu R. Das1 , Edward Lee1 , M. Dylan Tisdall1 , Karthik Prabhakaran1 , John Detre1 , Gabor Mizsei1 , John Q. Trojanowski1 , John Robinson1 , Theresa Schuck1 , Murray Grossman1 , Emilio Artacho-Pérula3 , Maria Mercedes Iñiguez de Onzoño Martin3 , María del Mar Arroyo Jiménez3 , Monica Muñoz3 , Francisco Javier Molina Romero3 , Maria del Pilar Marcos Rabal3 , Sandra Cebada Sánchez3 , José Carlos Delgado González3 , Carlos de la Rosa Prieto3 , Marta Córcoles Parada3 , David Wolk1 , Ricardo Insausti3 , and Paul Yushkevich1 1 University of Pennsylvania, Philadelphia, USA

[email protected]

2 Department of Diagnostic Radiology, Lund University, Lund, Sweden 3 University of Castilla La Mancha, Albacete, Spain

Abstract. Neurofibrillary tangle (NFT) pathology in the medial temporal lobe (MTL) is closely linked to neurodegeneration, and is the early pathological change associated with Alzheimer’s Disease (AD). In this work, we investigate the relationship between MTL morphometry features derived from high-resolution ex vivo imaging and histology-based measures of NFT pathology using a topological unfolding framework applied to a dataset of 18 human postmortem MTL specimens. The MTL has a complex 3D topography and exhibits a high degree of intersubject variability in cortical folding patterns which poses a significant challenge for volumetric registration methods typically used during MRI template construction. By unfolding the MTL cortex, the proposed framework explicitly accounts for the sheet-like geometry of the MTL cortex and provides a two-dimensional reference coordinate space which can be used to implicitly register cortical folding patterns across specimens based on distance along the cortex despite large anatomical variability. Leveraging this framework in a subset of 15 specimens, we characterize the associations between NFTs and morphological features such as cortical thickness and surface curvature and identify regions in the MTL where patterns of atrophy are strongly correlated with NFT pathology. Keywords: Medial temporal lobe · Ex vivo MRI · Cortical unfolding

Electronic supplementary material The online version of this chapter (https://doi.org/10.1007/ 978-3-030-87586-2_1) contains supplementary material, which is available to authorized users. © Springer Nature Switzerland AG 2021 A. Abdulkadir et al. (Eds.): MLCN 2021, LNCS 13001, pp. 3–12, 2021. https://doi.org/10.1007/978-3-030-87586-2_1

4

S. Ravikumar et al.

1 Introduction The medial temporal lobe (MTL) is an essential component of the human memory system and the earliest region of the cortex affected by tau neurofibrillary tangles (NFT); a hallmark pathology associated with Alzheimer’s Disease (AD). The accumulation of NFT pathology in the brain is closely linked to neurodegeneration and cognitive decline [1, 2]. According to studies by Braak and Braak [1], the spread of NFTs through the brain follows a characteristic pattern, with early manifestations observed in a specific region of the MTL surrounding the border between the lateral part of the entorhinal cortex (ERC) and the transentorhinal cortex (which corresponds to Brodmann area (BA) 35). The NFTs then spread further into the ERC before emerging in the hippocampus. Measurements of neurodegeneration in MTL subregions can be derived using structural magnetic resonance imaging (MRI) and have been shown to be sensitive to changes during the early stages of AD [3]. However, the specificity of these measurements to AD is limited by the fact that aging, other neurodegenerative pathologies such as TDP-43, and vascular disease are frequently comorbid in patients with AD, and also cause structural changes in the MTL. Recent studies suggest that compared to NFT pathology, these concomitant pathologies are associated with different patterns of neurodegeneration within the MTL [4]. Therefore, improved characterization of the relationship between MTL neurodegeneration and NFT pathology could lead to the discovery of atrophy patterns that are strongly associated with NFT burden specifically, thereby contributing towards the development of in vivo imaging biomarkers for neurodegeneration that are more sensitive to longitudinal change in the presence of early AD. Here, we study the relationship between MTL morphometry measures derived from high-resolution ex vivo imaging and histology-based measures of NFT pathology in a dataset of 15 human MTL specimens. The MTL has a complex topography and exhibits a high degree of anatomical variability, which poses a significant challenge during groupwise analyses that rely on creating a statistical reference space across all subjects. Typically, volumetric deformable registration is used to align individual anatomies and creates a reference space in the form of an average-shaped template of the anatomical region of interest [5]. However, when applied to the MTL, deformable registration can result in collapsing of different morphologies of the collateral sulcus, a basic landmark in the MTL [6, 7]. The variability in depth and shape of this sulcus can limit our ability to accurately infer the locations of MTL subregions such as BA35 and BA36 which are located along the collateral sulcus. A promising alternative is to use a surface-based approach which explicitly accounts for cortical folding patterns. In fact, ex vivo studies suggest that the locations of subregion borders depend on distance along the flattened cortical surface, indicating that MTL topology is an important consideration when studying inter-individual differences in MTL structure [8]. Therefore, in this work, we aim to investigate the relationship between NFT pathology and cortical thickness in a flattened space. Additionally, we seek to better understand the features of MTL morphometry driving the accumulation of NFT pathology by comparing our findings to the results of regional thickness analyses performed after alignment of cortical folding patterns using both surface-based and volumetric registration [7].

Unfolding the Medial Temporal Lobe Cortex

5

Existing tools such as FreeSurfer [9], developed for flattening the cortical surface in vivo, are not easily applicable to ultra-high-resolution ex vivo MRI. Instead, we customize the topological framework developed by DeKraker et al. [10] to create a twodimensional (2D) unfolded representation of the extra-hippocampal MTL cortex, which includes the ERC, BA35, BA36 and the parahippocampal cortex (PHC). This unfolded coordinate space can be used to index locations of the extrahippocampal MTL cortex in 2D based on their distance from the hippocampus and thus provides implicit registration between specimens despite differences in sulcal depth and folding patterns. Leveraging this framework, we identify regions of the MTL where atrophy correlates most strongly with NFT burden. In an exploratory analysis, we show that groupwise registration of MTL sulcal patterns across specimens can be performed using surface curvature for local shape comparison.

2 Ex-vivo Imaging Dataset 2.1 Specimen Preparation and Imaging Intact ex vivo brain bank specimens of the MTL were obtained from 18 donors (12 males, aged 45–93) from the University of Pennsylvania (UPenn) and the University of CastillaLa Mancha (UCLM) in Spain. Human brain specimens were obtained in accordance with the UPenn Institutional Review Board guidelines, and the Ethical Committee of UCLM. Where possible, pre-consent during life and, in all cases, next-of-kin consent at death was given. Following 4+ weeks of fixation, the MTL blocks were imaged on a Varian 9.4 T animal scanner at a 0.2 × 0.2 × 0.2 mm3 resolution using a T2-weighted, multislice spin echo sequence (TE = 9330 ms, TR = 23 ms). Due to gradient distortions in the 9.4 T scanner, as part of the post-processing, all of the scans had to be warped to correct for differences between the scanner coordinate frame and physical coordinate frame. Linear scaling factors for this transformation were derived using a 3D printed phantom [11]. Following MRI scanning, the specimens underwent serial histological processing. Specimens were cut into 2 cm blocks using custom molds that were 3D printed to fit each MTL specimen, frozen and sectioned at 50 µm intervals. Every 10th section was stained for cytoarchitecture (Nissl stain) and in 15 specimens, every 20th section was prepared for immunohistochemistry (IHC) with the anti-tau AT8 antibody and Nissl counterstain. Sections were mounted on 7.5 cm 5 cm slides and digitally scanned at 20X resolution. For each block, the scanned sections were reconstructed in 3D and aligned to MRI space using a custom deformable 3D registration pipeline. 2.2 Quantitative NFT Burden Maps from Histology For the 15 specimens with anti-tau IHC, “heat maps” quantifying the burden of NFT pathology on each of the anti-tau IHC sections were generated using a weakly supervised deep learning algorithm as described in [12]. Given an input patch extracted from a histology slide, the network outputs a spatial heatmap indicating the intensity of tangles at each location. The automated NFT burden measures generated by the network were shown to be consistent with manual NFT counts and semi-quantitative ratings of NFT

6

S. Ravikumar et al.

severity provided by an expert neuropathologist. For each specimen, the trained network was applied to all anti-tau whole-slide IHC images and the resulting heatmaps were reconstructed into a 3D volume and transformed into the space of the 9.4T MRI. Further details of the histology protocol, the approach for 3D reconstruction and matching of histology to MRI and NFT mapping are provided in [12]. 2.3 Histology-Guided MTL Subregion Segmentations In 11 specimens, the MTL subregions were labeled in ex vivo MRI space based on cytoarchitectural features derived from the serial histology images (Nissl stain). These histology guided segmentations are performed manually on each MRI slice and are highly labor intensive. Therefore, they are currently only available in a subset of cases. First, the boundaries between hippocampal subfields and extrahippocampal subregions (ERC, BA35, BA36, area TE and the PHC (areas TF and TH)) were identified in the histology images. Following histology reconstruction and registration to MRI space, the boundary annotations were mapped into 3D MRI space and overlaid on the coregistered MRI and histology images. Guided by the boundary annotations, the subfield segmentations were manually traced in 3D MRI space (Fig. 1C). Note that for each specimen, small gaps in the segmentation may exist between histology blocks.

3 Methods 3.1 Overview of Topological Unfolding Framework To unfold the extra-hippocampal MTL, we applied the framework developed by DeKraker et al. [10, 13] which imposes a curvilinear coordinate system on the cortex by solving a set of Laplace’s equations along segmentations of the gray matter. DeKraker et al. propose unfolding the hippocampus by computing potential field gradients along the anterior-posterior and proximal-distal directions of the cortex. This is done by defining additional boundary conditions at the anterior, posterior, proximal and distal ends of the region of interest and solving Laplace’s equation for three sets of boundary conditions, ∇ 2 ϕAP = 0, ∇ 2 ϕPD = 0, ∇ 2 ϕIO = 0 in the anterior-posterior (AP), proximal-distal (PD) and laminar (IO) directions respectively. We focus on unfolding the extra-hippocampal MTL by first segmenting the extra-hippocampal region over which the potential field is defined, labelling the boundary surfaces, and then solving the set of three Laplace’s equations to generate a potential field in each direction. 3.2 Segmentation of the Outer MTL Boundary in Ex vivo MRI In the ex vivo MRI scans of each of the 18 specimens, separate labels were used to segment the MTL gray matter and six boundary surfaces. The MTL gray matter was segmented using a semi-automated interpolation approach which combines inter-slice interpolation [14] and manual editing to reduce the manual effort needed to generate the segmentations. In the anterior region of MTL cortex, the gray matter segmentation extends until the medial bank of the occipitotemporal sulcus. In the posterior region of

Unfolding the Medial Temporal Lobe Cortex

7

MTL cortex, heuristically defined as starting 6 mm after the end of the hippocampus head, the gray matter segmentation only extends until the fundus of the collateral sulcus. This difference in lateral boundaries between the two regions introduces a discontinuity in the extent of unfolded tissue. Figure 1A shows an example MTL segmentation with boundary labels.

Fig. 1. Illustration of the topological unfolding framework applied to the extra-hippocampal MTL in an example specimen. (A) 3D reconstruction and coronal view of the semi-automated segmentation of the extra-hippocampus (red) and the six boundary labels used for solving Laplace’s equation along the anterior-posterior (AP), proximal-distal (PD) and inner-outer (IO) directions. The inner and outer boundary labels are marked in the cross-sectional view. (B) Coronal view and mid-surface model of the Laplacian solutions in each direction. (C) Coronal view of the NFT burden map, cortical thickness measurements (in mm) and MTL subregion labels in native MRI space. (D) Subregion labels shown in native and unfolded space along with unfolded representations of NFT burden, cortical thickness, and mean curvature along the cortical surface. (Color figure online)

Prior to segmentation, the MRI scan for each specimen was re-oriented such that the long-axis of the hippocampus aligned with the anterior-poster direction. Therefore, the AP boundary labels were obtained by dilating the MTL gray matter segmentation in the anterior-posterior dimension. Since the unfolding framework focuses on the extrahippocampal MTL, we defined the hippocampus as the proximal boundary. We note that part of the subiculum and parasubiculum is included in the unfolded tissue (they form the medial boundary, particularly for the PHC). The distal boundary was obtained by manually labelling the voxels bordering the lateral extent of the MTL gray matter segmentation on each slice. Lastly, the inner white matter and outer pial surfaces were

8

S. Ravikumar et al.

labelled using a semi-automated approach which involved dilating the MTL segmentation in the coronal plane and re-labelling the voxels falling within the background (identified by thresholding the MRI scan) with a different label to separate the inner and outer boundaries. Following this process, each of the completed segmentations were visually evaluated and any errors were manually corrected. Figure 1A shows a coronal view and 3D reconstruction of an example MTL segmentation including the boundaries used to solve the Laplace equations. 3.3 Laplacian Coordinate System Given the segmentation image, the Laplace equations were solved in the AP, PD, and IO directions by modifying the MATLAB code provided in [13] (Fig. 1B). The AP and PD potential field gradients together make up a 2D, unfolded coordinate system that can be used to index any point along the unfolded cortex. Due to the discontinuity at the boundary of the anterior and posterior MTL segmentation (shown in Fig. 1A), we computed Laplace solutions separately for the anterior and posterior MTL cortex resulting in a set of two coordinate maps per specimen. To reflect the real-world size and extent of the two regions, the unfolded map of the anterior region was sampled with a 1:1 AP:PD aspect ratio, while the posterior region was sampled with an empirically estimated, 1:0.7 aspect ratio. 3.4 Mapping Image and Morphological Features to Unfolded Space For the 15 specimens with NFT burden maps, regional cortical thickness was measured in native MRI space by generating a smoothed surface mesh of the extra-hippocampal segmentation, extracting the pruned Voronoi skeleton of the surface mesh [15] and computing twice the distance between each vertex and the closest point on the skeleton. The NFT heatmaps, subregion labels, and thickness measurements (shown in Fig. 1C), were transformed from native MRI space to unfolded space as follows: the mid-surface of the extra-hippocampal MTL was extracted by interpolating the 3D native-space coordinates corresponding to the unfolded points at a laminar potential of 0.5 from the inner and outer surfaces. Delaunay triangulation was used to perform scattered interpolation [16]. Image features were then sampled from the nearest-neighbor interpolated location in MRI space for each point along the mid-surface. Additionally, mean curvature was estimated at each vertex along the 0.5-level mid-surface of the cortex using the patchcurvature() function in MATLAB. Gaussian smoothing was then applied to the thickness and curvature maps for each specimen in a reparametrized unfolded space that reflects the real-world distances between points (Supplementary Table 1) [13]. Figure 1D shows examples of the four image features mapped to unfolded space. Additionally, the unfolded feature maps for all 18 specimens are shown in Supplementary Fig. 1.

4 Experiments and Results 4.1 Consensus MTL Subregion Segmentation in Unfolded Coordinate Space An average MTL subregion segmentation was generated in the unfolded space by performing voxel-wise majority voting among the subregion segmentations of the subset

Unfolding the Medial Temporal Lobe Cortex

9

of 11 specimens with histology-based annotations. When obtaining the consensus segmentation, we incorporated slight regularization using a Markov Random Field prior to smooth the boundaries between labels and provide continuity at voxels where little data is available (due to gaps in the histology segmentations) (Fig. 2A and Supplementary Table 1). For each pair of specimens, we computed the generalized Dice coefficient (GDSC) between their multi-label segmentations in unfolded space, only including points that are labeled in both specimens [17]. The average GDSC across the 11 specimens is 0.62 ± 0.09. This is higher than the average GDSC of 0.57 ± 0.07 obtained when registering specimens using a volumetric deformable registration pipeline [7], suggesting improved alignment of the MTL cortex of individual specimens in unfolded space.

Fig. 2. Comparison of the MTL segmentation, results of the regional thickness analysis, and average maps of cortical thickness and NFT burden, visualized in unfolded space, before and after sulcus registration using mean curvature. The consensus MTL subregion segmentation is derived from serial histology in 11 specimens. The statistical maps show regions where significant correlations were observed between cortical thickness and the 90th percentile of NFT burden in BA35, using the Spearman rank correlation model with age as a covariate, in a dataset of 15 specimens. Point-wise correlations were considered significant after false discovery rate correction (p < 0.1). The thickness and NFT burden maps shown in the third and fourth column respectively, represent the average values computed across 15 specimens. In each case, the boundaries of the consensus segmentation are overlaid in black.

4.2 Correlating NFT Burden with MTL Neurodegeneration To characterize the effects of NFT pathology on MTL thickness, for each of the 15 specimens with NFT burden heatmaps we first computed a summary measure of NFT severity, defined as the 90th percentile of NFT burden across all points in the unfolded map that fall within BA35. We used BA35 to compute the summary measure since it contains

10

S. Ravikumar et al.

the transentorhinal cortex, the first cortical site affected by NFTs in AD [1]. Since only five of the specimens in the dataset have both NFT burden maps and histology-based subregion segmentations, the consensus segmentation was used to determine the boundary of BA35 in all specimens. We then performed a statistical analysis to investigate the relationship between the NFT summary measure and thickness (standardized across subjects) at each location in the unfolded space using the partial Spearman rank correlation model with age as a covariate. Point-wise correlations were considered significant after false discovery rate (FDR) correction (p < 0.1). P-values and the corresponding FDR threshold are plotted in Supplementary Fig. 3. Due to missing data along the borders of the thickness maps, 10% of values along the anterior, posterior, proximal and distal edge were not included in the analysis. As seen in Fig. 2A, strong correlations were observed in the ERC and the border of BA35, consistent with the early Braak regions, and parts of BA36 [1]. No significant correlations were detected in the posterior MTL. 4.3 Surface-Based Registration Using Mean Curvature Maps In an exploratory analysis, we were interested in disentangling whether the distribution of NFTs within the MTL is dependent on distance along the cortex or merely a consequence of MTL morphology, with NFTs tending to accumulate within cortical folds. While the unfolding framework aligns specimens based on relative distance from a boundary surface, the mean curvature maps can be used to align specimens based on sulcal patterns and location since the fundus of the collateral sulcus is visible as regions of high curvature in the unfolded maps (Supplementary Fig. 1). This approach is analogous to the FreeSurfer surface-based registration method developed for in vivo MRI [9]. We applied groupwise intensity-based registration to the mean curvature maps of all 18 specimens to create an average curvature map and a set of transformations between the average map and each specimen [5]. Groupwise registration was performed using an implementation of the log domain diffeomorphic demons algorithm [18] included in the “Greedy” package and involved iteratively alternating between highly regularized, deformable registration of the individual curvature maps to an average template using the sum of squared differences metric and updating the template by averaging the registered maps (Supplementary Table 1). Supplementary Fig. 2 shows the average curvature map before and after registration. We observe that the regions of high curvature are better defined following registration, suggesting improved alignment of the collateral sulcus between specimens. To test the relationship between NFT burden and cortical thickness in this space, we repeated the pointwise thickness analysis after first mapping the subregion labels, NFT burden maps and thickness maps to the normalized space and re-computing the consensus subregion segmentation and NFT summary measures. As shown in Fig. 2B (and Supplementary Fig. 3), no significant correlations are observed following sulcus-based registration, suggesting that patterns of tau distribution and neurodegeneration are perhaps more dependent on distance along the cortex than sulcal folding patterns. This finding is consistent with the result obtained when we performed pointwise regional thickness analysis in the space of a 3D MTL template generated using volumetric deformable registration (Supplementary Fig. 3) [7]. While the expected effect of MTL morphology on NFT pathology is unclear, this result is also consistent with the findings reported in a recent study by Arena et al. that assessed the

Unfolding the Medial Temporal Lobe Cortex

11

preferential distribution of tau pathology towards sulcal depths in the context of chronic traumatic encephalopathy (CTE) and AD and found that NFTs showed a more uniform distribution along the cortex when compared to astroglial tau pathology [19].

5 Conclusions We present a topological unfolding framework applied to the extrahippocampal MTL cortex, using ex vivo MR imaging of a sizable collection of human MTL specimens (n = 18). This approach allows us to visualize, for the first time, the distribution of extrahippocampal subregions and NFT pathology in an unfolded space and analyze the effects of NFT burden on MTL neurodegeneration while explicitly accounting for the complex topology of the MTL. Our result suggesting that the association between NFT burden and cortical thickness is weakened following alignment of sulcal patterns motivates further work in a flattened space with a larger dataset. While we show that the unfolding framework provides a valuable tool for detailed investigation of MTL neurodegeneration due to NFT pathology, in ongoing work, IHC for other common molecular pathologies is being performed in many of our specimens and the unfolding framework can be easily extended to investigate a different set of image features. Future work will focus on expanding the size of our dataset and extending the unfolding framework to the hippocampal subfields, which include early targets for NFT pathology. This will allow us to further refine our understanding of early AD and support the development of better AD biomarkers.

References 1. Braak, H., Braak, E.: Neuropathological staging of Alzheimer-related changes. Acta Neuropathol. 82, 239–259 (1991). https://doi.org/10.1007/bf00308809 2. Hyman, B.T., et al.: National institute on aging–Alzheimer’s association guidelines for the neuropathologic assessment of Alzheimer’s disease. Alzheimer’s Dement. 8, 1–13 (2012). https://doi.org/10.1016/j.jalz.2011.10.007 3. Olsen, R.K., Palombo, D.J., Rabin, J.S., Levine, B., Ryan, J.D., Rosenbaum, R.S.: Volumetric analysis of medial temporal lobe subregions in developmental amnesia using highresolution magnetic resonance imaging. Hippocampus 23, 855–860 (2013). https://doi.org/ 10.1002/hipo.22153 4. Small, S.A., Schobel, S.A., Buxton, R.B., Witter, M.P., Barnes, C.A.: A pathophysiological framework of hippocampal dysfunction in ageing and disease (2011). https://doi.org/10.1038/ nrn3085 5. Joshi, S., Davis, B., Jomier, M., Gerig, G.: Unbiased diffeomorphic atlas construction for computational anatomy. NeuroImage Neuroimage (2004). https://doi.org/10.1016/j.neuroi mage.2004.07.068 6. Xie, L., et al.: Automatic clustering and thickness measurement of anatomical variants of the human perirhinal cortex. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8675, pp. 81–88. Springer, Cham (2014). https://doi.org/10.1007/ 978-3-319-10443-0_11 7. Ravikumar, S., et al.: Building an ex vivo atlas of the earliest brain regions affected by Alzheimer’s disease pathology. In: Proceedings - International Symposium on Biomedical Imaging (2020). https://doi.org/10.1109/ISBI45749.2020.9098427

12

S. Ravikumar et al.

8. Ding, S.L., Van Hoesen, G.W.: Borders, extent, and topography of human perirhinal cortex as revealed using multiple modern neuroanatomical and pathological markers. Hum. Brain Mapp. 31, 1359–1379 (2010). https://doi.org/10.1002/hbm.20940 9. Fischl, B., Sereno, M.I., Tootell, R.B.H., Dale, A.M.: High-resolution intersubject averaging and a coordinate system for the cortical surface. Hum. Brain Mapp. 8, 272–284 (1999). https:// doi.org/10.1002/(SICI)1097-0193(1999)8:4%3c272::AID-HBM10%3e3.0.CO;2-4 10. DeKraker, J., Ferko, K.M., Lau, J.C., Köhler, S., Khan, A.R.: Unfolding the hippocampus: an intrinsic coordinate system for subfield segmentations and quantitative mapping. Neuroimage 167, 408–418 (2018). https://doi.org/10.1016/j.neuroimage.2017.11.054 11. Adler, D.H., et al.: Characterizing the human hippocampus in aging and Alzheimer’s disease using a computational atlas derived from ex vivo MRI and histology. Proc. Natl. Acad. Sci. U.S.A. 115, 4252–4257 (2018). https://doi.org/10.1073/pnas.1801093115 12. Yushkevich, P.A., et al.: Three-dimensional mapping of neurofibrillary tangle burden in the human medial temporal lobe. Brain 139, 16–17 (2021). https://doi.org/10.1093/BRAIN/AWA B262 13. DeKraker, J., Lau, J.C., Ferko, K.M., Khan, A.R., Köhler, S.: Hippocampal subfields revealed through unfolding and unsupervised clustering of laminar and morphological features in 3D BigBrain. Neuroimage 206 (2020). https://doi.org/10.1016/j.neuroimage.2019.116328 14. Ravikumar, S., Wisse, L., Gao, Y., Gerig, G., Yushkevich, P.: Facilitating manual segmentation of 3D datasets using contour and intensity guided interpolation. In: 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 714–718 (2019) 15. Ogniewicz, R.L., Kübler, O.: Hierarchic Voronoi skeletons. Pattern Recogn. 28, 343–359 (1995). https://doi.org/10.1016/0031-3203(94)00105-U 16. Amidror, I.: Scattered data interpolation methods for electronic imaging systems: a survey. J. Electron. Imaging 11, 157 (2002). https://doi.org/10.1117/1.1455013 17. Crum, W.R., Camara, O., Hill, D.L.G.: Generalized overlap measures for evaluation and validation in medical image analysis. IEEE Trans. Med. Imaging. 25, 1451–1461 (2006). https://doi.org/10.1109/TMI.2006.880587 18. Vercauteren, T., Pennec, X., Perchant, A., Ayache, N.: Symmetric log-domain diffeomorphic registration: a demons-based approach. In: Metaxas, D., Axel, L., Fichtinger, G., Székely, G. (eds.) MICCAI 2008. LNCS, vol. 5241, pp. 754–761. Springer, Heidelberg (2008). https:// doi.org/10.1007/978-3-540-85988-8_90 19. Arena, J.D., et al.: Astroglial tau pathology alone preferentially concentrates at sulcal depths in chronic traumatic encephalopathy neuropathologic change. Brain Commun. 2 (2020). https:// doi.org/10.1093/BRAINCOMMS/FCAA210

Distinguishing Healthy Ageing from Dementia: A Biomechanical Simulation of Brain Atrophy Using Deep Networks Mariana Da Silva1(B) , Carole H. Sudre1,2,3 , Kara Garcia4 , Cher Bass1,5 , M. Jorge Cardoso1 , and Emma C. Robinson1 1

2

School of Biomedical Engineering and Imaging Sciences, King’s College London, London, UK mariana.da [email protected] MRC Unit for Lifelong Health and Ageing at UCL, University College London, London, UK 3 Centre for Medical Image Computing, Department of Computer Science, University College London, London, UK 4 Department of Radiology and Imaging Sciences, School of Medicine, Indiana University, Bloomington, USA 5 Panakeia Technologies, London, UK

Abstract. Biomechanical modeling of tissue deformation can be used to simulate different scenarios of longitudinal brain evolution. In this work, we present a deep learning framework for hyper-elastic strain modelling of brain atrophy, during healthy ageing and in Alzheimer’s Disease. The framework directly models the effects of age, disease status, and scan interval to regress regional patterns of atrophy, from which a strain-based model estimates deformations. This model is trained and validated using 3D structural magnetic resonance imaging data from the ADNI cohort. Results show that the framework can estimate realistic deformations, following the known course of Alzheimer’s disease, that clearly differentiate between healthy and demented patterns of ageing. This suggests the framework has potential to be incorporated into explainable models of disease, for the exploration of interventions and counterfactual examples. Keywords: Deep learning · Biomechanical modelling Neurodegeneration · Disease progression

1

·

Introduction

Alzheimer’s Disease (AD) is neurodegenerative condition characterized by progressive and irreversible death of neurons, which manifests macroscopically on structural magnetic resonance images (MRI) as progressive tissue loss or atrophy. While, cross-sectionally the progression of the disease is well documented presenting with disproportionate atrophy of the hippocampus, medial temporal, c Springer Nature Switzerland AG 2021  A. Abdulkadir et al. (Eds.): MLCN 2021, LNCS 13001, pp. 13–22, 2021. https://doi.org/10.1007/978-3-030-87586-2_2

14

M. Da Silva et al.

and posterior temporoparietal cortices [7,17], relative to age matched controls - in reality disease progression is heterogeneous across individuals and may be categorised into subtypes [8]. Historically, this has meant that early stage AD has been challenging to diagnose from structural MRI changes alone [1,9,11,15]. Biomechanical models present an alternate avenue, in which rather than performing post-hoc diagnosis of AD from longitudinally acquired data, it instead becomes possible to build a forward model of disease, simulating different possible scenarios for progression [12,13]. Such models have been used broadly throughout the literature to simulate both atrophy and growth [18,20,22] and are usually based on hyperelastic strain models, implemented using finite element methods (FEM) [21] or finite difference methods (FDM) [12]. Accordingly, in this paper we propose a novel deep network for biomechanical simulation of brain atrophy, and seek to model differential patterns of atrophy following healthy ageing or AD. In this way our model parallels a growing body of deep generative, interpretable or explainable models of disease. This includes [3– 5,14] which train generative models to deform [14] and/or change the appearance [3–5,14] of images, in such a way that it changes their class. By contrast, deep structural causal models such as [16], go further to support counterfactual models of disease progression, by associating demographic and phenotypic variables to imaging data, through variational inference on a causal graph. One challenge with structural causal models is that they require prior hypothesis of a causal graph, defining the directions of influence of different parameters in the model. In this paper, we therefore take a more explicit approach to explainable modelling, training a hyper-elastic strain simulation of brain growth and atrophy, while building an explicit simulation of atrophy for different populations and time windows. This supports subject-specific interventions, simulating projections of brain atrophy, following differing diagnoses.

2

Methods

2.1

Data

Data used in this study were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database1 . A total of 1054 longitudinal MRI scans, collated from the ADNI1, ADNI2, ADNI-GO and ADNI3 studies, were used. All examples have at least 2 different T1-weighted scans, separated by at least 1 year (range 1–14 years). Accelerated MRI data was used for the subjects that don’t have non-accelerated images for both time-points. The dataset includes 210 subjects diagnosed with AD, 677 subjects with Mild Cognitive Impairment (MCI), 67 subjects with Significant Memory Loss (SMC) and 92 cognitively normal (CN). From this, subjects were separated into 845 training datasets, 104 validation datasets, and 105 test datasets. An equal distribution of the 4 disease classes was ensured in each set. 1

http://adni.loni.usc.edu/.

Biomechanical Simulation of Brain Atrophy Using Deep Networks

15

Fig. 1. Model architecture: the biomechanical model estimates a deformation field from a prescribed atrophy map corresponding to local volume changes; the atrophy estimator predicts region-wise atrophy values based on demographics and time. In this work, we train the two networks in different stages: the model highlighted in yellow is pre-trained based on a biomechanical cost function (green arrows); the atrophy estimator is trained based on the similarity metrics between simulated and true follow-up image (blue arrows). At inference time, the model estimates a follow-up scan based on metadata and a baseline image alone (the testing procedure can be identified by the black and grey arrows). (Color figure online)

2.2

Preprocessing

MRI images were segmented into cerebrospinal fluid (CSF), white matter (WM), gray matter (GM), deep gray matter (DGM) and cerebellum using NiftySeg2 . The images were parcellated into 138 regions (NeuroMorph parcellations) using the geodesic information flow (GIF) algorithm [6]. We then generate a less granular parcellation of 27 regions that includes the separate cortical lobes, ventricular system and hippocampus, which we use in the model and for our analysis. T1 images were skull stripped based on the segmentations, then resampled to MNI space with rigid registration using FSL’s FLIRT [10]. Data were normalised into the range 0–1 using histogram normalization, based on data from a target subset of 50 subjects. 2.3

Model Overview

Figure 1 offers an overview of the model and the training procedure. The model consists of two networks: an Atrophy Estimator and a Biomechanical Network. The Atrophy Estimator is a two-hidden layer (32 and 64 nodes) multi-layer perceptron (MLP) which takes as input 4 demographic variables: biological age (at the time of the first scan - normalized), sex, disease class (CN, SMC, MCI or AD) and time-interval between scans (Δt). It predicts as output a tensor of size 27, which corresponds to a predicted atrophy or growth value for each region of the brain (using the less granular parcellations, in order to reduce 2

http://github.com/KCL-BMEIS/NiftySeg.

16

M. Da Silva et al.

computational cost). The goal of this network is to estimate region-wise values of atrophy and growth, between any 2 longitudinal scans. This vector is then mapped back onto the label image, in order to generate a 3D volumetric map of prescribed atrophies, piece-wise constant, across regions. Biomechanical Network: The goal of the Biomechanical Network is to estimate a displacement field u from atrophy values, a, corresponding to local volume changes. In this paper, u was estimated from a U-net architecture (implemented as for VoxelMorph [2]) and then was used to simulate follow-up scans X, from a baseline scan x, as X = x + u. A Spatial Transformer was used to apply the deformation field to the original grid and compute the deformed image. In training, network parameters were optimized based on a biomechanicsinspired cost function. Following the convention used in modelling growth of biological tissues [19,23], we model the brain as a Neo-Hookean material and minimise the strain energy density, W:   K μ  T r FK FK T J −2/3 − 3 + (J − 1)2 (1) W = 2 2 Here, J = det(FK ), and the elastic deformation FK is responsible for driving equilibrium. This is given by FK = F · G−1 , where F = ∇u + I is the total deformation gradient and G is the applied growth, G = (a−1/3 )I. a represents relative changes in volume, and we assume isotropic growth/atrophy. μ is the shear modulus and K = 100μ is the bulk modulus. We define μ = 1 for pixels belonging to GM and WM, and set μ = 0.01 for the CSF, which we model as a quasi-free tissue. As only the tissues inside the skull suffer deformation, we add a loss term to encourage zero displacement in the voxels outside of the CSF. We also minimize the displacement at the voxel corresponding to the centre of mass of the brain. The total cost function is:  2 ubackground  + λ2 ucenter 2 , (2) LBiomechanical = W + λ1 where λ1 , λ2 are hyperparameters weighting the contribution of these terms. 2.4

Training and Evaluation

We train the two networks of our model in two separate stages: Pre-training the Biomechanical Model: Here, ground-truth per-region atrophy maps were first calculated from the volume ratio between the 2 time-points for each of the original 138 NeuroMorph parcellations (aground−truth = V1 /V2 ). These were then used to simulate a range of possible atrophy maps by sampling, for each region, from a uniform distribution of plausible atrophies (with range constrained between the min and max values of each population). In this way, the diversity of training samples seen by the model was increased. At each iteration, the model was trained with either a subject-specific groundtruth atrophy or an atrophy pattern randomly sampled from these distributions. We note that the aim here is to train the network to estimate displacement fields from any reasonable value of prescribed growth or atrophy, rather than learn deformation patterns from the population. The biomechanical model was

Biomechanical Simulation of Brain Atrophy Using Deep Networks

17

trained for 200 epochs using a mini-batch size of 6 and ADAM optimizer with a learning rate of 1 × 10−4 . Based on our previous experiments, we set λ1 = 10−1 and λ2 = 102 . Atrophy Simulation: Subsequently, the atrophy estimator was trained to predict the atrophies from the subject demographics and time-window. To train this network, the MLP outputs are applied to the pre-trained biomechanical model to compute the corresponding displacement field, simulated image and simulated parcellations. We then update the weights of the MLP based on the average Soft Dice Loss across the 27 parcellations and the L1 loss between simulated follow-up scan and ground-truth follow-up scan. The total loss of this network is therefore given by: (3) LM LP = Sof tDice + 0.1L1 The network was trained for 50 epochs, with batch size = 3 and learning rate = 1 × 10−4 .

3 3.1

Experimental Methods and Results Evaluation of Biomechanical Model

We apply the trained biomechanical model to the region-wise ground-truth volume change values of the 105 subjects of the test set. Figure 2 shows a representative example of a prescribed atrophy map, computed atrophy (det(F)), simulated follow-up scan and corresponding ground-truth for a subject diagnosed with MCI, with a time-span between scans of 7 years. We evaluate the performance of the network by comparing the prescribed atrophy maps with the computed atrophy using the Mean Squared Error (MSE), and compare the simulated images and segmentations to the ground-truth using MSE and dice overlap scores. In addition, we calculate the Absolute Symmetric Percentage Volume Change (ASPVC) between the simulated and ground-truth follow up images as in [13]. The objective is to show that the biomechanical network can estimate realistic deformation fields when provided with a specific atrophy map corresponding to local volume changes.

Fig. 2. Results of biomechanical model applied to a ground-truth atrophy map. The simulated follow-up scan shows atrophy of the ventricles and cortex that approximates the true difference map between the scans.

18

M. Da Silva et al.

Table 1. Evaluation metrics (Mean and Standard Deviation) calculated over the 105 subjects of the test set, for the biomechanical model applied to the ground-truth atrophy maps. MSEatrophy MSEImage

Dicevent Dicecortex ASPVCvent ASPVCcortex

1.27 × 10−4 2.38 × 10−3 0.901

0.760

2.6 %

4.1 %

Standard deviation 2.87 × 10−4 1.65 × 10−3 0.047

0.065

4.5 %

5.0 %

Mean

Table 1 shows the evaluation metrics over the test set, with focused analysis of the dice and ASPVC metrics for ventricles and cortex. We report high dice overlap and low ASPVC for the ventricular region across all subjects, showing that the model can accurately simulate the deformation patterns in the ventricles. Calculated ASPVC values are inside the range (2%–5%) of values reported in [13], which simulated atrophy using a FDM model. Lower values of dice overlap for the cortex region can be explained partially by registration differences between the two scans, which influence not only the comparison between simulated and ground-truth image, but also the “ground-truth” volume changes used as input to the model, that are calculated from the parcellations. Note throughout, that 0% volume change could only be expected for the images if the model was prescribed precise voxel-wise atrophy values. 3.2

Evaluation of Atrophy Estimation

Our next aim is to use this model to simulate patterns of atrophy according to different conditions, including the elapsed time between scans and the disease status. In this section, we show that: 1) our atrophy estimation model is capable of simulating follow-up scans consistent with the ground-truth; 2) the model can differentiate between atrophy for healthy, MCI and AD subjects; 3) the model can project forward in time. Comparison with Ground-Truth: We start by evaluating our atrophy estimation model on the 105 subjects of the test set and compare the simulated follow-up scans with the ground-truth images. This evaluation is done in a similar manner to Sect. 3.1, but here using the full model with the atrophy maps, a, estimated from the metadata using the MLP atrophy simulator. Figure 3 shows the dice overlap between simulated and ground-truth follow-ups, over all considered regions. Predicting Trajectories of Disease: In order to evaluate the ability of our model to differentiate between healthy aging, MCI and AD, we use our trained MLP to estimate atrophy patterns for the different classes. For this, we use the metadata from the 105 subjects of the test set and re-estimate the atrophy maps by intervening on input channel corresponding to the diagnosis class. We

Biomechanical Simulation of Brain Atrophy Using Deep Networks

19

Fig. 3. Dice overlap scores between the simulated and ground-truth parcellations for each of the 27 regions, computed across the 105 subjects of the test set.

therefore calculate 4 atrophy maps for each subject (disease class = {CN, SMC, MCI, AD}), and keep the remaining metadata (age, sex, Δt) as the true values. Figure 4 shows the predicted atrophy values for the ventricles and hippocampus when intervening on the disease class. We performed one-sided paired t-test analysis on the computed atrophies, and conclude that, for the ventricles, the model is able to predict statistically significant differences in atrophy distributions between all 4 diagnosis classes (P < .001). For the hippocampus, the model estimates atrophies that are significantly different when comparing CN vs MCI, CN vs. AD and MCI vs. AD (P < .001); the distributions for CN and SMC are not significantly different (P = 0.89).

Fig. 4. Predicted atrophy values on (a) ventricles and (b) hippocampus when intervening on disease status. Values of a < 1 correspond to regional expansion and a > 1 correspond to shrinking.

20

M. Da Silva et al.

Fig. 5. Predicted atrophy trajectory for a subject for healthy ageing (true class) and in the presence of AD. Age at baseline = 82 years.

Finally, to show that the model can predict forward in time, we estimate the atrophy for the subjects of the test set for multiple time-spans (Δt = 2, 4, 6, and 8 years). Figure 5 shows the computed atrophy progression when considering healthy aging, and when changing the input class to Alzheimer’s Disease. Comparing the trajectories for both cases, it is visible that the model predicts, as expected, larger values of atrophy across the brain tissue, including the ventricles and hippocampus.

4

Discussion and Future Work

The results presented here show that the proposed framework can be used to simulate structural changes in brain shape resulting from neurodegenerative disease, and differentiate between healthy and diseased atrophy patterns. In the present case, atrophy was encoded to predict only demographic trends; however our goal is to expand the atrophy estimator network to better model subjectspecific heterogeneity by considering information present in the baseline image. By estimating atrophy values from a small number of variables, the current framework can be used as a simple simulator of disease progression where one can easily intervene on these inputs, including disease status, by simply changing the class. However, while the model can clearly differentiate between healthy and AD subjects, it is well documented that differentiation between MCI and AD is a complex task due to the heterogeneous nature of these disorders. This is reflected in the results from Fig. 4, and in particular for larger time-windows between scans, for which there are fewer data for the AD class. In the future, and in addition to exploring the use of imaging data as input to the atrophy estimator, we aim to include other metrics of cognitive assessment as input to the network, such as the Mini-Mental State Examination (MMSE) and Alzheimer’s Disease Assessment Scale-Cognition (ADAS-Cog) in order to more accurately predict disease trajectories. We also aim to evaluate the impact of class imbalance on the network training, and address this by including more data from healthy subjects,

Biomechanical Simulation of Brain Atrophy Using Deep Networks

21

as well as exploring techniques of oversampling and weighted loss when training the network. In this work, we estimate region-wise atrophy maps, which are then used as input to the biomechanical model. In future work, and in order to model patient-specific trajectories of disease, we plan on using the region-wise atrophy estimates as priors to further compute subject-specific voxel-wise patterns that more accurately represent true atrophy patterns. Note, although we focus on modelling brain atrophy with age, this proposed model can be translated to other tasks, including brain growth, and can support the use of different biomechanical models of tissue deformation. Acknowledgments. The data used in this work was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-20012).

References 1. Bae, J.B., et al.: Identification of Alzheimer’s disease using a convolutional neural network model based on T1-weighted magnetic resonance imaging. Sci. Rep. 10(1), 1–10 (2020) 2. Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V.: VoxelMorph: a learning framework for deformable medical image registration. IEEE Trans. Med. Imaging 38(8), 1788–1800, August 2019. https://doi.org/10.1109/TMI.2019. 2897538, http://arxiv.org/abs/1809.05231 3. Bass, C., et al.: Image synthesis with a convolutional capsule generative adversarial network, December 2018. https://openreview.net/forum?id=rJen0zC1lE 4. Bass, C., da Silva, M., Sudre, C., Tudosiu, P.D., Smith, S., Robinson, E.: ICAM: interpretable classification via disentangled representations and feature attribution mapping. In: Advances in Neural Information Processing Systems, vol. 33 (2020) 5. Baumgartner, C.F., Koch, L.M., Tezcan, K.C., Ang, J.X., Konukoglu, E.: Visual feature attribution using wasserstein GANs, June 2018. http://arxiv.org/abs/1711. 08998 6. Cardoso, M.J., et al.: Geodesic information flows: spatially-variant graphs and their application to segmentation and fusion. IEEE Trans. Med. Imaging 34(9), 1976–1988 (2015). https://doi.org/10.1109/TMI.2015.2418298 7. Carmichael, O., McLaren, D.G., Tommet, D., Mungas, D., Jones, R.N., Initiative, A.D.N., et al.: Coevolution of brain structures in amnestic mild cognitive impairment. NeuroImage 66, 449–456 (2013) 8. Ferreira, D., et al.: Distinct subtypes of Alzheimer’s disease based on patterns of brain atrophy: longitudinal trajectories and clinical applications. Sci. Rep. 7(1), 1–13 (2017). https://doi.org/10.1038/srep46263 9. Frisoni, G.B., Fox, N.C., Jack, C.R., Scheltens, P., Thompson, P.M.: The clinical use of structural MRI in Alzheimer disease. Nat. Rev. Neurol. 6(2), 67–77 (2010) 10. Jenkinson, M., Smith, S.: A global optimisation method for robust affine registration of brain images. Med. Image Anal. 5(2), 143–156 (2001) 11. Khan, N.M., Abraham, N., Hon, M.: Transfer learning with intelligent training data selection for prediction of Alzheimer’s disease. IEEE Access 7, 72726–72735 (2019)

22

M. Da Silva et al.

12. Khanal, B., Lorenzi, M., Ayache, N., Pennec, X.: A Biophysical Model of Shape Changes due to Atrophy in the Brain with Alzheimer’s Disease. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8674, pp. 41–48. Springer, Cham (2014). https://doi.org/10.1007/978-3-31910470-6 6 13. Khanal, B., Lorenzi, M., Ayache, N., Pennec, X.: A biophysical model of brain deformation to simulate and analyze longitudinal MRIs of patients with Alzheimer’s disease. NeuroImage 134, 35–52 (2016). https://doi.org/ 10.1016/j.neuroimage.2016.03.061, http://www.sciencedirect.com/science/article/ pii/S1053811916300052 14. Bigolin Lanfredi, R., Schroeder, J.D., Vachet, C., Tasdizen, T.: Interpretation of Disease Evidence for Medical Images Using Adversarial Deformation Fields. In: Martel, A.L., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M.A., Zhou, S.K., Racoceanu, D., Joskowicz, L. (eds.) MICCAI 2020. LNCS, vol. 12262, pp. 738–748. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59713-9 71 15. Li, H., Habes, M., Wolk, D.A., Fan, Y.: Alzheimer’s disease neuroimaging initiative and the australian imaging biomarkers and lifestyle study of aging: a deep learning model for early prediction of Alzheimer’s disease dementia based on hippocampal magnetic resonance imaging data. Alzheimer’s Dement. 15(8), 1059–1070 (2019). https://doi.org/10.1016/j.jalz.2019.02.007, https://alz-journals. onlinelibrary.wiley.com/doi/abs/10.1016/j.jalz.2019.02.007 16. Pawlowski, N., Castro, D.C., Glocker, B.: Deep structural causal models for tractable counterfactual inference, June 2020. arXiv:2006.06485 [cs, stat], http:// arxiv.org/abs/2006.06485 17. Rabinovici, G., et al.: Distinct MRI atrophy patterns in autopsy-proven Alzheimer’s disease and frontotemporal lobar degeneration. Am. J. Alzheimer’s Dis. Dement. & #x00AE; 22(6), 474–488 (2008) 18. Richman, D.P., Stewart, R.M., Hutchinson, J.W., Caviness, V.S.: Mechanical model of brain convolutional development. Sci. (New York, N.Y.) 189(4196), 18–21 (1975). doi: 10.1126/science.1135626 19. Rodriguez, E.K., Hoger, A., McCulloch, A.D.: Stress-dependent finite growth in soft elastic tissues. J. Biomech. 27(4), 455–467 (1994). https://doi.org/10.1016/ 0021-9290(94)90021-3 20. Tallinen, T., Chung, J.Y., Biggins, J.S., Mahadevan, L.: Gyrification from constrained cortical expansion. Proc. Natl. Acad. Sci. 111(35), 12667– 12672 (2014). https://doi.org/10.1073/pnas.1406015111, https://www.pnas.org/ content/111/35/12667 21. Tallinen, T., Chung, J.Y., Rousseau, F., Girard, N., Lef`evre, J., Mahadevan, L.: On the growth and form of cortical convolutions. Nat. Phys. 12(6), 588–593 (2016). https://doi.org/10.1038/nphys3632, https://www.nature.com/articles/nphys3632 22. Xu, G., Knutsen, A.K., Dikranian, K., Kroenke, C.D., Bayly, P.V., Taber, L.A.: Axons pull on the brain, but tension does not drive cortical folding. J. Biomech. Eng. 132(7), 071013 (2010). https://doi.org/10.1115/1.4001683 23. Young, J.M., Yao, J., Ramasubramanian, A., Taber, L.A., Perucchio, R.: Automatic generation of user material subroutines for biomechanical growth analysis. J. Biomech. Eng. 132(10), 104505 (2010). https://doi.org/10.1115/1.4002375, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2996139/

Towards Self-explainable Classifiers and Regressors in Neuroimaging with Normalizing Flows Matthias Wilms1,2,3(B) , Pauline Mouches1,2,3 , Jordan J. Bannister1,2,3 , onke Langner4 , and Nils D. Forkert1,2,3 Deepthi Rajashekar1,2,3 , S¨ 1

Department of Radiology, University of Calgary, Calgary, Canada [email protected] 2 Hotchkiss Brain Institute, University of Calgary, Calgary, Canada 3 Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, Canada 4 Institute for Diagnostic and Interventional Radiology, Pediatric and Neuroradiology, University Medical Center Rostock, Rostock, Germany

Abstract. Deep learning-based regression and classification models are used in most subareas of neuroimaging because of their accuracy and flexibility. While such models achieve state-of-the-art results in many different applications scenarios, their decision-making process is usually difficult to explain. This black box behaviour is problematic when nontechnical users like clinicians and patients need to trust them and make decisions based on their results. In this work, we propose to build selfexplainable generative classifiers and regressors using a flexible and efficient normalizing flow framework. We directly exploit the invertibility of those normalizing flows to explain the decision-making process in a highly accessible way via consistent and spatially smooth attribution maps and counterfactual images for alternate prediction results. The evaluation using more than 5000 3D MR images highlights the explainability capabilities of the proposed models and shows that they achieve a similar level of accuracy as standard convolutional neural networks for image-based brain age regression and brain sex classification tasks. Keywords: Normalizing flows

1

· Explainable AI

Introduction

Over the last decade, deep neural networks (DNNs) have revolutionized medical image analysis in general and many areas of neuroimaging in particular [32]. Their accuracy is usually a result of the availability of large training data sets that help the models to learn complex functions to map the input images to the desired outputs. While this data-driven approach helps to learn accurate mappings, it is also a major reason why DNNs are often deemed black boxes that are difficult to analyze and interpret [32]. This lack of interpretability and c Springer Nature Switzerland AG 2021  A. Abdulkadir et al. (Eds.): MLCN 2021, LNCS 13001, pp. 23–33, 2021. https://doi.org/10.1007/978-3-030-87586-2_3

24

M. Wilms et al.

Fig. 1. Graphical overview of the proposed use of an invertible regression model for explainable brain age prediction. The invertible model can predict the age of a given input image (left to right) but can also be used to generate data (right to left) that helps to explain its decision-making process through interpretable voxel-level attribution maps and counterfactual images for alternate prediction results.

explainability is a major problem in neuroimaging applications where their success relies on the acceptance by non-technical users like clinicians and patients who need to trust the model-generated results [10,20]. Recently, considerable progress has been made to open up the so-called black boxes by developing mechanisms that help to explain and interpret the decisions made by DNN models [6,22]. Many strategies used for image-based DNNs like convolutional neural networks (CNNs) can be categorized as model-agnostic, post-hoc feature attribution methods [3,6,20,22], which query a trained model to determine what features contribute most to a specific decision. Popular representatives of this category are gradient-based approaches that generate saliency or attribution maps by computing the gradient of the decision function with respect to input features [23,24,27]. While those methods are easy to apply and popular in neuroimaging (e.g., [5,17]), they lack an easily accessible explanation of a DNN’s decision as they mainly answer questions such as “Which voxels affect the prediction most?”. It can be argued that non-technical users like clinicians are more interested in questions that reveal the learned concepts [7]: “Why was this image classified as A and not B?” and “How would this image look like if it was from a different class?”. Therefore, approaches that analyze a given model’s behaviour with respect to concepts have been proposed [8,14]. Those models provide a better comprehension of a DNN’s decisions but are usually not able to generate meaningful alternate versions of the real images (counterfactuals) that would have resulted in different decisions and that are known to help in visually assessing the decision-making process [7,13]. More sophisticated approaches generate images for alternate decisions using a generative model such as a Generative Adversarial Network (GAN) that approximately inverts the DNN’s decision-making process [19,25]. Instead of training an additional generative model for explanation purposes, it is more natural and consistent to build a self-explainable model [22]. This can, for example, be done by using invertible DNN architectures that not only solve the classification/regression task but are also able to systematically manipulate the inputs to explain their decisions via counterfactual images. Recently, generative classifiers based on normalizing flows (NFs, [15]) have gained popularity in computer vision [2,18,26]. Aside from notable exceptions like [11,31], NFs

Towards Self-explainable Classifiers and Regressors in Neuroimaging

25

have rarely been used in the medical domain due to their high computational costs when applied to 3D data. This issue has recently been addressed in [30] where a deformation-based NF brain aging model was proposed that circumvents computational limitations by equipping the NF with a theoretically sound dimensionality reduction step. Although the model in [30] performs brain age regression, the authors do not exploit its invertibility for explainability purposes. In this paper, we (1) propose to build invertible, self-explainable generative classifiers and regressors derived from the efficient NF approach of [30] and (2) describe how to directly exploit the generative/invertibility properties of the models to explain their decisions through more natural attribution maps as well as counterfactuals (see Fig. 1). To our knowledge, this paper is the first that proposes invertible generative classifiers and regressors using NFs for explainable AI for 3D neuroimaging data and which also shows that they are able to achieve results comparable to standard black box CNNs. Our approach relies on the architecture proposed in [30], but we directly apply it to 3D images instead of deformation fields and not only investigate its use for brain age regression but also for sex classification. We also draw from some explainability definitions and concepts proposed in [26], but our setup significantly differs from theirs.

2

Normalizing Flows as Generative Invertible Classifiers and Regressors

Our goal is to learn a NF-based decision function that maps a 3D input image X : R3 → R to a scalar result r ∈ Ω. For a continuous regression problem, Ω ≡ R and for a binary classification problem, we define Ω ≡ [0, 1]. Without loss of generality, we also define that x ∈ Rnvox is a sampled and vectorized version of X with nvox being the number of image voxels. The decision function is then defined as f : Rnvox → Ω × Rnvox −1 and maps an input vector of nvox voxels to a vector of the same size with the first dimension representing the decision result. While such a definition seems counterintuitive at first, when using a NF, f (·) has to be a bijection to guarantee its invertibility. The nvox − 1 dimensions of the output vector not encoding the decision result, contain the additional data needed to reconstruct the input x when applying the inverse f −1 (·) (see [30]). The invertible function f (·) is then used in conjunction with the change-ofvariable technique to define a conditional probabilistic generative model that allows to assign probabilities to images x given r by mapping the input space to a latent space on which simple priors can be imposed [30]:      ∂f (x)  p(x|r) = pZ fz (x) pR fr (x)|r det . ∂x

(1)

Here, p(x|r) is the density of 3D images conditioned on a decision result r, fz : Rnvox → Z is the part of f (·) that maps x to Z ≡ Rnvox −1 , while fr : Rnvox → R maps x to R ≡ Ω. We use a factorized multivariate Gaussian prior for z ∼ pZ , which  covers  the image variability independent of r, and assume that density pR fr (x)|r is implicitly modeled by the regression or classification metric used

26

M. Wilms et al.

(see Sect. 2.2). In summary, Eq. (1) defines a bidirectional model where fr (·) solves the regression/classification problem (predictive direction), while f −1 (·) and sampling from the priors allows to generate data (generative direction). 2.1

Manifold-Constrained NFs for Efficient 3D Data Processing

The major challenge of the setup presented above is the high dimensionality of the inputs x ∈ Rnvox (millions of voxels for 3D images), which implies that enormous computational resources are required to learn f (·). We, therefore, follow the dimensionality reduction approach proposed in [30] for deformation fields and adapt it to gray value images. The basic idea is to approximate f (·) through a two-step process as f (x) ≈ e(x) = h ◦ g(x). Here, g : Rnvox → Rndim first projects x to a lower dimensional manifold of dimensionality ndim before a standard NF-based bijection h : Rndim → Rndim maps this space to the model’s latent space whose first dimension holds the regression/classification result and on which priors are imposed. While g(·) is not a bijection, we assume that it is a chart of the ndim -dimensional manifold and invertible for manifold elements. Hence, e : Rnvox → Rndim can be used as a replacement for f (·) in Eq. (1) if we assume that the information lost is not important for the task at hand. Similar to what has been done in [30] for deformations, we assume that 3D gray value images can be sufficiently represented by a ndim -dimensional affine subspace with translation vector x ∈ Rnvox and a matrix Q ∈ Rnvox × Rndim composed of ndim orthogonal columns. This subspace is estimated from training data in closed form via principal component analysis (PCA) and retaining the top ndim eigenvectors. Then, y = g(x) = Q+ (x − x) with Q+ being the pseudoinverse of Q and g(y)−1 = Qy + x. Here, [r, z] = e(x) = h ◦ g(x) solves the regression/classification problem by mapping x to r (and z ∈ Rndim−1 ) and its inverse x = e−1 ([r, z]) = g −1 ◦ h−1 ([r, z]) can be utilized to generate new images. 2.2

Implementation Details and Model Training

Learning e(·) from data is equivalent to learning h(·) as g(·) can be pre-computed via PCA. For [r, z] = h(y) = hnlay ◦ · · · ◦ hi ◦ · · · ◦ h1 (y), we follow the typical NF paradigm and use a sequence of nlay easily invertible sub-functions/layers hi : Rndim → Rndim . Each b = hi (a) is an affine coupling layer that transforms its input vector a = [a1 , a2 ] to b = [b1 , b2 ] by splitting the inputs into two equally sized parts and applying a learnable affine transform (see also [15,30]):   (2) b1 = exp si (a2 )  a1 + ti (a2 ) and b2 = a2 . Here,  denotes an element-wise multiplication and si (·) and ti (·) are fullyconnected neural networks with nhid hidden layers and ReLU activations. An affine coupling layer can be inverted by reversing the affine transformation without having to invert si (·) and ti (·) [15,30]. To learn the weights of all si (·)/ti (·) nsbj of nsbj subjects with images xj and groundbased on a training set {(xj , rj )}j=1 truth targets rj , we minimize the negative log-likelihood of Eq. (1) for e(·). Using

Towards Self-explainable Classifiers and Regressors in Neuroimaging

27

a Gaussian prior for the difference between rj and prediction er (xj ) and a unitdiagonal multivariate Gaussian prior for z ∼ pZ resulting from ez (xj ), gives nsbj    ∂e(x)   1  1  −2   2 2 σ rj − er (xj ) 2 + ez (xj )) 2 − logdet L=  . (3) nsbj j=1 2 ∂x The parameter σ can be used to balance the target fit and the adherence of the additional unrelated variability to the imposed prior. For a regression problem, Eq. (3) minimizes the L2 loss between prediction and ground-truth as in [30], while for a binary classification problem, we first pass er (xj ) through a sigmoid function to map the output to the unit interval before computing the difference.

3

Explainable AI with Normalizing Flows

The goal is now to exploit the invertibility of the learned decision function [r, z] = e(x) = h ◦ g(x) to explain the model’s decision-making process in regression and classification scenarios in a highly accessible way through a derivative-based attribution map (Sect. 3.1) and counterfactual images (Sect. 3.2). 3.1

Derivative-Based Attribution Map of the Inverse

Standard gradient-based approaches to compute attribution maps for DNNs like Grad-CAM and SmoothGrad [23,27] (see also Sect. 1), rely on the gradient of the decision function with respect to the input image or derived feature maps. The resulting attribution maps indicate voxel locations that influenced the decision made. However, we argue that, from an interpretability point of view, this is an unnatural way of analyzing the model and that a more natural approach is to explore what effect manipulations of the decision result would have on the input image x (see also [26] and Sect. 1). Such an analysis can be easily carried out with an invertible NF model by computing the partial derivative of the inverse e−1 ([r, z]) with respect to prediction result r = er (x) (with z = ez (x)):  ∂ −1 ∂  −1 ∂ e ([r, z]) = Qh ([r, z]) + x = Q h−1 ([r, z]) . ∂r ∂r ∂r

(4)

∂ −1 h ([r, z]) can be conveniently computed via automatic The partial derivative ∂r differentiation. We then use a normalized version as an attribution map that can be visualized in the image space. Utilizing the inverse of the decision function has the benefit that only voxels directly related to r will be highlighted while the remaining unrelated image information projected to z will have no effect.

3.2

Counterfactual Images for Systematic Analyses

Geometrically, the partial derivative in Eq. (4) defines the tangent to the curve parameterized by e−1 ([r, z]) for a fixed vector z. Assuming that the training process described in Sect. 2.2 successfully disentangles r (= decision result) and z

28

M. Wilms et al.

Fig. 2. Explainability results for the brain age regression task for two test subjects at the age of 58 years (top) and 69 years (bottom) when using the baseline CNN and the NF-based model. Both models correctly estimate the age of the first subject (58 years; rounded) while both fail to do so for the second one (CNN: 55 years; NF: 52 years). For the CNN model, vanilla gradients and SmoothGrad maps are visualized. For the NFbased model, partial derivatives of the inverse, counterfactual images, and associated difference images to the original image (blue: negative gray value diff., red: positive diff.; white: no diff.) are visualized. Bottom row counterfactual: The brain the model would have expected to see for the correct prediction (69 yrs. instead of 52 yrs.). (Color figure online)

(= unrelated information) for a given input image x, we can follow the curve to generate meaningful artificial images for alternate decision results r + δr to systematically analyze the model’s decision-making process and the learned concept (“How would this image look like if r was different?”; see Sect. 1). Those artificial images represent alternate realities and are usually called counterfactuals [26]. Given prediction [r, z] = e(x) for an input image x, we define a counterfactual image xδr for a modified decision result r + δr and fixed z as xδr = e−1 ([r + δr, z]) + nx .

(5)

Vector nx ∈ Rnvox is the information  of x being lost when projecting the image to the affine subspace; nx = g −1 ◦g(x) −x. We add this residual information to the generated counterfactual image to improve its visual appearance by substantially reducing the blurriness. Keeping z fixed allows us to generate images xδr highly similar to x that would lead to another decision result r + δr when applying e(·). We, therefore, argue that a NF-based model is inherently self-explainable.

4

Experiments and Results

The evaluation aims at showing that NF-based generative classifiers and regressors (1) achieve competitive results for typical classification/regression problems in neuroimaging and (2) are better at explaining their decisions when compared

Towards Self-explainable Classifiers and Regressors in Neuroimaging

29

to a standard black box CNN model. The tasks being analyzed here are brain age regression and brain-based sex classification using structural T1 MR images. Data: We use T1-weighted brain MR images of 5287 healthy adults from five data bases (SHIP [28]: 3164 subjects; IXI1 : 563 subjects; SALD [29]: 494 subjects; DLBS2 : 309 subjects; OASIS-3 [16]: 757 subjects). For all subjects, age data (age range: 20–90 years) as well as their sex is available (females: 55%). All 5287 images are first pre-processed (N4 bias correction and skull-stripping [12]), affinely registered and histogram matched to the SRI24 atlas [21] (cropped to 173×211×155 voxels; isotropic 1 mm spacing), and finally split into independent training (4281 subjects), test (684 subjects), and validation subsets (322 subjects) via age- and sex-stratified random sampling. Experimental Design: Based on the 4281 training subjects, independent NFbased models as described in Sect. 2 are build for both tasks. Brain age regression is a continuous regression problem (true age vs. predicted age) and sex classification is a binary classification scenario (true sex vs. predicted sex). For both problems/models, the same architecture is used: affine subspace with ndim = 500, nlay = 16 affine coupling layers, fully-connected scaling/translation networks with nhid = 2 hidden layers of width 32. During training, Eq. (3) is minimized for 20k epochs with an AdamW optimizer, a learning rate of 10−4 , and σ = 0.16 (age)/σ = 0.1 (sex). Parameters were chosen based on experiments on the validation data; see also [30]. Training takes 3 h on a NVIDIA Quadro P4000 GPU with a TensorFlow 2.2 implementation. As a baseline, we also train a classical CNN [4] frequently used in neuroimaging for both tasks. Results: Our NF-based generative brain age regression model achieves a mean absolute error (MAE) between true age and predicted age of 4.83±3.60 years for all test subjects, while the MAE for the baseline CNN model from [4] is 4.45±3.33 years. For the brain sex classification task, our NF-based classifier achieves an accuracy of 90.10% when using a 0.5 threshold after the sigmoid function and an area under the curve of the receiver operating characteristic (AUROC) of 0.97. The accuracy of the baseline CNN model is 92.98% (0.5 threshold) and the AUROC is 0.97. While the MAE and accuracy values are slightly better for the baseline CNN, all differences are not statistically significant (paired Wilcoxon signed rank test; alpha level: 0.05; age: p = 0.09; sex: p = 0.49). Figures 2 (regression) and 3 (classification) show explainable AI results for both tasks and models. For the NF-based models, derivative-based attribution maps of the inverse (see Sect. 3.1) and counterfactual images for alternate prediction results (see Sect. 3.2) are visualized. In addition, vanilla gradients as well as SmoothGrad ([27], regression only, parameters from [17]) and Grad-CAM attribution maps ([23], classification only) are shown for the baseline CNN. For both tasks and in comparison to the other attribution maps, it is immediately obvious that the maps of the NF-based models are often more consistent and/or spatially smoother. We argue that those properties make it easier to interpret the maps 1 2

https://brain-development.org/ixi-dataset/. http://fcon 1000.projects.nitrc.org/indi/retro/dlbs.html.

30

M. Wilms et al.

Fig. 3. Explainability results for the sex classification task for two female test subjects. The CNN model correctly classifies both subjects, while the NF-based model classifies the bottom one as being male (see also counterfactual image). For the CNN model, vanilla gradients and Grad-CAM maps are visualized. See also caption of Fig. 2.

while also increasing their trustworthiness. It also highlights that the NF-based models implicitly learn a consistent concept of the task at hand. It needs to be highlighted in particular that the concept learned for the brain age regression problem is in agreement with what is known about the normal aging process [9] as the model focuses on areas where aging-related atrophy is usually most visible (ventricles and gyri/sulci). The counterfactual images generated for the age regression task also help to understand the concept learned and are especially useful when analyzing why the NF model severely underestimated the age of the second subject (52 years instead of 69 years). From the counterfactual image for the correct age (69 years), it can be seen that the model would have expected to see larger ventricles, indicating a more advanced state of atrophy. The sex classification results are harder to interpret/verify as the average global volume difference between male and female brains cannot be used by the models as this information was removed by the affine registration during preprocessing. Interestingly, the NF-based model still uses some volume information (brains of males show more atrophy) as highlighted by the attribution maps and the difference images. This may indicate that a bias exists in the training data or was introduced during pre-processing and which would be harder to detect when using CNN-related maps. The NF model additionally believes that females have a larger cerebellum (see counterfactual for second subject in Fig. 3) than males, which is in agreement with the pediatric study results in [1].

5

Conclusion

In this paper, we proposed to build self-explainable generative classifiers and regressors based on invertible NFs that are easy to train and directly applicable to 3D neuroimaging data. Our evaluation showed that the proposed models achieve competitive results when compared to a standard CNN model for brain

Towards Self-explainable Classifiers and Regressors in Neuroimaging

31

age regression and brain sex classification. Because of their invertibility, the resulting models can be utilized to generate smooth and consistent attribution maps that directly visualize the concepts they are using. Furthermore, they can generate realistic counterfactual images for alternate prediction results that help to systematically analyze a model’s behaviour in a highly accessible way. All of this is possible without using any additional models or methods. The models proposed in this work are also fully-functional probabilistic generative models. This property has not been explicitly exploited here, but it allows, for example, to systematically sample data or to generate conditional templates (see [30] for examples). Future work will focus on additional regression and classification tasks in neuroimaging and the addition of other baseline methods for explainable AI to the evaluation. In summary, we see this work as an important step towards normalizing flow-based, self-explainable models in neuroimaging and believe that in the future improvements with respect to their accuracy can be expected by incorporating recent advances from the machine learning community [2,18]. Acknowledgements. This work was supported by a T. Chen Fong postdoctoral fellowship and the River Fund at Calgary Foundation.

References 1. Adeli, E., et al.: Deep learning identifies morphological determinants of sex differences in the pre-adolescent brain. Neuroimage 223, 117293 (2020) 2. Ardizzone, L., Mackowiak, R., Rother, C., K¨ othe, U.: Training normalizing flows with the information bottleneck for competitive generative classification. NeurIPS 33 (2020) 3. Arrieta, A.B., et al.: Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020) 4. Cole, J.H., et al.: Predicting brain age with deep learning from raw imaging data results in a reliable and heritable biomarker. Neuroimage 163, 115–124 (2017) 5. Eitel, F., Ritter, K.: Testing the robustness of attribution methods for convolutional neural networks in MRI-based alzheimer’s disease classification. In: Suzuki, K., et al. (eds.) ML-CDS/IMIMIC -2019. LNCS, vol. 11797, pp. 3–11. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33850-3 1 6. Gilpin, L.H., Bau, D., Yuan, B.Z., Bajwa, A., Specter, M., Kagal, L.: Explaining explanations: an overview of interpretability of machine learning. In: 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), pp. 80–89. IEEE (2018) 7. Goyal, Y., Wu, Z., Ernst, J., Batra, D., Parikh, D., Lee, S.: Counterfactual visual explanations. In: ICML, pp. 2376–2384 (2019) 8. Graziani, M., Andrearczyk, V., Marchand-Maillet, S., M¨ uller, H.: Concept attribution: explaining CNN decisions to physicians. Comput. Biol. Med. 123, 103865 (2020) 9. Hedman, A.M., van Haren, N.E., Schnack, H.G., Kahn, R.S., Hulshoff Pol, H.E.: Human brain changes across the life span: a review of 56 longitudinal magnetic resonance imaging studies. Human Brain Mapp. 33(8), 1987–2002 (2012) 10. Holzinger, A., Biemann, C., Pattichis, C.S., Kell, D.B.: What do we need to build explainable AI systems for the medical domain? (2017) arXiv:1712.09923

32

M. Wilms et al.

11. Hwang, S.J., Tao, Z., Kim, W.H., Singh, V.: Conditional recurrent flow: conditional generation of longitudinal samples with applications to neuroimaging. In: CVPR, pp. 10692–10701 (2019) 12. Isensee, F., et al.: Automated brain extraction of multisequence MRI using artificial neural networks. Human Brain Mapp. 40(17), 4952–4964 (2019) 13. Jeyakumar, J.V., Noor, J., Cheng, Y.H., Garcia, L., Srivastava, M.: How can i explain this to you? an empirical study of deep neural network explanation methods. NeurIPS 33 (2020) 14. Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., et al.: Interpretability beyond feature attribution: quantitative testing with concept activation vectors (TCAV). In: ICML, pp. 2668–2677. PMLR (2018) 15. Kobyzev, I., Prince, S., Brubaker, M.: Normalizing flows: an introduction and review of current methods. IEEE TPAMI, 1–1 (2020) 16. LaMontagne, P.J., et al.: Oasis-3: longitudinal neuroimaging, clinical, and cognitive dataset for normal aging and alzheimer disease. medRxiv (2019) 17. Levakov, G., Rosenthal, G., Shelef, I., Raviv, T.R., Avidan, G.: From a deep learning model back to the brain–identifying regional predictors and their relation to aging. Human Brain Mapp. 41(12), 3235–3252 (2020) 18. Mackowiak, R., Ardizzone, L., K¨ othe, U., Rother, C.: Generative classifiers as a basis for trustworthy computer vision. arXiv:2007.15036 (2020) 19. Narayanaswamy, A., et al.: Scientific discovery by generating counterfactuals using image translation. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12261, pp. 273–283. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-597108 27 20. Reyes, M., et al.: On the interpretability of artificial intelligence in radiology: challenges and opportunities. Radiol. Artif. Intell. 2(3), e190043 (2020) 21. Rohlfing, T., Zahr, N.M., Sullivan, E.V., Pfefferbaum, A.: The SRI24 multichannel atlas of normal adult human brain structure. Human Brain Mapp. 31(5), 798–819 (2010) 22. Samek, W., Montavon, G., Lapuschkin, S., Anders, C.J., M¨ uller, K.R.: Explaining deep neural networks and beyond: a review of methods and applications. Proc. IEEE 109(3), 247–278 (2021) 23. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Gradcam: visual explanations from deep networks via gradient-based localization. In: CVPR, pp. 618–626 (2017) 24. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv:1312.6034 (2013) 25. Singla, S., Pollack, B., Wallace, S., Batmanghelich, K.: Explaining the black-box smoothly-a counterfactual approach. arXiv:2101.04230 (2021) 26. Sixt, L., Schuessler, M., Weiß, P., Landgraf, T.: Interpretability through invertibility: a deep convolutional network with ideal counterfactuals and isosurfaces (2021). https://openreview.net/forum?id=8YFhXYe1Ps 27. Smilkov, D., Thorat, N., Kim, B., Vi´egas, F., Wattenberg, M.: Smoothgrad: removing noise by adding noise. arXiv:1706.03825 (2017) 28. V¨ olzke, H., et al.: Cohort profile: the study of health in pomerania. Int. J. Epidemiol. 40(2), 294–307 (2011) 29. Wei, D., Zhuang, K., Chen, Q., Yang, W., Liu, W., Wang, K., Sun, J., Qiu, J.: Structural and functional MRI from a cross-sectional southwest university adult lifespan dataset (sald). BioRxiv, p. 177279 (2017)

Towards Self-explainable Classifiers and Regressors in Neuroimaging

33

30. Wilms, M., et al.: Bidirectional Modeling and Analysis of Brain Aging with Normalizing Flows. In: Kia, S.M., et al. (eds.) MLCN/RNO-AI -2020. LNCS, vol. 12449, pp. 23–33. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-66843-3 3 31. Zhen, X., Chakraborty, R., Yang, L., Singh, V.: Flow-based generative models for learning manifold to manifold mappings. arXiv:2012.10013 (2020) 32. Zhou, S.K., et al.: A review of deep learning in medical imaging: image traits, technology trends, case studies with progress highlights, and future promises. arXiv:2008.09104 (2020)

Patch vs. Global Image-Based Unsupervised Anomaly Detection in MR Brain Scans of Early Parkinsonian Patients Ver´onica Mu˜ noz-Ram´ırez1,2 , Nicolas Pinon3 , Florence Forbes2 , Carole Lartizen3 , and Michel Dojat1(B) 1

Univ. Grenoble Alpes, Inserm U1216, CHU Grenoble Alpes, Grenoble Institut des Neurosciences, 38000 Grenoble, France {veronica.munoz-ramirez,michel.dojat}@univ-grenoble-alpes.fr 2 Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France [email protected] 3 Univ. Lyon, CNRS, Inserm, INSA Lyon, UCBL, CREATIS, UMR5220, U1206, 69621 Villeurbanne, France {nicolas.pinon,carole.lartizen}@creatis.insa-lyon.fr

Abstract. Although neural networks have proven very successful in a number of medical image analysis applications, their use remains difficult when targeting subtle tasks such as the identification of barely visible brain lesions, especially given the lack of annotated datasets. Good candidate approaches are patch-based unsupervised pipelines which have both the advantage to increase the number of input data and to capture local and fine anomaly patterns distributed in the image, while potential inconveniences are the loss of global structural information. We illustrate this trade-off on Parkinson’s disease (PD) anomaly detection comparing the performance of two anomaly detection models based on a spatial auto-encoder (AE) and an adaptation of a patch-fed siamese auto-encoder (SAE). On average, the SAE model performs better, showing that patches may indeed be advantageous. Keywords: Parkinson’s disease · Anomaly detection Siamese networks · Auto-encoders

1

· Patches ·

Introduction

Medical imaging represents the largest percentage of data produced in healthcare and thus a particular interest has emerged in deep learning (DL) methods Data used in the preparation of this article were obtained from the Parkinson,s Progression Markers Initiative (PPMI) database (www.ppmi-info.org/data). VMR is supported by a grant from NeuroCoG IDEX-UGA (ANR-15-IDEX-02). This work is partially supported by the French program “Investissement d’Avenir” run by the Agence Nationale pour la Recherche (ANR-11-INBS-0006). VMR and NP contributed equally to this work. c Springer Nature Switzerland AG 2021  A. Abdulkadir et al. (Eds.): MLCN 2021, LNCS 13001, pp. 34–43, 2021. https://doi.org/10.1007/978-3-030-87586-2_4

Unsupervised Parkinson’s Disease Anomaly Detection

35

to create support tools for radiologists to analyze multimodal medical images, segment lesions and detect subtle pathological changes that even an expert eye can miss. The vast majority of these methods are based on supervised models which require to be trained on large series of annotated data, time- and resourceconsuming to generate. Over the years, several publicly available neuroimaging databases have been curated and completed with annotations. Some of the most prominent ones are: MSSEG, for multiple sclerosis lesion segmentation [4]; BRATS, for brain tumor segmentation [14]; ISLES, for ischemic stroke lesion segmentation [13]; and mTOP for mild traumatic injury outcome prediction [12]. Challenges, namely those of MICCAI, are organized regularly to showcase the latest technological advancements and push the community towards better performances. However, there are several neurological diseases seldom studied due to the small size and subtlety of the lesions they present. This is the case of vascular disease, epilepsy, and most neurodegenerative diseases in their early stages. The main challenge for such pathologies indeed, is to identify the variability of the pathological patterns on images where the lesion is barely seen or not visible. Unsupervised methods are good candidates to tackle both the lack of annotated examples and the subtlety of brain scan anomalies [3,19]. They rely on networks that learn to encode normal brain patterns in such a manner that any atypical occurrence can be identified by the inability of the network to reproduce it. Auto-encoders (AE), variational auto-encoders (VAE) [10] and generative adversarial networks (GAN) [7], have been extensively used as building blocks for unsupervised anomaly detection due to their ability to learn high-dimensional distributions [3]. Parkinson’s Disease (PD) is a neurodegenerative disorder that is only identifiable through routine MR scans at an advanced stage. Nevertheless, the manifestation of non-motor symptoms, years before the apparition of the first motor disturbances, suggests the presence of physio-pathological differences that could allow for earlier PD diagnosis. PD afflicts patients for as many as one to two decades of their lives and current treatments can only attenuate some motor manifestations [21]. Therefore, reducing the gap between diagnosis and the onset of the neurodegenerative process is of paramount importance to identify personalized treatments that would significantly slow its natural progression. Unsupervised anomaly detection models are here employed to explore such challenging MR data analysis. In a previous work [16], we compared deterministic and variational, spatial and dense autoencoders for the detection of subtle anomalies in the diffusion parametric maps of de novo (i.e., newly diagnosed and without dopaminergic treatment) PD patients from the PPMI database [11]. Our results, while preliminary, offered compelling evidence that DL models are useful to identify subtle anomalies in early PD, even when trained with a moderate number of images and only two parametric maps as input. Our goal in this paper is to compare an improved anomaly detection pipeline based on a deterministic spatial auto-encoder, hereafter simply referred as AE, to

36

V. Mu˜ noz-Ram´ırez et al.

an adaptation of patch-based siamese auto-encoder (SAE) proposed in [1]. This architecture was originally intended to the detection of subtle epileptic lesions, application for which it achieved promising results. One important difference between the two compared architectures is the dimension of the input and output data. While AE were trained on 2D transverse slices, thus capturing a global pattern in the image, SAE were trained on small patches sampled throughout the data, making them more suitable to capture fine patterns but losing global structural integrity. Through this comparison we aim to analyze the advantages of patch-fed architectures for the identification of subtle and local abnormalities as well as to propose an alternative for anomaly detection in moderate size image datasets.

2

Brain Anomaly Detection Pipeline

The anomaly detection task with auto-encoders can be formally posed as follows: – An auto-encoder is first trained to reconstruct normal samples as accurately as possible. This network is composed of two parts: an encoder (1) that maps the input data into a lower dimensional latent space, assumed to contain important image features, and a decoder (2) that maps the code from the latent space into an output image. – When fed by an unseen image, this trained network produces a reconstructed image from its sampled latent distribution which is the counterpart ‘normal’ part of the input image. – Reconstruction error maps, computed as the difference between the input and output images, are thus assumed to highlight anomalous regions of the input data. – Anomaly scores at the voxel, region of interest or image levels may then be derived from the post-processing of these reconstruction error maps. In this work, we present a general framework for unsupervised brain anomaly detection based on auto-encoders to produce reconstruction error maps and a novel post-processing step to derive per-region anomaly scores. 2.1

Autoencoder Architectures

We constructed and evaluated two auto-encoder models: a classic auto-encoder (AE) and a siamese auto-encoder (SAE). Both models are fully-convolutional. Their architectures are displayed in Fig. 1 and their differences are detailed below. Classic Auto-Encoder: This architecture consists of 5 convolutional layers that go from input to bottleneck and 5 transposed convolutional layers going from bottleneck to output. The output of the encoder network is directly the latent vector z and the loss function was simply the L1 -norm reconstruction ˆ: error between input x and output x ˆ 1 LAE (x) = x − x

(1)

Unsupervised Parkinson’s Disease Anomaly Detection

37

Fig. 1. Classical auto-encoder (AE) on top, Siamese auto-encoder (SAE) at the bottom

Siamese Auto-Encoder: The siamese autoencoder (SAE) model [1] consists of two identical convolutional autoencoders with shared parameters. The SAE receives a pair of patches (x1 , x2 ) at input that are propagated through the network, yielding representations zt ∈ Z, t = (1, 2) in the middle layer bottleneck. The second term of the loss function LSAE (Eq. 2) is designed to maximize the cosine similarity between z1 and z2 . This constraints patches that are “similar” to be aligned in the latent space. Unlike standard siamese architectures where similar and dissimilar pairs are presented to the network, Alaverdyan et al. [1] proposed to train this architecture on similar pairs only and compensate for the lack of dissimilar pairs through a regularizing term that prevents driving the loss function to 0 by mapping all patches to a constant value. This term is defined as the mean squared error between the input patches and their reconstructed outputs. The proposed loss function for a single pair hence is: LSAE (x1 , x2 ) =

2 

||xt − x ˆt ||22 − α · cos(z1 , z2 )

(2)

t=1

where x ˆt is the reconstructed output of the patch xt while zt is its representation in the middle layer bottleneck and α an hyperparameter that controls the tradeoff between the two terms. As depicted on Fig. 1, the encoder part is composed of 3 convolutional layers and one maxpooling layer in-between the first and second convolutions, while the (non symmetrical) decoder part is composed of 4 convolutional layers, with an upsampling layer in-between the second and third convolutional layer.

38

2.2

V. Mu˜ noz-Ram´ırez et al.

Post-processing of the Reconstruction Error Maps

We leveraged the reconstruction error maps obtained from both architectures to generate an anomaly score, following the methodology introduced in [15]. The ˆi ||1 . Since voxel-wise reconstruction errors in one image were computed as ||xi − x the architectures were fed more than one channel (here two MR modalities), we defined the joint reconstruction error of every voxel as the square root of the sum of squares of the difference between input and output for every channel. Next, we fixed a threshold on these generated reconstruction error maps to decide whether or not a given voxel should be considered as abnormal, hereafter called the abnormality threshold. Since we expected PD patients to exhibit abnormal voxels in larger quantities than controls, this value corresponded to an extreme quantile (e.g. the 98% quantile) of the errors distribution in the control population. The thresholded reconstruction error maps were then employed to identify anatomical brain regions for which the number of abnormal voxels could be used to discriminate between patients and controls.

3 3.1

Experiments Data

The dataset used in this work consisted of DTI MR scans of 57 healthy controls and 129 de novo PD patients selected from the PPMI database. All images were acquired with the same MR scanner model (3T Siemens Trio Tim) and configured with the same acquisition parameters. Only one healthy control was taken out of the study due to important artifacts in the images. From these images, two measures, mean diffusivity (MD) and fractional anisotropy (FA), were computed using MRtrix3.0. Values of FA and MD were normalized into the range [0, 1]. The images were spatially normalized to the standard brain template of the Montreal Neurological Institute (MNI) with a non-linear deformation. The resulting MD and FA parameter maps were of dimension 121 × 145 × 121 with a voxel size of 1.5 × 1.5 × 1.5 mm3 . The control dataset was divided into 41 training controls and 15 testing controls to avoid data leakage. This division was effectuated in 10 different manners through a bootstrap procedure in order to assess the generalization of our predictions as advised in [17]. We took special care to maintain an age average around 61 years old for all the training and test population as well as a 40–60 proportion of females and males. Once the models were trained with one of the 10 training sets, they were evaluated with the corresponding healthy control test set and the PD dataset (age: 62 y. ± 9; sex: 48 F). 3.2

Training of the Auto-Encoders

AE Models. The training dataset of the AE models consisted of 1640 images corresponding to 40 axial slices around the center of the brain for each of the

Unsupervised Parkinson’s Disease Anomaly Detection

39

41 training control subjects. The AE models were trained for 160 epochs, with a learning rate of 10−3 . 3 × 3 kernels were convoluted using padding of 1 pixel and a stride of (2, 2). The bottleneck dimensions were h=4, w=5 and c=256. There were no pooling layers. Implementation was done in Python 3.6.8, PyTorch 1.0.1, CUDA 10.0.130 and trained on a NVIDIA GeForce RTX 2080 Ti GPU with batches of 40 images. After each convolutional layer, batch normalization [8] was applied for its regularization properties. The rectified linear unit (ReLU) activation function was employed in each layer except the last, for which a sigmo¨ıd was preferred. The loss functions were optimized using Adam [9]. SAE Models. The SAE model was trained with 600 000 patches of size 15×15×2 (∼15 000 patches per subject). The model was trained for 30 epochs, with a learning rate of 1 × 10−3 . Bottleneck dimensions were h=2, w=2 and c=16. Maxpooling and upsampling layers were used, as detailled before. No batch-normalization layers were used. Activation function for every convolutional layer was the rectified linear unit (ReLU) and the sigmoid function was used in the last layer. The kernel size and the numbers of filters were 3×3 and 16 respectively for all convolution blocks but the final one with 2×2 and 2 (equal to the number of channels) respectively. The stride was 1 for all convolution blocks. The maxpooling/upsampling factor was 2. Implementation was done in Python 3.8.10, Tensorflow 2.4.1, 11.0.221 and trained on a NVIDIA GeForce GTX 1660 GPU with batches of 225 patches. The loss function was comprised of a reconstruction part (mean squared error) and a similarity measure (cosine similarity) weighted by a coefficient α=0.005. The loss function was minimized using Adam [9]. 3.3

Performance Evaluation

The percentage of abnormal voxels found in the thresholded reconstruction error maps was employed to classify them as healthy or pathological (PD) based on a threshold. The critical choice of the threshold was investigated using a Receiver Operator Curve (ROC), taking into account the imbalanced nature of our test set (15 healthy and 129 PD). Every point in the ROC corresponds to the sensitivity and specificity values obtained by a given threshold. As proposed in [15], the choice of the optimal threshold, referred to as √ the pathological threshold, was based on the optimal geometric mean, g-mean= Sensitivity × Specif icity. Additionally, to help evaluate the localization of anomalies, two atlases were considered: the Neuromorphometrics atlas [2] and the MNI PD25 atlas [20]. The first was used to segment the brain into 8 macro-regions: subcortical structures, white matter and the 5 gray matter lobes (Frontal, Temporal, Parietal, Occipital, Cingulate/Insular). The latter was specifically designed for PD patients exploration. It contains 8 regions: substantia nigra (SN), red nucleus (RN), subthalamic nucleus (STN), globus pallidus interna and externa (GPi, GPe), thalamus, putamen and caudate nucleus. For all of the before-mentioned regions of interest (ROI), we calculated the g-mean for the associated pathological threshold, leading to the classification performance of our models.

40

4

V. Mu˜ noz-Ram´ırez et al.

Results

As it can be seen in Fig. 2, both auto-encoder architectures achieve good quality reconstructions, however the SAE seems to capture finer details and textures than the AE. This explains the high contrast in AE reconstructions.

Fig. 2. Showcase of a slice of the original data and its AE and SAE reconstructions

The visualization of the percentage of abnormal voxels in the ROIs presented in Fig. 3 showcases the inter-subject variability amongst members of the same population (healthy and PD). Even so, abnormal voxels are clearly more numerous in the PD patient population.

Fig. 3. The percentage of abnormal voxels found by the SAE in the anatomical ROIs presented in Sect. 3. Top: the test controls of Sample 1; Bottom: 15 randomly selected PD patients.

The g-mean classification scores for all models, obtained for each ROI and each sub-population, are presented in Fig. 4. We notice that, on average, the SAE model performed better than the AE on the whole brain and most of the macro and subcortical structures studied, with the exception of the temporal lobe, the putamen, the thalamus and the internal and external segment of the globus pallidus. We note that the results varied greatly across the ten populations samples. As an indication, for the whole brain, the SAE obtained a g-mean average score of 66.9 ± 5.8% and the AE 65.3 ± 7.5%, however the best scores

Unsupervised Parkinson’s Disease Anomaly Detection

41

among the 10 samples were of 79.9% and 81.9% for the AE and SAE respectively, both on sample 1. Corresponding values for the white matter are of 68.2 ± 4.6% for SAE and 66.2±6.7% for AE. The largest standard deviations in the observed anatomical regions belonged to the white matter, the frontal and occipital lobes.

Fig. 4. g-mean scores for the whole brain and several ROIs for AE and SAE. The vertical dashed lines separate macro- and micro brain structures.

5

Discussion and Conclusion

Unsupervised auto-encoders (AE) have shown to efficiently tackle challenging detection tasks where brain alteration are barely seen or not visible. The objective of our study was to explore the potential of such AE models for the detection of subtle anomalies in de novo PD data and compare patch-based versus imagebased models. Both AE and SAE architectures produced good quality reconstructions and were able to discriminate between healthy individuals and recently diagnosed PD patients with performances (see Figs. 3 and 4) that are competitive with those found in the literature. Notably, the Correia & al. [5] SVM mean accuracy score for a selection of WM regions is of 61.3% whereas both the SAE and AE achieved g-mean scores above 66% for the WM. Also, the cross-validation procedure of Schuff & al. [18] obtained a ROC AUC of 59% for the rostral segment of the SN which is below our SN average g-mean score for the SAE and equal to that of the AE. Note that at this early stage of the disease (1–2 H-Y scale) the patients have no tremor nor uncontrolled movements compared to healthy subjects. This rules out that movement was the index that allowed PD classification.

42

V. Mu˜ noz-Ram´ırez et al.

Using DTI data we did not search for structural atrophy or lesion load but for degradation of WM properties in the early stages of PD that could appear everywhere in the brain. This explains why the WM obtains the highest gmean scores. This being said, our models could largely improve by increasing the size of our dataset. Furthermore, the addition of another MR modality such as iron load using T2/T2* relaxometry could allow us to detect the reduction of dopaminergic neurons in subcortical structures, largely reported in the early stages of PD but not visible in DTI. Regarding the comparison between the AE and SAE, the choice is not clear. While the classic AE architecture benefits from a more straightforward implementation, the SAE proposes significant advantages for small databases. Indeed, patch-fed networks can be trained with smaller samples of data and the siamese constraint of the architecture ensures efficient learning. What is more, the latent space features of these models contain local information that can be used to classify between healthy and pathological individuals at the voxel-level and produce anomaly maps like those introduced by [1]. In future work, we plan to generate this kind of maps for early PD patients to offer more precise indications about the localization of anomalies and correlate them with the PD hemispheric lateralization. In addition, we aim to complete our dataset by adding other MR modalities such as perfusion and relaxometry, but also by gathering heterogeneous data from multi-vendors and multi-sites exams. For this purpose, we will use an harmonization procedure as a preprocessing step (extension of DeepHarmony [6]). Finally, our 3D implementation of the SAE model is ongoing.

References 1. Alaverdyan, Z., Jung, J., Bouet, R., Lartizien, C.: Regularized siamese neural network for unsupervised outlier detection on brain multiparametric magnetic resonance imaging: application to epilepsy lesion screening. Med. Image Anal. 60, 101618 (2020). https://doi.org/10.1016/j.media.2019.101618 2. Bakker, R., Tiesinga, P., K¨ otter, R.: The scalable brain atlas: instant web-based access to public brain atlases and related content. Neuroinformatics 13(3), 353–366 (2015). https://doi.org/10.1007/s12021-014-9258-x 3. Baur, C., Denner, S., Wiestler, B., Navab, N., Albarqouni, S.: Autoencoders for unsupervised anomaly segmentation in brain MR images: a comparative study. Med. Image Anal. 69, 101952 (2021). https://doi.org/10.1016/j.media.2020.101952 4. Commowick, O., et al.: Objective evaluation of multiple sclerosis lesion segmentation using a data management and processing infrastructure. Sci. Rep. 8, 13650 (2018). http://portal.fli-iam.irisa.fr/msseg-challenge 5. Correia, M.M., et al.: Towards accurate and unbiased imaging-based differentiation of Parkinson’s disease, progressive supranuclear palsy and corticobasal syndrome. Brain Commun. (2020) 6. Dewey, B.E., et al.: DeepHarmony: a deep learning approach to contrast harmonization across scanner changes. Magn. Reson. Imaging (2019). https://doi.org/ 10.1016/j.mri.2019.05.041. Jul

Unsupervised Parkinson’s Disease Anomaly Detection

43

7. Goodfellow, I.J., et al.: Generative adversarial networks. arXiv:1406.2661 [cs, stat], June 2014 8. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167 [cs], March 2015 9. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv:1412.6980 [cs], January 2017 10. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv:1312.6114 [cs, stat], May 2014 11. Marek, K., et al.: The parkinson’s progression markers initiative (PPMI) - establishing a PD biomarker cohort. Ann. Clin. Transl. Neurol. 1460–1477 (2018) 12. MICCAI: Mild traumatic brain injury outcome prediction (2016). www. tbichallenge.wordpress.com 13. MICCAI: Ischemic stoke lesion segmentation challenge (2018). www.isleschallenge.org 14. MICCAI: Brain tumor segmentation challenge (2020). http:// braintumorsegmentation.org/ 15. Mu˜ noz-Ram´ırez, V., Kmetzsch, V., Forbes, F., Meoni, S., Moro, E., Dojat, M.: Subtle anomaly detection in MRI brain scans: application to biomarkers extraction in patients with de novo parkinson’s disease. medRxiv (2021). https://doi.org/10. 1101/2021.06.03.21258269 16. Mu˜ noz-Ram´ırez, V., Kmetzsch, V., Forbes, F., Dojat, M.: Deep learning models to study the early stages of parkinson’s disease. In: 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pp. 1534–1537 (2020). https://doi.org/ 10.1109/ISBI45749.2020.9098529 17. Poldrack, R.A., Huckins, G., Varoquaux, G.: Establishment of best practices for evidence for prediction: a review. JAMA Psychiatry. 534–540 (2019) 18. Schuff, N., et al.: Diffusion imaging of nigral alterations in early Parkinson’s disease with dopaminergic deficits. Mov. Disord. 30, 1885–1892 (2015) 19. Shinde, S., et al.: Predictive markers for Parkinson’s disease using deep neural nets on neuromelanin sensitive MRI. NeuroImage: Clin. 22, 101748 (2019) 20. Xiao, Y., et al.: Multi-contrast unbiased MRI atlas of a Parkinson’s disease population. Int. J. Comput. Assist. Radiol. Surg. 10, 329–341 (2015) 21. Zhao, Y.J., et al.: Progression of Parkinson’s disease as evaluated by Hoehn and Yahr stage transition times. Mov. Disord. 25(6), 710–716 (2010). https://doi.org/ 10.1002/mds.22875

MRI Image Registration Considerably Improves CNN-Based Disease Classification Malte Klingenberg1,2 , Didem Stark1,2(B) , Fabian Eitel1,2 , and Kerstin Ritter1,2 for the Alzheimer’s Disease Neuroimaging Initiative 1

Department of Psychiatry and Neurosciences | CCM, Charit´e – Universit¨ atsmedizin Berlin (corporate member of Freie Universit¨ at Berlin, Humboldt-Universit¨ at zu Berlin, and Berlin Institute of Health), Berlin, Germany 2 Bernstein Center for Computational Neuroscience, Berlin, Germany Abstract. Machine learning methods have many promising applications in medical imaging, including the diagnosis of Alzheimer’s Disease (AD) based on magnetic resonance imaging (MRI) brain scans. These scans usually undergo several preprocessing steps, including image registration. However, the effect of image registration methods on the performance of the machine learning classifier is poorly understood. In this study, we train a convolutional neural network (CNN) to detect AD on a dataset preprocessed in three different ways. The scans were registered to a template either linearly or nonlinearly, or were only padded and cropped to the needed size without performing image registration. We show that both linear and nonlinear registration increase the balanced accuracy of the classifier significantly by around 6–7% in comparison to no registration. No significant difference between linear and nonlinear registration was found. The dataset split, although carefully matched for age and sex, affects the classifier performance strongly, suggesting that some subjects are easier to classify than others, possibly due to different clinical manifestations of AD and varying rates of disease progression. In conclusion, we show that for a CNN detecting AD, a prior image registration improves the classifier performance, but the choice of a linear or nonlinear registration method has only little impact on the classification accuracy and can be made based on other constraints such as computational resources or planned further analyses like the use of brain atlases. Keywords: Alzheimer learning

· CNN · Image registration · MRI · Deep

M. Klingenberg and D. Stark—These authors contributed equally to this work. Alzheimer’s Disease Neuroimaging Initiative—Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how to apply/ADNI Acknowledgement List.pdf. c Springer Nature Switzerland AG 2021  A. Abdulkadir et al. (Eds.): MLCN 2021, LNCS 13001, pp. 44–52, 2021. https://doi.org/10.1007/978-3-030-87586-2_5

MRI Image Registration Improves CNN Disease Classification

1

45

Introduction

In the last years, machine learning techniques have frequently been used to diagnose neurological and psychiatric diseases, such as Alzheimer’s disease (AD), based on structural magnetic resonance imaging (MRI) data [12,14,18]. While traditional methods such as support vector machines or random forests usually require a prior image registration in combination with feature extraction (e.g., volumes of brain areas or cortical thickness), convolutional neural networks (CNNs) can directly operate on 3-dimensional MRI data [9]. However, due to computational reasons and relatively small sample sizes, most studies so far focused on MRI data registered to a template (e.g., in MNI space) [18]. In addition, image registration makes it possible to compare scans taken from multiple subjects and identify relevant brain regions for a classifier [4]. Algorithms for image registration include both simple approaches that perform the registration linearly, using only translation, rotation, and scaling operations to match the input image to the reference image as well as more complicated methods that use an additional nonlinear warping step to achieve a better correspondence between the two images. The latter increases the computational cost, but also improves the image similarity. While there are studies comparing different image registration methods based on similarity metrics such as the overlap between specific volumes in the input and reference images, the similarity between these volumes, or differences in their boundaries [1,2,7,13,16], to the best of our knowledge, there has not been an analysis on the effect of the registration method on the performance of a machine learning classifier. In the present study, we examine how a CNN performs in AD classification when trained on a dataset preprocessed in three different ways, namely no registration but only image cropping, linear registration, and nonlinear registration. The CNN receives the entire 3D scan as input, without prior extraction of regions of interest or other pre-selection of image features. To eliminate potential confounders, we carefully balanced the data set with respect to age and sex, a step often neglected in previous work.

2 2.1

Data and Methods Dataset

Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early AD. We included subjects from all ADNI study phases who were, at their baseline visit, either classified as cognitively normal (CN) or diagnosed with AD. Subjects

46

M. Klingenberg et al.

who were classified as CN at their baseline visit, but at some later visit received a diagnosis of MCI or AD were excluded from our analysis. Overall, our study population included 573 subjects (406 CN, 167 AD). Since the mean age of female subjects was less than the mean age of the male subjects in both groups (AD: 74.0 ± 7.9 to 75.2 ± 8.3 years, p = 0.3211; CN: 71.3 ± 6.7 to 73.6 ± 6.8 years, p = 0.0007; p-values were calculated with a two-sample t-test), we removed this possible source of bias by undersampling based on subject sex. We divided the population into bins according to their diagnosis (AD or CN) and their age (in 5-year ranges, e.g. 60–64, 65–69 etc.) and if such a bin contained a different number of female and male subjects, we randomly removed subjects from the bin until their number was equal. In total, this process reduced the population size by about 25%, with the final population consisting of 432 subjects (306 CN, 126 AD). The resulting female and male age distributions were similar, with their means not significantly different anymore (p > 0.75). The population was then split into a training set (306 subjects) and validation and test sets (63 subjects each). We used a stratified split based on subject sex, diagnosis, and age range to ensure that the distributions of the sets are as close as possible. It is important to split the dataset on the subject level instead of the image level to avoid data leakage [18], as otherwise scans of a single subject may end up in both the training and the test set. The final size of the training set ranged from 758 to 834 images, depending on the number of scans taken from the specific subjects remaining after the undersampling. For the test set, we only kept the scans taken at the baseline visit. As the resulting test set was rather small, the results were likely to vary significantly depending on the specific data split. To avoid the possible issue of a lucky or unlucky split, we repeated both the undersampling and the data split for ten different random seeds. The subsequent analyses were then performed on all ten resulting data splits. 2.2

Image Preprocessing

For this analysis, we used T1-weighted structural MRI scans acquired at a magnetic field strength of 3 T. The scans were taken at different imaging sites and had undergone gradwarping (gradient inhomogeneity correction), intensity correction, and were scaled for gradient drift using the phantom data. We preprocessed the scans in three different ways, using the 1mm T1 version of the ICBM152 reference brain as a template. First, if the slice thickness of a scan in any plane was different from 1mm, the scan was resampled to that resolution. We then (1) used Advanced Normalization Tools (ANTs)1 to linearly register the scans to the template (transform type parameter ‘a’); (2) additionally used the ANTs implementation of the SyN algorithm [3] to apply nonlinear warping to better fit the template (transform type parameter ‘s’); and (3) as a baseline comparison, only padded and/or cropped the scan to fit the dimensions of the template without performing any image registration. We have chosen the 1

http://stnava.github.io/ANTs/.

MRI Image Registration Improves CNN Disease Classification

47

Fig. 1. Axial slice of a sample raw scan (a) and the resulting scans after preprocessing in three different ways: padding/cropping to the template dimensions (b), linear registration (c) and nonlinear registration (d). In all three cases, skull-stripping followed. Note the differing dimensions of the raw scan (208 × 240 × 256) and the preprocessed scans (182 × 218 × 182). The slightly off-center positioning and rotation of the brain in the raw scan is retained in the padded/cropped scan (b), while the brain is centered in the template and thus in the two registered scans (c) and (d).

SyN algorithm because of its superior performance reported in [13]. In all cases, we then used the FSL Brain Extraction Tool (fsl-bet) for skull stripping [11,17]. Figure 1 shows a comparison of the preprocessing results. 2.3

Network Architecture and Training

The CNN architecture chosen for this paper was taken from [4] and contains four convolutional blocks, each consisting of a convolutional layer with filter size 3 and 8, 16, 32, and 64 features respectively, as well as batch normalization and max pooling with window size 2, 3, 2 and 3. The convolutional blocks are followed by a fully connected layer of 128 units and the 2-unit output layer (representing the two classes CN and AD). Before each of these two layers, dropout is applied with a value of p = 0.4. Training the network was done using the Adam optimizer and cross entropy loss, with the learning rate and weight decay set to 10−4 . The chosen batch size of 16 was limited by the available GPU memory. The training data was augmented by translating the scans along the sagittal axis by up to 2 voxels and mirroring the scans across the sagittal plane. When the balanced accuracy achieved by the model on the validation set did not improve for 8 epochs, training was stopped. The model with the best balanced accuracy on the validation set was then evaluated on the test set. The training process was repeated five times for each split and the results averaged for each of the ten splits, to increase the robustness of the results.

3

Results

The performance achieved by the classifier on the three differently processed datasets is shown in Fig. 2. The best results are obtained using the nonlinear

48

M. Klingenberg et al.

Fig. 2. Mean balanced accuracy, sensitivity, and specificity achieved by the classifier on the test set across all runs for all splits. The error bars show the standard error. Significance values were calculated using the Wilcoxon signed-rank test.

preprocessing, with a balanced test accuracy mean and standard error of 83.3 ± 0.9% (training accuracy 93.1 ± 0.5%, validation accuracy 87.2 ± 0.7%). The linear preprocessing performs only slightly worse at 82.6 ± 0.9% (training 94.4 ± 0.3%, validation 87.8 ± 0.5%), while the classifier trained on unregistered data achieves a balanced accuracy of 76.2 ± 1.1% (training 85.3 ± 0.6%, validation 79.4 ± 0.8%). The improvement in balanced test accuracy for either registration method over the unregistered data is significant (p < 0.0001, Wilcoxon signedrank test), but the difference between linear and nonlinear registration is not (p = 0.2885). We also calculated the receiver operating characteristic (ROC) curve for each individual classifier and then averaged them for each preprocessing method. The average ROC curves are shown in Fig. 3. Again, the results for nonlinear and linear registration are very similar, with an area under the curve (AUC) of 0.910± 0.008 and 0.904 ± 0.008 respectively, while the classifier using non-registered images performs slightly worse at 0.850 ± 0.011. Additionally, there was a very strong dependence of the classifier performance on the specific dataset split the model was trained on. Figure 4 shows the mean balanced accuracy for each split, with a difference between splits of as much as 17%. The observed effect still holds, with the classifier trained on unregistered images achieving the worst results in all but one split. For most splits, using nonlinear registration gives the best results, while linear registration outperforms nonlinear registration in two of the ten splits.

MRI Image Registration Improves CNN Disease Classification

49

Fig. 3. Average receiver operating characteristic (ROC) curves for the three preprocessing methods. The shaded area shows the standard error of the curve.

4

Discussion

In this study, we have trained a 3D CNN on MRI brain scans for an AD vs. CN classification task using a dataset balanced for sex and age. We preprocessed the dataset in three different ways: no registration but only image cropping, linear registration, and nonlinear registration. Registering the input images improves the classifier performance by 6–7%, depending on the registration method. While image registration is common practice, this result is not obvious, as the information removed by registering the images to a template could be relevant for the classification task itself. Linear registration already eliminates all but slight differences between the inputs in positioning, rotation, and size of the brain within the scan. Nonlinear registration goes further by not only matching the brain as a whole to the template but also matching individual structures resulting in the removal of relative sizes of specific brain areas [1]. Both registration methods remove variations and result in a more homogenized input [7,13]. More recently, deep learning based image registration methods are also proposed, however, those methods have not yet become part of publicly available medical image processing tools [10,16]. Image registration can impact classification performance both negatively or positively, depending on how relevant the variations in the input were. In our case, registering the images benefits the classification. The information that was removed therefore seems to not have been salient, instead masking the actually important image features. The structural changes in the brain brought about

50

M. Klingenberg et al.

Fig. 4. Mean balanced accuracy achieved by the classifier on the different splits. The error bars show the standard error.

by AD are subtle, especially in milder cases, and on a smaller scale than the variations in brain positioning and size present in the input images. Removing these large-scale variations therefore enables the network to better focus on the smaller variations important for deciding the classification task. While there is a benefit to using registered instead of non-registered images, we did not find a significant difference between linear and nonlinear registration. This suggests that, while large-scale variations like subject positioning impede the performance of the network, it is not dependent on the more precise alignment to the template achieved by using a nonlinear registration method. Because the deformations introduced during nonlinear registration are usually small, this might be explained by the invariance to small shifts and distortions caused by the pooling layers of the CNN [15]. For a thorough region-wise analysis across subjects as for example done in [4,8], however, a nonlinear registration still can be beneficial. We would like to point out the following limitations and ideas for future work. First, although we compared for the first time the effect of different registration methods on a machine learning classifier, future studies should investigate whether our results generalize to different software packages performing linear and nonlinear registration, and different tasks and data sets. Moreover, it should be investigated whether the difference between no registration and linear or nonlinear registration can be alleviated in a larger data set (>10.000 subjects), allowing to learn across large-scale variations. Second, although we

MRI Image Registration Improves CNN Disease Classification

51

achieved a balanced accuracy on par with results reported in the literature [18], the performance of our classifier could likely have been improved with a thorough hyperparameter optimization. While the training, validation and test accuracies are reasonably close, further increases might be possible. However, since we have run extensive repetitions in order to get robust results, hyperparameter optimization was not possible due to computational time constraints. And third, although the conclusions about the superiority of registration over no registration remain, we have found a strong dependence of the results on the specific dataset split used in training the classifier, suggesting that some subjects are generally easier to correctly classify than others. This is supported by the fact that there are splits for which all preprocessing methods give good results, while for other splits all methods perform rather poorly, see for example splits 0 and 3 in Fig. 4. These differences are therefore likely caused by the data, i.e. the subjects themselves rather than by other effects or random chance. As the varying performance for different splits may also in part be caused by the small test set size, future research should address this problem by, for example, using oversampling instead of undersampling to balance the dataset. While this can lead to overfitting, using oversampling instead of undersampling or combining the two approaches would increase the test set size and has been shown to be capable of improving classifier performance [5,6]. Acknowledgements. We thank Tobias Scheffer for his useful suggestions. We acknowledge support from the German Research Foundation (DFG, 389563835; TRR 265, 402170461; CRC 1404, 414984028), the Brain & Behavior Research Foundation (NARSAD Young Investigator Grant, USA) and the Manfred and Ursula-M¨ uller Stiftung. Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

52

M. Klingenberg et al.

References 1. Abderrahim, M., Baˆ azaoui, A., Barhoumi, W.: Comparative study of relevant methods for MRI/X brain image registration. In: Jmaiel, M., Mokhtari, M., Abdulrazak, B., Aloulou, H., Kallel, S. (eds.) ICOST 2020. LNCS, vol. 12157, pp. 338– 347. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-51517-1 30 2. Andrade, N., Faria, F.A., Cappabianco, F.A.M.: A practical review on medical image registration: From rigid to deep learning based approaches. In: 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images, pp. 463–470 (2018) 3. Avants, B.B., Epstein, C.L., Grossman, M., Gee, J.C.: Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Med. Image Anal. 12(1), 26–41 (2008) 4. B¨ ohle, M., Eitel, F., Weygandt, M., Ritter, K.: Layer-wise relevance propagation for explaining deep neural network decisions in MRI-based Alzheimer’s disease classification. Front. Aging Neurosci. 11, 194 (2019) 5. Buda, M., Maki, A., Mazurowski, M.A.: A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249–259 (2018) 6. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002) 7. Dadar, M., Fonov, V.S., Collins, D.L., Alzheimer’s Disease Neuroimaging Initiative: A comparison of publicly available linear MRI stereotaxic registration techniques. NeuroImage 174, 191–200 (2018) 8. Eitel, F., Ritter, K., for the Alzheimer’s Disease Neuroimaging Initiative (ADNI): Testing the robustness of attribution methods for convolutional neural networks in MRI-based Alzheimer’s disease classification. In: Suzuki, K., et al. (eds.) MLCDS/IMIMIC -2019. LNCS, vol. 11797, pp. 3–11. Springer, Cham (2019). https:// doi.org/10.1007/978-3-030-33850-3 1 9. Eitel, F., Schulz, M.A., Seiler, M., Walter, H., Ritter, K.: Promises and pitfalls of deep neural networks in neuroimaging-based psychiatric research. Exp. Neurol. 339, 113608 (2021) 10. Haskins, G., Kruger, U., Yan, P.: Deep learning in medical image registration: a survey. Mach. Vis. Appl. 31(1), 1–18 (2020) 11. Jenkinson, M., Beckmann, C.F., Behrens, T.E., Woolrich, M.W., Smith, S.M.: FSL. Neuroimage 62(2), 782–790 (2012) 12. Jo, T., Nho, K., Saykin, A.J.: Deep learning in Alzheimer’s disease: diagnostic classification and prognostic prediction using neuroimaging data. Front. Aging Neurosc. 11, 220 (2019) 13. Klein, A., et al.: Evaluation of 14 nonlinear deformation algorithms applied to human brain MRI registration. NeuroImage 46(3), 786–802 (2009) 14. Kl¨ oppel, S., et al.: Accuracy of dementia diagnosis–a direct comparison between radiologists and a computerized method. Brain 131(11), 2969–2974 (2008) 15. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 16. Nazib, A., Fookes, C., Perrin, D.: A comparative analysis of registration tools: traditional vs deep learning approach on high resolution tissue cleared data. arXiv preprint arXiv:1810.08315 (2018) 17. Smith, S.M.: Fast robust automated brain extraction. Hum. Brain Mapp. 17(3), 143–155 (2002) 18. Wen, J., et al.: Convolutional neural networks for classification of Alzheimer’s disease: overview and reproducible evaluation. Med. Image Anal. 63, 101694 (2020)

Dynamic Sub-graph Learning for Patch-Based Cortical Folding Classification Zhiwei Deng, Jiong Zhang, Yonggang Shi(B) , and the Health and Aging Brain Study (HABS-HD) Study Team USC Stevens Neuroimaging and Informatics Institute, Keck School of Medicine of USC, University of Southern California, Los Angeles, USA [email protected]

Abstract. Surface mapping techniques have been commonly used for the alignment of cortical anatomy and the detection of gray matter thickness changes in Alzheimer’s disease (AD) imaging research. Two major hurdles exist in further advancing the accuracy in cortical analysis. First, high variability in the topological arrangement of gyral folding patterns makes it very likely that sucal area in one brain will be mapped to the gyral area of another brain. Second, the considerable differences in the thickness distribution of the sulcal and gyral area will greatly reduce the power in atrophy detection if misaligned. To overcome these challenges, it will be desirable to identify brains with cortical regions sharing similar folding patterns and perform anatomically more meaningful atrophy detection. To this end, we propose a patch-based classification method of folding patterns by developing a novel graph convolutional neural network (GCN). We focus on the classification of the precuneus region in this work because it is one of the early cortical regions affected by AD and considered to have three major folding patterns. Compared to previous GCN-based methods, the main novelty of our model is the dynamic learning of sub-graphs for each vertex of a surface patch based on distances in the feature space. Our proposed network dynamically updates the vertex feature representation without overly smoothing the local folding structures. In our experiments, we use a large-scale dataset with 980 precuneus patches and demonstrate that our method outperforms five other neural network models in classifying precuneus folding patterns.

Keywords: Cortical folding analysis convolutional neural network

· Alzheimer disease · Graph

HABS-HD MPIs: Sid E O’Bryant, Kristine Yaffe, Arthur Toga, Robert Rissman, & Leigh Johnson; and the HABS-HD Investigators: Meredith Braskie, Kevin King, James R Hall, Melissa Petersen, Raymond Palmer, Robert Barber, Yonggang Shi, Fan Zhang, Rajesh Nandy, Roderick McColl, David Mason, Bradley Christian, Nicole Philips and Stephanie Large. c Springer Nature Switzerland AG 2021  A. Abdulkadir et al. (Eds.): MLCN 2021, LNCS 13001, pp. 53–62, 2021. https://doi.org/10.1007/978-3-030-87586-2_6

54

Z. Deng et al.

Fig. 1. An illustration of three major folding patterns of the precunues cortical region. Pattern 1: three approximately parallel gyri; Pattern 2: M-shape gyri; Pattern 3: two approximately parallel gyri. Row 1: precuneus patches colored by the principal curvature; Row 2: segmentation of the patch into gyral (red) and sulcal (yellow) areas. Row 3: gray matter thickness distribution of the gyral and sulcal area of each patch. (Color figure online)

1

Introduction

To map cortical atrophy in Alzheimer’s disease (AD), conventional approaches relied on surface mapping techniques that computed the assumed one-to-one correspondences across different brains [6,19]. The high variability of the cortical folding patterns across subjects, however, have been well studied and known in brain anatomy [3,16]. In addition, there are salient differences in gray matter thickness between the sulci and gyri of the cortical ribbon [5]. Taken together, these two factors make it difficult for mapping-based methods to establish meaningful correspondences in many association cortices critical for AD diagnosis because they could likely mix sulcal and gyral areas with different thickness distributions. To overcome this fundamental limitation, we propose to develop a novel graph convolutional network (GCN) for patch-based cortical folding classification. Our goal is to enable the mapping of cortical patches with similar folding patterns and ultimately enhance the power in detecting localized cortical atrophy in brain degeneration.

Dynamic Sub-graph Learning for Patch-Based Cortical Folding Classification

55

While the variability of gyral and sulcal folding patterns over the whole cortex could be immense across the population, the number of folding patterns in each cortical region can be more tractable. In this work, we will thus focus on a patch-based approach to develop our cortical folding classification method. As one of the early cortical regions affected by AD, the precuneus region exhibits three main folding patterns [16] and is thus a great test bed for our method development and evaluation. As shown in Fig. 1 there are three different folding patterns of the precuneus region with different gray matter thickness distributions between the sulcal and gyral areas. With these heterogeneous topological patterns of the gyri and sulci, surface mapping methods cannot avoid to match some of the gyral area with the sulcal portion of the precuneus across different subjects if different cortical folding patterns are not classified and mapped separately. Deep learning has been applied on many neuroimaging tasks and achieved human-level performances for structure classification [1,9,13,17,18]. To learn the geometrical features, graph-based convolution networks (GCN) have been proposed and proven effective in shape analysis tasks [1,9], where graph convolution operator [11] was designed to simulate the convolution operator in CNNs. Unlike conventional convolutional operators on fixed grids, graph convolution operators learn the graph representations through feature propagation along connected edges, which enables the frameworks to learn from local to global features. Typically, the propagation graph is fixed once it is constructed, which limits the propagation of vertices that are close in feature spaces but far in the geometric space. In addition, the global propagation process of GCN could also smooth node features across sulcal and gyral areas in our problem and obscure discriminative features. To alleviate this problem, in this paper, we propose a novel dynamic subgraph propagation model for the classification of the folding pattern in precuneus cortical patches. As illustrated in Fig. 2, the proposed method has 2 critical components: (1) a dynamic sub-graph construction strategy, which can aggregate vertices with similar features to prevent the over-smoothing problem during the propagation process. (2) a graph-based vertex representation learning module via propagation within the sub-graphs. During the learning process, these components allow each layer of the network to learn the optimal graph structure to generate the vertex feature representation for the final classification. In our experiments, we applied our method to a large-scale dataset of 980 precuneus patches and demonstrate that our novel model achieved superior classification accuracy than five other neural network methods.

56

Z. Deng et al.

Fig. 2. The architecture of the proposed dynamic sub-graph learning model. Top row: the pipeline of the whole end-to-end model, which pre-processes and learns different structural pattern information by minimizing the Cross-entropy loss. Bottom row: a zoom-in view of the sub-graph propagation module, which adaptively updates the sub-graph transition matrix of each vertex and learns the representations.

2

Methods

In this section, we develop our novel approach to classify the folding pattern of the precuneus patches inspired by the graph convolutional neural network (GCN) [11]. Instead of propagating node features globally by GCN, we exploit the local features of the surface by constructing sub-graphs on each vertex of the surface patch. Before we develop our classification method, we leverage existing surface mapping method [7] to establish a common vertex indices for surface patches from different subjects. This will help factor out macroscopic differences across the cortices and allow us to focus on the classification of more detailed folding patterns. Given a precuneus patch P (V ), where V = {v1 , v2 , ..., vn } denotes a set of vertices on the precuneus surface, where the vertex indices are common for all subjects. Let A ∈ Rn×n represents the connection matrix, and X ∈ Rn×d be the feature matrix, where each row of X represents the feature of the corresponding node. Our aim is to learn a patch representation from the input node features. Vertex Representation Learning. Consider for each vertex vi in V , there is a corresponding feature vector xi ∈ Rd , where d represents the feature dimension. To better capture the geometry features of the input patch, the initial feature dimension is set to 4 for the 1st sub-graph propagation layer as shown in Fig. 2, which includes the 3-D coordinates and the 1-D principal curvature value at each vertex. The d feature dimension can also be extended by adding other shape descriptors and functional characterizations. The node feature propagation process has been proven to be efficient for feature learning with graph-structured data [8,11]. In general, the propagation process can be regarded as the multiplication of the normalized connection transition matrix A and the feature matrix X l of the l-th layer as: X l+1 = AT X l

(1)

Dynamic Sub-graph Learning for Patch-Based Cortical Folding Classification

57

where each element aij of A indicates the connection between the vertex i and vertex j. This process can learn the updated node representation by taking both the current node feature and the features from connected nodes into consideration. However, this method propagates vertex features globally, which may overly smooth the features across the sulcal and gyral areas and hence obscure the local folding structures. Such loss of sucal vs gyral contrast may compromise the final structure learning for pattern classification. To better learn the node features, we propose to represent local structures in sub-graph propagation layers shown in Fig. 2. In our work, instead of aggregating the features from vertices in the geometric neighborhood, we construct a sub-graph among these neighboring nodes in the feature space and use this sub-graph as the feature of the current node to characterize its local folding information. More specifically, as shown in the bottom row of Fig. 2, we first apply the K-nearest-neighbor(K-NN) algorithm to find K1 (K1 = 4 in the example in Fig. 2) neighbors VNblue in the feature space for the blue node. This collection of neighboring vertices are shown as the yellow patch in the second row of Fig. 2. After that, we apply another K-NN search process to identify K2 (K2 = 2 in the example in Fig. 2) neighbors for every node in VNblue to construct the sub-graph, which is the graph structure shown in the yellow patch in Fig. 2. During the propagation process on the sub-graph, features of each node in the sub-graph are updated and finally the updated sub-graph is readout and learned as input to the next sub-graph propagation layer. There are overall three sub-graph propagation layers in our model. Finally, the output from the sub-graph layers are sent to fully-connected layers to learn the cortical folding classification. The feature learning process in the sub-graph propagation layer can be expressed as:

l XN i

l+1 l = Gl+1 XN i XNi i

(2)

l+1 l+1 l+1 Xil+1 = f (W × (max(XN )||avg(XN )||sum(XN )) + b) i i i

(3)

Gli

where and represents the features of VNi and sub-graph of the i-th node that constructed using VNi at the l-th layer, max, avg, sum denote the pooling operators, and || represents the concatenation operator. W is a trainable projection matrix, b is the bias term, and f denotes the activation function. Xil+1 is the readout to the next layer. In every sub-graph propagation module, the same process is applied to every vertex of the input graph in parallel, which means the sub-graphs are generated only using the input features of each layer. It’s worth noting that the propagation with our proposed method not only maintains the vertex features, but also includes the structural information due to the connection information is remained in the sub-graphs. Dynamic Graph Construction. In graph convolutional neural networks [11], the node features are propagated along fixed graph edges, which is usually constructed with geometric neighborhoods. However, the nodes with similar features are not necessarily close in the geometric space, thus the information sharing between these nodes are limited under the context of GCN. In contrast from the graph convolutional neural network where the transition matrix is fixed, the

58

Z. Deng et al.

sub-graphs in this work is dynamically constructed layer by layer. In our model, the union of the sub-graphs can be denoted by Gl = {Gl1 , Gl2 , ..., Gln } at the l-th layer, where Gli is the sub-graph in VNi , the set of neighboring vertices of the i-th node constructed by the K-NN algorithm in the feature spaces of each layer. In addition, this dynamically graph construction process can potentially increase the number of intra-class connections and decrease the number of inter-class connections, which has been proven to be an efficient strategy to prevent the over-smoothing problem for GCN [2]. Furthermore, in graph-based shape recognition tasks, dynamically updating the propagation graph is used to enlarge the receptive field and group the points in semantic spaces (sulcal or gyral area in our problem), which enables the model to learn not only the node features, but also how to construct the graphs for better representation [20]. Table 1. Mean classification accuracy with 5-Fold cross validation Methods

3

ADNI (%) HABS-HD (%) Combined (%)

Graph-CNN 82.1

82.2

85.0

Graph-Unet 84.6

83.3

86.3

DGCNN

85.6

85.0

86.8

DenseNet

80.4

81.6

83.5

VoxNet

71.5

67.5

70.3

Ours

86.3

86.0

88.5

Experimental Results

Datasets and Labeling. In our experiment, we use the T1-weighted MRI of 588 subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [14] and 1675 subjects from the Health and Aging Brain Study: Health Disparities (HABS-HD) study [15]. Cortical surfaces were first reconstructed by FreeSurfer and the precuneus patches were then extracted based on the ROI label from FreeSurfer. Following previous anatomical description [4], we manually inspected the cortical foldings of the left hemisphere of all subjects. During this screening process, we identified potentially more general folding patterns than the threeclass definition proposed in [16], which could be interesting for future anatomical research. In the current experiment, however, we follow the existing anatomy literature [4] and use 980 subjects from these two cohorts that match the threeclass definition. More specifically, there are 284 and 696 patches selected from the ADNI3 and the HABS-HD dataset, respectively. In the combined dataset, there are 417, 301 and 262 patches labeled as pattern 1, 2 and 3, respectively. Example patches of different patterns are shown in Fig. 1. Experiments Setting. We perform three classification tasks to evaluate our model’s overall performance on the ADNI3 (n = 284), HABS-HD (n = 696),

Dynamic Sub-graph Learning for Patch-Based Cortical Folding Classification

59

Fig. 3. A comparison of prediction performance on four difficult cases. Under each case, the (true label, our prediction, Graph-CNN prediction) are listed. The disturbances in the folding patterns are unexpected gyral branch, short gyrus, broken gyrus, and unexpected gyral extension, respectively.

and the combination of these two datasets (n = 980). For each task, we use a 5-fold cross-validation to evaluate the robustness of the models. For comparison, we selected three graph-based (Graph-CNN [18], Graph U-nets [8], and DGCNN [20]), and two voxel-based (VoxNet [12] and DenseNet [10]) neural network models. In voxel-based models, the surface patch surfaces were voxelized into a 52 × 52 × 52 cube and the structural representations are learned by 3-D convolution operators. For fair comparison, the decision MLP and the convolutional layers of different models are set as the same as the proposed model. More specifically, the number of graph convolutional layers in Graph-CNN and Graph U-nets, or the EdgeConv layers in DGCNN are set to 3. The same number of 3-D convolutional layers and dense layers were applied in the VoxNet and DenseNet. For all models, ELU is used as the activation function for avoiding the dead neuron problem, and the dropout (a rate of 0.3) was applied in both fully-connected layers and GCN layers for better generalization. All experiments are implemented and tested with Pytorch and NVIDIA GTX 1080 GPU with 8 GB memory. Results on Precuneus Classification. Experimental results of all methods are listed in Table 1. The classification accuracy obtained by our model in all three classification tasks are 86.3% , 86.0% and 88.5% respectively, which are the highest among all methods. From our experiments, we also have three important observations. First, all graph-based models generally perform better than voxelbased models. This shows that describing the cortical surface patches as graphstructured data is more efficient and robust than voxel-structured data. Second, it is worth noting that DGCNN and the proposed method are the only methods here that dynamically update the graph structure during the learning process, and both of them surpass other graph-based methods. This validates that the dynamically constructed graphs can enable the model to learn how to propagate information more effectively. Third, the proposed model has demonstrated more robust performance to atypical patterns in our experience. As shown in Fig. 3, our model is able to better handle cases with atypical patterns deviating from the three-class definition.

60

Z. Deng et al.

Fig. 4. Second-order energy distribution heat map of vertex representation. Brighter color means higher energy. Most representative energy is distributed in gyral areas. Generally shared gyral areas enclosed by the red curves are suppressed and have little representative energy. (Color figure online)

Fig. 5. Effects of neighborhood size on classification accuracy on the combined dataset of ADNI and HABS-HD.

Analysis of Structure Learning. We also investigated the information that the model learned to verify that our model classifies the precuneus patches based on correct anatomical details. After the three sub-graph propagation layers in Fig. 2, the vertex representation’s energy distribution map can be computed according to [21] and plotted for different folding patterns in Fig. 4. As can be seen, most energy is focused on the gyral area of the patch, which is exactly the anatomical classification criteria used in [4]. Furthermore, our model can suppress the irrelevant gyral information in the upper area of each case (enclosed in the red curves) to enhance the discriminative power in classification. Effects of Neighborhood Sizes. To examine the effects of the neighborhood construction process, we trained our model with different neighborhood sizes. The model’s classification performance is evaluated with K1 equal to 4, 8, 12, 16, 20 respectively, and K2 set to be K1/2 for sub-graph construction. The classification accuracy are plotted in Fig. 5. The model performs best with K1 = 8 and the performance will drop if K1 is too small due to insufficient neighborhood representation. On the other hand, the performance of the model will also decline

Dynamic Sub-graph Learning for Patch-Based Cortical Folding Classification

61

if K1 is too large, which might be due to increased risk of the over-smoothing problem for graph propagation as mentioned in [2]. Thus, our model is trained with the best parameter setting (K1 = 8 and K2 = 4).

4

Conclusions

In this paper, we proposed a node sub-graph network for the classification of cortical patches, which can dynamically change the graph connection of vertices and effectively learn the local folding structure information. In the proposed network, the vertex features are propagated based on automatically updated sub-graphs in each layer. In comparisons to five existing neural networks, we demonstrated the superior performance of the proposed network on a largescale dataset of precuneus patches. For future work, we will extend the current framework to the classification of other cortical regions, and enable the ROIbased clustering of subjects in population studies for enhanced power of brain atrophy detection in AD. Acknowledgements. This work was supported by the National Institute of Health (NIH) under grants RF1AG064584, RF1AG056573, R01EB022744, R21AG064776, P41EB015922, P30AG066530. Research reported on this publication was also supported by the National Institute on Aging of the NIH under Award Numbers R01AG054073 and R01AG058533. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

References 1. Besson, P., Parrish, T., Katsaggelos, A.K., Bandt, S.K.: Geometric deep learning on brain shape predicts sex and age. bioRxiv (2020) 2. Chen, D., Lin, Y., Li, W., Li, P., Zhou, J., Sun, X.: Measuring and relieving the over-smoothing problem for graph neural networks from the topological view (2019) 3. Ding, S.L., Van Hoesen, G.W.: Borders, extent, and topography of human perirhinal cortex as revealed using multiple modern neuroanatomical and pathological markers. Hum. Brain Mapp. 31(9), 1359–1379 (2010) 4. Duan, D., et al.: Exploring folding patterns of infant cerebral cortex based on multi-view curvature features: methods and applications. NeuroImage 185, 575– 592 (2019) 5. Fischl, B., Dale, A.M.: Measuring the thickness of the human cerebral cortex from magnetic resonance images. Proc. Natl. Acad. Sci. 97(20), 11050–11055 (2000) 6. Fischl, B., Sereno, M.I., Dale, A.M.: Cortical surface-based analysis: Ii: Inflation, flattening, and a surface-based coordinate system. NeuroImage 9(2), 195–207 (1999) 7. Gahm, J.K., Tang, Y., Shi, Y.: Patch-based mapping of transentorhinal cortex with a distributed atlas. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., AlberolaL´ opez, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11072, pp. 689–697. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00931-1 79 8. Gao, H., Ji, S.: Graph u-nets. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 2083–2092. PMLR 09–15 June 2019

62

Z. Deng et al.

9. Gopinath, K., Desrosiers, C., Lombaert, H.: Graph convolutions on spectral embeddings for cortical surface parcellation. Med. Image Anal. 54, 297–305 (2019) 10. Gottapu, R.D., Dagli, C.H.: Densenet for anatomical brain segmentation. Procedia Comput. Sci. 140, 179–185 (2018). cyber Physical Systems and Deep Learning Chicago, Illinois 5–7 November 2018 11. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks (2016) 12. Maturana, D., Scherer, S.: Voxnet: a 3D convolutional neural network for realtime object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928 (2015) 13. Mehta, R., Sivaswamy, J.: M-net: a convolutional neural network for deep brain structure segmentation. In: 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), pp. 437–440 (2017) 14. Mueller, S., et al.: The Alzheimer’s disease neuroimaging initiative. Clin. North Am. 15(869–877), xi–xii (2005) 15. O’Bryant, S.E., et al.: for the HABLE Study Team: The health & aging brain among latino elders (hable) study methods and participant characteristics. Alzheimer’s Dement. Diagn. Assess. Dis. Monit. 13(1), e12202 (2021) 16. Pereira-Pedro, A.S., Bruner, E.: Sulcal pattern, extension, and morphology of the precuneus in adult humans. Ann. Anat. - Anatomischer Anz. 208, 85–93 (2016) 17. Qiu, S., et al.: Development and validation of an interpretable deep learning framework for Alzheimer’s disease classification. Brain 143(6), 1920–1933 (2020) 18. Song, T.A., et al.: Graph convolutional neural networks for Alzheimer’s disease classification. In: 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 414–417 (2019) 19. Thompson, P.M., et al.: Cortical change in Alzheimer’s disease detected with a disease-specific population-based brain atlas. Cereb. Cortex 11(1), 1–16 (2001) 20. Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph cnn for learning on point clouds (2018) 21. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-31910590-1 53

Detection of Abnormal Folding Patterns with Unsupervised Deep Generative Models Louise Guillon1(B) , Bastien Cagna1 , Benoit Dufumier1,2 , Jo¨el Chavas1 , Denis Rivi`ere1 , and Jean-Fran¸cois Mangin1 1

Universit´e Paris-Saclay, CEA, CNRS, NeuroSpin, Baobab, Gif-sur-Yvette, France [email protected] 2 LTCI, T´el´ecom Paris, IPParis, Palaiseau, France

Abstract. Although the main structures of cortical folding are present in each human brain, the folding pattern is unique to each individual. Because of this large normal variability, the identification of abnormal patterns associated to developmental disorders is a complex open challenge. In this paper, we tackle this problem as an anomaly detection task and explore the potential of deep generative models using benchmarks made up of synthetic anomalies. To focus learning on the folding geometry, brain MRI are preprocessed first to deal only with a skeleton-based negative cast of the cortex. A variational auto-encoder is trained to get a representation of the regional variability of the folding pattern of the general population. Then several synthetic benchmark datasets of abnormalities are designed. The latent space expressivity is assessed through classification experiments between control’s and abnormal’s latent codes. Finally, the properties encoded in the latent space are analyzed through perturbation of specific latent dimensions and observation of the resulting modification of the reconstructed images. The results have shown that the latent representation is rich enough to distinguish subtle differences like asymmetries between the right and left hemispheres. Keywords: Variational autoencoder · Brain architecture folding · Anomaly benchmark · Anomaly detection

1

· Cortical

Introduction

The cortex folds in utero to form numerous furrows called sulci, which delimit circumvolutions. Cortical folding is related to cortical architecture (architectony and connectivity) [7,8] and can be impacted by developmental issues that lead to brain disorders [9,25]. The identification of folding patterns acting as markers of developmental brain diseases would be a major breakthrough facilitating early diagnosis. However, although cortical morphology embeds a topography of the sulci sufficiently consistent across subjects to enable the design of automatic c Springer Nature Switzerland AG 2021  A. Abdulkadir et al. (Eds.): MLCN 2021, LNCS 13001, pp. 63–72, 2021. https://doi.org/10.1007/978-3-030-87586-2_7

64

L. Guillon et al.

recognition tools [5], the shapes of the sulci present a high diversity, which hinders the modelling of the inter-individual variability necessary to define abnormalities [18]. Hence, the diversity of the folding pattern is often put aside and canceled out using spatial normalisation, namely warping all brains toward a template space. Associations between folding patterns and developmental disorders have already been described. For instance, a very rare pattern called the Power Button Sign (PBS) has been linked to the epileptogenic zone of patients suffering from drug-resistant type 2 focal cortical dysplasia [17]. Similarly, it was demonstrated that the paracingulate sulcus morphology is correlated to hallucinations in patients suffering from schizophrenia [24]. Sulci shape deviations have also been observed in autism spectrum disorder (ASD) [2,11]. Once abnormal folding patterns linked to a pathology have been identified, automatic detection techniques can be developed using supervised learning. For instance, the PBS can be detected with a supervised classifier [4]. However, the upstream process of identifying and defining new patterns of interest is tedious and difficult as each individual has a unique cortical folding geometry and spotting a recurrent abnormality throughout a set of patients is very complex. An unsupervised tool designed to uncover cortical folding abnormalities and potential biomarkers would be an important lever to harvest the potential meaning of unusual folding patterns. In this paper we propose a dedicated framework based on deep learning and we test its potential through the detection of synthetic unusual folding patterns. The automatic detection of abnormal folding patterns is a challenging task that has not yet been addressed in the field of neuroimaging. In this work, we relate this objective with the general field of anomaly detection. Anomaly and novelty detection aims at identifying samples that do not fit the normal data distribution [19]. A few years ago, anomaly detection methods evolved towards deep learning approaches and specifically unsupervised deep learning due to the ability to detect potentially unseen events. Auto-encoder (AE) based methods have been particularly studied as they infer a latent space of interest with much fewer dimensions than the input space, enforcing to learn only the most common features of the training data. There exists a broad range of AEbased models. An extensive review on the detection of epilepsy lesions in MRI patches can be found in [1]. Similarly, different methods dedicated to medical images have been compared in [3] leading to qualify the variational AE (VAE) architecture as the most efficient. Generative adversarial networks (GAN) have also been used in order to identify biomarkers in optical coherence tomography scans, reaching good performances [21,22]. Based on these initial results, a first framework was proposed for anomaly detection in computed tomography scans of 3D brains where anomalies consisted in labeled traumatic brain injuries [23]. β-VAE have also been successfully used to model the inter-individual variability in the mouse brain [14]. More recently, very promising self-supervised methods have been applied to anomaly detection problems in medical images [6], but these methods lack the generative aspect provided by GAN and VAE, which

Abnormal Folding Patterns Detection

65

is crucial in terms of explainability. All these works assessed images containing known lesions. Our aim however is to discover still unknown patterns linked to diseases, which leads to challenging evaluation issues. Therefore, this paper is focused on dedicated synthetic benchmark datasets. In this paper, we propose the first technique aiming at bringing to light unusual and potentially abnormal folding patterns. For this purpose, we propose first a dedicated preprocessing leading to focus the learning on the cortical folding geometry of a specific region of interest (ROI). Then, like in [14], a β − V AE is trained on a set of control data sampling the general population to get a latent representation of the folding pattern distribution in this ROI. We also create several benchmark datasets simulating unusual regional folding patterns to assess the ability of our model to detect them. Finally, we analyse the latent space capacity to separate regional patterns from the two hemispheres.

2 2.1

Methods Focusing on Folding Information

Brain MRIs contain diverse information that are not all relevant to study folding patterns. Our method therefore includes a crucial first step of data pre-processing based on the BrainVisa/Morphologist pipeline (http://brainvisa.info) [16]. This pipeline combines several steps such as bias correction, grey-white segmentation, and skeletonization to obtain a negative cast of the folding. Morphologist’s skeletons were used as input of our learning model. They were first defined in [15] and are obtained by skeletonization of the grey matter and cerebrospinal fluid union while preserving the topology. The result is 3D volumes with three values: inside of the brain, sulci skeletons and outside of the brain. The use of these simple images rather than raw MRIs puts the focus of learning on the folding geometry and discard a major confound related to the width of the sulci, which increases with local atrophies induced by aging or degenerative pathologies. 2.2

Generating Synthetic Brain Anomalies

One of our biggest challenges is the lack of consensual datasets of abnormalities to assess the approach. The examples mentioned in the introduction are either especially challenging in terms of shape (the PBS in epilepsy) or inter-subject local variability (Broca’s area or Superior Temporal Sulcus (STS) branches in autism), or correspond to a stratification of the population into several frequent patterns, which is not in the scope of anomaly detection (paracingulate sulcus). Therefore, in this paper, we focus on a 3D ROI of 23 × 37 × 36 voxels with 2 mm isotropic resolution, localized in only one of these challenging areas, the STS branches. This ROI has been defined in each subject using affine normalisation to the classical MNI reference space. The localization of the ROI in the MNI space has been learned from the open access training dataset with annotated sulci of the Morphologist pipeline [5]. We have designed several dedicated synthetic

66

L. Guillon et al.

anomaly benchmark datasets in order to be able to evaluate the performances of our model (Fig. 1). Deletion: Our first benchmark dataset consists of skeletons in which we have randomly deleted one piece of sulcus, which is chosen among topologically elementary parts called simple surfaces and proposed by the Morphologist pipeline [15]. To be deleted, a simple surface must be completely within the ROI and made up of more than 1000 voxels, which corresponds to about 17% of the average number of skeleton voxels in the ROI. This arbitrary threshold aims at performing modification of the geometry beyond the normal anatomical variability observed in the population. Random: A second benchmark dataset is composed of random ROIs of the same dimensions and overlapping the skeleton but localized in different positions in the cortex. This benchmark is expected to be very easily spotted as abnormal since the images are highly different. It is used to ensure that the model is able to identify inputs that are far away from the normal distribution and that what the model has learned is not only non-region specific features such as voxel proportion and sulci continuity. Asymmetry: The last benchmark dataset corresponds to the same ROI but defined in the other hemisphere and flipped. The flip is defined from the interhemispheric plane of the MNI space after affine spatial normalisation. This benchmark has a biological interest as hemispheric asymmetry is still an intense field of research.

Fig. 1. Generated anomaly benchmarks. Benchmark Deletion: Original crop and its modified version. Benchmark Random: two examples of random crops. Benchmark Asymmetry: crop of right hemisphere and crop of left hemisphere flipped.

2.3

Learning a Representation of the Normal Variability

An effective way to model population variability is through AE-based networks. These architectures learn to project input data onto a lower dimensional manifold, also called latent space and to reconstruct from this space the input image. Simple AE are known to have some drawbacks and particularly the lack of regularization of the latent space. To overcome this issue, VAE model was introduced [13], and later β − V AE was proposed [10]. Like classical AE, β − V AE are composed of two parts: an encoder and a decoder but add a variational

Abnormal Folding Patterns Detection

67

objective. Contrary to simple AE, an input from image space X , is encoded as a distribution in a latent space Z comprising L dimensions, leading to a twofold objective. First, the minimization of the reconstruction error of the input image. Second, the matching of the encoded distribution to a prior distribution, usually a Gaussian, which is done thanks to Kullback-Leibler divergence and enables to regularize the latent space. VAE is a β − V AE with KL divergence weighted at 1. Thus, β − V AE encoder θ and decoder φ are trained maximising the following objective: L(θ, φ; x, z, β) = Eqφ (z|x) [log pθ (x|z)] − βDKL (qφ (z|x)||p(z))

(1)

where p(z) is the prior distribution, a reduced centered Gaussian distribution in our work that is approximated with the posterior distribution qφ (z|x). Tuning β parameter enables to improve latent factors disentanglement [10]. Analysing the Latent Space. The analysis of the latent space to understand the meaning of the encoded features is capital to assess the potential of our model to highlight unusual folding patterns. As such, we first trained a β − V AE on normal data only, for our model to learn to encode normal variability. Next, normal and benchmark data unseen during training are projected to the latent space to perform this analysis. The resulting latent codes are used to train a gradient boosting algorithm to classify normal versus synthetic abnormal samples. These three classifiers are used first to ensure that the latent representations are able to capture some relevant information regarding the folding patterns. Then, we can focus our analysis on the features contributing the most to the success of the classification using the generative power of the β − V AE, like in [14]. We travel throughout the latent space modifying only one of these features and observe the generated folding patterns.

3

Results

3.1

Datasets and Implementation

To learn the inter-individual variability of control subjects, the HCP database was used1 . MRIs were obtained with a single Siemens Skyra Connectom scanner with a resolution of 0.7 mm × 0.7 mm × 0.7 mm. In our work, we studied only the right hemisphere of 997 right-handed subjects with high quality result of the Morphologist pipeline. 547 subjects were used for the training and 150 for the validation of the β − V AE. The remaining 300 subjects were used to train classifiers, half of them being used to create synthetic abnormal patterns following each of the benchmark experiment. Two third of these 300 subjects were used to estimate the β − V AE hyper-parameters using a grid-search driven by a classifier and the last third was used to explore the latent space organization. The skeletons were spatially normalized with an affine transformation to the standard MNI space and were down-sampled to a voxel size of 2 mm, which 1

https://www.humanconnectome.org/.

68

L. Guillon et al.

is sufficient to preserve the folding geometry. Voxels were set to 0 for inside the brain (26% of voxels), 1 for outside the brain (65%) and 2 for the sulci skeleton (9%). As mentioned above, the input to the β − V AE was a 3D ROI, whose bounding box in MNI space was learned from the BrainVISA open access training set of 64 annoted brains, in order to include the two posterior branches of the STS. This 3D ROI is made up by 23 × 37 × 36 voxels extended to 40 × 40 × 40 using 1-padding. The complete pipeline is shown in Fig. 2. Our β −V AE was composed of fully convolutional encoder and decoder with symmetrical architectures comprising three convolutional blocks and 2 fully connected layers. We did a gridsearch (L = 8–100, β = 1–20, ranges are based on previous works [14] and reconstruction abilities), where hyperparameters were chosen according to the classification performances of the Deletion classifier applied to 100 controls and 100 synthetic samples using a 5-fold stratified cross-validation. We selected L = 100, β = 2 and a learning rate of 2e–4. Training lasted for 400 epochs on a Nvidia Quadro RTX 8000 GPU and was completed in roughly 1 h.

Fig. 2. Whole pipeline. First, bounding boxes of sulci of interest are defined. The HCP database is processed with the Morphologist pipeline and cropped thanks to the bounding boxes. Crops are then downsampled before feeding the β − V AE.

3.2

Analysing Learned Folding Variability

First, we visually evaluated the reconstruction ability of our β-VAE. 2D slices of several reconstructed inputs are presented in Fig. 3. For the control set, deletion and asymmetry benchmarks, the reconstructions are approximate but retain the general geometry. The main sulci included in the ROI can be identified. In return, the random crops cannot be reconstructed by the decoder, which outputs an image looking like a disturbed configuration of STS branches. This suggests that the model has really learned the distribution of the specific geometry of this cortical region rather than a generic distribution covering any skeleton configuration. To evaluate our model latent space, three gradient boosting classifiers were trained using 50 controls and 50 synthetic samples from the test set, using a 5-fold stratified cross-validation. ROC curves are presented in Fig. 4.A). For

Abnormal Folding Patterns Detection

69

Fig. 3. Reconstructions of test inputs. 2D sections from the 3D ROI presented in sagittal view at depth 18. First row: original images, second row: model outputs.

each configuration, the AUC score is above chance. Highest performances are obtained on benchmark Random (AUC = 0.98), which was expected. This first result comforts the fact that our model is able to capture very obvious abnormalities. On benchmark Asymmetry, very good scores are also obtained (AUC = 0.85). Though for an inexperienced eye the difference between right hemisphere and flipped left hemisphere can seem subtle, this region is known to be asymmetrical in terms of length and tilt of the main sulci [20]. It confirms that our model is capable of representing specific anatomical structures included in the ROI. Finally, when deleting one large simple surface, results are slightly above chance (AUC = 0.69) which indicates a potential of the latent space to detect such anomalies. Nevertheless further work is required using a larger test set to overcome potential limitations of this classifier experiment. Using gradient boosting classifiers gives insights on the most decisive dimensions of the latent space, which depend on the benchmark. Figure 4.B) shows a visualization of the datasets using the two most important latent features or tSNE 2D manifold projection. For Random and Asymmetry benchmarks, two groups can be clearly identified. Surprisingly, in the tSNE visualization, the random crops are surrounded by controls, which is counter-intuitive relative to the Gaussian prior and does not fit with the plot performed from the two most discrimative dimensions. Further experiments with unbalanced datasets including only a small ratio of abnormality will help us clarify this point. The Suppression benchmark is clearly the most difficult classification experiment with the current latent space. The last experiment consists in sampling latent vectors to travel along the most important dimensions according to the classifiers. Figure 4.C) shows the reconstruction provided by the decoder when following the dimension corresponding to the best discriminator of Asymmetry benchmark. All other dimensions were fixed at their mean value. As expected, the outputs look like approximated skeletons of the ROI. Subtle changes consistent during the travel can be observed: i.e. the upper part of the sulcus presented in pink on Fig. 4.C), called the Sylvian fissure (SF), seems to shorten from the left to the right of the dimension. This observation is consistent with Asymmetry benchmark’s subjects distribution according to the two most important features on Fig. 4.B). Indeed,

70

L. Guillon et al.

Fig. 4. Analyses of latent space. A) ROC curves of GradientBoosting classifiers for classifications between controls and benchmarks. B) Distribution of data points using classifiers most important features and t-SNE. C) Travelling through the latent space.

control subjects, i.e. right hemispheres correspond to higher values of the 88th dimension which is also suggested by the SF shortening on Fig. 4.C). This evolution is interesting as previous works demonstrated that the SF was shorter in the right hemisphere. Thus, this dimension could encode the length of the SF which is an important marker of laterality [12].

4

Discussion and Conclusion

In this paper we developed a framework that shall lead to the discovery of abnormal folding patterns beyond reach for human cognition because of the high inter-subject normal variability. Our main contribution is the design of synthetic benchmarks used to decipher the organization of the latent space used to model this variability. We have shown that the regional specificity of the folding pattern can be learned and used to detect some deviations from the norm. Our methods achieved to detect obvious and more subtle deviations (respectively with Random and Asymmetry benchmarks), but detection is harder for more complex ones such as Deletion benchmark. However, we don’t seek to detect benchmark subjects but rather to detect abnormal patterns. In the future we plan to perform further experiments to get more hints about the nature of the representation. Although some interesting features can be observed when varying latent vectors dimensions’ values, it is very difficult to apprehend all possible shifts looking at 2D sections because we deal with 3D images. Additionally, in some cases, we observe that the generated images are not realistic, preventing the interpretation. Generating synthetic anomaly benchmarks is a very useful initial step but

Abnormal Folding Patterns Detection

71

induces biases that must be acknowledged. In our case, with the Deletion benchmark we made the hypothesis that anomalies could be linked to “missing” simple surfaces. This work constitutes a first step of proof of concept and enables to control the complexity of abnormalities, however an important limitation is the high dependency of the results on simulated data as our benchmarks are made of synthetic anomalies. Nevertheless we stress out that the model has learned only control data, thus is totally unsupervised in regards to anomaly data, and only the evaluation results depend on simulated data. In future works, the ultimate benchmarks will have to be built from large datasets of neurodevelopmental disorders and will aim at discovering actual abnormal folding patterns. Acknowledgments. This project has received funding from the FRMDIC2016123 6445, the ANR-19-CE45-0022-01 IFOPASUBA, the ANR-14-CE30-0014-02 APEX the ANR-20-CHIA-0027-01 FOLDDICO. Data were provided in part by the Human Connectome Project funded by the NIH. This work was performed using HPC resources from GENCI-IDRIS (Grant 2020-AD011011929).

References 1. Alaverdyan, Z.: Unsupervised representation learning for anomaly detection on neuroimaging. Application to epilepsy lesion detection on brain MRI. Ph.D. thesis, Universit´e de Lyon (2019) 2. Auzias, G., et al.: Atypical sulcal anatomy in young children with autism spectrum disorder. NeuroImage: Clin. 4, 593–603 (2014). https://doi.org/10.1016/j.nicl.2014. 03.008 3. Baur, C., Denner, S., Wiestler, B., Albarqouni, S., Navab, N.: Autoencoders for Unsupervised Anomaly Segmentation in Brain MR Images: A Comparative Study. arXiv:2004.03271 [cs, eess] (2020) 4. Borne, L., et al.: Automatic recognition of specific local cortical folding patterns. NeuroImage 238 (2021). https://doi.org/10.1016/j.neuroimage.2021.118208 5. Borne, L., Rivi`ere, D., Mancip, M., Mangin, J.F.: Automatic labeling of cortical sulci using patch- or CNN-based segmentation techniques combined with bottomup geometric constraints. Med. Image Anal. 62 (2020). https://doi.org/10.1016/j. media.2020.101651 6. Bozorgtabar, B., Mahapatra, D., Vray, G., Thiran, J.P.: SALAD: self-supervised aggregation learning for anomaly detection on X-Rays. In: Martel, A.L., et al. (eds.) Medical Image Computing and Computer Assisted Intervention - MICCAI 2020, pp. 468–478. Lecture Notes in Computer Science, Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-59710-8 46 7. Fernandez, V., Llinares-Benadero, C., Borrell, V.: Cerebral cortex expansion and folding: what have we learned? EMBO J. 35(10), 1021–1044 (2016). https://doi. org/10.15252/embj.201593701 8. Fischl, B., et al.: Cortical folding patterns and predicting cytoarchitecture. Cereb. Cortex 18(8), 1973–1980 (2008). https://doi.org/10.1093/cercor/bhm225 9. Guerrini, R., Dobyns, W.B., Barkovich, A.J.: Abnormal development of the human cerebral cortex: genetics, functional consequences and treatment options. Trends Neurosci. 31(3), 154–162 (2008). https://doi.org/10.1016/j.tins.2007.12.004 10. Higgins, I., et al.: beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework (2016). https://openreview.net/forum?id=Sy2fzU9gl

72

L. Guillon et al.

11. Hotier, S., et al.: Social cognition in autism is associated with the neurodevelopment of the posterior superior temporal sulcus. Acta Psychiatr. Scand. 136(5), 517–525 (2017). https://doi.org/10.1111/acps.12814 12. Idowu, O.E., Soyemi, S., Atobatele, K.: Morphometry, asymmetry and variations of the Sylvian fissure and sulci bordering and within the pars triangularis and pars operculum: an autopsy study. J. Clin. Diagn. Res. JCDR 8(11), AC11–AC14 (2014). https://doi.org/10.7860/JCDR/2014/9955.5130 13. Kingma, D.P., Welling, M.: Auto-Encoding Variational Bayes [cs, stat] May 2014. arXiv:1312.6114 14. Liu, R., et al.: A generative modeling approach for interpreting population-level variability in brain structure. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12265, pp. 257–266. Springer, Cham (2020). https://doi.org/10.1007/978-3030-59722-1 25 15. Mangin, J.F., Frouin, V., Bloch, I., Rigis, J., Lopez-Krahe, J.: From 3D magnetic resonance images to structural representations of the cortex topography using topology preserving deformations. J. Math. Imaging Vis. 5(4), 297–318 (1995) 16. Mangin, J.F., et al.: Object-based morphometry of the cerebral cortex. IEEE Trans. Med. Imaging 23, 968–82 (2004). https://doi.org/10.1109/TMI.2004.831204. Sep 17. Mellerio, C., et al.: The power button sign: a newly described central sulcal pattern on surface rendering MR images of type 2 focal cortical dysplasia. Radiology 274(2), 500–507 (2014). https://doi.org/10.1148/radiol.14140773, publisher: Radiological Society of North America 18. Ono, M., Kubik, S., Abernathey, C.D.: Atlas of the cerebral sulci. G. Thieme Verlag; Thieme Medical Publishers, Stuttgart; New York (1990). oCLC: 645306373 19. Pang, G., Shen, C., Cao, L., Hengel, A.V.D.: Deep Learning for Anomaly Detection: A Review. arXiv:2007.02500 [cs, stat] July 2020 20. Rubens, A.B., Mahowald, M.W., Hutton, J.T.: Asymmetry of the lateral (Sylvian) fissures in man. Neurology 26(7), 620–620 (1976) 21. Schlegl, T., Seeb¨ ock, P., Waldstein, S.M., Langs, G., Schmidt-Erfurth, U.: fAnoGAN: fast unsupervised anomaly detection with generative adversarial networks. Med. Image Anal. 54, 30–44 (2019). https://doi.org/10.1016/j.media.2019. 01.010 22. Schlegl, T., Seeb¨ ock, P., Waldstein, S.M., Schmidt-Erfurth, U., Langs, G.: Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery. arXiv:1703.05921 [cs] March 2017 23. Simarro Viana, J., de la Rosa, E., Vande Vyvere, T., Robben, D., Sima, D.M., Investigators, C.E.N.T.E.R.-T.B.I.P.: Unsupervised 3D brain anomaly detection. In: Crimi, A., Bakas, S. (eds.) BrainLes 2020. LNCS, vol. 12658, pp. 133–142. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72084-1 13 24. The Australian Schizophrenia Research Bank, Garrison, J.R., Fernyhough, C., McCarthy-Jones, S., Haggard, M., Simons, J.S.: Paracingulate sulcus morphology is associated with hallucinations in the human brain. Nat. Commun. 6(1), 8956 (2015). https://doi.org/10.1038/ncomms9956 25. Walsh, C.A.: Genetic malformations of the human cerebral cortex. Neuron 23(1), 19–29 (1999). https://doi.org/10.1016/S0896-6273(00)80749-7

PialNN: A Fast Deep Learning Framework for Cortical Pial Surface Reconstruction Qiang Ma1(B) , Emma C. Robinson2 , Bernhard Kainz1 , Daniel Rueckert1 , and Amir Alansary1 1

2

BioMedIA, Department of Computing, Imperial College London, London, UK [email protected] School of Biomedical Engineering and Imaging Sciences, King’s College London, London, UK

Abstract. Traditional cortical surface reconstruction is time consuming and limited by the resolution of brain Magnetic Resonance Imaging (MRI). In this work, we introduce Pial Neural Network (PialNN), a 3D deep learning framework for pial surface reconstruction. PialNN is trained end-to-end to deform an initial white matter surface to a target pial surface by a sequence of learned deformation blocks. A local convolutional operation is incorporated in each block to capture the multi-scale MRI information of each vertex and its neighborhood. This is fast and memory-efficient, which allows reconstructing a pial surface mesh with 150k vertices in one second. The performance is evaluated on the Human Connectome Project (HCP) dataset including T1-weighted MRI scans of 300 subjects. The experimental results demonstrate that PialNN reduces the geometric error of the predicted pial surface by 30% compared to state-of-the-art deep learning approaches. The codes are publicly available at https://github.com/m-qiang/PialNN.

1

Introduction

As an essential part in neuroimage processing, cortical surface reconstruction aims to extract 3D meshes of the inner and outer surfaces of the cerebral cortex from brain MRI, also known as the white matter and pial surfaces. The extracted surface can be further analyzed for the prediction and diagnosis of brain diseases as well as for the visualisation of information on the cortex. However, it is difficult to extract a geometrically accurate and topologically correct cortical surface due to its highly curved and folded geometric shape [3,6]. The typical cortical surface reconstruction pipeline, which can be found in existing neuroimage analysis tools [1,3,5,8,15], consists of two main steps. Firstly, an initial white matter surface mesh is created by applying mesh tessellation or marching cubes [12] to the segmented white matter from the scanned image, along with topology fixing to guarantee the spherical topology. The initial mesh is further refined and smoothed to produce the final white matter surface. c Springer Nature Switzerland AG 2021  A. Abdulkadir et al. (Eds.): MLCN 2021, LNCS 13001, pp. 73–81, 2021. https://doi.org/10.1007/978-3-030-87586-2_8

74

Q. Ma et al.

Secondly, the pial surface mesh is generated by expanding the white matter surface iteratively until it reaches the boundary between the gray matter and cerebrospinal fluid or causes self-intersection. One limitation of such approaches is the high computational cost. For example, FreeSurfer [5], a widely used brain MRI analysis tool, usually takes several hours to extract the cortical surfaces for a single subject. As a fast and end-to-end alternative approach, deep learning has shown its advantages in surface reconstruction for general shape objects [7,9,13,14,18] and medical images [2,10,16,19]. Given brain MRI scans, existing deep learning frameworks [2,10] are able to predict cortical surfaces within 30 min. However, although the white matter surfaces can be extracted accurately, the pial surface reconstruction is still challenging. Due to its highly folded and curved geometry, the pial surface reconstructed by previous deep learning approaches tends to be oversmooth to prevent self-intersections, or fails to reconstruct the deep and narrow sulcus region. In this work, we propose a fast and accurate architecture for reconstructing the pial surface, called Pial Neural Network (PialNN). Given an input white matter surface and its corresponding MR image, PialNN reconstructs the pial surface mesh using a sequence of learned deformation blocks. In each block, we introduce a local convolutional operation, which applies a 3D convolutional neural network (CNN) to a small cube containing the MRI intensity of a vertex and its neighborhood. Our method can work on brain MRI at arbitrary resolution without increasing the complexity. PialNN establishes a one-to-one correspondence between the vertices in white matter and pial surface, so that a point-topoint loss can be minimized directly without any regularization terms or point matching. The performance is evaluated on the publicly available Human Connectome Project (HCP) dataset [17]. PialNN shows superior geometric accuracy compared to existing deep learning approaches. The main contributions and advantages of PialNN can be summarized as: – Fast: PialNN can be trained end-to-end to reconstruct the pial surface mesh within one second. – Memory-efficient: The local convolutional operation enables PialNN to process a high resolution input mesh (>150k vertices) using input MR brain images at arbitrary resolution. – Accurate: The proposed point-to-point loss, without additional vertex matching or mesh regularization, improves the geometric accuracy of the reconstructed surfaces effectively.

2

Related Work

Deep learning-based surface reconstruction approaches can be divided into implicit [13,14] and explicit methods [7,9,18]. The former use a deep neural network (DNN) to learn an implicit surface representation such as an occupancy field [13] and a signed distance function [14]. A triangular mesh is then extracted using isosurface extraction. For explicit methods [7,9,18], a DNN is trained

PialNN

75

Fig. 1. The proposed architecture for pial surface reconstruction (PialNN). The input white matter surface is deformed by three deformation blocks to predict a pial surface. Each deformation block incorporates two types of features: point features from the white matter surface vertices and local features from the brain MRI. Finally, the output mesh is refined using Laplacian smoothing.

end-to-end to deform an initial mesh to a target mesh, producing an explicit mesh directly. Previous deep learning frameworks [2,10] for cortical surface reconstruction mainly adopted implicit methods. Henschel et al. [10] proposed FastSurfer pipeline, which improved FreeSurfer [5] by introducing a fast CNN for wholebrain segmentation instead of atlas-based registration. The cortical surface is then extracted by a non-learning approach [5]. Cruz et al. [2] proposed DeepCSR framework to predict the implicit representation of both the inner and outer cortical surfaces. Explicit surfaces are extracted by the marching cubes algorithm [12]. Implicit methods require a time-consuming topology correction, while explicit methods can pre-define an initial mesh with spherical topology to achieve fast inference. Wickramasinghe et al. [19] presented an explicit framework, called Voxel2Mesh, to extract 3D meshes from medical images. Voxel2Mesh employed a series of deformation and unpooling layers to deform an initial mesh while increasing the number of vertices iteratively. Regularization terms are utilized to improve the mesh quality and prevent self-intersections, whereas these terms tend to oversmooth the output mesh. Conversely, our PialNN uses explicit methods to learn the pial surface reconstruction without any regularization terms.

3

Method

We first introduce necessary notations to formulate the problem. Let M = (V, E, F) be a 3D triangular mesh, where V ⊂ R3 , E and F are the sets of

76

Q. Ma et al.

vertices, edges and faces of the mesh. The corresponding coordinates and normal of the vertices are represented by v, n ∈ R|V|×3 , where |V| is the number of vertices. Given an initial white matter surface M0 = (V0 , E0 , F0 ) and a target pial surface M∗ = (V∗ , E∗ , F∗ ), we assume that M0 and M∗ have the same connectivity, i.e. E0 = E∗ and F0 = F∗ . Given a brain MRI volume I ∈ RL×W ×H , the goal of deep learning-based pial surface reconstruction is to learn a neural network g such that the coordinates v∗ = g(v0 , n0 , I). As illustrated in Fig. 1, the PialNN framework aims to learn a series of deformation blocks fθl for 1 ≤ l ≤ L to iteratively deform the white matter surface M0 to match the target pial surface M∗ , where θl represents the learnable parameter of the neural network. 3.1

Deformation Block

Let Ml be the l-th intermediate deformed mesh. The vertices of Ml can be computed as: vl = vl−1 + Δvl−1 = vl−1 + fθl (vl−1 , nl−1 , I),

(1)

for 1 ≤ l ≤ L, where fθl is the l-th deformation block represented by a neural network. The purpose of PialNN is to learn the optimal fθl , such that the final predicted mesh ML matches the target mesh M∗ , i.e. ML = M∗ . The architecture of the deformation block is shown in Fig. 1. In this approach, the deformation block predicts a displacement Δv based on the point feature and local feature of the vertex v. Point Feature. The point feature of a vertex is defined as the feature extracted from its coordinate v and normal n, which includes the spacial and orientation information. We extract the point feature using a multi-layer perceptron (MLP). Local Feature. We adopt a local convolutional operation to extract the local feature of a vertex from brain MRI scans. Rather than using a memory-intensive 3D CNN on the entire MRI volume [19], this method only employs a CNN on a cube containing MRI intensity of each vertex and its neighborhood. As illustrated in Fig. 1, for each vertex, we find the corresponding position in the brain MRI volume. Then a K 3 grid is constructed based on the vertex to exploit its neighborhood information. The voxel value of each point in the grid is sampled from the MRI volume. Such a cube sampling approach extracts a K 3 voxel cube containing the MRI intensity of each vertex and its neighborhood. Furthermore, we build a 3D image pyramid including 3 scales (1, 1/2, 1/4) and use cube sampling on the different scales. Therefore, each vertex is represented by a K 3 local cube with 3 channels containing multi-scale information. A 3D CNN with kernel size K is then applied to each local cube, which converts the cube to a local feature vector of its corresponding vertex. An MLP layer is followed to further refine the local feature. Such a local convolutional operation is memory- and time-efficient. As there are total |V| cubes with 3 channels, it only executes the convolution operators

PialNN

77

3|V| times, which are far less than L × W × H times for running a 3D CNN on the full MRI. Since the complexity only relies on the number of vertices |V|, the local convolutional operation can process MRI volumes at arbitrary resolution without increasing the complexity. The point and local features are concatenated as the input of several MLP layers followed by leaky ReLU activation, which predict a 3D displacement Δvl−1 . The new vertices vl are updated according to Eq. 1, and act as the input for the next deformation block. 3.2

Smoothing and Training

Laplacian Smoothing. After three deformation blocks, a Laplacian smoothing is used to further smooth the surface and prevent self-intersections. For each  vertex v i ∈ R3 , the smoothing is defined as v¯i = (1 − λ)v i + λ j∈N (i) v j /|N (i)|, where λ controls the degree of smoothness and N (i) is the adjacency list of the i-th vertex. The smoothing layer is incorporated in both training and testing. Loss Function. The Chamfer distance [4] is commonly used as the loss function for training explicit surface reconstruction models [18,19]. It measures the distance from a vertex in one mesh to the closest vertex in the other mesh bidirectionally. For PialNN, since the input and target mesh have the same connectivity, we can directly compute a point-to-point mean square error (MSE) loss between each pair of vertices. Therefore, the loss function is defined as: L(ML , M∗ ) = L(vL , v∗ ) = vL − v∗ 22 .

(2)

Rather than computing the loss for all intermediate meshes Ml , we only compute the loss between the final predicted pial surface ML and the ground truth M∗ , because the gradient can be backpropagated to all deformation blocks fθl for 1 ≤ l ≤ L. The parameters θl are learned by minimizing the MSE loss. It is noted that no explicit regularization term is required in the loss function, as the vertex will learn from the point-to-point supervision to move to a correct location. Such loss function effectively improves the geometric accuracy of the output mesh. Besides, we use an additional Laplacian smoothing after training to improve the mesh quality and to fix self-intersections.

4

Experiments

Dataset. The proposed framework is evaluated using the WU-Minn Human Connectome Project (HCP) Young Adult dataset [17]1 . We use 300 subjects, each of which has T1-weighted brain MRI scans with 1 mm isotropic resolution. Each brain MRI is cropped to size of (192, 224, 192). The 300 subjects are split into 200/50/50 for training/validation/testing. The input white matter surface and ground truth pial surface are generated by FreeSurfer [5]. Each surface has approximately 150k vertices and 300k faces for one hemisphere. It is noted that the input white matter surfaces can be generated by other faster tools [10,15]. 1

https://www.humanconnectome.org/study/hcp-young-adult/data-releases

78

Q. Ma et al.

Fig. 2. Visualization of the reconstructed pial surface meshes.

Implementation Details. PialNN consists of L = 3 layers of deformation blocks. We set the smoothing coefficient λ = 1 and kernel size K = 5 for 3D CNN. The Adam optimizer with learning rate 10−4 is used for training the model for 200 epochs with batch size 1. Experiments compare the performance of PialNN with state-of-the-art deep learning baselines, such as Voxel2Mesh [19] and DeepCSR [2]. All models are trained on an Nvidia GeForce RTX3080 GPU. Since Voxel2Mesh uses iterative mesh unpooling, the input white matter surface is simplified to a mesh with 5120 faces using quadric error metric decimation. For DeepCSR, we train two different models based on occupancy fields (DeepCSR-OCC) and signed distance functions (DeepCSR-SDF) for ground truth. The size of the implicit representation for DeepCSR is set to (192, 224, 192) in order to have a reasonable number of vertices for a fair comparison. Geometric Accuracy. We evaluate the geometric accuracy of the PialNN framework by computing the error between the predicted pial surfaces and FreeSurfer ground truth. We utilize three distance-based metrics to measure the geometric error, namely, Chamfer distance (CD) [4,18], average absolute distance (AD) [2] and Hausdorff distance (HD) [2]. The CD measures the mean distance between two sets of vertices. AD and HD compute the average and maximum distance between two sets of 150k sampled points from surface meshes. All distances are computed bidirectionally in millimeters (mm). A lower distance means a better result. The experimental results are given in Table 1, which shows that PialNN achieves the best geometric accuracy compared with existing deep learning baselines. It reduces the geometric error by >30% compared to Voxel2Mesh and DeepCSR in all three distances (mm). In addition, the quality of the predicted pial surface mesh is visualized in Fig. 2.

PialNN

79

Table 1. Geometric error for pial surface reconstruction. The results include the comparison with existing deep learning baselines and the ablation study. Chamfer distance (mm), average absolute distance (mm), and Hausdorff distance (mm) are computed for both left and right hemisphere. A lower distance means a better result. Left Pial

Right Pial

Method

Chamfer

Average

Hausdorff

Chamfer

PialNN (Ours)

0.39 ± 0.01 0.21 ± 0.02 0.45 ± 0.04 0.39 ± 0.02 0.20 ± 0.02 0.44 ± 0.04

Average

Hausdorff

Voxel2Mesh

0.58 ± 0.03

0.34 ± 0.04

0.82 ± 0.09

0.57 ± 0.02

0.31 ± 0.02

0.80 ± 0.07

DeepCSR-OCC

0.66 ± 0.04

0.42 ± 0.04

0.87 ± 0.13

0.65 ± 0.05

0.40 ± 0.04

0.88 ± 0.20

DeepCSR-SDF

0.72 ± 0.07

0.45 ± 0.06

1.23 ± 0.36

0.78 ± 0.11

0.49 ± 0.09

1.58 ± 0.54

Single Scale

0.42 ± 0.02

0.23 ± 0.02

0.50 ± 0.05

0.43 ± 0.02

0.23 ± 0.02

0.51 ± 0.05

Point Sampling

0.56 ± 0.03

0.40 ± 0.04

0.87 ± 0.09

0.57 ± 0.03

0.41 ± 0.05

0.91 ± 0.11

GCN

0.39 ± 0.02

0.21 ± 0.02

0.46 ± 0.04

0.40 ± 0.01

0.21 ± 0.01

0.46 ± 0.04

Figure 3 provides a detailed visual comparison between different approaches. The DeepCSR-SDF-2x represents DeepCSR-SDF with input size of (384, 448, 384). We focus on the areas highlighted by the blocks in different colors. In the red block, the DeepCSR frameworks fail to distinguish two separate regions in the surface. The issue remains unsolved after increasing the input size. The yellow block indicates an inaccurate Voxel2Mesh prediction since the mesh is oversmoothed. In the orange block, only FreeSurfer and PialNN reconstruct the deep and narrow sulci accurately. The green block indicates the error of Voxel2Mesh and DeepCSR-OCC in a sulcus region. It is noted that PialNN makes a correct reconstruction in all highlighted areas.

Fig. 3. A visual evaluation of the predicted pial surfaces (cyan colour). (Color figure online)

Figure 3 further shows that the predicted mesh from Voxel2Mesh is oversmoothed, which can be a result of the used regularization terms. Besides, it loses the geometric prior provided by the input white matter surface due to the mesh simplification. Regardless of the input size, DeepCSR is prone to fail in the deep sulcus regions, since the implicit representation can be affected by partial volume effects. Ablation Study. We consider three ablation experiments. First, we only use single-scale brain MRI rather than a multi-scale image pyramid. Second, instead of cube sampling, we only employ point sampling, which samples the MRI voxels at the exact position of each vertex. Third, we substitute the MLP layers with

80

Q. Ma et al.

Graph Convolutional Networks (GCN) [11]. The results are listed in Table 1 and the error maps are given in Fig. 4, which shows the Chamfer distance between the output surface and the FreeSurfer ground truth. Multi-scale input slightly improves the geometric accuracy, while the cube sampling contributes a lot to the performance of PialNN. There is no notable improvement after replacing MLP with GCN layers but the memory usage has increased.

Fig. 4. Error maps of the pial surface from ablation study. The color visualizes the Chamfer distance ranging from 0 to 2 mm. (Color figure online)

Fig. 5. Runtime (seconds) of deep learning-based approaches for pial surface reconstruction.

Runtime. We compute the runtime for each framework, as shown in Fig. 5, for both left and right pial surfaces reconstruction. PialNN achieves the fastest runtime with 0.52 s, whereas traditional pipelines [5,8,15] usually take >10 min for pial surface generation based on the white matter surface. Voxel2Mesh needs 4.8 s as it requires mesh simplification for the input. DeepCSR runs in >100 s due to the time-consuming topology correction.

5

Conclusion

PialNN is a fast and memory-efficient deep learning framework for cortical pial surface reconstruction. The proposed framework learns several deformation blocks to generate a pial surface mesh from an input white matter surface. Each block incorporates the point feature extracted from the coordinates and normals, as well as the local feature extracted from the MRI intensity of the vertex and its neighborhood. Experiments demonstrate that our framework achieves the best performance with highest accuracy and fastest runtime (within one second) compared to state-of-the-art deep learning baselines. A future direction will be to extend the PialNN framework to predict the segmentation labels and reconstruct both cortical white matter and pial surfaces using only the input MR brain images. Acknowledgements. This work was supported by the President’s PhD Scholarships at Imperial College London.

PialNN

81

References 1. Avants, B.B., Tustison, N., Song, G.: Advanced normalization tools (ANTs). Insight J. 2(365), 1–35 (2009) 2. Cruz, R.S., Lebrat, L., Bourgeat, P., Fookes, C., Fripp, J., Salvado, O.: DeepCSR: A 3D deep learning approach for cortical surface reconstruction. arXiv preprint arXiv:2010.11423 (2020) 3. Dale, A.M., Fischl, B., Sereno, M.I.: Cortical surface-based analysis: I. segmentation and surface reconstruction. Neuroimage 9(2), 179–194 (1999) 4. Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3D object reconstruction from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 605–613 (2017) 5. Fischl, B.: FreeSurfer. Neuroimage 62(2), 774–781 (2012) 6. Fischl, B., Dale, A.M.: Measuring the thickness of the human cerebral cortex from magnetic resonance images. Proc. Natl. Acad. Sci. 97(20), 11050–11055 (2000) 7. Gkioxari, G., Malik, J., Johnson, J.: Mesh R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9785–9795 (2019) 8. Glasser, M.F., et al.: The minimal preprocessing pipelines for the human connectome project. Neuroimage 80, 105–124 (2013) 9. Groueix, T., Fisher, M., Kim, V.G., Russell, B.C., Aubry, M.: A papier-mˆ ach´e approach to learning 3D surface generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern rRecognition, pp. 216–224 (2018) 10. Henschel, L., Conjeti, S., Estrada, S., Diers, K., Fischl, B., Reuter, M.: FastSurfer - a fast and accurate deep learning based neuroimaging pipeline. NeuroImage 219, 117012 (2020) 11. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016) 12. Lorensen, W.E., Cline, H.E.: Marching cubes: a high resolution 3D surface construction algorithm. ACM Siggraph Comput. Graph. 21(4), 163–169 (1987) 13. Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3D reconstruction in function space. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4460–4470 (2019) 14. Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: DeepSDF: learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 165–174 (2019) 15. Shattuck, D.W., Leahy, R.M.: BrainSuite: an automated cortical surface identification tool. Med. Image Anal. 6(2), 129–142 (2002) 16. T´ othov´ a, K., et al.: Probabilistic 3D surface reconstruction from sparse MRI information. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12261, pp. 813–823. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59710-8 79 17. Van Essen, D.C., et al.: The WU-Minn human connectome project: an overview. Neuroimage 80, 62–79 (2013) 18. Wang, N., et al.: Pixel2Mesh: 3D mesh model generation via image guided deformation. IEEE Trans. Pattern Anal. Mach. Intell. (2020) 19. Wickramasinghe, U., Remelli, E., Knott, G., Fua, P.: Voxel2Mesh: 3D mesh model generation from volumetric data. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12264, pp. 299–308. Springer, Cham (2020). https://doi.org/10.1007/ 978-3-030-59719-1 30

Multi-modal Brain Segmentation Using Hyper-Fused Convolutional Neural Network Wenting Duan1(B) , Lei Zhang1 , Jordan Colman1,2 , Giosue Gulli2 , and Xujiong Ye1 1 Department of Computer Science, University of Lincoln, Lincoln, UK

[email protected] 2 Ashford and St Peter’s Hospitals NHS Foundation Trust, Surrey, UK

Abstract. Algorithms for fusing information acquired from different imaging modalities have shown to improve the segmentation results of various applications in the medical field. Motivated by recent successes achieved using densely connected fusion networks, we propose a new fusion architecture for the purpose of 3D segmentation in multi-modal brain MRI volumes. Based on a hyper-densely connected convolutional neural network, our network features in promoting a progressive information abstraction process, introducing a new module – ResFuse to merge and normalize features from different modalities and adopting combo loss for handing data imbalances. The proposed approach is evaluated on both an outsourced dataset for acute ischemic stroke lesion segmentation and a public dataset for infant brain segmentation (iSeg-17). The experiment results show our approach achieves superior performances for both datasets compared to the state-of-art fusion network. Keywords: Multi-modal fusion · Dense network · Brain segmentation

1 Introduction In medical imaging, segmentation of lesions or organs using a multi-modal approach has become a growing trend strategy as more advanced systems and data becomes available. For example, magnetic resonance imaging (MRI) that is widely used for brain lesion or tumor detection and segmentation comes in several modalities including T1-weighted (T1), T2-weighted (T2), FLuid Attenuated Inversion Recovery (FLAIR) and Diffusionweighted image (DWI), etc. Compared to single modality, the extraction of information from multi-modal images brings complementary information that contributes to reduced uncertainty and an improved discriminative power of the clinical diagnosis system [1]. Motivated by the success of deep learning, image fusion strategies have largely moved from probability theory [2] or fuzzy concept [3] based methods to deep convolutional neural network based approaches [1, 4]. Promising performance has been achieved by deep learning based methods for medical image segmentation from multi-modal images. The most widely applied strategy is simply concatenating images or image patches of different modalities to learn a unified image features set [5–7]. Such networks combine the data at the input level to © Springer Nature Switzerland AG 2021 A. Abdulkadir et al. (Eds.): MLCN 2021, LNCS 13001, pp. 82–91, 2021. https://doi.org/10.1007/978-3-030-87586-2_9

Multi-modal Brain Segmentation Using Hyper-Fused Convolutional Neural Network

83

form a multi-channel input. Another straightforward fusion strategy is for images of each modality to learn an independent feature map. Then these single-modality feature sets will, either learn their separate classifiers and use ‘votes’ to arrive at a final output, or learn a multi-modal classifier integrating high-level representations of different modalities [8–10]. In comparison to the strategies mentioned previously where fusion happens either at the input level or the output/classifier level, some recent works [11– 14] have proved that performing fusion within the convolutional feature learning stage instead generally gives much better segmentation results. Tseng et al. [14] proposed a cross-modality convolution to aggregate data from different modalities within an encoder decoder network. The convolution LSTM is then used to model the correlations between slices. The method requires images of all modalities to be co-registered and the network parameters varies with the number of slices involved in the training dataset. For unpaired modalities such as CT and MRI, Dou et al. [15] developed a novel scheme involving separate feature normalization but shared convolution. Knowledge distillation-based loss is proposed to promote softer probability distribution over classes. However, the design so far is limited to two modalities. Another avenue of research on multi-modal fusion is based on DenseNet [16] where feature re-use is induced by connecting each layer with all previous layers. For example, Dolz et al. [13] extends the DenseNet so that the dense connections not only exist in the layers of same modality but also between the modalities. Their network (i.e. HyperDense-Net) made significant improvements over other state-of-art segmentation techniques and ranked first for two highly competitive multi-modal brain segmentation challenges. Dolz et al. [17] also explored the integration of DenseNet in U-Net, which involved a multi-path densely connected encoder and inception module-based convolution blocks with dilated convolution at different scales. However, the network input only accepts 2D slides and not 3D volumes. As reviewed in [3], dense connection-based layer-level fusion improves the effectiveness and efficiency of multi-modal segmentation network through better information propagation, implicit deep supervision and reduced risk of over-fitting on small datasets. While recognising the advantages provided by densely connected networks for multimodal fusion, HyperDense-Net architecture has some limitations which we address in this paper. The first lies in the variation of filter depth. Compared to many other segmentation networks such as U-Net, HyperDense-Net contains no pooling layer between convolutional layers and is overall not so deep (i.e. contains nine convolution blocks and four fully-convolutional layers). However, it retained the conventional way of increasing the number of filters (just like the networks with pooling layers) by doubling or multiplying 1.5 after every three consecutive convolution blocks, resulting in a drastic change in feature abstraction in the 4th and 7th layers and moderate learning in other layers. The other lacking aspect we identified is the way multi-modal feature maps concatenate. In HyperDense-Net, the feature maps from all modalities as well as previous layers are simply fused using concatenation along the channel dimension. We speculate this approach fails to consider the discrepancy in visual features under different modalities and the importance of modal-specific learning, resulting in ineffective multi-modal feature merging and propagation. Given the challenges and limitations described above, we propose a new densely connected fusion architecture, which we refer to as HyperFusionNet, for multi-modal

84

W. Duan et al.

brain tissue segmentation. The proposed network is trained in an end-to-end fashion, where a progressive feature abstraction process is ensured, and a better feature fusion strategy is integrated to alleviate the interference and incompatibility of feature maps generated from different modality paths. We compare the proposed architecture to the state-of-art method using both a private dataset on acute ischemic stroke lesions and data from the iSeg-2017 MICCAI Grand challenge [18] on 6-month infant brain MRI Segmentation.

2 Method 2.1 Baseline Architecture The pipeline of the baseline architecture – HyperDense-Net [13] is shown in Fig. 1, but without the added ResFuse modules. Taking the fusion of three modalities as an example, each imaging modality has its own stream for the propagation of the features until it reaches the fully convolutional layer. Every convolutional block includes batch normalization, PReLU activation and convolution with no spatial pooling. For a convolutional block in a conventional CNN, the output of the current layer, denoted as xl , is obtained by applying a mapping function Fl (·) to the output xl−1 of the previous layer, i.e. xl = Fl (xl−1 )

(1)

However, in the HyperDense-Net, feature maps generated from different modalities as well as the feature outputs from previous layers are concatenated in a feed-forward manner to be input to the convolution block. Let M represents the number of modalities involved in the multi-modal network, the output of the l th layer along a stream m = 1, 2, . . . , M in the baseline architecture is then defined as. 1 2 M 1 2 M xlm = Fl ([xl−1 , xl−1 , . . . xl−1 , xl−2 , xl−2 , . . . , xl−2 , . . . , x0M ])

(2)

All streams are then concatenated together before entering the fully convolutional layers. The output of the network is fed into a softmax function to generate the probabilistic map. The final segmentation result is computed based on the highest probability value. The baseline network is optimised using Adam optimiser and cross-entropy loss function. 2.2 Proposed Architecture To avoid drastic changes of feature abstraction, we first modified the number of filter sizes in the baseline network. Instead of having equal number of filters for every three consecutive convolutional blocks, we gradually increase the number of filters in the successive blocks. Let w denotes the increased value in filter number in the original network, we add w/3 filters to the successive convolutional layers in the proposed network. The effectiveness of such design is demonstrated previously in [19]. To improve the fusion of the multi-modal features along each modality path, we propose to merge the feature maps via a ‘ResFuse Module’. Inspired by [20], the module

Multi-modal Brain Segmentation Using Hyper-Fused Convolutional Neural Network

85

(illustrated in Fig. 2) contains a residual connection where the main information belonging to that specific modality path is traversed directly. A 1 × 1 convolutional layer is also introduced to allow some comprehension of channel correspondence between the features of the specific path and the merging information from other modalities. For the concatenated feature maps, we apply non-linear PReLU activation before summation in order to promote better mapping and information flow in the fused propagation. Equation 2 is then updated to  m  xlm = Hl xl−1   1 2 M 1 2 M + Gl xl−1 (3) , xl−1 , xl−1 , xl−2 , xl−2 , . . . , xl−2 , . . . , x0M m via the 1 × 1 convolution and G where Hl applies the dimension expansion of xl−1 l performs the concatenation of features from all modalities and the activation.

Fig. 1. The proposed HyperFusionNet architecture in the case of three imaging modalities. The feature map generated by each convolutional block is colour coded; the deeper the colour the deeper the layer. The stacked feature maps show how the dense connection and layer shuffling happens originally along each path. The ResFuse Module is added to replace the original concatenation.

Fig. 2. Proposed residual fusion module for the multi-modal feature merging.

The layer details are presented in Table 1, which shows the layer parameters involved in the proposed network. The overall architecture layout is presented in Fig. 1, which we term HyperFusionNet.

86

W. Duan et al.

Table 1. The HyperFusionNet architecture detail. Notations: CB - convolutional block; RFM – residual fusion module; FC - fully convolutional layer. Network components

No. filters

Output size

Network components

No. filters

Output size

CB1

25

253

RM6 CB7

819 75

133

RFM1 CB2

75 33

233

RM7 CB8

1044 83

113

RFM2 CB3

174 41

213

RM8 CB9

1293 91

93

RM3 CB4

297 50

193

RM9 FC1

1566 600

93

RM4 CB5

447 58

173

FC2

300

93

FC3

150

93

RM5 CB6

621 66

153

FC4

No. classes

93

2.3 Learning Process and Implementation Details Another change we made to the baseline network was to the loss function. Instead of using cross entropy loss, we propose to use Combo Loss, which is the combined function of Dice Loss (DL) and Cross-Entropy (CE). The Combo Loss function allows us to benefit from DL for better handling the lightly imbalanced class and the same time leverage the advantage of CE for curve smoothing. It is defined as ⎛

⎞ N

1 L = α ⎝− β(gi log si ) + (1 − β) (1 − gi ) log(1 − si ) ⎠ N i=1    2 N i=1 si gi + ε − (1 − α) N N i=1 si + i=1 gi + ε

(4)

where gi is the ground truth for pixel i, and si is the corresponding predicted probability. The model is implemented in PyTorch and trained on a single NVIDIA GTX 1080Ti GPU. Images from each modality are skull stripped and normalized by subtracting the mean value and dividing by the standard deviation. 3D image patches of size 27 × 27 × 27 are randomly extracted and only ones with lesion voxels are used for training. The Adam optimization algorithm used for optimization is set with default parameter values. The network was trained for 600 epochs. For model inference, the testing images are first normalised and non-overlapping 3D patches are extracted. The output, which is the 9 × 9 × 9 voxel-wise classification obtained from the prediction at the centre of the patch, is used to reconstruct the full image volume by reversing the extraction process. The source code for the implemented model is available on GitHub1 . 1 https://github.com/Norika2020/HyperFusionNet.

Multi-modal Brain Segmentation Using Hyper-Fused Convolutional Neural Network

87

3 Experiments and Results 3.1 Datasets The proposed HyperFusionNet is evaluated both on a hospital-collected multi-modal dataset of acute stroke lesion segmentation and on the public iSeg-17 MICCAI Grand Challenge dataset. The hospital-collected dataset was divided into 90 training cases and 30 testing cases, with three modalities in each case, i.e., T2, DWI-b1000 and DWI-b0. All images are of size 256 × 256 × 32. The ground truth for the acute stroke lesion in the dataset is annotated by experienced physicians and there are two classes involved: lesion and non-lesion. Comparably, iSEG17 is a much smaller dataset containing 10 available volumes with two modalities, i.e., T1- and T2- weighted. To be consistent with the experiment carried out in the original baseline paper [13], we also split the dataset into training, validation and testing sets, each having 6, 1, 3 subjects, respectively. There are four classes involved in iSeg-17 dataset, i.e., background, cerebrospinal fluid (CSF), grey matter (GM) and white matter (WM).

Fig. 3. Validation accuracy measured using mean DC during proposed model training on the stroke lesion dataset.

3.2 Results and Discussion The proposed network is first evaluated by assessing its performance at segmenting acute stroke lesions in the hospital-collected dataset. In this experiment, the batch size was set to 10 and learning rate was set to 0.0002. Figure 3 shows the comparison of the validation accuracy between the baseline and HyperFusionNet. The mean Dice score of the validation set is calculated after every ten epochs. We can see from the learning curve that HyperFusionNet is not only more accurate compared to the baseline but also converges faster. This can be attributed to the synergy between the residual connections and the feature activation after concatenation. Table 2 shows the segmentation results on the testing volumes in metrices Dice coefficient (DC) and Hausdorff distance (HD). Both measurements suggest that the proposed network provides more effective fusion of multi-modal features than the original approach. Figure 4 shows some examples of

88

W. Duan et al.

qualitative results on three kinds of stroke lesion conditions: a big lesion, multiple lesions and a small lesion. Overall, we observe that the proposed network is better at discarding outliers and predict stroke lesion regions of higher quality. To better understand how the proposed modifications to the baseline contribute to the network performance, we also did an ablation study. In this experiment, the 3D networks were changed to 2D (i.e. slice-by-slice input with patch size 27 × 27) to save training and computation time. As shown in Table 3, the accuracy is immediately decreased when the network is changed to 2D. This is expected and it also emphasises the importance of exploiting the slice dimension information for such networks. The results show the clear improvements made by each modification to the 2D baseline network, with ResFuse module making the biggest contribution. We also tested other loss functions – Dice Loss, Focal Loss and Tversky Loss. Comparably, Combo Loss has shown to be more advantageous in our proposed network. Table 2. The testing results on stroke lesion segmentation measured in DC (%) and HD with their associated standard deviation for the experimented networks. Network

Mean DC

DC Std

Mean HD

HD Std

Baseline

65.6

18.0

87.756

20.386

HyperFusionNet

67.7

16.5

85.462

14.496

Fig. 4. Qualitative results obtained for the stroke dataset using the baseline and the proposed networks.

We also tested the HyperFusionNet on the iSeg-17 dataset to investigate its performance on a smaller dataset with more classes involved. To allow a fair comparison, the parameters such as batch size (=5) and learning rate (initially = 0.001 and reducing by a factor of 2 every 100 epochs) are set to match the baseline paper. The results for the

Multi-modal Brain Segmentation Using Hyper-Fused Convolutional Neural Network

89

Table 3. The testing results of the proposed modification to the baseline on stroke lesion segmentation measured in DC (%). Network modification

DC

Other loss function

DC

Baseline 2D (CE loss)

41.9

HyperFusionNet (CE loss)

46.6

+ Incremental filters

43.0

+ Dice loss

45.9

+ ResFuse module

46.6

+ Focal loss

39.0

+ Combo loss

47.1

+ Tversky loss

43.6

Table 4. The performance comparison on the testing set of the iSeg17 brain segmentation measured in DC (%). Architecture

CSF

WM

GM

Baseline

93.4 ± 2.9

89.6 ± 3.5

87.4 ± 2.7

HyperFusionNet

93.6 ± 2.5

90.2 ± 2.2

87.8 ± 2.3

baseline are reproduced using their published code written in PyTorch2 in order to compare results under the same experimental setting. Results in Table 4 shows the proposed network yields better segmentation results than the baseline. Although there is not a significant improvement in the averaged Dice score, we observed that it worked well for challenging cases of segmenting GM and WM. Figure 5 depicts such a challenging example where the proposed HyperFusionNet shows a better contour recovery than that obtained by the baseline.

Fig. 5. Qualitative results achieved by the baseline and proposed network compared to the ground truth contour.

2 https://github.com/josedolz/HyperDenseNet_pytorch.

90

W. Duan et al.

4 Conclusion In this work, we propose a novel method HyperFusionNet for brain segmentation using 3D images captured with multiple modalities. The proposed network presents a new way to fuse features from different modalities in a densely connected architecture. A progressive feature abstraction process is promoted and a ResFuse module is introduced to replace the simple concatenated fusion used in the baseline network. The network is improved further with a Combo loss function. We evaluate the proposed network in both ischemic acute lesion segmentation and infant brain segmentation and compare it to a state-of-art multi modal fusion network. The experimental results demonstrate the effectiveness of HyperFusionNet and its capability to tackle challenging multi-modal segmentation tasks with different applications and dataset sizes. Our research largely focused on the fusion network itself, and little data augmentation and post processing was included. For future work, we will improve the network further by implementing pre and post enhancements. The influence of each modality on different applications will also be investigated.

References 1. Tongxue, Z., Su, R., Stéphane, C.: A review: deep learning for medical image segmentation using multi-modality fusion. Array 3–4 (2019) 2. Lapuyade-Lahorgue, J., Xue, J.H., Ruan, S.: Segmenting multi-source images using hidden markov fields with copula-based multivariate statistical distributions. IEEE Trans. Image Process. 26(7), 3187–3195 (2017) 3. Balasubramaniam, P., Ananthi, N.: Image fusion using intuitionistic fuzzy sets. Inf. Fus. 20(1), 21–30 (2014) 4. Guo, Z., Li, X., Huang, H., Guo, N., Li, Q.: Deep learning-based image segmentation on multimodal medical imaging. IEEE Trans. Radiat. Plasma Med. Sci. 3(2), 162–169 (2019) 5. Havaei, M., et al.: Brain tumor segmentation with deep neural networks. Med. Image Anal. 35, 18–31 (2017) 6. Kamnitsas, K., et al.: Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med. Image Anal. 36, 61–78 (2017) 7. Lavdas, I., et al.: Fully automatic, multiorgan segmentation in normal whole body magnetic resonance imaging (MRI), using classification forests (CFs), convolutional neural networks (CNNs), and a multi-atlas (MA) approach. Med. Phys. 44(10), 5210–5220 (2017) 8. Cai, H., Verma, R., Ou, Y., Lee, S., Melhem, E.R., Davatzikos, C.: Probabilistic segmentation of brain tumors based on multi-modality magnetic resonance images. In 4th IEEE International Symposium on Biomedical Imaging, pp. 600–603 (2007) 9. Klein, S., van der Heide, U.A., Lips, I.M., van Vulpen, M., Staring, M., Pluim, J.P.: Automatic segmentation of the prostate in 3D MR images by atlas matching using localized mutual information. Med. Phys. 35(4), 1407–1417 (2008) 10. Menze, B.H., et al.: The multimodal brain tumor image segmentation benchmark. IEEE Trans. Med. Imaging 34(10), 1993–2024 (2015) 11. Aygun, M., Sahin, Y.H., Unal, G.: Multimodal convolutional neural networks for brain tumor segmentation. arXiv preprint:1809.06191 (2018) 12. Chen, Y., Chen, J., Wei, D., Li, Y., Zheng, Y.: OctopusNet: a deep learning segmentation network for multi-modal medical images. In: Li, Q., Leahy, R., Dong, B., Li, X. (eds.) MMMI 2019. LNCS, vol. 11977, pp. 17–25. Springer, Cham (2020). https://doi.org/10.1007/978-3030-37969-8_3

Multi-modal Brain Segmentation Using Hyper-Fused Convolutional Neural Network

91

13. Dolz, J., Gopinath, K., Yuan, J., Lombaert, H., Desrosiers, C., Ben Ayed, I.: HyperDenseNet: a hyper-densely connected CNN for multi-modal image segmentation. IEEE Trans. Med. Imaging 38(5), 1116–1126 (2019) 14. Tseng, K.L., Lin, Y.L., Hsu, W., Huang, C.Y.: Joint sequence learning and cross-modality convolution for 3D biomedical segmentation. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2017), pp. 3739–3746 (2017) 15. Dou, Q., Liu, Q., Heng, P.A., Glocker, B.: Unpaired multi-modal segmentation via knowledge distillation. IEEE Trans. Med. Imaging 39(7), 2415–2425 (2020) 16. Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2017), pp. 2261–2269 (2017) 17. Dolz, J., Ben Ayed, I., Desrosiers, C.: Dense multi-path U-net for ischemic stroke lesion segmentation in multiple image modalities. In: Crimi, A., Bakas, S., Kuijf, H., Keyvan, F., Reyes, M., van Walsum, T. (eds.) BrainLes 2018. LNCS, vol. 11383, pp. 271–282. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11723-8_27 18. Wang, L., et al.: Benchmark on automatic 6-month-old infant brain segmentation algorithms: the iSeg-2017 challenge. IEEE Trans. Med. Imaging 38(9), 2219–2230 (2019) 19. Drozdzal, M., Vorontsov, E., Chartrand, G., Kadoury, S., Pal, C.: The importance of skip connections in biomedical image segmentation. In: Deep Learning and Data Labeling for Medical Applications, pp. 179–187 (2016) 20. Ibtehaz, N., Sohel Rahman, M.: MultiResUNet: rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw. 121, 74–87 (2020)

Robust Hydrocephalus Brain Segmentation via Globally and Locally Spatial Guidance Yuanfang Qiao, Haoyi Tao, Jiayu Huo, Wenjun Shen, Qian Wang, and Lichi Zhang(B) School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China [email protected] Abstract. Segmentation of brain regions for hydrocephalus MR images is pivotally important for quantitatively evaluating patients’ abnormalities. However, the brain image data obtained from hydrocephalus patients always have large deformations and lesion occupancies compared to the normal subjects. This leads to the disruption of the brain’s anatomical structure and the dramatic changes in the shape and location of the brain regions, which poses a significant challenge to the segmentation task. In this paper, we propose a novel segmentation framework, with two modules to better locate and segment these highly distorted brain regions. First, to provide the global anatomical structure information and the absolute position of target regions for segmentation, we use a dual-path registration network which is incorporated into the framework and trained simultaneously together. Second, we develop a novel Positional Correlation Attention Block (PCAB) to introduce the local prior information about the relative positional correlations between different regions, so that the segmentation network can be guided in locating the target regions. In this way, the segmentation framework can be trained with spatial guidance from both global and local positional priors to ensure the robustness of the segmentation. We evaluated our method on the brain MR data of hydrocephalus patients by segmenting 17 consciousness-related ROIs and demonstrated that the proposed method can achieve high performance on the image data with high variations of deformations. Source code is available at: https:// github.com/JoeeYF/TBI-Brain-Region-Segmentation. Keywords: Image segmentation

1

· Image registration · Hydrocephalus

Introduction

Hydrocephalus is an abnormal accumulation of cerebrospinal fluid (CSF) in the patient’s brain with persistent ventricular dilatation, which is a secondary injury from Traumatic Brain Injury (TBI). It is usually caused by the obstruction of cerebrospinal fluid pathways, impaired cerebrospinal fluid circulation, etc., which Electronic supplementary material The online version of this chapter (https:// doi.org/10.1007/978-3-030-87586-2 10) contains supplementary material, which is available to authorized users. c Springer Nature Switzerland AG 2021  A. Abdulkadir et al. (Eds.): MLCN 2021, LNCS 13001, pp. 92–100, 2021. https://doi.org/10.1007/978-3-030-87586-2_10

Robust Hydrocephalus Brain Segmentation

93

leaves the patient in a state of impaired consciousness [3]. The evaluations of the brain’s abnormalities and the correspondence to the patient’s consciousness state play an important role to assist the clinical assessments of the disease progression. Specifically, it is demonstrated in [8] that there are 17 brain regions whose functional and anatomical shape states have certain correlations with the improvements of the consciousness level. Therefore, the identification and parcellation of these consciousness-related brain regions are demanding. However, it is generally impractical to conduct segmentation manually, which is tedious, time-consuming, and introduces inter-observer variability. The issues become more deteriorated for hydrocephalus brain images, which contain even higher variability and extent brain changes than the normal brain images. Therefore, the development of an accurate and automatic brain parcellation method for hydrocephalus images would be highly beneficial. With the development of deep learning, the convolutional neural network (CNN) and its extensions have dominated the field of medical image segmentation. Specifically, UNet [5] combined the high-level and low-level features with different context information to estimate precise segmentation results, and has become the most applied method in this field. Many attempts were also made specifically for brain image segmentation. For example, Moeskops et al. [11] used a CNN to achieve the automatic segmentation of MR brain images into some tissue classes. Ghafoorian et al. [6] integrated the anatomical location information into the network to get explicit location features and improve the segmentation results substantially. The alignment-based brain mapping methods, which adopt the registration techniques such as VoxelMorph [2] to align the brain template to the target for brain parcellation, have also been widely applied. However, most brain region segmentation methods are designed for normal brain images, while the hydrocephalus images have much higher variations of anatomical structures compared to the normal, due to the large deformations and lesion occupancies caused by the diseases as shown in Fig. 1. The brain anatomical structure information is much more complicated to be encoded for constructing the segmentation model, especially when the training data are also limited. Some attempts have been made to resolve these issues in hydrocephalus brain segmentation. Ledig et al. [10] intended to develop a multi-atlas-based method to segment the brain regions on TBI data, but the experiments reported the failed cases. Ren et al. [12] proposed a two-stage framework with hard and soft attention modules to segment the brain regions of hydrocephalus brain images and demonstrated that it outperforms the state-of-art methods such as UNet. However, it is not an end-to-end framework, where the two modules are not fully integrated into a single network to fully share their conducted features.

Fig. 1. Exemplar hydrocephalus MR brain images with consciousness-related regions.

94

Y. Qiao et al.

In this paper, we intend to parcellate the 17 consciousness-related regions according to [8,12] from the MR hydrocephalus images. To resolve the issues in hydrocephalus brain segmentation, we focus on developing an end-to-end novel framework, with two modules to locate and segment the target regions from two different perspectives: (1) We use a registration guidance module to produce the segmentation network with more anatomical structure information about the absolute position of target regions in the whole brain. (2) We propose a novel Positional Correlation Attention Block (PCAB) integrated into the UNet to improve performance by conducting more explicitly structural features. In the PCAB, we design a Positional Correlation Layer (PCL) to extract the relative positional relationship between different brain regions and use it to refine the segmentation estimation via the attention layer. The absolute and relative position information can provide complementary and comprehensive guidance to the segmentation network, which is designed as an end-to-end network for better information integration. Experiments show that it can outperform the alternatives with a statistical significance, which is evaluated by 5-fold cross-validation on 17 brain regions with great deformations from the collected hydrocephalus brain images.

2

Method

To reduce the impact of the deformation caused by the occupancy and erosion of lesion areas on normal brain areas, we propose two novel modules to extract more comprehensive anatomical structure information from the absolute location in the whole brain and the correlation with each other’s location. Figure 2 shows our proposed segmentation framework with dual-path registration module and Positional Correlation Attention Block (PCAB) module. The two networks are trained simultaneously but only the segmentation network is used in the inference stage. 2.1

Guidance with Registration Module

We use an UNet-like network with a dual-path encoder to non-rigidly align the hydrocephalus subject (moving set) with the brain template (fixed set), which is shown in Fig. 2(b). Specifically, we adopt a dual-path encoder that takes the original brain image (IM , IF ) and the brain segmentation mask (MM , MF ) as two channels input to make the registration network focus on the ROI regions. The features extracted from the encoder are concatenated with the corresponding decoder features by skip connections to make a low-level and high-level feature fusion. The decoder predicts a deformation field φ to align the mask of different brain regions. Finally, the hydrocephalus brain image and brain regions mask are warped using Spatial Transformation Network (STN) according to φ. To introduce more supervised information into the segmentation framework and improve the robustness [13], we warp the output prediction map of the segmentation network according to φ above and calculate the similarity with the template dataset. Then the segmentation and registration network can be trained simultaneously.

Robust Hydrocephalus Brain Segmentation

95

Fig. 2. The overview of the proposed method. The framework consists of a segmentation network and a registration network. The Positional Correlation Attention Blocks (PCAB) are integrated into the segmentation network. Note that only the segmentation network is used in the inference stage.

2.2

Segmentation with Positional Correlation Attention Block

We use a 3D-UNet as our main segmentation network, and PCAB is integrated into different levels. The purpose of PCAB is to use the location correlation of different brain regions to generate the attention map with anatomical structure information. The PCAB includes a Positional Correlation Layer (PCL) to estimate the location map and an attention layer to generate the attention map to refine the input feature. As shown in Fig. 2(a), the PCAB is a plug-and-play block and is integrated into the first two levels of the encoder and the last two levels of the decoder in our study. Positional Correlation Layer. Localization probability maps for each brain region A ∈ Rl×w×h×17 are calculated from the input feature F ∈ Rl×w×h×c by a convolutional layer. Our PCL forms the target brain regions into two directed graph structures according to their correlation with each other’s location and the graph is shown in Fig. 2(c). The probability maps for every brain region are passed to the adjacent region using a convolutional layer according to the linkage between each brain region in the graph above. The method has been used in pose estimation [4]. To further expand receptive fields, we use the 3D dilated convolution with dilation rate of 2 and the filter size of 7 × 7. Note that due to the large distance between some regions, we exclude 3 brain regions and constructed 2 directed graph structures using the remaining 14 brain regions.

96

Y. Qiao et al.

Specifically, let Ak be the original feature maps and k is the index of brain regions. The positional correlation for Ak can be defined as ⎛ ⎞  ˜ k = f ⎝Ak + A C(Aj )⎠ , (1) j∈N

where j ∈ N means that Aj is the brain region that has linkage with Ak , f is the ReLU function, and C is the dilated convolution. Take A11 as an example, ˜ 9 and A10 , so the updated A ˜ 11 A11 is refined by receiving information from A after PCL layer is ˜ 11 = f (A11 + C(A ˜ 9 ) + C(A10 )). A (2) Since the graph is starting from A1 , A5 and A10 , and they do not receive information from other regions, they remain the same as the original ones. The other region’s feature map is refined in a similar way to A11 . Attention Layer. The 17 location maps generated by PCL may have different priorities for the segmentation task. Here we use an attention layer consisting of spatial attention and channel attention to exploiting the most significant features to refine the input feature. First, we use a convolution layer with an output channel number of c to generate a feature map α ∈ Rl×w×h×c , which has the same size as the input feature. Next, we use a global max pooling (GMP) and a global average pooling (GAP) to get the global information of each channel, which are represented as Fmax ∈ R1×1×1×17 and Favg ∈ R1×1×1×17 , respectively. Then, the channel attention coefficient β ∈ R1×1×1×c is calculated by a multiple layer perception consisting of two fully connected layers and a ReLU activation. ˜ can be obtained as F ˜ = F + σ(α · β) · F and is The output feature of PCAB F fed into the next convolution block, where σ denotes a sigmoid function. 2.3

Training Strategy

The registration network loss consists of three following components: LR = β1 LNCC (IM (φ), IF ) + β2 Lmask (MM (φ), MF ) + β3 Lsmooth (φ) ,

(3)

where the first term is local cross-correlation loss between the warped image and fixed image which is the same as VoxelMorph, the second term calculates the mean square error (MSE) between the brain region mask, and the last term penalizes local spatial variations φ to keep it smooth. The segmentation network is optimized by two aspects of supervision information. The loss function is written as follows: N  ˆ YG ) + α2 1 ˜ i , Ai ). Li (A LS = α1 Lseg (Y, G N i=1 loc

(4)

The first term in Eq. (4) calculates the cross-entropy between the prediction ˆ and the ground-truth YG . The second term map of the segmentation network Y

Robust Hydrocephalus Brain Segmentation

97

˜ i from calculates the weighted binary-cross-entropy between the location maps A PCL at each level i and the ground-truth AiG . AiG is the center of each brain region which is dilated into ball areas with a radius of 3. To balance the loss of foreground and background, we use the inverse of the average pixel value of ground-truth as the loss weight of the foreground. The equation for the second term is shown below, where V i denotes the sum of voxels for level i:

i  ˜i . ˜ AG ) = − V · Ai · log A ˜ i − 1 − Ai · log 1 − A Liloc (A, G G AiG

(5)

During the training stage, we warp the segmentation map according to φ and calculate the MSE with brain template ROIs, which is defined as ˆ MF ), LC = γLmask (Y(φ),

(6)

where we use LC to train the segmentation network and registration network simultaneously, but only when the training epoch is greater than 15 to keep the training process stable. In this way, the overall loss function can be defined as L = LS + LR + LC . Note that β1 , β2 , β3 in Eq. (3), α1 , α2 in Eq. (4) and γ in Eq. (6) are hyper-parameters and γ is 0 when the train epochs are less than 15.

3 3.1

Experiments and Results Datasets and Experiments

The proposed method was evaluated on in-house MR brain data in T1 from 44 hydrocephalus patients and all subjects have hematoma volume and hydrocephalus disease. The brain template used in the registration network is Colin 27 Average Brain [7]. The 17 consciousness-related ROIs shown in Fig. 1 was manually delineated and used for training. Before the data was fed into the network, we preprocessed the data by the following steps: The images’ voxel spacing was resized to 1 mm×1 mm×1 mm. Then the histogram matching was conducted for intensity normalization. In the registration module, the moving set was firstly affine-registered to the fixed set using the ANTs package [1]. The data were grouped into 5-fold for the cross-validation and ablation study. The network was implemented using PyTorch1.6 and trained on NVIDIA RTX Titan GPU. We used different learning rate settings for the two networks: 1e−3 for the registration network and 1e−4 for the segmentation network. All learning rate settings have a decay of 0.1 every 5 epochs and the epochs of training are 100. Adam optimizer was adopted with a weight decay of 1e−4. Due to the limitation of the GPU’s memory size, the batch size was set to 1. Therefore, we replaced the BatchNorm with GroupNorm to reduce the effect of small batch size, and the number of groups were set to 4. 3.2

Results

To verify the effect of the proposed method, we make the ablation studies by adjusting the number of PCABs denoted as N in Eq. (4) and the incorporation of

98

Y. Qiao et al.

Table 1. Dice Coefficient of ablation studies and comparison with other methods. PCABs(N) means that there are N PCABs integrated into the framework. Methods

Dice coefficient(%)

(a) Ablation studies UNet(Baseline) UNet+Registration UNet+PCABs(2) UNet+PCABs(4) UNet+PCABs(4)+Registration

61.64 ± 21.60 63.85 ± 18.71 64.83 ± 18.67 66.00 ± 17.76 69.03 ± 14.85

(b) Comparison with other methods VoxelMorph[2] UNet [5] nnUNet [9] Ren et al. [12] Proposed

39.23 ± 22.23 61.64 ± 21.60 63.37 ± 23.62 67.19 ± 17.18 69.03 ± 14.85

Fig. 3. Dice Coefficient of each brain region in different studies. (Color figure online)

the registration network, which are shown in Table 1(a). PCABs(2) means that the PCAB is only integrated into the first level and the last level of decoder, while PCABs(4) integrated two more PCABs in the deeper levels like Fig. 2. We use Dice Coefficient Score (DCS) as the evaluation metrics. As presented in Table 1(a), the DSC score is increased by 4.36% by applying PCAB into baseline UNet, and with the growth of the number of PCABs the performance also increases. The results indicate that the PCABs can help the network to encode the variations of the anatomical structure by extracting the positional correlations information. By training the segmentation network and registration network together, the DSC further improved 3.03% and the standard deviation decreased 2.91% which proved that the registration network further confirmed the absolute position of the regions in the whole brain. The results also show that the both two networks helps to achieve the best segmentation

Robust Hydrocephalus Brain Segmentation

99

Fig. 4. Visualization of segmentation results comparison of UNet and the proposed method.

performance. Figure 3 shows the distribution of dice scores of each brain region in the different ablation studies. It can be concluded that the method proposed has better performance than the baseline method. The final proposed methods (red) have higher average dice and lower standard deviation in most of the brain regions, indicating that our method can achieve more accurate and more stable results for the hydrocephalus dataset. We also compared our method with other state-of-the-art segmentation methods. The results are shown in Table 1(b). Our method achieved the average dice of 69.03% with a standard deviation of 14.85%, which is 1.84% higher than the recent state-of-the-art method [12]. The lower standard deviation also indicates that the result has improved more on the hard subject samples which have larger deformation than the others, where the proposed method achieves more robust performance than the alternatives. Figure 4 visualizes the results from our segmentation framework and the UNet method. The first column is ground-truth and the next column are the results of UNet and the method proposed, respectively. As shown in the zoomed block, our method can locate and segment the brain regions, while the UNet fails to handle. For the cases with large deformation, there is a huge improvement in accuracy and robustness with our proposed method.

4

Conclusion

We have proposed a novel segmentation framework based on globally registration guidance and locally relative positional correlations guidance for the hydrocephalus dataset. First, we used a dual-path registration network to provide global anatomical structure information which was important to the final segmentation results. Besides, to make the network focus more on the target regions, we integrated a Positional Correlation Attention Block (PCAB) into UNet to generate a location attention map to refine the feature of the encoder and decoder. The comparative results show that the proposed method can achieve higher results and have a greater improvement for cases with larger deformation, indicating that our method has high accuracy and robustness.

100

Y. Qiao et al.

Acknowledgements. This work was supported by the National Key Research and Development Program of China (2018YFC0116400), National Natural Science Foundation of China (NSFC) grants (62001292), Shanghai Pujiang Program (19PJ1406800), and Interdisciplinary Program of Shanghai Jiao Tong University.

References 1. Avants, B.B., Tustison, N.J., Song, G., Cook, P.A., Klein, A., Gee, J.C.: A reproducible evaluation of ants similarity metric performance in brain image registration. Neuroimage 54(3), 2033–2044 (2011) 2. Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V.: VoxelMorph: a learning framework for deformable medical image registration. IEEE Trans. Med. Imaging 38(8), 1788–1800 (2019) 3. Chari, A., Czosnyka, M., Richards, H.K., Pickard, J.D., Czosnyka, Z.H.: Hydrocephalus shunt technology: 20 years of experience from the Cambridge shunt evaluation laboratory. J. Neurosurg. 120(3), 697–707 (2014) 4. Chu, X., Ouyang, W., Li, H., Wang, X.: Structured feature learning for pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4715–4723 (2016) ¨ Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-net: 5. C ¸ i¸cek, O., learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). https://doi.org/10.1007/978-3-31946723-8 49 6. Ghafoorian, M., et al.: Location sensitive deep convolutional neural networks for segmentation of white matter hyperintensities. Sci. Rep. 7(1), 1–12 (2017) 7. Holmes, C.J., Hoge, R., Collins, L., Woods, R., Evans, A.C.: Enhancement of MR images using registration for signal averaging. J. Comput. Assist. Tomogr. 3(2), 324–333 (1998) 8. Huo, J., et al.: Neuroimage-based consciousness evaluation of patients with secondary doubtful hydrocephalus before and after lumbar drainage. Neurosci. Bull. (9) (2020) 9. Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H.: nnU-net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18(2), 203–211 (2021) 10. Ledig, C., et al.: Robust whole-brain segmentation: application to traumatic brain injury. Med. Image Anal. 21(1), 40–58 (2015) 11. Moeskops, P., Viergever, M.A., Mendrik, A.M., De Vries, L.S., Benders, M.J., Iˇsgum, I.: Automatic segmentation of MR brain images with a convolutional neural network. IEEE Trans. Med. Imaging 35(5), 1252–1261 (2016) 12. Ren, X., Huo, J., Xuan, K., Wei, D., Zhang, L., Wang, Q.: Robust brain magnetic resonance image segmentation for hydrocephalus patients: Hard and soft attention. In: 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pp. 385–389. IEEE (2020) 13. Xu, Z., Niethammer, M.: DeepAtlas: joint semi-supervised learning of image registration and segmentation. In: Shen, D., Liu, T., Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.-T., Khan, A. (eds.) MICCAI 2019. LNCS, vol. 11765, pp. 420–429. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32245-8 47

Brain Networks and Time Series

Geometric Deep Learning of the Human Connectome Project Multimodal Cortical Parcellation Logan Z. J. Williams1,2(B) , Abdulah Fawaz2 , Matthew F. Glasser3 , A. David Edwards1,4,5 , and Emma C. Robinson1,2 1

4

Centre for the Developing Brain, Department of Perinatal Imaging and Health, School of Biomedical Engineering and Imaging Sciences, King’s College London, London SE1 7EH, UK [email protected] 2 Department of Biomedical Engineering, School of Biomedical Engineering and Imaging Science, King’s College London, London SE1 7EH, UK 3 Departments of Radiology and Neuroscience, Washington University Medical School, Saint Louis, MO 63110, USA Department for Forensic and Neurodevelopmental Sciences, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London SE5 8AF, UK 5 MRC Centre for Neurodevelopmental Disorders, King’s College London, London SE1 1UL, UK

Abstract. Understanding the topographic heterogeneity of cortical organisation is an essential step towards precision modelling of neuropsychiatric disorders. While many cortical parcellation schemes have been proposed, few attempt to model inter-subject variability. For those that do, most have been proposed for high-resolution research quality data, without exploration of how well they generalise to clinical quality scans. In this paper, we benchmark and ensemble four different geometric deep learning models on the task of learning the Human Connectome Project (HCP) multimodal cortical parcellation. We employ Monte Carlo dropout to investigate model uncertainty with a view to propagate these labels to new datasets. Models achieved an overall Dice overlap ratio of >0.85 ± 0.02. Regions with the highest mean and lowest variance included V1 and areas within the parietal lobe, and regions with the lowest mean and highest variance included areas within the medial frontal lobe, lateral occipital pole and insula. Qualitatively, our results suggest that more work is needed before geometric deep learning methods are capable of fully capturing atypical cortical topographies such as those seen in area 55b. However, information about topographic variability between participants was encoded in vertex-wise uncertainty maps, suggesting a potential avenue for projection of this multimodal parcellation to new datasets with limited functional MRI, such as the UK Biobank.

Keywords: Human connectome project Cortical parcellation

· Geometric deep learning ·

c Springer Nature Switzerland AG 2021  A. Abdulkadir et al. (Eds.): MLCN 2021, LNCS 13001, pp. 103–112, 2021. https://doi.org/10.1007/978-3-030-87586-2_11

104

1

L. Z. J. Williams et al.

Introduction

Cortical parcellation is the process of segmenting the cerebral cortex into functionally specialised regions. Most often, these are defined using sulcal morphology [5], and are propagated to individuals from a population-average template (or set of templates) based on the correspondence of cortical shape [17,28]. By contrast, while it is possible to capture subject-specific cortical topography from functional imaging in a data-driven way [14], it is difficult to perform populationbased comparisons with these approaches as they typically result in parcellations where the number and topography of the parcels vary significantly across subjects [15]. Notably, even following image registration methods that use both structural and functional information [25,26], considerable topographic variation remains across individuals [10,19]. Recently, [10] achieved state-of-the-art cortical parcellation through hand annotation of a group-average multimodal magnetic resonance imaging (MRI) atlas from the Human Connectome Project (HCP). Specifically, a sharp group average of cortical folding, cortical thickness, cortical myelination, task and resting state functional MRI (fMRI), were generated through novel multi-modal image registration [25] driven by ‘areal features’: specifically T1w/T2w ratio (cortical myelin) [13] and cortical fMRI; modalities which are known to more closely reflect the functional organisation of the brain. This improved alignment allowed for manual annotation of regional boundaries via identification of sharp image gradients, consistent across modalities. With this group average template, they trained a multi-layer perceptron (MLP) classifier to recognise the multimodal ‘fingerprint’ of each cortical area. This approach allowed [10] to propagate parcellations from labelled to unlabelled subjects, in a registrationindependent manner, also providing an objective method to validate parcellation in an independent set of test participants. This classifier detected 96.6% of the cortical areas in test participants, and could correctly parcellate areas in individuals with atypical topography [10]. However, even in this state-of-the-art approach, the classifier was still unable to detect 3.4% of areas across all subjects [10]. Moreover, they were unable to replicate previously identified parcels in regions such as the orbitofrontal cortex [23] and the association visual cortex [1]. It is also unknown whether this classifier generalises to different populations with lower quality data, for example the UK Biobank [21] and the Developing Human Connectome Project [20]. Thus, development of new tools that improve upon areal detection and allow generalisation of this parcellation to new populations with less functional MRI data is warranted. To this end, we consider convolutional neural networks (CNNs), which have proven state-of-the-art for many 2D and 3D medical imaging tasks [3,16]. More specifically, we benchmark a range of different geometric deep learning (gDL) frameworks, since these adapt CNNs to irregular domains such as surfaces, meshes and graphs [2]. The specific contributions of this paper are as follows: 1. We propose a novel framework for propagating the HCP cortical parcellation [10] to new surfaces using gDL methods. These offer a way to improve

Geometric Deep Learning for Multimodal Cortical Parcellation

105

over vertex-wise classifiers (as used by [10]) by additionally learning the spatial context surrounding different image features. 2. Since gDL remains an active area of research, with several complementary approaches for implementing surface convolutions, we explore the potential to improve performance by ensembling predictions made across a range of models. 3. Given the degree of heterogeneity and anticipated problems in generalising to new data, we return estimates of model uncertainty using techniques for Bayesian deep learning implemented using Monte Carlo dropout [8].

2 2.1

Methods Participants and Image Acquisition

A total of 390 participants from the HCP were included in this study. Acquisition and minimal preprocessing pipelines are described in [12]. Briefly, modalities included T1w and T2w structural images, task-based and resting state-based fMRI images, acquired at high spatial and temporal resolution on a customized Siemens 3 T (3T) scanner [12]. From these, a set of 110 features were derived and used as inputs for cortical parcellation: 1 thickness map corrected for curvature, 1 T1w/T2w map [13], 1 surface curvature map, 1 mean task-fMRI activation map, 20 task-fMRI component contrast maps, 77 surface resting state fMRI maps (from a d = 137 independent component analysis), and 9 visuotopic features. This differs from 112 features used by the MLP classifier in [10] in that artefact features were not included and visuotopic spatial regressors were included. Individual subject parcellations predicted by the MLP classifier were used as labels for training each gDL model, as there are no ground truth labels available for multimodal parcellation in the HCP. 2.2

Modelling the Cortex as an Icosphere

For all experiments, the cortical surface was modelled as a regularly tessellated icosphere: a choice which reflects strong evidence that, for many parts of the cortex, cortical shape is a poor correlate of cortical functional organisation [7,10]. Icospheres also offer many advantages for deep learning. Since their vertices form regularly spaced hexagons, icospheric meshes allow consistently shaped spatial filters to be defined and lend themselves to straightforward upsampling and downsampling. This generates a hierarchy of regularly tessellated spheres over multiple resolutions, which is particularly useful as it allows deep learning models to aggregate information through pooling. 2.3

Image Processing and Augmentation

Spherical meshes and cortical metric data (features and labels) for each subject were resampled from the 32k (FS LR) HCP template space [30], to a sixth-order

106

L. Z. J. Williams et al.

icosphere (with 40,962 vertices). Input spheres were augmented using non-linear spherical warps estimated by: first, randomly displacing the vertices of a 2nd order icospheric mesh; then propagating these deformations to the input meshes using barycentric interpolation. In total, 100 warps were simulated, and these were randomly sampled from during training. Cortical metric data were then normalised to a mean and standard deviation of 0 and 1 respectively, using precomputed group means and standard deviations per feature.

Fig. 1. Mean (top row) and standard deviation (bottom row) (a) Dice overlap ratio, (b) recall score (c) and precision score per region for gDL ensemble. Mean (top row) and standard deviation (bottom row) Dice overlap ratio per region for (d) ensemble - ChebNet, (e) ensemble - GConvNet, (f) GConvNet - MoUNet, and (g) ensemble Spherical UNet

2.4

Model Architecture and Implementation

Geometric convolutions may be broadly classified into spatial or spectral methods, which reference the domain that the convolution is computed in (see [2,9] for more details). In brief, spatial methods [22,32] simulate the familiar concept of passing a localised filter over the surface. In practice, while expressive, such methods often approximate mathematically correct convolutions; since, due to lack of a single, fixed coordinate system it is not possible to slide a filter over a curved surface whilst maintaining consistent filter orientation. Spectral methods, on the other hand, utilise an alternate representation in which the (generalised) Fourier transform of a convolution of two functions may be represented by the product of their Fourier transforms. As full spectral methods are computationally expensive, it is standard practice to address this through polynomial approximation [4]. Each method therefore results in different compromises, and for that reason offers complementary solutions, which in principle may be combined to improve

Geometric Deep Learning for Multimodal Cortical Parcellation

107

performance. In this paper, we therefore benchmark and ensemble two spatial networks: Spherical U-Net [32] and MoUNet [22]; and two spectral (polynomial approximation) methods: ChebNet [4] and GConvNet [18]. In each case, methods were implemented with a U-Net [27] like architecture with a 6-layer encoder and decoder, and upsampling was performed using transpose convolution (as implemented by [32]). Code for Spherical U-Net was implemented from its GitHub repository1 and ChebNet, GConvNet and MoUnet were written using PyTorch Geometric [6]. Optimisation was performed using Adam, with an unweighted Dice loss and learning rates: 1 × 10−3 (for Spherical U-Net) and 1×10−4 (for all other models). All models were implemented on a Titan RTX 24 GB GPU, with batch size limited to 1 due to memory constraints (resulting from the high dimension of input channels). Models were trained and tested with a train/validation/test split of 338/26/26, using data from both left and right hemispheres. Following training, an unweighted ensemble approach was taken, where one-hot encoded predictions for a single test subject were averaged across all gDL models. Model performance was also assessed using weighted recall and precision scores. Finally uncertainty estimation was implemented using test-time dropout [8] (with p = 0.2, the probability of an input channel being dropped). Vertex-wise uncertainty maps were produced by repeating dropout 200 times, and then calculating the standard deviation across each vertex of the predicted parcellation per subject.

3

Results

Table 1 shows the overall performance each model on HCP parcellation using single subject cortical maps predicted by the HCP MLP classifier. All methods perform well, achieving a Dice overlap ratio of >0.85, recall score of >0.82 and precision score of >0.85. The mean and standard deviation Dice overlap ratio, recall and precision scores per area are shown for GConvNet (the best performing model) in Fig. 1a–c. V1 and cortical areas in the parietal lobe had higher a mean and lower standard deviation Dice overlap ratio, whilst cortical areas in the medial frontal lobe, occipital pole and insula had lower a mean and higher standard deviation Dice overlap ratio. At the level of a single cortical region, mean and standard deviation Dice overlap ratio varied across models (Fig. 1b– d), and this regional variability was utilised through an ensemble approach to improve parcellation performance (Table 1). The ability of gDL models to detect atypical cortical topography was assessed qualitatively in a test-set participant where area 55b was split into three distinct parcels by the frontal and posterior eye fields [11]. This showed that, while none of the gDL models predicted this split (Fig. 2b), vertex-wise uncertainty maps highlighted the split as a region of uncertainty. Figure 3b demonstrates that the most likely labels for this subject (at the vertex marked with a white dot) were the frontal eye fields (184/200 epochs) and area 55b (16/200 epochs). By contrast, when compared to a similar vertex location in a subject with typical 1

https://github.com/zhaofenqiang/Spherical U-Net.

108

L. Z. J. Williams et al.

Fig. 2. Label border (a) and estimated (b) border predicted by gDL ensemble for test set participant with atypical area 55b topography. Borders are overlaid on T1w/T2w map, and functional connectivity map from the HCP language task, and functional connectivity map highlighting the frontal and posterior eye fields. Table 1. Mean ± standard deviation Dice overlap ratio, recall, and precision for all four geometric deep learning methods and the unweighted ensemble approach Method

Dice overlap ratio Recall

Precision

ChebNet GConvNet MoUNet Spherical UNet

0.871 ± 0.021 0.875 ± 0.020 0.873 ± 0.021 0.860 ± 0.021

0.839 ± 0.024 0.843 ± 0.230 0.841 ± 0.023 0.825 ± 0.022

0.862 ± 0.016 0.865 ± 0.015 0.864 ± 0.015 0.851 ± 0.013

Ensemble

0.880 ± 0.019

0.848 ± 0.022 0.860 ± 0.019

parcellation in area 55b (Fig. 3c), there was no uncertainty in the estimated label, predicting area 55b across all epochs. Beyond area 55b, gDL models often predicted cortical areas as single contiguous parcels, whereas the HCP MLP classifier predicted some cortical areas as being comprised of several smaller, topographically-distinct parcels. This uncertainty relative to the MLP is further emphasised by the findings from the Monte Carlo dropout uncertainty modelling which showed that areas of uncertainty tended to be greatest along the boundaries between regions, and were higher in locations where >2 regions met.

Geometric Deep Learning for Multimodal Cortical Parcellation

109

Fig. 3. (a) Example of a vertex-wise uncertainty map produced using Monte Carlo dropout (MoUNet) (b) From left to right: label, estimate and vertex-wise uncertainty map in subject with atypical topography of area 55b. (c) from left to right: label, estimate and vertex-wise uncertainty map in subject with typical topography of area 55b.

4

Discussion

Developing methods that capture the topographic variability of cortical organisation is essential for precision modelling of neuropsychiatric disorders. Here we show that gDL methods achieve good performance in predicting subject’s cortical organisation, when trained on labels output from the HCP MLP classifier. Even though overall metrics of regional overlap were high, there was marked variability across cortical areas. These findings are in part a consequence using an unweighted Dice loss; since, in this case, mislabelling single vertices of smaller cortical areas will have less impact than for larger ones [24]. This is reflected in the results above, where larger regions e.g. V1, and cortical areas in the parietal lobe, had higher mean and lower standard deviation Dice overlap ratio, recall and precision score per region. This is compared to smaller regions in the medial frontal lobe, insula, and lateral occipital lobe, which had lower mean and higher standard deviation. This inherent limitation of the Dice overlap ratio might also explain why GConvNet (the gDL method with the smallest kernel size) performed the best, as it was capable of learning very localised features. The variation in performance across cortical areas also differed between models, which suggests that each gDL model is learning a different set of features. This was expected given the theoretical differences in how in each model’s convolution is defined. Utilising these differences in an ensemble approach improved Dice overlap ratio by 0.005 (0.5%) above GConvNet, which translates to an overlap improvement of 200 vertices on a 6th-order icosphere (on the same icosphere, area 55b is only 123 vertices in size). Although not described here, we also trained these gDL methods using a generalised (weighted) Dice overlap ratio as described by [29] that is designed

110

L. Z. J. Williams et al.

to address class imbalance, but found that it did not perform as well as the unweighted Dice overlap ratio. This suggests that future work on improving model performance should, in part, address the limitations of common image segmentation losses in the context of multimodal cortical parcellation. Qualitative assessment of gDL model performance on cortical parcellation is essential for investigating topographic variability, as this information is not fully captured by performance metrics. Although subjects with atypical topography of area 55b were included in the training set, none of the gDL methods were able to correctly identify this topography in a test-set subject. Specifically, all models predicted area 55b as a contiguous parcel compared to the HCP MLP prediction where it was split into three smaller areas by the frontal and posterior eye fields. The atypical topography of area 55b in this subject was confirmed manually from the features known to contribute to its multimodal fingerprint (namely, T1w/T2w ratio, the HCP language task contrast “Story vs. Baseline” and resting-state functional connectivity map) [11]. The performance of the gDL models in area 55b highlights the overall tendency of these models to predict cortical areas as contiguous regions compared to those predicted by the HCP MLP classifier. This behaviour might be a result of CNNs learning spatial context, and a strong bias towards learning typical topographic organisation due to downsampling and skip connections in the UNet architecture. In contrast, the HCP MLP was trained to classify each vertex independently using limited spatial context (30 mm radius searchlight across the surface) [10]. The importance of spatial contiguity in defining cortical areas is unknown, but given the lack of ground truth it is difficult to evaluate which approach is more accurate without extensive further qualitative and quantitative evaluation. However, these results do suggest each model introduces unique biases that need to be accounted for when investigating cortical organisation and neuropsychiatric disorders. Achieving multimodal cortical parcellation in datasets beyond the HCP will be invaluable for precision modelling of neuropsychiatric disorders. However, generalising these multimodal labels to other datasets such as the UK Biobank (healthy ageing adults) [21] and the Developing Human Connectome Project (term and preterm neonates) [20] is challenging due to differences in population demographics and data acquisition (less and lower quality). The vertex-wise uncertainty maps introduced here provide a quantitative method to evaluate label propagation, which also could be used to inform post-processing of individual participant cortical parcellations, similar in nature to [10]. The information about topographic variability encoded in these vertex-wise maps might also provide a way to investigate atypical topography in less explored cortical areas. Acknowledgements. Data were provided by the Human Connectome Project, WUMinn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University [31].

Geometric Deep Learning for Multimodal Cortical Parcellation

111

References 1. Abdollahi, R.O., et al.: Correspondences between retinotopic areas and myelin maps in human visual cortex. Neuroimage 99, 509–524 (2014) 2. Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P.: Geometric deep learning: going beyond Euclidean data. IEEE Signal Process. Mag. 34(4), 18–42 (2017) 3. Chen, H., Dou, Q., Yu, L., Qin, J., Heng, P.A.: Voxresnet: deep voxelwise residual networks for brain segmentation from 3D MR images. NeuroImage 170, 446–455 (2018) 4. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. arXiv preprint arXiv:1606.09375 (2016) 5. Desikan, R.S., et al.: An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage 31(3), 968–980 (2006) 6. Fey, M., Lenssen, J.E.: Fast graph representation learning with PyTorch Geometric. In: ICLR Workshop on Representation Learning on Graphs and Manifolds (2019) 7. Frost, M.A., Goebel, R.: Measuring structural-functional correspondence: spatial variability of specialised brain regions after macro-anatomical alignment. Neuroimage 59(2), 1369–1381 (2012) 8. Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: International Conference on Machine Learning, pp. 1050–1059. PMLR (2016) 9. Given, N.A.: Benchmarking geometric deep learning for cortical segmentation and neurodevelopmental phenotype prediction (2021, in preparation) 10. Glasser, M.F., et al.: A multi-modal parcellation of human cerebral cortex. Nature 536(7615), 171–178 (2016) 11. Glasser, M.F., et al.: The human connectome project’s neuroimaging approach. Nat. Neurosci. 19(9), 1175–1187 (2016) 12. Glasser, M.F., et al.: The minimal preprocessing pipelines for the human connectome project. Neuroimage 80, 105–124 (2013) 13. Glasser, M.F., Van Essen, D.C.: Mapping human cortical areas in vivo based on myelin content as revealed by T1-and T2-weighted MRI. J. Neurosci. 31(32), 11597–11616 (2011) 14. Gordon, E.M., et al.: Individual-specific features of brain systems identified with resting state functional correlations. Neuroimage 146, 918–939 (2017) 15. Gratton, C., et al.: Defining individual-specific functional neuroanatomy for precision psychiatry. Biol. Psychiatr. 88, 28-39 (2019) 16. Havaei, M., et al.: Brain tumor segmentation with deep neural networks. Med. Image Anal. 35, 18–31 (2017) 17. Heckemann, R.A., Hajnal, J.V., Aljabar, P., Rueckert, D., Hammers, A.: Automatic anatomical brain MRI segmentation combining label propagation and decision fusion. NeuroImage 33(1), 115–126 (2006) 18. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016) 19. Kong, R., et al.: Spatial topography of individual-specific cortical networks predicts human cognition, personality, and emotion. Cereb. Cortex 29(6), 2533–2551 (2019) 20. Makropoulos, A., et al.: The developing human connectome project: a minimal processing pipeline for neonatal cortical surface reconstruction. Neuroimage 173, 88–112 (2018)

112

L. Z. J. Williams et al.

21. Miller, K.L., et al.: Multimodal population brain imaging in the UK Biobank prospective epidemiological study. Nat. Neurosci. 19(11), 1523–1536 (2016) 22. Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., Bronstein, M.M.: Geometric deep learning on graphs and manifolds using mixture model CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) ¨ ur, D., Ferry, A.T., Price, J.L.: Architectonic subdivision of the human orbital 23. Ong¨ and medial prefrontal cortex. J. Comp. Neurol. 460(3), 425–449 (2003) 24. Reinke, A., et al.: Common limitations of image processing metrics: a picture story. arXiv preprint arXiv:2104.05642 (2021) 25. Robinson, E.C., et al.: Multimodal surface matching with higher-order smoothness constraints. Neuroimage 167, 453–465 (2018) 26. Robinson, E.C., et al.: MSM: a new flexible framework for multimodal surface matching. Neuroimage 100, 414–426 (2014) 27. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W., Frangi, A. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi. org/10.1007/978-3-319-24574-4 28 28. Sabuncu, M.R., Yeo, B.T., Van Leemput, K., Fischl, B., Golland, P.: A generative model for image segmentation based on label fusion. IEEE Trans. Med. Imaging 29(10), 1714–1729 (2010) 29. Sudre, C.H., Li, W., Vercauteren, T., Ourselin, S., Cardoso, M.J.: Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In: Cardoso, M., et al. (eds.) DLMIA 2017, ML-CDS 2017. LNCS, vol. 10553, pp. 240–248. Springer (2017). https://doi.org/10.1007/978-3-319-67558-9 28 30. Van Essen, D.C., Glasser, M.F., Dierker, D.L., Harwell, J., Coalson, T.: Parcellations and hemispheric asymmetries of human cerebral cortex analyzed on surfacebased atlases. Cereb. Cortex 22(10), 2241–2262 (2012) 31. Van Essen, D.C., et al.: The WU-Minn human connectome project: an overview. Neuroimage 80, 62–79 (2013) 32. Zhao, F., et al.: Spherical u-net on cortical surfaces: methods and applications. In: Chung, A., Gee, J., Yushkevich, P., Bao, S. (eds.) IPMI 2019, LNCS, vol. 11492, pp. 855–866. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20351-1 67

Deep Stacking Networks for Conditional Nonlinear Granger Causal Modeling of fMRI Data Kai-Cheng Chuang1,2 , Sreekrishna Ramakrishnapillai2 , Lydia Bazzano3 , and Owen T. Carmichael2(B) 1 Medical Physics Graduate Program, Louisiana State University, Baton Rouge, LA, USA 2 Biomedical Imaging Center, Pennington Biomedical Research Center, Baton Rouge, LA, USA

{Kai.Chuang,Owen.Carmichael}@pbrc.edu 3 Department of Epidemiology, Tulane School of Public Health and Tropical Medicine,

New Orleans, LA, USA

Abstract. Conditional Granger causality, based on functional magnetic resonance imaging (fMRI) time series signals, is the quantification of how strongly brain activity in a certain source brain region contributes to brain activity in a target brain region, independent of the contributions of other source regions. Current methods to solve this problem are either unable to model nonlinear relationships between source and target signals, unable to efficiently quantify time lags in source-target relationships, or require ad hoc parameter settings and post hoc calculations to assess conditional Granger causality. This paper proposes the use of deep stacking networks, with dilated convolutional neural networks (CNNs) as component parts, to address these challenges. The dilated CNNs nonlinearly model the target signal as a function of source signals. Conditional Granger causality is assessed in terms of how much modeling fidelity increases when additional dilated CNNs are added to the model. Time lags between source and target signals are estimated by analyzing estimated dilated CNN parameters. Our technique successfully estimated conditional Granger causality, did not spuriously identify false causal relationships, and correctly estimated time lags when applied to synthetic datasets and data generated by the STANCE fMRI simulator. When applied to real-world task fMRI data from an epidemiological cohort, the method identified biologically plausible causal relationships among regions known to be task-engaged and provided new information about causal structure among sources and targets that traditional single-source causal modeling could not provide. The proposed method is promising for modeling complex Granger causal relationships within brain networks. Keywords: Functional causal modeling (FCM) · Deep stacking networks (DSNs) · Functional magnetic resonance imaging (fMRI)

1 Introduction Conditional Granger causal modeling has become an important concept in neuroscience as functional neuroimaging methods, including functional magnetic resonance imaging © Springer Nature Switzerland AG 2021 A. Abdulkadir et al. (Eds.): MLCN 2021, LNCS 13001, pp. 113–124, 2021. https://doi.org/10.1007/978-3-030-87586-2_12

114

K.-C. Chuang et al.

(fMRI), enable the precise mapping of brain activity within neural circuitry underlying perception, cognition, and behavior [1, 2]. Conditional Granger causality (CGC) is defined as the unique influence that activity in one brain region (the “source”) exerts over activity in another region (the “target”) after accounting for the influence of activity in other source regions [3–5] (Fig. 1). CGC provides deeper information about brain circuit functioning than correlational concepts of functional connectivity by capturing directional relationships that underlie information flow during rest or task execution. Charting such directional relationships is important to the study of brain development, dysfunction during diseases, and degeneration during aging, as well as normal functioning during perception, cognition, behavior, and consciousness [2, 6–9].

Fig. 1. (a) A hypothetical three time series dataset, where Xt has a strong linear relationship with Yt+1 , suggesting a causal relationship between X and Y with a time lag of 1 time step. Yt has a nonlinear causal relationship with Zt with a time lag of 3 time steps. (b) Conditional Granger causality analysis allows us to determine that while X and Z appear to have a nonlinear causal relationship with a time lag of 4 time steps, this relationship is really just a reflection of causal relationships with the intermediary source Y .

The multiple vector auto-regression (MVAR) model has been widely used to assess conditional Granger causality. This paper addresses two key challenges limiting the utility of the MVAR model. First, nonlinear relationships between source and target signals (caused by complex neural dynamics giving rise to those signals) are commonly studied in neuroscience but are not well modeled by the MVAR model which is inherently linear [10–16]. In addition, identifying time lags in MVAR is computationally complex, typically requiring estimation of all models representing all possible time lags followed by model scoring using the Akaike information criterion (AIC) or similar criteria [16–18]. Kernel Granger causality (KGC), extended Granger causality, recurrent neural network methods [19–21], and multilayer perceptron methods [22] solve the nonlinearity problem for traditional single-source causality but do not address conditional Granger causality. Nauta et al. proposed the Temporal Causal Discovery Framework (TCDF), an attention-based dilated convolutional neural network (CNN) that solves the Granger causality problem efficiently while allowing fast estimation of time lags via interpretation of dilated CNN internal parameters [23]. However, the resulting causalities depend

Deep Stacking Networks for Conditional Nonlinear Granger Causal Modeling

115

on operating parameters—thresholds on attention scores and a loss function—that may be difficult to set. We propose a machine learning algorithm called deep stacking networks (DSNs), which improves upon CGC (Fig. 2(a)) by using dilated CNNs to estimate nonlinear relationships with efficient time lag estimation (Fig. 2(b)). Computational efficiency of training is an additional benefit of DSNs, as each dilated CNN can be trained separately from the others and some can be trained in parallel.

Fig. 2. (a) MVAR assesses X as a source for the target Z, conditioned on Y (X → Z|Y ) by comparing the prediction errors of VAR models that use only Y and Z, vs. X , Y , and Z simultaneously. (b) A deep stacking network solves this same problem using a series of CNNs, providing modeling of nonlinear causal relationships and efficient computation of time lags.

2 Materials and Methods 2.1 Deep Stacking Network The philosophy of DSN design is based on the concept of stacking proposed by Wolpert where simple modules of functions or classifiers are trained first, and then they are “stacked” on top of each other to compose complex functions or classifiers [24]. Deng et al. presented the basic form of the DSN architecture that consists of many stacking modules, each of which takes a simplified form of shallow multilayer perceptron using convex optimization for learning perceptron weights [25, 26]. The ‘stacking’ architecture is built by concatenating the output from previous modules with the original input to form the new “input” for the next module. Each layer of hidden units learns to represent features that capture higher order correlations within the input data. One benefit of DSNs is scalability, owing to their ability to train modules in parallel [25, 26]. 2.2 Conditional Nonlinear Granger Causal Modeling with DSN Following the above example of sources X and Y , and target Z, we train a DSN with individual dilated CNN to transform Xt ,Yt , and Zt into Zt , resulting in Zt estimates Zˆ t,1 ,

116

K.-C. Chuang et al.

Zˆ t,2 ,…, Zˆ t,5 (See Algorithm 1 and Fig. 2(b)). First, to evaluate the effect of Y and Z on Z, dilated CNNs 1 and 2 are trained respectively to transform Yt and Zt into Zt , resulting in estimates Zˆ t,1 and Zˆ t,2 and prediction errors εt,1 and εt,2 . Then, to represent the best estimate of Z based on both of Y and Z, Zˆ t,1 and Zˆ t,2 both provide as inputs to the third module, which estimates an element-wise weighted sum of the inputs to predict Zt , resulting in estimate Zˆ t,3 and prediction error of εt,3 . (Element-wise weighted sum is used instead of another dilated CNN as dilated CNN complicates the assessment of causalities and estimation of time lags.) To assess the effect of X on Z, time series Xt is then provided as input to dilated CNN 3, again with Zt as the target, resulting in predicted time series Zˆ t,4 and prediction error εt,4 . To model Z in terms of X ,Y , and Z, Zˆ t,3 and Zˆ t,4 are both the inputs of element-wise weighted sum to produce the final estimate of Z, Zˆ t,5 , and prediction error εt,5 . The conditional Granger causality of X to Z, conditioned on Y , is defined in terms of the reduction in modeling error when X , Y , and Z are used to model Z, compared to when only Y and Z are used to model Z:   var εt,3   GC_indexX →Z|Y = ln (1) var εt,5 If incorporating X improves the modeling of Z after accounting for effects of Y , GC_indexX →Z|Y will be a large positive number, providing evidence for conditional Granger causality between X and Z, conditioned on Y . Complex causal relationships among several time series can be disentangled by calculating conditional Granger causality with differing assignments of time series to X , Y , and Z. Algorithm 1: Conditional nonlinear Granger causal modeling with DSN DSN module (input(s), target, estimate, prediction error) Step 1. Train Dilated CNN 1 ( , , , ) and Dilated CNN 2 ( , , , Step 2. Estimate Element-wise weighted sum ([ , ], , , ) , ) Step 3. Train Dilated CNN 3 ( , , , ], , , ) Step 4. Estimate Element-wise weighted sum ([

)

Step 5. Calculate

Nonlinear parametric ReLU (PReLU) activation functions were used in each hidden layer to reveal linear and nonlinear relationships between source and target regions. Each dilated CNN had 3 hidden layers, a 1D kernel, a kernel size of 2, and zero causal padding, inspired by the well-known WaveNet architecture [23, 27]. A dilated CNN is a convolution where the kernel is applied over a length larger than its length by skipping input values with a certain step. Zero causal padding was used to prevent future time point values from influencing the prediction of current time points. To efficiently estimate time lags in source-target relationships, we interpreted the kernel weights Wi in each dilated CNN. 2.3 Model Validation and Application We applied the proposed method to synthetic time series data, simulated fMRI data from a public-domain simulator, and a real-world fMRI dataset. We assessed how well true

Deep Stacking Networks for Conditional Nonlinear Granger Causal Modeling

117

Granger causal relationships and time lags programmed into the synthetic and simulated data were identified by our method, as well as whether spurious causal relationships were identified. Also, the identified Granger causalities from the proposed method were compared with those from previous methods, MVAR and TCDF. For each synthetic and simulated dataset, 100 X , Y , Z time series triples were generated. Seventy-two participants were included in the real-world fMRI dataset. Ten-fold cross validation was used to repeatedly train the DSNs and quantify causal relationships within the testing data. The mean and standard deviation of the GC index across cross validation folds is presented. We consider the evidence for a particular conditional Granger causal relationship strong when the corresponding GC index is statistically significantly greater than 0 in one-tailed student’s t-test (p-value < 0.05). Also, the identified Granger causalities from the proposed method were compared with those from previous methods, MVAR [16] and TCDF [23]. The same statistical significance criterion (p-value < 0.05) has been applied for MVAR. TCDF analysis followed the same dilated CNNs architecture and statistical criterion as Nauta et al. [23].

3 Experiments and Results 3.1 Synthetic Dataset Synthetic Dataset 1: Linear Causal Relationships. Each time series had 128 time points generated according to the following formulas [28], where wt,1 , wt,2 , wt,3 were Gaussian white noise with standard deviation of 0.1. True causal relationships within this dataset are represented in Fig. 3(a). √ Xt = 0.95 × 2 × Xt−1 − 0.9025 × Xt−2 + wt,1 Yt = −0.5 × Xt−1 + wt,2

(2)

Zt = −0.5 × Yt−1 + 0.5 × Zt−1 + wt,3 Synthetic Dataset 2: Nonlinear Causal Relationships. Time series Xt and Yt share the same common Gaussian white noise, wt,1 , with standard deviation of 0.1. Independent Gaussian white noise, wt,2 , with standard deviation of 0.1, was added to time series Zt . The length of the time series was set to 128 time points. The causal relationships of the set are shown in Fig. 3(B). √ Xt = 0.95 × 2 × Xt−1 − 0.9025 × Xt−2 − 0.9 × Yt−1 + 0.5 × wt,1 Yt = −1.05 × Yt−1 − 0.85 × Yt−2 − 0.8 × Xt−1 + 0.5 × wt,1 Zt = 0.1 + 0.4 × Zt−2 +

2.4 − 0.9 × Xt−3 + wt,2 1 + e−4×Xt−3

(3)

118

K.-C. Chuang et al.

Results. The proposed method correctly identified true conditional Granger causalities X → Y |Z and Y → Z|X with statistically significant GC indices in synthetic dataset 1 (Eq. (2)). In addition, a causal relationship such as X → Z that would be viewed as significant when X and Z are viewed in isolation was non-significant when the influence of the intermediary source (Y ) was considered (X → Z|Y ). Similarly, true conditional Granger causalities in synthetic dataset 2 were identified. The method also accurately identified the true time lags between all sources and targets (Y → X|Z). Moreover, the magnitude of the GC index also reflected the strength of the causal relationship in synthetic time series 1. Specifically, there was strong evidence for X → Y |Z (GCindex X →Y |Z = 0.3151), reflecting that Y depended solely on the previous time steps of X and not other time series (Eq. (3)). In contrast, Z depended on previous time steps of both Y and Z, suggesting less of a unique contribution of one or the other; this was reflected in a lower GC index (0.0494). The strength of causalities in synthetic time series 2 are difficult to interpret using the equations themselves with mutual and nonlinear causal relationships. However, the GC indices indicated a strong causality of X → Y |Z, a moderate causality of X → Z|Y , and a weak causality of Y → X |Z. Figure 4 shows the comparison result of synthetic datasets 1 and 2 for three conditional Granger causality methods. Our proposed method, DSN-CNNs, fully detected the causal relationships and time lags which were already given in the datasets. However, the models of MVAR and TCDF did not perform well (Fig. 4).

Fig. 3. (a) Synthetic dataset 1: linear case with causalities of X → Y and Y → Z, and selfcausalities of X and Z. (b) Synthetic dataset 2: nonlinear case with mutual causalities between X and Y , and a flow of X → Z, and self-causalities of X ,Y , and Z.

3.2 Simulated fMRI Dataset The STANCE simulator, developed by Hill et al., was used to simulate task-based fMRI data [29]. For simulated fMRI dataset, the event designs were produced with causal relationships among time series then convolved with a canonical hemodynamic response function (HRF) followed by the addition of simulated system and physiological noise, of which the magnitude was about 10% of the simulated fMRI signal. Each time series had 128 time points. Simulated Dataset 1. Source time series Xt and Yt were generated with 26 randomly spaced events, each with a duration of 1 time step. The target time series, Zt , would have an event with 70% probability if there was an event at Xt−2 and would have an event with 30% probability if there was an event at Yt−5 . The causal relationships of this dataset are represented in Fig. 5(a).

Deep Stacking Networks for Conditional Nonlinear Granger Causal Modeling

119

Table 1. The identified causal relationships and time lags between time series, X , Y , Z, in synthetic dataset 1 & 2. The causalities programmed into the datasets a priori are marked in bold. Causality Synthetic dataset 1 GC index X → Y|Z

Causality Synthetic dataset 2 p-value

Time lags

0.3151 ± 0.0778