120 26 30MB
English Pages 251 [243] Year 2022
LNCS 13559
Ghada Zamzmi · Sameer Antani · Ulas Bagci · Marius George Linguraru · Sivaramakrishnan Rajaraman · Zhiyun Xue (Eds.)
Medical Image Learning with Limited and Noisy Data First International Workshop, MILLanD 2022 Held in Conjunction with MICCAI 2022 Singapore, September 22, 2022, Proceedings
Lecture Notes in Computer Science Founding Editors Gerhard Goos Karlsruhe Institute of Technology, Karlsruhe, Germany Juris Hartmanis Cornell University, Ithaca, NY, USA
Editorial Board Members Elisa Bertino Purdue University, West Lafayette, IN, USA Wen Gao Peking University, Beijing, China Bernhard Steffen TU Dortmund University, Dortmund, Germany Moti Yung Columbia University, New York, NY, USA
13559
More information about this series at https://link.springer.com/bookseries/558
Ghada Zamzmi · Sameer Antani · Ulas Bagci · Marius George Linguraru · Sivaramakrishnan Rajaraman · Zhiyun Xue (Eds.)
Medical Image Learning with Limited and Noisy Data First International Workshop, MILLanD 2022 Held in Conjunction with MICCAI 2022 Singapore, September 22, 2022 Proceedings
Editors Ghada Zamzmi National Institutes of Health Bethesda, MD, USA
Sameer Antani National Institutes of Health Bethesda, MD, USA
Ulas Bagci Northwestern University Chicago, IL, USA
Marius George Linguraru Children’s National Hospital Washington, WA, USA
Sivaramakrishnan Rajaraman National Institutes of Health Bethesda, MD, USA
Zhiyun Xue National Institutes of Health Bethesda, MD, USA
ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-031-16759-1 ISBN 978-3-031-16760-7 (eBook) https://doi.org/10.1007/978-3-031-16760-7 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Deep learning (DL)-based computer-aided diagnostic systems have been widely and successfully studied for analyzing various image modalities such as chest X-rays, computed tomography, ultrasound, and optical imaging including microscopic imagery. Such analyses help in identifying, localizing, and classifying disease patterns as well as staging the extent of the disease and recommending therapies. Although DL approaches have a huge potential to advance medical imaging technologies and potentially improve quality and access to healthcare, their performance relies heavily on the quality, variety, and size of training data sets as well as appropriate high-quality annotations. In the medical domain, obtaining such data sets is challenging due to several privacy constraints and tedious annotation processes. Further, real-world medical data tends to be noisy and incomplete leading to unreliable and potentially biased algorithm performance. To mitigate or overcome training challenges in imperfect or data-limited scenarios, several training techniques have been proposed. Despite the successful application of these techniques in a wide range of medical image applications, there is still a lack of theoretical and practical understanding of their learning characteristics and decision-making behavior when applied to medical images. This volume presents novel approaches for handling noisy and limited medical image data sets. This collection is derived from articles presented in the workshop titled “Medical Image Learning with Noisy and Limited Data (MILLanD)” that was held in conjunction with the 25th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2022). The workshop brought together machine learning scientists, biomedical engineers, and medical doctors to discuss the challenges and limitations of current deep learning methods applied to limited and noisy medical data and present new methods for training models using such imperfect data. The workshop received 54 full-paper submissions in various topics including efficient data annotation and augmentation strategies, new approaches for learning with noisy/corrupted data or uncertain labels, weakly-supervised learning, semi-supervised learning, self-supervised learning, and transfer learning strategies. Each submission was reviewed by 2–3 reviewers and further assessed by the workshop’s chairs. The workshop’s reviewing process was double-blind, i.e., both the reviewer and author identities were concealed throughout the review process. This process resulted in the selection of 22 high-quality papers that are included in this volume. August 2022
Ghada Zamzmi Sameer Antani Ulas Bagci Marius George Linguraru Sivaramakrishnan Rajaraman Zhiyun Xue
Organization
General Chair Ghada Zamzmi
National Institutes of Health, USA
Program Committee Chairs Sameer Antani Ulas Bagci Marius George Linguraru Sivaramakrishnan Rajaraman Zhiyun Xue Ghada Zamzmi
National Institutes of Health, USA Northwestern University, USA Children’s National Hospital, USA National Institutes of Health, USA National Institutes of Health, USA National Institutes of Health, USA
Program Committee Sema Candemir Somenath Chakraborty Amr Elsawy Prasanth Ganesan Loveleen Gaur Peng Guo Mustafa Hajij Alba García Seco Herrera Alexandros Karargyris Ismini Lourentzou Rahul Paul Anabik Pal Harshit Parmar Sirajus Salekin Ahmed Sayed Mennatullah Siam Sudhir Sornapudi Lokendra Thakur Lihui Wang Feng Yang Mu Zhou
Eski¸sehir Technical University, Turkey University of Southern Mississippi, USA National Institutes of Health, USA Stanford Medicine, USA Amity University, India National Institutes of Health, USA University of San Francisco, USA University of Essex, UK Institute of Image-Guided Surgery, IHU Strasbourg, France Virginia Tech, USA Harvard Medical School, USA SRM University, Andhra Pradesh, India Texas Tech University, USA University of South Florida, USA Milwaukee School of Engineering, USA York University, Canada Corteva Agriscience, USA MIT and Harvard Broad Institute, USA Guizhou University, China National Institutes of Health, USA Stanford University, USA
viii
Organization
Additional Reviewers Kexin Ding M. Murugappan Venkatachalam Thiruppathi Zichen Wang Miaomiao Zhang Qilong Zhangli
Contents
Efficient and Robust Annotation Strategies Heatmap Regression for Lesion Detection Using Pointwise Annotations . . . . . . . Chelsea Myers-Colet, Julien Schroeter, Douglas L. Arnold, and Tal Arbel Partial Annotations for the Segmentation of Large Structures with Low Annotation Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bella Specktor Fadida, Daphna Link Sourani, Liat Ben Sira, Elka Miller, Dafna Ben Bashat, and Leo Joskowicz Abstraction in Pixel-wise Noisy Annotations Can Guide Attention to Improve Prostate Cancer Grade Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hyeongsub Kim, Seo Taek Kong, Hongseok Lee, Kyungdoc Kim, and Kyu-Hwan Jung Meta Pixel Loss Correction for Medical Image Segmentation with Noisy Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhuotong Cai, Jingmin Xin, Peiwen Shi, Sanping Zhou, Jiayi Wu, and Nanning Zheng Re-thinking and Re-labeling LIDC-IDRI for Robust Pulmonary Cancer Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hanxiao Zhang, Xiao Gu, Minghui Zhang, Weihao Yu, Liang Chen, Zhexin Wang, Feng Yao, Yun Gu, and Guang-Zhong Yang
3
13
23
32
42
Weakly-Supervised, Self-supervised, and Contrastive Learning Universal Lesion Detection and Classification Using Limited Data and Weakly-Supervised Self-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Varun Naga, Tejas Sudharshan Mathai, Angshuman Paul, and Ronald M. Summers BoxShrink: From Bounding Boxes to Segmentation Masks . . . . . . . . . . . . . . . . . . Michael Gröger, Vadim Borisov, and Gjergji Kasneci Multi-Feature Vision Transformer via Self-Supervised Representation Learning for Improvement of COVID-19 Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . Xiao Qi, David J. Foran, John L. Nosher, and Ilker Hacihaliloglu
55
65
76
x
Contents
SB-SSL: Slice-Based Self-supervised Transformers for Knee Abnormality Classification from MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sara Atito, Syed Muhammad Anwar, Muhammad Awais, and Josef Kittler
86
Optimizing Transformations for Contrastive Learning in a Differentiable Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Camille Ruppli, Pietro Gori, Roberto Ardon, and Isabelle Bloch
96
Stain Based Contrastive Co-training for Histopathological Image Analysis . . . . . 106 Bodong Zhang, Beatrice Knudsen, Deepika Sirohi, Alessandro Ferrero, and Tolga Tasdizen Active and Continual Learning CLINICAL: Targeted Active Learning for Imbalanced Medical Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Suraj Kothawade, Atharv Savarkar, Venkat Iyer, Ganesh Ramakrishnan, and Rishabh Iyer Real Time Data Augmentation Using Fractional Linear Transformations in Continual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Arijit Patra DIAGNOSE: Avoiding Out-of-Distribution Data Using Submodular Information Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Suraj Kothawade, Akshit Shrivastava, Venkat Iyer, Ganesh Ramakrishnan, and Rishabh Iyer Transfer Representation Learning Auto-segmentation of Hip Joints Using MultiPlanar UNet with Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Peidi Xu, Faezeh Moshfeghifar, Torkan Gholamalizadeh, Michael Bachmann Nielsen, Kenny Erleben, and Sune Darkner Asymmetry and Architectural Distortion Detection with Limited Mammography Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Zhenjie Cao, Xiaoyun Zhou, Yuxing Tang, Mei Han, Jing Xiao, Jie Ma, and Peng Chang Imbalanced Data and Out-of-Distribution Generalization Class Imbalance Correction for Improved Universal Lesion Detection and Tagging in CT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Peter D. Erickson, Tejas Sudharshan Mathai, and Ronald M. Summers
Contents
xi
CVAD: An Anomaly Detector for Medical Images Based on Cascade VAE . . . . 187 Xiaoyuan Guo, Judy Wawira Gichoya, Saptarshi Purkayastha, and Imon Banerjee Approaches for Noisy, Missing, and Low Quality Data Visual Field Prediction with Missing and Noisy Data Based on Distance-Based Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Quang T. M. Pham, Jong Chul Han, and Jitae Shin Image Quality Classification for Automated Visual Evaluation of Cervical Precancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Zhiyun Xue, Sandeep Angara, Peng Guo, Sivaramakrishnan Rajaraman, Jose Jeronimo, Ana Cecilia Rodriguez, Karla Alfaro, Kittipat Charoenkwan, Chemtai Mungo, Joel Fokom Domgue, Nicolas Wentzensen, Kanan T. Desai, Kayode Olusegun Ajenifuja, Elisabeth Wikström, Brian Befano, Silvia de Sanjosé, Mark Schiffman, and Sameer Antani A Monotonicity Constrained Attention Module for Emotion Classification with Limited EEG Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 Dongyang Kuang, Craig Michoski, Wenting Li, and Rui Guo Automated Skin Biopsy Analysis with Limited Data . . . . . . . . . . . . . . . . . . . . . . . . 229 Yung-Chieh Chan, Jerry Zhang, Katie Frizzi, Nigel Calcutt, and Garrison Cottrell Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Efficient and Robust Annotation Strategies
Heatmap Regression for Lesion Detection Using Pointwise Annotations Chelsea Myers-Colet1(B) , Julien Schroeter1 , Douglas L. Arnold2 , and Tal Arbel1 1 2
Centre for Intelligent Machines, McGill University, Montreal, Canada {cmyers,julien,arbel}@cim.mcgill.ca Montreal Neurological Institute, McGill University, Montreal, Canada [email protected]
Abstract. In many clinical contexts, detecting all lesions is imperative for evaluating disease activity. Standard approaches pose lesion detection as a segmentation problem despite the time-consuming nature of acquiring segmentation labels. In this paper, we present a lesion detection method which relies only on point labels. Our model, which is trained via heatmap regression, can detect a variable number of lesions in a probabilistic manner. In fact, our proposed post-processing method offers a reliable way of directly estimating the lesion existence uncertainty. Experimental results on Gad lesion detection show our pointbased method performs competitively compared to training on expensive segmentation labels. Finally, our detection model provides a suitable pretraining for segmentation. When fine-tuning on only 17 segmentation samples, we achieve comparable performance to training with the full dataset. Keywords: Lesion detection · Lesion segmentation regression · Uncertainty · Multiple sclerosis
1
· Heatmap
Introduction
For many diseases, detecting the presence and location of all lesions is vital for estimating disease burden and treatment efficacy. In stroke patients, for example, the location of a cerebral hemorrhage was shown to be an important factor in assessing the risk of aspiration [1] thus, failing to locate even a single one could drastically impact the assessment. Similarly, in patients with Multiple Sclerosis (MS), detecting and tracking all gadolinium-enhancing lesions (Gad lesions), whether large or small, is especially relevant for determining treatment response in clinical trials [2]. Detecting all Gad lesions is imperative as just one new lesion indicates new disease activity. To achieve this goal, standard practice in deep learning consists of training a lesion segmentation model with a post-processing detection step [3,4]. However, segmentation labels are expensive and time consuming to acquire. To this end, we c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Zamzmi et al. (Eds.): MILLanD 2022, LNCS 13559, pp. 3–12, 2022. https://doi.org/10.1007/978-3-031-16760-7_1
4
C. Myers-Colet et al.
develop a lesion detection model trained on pointwise labels thereby reducing the manual annotation burden. Unlike previous point annotation-based methods [5– 7], ours combines the ability to detect a variable number of lesions with the benefit of leveraging a probabilistic approach. Indeed, our refinement method is not only independent of a specific binarization threshold, it offers a unique way of estimating the lesion existence probability. Our contributions are threefold: (1) We demonstrate the merit of training on point annotations via heatmap regression over segmentation labels on the task of Gad lesion detection. With weaker labels, our models still achieve better detection performance. (2) Our proposed refinement method allows for a reliable estimation of lesion existence uncertainty thus providing valuable feedback for clinical review. (3) When the end goal is segmentation, our detection models provide a suitable pre-training for fine-tuning on a limited set of segmentation labels. When having access to only 17 segmentation samples, we can achieve comparable performance to a model trained on the entire segmentation dataset.
2
Related Work
Point annotations are often extremely sparse which leads to instability during training of deep neural networks. Therefore, most state-of-the-art methods rely on the application of a smoothing operation to point labels. A Gaussian filter is commonly applied to create a heatmap as was done in [5,6] for suture detection. Others have found success applying distance map transformations. For instance, Han et al. [8] and van Wijnen et al. [9] used Euclidean and Geodesic distance maps to perform lesion detection. We demonstrate the benefits of training with Gaussian heatmaps over distance maps as they offer a more precise and interpretable probabilistic prediction yielding superior detection performance. Irrespective of the choice of smoothing used for training, detection methods will often differ in their post-processing refinement step, i.e. in extracting lesion coordinates from a predicted heatmap. The simplest approach consists in finding the location with the maximum mass [5,10,11] or computing the centre of mass [6]. Although these approaches easily allow for the detection of multiple lesions, they require careful tuning of the binarization threshold and are susceptible to missing both isolated and overlapping peaks. More sophisticated methods exist which aim to fit a Gaussian distribution to the predicted heatmap thus retaining its probabilistic interpretation, e.g. [12,13]. Specifically, to perform cephalogram landmark detection Thaler et al. [7] align a Gaussian distribution via Least Squares curve fitting. Since the approach taken in [7] is limited to a set number of landmarks, we extend it to detect a variable number of lesions. Our method thus offers the flexibility of simpler approaches, without any dependence on a binarization threshold, while providing a probabilistic interpretation.
3
Method
In this work, we propose a strategy to detect the presence and location of multiple lesions from brain MRIs of patients with a neurodegenerative disease. Our model
Heatmap Regression for Lesion Detection Using Pointwise Annotations
5
ˆ i : lesion candidates Fig. 1. Overview of detection method given a predicted heatmap H are found by (1) locating the global maximum, (2) fitting a Gaussian distribution to an extracted region and (3) subtracting the influence of this lesion from the heatmap. (4) Repeat steps 1–3 before (5) filtering out unlikely lesions.
is trained via heatmap regression (Sect. 3.1) while lesion detection is performed in a post-processing step (Sect. 3.2). Finally, we present a transfer learning scheme to perform segmentation on a limited dataset (Sect. 3.3). 3.1
Training via Heatmap Regression
The proposed heatmap regression training scheme requires a domain expert to label only a single point identifying each lesion, e.g. by marking the approximate centre of the lesion. To stabilize training, we apply a Gaussian filter with smoothing parameter σ to the point annotations thus creating a multi-instance heatmap [5–7,14–16]. Since all lesions are represented by a single point and smoothed using the same value of σ, equal importance is attributed to lesions of all sizes. We train a model fθ to map a sequence of input MRIs to a predicted ˆ i. heatmap H 3.2
Detection During Inference
ˆ i , we now aim to detect individual lesions. SpecifGiven a continuous heatmap H ically, for patient i, we wish to represent the k th detected lesion by a single ˆ ik , which can be extracted from the heatmap. We assume the predicted point, µ ˆ i will model a sum of Gaussian distributions (each describing a single heatmap H lesion) to mimic the target heatmap:1 ˆi = ˆ ik = ˆ ik , σ) N (µ H H (1) K
1
K
Valid as long as fθ sufficiently minimizes the loss and thus models the target.
6
C. Myers-Colet et al.
Our method essentially aims to find the individual Gaussian distributions comprising the sum in Eq. 1 in an iterative manner as shown in Fig. 1. We now describe each depicted step in detail. (1) Locate Global Maximum. The location of the global maximum serves as ˆ ik . an initial estimate for the k th predicted lesion centre, µ (2) Gaussian Fitting. In the region surrounding a detected lesion with cenˆ ik , we fit a Gaussian distribution with normalizing constant α ˆ ik : tre µ ˆ ik = α ˆ ik , σ) H ˆ ik N (µ
(2)
Provided there is minimal overlap between neighbouring lesions, we can use a ˆ ik .2 The normalizLeast Squares curve fitting algorithm to estimate α ˆ ik and µ ing constant, α ˆ ik , represents the prior probability of producing a peak in this region (from Bayes’ Theorem), i.e. it is the belief that a lesion exists in the given region. We thus refer to α ˆ ik as the lesion existence probability (similar to [17]). As an initial estimate for α ˆ ik , we sum within the extracted region, i.e. the hypothesis space, as shown in Fig. 1 (2). (3) Subtract. Now that potential lesion k has been identified and fitted with a continuous Gaussian function, we remove its contribution to the sum in Eq. 1. This allows our method to more easily detect the individual contributions of neighbouring lesions with overlapping Gaussian distributions. ˆi −α ˆ = H ˆ ik , σ) ˆ ik N (µ H i
(3)
(4) Repeat. Since we have subtracted the contribution of lesion k from the aggregated heatmap, the global maximum now corresponds to a different candidate lesion. Steps 1 to 3 are repeated until a maximum number of lesions have been found or when the lesion existence probability drops below a threshold, e.g. 0.01. (5) Filtering. Lesions with a low probability of existence are discarded (threshold optimized on the validation set). By overestimating the lesion count and subsequently discarding regions unlikely to contain a lesion, we can better account for noisy peaks in the heatmap. We evaluate the calibration of these probabilities and demonstrate the validity of this filtering step (Sect. 4.2). 3.3
Segmentation Transfer Learning
In addition to detection, estimating a lesion segmentation can be beneficial for assessing lesion load. We therefore design a transfer learning scheme which first relies on building a strong lesion detector using point annotations before finetuning on a small segmentation dataset. Specifically, we (1) train a detection model on point annotations until convergence; (2) build a small segmentation training set; (3) fine-tune the detection model on segmentation samples only. Training in this manner minimizes the amount of detailed segmentation labels that must be generated. 2
Valid for Gad lesions given a sufficiently small smoothing parameter σ.
Heatmap Regression for Lesion Detection Using Pointwise Annotations
4
7
Experiments and Results
The proposed heatmap regression model is compared against three benchmarks in terms of detection performance. We train models on (1) segmentation labels, (2) Euclidean distance maps and (3) Geodesic distance maps [8,9]. Similar to our method, lesions are detected from the output prediction in a post-processing step. Here, we instead binarize the output at threshold τ (optimized on the validation set), cluster connected components to form detected lesions and use the centre of mass (segmentation) or the maximum (detection) to represent the lesion (referred to as CC). As an additional benchmark, we apply this method to heatmap outputs from our proposed regression models. This is in line with detection methods used by [18,19] for segmentation outputs and [5,6] for heatmap predictions. 4.1
Experimental Setup
Dataset. We evaluate our method on Gad lesion detection as they are a relevant indicator of disease activity in MS patients [20]. However, their subtlety and extreme size variation makes them difficult to identify. Experiments are performed using a large, multi-centre, multi-scanner proprietary dataset consisting of 1067 patients involved in a clinical trial to treat Relapsing-Remitting MS. Multi-modal MRIs, including post-contrast T1-weighted MRI, are available for each patient and are provided as inputs to our system. For fairness, we create train (60%), validation (20%) and test (20%) sets by first splitting at the patient level. We have access to manually derived Gad lesion segmentation masks. Each sample is first independently rated by two experts who then meet to produce a consensus. Point labels were generated directly from segmentation masks by calculating the centre of mass of each lesion and transformed into either heatmaps, using a Gaussian kernel with smoothing parameter σ, or distance maps (baseline methods), using decay parameter p. Hyperparameters were selected based on validation performance. Model. We train a modified 5-layer U-Net [21] with dropout and instance normalization using a Mean-Squared Error loss for heatmap regression and a weighted cross-entropy loss for segmentation. See code for details3 . Evaluation. We apply the Hungarian algorithm [22] to match predicted lesions to ground truth lesions using Euclidean distance as a cost metric. Assignments with large distances are considered both a false positive and a false negative.
3
https://github.com/ChelseaM-C/MICCAI2022-Heatmap-Lesion-Detection.
8
C. Myers-Colet et al.
Table 1. Lesion detection results as a mean over 3 runs. Reported is the detection F1score, precision, recall and small lesion recall for models trained with segmentation, Gaussian heatmap or distance map (Geodesic, Euclidean) [8, 9] labels using connected components (CC) or Gaussian fitting (Gaussian). Label type
Detection method F1-score
Precision
Recall
Small lesion recall
Segmentation
CC
85.4 ± 0.02
85.3 ± 1.10
85.5 ± 1.15
67.7 ± 3.58
Euclidean map
CC
80.6 ± 1.01
92.6 ± 1.28 71.4 ± 2.14
51.0 ± 3.28
Geodesic map
CC
73.7 ± 4.89
81.0 ± 7.97
47.8 ± 2.81
Gaussian heatmap CC Gaussian
4.2
83.9 ± 0.27 80.9 ± 4.43 86.3 ± 0.24 87.0 ± 1.89
67.8 ± 2.98
87.3 ± 2.64 75.0 ± 5.75 85.7 ± 1.44 70.4 ± 4.47
Lesion Detection Results
Despite only having access to point annotations, the proposed Gaussian heatmap approach performs competitively with the segmentation baseline (see Table 1). In fact, our proposed iterative detection method (Gaussian) even slightly outperforms the segmentation model on all detection metrics. By contrast, both distance map approaches show notably worse performance with especially low recall scores indicating a high number of missed lesions. The proposed model additionally outperforms competing methods for the task of small lesion detection (3 to 10 voxels in size) underlining the merit of training directly for detection. Segmentation models will typically place more importance on larger lesions since they have a higher contribution to the loss, a bias not imposed by our detection model. Our model additionally does not sacrifice precision for high recall on small lesions; we perform on par with segmentation. Lesion Existence Probability Evaluation. We evaluate the quality of our fitted lesion existence probabilities on the basis of calibration and derived uncertainty to justify both the curve fitting and filtering steps. (1) Calibration. We compare the calibration [23] of the lesion existence probabilities before and after Least Squares curve fitting. Recall the initial estimate for αik is found by summing locally within the extracted region. Our proposed existence probabilities are well calibrated (Fig. 2a), with little deviation from the ideal case thus justifying our proposed filtering step. As well, the fitted probabilities are significantly better calibrated than the initial estimates demonstrating the benefit of curve fitting. (2) Uncertainty. While it is important to produce accurate predictions, quantifying their uncertainty is of equal importance in the medical domain. We can compute the entropy of our lesion existence probabilities without sampling and show it is well correlated with detection accuracy. As we consider only the least uncertain instances, we observe a monotonic increasing trend, even achieving an accuracy of 100% (Fig. 2b). Similar results are achieved with the more standard MC Dropout approach [24] applied to segmentation outputs (calculated at a lesion level as in [18]).
Heatmap Regression for Lesion Detection Using Pointwise Annotations
9
Fig. 2. Lesion existence probability evaluation. (a) Calibration of unfitted (pink) vs. fitted (green) probabilities. (b) Detection accuracy of least uncertain samples according to our model (green) vs. MC Dropout applied to segmentation (pink). (Color figure online)
Our derived lesion existence probabilities are not only well-calibrated, they produce meaningful uncertainty estimates. With only a single forward pass, our uncertainty estimates perform on par with standard sampling-based approaches. 4.3
Lesion Segmentation via Transfer Learning
To demonstrate the adaptability of our method, we fine-tune the trained heatmap regression models with a small segmentation dataset as described in Sect. 3.3. Specifically, we use segmentation labels for a randomly chosen 1% of our total training set for fine-tuning. To account for bias in the selected subset, we repeat this process 3 times and average the results. For comparison, we train from scratch with this limited set as well as on the full segmentation training set. We additionally include results on random subsets of 5% and 10% in the appendix along with the associated standard deviation of each experiment. Remarkably, our pre-trained models show only a 3% drop in segmentation F1-score performance with a mere 1% of the segmentation labels compared to the model trained on the full segmentation dataset (see Table 2). By contrast, the model trained from scratch with the same 1% of segmentation labels shows a 10% drop in segmentation F1-score. This emphasizes the importance of detecting lesions since models trained from scratch in the low data regime show considerably lower detection F1-score. It is clear the models do not require very much data in order to properly segment lesions as demonstrated by competitive performance of our pre-trained models. However, as indicated by poor detection performance of the pure segmentation models in the low data regime, it is clear these models need help localizing lesions before they can be segmented. We can
10
C. Myers-Colet et al.
Table 2. Segmentation transfer learning results averaged over 3 random subsets. We present segmentation F1-score (Seg F1.) and detection (Det.) metrics on the finetuned segmentation models: F1-score, precision, recall. Pre-trained models are distinguished by their smoothing hyperparameter σ. Quantity seg. labels Pre-trained model Seg. F1
Det. F1
Det. precision Det. recall
100%
None
70.5 ± 0.31
85.4 ± 0.02
85.3 ± 1.10
85.5 ± 1.15
1%
None σ = 1.0 σ = 1.25 σ = 1.5 Euclidean Geodesic
60.7 ± 2.48 67.2 ± 0.71 67.6 ± 1.31 67.2 ± 1.51 66.3 ± 0.18 59.9 ± 1.44
69.7 ± 5.92 85.4 ± 0.62 84.5 ± 0.38 85.0 ± 0.50 84.7 ± 1.45 77.9 ± 4.36
79.0 ± 6.47 83.9 ± 1.74 86.0 ± 0.67 83.6 ± 1.01 83.3 ± 3.24 85.6 ± 7.33
62.7 ± 7.69 87.0 ± 2.40 83.1 ± 1.32 86.6 ± 1.10 86.1 ± 1.67 71.7 ± 3.49
see a similar trend with the models pre-trained on distance maps. The Euclidean distance maps offered higher detection scores than the Geodesic ones (Table 1) and therefore serve as a better pre-training for segmentation, although still lower than our models.
5
Discussion and Conclusion
In this work, we have demonstrated how training a heatmap regression model to detect lesions can achieve the same, and at times better, detection performance compared to a segmentation model. By requiring clinicians to indicate a single point within each lesion, our approach significantly reduces the annotation burden imposed by deep learning segmentation methods. Our proposed method of iteratively fitting Gaussian distributions to a predicted heatmap produces well-calibrated existence probabilities which capture the underlying uncertainty. Perhaps most significantly, our transfer learning experiments have revealed an important aspect about segmentation models. Our results demonstrate that segmentation models must learn first and foremost to find lesions. Indeed, our models, which are already adept at lesion detection, can easily learn to delineate borders with only a few segmentation samples. By contrast, the models provided with the same limited set of segmentation labels trained from scratch fail primarily to detect lesions thus lowering their segmentation scores. It therefore presents an unnecessary burden on clinicians to require them to manually segment large datasets in order to build an accurate deep learning segmentation model. Although we have demonstrated many benefits, Gaussian heatmap matching has its limitations. The smoothing hyperparameter requires careful tuning to both maintain stable training and to avoid a significant overlap in peaks (especially for densely packed lesions). As well, the method still requires an expert annotator to mark approximate lesion centres however, this is much less timeconsuming than fully outlining each lesion. We recognize this could introduce high variability in the labels regarding where the point is placed within each lesion. Though the current model was trained on precise centres of mass, the
Heatmap Regression for Lesion Detection Using Pointwise Annotations
11
proposed method does not necessarily impose any such constraints, in theory. Future work is needed to evaluate the robustness of the model to high variability in the label space. In summary, our proposed training scheme and iterative Gaussian fitting post-processing step constitute an accurate and label-efficient method of performing lesion detection and segmentation. Acknowledgements. This work was supported by awards from the International Progressive MS Alliance (PA-1412-02420), the Canada Institute for Advanced Research (CIFAR) Artificial Intelligence Chairs program (Arbel), the Canadian Natural Science and Engineering Research Council (CGSM-NSERC-2021-Myers-Colet) and the Fonds de recherche du Qu´ebec (303237). The authors would also like to thank Justin Szeto, Kirill Vasilevski, Brennan Nichyporuk and Eric Zimmermann as well as the companies who provided the clinical trial data: Biogen, BioMS, MedDay, Novartis, Roche/Genentech, and Teva. Supplementary computational resources were provided by Calcul Qu´ebec, WestGrid, and Compute Canada.
References 1. Daniels, S.K., Foundas, A.L.: Lesion localization in acute stroke. J. Neuroimaging 9(2), 91–98 (1999) 2. Rudick, R.A., Lee, J.-C., Simon, J., Ransohoff, R.M., Fisher, E.: Defining interferon β response status in multiple sclerosis patients. Ann. Neurol. Official J. Am. Neurol. Assoc. Child Neurol. Soc. 56(4), 548–555 (2004) 3. Lundervold, A.S., Lundervold, A.: An overview of deep learning in medical imaging focusing on MRI. Zeitschrift f¨ ur Medizinische Physik 29(2), 102–127 (2019) 4. Doyle, A., Elliott, C., Karimaghaloo, Z., Subbanna, N., Arnold, D.L., Arbel, T.: Lesion detection, segmentation and prediction in multiple sclerosis clinical trials. In: Crimi, A., Bakas, S., Kuijf, H., Menze, B., Reyes, M. (eds.) BrainLes 2017. LNCS, vol. 10670, pp. 15–28. Springer, Cham (2018). https://doi.org/10.1007/ 978-3-319-75238-9 2 5. Sharan, L., et al.: Point detection through multi-instance deep heatmap regression for sutures in endoscopy. Int. J. Comput. Assist. Radiol. Surg. 16(12), 2107–2117 (2021). https://doi.org/10.1007/s11548-021-02523-w 6. Stern, A., et al.: Heatmap-based 2d landmark detection with a varying number of landmarks. arXiv preprint arXiv:2101.02737 (2021) 7. Thaler, F., Payer, C., Urschler, M., Stern, D.: Modeling annotation uncertainty with gaussian heatmaps in landmark localization. arXiv preprint arXiv:2109.09533 (2021) 8. Han, X., Zhai, Y., Yu, Z., Peng, T., Zhang, X.-Y.: Detecting extremely small lesions in mouse brain MRI with point annotations via multi-task learning. In: Lian, C., Cao, X., Rekik, I., Xu, X., Yan, P. (eds.) MLMI 2021. LNCS, vol. 12966, pp. 498–506. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87589-3 51 9. van Wijnen, K.M.H., et al.: Automated lesion detection by regressing intensitybased distance with a neural network. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11767, pp. 234–242. Springer, Cham (2019). https://doi.org/10.1007/ 978-3-030-32251-9 26 10. Donn´e, S., De Vylder, J., Goossens, B., Philips, W.: Mate: machine learning for adaptive calibration template detection. Sensors 16(11), 1858 (2016)
12
C. Myers-Colet et al.
11. Chen, B., Xiong, C., Zhang, Q.: CCDN: checkerboard corner detection network for robust camera calibration. In: Chen, Z., Mendes, A., Yan, Y., Chen, S. (eds.) ICIRA 2018. LNCS (LNAI), vol. 10985, pp. 324–334. Springer, Cham (2018). https://doi. org/10.1007/978-3-319-97589-4 27 12. Zhang, F., Zhu, X., Dai, H., Ye, M., Zhu, C.: Distribution-aware coordinate representation for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7093–7102 (2020) 13. Graving, J.M., et al.: Deepposekit, a software toolkit for fast and robust animal pose estimation using deep learning. Elife 8, e47994 (2019) 14. Wang, X., Bo, L., Fuxin, L.: Adaptive wing loss for robust face alignment via heatmap regression. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6971–6981 (2019) 15. Pfister, T., Charles, J., Zisserman, A.: Flowing convnets for human pose estimation in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1913–1921 (2015) ´ 16. Hervella, A.S., Rouco, J., Novo, J., Penedo, M.G., Ortega, M.: Deep multi-instance heatmap regression for the detection of retinal vessel crossings and bifurcations in eye fundus images. Comput. Methods Programs Biomed. 186, 105201 (2020) 17. Schroeter, J., Myers-Colet, C., Arnold, D., Arbel, T.: Segmentation-consistent probabilistic lesion counting. Med. Imaging Deep Learn. (2022) 18. Nair, T., Precup, D., Arnold, D.L., Arbel, T.: Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation. Med. Image Anal. 59, 101557 (2020) 19. De Moor, T., Rodriguez-Ruiz, A., M´erida, A.G., Mann, R., Teuwen, J.: Automated lesion detection and segmentation in digital mammography using a u-net deep learning network. In: 14th International Workshop on Breast Imaging (IWBI 2018), vol. 10718, p. 1071805. International Society for Optics and Photonics (2018) 20. McFarland, H.F., et al.: Using gadolinium-enhanced magnetic resonance imaging lesions to monitor disease activity in multiple sclerosis. Ann. Neurol. 32(6), 758– 766 (1992) 21. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4 28 22. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2(1–2), 83–97 (1955) 23. Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: International Conference on Machine Learning, pp. 1321–1330. PMLR (2017) 24. Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In: International Conference on Machine Learning, pp. 1050–1059. PMLR (2016)
Partial Annotations for the Segmentation of Large Structures with Low Annotation Cost Bella Specktor Fadida1(B) , Daphna Link Sourani2 , Liat Ben Sira3,4 , Elka Miller5 , Dafna Ben Bashat2,3 , and Leo Joskowicz1 1 School of Computer Science and Engineering, The Hebrew University of Jerusalem,
Jerusalem, Israel {bella.specktor,josko}@cs.huji.ac.il 2 Sagol Brain Institute, Tel Aviv Sourasky Medical Center, Tel Aviv-Yafo, Israel 3 Sackler Faculty of Medicine and Sagol School of Neuroscience, Tel Aviv University, Tel Aviv-Yafo, Israel 4 Division of Pediatric Radiology, Tel Aviv Sourasky Medical Center, Tel Aviv-Yafo, Israel 5 Medical Imaging, Children’s Hospital of Eastern Ontario, University of Ottawa, Ottawa, Canada
Abstract. Deep learning methods have been shown to be effective for the automatic segmentation of structures and pathologies in medical imaging. However, they require large annotated datasets, whose manual segmentation is a tedious and time-consuming task, especially for large structures. We present a new method of partial annotations of MR images that uses a small set of consecutive annotated slices from each scan with an annotation effort that is equal to that of only few annotated cases. The training with partial annotations is performed by using only annotated blocks, incorporating information about slices outside the structure of interest and modifying a batch loss function to consider only the annotated slices. To facilitate training in a low data regime, we use a two-step optimization process. We tested the method with the popular soft Dice loss for the fetal body segmentation task in two MRI sequences, TRUFI and FIESTA, and compared full annotation regime to partial annotations with a similar annotation effort. For TRUFI data, the use of partial annotations yielded slightly better performance on average compared to full annotations with an increase in Dice score from 0.936 to 0.942, and a substantial decrease in Standard Deviations (STD) of Dice score by 22% and Average Symmetric Surface Distance (ASSD) by 15%. For the FIESTA sequence, partial annotations also yielded a decrease in STD of the Dice score and ASSD metrics by 27.5% and 33% respectively for in-distribution data, and a substantial improvement also in average performance on out-of-distribution data, increasing Dice score from 0.84 to 0.9 and decreasing ASSD from 7.46 to 4.01 mm. The two-step optimization process was helpful for partial annotations for both in-distribution and out-of-distribution data. The partial annotations method with the two-step optimizer is therefore recommended to improve segmentation performance under low data regime. Keywords: Deep learning segmentation · Partial annotations · Fetal MRI
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Zamzmi et al. (Eds.): MILLanD 2022, LNCS 13559, pp. 13–22, 2022. https://doi.org/10.1007/978-3-031-16760-7_2
14
B. S. Fadida et al.
1 Introduction Fetal MRI has the potential to complement US imaging and improve fetal development assessment by providing more accurate volumetric information about the fetal structures [1, 2]. However, volumetric measurements require manual delineation, also called segmentation, of the fetal structures, which is time consuming, annotator-dependent and error-prone. In this paper, we focus on the task of fetal body segmentation in MRI scans. Several automatic segmentation methods were proposed for this task. In an early work, Zhang et al. [3] proposed a graph-based segmentation method. More recently, automatic segmentation methods for fetal MRI are based on deep neural networks. Dudovitch et al. [4] describes a fetal body segmentation network that reached high performance with only nine annotated examples. However, the method was tested only on data with similar resolutions and similar gestational ages for the FIESTA sequence. Lo et al. [5] proposed a 2D deep learning framework with cross attention squeeze and excitation network with 60 training scans for fetal body segmentation in SSFP sequences. While effective, robust deep learning methods usually require a large, high-quality dataset of expert-validated annotations, which is very difficult and expensive to obtain. The annotation process is especially time consuming for structures with large volumes, as they require the delineation of many slices. Therefore, in many cases, the annotation process is performed iteratively, when first initial segmentation is obtained with few annotated datasets, and subsequently manual segmentations are obtained by correcting network results. However, the initial segmentation network trained on few datasets is usually not robust and might fail for cases that are very different from the training set. To address the high cost associated with annotating structures with large volumes, one approach is to use sparse annotations, where only a fraction of the slices or pixels are annotated [6]. Çiçek et al. [7] describes a 3D network to generate a dense volumetric segmentation from sparse annotations, in which uniformly sampled slices were selected for manual annotation. Goetz et al. [8] selectively annotated unambiguous regions and employed domain adaptation techniques to correct the differences between the training and test data distributions caused by sampling selection errors. Bai et al. [9] proposed a method that starts by propagating the label map of a specific time frame to the entire longitudinal series based on the motion estimation, and then combines FCN with a Recurrent Neural Network (RNN) for longitudinal segmentation. Lejeune et al. [10] introduced a semi-supervised framework for video and volume segmentation that iteratively refined the pixel-wise segmentation within an object of interest. However, these methods impose restrictions on the way the partial annotations are sampled and selected that may be inconvenient for the annotator and still require significant effort. Wang et al. [11] proposed using incomplete annotations in a user-friendly manner of either a set of consecutive slices or a set of typical separate slices. They used a combined cross-entropy loss with boundary loss and performed labels completion based on the network output uncertainty that was incorporated in the loss function. They showed that their method with 30% of annotated slices was close to the performance using full annotations. However, the authors did not compare segmentation results using full versus partial annotations with the same annotation effort. Also, a question remains if
Partial Annotations for the Segmentation of Large Structures
15
user-friendly partial annotations can be leveraged in the context of the Dice loss as well, a widely used loss function that is robust to class imbalance [12].
Fig. 1. Training flow with partial annotations. 1) Non-empty blocks are picked from the partially annotated scans (sagittal view, example of relevant blocks is shown in yellow). 2) A batch of nonempty blocks is used as input along with information about non-empty slices. The black areas of the blocks correspond to unselected voxels (voxels that are not used by the loss function). 3) The network is trained with a selective loss that uses only the pixels in annotated slices. (Color figure online)
Training with limited data usually makes the training optimization more difficult. Therefore, to facilitate optimization, we seek a scheme that will help in avoiding convergence to a poor local minimum. Smith [13] proposed the usage of a cyclic learning rate to remove the need for finding the best values and schedule for the global learning rates. Loshchilov et al. [14] showed the effectiveness of using warm learning rate restarts to deal with ill-conditioned functions. They used a simple learning rate restart scheme after a predefined number of epochs. In this paper, we explore the effectiveness of using partial annotations under low data regime with the Soft Dice loss function. We also explore the usefulness of a warm restarts optimization scheme in combination with fine-tuning to deal with the optimization difficulties under low data regime.
2 Method Our segmentation method with small annotation cost consists of two main steps: 1) manual partial delineations, where the user partially annotates scans with the guidance of the algorithm; 2) training with partial annotations, where a 3D segmentation network is trained with blocks of the partially annotated data using a selective loss function. The manual partial delineations step is performed as follows. First, the uppermost and lowermost slices of the organ are manually selected by the annotator, which is a quick and easy task. Then, the algorithm randomly chooses a slice within the structure of interest. Finally, the slices to annotate around this slice are selected. The number of slices is determined by the chosen annotation percentage. The annotation percentage is taken from the slices that include the structure of interest, i.e., non-empty segmentation slices. The slices to annotate are chosen consecutively to reduce annotation time, as often the annotations depend on the 3D structure of the organ seen by scrolling and viewing nearby slices during the annotation. The training with partial annotations is performed as follows. Only the non-empty blocks of the partially annotated data are used for training, as some of the blocks may
16
B. S. Fadida et al.
not include annotations at all. To enrich the annotated data, we also use the border slices information in the loss function and treat the slices outside the structure of interest as annotated slices. We add as input to the network a binary mask specifying the locations of the annotated slices during training. The network is trained with a selective loss function that takes into account only the annotated slices. Also, we use a relatively large batch size of 8 to include enough information during each optimization iteration. Figure 1 shows the training flow.
Fig. 2. Illustration of the two-step optimization process with the proposed learning rate regimes (graphs of learning rate as a function of epoch number).
2.1 Selective Dice Loss To train a network with partially annotated data, we modify the loss function to use only the annotated slices information. We illustrate the use of a selective loss for the commonly used Soft Dice loss. A batch loss is used, meaning that the calculation is performed on the 4-dimentional batch directly instead of averaging the losses of single data blocks. Let the number of image patches be I and let the image patch consist of C pixels. The number of voxels in a minibatch is therefore given by I × C = N . Let ti be a voxel at location i in the minibatch for the ground truth delineation ti ∈ T and ri be a voxel at the location i in the minibatch for the network result ri ∈ R. The Batch Dice loss [15] is defined as: 2 N ti r i Batch Dice Loss (LCD ) = − (1) N ti + N ri Since we have partial annotations, we will use only the annotated slices locations in the loss calculation. Let T ⊂ T and R ⊂ R be the ground truth in the annotated slices and the network result in the annotated slices, with minibatch voxels ti ∈ T and ri ∈ R respectively. The number of voxels that we consider in the minibatch is now N < N , corresponding only to the annotated slices. The batch dice loss for partial annotations is defined as: 2 N ti r i Selective Batch Dice Loss (LCD ) = − (2) N ti + N ri
Partial Annotations for the Segmentation of Large Structures
17
2.2 Optimization To facilitate the optimization process under small data regime, we perform the training in two steps. First, a network is trained with reduction of learning rate on plateau. Then, we use the weights of the network with best results on the validation set to continue training. Similarly to the first phase, the training in the second phase is performed with reduction in plateau, but this time with learning rate restarts every predefined number of epochs (Fig. 2).
3 Experimental Results To evaluate our method, we retrospectively collected fetal MRI scans with the FIESTA and TRUFI sequences and conducted two studies. Datasets and Annotations: We collected fetal body MRI scans of patients acquired with the true fast imaging with steady-state free precession (TRUFI) and the fast imaging employing steady-state acquisition (FIESTA) sequences from the Sourasky Medical Center (Tel Aviv, Israel) with gestational ages (GA) 28–39 weeks and fetal body MRI scans acquired with the FIESTA sequence from Children’s Hospital of Eastern Ontario (CHEO), Canada with GA between 19–37 weeks. Table 1 shows detailed description of the data. Table 1. Datasets description. MRI ID/ Clinical sequence OOD site
Scanners
Resolution Pixels/ (mm3 ) slice
TRUFI
ID
Sourasky Medical Center
Siemens Skyra 3T, Prisma 3T, Aera 1.5T
0.6 − 1.34 320 – 512 50–120 28–39 × 0.6 − × 320 1.34 × 2 − – 512 4.8
FIESTA
ID
Sourasky Medical Center
GE MR450 1.48 − 256 × 1.5T 1.87 × 256 1.48 − 1.8 ×2−5
OOD Children’s Mostly GE Hospital Signa HDxt 1.5T; Signa 1.5T, SIEMENS Skyra 3T
# Slices
GA
50–100 28–39
0.55 − 1.4 256 × 19–55 × 0.55 − 256512 × 1.4 × 3.1 512 − 7.5
# 101
104
19–37, 33 mostly 19–24
Ground truth segmentations were created as follows. First, 36 FIESTA cases were annotated from scratch. Then, 68 ID and 33 OOD cases were manually corrected from network results. For the TRUFI data all cases were created by correcting network results:
18
B. S. Fadida et al.
first, a FIESTA network was used to perform initial segmentation and afterwards a TRUFI network was trained for improved initial segmentation. Both the annotations and the corrections were performed by a clinical trainee. All segmentations were validated by a clinical expert. Studies: We conducted two studies that compare partial annotations to full annotations with the same number of slices. Study 1 evaluates the partial annotations method for the TRUFI body dataset and performs ablation for the two-step optimization process and the usage of slices outside of the fetal body structure. Study 2 evaluates the partial annotations method for the FIESTA body dataset for both ID and OOD data. For both studies, we compared training with 6 fully annotated cases to 30 partially annotated cases with annotation of 20% of the slices. The selection of cases and the location for partial annotations was random for all experiments. Because of the high variability in segmentation quality for the low-data regime, we performed all the experiments with four different randomizations and averaged between them. The segmentation quality is evaluated with the Dice, Hausdorff and 2D ASSD (slice Average Symmetric Surface Difference) metrics.
Fig. 3. Fetal body segmentation results for the FIESTA sequence. Training with full annotations (full) is compared to training with partial annotations with (\w) and without (\wo) border slices. The colored bars show the STD of the metric and the grey bars show the range of the metric (minimum and maximum).
A network architecture similar to Dudovitch et al. [4] was utilized with a patch size of 128 × 128 × 48 to capture a large field of view. A large batch size of 8 was used in all experiments to allow for significant updates in each iteration for the partial annotations regime. Since the TRUFI sequence had a higher resolution compared to FIESTA, the scans were downscaled by × 0.5 in the in-plane axes to have a large field of view [16]. The segmentation results were refined by standard post-processing techniques of holes filling and extraction of the main connected component. Both partially annotated and fully annotated networks were trained in a two-step process. First, the network was trained with a decreasing learning rate, with an initial learning rate of 0.0005. The network that yielded the best validation result was selected, and this network was then fine-tunned on the same data. For fine-tuning, we again used a decreasing learning rate scheme with an initial learning rate of 0.0005, but this time we performed learning rate restarts every 60 epochs.
Partial Annotations for the Segmentation of Large Structures
19
Study 1: partial annotations for TRUFI sequence and ablation The method was evaluated on 30/13/58 training/validation/test split for partially annotated cases with 20% of annotated slices and 6/13/58 for fully annotated cases. The 6 fully annotated training cases were randomly chosen out of the 30 partially annotated training cases. Ablation experiments were performed to evaluate the effectiveness of the two-step optimization scheme and the usage of slices outside the body structure. Six scenarios were tested: 1) full annotations without fine tuning; 2) partial annotations without fine tuning and without borders information; 3) partial annotations without fine-tuning and with borders information; 4) full annotations with fine tuning; 5) partial annotations with fine tuning but without borders information; 6) partial annotations with fine tuning and borders information. Figure 3 shows the fetal body segmentation results with the Dice score and ASSD evaluation metrics. Fine tuning with restarts was helpful for both full and partial annotations regimes, increasing the full annotations segmentation Dice score from 0.919 to 0.937 and partial annotations with borders segmentation Dice score from 0.92 to 0.942. Incorporating border information with the selective Dice loss function improved partial annotation setting, increasing the Dice score from 0.936 to 0.942 and decreasing the Dice Standard Deviation (STD) from 0.056 to 0.049. Finally, partial annotations with borders information had slightly better average results to the full annotations regime with a Dice score of 0.937 and 0.942 and ASSD of 3.61 and 3.52 for the full and partial annotations respectively, with a substantially smaller STD: a Dice score STD of 0.063 compared to 0.049 and ASSD STD of 4.04 compared to 3.45 for the full annotations and partial annotations regimes respectively. Table 2. Segmentation results comparison between partial and full annotations for FIESTA body sequence on ID and OOD data. Best results are shown in bold. Unusual behavior for fine-tuning (two step optimization) is indicated with italics. Data distribution
Network training
Dice
Hausdorff (mm)
2D ASSD (mm)
In-Distribution (ID)
Full
0.959 ± 0.044 34.51 ± 37.26 2.15 ± 2.33
Full fine-tuned
0.964 ± 0.040 32.98 ± 36.86 1.88 ± 2.07
Partial
0.959 ± 0.034 34.15 ± 35.96 2.21 ± 1.67
Partial fine-tuned 0.965 ± 0.029 31.89 ± 35.82 1.90 ± 1.39 Out-of-Distribution (OOD) Full
0.836 ± 0.178 39.34 ± 29.26 7.46 ± 10.61
Full fine-tuned
0.826 ± 0.214 39.61 ± 32.66 8.86 ± 16.54
Partial
0.875 ± 0.091 36.19 ± 21.44 5.47 ± 3.92
Partial fine-tuned 0.899 ± 0.067 30.37 ± 18.86 4.00 ± 2.26
Study 2: partial annotations for FIESTA sequence for ID and OOD data For partial annotations regime, the network was trained on 30 cases and for the full annotations regime the network was trained on 6 cases randomly chosen out of the
20
B. S. Fadida et al.
Fig. 4. Illustrative fetal body segmentation results for the FIESTA OOD data. Left to right (columns): 1) original slice; 2) Full annotations without fine-tuning; 3) Full annotations with fine-tuning; 4) Partial annotations with fine-tuning; 5) ground truth.
30 partially annotated training cases. For both methods, we used the same 6 cases for validation, 68 test cases for ID data and 33 test cases for OOD data. The OOD data was collected from a different clinical site than the training set and included mostly smaller fetuses (28 out of 33 fetuses had GA between 19–24 weeks compared to GA between 28–39 in the training set). For both partial and full annotations regimes we used Test Time Augmentations (TTA) [17] for the OOD setting to reduce over-segmentation. Because of large resolution differences, we rescaled OOD data to the resolution of 1.56 × 1.56 × 3.0 mm3 , similar to the resolution of the training set. In total, eight scenarios were tested, four for ID data and four for OOD data. For both ID and OOD data the following was tested: 1) full annotations without fine tuning. 2) full annotations with fine tuning. 3) partial annotations without fine-tuning. 4) partial annotations with fine-tuning. Table 2 shows the results. For the ID data, partial annotations results were similar to full annotations with the same annotation effort, but again the STD was much smaller: Dice STD of 0.04 compared to 0.029 and ASSD STD of 2.07 compared to 1.39 for full and partial annotations respectively. For both full and partial annotations regimes the fine tuning slightly improved the segmentation results. For the OOD data, the differences between segmentation results using full and partial annotations were much larger, with better results for partial annotations regime. Using partial annotations, results improved from a Dice score of 0.836 to 0.899 and from ASSD of 7.46 mm to 4 mm. Unlike in the ID setting, fine-tuning with restarts hurt performance on OOD data in the full annotations regime, potentially indicating an overfitting phenomenon. This was not the case for partial annotations, where again fine tuning with learning rate restarts further improved segmentation results as in the ID setting. Figure 4 shows illustrative body segmentation results for the OOD data. Partial annotations showed better performance on these cases compared to full annotations, indicating higher robustness. Also, fine tuning full annotations resulted in decreased performance with a complete failure to the detect the case in the top row, which may indicate an overfitting to the training set.
Partial Annotations for the Segmentation of Large Structures
21
4 Conclusion We have presented a new method for using partial annotations for large structures. The method consists of algorithm-guided annotation step and a network training step with selective data blocks and a selective loss function. The method demonstrated significantly better robustness under low data regime compared to full annotations. We also presented a simple two-step optimization scheme for low data regime that combines fine-tuning with learning rate restarts. Experimental results show the effectiveness of the optimization scheme for partial annotations method on both ID and OOD data. For full annotations, the two-step optimization was useful only for ID data but hurt performance on OOD data, indicating potential overfitting. The selected partial annotations are user-friendly and require only two additional clicks in the beginning and end of the structure of interest, which is negligible compared to the effort required for segmentation delineations. Thus, they can be easily used to construct a dataset with a low annotation cost for initial segmentation network. Acknowledgements. This research was supported in part by Kamin Grant 72061 from the Israel Innovation Authority.
References 1. Reddy, U.M., Filly, R.A., Copel, J.A.: Prenatal imaging: ultrasonography and magnetic resonance imaging. Obstet. Gynecol. 112(1), 145–150 (2008) 2. Rutherford, M., et al.: MR imaging methods for assessing fetal brain development. Dev. Neurobiol. 68(6), 700–711 (2008) 3. Zhang, T., Matthew, J., Lohezic, M., Davidson, A., Rutherford, M., Rueckert, D et al.: Graph-based whole body segmentation in fetal MR images. In: Proceedings of the Medical Image Computing and Computer-Assisted Intervention Workshop on Perinatal, Preterm and Paediatric Image Analysis (2016) 4. Dudovitch, G., Link-Sourani, D., Ben Sira, L., Miller, E., Ben Bashat, D., Joskowicz, L.: Deep learning automatic fetal structures segmentation in MRI scans with few annotated datasets. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12266, pp. 365–374. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59725-2_35 5. Lo, J., et al.: Cross attention squeeze excitation network (CASE-Net) for whole body fetal MRI segmentation. Sensors 21(13), 4490 (2021) 6. Tajbakhsh, N., Jeyaseelan, L., Li, Q., Chiang, J.N., Wu, Z., Ding, X.: Embracing imperfect datasets: a review of deep learning solutions for medical image segmentation. Med. Image Anal. 63(1), 101693 (2020) 7. Çiçek, O., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M., Unal, G., Wells, W. (eds.) Proceedings of the international Conference Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016. MICCAI 2016, LNIP, vol 9901, pp. 424–432. Springer, Cham. https://doi.org/10.1007/978-3-319-467238_49 8. Goetz, M., et al.: DALSA: domain adaptation for supervised learning from sparsely annotated MR images. IEEE Trans. Med. Imag. 35(1), 184–196 (2016)
22
B. S. Fadida et al.
9. Bai, W., et al..: Recurrent neural networks for aortic image sequence segmentation with sparse annotations. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11073, pp. 586–594. Springer, Cham (2018). https:// doi.org/10.1007/978-3-030-00937-3_67 10. Lejeune, L., Grossrieder, J., Sznitman, R.: Iterative multi-path tracking for video and volume segmentation with sparse point supervision. Med. Image Anal. 50, 65–81 (2018) 11. Wang, S., et al.: CT male pelvic organ segmentation via hybrid loss network with incomplete annotation. IEEE Trans. Med. Imaging 39(6), 2151–2162 (2020) 12. Sudre, C.H., Li, W., Vercauteren, T., Ourselin, S., Jorge Cardoso, M.: Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In: Cardoso, M.J., et al. (eds.) DLMIA/ML-CDS -2017. LNCS, vol. 10553, pp. 240–248. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67558-9_28 13. Smith, L.N.: Cyclical learning rates for training neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) 24 Mar 2017, pp. 464–472. IEEE 14. Loshchilov, I., Hutter, F.: SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016) 15. Kodym, O., Španˇel, M., Herout, A.: Segmentation of head and neck organs at risk using CNN with batch dice loss. In: Brox, T., Bruhn, A., Fritz, M. (eds.) GCPR 2018. LNCS, vol. 11269, pp. 105–114. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12939-2_8 16. Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnU-Net: a selfconfiguring method for deep learning-based biomedical image segmentation. Nat. Meth. 18(2), 203–211 (2021) 17. Wang, G., Li, W., Aertsen, M., Deprest, J., Ourselin, S., Vercauteren, T.: Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation with convolutional neural networks. Neurocomputing 338, 34–45 (2019)
Abstraction in Pixel-wise Noisy Annotations Can Guide Attention to Improve Prostate Cancer Grade Assessment Hyeongsub Kim, Seo Taek Kong, Hongseok Lee, Kyungdoc Kim, and Kyu-Hwan Jung(B) VUNO Inc., Seoul, Korea [email protected] https://www.vuno.co/
Abstract. Assessing prostate cancer grade from whole slide images (WSIs) is a challenging task. While both slide-wise and pixel-wise annotations are available, the latter suffers from noise. Multiple instance learning (MIL) is a widely used method to train deep neural networks using WSI annotations. In this work, we propose a method to enhance MIL performance by deriving weak supervisory signals from pixel-wise annotations to effectively reduce noise while maintaining fine-grained information. This auxiliary signal can be derived in various levels of hierarchy, all of which have been investigated. Comparisons with strong MIL baselines on the PANDA dataset demonstrate the effectiveness of each component to complement MIL performance. For 2,097 test WSIs, accuracy (Acc), the quadratic weighted kappa score (QWK), and Spearman coefficient were increased by 0.71%, 5.77%, and 6.06%, respectively, while the mean absolute error (MAE) was decreased by 14.83%. We believe that the method has great potential for appropriate usage of noisy pixel-wise annotations. Keywords: Multiple instance learning · Weak supervision · Noisy labels · Prostate cancer grade assessment · Whole slide image
1
Introduction
Prostate cancer is one of the most common cancers in the world [8,10]. Important prognostic information is inferred from Gleason patterns and grades which are categorized into international society of urological pathology (ISUP) grade groups [5] based on their severity. Assessing prostate cancer grades in whole slide images (WSIs) with giga-scale resolutions is time-consuming and pixel-wise annotations have significant noisiness [1]. Deep neural networks when used to assist diagnosis of cancer must indicate regions where Gleason patterns present for further confirmation. However, pixel-wise Gleason pattern annotations are known to be excessively noisy and its c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Zamzmi et al. (Eds.): MILLanD 2022, LNCS 13559, pp. 23–31, 2022. https://doi.org/10.1007/978-3-031-16760-7_3
24
H. Kim et al.
noise levels outweigh its potential benefits. Optimizing patch-wise metrics was insufficient to translate to slide-wise performance, and consequently learning algorithms typically have used pixel-wise annotations have been used only for feature extraction, while the final classifiers have been trained on the less noisy slide-wise annotations [9]. Multiple instance learning (MIL) is a widely used paradigm when classifying histopathological WSIs because slide-wise annotations can be obtained through medical information systems while pixel-wise annotations are not readily available [4]. Attention-based MIL emphasizes regions to locate sparsely-positioned lesions in core needle biopsy tissues but never directly accesses pixel-wise information. Several attempts to utilize both WSI and pixel-wise annotations are outlined in [2,9]. Instead of relying on noisy Gleason patterns, the studies use annotations indicating presence of tumor and separate localization from classification. Specifically, Strom and Kartasalo et al. applied boosting on ensembles of detection and grading networks and evaluated their patch-wise performances [9]. Bulten et al. mimicked a clinical setting where a feature extraction network learns to identify tumor positions [2]. Features were then extracted to train a classification model predicting ISUP grade groups. Without training on segmentation masks, [4] ranks of the top-K relevant patches and MIL were utilized. Relevant patches were subjected to recurrent neural network to diagnose malignant or benign tumors. This work seeks to complement MIL by eliciting useful information from statistical approach in pixel-wise noisy annotations. Our experiments demonstrate that without carefully filtering pixel-wise noise, a combination with MIL amplifies errors in already mis-classified cases, e.g. ISUP grade 3 classified as 2 by a MIL model is classified as 1 by their combination. To allay such issues, we propose to construct weak-supervisory signals from noisy pixel-wise annotations. Annotation abstraction derived from Gleason patterns was shown to enhance spatial attention by reducing pixel-wise noise. Experiments demonstrated how coarse auxiliary signals effectively enhance an attention module’s accuracy and improve ISUP grading of prostate WSIs.
2 2.1
Materials and Method Data
The Prostate cANcer graDe Assessment (PANDA) dataset containing 10,516 WSIs was used for this study [3]. Data were split into 8,419 and 2,097 WSIs of digitized hematoxylin and eosin (H&E)-stained biopsies for training and test. Slide-wise annotations are provided in the form of Gleason scores and corresponding severity grade ranging 0 to 5 according to the international society of urological pathology (ISUP) standard [5], and endpoints indicating no tumor or malignancy. Mask values in pixel-wise annotation depend on the data provider [3]. Masks acquired from different institutions come with different semantics and are converted to another mask indicating tumor presence. In this work, we excluded slide-wise Gleason scores to focus on the effect of annotation abstraction. The distribution of dataset is detailed in Table 1.
Abstraction in Pixel-wise Noisy Annotations for Prostate Cancer Assessment
25
Table 1. Dataset description separated by ISUP grade groups. Grading group Train Val
Test
No tumor ISUP group ISUP group ISUP group ISUP group ISUP group
572 520 267 247 246 245
Total
2.2
1 2 3 4 5
1,726 1,572 805 735 750 727
575 524 269 244 249 243
Total 2,873 2,616 1,341 1,226 1,245 1,215
6,315 2,104 2,097 10,516
Architecture
The end-to-end network architecture commonly used throughout this work is described. An ImageNet-pretrained ResNeXt-50 extracted 2,048 channel preglobal average pooling features from a batch of bg = 16 WSI inputs. Each WSI was split into bs = 32 patches, with resolution of H = W = 224. A learnable global convolution filter followed by sigmoid activation was used to compute attention A and multiplied with the input feature. Post-attention features were fed to the classification layers to predict the ISUP grading group. The classification layer consists of max-pooling, average pooling layer, and fully connected layer(FC layer) as in Fig. 1(a). 2.3
Multiple Instance Learning for Cancer Grade Assessment
Let Y = {0, . . . , 5} be the set of possible ordinal annotations describing ISUP grades. A classifier is trained to predict slide-wise ordinal annotations y ∈ Y. Its softmax prediction is denoted by pˆ. Because classes share ordinal relations, the mean variance loss [7] is added to the standard cross entropy loss: 2 2 y − y) + (Eyˆ∼pˆ [ˆ y ] − y) . (1) Lmv = H (y, pˆ) + Eyˆ∼pˆ (ˆ
2.4
Noisy Labels and Weak Supervision
Raw pixel-wise annotations are extremely noisy [3], therefore have often been discarded [1]. Models trained using only MIL often weighed each patch equally because only ISUP grade groups were learnt. Appropriately processed finegrained annotations can potentially inform the model to utilize local morphological features whose importance should be weighed differently. The consensus of fine-grained pixel-wise annotations was rarely achievable, so that, the annotations method itself could be major component the noise of pixel-wise annotations. However, their abstraction at the increased coarseness
26
H. Kim et al.
Fig. 1. (a) An overview of the proposed method. Attention mechanism for multiple instance learning and additional network layers when using (b) no auxiliary loss, (c) patch-wise auxiliary loss, and (d) slide-wise auxiliary loss. bg : Global batch size, bs : slide batch size, C: Input channel size, W: Input width, H: Input height, Cf : Initial feature channel
releases pixel-wise noise in WSI annotations. Let γp , γs be ratio of tumor to total tissue area in each patch and slide: γ =
1 1 {Mω = 1} , ∈ {p, s} |Ω |
(2)
ω∈Ω
where Ω = {1, . . . , H } × {1, . . . , W } is the resolution set of a patch or slide, i.e. its pixel indices, and Mω is the tumor indicator mask. The masks (Fig. 2(a)) are obtained from WSI using Akensert method [1], resolution being 1.0 micron per pixel (mpp). To ensure representation capacity for well-separability, we added a learnable block followed by sigmoid for each coarseness level p, s, shown in Fig. 1(b–d). The auxiliary losses (L ) are then computed as the binary cross entropy H2 between predictions and the above ratio: L = H2 (γ , pˆ ) .
(3)
Abstraction in Pixel-wise Noisy Annotations for Prostate Cancer Assessment
27
Fig. 2. (a) Slide batch generation; bs : slide batch size, C: channel size, W: Patch width, H: Patch height, (b) Abstraction in noisy annotation method based on the noisy pixelwise annotation.
Combining all the losses considered, the total loss is then a convex combination between MIL and the auxiliary loss computed at varying levels of abstraction ∈ {p, s}. Here, w is the for the auxiliary loss as follows: L = wL + (1 − w)Lmv .
3 3.1
(4)
Experiments Implementation and Evaluation
We compared the performance of three baselines without the auxiliary loss and conducted an ablation study assessing the effectiveness of each auxiliary loss according to abstraction type and its weight (w). All models shared the same ResNeXt-50 (32 × 4d) encoder. The first baseline is MIL model without both attention and auxiliary loss. This MIL baseline model already achieved high performance by positioning in the top-10 rank in the challenge. [1]. The second
28
H. Kim et al.
Fig. 3. A comparison of methods using patch-wise and slide-wise annotation abstractions evaluated with respect to the following criteria: (a) Accuracy (Acc), (b) Mean absolute error (MAE), (c) Quadratic weighted kappa (QWK), and (d) Spearman correlation. (Color figure online)
baseline model consisted of two stages. In the first stage, a U-Net model with ResNeXt-50 backbone networks was trained on pixel-wise annotations for feature extraction. In second stage, MIL with the freezed ResNeXt-50 in the end of first stage in Fig. 1(a) was trained on only slide-wise annotation based on the first stage’s output as typical methods [2,9]. The third baseline adds only attention module without abstraction on top of the second baseline.
Abstraction in Pixel-wise Noisy Annotations for Prostate Cancer Assessment
29
Ablation study proceeds with increasing levels of abstraction (patch/slide) with various coefficients (w). A coefficient of 0.7 on the auxiliary loss was found to work best via grid search which weighs the abstraction loss during the training of the model. AdamW optimizer [6] with 16 slides in each mini-batch was used with cosine annealing, and the initial learning rate was set to 1e−4. Performance for ISUP grade group prediction were evaluated with respect to accuracy, mean absolute error (MAE), quadratic weighted kappa (QWK), and spearman rank correlations. 3.2
Results
As shown in Fig. 3(a), the accuracy of the model trained on pixel-wise noisy annotations was improved with the use of slide-wise annotations. This margin is similar to the gain achieved by adding attention to the MIL baseline (green dotted line in Fig. 3). However, inspecting other criteria (b–d) which penalizes incorrect predictions far from true annotations demonstrates how pixel-wise labels are detrimental in amplifying incorrect predictions. Acc, QWK, and Spearman coefficient were increased by 0.71%, 5.77%, and 6.06%, and MAE was decreased by 14.83% when adding slide-wise label abstraction to the MIL baseline. For such cases, models trained using either patch or slide-wise abstraction predicted ISUP grades closer to true annotations’. The higher levels of abstraction, the more noise filtered naturally, thereby the slide-wise annotations with high noise have achieved benefit. These results support that the use of auxiliary loss using abstracted annotations is more helpful in improving model performance.
Fig. 4. Confusion matrices comparing (a) Pixel-wise annotation based baseline model without the abstraction with (b) Proposed method trained on abstracted annotations.
30
H. Kim et al.
We also visualized the distribution of predictions and true ISUP grade groups in Fig. 4. The QWK increased from 0.8190 to 0.8663 when using slide-wise abstractions. Under and over-estimated predictions with margin ≥ 2 are highlighted in blue and red triangles, respectively. The implications of under and over-estimates differ: over-estimations (blue) lead to unnecessary costs of care. Under-estimating the severity of cancer (red) is critical because a patient would not receive proper treatment. The cumulative number of upper triangular cases slightly increased by 9 cases (from 64 to 73), but the number of lower triangular cases decreased by 30% from 148 cases to 100 cases. This implies that the potential risk of a patient can be mitigated with the use of our method. In this study, we tested the effective use of pixel-wise noisy labels in slide-wise inference. It showed a performance improvement in terms of QWK compared to slide-wise classification after attention based on the results of the segmentation model. Compared with the PANDA challenge, the source of the dataset we used, we note that there may be a slight performance difference because the train set and test set used are different from the challenge.
4
Conclusion
We proposed a method to guide a MIL attention network by performing abstraction to filter annotation noise. Our method demonstrated superior performance in comparison with strong baselines. In particular, the performance was improved for samples that were difficult to predict due to noisy annotations, thereby reducing the severity of misdiagnosis. We believe that this study has potential not only for pathology, but also for large-scale environments when fine-grained annotations are contaminated with substantial noise levels.
References 1. Bulten, W., et al.: Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the panda challenge. Nat. Med. 24, 1–10 (2022) 2. Bulten, W., et al.: Automated Gleason grading of prostate biopsies using deep learning. arXiv preprint arXiv:1907.07980 (2019) 3. Bulten, W., Pinckaers, S., Eklund, K., et al.: The PANDA challenge: prostate cancer grade assessment using the Gleason grading system. MICCAI challenge (2020) 4. Campanella, G., et al.: Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 25(8), 1301–1309 (2019) 5. Egevad, L., Delahunt, B., Srigley, J.R., Samaratunga, H.: International society of urological pathology (ISUP) grading of prostate cancer-an ISUP consensus on contemporary grading (2016) 6. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 7. Pan, H., Han, H., Shan, S., Chen, X.: Mean-variance loss for deep age estimation from a face. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5285–5294 (2018)
Abstraction in Pixel-wise Noisy Annotations for Prostate Cancer Assessment
31
8. Society, A.C.: About prostate cancer. https://www.cancer.org/cancer/prostatecancer/about/key-statistics.html 9. Str¨ om, P., et al.: Pathologist-level grading of prostate biopsies with artificial intelligence. arXiv preprint arXiv:1907.01368 (2019) 10. UK, P.C.: What is the prostate? https://prostatecanceruk.org/prostateinformation/about-prostate-cancer
Meta Pixel Loss Correction for Medical Image Segmentation with Noisy Labels Zhuotong Cai(B) , Jingmin Xin, Peiwen Shi, Sanping Zhou, Jiayi Wu, and Nanning Zheng Xi’an Jiaotong University, Xi’an, China [email protected] Abstract. Supervised training with deep learning has exhibited impressive performance in numerous medical image domains. However, previous successes rely on the availability of well-labeled data. In practice, it is a great challenge to obtain a large high-quality labeled dataset, especially for the medical image segmentation task, which generally needs pixel-wise labels, and the inaccurate label (noisy label) may significantly degrade the segmentation performance. In this paper, we propose a novel Meta Pixel Loss Correction (MPLC) based on a simple meta guided network for the medical segmentation that is robust to noisy labels. The core idea is to estimate a pixel transition confidence map by meta guided network to take full advantage of noisy labels for pixel-wise loss correction. To achieve this, we introduce a small size of meta dataset with the metalearning method to train the whole model and help the meta guided network automatically learn the pixel transition confidence map in an alternative training manner. Experiments have been conducted on three medical image datasets, and the results demonstrate that our method is able to achieve superior segmentation with noisy labels compared to the existing state-of-the-art approaches. Keywords: Label noise
1
· Loss correction · Meta learning
Introduction
With the recent emergence of large-scale datasets supervised by high-quality annotations, deep neural networks (DNNs) have exhibited impressive performance in numerous domains, particularly in medical applications. It has proved itself to be a worthy computer assistant in solving many medical problems, including disease early diagnosis, disease progression prediction, patient classification, and many other crucial medical image processing tasks like image registration and segmentation [7]. However, the former success is mostly contributed to the availability of well-labeled data. In practice, it is a great challenge to obtain large highquality datasets with accurate labels in medical imaging. Because such labeling is Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-16760-7 4. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Zamzmi et al. (Eds.): MILLanD 2022, LNCS 13559, pp. 32–41, 2022. https://doi.org/10.1007/978-3-031-16760-7_4
Meta Pixel Loss Correction
33
not only time-expensive but also expertise-intensive. In most cases, the labeled datasets more or less have potential noisy labels, especially for the segmentation task, which generally needs pixel-wise annotation. Therefore, a segmentation model that is robust to such noisy training data is highly required. To overcome this problem, a few recent approaches had been proposed. Mirikharaji et al. [11] proposed a semi-supervised method to optimize the weights on the images in the noisy dataset by reducing the loss on a small clean dataset for skin lesion segmentation. Inspired by [13], Zhu et al. [20] detected incorrect labels in the segmentation of heart, clavicles, and lung in chest radiographs through decreasing the weight of samples with incorrect labels. Wang et al. [18] combined the meta learning with the re-weighting method to adapt for corrupted pixels and re-weight the relative influenced loss for lung and liver segmentation. All these methods are built on the basis of exclusion or simply re-weighting the suspected noisy samples to reduce their negative influence for training. However, simple exclusion or re-weighting can not make full use of noisy labels and ignores the reason leading to these noise labels, which makes them still have room for further performance improvement. This motivates us to explore the feasibility of taking full advantage of noisy labels by estimating the pixel transition confidence map, so as to do further pixel loss correction to make the model noise-robust and improve the segmentation performance with corrupted pixels. In this paper, we propose a novel meta pixel loss correction(MPLC) to address the problem of medical image segmentation with noisy labels. Specifically, we design a meta guided network by feeding the segmentation network prediction as input to generate the pixel transition confidence map. The obtained pixel transition confidence map can represent the possibility of transitioning from the latent clean label to the observed noisy label, which can lead to improved robustness to noisy labels in the segmentation network through further pixel loss correction processing. The contributions of this paper can be summarized as follows: 1) We propose a novel meta pixel loss correction method to generate a noise-robust segmentation model to make full use of the training data. 2) With the introduction of noise-free meta-data, the whole model can be trained in an alternative manner to automatically estimate a pixel transition confidence map, so as to further do pixel loss correction. 3) We conduct experiments on a combination of three medical datasets, including LIDC-IDRI, LiTS and BraTS19 for segmentation tasks with noisy labels. The results show that our method achieves state-of-art performance in medical image segmentation with noisy labels.
2
Methodology
We propose a novel meta pixel loss correction method (MPLC) to correct loss function and generate a noise-robust segmentation network with noisy labels. The detailed architecture of our proposed framework and workflow are shown in Fig. 1. And it consists of two components: (1) a segmentation network based on U-Net (2) a meta guided network for generating the pixel transition confidence map to do further pixel loss correction. The components are trained in an endto-end manner and are described as follows.
34
Z. Cai et al.
Fig. 1. Overview of our workflow in one loop.
2.1
Meta Pixel Loss Correction
Given a set of training noisy label samples S = (X i , Y˜i ), 1 ≤ i ≤ N , where h×w×c X i is training input images, Y˜i ∈ {0, 1} represent the observed noisy labels, denote training images with noisy segmentation annotations. We use UNet [14] as the backbone DNN for segmentation and it generates a prediction P i from the function P i = f (X i , ω), where f denotes the U-Net and ω denotes the parameters of U-Net. For a conventional segmentation task, cross entropy is used as the loss function Loss = l(P i , Y˜i ) to learn the parameters ω. However, there may exist many noisy labels in the training dataset which contributes to the poor performance of the trained U-Net. Because the influences of these errors in the loss function can lead the gradient into the probably wrong direction and cause overfitting issues [16]. Instead of simply excluding the corrupted unreliable pixel [11,18], we aim to take advantage of these noisy labels. T Construction. Assuming that there is a pixel transition confidence map T , which can bridge clean label and noisy label, specifying the probability of clean label flipping to noisy label. T will be applied to the segmentation prediction through the transition function and finally we get the revised prediction, which resembles the relative noisy mask. Thus, the noisy labels are used properly and the original cross entropy loss between the revised prediction and the noisy mask can work as usual, which approximately equals to training on clean labels. In this paper, we design a learning framework with prediction P i which could adaptively generate pixel transition confidence map T for every training step T i = g(P i , θ),
(1)
where θ indicates the parameters of that framework. Specifically, for T in every pixel, we have i i i = p(Y˜xy = m|Yxy = n), ∀m, n ∈ {0, 1} , Txy
(2)
Meta Pixel Loss Correction
35
i where Txy represents the confidence of transitioning from the latent clean label i i Yxy to the observed noisy label Y˜xy at pixel (x,y). Corrupted pixels have low pixel confidence but high transition probability. Due to binary segmentation, we assume the size of the pixel transition matrix is N × C × H × W , where C = 2 represents the foreground and background in our paper. Each value in the transition matrix from different C represents the confidence that the pixel in foreground and background keep not flipping to other. i to do pixel loss correction and the loss function of the whole We can use Txy model can be written as: w N h 1 i i i l(Htrans (Txy , f (Xxy , ω)), Y˜xy ), (3) Loss = − N hw i=1 x=1 y=1 i i i i i i , f (Xxy , ω)) = Pxy ∗ Txy (C = 1) + (1 − Pxy ) ∗ (1 − Txy (C = 0)) (4) Htrans (Txy
where l is BCE loss function, Htrans is the transition function between foreground and background. In our method, the transition function Eq. 4 represents the foreground of prediction keeps no change and the background of prediction flips into the foreground. Optimization. Given a fixed θ, the optimized solution to ω can be found through minimizing the following objective function: ω ∗ (θ) = arg min ω
w N h 1 i i i l(Htrans (Txy , f (Xxy , ω)), Y˜xy ). N hw i=1 x=1 y=1
(5)
We then introduce how to learn the parameters θ through our meta guided network. Motivated by the success of meta-parameter optimization, our method takes advantage of a small trusted dataset to correct the probably wrong direction of the gradient and guide the generation of pixel loss correction map.Specifically, we leverage an additional meta data set S = (X j , Y j ), 1 ≤ j ≤ M which has clean annotations. M is the number of meta-samples and M N . Given a meta input X j and optimized parameters ω ∗ (θ), through segmentation network, we can obtain the prediction map as P j = f (X j , ω ∗ (θ)), the meta loss for the meta dataset can be written as: w M h 1 i i l(f (Xxy , ω ∗ (θ)), Yxy ), (6) Lossmeta = − M hw j=1 x=1 y=1 Combined with Eq.(5) and Eq.(6), it is formulated into a bi-level minimization problem and the optimized solution to θ∗ can be acquired through minimizing the following objective function: θ∗ = arg min θ
w M h 1 i i l(f (Xxy , ω ∗ (θ)), Yxy ). M hw j=1 x=1 y=1
(7)
After achieving θ∗ , we can then get the pixel transition confidence map, which estimates the transition confidence from correct labels to be corrupted ones to help train a noise-robust segmentation model.
36
Z. Cai et al.
Fig. 2. Illustration of working processing about meta guided network. (Dilation operator is used to generate noise)
Meta Guided Network. For the meta guided network g in the Eq. (1), we explore the different architectures, which need to satisfy the auto-encoder structure of U-Net and be also easy trained for the small meta dataset by metalearning. In this paper, SENet [5] has been used as the backbone, which is a simple and easy trained structure and generates the same size result as U-Net for transition. By feeding the prediction P i , this meta guide network can adaptively recalibrate latent transition confidence by explicitly modeling interdependencies between channels, especially in favor of finding the transition confidence from correct labels to the corrupted ones. From Fig. 2, we can see that how our meta guided network work to build a noise-robust model. By feeding the prediction (c) to the meta guided network, the relative pixel transition confidence map can be obtained. Corrupted pixels have low pixel confidence but high transition probability. After the transition function with the confidence map, the prediction is turned into the revised prediction (d), which is very similar to the noisy mask (e). Finally, cross entropy can be used between revised prediction and observed noisy mask to train the segmentation model. This enables our method to train a noise-robust segmentation network with noisy labels. 2.2
Optimization Algorithm
The algorithm includes mainly following steps. Given the training input (X i , Y˜i ), we can then deduce the formulate of one-step w updating with respect to θ as ω ˆ (θ) = ω (t) − α
N w h 1 i(t) i i ∇w l(Htrans (Txy , f (Xxy , ω)), Y˜xy ), N hw i=1 x=1 y=1
(8)
i(t)
where α is the learning rate and Txy is computed by feeding the pixel-level prediction into meta guided network with parameters θ(t) . Then, with current mini-batch meta data samples(X j , Y j ), we can perform one-step updating for solving θ(t+1) = θ(t) − β
M h w 1 i i ∇θ l(f (Xxy ,ω ˆ (θ)), Yxy ), M hw j=1 x=1 y=1
(9)
Meta Pixel Loss Correction
37
Algorithm 1: The proposed learning Algorithm Input: Training data S, meta data S, batch size n m, the number of iterations I Initialize segmentation network parameter ω and meta guided network parameter θ; for t = 1 to I do X, Y ← Sample minibatch (S, n); X m , Y m ← Sample minibatch (S, m); Update θ(t+1) by Eq.(9); Update ω (t+1) by Eq.(10); Update T by the current segmentation network with parameter ω (t+1) ; end Output: Segmentation network parameter ω I+1
1 2 3 4 5 6 7 8
where β is learning rate and we use autograd to calculate Jacobian. After we achieve θ(t+1) , we can update w, that is ω (t+1) = ω (t) − α
w N h 1 i(t+1) i i ∇w l(Htrans (Txy , f (Xxy , ω), Y˜xy ), N hw i=1 x=1 y=1
(10)
i(t+1)
is updated with the parameters of ω (t+1) of the segmentation The predict Txy network. The entire algorithm is then summarized in Algorithm 1.
3 3.1
Experiment Results Dataset
We evaluate our method on three medical segmentation datasets: LIDC-IDRI [1], LiTS [4] and BraTS2019 [10], which were selected for lesion segmentation. We follow the same preprocessing and experiment settings with [18] on the LIDCIDRI and LiTS datasets with 64 × 64 cropped lesion patches. LIDC-IDRI is a lung CT dataset consisting of 1018 lung CT scans. 3591 patches are adopted, which are split into a training set of 1906 images, a testing set of 1385 images and the last 300 images for the meta set. LiTS contains 130 abdomen CT liver scans with tumors and liver segmentation challenge. 2214 samples are sampled from this dataset. 1471, 300 and 443 images are used for training, meta weight learning, and testing respectively. BraTS19 is a brain tumor challenge dataset. It consists of 385 labeled 3D MRI scans and each MRI scan has four modalities (T1, T1 contrast-enhanced, T2 and FLAIR). 3863 ET lesion patches are adopted and training dataset, meta dataset and testing dataset contain 1963 samples, 300 samples and 1600 samples respectively. Specifically, because our input is cropped lesion patch, the challenge results can not be cited in our experiments.
38
Z. Cai et al. Table 1. Results of segmentation models on LIDC-IDRI. (r = 0.4)
3.2
Noise Model name
Dilation mIOU Dice
ElasticDeform Hausdorff mIOU Dice Hausdorff
U-Net [14]
62.53
75.56
1.9910
65.01
76.17
1.9169
Prob U-Net [9]
66.42
78.39
1.8817
68.43
79.50
1.8757
Phi-Seg [3]
67.01
79.06
1.8658
68.55
81.76
1.8429
UA-MT [19]
68.18
80.98
1.8574
68.84
82.47
1.8523
Curriculum [8]
67.78
79.54
1.8977
68.18
81.30
1.8691
Few-Shot GAN [12] 67.74
78.11
1.9137
67.93
77.83
1.9223
Quality Control [2] 65.00
76.50
1.9501
68.07
77.68
1.9370
U2 Net [6]
65.92
76.01
1.9666
67.20
77.05
1.9541
MWNet [15]
71.56
81.17
1.7762
71.89
81.04
1.7680
MCPM [18]
74.69
84.64
1.7198
75.79
84.99
1.7053
Our MPLC
77.24 87.16 1.6387
77.52 87.44 1.6157
Experiment Setting
Noise Setting: Extensive experiments have been conducted under different types of noise. We artificially corrupted the target lesion mask with two types of label degradation: dilation morphology operator and ElasticDeform. 1) Dilation morphology operator: the foreground region is expanded by several pixels (randomly drawn from [0, 6]). 2) ElasticDeform [17]: label noise is generated by complicated operations such as rotation, translation, deformation and morphology dilation on groundtruth labels. Specifically, we set a probability r as the noisy label ratio to represent the proportion of noisy corrupted labels in all data. Implementation Detail: We train our model with SGD at initial learning rate 1e−4 and a momentum 0.9, a weight decay 1e−3 with mini-batch size 60. Set α = 1e−4, β = 1e−3 in all the experiments. The learning rate decay 0.1 in 30th epoch and 60th epoch for a total of 120 epoch. mIOU, Dice and Hausdorff were used to evaluate our method. 3.3
Experimental Results
Comparisons with State-of-the-Art Methods. In this section, we set r to 40% for all experiments, which means 40% training labels are noisy labels with corrupted pixels. There are 9 existing segmentation methods for the similarity task on the LIDC-IDRI dataset, including: Prob U-Net [9], Phi-Seg [3], UAMT [19], Curriculum [8], Few-Shot GAN [12], Quality Control [2], U2 Net [6], MWNet [15] and MCPM [18]. Visualization results are shown in Fig. 3. Table 1 shows the results of all competing methods on the LIDC-IDRI dataset with the aforementioned experiment setting. It can be observed that our method gets the best performance. Specifically, compared with MCPM and MWNet,
Meta Pixel Loss Correction
39
which use the re-weighting method, our algorithm has the competitive Dice result (87.16) and it outperforms the second best method(MCPM) by 2.52%. An extra t-test comparison experiment has been done between our method and the second best method(MCPM), and the result shows that P-value < 0.01, which represents there is a statistical difference between our method and MCPM. Input with GT
U-Net
Pro U-Net
MWNet
MCPM
Our MPLC
35.95%
38.46%
64.47%
72.73%
80.36%
48.50%
62.21%
78.41%
82.11%
87.54%
63.32%
76.82%
88.89%
89.70%
95.54%
LIDC-IDRI
LiTS
BraTS19
Fig. 3. Visualization of segmentation results under r = 0.8 in this section. Green and red contours indicate the ground-truths and segmentation results, respectively. The Dice value is shown at the bottom line, and our method produces much better results than other methods on every dataset. (Color figure online)
Robustness to Various R-S. We explore the robustness of our MPLC under the various percent of noise label ratio r {0.2, 0.4, 0.6, 0.8}. It has been evaluated on LIDC-IDRI, LiTS and BraTS19 datasets under the dilation operation. Table 2 shows the results compared with baseline approaches. It shows that our method consistently outperforms other methods across all the noise ratios on all datasets, showing the effectiveness of our meta pixel loss correction strategy. Table 2. Results (mIOU) of segmentation methods using various r-s.(Noise=Dilation) Dataset
LIDC-IDRI
r
0.8
U-Net [14]
42.64 51.23 62.53 69.88 37.18 43.55 46.41 51.20 32.51 50.02 56.27 63.65
0.6
LiTS 0.4
0.2
0.8
BraTS19 0.6
0.4
0.2
0.8
0.6
0.4
0.2
Prob U-Net 52.13 60.81 66.42 71.03 40.16 45.90 49.22 53.97 55.04 56.25 58.08 62.64 MWNet [15] 61.28 67.33 71.56 72.07 43.14 44.97 51.96 58.65 60.63 66.06 67.99 69.50 MCPM [18] 67.60 68.97 74.69 74.87 45.09 48.76 55.17 62.04 61.74 67.39 67.93 69.52 Our MPLC
3.4
73.04 76.07 77.24 78.16 62.25 64.53 65.56 66.44 63.67 67.79 69.09 71.79
Limitation
Because our approach is based on the instance-independent assumption that P (˜ y |y) = P (˜ y |x, y). It is more suitable to model single noise distribution but
40
Z. Cai et al.
fails in real-world stochastic noise like the complicated noise setting with multi noises(erosion, dilation, deformity, false negatives, false positives). When it is extended to instance-dependent, we should model the relationship among clean label, noisy label and instance for P (˜ y |x, y) in future work.
4
Conclusion
We present a novel Meta Pixel Loss Correction method to alleviate the negative effect of noisy labels in medical image segmentation. Given a small number of high-quality labeled images, the deduced learning regime makes our meta guided network able to take full use of noisy labels and estimate the pixel transition confidence map, which can be used to do further pixel loss correction and train a noise-robust segmentation. We extensively evaluated our method on three datasets, LIDC-IDRI, LiTS and BraTS19. The result shows that the proposed method can outperform state-of-the-art in medical image segmentation with noisy labels. Acknowledgment. This work was supported by by the National Natural Science Foundation of China under Grant 61790562 and Grant 61773312.
References 1. Armato, S.G., et al.: The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans. Acad. Radiol. 14(12), 1455–1463 (2007) 2. Audelan, B., Delingette, H.: Unsupervised quality control of image segmentation based on Bayesian learning. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11765, pp. 21–29. Springer, Cham (2019). https://doi.org/10.1007/978-3-03032245-8 3 3. Baumgartner, F., et al.: PHiSeg: capturing uncertainty in medical image segmentation. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11765, pp. 119–127. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32245-8 14 4. Han, X.: Automatic liver lesion segmentation using a deep convolutional neural network method (2017) 5. Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 42(8), 2011–2023 (2020). https://doi.org/ 10.1109/TPAMI.2019.2913372 6. Huang, C., Han, H., Yao, Q., Zhu, S., Zhou, S.K.: 3D U2 -net: a 3D universal u-net for multi-domain medical image segmentation. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11765, pp. 291–299. Springer, Cham (2019). https://doi.org/10. 1007/978-3-030-32245-8 33 7. Karimi, D., Dou, H., Warfield, S.K., Gholipour, A.: Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis. Med. Image Anal. 65, 101759 (2020) ´ Ben Ayed, I.: Curriculum semi-supervised 8. Kervadec, H., Dolz, J., Granger, E., segmentation. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11765, pp. 568–576. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32245-8 63
Meta Pixel Loss Correction
41
9. Kohl, S.A., et al.: A probabilistic u-net for segmentation of ambiguous images. arXiv preprint arXiv:1806.05034 (2018) 10. Menze, B.H., Jakab, A., Bauer, S., et al.: The multimodal brain tumor image segmentation benchmark (brats). IEEE Trans. Med. Imaging 34(10), 1993–2024 (2015). https://doi.org/10.1109/TMI.2014.2377694 11. Mirikharaji, Z., Yan, Y., Hamarneh, G.: Learning to segment skin lesions from noisy annotations. In: Wang, Q., et al. (eds.) DART/MIL3ID -2019. LNCS, vol. 11795, pp. 207–215. Springer, Cham (2019). https://doi.org/10.1007/978-3-03033391-1 24 12. Mondal, A.K., Dolz, J., Desrosiers, C.: Few-shot 3d multi-modal medical image segmentation using generative adversarial learning. arXiv preprint arXiv:1810.12241 (2018) 13. Ren, M., Zeng, W., Yang, B., Urtasun, R.: Learning to reweight examples for robust deep learning. In: International Conference on Machine Learning, pp. 4334–4343. PMLR (2018) 14. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4 28 15. Shu, J., et al.: Meta-weight-net: learning an explicit mapping for sample weighting. arXiv preprint arXiv:1902.07379 (2019) 16. Song, H., Kim, M., Park, D., Shin, Y., Lee, J.G.: Learning from noisy labels with deep neural networks: a survey. IEEE Transactions on Neural Networks and Learning Systems (2022) 17. van Tulder, G.: Package elsticdeform. http://github.com/gvtulder/elasticdeform/. Accessed 4 Dec 2018 18. Wang, J., Zhou, S., Fang, C., Wang, L., Wang, J.: Meta corrupted pixels mining for medical image segmentation. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12261, pp. 335–345. Springer, Cham (2020). https://doi.org/10.1007/978-3030-59710-8 33 19. Yu, L., Wang, S., Li, X., Fu, C.-W., Heng, P.-A.: Uncertainty-aware self-ensembling model for semi-supervised 3D left atrium segmentation. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11765, pp. 605–613. Springer, Cham (2019). https:// doi.org/10.1007/978-3-030-32245-8 67 20. Zhu, H., Shi, J., Wu, J.: Pick-and-learn: automatic quality evaluation for noisylabeled image segmentation. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11769, pp. 576–584. Springer, Cham (2019). https://doi.org/10.1007/978-3-03032226-7 64
Re-thinking and Re-labeling LIDC-IDRI for Robust Pulmonary Cancer Prediction Hanxiao Zhang1 , Xiao Gu2 , Minghui Zhang1 , Weihao Yu1 , Liang Chen3 , Zhexin Wang3(B) , Feng Yao3 , Yun Gu1,4(B) , and Guang-Zhong Yang1(B) 1
Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai, China {geron762,gzyang}@sjtu.edu.cn 2 Imperial College London, London, UK 3 Department of Thoracic Surgery, Shanghai Chest Hospital, Shanghai Jiao Tong University, Shanghai, China [email protected] 4 Shanghai Center for Brain Science and Brain-Inspired Technology, Shanghai, China
Abstract. The LIDC-IDRI database is the most popular benchmark for lung cancer prediction. However, with subjective assessment from radiologists, nodules in LIDC may have entirely different malignancy annotations from the pathological ground truth, introducing label assignment errors and subsequent supervision bias during training. The LIDC database thus requires more objective labels for learning-based cancer prediction. Based on an extra small dataset containing 180 nodules diagnosed by pathological examination, we propose to re-label LIDC data to mitigate the effect of original annotation bias verified on this robust benchmark. We demonstrate in this paper that providing new labels by similar nodule retrieval based on metric learning would be an effective re-labeling strategy. Training on these re-labeled LIDC nodules leads to improved model performance, which is enhanced when new labels of uncertain nodules are added. We further infer that re-labeling LIDC is current an expedient way for robust lung cancer prediction while building a large pathological-proven nodule database provides the long-term solution.
Keywords: Pulmonary nodule Re-labeling
1
· Cancer prediction · Metric learning ·
Introduction
The LIDC-IDRI (Lung Image Database Consortium and Image Database Resource Initiative) [1] is a leading source of public datasets. Since the introduction of LIDC, it is used extensively for lung nodule detection and cancer prediction using learning-based methods [4,6,11,12,15–17,21,23].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Zamzmi et al. (Eds.): MILLanD 2022, LNCS 13559, pp. 42–51, 2022. https://doi.org/10.1007/978-3-031-16760-7_5
Re-thinking and Re-labeling LIDC-IDRI
43
When searching papers in PubMed1 with the following filter: (“deep learning” OR convolutional) AND (CT OR “computed tomography”) AND (lung OR pulmonary) AND (nodule OR cancer OR “nodule malignancy”) AND (prediction OR classification), among 53 papers assessed for eligibility of nodule malignancy classification, 40 papers used LIDC database, 5 papers used NSLT (National Lung Screening Trial) database2 [10,18,19] (no exact nodule location provided), and 8 papers used other individual datasets. LIDC is therefore the most popular benchmark in cancer prediction research. A careful examination of the LIDC database, however, reveals several potential issues for cancer prediction. During the annotation of LIDC, characteristics of nodules were assessed by multiple radiologists, where the rating of malignancy scores (1 to 5) was based on the assumption of a 60-year-old male smoker. Due to the lack of clinical information, these malignancy scores were subjective. Although a subset of LIDC cases possesses patient-based pathological diagnosis [13], its nodule-level binary labels can not be confirmed. Since it is hard to recapture the pathological ground truth for each LIDC nodule, we apply the extra SCH-LND dataset [24] with pathological-proven labels, which is used not only for establishing a truthful and fair evaluation benchmark but also for transferring pathological knowledge for different clinical indications. In this paper, we first assess the nodule prediction performances of LIDC driven model in six scenarios and their fine-tuning effects using SCH-LND with detailed experiments. Having identified the problems of the undecided binary label assignment scheme on the original LIDC database and unstable transfer learning outcomes, we seek to re-label LIDC nodule classes by interacting with the SCH-LND. The first re-labeling strategy adopts the state-of-the-art nodule classifier as an end-to-end annotator, but it has no contribution to LIDC re-labeling. The second strategy uses metric learning to learn similarity and discrimination between the nodule pairs, which is then used to elect new LIDC labels based on the similarity ranking in a pairwise manner between the underlabeled LIDC nodule and each nodule of SCH-LND. Experiments show that the models trained with re-labeled LIDC data created by metric learning model not only resolve the bias problem of the original data but also transcend the performance of our model, especially when the new labels of the uncertain subset are added. Further statistical results demonstrate that the re-labeled LIDC data suffers class imbalance problem, which indicates us to build a larger nodule database with pathological-proven labels.
2
Materials
LIDC-IDRI Database: According to the practice in [14], we excluded CT scans with slice thickness larger than 3 mm and sampled nodules identified by at least three radiologists. We only involve solid nodules in SCH-LND and LIDC databases because giving accurate labels for solid nodules is of great challenge. 1 2
https://pubmed.ncbi.nlm.nih.gov/. https://cdas.cancer.gov/datasets/nlst/.
44
H. Zhang et al.
Extra Dataset: The extra dataset called SCH-LND [24] consists of 180 solid nodules (90 benign/90 malignant) with exact spatial coordinates and radii. Each sample is very rare because all the nodules are confirmed and diagnosed by immediate pathological examination via biopsy with ethical approval. To regulate variant CT formats, CT slice thickness is resampled to 1mm/pixel if it is larger than 1 mm/pixel, while the X and Y axes are fixed to 512 × 512 pixels. Each pixel value is unified to the HU (Hounsfield Unit) value before nodule volume cropping.
3
Study Design
Fig. 1. Illustration of the study design for nodule cancer prediction. Case 1: training from scratch over the LIDC database after assigning nodule labels according to the average malignancy scores in 6 scenarios. Case 2: training over extra data based on accurate pathological-proven labels by 5-fold cross-validation. Case 3: testing or finetuning LIDC models of Case 1 using extra data.
The preliminary study follows the instructions of Fig. 1 where two types of cases (Case 1 and Case 2) conduct training and testing in each single data domain and one type of case (Case 3) involves domain interaction (cross-domain testing and transfer learning) between LIDC and SCH-LND. In Case 1 and Case 3, we identify 6 different scenarios by removing uncertain average scores (Scenarios A and B) or setting division threshold (Scenarios C, D, E, and F) to assign binary labels for LIDC data training. Training details are described in Sect. 5.1. To evaluate the model performance comprehensively, we additionally introduce Specificity (also called Recallb , when treating benign as positive sample) and Precisionb (Precision in benign class) [20], besides regular evaluation metrics including Sensitivity (Recall), Precision, Accuracy, and F1 score. Based on the visual assessment of radiologists, human-defined nodule features can be easily extracted and classified by a commonly used model (3D ResNet-18 [5]), whose performance can emulate the experts’ one (Fig. 2, Case 1). Many studies still put investigational efforts for better results across the LIDC board, overlooking inaccurate radiologists’ estimations and bad model capability in the real world. However, once the same model is revalidated under
Re-thinking and Re-labeling LIDC-IDRI
45
Fig. 2. Performance comparisons between different Cases or Scenarios (Scen) in Fig. 1. For instance, ‘A:(12/45)’ represents ‘Scenario A’ that treats LIDC scores 1 & 2 as benign labels and scores 4 & 5 as malignant labels. FT denotes fine-tuning using extra data by 5-fold cross-validation based on the pre-trained model in each scenario.
the pathological-proven benchmark (Fig. 2, Case 3, Scenario A), its drawback is objectively revealed that LIDC model decisions take up too many false-positive predictions. These two experimental outcomes raise a suspicion that whether the visual assessment of radiologists might have a bias toward malignant class. To resolve this suspicion, we compare the performances of 6 scenarios in Case 3. Evidence reveals that, under the testing data from SCH-LND, the number of false-positive predictions has a declining trend when the division threshold moves from the benign side to the malignant side, but the bias problem is still serious when reaching Scenario E, much less of Scenario A and B. Besides, as training on the SCH-LND dataset from scratch can hardly obtain a high capacity model (Fig. 2, Case 2), we use transfer learning in Case 3 to get the model fine-tuned on the basis of weights of different pre-trained LIDC models. Observing the inter-comparison within each scenario in Case 3, transfer learning can push scattered metric values close. However, compared with Case 2, the fine-tuning technique would bring both positive and negative transfer, depending upon the property of the pre-trained model. Thus, either for training from scratch or transfer learning process, the radiologists’ assessment of LIDC nodule malignancy can be hard to properly use. In addition to its inevitable assessment errors, there is a thorny problem to assign LIDC labels (how to set division threshold) and removing uncertain subset (waste of data). We thus expect to re-label the LIDC malignancy classes with the interaction of SCH-LND, to correct the assessment bias as well as utilize the uncertain nodules (average score = 3). Two independent approaches are described in the following section.
46
4
H. Zhang et al.
Methods
We put forward two re-labeling strategies to obtain new ground truth labels on the LIDC database. The first strategy generates the malignancy label from a machine annotator: the state-of-the-art nodule classifier that has been pretrained on LIDC data and fine-tuned on SCH-LND to predict nodule class. The second strategy ranks the top nodules’ labels using a machine comparator: a metric-based Network that measures the correlation between nodule pairs. Considering that the knowledge from radiologists’ assessments could be a useful resource, in each strategy, two modes of LIDC re-labeling are proposed. For Mode 1 (Substitute): LIDC completely accepts the re-label outcomes from other label machines. For Mode 2 (Consensus): The final LIDC relabel results would be decided by the consensus of label machine outcomes and its original label (Scenario A). In other words, this mode will leave behind the nodules with the same label and discard controversial ones, which may cause data reduction. We evaluate the LIDC re-labeling effect by using SCH-LND to test the model which is trained with re-labeled data from scratch. 4.1
Label Induction Using Machine Annotator
The optimized model with fine-tuning technique can correct the learning bias initiated by LIDC data. Some fine-tuned models even surpass the LIDC model performance in large scales of evaluation metrics. We wonder whether the current best performance model can help classify and annotate new LIDC labels. Experiments will be conducted using two annotation models from Case 2 and Case 3 (Scenario A) in Sect. 3. 4.2
Similar Nodule Retrieval Using Metric Learning
Fig. 3. The second strategy of LIDC re-labeling that using a metric learning model to search for the most similar nodules and give new labels.
Metric learning [2,7] provides a few-shot learning approach that aims to learn useful representations through distance comparisons. We use Siamese Network
Re-thinking and Re-labeling LIDC-IDRI
47
[3,9] in this study which consists of two networks whose parameters are tied to each other. Parameter tying guarantees that two similar nodules will be mapped by their respective networks to adjacent locations in feature space. For training a Siamese Network in Fig. 3, we pass the inputs in the set of pairs. Each pair is randomly chosen from SCH-LND and given the label whether two nodules of this pair are in the same class. Then these two nodule volumes are passed through the 3D ResNet-18 to generate a fixed-length feature vector individually. A reasonable hypothesis is given that: if the two nodules belong to the same class, their feature vectors should have a small distance metric; otherwise, their feature vectors will have a large distance metric. In order to distinguish between the same and different pairs of nodules when training, we apply contrastive loss over the Euclidean distance metric (similarity score) induced by the malignancy representation. During re-labeling, we first pair each nodule from SCH-LND used in training up with an under-labeled LIDC nodule and sort each under-labeled nodule partner by their similarity scores. Then the new LIDC label is awarded by averaging the labels of the top 20% partner nodules in the ranking list of similarity scores.
5 5.1
Experiments and Results Implementation
We apply 3D ResNet-18 [5] in this paper with adaptive average pooling (output size of 1 × 1 × 1) following the final convolution layer. For the general cancer prediction model, we use a fully connected layer and a Sigmoid function to output the prediction score (binary cross-entropy loss). While for Siamese Network, we use a fully connected layer to generate the feature vector (8 neurons). Due to various nodule sizes, the batch size is set to 1, and group normalization [22] is adopted after each convolution layer. All the experiments are implemented in PyTorch with a single NVIDIA GeForce GTX 1080 Ti GPU and learned using the Adam optimizer [8] with the learning rate of 1e–3 (100 epochs) and that of 1e–4 for fine-tuning in transfer learning (50 epochs). The validation set occupies 20% of the training set in each experiment. All the experiments and results involving or having involved the training of SCH-LND are strictly conducted by 5-fold cross-validation. 5.2
Quantitative Evaluation
To evaluate the first strategy using machine annotator, we first use Case 2 model to re-label LIDC nodules (a form of 5-fold cross-validation) other than the uncertain subset (original average score = 3). The re-labeled nodules are then fed into the 3D ResNet-18 model, which will be trained from scratch and tested on the corresponding subset of SCH-LND for evaluation. The result (4th row) shows
48
H. Zhang et al.
Table 1. Performances of different re-labeling methods based on each mode of relabeling strategies. Under-labeled LIDC data are chosen by their original average score. Row Baselines 1 2 3
Method Case 3-A Case 2 Siamese
Training LIDC Extra Extra
Testing Extra Extra Extra
Sensitivity 0.9778 0.6333 0.6667
Specificity 0.2333 0.6000 0.6000
Precision 0.5605 0.6129 0.6250
Precisionb 0.9130 0.6207 0.6429
Accuracy 0.6056 0.6167 0.6333
F1 0.7126 0.6230 0.6452
Method
Under-label Sensitivity Specificity Precision Precisionb Accuracy F1
LIDC re-labeling Strategy
Mode
4 5 6 7
Annotator
Substitute Case Case Consensus Case Case
8 9 10 11
Comparator Substitute Siamese Consensus
2 1;2;4;5 3-A 2 3-A 1;2;4;5 1;2;3;4;5 1;2;4;5 1;2;3;4;5
0.5778 0.4630 0.8778 0.8556
0.5667 0.6667 0.3667 0.3778
0.5714 0.5814 0.5809 0.5789
0.5730 0.5538 0.7500 0.7234
0.5722 0.5648 0.6222 0.6167
0.5746 0.5155 0.6991 0.6906
0.6111 0.6778 0.8000 0.7333
0.6556 0.6667 0.3778 0.5889
0.6395 0.6703 0.5625 0.6408
0.6277 0.6742 0.6538 0.6883
0.6333 0.6722 0.5889 0.6611
0.6250 0.6740 0.6606 0.6839
that although this action greatly fixes label bias to a balanced state, this group of new labels can hardly build a model tested well on SCH-LND. Contrary to common sense, the state-of-the-art nodule classifier makes re-label performance worse (5th row), which is much lower than that of learning from scratch using SCH-LND (2nd row), indicating that the best model optimized with fine-tuning technique is not suitable for LIDC re-labeling. The initial two experiments adopting Mode 2 (Consensus) achieved better comprehensive outcomes than Mode 1 (Substitute) but with low Specificity (Table 1). Metric learning takes a different re-label strategy that retrieves similar nodules according to the distance metric. Metric learning on a small dataset can obtain a better performance (3rd row) compared with general learning from scratch (2nd row). The re-label outcomes (8th and 9th row) also show great comprehensive improvement over baselines by Mode 1, where the re-labeling of uncertain nodules (average score = 3) is an important contributing factor. Overall, there is a trade-off between Mode 1 and Mode 2. But Mode 2 seems to remain the LIDC bias property because testing results often have low Specificity and introduce data reduction. Re-labeling by consensus (Mode 2) may integrate the defects of both original labels and models, especially for malignant labels, while re-labeling uncertain nodules can help mitigate the defect of Mode 2. We finally re-labeled the LIDC database with the Siamese Network trained using all of SCH-LND. As shown in Fig. 4, our re-labeled results are in broad agreement with the low malignancy score ones. In score 3 (uncertain data), the majority of the nodules are re-labeled to benign class, which explains the better performance when the nodules of score 3 are assigned to benign label in Scenario E (Fig. 2, Case 3). The new labels correct more than half of the original nodule labels with score 4 which could be the main reason leading to the data bias. 5.3
Discussion
Re-labeling through metric learning is distinct from the general supervised model in two notable ways. First, the input pairs generated by random sampling for
Re-thinking and Re-labeling LIDC-IDRI
49
Fig. 4. Statistical result of LIDC re-labeling nodules (benign or malignant) in terms of original average malignancy scores, where the smooth curve describes the simplified frequency distribution histogram of average label outputs. For each average score of 1, 2, 4, and 5, one nodule re-labeling example with the opposite class (treat score 1 and 2 as benign; treat 4 and 5 as malignant) is provided.
metric learning provide a data augmentation effect to overcome overfitting with limited data. Second, under-labeled LIDC data take the average labels of topranked similarity nodules to increase the confidence of label propagation. These two points may explain why general supervised models (including fine-tuning models) perform worse than metric learning in re-labeling task. Unfortunately, after re-labeling, the class imbalance problem emerged (748 versus 174), while bringing up new limits in model training performance in the aforementioned experiments. Moreover, due to the lack of pathological ground truth, the relabel outcomes of this study should always remain suspect until the LIDC clinical information is available. Considering a number of subsequent issues that LIDC may arise, sufficient evidence in this paper explores the motive for us to promote the ongoing collection work of a large pathological-proven nodule database, which is expected to become a powerful open-source database for the international medical imaging and clinical research community.
6
Conclusion and Future Work
The LIDC-IDRI database is currently the most popular public database of lung nodules with specific spatial coordinates and experts’ annotations. However, because of the absence of clinical information, deep learning models trained based on this database have poor generalization capability in lung cancer prediction and downstream tasks. To challenge the low confidence labels of LIDC, an extra nodule dataset with pathological-proven labels was used to identify the annotation bias problems of LIDC and its label assignment difficulties. With the
50
H. Zhang et al.
robust supervision of SCH-LND, we used a metric learning-based approach to relabel LIDC data according to the similar nodule retrieval. The empirical results show that with re-labeled LIDC data, improved performance is achieved along with the maximization of LIDC data utilization and the subsequent class imbalance problem. These conclusions provide a guideline for further collection of a large pathological-proven nodule database, which is beneficial to the community. Acknowledgments. This work was partly supported by Medicine-Engineering Interdisciplinary Research Foundation of Shanghai Jiao Tong University (YG2021QN128), Shanghai Sailing Program (20YF1420800), National Nature Science Foundation of China (No.62003208), Shanghai Municipal of Science and Technology Project (Grant No. 20JC1419500), and Science and Technology Commission of Shanghai Municipality (Grant 20DZ2220400).
References 1. Armato, S.G., III., et al.: The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans. Med. Phys. 38(2), 915–931 (2011) 2. Bellet, A., Habrard, A., Sebban, M.: Metric Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 9, no. 1, pp. 1–151 (2015) 3. Guo, Q., Feng, W., Zhou, C., Huang, R., Wan, L., Wang, S.: Learning dynamic siamese network for visual object tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1763–1771 (2017) 4. Han, F., et al.: Texture feature analysis for computer-aided diagnosis on pulmonary nodules. J. Digit. Imaging 28(1), 99–115 (2015) 5. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 6. Hussein, S., Cao, K., Song, Q., Bagci, U.: Risk stratification of lung nodules using 3D CNN-based multi-task learning. In: Niethammer, M., et al. (eds.) IPMI 2017. LNCS, vol. 10265, pp. 249–260. Springer, Cham (2017). https://doi.org/10.1007/ 978-3-319-59050-9 20 7. Kaya, M., Bilge, H.S ¸ : Deep metric learning: a survey. Symmetry 11(9), 1066 (2019) 8. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 9. Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, vol. 2. Lille (2015) 10. Kramer, B.S., Berg, C.D., Aberle, D.R., Prorok, P.C.: Lung cancer screening with low-dose helical CT: results from the national lung screening trial (NLST) (2011) 11. Liao, Z., Xie, Y., Hu, S., Xia, Y.: Learning from ambiguous labels for lung nodule malignancy prediction. arXiv preprint arXiv:2104.11436 (2021) 12. Liu, L., Dou, Q., Chen, H., Qin, J., Heng, P.A.: Multi-task deep model with margin ranking loss for lung nodule analysis. IEEE Trans. Med. Imaging 39(3), 718–728 (2019) 13. McNitt-Gray, M.F., et al.: The lung image database consortium (LIDC) data collection process for nodule detection and annotation. Acad. Radiol. 14(12), 1464–1474 (2007)
Re-thinking and Re-labeling LIDC-IDRI
51
14. Setio, A.A.A., et al.: Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the LUNA16 challenge. Med. Image Anal. 42, 1–13 (2017) 15. Shen, W., et al.: Learning from experts: developing transferable deep features for patient-level lung cancer prediction. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 124–131. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8 15 16. Shen, W., Zhou, M., Yang, F., Yang, C., Tian, J.: Multi-scale convolutional neural networks for lung nodule classification. In: Ourselin, S., Alexander, D.C., Westin, C.-F., Cardoso, M.J. (eds.) IPMI 2015. LNCS, vol. 9123, pp. 588–599. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19992-4 46 17. Shen, W., et al.: Multi-crop convolutional neural networks for lung nodule malignancy suspiciousness classification. Pattern Recogn. 61, 663–673 (2017) 18. National Lung Screening Trial Research Team: The national lung screening trial: overview and study design. Radiology 258(1), 243–253 (2011) 19. National Lung Screening Trial Research Team: Reduced lung-cancer mortality with low-dose computed tomographic screening. N. Engl. J. Med. 365(5), 395– 409 (2011) 20. Wu, B., Sun, X., Hu, L., Wang, Y.: Learning with unsure data for medical image diagnosis. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 10590–10599 (2019) 21. Wu, B., Zhou, Z., Wang, J., Wang, Y.: Joint learning for pulmonary nodule segmentation, attributes and malignancy prediction. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 1109–1113. IEEE (2018) 22. Wu, Y., He, K.: Group normalization. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018) 23. Xie, Y., et al.: Knowledge-based collaborative deep learning for benign-malignant lung nodule classification on chest CT. IEEE Trans. Med. Imaging 38(4), 991–1004 (2018) 24. Zhang, H., Gu, Y., Qin, Y., Yao, F., Yang, G.-Z.: Learning with sure data for nodule-level lung cancer prediction. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12266, pp. 570–578. Springer, Cham (2020). https://doi.org/10.1007/ 978-3-030-59725-2 55
Weakly-Supervised, Self-supervised, and Contrastive Learning
Universal Lesion Detection and Classification Using Limited Data and Weakly-Supervised Self-training Varun Naga1 , Tejas Sudharshan Mathai1(B) , Angshuman Paul2 , and Ronald M. Summers1 1 Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Radiology and Imaging Sciences, Clinical Center, National Institutes of Health, Bethesda, MD, USA [email protected] 2 Indian Institute of Technology, Jodhpur, Rajasthan, India
Abstract. Radiologists identify, measure, and classify clinically significant lesions routinely for cancer staging and tumor burden assessment. As these tasks are repetitive and cumbersome, only the largest lesion is identified leaving others of potential importance unmentioned. Automated deep learning-based methods for lesion detection have been proposed in literature to help relieve their tasks with the publicly available DeepLesion dataset (32,735 lesions, 32,120 CT slices, 10,594 studies, 4,427 patients, 8 body part labels). However, this dataset contains missing lesions, and displays a severe class imbalance in the labels. In our work, we use a subset of the DeepLesion dataset (boxes + tags) to train a state-of-the-art VFNet model to detect and classify suspicious lesions in CT volumes. Next, we predict on a larger data subset (containing only bounding boxes) and identify new lesion candidates for a weakly-supervised self-training scheme. The self-training is done across multiple rounds to improve the model’s robustness against noise. Two experiments were conducted with static and variable thresholds during self-training, and we show that sensitivity improves from 72.5% without self-training to 76.4% with self-training. We also provide a structured reporting guideline through a “Lesions” subsection for entry into the “Findings” section of a radiology report. To our knowledge, we are the first to propose a weakly-supervised self-training approach for joint lesion detection and tagging in order to mine for underrepresented lesion classes in the DeepLesion dataset.
Keywords: CT Self-training
· Detection · Classification · Deep learning ·
Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-16760-7 6. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Zamzmi et al. (Eds.): MILLanD 2022, LNCS 13559, pp. 55–64, 2022. https://doi.org/10.1007/978-3-031-16760-7_6
56
1
V. Naga et al.
Introduction
Radiologists evaluate tumor burden and stage cancer in their clinical practice by detecting, measuring and classifying clinically significant lesions. Computed tomography (CT) and positron emission tomography (PET) studies are usually the preferred imaging modalities for lesion assessment [1]. In CT volumes acquired with or without contrast agents, lesions have diverse appearances and asymmetrical shapes. The lesion size is measured using its long and short axis diameters (LAD and SAD) according to the RECIST guidelines. Lesion size is a surrogate biomarker for malignancy and impacts the ensuing course of patient therapy. According to guidelines, lesions are clinically meaningful if their LAD ≥10 mm [2]. Assessment standardization is complicated by a number of factors, such as observer measurement variability, the variety of CT scanners, different contrast phases, and exam protocols. Moreover, a radiologist must identify the same lesion in a prior study and assess the treatment response (shrinkage, growth, unchanged) [1,2]. Another confounding factor is the chance of smaller metastatic lesions being missed during a busy clinical day. To alleviate the radiologist’s repetitive task of lesion assessment, many stateof-the-art automated approaches [3–7] have been developed to universally detect, classify, and segment lesions with high sensitivity on a publicly available dataset called DeepLesion [8]. DeepLesion contains eight (8) lesion-level tags for only the validation and test splits. As seen in Fig. 1(a), there is a profound lesion class imbalance in this dataset (validation and test) with large quantities of certain labels (lung, abdomen, mediastinum, and liver) in contrast to other underrepresented classes (pelvis, soft tissue, kidney, bone). Since tags are unavailable for the DeepLesion training split, little research has been done on lesion classification [9,10] and these approaches are not easily reproduced due to the need for a sophisticated lesion ontology to generate multiple lesion tags (body part, type, and attributes). Moreover, DeepLesion is not fully annotated as only clinically significant lesions were measured while others remain unannotated [6–8]. These imbalances inhibit the development of efficient CT lesion detection and tagging algorithms. Approaches that use a limited dataset and exploit any unannotated or weakly-annotated data are desirable for clinical use cases, such as interval change detection (lesion tracking) [11–13] and structured report generation. To that end, in this paper, we design a method that can use a limited DeepLesion subset (30% annotated split) consisting of lesion bounding boxes and body part labels to train a state-of-the-art VFNet model [14] for lesion detection and tagging. Our model subsequently utilizes a larger data subset (with only bounding boxes) through a weakly-supervised self-training process, in which the model learns from its own predictions and efficiently re-trains itself for lesion and tag prediction. The self-training process is performed over multiple rounds with each round designed to improve model robustness against noise through the inclusion of new data points (box + tags) predicted with high confidence along with the original annotated (box + tags) training data. The final model is used for detection and tagging of lesions, and we provide a clinical application of our work by describing a structured reporting guideline for creating a dedicated “Lesions”
Universal Lesion Detection and Classification Through Self-training
57
Fig. 1. (a) Distribution of body part labels in the annotated DeepLesion dataset (30%, 9816 lesions, 9624 slices). (b) and (c) Model predictions before and after self-training (ST) respectively. Green boxes: ground truth, yellow: true positives, and red: false positives (FP). The top row shows a decrease in FP with ST. The middle row shows a “Kidney” lesion that was initially missed with no ST, but found after ST. The last row shows the predicted class corrected from “Lung” to “Mediastinum” after ST. (d) Four lung nodules were detected by the model. The top-3 lesion predictions, their labels, and confidence scores were entered into a structured “Lesions” list for inclusion in the “Findings” section of a radiology report. Lesions below a 50% confidence are shown in red. Lesion 2 was annotated in the original DeepLesion dataset, while Lesion 1 was not. Our model correctly detected lesion 1. but it was considered a FP. Lesion 3 had a lower confidence score than Lesion 4, and hence was not entered in the “Lesions” sub-section. (Color figure online)
sub-section for entry into the “Findings” section of a radiology report. The “Lesions” sub-section contains a structured list of detected lesions along with their body part tags, detection confidence, and series and slice numbers. To our knowledge, we are the first to present a joint lesion detection and tagging approach based on weakly-supervised self-training, such that under-represented classes in DeepLesion can be mined.
58
2
V. Naga et al.
Methods
Data. The DeepLesion dataset [8] contains annotated keyslices with 2D bounding boxes that demarcate lesions present in that slice. Contextual information in the form of slices 30 mm above and below the keyslice were also provided, but these were not annotated. Annotations were done through RECIST measurements with long and short axis diameters (LAD and SAD) [8]. The dataset was divided into 70% train, 15% validation and 15% test splits. Eight (8) lesion-level tags (bone - 1, abdomen - 2, mediastinum - 3, liver - 4, lung - 5, kidney - 6, soft tissue - 7, and pelvis - 8) were available for only the validation and test splits. The lesion tags were obtained through a body part regressor [15], which provided a continuous score that represented the normalized position of the body part for a CT slice in a CT volume (e.g., liver, lung, kidney etc.). The body part label for the CT slice was assigned to any lesion annotated in that slice. In our work, we used the limited annotated 30% subset of the original dataset for model training. Figure 1(a) shows the labeled lesion distribution in the limited 30% data subset. This was then sub-divided into 70/15/15% training/validation/test splits. The test set was kept constant and all results are presented for this set. Model. In this work, we used a state-of-the-art detection network called Varifocal Network (VFNet) [14] for the task of lesion detection and classification as seen in Fig. 2(a). VFNet combined a Fully Convolutional One-Stage Object (FCOS) detector [16] (without the centerness branch) and an Adaptive Training Sample Selection (ATSS) mechanism [17]. A Varifocal loss was used to up-weight the contribution of positive object candidates and down-weight negative candidates. Moreover, a star-shaped bounding box representation was utilized to extract contextual cues that reduced the misalignment between the ground truth and the predicted bounding boxes. VFNet was trained to predict a lesion’s bounding box and body part label in the CT slice. We also conducted experiments with a Faster-RCNN model [18] for the overall task of lesion detection (without tagging). However, we noticed that Faster RCNN showed an inferior detection performance compared to VFNet (see supplementary material). Once a model had been trained, Weighted Boxes Fusion (WBF) [19] was used to combine the numerous predictions from multiple epochs of a single model run or from multiple runs of the same model. As these predictions clustered together in common image areas with many being false positives (FP) that decrease the overall precision and recall metrics, WBF amalgamated the clusters into one. Our aim was to improve VFNet’s prediction capabilities for under-represented classes through the mining of data in DeepLesion. Weakly Supervised Self Training. At this point, the VFNet model that was trained on limited DeepLesion data (30% annotated subset) was then used to iteratively mine new lesions in DeepLesion’s original training split. This split contained only the annotated bounding boxes of a lesion in a CT slice without the lesion tags (body part labels), and only clinically significant lesions were measured leaving many others unannotated. These clinically meaningful bounding boxes served as weak supervision for our model. After the VFNet model was
Universal Lesion Detection and Classification Through Self-training
59
trained on the limited DeepLesion data, it generated predictions including the bounding boxes, class labels, and tag confidence scores for each lesion. Mined lesions were filtered, first by only considering mined lesions that had a 30% overlap with the originally annotated bounding boxes in the ground truth, and second by only using mined lesions that surpassed a tag confidence threshold (see Sect. 3). Once the mining procedure was complete, the lesions that met these two criteria were added back into the training data, effectively allowing the model to train on some of its own predictions. After lesion mining, the model was trained from scratch, and this procedure was repeated for four mining rounds. We also experimented with training VFNet from a previous mining round’s checkpoint weights, but we noticed that the results were worse than training from scratch.
Fig. 2. (a) VFNet model in the self-training pipeline takes CT slices annotated with GT as inputs and predicts lesion bounding boxes (Bn ), classes (Cn ), and confidences at a mining round n. Lesions are filtered by their confidences and IoU overlap with GT, and then fed back to the model for re-training. (b)-(c) Comparison of a static threshold (TS ) vs variable threshold (TV ) used in self-training. Green boxes: GT, yellow: TP, and red: FP. The first row shows a “soft tissue” lesion detection with TV showing fewer FP. The second row shows a “Bone” lesion that was missed by TS , but identified with TV . The third row reveals an incorrect “Abdomen” prediction with TS that was subsequently corrected to “Kidney” by TV . (d) Comparison of the mean recall at precisions [85,80,75,70]% for the experiments with static ES and variable EV thresholds respectively. (Color figure online)
60
3
V. Naga et al.
Experiments and Results
Experimental Design. In the weakly-supervised self-training setting, we designed two experiments to detect and classify lesions with our limited dataset. In the first experiment ES , we set a static lesion tagging confidence threshold (TS ) of 80% and a 30% box IoU overlap. Lesions that had predicted class confidences ≥ TS and overlaps ≥ 30% were incorporated into later mining rounds. Through this experiment, we hypothesized that only high quality mined lesions would be collected across the rounds that would directly lead to an efficient detector. In our second experiment EV , a variable lesion tagging confidence threshold TV was set along with a 30% box IoU overlap. A higher confidence threshold of 80% was set for the first mining round, and it was progressively lowered by 10% over the remaining rounds. The belief was that although good quality mined lesions would be found in the first round, larger lesion quantities would also be collected across the rounds with a reduced threshold. The results of these experiments were compared against an experiment EN where the model underwent no self-training. Consistent with prior work [6,7,20], our results at 4 FP/image and 30% IoU overlap on the 15% test split are presented in Figs. 1 and 2 and Table 1. Implementation details for the model are in the supplementary material. Results - No Self-training. The model in our no self-training experiment EN achieved a mean sensitivity of 72.5% at 4FP with the lowest sensitivities for underrepresented classes, such as “kidney” (∼54%) and “Bone” (∼56%) respectively, and the highest recalls for over-represented classes, such as “Lung” (∼87%) and “Liver” (∼83%). Generally, classes with more data (see Fig. 1) seemed to perform better with the exception of “Abdomen” class. We believe the “Abdomen” class performed relatively poorly as it was a “catch-all” term for all abdominal lesions that were not “Kidney” or “Liver” masses [8]. Anatomically however, the two Table 1. VFNet sensitivities on the task of lesion detection and tagging. The recalls were calculated at 4FP and at 30% IoU overlap. Mining round
Bone Kidney Soft tissue Pelvis Liver Mediastinum Abdomen Lung Mean (95% CI)
No self training No self-training # Lesions used
55.9 179
54.8 353
73.3 476
75.4 612
82.9 912
79.6 1193
69.7 1506
87.5 72.5 (71.8%-73.0%) 1640 -
Static threshold of 80% across 4 rounds of self-training Round Round Round Round
1 2 3 4
(80%) (80%) (80%) (80%)
58.8 52.9 47.1 58.8
# Lesions mined 26
58.1 61.3 61.3 59.7
75.2 75.2 71.4 73.3
79.0 75.4 78.3 72.5
85.0 79.3 85.0 83.9
85.5 83.4 80.0 83.4
69.7 73.0 69.7 71.2
88.1 88.9 87.3 87.5
74.9 73.7 72.5 73.8 (73.3%–74.3%)
129
541
517
929
1049
708
2515 -
Variable threshold [80, 70, 60, 50%] across 4 rounds of self-training Round Round Round Round
1 2 3 4
(80%) (70%) (60%) (50%)
58.8 61.8 44.1 61.8
# Lesions mined 115
58.1 66.1 56.5 64.5
75.2 69.5 75.2 73.8
79.0 76.8 78.3 75.4
85.0 85.5 87.1 89.6
85.5 82.1 81.3 83.0
69.7 70.6 73.9 72.7
88.1 89.4 89.1 90.5
74.9 75.2 73.2 76.4 (75.9%–76.9%)
473
1187
1438
2089 2936
3003
4249 -
Universal Lesion Detection and Classification Through Self-training
61
organs are in close proximity and axial slices often show cross-sections of both the kidney and liver within the same slice (c.f. Fig. 1(b) second row). A confusion matrix provided in the supplementary material confirmed our belief as it showed that the “Abdomen”, “Liver” and “Kidney” lesions were most often confused with each other. Results - Weakly Supervised Self-training. First, results from our static threshold experiment ES are discussed. In contrast to EN , sensitivities at round 4 either improved or were maintained for 7/8 classes except for the “Pelvis” class. The average sensitivity improved by 1.3% compared against EN , and those of the individual classes improved by 1.4% on average. In rounds 2 and 3 of self-training, a drop in mean sensitivity was observed, but the performance recovered by round 4 to within 1.1% of the mean sensitivity from round 1. We also see a greater number of “Lung”, “Mediastinum” and “Liver” lesions mined (¿900) in contrast to the “Bone” (26) and “Kidney” (129) lesions. Despite additional lesions being mined, we saw that only 3/8 classes (“Bone”, “Kidney”, “Abdomen”) improved when the recalls at round 4 were compared against round 1. From our variable threshold experiment EV (commencing at 80% confidence), average recalls improved by 3.9% in contrast to EN , and all 8/8 classes either improved or maintained their performance. As the “Kidney” class performed the worst in EN , it saw the biggest increase of 9.7% in EV . On average, the individual class sensitivities improved by ∼4%, which was bigger than that seen with ES . An assessment over 4 rounds revealed a general trend of sensitivity improvement. While the recall dipped moving from round 2 to round 3 for certain classes, it recovered by round 4 and surpassed round 1 by 1.5%. Table 1 also shows that the number of mined lesions for “Bone” and “Kidney” classes are significantly lower in the dataset in contrast to the other classes, such as “Lung” and “Liver”. The number of lesions mined after 4 rounds of self training is shown in Table 1. The supplementary material shows the number of lesions mined at each round. For quantitative comparison of ES and EV results across rounds, we plotted the precision-recall curves for each mining round in Fig. 2(d). The model’s mean recall at [85,80,75,70]% precision was evaluated to gauge the performance at higher True Positive (TP) rates, and we saw EV outperforming ES . By round 4, EV had outperformed ES by ∼3%. Additionally, the recall increased between mining rounds except for round 4, which saw a slight decrease by 1%. However, the overall performance had improved from round 1 by 4.2%, which suggested that additional rounds of self-training with a variable threshold TV helped improve recall. Qualitative analysis in Figs. 1 and 2 showed that EV improved recall by finding missed lesions, along with correct classification of lesion tags and a reduction in FP. We also saw improved performance on under-represented classes, such as “Kidney” and “Bone”, as seen in Figs. 2(b) and 2(c).
4
Discussion and Conclusion
Discussion. As shown in Figs. 1 and 2, self-training improved model performance in lesion detection and tagging. We saw an improvement in sensitivities
62
V. Naga et al.
across classes for the variable threshold experiment EV in comparison to the static threshold experiment ES . We believe that EV found a balance between data quality and quantity, making it ideal for the self-training procedure as the lowering of thresholds across rounds in EV allowed the model to identify high-quality data initially while additional training data was progressively included over the remaining rounds to provide data variety. But as shown in Fig. 1(a) and Table 1 in the supplementary material, the model performance suffers from the under-representation of classes, such as “Bone” and “Kidney”, in the dataset. For these classes, sensitivities varied without consistent improvements (c.f. Table 1) due to greater lesion quantities being mined for over-represented classes as opposed to the underrepresented classes. Balancing the lesion quantities in these classes, which would drastically decrease the amount of training data, could shed some light on the performance of these low data quantity classes in self-training. Another solution involves a custom adjustment of loss weights for the under-represented classes in the VFNet loss function, which would penalize the model when it performs poorly on under-represented classes and mediate the class imbalance effect. Additionally, as evidenced by the confusion matrices in the supplementary material, it was evident that the model confused the “Abdomen”, “Liver” and “Kidney” classes often. We believe that the “Abdomen” and “Soft Tissue” labels are ambiguous and non-specific labels that broadly encompass multiple regions in the abdomen. As these labels were generated using a body part regressor, future work involves the creation of fine-grained labels, and examining the performance of “Abdomen”, “Kidney”, and “Pelvis” classes. Furthermore, the DeepLesion dataset contains both contrast and non-contrast enhanced CT volumes, but the exact phase information is unavailable in the dataset description. Comparison against prior work [9,10] was not possible as MULAN [10] is the only existing approach to jointly detect and tag lesions in a CT slice. However, it used a Mask-RCNN model that needed segmentation labels, which we did not create in this work. Furthermore, MULAN also provided detailed tags, which would require a sophisticated ontology derived from reports to map them to the body part tags used in this work. To mitigate this issue, we tested our approach against Faster RCNN, but our VFNet model fared better at lesion detection (c.f. supplementary material). Prior to this study, limited research discussed the presentation of results from detection models in a clinical workflow. In Fig. 1(d), we present a structured reporting guideline with the creation of a “Lesions” sub-section for entry into the ’Findings’ section of a radiology report. This sub-section contains a structured list of detected lesions along with their body part tags, confidences, and series and slice numbers. Conclusion. In this work, we used a limited DeepLesion data subset (30% annotated data) containing lesion bounding boxes and body part labels to train a VFNet model for lesion detection and tagging. Subsequently, the model predicted lesion locations and tags on a larger data subset (boxes and no tags) through a weakly-supervised self-training process. The self-training process was done across multiple rounds, and two experiments were conducted that showed that sensitivity improved from 72.5% (no self-training) to 74.9% (static threshold) and 76.4% (variable threshold) respectively. In every round, new data points (boxes +
Universal Lesion Detection and Classification Through Self-training
63
tags) predicted with high confidence were added to the original annotated (boxes + tags) training data, and the model was trained from scratch. We also provide a structured reporting guideline for the clinical workflow. A “Lesions” sub-section for entry into the “Findings” section of a radiology report was created, and it contained a structured list of detected lesions, body part tags, confidences, and series and slice numbers. Acknowledgements. This work was supported by the Intramural Research Program of the National Institutes of Health (NIH) Clinical Center.
References 1. Eisenhauer, E., et al.: New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1). Eur. J. Cancer 45(2), 228–247 (2009) 2. van Persijn van Meerten, E.L., et al.: RECIST revised: implications for the radiologist. A review article on the modified RECIST guideline. Eur. Radiol. 20, 1456– 1467 (2010) 3. Yang, J., et al.: AlignShift: bridging the gap of imaging thickness in 3D anisotropic volumes. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12264, pp. 562– 572. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59719-1 55 4. Yang, J., He, Y., Kuang, K., Lin, Z., Pfister, H., Ni, B.: Asymmetric 3D context fusion for universal lesion detection. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12905, pp. 571–580. Springer, Cham (2021). https://doi.org/10.1007/ 978-3-030-87240-3 55 5. Han, L., et al.: SATr: Slice Attention with Transformer for Universal Lesion Detection. arXiv (2022) 6. Yan, K., et al.: Learning from multiple datasets with heterogeneous and partial labels for universal lesion detection in CT. IEEE TMI 40(10), 2759–2770 (2021) 7. Cai, J., et al.: Lesion harvester: iteratively mining unlabeled lesions and hardnegative examples at scale. IEEE TMI 40(1), 59–70 (2021) 8. Yan, K., et al.: DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning. J. Med. Imaging 5(3), 036501 (2018) 9. Yan, K., et al.: Holistic and comprehensive annotation of clinically significant findings on diverse CT images: learning from radiology reports and label ontology. In: IEEE CVPR (2019) 10. Yan, K., et al.: MULAN: multitask universal lesion analysis network for joint lesion detection, tagging, and segmentation. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11769, pp. 194–202. Springer, Cham (2019). https://doi.org/10.1007/ 978-3-030-32226-7 22 11. Hering, A., et al.: Whole-body soft-tissue lesion tracking and segmentation in longitudinal CT imaging studies. In: PMLR, pp. 312–326 (2021) 12. Cai, J., et al.: Deep lesion tracker: monitoring lesions in 4D longitudinal imaging studies. In: IEEE CVPR, pp. 15159–15169 (2021) 13. Tang, W., et al.: Transformer Lesion Tracker. arXiv (2022) 14. Zhang, H., et al.: VarifocalNet: an IoU-aware dense object detector. In: IEEE CVPR, pp. 8514–8523 (2021) 15. Yan, K., et al.: Unsupervised body part regression via spatially self-ordering convolutional neural networks. In: IEEE ISBI, pp. 1022–1025 (2018)
64
V. Naga et al.
16. Tian, Z., et al.: FCOS: fully convolutional one-stage object detection. In: IEEE ICCV, pp. 9627–9636 (2019) 17. Zhang, S, et al.: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: IEEE CVPR (2020) 18. Ren, S., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE PAMI 39(6), 1137–1149 (2017) 19. Solovyev, R., et al.: Weighted boxes fusion: ensembling boxes from different object detection models. Image Vis. Comput. 107, 104117 (2021) 20. Mattikalli, T., et al.: Universal lesion detection in CT scans using neural network ensembles. In: SPIE Medical Imaging: Computer-Aided Diagnosis, vol. 12033 (2022)
BoxShrink: From Bounding Boxes to Segmentation Masks Michael Gr¨ oger(B) , Vadim Borisov , and Gjergji Kasneci University of T¨ ubingen, T¨ ubingen, Germany [email protected] Abstract. One of the core challenges facing the medical image computing community is fast and efficient data sample labeling. Obtaining fine-grained labels for segmentation is particularly demanding since it is expensive, time-consuming, and requires sophisticated tools. On the contrary, applying bounding boxes is fast and takes significantly less time than fine-grained labeling, but does not produce detailed results. In response, we propose a novel framework for weakly-supervised tasks with the rapid and robust transformation of bounding boxes into segmentation masks without training any machine learning model, coined BoxShrink. The proposed framework comes in two variants – rapid -BoxShrink for fast label transformations, and robust-BoxShrink for more precise label transformations. An average of four percent improvement in IoU is found across several models when being trained using BoxShrink in a weaklysupervised setting, compared to using only bounding box annotations as inputs on a colonoscopy image data set. We open-sourced the code for the proposed framework and published it online. Keywords: Weakly-supervised learning Colonoscopy · Deep neural networks
1
· Segmentation ·
Introduction
Convolutional neural networks (CNNs) have achieved remarkable results across image classification tasks of increasing complexity, from pure image classification to full panoptic segmentation, and have become, as a consequence, the standard method for these tasks in computer vision [19]. However, there are also certain drawbacks associated with these methods. One of them is that in order to achieve satisfactory results, a data set of an appropriate size and high-quality labels are needed [21]. The costs and time associated with labeling increase with the complexity of the task, with image classification being the cheapest and image segmentation being the most expensive one. All of these challenges especially M. Gr¨ oger and V. Borisov—equal contribution.
Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-16760-7 7. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Zamzmi et al. (Eds.): MILLanD 2022, LNCS 13559, pp. 65–75, 2022. https://doi.org/10.1007/978-3-031-16760-7_7
66
M. Gr¨ oger et al.
apply to medical artificial intelligence (MAI) applications since they depend on the input and feedback by expensive domain experts [22]. In this work, we present a novel approach for fast segmentation label prepossessing, which is decoupled from any particular artificial neural network architecture. The proposed algorithmic framework can serve as a first approach for practitioners to transform a data set with only bounding box annotations into a prelabeled (i.e., semantically segmented) version of the data set. Our framework consists of independent components such as superpixels [23], fully-connected conditional random fields [14] and embeddings. This makes it easy to add our framework to an existing machine learning pipeline. To evaluate the proposed framework, we select an endoscopic colonoscopy data set [4]. Multiple experiments show that our framework helps to considerably reduce the gap between the segmentation performance and efficiency of a neural network that is trained only on bounding boxes and one trained on fully segmented segmentation masks. The main contributions of this work are: – We propose the BoxShrink framework consisting of two methods. One for a time-efficient and one for a more robust transformation of bounding-boxes into segmentation masks. In both methods there is no need to train a model. – We publish our bounding-box labels for the CVC-Clinic data set for future research in the area of weakly-supervised learning. – We open-source our code and publish it online.1
2
Related Work
In this Section, we further define weakly-supervised learning and separate it from other approaches such as semi-supervised learning. Also, we localize our work among those which use similar components. To reduce the need for resources such as time and money, various learning methodologies were introduced such as semi-supervised and weakly-supervised learning [30]. Semi-supervised learning leverages labeled data, e.g. for segmentation tasks correctly and fully segmented images and the availability of a larger amount of unlabeled data [16]. Weakly-supervised learning on the other hand, exploits noisy labels as a weak supervisory signal to generate segmentation masks. These labels can be provided in different forms such as points [3], or image-level labels [27], being the more simpler ones, or more complex ones such as scribbles [15,24], or bounding boxes [6,11]. A similar work [29] to ours also utilizes superpixel embeddings and CRFs, but their method requires an additional construction of a graph of superpixels and a custom deep neural network architecture. Our method, on the other hand, is easier to integrate into existing pipelines. Also, in contrast to many other weakly-supervised approaches [10,28], we do not apply CRFs as a postprocessing step on the output of the model but as a preprocessing step on the input, hence, we leave the downstream model untouched. Furthermore, the proposed framework does not require special hardware such as GPU or TPU for the label preprocessing step. 1
https://github.com/michaelgroeger/boxshrink.
BoxShrink: From Bounding Boxes to Segmentation Masks
67
Fig. 1. The impact of varying the threshold ts , i.e., a hyperparameter of the BoxShrink framework for tuning the final segmentation quality, where (a) shows two data samples from the data set after the superpixel assignment step (Sect. 3.2), and (b) demonstrates pseudo-masks after the FCRF postprocessing. As seen from this experiment, having a higher threshold might generate better masks but increases the risk of losing correct foreground pixels.
3
Boxshrink Framework
This section presents our proposed BoxShrink framework. First, we define its main components: superpixel segmentation, fully-connected conditional random fields, and the embedding step. We then explain two different settings of the framework, both having the same goal: to reduce the number of background pixels labeled as foreground contained in the bounding box mask. 3.1
Main Components
Superpixels aim to group pixels into bigger patches based on their color similarity or other characteristics [23]. In our implementation, we utilize the SLIC algorithm proposed by [1] which is a k-means-based algorithm grouping pixels based on their proximity in a 5D space. A crucial hyperparameter of SLIC is the number of segments to be generated which is a upper bound for the algorithm on how many superpixels should be returned for the given image. The relationship between the output of SLIC and the maximum number of segments can be seen in the supplementary material. Fully-connected-CRFs are an advanced version of conditional random fields (CRFs) which represent pixels as a graph structure. CRFs take into account
68
M. Gr¨ oger et al.
a unary potential of each pixel and the dependency structure between that pixel and its neighboring ones using pairwise potentials [25]. Fully-connected-CRFs (FCRFs) address some of the limitations of classic CRFs, such as the lack of capturing long-range dependencies by connecting all pixel pairs. Equation 1 shows the main building block of FCRFs which is the Gibbs-Energy function [13]. E(x) =
N i=1
ψu (xi ) +
N
ψp (xi , xj ),
(1)
i