Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 978-3-319-46723-8, 3319467239, 978-3-319-46722-1

The three-volume set LNCS 9900, 9901, and 9902 constitutes the refereed proceedings of the 19th International Conference

458 71 23MB

English Pages 703 [728] Year 2016

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Content: Machine learning and feature selection --
Deep learning in medical imaging --
Applications of machine learning --
Segmentation --
Cell image analysis.
Recommend Papers

Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II
 978-3-319-46723-8, 3319467239, 978-3-319-46722-1

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

LNCS 9901

Sebastien Ourselin · Leo Joskowicz Mert R. Sabuncu · Gozde Unal William Wells (Eds.)

Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016 19th International Conference Athens, Greece, October 17–21, 2016 Proceedings, Part II

123

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, Lancaster, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Zurich, Switzerland John C. Mitchell Stanford University, Stanford, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Dortmund, Germany Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbrücken, Germany

9901

More information about this series at http://www.springer.com/series/7412

Sebastien Ourselin Leo Joskowicz Mert R. Sabuncu Gozde Unal William Wells (Eds.) •



Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016 19th International Conference Athens, Greece, October 17–21, 2016 Proceedings, Part II

123

Editors Sebastien Ourselin University College London London UK

Gozde Unal Istanbul Technical University Istanbul Turkey

Leo Joskowicz The Hebrew University of Jerusalem Jerusalem Israel

William Wells Harvard Medical School Boston, MA USA

Mert R. Sabuncu Harvard Medical School Boston, MA USA

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-319-46722-1 ISBN 978-3-319-46723-8 (eBook) DOI 10.1007/978-3-319-46723-8 Library of Congress Control Number: 2016952513 LNCS Sublibrary: SL6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics © Springer International Publishing AG 2016 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

In 2016, the 19th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2016) was held in Athens, Greece. It was organized by Harvard Medical School, The Hebrew University of Jerusalem, University College London, Sabancı University, Bogazici University, and Istanbul Technical University. The meeting took place at the Intercontinental Athenaeum Hotel in Athens, Greece, during October 18–20. Satellite events associated with MICCAI 2016 were held on October 19 and October 21. MICCAI 2016 and its satellite events attracted word-leading scientists, engineers, and clinicians, who presented high-standard papers, aiming at uniting the fields of medical image processing, medical image formation, and medical robotics. This year the triple anonymous review process was organized in several phases. In total, 756 submissions were received. The review process was handled by one primary and two secondary Program Committee members for each paper. It was initiated by the primary Program Committee member, who assigned exactly three expert reviewers, who were blinded to the authors of the paper. Based on these initial anonymous reviews, 82 papers were directly accepted and 189 papers were rejected. Next, the remaining papers went to the rebuttal phase, in which the authors had the chance to respond to the concerns raised by reviewers. The reviewers were then given a chance to revise their reviews based on the rebuttals. After this stage, 51 papers were accepted and 147 papers were rejected based on a consensus reached among reviewers. Finally, the reviews and associated rebuttals were subsequently discussed in person among the Program Committee members during the MICCAI 2016 Program Committee meeting that took place in London, UK, during May 28–29, 2016, with 28 Program Committee members out of 55, the four Program Chairs, and the General Chair. The process led to the acceptance of another 95 papers and the rejection of 192 papers. In total, 228 papers of the 756 submitted papers were accepted, which corresponds to an acceptance rate of 30.1%. For these proceedings, the 228 papers are organized in 18 groups as follows. The first volume includes Brain Analysis (12), Brain Analysis: Connectivity (12), Brain Analysis: Cortical Morphology (6), Alzheimer Disease (10), Surgical Guidance and Tracking (15), Computer Aided Interventions (10), Ultrasound Image Analysis (5), and Cancer Image Analysis (7). The second volume includes Machine Learning and Feature Selection (12), Deep Learning in Medical Imaging (13), Applications of Machine Learning (14), Segmentation (33), and Cell Image Analysis (7). The third volume includes Registration and Deformation Estimation (16), Shape Modeling (11), Cardiac and Vascular Image Analysis (19), Image Reconstruction (10), and MRI Image Analysis (16). We thank Dekon, who did an excellent job in the organization of the conference. We thank the MICCAI society for provision of support and insightful comments, the Program Committee for their diligent work in helping to prepare the technical program,

VI

Preface

as well as the reviewers for their support during the review process. We also thank Andreas Maier for his support in editorial tasks. Last but not least, we thank our sponsors for the financial support that made the conference possible. We look forward to seeing you in Quebec City, Canada, in 2017! August 2016

Sebastien Ourselin William Wells Leo Joskowicz Mert Sabuncu Gozde Unal

Organization

General Chair Sebastien Ourselin

University College London, London, UK

General Co-chair Aytül Erçil

Sabanci University, Istanbul, Turkey

Program Chair William Wells

Harvard Medical School, Boston, MA, USA

Program Co-chairs Mert R. Sabuncu Leo Joskowicz Gozde Unal

A.A. Martinos Center for Biomedical Imaging, Charlestown, MA, USA The Hebrew University of Jerusalem, Israel Istanbul Technical University, Istanbul, Turkey

Local Organization Chair Bülent Sankur

Bogazici University, Istanbul, Turkey

Satellite Events Chair Burak Acar

Bogazici University, Istanbul, Turkey

Satellite Events Co-chairs Evren Özarslan Devrim Ünay Tom Vercauteren

Harvard Medical School, Boston, MA, USA Izmir University of Economics, Izmir, Turkey University College London, UK

Industrial Liaison Tanveer Syeda-Mahmood

IBM Almaden Research Center, San Jose, CA, USA

VIII

Organization

Publication Chair Andreas Maier

Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

MICCAI Society Board of Directors Stephen Aylward (Treasurer) Hervé Delinguette Simon Duchesne Gabor Fichtinger (Secretary) Alejandro Frangi Pierre Jannin Leo Joskowicz Shuo Li Wiro Niessen (President and Board Chair) Nassir Navab Alison Noble (Past President - Non Voting) Sebastien Ourselin Josien Pluim Li Shen (Executive Director)

Kitware, Inc., NY, USA Inria, Sophia Antipolis, France Université Laval, Quebéc, QC, Canada Queen’s University, Kingston, ON, Canada University of Sheffield, UK INSERM/Inria, Rennes, France The Hebrew University of Jerusalem, Israel Digital Imaging Group, Western University, London, ON, Canada Erasmus MC - University Medical Centre, Rotterdam, The Netherlands Technical University of Munich, Germany University of Oxford, UK University College London, UK Eindhoven University of Technology, The Netherlands Indiana University, IN, USA

MICCAI Society Consultants to the Board Alan Colchester Terry Peters Richard Robb

University of Kent, Canterbury, UK University of Western Ontario, London, ON, Canada Mayo Clinic College of Medicine, MN, USA

Executive Officers President and Board Chair Executive Director (Managing Educational Affairs) Secretary (Coordinating MICCAI Awards) Treasurer Elections Officer

Wiro Niessen Li Shen

Gabor Fichtinger Stephen Aylward Rich Robb

Organization

IX

Non-Executive Officers Society Secretariat Recording Secretary and Web Maintenance Fellows Nomination Coordinator

Janette Wallace, Canada Jackie Williams, Canada Terry Peters, Canada

Student Board Members President Professional Student Events officer Public Relations Officer Social Events Officer

Lena Filatova Danielle Pace Duygu Sarikaya Mathias Unberath

Program Committee Arbel, Tal Cardoso, Manuel Jorge Castellani, Umberto Cattin, Philippe C. Chung, Albert C.S. Cukur, Tolga Delingette, Herve Feragen, Aasa Freiman, Moti Glocker, Ben Goksel, Orcun Gonzalez Ballester, Miguel Angel Grady, Leo Greenspan, Hayit Howe, Robert Isgum, Ivana Jain, Ameet Jannin, Pierre Joshi, Sarang Kalpathy-Cramer, Jayashree Kamen, Ali Knutsson, Hans Konukoglu, Ender Landman, Bennett Langs, Georg

McGill University, Canada University College London, UK University of Verona, Italy University of Basel, Switzerland Hong Kong University of Science and Technology, Hong Kong Bilkent University, Turkey Inria, France University of Copenhagen, Denmark Philips Healthcare, Israel Imperial College London, UK ETH Zurich, Switzerland Universitat Pompeu Fabra, Spain HeartFlow, USA Tel Aviv University, Israel Harvard University, USA University Medical Center Utrecht, The Netherlands Philips Research North America, USA University of Rennes, France University of Utah, USA Harvard Medical School, USA Siemens Corporate Technology, USA Linkoping University, Sweden Harvard Medical School, USA Vanderbilt University, USA University of Vienna, Austria

X

Organization

Lee, Su-Lin Liao, Hongen Linguraru, Marius George Liu, Huafeng Lu, Le Maier-Hein, Lena Martel, Anne Masamune, Ken Menze, Bjoern Modat, Marc Moradi, Mehdi Nielsen, Poul Niethammer, Marc O’Donnell, Lauren Padoy, Nicolas Pohl, Kilian Prince, Jerry Reyes, Mauricio Sakuma, Ichiro Sato, Yoshinobu Shen, Li Stoyanov, Danail Van Leemput, Koen Vrtovec, Tomaz Wassermann, Demian Wein, Wolfgang Yang, Guang-Zhong Young, Alistair Zheng, Guoyan

Imperial College London, UK Tsinghua University, China Children’s National Health System, USA Zhejiang University, China National Institutes of Health, USA German Cancer Research Center, Germany University of Toronto, Canada The University of Tokyo, Japan Technische Universitat München, Germany Imperial College, London, UK IBM Almaden Research Center, USA The University of Auckland, New Zealand UNC Chapel Hill, USA Harvard Medical School, USA University of Strasbourg, France SRI International, USA Johns Hopkins University, USA University of Bern, Bern, Switzerland The University of Tokyo, Japan Nara Institute of Science and Technology, Japan Indiana University School of Medicine, USA University College London, UK Technical University of Denmark, Denmark University of Ljubljana, Slovenia Inria, France ImFusion GmbH, Germany Imperial College London, UK The University of Auckland, New Zealand University of Bern, Switzerland

Reviewers Abbott, Jake Abolmaesumi, Purang Acosta-Tamayo, Oscar Adeli, Ehsan Afacan, Onur Aganj, Iman Ahmadi, Seyed-Ahmad Aichert, Andre Akhondi-Asl, Alireza Albarqouni, Shadi Alberola-López, Carlos Alberts, Esther

Alexander, Daniel Aljabar, Paul Allan, Maximilian Altmann, Andre Andras, Jakab Angelini, Elsa Antony, Bhavna Ashburner, John Auvray, Vincent Awate, Suyash P. Bagci, Ulas Bai, Wenjia

Bai, Ying Bao, Siqi Barbu, Adrian Batmanghelich, Kayhan Bauer, Stefan Bazin, Pierre-Louis Beier, Susann Bello, Fernando Ben Ayed, Ismail Bergeles, Christos Berger, Marie-Odile Bhalerao, Abhir

Organization

Bhatia, Kanwal Bieth, Marie Bilgic, Berkin Birkfellner, Wolfgang Bloch, Isabelle Bogunovic, Hrvoje Bouget, David Bouix, Sylvain Brady, Michael Bron, Esther Brost, Alexander Buerger, Christian Burgos, Ninon Cahill, Nathan Cai, Weidong Cao, Yu Carass, Aaron Cardoso, Manuel Jorge Carmichael, Owen Carneiro, Gustavo Caruyer, Emmanuel Cash, David Cerrolaza, Juan Cetin, Suheyla Cetingul, Hasan Ertan Chakravarty, M. Mallar Chatelain, Pierre Chen, Elvis C.S. Chen, Hanbo Chen, Hao Chen, Ting Cheng, Jian Cheng, Jun Cheplygina, Veronika Chowdhury, Ananda Christensen, Gary Chui, Chee Kong Côté, Marc-Alexandre Ciompi, Francesco Clancy, Neil T. Claridge, Ela Clarysse, Patrick Cobzas, Dana Comaniciu, Dorin Commowick, Olivier Compas, Colin

Conjeti, Sailesh Cootes, Tim Coupe, Pierrick Crum, William Dalca, Adrian Darkner, Sune Das Gupta, Mithun Dawant, Benoit de Bruijne, Marleen De Craene, Mathieu Degirmenci, Alperen Dehghan, Ehsan Demirci, Stefanie Depeursinge, Adrien Descoteaux, Maxime Despinoy, Fabien Dijkstra, Jouke Ding, Xiaowei Dojat, Michel Dong, Xiao Dorfer, Matthias Du, Xiaofei Duchateau, Nicolas Duchesne, Simon Duncan, James S. Ebrahimi, Mehran Ehrhardt, Jan Eklund, Anders El-Baz, Ayman Elliott, Colm Ellis, Randy Elson, Daniel El-Zehiry, Noha Erdt, Marius Essert, Caroline Fallavollita, Pascal Fang, Ruogu Fenster, Aaron Ferrante, Enzo Fick, Rutger Figl, Michael Fischer, Peter Fishbaugh, James Fletcher, P. Thomas Forestier, Germain Foroughi, Pezhman

Foroughi, Pezhman Forsberg, Daniel Franz, Alfred Freysinger, Wolfgang Fripp, Jurgen Frisch, Benjamin Fritscher, Karl Funka-Lea, Gareth Gabrani, Maria Gallardo Diez, Guillermo Alejandro Gangeh, Mehrdad Ganz, Melanie Gao, Fei Gao, Mingchen Gao, Yaozong Gao, Yue Garvin, Mona Gaser, Christian Gass, Tobias Georgescu, Bogdan Gerig, Guido Ghesu, Florin-Cristian Gholipour, Ali Ghosh, Aurobrata Giachetti, Andrea Giannarou, Stamatia Gibaud, Bernard Ginsburg, Shoshana Girard, Gabriel Giusti, Alessandro Golemati, Spyretta Golland, Polina Gong, Yuanhao Good, Benjamin Gooya, Ali Grisan, Enrico Gu, Xianfeng Gu, Xuan Gubern-Mérida, Albert Guetter, Christoph Guo, Peifang B. Guo, Yanrong Gur, Yaniv Gutman, Boris Hacihaliloglu, Ilker

XI

XII

Organization

Haidegger, Tamas Hamarneh, Ghassan Hammer, Peter Harada, Kanako Harrison, Adam Hata, Nobuhiko Hatt, Chuck Hawkes, David Haynor, David He, Huiguang He, Tiancheng Heckemann, Rolf Hefny, Mohamed Heinrich, Mattias Paul Heng, Pheng Ann Hennersperger, Christoph Herbertsson, Magnus Hütel, Michael Holden, Matthew Hong, Jaesung Hong, Yi Hontani, Hidekata Horise, Yuki Horiuchi, Tetsuya Hu, Yipeng Huang, Heng Huang, Junzhou Huang, Xiaolei Hughes, Michael Hutter, Jana Iakovidis, Dimitris Ibragimov, Bulat Iglesias, Juan Eugenio Iordachita, Iulian Irving, Benjamin Jafari-Khouzani, Kourosh Jain, Saurabh Janoos, Firdaus Jedynak, Bruno Jiang, Tianzi Jiang, Xi Jin, Yan Jog, Amod Jolly, Marie-Pierre Joshi, Anand Joshi, Shantanu

Kadkhodamohammadi, Abdolrahim Kadoury, Samuel Kainz, Bernhard Kakadiaris, Ioannis Kamnitsas, Konstantinos Kandemir, Melih Kapoor, Ankur Karahanoglu, F. Isik Karargyris, Alexandros Kasenburg, Niklas Katouzian, Amin Kelm, Michael Kerrien, Erwan Khallaghi, Siavash Khalvati, Farzad Köhler, Thomas Kikinis, Ron Kim, Boklye Kim, Hosung Kim, Minjeong Kim, Sungeun Kim, Sungmin King, Andrew Kisilev, Pavel Klein, Stefan Klinder, Tobias Kluckner, Stefan Konofagou, Elisa Kunz, Manuela Kurugol, Sila Kuwana, Kenta Kwon, Dongjin Ladikos, Alexander Lamecker, Hans Lang, Andrew Lapeer, Rudy Larrabide, Ignacio Larsen, Anders Boesen Lindbo Lauze, Francois Lea, Colin Lefèvre, Julien Lekadir, Karim Lelieveldt, Boudewijn Lemaitre, Guillaume

Lepore, Natasha Lesage, David Li, Gang Li, Jiang Li, Xiang Liang, Liang Lindner, Claudia Lioma, Christina Liu, Jiamin Liu, Mingxia Liu, Sidong Liu, Tianming Liu, Ting Lo, Benny Lombaert, Herve Lorenzi, Marco Loschak, Paul Loy Rodas, Nicolas Luo, Xiongbiao Lv, Jinglei Maddah, Mahnaz Mahapatra, Dwarikanath Maier, Andreas Maier, Oskar Maier-Hein (né Fritzsche), Klaus Hermann Mailhe, Boris Malandain, Gregoire Mansoor, Awais Marchesseau, Stephanie Marsland, Stephen Martí, Robert Martin-Fernandez, Marcos Masuda, Kohji Masutani, Yoshitaka Mateus, Diana Matsumiya, Kiyoshi Mazomenos, Evangelos McClelland, Jamie Mehrabian, Hatef Meier, Raphael Melano, Tim Melbourne, Andrew Mendelson, Alex F. Menegaz, Gloria Metaxas, Dimitris

Organization

Mewes, Philip Meyer, Chuck Miller, Karol Misra, Sarthak Misra, Vinith MÌürup, Morten Moeskops, Pim Moghari, Mehdi Mohamed, Ashraf Mohareri, Omid Moore, John Moreno, Rodrigo Mori, Kensaku Mountney, Peter Mukhopadhyay, Anirban Müller, Henning Nakamura, Ryoichi Nambu, Kyojiro Nasiriavanaki, Mohammadreza Negahdar, Mohammadreza Nenning, Karl-Heinz Neumann, Dominik Neumuth, Thomas Ng, Bernard Ni, Dong Näppi, Janne Niazi, Muhammad Khalid Khan Ning, Lipeng Noble, Alison Noble, Jack Noblet, Vincent Nouranian, Saman Oda, Masahiro O’Donnell, Thomas Okada, Toshiyuki Oktay, Ozan Oliver, Arnau Onofrey, John Onogi, Shinya Orihuela-Espina, Felipe Otake, Yoshito Ou, Yangming Özarslan, Evren

Pace, Danielle Panayiotou, Maria Panse, Ashish Papa, Joao Papademetris, Xenios Papadopoulo, Theo Papie, Bartâomiej W. Parisot, Sarah Park, Sang hyun Paulsen, Rasmus Peng, Tingying Pennec, Xavier Peressutti, Devis Pernus, Franjo Peruzzo, Denis Peter, Loic Peterlik, Igor Petersen, Jens Petersen, Kersten Petitjean, Caroline Pham, Dzung Pheiffer, Thomas Piechnik, Stefan Pitiot, Alain Pizzolato, Marco Plenge, Esben Pluim, Josien Polimeni, Jonathan R. Poline, Jean-Baptiste Pont-Tuset, Jordi Popovic, Aleksandra Porras, Antonio R. Prasad, Gautam Prastawa, Marcel Pratt, Philip Preim, Bernhard Preston, Joseph Prevost, Raphael Pszczolkowski, Stefan Qazi, Arish A. Qi, Xin Qian, Zhen Qiu, Wu Quellec, Gwenole Raj, Ashish Rajpoot, Nasir

Randles, Amanda Rathi, Yogesh Reinertsen, Ingerid Reiter, Austin Rekik, Islem Reuter, Martin Riklin Raviv, Tammy Risser, Laurent Rit, Simon Rivaz, Hassan Robinson, Emma Rohling, Robert Rohr, Karl Ronneberger, Olaf Roth, Holger Rottman, Caleb Rousseau, François Roy, Snehashis Rueckert, Daniel Rueda Olarte, Andrea Ruijters, Daniel Salcudean, Tim Salvado, Olivier Sanabria, Sergio Saritas, Emine Sarry, Laurent Scherrer, Benoit Schirmer, Markus D. Schnabel, Julia A. Schultz, Thomas Schumann, Christian Schumann, Steffen Schwartz, Ernst Sechopoulos, Ioannis Seeboeck, Philipp Seiler, Christof Seitel, Alexander sepasian, neda Sermesant, Maxime Sethuraman, Shriram Shahzad, Rahil Shamir, Reuben R. Shi, Kuangyu Shi, Wenzhe Shi, Yonggang Shin, Hoo-Chang

XIII

XIV

Organization

Siddiqi, Kaleem Silva, Carlos Alberto Simpson, Amber Singh, Vikas Sivaswamy, Jayanthi Sjölund, Jens Skalski, Andrzej Slabaugh, Greg Smeets, Dirk Sommer, Stefan Sona, Diego Song, Gang Song, Qi Song, Yang Sotiras, Aristeidis Speidel, Stefanie Špiclin, Žiga Sporring, Jon Staib, Lawrence Stamm, Aymeric Staring, Marius Stauder, Ralf Stewart, James Studholme, Colin Styles, Iain Styner, Martin Sudre, Carole H. Suinesiaputra, Avan Suk, Heung-Il Summers, Ronald Sun, Shanhui Sundar, Hari Sushkov, Mikhail Suzuki, Takashi Szczepankiewicz, Filip Sznitman, Raphael Taha, Abdel Aziz Tahmasebi, Amir Talbot, Hugues Tam, Roger Tamaki, Toru Tamura, Manabu Tanaka, Yoshihiro Tang, Hui Tang, Xiaoying Tanner, Christine

Tasdizen, Tolga Taylor, Russell Thirion, Bertrand Tie, Yanmei Tiwari, Pallavi Toews, Matthew Tokuda, Junichi Tong, Tong Tournier, J. Donald Toussaint, Nicolas Tsaftaris, Sotirios Tustison, Nicholas Twinanda, Andru Putra Twining, Carole Uhl, Andreas Ukwatta, Eranga Umadevi Venkataraju, Kannan Unay, Devrim Urschler, Martin Vaillant, Régis van Assen, Hans van Ginneken, Bram van Tulder, Gijs van Walsum, Theo Vandini, Alessandro Vasileios, Vavourakis Vegas-Sanchez-Ferrero, Gonzalo Vemuri, Anant Suraj Venkataraman, Archana Vercauteren, Tom Veta, Mtiko Vidal, Rene Villard, Pierre-Frederic Visentini-Scarzanella, Marco Viswanath, Satish Vitanovski, Dime Vogl, Wolf-Dieter von Berg, Jens Vrooman, Henri Wang, Defeng Wang, Hongzhi Wang, Junchen Wang, Li

Wang, Liansheng Wang, Linwei Wang, Qiu Wang, Song Wang, Yalin Warfield, Simon Weese, Jürgen Wegner, Ingmar Wei, Liu Wels, Michael Werner, Rene Westin, Carl-Fredrik Whitaker, Ross Wörz, Stefan Wiles, Andrew Wittek, Adam Wolf, Ivo Wolterink, Jelmer Maarten Wright, Graham Wu, Guorong Wu, Meng Wu, Xiaodong Xie, Saining Xie, Yuchen Xing, Fuyong Xu, Qiuping Xu, Yanwu Xu, Ziyue Yamashita, Hiromasa Yan, Jingwen Yan, Pingkun Yan, Zhennan Yang, Lin Yao, Jianhua Yap, Pew-Thian Yaqub, Mohammad Ye, Dong Hye Ye, Menglong Yin, Zhaozheng Yokota, Futoshi Zelmann, Rina Zeng, Wei Zhan, Yiqiang Zhang, Daoqiang Zhang, Fan

Organization

Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang,

Le Ling Miaomiao Pei Qing Tianhao Tuo

Zhang, Yong Zhen, Xiantong Zheng, Yefeng Zhijun, Zhang Zhou, Jinghao Zhou, Luping Zhou, S. Kevin

Zhu, Hongtu Zhu, Yuemin Zhuang, Xiahai Zollei, Lilla Zuluaga, Maria A.

XV

Contents – Part II

Machine Learning and Feature Selection Feature Selection Based on Iterative Canonical Correlation Analysis for Automatic Diagnosis of Parkinson’s Disease . . . . . . . . . . . . . . . . . . . . . Luyan Liu, Qian Wang, Ehsan Adeli, Lichi Zhang, Han Zhang, and Dinggang Shen Identifying Relationships in Functional and Structural Connectome Data Using a Hypergraph Learning Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . Brent C. Munsell, Guorong Wu, Yue Gao, Nicholas Desisto, and Martin Styner Ensemble Hierarchical High-Order Functional Connectivity Networks for MCI Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaobo Chen, Han Zhang, and Dinggang Shen Outcome Prediction for Patient with High-Grade Gliomas from Brain Functional and Structural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luyan Liu, Han Zhang, Islem Rekik, Xiaobo Chen, Qian Wang, and Dinggang Shen Mammographic Mass Segmentation with Online Learned Shape and Appearance Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Menglin Jiang, Shaoting Zhang, Yuanjie Zheng, and Dimitris N. Metaxas Differential Dementia Diagnosis on Incomplete Data with Latent Trees . . . . . Christian Ledig, Sebastian Kaltwang, Antti Tolonen, Juha Koikkalainen, Philip Scheltens, Frederik Barkhof, Hanneke Rhodius-Meester, Betty Tijms, Afina W. Lemstra, Wiesje van der Flier, Jyrki Lötjönen, and Daniel Rueckert Bridging Computational Features Toward Multiple Semantic Features with Multi-task Regression: A Study of CT Pulmonary Nodules . . . . . . . . . . Sihong Chen, Dong Ni, Jing Qin, Baiying Lei, Tianfu Wang, and Jie-Zhi Cheng Robust Cancer Treatment Outcome Prediction Dealing with Small-Sized and Imbalanced Data from FDG-PET Images . . . . . . . . . . . . . . . . . . . . . . . Chunfeng Lian, Su Ruan, Thierry Denœux, Hua Li, and Pierre Vera

1

9

18

26

35

44

53

61

XVIII

Contents – Part II

Structured Sparse Kernel Learning for Imaging Genetics Based Alzheimer’s Disease Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jailin Peng, Le An, Xiaofeng Zhu, Yan Jin, and Dinggang Shen

70

Semi-supervised Hierarchical Multimodal Feature and Sample Selection for Alzheimer’s Disease Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Le An, Ehsan Adeli, Mingxia Liu, Jun Zhang, and Dinggang Shen

79

Stability-Weighted Matrix Completion of Incomplete Multi-modal Data for Disease Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kim-Han Thung, Ehsan Adeli, Pew-Thian Yap, and Dinggang Shen

88

Employing Visual Analytics to Aid the Design of White Matter Hyperintensity Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Renata Georgia Raidou, Hugo J. Kuijf, Neda Sepasian, Nicola Pezzotti, Willem H. Bouvy, Marcel Breeuwer, and Anna Vilanova

97

Deep Learning in Medical Imaging The Automated Learning of Deep Features for Breast Mass Classification from Mammograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Neeraj Dhungel, Gustavo Carneiro, and Andrew P. Bradley Multimodal Deep Learning for Cervical Dysplasia Diagnosis . . . . . . . . . . . . Tao Xu, Han Zhang, Xiaolei Huang, Shaoting Zhang, and Dimitris N. Metaxas Learning from Experts: Developing Transferable Deep Features for Patient-Level Lung Cancer Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Shen, Mu Zhou, Feng Yang, Di Dong, Caiyun Yang, Yali Zang, and Jie Tian DeepVessel: Retinal Vessel Segmentation via Deep Learning and Conditional Random Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huazhu Fu, Yanwu Xu, Stephen Lin, Damon Wing Kee Wong, and Jiang Liu Deep Retinal Image Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Pablo Arbeláez, and Luc Van Gool 3D Deeply Supervised Network for Automatic Liver Segmentation from CT Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qi Dou, Hao Chen, Yueming Jin, Lequan Yu, Jing Qin, and Pheng-Ann Heng

106 115

124

132

140

149

Contents – Part II

Deep Neural Networks for Fast Segmentation of 3D Medical Images . . . . . . Karl Fritscher, Patrik Raudaschl, Paolo Zaffino, Maria Francesca Spadea, Gregory C. Sharp, and Rainer Schubert SpineNet: Automatically Pinpointing Classification Evidence in Spinal MRIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amir Jamaludin, Timor Kadir, and Andrew Zisserman A Deep Learning Approach for Semantic Segmentation in Histology Tissue Images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiazhuo Wang, John D. MacKenzie, Rageshree Ramachandran, and Danny Z. Chen

XIX

158

166

176

Spatial Clockwork Recurrent Neural Network for Muscle Perimysium Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuanpu Xie, Zizhao Zhang, Manish Sapkota, and Lin Yang

185

Automated Age Estimation from Hand MRI Volumes Using Deep Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Darko Štern, Christian Payer, Vincent Lepetit, and Martin Urschler

194

Real-Time Standard Scan Plane Detection and Localisation in Fetal Ultrasound Using Fully Convolutional Neural Networks . . . . . . . . . . . . . . . Christian F. Baumgartner, Konstantinos Kamnitsas, Jacqueline Matthew, Sandra Smith, Bernhard Kainz, and Daniel Rueckert 3D Deep Learning for Multi-modal Imaging-Guided Survival Time Prediction of Brain Tumor Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dong Nie, Han Zhang, Ehsan Adeli, Luyan Liu, and Dinggang Shen

203

212

Applications of Machine Learning From Local to Global Random Regression Forests: Exploring Anatomical Landmark Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Darko Štern, Thomas Ebner, and Martin Urschler

221

Regressing Heatmaps for Multiple Landmark Localization Using CNNs. . . . . Christian Payer, Darko Štern, Horst Bischof, and Martin Urschler

230

Self-Transfer Learning for Weakly Supervised Lesion Localization . . . . . . . . Sangheum Hwang and Hyo-Eun Kim

239

Automatic Cystocele Severity Grading in Ultrasound by Spatio-Temporal Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dong Ni, Xing Ji, Yaozong Gao, Jie-Zhi Cheng, Huifang Wang, Jing Qin, Baiying Lei, Tianfu Wang, Guorong Wu, and Dinggang Shen

247

XX

Contents – Part II

Graphical Modeling of Ultrasound Propagation in Tissue for Automatic Bone Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Firat Ozdemir, Ece Ozkan, and Orcun Goksel Bayesian Image Quality Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryutaro Tanno, Aurobrata Ghosh, Francesco Grussu, Enrico Kaden, Antonio Criminisi, and Daniel C. Alexander Wavelet Appearance Pyramids for Landmark Detection and Pathology Classification: Application to Lumbar Spinal Stenosis . . . . . . . . . . . . . . . . . Qiang Zhang, Abhir Bhalerao, Caron Parsons, Emma Helm, and Charles Hutchinson A Learning-Free Approach to Whole Spine Vertebra Localization in MRI . . . Marko Rak and Klaus-Dietz Tönnies Automatic Quality Control for Population Imaging: A Generic Unsupervised Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohsen Farzi, Jose M. Pozo, Eugene V. McCloskey, J. Mark Wilkinson, and Alejandro F. Frangi A Cross-Modality Neural Network Transform for Semi-automatic Medical Image Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mehdi Moradi, Yufan Guo, Yaniv Gur, Mohammadreza Negahdar, and Tanveer Syeda-Mahmood Sub-category Classifiers for Multiple-instance Learning and Its Application to Retinal Nerve Fiber Layer Visibility Classification . . . . . . . . . . . . . . . . . Siyamalan Manivannan, Caroline Cobb, Stephen Burgess, and Emanuele Trucco Vision-Based Classification of Developmental Disorders Using Eye-Movements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guido Pusiol, Andre Esteva, Scott S. Hall, Michael Frank, Arnold Milstein, and Li Fei-Fei Scalable Unsupervised Domain Adaptation for Electron Microscopy . . . . . . . Róger Bermúdez-Chacón, Carlos Becker, Mathieu Salzmann, and Pascal Fua Automated Diagnosis of Neural Foraminal Stenosis Using Synchronized Superpixels Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaoxu He, Yilong Yin, Manas Sharma, Gary Brahm, Ashley Mercado, and Shuo Li

256 265

274

283

291

300

308

317

326

335

Contents – Part II

XXI

Segmentation Automated Segmentation of Knee MRI Using Hierarchical Classifiers and Just Enough Interaction Based Learning: Data from Osteoarthritis Initiative . . . Satyananda Kashyap, Ipek Oguz, Honghai Zhang, and Milan Sonka Dynamically Balanced Online Random Forests for Interactive Scribble-Based Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guotai Wang, Maria A. Zuluaga, Rosalind Pratt, Michael Aertsen, Tom Doel, Maria Klusmann, Anna L. David, Jan Deprest, Tom Vercauteren, and Sébastien Ourselin Orientation-Sensitive Overlap Measures for the Validation of Medical Image Segmentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tasos Papastylianou, Erica Dall’ Armellina, and Vicente Grau High-Throughput Glomeruli Analysis of lCT Kidney Images Using Tree Priors and Scalable Sparse Computation . . . . . . . . . . . . . . . . . . Carlos Correa Shokiche, Philipp Baumann, Ruslan Hlushchuk, Valentin Djonov, and Mauricio Reyes A Surface Patch-Based Segmentation Method for Hippocampal Subfields . . . Benoit Caldairou, Boris C. Bernhardt, Jessie Kulaga-Yoskovitz, Hosung Kim, Neda Bernasconi, and Andrea Bernasconi Automatic Lymph Node Cluster Segmentation Using Holistically-Nested Neural Networks and Structured Optimization in CT Images . . . . . . . . . . . . Isabella Nogues, Le Lu, Xiaosong Wang, Holger Roth, Gedas Bertasius, Nathan Lay, Jianbo Shi, Yohannes Tsehay, and Ronald M. Summers Evaluation-Oriented Training via Surrogate Metrics for Multiple Sclerosis Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michel M. Santos, Paula R.B. Diniz, Abel G. Silva-Filho, and Wellington P. Santos Corpus Callosum Segmentation in Brain MRIs via Robust TargetLocalization and Joint Supervised Feature Extraction and Prediction . . . . . . . Lisa Y.W. Tang, Tom Brosch, XingTong Liu, Youngjin Yoo, Anthony Traboulsee, David Li, and Roger Tam Automatic Liver and Lesion Segmentation in CT Using Cascaded Fully Convolutional Neural Networks and 3D Conditional Random Fields . . . . . . . Patrick Ferdinand Christ, Mohamed Ezzeldin A. Elshaer, Florian Ettlinger, Sunil Tatavarty, Marc Bickel, Patrick Bilic, Markus Rempfler, Marco Armbruster, Felix Hofmann, Melvin D’Anastasi, Wieland H. Sommer, Seyed-Ahmad Ahmadi, and Bjoern H. Menze

344

352

361

370

379

388

398

406

415

XXII

Contents – Part II

3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Özgün Çiçek, Ahmed Abdulkadir, Soeren S. Lienkamp, Thomas Brox, and Olaf Ronneberger Model-Based Segmentation of Vertebral Bodies from MR Images with 3D CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert Korez, Boštjan Likar, Franjo Pernuš, and Tomaž Vrtovec Pancreas Segmentation in MRI Using Graph-Based Decision Fusion on Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jinzheng Cai, Le Lu, Zizhao Zhang, Fuyong Xing, Lin Yang, and Qian Yin Spatial Aggregation of Holistically-Nested Networks for Automated Pancreas Segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Holger R. Roth, Le Lu, Amal Farag, Andrew Sohn, and Ronald M. Summers Topology Aware Fully Convolutional Networks for Histology Gland Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aïcha BenTaieb and Ghassan Hamarneh HeMIS: Hetero-Modal Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . Mohammad Havaei, Nicolas Guizard, Nicolas Chapados, and Yoshua Bengio Deep Learning for Multi-task Medical Image Segmentation in Multiple Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pim Moeskops, Jelmer M. Wolterink, Bas H.M. van der Velden, Kenneth G.A. Gilhuijs, Tim Leiner, Max A. Viergever, and Ivana Išgum Iterative Multi-domain Regularized Deep Learning for Anatomical Structure Detection and Segmentation from Ultrasound Images . . . . . . . . . . . Hao Chen, Yefeng Zheng, Jin-Hyeong Park, Pheng-Ann Heng, and S. Kevin Zhou Gland Instance Segmentation by Deep Multichannel Side Supervision . . . . . . Yan Xu, Yang Li, Mingyuan Liu, Yipei Wang, Maode Lai, and Eric I-Chao Chang Enhanced Probabilistic Label Fusion by Estimating Label Confidences Through Discriminative Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Oualid M. Benkarim, Gemma Piella, Miguel Angel González Ballester, and Gerard Sanroma

424

433

442

451

460 469

478

487

496

505

Contents – Part II

XXIII

Feature Sensitive Label Fusion with Random Walker for Atlas-Based Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Siqi Bao and Albert C.S. Chung

513

Deep Fusion Net for Multi-atlas Segmentation: Application to Cardiac MR Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heran Yang, Jian Sun, Huibin Li, Lisheng Wang, and Zongben Xu

521

Prior-Based Coregistration and Cosegmentation . . . . . . . . . . . . . . . . . . . . . Mahsa Shakeri, Enzo Ferrante, Stavros Tsogkas, Sarah Lippé, Samuel Kadoury, Iasonas Kokkinos, and Nikos Paragios

529

Globally Optimal Label Fusion with Shape Priors . . . . . . . . . . . . . . . . . . . . Ipek Oguz, Satyananda Kashyap, Hongzhi Wang, Paul Yushkevich, and Milan Sonka

538

Joint Segmentation and CT Synthesis for MRI-only Radiotherapy Treatment Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ninon Burgos, Filipa Guerreiro, Jamie McClelland, Simeon Nill, David Dearnaley, Nandita deSouza, Uwe Oelfke, Antje-Christin Knopf, Sébastien Ourselin, and M. Jorge Cardoso Regression Forest-Based Atlas Localization and Direction Specific Atlas Generation for Pancreas Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masahiro Oda, Natsuki Shimizu, Ken’ichi Karasawa, Yukitaka Nimura, Takayuki Kitasaka, Kazunari Misawa, Michitaka Fujiwara, Daniel Rueckert, and Kensaku Mori Accounting for the Confound of Meninges in Segmenting Entorhinal and Perirhinal Cortices in T1-Weighted MRI . . . . . . . . . . . . . . . . . . . . . . . Long Xie, Laura E.M. Wisse, Sandhitsu R. Das, Hongzhi Wang, David A. Wolk, Jose V. Manjón, and Paul A. Yushkevich 7T-Guided Learning Framework for Improving the Segmentation of 3T MR Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Khosro Bahrami, Islem Rekik, Feng Shi, Yaozong Gao, and Dinggang Shen Multivariate Mixture Model for Cardiac Segmentation from Multi-Sequence MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiahai Zhuang Fast Fully Automatic Segmentation of the Human Placenta from Motion Corrupted MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amir Alansary, Konstantinos Kamnitsas, Alice Davidson, Rostislav Khlebnikov, Martin Rajchl, Christina Malamateniou, Mary Rutherford, Joseph V. Hajnal, Ben Glocker, Daniel Rueckert, and Bernhard Kainz

547

556

564

572

581

589

XXIV

Contents – Part II

Multi-organ Segmentation Using Vantage Point Forests and Binary Context Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mattias P. Heinrich and Maximilian Blendowski Multiple Object Segmentation and Tracking by Bayes Risk Minimization . . . Tomáš Sixta and Boris Flach Crowd-Algorithm Collaboration for Large-Scale Endoscopic Image Annotation with Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L. Maier-Hein, T. Ross, J. Gröhl, B. Glocker, S. Bodenstedt, C. Stock, E. Heim, M. Götz, S. Wirkert, H. Kenngott, S. Speidel, and K. Maier-Hein Emphysema Quantification on Cardiac CT Scans Using Hidden Markov Measure Field Model: The MESA Lung Study . . . . . . . . . . . . . . . . . . . . . . Jie Yang, Elsa D. Angelini, Pallavi P. Balte, Eric A. Hoffman, Colin O. Wu, Bharath A. Venkatesh, R. Graham Barr, and Andrew F. Laine

598 607

616

624

Cell Image Analysis Cutting Out the Middleman: Measuring Nuclear Area in Histopathology Slides Without Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mitko Veta, Paul J. van Diest, and Josien P.W. Pluim

632

Subtype Cell Detection with an Accelerated Deep Convolution Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sheng Wang, Jiawen Yao, Zheng Xu, and Junzhou Huang

640

Imaging Biomarker Discovery for Lung Cancer Survival Prediction . . . . . . . Jiawen Yao, Sheng Wang, Xinliang Zhu, and Junzhou Huang 3D Segmentation of Glial Cells Using Fully Convolutional Networks and k-Terminal Cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lin Yang, Yizhe Zhang, Ian H. Guldner, Siyuan Zhang, and Danny Z. Chen Detection of Differentiated vs. Undifferentiated Colonies of iPS Cells Using Random Forests Modeled with the Multivariate Polya Distribution . . . . . . . . Bisser Raytchev, Atsuki Masuda, Masatoshi Minakawa, Kojiro Tanaka, Takio Kurita, Toru Imamura, Masashi Suzuki, Toru Tamaki, and Kazufumi Kaneda Detecting 10,000 Cells in One Second. . . . . . . . . . . . . . . . . . . . . . . . . . . . Zheng Xu and Junzhou Huang

649

658

667

676

Contents – Part II

XXV

A Hierarchical Convolutional Neural Network for Mitosis Detection in Phase-Contrast Microscopy Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yunxiang Mao and Zhaozheng Yin

685

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

693

Feature Selection Based on Iterative Canonical Correlation Analysis for Automatic Diagnosis of Parkinson’s Disease Luyan Liu1, Qian Wang1, Ehsan Adeli2, Lichi Zhang1,2, ( ) Han Zhang2, and Dinggang Shen2 ✉ 1 2

School of Biomedical Engineering, Med-X Research Institute, Shanghai Jiao Tong University, Shanghai, China Department of Radiology BRIC, University of North Carolina at Chapel Hill, Chapel Hill, USA [email protected]

Abstract. Parkinson’s disease (PD) is a major progressive neurodegenerative disorder. Accurate diagnosis of PD is crucial to control the symptoms appropri‐ ately. However, its clinical diagnosis mostly relies on the subjective judgment of physicians and the clinical symptoms that often appear late. Recent neuroimaging techniques, along with machine learning methods, provide alternative solutions for PD screening. In this paper, we propose a novel feature selection technique, based on iterative canonical correlation analysis (ICCA), to investigate the roles of different brain regions in PD through T1-weighted MR images. First of all, gray matter and white matter tissue volumes in brain regions of interest are extracted as two feature vectors. Then, a small group of significant features were selected using the iterative structure of our proposed ICCA framework from both feature vectors. Finally, the selected features are used to build a robust classifier for automatic diagnosis of PD. Experimental results show that the proposed feature selection method results in better diagnosis accuracy, compared to the baseline and state-of-the-art methods.

1

Introduction

Parkinson’s disease (PD) is a major neurodegenerative disorder that threatens aged people and causes huge burdens to the society. The clinical diagnosis of PD, however, is particularly prone to errors, because the diagnosis mostly relies on substantial symp‐ toms of the patients [1]. Computer-aided techniques can utilize machine learning for the diagnosis of PD and also for identifying biomarkers from neuroimaging data for the disease. There are several studies in the literature, which aim to distinguish PD from other similar diseases or normal subjects. In [2], Single-photon emission computed tomography (SPECT) images were analyzed automatically for PD diagnosis; while in [3], a novel speech signal-processing algorithm was proposed. Different clinical features (including response to levodopa, motor fluctuation, rigidity, dementia, speech, etc.) were evaluated in [4] for distinguishing multiple system atrophy (MSA) from PD. In [5], a novel synergetic paradigm integrating Kohonen self-organizing map (KSOM) was proposed to extract features for clinical diagnosis based on least squares support vector machine (LS-SVM).

© Springer International Publishing AG 2016 S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 1–8, 2016. DOI: 10.1007/978-3-319-46723-8_1

2

L. Liu et al.

In this study, we propose reliable feature selection and classification models for PD diagnosis using T1-weighted MR images. Therefore, our method would be a non-inva‐ sive and reasonable solution to PD screening, which is especially important to devel‐ oping countries with limited healthcare resources. Specifically, we extract numerous numbers of features from T1 MR images, which describe the volumes of individual tissues in the regions-of-interest (ROIs), such as white matter (WM) and gray matter (GM). Therefore, the features can be naturally grouped into two vectors, corresponding to WM and GM. Afterwards, we introduce an iterative feature selection strategy based on canonical correlation analysis (CCA) to iteratively identify the optimal set of features. Then, the selected features are used for establishing the robust linear discriminant anal‐ ysis (RLDA) model to classify PD patients from the normal control (NC) subjects. Feature selection is an important dimensionality reduction technique and has been applied to solving various problems in translational medical studies. For example, sparse logistic regression was proposed to select features for predicting the conversion from mild cognitive impairment (MCI) to the Alzheimer’s disease (AD) in [6]. The least absolute shrinkage and selection operator (LASSO) was used in [7] for feature selection. Similar works can also be found in [8 9], where principal component analysis (PCA) and CCA were used, respectively. CCA is able to explore the relationship between two high-dimensional vectors of features, and transform them from their intrinsic spaces to a common feature space [9]. In our study, the two feature vectors describe each subject under consideration from two views of different anatomical feature spaces, associated with WM and GM, respectively. The two feature vectors, thus, need to be transformed to a common space, where features can be compared and jointly selected for subsequent classification. Specifically, after linearly transforming the two views of features to the common space by CCA, we learn a regression model to fit the PD/NC labels based on the transformed feature represen‐ tations. With the CCA-based transformation and the regression model, we are able to identify the most useful and relevant features for PD classification. In addition, PD is not likely to affect all brain regions, but rather only a small number of ROIs are relevant for classification. Therefore, in the obtained features, there could be many redundant and probably noisy features, which may negatively affect the CCA mappings to a common space. In this sense, a single round of CCA-based feature selection with a large bunch of features being discarded at the same time would probably yield suboptimal outcome. Intuitively, we develop an iterative structure of CCA-based feature selection, or ICCA, in which we propose to gradually discard features step-by-step. In this way, the two feature vectors gradually get a better common space and thus more relevant features can be selected. Specifically, our ICCA method consists of multiple iterations for feature selection. In each iteration, we transform the features of the WM/GM views into a common space, build a regression model, inverse-transform the regression coefficients into the original space, and eliminate the most irrelevant features for PD classification. This iterative scheme allows us to gradually refine the estimation of the common feature space, by eliminating only the least important features. In the final, we utilize the representations in the common space, transformed from the selected features, to conduct PD classifica‐ tion. Note that, although the CCA-based transform is linear, our ICCA consists of

Feature Selection Based on Iterative Canonical Correlation Analysis

3

iterative procedure and thus provides fairly local linear operation capabilities to select features of different anatomical views. Experimental results show that the proposed method significantly improves the diagnosis accuracy and outperforms state-of-the-art feature selection methods, including sparse learning [7], PCA [8], CCA [10] and minimum redundancy-maximum (mRMR) [11].

2

Method

Figure 1 illustrates the pipeline for PD classification in this paper. After extracting the WM and GM features from T1 images, we feed them into the ICCA-based feature selection framework. The WM/GM feature vectors are mapped to a common space using CCA, where the canonical representations of the features are computed. The regression model, based on the canonical representations, fits the PD/NC labels of individual subjects. The regression then leads to the weights assigned to the canonical representa‐ tions, from which the importance of the WM/GM features can be computed. We then select the WM/GM features conservatively, by only eliminating the least important features. The rest of WM/GM features are transformed to the refined common space by CCA and selected repeatedly, until only a set of optimal features are remained. The finally selected features are incorporated to build a robust classifier for separating PD patients from the NC subjects.

Fig. 1. Pipeline of the ICCA-based feature selection and PD classification.

Feature Extraction: All T1-weighted MR images are pre-processed by skull stripping, cerebellum removal, and tissue segmentation into WM and GM. Then, the anatomical automatic labeling (AAL) atlas with 90 pre-defined ROIs is registered to the native space of each subject, using FLIRT [12], followed by HAMMER [13]. For each ROI, we compute the WM/GM volumes as features. In this way, we extract 90 WM and 90 GM features for each subject. The features are naturally grouped into two vectors, which will be handled by the ICCA-based feature selection and the RLDA-based classification.

4

L. Liu et al.

CCA-based Feature Selection: For subjects, we record their -dimensional feature vectors as individual columns in and , corresponding to WM and , where GM features, respectively. The class labels for the subjects are stored in each entry is either 1 or 0, indicating which class (PD or NC) each corresponding subject belongs to. Let

, and

can find the basis vectors and . The two basis vectors objective function:

and and

be its covariance matrix. CCA to maximize the correlation between can be optimized by solving the following

(1)

The optimal solution of

is obtained by a generalized eigen-decomposition

[9]. The canonical representations of all features in the common space can be computed . by With the canonical representations, we build a sparse regression model. The regres‐ sion aims to fit the class labels with the canonical representations by assigning various weights to the representations in the common space. (2) is the canonical representation matrix and is the regression coefficient matrix, which assigns weights to indi‐ denotes the vidual canonical representations; and are trade-off parameters; norm, which tends to assign non-zero coefficients to only a few canonical representa‐ denotes the canonical regularizer [10]: tions; and where

(3) denotes a set of canonical correlation coefficients. and are the where weights corresponding to a same feature index across the two views (GM and WM). Canonical regularizer enforces to select highly correlated representations across the two feature views. In other words, larger canonical correlation coefficients tend to be selected, while less correlated canonical representations across the two views (small canonical correlation coefficients) are not selected. Note that greater will lead to larger after the optimization process. values in and The Proposed ICCA-based Feature Selection Method: The CCA-based feature selection might be limited, as all features are (globally) linearly transformed to the common space and then truncated in a one-shot fashion. Therefore, we propose a novel

Feature Selection Based on Iterative Canonical Correlation Analysis

5

ICCA-based feature selection method, in which we iteratively discard the most irrelevant pair of features in each iteration, and re-evaluate the new set till the best set of features are selected. Altogether, this fairly simulates a local linear operation. In Eq. (2), we obtain the regression coefficient matrix , containing the weights for the canonical representations in the tentatively estimated common space. Since the canonical representations are linear combinations of the WM/GM features in ( ), the weights in are also linearly associated with the importance of WM/GM features prior to the CCA-based mapping. Therefore, the , where records importance of WM/GM features can be computed by the importance of the -th view of the original features. Given and , we eliminate the least important WM and GM features, respectively. Then, CCA is applied to trans‐ form the remained WM/GM features to an updated common space. This transformingeliminating scheme is iteratively executed till the number of iterations exceeds a predefined threshold. In other words, the iterations are stopped, when the classification performance in the subsequent steps does not increase anymore. Robust Linear Discriminant Analysis (RLDA): In this study, we use the robust discriminant analysis (RLDA) [14] to classify PD from the normal subjects based on be the matrix containing the -dimensional samples the selected features. Let possibly corrupted by noise. We know that an amount of noise can be introduced in the data, as the data acquisition and preprocessing steps are not error-free. Even a small number of noisy features or outliers can affect a model significantly. In RLDA, noisy data is modeled as , where is the underlying noise-free component and contains the outliers. RLDA learns the mapping from to fit the class labels in . RLDA decomposes into and , and computes the mapping using the noise-free data , which yields the following objective function:

(4)

compensates for different numbers of samples The normalization factor per class. is a centering matrix and denotes the mapping, which is learned only from the centered noise-free data . Therefore, it will avoid projecting the noise factor to the output space, thus results in an unbiased estimation. , is projected by onto the output After is learned in the training stage, a testing data, , then the class label of the test data can be determined by kspace spanned by Nearest Neighbor (k-NN) algorithm. The RLDA formulation can be solved using augmented Lagrangian method, as detailed in [14].

3

Experimental Results

The data used in this paper were obtained from the Parkinson’s Progression Makers Initiative (PPMI) database. In this paper, we use 112 subjects (56 PD, and 56 NC), each

6

L. Liu et al.

with a T1-weighted MR scan using 3T SIMENS MAGNETOM TrioTim Syngo with the following parameters: acquisition matrix = 256 × 256 mm2, 176 slices, voxel size = 1 × 1 × 1 mm3, echo time (TE) = 2.98 ms, repetition time (TR) = 2300 ms, inverse time = 900 ms, and flip angle = 9°. In order to evaluate the proposed and the baseline methods, we used an 8-fold crossvalidation. For each of the 8 folds, feature selection and classification models were established using the other 7 folds as the training subset, and the diagnostic ability was evaluated with the unused 8th testing fold. We compared our ICCA-based feature selection with three popular feature selection or feature reduction methods, including PCA, sparse feature selection (SFS), one-shot CCA and minimum redundancy-maximum relevance (mRMR). No feature selection (NoFS) was also regarded as a baseline method. Table 1 shows the performances of all the methods. As can be seen, the proposed ICCA-based feature selection method achieves the best performance. In particular, our method improves by 11.5 % on ACC and 16 % on AUC, respectively, compared to the baseline (NoFS). Moreover, compared to feature selection through PCA, SFS CCA and mRMR, our method achieves 5.3 %, 5.4 %, 4.4 % and 3.8 % accuracy improvements, respectively. Table 1. PD/NC classification comparison (ACC: Accuracy; AUC: Area Under ROC Curve). NoFS SFS PCA CCA mRMR Proposed (Select in canonical feature space) Proposed (Select in WM/GM feature space)

ACC (%) 58.0 65.1 65.2 66.1 66.7 68.8 70.5

AUC (%) 55.1 64.5 59.8 64.4 65.6 69.3 71.1

In order to further analyze the proposed ICCA-based feature selection, we also implement another experiment, in which we select canonical representations rather than the original WM/GM features. That is, in every iteration of ICCA, we directly eliminate the least important canonical representations, instead of using the inverse-transform and elimination steps in the original WM/GM features. The left canonical representations are used for re-estimating the new common space by CCA and further selected. The last two lines in Table 1 show the experiment results, where the performance of the ICCAbased feature selection in the canonical space is not better than the performance of the selection in the original WM/GM feature space. These results indicate that canonical representations are not better than the original features (after proper selection) for distin‐ guish PD from NC. This can be due to the fact that the CCA mapping to the common space is unsupervised, and, after the two feature vectors are mapped, they are highly correlated. As a result, there are many redundant data in the canonical representations and that could mislead the feature selection and the classification. Inverse-transforming the representations and going back to the original feature space (using our proposed ICCA framework) avoids this shortcoming.

Feature Selection Based on Iterative Canonical Correlation Analysis

7

To identify the biomarkers of PD, we further inspect the selected features, which correspond to specific WM/GM areas. Since the features selected in each cross-valida‐ tion fold may be different, we define the most discriminative regions as features, which were selected at least 60 % of the times in cross-validation. The most discriminative regions, as shown in Fig. 2, include ‘precuneus’, ‘thalamus’, ‘hippocampus’, ‘temporal pole’, ‘postcentral gyrus’, ‘middle frontal gyrus’, and ‘medial frontal gyrus’. GM and WM features extracted from these regions are found to be closely associated with PD pathology, which are in line with previous clinical researches [15, 16].

Fig. 2. The most discriminative ROIs for automatic diagnosis of PD.

4

Conclusion

In this paper, a novel feature selection technique was proposed to help identify individ‐ uals with PD from NC. The proposed ICCA-based feature selection framework can achieve a fairly local linear mapping capability. Moreover, it can dig deeper into the underlying structure of the feature space and explore the relationship among features. By increasing the depth of learning in ICCA framework, the two views of the selected features would be closer and closer, when mapped to their CCA common space. This also decreases the number of the selected features. The results show that the proposed ICCA feature selection framework outperforms conventional feature selection methods, and can improve the diagnosis ability based on T1 MR images. Note that, in the proposed framework of ICCA-based feature selection, we discard a pair of features from GM and WM at each iteration. In order to avoid dropping the possibly important features, we drop out the WM/GM features conservatively in each iteration. Dropping out the features in a smarter way could optimize the whole feature selection framework. This can be pursued in the future works. Furthermore, current ICCA-based feature selection can only explore relationship between two views of the features. We will investigate the possibility of handling more views of features simul‐ taneously, which can effectively enhance the feasibility of the proposed method. Besides further optimizing the efficiency and performance of the ICCA framework, the future work includes improving the classification method for PD diagnosis. RLDA is a linear classifier, which cannot model the nonlinear relationship between features and labels. Therefore, some nonlinear classifiers can probably perform at least equally or better than the linear classifier. In this study, we only used structural information in T1 MR images; we will explore the integration of other imaging modalities such as diffusion tensor imaging (DTI) and functional MRI in the future to further improve the classification performance based on the proposed framework.

8

L. Liu et al.

Acknowledgement. This work is supported by National Natural Science Foundation of China (NSFC) Grants (Nos. 61473190, 61401271, 81471733).

References 1. Calne, D.B., Snow, B.J.: Criteria for diagnosing Parkinson’s disease. Ann. Neurol. 32, 125– 127 (1992) 2. Goebel, G., Seppi, K., et al.: A novel computer-assisted image analysis of [123I] β-CIT SPECT images improves the diagnostic accuracy of parkinsonian disorders. Eur. J. Nucl. Med. Mol. Imaging 38, 702–710 (2011) 3. Tsanas, A., Little, M.A., et al.: Novel speech signal processing algorithms for high-accuracy classification of Parkinson’s disease. IEEE TBME 59, 1264–1271 (2012) 4. Wenning, G.K., et al.: What clinical features are most useful to distinguish definite multiple system atrophy from Parkinson’s disease. J. Neurol. Neurosurg. Psychiatry 68, 434–440 (2000) 5. Singh, G., Samavedham, L.: Unsupervised learning based feature extraction for differential diagnosis of neurodegenerative diseases: a case study on early-stage diagnosis of Parkinson disease. J. Neurosci. Methods 256, 30–40 (2015) 6. Ye, J., et al.: Sparse learning and stability selection for predicting MCI to AD conversion using baseline ADNI data. BMC Neurol. 12, 1 (2012) 7. Ye, J., Liu, J.: Sparse methods for biomedical data. ACM SIGKDD Explor. Newsl. 14, 4–15 (2012) 8. Lu, Y., et al.: Feature selection using principal feature analysis. In: ACM-MM (2007) 9. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neu. Comp. 16, 2639–2664 (2004) 10. Zhu, X., Suk, H.-I., Shen, D.: Multi-modality canonical feature selection for Alzheimer’s disease diagnosis. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014, Part II. LNCS, vol. 8674, pp. 162–169. Springer, Heidelberg (2014) 11. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of maxdependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005) 12. Smith, S.M., et al.: Advances in functional and structural MR image analysis and implementation as FSL. Neuroimage 23, 208–219 (2004) 13. Shen, D., Davatzikos, C.: HAMMER: hierarchical attribute matching mechanism for elastic registration. IEEE TMI 2, 1421–1439 (2002) 14. Huang, D., Cabral, R.S., De la Torre, F.: Robust regression. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part IV. LNCS, vol. 7575, pp. 616–630. Springer, Heidelberg (2012) 15. Hanakawa, T., Katsumi, Y., et al.: Mechanisms underlying gait disturbance in Parkinson’s disease. Brain 122, 1271–1282 (1999) 16. Burton, E.J., McKeith, I.G., et al.: Cerebral atrophy in Parkinson’s disease with and without dementia: a comparison with Alzheimer’s disease, dementia with Lewy bodies and controls. Brain 127, 791–800 (2004)

Identifying Relationships in Functional and Structural Connectome Data Using a Hypergraph Learning Method Brent C. Munsell1(B) , Guorong Wu2 , Yue Gao3 , Nicholas Desisto1 , and Martin Styner4 1

2

Department of Computer Science, College of Charleston, Charleston, SC, USA [email protected] Department of Radiology and BRIC, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA 3 School of Software, Tsinghua University, Beijing, China 4 Department of Psychiatry, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

Abstract. The brain connectome provides an unprecedented degree of information about the organization of neuronal network architecture, both at a regional level, as well as regarding the entire brain network. Over the last several years the neuroimaging community has made tremendous advancements in the analysis of structural connectomes derived from white matter fiber tractography or functional connectomes derived from time-series blood oxygen level signals. However, computational techniques that combine structural and functional connectome data to discover complex relationships between fiber density and signal synchronization, including the relationship with health and disease, has not been consistently performed. To overcome this shortcoming, a novel connectome feature selection technique is proposed that uses hypergraphs to identify connectivity relationships when structural and functional connectome data is combined. Using publicly available connectome data from the UMCD database, experiments are provided that show SVM classifiers trained with structural and functional connectome features selected by our method are able to correctly identify autism subjects with 88 % accuracy. These results suggest our combined connectome feature selection approach may improve outcome forecasting in the context of autism.

1

Introduction

Improvements in computational analyses of neuroimaging data now permit the assessment of whole brain maps of connectivity, commonly referred to as the brain connectome [7]. The brain connectome provides unprecedented information about global and regional conformations of neuronal network architecture (or network architecture for short) that is particularly relevant as it relates to neurological disorders. For this reason, the brain connectome has recently become c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 9–17, 2016. DOI: 10.1007/978-3-319-46723-8 2

10

B.C. Munsell et al.

instrumental in the investigation of network architecture organization and its relationship with health and disease, notably in the context of neurological conditions such as epilepsy, autism, Alzheimer’s, and Parkinson’s. In general, two connectome categories exist: (1) a structural connectome that is reconstructed using white matter fiber tractography from diffusion tensor imaging (DTI), and (2) a functional connectome that is reconstructed using resting-state time-series signal data from blood oxygen level dependent (BOLD) functional MRI (rsfMRI). In mathematical terms, a connectome is a weighted undirected graph, where nodes in the graph represent brain regions (defined in an anatomical parcellation, or brain atlas), and the edge that connects two different nodes is weighted by a value that represents the level of neural-connectivity, or information exchange. To better understand how the brain network is organized, network analysis algorithms [4] are applied to the connectome to reveal the underlying network architecture of the brain, which can then be used to quantify the differences between healthy and disease conditions. Currently, network analysis techniques have mainly been applied to just structural or functional connectivity data. However, research that combines both types of data [1,5,6,8] to better understand functional and structural connectivity relationships has gained attention in recent years. Here a novel combined connectome feature selection technique is proposed that uses hypergraphs to discover latent relationships in node-based graph theoretic measures found in structural and function connectomes. The primary rational behind selecting features where structural and functional connectivity agree, is that fiber density and signal synchronization similarities are likely to be correlated, and when combined these similarities may be easier to identify and quantify. More specifically, for each diagnosis label (i.e. disease and healthy) the proposed feature selection technique uses a hypergraph learning algorithm to find a hypergraph Laplacian graph that combines structural and functional node-based connectivity measures. A hierarchical partitioning algorithm is then applied to the hypergraph Laplacian, which in turn creates a code vector that encodes structural and functional connectivity similarities. The resulting code vectors are then used to create a binary weight vector that only selects brain regions associated with structural or functional node-based connectivity measures capable of differentiating the disease condition from the healthy one. Lastly, the selected structural and functional connectome features are used to train a SVM classifier that can predict diagnosis label of subjects not included in the training procedure.

2

Materials and Methods

2.1

Participants and MRI Data Acquisition

All participant data was acquired from the publicly available University of Southern California (USC)/University of California Los Angeles (UCLA) multimodal connectivity database1 (UMCD). In particular, high-functioning children and 1

http://umcd.humanconnectomeproject.org.

Identifying Relationships in Functional and Structural Connectome Data

11

adolescents with an autism spectrum disorder (ASD), and healthy control (HC) children and adolescents were recruited. In total, the autism study has 70 participants (35 ASD and 35 HC) that had both rsfMRI and DTI scan data. A complete list of all the demographic data, including the scan parameters, from the original study can be found at [5]. 2.2

Preprocessing and Connectome Reconstruction

Functional preprocessing steps were performed using the FSL2 and AFNI3 software libraries. In general, the following steps were performed: skull stripping, slice timing correction, motion correction with rigid-body alignment using MCFLIRT, geometric distortion correction using FUGUE. Structural preprocessing steps were performed using the FSL and Diffusion toolkit4 software libraries. In general, the following steps were performed: skull stripping, eddy current correction, motion correction with rigid-body alignment using MCFLIRT, voxel-wise fractional anisotropy (FA) estimation, and fiber track assignment using FACT algorithm. A complete overview of all the preprocessing steps can be found at [5]. The FSL FEAT query function is then applied to the functional and structural preprocessed images. In particular, the atlas proposed in Power et al. [2] defines m = 264 ROIs that are represented by a 10 mm diameter sphere. A symmetric m × m functional connectivity matrix Cf is constructed using the extracted ROIs, where each element in the functional connectivity matrix reflects the signal synchronization between two ROIs, which is estimated by computing the correlation between two discrete time-series rsfMRI signals. Likewise, a symmetric m × m structural connectivity matrix Cs is constructed using the same ROIs, where each element reflects the average number of fiber tracks, or fiber density, that connect the two ROIs. 2.3

Node-Based Connectome Feature Vector

The next step is to convert the values defined in C into node-based connectome feature vector cα = (cα1 , . . . , cαm ) using the betweenness centrality graphtheoretic connectivity measure, where α = s represents a node-based structural connectome feature vector, and α = f represents a node-based functional connectome feature vector. Betweenness centrality is a global measure that represents the fraction of shortest paths that go through a particular node (or brain region) defined in the connectome. The betweenness centrality measure for node i is cαi =

 ρihj 1 , (m − 1)(m − 2) ρhj h,j∈m

2 3 4

http://www.fmrib.ox.ac.uk/fsl. https://afni.nimh.nih.gov/afni/. http://trackvis.org/dtk/.

(1)

12

B.C. Munsell et al.

where h = j, h = i, j = i, The number of shortest path between node h and j is represented by ρhj , the number of these shortest paths going through node i is represented by ρihj . This is normalized to a value in [0 1], where (m − 1)(m − 2) is the highest score attainable in the network. 2.4

Combined Connectome Feature Selection

Given a training data set A = {aφ }nφ=1 of n ASD subjects we compute set of graph Laplacians {Lφ }nφ=1 , where aφ = (cφs | cφf ) is a 2m dimension feature vector that combines structural and functional node-based connectome values for subject φ. To do so, we first create a complete bipartite graph Gφ = (cφs , cφf , Eφ ) for each subject, where the edge that connects structural node i to functional node j in the bipartite graph is weighted by wij = 1−|csi −cf j |. The proposed edge weight strategy has a very straight forward and intuitive meaning: If two brain regions both have similar connectivity values then wij ≈ 1, conversely if two brain regions do not have similar connectivity values then wij ≈ 0.

Fig. 1. Hierarchical partition approach. Each partition level in the hierarchy has a unique integer code value that represents structural and functional connectivity similarities between brain regions.

A 2m × m2 dimension hypergraph incidence matrix Hφ for subject φ is then created using Gφ . Because we use bipartite graph, it’s important to note that each hyper-edge only represents the structural-functional relationship between two node-based connectome features. Once Hφ is found, the normalized hypergraph Laplacian5 Lφ = I − Dv−1/2 Hφ De−1 Hφt Dv−1/2 (2) is computed [11], where Dv is a diagonal matrix that defines the strength for each vertex in Hφ , De is a diagonal matrix that defines the strength for each edge in Hφ , and I is the identity matrix. In general, our design has two advantages: (1) we only identify functional and structural connectivity relationships just

5

In our approach each hyper edge has the same influence, therefore W is the identity matrix and is omitted in Eq. (2).

Identifying Relationships in Functional and Structural Connectome Data

13

between two different regions in the brain, and (2) the resulting hypergraph Laplacian is very sparse. A median hypergraph Laplacian Lm is then found using each subject specific hypergraph Laplacian in {Lφ }nφ=1 , where Lm (i, j) = median({L1 (i, j), L2 (i, j), . . . , Ln (i, j)}). Eigen decomposition is applied to Lm creating a 2m dimension embedding space and then a hierarchical partition is performed as illustrated in Fig. 1. More specifically, each embedding space partition in the hierarchy defines three cluster groups: (1) clusters that only have DTI brain regions, (2) clusters that only have rsfMRI brain regions, and (3) clusters that have both DTI and rsfMRI brain regions. At each partition the three cluster groups are found using the well-known normalized spectral clustering technique in [10]. However, instead of using a k-means algorithm the density estimation algorithm in [3] is applied, primarily because the number of clusters is automatically found and outliers can be automatically recognized and excluded. As shown in Fig. 1, the DTI and rsfMRI brain region cluster becomes the search space for the next partition in the hierarchy, and terminates when a DTI and rsfMRI brain region cluster does not exist. In our approach, each partition level in the hierarchy represents a unique integer code, and partitions at the top of the hierarchy represent brain regions that show low structural and functional connectivity similarities (i.e. low code value), and partitions near the bottom of the hierarchy represent brain regions that show high structural and functional connectivity similarities (i.e. large code value). Lastly, a code vector xad = (xs1 , xs2 , . . . , xsm , xf 1 , xf 2 , . . . , xf m ) is created using the code values in partition hierarchy, and then normalized by simply dividing all the code values by the height of the partition hierarchy. This exact same procedure outlined above is then applied to a training set of HC subjects, and a HC code vector xhc is produced. Next, a weight vector w = | xad − xhc |

(3)

is created, where a weight value close to one represents structural or function brain regions that have dramatically different code values, which suggests these regions may better differentiate the disorder from the normal condition. On the other hand, a weight value close to zero represents structural or function brain regions that have the same (or very similar) code values, which suggests these regions may not be able to differentiate the disorder from the normal condition. Lastly, we make w binary by applying a threshold, i.e. wi ≥ th = 1 and wi < th = 0. The primary motivation behind making the weight vector binary was to reduce the number of dimensions, which in turn will reduce the amount of error that may be introduced into the chosen classifier. 2.5

Linear SVM Classifier

Using a training data set A = {aφ }nφ=1 that now includes both ASD and HC subjects, the binary diagnosis labels y = (y1 , y2 , . . . , yn ), e.g. ASD = 1 and HC = 0, and the binary weight vector w, a linear two-class SVM classifier based

14

B.C. Munsell et al.

on the LIBSVM library6 is trained. In particular, the binary values in w is ˜ applied to each feature vector in A, creating a new sparse training data matrix A. ˜ Once the SVM classifier is trained, Finally, a SVM classifier is trained using A. the diagnosis label of a subject not included in the training data set can be predicted as follows: First compute a = (cs | cf ), then create sparse feature ˜ = (a1 w1 , a2 w2 , · · · , a2m w2m ) by applying learned binary weights, and vector a lastly calculate the predicted diagnosis label y using trained SVM classifier, where the sign of the y (i.e., y ≥ 0 or y < 0) determines the diagnosis label. Since the proposed combined connectome feature selection has two free parameters, i.e. number of Eigen-values (or dimensions) used by cluster algorithm (d) and binary weight threshold (th ) a grid search procedure is performed that uses 10-fold cross validation strategy. Specifically, an independent two-dimension grid-search procedure is performed for each left-out-fold, where the value stored at grid coordinate (d, th ) are the mean and standard deviation values for the accuracy (ACC), sensitivity (SEN), specificity (SPC), negative predictive value (NPV), and positive predictive value (PPV) measures. In particular, d is adjusted at increments of 1 starting at 1 and ending at 2m, and th is adjusted at increments of 0.05 starting at 0.1 and ending at 1.0. Lastly, when the grid-search procedures completes the parameter values that have the highest ACC and PPV scores are selected.

3

Results

The grid search parameters the yielded the best ACC and PPV classification results are d = 3 and th = 0.8. To assess the performance of the proposed feature selection method, SVM classifiers are also trained using structural and functional connectome features in training data set A that are selected by: (1) a linear regression technique that includes ℓ1 regularization (i.e. Lasso), and (2) no feature selection. As shown in Table 1, a SVM classifier trained using structural and functional connectome features selected by the proposed method is the most accurate at 88.3 %, can predict the disease case (i.e. PPV) approximately 87.2 % of the time, and consistently shows the highest sensitivity, specificity, and NPV. Table 1. ASD vs. HC 10-fold classification results in x ¯ ± σ format. The highest performance measures are shown in bold font. Feature Selection

PPV

Proposed

87.2 % ± 2.9 88.3 % ± 2.7 89.4 % ± 5.8 89.4 % ± 6.1 86.2 % ± 3.3

Lasso [9]

74.0 % ± 2.9

73.7 % ± 3.5

73.8 % ± 5.7

72.9 % ± 7.8

74.6 % ± 2.1

None (SVM only) 64.1 % ± 4.1

63.9 % ± 4.3

63.8 % ± 5.1

62.9 % ± 7.3

64.9 % ± 5.0

6

ACC

http://www.csie.ntu.edu.tw/∼cjlin/libsvm.

NPV

SEN

SPC

Identifying Relationships in Functional and Structural Connectome Data

15

The bar plots in Fig. 2 show the median7 structural and functional weight values found using Eq. (3) when grid search parameter th = 0.8 is used. The SVM classifier in Table 1 is trained only using the node-based connectivity values from the selected 47 regions (24 structural regions and 23 functional regions) also shown in Fig. 2. In general, the 47 regions have largest difference in code values, which suggests the structural and functional connectivity characteristics in these brain regions are significantly different between ASD and HC subjects. Lastly, Fig. 3 shows the median (See footnote 7) top, middle, and bottom DTI and rsfMRI regions in the learned partition hierarchy. Included are tables that list the brain regions in the bottom level (i.e. last partition) of the hierarchy. These regions have the most similar structural and functional connectivity characteristics. Note: The term shared in this figure means in this grouping the same brain region is present in both connectomes.

Fig. 2. Bar plots that show the median weight values for each node-based connectome feature (structural and functional) found by Eq. (2). The SVM classifier in Table 1 was trained only using the node-based connectivity values from the selected 47 regions. Approximately 91 % reduction in node-based connectome features

7

Median value is found using the results from all 10 folds.

16

B.C. Munsell et al.

Fig. 3. Visualizations that show the DTI and rsfMRI regions in the top, middle, and bottom partitions (see Fig. 1 for design of partition hierarchy). The tables summarize the brain regions in the bottom partition of the hierarchy.

4

Conclusion

A novel connectome feature selection technique is proposed that uses a hypergraph learning algorithm to identify brain regions that have similar structural and functional connectivity characteristics. Compared to other well-known feature selection techniques, SVM classifiers trained using structural and functional connectome features selected by our method are significantly better than SVM classifiers trained using connectome features selected by a state-of-the-art regression algorithm. Furthermore, since our approach converts a subject specific complete bipartite graph to an incidence matrix, the resulting incidence matrix is very sparse, which in turn greatly improves the space and time complexity of our approach. Visualizations that display brain regions in the top, middle, and bottom partitions in the proposed partition hierarchy show significant structural and functional connectivity differences in ASD and HC subjects and as seen in Fig. 3. Lastly, even though the betweenness centrality node-based connectivity measure is used, our method achieved similar accuracy and PPV classification results (mean ±3 %) when replaced by the Eigenvector centrality or clustering coefficient connectivity measures.

Identifying Relationships in Functional and Structural Connectome Data

17

References 1. Greicius, M.D., Supekar, K., Menon, V., Dougherty, R.F.: Resting-state functional connectivity reflects structural connectivity in the default mode network. Cereb. Cortex 19(1), 72–78 (2009) 2. Power, J.D., Barnes, K.A., Snyder, A.Z., Schlaggar, B.L., Petersen, S.E.: Spurious but systematic correlations in functional connectivity MRI networks arise from subject motion. NeuroImage 59(3), 2142–2154 (2012) 3. Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014) 4. Rubinov, M., Sporns, O.: Complex network measures of brain connectivity: Uses and interpretations. NeuroImage 52(3), 1059–1069 (2010) 5. Rudie, J., Brown, J., Beck-Pancer, D., Hernandez, L., Dennis, E., Thompson, P., Bookheimer, S., Dapretto, M.: Altered functional and structural brain network organization in autism. NeuroImage Clin. 2, 79–94 (2013) 6. Saur, D., Schelter, B., Schnell, S., Kratochvil, D., Kpper, H., Kellmeyer, P., Kmmerer, D., Klppel, S., Glauche, V., Lange, R., Mader, W., Feess, D., Timmer, J., Weiller, C.: Combining functional and anatomical connectivity reveals brain networks for auditory language comprehension. NeuroImage 49(4), 3187– 3197 (2010) 7. Sporns, O.: The human connectome: origins and challenges. Neuroimage 80, 53–61 (2013) 8. Teipel, S.J., Bokde, A.L., Meindl, T., Amaro, E., Soldner, J., Reiser, M.F., Herpertz, S.C., Moller, H.J., Hampel, H.: White matter microstructure underlying default mode network connectivity in the human brain. Neuroimage 49(3), 2021– 2032 (2010) 9. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. B 58, 267–288 (1994) 10. Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007) 11. Zhu, D., Li, K., Terry, D.P., Puente, A.N., Wang, L., Shen, D., Miller, L.S., Liu, T.: Connectome-scale assessments of structural and functional connectivity in MCI. Hum. Brain Mapp. 35(7), 2911–2923 (2014)

Ensemble Hierarchical High-Order Functional Connectivity Networks for MCI Classification Xiaobo Chen, Han Zhang, and Dinggang Shen ✉ (

)

Department of Radiology and BRIC, University of North Carolina at Chapel Hill, Chapel Hill, USA [email protected]

Abstract. Conventional functional connectivity (FC) and corresponding networks focus on characterizing the pairwise correlation between two brain regions, while the high-order FC (HOFC) and networks can model more complex relationship between two brain region “pairs” (i.e., four regions). It is eyecatching and promising for clinical applications by its irreplaceable function of providing unique and novel information for brain disease classification. Since the number of brain region pairs is very large, clustering is often used to reduce the scale of HOFC network. However, a single HOFC network, generated by a specific clustering parameter setting, may lose multifaceted, highly complemen‐ tary information contained in other HOFC networks. To accurately and compre‐ hensively characterize such complex HOFC towards better discriminability of brain diseases, in this paper, we propose a novel HOFC based disease diagnosis framework, which can hierarchically generate multiple HOFC networks and further ensemble them with a selective feature fusion method. Specifically, we create a multi-layer HOFC network construction strategy, where the networks in upper layers are formed by hierarchically clustering the nodes of the networks in lower layers. In such a way, information is passed from lower layers to upper layers by effectively removing the most redundant part of information and, at the same time, retaining the most unique part. Then, the retained information/features from all HOFC networks are fed into a selective feature fusion method, which combines sequential forward selection and sparse regression, to further select the most discriminative feature subset for classification. Experimental results confirm that our novel method outperforms all the single HOFC networks corresponding to any single parameter setting in diagnosis of mild cognitive impairment (MCI) subjects. Keywords: Functional connectivity · High-order network · Hierarchical clustering · Brain network · Resting state · Functional magnetic resonance imaging

1

Introduction

Alzheimer’s disease (AD) is the most prevalent dementia, accounting for about 60–80 % of dementia cases among the worldwide elderly population. Mild cognitive impairment (MCI), as a prodromal stage of AD, tends to convert to clinical AD with an average annual conversion rate of 10–15 % [1]. Early diagnosis of MCI is of great importance © Springer International Publishing AG 2016 S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 18–25, 2016. DOI: 10.1007/978-3-319-46723-8_3

Ensemble Hierarchical High-Order Functional Connectivity Networks

19

for possibly delaying the AD progression, and is always a hot spot in the translational clinical research. Resting-state functional magnetic resonance imaging (RS-fMRI) can be used to infer functional connectivity (FC) among brain regions, and to characterize brain network abnormalities in MCI. Here, the FC is traditionally defined as the temporal correlation of the blood-oxygenation-level-dependent (BOLD) time series between two brain regions [2]. The whole brain network can further be constructed by accounting for FC of every brain region pairs. Recently, static FC calculated based on the entire BOLD time series has been challenged by dynamic FC and dynamic network studies. A welladopted dynamic FC analysis strategy is to calculate FC in sliding windows [4–6]. Furthermore, multiple time-varying FC networks can be constructed and used for MCI classification [5]. Nevertheless, these time-varying FC networks are still low-order, since they are estimated based on the raw BOLD time series, reflecting just the rela‐ tionship among brain regions in a pairwise manner. The series of time-varying low-order FC (LOFC) networks can be equivalently represented as a set of dynamic FC time series, each of which is associated with a pair of brain regions and characterizes their time-evolving FC. Then, we can further compute the correlation between two FC time series (each of them involves two brain regions) and obtain a high-order FC (HOFC) which involves four brain regions (i.e., two pairs of brain regions). With this strategy, an HOFC network can be constructed in the whole brain. The graph theory based high-order features characterizing the properties of HOFC network can be extracted. HOFC has following merits: (1) it is computed based on FC time series, instead of raw BOLD time series, thus representing high-level features; (2) it reflects a more complex relationship among brain regions by characterizing how different brain region pairs, instead of two brain regions functionally interact with each other; (3) it is time invariant, solving the problem of phase mismatch among subjects. However, the side-effect of HOFC (i.e., the increase of network scale from N 2 to ( 2 )2 N , where N is the total number of brain regions) requires clustering-based dimension reduction and thus results in inevitable information loss when a single HOFC network (corresponding to a specific number of clusters) is used for classification. It is obvious that the clustering pattern in a high dimensional space is NOT a discrete structure; instead, rich information underlying in a continuous relationship between the network nodes in the space when viewing from different spatial scales could be used to boost classification accuracy. Accordingly, we first generate multiple HOFC networks for each subject; each network has different discriminative ability for disease identification. More importantly, these HOFC networks are organized in a hierarchical fashion, which means the network in each layer is generated by merging some nodes while retaining other nodes of the HOFC network in a previous layer. By doing so, the HOFC networks in two consecutive layers are highly overlapped. As a result, the features extracted from the HOFC network of each layer can be decomposed into two parts (blocks): one is redundant and the other is informative or complementary with respect to the features extracted from a previous layer. To further refine those informative feature blocks from all HOFC networks, a feature fusion strategy based on sequential forward selection and sparse regression [10] is developed and the resulting feature subset is used for classifi‐ cation with linear support vector machine (SVM).

20

2

X. Chen et al.

Approach

The proposed method consists of the following 8 steps. (1) For each subject, a sliding window with length N and step size s is applied to partition the entire BOLD time series into multiple overlapping segments [4, 5]. (2) Multiple LOFC networks are constructed, each of which is based on a respective segment of BOLD time series. By doing so, we actually obtain a set of FC time series, each describing the temporal variation of corre‐ lation between two brain regions. (3) All subjects’ FC time series associated with the same brain region pair are concatenated together to form a long FC time series. (4) The long FC time series from all brain region pairs are grouped into U clusters by a clustering algorithm, thus yielding consistent clustering results across different subjects. (5) For each subject, the mean of the FC time series within the same cluster is computed and then a HOFC network is constructed based on the correlation between the mean FC time series of different clusters. (6) Repeating steps (4) and (5) multiple times with different U s generates multiple HOFC networks. (7) The features extracted from all HOFC networks are analyzed based on correlation, and then a feature subset is selected by a feature selection method that combines sequential forward selection and sparse regres‐ sion. (8) Support vector machine (SVM) [8] with linear kernel is finally trained with the selected features to classify MCI and NC subjects. The main flowchart of our hierarchical HOFC networks construction and feature extraction is shown in Fig. 1, where four brain regions denoted as A, B, C, and D are illustrated.

Fig. 1. Hierarchical HOFC networks construction and feature decomposition.

Ensemble Hierarchical High-Order Functional Connectivity Networks

21

2.1 Hierarchical Clustering and Feature Decomposition As shown in the top left panel of Fig. 1, we can obtain FC time series corresponding to each pair of brain regions by following the above steps (1) and (2). When repeating step (4) with different numbers of clusters, we use an agglomerative hierarchical clustering ( ) [9] to group these FC time series into ui and ui+1 clusters ui > ui+1 , respectively, in layer i and layer i + 1. In Fig. 1, we have ui = 5 and ui+1 = 4. When the difference between ui and ui+1 is small, which is true using the agglomerative hierarchical clustering, a few clusters in the layer i + 1 are newly formed by merging some closer clusters in the layer i , while other clusters in the layer i + 1 are the same as those in the layer i . This is illustrated in the right panels of Fig. 1, where the blue and pink clusters in the layer i are merged into a new red cluster in the layer i + 1 while other clusters are kept the same. Based on the clustering results in the layers i and i + 1, the HOFC networks HONi and HONi+1 are constructed, respectively, where each node corresponds to each cluster and the weight for each edge is the Pearson’s correlation between the mean FC time series of two different clusters. As we can see from the upper right panel of Fig. 1, only the newly formed nodes and the associated edges (shown in red) in HONi+1 may contain extra information with respect to HONi. Afterwards, the feature vectors Feai ∈ Rui and Feai+1 ∈ Rui+1 can be extracted from HONi and HONi+1, respectively. In this paper, weighted local clustering coefficients (WLCC) [7] is used as features. Each entry in Feai and Feai+1 corresponds to a node in HONi and HONi+1, respectively, thus also corresponding to a cluster in layers i and i + 1. Since only a small number of nodes and edges in HONi+1 are different from those [ ] in HONi, Feai+1 can be decomposed as Feai+1 = Di+1 , Si+1 , where D and S respectively refer to the nodes newly formed from and the nodes already existed in the previous layer. As a result, Si+1 in layer i + 1 should be highly correlated with some features in Feai, while only Di+1 may contain some novel information with respect to Feai. Generalizing the above observation to L levels, for each subject, the features extracted from all hierarchical HOFC networks can be condensed and expressed as [ ] Fea = D0 , D1 , D2 , ⋯ , DL , where D0 = S1. Each Di is called a feature block. 2.2 Selective Feature Fusion Although the above agglomerative hierarchical clustering and correlation analysis can reduce the dimensionality of features to a large extent, the redundancy between different layers may still exist, especially when taking into account multiple layers. In addition, not all of the features in Fea are discriminative in terms of MCI classification. To benefit from the information contained in Fea and reduce redundancy, we propose a feature fusion method, by combining sequential forward selection and sparse regression [10], under the framework of wrapper-based feature selection [11]. Sequential forward selec‐ tion can find discriminant feature block progressively, while sparse regression can select individual features that are predictive for classification. Specifically, given a current set A of feature blocks, a new feature block Di from Fea − A can be selected and combined with A, thus producing an enlarged feature subset

22

X. Chen et al.

A ∪ Di. Then, l1-norm based sparse regression, i.e., least absolute shrinkage and selection operator (LASSO) is performed on all training samples with features within A ∪ Di to find a small subset C that is beneficial for classification. Next, the selected features of all training subjects are used to train a linear SVM model, and the classification accuracy on the validation subjects is used to guide the selection of Di. That is, the one yielding optimal accuracy is finally selected. In such a way, the feature block selected in the previous iteration will be kept and guild the selection of new feature block in the next iteration. The procedure above is repeated until either the optimal performance or the pre-defined number of feature blocks is reached.

3

Experiments

3.1 Data The Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset is used in this study. 50 MCI subjects and 49 normal controls (NCs) are selected from ADNI-GO and ADNI-2 datasets. Subjects from both datasets were age- and gender-matched and were scanned using 3.0T Philips scanners. The voxel size is 3.13 × 3.13 × 3.13 mm3. SPM8 software package (http://www.fil.ion.ucl.uk/spm/software/spm8) was used to preprocess the RSfMRI data. The first 3 volumes of each subject were discarded before preprocessing for magnet‐ ization equilibrium. A rigid-body transformation was used to correct head motion in subjects (and the subjects with head motion larger than 2 mm or 2° were discarded). The fMRI images were normalized to the Montreal Neurological Institute (MNI) space and spatially smoothed with a Gaussian kernel with full width at half maximum (FWHM) of 6 × 6 × 6 mm3. We did not perform scrubbing to the data with frame-wise displace‐ ment larger than 0.5 mm, as it would introduce additional artifacts. We excluded the subjects who had more than 2.5 min (the maximum is 7 min) RS-fMRI data with large frame-wise displacement from further analysis. The RS-fMRI images were parcellated into 116 regions according to the Automated Anatomical Labeling (AAL) template. Mean RS-fMRI time series of each brain region was band-pass filtered (0.015 ≤ f ≤ 0.15 Hz). Head motion parameters (Friston24), mean BOLD time series of the white matter and the cerebrospinal fluid were regressed out. 3.2 HOFC Network and Feature Correlation Analysis In this study, we use a sliding window with s = 1 and N = 50. Denote the number of clusters as U . To generate multiple layers, we start from one layer with a relatively large number of clusters (U = 220) because it can retain sufficient information. Then, the subsequent layers are added by gradually reducing U by 30 until the optimal performance is achieved. In such a way, we can eventually generate 4 HOFC networks from layer 1 to layer 4: HON1, HON2, HON3, and HON4, where the number of clusters equals 220, 190, 160, and 130, respectively. The averaged HOFC networks in layer 1 for MCI and NC subjects are shown in the left and middle of Fig. 2. The corresponding high-order

Ensemble Hierarchical High-Order Functional Connectivity Networks

23

feature vectors (WLCC) Fea1 ∈ R220, Fea2 ∈ R190, Fea3 ∈ R160, and Fea4 ∈ R130 are extracted. Since our method is a feature fusion method, the correlation between features from different HOFC networks provides important information about redundancy. To empirically verify the rationality of feature decomposition (Sect. 2.1), we compute the correlation between Fea1 and Fea2 and show the correlation matrices in the right of Fig. 2. The numbers of rows and columns of this matrix equal to 220 and 190, respec‐ tively. As shown by the red straight lines in this matrix, most features in Fea2 are highly correlated with features in Fea1. This observation is consistent for all HOFC networks, thus implying most features in the current layer are redundant with respect to those in the previous layer and thus should be eliminated before feature fusion. Based on this correlation, each feature vector Feai (i = 1, 2, 3, 4) can be decomposed into two feature [ ] blocks such as Feai = Di , Si , where D1 ∈ R29, S1 ∈ R191, D2 ∈ R30, S2 ∈ R160, D3 ∈ R29, S3 ∈ R131, and D4 ∈ R30, S4 ∈ R100. Note that an extra level with 250 clusters is used to decompose Fea1 and then discarded. Consequently, only about 30 features, i.e., Di, of each layer (i > 1) are less correlated with the previous layer while others are redundant. To include sufficient information and meanwhile reduce the redundancy, five feature blocks S1 ∈ R191, D1 ∈ R29, D2 ∈ R30, D3 ∈ R29, and D4 ∈ R30 are engaged in the subse‐ quent feature fusion, while others are discarded. Finally, the total number of features decreases from 700 to 309 by this unsupervised correlation analysis. 0.6 0.6

20 20

0.5

40

0.9

40

0.8

60

0.7

80

0.6

100

0.5

120

0.4

0.4 60

0.3

120 0.1

140 160

0 180

Cluster

0.2

80

0.3

100

1

80 100

0.4

0.2 120 0.1

140

-0.2 40

60

80

100 120 140 160 180 200 220 Cluster

0.2

180

0.1

-0.1

200

200

20

0.3

0 180

220

140 160

160

-0.1 200

Fea

60

Cluster

20 0.5

40

220

-0.2 20

40

60

80

100 120 140 160 Cluster

180 200 220

0

220 50

100 Fea2

150

Fig. 2. Averaged HOFC network in layer 1 for MCI (left) and NC (middle) subjects and feature correlation matrices between HOFC networks from layer 1 and layer 2 (right).

3.3 Classification Accuracy The proposed sequential forward selection and sparse regression based hierarchical HOFC networks feature fusion (HHON-SFS) is compared with some closely related methods, including (1) a feature fusion method (HHON-CON), which concatenates all features extracted from four HOFC networks, (2) four individual HOFC networks (HON1, HON2, HON3, and HON4), and (3) two LOFC networks based on partial corre‐ lation (LON-PAC) and Pearson’s correlation (LON-PEC), respectively. All methods were implemented in MATLAB 2012b environment. The SLEP [10] and Libsvm tool‐ boxes were utilized, respectively, to implement sparse regression and SVM classifica‐ tion. The leave-one-out cross validation (LOOCV) is adopted to evaluate performance of different methods. For the hyper-parameter in each method, we tune its value on the training subjects by using the nested LOOCV. To measure performances of different methods, we use the following indices: accuracy (ACC), area under ROC curve (AUC),

24

X. Chen et al.

sensitivity (SEN), specificity (SPE), Youden’s Index (YI), F-score, and balanced accuracy (BAC). The experimental results are shown in Table 1. The HOFC networks achieve better accuracy than the two LOFC networks, indicating that the HOFC networks provide more discriminative biomarkers for MCI identification. Comparing across four individual HOFC networks, we can observe their performance is rather sensitive to the number of clusters. For example, too large or too small U will adversely affect the performance. Although the HOFC network HON2 (U = 190) achieves better performance than other individual HOFC networks, this does not mean that the information contained in other networks is completely useless in distinguishing MCI from NC subjects. To make full use of information in the four HOFC networks, HHON-CON directly concatenates Fea1, Fea2, Fea3, and Fea4 to form a combined feature vector of length 700. Although this method uses all features, the accuracy falls just between the best and the worst ones. This may be due to too many redundant features being used which makes the relationship between features complex, thus causing difficulty in individual feature selection and the potential over-fitting. In contrast, the proposed method, HHON-SFS, achieves the best performance among all competing methods. On one hand, this improvement can be attributed to the feature correlation analysis and also the resulting feature decomposition, which eliminates many redundant features. On the other hand, the combination of sequential forward selection and sparse regression makes it possible to evaluate the importance of feature blocks and individual features progressively. As a result, those crucial and complementary features have more probability to be selected and fused for classification. Table 1. Performance comparison of different methods in MCI classification. Method LON-PEC LON-PAC HON1 HON2 HON3 HON4 HHON-CON HHON-SFS

4

ACC 57.58 60.61 80.81 81.82 68.69 72.73 78.79 84.85

AUC 0.6008 0.6249 0.8567 0.8743 0.8094 0.7857 0.8702 0.9057

SEN 58.00 58.00 86.00 82.00 74.00 74.00 80.00 88.00

SPE 57.14 63.27 75.51 81.63 63.27 71.43 77.55 81.63

YI 15.14 21.27 61.51 63.63 37.27 45.43 57.55 69.63

F-score 58.00 59.79 81.90 82.00 70.48 73.27 79.21 85.44

BAC 57.57 60.63 80.76 81.82 68.63 72.71 78.78 84.82

Conclusion

In this paper, we propose to fuse information contained in multiple HOFC networks for a better MCI classification. To this end, hierarchical clustering is utilized to generate multiple HOFC networks, each being located at one layer. With such a framework, features extracted from the network at each layer can be refined, and only the informative feature block is taken into account. Specifically, by combining the sequential forward selection and sparse regression, a novel feature fusion method is developed. This method

Ensemble Hierarchical High-Order Functional Connectivity Networks

25

is able to selectively integrate informative feature blocks from different HOFC networks and further detect a small set of individual features that are discriminative for early diagnosis. Finally, SVM with linear kernel is used for MCI classification. The experi‐ mental results demonstrate the capability of the proposed approach in making full use of information contained in multiple scales of HOFC networks. Also, combing multiple HOFC networks is demonstrated to yield better classification performance than simple use of a single HOFC network. Acknowledgments. This work was supported by National Institutes of Health (EB006733, EB008374, and AG041721) and National Natural Science Foundation of China (Grand No: 61203244).

References 1. Misra, C., Fan, Y., Davatzikos, C.: Baseline and longitudinal patterns of brain atrophy in MCI patients, and their use in prediction of short-term conversion to AD: results from ADNI. Neuroimage 44, 1415–1422 (2009) 2. Friston, K., Frith, C., Liddle, P., Frackowiak, R.: Functional connectivity: the principalcomponent analysis of large (PET) data sets. J. Cereb. Blood Flow Metab. 13, 5–14 (1993) 3. Brier, M.R., Thomas, J.B., Fagan, A.M., Hassenstab, J., Holtzman, D.M., Benzinger, T.L., Morris, J.C., Ances, B.M.: Functional connectivity and graph theory in preclinical Alzheimer’s disease. Neurobiol. Aging 35, 757–768 (2014) 4. Leonardi, N., Richiardi, J., Gschwind, M., Simioni, S., Annoni, J.-M., Schluep, M., Vuilleumier, P., Van De Ville, D.: Principal components of functional connectivity: a new approach to study dynamic brain connectivity during rest. NeuroImage 83, 937–950 (2013) 5. Wee, C.-Y., Yang, S., Yap. P.-T., Shen, D., Alzheimer’s Disease Neuroimaging Initiative: Sparse temporally dynamic resting-state functional connectivity networks for early MCI identification. Brain Imaging Behav. 10, 1–15 (2015) 6. Hutchison, R.M., Womelsdorf, T., Allen, E.A., Bandettini, P.A., Calhoun, V.D., Corbetta, M., Della, Penna S., Duyn, J.H., Glover, G.H., Gonzalez-Castillo, J., Handwerker, D.A., Keilholz, S., Kiviniemi, V., Leopold, D.A., de Pasquale, F., Sporns, O., Walter, M., Chang, C.: Dynamic functional connectivity: promise, issues, and interpretations. Neuroimage 80, 360–378 (2013) 7. Rubinov, M., Sporns, O.: Complex network measures of brain connectivity: uses and interpretations. Neuroimage 52, 1059–1069 (2010) 8. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995) 9. Ward Jr., J.H.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244 (1963) 10. Liu, J., Ji, S., Ye, J.: SLEP: sparse learning with efficient projections, vol. 6, p. 491. Arizona State University (2009) 11. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1997)

Outcome Prediction for Patient with High-Grade Gliomas from Brain Functional and Structural Networks Luyan Liu1, Han Zhang2, Islem Rekik2, Xiaobo Chen2, Qian Wang1, and Dinggang Shen2(&) 1

2

Med-X Research Institute, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China Department of Radiology and BRIC, University of North Carolina at Chapel Hill, Chapel Hill, USA [email protected]

Abstract. High-grade glioma (HGG) is a lethal cancer, which is characterized by very poor prognosis. To help optimize treatment strategy, accurate preoperative prediction of HGG patient’s outcome (i.e., survival time) is of great clinical value. However, there are huge individual variability of HGG, which produces a large variation in survival time, thus making prognostic prediction more challenging. Previous brain imaging-based outcome prediction studies relied only on the imaging intensity inside or slightly around the tumor, while ignoring any information that is located far away from the lesion (i.e., the “normal appearing” brain tissue). Notably, in addition to altering MR image intensity, we hypothesize that the HGG growth and its mass effect also change both structural (can be modeled by diffusion tensor imaging (DTI)) and functional brain connectivities (estimated by functional magnetic resonance imaging (rs-fMRI)). Therefore, integrating connectomics information in outcome prediction could improve prediction accuracy. To this end, we unprecedentedly devise a machine learning-based HGG prediction framework that can effectively extract valuable features from complex human brain connectome using network analysis tools, followed by a novel multi-stage feature selection strategy to single out good features while reducing feature redundancy. Ultimately, we use support vector machine (SVM) to classify HGG outcome as either bad (survival time  650 days) or good (survival time >650 days). Our method achieved 75 % prediction accuracy. We also found that functional and structural networks provide complementary information for the outcome prediction, thus leading to increased prediction accuracy compared with the baseline method, which only uses the basic clinical information (63.2 %).

1 Introduction Gliomas account for around 45 % of primary brain tumors. The prognosis of gliomas depends on multiple factors, such as age, histopathology, tumor size and location, extent of resection, and type of treatment. Most deadly gliomas are classified by World Health Organization (WHO) as Grade III (anaplastic astrocytoma, and anaplastic © Springer International Publishing AG 2016 S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 26–34, 2016. DOI: 10.1007/978-3-319-46723-8_4

Outcome Prediction for Patient with High-Grade Gliomas

27

oligodendroglioma) and Grade IV (glioblastoma multiforme), according to the histopathological subtypes. These are referred to as high-grade gliomas (HGG) with fast growing rate and diffusive infiltration. More importantly, HGG are characterized by a very poor prognosis but the outcome (i.e., the overall survival time) differs significantly from case to case. This can be explained by a large variation in tumor characteristics (e.g., location, and cancer cell type). Yet still challenging, pre-operative prediction of HGG outcome is of great importance and is highly desired by clinicians. Multimodal presurgical brain imaging has been gaining more solid ground in surgical planning. In turn, this produces abundant multimodal neuroimaging information for potential HGG outcome prediction. For instance, in [1], multiple features, reflecting intensity distributions of various magnetic resonance imaging (MRI) sequences, were extracted to predict patient survival time and molecular subtype of glioblastoma. In [2], morphologic features and hemodynamic parameters, along with clinical and genomic biomarkers, were used to predict the outcome of glioblastoma patients. In [3], data mining techniques based on image attributes from MRI produced better HGG outcome prediction performance, than that solely using histopathologic information. Although promising, all these studies shared a first key limitation: they overlooked the relationship between brain connectivity and the outcome. In other words, they mainly relied on extracting information from the tumor region (i.e., tumor and necrotic tissue) or around it (e.g., edema region). This excludes the majority of the “normal appearing” brain tissue — which most likely has been also affected by the tumor. Based on all these information, our hypothesis is rooted in the fact that HGG highly diffuses along white matter fiber tracts, thus altering the brain structural connectivity. Consecutively, altered structural connectivity will lead to functional connectivity. Moreover, the mass effect, edema and abnormal neovascularization may further change brain functional and structural connectivities. Therefore, connectomics data may present useful and complementary information to intensity-based survival time prediction. A second key limitation of previous studies is that none of them compared the prediction performance when using conventional clinical data versus when using advanced connectomics data from multimodalities. We aimed to address both of the limitations. Conventional neuroimaging computing methods, such as graph-theory-based complex network analysis, have demonstrated promising value in disease classification and biomarker detection [4]. However, to our best knowledge, no previous study has utilized brain connectome to predict the treatment outcome for HGG patients. In this work, we hypothesize that gliomas have ‘diffusive effects’ to both structural and functional connectivities, involving both white matter and grey matter, which could alter the inherent brain connectome and lead to abnormalities in network attributes. Hence, we devise an HGG outcome prediction framework, by integrating, extracting and selecting the best set of advanced brain connectome features. Specifically, we retrospectively divided the recruited HGG patients into short and long survival time groups based on the follow-up of a large number of glioma patients. Our method comprises the following key steps. First, we construct both functional and structural brain networks. Second, we extract structural and functional connectomics features using diverse network metrics. Third, we propose a novel framework to effectively reduce the dimension of connectomics features by step-wisely selecting the

28

L. Liu et al.

most discriminative features in a gradual, three-stage strategy. Finally, we use support vector machine (SVM) to predict the outcome.

2 Method Figure 1 illustrates the proposed pipeline to automatically predict the survival time for HGG patients in three steps. In Sect. 2.1, we introduce the construction of brain networks based on the resting-state functional MRI (rs-fMRI) and diffusion tensor imaging (DTI). In Sect. 2.2, we describe how to calculate network properties based on graph theory using a binary graph and a weighted graph. As we add up clinical information, such as tumor location, size and histopathological types, we generate a long stacked feature vector. In Sect. 2.3, we propose a three-stage feature selection algorithm to remove redundant features. Finally, we apply an SVM on the selected features to predict the treatment outcome.

Fig. 1. Proposed pipeline of treatment outcome prediction for high-grade glioma patients. (K: degree; L: shortest path length; C: clustering coefficient; B: betweenness centrality; Eg: global efficiency; El: local efficiency; OS: overall survival. For details, please see Sect. 2.2).

2.1

Brain Network Construction

Subjects. A total of 147 HGG patients were originally included in this study. We excluded patients lacking either rs-fMRI or DTI data. Patients with inadequate follow-up time, or died due to other reasons (e.g., road accident) were also excluded. Those with significant image artifacts and excessive head motion, as suggested by the following data processing, were also removed. All the images were checked by three experts to quantify the deformation of brain caused by tumor. Those with huge deformation, for which all three experts reached an agreement, were removed too. Finally, 34 patients who died within 650 days after surgery were labeled as “bad” outcome group, and the remaining 34 patients who survived more than 650 days after the surgery were classified into the “good” outcome group. The reason of using 650 days as a threshold is that the two-year survival rate for malignant glioma patients was

Outcome Prediction for Patient with High-Grade Gliomas

29

reported to be 51.7 % [5]. We slightly adjusted the threshold to balance the sample size in the two groups. Imaging. In addition to the conventional clinical imaging protocols, research-dedicated whole-brain rs-fMRI and DTI data were also collected preoperatively. The rs-fMRI has TR (repetition time) = 2 s, number of acquisitions = 240 (8 min), and a voxel size = 3:4  3:4  4 mm3. The DTI has 20 directions, voxel size = 2  2  2 mm3, and multiple acquisition = 2. Clinical Treatment and Follow-up. All patients were treated according to clinical guideline for HGGs, including a total or sub-total resection of tumor entity during craniotomy and radio- and chemo-therapy after surgery. They were followed up in a scheduled time, e.g., 3, 6, 12, 24, 36, 48 months after discharging. Any vital event, such as death, was reported to us to let us calculate the overall survival time. Image Processing. SPM8 and DPARSF [6] were used to preprocess rs-fMRI data and build functional brain networks. FSL and PANDA [7] were used to process the DTI data and build structural brain networks. Multimodal images were first co-registered within subject and then registered to the standard space. All these processes are following the commonly accepted pipeline and thus not detailed here. Network Construction. For each subject, two types of brain networks were constructed (see descriptions below). For each network, we calculated graph theory-based properties from both binary and weighted graphs. • Structural Brain Network. We parcellated each brain into 116 regions using Automated Anatomical Labeling (AAL) atlas, by warping the AAL template to each individual brain. The parcellated ROIs in each subject were used as graph nodes. The weighted network Nsw can be constructed by calculating the structural P connectivity strength ws ði; jÞ ¼ Si þ2 Sj i;jeN lðf Þ for the edge connecting nodes i and j ði; j 2 N; i 6¼ jÞ, where N is the set of all 116 nodes in the network, lð f Þ represents the number of fibers linking each pair of the ROIs, and Si denotes the cortical surface area of node i. The sum Si þ Sj corrects the bias caused by different ROI sizes. The binary structural network Nsb can be generated by setting the weight of the top 15 % edges to 1 after ranking the ws descending, and the others to 0 [8]. • Functional Brain Network. Using the same parcellation, we extracted the mean BOLD time series TSi ði 2 NÞ of each brain region. Then, we defined the functional connectivity strength wf ði; jÞ in the functional network by computing Pearson’s correlations between two BOLD time series in each pair ði; jÞ of 116 brain regions: wf ði; jÞ ¼ Corr TSi ; TSj ði; j 2 N; i 6¼ jÞ, thus generating a weighted functional brain network Nfw . The binary functional network Nfb can be generated in the same way as described above.

30

2.2

L. Liu et al.

Feature Extraction

Graph theory-based complex network analysis is used to independently extract multiple features from four networks (Nsw , Nsb , Nfw , Nfb ) for each subject. Since various graph metrics can reflect different organizational properties of the networks, we calculated four types of these metrics (i.e., degree, small-world properties, network efficiency properties, and nodal centrality) [9], which are detailed below. • Degree. In each binary network, Nsb and Nfb , the node i’s degree, ki , counts the number of edges linked to it. In each of the weighted networks, Nsw and Nfw , the P node degree is defined by ki ¼ w ði; jÞ, where  refers to s or f. j2N;j6¼i

• Small-world property. This type of property is originally used to describe small-world, and can also be separately calculated for each node, including the clustering coefficient Ci (which measures local interconnectivity of the node i’s neighbors) and the shortest path length Li (which measures overall communication speed between node i and all other nodes). Specifically, in Nsb and Nfb , Ci is calculated through dividing the number of edges connecting i’s neighbors by all possible edges linking i’s neighbors (i.e., ki ðki 1Þ=2). On the other hand, in Nsw and Nfw , Ci is calculated by a normalized sum of the mean weight of two participating edges in all triangles with node i as a vertex. Li is defined as the averaged minimum number of edges from node i to all other nodes in Nsb or Nfb , and the averaged minimum sum of weighted edges in Nsw and Nfw . • Network efficiency. The efficiency property of a network measures how efficiently information is exchanged within a network, which gives a precise quantitative analysis of the networks’ information flow. The global efficiency, Eglobal ðiÞ, is defined as the sum of the inverse of the shortest path length between node i and all other nodes. The local efficiency, Elocal ðiÞ, represents the global efficiency of a subgraph, which consists of all node i’s neighbors. The binary and weighted versions of shortest path length can result in binary and weighted efficiency metrics. • Nodal centrality. Nodal centrality, Bi , quantifies how important of node i is in the network. A node with high Bi acts as a hub in the network. It is calculated as P Bi ¼ m6¼n6¼i2N LLmnmnðiÞ, where Lmn is the total number of shortest paths from node m to node n, and Lmn ðiÞ is the number of these shortest paths passing through node i. Since Lmn ðiÞ and Lmn have both binary and weighted versions, Bi is calculated for each binary and weighted network. These network metrics, which will be used as connectomics features, were computed as part of features using GRETNA [8]. We also add to them 13 clinical features (age, gender, tumor size, WHO grade, histopathological type, main location, epilepsy or not, specific location in all lobes, and hemisphere of tumor tissue). Therefore, a total of 2797 (6 metrics  4 networks  116regions þ 13 clinical features) features for each subject were used. The number of features is much greater than that of samples (68 subjects). This is quite troublesome for machine learning-based methods because of the overfitting problem and the interference from noise. Thus, we design a three-stage

Outcome Prediction for Patient with High-Grade Gliomas

31

feature selection framework, as specified below, to select the most relevant features for our classification (i.e., prediction) problem.

2.3

Three-Stage Feature Selection

To identify a small number of features that are optimal for treatment outcome prediction, we propose a three-stage feature selection method to gradually select the most relevant features. • First stage. We roughly select features that significantly distinguish the two outcome groups (i.e., “bad” and “good”) using two sample t-tests with p\0:05. • Second stage. RELIEFF [10] is used to rank the remaining features X and compute their weights. RELEFF is an algorithm, which estimates feature quality in classification. Many heuristic measures of feature quality usually suppose the independence of features, while actually they may be dependent. RELIEFF can correctly estimate the quality of each feature in classification problem with strong dependency assumption among features. The main idea of RELIEFF is to estimate how well each feature distinguishes itself from its neighbors that belong to other classes. Given a randomly selected feature R from the feature set A, RELIEFF searches for its k-nearest neighbors first. Basically, it defines a cohort of neighbors as belonging to the same class of R (called nearest hit H), and also other neighbors as part of a different class (called nearest miss M). Then, it computes and updates the quality estimation WðAÞ for all features based on the distance from R to H and also distance from R to M. Therefore, the features can be descendingly ranked in X based on WðAÞ. • Third stage. A sequential backward selection [11] strategy was applied to carefully select a small group of significant features from X. Then, an inner SVM was wrapped into the feature selection framework to evaluate the predictive accuracy for candidate subset of features using a leave-one-out cross validation. The sequential backward selection is a feature selection strategy that sequentially removes one feature from back to front from X. The classification accuracy is recorded for the remaining subset of X. When no feature is left, the selection process stops and a subset of X with the highest classification accuracy is selected. Next, the selected features are fed into an outer SVM with a leave-one-out cross validation to build the prediction model. To test which features are more useful for outcome prediction, we conducted five experiments, where different features were combined in different ways for classification (see Sect. 3).

3 Results The outcome prediction accuracy of our proposed prediction framework is displayed in Table 1. Using only clinical features, the prediction accuracy only reaches 63.2 %. Notably, when using only the features from functional networks, the accuracy increases

32

L. Liu et al.

to 72 %. As we combine structural network features with functional ones, the classification rate reaches its apex (75 %, better than when only using clinical features). However, no improvement was noted when clinical features were further added, which means that the information contained in clinical features is somehow represented already in the brain functional and structural networks using graph theory. In order to test the results that we learned were random or not, we also did 30 times permutation test. The p-value of permutation test was 0.015 and the mean accuracy of 30 times permutation test was 49.1 %, which means that our results can reflect the intrinsic properties of the data to some degree. The most significant features shown in Table 2 (also drawn in Fig. 2) are those that were selected by our three-stage feature selection strategy more than 60 times out of 68 trials. Table 1. Prediction accuracy of using different sets of features. Features Clinical infomation Structural network Functional network Functional + Structural networks Functional + Structural + Clinical

Accuracy (%) Sensitivity (%) Specificity (%) 63.2 61.8 64.7 69.1 64.7 73.5 72.1 70.6 73.5 75 82.4 67.6 75 82.4 67.6

Table 2. The most useful features for outcome prediction. Network Metrics Clustering coefficient Shortest path length Global efficiency Degree Betweenness a For the full names of

Predictive ROIs from fMRI Predictive ROIs from DTI CUN R PAL Ra PAL R IFGoper R CER9 R MFG R, IFGoper R CER9 R, PAL R ACG R, PoCG R the brain regions, please see [4]. R: right side; L: left side.

Fig. 2. Discriminative ROIs with high predictive power in functional and structural brain networks, respectively.

Outcome Prediction for Patient with High-Grade Gliomas

33

As reported in many previous studies, the most useful regions for HGG outcome prediction are highly correlated with movement, cognition, emotion, language and memory functions. The deteriorated structural and functional connections to these regions could influence the survival time. The most frequently selected ROIs from functional network are those in the cerebellum, which have dense functional connectivity to the neocortex and are closely associated with motor and cognitive functions. However, the most frequently selected ROIs from structural network are mostly located in the cortex and less overlapped with each other, which may indicate that the structural network is easily affected by brain tumors.

4 Conclusion and Future Works In this paper, we have showed that complex brain network analysis, which is based on graph theory, is a powerful tool for treatment outcome prediction for high-glioma patients. Our findings highlighted the relevance of integrating functional and structural brain connectomics for HGG outcome prediction. Although the relationship between structural and functional brain networks is still poorly understood, our prediction framework remarkably benefitted from the use of brain connectomics for prognosis evaluation. In future works, we will incorporate the global graph properties (e.g., the averaged clustering coefficient, or network efficiency across all brain regions) as new features. In such case, individual heterogeneity of tumor characteristics can be better addressed. Also, more advanced graph metrics, e.g., assortativity, modularity, and rich-club value, can be taken into account for a more comprehensive network measurement. Moreover, intraoperatively derived features, e.g., extension of tumor resection, can also be integrated as important prognostic predictors. Acknowledgement. This work is supported by National Natural Science Foundation of China (NSFC) Grants (Nos. 61473190, 61401271, 81471733).

References 1. Macyszyn, L., Akbari, H., et al.: Imaging patterns predict patient survival and molecular subtype in glioblastoma via machine learning techniques. Neuro-oncology 127 (2015) 2. Jain, R., Poisson, L.M., et al.: Outcome prediction in patients with glioblastoma by using imaging, clinical, and genomic biomarkers: focus on the non-enhancing component of the tumor. Radiology 272, 484–493 (2014) 3. Zacharaki, E.I., et al.: Survival analysis of patients with high-grade gliomas based on data mining of imaging variables. AJNR 33, 1065–1071 (2012) 4. Liu, F., Guo, W., et al.: Multivariate classification of social anxiety disorder using whole brain functional connectivity. Brain Struct. Funct. 220, 101–115 (2015) 5. Ostrom, Q. T., et al.: CBTRUS statistical report: primary brain and central nervous system tumors diagnosed in the United States in 2007–2011. Neuro-oncology 16, vi1–vi63 (2014) 6. Yan, C., Zang, Y.: DPARSF: a MATLAB toolbox for “pipeline” data analysis of resting-state fMRI. Front. Syst. Neurosci. 4, 13 (2010)

34

L. Liu et al.

7. Cui, Z., Zhong, S., et al.: PANDA: a pipeline toolbox for analyzing brain diffusion images (2013) 8. Wang, J., Wang, X., et al.: GRETNA: a graph theoretical network analysis toolbox for imaging connectomics. Front. Hum. Neurosci. 9 (2015) 9. Rubinov, M., Sporns, O.: Complex network measures of brain connectivity: uses and interpretations. Neuroimage 52, 1059–1069 (2010) 10. Robnik-Šikonja, M., Kononenko, I.: An adaptation of Relief for attribute estimation in regression. In: ICML 1997, pp. 296–304 (1997) 11. Rückstieß, T., Osendorfer, C., van der Smagt, P.: Sequential feature selection for classification. In: Wang, D., Reynolds, M. (eds.) AI 2011. LNCS, vol. 7106, pp. 132–141. Springer, Heidelberg (2011)

Mammographic Mass Segmentation with Online Learned Shape and Appearance Priors Menglin Jiang1 , Shaoting Zhang2(B) , Yuanjie Zheng3 , and Dimitris N. Metaxas1 1

3

Department of Computer Science, Rutgers University, Piscataway, NJ, USA 2 Department of Computer Science, UNC Charlotte, Charlotte, NC, USA [email protected] School of Information Science and Engineering, Shandong Normal University, Jinan, China

Abstract. Automatic segmentation of mammographic mass is an important yet challenging task. Despite the great success of shape prior in biomedical image analysis, existing shape modeling methods are not suitable for mass segmentation. The reason is that masses have no specific biological structure and exhibit complex variation in shape, margin, and size. In addition, it is difficult to preserve the local details of mass boundaries, as masses may have spiculated and obscure boundaries. To solve these problems, we propose to learn online shape and appearance priors via image retrieval. In particular, given a query image, its visually similar training masses are first retrieved via Hough voting of local features. Then, query specific shape and appearance priors are calculated from these training masses on the fly. Finally, the query mass is segmented using these priors and graph cuts. The proposed approach is extensively validated on a large dataset constructed on DDSM. Results demonstrate that our online learned priors lead to substantial improvement in mass segmentation accuracy, compared with previous systems.

1

Introduction

For years, mammography has played a key role in the diagnosis of breast cancer, which is the second leading cause of cancer-related death among women. The major indicators of breast cancer are mass and microcalcification. Mass segmentation is important to many clinical applications. For example, it is critical to diagnosis of mass, since morphological and spiculation characteristics derived from segmentation result are strongly correlated to mass pathology [2]. However, mass segmentation is very challenging, since masses vary substantially in shape, margin, and size, and they often have obscure boundaries [7]. During the past two decades, many approaches have been proposed to facilitate mass segmentation [7]. Nevertheless, few of them adopt shape and appearance priors, which provide promising directions for many other biomedical image segmentation problems [12,13], such as segmentation of human lung, liver, prostate, and hippocampus. In mass segmentation, the absence of the study c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 35–43, 2016. DOI: 10.1007/978-3-319-46723-8 5

36

M. Jiang et al.

Shape Prior

Query Mass

Training Masses

Retrieved Training Masses

Appearance Prior

Query Mass Boundary

Fig. 1. Overview of our approach. The blue lines around training masses denote radiologist-labeled boundaries, and the red line on the rightmost image denotes our segmentation result.

of shape and appearance priors is mainly due to two reasons. First, unlike the aforementioned organs/objects, mammographic masses have no specific biological structure, and they present large variation in shape, margin, and size. Naturally, it is very hard to construct shape or appearance models for mass [7]. Second, masses are often indistinguishable from surrounding tissues and may have greatly spiculated margins. Therefore, it is difficult to preserve the local details of mass boundaries. To solve the above problems, we propose to incorporate image retrieval into mammographic mass segmentation, and learn “customized” shape and appearance priors for each query mass. The overview of our approach is shown in Fig. 1. Specifically, during the offline process, a large number of diagnosed masses form a training set. SIFT features [6] are extracted from these masses and stored in an inverted index for fast retrieval [11]. During the online process, given a query mass, it is first matched with all the training mass through Hough voting of SIFT features [10] to find the most similar ones. A similarity score is also calculated to measure the overall similarity between the query mass and its retrieved training masses. Then, shape and appearance priors are learned from the retrieved masses on the fly, which characterize the global shape and local appearance information of these masses. Finally, the two priors are integrated in a segmentation energy function, and their weights are automatically adjusted using the aforesaid similarity score. The query mass is segmented by solving the energy function via graph cuts [1]. In mass segmentation, our approach has several advantages over existing online shape prior modeling methods, such as atlas-based methods [12] and sparse shape composition (SSC) [13]. First, these methods are generally designed for organs/objects with anatomical structures and relatively simple shapes. When dealing with mass segmentation, some assumptions of those methods, such as correspondence between organ landmarks, will be violated and thus the results will become unreliable. On the contrary, our approach adopts a retrieval method dedicated to handle objects with complex shape variations. Therefore, it

Mammographic Mass Segmentation with Online Learned Priors

37

could get effective shape priors for masses. Second, since the retrieved training masses are similar to the query mass in terms of not only global shape but also local appearance, our approach incorporates a novel appearance prior, which complements shape prior and preserves the local details of mass boundaries. Finally, the priors’ weights in our segmentation energy function are automatically adjusted using the similarity between the query mass and its retrieved training masses, which makes our approach even more adaptive.

2

Methodology

In this section, we first introduce our mass retrieval process, and then describe how to learn shape and appearance priors from the retrieval result, followed by our mass segmentation method. The framework of our approach is illustrated in Fig. 1. Mass Retrieval Based on Hough Voting: Our approach characterizes mammographic images with densely sampled SIFT features [6], which demonstrate excellent performance in mass retrieval and analysis [4]. To accelerate the retrieval process, all the SIFT features are quantized using bag-of-words (BoW) method [4,11], and the quantized SIFT features extracted from training set are stored in an inverted index [4]. The training set D comprises a series of samples, each of which contains a diagnosed mass n located at the center. A training mass  d ∈ D is represented as d = vjd , pdj j=1 , where n is the number of features

extracted form d, vjd denotes the j-th quantized feature (visual word ID), and T  pdj = xdj , yjd denotes the relative position of vjd from the center of d (the coordinate origin is at mass center). The query set Q includes a series of query masses. Note that query masses are not necessarily located at image centers. m T A query mass q ∈ Q is represented as q = {(viq , pqi )}i=1 , where pqi = [xqi , yiq ] q denotes the absolute position of vi (the origin is at the upper left corner of the image since the position of the mass center is unknown). Given a query mass q, it is matched with all the training masses. In order to find similar training masses with different orientations or sizes, all the training masses are virtually transformed using 8 rotation degrees (from 0 to 7π/4) and 8 scaling factors (from 1/2 to 2). To this end, we only need to re-calculate the positions of SIFT features, since SIFT is invariant to rotation and scale change [6]. For the given query mass q and any (transformed) training mass d, we calculate a similarity map S q,d , a similarity score sq,d , and the position of the query mass center cq,d . S q,d is a matrix of the same size of q, and its element at position p, denoted as S q,d (p), indicates the similarity between the region of q centered at p and d. The matching process is based on generalized Hough voting of SIFT features [10], which is illustrated in Fig. 2. The basic idea is that the features should be quantized to the same visual words and be spatially consistent (i.e., have similar positions relative to mass centers) if q matches d. In particular, given a pair of matched features viq = vjd = v, the absolute position of the query mass center, denoted as cqi , is first computed based on pqi and pdj . Then viq updates the similarity map S q,d . To resist gentle nonrigid deformation,

38

M. Jiang et al.

Query Mass

Matched SIFT

Retrieved Mass

Matched SIFT

Fig. 2. Illustration of our mass matching algorithm. The blue lines denote the mass boundaries labeled by radiologists. The dots indicate the positions of the matched SIFT features, which are spatially consistent. The arrows denote the relative positions of the training features to the center of the training mass. The center of the query mass is localized by finding the maximum element in S q,d .

viq votes in favor of not only cqi but also the neighbors of cqi . cqi earns a full vote, and each neighbor gains a vote weighed by a Gaussian factor:  2 idf 2 (v) δp q exp − , (1) S q,d (ci + δp) + = tf (v, q) tf (v, d) 2σ 2 where δp represents the displacement from cqi to its neighbor, σ determines the order of δp, tf (v, q) and tf (v, d) are the term frequencies (TFs) of v in q and d respectively, and idf (v) is the inverse document frequency (IDF) of v. TF-IDF reflects the importance of a visual word to an image in a collection of images, and is widely adopted in BoW-based image retrieval methods [4,10,11]. The cumulative votes of all the feature pairs generate the similarity map S q,d . The largest element in S q,d is defined as the similarity score sq,d , and the position of the largest element is defined as the query mass center cq,d . After computing the similarity scores between q and all the (transformed) training masses, the top k most similar training masses along with their diagnostic reports are returned to radiologists. These masses are referred to as the retrieval . The average similarity score of these k training set of q, which is denoted as Nq

masses is denoted as ω = (1/k) d∈Nq sq,d . During the segmentation of q, ω measures the confidence of our shape and appearance priors learned from Nq . Note that our retrieval method could find training masses which are similar in local appearance and global shape to the query mass. A match between a query feature and a training feature assures that the two local patches, from where the SIFT features are extracted, have similar appearances. Besides, the spatial consistency constraint guarantees that two matched masses have similar shapes and sizes. Consequently, the retrieved training masses could guide segmentation of the query mass. Moreover, due to the adoption of BoW technique and inverted index, our retrieval method is computationally efficient. Learning Online Shape and Appearance Priors: Given a query mass q, our segmentation method aims to find a foreground mask Lq . Lq is a binary matrix

Mammographic Mass Segmentation with Online Learned Priors

39

of the size of q, and its element Lq (p) ∈ {0, 1} indicates the label of the pixel at position p, where 0 and 1 represent background and mass respectively. Each retrieved training mass d ∈ Nq has a foreground mask Ld . To align d with q, we simply copy Ld to a new mask of the same size of Lq and move the center of Ld to the query mass center cq,d . Ld will hereafter denote the aligned foreground mask. Utilizing the foreground masks of the retrieved training masses in Nq , we could learn shape and appearance priors for q on the fly. Shape prior models the spatial distribution of the pixels in q belonging to a mass. Our approach estimates this prior by averaging the foreground masks of the retrieved masses:

Ld (p), pS (Lq (p) = 1) = k1 d∈Nq (2) pS (Lq (p) = 0) = 1 − pS (Lq (p) = 1) . Appearance prior models how likely a small patch in q belongs to a mass. In our approach, a patch is a small region from where a SIFT feature is extracted, and it is characterized by its visual word (quantized SIFT feature). The probability of word v belonging to a mass is estimated on Nq : nf

pA (Lq (pv ) = 1) = nvv , pA (Lq (pv ) = 0) = 1 − pA (Lq (pv ) = 1) ,

(3)

where pv is the position of word v, nv is the total number of times that v appears in Nq , nfv is the number of times that v appears in foreground masses. It is noteworthy that our shape and appearance priors are complementary. In particular, shape prior tends to recognize mass centers, since the average foreground mask of the retrieved training masses generally has large scores around mass centers. Appearance prior, on the other hand, tends to recognize mass edges, as SIFT features extracted from mass edges are very discriminative [4]. Examples of shape and appearance priors are provided in Fig. 1. Mass Segmentation via Graph Cuts with Priors: Our segmentation method computes the foreground mask Lq by minimizing the following energy function: (Lq ) + λ2 ωES (Lq ) + λ3 ωEA

(Lq ) + ER (Lq ) E (Lq ) = λ1 EI

= −λ1 ln pI ( I q (p)| Lq (p)) − λ2 ω ln pS (Lq (p)) p

p

−λ3 ω ln pA (Lq (p)) + β (Lq (p) , Lq (p′ )), p

(4)

p,p′

where I q denotes the intensity matrix of q, I q (p) represents the value of the pixel at position p. EI (Lq ), ES (Lq ), EA (Lq ) and ER (Lq ) are the energy terms related to intensity information, shape prior, appearance prior, and regularity constraint, respectively. β (Lq (p) , Lq (p′ )) is a penalty term for adjacent pixels with different labels. λ1 , λ2 and λ3 are the weights for the first three energy terms, and the last term has an implicit weight 1.

40

M. Jiang et al.

In particular, the intensity energy EI (Lq ) evaluates how well Lq explains I q . It is derived from the total likelihood of the observed intensities given certain labels. Following conventions in radiological image segmentation [5,12], the foreground likelihood pI ( I q (p)| Lq (p) = 1) and background likelihood pI ( I q (p)| Lq (p) = 0) are approximated by Gaussian density function and Parzen window estimator respectively, and are both learned on the entire training set D. The shape energy ES (Lq ) and appearance energy EA (Lq ) measure how well Lq fits the shape and appearance priors. The regularity energy ER (Lq ) is employed to promote smooth segmentation. It calculates a penalty score for every pair of neighboring pixels (p, p′ ). Following [1,12], we compute this score using:  ′ 2 ′ (I (p) − I (p )) 1 (L (p) =  L (p )) q q q q exp − , (5) β (Lq (p) , Lq (p′ )) = 2 p − p′  2ζ 2 where 1 is the indicator function, and ζ determines the order of intensity difference. The above function assigns a positive score to (p, p′ ) only if they have different labels, and the score will be large if they have similar intensities and short distance. Similar to [12], we first plug in Eqs. (2), (3) and (5) to energy function Eq. (4), then convert it to the sum of unary potentials and pairwise potentials, and finally minimize it via graph cuts [1] to obtain the foreground mask Lq . Note that the overall similarity score ω is utilized to adjust the weights of the prior-related energy terms in Eq. (4). As a result, if there are similar masses in the training set, our segmentation method will rely on the priors. Otherwise, ω will be very small and Eq. (4) automatically degenerates to traditional graph cuts-based segmentation, which prevents ineffective priors from reducing the segmentation accuracy.

3

Experiments

In this section, we first describe our dataset, then evaluate the performance of mass retrieval and mass segmentation using our approach. Dataset: Our dataset builds on the digital database for screening mammography (DDSM) [3], which is currently the largest public mammogram database. DDSM is comprised of 2,604 cases, and every case consists of four views, i.e., LEFT-CC, LEFT-MLO, RIGHT-CC and RIGHT-MLO. The masses have diverse shapes, margins, sizes, breast densities as well as patients’ ages. They also have radiologist-labeled boundaries and diagnosed pathologies. To build our dataset, 2,340 image regions centered at masses are extracted. Our approach and the comparison methods are tested five times. During each time, 100 images are randomly selected to form the query set Q, and the remaining 2,240 images form the training set D. Q and D are selected from different cases in order to avoid positive bias. Below we report the average of the evaluation results during five tests.

Mammographic Mass Segmentation with Online Learned Priors

41

Fig. 3. Our segmentation results on four masses, which are represented by red lines. The blue lines denote radiologist-labeled mass boundaries. The left two masses are malignant (cancer), and the right two masses are benign.

Evaluation of Mass Retrieval: The evaluation metric adopted here is retrieval precision. In our context, precision is defined as the percentage of retrieved training masses that have the same shape category as that of the query mass. All the shape attributes are divided as two categories. The first category includes “irregular”, “lobulated”, “architectural distortion”, and their combinations. The second category includes “round” and “oval”. We compare our method with a state-of-the-art mass retrieval approach [4], which indexes quantized SIFT features with a vocabulary tree. The precision scores of both methods change slightly as the size of retrieval set k increases from 1 to 30, and our method systematically outperforms the comparison method. For instance, at k = 20, the precision scores of our method and the vocabulary tree-based method are 0.85 ± 0.11 and 0.81 ± 0.14, respectively. Our precise mass retrieval method lays the foundation for learning accurate priors and improving mass segmentation performance. Evaluation of Mass Segmentation: Segmentation accuracy is assessed by area overlap measure (AOM) and average minimum distance (DIST), which are two widely used evaluation metrics in medical image segmentation. Our method is tested with three settings, i.e., employing shape prior, appearance prior, and both priors. The three configurations are hereafter denoted as “Ours-Shape”, “Ours-App”, and “Ours-Both”. For all the configurations, we set k to the same value in mass retrieval experiments, i.e. k = 20. λ1 , λ2 and λ3 are tuned through cross validation. Three classical and state-of-the-art mass segmentation approaches are implemented for comparison, which are based on active contour (AC) [8], convolution neural network (CNN) [5], and traditional graph cuts (GC) [9], respectively. The key parameters of these methods are tuned using cross validation. The evaluation results are summarized in Table 1. A few segmentation results obtained by Ours-Both are provided in Fig. 3 for qualitative evaluation. The above results lead to several conclusions. First, our approach could find visually similar training masses for most query masses and calculate effective shape and appearance priors. Therefore, Ours-Shape and Ours-App substantially surpass GC. Second, as noted earlier, the two priors are complementary: shape prior recognizes mass centers, whereas appearance prior is vital to keep the

42

M. Jiang et al. Table 1. AOM and DIST (unit is mm) scores of the evaluated methods AC [8]

CNN [5]

GC [9]

Ours-Shape Ours-App

Ours-Both

AOM 0.78 ± 0.12 0.73 ± 0.17 0.75 ± 0.14 0.81 ± 0.13 0.80 ± 0.10 0.84 ± 0.09 DIST 1.09 ± 0.43 1.36 ± 0.62 1.24 ± 0.54 0.97 ± 0.49 1.01 ± 0.45 0.88 ± 0.47

local details of mass boundaries. Thus, by integrating both priors, Ours-Both further improves the segmentation accuracy. Third, detailed results show that for some “uncommon” query masses, which have few similar training masses, the overall similarity score ω is very small and the segmentation results of Ours-Both are similar to those of GC. That is, the adaptive weights of priors successfully prevent ineffective priors from backfiring. Finally, Ours-Both outperforms all the comparison methods especially for masses with irregular and spiculated shapes. Its segmentation results have a close agreement with radiologist-labeled mass boundaries, and are highly consistent with mass pathologies.

4

Conclusion

In this paper, we leverage image retrieval method to learn query specific shape and appearance priors for mammographic mass segmentation. Given a query mass, similar training masses are found via Hough voting of SIFT features, and priors are learned from these masses. The query mass is segmented through graph cuts with priors, where the weights of priors are automatically adjusted according to the overall similarity between query mass and its retrieved training masses. Extensive experiments on DDSM demonstrate that our online learned priors considerably improve the segmentation accuracy, and our approach outperforms several widely used mass segmentation methods or systems. Future endeavors will be devoted to distinguishing between benign and malignant masses using features derived from mass boundaries.

References 1. Boykov, Y.Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23(11), 1222–1239 (2001) 2. D’Orsi, C.J., Sickles, E.A., Mendelson, E.B., et al.: ACR BI-RADS Atlas, Breast Imaging Reporting and Data System, 5th edn. American College of Radiology, Reston (2013) 3. Heath, M., Bowyer, K., Kopans, D., Moore, R., Kegelmeyer, W.P.: The digital database for screening mammography. In: Proceeding IWDM, pp. 212–218 (2000) 4. Jiang, M., Zhang, S., Li, H., Metaxas, D.N.: Computer-aided diagnosis of mammographic masses using scalable image retrieval. IEEE Trans. Biomed. Eng. 62(2), 783–792 (2015) 5. Lo, S.B., Li, H., Wang, Y.J., Kinnard, L., Freedman, M.T.: A multiple circular paths convolution neural network system for detection of mammographic masses. IEEE Trans. Med. Imaging 21(2), 150–158 (2002)

Mammographic Mass Segmentation with Online Learned Priors

43

6. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 7. Oliver, A., Freixenet, J., Mart´ı, J., P´erez, E., Pont, J., Denton, E.R.E., Zwiggelaar, R.: A review of automatic mass detection and segmentation in mammographic images. Med. Image Anal. 14(2), 87–110 (2010) 8. Rahmati, P., Adler, A., Hamarneh, G.: Mammography segmentation with maximum likelihood active contours. Med. Image Anal. 16(6), 1167–1186 (2012) 9. Saidin, N., Sakim, H.A.M., Ngah, U.K., Shuaib, I.L.: Computer aided detection of breast density and mass, and visualization of other breast anatomical regions on mammograms using graph cuts. Comput. Math. Methods Med. 2013(205384), 1–13 (2013) 10. Shen, X., Lin, Z., Brandt, J., Avidan, S., Wu, Y.: Object retrieval and localization with spatially-constrained similarity measure and k-NN re-ranking. In: Proceeding CVPR, pp. 3013–3020 (2012) 11. Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: Proceeding ICCV, pp. 1470–1477 (2003) 12. van der Lijn, F., den Heijer, T., Breteler, M.M.B., Niessen, W.J.: Hippocampus segmentation in MR images using atlas registration, voxel classification, and graph cuts. NeuroImage 43(4), 708–720 (2008) 13. Zhang, S., Zhan, Y., Zhou, Y., Uzunbas, M., Metaxas, D.N.: Shape prior modeling using sparse representation and online dictionary learning. In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012. LNCS, vol. 7512, pp. 435–442. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33454-2 54

Differential Dementia Diagnosis on Incomplete Data with Latent Trees Christian Ledig1(B) , Sebastian Kaltwang1 , Antti Tolonen2 , Juha Koikkalainen3 , Philip Scheltens4 , Frederik Barkhof4 , Hanneke Rhodius-Meester4 , Betty Tijms4 , Afina W. Lemstra4 , Wiesje van der Flier4 , Jyrki L¨ otj¨ onen3 , and Daniel Rueckert1 1

Department of Computing, Imperial College London, London, UK [email protected] 2 VTT Technical Research Centre of Finland, Tampere, Finland 3 Combinostics Ltd., Tampere, Finland 4 Department of Neurology, VU University Medical Center, Amsterdam, The Netherlands

Abstract. Incomplete patient data is a substantial problem that is not sufficiently addressed in current clinical research. Many published methods assume both completeness and validity of study data. However, this assumption is often violated as individual features might be unavailable due to missing patient examination or distorted/wrong due to inaccurate measurements or human error. In this work we propose to use the Latent Tree (LT) generative model to address current limitations due to missing data. We show on 491 subjects of a challenging dementia dataset that LT feature estimation is more robust towards incomplete data as compared to mean or Gaussian Mixture Model imputation and has a synergistic effect when combined with common classifiers (we use SVM as example). We show that LTs allow the inclusion of incomplete samples into classifier training. Using LTs, we obtain a balanced accuracy of 62 % for the classification of all patients into five distinct dementia types even though 20 % of the features are missing in both training and testing data (68 % on complete data). Further, we confirm the potential of LTs to detect outlier samples within the dataset. Keywords: Latent Trees Incomplete data

1

·

Differential diagnosis

·

Dementia

·

Introduction

The accurate diagnosis of neurodegenerative diseases is a prerequisite to apply efficient treatment strategies, insofar as available, or recruit homogeneous study cohorts [6]. Many studies have shown that visually assessed criteria and a battery of quantitative features extracted from brain magnetic resonance imaging (MRI) have the potential to discriminate between different types of dementia [2]. Most published studies that address this classification problem assume a complete c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 44–52, 2016. DOI: 10.1007/978-3-319-46723-8 6

Differential Dementia Diagnosis on Incomplete Data

45

data set in the sense that all features are available for all samples. In practice this assumption does not hold as certain examinations might have been missed due to high measurement cost or missing patient consent [13]. However, many discriminative classifiers, such as SVMs, require training and testing data where a full set of features is available for every sample. A common strategy to account for unavailable features is the removal of incomplete samples from the study cohort [13,14]. However, the exclusion of data does not only reduce statistical power, but is also of ethical concern as acquired subject data remains unused. Other proposed approaches rely on feature imputation, such as replacing a missing feature with the feature’s mean, or model-based feature estimation using Gaussian Mixture Models (GMMs) [7,11,12,14]. In socalled “hot-deck” imputation missing features are replaced by those of similar complete samples [14]. Feature imputation can also be considered as a matrix completion problem [3,13] or tackled with genetic algorithms and neural networks [1,11]. Care needs to be taken when features are not missing at random to avoid the introduction of bias [7,15]. The performance of imputation approaches is ideally assessed by both the feature error and the classification accuracy on the imputed features. The latter is usually the main objective [7,15]. In this paper, we adapt the recently developed Latent Tree (LT) structure learning1 [9] to estimate missing features before applying a discriminative classifier. The proposed approach is applicable to the two common scenarios where features are missing in (1) the testing data or (2) both in the training and testing data. The basic idea of the approach is summarised in Fig. 1. LT learns a hierarchical model of latent variables and thus is able to discover a hidden dependence structure within the features. In contrast, GMMs assume that all features depend on a single latent variable. LTs can thus exploits the learned structure to provide more accurate estimates of missing features. In comparison to other LT learning methods [5,8], the approach of [9] poses less restrictions on the features (distribution, tree structure) while allowing for an efficient optimisation. The main contributions of this paper are (a) formulation of the LT model to be trainable on incomplete data; (b) feature imputation using LTs and subsequent combination with a discriminative classifier (SVM); (c) evaluation on a novel dementia cohort for the differential diagnosis of five dementia types under missing features; (d) proof of concept that LTs are suitable to detect candidate outlier samples within the data set. In Sect. 2, we describe a LT model that can handle missing data. In Sect. 3 we compare its performance to a baseline mean imputation and the widely used GMM estimation.

2

Method

This work addresses the classification problem of inferring the disease condition state y from M features X = {x1 , ..., xM }, while only an observed subset of 1

Implementation available at: https://github.com/kaltwang/2015latent.

46

C. Ledig et al.

Fig. 1. A data model is trained using the mean (blue), GMMs (orange) or LTs (yellow) to complete missing data (red). The trained models are employed to estimate missing features in the training/testing data. A discriminative classifier (e.g. SVM) is then trained and employed for testing on samples with the complete feature set available.

the features O ⊆ X is available. Each feature xm (m ∈ {1, ..., M }) can either be continuous (for attributes like structural volumes) or categorical with the number of states Km (for attributes like gender). Which features are observed, i.e. the composition of O, varies between samples and is unknown a-priori for the testing data. Since any feature might be missing in any of the samples, it is not possible to find a subset of features that is observed for all samples. Only in case of complete data (i.e. O = X), we can use any of the established classification methods (in this work we use SVM). Thus we propose to complete the partial data O → X first using LT and then classify X → y using SVM. Let the unobserved set of features be U = X\O. We proceed by training LT to model the density p(X), i.e. we treat each xm as a random variable. During testing, the observed features O are completed by inferring the unseen features = arg maxum p(um |O) for each U using the maximum likelihood solution upred m um ∈ U. Once all features are completed, we can proceed with established classification methods to obtain y. 2.1

Latent Trees (LT)

LT specifies a graphical model to represent the distribution p(X), by introducing additional latent random variables H = {h1 , ..., hL }. Each node of the graphical model corresponds to a single variable from X ∪ H and edges correspond to conditional probability distributions. In order to keep inference tractable, the graph structure is limited to a tree. All xm are leaves of the tree and the distribution of each node xm or hl is conditioned on its parent hP (m) or hP (l) , respectively (l ∈ {1, ..., L}). The tree structure is learned from data and represented by the function P (.), which assigns the parent to each node or the empty set ∅ if the node is a root. For discrete observed nodes and hidden nodes hl , the conditional distribution is categorical: p(hl |hP (l) = k) = Cat(hl ; µk,l ),

(1)

Differential Dementia Diagnosis on Incomplete Data

47

Here k ∈ {1, ..., K}, Cat(h; µ) is a categorical distribution over h ∈ {1, ..., K} with the parameter µ ∈ RK , µ(k) ≥ 0 and k µ(k) = 1. For observed nodes K is determined by the feature type (cf. Sect. 3.1) and for all hidden nodes K is set to K hid . The conditional distribution is Gaussian for continuous observed nodes xm : 2 ), (2) p(xm |hP (m) = k) = N (xm ; μk,m , σk,m Here N (x; μ, σ 2 ) is a Gaussian distribution over x ∈ R with mean μ ∈ R and variance σ 2 ∈ R+ . The tree root hr has no parent and therefore is not conditioned on another node, which means that the distribution is a prior: p(hr |hP (r) ) = p(hr |∅) = Cat(hr ; µr ). Then the joint distribution of the whole tree is  p(X, H) = m,l p(xm |hP (m) )p(hl |hP (l) ) (3) Given N datapoints X(1) , ..., X(N ) , the marginal log-likelihood of the complete data is    (4) L = n ln p(X(n) ) = n ln H p(X(n) , H)

LT training optimises L on the training data by applying a structural EM procedure, that iteratively optimises the tree structure and the parameters of the conditional probability distributions, for details see [9]. An example tree structure is shown in Fig. 2.

h1 Latent variables h2

Feature variables

x1

h3

x2

x3

Latent Tree

h

x4

x1

x2

x3

x4

Gaussian Mixture

Fig. 2. Example LT and GMM structure for four feature variables x1 , ..., x4 . Here, LT has learned three latent variables, whereas GMM always includes only a single latent variable. The rightmost edges are labeled with the conditional probability distributions.

The application of LT in this work differs from [9] in three ways: (1) we use LT to predict features and not to classify the disease target (for this we use SVM), (2) we adapt LT to handle missing features rather than noisy features that have been replaced by a random value, and (3) we deal with missing features during both the training and testing stages of the model. In contrast, [9] only induces noisy features in the test data. 2.2

Handling of Missing Data

LT (and GMM) handle missing variables U by treating them as latent variables (equivalent to H) and derive p(um |O) during inference. For predicting the missing variables, the maximum likelihood estimate is obtained as specified in Sect. 2.

48

C. Ledig et al.

Since the parameter and structure learning algorithms only depend on the posterior marginal distributions obtained from inference, it is sufficient for the LT algorithm to deal with missing variables at the inference step. Treating missing values U(n) as latent variables leads to the modified log-likelihood optimisation target   (5) L = n ln H,U(n) p(O(n) , U(n) , H) In the complete data case (i.e. O(n) = X(n) and U(n) = ∅, ∀n) Eq. 5 is equivalent to Eq. 4.

3

Experiments

3.1

Data and Setup

We study a total of 491 patients2 from the Amsterdam Dementia Cohort who had visited the Alzheimer center of the VU University Medical Center. Images were acquired on MRI scanners at the field strengths 1, 1.5 or 3 T. All patients underwent a standardised work-up including a lumbar puncture and a battery of neurological and neuropsychological markers. Patients were subsequently diagnosed in a multidisciplinary consensus meeting in 5 categories: subjective cognitive decline (SCD), Alzheimer’s Dementia (AD), Frontotemporal-Lobe Dementia (FTD), Dementia with Lewy Bodies (DLB) and Vascular Dementia (VaD) according to standardised criteria. A detailed description of the data and the employed clinical disease criteria can be found in [10]. A brief overview and the distribution of the data is summarised in Table 1. For our experiments we consider 31 features in total, which are grouped in two sets: The first set (VIS) contains 13 biomarkers assessed during the clinical visit, and the second set (IMG) includes 18 features automatically derived from the MRI scans. In detail, the sets contain: – VIS: Age, Gender (Categorical variable: K = 2), Verhage 7-point scale for education (K = 7), years of education (YoE), mini-mental state examination (MMSE), Amyloid-β42 , ApoE4 genotype (K = 5), Fazekas Score (K = 4), presence of lacunes in basal ganglia (K = 2), presence of infarcts (K = 2), 3 manually assessed atrophy measures (K=5, for left/right medial temporal lobe and global cortex) – IMG: 15 unnormalised volume measures (left/right/total HC, l/r Amy, l/r Ent, l/r inf lat Vent, l/r lat Vent, 3rd/4th Vent, l/r WM), 3 vascular burden measures (WM hyper-intensities total/adj., lacunar infarcts volume) The data contains dependencies between features (e.g. between structural volumes), which LT is able to encode to enable an improved imputation. All experiments were evaluated with five repetitions (to account for random model initialisation) of five configurations (to account for randomly removed 2

The dataset consists of 504 patients, 13 patients were excluded due to missing reference features that are required for the performed quantitative evaluation.

Differential Dementia Diagnosis on Incomplete Data

49

Table 1. Overview over patient data with reference diagnosis and age. Total N (♀)

SCD

AD

FTD

491 (217) 116 (44) 219 (118) 89 (40)

Age (SD) 64 (± 8)

60 (± 9) 66 (± 7)

DLB

VaD

47 (6)

20 (9)

63 (± 7) 68 (± 9) 69 (± 6)

features) of a 10-fold cross-validation (CV), leading to 250 evaluations in total. Paired, two-sided Student’s t-tests were calculated on the results of the five configurations averaged over the five repetitions. Significant differences between LT/GMM and mean imputation (p < 0.05m /0.001M ) or between LT and GMM (g /G ) are indicated respectively. For the 5-class classification problem we employ libSVM [4] (linear, cost  Mr,r = 0.1) and calculate the balanced accuracy as bACC = 15 rows  columns Mr,c from the confusion matrix M. Here, features were normalised (zero-mean, unitvariance) based on the respective training data. The model parameter K hid ∈ [2; 20] was chosen for ǫtest = 0.5 (cf. Sect. 3.2) and set to the optimum value of K hid = 16 for GMM and K hid = 5 for LT. We compare LT with the baseline methods (1) mean imputation (Mean) and (2) GMM. 3.2

Predicting Missing Features

In a first experiment we investigate the performance of LTs to predict missing features. We simulated missing features by randomly removing a fraction ǫtest of features of the testing data. The random selection of features to remove is applied per sample, i.e. each sample now includes different features. We measure the prediction error with respect to the true value of the removed features (utrue n,m ) by the normalised root mean squared error (NRMSE). The NRMSE is calculated   2 ˜ pred utrue for each feature xm over all N samples as NRMSEm = N1 n (˜ n,m ) , n,m − u all all ˜ n,m = (un,m − μm )/σm denotes feature m of sample n normalised by where u all the feature statistics (μall m , σm ) calculated on the whole dataset. The NRMSE all of each feature. E.g. is an error measure relative to the standard deviation σm all . For an NRMSE of 0.5 means that the expected prediction error is 50 % of σm selected features the NRMSE is summarised in Table 2. The prediction error with respect to ǫtest is illustrated in Fig. 3. LT significantly improves feature imputation as compared to mean replacement and GMMs. The advantage of LT reduces with increasing ǫtest as not sufficient information remains to leverage the learned structure. 3.3

Disease Classification

We explored the effect of improved feature imputation on classification accuracy. We simulated missing features as in Sect. 3.2, but now either in the testing set ǫtest or in all available data ǫall . Classification results and NRMSE for varying ǫtest are shown in Fig. 3 for mean, GMM and LT feature imputation. All

50

C. Ledig et al.

all × NRMSEm of selected features using mean, GMM Table 2. Prediction error as σm or LT for feature completion with 20/50/70 % missing features in testing data.

approaches yield 68 % accuracy on complete data and drop to 20 % when all features are missing, equivalent to a random guess of five classes. For an increasing ǫtest the NRMSE approaches 1, which means that the standard deviation all of the error becomes equivalent to σm of the respective feature. Classification test all and ǫ are shown in Table 3. With 50 % of the feaaccuracies for varying ǫ tures missing in both training and testing data, LT imputation still allows a high accuracy of 56.6 % (58.6 % if trained on complete data). The SVM model can better account for mean replacement during testing when it is also trained on mean-replaced data. This leads to a 8 % increase at ǫall = 50 % in comparison to clean training data, even outperforming GMM. LT consistently outperforms both reference methods. Table 3. Balanced accuracy [%] with missing features in testing data (left) or in all data (middle) and the confusion matrix corresponding to ǫall = 0 % (right).

Fig. 3. Classification accuracy (bACC, left) and prediction error (NRMSE, right) for an increasing factor of missing features in testing data for compared methods.

Differential Dementia Diagnosis on Incomplete Data

51

Fig. 4. Performance for detecting outlier samples after swapping two random features within 50 % of the testing samples (intra-sample swaps).

3.4

Application: Detection of Samples with Inconsistent Features

The LT model allows the calculation of L (Eq. 5), which measures the likelihood that a given sample belongs to the distribution of the trained model. This is used to detect samples that contain inconsistent features. We suggest to calculate L L for all samples in the training data to estimate μL train and σtrain . We then calculate the Z-score for each testing sample as a measure for how well a sample fits the training distribution. Specifically we classify each sample n as outlier if L (Ln − μL train )/σtrain ≤ ZLimit . To investigate the applicability of this approach we simulated erroneous samples by swapping two random features within 50 % of the testing samples (intrasample swaps). This simulates the common human error of inserting values in the wrong data field. Then we employed the proposed method to detect the samples with swapped features. The results are summarised in Fig. 4. Balanced accuracies are high (around 80 %) for a wide range of possible thresholds ZLimit and consistently higher using the LT model as compared to the reference GMM model. A high AUC ≈87 % (GMM AUC ≈83 %) confirms the feasibility to detect outlier candidates. Note that there is an upper bound to the possible accuracy as feature swapping might lead to valid samples (e.g. swapping left and right hippocampal volumes), which are consistent with the training distribution.

4

Conclusion

We have shown that LT is a powerful model to incorporate incomplete data for differential dementia diagnosis. The generative nature of LT allows the classification of arbitrary, a-priori unknown targets and substantially boosts the performance of discriminative classifiers under missing data. LT can reveal candidate outlier samples and is superior to the comparison data imputation strategies in all conducted experiments. An open source implementation of LT is available (cf. Sect. 1). Acknowledgments. This work received funding from the European Union’s Seventh Framework Programme under grant agreement no. 611005 (http://PredictND.eu). The work of S. Kaltwang was funded by H2020-EU.2.1.1.4. ref. 645094 (SEWA).

52

C. Ledig et al.

References 1. Abdella, M., Marwala, T.: The use of genetic algorithms and neural networks to approximate missing data in database. In: IEEE International Conference Computer Cybernetics, pp. 207–212 (2005) 2. Burton, E.J., Barber, R., Mukaetova-Ladinska, E.B., Robson, J., Perry, R.H., Jaros, E., Kalaria, R.N., O’Brien, J.T.: Medial temporal lobe atrophy on mri differentiates alzheimer’s disease from dementia with lewy bodies and vascular cognitive impairment: a prospective study with pathological verification of diagnosis. Brain 132(1), 195–203 (2009) 3. Cand`es, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717–772 (2009) 4. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27:1–27:27 (2011) 5. Choi, M.J., Tan, V.Y.F., Anandkumar, A., Willsky, A.S.: Learning latent tree graphical models. J. Mach. Learn. Res. 12, 1771–1812 (2011) 6. Falahati, F., Westman, E., Simmon, A.: Multivariate data analysis and machine learning in Alzheimer’s disease with a focus on structural magnetic resonance imaging. J. Alzheimer’s Dis. 41(3), 685–708 (2014) 7. Garc´ıa-Laencina, P.J., Sancho-G´ omez, J.-L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010) 8. Harmeling, S., Williams, C.K.I.: Greedy learning of binary latent trees. IEEE Trans. Pattern Anal. Mach. Intell. 33(6), 1087–1097 (2011) 9. Kaltwang, S., Todorovic, S., Pantic, M.: Latent trees for estimating intensity of facial action units. In: IEEE Conference Computer Vision Pattern Recognition (2015) 10. Koikkalainen, J., Rhodius-Meester, H., Tolonen, A.: Differential diagnosis of neurodegenerative diseases using structural MRI data. NeuroImage Clin. 11, 435–449 (2016) 11. Nelwamondo, F.V., Mohamed, S., Marwala, T., Data, M.: A Comparison of Neural Network and Expectation Maximisation Techniques. ArXiv e-prints (2007) 12. Schneider, T.: Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J. Climate 14(5), 853–871 (2001) 13. Thung, K.-H., Wee, C.-Y., Yap, P.-T., Shen, D.: Neurodegenerative disease diagnosis using incomplete multi-modality data via matrix shrinkage and completion. NeuroImage 91, 386–400 (2014) 14. Williams, D., Liao, X., Xue, Y., Carin, L., Krishnapuram, B.: On classification with incomplete data. IEEE Trans. Pattern Anal. Mach. Intel. 29(3), 427–436 (2007) 15. Zhu, X., Zhang, S., Jin, Z., Zhang, Z., Xu, Z.: Missing value estimation for mixedattribute data sets. IEEE Trans. Knowl. Data Eng. 23(1), 110–121 (2011)

Bridging Computational Features Toward Multiple Semantic Features with Multi-task Regression: A Study of CT Pulmonary Nodules Sihong Chen1, Dong Ni1, Jing Qin2, Baiying Lei1, Tianfu Wang1, and Jie-Zhi Cheng1(&) 1

National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, School of Biomedical Engineering, Shenzhen University, Shenzhen, China [email protected] 2 Centre for Smart Health, School of Nursing, The Hong Kong Polytechnic University, Kowloon, Hong Kong

Abstract. The gap between the computational and semantic features is the one of major factors that bottlenecks the computer-aided diagnosis (CAD) performance from clinical usage. To bridge such gap, we propose to utilize the multi-task regression (MTR) scheme that leverages heterogeneous computational features derived from deep learning models of stacked denoising autoencoder (SDAE) and convolutional neural network (CNN) as well as Haar-like features to approach 8 semantic features of lung CT nodules. We regard that there may exist relations among the semantic features of “spiculation”, “texture”, “margin”, etc., that can be exploited with the multi-task learning technique. The Lung Imaging Database Consortium (LIDC) data is adopted for the rich annotations, where nodules were quantitatively rated for the semantic features from many radiologists. By treating each semantic feature as a task, the MTR selects and regresses the heterogeneous computational features toward the radiologists’ ratings with 10 fold cross-validation evaluation on the randomly selected LIDC 1400 nodules. The experimental results suggest that the predicted semantic scores from MTR are closer to the radiologists’ rating than the predicted scores from single-task LASSO and elastic net regression methods. The proposed semantic scoring scheme may provide richer quantitative assessments of nodules for deeper analysis and support more sophisticated clinical content retrieval in medical databases. Keywords: Multi-task regression

 Lung nodule  CT  Deep learning

1 Introduction The semantic features like the “spiculation”, “lobulation”, etc., are commonly used to describe the phenotype of a pulmonary nodule in the radiology report. For the differential diagnosis of pulmonary nodules in the CT images, the semantic spiculation feature and the high-level texture feature of nodule solidness are suggested to be important factors for the identification of malignancy in several diagnostic guidelines © Springer International Publishing AG 2016 S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 53–60, 2016. DOI: 10.1007/978-3-319-46723-8_7

54

S. Chen et al.

[1, 2]. In the context of computer-aided diagnosis (CAD), several methods also attempted to computationally approximate some high-level semantic features to achieve the classification tasks [3–6]. For examples, in [4] the partial goal of the bag-of-frequencies descriptor was to classify 51 spiculated and 204 non-spiculated nodules in the CT images, whereas the nodule solidness categorization method was developed in [5] based on the low-level intensity features. In general, most of these works simply focused on the elaboration of single semantic feature for the discrete trichotomous/dichotomous nodule classification of malignancy, spiculation and solidness [3–6]. There is scarcely any work that has ever attempted to quantify the degrees of these high-level features to support deeper nodule analysis. Since a pulmonary nodule can be profiled with several semantic features, there may exist some kinds of relation among the semantic features. In this paper, we aim to address two specific problems: (1) degree quantification of the semantic features and (2) jointly mapping the computational image features toward the multiple semantic features. Distinct from the traditional CAD scheme that only suggests malignancy probability [3, 6], the proposed nodule profiling scheme may provide broader quantitative assessment indices in the hope to get closer to the clinical usage. The thoracic CT dataset from the Lung Image Database Consortium (LIDC) [7] is adopted here for the rich annotation resources from many radiologists across several institutes in U.S.A. A nodule with diameter larger than 3 mm was annotated by radiologists to give their ratings for the semantic features of “spiculation”, “lobulation”, “texture”, “calcification”, “sphericity”, “subtlety”, “margin”, “internal structure”, and “malignancy”. The exemplar nodules of each semantic features are shown in Fig. 1. The “malignancy” is excluded in this study as it relates to diagnosis. Most semantic features were scored in the range of 1–5, excepting the “internal structure” and “calcification” that were scored in the ranges of 1–4 and 1–6, respectively.

Fig. 1 Illustration of nodule patterns for the 8 semantic features.

As shown in Fig. 1, our goal is challenging. The appearances and shapes of nodules are very diverse for each semantic feature, and the surrounding tissues of nodules may further complicate the image patterns of nodules. Therefore intensive elaboration on the

Bridging Computational Features Toward Multiple Semantic Features

55

extraction and selection of effective computational image features for each semantic feature is needed. In this study, we leverage the techniques of stacked denoising autoencoder (SDAE) [8], convolutional neural network (CNN) [9], and Haar-like feature computing [10] along with multi-task regression (MTR) framework to approximate our predicted scores to radiologists’ ratings. With the SDAE, CNN and Haar-like features, the MTR can automatically exploit the sharable knowledge across the semantic features and select useful computational features for each of them. Here, each semantic feature is treated as an individual task.

2 Method The training of nodule scoring scheme for the 8 semantic features is based on 2D nodule ROIs to avoid direct 3D feature computing from the LIDC image data with anisotropic resolution between x-y and z directions. The slice thickness variation is quite high (1.25–3 mm) in the LIDC dataset. At testing, the predicted scores for a nodule are derived with the averaged scores over all its member slices. Each nodule ROI is defined as the expanding bounding boxes of radiologists’ outlines with offset of 10 pixels to include more anatomical contexts. For training and testing, all ROIs are resized as 28  28 for efficiency. The flowchart of our scheme is shown in Fig. 2.

Fig. 2 Flowchart of the proposed scheme.

2.1

Extraction of Heterogeneous Computational Features

Referring to Fig. 1, the semantic features (tasks) cover the high-level description about the shape and appearance of pulmonary nodules, and the effective computational

56

S. Chen et al.

features for each task is generally unknown. Therefore, we firstly compute heterogeneous features as diverse as possible, and then use the MTR framework to seek the suitable features for each task. The SDAE, CNN and Haar-like features are computed as the heterogeneous features. The SDAE and CNN are deep learning models that can automatically learn spatial patterns as features. The learnt SDAE and CNN features may encode both appearance and shape characteristics of nodules. SDAE features are derived from unsupervised phase and thus are general features. The training of CNN requires the sample labels, and hence CNN features are more task-specific. The Haar-like features aim to characterize low-level image contextual cue of nodules. To compensate the ROI resizing effect, we add the scaling factors of x and y directions and aspect ratio as three extra features; see Fig. 2. SDAE model is constituted of unsupervised and supervised training phases. Here, we only take output neurons of the unsupervised phase as the SDAE features. At the unsupervised phase, the SDAE architecture is built by stacking the autoencoders in a layer-by-layer fashion. A layer of autoencoder can be constructed by seeking the coding neurons with the minimization of the reconstruction error of kx

2

rðW 0 rðW~x þ bÞ þ b0 Þk ;

ð1Þ

where x is the input data and ~x is the corrupted input data for better performance. The 0 0 data corruption is conducted with random 0.5 zero masking. W, b, W , and b are the synaptic matrices and biases of coding and reconstruction neurons, respectively, and r is the sigmoid function. There are totally 100 SDAE features for MTR. A typical CNN model is composed of several pairs of convolutional (C) and max-pooling (M) layers and commonly ended with fully-connected (F) and soft-max layers. We train 8 CNN models for the 8 tasks and adopt the neural responses at the fully-connected layer as the CNN features. Therefore, the CNN features are task-specific. The number of CNN features for each task is 192. The Haar-like feature for a nodule ROI, Z, is computed with two blocks cropped from the resized ROI as: 1 2

ð2s1 þ 1Þ

X

kp c1 k  s1

ZðpÞ þ

e ð2s2 þ 1Þ

2

X

kp c2 k  s2

ZðqÞ;

ð2Þ

where c1 , s1 and c2 , s2 are the center and half-size of two square blocks, respectively, and ZðpÞ is the HU value of p. e can be 1, 0, and −1 and is randomly determined. The center and half-sizes of blocks are randomly set to generate 50 Haar-like features from. The half-size can be 1, 2, or 3.

2.2

Multi-task Regression

The 8 semantic features (tasks) describes nodule shape and appearance and hence may relate to each other in semantic meaning. The relation among the 8 tasks are generally unknown, and some tasks may share some computational features whereas some other tasks may not. To exploit the inter-task relation, we apply a MTR scheme with the

Bridging Computational Features Toward Multiple Semantic Features

57

constraints of block and element sparsity [11]. Specifically, the cost function of the MTR is expressed as: X8

X T Wt t¼1

t

2 Yt F þ kB kBk1;1 þ kS kSk1;1 ; W ¼ B þ S;

ð3Þ

where Yt and Xt are the data labels and the SDAE + CNN + Haar-like features for the task t, respectively, WtT is the feature coefficient matrix of task t, W ¼ ½W1 ;    ; W8 Š, and kkF is the Frobenius norm. The regularization terms kBk1;1 and kSk1;1 assure the block and element sparsity and kB and kS are their weightings. The kSk1;1 is defined as P P i kBi k1 , where kBi k1 ¼ max Bi;j ; i and j are i;j Si;j , and kBk1;1 is computed as j

the indices of the rows and columns w.r.t. each matrix. The kSk1;1 encourages zero elements in the matrix, whereas the kBk1;1 favors zero rows in the matrix. Each column of W carries the feature coefficients of each task, while the coefficients of shared and task-specific features are hold in B and S, respectively, see Fig. 4(right). The minimization of the Eq. (3) is realized by interleavedly seeking proper B and S with the coordinate descent algorithm. The output W can be obtained with the final B and S. As shown in [11], with the constraints in Eq. (3), the coefficients of non-zero rows in B are sparse and distinctive, because the term kBk1;1 may help avoid the situation of nearly-identical elements in the non-sparse rows with the constraint of l1 =lq -norm. In such case, the MTR can not only exploit the shared features but also reserve the flexibility of coefficient variation of the shared features w.r.t. each task.

3 Experiments and Results To illustrate the efficacy, the MTR framework are compared with two single-task regression schemes of LASSO [12] and elastic net [13], which can also select sparse features within the linear regression frameworks. The single-task regression schemes use the same set of computational features with MTR, except the CNN features, and perform the regression for each task independently. For MTR, all CNN features from 8 tasks are involved, whereas the single-task regressions only use the CNN features derived from each task. The performances of the two single-task regression schemes for each task are tuned independently and thus the regression parameters are different from task to task. To further show the effect of each type of computational features, we also compare the sole uses of SDAE, CNN, Haar-likes features in all regression schemes. 1400 nodules randomly selected from the LIDC dataset are involved in this study with the 10-fold cross-validation (CV) evaluation (basic unit is nodule). The feature computing and regression use the same data partition in each fold. Each nodule may have more than one annotation instances from different radiologists. In each fold of training, only one instance is utilized for nodules with multiple annotation instances, whereas all instances of the same nodule are involved in the testing in each fold. There are totally 581, 321, 254, 244 nodules with one, two, three and four annotation instances from different radiologists, respectively. We adopt the differences between the computer-predicted and radiologists’ scores as the assessment metrics.

58

S. Chen et al.

Table 1 summarizes the statistics of absolute differences between the computerpredicted (MTR, LASSO, and elastic net) and radiologists’ scores over the 10-fold of CV. The performance of sole uses of three heterogeneous features for the three regression schemes are reported in Table 1 to show the effectiveness of the three types of computational features. The absolute differences of inter-observer ratings are also shown in the Table 1 for comparison. The inter-observer variation is computed from all possible pairs of annotation instances of the same nodule. As can be observed, the inter-observation variation is quite close to the variation between MTR scores and radiologists’ scores. It may suggest there may exist ambiguity between the scoring degrees of the 8 semantic features that leads to rating disagreements among radiologists, see Fig. 4(left), where two ROIs of a nodule are shown. The nodule in Fig. 4(left) has degree ambiguity in “Subtlety” with scores from 4 radiologists of (2, 4, 5, 3), while the

Table 1. Abosulte distance perfomance. The “Tex”, “Sub”, “Spi”, “Sph”, “Mar”, “Lob”, “IS”, and “Cal” stand for of the tasks “Texture”, “Subtlety”, “Spiculation”, “Sphericity”, “Margin”, “Lobulation”, “Internal Structure”, and “Calcification”, respectively. “IB”, “LS” and “EN” indicate the inter-observer variation, LASSO and elastic net, respectively. Tex IB

. ±

LS

All features

EN

CNN

MTR

LS

Haar-like

EN

MTR

LS

SDAE

EN

MTR

. .

±

Sph

. .

±

Mar

. .

±

Lob

. .

±

IS

. .

±

Cal

. .

±

Overall

. .

±

. .

±

.

1.25

0.89

1.25

1.13

0.95

0.02

2.18

1.09

± 0.53

± 0.65

± 0.85

± 0.90

± 0.62

± 0.84

± 0.19

± 0.61

± 0.65

1.24

1.20

0.86

1.09

0.98

0.96

0.14

1.44

0.99

± 0.50

± 0.63

± 0.79

± 0.74

± 0.91

± 0.93

± 0.24

± 0.84

± 0.70

. ±

EN

Spi

1.04

MTR

LS

Sub

. .

±

. .

±

. .

±

. .

±

. .

±

. .

±

. .

±

. .

±

.

1.06

1.13

1.04

1.29

1.13

1.28

0.02

2.12

1.13

± 0.58

± 0.66

± 1.08

± 0.95

± 1.00

± 1.04

± 0.19

± 0.66

± 0.77

1.27

1.68

1.04

1.39

1.31

1.08

0.02

1.89

1.21

± 0.53

± 0.82

± 1.08

± 0.97

± 0.96

± 0.95

± 0.19

± 0.71

± 0.78

0.74

0.84

0.86

0.83

0.97

0.90

0.04

0.69

0.73

± 0.60

± 0.56

± 0.65

± 0.48

± 0.64

± 0.66

± 0.19

± 0.56

± 0.54

2.42

2.43

0.88

1.50 ±

1.88

0.95

0.02

4.38

1.81

± 1.18

± 1.09

± 1.02

± 1.16

± 1.02

± 0.19

± 1.13

± 0.98

3.48

3.08

0.95

2.68

2.71

1.02

0.26

4.61

2.35

± 1.05

± 1.08

± 1.01

± 0.95

± 1.18

± 1.00

± 0.46

± 0.99

± 0.97

1.70

1.63

0.91

1.65

1.56

0.96

0.07

2.94

1.43

± 1.17

± 1.11

± 1.04

± 1.08

± 1.09

± 1.04

± 0.29

± 1.37

± 1.02

1.17

1.43

1.17

1.68

1.35

1.16

0.02

2.24

1.28

± 0.54

± 0.76

± 0.91

± 0.88

± 0.80

± 0.89

± 0.19

± 0.58

± 0.69

1.23

1.23

1.15

1.58

1.47

1.18

0.55

1.56

1.24

± 0.75

± 0.70

± 0.90

± 0.87

± 0.90

± 0.92

± 0.24

± 0.95

± 0.78

0.74

0.84

0.95

0.84

1.00

0.97

0.05

0.64

0.76

± 0.74

± 0.63

± 0.69

± 0.50

± 0.69

± 0.70

± 0.19

± 0.71

± 0.61

1.05

Bridging Computational Features Toward Multiple Semantic Features

59

Fig. 3 Boxplots of the signed differences for MTR, LASSO, and elastic net, respectively.

Fig. 4 Annotation ambiguity (left) and illustration of B, S and W at one fold of CV (right).

MTR score is 3.78. The “Subtlety” scoring is highly subjective and depends on radiologists’ experience. Figure 3 shows the box-plots of signed differences between the computer-predicted and radiologists’ scores in CV 10 folds. In Table 1, it can be found that the performances for the task “internal structure” are very good. It is because most nodules were rated as score 1 (1388) where the sample numbers for the scores 2–4 are nearly zeros. Accordingly, the regression for the task “internal structure” will not be difficult. For tasks like “spiculation” and “lobulation”, the performance of the two single-task regression methods are not bad. However, it shall be recalled that the performance of these two single-tasks methods require tedious task-by-task performance tuning. It may turn out to be impractical if the task number goes formidably large. On the other hand, the MTR jointly considers the 8 tasks and achieves better performance. To further insight on meaning and effect of the W separation mechanism in the MTR scheme, the sought B and S at the one fold of CV are shown in Fig. 4(right), where the black and non-black areas suggest zero and non-zero elements respectively. The blue rectangle identify the Haar-like features, while the left and right sides of the rectangle are the CNN and SDAE features, respectively. The selected task-specific features in S are very sparse, and many CNN and SDAE features are sharable across tasks as can be found in B.

4 Discussion and Conclusion A computer-aided attribute scoring scheme for CT pulmonary nodules is proposed by leveraging the heterogeneous SDAE, CNN and Haar-like features with the multi-task regression (MTR) framework. The yielded scores with the MTR are shown to be more close to the radiologists’ ratings, comparing to the scores from the two single-task regression methods. The two single-task methods share similar formulation like Eq. (3) without the consideration of multiple tasks, and are suitable for comparison. Accordingly,

60

S. Chen et al.

the MTR can help to select useful features for each task with the exploration of inter-task relation. The effectiveness of using all SDAE, CNN, and Haar-like features are also illustrated in Table 1. Therefore, the efficacy of the MTR and the used heterogeneous features shall be well corroborated. Our automatic scoring scheme may help for deeper nodule analysis and support more sophisticated content retrieval of clinical reports and images for better diagnostic decision support [14]. Acknowledgement. This work was supported by the National Natural Science Funds of China (Nos. 61501305, 61571304, and 81571758), the Shenzhen Basic Research Project (Nos. JCYJ20150525092940982 and JCYJ20140509172609164), and the Natural Science Foundation of SZU (No. 2016089).

References 1. Naidich, D.P., et al.: Recommendations for the management of subsolid pulmonary nodules detected at CT: a statement from the Fleischner Society. Radiology 266, 304–317 (2013) 2. Gould, M.K., et al.: Evaluation of individuals with pulmonary nodules: When is it lung cancer?: Diagnosis and management of lung cancer: American College of Chest Physicians evidence-based clinical practice guidelines. Chest 143, e93S–e120S (2013) 3. Cheng, J.-Z., et al.: Computer-aided diagnosis with deep learning architecture: applications to breast lesions in US images and pulmonary nodules in CT scans. Sci. Rep. 6, 24454 (2016) 4. Ciompi, F., et al.: Bag-of-frequencies: a descriptor of pulmonary nodules in computed tomography images. IEEE TMI 34(4), 962–973 (2015) 5. Jacobs, C., et al.: Solid, part-solid, or non-solid?: classification of pulmonary nodules in low-dose chest computed tomography by a computer-aided diagnosis system. Invest. Radiol. 50(3), 168–173 (2015) 6. Gurney, W., Swensen, S.: Solitary pulmonary nodules: determining the likelihood of malignancy with neural network analysis. Radiology 196, 823–829 (1995) 7. Armato III, S.G., et al.: The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans. Med. Phys. 38(2), 915–931 (2011) 8. Vincent, P., et al.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010) 9. LeCun, Y., et al.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 10. Gao, Y., Shen, D.: Collaborative regression-based anatomical landmark detection. Phys. Med. Biol. 60(24), 9377 (2015) 11. Jalali, A., Sanghavi, S., Ruan, C., Ravikumar, P.K.: A dirty model for multi-task learning. NIPS, pp. 964–972 (2010) 12. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. B 58(1), 267–288 (1996) 13. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. B 67(2), 301–320 (2005) 14. Kurtz, C., et al.: On combining image-based and ontological semantic dissimilarities for medical image retrieval applications. Med. Image Anal. 18(7), 1082–1100 (2014)

Robust Cancer Treatment Outcome Prediction Dealing with Small-Sized and Imbalanced Data from FDG-PET Images Chunfeng Lian1,2(B) , Su Ruan2 , Thierry Denœux1 , Hua Li4 , and Pierre Vera2,3 1

3

Sorbonne Universit´es, Universit´e de Technologie de Compi`egne, CNRS, UMR 7253 Heudiasyc, 60205 Compi`egne, France [email protected] 2 Universit´e de Rouen, QuantIF - EA 4108 LITIS, 76000 Rouen, France Department of Nuclear Medicine, Centre Henri-Becquerel, 76038 Rouen, France 4 Department of Radiation Oncology, Washington University School of Medicine, Saint Louis, MO 63110, USA

Abstract. Accurately predicting the outcome of cancer therapy is valuable for tailoring and adapting treatment planning. To this end, features extracted from multi-sources of information (e.g., radiomics and clinical characteristics) are potentially profitable. While it is of great interest to select the most informative features from all available ones, small-sized and imbalanced dataset, as often encountered in the medical domain, is a crucial challenge hindering reliable and stable subset selection. We propose a prediction system primarily using radiomic features extracted from FDG-PET images. It incorporates a feature selection method based on Dempster-Shafer theory, a powerful tool for modeling and reasoning with uncertain and/or imprecise information. Utilizing a data rebalancing procedure and specified prior knowledge to enhance the reliability and robustness of selected feature subsets, the proposed method aims to reduce the imprecision and overlaps between different classes in the selected feature subspace, thus finally improving the prediction accuracy. It has been evaluated by two clinical datasets, showing good performance.

1

Introduction

Accurate outcome prediction prior to or even during cancer therapy is of great clinical value. It benefits the adaptation of more effective treatment planning for individual patient. With the advances in medical imaging technology, radiomics [1], referring to the extraction and analysis of a large amount of quantitative image features, provides an unprecedented opportunity to improve personalized treatment assessment. Positron emission tomography (PET), with the radio-tracer fluoro-2-deoxy-D-glucose (FDG), is one of the important and advanced imaging tools generally used in clinical oncology for diagnosis and staging. The functional information provided by FDG-PETs has also emerged to be predictive of the pathologic response of a treatment in some types of cancers, such as lung and esophageal tumors [10]. Abounding radiomic features have been c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 61–69, 2016. DOI: 10.1007/978-3-319-46723-8 8

62

C. Lian et al.

studied in FDG-PETs [3], which include standardized uptake values (SUVs), e.g., SUVmax , SUVpeak and SUVmean , to describe metabolic uptakes in a volume of interest (VOI), and metabolic tumor volume (MTV) and total lesion glycolysis (TLG) to describe metabolic tumor burdens. Apart from SUV-based features, some complementary characterization of PET images, e.g., texture analysis, may also provide supplementary knowledge associated with the treatment outcome. Although the quantification of these radiomic features has been claimed to have discriminant power [1], the solid application is still hampered by some practical difficulties: (i) uncertainty and inaccuracy of extracted radiomic features caused by noise and limited resolution of imaging systems, by the effect of small tumour volumes, and also by the lack of a priori knowledge regarding the most discriminant features; (ii) small-sized dataset often encountered in the medical domain, which results in a high risk of over-fitting with a relatively high-dimensional feature space; (iii) skewed dataset where training samples are originated from classes of remarkably distinct sizes, thus usually leading to poor performance for classifying the minority class. The challenge is to robustly select an informative feature subset from uncertain, small-sized, and imbalanced dataset. To learn efficiently from noisy and high overlapped training set, Lian et al. proposed a robust feature subset selection method, i.e., EFS [11], based on the Dempster-Shafer Theory (DST) [13], a powerful tool for modeling and reasoning with uncertain and/or imprecise knowledge. EFS quantifies the uncertainty and imprecision caused by different feature subsets; then, attempts to find a feature subset leading to both high classification accuracy and small overlaps between different classes. While it has shown competitive performance as compared to conventional methods, the influence of imbalanced data is still left unsolved; moreover, the loss function used in EFS can also be improved to reduce method’s complexity.

dataset #1



dataset #B

REFS

t

Selected features

Classifier

REFS



Data rebalancing

Treatment outcome

t …

Testing

Feature selection Feature extraction …

Training

Prior knowledge

Fig. 1. Protocol of the prediction system.

We propose a new framework for predicting the outcome of cancer therapy. Input features are extracted from multi-sources of information, which include radiomics in FDG-PET images, and clinical characteristics. Then, as a main contribution of this paper, EFS proposed in [11] is comprehensively improved to select features from uncertain, small-sized, and imbalanced dataset. The protocol

Robust Cancer Treatment Outcome Prediction Dealing

63

of the proposed prediction system is shown in Fig. 1, which will be described in more detail in upcoming sections.

2

Robust Outcome Prediction with FDG-PET Images

The prediction system is learnt on a dataset {(Xi , Yi )}ni=1 for N different patients, where vector Xi consists of V input features, while Yi denotes already known treatment outcome. Since Yi in our applications only has two possibilities (e.g., recurrence versus no-recurrence), the set of possible classes is defined as Ω = {ω1 , ω2 }. It is worth noting that this prediction system can also deal with multi-class problems. 2.1

Feature Extraction

To extract features, images acquired at different time points are registered to the image at initial staging via a rigid registration method. The VOIs around tumors are cuboid bounding boxes manually delineated by experienced physicians. Five types of SUV-based features are calculated from the VOI, namely SUVmin , SUVmax , SUVpeak , MTV and TLG. To characterize tumor uptake heterogeneity, the Gray Level Size Zone Matrix (GLSZM) [16] is adopted to extract eleven texture features. Since the temporal changes of these features may also provide discriminant value, their relative difference between the baseline and the follow-up PET acquisitions is calculated as additional features. Patients’ clinical characteristics can also be included as complementary knowledge if they are available. The number of extracted features is roughly between thirty to fifty. 2.2

Feature Selection

In this part, EFS [11] is comprehensively improved, which is denoted as REFS for simplicity. As compared to EFS, REFS incorporates a data rebalancing procedure and specified prior knowledge to enhance the robustness of selected features on small-sized and imbalanced data. Moreover, to reduce method’s complexity, the loss function used in EFS is simplified without loss of effectiveness. Prior Knowledge: considering that SUV-based features have shown great significance for assessing the response of treatment [12], we incorporate this prior knowledge in REFS to guide feature selection. More specifically, RELIEF [9] is used to rank all SUV-based features. Then, the top SUV-based feature is included in REFS as a must be selected element of the desired feature subset. This added constraint drives REFS into a confined searching space. By decreasing the uncertainty caused by the scarcity of learning samples, it ensures more robust feature selection on small-sized datasets, thus increasing prediction reliability. Data Rebalancing: pre-sampling is a common approach for imbalanced learning [8]. As an effective pre-sampling method which can generate artificial minority class samples, adaptive synthetic sampling (ADASYN) [8] is adopted in REFS

64

C. Lian et al.

to rebalance data for feature selection. Such as the example shown in Fig. 2, the key idea of ADASYN is to adaptively simulate samples according to the distribution of the minority class samples, where more instances are generated for the minority class samples that have higher difficulty in learning. However, due to the random nature of the rebalancing procedure, and also with a limited number of training samples, the rebalanced dataset can not always ensure that instances hard to (a) (b) (c) learn are properly tackled (e.g., Fig. 2 (b)). Therefore, ADASYN is totally Fig. 2. Data rebalancing by ADASYN: executed B (equals 5 in our experi- (a) original data with two input features ment) times to provide B rebalanced randomly selected from the lung tumor training datasets. REFS is then exe- dataset (Sect. 3); (b) and (c) are two indecuted with them to obtain B fea- pendent simulations, where more synthetic ture subsets. The final output is deter- (yellow) instances have been generated for mined as the most frequently subset the minority class (cyan) samples which that occurred among the B indepen- have higher difficulty in learning (on the boundary). However, due to the random dent actions. nature, no points has been generated for

Robust EFS (REFS): similar to [11], minority class samples within the orange we search for a qualified feature sub- circle in (b). set according to three requirements: (i) high classification accuracy; (ii) low imprecision and uncertainty, i.e., small overlaps between different classes; (iii) sparsity to reduce the risk of over-fitting. To learn such a feature subset, the dissimilarity between any feature vectors Xi and Xj is defined as a weighted V Euclidean distance, i.e., d2i,j = p=1 λp d2ij,p , where dij,p = |xi,p − xj,p | represents the difference between the pth feature. Features are selected via the value of the binary vector Λ = [λ1 , . . . , λV ]t , where the pth feature is selected when λp = 1. We successively regard each training instance Xi as a query object. In the framework of DST, other samples in the training pool can be considered as independent items of evidence that support different hypotheses regarding the class membership of Xi . The evidence offered by (Xj , Yj = ωq ), where j = i and q ∈ {1, 2}, asserts that Xi also belongs to ωq . According to [11], this piece of evidence is partially reliable, which can be quantified as a massfunction [13], i.e., mi,j ({ωq }) + mi,j (Ω) = 1, where mi,j ({ωq }) = exp −γq d2i,j , and γq relates to the mean distance in the same class. Quantity mi,j ({ωq }) denotes a degree of belief attached to the hypothesis “Yi ∈ {ωq }”; similarly, mi,j (Ω) is attached to “Yi ∈ Ω”, i.e., the degree of ignorance. The precision of mi,j is inversely proportional to d2i,j : when d2i,j is too large, it becomes totally ignorant (i.e., mi,j (Ω) ≈ 1), which provides little evidence regarding the class membership of Xi . Hence, for each Xi , it is sufficient to just consider the mass functions offered by the first K (with a large value, e.g., ≥ 10) nearest neighbors.

Robust Cancer Treatment Outcome Prediction Dealing

65

Let {Xi1 , . . . , XiK } be the selected training samples for Xi . Correspondingly, {mi,i1 , . . . , mi,iK } are K pieces of evidence taking into account. In the framework of DST, beliefs are refined by aggregating different items of evidence. A specific combination rule has been proposed in [11] to fuse mass functions {mi,i1 , . . . , mi,iK } for Xi . While it can lead to robust quantification of data uncertainty and imprecision, accompanying tuning parameters increase method’s complexity. To tackle this problem, this combination rule is replaced by the conjunctive combination rule defined in the Transferable Belief Model (TBM) [14], considering that the latter is a basic but robust rule for the fusion of independent pieces of evidence. We assign {mi,i1 , . . . , mi,iK } into two different groups (Θ1 and Θ2 ) according to {Yi1 , . . . , YiK }. In each group Θq = ∅, mass Θ functions are fused to deduce a new mass function mi q without conflict: ⎧

2 ⎨mΘq ({ωq }) = 1 − p=1,...,K 1 − e−γq di,ip , i Xip ∈Θq

(1) p=1,...,K −γ d2 ⎩mΘq (Ω) = Xi ∈Θq 1 − e q i,ip ; i p

Θ

2 1 while, when Θq is empty, mi q (Ω) = 1. After that, mΘ are further and mΘ i i combined to obtain a global Mi regarding the class membership of Xi , namely ⎧ Θq¯ Θq ⎪ ⎨Mi ({ωq }) = mi ({ωq }) · mi (Ω), ∀q ∈ {1, 2}, q¯ = q, Θ2 1 (2) = mΘ Mi (Ω) i (Ω) · mi (Ω), ⎪ ⎩ Θ1 2 Mi (∅) = mi ({ω1 }) · mΘ ({ω }). 2 i

Based on (1) and (2), Mi is determined by the weighted Euclidean distance, i.e., a function of the binary vector Λ defining which features are selected. Quantity Mi (∅) measures the conflict in the neighborhood of Xi . A large Mi (∅) means Xi is locating in a high overlapped area in current feature subspace. Differently, Mi (Ω) measures the imprecision regarding the class membership of Xi . A large Mi (Ω) may indicate that Xi is isolated from all other samples. According to the requirements of a qualified feature subset, the loss function with respect to Λ is arg min Λ

N N 2 1 1 2 2 2 {Mi ({ωq }) − ti,q } + {Mi (∅) +Mi (Ω) }+β||Λ||0 . (3) N i=1 q=1 N i=1

The first term is a mean squared error measure, where vector ti is a indicator of the outcome label, with ti,q = δi,q if Yi = ωq . The second term penalizes feature subsets that result in high imprecision Vand large overlaps between different classes. The last term, namely ||Λ||0 = v=1 λv , forces the selected feature subset to be sparse. Scalar β (≥ 0) is a hyper-parameter that controls the sparse penalty. It can be tuned according to the training performance. A global optimization method, namely the MI-LXPM [4], is utilized to minimize this loss function. Finally, selected features are used to train a robust classifier, namely the EK-NN classification rule [5], for predicting the outcome of cancer treatment.

66

3

C. Lian et al.

Experimental Results

The proposed prediction system has been evaluated by two clinical datasets: (1) Lung Tumor Data: twenty-five patients with inoperable stage II-III nonsmall cell lung cancer (NSCLC) treated with curative-intent chemoradiotherapy were studied. All patients underwent FDG-PET scans at initial staging, after induction chemotherapy, and during the fifth week of radiotherapy. Totally 52 SUV-based and GLSZM-based features were extracted. At one year after the end of treatment, local or distant recurrence (majority) was diagnosed on 19 patients, while no recurrence (minority) was reported on the remaining 6 patients. (2) Esophageal Tumor Data: thirty-six patients with esophageal squamous cell carcinomas treated with chemo-radiotherapy were studied. Since only PET/CT scans at initial tumor staging were available, some clinical characteristics were included as complementary knowledge. As the result, 29 SUVbased, GLSZM-based, and patients’ clinical characteristics (gender, tumour stage and location, WHO performance status, dysphagia grade and weight loss from baseline) were gathered. At least one month after the treatment, 13 patients were labeled disease-free (minority) when neither loco regional nor distant tumor recurrence is detected, while the other 23 patients were disease-positive (majority).

Table 1. Feature selection and corresponding prediction performance evaluated by the .632+ Bootstrapping. “All” denotes the input feature space. Lung Tumor Data All RELIEF FAST SVMRFE KCS HFS EFS REFS Robustness —

0.16

0.11

0.12

0.10 0.48 0.21 0.82

Accuracy

0.85 0.82

0.82

0.84

0.83 0.85 0.81 0.94

AUC

0.37 0.64

0.60

0.53

0.65 0.81 0.77 0.94

10

5

29

Subset size 52

7

3

4

4

Esophageal Tumor Data All RELIEF FAST SVMRFE KCS HFS EFS REFS Robustness —

0.33

0.61

0.31

0.29 0.32 0.44 0.74

Accuracy

0.74 0.69

0.74

0.74

0.69 0.74 0.77 0.83

AUC

0.63 0.66

0.63

0.75

0.66 0.71 0.75 0.82

25

5

3

Subset size 29

6

5

3

3

Feature Selection & Prediction Performance: REFS was compared with two univariate methods (RELIEF [9] and FAST [2]), and four multivariate methods (SVMRFE [7], KCS [18], HFS [12] and EFS [11]). Because of a limited

REFS+

REFS*

REFS

0.9

0.7 0.5 0.3

0.1 Robustness Accuracy

AUC

Fig. 3. Evaluating REFS, where REFS+ denotes resutls obtained without data rebalancing; while, REFS∗ denotes no prior knowledge.

67

Feature index

Robust Cancer Treatment Outcome Prediction Dealing

Bootstrap index (a)

Bootstrap index (b)

Fig. 4. Feature selected on (a) lung and (b) esophageal tumor datasets, respectively. Each column represents a bootstrapping evaluation, while the yellow points denote selected features.

number of instances, all compared methods were evaluated by the .632+ Bootstrapping [6], which ensures low bias and variance estimation. As a metric used to evaluate the selection performance, the robustness of the selected feature subsets was measured by the relative weighted consistency [15]. Its calculation is based on feature occurrence statistics obtained from all iterations of the .632+ Bootstrapping. The value of the relative weighted consistency ranges between [0, 1], where 1 means all selected feature subsets are approximately identical. To assess the prediction performance after feature selection, Accuracy and AUC were calculated. For all the compared methods except EFS, the SVM was chosen as the default classifier; the EK-NN [5] classifier was used with EFS and REFS. Setting the number of Bootstraps to 100, results obtained by all methods are summarized in Table 1, where the input feature space is also presented as the baseline for comparison. We can find that REFS is competitive as it led to better performance than other methods on both two imbalanced datasets. The significance of the specified prior knowledge and data rebalancing procedure for REFS was also evaluated by successively removing them. Results obtained on the lung tumor data are shown in Fig. 3, from which we can find that both of them are important for improving the feature selection and prediction performance. Analysis of Selected Feature Subsets: the indexes of features selected on both datasets with respect to 100 different Bootstraps are summarized in Fig. 4. For the lung tumor data, SUVmax during the fifth week of radiotherapy, and the temporal change of three GLSZM-based features were stably selected; for the esophageal tumor data, TLG at staging, and two clinical characteristics were stably selected. It is worth noting that the SUVbased features selected by REFS have also been proven to have significant predictive power in clinical studies, e.g., the SUVmax during the fifth week of radiotherapy has been clinically validated in [17] for NSCLC; while, the TLG (total lesion glycolysis) at staging has also been validated in [10] for oesophageal squamous cell carcinoma. Therefore, we might say that the feature subsets selected by REFS are in consistent with existing clinical studies; moreover, other kinds of features included in each subset can provide

68

C. Lian et al.

Survival proportion

complementary information for these already validated predictors to improve 0.6 the prediction performance. To sup0.4 0.4 port this analysis, on the esophageal 0.2 0.2 tumor data which has been followed 10 20 30 40 50 60 10 20 30 40 50 60 up in a long term up to five years, Time (month) Time (month) (a) (b) we drawn the Kaplan-Meier (KM) survival curves obtained by the EK-NN Fig. 5. The KM survival curves. The two classifier using, respectively, the feagroups of patients are obtained by (a) clin- ture subset selected by REFS, and ical validated predictor, and (b) features the clinically validated predictor (i.e., selected by REFS. TLG at tumor staging). Obtained results are shown in Fig. 5, in which each KM survival curve demonstrates the fraction of patients in a classified group that survives over time. As can be seen, using REFS (Fig. 5(b)), patients were better separated as two groups with distinct survival rates than using only TLG (Fig. 5(a)).

4

1

0.8

1 group group 0.8 censored 0.6

group group censored

Conclusion

In this paper, predicting the outcome of cancer therapy primarily based on FDG-PET images has been studied. A robust method based on Dempster-Shafer Theory has been proposed to select discriminant feature subsets from small-sized and imbalanced datasets containing noisy and high-overlapped inputs. The effectiveness of the proposed method has been evaluated by two real datasets. The obtained results are in consistent with published clinical studies. The future work is to validate the proposed method on more datasets with much higher dimensional features. In addition, how to improve the stability of involved prior knowledge should also be further studied.

References 1. Aerts, H.J., et al.: Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nature Commun. 5 (2014) 2. Chen, X., et al.: Fast: a ROC-based feature selection metric for small samples and imbalanced data classification problems. In: KDD, pp. 124–132 (2008) 3. Cook, G.J., et al.: Radiomics in PET: principles and applications. Clin. Transl. Imaging 2(3), 269–276 (2014) 4. Deep, K., et al.: A real coded genetic algorithm for solving integer and mixed integer optimization problems. Appl. Math. Comput. 212(2), 505–518 (2009) 5. Denœux, T.: A K-nearest neighbor classification rule based on Dempster-Shafer theory. IEEE TSMC 25(5), 804–813 (1995) 6. Efron, B., et al.: Improvements on cross-validation: the 632+ bootstrap method. JASA 92(438), 548–560 (1997) 7. Guyon, I., et al.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1–3), 389–422 (2002)

Robust Cancer Treatment Outcome Prediction Dealing

69

8. He, H., et al.: Learning from imbalanced data. IEEE TKDE 21(9), 1263–1284 (2009) 9. Kira, K., et al.: The feature selection problem: Traditional methods and a new algorithm. AAAI 2, 129–134 (1992) 10. Lemarignier, C., et al.: Pretreatment metabolic tumour volume is predictive of disease-free survival and overall survival in patients with oesophageal squamous cell carcinoma. Eur. J. Nucl. Med. Mol. Imaging 41(11), 2008–2016 (2014) 11. Lian, C., et al.: An evidential classifier based on feature selection and two-step classification strategy. Pattern Recogn. 48(7), 2318–2327 (2015) 12. Mi, H., et al.: Robust feature selection to predict tumor treatment outcome. Artif. Intell. Med. 64(3), 195–204 (2015) 13. Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press, Princeton (1976) 14. Smets, P., et al.: The transferable belief model. Artif. Intell. 66(2), 191–234 (1994) 15. Somol, P., et al.: Evaluating stability and comparing output of feature selectors that optimize feature subset cardinality. IEEE TPAMI 32(11), 1921–1939 (2010) 16. Thibault, G., et al.: Advanced statistical matrices for texture characterization: application to cell classification. IEEE TBME 61(3), 630–637 (2014) 17. Vera, P., et al.: FDG PET during radiochemotherapy is predictive of outcome at 1 year in non-small-cell lung cancer patients: a prospective multicentre study (RTEP2). Eur. J. Nucl. Med. Mol. Imaging 41(6), 1057–1065 (2014) 18. Wang, L.: Feature selection with kernel class separability. IEEE TPAMI 30(9), 1534–1546 (2008)

Structured Sparse Kernel Learning for Imaging Genetics Based Alzheimer’s Disease Diagnosis Jailin Peng1,2 , Le An1 , Xiaofeng Zhu1 , Yan Jin1 , and Dinggang Shen1(B) 1

2

Department of Radiology and BRIC, UNC at Chapel Hill, Chapel Hill, NC, USA [email protected] College of Computer Science and Technology, Huaqiao University, Xiamen, China

Abstract. A kernel-learning based method is proposed to integrate multimodal imaging and genetic data for Alzheimer’s disease (AD) diagnosis. To facilitate structured feature learning in kernel space, we represent each feature with a kernel and then group kernels according to modalities. In view of the highly redundant features within each modality and also the complementary information across modalities, we introduce a novel structured sparsity regularizer for feature selection and fusion, which is different from conventional lasso and group lasso based methods. Specifically, we enforce a penalty on kernel weights to simultaneously select features sparsely within each modality and densely combine different modalities. We have evaluated the proposed method using magnetic resonance imaging (MRI) and positron emission tomography (PET), and single-nucleotide polymorphism (SNP) data of subjects from Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. The effectiveness of our method is demonstrated by both the clearly improved prediction accuracy and the discovered brain regions and SNPs relevant to AD.

1

Introduction

Alzheimer’s disease (AD) is an irreversible and progressive brain disorder. Early prediction of the disease using multimodal neuroimaging data has yielded important insights into the progression patterns of AD [11,16,18]. Among the many risk factors for AD, genetic variation has been identified as an important one [11,17]. Therefore, it is important and beneficial to build prediction models by leveraging both imaging and genetic data, e.g., magnetic resonance imaging (MRI) and positron emission tomography (PET), and single-nucleotide polymorphisms (SNPs). However, it is a challenging task due to the multimodal nature of the data, limited observations, and highly-redundant high-dimensional data. Multiple kernel learning (MKL) provides an elegant framework to learn an optimally combined kernel representation for heterogeneous data [4,5,10]. When it is applied to the classification problem with multimodal data, data of each modality are usually represented using a base kernel [3,8,12]. The selection of J. Peng was partially supported by NSFC (11401231) and NSFFC (2015J01254). c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 70–78, 2016. DOI: 10.1007/978-3-319-46723-8 9

Structured Sparse Kernel Learning for Imaging Genetics

(a)

71

(b)

Fig. 1. Schematic illustration of our proposed framework (a), and different sparsity patterns (b) produced by lasso (ℓ1 norm), group lasso (ℓ2,1 norm) and the proposed structured sparsity (ℓ1,p norm, p > 1). Darker color in (b) indicates larger weights.

certain sparse regularization methods such as lasso (ℓ1 norm) [13] and group lasso (ℓ2,1 norm) [15], yields different modality selection approaches [3,8,12]. In particular, ℓ1 -MKL [10] is able to sparsely select the most discriminative modalities. With grouped kernels, group lasso performs sparse group selection, while densely combining kernels within groups. In [8], the group lasso regularized MKL was employed to select the most relevant modalities. In [12], a class of generalized group lasso with the focus on inter-group sparsity was introduced into MKL for channel selection on EEG data, where groups correspond to channels. In view of the unique and complementary information contained in different modalities, all of them are expected to be utilized for AD prediction. Moreover, compared with modality-wise analysis and then conducting relevant modality selection, integration of feature-level and modality-level analysis is more favorable. However, for some modalities, their features as a whole or individual are weaker than those in other modalities. In these scenarios, as shown in Fig. 1(b), the lasso and group lasso tend to independently select the most discriminative features/groups, making features from weak modalities having less chance to be selected. Moreover, they are less effective to utilize complementary information among modalities with ℓ1 norm penalty [5,7]. To address these issues, we propose to jointly learn a better integration of multiple modalities and select subsets of discriminative features simultaneously from all the modalities. Accordingly, we propose a novel structured sparsity (i.e., ℓ1,p norm with p > 1) regularized MKL for heterogeneous multimodal data integration. It is noteworthy that ℓ1,2 norm was considered [6,7] in settings such as regression, multitask learning etc. Here, we go beyond these studies by considering the ℓ1,p constrained MKL for multimodal feature selection and fusion and its application for AD diagnosis. Moreover, contrary to representing each modality with a single kernel as in conventional MKL based methods [3,4,8], we assign each feature with a kernel and then group kernels according to modalities to facilitate both

72

J. Peng et al.

feature- and group-level analysis. Specifically, we promote sparsity inside groups with inner ℓ1 norm and pursue dense combination of groups with outer nonsparse ℓp norm. Guided by the learning of modality-level dense combination, sparse feature selections in different modalities interact with each other for a better overall performance. This ℓ1,p regularizer is completely different from group lasso [15] and its generalization [9] (i.e., ℓp,1 norm) which gives sparse groups but performs no feature selection within each group [12,15]. An illustration of different sparsity patterns selected by lasso, group lasso and the proposed method is shown in Fig. 1(b). In comparison, the proposed model can not only keep information from each modality with outer nonsparse regularization but also support variable interpretability and scalability with the inner sparse feature selection.

2

Method

i T i i i Given a set of N labeled data samples {xi , y i }N i=1 , where x = (x1 , x2 , · · · , xM ) , i M is the number of all features in all modalities, and y ∈ {1, −1} is a class label. MKL aims to learn an optimal combination of base kernels, while each kernel describes a different property of the data. To also perform the task of joint feature selection, we assign each feature a base kernel through its own feature mapping. An overview of the proposed framework is illustrated in Fig. 1(a).

2.1

Structured Sparse Feature and Kernel Learning

Let G = {1, 2, · · ·, M } be the feature index set which is partitioned into L nonoverlapping groups {Gl }L l=1 according to task-specific knowledge. For instance, in our application, we partition G into L = 3 groups according to modalities. Let {Km  0}M m=1 be the M base kernels for the M features respectively, which are induced by M feature mappings {φm }M m=1 . Given the feature space defined by the joint feature mapping Φ(x) = (φ1 (x1 ), φ2 (x2 ), · · · , φM (xM ))T , we learn a lin√ L  ˜ Tm φm (xm ) + b. ear discriminant function of the form f (x) = l=1 m∈Gl θm w Here, we have explicitly written out the group structure in the function f (x), ˜ m is the normal vector corresponding to φm , b encodes the bias, and in which w θ = (θ1 , θ2 , · · · , θM )T contains the weights for the M feature mappings. Therefore, feature mappings with zero weights would not be active in f (x). In the following, we perform feature selection by enforcing a structured sparsity on weights of the feature mappings. To introduce a more general model, we further introduce (1) M positive weights β = (β1 , β2 , · · ·, βM )T for features and (2) L positive weights γ = (γ1 , γ2 , · · ·, γL )T for feature groups to encode prior information. If we have no knowledge about group/feature importance, we can set βm = 1 and γl = 1 for each l and m. Accordingly, our generalized MKL model with a structured sparsity inducing constraint can be formulated as below: L   1 ˜ m ,b 2 l=1 m∈G w l

min min θ

s.t. θ1,p;β,γ 

˜ m 22 + C ′ w



L 

l=1

γl





m∈Gl

N 

i=1

  L f (xi ), y i , p  p1

βm |θm |

≤ τ, 0 ≤ θ,

(1)

Structured Sparse Kernel Learning for Imaging Genetics

73

where L(t, y) = max(0, 1 − ty) is the hinge loss function, C ′ is a trade-off weight, τ controls the sparsity level, and 0 is a vector of all zeros. Similar to the typical MKL [10], this model is equivalent to learning an optimally combined kernel L  K = l=1 m∈Gl θm Km . The inequality constraint employs a weighted ℓ1,p mixed norm (p > 1), i.e.,  · 1,p;β,γ , which simultaneously promotes sparsity inside groups with the inner weighted ℓ1 norm and pursues dense combination of groups with the outer weighted ℓp norm. The rationale of using this regularization is that, while each individual modality contains redundant high-dimensional features, different modalities can offer unique and complementary information. Owing to the heterogeneity of different modalities, we sparsely select features from each homogenous feature groups, i.e., modalities, and densely integrate different modalities. As has been discussed in [5], with p > 1, the non-sparse ℓp norm has the advantage of better combining complementary features than ℓ1 norm. Moreover, in view of the unequal reliability of different modalities, we take a compromise of ℓ1 lasso and ℓ2 ridge regularization and intuitively set p = 1.5 for inter-group regularization, i.e. ℓ1,1.5 . More specifically, due to the geometrical property of the ℓ1.5 contour lines, it results in unequal shrinkage of weights with higher probability than ℓ2 norm, thus allowing the assignment of larger weights for leading groups/modalities. Further understanding and computation of √ our model can be achieved with ˜ m , w = (w1 , w2 , · · · , wM )T the following lemma and theorem. Let wm = θm w T and also W = (w1 2 , w2 2 , · · · , wM 2 ) , we first have the following lemma. Lemma 1. Given p ≥ 1, positive weights γ and β. We use the convention that 0/0 = 0. For fixed w = 0, the minimal θ in Eq. (1) is attained at ∗ θm =

1 2

wm 2

1 p+1

p−1 p+1

βm γlm WGlm 1;β where WGl 1;β =



τ , ∀m = 1, 2, · · ·, M (2) ·  2p 1 1 L p+1 p ) ( l=1 γlp+1 WGl 1;β

1

m′ ∈Gl

2 βm ′ w m′ 2 , and Glm is the index set containing m.

For the fixed w, this lemma gives an explicit solution for θ. The proof can be done by deriving the first order optimality conditions of Eq. (1). Plugging Eq. (2) into the model in Eq. (1) yields the following compact optimization problem. Theorem 1. Let q = 1 min wm ,b 2τ

 L  l=1

2−q q

γl

2p p+1 .

For p > 1, the model in Eq. (1) is equivalent to

WGl q1;β

 q2

+C



N  i=1

L



L  

l=1 m∈Gl

wTm φm (xim )

+ b, y

i



.(3)

The first term is a weighted ℓ1,q norm penalty on W with q ∈ (1, 2). By choosing p = 1.5 and thus q = 1.2, it shares similar group-level regularization property with that in Eq. (1) on θ. Specifically, in each group, only a small number of wm can contribute to the decision function f (x) with nonzero values. Accordingly, few features in each group can be selected. Meanwhile, the sparsely filtered groups are densely combined, while allowing the presence of leading groups.

74

J. Peng et al.

2.2

Model Computation

After the variable changing, we can optimize the proposed model via a block coordinate descent. For fixed θ, the subproblem of w and b can be computed with any support vector machine (SVM) [2] solver. According to Lemma 1, we can analytically carry out θm with w fixed. θm can be initialized as θm = L  1 ( l=1 γl ( m′ ∈Gl βm′ )p )− p to satisfy the constraint in Eq. (1). Moreover, from Eq. (3), it is obvious that we can fold τ and C ′ into a single trade-off weight C and set τ = 1. In this way, we have single model parameter C which not only acts as the soft margin parameter but also controls the sparsity of θ and W.

3

Experimental Results

3.1

Dataset

We evaluated our method by applying it on a subset of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset1 . In total, we used MRI, PET, and SNP data of 189 subjects, including 49 patients with AD, 93 patients with Mild Cognitive Impairment (MCI), and 47 Normal Controls (NC). After preprocessing, the MRI and PET images were segmented into 93 regions-of-interest (ROIs). The gray matter volumes of these ROIs in MRI and the average intensity of each ROI in PET were calculated as features. The SNPs [11] were genotyped using the Human 610-Quad BeadChip. Among all SNPs, only SNPs, belonging to the top AD candidate genes listed on the AlzGene database2 as of June 10, 2010, were selected after the standard quality control and imputation steps. The Illumina annotation information based on the Genome build 36.2 was used to select a subset of SNPs, belonging or proximal to the top 135 AD candidate genes. The above procedure yielded 5677 SNPs from 135 genes. Thus, we totally have 93 + 93 + 5677 = 5863 features from the three modalities for each subject. 3.2

Experimental Settings

For method evaluation, we used the strategy of 10 times repeated 10-fold cross-validation. All parameters were learned by conducting 5-fold inner crossvalidation. Three measures including classification accuracy (ACC), sensitivity (SEN), and specificity (SPE) were used. We compared the proposed method with (1) feature selection based methods, i.e., Fisher Score (FS) [2], and Lasso [13], and (2) MKL based methods, i.e., the method of Zhang et al. in [16], and ℓ1 -MKL [10]. In the Lasso method, the logistic loss [2] was used. The method in [16] represented each modality with a base kernel and further learned a linearlycombined kernel with cross validation. For FS, Lasso and the method in [16], the linear SVM implemented in LibSVM software3 was used as the classifier. For all 1 2 3

http://adni.loni.usc.edu. http://www.alzgene.org. https://www.csie.ntu.edu.tw/∼cjlin/libsvm/.

Structured Sparse Kernel Learning for Imaging Genetics

75

methods, we used t-test [2] thresholded by p-value as a feature pre-selection step to reduce feature size and improve computational efficiency. The commonly used p-value < 0.05 was applied for MRI and PET. Considering the large number of SNP features, we selected the p-value from {0.05, 0.02, 0.01}. Therefore, t-testSVM that combined t-test and SVM was designed for comparison with the same p-value setting as well. For our proposed model, ℓ1 -MKL and Zhang’s method, to avoid further kernel parameter selection, each kernel matrix was defined as a linear kernel on a single feature. Furthermore, we simply assumed no knowledge on both feature and group weights and thus we set γ = 1 and β = 1. The soft margin parameter C was selected with grid search from {2−5 , 2−4 , · · · , 25 }. 3.3

Results and Discussions

The classification results of AD vs. NC and MCI vs. NC using all the three modalities are listed in Table 1. By taking advantage of the structured feature learning in kernel space, the proposed method outperforms all competing methods in classification rate. For AD vs. NC classification, our method achieves an ACC of 96.1 % with an improvement of 2.1 % over the best performance of other methods. Meanwhile, the standard variance of the proposed method is also lower, demonstrating the stability of the proposed method. For classifying MCI from NC, the improvements by the proposed method is 2.4 % in terms of ACC. In comparison with t-test-SVM, we obtained 4.2 % and 7.6 % improvements in terms of ACC for classifying AD and MCI from NC, respectively. Similar results are obtained for the classification of AD and MCI, which has not listed in Table 1 due to space limit. For example, the ACC of Lasso-SVM, ℓ1 -MKL and our method are 70.3 ± 1.5 %, 73.0 ± 1.6 %, and 76.9 ± 1.4 %, respectively. In summary, these results show the improved classification performance by our method. To further investigate the benefit of SNP data and multimodality fusion, in Table 2 we illustrate the performance of the proposed method w.r.t different modality combinations. First of all, the performance of any single modality is much lower than that of their combinations. Among the three modalities, the Table 1. Performance comparison of different methods in terms of “mean ± standard deviation” for AD vs. NC and MCI vs. NC classifications, using MRI, PET and SNPs. The superscript “∗” indicates statistically significant difference (p-value < 0.05) compared with the proposed method Methods

AD vs. NC (%)

MCI vs. NC (%)

ACC

SEN

SPE

ACC

SEN

SPE

t-test-SVM

91.9 ± 1.9∗

92.7 ± 2.0

91.1 ± 3.0

72.7 ± 2.1∗

85.4 ± 2.9

47.7 ± 5.2

FS-SVM

92.4 ± 1.3∗

93.5 ± 2.7

91.3 ± 1.9

76.1 ± 1.4∗

84.3 ± 2.2

59.8 ± 3.1

2.2∗

1.4∗

Zhang et al. 92.6 ±

92.7 ± 1.4

92.6 ± 2.7

75.1 ±

82.8 ± 3.0

59.8 ± 2.9

Lasso-SVM

93.5 ± 1.4∗

94.9 ± 1.7

92.1 ± 1.8

76.3 ± 2.3∗

85.2 ± 2.3

58.7 ± 4.3

ℓ1 -MKL

94.0 ± 1.4∗

94.3 ± 2.5

93.6 ± 2.0

77.9 ± 1.4∗

85.7 ± 1.4 62.6 ± 4.0

Proposed

96.1 ± 1.0 97.3 ± 1.0 94.9 ± 1.8 80.3 ± 1.6 85.6 ± 1.9

69.8 ± 3.7

76

J. Peng et al.

SNP data shows the lowest performance. However, when combined with other modalities, genetic data can obviously help improve predictions. For example, in AD and NC classification, the performances using MRI+SNP and PET+SNP demonstrate 2.7 % and 5.7 % improvements in terms of ACC over the cases of only using MRI and PET, respectively; the improvement with MRI+PET+SNP over that with MRI+PET is 3.8 %. Similar results are obtained for MCI vs. NC. Table 2. Comparison of our proposed method in the cases of using different modality combinations. “∗” indicates statistically significant difference with MRI+PET+SNP Modalities

AD vs. NC (%) ACC SEN SPE

MCI vs. NC (%) ACC SEN SPE

MRI

88.4∗

84.1

93.0

71.6

83.9

47.2

PET

86.3



84.5

88.1

68.8

85.5

35.7

SNP

76.0∗

69.8

82.6

66.2

75.4

48.1

MRI+PET

92.3∗

91.9

91.7

76.4

83.9

61.5

MRI+SNP

91.1



89.8

92.6

74.9

84.5

55.7

PET+SNP

92.0∗

90.8

93.2

71.3

81.4

51.3

MRI+PET+SNP

96.1

97.3

94.9

80.3

85.6

69.8

The most selected brain regions and SNPs in our algorithm can also be the potential biomarkers used in clinical diagnosis. In MRI, hippocampal formation and uncus in parahippocampal gyrus are recognized in both AD vs. NC and MCI vs. NC classifications, as well as multiple temporal gyrus regions. This is in line with the findings of the most affected regions in AD in previous neuro-studies [3,8,16,18]. Amygdala, one of the subcortical regions, is the integrative center for emotions, is also identified as AD. In PET, angular gyri, precuneus, and entorhinal cortices are the regions identified, which are also among the altered regions in AD reported in prior studies [16,18]. As to the genetic information, the most selected SNPs for AD and NC classification are from APOE gene, VEGFA gene, and SORCS1 gene. For MCI prediction, the most selected SNPs are from KCNMA1 gene, APOE gene, VEGFA gene and CTNNA3 gene. Generally, our results are consistent with the existing results [11,17]. For instance, APOE and SORCS1 genes are the well-known top candidate genes related to AD and MCI [11]. VEGFA, the expression of vascular endothelial growth factor, represents a potential mechanism where vascular and AD pathologies are related [1].

4

Conclusion

We developed a kernel-based multimodal feature selection and integration method, and further applied it on imaging and genetic data for AD diagnosis. Instead of independently selecting features from each modality and then

Structured Sparse Kernel Learning for Imaging Genetics

77

combining them together [16] or performing most relevant modality selection [8,14], we integrated the multimodal feature selection and combination in a novel structured sparsity regularized kernel learning framework. A block coordinate descent algorithm was derived to solve our general ℓ1,p (p ≥ 1) constrained non-smooth objective function. Comparisons by various experiments have shown better AD diagnosis performance by our proposed method. In future work, we will incorporate prior knowledge about feature/group importance into the proposed framework.

References 1. Chiappelli, M., Borroni, B., Archetti, S., et al.: VEGF gene and phenotype relation with Alzheimer’s disease and mild cognitive impairment. Rejuvenation Res. 9(4), 485–493 (2006) 2. Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley, New York (2012) 3. Hinrichs, C., Singh, V., Xu, G., Johnson, S.: MKL for robust multi-modality AD classification. In: Yang, G.-Z., Hawkes, D., Rueckert, D., Noble, A., Taylor, C. (eds.) MICCAI 2009. LNCS, vol. 5762, pp. 786–794. Springer, Heidelberg (2009). doi:10.1007/978-3-642-04271-3 95 4. Jin, Y., Wee, C.Y., Shi, F., et al.: Identification of infants at high-risk for autism spectrum disorder using multiparameter multiscale white matter connectivity networks. Hum. Brain Mapp. 36(12), 4880–4896 (2015) 5. Kloft, M., Brefeld, U., Sonnenburg, S., et al.: Lp-norm multiple kernel learning. J. Mach. Learn. Res. 12, 953–997 (2011) 6. Kong, D., Fujimaki, R., Liu, J., et al.: Exclusive feature learning on arbitrary structures via L12-norm. In: NIPS, pp. 1655–1663 (2014) 7. Kowalski, M.: Sparse regression using mixed norms. Appl. Comput. Harmon. Anal. 27(3), 303–324 (2009) 8. Liu, F., Zhou, L., Shen, C., et al.: Multiple kernel learning in the primal for multimodal Alzheimer’s disease classification. IEEE J. Biomed. Health Inform. 18(3), 984–990 (2014) 9. Liu, J., Ye, J.: Efficient l1/lq norm regularization. arXiv:1009.4766 (2010) 10. Rakotomamonjy, A., Bach, F., Canu, S., et al.: Simple MKL. J. Mach. Learn. Res. 9, 2491–2521 (2008) 11. Shen, L., Thompson, P., Potkin, S., et al.: Genetic analysis of quantitative phenotypes in AD and MCI: imaging, cognition and biomarkers. Brain Imaging Behav. 8(2), 183–207 (2014) 12. Szafranski, M., Grandvalet, Y., Rakotomamonjy, A.: Composite kernel learning. Mach. Learn. 79(1), 73–103 (2010) 13. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B Stat. Methodol. 58(1), 267–288 (1996) 14. Wang, H., Nie, F., Huang, H.: Multi-view clustering and feature learning via structured sparsity. In: ICML, pp. 352–360 (2013) 15. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Series B Stat. Methodol. 68(1), 49–67 (2006) 16. Zhang, D., Wang, Y., Zhou, L., et al.: Multimodal classification of Alzheimer’s disease and mild cognitive impairment. NeuroImage 55(3), 856–867 (2011)

78

J. Peng et al.

17. Zhang, Z., Huang, H., Shen, D.: Integrative analysis of multi-dimensional imaging genomics data for Alzheimer’s disease prediction. Front. Aging Neuros. 6, 260 (2013) 18. Zhu, X., Suk, H.I., Lee, S.W., et al.: Subspace regularized sparse multi-task learning for multilclass neurodegenerative disease identification. IEEE Trans. Biomed. Eng. 63(3), 607–618 (2015)

Semi-supervised Hierarchical Multimodal Feature and Sample Selection for Alzheimer’s Disease Diagnosis Le An, Ehsan Adeli, Mingxia Liu, Jun Zhang, and Dinggang Shen(B) Department of Radiology and BRIC, University of North Carolina at Chapel Hill, Chapel Hill, USA [email protected]

Abstract. Alzheimer’s disease (AD) is a progressive neurodegenerative disease that impairs a patient’s memory and other important mental functions. In this paper, we leverage the mutually informative and complementary features from both structural magnetic resonance imaging (MRI) and single nucleotide polymorphism (SNP) for improving the diagnosis. Due to the feature redundancy and sample outliers, direct use of all training data may lead to suboptimal performance in classification. In addition, as redundant features are involved, the most discriminative feature subset may not be identified in a single step, as commonly done in most existing feature selection approaches. Therefore, we formulate a hierarchical multimodal feature and sample selection framework to gradually select informative features and discard ambiguous samples in multiple steps. To positively guide the data manifold preservation, we utilize both labeled and unlabeled data in the learning process, making our method semi-supervised. The finally selected features and samples are then used to train support vector machine (SVM) based classification models. Our method is evaluated on 702 subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, and the superior classification results in AD related diagnosis demonstrate the effectiveness of our approach as compared to other methods.

1

Introduction

As one of the most common neurodegenerative diseases, Alzheimer’s disease (AD) accounts for most dementia cases. AD is progressive and the symptoms worsen over time by gradually affecting patients’ memory and other mental functions. Unfortunately, there is no cure for AD yet. Nevertheless, once AD is diagnosed, treatment including medications and management strategies can help improve symptoms. Therefore, timely and accurate diagnosis of AD and its prodromal stage, i.e., mild cognitive impairment (MCI), which can be further categorized into progressive MCI (pMCI) and stable MCI (sMCI), is highly This work was supported in part by NIH grants (EB006733, EB008374, EB009634, MH100217, MH108914, AG041721, AG049371, AG042599). c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 79–87, 2016. DOI: 10.1007/978-3-319-46723-8 10

80

L. An et al.

desired in practice. Among various diagnosis tools, brain imaging, such as structural magnetic resonance imaging (MRI), has been widely used, since it allows accurate measurements of the brain structures, especially in the hippocampus and other AD related regions [1]. Besides imaging data, genetic variants are also related to AD [2], and genomewide association studies (GWAS) have been conducted to identify the association between single nucleotide polymorphism (SNP) and the imaging data [3]. In [4], the associations between SNPs and MRI-derived measures with the presence of AD were explored and the informative SNPs were identified to guide the disease interpretation. To date, most of the previous works focused on analyzing the correlation between imaging and genetic data [5], while using both for AD/MCI diagnosis has received very little attention [6]. In this paper, we aim to jointly use structural MRI and SNPs for improving AD/MCI diagnosis, as the data from both modalities are mutually informative [3]. For MRI-based diagnosis, features can be extracted from regions-of-interest (ROIs) in the brain [6]. Since not all of the ROIs are relevant to the particular disease of AD/MCI, feature selection can be conducted to identify the most relevant features in order to learn the classification model more effectively [7]. Similarly, only a small number of SNPs from a large SNP pool are associated with AD/MCI [6]. Therefore, it is preferable to use only the most discriminative features from both MRI and SNPs to learn the most effective classification model. To achieve this, supervised feature selection methods such as Lasso-based sparse feature learning have been widely used [8]. However, they do not consider discarding non-discriminative samples, which might be outliers or non-representative, and including them in the model learning process can be counterproductive. In this paper, we propose a semi-supervised hierarchical multimodal feature and sample selection (ss-HMFSS) framework. We utilize both labeled and unlabeled data for manifold regularization, to preserve the neighborhood structures during the mapping from the original feature space to the label space. Furthermore, since the redundant features and outlier samples inevitably affect the learning process, instead of selecting features and samples in one step, we perform feature and sample selection in a hierarchical manner. The updated features and pruned sample set from each current hierarchy are supplied to the next one to further identify a subset with most discriminative features and samples. In this way, we gradually refine the feature and sample subsets step-by-step, undermining the effect of non-discriminative or rather noisy data. The finally selected features and samples are used to train support vector machine (SVM) classifiers for AD and MCI related diagnosis. The proposed method is evaluated on 702 subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort. In different classification tasks, i.e., AD vs. NC, MCI vs. NC, and pMCI vs. sMCI, superior results are achieved by our framework as compared to the other competing methods.

Semi-supervised Hierarchical Multimodal Feature and Sample Selection

2

81

Method

2.1

Data Preprocessing

In this study, we use 702 subjects in total from the ADNI cohort whose MRI and SNP features are available1 . Among them, 165 are AD patients, 342 are MCI patients, and the rest 195 subjects are normal controls (NCs). Within the MCI patients, there are 149 pMCI cases and 193 sMCI cases. sMCI subjects are those who were diagnosed as MCI patients and remained stable all the time, while pMCI refers to the MCI case that converted to AD within 24 months. For MRI data, the preprocessing steps included skull stripping, dura and cerebellum removal, intensity correction, tissue segmentation and registration. The preprocessed images were then divided into 93 pre-defined ROIs, and the gray matter volume in these ROIs are calculated as MRI features. The SNP data were genotyped using the Human 610-Quad BeadChip. According to the AlzGene database2 , only SNPs that belong to the top AD gene candidates were selected. The selected SNPs were imputed to estimate the missing genotypes, and the Illumina annotation information was used to select a subset of SNPs [9]. The processed SNP data have 2098 features. Since the SNP feature dimension is much higher than that of MRI, we perform sparse feature learning [8] on the training data to reduce the SNP feature dimension to the similar level of the MRI feature dimension. 2.2

Semi-supervised Hierarchical Feature and Sample Selection

The framework of the proposed method is illustrated in Fig. 1. After features are extracted and preprocessed from the raw SNP and MRI data, we first calculate the graph Laplacian matrix to model the data structure using the concatenated features from both labeled and unlabeled data. This Laplacian matrix is then used in the manifold regularization to jointly learn the feature coefficients and sample weights. The features are selected and weighted based on the learned coefficients, and the samples are pruned by discarding those with smaller sample weights. The updated features and samples are forwarded to the next hierarchy for further selection in the same manner. In such a hierarchical manner, we gradually select the most discriminative features and samples in order to mitigate the effects of data redundancy in the learning process. Finally, the selected features and samples are used to train classification models (SVM in this work) for AD/MCI diagnosis tasks. In the following, we explain in detail how the joint feature and sample selection works in each hierarchy. Suppose we have N1 labeled training subjects with their class labels and the corresponding features from both MRI and SNP, denoted by y ∈ RN1 , XMRI ∈ RN1 ×d1 , and XSNP ∈ RN1 ×d2 , respectively. In addition, data from ˜ MRI ∈ RN2 ×d1 , and N2 unlabeled subjects are also available, denoted as X 1 2

http://adni.loni.usc.edu/. www.alzgene.org.

82

L. An et al.

Fig. 1. Framework of the proposed semi-supervised hierarchical multimodal feature and sample selection (ss-HMFSS) for AD/MCI diagnosis.

˜ SNP ∈ RN2 ×d2 . The goal is to utilize both labeled and unlabeled data in X a semi-supervised framework to jointly select the most discriminative samples and features for the subsequent classification model training and prediction. Let X = [XMRI , XSNP ] ∈ RN1 ×(d1 +d2 ) be the concatenated features of the labeled ˜ = [X ˜ MRI , X ˜ SNP ] ∈ RN2 ×(d1 +d2 ) represent features of the unlabeled data, data, X d1 +d2 be the feature coefficient vector, the objective function for this and w ∈ R joint sample and feature learning model can be written as ˜ w) + Rf (w), F = E(y, X) + Rm (y, X, X,

(1)

where E(y, X) is the loss function defined for the labeled data, and ˜ w) is the manifold regularization term for both labeled and unlaRm (y, X, X, beled data. This regularizer is based on the natural assumption that if two data samples xp and xq are close in their original feature space, after mapping into the new space (i.e., label space), they should also be close to each other. Rf (w) = w1 is the sparse regularizer for the purpose of feature selection. In the following, we explain in detail how the loss function and the manifold regularization term are defined by taking into account sample weights. Loss function: The loss function E(y, X) considers the weighted loss for each sample, and it is defined as 2

E(y, X) = A(y − Xw)2 ,

(2)

where A ∈ RN1 ×N1 is a diagonal matrix with each diagonal element denoting the weight for a data sample. Intuitively, a sample that can be more accurately mapped into the label space with less error is more desirable, and thus it should contribute more to the classification model. The sample weights in A will be learned through optimization and the samples with larger weights will be selected to train the classifier. Manifold regularization: The manifold regularization preserves the neighborhood structures for both labeled and unlabeled data during mapping from the

Semi-supervised Hierarchical Multimodal Feature and Sample Selection

83

feature space to the label space. It is defined as ˜ w) = (A ˆ Xw) ˆ ⊤ L(A ˆ Xw), ˆ Rm (y, X, X,

(3)

ˆ ∈ R(N1 +N2 )×(d1 +d2 ) contains features of both labeled data X and where X ˜ The Laplacian matrix L ∈ R(N1 +N2 )×(N1 +N2 ) is given by unlabeled data X.  L = D − S, where D is a diagonal matrix such that D(p, p) = q S(p, q), and S is the affinity matrix with S(p, q) denoting the similarity between samples xp and xq . S(p, q) is defined as S(p, q) = 1 − |yp − yq | ,

(4)

where yp and yq are the labels for xp and xq . For the case of unlabeled data, yp defines a soft label for an unlabeled data sample xp as yp = kppos /k, where kppos is the number of xp ’s neighbors with positive class labels out of its k neighbors in total. Note that for an unlabeled sample, the nearest neighbors are searched only in the labeled training data, and the soft label represents its closeness to a target class. Using such definition, the similarity matrix S encodes relationships among both labeled and unlabeled samples. ˆ ∈ R(N1 +N2 )×(N1 +N2 ) applies weights on both labeled The diagonal matrix A ˆ are different for labeled and unlabeled and unlabeled samples. The elements in A data:  A(p, p),  p ∈ [1, N1 ], ˆ A(p, p) =  (5) kpos  1 − 2 pk  , p ∈ [N1 + 1, N1 + N2 ]. By this definition, if an unlabeled sample whose k nearest neighbors are relatively balanced from both positive and negative classes (i.e., kppos /k ≈ 0.5), it is assigned a smaller weight as this sample may not be representative enough in terms of class separation. The weights in A for the labeled data are to be learned in the optimization process.

Overall objective function: Taking into account the loss function, the manifold regularization, as well as the sparse regularization on features, the objective function is 2 ˆ Xw) ˆ ⊤ L(A ˆ Xw) ˆ + λ2 w1 , min A(y − Xw)2 + λ1 (A w,A  s.t. diag(A) = 1, diag(A) ≥ 0.

(6)

Note that the elements in A are enforced to be non-negative to assign physically interpretable weights to different samples. Also, the diagonal of A should sum to one, which makes the sample weights to be interpreted as probabilities, and also ensures that sample weights will not be all zero. We employ an alternating optimization strategy to solve this problem, i.e., we fix A to find the solution of w, and vice versa. For solving w, the Accelerated Proximal Gradient (APG)

84

L. An et al.

Algorithm 1. Semi-supervised hierarchical multimodal feature and sample selection Input: Labeled and unlabeled data from MRI and SNP, and the number of hierarchies L. 1: Initialize labeled sample weights in A and feature coefficients in w. 2: for i = 1 to L do 3: Calculate the data similarity scores in S by Eq. (4). ˆ by Eq. (5). 4: Calculate the sample weights in A 5: repeat 6: Fix A and solve w in Eq. (6). 7: Fix w and solve A in Eq. (6). 8: until convergence 9: Discard insignificant samples and features based on the values in A and w. 10: Weight the remaining features by the coefficients in w. 11: end for Output: Subset of samples and features for classification model training.

algorithm is used. The optimization on A is a constrained quadratic programming problem and it can be solved using the interior-point algorithm. After this hierarchy, insignificant features and samples are discarded based on the values in w and A, and the remaining features are weighted by the coefficients in w. The remaining samples with their updated features are used in the next hierarchy to further refine the sample and feature set. The entire process of the proposed method is summarized in Algorithm 1.

3

Experiments

Experimental Settings: We consider three binary classification tasks in the experiments: AD vs. NC, MCI vs. NC, and pMCI vs. sMCI. A 10-fold crossvalidation strategy is adopted to evaluate the classification performance. For the unlabeled data used in our method, we choose the irrelevant subjects with respect to the current classification task, i.e., when we classify AD and NC, the data from MCI subjects are used as unlabeled data. The dimension of the SNP features is reduced to 100 before the joint feature and sample learning. The neighborhood size k is chosen by cross-validation on the training data. After each hierarchy, 5 % samples are discarded, and the features whose coefficients are smaller than 10−3 are removed. To train the classifier, we use LIBSVM’s implementation of linear SVM3 . The parameters in feature and sample selection for each classification task are determined by grid search on the training data. Results: To examine the effectiveness of the proposed hierarchical structure, Fig. 2 shows classification accuracy (ACC) and area under receiver operating characteristic curve (AUC) with different number of hierarchies. It is observed 3

https://www.csie.ntu.edu.tw/∼cjlin/libsvm/.

Semi-supervised Hierarchical Multimodal Feature and Sample Selection

85

Results (%)

that the use of more hierarchies benefits the classification performance in all tasks, although the improvement becomes marginal after the third hierarchy. Especially for cases such as pMCI vs. sMCI, where the training data are not abundant, keeping discarding samples and features in many hierarchies may result in insufficient classification model training. Therefore, we set the number of hierarchies to three in our experiments. It is also worth mentioning that compared to AD vs. NC classification, MCI vs. NC and pMCI vs. sMCI classifications are more difficult, yet they are important problems for early diagnosis and possible therapeutic interventions.

86

84

94

AUC

84

82

ACC

82

80

92 90

88

86

96

80

78 0

1

2

3

0

1

2

3

0

1

2

3

# of hierarchies

# of hierarchies

# of hierarchies

(a) AD vs. NC

(b) MCI vs. NC

(c) pMCI vs. sMCI

Fig. 2. Effects of using different numbers of hierarchies.

For benchmark comparison, the proposed method (ss-HMFSS) is compared with the following baseline methods: (1) classification using MRI features only without feature selection (noFS (MRI only)), (2) classification using SNP features only without feature selection (noFS (SNP only)), (3) classification using concatenated MRI and SNP features without feature selection (noFS), (4) classification using concatenated MRI and SNP features with Laplacian score for feature selection (Laplacian), and (5) classification using concatenated MRI and SNP features with Lasso-based sparse feature learning (SFL). In addition, we evaluate the performance of our method using labeled data only (HMFSS). Besides, we also report sensitivity (SEN) and specificity (SPE). The mean classification results are reported in Table 1. Regarding each feature modality, MRI is more discriminative than SNP to distinguish AD from NC, while for MCI vs. NC and pMCI vs. NC classifications, SNP is more useful. Directly combining features from two different modalities may not necessarily improve the results. For example, in AD vs. NC classification, simply concatenating SNP and MRI features decreases the accuracy due to the less discriminative nature of the SNP features, which negatively contribute in the classification model learning. This limitation is alleviated by SFL. In our method, we further improve the selection scheme in a hierarchical manner and only the most discriminative features and samples are kept to train the classification models. Even without using unlabeled data, our method (i.e., HMFSS) outperforms the

86

L. An et al. Table 1. Comparison of classification performance by different methods (in %)

Method

AD vs. NC ACC SPE

MCI vs. NC

pMCI vs. sMCI

SEN AUC ACC SPE

SEN AUC ACC SPE

SEN AUC

noFS (MRI only) 88.3

81.9

92.2

94.1

72.5

80.7

42.9

72.0

68.4

59.4

75.9

73.2

noFS (SNP only) 77.3

75.3

80.6

85.3

74.8

83.2

36.1

74.1

73.1

67.7

80.8

79.2

noFS

87.5

81.6

90.3

95.6

73.8

85.1

53.6

80.6

74.7

64.5

78.8

83.4

Laplacian

88.7

82.9

90.5

95.7

74.6

86.9

56.8

82.8

75.5

67.3

81.4

77.5

SFL

89.2

83.6

90.7

95.8

74.7

87.3 57.6

83.2

76.3

68.1

83.0

77.9

HMFSS

90.8

83.9

94.1

96.7

77.6

83.9

65.7

85.2

78.3

68.9

84.5

85.6

ss-HMFSS

92.1

85.7 95.9 97.3

79.9

85.0

67.5 86.9

80.7

71.1 85.3 88.0

other baseline methods. By incorporating unlabeled data to facilitate the learning process, the performance of our method (i.e., ss-HMFSS) is further improved. It is also worthwhile to mention that when only feature example is enabled in our method, the accuracies for the three classification tasks are 91.1 %, 77.2 %, and 77.9 %, respectively, which are all inferior to the results using both feature and sample selection. Compared with a state-of-the-art method for AD diagnosis [10], which considers the relationships among samples and different feature modalities when performing feature selection, at least a 1–2% improvement in accuracy is achieved by our method on the same data. Regarding the computational cost, our method in Matlab implementation on a computer with 2.4 GHz CPU and 8 GB memory takes about 15 s for feature and sample selection, and the SVM classifier training takes less than 0.5 s.

4

Conclusions

In this paper, we have proposed a semi-supervised hierarchical multimodal feature and sample selection (ss-HMFSS) framework for AD/MCI diagnosis using both imaging and genetic data. In addition, both labeled and available unlabeled data were utilized to preserve the data manifold in the label space. Experimental results on the data from ADNI cohort showed that the hierarchical scheme was able to gradually refine the feature and sample set in multiple steps. Superior performance in different classification tasks was achieved as compared to the other baseline methods. Currently, data from two modalities including MRI and SNP were used. We would like to extend our method to utilize data from more modalities, such as positron emission tomography (PET) and cerebrospinal fluid (CSF), to further improve the diagnosis performance.

References 1. Chen, G., Ward, B.D., Xie, C., Li, W., Wu, Z., Jones, J.L., Franczak, M., Antuono, P., Li, S.J.: Classification of Alzheimer disease, mild cognitive impairment, and normal cognitive status with large-scale network analysis based on resting-state functional MR imaging. Radiology 259(1), 213–221 (2011)

Semi-supervised Hierarchical Multimodal Feature and Sample Selection

87

2. Gaiteri, C., Mostafavi, S., Honey, C.J., De Jager, P.L., Bennett, D.A.: Genetic variants in Alzheimer disease - molecular and brain network approaches. Nat. Rev. Neurol. 12, 1–15 (2016) 3. Shen, L., Kim, S., Risacher, S.L., Nho, K., Swaminathan, S., West, J.D., Foroud, T., Pankratz, N., Moore, J.H., Sloan, C.D., Huentelman, M.J., Craig, D.W., DeChairo, B.M., Potkin, S.G., Jack Jr., C.R., Weiner, M.W., Saykin, A.J.: Whole genome association study of brain-wide imaging phenotypes for identifying quantitative trait loci in MCI and AD: a study of the ADNI cohort. NeuroImage 53(3), 1051– 1063 (2010) 4. Hao, X., Yu, J., Zhang, D.: Identifying genetic associations with MRI-derived measures via tree-guided sparse learning. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8674, pp. 757–764. Springer, Heidelberg (2014) 5. Lin, D., Cao, H., Calhoun, V.D., Wang, Y.P.: Sparse models for correlative and integrative analysis of imaging and genetic data. J. Neurosci. Methods 237, 69–78 (2014) 6. Zhang, Z., Huang, H., Shen, D.: Integrative analysis of multi-dimensional imaging genomics data for Alzheimer’s disease prediction. Front. Aging Neurosci. 6, 260 (2014) 7. Zhu, X., Suk, H.I., Lee, S.W., Shen, D.: Canonical feature selection for joint regression and multi-class identification in Alzheimer’s disease diagnosis. Brain Imaging Behav. 10, 818–828 (2015) 8. Ye, J., Farnum, M., Yang, E., Verbeeck, R., Lobanov, V., Raghavan, N., Novak, G., DiBernardo, A., Narayan, V.A.: Sparse learning and stability selection for predicting MCI to AD conversion using baseline ADNI data. BMC Neurol. 12(1), 1–12 (2012) 9. Bertram, L., McQueen, M.B., Mullin, K., Blacker, D., Tanzi, R.E.: Systematic meta-analyses of Alzheimer disease genetic association studies: the AlzGene database. Nat. Genet. 39, 17–23 (2007) 10. Liu, M., Zhang, D., Shen, D.: Relationship induced multi-template learning for diagnosis of Alzheimers disease and mild cognitive impairment. IEEE Trans. Med. Imag. 35(6), 1463–1474 (2016)

Stability-Weighted Matrix Completion of Incomplete Multi-modal Data for Disease Diagnosis Kim-Han Thung, Ehsan Adeli, Pew-Thian Yap, and Dinggang Shen(B) Department of Radiology and BRIC, University of North Carolina, Chapel Hill, USA [email protected]

Abstract. Effective utilization of heterogeneous multi-modal data for Alzheimer’s Disease (AD) diagnosis and prognosis has always been hampered by incomplete data. One method to deal with this is low-rank matrix completion (LRMC), which simultaneous imputes missing data features and target values of interest. Although LRMC yields reasonable results, it implicitly weights features from all the modalities equally, ignoring the differences in discriminative power of features from different modalities. In this paper, we propose stability-weighted LRMC (swLRMC), an LRMC improvement that weights features and modalities according to their importance and reliability. We introduce a method, called stability weighting, to utilize subsampling techniques and outcomes from a range of hyper-parameters of sparse feature learning to obtain a stable set of weights. Incorporating these weights into LRMC, swLRMC can better account for differences in features and modalities for improving diagnosis. Experimental results confirm that the proposed method outperforms the conventional LRMC, feature-selection based LRMC, and other state-of-the-art methods.

1

Introduction

Effective methods to jointly utilize heterogeneous multi-modal and longitudinal data for Alzheimer’s Disease (AD) diagnosis and prognosis often need to overcome the problem of incomplete data. Data are incomplete due to various reasons, including cost concerns, poor data quality, and subject dropouts. Most studies deal with this issue by simply discarding incomplete samples, hence significantly reducing the sample size of the study. A more effective approach to deal with missing data is by imputing them using k-nearest neighbor, expectation maximization, low-rank matrix completion (LRMC) [2], or other methods [8,13]. However, these methods perform well only if a small portion, but not a whole chunk, of the data is missing. To avoid propagation of the imputation error to the diagnosis stage, Goldberg et al. [3] propose to simultaneously impute the missing data and the diagnostic labels This work was supported in part by NIH grants (NS093842, EB006733, EB008374, EB009634, AG041721, and MH100217). c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 88–96, 2016. DOI: 10.1007/978-3-319-46723-8 11

swLRMC

89

using LRMC. This approach, along with other variants [9], however, inherently assumes that the features are equally important. This might not be the case especially when the data are multi-modal and heterogeneous, with some features being more discriminative than others [4,6,10]. For example, in our study involving magnetic resonance imaging (MRI) data, positron emission tomography (PET) data, and cognitive assessment data, we found that clinical scores, though fewer in dimension, are more discriminative than PET data, and within the PET data, only few features are related to the progression of mild cognitive impairment (MCI), a prodromal stage of AD. To address this issue, the method in [9] shrinks the data via selection of the most discriminant features and samples using sparse learning methods and then applies LRMC. Although effective, this approach still neglects the disproportionate discriminative power of different features, when employing LRMC. In this paper, we explicitly consider the differential discriminative power of features and modalities in our formulation of LRMC by weighting them using a procedure called stability weighting. We first explain feature weighting, where each feature is assigned a weight according to its feature-target relationship, i.e., more discriminative features are assigned higher weights, and vice versa. For instance, in sparse feature weighting [14], the feature-target regression coefficients are used as feature weights. Feature weighting like [14] always involves tuning one (or multiple) regularizing hyper-parameter(s), which is (are) normally determined via cross-validation. However, as pointed out in [7], it is difficult to choose a single set of hyper-parameter that is able to retain all the discriminative features while removing the noisy features. Stability weighting avoids the difficulties of proper regularization [7] in feature weighting by going beyond one set of hyper-parameters. It utilizes multiple sets of hyper-parameters and subsampled data to compute a set of aggregated weights for the features. Using random subsampling and aggregation, stability weighting estimates the weights based on the “stability” of the contribution of a feature. More specifically, we perform a series of logistic regression tasks, involving different hyper-parameters and different data subsets, for each modality. Regression coefficients corresponding to the hyper-parameters that yield higher prediction performance are then aggregated as feature weights. We use the term “importance” and “reliability” to denote how good a feature and a modality are in the prediction task, respectively. In the context of stability weighting, feature importance is quantified by the aggregated weight values while modality reliability is quantified by the performance measures. We then incorporate the feature importance and modality reliability into LRMC, giving us stability-weighted LRMC (swLRMC) for greater prediction accuracy. The contribution of our work is two-fold. (1) We propose a stability weighting procedure to quantify the importance of features and the reliability of modalities. (2) We incorporate this information into the formulation of the proposed swLRMC for more robust and accurate prediction using incomplete heterogeneous multi-modal data.

90

2

K.-H. Thung et al.

Materials, Preprocessing and Feature Extraction

In this study, we focus on MCI and use the baseline multi-modal data from ADNI dataset1 , including MRI, PET, and clinical scores (i.e., Mini-Mental State Exam (MMSE), Clinical Dementia Rating (CDR-global, CDR-SOB), and Alzheimer’s Disease Assessment Scale (ADAS-11, ADAS-13)). Only MRI data is complete, the other two modalities are incomplete. MCI subjects who progressed to AD within 48 month are retrospectively labeled as pMCI, whereas those who remained stable are labeled as sMCI. MCI subjects who progressed to AD after the 48th month are excluded from this study. Table 1 shows the demographic information of the subjects involved. Table 1. Demographic information of MCI subjects involved in this study. (Edu.: Education) # Subjects Gender (M/F) Age (years) Edu. (years) pMCI 169

103/66

74.6 ± 6.7

15.8 ± 2.8

sMCI 61

45/16

73.9 ± 7.7

14.9 ± 3.4

Total 230

148/82

-

-

We use region-of-interest (ROI)-based features from the MRI and PET images in this study. The processing steps involved are described as follows. Each MRI image was AC-PC aligned using MIPAV2 , corrected for intensity inhomogeneity using the N3 algorithm, skull stripped, tissue segmented, and registered using a template to obtain subject-labeled image with 93 ROIs [11]. Gray matter (GM) volumes, normalized by the total intracranial volume, were extracted from 93 ROIs as features [9,10]. We also linearly aligned each PET image to its corresponding MRI image, and used the mean intensity values of each ROI as PET features.

3

Method

Figure 1 gives an overview of the proposed swLRMC framework. The main difference between swLRMC and LRMC is the introduction of a stability weight matrix W, which is computed via stability weighting. W is then used in swLRMC to simultaneously impute the missing feature values and the unknown target values (i.e., diagnostic labels and conversion times). We provide the details of each step in the following.

1 2

http://adni.loni.ucla.edu. http://mipav.cit.nih.gov.

swLRMC

91

Fig. 1. Stability-weighted low-rank matrix completion (swLRMC).

3.1

Notation

Let X = [X(1) · · · X(m) ] ∈ RN ×d denotes the feature matrix of N samples. The features from m modalities (i.e., MRI, PET and clinical scores (Cli)) are concatenated to give d features per sample. Since, for each sample, not all the modalities are available, X is incomplete with some missing values. We use Y = [y1 · · · yt ] ∈ RN ×t to denote the corresponding target matrix with two targets (t = 2), i.e., the diagnostic labels (1 for pMCI and −1 for sMCI), and the conversion time (i.e., number of months prior to AD conversion). The conversion time of an sMCI subject should ideally be set to infinity. But for feasibility, we set the conversion time to a large value computed as 12 months plus the maximum conversion time over all pMCI samples. Throughout the paper, we use bold upper-case to denote matrices and bold lower-case to denote column vectors. 3.2

Low-Rank Matrix Completion (LRMC)

Prediction using LRMC is based on several assumptions. First, it assumes linear relationship between X and Y, i.e., Y = [X 1] ∗ β, where 1 is a column vector of all 1’s, and β is the coefficient matrix. Second, it assumes X is low-rank, i.e., rows (columns) of X could be represented by other rows (columns). It can be inferred then that the concatenated matrix M = [X 1 Y] is also low-rank [3]. Hence, it follows that LRMC can be applied on M to impute the missing feature values and the unknown output targets simultaneously, without knowing β. This is achieved by solving minZ {Z∗ | PΩ (M) = PΩ (Z)} [2], where Z is the completed version of M, Ω is the set of indices of known values in M, P is the projection operator, and  · ∗ is the nuclear norm (i.e., sum of singular values), which is used as a convex surrogate for matrix rank. In the presence of noise, and using different loss functions for X and Y, this problem is reformulated as [3]: t

 λi 1 Li (PΩyi (Z), PΩyi (M)), (1) PΩX (Z − M)2F + min μZ∗ + Z |ΩX | |Ωyi | i where Li (·, ·) is the loss function for the i-th column of Y. Since the first target is the diagnostic label (binary) and the second target is the conversion time

92

K.-H. Thung et al.

 (continuous), we use logistic loss (L1 (u, v) = j log(1 + exp(−uj vj ))) and mean  2 square loss (L2 (u, v) = j 1/2(uj − vj ) ) functions for the first and second targets, respectively. ΩX and Ωyi are the index sets of the known feature values and target outputs in M, respectively. | · | denotes the cardinality of a set, and  · F is the Frobenius norm. Parameters μ and λi are the tuning hyperparameters that control the effect of each term. The features fitting term (second term) in (1) shows that the conventional LRMC treats all the features equally, without considering the importance of each feature in relation to the target(s). In the following, we propose to modulate this fitting term according to the feature-target relationship. 3.3

Stability-Weighted LRMC (swLRMC)

Due to missing feature values for some modalities, conventional feature selection methods cannot be applied to the whole data. Thus, we compute the weights separately for each modality. Denoting the importance of features in the j-th modality as vector w(j) and the reliability of the j-th modality as s(j) , we reformulate the second term of (1) as follows: m

1  (j) s PΩX (j) (diag(w(j) )(ZX(j) − X(j) ))2F , |ΩX | j=1

(2)

where ZX(j) is the j-th modality feature part of Z, ΩX (j) is the known value indices of X(j) , and diag(·) is the diagonal operator. Each element in w(j) quantifies the importance of the corresponding feature in X(j) in terms of discriminative power. More important features are given higher values, so that they are less affected by the smoothing effect of the low rank constraint (first term of (1)), and play more dominant roles in the optimization process. In the following, we explain how w(j) and s(j) are obtained via stability weighting. Stability Weighting: Stability weighting uses data subsampling and sparse feature weighting with multiple hyper-parameters (similar to stability selection [7]), to improve robustness in feature weighting. Any feature weighting method can be used for stability weighting. In this paper, we choose logistic elastic net [14]. First, we use elastic net to compute a weight vector for each modality: minβ(i) log(1 + exp(−y1 ⊙ (X(i) β (i) )))1 + α1 β (i) 1 + α2 β (i) 22 ,

(3)

where y1 is a column vector of diagnostic labels, ⊙ is element-wise multiplication, α1 and α2 are the tuning hyper-parameters, and β (i) is a sparse coefficient vector. The magnitude of each element in β (i) can be seen as an indicator of the importance of the corresponding feature in X(i) . Note that, in this process one needs to determine the hyper-parameter α = [α1 α2 ], which is normally done through cross-validation. However, instead of limiting ourselves to just one hyper-parameter and one set of data, we use a range of hyper-parameters and

swLRMC

93

the subsamples of training data to determine the feature weights. More specifically, we solve (3) using a range of α values using 5-fold cross-validation on the training data with 10 repetitions. For each α, we therefore have 50 versions of β (i) , and one average F-score3 . We choose three α values that give us highest F-score values, and compute the weight vector for the i-th modality as ¯(i) /max(β ¯(i) ) + ǫ, where ǫ is a small constant and β ¯(i) is the mean w(i) = β absolute vector of all (50 × 3 = 150) β (i) ’s that correspond to the α’s with the highest average F-scores. We then use the best average F-score to quantify the reliability of using X(i) in predicting target y1 , which is denoted as s(i) . Note that s(i) and w(i) in (2) can be combined into a single weight matrix as W = diag([s(1) w(1) ; · · · ; s(m) w(m) ]). Finally, the compact equivalent form of swLRMC is given as t

min µZ∗ + Z

 λi 1 PΩX (W(Z − M))2F + Li (PΩyi (Z), PΩyi (M)). |ΩX | |Ωyi | i

(4)

Optimization: Equation (4) can be solved to obtain matrix Z by iterating through l in the two steps below until convergence [3]: 1. Gradient Step: Gl = Zl − τ g(Zl ), where G is a intermediate matrix, τ is the step size, and g(Zl ) is the matrix gradient defined as ⎧ λ −Mij 1 , (i, j) ∈ Ωy1 ⎪ ⎪ Ωy1 | 1+exp(Mij Zij ) | ⎪ ⎪ ⎨ Wjj (M − Zij ), (i, j) ∈ ΩX (5) g(Zij ) = |ΩλX | ij 2 ⎪ (M − Z ), (i, j) ∈ Ω ⎪ ij ij y 2 ⎪ |Ωy2 | ⎪ ⎩ 0, otherwise 2. Shrinkage Step [2]: Zl+1 = Sτ µ (Gl ) = P (max(Λ − τ μ, 0)) QT , where S(·) is the matrix shrinkage operator, PΛQT is the SVD of Gl , and max(·) is the element-wise maximum operator.

4

Results and Discussions

We evaluated the proposed method, swLRMC, using multi-modal data for the ADNI database. We evaluated two versions of swLRMC: (1) swLRMC on the original feature matrix without removing any features, and (2) swLRMC on feature-selected matrix (fs-swLRMC) by discarding the features that were selected less than 50 % of the time in stability selection. We compared our methods with two baseline LRMC methods: (1) LRMC without feature selection, and (2) LRMC with sparse feature selection (fs-LRMC). The hyper-parameters μ, λ1 , λ2 for all methods were selected automatically using Bayesian hyperparameter optimization [1] in the ranges of {10−6 , · · · , 10−2 }, {10−4 , · · · , 10−1 }, 3

We use F-score as performance measure as our dataset is unbalanced.

94

K.-H. Thung et al.

and {10−4 , · · · , 10−1 }, respectively. For sparse feature selection, we used the SLEP package4 and performed 5-fold cross validation on the training data to select the best hyper-parameter. Since the dataset we used was unbalanced, we used the F-score and the area under the ROC curve (AUC) to measure the classification performance, and correlation coefficient (CC) to measure the accuracy of conversion time prediction. All the results reported are the averages of 10 repetitions of 10-fold cross validation. The results shown in Fig. 2 indicate that swLRMC (blue bars) performs consistently better than baseline LRMC (orange bar), for all the performance metrics and modality combinations. It is worth noting that swLRMC and fs-swLRMC seem to be performing almost equally well, but fs-swLRMC is faster in computation, due to its smaller matrix size during imputation. It is also interesting to see that swLRMC performs better than fs-LRMC in terms of F-score and CC values, indicating that penalizing less discriminative features is better than removing them. Another encouraging observation is that swLRMC is less sensitive to “noisy” features in the multi-modal data. This can be seen in MRI+PET combination, where performance of LRMC drops, compared to the case where only MRI is used, whereas the performance of swLRMC improves. A similar pattern can be observed for MRI+PET+Cli, where LRMC performs poorer than MRI+Cli case, whereas swLRMC maintains its performance.

F-score

0.95 0.9

∗∗

∗∗

∗∗

∗∗

0.85 0.8 MRI

AUC

0.9 0.8 0.7



∗∗

MRI+PET

MRI+Cli

MRI+PET+Cli



∗∗

0.6 MRI

MRI+PET

CC

0.7 0.6

∗∗ ∗

∗∗

MRI+Cli

∗∗

LRMC fs-LRMC swLRMC fs-swLRMC

MRI+PET+Cli

∗∗

0.5 0.4 MRI

MRI+PET

MRI+Cli

MRI+PET+Cli

Fig. 2. Comparisons between the baseline LRMC and the proposed swLRMC methods using multi-modal data. The first two plots: pMCI/sMCI classification results (first target), the last plot: conversion time prediction results (second target). Error bars: standard deviations, *: statistically significant.

4

http://www.yelab.net/software/SLEP/.

swLRMC

95

Table 2. Comparison with [12] and [5]. Bold: Best results, *: Statistically significant.

Data

F-score

AUC

swLRMC [12]

[5]

CC

swLRMC [12]

[5]

swLRMC [12]

[5]

MRI

0.862

0.853* 0.777* 0.776

0.758* 0.755* 0.506

0.427* 0.464*

MRI+PET

0.870

0.853* 0.798* 0.802

0.786* 0.806 0.523

0.469* 0.492*

MRI+Cli

0.892

0.853* 0.809* 0.838

0.829* 0.841 0.594

0.567* 0.560*

0.859* 0.805* 0.851

0.827* 0.842

0.568* 0.553*

MRI+PET+Cli 0.883

0.599

We also show in Table 2 a comparison of swLRMC with two methods that works with incomplete dataset: (1) incomplete data multi-task learning [12], and (2) Ingalhalikar’s ensemble method [5]. We selected the best hyper-parameters for these methods using 5-fold cross validation. We used logistic loss and meansquare loss function for classification and regression, respectively, for [12]. The highest score for each category is highlighted in bold. The results show that swLRMC outperforms both methods in F-score and CC for all the combinations of modalities. In terms of AUC, swLRMC gives comparable performance. To test the significance of the results, we perform paired t-test between the best result and the other results in each category. The outcomes of the paired ttest are included in Fig. 2 and Table 2, where statistically significantly difference results in comparison with the best method, at 95 % confidence level, are marked with asterisks. The results show that the improvement of the proposed method is statistically significant in terms of F-score and CC values, in all the combinations of multi-modal data.

5

Conclusion

We have demonstrated that the proposed method, swLRMC, which explicitly considers feature importance and modality reliability using stability weighting procedure, outperforms conventional LRMC, fs-LRMC, and two state-of-theart methods that were designed for incomplete multi-modal data. Experimental results show that our proposed method is effective when dealing with incomplete multi-modal data, where not all the feature values are equally important.

References 1. Bergstra, J.S., et al.: Algorithms for hyper-parameter optimization. In: Proceedings of Advances in Neural Information Processing Systems, pp. 2546–2554 (2011) 2. Cand`es, E.J., et al.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717–772 (2009) 3. Goldberg, A., et al.: Transduction with matrix completion: three birds with one stone. In: Proceedings of Advances in Neural Information Processing Systems, vol. 23, pp. 757–765 (2010)

96

K.-H. Thung et al.

4. Huang, L., Gao, Y., Jin, Y., Thung, K.-H., Shen, D.: Soft-split sparse regression based random forest for predicting future clinical scores of Alzheimer’s disease. In: Zhou, L., Wang, L., Wang, Q., Shi, Y. (eds.) MLMI 2015. LNCS, vol. 9352, pp. 246–254. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24888-2 30 5. Ingalhalikar, M., Parker, W.A., Bloy, L., Roberts, T.P.L., Verma, R.: Using multiparametric data with missing features for learning patterns of pathology. In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012. LNCS, vol. 7512, pp. 468–475. Springer, Heidelberg (2012) 6. Jin, Y., et al.: Identification of infants at high-risk for autism spectrum disorder using multiparameter multiscale white matter connectivity networks. Hum. Brain Mapp. 36(12), 4880–4896 (2015) 7. Meinshausen, N., et al.: Stability selection. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 72(4), 417–473 (2010) 8. Qin, Y., et al.: Semi-parametric optimization for missing data imputation. Appl. Intell. 27(1), 79–88 (2007) 9. Thung, K.H., et al.: Neurodegenerative disease diagnosis using incomplete multimodality data via matrix shrinkage and completion. Neuroimage 91, 386–400 (2014) 10. Thung, K.H., et al.: Identification of progressive mild cognitive impairment patients using incomplete longitudinal MRI scans. Brain Struct. Funct. pp. 1–17 (2015) 11. Wang, Y., Nie, J., Yap, P.-T., Shi, F., Guo, L., Shen, D.: Robust deformablesurface-based skull-stripping for large-scale studies. In: Fichtinger, G., Martel, A., Peters, T. (eds.) MICCAI 2011. LNCS, vol. 6893, pp. 635–642. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23626-6 78 12. Yuan, L., et al.: Multi-source feature learning for joint analysis of incomplete multiple heterogeneous neuroimaging data. NeuroImage 61(3), 622–632 (2012) 13. Zhu, X., et al.: Missing value estimation for mixed-attribute data sets. IEEE Trans. Knowl. Data Eng. 23(1), 110–121 (2011) 14. Zou, H., et al.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. Ser. B 67(2), 301–320 (2005)

Employing Visual Analytics to Aid the Design of White Matter Hyperintensity Classifiers Renata Georgia Raidou1,2(B) , Hugo J. Kuijf3 , Neda Sepasian1 , Nicola Pezzotti2 , Willem H. Bouvy4 , Marcel Breeuwer1,5 , and Anna Vilanova1,2 1

Eindhoven University of Technology, Eindhoven, The Netherlands [email protected] 2 Delft University of Technology, Delft, The Netherlands 3 Image Sciences Institute, University Medical Center Utrecht, Utrecht, The Netherlands 4 Department of Neurology, Brain Center Rudolf Magnus, University Medical Center Utrecht, Utrecht, The Netherlands 5 Philips Healthcare, Best, The Netherlands

Abstract. Accurate segmentation of brain white matter hyperintensities (WMHs) is important for prognosis and disease monitoring. To this end, classifiers are often trained – usually, using T1 and FLAIR weighted MR images. Incorporating additional features, derived from diffusion weighted MRI, could improve classification. However, the multitude of diffusion-derived features requires selecting the most adequate. For this, automated feature selection is commonly employed, which can often be sub-optimal. In this work, we propose a different approach, introducing a semi-automated pipeline to select interactively features for WMH classification. The advantage of this solution is the integration of the knowledge and skills of experts in the process. In our pipeline, a Visual Analytics (VA) system is employed, to enable user-driven feature selection. The resulting features are T1, FLAIR, Mean Diffusivity (MD), and Radial Diffusivity (RD) – and secondarily, CS and Fractional Anisotropy (FA). The next step in the pipeline is to train a classifier with these features, and compare its results to a similar classifier, used in previous work with automated feature selection. Finally, VA is employed again, to analyze and understand the classifier performance and results. Keywords: White matter hyperintensities (WMHs) (VA) · Classification · Interactive feature selection

1

· Visual Analytics

Introduction

White matter hyperintensities of presumed vascular origin (WMHs) are a common finding in MR images of elderly subjects. They are a manifestation of cerebral small vessel disease (SVD) and are associated with cognitive decline and dementia [1]. Accurate segmentation of WMHs is important for prognosis and disease monitoring. To this end, automated WMH classification techniques have c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 97–105, 2016. DOI: 10.1007/978-3-319-46723-8 12

98

R.G. Raidou et al.

been developed [2]. Conventional approaches include raw image intensities from T1 and FLAIR weighted MR images, but recently, it has been suggested that diffusion MRI can improve the segmentation [3,4]. Multiple features can be derived from this imaging modality; thus, careful feature selection is required. In this work, we propose a semi-automated approach, to aid the design of WMH classifiers. Our novelty is the introduction of a user-driven, interactive pipeline that provides new insight into the entire classification procedure – especially, in the identification of an adequate feature list, and the analysis of the outcome. Up to now, the knowledge and cognitive skills of experts have not been intensively involved in the process. In the first step of our pipeline, we employ a Visual Analytics (VA) system [5], where expert users select interactively the most important features. In the second step, the resulting feature list is used to train a classifier for WMH segmentation. The performance and results of this classifier can be analyzed and interpreted in the final step of the pipeline, using again VA.

2

Related Work

Visual Analytics (VA) refers to the field that combines, through interaction, visualizations with pattern recognition, data mining and statistics, and focuses on aiding exploration and analytical reasoning [6]. Recently, Raidou et al. proposed a highly interactive VA system for the exploration of intra-tumor tissue characteristics [5]. The system employs a t-Distributed Stochastic Neighbor Embedding [7] of several imaging-derived features, used in tumor diagnosis. It also consists of multiple interactive views, for the exploration and analysis of the underlying structure of the feature space, providing linking to anatomy and ground truth data. Yet, to the best of our knowledge, involving users through VA and interaction in an entire pipeline for feature selection, classification and outcome evaluation for WMH structures has not been addressed before.

3 3.1

Materials and Method Subjects and MRI Data

We used the subjects of the MRBrainS13 challenge [8], with additional manual WMH delineations. Subjects included patients with diabetes and matched controls (men: 10, age: 71 ± 4 years). All subjects underwent a standardized 3 T MR exam, including a 3D T1-weighted, a multi-slice FLAIR, a multi-slice IR, and a single-shot EPI DTI sequence with 45 directions. All sequences were aligned with the FLAIR sequence [9]. The diffusion images were corrected for subject motion, eddy current induced geometric distortions, and EPI distortions, including the required B-matrix adjustments, using ExploreDTI [10]. The dataset includes T1, FLAIR and IR weighted images, as well as the following diffusion features: Fractional Anisotropy (FA), Mean Diffusivity (MD), Axial Diffusivity (AD), Radial Diffusivity (RD), the Westin measures CL , CP ,

Employing Visual Analytics to Aid the Design of WMH Classifiers

99

CS [11], and MNI152-normalized spatial coordinates [9,12]. This exact dataset has been previously reported in a study of Kuijf et al. [3], for the investigation of the added value of diffusion features in a WMH classifier. Since we could have access to the exact same data and we share the same goal, we will use the previous work of Kuijf et al., as a baseline for the evaluation of our results. 3.2

Method

In this section, we describe our new pipeline1 for the user-driven, interactive selection of features that can differentiate WMHs from healthy brain tissue. Our pipeline consists of three steps, depicted in Fig. 1. First, the data are interactively explored and analyzed by expert users, in the VA system proposed by Raidou et al. [5]. From this step, we obtain through interaction and visual analysis, a list of features, adequate for WMH detection. These features are subsequently used to train a classifier. After classification, the VA system is used again to evaluate and better understand the classification process and outcome.

Fig. 1. The pipeline proposed for the user-driven feature selection, classification and outcome evaluation for the segmentation of White matter hyperintensities (WMHs).

Feature Selection Using VA. The VA system of Raidou et al. [5] is employed to interactively explore the data of each one of the available subjects (Fig. 2). Initially, t-Distributed Stochastic Neighbor Embedding (t-SNE) [7] is used to map the high-dimensional feature space of each subject (described in Sect. 3.1) into a reduced 2D abstract embedding view, preserving the local structure of the feature space. Spatial coordinates are excluded, as we are interested in preserving similarities in the feature space, and the voxel positions could introduce bias. In the resulting embedding view (Fig. 2-ii), close-by 2D data points reflect voxels with similar behavior in the high-dimensional feature space. Therefore, voxels from structures with similar imaging characteristics are expected to be grouped together in the embedding, in so-called visual clusters. Having available ground truth data, i.e., manual delineations of the WMHs, allows to associate visual clusters from the feature space to anatomy, and vice versa (Fig. 2-i). When a WMH-containing visual cluster is interactively selected, its intrinsic feature characteristics are explored; for example, against other structures of the 1

An interactive demo can be found here: https://vimeo.com/170609498

100

R.G. Raidou et al.

Fig. 2. The adopted VA system [5] during the exploration of the data of a subject from the MRBrainS13 challenge [8]. The three components of the system are denoted.

brain, or against WMHs voxels that are not within the selected visual cluster. Then, several linked views (Fig. 2-iii) are interactively updated with complementary data information. This includes feature distributions and correlations, multidimensional data patterns, cluster validity analysis and information on features that help separating visual clusters from each other, as given by the weights of the separation vector of Linear Discriminant Analysis (LDA). In this way, features suitable for the detection of WMHs are interactively identified. For example, for the subject of Fig. 2, two visual clusters have been selected in the t-SNE of the middle view. As depicted in the anatomical views, one corresponds to the WMH core (green) and the other to the periphery (purple). Together, they represent the biggest part of the structure. Still, several small parts are missed. The separation vector, resulting from LDA between the two visual clusters containing the WMHs against the rest of the brain, is extracted. From the weights of this vector, features adequate for differentiating the detected WMHs from the rest of the brain, are identified. This analysis is subject-specific and has to be performed on a single-subject basis. When all subjects have been explored, the user decides on the most suitable feature list, overall. Classification. In this step, many different classification approaches could be followed, but comparing to all would be out of scope, for this work. Recently, Kuijf et al. [3] presented an approach for WMH classification, using the same set of diffusion features. To evaluate whether our user-driven feature selection outperforms automated feature selection, we adopt a similar classification approach, as in the previous work of Kuijf et al. The list of features resulting from the VA system is used to train a k-nearest-neighbor classifier for WMH segmentation.

Employing Visual Analytics to Aid the Design of WMH Classifiers

101

Table 1. The most important features for each subject, as resulting from the weights of the LDA separation vector, performed for the detected visual clusters of WMH voxels against the rest of the brain. The second column denotes the size of WMHs in voxels. The third column shows the percentage of WMHs detected by visual clusters in the VA tool. The other columns represent features, and their weights are color encoded per row. The resulting feature list is the set MD, RD, T1 and FLAIR (then, CS and FA).

Negative High Weight

Low Weight

Positive High Weight

For different feature combinations, several classifiers are trained with k = 50, 75, or 100, and the neighbor-weighted is either uniform or distance-based [3]. Evaluation of Classification. In many cases, classifiers are treated as black boxes, and users do not have actual insight into the achieved result. With this step, we want to provide a way for evaluating and understanding both the results of the classifier and the classifier itself. To this end, we import the binary masks resulting from the classification (detected vs. missed WMHs) into the VA system [5]. The user can interactively explore the high-dimensional feature space of the two regions of the mask, and generate hypotheses about why the classifier failed to detect parts of the WMHs, with respect to the imaging features.

4

Results

Feature Selection Using VA. In most of t-SNE embeddings of the subjects, the majority of voxels of the WMHs are grouped together, in one or two visual

102

R.G. Raidou et al.

clusters, similar to the case depicted in Fig. 2. From selecting these visual clusters, we could identify that, for subjects with two visual clusters, these either correspond to the core and the periphery, or to anterior and posterior WMHs. For large WMHs (top 50 %), the visual clusters of the embedding identify 84– 98% of the structures. For the rest, the visual clusters can at least detect the core, with a minimum detection percentage of 54 %. The multiple interactive linked views of the VA system show that there are comparable behaviors, within all cases of visual clusters, especially for larger WMH structures. As mentioned before, the cluster analysis view of the VA system provides the separation vector, resulting from LDA between the visual cluster containing most of WMHs and the visual cluster of the rest of the brain. Table 1 depicts, for all investigated subjects, the weights of separation for these two visual clusters. In all – but three – cases, T1, FLAIR, RD and MD are more important, as they have a considerable weight. For bigger WMHs, CS and FA also become important. The contribution of other features such as AD, CL , CP and IR seems not significant. Considering also the (cor-)relations between diffusion features, we decide on the overall set of features for the classifier: MD, RD, T1 and FLAIR (secondarily, CS and FA). Here, we add the MNI152-normalized spatial coordinates (x, y, z) to better represent the brain volume and to suppress non-WMH structures. Classification. Based on the results of the VA system, the following four combinations of feature sets si ∈ S are chosen for our k-NN classifier: s1 : MD, RD, T1, FLAIR; s2 : s1 + (x, y, z); s3 : s1 + CS , FA; s4 : s3 + (x, y, z). For each classifier trained on a feature set si ∈ S, we measure the sensitivity and Dice similarity coefficient (mean ± standard deviation), as shown in Table 2. These measurements are performed, with respect to the available manual delineations of the WMH structures. Furthermore, our results are compared to the feature sets fi ∈ F , previously used by Kuijf et al. [3]: f1 : T1, IR, FLAIR; f2 : f1 + (x, y, z); f3 : f1 + FA, MD; f4 : f2 + FA, MD; f5 : f4 + CL , CP , CS , AD, RD. The results of Table 2 demonstrate that our proposed VA-guided feature selection can achieve similar or slightly better performance than the automated feature selection, presented by Kuijf et al. [3]. The two best performing feature sets of Kuijf et al. used 8 (f4 ) and 13 (f5 ) features, while our current two best methods use 7 (s2 ) and 8 (s4 ) features only, with comparable results. Our approach allows to discard CL , CP , AD and IR, which do not contribute in the classification; hereby, saving scanning and also computational time. Evaluation of Classification. To evaluate the classification outcome, we introduce the results of the two best performing classifiers, s2 and s4 , into the VA system. One of the goals is to explore and analyze the parts of the WMHs that are missed, but also to understand better how these classifiers work and how they can be improved. From an initial inspection, it results that classifier s2 is restricted to the core of the WMHs, while s4 detects an extension of it. The WMH core is always detected by both classifiers, as it has consistent imaging characteristics and is well-clustered in the t-SNE embeddings. In subjects with bigger WMHs, s4 misses only small or thin structures and part of the periphery.

Employing Visual Analytics to Aid the Design of WMH Classifiers

103

Table 2. Sensitivity, Dice similarity coefficient (SI, higher is better) and number of features for the classifiers, trained on combinations of features si ∈ S (left, from our VA-driven approach) and fi ∈ F (right, from [3]), with respect to the available manual delineations. Our VA-driven approach S

Sensitivity (%) Dice SI

Automated approach from [2] Features F

Sensitivity (%) Dice SI

Features

s1 64.8 ± 0.2

0.460 ± 0.003 4

f1 59.7 ± 0.2

0.349 ± 0.001

3

s2 76.2 ± 0.4

0.560 ± 0.005 7

f2 73.4 ± 0.4

0.536 ± 0.005

6

s3 66.3 ± 0.2

0.471 ± 0.004 5

f3 67.8 ± 0.3

0.411 ± 0.003

5

s4 76.6 ± 0.5

0.576 ± 0.004 8

f4 77.2 ± 0.4

0.565 ± 0.004

8

f5 75.2 ± 0.6

0.561 ± 0.003 13

In subjects with smaller WMHs, there is a tendency to miss periphery parts and posterior structures more often than the anterior. For bigger WMHs, the core differs in T1, MD, RD with the missed structures. Also, the latter are not as good clustered in the t-SNE embeddings as the core, i.e., they are not coherent in their imaging characteristics. As WMHs become smaller, the influence of T1 becomes less strong, while MD and RD seem to become more important.

5

Discussion and Conclusions

We proposed a user-driven pipeline for aiding the design of classifiers, focusing on WMH segmentation. Using VA and the cognitive skills of an expert user, we initially identified the list of features (MD, RD, T1, FLAIR, and secondarily, FA and CS ) that are suitable for the separation of WMHs. Then, this list was used for WMH classification. In respect of previous work [3], our results are comparable. Yet, our results are not achieved through trial-and-error, but after a justifiable and understandable, interactive feature selection. Additionally, our approach requires less features, which allows to skip several imaging sequences, making the feature calculation less computationally intensive and time consuming. For example, we concluded that CL , CP , AD and IR can be omitted, which saves valuable scanning time (IR: 3:49.6 min). After classification, we evaluated the classifier outcome in the VA system. The periphery is constantly missed. Thin and small structures can be missed due to partial volume effect, while the MNI152-normalized spatial coordinates can influence the separation of posterior or anterior WMHs. For certain subjects, the missed structures have intrinsically different imaging characteristics. In this case, more features, such as texture or tensor information, should be further investigated. The performance of the classifier could be further improved by adding additional post-processing, to remove false positive detection, which was not performed here, to be comparable to Kuijf et al. [3]. Also, it would be interesting to investigate what happens, when our VA-selected features are used with more sophisticated classification algorithms.

104

R.G. Raidou et al.

In the entire pipeline, the user interacts and guides the analysis. This has the advantage that the cognitive capabilities of the user, which are not easily automatized, can be included in feature selection. However, the results are userdependent and it remains important to analyze the bias introduced by the user. Although t-SNE is widely used [13] for understanding high dimensional data, errors can also be introduced due to its use. Adding more features for exploration in the VA system, such as textural features or information from tensors, could give interesting results. However, certain visualizations of the VA system do not scale well to a high number of features; thus, new visualizations would be needed to tackle hundreds of features. Finally, evaluating the use of the pipeline with a user study, to define its general usefulness, is another point for future work. Nevertheless, employing VA in the design of classifiers has potential for better understanding the data under exploration, and for obtaining more insight into classifiers and the frequently exploding set of imaging features.

References 1. Pantoni, L.: Cerebral small vessel disease: from pathogenesis and clinical characteristics to therapeutic challenges. Lancet Neurol. 9(7), 689–701 (2010) 2. Anbeek, P., Vincken, K.L., van Osch, M.J., Bisschops, R.H., van der Grond, J.: Probabilistic segmentation of white matter lesions in MR imaging. NeuroImage 21(3), 1037–1044 (2004) 3. Kuijf, H.J., et al.: The added value of diffusion tensor imaging for automated white matter hyperintensity segmentation. In: O’Donnell, L., Nedjati-Gilani, G., Rathi, Y., Reisert, M., Schneider, T. (eds.) Computational Diffusion MRI. Mathematics and Visualization, pp. 45–53. Springer International Publishing, Switzerland (2014) 4. Maillard, P., Carmichael, O., Harvey, D., Fletcher, E., Reed, B., Mungas, D., DeCarli, C.: FLAIR and diffusion MRI signals are independent predictors of white matter hyperintensities. Am. J. Neuroradiol. 34(1), 54–61 (2013) 5. Raidou, R.G., Van Der Heide, U.A., Dinh, C.V., Ghobadi, G., Kallehauge, J.F., Breeuwer, M., Vilanova, A.: Visual analytics for the exploration of tumor tissue characterization. Comput. Graph. Forum 34(3), 11–20 (2015) 6. Keim, D.A., Mansmann, F., Schneidewind, J., Thomas, J., Ziegler, H.: Visual analytics: scope and challenges. In: Simoff, S.J., B¨ ohlen, M.H., Mazeika, A. (eds.) Visual Data Mining: Theory, Techniques and Tools for Visual Analytics. LNCS, vol. 4404, pp. 76–90. Springer, Heidelberg (2008). doi:10.1007/978-3-540-71080-6 6 7. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(2579–2605), 85 (2008) 8. Mendrik, A.M., Vincken, K.L., Kuijf, H.J., Breeuwer, M., Bouvy, W.H., de Bresser, J., Jog, A.: MRBrainS challenge: online evaluation framework for brain image segmentation in 3T MRI scans. Comput. Intell. Neurosc. No. 813696, p. 16 (2015). Special issue on Simulation and Validation in Brain Image Analysis 9. Klein, S., Staring, M., Murphy, K., Viergever, M.A., Pluim, J.P.: Elastix: a toolbox for intensity-based medical image registration. IEEE Trans. Med. Imag. 29(1), 196–205 (2010) 10. Leemans, A., Jeurissen, B., Sijbers, J., Jones, D.K.: ExploreDTI: a graphical toolbox for processing, analyzing, and visualizing diffusion MR data. In: Proceedings of the 17th Annual Meeting of Intl Soc Mag Reson Med, Vol. 209, p. 3537 (2009)

Employing Visual Analytics to Aid the Design of WMH Classifiers

105

11. Westin, C.F., Maier, S.E., Mamata, H., Nabavi, A., Jolesz, F.A., Kikinis, R.: Processing and visualization for diffusion tensor MRI. Med. Image Anal. 6(2), 93–108 (2002) 12. Fonov, V.S., Evans, A.C., McKinstry, R.C., Almli, C.R., Collins, D.L.: Unbiased nonlinear average age-appropriate brain templates from birth to adulthood. NeuroImage 47, S102 (2009) 13. Amir, E.A.D., Davis, K.L., Tadmor, M.D., Simonds, E.F., Levine, J.H., Bendall, S.C., Pe’er, D.: viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nature Biotechnol. 31(6), 545–552 (2013)

The Automated Learning of Deep Features for Breast Mass Classification from Mammograms Neeraj Dhungel1(B) , Gustavo Carneiro1 , and Andrew P. Bradley2 1

2

ACVT, School of Computer Science, The University of Adelaide, Adelaide, Australia {neeraj.dhungel,gustavo.carneiro}@adelaide.edu.au School of ITEE, The University of Queensland, Brisbane, Australia [email protected]

Abstract. The classification of breast masses from mammograms into benign or malignant has been commonly addressed with machine learning classifiers that use as input a large set of hand-crafted features, usually based on general geometrical and texture information. In this paper, we propose a novel deep learning method that automatically learns features based directly on the optmisation of breast mass classification from mammograms, where we target an improved classification performance compared to the approach described above. The novelty of our approach lies in the two-step training process that involves a pre-training based on the learning of a regressor that estimates the values of a large set of handcrafted features, followed by a fine-tuning stage that learns the breast mass classifier. Using the publicly available INbreast dataset, we show that the proposed method produces better classification results, compared with the machine learning model using hand-crafted features and with deep learning method trained directly for the classification stage without the pre-training stage. We also show that the proposed method produces the current state-of-the-art breast mass classification results for the INbreast dataset. Finally, we integrate the proposed classifier into a fully automated breast mass detection and segmentation, which shows promising results. Keywords: Deep learning

1

· Breast mass classification · Mammograms

Introduction

Mammography represents the main imaging technique used for breast cancer screening [1] that uses the (mostly manual) analysis of lesions (i.e., masses and micro-calcifications) [2]. Although effective, this manual analysis has a trade-off This work was partially supported by the Australian Research Council’s Discovery Projects funding scheme (project DP140102794). Prof. Bradley is the recipient of an Australian Research Council Future Fellowship(FT110100623). c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 106–114, 2016. DOI: 10.1007/978-3-319-46723-8 13

The Automated Learning of Deep Features

107

between sensitivity (84 %) and specificity (91 %) that results in a relatively large number of unnecessary biopsies [3]. The main objective of computer aided diagnosis (CAD) systems in this problem is to act as a second reader with the goal of increasing the breast screening sensitivity and specificity [1]. Current automated mass classification approaches extract hand-crafted features from an image patch containing a breast mass, and subsequently use them in a classification process based on traditional machine learning methodologies, such as support vector machines (SVM) or multi-layer perceptron (MLP) [4]. One issue with this approach is that the hand-crafted features are not optimised to work specifically for the breast mass classification problem. Another limitation of these methods is that the detection of image patches containing breast masses is typically a manual process [4,5] that guarantees the presence of a mass for the segmentation and classification stages.

Fig. 1. Four classification models explored in this paper, where our main contribution consists of the last two models (highlighted in red and green).

In this paper, we propose a new deep learning model [6,7] which addresses the issue of producing features that are automatically learned for the breast mass classification problem. The main novelty of this model lies in the training stage that comprises two main steps: first stage acknowledges the importance of the aforementioned hand-crafted features by using them to pre-train our model, and the second stage fine-tunes the features learned in the first stage to become more specialised for the classification problem. We also propose a fully automated CAD system for analysing breast masses from mammograms, comprising a detection [8] and a segmentation [9] steps, followed by the proposed deep learning models that classify breast masses. We show that the features learned by our proposed models produce accurate classification results compared with the handcrafted features [4,5] and the features produced by a deep learning model without the pre-training stage [6,7] (Fig. 1) using the INbreast [10] dataset. Also, our fully automated system is able to detect 90 % of the masses at a 1 false positive per image, where the final classification accuracy reduces only by 5 %.

2

Literature Review

Breast mass classification systems from mammograms comprise three steps: mass detection, segmentation and classification. The majority of classification

108

N. Dhungel et al.

methods still relies on the manual localisation of masses as their automated detection is still considered a challenging problem [4]. The segmentation is mostly an automated process generally based on active contour [11] or dynamic programming [4]. The classification usually relies on hand-crafted features, extracted from the detected image patches and their segmentation,which are fed into classifiers that classify masses into benign or malignant [4,5,11]. A common issue with these approaches is that they are tested on private datasets, preventing fair comparisons. A notable exception is the work by Domingues et al. [5] that uses the publicly available INbreast dataset [10]. Another issue is that the results from fully automated detection, segmentation and classification CAD systems are not (often) published in the open literature, which makes comparisons difficult. Deep learning models have consistently shown to produce more accurate classification results compared to models based on hand-crafted features [6,12]. Recently, these models have been successfully applied in mammogram classification [13], breast mass detection [8] and segmentation [9]. Carneiro et al. [13] have proposed a semi-automated mammogram classification using a deep learning model pre-trained with computer vision datasets, which differs from our proposal given that ours is fully automated and that we process each mass independently. Finally, for the fully automated CAD system, we use the deep learning models of detection [8] and segmentation [9] that produce the current state-of-the-art results on INbreast [10].

3

Methodology |D|

Dataset. The dataset is represented by D = {(x, A)i }i=1 , where mammograms are denoted by x : Ω → R with Ω ∈ R2 , and the annotation for the |Ai | |Ai | masses for mammogram i is represented by Ai = {(d, s, c)j }j=1 , where d(i)j = 4 [x, y, w, h] ∈ R represents the left-top position (x, y) and the width w and height h of the bounding box of the j th mass of the ith mammogram, s(i)j : Ω → {0, 1} represents the segmentation map of the mass within the image patch defined by the bounding box d(i)j , and c(i)j ∈ {0, 1} denotes the class label of the mass that can be either benign (i.e., BI-RADS ∈ {1, 2, 3}) or malignant (i.e., BI-RADS ∈ {4, 5, 6}). Classification Features. The features are obtained by a function that takes a mammogram, the mass bounding box and segmentation, defined by: f (x, d, s) = z ∈ RN .

(1)

In the case of hand-crafted features, the function f (.) in (1) extracts a vector of morphological and texture features [4]. The morphological features are computed from the segmentation map s and consist of geometric information, such as area, perimeter, ratio of perimeter to area, circularity, rectangularity, etc. The texture features are computed from the image patch limited by the bounding box d and use the spatial gray level dependence (SGLD) matrix [4] in order to produce energy, correlation, entropy, inertia, inverse difference moment, sum

The Automated Learning of Deep Features

109

average, sum variance, sum entropy, difference of average, difference of entropy, difference variance, etc. The hand-crafted features are denoted by z(H) ∈ RN . The classification features from the deep learning model are obtained using a convolutional neural network (CNN) [7], which consists of multiple processing layers containing a convolution layer followed by a non-linear activation and a sub-sampling layer, where the last layers are represented by fully connected layers and a final regression/classification layer [6,7]. Each convolution layer l ∈ {1, ..., L} computes the output at location j from input at i using the filter (l) (l) Wm and bias bm , where m ∈ {1, ..., M (l)} denotes the number of features in  (l) (l) (l+1) (j) = σ( i∈Ω x(l) (i) ∗ Wm (i, j) + bm (j)), where σ(.) is layer l, as follows: x the activation function [6,7], x(1) is the original image, and ∗ is the convolution operator. The sub-sampling layer is computed by x(l) (j) =↓ ( x(l) (j)), where ↓ (.) is the subsampling function that pools the values (i.e., a max pooling operator) (l) (j). The fully connected layer is in the region j ∈ Ω of the input data x determined by the convolution equation above using a separate filter for each output location, using the whole input from the previous layer. In general, the last layer of a CNN consists of a classification layer, represented by a softmax activation function. For our particular problem of mass classification, recall that we have a binary classification problem, defined by c ∈ {0, 1} (Sect. 3), so the last layer contains two nodes (benign or malignant mass classification), with a softmax activation function [6]. The training of such a CNN is based on the minimisation of the regularised cross-entropy loss [6], where the regularisation is generally based on the ℓ2 norm of the parameters θ of the CNN. In order to have a fair comparison between the hand-crafted and CNN features, the number of nodes in layer L − 1 must be N , which is the number of hand-crafted features in (1). It is well known that CNN can overfit the training data even with the regularisation of the weights and biases based on ℓ2 norm, so a current topic of investigation is how to regularise the training more effectively [14]. One of the contributions of this paper is an experimental investigation of how to regularise the training for problems in medical image analysis that have traditionally used hand-crafted features. Our proposal is a two-step training

Fig. 2. Two steps of the proposed model with the pre-training of the CNN with the regression to the hand-crafted features (step 1), followed by the fine-tuning using the mass classification problem (step 2).

110

N. Dhungel et al.

process, where the first stage consists of training a regressor (see step1 in Fig. 2), (L) approximates the values of the hand-crafted features z(H) where the output x using the following loss function: J=

|D| |Ai |   i=1 j=1

(H)

(L)

(i,j) 2 , z(i,j) − x

(2)

where i indexes the training images, j indexes the masses in each training image, (H) and z(i,j) denotes the vector of hand-crafted features from mass j and image i. This first step acts as a regulariser for the classifier that is sub-sequentially fine-tuned (see step 2 in Fig. 2). Fully Automated Mass Detection, Segmentation and Classification. The mass detection and segmentation methods are based on deep learning methods recently proposed by Dhungel et al. [8,9]. More specifically, the detection consists of a cascade of increasingly more complex deep learning models, while the segmentation comprises a structured output model, containing deep learning potential functions. We use these particular methods given their use of deep learning methods (which facilitates the integration with the proposed classification), and their state-of-art performance on both problems.

4

Materials and Methods

We use the publicly available INbreast dataset [10] that contains 115 cases with 410 images, where 116 images contain benign or malignant masses. Experiments are run using five fold cross validation by randomly dividing the 116 cases in a mutually exclusive manner, with 60 % of the cases for training, 20 % for validation and 20 % for testing. We test our classification methods using a manual and an automated set-up, where the manual set-up uses the manual annotations for the mass bounding box and segmentation. The automated set-up first detects the mass bounding boxes [8] (we select a detection score threshold based on the training results that produces a TPR = 0.93 ± 0.05 and FPI = 0.8 on training data - this same threshold produces TPR of 0.90 ± 0.02 and FPI = 1.3 on testing data, where a detection is positive if the intersection over union ratio (IoU)>= 0.5 [8]). The resulting bounding boxes and segmentation maps are resized to 40 × 40 pixels using bicubic interpolation, where the image patches are contrast enhanced, as described in [11]. Then the bounding boxes are automatically segmented [9], where the segmentation results using only the TP detections has a Dice coefficient of 0.85 ± 0.01 in training and 0.85 ± 0.02 in testing. From these patches and segmentation maps, we extract 781 hand-crafted features [4] used to pre-train the CNN model and to train and test the baseline model using the random forest (RF) classifier [15]. The CNN model for step 1 (pre-training in Fig. 2) has an input with two channels containing the image patch with a mass and respective segmentation mask; layer 1 has 20 filters of size 5 × 5, followed by a max-pooling layer (subsamples by 2); layer 2 contains 50 filters of size 5 × 5 and a max-pooling that

The Automated Learning of Deep Features

(a) Manual set-up

111

(b) Automated set-up

Fig. 3. Accuracy on test data of the methodologies explored in this paper.

(a) Manual set-up

(b) Automated set-up

Fig. 4. ROC curves of various methodologies explored in this paper on test data. Table 1. Comparison of the proposed and state-of-the-art methods on test sets. Methodology

Dataset (Rep?) set-up

ACC

Proposed RF on CNN with pre-training INbreast (Yes)

Manual

0.95 ± 0.05

Proposed CNN with pre-training

Manual

0.91 ± 0.06

INbreast (Yes)

Proposed RF on CNN with pre-training INbreast(Yes)

Fully automated 0.91 ± 0.02

Proposed CNN with pre-training

INbreast (Yes)

Fully automated 0.84 ± 0.04

Domingues et al. [5]

INbreast (Yes)

Manual

Varela et al. [4]

DDSM (No)

Semi-automated 0.81

Ball et al. [11]

DDSM (No)

Semi-automated 0.87

0.89

subsamples by 2; layer 3 has 100 filters of size 4×4 followed by a rectified linear unit (ReLU) [16]; layer 4 has 781 filters of size 4 × 4 followed by a ReLU unit; layer 5 comprises a fully-connected layer of 781 nodes that is trained to approximate the hand-crafted features, as in (2). The CNN model for step 2 (fine-tuning in Fig. 2) uses the pre-trained model from step 1, where a softmax layer containing two nodes (representing the benign versus malignant classification) is added, and the fully-connected layers are trained with drop-out of 0.3 [14]. Note that for comparison purposes, we also train a CNN model without

112

N. Dhungel et al.

Fig. 5. Results of RF on features from the CNN with pre-training on test set. Red and blue lines denote manual detection and segmentation whereas yellow and green lines are the automated detection and segmentation.

the pre-training step to show its influence in the classification accuracy. In order to improve the regularisation of the CNN models, we artificially augment by 10-fold the training data using geometric transformations (rotation, translation and scale). Moreover, using the hand-crafted features, we train an RF classifier [15], where model selection is performed using the validation set of each cross validation training set. We also train a RF classifier using the 781 features from the second last fully-connected layer of the fine-tuned CNN model. We carried out all our experiments using a computer with the following configuration: Intel(R) Core(TM) i5-2500k 3.30 GHz CPU with 8 GB RAM and graphics card NVIDIA GeForce GTX 460 SE 4045 MB. We compare the results of the methods explored in this paper with receiver operating characteristic (ROC) curve and classification accuracy (ACC).

5

Results

Figures 3(a–b) show a comparison amongst the models explored in this paper using classification accuracy for both manual and automated set-ups. The most accurate model in both set-ups is the RF on features from the CNN with pretraining with ACC of 0.95 ± 0.05 on manual and 0.91 ± 0.02 on automated set-up (results obtained on test set). Similarly, Fig. 4(a–b) display the ROC curves that also show that RF on features from the CNN with pre-training produces the best overall result with the area under curve (AUC) value of 0.91 ± 0.12 for manual and 0.76 ± 0.23 for automated set-up on test sets. In Table 1, we compare our results with the current state-of-the-art techniques in terms of accuracy (ACC), where the second column describes the dataset used and whether it can

The Automated Learning of Deep Features

113

be reproduced (‘Rep’) because it uses a publicly available dataset, and the third column, denoted by ‘set-up’, describes the method of mass detection and segmentation (semi-automated means that detection is manual, but segmentation is automated). The running time for the fully automated system is 41 s, divided into 39 s for the detection, 0.2 s for the segmentation and 0.8 s for classification. The training time for classification is 6 h for pre-training, 3 h for fine-tuning and 30 min for the RF classifier training (Fig. 5).

6

Discussion and Conclusions

The results from Figs. 3 and 4 (both manual and automated set-ups) show that the CNN model with pre-training and RF on features from the CNN with pretraining are better than the RF on hand-crafted features and CNN without pre-training. Another important observation from Fig. 3 is that the RF classifier performs better than CNN classifier on features from CNN with pre-training. The results for the CNN model without pre-training in automated set-up are not shown because they are not competitive, which is expected given its relatively worse performance in the manual set-up. In order to verify the statistical significance of these results, we perform the Wilcoxon paired signed-rank test between the RF on hand-crafted features and RF on features from the CNN with pre-training, where the p-value obtained is 0.02, which indicates that the result is significant (assuming 5 % significance level). In addition, both the proposed CNN with pre-training and RF on features from CNN with pre-training generalise well, where the training accuracy in the manual set-up for the former is 0.93 ± 0.06 and the latter is 0.94 ± 0.03. In this paper we show that the proposed two-step training process involving a pre-training based on the learning of a regressor that estimates the values of a large set of hand-crafted features, followed by a fine-tuning stage that learns the breast mass classifier produces the current state-of-the-art breast mass classification results on INbreast. Finally, we also show promising results from a fully automated breast mass detection, segmentation and classification system.

References 1. Giger, M.L., Karssemeijer, N., Schnabel, J.A.: Breast image analysis for risk assessment, detection, diagnosis, and treatment of cancer. Ann. Rev. Biomed. Eng. 15, 327–357 (2013) 2. Fenton, J.J., Taplin, S.H., Carney, P.A., et al.: Influence of computer-aided detection on performance of screening mammography. N. Engl. J. Med. 356(14), 1399– 1409 (2007) 3. Elmore, J.G., Jackson, S.L., Abraham, L., et al.: Variability in interpretive performance at screening mammography and radiologists characteristics associated with accuracy1. Radiology 253(3), 641–651 (2009) 4. Varela, C., Timp, S., Karssemeijer, N.: Use of border information in the classification of mammographic masses. Phys. Med. Biol. 51(2), 425 (2006)

114

N. Dhungel et al.

5. Domingues, I., Sales, E., Cardoso, J., Pereira, W.: Inbreast-database masses characterization. In: XXIII CBEB (2012) 6. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, vol. 1 (2012) 7. LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series. In: Arbib, M.A. (ed.) The Handbook of Brain Theory and Neural Networks. MIT Press, Massachusetts (1995). 3361 8. Dhungel, N., Carneiro, G., Bradley, A.: Automated mass detection in mammograms using cascaded deep learning and random forests. In: DICTA, November 2015 9. Dhungel, N., Carneiro, G., Bradley, A.P.: Deep learning and structured prediction for the segmentation of mass in mammograms. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 605–612. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24553-9 74 10. Moreira, I.C., Amaral, I., Domingues, I., et al.: Inbreast: toward a full-field digital mammographic database. Acad. Radiol. 19(2), 236–248 (2012) 11. Ball, J.E., Bruce, L.M.: Digital mammographic computer aided diagnosis (cad) using adaptive level set segmentation. In: EMBS 2007. IEEE (2007) 12. Farabet, C., Couprie, C., Najman, L., et al.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1915 (2013) 13. Carneiro, G., Nascimento, J., Bradley, A.P.: Unregistered multiview mammogram analysis with pre-trained deep learning models. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 652–660. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24574-4 78 14. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 15. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 16. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: ICML (2010)

Multimodal Deep Learning for Cervical Dysplasia Diagnosis Tao Xu1 , Han Zhang2 , Xiaolei Huang1(B) , Shaoting Zhang3 , and Dimitris N. Metaxas2 1

2

Computer Science and Engineering Department, Lehigh University, Bethlehem, PA, USA {tax313,xih206}@lehigh.edu Department of Computer Science, Rutgers University, Piscataway, NJ, USA 3 Department of Computer Science, UNC Charlotte, Charlotte, NC, USA

Abstract. To improve the diagnostic accuracy of cervical dysplasia, it is important to fuse multimodal information collected during a patient’s screening visit. However, current multimodal frameworks suffer from low sensitivity at high specificity levels, due to their limitations in learning correlations among highly heterogeneous modalities. In this paper, we design a deep learning framework for cervical dysplasia diagnosis by leveraging multimodal information. We first employ the convolutional neural network (CNN) to convert the low-level image data into a feature vector fusible with other non-image modalities. We then jointly learn the non-linear correlations among all modalities in a deep neural network. Our multimodal framework is an end-to-end deep network which can learn better complementary features from the image and non-image modalities. It automatically gives the final diagnosis for cervical dysplasia with 87.83 % sensitivity at 90 % specificity on a large dataset, which significantly outperforms methods using any single source of information alone and previous multimodal frameworks.

1

Introduction

Cervical cancer ranks as the second most common type of cancer in women aged 15 to 44 years worldwide [13]. Screening can help prevent cervical cancer by detecting cervical intraepithelial neoplasia (CIN), which is the potentially precancerous change and abnormal growth of squamous cells on the surface of the cervix. According to the World Health Organization (WHO) [13], CIN has three grades: CIN1 (mild), CIN2 (moderate), and CIN3 (severe). Mild dysplasia in CIN1 only needs conservative observation while lesions in CIN2/3 or cancer (denoted as CIN2+ in this paper) require treatment. In clinical practice one important goal of screening is to differentiate normal/CIN1 from CIN2+ for early detection of cervical cancer. Widely used cervical cancer screening methods today include Pap tests, HPV tests, and visual examination. Pap tests are effective, but they often suffer from T. Xu and H. Zhang are contributed equally. c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 115–123, 2016. DOI: 10.1007/978-3-319-46723-8 14

116

T. Xu et al.

low sensitivity in detecting CIN 2+ [10]. HPV tests are often used in conjunction with Pap tests, because nearly all cases of cervical cancer are caused by Human papillomavirus (HPV) infection. Digital cervicography is a non-invasive and lowcost visual examination method that takes a photograph of the cervix (called a R ) after the application of 5 % acetic acid to the cervix epithelium. Cervigram Recently, the automated Cervigram analysis techniques [10,14] have shown great potential for CIN classification. screening methods

multimodal clinical record age and PH value evaluation of sample cells

Pap test

HPV signal strength

HPV test

jfc1 jfc2

HPV status

cervicography

observation result conv1 …… conv5 fc6 fc7

diagnosis

……

11 11

256 3

CNN for image feature extraction

fc8

… …

input ROI 227 x 227

… …

Cervigram

non-image input (13D)

52

52

96 4096 4096

image feature (13D)

Fig. 1. Our multimodal deep network: (1) we apply a convolutional neural network (CNN) to learn image features from raw data in Cervigram ROIs; (2) we use joint fully connected (jfc) layers to model the non-linear correlations across all sources of information for CIN classification.

Previous works [1,3,10,14] have shown that multimodal information from conventional screening tests can provide complementary information to improve the diagnostic accuracy of cervical dysplasia. DeSantis et al. [3] combined spectroscopic image information measured from the cervix with other patient data, such as Pap results. Chang et al. [1] investigated the diagnostic potential of different combinations of reflectance and fluorescence spectral features. In [10,14], the authors hand-crafted pyramid histograms of color and oriented gradient features to represent Cervigram and directly utilized the clinical results to represent nonimage modalities. Then either Support Vector Machine (SVM) [14] or k-nearest neighbor (K-NN) [10] was used to calculate the decision score for each group of modalities separately. The final decision was simply made by combining decision scores in all the modalities. Since those previous methods integrated multimodal information at the final stage, they did not fully exploit the inherent correlations across image and non-image modalities. Another limitation is that their hand-crafted features may require strong domain knowledge and it is difficult to manually design proper features that are fusible across different modalities. Recently, deep learning has been exploited in medical image analysis to achieve state-of-the-art results [2,8,9,12]. Besides learning data representations

Multimodal Deep Learning

117

just from a single modality, deep learning is also able to discover the intricate structure in the multimodal datasets (e.g., video and audio) and improve the performance of the corresponding tasks [7]. However, this attractive feature is less well investigated in the medical domain. A pioneering work in this direction, Suk et al. [11] applied multimodal Deep Boltzmann Machine (DBM) to learn a unified representation from the paired patches of PET and MRI for AD/MCI diagnosis. However, considering that PET and MRI are both image modalities, this could be less complicated compared to the dilemma we face. Particularly, in a patient’s medical record, the data is more heterogeneous. The raw medical image data is a high dimensional vector. It requires less human labor to obtain but it contains a large amount of undiscovered information. The clinical results that are verified by clinicians have less feature dimensions, but they usually provide more instructional information. Therefore, it is challenging to combine the information from these modalities to perform improved diagnosis. In this paper, we apply deep learning for the task of cervical dysplasia diagnosis using multimodal information collected during a patient’s screening visit. The contribution is threefold. (1) We solve the challenging problem of highly heterogeneous input data by converting the low-level Cervigram data into a feature fusible with other non-image modalities using convolutional neural networks (CNN). (2) We propose a deep neural network to jointly learn the non-linear correlations among all modalities. (3) We unify the CNN image processing network and the joint learning network into an end-to-end framework which gives the final diagnosis for cervical dysplasia with 87.83 % sensitivity at 90 % specificity on the test dataset. The proposed multimodal network significantly outperforms methods using any single source of information alone and previous multimodal frameworks.

2

Our Approach

In our dataset, every screening visit of the patient has at least one Cervigram and the clinical results of Pap tests and HPV tests. As shown in Fig. 1, we use Cervigram as low-level image input. Motivated by the work of Song et al. [10], we construct a 13D high-level non-image input using four Pap test results (e.g., Cytyc ThinPrep), three pairs of HPV test results (e.g., high risk HPV 16 and HPV18), one Cervigram observation result, PH value and age of the patient. Not every visit has a complete set of clinical results for all Pap and HPV tests. Thus, for our non-image feature vector, we compute the average value of each dimension using available data of that dimension in the training dataset to estimate the missing value. This imputation method is widely used in training deep networks since it actually ignores the missing dimension after whitening the data. Next, we will describe each component of our proposed multimodal deep network.

118

2.1

T. Xu et al.

Learning a Deep Representation for Cervigram

Inspired by the recent success of convolutional neural networks (CNN) in general recognition tasks [6], instead of hand-crafting features [10,14], we propose to use a CNN to learn visual features directly from Cervigram images. Fine-tune pre-trained model: We use AlexNet [6] as our network structure for the feature learning. This model contains five convolutional layers (conv1conv5) and two fully connected layers (fc6 and fc7) and a final 1000-way softmax layer. Since our Cervigram dataset is relatively small compared to the general image classification datasets, we follow the transfer learning scheme to train our model. We first take the model which is pre-trained on ImageNet classification task and replace its output layer by a new 2-way softmax layer. Then, we finetune its parameters on our Cervigram dataset. We detect one cervix region of interest (ROI) for each Cervigram using the method proposed in [10]. Every ROI region is fed into the network which outputs the corresponding feature vector from its last fully connected layer (fc7). Since there are 4096 hidden units in the fc7 layer, we get a 4096D image feature embedding. CNN feature compression: The dimension of the CNN feature vector from fc7 layer is much higher than that of the non-image feature. Our experimental result shows that the high dimensional image feature can overwhelm the low dimensional non-image feature if we fuse them directly. Thus, we add another fully connected layer (fc8) with 13 units on top of fc7 to reduce the dimension of CNN feature to be comparable with non-image data. Thus, our image feature is non-linearly compressed to 13D. SVM

Softmax CNN code

CNN

Softmax

non-image

image (a) Late Fusion by SVM

Softmax

fc

CNN

fc fc

image non-image (b) Late Fusion by Softmax

fc fc

CNN

image (c) Image only

non-image (d) Non-mage only

Fig. 2. Baseline methods.

2.2

Novel Method for Fusing Multimodal Information in a Deep Network

Increasing evidence shows that cues from different modalities can provide complementary information in cervical dysplasia diagnosis [1,3,10,14]. However, it is challenging to integrate highly heterogeneous modalities. To motivate our multimodal deep network, we first discuss two simple fusion models and their drawbacks. Baseline models: The previous multimodal methods [10,14] assumed that the image and non-image data should be treated separately to make the prediction by

Multimodal Deep Learning

119

themselves. The fusion between image and non-image modalities only happened when merging the decision scores from each modality. We call this type of fusion as Late Fusion. Based on this assumption, we will compare our proposed fusion model with two baseline late fusion frameworks. In the first one (Fig. 2a), the 13D CNN feature and the 13D non-image feature are directly concatenated and fed into a linear SVM for CIN classification. It is an intuitive approach without any feature learning or engineering in the non-image modalities. In the second baseline model (Fig. 2b), we simulate the feature learning strategy in the image modality to use a neural network to learn the features in non-image modalities and then combine them with CNN features for the final classification using softmax. In this setting, the hidden units in the deep neural networks are only modeling the correlations within each group of modalities. Our model: Instead of using the above assumption, we assume that the data in the different modalities have a tighter correlation. For example, visual features (e.g., acetowhite epithelium) in the image can be treated as a complementary support of positive HPV or Pap. Therefore, those correlations can be used as a better representation to improve the classification accuracy. However, hand-engineering such complementary pairs is difficult and time-consuming. It is better to learn such correlations directly from the multimodal data. Therefore, we propose an early fusion framework to use deep neural networks to learn the highly non-linear correlations across all the modalities. As shown in Fig. 1, the 13D image feature and the 13D non-image feature (e.g., clinical results) are concatenated at an early stage and followed by joint learning layers. To solve the problem that data in different modalities have different statistical properties, we applied the batch normalization (BN) transform [5] to fix the means and variances of the input in each modality. Given the input x1 , x2 , ...xm over a mini-batch, the output x ˆi is calculated as: xi − μ x ˆi = γ √ +β σ2 + ǫ

(1)

m 1 where γ, β are the parameters to be learned by the network, μ = m i=1 xi  m 1 2 and σ 2 = m (x − μ) . The batch normalization can regularize the model i i=1 and allow us to have higher learning rates. Thus, we also apply it to joint fully connected layers. In our network, the joint fully connected layer (jfc) is applied to learn the correlations across different modalities. Each node in the jfc layer is computed by Eq. 2, ˆk−1 + bk ) (2) zk = f (Wk x where zk indicate the activations in the k th layer; Wk and bk are weights and bias learned for the k th layer; x ˆk−1 are the normalized output of the previous layer; f (x) = max(0, x) is the ReLU activation function. Compared to the previous framework [1,3,10,14], the units in our jfc layers model the non-linear correlations across modalities. Also the output of jfc can be viewed as a better representation for the multimodal data.

120

T. Xu et al.

Finally, a 2-way softmax (Eq. 3) layer is added upon the last joint fully connected layer to predict the diagnosis. p(c = j|ˆ x; W, b) =

ˆ + bj ) exp(Wj x 1 exp( l=0 Wl x ˆ + bl )

(3)

where p(c = j) indicates the probability of the input data belonging to the j th category, here j ∈ [0, 1]; x ˆ is the normalized output of the last joint fully connected layer; W and b are weights and bias learned for the softmax layer. During the training process, we compute the cross-entropy loss and apply stochastic gradient descent (SGD) to train the whole network. The classification loss can also backpropagate to the image CNN layers to guide the CNN network to extract visual features that complement clinical features for the classification. The number of joint fully connected layers and the number of hidden units in each jfc layer are the hyper-parameters, and we choose them through cross validation.

3

Experiments

We evaluate our method on a dataset built from a large data archive collected by the National Cancer Institute (NCI) from 10,000 anonymized women in the Guanacaste project [4]. Each patient typically had multiple visits at different ages. During each visit, multiple cervical screening tests were performed. Since the Guanacaste project is a population-based study, only a small proportion of patient visits have the Worst Histology results: multiple expert histology interpretations were done on each biopsy taken during a visit; the most severe interpretation is labeled the Worst Histology and serves as the “gold standard” ground truth for that visit in the database. From those labeled visits, we randomly sample 345 positive visits (CIN2+) and 345 negative visits (normal/CIN1) to build our visit-level dataset. And we use the same three-round three-fold cross validation to evaluate the proposed method and compare it with baseline models and previous works [1,3,10,14]. Hyper-parameters of the proposed method: For our proposed models with different hyper-parameters, we compare their overall performance in Fig. 3. Their accuracy and sensitivity at high (90 % and 92 %) specificity are also listed in Table 1. We first evaluate the importance of our feature compression layer (fc8) by comparing models “4096D-image+non-image” and “13D-image+non-image”. In the former model, we directly concatenate the 4096 CNN feature from finetuned AlexNet with the 13D non-image feature and feed them into the softmax for the final classification. In the latter one, we first compress the 4096 CNN feature into 13 dimensions using an additional hidden layer, and then perform the concatenation and classification. The result shows that “13D-image+nonimage” significantly outperforms “4096D-image+non-image”, especially at high specificity. For example, the sensitivity is increased by about 10 % at both 90 %

Multimodal Deep Learning

121

and 92 % specificity using the compressed “13D-image+non-image”. The reason is that the high dimensional image feature overwhelms the low dimensional nonimage feature in “4096D-image+non-image”. We can further improve our model by adding joint fully connected (jfc) layers. After trying different depth (the number of jfc layers) and width (the number of units in each jfc layer), we get our best model “13D-image+non-image,2jfc”. It has two jfc layers and each of them has 52 units. This deeper model achieves a better overall performance with another over 10 % sensitivity increment at 90 % and 92 % specificity. It indicates that the information in image and nonimage modalities needs to be jointly learned and non-linearly transformed in a deeper network. To test the effectiveness of our batch normalization, we remove the batch normalization transform in our best model. The new model “13Dimage+non-image,2jfc(noBN)” has a decreased AUC 91.61 % (in comparison to AUC 94 % with BN). Thus it is important to use batch normalization to fix the means and variances to input in each modality and regularize the model. To conclude, our “13D-image+non-image,2jfc” with BN model gives the best performance with 88.91 % accuracy and 87.83 % sensitivity at 90 % specificity. In the following experiments, we use this model for comparison.

1

1

0.9

0.9

0.8

0.8

Sensitivity (True positive rate)

Sensitivity (True positive rate)

Comparison with previous works: For fair comparison, we search the best hyper-parameters for alternative CIN classification methods shown in Fig. 2. We report the results of their best models as baselines in Fig. 4.

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

4096D-image+non-image (89.53%AUC) 13D-image+non-image (92.85%AUC) 13D-image+non-image,2jfc(noBN) (91.61%AUC) 13D-image+non-image,2jfc (94.00%AUC)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1-Specifivity (False positive rate)

Fig. 3. Our models with different hyper-parameters (please view in color)

1

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

non-image only (86.06%AUC) image only (88.77%AUC) Late Fusion by SVM (91.02%AUC) Late Fusion by Softmax (92.05%AUC) Ours (94.00%AUC)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

1-Specifivity (False positive rate)

Fig. 4. Comparison with baseline methods (please view in color)

We first compare our model with the methods using image only or non-image only. From Fig. 4, it is clear that our model achieves a significant improvement over using any single group of modalities (our 94 % AUC vs. image-only 88.77 % and non-image only 86.06 %). The sensitivity of our method is 38.26 % higher

122

T. Xu et al.

Table 1. Results of the proposed models with different hyper-parameters. (“noBN” indicates that batch normalization is not used in joint fully connected (jfc) layers.) Model

AUC(%) At 90% specificity At 92% specificity accu(%) sensi(%) accu(%) sensi(%)

4096D-image+non-image

89.53

78.04

66.09

77.30

62.61

13D-image+non-image

92.85

83.70

77.39

82.09

72.17

13D-image+non-image,2jfc(noBN) 91.61

85.00

80.00

84.70

77.39

13D-image+non-image,2jfc

88.91

87.83

87.74

83.48

94.00

than “non-image only” and 21.74 % higher than “image only” at 90 % specificity. It demonstrates the importance of fusing raw Cervigram information with other non-image modalities (e.g., Pap and HPV results) for cervical dysplasia diagnosis. Figure 4 also shows the comparison results of our early fusion method with “Late Fusion by SVM” and “Late Fusion by Softmax”. The best model of “Late Fusion by Softmax” has two hidden layers with 104 units in each layer. Our method outperforms the best results of both late fusion methods, especially at high specificity region. For instance, our model achieves more than 10 % higher sensitivity at 90 % specificity than both of them. This comparison result proves our assumption that the information in image and non-image modalities has a tighter correlation and the proposed “Early Fusion” assumption is better than the “Late Fusion” assumption used in [10,14]. Our model achieves 88.91 % accuracy and 87.83 % sensitivity at 90 % specificity. Compared with other previous multimodal methods [1,3,10,14], ours is state-of-the-art in terms of visit level classification. For example, the method by DeSantis et al. [3] only achieved an accuracy of 71.3 % and the approach in [1] gave 82.39 % accuracy. Two previous works [10,14] utilized the same multimodal information as ours. The work in [14] performed visit level classification. However, our performance is much better than theirs (our 88.91 % accuracy vs. their 79.68 %). Song et al. [10] utilized patient-level (multiple visits) information and achieved similar performance as ours (their 89.00 % accuracy vs. our 88.91 %). However, their patient-level method could not tell which visit of the patient was diagnosed as high risk (i.e., CIN2+).

4

Conclusions

In this paper, we propose a multimodal deep network for the task of cervical dysplasia diagnosis. We integrate highly heterogeneous data collected during a patient’s screening visit by expanding conventional CNN structure with joint fully connected layers. The proposed model can learn better complementary features for the image and non-image modalities through backpropagation. It automatically gives the final diagnosis for cervical dysplasia with 87.83 % sensitivity

Multimodal Deep Learning

123

at 90 % specificity on a large dataset, which is the state-of-the-art performance in visit level classification.

References 1. Chang, S.K., Mirabal, Y.N., et al.: Combined reflectance and fluorescence spectroscopy for in vivo detection of cervical pre-cancer. J. Biomed. Optics 10(2), 024–031 (2005) 2. Cire¸san, D.C., Giusti, A., Gambardella, L.M., Schmidhuber, J.: Mitosis detection in breast cancer histology images with deep neural networks. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 411–418. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40763-5 51 3. DeSantis, T., Chakhtoura, N., Twiggs, L., Ferris, D., Lashgari, M., et al.: Spectroscopic imaging as a triage test for cervical disease: a prospective multicenter clinical trial. J. Lower Genital Tract Dis. 11(1), 18–24 (2007) 4. Herrero, R., Schiffman, M., Bratti, C., et al.: Design and methods of a populationbased natural history study of cervical neoplasia in a rural province of costa rica: the guanacaste project. Rev Panam Salud Publica 1, 362–375 (1997) 5. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456 (2015) 6. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012) 7. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML, pp. 689–696 (2011) 8. Roth, H.R., et al.: A new 2.5D representation for lymph node detection using random sets of deep convolutional neural network observations. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8673, pp. 520–527. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10404-1 65 9. Shin, H., Orton, M., Collins, D.J., Doran, S.J., Leach, M.O.: Stacked autoencoders for unsupervised feature learning and multiple organ detection in a pilot study using 4D patient data. TPAMI 35(8), 1930–1943 (2013) 10. Song, D., Kim, E., Huang, X., Patruno, J., Munoz-Avila, H., Heflin, J., Long, L., Antani, S.: Multi-modal entity coreference for cervical dysplasia diagnosis. TMI 34(1), 229–245 (2015) 11. Suk, H., Lee, S., Shen, D.: Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis. NeuroImage 101, 569–582 (2014) 12. Suk, H.-I., Shen, D.: Deep learning-based feature representation for AD/MCI classification. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 583–590. Springer, Heidelberg (2013). doi:10.1007/ 978-3-642-40763-5 72 13. WHO: Human papillomavirus and related cancers in the world. Summary report. ICO Information Centre on HPV and Cancer, August 2014 14. Xu, T., Huang, X., Kim, E., Long, L., Antani, S.: Multi-test cervical cancer diagnosis with missing data estimation. In: SPIE Medical Imaging, p. 94140X–94140X-8 (2015)

Learning from Experts: Developing Transferable Deep Features for Patient-Level Lung Cancer Prediction Wei Shen1 , Mu Zhou2 , Feng Yang3(B) , Di Dong1 , Caiyun Yang1 , Yali Zang1 , and Jie Tian1(B) 1

Key Laboratory of Molecular Imaging, Institute of Automation, Chinese Academy of Sciences, Beijing, China [email protected], [email protected] 2 Stanford University, Stanford, CA, USA 3 Beijing Jiaotong University, Beijing, China [email protected]

Abstract. Due to recent progress in Convolutional Neural Networks (CNNs), developing image-based CNN models for predictive diagnosis is gaining enormous interest. However, to date, insufficient imaging samples with truly pathological-proven labels impede the evaluation of CNN models at scale. In this paper, we formulate a domain-adaptation framework that learns transferable deep features for patient-level lung cancer malignancy prediction. The presented work learns CNN-based features from a large discovery set (2272 lung nodules) with malignancy likelihood labels involving multiple radiologists’ assessments, and then tests the transferable predictability of these CNN-based features on a diagnosis-definite set (115 cases) with true pathologically-proven lung cancer labels. We evaluate our approach on the Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI) dataset, where both human expert labeling information on cancer malignancy likelihood and a set of pathologically-proven malignancy labels were provided. Experimental results demonstrate the superior predictive performance of the transferable deep features on predicting true patient-level lung cancer malignancy (Acc = 70.69 %, AUC = 0.66), which outperforms a nodulelevel CNN model (Acc = 65.38 %, AUC = 0.63) and is even comparable to that of using the radiologists’ knowledge (Acc = 72.41 %, AUC = 0.76). The proposed model can largely reduce the demand for pathologicallyproven data, holding promise to empower cancer diagnosis by leveraging multi-source CT imaging datasets.

1

Introduction

Lung cancer is one of the leading causes of cancer death with a dismal 5-year survival rate at 15–18 % [9]. Computed Tomography (CT) sequences at varying W. Shen and M. Zhou—These two authors contributed equally. c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 124–131, 2016. DOI: 10.1007/978-3-319-46723-8 15

Developing Transferable Features for Lung Cancer Prediction

125

stages of patients have been fast-evolving over past years. Therefore, developing image-based, data-driven models is of great clinical interest for identifying predictive imaging biomarkers from multiple CT imaging sources. Recently, Convolutional Neural Networks (CNNs) [3,11] emerge as a powerful learning model that has gained increasing recognition for a variety of machine learning problems. Approaches have been proposed for improving computeraided diagnosis with cascade CNN frameworks [6–8]. However, these studies are limited at building CNN models for a single diagnostic data source, without considering the relationship across various diagnostic CT data of the disease. To verify the learned CNN model, a related but different diagnostic data can be always served as a good benchmark. Therefore, we ask two specific questions: Can CNN-based features generalize to other sets for image-based diagnosis? How do these features transfer across different types of diagnostic datasets? We address these questions with an application in lung cancer malignancy prediction. More specifically, we define two malignancy-related sets: (1) DiscoverySet (source domain): CT imaging with abundant labels from only radiologists’ assessments; (2) DiagnosedSet (target domain): CT imaging with definite, follow-up diagnosis labels of lung cancer malignancy. It is reasonable to assume that radiologists’ knowledge in assessing risk factors is a helpful resource, but currently lacking a quantitative comparison with definite diagnostic information. Bridging the disconnection between them would accelerate diagnostic knowledge sharing to help radiologists refine follow-up diagnosis for patients. The challenge, however, remains as how can we develop a transferable scheme to fuse the crossdomain knowledge with growing availability of CT imaging arrays nowadays. To overcome the obstacle, we propose a new, integrated framework to learn transferable malignancy knowledge for patient-level lung cancer prediction. The proposed model, called CNN-MIL, is composed of a convolutional neural network (CNN) model and a multiple instance learning (MIL) model. They are respectively trained on the DiscoverySet (2272 lung nodules) and the DiagnosedSet (115 patients). We achieve the purpose of knowledge transfer by sharing the learned weights between the built CNN and the instance networks (see Fig. 1). The proposed approach draws inspiration from a recent study [5] suggesting the feasibility of the CNN architecture in transfer learning. A difference between such work and ours is that the knowledge adaptation is achieved via the instance networks where the nodule-to-patient relationship is defined, and the layers of target network is deeper than that in the source domain network. Our contributions of this paper can be summarized as follows: (i) We demonstrate that the knowledge defined from radiologists can be effectively learned by a CNN model and then transferred to the domain with definite diagnostic CT data. (ii) We present experimental evidence that knowledge adaptation can improve the accuracy of patient-level lung cancer prediction from a baseline model. (iii) The proposed CNN-MIL largely reduces the demand for pathologically-proven CT data by incorporating a referenced discovery set, holding promise to empower lung cancer diagnosis by leveraging multi-source CT imaging datasets.

126

W. Shen et al.

Conv1

Conv2

Conv3

Nodule-level CNN modeling

FC4 Regression Output

DiscoverySet

Network weights sharing

Multiple Instance Learning PCA

Conv1

Conv2

Conv3

FC4

MAX

FC5

Regression Output

PCA

Conv1

Conv2

Conv3

FC4

FC5



… …









PCA

Conv1

Conv2

Conv3

FC4

FC5

Patient-level Malignancy Detection

DiagnosedSet

Fig. 1. Illustration of the proposed framework. Upper part: a nodule-level CNN model (CNNnodule ) is firstly trained on the DiscoverySet to extract the radiologist’s knowledge (Sect. 2.1). It has three convolutional layers (Conv1-3) and the output layer has one neuron that estimates the malignancy rating of the input nodule. The number of hidden neurons in FC4 is 32. The network weights (Conv1-3) learned from DiscoverySet will be directly applied into the DiagnosedSet. Lower part: Multiple Instance Learning (MIL) models the nodule-to-patient relationship towards patient-level cancer prediction on the DiagnosedSet. Notably, the dimension of Conv3 feature from the instance network is reduced to 32 via Principal Component Analysis. The number of hidden neurons in FC5 is 4. The output of the MIL model is the aggregated output of the instance network, estimating true lung cancer malignancy (Sect. 2.2).

2 2.1

Methods Knowledge Extraction via the CNN Model

As seen in Fig. 1, we firstly build a nodule-level CNN model to learn the radiologist’s knowledge in estimating nodule malignancy likelihood from the source domain. The proposed CNN model is composed of three concatenated convolutional layers (with each comes with a Rectified Linear Unit plus a max-pooling layer). The followed two fully-connected layers (FC4 layer and regression layer) are used to determine the malignancy rating distribution over nodules. The used layers here follow the standard structure introduction in CNN structure, more details are referred to [10]. The input of our CNN model is the raw nodule patches with size 64 × 64 × 64 voxels centering around the nodule shape. Each convolutional layer has 64 convolutional kernels with size 3×3. The pooling window size is 4 × 4 in the first max-pooling layer and 2 × 2 in remained layers. The loss function is the L-2 norm loss between the predicted rating and the malignancy rating:

Developing Transferable Features for Lung Cancer Prediction

L=

N 1  (Ri − Pi )2 , N i=1

127

(1)

where N is the number of nodule patches in the DiscoverySet. The Ri and Pi are the ith nodule rating from radiologists and our model. The loss function of Eq. 1 is minimized via stochastic gradient descent. Once the training is done, the knowledge of nodule malignancy estimation is learned in terms of the retaining weights in the trained CNN model. Next, the weights of the three convolutional layers are shared with instance networks for knowledge transfer. The weights from the fully-connected layers are not considered for domain transfer as higher layers appear to be more domain-biased which are less transferable [4]. Having learned feature representation for nodules, we detail the patient-level prediction via a MIL model next. 2.2

The MIL Model for Patient-Level Lung Cancer Prediction

We formulate the patient-level cancer prediction as a MIL task shown in Fig. 1. The input to the nodule level CNN is the nodule patch while the input to the MIL network is all the nodule features within a patient case. MIL builds on the concept of bags and instances, where the label of a bag is positive if it has at least one positive containing instance; the label of the bag is negative if and only if all its containing instances are negative. Thus, in the scenario of two-category malignancy prediction, we similarly define each patient as a bag and each nodule as an instance. Given a patient (Oi ), if all his/her nodules are non-malignant, the patient is non-malignant; while if at least one nodule is malignant, the patient is malignant. Let the patient-level malignancy predictions of m patients be O = {Oi |i = 1, 2, 3, ..., m} and patients’ malignancy labels t = {ti |i = 1, 2, 3, ..., m} ∈ [0, 1]. Given a patient with n nodules (1 ≤ n ≤ 15 in this study), we denote the output from the jth nodule of the ith patient by oi,j (i = 1, 2, 3..., m, j = 1, 2, 3..., n}. The final output from the regression layer (lower part, Fig. 1) is used to determine the ith patient’s malignancy by aggregating nodule instance outputs oi,j : Oi = max(oi1 , oi2 , oi3 , ..., oin ), where n ∈ [1, 15],

(2)

The loss function is also the L-2 norm loss between the prediction Oi and the diagnostic label ti . As discussed, the weights of the convolutional layers are shared between the instance networks and the nodule-level CNN networks. The weights of the fully-connected layers (FC4 and FC5) will be continuously learned as in [5]. Next, we report experimental results on the DiagnosedSet for patientlevel lung cancer prediction.

3

Experiments and Results

Dataset: We use the Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI) dataset [2]. Nodule samples (>3 mm) are either

128

W. Shen et al.

included into the DiscoverySet or the DiagnosedSet based on absence or presence of definite diagnosis. In DiscoverySet, the nodule malignancy likelihood was rated by four experienced thoracic radiologists, estimating an increasing degree (i.e., Rrad ∈ [1,5]). The averaged rating report from four radiologists was chosen for determining the final rating of each nodule as in [7]. Overall, there were 2272 nodules included. We further split the DiscoverySet into a training set containing 80 % (1817 nodules) samples and a validation set containing 20 % (455 nodules) samples to observe the CNN model performance. For DiagnosedSet, there are 115 cases with true pathologically-proven diagnostic labels: non-malignant cases (30 cases), malignant cases (85 cases including 40 primary cancer and 45 malignant, metastatic cancer cases). Model Configuration: For the nodule-level CNN model, the learning rate was 0.0001 and the number of training epochs (one epoch means that each sample has been seen once in the training phase) was 50. For the MIL model, the learning rate was 0.001. To evaluate the performance of the MIL model under different settings, we investigated different number of hidden neurons nh (nh = [4, 8, 16]) in FC5 layer and the number of the MIL model training epochs ne (ne = [5, 10]) in Sect. 3.2, while the default values were nh = 4 and ne = 10 in Sect. 3.3. We reported average results of the MIL model from 10 times fivefold cross validation. During each round of cross-validation, there were 92 cases (24 non-malignant and 68 malignant cases) in the training set and 23 cases (6 non-malignant and 17 malignant cases) in the test set. Since the number of the non-malignant cases was much smaller than that of malignant cases in the training set, we fed the non-malignant cases multiple times to our MIL model to make a proximately balanced dataset. 3.1

Knowledge Extraction The distribution of the error

Given the output value Pcnn ∈ [1,5] made by the nodule-level CNN model (CNNnodule ) and the Rrad ∈ [1,5] given the radiologist’s rating on the 0.2 DiscoverySet. To verify that the radiologist’s knowledge of malignancy is 0.1 properly extracted, the estimation error defined as E = |Pcnn − Rrad |, ∈ 0.0 [0,4] in Fig. 2. We observed that the E 0.0 0.5 1.0 1.5 2.0 ∈ [0,1] already occupied 90.99 % of test Rating estimation error nodules in the validation set from the DiscoverySet, revealing the outputs of Fig. 2. The estimation error (E) distributhe CNN model approximated those of tion of CNNnodule on the DiscoverySet. the radiologist’ inputs. Once the radiologist’s knowledge of malignancy was well preserved by the nodule-level CNN model, we further report its results on patient-level cancer prediction on the DiagnosedSet.

Developing Transferable Features for Lung Cancer Prediction

3.2

129

Patient-Level Lung Cancer Malignancy Prediction

We show the performance of our CNN-MIL model with respect to different configuration values of nh and ne on patient-level malignancy prediction. Prediction accuracy (i.e. the ratio of the number of correctly classified patient malignancy Oi over the entire DiagnosedSet) and area under the curve (AUC) score were used to measure the model performance. As shown in Table 1, the performance of our model was insensitive to nh and ne . It could be explained that shared weights preserved much nodule information that allows discriminative features fed into final fully-connected layers. We continue to verify the performance of the proposed approach with competing methods next. Table 1. Mean value and standard deviation of prediction accuracy and AUC score (in parenthesis) of the CNN-MIL model with different nh and ne . nh = 4 ne = 5

nh = 8

nh = 16

68.80±3.12 %(0.65±0.03) 70.56±2.25 %(0.64±0.02) 68.12±1.97 %(0.62±0.02)

ne = 10 70.69±2.34 %(0.66±0.03) 68.99±1.90 %(0.63±0.02) 68.98±2.10 %(0.62±0.03)

3.3

Methods Comparison

We chose the nodule-level CNNnodule as a base- Table 2. Average prediction line model and the reports from the radi- accuracy and AUC score of patient-level cancer prediction ologists’ ratings (RR) as a reference model. using different models. For CNNnodule and RR, all nodule maligAccuracy AUC nancy likelihoods within a patient were comDMIL [11] 59.40 % 0.56 bined according to Eq. 2 as the patient-level MI-SVM [1] 61.93 % 0.55 malignancy score. We also implemented a CNN 0.63 nodule 65.38 % MI-SVM model [1] and a deep MIL model without knowledge transfer (DMIL) [11]. The CNN-MIL 70.69 % 0.66 72.41 % 0.76 features fed to the MI-SVM were also the RR 32-dimensional CNN features generated from Conv3 (Fig. 1) and the kernel function was the radial basis function. The best parameters for MI-SVM were obtained via grid search and the parameter settings of DMIL were identical with our CNN-MIL except that DMIL did not have PCA operation inside. As seen in Table 2, with efficient knowledge transfer, our CNN-MIL outperformed both DMIL and MI-SVM. When comparing the performance of our CNN-MIL model to CNNnodule , our CNN-MIL model integrating transferablefeatures through shared network weights could bring a boosted performance. Surprisingly, the performance of our CNN-MIL model was only marginally lower than that using the radiologists’ ratings, which demonstrated the effectiveness of the proposed method in transferring human knowledge into unknown samples prediction. On the other hand, despite knowing that radiologists’ ratings (i.e. RR) may affect our model learning due to the potential mislabelled samples,

130

W. Shen et al.

Fig. 3. Illustration of our CNN-MIL for patient-level cancer prediction with a nonmalignant patient and a malignant patient. Our CNN-MIL model can make more accurate patient-level prediction than CNNnodule (ratings rescaled to [0,1]) by reassigning nodule malignancy probability (red boxes) from the nodule-level CNN model.

we demonstrate that experts’ knowledge, building upon consensus agreement from multiple radiologists, can be captured by our CNN-MIL model to further estimate true nodule malignancies in lung cancer. As shown in Fig. 3, two patients using our CNN-MIL model and CNNnodule illustrated that transfer learning on DiagnosedSet allowed us to optimize the instance networks for improved patient-level cancer prediction, permitting an error correction from the nodule-level CNN model. Using p = 0.5 as a division point (p=0.5 as malignant), CNN-MIL corrected erroneous predictions (red boxes) from CNNnodule on both patients. The success of our model could be attributed to the ability of the CNN to learn rich mid-level image representations (e.g. features derived from the layer Conv3 in CNN) that are proven to be transferable to related visual recognition task [3,5]. Overall, our purpose of this study is not to pursue precise diagnosis for malignancy classification on a single diagnostic CT set, rather, we sought to infer data-driven knowledge across different sets (with different diagnostic labels), which holds promise to reduce the pressing demand of truly diagnosed, labelled data that typically require invasive assessment of biopsy and lasting monitoring of cancer progressions. We developed the domain transfer model based on the fact that the DiscoverySet (with radiologist ratings) is relatively easy-accessed at early stage of diagnosis with ubiquitous CT screening (2272 defined nodules here). Meanwhile, it is not surprising that the DiagnosedSet (with definitive clinical labels) is much difficult to scale due to invasive biopsy testing and surgery for pathological verification with a controlled patient population (115 case here).

4

Conclusion

Multi-source data integration in medical imaging is a rising topic with growing volumes of imaging data. Developing causal inference among different sets would allow better understanding of imaging set-to-set relationships in computer-aided diagnosis, thus enabling alternative biomarkers for improved cancer diagnosis. In this paper, we demonstrate that the transfer learning model is able to learn

Developing Transferable Features for Lung Cancer Prediction

131

transferable deep features for lung cancer malignancy prediction. The empirical evidence supports a feasibility that data-driven CNN is useful for leveraging multi-source CT data. In the future, we plan to expand to a large-scale, multimodel image sets to improve predictive diagnostic performance. Acknowledgement. This paper is supported by the CAS Key Deployment Program under Grant No. KGZD-EW-T03, the National NSFC funds under Grant No. 81227901, 81527805, 61231004, 81370035, 81230030, 61301002, 61302025, 81301346, 81501616, the Beijing NSF under Grant No. 4132080, the Fundamental Research Funds under Grant No. 2016JBM018, the CAS Scientific Research and Equipment Development Project under Grant No. YZ201457.

References 1. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: Advances in Neural Information Processing Systems, pp. 561–568 (2002) 2. Armato, S.G., McLennan, G., Bidaut, L., McNitt-Gray, M.F., Meyer, C.R., Reeves, A.P., Zhao, B., Aberle, D.R., Henschke, C.I., Hoffman, E.A., et al.: The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans. Med. Phys. 38(2), 915–931 (2011) 3. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 4. Long, M., Wang, J.: Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791 (2015) 5. Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1717–1724 (2014) 6. Roth, H.R., Yao, J., Lu, L., Stieger, J., Burns, J.E., Summers, R.M.: Detection of sclerotic spine metastases via random aggregation of deep convolutional neural network classifications. In: Yao, J., Glocker, B., Klinder, T., Li, S. (eds.) Recent Advances in Computational Methods and Clinical Applications for Spine Imaging, vol. 20, pp. 3–12. Springer, Heidelberg (2015) 7. Shen, W., Zhou, M., Yang, F., Yang, C., Tian, J.: Multi-scale convolutional neural networks for lung nodule classification. In: Ourselin, S., Alexander, D.C., Westin, C.-F., Cardoso, M.J. (eds.) IPMI 2015. LNCS, vol. 9123, pp. 588–599. Springer, Heidelberg (2015). doi:10.1007/978-3-319-19992-4 46 8. Shen, W., Zhou, M., Yang, F., Yu, D., Dong, D., Yang, C., Zang, Y., Tian, J.: Multi-crop convolutional neural networks for lung nodule malignancy suspiciousness classification. Pattern Recognition (2016) 9. Siegel, R.L., Miller, K.D., Jemal, A.: Cancer Statistics, 2015. CA Cancer J. Clin. 65(1), 5–29 (2015) 10. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 11. Wu, J., Yinan, Y., Huang, C., Kai, Y.: Deep multiple instance learning for image classification and auto-annotation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3460–3469. IEEE (2015)

DeepVessel: Retinal Vessel Segmentation via Deep Learning and Conditional Random Field Huazhu Fu1(B) , Yanwu Xu1 , Stephen Lin2 , Damon Wing Kee Wong1 , and Jiang Liu1,3 1

Institute for Infocomm Research, A*STAR, Singapore, Singapore [email protected] 2 Microsoft Research, Beijing, China 3 Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo, China

Abstract. Retinal vessel segmentation is a fundamental step for various ocular imaging applications. In this paper, we formulate the retinal vessel segmentation problem as a boundary detection task and solve it using a novel deep learning architecture. Our method is based on two key ideas: (1) applying a multi-scale and multi-level Convolutional Neural Network (CNN) with a side-output layer to learn a rich hierarchical representation, and (2) utilizing a Conditional Random Field (CRF) to model the long-range interactions between pixels. We combine the CNN and CRF layers into an integrated deep network called DeepVessel. Our experiments show that the DeepVessel system achieves state-of-the-art retinal vessel segmentation performance on the DRIVE, STARE, and CHASE DB1 datasets with an efficient running time.

1

Introduction

Retinal vessels are of much diagnostic significance, as they are commonly examined to evaluate and monitor various ophthalmological diseases. However, manual segmentation of retinal vessels is both tedious and time-consuming. To assist with this task, many approaches have been introduced in the last two decades to segment retinal vessels automatically. For example, Marin et al. employed the gray-level vector and moment invariant features to classify each pixel using a neural network [8]. Nguyen et al. utilized a multi-scale line detection scheme to compute vessel segmentation [11]. Orlando et al. performed vessel segmentation using a fully-connected Conditional Random Field (CRF) whose configuration is learned using a structured-output support vector machine [12]. Existing methods such as these, however, lack sufficiently discriminative representations and are easily affected by pathological regions, as shown in Fig. 1. Deep learning (DL) have recently been demonstrated to yield highly discriminative representations that have aided in many computer vision tasks. For c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 132–139, 2016. DOI: 10.1007/978-3-319-46723-8 16

DeepVessel: Retinal Vessel Segmentation via Deep Learning and CRF

133

Fig. 1. Retinal vessel segmentation results. Existing vessel segmentation methods (e.g., Nguyen et al. [11], and Orlando et al. [12]) are affected by the optic disc and pathological regions (highlighted by red arrows), while our DeepVessel deals well with these regions.

example, Convolutional Neural Networks (CNNs) have brought heightened performance in image classification and semantic image segmentation. Xie et al. employed a holistically-nested edge detection (HED) system with deep supervision to resolve the challenging ambiguity in object boundary detection [16]. Zheng et al. reformulated the Conditional Random Field (CRF) as a Recurrent Neural Network (RNN) to improve semantic image segmentation [18]. These works inspire us to learn rich hierarchical representation based on a DL architecture. A DL-based vessel segmentation method is proposed in [9], which addressed the problem as pixel classification using a deep neural network. In [7], Li et al. employed cross-modality data transformation from retinal image to vessel map, and outputted the label map of all pixels for a given image patch. These methods has two drawbacks: first, it does not account for non-local correlations in classifying individual pixels/patches, which leads to failures caused by noise and local pathological regions; second, the classification strategy is computationally intensive for both the training and testing phases. In our paper, we address retinal vessel segmentation as a boundary detection task that is solved using a novel DL system called DeepVessel, which utilizes a CNN with a side-output layer to learn discriminative representations, and also a CRF layer that accounts for non-local pixel correlations. With this approach, our DeepVessel system achieves state-of-the-art performance on publicly-available datasets (DRIVE, STARE, and CHASE DB1) with relatively efficient processing.

2

Proposed Method

Our DeepVessel architecture consists of three main layers. The first is a convolutional layer used to learn a multi-scale discriminative representation. The second is a side-output layer that operates with the early layers to generate a companion local output. The last one is a CRF layer, which is employed to

134

H. Fu et al.

Fig. 2. Architecture of our DeepVessel system, which consists of convolutional, sideoutput, and CRF layers. The front network is a four-stage HED-like architecture [16], where the side-output layer is inserted after the last convolutional layers in each stage (marked in Bold). The convolutional layer parameters are denoted as “Conv-”. The CRF layer is represented as an RNN as done in [18]. The ReLU activation function is not shown for brevity. The red blocks exist only in the training phase.

further take into account the non-local pixel correlations. The overall architecture of our DeepVessel system is illustrated in Fig. 2. Convolutional Layer is used to learn local feature representations based (n) on patches randomly sampled from the image. Suppose Lj is the j-th output (n−1)

map of the n-th layer, and Li is the i-th input map of the n-th layer. The output of the convolutional layer is then defined as:  (n−1) (n) (n) (n) Lj = f ( Li ∗ Wij + bj 1), (1) i (n)

where Wij is the kernel linking the i-th input map to the j-th output map, ∗ (n)

denotes the convolution operator, and bj is the bias element. Side-output Layer acts as a classifier that produces a companion local output for early layers [6]. Suppose W denotes the parameters of all the convolutional layers, and there are M side-output layers in the network, where the corresponding weights are denoted as w = (w(1) , ..., w(M ) ). The objective function of the side-output layer is given as: Ls (W, w) =

M 

αm Ls(m) (W, w(m) ),

(2)

m=1 (m)

where αm is the loss function fusion-weight or each side-output layer, and Ls denotes the image-level loss function, which is computed over all pixels in the

DeepVessel: Retinal Vessel Segmentation via Deep Learning and CRF

135

training retinal image X and its vessel ground truth Y . For the retinal image, the pixels of the vessel and background are imbalanced, thus we follow HED [16] to utilize a class-balanced cross-entropy loss function: |Y + |  |Y − |  (m) (m) log σ(aj ) − log(1 − σ(aj )), (3) Ls(m) (W, w(m) ) = − |Y | |Y | + − j∈Y

+

j∈Y



where |Y | and |Y | denote the vessel and background pixels in the ground (m) truth Y , and σ(aj ) is the sigmoid function on pixel j of the activation map (m)

(m)

, j = 1, ..., |Y | in side-output layer m. Simultaneously, we can obtain (m) (m) the vessel prediction map of each side-output layer m by Yˆs = σ(As ). Conditional Random Field (CRF) Layer is used to model non-local pixel correlations. Although the CNN can produce a satisfactory vessel probability map, it still has some problems. First, a traditional CNN has convolutional filters with large receptive fields and hence produces maps too coarse for pixel-level vessel segmentation (e.g., non-sharp boundaries and blob-like shapes). Second, a CNN lacks smoothness constraints, which may result in small spurious regions in the segmentation output. Thus, we utilize a CRF layer to obtain the final vessel segmentation result. Following the fully-connected CRF model of [5], each node is a neighbor of each other, and it takes into account long-range interactions in the whole image. We denote v = {vi } as a labeling over all pixels of the image, with vi = 1 for vessel and vi = 0 for background. The energy of a label assignment v is given by:   ψu (vi ) + ψp (vi , vj ), (4) E(v) = As

≡ aj

i

i 18

Radiologist BA DCNN-FI BA DCNN-FI CA

100.0

0.0

100.0 0.0

98.7 28.6 98.7 28.6 98.7 3.6

71.4 1.3 71.4 1.3 96.4 1.3

volumes of each bone are slightly rotated and translated around the estimated anatomical landmarks defining the bone. To reduce the number of parameters optimized by the DCNN, the part of the bone volume that contains the epiphyseal plate is used and all bone images are resized to 40 × 40 × 40 pixels. Implemented in the Caffe framework [4], our DCNN was optimized with stochastic gradient descent with a maximal number of iterations 104 , momentum 0.9 and learning rate 10−4 . For estimation of BA and classification to discriminate between minors and adults, we experimented with training our DCNN on both BA and CA. The results from training the DCNN on original intensity images (II) and on the filtered images (FI) as explained in Sect. 2.1 are compared. As a baseline method we use the age estimation based on random regression forests (RRF) as proposed in [13], with the only difference that in this work we used an increased number of training samples, the same as for the DCNN. Results: Using either intensity (II) or filtered (FI) volumes for training, the results of BA estimation trained BA and CA are given in Table 1. Detailed results separately for each biological age group of the best performing DCNNFI compared with RRF-FI are presented in the box-whiskers plot in Fig. 2c. The contingency table of classifying subjects as being minor or adult is given in Table 2. All results are compared with the RRF age estimation method of [13].

200

4

ˇ D. Stern et al.

Discussion and Conclusion

Inspired by TW2 [10], which is considered the most accurate radiological hand bone age estimation method due to its fusion of independent per-bone estimates, we have designed our novel automatic age estimation DCNN using an architecture mimicking the TW2 method. Limited by dataset size, our design choice to use the LeNet architecture [6] as a building block for our method was motivated by keeping the number of DCNN weights as low as possible. To further reduce model complexity, we also experimented with sharing network weights between SE blocks of bones that undergo the same physical maturation process, an idea borrowed from the TW2 staging scheme. However, as shown in Table 1, we experienced no performance gains, which might be due to our limited number of training images but could also indicate subtle different physical maturation processes in bones where the TW2 staging system assigns the same score. Compared to the selection of hand-crafted features in RFs, DCNNs internally learn to generate the features relevant for age estimation. This comes at the cost of requiring a larger number of training data, therefore we obtain additional training data by augmentation with synthetic transformations. We found that when using the pre-processed filtered images (FI) as input to our DCNN, a higher estimation accuracy compared to raw intensity images could be achieved. Thus, in accordance with [13], by suppressing image intensity variations and enhancing the appearance of the ossifying epiphyseal plate from the surrounding anatomical structures it was possible to simplify the learning task for the DCNN. For discussing results presented in Tables 1 and 2, it is important to understand that “true” BA, which we want to estimate, is the average stadium of physical development for individuals of the same CA. Therefore, the estimation of “true” BA would require a large dataset of subjects with given CA that statistically represents biological variation. Since our limited dataset can only partly cover biological variation in the target age range, we use BA as estimated by a radiologist as ground truth for training and testing, although the deviation from “true” BA that is introduced by the radiologist [9] can never be corrected by an algorithm. Moreover, the reported inter-observer variation for radiographic images varies in the range of 0.5 to 2 years, depending on the age, sex and origin of the examined population [9]. In clinical medicine applications, when biological age is required, training our DCNN-FI method on BA estimated by radiologists shows higher accuracy (0.36 ± 0.30 y) compared to training on CA (0.56±0.44 y) on our dataset. The higher error can be explained by biological variation in the training dataset using CA. As shown in Table 1, our best performing DCNN-FI method outperforms previous work when estimating BA. When interpreting the detailed results in Fig. 2c, it has to be noted that the improvement of RRF-FI upon our previous work [13] is due to a larger, synthetically enhanced training dataset and the method being trained and evaluated on BA. Depending on the used population, results of the prominent automatic BoneXpert age estimation method [11] were reported between 0.65 and 0.72 years when compared to radiologists GP ratings for X-ray images of male boys, but further comparison to our method has to be taken with care due to the differences in datasets.

Age Estimation from Hand MR Volumes Using Deep Learning

201

In legal medicine applications, recent migration tendencies lead to challenges, when asylum seekers without identification documents have to be discriminated according to having reached majority age. As can be seen in Table 2, our DCNNFI classifier trained on BA is able to perfectly discriminate between subjects below and above 18 years of BA in our dataset. Nevertheless such a perfectly discriminating classifier trained on BA makes a larger error by classifying 28.6 % of minors to be adults, the same error that radiologists make when approximating BA with CA using the GP method. Thus, better discrimination can not be achieved by a classifier when using BA defined by radiologists for training. We further retrained our classifier using CA for training and achieve significantly better discrimination of legal majority age, misclassifying 3.6 % minors to be adults. This observed behavior is in line with literature showing that BA estimated with the GP method has the tendency to underestimate CA [9] due to advanced physical maturation in nowadays population, while GP is based on radiographs that were acquired in the 30s of the last century. In conclusion, our proposed DCNN method has proven to be the best automatic method for BA estimation from 3D MR images, although it has to be used carefully in legal medicine applications due to the unavoidable misclassification when discriminating minors from adults, which is caused by biological variation.

References 1. Ebner, T., Stern, D., Donner, R., Bischof, H., Urschler, M.: Towards automatic bone age estimation from MRI: localization of 3D anatomical landmarks. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8674, pp. 421–428. Springer, Heidelberg (2014). doi:10.1007/ 978-3-319-10470-6 53 2. Frangi, A.F., Niessen, W.J., Vincken, K.L., Viergever, M.A.: Multiscale vessel enhancement filtering. In: Wells, W.M., Colchester, A., Delp, S. (eds.) MICCAI 1998. LNCS, vol. 1496, pp. 130–137. Springer, Heidelberg (1998). doi:10.1007/ BFb0056195 3. Greulich, W.W., Pyle, S.I.: Radiographic Atlas of Skeletal Development of the Hand and Wrist, 2nd edn. Stanford University Press, Stanford (1959) 4. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia (MM 2014), pp. 675–678 (2014) 5. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012) 6. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient based learning applied to document recognition. Proc. IEEE 86(11), 2278–2323 (1998) 7. Lee, S.C., Shim, J.S., Seo, S.W., Lim, K.S., Ko, K.R.: The accuracy of current methods in determining the timing of epiphysiodesis. Bone Jt. J. 95–B(7), 993– 1000 (2013) 8. Lindner, C., Bromiley, P.A., Ionita, M.C., Cootes, T.F.: Robust and accurate shape model matching using random forest regression-voting. IEEE Trans. Pattern Anal. Mach. Intell. 37, 1862–1874 (2015)

202

ˇ D. Stern et al.

9. Ritz-Timme, S., Cattaneo, C., Collins, M.J., Waite, E.R., Schuetz, H.W., Kaatsch, H.J., Borrman, H.I.: Age estimation: the state of the art in relation to the specific demands of forensic practise. Int. J. Legal Med. 113(3), 129–136 (2000) 10. Tanner, J.M., Whitehouse, R.H., Cameron, N., Marshall, W.A., Healy, M.J.R., Goldstein, H.: Assessment of Skeletal Maturity and Predicion of Adult Height (TW2 Method), 2nd edn. Academic Press, Oxford (1983) 11. Thodberg, H.H., Kreiborg, S., Juul, A., Pedersen, K.D.: The BoneXpert method for automated determination of skeletal maturity. IEEE Trans. Med. Imaging 28(1), 52–66 (2009) 12. Stern, D., Ebner, T., Bischof, H., Grassegger, S., Ehammer, T., Urschler, M.: Fully automatic bone age estimation from left hand MR images. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8674, pp. 220–227. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10470-6 28 ˇ 13. Stern, D., Urschler, M.: From individual hand bone age estimates to fully automated age estimation via learning-based information fusion. In: ISBI (2016)

Real-Time Standard Scan Plane Detection and Localisation in Fetal Ultrasound Using Fully Convolutional Neural Networks Christian F. Baumgartner1(B) , Konstantinos Kamnitsas1 , Jacqueline Matthew2,3 , Sandra Smith3 , Bernhard Kainz1 , and Daniel Rueckert1 1

2

Biomedical Image Analysis Group, Imperial College London, London, UK [email protected] Biomedical Research Centre, Guy’s and St Thomas’ NHS Foundation, London, UK 3 Division of Imaging Sciences and Biomedical Engineering, King’s College London, London, UK

Abstract. Fetal mid-pregnancy scans are typically carried out according to fixed protocols. Accurate detection of abnormalities and correct biometric measurements hinge on the correct acquisition of clearly defined standard scan planes. Locating these standard planes requires a high level of expertise. However, there is a worldwide shortage of expert sonographers. In this paper, we consider a fully automated system based on convolutional neural networks which can detect twelve standard scan planes as defined by the UK fetal abnormality screening programme. The network design allows real-time inference and can be naturally extended to provide an approximate localisation of the fetal anatomy in the image. Such a framework can be used to automate or assist with scan plane selection, or for the retrospective retrieval of scan planes from recorded videos. The method is evaluated on a large database of 1003 volunteer mid-pregnancy scans. We show that standard planes acquired in a clinical scenario are robustly detected with a precision and recall of 69 % and 80 %, which is superior to the current state-of-the-art. Furthermore, we show that it can retrospectively retrieve correct scan planes with an accuracy of 71 % for cardiac views and 81 % for non-cardiac views.

1

Introduction

Abnormal fetal development is a leading cause of perinatal mortality in both industrialised and developing countries [11]. Although many countries have introduced fetal screening programmes based on mid-pregnancy ultrasound (US) scans at around 20 weeks of gestational age, detection rates remain relatively low. For example, it is estimated that in the UK approximately 26 % of fetal anomalies are not detected during pregnancy [4]. Detection rates have also been reported to vary considerably across different institutions [1] which suggests that, at least in part, differences in training may be responsible for this variability. Moreover, according to the WHO, it is likely that worldwide many US scans are carried out by individuals with little or no formal training [11]. c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 203–211, 2016. DOI: 10.1007/978-3-319-46723-8 24

204

C.F. Baumgartner et al.

Biometric measurements and identification of abnormalities are performed on a number of standardised 2D US view planes acquired at different locations in the fetal body. In the UK, guidelines for selecting these planes are defined in [7]. Standard scan planes are often hard to localise even for experienced sonographers and have been shown to suffer from low reproducibility and large operator bias [4]. Thus, a system automating or aiding with this step could have significant clinical impact particularly in geographic regions where few highly skilled sonographers are available. It is also an essential step for further processing such as automated measurements or automated detection of anomalies. Detection

Unsupervised Localisation

Abdominal View (98%)

Lips View (96%)

(a) Input frame

(b) Prediction

(c) Category-specific feature map

(d) Localised salicency map

(e) Approximate localisation

Fig. 1. Overview of the proposed framework for two standard view examples. Given a video frame (a) the trained convolutional neural network provides a prediction and confidence value (b). By design, each classifier output has a corresponding low-resolution feature map (c). Back-propagating the error from the most active feature neurons results in a saliency map (d). A bounding box can be derived using thresholding (e).

Contributions: We propose a real-time system which can automatically detect 12 commonly acquired standard scan planes in clinical free-hand 2D US data. We demonstrate the detection framework for (1) real-time annotations of US data to assist sonographers, and (2) for the retrospective retrieval of standard scan planes from recordings of the full examination. The method employs a fully convolutional neural network (CNN) architecture which allows robust scan plane detection at more than 100 frames per second. Furthermore, we extend this architecture to obtain saliency maps highlighting the part of the image that provides the highest contribution to a prediction (see Fig. 1). Such saliency maps provide a localisation of the respective fetal anatomy and can be used as starting point for further automatic processing. This localisation step is unsupervised and does not require ground-truth bounding box annotations during training. Related Work: Standard scan plane classification of 7 planes was proposed for a large fetal image database [13]. This differs significantly from our work since in that scenario it is already known that every image is in fact a standard plane whilst in video data the majority of frames does not show standard planes.

Real-Time Standard Plane Detection and Localisation in Fetal Ultrasound

205

A number of papers have proposed methods to detect fetal anatomy in videos of fetal 2D US sweeps (e.g. [6]). In those works the authors were aiming at detecting the presence of fetal structures such as the skull, heart or abdomen rather specific standardised scan planes. Automated fetal standard scan plane detection has been demonstrated for 1–3 standard planes in 2D fetal US sweeps [2,3,8]. Notably, [2,3] also employed CNNs. US sweeps are acquired by moving the US probe from the cervix upwards in one continuous motion [3]. However, not all standard views required to determine the fetus’ health status are adequately visualised using a sweep protocol. For example, visualising the femur or the lips normally requires careful manual scan plane selection. Furthermore, data obtained using the sweep protocol are typically only 2–5 s long and consist of fewer than 50 frames [3]. To the best of our knowledge, fetal standard scan plane detection has never been performed on true free-hand US data which typically consist of 10,000+ frames. Moreover, none of related works were demonstrated to run in real-time, typically requiring multiple seconds per frame.

2

Materials and Methods

Data and Preprocessing: Our dataset consists of 1003 2D US scans of consented volunteers with gestational ages between 18–22 weeks which have been acquired by a team of expert sonographers using GE Voluson E8 systems. For each scan a screen capture video of the entire procedure was recorded. Additionally, the sonographers saved “freeze frames” of a number of standard views for each subject. A large fraction of these frames have been annotated allowing us to infer the correct ground-truth (GT) label. All video frames and images were downsampled to a size of 225 × 273 pixels. We considered 12 standard scan planes based on the guidelines in [7]. In particular, we selected the following: two brain views at the level of the ventricles (Vt.) and the cerebellum (Cb.), the standard abdominal view, the transverse kidney view, the coronal lip, the median profile, and the femur and sagittal spine views. We also included four commonly acquired cardiac views: the left and right ventricular outflow tracts (LVOT and RVOT), the three vessel view (3VV) and the 4 chamber view (4CH)1 . In addition to the labelled freeze frames, we sampled 50 random frames from each video in order to model the background class, i.e., the “not a standard scan plane” class. Network Architecture: The architecture of our proposed CNN is summarised in Fig. 2. Following recent advances in computer vision, we opted for a fully convolutional network architecture which replaces traditional fully connected layers with convolution layers using a 1 × 1 kernel [5,9]. In the final convolutional layer (C6) the input is reduced to K 13 × 13 feature maps Fk , where K is the number of classes. Each of these feature maps is then averaged to obtain the 1

A detailed description of the considered standard planes is included in the supplementary material available at http://www.doc.ic.ac.uk/∼cbaumgar/dwnlds/ miccai2016/.

206

C.F. Baumgartner et al.

input to the final Softmax layer. This architecture makes the network flexible with regard to the size of the input images. Larger images will simply result in larger feature maps, which will nevertheless be mapped to a scalar for the final network output. We use this fact to train on cropped square images rather than the full field of view which is beneficial for data augmentation. C4 (3x3/1)

C5 (1x1/1)

C6 (1x1/1)

...

13x13x128

13x13x128

13x13x64

...

13x13x64

Softmax

C2 (5x5/2) - MP C3 (3x3/1)

Global Average Pooling

C1 (7x7/2) - MP

55x55x32 225x225x1

13x13x

Fig. 2. Overview of the proposed network architecture. The size and stride of the convolutional kernels are indicated at the top (notation: kernel size/stride). Max-pooling steps are indicated by MP (2 × 2 bins, stride of 2). The activation functions of all convolutions except C6 are rectified non-linear units (ReLUs). C6 is followed by a global average pooling step. The sizes at the bottom of each image/feature map refer to the training phase and will be slightly larger during inference due to larger input images.

A key aspect of our proposed network architecture is that we enforce a one-to-one correspondence between each feature map Fk and the respective prediction yk . Since each neuron in the feature maps Fk has a receptive field in the original image, during training, the neurons will learn to activate only if an object of class k is in that field. This allows to interpret Fk as a spatially encoded confidence map for class k [5]. In this paper, we take advantage of this fact to generate localised saliency maps as described below. Training: We split our dataset into a test set containing 20 % of the subjects and a training set containing 80 %. We use 10 % of the training data as validation set to monitor the training progress. In total, we model 12 standard view planes, plus one background class resulting in K = 13 categories. We train the model using mini-batch gradient descent and the categorical cross-entropy cost function. In order to prevent overfitting we add 50 % dropout after the C5 and C6 layers. To account for the significant class imbalance introduced by the background category, we create mini-batches with even class-sampling. Additionally, we augment each batch by a factor of 5 by taking 225 × 225 square sub-images with a random horizontal translation and transforming them with a small random rotation and flips along the vertical axis. Taking random square sub-images allows to introduce more variation to the augmented batches compared to training on the full field of view. This helps to reduce the overfitting of the network. We train the network for 50 epochs and choose the network parameters with the lowest error on the validation set.

Real-Time Standard Plane Detection and Localisation in Fetal Ultrasound

207

Frame Annotation and Retrospective Retrieval: After training we feed the network with video frames containing the full field of view (225 × 273 pixels) of the input videos. This results in larger category-specific feature maps of 13 × 16. The prediction yk and confidence ck of each frame are given by the prediction with the highest probability and the probability itself. For retrospective frame retrieval, for each subject we calculate and record the confidence for each class over the entire duration of an input video. Subsequently, we retrieve the frame with the highest confidence for each class. Saliency Maps and Unsupervised Localisation: After obtaining the category yk of the current frame X from a forward pass through the network, we can examine the feature map Fk (i.e. the output of the C6 layer) corresponding to the predicted category k. Two examples of feature maps are shown in Fig. 1c. The Fk could already be used to make an approximate estimate of the location of the respective anatomy similar to [9]. Here, instead of using the feature maps directly, we present a novel method to obtain localised saliency with the resolution of the original input images. For (p,q) at the location p, q in the feature map it is possible calculate each neuron Fk how much each original input pixel X (i,j) contributed to the activation of this neuron. This corresponds to calculating the partial derivatives (i,j)

Sk

(p,q)

=

∂Fk , ∂X (i,j)

which can be solved efficiently using an additional backwards pass through the network. [12] proposed a method for performing this back-propagation in a guided manner by allowing only error signals which contribute to an increase of the activations in the higher layers (i.e. layers closer to the network output) to backpropagate. In particular, the error is only back-propagated through each neuron’s ReLU unit if the input to the neuron x, as well as the error in the higher layer δℓ are positive. That is, the back-propagated error δℓ−1 of each neuron is given by δℓ−1 = δℓ σ(x)σ(δℓ ), where σ(·) is the unit step function. In contrast to [12] who back-propagated from the final output, in this work we take advantage of the spatial encoding in the category specific feature maps and only back-propagate the errors for the 10 % most active feature map neurons, i.e. the spatial locations where the fetal anatomy is predicted. The resulting saliency maps are significantly more localised compared to [12] (see Fig. 3). These saliency maps can be used as starting point for various image analysis tasks such as automated segmentation or measurements. Here, we demonstrate how they can be used for approximate localisation using basic image processing. We blur the absolute value image of a saliency map |Sk | using a 25 × 25 Gaussian kernel and apply a thresholding using Otsu’s method [10]. Finally, we compute the minimum bounding box of the components in the thresholded image.

208

C.F. Baumgartner et al.

Fig. 3. Saliency maps obtained from the input frame (LVOT class) shown on the left. The middle map was obtained using guided back-propagation from the average pool layer output [12]. The map on the right was obtained using our proposed method.

3

Experiments and Results

Frame Annotation: We evaluated the ability of our method to detect standard frames by classifying the test data including the randomly sampled background class. We report the achieved precision (pc) and recall (rc) scores in Table 1. The lowest scores were obtained for cardiac views, which are also the most difficult to scan for expert sonographers. This fact is reflected in the low detection rates for serious cardiac anomalies (e.g. only 35 % in the UK). Table 1. Precision pc = T P/(T P + F P ) and recall rc = T P/(T P + F N ) for the classification of the modelled scan planes. Background class: pc = 0.96, rc = 0.93. view

pc

rc

view

Brain (Vt.) 0.96 0.90 Lips

pc

rc

view

pc

rc

0.85 0.88 LVOT 0.63 0.63

Brain (Cb.) 0.92 0.94 Profile 0.71 0.82 RVOT 0.40 0.46 Abdominal 0.85 0.80 Femur 0.79 0.93 3VV

0.46 0.60

Kidneys

0.61 0.74

0.64 0.87 Spine

0.51 0.99 4CH

[2] have recently reported pc/rc scores of 0.75/0.75 for the abdominal standard view, and 0.77/0.61 for the 4CH view in US sweep data. We obtained comparable values for the 4CH view and considerably better values for the abdominal view. However, with 12 modelled standard planes and free-hand US data Table 2. % of correctly retrieved frames for each standard view for all 201 test subjects.

view

%

view

Brain (Vt.) 0.95 Lips

%

view

%

0.77 LVOT 0.73

Brain (Cb.) 0.89 Profile 0.76 RVOT 0.70 Abdominal 0.79 Femur 0.75 3VV

0.66

Kidneys

0.78

0.87 Spine

0.77 4CH

Real-Time Standard Plane Detection and Localisation in Fetal Ultrasound

209

Fig. 4. Retrieved standard frames (RET ) and GT frames annotated and saved by expert sonographers for two volunteers. Correctly retrieved and incorrectly retrieved frames are denoted with a green check mark or red cross, respectively. Frames with no GT annotation are indicated. The confidence is shown in the lower right of each image. The frames in (b) additionally contain the results of our proposed localisation (boxes).

our problem is significantly more complex. Using a Nvidia Tesla K80 graphics processing unit (GPU) we were able to classify 113 frames per second (FPS) on average, which significantly exceeds the recording rate of the ultrasound machine of 25 FPS. We include an annotated video in the supplementary material. Retrospective Frame Retrieval: We retrieved the standard views from videos of all test subjects and manually evaluated whether the retrieved frames corresponded to the annotated GT frames for each category. Several cases did not have GTs for all views because they were not manually included by the sonogra-

210

C.F. Baumgartner et al.

pher in the original scan. For those cases we did not evaluate the retrieved frame. The results are summarised in Table 2. We show examples of the retrieved frames for two volunteers in Fig. 4. Note that in many cases the retrieved planes match the expert GT almost exactly. Moreover, some planes which were not annotated by the experts were nevertheless found correctly. As before, most cardiac views achieved lower scores compared to other views. Localisation: We show results for the approximate localisation of the respective fetal anatomy in the retrieved frames for one representative case in Fig. 4b and in the supplemental video. We found that performing the localisation reduced the frame rate to 39 FPS on average.

4

Discussion and Conclusion

We have proposed a system for the automatic detection of twelve fetal standard scanplanes from real clinical fetal US scans. The employed fully CNN architecture allowed for robust real-time inference. Furthermore, we have proposed a novel method to obtain localised saliency maps by combining the information in category-specific feature maps with a guided back-propagation step. To the best of our knowledge, our approach is the first to model a large number of fetal standard views from a substantial population of free-hand US scans. We have shown that the method can be used to robustly annotate US data with classification scores exceeding values reported in related work for some standard planes, but in a much more challenging scenario. A system based on our approach could potentially be used to assist or train inexperienced sonographers. We have also shown how the framework can be used to retrieve standard scan planes retrospectively. In this manner, relevant key frames could be extracted from a video acquired by an inexperienced operator and sent for further analysis to an expert. We have also demonstrated how the proposed localised saliency maps can be used to extract an approximate bounding box of the fetal anatomy. This is an important stepping stone for further, more specialised image processing. Acknowledgments. Supported by the Wellcome Trust IEH Award [102431].

References 1. Bull, C., et al.: Current and potential impact of fetal diagnosis on prevalence and spectrum of serious congenital heart disease at term in the UK. The Lancet 354(9186), 1242–1247 (1999) 2. Chen, H., Dou, Q., Ni, D., Cheng, J.-Z., Qin, J., Li, S., Heng, P.-A.: Automatic fetal ultrasound standard plane detection using knowledge transferred recurrent neural networks. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 507–514. Springer, Heidelberg (2015). doi:10. 1007/978-3-319-24553-9 62 3. Chen, H., Ni, D., Qin, J., Li, S., Yang, X., Wang, T., Heng, P.: Standard plane localization in fetal ultrasound via domain transferred deep neural networks. IEEE J. Biomed. Health Inf. 19(5), 1627–1636 (2015)

Real-Time Standard Plane Detection and Localisation in Fetal Ultrasound

211

4. Kurinczuk, J., Hollowell, J., Boyd, P., Oakley, L., Brocklehurst, P., Gray, R.: The contribution of congenital anomalies to infant mortality. University of Oxford, National Perinatal Epidemiology Unit (2010) 5. Lin, M., Chen, Q., Yan, S.: Network in network arXiv:1312.4400 (2013) 6. Maraci, M.A., Napolitano, R., Papageorghiou, A., Noble, J.A.: Searching for structures of interest in an ultrasound video sequence. In: Wu, G., Zhang, D., Zhou, L. (eds.) MLMI 2014. LNCS, vol. 8679, pp. 133–140. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10581-9 17 7. NHS Screening Programmes: Fetal Anomalie Screen Programme Handbook, pp. 28–35 (2015) 8. Ni, D., Yang, X., Chen, X., Chin, C.T., Chen, S., Heng, P.A., Li, S., Qin, J., Wang, T.: Standard plane localization in ultrasound by radial component model and selective search. Ultrasound Med. Biol. 40(11), 2728–2742 (2014) 9. Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Is object localization for free?-Weaklysupervised learning with convolutional neural networks. In: IEEE Proceedings of CVPR, pp. 685–694 (2015) 10. Otsu, N.: A threshold selection method from gray-level histograms. Automatica 11(285–296), 23–27 (1975) 11. Salomon, L., Alfirevic, Z., Berghella, V., Bilardo, C., Leung, K.Y., Malinger, G., Munoz, H., et al.: Practice guidelines for performance of the routine mid-trimester fetal ultrasound scan. Ultrasound Obst. Gyn. 37(1), 116–126 (2011) 12. Springenberg, J., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net arXiv:1412.6806 (2014) 13. Yaqub, M., Kelly, B., Papageorghiou, A.T., Noble, J.A.: Guided random forests for identification of key fetal anatomy and image categorization in ultrasound scans. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 687–694. Springer, Heidelberg (2015). doi:10.1007/ 978-3-319-24574-4 82

3D Deep Learning for Multi-modal Imaging-Guided Survival Time Prediction of Brain Tumor Patients Dong Nie1,2 , Han Zhang1 , Ehsan Adeli1 , Luyan Liu1 , and Dinggang Shen1(B) 1

Department of Radiology and BRIC, University of North Carolina at Chapel Hill, Chapel Hill, USA [email protected] 2 Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, USA

Abstract. High-grade glioma is the most aggressive and severe brain tumor that leads to death of almost 50 % patients in 1–2 years. Thus, accurate prognosis for glioma patients would provide essential guidelines for their treatment planning. Conventional survival prediction generally utilizes clinical information and limited handcrafted features from magnetic resonance images (MRI), which is often time consuming, laborious and subjective. In this paper, we propose using deep learning frameworks to automatically extract features from multi-modal preoperative brain images (i.e., T1 MRI, fMRI and DTI) of high-grade glioma patients. Specifically, we adopt 3D convolutional neural networks (CNNs) and also propose a new network architecture for using multi-channel data and learning supervised features. Along with the pivotal clinical features, we finally train a support vector machine to predict if the patient has a long or short overall survival (OS) time. Experimental results demonstrate that our methods can achieve an accuracy as high as 89.9 % We also find that the learned features from fMRI and DTI play more important roles in accurately predicting the OS time, which provides valuable insights into functional neuro-oncological applications.

1

Introduction

Brain tumors are one of the most lethal and difficult-to-treat cancers. The most deadly brain tumors are known as the World Health Organization (WHO) highgrade (III and IV) gliomas. The prognosis of glioma, often measured by the overall survival (OS) time, varies largely across individuals. Based on histopathology, OS is relatively longer for WHO-III, while shorter for WHO-IV gliomas. For instance, there is a median survival time of approximately 3 years for anaplastic astrocytoma while only 1 year for glioblastoma [2]. Tumor WHO grading, imaging phenotype, and other clinical data have been studied in their relationship to OS [1,6]. Bisdas et al. [1] showed that the relative cerebral blood volume in astrocytomas is predictive for recurrence and 1-year OS rate. However, this conclusion cannot be extended to other higher-grade gliomas, c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 212–220, 2016. DOI: 10.1007/978-3-319-46723-8 25

3D Deep Learning for Survival Time Prediction of Brain Tumor Patients

213

where prognosis prediction is more important. Lacroix et al. [6] identified five independent predictors of OS in glioblastoma patients, including age, Karnofsky Performance Scale score, extent of resection, degree of necrosis and enhancement in preoperative MRI. However, one may question the generalization ability of such models. Recently, researchers have been using informative imaging phenotypes to study prognosis [8,9,11]. Contrast-enhanced T1 MRI has been widely used for glioblastoma imaging. Previous studies [9] have shown that T1 MRI features can contribute largely to the prognostic studies for survival of glioblastoma patients. For instance, Pope et al. [9] analyzed 15 MRI features and found that several of them, such as non-enhancing tumor and infiltration area, are good predictors of OS. In addition, diffusion tensor imaging (DTI) provides complementary information, e.g., white matter integrity. For instance, authors in [11] concluded that DTI features can help discriminate between short and long survival of glioblastoma patients more efficiently than using only the histopathologic information. Functional MRI (fMRI) has also been used to measure cerebrovascular reactivity, which is impaired around the tumor entity and reflects the angiogenesis, a key sign of malignancy [8]. These approaches mostly used the handcrafted and engineered features according to the previous medical experiences. This often limits the ability to take full advantage of all the information embedded in MR images, since handcrafted features are usually bound to the current limited knowledge of specific field. On the other hand, the features exploited in some previous methods, e.g., [11], were often obtained in unsupervised manners, which could introduce many useless features or even ignore some useful clues. To overcome such problems, we propose a novel method to predict survival time for brain tumor patients by using high-level imaging features captured in a supervised manner.

fMRI patches

Multi-Channel deep network (fMRI)

Feature vector 1

DTI patches

Multi-Channel deep network (DTI)

Feature vector 2

T1 MRI patches

Deep network (T1 MRI)

Feature vector 3

Feature fusion

Build a SVM

Fig. 1. The flow chart for our survival prediction system

Specifically, we do not use only the handcrafted features; rather, we propose a 3D deep learning approach to discover high-level features that can better characterize the virtue of different brain tumors. Particularly, in the first step, we apply a supervised deep learning method, i.e., convolutional neural network (CNN) to extract high-level tumor appearance features from T1 MRI, fMRI and DTI images, for distinguishing between long and short survival patients. In the second step, we train a SVM [3] with the above extracted features (after feature

214

D. Nie et al.

selection and also some basic clinical information) for predicting the patient’s OS time. The whole pipeline of our system is shown in Fig. 1. Note that the MR images are all 3D images and, therefore, we adopt a 3D CNN structure on the respective 3D patches. Furthermore, since both fMRI and DTI data include multiple channels (as explained later), we further propose a multi-channel CNNs (mCNNs) to properly fuse information from all the channels of fMRI or DTI.

2

Data Acquisition and Preprocessing

The data acquired from the tumor patients include the contrast-enhanced T1 MRI, resting-state fMRI and DTI images. The images from a sample subject are shown in Fig. 2, in which we can see single-channel data for T1 MRI, and multichannel images for both fMRI and DTI. These three modalities of images are preprocessed following the conventional pipelines. Briefly, they are first aligned. Then, for T1 MRI, intensity normalization is performed. For DTI, diffusion tensor modeling is done, after which diffusion metrics (i.e., FA, MD, RD, lambda 1/2/3) are calculated in addition to the B0 image. For fMRI data, frequencyspecific blood oxygen level-dependent (BOLD) fluctuation powers are calculated in five non-overlapping frequency bands within 0–0.25 Hz.

(a)

T1 MRI

(b)

fMRI

(c)

DTI

Fig. 2. A sample subject with glioblastoma in our dataset.

We collected the data from 69 patients, who have all been diagnosed with WHO-III or IV. Exclusion criteria were (1) any surgery, radiation therapy, or chemotherapy of brain tumor, before inclusion in the study, (2) missing data, and (3) excessive head motion or presence of artifacts. All patients were treated under the same protocol based on clinical guideline. The whole study was approved by a local ethical committee. Patch Extraction: The tumor often appears in a certain region of the brain. We want to extract features not only from the contrast-enhanced regions, but also from the adjacent regions, where edema and tumor infiltration occur. In this paper, we manually label and annotate the tumor volume in T1 MRI that has the highest resolution. Specifically, we define a cuboid by picking the 8 vertices’ coordinates that confine the tumor and its neighboring areas. For fMRI and DTI data, we locate the tumor region according to the tumor mask defined in the T1 MRI data. With the extracted tumor region, we rescale the extracted tumor

3D Deep Learning for Survival Time Prediction of Brain Tumor Patients

215

region of each subject to a predefined size (i.e., 64 × 64 × 64), from which we can extract many overlapping 32 × 32 × 32 patches to train our CNN/mCNNs. Definition of Survival Time: OS time is defined as the duration from the date the patient was first scanned, to the date of tumor-related death for the patients. Subjects with irrelevant death reasons (e.g., suicide) were excluded. A threshold of 22 months was used to divide the patients into 2 groups: (1) short-term survival and (2) long-term survival, with 34 and 35 subjects in each group, respectively. This threshold was defined based on the OS rates in the literature [6], as approximately 50 % of the high-grade glioma patients died within this time period.

3

The Proposed Method

Deep learning models can learn a hierarchy of features, in which high-level features are built upon low-level image features layer-by-layer. CNN [4,5] is a useful deep learning tool, when trained with appropriate regularizations, CNN has been shown with superior performance on both visual object recognition and image classification tasks (e.g., [5]). In this paper, we first employ CNN architecture to train one survival time prediction model with T1 MRI, fMRI and DTI modalities, respectively. With such trained deep models, we can extract features from the respective image modalities in a supervised manner. Then, a binary classifier (e.g., SVM) is trained to predict OS time. In the following, we first introduce our supervised feature extraction strategies, followed by a classifier training step. Note that the feature extraction strategy for T1 MRI data is different from the feature extraction strategies for fMRI and DTI. This is because fMRI and DTI have multiple channels of data from each subject, while T1 MRI has a single channel of data. Thus, we first introduce our 3D CNN architecture for single-channel data (T1 MRI), and then extend it for multi-channel data (fMRI and DTI). 32@32x32x32 32@16x16x16 64@8x8x8 32@4x4x4 feature maps feature maps feature maps feature maps

1028 256 neurons neurons

Input: 1@32x32x32 3D patch

Output: 2 neurons

conv1

conv2

conv3

conv4

fc5

fc6

fc7

Fig. 3. The CNN architecture for single-channel feature extraction from 3D patches

Single-Channel Feature Extraction: As described earlier, T1 MRI data is in 3D, and therefore we propose a 3D CNN model with a set of 3D trainable filters.

216

D. Nie et al.

The architecture used in our paper is depicted in Fig. 3, in which we define four convolutional layer groups (conv1 to conv4, note both conv1 and conv2 are composed of two convolutional layers), and three fully-connected layers (fc5 to fc7). The convolutional layers will associate their outputs to the local 3D regions in their inputs, each computing a convolutional operation with a 3D filter of size 3 × 3 × 3 (i.e., between their weights and the 3D regions they are operating on). These results are 3D volumes of the same size as their inputs, which are followed by a max-pooling procedure to perform a downsampling operation along the 3D volume dimensions. The fully-connected layers include neurons connected to all activations in their previous layer, as in the conventional neural networks. The last layer (fc7) would have 2 neurons, whose outputs are associated with the class scores. The supervision on the class scores would lead to a back-propagation procedure for learning the most relevant features in the fc layers. We use the outputs from the last two layers of this CNN (fc6 and fc7) as our fully supervised learned features, and also compare their efficiency and effectiveness in the experiments. Multi-channel Feature Extraction: As mentioned, both fMRI and DTI images are composed of multiple channels of data (as in Fig. 2). To effectively employ all multi-channel data, we propose a new multi-channel-CNN (mCNN) architecture to train one mCNN for one modality by considering multi-channel data that can provide different information for the brain tumor. On the other hand, different channels may have little direct complementary information in their original image spaces, due to different acquisition techniques. Inspired by the work in [10], in which a multi-modal deep Boltzmann machine was introduced, we modify our 3D CNN architecture to deal with multi-channel data. Specifically, in our proposed mCNN, the same convolution layers are applied to each channel of data separately, but a fusion layer is added to fuse the outputs of conv4 layers from all different channels. Then, three fully connected layers are further incorporated to finally extract the features. This new mCNN architecture is illustrated in Fig. 4. In this architecture, the networks in the lower layers (i.e., the conv layers) for each pathway are different, accounting for different input distributions from each channel. The fusion layer combines the outputs from these different streams. Note that all these layer groups are identical to those described for the single-channel data. It is very important to note that the statistical properties of different channels of the data are often different, which makes it difficult for a single-channel model (as illustrated in Fig. 3) to directly find correlations across channels. In contrast, on our proposed mCNNs model, the differences of multi-channel data can be largely bridged by fusing their respective higher-layer features. Classification: Once we train a CNN (Fig. 3) for T1 MRI images and two mCNNs (Fig. 4) for fMRI and DTI, respectively, we can map each raw (3D) image from all three modalities to its high-level representation. This is actually accomplished by feeding each image patch to the corresponding CNN or mCNN model, according to the modality where the current patch is extracted. The

3D Deep Learning for Survival Time Prediction of Brain Tumor Patients

217

Fig. 4. Architecture of mCNN for feature extraction from multi-channel data

outputs of the last fully-connected layers are regarded as the learned features for the input image patch. Finally, each subject will be represented by a samelength feature vector. Together with survival labels, we train a SVM classifier for survival classification.

4

Experiments and Results

As described earlier, we learn features from multi-modal brain tumor data in a supervised manner, and then these extracted features to train a SVM classifier for survival prediction. For the training of deep networks, we adopt a back-propagation algorithm to update the network parameters. The network weights are initialized by Xavier algorithm [4]. The initial learning rate and weight decay parameter are determined by conducting a coarse line search, followed by decreasing the learning rate during training. To evaluate our proposed method, we incorporate different sets of features for training our method. In particular, the two commonly-used sets of features from CNNs are the outputs of the fc6 and fc7 layers, as can be seen in Figs. 3 (for single-channel CNN) and 4 (for mCNN). The network for each modality is trained on the respective modality data of training set. Then, each testing patch is fed into the trained network to obtain its respective features. Note that the layer before the output layer, denoted as fc6, has 256 neurons, while the output layer (fc7) is composed of 2 neurons. In addition to these features, the handcrafted features (HF) are also included in our experiments. These HF features consist of generic brain tumor features, including gender, age at diagnosis, tumor location, size of tumor, and the WHO grade. We also take advantage of scale-invariant transform (SIFT) [7] as a comparison feature extraction approach. To show the advantage of the 3D filters and architecture of the proposed method, we also provide the results obtained using the conventional 2D-CNN with the same settings. As stated in Sect. 2, each patient has 3 modalities of images (i.e., T1 MRI, fMRI and DTI). From each of these three modalities (of each subject), 8 different patches are extracted. This leads to 8 × 256 = 2048 learned fc6 features, and 8 × 2 = 16 learned fc7 features, for each modality, while totally, 2048 × 3 = 6144 and 16 × 3 = 48 learned features in fc6 and fc7 layers, for three modalities. Due

218

D. Nie et al.

to the large number of features in fc6, we conducted a feature selection/reduction procedure on them. Specifically, we used Principal Component Analysis (PCA) and Sparse Representation (SR). Results: We used 10-fold cross-validation, in which for each testing fold, 9 other folds are used to train both the CNN and mCNNs, and then the SVM with the learned features. The performance measures averaged for the 10 folds are reported in Table 1, including accuracy (ACC), sensitivity (SEN), specificity (SPE), positive predictive rate (PPR) and negative predictive rate (NPR), which are defined in the following (TP: true positive; TN: true negative; FP: false positive; FN: false negative): ACC =

TP + TN TP TN TP TN , SEN = , SPE = , PPR = , NPR = . TP + TN + F + FN TP + FN TN + FP TP + FP TN + FN

As could be seen, incorporating both deeply-learned (with the proposed 3D CNN and 3D mCNN models) and hand crafted (HF) features results in the best classification accuracy of 89.85 %. In contrast, with HF alone, or in combination with unsupervised learned features (SIFT), we obtain just an accuracy of 62.96 % or 78.35 %, respectively. Furthermore, the 3D architecture outperforms the conventional 2D architecture (i.e., 2D-CNN), which suggests that the 3D filters can lead to better feature learning. Regarding sensitivity and specificity, we know that the higher the sensitivity, the lower the chance of misclassifying the short survival patients; on the other hand, the higher the specificity, the lower the chance of misclassifying the long survival patients. The proposed feature extraction method resulted in an approximate 30 % higher sensitivity and specificity, compared to the traditional handcrafted features. Interestingly, our model predicts the short survival patients with more confidence than the long survival patients. Furthermore, the features from different layers of the CNN and Table 1. Performance evaluation of different features and selection/reduction methods.

ACC (%) SEN (%) SPE (%) PPR (%) NPR (%) HF

62.96

66.39

HF + SIFT

58.53

63.18

65.28

78.35

80.00

77.28

67.59

87.09

HF + 2D-CNN 81.25

81.82

80.95

74.23

88.35

fc7

80.12

85.60

77.64

71.71

87.50

fc6-PCA

80.55

84.85

76.92

75.68

85.71

fc6-SR

76.39

86.67

69.05

66.67

87.88

HF + fc7

89.58

92.19

88.22

84.44

95.57

HF + fc6-PCA 89.85

96.87

83.90

84.94

93.93

HF + fc6-SR

92.60

80.39

75.36

96.83

85.42

# of selected features

3D Deep Learning for Survival Time Prediction of Brain Tumor Patients

4 3 2 1 0

15 10 5 0 fc6

219

fMRI DTI T1 MRI fc7

Fig. 5. The average number of selected features from each modality, using our model.

mCNN models (fc6 and fc7) exhibit roughly similar power in predicting the OS time, on this dataset. To analyze the importance of the features for predicting OS, we also visualize the number of features selected from each modality. To do this, for the features from fc6, we count the features selected by sparse representation, and for the fc7 layer, we use ℓ1 -regularized SVM for classification for the features from fc7 layer, to enforce selection of the most discriminative features. We average the number of the selected features over all cross-validation folds. The results are depicted in Fig. 5. As it is obvious, the fMRI data have contributions for the prediction model as well as the DTI and T1 MRI.

5

Conclusions

In this study, we proposed a 3D deep learning model to predict the (long or short) OS time for brain gliomas patients. We trained 3D CNN and mCNN models for learning features from single-channel (T1 MRI) and multi-channel (fMRI and DTI) data in a supervised manner, respectively. The extracted features were then fed into a binary SVM classifier. Experimental results showed that our supervised-learned features significantly improved the predictive accuracy of gliomas patients’ OS time. This indicates that our proposed 3D deep learning frameworks can provoke computational models to extract useful features for such neuro-oncological applications. In addition, the analysis on the selected features further shows that DTI data can contribute slightly more than fMRI, but both fMRI and DTI play more significant roles compared to T1 MRI, in building such successful prediction models.

References 1. Bisdas, S., et al.: Cerebral blood volume measurements by perfusion-weighted MR imaging in gliomas: ready for prime time in predicting short-term outcome and recurrent disease? Am. J. Neuroradiol. 30(4), 681–688 (2009) 2. DeAngelis, L.M.: Brain tumors. N. Engl. J. Med. 344(2), 114–123 (2001) 3. Fan, R.-E., et al.: Liblinear: a library for large linear classification. JMLR 9, 1871– 1874 (2008) 4. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS, pp. 249–256 (2010)

220

D. Nie et al.

5. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012) 6. Lacroix, M., et al.: A multivariate analysis of 416 patients with glioblastoma multiforme: prognosis, extent of resection, and survival. J. Neurosurg. 95(2), 190–198 (2001) 7. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 8. Pillai, J.J., Zac´ a, D.: Clinical utility of cerebrovascular reactivity mapping in patients with low grade gliomas (2011) 9. Pope, W.B., et al.: MR imaging correlates of survival in patients with high-grade gliomas. Am. J. Neuroradiol. 26(10), 2466–2474 (2005) 10. Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep boltzmann machines. In: NIPS, pp. 2222–2230 (2012) 11. Zacharaki, E.I., et al.: Survival analysis of patients with high-grade gliomas based on data mining of imaging variables. Am. J. Neuroradiol. 33(6), 1065–1071 (2012)

From Local to Global Random Regression Forests: Exploring Anatomical Landmark Localization 1 ˇ Darko Stern , Thomas Ebner2 , and Martin Urschler1,2,3(B) 1

Ludwig Boltzmann Institute for Clinical Forensic Imaging, Graz, Austria [email protected] 2 Institute for Computer Graphics and Vision, Graz University of Technology, Graz, Austria 3 BioTechMed-Graz, Graz, Austria

Abstract. State of the art anatomical landmark localization algorithms pair local Random Forest (RF) detection with disambiguation of locally similar structures by including high level knowledge about relative landmark locations. In this work we pursue the question, how much high-level knowledge is needed in addition to a single landmark localization RF to implicitly model the global configuration of multiple, potentially ambiguous landmarks. We further propose a novel RF localization algorithm that distinguishes locally similar structures by automatically identifying them, exploring the back-projection of the response from accurate local RF predictions. In our experiments we show that this approach achieves competitive results in single and multi-landmark localization when applied to 2D hand radiographic and 3D teeth MRI data sets. Additionally, when combined with a simple Markov Random Field model, we are able to outperform state of the art methods.

1

Introduction

Automatic localization of anatomical structures consisting of potentially ambiguous (i.e. locally similar) landmarks is a crucial step in medical image analysis applications like registration or segmentation. Lindner et al. [5] propose a state of the art localization algorithm, which is composed of a sophisticated statistical shape model (SSM) that locally detects landmark candidates by three step optimization over a random forest (RF) response function. Similarly, Donner et al. [2] use locally restricted classification RFs to generate landmark candidates, followed by a Markov Random Field (MRF) optimizing their configuration. Thus, in both approaches good RF localization accuracy is paired with disambiguation of landmarks by including high-level knowledge about their relative location. A different concept for localizing anatomical structures is from Criminisi et al. [1], ˇ D. Stern—This work was supported by the province of Styria (HTI:Tech for Med ABT08-22-T-7/2013-13) and the Austrian Science Fund (FWF): P 28078-N33. c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 221–229, 2016. DOI: 10.1007/978-3-319-46723-8 26

222

ˇ D. Stern et al.

suggesting that the RF framework itself is able to learn global structure configuration. This was achieved with random regression forests (RRF) using arbitrary long range features and allowing pixels from all over the training image to globally vote for anatomical structures. Although roughly capturing global structure configuration, their long range voting is inaccurate when pose variations are present, which led to extending this concept with a graphical model [4]. Ebner et al. [3] adapted the work of [1] for multiple landmark localization without the need for an additional model and improved it by introducing a weighting of voting range at testing time and by adding a second RRF stage restricted to the local area estimated by the global RRF. Despite putting more trust into the surroundings of a landmark, their results crucially depend on empirically tuned parameters defining the restricted area according to first stage estimation. In this work we pursue the question, how much high-level knowledge is needed in addition to a single landmark localization RRF to implicitly model the global configuration of multiple, potentially ambiguous landmarks [6]. Investigating different RRF architectures, we propose a novel single landmark localization RRF algorithm, robust to ambiguous, locally similar structures. When extended with a simple MRF model, our RRF outperforms the current state of the art method of Lindner et al. [5] on a challenging multi-landmark 2D hand radiographs data set, while at the same time performing best in localizing single wisdom teeth landmarks from 3D head MRI.

Fig. 1. Overview of our RRF based localization strategy. (a) 37 anatomical landmarks in 2D hand X-ray images and differently colored MRF configurations. (b) In phase 1, RRF is trained locally on an area surrounding a landmark (radius R) with short range features, resulting in accurate but ambiguous landmark predictions (c). (d) Backprojection is applied to select pixels for training the RRF in phase 2 with larger feature range (e). (f) Estimated landmarks by accumulating predictions of pixels in local neighbourhood. (g, h) One of two independently predicted wisdom teeth from 3D MRI.

From Local to Global Random Regression Forest Localization

2

223

Method

Although being constrained by all surrounding objects, the location of an anatomical landmark is most accurately defined by its neighboring structures. While increasing the feature range leads to more surrounding objects being seen for defining a landmark, enlarging the area from which training pixels are drawn leads to the surrounding objects being able to participate in voting for a landmark location. We explore these observations and investigate the influence of different feature and voting ranges, by proposing several RRF strategies for single landmark localization. Following the ideas of Lindner et al. [5] and Donner et al. [2], in the first phase of the proposed RRF architectures, the local surroundings of a landmark are accurately defined. The second RRF phase establishes different algorithm variants by exploring distinct feature and voting ranges to discriminate ambiguous, locally similar structures. In order to maintain the accuracy achieved during the first RRF phase, locations outside of a landmark’s local vicinity are recognized and banned from estimating the landmark location. 2.1

Training the RRF

We independently train an RRF for each anatomical landmark. Similar to [1,3], at each node of the T trees of a forest, the set of pixels Sn reaching node n is pushed to left (Sn,L ) or right (Sn,R ) child node according to the splitting decision made by thresholding a feature response for each pixel. Feature responses are calculated as differences between mean image intensity of two rectangles with maximal size s and maximal offset o relative to a pixel position v i ; i ∈ Sn . Each node stores a feature and threshold selected from a pool of NF randomly generated features and NT thresholds, maximizing the objective function I:       di − d(Sn )2 − di − d(Sn,c )2 . I= (1) i∈Sn

c∈{L,R} i∈Sn,c

For pixel set S, di is the i-th voting vector, defined as the vector between landmark position l and pixel position v i , while d(S) is the mean voting vector of pixels in S. For later testing, we store at each leaf node l the mean value of relative voting vectors dl of all pixels reaching l. First training phase: Based on a set of pixels S I , selected from the training images at the location inside a circle of radius R centered at the landmark position, the RRF is first trained locally with features whose rectangles have maximal size in each direction sI and maximal offset oI , see Fig. 1b. Training of this phase is finished when a maximal depth DI is reached. Second training phase: Here, our novel algorithm variants are designed by implementing different strategies how to deal with feature ranges and selection of the area from which pixels are drawn during training. By pursuing the same local strategy as in the first phase for continuing training of the trees up to a maximal depth DII , we establish the localRRF similar to the RF part in [2,5].

224

ˇ D. Stern et al.

If we continue training to depth DII with a restriction to pixels S I but additionally allow long range features with maximal offset oII >oI and maximal size sII >sI , we get fAdaptRRF. Another way of introducing long range features, but still keeping the same set of pixels S I , was proposed for segmentation in Peter et al. [7]. They optimize for each forest node the feature size and offset instead of the traditional greedy RF node training strategy. For later comparison, we have adapted the strategy from [7] for our localization task by training trees from root node to a maximal depth DII using this optimization. We denote it as PeterRRF. Finally, we propose two strategies where feature range and area from which to select pixels are increased in the second training phase. By continuing training to depth DII , allowing in the second phase large scale features (oII , sII ) and simultaneously extending the training pixels (set of pixels S II ) to the whole image, we get the fpAdaptRRF. Here S II is determined by randomly sampling from pixels uniformly distributed in the image. The second strategy uses a different set of pixels S II , selected according to back-projection images computed from the first training phase. This concept is a main contribution of our work, therefore the next paragraph describes it in more detail. 2.2

Pixel Selection by Back-Projection Images

In the second training phase, pixels S II from locally similar structures are explicitly introduced, since they provide information that may help in disambiguation. We automatically identify similar structures by applying the RRF from the first phase on all training images in a testing step as described in Sect. 2.3. Thus, pixels from the area surrounding the landmark as well as pixels with locally similar appearance to the landmark end up in the first phase RRFs terminal nodes, since the newly introduced pixels are pushed through the first phase trees. The obtained accumulators show a high response on structures with a similar appearance compared to the landmark’s local appearance (see Fig. 1c). To identify pixels voting for a high response, we calculate for each accumulator a back-projection image (see Fig. 1d), obtained by summing for each pixel v all accumulator values at the target voting positions v + dl of all trees. We finalize our backProjRRF strategy by selecting for each tree training pixels S II as Npx randomly sampled pixels according to a probability proportional to the back-projection image (see Fig. 1e). 2.3

Testing the RRF

During testing, all pixels of a previously unseen image are pushed through the RRF. Starting at the root node, pixels are passed recursively to the left or right child node according to the feature tests stored at the nodes until a leaf node is reached. The estimated location of the landmark L(v) is calculated based on the pixels position v and the relative voting vector dl stored in the leaf node l. However, if the length of voting vector |dl | is larger than radius R, i.e. pixel v is not in the area closely surrounding the landmark, the estimated location is

From Local to Global Random Regression Forest Localization

225

omitted from the accumulation of the landmark location predictions. Separately for each landmark, the pixel’s estimations are stored in an accumulator image. 2.4

MRF Model

For multi-landmark localization, high-level knowledge about landmark configuration may be used to further improve disambiguation between locally similar structures. An MRF selects the best candidate for each landmark according to the RRF accumulator values and a geometric model of the relative distances between landmarks, see Fig. 1a. In the MRF model, each landmark Li corresponds to one variable while candidate locations selected as the Nc strongest maxima in the landmark’s accumulator determine the possible states of a variable. The landmark configuration is obtained by optimizing energy function E(L) =

NL  i=1

Ui (Li ) +



Pi,j (Li , Lj ),

(2)

{i,j}∈C

where unary term Ui is set to the RRF accumulator value of candidate Li and the relative distances of two landmarks from the training annotations define pairwise term Pi,j , modeled as normal distributions for landmark pairs in set C.

3

Experimental Setup and Results

We evaluate the performance of our landmark localization RRF variants on data sets of 2D hand X-ray images and 3D MR images of human teeth. As evaluation measure, we use the Euclidean distance between ground truth and estimated landmark position. To measure reliability, the number of outliers, defined as localization errors larger than 10 mm for hand landmarks and 7 mm for teeth, are calculated. For both data sets, which were normalized in intensities by performing histogram matching, we perform a three-fold cross-validation, splitting the data into 66 % training and 33 % testing data, respectively. Hand Dataset consists of 895 2D X-ray hand images publicly available at Digital Hand Atlas Database1 . Due to their lacking physical pixel resolution, we assume a wrist width of 50 mm, resample the images to a height of 1250 pixels and normalize image distances according to the wrist width as defined by the ground-truth annotation of two landmarks (see Fig. 1a). For evaluation, NL = 37 landmarks, many of them showing locally similar structures, e.g. finger tips or joints between the bones, were manually annotated by three experts. Teeth Dataset consists of 280 3D proton density weighted MR images of left or right side of the head. In the latter case, images were mirrored to create a consistent data set of images with 208 × 256 × 30 voxels and a physical resolution of 0.59 × 0.59 × 1 mm per voxel. Specifying their center locations, two wisdom teeth per data set were annotated by a dentist. Localization of wisdom teeth is challenging due to the presence of other locally similar molars (see Fig. 1g). 1

Available from http://www.ipilab.org/BAAweb/, as of Jan. 2016.

ˇ D. Stern et al.

226

Experimental setup: For each method described in Sect. 2, an RRF consisting of NT = 7 trees is built separately for every landmark. The first RRF phase is trained using pixels from training images within a range of R = 10 mm around each landmark position. The splitting criterion for each node is greedily optimized with NF = 20 candidate features and NT = 10 candidate thresholds except for PeterRRF. The random feature rectangles are defined by maximal size in each direction sI = 1 mm and maximal offset oI = R. In the second RRF phase, Npx = 10000 pixels are introduced and feature range is increased to a maximal feature size sII = 50 mm and offset in each direction oII = 50 mm. Treating each landmark independently on both 2D hands and 3D teeth dataset, the single-landmark experiments show the performance of the methods in case it is not feasible (due to lack of annotation) or semantically meaningful (e.g. third vs. other molars) to define all available locally similar structures. We compare our algorithms that start with local feature scale ranges and increase to more global scale ranges (localRRF, fAdaptRRF, PeterRRF, fpAdaptRRF, backProjRRF ) with reimplementations of two related works that start from global feature scale ranges (CriminisiRRF [1], with maximal feature size sII and offset oII from pixels uniformly distributed over the image) and optionally decrease to more local ranges (EbnerRRF [3]). First training phases stop for all methods at DI = 13, while the second phase continues training within the same trees until DII = 25. To ensure fair comparison, we use the same RRF parameters for all methods, except for the number of candidate features in PeterRRF, which was set to NF = 500 as suggested in [7]. Cumulative error distribution results of the single-landmark experiments can be found in Fig. 2. Table 1 shows quantitative localization results regarding reliability for all hand landmarks and for subset configurations (fingertips, carpals, radius/ulna). hand dataset

teeth dataset

1.00

0.98

0.98

0.96

0.96

0.94

0.94

Cumulative Distribution

Cumulative Distribution

1.00

0.92 0.90 0.88 0.86

CriminisiRRF EbnerRRF localRRF PeterRRF fAdaptRRF fpAdaptRRF backProjRRF

0.84 0.82

0.92 0.90 0.88 0.86

CriminisiRRF EbnerRRF localRRF PeterRRF fAdaptRRF fpAdaptRRF backProjRRF

0.84 0.82

0.80

0.80 0

5

10

error [mm]

15

0

5

10

15

20

error [mm]

Fig. 2. Cumulative localization error distributions for hand and teeth data sets.

The multi-landmark experiments allow us to investigate the benefits of adding high level knowledge about landmark configuration via an MRF to the prediction. In addition to our reimplementation of the related works [1,3],

From Local to Global Random Regression Forest Localization

227

Table 1. Multi-landmark localization reliability results on hand radiographs for all landmarks and subset configurations (compare Fig. 1 for configuration colors).

Lindner et al. [5] applied their code onto our hand data set using DI = 25 in their implementation of the local RF stage. To allow a fair comparison with Lindner et al. [5], we modify our two training phases by training two separate forests for both stages until maximum depths DI = DII = 25, instead of continuing training trees of a single forest. Thus, we investigate our presented backProjRRF, the combination of backProjRRF with an MRF, localRRF combined with an MRF, and the two state of the art methods from Ebner et al. [3] (EbnerRRF ) and Lindner et al. [5]. The MRF, which is solved by a message passing algorithm, uses Nc = 75 candidate locations (i.e. local accumulator maxima) per landmark as possible states of the MRF variables. Quantitative results on multi-landmark localization reliability for the 2D hand data set can be found in Table 1. Since all our methods including EbnerRRF are based on the same local RRFs, accuracy = 0.51 mm, which is slightly better is the same with a median error of µhand E = 0.64 mm). than accuracy of Lindner et al. [5] (µhand E

4

Discussion and Conclusion

Single landmark RRF localization performance is highly influenced by both, selection of the area from which training pixels are drawn and range of handcrafted features used to construct its forest decision rules, yet exact influence is currently not fully understood. As shown in Fig. 2, the global CriminisiRRF method, is not giving accurate localization results (median error = 2.98 mm), although it shows the capability to discriminate ambiguous µhand E structures due to the use of long range features and training pixels from all over the image. As a reason for low accuracy we identified greedy node optimization, that favors long range features even at deep tree levels when no ambiguity among training pixels is present anymore. Our implementation of PeterRRF [7], which overcomes greedy node optimization by selecting optimal feature range in each = 0.89 mm). node, shows a strong improvement in localization accuracy (µhand E Still it is not as accurate as the method of Ebner et al. [3], which uses a local = 0.51 mm), while also RRF with short range features in the second stage (µhand E requiring a significantly larger number (around 25 times) of feature candidates per node. The drawback of EbnerRRF is essentially the same as for localRRF if the area, from which local RRF training pixels are drawn, despite being reduced by the global RRF of the first stage, still contains neighboring, locally similar structures. To investigate RRFs capability to discriminate ambiguous structures

228

ˇ D. Stern et al.

reliably while preserving high accuracy of locally trained RRFs, we switch the order of EbnerRRF stages, thus inverting their logic in the spirit of [2,5]. Therefore, we extended localRRF by adding a second training phase that uses long range features for accurate localization and differently selects areas from which training pixels are drawn. While increasing the feature range in fAdaptRRF = 0.51 mm), reliability shows the same accuracy compared to localRRF (µhand E is improved, but not as strong as when introducing novel pixels into the second training phase. Training on novel pixels is required to make feature selection more effective in discriminating locally similar structures, but it is important to note that they do not participate in voting at testing time since the accuracy obtained in the first phase would be lost. With our proposed backProjRRF we force the algorithm to explicitly learn from examples which are hard to discriminate, i.e. pixels belonging to locally similar structures, as opposed to fpAdaptRRF, where pixels are randomly drawn from the image. Results in Fig. 2 reveal that highest reliability (0.172 % and 7.07 % outliers on 2D hand and 3D teeth data sets, respectively) is obtained by backProjRRF, while still achieving the same accuracy as localRRF. In a multi-landmark setting, RRF based localization can be combined with high level knowledge from an MRF or SSM as in [2,5]. Method comparison results from Table 1 show that our backProjRRF combined with an MRF model outperforms the state-of-the-art method of [5] on the hand data set in terms of accuracy and reliability. However, compared to localRRF our backProjRRF shows no benefit when both are combined with a strong graphical MRF model. In cases where such a strong graphical model is unaffordable, e.g. if expert annotations are limited (see subset configurations in Table 1), combining backProjRRF with an MRF shows much better results in terms of reliability compared to localRRF+MRF. This is especially prominent in the results for radius and ulna landmarks. Moreover, Table 1 shows that even without incorporating an MRF model, the results of our backProjRRF are competitive to the state of the art methods when limited high level knowledge is available (fingertips, radius/ulna, carpals). Thus, in conclusion, we have shown the capability of RRF to successfully model locally similar structures by implicitly encoding global landmark configuration while still maintaining high localization accuracy.

References 1. Criminisi, A., Robertson, D., Konukoglu, E., Shotton, J., Pathak, S., White, S., Siddiqui, K.: Regression forests for efficient anatomy detection and localization in computed tomography scans. Med. Image Anal. 17(8), 1293–1303 (2013) 2. Donner, R., Menze, B.H., Bischof, H., Langs, G.: Global localization of 3D anatomical structures by pre-filtered hough forests and discrete optimization. Med. Image Anal. 17(8), 1304–1314 (2013) 3. Ebner, T., Stern, D., Donner, R., Bischof, H., Urschler, M.: Towards automatic bone age estimation from MRI: localization of 3D anatomical landmarks. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8674, pp. 421–428. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10470-6 53

From Local to Global Random Regression Forest Localization

229

4. Glocker, B., Zikic, D., Konukoglu, E., Haynor, D.R., Criminisi, A.: Vertebrae localization in pathological spine CT via dense classification from sparse annotations. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 262–270. Springer, Heidelberg (2013). doi:10.1007/ 978-3-642-40763-5 33 5. Lindner, C., Bromiley, P.A., Ionita, M.C., Cootes, T.F.: Robust and accurate shape model matching using random forest regression-voting. IEEE Trans. PAMI 37, 1862–1874 (2015) 6. Lindner, C., Thomson, J., Consortium, T.O.G.E.N., Cootes, T.F.: Learningbased shape model matching: training accurate models with minimal manual input. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 580–587. Springer, Heidelberg (2015). doi:10.1007/ 978-3-319-24574-4 69 7. Peter, L., Pauly, O., Chatelain, P., Mateus, D., Navab, N.: Scale-adaptive forest training via an efficient feature sampling scheme. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 637–644. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24553-9 78

Regressing Heatmaps for Multiple Landmark Localization Using CNNs 2 ˇ Christian Payer1(B) , Darko Stern , Horst Bischof1 , and Martin Urschler2,3 1

2

Institute for Computer Graphics and Vision, Graz University of Technology, Graz, Austria [email protected] Ludwig Boltzmann Institute for Clinical Forensic Imaging, Graz, Austria 3 BioTechMed-Graz, Graz, Austria

Abstract. We explore the applicability of deep convolutional neural networks (CNNs) for multiple landmark localization in medical image data. Exploiting the idea of regressing heatmaps for individual landmark locations, we investigate several fully convolutional 2D and 3D CNN architectures by training them in an end-to-end manner. We further propose a novel SpatialConfiguration-Net architecture that effectively combines accurate local appearance responses with spatial landmark configurations that model anatomical variation. Evaluation of our different architectures on 2D and 3D hand image datasets show that heatmap regression based on CNNs achieves state-of-the-art landmark localization performance, with SpatialConfiguration-Net being robust even in case of limited amounts of training data.

1

Introduction

Localization of anatomical landmarks is an important step in many medical image analysis tasks, e.g. for registration or to initialize segmentation algorithms. Since machine learning approaches based on deep convolutional neural networks (CNN) outperformed the state-of-the-art in many computer vision tasks, e.g. ImageNet classification [1], we explore in this work the capability of CNNs to accurately locate anatomical landmarks in 2D and 3D medical data. Inspired by the human visual system, neural networks serve as superior feature extractors [2] compared to hand-crafted filters as used e.g. in random forests. However, they involve increased model complexity by requiring a large number of weights that need to be optimized, which is only possible when a large amount of data is available to prevent overfitting. Unfortunately, acquiring large amounts of medical data is challenging, thus imposing practical limits on network complexity. Additionally, working with 3D volumetric data further increases the number of required network weights to be learned due to the added dimension of filter kernels. Demanding CNN training for 3D input was previously investigated C. Payer—This work was supported by the province of Styria (HTI:Tech for Med ABT08-22-T-7/2013-13) and the Austrian Science Fund (FWF): P 28078-N33. c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 230–238, 2016. DOI: 10.1007/978-3-319-46723-8 27

Multiple Landmark Localization Using CNNs

231

for different applications [3–5]. In [3], knee cartilage segmentation of volumetric data is performed by three 2D CNNs, representing xy, yz, and xz planes, respectively. Despite not using full 3D information, it outperformed other segmentation algorithms. In [4], authors differently decompose 3D into 2D CNNs by randomly sampling n viewing planes of a volume for detection of lymph nodes. Finally, [5] presents deep network based 3D landmark detection, where first a shallow network generates multiple landmark candidates using a sliding window approach. Image patches around landmark candidates are classified with a subsequent deep network to reduce false positives. This strategy has the substantial drawback of not being able to observe global landmark configuration, which is crucial for robustly localizing locally similar structures, e.g. fingertips in Fig. 1. Thus, to get rid of false positives, state-of-the-art localization methods for multiple landmark localization rely on local feature responses combined with high level knowledge about global landmark configuration [6], in the form of graphical [7] or statistical shape models [8]. This widely used explicit incorporation has proven very successful due to strong anatomical constraints present in medical data. When designing CNN architectures these constraints could be used to reduce network complexity and allow training even on limited datasets. In this work, we investigate the idea of directly estimating multiple landmark locations from 2D or 3D input data using a single CNN, trained in an end-to-end manner. Exploring the idea of Pfister et al. [9] to regress heatmaps for landmarks simultaneously instead of absolute landmark coordinates, we evaluate different fully convolutional [10] deep network architectures that benefit from constrained relationships among landmarks. Additionally, we propose a novel architecture for multiple landmark localization inspired by latest trends in the computer vision community to design CNNs that implicitly encode high level knowledge as a convolution stage [11]. Our novel architecture thus emphasizes the CNN’s capability of learning features for both accurately describing local appearance as well as enforcing restrictions in possible spatial landmark configurations. We discuss benefits and drawbacks of our evaluated architectures when applied to two datasets of hand images (2D radiographs, 3D MRI) and show that CNNs achieve state-of-the-art performance compared to other multiple landmark localization approaches, even in the presence of limited training data.

Fig. 1. Multiple landmark localization by regressing heatmaps in a CNN framework.

232

C. Payer et al.

Fig. 2. Schematic representation of the CNN architectures. Blue boxes represent convolution, pooling, concatenation, upsampling, channelimages, channel-by-channel addition, element-wise multipliby-channel convolution, cation. Arrow thickness illustrates kernel sizes.

2

Heatmap Regression Using CNNs

As shown in Fig. 1, our approach for multiple landmark localization uses a CNN framework to regress heatmaps directly from input images. Similarly to [9], we represent heatmaps Hi as images where Gaussians are located at the position of landmarks Li . Given a set of input images and corresponding target heatmaps, we design different fully convolutional network architectures (see Fig. 2 for schematic representations), all capable of capturing spatial relationships between landmarks by allowing convolutional filter kernels to cover large image areas. After training the CNN architectures, final landmark coordinates are obtained as maximum responses of the predicted heatmaps. We propose three different CNN architectures inspired by the literature, which are explained in more detail in the following, while Sect. 2.1 describes our novel SpatialConfiguration-Net. Downsampling-Net: This architecture (Fig. 2a) uses alternating convolution and pooling layers. Due to the involved downsampling, it is capable of covering large image areas with small kernel sizes. As a drawback of the low resolution of the target heatmaps, poor accuracy in localization has to be expected. ConvOnly-Net: To overcome the low target resolution, this architecture (Fig. 2b) does neither use pooling layers, nor strided convolution layers. Thus, much larger kernels are needed for observing the same area as the DownsamplingNet which largely increases the number of network weights to optimize. U-Net: The architecture (Fig. 2c) is slightly adapted from [12], since we replace maximum with average pooling in the contracting path. Also, instead of learning

Multiple Landmark Localization Using CNNs

233

Fig. 3. Illustration showing the combination of local appearance from landmark Li trans with transformed heatmaps Hi,j from all other landmarks.

the deconvolution kernels, we use fixed linear upsampling kernels in the expanding path, thus obtaining a fully symmetric architecture. Due to the contracting and expanding path, the net is able to grasp a large image area using small kernel sizes while still keeping high accuracy. 2.1

SpatialConfiguration-Net

Finally, we propose a novel, three block architecture (Fig. 2d), that combines local appearance of landmarks with the spatial configuration to all other landmarks. The first block of the network consists of three convolutional layers with small kernel sizes, that result in local appearance heatmaps Hiapp for every landmark Li . Although these intermediate heatmaps are very accurate, they may contain ambiguities due to similarly looking landmarks, e.g. fingertips, as the kernels do not grasp information on the larger surrounding area. This ambiguity is reduced by combining the output of the appearance block with the prediction of our spatial configuration block. A sample heatmap calculation for one landmark is visualized in Fig. 3. The position of each landmark predicted by this block is based on the estimated locations of all remaining landmarks as obtained from the appearance block. Thus, the relative position of landmark Li according to Lj is learned as a convolution kernel Ki,j , transforming the response of trans heatmap Hjapp into a heatmap Hi,j , that predicts the position of Li . The transformed heatmap is defined as trans Hi,j = Hjapp ∗ Ki,j ,

(1)

where ∗ denotes a convolution of Hjapp with Ki,j . Note that by not having any restriction on the kernels Ki,j , the net is able to learn the spatial configuration between landmarks on its own. For each Li , the responses of the transformed trans are accumulated resulting in a location estimation obtained heatmaps Hi,j from all other landmarks. This accummulated heatmap is defined as

234

C. Payer et al.

Hiacc =

n 

trans Hi,j .

(2)

j=1

The final heatmap, which combines local appearance and spatial configuration between all other landmarks, is calculated as Hi = Hiapp ⊙ Hiacc ,

(3)

with ⊙ the element-wise product. This suppresses locations from local appearance predictions that are infeasible due to the spatial configuration of landmarks. The spatial configuration block is calculated on a lower resolution, as kernels Ki,j have to be very large to capture the spatial landmark configuration. However, a high resolution is not necessary for the spatial configuration block, as it is solely used to remove ambiguities. To preserve accuracy of the local appearance block, the outputs of the spatial configuration block are upsampled and the final heatmaps Hi are calculated on the same resolution as the input.

3

Experimental Setup and Results

Materials: We evaluated localization performance of the network architectures on two different datasets. The first one consists of 895 publicly available X-ray images of hands1 with an average size of 1563×2169, acquired with different X-ray scanners. 37 characteristic landmarks, e.g. finger tips or bone joints, were annotated manually by an expert. As the images do not contain information about physical pixel resolution, we assume a wrist width of 50 mm, defined by two annotated landmarks. The second dataset consists of 60 T1-weighted 3D gradient echo hand MR scans with 28 annotated landmarks. The average volume size was 294 × 512 × 72 with a voxel resolution of 0.45 × 0.45 × 0.9 mm3 . Experimental Setup: The 2D hand radiographs were acquired with various different X-ray scanners, resulting in large intensity variations. Histogram equalization was performed to adjust intensity values. Additionally, we preprocessed pixels by subtracting mean intensities and setting standard deviation equal to 1. For the 3D data, we only subtracted the mean since intensity variations were negligible. To augment the datasets, nine additional synthetic images were created for each image by applying rotation (up to 30◦ ), translation (up to 1 cm), and intensity scaling/shifting (up to 10 % difference, only used in 2D). Heatmaps were created by centering Gaussians at landmark positions, normalized to maximum value of 1 and with σ ranging from 1.5 to 3 depending on heatmap size. To achieve best possible performance, we tuned each network architecture regarding kernel and layer size as well as number of outputs. All networks consist of standard layers (Fig. 2), i.e., convolution, pooling (average), concatenation, addition, multiplication, and fixed deconvolution (linear upsampling). In each network, the final convolution layer has the same number of outputs as landmarks 1

Digital Hand Atlas Database System, http://www.ipilab.org/BAAweb/.

Multiple Landmark Localization Using CNNs

235

and a kernel size of 1 without an activation function. All other convolution layers have a ReLU activation function and produce 128 intermediate outputs in 2D and 64 in 3D (except 3D ConvOnly-Net with 32 outputs, due to memory limitations). Additionally, all pooling layers use averaging and halve the output size in every dimension, while all linear upsampling layers double output size. The networks are structured as follows: The 2D ConvOnly-Net consists of 6 convolution layers with 11 × 11 kernels (3D: 6, 5 × 5 × 5). The Downsampling-Net is composed of multiple blocks containing two convolution layers followed by pooling. After the last block, two additional convolution layers are appended. In 2D we use 5 × 5 kernels and 2 downsampling blocks (3D: 3 × 3 × 3, 1 block). The U-Net consists of a contracting path, being equivalent to Downsampling-Net, and an expanding path, consisting of blocks of upsampling, concatenation with the same level output from the contracting path, and finally two convolution layers. In 2D we use 3 × 3 kernels with 4 down- and upsampling blocks (3D: 3 × 3 × 3, 3 blocks). The 2D SpatialConfiguration-Net consists of 3 convolution layers with 5 × 5 kernels, followed by the spatial configuration block, using 15 × 15 kernels with a downsampling factor of 18 (3D: 3 × 3 × 3, 3, and 9 × 9 × 5, factor 41 ). We evaluated the 2D dataset with three-fold cross-validation and additionally compared to results obtained with two other state-of-the-art localization methods of Lindner et al. [8], who applied their code to our dataset, and of Ebner et al. [13]. The 3D dataset evaluation used five cross-validation rounds splitting the dataset randomly into 43 training and 17 testing images, respectively, and we also compared our results to Ebner et al. [13]. We additionally evaluated the performance of U-Net and SpatialConfiguration-Net on a dataset with largely reduced number of images, to show the limits of these architectures in terms of required training data. Here, for the same three cross-validation setups as in the original 2D experiment, we used only 10 of the 597 annotated images and tested on the remaining 298. By excessive data augmentation on these 10 images we get to the same number of training images as used in the original experiment. Results: All networks were trained from scratch using the Caffe framework [14]. We did not fine-tune networks pre-trained on large image databases, as no such Table 1. Localization results on 2D dataset containing 895 images with 37 landmarks, grouped as full and reduced dataset. #w shows the relative number of network weights. Set Full

Red

Method

Image height

Localization error (in mm)

Input

Target

Median

#Outliers > 10 mm

#w

Mean ± SD

Downsampling-Net

256

64

1.85

1.96 ± 1.14

12 (0.036 %)

1.8

ConvOnly-Net

128

128

1.13

1.29 ± 1.13

9 (0.027 %)

8.7

U-Net

256

256

0.68

0.87 ± 1.05

15 (0.045 %)

2.0

SpatialConf-Net

256

256

0.91

1.13 ± 0.98

12 (0.036 %)

1.0

Lindner et al. [8]

1250

1250

0.64

0.85 ± 1.01

20 (0.060 %)

-

Ebner et al. [13]

1250

1250

0.51

0.97 ± 2.45

228 (0.689 %)

-

U-Net

256

256

1.24

3.29 ± 11.78

1175 (3.548 %)

2.0

SpatialConf-Net

256

256

1.14

1.61 ± 3.43

120 (0.362 %)

1.0

236

C. Payer et al. Table 2. 3D localization results on 85 images with 28 landmarks per image.

Method

Image height Localization error (in mm) Input Target Median Mean ± SD

#Outliers > 10 mm

64

1.91

2.21 ± 2.82

16 (0.672 %)

128

128

1.10

8.17 ± 23.62

360 (15.126 %)

U-Net

128

128

1.01

1.18 ± 1.31

3 (0.126 %)

SpatialConf-Net

128

128

1.01

1.19 ± 1.48

3 (0.126 %)

Ebner et al. [13]

512

512

1.27

1.44 ± 1.51

6 (0.252 %)

Downsampling-Net 128 ConvOnly-Net

network exists for 3D and converting 2D kernels to 3D is not straightforward. Our networks were optimized using stochastic gradient descent with L2-loss, momentum of 0.99, and a batch size of 5 for 2D and 2 for 3D inputs, respectively. The learning rate was set to 10−5 for the ConvOnly- and Downsampling-Nets, and 10−6 for the U- and SpatialConfiguration-Nets, with weight decay of 0.0005. The network biases were initialized with 0, the weight values drawn from a Gaussian distribution with a standard deviation of 0.01. Networks were trained until the testing error reached a plateau, which took between 15000 and 40000 iterations, depending on the architecture. We did not observe overfitting to the datasets as also the test error remained at a plateau. Even after decreasing learning rate, results did not improve any further. Training time was similar for all architectures, between 5 and 10 h per cross-validation round on a 12 GB RAM NVidia Geforce TitanX. Testing per image takes below 10 s, with down- and upsampling of in- and output consuming most of the time. Detailed localization results for 2D and 3D datasets are shown in Tables 1 and 2, also comparing performance of CNN architectures with the state-of-the-art.

4

Discussion and Conclusion

Results of our experiments show, that fully convolutional CNN architectures trained in an end-to-end manner are able to achieve state-of-the-art localization performance by regressing heatmaps. Despite using a much lower input and target heatmap resolution, still the best-performing U-Net architecture achieves the same accuracy as the method of Lindner et al. [8] on the 2D dataset, while all architectures have less outliers (see Table 1). On the 3D dataset (see Table 2) with the U-Net and SpatialConfiguration-Net architectures we achieve even better results than the method of Ebner et al. [13]. With medium number of network weights, Downsampling-Net is capable to capture spatial configuration of the landmarks, however, since it involves downsampling, accuracy is worst among the compared architectures. ConvOnly-Net improves the accuracy, nevertheless it requires a high number of network weights, leading to worst performance in terms of outliers when used for 3D data due to memory restrictions preventing large enough kernel sizes. We found that localization performance corresponds

Multiple Landmark Localization Using CNNs

237

with target heatmap size, as emphasized by U-Net and SpatialConfiguration-Net showing the best results among compared architectures. In future work, we plan to also evaluate datasets with sparser landmarks and more spatial variation. By both accurately describing local appearance and enforcing restrictions in possible spatial landmark configurations, our novel SpatialConfigurationNet architecture is able to achieve accurate localization performance with a low amount of outliers, despite requiring the lowest number of network weights. While achieving the same result for the 3D dataset, in the 2D experiment, we found that U-net performance is slightly better than SpatialConfiguration-Net, however, U-net requires more network weights. When evaluating SpatialConfiguration-Net on the augmented training dataset where anatomical variation is defined from only 10 input images, it reveals its capability to model spatial landmark configuration, outperforming U-Net significantly in number of outliers. Thus, explicitly encoding spatial landmark configuration as in our novel SpatialConfiguration-Net proves to be a promising strategy for landmark localization in the presence of limited training data, as is usually the case when working with medical images.

References 1. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012) 2. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient based learning applied to document recognition. Proc. IEEE 86(11), 2278–2323 (1998) 3. Prasoon, A., Petersen, K., Igel, C., Lauze, F., Dam, E., Nielsen, M.: Deep feature learning for knee cartilage segmentation using a triplanar convolutional neural network. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 246–253. Springer, Heidelberg (2013). doi:10.1007/ 978-3-642-40763-5 31 4. Roth, H.R., Lu, L., Seff, A., Cherry, K.M., Hoffman, J., Wang, S., Liu, J., Turkbey, E., Summers, R.M.: A new 2.5D representation for lymph node detection using random sets of deep convolutional neural network observations. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8673, pp. 520–527. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10404-1 65 5. Zheng, Y., Liu, D., Georgescu, B., Nguyen, H., Comaniciu, D.: 3D deep learning for efficient and robust landmark detection in volumetric data. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 565–572. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24553-9 69 6. Liu, D., Zhou, K.S., Bernhardt, D., Comaniciu, D.: Search strategies for multiple landmark detection by submodular maximization. In: CVPR, pp. 2831–2838 (2010) 7. Donner, R., Menze, B.H., Bischof, H., Langs, G.: Global localization of 3D anatomical structures by pre-filtered hough forests and discrete optimization. Med. Image Anal. 17(8), 1304–1314 (2013) 8. Lindner, C., Bromiley, P.A., Ionita, M.C., Cootes, T.: Robust and accurate shape model matching using random forest regression-voting. IEEE Trans. Pattern Anal. Mach. Intell. 37, 1862–1874 (2015) 9. Pfister, T., Charles, J., Zisserman, A.: Flowing convnets for human pose estimation in videos. In: ICCV, pp. 1913–1921 (2015)

238

C. Payer et al.

10. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) 11. Liu, Z., Li, X., Luo, P., Change, C., Tang, L.X.: Semantic image segmentation via deep parsing network. In: ICCV, pp. 1377–1385 (2015) 12. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24574-4 28 13. Ebner, T., Stern, D., Donner, R., Bischof, H., Urschler, M.: Towards automatic bone age estimation from MRI: localization of 3D anatomical landmarks. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8674, pp. 421–428. Springer, Heidelberg (2014). doi:10.1007/ 978-3-319-10470-6 53 14. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of ACM International Conference on Multimedia (MM 2014), pp. 675–678 (2014)

Self-Transfer Learning for Weakly Supervised Lesion Localization Sangheum Hwang(B) and Hyo-Eun Kim Lunit Inc., Seoul, Korea {shwang,hekim}@lunit.io

Abstract. Recent advances of deep learning have achieved remarkable performances in various computer vision tasks including weakly supervised object localization. Weakly supervised object localization is practically useful since it does not require fine-grained annotations. Current approaches overcome the difficulties of weak supervision via transfer learning from pre-trained models on large-scale general images such as ImageNet. However, they cannot be utilized for medical image domain in which do not exist such priors. In this work, we present a novel weakly supervised learning framework for lesion localization named as self-transfer learning (STL). STL jointly optimizes both classification and localization networks to help the localization network focus on correct lesions without any types of priors. We evaluate STL framework over chest X-rays and mammograms, and achieve significantly better localization performance compared to previous weakly supervised localization approaches. Keywords: Weakly supervised learning Convolutional neural networks

1

·

Lesion localization

·

Introduction

Recently, deep convolutional neural networks (CNN) show promising performances in various computer vision tasks such as classification [6,9], localization [2], and segmentation [12]. Among those tasks, object (lesion in medical images) localization is one of the challenging problems. In object localization task, a lot of training images with bounding boxes (or pixel-level) annotations of region-of-interests (ROIs) are required. However, a dataset with such location information needs heavy annotation efforts. Such annotation costs are much greater for medical images because only experts can interpret them. To alleviate the problem, several works for weakly supervised localization only using a weaklabeled (i.e. image-level label) dataset have been proposed [10,11,14]. These approaches require pre-trained models on relatively well-localized datasets (e.g., ImageNet [3]) to transfer good initial features for localization. Therefore, we cannot expect good performances for medical image domain since we do not have such domain-specific well-trained features. c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 239–246, 2016. DOI: 10.1007/978-3-319-46723-8 28

240

S. Hwang and H.-E. Kim

In this work, we propose a self-transfer learning (STL) framework for weakly supervised lesion localization in medical images. STL co-optimizes both classification and localization networks simultaneously in order to guide the localization network with the most discriminative features in terms of the classification task (see Fig. 1). The proposed method does not require not only the location information but also any types of priors for training. We show that previous approaches without good initial features are not effective by themselves since errors are back-propagated through a restricted path or with insufficient information.

Fig. 1. Overall architecture of STL (Fcls denotes fully-connected classification layers, Cloc and Ploc denote a 1×1 convolutional layer and a global pooling layer respectively). The final objective function Losstotal is a weighted sum of Losscls and Lossloc with a controllable hyperparameter α. Self-transfer learning is realized by re-weighting the α adaptively in a training phase.

Related Work. We consider recent methods based on CNN showing a promising performance on weakly supervised object localization [10,11,14,15]. The common strategy for them is to produce activation maps (in other words, score maps) for each class, and select or extract a representative activation value. The dimensions of those maps are automatically determined by a network architecture. If such a network is trained well, it is expected that a target object can be easily localized by examining the activation map corresponding to its class. To select or extract the representative activations for each class, typical pooling methods can be effectively used. In [10], a global max pooling method is used and its classification and localization performances are verified in the domain of general images. Another choice can be a global average pooling method. As discussed in [15], it might be better for localization since a global max pooling focuses on the most discriminative part only, while a global average pooling discovers all those parts as much as possible. Inferring segmentation map is more challenging compared to object localization, since it performs pixel-level classification. In [11], they adopt a Log-Sum-Exponential pooling method which is a smooth version of the max pooling to explore the entire feature maps. Smoothing priors are also considered to obtain fine-grained segmentation maps. Those approaches can be interpreted as a variant of multiple instance learning (MIL), which is designed for classification where labels are associated with sets

Self-Transfer Learning for Weakly Supervised Lesion Localization

241

of instances, called bags, instead of individual instances. In image classification tasks, the full size image and its subsampled patches are considered as a bag and instances, respectively. For instance, if we use a global max pooling to select a representative value among activations of patches, it is equivalent to use a well-classified single patch for building the decision boundary. Strictly speaking, however, current approaches are not generally applicable since they essentially require well-trained features on semantically similar datasets.

2

Self-Transfer Learning for Weakly Supervised Learning

STL consists of three main components: shared convolutional layers, fully connected layers (i.e. classifier), and localization layers (i.e. localizer) (see Fig. 1). The key features of STL are twofold. First, it simultaneously propagates errors backward from both classifier and localizer to prevent the localizer from wandering a loss surface to find a local optimum. Second, an adjustable hyperparameter α is introduced to control the relative importance between classifier and localizer. Two losses, Losscls from classifier and Lossloc from localizer, are computed at the forward pass, and the weighted sum of those errors is propagated at the backward pass. The errors from classifier contribute to train the filters in an overall view, while those from localizer are backpropagated through the subsampled region which is the most important window to classify training set. At the early stage of training phase, the errors from classifier should be more weighted than those from localizer to prevent the localizer from falling in a bad local optimum. By reducing the effects of errors from localizer, good filters which have a discriminative power can be well trained although localizer fails to find objects associated with the class label. As training proceeds, the weight for localizer increases to focus on the subsampled region of input image. At this stage, the network’s filters are fine-tuned for the task of localization. Consider a data set of N input-target pairs {xi , ti }N i=1 . xi and ti denote an i-th image and the corresponding K-dimensional true label vector, respectively, where K represents the number of classes. Assuming an image with a single class label, our objective function to be optimized is a weighted sum of cross-entropy losses from classifier and localizer, which can be defined as follows: Losstotal = (1 − α)Losscls + αLossloc N N = −(1 − α) i=1 t⊺i log(yicls ) − α i=1 t⊺i log(yiloc )

(1)

where yicls and yiloc are K-dimensional class probability vectors from classifier and localizer, respectively, for i-th input, and log(·) denotes an element-wise log operation. The effect of the proposed STL can be explained by examining a backpropagation process at the end of shared convolutional layers C. Suppose that the node i represents a particular node in C which is connected with H nodes in Fcls and K nodes in Cloc . Note that Cloc is obtained by 1 × 1 convolution on C as shown in Fig. 1 and K is equal to the number of activation maps (i.e. the number of classes).

242

S. Hwang and H.-E. Kim

If ReLU activation function is used for the node i, the backpropagated error δi at the node i can be written as follows: H K δi = max(0, δicls + δiloc ) where δicls = j=1 wji δj , δiloc = k=1 wki δk (2) It should be noted that the relative importance between classifier and localizer is already reflected in the errors δicls and δiloc through the weighted loss function Losstotal . It can be seen that the errors δiloc are backpropagated undesirably without δicls due to the special treatment, a global pooling, for activation maps in Cloc . For instance, if a global max pooling is used to aggregate the activations within each activation map and the location corresponding to node i in C is not selected as the maximum, all δk ’s to be backpropagated from Cloc will be zero. Therefore, the computed errors of most of nodes in C except for the nodes whose locations correspond to the maximal responses for each activation map will be zero. In case of a global average pooling, zero errors will be merely replaced with a mean of errors. This situation is not certainly desirable, especially when we train the network from scratch (i.e. without pre-trained filters). By incorporating classifier into a network architecture, the shared convolutional layers C can be consistently improved even if the backpropagated errors δiloc from localizer do not contribute to learn useful features. It should be noted that STL is different from multi-task learning (MTL). They look similar because of the branch architecture and several objectives. However, STL solves exactly the same tasks and therfore it does not need any extra supervision. While, MTL jointly trains several tasks with separate losses. Therefore, it is more appropriate to see the classifier in STL as an auxiliary component for successful training of localizer.

3

Computational Experiments

In this section we use two medical image datasets, chest X-rays (CXRs) and mammograms, to evaluate the classification and localization performances of STL. All training CXRs and mammograms are resized to 500 × 500. The network architecture used in this experiment is slightly modified based on the network from [9]1 . For localizer, 15 × 15 activation maps for each class are obtained via 1 × 1 convolution operation. Two global pooling methods, max [10] and average poolings [15], are applied to the activation maps. The network is trained via stochastic gradient descent with momentum 0.9 and the minibatch size is set to 64. There is an additional hyperparameter α on STL to determine the level of importance between classifier and localizer. We set its initial value to 0.1 so that the network more focuses on learning the representative features at the early stage, and it is increased to 0.9 after 60 epochs to fine-tune the localizer. To compare the classification performance, an area under characteristic curve (AUC), accuracy and average precision (AP) of each class are used. For STL, 1

We add one convolutional layer (i.e. the 6th convolutional layer) since the resolution of the input image is relatively high compared to input images for [9].

Self-Transfer Learning for Weakly Supervised Lesion Localization

243

class probabilities obtained from localizer is used for measuring performance. For a localization task, a similar performance metric in [10] is used. It is based on AP, but the difference is the way to count true positives and false positives. In classification, it is a true positive if its class probability exceeds some threshold. To measure the localization performance under this metric, the test image whose class probability is greater than some threshold (i.e. a true positive in case of classification) but the maximal response in the activation map does not fall within the ground truth annotations allowing some tolerance is counted as a false positive. In our experiment, only positive class is considered for localization AP since there is no ROI on negative class. First, the activation map of positive class is resized to the size of original image via simple bilinear interpolation, then it is examined whether the maximal response falls into the ground truth annotations within 16 pixels tolerances which is a half of the global stride 32 of the considered network architecture. If the response is located inside true annotations, the test image is counted as a true positive. If not, it is counted as a false positive. Tuberculosis Detection. We use three CXRs datasets, namely KIT, Shenzhen and MC sets in this experiment. All the CXRs used in this work are de-identified by the corresponding image providers. KIT set contains 10,848 DICOM images, consisting of 7,020 normal and 3,828 abnormal (TB) cases, from the Korean Institute of Tuberculosis. Shenzhen2 and MC3 sets are available limited to research purpose from the authors of [1,7,8]. We train the models using the KIT set, and test the classification and localization performances using the Shenzhen and MC sets. To evaluate the localization performance, we obtain their detailed annotations from the TB clinician since the testsets, Shenzhen and MC sets, do not contain any annotations for TB lesions.

Fig. 2. Training curves and 1st layer filters at 5000 iterations in case of average pooling 2

3

Shenzhen set has 326 normal and 336 TB cases from Shenzhen No. 3 People’s Hospital, Guangdong Medical College, Shenzhen, China. MC set is from National Library of Medicine, National Institutes of Health, Bethesda, MD, USA. It consists of 80 normal and 58 TB cases.

244

S. Hwang and H.-E. Kim

Table 1. Classification and localization performances for CXRs and mammograms (subscripts + and - denote positive and negative class, respectively) Testset

Shenzhen set

MC set

Task

Classification

Loc. Classification

MIAS set

Metric

ACC AUC AP+ AP− AP ACC AUC AP+ AP− AP ACC AUC AP+ AP− AP

MaxPool

.786 .867 .892 .814 .698 .725 .805 .809 .797 .602 .531 .486 .322 .653 -

AvePool

.787 .907 .924 .888 .690 .703 .784 .752 .804 .512 .662 .633 .544 .716 .095

Loc. Classification

Loc.

STL+MaxPool .834 .917 .934 .900 .789 .768 .863 .848 .883 .796 .665 .536 .435 .664 .149 STL+AvePool .837 .927 .943 .906 .872 .841 .890 .884 .892 .807 .696 .675 .575 .761 .326

Table 1 summarizes the experimental results. For both classification and localization tasks, STL consistently outperforms other methods. The best performance model is STL+AvePool. A global average pooling works well for localization and it is consistent result with [15]. Since the value of localization AP is always less than that of classification AP (from the definition of measure), it is helpful to see the improvement ratio for performance comparison. Regardless of pooling methods, the localization APs for both Shenzhen and MC sets are much improved from baselines (i.e. MaxPool and AvePool) compared to classification APs. This means that STL certainly assists localizer to find the most important ROIs which define the class label. Figure 2 clearly shows the advantages of STL, faster training and better feature learning. The localization examples in testsets are visualized in Fig. 3. Mammography. We use two public mammography databases, Digital Database for Screening Mammography (DDSM) [4,5] and Mammographic Image Analysis Society (MIAS) [13]. DDSM and MIAS are used for training and testing, respectively. We preprocess DDSM images to have two labels, positive (abnormal) and negative (normal). Originally, abnormal mammographic images contain several types of abnormalities such as masses, microcalcification, etc. We merge all types of abnormalities into positive class to distinguish any abnormalities from normal, thus the number of positive and negative images are 4,025 and 6,338 respectively in the training set (DDSM). In testset (MIAS), there are 112 positive and 210 negative images. Note that we do not use any additional information except for image-level labels for training although the training set has boundary information of abnormal ROIs. The boundary information of testset is utilized to evaluate the localization performance. Table 1 reports the classification and localization results4 . As we can see, classification of mammograms is much difficult compared to TB detection. First of all, mammograms used for training are low quality images which contain some degree of artifact and distortion generated from the scanning process for creating digital images from films. Moreover, this task is inherently complicated since 4

For a global max pooling without STL, training loss is not decreased at all, i.e., it cannot be trained. Therefore, the localization performance of that is not reported in Table 1.

Self-Transfer Learning for Weakly Supervised Lesion Localization

245

Fig. 3. Localization examples for chest X-rays and mammograms. Top row shows test images with groud-truth annotations. The belows represent the results from MaxPool, AvePool, STL+MaxPool and STL+AvePool in a sequential order. The activation map for positive class is linearly scaled to the range between 0 and the maximum probability.

there also exist quite a few irregular patterns in negative class caused by various shapes and characteristics of normal tissues. Nevertheless, it is confirmed that STL is significantly better than other methods for both classification and localization. Again, the localization performances are much improved from baselines compared to the classification performances regardless of pooling methods. Figure 3 shows some localization examples in the testset.

4

Conclusion

In this work, we propose a novel framework STL which enables training CNN for lesion localization without neither any location information nor pre-trained models. Our framework jointly learns both classifier and localizer using a weighted loss as an objective function for the purpose of preventing localizer from falling in a bad local optimum. Self-transfer is realized via a weight controlling the relative importance between classifier and localizer. Also, the effect of classifier on localizer is discussed to provide the rationale behind the advantages of the proposed framework. Computational experiments for lesion localization given only image-level labels show that the proposed framework outperforms the existing approaches in terms of both classification and localization performance metrics.

246

S. Hwang and H.-E. Kim

References 1. Candemir, S., et al.: Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration. IEEE Trans. Med. Imaging 33(2), 577–590 (2014) 2. Cire¸san, D.C., Giusti, A., Gambardella, L.M., Schmidhuber, J.: Mitosis detection in breast cancer histology images with deep neural networks. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 411–418. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40763-5 51 3. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009) 4. Heath, M., Bowyer, K., Kopans, D., Kegelmeyer Jr., P., Moore, R., Chang, K., Munishkumaran, S.: Current status of the digital database for screening mammography. In: Proceedings of the Fourth International Workshop on Digital Mammography, pp. 457–460 (1998) 5. Heath, M., Bowyer, K., Kopans, D., Moore, R., Kegelmeyer, W.P.: The digital database for screening mammography. In: Proceedings of the 5th international workshop on digital mammography, pp. 212–218 (2000) 6. Hwang, S., Kim, H.E., Jeong, J., Kim, H.J.: A novel approach for tuberculosis screening based on deep convolutional neural networks. In: Proceedings of SPIE Medical Imaging (2016) 7. Jaeger, S., Karargyris, A., Candemir, S., Siegelman, J., Folio, L., Antani, S., Thoma, G.: Automatic screening for tuberculosis in chest radiographs: a survey. Quant. Imaging Med. Surg. 3(2), 89–99 (2013) 8. Jaeger, S., et al.: Automatic tuberculosis screening using chest radiographs. IEEE Trans. Med. Imaging 33(2), 233–245 (2014) 9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012) 10. Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Is object localization for free?-weaklysupervised learning with convolutional neural networks. In: CVPR, pp. 685–694 (2015) 11. Pinheiro, P.O., Collobert, R.: From image-level to pixel-level labeling with convolutional networks. In: CVPR, pp. 1713–1721 (2015) 12. Roth, H.R., Lu, L., Farag, A., Shin, H.-C., Liu, J., Turkbey, E.B., Summers, R.M.: DeepOrgan: multi-level deep convolutional networks for automated pancreas segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 556–564. Springer, Heidelberg (2015). doi:10.1007/ 978-3-319-24553-9 68 13. Suckling, J., Parker, J., Dance, D., Astley, S., Hutt, I., Boggis, C., Ricketts, I., Stamatakis, E., Cerneaz, N., Kok, S., et al.: The mammographic image analysis society digital mammogram database. Exerpta Medica Int. Cong. Ser. 1069, 375– 378 (1994) 14. Wu, J., Yu, Y., Huang, C., Yu, K.: Deep multiple instance learning for image classification and auto-annotation. In: CVPR, pp. 3460–3469 (2015) 15. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. arXiv preprint (2015). arXiv:1512.04150

Automatic Cystocele Severity Grading in Ultrasound by Spatio-Temporal Regression Dong Ni1(B) , Xing Ji1 , Yaozong Gao2 , Jie-Zhi Cheng1 , Huifang Wang3 , Jing Qin4 , Baiying Lei1 , Tianfu Wang1 , Guorong Wu2 , and Dinggang Shen2 1

3

National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, School of Biomedical Engineering, Shenzhen University, Shenzhen, China [email protected] 2 Department of Radiology and BRIC, UNC at Chapel Hill, Chapel Hill, NC 27599, USA Department of Ultrasound, Shenzhen Second Peoples Hospital, Shenzhen, China 4 School of Nursing, Centre for Smart Health, The Hong Kong Polytechnic University, Kowloon, Hong Kong

Abstract. Cystocele is a common disease in woman. Accurate assessment of cystocele severity is very important for treatment options. The transperineal ultrasound (US) has recently emerged as an alternative tool for cystocele grading. The cystocele severity is usually evaluated with the manual measurement of the maximal descent of the bladder (MDB) relative to the symphysis pubis (SP) during Valsalva maneuver. However, this process is time-consuming and operator-dependent. In this study, we propose an automatic scheme for csystocele grading from transperineal US video. A two-layer spatio-temporal regression model is proposed to identify the middle axis and lower tip of the SP, and segment the bladder, which are essential tasks for the measurement of the MDB. Both appearance and context features are extracted in the spatio-temporal domain to help the anatomy detection. Experimental results on 85 transperineal US videos show that our method significantly outperforms the state-of-theart regression method. Keywords: Ultrasound

1

· Regression · Spatio-temporal · Cystocele

Introduction

Cystocele is a common disease in woman that occurs when bladder bulges into vagina due to defects in pelvic support. The accurate assessment of cystocele severity is very important for treatment options, which can be no treatment for a mild case or surgery for a serious case. Pelvic Organ Prolapse Quantification system (POP-Q) is widely used for cystocele diagnosis [1]. This evaluation system involves many complicated procedures and may be clinically inefficient [2]. Recently, the transperineal ultrasound (US) has emerged as a new and effective tool for cystocele diagnosis for its advantages of no radiation exposure, minimal c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 247–255, 2016. DOI: 10.1007/978-3-319-46723-8 29

248

D. Ni et al.

discomfort, cost-effectiveness and real-time imaging capability [3]. Generally, the US examination for cystocele includes four steps [4] (Fig. 1). First, a radiologist steadily holds the US probe on the patient when asking the patient to perform Valsalva maneuver. Then, an image frame containing the maximal descent of the bladder (MDB) relative to the symphysis pubis (SP) is manually selected from US video. Next, the MDB is manually measured as the distance from the lowest point of the bladder to the reference line. With the measured MDB, the degree of cystocele severity can be further graded into normal, mild, moderate, and severe. In these steps, frame selection and manual measurements are time-consuming and experience-dependent, which often leads to significant inter-observer grading variations [5]. Therefore automatic methods for cystocele grading may help to improve diagnostic efficiency and decrease inter-observer variability.

Fig. 1. Illustration of the MDB measurement. Several US snapshots acquired during Valsalva maneuver are listed in the upper row. The lower row shows the process of MDB measurement. The MDB (in green, sub-figure (c)) is measured as the distance between the reference line (RL, in blue) and the lowest point of the bladder (BL) relative to the RL. The RL originates from the lower tip of the SP and its direction is 135 degree clockwise from the middle axis of the SP.

As shown in Fig. 1, the identification of the middle axis and lower tip of SP and bladder segmentation in US images deem to be necessary tasks for severity grading. However, these tasks are very challenging. First, due to the vagueness in US images, the localization of SP and its lower tip is very difficult, even for a senior radiologist. Second, the missing or weak boundaries of the bladder resulted from acoustic attenuation, speckles and shadows make the segmentation task difficult. Third, the image appearance, geometry and shape of anatomies vary significantly in the US image series of Valsalva maneuver, because of forced exhalation. They also vary significantly from subject to subject. These large variations will then impose additional difficulty for our automation goal. In this study, a novel spatio-temporal regression model is proposed to address the three challenging issues for the automatic analysis of transperineal US video and cystocele grading. The technical contributions of this work are summarized as follows. First, to our knowledge, this is the first study that performs the computerized grading of cystocele severity with the transperineal US. Second, we

Automatic Cystocele Severity Grading in Ultrasound

249

propose a two-layer spatio-temporal regression model for context-aware detection of anatomical structures at all time points jointly. In our proposed model, both appearance and context features are extracted in the spatio-temporal domain to impose temporal consistency along the temporal displacement maps, thus the detection results can help each other to alleviate the ambiguity and refine structure localization.

2

Method

For the automatic grading of cystocele severity, we first train the two-layer spatio-temporal regression models for the identification of the middle axis and lower tip of SP and segmentation of bladder in US images. With the trained models, the descending of the bladder relative to the SP was measured in all image frames of a Valsalva maneuver US video. The MDB can then be sought from the estimated distance measurements over all US frames for cystocele grading. 2.1

The Proposed Spatio-Temporal Regression Model

Random forest [6] is an ensemble learning technique with good generalization capability [7]. This technique has been successfully applied in many medical image analysis tasks, e.g., landmark detection, organ segmentation and localization [8–10], etc. Here we employ the random forest to train the two-layer spatio-temporal regression models for the detection of target structures in US videos. To build a random forest, multiple decision trees are constructed by randomly sampling the training data and features for each tree to avoid over-fitting. The final regression result, P (ds |v), can then be reached by averaging the predictions of T trees, pi (ds |v), as: P (ds (x)|v(x)) =

T 1 pi (ds (x)|v(x)) T i=1

(1)

where x is the image pixel, v is the feature vector and ds is the distance of x to the target structure s, and s ∈ {l, t, b}. The target structures l, t and b represent the middle axis and lower tip of the SP and the bladder, respectively. As shown in Fig. 2, we train one regression forest for each target structure s, to learn its specific non-linear mapping from each pixel’s local appearance and geometry to its 2D displacement vector towards the specific structure. Specifically, the first layer is designed to provide the initial displacement field for each time point by using the appearance and coordinates features from neighboring US images, while the second layer is designed to refine the detection result in spatio-temporal domain (a 2D+t neighborhood) by using contexture features from the results in the first layer. First-Layer Regression. The SP appears like a large bright ridge with two dark valleys around in US images (see Fig. 1), whereas a bladder is depicted

250

D. Ni et al.

Fig. 2. The flowchart of proposed two-layer spatio-temporal regression model.

with hypoechogenicity in sonography for its fluid content. Accordingly, contrast features shall be informative and helpful for modeling of these structures. Furthermore, the correlation between neighboring US frames can be utilized as temporal consistency for displacement field. In this regard, we compute randomized Haar-like features [11] of different scales in spatio-temporal domain to describe the intensity patterns and the contrastness of target structures, as well as to boost anatomy detection at current time point with additional temporal cues from previous and next time points. Meanwhile, we also use normalized coordinate as input features. With these features, we train the regression forest to seek a reliable nonlinear mapping that tells the displacement vector of a pixel to the target structures of the middle axis and lower tip of the SP and the bladder, denoted as dl , dt , and db , respectively. The definitions of the displacement maps for the three target structures can be seen in Fig. 3. Second-Layer Regression. We first use the above trained first-layer regression forest to estimate an initial displacement map at each time point. Thus, for each image pixel, we have not only appearance features but also additional high-level context feature [12] from the initial displacement map at current time point and along all other displacement maps at other time points. All these

Automatic Cystocele Severity Grading in Ultrasound

251

features are used to train the second-layer regression forest jointly. Specifically, our context features are calculated again by Haar-like features from local patches in the displacement maps. Two types of context features are extracted: (1) Within-time-point context features refer to the Haar-like features extracted within the displacement map of each structure. These features are informative in providing the estimated structure locations from nearby pixels, and can be used to spatially regularize the whole displacement of each structure. (2) Acrosstime-point context features refer to the Haar-like features extracted from the displacement maps of the same structure at other time points. These features encode the temporal relationship along time, i.e., the trajectory of structure. Thus, the use of across-time-point context features can effectively impose temporal consistency on the displacement field. With the augmented feature vector, we perform the random forest regression again to approach the target distance spaces of dl , dt , db . 2.2

Cystocele Severity Grading

With the two-layer random forest regressors, the middle axis and lower tip of the SP and the bladder contour can be inferred for the MDB measurement and severity grading. We first generate the displacement maps of the three target structures from the testing sonography. The voting maps is then obtained for the lower tip and middle axis of the SP by adopting the voting strategy in [8] on the corresponding displacement maps. Next, the lower tip of the SP can be identified by searching the most votes in its voting map. Then, the delineation of the middle axis of the SP can be realized by seeking the line that originates from lower tip with maximal average voting responses. For the bladder segmentation in the testing sonography, the bladder boundary can be simply attained by finding the zero level set on its displacement feature map. Once the three target anatomies are defined, we calculate the MDB from the consecutive US images (Fig. 1). Then, we categorize the severity degree of cystocele into normal, mild, moderate, and severe by adopting the thresholds of the MDB recommended in [13].

3

Experimental Results

Materials. We acquired 170 US videos from 170 women with ages ranging from 20 to 41. Each video lasts approximately 10 s and contains around 400 frames. The data is randomly split into 85 and 85 videos for the training and testing, respectively. All videos were acquired using a Mindray DC8 US scanner with local IRB approvals. To support the training of regression models, one graduate student was recruited to prepare the necessary annotation on each training image. The annotated training data were further reviewed by a senior radiologist with experience on medical US over 15 years to assure correctness. The number of neighboring frames for extracting spatio-temporal features was 30 and other parameters were set according to [11]. To evaluate the performance of

252

D. Ni et al.

Fig. 3. The distance definition with respect to three target structures.

Fig. 4. Boxplots of the MDB distributions.

Fig. 5. Comparison of measurements by our method (in red) and 3 radiologists (in yellow, green and purple). The severities are graded into normal, mild, moderate and severe from the top to the bottom videos. The sub-figure marked by red box contains the maximal descent of the bladder from the SP.

our system and the inter-observer variation, three radiologists with US imaging experience of more than 3 years were invited to annotate the middle axis and lower tip of SP on each testing image. Each radiologist was also asked to measure the bladder descent on each testing image and give the cystocele severity grades of all patients. The bladder boundaries were not annotated in the testing data as the boundary drawing task is very costly.

Automatic Cystocele Severity Grading in Ultrasound

253

Table 1. Overall grading accuracy and Kappa statistics. Auto vs. E1 Auto vs. E2 Auto vs. E3 Grading accuracy 2D 78.82 % 2D+t 87.06 %

74.12 % 81.18 %

75.29 % 82.35 %

Kappa value

0.54 0.67

0.55 0.68

2D 0.64 2D+t 0.78

Intermediate Results. We first evaluate the performance on the identification of the middle axis and lower tip of SP. Figure 5 shows the comparison of the performance of our automatic system on four typical cases with the three sets of manual annotations. It can be found that there exists significant variation of SP and bladder in terms of shape, geometry and appearance. Our method can generate the reasonably good intermediate results by comparing to the manual definitions. We further evaluate the MDB performance by comparing the accuracies of the MDBs from spatio-temporal regression model (2D+t) and the regression model without temporal cue (2D) [11]. The means and standard deviations of absolute MDB differences of the proposed method and three radiologists (namely E1, E2 and E2) are 3.02 ± 2.74 mm, 3.01 ± 2.59 mm and 3.00 ± 2.91 mm, respectively, whereas the differences between the MDBs of 2D regression [11] and three radiologists are 3.92 ± 3.04 mm, 4.68 ± 3.19 mm and 4.78 ± 3.50 mm, respectively. The p-values (two-sample, two-tailed t-test) between two automatic methods w.r.t. three radiologists are 0.0287, 6.8538e-04 and 9.2093e-04, respectively. It can then be concluded our spatio-temporal model is significantly better than the regression method without temporal cue. The boxplots of the MDB measurements by two methods are also shown in Fig. 4. Accuracy of Cystocele Severity Grading. Here we show the clinical applicability by comparing final grading results of two automatic methods. The Cohens kappa statistics is used to evaluate the grading agreement between the radiologists and the computerized methods. As illustrated in Table 1, the overall grading accuracies to three radiologists by our proposed method (2D+t) are all higher than 80 %. The grading results by our method are significantly better than the 2D regression method [11]. The Kappa values shown in Table 1 further indicate that our method can achieve significantly better agreement with the radiologists than the 2D regression method. It can then be suggested the incorporation of temporal appearance and context features into the random forest regression is effective. We further calculate the Kappa values of the manual grading results by three radiologists to compare the agreement between the radiologist to the computer as well as the inter-radiologist agreement. The Kappa values of radiologists are 0.65 (E1 vs. E2), 0.55 (E1 vs. E3) and 0.87 (E2 vs. E3), respectively. It can be suggested that the grading agreements between the computer and radiologists are relatively stable, comparing to inter-radiologist agreement. In particular, the

254

D. Ni et al.

grading results between the radiologist 1 and other radiologists are relatively less consistent.

4

Conclusions

This paper develops the first automatic solution for grading cystocele severity in the transperineal US videos. A novel spatio-temporal regression model is proposed to introduce temporal consistency for displacement field estimation. Both appearance and context features in spatio-temporal domain can boost the anatomy detection performance in US images. The experimental results suggest that our method significantly outperforms the 2D regression method in terms of intermediate distance measurement and final severity grading. The developed system is robust and has potential in clinical applicability. Acknowledgement. This work was supported by the National Natural Science Funds of China (Nos. 61501305, 61571304, and 81571758), the Shenzhen Basic Research Project (Nos. JCYJ20150525092940982 and JCYJ20140509172609164), and the Natural Science Foundation of SZU (No. 2016089).

References 1. Persu, C., Chapple, C., Cauni, V., Gutue, S., Geavlete, P.: pelvic organ prolapse quantification system (POP-Q)-a new era in pelvic prolapse staging. J. Med. Life 4(1), 75 (2011) 2. Lee, U., Raz, S.: Emerging concepts for pelvic organ prolapse surgery: what is cure? Cur. Urol. Rep. 12(1), 62–67 (2011) 3. Santoro, G., Wieczorek, A., Dietz, H., Mellgren, A., Sultan, A., Shobeiri, S., Stankiewicz, A., Bartram, C.: State of the art: an integrated approach to pelvic floor ultrasonography. Ultrasound Obstet. Gynecol. 37(4), 381–396 (2011) 4. Chan, L., Tse, V., Stewart, P.: Pelvic floor ultrasound (2015) 5. Thyer, I., Shek, C., Dietz, H.: New imaging method for assessing pelvic floor biomechanics. Ultrasound Obstet. Gynecol. 31(2), 201–205 (2008) 6. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 7. Verikas, A., Gelzinis, A., Bacauskiene, M.: Mining data with random forests: a survey and results of new tests. Pattern Recogn. 44(2), 330–349 (2011) 8. Gao, Y., Shen, D.: Context-aware anatomical landmark detection: application to deformable model initialization in prostate CT images. In: Wu, G., Zhang, D., Zhou, L. (eds.) MLMI 2014. LNCS, vol. 8679, pp. 165–173. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10581-9 21 9. Richmond, D., Kainmueller, D., Glocker, B., Rother, C., Myers, G.: Uncertaintydriven forest predictors for vertebra localization and segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 653–660. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24553-9 80 10. Zhou, S.K., Comaniciu, D.: Shape regression machine. In: Karssemeijer, N., Lelieveldt, B. (eds.) IPMI 2007. LNCS, vol. 4584, pp. 13–25. Springer, Heidelberg (2007). doi:10.1007/978-3-540-73273-0 2

Automatic Cystocele Severity Grading in Ultrasound

255

11. Shao, Y., Gao, Y., Wang, Q., Yang, X., Shen, D.: Locally-constrained boundary regression for segmentation of prostate and rectum in the planning CT images. Med. Image Anal. 26(1), 345–356 (2015) 12. Tu, Z., Bai, X.: Auto-context and its application to high-level vision tasks and 3D brain image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 32(10), 1744–1757 (2010) 13. Wang, H., Chen, H., Zhe, R., Xu, F., Chen, Q., Liu, Y., Guo, J., Shiya, W.: Correlation between anterior compartment prolapse assessments by transperineal ultrasonography and pelvic organ prolapse quantification. Chin. J. Ultrason. 22(8), 684–687 (2013)

Graphical Modeling of Ultrasound Propagation in Tissue for Automatic Bone Segmentation Firat Ozdemir(B) , Ece Ozkan, and Orcun Goksel Computer-Assisted Applications in Medicine, ETH Zurich, Zurich, Switzerland [email protected]

Abstract. Bone surface identification and localization in ultrasound have been widely studied in the contexts of computer-assisted orthopedic surgeries, trauma diagnosis, and post-operative follow-up. Nevertheless, the (semi-)automatic bone surface segmentation methods proposed so far either require manual interaction or complex parametrizations, while failing to deliver accuracy fit for clinical purposes. In this paper, we utilize the physics of ultrasound propagation in human tissue by encoding this in a factor graph formulation for an automatic bone surface segmentation approach. We comparatively evaluate our method on annotated in-vivo ultrasound images of bones from several anatomical locations. Our method yields a root-mean-square error of 0.59 mm, far superior to state-of-the-art approaches.

1

Introduction

Radiography (e.g. X-ray, CT, fluoroscopy) is the conventional technique for imaging bones, however it involves radiation exposure. Ultrasound (US) has been proposed as a safe, real-time imaging alternative for certain applications such as bone surface localization for diagnosis and routine orthopedic controls, e.g. [1– 4]; and for intra-operative guidance in computer-assisted orthopedic surgery (CAOS), e.g. [5,6]. Nevertheless, identifying bone surface is a challenging task, since US suffers from a range of different artifacts and presents low signal-tonoise-ratio (SNR) in general. The methods proposed in the literature require manual interaction or complex parametrizations limiting their generalizability. Although ultrasound raw radio-frequency can be used to segment bones [7], its availability for routine clinical applications from commercial US machines is still quite limited. Considering conventional B-mode imaging, early work focusing on bone surface segmentation utilized intensity and gradient information, e.g. [8]. Hacihaliloglu et al. [1] exploited phase congruency from Kovesi [9] to introduce phase symmetry (PS) in 2D and 3D to identify bone fractures by aggregating logGabor filters at different orientations. This enhances bone surface appearance as seen in Fig. 1c. The tedious parameter selection phase of log-Gabor filters for PS was automated later in [10]. Inspired by gradient energy tensor from [11], PS was also used to define local phase tensor (LPT) metric and was studied for enhancing bone surface appearance for registering statistical shape models to 3D US images [5]. Despite its high sensitivity, the major drawback of PS is its c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 256–264, 2016. DOI: 10.1007/978-3-319-46723-8 30

Graphical Modeling of Ultrasound Propagation

257

Fig. 1. (a) An in-vivo bone US. The red line indicates the mid-column, along which the plots show: (b) B-mode intensity, (c) PS [1], (d) shadowing feature [12], (e) shadow and (f) soft-tissue probabilities from the trained appearance model cf. Sect. 2.1.

low specificity; i.e. it gives false positives at interfaces between soft tissue layers (Fig. 1c). Therefore, most works using PS alone require manual interaction, e.g. selection of a region-of-interest (ROI) around expected bone surface, or postprocessing to remove false positives. Note that PS is a hard-decision, giving almost binary (a very high dynamic range) response, from which post-processing may not always recover from, leading to suboptimal solutions. Alternatively, in [3] confidence in phase-symmetry (CPS) was introduced to enhance bone surfaces in US by uniformly weighting PS, attenuation and shadowing features; the latter two stemming from confidence maps [12] based on random walks. The shadowing feature is exemplified in Fig. 1d. These earlier works either lack a principled approach to combine the available information, e.g. image appearance and physical constraints of ultrasound, or rely strongly on PS for bone surface. In this paper, we propose a novel graphical model, which is robust to false-positive responses, by introducing physical constraints of ultrasound-bone interaction combined in a principled way with appearance information from a supervised learning framework.

2

Methods

Despite the fact that soft-tissue interfaces and bone surface may both appear as hyperechoic reflections, there is a fundamental difference at bone surfaces: Due to the relatively higher acoustic impedance of cortical bone, it causes an almost total reflection of transmitted ultrasound energy. This leads to a bright surface appearance and, behind this, a dark or incoherent appearance due to lack of ultrasound penetration. Accordingly, we categorize the US scene in three classes: bone surface (B), shadow behind this surface (S), and other (soft) tissue (T). We model their appearance using supervised learning with the following features.

258

2.1

F. Ozdemir et al.

Image Features and Learned Appearance Models

2D image patch and 1D image column features are employed, the latter approximating the axial propagation of focused beams. Features regarding statistical, textural, and random walks-based information are extracted at different scales as listed in Table 1, where Scale-space indicates kernel sizes, i.e. the edge length of square kernels or length of vector kernels. A subset of the features for a sample US image is depicted in Fig. 2. Below they are briefly summarized.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Fig. 2. Sample features from a US image (a), where ·n denotes filter kernel scale from fine “1” to coarse “3”: (b) Median3 , (c) Entropy2 (d) Attenuation, (e) Gauss3 , (f) Column-long σ, (g) LBP1 , (h) Rayleigh fit error.

Local-patch statistics. Simple and higher-order statistical features are used. Random-Walks. Features from the literature such as confidence maps (mx,y )  from [12]; and, based on this,  attenuation (ax,y = norm ( w (mx,y − mmin ))) and shadowing (sx,y = norm ( w mx,y /mmin )) from [3], where norm(.) is the unity-based normalization and w is number of pixels in the patch are applied. Column-wise and integral statistics. Intuitive metrics motivated by the reflection and attenuation effects acting in a cumulative manner as ultrasound propagates are also employed from the far-side of the image to a point. Local Binary Patterns. In order to capture textural (speckle) information visually, we used well-known Local Binary Patterns [13] and Modified Census Transform [14], which relate the intensity at a point to its neighbors. Speckle characteristics. A last feature is included from an ultrasound physics perspective: It is known that the appearance of fully-developed speckle can be characterized locally by Rayleigh, Nakagami, or similar distributions. At locations where ultrasound SNR is low, e.g. behind bone surface, although it may be possible to get a high intensity (with high gain, etc.), the content would be mostly other (e.g. electrical) noise, which will follow a Gaussian or uniform distribution. Accordingly, we used the fit of a Rayleigh probability density function (pdf) to patch intensity histograms to quantify its speckle characteristics, i.e.: fit error = || pdfRayleigh − norm(hist(patchi )) ||

(1)

where pdfRayleigh is the maximum likelihood fitted distribution and the second term is the normalized histogram of patch intensities.

Graphical Modeling of Ultrasound Propagation

259

Table 1. Features extracted at different kernel space-scales for US transmit wavelength (λ) and pixels (px). Filter group type

Filter names

Scale-space

Intensity

Pixel intensity (Fig. 2a)

1px

Local patch statistics

Mean, median (Fig. 2b), variance, standard deviation, skewness, kurtosis, entropy (Fig. 2c)

3,6,12λ

Random Walks

Confidence Map [12], Shadowing [3], logShadowing, Attenuation [3] (Fig. 2d)

1px

Column-wise statistics 0th order (Fig. 2e), 1st order, 2nd order, 3rd order

2,5,11λ

Integral statistics

5,11,31λ

Integral, weighted integral, standard deviation

Local Binary Patterns Local Binary Patterns [13], Modified Census Transform [14] Speckle characteristics Rayleigh fit error (Fig. 2h)

12λ

To capture scale-space information, patch-based features are extracted at multiple scales (see Table 1). At a point i, this leads to a feature vector of fi of length 47 populated by the above-mentioned features. From these features extracted from all image locations of annotated sample images, two discriminative binary classifiers are then trained to construct independent probability functions p(fi | labeli ) for classes S and T, below and above the annotated bones respectively. For bone surface B, we use phase symmetry PS, converted to a PS likelihood as e −σ0 . For a given test image, we cast the bone segmentation as a graph labeling problem shown below.

O N N

O N N

O N N

N

Fig. 3. (a) Unary cost calculation and (b) Pairwise edge connections for (i) 4-connected, (ii) directional 4-connected, (iii) proposed configuration. Horizontal, vertical and jumpedge connections are denoted with H, V and J respectively.

260

2.2

F. Ozdemir et al.

Encoding Ultrasound Physics on Graph Edges

For spatially consistent results and removing false local responses, Markov Random Fields (MRF) is a common regularization approach. In MRF, the image is represented by a graphical model, where pixels are the nodes and interpixel interaction (e.g. regularization) are encoded on the edges. A maximum-aposteriori solution involves the minimization of a cost function in the following form:   Ψ (i, j) (2) Ψ (i) + µ i

i

j∈Ni

where Ψ (·) and Ψ (·, ·) are the unary and pairwise cost functions and Ni is the neighbourhood of node i. One can then obtain a regularized labeling (segmentation) solution, e.g., using common Potts potential for pairwise regularization and the label models above as unary costs, as seen in Fig. 3(a). Table 2. Our pairwise cost definition for horizontal (H), vertical (V) and jump (J) edges. (a) Ψ H (i, j) T(j) B(j) S(j) T(i) k1 1 1 B(i) 1 k2 1 S(i) 1 1 k1

(b) Ψ V (i, j) T(j) B(j) T(i) k2 k3 B(i) ∞1 k2 S(i) ∞1 ∞1

S(j) ∞1 k3 k2

(c) Ψ J (i, j) T(j) B(j) S(j) T(i) 0 0 ∞2 B(i) ∞1 ∞3 0 S(i) ∞1 ∞1 0

MRF uses undirected edges as in Fig. 3(b.i), and thus can only encode bidirectional information. Regarding ultrasound, we know that it travels axially, thus different types of interaction occur between vertical (V) and horizontal (H) pixel neighbours in the image. Different pairwise costs for such neighbours can be set using a directed factor graph as in Fig. 3(b.ii). For horizontal edges, we use a Potts-like model in Table 2(a), where same labels on both ends are penalized less, with parameters k1 and k2 in range (0, 1) since neighboring pixels shall be more likely to be of the same class. For vertical edges, what we know is following: 1. soft tissue T starts from the skin; 2. once the bone surface B is encountered, the rest of the image (below that location) should be shadow S (no more T); and 3. S cannot start without encountering B first. Constraint 1 above is enforced by a unary constraint on top image pixels (skin), and the latter two are enforced using the vertical pairwise costs in Table 2(b), where ∞1 prohibits transitions that violate these conditions. Consequently, starting from the transducer the encountered labels (downward) should be in this strict order: T→B→S. Thanks to factor graphs, vertical transitions can also be penalized, if desired, differently than horizontal ones, controlled by parameter k3 (=1 for isotropic penalty). Reflection of ultrasound at bone surface generates a hyperechoic band, the thickness of which depends on various factors (e.g., ultrasound frequency).

Graphical Modeling of Ultrasound Propagation

261

Accordingly, after the label switching to bone surface B, it should not continue as B until the bottom of the image, but instead switch to S shortly after. We encode this with an additional so-called jump edge (J) connected from each pixel to the one l pixels below, as in Fig. 3b.iii (green). With the costs given in Table 2(c), this enforces the thickness of surface appearance to be exactly l pixels: ∞2 prohibiting S below T, setting a lower bound of l; and ∞3 prohibiting both ends from being B, setting an upper bound of l. For J, ∞1 still enforces the right order of transmission. We call this novel connectivity and cost definition as bone factor graph (BFG). This is optimized by off-the-shelf tools to obtain segmentation.

3

Results and Discussion

37 US images were acquired using a SonixTouch machine (Ultrasonix, Richmond, Canada) with L14-5 transducer at depths [3, 5] cm with frequencies {6.66,10} MHz (depending on body location). B-mode images had an isotropic pixel resolution of 230 µm. Collected data include bones in the forearm (radius, ulna), shoulder (acromion, humerus tip), leg (fibula, tibia, malleolus), hip (iliac crest), jaw (mandible, rasmus) and fingers (phalanges). Following [2], bonesurfaces were delineated in the images by an expert at locations where it can be distinguished with certainty; i.e. unannotated columns mean either no bone or not visible.

Fig. 4. (a) Comparison of algorithms (best scores are shown in bold), (b) average accuracy vs. tolerance margin, and (c) F1 score; where we propose CFG↑ & BFG.

We ran 6-fold cross-validation experiments. For learning probability models, L2-regularized logistic regression was used from LIBLINEAR library1 . Factor graphs were implemented using OpenGM library2 . For transmit wavelength λ at a given ultrasound frequency, l = 4λ, σ0 = 10−3 , µ = 1, k1 = k2 = 0.5 and k3 = 0.3 are used for the experiments. We utilized the Sequential TreeReweighted Message Passing (TRW-S) algorithm for graph optimization. 1 2

https://www.csie.ntu.edu.tw/∼cjlin/liblinear/. http://hci.iwr.uni-heidelberg.de/opengm2/.

262

F. Ozdemir et al.

Fig. 5. Sample qualitative results show robustness (top row) to false detections in soft tissue interfaces; (bottom, left¢er) to shadowing and reverberation artifacts inside bone; (top, center) separate bone surfaces, e.g. radius and ulna; (right) images demonstrate typical failures.

To compare BFG with alternatives, we also implemented MRF with edge connectivity in Fig. 3(b.i) and Potts pairwise potentials, and conventional factor graphs (CFG) without the jump-edge potential with edge connectivity in Fig. 3(b.ii) and potentials in Table 2(a, b), and with parameters given above. As these implementations gave arbitrarily poor results for our evaluation metrics due to many false negative in comparison to BFG, we applied the following postprocessing steps to improve these alternative methods to a comparable level. We first thinned the result to single pixel using morphological thinning [15]. Subsequently, if there are multiple occurrences of bone detection, only the lowermost pixel is kept to avoid false positives within the soft tissue. We denote these two post-processing steps with (·↑ ). Considering typical state-of-the-art PS methods, most require the selection of a ROI around actual bones, since multiple reflections are extracted. We compared our method with [10], where the highest PS response per column (PSmax ) was proposed as an automatic way of identifying the bone surface. Since this yielded relatively poor results as it was, we also applied (·↑ ) to PS as an alternative technique, which we refer as state-of-theart. We also compared with confidence-weighted phase symmetry (CPS) from [3]. This similarly yields many false negatives, so we report its post-processed version CPS↑ . For BFG, simply the midpoint of l-thick B was output. We used common bone-detection evaluation metrics: symmetric Hausdorff distance (sHD), one-way Hausdorff distance (oHD), and RMSE of detected bone surface to the closest gold standard (GS) point. Quantitative results averaged over 6-folds are seen in Fig. 4(a), indicating that our algorithm outperforms other approaches. We also looked at the classification accuracy of surface detections. We considered a detection pixel correct if it is within a tolerance margin around the gold standard annotation. Accordingly, we generated a classification result

Graphical Modeling of Ultrasound Propagation

263

(i.e. true/false positive/negative) for each column and computed accuracy score over those for the image. In Fig. 4(b), average accuracy of three best methods are seen as the tolerance is changed. The accuracy of BFG is 86 % whereas PS↑ is 65 % at 1 mm; 92 % vs. 72 % at 2 mm; and 95 % vs. 79 % at 4 mm margin, respectively. BFG outperforms the others at all operating points. According to [16], error tolerance in CAOS is 1 mm excluding operator error. Choosing this as an example operating point, we also calculated F1 scores as in Fig. 4(c). This shows the robustness of our method across all test images, compared to alternatives. A qualitative comparison between BFG, PS↑ , and GS is seen in Fig. 5. A computer with Intel i7 930 @ 2.80 GHz and 8 GB RAM is used for the experiments. Results were computed in 2 min on average with a non-optimized Matlab implementation, where the majority of time is taken by feature extraction; which can be in the future accelerated by parallel computation or feature selection, albeit was not the focus of this paper.

4

Conclusions

In this paper, we have presented a novel graph representation of ultrasound-bone interaction for a robust and fully-automatic segmentation of bone surfaces. Our method performs superior to alternative techniques, demonstrating clinicallyrelevant performance for a diverse range of anatomical regions. In the future, we will improve its speed for real-time surface detection, e.g. for registration of pre-operative models to real-time US data for navigation and guidance. Acknowledgements. We thank Dr. Andreas Schweizer for annotations, and to Swiss National Science Foundation and Zurich Department of Health for funding.

References 1. Hacihaliloglu, I., Abugharbieh, R., Hodgson, A., Rohling, R.: Bone surface localization in ultrasound using image phase-based features. Ultrasound Med. Biol. 35(9), 1475–1487 (2009) 2. Jain, A.K., Taylor, R.H.: Understanding bone responses in B-mode ultrasound images and automatic bone surface extraction using a Bayesian probabilistic framework. In: Proceedings of SPIE, San Diego, USA, pp. 131–142 (2004) 3. Quader, N., Hodgson, A., Abugharbieh, R.: Confidence weighted local phase features for Robust bone surface segmentation in ultrasound. In: Linguraru, M.G., ´ Drechsler, Oyarzun Laura, C., Shekhar, R., Wesarg, S., Gonz´alez Ballester, M.A., K., Sato, Y., Erdt, M. (eds.) CLIP 2014. LNCS, vol. 8680, pp. 76–83. Springer, Heidelberg (2014). doi:10.1007/978-3-319-13909-8 10 4. Cheung, C.W.J., Law, S.Y., Zheng, Y.P.: Development of 3-D ultrasound system for assessment of adolescent idiopathic scoliosis (AIS): and system validation. In: EMBC, pp. 6474–6477 (2013)

264

F. Ozdemir et al.

5. Hacihaliloglu, I., Rasoulian, A., Rohling, R.N., Abolmaesumi, P.: Statistical shape model to 3D ultrasound registration for spine interventions using enhanced local phase features. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 361–368. Springer, Heidelberg (2013). doi:10. 1007/978-3-642-40763-5 45 6. Scepanovic, D., Kirshtein, J., Jain, A.K., Taylor, R.H.: Fast algorithm for probabilistic bone edge detection (FAPBED). In: Medical Imaging, pp. 1753–1765 (2005) 7. Hussain, M.A., Hodgson, A., Abugharbieh, R.: Robust bone detection in ultrasound using combined strain imaging and envelope signal power detection. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8673, pp. 356–363. Springer, Heidelberg (2014). doi:10.1007/ 978-3-319-10404-1 45 8. Daanen, V., Tonetti, J., Troccaz, J.: A fully automated method for the delineation of osseous interface in ultrasound images. In: Barillot, C., Haynor, D.R., Hellier, P. (eds.) MICCAI 2004. LNCS, vol. 3216, pp. 549–557. Springer, Heidelberg (2004). doi:10.1007/978-3-540-30135-6 67 9. Kovesi, P.: Image features from phase congruency. J. Comput. Vis. Res. 1(3), 1–26 (1999) 10. Hacihaliloglu, I., Abugharbieh, R., Hodgson, A.J., Rohling, R.N.: Automatic adaptive parameterization in local phase feature-based bone segmentation in ultrasound. Ultrasound Med. Biol. 37(10), 1689–1703 (2011) 11. Felsberg, M., K¨ othe, U.: GET: the connection between monogenic scale-space and Gaussian derivatives. In: Kimmel, R., Sochen, N.A., Weickert, J. (eds.) Scale-Space 2005. LNCS, vol. 3459, pp. 192–203. Springer, Heidelberg (2005). doi:10.1007/ 11408031 17 12. Karamalis, A., Wein, W., Klein, T., Navab, N.: Ultrasound confidence maps using random walks. Med. Image Anal. 16(6), 1101–1112 (2012) 13. He, D.C., Wang, L.: Texture unit, texture spectrum, and texture analysis. IEEE Trans. Geosci. Remote Sens. 28(4), 509–512 (1990) 14. Froba, B., Ernst, A.: Face detection with the modified census transform. In: Proceddings of International Conference on Automatic Face and Gesture Recognition (2004) 15. Lam, L., Lee, S.W., Suen, C.Y.: Thinning methodologies-a comprehensive survey. Pattern Anal. Mach. Intell. 14(9), 869–885 (1992) 16. Phillips, R.: The accuracy of surgical navigation for orthopaedic surgery. Curr. Orthop. 21(3), 180–192 (2007)

Bayesian Image Quality Transfer Ryutaro Tanno1,3(B) , Aurobrata Ghosh1 , Francesco Grussu2 , Enrico Kaden1 , Antonio Criminisi3 , and Daniel C. Alexander1 1

Centre for Medical Image Computing, University College London, London, UK [email protected] 2 Institute of Neurology, University College London, London, UK 3 Machine Intelligence and Perception Group, Microsoft Research Cambridge, Cambridge, UK

Abstract. Image quality transfer (IQT) aims to enhance clinical images of relatively low quality by learning and propagating high-quality structural information from expensive or rare data sets. However, the original framework gives no indication of confidence in its output, which is a significant barrier to adoption in clinical practice and downstream processing. In this article, we present a general Bayesian extension of IQT which enables efficient and accurate quantification of uncertainty, providing users with an essential prediction of the accuracy of enhanced images. We demonstrate the efficacy of the uncertainty quantification through super-resolution of diffusion tensor images of healthy and pathological brains. In addition, the new method displays improved performance over the original IQT and standard interpolation techniques in both reconstruction accuracy and robustness to anomalies in input images.

1

Introduction

A diverse range of technological and economical factors constrain the quality of magnetic resonance (MR) images. Whilst there exist bespoke scanners or imaging protocols with the capacity to generate ultra high-quality data, their prohibitive cost and lengthy acquisition time render the technology impractical in clinical applications. On the other hand, the poor quality of clinical data often limits the accuracy of subsequent analysis. For example, low spatial resolution of diffusion weighted images (DWI) gives rise to partial volume effects, introducing a bias in diffusion tensor (DT) measurements [1] that are widely used to study white matter in terms of anatomy, neurological diseases and surgical planning. Super-resolution (SR) reconstruction potentially addresses this challenge by post-processing to increase the spatial resolution of a given low-resolution (LR) image. One popular approach is the single-image SR method, which attempts to recover a high-resolution (HR) image from a single LR image. Numerous machine-learning based methods have been proposed. For instance, [2,3] use example patches from HR images to super-resolve scalar MR and DW images respectively, with an explicitly defined generative model relating a HR patch to a LR patch and carefully crafted regularisation. Another generative approach is the c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 265–273, 2016. DOI: 10.1007/978-3-319-46723-8 31

266

R. Tanno et al.

sparse-representation methods [4,5], which construct a coupled library of HR and LR images from training data and solve the SR problem through projection onto it. Image quality transfer (IQT) [6] is a general quality-enhancement framework based on patch regression, which shows great promise in SR of DT images and requires no special acquisition, so is applicable to large varieties of existing data. A key limitation of above methods is the lack of a mechanism to communicate confidence in the predicted HR image. High quality training data typically come from healthy volunteers. Thus, performance in the presence of pathology or other effects not observed in the training data is questionable. We expect methods to have high confidence in regions where the method has seen lots of similar examples during training, and lower confidence on previously unseen structures. However, current methods implicitly have equal confidence in all areas. Such an uncertainty characterisation is particularly important in medical applications where ultimately images can inform life-and-death decisions. It is also beneficial to downstream image processing algorithms, such as registration or tractography. In this paper, we extend the IQT framework to predict and map uncertainty in its output. We incorporate Bayesian inference into the framework and name the new method Bayesian IQT (BIQT). Although many SR methods [2–5] can be cast as maximum a posteriori (MAP) optimisation problems, the dimensionality or complexity of the posterior distribution make the computation of uncertainty very expensive. In contrast, the random forest implementation of the original IQT is amenable to uncertainty estimation thanks to the simple linear model at each leaf node, but the current approach computes maximum likelihood (ML) solution. BIQT replaces this ML based inference with Bayesian inference (rather than just MAP) and this allows the uncertainty estimate to reflect unfamiliarity of input data (see Fig. 1(a)). We demonstrate improved performance through SR of DT images on Human Connectome Project (HCP) dataset [7], which has sufficient size and resolution to provide training data and a testbed to gauge the baseline performance. We then use clinical data sets from multiple sclerosis (MS) and tumour studies to show the efficacy of the uncertainty estimation in the presence of focal tissue damage, not represented in the HCP training data.

2

Methods

Here we first review the original IQT framework based on a regression forest. We then introduce our Bayesian extension, BIQT, highlighting the proposed efficient hyperparameter optimisation method and the robust uncertainty measure. Background. IQT splits a LR image into small patches and performs quality enhancement on them independently. This patch-wise reconstruction is formulated as a regression problem of learning a mapping from each patch x of Nl voxels in the LR image to a corresponding patch y(x) of Nh voxels in the HR image. Input and output voxels are vector-valued containing pl and ph values, and thus the mapping is x ∈ RNl pl → y(x) ∈ RNh ph . Training data comes from high quality data sets, which are artificially downsampled to provide matched pairs of LR and HR patches. For application, each patch of a LR image is passed

Bayesian Image Quality Transfer

267

through the learned mapping to obtain a HR patch and those patches combine to estimate a HR image. To solve the regression problem, IQT employs a variant of random forests [8]. The method proceeds in two stages: training and prediction. During training, we grow a number of trees on different sets of training data. Each tree implements a piecewise linear regression; it partitions the input space RNl pl and performs regressions in respective subsets. Learning the structure of a |D| tree on dataset D = {xi , yi }i aims to find an ‘optimal’ sequence of the following form of binary partitioning. At the initial node (root), D is split into two sets DR and DL by thresholding one of J scalar functions of x, or features, f1 , ..., fJ . The optimal pair of a feature fm and a threshold τ with the most effective splitting is selected by maximising the information gain [9], IG(fm , τ, D)  |D| · H(D) − |DR | · H(DR ) − |DL | · H(DL ) where |D| denotes the size of set D and H(D) is the average differential entropy of the predictive distribution P(y|x, D, H) given by  1  H(D)  − P(y|x, D, H) · log P(y|x, D, H) dy. (1) |D| x∈D

Maximising the information gain helps selecting the splitting with highest confidence in predictive distributions. This optimization problem is solved by performing golden search on the threshold for all features. The hypothesis space H specifies the class of statistical models and governs the form of predictive distribution. In particular, IQT fits the ML estimation of a linear model with a Gaussian noise. To control over-fitting, a validation set DV with similar size to D is used and the root node is only split if the residual error is reduced. This process is repeated in all new nodes until no more splits pass the validation test. At the time of prediction, every LR patch x is routed to one of the leaf nodes (nodes with no children) in each tree through a series of binary splitting learned during training, and the corresponding HR patch is estimated by the mode of the predictive distribution. The forest output is computed as the average of predictions from all trees weighted by the inverted variance of predictive distributions. Bayesian Image Quality Transfer. Our method, BIQT follows the IQT framework described above and performs a patch-wise reconstruction using a regression forest. The key novelty lies in our Bayesian choice of H (Eq. (1)). For a given training set at a node, D = {xi , yi }N i=1 , BIQT fits a Bayesian linear model y = Wx + η where the additive noise η and the linear transform W ∈ RNh ph ×Nl pl follow isotropic Gaussian distributions P(η|β) = N (η|0, β −1 I) and P(W| |α) = N (W| |0, α−1 I), with W| denoting the row-wise vectorised version of W. The hyperparameters α and β are positive scalars, and I denotes an identity matrix. Assuming for now that α, β are known, the predictive distribution is computed by marginalising out the model parameters W as 2 (x) · I) P(y|x, D, H) = P(y|x, D, α, β) = N (y| WPred x, σPred

(2)

268

R. Tanno et al.

Fig. 1. (a) 1D illustration (i.e. both x, y ∈ R ) of ML and Bayesian linear models fitted to the data (blue circles). The red line and shaded areas show the mode and variance (uncertainty) of P(y|x, D, H) at respective x values. Bayesian method assigns high uncertainty to an input distant from the training data whilst the ML’s uncertainty is fixed. (b) 2D illustration of the input (grey) and output (red) patches.

where the ith columns of matrices X and Y are given by xi and yi , the mean −1 2 linear map WPred = YXT (XXT + α and the variance σPred (x) = xT A−1 x+ β I) β −1 with A = αI + βXXT . Themean differential entropy in Eq. (1) can be computed as H(D) = Nh ph |D|−1 x∈D log(xT A−1 x + β −1 ) (up to constants). 2 The predictive variance σPred (x) provides an informative measure of uncertainty over the enhanced patch y(x) by combining two quantities: the degree of variation in the training data, β −1 and the degree of ‘familiarity’, xT A−1 x which measures how different the input patch x is from the observed data. For example, if x contains previously unseen features such as pathology, the familiarity term becomes large, indicating high uncertainty. The equivalent measure for the original IQT, however, solely consists of the term, β −1 determined from the training data, and yields a fixed uncertainty estimate for any new input x (see Fig. 1(a)). Once a full BIQT forest F is grown, we perform reconstruction in the same way as before. All leaf nodes are endowed with the predictive distributions of the form in Eq. (2), and BIQT quantifies the uncertainty over the HR output, y(x) as the predictive variance at leaf nodes (at which x arrives after traversing 2 (x)F averaged over trees in the forest F. respective trees) σPred A priori the hyper-parameters α and β are unknown, so we optimise them by maximising the marginal likelihood P(D|α, β). As WPred is in fact a solution of L2 regularisation problem with smoothing α/β, this optimisation procedure can be viewed as a data-driven determination of regularisation level. Although a closed form for P(D|α, β) exists, exhaustive search is impractical as we have to solve this problem for every binary splitting (characterised by a feature and a threshold) at all internal nodes of the tree. We thus derive and use the multioutput generalisation of the Gull-Mackay fixed-point iteration algorithm [10] βnew = αnew =

1 − βold · |D|−1 trace(A(αold , βold )−1 XXT ) Nh ph |D| 1 T 2 i=1 [yji − µj (αold , βold ) xi ] j=1 |D|Nh ph Nl pl − αold · trace(A(αold , βold )−1 ) Nh ph 1 T j=1 µj (αold , βold ) µj (αold , βold ) Nh ph

(3) (4)

Bayesian Image Quality Transfer

269

D where µj (α, β) = β · A(α, β)−1 i=1 yji xi . Whilst the standard MATLAB optimisation solver (e.g. fminunc) requires at least 50 times more computational time per node optimisation than for IQT, this iterative method is only average 2.5 times more expensive, making the Bayesian extension viable. We use this over Expectation Maximisation algorithm for its twice-as-fast convergence rate.

3

Experiments and Results

Here we demonstrate and evaluate BIQT through the SR of DTI. First we describe the formulation of the application. Second, we compare the baseline performance on the HCP data to the original IQT. Lastly, we demonstrate on clinical images of diseased brains that our uncertainty measure highlights pathologies. Super-Resolution of DTIs. Given a LR image, BIQT enhances its resolution patch by patch. Each case takes as input a Nl = (2n + 1)3 cubic patch of voxels each containing pl = 6 DT elements, and super-resolves its central voxel by a factor of m (so the output is a Nh = m3 patch with each new voxel also containining ph = 6 DT components). For all the experiments, we use n = 2 3 3 and m = 2 (Fig. 1(b)) and so the map is R750=5 ×6 → R48=2 ×6 . The features {fi } consist of mean eigenvalues, principal orientation, orientation dispersion averaged over central subpatches of different widths = 1, 3, 5 within the LR patch. Training data is generated from 8 randomly selected HCP subjects and used for all subsequent experiments. We use a subsample of each dataset, which con2 sists of 90 DWIs of voxel size 1.253 mm3 with b = 1000 s/mm . We create training pairs by downsampling each DWI by a factor of m and then fitting the DT to the downsampled and original DWI. A coupled library of LR and HR patches is then constructed by associating each patch in the downsampled DTI with the corresponding patch in the ground truth DTI. Training of each tree is performed on a different data set obtained by randomly sampling ≈105 pairs from this library, and it takes under 2 h for the largest data sets in Fig. 2. Testing on HCP dataset. We test BIQT on another set of 8 subjects from the HCP cohort. To evaluate reconstruction quality, three metrics are used: the rootmean-squared-error of the six independent DT elements (DT RMSE); the Peak Signal-to-Noise Ratio (PSNR); and the mean Structural Similarity (MSSIM) index [11]. We super-resolve each DTI after downsampling by a factor of 2 as before, and these quality measures are then computed between the reconstructed HR image and the ground-truth. BIQT displays highly statistically significant (p < 10−8 ) improvements (see Fig. 2) on all three metrics over IQT, linear regression methods and a range of interpolation techniques. In addition, trees obtained with BIQT are generally deeper than those of the original IQT. Standard linear regression performs as well as the Bayesian regression due to the large training data size. However, with BIQT, as you descend each tree, the number of training data points at each node gets smaller, increasing the degree

270

R. Tanno et al. 10.5 × 10

-5

48.5

10

0.976 0.974

48

9 8.5

47.5 0.97 47

MSSIM

Tricubic interpolation β - spline interpolation Linear regression (LR) Bayesian regression (BLR) IQT BIQT

PSNR (dB)

2 -1

DT RMSE (mm s )

0.972 9.5

46.5

8

0.968 0.966 0.964

46 0.962

7.5 45.5

0.96

7 0

1

2

3

Size of training data

4 5 × 10

45

0

1

2

3

Size of training data

4 5 × 10

0.958

0

1

2

3

Size of training data

4 5 × 10

Fig. 2. Three reconstruction metrics of various SR methods as a function of training data size; RMSE (left), PSNR (middle) and MSSIM (right). The performance of LR (yellow) and BLR (purple) coincide. The results for linear and nearest-neighbour interpolation are omitted for their poor performance.

of uncertainty in model fitting, and so the data-driven regularisation performed in each node-wise Bayesian regression becomes more effective, leading to better reconstruction quality. This is also manifested in the deeper structure of BIQT trees, indicating more successful validation tests and thus greater generalisability. Moreover, the feedforward architectures of trees and parallelisability of patchwise SR means highly efficient reconstruction (a few minutes for a full volume). Figure 3 shows reconstruction accuracies and uncertainty maps for BIQT and IQT. The uncertainty map of BIQT is more consistent with its reconstruction accuracy when compared to the original IQT. Higher resemblance is also observed between the distribution of accuracy (RMSE) and uncertainty (variance). The BIQT uncertainty map also highlights subtle variations in the reconstruction-quality within the white matter, whereas the IQT map contains flatter contrasts with discrete uncertainties that vary greatly in the same region (see histograms in bottom row). This improvement reflects the positive effect of the data-driven regularisation and better generalisability of BIQT and can be observed particularly in the splenium and genu of the Corpus Callosum, where despite good reconstruction accuracy, IQT assigns higher uncertainty than in the rest of the white matter and BIQT indicates a lower and more consistent uncertainty. Thus, the BIQT uncertainty map displays higher correspondence with accuracy and allows for a more informative assessment of reconstruction quality. Note that while the uncertainty measure for IQT is governed purely by the training data, for BIQT the uncertainty also incorporates the familiarity of the test data. Testing on MS and tumour data. We further validate our method on images with previously unseen abnormalities; we use trees trained on healthy subjects from HCP dataset to super-resolve DTIs of MS and brain tumour patients (10 2 each). We process the raw data (DWI) as before, and only use b = 1200 s/mm 2 measurements for the MS dataset and b = 700 s/mm for the tumour dataset.

Bayesian Image Quality Transfer (a) BIQT

271

(b) Original IQT

Fig. 3. Reconstruction accuracy and uncertainty maps. (top row) The voxel-wise RMSE as a normalised colour-map and its distribution; (bottom row) Uncertainty map (variance) over the super-resolved voxels and its distribution for (a) BIQT and (b) IQT. Trees were trained on ≈4 × 105 patch pairs. (a) MS

(b) RMSE

(c) Edema

-3 1.6 ×10

Control MS

DT RMSE (mm 2 s-1)

1.55

1.5

1.45

1.4

1.35

1.3

1.25

BIQT IQT BLR LR SplineCubic

Fig. 4. (a), (c) Normalised uncertainty map (variance is shown i.e. the smaller the more certain) for BIQT (middle row) and IQT (bottom row) along with the T2-weighted slices (top row) for MS (with focal lesions in orange) and edema (contours highlighted), respectively. (b). The RMSE for MS and control subjects (averaged over 10 subjects in each case).

The voxel size for both datasets is 23 mm3 . The MS dataset also contains lesion masks manually outlined by a neurologist. Figure 4(a), (c) middle row shows that the uncertainty map of BIQT precisely identifies previously unseen features (pathologies in this case) by assigning lower confidence than for the remaining healthy white matter. Moreover, in accordance with the reconstruction accuracy, the prediction is more confident in pathological regions than in the cerebrospinal fluid (CSF). This is expected since the CSF is essentially free water with low SNR and is also affected by cardiac pulsations, whereas the pathological regions are contained within the white matter and produce better SNR. Each BIQT tree

272

R. Tanno et al.

appropriately sends pathological patches into the ‘white-matter’ subspace and its abnormality is detected there by the ‘familiarity’ term, leading to a lower confidence with respect to the healthy white matter. By contrast, IQT sends pathological patches into the CSF subspace and assigns the fixed corresponding uncertainty which is higher than what it should be. In essence, BIQT enables an uncertainty measure which highly correlates with the pathologies in a much more plausible way, and this is achieved by its more effective partitioning of the input space and uncertainty estimation conferred by Bayesian inference. Moreover, Fig. 4(b) shows the superior generalisability of BIQT even in reconstruction accuracy (here SR is performed on downsampled clinical DTIs); the RMSE of BIQT for MS patients is even smaller than that of IQT for healthy subjects.

4

Conclusion

We presented a computationally viable Bayesian extension of Image Quality Transfer (IQT). The application in super resolution of DTI demonstrated that the method not only achieves better reconstruction accuracy even in the presence of pathology (Fig. 3(b)) than the original IQT and standard interpolation techniques, but also provides an uncertainty measure which is highly correlated with the reconstruction quality. Furthermore, the uncertainty map is shown to highlight focal pathologies not observed in the training data. BIQT also performs a computationally efficient reconstruction while preserving the generality of IQT with large potential to be extended to higher-order models beyond DTI and applied to a wider range of modalities and problems such as parameter mapping and modality transfer [6]. We believe that these results are sufficiently compelling to motivate larger-scale experiments for clinical validation in the future. Acknowledgements. This work was supported by Microsoft scholarship. Data were provided in part by the HCP, WU-Minn Consortium (PIs: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by NIH and Wash. U. The tumour data were acquired as part of a study lead by Alberto Bizzi, MD at his hospital in Milan, Italy. The MS data were acquired as part of a study at UCL Institute of Neurology, funded by the MS Society UK and the UCL Hospitals Biomedical Research Centre (PIs: David Miller and Declan Chard).

References 1. Berlot, R., Metzler-Baddeley, C., Jones, D.K., O’Sullivan, M.J., et al.: CSF contamination contributes to apparent microstructural alterations in mild cognitive impairment. Neuroimage 92, 27–35 (2014) 2. Rousseau, F.: Brain hallucination. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 497–508. Springer, Heidelberg (2008). doi:10. 1007/978-3-540-88682-2 38 3. Coup´e, P.: Collaborative patch-based super-resolution for diffusion-weighted images. Neuroimage 83, 245–261 (2013)

Bayesian Image Quality Transfer

273

4. Rueda, A., Malpica, N., Romero, E.: Single-image super-resolution of brain MR images using overcomplete dictionaries. Med. Image Anal. 17(1), 113–132 (2013) 5. Wang, Y.H., Qiao, J., Li, J.B., Fu, P., Chu, S.C., Roddick, J.F.: Sparse representation-based MRI super-resolution reconstruction. Measurement 47, 946– 953 (2014) 6. Alexander, D.C., Zikic, D., Zhang, J., Zhang, H., Criminisi, A.: Image quality transfer via random forest regression: applications in diffusion MRI. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8675, pp. 225–232. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10443-0 29 7. Sotiropoulos, S.N., et al.: Advances in diffusion MRI acquisition and processing in the human connectome project. Neuroimage 80, 125–143 (2013) 8. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 9. Criminisi, A., Shotton, J.: Decision Forests for Computer Vision and Medical Image Analysis. Springer, London (2013) 10. MacKay, D.J.: Bayesian interpolation. Neural Comput. 4(3), 415–447 (1992) 11. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Proc. 13(4), 600– 612 (2004)

Wavelet Appearance Pyramids for Landmark Detection and Pathology Classification: Application to Lumbar Spinal Stenosis Qiang Zhang1(B) , Abhir Bhalerao1 , Caron Parsons2,3 , Emma Helm2 , and Charles Hutchinson2,3 1

Department of Computer Science, University of Warwick, Coventry, UK [email protected] 2 Division of Health Sciences, University of Warwick, Coventry, UK 3 Department of Radiology, University Hospital Coventry and Warwickshire, Coventry, UK

Abstract. Appearance representation and feature extraction of anatomy or anatomical features is a key step for segmentation and classification tasks. We focus on an advanced appearance model in which an object is decomposed into pyramidal complementary channels, and each channel is represented by a part-based model. We apply it to landmark detection and pathology classification on the problem of lumbar spinal stenosis. The performance is evaluated on 200 routine clinical data with varied pathologies. Experimental results show an improvement on both tasks in comparison with other appearance models. We achieve a robust landmark detection performance with average point to boundary distances lower than 2 pixels, and image-level anatomical classification with accuracies around 85 %.

1

Introduction

Diagnosis and classification based on radiological images is one of the key tasks in medical image computing. A standard approach is to represent the anatomy with coherent appearance models or feature descriptors, and vectorise the representations as inputs for training a classifier (Fig. 1(A)). The training data usually consists of instances with landmarks annotated at consistent anatomical features. The appearance correspondence across the instances is built by aligning a deformable appearances, e.g., Active Appearance Model (AAM) [1], or extracting local features at the landmarks [2–4]. During testing, the landmarks are detected in new, unseen instances, and the features are extracted and sent to a classifier for the pathology classification. For a robust landmark detection, a prior model of the object class is learned by formulating the statistics of the vectorised representations, and the searching is conducted under the regularisation of the prior model. The deformable model is either holistic [1], which consists of the shape and aligned appearance, or part-based [2–5], which represents an object by locally rigid parts with a shape capturing the spatial relationships among c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 274–282, 2016. DOI: 10.1007/978-3-319-46723-8 32

Wavelet Appearance Pyramids for Landmark Detection and Pathology

275

parts. Part-based models have shown superior performance benefiting from the local feature detection [2,3,5] and shape optimisation methods [2,4]. However less attention has been paid to optimising the appearance representation and preserving the anatomical details. We propose a new appearance model referred to as a Wavelet Appearance Pyramid (WAP) to improve the performance of landmark detection and pathology classification, see an overview in Fig. 1(B). The object is decomposed into multi-scale textures and each scale is further decomposed into simpler parts. To achieve an explicit scale decomposition, the filter banks are designed and arranged directly in Fourier domain. The logarithmic wavelets (loglets) [6] are adopted as the basis functions of the filter banks for their superior properties, such as uniform coverage of the spectrum (losslessness) and infinite number of vanishing moments (smoothness). The scales are complementary in the Fourier domain which enables the reconstruction of the appearance from a WAP. The variations in the population can be modelled and visualised, with the deformation approximated by local rigid translations of the multi-scale parts, and the appearance changes by linear modes of the assembly of the parts. We apply the WAP to the problem of lumbar spinal stenosis (LSS) and present an approach for fitting the landmarks and grading the central and foremenal stenosis [7,8]. The Supervised Descent Method (SDM) [2] is integrated with the WAP for landmark detection. The performance is validated on MRI data from 200 patients with varied LSS symptoms. Experimental results show an improvement in both the landmark detection and pathology grading over other models such as Active Shape Models (ASMs), AAMs [1], and Constrained Local Models (CLMs) [9]1 .

2

Method

To provide a more comprehensive description of an object, we decompose the appearance into pyramidal channels at complementary scale ranges with wavelets, and represent each channel with a part-based model. We refer to this form of appearance models as Wavelet Appearance Pyramids. We detail the method as follows. Explicit Scale Selection in the Fourier Domain. We start by decomposing an image I into multi-scale channels directly in the Fourier domain. When considered in polar coordinates, the Fourier spectrum I actually spans a scale space with larger scales at lower frequency and smaller scales spreading outwards. Therefore a multi-scale decomposition of image textures can be achieved explicitly by dividing the spectrum into subbands, see Fig. 2(b). In practice, filtering the spectrum with sharp windows will introduce discontinuities therefore causing aliasing. To design a bank of window functions which are smooth in 1

Supplementary videos of the paper can be found at https://sites.google.com/site/ waveletappearancepyramids.

276

Q. Zhang et al.

Fig. 1. (A) A standard approach of landmark detection and pathology grading. (B) The proposed appearance model (A) and feature descriptor (h(A)).

Fig. 2. (a) Radial profiles of the filters. (b) Scale selection in the Fourier domain. (c) The high pass filter in the Fourier domain. (d) The first bandpass filter.

shape while uniformly cover the spectrum, we use loglets as the basis functions because they possess a number of useful properties [6]. Denoting the frequency vector by u and its length by ρ, a bandpass window with a loglets basis can be designed in the Fourier domain as,       1 ρ 1 ρ W(u; s) = erf α log β s+ 2 −erf α log β s− 2 (1) ρ0 ρ0 where α controls the radial bandwidth, s is an integer defining the scale of the filter, and β > 1 sets the relative ratio of adjacent scales – set to two for one octave intervals. ρ0 is the peak radial frequency of the window with scale s = 0. To extract the sharp textures of an image, the first scale channel should cover the highest frequency components. Noting the uniform property of the loglets,

Wavelet Appearance Pyramids for Landmark Detection and Pathology

277

we accumulate a group of loglets successively having  one-octave higher central frequencies as the first scale window, i.e., W (1) = s W(u; s), s = {0, −1, ...}, which achieves an even coverage towards the highest frequency, see the 1D profile in Fig. 2(a) shown as a red curve, and the 2D window in Fig. 2(c). The second and larger scale features can be selected by windows covering lower frequencies, W (s) (u) = W(u; s − 1). Profiles of two adjacent larger scale windows are shown in Fig. 2(a) as blue curves, and a 2D window shown in Fig. 2(d). For a lossless decomposition, the largest scale window should uniformly cover the lowest frequencies,so it is designed as a accumulation of the remaining loglets functions, W (L) = s W(u; s), s = {L − 1, L, ...}, see the green curve in Fig. 2(a). L is the total number of scales in the filter banks. As the image filtering can be implemented in the Fourier domain by multiplication, the filters can be efficiently applied by windowing them on the image spectrum I, and the image channels obtained by the inverse Fourier transform of the windowed spectrum, I (s) = F −1 (I · W (s) ), s = {1, 2, ..., L}. The image is thus decomposed into complementary channels {I (s) }. Wavelet Image Pyramid. It is evident that larger scale textures can be described sufficiently at a lower resolution. Note in Fig. 2(a) that the magnitude of the two larger scale windows beyond π/2 and π/4 is almost zero. Therefore we can discard these areas of the spectrum, which results in an efficient downsampling without information loss or aliasing effect2 . As a result, the resolution is reduced by 2s at scale s and a subband pyramid is obtained, see Fig. 1(c, d). Wavelet Appearance Pyramid (WAP). Given a landmark x , we extract an image patch As at each scale s of the pyramid. All patches {As }L s=1 have the same size in pixels, which describe the local features at octave larger scales, domain sizes and lower resolutions, see Fig. 1(d, e). A WAP, denoted by Φ = [A, s], N consists of an assembly of feature patches A = {{As,i }L s=1 }i=1 extracted at all N the landmarks {x i }1 , and a shape s = [x 1 , x 2 , ...x N ] designating the locations of the patches. At larger scales fewer patches are manually chosen at key landmarks to reduce the overlapping. Φ is then flattened into a 1D vector serving as the profile of the anatomy. A further feature extraction function such as histogram of oriented gradients (HOG) can be readily applied on the patches to reduce the N dimensionality and enhance its robustness, i.e., h(A) = {{h(As,i )}L s=1 }i=1 , see Fig. 1(f). To reconstruct the original appearance from the profile, we first pad the patches at each scale with the geometry configured by s to recover the individual channels. As the scales are complementary, all channels are then accumulated to recover the object appearance, see Fig. 1(g). Landmark Detection. We integrate our WAP representation with the SDM algorithm [2] for robust landmark detection. To deduce the true landmark locas )) at sˆ and tion s ∗ given an initial estimation sˆ , we extract the descriptor h(A(ˆ learn the mapping h(A(ˆ s )) → Δs ∗ , in which Δs ∗ = s ∗ − sˆ . The direct mapping function satisfying all the cases in the dataset is non-linear in nature and can 2

Spectrum cropping as image downsampling is explained at https://goo.gl/ApJJeL.

278

Q. Zhang et al.

be over-fitted. So we adopt the SDM and approximate the non-linear mapping with a sequence of linear mapping {R(i) , b (i) } and landmark updating steps,  Mapping: Δs (i) = R(i) h(A(ˆ s (i) )) + b (i) , (2) Updating: sˆ (i+1) = sˆ (i) + Δs (i) . The descriptor h(A) is extracted and updated at each iteration. More details on the SDM can be found at [2]. Anatomical Classification. For the classification tasks, the correspondence of anatomical features should be built such that the differences among the descriptors account for the true variations rather than the miss-alignment. In a WAP the appearance correspondence is built by extracting local features at corresponding landmarks. A classifier predicts the label ℓ given an anatomical observation Φ, i.e., ℓ = arg max p(ℓ|Φ). The most significant variations in the training data {Φ} can be learned by principal components analysis and the dimensionality reduced by preserving the first t significant components, which span a feature space P ∈ RM ×t with M being the dimensionality of Φ. A WAP therefore can be represented in the feature space by a compact set of parameters b Φ , i.e., ¯ in which Φ¯ is the mean of {Φ}. Using b Φ as inputs the classib Φ = P T (Φ − Φ), fier now predicts ℓ = arg max p(ℓ|b Φ ).

3

Results and Discussion

Clinical Background. Lumbar spinal stenosis (LSS) is a common disorder of the spine. The important function of radiological studies is to evaluate the morphological abnormalities and make the anatomical classification. Disc-level axial images in MRI scans can provide rich information for the diagnosis. In paired sagittal-axial scans, the disc-level planes (red line in Fig. 3(a)) are localised in sagittal scans, and the geometry is mapped to the registered axial scans (dashed lines in Fig. 3(a)) to extract the disc-level images. On the disc-level image shown in Fig. 3(b), conditions of the posterior margins of the disc (red line), posterior

Fig. 3. (a) Mid-sagittal scan of a lumbar spine. (b) Anatomy of a L3/4 disc-level axial image. (c) Severe central stenosis. (d) Foraminal stenosis. The neural foramen are suppressed by the thickening of the facet (green) and the disc (red).

Wavelet Appearance Pyramids for Landmark Detection and Pathology

279

spinal canal (cyan line) and the facet between the superior and inferior articular processes (green line) are typically inspected for diagnosis and grading. Degeneration of these structures can constrict the spinal canal and the neural foramen causing central and foraminal stenosis. Pathological examples are given in Fig. 3(c, d). In clinical practice, parameters such as antero-posterior diameter, cross-sectional area of spinal canal are typically measured [7]. However there is a lack of consensus in these parameters and no diagnostic criteria are generally applicable [8]. A more detailed appearance model of the anatomy, followed by a classification therefore could contribute to reliable diagnoses. Data and Settings. The dataset consists of T2-weighted MRI scans of 200 patients with varied LSS symptoms. Each patient has routine paired sagittalaxial scans. The L3/4, L4/5, L5/S1 disc-level axial planes are localised in the sagittal scans and the images sampled from the axial scans. We obtain 3 subset of 200 disc-level images from the three intervertebral planes, 600 images in total. Each image is inspected and graded, and the anatomy annotated with 37 landmarks outlining the disc, central canal and facet. Results of Landmark Detection. To cover richer pathological variations, we perform the landmark detection on the mixed dataset containing all 600 images. We randomly choose 300 images for training and detect the landmarks on the remaining 300. Two metrics are used for the evaluation: the Point to Boundary Distance (PtoBD) and the Dice Similarity Coefficients (DSC) of the canal and disc contours. DSC is defined as the amount of the intersection between a fitted shape and the ground truth, DSC = 2 · tp/(2 · tp + f p + f n), with tp, f p, f n denoting the true positive, false positive and false negative values respectively. We compare the proposed WAP with AAMs, ASMs and CLMs. To validate the improvement of the loglets pyramid decomposition, we also report the performance of an alternative model by replacing our pyramids with the original images but using the same HOG features and SDM algorithm. We refer to this control model as WAP− . The mean results of landmark detection are shown in Table.1. We can see that the WAP outperforms the other methods by a favourable margin. Several qualitative results by WAP are shown in Fig. 4(Top). Generating a WAP is also efficient as the filtering is conducted directly in the Fourier domain. The most expensive computation is extracting the HOG descriptors, which takes only 54 ms on an image of size 496 × 496 pixels. Table 1. Performance of landmark detection Metrics

AAM

ASM

CLM

WAP−

WAP

PtoBD (in pixels) 3.10 ± 1.29 2.51 ± 1.32 2.34 ± 1.15 1.95 ± 0.92 1.87 ± 0.73 DSC (%)

90.6 ± 4.9

92.1 ± 5.2

92.4 ± 5.2

93.9 ± 3.3

94.7 ± 2.6

Results of Anatomical Classification. For central stenosis, in each of the three subsets, the morphology of the central canal is inspected and labelled with

280

Q. Zhang et al.

three grades: normal, moderate and severe. The average appearances of these classes delineated by WAPs are shown in Fig. 5(a). We randomly pick 100 samples to train the classifier, and test on the remaining 100, and repeat for 100 times for an unbiased result. The WAP extracted from the detected landmarks are projected onto the feature space and represented by a compact set of parameters b Φ (Fig. 4 Bottom), which are used as inputs of the classifier. The performance of normal/abnormal classification is measured with accuracy, which is calculated by (tp + tn)/(tp + tn + f p + f n). The grading errors are measured with Mean Absolute Errors (MAE) and Root Mean Squared Errors (RMSE). We compare the performance of our method against approaches using other models as inputs to the same classifier. The agreements of the results with manual inspection are reported in Table 2. Similarly we perform another normal/abnormal classification on the morphology of the neural foremen. The average appearances are given in Fig. 5(b). The classification accuracy of methods compared is reported in Table 3. We can see that in both tasks, our WAP appearance models enable a significant improvement.

Fig. 4. Top: Qualitative results of landmark detection. Bottom: Appearance fitted by WAP. The appearances shown are represented by b Φ in the feature space which are used as inputs for classification.

Fig. 5. Average appearance of classes represented by WAP. (a) Three grades of central stenosis. (b) Normal and abnormal in terms of foreminal stenosis.

Wavelet Appearance Pyramids for Landmark Detection and Pathology

281

Table 2. Agreement of classification and grading of central stenosis Method Accuracy (%) of classification

MAE of grading

RMSE of grading

L3/4

L4/5

L5/S1

L3/4 L4/5 L5/S1 L3/4 L4/5 L5/S1

ASM

79.1 ± 4.8

77.4 ± 4.3

81.7 ± 4.5

0.25

0.31

0.20

0.55

0.67

0.48

AAM

70.1 ± 7.1

69.7 ± 7.3

71.3 ± 8.8

0.41

0.44

0.32

0.72

0.79

0.58

CLM

81.0 ± 4.9

82.4 ± 4.5

82.7 ± 4.4

0.23

0.25

0.23

0.53

0.56

0.52

WAP−

80.7 ± 4.9

82.1 ± 4.6

84.7 ± 4.2

0.23

0.25

0.18

0.53

0.58

0.47

WAP

84.7 ± 4.6 84.5 ± 4.3 85.9 ± 4.2 0.19

0.21 0.16

0.48 0.54 0.44

Table 3. Accuracy (%) of classification of foreminal stenosis Anatomy ASM

4

AAM

CLM

WAP−

WAP

L3/4

83.3 ± 3.8 73.3 ± 5.5 83.1 ± 4.7 84.3 ± 4.1 85.0 ± 3.9

L4/5

82.4 ± 4.6 76.2 ± 5.8 83.3 ± 4.3 86.9 ± 3.9 87.8 ± 3.5

L5/S1

81.8 ± 4.7 74.5 ± 5.7 82.9 ± 4.5 85.2 ± 4.3 85.7 ± 4.3

Conclusions

We have presented a novel appearance model and demonstrated its applications to the problem of LSS for variability modelling, landmark detection and pathology classification. The improvement in the diagnosis and grading lies in its ability to capture detailed appearances, better appearance correspondence by scale decomposition, and more precise landmark detection. The model can be readily applied to other anatomical areas for clinical tasks requiring segmentation and classification. The source code will be released for research purposes. For the task of LSS, our future work is aimed towards patient-level diagnosis by utilising the image-level anatomical classification together with etiological information.

References 1. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 23(6), 681–685 (2001) 2. Xiong, X., De la Torre, F.: Supervised descent method and its applications to face alignment. In: 2013 IEEE Conference on CVPR, pp. 532–539. IEEE (2013) 3. Lindner, C., Thiagarajah, S., Wilkinson, J., Consortium, T., Wallis, G., Cootes, T.F.: Fully automatic segmentation of the proximal femur using random forest regression voting. IEEE Trans. Med. Imaging 32(8), 1462–1472 (2013) 4. Antonakos, E., Alabort-i Medina, J., Zafeiriou, S.: Active pictorial structures. In: Proceedings of the IEEE Conference on CVPR, pp. 5435–5444 (2015) 5. Saragih, J.M., Lucey, S., Cohn, J.F.: Face alignment through subspace constrained mean-shifts. In: IEEE 12th International Conference on Computer Vision, pp. 1034– 1041. IEEE (2009)

282

Q. Zhang et al.

6. Knutsson, H., Andersson, M.: Loglets: generalized quadrature and phase for local spatio-temporal structure estimation. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 741–748. Springer, Heidelberg (2003). doi:10.1007/ 3-540-45103-X 98 7. Steurer, J., Roner, S., Gnannt, R., Hodler, J.: Quantitative radiologic criteria for the diagnosis of lumbar spinal stenosis: a systematic literature review. BMC Musculoskelet. Disorders 12(1), 175 (2011) 8. Ericksen, S.: Lumbar spinal stenosis: imaging and non-operative management. Semin. Spine Surg. 25, 234–245 (2013). Elsevier 9. Cristinacce, D., Cootes, T.: Automatic feature localisation with constrained local models. Pattern Recogn. 41(10), 3054–3067 (2008)

A Learning-Free Approach to Whole Spine Vertebra Localization in MRI Marko Rak(B) and Klaus-Dietz T¨ onnies Department of Simulation and Graphics, Otto von Guericke University, Magdeburg, Germany [email protected]

Abstract. In recent years, analysis of magnetic resonance images of the spine gained considerable interest with vertebra localization being a key step for higher level analysis. Approaches based on trained appearance - which are de facto standard - may be inappropriate for certain tasks, because processing usually takes several minutes or training data is unavailable. Learning-free approaches have yet to show there competitiveness for whole-spine localization. Our work fills this gap. We combine a fast engineered detector with a novel vertebrae appearance similarity concept. The latter can compete with trained appearance, which we show on a data set of 64 T1 - and 64 T2 -weighted images. Our detection took 27.7 ± 3.78 s with a detection rate of 96.0 % and a distance to ground truth of 3.45 ± 2.2 mm, which is well below the slice thickness.

1

Motivation

The localization of vertebrae is one of the key steps for spine analysis based on magnetic resonance imaging. It may be used to initialize spine registrations, for curved planar reformations [8], for scan geometry planning [9], to ease vertebrae segmentation and Cobb angle measurements [10]. Depending on the application, time may be a critical resource, which is often overlooked in previous works. According to a recent review [5], the majority of works use learned vertebra detectors, which may take minutes to process only a section of the spine, cf. [1]. Moreover, even if training data is available, the man-hours spend on data preparation may already be wasted when acquisition parameters change. Learning-free approaches reviewed in [5] are specialized to the (thoraco-)lumbar section of the spine, cf. [6,8]. To the best of our knowledge, we are the first to show that wholespine localization is feasible without preliminary training of any kind. After detection, vertebrae candidates are usually fed into curve-fitting strategies like random sample consensus [2,6], least-trimmed squares [8] and others [10] to sort out the wrong candidates. Aternatively a graphical model is used to infer the sought candidates [3,4,9]. We apply graphical modeling as well. However, instead of another strong appearance learner we contribute a weak but quickly computable candidate detector. We exploit the homogeneity of vertebral bone and the increasing vertebra size in superior-inferior direction, which is observable in any spine image regardless of the imaging sequence or acquisition parameters. c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 283–290, 2016. DOI: 10.1007/978-3-319-46723-8 33

284

M. Rak and K.-D. T¨ onnies

To compete with strong learners, we design a novel appearance energy which exploits the similar appearance of adjacent vertebrae, making our information more case-specific compared to preliminary trained information.

2

Graphical Modeling

We model vertebra localization as inference task on a second-order Markov random field. As depicted in Fig. 1, our model comprises a node for each vertebra and connections between nodes account for vertebrae dependencies. Although such dependencies may span several vertebrae in reality, most previous works apply first-order Markov models, cf. [3,9]. Only [4] implements second-order models as we do to give a more complete picture of vertebrae dependencies.

Fig. 1. As illustrated (bottom row) on a T1 /T2 -weighted image (left/right), our Markov model (top row) aims to select the vertebrae (plus signs) out of a number of candidates (plus signs and crosses), being the minima of the underlying homogeneity measure.

Given a set Ci of candidates for each vertebra i, we seek to infer the most likely constellation of candidates c = (c1 , . . . , cN ) ∈ C1 × . . . ×CN . The likelihood is defined via model energy E(c|I), where I is the current image. Since the factor graph of our model is a tree, inference of c is feasible by belief propagation. Candidate-based approaches may be problematic when certain vertebrae are missed during detection. Like in [3,4], we introduce a virtual catchment candidate for each vertebra and “tie” its neighbors together as discussed later on. This is not possible with first-order Markov models, which can only enforce positional relations between directly adjacent vertebrae. Thus gaps may occur at catchment candidates, which manifests as disconnected vertebrae chains in image space. 2.1

Candidate Detection

Vertebrae detection is often seen as a sliding window classification task. For instance, support vector machines [1,3,4] and AdaBoost [9,10] are applied to histograms of oriented gradients [1,3,4] and Haar-like features [9,10]. Orientation and size differences are addressed by searching over various scales [1,3,4,10]

A Learning-Free Approach to Whole Spine Vertebra Localization in MRI

285

and rotations [3] or classifiers are trained on differently scaled [2] and rotated data [2,10]. Either way, the computational costs are significant, especially when classifiers are trained per spine section [1,3,9] or even per vertebra [4]. Our approach takes the opposite way. Instead of a strong appearance learner, we apply a rather weak but quickly computable homogeneity criterion and sort out wrong candidates later by establishing additional appearance relations between adjacent vertebrae. We assess homogeneity by Shannon entropy H(X) = −

#bins i=1

Pr(xi ) log2 (Pr(xi ))

(1)

of the local intensity histogram X at each node of the image grid and take the spatial minima of the entropy map as our candidates (see Fig. 1). A similar strategy is used in [8], who calculate two-dimensional entropy maps via distanceweighted sum of the entropies of several concentric rings around a grid node. We made their idea feasible for three-dimensional regions of different sizes. Therefore, we apply the distance weighting on the voxel level, letting them contribute only their assigned weight to a local intensity histogram. The results are similar to [8], but ours can be obtained by convolution, which accelerates the computation especially when using a separable weighting kernel. Also desirable is cascade convolution, where a sequence of small incremental kernels is used to generate results of increasing extent to address differently sized vertebra. Considering both factors a Gaussian weighting is a reasonable choice. Summarizing, we detect vertebra candidates by Algorithm 1, where size range [a, b] reflects the growth of vertebrae in superior-inferior direction. The choice of the size range is uncritical, because the gradually decreasing Gaussian weighting favors centrality even if selected and true sizes do not match accurately. The different extents along the vertebral body axis are compensated by this too. A factor of 0.75 (first line) ensures underestimation of the true sizes to cover even small subjects. To speed up the detection further, we use a pyramidal implementation, downsampling the histogram array B by a factor of two along its spatial dimensions whenever σ exceeds a power of two times the voxel size.

1 2 3 4 5 6 7 8 9

{σi } ← EquidistantSizes(0.75 · a, 0.75 · b, N ); // σi in mm B ← InitialzeHistogramArray(I); // size #rows×#columns×#slices×#bins for i ← 1 to N do /* calculate incremental sigma */ 2 2 σ∆ ← σi2 − σi−1 ; // special case σ0 = 0 /* update previous histograms */ G ← CreateGaussianKernel(σ∆ ); B ← SeperableConvolution(B, G); // along rows, columns & slices /* extract vertebrae candidates */ H ← EstimateEntropies(B); // size #rows×#columns×#slices Ci ← ExtractLocalMinima(H); // candidate set for vertebra i end

Algorithm 1. Pseudocode implementation of our candidate detection

286

2.2

M. Rak and K.-D. T¨ onnies

Derivation of Energies

To account for the dependencies among adjacent vertebrae, we incorporate information about vertebrae appearance and spine geometry, i.e. relative position, into our model by affine combination of respective energies according to E(c|I) =

N −1 i=2

α Eapp (ci−1 , ci , ci+1 |I) + (1 − α) Egeo (ci−1 , ci , ci+1 )

(2)

where α ∈ (0, 1) controls the influence of both kinds of information. We designed Eapp ( · |I) and Egeo ( · ) to accept a wide range of variation per vertebra (large valley of low costs) and solely penalize unrealistic variation (sharp increase of costs). The unique composition of the spine, i.e. the many close-by and similarlyappearing parts, should provide sufficient compensation here. Appearance. Previous works assign an appearance score to each detected candidate, e.g. the probability output of their classifier, and use this as the unary term to form Eapp ( · |I). We use a novel higher-order term to extract case-specific appearance “on the fly”, avoiding any beforehand learning of appearance. To this end, we exploit the fact that adjacent vertebrae appear very similar. Let X and Y be the local intensity histograms of two adjacent vertebrae, then their similarity can be assessed by the Kullback-Leibler divergence DKL (X||Y ) =

#bins i=1

Pr(xi ) log2 (Pr(xi ) · Pr(yi )−1 ).

(3)

Since we still want both vertebrae to be homogeneous, we combine entropy and Kullback-Leibler divergence into a cross-entropy formulation. The most natural way would be H(X; Y ) = H(X) + DKL (X||Y ), which is neither symmetric nor always defined. Instead, we use the following symmetric variant ¯ H(X; Y ) = H(X) + DKL (X||X + Y ) + H(Y ) + DKL (Y ||Y + X),

(4)

which has neither problem and is also less rigorous to small histogram differences (large valley of low costs), because of the averaging in the second argument of the Kullback-Leibler divergence. Eventually, our appearance energy reads  1 ¯ 1 ¯   2 H(Xi ; Xj ) + 2 H(Xj ; Xk ) ¯ ¯ ¯ i ; Xj ) Eapp (ci , cj , ck |I) = H(Xj ; Xk ), H(Xi ; Xk ) or H(X   + inf

if neither is virtual if only one is virtual (5) otherwise,

where Xi , Xj and Xk are the local intensity histograms of ci , cj and ck , respectively. The last case prohibits several virtual candidates in a row. In previous works, catchment candidates are realized by penalties learned from training data [3,4]. This is unnecessary here, because case two involves a natural penalty, i.e. distant vertebrae usually appear more different than adjacent ones.

A Learning-Free Approach to Whole Spine Vertebra Localization in MRI

287

Geometry. To guarantee connectedness, geometrical dependencies are needed. These may be implemented by distances between vertebrae [3,9] and centrality with respect to the neighbors [4,9]. Even vertebrae orientation is used [3,9] if detection is sensitive to orientation, which is not the case here. We restrict ourselves to a centrality-based energy and constrain distances by implementation as discussed later. The rationale is that centrality already integrates relative distances and absolute distances vary strongly with the imaged subject, requiring large safety margins or low relative importance w.r.t. the model energy. Assessing centrality, we need to take into account that detected centers and their expected positions given their neighbors only represent beliefs about the true vertebra center. Thus, both are only correct in a probabilistic sense, which is why we model each as a Gaussian in image space. Let Nc and Ne be these Gaussians then their agreement is assessed by Kullback-Leibler divergence DKL (Nc ||Ne ) = (2 σ 2 ln(2))−1 (µc − µe )T (µc − µe ),

(6)

where σ 2 is their assumed (shared) variance. ln(2) transforms units from nats to bits, to be compatible to Eq. 4. Eventually, our geometric energy reads ⎧ pi +pk ⎪ ⎨DKL (N (pj , σj )||N ( 2 , σj )) if neither is virtual Egeo (ci , cj , ck ) = (2 ln(2))−1 if only one is virtual (7) ⎪ ⎩ + inf otherwise, where pi , pj and pk are the centers of ci , cj and ck , respectively. Case two treats catchment candidates as if they were σj (assumed vertebra size) apart from their expected position, which is just sufficient penalty to not prefer them per se. Typically learned statistics are used to form the geometric energy. We found this unnecessary, because training data often contains too few samples for reliable (multivariate) estimates of the true dependencies and vertebrae localization outside the graphical modeling paradigm, e.g. the curve-fitting approaches mentioned earlier, do not apply geometrical learning at all. Implementation. Evaluating energies for every combination of candidates poses a combinatorial problem; at least for whole-body three-dimensional tasks. It is unclear how previous works addressed this issue. We propose to filter spatially reasonable combinations prior to inference. To this end, we perform a range query (k -d tree implementation) around each candidate of vertebra i to identify all candidates of vertebra j that lie within a range of [0.5, 2] · (σi + σj ) mm. Safety factors 0.5 and 2 avoid very close and distant results, respectively.

3 3.1

Experiments Data Set and Preprocessing

We carried out experiments on T1 - and T2 -weighted images of 64 subjects from the “Study of Health in Pomerania” [7], acquired by turbo spin echo sequences on

288

M. Rak and K.-D. T¨ onnies

two Siemens 1.5 Tesla Magnetom Avanto imagers. During acquisition, a fieldof-view of typically 900 × 500 × 66 mm was sliced sagittaly at a resolution of 1.12 × 1.12 × 4.4 mm. To simplify processing, we upsampled all images in mediolateral direction to isotropical voxels. For each subject, we are given annotations of vertebra centers from two independent readers. For comparison to previous works, we assessed the detection rate, i.e. the portion of correctly detected vertebrae, and the localization error, i.e. the Euclidean distance to ground truth. In three subjects the number of presacral vertebrae differed by one from expectation. We used shortened/prolonged models in these cases. During preprocessing we transform image intensities into log2 -domain and apply intensity binning therein. The rationale is that bins should widen with signal level, because noise variance - under Rician noise - does too. The binning uses steps of 0.5, starting at log2 (16) for the second bin, while lower intensities were assigned to the first (background) bin. We did not find any deterioration of entropies unless bin widths were increased above 1.0. To select a size range, we took the data set’s first subject as reference, yielding a range of [5, 11] mm. As mentioned earlier, our approach requires only a rough estimate of the true range. The relative weight of candidate appearance and geometry was set to 0.5, but can be varied between 0.25 and 0.75 with little effect. This is due to the fact that both energies allow a wide range of variation (large valleys of low costs) and penalize only unrealistic variation strongly (sharp increase of costs). 3.2

Results and Discussion

Experiments were carried out on a Intel Core i5 2500 @ 4 × 3.30 GHz. During computation, most time was spent on candidate detection, which took 27.7 ± 3.78 s. The model inference, i.e. calculation of potentials and belief propagation, took barely a few seconds. For comparison, [1] reported 4.1 ± 1.32 min (Intel Core i7 820QM @ 4 × 1.73 GHz) on cervical images. “Typically [...] less than a minute” (unknown system) was spend by [3] on thoracolumbar images that comprise around ten vertebrae, while run times range “from minutes to hours” (Intel Core 2 Duo @ 2 × 2.0 GHz) on lumbar images in [6]. Certainly, our hardware is faster, but the reported tasks were significantly smaller as well. Most time during candidate detection was spend on convolution, which is ideal for heavy parallelism. Therefore, we expect a considerable speed-up (factor > 2) from an implementation on graphics hardware, facilitating applicability in practice. Qualitative results in comparison to previous works are listed in Table 1. Although the results are not one-to-one comparable, we still see our learningfree approach to whole-spine vertebrae localization is competitive. Please note that the low L2 -error listed for Daenzer [1] needs to be seen relative to the small sizes of cervical vertebrae. Splitting our results into T1 - and T2 -weighted subsets, we yield detection rates of 94.2 % and 97.8 %, respectively. Most of the difference can be explained by five T1 -weighted images that show little contrast between cervical vertebrae and the surrounding tissue. Thus, vertebral entropy minima fuse with adjacent structures (see Fig. 2). Regarding localization quality, both

A Learning-Free Approach to Whole Spine Vertebra Localization in MRI

289

Table 1. Comparison with previous works. ‘2d’ implies mid-sagittal; ‘+’ implies joint use; C - cervical; T - thoracic; L - lumbar; TL - thoracolumbar; W - whole spine. Works

Learning Section

Weighting #Images Detection rate L2 -Error [mm]

2d Huang [2]

×

C, T, L, W T2

17

98.1 %

Oktay [4]

×

L

T1 +T2

80

97.8 %

3.1±?

3d Daenzer [1] ×

C

T2

21

-

1.64 ± 0.70

TL

T2

291

92.7 %

3.3 ± 3.2

L

T1 , T2

13

-

2.9 ± 1.7

TL

T1 , T2

18

-

2.5 ± 1.1

C, L, W

?

300

97.7 %

-

W

?

15

-

3.07±?

L

T1 , T2

17

92.9 %

-

W

T1 , T2

128

96.0 %

3.45 ± 2.20

Lootus [3] ˇ Stern [6]

×

Vrtovec [8] Zhan [9] Zuki´ c [10]

× ×

3d Ours

-

Fig. 2. Results on T1 /T2 -weighted images (left/right) for a successful case (top row) with kyphoscoliotic tendency and a problematic case (bottom row), where vertebral entropy minima (crosses) of the T1 -weighted image fuse with adjacent cervical structures due to missing contrast. Results were projection onto the mid-sagittal slice.

subsets score similar, i.e. 3.55 ± 2.26 mm and 3.34 ± 2.14 mm. For both, the L2 error is larger than the inter-reader variation (1.96 ± 1.23 mm). Interestingly, it is also larger than the inter-sequence variation (2.92 ± 2.56 mm), i.e. the distance between T1 - and T2 -weighted results. Hence, human readers and our results are more conformal to themselves than to each other, indicating that both have a slightly different “opinion” what the vertebra center should be.

4

Conclusion

We proposed a learning-free approach to whole-spine vertebrae localization in magnetic resonance images. Our work combines a fast vertebra detector with a novel similarity-based appearance energy under a graphical modeling framework. Our approach is largely independent of any particular type of image, because we exploit only information that are common to vertebrae in general. Results on 64 T1 - and 64 T2 -weighted images show competitiveness of our approach with the most often learning-based previous works.

290

M. Rak and K.-D. T¨ onnies

Our approach has limitations. In case of severe vertebra pathologies, e.g. fractures, osteoporosis and tumors, or when larger implants are present, our approach may fail to detect these vertebrae, because our assumptions about their appearance are violated. This is not problematic in general, because we designed our model to compensate for such missing vertebra candidates. However, if the detection misses two or more directly adjacent vertebrae then our model needs to be extended to Markov fields of third-order or above for compensation. When the number of presacral vertebrae differs from expectation a shortened/prolonged model is used. Handling such cases automatically is a challenge for future work, requiring a substantially larger data set to include enough such cases. We think that concurrent optimization of differently sized models combined with an additional model selection stage is a promising option. Acknowledgments. We thank all parties of the Study of Health in Pomerania. This research was funded by the German Research Foundation (TO 166/13-2).

References 1. Daenzer, S., Freitag, S., von Sachsen, S., Steinke, H., Groll, M., Meixensberger, J., Leimert, M.: VolHOG: a volumetric object recognition approach based on bivariate histograms of oriented gradients for vertebra detection in cervical spine MRI. Med. Phys. 41(8), 082305.1–082305.10 (2014) 2. Huang, S.H., Chu, Y.H., Lai, S.H., Novak, C.L.: Learning-based vertebra detection and iterative normalized-cut segmentation for spinal MRI. IEEE Trans. Med. Imag. 28, 1595–1605 (2009) 3. Lootus, M., Kadir, T., Zisserman, A.: Vertebrae detection and labelling in lumbar MR images. In: Computational Methods and Clinical Applications for Spine Imaging, MICCAI 2013, vol. 17, pp. 219–230. Springer, Cham (2014) 4. Oktay, A.B., Akgul, Y.S.: Simultaneous localization of lumbar vertebrae and intervertebral discs with SVM-based MRF. IEEE Trans. Biomed. Eng. 60, 2375–2383 (2013) 5. Rak, M., T¨ onnies, K.D.: On computerized methods for spine analysis in MRI: a systematic review. Int. J. Comput. Assist. Radiol. Surg. 11(8), 1–21 (2016) ˇ 6. Stern, D., Likar, B., Pernuˇs, F., Vrtovec, T.: Automated detection of spinal centrelines, vertebral bodies and intervertebral discs in CT and MR images of lumbar spine. Phys. Med. Biol. 55, 247–264 (2010) 7. V¨ olzke, H., Alte, D., Schmidt, C.O., Radke, D., Lorbeer, R., et al.: Cohort profile: the study of health in pomerania. Int. J. Epidemiol. 40, 294–307 (2011) 8. Vrtovec, T., Ourselin, S., Gomes, L., Likar, B., Pernuˇs, F.: Automated generation of curved planar reformations from MR images of the spine. Phys. Med. Biol. 52, 2865–2878 (2007) 9. Zhan, Y., Maneesh, D., Harder, M., Zhou, X.S.: Robust MR spine detection using hierarchical learning and local articulated model. In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012. LNCS, vol. 7510, pp. 141–148. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33415-3 18 10. Zuki´c, D., Vlas´ ak, A., Egger, J., Hoˇr´ınek, D., Nimsky, C., Kolb, A.: Robust detection and segmentation for diagnosis of vertebral diseases using routine MR images. Comput. Graph. Forum. 33, 190–204 (2014)

Automatic Quality Control for Population Imaging: A Generic Unsupervised Approach Mohsen Farzi1,2 , Jose M. Pozo1 , Eugene V. McCloskey2 , J. Mark Wilkinson2 , and Alejandro F. Frangi1(B) 1

Centre for Computational Imaging and Simulation Technologies in Biomedicine, University of Sheffield, Sheffield, UK [email protected] 2 Academic Unit of Bone Metabolism, University of Sheffield, Sheffield, UK Abstract. Population imaging studies have opened new opportunities for a comprehensive characterization of disease phenotypes by providing large-scale databases. A major challenge is related to the ability to amass automatically and accurately the vast amount of data and hence to develop streamlined image analysis pipelines that are robust to the varying image quality. This requires a generic and fully unsupervised quality assessment technique. However, existing methods are designed for specific types of artefacts and cannot detect incidental unforeseen artefacts. Furthermore, they require manual annotations, which is a demanding task, prone to error, and in some cases ambiguous. In this study, we propose a generic unsupervised approach to simultaneously detect and localize the artefacts. We learn the normal image properties from a large dataset by introducing a new image representation approach based on an optimal coverage of images with the learned visual dictionary. The artefacts are then detected and localized as outliers. We tested our method on a femoral DXA dataset with 1300 scans. The sensitivity and specificity are 81.82 % and 94.12 %, respectively.

1

Introduction

Over recent decades, large-scale databases containing prominent imaging components in addition to omics, demographics and other metadata have increasingly underpinned national population imaging cohorts [1,7,10], biomedical research (cf. [10]), and pharmaceutical clinical trials (cf. [6]). Large imaging databases provide new opportunities for the comprehensive characterization of the population or disease phenotype but equally pose new challenges. Collection and process of large databases require automatic streamlined image analysis pipelines that are robust to the varying quality of images. In contrast, most image analysis methods developed in the last three decades are designed for relatively small datasets where visual quality checking and pruning is still feasible, and thereby not exposed to heterogeneous image quality. On the other side, visual quality assurance in multi-centre large-scale cohorts with several imaging sequences per subject is simply infeasible. Despite efforts to enforce strict imaging protocols to ensure consistent high-quality images, various incidental artefacts are inevitable, c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 291–299, 2016. DOI: 10.1007/978-3-319-46723-8 34

292

M. Farzi et al.

which will have a significant impact on the analysis techniques [8]. Detecting this small fraction of inadequate data requires an enormous amount of expert manual labour or resourcing to sampling schemes which are prone not to detect all inadequate data. Hence, automatic image quality control in population imaging studies is an emerging challenge to the medical image computing community. In the multimedia community, image quality assessment (IQA) has been explored extensively over the last two decades [5]. However, existing algorithms are not directly applicable to biomedical images mainly for three reasons. Firstly, multimedia IQA algorithms generally quantify image quality in relationship to human visual perception rather than being task-oriented, thereby are not appropriate for detecting medical image artefacts as pointed out by Xie et al. [12]. For a description of such artefacts, see for instance, Jacobson et al. [3] for dual energy X-ray absorptiometry (DXA). Secondly, in multimedia, a limited number of artefacts such as blurring, noise, and JPEG compression are normally of interest. This is in stark contrast with much more diverse and difficult to catalogue artefacts in medical imagery. Furthermore, most of the literature on imaging artefacts has been focused on their origins in the physics of image acquisition [11] or in their qualitative description (cf. [3]), which are not helpful to develop automated techniques for their detection. Albeit some research in the medical community to address specific artefact types such as eddy currents and head motion in diffusion MRI [8], or blurring and uneven illumination in dermoscopy [12] in an automatic way, they are not general purpose and cannot detect unforeseen incidental artefacts in large-scale databases. Thirdly, artefacts in multimedia are usually easier to model mathematically, which can be used to synthetically generate annotations over arbitrarily large image sets including artefacts with varying degree of severity. In contrast, manual quality annotations in large-scale medical image databases are rare and their creation would require extensive and tedious visual assessment, which is laborious and error prone. A completely unsupervised framework would, therefore, be desired. In this work, we propose a general purpose and fully unsupervised approach to detect images with artefacts in large-scale medical image datasets. Furthermore, our approach also localizes the artefact within the corresponding image. Our work only relies on three main assumptions. Firstly, artefacts have a local nature by observing image patches of an appropriate size albeit the extent of such patches could be, in the extreme case, the full image. Secondly, the image database is large enough so as to capture the statistics of the normal images and that of the artefacts of interest. Thirdly, the incidence of artefacts in the image database should be small enough so that artefacts remain always outliers in the statistical distribution of the database. Under these assumptions, we propose a novel image coverage technique based on a visual dictionary of image patches to represent each image with a set of visual words localized in the image domain. Artefact detection is based on the similarity between each representative visual word and the corresponding raw patch from the image. The proposed method will facilitate large-scale, high-throughput quality control, and will be applicable to different imaging modalities without the need for manual annotation.

Automatic Quality Control for Population Imaging

2

293

Proposed Method

The basic idea is to build a model representing the normal characteristics of the image population and detect the artefacts as deviations from this model. Based on the assumptions above, the proposed method works on image patches and comprises three main constituents: a robust learning of a dictionary of patches, an optimal coverage of the images with the dictionary words, and an assessment of the similarity between each covered patch and the corresponding dictionary word. This assessment allows us to detect outliers, identifying both images with artefacts and their locations in the image. The use of a visual dictionary has been proposed for image quality assessment in [13] and explored later as the so-called bag of visual words (BOVW) representation. In the BOVW approach, each image is modelled as a distribution over the visual words by normalizing the histogram of occurrence counts for each visual word. In this study, however, we propose a novel image representation with a subset of visual words that provides an optimal image coverage, which is a new concept and differs from the typical usage of visual codebooks. The proposed optimal image coverage allows the visual words to be positioned optimally without prefixed positions. This reduces the number of required words for a good representation, which maximizes the generalizability of the model. Artefact detection is based on the similarity between the visual words and the corresponding image patches (Fig. 1). An image patch with a small dissimilarity score can be interpreted as artefact-free only if the dictionary has already excluded all words with artefacts. Hence, being robust to outliers is crucial in the dictionary learning step and thereby common clustering algorithms like kmeans are not appropriate. We propose a robust dictionary learning based on the fixed-width clustering algorithm, introduced for outlier detection in network systems [2]. 2.1

Robust Dictionary Learning

The objective is to learn a Dictionary W = {w1 , · · · , wN } with N visual words from a large pool of patches, capturing the normal shape and appearance variations in the image class while excluding outlier patches. An outlier patch is expected to lie in a sparse region of the feature space, i.e. raw intensity values here, having few or no neighbours within a typical distance r. Observe that outlier patches detected in this step cannot be used directly to identify image artefacts. Since images are not coregistered and patches are extracted from fixed locations, some proportion of outliers will be due to misalignments not necessarily representing an image artefact. The proposed robust dictionary learning is as follows. Each image is divided into overlapping square patches of size k for 2D images, or cubic patches for 3D images, with an overlap of size k ′ between neighbouring patches. The fixed-width clustering algorithm is then applied as follows. All the patches are shuffled randomly and the first patch would be the first visual word. For all the subsequent patches, the Euclidean distance of the patch to each visual word is computed.

294

M. Farzi et al.

Fig. 1. An example of optimal coverage and artefact detection as outliers in the dissimilarity score distribution.

If a distance is less than r, then the patch is added to the corresponding cluster and its centroid is recalculated with the average of all members. Otherwise, the patch is added as the centroid of a new cluster. Observe that each patch may contribute to more than one cluster. Finally, clusters with only one member are considered as outliers and removed from the dictionary. 2.2

Optimal Image Coverage and Word Selection

A coverage of an image I is a selection of visual words placed at different locations in the image so that all pixels are covered at least once. Let us consider that the image I has P pixels and each visual word can be deployed at L locations indexed by ℓ ∈ [1, L], where L ≤ P depends on the resolution with which the image is swept. The binary mask mℓ represents the word location ℓ in the image, dnℓ denotes the word wn placed at location ℓ with appropriate zero-padding, and the binary variable zℓn encodes whether dnℓ is selected to cover the image or not. N would represent a coverage of the Thus, the binary vector z = z11 , · · · , zL image if  zℓn mℓ ≥ 1P ×1 , (1) n,ℓ

where the left-hand side is an integer vector counting how many times each pixel is covered in the image. We define the image coverage error as the L2 -norm of the difference between each selected visual word and the corresponding image patch,  2 E= zℓn dnℓ − mℓ ◦ I . (2) n,ℓ

Automatic Quality Control for Population Imaging

295

Here, mℓ ◦ I denotes the component-wise product between the binary mask mℓ and the image I. The optimal image coverage is defined by the minimization of the coverage error, z = argmin  z E, subject to the constraint in Eq. 1. Let us denote by zℓ∗ = n zℓn , the number of visual words placed at location ℓ. If two words, wn1 and wn2 , are used at the same location ℓ (zℓn1 = zℓn2 = 1), then the coverage error will be always larger than using just one of them, without any effect on the constraint. Hence, the optimal solution will place at each location ℓ either one single visual word (zℓ∗ = 1) or none (zℓ∗ = 0). Therefore, the optimization can be split into two independent problems. First, for each ℓ, the locally optimal visual word wn(ℓ) is selected by minimizing the local error, Eℓ = min dnℓ − mℓ ◦ I2 .

(3)

n

∗ Then, the optimal locations, z∗ = (z1∗ , · · · , zL ), are selected by minimizing the total coverage error,   zℓ∗ Eℓ subject to zℓ∗ mℓ ≥ 1P ×1 . (4) z∗ = argmin z∗





Equation 4 can be efficiently solved using linear integer programming packages such as Matlab optimization toolbox (Mathworks Inc., Cambridge, MA). 2.3

Artefact Detection

For a given image, a dissimilarity score is computed between each representative visual word and its corresponding raw patch. Any image patch with an associated score above an optimal threshold will identify the location of an artefact in the given image. Observe that since matching of the words is local and the best fitting locations are found after an optimal coverage without forcing a priori known locations, images do not need to be previously registered. Dissimilarity score: The local properties of an image can be described by the set of its derivatives, which is named as local jet [9]. For a given image I and a scale σ, the local jet of order N at point x is defined as J N [I](x, σ)  {Li1 ,··· ,in (x; σ)}N n=0 ,

(5)

where the nth -order derivative tensors are computed by the convolution   (σ) Li1 ,i2 ,··· ,in (x; σ) = Gi1 ,i2 ,··· ,in ∗ I (x), (σ)

(6)

with the corresponding derivatives of the Gaussian kernel Gi1 ,i2 ,··· ,in (x), and ik = 1, . . . , D, for D-dimensional images. For each derivative order, a complete set of independent invariants under orthogonal transformations can be extracted [9]. For 2D images and second order, for example,  this complete set 2 is formed by the intensity, the magnitude of gradients i Li , the Laplacian   ii , the Hessian norm iL i,j Lij Lji , and the second derivative along the gradi ent i,j Li Lij Lj . Multiresolution description can be obtained by changing the

296

M. Farzi et al.

scale σ in a convenient set of scale-space representations. For each invariant feature, the Euclidean distance between the visual word and the corresponding image patch is used as the dissimilarity metric. Optimum threshold: The optimum threshold for each dissimilarity score is computed as follows. For each image in the database, the maximum score among all the representative visual words is computed. The optimum threshold is selected as q3 + ν ∗ (q3 − q1 ), where ν = 1.5, and q1 and q3 are the first and third quartiles, respectively. An image is then artefact-free only if all the representative visual words have a dissimilarity score below the optimum threshold with respect to all the considered features.

3

Experiments and Results

Data specifications: We have tested the proposed algorithm on a dataset of hip DXA scans collected using a Hologic QDR4500 Acclaim densitometer (Hologic Inc., Bedford, MA, USA) during a previous pharmaceutical clinical trial [6]. DXA is the standard imaging technique for bone densitometry. The current software packages for DXA analysis requires manual interaction and thereby each scan should be analysed individually. Therefore, quality control is assumed to be done online during the analysis step almost accurately. Unlike the common perception, a survey of members of the International Society for Clinical Densitometry (ISCD) indicated that errors in DXA acquisition are not rare [4]. Although qualitative description of DXA artefacts is available [3], no quantitative classification nor manually-crafted features has been introduced so far. Thus, no prior knowledge of the types of artefacts can be assumed in DXA modality. This makes automatic DXA quality control an interesting challenge. A population cohort of 5592 women was included in the clinical trial [6], from which, 1300 scans were already available. Experimental set-up: To evaluate the method, 300 scans were randomly selected for manual annotation. We learned the visual dictionary and the set of optimum thresholds on the remaining 1000 scans. Observe that the learning step is also unsupervised, thereby we could do the training based on the whole dataset and then validate the method based on a proportion of the dataset. However, to make a more rigorous evaluation, we split the dataset into training and test sets. Performance measures: Sensitivity and specificity of the method are reported on the test data based on a priori manual annotation. The sensitivity is defined as the proportion of images with artefacts that are detected correctly by the algorithm. The specificity is defined as the proportion of normal images that are correctly labelled as artefact-free. The localization capacity of the algorithm is measured in term of the number of patches that localize an artefact correctly divided by the number of all detected patches on those images that are truly detected as artefacts. We will refer to it as the localisation rate.

Automatic Quality Control for Population Imaging

297

Fig. 2. Samples from the dataset. Red squares show the location of detected artefacts.

Parameter selection: The patch size and the radius r are two parameters for the proposed method. Both parameters would be data dependent. The radius r is automatically selected estimating a typical small distance between patches in the same image: For each image, all pairwise distances between the patches comprising the image are computed. Next, n1 -quantile of these distances are computed per image, where n is the total number of patches extracted from each image. Then, the parameter r is selected as the median of the computed quantiles in the image dataset. The patch size could be estimated based on the size of the effect that is measured. For example, in femoral DXA bones, the diameter of the femoral stem is approximately 64 pixels. We have tested the results with patches of size 32 and 64 with 8 pixel overlap. No differences were observed in the sensitivity. We presented the results with patches of size 64. Thus, the total number of 24830 patches were extracted from 1000 images. The radius r = 3.5 is estimated for this dataset. The obtained dictionary contained 1146 visual words. We tested invariant features up to the second order. However, the second order features did not provide any new information. Hence, only intensity and gradient magnitude are finally used as features. The gradient magnitude for each image patch or visual word is normalized to have Euclidean norm one. Single scale analysis with σ = 0.2 was used. Optimum thresholds are derived as 0.37 and 4.86 for the gradient magnitude and intensity, respectively. Results: Eleven images out of 300 are manually annotated as artefacts. Nine out of eleven are detected using the proposed algorithm. Sensitivity and specificity are 81.82 % and 94.12 %, respectively. The localization rate is 96 %. Figure 2 shows normal images and artefacts. Only 2 out of 11 image artefacts are misclassified as normal. These two scans are characterized as movement artefacts that cause subtle vertical displacement in the image. However, the algorithm managed to successfully localize other types of artefacts including the existence of an external object (key-shape object in Fig. 2).

4

Conclusion

In this paper, we proposed a completely unsupervised and generic framework to address automatic quality control in a large cohort of medical image dataset. Based on the assumption that artefacts constitute a small proportion of the dataset, a dictionary-based framework based on an optimal coverage of images

298

M. Farzi et al.

was introduced to detect and localize image artefacts as outliers to the normal image class. The method computational complexity is linear in the number of input images, providing good scalability to large datasets. We have tested the method on 1300 femoral DXA scans and reported good sensitivity and specificity on the dataset.

Acknowledgement. M Farzi was funded through a PhD Fellowship from the United Kingdom Medical Research Council-Arthritis Research-UK Centre for Integrated research into Musculoskeletal Ageing (CIMA).

References 1. Bamberg, F., Kauczor, H.U., Weckbach, S., Schlett, C.L., Forsting, M., Ladd, S.C., Greiser, K.H., Weber, M.A., Schulz-Menger, J., Niendorf, T., et al.: Wholebody MR imaging in the german national cohort: rationale, design, and technical background. Radiology 277(1), 206–220 (2015) 2. Eskin, E., Arnold, A., Prerau, M., Portnoy, L., Stolfo, S.: A geometric framework for unsupervised anomaly detection. In: Barbar´ a, D., Jajodia, S. (eds.) Applications of Data Mining in Computer Security. AISC, pp. 77–101. Springer, New York (2002) 3. Jacobson, J.A., Jamadar, D.A., Hayes, C.W.: Dual X-ray absorptiometry: recognizing image artifacts and pathology. Am. J. Roentgenol. 174(6), 1699–1705 (2000) 4. Lewiecki, E.M., Binkley, N., Petak, S.M.: DXA quality matters. J. Clin. Densitometry 9(4), 388–392 (2006) 5. Manap, R.A., Shao, L.: Non-distortion-specific no-reference image quality assessment: a survey. Inf. Sci. 301, 141–160 (2015) 6. McCloskey, E.V., Beneton, M., Charlesworth, D., Kayan, K., de Takats, D., Dey, A., Orgee, J., Ashford, R., Forster, M., Cliffe, J., et al.: Clodronate reduces the incidence of fractures in community-dwelling elderly women unselected for osteoporosis: results of a double-blind, placebo-controlled randomized study. J. Bone Miner. Res. 22(1), 135–141 (2007) 7. Petersen, S.E., Matthews, P.M., Bamberg, F., Bluemke, D.A., Francis, J.M., Friedrich, M.G., Leeson, P., Nagel, E., Plein, S., Rademakers, F.E., et al.: Imaging in population science: cardiovascular magnetic resonance in 100,000 participants of UK Biobank-rationale, challenges and approaches. J. Cardiovasc. Magn. Reson. 15(1), 46 (2013) 8. Roalf, D.R., Quarmley, M., Elliott, M.A., Satterthwaite, T.D., Vandekar, S.N., Ruparel, K., Gennatas, E.D., Calkins, M.E., Moore, T.M., Hopson, R., et al.: The impact of quality assurance assessment on diffusion tensor imaging outcomes in a large-scale population-based cohort. NeuroImage 125, 903–919 (2016) 9. Schmid, C., Mohr, R.: Local grayvalue invariants for image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 19(5), 530–534 (1997) 10. Weiner, M.W., Veitch, D.P., Aisen, P.S., Beckett, L.A., Cairns, N.J., Cedarbaum, J., Donohue, M.C., Green, R.C., Harvey, D., Jack, C.R., et al.: Impact of the alzheimer’s disease neuroimaging initiative, 2004 to 2014. Alzheimer’s & Dementia 11(7), 865–884 (2015) 11. Willis, C.E., Weiser, J.C., Leckie, R.G., Romlein, J.R., Norton, G.S.: Optimization and quality control of computed radiography. In: Medical Imaging 1994, pp. 178– 185. International Society for Optics and Photonics (1994)

Automatic Quality Control for Population Imaging

299

12. Xie, F., Lu, Y., Bovik, A.C., Jiang, Z., Meng, R.: Application-driven no-reference quality assessment for dermoscopy images with multiple distortions. IEEE Trans. Biomed. Eng. 63(6), 1248–1256 (2016). IEEE 13. Ye, P., Doermann, D.: No-reference image quality assessment based on visual codebook. In: 18th IEEE International Conference on Image Processing (ICIP), pp. 3089–3092. IEEE (2011)

A Cross-Modality Neural Network Transform for Semi-automatic Medical Image Annotation Mehdi Moradi(B) , Yufan Guo, Yaniv Gur, Mohammadreza Negahdar, and Tanveer Syeda-Mahmood IBM Almaden Research Center, San Jose, CA, USA [email protected]

Abstract. There is a pressing need in the medical imaging community to build large scale datasets that are annotated with semantic descriptors. Given the cost of expert produced annotations, we propose an automatic methodology to produce semantic descriptors for images. These can then be used as weakly labeled instances or reviewed and corrected by clinicians. Our solution is in the form of a neural network that maps a given image to a new space formed by a large number of text paragraphs written about similar, but different images, by a human expert. We then extract semantic descriptors from the text paragraphs closest to the output of the transform network to describe the input image. We used deep learning to learn mappings between images/texts and their corresponding fixed size spaces, but a shallow network as the transform between the image and text spaces. This limits the complexity of the transform model and reduces the amount of data, in the form of image and text pairs, needed for training it. We report promising results for the proposed model in automatic descriptor generation in the case of Doppler images of cardiac valves and show that the system catches up to 91 % of the disease instances and 77 % of disease severity modifiers.

1

Introduction

The availability of large datasets and today’s immense computational power have resulted in the success of data driven methods in traditional application areas of computer vision. In such applications, it is fairly inexpensive to label images based on crowd sourcing methods and create datasets with millions of categorized images or use the publicly available topical photo blogs. One hurdle for fully utilizing the potential of big data in medical imaging is the expensive process of annotating images. Crowd-sourcing in simple annotation tasks has been reported in the past [7,10]. However, the expert requirements for certain medical labeling and annotation tasks limit the applicability of crowd sourcing. More importantly, privacy concerns and regulations prohibit the posting of some medical records on crowd sourcing websites even in anonymized format. Electronic medical records (EMR) are the natural sources of big data in our field. One potential solution for establishing ground truth labels such as disease type and severity for images within EMR is automatic concept extraction from c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 300–307, 2016. DOI: 10.1007/978-3-319-46723-8 35

A Cross-Modality Neural Network Transform

301

unstructured sources such as clinician reports stored with images. This is a very active and mature area of work [13]. In many situations, however, the clinical reports are not available. In other situations, a clinical record consists of many images and only one report. In an echocardiography study of cardiac valves, for example, there is usually many continuous wave (CW) Doppler images of four different cardiac valves. Typically these are stored as short videos. Only some patient records also include a cardiologist report (less than half in our dataset). Even when the report is available, there is no matching between each image and passages of the text. For low level algorithm development tasks, such as learning to detect a specific disease from CW Doppler, we need individually annotated images. Our work here addresses a scenario in annotation of a set of medical images where we also have access to a rather large set of text reports from clinical records, written by clinicians based on images of the same modality from other patients. This could be a text data dump from the EMR. We do not have access to the images matched to these reports. In fact, the lack of a huge set of images and text reports that are matched with each other separates our scenario from some of the work in the machine learning community in the area of automatic image captioning [4,5]. Our goal is to speed up the process of labeling images for semantic concepts such as the imaged valve, disease type and severity by providing a fairly accurate initial automatic annotation driven by the text reports of similar images written by clinicians. To this end, we propose a learned transform between the image and text spaces. We use a multilayer perceptron (MLP) neural network which acts in the role of a universal function approximator, as opposed to a classifier. This transform network receives a fixed length representation of an image and outputs a vector in the space defined by fixed length representations of text reports. The key to success is that we have separated the process of learning the quantitative representation of images and texts from the process of learning the mapping between the two. The former relies on rather large datasets and deep learning, while the latter uses a small neural network and can be trained by using a small set of paired images and text. We show the practical value of this innovation on a clinical dataset of CW Doppler images. This methodology can significantly speed up the process of creating labeled datasets for training big data solutions in medical imaging.

2

Method

The general methodology involves three networks: a transform network that acts as a mapping function and requires a fixed length feature vector describing the image as input and outputs a fixed length text vector as output; and two deep networks that act in the capacity of feature generators to map images and text paragraphs to their corresponding fixed length spaces. We will describe the proposed methodology in the context of fast annotation of CW Doppler echocardiography images for the most common valvular diseases, namely regurgitation and stenosis, and the severity of these conditions. CW Doppler images

302

M. Moradi et al.

Fig. 1. Examples of the CW Doppler images: left panel shows a full CW image from the aortic valve. Right: region of interest CW images of aortic (top left), mitral (top right), tricuspid (bottom left) and pulmonic (bottom right) valve.

are routinely used for the study of mitral, tricuspid, pulmonic, and the aortic valves (Fig. 1). In the context of this specific problem, our solution also includes a fourth neural network that acts as a classifier to label the CW images for the valve. The motivation to separate this step is to limit the search space for the closest text paragraph in the final stage to only those text paragraphs that describe the relevant valve. 2.1

Learning a Fixed Length Vector Representation of Text Paragraphs

The text data was from the EMR of a local hospital network and included 57,108 cardiac echocardiography reports. The first step in our text pipeline was to isolate paragraphs focused on each of the four valve types. This was fairly trivial as the echo reports routinely include paragraphs starting with “Aortic valve:” and alike for mitral, pulmonic and tricuspid valves. With this simple rule, we isolated 10,253 text paragraphs with a valve label. Traditionally, text can be represented as a fixed-length feature vector, composed of a variety of lexical, syntactic, semantic, and discourse features such as words, word sequences, part-of-speech tags, grammatical relations, and semantic roles. Despite the impressive performance of the aforementioned features in many text analytics tasks, especially in text classification, vector representations generated through traditional feature engineering have their limits. Given the complexity and flexibility within natural languages, features such as bag of words or word sequences usually result in a high dimensional vector, which may cause the data sparsity issues when the size of training data is incomparable to the number of features. Moreover, in a traditional feature space, words such as “narrowing”, “stenosis”, and “normal” are equally distant from each other, regardless of meaning. In this work, we used a neural network language model proposed in [6] to generate distributed representations of texts in an unsupervised fashion, in the absence of deliberate feature engineering. This network is often referred to as

A Cross-Modality Neural Network Transform

303

Doc2Vec in the literature1 . The input of the neural network includes a sequence of observed words (e.g. “aortic valve peak”), each represented by a fixed-length vector, along with a text snippet token, also in the form of a dense vector and corresponding to the sentence/document source for the sequence. The concatenation or average of the word and paragraph vectors was used to predict the next word (e.g. “velocity”) in the snippet. The two types of vectors were trained on the 10,253 paragraphs. Training was performed using stochastic gradient descent via backpropagation. At the testing stage, given an unseen paragraph, we freeze the word vectors from training time and just infer the paragraph vector. The fixed length of the text feature vector m is a parameter in Doc2Vec model. In our application, since the length of the paragraphs is typically only two to three sentences, we prefer a short vector. This also helps with limiting the complexity of the transform network as it defines the number of output nodes. We report the results for m = 10. 2.2

Image Vectors

We rely on transfer deep learning to create a vector of learned features to represent each image. Pre-trained large deep learners such as the convolution network designed by the Visual Geometry Group (VGG) of the University of Oxford [2] have been widely used in the capacity of “feature generator” in both medical [1,11] and non-medical [9] applications, as an alternative to computation and selection of handcrafted features. We use the VGG implementation available through the MatConvNet Matlab library. This network consists of 5 convolution layers, two fully connected layers and a SoftMax layer with 1000 output nodes for the categories of the ImageNet challenge [3]. We ignore this task-specific SoftMax layer. Instead, we harvest a feature vector at the output of the fully connected layer (FC7) of the network. The VGG network has several variations where FC7 layer has between 128 and 4096 nodes. We run each CW image through the pre-trained VGG networks with both FC7 size of 128 and 4096. The former is used for the transform network training, and the latter is used for valve type classification network. The choice of the smaller feature vector size for the transform network is due to the fact that it defines the size of the input layer. Given the small size of the dataset used to train the transform network, we keep the size of the image vectors to 128 to minimize the number of weights. For the valve classifier network, we use the 4096 dimensional representation of the images since the size of the dataset is larger and the output layer is also only limited to the number of valve classes which is four. 2.3

Valve Recognition Network

Since the text paragraphs are trivially separated based on the valve, we can reduce the errors and limit the search space in the final stage of the pipeline by 1

Open source code: http://deeplearning4j.org/doc2vec.html.

304

M. Moradi et al.

first accurately classifying the images for the depicted valve exclusively based on the image features. In most cases, the text fields on the image (left side of Fig. 1) include clues that reveal the valve type and can be discerned using optical character recognition (OCR). In this work, however, we opt for a learning method. The classifier used in this work is an MLP network that uses the 4096 dimensional feature vector from VGG FC7 as input, has a single hidden layer, and four SoftMax output nodes each for one type of valve. To train this valve classifier, we created an expert-reviewed dataset of 496 CW images, each labeled with one of the four valve types. The network was optimized in terms of the number of nodes in the hidden layer using leave-oneout cross-validation. The results are reported for a network with 128 nodes in the hidden layer. 2.4

The Transform Network: Architecture and Training

Universal approximation theorem states that a feedforward neural network with a hidden layer can theoretically act as a general function approximator, given sufficient training data. The transform network used in our work is designed based on this principle. This is the only network in our system that requires images and clinical text paragraph pairs. Since this network acts as a regressor as opposed to a classifier, the output layer activation functions were set to linear as opposed to SoftMax. To optimize the number of hidden nodes of this network and train the weights, we used a dataset of 226 images and corresponding text reports, in a leave-one-out scheme. We optimized the network with the objective of minimizing the mean Euclidean distance between the output vector and the target text vector for the image. The optimal architecture had four nodes in the hidden layer. 2.5

Deployment Stage and the Independent Test Data

Given an image, we first determine the valve type using the valve classifier network. The remaining steps to arrive at the disease descriptors are depicted in Fig. 2. The given image is first passed through the VGG network. The output is fed to the transform network to obtain a vector in the text space. Then we search for the closest matches to this vector in the text dataset. The closest match, or top few, in terms of Euclidean distance are used for extraction of semantic descriptors of the image. Note that the use of the valve classifier reduces the cost of the search step by a factor of four as we only search the text paragraphs written for the same type of valve. The extraction of the semantic descriptors from the retrieved paragraphs is performed by a propriety concept extractor that accurately identifies given descriptors in the text only when they are mentioned in the positive sense [12]. The overall performance of the model is investigated on a holdout dataset of CW images that has not been used in the training or cross validation of the transform network or the valve classifier network. This consists of 48 CW images with corresponding text reports which were used only to validate the semantic

A Cross-Modality Neural Network Transform

305

Fig. 2. The workflow of identifying a text segment as the source for semantic descriptors for a given image. The valve classifier network is not depicted in this illustration.

labels extracted for the image using our model. This test set includes 14 CW images of mitral, aortic, and tricuspid valves and six of the pulmonic valve.

3

Results

Result of classification for valve type: The optimized automatic valve classifier achieved an accuracy of 96 % on the test set, mis-classifying only two of the 48 test samples, both in case of tricuspid valve. Note that the OCR can potentially improve this by correct determination of the valve type without the need for classification, when the information is recorded on the image. Evaluation of the quality of identified semantic labels: Table 1 provides examples of real text segments and retrieved text segments using our proposed methodology for a number of images in the test set. It is notable that the retrieved paragraphs often include anatomic information about the valve that are not discernible from CW images. For example, the thickening of the valve is visible in B-mode images. Since we use only CW images, we evaluate the results only for descriptors related to these images. It should be noted that the system is not designed as a diagnostic tool and the results here should be evaluated in light of the intended application. For the 34 test images of tricuspid, mitral, and pulmonic valves, the most common finding is regurgitation, present in 24 of the images. In 22 cases (91 %) at least one of the top retrieved reports also reported regurgitation. The modifiers in this case included “trace”, “mild”, “moderate” and “severe”. If extracted only from the top retrieved report, the modifier is under-estimated in 9 (41 %) of the positive regurgitation findings. This drops to 5 (23 %) if the semantic descriptors are extracted from the top three retrieved reports.

306

M. Moradi et al.

Table 1. Examples of the text segments retrieved with CW images (right column) and the text segment from the original clinical report for the image (middle column). Valve imaged Corresponding report

Retrieved text

Aortic

“Aortic Valve: Aortic sclerosis present with evidence of stenosis. 55.1 mmHg, and the mean gradient is 27.0 mmHg, compatible with moderate stenosis”

“Aortic Valve: Severe aortic stenosis is present. The aortic valve peak velocity is 4.07 m/s, the peak gradient is 66.3 mmHg, and the mean gradient is 35.5 mmHg. The aortic valve area is estimated to be 0.61 cm2. Moderate aortic valve insufficiency”

Mitral

“Mitral Valve: There is marked “Mitral Valve: Mild mitral mitral annular calcification. The regurgitation is present” mitral valve leaflets are calcified. Mild mitral regurgitation is present”

In case of the aortic valve, the most critical finding is stenosis. Cardiologists report aortic stenosis following the guidelines of the American Heart Association (AHA) with “mild”, “moderate” or “severe” modifiers. In our set of 14 independent cases of aortic valve CW images in the final test set, the original corresponding text paragraphs reported stenosis in five cases. In all five cases, the combination of the top three retrieved paragraphs provided “stenosis” as a descriptor. In one case, there was a finding of stenosis in the top retrieved paragraph, but not in the original report. Although further examination revealed that the case was positive based on one measure of stenosis (maximum jet velocity) and negative based on another measure (mean pressure gradient). For modifiers, in four cases the original modifier was “mild” and the true modifier was also either moderate or mild. In one case, the clinician had not reported a modifier and the retrieved paragraph reported “severe”.

4

Conclusion

We proposed a methodology for generating annotations, in form of semantic disease related labels, for medical images based on a learned transform that maps the image to a space formed by a large number of text segments written by clinicians for images of the same type. Note that we used a pre-trained convolutional neural network. Handcrafted feature sets such as histogram of gradients can be studied as alternative image descriptors in this framework. However, the CNN based features proved more powerful in our previous work [8] and also here. While quantitative analysis reported here is limited to stenosis and regurgitation, there is no such limitation in our implementation. Our evidence from over 10,000 text reports show that we can cover a wide range of labels. For example,

A Cross-Modality Neural Network Transform

307

we can accurately pick up labels related to deficiencies such as valve thickening, calcification and decreased excursion. As examples in Table 1 show, in many cases the retrieved reports also include values of relevant measured clinical features. As a future step we will explore the idea of expanding the list of top matches and averaging the values to obtain a rough estimate of the measurements for the image of interest. Also, inclusion of B-mode images can improve the value of the retrieved paragraphs that often include features only visible in such images. Finally, a larger user study is under way to understand the practical benefits of the system in terms of cost saving.

References 1. Bar, Y., Diamant, I., Wolf, L., Greenspan, H.: Deep learning with non-medical training used for chest pathology identification. In: SPIE, Medical Imaging 2015 (2015) 2. Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: British Machine Vision Conference (2014) 3. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009) 4. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015) 5. Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Understanding and generating image descriptions. In: CVPR (2011) 6. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: ICML, pp. 1188–1196 (2014) 7. Maier-Hein, L., Mersmann, S., Kondermann, D., Bodenstedt, S., Sanchez, A., Stock, C., Kenngott, H.G., Eisenmann, M., Speidel, S.: Can masses of non-experts train highly accurate image classifiers? In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8674, pp. 438–445. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10470-6 55 8. Moradi, M., et al.: A hybrid learning approach for semantic labeling of cardiac CT slices and recognition of body position. In: IEEE ISBI, pp. 1418–1421 (2016) 9. Park, C.C., Kim, G.: Expressing an image stream with a sequence of natural sentences. In: NIPS (2015) 10. Rodrguez, A.F., Muller, H.: Ground truth generation in medical imaging: a crowdsourcing-based iterative approach. In: Proceedings of the ACM Workshop on Crowdsourcing for Multimedia (2012) 11. Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., Summers, R.M.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35, 1285–1298 (2016) 12. Syeda-Mahmood, T., Chiticariu, L.: Extraction of information from clinical reports 29 Aug 2013. http://www.google.com/patents/US20130226841, US Patent App. 13/408,906 13. Wang, F., Syeda-Mahmood, T., Beymer, D.: Information extraction from multimodal ECG documents. In: ICDAR, pp. 381–385 (2009)

Sub-category Classifiers for Multiple-instance Learning and Its Application to Retinal Nerve Fiber Layer Visibility Classification Siyamalan Manivannan1(B) , Caroline Cobb2 , Stephen Burgess2 , and Emanuele Trucco1 1

2

CVIP, School of Science and Engineering (Computing), University of Dundee, Dundee, UK [email protected] Department of Ophthalmology, NHS Ninewells, Dundee, UK

Abstract. We propose a novel multiple instance learning method to assess the visibility (visible/not visible) of the retinal nerve fiber layer (RNFL) in fundus camera images. Using only image-level labels, our approach learns to classify the images as well as to localize the RNFL visible regions. We transform the original feature space to a discriminative subspace, and learn a region-level classifier in that subspace. We propose a margin-based loss function to jointly learn this subspace and the region-level classifier. Experiments with a RNFL dataset containing 576 images annotated by two experienced ophthalmologists give an agreement (kappa values) of 0.65 and 0.58 respectively, with an inter-annotator agreement of 0.62. Note that our system gives higher agreements with the more experienced annotator. Comparative tests with three public datasets (MESSIDOR and DR for diabetic retinopathy, UCSB for breast cancer) show improved performance over the state-of-the-art.

1

Introduction

This paper introduces an automatic system assessing the visibility and location of the RNFL in fundus camera (FC) images from image-level labels. The optic nerve transmits visual information from the retina to the brain. It connects to the retina in the optic disc, and its expansion form the RNFL, the innermost retinal layer. The RNFL has been recently considered as a potential biomarker for dementia [1], by assessing its thickness in optical confocal tomography (OCT) images. However, screening of high numbers of patients would be enabled if the RNFL could be assessed with FC, still much more common than OCT for retinal inspection, and increasingly part of routine optometry checks. Very little work exists on RNFL-related studies with FC images on studying associations with dementia [2]. This is contrast with RNFL analysis via OCT, supported by a rich literature [1]. The RNFL is not always visible in FC images, and its visibility itself has been posited as a biomarker for neurodegenerative conditions. This motivates our work, part of a larger project on multi-modal retina-brain biomarkers for dementia1 . 1

EPSRC grant EP/M005976/1.

c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 308–316, 2016. DOI: 10.1007/978-3-319-46723-8 36

Sub-category Classifiers for Multiple-instance Learning and Its Application

(a)

(b)

(c)

(d)

309

(e)

Fig. 1. RNFL visibility in the green channel: (a) an image with visible RNFL (the marked region indicates its visibility), (b) an image with invisible RNFL, (c) examples of RNFL-visible regions, (d) examples of RNFL-invisible regions, (e) a synthetic image showing RNFL (pink) and blood vessels (blue).

We report an automatic system to identify FC images with visible RNFL regions and simultaneously localize visible regions. A crucial challenge is obtaining ground truth annotations of visible RNFL regions from clinicians, notoriously a difficult and time-consuming process. We take therefore a Multiple Instance Learning (MIL) approach, requiring only image-level labels (RNFL visible/invisible), which can be generated much more efficiently. In MIL, images are regarded as bags, and image regions as instances. Visible RNFL regions have significant intra-class variations, and can be difficult to distinguish from RNFL-invisible regions. To address this, we embed the instances in a discriminative subspace defined by the outputs of a set of subcategory classifiers. An instance-level (IL) classifier is then learned in that subspace by maximizing the margin between positive and negative bags. A margin-based loss is proposed to learn the IL and the subcategory classifiers jointly. Our two main contributions are the following. 1. To our best knowledge, we address a new problem with significant impact potential for biomarker discovery, i.e. classifying FC images as RNFLvisible/invisible, including region localization. 2. We improve experimental performance compared to state-of-the-art MIL systems by proposing a novel MIL approach with a novel margin-based loss (instead of the cross-entropy loss commonly used in comparable MIL systems). The differences between our and recent, comparable work are captured in Sect. 2 after a concise discussion of related work. We evaluated our approach on a local dataset (“RNFL”) of 576 FC images, and with three public datasets (MESSIDOR [3] and DR [4] for diabetic retinopathy, UCSB [5] for breast cancer). Table 1 summarizes the datasets and the experimental settings used. The images (green channel) in our RNFL dataset were annotated (image-level annotations) independently by two experienced ophthalmologists (A1 and A2, A1 the more experienced). Overall, they agreed ≃ 83 % of the time (P ≃ 83 %) with a kappa value of K ≃ 0.62. Our experiments suggest that our system highly agrees with A1 than A2 (system agreement with A1, P ≃ 84 % with K ≃ 0.65 and A2, P ≃ 82 % with K ≃ 0.58). Our approach also improves the state-of-the-art results on the public datasets (see Table 2).

310

2

S. Manivannan et al.

Related Work

MIL approaches can be divided in two broad classes, (1) instance-level (IL) and (2) bag-level (BL). In both cases a classifier is trained to separate positive from negative bags using a loss function defined at the bag-level. IL approaches: the classifier is trained to classify instances, obtaining IL predictions. Here, BL predictions are usually obtained by aggregating IL decisions, e.g. MI-SVM [6], MCIL [7]. BL approaches: a classifier is trained to classify bags. Usually a feature representation is computed for each bag from its instances, then used to learn a supervised classifier. As this is trained at the BL, IL predictions cannot be obtained directly; e.g. JC2 MIL [8], and RMC-MIL [9]. The original feature space may not be discriminative. Hence embedding-based (EB) approaches try to embed the instances in a discriminative space [8–10]. The bag representation computed in this space is used to learn a BL classifier. MIL approaches have also been explored within the recent, successful Convolutional Neural Networks (CNN) paradigm for visual recognition [11]. Here, a MIL pooling layer is introduced at the end of the deep network architecture to aggregate (pool) IL predictions and compute the BL ones. Our approach is an EB approach; but it learns an IL classifier instead of the BL one learned in [8,9]. Therefore it can provide both IL and BL predictions. CNN+MIL [11], EB approaches [8,9,12] as well as other approaches [7] minimize cross-entropy loss. However, recent results suggest that margin-based loss is better than the cross-entropy loss for classification problems [13]. Considering this, we propose a novel soft-margin loss where the bags which violate the margin are penalized, and show improved performance over the cross-entropy loss.

3 3.1

Method Motivation and the Overview of the Method

Most MIL approaches do not make explicit assumptions about the inter or intraclass variations of the positive and negative bags (e.g. [6,14]). However, with high intra-class variation and low inter-class distinction these approaches may not perform well. This is the case for our RNFL dataset: the visible RNFL regions have a high intra-class variations, and they are difficult to distinguish from RNFL-invisible regions (Fig. 1). To overcome this, we assume there exists a set of discriminative sub-categories, and learn a set of classifiers for them. These sub-categories, for instance, may capture different variations (or visual appearance) of the RNFL. Each classifier in this pool is learned specifically to separate a particular sub-category from others. Each instance is thus transformed from its original feature space to a discriminative subspace defined by the output of these classifiers. An IL classifier is then learned in this space by maximizing the margin between the positive and the negative bags. For each bag, the BL prediction is obtained by aggregating (pooling) the decisions of its instances. An overview of the proposed approach is illustrated in Fig. 2.

Sub-category Classifiers for Multiple-instance Learning and Its Application

311

Fig. 2. Overview of the proposed approach.

3.2

Sub-category Classifiers for MIL

th Let the training dataset contain {(Bi , yi )}N bag (image), i=1 , where Bi is the i yi ∈ {−1, 1} is its label, and N is the number of bags. Each bag Bi consists of Ni d i instances (image regions), so that Bi = {xij }N j=1 , where xij ∈ R is the feature representation of the j th instance of the ith bag. Let M = [µ1 , . . . , µK ] ∈ Rd×K be a set of sub-category classifiers, where each classifier is learned to separate a particular sub-category from others. The probability of an instance xij belonging to the k th sub-category vs rest can be given as qijk = σ(µTk xij ), where σ(x) = 1/(1 + exp(−x)). The new instancerepresentation zij in the discriminative sub-space is defined by the outputs of these sub-category classifiers, i.e. zij = [qij1 , . . . , qijk ]. Let w ∈ RK define the IL classifier which is learned in this discriminative subspace, and pij = σ(wT zij ) the probability of the instance xij belonging to the positive class. The BL probability, Pi , of a bag Bi can be obtained by aggregating (pooling) the probabilities of the instances inside the bag. In this work, we use the 1/r   Ni generalized-mean operator (G) for aggregation: Pi = N1i j=1 prij , where r is a pooling parameter. When r = 1, G becomes average-pooling, and large r values (r → ∞) approximate max-pooling. The sub-category classifiers (M), the pooling parameter (r), and the IL classifier (w) can be learned using a cross-entropy loss function (Eq. (1)).

arg min r,M,w

λ 1 1  log(Pi ) − w22 − 2 N+ i:y =1 N− i



log(1 − Pi )

(1)

i:yi =−1

where Pi = Pi (yi = 1|Bi , r, M), λ is a regularization parameter, and N+ , N− are the total number of positive and negative bags in the training set respectively. Note that, this loss is widely used by the existing MIL approaches in [8,9,11,12].

312

S. Manivannan et al.

Instead, we propose a margin-based loss (Eq. (2)) which penalizes the bags violating the margin, as margin-based loss has two main advantages over the cross-entropy loss [13]. (1) It tries to improve the classification accuracy of the training data (by focussing on the wrongly classified images), instead of making the correct predictions more accurate (as in cross-entropy loss). (2) It improves training speed, as model updates are only based on the images classified wrongly; the ones classified correctly will not contribute to the model updates, and can be avoided altogether in derivative calculations. λ 1  1  Li (yi , Bi , r, M) + Li (yi , Bi , r, M)) (2) arg min w22 + N+ i:y =1 N− i:y =−1 r,M,w 2 i

where,

i

2

Li (yi , Bi , r, M) = max [0, γ + yi (0.5 − Pi )] .

γ ∈ (0, 0.5] is a margin parameter. In our experiments we set γ = 0.1, λ = 102 . Initialization and optimization: We use gradient descent to optimize Eq. (2), alternating between optimizing M, w and r until convergence. To initialize M, first the instances from the training set are clustered using k-means (dictionary size = K), then a set of one-vs-rest linear SVM classifiers are learned to separate each cluster from the rest. These classifiers give the initial values to M.

4 4.1

Experiments Datasets and Experimental Settings

The experimental settings for different datasets are summarized in Table 1. (1) Messidor [3]: A public diabetic retinopathy screening dataset, contains 1200 eye fundus images. Well studied in [3] for BL classification. Each image was rescaled to 700 × 700 pixels and split into 135 × 135 regions. Each region was represented by a set of features including intensity histograms and texture. (2) The diabetic retinopathy (DR) screening dataset [4]: 425 FC images, constructed from 4 publicly available datasets (DiabRetDB0, DiabRetDB1, STARE and Messidor). Each image is represented by a set of 48 instances. (3) UCSB breast cancer [5]: 58 TMA H&E stained breast images (26 malignant, 32 benign). Used in [5,8,12] to compare different MIL approaches; each image was divided into 49 instances, and each instance is represented by a 708dimensional feature vector including SIFT and local binary patterns. (4) RNFL retinal fundus image dataset: Green channel was considered for processing. Images were resized preserving their aspect ratio so that their maximum dimension (row or column) becomes 700 pixels. Each image is then histogram-equalized. Instances (square image regions) of size 128 × 128 pixels with an overlap of 64 pixels are extracted, leading to ∼ 90 instances per image. Inside each instance, SIFT features (patch size of 24×24 pixels, overlap 16 pixels) are computed and encoded using Sparse Coding with a dictionary size of 200. Average-pooling was applied to get a feature representation for each instance.

Sub-category Classifiers for Multiple-instance Learning and Its Application

313

Table 1. Datasets and experimental settings (FCV-fold cross validation). Dataset

No of images Positive Negative Total

Exp. setup

Messidor [3]

654

546

1200

10 times 2-FCV

DR [4]

265

160

425

fixed train( 23 )-test( 13 ) split

UCSB Breast cancer [5]

26

32

58

10 times 4-FCV

RNFL fundus images A1 A2

348 436

228 140

576 576

20 times 2-FCV

Table 2. Results on the public datasets. All the results except ours and mi-Graph were copied from [3–5]. Some references are omitted due to space. Different evaluation measures were used as they were reported in [3–5]. Method

Acc.

Method

MI-SVM [6] SIL-SVM GP-MIL MILBoost mi-Graph [14] Ours

54.5 58.4 59.2 64.1 72.5 73.1

mi-SVM [6] MILES EMDD SNPM [4] mi-Graph [14] Ours

(a) Messidor dataset [3]

4.2

Acc. 70.32 71.00 73.50 81.30 83.87 88.00

(b) DR dataset [4]

Method

AUC

MILBoost MI-SVM [6] BRT [12] mi-Graph [14] JC2 MIL [8] Ours

0.83 0.90 0.93 0.946 0.95 0.965

(c) UCSB cancer [5]

Experiments with Public Medical Image Datasets

Table 2 reports the comparative results on the public datasets. For fair comparison we use directly the features made publicly available2 , and follow the same experimental set-up used by the existing approaches. With Messidor, our approach gives a competitive accuracy of 73.1 % (with a standard error of ±0.12) compared to the accuracy obtained by mi-Graph,which however cannot provide IL predictions as a BL approach. With DR, our approach improves the state-of-the-art accuracy by ∼ 4 %. With UCSB, our approach achieves an AUC of 0.965 with a standard error of 0.001. Our Equal Error Rate was 0.07 ± 0.002, much smaller than the one reported in [12] (0.16 ± 0.03). The figure on the right shows K (x-axis) vs. accuracy values (y-axis) for the DR dataset. As expected, increasing K improves the accuracy, saturating for K > 150. This figure also shows that the margin-based loss (Eq. (2)) outperforms the cross-entropy loss (Eq. (1)). The advantages of the margin-based loss are discussed in Sect. 3. 2

Messidor and UCSB cancer: http://www.miproblems.org/datasets/; DR: https://github.com/ragavvenkatesan/np-mil.

314

S. Manivannan et al.

Fig. 3. Example region-level predictions for test images. Top row: Images with rough annotations for visible RNFL regions. In the last two images RNFL is invisible. Second row: Region-level probabilities obtained by the proposed approach, where the high values (red) indicate the probable RNFL visible regions. Table 3. Approaches and their agreements (P and K ± standard error) with different annotators (A1 and A2) for RNFL visibility classification. Note that the agreement between the two annotators is P = 82.99 % and K = 0.6190. Method

Percentage of images agreed (P) A1 A2

mi-SVM

73.11 ± 0.27

71.92 ± 0.62

0.4354 ± 0.0042

0.3942 ± 0.0063

MIL-Boost 80.53 ± 0.09

78.35 ± 0.09

0.5865 ± 0.0020

0.4940 ± 0.0029

BL-SVM

83.64 ± 0.09

80.72 ± 0.09

0.6523 ± 0.0017

0.5526 ± 0.0024

Ours

83.78 ± 0.08

82.09 ± 0.09

4.3

Kappa values (K) A1 A2

0.6539 ± 0.0017 0.5798 ± 0.0020

RNFL Visibility Classification

We used the public code from [3] for MILBoost and mi-SVM, taking care to select the parameters guaranteeing the fairest possible comparison. As a further basline, we implemented BL-SVM, a supervised linear SVM classifier trained on the image-level feature representations obtained by average-pooling the dictionary-encoded (size 200) SIFT features. The training images with consensus labels from the annotators were used for training for each cross-validation. Table 3 reports the results. Our approach gives better agreement with the annotators than other approaches. Over the entire RNFL dataset we found that the inter-annotator agreement is P = 82.99 % with a kappa value of K = 0.6190. Our approach gives higher agreement with the experienced annotator (A1) than the less-experienced one (A2). Notice that, although BL-SVM gives a competitive performance compared to our approach, it cannot give region-level predictions as a BL method. Figure 3 shows some region-level predictions by our approach.

Sub-category Classifiers for Multiple-instance Learning and Its Application

5

315

Conclusions

The RNFL thickenss and its visibility have been posited as biomarkers for neurodegenerative conditions. We have proposed a novel MIL method to assess the visibility (visible/not visible) of the RNFL in fundus camera images, which would enable screening of much larger patient volumes than OCT. In addition, our approach locates visible RNFL regions from image-level training labels. Experiments suggest that our margin-based loss solution performs better than the cross-entropy loss used by existing EB MIL approaches [8,9,12]. Experiments with a local RNFL and 3 public medical image datasets show considerable improvements compared to the state-of-the-art. Future work will address the associations of RNFL visibility with brain features and patient outcome. Acknowledgement. S. Manivannan is supported by EPSRC grant EP/M005976/1. The authors would like to thank Prof. Stephen J. McKenna and Dr. Jianguo Zhang for valuable comments.

References 1. Thomson, K.L., Yeo, J.M., Waddell, B., Cameron, J.R., Pal, S.: A systematic review and meta-analysis of retinal nerve fiber layer change in Dementia, using optical coherence tomography. Alzheimer’s Dementia Diagn. Assess. Disease Monit. 1(2), 136–143 (2015) 2. Hedges, H.R., Galves, R.P., Speigelman, D., Barbas, N.R., Peli, E., Yardley, C.J.: Retinal nerve fiber layer abnormalities in Alzheimer’s disease. Acta Ophthalmologica Scandinavica 74(3), 271–275 (1996) 3. Kandemir, M., Hamprecht, F.A.: Computer-aided diagnosis from weak supervision: a benchmarking study. Comput. Med. Imaging Graph. 42, 44–50 (2015) 4. Venkatesan, R., Chandakkar, P., Li, B.: Simpler non-parametric methods provide as good or better results to multiple-instance learning. In: IEEE International Conference on Computer Vision, pp. 2605–2613 (2015) 5. Kandemir, M., Zhang, C., Hamprecht, F.A.: Empowering multiple instance histopathology cancer diagnosis by cell graphs. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8674, pp. 228–235. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10470-6 29 6. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: Advances in Neural Information Processing Systems 15, pp. 561–568. MIT Press (2003) 7. Xu, Y., Zhu, J.Y., Chang, E., Tu, Z.: Multiple clustered instance learning for histopathology cancer image classification, segmentation and clustering. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 964–971 (2012) 8. Sikka, K., Giri, R., Bartlett, M.: Joint clustering and classification for multiple instance learning. In: British Machine Vision Conference, pp. 71.1–71.12 (2015) 9. Ruiz, A., Van de Weijer, J., Binefa, X.: Regularized multi-concept MIL for weaklysupervised facial behavior categorization. In: British Machine Vision Conference (2014) 10. Chen, Y., Bi, J., Wang, J.: MILES: multiple-instance learning via embedded instance selection. IEEE Trans. Pattern Anal. Mach. Intell. 28(12), 1931–1947 (2006)

316

S. Manivannan et al.

11. Kraus, O.Z., Ba, L.J., Frey, B.J.: Classifying and segmenting microscopy images using convolutional multiple instance learning. CoRR abs/1511.05286 (2015) 12. Li, W., Zhang, J., McKenna, S.J.: Multiple instance cancer detection by boosting regularised trees. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 645–652. Springer, Heidelberg (2015). doi:10. 1007/978-3-319-24553-9 79 13. Jin, J., Fu, K., Zhang, C.: Traffic sign recognition with hinge loss trained convolutional neural networks. IEEE Intell. Transp. Syst. 15(5), 1991–2000 (2014) 14. Zhou, Z.H., Sun, Y.Y., Li, Y.F.: Multi-instance learning by treating instances as non-i.i.d. samples. In: International Conference on Machine Learning, pp. 1249– 1256 (2009)

Vision-Based Classification of Developmental Disorders Using Eye-Movements Guido Pusiol1(B) , Andre Esteva2 , Scott S. Hall3 , Michael Frank4 , Arnold Milstein5 , and Li Fei-Fei1 1

2

Department of Computer Science, Stanford University, Stanford, USA [email protected] Department of Electrical Engineering, Stanford University, Stanford, USA 3 Department of Psychiatry, Stanford University, Stanford, USA 4 Department of Psychology, Stanford University, Stanford, USA 5 Department of Medicine, Stanford University, Stanford, USA

Abstract. This paper proposes a system for fine-grained classification of developmental disorders via measurements of individuals’ eye-movements using multi-modal visual data. While the system is engineered to solve a psychiatric problem, we believe the underlying principles and general methodology will be of interest not only to psychiatrists but to researchers and engineers in medical machine vision. The idea is to build features from different visual sources that capture information not contained in either modality. Using an eye-tracker and a camera in a setup involving two individuals speaking, we build temporal attention features that describe the semantic location that one person is focused on relative to the other person’s face. In our clinical context, these temporal attention features describe a patient’s gaze on finely discretized regions of an interviewing clinician’s face, and are used to classify their particular developmental disorder.

1

Introduction

Autism Spectrum Disorders (ASD) is an important developmental disorder with both increasing prevalence and substantial social impact. Significant effort is spent on early diagnosis, which is critical for proper treatment. In addition, ASD is also a highly heterogeneous disorder, making diagnosis especially problematic. Today, identification of ASD requires a set of cognitive tests and hours of clinical evaluations that involve extensively testing participants and observing their behavioral patterns (e.g. their social engagement with others). Computerassisted technologies to identify ASD are thus an important goal, potentially decreasing diagnostic costs and increasing standardization. In this work, we focus on Fragile X Syndrome (FXS). FXS is the most common known genetic cause of autism [5], affecting approximately 100,000 people in the United States. Individuals with FXS exhibit a set of developmental and cognitive deficits including impairments in executive functioning, visual memory and perception, social avoidance, communication impairments and repetitive c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 317–325, 2016. DOI: 10.1007/978-3-319-46723-8 37

318

G. Pusiol et al.

behaviors [14]. In particular, as in ASD more generally, eye-gaze avoidance during social interactions with others is a salient behavioral feature of individuals with FXS. FXS is an important case study for ASD because it can be diagnosed easily as a single-gene mutation. For our purposes, the focus on FXS means that ground-truth diagnoses are available and heterogeneity of symptoms in the affected group is reduced. Maintaining appropriate social gaze is critical for language development, emotion recognition, social engagement, and general learning through shared attention [3]. Previous studies [4,10] suggest that gaze fluctuations play an important role in the characterization of individuals in the autism spectrum. In this work, we study the underlying patterns of visual fixations during dyadic interactions. In particular we use those patterns to characterize different developmental disorders. We address two problems. The first challenge is to build new features to characterize fine behaviors of participants with developmental disorders. We do this by exploiting computer vision and multi-modal data to capture detailed visual fixations during dyadic interactions. The second challenge is to use these features to build a system capable of discriminating between developmental disorders. The remainder of the paper is structured as follows: In Sect. 2, we discuss prior work. In Sect. 3, we describe the raw data: its collection and the sensors used. In Sect. 4, we describe the built features and analyze them. In Sect. 5, describe our classification techniques. In Sect. 6, we describe the experiments and results. In Sect. 7 we discuss the results.

2

Previous Work

Pioneering work by Rehg et al. [12] shows the potential of using coarse gaze information to measure relevant behavior in children with ASD. However, this work does not address the issue of fine-grained classification between ASD and other disorders in an automated way. Our work thus extends this work to develop a means for disorder classification via multi-modal data. In addition, some previous efforts in the classification of developmental disorders such as epilepsy and schizophrenia have relied on using electroencephalogram (EEG) recordings [11]. These methods are accurate, but they require long recording times; in addition, the use of EEG probes positioned over a participant’s scalp and face can limit applicability to developmental populations. Meanwhile, eye-tracking has long been used to study autism [1,7], but we are not aware of an automated system for inter-disorder assessment using eye-tracking such as the one proposed here.

3

Dataset

Our dataset consists of 70 videos of an clinician interviewing a participant, overlaid with the participant’s point of gaze (as measure by a remote eye-tracker), first reported in [6].

Vision-Based Classification of Developmental Disorders

(a)

319

(b)

Fig. 1. (a) We study social interactions between a participant with a mental impairment and an interviewer, using multi-modal data from a remote eye-tracker and camera. The goal of the system is to achieve fine-grained classification of developmental disorders using this data. (b) A frame from videos showing the participant’s view (participant’s head is visible in the bottom of the frame). Eye-movements were tracked with a remote eye-tracker and mapped into the coordinate space of this video.

The participants were diagnosed with either an idiopathic developmental disorder (DD) or Fragile X Syndrome (FXS). Participants with DD displayed similar levels of autistic symptoms to participants with FXS, but did not have a diagnosis of FXS or any other known genetic syndrome. There are known gender-related behavioral differences between FXS participants, so we further subdivided this group by gender into males (FXS-M) and females (FXS-F). There were no gender-related behavioral differences in the DD group, and genetic testing confirmed that DD participants did not have FXS. Participants were between 12 and 28 years old, with 51 FXS participants (32 male, 19 female) and 19 DD participants. The two groups were well-matched on chronological and developmental age, and had similar mean scores on the Vineland Adaptive Behavior Scales (VABS), a well-established measure of developmental functioning. The average score was 58.5 (SD = 23.47) for individuals with FXS and 57.7 (SD = 16.78) for controls, indicating that the level of cognitive functioning in both groups was 2–3 SDs below the typical mean. Participants were each interviewed by a clinically-trained experimenter. In our setup the camera was placed behind the patient and facing the interviewer. Figure 1 depicts the configuration of the interview, and of the physical environment. Eye-movements were recorded using a Tobii X120 remote corneal reflection eye-tracker, with time-synchronized input from the scene camera. The eyetracker was spatially calibrated to the remote camera via the patient looking at a known set of locations prior to the interview.

4

Visual Fixation Features

A goal of our work is to design features that simultaneously provide insight into these disorders and allow for accurate classification between them. These features

320

G. Pusiol et al.

(a) DD

(b) FXS-F

(c) FXS-M

Fig. 2. Temporal analysis of attention to face. X axis represents time in frames (in increments of 0.2 s). Y axis represents each participant. Black dot represent time points when the participant was looking at the interviewer’s face. White space signifies that they were not.

(a) DD

(b) FXS-F

(c) FXS-M

Fig. 3. Histograms of visual fixation for the various disorders. X-axis represents fixations, from left to right: nose (1), eye-left (2), eye-right (3), mouth (4), and jaw (5). The histograms are computed with the data of all participants. The non-face fixation is removed for visualization convenience.

are the building blocks of our system, and the key challenge is engineering them to properly distill the most meaningful parts out of the raw eye-tracker and video footage. We capture the participant’s point of gaze and its distribution over the interviewer’s face, 5 times per second during the whole interview. There are 6 relevant regions of interest: nose, left eye, right eye, mouth, jaw, outside face. The precise detection of these fine-grained features enables us to study small changes in participants’ fixations at scale. For each video frame, we detected a set of 69 landmarks on the interviewer’s face using a part-based model [16]. Figure 1 shows examples of landmark detections. In total, we processed 14,414,790 landmarks. We computed 59 K, 56 K and 156 K frames for DD, FXS-Female, and FXS-Male groups respectively. We evaluated a sample of 1 K randomly selected frames, out of which only a single frame was incorrectly annotated. We mapped the eye-tracking coordinates to the facial landmark coordinates with a linear transformation. Our features take the label of the cluster (e.g. jaw) holding the closest landmark to the participant point of gaze. We next present some descriptive analyses of these data. Feature granularity. We want to analyze the relevance of our fine grained attention features. Participants—especially those with FXS—spent only a fraction of the time looking at the interviewer’s face. Analyzing the time series data of when individuals are glancing at the face of their interviewer (see Fig. 2), we observe

Vision-Based Classification of Developmental Disorders

(a) DD

(b) FXS-F

321

(c) FXS-M

Fig. 4. Matrix of attentional transitions for each disorder. Each square [ij] represents the aggregated number of times participants of each group transitioned attention from state i to state j. The axes represent the different states: non-face (0), nose (1), eye-left (2), eye-right (3), mouth (4), and jaw (5). r=1, τ=1 1

r=1, τ=1 DD

0.5 0

1

r=1, τ=1 FXSF

0.5

10

w

20

(a) DD

30

0

1

FXSM

0.5

10

w

20

(b) FXS-F

30

0

10

w

20

30

(c) FXS-M

Fig. 5. (a)–(c) Analysis of the ApEn of the data per individual varying the window length parameter w. Y-axis is ApEn and X-axis varies w. Each line represents one participant’s data. We observe great variance among individuals.

high inter-group participant’s variance. For example, most of FXS-F individual sequences could be easily confused with the other groups. Clinicians often express the opinion that the distribution of fixations, not just the sheer lack of face fixations—seem related to the general autism phenotype [8,10]. This opinion is supported by the distributions in Fig. 3: DD and FXS-F are quite similar, whereas FXS-M is distinct. FXS-M focuses primarily on mouth (4) and nose (1) areas. Attentional transitions. In addition to the distribution of fixations, clinicians also believe that the sequence of fixations describe underlying behavior. In particular, FXS participants often glance to the face quickly and then look away, or scan between non-eye regions. Figure 4 shows region-to-region transitions in a heatmap. There is a marked difference between the different disorders: Individuals with DD make more transitions, while those with FXS exhibit significantly less—congruent with the clinical intuition. The transitions between facial regions better identify the three groups than the transitions from non-face to face regions. FXS-M participants tend to swap their gaze quite frequently between mouth and nose, while the other two do not. DD participants exhibit much more movement between facial regions, without any clear preference. FXS-F patterns resemble DD, though the pattern is less pronounced.

322

G. Pusiol et al.

Approximate Entropy. We next estimate Approximate Entropy (ApEn) analysis to provide a measure of how predictable a sequence is [13]. A lower entropy value indicates a higher degree of regularity in the signal. For each group (DD, FXSFemale, FXS-Male), we selected 15 random participants sequences. We compute ApEn by varying w (sliding window length). Figure 5 depicts this analysis. We can see that there is great variance amongst individuals of each population, many sharing similar entropy with participants of other groups. The high variability of the data sequences makes them harder to classify.

5

Classifiers

The goal of this work is to create an end-to-end system for classification of developmental disorders from raw visual information. So far we have introduced features that capture social attentional information and analyzed their temporal structure. We next need to construct methods capable of utilizing these features to predict the specific disorder of the patient. Model (RNN). The Recurrent Neural Network (RNN) is a generalization of feedforward neural networks to sequences. Our deep learning model is an adaptation of the attention-enhanced RNN architecture proposed by Hinton et al. [15] (LSTM+A). The model has produced impressive results in other domains such as language modeling and speech processing. Our feature sequences fit this data profile. In addition, an encoder-decoder RNN architecture allows us to experiment with sequences of varying lengths in a cost-effective manner. Our actual models differ from LSTM+A in two ways. First, we have replaced the LSTM cells with GRU cells [2], which are memory-efficient and could provide a better fit to our data [9]. Second, our decoder produces a single output value (i.e. class). The decoder is a single-unit multi-layered RNN (without unfolding) and with a soft-max output layer. Conceptually it could be seen as a many-to-one RNN, but we present it as a configuration of [15] given its proximity and our adoption of the attention mechanism. For our experiments, we used 3 RNN configurations: RNN 128: 3 layers of 128 units; RNN 256: 3 layers of 256 units; RNN 512: 2 layers of 512 units. These parameters were selected considering our GPU memory allocation limitation. We trained our models for a total of 1000 epochs. We used batches of sequences, SGD with momentum and max gradient normalization (0.5). Other Classifiers. We also trained shallow baseline classifiers. We engineer a convolutional neural network approach (CNN) that can exploit the local-temporal relationship of our data. It is composed of one hidden layer of 6 convolutional units followed by point-wise sigmoidal nonlinearities. The feature vectors computed across the units are concatenated and fed to an output layer composed of an affinity transformation followed by another sigmoid function. We also trained support vector machines (SVMs), Naive Bayes (NB) classifiers, and Hidden Markov Models (HMMs).

Vision-Based Classification of Developmental Disorders

6

323

Experiments and Results

By varying the classification methods described in Sect. 5 we perform a quantitative evaluation of the overall system. We assume the gender of the patient is known, and select the clinically-relevant pair-wise classification experiments DD vs FXS-F and DD vs FXS-M. For the experiments we use 32 FXS-male, 19 FXSfemale and 19 DD participants. To maintain equal data distribution in training and testing we build Strain and and Stest randomly shuffling participants of each class ensuring a 50 %/50 % distribution of the two participant classes over the sets. At each new training/testing fold the process is repeated so that the average classification results will represent the entire set of participants. We classify the developmental disorder of the participants, given their individual time-series feature data p, to evaluate the precision of our system. For N total participants, we create an 80 %/20 % training/testing dataset such that no participant’s data is shared between the two datasets. For each experiment, we performed 10-fold cross validation where each fold was defined by a new random 80/20 split of the participants–about 80 participant’s were tested per experiment. Metric. We consider the binary classification of an unknown participant as having DD or FXS. We adopt a voting strategy where, given a patient’s data p = [f1 , f2 , ....fT ], we classify all sub-sequences s of p of fixed length w using a sliding-window approach. In our experiments, w correspond to 3, 10, and 50 Table 1. Comparison of precision of our system against other classifiers. Columns denote pairwise classification precision of participants for DD vs FXS-female and DD vs FXS-male binary classification. Classifiers are run on 3,10, and 50 seconds time windows. We compare the system classifier, RNN to CNN, SVM, NB, and HMM algorithms. Window length DD vs FXS-female (precision) DD vs FXS-male (precision) SVM

3

0.65

0.83

10

0.65

0.80

50

0.55

0.85

3

0.60

0.85

10

0.60

0.87

50

0.60

0.75

3

0.67

0.81

10

0.66

0.82

50

0.68

0.74

3

0.68

0.82

10

0.68

0.90

50

0.55

0.77

3

0.69

0.79

RNN 250 10

0.79

0.81

RNN 512 50

0.86

0.91

N.B

HMM

CNN

RNN 128

324

G. Pusiol et al.

seconds of video footage. To predict the participant’s disorder, we employ a max-voting scheme over each class. The predicted class C of the participant is given by:  C = argmax 1(Class(s) = c) (1) c∈{C1 ,C2 } sub-seq. s

Where C1 , C2 ∈ {DD, FXS-F, FXS-M}, Class(s) is the output of a classifier given input s. We use 10 cross validation folds to compute the average classification precision. Results. The results are reported in Table 1. We find that the highest average precision is attained using the RNN.512 model with a 50 second time window. It classifies DD versus FXS-F with 0.86 precision and DD versus FXS-M with 0.91 precision. We suspect that the salient results produced by the RNN 512 are related to its high capacity and its capability of representing complex temporal structures.

7

Conclusion

We hereby demonstrate the use of computer vision and machine learning techniques in a cost-effective system for assistive diagnosis of developmental disorders that exhibit visual phenotypic expression in social interactions. Data of experimenters interviewing participants with developmental disorders was collected using video and a remote eye-tracker. We built visual features corresponding to fine grained attentional fixations, and developed classification models using these features to discern between FXS and idiopathic developmental disorder. Despite finding a high degree of variance and noise in the signals used, our high accuracies imply the existence of temporal structures in the data. This work serves as a proof of concept of the power of modern computer vision systems in assistive development disorder diagnosis. We are able to provide a high-probability prediction about specific developmental diagnoses based on a short eye-movement recording. This system, along with similar ones, could be leveraged for remarkably faster screening of individuals. Future work will consider extending this capability to a greater range of disorders and improving the classification accuracy.

References 1. Boraston, Z., Blakemore, S.J.: The application of eye-tracking technology in the study of autism. J. Physiol. 581(3), 893–898 (2007) 2. Cho, K., van Merrienboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: Encoder-decoder approaches (2014). arXiv:1409.1259 3. Csibra, G., Gergely, G.: Social learning and social cognition: The case for pedagogy. In: Process of Change in Brain and Cognitive Development, Attention and Performance 9, pp. 152–158 (2006)

Vision-Based Classification of Developmental Disorders

325

4. Golarai, G., Grill-Spector, K., Reiss, A.: Autism and the development of face processing. Clin. Neurosci. Res. 6, 145–160 (2006) 5. Hagerman, P.J.: The fragile x prevalence paradox. J. Med. Genet. 45, 498–499 (2008) 6. Hall, S.S., Frank, M.C., Pusiol, G.T., Farzin, F., Lightbody, A.A., Reiss, A.L.: Quantifying naturalistic social gaze in fragile x syndrome using a novel eye tracking paradigm. Am. J. Med. Genet. Part B Neuropsychiatric Genet. 168, 564–572 (2015) 7. Hashemi, J., Spina, T.V., Tepper, M., Esler, A., Morellas, V., Papanikolopoulos, N., Sapiro, G.: A computer vision approach for the assessment of autism-related behavioral markers. In: 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL), pp. 1–7. IEEE (2012) 8. Jones, W., Klin, A.: Attention to eyes is present but in decline in 2-6-month-old infants later diagnosed with autism. Nature (2013) 9. Jzefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent network architectures. In: Bach, F.R., Blei, D.M. (eds.) ICML, vol. 37, pp. 2342– 2350 (2015) 10. Klin, A., Jones, W., Schultz, R., Volkmar, F., Cohen, D.: Visual fixation patterns during viewing of naturalistic social situations as predictors of social competence in individuals with autism. Arch. Gen. Psychiatry 59(9), 809–816 (2002) 11. Kumar, Y., Dewal, M.L., Anand, R.S.: Epileptic seizure detection using DWT based fuzzy approximate entropy and support vector machine. Neurocomputing 133, 271–279 (2014) 12. Rehg, J.M., Rozga, A., Abowd, G.D., Goodwin, M.S.: Behavioral imaging and autism. IEEE Pervasive Comput. 13(2), 84–87 (2014) 13. Restrepo, J.F., Schlotthauer, G., Torres, M.E.: Maximum approximate entropy and r threshold: A new approach for regularity changes detection (2014). arXiv:1405.7637 (nlin.CD) 14. Sullivan, K., Hatton, D.D., Hammer, J., Sideris, J., Hooper, S., Ornstein, P.A., Bailey, D.B.: Sustained attention and response inhibition in boys with fragile X syndrome: measures of continuous performance. Am. J. Med. Gen. Part B Neuropsychiatric Gen. 144B(4), 517–532 (2007) 15. Vinyals, O., Kaiser, L.u., Koo, T., Petrov, S., Sutskever, I., Hinton, G.: Grammar as a foreign language. In: Advances in Neural Information Processing Systems 28, pp. 2773–2781. Curran Associates, Inc. (2015). http://papers.nips.cc/paper/ 5635-grammar-as-a-foreign-language.pdf 16. Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: CVPR, pp. 2879–2886. IEEE (2012)

Scalable Unsupervised Domain Adaptation for Electron Microscopy R´oger Berm´ udez-Chac´on(B) , Carlos Becker, Mathieu Salzmann, and Pascal Fua ´ Computer Vision Lab, Ecole Polytechnique F´ed´erale de Lausanne, Lausanne, Switzerland [email protected] Abstract. While Machine Learning algorithms are key to automating organelle segmentation in large EM stacks, they require annotated data, which is hard to come by in sufficient quantities. Furthermore, images acquired from one part of the brain are not always representative of another due to the variability in the acquisition and staining processes. Therefore, a classifier trained on the first may perform poorly on the second and additional annotations may be required. To remove this cumbersome requirement, we introduce an Unsupervised Domain Adaptation approach that can leverage annotated data from one brain area to train a classifier that applies to another for which no labeled data is available. To this end, we establish noisy visual correspondences between the two areas and develop a Multiple Instance Learning approach to exploiting them. We demonstrate the benefits of our approach over several baselines for the purpose of synapse and mitochondria segmentation in EM stacks of different parts of mouse brains. Keywords: Domain Adaptation · Multiple instance learning · Electron microscopy · Synapse segmentation · Mitochondria segmentation

1

Introduction

Electron Microscopy (EM) can now deliver huge amounts of high-resolution data that can be used to model brain organelles such as mitochondria and synapses. Since doing this manually is immensely time-consuming, there has been increasing interest in automating the process. Many state-of-the-art algorithms [2,12,14] rely on Machine Learning to detect and segment organelles. They are effective but require annotated data to train them. Unfortunately, organelles look different in different parts of the brain as shown in Fig. 1. Also, since the EM data preparation processes are complicated and not easily repeatable, significant appearance variations can even occur when imaging the same areas. In other words, the classifiers usually need to be retrained after each new image acquisition. This entails annotating sufficient amounts of new data, which is cumbersome. Domain Adaptation (DA) [11] is a well-established Machine Learning approach to mitigating this problem by leveraging information acquired c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 326–334, 2016. DOI: 10.1007/978-3-319-46723-8 38

Scalable Unsupervised Domain Adaptation for Electron Microscopy Image data

Ground truth

(a) Striatum 3D stack Image data

Ground truth

(c) Cerebellum 3D stack

Image data

327

Ground truth

(b) Hippocampus 3D stack Image data

Ground truth

(d) Somatosensory Cortex 3D stack

Fig. 1. Slices from four 3D Electron Microscopy volumes acquired from different parts of a mouse brain (annotated organelles overlaid in yellow). Note the large differences in appearance, despite using the same microscope in all cases.

when training earlier models to reduce the labeling requirements when handling new data. Previous DA methods for EM [3,17] have focused on the Supervised DA setting, which involves acquiring sufficient amounts of labeled training data from one specific image set, which we will refer to as the source domain, and then using it in conjunction with a small amount of additional labeled training data from any subsequent one, which we will refer to as the target domain, to retrain the target domain classifier. In this paper, we go one step further and show that we can achieve Unsupervised Domain Adaptation, that is, Domain Adaptation without the need for any labeled data in the target domain. This has the potential to greatly speed up the process since the human expert will only have to annotate the source domain once after the first acquisition and then never again. Our approach is predicated on a very simple observation. As shown in Fig. 2, even though the organelles in the source and target domain look different, it is still possible to establish noisy visual correspondences between them using a very simple metric, such as the Normalized Cross Correlation. By this, we mean that, for each labeled source domain sample, we can find a set of likely target domain locations of similar organelles. Not all these correspondences will be right, but some will. To handle this uncertainty, we introduce a Multiple Instance Learning approach to performing Domain Adaptation, which relies on boosted tree stumps similar to those of [3]. In essence, we use the correspondences to replace manual annotations and automatically handle the fact that some might be wrong. In the remainder of this paper, we briefly review related methods in Sect. 2. We then present our approach in more detail in Sect. 3 and show in Sect. 4 that it outperforms other Unsupervised Domain Adaptation techniques.

328

R. Berm´ udez-Chac´ on et al.

Fig. 2. Potential visual correspondences between an EM source stack (left) and a target stack (right) found with NCC. Our algorithm can handle noisy correspondences and discard incorrect matches.

2

Related Work

Domain Adaptation (DA) methods have proven valuable for many different purposes [11]. They can be roughly grouped in the two classes described below. Supervised DA methods rely on the existence of partial annotations in the target domain. Such methods include adapting SVMs [5], projective alignment methods [4,20], and metric learning approaches [16]. Supervised DA has been applied to EM data to segment synapses and mitochondria [3], and to detect immunogold particles [17]. While effective, these methods still require manual user intervention and are therefore unsuitable for fully-automated processing. Unsupervised DA methods, by contrast, do not require any target domain annotation and therefore overcome the need for additional human intervention beyond labeling the original source domain images. In this context, many approaches [1,10,15] attempt to transform the data so as to make the source and target distributions similar. Unfortunately, they either rely on very specific assumptions about the data, or their computational complexity becomes prohibitive for large datasets. By contrast, other methods rely on subspace-based representations [7,9], and are much less expensive. Unfortunately, as will be shown in the results section, the simple linear assumption on which they rely is too restrictive for the kinds of domain shift we encounter. Recently, Deep Learning has been investigated for supervised and unsupervised DA [13,18]. These techniques have shown great potential for natural image classification, but are more effective on 2D patches than 3D volumes because of the immense amounts of memory required to run Convolutional Neural Nets on them. They are therefore not ideal to leverage the 3D information that has proven so crucial for effective segmentation [2]. By contrast, our approach operates directly in 3D, can leverage large amounts of data, and its computational complexity is linear in the number of samples.

Scalable Unsupervised Domain Adaptation for Electron Microscopy

3

329

Method

Our goal is to leverage annotated training samples from a source domain, in which they are plentiful, to train a voxel classifier to operate in a target domain, in which there are no labeled samples. Our approach is predicated on the fact that we can establish noisy visual correspondences from the source to the target domain, which we exploit to adapt a boosted decision stump classifier. Formally, let fθs be a boosted decision stump classifier with parameters θs trained on the source domain, where we have enough annotated data. In practice, we rely on gradient boosting optimization and use the spatially extended features of [2], which capture contextual information around voxels of interest. The score D of such a classifier can be expressed as fθs (xs ) = d=1 αds · sign (xsd − τds ), where s s αs = {α1s , . . . , αD } are the learned stump weights, Γs = {τ1s , . . . , τD } the learned s s s thresholds, and x = {x1 , . . . , xD } the features selected during training. Given the corresponding features xt extracted in the target domain, our challenge is to learn the new thresholds Γt for the target domain classifier fθt ={αs ,Γt } without any additional annotations. To this end, we select a number of positive and negative samples from the source training set C s ={cs1 , . . . , csNc }. For each one, we establish multiple correspondences by finding a set of k candidate locations in the target stack Cit ={cti,1 , . . . , cti,k } that visually resemble it, as depicted by Fig. 2. In practice, correspondences tend to be unreliable, and we can never be sure that any cti,j is a true match for sample csi . We therefore develop a Multiple Instance Learning formulation to overcome this uncertainty and learn a useful set of parameters Γt nevertheless. 3.1

Noisy Visual Correspondences

To establish correspondences between samples from both stacks, we rely on Normalized Cross Correlation (NCC). It assigns high scores to regions of the target domain with intensity values that locally correlate to a template 3D patch. We take these templates to be small cubic regions centered around each selected sample csi in the source stack. Since the organelles can appear in any orientation, we precompute a set of 20 rotated versions of these patches. For each template, we compute the NCC at each target location for all 20 rotations and keep the highest one. This results in one score at every target location for each source template, which we reduce to the scores of the k locations with the highest NCC per source template via non-maximum suppression. Figure 3 shows some examples of the resulting noisy matches. The intuition behind establishing correspondences is that, since we are looking for similar structures in both domains, they ought to have similar shapes even if the gray levels have been affected by the domain change. In practice, the behavior is the one depicted by Fig. 3. Among the candidates, we find some that do indeed correspond to similarly shaped mitochondria or synapses and some that are wrong. On average, however, there are more valid ones, which allows the robust approach to parameter estimation described below to succeed.

330

R. Berm´ udez-Chac´ on et al.

Source patch

Target domain corresp ondences found using NCC.

0.56

0.32

0.093

0.018

0.00024

2.56e-5

2.10e-6

0.21

0.20

0.20

0.19

0.08

0.08

0.04

Fig. 3. Examples of visual correspondences and their contributions to the gradient of the softmin function (Eq. 1) for synapses (top) and mitochondria (bottom).

3.2

Multiple Instance Learning

We aim to infer a target domain classifier given the source domain one and a few potential target matches for each source sample. To handle noisy many-to-one matches, we pose our problem as a Multiple Instance Learning (MIL) one. Standard MIL techniques [19] group the training data into bags containing a number of samples. They then minimize a loss function that is a weighted sum of scores assigned to these bags. Here, the bags are the sets Cit of target samples assigned to each source sample csi . We then express our loss function as  ˆ t = arg min 1 Γ softmin [ℓi1 , ℓi2 , . . . , ℓik ] , s |C | s s Γt

(1)

ci ∈C

  where ℓij = Lδ fθs (csi ) − fθt (cti,j ) , Lδ is the Huber loss, and k 1 1 exp(−rℓj ) softmin [ℓ1 , . . . , ℓk ] = − ln r k j=1

(2)

is the log-sum-exponential, with r = 100 and δ = 0.1 in our experiments. To find ˆ t that minimize the loss of Eq. 1, we rely on gradient boosting [8] the parameters Γ and learn the thresholds one at a time as boosting progresses. To avoid overfitting when correspondences do not provide enough discriminative information, we estimate probability distributions for the source and target we assume that these thresholds follow a normal thresholds τd∗ . In particular,   distribution τd∗ ∼ N μ∗τd , (στ∗d )2 , and estimate its parameters by bootstrap resampling [6]. For the source domain, we learn multiple values for each τds from random subsamples of the training data, and then take the mean and variance of these values. Similarly, for the target domain, we randomly sample subsets of the source-target matches, and minimize Eq. 1 for each subset. From these multiple estimates of τdt we can compute the required means and variances.

Scalable Unsupervised Domain Adaptation for Electron Microscopy

331

Finally, we take τˆdt = arg maxτ p(τds = τ )p(τdt = τ ), where p(τds ) acts as a prior over the target domain thresholds: if the target domain correspondences produce high variance estimates, the distribution learned in the source domain acts as a regularizer.

4

Experimental Results

We test our DA method for mitochondria and synapse segmentation in FIBSEM stacks imaged from mouse brains, manually annotated (Fig. 1). We use source domain labels for training purposes and target domain labels for evaluation only. For mitochondria segmentation, we use a 853 × 506 × 496 stack from the mouse striatum as source domain and a 1024 × 883 × 165 stack from the hippocampus as target domain, both imaged at an isotropic 5 nm resolution. For synapse segmentation, we use a 750 × 564 × 750 stack from the mouse cerebellum as source domain, and a 1445 × 987 × 147 stack from the mouse somatosensory cortex as target domain, both at an isotropic 6.8 nm resolution. 4.1

Baselines

No adaptation. We use the model trained on the source domain directly for prediction on the target domain, to show the need for Domain Adaptation. Histogram Matching. We change the gray levels in the target stack prior to feature extraction to match the distribution of intensity values in the source domain. We apply the classifier trained on the source domain on the modified target stack, to rule out that a simple transformation of the images would suffice. TD Only. For each source example, we assume that the best match found by NCC is a true correspondence, which we annotate with the same label. A classifier is trained on these labeled target examples. Subspace Alignment (SA). We test the method of [7]–one of the very few state-of-the-art DA approaches directly applicable to our problem, as discussed in Sect. 2. It first aligns the source and target PCA subspaces and then trains a linear SVM classifier. We also tested a variant that uses an AdaBoost classifier on the transformed source data to check if introducing non-linearity helps. 4.2

Results

For our quantitative evaluation, we report the Jaccard Index. Figure 4 shows that our method is robust to the choice of number of potential correspondences k; our approach yields good performance for k between 3 and 15. This confirms the importance of MIL over simply choosing the highest ranked correspondence. However, too large a k is detrimental, since the ratio of right to wrong candidates

Jaccard index

332

R. Berm´ udez-Chac´ on et al.

0.7 0.65 0.6 0.55

0.56 0.54 0.52 0.5 1

2

4

8 16

1

40 80

2

4

8 16

k

40 80

k

Fig. 4. Segmentation performance as a function of the number of candidate matches k used for Multiple Instance Learning, for the synapses (left) and mitochondria (right) datasets. Our approach is stable for a large range of values. Table 1. Jaccard indices for our method and the baselines of Sect. 4.1. No adaptation Histogram TD only SA [7] matching Synapse

SA [7]

Ours (k = 8)

+ Lin. SVM + AdaBoost

0.22

0.32

0.39

0.13

0.39

0.57

Mitochondria 0.50

0.39

0.57

0.24

0.59

0.62

Synapses

Mitochondria

Fig. 5. Detected synapses and mitochondria overlaid on one slice of the target domain stacks. In both cases, we display from left to right the results obtained without domain adaptation, with domain adaptation, and the ground truth.

then becomes lower. In practice, we used k = 8 for both datasets. Table 1 compares our approach to the above-mentioned baselines. Note that we significantly outperform them in both cases. We conjecture that the inferior performance of SA [7] is because our features are highly correlated, making PCA a suboptimal representation to align both domains. The training time for the baselines was around 30 min each. Our method takes around 35 min for training. Finding correspondences for 10000 locations takes around 24 h when parallelized over 10 cores, which corresponds to around 81 s per source domain patch. While our approach takes longer overall, it yields significant performance improvement with no need for user supervision. All the experiments were carried out on a 20-core Intel Xeon 2.8 GHz. In Fig. 5, we provide qualitative results by overlaying on a single target domain slice results with our domain adaptation and without. Note that our approach improves in terms of both false positives and false negatives.

Scalable Unsupervised Domain Adaptation for Electron Microscopy

5

333

Conclusion

We have introduced an Unsupervised Domain Adaptation method based on automated discovery of inter-domain visual correspondences and shown that its accuracy compares favorably to several baselines. Furthermore, its computational complexity is low, which makes it suitable for handling large data volumes. A limitation of our current approach is that it computes the visual correspondences individually, thus disregarding the inherent structure of the matching problem. Incorporating such structural information will be a topic for future research.

References 1. Baktashmotlagh, M., Harandi, M., Lovell, B., Salzmann, M.: Unsupervised domain adaptation by domain invariant projection. In: CVPR (2013) 2. Becker, C., Ali, K., Knott, G., Fua, P.: Learning Context Cues for Synapse Segmentation. TMI (2013) 3. Becker, C., Christoudias, M., Fua, P.: Domain adaptation for microscopy imaging. TMI 34(5), 1125–1139 (2015) 4. Conjeti, S., Katouzian, A., Roy, A.G., Peter, L., Sheet, D., Carlier, S., Laine, A., Navab, N.: Supervised domain adaptation of decision forests: Transfer of models trained in vitro for in vivo intravascular ultrasound tissue characterization. Medical image analysis (2016) 5. Duan, L., Tsang, I., Xu, D.: Domain transfer multiple kernel learning. PAMI (2012) 6. Efron, B., Efron, B.: The jackknife, the bootstrap and other resampling plans, vol. 38. SIAM (1982) 7. Fernando, B., Habrard, A., Sebban, M., Tuytelaars, T.: Unsupervised visual domain adaptation using subspace alignment. In: ICCV (2013) 8. Friedman, J.: Stochastic Gradient Boosting. Computational Statistics & Data Analysis (2002) 9. Gong, B., Shi, Y., Sha, F., Grauman, K.: Geodesic flow kernel for unsupervised domain adaptation. In: CVPR (2012) 10. Heimann, T., Mountney, P., John, M., Ionasec, R.: Learning without labeling: domain adaptation for ultrasound transducer localization. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8151, pp. 49–56. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40760-4 7 11. Jiang, J.: A Literature Survey on Domain Adaptation of Statistical Classifiers. Technical report, University of Illinois at Urbana-Champaign (2008) 12. Kreshuk, A., Koethe, U., Pax, E., Bock, D., Hamprecht, F.: Automated detection of synapses in serial section transmission electron microscopy image stacks. PloS one (2014) 13. Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with deep adaptation networks. In: ICML (2015) 14. Lucchi, A., Becker, C., M´ arquez Neila, P., Fua, P.: Exploiting enclosing membranes and contextual cues for mitochondria segmentation. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8673, pp. 65–72. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10404-1 9 15. Pan, S., Tsang, I., Kwok, J., Yang, Q.: Domain adaptation via transfer component analysis. TNN 22(2), 199–210 (2011)

334

R. Berm´ udez-Chac´ on et al.

16. Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting visual category models to new domains. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 213–226. Springer, Heidelberg (2010). doi:10.1007/ 978-3-642-15561-1 16 17. Sousa, R.G., Esteves, T., Rocha, S., Figueiredo, F., S´a, J.M., Alexandre, L.A., Santos, J.M., Silva, L.M.: Transfer learning for the recognition of immunogold particles in TEM imaging. In: Rojas, I., Joya, G., Catala, A. (eds.) IWANN 2015. LNCS, vol. 9094, pp. 374–384. Springer, Heidelberg (2015). doi:10.1007/ 978-3-319-19258-1 32 18. Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Simultaneous deep transfer across domains and tasks. In: ICCV (2015) 19. Viola, P., Platt, J., Zhang, C.: Multiple instance boosting for object detection. In: NIPS, pp. 1417–1424 (2005) 20. Wang, C., Mahadevan, S.: Heterogeneous domain adaptation using manifold alignment. In: IJCAI (2011)

Automated Diagnosis of Neural Foraminal Stenosis Using Synchronized Superpixels Representation Xiaoxu He1,2(B) , Yilong Yin3 , Manas Sharma1,2 , Gary Brahm1,2 , Ashley Mercado1,2 , and Shuo Li1,2 1

2

Digital Imaging Group (DIG), London, ON, Canada The University of Western Ontario, London, ON, Canada [email protected] 3 Shandong University, Jinan, Shandong, China

Abstract. Neural foramina stenosis (NFS), as a common spine disease, affects 80 % of people. Clinical diagnosis by physicians’ manual segmentation is inefficient and laborious. Automated diagnosis is highly desirable but faces the class overlapping problem derived from the diverse shape and size. In this paper, a fully automated diagnosis approach is proposed for NFS. It is based on a newly proposed synchronized superpixels representation (SSR) model where a highly discriminative feature space is obtained for accurately and easily classifying neural foramina into normal and stenosed classes. To achieve it, class labels (0:normal,1:stenosed) are integrated to guide manifold alignment which correlates images from the same class, so that intra-class difference is reduced and the interclass margin are maximized. The overall result reaches a high accuracy (98.52 %) in 110 mid-sagittal MR spine images collected from 110 subjects. Hence, with our approach, an efficient and accurate clinical tool is provided to greatly reduce the burden of physicians and ensure the timely treatment of NFS.

1

Introduction

Neural foramina stenosis (NFS) is known as a common result of disc degeneration due to age. For example, about 80 % of people suffer lower back pain caused by NFS [1,2]. Existing clinical diagnosis by physicians’ manual segmentation is very inefficient and tedious. Automated diagnosis, which predicts class label (0:normal,1:stenosed) for a given neural foramina image, is highly desirable. However, automated diagnosis is still challenging due to the difficulty in extracting very discriminative feature representation from extremely diverse neural foramina images [1]. This diversity leads to severe inter-class overlapping problem when classifying neural foramina images into normal or stenosed class(see Fig. 1(a)). Class overlapping problem is regarded as one of the toughest pervasive problems in classification [3–6], and severely affects the diagnosis accuracy of neural foramina. To solve it, a discriminant feature space maximizing the inter-class margin between normal and stenosed class is needed. c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 335–343, 2016. DOI: 10.1007/978-3-319-46723-8 39

336

X. He et al.

Fig. 1. SSR model, implemented by integrating class label (0:normal,1:stenosed) into manifold alignment, provides a discriminative feature space (called SSR space) for reliable classification. (a) the class overlapping problem in original image space; (b) stenosed SSR; (c) normal SSR; (d) SSR space.

In this paper, a fully automated and reliable diagnosis framework is proposed for NFS. For reliable classification, it construct a new discriminative feature space (as shown in Fig. 1(d)) using a new synchronized superpixels representation (SSR) model (as shown in Fig. 1(b) and (c)). SSR model integrates class label into manifold approximation and alignment to obtain the joint decomposing, synchronizing, and clustering the spectral representation for neural foramina images from the same class. The obtained normal SSR and stenosed SSR are new superpixel representation synchronized for image from the same class. As the synchronization of SSR is merely performed for images from the same class so that images from different classes have unsynchronized superpixel representations which enlarge the inter-class difference. Hence, the constructed SSR space is highly discriminative due to the enlarged inter-class margin and reduced intra-class margin (as shown in Fig. 1(d)). With this discriminative space, any classifier, even the simple knn, could achieve superior performance in automated diagnosis of NFS. With our diagnosis framework, an automated and accurate clinical diagnosis tool is provided for NFS.

2

Spectral Graph, Spectral Bases, and Superpixels

There are three key concepts used in our framework: Spectral graph G = (V, E) is a graph structure for the pairwise similarities among all pixels within an image [7,8]. For an image I with total N pixels, we

Automated Diagnosis of Neural Foraminal Stenosis

337

construct G = (V, E) where V(N = |V|) is the pixel set and each edge e ∈ E connects two arbitrary pixels i, j in the image. Each e for i, j is weighted by W (i, j) determined by intensity, spatial location, and the contour interventions between two pixels: W (i, j) = exp(−||xi − xj ||2 /δx − ||Ii − Ij ||2 /δI −

max

x∈line(i,j)

||Edge(x)||2 /δE ) (1)

where xi , xj are the location of the pixels i, j and the Ii , Ij are their intensities respectively. Edge(x) represents an edge detector (i.e., Canny detector) in location x. δx , δI , δE are constants that will be assigned empirically. In practice, spectral matrix W will only be computed in k-nearest neighbors, thus W is a sparse matrix. Spectral bases U = [ξ1 (G), ..., ξN (G)] are the eigenvectors of spectral matrix W [8,9]. In practice, they are decomposed from the graph Laplacian L instead of W : 1 1 (2) L = Id − D− 2 W D− 2 where Id = diag(1, 1, ..., 1) is the identify matrix, D is the diagonal matrix whose elements are the row summations of W . Superpixels are the clusters obtained from grouping images pixels based on spectral bases which approximate manifold of an image. They correspond to high level representation of an image, such as smooth and non-overlapping regions in the image.

3

Methodology

The overview of our diagnosis framework includes two phases (as shown in Fig. 2): (1) in training (Sect. 3.1), SSR space is constructed by label-supervised synchronization of spectral bases’ decomposition and clustering; (2) in testing (Sect. 3.2), the class label of an unlabeled localized neural foramina is predicted by searching its nearest neighbors in SSR space. 3.1

Training Phase

Given training set {I, C} include M1 normal neural foramina images Inor = {Im |Cm = 0, Im ∈ I, Cm ∈ C, m = 1, ..., M1 }, M2 stenosed neural foramina images Iste = {Im |Cm = 1, Im ∈ I, Cm ∈ C, m = 1, ..., M2 }, and the corresponding Laplacians set L = {Lm , m = 1, ..., M }, where M = M1 + M2 , the construction of SSR space includes the following two steps: Spectral Bases Synchronization: Synchronized spectral bases for normal and stenosed images are simultaneously obtained by the integration of class labels {Cm = Cl } into Joint Laplacian Diagonalization with Fourier coupling [7]. For T looking a set of synchronized bases {Yi : YmT Ym = Id}M m=1 , Ym Lm Ym are approximately diagonal for m = 1, ..., M . To ensure that the bases from the same class

338

X. He et al.

Fig. 2. The overview of our automated diagnosis framework.

behave consistently, the label-supervised coupling constraints [7] are introduced: given a vector f m on manifold of image Im , and a corresponding vector f l on manifold of image Il , if Cm = Cl , we require that their Fourier coefficients in the respective bases coincide, Ym f m = Yl f l . So, the label-supervised coupled diagonalization problem can be rewritten as   T ||YmT Lm Ym − Λm ||2F + µ ||Fm Ym − FlT Yl ||2F (3) min Y1 ,...,YM

m∈I

m,l∈I,Cm =Cl

where Λm = diag(λ1 , ..., λK ) denotes the diagonal matrix containing the first smallest eigenvalues of Lm . F is an arbitrary feature mapping that maps a ∗ spectral map to a fixed dimension feature vector. The optimal results Y1∗ , ..., YM can be classified as normal and stenosed synchronized spectral bases according to their class labels. In practice, to resolve the ambiguity of Ym and simplify the optimization, the first K vectors of the synchronized spectral bases are approximated as a linear combination of the first smallest K ′ ≥ K eigenvectors of Lm , denoted by Um = [ξ1 (Gm ), ..., ξK (Gm )]. We parameterize the synchronized spectral base Ym as Ym = Um Am , where Am is the K ′ × K matrix of linear combination coefficients. From the orthogonality of Ym , it follows that ATm Am = Id. Plugging this subspace parametrization into Eq. (3), where Λ˜m is the diagonal matrix

Automated Diagnosis of Neural Foraminal Stenosis

containing the first K ′ eigenvalues of Lm :  min ||ATm Λm Am − Λ˜m ||2F + µ A1 ,...,AM

s.t

m∈I

ATm Am



339

T ||Fm Um Am − FlT Ul Al ||2F

m,l∈I,Cm =Cl

= Id, (m = 1, ..., M )

(4) The solution of problem Eq. (4) can be carried out using standard constrained optimization techniques. As the label-supervised Coupled Diagonalization of Laplacians enables the approximation and alignment of manifolds for images from the same class, the obtained synchronized spectral bases approximate manifold of each image and align them for images from the same class. Superpixels Synchronization: Normal SSR and stenosed SSR are respectively achieved by grouping all images pixels with the corresponding synchronized spectral bases. As the obtained spectral bases are automatically synchronized for images from the same class, so the obtained normal SSR and stenosed SSR simultaneously minimize the intra-class difference. Correspondingly, the unsynchronized spectral bases for images from different classes enables the obviously different superpixel representations to maximize the inter-class margin. Hence, the obtained SSR space provides a new discriminate ability for reliable diagnosis even using a simple classifier. 3.2

Testing Phase

In testing, unlabeled neural foramina Iw is first localized by a trained SVM subwindow localization classifier implemented by method introduced in [10], then its class label is predicted by finding the nearest neighbors in SSR space. Incremental Spectral Bases Synchronization: For unlabeled neural foramina Iw , incremental synchronization is proposed to obtain its mapping point Yw in SSR space:  T costste = min ||YwT Lw Yw − Λw ||2F + µ ||Fm Ym − FwT Yw ||2F (5) Yw

costnor = min Yw

Cm =1

||YwT Lw Yw



Λw ||2F





T ||Fm Ym − FwT Yw ||2F

(6)

Cm =0

where Lw is the Laplacian matrix of Iw , {Ym |Cm = 1, m = 1, ..., M2 } are the learned stenosed synchronized spectral bases, {Ym |Cm = 0, m = 1, ..., M1 } are the learned normal synchronized spectral bases, costste and costnor denote the mapping cost loss. Incremental spectral bases synchronization maps Iw into SSR space. Diagnosis: The class label of Iw is predicted by comparing the computed cost loss costste and costnor . For example, if costnor is smaller, its approximated manifold Yw is more similar to images from normal class and the mapping point

340

X. He et al.

of Iw in SSR space is in normal class. Hence, the class label of Iw is naturally predicted by the minimal mapping cost loss:  1(stenosed), if costste < costnor , f (Iw ) = (7) 0(normal), otherwise.

4 4.1

Experiments and Results Experiment Setup

Following the clinical standard, our experiments are tested on 110 mid-sagittal MR lumbar spine images collected from 110 subjects including healthy cases and patients with NFS. These collected MR scans are scanned using a sagittal T1 weight MRI with repetition time (TR) of 533 ms and echo time (TE) of 17 ms under a magnetic field of 1.5 T. The training sets includes two types: (1) NF images and non-NF images used to train SVM-classifier localization; Table 1. Performance of the proposed framework Accuracy Sensitivity Specificity Localization 99.27 %

99.08 %

100.00 %

Diagnosis

97.96 %

100.00 %

98.52 %

Fig. 3. Accurate diagnosis results in multiple subjects with diverse appearance, size, and shape.

Automated Diagnosis of Neural Foraminal Stenosis

341

(2) normal NF images and stenosed NF images used to train SSR model. These training images were manually cropped and labeled by physician according to the clinical commonly diagnosis criterion [1]. The classification accuracy, specificity, and sensitivity are reported in the average from ten runs of leave-one-subject-out cross-validation. 4.2

Results

The higher accuracy achieved by the proposed framework both in localization (99.27 %) and classification (98.52 %) are shown in Table 1. Besides, its robustness in localizing and diagnosing neural foramina with different appearance, shape, and orientation is qualitatively displayed in Fig. 3. This high accuracy and robustness is derived from the intrinsic class separation captured by our framework. Hence, an accurate and efficient diagnosis of NFS is obtained regardless of the disturbance from appearance, shape, and orientation. Table 2 demonstrates that SSR achieved highest accuracy (>95 %) than other five classical features ( 0, it means after the new data arrives, more positive data should be sampled from P0 in order to keep the same sampling rate Pois(λp1 ) as used for P ′ . We additionally sample P0 with Pois(δp0 ) to get an Add Set Ap , and add it to A. If δp0 < 0, it means after the new data arrives, less positive samples from P˙ 0 are needed to keep the same sampling rate Pois(λp1 ) as used for P ′ . We generate a random number r ∼Pois(|δp0 | × |P0 |) and get min(r, |P˙ 0 |) samples from P˙ 0 . We denote these sampled data as a Remove Set Rp , and then add Rp to R. The same steps are used to deal with S0 ’s negative subset N0 , so that either an Add Set An or a Remove Set Rn is obtained. Thus we get the whole Add Set A = S˙ ′ ∪ Ap ∪ An and the whole Remove Set R = Rp ∪ Rn . To get the updated training sample set S˙1 for a tree on the fly, we remove R from S˙0 and add A to it: S˙1 = (S˙0 − R) ∪ A. Thanks to the way R and A are generated, S˙1 is balanced and adapted to the new imbalance ratio γ1 . Tree Growing and Shrinking. Instead of reconstructing trees from scratch, we use the Remove Set R and Add Set A to update an existing tree that has been constructed based on S˙0 , to make the updated tree adapted to the imbalance ratio change. Each sample in R and A is propagated from the root to a leaf. Assuming a sub set Rl of R and a sub set Al of A fall into one certain leaf l with an existing sample set Sl old , the sample set of l is updated as Sl new = (Sl old − Rl ) ∪ Al . Then tree growing or shrinking is implemented on l based on Sl new . If |Sl new | > 0, a split test is executed for l and its children are created (growing) if applicable based on the same split rules as used in the tree constructing stage [3]. If |Sl new | = 0, l is deleted (shrinking). Its parent merges the left and right child and becomes a leaf. The parent of a deleted leaf is tested for growing or shrinking again if applicable. Applying DyBa ORF to Interactive Image Segmentation. In our image segmentation tasks, DyBa ORF learns from user-gradually-provided scribbles, and it predicts the probability each pixel being foreground. The features are extracted from a 9 × 9 region of interest (ROI) centered on each pixel [8]. We use gray level features based on the mean and standard deviation of intensity, histogram of oriented gradients (HOG), Haar wavelet, and texture features from gray level co-occurrence matrix (GLCM). The probability given by DyBa ORF is combined with a Conditional Random Field (CRF) [2] for a spatial regularization.

3

Experiments and Results

DyBa ORF was compared with three counterparts: (1) a traditional ORF [1] with multiple Poisson distributions based on Eq. (1) (MP ORF), (2) a traditional ORF with a single Poisson distribution Pois(λ) (SP ORF), and 3) an offline counterpart (OffBa RF) which learns from scratch when new data arrives.

356

G. Wang et al.

Table 1. Comparison of G-mean on four UCI data sets after 100 % training data arrived in online learning. The bold font shows values that are not significantly different from the corresponding results of OffBa RF(p-value > 0.05). The G-mean of SP ORF on Wine is zero due to classifying all the samples into the negative class. Data set

Imbalance ratio γ G-mean (%) OffBa RF

DyBa ORF

MP ORF

SP ORF

Biodegradation

1.23±0.47

83.33±2.50 83.57±2.55 81.80±2.90 80.92±2.98

Musk (version 1)

0.72±0.36

82.78±4.44 83.06±4.08 73.83±6.95 81.65±4.93

Cardiotocography 19.16±3.46

97.06±1.69 97.09±1.56 95.52±1.16 87.59±5.41

Wine

76.14±3.34 76.50±3.91 74.99±4.39 0.00±0.00

27.42±2.90

The parameter settings were: λ = 1.0, tree number 50, maximal tree depth 20, minimal sample number for split 6. The code was implemented in C++1 . Validation of DyBa ORF. Firstly, we validate DyBa ORF as an online learning algorithm with four of the UCI data sets2 that are widely used: QSAR biodegradation, Musk (Version 1), Cardiotocography and Wine. The positive class labels for them are “RB”, “1”, “8” and “8” respectively. Each of these data sets has an imbalance between the positive and negative class. We used a Monte Carlo cross-validation with 100 repetition times. In each repetition, 20 % positive samples and 20 % negative samples were randomly selected to constitute test data. The remaining 80 % samples were used as training data T in an online manner. The initial training set S0 contained the first 50 % of T and it

Fig. 1. Performance of DyBa ORF and counterparts on UCI QSAR biodegradation data set. Training data was gradually obtained from 50 % to 100 %. 1 2

The code is available at https://github.com/gift-surg/DyBaORF. http://archive.ics.uci.edu/ml/datasets.html.

Dynamically Balanced Online Random Forests for Interactive Segmentation

357

was gradually enlarged by the second 50 % of T , with 5 % of T arriving each time in the same order as the appeared in T . We measured the update time when new data √ arrived, sensitivity, specificity, and G-mean which is defined by G-mean = sensitivity × specificity. Table 1 shows the final G-mean on all the four datasets after 100 % T arrived. The performances on the QSAR biodegradation data set are presented in Fig. 1, which shows a decreasing sensitivity and increasing specificity for SP ORF and MP ORF. In contrast, OffBa RF keeps high sensitivity and G-mean when the imbalance ratio increases. DyBa ORF achieves a sensitivity and specificity close to OffBa RF, but its update time is much less when new data arrives. Interactive Segmentation of the Placenta and Adult Lungs. DyBa ORF was applied to two different 2D segmentation tasks: placenta segmentation of

Fig. 2. Visual comparison of DyBa ORF and counterparts in segmentation of (a) placenta from fetal MRI and (b) adult lungs from radiographs. The first row in each sub-figure shows two stages of interaction, where scribbles are extended with changing imbalance ratio. Probability higher than 0.5 is highlighted by green color. The last column in (a) and (b) show the final segmentation and the ground truth.

358

G. Wang et al.

Table 2. G-mean and Dice Score(DS) of DyBa ORF and counterparts in placenta and adult lung segmentation. G-mean and DS(RF) were measured on probability given by RFs. DS(CRF) was measured on the result after using CRF. tu is the time for forests update after the arrival of new scribbles. The bold font shows values that are not significantly different from the corresponding results of OffBa RF(p-value>0.05). Method

G-mean(%) DS(RF)(%)

DS(CRF)(%) Average tu (s)

OffBa RF DyBa ORF MP ORF SP ORF

84.24±4.02 83.09±4.18 78.21±8.12 74.49±6.94

74.97±7.20 75.25±6.88 71.98±9.76 69.40±8.55

89.32±3.62 89.17±3.73 85.14±9.13 79.32±12.07

1.80±0.92 0.42±0.22 0.37±0.18 0.53±0.26

Adult Lungs OffBa RF DyBa ORF MP ORF SP ORF

90.80±2.30 90.08±2.36 85.51±3.82 83.38±5.52

86.87±3.89 86.69±3.56 82.95±4.43 80.93±6.70

94.25±1.62 94.06±1.64 90.53±3.59 87.27±9.18

7.40±1.17 1.52±0.43 1.14±0.30 2.19±0.68

Placenta

fetal MRI and adult lung segmentation of radiographs. Stacks of MRI images from 16 pregnancies in the second trimester were acquired with slice dimension 512 × 448, pixel spacing 0.7422 mm × 0.7422 mm. A slice in the middle of each placenta was used, with the ground truth manually delineated by a radiologist. Lung images and ground truth3 were downloaded from the JSRT Database4 . Data from the first 20 normal patients were used (image size 2048 × 2048, pixel spacing 0.175 mm × 0.175 mm). At the start of segmentation, the user draws an initial set of scribbles to indicate the foreground and background and the RFs and CRF are applied. After that the user gives more scribbles several times and each time RFs are updated and used to predict the probability at each pixel. Figure 2 shows an example of the placenta and adult lung segmentation with increasing scribbles. In Fig. 2(a) and (b), lower accuracy of MP ORF and SP ORF compared with OffBa RF and DyBa ORF can be observed in the second and third column. Quantitative evaluations of both segmentation tasks after the last stage of interaction are listed in Table 2. We measured the G-mean and Dice score (DS) of the probability map thresholded by 0.5, DS after using CRF, and the average update time after the arrival of new scribbles. Table 2 shows DyBa ORF achieved a higher accuracy than MP ORF and SP ORF, and a comparable accuracy with OffBa RF, with largely reduced update time.

4

Discussion and Conclusion

Experiment results show that SP ORF had the worst performance, due to its lack of explicitly dealing with data imbalance. MP ORF [1] performed better, but it failed to be adaptive to imbalance ratio changes. OffBa RF, which learns from 3 4

http://www.isi.uu.nl/Research/Databases/SCR/. http://www.jsrt.or.jp/jsrt-db/eng.php.

Dynamically Balanced Online Random Forests for Interactive Segmentation

359

scratch for each update, and DyBa ORF, which considers the new imbalance ratio in both existing and new data, were adaptive to imbalance ratio changes. DyBa ORF’s comparable accuracy and reduced update time compared with OffBa RF show that it is more suitable for interactive image segmentation. In addition, the results indicate that the MP/SP ORF needs some additional user interaction to achieve the same accuracy as obtained by DyBa ORF. This indirectly demonstrates that our model is helpful in reducing user interaction and saving interaction time. Future works can be done to further investigate the performance of DyBa ORF in reducing user interaction in segmentation tasks. In conclusion, we present a dynamically balanced online random forest to deal with incremental and imbalanced training data with changing imbalance ratio, which occurs in the scribble-and-learning-based image segmentation. Our method is adaptive to imbalance ratio changes by combining a dynamically balanced online Bagging and a tree growing and shrinking strategy to update the random forests. Experiments show it achieved a higher accuracy than traditional ORF, with a higher efficiency than its offline counterpart. Thus, it is better for interactive image segmentation. It can also to be applied to other online learning problems with imbalanced data and changing imbalance ratio. Acknowledgements. This work was supported through an Innovative Engineering for Health award by the Wellcome Trust [WT101957]; Engineering and Physical Sciences Research Council (EPSRC) [NS/A000027/1], the EPSRC (EP/H046410/1, EP/J020990/1, EP/K005278), the National Institute for Health Research University College London Hospitals Biomedical Research Centre (NIHR BRC UCLH/UCL High Impact Initiative), a UCL Overseas Research Scholarship and a UCL Graduate Research Scholarship.

References 1. Barinova, O., Shapovalov, R., Sudakov, S., Velizhev, A.: Online random forest for interactive image segmentation. In: EEML, pp. 1–8 (2012) 2. Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images. In: ICCV 2001, vol. 1, pp. 105–112, July 2001 3. Breiman, L.: Random forests. Eur. J. Math. 45, 5–32 (2001) 4. Chen, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data, pp. 1–12. University of California, Berkeley (2004) 5. Grady, L., Schiwietz, T., Aharon, S., Westermann, R.: Random walks for interactive organ segmentation in two and three dimensions: implementation and validation. In: Duncan, J.S., Gerig, G. (eds.) MICCAI 2005. LNCS, vol. 3750, pp. 773–780. Springer, Heidelberg (2005). doi:10.1007/11566489 95 6. Harrison, A.P., Birkbeck, N., Sofka, M.: IntellEditS: intelligent learning-based editor of segmentations. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8151, pp. 235–242. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40760-4 30 7. Saffari, A., Leistner, C., Santner, J., Godec, M., Bischof, H.: Online random forests. In: ICCV Workshops 2009, pp. 1393–1400 (2009)

360

G. Wang et al.

8. Santner, J., Unger, M., Pock, T., Leistner, C., Saffari, A., Bischof, H.: Interactive texture segmentation using random forests and total variation. In: BMVC 2009, pp. 66.1–66.12 (2009) 9. Top, A., Hamarneh, G., Abugharbieh, R.: Active learning for interactive 3D image segmentation. In: Fichtinger, G., Martel, A., Peters, T. (eds.) MICCAI 2011. LNCS, vol. 6893, pp. 603–610. Springer, Heidelberg (2011). doi:10.1007/ 978-3-642-23626-6 74 10. Wang, G., Zuluaga, M.A., Pratt, R., Aertsen, M., David, A.L., Deprest, J., Vercauteren, T., Ourselin, S.: Slic-seg: slice-by-slice segmentation propagation of the placenta in fetal MRI using one-plane scribbles and online learning. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 29–37. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24574-4 4 11. Yang, W., Cai, J., Zheng, J., Luo, J.: User-friendly interactive image segmentation through unified combinatorial user inputs. IEEE TIP 19(9), 2470–2479 (2010)

Orientation-Sensitive Overlap Measures for the Validation of Medical Image Segmentations Tasos Papastylianou1(B) , Erica Dall’ Armellina2 , and Vicente Grau1 1

Institute of Biomedical Engineering, University of Oxford, Oxford, UK [email protected] 2 Acute Vascular Imaging Centre, Radcliffe Department of Medicine, John Radcliffe Hospital, University of Oxford, Oxford, UK

Abstract. Validation is a key concept in the development and assessment of medical image segmentation algorithms. However, the proliferation of modern, non-deterministic segmentation algorithms has not been met by an equivalent improvement in validation strategies. In this paper, we briefly examine the state of the art in validation, and propose an improved validation method for non-deterministic segmentations, showing that it improves validation precision and accuracy on both synthetic and clinical sets, compared to more traditional (but still widely used) methods and state of the art. Keywords: Validation T-Norms

1

·

Segmentation

·

Fuzzy

·

Probabilistic

·

Introduction

It is often noted in the segmentation literature, that while research on newer segmentation methods abounds, corresponding research on appropriate evaluation methods tends to lag behind by comparison [1,2]. This is particularly the case with medical images, which suffer inherent difficulties in terms of validation, such as: limited datasets; clinical ambiguity over the ground truth itself; difficulty and relative unreliability of clinical gold standards (which usually defaults to expertled manual contour delineation); and variability in agreed segmentation protocols and clinical evaluation measures [1]. Furthermore, many of the latest approaches in segmentation have been increasingly non-deterministic, or “fuzzy”1 in nature [3]; this is of particular importance in medical image segmentation, due to the presence of the Partial Volume Effect (PVE) [4]. However, appropriate validation methods that take fuzziness specifically into account are rarely considered, despite the fact that gold standards are also becoming increasingly fuzzy (e.g. expert delineations at higher resolutions; consensus voting [3]). On the 1

We use the term “fuzzy” here in a broad sense, i.e. all methods assigning non-discrete labels, of which modern probabilistic segmentation methods are a strict subset.

c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 361–369, 2016. DOI: 10.1007/978-3-319-46723-8 42

362

T. Papastylianou et al.

contrary, most segmentation papers approach fuzziness as a validation nuisance instead, as they tend to rely on more conventional binary validation methods, established from early segmentation literature, and work around the ‘problem’ of fuzziness by thresholding pixels at a (sometimes arbitrary) threshold, so as to produce the binary sets required for traditional validation. State of the art in the validation of fuzzy/probabilistic segmentations: There is a multitude of validation approaches and distance/similarity metrics; Deza and Deza’s The Encyclopedia of Distances [5] alone, spans over 300 pages of such metrics from various fields. In the medical image segmentation literature, traditional binary methods like the Dice and Tanimoto coefficients seem to be by far the most popular overlap-based metrics [7,8], even when explicitly validating inherently non-binary segmentation/gold-standard pairs. While there are a few cases where thresholding could be deemed appropriate (e.g. when fuzziness does not have a straightforward interpretation, therefore a simplifying assumption is made that all output pixels are pure) we would argue that in most cases, the thresholding approach is still used mostly by convention, or at most out of a need for consistency and comparison with older literature, rather than because it is an appropriate method for fuzzy sets. This occurs at the cost of discarding valuable information, particularly in the case where fuzziness essentially denotes a PVE. Yi et al. [6] addressed this issue by treating PVE pixels as a separate binary class denoting ‘boundary’ pixels. While they demonstrated that this approach led to higher scores on validation compared to thresholding, there was no discussion as to whether this genuinely produces a more accurate, precise, and reliable result; furthermore, it is still wasteful of information contained in pixel fuzziness, since all degrees of fuzziness at the boundary would be treated as a single label. Chang et al. [7] proposed a framework for extending traditional validation coefficients for fuzzy inputs, by taking the intersection of two non-zero pixels to be equal to the complement of their discrepancy. In other words, two pixels of equal fuzzy value are given an intersection value of 1. However, this is a rather limited interpretation of fuzziness, which is not consistent with PVE or geometric interpretations of fuzziness, as we will show later. Furthermore, the authors did not formally assess validation performance against their binary counterparts. Crum et al. [8] proposed a fuzzy extension of the Tanimoto coefficient, demonstrating increased validation accuracy. They used a specific pair of Intersection and Union operators derived from classical fuzzy set theory, and compared against the traditional thresholding approach. They assessed their operator using fuzzy segmentations and gold standards derived from a synthetic ‘petal’ set, whose ground-truth validation could be reproduced analytically. Main contributions: Our work expands on the theoretical framework put in place by Crum et al. by examining the geometric significance of the particular fuzzy intersection and union operators used. Armed with this insight, we proceed to: 1. establish absolute validation operator bounds, outside of which, pixel and geometry semantics do not apply. 2. show that thresholding tends to violate these

Orientation-Sensitive Overlap Measures

363

bounds, leading to reduced validation precision and accuracy, and rendering it unreliable and unsuitable for the assessment and comparison of non-deterministic segmentations. 3. propose a novel fuzzy intersection and union operator defined within these bounds, which takes into account the orientation of fuzzy pixels at object boundaries, and show that this improves validation precision and accuracy on both synthetic and real clinical sets.

2

Background Theory and Motivation

2.1

Fuzziness and Probability in Medical Image Segmentation

What does it mean for a pixel to be fuzzy? In classic segmentation literature, a segmentation (SG) mask — and similarly, a gold-standard (GS) mask — is a binary image (i.e. pixels take values in the set {0,1}) of the same resolution as the input image, where the values denote the abscence or presence in the pixel, of the tissue of interest. In a fuzzy SG mask, pixels can instead take any value in the interval [0,1]. The underlying semantics of such a value are open to interpretation; however, perhaps the most intuitive interpretation is that of a mapping from the fuzzy value, to the extent to which a pixel is occupied by the tissue in question. For example, in the simplest case of a linear mapping, a pixel with a fuzzy value of 0.56, could be interpreted as consisting of the tissue in question by 56 %, and 44 % background. Figure 1 demonstrates this graphically. 4

4

3

3

2

2

1 0 0

1

1

2

3

4

0 0

1

2

3

4

Fig. 1. Two fuzzy pixels with the same fuzzy value of 0.5625, but different underlying configurations (i.e. tissue distribution inside the pixel). The pixel’s fuzzy value is the average of its constituent subpixels. Note the pixel on the right is more homogeneous.

Fig. 2. Edge Pixels: In all cases, GS (thin blue lines) covers 70 % of the pixel, and SG (coarse yellow lines) covers 50 %. From Left to right: GS and SG have the same orientation; GS and SG have opposite orientations; GS and SG have perpendicular directionality; GS and SG exhibit arbitrary orientations. Their intersections have pixel coverages of 50 %, 20 %, 35 %, and 46.2 % respectively.

T-Norms and T-Conorms: Intersection and Union operations on fuzzy inputs: Triangular Norms and Triangular Co-norms, or T-Norms (TN ) and T-Conorms  (TC ),  are generalisations of the Intersection and Union operators (denoted and respectively) in Fuzzy Set Theory [9]. A TN takes two fuzzy inputs and returns a fuzzy output, is commutative (A ∩ B ≡ B ∩ A), associative (A ∩ (B ∩ C) ≡ (A ∩ B) ∩ C), monotonically nondecreasing with respect

364

T. Papastylianou et al.

to increasing inputs, and treats 0 and 1 as null and unit elements respectively (A ∩ 0 ≡ 0, A ∩ 1 ≡ A). A TC is similar to a TN , except the null and unit elements are reversed (i.e. A ∪ 0 = A, A ∪ 1 = 1). TN and TC are dual operations, connected (like in the binary case) via De Morgan’s Law: A ∪ B ≡ ¬(¬A ∩ ¬B), where ¬ represents complementation, i.e. ¬(A) = 1 − A in this context. 2.2

Motivation: Defining Theoretical Limits for Valid Intersections

We shall now make special mention of two specific TN /TC pairs: T − Norm G¨ odel : A ∩G B = min(A, B) L  ukasiewicz : A ∩L B = max(0, A + B − 1)

T − Conorm A ∪G B = max(A, B) A ∪L B = min(1, A + B)

In this section, we show that these TN s are of special relevance in the context of pixels: when the fuzzy inputs are SG and GS masks denoting underlying tissue composition, these TN s have the following properties:  Theorem 1: The G¨ odel TN (∩G ) represents the most optimal possible for that pixel. Proof: We remind ourselves that fuzziness here is defined in the frequentist sense, i.e. as the number of subpixels labelled true, over the total number of subpixels N comprising the pixel. Optimality occurs when the set of all true subpixels in one mask is astrict subset of the set of all true subpixels in the other mask; therefore the of the two sets is equivalent to the set with the least elements.  Theorem 2: The L  ukasiewicz TN (∩L ) represents the most pessimal  possible. Proof: For any two sets of true subpixels A and B such that |A| + |B| = N , the most  pessimistic scenario occurs when A and B are mutually exclusive (i.e. = 0). Decreasing the number of true subpixels in   and B)  either A or B still results in = 0. Increasing either input (to, say, A will result in a necessary overlap equal to the number of extra true subpixels    B−A−B   B−1.  introduced from either set, i.e. (A−A)+( B−B) = A+ = A+   Therefore, these TN s represent theoretical bounds for fuzzy pixels. Any val idation, which was obtained via outputs outside these theoretical bounds, should be considered theoretically infeasible, and therefore unreliable. We immediately note that the traditional thresholding approach and the Chang approach can lead to unreliable validations (e.g.  for SG and GS fuzzy inputs of 0.6, both methods lead to an out-of-bounds output of 1). 2.3

Boundary Pixel Validation – The Case for a Directed TN (∩D )

Pixels at object boundaries commonly exhibit PVE, which manifests in their corresponding SG masks as fuzziness. Such mask pixels can be thought of as homogeneous fuzzy pixels (see Fig. 1) with a particular orientation. At its simplest,

Orientation-Sensitive Overlap Measures

365

we can portray such a pixel as divided by a straight line parallel to the object boundary at that point, splitting it into foreground and background regions. We define as the fuzzy orientation for that pixel the outward direction perpendicular to this straight line (in other words, the negative ideal local image-gradient). It is easy to see that optimal overlap between an SG and GS mask pixel occurs when their corresponding orientations are congruent; similarly, pessimal overlap occurs when they are completely incongruent. In other words, ∩G and ∩L in the context of boundary pixels, correspond to absolute angle differences in orientation of 0◦ and 180◦ respectively. Figure 2 demonstrates this visually for the particular case of a 2D square pixel. It should be clear that for any absolute orientation angle difference between 0◦ and 180◦ , there exists a suitable  operator which returns a value between the most optimal (i.e. ∩G ) and most pessimal (i.e. ∩L ) value, and which decreases monotonically between these two limits as this absolute angle difference increases within that range. Furthermore, as Fig. 2 suggests, for a particular pixel of known shape and dimensionality, this can be calculated exactly in an analytical fashion. We define such an operator as a Directed TN (∩D ), and its dual as a Directed TC (∪D ), and distinguish between generalised and exact versions as above. We can now define suitable fuzzy validation operators, in a similar fashion to   and operations with fuzzy ones in Crum et al. [8], by substituting binary the definitions of the validation operators used; here, by way of example, we   will [SG  GS] focus on the Tanimoto Coefficient, defined as T animoto(SG, GS) = [SG GS] ; however any validation operator which can be defined in terms of set operations can be extended to fuzzy set theory this way.

3 3.1

Methods and Results Exact and Generalised Directed TN S

An Exact Directed T-Norm: As mentioned above, a ∩D can be calculated exactly, if we assume the exact shape/dimensionality of a particular pixel, to be known and relevant to the problem. In other words, given a fuzzy value and orientation, the shape of  the pixel dictates the exact manner in which tissue is distributed within it. An operation between a SG and a GS pixel, whose fuzzy values and orientations are both known, can therefore be calculated exactly in an analytical manner, but the result is specific to that particular pixel shape. Figure 3 shows the profile of such an Exact ∩D for the specific case of a 2D pixel and a range of fuzzy inputs and angle differences (the algorithmic steps for this particular Exact ∩D calculation are beyond the scope of this paper, but an Octave/Matlab implementation is available on request). A Generalised Directed T-Norm: Clearly, while an exact ∩D should be more accurate, this dependence on the exact pixel shape adds an extra layer of complexity, which may be undesirable. The biggest benefit of the ∩D is its property of outputting a suitable value between the theoretical limits imposed by ∩L and ∩G ; therefore, any function that adheres to this description, should be

366

T. Papastylianou et al.

able to provide most of the benefits afforded by an Exact ∩D , regardless of pixel shape/dimensions, and potentially with very little extra computational overhead compared to standard TN s. For the  purposes  1−cos θof this paper, we chose a sinusoidal function: A ∩D B =  1+cos θ G + L, where G = A ∩G B, L = A ∩L B, and θ signifies the 2 2 discrepancy angle between SG and GS front orientations. Our particular choice of function here aimed to provide a good fit to the Exact version specified above (see Fig. 3), while also being generalisable to pixels of higher dimensions. However, we reiterate that this is simply one of many valid formulations, chosen as proof of concept; in theory, any valid formulation (i.e. monotonically decreasing for increasing discrepancy in angle) should prove more accurate than ∩L or ∩G alone which is already considered state of the art.

SG value (0, 0.25, 0.5, 0.75 and 1)

1

1

1

1

1

0 -180 1

0

180

0 -180 1

0

180

0 -180 1

0

180

0 -180 1

0

180

0 -180 1

0

180

0 -180 1

0

180

0 -180 1

0

180

0 -180 1

0

180

0 -180 1

0

180

0 -180 1

0

180

0 -180 1

0

180

0 -180 1

0

180

0 -180 1

0

180

0 -180 1

0

180

0 -180 1

0

180

0 -180 1

0

180

0 -180 1

0

180

0 -180 1

0

180

0 -180 1

0

180

0 -180 1

0

180

0 -180

0

180

0 -180

0

180

0 -180

0

180

0 -180

0

180

0 -180

0

180

GS value (0, 0.25, 0.5, 0.75 and 1)

Fig. 3. The Exact (solid line) and Generalised (dashed line) ∩D described in Sect. 3.1, for a range of fuzzy inputs. The x-axis of each subplot corresponds to the angle difference between the SG and GS pixels, and the y-axis to the corresponding fuzzy intersection output.

3.2

Fig. 4. Fused masks from the petal and clinical sets used. Top row: GTv (high-resolution) sets, bottom row: fuzzy (lowresolution) sets. SG shown as violet, GS as green throughout; colour-fusion (i.e. intersection) produces white colour.

Demonstration on Synthetic and Clinical Sets

Synthetic example: The synthetic ‘Petal’ set introduced in Crum et al. [8] was replicated, to obtain a ‘high resolution’ binary image (100 × 100), like the one shown in Fig. 4. The Ground Truth validation (GTv), i.e. the latent truth, was obtained by rotating one copy of the petal image (acting as the SG mask), onto another, stationary petal image acting as the GS mask, and calculating a

Orientation-Sensitive Overlap Measures

367

  normal Tanimoto coefficient via the and of the two masks, at various angles of rotation. At each rotation angle, fuzzy SG and GS masks (25 × 25 resolution) were also produced from their corresponding GT masks, using simple 4 × 4 blockaveraging (i.e., each 4 × 4 block in the high-resolution masks became a single pixel in the fuzzy, low-resolution masks). For each angle, we obtained a validation output for each of the following methods: traditional (i.e. binary validation post ukasiewicz TN , thresholding), Yi [6], Chang [7], Crum [8] (i.e. the G¨ odel TN ), L and finally the Exact and Generalised Directed operators. Figure 5a shows the difference between each method and the GTv at each angle. Clinical example: The STARE (STructured Analysis of the REtina) Project [10] provides a clinical dataset of 20 images of human retinae, freely available online. For each image, it also provides a triplet of binary masks (700 × 605): two manual delineations of retinal blood vessels from two medical experts and one automated method. One of the manual sets was treated as the GS mask, and the other two were treated as SG masks (human rater vs computer algorithm); Fig. 4 shows an example of the automated SG against the GS. Similar to the petal set, fuzzy versions were produced using various degrees of block-averaging, and validated using the same array of methods. Figure 5b shows the validation accuracy and precision profiles of each method over the whole dataset for the case of a human rater (the algorithm-based results were very similar) at 4 × 4 block averaging. Larger blocks resulted in less accurate / precise curves (not shown here), but interrelationships between methods were preserved. Figure 5c compares human rater vs automated method accuracy as assessed by each validation operator.

0.06

0.1 0 -0.1 -0.2

100 80

0

10

20

30

40

50

60

Goedel Threshold Exact Generalised

-0.02 0

a 10

20

30 Rotation (Degrees)

40

50

60

80 60 40 20 0 -0.4

-0.3

-0.2

-0.1

0

0.1

b

0 -0.1

0

100

40

-0.05 0 0.05 Difference from True Tanimoto coefficient

0.04

0.02

120

60

20

-0.3

Human vs Auto (Tanimoto diff)

Difference from True Tanimoto

0.08

True Goedel Generalised (unprocessed Gradient) Generalised (optimal Gradient) Exact (unprocessed Gradient) Exact (optimal Gradient) Threshold

Frequency density

120 0.1

0.1

0.35 0.3 0.25 0.2 True Goedel Exact Generalised Threshold

0.15 0.1 0.05 0 0

c 5

10 Experiment No.

15

20

Fig. 5. (a) Difference between the GTv and our proposed methods, the standard 0.5 threshold approach, and state of the art (Crum), plotted against different angles of rotation for the rotating petal set. (b) Distribution of the differences between fuzzy and latent validations over 20 retinal sets (human rater only), represented as gaussian curves. The insets are ‘zoomed-out’ views, showing in gray the remaining methods (∩L , Yi and Chang) which were far less accurate to depict at this scale. (c) Difference between human rater and computer algorithm SG validation for the above methods.

368

4

T. Papastylianou et al.

Discussion and Conclusions

There are several interesting points to note with regard to the datasets and TN s: 1. Pixels at the object boundary are generally more likely to be fuzzy than pixels at the core (indeed, true core pixels should be deterministic); for segmentations of large organ structures, one generally expects to see a relatively high overallreported validation value, even with less sophisticated segmentation algorithms, due to the disproportionately large contribution of core pixels. More importantly, one might also expect less variability in validation values for the same reason, even between algorithms of variable quality. However, a segmentation algorithm’s true quality/superiority compared to other algorithms, generally boils down to its superior performance on exactly such boundary pixels; it is therefore even more important in such objects that appropriate validation methods are chosen that ensure accurate and precise discriminating ability at boundary pixels as well as core ones. Nevertheless, for demonstration purposes, both the synthetic and clinical sets used in this paper, involved relatively thin structures, i.e. a high “boundary-to-core” pixel ratio. This choice was intentional, such that the relationship between fuzziness, choice of validation operator, and validation precision/accuracy could be demonstrated more clearly; this also explains why the GTv itself between the two experts was of a fairly low value (with a mean of the order of 0.6). 2. The Generalised and Exact ∩D seem to perform equally well; oddly enough, the Exact ∩D seems to be slightly less accurate for the clinical dataset, but it is more precise. 3. Both are much more accurate and precise than all the other methods investigated, and even more so when a more appropriate gradient response is used. 4. Decreasing the resolution affects all methods negatively, but ∩D still performs much better. 5. Out of all the methods in the literature, Crum’s approach (i.e. ∩G ) is the next most accurate/precise overall; the ∩L operator, while very precise at times of bad overlap, seems to be completely off at times of good overlap. 6. The threshold approach seems very unreliable, at least as demonstrated through these datasets: the theoretical maximum established by ∩G is violated consistently (Fig. 5a); it tends to be inaccurate and imprecise (Fig. 5b); and as a result, its use for comparing segmentation algorithms can lead to false conclusions: e.g. in Fig. 5c, experiment 11, the algorithm is judged to be almost as good as the human rater (validation difference of 1 %), but in fact, the GTv difference is as high as 10 %. 7. The Yi and Chang algorithms also violate this boundary. Furthermore they both seem to be overpessimistic at times of good overlap, and overoptimistic at times of bad overlap. Conclusion: Validation is one of the most crucial steps in ensuring the quality and reliability of segmentation algorithms; however, the quality and reliability of validation algorithms themselves has not attracted much attention in the literature, despite the fact that many state-of-the-art segmentation algorithms are already deployed in clinical practice for diagnostic and prognostic purposes. We have shown in this paper that in the presence of non-deterministic sets, conventional validation approaches can lead to conclusions that are inaccurate, imprecise, and theoretically unsound, and we have proposed appropriate alter-

Orientation-Sensitive Overlap Measures

369

natives which we have demonstrated to be accurate, precise, and robust in their theoretical underpinnings. Further work is needed to evaluate and understand the advantages of the proposed validation framework on more, specific clinical scenarios, including segmentations of large organs and small lesions. Acknowledgements. TP is supported by the RCUK Digital Economy Programme (grant EP/G036861/1: Oxford Centre for Doctoral Training in Healthcare Innovation). ED acknowledges the BHF intermediate clinical research fellow grant (FS/13/71/30378) and the NIHR BRC. VG is supported by a BBSRC grant (BB/I012117/1), an EPSRC grant (EP/J013250/1) and by BHF New Horizon Grant NH/13/30238.

References 1. Udupa, J.K., et al.: A framework for evaluating image segmentation algorithms. Comput. Med. Imaging Graph. 30, 75–87 (2006) 2. Zhang, Y.: A survey on evaluation methods for image segmentation. Pattern Recogn. 29, 1335–1346 (1996) 3. Weisenfeld, N.I., Warfield, S.K.: SoftSTAPLE: Truth and performance-level estimation from probabilistic segmentations. In: IEEE International Symposium on Biomedical Imaging, pp. 441–446 (2011) ` 4. Ballester, M.A.G., Zisserman, A.P., Brady, M.: Estimation of the partial volume effect in MRI. Med. Image Anal. 6, 389–405 (2002) 5. Deza, M.M., Deza, E.: Encyclopedia of Distances, 2nd edn. Springer, Heidelberg (2013) 6. Yi, Z., Criminisi, A., Shotton, J., Blake, A.: Discriminative, semantic segmentation of brain tissue in MR images. In: Yang, G.-Z., Hawkes, D., Rueckert, D., Noble, A., Taylor, C. (eds.) MICCAI 2009. LNCS, vol. 5762, pp. 558–565. Springer, Heidelberg (2009). doi:10.1007/978-3-642-04271-3 68 7. Chang, H.H., et al.: Performance measure characterization for evaluating neuroimage segmentation algorithms. NeuroImage 47, 122–135 (2009) 8. Crum, W.R., Camara, O., Hill, D.L.: Generalized overlap measures for evaluation and validation in medical image analysis. IEEE Trans. Med. Imaging 25, 1451–1461 (2006) 9. Bloch, I., Maitre, H.: Fuzzy mathematical morphologies: a comparative study. Pattern Recogn. 28, 1341–1387 (1995) 10. Hoover, A., Kouznetsova, V., Goldbaum, M.: Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response. IEEE Trans. Med. Imaging 19, 203–210 (2000)

High-Throughput Glomeruli Analysis of µCT Kidney Images Using Tree Priors and Scalable Sparse Computation Carlos Correa Shokiche1,2(B) , Philipp Baumann3 , Ruslan Hlushchuk2 , Valentin Djonov2 , and Mauricio Reyes1(B) 1

3

Institute for Surgical Technology and Biomechanics, University of Bern, Bern, Switzerland {carlos.correa,mauricio.reyes}@istb.unibe.ch 2 Institute of Anatomy, University of Bern, Bern, Switzerland Department of Business Administration, University of Bern, Bern, Switzerland

Abstract. Kidney-related diseases have incrementally become one major cause of death. Glomeruli are the physiological units in the kidney responsible for the blood filtration. Therefore, their statistics including number and volume, directly describe the efficiency and health state of the kidney. Stereology is the current quantification method relying on histological sectioning, sampling and further 2D analysis, being laborious and sample destructive. New micro-Computed Tomography (µCT) imaging protocols resolute structures down to capillary level. However large-scale glomeruli analysis remains challenging due to object identifiability, allotted memory resources and computational time. We present a methodology for high-throughput glomeruli analysis that incorporates physiological apriori information relating the kidney vasculature with estimates of glomeruli counts. We propose an effective sampling strategy that exploits scalable sparse segmentation of kidney regions for refined estimates of both glomeruli count and volume. We evaluated the proposed approach on a database of µCT datasets yielding a comparable segmentation accuracy as an exhaustive supervised learning method. Furthermore we show the ability of the proposed sampling strategy to result in improved estimates of glomeruli counts and volume without requiring a exhaustive segmentation of the µCT image. This approach can potentially be applied to analogous organizations, such as for example the quantification of alveoli in lungs.

1

Introduction

Kidney-related diseases have incrementally become an important public health issue worldwide. According to the International Federation of Kidney Foundation, chronic kidney disease is an important cause of death. Yet the underpinning mechanisms are still poorly understood. In general, the kidney is the organ responsible for urine production through filtration units called glomerulus. The statistics associated with glomeruli are crucial since they are in direct relation c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 370–378, 2016. DOI: 10.1007/978-3-319-46723-8 43

High-Throughput Glomeruli Analysis of µCT Kidney Images

371

with the state, efficiency and filtration power of the kidney. However, the current standard for quantification of these structures relies on stereology, which involves a very time-consuming and costly manual analysis of each histological section, and therefore is not carried out by most laboratories. In the last years, due to further development of micro-Computed Tomography (µCT), it is possible to resolute structures down to capillary level. Supervised machine learning algorithms have been used to tackle the automation task of the quantification process. Rempfler [11] and Schneider [12] performed image segmentation based on supervised random forest (RF) with vessel completion under physiological constrains for brain networks. However, the segmentation task in general remains challenging due to large scale datasets with scarce labelled data for model training, as well as a poor distinction between glomeruli and other structures such as capillaries. We propose an efficient approach for large-scale glomeruli analysis of µCT kidney images exploiting physiological information. The key idea is to model the relationship between the kidney vasculature topology and the glomeruli counts. This relationship is based on the allometry (i.e. proportions) between parental and children vessel radii along the vascular tree, allowing us to derive glomeruli count bounds. This bound serves as initialisation of an iterative sampling strategy that incrementally updates estimates of the glomeruli number and their total volume. The update step proceeds on selected regions with semi-supervised segmentation, which relies on sparse-reduced computation [1], suitable for highthroughput data. We report results on a database of µCT kidney images, comparing it to an exhaustive RF-based segmentation method. We demonstrate the ability of the sparse sampling strategy to provide accurate estimates of glomeruli counts and volume at different levels of image coverage.

2

Methods

We first start with a brief description of the kidney morphology, followed by descriptions of the main steps depicted in Fig. 1. Biological landmarks. The overall kidney structure consists of two large regions: medulla and cortex. Only within the latter glomeruli are uniformly distributed [3]. The natural boundary between those two regions is roughly delineated by the vascular tree. In turn, the kidney vasculature consists of two trees (i.e. arterial and venal tree) connected in series, where the joining points correspond to the glomeruli. For glomeruli analysis, the arterial tree is of interest [3]. Preprocessing: Vasculature tree separation. As only a coarse description of the vasculature tree is necessary, simple thresholding and connected component analysis were adopted in this study, and resulted in an effective and practical solution. We observed that downsampling the image (by a factor of 8 per axis) does not affect the main topological features, since we are interested only in the split proportions between main large branches, while providing a computationally efficient solution.

372

C.C. Shokiche et al.

Volume

Preprocessing: tree extraction

Vascular tree parametrization

update Glomeruli total Region sampling number bounds and segmentation Nlow
Ai > AK > 0). Determining number of sampled regions for segmentation. Let us consider a fixed known partition of B regular regions and drawing b sample regions with equal inclusion probability [14]. In order to determine the number of sample regions, one can derive the required quantity from the bounded probability   of an ˆ −N | |N ˆ unbiased estimator for the total number of events N : P √ > zα/2 = α, ˆ] V[N

where

ˆ |N √ −N | ˆ] V[N

∼ N (0, 1) for a desired confidence level α (α = 5 % as customary

for statistical hypothesis testing). One specifies an absolute error d on the total  ˆ number estimator (i.e. zα/2 V[N ] ≤ d), and then solving for b gives b=

1 , 1/b0 + 1/B

where

b0 = σ 2



Bzα/2 d

2

,

(3)

where V [·] denotes the variance operator, the hat notation (ˆ·) refers to estimators and σ 2 is the population variance of the total number, which can be obtained from previous studies or estimated from training data. Scalable glomerular segmentation using sparse computation. Knowing the number of sampled regions b (Eq. (3)), uniform regions are extracted and segmented. In order to address the issue of scalability and the use of unlabelled data in large image volumes, we draw upon sparse-reduced computation (SRC) for efficient graph-partitioning introduced by Baumann et al. [1]. Sparse-reduced computation creates a compact graph representation of the data with minimal loss of relevant information. This is achieved by efficiently projecting the feature vectors onto a low-dimensional space using a sampling variant of principal component analysis. The low-dimensional space is then partitioned into grid blocks and feature vectors that fall into the same or neighbouring grid blocks are replaced by representatives. These representatives are computed as the center of mass of the feature vectors they represent. The graph is then constructed on the representatives rather than on the feature vectors which significantly reduces its size. The segmentation for the representatives is obtained by applying Hochbaum’s normalized cut (HNC) [7], which can be solved in polynomial time while the normalized cut problem is NP-hard.

374

C.C. Shokiche et al.

Iterative count refinement: unbiased estimators for total number and total volume of glomeruli. We can iteratively refine our estimates by including more segmented subvolumes b as a function of a target confidence of the glomeruli counts and volume estimates. To this end, in this study we adopted a bootstrapping replication scheme [6]. Based on unbiased estimators for totals with equal inclusion probabilities with uniform random sampling without replacement [14], we obtain for each j-th replication b  ˆj = B y¯bj = B yij N b i

ˆN ˆ ybj ] = B(B − b) ˆj ] = B 2 V[¯ V[

s2Nˆ j b

b B and Vˆj = B¯ vbj = vij b i

∀j ∈ J (4)

ˆ Vˆj ] = B 2 V[¯ ˆ vbj ] = B(B − b) and V[

s2Vˆ j b

,

ˆj and Vˆj are unbiased estimators for total glomerular number and volwhere N ume, respectively. yi and vi are the actual glomeruli counts and the segmented volume, both on the i-th segmented region. Empirical booststrap distributions are generated from J estimate replications, which are drawn with replacement, ˆ¯ and and hence confidence intervals are constructed for the mean number N ˆ ¯ mean total volume V estimators, in order to assess their quality in terms of their statistical consistency: unbiasedness and low variability.

3

Results

Data source and experiments. Our database consisted of 9 right kidneys belonging to two groups: 5 healthy mice and 4 GDNF+/− mice, which are genetically modified, and known to have about 30 % fewer glomeruli and smaller kidney volume [4,5]. The volume size ranges from 1 K × 2.5 K × 4 K up to 3 K × 4 K × 6 K voxels, which corresponds to 30–80 GB. The isotropic size of a voxel is 2.6 µm. We performed two experiments: (1) Evaluation of the sparsereduced semi-supervised segmentation, (2) Evaluation of the iterative sampling scheme. In experiment (1) we generated ground-truth by manually annotating a reduced region of interest of size 512×512×141 and compared the sparse-reduced semi-supervised segmentation to an exhaustive RF segmentation. Experiment 1. We evaluated the performance of sparse-reduced computation with different grid resolutions. Figure 2 illustrates the process of sparse-reduced computation for a manually segmented region which comprises 36 million voxels. Rather than constructing a complete graph with 36 million nodes, the sparsereduced computation approach constructs a much smaller graph by consolidating highly-similar voxels into a small number of representatives. The reduced graph contains a node for each representative. The segmentation for the representatives is obtained by applying Hochbaum’s normalized cut (HNC) [7] to the reduced graph formed by edges connecting only representatives. The labels that are assigned to the representatives are passed on to each of the voxels that

High-Throughput Glomeruli Analysis of µCT Kidney Images

375

Fig. 2. Sparse-reduced computation for semi-supervised segmentation. Left: Projection of feature vectors onto a 3-dimensional space. Right: Partitioning of 3-dimensional space into blocks and generation of representatives. Similarities are only computed between representatives of the same and neighboring blocks. This provides a significant reduction in computational complexity.

they represent. In this study, we used standard first and second order statistics intensity information as features (i.e. mean, quantiles, entropy, gradient, etc.). We compared the sparse-reduced computation approach to an exhaustive RF based segmentation method [2] with 100 trees, depth = 18, and the same set of 19 features. In order to simplify the evaluations, no attempt to regularise the segmentation result was performed. In Table 1 we report the overall performance for different grid resolutions in terms of accuracy, precision, recall, and F1 score. The grid resolution determines the number of divisions of each axis (e.g. 5 divisions per axis, cf. Fig. 2 right side), and hence it controls the total number of grid blocks. The results were compared with the exhaustive RF segmentation method. From Table 1 it is observed that higher grid resolutions yield better segmentation results, influencing most notably on accuracy and F1 score, up to a saturation point between grid size 10 and 15. In terms of computation load, the increase is negligible as the number of representatives is substantially low for all resolutions. In comparison to the exhaustive RF-based segmentation, the results are competitive and attractive in light of scenarios where labelled data and memory resources are limited. Based on the results from this first experiment, we regarded the segmentation from the exhaustive RF-based method as a silver ground-truth to evaluate in experiment 2 the reliability of the iterative sampling scheme. The RF segmentation was fully run on the 9 datasets and then a visual inspection was performed for sanity check. Minimal manual corrections were needed.

376

C.C. Shokiche et al.

Table 1. Segmentation performance of sparse-reduced computation at different grid resolutions, and RF-based segmentation. Best results of SRC are in bold. Experiment

Accuracy Precision Recall F1 score

Sparse - grid resolution 5

0.9829

0.736

0.284

0.41

Sparse - grid resolution 10 0.9858

0.618

0.843 0.713

Sparse - grid resolution 15 0.9864

0.661

0.719

0.689

Sparse - grid resolution 20 0.985

0.664

0.576

0.617

Exhaustive random forest 0.99

0.761

0.793

0.777

Experiment 2. In this experiment we are interested in evaluating the unbiasedness (consistency) and efficiency of both glomeruli count and total volume estimators for different number of segmented sampled regions, reformulated as kidney volume coverage from 5 % up to 80 %. We considered J = 10 bootstrap replications in order to derive bootstrap empirical distributions and construct bootstrap confidence intervals for the mean estimators. Figure 3(a)–(b) show the decrease in the total volume and count errors as more volume is covered, with values ranging from 6 % down to 1 % for volume and count error. This is consistent with the unbiasedness property of the estimators. Furthermore, it also shows an increase in efficiency, depicted by the reduction of the estimator variance as larger volume is covered.

(a) Error on glomeruli total (b) Error on glomeruli total (c) Glomeruli total counts volume (%) count (%) per group

Fig. 3. Iterative sampling scheme: performance of mean total and mean total volume estimators as a function of volume coverage for the complete 9 datasets (a)–(b), and mean total counts per group (c): healthy and genetically modified GDNF+−. The central solid line corresponds to the mean estimation. Confidence intervals at levels α = 0.05 and α = 0.2 are shown as shaded areas.

Figure 3(c) depicts that the mean total count estimate separates into the healthy and the genetically modified GDNF+− groups. We identify the nearly 30 % mean difference in total glomeruli number, as reported in the literature [4,5]. Note that the bootstrap confidence intervals for the mean total count

High-Throughput Glomeruli Analysis of µCT Kidney Images

377

estimator do not overlap between groups, which suggests that the statistical efficiency (i.e. low variance) of the estimator allows researchers to discriminate groups in studies involving GDNF+− subjects, as disease model.

4

Discussion and Conclusion

In this study we have presented a fast and efficient iterative sampling strategy to quantify glomeruli in large µCT kidney images. In contrast to previous approaches, we combine estimators of volume and counts with a scalable and computationally efficient semi-supervised segmentation approach. The proposed pipeline exploits physiological relations of kidney vasculature and glomeruli counts and volume, and yields fast and statistically efficient estimators with accurate estimates of them. The iterative nature of the approach allows users to define a trade-off between accuracy of the estimations and computational complexity, up to a desired level. The sparse-reduced computation is suitable for large image volumes for which annotated data is typically scarcely available, while yielding competitive results with standard supervised RF-based segmentation approaches. The method features high scalability through an efficient computation of similarities among representatives. The proposed approach can be extended for high-throughput analysis of structures in large-scale images. Also it can potentially be applied to analogous organizations, such as for example the quantification of alveoli in lungs. Acknowledgements. This work is funded by the Kommission f¨ ur Technologie und Innovation (KTI) Project No. 14055.1 PFIW-IW.

References 1. Baumann, P., et al.: Sparse-reduced computation - enabling mining of massivelylarge data sets. In: Proceedings of ICPRAM 2016, pp. 224–231 (2016) 2. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 3. Bruce, M., et al.: Berne and Levy physiology, 6th edn. Elsevier (2010) 4. Cullen-McEwen, L., Drago, J., Bertram, J.: Nephron endowment in glial cell linederived neurotrophic factor (GDNF) heterozygous mice. Kidney Int. 60(1), 31–36 (2001) 5. Cullen-McEwen, L., et al.: Nephron number, renal function, and arterial pressure in aged GDNF heterozygous mice. Hypertension 41(2), 335–40 (2003) 6. Davison, A., Hinkley, D.: Bootstrap Methods and their Applications. Cambridge University Press, Cambridge (1997) 7. Hochbaum, D.: Polynomial time algorithms for ratio regions and a variant of normalized cut. IEEE Trans. Pattern Anal. Mach. Intel. 32, 889–898 (2010) 8. Kerschnitzki, M., et al.: Architecture of the osteocyte network correlates with bone material quality. J. Bone Miner. Res. 28(8), 1837–1845 (2013) 9. Murray, C.: The physiological principle of minimum work: II. Oxygen exchange in capillaries. Proc. National Acad. Sci. United States Am. 12(5), 299–304 (1926) 10. Murray, C.: The physiological principle of minimum work: I. the vascular system and the cost of blood volume. Proc. Natl. Acad. Sci. U.S.A. 12(3), 207–214 (1926)

378

C.C. Shokiche et al.

11. Rempfler, M., et al.: Reconstructing cerebrovascular networks under local physiological constraints by integer programming. Med. Image Anal. 25(1), 86–94 (2015). special Issue on MICCAI 2014 12. Schneider, M., et al.: Tissue metabolism driven arterial tree generation. Med. Image Anal. 16(7), 1397–1414 (2012). special Issue on MICCAI 2011 13. Sherman, T.: On connecting large vessels to small. the meaning of murray’s law. J. Gen. Physiol. 78(4), 431–453 (1981) 14. Thompson, S.: Sampling. Wiley series in probability and statistics. Wiley (2002)

A Surface Patch-Based Segmentation Method for Hippocampal Subfields Benoit Caldairou1 ✉ , Boris C. Bernhardt1, Jessie Kulaga-Yoskovitz1, Hosung Kim2,1, Neda Bernasconi1, and Andrea Bernasconi1 (

1

)

NeuroImaging of Epilepsy Laboratory, Montreal Neurological Institute, McGill University, Montreal, QC, Canada [email protected] 2 UCSF School of Medicine, San Francisco, CA, USA

Abstract. Several neurological disorders are associated with hippocampal path‐ ology. As changes may be localized to specific subfields or spanning across different subfields, accurate subfield segmentation may improve non-invasive diagnostics. We propose an automated subfield segmentation procedure, which combines surface-based processing with a patch-based template library and feature matching. Validation experiments in 25 healthy individuals showed high segmentation accuracy (Dice >82 % across all subfields) and robustness to varia‐ tions in the template library size. Applying the algorithm to a cohort of patients with temporal lobe epilepsy and hippocampal sclerosis, we correctly lateralized the seizure focus in >90 %. This advantageously compares to classifiers relying on volumes retrieved from other state-of-the-art algorithms. Keywords: Multi-template · Segmentation · SPHARM · Mesiotemporal lobe · Surface-based patch · MRI · Hippocampus · Epilepsy

1

Introduction

The hippocampus plays a key role in cognition and its compromise is a hallmark of several prevalent brain disorders, such as temporal lobe epilepsy (TLE) [1]. With the advent of large-scale neuroimaging data basing and analysis in health and disease, the development of accurate automated segmentation approaches becomes increasingly important. The majority of automated hippocampal segmentation approaches have operated on a global scale. Recent methods rely on a multi-template framework to account for inter‐ individual anatomical variability. While the majority of previous algorithms employed a purely voxel-based strategy, adopting a surface-based library has shown benefits by improving flexibility to model shape deformations often seen in disease, but also in 10– 15 % of healthy subjects [2]. To improve label fusion and image matching, recent studies have adopted patch-based methods that compactly represent shape, anatomy, texture, and intensity [3]. Notably, these approaches have also been successful in non-segmen‐ tation tasks, such as image denoising [4] and supersampling [5].

© Springer International Publishing AG 2016 S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 379–387, 2016. DOI: 10.1007/978-3-319-46723-8_44

380

B. Caldairou et al.

Developments in MRI hardware have begun to generate images of brain anatomy with unprecedented details [6], fostering guidelines to manually delineate hippocampal subfields. Few automated methods have been proposed, each relying on multiple templates together with: (i) Bayesian inference and fusion of ex-vivo/in-vivo landmarks [7, 8], (ii) label propagation to intermediate templates [9], (iii) combinations of label fusion (taking inter-template similarity into account) and post-hoc segmentation correc‐ tion [10]. These methods operate either on anisotropic T2-weighted images [10], or T1weighted images with standard millimetric [8, 9], or submillimetric resolution [7]. Only one study so far [9] calculated Dice overlaps of manual and automated labels in milli‐ metric T1-weighted images, with modest performance (Dice: 0.56–0.65). We propose a novel approach for hippocampal subfield segmentation SurfPatch, which combines multi-template feature matching with deformable parametric surfaces and vertex-wise patch sampling, relying on point-wise correspondence across the template library. Validation was performed using a publically available 3T dataset of manual segmentations together with high- and standard-resolution MRI data of healthy controls [11]. We also applied SurfPatch to 17 TLE patients with hippocampal atrophy, testing its ability to lateralize the seizure focus and compared its performance to two ASHS (Automatic Segmentation of Hippocampal Subfields) [10] and FreeSurfer 5.3 [7].

2

Methodology

Figure 1 below summarizes the segmentation steps.

Fig. 1. Flowchart of the proposed algorithm SurfPatch.

For training, SurfPatch builds the mean patch surface S� and standard deviation (SD) patch surface S𝜎 across the template library (Fig. 1A). For segmentation, it nonlinearly warps each template surface to the test case, re-computes patch features across the warped surface, and normalizes features using surface-based z-scoring (relative to S� and S𝜎). Based on vertex-wise z-score, it selects a subset of templates, builds an average surface, and performs a deformation for final segmentation (Fig. 1B).

A Surface Patch-Based Segmentation Method

381

2.1 Training Step (Fig. 1A) Subfield labels are converted to surface meshes and parameterized using spherical harmonics and a point distribution model (SPHARM-PDM) [12] that guarantees corre‐ spondence of surface points (henceforth, vertices) across subjects (Fig. 1A-1). For a given template t, its reconstructed surface St is mapped to its corresponding T1-MRI I t. Let xkv with k ∈ {1, … , 8} be the eight closest voxels of a given vertex v, and let Ptx,k be the corresponding local cubical neighborhoods (i.e., patches) centered around these voxels. We build a vertex patch Ptv by computing a trilinear interpolation of these 8 patches (which is an extension of the trilinear interpolation of the 8 closest voxels). Patches are considered as vectors. By pooling corresponding vertex patches from each template surface, we derive the mean P�v and SD patch P𝜎v at vertex v: ∑N P�v

=

t=1

N

√ Ptv

and P𝜎v

=

) ∑N ( t 2 (Pv ) − (P�v )2 t=1

(1)

N

where N is the number of templates (Fig. 1A-2). 2.2 Segmentation Step (Fig. 1B) Registration and Subset Restriction. Each template MRI is nonlinearly registered to the test MRI to increase shape similarity (Fig. 1B-1). We used ANIMAL non-linear registration tool [13], enhanced with a boundary-based similarity measure [14]. Regis‐ tration was based on a volume-of-interest that includes the labels of all hippocampi in the template library, plus a margin of 10 voxels in each direction to account for additional shape variability. Applying the registration to the library surface Ŝt, it is placed on the test MRI. We then re-compute patch features across vertices and compare these patch features P̂tv with the template library patch distribution, using vertex-wise z-scoring:

Fvt =

P̂tv − P�v P𝜎v

(2)

This is an element-wise operation. The absolute deviation from the library can be quantified by summing the squared norm of each patch over all vertices through: Ft =

∑K v=1

∥ Fvt ∥22

(3)

where K is the number of vertices. Figure 1B-2 shows vertex-wise deviation maps. Surfaces in the template library are ranked according to this measure, with smaller scores indicating better fit. To obtain an initial estimation, we a performed successive surface averaging (Fig. 1B-3) defined as:

382

B. Caldairou et al.

1 ̂l S l F Sk = ∑k 1 l=1 Fl ∑k

l=1

(4)

where l corresponds to the ranking index. To sum up, S1 corresponds the best template, S2 to the weighted surface average of the two best templates and Sk to the weighted surface average of the k best templates. Corresponding deviation scores F k were computed as in (2) and (3), and the one resulting in the minimal measure is chosen as initialization for a deformable model. This selection of templates has the advantage to automatically adapt to the template library size. Deformable Model. To further increase segmentation accuracy and to account for potential errors in the preceding steps, we applied parametric deformable model of the surface average S [15]. The use of an explicit parameterization of the surface ensures vertex-wise correspondence across the library that would be otherwise lost (e.g. when using level-sets [16]). The objective function to minimize is composed of a regulariza‐ tion term, based on mechanical properties of the surface (stretching and bending), and a data term, which is represented by the deviation score F. Surface deformation is performed using gradient descent search:

x⃖⃖⃗v = x⃖⃖⃗v − �

{( (2) ) } �Sv + �Sv(4) + ΔFv

(5)

where x⃖⃖⃗v represents the spatial coordinates of voxel v, γ is the step size controlling the magnitude of the surface’s deformation; α and β are parameters controlling for surface stretching and bending. Sv(2) and Sv(4) are surface’s second and fourth order spatial deriv‐ ative respectively at voxel v. ΔFv represents the gradient of the surface’s deviation score at vertex v. Figure 2B-4 illustrates a final segmentation.

Fig. 2. Overall Dice with respect to patch (A) and library size (B). CA1-3 (red), CA4-DG (green), SUB (blue).

A Surface Patch-Based Segmentation Method

3

383

Experiments and Results

3.1 Material The training set includes 25 subjects from a public repository1 (31 ± 7 yrs, 13 females) of MRI and manually-drawn labels (CA1-3, DG-CA4, SUB; average intra-/inter-rater Dice >90/87 %.) [11]. MRI data consist of isotropic T1-weighted millimetric (1 mm3) and submillimetric (0.6 mm3) 3D-MPRAGE and anisotropic 2D T2-weighted TSE (0.4 × 0.4 × 2 mm3). Images underwent automated correction for intensity nonuniformity [17] and intensity standardization. Submillimetric data was resampled to 0.4 mm3 resolution in MNI152 space. The patient cohort consists of 17 TLE patients. MRI post-processing followed the same steps as in [11]. TLE diagnosis and lateraliza‐ tion of the seizure focus was based on a multi-disciplinary evaluation. Hippocampal atrophy was determined as hippocampal volumes beyond 2SD of the corresponding mean of healthy controls [18]. 3.2 Experiments Parameter optimization and robustness to library size were performed on submillimetric T1-weighted images. Parameter Optimization. Parameters for the active contour are empirically set to α = 100 and β = 100. The step size parameter γ is set to 10−5. Performance with regards to patch sizes was evaluated using a leave-one-out (LOO) strategy, based on Dice overlap index between automated/manual segmentations. Robustness to Template Library Variations and Image Resolution. For each subject, we randomly decreased the library from the full size (n = 24 in LOO validation) to 1/2 (12), 1/3 (8) and 1/5 (5) of its original size. We repeated this process 5 times. We evaluated performance with smaller template libraries, based on Dice overlaps. We tested whether SurfPatch achieved adequate performance by operating solely on standard 1 mm3 MPRAGE data. In this evaluation, we first linearly upsampled images to 0.4 mm3, followed by the segmentation outlined above. This permitted the use of equivalent patch sizes. In addition to Dice, we computed correlation coefficients between automated as well as manual volumes, and generated Bland-Altman plots. TLE Lateralization. Direct dice overlap comparisons between SurfPatch and both ASHS and FreeSurfer are challenged by the absence of a unified subfield segmentation protocol and by the optimization of different algorithms to different MRI sequences. We thus assessed the clinical utility of the different approaches using a “TLE lateralization challenge” that assessed the accuracy of linear discriminant analysis (LDA) classifiers to lateralize the seizure focus in individual patients based on subfields volumes obtained with SurfPatch compared to those using volumes generated by FreeSurfer 5.32 [7] and 1 2

Data available at: http://www.nitrc.org/projects/mni-hisub25. FreeSurfer freely available at: http://freesurfer.net/fswiki/DownloadAndInstall.

384

B. Caldairou et al.

ASHS3 [10]. We ran both algorithms with their required modalities (FreeSurfer: 1 mm3 T1-weighted; ASHS: 0.4 × 0.4 × 2 mm3 T2-weighted) and default parameters. As both algorithms operate in native space, subfields volumes were corrected for intra‐ cranial volume by multiplying them by the Jacobian determinant of the corresponding linear transform to MNI152 space. Cross-validation was performed using a 5-Fold scheme, repeated 200 times. ASHS Evaluation. Given that it includes an atlas building tool, we also trained ASHS using our template library. Inputs are submillimetric T1-weighted and T2-weighted images, resampled to MNI152 space along with the corresponding labels. T1-weighted images are used for registration and T2-weighted images for segmentation. 3.3 Results Parameter Optimization and Robustness to Template Library Size. Maximum accuracy was achieved with a patch size of 13 × 13 × 13 voxels for CA1-3 (% Dice: 87.43 ± 2.47), 19 × 19 × 19 for CA4-DG (82.71 ± 2.85) and 11 × 11 × 11 for SUB (84.95 ± 2.45) (Fig. 2A). Mean Dice indices remained >80 % for all structures when using only 8 templates (Fig. 2B). Robustness with Respect to Standard T1-Weighted Images. Segmenting subfields using only standard millimetric T1-weighted images, we obtained accuracy of 85.71 ± 2.48 for CA1-3 (average decrease compared to submillimetric T1MRI = −1.72 %), 81.10 ± 3.86 for DG (−1.61 %) and 82.21 ± 3.72 for SUB (−2.75 %). We obtained overall higher correlations between manual and automated volumes for submillimetric (Fig. 3A) than for standard images (Fig. 3B; submillimetric/millimetric CA1-3: 0.73/0.64, CA4-DG: 0.44/0.28, SUB: 0.56/0.63). Bland-Altman plots suggested lower bias in submillimetric than standard images (average shrinkage based on

Fig. 3. Correlations and Bland-Altman plots between automated and manual volumes (in mm3) in submillimetric (A) and millimetric (B) T1-weighted images. 3

ASHS and UPenn PMC atlas freely available at: https://www.nitrc.org/projects/ashs/.

A Surface Patch-Based Segmentation Method

385

submillimetric/millimetric images for CA1-3: 58/131 mm3 (1.6/3.6 % from average manual volume), CA4-DG: 23/83 mm3 (3.4/8.3 %), SUB: 76/35 mm3 (4.2/1.9 %)). Segmentation examples with SurfPatch are shown in Fig. 4.

Fig. 4. Manual delineation and SurfPatch automated segmentations relying on a submillimetric T1-weighted image and a standard T1-weighted image.

TLE Lateralization. Lateralization of the seizure focus in TLE patients was highly accurate when using SurfPatch, both based on submillimetric and millimetric T1weighted images (>93 %; Table 1). For ASHS and FreeSurfer, we performed two experiments using: (i) single subfields as defined by the anatomical templates and (ii) subfields grouped into CA1-3, DG-CA4 and SUB, as in [11]. Although better results were obtained with the second option, overall performance was lower than with Surf‐ Patch (Table 1). Table 1. Average accuracy of seizure focus lateralization in TLE. Anatomical template SurfPatch submilli‐ metric Single subfields 92.9 ± 4.7 Subfields grouped

SurfPatch millimetric

ASHS

FreeSurfer

93.8 ± 6.0

75.0 ± 7.8 78.6 ± 9.5 81.8 ± 4.1 86.7 ± 4.5

ASHS Evaluation. Trained on our library, ASHS achieved similar performance as SurfPatch (CA1-3: 87.36 ± 1.97; CA4-DG: 82.54 ± 3.45; SUB: 85.48 ± 2.43).

4

Discussion and Conclusion

SurfPatch is a novel subfield segmentation algorithm combining surface-based processing with patch similarity measures. Its use of a population-based patch normal‐ ization relative to a template library has desirable run-time and space complexity prop‐ erties. Moreover, it operates on T1-weighted images only, the currently preferred anatomical contrast of many big data MRI initiatives, and thus avoids T2-weighted MRI, a modality prone to motion and flow artifacts. In controls, accuracy was excellent, with Dice overlap indices of >82 % when submillimetric images were used and only marginal performance drops when using millimetric data. Performance remained robust when reducing the size of the template

386

B. Caldairou et al.

library, an advantageous feature given high demands on expertise/time for the generation of subfield-specific atlases. While Dice indices across studies need to be cautiously interpreted given differences in protocols, our results compare favorably to the literature. Indeed, FreeSurfer achieved 62 % for CA1, 74 % in CA2-3 and 68 % in DG-CA4 when applied to high-resolution T1-MRI [7]. With respect to ASHS, slightly lower Dice indices than for our evaluations have been previously reported [10], particularly for CA (80 %) and SUB (75 %), whereas similarly high performance was achieved for DG (82 %). It is possible that the reliance of ASHS on anisotropic images presents a challenge to cover shape variability in antero-posterior direction. It has to be noted that ASHS achieved similar performance than SurfPatch, when trained on our library and dataset. Although ASHS, FreeSurfer and SurfPatch consistently achieved high lateralization performance, learners based on volume measures derived from the latter lateralized the seizure focus more accurately than the other two. Robust performance on diseased hippocampi may stem from the combination of the patch-based framework, offering intrinsic modeling of multi-scale intensity features with surface-based feature sampling, which may more flexibly capture shape deformations and displacements seen in this condition.

References 1. Blumcke, I., Thom, M., Aronica, E., Armstrong, D.D., Bartolomei, F., Bernasconi, A., et al.: International consensus classification of hippocampal sclerosis in temporal lobe epilepsy: a task force report from the ILAE commission on diagnostic methods. Epilepsia 54(7), 1315– 1329 (2013) 2. Kim, H., Mansi, T., Bernasconi, N., Bernasconi, A.: Surface-based multi-template automated hippocampal segmentation: application to temporal lobe epilepsy. Med. Image Anal. 16(7), 1445–1455 (2012) 3. Giraud, R., Ta, V.T., Papadakis, N., Manjon, J.V., Collins, D.L., Coupe, P., et al.: An optimized PatchMatch for multi-scale and multi-feature label fusion. Neuroimage 124(Pt A), 770–782 (2016) 4. Buades, A., Coll, B., Morel, J.M.: A review of image denoising algorithms, with a new one. Multiscale Model Sim. 4(2), 490–530 (2005) 5. Manjon, J.V., Coupe, P., Buades, A., Fonov, V., Louis Collins, D., Robles, M.: Non-local MRI upsampling. Med. Image Anal. 14(6), 784–792 (2010) 6. Winterburn, J.L., Pruessner, J.C., Chavez, S., Schira, M.M., Lobaugh, N.J., Voineskos, A.N., et al.: A novel in vivo atlas of human hippocampal subfields using high-resolution 3 T magnetic resonance imaging. Neuroimage 74, 254–265 (2013) 7. Van Leemput, K., Bakkour, A., Benner, T., Wiggins, G., Wald, L.L., Augustinack, J., et al.: Automated segmentation of hippocampal subfields from ultra-high resolution in vivo MRI. Hippocampus 19(6), 549–557 (2009) 8. Iglesias, J.E., Augustinack, J.C., Nguyen, K., Player, C.M., Player, A., Wright, M., et al.: A computational atlas of the hippocampal formation using ex vivo, ultra-high resolution MRI: application to adaptive segmentation of in vivo MRI. Neuroimage 115, 117–137 (2015) 9. Pipitone, J., Park, M.T., Winterburn, J., Lett, T.A., Lerch, J.P., Pruessner, J.C., et al.: Multiatlas segmentation of the whole hippocampus and subfields using multiple automatically generated templates. Neuroimage 101, 494–512 (2014)

A Surface Patch-Based Segmentation Method

387

10. Yushkevich, P.A., Pluta, J.B., Wang, H., Xie, L., Ding, S.L., Gertje, E.C., et al.: Automated volumetry and regional thickness analysis of hippocampal subfields and medial temporal cortical structures in mild cognitive impairment. Hum. Brain Mapp. 36(1), 258–287 (2015) 11. Kulaga-Yoskovitz, J., Bernhardt, B.C., Hong, S.-J., Mansi, T., Liang, K.E., van der Kouwe, A.J.W., et al.: Multi-contrast submillimetric 3 Tesla hippocampal subfield segmentation protocol and dataset. Sci. Data 2, 150059 (2015) 12. Styner, M., Oguz, I., Xu, S., Brechbuhler, C., Pantazis, D., Levitt, J.J., et al.: Framework for the statistical shape analysis of brain structures using SPHARM-PDM. Insight J. 1071, 242– 250 (2006) 13. Collins, D.L., Holmes, C.J., Peters, T.M., Evans, A.C.: Automatic 3-D model-based neuroanatomical segmentation. Hum. Brain Mapp. 3(3), 190–208 (1995) 14. Greve, D.N., Fischl, B.: Accurate and robust brain image alignment using boundary-based registration. Neuroimage 48(1), 63–72 (2009) 15. Kass, M., Witkin, A., Terzopoulos, D.: Snakes – active contour models. Int. J. Comput. Vision 1(4), 321–331 (1987) 16. Osher, S., Sethian, J.A.: Fronts propagating with curvature-dependent speed - algorithms based on Hamilton-Jacobi formulations. J. Comput. Phys. 79(1), 12–49 (1988) 17. Sled, J.G., Zijdenbos, A.P., Evans, A.C.: A nonparametric method for automatic correction of intensity nonuniformity in MRI data. IEEE Trans. Med. Imaging 17(1), 87–97 (1998) 18. Bernasconi, N., Bernasconi, A., Caramanos, Z., Antel, S.B., Andermann, F., Arnold, D.L.: Mesial temporal damage in temporal lobe epilepsy: a volumetric MRI study of the hippocampus, amygdala and parahippocampal region. Brain. 126(2), 462–469 (2003)

Automatic Lymph Node Cluster Segmentation Using Holistically-Nested Neural Networks and Structured Optimization in CT Images Isabella Nogues1(B) , Le Lu1 , Xiaosong Wang1 , Holger Roth1 , Gedas Bertasius2 , Nathan Lay1 , Jianbo Shi2 , Yohannes Tsehay1 , and Ronald M. Summers1 1

Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bethesda, MD 20892-1182, USA [email protected] 2 University of Pennsylvania, Philadelphia, PA 19104, USA

Abstract. Lymph node segmentation is an important yet challenging problem in medical image analysis. The presence of enlarged lymph nodes (LNs) signals the onset or progression of a malignant disease or infection. In the thoracoabdominal (TA) body region, neighboring enlarged LNs often spatially collapse into “swollen” lymph node clusters (LNCs) (up to 9 LNs in our dataset). Accurate segmentation of TA LNCs is complexified by the noticeably poor intensity and texture contrast among neighboring LNs and surrounding tissues, and has not been addressed in previous work. This paper presents a novel approach to TA LNC segmentation that combines holistically-nested neural networks (HNNs) and structured optimization (SO). Two HNNs, built upon recent fully convolutional networks (FCNs) and deeply supervised networks (DSNs), are trained to learn the LNC appearance (HNN-A) or contour (HNNC) probabilistic output maps, respectively. HNN first produces the class label maps with the same resolution as the input image, like FCN. Afterwards, HNN predictions for LNC appearance and contour cues are formulated into the unary and pairwise terms of conditional random fields (CRFs), which are subsequently solved using one of three different SO methods: dense CRF, graph cuts, and boundary neural fields (BNF). BNF yields the highest quantitative results. Its mean Dice coefficient between segmented and ground truth LN volumes is 82.1 % ± 9.6 %, compared to 73.0 % ± 17.6 % for HNN-A alone. The LNC relative volume (cm3 ) difference is 13.7 % ± 13.1 %, a promising result for the development of LN imaging biomarkers based on volumetric measurements.

1

Introduction

Lymph node (LN) segmentation and volume measurement play a crucial role in important medical imaging based diagnosis tasks, such as quantitatively evaluating disease progression or the effectiveness of a given treatment or therapy. Enlarged LNs, defined by the widely observed RECIST criterion [14] to have c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 388–397, 2016. DOI: 10.1007/978-3-319-46723-8 45

Automatic Lymph Node Cluster Segmentation

389

a short axis diameter ≥ 10 mm on an axial computed tomography (CT) slice, signal the onset or progression of a malignant disease or an infection. Often performed manually, LN segmentation is highly complex, tedious and time consuming. Previous methods for automatic LN segmentation in CT images fall under several categories, including atlas registration and label fusion [17], 3D deformable surface shape model [6] and statistical 3D image feature learning [1,7], respectively. This paper addresses and solves a novel problem: lymph node cluster (LNC) segmentation in the thoracoabdominal (TA) region. LN volumes are subsequently predicted from our segmentation results.

Fig. 1. CT images of thoracoabdominal lymph node clusters with annotated (red) boundaries.

In CT, the TA region exhibits exceptionally poor intensity and texture contrast among neighboring LNs and between LNs and their surrounding tissues. Furthermore, TA LNs often appear in clusters. Weak intensity contrast renders the boundaries of distinct agglomerated LNs ambiguous (Fig. 1). Existing fullyautomated methods have been applied to the more contrast-distinctive axillary and pelvic regions [1], as well as the head-and-neck section [6,17]. This paper presents a fully-automated method for TA LNC segmentation. More importantly, the segmentation task is formulated as a flexible, bottom-up image binary classification problem that can be effectively solved using deep convolutional neural networks (CNN) and graph-based structured optimization and inference. Our bottom-up approach can easily handle all variations in LNC size and spatial configuration. By contrast, top-down, model-fitting methods [1,6,7,17] may struggle to localize and segment each LN. LN volume is a more robust metric than shortaxis diameter, which is susceptible to high inter-observer variability and human error. Furthermore, our proposed method is well-suited for measuring agglomerated LNs, whose ambiguous boundaries compromise the accuracy of diameter measurement. This paper addresses a clinically relevant and challenging problem: automatic segmentation and volume measurement of TA LNCs. A publicly available dataset1 containing 171 TA 3D CT volumes (with manually-annotated LN segmentation masks) [15] is used. The method in this paper integrates HNN learning with structured optimization (SO). Furthermore, it yields remarkable 1

https://wiki.cancerimagingarchive.net/display/Public/CT+Lymph+Nodes.

390

I. Nogues et al.

quantitative results. The mean Dice similarity coefficient (DSC) between predicted and segmented LN volumes is 82.1 % ± 9.6 % for boundary neural fields (BNF), and 73.0 % ± 17.6 % for HNN-A. The relative volume measurement error is 13.7 % ± 13.1 % for BNF and 32.16 % ± 36.28 % for HNN-A.

2

Methods

Our segmentation framework comprises two stages: holistically-nested neural network (HNN) training/inference and structured optimization (SO). (1) Two HNNs, designed in [18], are trained on pairs of raw CT images from the TA region and their corresponding binary LN appearance (segmentation) or contour (boundary) masks. We denote them as HNN-A and HNN-C, respectively. The HNN merges the CNN frameworks of a fully convolutional network (FCN) [11] and a deeply-supervised network (DSN) [10]. The FCN component is an end-toend holistic image training-prediction architecture: the output probability label map has the same dimension as the input image. The DSN component performs multi-scale feature learning, in which deep layer supervision informs and refines classification results at multiple convolutional stages. HNN’s multi-level contextual architecture and auxiliary cost functions (which assign pixel-wise label penalties) allow for capturing implicit, informative deep features to enhance the segmentation accuracy. (2) However, HNN’s large receptive fields and pooling layers potentially lead to segmentation outputs that are not precisely localized along the LN boundaries. Hence, we implement and evaluate explicit Conditional Random Field (CRF) based structured optimization schemes [2,3,9] to refine segmentation results. Particularly, we optimize a CRF energy function that includes a unary term encoding LN area information (HNN-A) and a pairwise term representing the object-specific LN boundary discontinuities (learned by HNN-C via deep supervision). This pairwise energy differs from the conventional intensity contrast-sensitive term from [9]. We note that the integration of boundary and appearance neural networks for automatic segmentation has also been newly exploited in [13]. Both our paper and [13] have been inspired by the visual cognitive science and computer vision literature (namely [12,18]). However, the global methods are different: [13] refines HNN predictions via robust spatial aggregation using random forest, as opposed to structured optimization. 2.1

Holistically-Nested Neural Networks

The holistically-nested neural network (HNN) [18] was first proposed as an image-to-image solution to the long-standing edge detection problem using deep CNN. In this study, we empirically find that the HNN architecture is also highly effective and efficient in predicting the full object segmentation mask, due to its per-pixel label cost formulation. Therefore, we train two separate holisticallynested neural networks to learn the probabilistic label maps of the LN specific binary appearance mask (HNN-A) and contour mask (HNN-C) from raw TA

Automatic Lymph Node Cluster Segmentation

391

CT images. The HNN-A prediction map provides the approximate location and shape of the target LNC, while the HNN-C prediction map renders LN boundary cues. By learning boundaries alone, HNN-C generates boundary information with refined spatial and contextual detail, relative to HNN-A. HNN-A and HNN-C results are combined in the Structured Optimization phase (cf. Section: Structured Optimization) to obtain more accurate pixel-wise label predictions. HNN Training. We adopt the CNN architecture in [18], which derives from a VGGNet model pre-trained on ImageNet [16]. The HNN contains five convolutional stages, with strides 1, 2, 4, 8, and 16, respectively, and different receptive field sizes, all nested in the VGGNet as in [18]. The HNN includes one side layer per convolutional stage, which is associated with an auxiliary classifier. The side outputs, generated by each side layer, are increasingly refined, as they gradually approach the ground truth. Finally, all side outputs are fed into a “weightedfusion” layer, which generates a global probability map merging the information from all side output scales. During the training phase, HNN seeks to minimize the per-pixel cost of each network stage, by applying stochastic gradient descent to the global objective function (W, w, h)∗ = argmin(Lside (W, w) + Lfuse (W, w, h)),

(1)

where Lside is the loss function computed at each side-output layer (i.e., auxiliary cost functions), and Lfuse is the cross-entropy distance between the ground truth and fusion layer output edge maps. The loss function Lside is a linear combination (m) of the image-level loss functions ℓside (W, w(m) ). The parameters W correspond to the set of standard network parameters, and w to the weights in each sideoutput layer’s classifier. Due to the per-pixel cost setup [11,18], HNN does not require a large number of training images to converge, and thus can be applied to small datasets. HNN Testing. During the testing phase, the network generates edge map predictions for each layer. The final unified output is a weighted average of all prediction maps: (1) (5) YˆHED = Average(Yˆfuse , Yˆside , . . . , Yˆside )

(2)

HNN is highly efficient, requiring a mere 0.4 s per image in the feed-forward testing. Further details on the HNN architecture and training are provided in [18]. 2.2

Structured Optimization

Although HNN is a state-of-the-art, image-to-image, semantic pixel-wise labeling method, it tends to produce imprecise segmentations, like many deep CNN models. Its large receptive fields and many pooling layers compromise clarity and spatial resolution in the deep layers. Therefore, we exploit explicit structured optimization techniques to refine HNN-A’s segmentation results. We select

392

I. Nogues et al.

a conditional random field (CRF) optimization framework, as it is well-suited for integrating LN predictions with LN boundary cues. As in [2], the boundary cues serve to improve segmentation coherence and object localization. For a given target CT image, the unary potential is a function of the corresponding HNN-A prediction, while the pairwise potential is a function of the corresponding HNNC prediction. Three structured optimization representations and methods are described and evaluated: dense CRF, graph cuts, and boundary neural fields. Under all three techniques, segmentation is cast as a binary classification problem. A graph representation for the original CT image is provided, in which vertices correspond to image pixels, and edges to inter-pixel connections. A boundary strength-based affinity function, defined in [2], for distinct pixels i and j is given by:  −Mij  , (3) wij = exp σ where Mij is the magnitude of the strongest LN boundary intersecting {i, j}, and σ is a smoothing hyper-parameter. The boundary strength map is obtained by performing non-maximum suppression on an HNN-C prediction [18]. This boundary-based affinity function better refines the segmentation than would a standard image intensity gradient-based function, as intensity information is N highly ambiguous in TA CT images. We set the degree of pixel i to di = i=j wij , where N is the total number of pixels. Dense Conditional Random Field: Our dense conditional random field (dCRF) representation follows the framework in [5,9]. We adopt the CT intensity contrast-sensitive pixel affinities for all possible image pixel pairs, as described in [5]. Finally, the dCRF solver designed in [9] is utilized, as a variation of distributed message passing. Graph Cuts: The minimum-cut/maximum-flow graph cuts (GC) algorithm described in [4] is applied to the segmentation problem. We optimize an energy function whose unary term is the negative log-likelihood of the HNN-A LN segmentation probability value per pixel and pairwise term is defined using Eq. 3 [2]. All inter-pixel affinities are computed within a 20 × 20 neighborhood for each pixel location. Boundary Neural Fields: The LN mask (HNN-A) and boundary (HNN-C) predictions are integrated into a matrix model. We optimize the global energy function: μ 1 X∗ = argmin D(X − D−1 f )T (X − D−1 f ) + XT (D − W)X, 2 2 X

(4)

where X∗ is a N x 1 vector representing an optimal continuous label assignment for a vectorized input image (with N pixels), D is the N x N diagonal degree matrix, W is the N x N pairwise affinity matrix, and f is a N x 1 vector containing the HNN-A prediction values. Each diagonal entry di,i is set to di , defined above. W is a sparse weight matrix: the entries wij are computed only for pixel pairs i, j belonging to the same 20 × 20 pixel neighborhood. (via Eq. 3 [2]).

Automatic Lymph Node Cluster Segmentation

393

The unary energy attempts to find a segmentation assignment X that deviates little from the HNN-A output. The assignment X is weighted by D, in order to assign larger unary costs to pixels with many similar neighbors. By contrast, the pairwise energy minimizes the cost assigned to such pixels by weighting the squared distances between segmentation assignments of similar pixel pairs {i, j} by their affinity wij . To balance the unary and pairwise contributions, the unary term is weighted by the hyperparameter μ. As in [2], μ is set to 0.025. The optimal segmentation is given by: X∗ = (D − αW)−1 βf , where α =

3 3.1

1 1+µ

and β =

µ 1+µ .

(5)

Note that Eq. 5 is a closed-form solution.

Results and Discussion Dataset Creation

Our dataset contains 84 abdominal and 87 mediastinal 3D CT scans (512×512× 512 voxels) (publicly available from [15]). We spatially group the ground truth binary LN masks in 3D to form clusters with a linking distance constraint. All LN clusters, padded by 32 pixels in each direction, are subsequently cropped, resulting in 1∼7 such subvolume regions (77 × 76 × 79 – 212 × 235 × 236 voxels) per CT volume. All CT axial slices have been extracted from the portal venous phase with slice thickness 1 − 1.25 mm and manually segmented by an expert radiologist. This yields a total of 39, 361 images (16, 268 images containing LN pixels) in 411 LN clusters (with 395 abdominal and 295 mediastinal LNs). By extracting all LN contours from the ground truth appearance masks, we obtain the LN contour masks to train HNN-C. Examples of LN CT image ground truth boundaries are shown in Figs. 1 and 2. 3.2

Quantitative Analysis

Segmentation accuracies of HNN-A, BNF, GC, and dCRF are evaluated. The volume-wise means and standard deviations (std.) are computed for three evaluations metrics: Dice similarity coefficient (DSC), Intersection over Union (IoU), and Relative Volume Difference (RVD) (cm3 ) between predicted and ground truth LNC 3D masks. The RVD indicates whether volume measurement is accurate enough to be used as a new imaging biomarker, in addition to the diameterbased RECIST criterion [14]. Our experiments are conducted under 4-fold cross-validation, with the dataset split at the patient level. Prior to generating binary segmentation results for HNN-A prediction maps alone (with pixels in the range [0, 1]), we remove all pixels below the threshold τ = 8.75 × 10−1 . This value of τ , which is shown to maximize the mean DSC between the HNN-A predictions and ground truth LN masks, is calibrated using the training folds. HNN is run on Caffe [8], using a

394

I. Nogues et al.

Fig. 2. Examples of LN CT image segmentation. Top, Bottom: CT images with ground truth (red) and BNF segmented (green) boundaries. Center: HNN-A LN probability maps. CT images 1–3, 5–8 depict successful segmentation results. CT image + map 4 present an unsuccessful case.

Nvidia Tesla K40 GPU. The system requires 5 h 40 min for training (30K iterations), and 7 min 43 s for testing (computation of all contour and area maps). BNF is run on MATLAB 2013a (∼ 9 h). dCRF and GC are run using C++ (∼ 3 h; ∼ 12 h). The hyperparameters of dCRF and GC (described in [3,9]) are optimized using a randomized search. BNF yields the highest quantitative segmentation results. Its mean and std per-volume DSC is 82.1 ± 9.6%, above HNN-A’s 73.0 ± 17.6 %, dCRF’s 69.0 ± 22.0%, and GC’s 67.3 ± 16.8% (cf. Table 1). Additionally, it decreases HNN-A’s mean RVD from 32.2 % to 13.7 %. Meanwhile, dCRF marginally decreases the RVD to 29.6 %, and GC increases it to 86.5 %. Figure 2 contains 8 examples of LNC segmentation using BNF. The plots in Fig. 3 compare segmentation and ground truth LNC volume values (in cm3 ). Due to the HNN’s deeply nested architecture and auxiliary loss functions on multi-scale side-output layers, the quality of HNN-A segmentation is already high. dCRF’s decline in performance relative to HNN-A may be partly attributed to its CT intensity contrast-sensitive pairwise CRF term, defined in [9]. The appearance kernel in this function implies that neighboring pixels of similar intensity are likely to belong to the same class. Though highly relevant in natural images, this idea cannot be extended to TA CT images, in which LN and background pixels may have similar intensities. GC, however, uses a pairwise term defined by the HNN-C boundary cues. Its lower performance may be attributed to its usage of an L1 norm for the CRF energy minimization. Unlike the L2 norm, used by dCRF and BNF, the L1 norm may yield multiple and/or unstable solutions, thus compromising the final segmentation accuracy. Like GC, BNF omits all intensity information from the energy function. However, it utilizes the

Automatic Lymph Node Cluster Segmentation

(a) HNN-A

(b) Boundary Neural Fields

395

(c) Graph Cuts

Fig. 3. Comparison between ground truth and predicted LN volumes.

boundary cues from HNN-C in both the unary and pairwise terms. Hence, one may infer that an emphasis on LN boundary information is crucial to overcome the complexity of TA LNC CT image segmentation. Additionally, the L2 norm may increase the accuracy of the final segmentation result. Comparison to previous work: The proposed bottom-up LN segmentation scheme equals or surpasses previous top-down methods in performance, though applied to a more challenging body region. For head-and-neck segmentation, [17] obtains a mean DSC of 73.0 % on five CT scans. The neck study in [6] yields relative volumetric segmentation error ratios of 38.88 % − 51.75 % for five CT scans. The discriminative learning approach in [7] only reports LN detection results on 54 chest CT scans and no information on LN segmentation accuracy. The more comprehensive study from [1] achieves a mean DSC of 80.0 % ± 12.6 % for 308 axillary LNs and 76.0 % ± 12.7 % for 455 pelvic+abdominal LNs, from a dataset of 101 CT cases. Table 1. Evaluation of segmentation accuracy: HNN-A, BNF, dCRF, and GC Method Evaluation Metric Mean DSC (%) Mean IoU (%) Mean RVD (%) HNN-A 73.0 ± 17.6

4

60.1 ± 18.8

32.2 ± 46.3

BNF

82.1 ± 9.6

70.6 ± 11.9

13.7 ± 13.1

dCRF

69.0 ± 22.0

56.2 ± 21.6

29.6 ± 45.4

GC

67.3 ± 16.8

53.0 ± 17.9

86.5 ± 107.6

Conclusion

To solve a challenging problem with high clinical relevance – automatic segmentation and volume measurement of TA LNCs in CT images, our method integrates HNN learning in both LN appearance and contour channels and exploits different structured optimization methods. BNF (combining HNN-A and

396

I. Nogues et al.

HNN-C via a sparse matrix representation) is the most accurate segmentation scheme. BNF’s mean RVD of 13.7 ± 13.1% is promising for the development of LN imaging biomarkers based on volumetric measurements, which may lay the groundwork for improved RECIST LN measurements. Acknowledgments. This work was supported by the Intramural Research Program at the NIH Clinical Center.

References 1. Barbu, A., Suehling, M., Xu, X., Liu, D., Zhou, S.K., Comaniciu, D.: Automatic detection and segmentation of lymph nodes from CT data. IEEE Trans. Med. Imaging (2012) 2. Bertasius, G., Shi, J., Torresani, L.: Semantic segmentation with boundary neural fields. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 3. Boykov, Y., Funka-Lea, G.: Graph cuts and efficient ND image segmentation. IJCV (2006) 4. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pat. Ana. Mach. Intel. (2004) 5. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: ICLR (2015) 6. Dornheim, J., Seim, H., Preim, B., Hertel, I., Strauss, G.: Segmentation of neck lymph nodes in CT datasets with stable 3D mass-spring models. In: Larsen, R., Nielsen, M., Sporring, J. (eds.) MICCAI 2006. LNCS, vol. 4191, pp. 904–911. Springer, Heidelberg (2006). doi:10.1007/11866763 111 7. Feulnera, J., Zhou, S., Hammond, M., Horneggera, J., Comaniciu, D.: Lymph node detection and segmentation in chest CT data using discriminative learning and a spatial prior. In: Medical Image Analysis, pp. 254–270 (2013) 8. Jia, Y.: Caffe: an open source convolutional architecture for fast feature embedding (2013). http://goo.gl/Fo9YO8 9. Kr¨ ahenb¨ uhl, P., Koltun, V.: Efficient inference in fully connected CRFs with Gaussian edge potentials. In: NIPS (2012) 10. Lee, C.-Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-supervised nets. In: AISTATS (2015) 11. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE CVPR, pp. 3431–3440 (2015) 12. Martin, D., Fowlkes, C., Malik, J.: Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Trans. Pat. Ana. Mach. Intel. (2004) 13. Roth, H., Lu, L., Farag, A., Sohn, A., Summers, R.: Spatial aggregation of holistically-nested networks for automated pancreas segmentation. In: Ourselin, S., Wells, W.M., Joskowicz, L., Sabuncu, M., Unal, G. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 451–459. Springer, Heidelberg (2016) 14. Schwartz, L., Bogaerts, J., Ford, R., Shankar, L., Therasse, P., Gwyther, S., Eisenhauer, E.: Evaluation of lymph nodes with recist 1.1. Euro. J. Cancer 45(2), 261– 267 (2009)

Automatic Lymph Node Cluster Segmentation

397

15. Seff, A., Lu, L., Barbu, A., Roth, H., Shin, H.-C., Summers, R.M.: Leveraging midlevel semantic boundary cues for automated lymph node detection. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9350, pp. 53–61. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24571-3 7 16. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2014) 17. Stapleford, L., Lawson, J., et al.: Evaluation of automatic atlas-based lymph node segmentation for head-and-neck cancer. Int. J. Rad. Onc. Bio. Phys. (2010) 18. Xie, S., Tu, Z.: Holistically-nested edge detection. In: IEEE ICCV, pp. 1395–1403 (2015)

Evaluation-Oriented Training via Surrogate Metrics for Multiple Sclerosis Segmentation Michel M. Santos1(B) , Paula R.B. Diniz2 , Abel G. Silva-Filho1 , and Wellington P. Santos3 1

2

Centro de Inform´ atica, Universidade Federal de Pernambuco, Pernambuco, Brazil [email protected] N´ ucleo de Telessa´ ude, Universidade Federal de Pernambuco, Pernambuco, Brazil 3 Dpto. de Eng. Biom´edica, Universidade Federal de Pernambuco, Pernambuco, Brazil

Abstract. In current approaches to automatic segmentation of multiple sclerosis (MS) lesions, the segmentation model is not optimized with respect to all relevant evaluation metrics at once, leading to unspecific training. An obstacle is that the computation of relevant metrics is threedimensional (3D). The high computational costs of 3D metrics make their use impractical as learning targets for iterative training. In this paper, we propose an oriented training strategy that employs cheap 2D metrics as surrogates for expensive 3D metrics. We optimize a simple multilayer perceptron (MLP) network as segmentation model. We study fidelity and efficiency of surrogate 2D metrics. We compare oriented training to unspecific training. The results show that oriented training produces a better balance between metrics surpassing unspecific training on average. The segmentation quality obtained with a simple MLP through oriented training is comparable to the state-of-the-art; this includes a recent work using a deep neural network, a more complex model. By optimizing all relevant evaluation metrics at once, oriented training can improve MS lesion segmentation.

1

Introduction

In most multiple sclerosis (MS) lesion segmentation methods specific metrics serve as evaluation criteria, but not as training targets. For instance, many studies use dice similarity coefficient, sensitivity, specificity, and surface distance as evaluation metrics, but have energy, likelihood, or maximum a posteriori functions as the learning target [1–4]. Despite previous efforts, automatic MS segmentation still requires improvements [5,6]. The use of evaluation metrics as optimization targets has potential to improve MS segmentation methods [7]. Evaluation metrics of clinical relevance involve lesion count, volume, and shape that are criteria to characterize the MS progression [8]. However, previous works failed to target all metrics of clinical relevance at the same time [7,9,10]. An obstacle to evaluation-oriented training is that domain-specific metrics can be too expensive for iterative optimization of MS segmentation methods. c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 398–405, 2016. DOI: 10.1007/978-3-319-46723-8 46

Evaluation-Oriented Training for MS Lesion Segmentation

399

The high computational cost is in part due to the three-dimensional (3D) computation of evaluation metrics. For instance, a shape based metric, as the average surface distance, may take minutes to be computed in three dimensions [11]. In addition, the measures based on counts of overlapped regions between segmentation and ground truth depend in turn on connected components method to obtain contiguous regions. The 3D computation of connected components can take minutes as well [12]. To verify that the optimization of 3D metrics is impractical, suppose that the computation of measures spends 10 min for each one of 10 patients. In this case, considering 1000 iterations of an optimization algorithm, the training would take more than two months (100,000 min). As an alternative, surrogate-assisted optimization, a field of evolutionary computing, uses efficient computational models to approximate costly target functions to be optimized [13]. While surrogate-assisted optimization has been applied to other fields, no studies have been found in MS lesion segmentation. The surrogateassisted method can make oriented training feasible enabling the optimization of a segmentation model specifically for MS lesion detection. The contributions of this paper are as follows: a surrogated-assisted training method that allows indirect optimization of costly and multiple metrics, specially designed for MS segmentation; the method comprises a surrogate objective function; we apply the proposed optimization method to carry out evaluationoriented training, aiming metrics of lesion count, volume, and shape, all at once.

2

MS Segmentation Challenge: Dataset and 3D Metrics

The MS Lesion Segmentation Challenge 2008 aims to compare automatic methods [14]. The competition website1 is still providing image data and receiving new submissions. For each patient in the database, three images were acquired with different MRI sequences called T1-weighted (T1w), T2-weighted (T2w) and fluid-attenuated inversion recovery (FLAIR). All images were coregistered and resampled to a matrix of 512 × 512 × 512, corresponding to a voxel size of 0.5 × 0.5 × 0.5 mm. The segmentation was made by human experts, attributing to each voxel the values: 1, for lesion; 0, otherwise. In the competition, the images are divided into two sets: one named training and other named test. To avoid confusion with the stages of cross-validation, in our study, we renamed training and test sets from competition to public and private, respectively, according to the availability of ground truth. For each submission, the competition website attributes a score based on the segmentation done over the private set. This evaluation score comprises four metrics that are 3D computed: relative absolute volume difference (RAVD); average symmetric surface distance (ASSD); lesionwise true positive rate (LTPR); lesion-wise false positive rate (LFPR). The final (3D) score is an average scaled to be between 0 and 100, the higher, the better. A score of 90 is equivalent to the expected performance of an independent human expert. The original evaluation algorithm is available online2 . 1 2

Available at http://www.ia.unc.edu/MSseg/. https://github.com/telnet2/gradworks/tree/master/EvalSegmentation.

400

3

M.M. Santos et al.

Segmentation Model and Optimization Method

In our approach, we adopt a simple multilayer perceptron (MLP) model with a single hidden layer containing just a few neurons, to accelerate the algorithm execution and enable statistical evaluation. This shallow neural network serves to demonstrate the impact of oriented training and should not be disregarded. Shallow neural networks, when properly parameterized and trained, can achieve performance equivalent to that of state-of-the-art deep learning models [15]. We employ particle swarm optimization (PSO) for training MLP as an alternative to backpropagation [16]. The backpropagation algorithm restricts the form of learning target. On the other hand, PSO enables to use different learning targets without being necessary to develop a specific procedure for each function to be optimized. In our implementation, the MLP had three input neurons that correspond to the intensity values for T1w, T2w and FLAIR images at each voxel position, two neurons at output layer that indicate the voxel class, and a single hidden layer with six neurons and logistic activation function. In PSO, 20 particles were used, besides the parameters w = 0.65, c1 = 3, c2 = 1. The search space at each coordinate was limited to the range [−20,20]. Maximum velocity was set to 20 % of search space width. When some particle went to outside the search space, its velocity was inverted and reduced by a factor of −0.01. All parameters were defined empirically or gathered on literature.

4 4.1

Evaluation-Oriented Training Through Surrogate Metrics Computationaly Cheaper Training Targets

The 3D computation of learning targets would be unfeasible for iterative training. To reduce the computational burden, instead of costly 3D metrics, we employed cheaper 2D surrogates. Accordingly, the surrogate 2D-score depends on four 2D metrics as follows: ρ1 : relative absolute area difference; ρ2 : average symmetric surface distance; ρ3 : lesion-wise true positive rate; ρ4 : lesion-wise false positive rate. As in the 2008 competition (Sect. 2), the four metrics are transformed to the range [0, 100] with a 90 score being equivalent to the performance of a human expert. Thus, the learning target for evaluation-oriented training is the 2D-Score = (ρ1 + ρ2 + ρ3 + ρ4 )/4. The optimization of all the four metrics is important: a single metric that reaches just a 50 % score represents a loss of 12.5 points on the final score, which is too much in such a competitive task. The 2Dscore is faster because it can be calculated in lower dimension and averaged over a reduced number of selected slices (as we will see). Furthermore, notice that surrogate metrics are faster due to problem reduction, regardless of hardwarebased acceleration. 4.2

Evaluation-Oriented Versus Unspecific Training

The proposed evaluation-oriented training differs from unspecific training in two aspects: learning target and data sampling. In evaluation-oriented training, the

Evaluation-Oriented Training for MS Lesion Segmentation

401

objective function (Algorithm 1) has as learning target the 2D-score. In contrast, the unspecific training has the MSE as learning target in the objective function (Algorithm 2). Regarding data sampling, the oriented objective computes an average of four 2D metrics that requires samples of slices for each patient. On the other hand, samples of voxels are used with unspecific training for MSE calculation. Both oriented and unspecific objectives include a pre-classification rule. This rule prevents unnecessary computation of MLP outputs and takes advantage of the fact that lesions appear hyper-intense in FLAIR images (Algorithm 1 at line 5 and Algorithm 2 at line 4). Notice that such rule requires standardized intensities. We standardized images to have median 0 and interquartile range 1. In unspecific training, to balance classes, for each patient, all lesion voxels are taken, and an equal number of non-lesion voxels is sampled. Note that 2Dscore must be maximized, whereas MSE must be minimized. For each objective function, the PSO was configured and employed accordingly. Algorithm 1. Oriented Objective

Algorithm 2. Unspecific Objective

1: function objective(sliceSamples) 2: for each patient do 3: for each slice in sliceSamples do 4: for each voxel do 5: if FLAIR intensity < 0 then 6: assign non-lesion to voxel 7: else 8: classify voxel via MLP 9: end if 10: end for 11: compute 2D (per slice) metrics 12: end for 13: end for 14: compute average 2D metrics 15: objective = 2D-score ⊲ Sect. (4.1) 16: end function

1: function objective(voxelSamples) 2: for each patient do 3: for each voxel in voxelSamples do 4: if FLAIR intensity < 0 then 5: assign non-lesion to voxel 6: else 7: classify voxel via MLP 8: end if 9: end for 10: compute error per voxel 11: end for 12: objective = MSE 13: end function

5 5.1

Results and Discussion Experimental Setup

Fidelity and Efficiency of 2D-Score. The 2D-score was evaluated regarding fidelity and efficiency as a function of the number of slices. The plane, quantity, and position of the selected slices can influence the final 2D-score. We selected slices in the axial plane. The number of slices s was varied as follows: s = {22 , 23 , . . . , 29 }. Position pn of the n-th slice was given by pn = 1 + (n − 1)(29 /s) when s slices were selected. For example, for s = 4, the selected slices were {1, 129, 257, 385}. In addition, we evaluated a case with three slices in the positions 240, 250 and 260 by being approximately medial slices. A 2D-score based on s slices per patient is denoted as 2D-score-s. In this analysis, it was employed a subset of 10 images that was segmented by two human experts, totaling 20 segmentations. For each of these segmentations, 2D-score and 3D-score were computed as measures of rater-rater agreement.

402

M.M. Santos et al.

Fidelity was given by the (Pearson) correlation of the 2D-score with the 3Dscore. Efficiency was assessed via the median time taken per patient to compute each 2D-score. Note that this analysis is independent of automatic methods, since we employ only segmentations made by human raters. Comparison Based on Public and Private Data. The images with public ground truth serve for statistical evaluation of training methods since they can be used to compute segmentation metrics locally and repeatedly. We employed a repeated holdout method to compare the training algorithms statistically. For each type of training, the execution of the PSO-MLP was repeated 120 times. We compare two training algorithms: oriented using 2D-score-3; unspecific using MSE. We computed the median execution time and the median number of objective function evaluations to assess training efficiency. The purpose of the private ground truth data is to avoid the optimization of algorithms to a particular set of images. We compare the oriented training with the unspecific training on private data as well. For this, we selected the MLP with the best learning target regarding all images in the public dataset. Next, each selected MLP was applied to detect lesions on images with a private ground truth. The training methods are compared through the 3D-score provided by the competition website. We also compare oriented training to state-of-the-art methods. For this, we employed a Gaussian filter as a post-processing step to avoid false-positives due to small candidate lesions, like other approaches [2,9]. 5.2

Surrogate 2D Versus Actual 3D Metrics

Fidelity and efficiency of 2D-score as a surrogate for the 3D-score are examined. The fidelity tends to grow with the number of slices used to calculate the 2Dscore (Table 1). On the other hand, the efficiency of 2D-score decreases with more slices. Note that the 2D-score has positive fidelity even with a small number of slices. A positive correlation, even weak, indicates a tendency for 3D-score increases as 2D-score increases. This monotonic relationship makes reasonable the use of the low-fidelity 2D-score-3 as objective function. Moreover, the lowfidelity 2D-score-3 is around 80 times faster than the high-fidelity 2D-score-512. Table 1. Fidelity and efficiency of 2D-score as surrogate for 3D-score Number of slices in 2D-score 3 4 8 16 32 64 Correlation with 3D-score

128 256 512

0.13 0.15 0.17 0.19 0.28 0.29 0.49 0.77 0.81

Median time per patient (s) 0.04 0.04 0.06 0.11 0.22 0.44 0.88 1.73 3.46

Evaluation-Oriented Training for MS Lesion Segmentation

5.3

403

Training Efficiency Using Public Ground Truth

We compare oriented to the unspecific training with respect to the number of function evaluations and execution time (Table 2). The evaluation-oriented training stopped with lower objective function evaluations but took more time overall (∼ 1 h) compared to unspecific training (∼ 1 min), because MSE is computed faster than surrogate measures. On the other hand, compared to 3D metrics, the 2D surrogates can reduce the overall training time from months to hours. In oriented training, the time per evaluation of the objective function is roughly 3238/1645 = 1.9 s. With 10 individuals in the training set, the time per patient in a single oriented objective computation is 0.19 s (note that the time for obtaining just the 2D-score-3 is 0.04 s per patient). Once the training is concluded, the time taken to detect lesions in a new patient is about 40 s. Table 2. Training method efficiency Unspecific Oriented

5.4

Median objective evaluations 2911

1645

Median execution time (s)

3238.48

78.81

Segmentation Effectiveness on Private Ground Truth

Evaluation-oriented training with 2D-score-3 surpasses unspecific training in segmentation quality as given by the average 3D-score (Table 3, best scores in bold). Unspecific training with MSE has better performance only in the true-positive score. These results reveal in which metrics the oriented training has the advantage. Furthermore, the proposed oriented training with 2D-score-3 better balances the four metrics and surpasses the unspecific training in the average score. A low-fidelity 2D-score with just 3 slices enables evaluation-oriented training to reach a better segmentation quality than unspecific training. Table 3. Segmentation quality per learning target on private data Target

RAVD ASSD LTPR LFPR Average 3D-Score [%] Score [mm] Score [%] Score [%] Score

MSE

2302

2D-score-3

5.5

6

14

71

88

98

97

51

56.30

63 91

6

87

43

76

49

80

83.26

Comparison with State-of-the-art Methods

Compared with state-of-the-art methods, the evaluation-oriented training is the first to optimize measures of lesion count, volume, and shape through surrogate

404

M.M. Santos et al.

learning targets (Table 4). The proposed method reaches a segmentation quality score comparable to that of state-of-the-art methods surpassing some recent approaches. Interestingly, our oriented method using a simple and shallow neural network model reaches a 3D-score of 83.26 that is very close to 84.07 obtained by a state-of-the-art deep neural network [17]. Table 4. Comparison with state-of-the-art methods Optimized Targetsa

3D-Score Reference

Maximum a Posteriori

86.93

Rotation-invariant Similarity

86.10

[4] [18]

Trimmed Likelihood, Energy Function 84.46

[3]

MSE Weighted by Sensitivity

84.07

[17]

⇒ LTPR, LFPR, RAVD, ASSD

83.26

Present work

Energy Function, ROC, Jaccard

82.54

[10]

Dice (voxel-wise), LTPR, LPPV

82.34

[9]

Log-likelihood

79.99

[1]

Trimmed Likelihood 79.09 [2] a lesion-wise positive predictive value (LPPV) and receiver operating characteristic (ROC)

6

Conclusion

In this paper, we introduced an evaluation-oriented training strategy that targets evaluation criteria of MS lesion segmentation. To make the evaluation-oriented training feasible, we optimize cheaper 2D surrogates rather than costly 3D metrics. Different from previous methods, the evaluation-oriented training targets metrics of lesion count, volume, and shape, all at once. We found that oriented training significantly improves the MS segmentation quality compared to unspecific training. Moreover, the oriented training of a simple MLP yielded a segmentation quality that is comparable to the state-of-the-art performance. Surrogate metrics are a promising approach to optimize MS segmentation methods. Future studies should investigate oriented training for other segmentation models, evaluate slice selection strategies, and aim the 2D-score using more slices.

References 1. Souplet, J.C., Lebrun, C., Ayache, N., Malandain, G., et al.: An automatic segmentation of T2-FLAIR multiple sclerosis lesions. In: The MIDAS Journal-MS Lesion Segmentation (MICCAI 2008 Workshop) (2008)

Evaluation-Oriented Training for MS Lesion Segmentation

405

2. Garc´ıa-Lorenzo, D., Prima, S., Arnold, D.L., Collins, D.L., Barillot, C.: Trimmedlikelihood estimation for focal lesions and tissue segmentation in multisequence MRI for multiple sclerosis. IEEE Trans. Med. Imaging 30(8), 1455–1467 (2011) 3. Tomas-Fernandez, X., Warfield, S.: A model of population and subject (MOPS) intensities with application to multiple sclerosis lesion segmentation. IEEE Trans. Med. Imaging 34(6), 1349–1361 (2015) 4. Jesson, A., Arbel, T.: Hierarchical MRF and random forest segmentation of MS lesions and healthy tissues in brain MRI. In: The Longitudinal MS Lesion Segmentation Challenge (2015) 5. Llad´ o, X., Oliver, A., Cabezas, M., Freixenet, J., Vilanova, J.C., Quiles, A., Valls, L., Rami´ o-Torrent` a, L., Rovira, A.: Segmentation of multiple sclerosis lesions in brain MRI: a review of automated approaches. Inf. Sci. 186(1), 164–185 (2012) 6. Garc´ıa-Lorenzo, D., Francis, S., Narayanan, S., Arnold, D.L., Collins, D.L.: Review of automatic segmentation methods of multiple sclerosis white matter lesions on conventional magnetic resonance imaging. Med. Image Anal. 17(1), 1–18 (2013) 7. Lecoeur, J., Ferr´e, J.C., Barillot, C.: Optimized supervised segmentation of MS lesions from multispectral MRIs. In: MICCAI workshop on Medical Image Analysis on Multiple Sclerosis (Validation and Methodological Issues) (2009) 8. Barkhof, F., Filippi, M., Miller, D.H., Scheltens, P., Campi, A., Polman, C.H., Comi, G., Ader, H.J., Losseff, N., Valk, J.: Comparison of MRI criteria at first presentation to predict conversion to clinically definite multiple sclerosis. Brain 120(11), 2059–2069 (1997) 9. Roura, E., Oliver, A., Cabezas, M., Valverde, S., Pareto, D., Vilanova, J.C., ` Llad´ Rami´ o-Torrent` a, L., Rovira, A., o, X.: A toolbox for multiple sclerosis lesion segmentation. Neuroradiology, pp. 1–13 (2015) 10. Zhan, T., Zhan, Y., Liu, Z., Xiao, L., Wei, Z.: Automatic method for white matter lesion segmentation based on T1-fluid-attenuated inversion recovery images. IET Computer Vision (2015) 11. Taha, A.A., Hanbury, A.: Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med. Imaging 15(1), 29 (2015) 12. He, L., Chao, Y., Suzuki, K.: Two efficient label-equivalence-based connectedcomponent labeling algorithms for 3-D binary images. IEEE Trans. Image Process. 20(8), 2122–2134 (2011) 13. Jin, Y.: Surrogate-assisted evolutionary computation: recent advances and future challenges. Swarm Evol. Comput. 1(2), 61–70 (2011) 14. Styner, M., Lee, J., Chin, B., Chin, M.S., huong Tran, H., Jewells, V., Warfield, S.: 3D segmentation in the clinic: a grand challenge II: MS lesion segmentation. In: MICCAI 2008 Workshop, pp. 1–5 (2008) 15. Ba, J., Caruana, R.: Do deep nets really need to be deep?. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 2654–2662. Curran Associates, Inc. (2014) 16. Eberhart, R.C., Shi, Y.: Computational Intelligence - Concepts to Implementations. Elsevier, San Francisco (2007) 17. Brosch, T., Tang, L.Y.W., Yoo, Y., Li, D.K.B., Traboulsee, A., Tam, R.: Deep 3D convolutional encoder networks with shortcuts for multiscale feature integration applied to multiple sclerosis lesion segmentation. IEEE Trans. Med. Imaging 35(5), 1229–1239 (2016) 18. Guizard, N., Coup´e, P., Fonov, V.S., Manj´ on, J.V., Arnold, D.L., Collins, D.L.: Rotation-invariant multi-contrast non-local means for MS lesion segmentation. NeuroImage Clin. 8, 376–389 (2015)

Corpus Callosum Segmentation in Brain MRIs via Robust Target-Localization and Joint Supervised Feature Extraction and Prediction Lisa Y.W. Tang(B) , Tom Brosch, XingTong Liu, Youngjin Yoo, Anthony Traboulsee, David Li, and Roger Tam MS/MRI Research Group, University of British Columbia, Vancouver, BC, Canada [email protected]

Abstract. Accurate segmentation of the mid-sagittal corpus callosum as captured in magnetic resonance images is an important step in many clinical research studies for various neurological disorders. This task can be challenging, however, especially more so in clinical studies, like those acquired of multiple sclerosis patients, whose brain structures may have undergone significant changes, rendering accurate registrations and hence, (multi-) atlas-based segmentation algorithms inapplicable. Furthermore, the MRI scans to be segmented often vary significantly in terms of image quality, rendering many generic unsupervised segmentation methods insufficient, as demonstrated in a recent work. In this paper, we hypothesize that adopting a supervised approach to the segmentation task may bring a break-through to performance. By employing a discriminative learning framework, our method automatically learns a set of latent features useful for identifying the target structure that proved to generalize well across various datasets, as our experiments demonstrate. Our evaluations, as conducted on four large datasets collected from different sources, totaling 2,033 scans, demonstrates that our method achieves an average Dice similarity score of 0.93 on test sets, when the models were trained on at most 300 images, while the top-performing unsupervised method could only achieve an average Dice score of 0.77.

1

Introduction

Accurate and robust segmentation of the corpus callosum in brain magnetic resonance images (MRI) is a crucial step in many automatic image analysis tasks, with clinical applications like computer-aided diagnosis and prognosis of several neurological diseases [1,6,10,16]. The segmentation task is often challenging, however, especially when dealing with some clinical studies of chronic diseases, like those of multiple sclerosis patients, whose brain structures may have undergone significant changes over time, thereby rendering registrations of subject images to pre-segmented brain templates inaccurate [12,18]. Consequently, the standard approach of employing (multi-) atlas-based segmentation algorithms may fail, as reported in recent c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 406–414, 2016. DOI: 10.1007/978-3-319-46723-8 47

Corpus Callosum Segmentation in Brain MRIs

407

publications [15,18]. Furthermore, the MRI scans to be segmented often vary significantly in terms of image quality, rendering many unsupervised segmentation algorithms that often require delicate image-dependent parameter-tuning impractical when a large dataset of images must be segmented. As demonstrated in [18], a multi-atlas-based algorithm custom-designed for the segmentation of the corpus callosum achieved a segmentation accuracy of 0.77 in Dice similarity score (DSC), as evaluated on a challenging clinical dataset, while those of other generic multi-atlas-based methods, like weighted majority voting and STEPS [8], achieved an average DSC of 0.65 to 0.70. In view of the availability of several public image datasets [12,14], where some are accompanied with expert segmentations, we thus hypothesize that adopting a supervised approach [5,7] to the segmentation task may provide a better alternative to existing unsupervised methods. More specifically, we propose a two-stage process to improve the efficiency (of model training) and accuracy of the final segmentation. In the first stage, we employ multi-atlas based segmentation to localize the CC structure. Because neighbouring structures of the CC provide adequate contextual information, our localization step is shown to be tolerant to imaging artifacts and shape variations caused by significant brain atrophy, and is sufficient for CC localization, which requires less precision than the actual CC segmentation task. In the second stage, we apt a discriminative learning approach where we employ a subset of the available expert segmentations from a single dataset to train a neural network architecture called convolutional encoder network (CEN) [7] that performs feature extraction and model parameter training in a joint manner. Our two-stage pipeline is motivated as follows. Rather than training a CEN that would extract features from the entire image slice, we propose to examine image features only within the bounding box of the target structure. By first performing target-localization to identify the region-of-interest (ROI) of the target structure, our two-stage approach explicitly constrains the network to learn and extract features only around the periphery of the CC, thereby yielding features that are more relevant to differentiating the CC structure from its neighbouring vessels, which, as similarly reported in [18], often brought confusion to existing unsupervised segmentation algorithms. To this end, our main contributions are as follows. We (1) propose a novel pipeline for a fully automatic CC segmentation algorithm; (2) performed comparative analysis with existing unsupervised methods; and (3) conducted extensive evaluation of our method using 4 datasets involving over 2,000 scans with different image characteristics.

2 2.1

Methods ROI Localization

In localizing the CC structure of each image, we employ a multi-atlas based segmentation approach. This step is done both at training and at test time. Our localization process involves performing registrations of 5 segmented atlases

408

L.Y.W. Tang et al.

from the public dataset of Heckemann et al. [12]. This allows us to transfer the CC segmentation label of each atlas onto the target image to be segmented. After registrations, we perform weighted majority voting on the transferred atlas labels and threshold the casted votes to define a rough outline of the CC in each image. To give us a confident estimation of the CC region, we employ a relatively high threshold value of 0.7 and set the weight of the vote of each atlas to be proportional to the average local correlation between the registered atlas and the subject image being segmented. For registrations, we employ a coarse-to-fine, multi-stage (rigid to affine) registration pipeline as provided through the ANTS registration toolkit [2]. Initial experiments demonstrated that the localization results are insensitive to the choice of similarity measure used. We chose to use the mutual information measure due to its low computational and time requirements (as opposed to the use of local cross-correlation measure). In calculating the similarity measure, we also employ stochastic sampling to further increase computational efficiency. 2.2

Joint Feature-Learning and Model Training N

Let there be a training set of N images I = {I}n=1 and corresponding CC N segmentations S = {S}n=1 . Following the supervised approach of [7], our goal is to seek a segmentation function f that maps input image In to its corresponding CC segmentation Sn :  E(Sn , f (In )), (1) fˆ = arg min f ∈F

n

where F is the set of possible segmentation functions, and E is an error function that measures the sum of squared differences between Sn and the segmentation predicted by f . As done in previous work [7], F is modeled by a convolutional encoder network (CEN) with shortcut connections, whose model architecture is divided into the convolutional pathway and the deconvolutional [5] pathways. The convolutional pathway is designed to automatically learn a set of features from I and is constructed with alternating convolutional and pooling layers, where the former convolves its input signal with convolutional filter kernels, and the latter performs averaging over blocks of data from the convolved signal. Conversely, the deconvolutional pathway is designed to reconstruct the segmentation mask using inputs from the convolutional pathway and consists of alternating deconvolutional and unpooling layers. In short, the convolutional pathway learns a hierarchical set of low-level to high-level image features while the deconvolutional pathway learns to predict segmentations using the learned features extracted from the convolutional pathway. As detailed in [3,7], the use of low-level and high-level image features respectively provides the means to increase localization precision and provide contextual information to guide segmentation. Hence, to integrate these hierarchical

Corpus Callosum Segmentation in Brain MRIs

409

features, the two pathways in the CEN model [7] are further linked with “shortcut connections”, via which the activations of the convolutional pathway and those of the deconvolutional pathway become connected [7]. 2.3

Details on Implementation and Model-Training

We trained a 2-layer CEN where, in each layer, we employ 32 filters of size of 9 × 9 and 2 × 2 respectively for the convolutional and pooling layers. For the hidden units, we employ an improved version of the rectified linear units [7]. We stopped training when the training error converged, which was usually around 800 epochs. To avoid overfitting [7], we also employ the dropout technique with dropout rates of 0.5−0.75. Further, recent research [7,17] has shown that pre-training of the CEN model parameters with adequate fine-tuning improves accuracy across a variety of application domains. Accordingly, in a similar manner as done in [7], we performed pre-training on the input images layer by layer using 3 layers of convolutional restricted Boltzmann machines (convRBMs) [13]. More specifically, the first layer is trained on the input images and each subsequent layer is trained on the hidden activations of its previous layer. The model parameters of our CENs are then initialized using the weights obtained from this pre-training step. In training the convRBMs, we employ the AdaDelta algorithm and explored various weight-initializations for the convRBMs model parameters, i.e. with mean of zero and standard deviation of {0.1, 0.03, 0.01, 0.003, 0.001}, which was determined using a logarithmic scale for the condition that the images have been normalized to have mean and standard deviation of 0 and 1, respectively. We also explored different settings for the hyper-parameter in AdaDelta and optimized this parameter empirically (i.e. ǫ = {1e−8, 1e−9, 1e−10, 1e−11}).

3

Experiments and Results

Materials. The following datasets of MR T1w were retrospectively collected for this work. Due to the retrospective nature, two protocols were used to generate segmentations of the CC for both training and testing purposes: manual vs. semi-automatic. MS (in-house): 348 patients with secondary progressive multiple sclerosis (SPMS) were enrolled in a clinical trial. Each subject was scanned at 4 different timepoints: at screening, baseline (∼4 weeks after screening), year one, and year two. This resulted in a total of 280 + 270 + 280 + 300 = 1130 scans. Registration of each scan to the template space and automatic extraction of the mid-sagittal slice from each volume was performed. Then, the area of the CC was manually outlined by a trained medical student who was blinded to the chronological sequence of the four timepoints. CIS (in-house): 140 subjects with clinical isolated syndrome (a prodromal phase of MS) were enrolled in another clinical trial that was completed approximately

410

L.Y.W. Tang et al.

Fig. 1. Randomly selected images from our four test datasets (rows from top to bottom): MS#1 (first row), CIS, OAS-NC, and OAS-AD datasets. Table 1. Characteristics of the datasets: size (mm), signal-to-noise-ratio (SNRmean as given in [9]), original pixel spacing (mm) and image dimensions. Dataset

Size of dataset

SNR Mean CC area Orig. spacing Orig. dim

MS#1-4 280,270,280,300 2.06 731

1×1×1

256×256×1

CIS

140

2.54 661

1×1×1

512×512×1

OAS-AD 100

3.10 586

0.5×0.5×1

512×512×1

OAS-NC 316

2.85 573

0.5×0.5×1

512×512×1

in 2013. Each subject was scanned at up to 5 different timepoints: at screening, baseline (∼4 weeks after screening), 3 months, 6 months, year one, and year two. This resulted in a total of 280 + 270 + 280 + 300 = 1130 scans. Registration and automatic extraction of the mid-sagittal slice from each volume were performed. Then, the Yuki tool [6] was applied to segment the CC from each mid-sagittal slice. Finally, each segmentation was manually edited by a blinded operator using a graphical user interface. OAS-NC (public): normal subjects collected from the OASIS dataset1 were downloaded from NITRC2 . The CC segmentations were obtained using the semiautomatic CC extraction method of [6] with 20 % of the scans subsequently edited by a blinded operator using the ITK-SNAP software, as noted in [6]. OAS-AD (public): subjects with Alzheimer’s disease also from the OASIS dataset were downloaded from NITRC. The CC segmentations were obtained using the same procedure as those used for the previous set. Figure 1 shows a collection of the images from each dataset while Table 1 summarizes the characteristics of these datasets. All scans were registered to the template space and resampled to 256 × 256 with a pixel size of 1 × 1 mm prior to analysis. We chose the images acquired at the fourth timepoint of the MS dataset as the training set (N = 300). We denote this training set as MS#4. 1 2

http://www.oasis-brains.org. https://www.nitrc.org/frs/?group id=90.

Corpus Callosum Segmentation in Brain MRIs

411

Table 2. DSC obtained by different existing unsupervised methods and our novel supervised approach as averaged over all the available test data. “Ours” denotes using the proposed two-stage pipeline to learn and extract features only within the bounding box of the CC structure as localized from stage 1. “CEN” denotes applying the approach of [7] on the entire image slice. Unsupervised methods

Supervised method

CCTPA WMV STAPLE STEPS RW [18] CEN only Ours 0.734

0.776

0.713

0.790

0.798

0.929

0.947

Experiment I. Comparative analysis with unsupervised methods. We first performed comparative analysis using existing unsupervised methods; these include the intensity-based approach of Adamson et al. [1] (denoted as CCTPA), weighted majority voting (WMV), STAPLE [4], STEPS [8], and the random walker-based algorithm of [18]. Note that all multi-altas based methods (except CCTPA) employed the same region-of-interest that we employed for our proposed 2-stage pipeline. For CCTPA, as its parameters depend on the input size, we did not use the same ROI. Table 2 reports the Dice similarity score (DSC) obtained by each method as averaged over the entire MS dataset, excluding the training set MS#4. Examining the numbers, our proposed supervised approach gave results superior to those derived from unsupervised methods. We also examined the effect of training the CEN directly on the entire midsagittal slices, bypassing the CC localization step. From the same table, we can see that our two-stage ROI-specific CEN approach gave the highest DSC, thereby confirming the usefulness of the proposed two-stage pipeline. Experiment II. Sensitivity analysis of the training sample size. Table 3 highlights the effect of training sample size (N ); its top three rows report the segmentation accuracy obtained by our method for each dataset not used for training. Evidently, the more training samples used, the higher the accuracy can be obtained for all four datasets. We again examined the effect of omitting the proposed two-stage pipeline, and trained the CEN directly on the entire image slices. Examining the bottom row of Table 3, we can observe again that our proposed two-stage approach also gave higher segmentation accuracy on all datasets except for MS#1. Hence, the proposed two-stage is not only more efficient (i.e. smaller image domain translates to fewer degrees of freedom being needed to model f ), but is also capable of achieving better results. Figure 2 shows example segmentation results that were obtained by thresholding the probabilistic predictions inferred from the trained CEN, i.e. with no other post-processing. Note that because we adopt a voxel-wise classification approach, and that the CENs that we employ do not explicitly enforce any form of shape and/or spatial regularization, some disconnected voxels can be mislabeled as CC, as highlighted in the figure in red. However, the produced segmentations can still be refined by fairly simple processing steps, such as connected

412

L.Y.W. Tang et al.

Table 3. Effect of using N training images randomly drawn from MS#4. Dice scores of the coloured cells indicate top performance achieved for each dataset. Asterisks denote significant improvement over second competing method (p90 % [1–3], the pancreas’ variable shape, size and location in the abdomen limits segmentation accuracy to 0.5 which is sufficient to reject the vast amount of non-pancreas from the CT images. This initial candidate generation is sufficient to extract bounding box regions that completely surround the pancreases in all used cases with nearly 100 % recall. All candidate regions are computed during the testing phase of cross-validation (CV) as in [6]. Note that candidate region proposal is not the focus of this work and assumed to be fixed for the rest of this study. This part could be replaced by other means of detecting an initial bounding box for pancreas detection, e.g., by RF regression [12] or sliding-window CNNs [6]. Semantic Mid-Level Segmentation Cues: We show that organ segmentation can benefit from multiple mid-level cues, like organ interior and boundary predictions. We investigate deep-learning based approaches to independently learn the pancreas’ interior and boundary mid-level cues. Combining both cues via learned spatial aggregation can elevate the overall performance of this semantic segmentation system. Organ boundaries are a major mid-level cue for defining

Spatial Aggregation of HNN for Pancreas Segmentation

453

Fig. 1. Schematics of (a) the holistically-nested nets, in which multiple side outputs are added, and (b) the HNN-I/B network architecture for both interior (left images) and boundary (right images) detection pathways. We highlight the error back-propagation paths to illustrate the deep supervision performed at each side-output layer after the corresponding convolutional layer. As the side-outputs become smaller, the receptive field sizes get larger. This allows HNN to combine multi-scale and multi-level outputs in a learned weighted fusion layer (Figures adapted from [11] with permission).

and delineating the anatomy of interest. It could prove to be essential for accurate semantic segmentation of an organ. Holistically-Nested Nets: In this work, we explicitly learn the pancreas’ interior and boundary image-labeling models via Holistically-Nested Networks (HNN). Note that this type of CNN architecture was first proposed by [11] under the name “holistically-nested edge detection” as a deep learning based general image edge detection method. We however find that it can be a suitable method for segmenting the interior of organs as well (see Sect. 3). HNN tries to address two important issues: (1) training and prediction on the whole image end-to-end (holistically) using a per-pixel labeling cost; and (2) incorporating multi-scale and multi-level learning of deep image features [11] via auxiliary cost functions at each convolutional layer. HNN computes the image-to-image or pixel-to-pixel prediction maps (from any input raw image to its annotated labeling map). The per-pixel labeling cost function [9,11] offers the good feasibility that HNN/FCN can be effectively trained using only several hundred annotated image pairs. This enables the automatic learning of rich hierarchical feature representations (contexts) that are critical to resolve spatial ambiguity in the segmentation of organs. The network structure is initialized based on an ImageNet pre-trained VGGNet model [13]. It has been shown that fine-tuning CNNs pre-trained on the general image classification task (ImageNet) is helpful to low-level tasks, e.g., edge detection [11].

454

H.R. Roth et al.

  I/B Network Formulation: Our training data S I/B = (Xn , Yn ), n = 1, . . . , N is composed of cropped axial CT images Xn (rescaled to within [0, . . . , 255] with a soft-tissue window of [−160, 240] HU); and YnI ∈ {0, 1} and YnB ∈ {0, 1} denote the (binary) ground truths of the interior and boundary map of the pancreas, respectively, for any corresponding Xn . Each image is considered holistically and independently as in [11]. The network is able to learn features from these images alone from which interior (HNN-I) boundary (HNN-B) predication maps can be produced. HNN can efficiently generate multi-level image features due to its deep architecture. Furthermore, multiple stages with different convolutional strides can capture the inherent scales of (organ edge/interior) labeling maps. However, due to the difficulty of learning such deep neural networks with multiple stages from scratch, we use the pre-trained network provided by [11] and fine-tuned to our specific training data sets S I/B with a relatively smaller learning rate of 10−6 . We use the HNN network architecture with 5 stages, including strides of 1, 2, 4, 8 and 16, respectively, and with different receptive field sizes as suggested by the authors1 . In addition to standard CNN layers, a HNN network has M side-output layers as shown in Fig. 1. These side-output layers are also realized as classifiers in which the corresponding weights are w = (w(1) , . . . , w(M ) ). For simplicity, all standard network layer parameters are denoted as W. Hence, the following objective funcM (m) tion can be defined2 : Lside (W, w) = m=1 αm lside (W, wm ). Here, lside denotes an image-level loss function for side-outputs, computed over all pixels in a training image pair X and Y . Because of the heavy bias towards non-labeled pixels in the ground truth data, [11] introduces a strategy to automatically balance the loss between positive and negative classes via a per-pixel class-balancing weight β. This allows to offset the imbalances between edge/interior (y = 1) and nonedge/exterior (y = 0) samples. Specifically, a class-balanced cross-entropy loss function can be used with j iterating over the spatial dimensions of the image: (m)

lside (W, w(m) ) = −β



  log P r yj = 1|X; W, w(m) −

j∈Y+

(1 − β)



  log P r yj = 0|X; W, w(m) .

(1)

j∈Y−

Here, β is simply |Y− |/|Y | and 1 − β = |Y+ |/|Y |, where |Y− | and |Y+ | denote the ground truth set of negatives and positives, respectively. The class proba(m) bility P r(yj = 1|X; W, w(m) ) = σ(aj ) ∈ [0, 1] is computed on the activation value at each pixel j using the sigmoid function σ(.). Now, organ edge/interior (m) (m) map predictions Yˆside = σ(Aˆside ) can be obtained at each side-output layer, (m) (m) where Aˆside ≡ {aj , j = 1, . . . , |Y |} are activations of the side-output of layer m. Finally, a “weighted-fusion” layer is added to the network that can be simultaneously learned during training. The   layer Lfuse  loss function at the fusion ˆside M A ˆ ˆ h is defined as Lfuse (W, w, h) = Dist Y, Yfuse , where Yfuse ≡ σ m m=1 1 2

https://github.com/s9xie/hed. We follow the notation of [11].

Spatial Aggregation of HNN for Pancreas Segmentation

455

with h = (h1 , . . . , hM ) being the fusion weight. Dist(., .) is a distance measure between the fused predictions and the ground truth label map. We use crossentropy loss for this purpose. Hence, the following objective function can be minimized via standard stochastic gradient descent and back propagation: (W, w, h)⋆ = argmin (Lside (W, w) + Lfuse (W, w, h))

(2)

Testing Phase: Given image X, we obtain both interior (HNN-I) and boundary (HNN-B) predictions from the models’ side output layers and the weighted-fusion layer as in [11]: 

I/B I1 /B1 ) IM /BM Yˆfuse , Yˆside , . . . , Yˆside



= HNN-I/B (X, (W, w, h)) .

(3)

Learning Organ-Specific Segmentation Object Proposals: “Multiscale Combinatorial Grouping” (MCG3 ) [14] is one of the state-of-the-art methods for generating segmentation object proposals in computer vision. We utilize this approach to generate organ-specific superpixels based on the learned boundary predication maps HNN-B. Superpixels are extracted via continuous oriented waterB2 ˆ B3 ˆ B , Yside , Yfuse ) supervisedly learned shed transform at three different scales (Yˆside by HNN-B. This allows the computation of a hierarchy of superpixel partitions at each scale, and merges superpixels across scales thereby, efficiently exploring their combinatorial space [14]. This, then, allows MCG to group the merged superpixels toward object proposals. We find that the first two levels of object MCG proposals are sufficient to achieve ∼ 88 % DSC (see Table 1 and Fig. 2), with the optimally computed superpixel labels using their spatial overlapping ratios against the segmentation ground truth map. All merged superpixels S from the first two levels are used for the subsequently proposed spatial aggregation of HNN-I and HNN-B. Spatial Aggregation with Random Forest: We use the superpixel set S generated previously to extract features for spatial aggregation via random forest classification4 . Within any superpixel s ∈ S we compute simple statistics including the 1st-4th order moments and 8 percentiles [20 %, 30 %, . . . , 90 %] on CT, HNN-I, and HNN-B. Additionally, we compute the mean x, y, and z coordinates normalized by the range of the 3D candidate region (Sect. 2). This results in 39 features describing each superpixel and are used to train a random forest classifier on the training positive or negative superpixels at each round of 4-fold CV. Empirically, we find 50 trees to be sufficient to model our feature set. A final 3D pancreas segmentation is simply obtained by stacking each slice prediction back into the space of the original CT images. No further postprocessing is employed and spatial aggregation of HNN-I and HNN-B maps for superpixel classification is already of high quality. This complete pancreas segmentation model is denoted as HNN-I/B-RF or HNN-RF. 3 4

https://github.com/jponttuset/mcg. Using MATLAB’s TreeBagger() class.

456

H.R. Roth et al.

CT image

B2 Yˆside

merged superpixels

B3 Yˆside

B Yˆfuse

Fig. 2. Combinatorial Grouping” (MCG) [14] on three different scales of learned boundB2 ˆ B3 B , Yside , and Yˆfuse using the original CT image ary predication maps from HNN-B: Yˆside as input (shown with ground truth delineation of pancreas). MCG computes superpixels at each scale and produces a set of merged superpixel-based object proposals. We only visualize the boundary probabilities p > 10 %.

3

Results and Discussion

Data: Manual tracings of the pancreas for 82 contrast-enhanced abdominal CT volumes were provided by a publicly available dataset5 [6], for the ease of comparison. Our experiments are conducted on random splits of ∼60 patients for training and ∼20 for unseen testing in 4-fold cross-validation. Most previous work [1–3] use the leave-one-patient-out (LOO) protocol which is computationally expensive (e.g., ∼ 15 h to process one case using a powerful workstation [1]) and may not scale up efficiently towards larger patient populations. Evaluation: Table 1 shows the improvement from HNN-I to using spatial aggregation via HNN-RF based on thresholded probability maps (calibrated based on the training data), using DSC and average minimum distance. The average DSC is increased from 76.99 % to 78.01 % statistically significantly (p greater

Pq Pi

a) concept of pair-wise intensity comparisons

low

b) BRIEF- and LBPlike comparisons

c) Contextual similarity in other scan

Fig. 1. Contextual information (BRIEF) is captured by comparing mean values of two offset locations Pi (q) and Pi (r). Structural content (LBP) can be obtained by fixing one voxel to be the central Pi (0). When determining the training samples from (c) that are closest to the central voxel in (a) using our vantage point forest the similarity map overlaid to (c) is obtained, which clearly outlines the corresponding psoas muscle.

by the Hamming weight of their entire feature vectors, so that all i for which dH (i, j) = ||hi − hj ||H < τ will be assigned to the left node (and vice-versa). The Hamming distance measures the number of differing bits in two binary strings ||hi − hj ||H = Ξ{hi ⊕ hj }, where ⊕ an exclusive OR and Ξ a bit count. The partitioning is recursively repeated until a minimum leaf size is reached (we store both the class distribution and the indices of the remaining training samples Sl for each leaf node l). During testing each sample (query) is inserted in a tree starting at the root node. Its distance w.r.t. the training sample of the current vantage point is calculated and compared with τ (determining the direction the search branches off). When reaching a leaf node the class distribution is retrieved and averaged across all trees within the forest1 . When trees are not fully grown (leaving more than one sample in each leaf node), we propose to gather all training samples from all trees that fall in the same leaf node (at least once) and perform a linear search in Hamming space to determine the k-nearest neighbours (this will be later denoted as VPF+kNN). Even though intuitively this will add computationally cost, since more Hamming distances have to be evaluated, this approach is faster in practice (for small Lmin ) compared to deeper trees due to cache efficiencies. It is also much more efficient than performing an approximate global nearest neighbour search using locality sensitive hashing or related approaches [14]. Split Optimisation: While vantage point forests can be built completely unsupervised, we also investigate the influence of supervised split optimisation. In this case the vantage points are not fully randomly chosen (as noted in Line 4 of Algorithm 1), but a small random set is evaluated based on the respective infor1

Our source code is publicly available at http://mpheinrich.de/software.html.

602

M.P. Heinrich and M. Blendowski

Algorithm 1. Training of Vantage Point Forest

1 2 3 4 5 6 7

8 9

Input: |M | labelled training samples (hj , yj ), parameters: number of trees T , minimum leaf size Lmin Output: T tree structures: indices of vantage points, thresholds τ for every node, class distributions p(y|hi ) and sample indices for leaf nodes. foreach t ∈ T do add initial subset S0 = M (whole training set→root) to top of stack while stack is not empty do retrieve Sn from stack, select vantage point j ∈ Sn (randomly) if |Sn | > Lmin then calculate dH (i, j) = ||hi − hj ||H ∀i ∈ Sn , and median distance τ = d˜H partition elements i of Sn in two disjunct subsets Snl = {i|dH (i, j) < τ }, Snr = Sn \ Snl and add them to stack else store p(y|hi ) and sample indices of Sl (leaf node)

mation gain (see [7] for details on this criterion) and the point that separates classes best, setting τ again to the median distance for balanced trees, is chosen. 2.3

Spatial Regularisation Using Multi-label Random Walk

Even though the employed features provide good contextual information, the classification output is not necessarily spatially consistent. It may therefore be beneficial for a dense segmentation task to spatially regularise the obtained probability maps P y (x) (in practice the classification is performed on a coarser grid, so probabilities are first linearly interpolated). We employ the multi-label random walk [15] to obtain a smooth probability map P (x)yreg for every label y ∈ C by minimising E(P (x)yreg ): 1 x

2

(P (x)y − P (x)yreg )2 +

λ x

2

||∇P (x)yreg ||2

(1)

where the regularisation weight is λ. The gradient of the probability map is 2 )) based on differences of image weighted by wj = exp(−(I(xi ) − I(xj ))2 /(2σw intensities I of xi and its neighbouring voxels xj ∈ Ni in order to preserve edges. Alternatively, other optimisation techniques such as graph cuts or conditional random fields (CRF) could be used, but we found that random walk provided good results and low computation times.

3

Experiments

We performed automatic multi-organ segmentations for 20 abdominal contrast enhanced CT scans from the VISCERAL Anatomy 3 training dataset (and additionally for the 10 ceCT test scans) [16]. The scans form a heterogenous dataset

Multi-organ Segmentation Using Vantage Point Forests

603

with various topological changes between patients. We resample the volumes to 1.5 mm isotropic resolution. Manual segmentations are available for a number of different anatomical structures and we focus on the ones which are most frequent in the dataset, namely: liver, spleen, bladder, kidneys and psoas major muscles (see example in Fig. 2 with median automatic segmentation quality). Parameters: Classification is performed in a leave-one-out fashion. A rough foreground mask (with approx. 30 mm margin to any organ) is obtained by nonrigidly registering a mean intensity template to the unseen scan using [17]. We compare our new vantage point classifier to standard random forests (RDF) with axis-aligned splits using the implementation of [18]. For each method 15 trees are trained and either fully grown or terminated at a fixed leaf size of Lmin = 15 (VPF+kNN). Using more trees did not improve classification results of RDF. The number k of nearest neighbours in VPF+kNN is set to 21. A total of n = 640 intensity comparisons are used for all methods within patches of sizes of 1013 voxels, after pre-smoothing the images with a Gaussian kernel with σp = 3 voxels. Half the features are comparisons between the voxel centred around i and a randomly displaced location (LBP), and for the other half both locations are random (BRIEF). The displacement distribution is normal with a standard deviation of 20 or 40 voxels (for 320 features each). The descriptors are extracted for every fourth voxel (for testing) or sixth voxel (in training) in each dimension (except outside the foreground mask) yielding ≈500’000 training and ≈60’000 test samples. Spatial regularisation (see Sect. 2.3) is performed for all methods with optimal parameters of λ = 10 for RDF, λ = 20 for VPF and σw = 10 throughout (run time ≈ 20 s). RDF have been applied with either binary or real-valued (float) features. We experimented with split-node optimisation for VPF, but found (similar to [3] for ferns) that it is not necessary unless when using very short feature strings (which may indicate that features of same organs cluster together without supervision).

a) test image

b) ground truth

c) random forest

d) VP forest (ours)

Fig. 2. Coronal view of CT segmentation: Psoas muscles  and left kidney  are not fully segmented using random forests. Vantage point forests better delineate the spleen  and the interface between liver  and right kidney  (bladder is out of view).

604

M.P. Heinrich and M. Blendowski

Fig. 3. Distribution of Dice overlaps demonstrates that vantage point forests significantly outperform random forests (p < 0.001) and improve over several algorithms from the literature. Including the kNN search over samples within leaf nodes from all trees is particularly valuable for the narrow psoas muscles. Our results are very stable across all organs and not over-reliant on post-processing (see boxplots with grey lines).

Results: We evaluated the automatic segmentation results A using the Dice overlap D = 2|A ∩ E|/(|A| + |E|) (compared to an expert segmentation E). Vantage point forests clearly outperform random forests and achieve accuracies of >0.90 for liver and kidneys and ≈0.70 for the smaller structures. Random forests benefit from using real-valued features but are on average 10 % points inferior, revealing in particular problems with the thin psoas muscles. Our average Dice score of 0.84 (see details in Fig. 3) is higher than results for MALF: 0.70 or SIFT keypoint transfer: 0.78 published by [8] on the same VISCERAL training set. For the test set [16], we obtain a Dice of 0.88, which is on par with the best MALF approach and only slightly inferior to the overall best performing method that uses shape models and is orders of magnitudes slower. Training times for vantage point trees are ≈15 s (over 6x faster than random forests). Applying the model to a new scan takes ≈1.5 s for each approach.

4

Conclusion

We have presented a novel classifier, vantage point forest, that is particularly well suited for multi-organ segmentation when using binary context features. It is faster to train, less prone to over-fitting and significantly more accurate than random forests (using axis-aligned splits). VP forests capture joint feature relations by comparing the entire feature vector at each node, while being computationally efficient (testing time of ≈1.5 s) due to the use of the Hamming distance

Multi-organ Segmentation Using Vantage Point Forests

605

(which greatly benefits from hardware popcount instructions, but if necessary real-valued features could also be employed in addition). We demonstrate stateof-the-art performance for abdominal CT segmentation – comparable to much more time-extensive multi-atlas registration (with label fusion). We obtained especially good results for small and challenging structures. Our method would also be directly applicable to other anatomies or modalities such as MRI, where the contrast insensitivity of BRIEF features would be desirable. The results of our algorithm could further be refined by adding subsequent stages (cascaded classification) and be further validated on newer benchmarks e.g. [19].

References 1. Glocker, B., Pauly, O., Konukoglu, E., Criminisi, A.: Joint classification-regression forests for spatially structured multi-object segmentation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 870–881. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33765-9 62 2. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) ¨ 3. Ozuysal, M., Calonder, M., Lepetit, V., Fua, P.: Fast keypoint recognition using random ferns. IEEE PAMI 32(3), 448–461 (2010) 4. Pauly, O., Glocker, B., Criminisi, A., Mateus, D., M¨oller, A.M., Nekolla, S., Navab, N.: Fast multiple organ detection and localization in whole-body MR dixon sequences. In: Fichtinger, G., Martel, A., Peters, T. (eds.) MICCAI 2011. LNCS, vol. 6893, pp. 239–247. Springer, Heidelberg (2011). doi:10.1007/ 978-3-642-23626-6 30 5. Zikic, D., Glocker, B., Criminisi, A.: Encoding atlases by randomized classification forests for efficient multi-atlas label propagation. Med. Image Anal. 18(8), 1262– 1273 (2014) 6. Calonder, M., Lepetit, V., Ozuysal, M., Trzcinski, T., Strecha, C., Fua, P.: BRIEF: computing a local binary descriptor very fast. IEEE PAMI 34(7), 1281–1298 (2012) 7. Criminisi, A., Shotton, J., Konukoglu, E.: Decision forests: a unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Found. Trends. Comp. Graph. Vis. 7(2–3), 81–227 (2012) 8. Wachinger, C., Toews, M., Langs, G., Wells, W., Golland, P.: Keypoint transfer segmentation. In: Ourselin, S., Alexander, D.C., Westin, C.-F., Cardoso, M.J. (eds.) IPMI 2015. LNCS, vol. 9123, pp. 233–245. Springer, Heidelberg (2015). doi:10. 1007/978-3-319-19992-4 18 9. Yianilos, P.N.: Data structures and algorithms for nearest neighbor search in general metric spaces. SODA 93, 311–321 (1993) 10. Kumar, N., Zhang, L., Nayar, S.: What is a good nearest neighbors algorithm for finding similar patches in images? In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5303, pp. 364–378. Springer, Heidelberg (2008). doi:10. 1007/978-3-540-88688-4 27 11. Menze, B.H., Kelm, B.M., Splitthoff, D.N., Koethe, U., Hamprecht, F.A.: On oblique random forests. In: ECML, pp. 453–469 (2011) 12. Schneider, M., Hirsch, S., Weber, B., Sz´ekely, G., Menze, B.: Joint 3-d vessel segmentation and centerline extraction using oblique hough forests with steerable filters. Med. Image Anal. 19(1), 220–249 (2015) 13. Ahonen, T., Hadid, A., Pietikainen, M.: Face description with local binary patterns: application to face recognition. IEEE PAMI 28(12), 2037–2041 (2006)

606

M.P. Heinrich and M. Blendowski

14. Muja, M., Lowe, D.G.: Fast matching of binary features. In: CRV, pp. 404–410 (2012) 15. Grady, L.: Multilabel random walker image segmentation using prior models. In: CVPR, pp. 763–770 (2005) 16. Jim´enez-del Toro, O., et al.: Cloud-based evaluation of anatomical structure segmentation and landmark detection algorithms: VISCERAL anatomy benchmarks. IEEE Trans. Med. Imaging, 1–20 (2016) 17. Heinrich, M., Jenkinson, M., Brady, J., Schnabel, J.: MRF-based deformable registration and ventilation estimation of lung CT. IEEE Trans. Med. Imaging 32(7), 1239–1248 (2013) 18. Dollar, P., Rabaud, V.: Piotr Dollar’s image and video toolbox for matlab. UC San Diego (2013). https://github.com/pdollar/toolbox 19. Xu, Z., Lee, C., Heinrich, M., Modat, M., Rueckert, D., Ourselin, S., Abramson, R., Landman, B.: Evaluation of six registration methods for the human abdomen on clinically acquired CT. IEEE Trans. Biomed. Eng. 1–10 (2016)

Multiple Object Segmentation and Tracking by Bayes Risk Minimization Tom´aˇs Sixta(B) and Boris Flach Department of Cybernetics, Faculty of Electrical Engineering, Center for Machine Perception, Czech Technical University in Prague, Prague, Czech Republic [email protected] Abstract. Motion analysis of cells and subcellular particles like vesicles, microtubules or membrane receptors is essential for understanding various processes, which take place in living tissue. Manual detection and tracking is usually infeasible due to large number of particles. In addition the images are often distorted by noise caused by limited resolution of optical microscopes, which makes the analysis even more challenging. In this paper we formulate the task of detection and tracking of small objects as a Bayes risk minimization. We introduce a novel spatiotemporal probabilistic graphical model which models the dynamics of individual particles as well as their relations and propose a loss function suitable for this task. Performance of our method is evaluated on artificial but highly realistic data from the 2012 ISBI Particle Tracking Challenge [8]. We show that our approach is fully comparable or even outperforms state-of-the-art methods. Keywords: Multiple object tracking minimization

1

·

Graphical models

·

Bayes risk

Introduction

Multiple object tracking (MOT) is an important tool for understanding various processes in living tissues by analyzing the dynamics and interactions of cells and subcellular particles. Tracking biological cells is challenging, because they may appear and disappear at any moment, reproduce or merge (e.g. macrophage engulfing a pathogen) and often cannot be visually distinguished from each other. Existing techniques for MOT (see [5] for an excellent review) often consist of two separate modules: a detector and a tracker. The reliability of the detector is crucial for their success, because without feedback from the tracker it is impossible to recover from detection errors. Unfortunately, current visualization techniques (e.g. fluorescence microscopy) are prone to noise, which motivates the development of more elaborate tracking methods. Network flow based methods formulate the MOT as a min-cost flow problem in a network, which is constructed using the detector output and local motion information. A typical example is [16]. A similar formulation is adopted in [2] and c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 607–615, 2016. DOI: 10.1007/978-3-319-46723-8 70

608

T. Sixta and B. Flach

solved by the k-shortest paths algorithm. Integer linear programming (ILP) can be used to handle aspects of MOT, which are difficult to model by a flow-based formulation (object-object interactions). The ILP framework allows to express detection and tracking by a single objective [3,14], which avoids the problem of propagating detection errors. This additional power however comes at a cost, because ILP is NP-hard in general and can be solved only approximately by an LP relaxation [12]. A natural way to deal with the uncertainty of the detector is to use a probabilistic model. The model in [4] employs interacting Kalman filters with several candidate detections. Many works start by assuming that a sparse tracking graph has been already created and can be used as an input. Nodes of the graph then correspond either to the detection candidates [7] or short tracklets [15] and potential functions are used to model the shape prior, particle dynamics, interactions among the objects or their membership in social groups [9]. Node labels may carry information about presence or absence of an object, its position and/or about its depth [13] to resolve occlusions. In [1] the detection and data association tasks are solved jointly in a single probabilistic framework. Their model however requires the number of objects to be known a priori, which severely restricts its applicability to tracking biological particles. The optimal output of a probabilistic tracker is determined by the loss function. Despite its importance it receives surprisingly little attention in the literature. In fact the zero-one loss has become a de facto standard and although it often leads to acceptable results, we believe that better loss could lead to more robust and accurate trackers. In this paper we propose a novel probabilistic model for MOT, which can deal with unknown and changing number of objects, model their interactions and recover from detection errors. The tracking task is formulated as minimization of a Bayes risk with a loss function specifically designed for this purpose. The optimal set of trajectories is found by a greedy optimization algorithm.

2 2.1

Multiple Object Tracking The Model

We assume that we are given a sequence of n images on domain D ⊂ Z2 and corresponding measurements U = {U1 , U2 , ..., Un }. Each Ut contains a detection score for each pixel – a number from interval [0, 1]. They are however not final decisions – Ut (i) = 1 does not necessarily imply the presence of an object at pixel i and Ut (i) = 0 does not exclude an object in that position. To account for the noise in the images and resulting uncertainty of detection scores we model a conditional distribution of a set of trajectories pθ (S|U ). The basic building block of the model is a trajectory represented as a sequence of n discrete positions of a single particle. A special label is used to indicate, that the particle is not in the scene. This allows to deal with variable and unknown number of particles, as the only requirement for the model is to have at least as many trajectories as particles in the scene. The probability of a set of trajectories

Multiple Object Segmentation and Tracking by Bayes Risk Minimization

609

S given measurements U is defined by a graphical model, which involves terms for individual trajectories, interaction potentials and hard constraints, which ensure, that the set of trajectories is physically possible:     ′ exp s∈S κθ (s) s,s′ ∈S ψθ (s, s ) + s∈S φθ (s, U ) + , (1) pθ (S|U ) = Z(θ) where Z(θ) is the normalization constant. The single trajectory potentials φθ (s, U ) are sums of contributions of the appearance model, the motion model (proportional to the distance between expected and real position of the particle) and penalties for trajectories, that enter the scene (this prevents the tracker to model a single particle by several short trajectories). The interaction potentials ψθ (s, s′ ) prevent two trajectories from getting too close to each other, which eliminates the need for rather heuristic non-maximum suppression. The hard constraints κθ (s) are 0/ − ∞ valued functions used mainly to ensure, that no particle exceeds some predefined maximum velocity. 2.2

The Loss Function

In the context of the Bayes risk minimization a good loss function should reflect the properties of the physical system. For example the commonly used zero-one loss is not a proper choice, because it cannot distinguish between solutions, which are only slightly off and completely wrong. Another common choice, the additive L2 loss, is also inappropriate, because it is unclear how to evaluate the loss for two sets of trajectories with different cardinality. The loss function we propose is based on the Baddeley’s Delta Loss, which was originally developed for binary images [11]. Let I be a binary image on discrete domain D ⊂ Z2 and δI its distance transform with respect to a truncated pixel distance min{dist(i, j), ǫ} (Fig. 1a). The value of the Baddeley’s loss for two binary images I and Iˆ is the squared Euclidean distance of δI and δIˆ:  2 ˆ = B(I, I) δI (i) − δIˆ(i) . (2) i∈D

To make the loss applicable for the tracking task we have to define an appropriate set of sites. The simplest option is to take pixels of the input images and

Fig. 1. (a) Distance transform of 2D binary images with truncated L1 norm (ǫ = 2). (b) Two sets of trajectories with identical single frame distance transforms. (c) Set of sites (tracklets) for two consecutive frames. Drawn with thicker line are trajectories in S.

610

T. Sixta and B. Flach

foreground sites would be positions where the trajectories intersect the frames. The resulting loss has however a serious flaw because it cannot distinguish between parallel and crossing trajectories (Fig. 1b). To solve this issue we also consider all conceivable tracklets of length two and the foreground sites are those which are part of some trajectory (Fig. 1c). The distance between two tracklets is sum of distances of their positions in individual frames. Because of the maximum velocity constraint we often have to consider only a subset of the tracklets. Our loss function is then defined as Baddeley’s loss over all tracklets of length one (pixels) and two (we denote them as T R(D)):   2 ˆ = δS (a) − δSˆ (a) . (3) L(S, S) a∈T R(D)

This loss function has several appealing properties. It is permutation invariˆ and is well ant, does not require to artificially match trajectories from S and S defined for sets of trajectories with different cardinality.

3

The Inference

∗ In the Bayesian framework  the optimal set of trajectories S minimizes the ˆ with respect to S. ˆ If we substitute the proposed loss expected risk Epθ L(S, S) function (3) and discard terms, that cannot influence the optimal solution, we end up with the following optimization task:    S∗ = argmin δSˆ (a)2 − 2δSˆ (a)Epθ [δS (a)] . (4) ˆ S

a∈T R(D)

This is a difficult energy minimization problem that in addition requires to calculate for each tracklet the expected distance to the nearest trajectory which is also non-trivial. We estimate Epθ [δS (a)] using an MCMC sampling method and minimize the risk by a greedy algorithm. Both approaches are described in the following subsections. 3.1

Model Sampling

Our probabilistic model (1) is in fact a fully connected Conditional Random Field, where each variable represents a single trajectory. Usual procedure for Gibbs sampling is to iteratively simulate from conditional distributions of single variables, i.e. hidden Markov chains in our case. Although this can be done by a dynamic programming algorithm [10], its time complexity O(n|D|2 ) makes it prohibitively slow for our purposes1 . Instead, we use Gibbs sampling in a more restricted way by sampling single components of the trajectory sequence (i.e. single position or special label), which is tractable and enjoys the same asymptotic properties. However, in our case Gibbs sampling (both full and restricted) suffers from poor mixing, so in addition we use the following Metropolis-Hastings scheme: 1

For example for images with 512 × 512 pixels |D|2 = 236 .

Multiple Object Segmentation and Tracking by Bayes Risk Minimization

611

1. Take two trajectories and some frame t 2. Cut both trajectories in frame t and swap the sequences for subsequent frames

pθ (S′ |U ) 3. Accept the new labelling with probability min 1, pθ (S|U ) , where S′ denotes the set of trajectories after the cut-swap procedure was performed. The resulting MCMC algorithm consists of iteratively repeating the restricted Gibbs and cut-swap schemes. The expected values Epθ [δS (a)] are estimated by averaging over the generated samples. Calculation of δS (a) for a given sample S is straightforward and for many commonly used distance functions it can be done in O(|T R(D)|) [6]. 3.2

Risk Minimization

We minimize the risk (4) by a greedy algorithm (Algorithm 1), which iteratively adds new trajectories and extends existing ones. The procedure Extend(S∗ , Epθ [δS (a)] , t) iterates over all the trajectories in S∗ , that reached the (t − 1)-th frame and for each of them calculates the change of the risk for all possible extensions to the t-th frame. The extension with the largest decrease of the risk is selected and S∗ is changed accordingly. The procedure stops, when no trajectory extension can decrease the risk. The procedure Start(S∗ , Epθ [δS (a)] , t) iteratively adds new trajectories starting in the t-th frame, such that adding each trajectory leads to the largest possible decrease of the risk. If adding a new trajectory would increase the risk, the algorithm proceeds to the next frame. This algorithm is easy to implement but it remains open whether some optimality guarantees could be proved or whether there is a different algorithm with better properties. The main focus of this paper is however on the model and the loss function itself so we postpone this issue to the future work. Data: ∀a ∈ T R(D) : Epθ [δS (a)] Result: The set of optimal trajectories S∗ S∗ ← ∅; for t ← 1 to n do S∗ ← Extend(S∗ , Epθ [δS (a)] , t); S∗ ← Start(S∗ , Epθ [δS (a)] , t); end

Algorithm 1: Greedy risk minimization

4 4.1

Experimental Results The Data

We test our method on artificial, but highly realistic data from the 2012 ISBI Particle Tracking Challenge [8]. It contains simulated videos of fluorescence microscopy images of vesicles in the cytoplasm, microtubule transport, membrane receptors and infecting viruses (3D) with three different densities

612

T. Sixta and B. Flach

(a)

(b)

(c)

Fig. 2. 2012 ISBI Particle Tracking Challenge Data. (a) Four biological scenarios were simulated (from left to right): vesicles, microtubules, receptors and viruses. (b) Each scenario contains images with four different SNR (from left to right, illustrated on images of vesicles): 1, 2, 4 and 7. (c) Images with three different particle densities are available for each SNR level. For the sake of clarity only 150 × 150 px segments of the original 512 × 512 px images are shown.

(on average 100, 500 and 1000 particles per frame) and four different signalto-noise ratios (SNR): 1, 2, 4 and 7 (see Fig. 2). Motion types correspond to the dynamics of real biological particles: Brownian for vesicles, directed (near constant velocity) for microtubules and random switching between these two for receptors and viruses. The particles may appear and disappear randomly on any position and any frame. The data contain ambiguities similar to those in real data, including noise, clutter, parallel trajectories, intersecting and visual merging and splitting. 4.2

Evaluation Objectives

To make our method comparable to the other contributions to the 2012 ISBI Particle Tracking Challenge, we use the most important evaluation objectives from the challenge: the true positive rate α(S ∗ , SGT ) ∈ [0, 1], the false positive rate β(S ∗ , SGT ) ∈ [0, α(S ∗ , SGT )] and root mean square error (RMSE) of true positives. True positive rate takes value 1 if for every trajectory in the ground truth SGT there is a corresponding trajectory in S ∗ with exactly the same properties and it is smaller than 1 if some trajectories are misplaced or missing (α(S ∗ , SGT ) = 0 means that S ∗ = ∅). Spurious trajectories, which cannot be paired with any trajectory from SGT , are taken into account by the second objective β(S ∗ , SGT ). It is equal to α(S ∗ , SGT ) if there are none, whereas small β(S ∗ , SGT ) indicates lots of them. Instead of creating a single measure of quality, these objectives are used separately in the challenge thus a direct comparison of two trackers is possible only if one outperforms the other in both of them. We refer the reader to [8] for their detailed definition and description of other auxiliary objectives used in the challenge.

Multiple Object Segmentation and Tracking by Bayes Risk Minimization

(a) Frame 1

(b) Frame 4

(c) Frame 7

613

(d) Frame 10

Fig. 3. Tracking results for the sequence of vesicles with SNR=7 (150 × 150 px segments). Objects found by the tracker are marked by crosses and ground truth by circles.

Fig. 4. Objectives α, β and RMSE from the 2012 ISBI particle tracking challenge for tracking low-density vesicles (SNR 1, 2, 4, and 7). Competing methods are numbered according to [8]. Team 4 did not submit results for low density vesicles.

4.3

Tracking Vesicles

We demonstrate the performance of our method on sequences of low density vesicles for all four levels of SNR. Each sequence consists of 100 images (512×512 px) and contains around 500 physical particles. Vesicles appear as brighter blobs in the images but due to the noise the pixels values are not reliable detection scores. Instead we enhanced the contrast and reduced the noise by the following filter: 1. Convolve the input images with 3 × 3 Gaussian kernel (σ = 1) to filter out the noise 2. Choose two thresholds 0 < t1 < t2 < 1, replace all the pixel values lower than t1 by t1 and all the pixel values higher than t2 by t2 3. Normalize the pixel values to interval [0, 1] and use them as detection scores The appearance model was a simple linear function of the detection scores and we used Brownian motion assumption. In all four experiments we used the same model parameters and only the detector thresholds t1 and t2 were tuned using the first image of the sequence. Maximum velocity of the particles was bounded to 15 pixels per frame. As shown in Fig. 4, for SNR 2, 4 and 7 our method outperforms all the methods from the challenge in objectives α and β and it is highly competitive

614

T. Sixta and B. Flach

in RMSE. Similarly to the other methods, it performs poorly on the data with SNR 1. This is caused by our oversimplified detector, which fails in that case beyond the model’s ability to recover. We believe that better results could be achieved by incorporating the tracker and an improved detector into a joint model.

5

Conclusion and Future Work

In this paper we have proposed a novel probabilistic model and loss function for multiple object tracking. The model can recover from detection errors and allows to model interactions between the objects and therefore does not need heuristics like non-maximum suppression. We demonstrate the performance of the method on data from the 2012 ISBI particle tracking challenge and show, that it outperforms state-of-the-art methods in most cases. In the future work we will incorporate the detection and tracking into a joint model which will allow the tracker to recover from detection errors even on larger scale. We will also focus on automated parameter learning as well as on better detector and inference algorithm. Acknowledgements. This work has been supported by the Grant Agency of the CTU Prague under Project SGS15/156/OHK3/2T/13 and by the Czech Science Foundation under Project 16-05872S.

References 1. Aeschliman, C., Park, J., Kak, A.C.: A probabilistic framework for joint segmentation and tracking. In: CVPR, pp. 1371–1378. IEEE (2010) 2. Berclaz, J., Fleuret, F., Turetken, E., Fua, P.: Multiple object tracking using kshortest paths optimization. IEEE Trans. Pattern Anal. Mach. Intell. 33(9), 1806– 1819 (2011) 3. Chenouard, N., Bloch, I., Olivo-Marin, J.: Multiple hypothesis tracking for cluttered biological image sequences. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2736–2750 (2013) 4. Godinez, W.J., Rohr, K.: Tracking multiple particles in fluorescence time-lapse microscopy images via probabilistic data association. IEEE Trans. Med. Imaging 34(2), 415–432 (2015) 5. Luo, W., Zhao, X., Kim, T.: Multiple object tracking: a review. CoRR abs/1409.7618 (2014) 6. Meijster, A., Roerdink, J., Hesselink, W.H.: A general algorithm for computing distance transforms in linear time. In: Goutsias, J., Vincent, L., Bloomberg, D.S. (eds.) Mathematical Morphology and its Applications to Image and Signal Processing, pp. 331–340. Springer, Boston (2000) 7. Nillius, P., Sullivan, J., Carlsson, S.: Multi-target tracking - linking identities using bayesian network inference. In: CVPR, vol. 2, pp. 2187–2194. IEEE Computer Society (2006) 8. Olivo-Marin, J.C., Meijering, E.: Objective comparison of particle tracking methods. Nat. Methods 11(3), 281–289 (2014)

Multiple Object Segmentation and Tracking by Bayes Risk Minimization

615

9. Pellegrini, S., Ess, A., Gool, L.: Improving data association by joint modeling of pedestrian trajectories and groupings. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 452–465. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15549-9 33 10. Rao, V., Teh, Y.W.: Fast MCMC sampling for Markov jump processes and extensions. J. Mach. Learn. Res. 14, 3207–3232 (2013). arxiv:1208.4818 11. Rue, H., Syversveen, A.R.: Bayesian object recognition with baddeley’s delta loss. Adv. Appl. Probab. 30(1), 64–84 (1998) 12. T¨ uretken, E., Wang, X., Becker, C.J., Fua, P.: Detecting and tracking cells using network flow programming. CoRR abs/1501.05499 (2015) 13. Wang, C., de La Gorce, M., Paragios, N.: Segmentation, ordering and multi-object tracking using graphical models. In: ICCV, pp. 747–754. IEEE (2009) 14. Wu, Z., Thangali, A., Sclaroff, S., Betke, M.: Coupling detection and data association for multiple object tracking. In: CVPR, pp. 1948–1955. IEEE (2012) 15. Yang, B., Huang, C., Nevatia, R.: Learning affinities and dependencies for multitarget tracking using a crf model. In: CVPR, pp. 1233–1240. IEEE (2011) 16. Zhang, L., Li, Y., Nevatia, R.: Global data association for multi-object tracking using network flows. In: CVPR. IEEE Computer Society (2008)

Crowd-Algorithm Collaboration for Large-Scale Endoscopic Image Annotation with Confidence L. Maier-Hein1(B) , T. Ross1 , J. Gr¨ ohl1 , B. Glocker2 , S. Bodenstedt3 , 4 1 5 C. Stock , E. Heim , M. G¨ otz , S. Wirkert1 , H. Kenngott6 , S. Speidel3 , and K. Maier-Hein5 1

Computer-assisted Interventions Group, German Cancer Research Center (DKFZ), Heidelberg, Germany [email protected] 2 Biomedical Image Analysis Group, Imperial College London, London, UK 3 Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Karlsruhe, Germany 4 Institute of Medical Biometry and Informatics, University of Heidelberg, Heidelberg, Germany 5 Medical Image Computing Group, DKFZ, Heidelberg, Germany 6 Department of General, Visceral and Transplant Surgery, University of Heidelberg, Heidelberg, Germany

Abstract. With the recent breakthrough success of machine learning based solutions for automatic image annotation, the availability of reference image annotations for algorithm training is one of the major bottlenecks in medical image segmentation and many other fields. Crowdsourcing has evolved as a valuable option for annotating large amounts of data while sparing the resources of experts, yet, segmentation of objects from scratch is relatively time-consuming and typically requires an initialization of the contour. The purpose of this paper is to investigate whether the concept of crowd-algorithm collaboration can be used to simultaneously (1) speed up crowd annotation and (2) improve algorithm performance based on the feedback of the crowd. Our contribution in this context is two-fold: Using benchmarking data from the MICCAI 2015 endoscopic vision challenge we show that atlas forests extended by a novel superpixel-based confidence measure are well-suited for medical instrument segmentation in laparoscopic video data. We further demonstrate that the new algorithm and the crowd can mutually benefit from each other in a collaborative annotation process. Our method can be adapted to various applications and thus holds high potential to be used for large-scale low-cost data annotation.

L. Maier-Hein and T. Ross—Contributed equally to this paper. K. Maier-Hein—Many thanks to Carolin Feldmann for designing Fig. 1 and Pallas Ludens for providing the annotation platform. This work was conducted within the setting of the SFB TRR 125: Cognition-guided surgery (A02, I04, A01) funded by the German Research Foundation (DFG). It was further sponsored by the Klaus Tschira Foundation. c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 616–623, 2016. DOI: 10.1007/978-3-319-46723-8 71

Crowd-Algorithm Collaboration for Image Annotation

1

617

Introduction

With the paradigm shift from open surgical procedures towards minimally invasive procedures, endoscopic image processing for surgical navigation, contextaware assistance, skill assessment and various other applications has been gaining increasing interest over the past years. The recent international endoscopic vision challenge, organized at MICCAI 2015, revealed that state-of-the-art methods in endoscopic image processing are almost exclusively based on machine learning based techniques. However, the limited availability of training data with reference annotations capturing the wide range of anatomical/scene variance is evolving as a major bottleneck in the field because of the limited resources of medical experts. Recently, the concept of crowdsourcing has been introduced as a valuable alternative for large-scale annotation of endoscopic images [6]. It has been shown that anonymous untrained individuals from an online community are able to generate training data of expert quality. Similar achievements were made in other biomedical imaging fields such as histopathological image analysis [1]. A remaining problem, however, is that object segmentation from scratch is relatively timeconsuming and thus expensive compared to other tasks outsourced to the crowd. On the other hand, state-of-the-art annotation algorithms are already mature enough to annotate at least parts of the input image with high confidence. To address this issue, we propose a collaborative approach to large-scale endoscopic image annotation. The concept is based on a novel atlas-forest-based segmentation algorithm that uses atlas-individual uncertainty maps to weigh training images according to their relevance for each superpixel (Spx) of a new image. As the new algorithm can estimate its own uncertainty with high accuracy, crowd feedback only needs to be acquired for regions with low confidence. Using international benchmarking data, we show that the new approach requires only a minimum of crowd input to enlarge the training data base.

2

Methods

The following sections introduce our new approach to confidence-guided instrument segmentation (Sect. 2.1), our concept for collaborative image annotation (Sect. 2.2) as well as our validation experiments (Sect. 2.3). 2.1

Confidence-Weighted Atlas Forests

The instrument segmentation methods presented in the scope of the MICCAI Endoscopic Vision Challenge 2015 were typically based on random forests (cf. e.g. [2,3]). One issue with commonly used random forests [4] is that the same classifier is applied to all new images although it is well known that endoscopic images vary highly according to a number of parameters (e.g. hardware applied, medical application) and hence, the relevance of a training image can be expected to vary crucially with the test image. Furthermore, non-approximative

618

L. Maier-Hein et al.

addition of new training data requires complete retraining of standard forests. A potentially practical way to give more weight to the most relevant training images without having to retrain the classifier is the application of so-called atlas forests [9]. Atlas forests are based on multiple random forests (atlases), each trained on a single image (or a subset of training images). A previously unseen image can then be annotated by combining the results of the individual forests [8,9]. To our knowledge, atlas forests have never been investigated in the context of medical instrument segmentation in particular and endoscopic image processing in general. The hypothesis of our work with respect to automatic instrument segmentation is: Hypothesis I: Superpixel-specific atlas weighting using local uncertainty estimation improves atlas forest based medical instrument segmentation in laparosopic video data. Hence, we assume that the optimal atlases vary not only from image to image but also from (super)pixel to super(pixel). Our approach is based on a set of

Fig. 1. Concept of collaborative large-scale data annotation as described in Sect. 2.2.

Crowd-Algorithm Collaboration for Image Annotation

619

  train training images I1train , . . . IN with corresponding reference segmentations img   REF1 , . . . REFNimg and involves the following steps (cf. Fig. 1): Atlas forest generation: As opposed to training a single classifier with the available training data, we train one random forest Ai per training image Iitrain . To be able to learn the relevance of a training image for a Spx (see confidence learning), we apply each atlas Ai to each image Ijtrain . Based on the generated instrument probability map Pij = Ai (Ijtrain ), we compute a Spxbased error map Eij (x) = |REFj (x) − Pij (x)| that represents the difference between the probability assigned to a Spx x and the true class label REFj (x), defined as the number of pixels corresponding to an instrument (according to the reference annotation) divided by the total number of pixels in the Spx. Confidence learning: Based on Eij , we train a regressor (uncertainty estimator) UAi (x) for each atlas that estimates the error made when applying atlas Ai to Spx x, where the Spx is represented by exactly the same features as used by the atlas forests. The error estimator can be used to generate atlasspecific confidence maps Ci (I) for an image, where the confidence in a Spx x is defined as |1 − UAi (x)|. Confidence-weighted segmentation merging: To segment a new image I, all atlases Aj are applied to it, and the resulting probability maps Pj are merged considering the corresponding uncertainty maps Uj (x) in each Spx. In our first prototype implementation, we instantiate our proposed concept as follows. We base our method on the most recently proposed random-forest-based endoscopic instrument segmentation algorithm [3], which extends a method presented at the MICCAI 2015 challenge by classifying Spx rather than pixels and combines state-of-the-art rotation, illumination and scale invariant descriptors from different color spaces. In this paper, we apply this method to individual images rather than sets of images. For error estimation, a regression forest (number of trees: 50) is trained on a Spx basis with the same features as used by the atlas forests. To classify a Spx x of a new image I, we initially select the most confident atlases: S(x) = {Ai |i ∈ {1 . . . Nimg } , UAi (x) < emax }

(1)

If S(x) is empty, we add the atlases Aj with the lowest UAj (x) to S(x) until A |S(x)| == Nmin . For each Spx (type: SEEDS), the mean of the classification results of all confident atlases is used as probability value, and Otsus method is applied to the whole image to generate a segmentation. 2.2

Collaborative Image Annotation

In previous work on crowd-based instrument segmentation [6], the user had to define the instrument contours from scratch. A bounding box placed around the instruments was used to clarify which object to segment. With this approach,

620

L. Maier-Hein et al.

the average time used for annotating one image was almost 2 min. The second hypothesis of this paper is: Hypothesis II: Crowd-algorithm collaboration reduces annotation time. Our collaborative annotation concept involves the following step (cf. Fig. 1). Atlas forest initialization. The confidence-weighted atlas forest is initialized according to Sect. 2.1 using all the available training data and yields the initial segmentation algorithm AF 0 . Iterative collaborative annotation. A previously unseen image is segmented by the current atlas forest AF t . The regions with low accumulated confidence are distributed to the crowd for verification. The crowd refines the segmentation, and the resulting crowd-generated reference annotation is used to generate a new atlas ANimg +t . The corrections of the crowd may be used to retrain the uncertainty estimators, and the new atlas is added to the new atlas forest AF t+1 along with the corresponding uncertainty estimator UANimg +t . 2.3

Experiments

The purpose of our validation was to confirm the two hypotheses corresponding to Sects. 2.1 and 2.2. Our experiments were performed on the data of the laparoscopic instrument segmentation challenge that had been part of the MICCAI 2015 endoscopic vision challenge. The data comprises 300 images extracted from six different laparoscopic surgeries (50 each). Investigation of Hypothesis I. To investigate the benefits of using confidence-weighted atlas forests, we adapted the recently proposed Spx-based instrument classifier [3] already presented in Sect. 2.1. 200 images from four surgeries were used to train (1) an atlas forest AF with simple averaging of the individual probability maps which served as baseline and (2) an atlas forest with conA train = 0.1 · Nimg ). fidence weighting AFw according to Sect. 2.2 (emax = 0.1; Nmin The remaining 100 images from two surgeries were used for testing. For each classifier and all test images, we determined descriptive statistics for the distance between the true label of a Spx and the corresponding computed probability. In addition, we converted the probability maps to segmentations using Otsu’s method and computed precision, recall and accuracy. Investigation of Hypothesis II. For our collaborative annotation concept, we designed two annotation tasks for the crowd, using Amazon Mechanical Turk (MTurk ) as Internet-based crowdsourcing platform. In the false positive (FP) task, the crowd is presented with Spx classified as instrument that had a low accumulated confidence (here: mean of confidence averaged over all atlases that were used for the classification of the Spx) in our weighting-based method. An eraser can be used to delete regions that are not part of medical instruments. In the false negative (FN) task, the crowd is presented with Spx classified as background that had a low accumulated confidence in our weighting-based methods.

Crowd-Algorithm Collaboration for Image Annotation

621

An eraser can be used to delete regions that are not part of the background. To investigate whether crowd-algorithm collaboration can increase annotation speed by the crowd, we initialized the atlas forest AFw with the 200 training images according to Sect. 2.2. AFw0 was then applied to the testing images, and the resulting segmentations were corrected using the two refinement tasks (majority voting with 10 users). We compared the annotation time required for the collaborative approach with the annotation time needed when segmenting the instruments from scratch.

3

Results

The observed median (interquartile range (IQR)) and maximum of the difference between the true class label (i.e. the number of pixels corresponding to an instrument (according to the reference annotation) divided by the total number of pixels in the Spx) and the corresponding probability value on the test data was 0.07 (0.04, 0.09) and 0.20. Descriptive statistics for the accuracy of non-weighted atlas forests AF and weighted atlas forest AFw after segmentation using Otsu’s method are shown in Fig. 2. There is a trade-off between the percentage of Spx regarded as confident and the quality of classification. When varying the confidence threshold emax = 0.1 by up to ± 75 %, the (median) accuracy decreases monotonously from 0.99 (78 % coverage) to 0.96 (94 % coverage) on the confident regions. This compares to a median accuracy of 0.87 for the baseline method (AF ) and to 0.94 for AFw applied to all Spx (emax ∈ {0.025, 0.05, ..., 0.075}). On the confident regions, the (median) precision was 1.00 for all thresholds, compared to 0.38 for non-weighted AFs and 0.84–0.86 for weighted AFs applied to all Spx. These spectacular values come at the cost of a reduced recall (range: 0.49–0.55). Example classification results from the atlas forest and the weighted atlas forest along with the corresponding confidence map are visualized in Fig. 3.

(a)

(b)

Fig. 2. Accuracy of the standard atlas forest AF and the weighted atlas forest AFw when using all superpixels (Spx) of 100 test images as well as accuracy of AFw when evaluated only on the confident Spxs of these images. (b) Accuracy of AFw on just the confident Spxs for varying confidence threshold emax . The whiskers of the box plot represent the 2.5 % and 97.5 % quantiles.

622

L. Maier-Hein et al.

(a) Test image I

(b) AF (I)

(c) AFw (I)

(d) Confidence map

Fig. 3. Test image (a) with corresponding AF classification (blue: low probability) (b) AFw classification (c) and confidence map of AFw (blue: low confidence) (d). The specular highlight is not recognized as part of the instrument but the associated uncertainty is high.

The median (IQR) and maximum percentage of atlases that had a confidence above the chosen threshold ranged from 25 % (0.05 %, 50 %) and 87 % (emax : 0.025) per Spx to 89 % (57 %, 96 %) and 100 % (emax : 0.175). The corresponding (median) percentage of Spx classified incorrectly and not shown to the crowd ranged from 0.5 % to 3.4 %. With the collaborative annotation approach, the annotation time per image could be reduced from about two minutes to less than one minute (median: 51 s; IQR: (35 s, 70 s); max: 173 s).

4

Discussion

To our knowledge, we are the first to investigate the concept of crowd-algorithm collaboration in the field of large-scale medical image annotation. Our approach involves (1) automatic initialization of crowd annotations with a new confidenceweighted atlas-forest-based algorithm and (2) using the feedback of the crowd to iteratively enlarge the training data base. In analogy to recent work outside the field of medical image processing [5,7], we were able to show that collaborative annotation can speed up the annotation process considerably. Our experiments further demonstrate that the performance of an atlas on previously unseen images can be predicted with high accuracy. Hence, Spx-individual weighting of atlases improves classification performance of atlas forests compared to the nonweighted approach. It is worth noting that we just presented a first prototype implementation of the collaborative annotation approach. For example, we took a simple thresholdbased approach to convert the set of probability maps with corresponding confidence maps into a final segmentation. Furthermore, we did not systematically optimize the parameters of our method. This should be considered when comparing the results of our atlas forest with the results of other methods. According to our experience, the performance of random forests compared to atlas forests is highly dependent on the features used. In fact, when we initially trained all classifiers on point-based features (without local binary patterns), non-weighted atlas forests showed a similar performance to random forests. In the current version, random forests [3] perform similar to the weighted AFw s when evaluated on all Spxs.

Crowd-Algorithm Collaboration for Image Annotation

623

A disadvantage of our approach could be seen in the fact that we currently train one uncertainty estimator for each atlas. Note, however, that there is no need to perform the training on all images with reference annotations. Hence, the strong advantages of atlas forests are kept. A major advantage of our method is the extremely high precision. Given the 100 % precision on the confident regions, we designed an additional fill-uptask, where the crowd was simply asked to complete the segmentation of the algorithm. This way annotation times were further reduced to about 45 s per image. In conclusion, we have shown that large-scale endoscopic image annotation using crowd-algorithm collaboration is feasible. As our method can be adapted to various applications it could become a valuable tool in the context of big data analysis.

References 1. Albarqouni, S., Baur, C., Achilles, F., Belagiannis, V., Demirci, S., Navab, N.: Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE Trans. Med. Imaging 35, 1313–1321 (2016) 2. Allan, M., Chang, P.-L., Ourselin, S., Hawkes, D.J., Sridhar, A., Kelly, J., Stoyanov, D.: Image based surgical instrument pose estimation with multi-class labelling and optical flow. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 331–338. Springer, Heidelberg (2015). doi:10.1007/ 978-3-319-24553-9 41 3. Bodenstedt, S., Goertler, J., Wagner, M., Kenngott, H., Mueller-Stich, B.P., Dillmann, R., Speidel, S.: Superpixel-based structure classification for laparoscopic surgery. In: SPIE Medical Imaging, p. 978618 (2016) 4. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). http://dx.doi.org/ 10.1023/A%3A1010933404324 5. Cheng, J., Bernstein, M.S.: Flock: hybrid crowd-machine learning classifiers. In: Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, pp. 600–611. ACM (2015) 6. Maier-Hein, L., et al.: Can masses of non-experts train highly accurate image classifiers? In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8674, pp. 438–445. Springer, Heidelberg (2014). doi:10.1007/ 978-3-319-10470-6 55 7. Radu, A.-L., Ionescu, B., Men´endez, M., St¨ ottinger, J., Giunchiglia, F., Angeli, A.: A hybrid machine-crowd approach to photo retrieval result diversification. In: Gurrin, C., Hopfgartner, F., Hurst, W., Johansen, H., Lee, H., O’Connor, N. (eds.) MMM 2014. LNCS, vol. 8325, pp. 25–36. Springer, Heidelberg (2014). doi:10.1007/ 978-3-319-04114-8 3 8. Zikic, D., Glocker, B., Criminisi, A.: Classifier-based multi-atlas label propagation with test-specific atlas weighting for correspondence-free scenarios. In: Menze, B., Langs, G., Montillo, A., Kelm, M., M¨ uller, H., Zhang, S., Cai, W.T., Metaxas, D. (eds.) MCV 2014. LNCS, vol. 8848, pp. 116–124. Springer, Heidelberg (2014). doi:10. 1007/978-3-319-13972-2 11 9. Zikic, D., Glocker, B., Criminisi, A.: Encoding atlases by randomized classification forests for efficient multi-atlas label propagation. Med. Image Anal. 18(8), 1262– 1273 (2014)

Emphysema Quantification on Cardiac CT Scans Using Hidden Markov Measure Field Model: The MESA Lung Study Jie Yang1 , Elsa D. Angelini1 , Pallavi P. Balte2 , Eric A. Hoffman3,4 , Colin O. Wu5 , Bharath A. Venkatesh6 , R. Graham Barr2,7 , and Andrew F. Laine1(B) 1

Department of Biomedical Engineering, Columbia University, New York, NY, USA [email protected] 2 Department of Medicine, Columbia University Medical Center, New York, NY, USA 3 Department of Radiology, University of Iowa, Iowa City, IA, USA 4 Department of Biomedical Engineering, University of Iowa, Iowa City, IA, USA 5 Office of Biostatistics Research, National Heart, Lung and Blood Institute, Bethesda, MD, USA 6 Department of Radiology, Johns Hopkins University, Baltimore, MD, USA 7 Department of Epidemiology, Columbia University Medical Center, New York, NY, USA

Abstract. Cardiac computed tomography (CT) scans include approximately 2/3 of the lung and can be obtained with low radiation exposure. Large cohorts of population-based research studies reported high correlations of emphysema quantification between full-lung (FL) and cardiac CT scans, using thresholding-based measurements. This work extends a hidden Markov measure field (HMMF) model-based segmentation method for automated emphysema quantification on cardiac CT scans. We show that the HMMF-based method, when compared with several types of thresholding, provides more reproducible emphysema segmentation on repeated cardiac scans, and more consistent measurements between longitudinal cardiac and FL scans from a diverse pool of scanner types and thousands of subjects with ten thousands of scans.

1

Introduction

Pulmonary emphysema is defined by a loss of lung tissue in the absence of fibrosis, and overlaps considerably with chronic obstructive pulmonary disease (COPD). Full-lung (FL) quantitative computed tomography (CT) imaging is commonly used to measure a continuous score of the extent of emphysema-like lung tissue, which has been shown to be reproducible [1], and correlates well with respiratory symptoms [2]. Cardiac CT scans, which are commonly used for the assessment of coronary artery calcium scores to predict cardiac events [3], include about 70 % of the lung volume, and can be obtained with low radiation exposure. Despite missing apical and basal individual measurements, emphysema c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 624–631, 2016. DOI: 10.1007/978-3-319-46723-8 72

Emphysema Quantification on Cardiac CT Scans

625

quantification on cardiac CT were shown to have high reproducibility and high correlation with FL measures [4], and correlate well with risk factors of lung disease and mortality [5] at the population-based level. With the availability of large-scale well characterized cardiac CT databases such as the Multi-Ethnic Study of Atherosclerosis (MESA) [6,7], emphysema quantification on cardiac scans has now been actively used in various population-based studies [8]. However, currently used methods for emphysema quantification on cardiac scans rely on measuring the percentage of lung volume (referred to as %emph) with intensity values below a fixed threshold. Although thresholding-based methods are commonly used in research, they can be very sensitive to factors leading to variation in image quality and voxel intensity distributions, including variations in scanner type, reconstruction kernel, radiation dose and slice thickness. To study %emph on heterogeneous datasets of FL scans, density correction [9], noise filtering [10,11] and reconstruction-kernel adaptation [12] have been proposed. These approaches consider only a part of the sources of variation, and their applicability to cardiac scans has not been demonstrated. Superiority of a segmentation method based on Hidden Markov Measure Field (HMMF) to thresholdingbased measures and these correction methods was demonstrated in [13,14] on FL scans. In this work, we propose to adapt the parameterization of the HMMF model to cardiac scans from 6,814 subjects in the longitudinal MESA Lung Study. Results compare HMMF and thresholding-based %emph measures for three metrics: (1) intra-cardiac scan reproducibility, (2) longitudinal correlation on “normal” subjects (never-smokers without respiratory symptoms or disease [15]), and (3) emphysema progression on “normal” and “disease” subjects.

2 2.1

Method Data

The MESA Study consists of 6,814 subjects screened with cardiac CT scans at baseline (Exam 1, 2000–2002), and with follow-up scans in Exam 2 to 4 (2002–2008). Most subjects had two repeated cardiac scans per visit (same scanner). Among these subjects, 3,965 were enrolled in the MESA Lung Study and underwent FL scans in Exam 5 (2010–2012). Cardiac scans were collected using either an EBT scanner from GE, or six types of MDCT scanners from GE or Siemens (details in [4]). The average slice thickness is 2.82 mm, and in-plane resolution is in the range [0.44, 0.78] mm. Lung segmentation was performed with the R software (VIDA Diagnostics, Iowa). FL scans were cropped (removAPOLLO ing apical and basal lung) to match the cardiac scans field of view. Longitudinal correlation of lung volumes in incremental cardiac exams is in the range [0.84, 0.95]. Cardiac scans were acquired at full inspiration with cardiac and respiration gating, while FL scans were acquired at full inspiration without cardiac gating. For this study, we selected a random subset of 10,000 pairs of repeated cardiac scans with one in each pair considered as the “best” scan in terms of inflation or scan quality [8]. Out of these 10,000 pairs, 379 pairs were discarded due to corruption in one scan during image reconstruction or storage, detected via

626

J. Yang et al.

Fig. 1. (a) Illustration of fitting lung-field intensity with skew-normal distribution on three cardiac scans. (b) Population average of %emphHMMF (mB (λ)) measured from normal subjects on four baseline cardiac scanners (SB ) versus λ values. (c) [top] Outside air mean value (HU) per subject and per scanner used to tune μE ; [bottom] Initial μN value (HU) per subject and per scanner. Table 1. Year and number of MESA cardiac/FL CT scans evaluated. MESA Exam #

1

2

3

4

5

Year start-end

2000-02

2002-04

2004-05

2005-08

2010-12

# of subjects available in MESA 6,814

2,955

2,929

1,406

3,965

# of normal subjects evaluated

741

261

307

141

Total # of scans evaluated

6,088

(×2)

1,164

(×2)

1,645

(×2)

724

827

(×2)

2,984

abnormally high values of mean and standard deviation of outside air voxel intensities (cf. Fig. 1(c) for ranges of normal values). The selected subset involves 6,552 subjects, among which 2,984 subjects had a FL scan in Exam 5, and 827 are “normals”, as detailed in Table 1. Overall, we processed a grand total of 9,621 pairs of repeated cardiac scans, 3,508 pairs of “best” longitudinal cardiac scans, and 5,134 pairs of “best” cardiac-FL scans. 2.2

HMMF-Based Emphysema Segmentation

The HMMF-based method [13] enforces spatial coherence of the segmentation, and relies on parametric models of intensity distributions within emphysematous and normal lung tissue. It uses a Gaussian distribution NE (θE ) for the emphysema class and a skew-normal distribution NN (θN ) for normal lung tissue. We found the skew-normal distribution model to be applicable to cardiac

Emphysema Quantification on Cardiac CT Scans

627

scans. Figure 1(a) gives examples of histogram fitting results for three cardiac scans from normal subjects. For a given image I : Ω → R, the HMMF estimates on Ω the continuousvalued measure field q ∈ [0, 1] by maximizing the posterior distribution P for q and the associated parameter vector θ = [θE , θN ] expressed as: P (q, θ|I) =

1 P (I|q, θ)Pq (q)Pθ (θ) R

(1)

where R is a normalization constant. The Markov random field (MRF) variable q is a vector q = [qE , qN ], representing the intermediate labeling of both classes. Emphysema voxels are selected as {v ∈ Ω|qE (v) > qN (v)}, from which %emphHMMF is computed. The distribution Pq (q) enforces spatial regularity via Markovian regularization on neighborhood cliques C and involves a weight parameter λ in the potential of the Gibbs distribution. The likelihood P(I|q, θ) requires initialization of parameter values for both classes, which are tuned in this work to handle the heterogeneity of the dataset, as described below. Parameter Tuning for Cardiac Scans. The parameters of intensity distributions are θE = [µE , σE ], θN = [µN , σN , αN ] where µ denotes the mean, σ the standard deviation and α the skewness of respective classes. Likelihood for Normal Lung Tissue: The standard deviation σN and the skewness αN are assumed to be sensitive to scanner-specific image variations. They are tuned separately for each scanner type by averaging on the subpopulation of normal subjects, after fitting their intensity histograms. The initial value of mean µN is sensitive to inflation level and morphology and therefore made subject-specific via fitting individual intensity histograms with the pre-fixed σN and αN . Measured initial µN values are plotted in Fig. 1(c). Likelihood for Emphysema Class: The initial value of mean µE is set to the average scanner-specific outside air mean intensity value, learned on a subpopulation of both normal and disease subjects from each scanner type, and illustrated in Fig. 1(c). The standard deviation σE is set to be equal to the scanner-specific σN since the value of σ is mainly affected by image quality. Cliques: To handle the slice thickness change from FL (mean 0.65 mm) to cardiac CT (mean 2.82 mm), the spatial clique is set to 8-connected neighborhoods in 2-D planes instead of 26-connected 3-D cliques used in [13]. Regularization Weight λ: The regularization weight λ is made scannerspecific to adapt to image quality and noise level. We note mX the population average of %emphHMMF measures on normal subjects using scanners in category X. There are three scanner categories: scanners used only at baseline (SB ), scanners used at baseline and some follow-up times (SBF ) and scanners used only at follow-up (SF ). For each scanner in SB and SBF , we chose, via Bootstrapping, the λB values that returns mB (λB ) = 2 % (i.e. a small arbitrary value). The selection process is illustrated in Fig. 1(b). For scanners in SBF , the same λB

628

J. Yang et al.

values are used at follow-up times, leading to population %emphHMMF averages mBF (λB ). Finally, the λF are chosen such that mBF (λB ) = mF (λF ). Parameter Tuning for FL Scans. Parameters for the segmentation of FL scans with HMMF were tuned similarly to [13], except for λ and the initial values of µN and µE . In [13], scans reconstructed with a smooth kernel were used as a reference to set λ for noisier reconstructions. In this work, having only one reconstruction per scan, we propose to use the progression rate of %emph measured on longitudinal cardiac scans from the subpopulation of normal subjects. We set mF L (λF L ) = mpr with mpr the predicted normal population average of %emph at the time of acquisition of the FL scans, based on linear interpolation of anterior progression rates. This leads to λF L in the range [3, 3.5] for the different scanners, which is quite different from the range of λ values tuned on cardiac scans (cf. Fig. 1(b)). 2.3

Quantification via Thresholding (%emph−950 )

Standard thresholding-based measures %emph−950 were obtained for comparison, using a threshold value of reference Tref . Among standard values used by radiologists, Tref = −950HU was found to generate higher intra-class correlation and lower maximal differences on a subpopulation of repeated cardiac scans. For reproducibility testing on repeated cardiac scans (same scanner), an additional measure %emph−950G was generated after Gaussian filtering, which was shown to reduce image noise-level effect in previous studies [13]. The scale parameter of the Gaussian filter is tuned in the same manner as λ for the HMMF (i.e. matching average values of %emph−950G on normal subpopulations with the reference values). This leads to scale parameter values in the range [0.075, 0.175]. For longitudinal correlations, an additional measure %emph−950C was computed correcting Tref with respect to the scanner-dependent bias observed on mean outside air intensity values (µE ), as: Tref = −950 + (µE − (−1000)) HU.

3 3.1

Experimental Results Intra-Cardiac Scan Reproducibility

Intraclass Correlation (ICC) on Repeated Cardiac Scans. Scatter plots and ICC (average over Exams 1–4) of %emph in 9,621 pairs of repeat cardiac scans are shown in Fig. 2(a). All three measures show high reproducibility (ICC > 0.98). %emph−950G provides minor improvement compared with %emph−950 , which may be explained by the low noise level in MESA cardiac scans. Spatial Overlap of Emphysema Masks on Repeated Cardiac Scans. Lung masks of repeated cardiac scans were registered with FSL [16], using a similarity transform (7 degrees of freedom). Spatial overlap of emphysema was measured

Emphysema Quantification on Cardiac CT Scans

40 30 20

%emph-950

1

%emph-950 ICC=0.980 %emph -950 %emph-950G ICC=0.981 %emph -950G %emphHMMF ICC=0.986 %emph HMMF

0.8

%emph-950G

Dice Dice

%emph in repeated scan %emph

50

0.6 0.4 %emph %emph-950 -950

10

0.2

0

0

Dice=0.39

%emph -950G Dice=0.40

10

20

30

40

%emph in first scan %emph

(a)

50

%emphHMMF

-950G

%emph HMMF HMMF Dice=0.61

0

629

10

20

%emph %emph

(b)

30

40

TP FN FP

(c)

Fig. 2. Reproducibility of %emph measures on repeated cardiac scans: (a) Intraclass correlation (ICC) (N = 9,621); (b) Dice of emphysema mask overlap for disease subjects (N = 471); (c) Example of emphysema spatial overlap on a baseline axial slice from a pair of repeated cardiac scans (TP = true positive, FN = false negative, FP = false positive.)

with the Dice coefficient, on subjects with %emph−950 > 5 % (N = 471). Scatter plots and average values of Dice are reported in Fig. 2(b). Except for very few cases, HMMF returned higher overlap measures than thresholding, with an average Dice = 0.61, which is comparable to the value achieved on FL scans (0.62) [14]. Figure 2(c) gives an example of spatial overlaps of emphysema segmented on a pair of repeated cardiac scans, where there is less disagreement with HMMF. 3.2

Longitudinal Correlation and Progression of %emph

Pairwise Correlation on Longitudinal Cardiac Scans. For longitudinal cardiac scans, we correlated all baseline scans and follow-up scans acquired within a time interval of 48 months, in the population of normal subjects, who are expected to have little emphysema progression over time (only due to aging). Figure 3(a) shows that %emphHMMF measures return the highest pair-wise correlations on longitudinal cardiac scans, followed by %emph−950C measures. Emphysema Progression. Differential %emph scores ∆ were computed at follow-up times t to evaluate emphysema progression, as: ∆(t) = %emph(t) − %emph(baseline). Mean values and standard errors of the mean of ∆ for 87 normal subjects and 238 disease subjects who have three longitudinal cardiac scans and one FL scan are shown in Fig. 3(b). The %emphHMMF measures progressed steadily along cardiac and FL (measuring on cardiac field of view) scans, and at different rates for normal and disease populations. The %emph−950C measures progressed steadily across cardiac scans but decreased from cardiac to FL scans, which indicates that a single threshold is not able to provide consistency between cardiac and FL scans. Furthermore, thresholding-based measurements on cardiac scans show similar progression rates in normal and disease populations, which is not what is expected.

630

J. Yang et al.

Fig. 3. (a) %emph measures on longitudinal cardiac scans of normal subjects (N = 478; r = pairwise Pearson correlation); (b) Mean and standard error of the mean of emphysema progression ∆ (normal: N = 87, disease: N = 238).

Finally, we tested mixed linear regression models on all longitudinal scans to assess the progression of %emphHMMF and %emph−950C over time after adjusting for demographic and scanner related factors. The initial model (model 1) includes age at baseline, gender, race, height, weight, BMI, baseline smoking pack years, current cigarettes smoking per day, scanner type, and voxel size. In the subsequent model (model 2), to assess the effect modification for some demographic factors (including age at baseline, gender, race, baseline smoking pack years and current cigarettes smoking per day), their interaction terms with time (starting from the baseline) were added. In model 2 we observed that progression of %emphHMMF was higher with higher baseline age (p = 0.0001), baseline smoking pack years (p < 0.0001) and current cigarettes smoking per day (p = 0.03). These findings were not significant for %emph−950C except for baseline smoking pack years (p = 0.0016). Additionally, both models demonstrated that the effects of scanner types in cardiac scans are attenuated for %emphHMMF when compared with %emph−950C .

4

Discussions and Conclusions

This study introduced a dedicated parameter tuning framework to enable the use of an automated HMMF segmentation method to quantify emphysema in a robust and reproducible manner on a large dataset of cardiac CT scans from multiple scanners. While thresholding compared well with HMMF segmentation for intraclass correlation on repeated cardiac scans, only HMMF was able to provide high spatial overlaps of emphysema segmentations on repeated cardiac scans, consistent longitudinal measures between cardiac and FL scans, attenuated scanner effects on population-wide analysis of emphysema progression rates, and clear discrimination of emphysema progression rates between normal and disease subjects. Exploiting HMMF segmentation to quantify emphysema on

Emphysema Quantification on Cardiac CT Scans

631

low-dose cardiac CT scans has great potentials given their very large incidence in clinical routine. Acknowledgements. Funding provided by NIH/NHLBI R01-HL121270, R01HL077612, RC1-HL100543, R01-HL093081 and N01-HC095159 through N01-HC95169, UL1-RR-024156 and UL1-RR-025005.

References 1. Mets, O.M., De Jong, P.A., Van Ginneken, B., et al.: Quantitative computed tomography in COPD: possibilities and limitations. Lung 190(2), 133–145 (2012) 2. Kirby, M., Pike, D., Sin, D.D., et al.: COPD: Do imaging measurements of emphysema and airway disease explain symptoms and exercise capacity? Radiology, p. 150037 (2015) 3. Detrano, R., Guerci, A.D., Carr, J.J., et al.: Coronary calcium as a predictor of coronary events in four racial or ethnic groups. NEJM 358(13), 1336–1345 (2008) 4. Hoffman, E.A., Jiang, R., Baumhauer, H., et al.: Reproducibility and validity oflung density measures from cardiac CT scansthe multi-ethnic study ofatherosclerosis (MESA) lung study 1. Acad. Radiol. 16(6), 689–699 (2009) 5. Oelsner, E.C., Hoffman, E.A., Folsom, A.R., et al.: Association between emphysema-like lung on cardiac CT and mortality in persons without airflow obstruction: A cohort study. Ann. Intern. Med. 161(12), 863–873 (2014) 6. Bild, D.E., Bluemke, D.A., Burke, G.L., et al.: Multi-ethnic study of atherosclerosis: objectives and design. Am. J. Epidemiol. 156, 871–881 (2002) 7. The MESA website. https://mesa-nhlbi.org/ 8. Barr, R.G., Bluemke, D.A., Ahmed, F.S., et al.: Percent emphysema, airflow obstruction, and impaired left ventricular filling. NEJM 362(3), 217–227 (2010) 9. Kim, S.S., Seo, J.B., Kim, N., et al.: Improved correlation between CT emphysema quantification and pulmonary function test by density correction of volumetric CT data based on air and aortic density. Eur. Radiol. 83(1), 57–63 (2014) 10. Schilham, A.M.R., van Ginneken, B., Gietema, H., et al.: Local noise weighted filtering for emphysema scoring of low-dose CT images. IEEE TMI 25(4), 451–463 (2006) 11. Gallardoestrella, L., Lynch, D.A., Prokop, M., et al.: Normalizing computed tomography data reconstructed with different filter kernels: effect on emphysema quantification. Eur. Radiol. 26, 478–486 (2016) 12. Bartel, S.T., Bierhals, A.J., Pilgram, T.K., et al.: Equating quantitative emphysema measurements on different CT image reconstructions. J. Med. Phys. 38(8), 4894–4902 (2011) 13. Hame, Y., Angelini, E.D., Hoffman, E., et al.: Adaptive quantification and longitudinal analysis of pulmonary emphysema with a hidden markov measure field model. IEEE TMI 33(7), 1527–1540 (2014) 14. Hame, Y., Angelini, E.D., Barr, R.G., et al.: Equating emphysema scores and segmentations across CT reconstructions: a comparison study. In: ISBI 2015, pp. 629–632. IEEE 15. Hoffman, E.A., Ahmed, F.S., Baumhauer, H., et al.: Variation in the percent of emphysema-like lung in a healthy, nonsmoking multiethnic sample. the MESA lung study. Ann. Am. Thorac. Soc. 11(6), 898–907 (2014) 16. Smith, S.M., Jenkinson, M., Woolrich, M.W., et al.: Advances in functional and structural MR image analysis and implementation as FSL. Neuroimage 23, S208– S219 (2004)

Cutting Out the Middleman: Measuring Nuclear Area in Histopathology Slides Without Segmentation Mitko Veta1(B) , Paul J. van Diest2 , and Josien P.W. Pluim1 1

Medical Image Analysis Group (IMAG/e), TU/e, Eindhoven, The Netherlands [email protected] 2 Department of Pathology, UMCU, Utrecht, The Netherlands http://tue.nl/image Abstract. The size of nuclei in histological preparations from excised breast tumors is predictive of patient outcome (large nuclei indicate poor outcome). Pathologists take into account nuclear size when performing breast cancer grading. In addition, the mean nuclear area (MNA) has been shown to have independent prognostic value. The straightforward approach to measuring nuclear size is by performing nuclei segmentation. We hypothesize that given an image of a tumor region with known nuclei locations, the area of the individual nuclei and region statistics such as the MNA can be reliably computed directly from the image data by employing a machine learning model, without the intermediate step of nuclei segmentation. Towards this goal, we train a deep convolutional neural network model that is applied locally at each nucleus location, and can reliably measure the area of the individual nuclei and the MNA. Furthermore, we show how such an approach can be extended to perform combined nuclei detection and measurement, which is reminiscent of granulometry. Keywords: Histopathology image analysis learning · Convolutional neural networks

1

·

Breast cancer

·

Deep

Introduction

Cancer causes changes in the tissue phenotype (tissue appearance) that can be observed in histological tissue preparations. Based on the characteristics of the cancer phenotype, patients can be stratified into groups with different expected outcomes (recurrence or survival). Such characteristics identified by pathologists are arranged in grading systems. One such instance is the Bloom-Richardson grading system that is used for estimating the prognosis of breast cancer patients after surgical removal of the tumor. It consists of estimation of three biomarkers: nuclear pleomorhism, nuclear proliferation and tubule formation. Although such grading systems are routinely applied in clinical practice, they are known to suffer from reproducibility issues due to the subjectivity of the assessment. This can result in suboptimal estimation of the prognosis of the patient and in turn c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 632–639, 2016. DOI: 10.1007/978-3-319-46723-8 73

Measuring Nuclear Area in Histopathology Slides Without Segmentation

633

poor treatment planning. With the advent of digital pathology, which involves digitization of histological slides in the form of large, gigapixel images, automated quantitative image analysis is being proposed as a solution for this problem [12]. In this paper we address the problem of measuring nuclear size in digitized histological slides from breast cancer patients. Estimation of the average nuclear size in the tissue by pathologists is part of the nuclear pleomorhism scoring (high grade tumors are characterizes by large nuclear size). In addition to being part of the Bloom-Richardson grading system, nuclear size expressed as the mean nuclear area (MNA) is an independent biomarker both by manual [5,7] and automatic [11] measurement. The straightforward approach to measuring the MNA of a tumor region is to detect the locations of all nuclei or of a representative sample, measure their area by segmentation and compute the average over the sample. Thus, when designing an automatic measurement method, the more general and difficult task of automatic nuclei segmentation needs to be solved first. We hypothesize that given an image of a tumor region with known nuclei locations, the area of the individual nuclei and region statistics such as the MNA can be reliably computed directly from the image data by employing a machine learning model, without the intermediate step of nuclei segmentation. With our approach, the machine learning model is applied locally and separately for each nucleus, i.e. on an image patch centered at the nucleus, and outputs an estimate of the nuclear area. Deep convolutional neural networks (CNN), that recently came into prominence and operate directly on raw image data are a natural fit for such an approach. This type of models has been successfully applied to a large variety of general computer vision tasks and are increasingly becoming relevant for medical image analysis [1,3,8] including histopathology image analysis [2]. Additionally, we show how such an approach can be extended to perform combined nuclei detection and area measurement, without relying on manual input for the nuclei locations. This is reminiscent of granulometry [13], however, instead of using mathematical morphology operators, a machine learning model that can better handle the complexity of histological images is used.

2

Dataset

The experiments in this paper were performed with the dataset of breast cancer histopathology images with manually segmented nuclei originally used in [10]. This dataset consists of 39 slides from patients with invasive breast cancer. The slides were routinely prepared, stained with hematoxylin and eosin (H&E) and digitized at ×40 magnification with a spatial resolution of 0.25 µm/pixel at the Department of Pathology, University Medical Center Utrecht, The Netherlands. From each slide, representative tumor regions of size 1 × 1 mm (resulting in images of size 4000 × 4000 pixels) were selected by an experienced pathologist. In each region, approximately 100 nuclei were selected with systematic random sampling and manually segmented by an expert observer (pathology resident). The dataset is divided in two subsets: subset A, consisting of 21 cases with 2191 manually segmented nuclei, is used as a training dataset, and subset B,

634

M. Veta et al.

consisting of 18 cases with 2073 manually segmented nuclei, is used as a testing dataset. For the experiments in this paper, subset A is further divided in two: subset A1 consisting of 14 cases, which is used for training the area measurement model, and subset A2 consisting of the remaining 7 cases, which is used as a validation dataset (to monitor for overfitting during training).

3

Methods

We first address the problem of measuring the area of nuclei with known locations (centroids) in the image using a machine learning model. Then, we show how such an approach can be extended to perform combined granulometry-like nuclei detection and area measurement. Area measurement as a classification task. We assume that the locations of all or a sample of the nuclei in the image are known (we use the nuclei locations from the manual annotation by a pathologist). Given an image patch x centered at a nucleus location, we want to learn the parameters w of a function f (x, w) that will approximate as closely as possible the area y of the nucleus in the center of the patch. This results to training a regression model. However, we chose to treat this problem as classification. Instead of predicting the nuclear area directly with a regression model, a classification model can predict the bin of the area histogram to which a nucleus belongs (each histogram bin represents one class in the classification problem). The number of histogram bins defines the fidelity of the nuclei area measurement. The advantage of this over training a regression model is that it enables seamless extension to a combined nuclei detection and area measurement model. The output fc (x, w) of the classification model is a vector with probabilities associated with each class (area histogram bin). The area of the nucleus in x is reconstructed as the weighted average of the histogram bin centroids with the output probabilities used as weigths. This approach takes into account the confidence of the class prediction and results in a continuous output for the area measurements. Classification model. We model fc (x, w) as a deep CNN for classification. The deep CNN model consists of eight convolutional layers and two fully connected layers. As in [9], we use filter size and padding combinations that preseve the input size, which simplifies the network design. The first convolutional layer has a kernel size of 5 × 5, and all remaining convolutional layers have kernels of size 3× 3. The first, second, fifth and eighth layer are followed by a 2× 2 max-pooling layer. The first two convolutional layers have 32 feature maps and the remaining six have 64 feature maps. The first fully connected layer has 128 neurons. The second fully connected layer (output layer) has a number of neurons equal to the number of classes and is followed by a softmax function. Dropout, which has a regularization effect, is applied after the last two max-pooling layers and between the two fully connected layers. Rectified linear unit (ReLU) nonlinearities are used throughout the network.

Measuring Nuclear Area in Histopathology Slides Without Segmentation

635

Fig. 1. Examples of data augmentation. New training samples are generated by random transformation of the original training samples.

Training. The range of nuclear areas in the training dataset, determined to be 16.6–151.8 µm2 based on the 0.5th and 99.5th percentile, was quantized into 20 histogram bins (20 classes for the classification problem). This number of histogram bins results in a reasonably small distance of 6.8 µm2 between two neighboring bin centroids. Each nucleus was represented by a patch of size 96 × 96 pixels with a center corresponding to the nucleus centroid. This patch size is large enough to fit the largest nuclei in the dataset while still capturing some of the context. The number of classes and patch size were chosen based on optimization on the validation set, but the perfomance was stable for wide range of values. Since there is only a limited number of training samples in subset A1, data augmentation was necessary in order to avoid overfitting. The training samples were replicated by performing random translation, rotation, reflection, scaling, and color and contrast transformations. Note that the scaling transformation can change the class of the object, which is accounted for by changing the class label of the newly generated sample. We used this property to balance the distribution of classes in the training set. Each nucleus in subset A1 was replicated 1000 times. This resulted in over 1.4 million training samples. Examples of the data augmentation are shown in Fig. 1. We used the Caffe [4] deep learning framework to implement and optimize the deep CNN model. The model was optimized with batch gradient descent with batch size 256 and momentum 0.9. The base learning rate of 0.01 was decreased by 10 % of the current value every 2000 iterations. In addition to dropout and data augmentation, L2 regularization was performed during training with weight decay value of 0.001. The weights of the neural network were initialized with small random numbers drawn from a uniform distribution. All biases were initialized to 0.1. The choice for these parameters was based on commonly used values for similar network configurations in the literature. The training was stopped after 25,000 iterations when the loss on the validation set (subset A2) stopped decreasing. Combined nuclei detection and area measurement. In order to train a model that can perform combined nuclei detection and classification, an additional “background” class is introduced in the classification task. This class accounts for patches that are not centered at nuclei locations. The classifier

636

M. Veta et al.

is then applied to every pixel location in a test image. The probability outputs for the “background” class are used to form a nuclei detection probability map. Local minima in this probability map below a certain threshold will correspond to nuclei centroids (the threshold value is the operating point of the detector and is subject to optimization). Once the nuclei are detected using the nuclei detection probability map, the same procedure as described before can be used to infer their size from the probability outputs of the “foreground” classes. The annotated dataset used in this paper does not allow proper sampling of “background” patches for a training set, as only a limited number of the nuclei present in an image are annotated. In order to sample the “background” class, we used the results from the automatic segmentation method in [10] as surrogate ground truth (we assume that the method correctly segments all nuclei in the image). The results from this method were used to create a mask of nuclei centroid locations. The “background” patches were then randomly sampled from the remaining image locations. Note that the surrogate ground truth was only used for sampling of the “background” class; the training samples for the remaining classes were based on the ground truth assigned by pathologists. From each image in subset A1 40,000 “background” patches were randomly sampled and together with the samples from the original 20 classes used to train the new classifier. We used a neural network architecture and a training procedure that is identical to the one described before. The training of this classification model converged after 40,000 iterations. For computational efficiency, the model was transformed to a fully convolutional neural network [6] by converting the fully connected layers to convolutional layers.

4

Experiments and Results

We evaluate both nuclear area measurement with known nuclei locations and combined nuclei detection and area measurement. The former enables testing our hypothesis that nuclear area can be reliably measured by a machine learning model without performing segmentation under ideal conditions (manually annotated nuclei locations). The model for nuclear area measurement with known nuclei locations was trained with the manually annotated nuclei in subset A1, and then used to measure the nuclear area at the manually annotated nuclei locations in subset B. From the area measurements of the individual nuclei, the MNA was computed for the 18 tumor regions in subset B. The measured area of individual nuclei and the MNA were compared with the measurements based on the manually segmented nuclei contours. The combined nuclei detection and area measurement model was trained with the manually annotated nuclei in subset A1 using the surrogate ground truth to sample the “background class”. The optimal operating point of the detector was determined based on subset A2. The error of the estimation of the MNA over this subset was used as an optimization criterion. The trained model and the determined optimal operating point were then used to perform joint nuclei

Measuring Nuclear Area in Histopathology Slides Without Segmentation

637

Fig. 2. Scatter and Bland-Altman plots for manual and automatic measurement of nuclear area. (a) and (b) refer to the measurement of individual nuclei and (c) and (d) to the measurement of the MNA with the approach that relies on known nuclei locations. (e) and (f) refer to the measurement of the MNA with the combined nuclei detection and area measurement approach. (g) and (h) refer to the measurement of the MNA with the method described in [10]. The red line in the scatter plots indicates the identity.

detection and area measurement in subset B. The resulting measurements were used to compute the MNA and compare it with the measurement based on the manually segmented nuclei contours. The agreement between two sets of measurements was evaluated with the Bland-Altman method. In addition, the coefficient of determination for a linear fit between the two measurements was computed. Nuclear area measurement with known nuclei locations. The BlandAltman plots and the corresponding scatter plots for the measurement of the area of individual nuclei and the MNA are shown in Fig. 2(a–d). The bias and limits of agreement for the measurement of the area of individual nuclei were b = −2.19 ± 18.85 µm2 and for the measurement of the MNA b = −2.18 ± 3.32 µm2 . The coefficient of determination was r2 = 0.87 for the measurement of the area of individual nuclei and r2 = 0.99 for the measurement of the MNA. Combined nuclei detection and area measurement. The Bland-Altman plots and the corresponding scatter plots for the measurement of the MNA are shown in Fig. 2(e, f). The bias and limits of agreement for the measurement of the MNA were b = −2.98 ± 9.26 µm2 . The coefficient of determination for the measurement of the MNA was r2 = 0.89. Some examples from the combined nuclei detection and area measurement are shown in Fig. 3.

638

M. Veta et al.

Fig. 3. Examples of combined nuclei detection and area measurement. The circles indicate the location and measured size of the nuclei (note that they are not countour segmentations).

Comparison to measurement by automatic nuclei segmentation. For comparison, we show the results for the measurement of the MNA by performing nuclei segmentation with the method described in [10]. The Bland-Altman plots and the corresponding scatter plots for the measurement of the MNA are shown in Fig. 2(g, h). The bias and limits of agreement for the measurement of the MNA were b = −1.20 ± 13.50 µm2 . The coefficient of determination for the measurement of the MNA was r2 = 0.77.

5

Discussion and Conclusions

The Bland-Altman plot for the measurement of the area of individual nuclei with known locations (Fig. 2(b)) indicates that there is a small bias in the automatic measurement. In other words, the nuclear area measured with the automatic method is on average larger when compared with the manual method. However, the bias value is very small considering the scale of nuclei sizes. The limits of agreement indicate moderate agreement with differences in the measurement that can be in the order of the area of the smallest nuclei in the dataset. Due to the averaging effect, the measurement of the MNA is considerably more accurate (Fig. 2(d)). Although a small bias is still present, the limits of agreement indicate almost perfect agreement between the automatic and manual methods. This shows that the area of individual nuclei, and region statistics such as the MNA in particular, can be reliably computed directly from the image data without performing nuclei segmentation. These results, however, were achieved under ideal conditions, with expert annotations for the nuclei locations. The extension of this approach to combined nuclei detection and measurement has a much larger practical potential. The measurement of the MNA with this method had lower, but nevertheless substantial agreement with the manual measurement (Fig. 2(f)). In part, the lower agreement is likely due to the two MNA measurements being based on different nuclei populations, although detection errors also have influence. This agreement was better compared with MNA measurement based on automatic nuclei segmentation (Fig. 2(d)). An added advantage of the methodology proposed in this paper is that deep CNNs can be efficiently run on GPUs. In our current implementation using fully

Measuring Nuclear Area in Histopathology Slides Without Segmentation

639

convolutional neural networks, combined nuclei detection and area measurement in an image of size 4000 × 4000 pixels is performed in approximately 5 min. on a Tesla K40 GPU. We expect that this can be improved upon by exploiting the spatial redundancy of the image data (the current implementation evaluates the classifier at every pixel location), using smaller magnification such as ×20, and optimizing the CNN architecture. In future work we plan to use this methodology for automatic assesment of the prognosis of breast cancer patients.

References 1. Cire¸san, D., Giusti, A., Gambardella, L.M., Schmidhuber, J.: Deep neural networks segment neuronal membranes in electron microscopy images. NIPS 25, 2843–2851 (2012) 2. Cire¸san, D.C., Giusti, A., Gambardella, L.M., Schmidhuber, J.: Mitosis detection in breast cancer histology images with deep neural networks. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 411–418. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40763-5 51 3. Ginneken, B.v., Setio, A.A.A., Jacobs, C., Ciompi, F.: Off-the-shelf convolutional neural network features for pulmonary nodule detection in computed tomography scans. In: IEEE ISBI 2015, pp. 286–289 (2015) 4. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding (2014). arXiv:1408.5093 5. Kronqvist, P., Kuopio, T., Collan, Y.: Morphometric grading of invasive ductal breast cancer. I. thresholds for nuclear grade. Br. J. Cancer 78, 800–805 (1998) 6. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE CVPR 2015, pp. 3431–3440 (2015) 7. Mommers, E.C., Page, D.L., Dupont, W.D., Schuyler, P., Leonhart, A.M., Baak, J.P., Meijer, C.J., van Diest, P.J.: Prognostic value of morphometry in patients with normal breast tissue or usual ductal hyperplasia of the breast. Int. J. Cancer 95, 282–285 (2001) 8. Roth, H.R., et al.: A new 2.5D representation for lymph node detection using random sets of deep convolutional neural network observations. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8673, pp. 520–527. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10404-1 65 9. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv:1409.1556 10. Veta, M., van Diest, P.J., Kornegoor, R., Huisman, A., Viergever, M.A., Pluim, J.P.W.: Automatic nuclei segmentation in H&E stained breast cancer histopathology images. PLoS ONE 8, e70221 (2013) 11. Veta, M., Kornegoor, R., Huisman, A., Verschuur-Maes, A.H.J., Viergever, M.A., Pluim, J.P.W., van Diest, P.J.: Prognostic value of automatically extracted nuclear morphometric features in whole slide images of male breast cancer. Mod. Pathol. 25, 1559–1565 (2012) 12. Veta, M., Pluim, J.P.W., van Diest, P.J., Viergever, M.A.: Breast cancer histopathology image analysis: a review. IEEE Trans. Biomed. Eng. 61, 1400–1411 (2014) 13. Vincent, L.: Fast grayscale granulometry algorithms. In: Mathematical Morphology and its Applications to Image Processing, pp. 265–272 (1994)

Subtype Cell Detection with an Accelerated Deep Convolution Neural Network Sheng Wang, Jiawen Yao, Zheng Xu, and Junzhou Huang(B) Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX 76019, USA [email protected]

Abstract. Robust cell detection in histopathological images is a crucial step in the computer-assisted diagnosis methods. In addition, recent studies show that subtypes play an significant role in better characterization of tumor growth and outcome prediction. In this paper, we propose a novel subtype cell detection method with an accelerated deep convolution neural network. The proposed method not only detects cells but also gives subtype cell classification for the detected cells. Based on the subtype cell detection results, we extract subtype cell related features and use them in survival prediction. We demonstrate that our proposed method has excellent subtype cell detection performance and our proposed subtype cell features can achieve more accurate survival prediction.

1

Introduction

Analysis of microscopy images is very popular in modern cell biology and medicine. In microscopic image analysis for computer-assisted diagnosis methods, automatic cell detection is the basis. However, this task is challenging due to (1) cell clumping and background clutter, (2) large variation in the shape and size of cells, (3) time consuming because of the high resolution in the histopathological images. To solve these problems, Arteta proposes a general non-overlapping extremal regions selection (NERS) method [1], which achieves the state-of-the-art cell detection performance. Recently, to fully exploit the hierarchical discriminative features learned from deep neural networks, especially deep convolution neural networks (DCNN), many DCNN-based cell detection methods [7,8,11] are proposed. These methods regard DCNN as a two-class classifier to detect cells in a pixel-wise way. Recent studies show that different cell types (tumor cells, stromal cells, lymphocytes) play different roles in tumor growth and metastasis, and accurately classifying cell types is a critical step to better characterization of tumor growth and outcome predictions [2,9,13]. However, to the best of our knowledge, there is no existing automatic microscopic subtype cell analysis method with DCNN. This work was partially supported by U.S. NSF IIS-1423056, CMMI-1434401, CNS1405985. c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 640–648, 2016. DOI: 10.1007/978-3-319-46723-8 74

Subtype Cell Detection with an Accelerated Deep Convolution

641

In this paper, we propose a subtype cell detection method with accelerated deep convolution neural network. Our contributions are summarized as three parts: (1) Our work proposes a subtype cell detection method, which can detect cells in the histopathological images and give subtype cell information of the detected cells at the same time. To the best of our knowledge, this is the first study to report subtype cell detection using DCNN. (2) We introduce the dregularly sparse kernels [6] to our method to elimination all the redundant computation and to speed up the detection process. (3) A new set of features based on the subtype cell detection results are extracted and used to give more accurate survival prediction.

2

Methodology

Our approach for subtype cell detection is to detect cells in histopathological images and give the subtype information of detected cells at the same time. To accomplish this, cell patches extracted according to their annotations are used to train two partially shared-weighted DCNN models for classification: one for cell/non-cell classification, the other for subtype cell classification. Then we apply sparse kernels in the two DCNN models to eliminate all the redundant computations so that detection of cells and subtypes for a tile can be done in one round. Then we integrate the two DCNN models into one subtype cell detection model for subtype cell detection. After that, we extract subtype cell features from the subtype cell detection results. 2.1

Training Two DCNNs for Classification

Given two sets of training data: the cell and non-cell patches Ns Nc   i i (xc , yc ) ∈ (Xc , Yc ) i=1 and subtype cell patches (xis , ysi ) ∈ (Xs , Ys ) i=1 , we use them to train two deep convolution neural networks via argmin θc1 ,...,θcL

and argmin θs1 ,...,θsL

1 Nc Σ L(Hc (xi ; θc1 , . . . , θcL ), y i ), Nc i=1

(1)

1 Ns Σ L(Hs (xi ; θs1 , . . . , θsL ), y i ), Ns i=1

(2)

where L is the loss function, Hc and Hs are the outputs for cell/non-cell DCNN and subtype DCNN, θc1 . . . θcL , and θs1 . . . θsL are the weights of layers indexed from 1 to L in these two DCNN models. Since the training dataset (Xs , Ys ) are for subtype cell patches and in real world application, it is harder to manually annotate the subtypes of cells, it is common that we have less subtype cell patches for training than cell/noncell patches, i.e. Xs ⊂ Xc . In DCNN, the convolution layer and pooling layer operate together to extract hierarchical features from training images and the fullyconnected layers and loss function layer mainly work for solving the classification

642

S. Wang et al.

task. To avoid the insufficiency and imbalance of subtype cell patches, meanwhile use the better convolution features learned from more cell/non-cell patch images, we use all convolution layer feature learned from Eq. (1) and keep them from changing when training the subtype DCNN model via optimization in equation Eq. (2). Suppose that all the convolution related layers are indexed from 1 to j − 1, then Eq. (2) becomes argmin θsj ,...,θsL

1 Ns Σ L(Hs (xi ; θsj , . . . , θsL ), y i ). Ns i=1

(3)

We keep the weights in convolution layers unchanged in the whole DCNN training process. Thus, by Eqs. (1) and (3), we train our two DCNNs for cell/non-cell classification and subtype cell classification. 2.2

Accelerated Detection with d-Regularly Sparse Kernel

The traditional pixel-wise detection method requires the patch-by-patch sliding window scanning for every pixel in the image. It sequentially and independently takes cell patches as the inputs of DCNN model and the forward propagation is repeated for all the local pixel patches. However, this strategy is time consuming due to the fact that there exists a lot of redundant convolution operations among adjacent patches. To eliminate the redundant convolution computation, we introduce the dregularly sparse kernel technique [6] for convolution, pooling and fully-connected layers in our DCNN models. The d-sparse kernels are created by inserting all-zero rows and columns into the original kernels to make every two original neighboring entries d-pixel away. We apply the d-sparse kernel for all the original convolution, pooling and fully-connected layers. After applying d-regularly sparse kernel into our model, our model can take a tile as input for subtype cell detection in one run instead of computing one patch at each time. In our subtype cell detection model and experiments, we have three subtype cells: lymphocyte, stromal cell and tumor cell, so the subtype cell DCNN model is a three-class classification model. The network structure used for our model is the same as basic LeNet [5] (Two convolution-pooling combinations and then two fully-connected layers) with input patch size as 40 × 40 for training and 551 × 551 for testing after padding. 2.3

Subtype Cell Detection

Our subtype cell detection model is shown in Fig. 1. Since we apply the dregularly sparse kernel into our model, the tile image can be taken as the input of our model for processing at each time. After the shared convolution and pooling operations, we have two branches according to each DCNN that have been trained. The above branch is for cell detection. After the softmax layer, we have the cell probability of the tile image. The next operation can be any method which maps the probability map into the final detection result. In our model, we

Subtype Cell Detection with an Accelerated Deep Convolution

F

643

S

C Cell Probability Cell Detection

Image Tile

F

×

Subtype Cell Detection

S Subtype Cell Probability

Fig. 1. Subtype Cell Detection. C stands for the multiple shared convolution and pooling layers between the two models. F stands for fully-connected layer and S stands for softmax layer.

use the moment centroid based method to get our final cell detection result. The other branch is the DCNN for subtype cell classification. It gives the probability of all the subtypes for each pixel in the tile. In the end, the results of the two branches are merged by simply multiplying as the final subtype cell detection results. 2.4

Subtype Cell Features for Survival Prediction

According to recent studies, accurately classifying cell types is a critical step to better characterization of survival prediction. Thus, three groups of cellular features are extracted from our subtype cell detection result for survival prediction. These features, motivated by [12,14], cover cell-level information (e.g., appearance and shapes) of individual subtype cells and also texture properties of background tissue regions. Holistic Statistics: The four holistic statistics include overall information like the total area, perimeter, number and the corresponding ratio of each subtype cells. Geometry Features: Geometry properties including area, perimeter and so on are calculated from each detected subtype cell with its detection region in the prediction map. Zernike moments are also applied on each type of cells. When combine with different tiles, we calculate mean, median and std. of each feature. There are 564 features. Texture Features: This group of features contains Gabor “wavelet” features, Haralick, and granularity to measure texture properties of objects (e.g., cells and tissues), resulting in 1685 texture features.

3

Experiments

We evaluated our subtype cell detection model via two experiments: subtype cell detection and survival prediction with subtype cell features. All experiments are conduced on a workstation with Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10 GHz CPU, 32 gigabyte RAM and two NVIDIA Tesla K40c GPUs.

644

3.1

S. Wang et al.

Subtype Cell Detection

The subtype cell detection performance of the proposed method is evaluated on the part of TCGA (The Cancer Genome Atlas) data portal. We use 300 512 × 512 lung cancer tiles with subtype cell annotations as datasets, 270 for training and 30 for evaluation. All the tiles are annotated by a pathologist. For cell/noncell classification, we have 48562 patches for training and 5158 for testing. We have three subtypes cell in our dataset: lymphocytes, stromal cells and tumor cells. For subtype cell classification, we use 24281 patches (11671 lymphocytes, 10225 tumor cells and 2385 stromal cells) for training and 2579 patches (1218 lymphocytes, 1122 tumor cells and 239 stromal cells) for testing. For cell/noncell detection, we compare our proposed method with NERS [1] and the robust lung cancer cell detection (RLCCD) method based on pixelwise DCNN [8]. A detected cell centroid is considered to be a true positive (T P ) sample if the circular area of radius 8 centered at the detected nuclei contains the ground-truth annotation; otherwise, it is considered as false positive (F P ). Missed ground-truth dots are counted as false negatives (F N ). The results are reported in terms of F1 score F1 = 2P R/(P + R), where precision P = T P/(T P + F P ) and recall R = T P/(T P + F N ). The average precision, recall, F1 score and time consuming on the testing tiles for the three methods are listed in Table 1. Among these methods, the proposed has higher precision, recall and F1 score. It demonstrates that the proposed method can ensure excellent detection performance. Among the time consumings, the proposed method is the fastest method. By applying d-regularly sparse kernel, the proposed method is around 80 times faster than pixel wise detection with DCNN. Our convolution kernel sizes in both training model and dregularly sparse kernel are much larger than those in RLCCD [8]. If the proposed method uses the same kernel settings as RLCCD, it will be much faster. Our subtype cell classification accuracy on the testing set is 88.64 %. 2286 of 2579 subtype cells in the testing dataset has been detected. If one subtype ground truth corresponds to multiple detection results, we choose the nearest detection and set its subtype result as the final result. Then for all the subtype detection results, the accuracy is 87.18 %. The accuracies for lymphocytes, tumor cells and stromal cells are 88.05 %, 87.39 % and 81.08 % respectively. It demonstrates that our method achieves impressive subtype cell detection performance. In addition, we show one of the subtype cell detection results in Fig. 2 with red points for lymphocytes, yellow points for tumors and green ones for stromal cells. Obviously, our detection results are close to the ground truth. Table 1. Cell detection results Method

Precision Recall

F1 score Time(s)

NERS [1]

0.7990

0.6109

0.6757

31.4773

RLCCD [8] 0.7280

0.8030

0.7759

52.8912

Proposed

0.8683 0.8215

0.8029

0.7147

Subtype Cell Detection with an Accelerated Deep Convolution

645

Fig. 2. Subtype cell detection results. Red dots stand for lymphocytes, yellow dots stand for tumor cells and green dots stand for stromal cells.

To the best of our knowledge, this is the first study to report subtype cell detection using fast DCNN. Then we extract features from the subtype cell detection results with the corresponding probability map for survival prediction in order to further evaluate our subtype cell detection performance in the next subsection. 3.2

Survival Prediction

For survival prediction, we focused on the widely used lung cancer dataset NLST (National Lung Screening Trial). The NLST dataset contains complete patients’ pathology images. We collect data from 144 adenocarcinoma (ADC) and 113 squamous cell carcinoma (SCC) patients. To examine whether the features extracted from subtype cell detection from our proposed and trained model can achieve better predictions than traditional imaging biomarkers, we evaluated with the state-of-the-arts framework in lung cancer [10] which doesn’t use the subtype cell features. To test our proposed features, we randomly divided the whole NLST dataset into training (97 for ADC, 76 for SCC) and testing set (47 for ADC, 37 for SCC) and built multivariate Cox regression on the top 50 selected features for ADC and SCC, respectively. Figure 3 presents the predictive power on a partitioning into two groups on testing set ((a), (b) for ADC, (c), (d) for SCC). A significant difference (Wald-Test) in survival times can be seen in Fig. 3(a),(c). It demonstrates that our proposed features extracted from subtype cell detection results which cover subtype cell distributions and granularity are more associated with survival outcomes than traditional imaging biomarkers used in [10]. Then we randomly divide the whole set to 50 splits and use the concordance index (C-index) to show the prediction performances of two methods. The Cindex is a nonparametric measurement to quantify the discriminatory power of a predictive model: 1 indicates perfect prediction accuracy and a C-index of 0.5 is as good as a random guess. Component-wise likelihood based boosting (CoxBoost) [3] and random survival forest (RSF) [4] are both applied as survival models on ADC and SCC cases. From Fig. 4, we can see the higher median Cindex of the proposed method in both cases. This illustrates the robustness of

646

S. Wang et al.

Fig. 3. Kaplan-Meier survival curves of two groups on testing set. The x axis is the time in days and the y axis denotes the probability of overall survival. (a), (c) are from proposed framework while (b), (d) are from Wang’s method [10].

CoxBoost

RSF

CoxBoost

RSF

Fig. 4. Boxplot of C-index distributions (Left: ADC, Right: SCC)

the proposed subtype cell features since the subtype cell features are highly associated with tumor growth and survival outcomes.

4

Conclusion

In this paper, we propose a subtype cell detection method with an accelerated deep convolution neural network. The proposed method can detect the cells in the histological image and give the subtype cell information at the same time. By applying sparse kernel, the proposed method can detect subtype cells of

Subtype Cell Detection with an Accelerated Deep Convolution

647

the tile image in one round. We also present a set of features extracted from subtype cell detection results and use them in survival prediction to improve the prediction performance. Experimental results show that our proposed method can give good subtype cell detection and that the corresponding subtype features we extract are more associated with survival outcomes than traditional imaging biomarkers.

References 1. Arteta, C., Lempitsky, V., Noble, J.A., Zisserman, A.: Learning to detect cells using non-overlapping extremal regions. In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012. LNCS, vol. 7510, pp. 348–356. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33415-3 43 2. Beck, A.H., Sangoi, A.R., Leung, S., Marinelli, R.J., Nielsen, T.O., van de Vijver, M.J., West, R.B., van de Rijn, M., Koller, D.: Systematic analysis of breast cancer morphology uncovers stromal features associated with survival. Sci. Transl. Med. 3, 108ra113 (2011) 3. Binder, H., Schumacher, M.: Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinform. (2008) 4. Ishwaran, H., Kogalur, U.B., Blackstone, E.H., Lauer, M.S.: Random survival forests. Ann. Appl. Stat. 2(3), 841–860 (2008) 5. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 6. Li, H., Zhao, R., Wang, X.: Highly efficient forward and backward propagation of convolutional neural networks for pixelwise classification. arXiv preprint arXiv:1412.4526 (2014) 7. Liu, F., Yang, L.: A novel cell detection method using deep convolutional neural network and maximum-weight independent set. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 349–357. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24574-4 42 8. Pan, H., Xu, Z., Huang, J.: An effective approach for robust lung cancer cell detection. In: Wu, G., Coup´e, P., Zhan, Y., Munsell, B., Rueckert, D. (eds.) PatchMI 2015. LNCS, vol. 9467, pp. 87–94. Springer, Heidelberg (2015). doi:10.1007/ 978-3-319-28194-0 11 9. Tabesh, A., Teverovskiy, M., Pang, H.Y., Kumar, V.P., Verbel, D., Kotsianti, A., Saidi, O.: Multifeature prostate cancer diagnosis and gleason grading of histological images. IEEE Trans. Med. Imaging 26(10), 1366–1378 (2007) 10. Wang, H., Xing, F., Su, H., Stromberg, A., Yang, L.: Novel image markers for non-small cell lung cancer classification and survival prediction. BMC Bioinform. 15, 310 (2014) 11. Xu, Z., Huang, J.: Efficient lung cancer cell detection with deep convolution neural network. In: Wu, G., Coup´e, P., Zhan, Y., Munsell, B., Rueckert, D. (eds.) PatchMI 2015. LNCS, vol. 9467, pp. 79–86. Springer, Heidelberg (2015). doi:10.1007/ 978-3-319-28194-0 10 12. Yao, J., Ganti, D., Luo, X., Xiao, G., Xie, Y., Yan, S., Huang, J.: Computer-assisted diagnosis of lung cancer using quantitative topology features. In: Zhou, L., Wang, L., Wang, Q., Shi, Y. (eds.) MLMI 2015. LNCS, vol. 9352, pp. 288–295. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24888-2 35

648

S. Wang et al.

13. Yuan, Y., Failmezger, H., Rueda, O.M., Ali, H.R., Gr¨ af, S., Chin, S.F., Schwarz, R.F., Curtis, C., Dunning, M.J., Bardwell, H., et al.: Quantitative image analysis of cellular heterogeneity in breast tumors complements genomic profiling. Sci. Transl. Med. 4(157), 157ra143 (2012) 14. Zhu, X., Yao, J., Luo, X., Xiao, G., Xie, Y., Gazdar, A., Huang, J.: Lung cancer survival prediction from pathological images and genetic data - an integration study. In: IEEE ISBI, pp. 1173–1176, April 2016

Imaging Biomarker Discovery for Lung Cancer Survival Prediction Jiawen Yao, Sheng Wang, Xinliang Zhu, and Junzhou Huang(B) Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX 76019, USA [email protected]

Abstract. Solid tumors are heterogeneous tissues composed of a mixture of cells and have special tissue architectures. However, cellular heterogeneity, the differences in cell types are generally not reflected in molecular profilers or in recent histopathological image-based analysis of lung cancer, rendering such information underused. This paper presents the development of a computational approach in H&E stained pathological images to quantitatively describe cellular heterogeneity from different types of cells. In our work, a deep learning approach was first used for cell subtype classification. Then we introduced a set of quantitative features to describe cellular information. Several feature selection methods were used to discover significant imaging biomarkers for survival prediction. These discovered imaging biomarkers are consistent with pathological and biological evidence. Experimental results on two lung cancer data sets demonstrated that survival models bsuilt from the clinical imaging biomarkers have better prediction power than state-of-the-art methods using molecular profiling data and traditional imaging biomarkers.

1

Introduction

Lung cancer is the second most common cancer in both men and women. The non-small cell lung cancer (NSCLC) is the majority (80–85%) of lung cancer and two major NSCLC types are Adenocarcinoma (ADC) (40 %) and Squamous Cell Carcinoma (SCC) (25–30%).1 The 5-year survival rate of lung cancer (17.7 %) is still significantly lower than most other cancers.2 Therefore, predicting clinical outcome of lung cancer is an active field in today’s medical research. Molecular profiling is a technique to query the expression of thousands of molecular data simultaneously. The information derived from molecular profiling can be used to classify tumors, and help to make clinical decisions [6,15]. However, tumor microenvironment is a complex milieu that includes not only the cancer cells but also the stromal cells and immune cells. All this “extra” genomic information may muddle results and therefore make molecular analysis a challenging task for cancer prognosis [14].

1 2

This work was partially supported by NSF IIS-1423056, CMMI-1434401, CNS1405985. http://www.cancer.org/cancer/lungcancer-non-smallcell/detailedguide/. http://seer.cancer.gov/statfacts/html/lungb.html.

c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 649–657, 2016. DOI: 10.1007/978-3-319-46723-8 75

650

J. Yao et al.

Recently, Arne Warth et al. [10] showed that there exists connections between lung tumor morphology and prognosis. Advances in imaging have created a good chance to study such information using hispathological images to help tumor diagnosis [1,14,16]. In general, a Fig. 1. Tumor morphology are corre- pathologist can visually examine stained slides of a tumor to discover imaging biolated with patient survival markers that can be used for diagnosis. For example, Fig. 1(A) shows two pathology images from ADC lung cancer patients. (A) is an image from one patient who had the worse survival outcome while (B) is captured from a patient who lived longer. A distinct pattern can be found in Fig. 1(A) as the more advanced tumor cells clustered in a larger more condensed area indicates a worse survival outcome than Fig. 1(B) where tumor cells are scattered into a smaller region with lymphocytes and stromal cells nearby. However, the process of manually searching for such imaging biomarkers is very labor-intensive and cannot be easily scaled to large number of samples. Wang et al. [9] proposed an automated image analysis to help pathologists find imaging biomarkers that could identify lung cancer survival characteristics. However, their results still remain some issues. First, they collected ADC and SCC samples together when looking for imaging biomarkers. According to lung cancer pathology [8], the two major types of NSCLC (ADC and SCC) are generally regarded as two different types of disease due to their distinct molecular mechanisms and pathological patterns. Second, spatial variations between the different types of cells (ADC and SCC) are associated with survival outcomes [14]. However, the study in [9] adopted a traditional cell segmentation which was unable to classify cell subtypes and achieve clinically interpretable imaging biomarkers in lung cancer. In this paper, we introduced a computational image analysis to discover clinically interpretable imaging biomarkers for lung cancer survival prediction. Experiments on two lung cancer cohorts demonstrate that: (1) Two major subtypes of NSCLC should be treated separately since they have different key imaging biomarkers. (2) Spatial distribution of subtype cells are informative imaging biomarkers for lung cancer survival prediction. (3) The proposed framework can better describe tumor morphology and can provide powerful survival analysis than the state-of-the-art method with molecular profiling data. (A)

2

(B)

Methodology

An overview of our method is presented in Fig. 2. An expert pathologist first labels regions of tissues. Several image tiles are extracted from the interested regions. Then a deep learning approach is applied to detect different types of cells (tumor, stroma and lymphocyte cells). A set of quantitative descriptors is used to cover granularity and subtype cellular heterogeneity. Our image analysis pipeline automatically segments H&E stained images, classifies cellular components into

Imaging Biomarker Discovery for Lung Cancer Survival Prediction

651

Training Samples

ADC Lymphocyte

Tumor

Stroma

Other

SCC Deep Learning for Cell subtype detection

Whole slide scanning NSCLC patients

Survival Prediction

Feature Selection

Imaging Features

Fig. 2. Overview of the proposed framework.

Fig. 3. The architecture of DCNNs for cell type classification (C stands for the multiple shared convolution and pooling layers between two models. F stands for fully-connected layer and S stands for softmax layer).

three categories (tumor, lymphocyte, stromal), and extracts features based on cell segmentation and detection results. Feature selection methods are used to find important features (image markers). These imaging biomarkers can then be applied for building survival models to predict patient clinical outcomes. 2.1

Deep Learning Approach for Cell Subtype Classification

The architecture of network can be seen in Fig. 3. Different cell types (cancer cells, stromal cells, lymphocytes) play different roles in tumor growth and metastasis, and accurately classifying cell types is a critical step to better characterization of tumor growth and outcome prediction [2,14]. Due to the large appearance variation and high complexity of lung cancer tissues, traditional machine learning approaches do not clearly distinguish or define the different cell types. Motivated by recent deep learning method for cell detection [11,12], we developed a two partially shared-weighted deep convolution neural networks (DCNNs) for cell subtype detection. The ground truth for cell subtype classification was annotated by an experienced pathologist. Then we built training samples with two annotations, one is for cell/non-cell classification and the other is for subtype cell. Each patch size is 40*40. We collected 48562 and

652

J. Yao et al.

24281 patches for cell/noncell and subtype cell classification, respectively. Sparse kernels [5] are applied in the two DCNN models to eliminate all the redundant calculations for acceleration. In the final step, those two DCNN models are integrated into one model to achieve subtype cell detection. More details can be found in our research web page: http://ranger.uta.edu/∼huang/R Lung.htm. 2.2

Quantitative Imaging Feature Extraction

Motivated by [9,13], three groups of cellular features were extracted using subtype cell detection results. These features cover cell-level information (e.g., appearance and shapes) of individual subtype cells and also texture properties of background tissue regions. Group 1: Geometry Features. Geometry properties are calculated for each segmented subtype cell, including area, perimeter, circularity, major-minor axis ratio. Zernike moments were also applied on each type of cells. When combined with different tiles, we calculated mean, median and std. of each feature with a total of 564 features. Group 2: Texture Features. This group of features contains Gabor “wavelet” features, co-occurrence matrix and granularity to measure texture properties of objects (e.g., cells and tissues), resulting in 1685 texture features. Group 3: Holistic Statistics. The four holistic statistics include overall information like the total area, perimeter, number and the corresponding ratio of each subtype cells. 2.3

Imaging Biomarkers Discovery

The objective of this step is to find important imaging biomarkers since not all features were highly correlated with patients’ survival outcomes. Different from traditional applications, selecting features in survival analysis is a censoring problem (subjects are censored if they are not followed up or the study ends before they die). In this study, we built the predictive models using two wellestablished types of methods: (1) the multivariate Cox proportional hazards model with L1 penalized log partial likelihood (Lasso) [7] or component-wise likelihood based boosting (CoxBoost) [3] for feature selection, and (2) random survival forest (RSF) [4]. Because of the high dimension of the image features, we first applied univariate Cox regression and kept those with Wald test p value less than 0.05. Then we conducted the feature selection on a small candidates set for survival model to improve the speed.

3

Experimental Results

3.1

Materials

We focused on two widely used lung cancer dataset NLST (National Lung Screening Trial)3 and TCGA Data Portal4 . Both dataset contain complete patients’ 3 4

https://biometry.nci.nih.gov/cdas/studies/nlst/. https://tcga-data.nci.nih.gov/tcga/.

Imaging Biomarker Discovery for Lung Cancer Survival Prediction

653

pathology images with survival and clinical information while TCGA cohorts can provide additional molecular profiling data. In NLST, we collected 144 ADC and 113 SCC patients. In TCGA, we focused on SCC case and collected 106 patients with four types of molecular data including: Copy number variation (CNV), mRNA, microRNA and protein expression (RPPA). To examine whether imaging biomarkers from the proposed framework can achieve better predictions than traditional imaging biomarkers and molecular profiling data (biomarkers), we evaluated with two state-of-the-arts framework in lung cancer [9,15]. 3.2

Imaging Biomarker Discovery for Survival Analysis

ADC vs SCC samples. In this experiment, we followed the framework in [9] and investigated differences in imaging biomarkers selecting from the set of ADC and SCC markers, and combining ADC and SCC markers together. To ensure the robustness of selection, we resampled the whole dataset with replacements and performed the boosting feature selection procedure [3] and calculated the frequency of choosing a variable. Figure 4 shows that key features (high frequencies shown in the green rectangle) chosen from the combination set are very different from those of ADC and SCC, respectively. These differences convinced us the prognosis models for ADC and SCC should be developed separately. This discovery verified the evidence in lung cancer pathology, that lung cancer subtypes are highly heterogeneous and cannot be combined together. For ADC and SCC, selected features include information about suptype cell distributions, cell shape and granularity. Among them, subtype cell distributions and granularity have been confirmed to be associated with survival outcomes [8,14]. To test these imaging biomarkers, we built multivariate Cox regression using the top 50 selected features on testing sets (47 for ADC and 37 for SCC). Figure 5 presents the predictive power on a partitioning into two groups on testing set (a–b for ADC and c–d for SCC). A significant difference (Wald-Test) in survival times can be seen in Fig. 5(a),(c). It demonstrates that discovered imaging biomarkers which cover subtype cell distributions and granularity are more often associated with survival outcomes than traditional imaging biomarkers. Then we randomly divided the whole set to 50 splits (2/3 for training, 1/3 for testing). Each feature selection method performed 10-fold cross validation for

Fig. 4. Frequencies of features on ADC, SCC and ADC+SCC set.

654

J. Yao et al.

Fig. 5. Kaplan-Meier survival curves of two groups on testing set. The x axis is the time in days and the y axis denotes the probability of overall survival. (a,c) are from the framework developed in this research, while (b,d) are using features from [9].

CoxBoost

RSF

CoxBoost

RSF

Fig. 6. Boxplot of C-index distributions (Left: ADC, Right: SCC).

parameter optimization. Figure 6 shows the concordance index (C-index) results of the two methods on ADC and SCC set. The C-index is a nonparametric measurement to quantify the discriminatory power of a predictive model: 1 indicates perfect prediction accuracy, and a Cindex of 0.5 is as good as a random guess. From Fig. 6, it can see the higher

Imaging Biomarker Discovery for Lung Cancer Survival Prediction

655

Fig. 7. Comparison of the survival predictive power using Cox+Lasso model.

median C-index of the discovered imaging markers in both cases with different survival models. This illustrates the robustness of the proposed method since the discovered imaging biomarkers are highly associated with tumor growth and survival outcomes. 3.3

Comparison of Survival Model with Imaging and Molecular Data

To examine whether the proposed imaging biomarkers can provide better prediction power than traditional molecular data, we conducted experiments on TCGA LUSC cohort following the recent study [15]. We applied 50 random splits and assessed the C-index of a model built from the individual imaging and molecular data sets alone. Figure 7A presents the highest median C-index value of survival models built on the discovered imaging biomarkers. When each type of data integrates with clinical variables (“+” means the integration), all prediction accuracies increase while the proposed method still has the best results (Fig. 7B). It verified the discovered imaging biomarkers can better describe tumor morphology which enabled the proposed framework to have the best predictions for survival analysis.

4

Conclusions

In this paper, we investigated subtype cell information and found that they have useful patterns for predicting patients survival. These results are consistent with recent study in lung cancer pathology [10]. Extensive experiments have been conducted to demonstrate that imaging biomarkers from subtype cell information can better describe tumor morphology and provide more accurate prediction than state-of-the-art method using imaging and molecular profilers. In the future, we will try to find more quantitative measurements to better describe tumor morphology and further improve the prediction performances.

656

J. Yao et al.

References 1. Barker, J., Hoogi, A., Depeursinge, A., Rubin, D.L.: Automated classification of brain tumor type in whole-slide digital pathology images using local representative tiles. Med. Image Anal. 30, 60–71 (2016) 2. Beck, A.H., Sangoi, A.R., Leung, S., Marinelli, R.J., Nielsen, T.O., van de Vijver, M.J., West, R.B., van de Rijn, M., Koller, D.: Systematic analysis of breast cancer morphology uncovers stromal features associated with survival. Sci. Transl. Med. 3(108), 108ra113 (2011) 3. Binder, H., Schumacher, M.: Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinform. 9(1), 1–10 (2008) 4. Ishwaran, H., Kogalur, U.B., Blackstone, E.H., Lauer, M.S.: Random survival forests. Ann. Appl. Stat. 2(3), 841–860 (2008) 5. Li, H., Zhao, R., Wang, X.: Highly efficient forward and backward propagation of convolutional neural networks for pixelwise classification. arXiv preprint arXiv:1412.4526 (2014) 6. Shedden, K., Taylor, J.M., Enkemann, S.A., Tsao, M.S., Yeatman, T.J., Gerald, W.L., Eschrich, S., Jurisica, I., Giordano, T.J., Misek, D.E., et al.: Gene expressionbased survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat. Med. 14(8), 822–827 (2008) 7. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc.: Ser. B (Methodol.) 58, 267–288 (1996) 8. Travis, W.D., Harris, C.: Pathology and genetics of tumours of the lung, pleura, thymus and heart (2004) 9. Wang, H., Xing, F., Su, H., Stromberg, A., Yang, L.: Novel image markers for non-small cell lung cancer classification and survival prediction. BMC Bioinform. 15(1), 310 (2014) 10. Warth, A., Muley, T., Meister, M., Stenzinger, A., Thomas, M., Schirmacher, P., Schnabel, P.A., Budczies, J., Hoffmann, H., Weichert, W.: The novel histologic international association for the study of lung cancer/american thoracic society/european respiratory society classification system of lung adenocarcinoma is a stage-independent predictor of survival. J. Clin. Oncol. 30(13), 1438–1446 (2012) 11. Xie, Y., Xing, F., Kong, X., Su, H., Yang, L.: Beyond classification: structured regression for robust cell detection using convolutional neural network. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 358–365. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24574-4 43 12. Xu, Z., Huang, J.: Efficient lung cancer cell detection with deep convolution neural network. In: Wu, G., Coup´e, P., Zhan, Y., Munsell, B., Rueckert, D. (eds.) PatchMI 2015. LNCS, vol. 9467, pp. 79–86. Springer, Heidelberg (2015). doi:10.1007/ 978-3-319-28194-0 10 13. Yao, J., Ganti, D., Luo, X., Xiao, G., Xie, Y., Yan, S., Huang, J.: Computer-assisted diagnosis of lung cancer using quantitative topology features. In: Zhou, L., Wang, L., Wang, Q., Shi, Y. (eds.) MLMI 2015. LNCS, vol. 9352, pp. 288–295. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24888-2 35 14. Yuan, Y., Failmezger, H., Rueda, O.M., Ali, H.R., Gr¨ af, S., Chin, S.F., Schwarz, R.F., Curtis, C., Dunning, M.J., Bardwell, H., et al.: Quantitative image analysis of cellular heterogeneity in breast tumors complements genomic profiling. Sci. Transl. Med. 4(157), 157ra143 (2012)

Imaging Biomarker Discovery for Lung Cancer Survival Prediction

657

15. Yuan, Y., Van Allen, E.M., Omberg, L., Wagle, N., Amin-Mansour, A., Sokolov, A., Byers, L.A., Xu, Y., Hess, K.R., Diao, L., et al.: Assessing the clinical utility of cancer genomic and proteomic data across tumor types. Nat. Biotechnol. 32(7), 644–652 (2014) 16. Zhu, X., Yao, J., Luo, X., Xiao, G., Xie, Y., Gazdar, A., Huang, J.: Lung cancer survival prediction from pathological images and genetic data - an integration study. In: 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI), pp. 1173–1176 (2016)

3D Segmentation of Glial Cells Using Fully Convolutional Networks and k-Terminal Cut Lin Yang1(B) , Yizhe Zhang1 , Ian H. Guldner2 , Siyuan Zhang2 , and Danny Z. Chen1 1

2

Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA [email protected] Department of Biological Sciences, Harper Cancer Research Institute, University of Notre Dame, Notre Dame, IN 46556, USA

Abstract. Glial cells play an important role in regulating synaptogenesis, development of blood-brain barrier, and brain tumor metastasis. Quantitative analysis of glial cells can offer new insights to many studies. However, the complicated morphology of the protrusions of glial cells and the entangled cell-to-cell network cause significant difficulties to extracting quantitative information in images. In this paper, we present a new method for instance-level segmentation of glial cells in 3D images. First, we obtain accurate voxel-level segmentation by leveraging the recent advances of fully convolutional networks (FCN). Then we develop a kterminal cut algorithm to disentangle the complex cell-to-cell connections. During the cell cutting process, to better capture the nature of glial cells, a shape prior computed based on a multiplicative Voronoi diagram is exploited. Extensive experiments using real 3D images show that our method has superior performance over the state-of-the-art methods.

1

Introduction

Glial cells are heterogeneous central nervous system-specific cells. Recent studies have shown that glial cells are critical to various homeostatic biological processes (e.g., metabolic support of neurons and synapse formation and function), and are largely implicated in neuroinflammatory events such as traumatic brain injury and multiple sclerosis [3]. Quantitative analysis of glial cells, such as the number of cells, cell volume, number of protrusions, and length of each protrusion, can offer new insights to many studies. Although the current imaging modalities are able to image glial cells in 3D in their native environment, the complicated morphology of their protrusions and the entangled cell-to-cell network cause significant difficulties to the extraction of quantitative information (Fig. 1(a)). In this paper, we present a new method for instance-level segmentation of glial cells in 3D images based on fully convolutional networks (FCN) and k-terminal cut. Many methods were developed to visualize and quantify glial cells in 3D microscopy images, such as angular variance [2], random walk [10], and local c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 658–666, 2016. DOI: 10.1007/978-3-319-46723-8 76

3D Segmentation of Glial Cells Using Fully Convolutional Networks

659

Fig. 1. (a) 3D visualization of a raw image of the glial cell network; (b) 3D visualization of the instance-level segmentation of (a) computed by our method. Random colors are assigned to every segmented instance (glial cell).

priority-based parallel tracing (LPP) [6]. Due to the difficulties in attaining accurate voxel-level segmentation, none of these methods can automatically compute accurate instance-level segmentation of glial cells in 3D. By leveraging the recent advances of fully convolutional networks (FCN) [7], we obtain accurate voxellevel segmentation (Fig. 2(d)). Based on that, we develop a new algorithm for computing instance-level segmentation of glial cells in 3D microscopy images. Utilizing deep learning techniques greatly relieves the burden of image segmentation and detection. However, it is still non-trivial to attain accurate 3D instance-level segmentation of glial cells, for the following reasons. First, although some recent FCNs are capable of computing instance-level segmentation in 2D images [8], due to the memory limitation of the current computing systems, so far no FCN can directly yield good quality instance-level segmentation in 3D. Second, in the native environment of glial cells, they form a large network in which their protrusions entangle together (Fig. 1(a)). Hence, in the voxel-level segmentation, thousands of glial cells form a large-size connected network. In order to generate accurate 3D instance-level segmentation, we need to carefully cut this connected network into thousands of individual 3D glial cells. We model this cutting task as a k-terminal cut problem, in which the k terminals correspond to the root points of glial cells (the root point of a glial cell is the convergence point of its protrusions). Much work was done on the general k-terminal cut problem [5], which is NP-hard and was solved only by approximation. Interestingly, by exploiting the morphological properties of glial cells in our 3D images, we are able to develop an effective k-terminal cut algorithm to disentangle the complex cell-to-cell connections. First, since glial cells touch one another with their protrusions, we design a shape prior computed by a multiplicative Voronoi diagram (MVD) [1]. Then, we develop an iterative k-terminal cut algorithm to incorporate this shape prior. Finally, we apply our k-terminal cut algorithm to the voxel-level segmentation results computed by deep learning. Extensive experiments using real 3D images show that our method has much better performance over the state-of-the-art glial cell analysis methods [6,10].

660

L. Yang et al.

Fig. 2. (a) A training image; (b) the training labels for (a); (c) a testing image; (d) the testing result of (c).

2

Method

Our method consists of three major components: (1) voxel-level segmentation and root point detection using FCN; (2) computing shape prior based on a multiplicative Voronoi diagram (MVD) in a network model; (3) generating 3D instance-level segmentation using a k-terminal cut algorithm. 2.1

Voxel-Level Segmentation and Root Point Detection

Accurate voxel-level segmentation and root point detection are basic foundations of our method. As shown in [7], FCNs are a state-of-the-art method for segmentation in 2D images. Thus, we chose U-net [9] (a state-of-the-art FCN for biomedical image segmentation) to solve this problem. Note that due to the memory limitation, U-net is so far not effective for direct 3D segmentation. This section shows how we can still apply U-net to attain our goals on 3D images. We first process an input 3D image slice by slice. The procedure for the pixel-level segmentation in a 2D slice remains the same as in [9]. For the root point detection, we mark each root point by a disk of a 5-pixel diameter in the training images. Figure 2(b) gives an example of the labeled training images for segmentation and root point detection. After processing the 2D slices, because we use optical sectioning when we image the tissue samples, there is no shift nor distortion between the slices. Thus the voxel-level segmentation can be obtained by simply stacking the 2D segmentation results together. However, it is more complicated for the root point detection. Figure 3(a) shows that after stacking the 2D root point detection slices, the detected root points form some columns rather than spheres and the columns belonging to different cells may be connected together. These are caused by two main factors: (1) The axial resolution for multi-photon microscopy is 2–3 times lower than the lateral resolution; (2) U-net is very sensitive in detecting root points (it can still detect the true center root points even when it is out of the focus plane). To discriminate different cells that have the same x-y positions but different z positions, we adopt the noisetolerance non-maximum suppression method [4] (see Fig. 3(b) for an example). 2.2

Computing Shape Prior

After obtaining the voxel-level segmentation and root point detection results, we compute 3D instance-level segmentation using a k-terminal cut algorithm. But,

3D Segmentation of Glial Cells Using Fully Convolutional Networks

661

Fig. 3. (a) 3D visualization of a stacked root point detection probability map. (b) 3D visualization of the non-maximum suppressed root point detection map (each root point is dilated for illustration). (c) Some touching glial cells. (d) The cell cutting areas computed by our method. The root points of glial cells are in yellow, the segmentation is in blue, and the computed cell cutting areas are in green.

common minimum cut algorithms are sensitive to the number of cut edges and are likely to give some trivial solutions (e.g., cutting only some edges close to the terminals). Thus, in this section, we address this issue by computing a shape prior based on a multiplicative Voronoi diagram (MVD) in a network model. Since glial cells touch one another with their protrusions, in the 3D instancelevel segmentation, the cutting boundary between two touching cells should fall inside the area of their touching protrusions. We call this area the touching protrusion area (TPA). We observed two properties of the TPAs: (1) While two neighboring cells may have multiple touching protrusions, their TPA always lies between the two root points; (2) the TPA is at least 8 µm (can be adjusted to accommodate different scenarios) away from their root point. Thus, we define the common region of two neighboring cells Ca and Cb that satisfies both these properties as the cell cutting area, denoted by Aab (Fig. 3(d)). Due to the morphology of glial cells, we use the concept of MVD in a network model to define Aab . The MVD is a variant of Voronoi diagram that uses weight parameters for the sites (root points) to control the volumes of their Voronoi regions (i.e., Voronoi cells) [1]. Since cell touching is a pairwise relation between two neighboring root points, we apply the MVD in a pairwise manner to determine Aab . The MVD is defined as: Given an undirected graph G = (V, E) and m sites vsi ∈ V with weight wi > 0, divide V into m Voronoi regions Ri , one for each site vsi , such that for every vx ∈ V , vx ∈ Ri if and only if D(vx , vsi )/wi = minj {D(vx , vsj )/wj }, where D(vp , vq ) is the length of the shortest vp -to-vq path in G. We define the Voronoi region border of the MVD as a set Vbd of vertices: For each vx ∈ V , vx ∈ Vbd if and only if either vx is shared by different Voronoi regions or ∃(vx , vy ) ∈ E such that vx and vy belong to different Voronoi regions. In our model G = (V, E) for two cells Ca and Cb , V corresponds to all the voxels in Ca ∪ Cb , E is the 6-neighborhood system of V whose edge lengths are all 1, m = 2, vsi corresponds to a root point with i ∈ {a, b}, and wi is the weight of vsi . Since our MVD has only two sites, different weight values wa and wb with the same ratio of wa /wb result in the same partition of V . Let wab = wa /wb denote the relative weight. Note that different wab values yield different Vbd .

662

L. Yang et al.

Intuitively, we consider all possible Voronoi region borders of the MVD in G that satisfy property (2) as lying between the two root points and as possible cell cutting area. Thus, we define Aab as follows: vx ∈ V is in Aab if and only if ∃wab > 0 such that vx ∈ Vbd (for this wab value) and all vy ∈ Vbd satisfy property (2) of the TPAs (i.e., for each vy ∈ Vbd , D(vy , vsa ) ≥ 8 µm and D(vy , vsb ) ≥ 8 µm). Based on the definition of MVD, Vbd satisfies property (2) if and only if 1/θab ≤ wab ≤ θab , where θab = (D(vsa , vsb ) − dmin )/dmin and dmin corresponds to 8 µm in the image space. Thus, we compute Aab as follows: For each vx ∈ V , vx is in D(v ,v ) Aab if and only if either 1/θab ≤ D(vxx ,vssa ) ≤ θab or ∃(vx , vy ) ∈ E such that the b  D(v ,v ) D(v ,v ) D(v ,v ) D(v ,v ) interval [1/θab , θab ] (min{ D(vxx ,vssa ) , D(vyy ,vssa ) }, max{ D(vxx ,vssa ) , D(vyy ,vssa ) }) = ∅. b

2.3

b

b

b

The Iterative k-Terminal Cut Algorithm

Given the voxel-level segmentation and k detected root points, to compute the 3D instance-level segmentation, we aim to assign to each segmented cell voxel a label for one of the k root points. We use the k-terminal cut [5] to model this task. The definition of the k-terminal cut problem is: Given an undirected graph G = (V, E) with edge costs and k terminals vt1 , . . . , vtk ∈ V , assign to each vx ∈ V a label L(vx ) ∈ {t1 , . . . , tk }, such that L(vti ) = ti , all vertices with label ti form a connected subgraph in G using edges whose end vertices are both labeled by ti ,  for each i = 1, 2, . . . , k, and the total cost Tcost = (vp ,vq )∈E c(vp , vq )∗I(L(vp ) = L(vq )) is minimized, where c(vp , vq ) > 0 is the cost of edge (vp , vq ), and I(b) is the indicator function that is 1 when b is true and 0 otherwise. In our k-terminal cut model, all the segmented cell voxels are vertices in G, the k root points are the k terminals, the labels L(v) specify the instance-level segmentation, E is the 6-neighborhood system of V , all edge lengths are 1, and the edge cost c(vp , vq ) = min{I(vp ), I(vq )} + 1, where I(vx ) is the intensity of voxel vx in the raw image. In Sect. 2.2, we introduce a shape prior to guide our cell cutting. This shape prior is determined between the pairs of neighboring cells. Thus, we solve the kterminal cut problem as a collection of minimum st-cut problems: We iteratively refine the cut between each pair of neighboring cells. We use the ordinary Voronoi diagram [1] in our graph G to define the initial neighboring relations (two glial cells are neighbors of each other if their initial Voronoi regions touch in their boundaries), and update the neighboring relations as the iterative algorithm refines the cuts. The details of the algorithm are given below. Step 1: We first compute the ordinary Voronoi diagram in G, whose k sites correspond to the k root points. The definition of the ordinary Voronoi diagram is the same as that of MVD in Sect. 2.2, but with all wi = 1. To avoid conflict in the labeling, for each vx ∈ V that belongs to multiple Voronoi regions, we assign vx to the Voronoi region with the largest index. As initialization, for each vx ∈ V , we let L(vx ) = ti if and only if vx ∈ Ri . Let CCti denote the set of all the vertices with label ti . By the definition of minimum st-cuts and the ordinary Voronoi diagram algorithm in G, every CCti forms a connected subgraph in G by using edges whose end vertices are both labeled by ti . Intuitively, each CCti

3D Segmentation of Glial Cells Using Fully Convolutional Networks

663

(roughly) corresponds to a cell in the image. Two cells Ca and Cb are taken as a neighboring pair if and only if ∃vp ∈ CCta and ∃vq ∈ CCtb such that (vp , vq ) ∈ E. We construct the set N of neighboring pairs, with (CCti , CCtj ) ∈ N if and only if CCti and CCtj are neighbors in G. Step 2: For each (CCti , CCtj ) ∈ N , we construct a subgraph G′ = (V ′ , E ′ ) of G to compute (or refine) the cut between this pair, where V ′ is the set of vertices in CCti ∪ CCtj and E ′ = {(vp , vq ) | (vp , vq ) ∈ E and both vp , vq ∈ V ′ }. Since we allow cutting only in the TPAs, we compute Aij as in Sect. 2.2 (with G′ as the input graph, and vti and vtj as the two sites). Note that even for the same neighboring pair (CCti , CCtj ), G′ may change iteration by iteration. Thus, Aij needs to be recomputed in each iteration. To restrict the minimum st-cut in G′ , for each (vp , vq ) ∈ E ′ , we set its cutting cost c′ (vp , vq ) = c(vp , vq ) if both vp , vq ∈ Aij , and c′ (vp , vq ) = ∞ otherwise. Then we refine the cut between CCti and CCtj by computing the minimum st-cut in G′ (with s = vti and t = vtj ). After the st-cut, for each vx in G′ belonging to the component Hs containing s (resp., Ht containing t), we assign L(vx ) = ti (resp., L(vx ) = tj ). Step 3: After obtaining the cut between every neighboring pair in N , the neighboring relations in G may change. Hence, we update N accordingly. If Tcost for the k-terminal cut in G thus computed decreases after step 2, we go back to step 2 with the updated N . Otherwise, we output the results and stop. When applying the above algorithm to our real 3D images, it usually stops after 2–3 iterations. Once the labels L(v) are assigned, to generate the 3D instance-level segmentation, we assign each cell voxel in the image a color for its label. All the cell voxels sharing the same label have the same unique color. The time complexity of the k-terminal cut algorithm is O(nkmv 3 ), where n is the number of iterations, k is the number of terminals, m is the maximum number of neighbors of a glia cell, and v is the number of voxels in the largest glial cells.

3

Experiments and Results

To perform quantitative performance analysis, we collected four sample 3D images of mouse brains using two-photon microscopy. The sizes of the 3D images are 640×640×25 voxels and the resolution of each voxel is 1×1×2 µm. Fluorescence background in all images is corrected by the algorithm in [11]. To evaluate the performance of the root point detection, human experts manually marked all the root points in these images. Due to the difficulties in labeling ground truth in 3D, to evaluate the performance of the 3D instance-level segmentation, three 2D representative slices were selected from each of these four 3D images. To capture the changes of glial cells in different 3D depths, we use the 7th, 13th, and 19th slices. Human experts then labeled all the glial cells in these 2D representative slices. After that, glial cells are assigned unique IDs and these IDs are used to evaluate the 3D instance-level segmentation. The 1st and 25th slices of each 3D image are also labeled and used as training data for U-net. Two most recent methods are selected for comparison with our method. The first one is an interactive method based on random walk [10]. The second method

664

L. Yang et al.

detects the root points using a logistic classifier based on hand-craft features and detects the center lines of the protrusions using the LPP algorithm [6]. To measure the accuracy of the root point detection, we compute the maximum cardinality bipartite matching between the true root points and the detected root points based on the following criterion. A true root point ri can be matched with a detected root point rj′ only if the Euclidean distance between ri and rj′ is smaller than 10 µm. All the matched detected root points are counted as true positive, all the unmatched true root points are counted as false negative, and all the unmatched detected root points are counted as false positive. Then F1 scores are calculated, and the results are shown in Table 1. Note that since method one [10] requires that the positions of all root points be given as input, it is not applicable to this comparison. For a fair comparison on the 3D instance-level segmentation, we give method one [10] our root detection results and method two [6] our voxel-level segmentation results. To generate segmentation by method two [6], each cell voxel in our voxel-level segmentation is assigned to the nearest protrusion center line. Table 1. F1 scores of the root point detection results. Sample 1 Sample 2 Sample 3 Sample 4 Average F1 score Our method 0.8999

0.9208

0.9148

0.8783

0.9035

LPP [6]

0.7960

0.8336

0.7697

0.8054

0.8222

Table 2. F1 scores of the 3D instance-level segmentation results. Sample 1 Sample 2 Sample 3 Sample 4 Average F1 score Our method

0.8709

0.9087

0.9032

0.8931

0.8940

Our w/o MVD

0.8423

0.8930

0.8852

0.8937

0.8786

Random walk [10] 0.7781

0.7855

0.7620

0.7856

0.7778

LPP [6]

0.8017

0.7943

0.8150

0.7960

0.8018

LPP w/ our roots 0.8203

0.8320

0.8308

0.8157

0.8247

To measure the accuracy of the 3D instance-level segmentation, we also compute the maximum cardinality bipartite matching between the true instances and the segmented instances. A true instance Ii and a segmented instance Ij′ can be matched if and only if |Ii ∩ Ij′ |/|Ii ∪ Ij′ | > 0.5, where |I| denotes the total number of voxels of an instance I. All the voxels of the unmatched true instances are counted as false negative, and all the voxels of the unmatched segmented instances are counted as false positive. Among the matched instances, the numbers of true positives, false negatives, and false positives are calculated based on their definitions. F1 scores are computed, and Table 2 summaries the results.

3D Segmentation of Glial Cells Using Fully Convolutional Networks

665

Fig. 4. A cropped window in a slice from Sample 3. (a) The original image (the ground truth is overlaid on the raw image for illustration); (b) the result of method one [10]; (c) the result of method two [6]; (d) our result. It shows that our method attains the best result when multiple cells are entangled together.

Some visual results are shown in Fig. 4. One can see that in all cases, our method achieves the most accurate results over the other two methods. Finally, the average computation time for each cell is ∼3.5 s on a personal computer.

4

Conclusions

In this paper, we present a new method for instance-level segmentation of glial cells in 3D images based on fully convolutional networks and an iterative kterminal cut algorithm. Further, we apply a shape prior of glial cells using a multiplicative Voronoi diagram. Extensive experiments on real 3D images show that our method has superior performance over the state-of-the-art methods. Acknowledgment. This research was supported in part by NSF Grants CCF-1217906 and CCF-1617735, and by NIH Grant 5R01CA194697-02.

References 1. Aurenhammer, F.: Voronoi diagrams – a survey of a fundamental geometric data structure. ACM Comput. Surv. (CSUR) 23(3), 345–405 (1991) 2. Bjornsson, C.S., Lin, G., Al-Kofahi, Y., Narayanaswamy, A., Smith, K.L., Shain, W., Roysam, B.: Associative image analysis: a method for automated quantification of 3D multi-parameter images of brain tissue. J. Neurosci. Methods 170(1), 165– 178 (2008) 3. Clarke, L.E., Barres, B.A.: Emerging roles of astrocytes in neural circuit development. Nat. Rev. Neurosci. 14(5), 311–321 (2013) 4. Collins, T.J.: ImageJ for microscopy. Biotechniques 43(1 Suppl.), 25–30 (2007) 5. Cunningham, W.H.: The optimal multiterminal cut problem. DIMACS Ser. Discrete Math. Theor. Comput. Sci. 5, 105–120 (1991) 6. Kulkarni, P.M., Barton, E., Savelonas, M., Padmanabhan, R., Lu, Y., Trett, K., Shain, W., Leasure, J.L., Roysam, B.: Quantitative 3-D analysis of GFAP labeled astrocytes from fluorescence confocal images. J. Neurosci. Methods 246, 38–51 (2015)

666

L. Yang et al.

7. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conference on CVPR, pp. 3431–3440 (2015) 8. Romera-Paredes, B., Torr, P.H.: Recurrent instance segmentation. arXiv preprint arXiv:1511.08250 (2015) 9. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24574-4 28 10. Suwannatat, P., Luna, G., Ruttenberg, B., Raviv, R., Lewis, G., Fisher, S.K., H¨ ollerer, T.: Interactive visualization of retinal astrocyte images. In: 2011 IEEE Conference on ISBI, pp. 242–245 (2011) 11. Yang, L., Zhang, Y., Guldner, I.H., Zhang, S., Chen, D.Z.: Fast background removal in 3D fluorescence microscopy images using one-class learning. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 292–299. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24574-4 35

Detection of Differentiated vs. Undifferentiated Colonies of iPS Cells Using Random Forests Modeled with the Multivariate Polya Distribution Bisser Raytchev1(B) , Atsuki Masuda1 , Masatoshi Minakawa1 , Kojiro Tanaka1 , Takio Kurita1 , Toru Imamura2,3 , Masashi Suzuki3 , Toru Tamaki1 , and Kazufumi Kaneda1 1

2

3

Department of Information Engineering, Hiroshima University, Higashihiroshima, Japan [email protected] School of Bioscience and Biotechnology, Tokyo University of Technology, Hachioji, Japan Biotechnology Research Institute for Drug Discovery, AIST, Tokyo, Japan

Abstract. In this paper we propose a novel method for automatic detection of undifferentiated vs. differentiated colonies of iPS cells, which is able to achieve excellent accuracy of detection using only a few training images. Local patches in the images are represented through the responses of texture-layout filters over texton maps and learned using Random Forests. Additionally, we propose a novel method for probabilistic modeling of the information available at the leaves of the individual trees in the forest, based on the multivariate Polya distribution.

1

Introduction

Induced pluripotent stem (iPS) cells [10], for whose discovery S. Yamanaka received the Nobel prize in Physiology and Medicine in 2012, are already revolutionizing medical therapy by personalizing regenerative medicine and contributing to the creation of novel human disease models for research and therapeutic testing (see e.g. [3] for a review of iPS cell technology, which also discusses clinical applications). Like embryonic stem (ES) cells, iPS cells have the ability to differentiate into any other cell type in the body. However, since iPS cells can be derived from adult somatic tissues, they avoid the ethical issues connected with ES cells, which can only be derived from embryos. Additionally, iPS cells do not engender immune rejection since they are autologous cells unique to each patient, which allows modeling disease in vitro on a patient-by-patient basis. Since a large number of undifferentiated human iPS cells must be prepared for use as a renewable source of replacement cells for regenerative medicine, the development of an automated culture system for iPS cells is considered to be crucial. Among the multiple procedures involved in the culture system, detection of good/bad cells and the subsequent elimination of bad cells appears to be one c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 667–675, 2016. DOI: 10.1007/978-3-319-46723-8 77

668

B. Raytchev et al.

(a) Good colonies of iPS cells

(b) Bad colonies of iPS cells

Fig. 1. Several representative examples of Good (undifferentiated) and Bad (differentiated) colonies of cells, which have been used in the experiments.

of the most important parts. In this regard, it is expected that machine learning techniques have an important role to play in the process of automatic detection of abnormalities, using images of the cultivated cell colonies. Although this is a very new technology, several attempts in this direction have already been reported. In [11] an in-house developed image analysis software is used to detect cell colonies, and then colonies are classified as iPS or non-iPS based on morphological rules represented in a decision tree. This method needs a huge amount of data (more than 2000 colonies being used) to learn the rules and average accuracy of 80.3 % is achieved. Joutsijoki et al. [6] use intensity histograms calculated over the whole image of a colony as features and SVMs with linear kernel function as a classifier, obtaining 54 % accuracy on 80 colony images. In this paper we propose a novel method for automatic detection of Good/Bad colonies of iPS cells, which is able to achieve excellent accuracy of detection using only a few training images. Figure 1 shows a few examples of Good and Bad colonies of cells, which have been used in the experiments reported in this paper. In Good colonies most cells appear to be undifferentiated (compact, round), while in Bad colonies cells appear to be differentiated (spread, flattened). We represent local patches in the images as the responses of texture-layout filters over texton maps (details are given in the Experiments section), which are able to capture rich information about shape, texture and context/layout. The detection is performed using a Random Forest trained on the features extracted from the local patches. On a theoretical level, we propose a new method for probabilistic modeling of the information available at the leaves of the individual trees, based on the multivariate Polya distribution.

Detection of Differentiated vs. Undifferentiated Colonies of iPS Cells

2

669

Random Forests Modeled with the Multivariate Polya Distribution

Random forests (RFs) [1] are one of the most successful machine learning algorithms which have found numerous applications in a large variety of medical image analysis tasks [2]. The building blocks of RFs are individual decision trees which classify a data sample x ∈ X by recursively branching left or right down the tree starting from a root node and until reaching a leaf node where are stored the class labels y ∈ Y of a subgroup of the training samples, which have reached the same leaf during the training phase. A binary split function s(x, θj ) ∈ {0, 1} is associated with each node j, which guided by the parameters θj sends the data to the left if s(x, θj ) = 0, and to the right if s(x, θj ) = 1. Suitable parameters θj that would result in a good split of the data are found during the training phase by maximizing an information gain criterion (usually based on the decrease of the entropy after the split). Individual decision trees are known to overfit due to high variance, and RFs avoid this by combining the outputs of multiple de-correlated trees. Many different ways to achieve de-correlation have been proposed, one of the most popular being to inject randomness at node level through randomly subsampling the features and the splits used to train each node. Much less attention has been given to the question “how to combine the information from the leaves of the forest in order to predict the label of the test data.” Usually, the standard procedure is to normalize the histogram of class occurrences at t-th tree’s leaf by dividing by the total number of occurrences there and consider this to represent the posterior distribution pt (c|x) for class c for that tree. Then the forest prediction is calculated by a simple averaging: p(c|x) =

T 1 pt (c|x). T t=1

(1)

In this paper, we investigate whether a more careful probabilistic modeling of the information available at the leaves might lead to capturing better leaf statistics. We consider a forest of trees, where each tree is indexed by t = 1, 2, . . . , T . In the proposed approach, to each tree we associate a probability mass function (pmf ) p(t) of length K, where K is the number of different classes. We model the T pmf s as coming from a Dirichlet distribution Dir(α), defined as [4] K  αj −1  Γ (α0 ) (t) pj , p(p(t) |α) = K j=1 Γ (αj ) j=1

(2)

where α = [α1 , α2 , . . . , αK ] (αj > 0) are the parameters of the distribution and K α0 = j=1 αi . However, we don’t have access directly to the pmf s, but only to nt samples (discrete outcomes) drawn from the t-th pmf, which we will represent by sample histograms h(t) in vector form: i.e. the j-th element of h(t) gives the number of times the j-th class has been observed in the leaf of the t-th tree. We will denote

670

B. Raytchev et al. (t)

this number by htj , for example htj = hj = 10 would mean that the j-th  class has been observed 10 times in the corresponding leaf of the t-th tree. Also, j htj = nt . Then h(t) has a multinomial distribution with parameter p(t) :   K K  htj h ! tj j=1 (t) (t) (t) pj . (3) p(h |p ) = K j=1 htj ! j=1 We can express the resulting distribution over the histogram of outcomes as  (t) p(h |α) = p(h(t) , p(t) |α) dp(t)  = p(h(t) |p(t) , α) p(p(t) |α) dp(t)  = p(h(t) |p(t) ) p(p(t) |α) dp(t) ,

(4)

and by substituting Eqs. (2) and (3) into the last line of Eq. (2) while taking the normalization constants outside the integral, obtain the multivariate Polya distribution (also known as the compound Dirichlet distribution):   K   K  htj +αj −1 htj ! Γ (α0 ) j=1 (t)    pj p(h(t) |α) =  K K j=1 j=1 Γ (αj ) j=1 htj !   (5)  K K ! Γ (α h ) Γ (α + h ) j tj 0 j=1 tj j=1 .     =  K K K j=1 (αj + htj j=1 Γ (αj ) Γ j=1 htj ! Finally, the likelihood of α can be formulated using all available data (leaf histograms) from all trees in the forest D = {h(1) , h(2) , . . . , h(T ) } as   K K T T htj ! Γ (α0 ) j=1 Γ (αj + htj )   j=1 (t)  .     p(h |α) = p(D|α) = K K K t=1 t=1 j=1 htj ! j=1 Γ (αj ) Γ j=1 (αj + htj (6) The gradient of the log-likelihood is T

d log p(D|α)  Ψ (α0 ) − Ψ (nt + α0 ) + Ψ (htj + αj ) − Ψ (αj ) = dαj t=1

(7)

where Ψ (x) = d log Γ (x)/dx is known as the digamma function. The value of the parameter α which maximizes the log-likelihood can then be computed via the fixed-point iteration [8] αjnew

T

= αj t=1 T

t=1

Ψ (htj + αj ) − Ψ (αj ) Ψ (nt + α0 ) − Ψ (α0 )

(8)

Detection of Differentiated vs. Undifferentiated Colonies of iPS Cells

671

See also Refs. [5,8] for alternative ways to find the maximum of the log-likelihood. For the experiments reported in the next section we have used the Fastfit Matlab toolbox, available from http://research.microsoft.com/en-us/um/people/minka/ software/fastfit/. Once α is estimated, the ensemble-aggregated decision of the forests in probK abilistic form is given by the expectation π = E[p(t) ] = α/ j=1 αi , so that πj gives the posterior probability for the j-th class.

3

Experimental Results

We conduct experiments using a dataset of 59 images of iPS cell colonies obtained at the Biotechnology Research Institute for Drug Discovery, AIST, Japan. The iPS cell colonies were cultured under various conditions either to maintain the undifferentiated state of the iPS cells or to induce cellular differentiation. The images were then captured by a digital camera mounted to a phase-contrast microscope. The size of the images is 1600 × 1200 pixels – several representative images are shown in Fig. 1. One of the typical conditions involved feeder-free culture surface and serum-free culture medium specifically formulated for iPS cells. The medium was exchanged every two days. This enabled maintenance and propagation of undifferentiated iPS cell colonies as the dominant cell population. Still, sometimes differentiated cells spontaneously emerged at low frequency in some colonies. In some other cases, undifferentiated iPS cells were cultured in differentiation-inducing medium for several days. This enabled controlled differentiation of the cells: the cells become different from undifferentiated iPS cells in both phenotype (shape) as well as biochemical characteristics. For automatic detection we consider 3 different categories: Good (undifferentiated cells), Bad (differentiated) and Background (or BGD for simplicity) – the last category including all other objects apart from the Good/Bad cells, coming from the culture medium, etc. Features are extracted from the images in the following manner. First, the images are converted to grayscale and convolved with a 48-dimensional filter bank of Gaussian derivative filters of several different scales and directions. Then a dictionary of textons [7] is learned by k-means clustering (k = 255 was used), and each pixel in an image is assigned to the nearest cluster center, i.e. a texton map is generated for each image. Once the texton map is generated for each image, we extract local features on a grid overlaid on the texton map by representing the local patch of the image centered on the current grid node (say the i-th pixel) using the texture-layout filter representation of [9]. The texture-layout filter randomly chooses rectangular areas of random size within the patch relative to its center (the i-th pixel) and calculates the proportion of pixels in each offset region that correspond to each texton. Thus, the dimension of the features is given by the product of the number of offset regions and the number of textons, and can be infinite in principle. These features contain rich information about texture, shape and context/layout, and are especially suitable for use in Random Forests, where the necessity to inject randomness at node level is well served by the random selection of the offset regions (and which can be done on-the-fly, as needed). We found that

672

B. Raytchev et al.

Fig. 2. Comparison of accuracy obtained when using a Random Forest with averaging (left), the Polya distribution (center) and a CNN (right).

patches of size 150 × 150 pixels and grid step of 45 pixels produce best results. The feature representation corresponding to each patch is assigned the label of the class it belongs to, i.e. either Good, Bad or BGD and sent to a Random Forest for either training or classification. We used the RF implementation in the Microsoft Research’s C++ and C# code library for decision forests available from research.microsoft.com/en-us/projects/decisionforests. Figure 2 compares the results obtained when a Random Forest using simple averaging to combine the predictions from each tree is used (shown on the left), the proposed method using the Polya distribution (center) and a Convolutional Neural Net (CNN). CNNs have recently shown state-of-the-art results on similar medical image analysis problems, and therefore are used here as a reference. The results were obtained using a 5-fold cross validation and the bars show the average accuracy and 1 std error bars. Table 1 shows the confusion matrices corresponding to the results in Fig. 2. Extensive experimentation showed that for the Table 1. Confusion matrices for the results shown in Fig. 2. average

Good

Polya

Good

Bad

BGD

100.00 ± 0.00

0.00 ± 0.00

0.00 ± 0.00

Bad

0.00 ± 0.00

BGD

1.86 ± 2.73

Good Good

100.00 ± 0.00 0.00 ± 0.00 0.60 ± 1.36

97.54 ± 2.55

Bad

0.00 ± 0.00

BGD

1.25 ± 1.71

CNN Good Good

Bad

98.67 ± 2.98 0.00 ± 0.00

Bad

100.00 ± 0.00 0.00 ± 0.00

BGD 1.33 ± 2.98

Bad

0.00 ± 0.00 98.57 ± 3.19 1.43 ± 3.19

BGD

0.61 ± 1.36

0.00 ± 0.00 99.39 ± 1.36

BGD 0.00 ± 0.00

98.57 ± 3.19 1.43 ± 3.19 0.61 ± 1.36 98.14 ± 1.70

Detection of Differentiated vs. Undifferentiated Colonies of iPS Cells

673

Fig. 3. Results obtained by RF+averaging (3rd column), RF+Polya (4th column) and CNN (5th column). First column shows the original images, and 2nd column the ground truth segmentation provided by an expert. See main text for details.

RF-based methods best results were obtained with an ensemble of 1000 trees of tree depth 15, and parameters θj for the split function at each node were determined by randomly subsampling 100 feature dimensions and 10 threshold levels. Again, extensive experimentation with different architectures and parameters for the CNN showed that best results were obtained when using 4 convolution and max pooling layers and 1 fully connected layer with 1000 units, dropout rate 0.5, tanh activation and sof tmax for classification (values of the many other CNNrelated parameters omitted for lack of space). The results seem to indicate that all 3 methods perform very well on this task. While the average recognition rates are almost identical, there are some qualitative differences. For example, compared to RF+averaging, RF+Polya performs slightly worse on class Bad, but slightly better on class BGD. CNNs perform slightly better on class BGD, but worse on classes Good and Bad. Figure 3 compares all 3 methods on some of the test images. In all images red is used to denote Good cells, green for Bad cells and blue for the BGD category. Saturation level is used to represent confidence in the results, e.g. darker red signifies higher probability for the Good category, while lower confidence corresponds to less saturated colors (the corresponding spots become whitish). Finally, Fig. 4 shows several cases where the corresponding regions have been wrongly identified by the CNN (last column), but have been correctly detected by the proposed method (3rd column), and vice verse.

674

B. Raytchev et al.

Fig. 4. First 2 rows show examples of regions where the proposed method produced significantly better results than the CNN, while the opposite is true for the bottom row. First column shows original images, second column ground truth, third column results from Polya and last column CNN results. Best view in color.

4

Conclusion

In this paper we have proposed a novel method for automatic detection of undifferentiated vs. differentiated colonies of iPS cells. Excellent accuracy of detection is achieved using only a few training images. The proposed method was able to obtain slightly better results on class BGD, compared to RF with simple averaging, and on class Bad compared to a CNN. However, further experiments on different datasets are needed to show whether the probabilistic modeling based on the multivariate Polya distribution could lead to a more significant increase in accuracy. Acknowledgements. This work was supported in part by JSPS KAKENHI Grant Numbers 25330337, 16K00394 and 16H01430.

References 1. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 2. Criminisi, A., Shotton, J.: Decision Forests for Computer Vision and Medical Image Analysis. Springer Publishing Company, Incorporated (2013) 3. Ebben, J.D., Zorniak, M., Clark, P.A., Kuo, J.S.: Introduction to induced pluripotent stem cells: Advancing the potential for personalized medicine. World Neurosurg. 76(3–4), 270–275 (2011) 4. Frigyik, B.A., Kapila, A., Gupta, M.R.: Introduction to the Dirichlet Distribution and Related Processes. Technical report UWEETR-2010-0006 (2010) 5. Gupta, M.R., Chen, Y.: Theory and use of the em algorithm. Found. Trends Signal Process. 4(3), 223–296. http://dx.doi.org/10.1561/2000000034 6. Joutsijoki, H., et al.: Classification of ipsc colony images using hierarchical strategies with support vector machines. In: IEEE Symposium CIDM 2014, pp. 86–92 (2014) 7. Malik, J., Belongie, S., Leung, T., Shi, J.: Contour and texture analysis for image segmentation. Int. J. Comput. Vision 43, 7–27 (2001)

Detection of Differentiated vs. Undifferentiated Colonies of iPS Cells

675

8. Minka, T.P.: Estimating a dirichlet distribution. Technical report (2000) 9. Shotton, J., Winn, J.M., Rother, C., Criminisi, A.: Textonboost for image understanding: multi-class object recognition and segmentation by jointly modeling texture, layout, and context. Int. J. Comput. Vision 81(1), 2–23 (2009) 10. Takahashi, K., Tanabe, K., Ohnuki, M., Narita, M., Ichisaka, T., Tomoda, K., Yamanaka, S.: Induction of pluripotent stem cells from adult human fibroblasts by defined factors. Cell 131(5), 861–871 (2007) 11. Watanabe, H., Tanabe, K., Kii, H., Ishikawa, M., Nakada, C., Uozumi, T., Kiyota, Y., Wada, Y., Tsuchiya, R.: Establishment of an algorithm for automated detection of ips/non-ips cells under a culture condition by noninvasive image analysis (2012)

Detecting 10,000 Cells in One Second Zheng Xu and Junzhou Huang(B) Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX 76019, USA [email protected]

Abstract. In this paper, we present a generalized distributed deep neural network architecture to detect cells in whole-slide high-resolution histopathological images, which usually hold 108 to 1010 pixels. Our framework can adapt and accelerate any deep convolutional neural network pixel-wise cell detector to perform whole-slide cell detection within a reasonable time limit. We accelerate the convolutional neural network forwarding through a sparse kernel technique, eliminating almost all of the redundant computation among connected patches. Since the disk I/O becomes a bottleneck when the image size scale grows larger, we propose an asynchronous prefetching technique to diminish a large portion of the disk I/O time. An unbalanced distributed sampling strategy is proposed to enhance the scalability and communication efficiency in distributed computing. Blending advantages of the sparse kernel, asynchronous prefetching and distributed sampling techniques, our framework is able to accelerate the conventional convolutional deep learning method by nearly 10, 000 times with same accuracy. Specifically, our method detects cells in a 108 -pixel (104 × 104 ) image in 20 s (approximately 10, 000 cells per second) on a single workstation, which is an encouraging result in whole-slide imaging practice.

1

Introduction

Recently, increased interests have been raised in the research community concerning the cell detection problem. A large number of cell detection methods on small images (with around 104 to 106 pixels) have been proposed [1–4]. Due to the recent success of deep convolutional neural network in imaging, several deep neural network based methods have been proposed for cell-related applications in the past few years [2–4]. While these methods have achieved great success on small images, very few of them are ready to be applied into practical whole-slide cell detection, in that the real whole-slide images usually have 108 to 1010 pixels. It takes several weeks to detect cells in a single whole-slide image by directly applying the deep learning cell detection methods [2–4], which is definitely prohibitive in practice. J. Huang—This work was partially supported by U.S. NSF IIS-1423056, CMMI1434401, CNS-1405985. c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 676–684, 2016. DOI: 10.1007/978-3-319-46723-8 78

Detecting 10,000 Cells in One Second

677

To alleviate the issue, we hereby propose a generalized distributed deep convolutional neural network framework for the pixel-wise cell detection. Our framework accelerates any deep convolutional neural network pixel-wise cell detector. In the proposed framework, we first improve the forwarding speed of the deep convolutional neural network with the sparse kernel technique. Similar techniques are referred to [5,6]. In order to reduce the disk I/O time, we propose a novel asynchronous prefetching technique. The separable iteration behavior also suggests needs for a scalable and communication efficient distributed and parallel computing framework to further accelerate the detection process on whole-slide images. We, therefore, recommend an unbalanced distributed sampling strategy with two spatial dimensions, extending the balanced cutting in [7]. The combination of the aforementioned techniques thus yields a huge speedup up to 10,000x in practice. To the best of our knowledge, the research presented in this paper represents the first attempt to develop an extremely efficient deep neural network based pixel-wise cell detection framework for whole-slide images. Particularly, it is general enough to cooperate with any deep convolutional neural networks to work on whole-slide imaging. Our technical contributions are summarized as: (1) A general sparse kernel neural network model is applied for the pixelwise cell detection, accelerating the forwarding procedure of the deep convolutional neural networks. (2) An asynchronous prefetching technique is proposed to reduce nearly 95 % of the disk I/O time. (3) We propose a scalable and communication efficient framework to extend our neural network to multi-GPU and cluster environments, dramatically accelerating the entire detecting process. Extensive experiments have been conducted to demonstrate the efficiency and effectiveness of our method.

2 2.1

Methodology Sparse Kernel Convolutional Neural Network

The sparse kernel network takes the whole tile image, instead of a pixel-centered patch, as input and can predict the whole label map with just one pass of the accelerated forward propagation. The sparse kernel network uses the same weights as the original network trained in the training stage to generate the exact same results as the original pixel-wise detector does. To achieve this goal, we involve the k-sparse kernel technique [6] for convolution and blended maxpooling layers into our approach. The k-sparse kernels are created by inserting all-zero rows and columns into the original kernels to make every two original neighboring entries k-pixel away. In [6], however, it remains unclear how to deal with fully connected layers, which is completed in our research. A fully connected layer is treated as a special convolution layer with kernel size set to the input dimension and kernel number set as the output dimension of the fully connected layer. This special convolution layer will generate the exact same output as the fully connected layer does when given the same input. The conversion algorithm is summarized in Algorithm 1.

678

Z. Xu and J. Huang

Algorithm 1. Network To Sparse Kernel Network Conversion Algorithm Input: Original network N with K layers denoted as N = {N (1) , . . . , N (K) }. ˆ with K layers. Output: Sparse kernel network N Initialization: d = 1 for k = {1, 2, . . . , K} do if N (k) is convolution layer then ˆ (k) as ConvolutionSK layer Set N (k) ˆ (k) := 1, N ˆ (k) ˆ (k) N stride kstride := d, Nkernel := Nkernel (k) else if N is pooling layer then ˆ (k) as PoolingSK layer Set N ˆ (k) := 1, N ˆ (k) N stride kstride := d else if N (k) is fully connected layer then ˆ (k) as ConvolutionSK layer Set N (k) (k) ˆ ˆ (k) ˆ (k) Nstride := 1, N kstride := d, Nnum output := Nnum output (k−1) (k) (k) (k) ˆ ˆ N kernel size := Noutput shape , Nkernel := Nweight else ˆ (k) = N (k) N end if (k) d := d × Nstride end for

2.2

Asynchronous Prefetching

Comparing with other procedures in the whole cell detection process, e.g. the memory transfer between GPU and CPU memory, the disk I/O becomes a bottleneck in the cell detection problem. In this subsection, we describe our asynchronous prefetching technique to relieve the bottleneck of the disk I/O. To reduce frequent I/O operations and, meanwhile, ensure the absence of insufficient memory problems, we propose an asynchronous prefetching technique to resolve this. We first load a relatively large image, referred to as cached image, into memory (e.g., 4096 × 4096). While we start to detect cells on the first cached image tile by tile, we immediately start loading the second cached image in another thread. Thus, when the detection process of the first cached image is finished, since the reading procedure is usually faster than the detection, we’ve already loaded the second cached image and can start detection in the second cached image and load the next cached image immediately. Hence, the reading time of the second cached image, as well as the cached images thereafter, is hidden from the overall runtime. Experiments have exhibited that this technique reduces approximately 95 % of the disk I/O time. It achieves an even larger speedup on a cluster since the NFS (Network File System) operation is even more time-consuming and we reduce most of them. 2.3

Multi-GPU Parallel and Distributed Computing

When considering distributed optimization, two resources are at play: (1) the amount of processing on each machine, and (2) the communication

Detecting 10,000 Cells in One Second

679

between machines. The single machine performance has been optimized in Sects. 2.1 and 2.2. We then describe our unbalanced distributed sampling strategy with two spatial dimensions of our framework, which is a gentle extension to [7]. Assuming T = {(1, 1), (1, 2), . . . , (H, W )} is the index set of an image with size H × W , we aim at sampling tiles of sizes not larger than h × w. Unbalanced Partitioning. Let S := ⌈HW/C. We first partition the index set T into a set of blocks P (1) , P (2) , . . . , P (C) according to the following criterion: C 1. T = c=1 P (c) , ′  ′′ 2. P (c ) P (c ) = ∅, for c′ = c′′ , 3. |P c | ≤ S, 4. P (c) is connected. Sampling. After the procedure of partitioning, we now sample small tiles from ˆ (c) C different machines and devices. For each c ∈ {1,  .ˆ.(c. ′′, )C}, the Z ′ is a ′′connected (c) (c) (c′ ) ˆ ˆ satisfying |Z | ≤ hw and Z Z = ∅, for c = c . subset of P C The set-valued mapping Zˆ = c=1 Zˆ (c) is termed as (C, hw)-unbalanced sampling, which is used for fully sampling tile images from the entire image. Note this is not a subsampling process since all the tile images are sampled from the whole slide in one data pass. Since only index sets are transmitted among all the machines, the communication cost is very low in network transferring. This distributed sampling strategy also ensures the scalability of the proposed framework as indicated in Sect. 3.4.

3

Experiments

3.1

Experiment Setup

Throughout the experiment section, we use a variant [4,8]1 of LeNet [9] as a pixel-wise classifier to show the effectiveness and efficiency of our framework. We have implemented our framework based on caffe [10] and MPI. The original network structure is shown in Table 1 (left). The classifier is designed to classify a 20 × 20 patch centered at specific pixel and predict the possibility of whether the pixel is in a cell region. Applying Algorithm 1, we show the accelerated network on the right of Table 1, which detects cells on a tile image of size 512 × 512. Since the classifier deals with 20 × 20 image patches, we mirror pad the original 512 × 512 tile image to a 531 × 531 image. 3.2

Effectiveness Validation

Our framework can be applied to any convolutional neural network for pixelwise cell detection, e.g., [2–4]. Thus, the effectiveness of our framework highly depends on the performance of the original deep neural networks designed for 1

The code is the publicly available at https://github.com/uta-smile/caffe-fastfpbp. We also provide a web demo for our method at https://celldetection.zhengxu.work/.

680

Z. Xu and J. Huang

Table 1. Original LeNet Classifier (left) and accelerated forward (right) network architecture. M : the training batch size, N : the testing batch size. Layer type: I - Input, C Convolution, MP - Max Pooling, ReLU - Rectified Linear Unit, FC - Fully Connected Type

Maps Filter Filter Stride Type and neurons size num I 3 × 20 × 20M - I C 20 × 16 × 16M 5 20 1 C MP 20 × 8 × 8M 2 2 MP C 50 × 4 × 4M 5 50 1 C MP 50 × 2 × 2M 2 2 MP FC 500M 1 - FC(C) ReLU 500M 1 - ReLU FC 2M 1 - FC(C)

Maps Filter Filter Stride and neurons size number 3 × 531 × 531N 20 × 527 × 527N 5 20 1 20 × 526 × 526N 2 1 50 × 518 × 518N 9 50 1 50 × 516 × 516N 3 1 500 × 512 × 512N 5 1 500 × 512 × 512N 1 2 × 512 × 512N 1 -

the small-scale cell detection. In this subsection, we validate the result consistency between our framework and the original work [4]. We conduct experiments on 215 tile images sized 512 × 512 sampled from the NLST2 whole-slide images, with 83245 cell object annotations. These tile images are then partitioned into three subsets: the training set (143 images), the testing set (62 images) and the evaluation set (10 images). The neural network model was trained on the training set with the original network described on the Table 1 (left). We then applied Algorithm 1 to transfer the original network into our framework. This experiment was conducted on a workstation with Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10 GHz CPU, 32 gigabyte RAM, and a single Nvidia K40 GPU. For quantitative analysis, we used a precision-recall-F1 score evaluation metric to measure the performance of the two methods. Since the proposed method detects the rough cell area, we calculated the raw image moment centroid as its approximate nuclei location. Each detected cell centroid is associated with the nearest ground-truth annotation. A detected cell centroid is considered to be a True Positive (T P ) sample if the Euclidean distance between the detected cell centroid and the ground-truth annotation is less than 8 pixels; otherwise, it is considered as False Positive (F P ). Missed ground-truth dots are counted as False Negative (F N ) samples. We consider F1 score F1 = 2P R/(P + R), where precision P = T P/(T P + F P ) and recall R = T P/(T P + F N ). We report the precision, recall and F1 score of the original work and our framework in Table 2. Table 2. Quantitative comparison between original work and our framework Methods

Precision

Recall

F1 score

Overall runtime Pixel rate

Original work [4] 0.83 ± 0.09 0.84 ± 0.10 0.83 ± 0.07 38.47 ± 1.01

6814.24 ± 174.43

Our framework

2621440.00 ± 24.01

2

0.83 ± 0.09 0.84 ± 0.10 0.83 ±0.07

0.10 ± 0.00

https://biometry.nci.nih.gov/cdas/studies/nlst/.

Detecting 10,000 Cells in One Second

681

Table 2 also shows the overall runtime (in seconds) and pixel rate (pixels per second) comparison. While our framework produced the same result as the original work, our overall speed was increased by approximately 400 times in small scale images on a single GPU device. This is reasonable since our method reduces most redundant convolution computation among the neighbor pixel patches. 3.3

Prefetching Speedup

In this subsection, we validate the effectiveness of the proposed asynchronous prefetching technique. Figure 1 shows the disk I/O time comparison among memory, file and prefetching modes in a whole-slide image (NLSI0000105 with spatial dimension 13483 × 17943). The I/O time is calculated by the difference between the overall runtime and the true detection time. As mentioned in Sect. 2.2, memory mode is slightly faster than file mode in that memory mode requires less hardware interruption invocation. Note that the prefetching technique doesn’t truly reduce the I/O time. It hides most I/O time into the detection time, since the caching procedure and detection occur simultaneously. So for a 108 -pixel whole-slide image, our technique diminishes (or hides) 95 % I/O time compared with file mode. This is because the exposed I/O time with our prefetching technique is only for reading the first cached image. 3.4

Fig. 1. I/O time comparison among memory, file and proposed asynchronous prefetching modes (in seconds)

Parallel and Distributed Computing

In this subsection, we show our experiment results in several whole-slide images. We randomly selected five whole-slide images, in Aperio SVS format, from NLST and TCGA [11] data sets, varying in size, from 108 to 1010 pixels. In order to show the efficiency of our methods, we conducted experiments in all five wholeslide images on a single workstation with Intel(R) Core(TM) i7-5930 K CPU @ 3.50 GHz, 64 Gigabytes RAM, 1 TB Samsung(R) 950 Pro Solid-State Drive and four Nvidia Titan X GPUs. Table 3 shows the overall runtime on cell detection in these whole-slide images. On a single workstation, our method is able to detect cells in a whole-slide image of size around 104 × 104 (NLSI0000105) in 20 s. Since the detection result of this whole-slide image includes approximately 200, 000 cells, our method detects nearly 10, 000 cells per second on average on a single workstation, while the original work [4] only detects approximately 6 cells per second, reaching a 1, 500 times speedup.

682

Z. Xu and J. Huang Table 3. Time comparison on single workstation (in seconds) Image name (Dimension)

1 GPU 2 GPUs 3 GPUs 4 GPUs

NLSI0000105 (13483 × 17943)

71.43

38.81

26.89

20.88

NLSI0000081 (34987 × 37879)

366.74

194.99

131.30

99.20

1502.16

800.24

529.00 449.94

TCGA-05-4405 (83712 × 50432)

TCGA-35-3615 (62615 × 133335) 2953.99 1519.57 1100.32 861.17 TCGA-38-4627 (65033 × 149642) 3385.28 1773.11 1216.80 972.36

The workaround of our method in distributed computing environment is demonstrated on TACC Stampede GPU clusters3 . Each node is equipped with two 8-core Intel Xeon E5-2680 2.7 GHz CPUs, 32 Gigabytes RAM and a single Nvidia K20 GPU. We show only the distributed results for the last four images from Table 3, since the first image is too small to be sliced into 32 pieces. Table 4 shows that our method detects cells in a whole-slide image (TCGA-38-4627) with nearly 1010 pixels within 155.87 s. When directly applying the original work, it takes approximately 400 h (1440000 s) even without considering the disk I/O time. Our method has impressively achieved nearly 10, 000 times speed up compared with naively applying [4]. The linear speedup also exhibits the scalability and communication efficiency, since our sampling strategy reduces most overhead in communication. Table 4. Time comparison on multi-node cluster (in seconds) Image name (Dimension) NLSI0000081 (34987 × 37879) TCGA-05-4405 (83712 × 50432)

1

2

4

8

32

266.06

143.99

44.10

26.03

1820.08

945.77

508.23 271.02 155.39

86.31

TCGA-35-3615 (62615 × 133335) 3558.48 1834.00

77.16

16

520.94

944.91 487.47 266.35 147.07

TCGA-38-4627 (65033 × 149642) 4151.56 2107.46 1086.53 559.28 293.98 155.87

4

Conclusions

In this paper, a generalized distributed deep neural network framework is introduced to detect cells in whole-slide histopathological images. The innovative framework can be applied with any deep convolutional neural network pixel-wise cell detector. Our method is extremely optimized in distributed environment to detect cells in whole-slide images. We utilize a sparse kernel neural network forwarding technique to reduce nearly all redundant convolution computations. 3

https://www.tacc.utexas.edu/stampede/.

Detecting 10,000 Cells in One Second

683

An asynchronous prefetching technique is recommended to diminish most disk I/O time when loading the large histopathological images into memory. Furthermore, an unbalanced distributed sampling strategy is presented to enhance the scalability and communication efficiency of our framework. These techniques construct three pillars of our framework. Extensive experiments demonstrate that our method can approximately detect 10, 000 cells per second on a single workstation, which is encouraging for high-throughput cell data. While our result enables the high speed cell detection, our result can expect to benefit some further pathological analysis, e.g. feature extraction [12]. Acknowledgments. The authors would like to thank NVIDIA for GPU donation and the National Cancer Institute for access to NCI’s data collected by the National Lung Screening Trial. The statements contained herein are solely of the authors and do not represent or imply concurrence or endorsement by NCI.

References 1. Arteta, C., Lempitsky, V., Noble, J.A., Zisserman, A.: Learning to detect cells using non-overlapping extremal regions. In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012. LNCS, vol. 7510, pp. 348–356. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33415-3 43 2. Cire¸san, D.C., Giusti, A., Gambardella, L.M., Schmidhuber, J.: Mitosis detection in breast cancer histology images with deep neural networks. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 411–418. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40763-5 51 3. Xie, Y., Xing, F., Kong, X., Su, H., Yang, L.: Beyond classification: structured regression for robust cell detection using convolutional neural network. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 358–365. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24574-4 43 4. Pan, H., Xu, Z., Huang, J.: An effective approach for robust lung cancer cell detection. In: Wu, G., Coup´e, P., Zhan, Y., Munsell, B., Rueckert, D. (eds.) Patch-MI 2015. LNCS, vol. 9467, pp. 87–94. Springer, Heidelberg (2015) 5. Giusti, A., Cire¸san, D.C., Masci, J., Gambardella, L.M., Schmidhuber, J.: Fast image scanning with deep max-pooling convolutional neural networks. arXiv preprint arXiv:1302.1700 (2013) 6. Li, H., Zhao, R., Wang, X.: Highly efficient forward and backward propagation of convolutional neural networks for pixelwise classification. arXiv preprint arXiv:1412.4526 (2014) 7. Mareˇcek, J., Richt´ arik, P., Tak´ aˇc, M.: Distributed block coordinate descent for minimizing partially separable functions. In: Numerical Analysis and Optimization, pp. 261–288. Springer, Switzerland (2015) 8. Xu, Z., Huang, J.: Efficient lung cancer cell detection with deep convolution neural network. In: Wu, G., Coup´e, P., Zhan, Y., Munsell, B., Rueckert, D. (eds.) PatchMI 2015. LNCS, vol. 9467, pp. 79–86. Springer, Heidelberg (2015) 9. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 10. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)

684

Z. Xu and J. Huang

11. Network, C.G.A.R., et al.: Comprehensive molecular profiling of lung adenocarcinoma. Nature 511(7511), 543–550 (2014) 12. Yao, J., Ganti, D., Luo, X., Xiao, G., Xie, Y., Yan, S., Huang, J.: Computer-assisted diagnosis of lung cancer using quantitative topology features. In: Zhou, L., Wang, L., Wang, Q., Shi, Y. (eds.) MLMI 2015. LNCS, vol. 9352, pp. 288–295. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24888-2 35

A Hierarchical Convolutional Neural Network for Mitosis Detection in Phase-Contrast Microscopy Images Yunxiang Mao and Zhaozheng Yin(B) Department of Computer Science, Missouri University of Science and Technology, Rolla, USA {ym8r8,yinz}@mst.edu

Abstract. We propose a Hierarchical Convolution Neural Network (HCNN) for mitosis event detection in time-lapse phase contrast microscopy. Our method contains two stages: first, we extract candidate spatial-temporal patch sequences in the input image sequences which potentially contain mitosis events. Then, we identify if each patch sequence contains mitosis event or not using a hieratical convolutional neural network. In the experiments, we validate the design of our proposed architecture and evaluate the mitosis event detection performance. Our method achieves 99.1 % precision and 97.2 % recall in very challenging image sequences of multipolar-shaped C3H10T1/2 mesenchymal stem cells and outperforms other state-of-the-art methods. Furthermore, the proposed method does not depend on hand-crafted feature design or cell tracking. It can be straightforwardly adapted to event detection of other different cell types.

1

Introduction

Analyzing the proliferative behavior of stem cells in vitro plays an important role in many biomedical applications. Most of the analysis methods use fluorescent, luminescent or colorimetric microscopy images which are acquired by invasive methods, such as staining cells with fluorescent dyes and radiating them with the specific wavelength light. The invasive method damages cells’ viability or kills cells, which is not suitable for continuously monitoring the cell proliferation process. Phase-contrast microscopy, as a non-invasive imaging modality, offers the possibility to persistently monitor cells’ behavior in the culturing dish without altering them. Quantitatively analyzing the cell proliferation process relies on the accurate detection of mitosis events, in which the genetic material of an eukaryotic cell is equally divided, resulting in daughter cells. In fact, the process of a mitosis event consists of four stages: interphase, start of mitosis, formation of daughter cells, separation of daughter cells, as shown in Fig. 1. According to the four stages, This research was supported by NSF CAREER award IIS-1351049, NSF EPSCoR grant IIA-1355406, ISC and CBSE centers at Missouri S&T. c Springer International Publishing AG 2016  S. Ourselin et al. (Eds.): MICCAI 2016, Part II, LNCS 9901, pp. 685–692, 2016. DOI: 10.1007/978-3-319-46723-8 79

686

Y. Mao and Z. Yin

Fig. 1. The process of a mitosis event.

a mitotic cell have the following sequential actions: reduce its migration speed, shrink its size and increase its brightness, appear like a number “8”, split into two daughter cells. In this paper, the mitosis detection is defined as detecting the time and location at which the daughter cells first appear (birth moment). 1.1

Related Work

Several mitosis detection methods based on phase-contrast microscopy images have been proposed in the past decade. Liu et al. [2] proposed an approach based on Hidden Conditional Random Fields (HCRF) [1] in which mitosis candidate patch sequences are extracted through a 3D seeded region growing method, then HCRF is trained to classify each candidate patch sequence. This method achieves good performance on C3H10T1/2 stem cell datasets. Since only one label is assigned to patch sequence, this HCRF-based approach can classify each patch sequence into mitosis or nor, but it can not accurately localize the birth moment of the mitosis event in the patch sequence. A few extensions have been made on the HCRF-based approach. Huh et al. [3] proposed an Event-Detection CRF (EDCRF) in which each patch in a candidate sequence is assigned with one label. The birth moment of the mitosis event is determined based on the observation that if there exists a change from “before mitosis” to “after mitosis” label. Liu et al. [4] utilized a maximum-margin learning framework for training the HCRF and proposed a semi-Markovian model to localize mitosis events. Cire¸san et al. [5] utilized the DCNN as a pixel classifier for mitosis detection in individual breast cancer histology images. During the histology, the histologic specimens are stained and sandwiched, which makes it not suitable for detecting mitosis events in the time-lapse image sequences. 1.2

Motivation and Contributions

The previous mitosis detection approaches either use handcrafted features or consider a single image for the input of DCNN architectures. If we attempt to detect mitosis events by a single image, we may lose the visual appearance change information during the whole process of a mitosis event. Furthermore, motion information hidden in the continuous image sequence can also aid the detection of mitosis event. Thus, we propose a Hierarchical Convolutional Neural Network (HCNN) for the task of mitosis detection, which utilizes the temporal appearance change information and motion information in continuous microcopy images.

A HCNN for Mitosis Detection in Phase-Contrast Microscopy Image

2

687

Methodology

Our proposed method takes a video sequence as the input, and detects when and where mitosis events occur in the sequence. It consists of two steps: first, candidate patch sequences that possibly contain mitosis events are extracted from the image sequences; then, each candidate patch sequence is classified by our Hierarchical Convolutional Neural Network (HCNN). 2.1

Mitosis Candidate Extraction

The mitosis candidate extraction aims to eliminate the regions in the input image where mitosis events are highly unlikely to occur, and retrieve patch sequences in temporally continuous frames as the input to our classifier. We follow the similar process as in [3], except for that we we run flatfield correction (illumination normalization, [10]) on the observed images and use a Gaussian filter with standard derivation of 3 to smooth the backgroundsubtracted image. The time length of each mitosis event may be quite different. However, the most salient images during the mitosis are just a few images around the birth moment, so we choose a short fixed time length to extract candidate patch sequences. Each candidate patch sequences contains five 52 × 52 image patches. The precision of mitosis events is low (1.2 %) after candidate extraction, thus we propose the HCNN in the next section to further improve the performance. 2.2

Hierarchical CNN Architecture

The overall architecture of our proposed Hierarchical Convolutional Neural Network (HCNN) is illustrated in Fig. 2. The first set of input contains five consecutive patches in the candidate patch sequence, and the second set of input contains the five corresponding motion images computed by the central finite difference. Each of the ten convolutional neural networks in the first layer (CN N1k , k ∈ [1, 10]) takes a single image as the input. In the second layer of

Fig. 2. The overview of our proposed Hieratical CNN architecture

688

Y. Mao and Z. Yin

Fig. 3. The architecture of CNNs in the first layer of our HCNN.

Fig. 4. The architecture of CNNs in the second and last layer of our HCNN.

our HCNN, we design two CNNs (CN N211 and CN N212 ) to learn joint features at the patch-sequence level from original patch sequences and their motion patch sequences separately. In the last layer of our HCNN, combined appearance and motion features are fed into the last CNN (CN N313 ) to make the final prediction. In the notation of CN Nik , i denotes the layer in our HCNN and k indexes the CNN out of the total 13 CNNs in our HCNN. The design of such an architecture has two motivations. First, mitosis is a continuous event. Instead of detecting the mitosis events by single frame, leveraging several nearby frames will be more reliable to detect the birth moments of mitosis events. Second, the movement pattern of mitotic cells are different from that of migration cells, thus utilizing the motion information should boost the classification performance. The first layer of our HCNN contains ten CNNs (CN N1k , k ∈ [1, 10]), each of which classifies a single appearance or motion image at different time instants of a mitosis event. The ten CNNs shares the same architecture as shown in Fig. 3. There are three convolutional layers with each followed by a 2 × 2 max pooling layer. We add one more drop-out layer in case of over-fitting. The prediction layer outputs the label of the input image, indicating if the input image is the image at the specific time instant of a mitosis event. The architecture of CNNs in the second and last layer our HCNN (CN N211 CN N212 and CN N313 ) is shown in Fig. 4. The input to CN N211 is the combined features from the Fully-connection Layer 2 of CN N1k , k ∈ [1, 5], leading to a 5120 vector. The input to CN N212 is the combined features from the Fully-connection

A HCNN for Mitosis Detection in Phase-Contrast Microscopy Image

689

Layer 2 of CN N1k , k ∈ [6, 10], and the input to CN N313 is the combined features from the Fully-connection Layer 3 of CN N211 and CN N212 . 2.3

Hierarchical CNN Training

Since the overall HCNN has 13 CNNs, the number of parameters is quite large. If we train the whole HCNN at once, this will increase the training complexity. Given the limited amount of training data, this will also increase the risk of over-fitting. Therefore, we divide the training process into two steps as below. 2.3.1 Pretraining Each CNN Independently First, we train each CNN in three layers independently. For the first-layer CNNs (CN N1k , k ∈ [1, 10]), we use the trained weights of the first CNN (e.g., CN N11 ) as the initialization for the rest four CNNs (e.g., CN N1k , k ∈ [2, 5]) to achieve faster convergence. When training the 13 CNNs, we set the batch size as 100 and the number of epochs as 20 with the learning rate gradually decreasing from 10−3 to 10−4 . The drop-out rate is set to be 0.5 for all drop-out layers. 2.3.2 Fine-Tuning Hieratical CNN After each CNN is properly pretrained, we fine-tune the complete HCNN. The prediction layers of CNNs in the first and second layers are bypassed and the error from the third-layer CNN (CN N313 ) is back-propagated to all the CNNs to updates the weights.

3 3.1

Experiments Dataset

We evaluate our proposed method in five phase contrast video sequences obtained from [4], with each containing 79, 94, 85, 120 and 41 mitosis cells, respectively. Each sequence consists of 1436 images (resolution: 1392 × 1040 pixels). The location and time of mitosis events in the video sequences are provided as the ground truth. In order to train our HCNN, data expansion is performed to generate more positive training data. For each positive mitosis sequence, we rotate the images every 45◦ (8 variations), slightly translate the images (e.g., by 5 pixels) horizontally and/or vertically (9 variations), which generates 72 times of the original positive training data. We retrieve negative training sequences by our proposed candidate patch sequence extraction method. At last, the training data are balanced by randomly duplicating some positive data so that the numbers of positive samples and negative samples are even.

690

Y. Mao and Z. Yin

3.2

Evaluation Metric

We adopt leave-one-out policy in the experiment, i.e., using four sequences for training and the rest one for testing. For testing, we use maximum-suppression to converge all the detection results based on their spatial and temporal locations and confidence scores. We use two evaluation metrics in our experiments. First, we evaluate the performance of mitosis occurrence detection in terms of the mean and standard deviation of precision, recall and F score on the five leaveone-out tests, without examining the timing of birth events. In this case, we define True Positive (TP) as a patch sequence contains a mitosis event, False Positive (FP) as it does not contain a mitosis event, and False Negative as a true positive is classified as negative. Second, the performance of mitosis detection is strictly evaluated in terms of the timing error of birth moments, i.e., those aforementioned true positive patch sequences will be considered as true positive only if the timing error of the mitosis event is equal or less than a certain threshold. The timing error is measured as the frame difference between the detection result and the ground truth. 3.3

Evaluation on the Hierarchical Architecture

In this section, we show the effectiveness of each module in the proposed architecture design. We compare the performance of a single-appearance CNN (CN N13 ) targeted at the detection of the birth moment, a multi-appearance HCNN with the 5 original image patches as input (CN N11 to CN N15 + CN N211 ), a simple CNN which takes 10-channel images as input and our complete HCNN. As shown in Table 1, because single-appearance CNN cannot capture the temporal appearance change, the F-Score of single-appearance CNN is 5 % points lower than that of the multi-appearance HCNN which classify the whole patch sequence. With only the appearance information as input, the F-Score of multi-appearance HCNN is 10 % points lower than that of our HCNN that further incorporates the motion information. As proven in [9], fusing the temporal information in feature level is better than in input pixel level, thus our HCNN performs better than a simple CNN with 10-channel images as the input. Table 1. Mitosis occurrence detection accuracy of different designs. Model

Precision (%) Recall (%) F score (%)

Our HCNN

99.1 ± 0.8

97.2 ± 2.4 98.2 ± 1.3

CNN with multi-channel input 97.6 ± 1.2

94.0 ± 1.9 95.8 ± 1.2

Multi-appearance HCNN

90.9 ± 3.8

85.6 ± 3.3 88.1 ± 1.4

Single appearance CNN

85.9 ± 4.7

80.5 ± 8.1 82.9 ± 4.7

A HCNN for Mitosis Detection in Phase-Contrast Microscopy Image

691

Table 2. Comparison of mitosis detection accuracy. Model

Precision (%) Recall (%) F score (%)

Our HCNN

99.1 ± 0.8

97.2 ± 2.4 98.6 ± 1.3

MM-HCRF+MM-SMM 95.8 ± 1.0

88.1 ± 3.1 91.8 ± 2.0

MM-HCRF

82.8 ± 2.4

92.2 ± 2.4 87.2 ± 1.6

EDCRF

91.3 ± 4.0

87.0 ± 4.8 88.9 ± 0.7

CRF

90.5 ± 4.7

75.3 ± 9.6 81.5 ± 4.4

HMM

83.4 ± 4.9

79.4 ± 8.8 81.0 ± 3.4

SVM

68.0 ± 3.4

96.0 ± 4.2 79.5 ± 1.7

Table 3. Comparison of mitosis event timing accuracy. th Precision Our HCNN [4]

3.4

Recall Our HCNN [4]

F score Our HCNN [4]

1

92.8 ± 1.4

79.8 ± 3.4 93.1 ± 1.1

73.3 ± 2.4 93.0 ± 0.4

76.4 ± 2.7

3

96.6 ± 1.1

91.1 ± 2.2 94.9 ± 2.0

83.8 ± 3.7 95.8 ± 0.8

87.3 ±2.8

5

98.3 ± 1.2

94.7 ± 0.5 96.9 ± 1.6

87.1 ± 2.8 97.6 ± 0.9

90.8 ±1.7

10 99.1 ± 0.8

95.8 ± 1.0 97.2 ± 2.4

88.1 ± 3.1 98.2 ± 1.3

91.8 ±2.0

Comparisons

We compare our method with six state-of-the-arts: Max-Margin Hidden Conditional Random Fields+Max-Margin Semi-Markov Model (MM-HCRF + MMSMM) [4], EDCRF [3], HCRF [2], Hidden Markov Models (HMMs) [6], and Support Vector Machine (SVM) [7]. As shown in Table 2, our HCNN achieves an average precision of 99.14 %, recall of 97.21 and F score of 98.15 %, which outperforms the state-of-the-arts by a large margin. When evaluating the mitosis detection in term of the timing error of birth event, we use four different thresholds th (1, 3, 5 and 10) to report the precision, recall. As shown in Table 3, our HCNN achieves better performance than (MM-HCRF + MM-SMM) [4]. The reason for that is two-fold. First, in [4], they extract hand-crafted SIFT features [8] from patch images, which is not the most suitable features descriptor compared with CNN; Second, their method labels each patch in the whole progress of mitosis, but the early frames and last frames may introduce noise in the model since the appearance representation of them are not clear. While we only focus on consecutive frames near the birth event, the appearance representations of these frames are clear and easy to be captured.

4

Conclusion

In this paper, we propose a Hierarchical Convolutional Neural Network (HCNN) for mitosis event detection in phase-contrast microcopy images. We extract candidate patch sequences from the image sequence as the input to HCNN. In our

692

Y. Mao and Z. Yin

HCNN architecture, we utilize both the appearance information and temporal cues hidden in patch sequences to identify the birth event of mitotic cells. Given the complex HCNN structure, we propose an efficient training methodology to learn the parameters inside HCNN and prevent the risk of over-fitting. In the experiments, we prove that the design of our HCNN is sound and our method outperforms other state-of-the-art by a large margin.

References 1. Quattoni, A., et al.: Hidden conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell. 29(10), 1848–1853 (2007) 2. Liu, A., et al.: Mitosis sequence detection using hidden conditional random fields. In: Proceedings of IEEE International Symposium on Biomedical Imaging (2010) 3. Huh, S., et al.: Automated mitosis detection of stem cell populations in phasecontrast microscopy images. IEEE Trans. Med. Imaging 30(3), 586–596 (2011) 4. Liu, A., et al.: A semi-markov model for mitosis segmentation in time-lapse phase contrast microscopy image sequences of stem cell populations. IEEE Trans. Med. Imaging 31(2), 359–369 (2012) 5. Cire¸san, D.C., Giusti, A., Gambardella, L.M., Schmidhuber, J.: Mitosis detection in breast cancer histology images with deep neural networks. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 411–418. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40763-5 51 6. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989) 7. Suykens, J., et al.: Least squares support vector machine classifiers. Neural Process. Lett. 9(3), 293–300 (1999) 8. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 9. Karpathy, A., et al.: Large-scale video classification with convolutional neural networks. In: Proceedings of CVPR (2014) 10. Murphy, D.: Fundamentals of Light Microscopy and Electronic Imaging. Wiley, New York (2001)

Author Index

Aalamifar, Fereshteh I-577 Abdulkadir, Ahmed II-424 Aboagye, Eric O. III-536 Abolmaesumi, Purang I-465, I-644, I-653 Aboud, Katherine I-81 Aboulfotouh, Ahmed I-610 Abugharbieh, Rafeef I-132, I-602 Achilles, Felix I-491 Adalsteinsson, Elfar III-54 Adeli, Ehsan I-291, II-1, II-79, II-88, II-212 Adler, Daniel H. III-63 Aertsen, Michael II-352 Afacan, Onur III-544 Ahmadi, Seyed-Ahmad II-415 Ahmidi, Narges I-551 Akgök, Yigit H. III-527 Alansary, Amir II-589 Alexander, Daniel C. II-265 Al-Kadi, Omar S. I-619 Alkhalil, Imran I-431 Alterovitz, Ron I-439 Amann, Michael III-362 An, Le I-37, II-70, II-79 Anas, Emran Mohammad Abu I-465 Ancel, A. III-335 Andělová, Michaela III-362 Andres, Bjoern III-397 Angelini, Elsa D. II-624 Ankele, Michael III-502 Arbeláez, Pablo II-140 Armbruster, Marco II-415 Armspach, J.-P. III-335 Arslan, Salim I-115 Aung, Tin III-441 Awate, Suyash P. I-237, III-191 Ayache, Nicholas III-174 Aydogan, Dogu Baran I-201 Azizi, Shekoofeh I-653 Bagci, Ulas I-662 Bahrami, Khosro II-572 Bai, Wenjia III-246 Bajka, Michael I-593 Balédent, O. III-335

Balfour, Daniel R. III-493 Balte, Pallavi P. II-624 Balter, Max L. III-388 Bandula, Steven I-516 Bao, Siqi II-513 Barillot, Christian III-570 Barkhof, Frederik II-44 Barr, R. Graham II-624 Barratt, Dean C. I-516 Bartoli, Adrien I-404 Baruthio, J. III-335 Baumann, Philipp II-370 Baumgartner, Christian F. II-203 Bazin, Pierre-Louis I-255 Becker, Carlos II-326 Bengio, Yoshua II-469 Benkarim, Oualid M. II-505 BenTaieb, Aïcha II-460 Berks, Michael III-344 Bermúdez-Chacón, Róger II-326 Bernardo, Marcelino I-577 Bernasconi, Andrea II-379 Bernasconi, Neda II-379 Bernhardt, Boris C. II-379 Bertasius, Gedas II-388 Beymer, D. III-238 Bhaduri, Mousumi III-210 Bhalerao, Abhir II-274 Bickel, Marc II-415 Bilgic, Berkin III-467 Bilic, Patrick II-415 Billings, Seth D. III-133 Bischof, Horst II-230 Bise, Ryoma III-326 Blendowski, Maximilian II-598 Boctor, Emad M. I-577, I-585 Bodenstedt, S. II-616 Bonmati, Ester I-516 Booth, Brian G. I-175 Borowsky, Alexander I-72 Bounincontri, Guido III-579 Bourdel, Nicolas I-404 Boutagy, Nabil I-431 Bouvy, Willem H. II-97

694

Author Index

Boyer, Edmond III-450 Bradley, Andrew P. II-106 Brahm, Gary II-335 Breeuwer, Marcel II-97 Brosch, Tom II-406 Brown, Colin J. I-175 Brown, Michael S. III-273 Brox, Thomas II-424 Burgess, Stephen II-308 Burgos, Ninon II-547 Bustamante, Mariana III-519 Buty, Mario I-662 Caballero, Jose III-246 Cai, Jinzheng II-442, III-183 Cai, Weidong I-72 Caldairou, Benoit II-379 Canis, Michel I-404 Cao, Xiaohuan III-1 Cao, Yu III-238 Carass, Aaron III-553 Cardon, C. III-255 Cardoso, M. Jorge II-547, III-605 Carlhäll, Carl-Johan III-519 Carneiro, Gustavo II-106 Caselli, Richard I-326 Cattin, Philippe C. III-362 Cerrolaza, Juan J. III-219 Cetin, Suheyla III-467 Chabannes, V. III-335 Chahal, Navtej III-158 Chang, Chien-Ming I-559 Chang, Eric I-Chao II-496 Chang, Hang I-72 Chang, Ken I-184 Chapados, Nicolas II-469 Charon, Nicolas III-475 Chau, Vann I-175 Chen, Alvin I. III-388 Chen, Danny Z. II-176, II-658 Chen, Geng I-210, III-587 Chen, Hanbo I-63 Chen, Hao II-149, II-487 Chen, Kewei I-326 Chen, Ronald I-627 Chen, Sihong II-53 Chen, Terrence I-395 Chen, Xiaobo I-37, II-18, II-26 Chen, Xin III-493 Cheng, Erkang I-413

Cheng, Jie-Zhi II-53, II-247 Choyke, Peter I-653 Christ, Patrick Ferdinand II-415 Christlein, Vincent III-432 Chu, Peng I-413 Chung, Albert C.S. II-513 Çiçek, Özgün II-424 Çimen, Serkan III-142, III-291 Clancy, Neil T. III-414 Cobb, Caroline II-308 Coello, Eduardo III-596 Coles, Claire I-28 Collet, Pierre I-534 Collins, Toby I-404 Comaniciu, Dorin III-229 Combès, Benoit III-570 Commowick, Olivier III-570, III-622 Cook, Stuart III-246 Cooper, Anthony I-602 Coskun, Huseyin I-491 Cowan, Noah J. I-474 Crimi, Alessandro I-140 Criminisi, Antonio II-265 Culbertson, Heather I-370 Cutting, Laurie E. I-81 D’Anastasi, Melvin II-415 Dall’ Armellina, Erica II-361 Darras, Kathryn I-465 Das, Dhritiman III-596 Das, Sandhitsu R. II-564 Davatzikos, Christos I-300 David, Anna L. I-353, II-352 Davidson, Alice II-589 de Marvao, Antonio III-246 De Silva, T. III-124 de Sousa, P. Loureiro III-335 De Vita, Enrico III-511 Dearnaley, David II-547 Delbany, M. III-335 Delingette, Hervé III-174 Denny, Thomas III-264 Denœux, Thierry II-61 Depeursinge, Adrien I-619 Deprest, Jan II-352 Dequidt, Jeremie I-500 Deriche, Rachid I-89 Desisto, Nicholas II-9 Desjardins, Adrien E. I-353 deSouza, Nandita II-547

Author Index

Dhamala, Jwala III-282 Dhungel, Neeraj II-106 di San Filippo, Chiara Amat I-378, I-422 Diehl, Beate I-542 Diniz, Paula R.B. II-398 Dinsdale, Graham III-344 DiPietro, Robert I-551 Djonov, Valentin II-370 Dodero, Luca I-140 Doel, Tom II-352 Dong, Bin III-561 Dong, Di II-124 Dou, Qi II-149 Du, Junqiang I-1 Du, Lei I-123 Duan, Lixin III-441, III-458 Dufour, A. III-335 Duncan, James S. I-431 Duncan, John S. I-542, III-81 Durand, E. III-335 Duriez, Christian I-500 Dwyer, A. III-613 Eaton-Rosen, Zach III-605 Ebbers, Tino III-519 Eberle, Melissa I-431 Ebner, Thomas II-221 Eggenberger, Céline I-593 El-Baz, Ayman I-610, III-613 El-Ghar, Mohamed Abou I-610, III-613 Elmogy, Mohammed I-610 Elshaer, Mohamed Ezzeldin A. II-415 Elson, Daniel S. III-414 Ershad, Marzieh I-508 Eslami, Abouzar I-378, I-422 Essert, Caroline I-534 Esteva, Andre II-317 Ettlinger, Florian II-415 Fall, S. III-335 Fan, Audrey III-467 Farag, Amal II-451 Farzi, Mohsen II-291 Faskowitz, Joshua I-157 Fei-Fei, Li II-317 Fenster, Aaron I-644 Ferrante, Enzo II-529 Fichtinger, Gabor I-465 Fischl, Bruce I-184

695

Flach, Barbara I-593 Flach, Boris II-607 Forman, Christoph III-527 Fortin, A. III-335 Frangi, Alejandro F. II-291, III-142, III-201, III-353 Frank, Michael II-317 Fritscher, Karl II-158 Fu, Huazhu II-132, III-441 Fua, Pascal II-326 Fuerst, Bernhard I-474 Fujiwara, Michitaka II-556 Fundana, Ketut III-362 Funka-Lea, Gareth III-317 Gaed, Mena I-644 Gahm, Jin Kyu I-228 Gallardo-Diez, Guillermo I-89 Gao, Mingchen I-662 Gao, Wei I-106 Gao, Wenpeng I-457 Gao, Yaozong II-247, II-572, III-1 Gao, Yue II-9 Gao, Zhifan III-98 Garnotel, S. III-335 Gateno, Jaime I-559 Ge, Fangfei I-46 Génevaux, O. III-335 Georgescu, Bogdan III-229 Ghesu, Florin C. III-229, III-432 Ghista, Dhanjoo III-98 Gholipour, Ali III-544 Ghosh, Aurobrata II-265 Ghotbi, Reza I-474 Giannarou, Stamatia I-386, I-525 Gibson, Eli I-516, I-644 Gilhuijs, Kenneth G.A. II-478 Gilmore, John H. I-10 Gimelfarb, Georgy I-610, III-613 Girard, Erin I-395 Glocker, Ben I-148, II-589, II-616, III-107, III-536 Goerres, J. III-124 Goksel, Orcun I-568, I-593, II-256 Golland, Polina III-54, III-166 Gomez, Jose A. I-644 Gómez, Pedro A. III-579 González Ballester, Miguel Angel II-505 Gooya, Ali III-142, III-201, III-291 Götz, M. II-616

696

Author Index

Goury, Olivier I-500 Grady, Leo III-380 Grant, P. Ellen III-54 Grau, Vicente II-361 Green, Michael III-423 Groeschel, Samuel III-502 Gröhl, J. II-616 Grunau, Ruth E. I-175 Grussu, Francesco II-265 Guerreiro, Filipa II-547 Guerrero, Ricardo III-246 Guizard, Nicolas II-469 Guldner, Ian H. II-658 Gülsün, Mehmet A. III-317 Guo, Lei I-28, I-46, I-123 Guo, Xiaoyu I-585 Guo, Yanrong III-238 Guo, Yufan II-300 Gupta, Vikas III-519 Gur, Yaniv II-300, III-238 Gutiérrez-Becker, Benjamín III-10, III-19 Gutman, Boris A. I-157, I-326 Ha, In Young III-89 Hacihaliloglu, Ilker I-362 Haegelen, Claire I-534 Hager, Gregory D. I-551, III-133 Hajnal, Joseph V. II-589 Hall, Scott S. II-317 Hamarneh, Ghassan I-175, II-460 Hamidian, Hajar III-150 Hamzé, Noura I-534 Han, Ju I-72 Han, Junwei I-28, I-46 Handels, Heinz III-28, III-89 Hao, Shijie I-219 Havaei, Mohammad II-469 Hawkes, David J. I-516 Hayashi, Yuichiro III-353 He, Xiaoxu II-335 Heim, E. II-616 Heimann, Tobias I-395 Heinrich, Mattias Paul II-598, III-28, III-89 Helm, Emma II-274 Heng, Pheng-Ann II-149, II-487 Herrick, Ariane III-344 Hibar, Derrek Paul I-335 Hipwell, John H. I-516 Hlushchuk, Ruslan II-370 Ho, Chin Pang III-158

Ho, Dennis Chun-Yu I-559 Hodgson, Antony I-602 Hoffman, Eric A. II-624 Hofmann, Felix II-415 Hofmanninger, Johannes I-192 Holdsworth, Samantha III-467 Holzer, Markus I-192 Horacek, Milan III-282 Hornegger, Joachim III-229, III-527 Horváth, Antal III-362 Hu, Jiaxi III-150 Hu, Xiaoping I-28 Hu, Xintao I-28, I-46, I-123 Hu, Yipeng I-516 Hua, Jing III-150 Huang, Heng I-273, I-317, I-344 Huang, Junzhou II-640, II-649, II-676 Huang, Xiaolei II-115 Hunley, Stanley C. III-380 Huo, Yuankai I-81 Huo, Zhouyuan I-317 Hutchinson, Charles II-274 Hutton, Brian F. III-406 Hwang, Sangheum II-239 Ichim, Alexandru-Eugen I-491 Iglesias, Juan Eugenio III-536 Imamura, Toru II-667 Imani, Farhad I-644, I-653 Iraji, Armin I-46 Išgum, Ivana II-478 Ishii, Masaru III-133 Ismail, M. III-335 Ittyerah, Ranjit III-63 Jacobson, M.W. III-124 Jahanshad, Neda I-157, I-335 Jakab, András I-247 Jamaludin, Amir II-166 Janatka, Mirek III-414 Jannin, Pierre I-534 Jayender, Jagadeesan I-457 Jeong, Won-Ki III-484 Jezierska, A. III-335 Ji, Xing II-247 Jiang, Baichuan I-457 Jiang, Menglin II-35 Jiang, Xi I-19, I-28, I-55, I-63, I-123 Jiao, Jieqing III-406

Author Index

Jie, Biao I-1 Jin, Yan II-70 Jin, Yueming II-149 Jog, Amod III-553 John, Matthias I-395 John, Paul St. I-465 Johns, Edward I-448 Jojic, Vladimir I-627 Jomier, J. III-335 Jones, Derek K. III-579 Joshi, Anand A. I-237 Joshi, Sarang III-46, III-72 Kacher, Daniel F. I-457 Kaden, Enrico II-265 Kadir, Timor II-166 Kadoury, Samuel II-529 Kainz, Bernhard II-203, II-589 Kaiser, Markus I-395 Kakileti, Siva Teja I-636 Kaltwang, Sebastian II-44 Kamnitsas, Konstantinos II-203, II-589, III-246 Kaneda, Kazufumi II-667 Kang, Hakmook I-81 Karasawa, Ken’ichi II-556 Kashyap, Satyananda II-344, II-538 Kasprian, Gregor I-247 Kaushik, S. III-255 Kee Wong, Damon Wing II-132 Kendall, Giles I-255 Kenngott, H. II-616 Kerbrat, Anne III-570 Ketcha, M. III-124 Keynton, R. III-613 Khalifa, Fahmi I-610, III-613 Khanna, A.J. III-124 Khlebnikov, Rostislav II-589 Kim, Daeseung I-559 Kim, Hosung II-379 Kim, Hyo-Eun II-239 Kim, Junghoon I-166 Kim, Minjeong I-264 King, Andrew P. III-493 Kiryati, Nahum III-423 Kitasaka, Takayuki II-556 Kleinszig, G. III-124 Klusmann, Maria II-352 Knopf, Antje-Christin II-547 Knoplioch, J. III-255

Kochan, Martin III-81 Koesters, Zachary I-508 Koikkalainen, Juha II-44 Kokkinos, Iasonas II-529 Komodakis, Nikos III-10 Konen, Eli III-423 Kong, Bin III-264 Konno, Atsushi III-116 Konukoglu, Ender III-536 Korez, Robert II-433 Kou, Zhifeng I-46 Krenn, Markus I-192 Kriegman, David III-371 Kruecker, Jochen I-653 Kuijf, Hugo J. II-97 Kulaga-Yoskovitz, Jessie II-379 Kurita, Takio II-667 Kwak, Jin Tae I-653 Lai, Maode II-496 Laidley, David III-210 Laine, Andrew F. II-624 Landman, Bennett A. I-81 Langs, Georg I-192, I-247 Larson, Ben III-46 Lassila, Toni III-201 Lasso, Andras I-465 Lavdas, Ioannis III-536 Lay, Nathan II-388 Lea, Colin I-551 Leahy, Richard M. I-237 Ledig, Christian II-44 Lee, Gyusung I. I-551 Lee, Kyoung Mu III-308 Lee, Matthew III-246 Lee, Mija R. I-551 Lee, Soochahn III-308 Lee, Su-Lin I-525 Lee, Thomas C. I-457 Lei, Baiying II-53, II-247 Leiner, Tim II-478 Lelieveldt, Boudewijn P.F. III-107 Lemstra, Afina W. II-44 Leonard, Simon III-133 Lepetit, Vincent II-194 Lessoway, Victoria A. I-465 Li, David II-406 Li, Gang I-10, I-210, I-219 Li, Hua II-61 Li, Huibin II-521

697

698

Author Index

Li, Qingyang I-326, I-335 Li, Shuo II-335, III-98, III-210 Li, Xiang I-19, I-63 Li, Xiao I-123 Li, Yang II-496 Li, Yanjie III-98 Li, Yuanwei III-158 Li, Yujie I-63 Lian, Chunfeng II-61 Lian, Jun I-627 Liao, Rui I-395 Liao, Ruizhi III-54 Liebschner, Michael A.K. I-559 Lienkamp, Soeren S. II-424 Likar, Boštjan II-433 Lim, Lek-Heng III-502 Lin, Jianyu III-414 Lin, Ming C. I-627 Lin, Stephen II-132 Lin, Weili I-10, I-210 Ling, Haibin I-413 Linguraru, Marius George III-219 Lippé, Sarah II-529 Liu, Jiang II-132, III-441, III-458 Liu, Luyan II-1, II-26, II-212 Liu, Mingxia I-1, I-308, II-79 Liu, Mingyuan II-496 Liu, Tianming I-19, I-28, I-46, I-55, I-63, I-123 Liu, Weixia III-63 Liu, Xin III-98 Liu, XingTong II-406 Lombaert, Herve I-255 Lorenzi, Marco I-255 Lötjönen, Jyrki II-44 Lu, Allen I-431 Lu, Jianfeng I-55 Lu, Le II-388, II-442, II-451 Lugauer, Felix III-527 Luo, Jie III-54 Lv, Jinglei I-19, I-28, I-46, I-55, I-63 Lynch, Mary Ellen I-28 Ma, Andy I-482 MacKenzie, John D. II-176 Madhu, Himanshu J. I-636 Maguire, Timothy J. III-388 Mai, Huaming I-559 Maier, Andreas III-432, III-527 Maier-Hein, K. II-616

Maier-Hein, L. II-616 Majewicz, Ann I-508 Malamateniou, Christina II-589 Malpani, Anand I-551 Mancini, Laura III-81 Mani, Baskaran III-441 Maninis, Kevis-Kokitsi II-140 Manivannan, Siyamalan II-308 Manjón, Jose V. II-564 Mansi, Tommaso III-229 Mao, Yunxiang II-685 Marami, Bahram III-544 Marchesseau, Stephanie III-273 Mari, Jean-Martial I-353 Marlow, Neil I-255, III-605 Marom, Edith M. III-423 Marsden, Alison III-371 Marsden, Paul K. III-493 Masuda, Atsuki II-667 Mateus, Diana III-10, III-19 Mattausch, Oliver I-593 Matthew, Jacqueline II-203 Mayer, Arnaldo III-423 Mazauric, Dorian I-89 McClelland, Jamie II-547 McCloskey, Eugene V. II-291 McEvoy, Andrew W. I-542, III-81 McGonigle, John III-37 Meining, Alexander I-448 Melbourne, Andrew I-255, III-406, III-511, III-605 Meng, Yu I-10, I-219 Menze, Bjoern H. II-415, III-397, III-579, III-596 Menzel, Marion I. III-579 Mercado, Ashley II-335 Merino, Maria I-577 Merkow, Jameson III-371 Merveille, O. III-335 Metaxas, Dimitris N. II-35, II-115 Miao, Shun I-395 Miller, Steven P. I-175 Milstein, Arnold II-317 Min, James K. III-380 Minakawa, Masatoshi II-667 Miraucourt, O. III-335 Misawa, Kazunari II-556 Miserocchi, Anna I-542 Modat, Marc III-81 Modersitzki, Jan III-28

Author Index

Moeskops, Pim II-478 Molina-Romero, Miguel III-579 Mollero, Roch III-174 Mollura, Daniel J. I-662 Moore, Tonia III-344 Moradi, Mehdi II-300, III-238 Mori, Kensaku II-556, III-353 Mousavi, Parvin I-465, I-644, I-653 Moussa, Madeleine I-644 Moyer, Daniel I-157 Mullick, R. III-255 Mulpuri, Kishore I-602 Munsell, Brent C. II-9 Murino, Vittorio I-140 Murray, Andrea III-344 Mwikirize, Cosmas I-362 Nachum, Ilanit Ben III-210 Naegel, B. III-335 Nahlawi, Layan I-644 Najman, L. III-335 Navab, Nassir I-378, I-422, I-474, I-491, III-10, III-19 Negahdar, Mohammadreza II-300, III-238 Negussie, Ayele H. I-577 Neumann, Dominik III-229 Ng, Bernard I-132 Nguyen, Yann I-500 Ni, Dong II-53, II-247 Nicolas, G. III-255 Nie, Dong II-212 Nie, Feiping I-291 Niethammer, Marc I-439, III-28 Nill, Simeon II-547 Nimura, Yukitaka II-556 Noachtar, Soheyl I-491 Nogues, Isabella II-388 Noh, Kyoung Jin III-308 Nosher, John L. I-362 Nutt, David J. III-37 O’Donnell, Matthew I-431 O’Regan, Declan III-246 Oda, Masahiro II-556, III-353 Oelfke, Uwe II-547 Oguz, Ipek II-344, II-538 Okamura, Allison M. I-370

699

Oktay, Ozan III-246 Orasanu, Eliza I-255 Ourselin, Sebastien I-255, I-353, I-542, II-352, II-547, III-81, III-406, III-511, III-605 Owen, David III-511 Ozdemir, Firat II-256 Ozkan, Ece II-256 Pagé, G. III-335 Paknezhad, Mahsa III-273 Pang, Yu I-413 Pansiot, Julien III-450 Papastylianou, Tasos II-361 Paragios, Nikos II-529 Parajuli, Nripesh I-431 Parisot, Sarah I-115, I-148 Park, Jin-Hyeong II-487 Park, Sang Hyun I-282 Parker, Drew I-166 Parsons, Caron II-274 Parvin, Bahram I-72 Passat, N. III-335 Patil, B. III-255 Payer, Christian II-194, II-230 Peng, Hanchuan I-63 Peng, Jailin II-70 Pennec, Xavier III-174, III-300 Pereira, Stephen P. I-516 Pernuš, Franjo II-433 Peter, Loïc III-19 Pezold, Simon III-362 Pezzotti, Nicola II-97 Pichora, David I-465 Pickup, Stephen III-63 Piella, Gemma II-505 Pinto, Peter I-577, I-653 Pizer, Stephen I-439 Pluim, Josien P.W. II-632 Pluta, John III-63 Pohl, Kilian M. I-282 Polzin, Thomas III-28 Pont-Tuset, Jordi II-140 Pozo, Jose M. II-291, III-201 Pratt, Rosalind II-352 Prayer, Daniela I-247 Preston, Joseph Samuel III-72 Price, True I-439

700

Author Index

Prieto, Claudia III-493 Prince, Jerry L. III-553 Prosch, Helmut I-192 Prud’homme, C. III-335 Pusiol, Guido II-317 Qi, Ji III-414 Qin, Jing II-53, II-149, II-247 Quader, Niamul I-602 Quan, Tran Minh III-484 Rahmim, Arman I-577 Raidou, Renata Georgia II-97 Raitor, Michael I-370 Rajan, D. III-238 Rajchl, Martin II-589 Rak, Marko II-283 Ramachandran, Rageshree II-176 Rapaka, Saikiran III-317 Rasoulian, Abtin I-465 Raudaschl, Patrik II-158 Ravikumar, Nishant III-142, III-291 Rawat, Nishi I-482 Raytchev, Bisser II-667 Reader, Andrew J. III-493 Reaungamornrat, S. III-124 Reda, Islam I-610 Rege, Robert I-508 Reiman, Eric M. I-326 Reiter, Austin I-482, III-133 Rekik, Islem I-210, II-26, II-572 Rempfler, Markus II-415, III-397 Reyes, Mauricio II-370 Rhodius-Meester, Hanneke II-44 Rieke, Nicola I-422 Robertson, Nicola J. I-255 Rockall, Andrea G. III-536 Rodionov, Roman I-542 Rohé, Marc-Michel III-300 Rohling, Robert I-465 Rohrer, Jonathan III-511 Ronneberger, Olaf II-424 Roodaki, Hessam I-378 Rosenman, Julian I-439 Ross, T. II-616 Roth, Holger R. II-388, II-451 Rottman, Caleb III-46

Ruan, Su II-61 Rueckert, Daniel I-115, I-148, II-44, II-203, II-556, II-589, III-246, III-536 Rutherford, Mary II-589 Sabouri, Pouya III-46 Salmon, S. III-335 Salzmann, Mathieu II-326 Sanabria, Sergio J. I-568 Sankaran, Sethuraman III-380 Sanroma, Gerard II-505 Santos, Michel M. II-398 Santos, Wellington P. II-398 Santos-Ribeiro, Andre III-37 Sapkota, Manish II-185 Sapp, John L. III-282 Saria, Suchi I-482 Sarrami-Foroushani, Ali III-201 Sase, Kazuya III-116 Sato, Imari III-326 Sawant, Amit III-46 Saygili, Gorkem III-107 Schaap, Michiel III-380 Scheltens, Philip II-44 Scherrer, Benoit III-544 Schirmer, Markus D. I-148 Schlegl, Thomas I-192 Schmidt, Michaela III-527 Schöpf, Veronika I-247 Schott, Jonathan M. III-406 Schubert, Rainer II-158 Schulte, Rolf F. III-596 Schultz, Thomas III-502 Schwab, Evan III-475 Schwartz, Ernst I-247 Scott, Catherine J. III-406 Seifabadi, Reza I-577 Seitel, Alexander I-465 Senior, Roxy III-158 Sepasian, Neda II-97 Sermesant, Maxime III-174, III-300 Shakeri, Mahsa II-529 Shalaby, Ahmed I-610 Sharma, Manas II-335 Sharma, Puneet III-317 Sharp, Gregory C. II-158 Shatkay, Hagit I-644

Author Index

Shehata, M. III-613 Shen, Dinggang I-10, I-37, I-106, I-210, I-219, I-264, I-273, I-291, I-308, I-317, I-344, II-1, II-18, II-26, II-70, II-79, II-88, II-212, II-247, II-572, III-561, III-587 Shen, Shunyao I-559 Shen, Wei II-124 Shi, Feng II-572 Shi, Jianbo II-388 Shi, Jie I-326 Shi, Xiaoshuang III-183 Shi, Yonggang I-201, I-228 Shigwan, Saurabh J. III-191 Shimizu, Natsuki II-556 Shin, Min III-264 Shin, Seung Yeon III-308 Shokiche, Carlos Correa II-370 Shriram, K.S. III-255 Shrock, Christine I-482 Siewerdsen, J.H. III-124 Siless, Viviana I-184 Silva-Filho, Abel G. II-398 Simonovsky, Martin III-10 Sinha, Ayushi III-133 Sinusas, Albert J. I-431 Sixta, Tomáš II-607 Smith, Sandra II-203 Sohn, Andrew II-451 Sokooti, Hessam III-107 Soliman, A. III-613 Sommer, Wieland H. II-415 Sona, Diego I-140 Sonka, Milan II-344, II-538 Sotiras, Aristeidis I-300 Spadea, Maria Francesca II-158 Sparks, Rachel I-542 Speidel, S. II-616 Sperl, Jonathan I. III-579 Stalder, Aurélien F. III-527 Stamm, Aymeric III-622 Staring, Marius III-107 Stendahl, John C. I-431 Štern, Darko II-194, II-221, II-230 Stock, C. II-616 Stolka, Philipp J. I-370 Stonnington, Cynthia I-326 Stoyanov, Danail III-81, III-414 Styner, Martin II-9 Subramanian, N. III-255

701

Suk, Heung-Il I-344 Summers, Ronald M. II-388, II-451, III-219 Sun, Jian II-521 Sun, Shanhui I-395 Sun, Xueqing III-414 Sun, Yuanyuan III-98 Sutton, Erin E. I-474 Suzuki, Masashi II-667 Syeda-Mahmood, Tanveer II-300, III-238 Synnes, Anne R. I-175 Szopos, M. III-335 Tahmasebi, Amir I-653 Talbot, H. III-335 Tam, Roger II-406 Tamaki, Toru II-667 Tan, David Joseph I-422 Tanaka, Kojiro II-667 Tang, Lisa Y.W. II-406 Tang, Meng-Xing III-158 Tanner, Christine I-593 Tanno, Ryutaro II-265 Tarabay, R. III-335 Tatavarty, Sunil II-415 Tavakoli, Behnoosh I-585 Taylor, Charles A. III-380 Taylor, Chris III-344 Taylor, Russell H. III-133 Taylor, Zeike A. III-142, III-291 Teisseire, M. III-255 Thiriet, M. III-335 Thiruvenkadam, S. III-255 Thomas, David III-511 Thompson, Paul M. I-157, I-326, I-335 Thornton, John S. III-81 Thung, Kim-Han II-88 Tian, Jie II-124 Tijms, Betty II-44 Tillmanns, Christoph III-527 Toi, Masakazu III-326 Tolonen, Antti II-44 Tombari, Federico I-422, I-491 Tönnies, Klaus-Dietz II-283 Torres, Renato I-500 Traboulsee, Anthony II-406 Trucco, Emanuele II-308 Tsagkas, Charidimos III-362 Tsehay, Yohannes II-388 Tsien, Joe Z. I-63 Tsogkas, Stavros II-529

702

Author Index

Tsujita, Teppei III-116 Tu, Zhuowen III-371 Tunc, Birkan I-166 Turk, Esra A. III-54 Turkbey, Baris I-577, I-653 Ulas, Cagdas III-579 Unal, Gozde III-467 Uneri, A. III-124 Ungi, Tamas I-465 Urschler, Martin II-194, II-221, II-230 Usman, Muhammad III-493 Van De Ville, Dimitri I-619 van der Flier, Wiesje II-44 van der Velden, Bas H.M. II-478 van Diest, Paul J. II-632 Van Gool, Luc II-140 Vantini, S. III-622 Varol, Erdem I-300 Vedula, S. Swaroop I-551 Venkataramani, Krithika I-636 Venkatesh, Bharath A. II-624 Vera, Pierre II-61 Vercauteren, Tom II-352, III-81 Verma, Ragini I-166 Veta, Mitko II-632 Vidal, René III-475 Viergever, Max A. II-478 Vilanova, Anna II-97 Vizcaíno, Josué Page I-422 Vogt, S. III-124 Voirin, Jimmy I-534 Vrtovec, Tomaž II-433 Walker, Julie M. I-370 Walter, Benjamin I-448 Wang, Chendi I-132 Wang, Chenglong III-353 Wang, Guotai II-352 Wang, Hongzhi II-538, II-564 Wang, Huifang II-247 Wang, Jiazhuo II-176 Wang, Jie I-335 Wang, Li I-10, I-219 Wang, Linwei III-282 Wang, Lisheng II-521 Wang, Qian II-1, II-26 Wang, Sheng II-640, II-649

Wang, Tianfu II-53, II-247 Wang, Xiaoqian I-273 Wang, Xiaosong II-388 Wang, Yalin I-326, I-335 Wang, Yipei II-496 Wang, Yunfu I-72 Wang, Zhengxia I-291 Ward, Aaron D. I-644 Warfield, Simon K. III-544, III-622 Wassermann, Demian I-89 Watanabe, Takanori I-166 Wehner, Tim I-542 Wei, Zhihui I-37 Weier, Katrin III-362 Weiskopf, Nikolaus I-255 Wells, William M. III-166 West, Simeon J. I-353 Wetzl, Jens III-527 Whitaker, Ross III-72 White, Mark III-81 Wilkinson, J. Mark II-291 Wilms, Matthias III-89 Wilson, David I-465 Winston, Gavin P. III-81 Wirkert, S. II-616 Wisse, Laura E.M. II-564 Wolinsky, J.-P. III-124 Wolk, David A. II-564, III-63 Wolterink, Jelmer M. II-478 Wong, Damon Wing Kee III-441, III-458 Wong, Tien Yin III-458 Wood, Bradford J. I-577, I-653 Wu, Aaron I-662 Wu, Colin O. II-624 Wu, Guorong I-106, I-264, I-291, II-9, II-247, III-1 Wu, Wanqing III-98 Wu, Yafeng III-587 Würfl, Tobias III-432 Xia, James J. I-559 Xia, Wenfeng I-353 Xie, Long II-564 Xie, Yaoqin III-98 Xie, Yuanpu II-185, III-183 Xing, Fuyong II-442, III-183 Xiong, Huahua III-98 Xu, Sheng I-653 Xu, Tao II-115 Xu, Yan II-496

Author Index

Xu, Xu, Xu, Xu,

Yanwu II-132, III-441, III-458 Zheng II-640, II-676 Ziyue I-662 Zongben II-521

Yamamoto, Tokunori III-353 Yan, Pingkun I-653 Yang, Caiyun II-124 Yang, Feng II-124 Yang, Guang-Zhong I-386, I-448, I-525 Yang, Heran II-521 Yang, Jianhua III-1 Yang, Jie II-624 Yang, Lin II-185, II-442, II-658, III-183 Yang, Shan I-627 Yang, Tao I-335 Yao, Jiawen II-640, II-649 Yap, Pew-Thian I-210, I-308, II-88, III-561, III-587 Yarmush, Martin L. III-388 Ye, Chuyang I-97 Ye, Jieping I-326, I-335 Ye, Menglong I-386, I-448 Yendiki, Anastasia I-184 Yin, Qian II-442 Yin, Yilong II-335, III-210 Yin, Zhaozheng II-685 Yoo, Youngjin II-406 Yoshino, Yasushi III-353 Yousry, Tarek III-81 Yu, Lequan II-149 Yu, Renping I-37 Yuan, Peng I-559 Yun, Il Dong III-308 Yushkevich, Paul A. II-538, II-564, III-63 Zaffino, Paolo II-158 Zang, Yali II-124 Zapp, Daniel I-378 Zec, Michelle I-465 Zhan, Liang I-335 Zhan, Yiqiang III-264 Zhang, Daoqiang I-1 Zhang, Guangming I-559 Zhang, Haichong K. I-585

703

Zhang, Han I-37, I-106, II-1, II-18, II-26, II-115, II-212 Zhang, Heye III-98 Zhang, Honghai II-344 Zhang, Jie I-326 Zhang, Jun I-308, II-79 Zhang, Lichi II-1 Zhang, Lin I-386 Zhang, Miaomiao III-54, III-166 Zhang, Qiang II-274 Zhang, Shaoting II-35, II-115, III-264 Zhang, Shu I-19, I-28 Zhang, Siyuan II-658 Zhang, Tuo I-19, I-46, I-123 Zhang, Wei I-19 Zhang, Xiaoqin III-441 Zhang, Xiaoyan I-559 Zhang, Yizhe II-658 Zhang, Yong I-282, III-561 Zhang, Zizhao II-185, II-442, III-183 Zhao, Liang I-525 Zhao, Qinghua I-55 Zhao, Qingyu I-439 Zhao, Shijie I-19, I-28, I-46, I-55 Zhen, Xiantong III-210 Zheng, Yefeng I-413, II-487, III-317 Zheng, Yingqiang III-326 Zheng, Yuanjie II-35 Zhong, Zichun III-150 Zhou, Mu II-124 Zhou, S. Kevin II-487 Zhou, Xiaobo I-559 Zhu, Hongtu I-627 Zhu, Xiaofeng I-106, I-264, I-291, I-344, II-70 Zhu, Xinliang II-649 Zhu, Ying I-413 Zhu, Yingying I-106, I-264, I-291 Zhuang, Xiahai II-581 Zisserman, Andrew II-166 Zombori, Gergely I-542 Zontak, Maria I-431 Zu, Chen I-291 Zuluaga, Maria A. I-542, II-352 Zwicker, Jill G. I-175