119 23 74MB
English Pages 640 [627] Year 2021
LNBI 13064
Yanjie Wei · Min Li · Pavel Skums · Zhipeng Cai (Eds.)
Bioinformatics Research and Applications 17th International Symposium, ISBRA 2021 Shenzhen, China, November 26–28, 2021 Proceedings
123
Lecture Notes in Bioinformatics
13064
Subseries of Lecture Notes in Computer Science Series Editors Sorin Istrail Brown University, Providence, RI, USA Pavel Pevzner University of California, San Diego, CA, USA Michael Waterman University of Southern California, Los Angeles, CA, USA
Editorial Board Members Søren Brunak Technical University of Denmark, Kongens Lyngby, Denmark Mikhail S. Gelfand IITP, Research and Training Center on Bioinformatics, Moscow, Russia Thomas Lengauer Max Planck Institute for Informatics, Saarbrücken, Germany Satoru Miyano University of Tokyo, Tokyo, Japan Eugene Myers Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany Marie-France Sagot Université Lyon 1, Villeurbanne, France David Sankoff University of Ottawa, Ottawa, Canada Ron Shamir Tel Aviv University, Ramat Aviv, Tel Aviv, Israel Terry Speed Walter and Eliza Hall Institute of Medical Research, Melbourne, VIC, Australia Martin Vingron Max Planck Institute for Molecular Genetics, Berlin, Germany W. Eric Wong University of Texas at Dallas, Richardson, TX, USA
More information about this subseries at http://www.springer.com/series/5381
Yanjie Wei · Min Li · Pavel Skums · Zhipeng Cai (Eds.)
Bioinformatics Research and Applications 17th International Symposium, ISBRA 2021 Shenzhen, China, November 26–28, 2021 Proceedings
Editors Yanjie Wei Shenzhen Institutes of Advanced Technology Shenzhen, China
Min Li Central South University Changsha, China
Pavel Skums Georgia State University Atlanta, GA, USA
Zhipeng Cai Georgia State University Atlanta, GA, USA
ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Bioinformatics ISBN 978-3-030-91414-1 ISBN 978-3-030-91415-8 (eBook) https://doi.org/10.1007/978-3-030-91415-8 LNCS Sublibrary: SL8 – Bioinformatics © Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
On behalf of the Program Committee, we would like to welcome you to the proceedings of the 17th International Symposium on Bioinformatics Research and Applications (ISBRA 2021), held in Shenzhen, China, November 26–28, 2021. The symposium provides a forum for the exchange of ideas and results among researchers, developers, and practitioners working on all aspects of bioinformatics and computational biology and their applications. This year, we received 135 submissions in response to the call for extended abstracts. The Program Committee decided to accept 51 of these for full publication in the proceedings; at the symposium a list of these contributions can be found in this front matter. The technical program also featured keynote talks delivered by six distinguished speakers: Guoliang Chen, Academician of the Chinese Academy of Sciences, from Shenzhen University and Nanjing University of Posts and Telecommunications, China; Dongqing Wei from Shanghai Jiaotong University, China; Yixue Li from the Chinese Academy of Sciences, China; Bairong Shen from West China Hospital, Sichuan University, China; Shaoliang Peng from Hunan University and Xiaolong Xu from Nanjing University of Information Science and Technology, China. We would like to thank the Program Committee members and the additional reviewers for volunteering their time to review and discuss symposium papers. We would like to extend special thanks to the steering and general chairs of the symposium for their leadership, and to the finance, publicity, workshops, local organization, and publications chairs for their hard work in making ISBRA 2021 a successful event. Last but not least, we would like to thank all authors for presenting their work at the symposium. October 2021
Zhipeng Cai Min Li Pavel Skums Yanjie Wei
Organization
Steering Committee Yi Pan (Chair) Dan Gusfield Ion Mandoiu Marie-France Sagot Zhirong Sun Ying Xu Aidong Zhang
Georgia State University, USA University of California, Davis, USA University of Connecticut, USA Inria, France Tsinghua University, China The University of Georgia, USA State University of New York, USA
General Chairs Ye Li Sanguthevar Rajasekaran Jianxin Wang Alexander Zelikovsky
Shenzhen Institutes of Advanced Technology, CAS, China University of Connecticut, USA Central South University, China Georgia State University, USA
Program Chairs Zhipeng Cai Min Li Pavel Skums Yanjie Wei
Georgia State University, USA Central South University, China Georgia State University, USA Shenzhen Institutes of Advanced Technology, CAS, China
Publicity Chairs Gangman Yi Xiujuan Lei Fa Zhang
Dongguk University, Korea Shaanxi Normal University, China Institute of Computing Technology, CAS, China
Publication Chair Jin Liu
Central South University, China
Local Arrangement Chair Yunpeng Cai
Shenzhen Institutes of Advanced Technology, CAS, China
viii
Organization
Workshop Chairs Gulsah Altun Quan Zou Le Zhang
Thermo Fisher, USA University of Electronic Science and Technology, China Sichuan University, China
Web Chairs Filipp Rondel Sergey Knyazev Haiping Zhang
Georgia State University, USA Centers for Disease Control and Prevention, USA Shenzhen Institutes of Advanced Technology, CAS, China
Program Committee Yanjie Wei Daniel Brown Jian Liu Shuai Cheng Li Steffen Heber Zhi-Ping Liu Juan Wang Lu Zhang Jian-Yu Shi Wei Jiang Liang Cheng Gabriel Valiente Russell Schwartz Serghei Mangul Hongmin Cai Tatsuya Akutsu Murray Patterson Seth Weinberg Xuan Guo Ruxin Wang Nikita Alexeev Xing Chen Junwei Luo Oliver Eulenstein Qi Zhao Filipp Rondel Danny Krizanc
Shenzhen Institutes of Advanced Technology, CAS, China University of Waterloo, Canada Harbin Institute of Technology, China City University of Hong Kong, China North Carolina State University, USA Shandong University, China Inner Mongolia University, China Hong Kong Baptist University, China Northwestern Polytechnical University, China Nanjing University of Aeronautics and Astronautics, China Harbin Medical University, China Technical University of Catalonia, Spain Carnegie Mellon University, USA University of California, Los Angeles, USA South China University of Technology, China Kyoto University, Japan Georgia State University, USA The Ohio State University, USA University of North Texas, USA Shenzhen Institutes of Advanced Technology, CAS, China ITMO University, Russia China University of Mining and Technology, China Henan Polytechnic University, China Iowa State University, USA University of Science and Technology Liaoning, China Georgia State University, USA Wesleyan University, USA
Organization
Chunyan Ji Pufeng Du Peng Yin Yuedong Yang Shuqiang Wang Zhipeng Cai Jin Liu Min Li Pavel Skums Meng Jintao Fen Miao Shuigeng Zhou Emily Chia-Yu Su Le Zhang Fa Zhang Hongyan Wu Andrei Paun Jianxin Wang Zengyou He Yi Shi Wei Peng Nadia Pisanti Yuri Porozov
Lei Deng Xin Gao Yufeng Wu Xing-Ming Zhao Joao Setubal Ion Mandoiu Weiguo Liu Yaohang Li Mukul S. Bansal Sing-Hoi Sze Yuk Yee Leung Weitian Tong Yubao Wu Sergey Knyazev Xiaowen Liu Xinghua Shi
ix
Georgia State University, USA Tianjin University, China Shenzhen Institutes of Advanced Technology, CAS, China Sun Yat-sen University, China Shenzhen Institute of Advanced Technology, CAS, China Georgia State University, USA Central South University, China Central South University, China Georgia State University, USA Shenzhen Institutes of Advanced Technology, CAS, China Shenzhen Institutes of Advanced Technology, CAS, China Fudan University, China Taipei Medical University, China Sichuan University, China Institute of Computing Technology, CAS, China Shenzhen Institutes of Advanced Technology, CAS, China University of Bucharest, Romania Central South University, China Dalian University of Technology, China Shanghai Jiao Tong University, China Kunming University of Science and Technology, China University of Pisa, Italy Saint Petersburg National Research University of Information Technologies, Mechanics and Optics, Russia Central South University, China King Abdullah University of Science and Technology, Saudi Arabia University of Connecticut, USA Tongji University, China University of São Paulo, Brazil University of Connecticut, USA Shandong University, China Old Dominion University, USA University of Connecticut, USA Texas A&M University, USA University of Pennsylvania, USA Eastern Michigan University, USA Georgia State University, USA Georgia State University, USA Indiana University-Purdue University Indianapolis, USA Temple University, USA
x
Organization
Quan Zou Xiaoqing Peng Ileana Streinu Yunpeng Cai Zeng Xiangxiang Fatemeh Mohebbi Paola Bonizzoni Derek Aguiar Alexandre G. De Brevern Yongxian Fan Jun Wang Fei Guo Xiaobo Li Lei Xu Leyi Wei Junjie Chen Xuefeng Cui Ruiqing Zheng Cheng Yan Hongdong Li Min Zeng Mengyun Yang Xingyi Li
Tianjin University, China Central South University, China Smith College, USA Shenzhen Institute of Advanced Technology, CAS, China Hunan University, China Georgia State University, USA Università degli Studi di Milano-Bicocca, Italy University of Connecticut, USA University of Paris, France Guilin University of Electronic Technology, China Southwest University, China Tianjin University, China Lishui University, China Shenzhen Polytechnic, China Shandong University, China Harbin Institute of Technology, China Shandong University, China Central South University, China Hunan University of Chinese Medicine, China Central South University, China Central South University, China Shaoyang University, China Northwestern Polytechnical University, China
Contents
AI and Disease MKL-LP: Predicting Disease-Associated Microbes with Multiple-Similarity Kernel Learning-Based Label Propagation . . . . . . . . . . . Ying-Lian Gao, Meng-Meng Yin, Jin-Xing Liu, Junliang Shang, and Chun-Hou Zheng Immune-Microbiota Crosstalk Underlying Inflammatory Bowel Disease . . . . . . . Congmin Xu, Quoc D. Mac, Qiong Jia, and Peng Qiu
3
11
Epidemic Vulnerability Index for Effective Vaccine Distribution Against Pandemic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hunmin Lee, Mingon Kang, Yingshu Li, Daehee Seo, and Donghyun Kim
22
Exploiting Multi-granular Features for the Enhanced Predictive Modeling of COPD Based on Chinese EMRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qing Zhao, Renyan Feng, Jianqiang Li, and Yanhe Jia
35
Task-Oriented Feature Representation for Spontaneous Speech of AD Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiyun Li and Peng Huang
46
Identification of Protein Markers Predictive of Drug-Specific Survival Outcome in Cancers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shuting Lin, Jie Zhou, Yiqiong Xiao, Bridget Neary, Yong Teng, and Peng Qiu Diabetic Retinopathy Grading Base on Contrastive Learning and Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yunchao Gu, Xinliang Wang, Junjun Pan, and Zhong Zhou Reinforcement Learning for Diabetes Blood Glucose Control with Meal Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jinhao Zhu, Yinjia Zhang, Weixiong Rao, Qinpei Zhao, Jiangfeng Li, and Congrong Wang Predicting Microbe-Disease Association via Tripartite Network and Relation Graph Convolutional Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yueyue Wang, Xiujuan Lei, and Yi Pan
58
68
80
92
xii
Contents
Combining Model-Based and Model-Free Reinforcement Learning Policies for More Efficient Sepsis Treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Xiangyu Liu, Chao Yu, Qikai Huang, Luhao Wang, Jianfeng Wu, and Xiangdong Guan An Efficient Two-Stage Fusion Network for Computer-Aided Diagnosis of Diabetic Foot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Anping Song, Hongtao Zhu, Lifang Liu, Ziheng Song, and Hongyu Jin A Heterogeneous Graph Convolutional Network-Based Deep Learning Model to Identify miRNA-Disease Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Zicheng Che, Wei Peng, Wei Dai, Shoulin Wei, and Wei Lan Identification of Gastric Cancer Immune Microenvironment Related Genes with Poor Prognosis and Tumor Immune Infiltration . . . . . . . . . . . . . . . . . . 142 Yishu Wang, Lingyun Xu, Xuehan Tian, and Zhe Lin A k-mer Based Approach for SARS-CoV-2 Variant Identification . . . . . . . . . . . . . 153 Sarwan Ali, Bikram Sahoo, Naimat Ullah, Alexander Zelikovskiy, Murray Patterson, and Imdadullah Khan A Novel Network Representation of SARS-CoV-2 Sequencing Data . . . . . . . . . . 165 Sergey Knyazev, Daniel Novikov, Mark Grinshpon, Harman Singh, Ram Ayyala, Varuni Sarwal, Roya Hosseini, Pelin Icer Baykal, Pavel Skums, Ellsworth Campbell, Serghei Mangul, and Alex Zelikovsky Computational Proteomics A Sequence-Based Antibody Paratope Prediction Model Through Combing Local-Global Information and Partner Features . . . . . . . . . . . . . . . . . . . . 179 Shuai Lu, Yuguang Li, Xiaofei Nan, and Shoutao Zhang SuccSPred: Succinylation Sites Prediction Using Fused Feature Representation and Ranking Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Ruiquan Ge, Yizhang Luo, Guanwen Feng, Gangyong Jia, Hua Zhang, Chong Xu, Gang Xu, and Pu Wang BindTransNet: A Transferable Transformer-Based Architecture for Cross-Cell Type DNA-Protein Binding Sites Prediction . . . . . . . . . . . . . . . . . . 203 Zixuan Wang, Xiaoyao Tan, Beichen Li, Yuhang Liu, Qi Shao, Zijing Li, Yihan Yang, and Yongqing Zhang Overlapping Protein Complexes Detection Based on Multi-level Topological Similarities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Wenkang Wang, Xiangmao Meng, Ju Xiang, and Min Li
Contents
xiii
LPI-FKLGCN: Predicting LncRNA-Protein Interactions Through Fast Kernel Learning and Graph Convolutional Network . . . . . . . . . . . . . . . . . . . . . . . . 227 Wen Li, Shulin Wang, and Hu Guo Biomedical Imaging Prediction of Protein Subcellular Localization from Microscopic Images via Few-Shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Francesco Arcamone, Yanlun Tu, and Yang Yang A Novel Pseudo-Labeling Approach for Cell Detection Based on Adaptive Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 Tian Bai, Zhenting Zhang, Chen Zhao, and Xiao Luo Parameter Transfer Learning Measured by Image Similarity to Detect CT of COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 Chang Zhao and Shunfang Wang A Novel Prediction Framework for Two-Year Stroke Recurrence Using Retinal Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Yidan Dai, Yuanyuan Zhuo, Xingxian Huang, Haibo Yu, and Xiaomao Fan The Classification System and Biomarkers for Autism Spectrum Disorder: A Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 Zhongyang Dai, Haishan Zhang, Feifei Lin, Shengzhong Feng, Yanjie Wei, and Jiaxiu Zhou LiteTrans: Reconstruct Transformer with Convolution for Medical Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 Shuying Xu and Hongyan Quan CFCN: A Multi-scale Fully Convolutional Network with Dilated Convolution for Nuclei Classification and Localization . . . . . . . . . . . . . . . . . . . . . . 314 Bin Xin, Yaning Yang, Dongqing Wei, and Shaoliang Peng Image to Image Transfer Makes Malpositioned Teeth Orderly . . . . . . . . . . . . . . . . 324 Sanbi Luo MIFS: A Peer-to-Peer Medical Images Storage and Sharing System Based on Consortium Blockchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 Hao Liu, Xia Xiao, Xinglong Zhang, Kenli Li, and Shaoliang Peng StarLace: Nested Visualization of Temporal Brain Connectivity Data . . . . . . . . . 348 Ming Jing, Yunjing Liu, Xiaoxiao Wang, and Li Zhang
xiv
Contents
Batch Weighted Nuclear-Norm Minimization for Medical Image Sequence Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 Kele Xu, Zijian Gao, Jilong Wang, Yang Wen, Ming Feng, Changjian Wang, and Yin Wang Drug Screening and Drug-Drug Interaction Prediction Predicting Drug Drug Interactions by Signed Graph Filtering-Based Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 Ming Chen, Yi Pan, and Chunyan Ji Drug-Target Interaction Prediction Based on Gaussian Interaction Profile and Information Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 Lina Liu, Shuang Yao, Zhaoyun Ding, Maozu Guo, Donghua Yu, and Keli Hu A Deep Learning Approach Based on Feature Reconstruction and Multi-dimensional Attention Mechanism for Drug-Drug Interaction Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400 Jiang Xie, Jiaming Ouyang, Chang Zhao, Hongjian He, and Xin Dong OrgaNet: A Deep Learning Approach for Automated Evaluation of Organoids Viability in Drug Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 Xuesheng Bian, Gang Li, Cheng Wang, Siqi Shen, Weiquan Liu, Xiuhong Lin, Zexin Chen, Mancheung Cheung, and XiongBiao Luo HGDD: A Drug-Disease High-Order Association Information Extraction Method for Drug Repurposing via Hypergraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 Shanchen Pang, Kuijie Zhang, Shudong Wang, Yuanyuan Zhang, Sicheng He, Wenhao Wu, and Sibo Qiao IDOS: Improved D3DOCK on Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436 Yonghui Cui, Zhijian Xu, and Shaoliang Peng Biomedical Data A New Deep Learning Training Scheme: Application to Biomedical Data . . . . . 451 Jianhong Cheng, Qichang Zhao, Lei Xu, and Jin Liu EEG-Based Emotion Recognition Fusing Spacial-Frequency Domain Features and Data-Driven Spectrogram-Like Features . . . . . . . . . . . . . . . . . . . . . . . 460 Chen Wang, Jingzhao Hu, Ke Liu, Qiaomei Jia, Jiayue Chen, Kun Yang, and Jun Feng
Contents
xv
ECG Arrhythmia Detection Based on Hidden Attention Residual Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471 Yuxia Guan, Jinrui Xu, Ning Liu, Jianxin Wang, and Ying An EEG-Based Depression Detection with a Synthesis-Based Data Augmentation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484 Xiangyu Wei, Meifei Chen, Manxi Wu, Xiaowei Zhang, and Bin Hu Sequencing Data Analysis Joint CC and Bimax: A Biclustering Method for Single-Cell RNA-Seq Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499 He-Ming Chu, Xiang-Zhen Kong, Jin-Xing Liu, Juan Wang, Sha-Sha Yuan, and Ling-Yun Dai Improving Protein-protein Interaction Prediction by Incorporating 3D Genome Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511 Zehua Guo, Kai Su, Liangjie Liu, Xianbin Su, Mofan Feng, Song Cao, Mingxuan Zhang, Runqiu Chi, Luming Meng, Guang He, and Yi Shi Boosting Metagenomic Classification with Reads Overlap Graphs . . . . . . . . . . . . 521 M. Cavattoni and M. Comin ScDA: A Denoising AutoEncoder Based Dimensionality Reduction for Single-cell RNA-seq Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534 Xiaoshu Zhu, Yongchang Lin, Jian Li, Jianxin Wang, and Xiaoqing Peng Others PickerOptimizer: A Deep Learning-Based Particle Optimizer for Cryo-Electron Microscopy Particle-Picking Algorithms . . . . . . . . . . . . . . . . . . 549 Hongjia Li, Ge Chen, Shan Gao, Jintao Li, and Fa Zhang SkeIn: Sketchy-Intensive Reading Comprehension Model for Multi-choice Biomedical Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561 Jing Li, Shangping Zhong, Kaizhi Chen, and Taibiao Li DNA Image Storage Using a Scheme Based on Fuzzy Matching on Natural Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572 Jitao Zhang, Shihong Chen, Haoling Zhang, Yue Shen, and Zhi Ping Prediction of Virus-Receptor Interactions Based on Similarity and Matrix Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584 Lingzhi Zhu, Guihua Duan, Cheng Yan, and Jianxin Wang
xvi
Contents
An Efficient Greedy Incremental Sequence Clustering Algorithm . . . . . . . . . . . . . 596 Zhen Ju, Huiling Zhang, Jingtao Meng, Jingjing Zhang, Xuelei Li, Jianping Fan, Yi Pan, Weiguo Liu, and Yanjie Wei Correlated Evolution in the Small Parsimony Framework . . . . . . . . . . . . . . . . . . . . 608 Brendan Smith, Cristian Navarro-Martinez, Rebecca Buonopane, S. Ashley Byun, and Murray Patterson Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
AI and Disease
MKL-LP: Predicting Disease-Associated Microbes with Multiple-Similarity Kernel Learning-Based Label Propagation Ying-Lian Gao2
, Meng-Meng Yin1 , Jin-Xing Liu1(B) and Chun-Hou Zheng1
, Junliang Shang1
,
1 School of Computer Science, Qufu Normal University, Rizhao 276826, China 2 Qufu Normal University Library, Qufu Normal University, Rizhao 276826, China
Abstract. A growing number of clinical evidences have proved that there are considerable associations between microbes and diseases. At present, developing computational models to explore unbeknown microbe-disease associations, rather than using the traditionally experimental method which is usually expensive and costs time, is a hot research trend. In this paper, a new method, MKL-LP, which utilizes Multiple Kernel Learning (MKL) and Label Propagation (LP), is presented on the basis of known microbe-disease associations and multiple microbe/disease similarities. Firstly, multiple microbe/disease similarities are calculated. Secondly, for the more comprehensive input information, multiple microbe/disease similarity kernels are fused by MKL to obtain the fused microbe/disease kernel. Then, considering that many non-associations may be positive, a pre-processing step is applied for estimating the association probability of unknown cases in the association matrix by using the microbe/disease similarity information. Then LP is applied for predicting novel microbe-disease associations. After that, 5-fold cross validation is applied to validate the predictive performance of our method with the comparison of the other four predicting methods. Also, in the case study of Chronic Obstructive Pulmonary Disease (COPD), 10 of the first 15 candidate microbes associated with the corresponding disease have literature proof. These suggest that MKL-LP has played a significant role in discovering novel microbedisease associations, thus providing important insights into complicated disease mechanisms, as well as facilitating new approaches to the diagnosis and treatment of the disease. Keywords: Microbe-disease association prediction · Multiple kernel learning · Label propagation · Pre-processing step
1 Introduction The number of cells in the human body is approximately 37 trillion [1], and the number of microbes (such as bacteria, fungi, viruses, and archaea) distributed in these cells is 10 times the number of cells [2]. Most of the microbes in the human body are beneficial or even indispensable to human health. © Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 3–10, 2021. https://doi.org/10.1007/978-3-030-91415-8_1
4
Y.-L. Gao et al.
For current and future research about microbes, Ma et al. constructed the Human Microbe-Disease Association Database (HMDAD) by collecting and collating the association data from microbiome studies [3]. Based on the data resource provided by this database, a number of computational methods have been developed to explore new microbe-disease associations. On account of known microbe-disease associations, as well as Gaussian Interaction Profile (GIP) kernel similarity of microbes and diseases, KATZHMDA is a method for predicting unknown microbe-disease pairs with the help of the KATZ measure, which merely regards the known associations’ topological information as the input information [4]. The method proposed by Yan et al., BRWMDA, is based on a modified bi-random walk on the microbial similarity network and the disease similarity network [5]. In 2019, Qu et al. proposed a creative method, MDLPHMDA, fusing matrix decomposition and label propagation [6]. The method firstly extracted a new adjacency matrix from the original associations through the spare learning method and used label propagation to find the unknown microbe-disease pairs. Although the above methods have achieved positive results, they still have some limitations. For example, some methods have a single priori information, relying too much on known associations, while ignoring other kinds of biological information, such as Medical Subject Headings (MeSH), symptom data, etc. This can be too skewed towards well-studied diseases and microbes. In view of this, a new method, MKL-LP, is proposed in this paper, which is based on multiple similarity kernels fused by Multiple Kernel Learning (MKL) [7], and uses Label Propagation (LP) [8] to predict hidden microbe-disease associations. At the beginning of the method, multiple similarity kernels are constructed. Moreover, considering that many non-associations in the association matrix may be positive, the pre-processing step [9] is used to estimate the association probability of unknown entries according to the similarity information. Then, 5-fold cross validation (CV) and the case study on Chronic Obstructive Pulmonary Disease (COPD) are used to verify the performance of MKL-LP. The results suggest that MKL-LP is of great importance in the discovery of new microbe-disease pairs.
2 Materials and Methods 2.1 Human Microbe-Disease Associations The duplicated associations are removed from the association data obtained from HMDAD [3], to obtain a dataset containing 39 diseases, 292 microbes and 450 known associations, denoted by Aknown . 2.2 Similarity Calculation In this paper, four calculation methods are applied to evaluate the disease similarity and the microbe similarity respectively, namely two types of disease semantic similarity (D1 and D2 ), two kinds of microbe functional similarity (M1 and M2 ), GIP kernel similarity of diseases and microbes (D3 and M3 ), as well as cosine similarity of diseases and microbes (D4 and M4 ).
MKL-LP: Predicting Disease-Associated Microbes
5
The calculation method of disease semantic similarity continues the measurement method of that in microRNA-disease association prediction [10, 11]. The calculation of disease semantic similarity in this paper includes two models, which respectively take into account semantic information and information content. And the calculation method of microbe functional similarity based on this two kinds of disease semantic similarity also refers to that of microRNA functional similarity [10]. 2.3 Multiple Kernel Learning In order to fuse different kernels in the disease/microbe space, linear combinations of these kernels are made by the Multiple Kernel Learning algorithm [7]. The similarity kernels of the disease space space can be respectively and the microbe expressed as D = D1 , D2 , . . . , Dnum_d and M = M1 , M2 , . . . , Mnum_m , where num_d and num_m respectively represent the number of similarity kernels defined in the disease space and the microbe space. Correspondingly, the optimal kernels are defined as: ∗ = KD
∗ KM =
num_d
ϑDnum1 Dnum1 , Dnum1 ∈ Ndd ×Ndd ,
(1)
num2 num2 ϑM M , Mnum2 ∈ Nmm ×Nmm ,
(2)
num1=1 num_m num2=1
1 2 num_m are the , ϑM , . . . , ϑM where ϑD = ϑD1 , ϑD2 , . . . , ϑDnum_d and ϑM = ϑM weights when multiple similarity kernels are connected in the disease space and the microbe space, respectively. Moreover, Ndd and Nmm are the number of diseases and microbes, respectively. In order to obtain the optimal kernel (take the disease as an example), the objective function is defined as follows:
∗ ideal (3) , vec KD max∗ cos vec KD ϑ,KD
∗ ideal vec K vec KD = ∗ Dideal vec K vec K D D
∗ subject to KD
=
num_d
ϑDnum1 Dnum1 ,
(4)
(5)
num1=1
ϑDnum1 ≥ 0, num1 = 1, 2, 3, . . . , num_d , num_d
ϑDnum1 = 1,
(6)
(7)
num1=1 ∗ is the optimal kernel in the disease space, and vec(·) denotes the vector where KD ideal is the ideal kernel of the disease. format of each matrix. Also, KD
6
Y.-L. Gao et al.
2.4 Label Propagation for Predicting Novel Associations Label propagation algorithm [8] is used to find the potential associations. And label propagation from the disease side is defined as follows: t+1 ∗ t PD = gamma × KD × PD + (1 − gamma) × AF ,
(8)
t is the correlation probability of the microbe-disease pair at step t. And where PD gamma is the rate for balancing the retained neighborhood information and the initial label information. The matrix AF is obtained by fusing the pre-processed association matrix and the original association matrix. Similarly, another prediction score matrix PM can be obtained from the perspective of microbes. And the final predicted scores can be obtained as follows:
P=
T PD + PM . 2
(9)
3 Results and Discussion 3.1 Parameter Selection Two parameters need to be found the optimal value by experiments, i.e., K (that is, the number of the nearest neighbors) in the pre-processing step and the rate gamma in the predicting method (that is, label propagation). In this paper, 5-fold CV is adopted respectively to search for the optimal values of K and gamma, where the range of K is [1, 10] and the range of gamma is (0, 1). The optimal values of K and gamma are selected based on 5-fold CV, which are shown in Fig. 1. When K is 10 and gamma is 0.9, the result is the highest that is 0.9748 ± 0.0022 under 5-fold CV.
Fig. 1. The selection results of K and gamma under 5-fold CV.
3.2 Performance Evaluation In this paper, 5-fold CV is chosen for evaluating the predictive performance of MKL-LP. The AUC (area under the Receiver Operating Characteristic (ROC) curve) is selected as the evaluation index, with the value range between 0 and 1.
MKL-LP: Predicting Disease-Associated Microbes
7
Under 5-fold CV, the AUC of MKL-LP is compared with the existing mature methods (BRWMDA [5], MDLPHMDA [6], NTSHMDA [12] and NGRHMDA [13]), as shown in Fig. 2. Under 5-fold CV, the AUC value of MKL-LP is 6.52%, 10.43%, 8.89% and 8.19% higher than that of BRWMDA, MDLPHMDA, NTSHMDA and NGRHMDA respectively.
Fig. 2. The comparison result of MKL-LP with BRWMDA, MDLPHMDA, NTSHMDA and NGRHMDA under 5-fold CV.
Considering that MKL-LP respectively predicts from both the disease side and the microbe side and finally integrated, we make the prediction from the disease side alone and the microbe side alone at the same time, and compare them with MKL-LP. The results can be seen from Fig. 3. Although the prediction result of MKL-LP is not much
Fig. 3. The comparison results of predicting from different sides under 5-fold CV.
8
Y.-L. Gao et al.
different from those of the unilateral prediction in 5-fold CV, we still adhere to the bilateral prediction because of the more comprehensive information. 3.3 Case Study The detailed case study is conducted on the prediction result of MKL-LP. COPD is the major disease for detailed information. In the process of the case study, the predicted microbes related to the corresponding diseases are arranged in descending order, and then the first 15 correlation pairs are selected to conduct literature search to see if there is any literature to provide evidence for them. The specific predicting result of COPD is shown in Table 1. Table 1. The predicted microbes associated with COPD. Rank
Microbe
Evidence
1
Proteobacteria
PMID: 33766947
2
Bacteroides vulgatus
Unconfirmed
3
Prevotella
PMID: 33547327
4
Haemophilus
PMID: 33784296
5
Lactobacillus
PMID: 33230454
6
Brenneria
Unconfirmed
7
Pseudomonas
PMID: 33849721
8
Helicobacter pylori
PMID: 32787348
9
Actinobacillus
PMID: 33598807
10
Clostridium
PMID: 33425595
11
Clostridium cocleatum
Unconfirmed
12
Lysobacter
Unconfirmed
13
Rickettsiales
PMID: 4361691
14
Streptococcus mitis
Unconfirmed
15
Xanthomonas
PMID: 11775914
4 Conclusion Nowadays, more and more studies focus on the prediction of the associations between microbes and diseases. Meanwhile, an increasing number of forecasting methods are emerging. The main contribution of MKL-LP lies in the use of MKL to fuse a variety of similarity information, which makes the information used for prediction more comprehensive. Secondly, considering that many non-associations in the association matrix may be positive, the new pre-processing method is used to pre-process the original association
MKL-LP: Predicting Disease-Associated Microbes
9
matrix to estimate the association probability of unknown items. Finally, label propagation is used for the prediction of the potentially unknown associations. The experimental results show the great potential of MKL-LP for the diagnosis and treatment of complex diseases. Although the predictive performance of MKL-LP is impressive, it still has some drawbacks. For example, there are too few known associations that can be exploited, and the hope is that there will be a greater abundance of known associations used for prediction later. In addition, the premise of MKL is that the similarity matrix is a symmetric matrix when fusing multiple kernels, but the matrix calculated by some similarity calculation methods is not symmetric, so we hope to try to mine other similarity fusion algorithms in the future. Acknowledgment. This work was supported by the National Natural Science Foundation of China (Grant Nos. 62172254, and 61872220).
References 1. Aagaard, K., Ma, J., Antony, K.M., Ganu, R., Petrosino, J., Versalovic, J.: The placenta harbors a unique microbiome. Sci. Transl. Med. 6(237), 11 (2014). https://doi.org/10.1126/scitransl med.3008599 2. Mertz, L.: My body, my microbiome microbes outnumber cells, but what are they doing? IEEE Pulse 5(6), 40–45 (2014). https://doi.org/10.1109/mpul.2014.2355309 3. Ma, W., et al.: An analysis of human microbe-disease associations. Brief. Bioinform. 18(1), 85–97 (2017). https://doi.org/10.1093/bib/bbw005 4. Chen, X., Huang, Y.-A., You, Z.-H., Yan, G.-Y., Wang, X.-S.: A novel approach based on KATZ measure to predict associations of human microbiota with non-infectious diseases. Bioinformatics 33(5), 733–739 (2017). https://doi.org/10.1093/bioinformatics/btw715 5. Yan, C., Duan, G., Wu, F.-X., Pan, Y., Wang, J.: BRWMDA: predicting microbe-disease associations based on similarities and bi-random walk on disease and microbe networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 17(5), 1595–1604 (2020). https://doi.org/10. 1109/tcbb.2019.2907626 6. Qu, J., Zhao, Y., Yin, J.: Identification and analysis of human microbe-disease associations by matrix decomposition and label propagation. Front. Microbiol. 10, 10 (2019). https://doi. org/10.3389/fmicb.2019.00291 7. Ding, Y., Tang, J., Guo, F.: Identification of drug-side effect association via semisupervised model and multiple kernel learning. IEEE J. Biomed. Health Inf. 23(6), 2619–2632 (2019). https://doi.org/10.1109/jbhi.2018.2883834 8. Yu, S.-P., Liang, C., Xiao, Q., Li, G.-H., Ding, P.-J., Luo, J.-W.: MCLPMDA: a novel method for miRNA-disease association prediction based on matrix completion and label propagation. J. Cell. Mol. Med. 23(2), 1427–1438 (2019). https://doi.org/10.1111/jcmm.14048 9. Wu, G., Yang, M., Li, Y., Wang, J.: De novo prediction of drug-target interaction via laplacian regularized schatten-p norm minimization. In: Cai, Z., Mandoiu, I., Narasimhan, G., Skums, P., Guo, X. (eds.) ISBRA 2020. LNCS, vol. 12304, pp. 154–165. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57821-3_14 10. Wang, D., Wang, J., Lu, M., Song, F., Cui, Q.: Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics 26(13), 1644–1650 (2010). https://doi.org/10.1093/bioinformatics/btq241
10
Y.-L. Gao et al.
11. Xuan, P., et al.: Prediction of microRNAs associated with human diseases based on weighted k most similar neighbors. PLoS ONE 8(8), 15 (2013). https://doi.org/10.1371/journal.pone. 0070204 12. Luo, J., Long, Y.: NTSHMDA: prediction of human microbe-disease association based on random walk by integrating network topological similarity. IEEE/ACM Trans. Comput. Biol. Bioinform. 17(4), 1341–1351 (2020). https://doi.org/10.1109/TCBB.2018.2883041 13. Huang, Y.-A., You, Z.-H., Chen, X., Huang, Z.-A., Zhang, S., Yan, G.-Y.: Prediction of microbe-disease association from the integration of neighbor and graph with collaborative recommendation model. J. Transl. Med. 15, 11 (2017). https://doi.org/10.1186/s12967-0171304-7
Immune-Microbiota Crosstalk Underlying Inflammatory Bowel Disease Congmin Xu1 , Quoc D. Mac1 , Qiong Jia2 , and Peng Qiu1(B) 1
2
The Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Georgia, USA [email protected] The Department of Pediatrics, Peking University Third Hospital, Beijing, China
Abstract. Long-time evolution has shaped a harmonious hostmicrobiota symbiosis consisting of intestinal microbiota in conjunction with the host immune system. Inflammatory bowel disease (IBD) is a result of the dysbiotic microbial composition together with aberrant mucosal immune responses, while the underlying mechanism is far from clear. In this report, we creatively proposed that when correlating with the host metabolism, functional microbial communities matter more than individual bacteria. Based on this assumption, we performed a systematic analysis to characterize the co-metabolism of host and gut microbiota established on a set of newly diagnosed Crohn’s disease (CD) samples and healthy controls. From the host side, we applied gene set enrichment analysis on host mucosal proteome data to identify those host pathways associated with CD. At the same time, we applied community detection analysis on the metagenomic data of mucosal microbiota to identify those microbial communities, which were assembled for a functional purpose. Then, the correlation analysis between host pathways and microbial communities was conducted. We discovered two microbial communities negatively correlated with IBD enriched host pathways. The dominant genera for these two microbial communities are known as health-benefits and could serve as a reference for designing complex beneficial microorganisms for IBD treatment. The correlated host pathways are all relevant to MHC antigen presentation pathways, which hints toward a possible mechanism of immune-microbiota cross talk underlying IBD.
Keywords: IBD
1
· Microbial symbiont · MHC · Co-metabolism
Introduction
The inflammatory bowel disease (IBD) known as Crohn’s disease (CD) and ulcerative colitis are a result of accumulating alterations in intestinal microbiota and disorders of the immune system. However, the mechanisms leading to the chronic mucosal inflammation that characterize IBD are ambiguous. There has been a dramatic increase of metagenomic and metabolomic studies of IBD c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 11–21, 2021. https://doi.org/10.1007/978-3-030-91415-8_2
12
C. Xu et al.
in the past decades [1] aiming to characterize IBD from host metabolic activities and the accompanied microbial dysbiosis. Studies aiming to understand the host pathways involved in IBD initiation have revealed that IBD are strongly associated with the immune system, including antigen processing and presentation pathways linked with major histocompatibility complex (MHC)[2] Antigen presentation by intestinal epithelial cells (IEC) is crucial for intestinal homeostasis. Disturbances of MHC I- and II-related presentation pathways in IEC are involved in an altered activation of CD4+ and CD8+ T cells in IBD [3]. From the microbial side, current literature has clearly demonstrated a perturbation of the gut microbiota in IBD patients [4]. Gevers et al. linked alterations in mucosal-associated microbiota with CD status using metagenomic analysis [5]. A meta-analysis reported 467 out of 536 patients with CD (87%) experienced resolution of diarrhea after fecal microbiota transplant treatment [6], which proved the significance of microbial dysbiosis in CD patients. The microbes inside the human gut often have correlated functions, and can be aggregated into different functional communities that are able to dynamically respond to or modulate the host metabolic activities [7]. When correlating with the host metabolites, the functional communities of microbes matter more than the relative abundance of individual microbes [8]. We proposed to apply community detection algorithm on the microbial composition of human gut to identify microbial communities and then cross-link these communities with gene pathways enriched by IBD-associated genes. With this approach, genera often reported as beneficial, such as Bacteroides, Blautia, Faecalibacterium and Propionibacterium, are revealed as negatively interacting with the those host immunological pathways enriched in IBD patients, especially those relevant to MHC presentation.
2 2.1
Materials and Methods Sample and Data Description
We retrieved data for 21 subjects with both 16S rRNA sequencing data of mucosa-luminal interface (MLI) microbiota and proteome data of colon or ileum from a previous study [9], including 11 Crohn’s disease patients and 10 healthy controls (Table 1). These Crohn’s disease samples represent new-onset teenagers, so there are no treatment influence and few co-morbidities compared with samples from adults. More details about the sampling and sequencing technologies could be found in the reference [9]. Table 1. Sample information Groups
Number Age
Male Female
Healthy controls 10
14.25 ± 2.70 6
4
Crohn’s diseases 11
13.3 ± 2.92
5
6
Immune-Microbiota Crosstalk Underlying Inflammatory Bowel Disease
2.2
13
Annotation of the 16S rDNA Sequencing Data of the Mucosa-Luminal Interface Microbiota
The 16S rDNA sequencing data of the MLI microbiota were processed in a standard pipeline [10]. Raw reads were downloaded from NCBI with accession code SRP056939 [9]. Read quality control was conducted by applying FastQC. Those high-quality reads passing quality controls were converted into fasta format and imported into QIIME using QIIME import command. Duplications were removed for speeding up the annotation process. Dereplicated contigs were clustered into operational taxonomic units (OTUs) using a closed-reference OTU picking workflow against the Greengenes 16S rRNA gene database (version gg13-8) based on an average percentage of identity 0.97, after which a set of representative sequences and an OTU relative abundance (proportion) matrix were obtained. A taxonomic annotation was assigned to the representative sequence of each OTU using classify-sklearn of QIIME. By summing up the abundance of OTUs assigned to the same genus, a taxonomic abundance matrix can be obtained on genus level. 2.3
Microbial Community Detection
The microbes inside human gut aggregate into different communities for functional purposes. When analyzing crosstalks with the host metabolic pathways, considering microbes in the same community as a whole is likely to shed new light on the interaction mechanism between microbiota and host pathways. In order to identify microbial communities, we first calculated a pairwise similarity matrix for all OTUs. The similarity was quantified using the correlation between each pair of OTUs regarding their relative abundance across all samples. In order to make sure the microbes in the same community correlate with each other in the same direction and also exclude spurious correlations induced by the unit-sum constraint, only positive correlations were kept while negative correlations were set as zeros. Furthermore, weak and insignificant correlations (i.e., correlation coefficient |R| < 0.2 or p-value P > 0.05) were discarded and set as zeros. Once the similarity matrix was generated, Louvain community detection algorithm [11] was applied on it to identify OTU clusters. 12 OTU clusters were identified, and each cluster of OTUs was considered as one microbial community. We then defined the level of activity for each microbial community. Since only positive correlations were considered during the community detection, the crosssample alteration of OTUs in the same microbial community are in the same trend. The easiest way to quantify the level of community activity is summing up the relative abundance of all OTUs in each OTU cluster/community. 2.4
Proteome Data of the Host Tissue (Human Colon or Ileum)
For the same set of 21 subjects with 16S rDNA sequencing data of MLI, the biopsies of their colon or ileum were profiled by mass spectrometry to characterize their proteome. We retrieved the relative abundance matrix of 3,861 proteins/genes from a public data source released by Mottawea et al. [9].
14
2.5
C. Xu et al.
Gene Set Enrichment Analysis
IBD enriched gene pathways were identified by applying Gene Set Enrichment Analysis (GSEA) [12] using gene sets in the KEGG database [13] as the reference database. Genes were sorted according to their fold change, and the fold change (F C) for gene i was defined as F Ci =
j=NC j=NH 1 1 Hij j=1 Cij − NH j=1 NC j=NH j=NC 1 1 Hij ) j=1 Cij + NH j=1 (NC +NH ) (
(1)
where, NC is the total number of samples in the Crohn’s group and NH is the total number of samples in the group of healthy control. Cij and Hij are the relative abundance of gene i in the j th Crohn’s sample and j th healthy sample, respectively. When performing GSEA, the number of permutations was set as 1000, the minimal gene set size was set as 20, and the cutoff for p-value was set as 0.05 . 2.6
The Activeness of Each Gene Pathway
The activeness of a metabolic pathway can be quantified by the expression levels of genes in the pathway. The simplest idea for calculating the activeness of a pathway is to compute the average of gene expression levels in this pathway. However, genes in the same gene pathway may be positively correlated but may also be negatively correlated. Therefore, when computing the simple average within a gene pathway, the negatively correlated genes will cancel out each other. As an alternative approach, principle component analysis (PCA) was adopted here. PCA performs dimension reduction by linearly combining the genes/features to derive principle component scores that maximally preserve the variance. In a gene pathway where a large portion of genes are correlated, the first principle component score is typically dominated by a weighted combination of the correlated genes, where the signs of the weights are able to avoid the cancelling effect due to negative correlations among genes in the pathway. Therefore, operationally, given the gene list for a gene pathway, a sub-matrix containing the relative abundance of these genes was retrieved. The first principle component of the sub-matrix was used to represent the overall activeness of this pathway.
3
Results
As shown in Fig. 1, based on the proteome data of host colon and ileum, our approach aims to identify gene pathways significantly enriched by those genes associated with the IBD condition. In parallel, our approach takes the taxonomic composition of intestinal microbiota, and identifies microbial communities. After that, the correlations between gene pathways and OTU communities are examined to discover OTU communities that are closely linked with IBD-enriched pathways.
Immune-Microbiota Crosstalk Underlying Inflammatory Bowel Disease
15
w1 w2
w1 w2 w3 w4
CD -enriched Pathways
Louvain
Sum
PCA
GSEA Genes
Pathway Activeness
Community Activeness
Microbial Community
OTUs
Fig. 1. Schematic diagram of the analysis pipeline. The left side shows the procedure of identifying host pathways and calculating their activeness. Genes were sorted according to their fold change of expression levels in CD vs healthy controls. Then GSEA identified those KEGG pathways significantly enriched/depleted in CD patients. Then the activeness of these KEGG pathways were calculated using PCA analysis as described in the Material and Method section. The right side of Fig. 1 illustrated how microbial OTUs were aggregated into different communities. The activeness of each microbial community was calculated by simply summing up the relative abundance of every OTU in that community. Finally, for each combination of host pathway and microbial community, a Pearson Correlation was calculated based on their activeness. Significant correlation implied a strong interaction between host metabolic pathways and activities of those bacteria in the corresponding microbial community.
3.1
Pathways Enriched in IBD Patients
Several metabolic pathways were identified as significantly enriched or downregulated in IBD patients compared to the healthy controls. Using the KEGG pathway database as reference, we performed GSEA and identified 17 KEGG pathways as significantly enriched (as shown in Fig. 2). Among these 17 KEGG pathways, five pathways are involved in virus infection, i.e., Epstein-Barr virus infection, Herpes simplex virus 1 infection, Measles, Hepatitis B and Influenza A; two pathways are related to bacterial infection, i.e., Tuberculosis and Staphylococcus aureus infection; one pathway is associated with Toxoplasmosis, which is also an infectious disease. These infectious diseases are all linked with disordered immune responses [14]. The other nine pathways are also relevant to immune response. NOD-like receptor (NLR) signaling pathway mediates the production of pro-inflammatory cytokines. NLR together with inflammatory factors enhance the body’s inflammatory response and antimicrobial infection [15]. Pathway Complement and coagulation cascades, and pathway Antigen processing and presentation are well known as part of immune system. Pathway Phagosome is linked to abnormal immune response. Transcriptional misregulation in cancer is a NF-kappa B related pathway. Osteoclast differentiation is mainly regulated by signaling pathways activated by immune receptors. Systemic lupus erythematosus is an autoimmune disease. IL-17 signaling pathway is mainly involved in mucosal host defense mechanisms. The IL-17 family signals via their correspondent receptors and activates downstream pathways that include NF-kappaB, MAPKs and C/EBPs to induce the expression of antimicrobial peptides, cytokines and chemokines.
16
C. Xu et al.
Eight of the 17 identified KEGG pathways were indicated as infectious diseases, and these eight pathways cover a broad range of biological processes. To identify the key effectors of the metabolic alterations in these pathways, we used the Hallmark gene sets in MsigDB [16] as the reference database to perform another set of GSEA analyses. Six pathways were significantly enriched by genes associated with IBD, i.e., Complement, Interferon gamma response, Allograft rejection, Coagulation, Interferon alpha response, and TNFA signaling via NFKB. These pathways points to the up-regulation of adaptive immune responses during IBD, which is consistent with what we found in the KEGG pathway analyses, and confirms our previous conjecture that these eight KEGG pathways were related to infectious diseases, indicating alterations of the immune system. IBD enriched pathways Leishmaniasis Systemic lupus erythematosus Staphylococcus aureus infection Complement and coagulation cascades Count 10 15 20
Osteoclast differentiation Herpes simplex virus 1 infection Antigen processing and presentation
25
Transcriptional misregulation in cancer
30
IL−17 signaling pathway p.adjust 0.020 0.025 0.030 0.035 0.040
Epstein−Barr virus infection Measles Tuberculosis Toxoplasmosis Phagosome NOD−like receptor signaling pathway Hepatitis B Influenza A 0.3
0.4
0.5 0.6 GeneRatio
0.7
Fig. 2. Pathways enriched in CD patients. These 17 pathways were identified through GSEA with KEGG database as reference. The size of each dot represents the number of genes in each gene set and the adjusted P values of testing enrichment significance were illustrated using different colors as shown in the color bar.
3.2
OTU Communities Within Human Gut
Stintzi and his colleagues reported significant OTUs as those negatively correlated with the severity of host suffering IBD. In contrast, our analysis takes a different perspective. We proposed to examine microbial communities, which are aggregated by multiple OTUs. Before being manifested in disease severity, alterations in the human gut microbiota first interact with the host metabolism. Instead of individually interacting with the host metabolism, different microbes
Immune-Microbiota Crosstalk Underlying Inflammatory Bowel Disease
17
share common set of metabolic activities aggregated into functional microbial communities. After taxonomic binning of all high-quality raw reads of wholegenome-sequenced human gut microbiota, we searched for microbial communities based on pairwise correlations between OTUs. The correlations were quantified using Spearman correlation coefficients, and the OTU communities were identified using Louvain community detection algorithm. Overall, 12 OTU communities were discovered. Different OTU communities were dominated by different genera, and the genera in the same community are supposed to participate in the same sets of metabolic pathways. 3.3
Multiple Health Beneficial Genera Are Negatively Correlated with Inflammation-Relevant MHC Pathways
As described in the Material and Methods section, the correlation between microbial OTU communities and host metabolic pathways could be computed based on the activeness of each OTU community and host metabolic pathway. Two out of the 12 OTU communities (OTU community number 2 and number 7) were identified to be negatively correlated with seven Crohn-enriched host metabolic pathways (Pearson correlation with correlation coefficient |R| > 0.4 and the correlating significance test p-value P < 0.05) (Fig. 3). By counting the occurrences of different genera in these two OTU communities, the dominant genera were found to be beneficial ones. Nine most dominant genera (assigned to > 5 OTUs) these two OTU communities affiliated to include Bacteroides, Blautia, Clostridium, Dorea, Faecalibacterium, Propionibacterium, Prevotella, Ruminococcus and Parabacteroides. Out of these nine dominant genera, five genera Blautia, Roseburia, Ruminococcus, Clostridium and Faecalibacterium were reported as negatively correlated with IBD severity in the paper where we obtained the raw data [9], which supported our findings here. Comprehensive literature review of these nine dominant genera advanced our understanding about the metabolic roles of these genera and provided evidence of the health beneficial roles of these genera. Bacteroides has been shown to have the ability to influence the host immune system and inhibit the activities of other competing pathogens [17]. Blautia is associated with the remission of IBD and one of the most important features characterizing disease activity levels in pediatric IBD patients [18]. Clostridium spp. takes colonization resistance in the mucosa and plays an important role in host immune response, and is one of those strong inducers of colonic T regulatory cell (Treg) accumulation [19]. Dorea genus has also been reported to play an important role in host immune system activity [20], suggested by an elevated abundance in patients with an autoimmune condition. As a butyrate-producing genus, Faecalibacterum are decreased in Crohn’s diseases compared to healthy controls [21]. Species in genus Propionibacterium have been shown to display promising immunomodulatory properties and antiinflammatory effects via interacting with surface proteins [22,22,23]. Prevotella was associated with T helper type 17 (Th17) immune response, which can be beneficial to the host during infection [24]. Ruminococcus species R. albus, R. callidus, and R. bromii are less abundant in patients with IBD compared to the
18
C. Xu et al. A
Host pathways
Antigen processing and presentation Epstein−Barr virus infection Hepatitis B Leishmaniasis Phagosome Staphylococcus aureus infection Systemic lupus erythematosus
Running Enrichment Score
0.8
0.6
0.4
0.2
Ranked list metric
0.0
2 1 0 −1 −2 1000
B
2000 Rank in Ordered Dataset
Dominant genera for cluster 2
C
3000
Dominant genera for cluster 7
others
Bacteroides
Bacteroides others
Ruminococcus
Blautia
Blautia Prevotella
Clostridium Dorea
Parabacteroides Faecalibacterium
Propionibacterium Faecalibacterium
Fig. 3. Two microbial communities were closely correlated with seven immune related host pathways. A, Those seven host pathways enriched in CD samples and also correlated with the microbial communities. On the bottom of plot A, each vertical line represents the fold change of one gene regarding gene expression levels in CD vs. healthy controls. A positive value indicates this gene is more abundant in CD, otherwise more abundant in healthy controls. All genes were sorted in a descending order of fold changes. For each host pathway, an enrichment score is calculated based on the fold changes of those genes emerging in this pathway. A positive enrichment score indicates that pathway is up-regulated in CD, vice versa. B, Dominant genera for those two OTU communities closely correlated the host pathways. Only those genera with more than five OTUs affiliated with were illustrated. The area size on pie plot represents the number of OTUs assigned to that genus.
healthy controls [25]. Furthermore, Prevotella, Parabacteroide, Bacteroides, Faecalibacterium and Clostridium have been shown to have increased abundance in healthy controls compared to multiple sclerosis patients, which further proved
Immune-Microbiota Crosstalk Underlying Inflammatory Bowel Disease
19
the immunomodulatory role of these genera [26,26–29]. Giri et al. also reported Prevotella, Parabacteroides, Clostridium, and Adlercreutzia as part of the antiinflammatory symbionts [30]. Seven Crohn-enriched host metabolic pathways were correlated with the alteration of the microbial communities. These seven KEGG pathways are Leishmaniasis, Epstein-Barr virus infection, Staphylococcus aureus infection, Hepatitis B, Antigen processing and presentation, Systemic lupus erythematosus and Phagosome. Literature review of these seven pathways led to an interesting finding that they are all relevant to MHC processing and presentation pathways. Leishmaniasis was reported to be associated with the defective expression of MHC genes, which silences subsequent T cell activation mediated by macrophages, resulting in abnormal immune responses [14,14,31]. MHC class II was observed to be induced following Epstein-Barr virus infection [32,32,33]. Staphylococcus aureus expresses an MHC class II analog protein (Map), which influences the immune response of T cells [34]. MHC class I-related chain A (MICA) was induced after HBV infection compared with the uninfected control [35]. Antigen processing and presentation is closely relevant to EBV and MHC presentation. Systemic lupus erythematosus is closely linked with the MHC relevant pathways [36]. Bacterially derived antigens within the phagosome are closely linked with the MHC-I processing and presentation pathway [37].
4
Conclusions
In the literature, most studies on the host metabolism and the microbial community have been conducted separately. To the best of our knowledge, none of the previous studies correlated the host metabolism with microbes in a community manner. Here we explored a new analysis approach addressing the importance of microbial communities for the interplay between microbiota and host metabolic pathways. We identified two microbial communities of beneficial microbes that provide potential directions for developing beneficial microbes to treat IBD. Animal studies should be designed to test the influence of these beneficial microbes on host medical conditions, by transplanting the combination of these microbes to mucosal of IBD mouse. Acknowledgments. This work was supported by funding from the National Science Foundation (CCF1552784 and CCF2007029). P.Q. is an ISAC Marylou Ingram Scholar, a Carol Ann and David D. Flanagan Faculty Fellow, and a Wallace H. Coulter Distinguished Faculty Fellow.
References 1. Wlodarska, M., Ramnik, A.: An integrative view of microbiome-host interactions in inflammatory bowel diseases. Cell Host Microbe 17(5), 577–591 (2015) 2. Goyette, P., Boucher, G., et al.: High-density mapping of the mhc identifies a shared role for hla-drb1*01:03 in inflammatory bowel diseases and heterozygous advantage in ulcerative colitis. Nat. Genet. 47(2), 172–179 (2015)
20
C. Xu et al.
3. B¨ ar, F., et al.: Inflammatory bowel diseases influence major histocompatibility complex class i (mhc i) and ii compartments in intestinal epithelial cells. J. Transl. Immunol. 172(2), 280–289 (2013) 4. Khan, I.: Alteration of gut microbiota in inflammatory bowel disease (ibd): cause or consequence? ibd treatment targeting the gut microbiome. Pathogens 8(3), 126 (2019) 5. Gevers, D., et al.: The treatment-naive microbiome in new-onset crohn’s disease. Cell Host Microbe 15(3), 382–392 (2014) 6. Cammarota, G., Ianiro, G., Gasbarrini, A.: Fecal microbiota transplantation for the treatment of clostridium difficile infection. J. Clin. Gastroenterol. 48(8), 693–702 (2014) 7. Burke, C., Steinberg, P., Rusch, D., Kjelleberg, S., Thomas, T.: Bacterial community assembly based on functional genes rather than species. Proc. Natl. Acad. Sci. 108(34), 14288–14293 (2011) 8. Visconti, A., et al.: Interplay between the human gut microbiome and host metabolism. Nat. Commun. 10(1), 1–10 (2019) 9. Mottawea, W., et al.: Altered intestinal microbiota-host mitochondria crosstalk in new onset crohn’s disease. Nat. Commun. 7(1), 13419 (2016) 10. Caporaso, J.G., et al.: Qiime allows analysis of high-throughput community sequencing data. Nat. Methods 7(5), 335–336 (2010) 11. Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008(10), 10008 (2008) 12. Subramanian, A., et al.: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. 102(43), 15545–15550 (2005) 13. Kanehisa, M.: The kegg database. In: Novartis Foundation Symposium, pp. 91–100. Wiley Online Library (2020) 14. Cunningham, A.C.: Parasitic adaptive mechanisms in infection by leishmania. Exp. Mol. Pathol. 72(2), 132–141 (2002) 15. Creagh, E.M., O’Neill, L.A.J.: Tlrs, nlrs and rlrs: a trinity of pathogen sensors that co-operate in innate immunity. Trends Immunol. 27(8), 352–357 (2006) 16. Liberzon, A., Subramanian, A., Pinchback, R., Thorvaldsd´ ottir, H., Tamayo, P., Mesirov, J.P.: Molecular signatures database (msigdb) 3.0. Bioinformatics 27(12), 1739–1740 (2011) 17. Wexler, H.M.: Bacteroides: the good, the bad, and the nitty-gritty. Clin. Microbiol. Rev. 20(4), 593–621 (2007) 18. Papa, E., et al.: Non-invasive mapping of the gastrointestinal microbiota identifies children with inflammatory bowel disease. PLoS ONE 7(6), e39242 (2012) 19. Atarashi, K., et al.: Induction of colonic regulatory t cells by indigenous clostridium species. Science 331(6015), 337–341 (2011) 20. Shahi, S.K., Freedman, S.N., Mangalam, A.K.: Gut microbiome in multiple sclerosis: the players involved and the roles they play. Gut Microbes 8(6), 607–615 (2017) 21. Morgan, X.C., et al.: Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 13(9), R79 (2012) 22. Pl´e, C., et al.: Combining selected immunomodulatory propionibacterium freudenreichii and lactobacillus delbrueckii strains: Reverse engineering development of an anti-inflammatory cheese. Molec. Nutr. Food Res. 60(4), 935–948 (2016) 23. Pl´e, C.: Single-strain starter experimental cheese reveals anti-inflammatory effect of propionibacterium freudenreichii cirm bia 129 in tnbs-colitis model. J. Funct. Foods 18, 575–585 (2015)
Immune-Microbiota Crosstalk Underlying Inflammatory Bowel Disease
21
24. Jeppe Madura Larsen: The immune response to prevotellabacteria in chronic inflammatory disease. Immunology 151(4), 363–374 (2017) 25. Kang, S., et al.: Dysbiosis of fecal microbiota in crohn’s disease patients as revealed by a custom phylogenetic microarray. Inflam. Bowel Dis. 16(12), 2034–2042 (2010) 26. Jangi, S., et al.: Alterations of the human gut microbiome in multiple sclerosis. Nat Commun. 7, 12015 (2016) 27. Miyake, S., et al.: Dysbiosis in the gut microbiota of patients with multiple sclerosis, with a striking depletion of species belonging to clostridia xiva and iv clusters. PLoS One 10(9), e0137429 (2015) 28. Cekanaviciute, E., et al.: Gut bacteria from multiple sclerosis patients modulate human t cells and exacerbate symptoms in mouse models. Proc. Natl. Acad. Sci. USA 114(40), 10713–10718 (2017) 29. Chen, J., et al.: Multiple sclerosis patients have a distinct gut microbiota compared to healthy controls. Sci. Rep. 6, 28484 (2016) 30. Giri, S., Mangalam, A.: The Gut Microbiome and Metabolome in Multiple Sclerosis, book section 34. Elsevier Inc. (2019) 31. Nandan, D.: Exploitation of host cell signaling machinery: activation of macrophage phosphotyrosine phosphatases as a novel mechanism of molecular microbial pathogenesis. J. Leukocyte Biol. 67, 464–470 (2000) 32. Knox, P.G., Young, L.S.: Epstein-barr virus infection of cr2-transfected epithelial cells reveals the presence of mhc class ii on the virion. Virology 213(1), 147–157 (1995) 33. Thorley-Lawson, D.A.: Epstein-barr virus: exploiting the immune system. Nat. Rev. Immunol. 1(1), 75–82 (2001) 34. Lee, L.Y., et al.: The staphylococcus aureus map protein is an immunomodulator that interferes with t cell-mediated responses. J. Clin. Invest. 110(10), 1461–1471 (2002) 35. Sasaki, R., et al.: Association between hepatitis b virus and mhc class i polypeptiderelated chain a in human hepatocytes derived from human-mouse chimeric mouse liver. Biochem. Biophys. Res. Commun 464(4), 1192–1195 (2015) 36. Ruiz-Narvaez, E.A., et al.: Mhc region and risk of systemic lupus erythematosus in African American women. Hum. Genet. 130(6), 807–815 (2011) 37. Harriff, M., Purdy, G., Lewinsohn, D.M.: Escape from the phagosome: the explanation for mhc-i processing of mycobacterial antigens? Front. Immunol. 3, 40 (2012)
Epidemic Vulnerability Index for Effective Vaccine Distribution Against Pandemic Hunmin Lee1 , Mingon Kang2 , Yingshu Li1 , Daehee Seo3 , and Donghyun Kim1(B) 1
3
Georgia State University, Atlanta, GA 30302, USA [email protected], {yili,dhkim}@gsu.edu 2 University of Nevada, Las Vegas, NV 89154, USA [email protected] Sangmyung University, Seoul 03016, Republic of Korea [email protected]
Abstract. As COVID-19 vaccines have been distributed worldwide, the number of infection and death cases vary depending on the vaccination route. Therefore, computing optimal measures that will increase the vaccination effect are crucial. In this paper, we propose an Epidemic Vulnerability Index (EVI) that quantitatively evaluates the risk of COVID-19 based on clinical and social statistical feature analysis of the subject. Utilizing EVI, we investigate the optimal vaccine distribution route with a heuristic approach in order to maximize the vaccine distribution effect. Our main criterias of determining vaccination effect were set with mortality and infection rate, thus EVI was designed to effectively minimize those critical factors. We conduct vaccine distribution simulations with nine different scenarios among multiple Agent-Based Models that were constructed with real-world COVID-19 patients’ statistical data. Our result shows that vaccine distribution through EVI has an average of 5.0% lower in infection cases, 9.4% lower result in death cases, and 3.5% lower in death rates than other distribution methods. Keywords: COVID-19 · Epidemic Vulnerability Index Model · Clinical data analysis · Social data analysis
1
· Agent Based
Introduction
The coronavirus (2019-nCoV) is a new variant virus that occurred at the end of 2019 and began to be transmitted from person-to-person at a high rate of infection. As of March 31st 2021, there were 1.28 billion confirmed cases and 2.81 million COVID-related deaths worldwide, increasing by more than 400,000 per day. However, vaccines have been developed and the rollout of the vaccinations are in progress worldwide. The distribution strategy with the limited supply of vaccines directly affects the mortality and infections, thus the effective vaccine distribution that can minimize those factors is considered crucial. Accordingly, studies have emerged analyzing vaccinations such as distribution c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 22–34, 2021. https://doi.org/10.1007/978-3-030-91415-8_3
Epidemic Vulnerability Index for Effective Vaccine Distribution
23
challenges [4,5,23] and distribution strategy [2,4,17,22]. There were studies to quantify the risk of the pandemic, and the following criteria or index were suggested. Social Vulnerability Index [9] proposed by the CDC has been acknowledged as an impactive indicator to define the risk of COVID-19 for the socially vulnerable, and has been considered to establish COVID-19 response strategies. Marvel et al. [12] have suggested Pandemic Vulnerability Index, which quantifies vulnerability associated with certain pandemics using Bayesian machine learning approach based on US county data. Similarly, C-19 Index [6] was devised as a risk indicator based on the XGBoost machine learning model which predicts mortality using past respiratory disease patient data. Furthermore, COVID-19 Vulnerability Index [1] was proposed utilizing the COVID-19 diagnosis rate, which was derived from the analysis of COVID patient data in the State of Washington. Moreover, Agent Based Models (ABM) that predict the spread of COVID-19 [15,25] were studied to predict variations in future infections among the community. The limitations of previous studies are mainly focused on approximating the mortality rates based on past COVID-19 or respiratory patient data by utilizing AI or ABM models [10], also only a few studies exist estimating the effective vaccine distribution route and other conditional vaccine distribution effects. In this paper, we propose an Epidemic Vulnerability Index (EVI) for effective vaccine distribution in the reality-based ABM graph network, utilizing past COVID-19 patients’ clinical and social features. Each agent in ABM has their own statistical features based on their unique clinical and social factors. On account of this, we propose a new Index that could effectively encompass their features by analyzing collected information. Through nine different vaccine distribution scenarios on the ABM, we empirically show the distribution results with the following evaluation metrics; variance in infection cases, death cases and death rates. As a result, vaccination through EVI was most effective in disease containment, minimizing the mortality and infection rates which implies the optimal vaccination route. The content of this paper is organized as follows. In Sect. 2, we show the statistical analysis of the past COVID-19 patients’ data, exploring the relationship and significance of the features related to mortality and infection rate. Also, we validate the optimality of selected features through simulations, and define the EVI. In Sect. 3, we conduct nine different vaccine distribution scenarios on ABM, including distribution through EVI and other indexes. We show the compared simulation results and evaluate the outcome. In Sect. 4, we conclude our paper summarizing our work and suggest expected utilization effect in the application domain.
2
Methodology
In this section, we analyzed the selected clinical features and social features regarding mortality and infection rate. In Subsect. 2.1, we show the steps of computing the mortality considering multiple clinical features. In Subsect. 2.2,
24
H. Lee et al.
we set the ABM based on real-world statistical data and determine the infection rate implementing social features. 2.1
Mortality and Clinical Features
Mortality rate varies according to individual patient’s features, and the clinical factors that designates the criteria of classifying the mortality are mainly age, sex and comorbidities, which are all considered as influential features regarding mortality [14,24,26]. Authoritative medical institutions such as CDC and WHO have identified age as a salient factor in the mortality rate of COVID-19 patients [24,26], and following Fig. 1 shows the trend of exponentially growing mortality rates as age increases in given nations. Mortality per age group λα is computed as λα = ηταα where ηα indicates number of infections in each age group index α, τα is number of deaths, 1 ≤ α ≤ 9, α ∈ N which N denotes natural number.
Fig. 1. Mortality rates by age groups among nations in given period
Apart from age, gender and sex are also regarded as critical drivers in COVID-19 mortality [14]. When analyzing the sex distribution of COVID-19 infections and deaths in the US population, the confirmed cases for female ηF was set with ηF = 6, 277, 679 and ηM = 5, 750, 585 for male ηM . When converted to ratio, it was 4.4% higher for ηM > ηF , having 52.2 : 47.8 as the recent study shows [26]. Using conditional probability P (τs |y) = λs where λs denotes the mortality of given type of sex = {M, F } which s ∈ sex, and τs is the number of deaths of given sex, y ∈ {ηF , ηM }, having λM = 1.985% and λF = 1.563%. λM : λF = 0.559 : 0.441, which makes a difference of 0.5 ± 0.059. We define these values with constant Ps in Eq. (1). The mortality of age group α on ∃s which we denote as λ(α,s) and we compute this using (Ps + 1) · μα . Since ∀α has diverse range of scale, we normalize the component within the [0,1] using Min-Max normalization. Usually Min-Max normalization is known to be susceptible with outliers, however α is deterministic which fits in our case. Thus, we compute the λ(α,s) where ξ(α,s) = {λα · (1 + Ps )|1 ≤ α ≤ 9, s ∈ sex} in Eq. (2).
Epidemic Vulnerability Index for Effective Vaccine Distribution
Ps =
λs
s∈sex
λ(α,s) =
λs
− 0.5
λα · (1 + Ps ) − min(ξ(α,s) ) max(ξ(α,s) ) − min(ξ(α,s) )
25
(1) (2)
Moreover, we took the comorbidity features of the subject into account. According to the CDC, almost 90% of hospitalized COVID patients have underlying conditions [13] which implies COVID-19 is more perilous for patients with underlying diseases (e.g. cardiovascular disease, diabetes and chronic lung disease), which proves that the comorbidity is significant factor affecting the mortality. Our dataset [3] is largely classified into 22 types of diseases by age groups (Table 1), and can be further subdivided into smaller diseases in the International Statistical Classification of Diseases and Related Health Problems (ICD) [24] provided by WHO. Table 1. Type of comorbidity Index Disease type
Index Disease type
C1
Influenza and pneumonia
C12
Cerebrovascular diseases
C2
Chronic lower respiratory diseases
C13
Other circulatory diseases
C3
Adult respiratory distress syndrome C14
C4
Respiratory failure
C15
Malignant neoplasms
C5
Respiratory arrest
C16
Diabetes
C6
Other respiratory diseases
C17
Obesity
C7
Hypertensive diseases
C18
Alzheimer disease
C8
Ischemic heart disease
C19
Vascular and unspecified dementia
C9
Cardiac arrest
C20
Renal failure
C10
Cardiac arrhythmia
C21
Injury poisoning other events
C11
Heart failure
C22
Other conditions and causes
Sepsis
We observed the linear relation between the diseases in Table 1 with λα . Let D = {Ci |1 ≤ i ≤ 22, i ∈ N} in Table 1, and using the Pearson Correlation Coefficient (pcc(·)), linear outcome indicates that most of the pcc(D) ≥ 0.9 except for C17 ∼C19 . This indicates that underlying diseases are mostly proportional to λα , and as the result differs on range of ages, the effects of the disease in λα varies. C17 ∼C19 had the idiosyncratic trend, where C17 (Obesity) had the highest value in the range of age 55–64, while C18 and C19 (Alzheimer and Dementia) enlarges with a form of exponential function as age increases linearly. Other chronic diseases tend to increase exponentially until age group 55–64, and the gradient starts to decline after the age of 65. In Table 2, COVID-19 deaths related to underlying diseases are aligned by age group. We computed the weight of the comorbidities considering the proportion
26
H. Lee et al. Table 2. Number of deaths with the underlying disease (C1 ∼C22 ) by age group Age group C1
C2
Age 0–24
C3
C4
C5
... C22
175
36
86
153
10
... 435
Age 25–34 853
88
328
613
38
... 1,175
Age 35–44 2,174
203
895
1,582
Age 45–54 6,220
594
2,527 4,726
88
... 2,631
234
... 6,997
Age 55–64 15,295 2,334 5,582 11,840 580
... 17,581
Age 65–74 25,749 5,577 8,367 21,324 1,062 ... 31,390 Age 75–84 30,258 7,551 7,667 25,811 1,427 ... 36,470 Age 85+
30,239 6,749 5,659 24,738 1,837 ... 38,705
representing the number of deaths by each age group in each underlying disease, which we nominate the values as μ(α,Ci ) in Eq. (3). We set the number of deaths who had specific comorbidity in all age groups as τCi , where i is 1 ≤ i ≤ 22, i ∈ N. To consider these two parameters at the same time, we set the number of deaths with specific underlying diseases by certain age groups as τ(α,Ci ) , and using these variables we compute the mortality of comorbidity by age group. μ(α,Ci ) =
(τ(α,Ci ) )2 τα · τC i
(3)
As each subject could have more than one comorbidity, we set the maximum number of comorbidities Cmax with three, and randomly select the number of comorbidities Ri since we couldn’t access the data that shows the average number of diseases, where Ri is a random variable with the range of 0 ≤ Ri ≤ 3. With selected Ri , we linearly aggregate the mortality of the comorbidities and μ(α,Ci ) as disease would linearly increase the mortality. Likewise, we scale the values using min-max normalization into [0,1] and total mortality φ is calculated in Eq. (4) where δ(α,s,i) = {λ(α,s) + Rj i=0 μ(α,Ci ) |1 ≤ α ≤ 9, s ∈ sex, 1 ≤ i ≤ 22, 1 ≤ j ≤ 3} φ(α,c,Di ) = 2.2
Rj λ(α,c) + i=1 μ(α,Di ) − min(δ(α,c,i) ) max(δ(α,c,i) ) − min(δ(α,c,i) )
(4)
Infection Rate and Network Centrality
Infection rate is another impactive factor to consider in order to quantitatively measure the risk of disease. EVI considers infection rates of the subjects to maximize the effect of vaccinations including two main factors; locational features and influence of each subject among the group. In locational features, we acquired the US COVID-19 county data [16], which contains features of locational infections, deaths, population density, median age, age variance, race
Epidemic Vulnerability Index for Effective Vaccine Distribution
27
variance, SVI score [9] and GDP per capita. General method to determine accumulated infection rate is to divide the total population of the region by infected cases. Infection rates differ from each location, and some number represents a considerable difference between each county, having a diverse statistical distribution range. Such dissimilarity is being triggered from the various factors and relationships that affect them. When implementing the Shapiro-Wilk normality test for infection rate and mortality, having their p-value ≈ 0, ! → H0 where H0 assumes the dataset is normally distributed. The skewness of the county’s infection rate was 14.74, while death rate had 1.82. Furthermore, Infection rate is also influenced by the configuration of connectivity between the subject among the society. In order to estimate the significance, we implemented the Social Network Analysis (SNA), and created the Agent Based Model (ABM) to measure the centrality of each subject. The ABM is composed of graph structure, where each vertex indicates an individual agent and edge implies the connection link. In given graph G = (V, E) which consists of a finite set of vertices V = {vk |1 ≤ k ≤ T, k ∈ N} where T is the total number of nodes and finite set of edges E = {< vi , vj > |(vi , vj ) ∈ V, 1 ≤ (i, j) ≤ T, (i, j) ∈ N}, where existing ∀E ∈ G are directed edges (e.g. < v1 , v2 > = directed edge from v1 to v2 ). Number of E in vertex i and j; n(E(i,j) ) in ∃V ranges from 1 ≤ n(E(i,j) ) ≤ ζ where ζ is maximum number of neighbor vertices. As the number of edges expands with the other agents, the possibility of rapid infection is expected in the society. Suggested graph is a directed graph in Fig. 2, since infection contains a certain direction such as infection moves from the patient to others.
Fig. 2. Example of graph network structure, (a) Circular Layout, (b) Random Layout
While implementing a graph network, network centrality can quantitatively determine which nodes are significant and measure the influence among the graph [18]. We denote the node centrality as θ(node), and θ(vi ) are computed to empirically determine the influence of the initial spreader [7]. We used five established centrality metrics; Degree centrality [19], Closeness centrality [18], Betweenness centrality [11], Eigenvector centrality [20] and PageRank [21]. High θ(vi ) implies that its importance among the network is significant, similarly if a person with high centrality among the society catches disease, it will rapidly infect the community. We empirically show this through simulations in following subsections.
28
H. Lee et al.
Selecting Optimal Centrality. By applying each centrality metric among constructed reality-based ABM which is explained further later in this section, we identify which centrality is most relevant in spreading disease. We measure the duration period until ∀vi among the network become confirmed patients, setting the 10 highest θ(vi ) as initial spreaders. We designed an infection spread simulation algorithm based on Depth-First-Search and Recursion. Spreading from infected nodes to other connected non-infected nodes is calculated as one time step, and sets the infection chance as 100%. We applied the test in three different populations T (1,000, 5,000 and 10,000 nodes) conducting 50 simulations on each graph, and randomly forming a new graph structure in each simulation. vi were constructed based on the statistical data of each age group and < vi , vj > were selected with age groups’ physical contact frequency per day [8]. To form a normal distribution, contact frequencies are randomly allocated in between [1, 2 · RCF ] where RCF indicates Rounded Contact Frequency to each node (ζ = 2 · RCF ). Concerning the criteria of time-unit, contagious speed was computed based on steps that indicate moving one node to another (=1 step). By considering the worst-case scenario, we selected PageRank P R as it had the shortest time step when spreading the disease to ∀vi ∈ G (Table 3). Table 3. Average time-step consumption among centrality # nodes
Degree
Closeness
Betweenness
Eigenvector
PageRank
1,000 nodes
271.8 (±4.2) 129.5 (±20.7) 129.5 (±20.7) 130.4 (±19.9) 116.9 (±19.4)
5,000 nodes
331.4 (±3.3) 145.5 (±22.1) 145.5 (±22.1) 144.4 (±31.1) 139.4 (±14.3)
10,000 nodes 332.3 (±2.9) 143.2 (±12.0) 143.2 (±12.0) 156.4 (±11.2) 103.7 (±10.7) Average
311.8 (±3.5) 139.4 (±18.3) 139.4 (±18.3) 143.8 (±20.7) 123.1 (±14.8)
In order to conduct the simulations that shows the disease propagation in ABM, we designed the new Graph G = (V , E ) with 1 ≤ i, j, k ≤ 300, 000. ∃vk ∈ V contains their unique features, which were assigned based on statistical proportions of the collected real-world dataset, with 0 ≤ Rk ≤ Cmax . In the cz |1 ≤ z ≤ 6}, and graph, main clusters C˜x were set, where 1 ≤ x ≤ 5, C˜x = {˜ (x, z) ∈ N which c˜z indicates the subset of small clusters c˜z ⊂ ∃C˜x . c˜z = ∪Tk=1 Vk , and E(i,j) ∈ E is a directed edges. The main reason that divided the vk with clusters is to disperse the nodes to have regional feature as the real-world does. Initially, regions that contain super-spreader start to proliferate the disease in its own region, and begin to spread into other locations. In order to construct the ABM that is analogous to real-world, we designate the C˜x as samples for the states and c˜z as counties in the US. By diminishing the scale by 10% of the total number of states, counties and populations, it accommodates the statistical resemblance. For example, the average number of counties in each state is 62, and 6 is approximately 10%, where our graph has max z = 6 in each max x = 5. Figure 3 shows the visualization of constructed graph networks.
Epidemic Vulnerability Index for Effective Vaccine Distribution
29
Fig. 3. Graph network structure for disease spread and vaccine distribution simulation
Distributing Vaccines by Order of PageRank. The objective of this simulation is to compare the infection result of when the vaccine distributions are conducted by the highest and the lowest order of pagerank θpr (G ), to empirically validate which sequence triggers faster termination of spreading. In G , the infection chance is set as 100%, with 20 initial spreaders. The simulation was operated for 100 times each, and the average time consumed in highest to lowest had the 456.607 steps, and the lowest to highest had 485.642 steps. The distribution data visualization of each condition is shown in Fig. 4. This simulation implies that vaccinations through higher order of θpr (G ), faster containment of the disease is expected. Therefore, the metric representing the infection rate was set based on the higher rank of θpr (vk ). In Eq. (5), final EVI is computed after applying standard normalization to each mortality and infection rate in order to make a standard normal distribution. Finally, we add those two factors and scale them into [0,1] with minmax normalization to have the probabilistic format for plain implementation.
Fig. 4. PageRank Centrality vaccination distribution time result when distributed by the order of high and low value (a) Density plot, (b) Box plot
EV I = M inM ax(
φ(α,s,Ck ) − mean(δ(G ) ) θpr (vk ) − mean(θpr (G )) + ) σ(δ(G ) ) σ(G )
(5)
30
3
H. Lee et al.
Experiment
In this section, we demonstrate the vaccine distribution simulation with nine different vaccine distribution scenarios in Table 4 and compare the effectiveness of the EVI with other indicators. Based on G , the θpr (G ), φ(G ) and EVI are computed and allocated to the corresponding features of ∃vk ∈ G . An example is shown in Table 5 which shows the overall dataset used to construct the ABM for final simulation where M or denotes mortality and If tr is infection rate. Table 4. Simulation types Simulation type Types
Vaccine distribution type
Type 1 No vaccination Type 2 Random vaccination Type 3 Vaccination by age Type 4 Vaccination by comorbidity risk Type 5 Vaccination by age, comorbidity risk Type 6 Vaccination by SVI Type 7 Vaccination by CVI Type 8 Vaccination by PVI Type 9 Vaccination by EVI
Table 5. Calculated feature dataset for 300,000 nodes Index 1 2 ...
˜ c˜ C ˜1 c˜1 C ˜1 c˜1 C
Age Sex R1 0 1
... ... ... ˜5 c˜30 85 300,000 C
M
C1
R2
R3
... Mor
θpr
Iftr
EVI
None None ... 0.183 0.241 0.482 0.258
F
None C14
C6
... 0.241 0.387 0.358 0.372
...
...
...
...
... ...
M
C22
C3
C19
... 0.698 0.442 0.291 0.604
...
...
...
In vaccine distribution simulation, the performance evaluation metric was set with variance in number of vaccinations, deaths, infections, cured, non-infected, mortality and infection rate. In the result, we only visualize infection cases, death cases and death rates in Fig. 6, 7 and 8 since all other metrics can be computed if we have those three data (i.e. death cases + cured cases = infection cases). Simulations were conducted for 100 times each, and the allocation of nodes’ edges and features were randomly updated for each simulation. Number of initial patients was set to 20, distributing 500 vaccines per time unit. Mortality,
Epidemic Vulnerability Index for Effective Vaccine Distribution
31
infection rate and EVI score were calculated depending on the corresponding features of each 300,000 nodes for 100 simulations. Additionally, the average of the original death rate was 0.0145, which was too low to compute the number of deaths among the population. Thus we increased it by 5 times, surging to 0.0725. To be used for our simulation, SVI [9], CVI [6] and PVI [12] were also used in given subjects. As a result, by comparing the nine distribution scenarios based on the evaluation metrics, it was confirmed that the proposed EVI had the best performance which is shown in Fig. 6, 7 and 8. The y-axis is set with decimals to compare with the no-vaccination scenario (Type 1), which had the value of 1 (e.g. 0.435 = 43.5%, Type 1 = 100%). Figure 5 shows the visualization when spreading disease, the value of evaluation metrics vary through time step. The overall results show that our EVI shows an average of 5.0% lower in infection cases, 9.4% lower result in death cases and 3.5% lower in death rates than other distribution methods.
Fig. 5. (a) Visualizations of the cumulative amount of each indicator over time, (b) Visualizations of the variations in each indicator over time
Fig. 6. Number of infection comparison based on 8 simulations (a) Barplot, (b) Boxplot
32
H. Lee et al.
Fig. 7. Number of death comparison based on 8 simulations (a) Barplot, (b) Boxplot
Fig. 8. Mortality comparison based on 8 simulations (a) Barplot, (b) Boxplot
4
Conclusion
In this paper, we proposed an Epidemic Vulnerability Index that can maximize the vaccination effects, utilizing the mortality and infection rate. EVI intuitively shows the risk of COVID-19 and quantifies the relative danger based on the past statistical COVID-19 patients. Past studies mainly focused on the clinical factors quantifying the risk, whereas EVI included social features such as significance of the subject among affiliated communities and applied it into vaccination distributions. To validate EVI, vaccine distribution simulations were performed in 300,000 graph-based nodes that represent the reality which was created with past COVID patients’ statistical feature data. Conducting nine distribution scenarios, we compared the results presenting a number of variations in infection cases, death cases and death rates. The outcomes had shown that vaccine distribution based on EVI had the better performance with 5.0% lower in infection cases, 9.4% lower in death cases and 3.5% lower in death rates than other vaccine distribution methods. This study could be implemented by adapting in pandemic response strategy from other possible epidemics. Moreover, this study could be applied to make predictions on regional vaccine demands and further decision makings such as undertaking preparations for the medical supplies by adapting future vaccine distribution strategies to the right time and place.
Epidemic Vulnerability Index for Effective Vaccine Distribution
33
References 1. Amram, O., Amiri, S., Lutz, R.B., Rajan, B., Monsivais, P.: Development of a vulnerability index for diagnosis with the novel coronavirus, covid-19, in Washington state, USA. Health Place 64, 102377 (2020) 2. Bubar, K.M., et al.: Model-informed covid-19 vaccine prioritization strategies by age and serostatus. Science 371(6532), 916–921 (2021) 3. CDC: Conditions contributing to covid-19 deaths, by state and age, provisional 2020–2021 (2021). https://data.cdc.gov/widgets/hk9y-quqm 4. Corey, L., Mascola, J.R., Fauci, A.S., Collins, F.S.: A strategic approach to covid-19 vaccine r&d. Science 368(6494), 948–950 (2020) 5. Coustasse, A., Kimble, C., Maxik, K.: Covid-19 and vaccine hesitancy: a challenge the united states must overcome. J. Ambul. Care Manag. 44(1), 71–75 (2021) 6. DeCaprio, D., et al.: Building a covid-19 vulnerability index (2020). arXiv preprint arXiv:2003.07347 7. Dekker, A.: Network centrality and super-spreaders in infectious disease epidemiology. In: 20th International Congress on Modelling and Simulation (MODSIM2013) (2013) 8. Del Valle, S.Y., Hyman, J.M., Hethcote, H.W., Eubank, S.G.: Mixing patterns between age groups in social networks. Social Netw. 29(4), 539–554 (2007) 9. Flanagan, B.E., Gregory, E.W., Hallisey, E.J., Heitgerd, J.L., Lewis, B.: A social vulnerability index for disaster management. J. Homeland Secur. Emerg. Manag. 8(1) (2011) 10. Hughes, M.M., et al.: County-level covid-19 vaccination coverage and social vulnerability–united states, december 14, 2020-march 1, 2021. Morb. Mort. Weekly Rep. 70(12), 431 (2021) 11. Leydesdorff, L.: Betweenness centrality as an indicator of the interdisciplinarity of scientific journals. J. Am. Soc. Inf. Sci. Technol 58(9), 1303–1319 (2007) 12. Marvel, S.W., et al.: The covid-19 pandemic vulnerability index (pvi) dashboard: monitoring county-level vulnerability using visualization, statistical modeling, and machine learning. Environ. Health Perspect. 129(1), 017707 (2021) 13. Medscape, R.F.: Almost 90 % of covid-19 admissions involve comorbidities (2020). https://www.medscape.com/viewarticle/928531 14. Mukherjee, S., Pahan, K.: Is covid-19 gender-sensitive? J. Neuroimmune Pharmacol. 16, 1–10 (2021) 15. Mwalili, S., Kimathi, M., Ojiambo, V., Gathungu, D., Mbogo, R.: Seir model for covid-19 dynamics incorporating the environment and social distancing. BMC Res. Notes 13(1), 1–5 (2020) 16. NYTimes: Coronavirus (covid-19) data in the united states (2021). https://github. com/nytimes/covid-19-data 17. Rastegar, M., Tavana, M., Meraj, A., Mina, H.: An inventory-location optimization model for equitable influenza vaccine distribution in developing countries during the covid-19 pandemic. Vaccine 39(3), 495–504 (2021) 18. Sabidussi, G.: The centrality index of a graph. Psychometrika 31(4), 581–603 (1966) 19. Sharma, D., Surolia, A.: Degree centrality. In: Encyclopedia of Systems Biology, Dubitzky W 558 (2013) 20. Straffin, P.D.: Linear algebra in geography: eigenvectors of networks. Math. Mag. 53(5), 269–276 (1980)
34
H. Lee et al.
21. Sullivan, D.: What is google pagerank? a guide for searchers & webmasters. Search engine land (2007) 22. Tuite, A.R., Zhu, L., Fisman, D.N., Salomon, J.A.: Alternative dose allocation strategies to increase benefits from constrained covid-19 vaccine supply. Ann. Internal Med. 174(4), 570–572 (2021) 23. Wang, J., Peng, Y., Xu, H., Cui, Z., Williams, R.O.: The covid-19 vaccine race: challenges and opportunities in vaccine formulation. AAPS PharmSciTech 21(6), 1–12 (2020) 24. WHO: International statistical classification of diseases and related health problems (icd) (2021). https://www.who.int/standards/classifications/classification-ofdiseases 25. Wolfram, C.: An agent-based model of covid-19. Comp. Syst. 29(1), 87–105 (2020) 26. Yanez, N.D., Weiss, N.S., Romand, J.A., Treggiari, M.M.: Covid-19 mortality risk for older men and women. BMC Public Health 20(1), 1–7 (2020)
Exploiting Multi-granular Features for the Enhanced Predictive Modeling of COPD Based on Chinese EMRs Qing Zhao1 , Renyan Feng2 , Jianqiang Li1(B) , and Yanhe Jia3 1 Faculty of Information Technology, Beijing University of Technology, Beijing, China
[email protected]
2 Department of Computer Science, Guizhou University, Guizhou, China 3 School of Economics and Management, Beijng Information Science and Technology
University, Beijing, China
Abstract. Building a predictive model to understand a patient’s status and the future progression of disease is an important precondition for facilitating preventive care to reduce the burden of chronic obstructive pulmonary disease. Recently, unsupervised feature learning has become the dominant approach to predictive modeling based on EMRs (electronic medical records), which usually adopts word and concept embeddings as distributed representations of clinical data. By observing that (1) semantic discrimination is limited by the inherent small feature granularity of words and concepts and (2) clinical decision-making is conducted based on a set of attribute-value pairs, e.g., a clinical laboratory test (the attribute) with a numerical or categorical value, this paper proposes a novel predictive modeling approach based on EMRs. In this approach, multi-granular features, i.e., words, concepts, concept relations and attribute-value pairs, are extracted through three subtasks (concept recognition, relation extraction and attribute-value pair extraction), then, combined to derive representations and predictive models of chronic obstructive pulmonary disease (COPD). The approach itself is highly generic, it can be used for different disease, but limited in Chinese EMRs. In this paper, we focus on the COPD risk prediction, and conduct extensive experiments on real-world datasets for a comparison study. The results show that our approach outperforms baselines, which demonstrates the effectiveness of the proposed model. Keywords: Disease risk prediction · Multi-granular features · Electronic medical records · Deep neural network · Semantic information analysis
1 Introduction Chronic obstructive pulmonary disease (COPD) is a major cause of morbidity and mortality throughout the world, but it is a preventable and treatable disease [1, 2], and the best strategy for COPD diagnosis, management and prevention is to decrease the probability of exacerbation onset and improve the quality of medical care by early intervention [3]. © Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 35–45, 2021. https://doi.org/10.1007/978-3-030-91415-8_4
36
Q. Zhao et al.
Therefore, building a predictive model to discover the mechanisms of disease development is important for early prevention and intervention. EMRs contain many valuable clinical data sources, including medication, laboratory, imaging and narrative data and treatment of patients. These sources provide the potential for high-quality COPD predictive model construction. Until now, there have been two main ways to build predictive models from EMRs, i.e., hypothesis-based and data-based approaches. The former based on the hypothesis that proposed by clinical experts, and then, the semantic information contained in EMRs and deductive reasoning are used to verify the truthfulness of hypothesis [4]. A predictive model is derived from a set of validated propositions. In general, hypothesisdriven approaches cannot make full use of the valuable information contained in EMRs [5]. The data-driven approach uses a labeled EMR dataset to train a machine learning model and achieve disease prediction. The success of the resulting predictive models based on traditional machine learning relies heavily on the sophisticated supervision of hand-engineered feature selection. Recently, deep learning models have made significant achievements in natural language processing tasks [6–8] and promise to automatically learn a compact and general feature representation from unlabeled data; they have become a dominant unsupervised feature learning approach to build a predictive model from EMRs [9, 10]. Deep learning-based approaches of predictive modeling based on EMRs usually adopt word and concept embeddings to represent EMRs. However, only considering the feature representation at the word or concept level (where the feature granularity is small) cannot provide enough information for correct clinical decision-making. For example, for a sentence in an EMR “ ” (three years ago, the patient had chest tightness and wheezing after physical exercise and was diagnosed with COPD in our hospital), without taking the attribute-value pair (COPD-three years ago) into account, it is impossible to distinguish whether COPD is part of the medical history or a current diagnosis. Based on above consideration, we proposes a novel approach named multi-granular feature-based COPD risk prediction model (MFRP), which extract and combine words, concepts, concept relations, and attribute-value pairs to build a COPD predictive model. Its implementation pipeline includes two stages: (1) Multigranular semantic information extraction and (2) Prediction model building. The contributions of this study can be summarized as follows: (1) for the first time, a novel framework is proposed for COPD prediction model building that addresses the identification and representation of multi-granular features from EMRs to learn a highquality prediction model; (2) to maximize the utilization of the structured information within a sentence, we propose a parallel adaptive convolutional neural network (PACNN) to extract concept relations and attribute-value pairs; (3) an empirical experiment is conducted on a real-world datasets, and the results show that MFRP achieves state-ofthe-art performance compared with baseline approaches.
Exploiting Multi-granular Features for the Enhanced Predictive Modeling
37
2 Methodology This section presents the overall framework of the proposed MFRP model and describes the subtasks of MFPR, as shown in Fig. 1. The overall architecture of our model consists of two stages: 1. multi-granular semantic information extraction, which is responsible for capturing concepts, concept relations and attribute-value pairs from EMRs through those three subtasks (1) concept recognition, (2) relation extraction and (3) attribute-value pair extraction). Moreover, negative words are also considered supplementary information for our task. 2. After multi-granular feature extraction, the features are integrated and fed into the parallel adaptive convolutional neural network for COPD risk prediction.
Fig. 1. The overall framework of the multi-granular feature-based COPD prediction model
The data in EMRs include structured data and unstructured data. Structured data contain laboratory tests and basic patient information (e.g., name, gender, age and habits), and unstructured data contain the doctor’s records, diagnosis and patient’s description of his/her illness. To standardize the different data structures into uniform structured data, we use three subtasks to adequately extract the valuable information expressed in EMRs and then incorporate the extraction results and structured information to train the COPD risk prediction model. The subtasks are as follows: • Concept recognition: recognizing and classifying the medical concepts in the documents into a set of fixed types. • Concept relation extraction: This subtask is concerned with extracting the relations between two concept pairs represented as a triple (head concept, relation, tail concept).
38
Q. Zhao et al.
• Attribute-value pair extraction: An attribute-value pair includes two elements: the attribute and its corresponding real value. In this paper, attribute-value pairs are divided into two categories: disease-timestamp and test-result. Test-result values are divided into two types: numerical values and categorical values. Although the task of COPD risk prediction consists of three subtasks, in this paper, we mainly focus on the relation extraction and attribute-value pair extraction (subtask 2 and 3), and integrating multi-granular features for COPD risk prediction model construction. For the subtask 2 and 3, we propose the new methods and for the subtask 1 (concept recognition), we directly leverage existing methods to identify concepts and concept types. 2.1 Concept Recognition Medical terminology includes many important clues for downstream tasks, such as concept relation extraction, attribute-value pair extraction. Following [11], concept recognition can be divided into three major stages. (1) The word embedding representations are learned by the word2vec [12] model, which is an efficient way to encode word tokens into low-dimensional vectors. (2) Existing ontologies involve many canonical concepts can effectively guide the learning process of concept recognition. (3) The sequence model BLSTM-CRF is trained to recognize concepts and their corresponding types. After obtaining the concepts and concept types, we calculate a concept vector by incorporating concepts and their corresponding types. This is shown in the following equation: ei = gi ⊕ gtype
(1)
where gi is the concept embedding, gtype is the concept type embedding, and ei is the updated concept embedding. Note that words (non-concept word) and concepts are regarded as word tokens. We defined 5 concept types: disease (e.g. hypertension), complaint symptoms (e.g. cough), test symptoms (e.g. double emphysema), tests (including physical examinations and laboratory tests), and treatment (including medicine and nebulization). 2.2 Concept Relation Extraction Based on the concept recognition in Sect. 2.1, we can extract concepts and concept types directly. In addition, the position embedding, POS embedding and negative words is also considered an important feature for concept relation extraction, because relation words between concept pairs are usually expressed by verbs and negative words might (hypoxemia be change the polarity of relations (e.g., does not improve after using budesonide)); We randomly initialize the matrix of position
Exploiting Multi-granular Features for the Enhanced Predictive Modeling
39
embeddings and transform the relative distance into two vectors, the size of a word vector p is wi = wid ⊕ wiPOS ⊕ 2wi ⊕ ni , where wid is the word embedding of wi , wiPOS is the p POS tagging and wi is the position embeddings. After the parallel adaptive max-pooling layer to generate the relations between concept pairs (described in Sect. 2.4) concept pairs and relations compose triples Ri = (e1 , ri , e2 ) ∈ R. Table 1. Examples of different categories of sentences for concept relation extraction Normal S1: 6 (The patient has had chronic constipation for 6 years and often needs to take duphalac). Triple: { (chronic constipation), (take), (duphalac)} ConceptCoordinate S2: (Chest radiography shows a right pleural lesion and calcification of the aortic valve). Triples: { (chest radiography), (shows), (right pleural lesion)}, { (chest radiography), (shows), (calcification of aortic)} ConceptOverlap S3: 3 The patient had a cough after catching a cold 3 years ago and improved after taking antibiotics. (cold), (had), (cough)}, { (cough), (improved), Triples: { (antibiotics)}, { (cold), (improved), (antibiotics)}
Concept relations in a sentence in an EMR are usually complicated. According to the presence of coordinate relations and overlapped triples, we divided sentences into three categories: normal, ConceptCoordinate and ConceptOverlap. If none of the triples have coordinate relations or overlapping concepts (concept A only has relations with concept B), we regard it as a normal sentence. If some triples have overlapping relations and concepts and these triples do not have overlapping concept pairs, we regard them as ConceptCoordinate sentences (a concept has relations with more than one concept). If some triples have overlapping concepts and these triples do not have overlapping concept pairs, we regard them as ConceptOverlap sentences. An example is shown in Table 1. The red and blue words denote head tail concepts, yellow words denote relations. Note that because relation types have less meaning for disease prediction, therefore, we mainly extract relations as features without considering relation types to train COPD risk prediction model. 2.3 Attribute-Value Pair Extraction In practical applications, EMRs involve many valuable laboratory and physical examination results and medical histories. An attribute-value pair can be divided into two types: disease-timestamp and test-result. The values in the disease-timestamp pair only include numerical values, and those in the test-result pair include two types: numerical values and categorical values. Each attribute-value pair contains two elements, an
40
Q. Zhao et al.
attribute and its corresponding value. In contrast to concept relations, wherein the tail concept rarely changes, the values in an attribute-value pair change according to different patients (e.g., the blood pressure of each patient). For numerical values, each quantity can be expressed in different units (e.g., 10 years or 122/70 mmHg). To generate a structure, we first extract a real value and the corresponding unit symbol from an EMR, including ratio values (e.g., 47.6%) and numerical values (e.g., 5 years); given a value Vi , and its corresponding unit symbol Ui , the updated real value is expressed as vi = Vi ⊕ ui , where ui is the unit embedding. The categorical value is regarded as a word that has no unit symbol, but according to the different expression styles of different doctors, negative words often exist that can change the polarity of a categorical value (e.g., ” (The patient’s echocardiography showed no abnormal “ ” (The patient’s echocardiography was normal)). condition) and “ If a categorical value has a negative word, we first concatenate the negative word and categorical value; then, the cosine distance is used to calculate the similarity between this categorical value and others (in the experiments, the similarity threshold rang is set from 0.8 to 0.95, finally we set to 0.9 because of the small error of this value). According to the diagnosis guidelines of COPD [13] and domain expert knowledge, we set the quantization threshold for each test numerical value structure and divide the numerical value of the test result into four levels: low, normal, high and critically high. Given a set of attributes G = {G1 , G2 , . . . , Gn } and the corresponding numerical value set v = {v1 , v2 , . . . , vn } and indicator levels L = {L1 , L2 , L3 }, a test result in a numerical value instance is described as Z = (ei , vi , li ), where ei is an attribute embedding, li is an indicator value embedding, e.g. (FEV1, 46%, high), and the disease timestamp is denoted as P = (ei , vi ) (e.g. arrhythmia, 2 months). If the attribute of test Gm has a corresponding categorical value Zm in a sentence, the test result in the categorial value type is expressed as Q = (em , zm ), where em is an attribute embedding and zm is a categorical value embedding, e.g. (the bronchodilation test, negative), Z, P, Q ∈ U . The attribute-value features are learned by a parallel adaptive max-pooling layer (the details are described in Sect. 2.4). 2.4 Construction of a COPD Risk Prediction Model Convolution. The aim of the convolution layer is to capture the important semantic information from an entire EMR text and transform this information into feature maps. Formally, let s1:n denote n sentences in the EMR text T, x1:m denote m words in the sentence si ∈ s1:n , and the dimension of each word xi ∈ Rd and xi ∈ x1:m . The convolution operation contains a filter w ∈ Rh×d , where h is the length of filter. At position j, it is expressed as follows, where zero padding is adopted to maintain a consistent volume of input structures: (2) cj = f w · xj:j+h−1 + b where b ∈ Rd is a bias vector, f (·) is a nonlinear function, we use ReLU in this paper, and xj:j+h−1 denotes the concatenation from j to j + h − 1 word embeddings. A filter is used for each window size h in the sentence si to produce the feature map cj ∈ Rn−h+1 . (3) cij = f wi · xj:j+h−1 + b
Exploiting Multi-granular Features for the Enhanced Predictive Modeling
41
where the range of i is 1 < i < m. Through the convolution layer, the resulting vectors are obtained as C = [c1 , c2 , . . . , cm ]T ∈ Rm×(n−h+1) . Parallel Adaptive Max-Pooling. The max-pooling operation is used to choose the most significant features (maximum value) of filters. However, single max-pooling usually loses much information in a sentence. In this paper, according to different subtasks, we utilize two different adaptive max-pooling methods to capture the complete structured information for relation extraction and attribute-value pair in a parallel way. For the concept relation subtask, we divide each feature map into three segments according to concept pairs. Then, the max-pooling operation is performed on each segment. The feature map output ci can be divided into three parts ci1 , ci2 , ci3 by two concepts. Adaptive max-pooling is expressed by the following equation. (4) pij = max cij the pij for each feature map, we can concatenate all vectors pi = After obtaining pi1 , pi2 , pi3 (1 ≤ i ≤ m)(j = [1, 3]) to form vector p ∈ R3m . According to the relevance between each word and concept in a pair, we can find the relation word; thereby, the concept relation features are captured. The attribute-value pair extraction is less complicated than that of concept relation extraction. The feature map is split into two parts according to the value pi = pi1 , pi2 (i ≤ 1 ≤ m)(j = [1, 2]), p ∈ R2m . Then, the nearest disease or test concept is extracted as an attribute of the value. Finally, we obtain the attribute-value pair features. Softmax. Concatenating basic patient information embeddings, concept features, concept relation features and attribute-value pair features, we can obtain the final single vector Q, and Q ∈ R3m+2m+d , where m is the number of feature maps and d represents both basic patient information and concept dimension. Then, we feed it into the softmax layer to obtain the label of the risk level, as shown in the following equations. Q =P⊕C ⊕R⊕U
(5)
O = Ws Q + bs
(5)
where P is the set of basic patient information embeddings, C is the set of concept embeddings, R is the set of concept relation features and U is the set of attribute-value pair features; Ws ∈ Rni ×(3m+2m+d ) is the weight matrix, bs ∈ Rni is the bias vector, O is the final output of the PACNN, O ∈ Rni , O ∈ [1, n], n denotes the number of COPD risk levels (from Sect. 3.1 shows n = 5).
3 Experiments 3.1 Datasets The experimental data of this study were obtained from a real hospital. The datasets include 7026 COPD EMRs, each EMR consists of three parts: admission records,
42
Q. Zhao et al.
progress notes and discharge notes. The severity of COPD in this paper is classified into four categories: group A (mild), group B (moderate), group C (severe), group D (very serious), and group F (non-COPD). The datasets include non-COPD EMRs because some diseases have similar symptoms to COPD but are not diagnosed as COPD through testing (e.g., bronchiectasis). The process of annotating the datasets is guided by a group of medical practitioners. 3.2 Evaluation Criteria and Experimental Settings The performance of the proposed model is evaluated using standard metrics, include Accuracy, precision, recall, F1 measure and AUC (area under the curve) value (the range of AUC values is from 0.5 to 1, where a value closer to 1 indicates better performance of the prediction model). In the experiments, We choose the best configurations by comparing the results of different hyperparameters, and the configurations are as follows: the word token embedding is set to 100, the concept type embedding is set to 50, the position embedding is set to 5, and the window size is 3. As appropriate for different subtasks, we set different batch sizes and feature maps. For concept relation extraction, the batch size and feature map are set to 100 and 200 respectively. For attribute-value pair extraction, the batch size is set to 50, and the feature map is set to 150. We set learning rate to 0.001 and dropout rate [14] to 0.4. Table 2. The performance of subtasks Subtasks
Precision (%)
Recall (%)
F1 (%)
Concept recognition
92.46
91.61
92.03
Concept relation extraction
86.74
86.33
86.53
Attribute-value pair extraction
88.57
89.64
89.10
According to the abovementioned that the performance of the proposed MFRP model heavily relies on the results of subtasks. The performance of the subtasks is provided in the following Table 2. The precision, recall and F1 measure are used as the evaluation criteria of the subtasks. 3.3 Overall Evaluation Results The experiment is designed to compare our MFRP method with the following baselines: Traditional machine learning approaches: (1) SVM-RFE [15]: this method proposes an SVM-based recursive feature elimination method to extract gene signatures for breast cancer prediction. (2) Random forest regression (RFR) [16]: this paper uses patient outcomes (e.g. gender) and random forest regression for disease prediction. in this paper, we test these model on COPD datasets.
Exploiting Multi-granular Features for the Enhanced Predictive Modeling
43
Deep learning approaches: (1) WE+CNN: we test the predictive model considering only word embedding representations. (2) CE+CNN: To verify the significance of knowledge, we conduct an experiment to extract concept embeddings from the knowledge graph and combine them with word embeddings for disease prediction. (3) CER+PCNN: this model extracts concept and concept relation embeddings to train the disease prediction model without considering attribute-value pair features, this method is utilized to evaluate the effectiveness of concept relation features. The average value of each method over 10-fold cross-validation is reported in Table 3.
Table 3. The average prediction performance between MFRP and the baselines Model
Precision(%)
Recall(%)
F1 (%)
Accuracy(%)
AUC(%)
SVM-REF
52.84
72.71
61.20
70.62
74.95
RFR
54.66
72.43
62.30
71.39
75.83
WE + CNN
65.48
76.31
70.48
76.54
81.68
CE + CNN
72.72
79.15
75.80
79.35
86.20
CER + CNN
74.23
79.25
76.66
80.44
86.96
MFRP
77.31
80.17
78.71
82.75
90.88
We evaluate the predictive performance of MFRP and the baselines. From Table 3, we can observe that the traditional models SVM-PEF and RFR perform significantly worse than the deep learning models. In addition, through comparing CE+CNN and WE+CNN, we see that CE+CNN performed better by 3.5% on the F1 measure and 3.7% on the AUC, which illustrates that concept embedding can help reduce semantic ambiguity in EMRs. Then, from the comparison between CE+CNN and CER+CNN, we find that CER+CNN is better than CE+CNN by 2.4% on the F1 measure and 2.5% on the AUC. This shows that a large amount of significant information exists among concepts in EMRs and extracting concept relations can help improve semantic understanding. Finally, we can see that the MFRP model outperforms other baselines, this model simultaneously considers three different granular features (concept, concept relations and attributevalue pairs), the excellent performance of MFRP demonstrates the effectiveness of three granular features.
4 Conclusion Completely utilizing the valuable semantic information contained in patient EMRs for predictive model construction is a critical problem in the biomedical informatics domain. In contrast to previous predictive methods, in this paper, we introduce an MFRP model
44
Q. Zhao et al.
to extract features with different granularity through three subtasks for COPD risk prediction. Concept embeddings can help to address the problem of semantic ambiguity, because Chinese medical terminology is usually composed of several words, such as “ ” (chronic obstructive pulmonary disease) is composed of “ ” ” (obstructive), “ ” (pulmonary disease). Concept relations (chronic), “ and attribute-value pair extraction provide more sufficient information among concepts and concept values to enhance the effectiveness of a predictive model. Additionally, PACNN model is proposed to extract features according to different subtasks. We conduct experiments on real-world datasets, and the results show that our model outperforms all the baselines and can provide an essential reference to develop an efficient and reliable COPD prediction system in the Chinese domain. Acknowledgement. This study is supported by the Beijing Municipal Commission of Education with project No. KZ202010005010 and the Cooperative Research Project of BJUT-NTUT with No. 11001.
References 1. Poon, H., Toutanova, K., Quirk, C.: Distant supervision for cancer pathway extraction from text. In: Pacific Symposium of Biocomputing, pp. 121–131. Big Island of Hawaii, (2015) 2. Pauwels, R.A., Rabe, K.F.: Burden and clinical features of chronic obstructive pulmonary disease (COPD). The Lancet 364(9434), 613–620 (2004) 3. Chen, L., Li, Y., Chen, W., et al.: Utilizing soft constraints to enhance medical relation extraction from the his- tory of present illness in electronic medical records. J Biomed Inform 87, 108–117 (2018) 4. Buijsen, J., van Stiphout, R.G., Menheere, P.P.C.A., Lammering, G., Lambin, P.: Blood biomarkers are helpful in the prediction of response to chemoradiation in rectal cancer: a prospective, hypothesis driven study on patients with locally advanced rectal cancer. Radiother. Oncol. 111(2), 237–242 (2014) 5. Oztekin, A., Delen, D., Kong, Z.J.: Predicting the graft survival for heart–lung transplantation patients: an integrated data mining methodology. Int. J. Med. Inform. 78(12), e84–e96 (2009) 6. Luong, T., Cho, K., Manning, C.D.: Neural machine translation. Association for Computational Linguistics, Berlin (2016) 7. Koopman, B., Zuccon, G., Bruza, P., Sitbon, L., Lawley, M.: Information retrieval as semantic inference: a Graph Inference model applied to medical search. Inf. Retri. J. 19(1–2), 6–37 (2015). https://doi.org/10.1007/s10791-015-9268-9 8. Chen, Y., Xu, L., Liu, K., Zeng, D., Zhao, J.: Event extraction via dynamic multi-pooling convolutional neural networks. Association for Computational Linguistics, Beijing, pp. 167– 176 (2015) 9. Yang, J.-J., et al.: Emerging information technologies for enhanced healthcare. Comput. Ind. 69, 3–11 (2015) 10. Miotto, R., Li, L., Kidd, B., et al.: Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci Rep 6, 26094 (2016). https://doi.org/ 10.1038/srep26094 11. Zhao, Q., Wang, D., Li, J., Akhtar, F.: Exploiting the concept level feature for enhanced name entity recognition in Chinese EMRs. J. Supercomput. 76(8), 6399–6420 (2019). https://doi. org/10.1007/s11227-019-02917-3
Exploiting Multi-granular Features for the Enhanced Predictive Modeling
45
12. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp. 3111–3119 (2013) 13. Chinese Thoracic Society, Guidelines for the management of chronic obstructive pulmonary disease (2013 revision). Chin. J. Tubercul. Resp. Dis. 36(4), 255–264 (2013) 14. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1), 192–200 (2014) 15. Xu, X., Zhang, Y., Zou, L., Wang, M., Li, A.: A gene signature for breast cancer prognosis using support vector machine. In: 2012 5th International Conference on BioMedical Engineering and Informatics, Chongqing, pp. 928–931 (2012) 16. Zhao, J., Shaopeng, G., McDermaid, A.: Predicting outcomes of chronic kidney disease from EMR data based on random forest regression. Math. Biosci. 310, 24–30 (2019)
Task-Oriented Feature Representation for Spontaneous Speech of AD Patients Jiyun Li(B)
and Peng Huang
College of Computer Science and Technology, Donghua University, Shanghai, China [email protected]
Abstract. Alzheimer’s disease (AD) detection by speech is a promising way due to its less demanding of patient’s cooperation. Many machine learning speech models for AD classification have been studied recently. Limited by data type due to patients’ conditions, most of the models are content-independent. By using only acoustic factors e.g. speech frequency, energy, loudness etc. as feature representation, they generally suffered from low classification accuracy due to the neglecting of problem domain information in algorithmic architecture. In this paper we propose a novel task-oriented speech feature representation method which incorporate the classification task information into the feature representation learning process. A dynamic boundary mechanism is introduced further to reduce the influences of easy to classify samples to the loss function. Experimental results on ADReSS dataset show that the classification accuracy of models (LDA, DT, 1NN, SVM, RT) using our feature representation is higher than using the five acoustic features in baseline experiment, and the classification accuracy on the model (DT, 1NN) even exceeds the linguistic features of baseline experiment. The best classification performance is using our representation with DT model which can reach 83.3 %.
Keywords: Alzheimer
1
· Speech · Task-oriented
Introduction
Medical images and scales are the main basis of Alzheimer’s disease identification. Speech-based Alzheimer’s disease recognition is a new field. Using speech signals to recognize Alzheimer’s disease has been shown to be effective [16]. Although the specific symptom of Alzheimer’s disease is amnesia, the deterioration of the patient’s language system is an obvious manifestation of the disease. In the development of the disease, the systems that control semantics, grammar and pronunciation have varying degrees of lesions [5]. Researchers try to recognize Alzheimer’s patients from the semantic and grammatical information contained in their speech. This content-related speech detection method [4,20,21] has poor economy and scalability, and cannot adapt to large-scale c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 46–57, 2021. https://doi.org/10.1007/978-3-030-91415-8_5
Task-Oriented Feature Representation for Spontaneous Speech
47
detection. So researchers try to use content-independent methods [1,11] to detect Alzheimer’s disease. Most of these methods are based on the frequency characteristics and statistical characteristics of speech, which can achieve better results in controllable speech. Spontaneous speech [2], collected in uncontrollable environment, different from clinical environment and other controllable environment, can better adapt to large-scale detection tasks. The performance of content-independent detection methods are poor on the spontaneous speech dataset of Alzheimer’s disease. The reason is that on the one hand, the collection environment of spontaneous speech is uncontrollable, and the collectors speak freely. On the other hand, it is also related to the insufficient task orientation of the method. Task orientation is described as follows: one tester tries to classify apples, bananas and tables. According to the principle that apples and bananas should be one category, but if we suggest that apples and tables are one category to testers, then testers will conduct self-learning to find the similarities between apples and tables, and differences between bananas and apples. Similarly, speech classification also needs task orientation. Task-independent feature extraction cannot allow us to obtain task-related information. We need to represent speech features based on task orientation to improve the accuracy of classification. In this paper, a new loss is used in the training of neural network model for speech feature representation of Alzheimer’s disease patients. Through training, the neural network learns task-related encoding method, which improves the recognition effect of Alzheimer’s disease based on speech. The contributions of this paper are as follows: 1. Using task-oriented feature representation to improve the recognition effect of content-independent feature representation for Alzheimer’s disease. 2. The triplet loss based on static boundary is improved, and Dynamic Boundary triplet (DB-triplet) loss is proposed. Finally, we briefly introduce the following contents of this paper. Section 2 is the related work based on speech recognition of Alzheimer’s disease. Section 3 introduces the mathematical description of the problem in the paper, Sect. 4 introduces the model, including the implementation of the overall architecture, encoding network and task orientation in the model, Sect. 5 is the experiment and discussion, showing the experimental comparison results and discussing the results. Section 6 is the conclusion and future work.
2
Related Work
Using speech to non-invasively identify Alzheimer’s disease is an economic and scalable recognition method. The general method is to transform speech into corresponding features, and then use the classification model to classify the features to obtain recognition results. Shah Z et al. [19] obtained the linguistic features of speech by using CHAT speech analysis software, and fused the four acoustic features (the AVEC 2013,
48
J. Li and P. Huang
the ComParE 2013, the emo large, the MFCC 1–16) after feature selection, and finally obtained the highest accuracy of 85% on the five ML models. But on content-independent acoustic features, only 65%. Rahmani M et al. [17] tried to identify Alzheimer’s disease in Persian. Researchers used wavelet packet to extract speech features and SVM classifier to classify, and finally obtained 96% recognition accuracy. Although the impact of data sets is a factor that cannot be ignored, this also initially demonstrates the cross-language advantage of acoustic features in Alzheimer’s disease recognition.Edwards E et al. [6] improve the recognition accuracy of Alzheimer’s disease by adding the features of phonemes. They use five systems to complete the recognition task in multi-scale. Each system is based on different features, and the system based on Phonemes and Audio achieves the best effect, reaching 79%. Rohanian M et al. [18] proposed the Lstm-based Multimodal Fusion with Gating model, integrating language and acoustic features, and finally obtained 79% accuracy. Similarly, pure acoustic features only obtained 66% accuracy. Most of the above work is based on the identification results of mixed features. Although they have achieved good improvement effect, the model effect is not good in the acoustic characteristics, which is content-independent. In this paper, the deep learning model is used to learn the task-related feature representation of speech, and the DB-triplet loss proposed in this paper is used to improve the representation effect and help the classification model to obtain better results.
3
Problem Description
Speech-based Alzheimer’s disease recognition is a spatial mapping problem. We need to map the speech of the detected person to the recognition result (AD or non-AD). In the mathematical description, a speech of the detected person is regarded as vector α , and the detection result is scalar γ. The recognition of Alzheimer’s disease is the process of mapping vector α to scalar γ, where the mapping relationship is set to g, and the formula is as follows: → γ = g( α) g : Rm → Z0,1 , α
(1)
In speech recognition, the mapping dimension gap is too large (speech data vector often has millions of dimensions ), which makes the search space of mapping relationship too complex to obtain accurate g. Therefore, the commonly used method is to map the speech to a subspace to reduce the search space of mapping relationship, so the above formula can be modified as follows: = f ( → β α) (m >> n) f : Rm → Rn , α
(2)
→ γ = g(β) g : Rn → Z0,1 , β
(3)
β is the feature representation of speech in subspace, f is the mapping relation between them.
Task-Oriented Feature Representation for Spontaneous Speech
49
In particular, for longer segments of speech, we need to separate the valid 2, α 3, α 4 , ... α n }, and segments from them to form a speech set A = { α1 , α 2 , 1 , β through the mapping relationship f, we obtain the mapping set B = {β β3 , β4 , ... βn }. The sensitivity value λ are added to adjust the sensitivity of the model. The final formula is: 1 y= g(βi ) − λ (λ, y ∈ {0, 1}) (4) n i ∈B β
y is the output of the model, 1 represents AD, 0 represents non-AD. When selecting f in Formula (2), the traditional model is more based on experience and lacks task orientation, resulting in that the mapping relationship does not adapt to the task. In this paper, the neural network is used to learn the new mapping relationship, which has more task correlation under the special loss function, and retains enough task information in the mapping process to ensure that the search of g in Formula (3) is more accurate.
4
Model Architecture
The model architecture used in this paper is shown in Fig. 1. Speech is input into the encoding network to learn the representation of feature in the subspace, and DB-triplet loss is used to correct this process, which helps the encoding network to find the mapping relationship with the largest task correlation between speech and feature representation.
Fig. 1. Model structure diagram.
4.1
The Encoding Network
Considering that the information extracted by the model in the speech feature representation learning is irrelevant, the timing should not be included in the influencing factors of encoding, so when selecting the encoding network, we choose the timing independent network CNN rather than LSTM as the encoding network.
50
J. Li and P. Huang
Fig. 2. The encoding network
The speech sampling rate in the dataset is 44100, and the maximum speech segment duration is 10 s, namely 440,000 sampling points. As the input data, it is relatively large. Therefore, We fill each speech fragment up to 500,000 data and fold it into a 500×1000 shape as input, and use Conv1D layer to reduce the dimension of the data. Combined with the dense layer, the speech data are finally transformed into 64-dimensional vectors. The network architecture is shown in Fig. 2.
Fig. 3. Triplet network training process. The left figure shows the distribution of positive and negative sample before training, which is scattered and cannot be distinguished. In the training process, the positive case is close, the negative case is far away. The right figure shows the ideal distribution of positive and negative examples after training.
4.2
Task Orientation
To improve the correlation between the feature representation obtained by the encoding network and the task, and ensure that the coding can provide Alzheimer’s disease-related information for the task more accurately, we use triplet loss [10] to train the encoding network. The triplet loss function is as follows: a − β p ||2 dp = ||β 2
(5)
Task-Oriented Feature Representation for Spontaneous Speech
51
Fig. 4. The left figure shows the impact of some counterexamples on the loss function resulting in some samples not being distributed as expected (some extreme samples have a great effect on loss), while the right figure uses dynamic boundaries to reduce the impact of this part of the sample.
a − β n ||2 dn = ||β 2 l=
N i
dpi − dni + margin+
(6) (7)
p is the positive example of the a is the anchor sample, β In the formula, β anchor sample, βn is the counterexample of the anchor sample, The above three variables are calculated to get the distances dp and dn . magin represents the boundary width after classification, and l represents the final loss function. However, the actual situation may not be as ideal as Fig. 3. In the process of training, Fig. 4 (left) will appear.This is due to the fact that some easy-toclassify samples contribute greatly to the loss function in training, which results in some hard-to-classify samples being ignored, and the contradiction between static boundary and dynamic training state is also an important reason for Fig. 4 (left). Therefore, we try to modify the boundary conditions to reduce the contribution of such samples to the loss function, and introduce a dynamic boundary as Fig. 4 (right), which is determined by Batchmin (the minimum dn of each batch of training data). Combined with the corresponding boundary coefficient η, the following formula is obtained: 1 (η ∗ Batchmin − dn ) < 0 (8) M ASK = 0 (η ∗ Batchmin − dn ) ≥ 0 l=
N i
(dpi − dni ∗ M ASK + margin)
(9)
52
J. Li and P. Huang Table 1. The distribution of patients (M = male and F = female) Dataset
Age
Training set [50, 55) [55, 60) [60, 65) [65, 70) [70, 75) [75, 80)
1 5 3 6 6 3
0 4 6 10 8 2
1 5 3 6 6 3
0 4 6 10 8 2
Test set
1 2 1 3 3 1
0 2 3 4 3 1
1 2 1 3 3 1
0 2 3 4 3 1
35
43
35
43
[50, 55) [55, 60) [60, 65) [65, 70) [70, 75) [75, 80) Total
5
M (AD) F (AD) M (non-AD) F (non-AD)
Experimentation and Discussion
The feature representation model used in the experiment has three hyperparameters : classification boundary distance margin, dynamic boundary coefficient η, and recognition sensitivity λ, Margin and η are used in the training process of the model. They determine the effect of the encoding network being corrected. The recognition sensitivity λ plays a role in the final long speech classification, it provides a threshold for the classification result, and needs to be determined by experiments. In addition to parameter adjustment, model evaluation is also an important part of the experiment. We use four indicators in model evaluation: Precision, Recall, F1 , Accuracy, and compare these indicators to evaluate the quality of the model. Table 2. Comparative experimental results of loss Loss
Accuracy
Triplet
0.688
DB-Triplet 0.813
Task-Oriented Feature Representation for Spontaneous Speech
53
Fig. 5. The grid search results of hyper-parameters based on 1NN
5.1
Dataset
Our dataset comes from the ADReSS Challenge [13], a challenge initiated by Dementia Bank. This dataset provides a speech dataset for patients with Alzheimer’s disease that is relatively balanced in age and gender, Table 1 shows the distribution of patients in the dataset. The dataset contains 70 males and 86 females, and the number of patients and non-patients of the same sex in each age group is equal, which ensures that the gender information and age information in the data will not affect the results. In the dataset, the patient’s speech is cut into several effective segments according to the 65 dB signal energy threshold. The AD patient’s speech is cut into 2122 segments, and the non-AD patient’s speech is cut into 1955. Our experiments are based on these segments. Table 3. Experimental result of accuracy on test set Features
LDA
DT
1NN
SVM
RF
MEAN
emobase [9]
0.542
0.688
0.604
0.500
0.729
0.613
ComParE [8]
0.625
0.625
0.458
0.500
0.542
0.550
eGeMAPS [7]
0.583
0.542
0.688
0.563
0.604
0.596
MRCG [3]
0.542
0.563
0.417
0.521
0.542
0.517
Minimal [12]
0.604
0.562
0.604
0.667
0.583
0.60
Linguistic [15] 0.750 0.625
0.667
0.792 0.750 0.717
1dCNN-triplet 0.750 0.833 0.813 0.702
0.750 0.770
54
5.2
J. Li and P. Huang
Training Parameters Selection
In the selection of hyper-parameters λ, when λ is set to 1 or 0, it is found that the error rate of recognition will be improved. We assume that the speech emitted by patients in the test set does not all have disease characteristics, but has a certain proportion. Therefore, we adjust λ to a relatively mild value. After many attempts, 0.51 is the best value. We conduct grid search for margin and η, and use 1NN (KNN with K=1) classification model to verify the results. The results of parameter selection are shown in Fig. 5. Finally, we select λ = 0.51, margin = 5 and η = 5. 5.3
Experimental Result
We compared the model effect based on triplet loss and DB-triplet loss on 1NN model, and obtained the results in Table 2. The experimental data show that our improvement improves the classification effect of the model. Table 4. Comparison results of five models Models
Class
Precision Recall F1 score Accuracy
BaselineAcous [13]
AD
BaselineLing [13] Multi-scale System 1 [6]
0.60
0.75
0.67
non-AD 0.67
0.50
0.57
AD
0.83
0.63
0.71
non-AD 0.70
0.87
0.78
AD
0.71
0.64
0.5
0.56
0.59
non-AD 0.63 LSTM with Gating [18](Acoustic and Lexical) AD DT (our feature)
0.78
0.75
0.76
non-AD 0.76
0.79
0.78
0.88
0.84
0.85
non-AD 0.83
0.87
0.85
AD
0.62 0.75 0.60 0.77 0.83
Comparing five acoustic and one linguistic feature representation, the result is shown in Table 3. On five different machine learning models (Linear Discriminant analysis (LDA), Decision Tree (DT), Nearest Neighbour (1NN, KNN with 1), Random Forest (RF), Support Vector Machines (SVM)), the three models have reached the best performance, and the performance on DT exceeds the best effect of linguistic coding, reaching 83.3% accuracy. At the same time, this feature representation method comprehensively exceeds the effect of all five acoustic coding features except linguistic coding. In the comparison of recognition results, Table 4 is obtained by comparing the results of the five models including the baseline experiment on the test dataset. The results show that the recognition accuracy and F1 scores of this paper are the best among the five models.
Task-Oriented Feature Representation for Spontaneous Speech
55
Fig. 6. We use t SNE [14] to represent features to two dimensional. The left figure is the MRCG features of the patient’s speech. The right figure is the intermediate results based on the feature representation method in this paper, and we can see the clustering tendency on the right figure.
5.4
Discussion
The emobase, ComParE, eGeMAPS, MRCG and Minimal features in Table 3 are based on acoustic correlation feature extraction, focusing on obtaining acoustic information in the patient’s voice. However, whether the acoustic information contains information that can identify Alzheimer’s disease is not concerned. We try to express the vectors obtained by these feature codes in space. As shown in Fig. 6 (left), we find that there is no obvious boundary, and these vectors do not obtain more task-related information. When we visualize the distribution of feature vectors in the model training process, a clustering tendency is easily found as shown in Fig. 6 (right). This indicates that our model is effectively learning the task-related knowledge. The encoding method in this paper does not completely solve the problem of sample deviation, but it reduces this situation well. We compare the representation distribution before and after improvement (Fig. 7), and find that our representation method can separate more different types of samples and help the machine learning model to classify more accurately. In this paper, only five machine models are used to verify the representation method, but we believe that there will be models that can achieve better results on the data of the representation method, which is left for subsequent research to explore.
56
J. Li and P. Huang
Fig. 7. The left figure is the result of 300 rounds of traditional triplet loss training, and the right figure is the result of triplet loss training with dynamic boundaries, The comparison between the two figures shows that the overlapping area of the right figure is significantly reduced, indicating that DB-triplet improves the representation effect of the model.
6
Conclusion and Future Work
Through the above experiments, we can conclude that the task-oriented feature representation method can improve the recognition accuracy of spontaneous speech in patients with Alzheimer’s disease. Although the effect of representation on the model is improved, there are still problems such as heterogeneous stickiness of coding and similar clustering. Heterogeneous stickiness refers to the problem of swinging the encoding direction in the process of encoding adjustment. Similar clustering refers to that samples from the same class may be clustered into multiple clusters and cannot be encoded correctly. These problems can be used as the starting point for future work.
References 1. Ammar, R.B., Ayed, Y.B.: Speech processing for early alzheimer disease diagnosis: machine learning based approach. In: 2018 IEEE/ACS 15th International Conference on Computer Systems and Applications (AICCSA), pp. 1–8. IEEE (2018) 2. Bschor, T., K¨ uhl, K.P., Reischies, F.M.: Spontaneous speech of patients with dementia of the alzheimer type and mild cognitive impairment. Int. Psychogeriatrics 13(3), 289–298 (2001) 3. Chen, J., Wang, Y., Wang, D.: A feature study for classification-based speech separation at low signal-to-noise ratios. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1993–2002 (2014) 4. Chien, Y.W., Hong, S.Y., Cheah, W.T., Yao, L.H., Chang, Y.L., Fu, L.C.: An automatic assessment system for alzheimer’s disease based on speech using feature sequence generator and recurrent neural network. Sci. Rep. 9(1), 1–10 (2019) 5. Cummings, J.L., Darkins, A., Mendez, M., Hill, M.A., Benson, D.: Alzheimer’s disease and parkinson’s disease: comparison of speech and language alterations. Neurology 38(5), 680–680 (1988)
Task-Oriented Feature Representation for Spontaneous Speech
57
6. Edwards, E., Dognin, C., Bollepalli, B., Singh, M.K., Analytics, V.: Multiscale system for alzheimer’s dementia recognition through spontaneous speech. In: INTERSPEECH, pp. 2197–2201 (2020) 7. Eyben, F., et al.: The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190– 202 (2015) 8. Eyben, F., Weninger, F., Gross, F., Schuller, B.: Recent developments in opensmile, the munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 835–838 (2013) 9. Eyben, F., W¨ ollmer, M., Schuller, B.: Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on Multimedia, pp. 1459–1462 (2010) 10. Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: Feragen, A., Pelillo, M., Loog, M. (eds.) SIMBAD 2015. LNCS, vol. 9370, pp. 84–92. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24261-3 7 11. Lopez-de Ipi˜ na, K., et al.: On automatic diagnosis of alzheimer’s disease based on spontaneous speech analysis and emotional temperature. Cogn. Comput. 7(1), 44–55 (2015) 12. Luz, S.: Longitudinal monitoring and detection of alzheimer’s type dementia from spontaneous speech data. In: 2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS), pp. 45–46. IEEE (2017) 13. Luz, S., Haider, F., de la Fuente, S., Fromm, D., MacWhinney, B.: Alzheimer’s dementia recognition through spontaneous speech: the adress challenge (2020). arXiv preprint arXiv:2004.06833 14. Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9(11), 1–27 (2008) 15. MacWhinney, B.: Tools for analyzing talk part 2: The clan program. Pittsburgh, PA: Carnegie Mellon University (2017). http://talkbank.org/manuals/CLAN.pdf ´ 16. Pulido, M.L.B., Hern´ andez, J.B.A., Ballester, M.A.F., Gonz´ alez, C.M.T., Mekyska, J., Sm´ekal, Z.: Alzheimer’s disease and automatic speech analysis: a review. Exp. Syst. Appl. 150, 113213 (2020) 17. Rahmani, M., Momeni, M.: Alzheimer speech signal analysis of persian speaking alzheimer’s patients. Comput. Intell. Electric. Eng 11(1), 81–94 (2020) 18. Rohanian, M., Hough, J., Purver, M.: Multi-modal fusion with gating using audio, lexical and disfluency features for alzheimer’s dementia recognition from spontaneous speech (2021). arXiv preprint arXiv:2106.09668 19. Shah, Z., Sawalha, J., Tasnim, M., Qi, S.A., Stroulia, E., Greiner, R.: Learning language and acoustic models for identifying alzheimer’s dementia from speech. Front. Comput. Sci. 3, 4 (2021) 20. Thomas, C., Keselj, V., Cercone, N., Rockwood, K., Asp, E.: Automatic detection and rating of dementia of alzheimer type through lexical analysis of spontaneous speech. In: IEEE International Conference Mechatronics and Automation, 2005, vol. 3, pp. 1569–1574. IEEE (2005) 21. Yuan, J., Bian, Y., Cai, X., Huang, J., Ye, Z., Church, K.: Disfluencies and finetuning pre-trained language models for detection of alzheimer’s disease. In: INTERSPEECH, pp. 2162–2166 (2020)
Identification of Protein Markers Predictive of Drug-Specific Survival Outcome in Cancers Shuting Lin1 , Jie Zhou1 , Yiqiong Xiao1 , Bridget Neary1 , Yong Teng2 , and Peng Qiu1,2(B) 1 Georgia Institute of Technology, Atlanta, USA [email protected], [email protected] 2 Emory University, Atlanta, USA
Abstract. Novel discoveries of biomarkers predictive of drug-specific responses not only play a pivotal role in revealing the drug mechanisms in cancers, but are also critical to personalized medicine. In this study, we identified drug-specific biomarkers by integrating protein expression data, drug treatment data and survival outcome of 7076 patients from The Cancer Genome Atlas (TCGA). We first defined cancer-drug groups, where each cancer-drug group contains patients with the same cancer and treated with the same drug. For each protein, we stratified the patients in each cancer-drug group by high or low expression of the protein, and applied log-rank test to examine whether the stratified patients show significant survival difference. We examined 336 proteins in 98 cancerdrug groups and identified 65 protein-cancer-drug combinations involving 55 unique proteins, where the protein expression levels are predictive of drug-specific survival outcomes. Some of the identified proteins were supported by published literature. Using the gene expression data from TCGA, we found the mRNA expression of ∼11% of the drug-specific proteins also showed significant correlation with drug-specific survival, and most of these drug-specific proteins and their corresponding genes are strongly correlated.
Keywords: Protein
1
· Drug response · Survival analysis
Introduction
The high inter-individual variability in drug response makes it a great challenge to develop personalized treatment strategies for individual patients [1]. Therefore, personalized medicine is a research area of great interest in terms of optimizing therapeutic options and improving patient clinical outcomes. One essential aspect for personalized medicine is to identify biomarkers that are predictive of drug treatment responses [2]. Rapid technological advances in cancerogenic research have facilitated the discovery of genetic variants as predictive and prognostic biomarkers associated with drug efficacy and patient clinical outcomes [3]. c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 58–67, 2021. https://doi.org/10.1007/978-3-030-91415-8_6
Drug-Specific Survival Biomakers
59
In the literature, numerous pharmacogenetic studies have investigated the relationship between molecular expression profiles and patient survival outcomes, and identified prognostic biomarkers in cancers [4]. Most of the existing studies chose to include as many subjects relevant to their scopes as possible, while frequently ignoring the fact that these patients might receive different treatments. In our opinion, there are two main reasons for such choices. First, populationbased studies with a larger sample size often have increased statistical power for identification of biomarkers [5]. Second, drug treatment data is often either unavailable or in non-standardized formats that are difficult to incorporate. As a result, cancer survival biomarkers identified in existing studies are often general to the cancer being studied, but not specific to any drug treatments. However, studying biomarkers in a drug-specific manner has the potential to reveal the underlying cancer mechanisms and inform designs of personalized medicine. The Cancer Genome Atlas (TCGA) is one of the most powerful cancer genomics programs to date [6], which provides massive data to expand our knowledge of tumourigenesis. TCGA has generated a large public collection of multiple types of omic data on ∼11,000 cancer patients across 33 different cancer types. The omic data types include mutation, copy number variation, methylation, gene expression, miRNA expression and protein expression data. TCGA also provides drug treatment data and survival outcomes of the patients. In the drug treatment data in TCGA, there are nomenclature problems (i.e., alternative names, abbreviations and misspellings), making it difficult for bioinformatics analysis. In our previous study [7–9], we manually standardized the drug names in the drug treatment data, which enabled us to examine the potential for gene copy number and gene expression as biomarkers of drug-specific survival. Here, we focused on investigating the potential of proteins as drug-specific predictive biomarkers, since proteins are the functional units in the central dogma of molecular biology. And it has already been hypothesized that proteomic profiling more directly addresses biological and pharmacologic problems in cancer [10]. In recent years, proteomics efforts have led to proteins that can serve as cancer biomarkers. Several lines of evidence have shown that the expression level of proteins is frequently associated with drug response. One example is MRP1, which is associated with drug resistance or poor patient outcomes in a variety of cancers [23]. MRP3 is the ABC transporter that is most closely related to MRP1. For both MRP3 and MRP1, their protein expression levels correlated with decreased sensitivity of lung cancer cell lines to doxorubicin [11]. Another well-characterized example is eight protein signatures that were identified for the prediction of drug response to 5-fluorouracil, including CDH1, CDH2, KRT8, ERBB2, MSN, MVP, MAP2K1, and MGMT. All of these proteins, except for KRT8, are involved in the pathogenesis of colon cancer [12]. In this study, we performed survival analyses on patients with the same cancer and were exposed to the same drug, and identified proteins whose expression levels are associated with drug-specific survival outcome. Some of the identified protein markers were further supported in the literature, where we found multiple published papers indicating their relationship with drug response in cancers.
60
S. Lin et al.
However, we also found a few drug-specific proteins that are inconsistent with previously reported findings in terms of the direction of correlation with survival outcomes. In addition, using the gene expression data in TCGA, we explored the regulatory mechanism of predictive protein markers by examining their coding genes.
2 2.1
Results Significant Proteins Predictive of Drug-Specific Survival
To identify proteins correlated with drug-specific survival outcomes, we grouped patients who suffered from the same cancer and received the same drug together, which we call cancer-drug groups. Across the 33 cacner types in TCGA and the 254 unique drug names from our previous manual standardization of the drug treatment data [7–9], a large number of cancer-drug groups contained 0 or very small number of patients, because not all drugs were applied to treat all cancer types. We imposed a minimum sample size requirement of 15, and only considered cancer-drug groups whose number of patients exceeded this threshold. Therefore, a total of 98 cancer-drug groups were considered for the subsequent analysis to identify protein markers for drug-specific survival. Next, we binarized the protein expression data in TCGA, which was needed in our survival analysis. For each of the 336 proteins measured by TCGA, we applied StepMiner [13] to binarize its expression data across all patients in all cancer types. Specifically, for each protein, we sorted its expression data for all patients and then fitted a step function to the sorted data that minimizes the square error between the original and the fitted values. The step function provided a threshold to binarize the expression of the protein. Finally, we performed survival analysis to evaluate each protein’s ability to predict the survival outcome of patients in each cancer-drug group. Patients in the cancer-drug group were stratified into a highly-expressed class and a lowlyexpressed class based on the binarized data of the protein. To minimize undesired statistical bias, we only performed survival analysis on proteins in cancer-drug groups with at least 5 lowly-expressed patients and 5 highly-expressed patients. In total, 17,812 protein-cancer-drug combinations were tested in our analysis, which involved 23 cancer types and 41 drugs. We applied log-rank test to determine the statistical significance of survival difference between highly-expressed class and lowly-expressed class. 90 proteins exceeding an FDR threshold of < 0.1 were selected as predictive markers whose expression levels were related to patients’ survival outcome in a drug-specific manner. In order to identify proteins that are specifically related to individual drugs, we performed the same analysis on all patients in each cancer type, and identified proteins that are predictive of cancer-specific survival irrespective of drug treatment. Among the 90 proteins significant for drug-specific survival, 25 were also identified in the cancer-specific analysis. In our subsequent analysis, we excluded the 25, and only included the protein markers that were significant in cancer-drug groups but not significant in the corresponding cancer types.
Drug-Specific Survival Biomakers
61
Table 1. 65 Significant protein-cancer-drug combinations identified in cancer-drug groups Cancer-drug groups Protein markers for drug-specific survival
Number of patients
BLCA-Gemcitabine YWHAE,AKT1-3,CDK1,MAPK1
75
BRCA-Carboplatin AKT1
24
BRCA-Doxorubicin CDK1
282
CESC-Cisplatin
TP53BP1,Tubulins,BAP1,CTNNB1,COL6A1, CDH1,EIF4G1,HSPA1A,KU80,MRE11,CDH2, SERPINE1,TSC2
61
COAD-Oxaliplatin
EIF4EBP1
82
HNSC-Carboplatin BCL2,CLDN7,FOXM1,MSH2
22
KIRC-Sorafenib
KDR
16
LGG-Bevacizumab
EIF4EBP1,ETS1,FASN,TIGAR,TSC2
41
LGG-Irinotecan
ETS1,KU80,SRSF1,FYN
20
LGG-Lomustine
CTNNA3,AR,CDKN1A,BRAF
31
LUAD-Cisplatin
NF2
66
LUSC-Gemcitabine ABL1
26
LUSC-Cisplatin
CCNE2,PEA15
54
OV-Docetaxel
CCNE1,RBM15
69
OV-Carboplatin
CASP7,RICTOR
290
OV-Doxorubicin
CDH1,GAB2,RBM15
106
OV-Vinorelbine
AKT1-3,CDK1,ERCC5
19
PAAD-Gemcitabine CDKN1B,MAPK11
73
PAAD-Fluorouracil BECN1,GAPDH,ERBB3,CDKN1B,MAPK11, MAPK12,PTEN,SRC,PARP1
24
STAD-Cisplatin
DPP4
48
STAD-Etoposide
INPP4B
19
As showed in Table 1, a total of 65 significant protein-cancer-drug combinations were identified in 21 cancer-drug groups, which involved 55 unique proteins. We found 13 significant proteins predictive of cisplatin response in cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC), and 9 protein markers are associated with fluorouracil response in pancreatic adenocarcinoma (PAAD). Interestingly, there are 11 proteins that turned out to be significant in multiple cancer-drug groups, which may potentially serve as key biomarkers to drug responses in multiple cancer types. Among the proteins that were significant in more than one cancer-drug group, we observed that CDH1 was related to the sensitivity of cisplatin in CESC and also associated with the overall survival of Doxorubicin-treated patients in ovarian serous cystadenocarcinoma (OV). We also found that CDK1 was correlated with drug response to gemcitabine in bladder urothelial carcinoma (BLCA), doxorubicin in Breast invasive carcinoma (BRCA), and vinorelbine in OV.
62
S. Lin et al.
A
B
Survival on Fluorouracil by PTEN expression in PAAD Class
Lowexpressed
Survival on Fluorouracil by SRC expression in PAAD
Highexpressed
Class
Lowexpressed
Highexpressed
100
Survival probability (%)
100
75
50
75
50
25
25
p = 0.094
0 0
p = 0.021
0 500
1000
Time
1500
2000
0
500
1000
1500
2000
Time
Fig. 1. Kaplan-Meier curves of overall survival for patients treated with fluorouracil at low or high expressed classes stratified by PTEN or SRC in the PAAD.
2.2
Literature Support of Predictive Protein Markers
To assess whether there are previous research that supported the identified protein markers predictive of drug response, we conducted a comprehensive literature survey on the PubMed database for each of the 65 protein-drug combinations. We found supportive evidence for multiple protein-drug combinations in various cancer contexts. In particular, our analysis suggested that high-expressed CDKN1B was able to increase drug response to gemcitabine in pancreatic adenocarcinoma (PAAD). This is consistent with previous studies that the re-expression of CDKN1B was related to the sensitization of pancreatic cancer cells to gemcitabine leading to a significant induction of apoptosis, which could be a superior potential treatment for pancreatic cancer [14,15]. Another literature support is about PTEN. PTEN was first discovered as a tumor suppressor, and its loss of function is strongly associated with tumor growth and survival. Figure 1A shows how PTEN expression correlated with PAAD patients in TCGA, that PTEN over-expression led to increased sensitivity of fluorouracil. This observation is supported by previous studies which showed that PTEN was involved in promoting 5-Fluorouracilinduced apoptosis, and the reduced expression of PTEN was associated with increased malignancy grade in PAAD, whereas maintenance of PTEN expression showed a trend toward a longer survival [16]. In addition, it has been shown that the inhibition of TAP subsequently promoted the expression of PTEN that increase sensitivity to chemotherapeutic agents in cancer [17]. A third example is VEGFR2, which was previously reported to be predictive of sorafenib efficacy in patients with metastatic renal cell carcinoma (mRCC) and was associated with longer overall survival of patients those treated with sorafenib [18]. In our analysis, we found that the repressed VEGFR2 resulted in prolonged survival outcomes of patients exposed to sorafenib in Kidney renal clear cell carcinoma (KIRC), which reveals the potential prediction of VEGFR2 on gemcitabine in other diseases.
Drug-Specific Survival Biomakers A
C
Bcl−2
Class
Lowexpressed
Highexpressed
y = 1.1582 x 4.7653 R 2 = 0.59031, p < 0.0001
75
2
50
25
p = 0.0041
0 0
1000
2000
3000
4000
Time
BCL2
Class
Lowexpressed
Highexpressed
Survival probability (%)
100
Protein Expression
Survival probability (%)
100
B
63
1
0
75
50
25
1
p = 0.00013
0 0
1000
2000
Time
3000
4000
4.0
4.5
5.0
5.5
Gene Expression
Fig. 2. Correlation between Bcl-2 and BCL2 in expression level and survival outcomes. A-B. Kaplan-Meier curves of overall survival for patients treated with fluorouracil at low or high expressed classes stratified by Bcl-2 or BCL2 in the HNSCC. C. Consistency of Bcl-2 and BCL2 correlation in expression levels for patients in HNSCC-carboplatin group.
We also found literature that showed inconsistent direction of survival impact compared to two of the protein-drug combinations we identified. Our analysis suggested that decreased SRC was related to poor overall survival of patients in PAAD and treated with Fluorouracil, shown in Fig. 1B. However, a recent study that involved fluorouracil and a few other drugs showed that SRC expression up-regulation in some Pancreatic ductal adenocarcinoma (PDAC) patients was associated to relatively poor patient outcome [19]. The second inconsistency was related to BCL2. Our analysis demonstrated that the over-expression of BCL2 resulted in better survival outcomes of patients with Head and Neck squamous cell carcinoma (HNSCC) and exposed to carboplatin (Fig. 2A), and the highexpressed BCL2-coding gene was also associated with prolonged overall survival of patients with HNSCC (Fig. 2B). In contrast, a previous study observed that BCL2 could inhibit apoptosis induced by cisplatin, carboplatin and paclitaxel, making HNSCC that express BCL2 resistant to rapamycin, carboplatin and paclitaxel [20]. Despite of these inconsistencies in the direction of correlation with survival, the literature did indicate the relevance of our identified proteins to drug responses in cancer patients.
64
2.3
S. Lin et al.
Correlation Between Predictive Proteins and Their Coding Genes
To understand the roles of drug-specific proteins during carcinogenesis and pharmacotherapy, we investigated the regulatory mechanism of identified protein markers by examining their corresponding coding genes. We performed survival analysis on the genes coded the protein markers, in the same cancer-drug context where the protein markers were identified. Specificity, for each of the 65 identified protein-cancer-drug combinations, we extracted the binarized gene expression data of the corresponding gene for patients in that cancer-drug group, stratified the patients to high-expressed and low-expressed classes according to the binarized gene expression data, and performed log-rank test to examine whether there is a significant difference in survival outcome between the two classes. We applied p-value threshold of < 0.05 to identify genes whose expression were also predictive of drug-specific survival outcomes. Similar to our analysis of proteins, survival analysis was only performed on the corresponding genes if there were at least 5 highly-expressed patients and 5 lowly-expressed patients in the corresponding cancer-drug group. 7 genes were identified whose expression were also predictive of drug-specific survival, in the same context as its associated proteincancer-drug combinations. Therefore, this result suggests ∼11% of the identified proteins also showed significance in their corresponding genes. To elucidate the relationships between significant proteins and their corresponding genes in each cancer-drug group, we examined the correlation between the expression levels of protein and gene in each of the 7 significant protein-gene pair. Correlation analysis was performed between log-transformed gene expression data and protein expression data by using R package ‘lm’. Among the 7 protein markers whose gene expression also correlated with survival in the same cancer-drug groups, we noticed that 4 (BCL2, CCNE2, ETS1, GAB2) showed positive correlation between gene expression level and protein expression level, while the remaining 3 (MAPK3, TIGAR, CTNNB1) showed negative correlation. We also examined the consistency between the survival analyses based on the proteins and the genes. For example, for a particular protein-cancer-drug combination whose corresponding gene was also predictive of drug-specific survival, we examined the direction of their correlation with survival outcome. If high expression of the protein led to better survival in the cancer-drug group, we considered the protein to be positively correlated with survival outcome. If high expression of the corresponding gene also led to better survival in the cancerdrug group, the gene was also positively correlated with survival outcome. In this case, the protein and its corresponding gene showed consistency in terms of their directions of the survival outcome. However, if high protein expression and low gene expression led to better survival, or low protein expression and high gene expression led to better survival, the protein and its corresponding gene were inconsistent in their directions of the survival outcome. Similar to the correlation analysis above, out of the 7 proteins whose corresponding genes were also predictive of drug-specific survival outcomes, 4 showed consistent survival
Drug-Specific Survival Biomakers
65
directions between genes and proteins, whereas the remain 3 showed inconsistent survival direction. This is not surprising, given mixed reports in the literature on the concordance and discordance between gene expression and protein expression in various contexts [21,22].
3 3.1
Materials and Methods Data Access
TCGA protein expression data and gene expression data were obtained from Genomic Data Commons (GDC) database using the GDC Data Transfer Tool. Clinical data were also downloaded from GDC, which included patients’ drug treatment records and survival outcomes. After removing duplicates in the molecular data and filtering for samples with treatment and survival data, we finally used a total of 31 cancer types in this study. 3.2
Data Preprocessing
The gene expression data downloaded from TCGA have been normalized by FPKM-UQ, and we subsequently preprocessed the gene expression data by logtransformation. The protein expression data available from TCGA have already been properly normalized and transformed. For each gene and each protein, we used the StepMiner algorithm [13] to compute a global threshold for all patients across all cancer types. Specifically, we sorted the expression data across all patients from low to high for each gene or protein, and then a step function was fitted to minimize the square error between the original and the fitted values. Using the threshold, the normalized protein and gene expression data are binarized, so that patients can be divided into two classes (high-expressed class vs. low-expressed class) based on expression levels of each individual protein and gene features. 3.3
Survival Analysis
For each protein, patients who suffered from the same cancer and received the same drug were stratify into highly- or lowly-expressed classes according to the binarized data of that protein. We used log-rank test to compare the survival differences between patients in highly- and lowly-expressed classes. BenjaminiHochberg multiple tests were used to calibrate the false discovery rate (FDR) for the significance. Proteins with FDR < 0.1 were identified as drug-specific markers whose expression expression levels were predictive of patients’ survival outcome in a drug-specific manner. Kaplan-Meier analysis and log-rank test in this study were conducted using the R package ‘survival’.
66
3.4
S. Lin et al.
Literature Search
We performed literature searches on PubMed database to find articles that mentioned proteins interacting with drugs in the cancer-drug context from which the proteins were identified. We used a Python script with the Bio.Entrez package from Biopython, and programmatically searched the National Library of Medicine PubMed database (http://www.ncbi.nlm.nih.gov/pubmed). Keywords for the searches were drug AND protein markers in all fields, including the title, abstract and main texts of the articles.
4
Conclusion
In this study, we integrated multiple data types in TCGA to perform survival analysis for patients who belonged to the same type of cancer and exposed to the same drug. This analysis identified predictive protein markers whose expression levels are associated with drug-specific survival outcomes in various cancer types. Notably, our results included proteins that have been previously reported to be predictive biomarkers for drug sensitivity and resistance in cancers, as well as the novel ones that have not been proposed in the literature. In addition, we examined gene expression of the identified proteins in terms of both the correlations between their expression levels and their correlations with survival. Overall, the drug-specific proteins identified in this analysis may be effective biomarkers predictive of drug response and survival outcomes in cancers. Further validation investigation on these protein markers can help guide clinical decisions for individual patients. Acknowledgement. This work was supported by funding from the National Science Foundation (CCF1552784 and CCF2007029). P.Q. is an ISAC Marylou Ingram Scholar, a Carol Ann and David D. Flanagan Faculty Fellow, and a Wallace H. Coulter Distinguished Faculty Fellow.
References 1. Latini, A., Borgiani, P., Novelli, G., Ciccacci, C.: miRNAs in drug response variability: potential utility as biomarkers for personalized medicine. Pharmacogenomics 20(14), 1049–1059 (2019) 2. Li, B., He, X., Jia, W., Li, H.: Novel applications of metabolomics in personalized medicine: a mini-review. Molecules 22(7), 1173 (2017) 3. Arbitrio, M., et al.: Pharmacogenomics biomarker discovery and validation for translation in clinical practice. Clin. Transl. Sci. 14(1), 113–119 (2021) 4. Fu, Q., et al.: miRomics and proteomics reveal a miR-296-3p/PRKCA/FAK/ Ras/c-Myc feedback loop modulated by HDGF/DDX5/β-catenin complex in lung adenocarcinoma. Clin. Cancer Res. 23(20), 6336–6350 (2017) 5. Hong, E.P., Park, J.W.: Sample size and statistical power calculation in genetic association studies. Genom. Inf. 10(2), 117 (2012)
Drug-Specific Survival Biomakers
67
6. Tomczak, K., Czerwi´ nska, P., Wiznerowicz, M.: The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp. Oncol 19(1A), A68 (2015) 7. Spainhour, C., Qiu, P.: Identification of gene-drug interactions that impact patient survival in TCGA. BMC Bioinf 17(1), 409 (2016) 8. Spainhour, C., Lim, J., Qiu, P.: GDISC: integrative TCGA analysis to identify gene-drug interaction for survival in cancer. Bioinformatics 33(9), 1426–1428 (2017) 9. Neary, B., Zhou, J., Qiu, P.: Identifying gene expression patterns associated with drug-specific survival in cancer patients. Sci. Rep. 11(1), 5004 (2021) 10. Ma, Y., et al.: Predicting cancer drug response by proteomic profiling. Clin. Cancer Res. 12(15), 4583–4589 (2006) 11. Young, L.C., Campling, B.G., Cole, S.P., Deeley, R.G., Gerlach, J.H.: Multidrug resistance proteins mrp3, mrp1, and mrp2 in lung cancer: correlation of protein levels with drug response and messenger rna levels. Clin. Cancer Res. 7(6), 1798– 1804 (2001) 12. Ginsburg, G.S., Willard, H.F.: Essentials of Genomic and Personalized Medicine. Academic Press, Cambridge (2009) 13. Sahoo, D., Dill, D.L., Tibshirani, R., Plevritis, S.K.: Extracting binary signals from microarray time-course data. Nucl. Acids Res. 35(11), 3705–3712 (2007) 14. Khan, M.A., Zubair, H., Srivastava, S.K., Singh, S., Singh, A.P.: Insights into the role of microRNAs in pancreatic cancer pathogenesis: potential for diagnosis, prognosis, and therapy. In: Santulli, G. (ed.) microRNA: Cancer. AEMB, vol. 889, pp. 71–87. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23730-5 5 15. Chen, X., et al.: Therapeutic effects of argyrin f in pancreatic adenocarcinoma. Cancer Lett. 399, 20–28 (2017) 16. Ni, S.: Cbx7 suppresses cell proliferation, migration, and invasion through the inhibition of pten/akt signaling in pancreatic cancer. Oncotarget 8(5), 8010 (2017) 17. Tian, Y., et al.: Metformin mediates resensitivity to 5-fluorouracil in hepatocellular carcinoma via the suppression of yap. Oncotarget 7(29), 46230 (2016) 18. Hutson, T.E., et al.: Randomized phase III trial of temsirolimus versus sorafenib as second-line therapy after sunitinib in patients with metastatic renal cell carcinoma. J. Clin. Oncol. 32(8), 760 (2014) 19. Abrams, S.L., et al.: Introduction of wt-tp53 into pancreatic cancer cells alters sensitivity to chemotherapeutic drugs, targeted therapeutics and nutraceuticals. Adv. Biol. Regul. 69, 16–34 (2018) 20. Aissat, N., et al.: Antiproliferative effects of rapamycin as a single agent and in combination with carboplatin and paclitaxel in head and neck cancer cell lines. Cancer Chemother. Pharmacol. 62(2), 305–313 (2008) 21. Gygi, S.P., Rochon, Y., Franza, B.R., Aebersold, R.: Correlation between protein and MRNA abundance in yeast. Molec. Cell. Biol. 19(3), 1720–1730 (1999) 22. Chen, G., et al.: Discordant protein and MRNA expression in lung adenocarcinomas. Molec. Cell. Proteom. 1(4), 304–313 (2002) 23. Munoz, M., Henderson, M., Haber, M., Norris, M.: Role of the MRP1/ABCC1 multidrug transporter protein in cancer. IUBMB Life 59(12), 752–757 (2007)
Diabetic Retinopathy Grading Base on Contrastive Learning and Semi-supervised Learning Yunchao Gu1,3,4(B) , Xinliang Wang1 , Junjun Pan1,2 , and Zhong Zhou1 1
4
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China {guyunchao,wangxinliang,pan junjun}@buaa.edu.cn 2 Peng Cheng Laboratory, Shenzhen 518000, China 3 Hangzhou Innovation Research Institute, Beihang University, Hangzhou 100191, China Beijing Advanced Innovation Center for Big Data and Brain Computing (BDBC), Beihang University, Beijing 100191, China
Abstract. The diabetic retinopathy (DR) detection based on deep learning is a powerful tool for early screening of DR. Although several automatic DR grading algorithms have been proposed, their performance is still limited by the characteristics of DR lesions and grading criteria, and coarse-grained image-level label. In this paper, we propose a novel approach based on contrastive learning and semi-supervised learning to break through these limitations. We first employ contrastive learning to solve the problem of inter-class and intra-class differences in DR grading. This method enables the model to identify the unique lesion features on each DR fundus color image and strengthen the feature expression for different kinds of lesions. Then we use a small amount of open-source pixellevel annotation dataset to train the lesion segmentation model, in order to provide fine-grained pseudo-label for image-level fundus images. Meanwhile, we design a pseudo-label attention structure and deep supervision method, to increase the attention of the model to lesion features and improve the grading performance. Experiments on the open-source DR grading datasets EyePACS, Messidior, IDRiD, and FGADR can prove the effectiveness of our proposed method and show the results superior to the previous methods.
Keywords: Deep learning Lesion segmentation
1
· Diabetic retinopathy · Disease grading ·
Introduction
In recent years, medical image analysis and processing based on deep learning have made remarkable progress [4,16,19]. Disease classification based on convolutional neural network and lesion recognition based on object detection and c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 68–79, 2021. https://doi.org/10.1007/978-3-030-91415-8_7
Contrastive Learning and Semi-supervised Learning for DR Grading
69
semantic segmentation can provide great help for clinical diagnosis of diseases, especially for disease grading and early screening of fundus diseases [20,25,26]. Among so many fundus diseases, diabetic retinopathy (DR) is a kind of fundus disease with high incidence and serious blindness. However, the diagnosis of diabetic retinopathy often takes a lot of time and energy of professional ophthalmologists. Therefore, it is necessary to use the method based on deep learning to complete the automatic classification of diabetic retinopathy, to improve the diagnosis efficiency and reduce the burden of ophthalmologists. In recent years, many researchers have used deep learning technology to carry out research on DR grading. In order to find different lesions in fundus images automatically, [17] designed a deep learning method using image patched. Inspired by the visual attention mechanism. [25] proposed Zoom-In-Net, which used the main network to extract the visual features of fundus images, utilized attention network to enable the network to pay attention to potential lesion areas, and finally employed crop network to further process high-resolution lesion areas, thus improving the accuracy of DR grading. CABNet was designed by [12] to deal with the problems of tiny lesions and unbalanced data distribution, which can be easily embedded in neural networks such as ResNet [14], DenseNet [15] and Xception [8]. [18] used the central sample detector to improve the robustness of the model and adopted an information fusion method based on an attention mechanism to improve the performance of lesion recognition. [27] proposed a collaborative learning method to improve the performance of disease classification and lesion segmentation through semi-supervised learning with an attention mechanism. Although the above methods can complete automatic DR grading to a certain extent and achieve considerable performance, the overall effect is still limited by: (1) the characteristics of DR lesions and grading criteria, and (2) coarse-grained image-level annotation. For the first point, it can be seen from the international standard of diabetic retinopathy classification in Table 1 that even if the grade is same, the expression of lesions is still different. For example, the same two fundus images of grade 3 may have IRMA lesions on one and intraretinal hemorrhages lesions on the other. That is to say, the DR grading task has both inter-class and intra-class differences. As for the second point, most of the existing DR grading public datasets only provide image-level annotation, which makes it difficult for CNN to capture small lesions. In this paper, we propose a DR grading method based on contrastive learning [6,13] and semi-supervised learning [23] to solve the above problems. Contrastive learning is a kind of representation learning, which regards each sample and its augmented samples as positive samples, and other images as negative samples. It pulls in the distance between positive samples and pulls out the distance between positive and negative samples, which is consistent with the differences between classes of DR. In the implementation, we use the cross-entropy loss to enable the model to classify different gradings, and use a contrastive loss to judge within different samples and lesion representations. In addition, we develop a model to train lesion segmentation by using small-scale fine-grained
70
Y. Gu et al.
Table 1. Grades of DR according to international clinical guidelines for diabetic retinopathy [3]. IRMA stands for Intra Retinal Microvascular Abnormalities. Grade
Grading criterion
0 - No DR
No Abnormalities
1 - Mild
Microaneurysms
2 - Moderate
More than just microaneurysms but less than Grade 3
3 - Severe
Intraretinal hemorrhages or Definite venous beading or IRMA
4 - Proliferative DR Neovascularization or Preretinal hemorrhage
lesion data, and generate pseudo labels for large-scale image-level fundus images. Moreover, we design a pseudo-label attention module to train CNN simultaneously with deep supervision to improve the performance of DR grading. The pseudo-label attention method can provide fine-grained lesions attention for DR classification network. The deep supervision method also plays a similar role, and can supervise the feature graph in the deep part of the network to accelerate the convergence speed of the model. We verify the effectiveness of the proposed method by carrying out abundant experiments on four open-source datasets. The structure of the article is as follows. Section 2 introduces our contrastive learning and semi-supervised learning methods for DR grading; Sect. 3 introduces the open-source datasets, evaluation metrics and experimental results; Sect. 4 shows the conclusion and discussion.
2 2.1
Method Contrastive Learning Based DR Grading
Contrastive learning [6,7,13] is a representation learning method and a powerful tool for self-supervised learning [10,11]. It maximizes the distance between heterogeneous samples and minimizes the distance between similar samples by using contrastive loss. As mentioned in Sect. 1, the DR grading task has the characteristics of both inter-class differences and intra-class differences. Treating each sample as a class and maximizing the distance between it and other samples solves the problem of intra-class differences; Considering a class of samples as the same class and maximizing its distance from other classes, the problem of differences between classes can be solved. We first use contrastive learning to solve intra-class differences. Specifically, the calculation formula is as follows:
esim(f (x),f (x ))/t , l(x, x ) = −log sim(f (x),f (x ))/t e + I∈S − esim(f (x),f (I))/t
(1)
where x and x are a fundus image and its data augmentation forms, t is the temperature parameter. Data augmentation methods include horizontal flip, vertical flip, random rotation, and contrast enhancement, etc. Although the
Contrastive Learning and Semi-supervised Learning for DR Grading
71
LBCE
(a) Masks Ground Truth / Pseudo Label
(c)
Downsampling
Adding
Projection LCL_Seg
LD
S+
Deep Supervision
(b)
…… S-
Multiply
……
Projection LCL
DR grading
LCE
Fig. 1. The overall structure of our proposed method. There are three parts: (a) contrastive learning based lesion segmentation model (UNet structure in blue color); (b) contrastive learning based DR grading network (five stages of ResNet in yellow color) and (c) pseudo-label attention and deep supervision method (five stages of ResNet in green color). Note that (a) needs to be pre-trained with pixel-level annotation and use mask ground truth for calculating BCE loss. When (a) is used as a pseudo-label extractor, its outputs are pseudo labels. (Color figure online)
image is changed by augmentation, the lesion features on x are still consistent with x. Thus, x and x can be regarded as one category, and all other samples (such as I in the above formula) and their augmentation forms can be regarded as other categories that in the set of S − . Calculation of similarity in formula sim(a, b) = aT b/||a||||b||, can increase the similarity of feature expression between x and x , and reduce the similarity between x and other fundus images, which solves the problem of intra-class difference in DR grading (because images with the same grade are also regarded as other categories in comparative study). We refer to [5] and add a projection head which consists of fully connected layers to the last feature map. The above formula conducts the contrastive loss of a sample and its augmentated form, which is calculated by: 1 [l(x, x ) + l(x , x)], (2) LCL = + |S | + (x,x )∈S
where S + is the set of x and its data augmentation images. In addition, we use the cross-entropy loss to solve the inter-class differences of DR grading that is defined as: m
LCE =
1 [yi log(pi ) + (1 − yi )log(1 − pi )], m i=1
(3)
72
Y. Gu et al.
where m is the total number of samples, yi is the ground truth and pi is the prediction of the i-th image. The total loss of this part is LCL + LCE . 2.2
Contrastive Learning Based Lesion Segmentation
Semantic segmentation can solve the problem of finer granularity, and the segmentation model can locate tiny lesions on fundus images. However, the pixellevel annotation required by semantic segmentation is difficult to obtain, especially in the field of medical images which needs a lot of time and experience for professional doctors. Therefore, we employ semi-supervised method, only use a few fine annotations to train segmentation model, and use pseudo label to provide fine-grained classification information for image-level data. Among many semantic segmentation models, UNet can adapt to medical images well. Because of the fuzzy boundary and complex gradient of medical images, it is necessary to keep high-resolution information. Meanwhile, lowresolution information is also needed to provide intuitive semantics. These characteristics are very consistent with UNet with U-shaped structure and skip connection structure, so UNet is used as semantic segmentation model in this paper, as shown in Fig. 1(a). The first part of UNet is an encoder, which can extract features from fundus images, and the second part is a decoder, which decodes abstract features and generate mask. The input of encoder is fundus image, which has strong lesion semantic information on feature map after layer by layer abstraction of convolutional layers and pooling layers. Therefore, we use contrastive learning after the last stage of ResNet in the encoder, to treat each sample as an independent category as described in Sect. 2.1, so that the network can learn unique lesion features. We don’t add extra contrastive learning supervision to the decoder because we think that the decoder is partially influenced by the final binary cross-entrophy loss to fit the mask annotation with only values of 0 and 1, thus losing the true visual features of lesions. The loss of this part is defined as LCL Seg + LBCE , where LCL Seg is the same as Eq. (2) and LBCE is the binary cross-entrophy loss of the output of UNet and ground truth. 2.3
Pseudo-label Attention and Deep Supervision
As shown in Fig 1(b), the semantic segmentation model described in Sect. 2.2 can generate pseudo labels of six types of lesions for a large number of fundus images only labeled at image level. We designed pseudo-label attention structure and deep supervision method to use pseudo-label to provide fine-grained lesion information for CNN in DR grading. The attention structure of pseudo label is shown in the Fig. 1(c). In addition to CNN for extracting image features, we use an additional CNN(called CNN’) for extracting the mask features of lesions. The input of CNN’ is the pseudo mask, and the feature map is obtained after the first stage of ResNet. Because the values on the pseudo-label mask are only 0 and 1, which make CNN’ can not directly extract information with rich semantics. Therefore, we add fundus image features to CNN’ branch. The first part of the pseudo-label attention
Contrastive Learning and Semi-supervised Learning for DR Grading
73
method is to combine the focus location features of pseudo-label with fundus image features extracted by CNN, and then continue to abstract layer by layer through convolutional layer and pooling layer. Finally, we can get an attention map, multiply the map with the feature map generated by CNN branch, and use residual structure to prevent over-fitting. Finally, after passing through the fully connection layer, we can do DR grading. In addition, we use the method of deep supervision to make the model pay more attention to the focus area. Because features in deep layers of CNN have been highly abstracted, we no longer distinguish specific lesion categories, but concat all lesion areas as a supervision mask, and use BCE loss as the loss function that is LD in Fig. 1(c).
3 3.1
Experimental Results Settings
Datasets. In the whole experiment, we used four open-source datasets, including large-scale dataset EyePACS [1], high-quality and high-resolution dataset Messidor [9] and IDRiD [21], and FGADR [28] that with pixel-level annotation. EyePACS includes 35,126 fundus images for training and 53,576 images for testing. It consists of five categories, including No DR, Mild, Moderate, Severe and Proliferative DR. Although EyePACS has a large amount of data, the collection environment and equipment lead to its low quality. Messdior dataset contains 1,200 fundus color images with high resolution (2000 × 2000) and with four categories labeled from 0 to 3. IDRiD is a competition dataset for the classification competition of fundus diseases held by ISBI in 2018. Its quality and resolution are very high, but the amount of data is slightly small, with 413 samples for training and 103 for testing. It has the same five categories as EyePACS. FGADR is an open-source dataset with pixel-level annotations. It has 1,842 open-source fundus color images with fine-grained annotations and image-level annotations, and the lesion types include MA (Microaneurysms), HE (Hemohedge), SE (SoftExudate), EX (HardExudate), IRMA and NV (Neovascularization). Metrics. In the years of development of fundus image analysis, there are relatively fixed evaluation metrics for different datasets. QWK (quadratic weighted kappa) is used for EyePACS. As for Messidor, it is usually used as a binary classification problem, so Acc (accuracy) and AUC (area under the curve) are used. IDRiD follows its competition rules and uses Acc as the evaluation metric. DICE (dice similarity coefficient) is used when performing segmentation tasks on FGADR.
74
Y. Gu et al.
Table 2. Quantitative evaluation result on EyePACS dataset. We use different models and metrics to validate the performanec of models. Backbone
Method
ResNet18
baseline 0.844 0.837 Ours-CL 0.845 0.835
QWK Acc
ResNet50
baseline 0.842 0.841 Ours-CL 0.846 0.843
DenseNet121 baseline 0.846 0.841 Ours-CL 0.848 0.844 Table 3. Quantitative evaluation of multi-label classification for six lesions in FGADR. Baseline is traning on ResNet18, Ours-CL is our method that employ the contrastive loss to the training process. We show the results in every fold of cross-validation experiments. Backbone
Method AUC-fold0 AUC-fold1 AUC-fold2 AUC-fold3 AUC-Average
ResNet50
baseline 0.8202
0.8299
0.8119
0.8285
0.8226
Ours-CL 0.8178
0.8334
0.8372
0.8182
0.8266
DenseNet121 baseline 0.8198
0.8324
0.8159
0.8142
0.8206
0.8249
0.8229
0.8161
0.8217
Ours-CL 0.8231
3.2
Implementation Details
EyePACS. Because of the large amount of data in this dataset, we train 50 epochs with the learning rate of 5e–4, the optimizer used is Adam, the learning rate adjustment strategy is cosine, and the data augmentation used include horizontal flip, vertical flip, random rotation, contrast enhancement. Messidor. We regard this dataset as a normal/abnormal problem which is a binary classification task. Samples with grade 0 are normal and others are abnormal. We use the model trained on EyePACS as pre-trained model and fine-tune on Messidor for 30 epochs. Other configurations are the same as above. IDRiD. We use the model pre-trained on EyePACS and fine-tune it on IDRiD. We use the same strategy as EyePACS to train on this dataset. FGADR. We trained a semantic segmentation model on this dataset, and the network structure used was UNet. We trained a segmentation model for each disease, with a learning rate of 5e–4, an optimizer of Adam and a loss function of DICEloss. In the multi-label classification task in the summary of Sect. 3.3, the loss used is BCEloss.
Contrastive Learning and Semi-supervised Learning for DR Grading Ground Truth
Baseline
Ours
Fundus Image
Ground Truth
Baseline
Ours
Neovascularization
IRMA
HardExudate
Microaneurysms
Hemohedge
SoftExudate
Fundus Image
75
Fig. 2. Qualitative evaluation results on FGADR dataset. We compare the segmentation results of baseline method and our proposed method for six types of lesions.
3.3
DR Grading with Contrastive Learning
Table 2 shows the quantitative evaluation result of using contrastive learning for DR grading on the EyePACS dataset. We use CNN models with different structures, including ResNet [14] and DenseNet [15], and verify the models with different depths. When using the small model (ResNet18), the advantages of contrastive learning are not obvious, but only improved on QWK. In larger models, such as ResNet50 and DenseNet121, contrastive leraning can play a better role and bring better results. We think this is because comparative learning can strengthen the representation ability of the model to images, and the features extracted and abstracted by large models are richer, so better DR grading results can be obtained. Because EyePACS is limited by the data quality, we also do K-fold crossvalidation experiments on high-quality FGADR datasets to further verify the effectiveness of the method. In addition, since FGADR is labeled with six types of lesions at pixel-level, it is easy to obtain the image-level multi-label classification labels represented by these lesions. In this experiment, we regard this task as multi-label classification, and let CNN to predict the presence or absence of
76
Y. Gu et al.
Table 4. Quantitative evaluation results on FGADR data set. We use ResNet18 as the backbone of UNet. Method
IRMA
Baseline-UNet 0.1195
NV
HE
SE
MA
EX
0.1372
0.3385
0.4262
0.1552
0.3156
Ours-UNet-CL 0.1235 0.1644 0.3650 0.4755 0.1664 0.3825
each lesion. As shown in Table 3, we record the verification results of each fold in detail, and we can see that in most of the folds, the effect of contrastive learning is superior. And from the experimental results, we can see that ResNet50 is better than DenseNet121. We think that this phenomenon is limited by the scale of dataset. 3.4
Lesion Segmentation with Contrastive Learning
As shown in Table 4, we use DICE to test the segmentation results of each lesion in the FGADR dataset. It can be seen that our proposed method can improve the performance of lesion segmentation obviously, especially for HE, EX and SE, because these lesions have obvious characteristics and the amount of data is relatively large. For other rare diseases, our method still brings objective performance improvement. Figure 2 is the qualitative segmentation result on the FGADR dataset. There are four color of bounding boxes, the yellow bounding boxes can show that our method is more refined than the baseline method; The green bounding boxes display the false positive in the prediction of baseline method; The red bounding boxes are the lesion area that baseline failed to predict, but our method predicted; The blue bounding boxes show the missed lesions in the ground truth, but can be predicted by our model. These experimental results can explain that our contrastive learning method makes the model pay attention to the difference between the lesion and the background, as well as the difference between different lesions. To sum up, our proposed method based on comparative learning can enhance the model’s ability to distinguish different lesion features and accurately predict each lesion, and makes the segmentation results both accurate and generalized. 3.5
Comparison with SOTA Methods
In this section, we compare our methods with the state of the art methods. At the same time, the displayed results can also verify the effectiveness of proposed pseudo-label attention and deep supervision. First, we compare with the winners in the IDRiD competition. As shown in Table 5, the first few rows are the top four in the competition. It can be seen that our method is obviously superior to the result of the first place. Note that our results are a single model without an ensemble, while the results submitted in competitions are often predicted by multiple models.
Contrastive Learning and Semi-supervised Learning for DR Grading Table 5. Performance comparison on the IDRiD dataset. Methods
QWK
77
Table 6. Results of the normal/abnormal task on the Messidor dataset. Methods
AUC
Acc
HarangiM1 [2] 0.5534
VNXK [24]
0.870
0.871
Mammoth [2]
0.5437
CKML [24]
0.862
0.858
VRT [2]
0.5922
Expert [22]
0.922
–
LzyUNCC [2]
0.7476
Zoom-in-Net [25] 0.921
0.905
ResNet18
0.6117
AFN [18]
0.935
–
Ours
0.7930
CLSC [27]
0.943
0.922
Ours
0.948 0.920
Table 6 is the experimental results on the Messidor dataset which we used as a normal/abnormal classification task. We compare our methods with other DR grading methods. The first three are the top 3 results in the kaggle competition [1]. Expert [22] was the classification result of human ophthalmologists. Zoom-in-Net [25] designed the zoom-in attention method by simulating doctors’ magnification and observance of lesions, note that we compare with the result of a single model reported in this paper. In AFN [18], bounding box annotation was used to assist in DR grading. CLSC [27] used pixel-level annotation to complete the collaborative learning. It shows that our method can achieve good performance, especially in the AUC metric, which shows that our model can effectively process the fundus images with diseases.
4
Discussion
Using deep learning technology to complete the automatic classification of DR is a research hotspot in the field of medical image analysis. In this study, we find the coexistence of intra-class differences and inter-class differences according to the grading standards of DR, and solve the intra-class differences and inter-class differences by using the combination of contrastive learning and cross-entropy loss. To solve the problem of small granularity of lesions in fundus images, we use a semi-supervised learning method, train the lesion segmentation model with a few pixel-level annotations of lesions, and generate pseudo labels for imagelevel fundus images. The pseudo-label attention method and deep supervision method are designed, to make the model pay more attention to the subtle focus points. Experiments on four open-source datasets verify the effectiveness of our proposed method. In addition, from the experimental results, we can see that the boost of our method on the other three datasets is greater than that on EyePACS, which may be caused by the low quality and noise labels. Therefore, in the future, from the perspective of data, we will build a high-quality large dataset; From the method point of view, we will design a deep learning method with anti-noise.
78
Y. Gu et al.
References 1. Kaggle diabetic retinopathy detection competition. https://www.kaggle.com/c/ diabetic-retinopathy-detection 2. The leaderboard of idrid competition. https://idrid.grand-challenge.org/ Leaderboard/ 3. Bajwa, M.N., Taniguchi, Y., Malik, M.I., Neumeier, W., Dengel, A., Ahmed, S.: Combining fine- and coarse-grained classifiers for diabetic retinopathy detection. In: Zheng, Y., Williams, B.M., Chen, K. (eds.) MIUA 2019. CCIS, vol. 1065, pp. 242–253. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-39343-4 21 4. Chen, L., Bentley, P., Mori, K., Misawa, K., Fujiwara, M., Rueckert, D.: Selfsupervised learning for medical image analysis using image context restoration. Med. Image Anal. 58, 101539 (2019) 5. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020) 6. Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.: Big self-supervised models are strong semi-supervised learners (2020). arXiv preprint arXiv:2006.10029 7. Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning (2020). arXiv preprint arXiv:2003.04297 8. Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017) 9. Decenci`ere, E., et al.: Feedback on a publicly distributed image database: the messidor database. Image Anal. Stereol. 33(3), 231–234 (2014) 10. Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430 (2015) 11. Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations (2018). arXiv preprint arXiv:1803.07728 12. He, A., Li, T., Li, N., Wang, K., Fu, H.: Cabnet: category attention block for imbalanced diabetic retinopathy grading. IEEE Trans. Med. Imaging 40(1), 143– 153 (2020) 13. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020) 14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 15. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017) 16. Karimi, D., Dou, H., Warfield, S.K., Gholipour, A.: Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis. Med. Image Anal. 65, 101759 (2020) 17. Lam, C., Yu, C., Huang, L., Rubin, D.: Retinal lesion detection with deep learning using image patches. Invest. Ophthalmol. Vis. Sci. 59(1), 590–596 (2018) 18. Lin, Z., et al.: A framework for identifying diabetic retinopathy based on anti-noise detection and attention-based fusion. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-L´ opez, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11071, pp. 74–82. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00934-2 9
Contrastive Learning and Semi-supervised Learning for DR Grading
79
19. Litjens, G., et al.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017) 20. Mitani, A., et al.: Detection of anaemia from retinal fundus images via deep learning. Nat. Biomed. Eng. 4(1), 18–27 (2020) 21. Porwal, P., et al.: Indian diabetic retinopathy image dataset (idrid): a database for diabetic retinopathy screening research. Data 3(3), 25 (2018) 22. S´ anchez, C.I., Niemeijer, M., Dumitrescu, A.V., Suttorp-Schulten, M.S., Abramoff, M.D., van Ginneken, B.: Evaluation of a computer-aided diagnosis system for diabetic retinopathy screening on public data. Invest. Ophthalmol. Vis. Sci. 52(7), 4866–4871 (2011) 23. van Engelen, J.E., Hoos, H.H.: A survey on semi-supervised learning. Mach. Learn. 109(2), 373–440 (2019). https://doi.org/10.1007/s10994-019-05855-6 24. Vo, H.H., Verma, A.: New deep neural nets for fine-grained diabetic retinopathy recognition on hybrid color space. In: 2016 IEEE International Symposium on Multimedia (ISM), pp. 209–215. IEEE (2016) 25. Wang, Z., Yin, Y., Shi, J., Fang, W., Li, H., Wang, X.: Zoom-in-Net: deep mining lesions for diabetic retinopathy detection. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10435, pp. 267–275. Springer, Cham (2017). https://doi.org/10.1007/978-3-31966179-7 31 26. Zhao, H., Li, H., Maurer-Stroh, S., Guo, Y., Deng, Q., Cheng, L.: Supervised segmentation of un-annotated retinal fundus images by synthesis. IEEE Trans. Med. Imaging 38(1), 46–56 (2018) 27. Zhou, Y., ET AL.: Collaborative learning of semi-supervised segmentation and classification for medical images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2079–2088 (2019) 28. Zhou, Y., Wang, B., Huang, L., Cui, S., Shao, L.: A benchmark for studying diabetic retinopathy: segmentation, grading, and transferability. IEEE Trans. Med. Imaging 40(3), 818–828 (2020)
Reinforcement Learning for Diabetes Blood Glucose Control with Meal Information Jinhao Zhu1 , Yinjia Zhang1 , Weixiong Rao1(B) , Qinpei Zhao1 , Jiangfeng Li1 , and Congrong Wang2 1
School of Software Engineering, Tongji University, Shanghai, China {1931541,yinjiazhang,wxrao,qinpeizhao,lijf}@tongji.edu.cn 2 Shanghai Fourth People’s Hospital, Shanghai, China
Abstract. The blood glucose management of diabetics is essentially a control and optimization problem. The blood glucose level of patients is mainly influenced by diet and insulin dose. The goal of blood glucose management is to continuously control the blood glucose level of a patient in a normal range. Reinforcement learning models show good effectiveness and robustness in dealing with various nonlinear control problems. Because of the importance of diet information in blood glucose management, we introduce a reinforcement learning model for blood glucose management with meal information. A new reward function has been designed, which contains the long-term abnormal blood glucose as a penalty. The proposed model has been tested with a T1D simulator. The experimental results indicate that the introduced model is better at avoiding the blood glucose at a low level and keeping the patients on a longer duration of normal blood glucose level. Keywords: Blood glucose control simulator · Meal information
1
· Reinforcement learning · Diabetes
Introduction
By 2019, about 463 million people have diabetes worldwide and the prevalence rate of diabetes will continue to increase [1]. Type 1 diabetes (T1D) and Type 2 diabetes (T2D) are two major kinds of diabetes. T1D is caused by insufficient insulin secretion due to the destruction of pancreatic beta cells while T2D is caused by insulin resistance, a situation in which the body fails to respond to insulin properly. In diabetes management, the primary objective is to keep blood glucose (BG) within the normal range and reduce the occurrence of abnormal blood glucose levels. We regard the blood glucose between 70 mg/dL and 180 mg/dL as a normal range (i.e., euglycemia). Blood glucose below 70 mg/dL is considered hypoglycemia, while blood glucose above 180 mg/dL is considered hyperglycemia. Without enough insulin, blood glucose will stay at a high level for a long time and cause many complications, including cardiovascular disease, c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 80–91, 2021. https://doi.org/10.1007/978-3-030-91415-8_8
Reinforcement Learning for Blood Glucose Control
81
damage to the nerves, eyes and cognitive impairment [2]. On the other hand, over intake insulin or inadequate food intake may cause a low level of blood glucose, which leads to the risk of loss of consciousness, seizures, coma and death [3]. Because there is no cure for diabetes currently [4], an effective diabetes management plan is necessary for diabetics. The plan is to keep the blood glucose within a healthy range by well organizing diet, physical activities, sleep and so on. Monitoring fluctuating glucose levels is vital during the management. There are two main glucose monitoring approaches available, which are self-monitoring of blood glucose (SMBG) and continuous glucose monitoring (CGM) [5]. SMBG measures the blood glucose level using one drop of finger blood. CGM is a device with a subcutaneous miniaturized sensor. It tests blood glucose every few minutes and reports the data to the mobile device through wireless communication. A closed-loop control system has been developed for diabetics, which is called the artificial pancreas (AP). It consists of three parts: a blood glucose monitor (i.e. CGM), an insulin pump to automatically deliver insulin into the body and a control algorithm to determine the insulin dose. The development of the control algorithm relieves the burden of the AP. Traditional control algorithms such as Proportional-Integral-Derivative (PID) Control, Model Predictive Control (MPC) and fuzzy logic [6], have been employed in AP. Due to the complexity and non-linearity of the glucose kinetic process in the human body, it is difficult for traditional control methods to establish an appropriate personalized model. Reinforcement learning (RL) is an adaptive optimal control method for nonlinear systems, which has been successfully employed in control problems such as traffic signal control [7], autopilot [8], diabetes blood glucose control [6,9]. RL can overcome the challenges related to the inter-and intra-patient variabilities and achieve personalization of the insulin treatment. However, the application of RL in blood glucose control still suffers from problems. The reward is enough to drive behavior that exhibits abilities studied in natural and artificial intelligence [10]. The design of the reward function for blood glucose control needs to be considered carefully. Furthermore, food intake is an important factor in avoiding low levels of blood glucose. It is therefore necessary to include the meal information into the RL. In this work, we propose an RL framework for blood glucose control with meal information (RL-meal ). Specially, we consider historical blood glucose information, insulin and dietary information as influencing factors for blood glucose and take them as inputs of the RL model. The RL-meal can deal with temporal irregularity and quantitative irregularity of meals by introducing meal information of diabetics into the state. Our experiments show that our model can greatly reduce the duration of low blood glucose levels. A customized reward function is introduced, which can capture the long-time abnormal blood glucose and give an additional penalty. This reward function considers the historical blood glucose data and increases the duration of patients at a normal blood glucose level. The experimental results indicate that the RL-meal is able to reduce the risk of hypoglycemia and increase the time of the blood glucose in a normal range.
82
2
J. Zhu et al.
Related Works
The control algorithm is an important part of an AP system, which makes decisions on insulin and food intake. Steil [11] evaluates the performance of PID for the AP system. PID controller has been used in the real world. Medtronic Hybrid Close-Loop system uses PID as its control algorithm [12,13] and it uses timevarying model parameters in the PID model. PID requires a set of parameters representing the patient’s basal rate to decide the amount of insulin injection. However, The parameters are difficult to determine as the majority of patients use different basal rate during the day, and the basal rate works on one day may not work on another day [11]. Another weakness of the PID is that it is not able to respond to diet properly. It is difficult for the PID to control postprandial hyperglycemia and postprandial hypoglycemia [11]. Compared to the PID, RL approaches can learn more from the patient’s historical data and produce safer policies. A systematic review analyzes 29 articles about the application of RL in diabetes blood glucose control from 1990 to 2019 [9]. They concluded that there is a lack of focus on aspects that influence BG levels such as meal intakes and physical activities. Sun et al. propose a dual mode adaptive basal-bolus advisor for blood glucose control [14], which concludes that RL can provide personalized adaptive insulin optimization and apply in glucose control. Daskalaki et al. propose an Actor-Critic based model to predict the daily average basal rate and the state is the measured minimum and maximum blood glucose level of the day [15]. Fox tried to use the deep Q network with a discrete action space and the soft Actor-Critic with a continuous action space in blood glucose control [16,17]. However, in Sun and Daskalaki’s work, they use the RL algorithm to generate an overall basic insulin rate for the whole day, which means they can not respond to the sudden changes in blood glucose in time. Fox et al. propose a real-time blood glucose control algorithm, but take only blood glucose and insulin injection dose as the state and the reward function in his work is defective. Existing work pays little attention to the reward function in the RL. Especially, the food intake information is not well included in the control system.
3 3.1
The RL-Meal Problem Definition
The primary objective of blood glucose control is to keep the BG level within a normal range and reduce the occurrence of abnormal blood glucose levels. To apply RL algorithms to achieve the objective, we need to formulate the problem as a Markov decision process (MDP). However, it is impracticable to obtain the internal state of diabetics. We can only acquire blood glucose data, insulin injection data, diet data and other data that can be obtained by external measurements. Essentially blood glucose control problem is a partially observable MDP (POMDP), but we can regard it as an MDP with five tuples (S, A, P, R, γ). In this problem, the set of states S are observable states of diabetics, the set
Reinforcement Learning for Blood Glucose Control
83
of actions A are the possible insulin injection dose. P is the transition function, which defines the conversion process from current state s ∈ S to next state s ∈ S after executing the action a ∈ A. The reward function R : (s, a) → r represents the benefit of action a generating on the state s. The discount factor γ ∈ [0, 1] implies the trade-off between the immediate reward and delayed rewards.
Fig. 1. The overall architecture of the RL-meal.
The overall architecture of our model is shown in Fig. 1. The original state consists of three historical records, which are blood glucose sequence, insulin injection sequence and meal intake sequence. To extract the features in the sequence data, we use Gate Recurrent Unit (GRU), a special neural network that is effective for sequence modeling. GRU extracts the information in the sequence and gives the state to the actor network and critic network. The actor network gives the corresponding dose of insulin according to the probability and critic network gives the goodness of a given state-action pair. The patient injects the corresponding dose of insulin, then generates a new state sequence to the GRU and a reward to the critic network. The dot-arrow lines indicate the training process. The critic network and the actor network update their parameters according to the reward from the patient. 3.2
Details
State S: In our setting, the states of diabetics consist of three parts: blood glucose, insulin injection dose and diets. Specifically, to represent the physical state of diabetics, we include the previous four hours of blood glucose data bt and insulin data it into the state space. Because the insulin dose we choose in our model is a regular amount, which starts to work within 30 min to an hour after injection and takes about 24 h before the medication reaches peak effectiveness, we choose a 4-h window for blood glucose data and insulin injection data. Moreover, in order to respond rapidly to food intake with time uncertainty and quantitative uncertainty, we augment our state space to include the previous two hours of carbohydrate intake data ct . As for the carbohydrate intake data,
84
J. Zhu et al.
we choose a 2-h window because the peak level of the blood glucose caused by carbohydrate intake is usually in one to two hours. The sampling rate of BG values from the CGM device are five minutes, which leads to 48 samples within four hours and 24 samples within two hours. The state at time t: st = [bt , it , ct ], where: bt = [bt−47 , bt−46 , · · · , bt ], it = [it−47 , it−46 , · · · , it ], ct = [ct−23 , ct−22 , · · · , ct ] bt ∈ N40:400 is in the measurement range of the CGM from 40 to 400 mg/dL. it ∈ R+ . Action A: The action is a continuous number representing the insulin injection dose for diabetics. For every time step, trained algorithm will give a recommended insulin injection dose for the patient. Reward Function: To evaluate the goodness of the insulin injection dose in a given state, we need to find an appropriate reward function related to blood glucose. Blood glucose between 70 mg/dL to 180 mg/dL is regarded as normal, and the scale of risks of hypoglycemia is much different from that of hyperglycemia. In general, hypoglycemia is more dangerous than hyperglycemia. Hyperglycemia can not result in serious consequences in a short time but hypoglycemia may lead to headache, Stupor and even sudden death in a few hours. In other works, hypoglycemia has a higher risk than the hyperglycemia. In order to show the risk characteristics, a risk of the blood glucose is defined [18]. risk(b) = 10 ∗ (3.5506 ∗ (log(b)0.8353 − 3.7932))2
(1)
where b is the blood glucose level in mg/dL. The risk function is shown in Fig. 2(a), where the risk value increases rapidly in the hypoglycemia region but slowly in the hyperglycemia region. risk(70) = risk(280) = 25 and risk(50) = risk(400), which indicate that the risk of the BG staying at a low level is higher than that of the BG staying at a relatively high level. In other words, hypoglycemia has higher risk than hyperglycemia.
(a) Negative risk function
(b) reward function R(bt )
Fig. 2. The risk function and the reward function values with different numbers of abnormal BG values in one hour.
Reinforcement Learning for Blood Glucose Control
85
However, this risk function is only related to the blood glucose at one moment. Historical blood glucose information is also a key factor to evaluate the goodness of the insulin injection dose. For example, given a blood glucose sequence with high blood glucose for a long time, we may need to inject more insulin for the patient. This kind of sequence feature is ignored by the risk function. Based on this risk function, we design a new reward function that considers the number of times of abnormal blood glucose in the previous hour: (2) R(bt ) = −risk(bt ) − duration(bt ) ⎧ 11 Ihyper (bt−i )risk(bt−i ) ⎪ ⎪ ⎨ i=1 ,bt > 180 12 (3) duration(bt ) = 11 ⎪ ⎪ i=1 Ihypo (bt−i )risk(bt−i ) ⎩ ,bt < 70 12 where bt is the current blood glucose after insulin injection, bt is the blood glucose sequence of the previous hour, Ihyper is an indicator function that is one when the blood glucose is greater than 180 and zero otherwise. Ihypo is also an indicator function that is one when the blood glucose is less than 70 and zero otherwise. Duration is a penalty term for a long-time abnormal blood glucose. We show the reward function in the hyperglycemia portion in Fig. 2(b). The x-axis is the current blood glucose level, and different lines represent different reward function values with different numbers of abnormal BG values in one hour. We can find that the more times we had abnormal blood glucose in the previous hour, the lower the reward we would obtain. With this reward function, given the same current blood glucose, a longer duration of abnormal blood glucose in the previous period would result in less reward, which can make the RL model tend to choose a more suitable insulin dose to reduce the duration of abnormal blood glucose more quickly. 3.3
Algorithm Description
We choose the soft actor-critic (SAC) algorithm for the blood glucose control problem. One of the main reasons we choose SAC is that it can be used in continuous action space. It is impractical to choose a discrete insulin injection dose as an action space, as an improper insulin dose may lead to serious consequences. Besides, it is proved that SAC is sample efficient and competitive with other RL algorithms in the continuous control problem [19]. SAC consists of two networks that are actor-network and critic network. The actor network takes the state as the input and outputs continuous action (i.e. insulin injection dose). The critic network is trained to approximate the actual Q value of a given state-action pair. The actual Q value refers to the long-term benefits of the state-action pair and cannot be obtained directly.
4
Experiments and Results
In this section, we verify the performance on the new reward function and the RL by introducing meal information.
86
4.1
J. Zhu et al.
Experimental Setup
We use UVa/Padova T1D simulator [20] approved by FDA in our experiment. The simulator is equipped with 30 virtual patients, including 10 children, 10 adolescents and 10 adults. The simulator we use is based on an open-source implementation by Python [21]. Four approaches are selected to compare with the RL-meal. They are the basal-bolus, Proportion Integration Differentiation (PID), PID with meal information (PID-MA) and the SAC without considering the meal information (SAC). The basal-bolus control algorithm is designed to mimic how patients with T1D control their blood glucose in reality. The algorithm needs four parameters denoted by st = [bt , it , ct , cooldown] to calculate insulin injection dose, where bt is the current blood glucose, it is the last insulin injection, ct is the carbohydrate that patient takes and cooldown is an error correction term. Besides, the basalbolus requires additional parameters about the physical condition of the patient. They are basal insulin rate bas, correction factor CF and carbohydrate ratio CR, which can be obtained from the T1D simulator. In real life, these additional parameters can be estimated by doctors according to the patient’s historical diabetes-related data [22]. The insulin injection strategy of the basal-bolus is defined in Eq. (4). π(st ) = bas + (ct > 0) ∗ (
bt − bg ct + cooldown ∗ ) CR CF
(4)
where bg is a glucose target, cooldown is one if there have been no meals in the past three hours, otherwise is zero. The cooldown means how much one unit of rapid-acting insulin will generally lower the blood glucose over two to four hours before eating. It should be noted that the parameter is not applicable within a few hours after a meal. PID is a wide-employed algorithm for closed-loop blood glucose control situations, especially for artificial pancreas [11]. The formula of PID is defined in Eq. (5): (5) It = kp P (bt ) + ki I(bt ) + kd D(bt ) where bt is the blood glucose of the patient at time t and It is the insulin injection dose at time t. PID aims to make bt as close as possible to a pre-set blood glucose target btarget . It is composed of three parts: proportional part (i.e. kp P (bt )), integral part (ki I(bt )) and derivative part (kd D(bt )), where P (bt ) = bt − btarget , t I(bt ) = k=0 (bk − btarget ) and D(bt ) = bt − bt−1 . In PID, the weight of three parts kp , ki , kd are hyper-parameters, which should be set manually. To find the best hyper-parameters combination, we use grid-search on each patient. However, the original PID can not utilize dietary information. In this paper, the PID-MA [17] is employed to take advantage of dietary information. PID-MA combines basal-bolus and PID algorithm. Insulin injection dose is calculated by basal-bolus at meals, and by PID at other times.
Reinforcement Learning for Blood Glucose Control
87
Simulation Details. There is a meal schedule in the simulator for virtual patients. We determine the expected daily carbohydrate consumption by HarrisBenedict equation, which is used to estimate an individual’s daily basal metabolic rate (BMR): BM R = 66.5 + (13.75 ∗ w) + (5.003 ∗ h) − (6.775 ∗ a)
(6)
where w is weight in kg, h is height in cm and a is age. BMR is the expected calories for the individual. We assume that 45% of calories come from carbohydrates and one carbohydrate can generate 4 cal. The amount of food is represented by the number of carbohydrates. After calculating the daily carbohydrate intake of the patient, we divide it into breakfast, lunch, dinner and three snacks with a set of proportions. Each meal has a certain chance of being skipped. The skipping rate is 5% for breakfast, lunch and dinner and 70% for snacks. All the eating behaviors are random in time and quantity. The mealtime follows a truncated normal distribution to ensure a consistent order between meals and the amount follows a normal distribution. Training and Evaluation Parameters. We train the RL models for 300 epochs, while the non-RL algorithms require no training. RL models are trained with batch size 256 and an epoch length of 21 days, which means that we train 6048 steps in one epoch. An experience replay buffer is used to improve samples efficiency and reduce data correlation and its size is 1e6. The discount rate of future returns is 0.99. We use Adam to optimize the RL network parameters with a learning rate of 3e−4, and two layers of GRU cells with a hidden state size of 128 to extract features in historical blood glucose, insulin and dietary data. The networks1 are implemented by Pytorch. We use 14 days of simulations to validate the performance of these approaches. For the RL model, a special model selection method is used to find the best-trained model. Firstly, models with a minimum blood glucose level of less than 30 mg/dL are excluded at validation, and then we choose the model with the highest average reward as the best model. The performances of different approaches are evaluated by two metrics. One is the average risk values (Eq. 1) during the 14-day evaluation. In addition, we also use the percentage of time in different blood glucose ranges as a metric. We define three blood glucose ranges: target range is between 70 and 180 mg/dL (euglycemia); the range of hypoglycemia is less than 70 mg/dL and the range of hyperglycemia is above 180 mg/dL. 4.2
Results
Because meal information has a great influence on blood glucose control, we conduct comparative experiments on two stages: the state of patients with and without meal information. 1
https://github.com/JinhaoZHU/RL-BloogGlucoseControl.
88
J. Zhu et al.
The PID, SAC and SAC+r are approaches without meal information considered, where SAC+r is the SAC with the new reward function. It is interesting to see the performance of the new reward function in the control model. The Basalbolus, PID-MA and the RL-meal are the approaches with meal information, where RL-meal is the SAC with the new reward function and meal information. We use risk, percent of the time of euglycemia, hypoglycemia and hyperglycemia to evaluate the result. The results are the median values of 100 times of 14-day simulations. Three patients are chosen from 10 adults, 10 adolescents and 10 children respectively. Lower risk, hypoglycemia and hyperglycemia are better, which indicates the model is good at avoiding hypoglycemia and hyperglycemia. Higher euglycemia is better, which satisfying the objective of the control model which is to keep the BG level at a normal range as long as possible. Compared with risk, we pay more attention to the time of euglycemia, hypoglycemia and hyperglycemia. The approach with the best median score is in bold. Table 1. Performances of different approaches without meal information Patients
Models Risk Euglycemia Hypoglycemia Hyperglycemia
child#006
PID 16.6 0.6380 SAC 10.3 0.6670 SAC+r 9.4 0.7088
0.0812 0.0270 0.0196
0.2804 0.3039 0.2704
adolescent#006 PID 10.4 0.7881 SAC 4.6 0.7993 SAC+r 4.4 0.8034
0.0627 0.0040 0.0014
0.1484 0.1965 0.1933
adult#006
0.0223 0.0082 0.0082
0.2597 0.2558 0.2318
PID 7.0 0.7170 SAC 5.0 0.7342 SAC+r 4.8 0.7586
Approaches for the State Without Meal Information. To verify the effect of our proposed reward function, we evaluate three approaches without meal information (see Table 1). We can observe that SAC+r consistently outperforms PID and SAC. Compared to PID and SAC, SAC+r reduces the percent of the time of hypoglycemia and hyperglycemia on child#006. However, SAC+r only has a slight improvement on adolescent#006 and adult#006 in terms of the time of hypoglycemia. This is due to the primitive reward function of SAC has been able to control hypoglycemia well in these two patients. The results indicate that introducing historical glucose information into the reward function can increase the effectiveness of the RL algorithm, especially for people whose blood glucose is difficult to control.
Reinforcement Learning for Blood Glucose Control
89
Approaches for the State with Meal Information. We investigate the effect of the RL algorithms with meal information on blood glucose control. Table 2 gives the evaluation results on three approaches with meal information. First, it is found out that the control algorithms PID-MA and RL-meal are better than the manual control basal-bolus according to the Euglycemia. For the Hypoglycemia, the RL-meal consistently achieves better result. However, it is not best for the Hyperglycemia. Since hypoglycemia is considered more dangerous than hyperglycemia, we can draw a conclusion that the RL-meal outperforms the PID-MA and basal-bolus. These results suggest that introducing dietary information into the state of the RL algorithm can greatly reduce the risk of hypoglycemia in the process of blood glucose control and achieve better blood glucose control. Table 2. Performances of different approaches with meal information Patients
Models
Risk Euglycemia Hypoglycemia Hyperglycemia
child#006
Basal-bolus 17.2 0.5646 PID-MA 11.1 0.7017 RL-meal 6.9 0.7075
0.0976 0.0734 0.0051
0.3366 0.2248 0.2875
adolescent#006 Basal-bolus PID-MA RL-meal
3.1 0.8098 2.5 0.8603 2.0 0.8598
0.0000 0.0000 0.0000
0.1888 0.1397 0.1402
adult#006
Basal-bolus PID-MA RL-meal
7.7 0.7140 9.6 0.8011 3.1 0.8078
0.0605 0.1118 0.0000
0.2259 0.0869 0.1907
child#007
Basal-bolus 10.2 0.7007 PID-MA 8.1 0.7678 RL-meal 5.0 0.7918
0.0757 0.0528 0.0001
0.2236 0.1794 0.2081
adolescent#007 Basal-bolus 11.1 0.6253 PID-MA 10.0 0.7195 RL-meal 7.9 0.7413
0.0051 0.0085 0.0105
0.3696 0.2720 0.2482
adult#007
0.2088 0.1387 0.0064
0.0703 0.1773 0.1729
Basal-bolus 15.6 0.7209 PID-MA 10.3 0.6840 RL-meal 3.6 0.8206
Model Responses to Meal Information. The proposal achieves a lower percent of hypoglycemia duration than other approaches with meal information. To investigate the reason for the superiority, we examine the changes in blood glucose and insulin injection after a meal. The average blood glucose and insulin injection of RL-meal and PID-MA after lunch on child#006 is shown in Fig. 3. In Fig. 3(a), we observe that RL-meal and PID-MA achieve similar postprandial maximum blood glucose, but RL-meal obtains higher postprandial minimum blood glucose than PID-MA, which can reduce the risk of hypoglycemia.
90
J. Zhu et al.
Figure 3(b) shows that RL-meal only provides a large dose of insulin at the beginning while PID-MA provides first insulin according to the basal-bolus and many small doses of insulin during the next four hours. The results indicate that our approach can make better use of meal information and avoid the risk of hypoglycemia caused by repeated insulin injections after the meal.
(a) Blood glucose curve after lunch
(b) insulin injection curve after lunch
Fig. 3. Changes of blood glucose and insulin injection of RL-meal and PID-MA after lunch
5
Conclusion
In this paper, we propose an RL-meal model which introduces an innovative reward function and includes meal information into the state representation. We validate the performance of our model on virtual patients and compare the results with other baseline models. By introducing dietary information into the state and the penalty for a long time abnormal blood glucose into the reward function, the proposed model can lead to a more stable and lower risk in blood glucose control. Specifically, the risk of hypoglycemia has been reduced and the average time of euglycemia is around 78.81%. Acknowledgments. This work was partially supported by National Natural Science Foundation of China (Grant No. 61972286), the Shanghai Municipal Science and Technology Major Project (2021SHZDZX0100), the Fundamental Research Funds for the Central Universities, the Natural Science Foundation of Shanghai, China (No. 20ZR1460500).
References 1. International Diabetes Federation: IDF diabetes atlas ninth. IDF, Dunia (2019) 2. Saedi, E., Gheini, M.R., Faiz, F., Arami, M.A.: Diabetes mellitus and cognitive impairments. World J. Diabetes 7(17), 412 (2016) 3. Association, A.D., et al.: 6. glycemic targets: standards of medical care in diabete2020. Diabetes Care 43(Supplement 1), S66–S76 (2020)
Reinforcement Learning for Blood Glucose Control
91
4. Holt, R.I., Cockram, C., Flyvbjerg, A., Goldstein, B.J.: Textbook of Diabetes. Wiley, Hoboken (2017) 5. Heinemann, L., Freckmann, G.: CGM versus FGM; or, continuous glucose monitoring is not flash glucose monitoring (2015) 6. Bothe, M.K., et al.: The use of reinforcement learning algorithms to meet the challenges of an artificial pancreas. Expert Rev. Med. Devices 10(5), 661–673 (2013) 7. Yu, B., Guo, J., Zhao, Q., Li, J., Rao, W.: Smarter and safer traffic signal controlling via deep reinforcement learning. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3345–3348 (2020) 8. Pan, X., You, Y., Wang, Z., Lu, C.: Virtual to real reinforcement learning for autonomous driving. arXiv preprint arXiv:1704.03952 (2017) 9. Tejedor, M., Woldaregay, A.Z., Godtliebsen, F.: Reinforcement learning application in diabetes blood glucose control: a systematic review. Artif. Intell. Med. 104, 101836 (2020) 10. Silver, D., Singh, S., Precup, D., Sutton, R.S.: Reward is enough. Artif. Intell. 299, 103535 (2021) 11. Steil, G.M.: Algorithms for a closed-loop artificial pancreas: the case for proportional-integral-derivative control. J. Diabetes Sci. Technol. 7(6), 1621–1631 (2013) 12. Garg, S.K., et al.: Glucose outcomes with the in-home use of a hybrid closed-loop insulin delivery system in adolescents and adults with type 1 diabetes. Diabetes Technol. Therapeutics 19(3), 155–163 (2017) 13. Ruiz, J.L., et al.: Effect of insulin feedback on closed-loop glucose control: a crossover study. J. Diabetes Sci. Technol. 6(5), 1123–1130 (2012) 14. Sun, Q., et al.: A dual mode adaptive basal-bolus advisor based on reinforcement learning. J-BHI 23(6), 2633–2641 (2018) 15. Daskalaki, E., Diem, P., Mougiakakou, S.G.: An actor-critic based controller for glucose regulation in type 1 diabetes. Comput. Methods Programs Biomed. 109(2), 116–125 (2013) 16. Fox, I., Wiens, J.: Reinforcement learning for blood glucose control: challenges and opportunities. In: Reinforcement Learning for Real Life Workshop in the 36th International Conference on Machine Learning (2019) 17. Fox, I., Lee, J., Pop-Busui, R., Wiens, J.: Deep reinforcement learning for closedloop blood glucose control. In: Machine Learning for Healthcare Conference, pp. 508–536. PMLR (2020) 18. Magni, L., et al.: Model predictive control of type 1 diabetes: an in silico trial (2007) 19. Haarnoja, T., et al.: Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905 (2018) 20. Man, C.D., Micheletto, F., Lv, D., Breton, M., Kovatchev, B., Cobelli, C.: The UVA/PADOVA type 1 diabetes simulator: new features. J. Diabetes Sci. Technol. 8(1), 26–34 (2014) 21. Jinyu, X.: Simglucose v0.2.1 (2018). https://github.com/jxx123/simglucose. Accessed 9 Mar 2021 22. Walsh, J., Roberts, R., Bailey, T.: Guidelines for optimal bolus calculator settings in adults. J. Diabetes Sci. Technol. 5(1), 129–135 (2011)
Predicting Microbe-Disease Association via Tripartite Network and Relation Graph Convolutional Network Yueyue Wang1 , Xiujuan Lei1(B) , and Yi Pan2 1 School of Computer Science, Shaanxi Normal University, Xi’an 710119, China
[email protected] 2 Faculty of Computer Science and Control Engineering, Shenzhen Institute of Advanced
Technology, Chinese Academy of Sciences, Shenzhen 518055, China
Abstract. Many evidences show that microbes play vital roles in human health and diseases. Thus, predicting microbe-disease associations is helpful for disease prevention. In this study, we propose a predictive model called TNRGCN for microbe-disease associations based on Tripartite Network and Relation Graph Convolutional Network (RGCN). Firstly, we construct a microbe-disease-drug tripartite network through data processing from four databases. Secondly, we calculate similarity networks for microbes, diseases and drugs via microbe function similarity, disease semantic similarity and Gaussian interaction profile kernel similarity, respectively. Then, we utilize Principal Component Analysis (PCA) on similarity networks to extract main features of nodes in the tripartite network and input them as initial features to RGCN. Finally, according to the tripartite network and initial features, we apply two-layer RGCN to predict microbe-disease associations. Compared with other methods, TNRGCN achieves a good performance in cross validation. Meanwhile, case studies for diseases demonstrate TNRGCN has a good performance for predicting potential microbe-disease associations. Keywords: Microbe-disease associations · Tripartite network · Principal component analysis · Relation graph convolutional network
1 Introduction Microbe communities are regarded as a special “organ” of human beings owing to their large number. They reside in different parts of human skin, oral cavity and gastrointestinal tract, etc., mainly including archaea, viruses, fungi and protozoa [1, 2]. In general, microbes are harmless to human health. They can develop immune system [3], promote nutrient absorption [4] and prevent pathogens. For example, short-chain fatty acids produced by intestinal microbial fermentation can provide energy for colonocytes [5]. Probiotics can stimulate the host’s immunity to enhance the protection of pathogens through generating immune modulation signals [6]. Microbe communities differ among different hosts due to host variability such as diet, genotype and colonization history [7]. They maintain a complex dynamic balance in © Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 92–104, 2021. https://doi.org/10.1007/978-3-030-91415-8_9
Predicting Microbe-Disease Association via Tripartite Network
93
the host and they have mutualistic interactions with host. These balances and symbiotic relationships are essential for human health. Changes in ecology or genes could damage them and lead to disease [7]. For example, excessive growth of Klebsiella in the intestine can lead to many chronic diseases such as colitis and Crohn’s disease [8]. Studies have found that low-starch diet could restrain the reproduction of Klebsiella, thereby helping to alleviate Crohn’s disease [9]. Thus, knowing microbe-disease associations is beneficial to disease diagnosis and treatment. In recent years, two databases for microbe-disease research have been established. Human Microbe-Disease Association Database (HMDAD) provides 483 microbedisease association records from 61 literatures [10]. Disbiome [11] is a constantly updated database including microbe-disease associations and experimental verification methods. Based on these databases, many computational methods for microbe-disease association prediction have been proposed. For example, Chen et al. first proposed a prediction model called KATZHMDA [12]. It scored microbe-disease associations based on integrating different path information between two nodes in microbe-disease heterogeneous network integrated by known associations and the Gaussian interaction profile (GIP) similarities. Li et al. reconstructed heterogeneous network through integrating normalized GIP kernel similarity and bidirectional recommendations. And then utilized KATZ method on the heterogeneous network for prediction [13]. Wu et al. employed random walk on the heterogeneous network constructed by known associations and cosine similarity. And then used Particle Swarm Optimization to find the optimal parameters in the wandering process [14]. Qu et al. implemented label transmission on microbes and disease networks separately. In this work, they applied matrix decomposition on known microbe-disease associations to eliminate noise [15]. Long et al. predicted microbe-disease associations based a deep learning framework of graph attention networks (GAT). They utilized GAT to learn representations for nodes and combined inductive matrix completion to reconstruct microbe-disease associations [16]. These methods have achieved good results, but the database they used is almost only HMDAD, and Disbiome is rarely used. Since HMDAD contains only 39 diseases, using this database alone will result in too few diseases that can be predicted. Moreover, with the establishment of many databases such as Microbe-Drug Association Database (MDAD) and Comparative Toxicogenomics Database (CTD), integrating multiple types of databases can directly or indirectly increase the link between microbes and diseases, which is conducive to association prediction. Graph Convolution Network (GCN) is an embedding method which extracts features by aggregating neighbors’ information through implementing convolutional operation. GCN shows good performance in many aspects such as miRNA-disease association prediction [17], protein-phenotype association prediction [18] and metabolite-disease association prediction [19]. However, it treats all nodes and edges in a graph as the same type, and does not perform selectively aggregation based on the type of edges. On the basis of GCN, Relation Graph Convolutional Network (RGCN) [20] considers the type and direction of edges when convolution. Thus, it can be applied to heterogeneous networks containing different types of nodes and edges. In this study, we design a microbe-disease association prediction model called TNRGCN based on Tripartite Network and RGCN. Firstly, we collect microbe-disease
94
Y. Wang et al.
associations, microbe-drug associations and disease-drug associations from HMDAD, Disbiome, MDAD and CTD. And then construct a microbe-disease-drug tripartite network. Secondly, we calculate similarity networks for microbes, diseases and drugs via microbe function similarity, disease semantic similarity and GIP similarity, respectively. According to similarity networks, we utilize Principal Component Analysis (PCA) to extract main features of all nodes and input them as initial features to RGCN. Finally, based on the tripartite network and initial features, we introduce two-layer RGCN to predict microbe-disease associations. The flowchart of TNRGCN is shown in Fig. 1.
HMDAD
Microbe-disease-drug Tripartite network
microbe-disease
microbe function similarity
PCA
Microbe feature
disease semantic similarity
PCA
Disease feature
drug GIP kernel similarity
PCA
Drug feature
Disbiome microbe-drug MDAD
disease-drug CTD
Input graph
Input
Initial feature
Predicting microbe-disease association based on RGCN Output Hidden layer Hidden layer microbe disease drug
Initial feature
⊕
ReLU
⊕
Microbe
Disease
Output feature
Microbe
Microbe
Disease
Disease
Drug
Drug
Dot Score
Fig. 1. Flowchart of TNRGCN.
2 Material and Methods 2.1 Material The data for this study are from four databases. The microbe-disease associations are collected from two database: HMDAD and Disbiome. HMDAD involves 483 confirmed microbe-disease associations between 292 microbes and 39 diseases. Disbiome is a constantly updated database of microbe-disease associations. We obtain all records from Disbiome as of December, 2020, including 1585 microbes, 353 diseases and 8695 microbedisease associations between them. The microbe-drug associations are collecting from MDAD, which includes 5055 microbe-drug associations between 180 microbes and 1388 drugs. The disease-drug associations are collected from CTD, including 7119363
Predicting Microbe-Disease Association via Tripartite Network
95
records among 12791 drugs and 7098 diseases. The specific details of the databases are shown in Table 1. Table 1. Details of four databases. Database
Microbe
Disease
HMDAD
292
39
Disbiome
1585
353
MDAD
180
CTD
7098
Drug
Associations 483 8695
1388
5055
12791
7119363
2.2 Methods Data Processing We firstly combine HMDAD and Disbiome to obtain total microbe-disease associations. For duplicate associations in databases, we only keep one record of them. Simultaneously, because that disease records in Disbiome includes 17 types such as Disease or Syndrome, Organism Function, Individual Behavior. By screening and comparing disease ID with UMLS CUI in DisGeNET, we keep three types in Disbiome including Disease or Syndrome, Mental or Behavioral Dysfunction and Neoplastic Process. Finally we screened 254 diseases and 7258 microbe-disease associations related to them, involving 1519 microbes. Secondly, we obtain microbe-drug associations from MDAD. For 1519 microbes associated with diseases, we screen 3783 microbe-drug associations related to them, involving 1181 drugs. Thirdly, for 1181 drugs and 254 diseases, we obtain disease-drug associations from CTD. Eventually, we obtain 4552 associations between them. Tripartite Network Construction After data processing, we construct three adjacency matrices representing these associations. We use matrix Adm representing microbe-disease associations. If disease di has association with microbe mj , we set Adm (i, j) = 1, otherwise, Adm (i, j) = 0.Amu is constructed to represent microbe-drug associations. If microbe mi has associations with drug uj , Amu (i, j) is set to 1, otherwise, Amu (i, j) is set to 0. Adu represents diseasedrug associations. If disease di has associations with drug uj , Adu (i, j) = 1, otherwise, Adu (i, j) = 0. Since drugs have associations with microbes and diseases, we construct a tripartite network to indirectly increase microbe-disease associations by introducing drugs. Feature Initialization Microbe Similarity Calculating. HMDAD and Disbiome provide living organs of microbes and their effects on different organs. We calculate microbe function similarity based on the assumption that microbes share stronger function similarities if they
96
Y. Wang et al.
have same effects on the same organ. If microbe i and microbe j live in the same organ and have same regulation (increase or decrease), we add 1 to MF (i, j). After traversing all organs, we normalize microbe function similarity, which is shown as Eq. (1): MF (i, j) =
MF (i, j) − min(MF ) max(MF ) − min(MF )
(1)
where max(MF ) and min(MF ) are the maximum and minimum of matrix MF . Disease Similarity Calculating. We calculate disease semantic similarity according to Mesh database [21]. In Mesh, each disease is represented as a Directed Acyclic Graph (DAG), including a disease and its dependencies among all its ancestors. We calculate contribution of every element in a DAG. It is calculated as Eq. (2): 1 if d = D (2) Dcon (d ) = max × Dcon (d )|d ∈ children of d if d = D here, is a semantic contribution decay factor. According to previous paper, is set to 0.5 [22]. For a certain disease, we can obtain the total contribution of it by adding all contributions of element of its DAG, and it can be obtained as Eq. (3): Dtc (d ) = Dcon (t) (3) t∈Vd
here, V d includes disease d and all its ancestors. According to the contribution of every element in DAG, we can calculate the semantic similarity of two diseases based on shared elements in two DAG. Specifically, the disease similarity of disease i and disease j is calculated as Eq. (4): t∈V (di )∩V (dj ) D(i)con (t) + D(j)con (t) (4) Ds (di , dj ) = Dtc (i) + Dtc (j) Drug Similarity Calculating. Inspired by the assumption that drugs have more functional similarities when they share more of the same neighbor nodes. We calculate the drug GIP kernel similarity based on the disease-drug association network and the microbe-drug association network. According to disease-drug association network, The GIP similarity of drug i and j are calculate as Eq. (5) and Eq. (6): Gu1 (i, j) = exp(−γu Adu (u(i)) − Adu (u(j))2 ) γu1 = γu1 /
Nu 1 Adu (u(i))2 Nu
(5)
(6)
i=1
where Adu (u(i)) represents the ith column of matrix Adu . γ u is a normalized kernel bandwidth parameter affected by parameter γ u . According to previous study, we set γu1 to 1 [23]. N u = 1181, is the total number of drugs.
Predicting Microbe-Disease Association via Tripartite Network
97
In the similar way, we calculate the drug GIP kernel similarity Gu2 based on microbedrug association network. Then we combine the two similarities to form a drug similarity matrix Gu , which is represented as Eq. (7): Gu = (Gu1 + Gu2 )/2
(7)
According to the similarities for microbes, diseases and drugs, we utilize PCA to reduce the dimensionality of each node and feed them to RGCN as initial features. In this paper, we set the dimension of initial features to 128. Predicting Microbe-Disease Associations Based on RGCN RGCN [20] has a selectively aggregation according to the type and direction of edges when convolution. Different from GCN, edges can represent different relations in RGCN. Specifically, for a node i, the convolutional operation of it is defined as Eq. (8): 1 l l hil+1 = σ ( W h + W0l hli ) (8) r r∈R j∈Ni ci,r r j the type of where hli is the hidden features of node i in the l-th layer. r ∈ R represents edges. Nir includes all neighbors of node i under relation r. ci,r = Nir , is a normalization constant. Wrl is weight corresponding to the relation r in the l-th layer. σ is an activation function. In this study, we use ReLU as the activation function. We construct a two-layer RGCN (l = 2). After duplicated experiments, the dimension of feature in first layer is set to 128, and in second layer is set to 64. In microbedisease-drug tripartite network, we consider 6 types of edges: “microbe-influencedisease (in/out)”, “microbe-relate-drug (in/out)” and “drug-treat-disease (in/out)”. Since the tripartite network has no self-loop edges, the convolutional operation only accumulates all features from the neighbor nodes corresponding to its different edges. According to the nodes’ features after convolution, we implement dot product between each microbe and disease pair for predicted score. The model is trained by optimizing the cross entropy loss through Adam Optimizer. The cross entropy loss can be described as Eq. (9): −label(di , mj ) log(p(di , mj )) − (1 − label(di , mj )) log(1 − p(di , mj )) Loss = (di ,mj )∈E
(9) where E includes all edges between diseases and microbes. label(di , mj ) represents real label between disease i and microbe j. p(di , mj ) is the predicted score.
3 Experiments and Results We implement 5-fold cross validation to access performance of TNRGCN. We randomly divide all microbe-disease associations into five groups. One group is treated as test samples, and others are training samples. For each iteration, we regard all known associations as positive samples, and randomly select negative samples equal to positive samples from training samples. We ran 5-fold cross validation for 10 times and average the scores. After sorting the scores in descending order, we plot receiver operating characteristic (ROC) curve and precision-recall curve for evaluation.
98
Y. Wang et al.
3.1 Model Analysis We compare TNRGCN with its different variants, which are as follows: TNRGCNRF: it initializes features randomly. BNRGCN: it utilizes two-layer RGCN to predict microbe-disease associations on microbe-disease bipartite network, without drugs. ONRGCN: it utilizes one-layer RGCN to predict microbe-disease associations on microbe-disease-drug tripartite network. As shown in Table 2, we can see that using similarities as initial features and adding drug nodes can improve the predictive performance of the model. Compared with onelayer RGCN, two-layer RGCN can achieve better performance. Table 2. Prediction performance of TNRGCN and its variants. Variants
AUC
AUPR
Precision
Recall
TNRGCN
0.9038
0.8914
0.8142
0.6722
TNRGCNRF
0.8232
0.8090
0.7355
0.6625
BNRGCN
0.8938
0.8803
0.7922
0.6857
ONRGCN
0.7683
0.7902
0.7216
0.6301
3.2 Comparison with Other Methods We compare TNRGCN with nine models under 5-fold cross validation. These models including BRWMDA [24], BDSILP [22], BiRWHMDA [25], KATZHMDA [12], NTSHMDA [26], KATZBNRA [27], NCPHMDA [28], PBHMDA [29], and NCPLP [30]. Figure 2 shows the ROC curves drawn by different models. We can observe that TNRGCN achieves the best performance. In Table 3, we calculate the area under precision-recall (AUPR) curves. We can find that AUPR predicted by TNRGCN is the highest. We count the prediction scores of the first 1000 microbe-disease pairs. As shown in Fig. 3, TNRGCN predicts more microbe-disease pairs with known associations among ten models. In addition, we calculate the AUC value for all individual diseases. In Fig. 4, we can see that AUCs for most diseases predicted by TNRGCN are above 0.8. Moreover, the mean and median value predicted by TNRGCN are higher than other compared models.
Predicting Microbe-Disease Association via Tripartite Network
99
Fig. 2. ROC curve of ten models in 5-fold cross validation.
Table 3. AUPR predicted by ten models in 5-fold cross validation. Models
AUPR
Models
AUPR
TNRGCN
0.8914
NTSHMDA
0. 6828
BRWMDA
0.7469
KATZBNRA
0. 6061
BDSILP
0. 7564
NCPHMDA
0. 7225
BiRWHMDA
0. 5593
PBHMDA
0. 8577
KATZHMDA
0. 5819
NCPLP
0. 5722
500 400
300 200 100 0
TNRGCN BRWMDA BDSILP BiRWHMDA KATZHMDA NTSHMDA KATZBNRA NCPHMDA PBHMDA NCPLP
Fig. 3. The number of correctly associations predicted by ten models in 5-fold cross validation.
100
Y. Wang et al.
Fig. 4. Distribution of AUC value for all diseases in 5-fold cross validation.
4 Case Studies To further evaluate the performance of TNRGCN, we select two diseases for case studies. We train all known associations as positive samples and score all unknown associations. After sorting unknown scores in descending order, we find top 10 microbes corresponding to each disease. According to the global cancer statistics in 2020, Lung cancer has become the second most common cancer with a high mortality rate [31]. In this study, we select lung cancer for case study. As shown in Table 4, the top 10 potential microbes related to lung cancer predicted by TNRGCN are all confirmed. For example, through assessing bacterial microbiota in the respiratory tract, researchers found that Corynebacterium is one of the specific bacteria in patients with lung cancer [32]. Compared with healthy individuals, Faecalibacterium, Ruminococcus, Blautia, Veillonella and Dorea are more abundant in patients with lung cancer [33–35]. In the saliva of patients with central lung cancer, the abundance of Lactobacillus is higher [36]. When evaluating the effect of bronchoalveolar fluid on lung cancer cells, studies found that Pseudomonas is increased in non-small cell lung cancer bronchoalveolar fluid inhalation group, which can inhibit tumor cell growth [37]. The prevalence of Obesity in children is increasing year by year, which is related to heart disease and other chronic diseases such as hyperlipidaemia and early atherosclerosis [38]. In the case study of Obesity, 8 of the top 10 potential microbes predicted by TNRGCN are confirmed, which is shown in Table 5. For example, the growth of Desulfovibrio is a key feature related to Obesity [39]. In patients with Obesity, EscherichiaShigella, Parvimonas and Ruminococcus gnavus grow significantly [40–42]. In the phylum Proteus, the number of Neisseria mucous in obese subjects is 6 times that of normal body Of the Proteobacteria phylum [43]. Researchers also found that Escherichia coli increases significantly in patients with metabolic syndrome, which including Obesity and insulin resistance [44].
Predicting Microbe-Disease Association via Tripartite Network
101
Table 4. Top-10 microbes associated with Lung cancer. Disease
Microbe
Evidence
Lung cancer
Faecalibacterium
PMID: 33302682
Lung cancer
Lactobacillus
PMID: 32620417
Lung cancer
Ruminococcus
PMID: 33302682
Lung cancer
Blautia
PMID: 32676331
Lung cancer
Corynebacterium
PMID: 32255852
Lung cancer
Fusobacterium
PMID: 33454779
Lung cancer
Veillonella
PMID: 27987594
Lung cancer
Dorea
PMID: 33302682
Lung cancer
Porphyromonas
PMID: 33454779
Lung cancer
Pseudomonas
PMID: 33363002
Table 5. Top-10 microbes associated with Obesity. Disease
Microbe
Evidence
Obesity
Actinomyces
PMID: 29922272
Obesity
Shigella
PMID: 29280312
Obesity
Desulfovibrio
PMID: 31346040
Obesity
Dorea
PMID: 32708278
Obesity
Escherichia coli
PMID: 26599039
Obesity
Neisseria
PMID: 21996660
Obesity
Parvimonas
PMID: 27499582
Obesity
Ruminococcus gnavus
PMID: 33482223
Obesity
Roseburia Inulinivorans
unconfirmed
Obesity
Streptococcus salivarius
unconfirmed
5 Conclusion In this study, we propose a model for microbe-disease association prediction called TNRGCN based on Tripartite Network and RGCN. Firstly, considering that HMDAD contains relatively few microbes, diseases and associations, we integrate HMDAD and Disbiome to obtain more related information. Secondly, we introduce the drug information to increase the indirect associations in the microbe-disease network, thus building a microbe-disease-drug tripartite network. Thirdly, we utilize RGCN on the microbe-disease-drug tripartite network to predict potential microbe-disease associations. TNRGCN has a good performance in 5-fold cross validation. Experiments of case studies further demonstrate the predictive performance of TNRGCN.
102
Y. Wang et al.
Acknowledgment. We thank the financial support from National Natural Science Foundation of China under Grant Nos. 61972451, 61902230, and the Fundamental Research Funds for the Central Universities of China under Grant No. GK201901010.
References 1. Holmes, E., Wijeyesekera, A., Taylor-Robinson, S.D., et al.: The promise of metabolic phenotyping in gastroenterology and hepatology. Nat. Rev. Gastroenterol. Hepatol. 12(8), 458–471 (2015) 2. Sommer, F., Backhed, F.: The gut microbiota - masters of host development and physiology. Nat. Rev. Microbiol. 11(4), 227–238 (2013) 3. Gollwitzer, E.S., Saglani, S., Trompette, A., et al.: Lung microbiota promotes tolerance to allergens in neonates via PD-L1. Nat. Med. 20(6), 642–647 (2014) 4. Gill, S.R., Pop, M., DeBoy, R.T., et al.: Metagenomic analysis of the human distal gut microbiome. Science 312(5778), 1355–1359 (2006) 5. Shoaie, S., Ghaffari, P., Kovatcheva-Datchary, P., et al.: Quantifying diet-induced metabolic changes of the human gut microbiome. Cell Metab. 22(2), 320–331 (2015) 6. Cross, M.L.: Microbes versus microbes: immune signals generated by probiotic lactobacilli and their role in protection against microbial pathogens. FEMS Immunol. Med. Microbiol. 34(4), 245–253 (2002) 7. Dethlefsen, L., McFall-Ngai, M., Relman, D.A.: An ecological and evolutionary perspective on human-microbe mutualism and disease. Nature 449(7164), 811–818 (2007) 8. Yan, Q., Gu, Y., Li, X., et al.: Alterations of the Gut Microbiome in Hypertension. Front. Cell. Infect. Microbiol. 7, 381 (2017) 9. Rashid, T., Ebringer, A., Wilson, C.: The role of Klebsiella in Crohn’s disease with a potential for the use of antimicrobial measures. Int. J. Rheumatol. 2013, 610393 (2013) 10. Ma, W., Zhang, L., Zeng, P., et al.: An analysis of human microbe-disease associations. Brief. Bioinform. 18(1), 85–97 (2017) 11. Janssens, Y., Nielandt, J., Bronselaer, A., et al.: Disbiome database: linking the microbiome to disease. BMC Microbiol. 18(1), 50 (2018) 12. Chen, X., Huang, Y.A., You, Z.H., et al.: A novel approach based on KATZ measure to predict associations of human microbiota with non-infectious diseases. Bioinformatics 33(5), 733–739 (2017) 13. Li, H., Wang, Y.Q., Jiang, J.W., et al.: A Novel human microbe-disease association prediction method based on the bidirectional weighted network. Front. Microbiol. 10, 676 (2019) 14. Wu, C.Y., Gao, R., Zhang, D.L., et al.: PRWHMDA: human microbe-disease association prediction by random walk on the heterogeneous network with PSO. Int. J. Biol. Sci. 14(8), 849–857 (2018) 15. Qu, J., Zhao, Y., Yin, J.: Identification and analysis of human microbe-disease associations by matrix decomposition and label propagation. Front. Microbiol. 10, 291 (2019) 16. Long, Y., Luo, J., Zhang, Y., et al.: Predicting human microbe-disease associations via graph attention networks with inductive matrix completion. Brief. Bioinf. 22(3), bbaa146 (2021) 17. Tang, X., Luo, J., Shen, C., et al.: Multi-view multichannel attention graph convolutional network for miRNA-disease association prediction. Brief. Bioinf. (2021). https://doi.org/10. 1093/bib/bbab174 18. Liu, L., Mamitsuka, H., Zhu, S.: HPOFiller: identifying missing protein-phenotype associations by graph convolutional network. Bioinformatics (2021). https://doi.org/10.1093/bioinf ormatics/btab224
Predicting Microbe-Disease Association via Tripartite Network
103
19. Lei, X., Tie, J., Pan, Y.: Inferring metabolite-disease association using graph convolutional networks. IEEE/ACM Trans. Comput. Biol. Bioinf. (99), 1 (2021) 20. Schlichtkrull, M., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling, M.: Modeling relational data with graph convolutional networks. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol. 10843, pp. 593–607. Springer, Cham (2018). https://doi.org/10.1007/978-3-31993417-4_38 21. Lipscomb, C.E.: Medical subject headings (MeSH). Bull. Med. Libr. Assoc. 88(3), 265–266 (2000) 22. Wen, Z., Weitai, Y., Xiaoting, L., et al.: The bi-direction similarity integration method for predicting microbe-disease associations. IEEE Access 6, 38052–38061 (2018) 23. van Laarhoven, T., Nabuurs, S.B., Marchiori, E.: Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics 27(21), 3036–3043 (2011) 24. Yan, C., Duan, G.H., Wu, F.X., et al.: BRWMDA: Predicting microbe-disease associations based on similarities and bi-random walk on disease and microbe networks. IEEE-ACM Trans. Comput. Biol. Bioinf. 17(5), 1595–1604 (2020) 25. Zou, S., Zhang, J.P., Zhang, Z.P.: A novel approach for predicting microbe-disease associations by bi-random walk on the heterogeneous network. PLoS ONE 12(9), e0184394 (2017) 26. Luo, J.W., Long, Y.H.: NTSHMDA: prediction of human microbe-disease association based on random walk by integrating network topological similarity. IEEE-ACM Trans. Comput. Biol. Bioinf. 17(4), 1341–1351 (2020) 27. Li, S., Xie, M., Liu, X.: A novel approach based on bipartite network recommendation and KATZ model to predict potential micro-disease associations. Front. Genet. 10, 1147 (2019) 28. Bao, W., Jiang, Z., Huang, D.S.: Novel human microbe-disease association prediction using network consistency projection. BMC Bioinf. 18(S16), 543 (2017) 29. Huang, Z.A., Chen, X., Zhu, Z.X., et al.: PBHMDA: path-based human microbe-disease association prediction. Front. Microbiol. 8, 2560 (2017) 30. Yin, M.-M., Liu, J.-X., Gao, Y.-L., et al.: NCPLP: a novel approach for predicting microbeassociated diseases with network consistency projection and label propagation. IEEE Trans. Cybern. (99), 1–9 (2020) 31. Sung, H., Ferlay, J., Siegel, R.L., et al.: Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 71(3), 209–249 (2021) 32. Ekanayake, A., Madegedara, D., Chandrasekharan, V., Magana-Arachchi, D.: Respiratory bacterial microbiota and individual bacterial variability in lung cancer and bronchiectasis patients. Indian J. Microbiol. 60(2), 196–205 (2019). https://doi.org/10.1007/s12088-01900850-w 33. Zhang, M., Zhou, H., Xu, S.S., et al.: The gut microbiome can be used to predict the gastrointestinal response and efficacy of lung cancer patients undergoing chemotherapy. Ann. Palliative Med. 9(6), 4211–4227 (2020) 34. Cheng, C., Wang, Z., Wang, J., et al.: Characterization of the lung microbiome and exploration of potential bacterial biomarkers for lung cancer. Transl. Lung Cancer Res. 9(3), 693–704 (2020) 35. Lee, S.H., Sung, J.Y., Yong, D., et al.: Characterization of microbiome in bronchoalveolar lavage fluid of patients with lung cancer comparing with benign mass like lesions. Lung Cancer 102, 89–95 (2016) 36. Bello, S., Vengoechea, J.J., Ponce-Alonso, M., et al.: Core microbiota in central lung cancer with streptococcal enrichment as a possible diagnostic marker. Arch. Bronconeumol. (Engl. Ed) (2020). https://doi.org/10.1016/j.arbres.2020.05.034 37. Zheng, L., Xu, J., Sai, B., et al.: Microbiome related cytotoxically active CD8+ TIL are inversely associated with lung cancer development. Front Oncol 10, 531131 (2020)
104
Y. Wang et al.
38. Cole, T.J., Bellizzi, M.C., Flegal, K.M., et al.: Establishing a standard definition for child overweight and obesity worldwide: international survey. BMJ 320(7244), 1240–1243 (2000) 39. Petersen, C., Bell, R., Kiag, K.A., et al.: T cell-mediated regulation of the microbiota protects against obesity. Science 365(6451), 340 (2019) 40. Gao, R.Y., Zhu, C.L., Li, H., et al.: Dysbiosis signatures of gut microbiota along the sequence from healthy, young patients to those with overweight and obesity. Obesity 26(2), 351–361 (2018) 41. Andoh, A., Nishida, A., Takahashi, K., et al.: Comparison of the gut microbial community between obese and lean peoples using 16S gene sequencing in a Japanese population. J. Clin. Biochem. Nutr. 59(1), 65–70 (2016) 42. Jie, Z., Yu, X., Liu, Y., et al.: The baseline gut microbiota directs dieting-induced weight loss trajectories. Gastroenterology 160(6), 2029-2042.e2016 (2021) 43. Zeigler, C.C., Persson, G.R., Wondimu, B., et al.: Microbiota in the oral subgingival biofilm is associated with obesity in adolescence. Obesity (Silver Spring) 20(1), 157–164 (2012) 44. Moreno-Indias, I., Sanchez-Alcoholado, L., Perez-Martinez, P., et al.: Red wine polyphenols modulate fecal microbiota and reduce markers of the metabolic syndrome in obese patients. Food Funct. 7(4), 1775–1787 (2016)
Combining Model-Based and Model-Free Reinforcement Learning Policies for More Efficient Sepsis Treatment Xiangyu Liu1 , Chao Yu1(B) , Qikai Huang2(B) , Luhao Wang3 , Jianfeng Wu3 , and Xiangdong Guan3 1 Sun Yat-sen University, Guangzhou 510275, China [email protected], [email protected] 2 Department of Orthopedics, Shanghai Pudong Hospital, Fudan University Pudong Medical Center, 2800 Gongwei Road, Pudong, Shanghai 201399, China 3 The First Affiliated Hospital, Sun Yat-sen University, Guangzhou 510080, China {wanglh36,wujianf,guanxd}@mail.sysu.edu.cn
Abstract. Sepsis is the main cause of mortality in intensive care units (ICUs), but the optimal treatment strategy still remains unclear. Managing the treatment of sepsis is challenging because individual patients respond differently to the treatment, thus calling for a pressing need of personalized treatment strategies. Reinforcement learning (RL) has been widely used to learn optimal strategies for sepsis treatment, especially for the administration of intravenous fluids and vasopressors. RL can be generally categorized into two types of approaches: the model-based and the model-free approaches. It has been shown that model-based approaches, with the prerequisite of accurate estimation of environment models, are more sample efficient than model-free approaches, but at the same time can only achieve inferior asymptotic performance. In this paper, we propose a policy mixture framework to make the best of both model-based and model-free RL approaches to achieve more efficient personalized sepsis treatment. We demonstrate that the policy derived from our framework outperforms policies prescribed by physicians, model-based only methods, and model-free only approaches. Keywords: Sepsis · ICU · Reinforcement learning Model-free · Policy mixture
1
· Model-based ·
Introduction
Sepsis is a life-threatening acute organ dysfunction caused by a dysregulated host response to infection, which is a leading cause of mortality and associated healthcare costs in critical care [26]. The management of drug dosage in sepsis treatment is a big challenge because individual patients respond differently to the treatment, and evidence suggests that suboptimal strategies are likely to induce harm to the patients [1,4,12,13]. Among various aspects of treatment, administering intravenous fluids and vasopressors is an effective way for c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 105–117, 2021. https://doi.org/10.1007/978-3-030-91415-8_10
106
X. Liu et al.
the management of sepsis. While a number of international organizations have devoted significant efforts to provide general guidance for treating sepsis over the past 20 years, physicians at practice still lack universally agree-upon decision support for sepsis treatment [22], calling for an urgent demand of applying advanced data analysis and machine learning methods to discover more efficient treatment strategies for sepsis patients [35]. As a powerful machine learning technique, RL is widely used in sequential decision-making problems to find an optimal policy that maximizes the long-term accumulative reward [29]. The RL technique has gained great attention in the healthcare domain to learn effective treatment strategies [39]. With the available data obtained from freely accessible critical care databases such as the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC) [7], recent years have seen an increasing number of studies that applied RL methods in deducing optimal treatment strategies for patients with sepsis [8,11,19,20,29,40]. In general, there are two kinds of learning approaches in the RL domain: model-based and model-free approaches [29]. The model-based approaches usually supply high sample efficiency during learning, but at the same time, model accuracy acts as an essential bottleneck to policy quality, generally resulting in inferior asymptotic performance of model-based approaches compared to their model-free counterparts [16]. In previous works, all the model-based approaches are built upon the complete features that are recorded in the observational history data including demographics, laboratory values, and vital signs, among which some features like weight and height may not have much impact on the mortality of patients. Since including all these features will introduce significant bias in the model building process, it is crucial to discover those features that have the greatest impact on the mortality of patients for building an accurate transition model. To this end, we extract a set of vital features that have the greatest impact on the mortality of patients using traditional supervised leaning methods, and combine these features with the criterion features that define the sepsis and septic shock [26], in order to build a Markov Decision Process (MDP) that models the health status transition of patients. The MDP can be solved by applying generalized modelbased approaches, e.g., policy iteration and value iteration [29]. Then, in order to further improve the performance of the policy, we use the complete features as supplementary features to derive a model-free policy. Finally, we propose a policy mixture framework that automatically switches between the derived model-based and model-free policies for more efficient sepsis treatment using the gradient boosting algorithm [2]. We apply off-policy evaluation methods to estimate the expected return of our policy. Experimental results show that our policy can achieve the highest expected return and offer higher possibilities for patients to recover, compared to the policies prescribed by physicians, modelbased only approaches, and model-free only approaches. Furthermore, the way of combining the two types of policies in our policy mixture framework provides better intuitiveness and explainability compared to the existing RL methods.
More Efficient Sepsis Treatment
107
This paper is organized as follows: Sect. 2 describes the background of RL and off-policy evaluation. Section 3 introduces the related work in sepsis treatment using RL. Section 4 demonstrates our policy mixture framework in detail. Section 5 shows the results of our experiments. Finally, Sect. 6 presents conclusions and directions for future work.
2 2.1
Background RL
RL is a type of ML technique that enables an agent to learn in an environment by trial and error using feedback from its experiences [29]. The environment can be represented by an MDP, which is defined by M = S, A, P, R, γ, where S is the state space, A is the action space, P is the transition function with P (s |s, a) denoting the probability of reaching state s ∈ S by taking action a ∈ A in a particular state s ∈ S, R is the reward function with R(s, a) being the expected immediate reward of the agent by taking action a in state s, and γ is the discount factor. At each time step t, the RL agent observes a state s from the state space S and chooses an action a from the action space A based on some policy π(s, a), which assigns a probability to actions that will be selected in each state. By taking the action a in state s, the agent receives some reward r = R(s, a) The agent’s goal is to maximize the expected and transits to a new state s . T long-term discounted return E[ t =t γ t −t rt ], where γ captures the trade-off between immediate and future rewards, and T is the terminal time step. The optimal value function yields the maximum value compared to all other value functions, which is defined as V ∗ (s) = maxπ E[ t γ t rt |s0 = s, π]. The optimal action value function indicates the highest value after committing to a particular action, which is defined as Q∗ (s, a) = maxπ E[ t γt rt |s0 = s, a0 = a, π]. In model-based approaches, an agent is given the model dynamics, and the whole MDP can be solved by applying dynamic programming methods, e.g., policy iteration (PI) and value iteration (VI) [29]. PI iterates between two phases: policy evaluation and policy improvement. The policy evaluation phase computes the value function V (s) of the current policy and the policy improvement phase computes an improved policy by maximizing over the value function. This process repeats until converging to an optimal policy. The policy using PI is updated as follows: P (s |s, a)(r + γV (s )) (1) V (s) ← s ,r
π(s) ← argmaxa
P (s |s, a)(r + γV (s ))
(2)
s ,r
In model-free approaches, an agent learns how to act without explicitly learning the model’s dynamics. Temporal difference (TD) learning refers to a class of model-free approaches which learn by bootstrapping from the current estimate of the value function [27]. Q-learning [38] is an off-policy TD-based algorithm,
108
X. Liu et al.
in which the optimal action value function is estimated using the Bellman equation Q∗ (s, a) = R(s, a) + γmaxa E[Q∗ (s , a )]. In recent years, deep Q-network (DQN) [14] has been widely applied in various domains, which combines Qlearning with deep neural network (DNN) to represent the action value function, and learns the optimal policy by minimizing the TD error. In DQN, Q values are frequently overestimated, leading to incorrect predictions and poor policies. To tackle this problem, Double DQN [34] was proposed to reduce overestimations by decomposing the max operation in the target into action selection and action evaluation. The target network in DQN provides a natural candidate for the second value function, without having to introduce additional networks. The loss function in Double DQN is slightly different to the one in DQN as given by: L(θ) = E[R(s, a) + γQ(s , argmaxa Q(s , a , θ), θ ) − Q(s, a, θ))2 ]
(3)
where θ are the parameters of DNN, and θ are the parameters used to compute the target. In order to achieve generalized learning across actions without imposing any change to the underlying RL process, dueling deep Q-network (Dueling DQN) [37] was proposed, which explicitly separates the representation of state values and state-dependent action advantages, leading to better policy evaluation in the presence of similar-valued actions. The action value function is given as follows: (4) Q(s, a) = V (s) + (A(s, a) − maxa ∈A A(s, a )) where V (s) is the value function which represents the quality of the current state, and A(s, a) is the advantage function of action a that represents the quality of the selected action. To accelerate the learning process, Prioritized Experience Replay [24] is used to sample a transition from the training set. 2.2
Off-Policy Evaluation
Evaluating the performance of a new policy πe (i.e., the target policy) using the trajectories sampled by another behavior policy πb (i.e. the physician policy) is termed as off-policy evaluation (OPE) [5,31,32]. It is crucial to obtain reliable estimates of the performance of new policies before actually deploying them, since executing a bad policy to the real world would be costly and even dangerous, especially in safety-critical domains such as the healthcare domain [23]. The importance sampling estimator [6] provides an unbiased estimate value of πe , but suffers from high variance that easily grows exponentially in horizon. As a variant, weighted importance sampling (WIS) [32] can reduce variance at the cost of adding some bias. t Define ρt = πe (at |st )/πb (at |st ) as the per-step importance ratio, ρ1:t = t =1 ρt as the cumulative importance ratio up to |D| (i) step t, and wt = i=1 ρ1:t /|D| as the average cumulative importance ratio at horizon t in a data set D (|D| is the number of trajectories in the data set H ( t=1 γ t−1 rt ), D). The trajectory-wise WIS estimator is given by: VWIS = ρw1:H H where H is the length of a trajectory. Then, the WIS estimator is the average |D| (i) (i) 1 estimate over all trajectories: W IS(D) = |D| i=1 VWIS , where VWIS is applying WIS to the i-th trajectory.
More Efficient Sepsis Treatment
3
109
Related Work
RL has been applied to sepsis treatment by a number of studies in the past years. Komorowski et al. [8] applied model-based PI approach in a discretized state and action space to learn the optimal sepsis treatment strategy. Raghu et al. [21] directly estimated the transition model in continuous state space, and applied Policy Gradient (PG) [30] and Proximal Policy Optimization (PPO) [25] to derive a treatment strategy. Utomo et al. [33] proposed a graphical model that was able to show transitions of patient health conditions and treatments for better explainability, and applied Monte Carlo (MC) to generate a real-time treatment recommendation. Li et al. [10] provided an online Partially Observable Markov Decision Process (POMDP) solution to take into account uncertainty and history information in sepsis clinical applications. However, one significant shortcoming of these model-based approaches is that they can suffer from model bias, leading to poor accuracy of the transition model [36], thus can only achieve inferior asymptotic performance. There have been several attempts at using model-free approach for sepsis treatment. Komorowski et al. [9] directly applied the on-policy SARSA [28] algorithm to learn the treatment strategy. Raghu et al. [19,20] examined fully continuous state and action space, where strategies are learned directly from the clinical state data. The authors proposed the fully-connected Dueling Double DQN to learn an approximation for the optimal action value function. Futoma et al. [3] used multi-output Gaussian processes and deep RL (DRL) to directly learn from sparsely sampled and frequently missing multivariate time series ICU data. Yu et al. [40] used deep inverse RL to derive an optimal reward function and exposed features that should be considered in the reward function formulation. However, all these model-free approaches tend to be substantially less sample efficient, due to the fact that the learning signal only consists of rewards and ignores much of the rich domain information contained in state transitions [18]. Most recently, Peng et al. [17] applied the mixture-of-experts framework in sepsis treatment by automatically switching between a kernel learning and a DRL process, depending on the patient’s state trajectory. Results showed that this kind of mixed learning approach could achieve better performance than the strategies by physician, kernel only policies, and DRL only policies. They applied a gating function that combines the features x linearly via weights w, along with a bias term b, and passed them through a logit to get the probability of choosing each policy: pk = sigmoid(wx+b), where pk denotes the assigned probability for choosing the kernel policy. However, the term wx + b is a linear transformation on features x, which still lacks an explicit explainability. Moreover, our work combines the benefits of both model-based and model-free RL in order to derive a final efficient strategy. The combination of these two learning methods also differs from their mixture-of-experts framework.
110
4 4.1
X. Liu et al.
Methods Data Preprocessing and Problem Formulation
In this work, the data for our cohort of patients are obtained from the MIMIC database fulfilling the Sepsis-3 criteria [26]. Our cohort consist of 20,376 patients, which are splitted into three parts, 80% for training, 10% for validation and 10% for testing. For each patient, we extract a set of 53 features (complete features): ALAT, Albumin, AnionGap, ASAT, BANDS, BaseExcess, Bicarbonate, Bilirubin, BUN, Calcium, Chloride, Creatinine, DiasBP, Glucose, HeartRate, Hematocrit, Hemoglobin, INR, IonCalcium, lactate, Magnesium, MeanBP, PaCO2, PaO2, PH, Platelet, Potassium, PT, PTT, RespRate, SODIUM, SpO2, SysBP, TempC, WBC, GCS, age, IsMale, RaceWhite, RaceBlack, Raceaother, RaceHispanic, height, weight, vent, SOFA, lods, sirs, QSOFA, QsofaSysbpScore, QsofaGcsScore, QsofaResprateScore and BloodCulturePositive. The data are aggregated into windows of 4 h, with the mean or sum being recorded when several data points were present in one 4-h window. Features with missing data are imputed by applying the last value carried forward method [15], yielding a 53 × 1 feature vector for each patient at each time step. The data are normalized per-feature to zero mean and unit variance. The total volume of intravenous fluids and maximum dose of vasopressor administered over each 4h window define the medical treatment of interest. The dosage for each drug are discretized into 5 bins, resulting in a 5 × 5 action space indexed from 0 to 24. We include a special case of no drug given as index 0. A reward of 15 is assigned to the agent if a patient survives, otherwise a reward of −15 is assigned. With the rewards and transition function, we build a reward function as follows: R(s, a) = r∈R r s ∈S P (s |s, a), where R is the rewards set. 4.2
The Policy Mixture Framework
Figure 1 shows the proposed policy mixture framework. We use two decision trees to select the vital features, and combine them with the criterion features to derive the key features. A model-based policy is learned using PI over the key features and a model-free policy is learned by applying Dueling DQN on the complete features. Finally, we combine the two policies to derive the mixed policy. Selecting the Vital Features. In order to select the features that have the greatest impact on the mortality of patients, we use decision tree to select the features that are positioned at the first two layers of the tree as the vital features. We train two decision tree models with the entropy and gini criterion, respectively, to select the vital features. Then, we combine them with the criterion features to derive the key features, which include: lods, BUN, GCS, Albumin, AnionGap, PaCO2, HeartRate, WBC, SOFA, QSOFA, MBP and Lactate. Building the Transition Model. We use the key features to train a clustering model, and empirically cluster the data into 400 clusters, in which each cluster represents a state in the MDP. In the previous work, Komorowski el al. [8] use 750
More Efficient Sepsis Treatment
111
clusters, because they conduct 48 features to train a clustering model. However, since the number of the key features are not enough to support the same size of states, the model estimation process might be biased, which accordingly impacts the performance of the final policy. Then, we use the actions in the patients’ trajectories and the corresponding clustered states to build a transition model of the changes of patients’ health status [29].
Fig. 1. Overview of the policy mixture framework.
Deriving the Model-Based Policy. To achieve better sample efficiency, PI, a modelbased approach, is applied to derive our model-based policy πmodel based (s, a) by solving the MDP and predicting the outcomes of the learned treatment strategy. The maximum number of iterations is 1000, and we stop the learning procedure once the policy becomes stable. Deriving the Model-Free Policy. We adapt Dueling Double DQN on the complete features to learn the model-free policy πmodel f ree (s, a) to achieve better asymptotic performance. For the architecture of NN in DQN, we empirically choose 10 hidden layers with each containing 1024 neurons to minimize the TD error of the learned policy. Batch normalization and ELU activation function are applied to each hidden layer. The learning rate is 1e−4, and the ReduceLROnPlateau [41] scheduler is applied to reduce the learning rate. Mixing the Policies. For each policy, we obtain action probability over all actions in each state to derive πmodel based (s, a), and πmodel f ree (s, a). Due to the large state space and action space, some actions may never be performed by these
112
X. Liu et al.
policies. To tackle this problem, we soften the policies by taking the suggested action 99% of the times and any of the remaining actions 1% of the times in total. We combine the two policies in the form of weighted sum as follows: πmixture (s, a) = w · πmodel f ree (s, a) + (1 − w) · πmodel based (s, a),
(5)
where w is a weight vector in the shape of 402 × 1 (400 intermediate states and 2 terminal states). We choose WIS as the OPE estimator due to the fact that it is a consistent estimator. Since our goal is to maximize the expected return of the mixed policy, we choose WIS as the objective function, and add a sigmoid function to introduce nonlinearity. Therefore, the objective function in our approach is given by: J = σ(W IS) = 1/(1 + e−W IS(D) ). We apply gradient boosting algorithm to optimize the objective function to find a set of optimal weight vector w. The procedure of updating the weight vector is as follows: ∇w =
∂W IS ∂πmixture ∂J = σ(W IS)(1 − σ(W IS)) , ∂w ∂πmixture ∂w w ← w + α · ∇w ,
(6) (7)
where α is the learning rate. Due to the nonconvexity of the WIS-based objective function, we initialize w randomly and optimize the function using three learning rates, 1e−3, 1e−4 and 1e−5, respectively, each running for 100 epochs.
5
Experiment Results
The estimations of the discounted expected return of different policies, i.e. the physician policy, the model-free policy, the model-based policy and the policy derived from our policy mixture framework, are shown in Fig. 2. As can be seen, the policy derived from our framework outperforms the other policies. In particular, the model-based policy with the key features outperforms the policy with the complete features, which supports the claim that the key features can model the environment more efficiently because they can capture the transition probabilities more efficiently than the complete features.
Fig. 2. Expected return of policies.
More Efficient Sepsis Treatment
113
Figure 3 shows the action matrices performed by these policies. The physician policy tends to use low vasopressor dosage. The model-free policy is similar to the physician policy but also explores some higher vasopressor dosage to achieve better treatment. The model-based policy with complete features and the policy derived from our framework tend to use higher vasopressor-intravenous dosage, generating policies that vary significantly from the physician policy.
(a) physician.
(b) model-free policy.
(c) model-based policy.
(d) policy mixture.
Fig. 3. Action matrix of policies. VP: vasopressor dosage, IV: intravenous fluids dosage.
With the transition model and the reward function we obtained, we can analyze the actual effect of the mixed policy. First, we divide the test data set into two parts, in which one set contains the survived patients (1776 in total) under the physician policy, the other contains the deceased patients (247 in total). We analyze the outcomes of our policy on these two kinds of patients. Then, we divide each kind of patients into four categories, 1) being discharged from hospital after recovery; 2) being in a good situation so far (i.e. the reward is non-negative); 3) being in a bad situation (i.e. the reward is negative); and 4) being deceased under our policy. We use the transition model and the reward function to analyze the survival status of patients under the mixed policy. There are two kinds of states that a patient can transit to, namely, the intermediate
114
X. Liu et al.
states and the terminal states (i.e., the discharged and the deceased state). We can directly calculate the number of discharged and the deceased patients by counting the patients that will transit to the discharged and the deceased state, respectively, after applying the mixed policy. For the patients that will transit to the intermediate states, we use the reward function to measure the goodness of the situation. The patient is in a bad situation if the agent receives a negative reward after applying the mixed policy, otherwise the patient is in a good situation. Table 1 shows the survival status of patients under our policy. For the patients that survive under the physician policy, 31.138% of them also survive after applying our policy, 68.806% of them are in a good situation and just 0.056% are in a bad situation. For the patients that have not survived under the physician policy, 24.296% of them can survive under our policy, 74.494% of them can still stay in a good situation, and only 0.81% will be in a bad situation. Most importantly, our policy can decrease the possibility of death on the current available states in the test data set, and helps the patients live longer who would have otherwise deceased under the physician policy. The policy derived from the policy mixture framework gives higher possibilities to the patients to survive in the future treatment. Table 1. Survival status of patients under policy mixture. Physician
6
Policy mixture Discharged Good situation Bad situation Deceased
Discharged 31.138%
68.806%
0.056%
0.0%
Deceased
74.494%
0.81%
0.0%
24.696%
Conclusion
In this work, we propose a policy mixture framework to combine two policies that are learned using two different types of RL algorithms. We use PI to learn a model-based policy, and Dueling Double DQN to learn a model-free policy. Then, these two policies are combined by using a gradient boosting algorithm to maximize the expected return of the mixed policy. The result of OPE reveals that we can find improved treatment strategies by blending the model-based and model-free policies. Furthermore, we examine the outcomes of our policy mixture framework for each patient in the test data set, and results show that the mixed policy can increase the survival ratio of the patients. Further work will consider other methods to construct the reward functions, to evaluate the final policies, and to test our method in other critical care data set.
More Efficient Sepsis Treatment
115
References 1. Byrne, L., Van Haren, F.: Fluid resuscitation in human sepsis: time to rewrite history? Ann. Intensive Care 7(1), 1–8 (2017). https://doi.org/10.1186/s13613016-0231-8 2. Friedman, J.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367– 378 (2002) 3. Futoma, J., et al.: Learning to treat sepsis with multi-output Gaussian process deep recurrent Q-networks (2018) 4. Gotts, J., Matthay, M.: Sepsis: pathophysiology and clinical management. BMJ 353, i1585 (2016). https://doi.org/10.1136/bmj.i1585 5. Hanna, J., Stone, P., Niekum, S.: Bootstrapping with models: confidence intervals for off-policy evaluation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 538–546 (2017) 6. Henmi, M., Yoshida, R., Eguchi, S.: Importance sampling via the estimated sampler. Biometrika 94(4), 985–991 (2007) 7. Johnson, A., Pollard, T., Shen, L., Li Wei, L., Feng, M., Ghassemi, M., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data 3(1), 1–9 (2016) 8. Komorowski, M., Celi, L.A., Badawi, O., Gordon, A., Faisal, A.: The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nat. Med. 24(11), 1716–1720 (2018) 9. Komorowski, M., Gordon, A., Celi, L., Faisal, A.: A Markov decision process to suggest optimal treatment of severe infections in intensive care. In: Neural Information Processing Systems Workshop on Machine Learning for Health (2016) 10. Li, L., Komorowski, M., Faisal, A.: The actor search tree critic (ASTC) for off-policy POMDP learning in medical decision making. arXiv preprint arXiv:1805.11548 (2018) 11. Littman, M.: Reinforcement learning improves behaviour from evaluative feedback. Nature 521(7553), 445–451 (2015) 12. Marik, P.: The demise of early goal-directed therapy for severe sepsis and septic shock. Acta Anaesthesiol. Scand. 59(5), 561–567 (2015) 13. Marik, P., Bellomo, R.: A rational approach to fluid therapy in sepsis. BJA Br. J. Anaesthesia 116(3), 339–349 (2016) 14. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare, M., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529– 533 (2015) 15. Nahler, G.: Last value carried forward (LVCF). In: Dictionary of Pharmaceutical Medicine, pp. 105–105. Springer, Vienna (2009). https://doi.org/10.1007/978-3211-89836-9 773 16. Pal, C.V., Leon, F.: Brief survey of model-based reinforcement learning techniques. In: 2020 24th International Conference on System Theory, Control and Computing, pp. 92–97. IEEE (2020) 17. Peng, X., Ding, Y., Wihl, D., Gottesman, O., Komorowski, M., Lehman, L.W., Ross, A., et al.: Improving sepsis treatment strategies by combining deep and kernel-based reinforcement learning. In: AMIA Annual Symposium Proceedings, vol. 2018, p. 887 (2018) 18. Pong, V., Gu, S., Dalal, M., Levine, S.: Temporal difference models: model-free deep RL for model-based control. arXiv preprint arXiv:1802.09081 (2018) 19. Raghu, A., Komorowski, M., Ahmed, I., Celi, L.A., Szolovits, P., Ghassemi, M.: Deep reinforcement learning for sepsis treatment. arXiv preprint arXiv:1711.09602 (2017)
116
X. Liu et al.
20. Raghu, A., Komorowski, M., Celi, L.A., Szolovits, P., Ghassemi, M.: Continuous state-space models for optimal sepsis treatment: a deep reinforcement learning approach. In: Machine Learning for Healthcare Conference, pp. 147–163 (2017) 21. Raghu, A., Komorowski, M., Singh, S.: Model-based reinforcement learning for sepsis treatment. arXiv preprint arXiv:1811.09602 (2018) 22. Rhodes, A., Evans, L., Alhazzani, W., Levy, M., Antonelli, M., Ferrer, R., et al.: Surviving sepsis campaign: international guidelines for management of sepsis and septic shock: 2016. Intensive Care Med. 43(3), 304–377 (2017) 23. Roggeveen, L., El Hassouni, A., Ahrendt, J., Guo, T., Fleuren, L., Thoral, P., et al.: Transatlantic transferability of a new reinforcement learning model for optimizing haemodynamic treatment for critically ill patients with sepsis. Artif. Intell. Med. 112, 102003 (2021) 24. Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. arXiv preprint arXiv:1511.05952 (2015) 25. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) 26. Singer, M., Deutschman, C., Seymour, C.W., Shankar Hari, M., Annane, D., Bauer, M., et al.: The third international consensus definitions for sepsis and septic shock (Sepsis-3). JAMA 315(8), 801–810 (2016) 27. Sutton, R.: Learning to predict by the methods of temporal differences. Mach. Learn. 3(1), 9–44 (1988) 28. Sutton, R.: Generalization in reinforcement learning: successful examples using sparse coarse coding. In: Advances in Neural Information Processing Systems, pp. 1038–1044 (1996) 29. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018) 30. Sutton, R., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057–1063 (2000) 31. Thomas, P., Theocharous, G., Ghavamzadeh, M.: High-confidence off-policy evaluation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29, pp. 3000–3006 (2015) 32. Thomas, P., Theocharous, G., Ghavamzadeh, M.: High confidence policy improvement. In: International Conference on Machine Learning, pp. 2380–2388 (2015) 33. Utomo, C.P., Li, X., Chen, W.: Treatment recommendation in critical care: a scalable and interpretable approach in partially observable health states. In: 39th International Conference on Information Systems, pp. 1–9 (2018) 34. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, pp. 2094–2100 (2016) 35. Waechter, J., Kumar, A., Lapinsky, S., Marshall, J., Dodek, P., Arabi, Y., et al.: Interaction between fluids and vasoactive agents on mortality in septic shock: a multicenter, observational study. Crit. Care Med. 42(10), 2158–2168 (2014) 36. Wang, T., et al.: Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057 (2019) 37. Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., Freitas, N.: Dueling network architectures for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1995–2003 (2016) 38. Watkins, C.J.C.H.: Learning from delayed rewards. King’s College, Cambridge United Kingdom (1989)
More Efficient Sepsis Treatment
117
39. Yu, C., Liu, J., Nemati, S.: Reinforcement learning in healthcare: a survey. arXiv preprint arXiv:1908.08796 (2019) 40. Yu, C., Ren, G., Liu, J.: Deep inverse reinforcement learning for sepsis treatment. In: 2019 IEEE International Conference on Healthcare Informatics, pp. 1–3. IEEE (2019) 41. Zaheer, M., Reddi, S., Sachan, D., Kale, S., Kumar, S.: Adaptive methods for nonconvex optimization. In: Advances in Neural Information Processing Systems, vol. 31, pp. 9815–9825 (2018)
An Efficient Two-Stage Fusion Network for Computer-Aided Diagnosis of Diabetic Foot Anping Song1(B) , Hongtao Zhu1(B) , Lifang Liu2 , Ziheng Song3 , and Hongyu Jin1 1
2
School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China {Apsong,zhuhongtao66,jinhongyu 2019}@shu.edu.cn Medical Examination Center, Shanghai Municipal Eighth People’s Hospital, Shanghai 200234, China 3 College of Engineering, University of Illinois at Urbana-Champaign, Urbana and Champaign 61820, USA [email protected]
Abstract. The Diabetic Foot (DF) is threatening every diabetic patient’s health. Every year, more than one million people suffer amputation in the world due to lack of timely diagnosis of DF. Diagnosing DF at early stage is very essential. However, it is easy for inexperienced doctors to confuse Diabetic Foot Ulcer (DFU) wounds and other specific ulcer wounds when there is a lack of patients’ health records in underdeveloped areas. In this paper, we propose an efficient two-stage fusion network fusing global foot features and local wound features to classify DF images and non-DF images. In particular, we apply an object detection module to detect wounds, which assists in making decisions on classification. The fusion network combines two crucial kinds of features extracted from foot areas and wound areas. Our method is evaluated upon our dataset collected by Shanghai Municipal Eighth People’s Hospital. In the training-validation stage, we collect 1211 images for a 5-fold cross-validation. Our method can classify DF images and non-DF images with the area under the receiver operating characteristic curve (AUC) value of 94.87%, accuracy of 88.19%, sensitivity of 84.79%, specificity of 90.63%, and F1-score of 85.68%. With the great performance, the proposed algorithm has great potential in clinical auxiliary diagnosis. Keywords: Diabetic foot diagnosis · Diabetic foot ulcers detection Deep learning · Fusion network · Computer-aided diagnosis
·
This work was supported by the High Performance Computing Center of Shanghai University and Shanghai Engineering Research Center of Intelligent Computing System under Project 19DZ2252600. c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 118–129, 2021. https://doi.org/10.1007/978-3-030-91415-8_11
An Efficient Two-Stage Fusion Network for Computer-Aided Diagnosis
1
119
Introduction
Diabetes mellitus has already become an epidemic. According to the report released by the International Diabetes Federation Diabetes Atlas in 2019 [23], the number of diabetic patients will add up to 578 million in 2030. The DF is one of the most common and serious diabetic complications. Around 15% to 25% of diabetic patients are going to suffer from DF, and 33% of DF patients will lead to amputation [1,14]. Every year more than 1 million DF patients face amputation in the world. Actually, timely diagnosis of DF can help reduce the risk of amputation. However, early diabetic foot has no obvious external wound, and neuropathy caused by diabetes reduces the patients’ perception of foot pain. In addition, DFU wounds are similar to other chronic wounds (like pressure ulcer and venous ulcer), which makes them confusing [4]. On the one hand, it is easy for inexperienced doctors to confuse DFU wounds and other specific ulcer wounds when there is a lack of patient’s health records in underdeveloped areas [4]. For patients, they often mistake DFU wounds for common wounds, which can lead to their failure to seek medical treatment in time. On the other hand, not all of wounds on the feet of diabetic patients are DFU wounds in reality [18]. Some doctors diagnose DF only according to the patient’s history of diabetes and the wounds on the foot. However, these wounds may be caused by accident, and the vascular condition of the patient’s foot is relatively good. These wounds are probably just common wounds. For the above reasons, it is difficult to detect and diagnose DF accurately in time. In clinical practice, professional diabetic doctors tend to detect and diagnose DF on the basic of the global foot conditions and the local wound conditions. The global foot conditions include degree of skin wrinkle and abnormality of foot, while the local wound conditions include depth, area and location of the wounds. In this study, we propose an efficient two-stage fusion network which fuses global foot features and local wound features to classify DF images and nonDF images on all the cases in our training-validation set. The examples and numbers of DF cases and non-DF cases are shown in Fig. 1. To focus on the wound area, we make full use of object detection model from MICCAI Diabetic Foot Ulcer Challenge 2020 [25]. At the same time, we transfer the model to detect the chronic wounds in DF images and non-DF images from our dataset. To focus on the foot area, our method takes advantage of attention mechanism by adding convolution block attention module (CBAM) [24] into network. The experimental result demonstrates the proposed two-stage fusion network can effectively improve the classification performance.
2 2.1
Related Works Computer-Aided Diagnosis of Diabetic Foot
The recent studies for computer-aided diagnosis of DF are based on hardware devise. Fraiwan et al. [5] used a thermal camera to acquire thermal images used
120
A. Song et al.
Fig. 1. Examples and numbers of DF images and non-DF images from our dataset. These DF images contain all Wagner grade types of DF. The non-DF images from our dataset mainly contain other chronic wounds from the feet and legs, such as pressure ulcer and venous ulcer of lower extremity.
to indicate a possible development of ulcers. Madarasingha et al. [15] developed a diagnostic device using the said imaging technology to detect as well as progress monitoring of foot complications. These studies rely on the theoretical knowledge that temperature differences of more than 2.2 ◦ C between a region on one foot and the same region on the contra-lateral foot are considered Hyperthermia [16,21]. Monitoring such differences through thermal images proved to be an efficient way of detecting DFU. However, infrared camera devices are highly sensitive to external environment, which will produce errors inevitably. Besides, the cost of these infrared camera devices is also very high. By contrast, the computer vision methods are more stable and potential. 2.2
Machine Learning Methods in DFU Diagnosis
In early studies, many researchers used machine learning methods to evaluate the condition of diabetic foot. Vardasca et al. [20] used the k-Nearest Neighbour method to classify infrared thermal images for early DFU prevention. Kasbekar et al. [12] used the C5.0 decision tree algorithm to guide the risk of amputation in diabetic foot patients. Due to the lack of data collected by early researchers, machine learning methods were widely used. As the amount of data increases, deep learning methods become more and more popular. 2.3
Deep Learning Methods in DFU Diagnosis
With the deep learning rapidly developing, many researchers have applied deeplearning methods to detect DFU. Goyal et al. [9] proposed a two-tier transfer learning from bigger datasets to train the fully convolutional networks to automatically segment the ulcers and surrounding skin. Goyal et al. [6] proposed the DFUNet which classified normal skin patches and abnormal skin patches which extracted from the DF images. Goyal et al. [8] proposed the real-time method
An Efficient Two-Stage Fusion Network for Computer-Aided Diagnosis
121
based on the Faster-RCNN to detect DFU. Goyal et al. [7] proposed the Ensemble Convolutional Neural Network to recognize ischemia and infection of DFU. Han et al. [10] proposed the real-time detection and location method for Wagner grades of DF based on refinements on YOLOv3 [17]. Although these studies have made a great progress on the detection of DFU, there are still many problems which cannot be solved very well. Firstly, the performance of the existing methods is not very satisfactory in the complex clinical environment. Secondly, the existing deep learning studies focus on DFU detection, while there are not many deep-learning studies paying attention to computer-aided diagnosis of DF.
3
Method and Experiment
The overall framework is shown in Fig. 2. The original input is the image. In the first stage, we use the object detection module to detect wounds. After that, we cut the wound area out and take it as one of the two inputs of fusion network. In the second stage, the fusion network extracts global foot features and local wound features and fuses them to classify DF images and non-DF images.
Fig. 2. The framework of our proposed method. The training process of our method consists of two stages: 1) training the object detection model and transferring it to locate the wound area; 2) training the fusion network with the original image and the wound area. In this figure, “CBAM” means convolution block attention module. “GAP” indicates the global average pooling layer, while “FC” indicates the fully connected layer.
3.1
Fusion Network
Our fusion network can be divided into two parts. The former part is to extract global foot features and local wound features. The latter part is to integrate features. On two branches, we both take the 101-layer ResNet as the feature extractor. And we use it to obtain the decisive features of the global foot area and local wound area, respectively. In order to make the network focus on the
122
A. Song et al.
foot area, we add the CBAM into ResNet on the global branch. The ResNet is composed of one convolution block and four residual blocks. We add the CBAM after every residual block on the global branch. We try to focus the attention of network on the foot area. After extracting features, we use two global average pooling layers to integrate global spatial information. On both branches, a global average pooling layer is connected to the last convolution layer of the corresponding branch, respectively. Lastly, we concatenate two feature extractor modules and connect them to a fully connected layer for classification. 3.2
Object Detection Module
The object detection method is that we used in the MICCAI Diabetic Foot Ulcer challenge 2020 [25]. Considering that the DFU is a type of chronic wound and has a certain similarity to other specific chronic wounds, we continue taking advantage of the object detection model from MICCAI DFUC2020 dataset and transfer it to detect chronic wounds. The detection effect is shown in Fig. 3. The architecture of object detection network is shown in Fig. 4 [25]. We use the cascade structure [3] as the main framework of object detection network due to its great performance. To make the network fit in different scales, we use the DetNet as the backbone of the object detection network. The detection of wounds is different from traditional object detection tasks. For the latter, objects can appear anywhere in an image. For the former, the wounds can only appear on the feet. It is very suitable for applying the soft attention. Therefore, we adopt the structure of the attention branch [22] into DetNet [13].
(a) DF images from our dataset
(b) non-DF images from our dataset
Fig. 3. The display of detecting wounds with our object detection model. The two columns on the left are DF images from our dataset. The two columns on the right are non-DF images from our dataset.
An Efficient Two-Stage Fusion Network for Computer-Aided Diagnosis
123
Fig. 4. The framework of our object detection module. In the Cascade AttentionDetNet (CA-DetNet), “Input” is input image. “A-DetNet” is backbone network. “Pool” is region-wise feature extraction. “H” is network head. “B” is bounding box and “C” is classification. “B0” is proposals in all architectures. In the A-DetNet, “Conv 7 * 7” indicates 7 * 7 convolution layer (with stride 2), while “Maxpooling” indicates max-pooling layer. “Conv 1 * 1” indicates 1 * 1 convolution layer, and “FPN” means Feature Pyramid Networks. In the A-ResBody and A-DetBody, DetBlock means a dilated bottleneck or a dilated bottleneck with 1 * 1 convolution projection, and ResBlock means an original bottleneck or a bottleneck with 1 * 1 convolution projection. “UpSampling” indicates up-sampling layer, and “Sigmoid” means sigmoid activation function.
3.3
Dataset
Our dataset was collected by the professional diabetic doctors from the Shanghai Municipal Eighth People’s Hospital over the past few years. Our doctors have followed the patients’ progress for a long time. Our dataset has 1211 images of DF and non-DF. The dataset includes 507 positive samples and 704 negative samples. The DF images only have DFU wounds. The non-DF images all have chronic wounds other than DFU. The chronic wound in the non-DF images mainly consists of pressure ulcer, venous ulcer of lower extremity and wounds caused by other reasons. The chronic wound is mainly on the feet and legs. The images were taken by professional diabetic foot doctors using the IPAD 3. The images with out of focus and blurry artifacts were discarded. A few examples of DF and non-DF images are shown in Fig. 1. Our professional diabetic foot doctors labelled all images in our dataset according to all patients’ disease records. The size of images in the dataset is different. In order to meet the input of object detection model, we resize all images to 640 * 480.
124
4 4.1
A. Song et al.
Results and Analysis Experimental Results
For evaluating, we conduct five-fold cross-validation. We use five different metrics to measure the classification results from the model: AUC, accuracy, sensitivity, specificity and F1-score. The experimental result is shown in Table 1. Compared with the results of various methods, the AUC score of the FusionNet with CBAM is 0.65% higher than that of FusionNet without CBAM. This indicates the strong capability of attention mechanism. The AUC score of the FusionNet with CA-DetNet model is 0.43% higher than that of FusionNet with YOLOV3 [17] model. The following section will carefully analyze the impact of object detection quality on the final classification. In all methods using a single kind of feature, the AUC score of the DFUNet [6] which stacks parallel convolutions is 2.44% higher than that of ResNet [11]. The AUC score of the DFUNet is 0.09% higher than that of Inception-ResNet-V2 [19]. The DFUNet contains the Inception-ResNet-V2, which also confirms the good performance of the ensemble learning. The AUC score of the DFU QUTNet [2] is 0.42% higher than that of DFUNet. Compared to methods of using a single kind of features, the AUC score of FusionNet using two kinds of features is 1.42% higher than that of the best method of using a single kind of features. The FusionNet we proposed has obvious superiority. 4.2
Detailed Analysis
To demonstrate the effectiveness of wound detection, we design a contrast experiment. The experimental result is shown in Table 2. We set the input of local branch as blank information, which represents a situation where the object detection model can not detect anything. We can see from the results that failed detection can lead to a slight reduction in all metrics. However, the network still achieves good performance, because the original image keeps all information, and the only drawback is that the network can not learn accurate wound features. In addition, we also set the input of the local branch as the original image, which represents a situation where the object detection model detects a lot of irrelevant information. From the results, we can find that inaccurate detection can not make great influence on accuracy of diagnosis. Based on the above analysis, we think the wound feature can help the network classify more correctly and false detection can not influence the whole good performance of network. Then we consider the impact of different object detection models on the classification results. Comparing the CA-DetNet model with the YOLOV3 model, we can also find that the results of different models with good performance on object detection are close. Then we think about the influence of global foot features on the classification. For DF images, some global factors can help doctors diagnose DF, such as the location of wound, the degree of skin wrinkle and the abnormality of foot. For non-DF images, global information can also help doctors diagnose related
An Efficient Two-Stage Fusion Network for Computer-Aided Diagnosis
125
Table 1. Comparison of classification results of different models on five folds(FusionNet: our fusion network). The results of AUC, Accuracy, Sensitivity, Specificity, F1-Score are present in this table. The results are the combined results of five folds. We show MEAN±STD(standard deviation) scores of five trained models of each training-validation fold. Methods
Results (%)
AUC
FusionNet (CA-DetNet+ResNet w/CBAM) FusionNet (CA-DetNet+ResNet w/o CBAM) FusionNet (YOLOV3+ResNet w/CBAM) Inception-ResNet-v2 ResNet DFUNet DFU QUTNet
94.87 ± 1.04 94.22 ± 1.20 94.44 ± 1.18 93.36 ± 1.26 91.01 ± 0.54 93.45 ± 1.25 93.87 ± 0.13
Accuracy
FusionNet (CA-DetNet+ResNet w/CBAM) FusionNet (CA-DetNet+ResNet w/o CBAM) FusionNet (YOLOV3+ResNet w/CBAM) Inception-ResNet-v2 ResNet101 DFUNet DFU QUTNet
88.19 ± 1.83 87.28 ± 1.49 85.04 ± 3.89 84.96 ± 2.43 84.13 ± 0.87 86.45 ± 1.58 85.78 ± 1.01
Sensitivity FusionNet (CA-DetNet+ResNet w/CBAM) FusionNet (CA-DetNet+ResNet w/o CBAM) FusionNet (YOLOV3+ResNet w/CBAM) Inception-ResNet-v2 ResNet101 DFUNet DFU QUTNet
84.79 ± 4.74 84.21 ± 4.95 86.57 ± 7.28 89.12 ± 6.11 83.20 ± 2.24 83.99 ± 2.77 87.54 ± 3.24
Specificity FusionNet (CA-DetNet+ResNet w/CBAM) FusionNet (CA-DetNet+ResNet w/o CBAM) FusionNet (YOLOV3+ResNet w/CBAM) Inception-ResNet-v2 ResNet101 DFUNet DFU QUTNet
90.63 ± 2.91 89.49 ± 1.59 86.78 ± 6.27 82.00 ± 7.20 84.81 ± 1.57 88.23 ± 4.44 84.50 ± 2.57
F1-score
85.68 ± 2.35 84.63 ± 2.34 84.44 ± 3.09 83.26 ± 2.06 81.43 ± 1.08 83.88 ± 1.29 83.72 ± 1.21
FusionNet (CA-DetNet+ResNet w/CBAM) FusionNet (CA-DetNet+ResNet w/o CBAM) FusionNet (YOLOV3+ResNet w/CBAM) Inception-ResNet-v2 ResNet101 DFUNet DFU QUTNet
126
A. Song et al.
Table 2. The results of contrast experiments on five folds (two images: The inputs of global branch and local branch are both original images. One image: The input of local branch is blank.). The results of AUC, Accuracy, Sensitivity, Specificity, F1Score are present in this table. The results are the combined results of five folds. We show MEAN±STD (standard deviation) scores of five trained models of each trainingvalidation fold. Methods
Results (%)
AUC
FusionNet w/two images 92.78 ± 0.84 FusionNet w/one image 91.44 ± 0.70
Accuracy
FusionNet w/two images 86.11 ± 1.37 FusionNet w/one image 83.39 ± 2.53
Sensitivity FusionNet w/two images 83.59 ± 4.63 FusionNet w/one image 84.96 ± 7.56 Specificity FusionNet w/two images 87.92 ± 3.18 FusionNet w/one image 82.27 ± 7.38 F1-score
FusionNet w/two images 83.39 ± 1.88 FusionNet w/one image 81.03 ± 2.68
diseases. As shown in Fig. 5(A), this is a case of venous occlusion of lower extremity. From the view of wound, it looks like DFU. However, if we observe the area around the wound, it is obviously not DF. The doctors will generally be based on the melanin deposition on the skin to diagnose non-DF. To get the visualization results, we generate the typical class activation mapping (CAM) [26] from the trained model. From the Fig. 5(B), we can see that the network also prefers to focusing on the abnormal color of skin on the leg to classify it as a non-DF image. This visualization result also reflects the quality of classification performance from the side. Based on the above analysis, we think the global foot features play an extremely important role in classification.
Fig. 5. A case of venous occlusion of lower extremity. (a) The original image. (b) The visualization result of the original image. For the visualization result, we show the CAM results of FusionNet.
An Efficient Two-Stage Fusion Network for Computer-Aided Diagnosis
127
Finally, we consider the relationship between two kinds of features and classification. In order to analyze the impact of global foot features and local wound features on classification, we design another contrast experiment. On the one hand, we remove the wound area from the original image. On the other hand, we cut the wound area out. We take them as the foot features and the wound features. We use ResNet to evaluate the contribution of two kinds of features on classification. The experimental results are shown in Table 3. Compared with the results on five folds, the AUC score of ResNet with foot features is 5.18% higher than that of ResNet with wound features. From this result, we can find that foot features can guide the network to classification more accurately than wound features. Combined with the above analysis, we speculate that the network is mainly based on foot features and supplemented by wound features to classify DF images and non-DF images. This also proves from the side that it is not easy for people to diagnose DF only by wounds. Therefore, we suggest that the diagnosis should be based on foot features and supplemented by wound features. Our proposed fusion network can effectively combine the advantages of two kinds of features for classification. Table 3. The results of contrast experiments on five folds. The results of AUC, Accuracy, Sensitivity, Specificity, F1-Score are present in this table. The results are the combined results of five folds. We show MEAN±STD (standard deviation) scores of five trained models of each training-validation fold. Methods
Results (%)
AUC
FusionNet w/foot features 87.24 ± 2.71 FusionNet w/wound features 82.06 ± 3.65
Accuracy
FusionNet w/foot features 81.55 ± 4.02 FusionNet w/wound features 75.94 ± 3.85
Sensitivity FusionNet w/foot features 79.84 ± 4.37 FusionNet w/wound features 72.14 ± 3.89 Specificity FusionNet w/foot features 80.84 ± 3.16 FusionNet w/wound features 78.68 ± 4.11 F1-score
5
FusionNet w/foot features 77.30 ± 2.94 FusionNet w/wound features 71.53 ± 4.26
Conclusion and Discussion
For DF, it is important to get the diagnosis result as soon as possible. It is likely for inexperienced doctors to mistake DFU for other chronic wounds where there is a lack of patients’ health records in underdeveloped area, which leads to misdiagnosis and increases risk of amputation. In China, patients tend to
128
A. Song et al.
be triaged into general practice department in hospital. Patients are hard to be treated by professional diabetes doctors, which increases the risk of misdiagnosis. With the medical image processing technology developing, it is beneficial to develop an automatic computer-aided diagnosis method to diagnose DF. In this study, we propose an efficient two-stage fusion network to classify DF images and non-DF images. We evaluate our method upon our dataset. We use five-fold cross-validation to evaluate the generalization ability of our model, achieving AUC of 94.87%, accuracy of 88.19%, sensitivity of 84.79%, specificity of 90.63%, F1-Score of 85.68%. At the same time, to better understand the decision of the deep learning model, we also refine the CAM and show the visualization results, which is able to reveal important regions. It is very likely to put this study into practice to reduce the burden of clinicians and the pain of patients.
References 1. Abdullah Mueen, Dr., Luan, S.: Early detection and prevention of diabetic foot (2013). https://www.cs.unm.edu/∼mueen/diabeticfoot/Proposal.pdf 2. Alzubaidi, L., Fadhel, M.A., Oleiwi, S.R., Al-Shamma, O., Zhang, J.: DFU QUTNET: diabetic foot ulcer classification using novel deep convolutional neural network. Multimedia Tools Appl. 79(21), 15655–15677 (2020) 3. Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018) 4. Chadwick, P.: Best practice in the management of diabetic foot ulcers and pressure ulcers on the foot. Prim. Health Care 31 (2021). https://doi.org/10.7748/phc.2021. e1686 5. Fraiwan, L., AlKhodari, M., Ninan, J., Mustafa, B., Saleh, A., Ghazal, M.: Diabetic foot ulcer mobile detection system using smart phone thermal camera: a feasibility study. Biomed. Eng. Online 16(1), 1–19 (2017) 6. Goyal, M., Reeves, N.D., Davison, A.K., Rajbhandari, S., Spragg, J., Yap, M.H.: DFUNet: convolutional neural networks for diabetic foot ulcer classification. IEEE Trans. Emerg. Top. Comput. Intell. 4, 728–739 (2018) 7. Goyal, M., Reeves, N.D., Rajbhandari, S., Ahmad, N., Wang, C., Yap, M.H.: Recognition of ischaemia and infection in diabetic foot ulcers: dataset and techniques. Comput. Biol. Med. 117, 103616 (2020) 8. Goyal, M., Reeves, N.D., Rajbhandari, S., Yap, M.H.: Robust methods for real-time diabetic foot ulcer detection and localization on mobile devices. IEEE J. Biomed. Health Inform. 23(4), 1730–1741 (2018) 9. Goyal, M., Yap, M.H., Reeves, N.D., Rajbhandari, S., Spragg, J.: Fully convolutional networks for diabetic foot ulcer segmentation. In: 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 618–623. IEEE (2017) 10. Han, A., et al.: Efficient refinements on YOLOv3 for real-time detection and assessment of diabetic foot Wagner grades. arXiv preprint arXiv:2006.02322 (2020) 11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 12. Kasbekar, P.U., Goel, P., Jadhav, S.P.: A decision tree analysis of diabetic foot amputation risk in Indian patients. Front. Endocrinol. 8, 25 (2017)
An Efficient Two-Stage Fusion Network for Computer-Aided Diagnosis
129
13. Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., Sun, J.: DetNet: a backbone network for object detection. arXiv preprint arXiv:1804.06215 (2018) 14. Liu, C., Netten, J.J.V., Baal, J.G.V., Bus, S.A., Heijden, F.V.D.: Automatic detection of diabetic foot complications with infrared thermography by asymmetric analysis. J. Biomed. Opt. 20(2), 26003 (2015) 15. Madarasingha, K., et al.: Development of a system to profile foot temperatures of the plantar and the periphery. In: TENCON 2018–2018 IEEE Region 10 Conference, pp. 1928–1932. IEEE (2018) 16. van Netten, J.J., van Baal, J.G., Liu, C., van Der Heijden, F., Bus, S.A.: Infrared thermal imaging for automated detection of diabetic foot complications. SAGE Publications Sage CA, Los Angeles (2013). https://doi.org/10.1177/ 193229681300700504 17. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018) 18. Rogers, L.C., Armstrong, D.G., Boulton, A.J., Freemont, A.J., Malik, R.A.: Malignant melanoma misdiagnosed as a diabetic foot ulcer. Diabetes Care 30(2), 444–445 (2007) 19. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016) 20. Vardasca, R., Magalhaes, C., Seixas, A., Carvalho, R., Mendes, J.: Diabetic foot monitoring using dynamic thermography and AI classifiers. In: Proceedings of the QIRT Asia, pp. 1–5 (2019) 21. Vilcahuaman, L., et al.: Automatic analysis of plantar foot thermal images in at-risk type II diabetes by using an infrared camera. In: Jaffray, D.A. (ed.) World Congress on Medical Physics and Biomedical Engineering. IP, vol. 51, pp. 228–231. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19387-8 55 22. Wang, F., et al.: Residual attention network for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2017) 23. Wild, S., Roglic, G., Green, A., Sicree, R., King, H.: Global prevalence of diabetes: estimates for the year 2000 and projections for 2030. Diabetes Care 27(5), 1047– 1053 (2004) 24. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/9783-030-01234-2 1 25. Yap, M.H., et al.: Deep learning in diabetic foot ulcers detection: a comprehensive evaluation. Comput. Biol. Med. 135, 104596 (2021) 26. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)
A Heterogeneous Graph Convolutional Network-Based Deep Learning Model to Identify miRNA-Disease Association Zicheng Che1 , Wei Peng1,2(B) , Wei Dai1,2 , Shoulin Wei1,2 , and Wei Lan3 1 Faculty of Information Engineering and Automation, Kunming University of Science
and Technology, Kunming 650050, China 2 Computer Technology Application Key Lab of Yunnan Province, Kunming University
of Science and Technology, Kunming 650050, China 3 Guangxi Key Laboratory of Multimedia Communications and Network Technology,
Guangxi University, Nanning 530004, China
Abstract. MiRNAs are proved to be implicated in human diseases. The diseaserelated miRNAs are expected to be novel bio-marks for disease therapy and drug development. This work develops a Heterogeneous Graph Convolutional Network-based deep learning model, namely HGCNMDA, to perform a MiRNADisease Association prediction task. We construct a three-layer heterogeneous network consisting of a miRNA, a disease, and a gene layer. Then we prepare two kinds of attributes for every node in the network and refine the nodes in the network into several node types according to their attributes. After that, a heterogeneous graph convolutional network is employed to learn feature representations for miRNAs, diseases, and genes with finer-grained node type and edge type information. Finally, the miRNA-disease associations are recovered by the inner product of the miRNA features and disease features. The experimental results on the human miRNA-disease association dataset show that the HGCNMDA achieves better performance in AUC values than other five state-of-the-art models. Keywords: Heterogeneous network embedding · Graph convolutional network · miRNA · Disease · miRNA-disease association prediction
1 Introduction MicroRNA (miRNA) are a group of small non-coding RNAs consisting of about 21 nucleotides. Some studies revealed that MicroRNAs play an essential role in many critical biological processes, and were also proved to be implicated in human diseases. Hence the disease-related miRNAs are expected to be novel bio-marks for disease therapy and drug development. Recently, a massive number of miRNAs are sequenced. However, only 1206 miRNAs are experimentally determined to associate with 893 diseases according to the data in HMDD 3.0 [1]. There still exists a big gap between the sequenced miRNAs and the diseases. Biologically experiential verification of the miRNA-disease © Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 130–141, 2021. https://doi.org/10.1007/978-3-030-91415-8_12
A Heterogeneous Graph Convolutional Network-Based Deep Learning Model
131
associations is time-consuming and costly. Therefore, some computational approaches have been proposed to identify these associations [2]. These computational methods are designed based on a common observation that similar miRNAs are more likely to associate with the same diseases and vice versa [2]. Hence, they usually construct a heterogeneous network, including a miRNA similarity network, a disease similarity network, and a known miRNA-disease association network. Various network-based approaches are proposed to infer the novel miRNA-disease associations based on the heterogeneous network. These network-based methods are generally categorized into three types, the path-based method, the network propagation-based method, and the network representation learning-based method. The path-based methods calculate the association of miRNA and disease by counting to the number of paths from miRNAs/diseases to diseases/miRNAs. Zeng et al. [3] enumerate all paths between miRNAs and diseases in the heterogeneous network and calculate the links of miRNAs and diseases by linearly combining their path scores. You et al. [4] find the paths between the miRNAs and diseases by doing a depth-first search on the heterogeneous network. They combine all the paths as the association scores of the miRNAs and diseases after filtering the long paths. The path-based methods run long to find out all paths between miRNAs and diseases and ignore the difference of the paths. The network propagationbased methods, i.e. [5], usually run a random walk on the heterogeneous network to propagate the message and infer potential miRNA-disease associations from similar neighbors. Considering the different network structures between the miRNA similarity and disease similarity network, Luo et al. [6] introduce an unbalanced Bi-Random walk model that walks different steps on the two similarity networks to measure the probability of the miRNAs linking with the diseases. These network propagation-based methods can capture the similarities between miRNAs and diseases based on multiple paths in the heterogeneous networks. However, they highly depend on the reliability of the networks and may introduce a large amount of false-positive associations. The network embedding-based methods learn features in a latent space for miRNAs and diseases. After that, these methods use these features to reconstruct the associations between miRNAs and diseases and make predictions. Chen et al. [7] develop a Regularized Least Squares (RLS) framework to reveal the probability that each miRNA is related to a given disease by defining a cost function requiring similar miRNAs (diseases) should obtain similar miRNA-disease scores. Lan et al. [8] introduce a Bayesian matrix factorization method that fuses miRNA functional similarity network, miRNA sequence similarity network, and disease semantic similarity network to mine the potential associations between miRNAs and diseases. Some studies, i.e., GRNMF [9], DNRLMF-MDA [10], MDN-NMTF [11], employ matrix factorization to learn representation vectors for miRNAs and diseases. Matrix completion algorithm is another popular technique adopted by researchers [12–14] to identify the potential miRNA-disease associations. It updates the matrix of known miRNA-disease association by defining an objective function to ensure the scores in the known miRNA-disease association matrix are close to those in the predicted one. Li et al. [15] leverage the graph convolutional network (GCNs) to learn miRNA and disease latent feature representations from the miRNA and disease similarity networks separately. Then, they input the features into a novel neural inductive matrix completion model to generate an association matrix completion.
132
Z. Che et al.
Although the previous great works have achieved promising results, most of them rely on the heterogeneous network that consists of miRNA, disease, and their interand intra-associations. MiRNAs suppress the expression of their target genes and the dysfunctions of genes usually cause diseases, which implies that gene plays a crucial role to bridge miRNAs and diseases. It motivates us to introduce gene information to improve the association predictions. Previous methods use gene information to explore the links between miRNAs and diseases intuitively. Zeng et al. [3] measure miRNA similarity based on how significant the two microRNAs share target genes. Peng et al. [11] calculate the disease similarity according to the functional similarity of the diseaserelated genes. Some methods [16, 17] measure the significance that the microRNA is associated with a disease by referring to how relevant their related genes are. Peng et al. [18] and Li et al. [19] adopt a supervised deep-learning framework to determine the labels for miRNA-disease pairs. They design the miRNA features and disease features by calculating the associations between miRNAs and genes, between diseases and genes. Considering the complexity in the relationships among miRNAs, diseases and genes, we construct a three-layer heterogeneous network consisting of a miRNA layer, a disease layer, and a gene layer. In the heterogeneous network, various types of edges connect the nodes within and between the layers. Our previous work [20, 21] tries to randomly walk on the heterogeneous network to mine potential miRNA-disease associations. However, it cannot propagate the information along with different types of edges automatically and efficiently. This work develops a Heterogeneous Graph Convolutional Network-based deep learning model, namely HGCNMDA, to perform a MiRNA-Disease Association prediction task. We introduce a gene layer and construct a three-layer heterogeneous network. Then we prepare two kinds of attributes for every node in the network. The one is generated by passing the similarity data into a nonlinear layer. The other aggregates local neighbors’ attributes by graph convolutional network. We refine the nodes in the network into several node types according to their attributes and tailor the connections of these node types based on known associations and the relevancy to the prediction task. After that, a heterogeneous graph convolutional network is employed to learn feature representations for miRNAs, diseases, and genes via message passing between the nodes of graphs with finer-grained node type and edge type information. Finally, the miRNA-disease associations are recovered by the inner product of the miRNA features and disease features. We test our model on the human miRNA-disease association data set. Compared with the other state-of-the-art algorithms, the experimental results show that the HGCNMDA algorithm achieves outstanding performance in AUC values.
2 Materials We adopted a similar way as [18] to prepare data sets in this work. Building the heterogeneous network requires six kinds of data, involving disease-gene associations, miRNA-gene associations, miRNA-disease associations, disease similarity data, miRNA similarity data, and gene network. The disease-gene associations were from DisGeNET v7.0 database [18]. The disease similarity network was obtained from You et al. [4]. The miRNA-gene associations came from miRWalk2.0 [22]. We used the miRNA similarity
A Heterogeneous Graph Convolutional Network-Based Deep Learning Model
133
matrix generated in [4]. The gene network was downloaded from the STRING database [23]. We removed the genes that have no relations with miRNAs in the gene network. The known human miRNA-disease associations were obtained from the database HMDD3.0 [1]. Finally, 204 diseases, 1789 genes, and 243 miRNAs, 3072 validated miRNA-disease associations, 2639 gene-disease associations and 2455 gene-miRNA associations are involved in our experiment. The 3072 validated miRNA-disease associations were used as the positive set.
Fig. 1. The overview of HGCNMDA
3 Methods Our model takes four steps to implement link prediction task. Firstly, we build a threelayer heterogeneous network consisting of a miRNA layer, a disease layer, and a gene layer. The second step of our model is to prepare the attributes for the nodes. Then we employ a GCN encoder to learn the nodes’ latent representations with finer-grained node type and edge type information. Finally, we make link predictions using the learned embeddings with the edge decoding model. Figure 1 shows the overview of our method. 3.1 Heterogeneous Network Construction This work constructs a three-layer heterogeneous network consisting of a miRNA layer, a disease layer, and a gene layer. In the heterogeneous network, various types of edges
134
Z. Che et al.
connect the nodes within and between the layers. For a miRNA layer, the miRNAs with similar functions bind and form a miRNA similarity network. The miRNAs also link with disease and genes to perform certain functions. Let Gm ∈ Rnm ×nm ,Gd ∈ Rnd ×nd and Gg ∈ Rng ×ng represent miRNA similarity network, disease similarity network and gene network, respectively, where nm , nd and ng represents the number of miRNAs, diseases and genes, respectively. Amd ∈ {0, 1}nm ×nd denotes the known miRNA-disease association matrix. If a miRNA is associated with a disease, Amd (i, j) = 1, otherwise Amd (i, j) = 0. Similarly, Amg ∈ {0, 1}nm ×ng and Adg ∈ {0, 1}nd ×ng denote the miRNAgene and disease-gene association matrix. 3.2 Node Attributes Preparation The nodes of the heterogeneous network are composed of miRNAs, diseases and genes. We prepare two kinds of attributes for every node. One is the initial attribute and the other is the inductive attribute. The initial attributes are generated by passing the similarity data into a nonlinear layer. We generate the initial attributes of diseases (Fd1 ), miRNAs (Fm1 ), and genes (Fg1 ) by separately putting the disease similarity network Gd , miRNA similarity network Gm , and gene network Gg through three one-layer fully connected networks. We introduce inductive attributes for every node by graph convolutional network. A miRNA node of the heterogeneous network may connect to three different neighbor types, i.e., the miRNA, diseases, and genes. Hence, we implement three separate onelayer graph convolution operations on the miRNA similarity network, known miRNAdisease association network, and miRNA-gene association network to produce three kinds of inductive attributes for miRNAs. Mathematically, the inductive attribute of miRNA (Fm2 ) generated by a graph convolution operation on the miRNA similarity network is defined as follows. (1) Fm2 = relu L˜ m Gm θ2m + b2m −1/2 ˜ ˜ −1/2 ˜m Am Dm , Where the adjacent matrix of Gm . A˜ m = Am +Im , L˜ m = D Am denotes 2 n ×h m ˜ ˜ = is the weight parameter matrix with nm of and Dm j Am . θm ∈ R ii
ij
miRNAs and h of hidden layer nodes, b2m is the bias matrix. Similarly, we can calculate the inductive attributes of disease (Fd2 ) on the disease similarity network (Gd ) by the following equation. (2) Fd2 = relu L˜ d Gd θ2d + b2d ˜ −1/2 A˜ d D ˜ −1/2 , Where the adjacent matrix of Gd . A˜ d = Ad + Id , L˜ d = D d d Ad denotes ˜ d = j A˜ d . We also ran a single-layer graph convolution operation on the and D ii
ij
miRNA-disease associations to obtain another inductive attribute for miRNAs (Fm3 ) and
A Heterogeneous Graph Convolutional Network-Based Deep Learning Model
135
diseases (Fd3 ) (see Eq. (3). 3 F m = relu L˜ md F1 θ3 + b3 m md m F3 d −1/2 −1/2 1 1 ˜ ˜ ˜ D F A F D D md md (m) md (d ) d 3 md (m) m 3 = relu −1/2 T −1/2 1 θ m + b m ˜ ˜ ˜ D A D Fm F1 D md (d ) md
md (m)
(3)
md (d ) d
˜ md (m) = where Amd denotes the known miRNA-disease association matrix. D ii ˜
Dmd (m) 0 The Amd can ˜ ˜ = j ATmd ij + 1. D j [Amd ]ij + 1, Dmd (d ) md = ˜ md (d ) ii 0 D 1 ˜ −1/2 A˜ md D ˜ −1/2 . F1 = Fm 0 .θ3m and b3m are parameter be normalized as L˜ md = D md md md 0 F1 d matrixes. Similarly, the inductive attributes for miRNAs (Fm4 ) and gene (Fg2 ) on miRNAgene associations are calculated by Eq. (4). The inductive attributes for diseases (Fd4 ) and gene (Fg3 ) on disease-gene associations are calculated by Eq. (5). 4 Fm 2 = relu L˜ mg F1mg θ4m + b4m Fg 4 Fd 3 = relu L˜ dg F1dg θ4d + b4d Fg
(4)
(5)
Here L˜ mg ,L˜ dg ,F1mg and F1dg can be construct by referring to the explanation of L˜ md and F1md . 3.3 Node Embedding with Graph Convolution Network After preparing attributes for every node in the heterogeneous network, we learn feature representations (embedding) for miRNAs, diseases and genes by employing the GCN to propagate and transform information. We refine the nodes in the network into several node types according to their attributes and tailor the connections of these node types based on known associations and the relevancy to the prediction task. Therefore, the heterogeneous GCN model can aggregate attributes for a given node from the neighboring nodes conducive to exploring the potential miRNA-disease associations and tackle each edge type independently. In practice, we reconstruct a super-heterogeneous network by extending and grouping miRNAs, diseases and genes into several node types based on their attribute types. Hence the super-heterogeneous network consists of eight types of nodes (three types for miRNAs, three types for diseases and two types for genes). We call the eight types of nodes m1, m2, m3, d1, d2, d3, g2, g3, respectively. m1, m2, m3 are three miRNA node types containing the same number of miRNAs with different attributes. d1, d2, d3 are disease node types, and g2, g3 are gene node types. In the super-heterogeneous network, there are no links between the nodes of the same node types. The connection status between the eight types of nodes in the super-heterogeneous network is described
136
Z. Che et al.
in Table 1. If the table value is 1, the two types of nodes may connect according to their original similarity network or association matrix. Otherwise, the two types of nodes will not connect. Hence, the super-heterogeneous network is just like to be a multi-relational graph. The node of one node type connects to nodes of another node type with different types of edges. To aggregate features for nodes in the super-heterogeneous network with finer-grained edge type information, we introduce Eq. (6). ⎞ ⎛ 1 ⎠ (6) Fk (i) = relu⎝ ci,l Wl F(j) + W0,k Fk (i) + B0,k l∈R j∈N l i
Fk (i) equals the input feature vector of the node vi of the k-th node type. Fk (i) is the learning features of node vi after graph convolution. l represents the edge types. Nli are neighbors of node vi, which are linked by the edge type l. F(j) equals the input l feature vector l of the node vj ∈ Ni . Wl is the weight parameter related to the link type l. ci,l = Ni is the number of neighbors linked to node vi by edge type l. W0,k is the weight parameter that retains information of the node itself in the node type k. relu is the activation function. After running the relational graph convolution on the super-heterogeneous network, we learn feature vectors for nodes type. They are defined of every node will 1 2 3 1 2 3 2 3 1 2 as F = Fm , Fm , Fm , Fd , Fd , Fd , Fg , Fg . We add Fm ,Fm ,Fm3 to generate the final
embedding feature vectors for miRNAs (Hm ). Similarly, we add Fd1 ,Fd2 ,Fd3 to obtain the final embedding feature vectors for diseases (Hd ) and add Fg2 Fg3 to get the final embedding feature vectors for genes (Hg ). Table 1. The connection status between the eight types of nodes in the super-heterogeneous network m1
m2
m1 m2
m3
d1
d2
1
1
1
g2
1
d1
1
d2
1
1
1
d3
1
g2
1 1
g3 1
1
m3
g3
d3
1
A Heterogeneous Graph Convolutional Network-Based Deep Learning Model
137
3.4 Link Prediction from Embedding After obtaining the embedding feature of miRNA, disease and gene, we use an edge decoder to reconstruct miRNA-disease matrix, gene-disease matrix and genemiRNA matrix. Before decoding, we input the embedding features of miRNAs(Hm ), diseases(Hd ) and genes(Hg ) into three different three-layer fully connected layers to do nonlinear transformation, respectively. Finally, the miRNA-disease association matrix can be reconstructed as follows. Mmd = ∅lm (Hm ) × ∅ld (Hd )T
(7)
To ensure the embedding features can keep the original network structure as much as possible, we also reconstruct the miRNA-gene association matrix (Eq. (8) and diseasegene association matrix (Eq. (9). The loss function is designed based on the mean square error loss between the three reconstructed matrixes and the original ones (Eq. (10). T Mmg = ∅lm (Hm ) × ∅lg Hg
(8)
T Mdg = ∅ld (Hd ) × ∅lg Hg
(9)
2 Loss = (1 − α)||Pδ (Amd − Mmd )||2F + α Pδ (Amd − Mmd )F 2 2 +Adg − Mdg + Amg − Mmg + ||W ||2
(10)
F
F
Where ∅lm (Hm ), ∅ld (Hd ), ∅lg Hg are final embeddings of miRNAs, diseases and genes. Pδ (.) and Pδ (.) are the projection of the matrix onto the setδ (positive sample of the training set) and the set δ (negative sample of the training set), respectively. Amd , Adg , Amg are the true miRNA-disease, gene-disease, and gene-miRNA association matrix. W is the model parameter.αis a predefined parameter that controls the weight of positive sample loss and negative sample loss. We minimize the loss to learn the parameters of the model. Our model is developed using the framework of Pytorch. The optimizer for the model is Adam. The best combination of hyperparameters is as follows: the learning rate is 0.0001, the filter of GCN for generating inductive attributes is 256 and the filter of the GCN for network embedding is 256. The number of hidden nodes of the single fully connected layers for generating initial attributes is 256. The hidden nodes of the three-layer fully connected layers for nonlinear transformation before decoding are 256, 128, 64, respectively. We tune the parameters α and epoch when conducting different cross-validation task. In the randomly zeroing cross-validation, α = 0.4 and epoch = 200. In the multi-column zeroing and multi-row zeroing cross-validation, α = 0.1 and epoch = 100.
4 Results To verify the effectiveness of our model, we compared it with five state-of-the-art methods (MDACNN [18], NIMCGCN [15], GCSENet [19], ThrRWMDE [21], CCA-based
138
Z. Che et al.
[24]). We do cross-validation to evaluate the performance of every method under three different Settings, including randomly zeroing cross-validation, multi-column zeroing cross-validation and multi-row zeroing cross-validation. For randomly zeroing crossvalidation, all known miRNA-disease associations are randomly divided into five nonoverlapping parts. One part is for testing, and the remaining is for training. The columns of the miRNA-disease association matrix correspond to diseases, and the rows correspond to miRNAs. Multi-column zeroing cross-validation or multi-row zeroing crossvalidation randomly divides all columns or rows of the miRNA-disease association matrix into five non-overlapping parts. It clears all miRNA-disease associations of onefifth of the columns or rows as the test set. All the remaining columns or rows are the training set. We repeat every cross-validation ten times and show the average results in the following section. 4.1 Randomly Zeroing Cross-Validation We repeat randomly zeroing five-fold cross-validation on our model and all baselines ten times. We plot the ROC curves for all methods and calculate the area under the curves (AUC) to evaluate their overall performance for each testing. Table 2 shows the average AUC values of every model, which demonstrates that our HGCNMDA model controls the highest average AUC and achieves the best prediction performance. We also notice that ThrRWMDE propagates association information on the heterogamous network and obtains the second-best performance. However, our model can assign suitable weights to different-type edges in a supervised manner and make the embedding more effective in revealing the potential miRNA-disease associations. Table 2. The average AUC values of every model under randomly zeroing, multi-column zeroing and multi-row zeroing cross-validation. Method
Randomly zeroing
Multi-column zeroing
Multi-row zeroing
ThrRWMDE
0.9094 + 0.004256
0.5821 + 0.070164
0.9079 + 0.005557
CCA-based
0.7210 + 0.000158
0.5
0.7109 + 0.001078
MDACNN
0.8936 + 0.000044
0.7287 + 0.00020
0.8906 + 5.42e-06
GCSENet
0.8164 ± 0.000078
0.7762 + 0.000142
0.7918 + 4.90e-05
NIMCGCN
0.8851 ± 0.000047
0.7668 ± 0.006866
0.8783 ± 0.000120
HGCNMDA
0.9213 ± 0.000017
0.8505 ± 0.000677
0.9092 ± 0.000038
4.2 Multi-column Zeroing Cross-Validation and Multi-row Zeroing Cross-Validation Multi-column zeroing cross-validation aims to test whether our model can successfully identify the relevant miRNAs of new diseases. Table 2 shows that our HGCNMDA model outperforms all other five methods under this cross-validation. It improves the
A Heterogeneous Graph Convolutional Network-Based Deep Learning Model
139
AUC value 7.43% compared to the GCSENet that processes the second-best performance. It may be partially attributed to the high quality of miRNA and disease features extracted by our model from the heterogeneous network under the consideration of the different edge types and node types. Note that ThrRWMDE leads to very low AUC values when recommending relevant miRNAs to the new disease. It cannot successfully infer miRNA-disease association information due to introducing noise neighbors in the network diffusion. Multi-row zeroing cross-validation can evaluate the success of our approach in predicting diseases for new miRNAs. From Table 2, we observe that all methods perform in under multi-row zeroing cross-validation than in multi-column zeroing cross-validation. It may be attributed to more reliable miRNA similarity data and miRNA-gene association data available. However, our model still shows the best performance among the models. The results prove that our model independently tackling different-type neighbor nodes can successfully learn the high quality of miRNA and disease features to predict related diseases for new miRNAs.
5 Conclusion This work develops a heterogeneous graph convolutional network-based deep learning model, namely HGCNMDA, to predict miRNA-disease associations. Compared with previous work, our model introduces an additional gene layer and constructs a threelayer heterogeneous network. Based on the heterogeneous network, our model leverages graph convolutional network (GCN) to learn feature vectors of miRNAs and diseases for association identification, which can consider both the miRNA/disease attributes and the miRNA-disease associations when learning their features. To prepare the input node attribute for the heterogeneous GCN model, we introduce two types of attributes for every node. One is generated by passing the similarity data into a nonlinear layer, and the other aggregates local neighbors’ attributes by GCN network. Considering that every node with various attributes affects its neighbors differently, we refine the nodes in the network into several node types according to their attributes and tailor the connections of these node types based on known associations and the relevancy to the prediction task. Therefore, the heterogeneous GCN model can collect features for a given node from the transformed feature vectors of neighboring nodes conducive to exploring the potential miRNA-disease associations and tackle each edge type independently. We test our model on the human miRNA-disease association data set. Compared with the other five existing algorithms, the HGCNMDA algorithm can improve the prediction performance to a high level when recovering missing miRNA-disease associations or recommending disease/miRNA to new miRNA/diseases. Acknowledgement. This work is supported in part by the National Natural Science Foundation of China (No.61972185. No. 62072124). Natural Science Foundation of Yunnan Province of China (2019FA024), Yunnan Key Research and Development Program (2018IA054), Yunnan Ten Thousand Talents Plan young.
140
Z. Che et al.
References 1. Huang, Z., Shi, J., Gao, Y., Cui, C., Zhang, S., Li, J., et al.: HMDD v3. 0: a database for experimentally supported human microRNA–disease associations. Nucleic Acids Res. 47(D1), D1013–D1017 (2019) 2. Zou, Q., Li, J., Song, L., Zeng, X., Wang, G.: Similarity computation strategies in the microRNA-disease network: a survey. 15(1), 55–64 (2016). https://doi.org/10.1093/bfgp/ elv024 3. Zeng, X., Zhang, X., Liao, Y., Pan, L.: Prediction and validation of association between microRNAs and diseases by multipath methods. Biochim. Biophys. Acta Gen. Subj. 1860(11 Pt B), 2735–2739 (2016). https://doi.org/10.1016/j.bbagen.2016.03.016 4. You, Z.-H., Huang, Z.-A., Zhu, Z., Yan, G.-Y., Li, Z.-W., Wen, Z., et al.: PBMDA: A novel and effective path-based computational model for miRNA-disease association prediction. PLoS Computat. Biol. 13(3), e1005455 (2017). https://doi.org/10.1371/journal.pcbi.1005455 5. Liu, Y., Zeng, X., He, Z., Zou, Q.: Inferring microRNA-disease associations by random walk on a heterogeneous network with multiple data sources. IEEE/ACM Trans. Computat. Biol. Bioinform. 14(4), 905–915 (2017). https://doi.org/10.1109/TCBB.2016.2550432 6. Luo, J., Xiao, Q.: A novel approach for predicting microRNA-disease associations by unbalanced bi-random walk on heterogeneous network. J. Biomed. Inform. 66, 194–203 (2017).https://doi.org/10.1016/j.jbi.2017.01.008 7. Chen, X., Yan, G.-Y.: Semi-supervised learning for potential human microRNA-disease associations inference. Sci. Rep. 4(1), 1–10 (2014) 8. Lan, W., Wang, J., Li, M., Liu, J., Wu, F.-X., Pan, Y.: Predicting microRNA-disease associations based on improved microRNA and disease similarities. IEEE/ACM Trans. Computat. Biol. Bioinform. 15(6), 1774–1782 (2016) 9. Xiao, Q., Luo, J., Liang, C., Cai, J., Ding, P.: A graph regularized non-negative matrix factorization method for identifying microRNA-disease associations. Bioinformatics 34(2), 239–248 (2018) 10. Yan, C., Wang, J., Ni, P., Lan, W., Wu, F.-X., Pan, Y.: DNRLMF-MDA: predicting microRNAdisease associations based on similarities of microRNAs and diseases. IEEE/ACM Trans. Computat. Biol. Bioinform. 16(1), 233–243 (2017) 11. Peng, W., Du, J., Dai, W., Lan, W.: Predicting miRNA-disease association based on modularity preserving heterogeneous network embedding. Front. Cell Dev. Biol. 9, 603758 (2021) 12. Li, J.-Q., Rong, Z.-H., Chen, X., Yan, G.-Y., You, Z.-H.: MCMDA: matrix completion for MiRNA-disease association prediction. Oncotarget 8(13), 21187 (2017) 13. Chen, X., Wang, L., Qu, J., Guan, N.-N., Li, J.-Q.: Predicting miRNA–disease association based on inductive matrix completion. Bioinformatics 34(24), 4256–4265 (2018) 14. Chen, X., Sun, L.-G., Zhao, Y.: NCMCMDA: miRNA–disease association prediction through neighborhood constraint matrix completion. Brief. Bioinform. 22(1), 485–496 (2021) 15. Li, J., Zhang, S., Liu, T., Ning, C., Zhang, Z., Zhou, W.: Neural inductive matrix completion with graph convolutional networks for miRNA-disease association prediction. Bioinformatics 36(8), 2538–2546 (2020) 16. Jiang, Q., Hao, Y., Wang, G., Juan, L., Zhang, T., Teng, M., et al.: Prioritization of disease microRNAs through a human phenome-microRNAome network. BMC Syst. Biol. 4(1), 1–9 (2010) 17. Shi, H., Xu, J., Zhang, G., Xu, L., Li, C., Wang, L., et al.: Walking the interactome to identify human miRNA-disease associations through the functional link between miRNA targets and disease genes. BMC Syst. Biol. 7(1), 1–12 (2013) 18. Peng, J., Hui, W., Li, Q., Chen, B., Hao, J., Jiang, Q., et al.: A learning-based framework for miRNA-disease association identification using neural networks. Bioinformatics 35(21), 4364–4371 (2019)
A Heterogeneous Graph Convolutional Network-Based Deep Learning Model
141
19. Li, Z., Jiang, K., Qin, S., Zhong, Y., Elofsson, A.: GCSENet: a GCN, CNN and SENet ensemble model for microRNA-disease association prediction. PLOS Computat. Biol. 17(6), e1009048 (2021) 20. Peng, W., Lan, W., Yu, Z., Wang, J., Pan, Y.: A framework for integrating multiple biological networks to predict MicroRNA-disease associations. IEEE Trans. Nanobiosci. 16(2),100–107 (2016) 21. Peng, W., Lan, W., Zhong, J., Wang, J., Pan, Y.: A novel method of predicting microRNAdisease associations based on microRNA, disease, gene and environment factor networks. Methods 124, 69–77 (2017) 22. Dweep, H., Gretz, N.: miRWalk2. 0: a comprehensive atlas of microRNA-target interactions. Nat. Methods 12(8), 697 (2015) 23. Szklarczyk, D., Morris, J. H., Cook, H., Kuhn, M., Wyder, S., Simonovic, M., et al.: The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible. Nucleic Acids Res. gkw937 (2016) 24. Chen, H., Zhang, Z., Feng, D.: Prediction and interpretation of miRNA-disease associations based on miRNA target genes using canonical correlation analysis. BMC Bioinformatics 20(1), 1–8 (2019)
Identification of Gastric Cancer Immune Microenvironment Related Genes with Poor Prognosis and Tumor Immune Infiltration Yishu Wang1(B)
, Lingyun Xu2 , Xuehan Tian2 , and Zhe Lin2
1 School of Mathematics and Physics, University of Science and Technology, Beijing, China 2 School of Mathematics and Statistics, Qingdao University, Qingdao, China
Abstract. The tumor immune microenvironment (TIME) is being increasingly recognized as a key factor in multiple stages of disease progression, such as immune escape, local resistance, cancer development, and distant metastasis, thereby substantially affecting the future development of frontline interventions and prognosis outcomes. The molecular and cellular nature of the TIME influences disease outcome by altering the balance of suppressive versus cytotoxic responses in the vicinity of the tumor. Therefore, exploring the complex interactions between tumors and their immunological microenvironment can help us effectively monitor the host immune system to protect against disease. Gastric cancer is notorious for its poor prognosis. In this study, we adopted two statistical algorithms to identify four important immune microenvironment related genes, including three notorious genes with increasing evidence and one new key genes CLRN3, whose high expression was correlates with poor prognosis and tumor immune infiltration of Gastric cancer. Furthermore, the mRNA levels of CLRN3 in stomach cancer cells were analyzed by qRT-PCR. Gene set enrichment analysis (GSEA) and other analysis showed high expression of CLRN3 enriched in immune activities and cell cycles. Keywords: Tumor immune microenvironment · Immune-related genes · Prognosis biomarker · Gastric cancer · Cox-univariate survival algorithm · Random forest variable selection · RT-qPCR
1 Introduction Gastric cancer (GC) is the second most common cause of cancer-related deaths, which is usually initiated as a result of the stepwise accumulation of genetic and epigenetic changes. However, increasing evidence indicates that the tumor immune microenvironment (TIME) plays a critical role in the subsequent development of GC [2]. Recent insights into the tumor microenvironment have begun to uncover the close association between cancer and immune characteristics. However, common immunotherapy Electronic supplementary material The online version of this chapter (https://doi.org/10.1007/ 978-3-030-91415-8_13) contains supplementary material, which is available to authorized users. © Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 142–152, 2021. https://doi.org/10.1007/978-3-030-91415-8_13
Identification of Gastric Cancer Immune Microenvironment Related Genes
143
strategies mainly include chimeric antigen receptor-engineered T cells, cancer vaccines, cytokine therapy, and immune checkpoint inhibitors (ICIs), which is currently most successful strategy for immune therapeutics. But because of the complexity of the underlying mechanisms of the immune checkpoint blockade response, only a few percent of patients respond well to ICIs [3]. Likewise, we also found that the immune scores, which indicated the ratios of immune and stromal components of GC patients, were not significantly related to survival curves (see SuppleFig. 1 in the Supplementary). Thus, deeper genetic and immunological characterization of GC are required to determine the best treatment options. Identifying novel immune-related genes (IRGs) and immune-related therapeutic targets is useful for predicting the outcome of GC patients and determining new strategies for immunology treatments. In this study, first, we applied ESTIMATE computational methods to calculate the proportion of tumor-infiltrating immune cells (TICs) and the immune and stromal components in GC cases from The Cancer Genome Atlas (TCGA) database. The raw data were considered as differentially expressed genes (DEGs) by comparing immune components and stromal components in GC samples. These differentially expressed DEGs were subjected to univariate Cox regression to identify candidate immune microenvironment related prognostic genes. To further analyze these candidate genes, we introduced the random forest variable selection algorithm to filter key genes, resulting in four immune microenvironment related prognostic genes, including two immune related genes: CXCR4, BMP4, by comparing with the Immunology Database and Analysis Portal database [4], and two nonimmunological genes: CLRN3, VHL. Increasing evidence has demonstrated the rascality of gene CXCR4 and BMP4, in gastric cancer or the other malignant tumor. Xiang Z., et al. [16] and Hashimoto., et al. [18] have found that CXCR4 played an important role in gastric cancer metastasis, where CXCR4 was highly expressed in a high invasive gastric cancer cell model and in gastric cancer tissues. Furthermore, Ying., et al. [17] also demonstrated the expression of CXCR4 could be used as a biomarker to predict malignant features of gastric cancer. Meanwhile BMP4 as a member of the TGF-β family has been demonstrated to be frequently overexpressed in gastric cancer tissue and was correlated with poor patient’s prognosis [19], which was consistent with our results. Our statistical model also found one tumor suppressor gene: von Hippel-Lindau (VHL), which is located on chromosome 3p25, plays an important role in tumorigenesis, particularly in tumor growth and vascularization, which has been proven to promote gastric cancer cell proligeration [21], and a protein coding gene Clarin 3 (CLRN3) which was a transmembrane protein whose function is not known. VHL with another tumor suppressor genes P15, P16, P53 also have been demonstrated that the frequent inactivation of them may be an important step during oral cancer development [20]. However, although CLRN3 was identified the novel upregulated gene in gastric cancer, its mechanism in tumor progression and the prognosis relationship were unambiguous. In this article, by integrating two statistic algorithms, we identified one important immune microenvironment related prognostic gene: CLRN3. And then we gave the correlation of the expression of CLRN3 with the survival and clinic-pathological information of GC patients, and the correlation with proportion of tumor-infiltrating
144
Y. Wang et al.
immune cells. Furthermore, the aberrant expression of CLRN3 was identified by reverse transcription-quantitative polymerase chain reaction (RT-qPCR).
2 Materials and Methods 2.1 GC Datasets and Immune-Related Genes Datasets (IRG) Gene expression data, including 408 samples and 19645 observations of gene expression, and clinical data of Gastric cancer patients, including 375 tumor samples and 33 normal samples, were downloaded from the TCGA database (https://portal.gdc.cancer.gov/). 2.2 Calculation of Immune Score and Stromal Score The ESTIMATE algorithm and CIBERSORT algorithms were used to calculate the TIC proportion and the ratio of immune and stromal components of GC samples, resulting in the ImmuneScore and StromalScore. To determine the optimal cutoff values for grouping patients, we used the R package maxstat [5]. All samples were divided into high-and lowscoring groups. The higher score estimated in ImmuneScore or StromalScore represented a larger amount of immune or stromal components in TIME. ESTIMATEScore is the sum of these two scores. To determine whether there was a correlation between the stromal/immu ne/ESTIMATE scores and overall survival, we used Kaplan-Meier survival analysis, results showed in SuppleFig. 1 (ImmuneScore and StromalScore correlate with GC clinical data and prognosis). As shown in SuppleFig. 1A, the proportion of immune components did not significantly correlate with the overall survival rate. It was not significant enough to determine ImmuneScore as a prognostic biomarkr (p = 0.403 by log-rank test). However, the StromalScore and ESTIMATEScore still showed a positive correlation with the survival rate (SuppleFig. 1B-C), which indicated that the immune components in TIME were still suitable for indicating the prognosis of GC patients, nevertheless, requiring more thorough analysis. 2.3 Identification of Differentially Expressed Genes (DEGs) To ascertain the exact variations of gene profiles in TIME considering immune and stromal components, linear models were used to identify DEGs between two immune/stromal (high score group vs. low score group) using the limma R package [6]. A false discovery rate (FDR) adjusted to q < 0.05, combined with a simultaneous absolute value of > 1 for logFC, was set as the threshold for DEG identification. A total of 1169 DEGs were obtained from ImmuneScore, among which 881 were upregulated and 211 were downregulated. Similarly, 1746 DEGs were obtained from the StromalScore, consisting of 1535 upregulated and 288 downregulated genes. We then selected the top 50 upregulated and 50 downregulated genes to plot the heatmap (SuppleFig. 2A-B). The intersection analysis displayed by the Venn plot showed a total of 760 DEGs in both ImmuneScore and StromalScore (SuppleFig. 2C-D).
Identification of Gastric Cancer Immune Microenvironment Related Genes
145
2.4 Univariate Cox Survival Analysis To determine the significant factors among these immune-related DEGs with the survival of GC patients, we used these DEGs and the survival data analyzed by univariate Cox survival analysis, using R package survival. The univariate Cox survival analysis taking survival outcome and survival time as dependent variables, can simultaneously analyze the influence of the most related factors on survival times. Tumor samples were grouped into the “high expression group” and “low expression group” with the median gene expression level set as the cutoff. 20 genes (p < 0.05) were identified as differentially expressed survival-related IRGs, which were considered as candidate immune-related prognostic genes. 2.5 Random Forest Variable Selection To further filter the prognostic genes, we used the random forest variable selection method based on eight candidate immune-related prognostic genes. Random forest is frequently applied as it achieves high prediction accuracy and has ability to identify informative variables [7]. This method does not need to specify the distribution characteristics of parameters in advance. Here, we adopted the R package varSelRF [8]. SuppleFig. 3 (in the Supplementary) shows the error results for different numbers of trees, according to which we selected a tree number of 100. The varSelRF package helps us decide on the number of “relevant variables” resulting in four important genes (CXCR4, BMP4, VHL, CLRN3). 2.6 Reverse Transcription-Quantitative Polymerase Chain Reaction (RT-qPCR) We gathered the cDNA sequences stored at −800 C (Jinke Biotechnology Co., Ltd), which were inversed transcription from 30 representative fresh GC samples and adjacent normal stomach samples from the operating room. We perform quantitative realtime PCR using TB Green Premix Ex Taq (Jinke Biotechnology Co., Ltd) and Applied Biosystems StepOnePlus Real-Time PCR system. The mRNA expression was normalized to that of β-actin mRNA, and we applied the 2C method to evaluate the relative expression levels of CLRN3. All the primers were purchased from Jinke. The primers and PCR parameters are presented in SuppleTable 2 (in the Supplementary). Each sample and each qPCR run were in the same experimental patterns.
3 Results 3.1 Correlation of the Expression of CLRN3 with the Survival and Clinic-Pathological Staging of GC Patients In this study, all GC samples were grouped into CLRN3 high-expression group and lowexpression group by comparing with the median expression group. Survival analysis showed patients in high-expression group had a shorter overall survival than those in low-expression group (Fig. 1A). In particular, the expression of CLRN3 was decreased along with the degree of abnormality of cells and aggressiveness (Fig. 1B), but increased with the age at which the disease was first diagnosed in years (Fig. 1C).
146
Y. Wang et al.
3.2 Correlation of CLRN3 with the Proportion of TICs To further confirm the correlation between the expression of CLRN3 with the immune microenvironment, the proportion of tumor-infiltrating immune subsets was analyzed using CIBERSORT (https://cibersort.stanford.edu/). The CIBERSORT p-value reflects the statistical significance of the results, and a threshold of 0, (ε, τ )-MSN is a graph in which two vertices u, v are connected if they are connected in ε-MSN and d(u, v) ≤ τ .
168
S. Knyazev et al.
(ε, τ )-MSN is a generalization of both the MSN and τ -network methods. Indeed, MSN is a special case of (ε, τ )-MSN if we set the parameters to ε = 0 and τ = ∞. Similarly, τ -network is a special case of (ε, τ )-MSN when ε = ∞. An implementation of the (ε, τ )-MSN algorithm as a software tool is freely available on GitHub at https://github.com/Sergey-Knyazev/eMST. The software can accept sequences in FASTA format and compute (ε, τ )-MSN using either of the two genetic metrics of choice, Hamming distance or TN93. The user can also provide their own distance matrix in the list of edges format. For efficiently computing Hamming distance between sequences, we implemented the following speed up technique [24]. Initially, we infer a consensus of all input sequences. Then, for each sequence in the input, we determine a set of positions where each sequence has mutated from the consensus. Finally, for each pair of sequences the Hamming distance is computed in two steps. First, we initialize the value of Hamming distance to be the size of the symmetric difference between the two sets. Second, for each position in the intersection of the two sets, we check if the sequences differ at this position, and if they do, we increment the value of Hamming distance by one.
Algorithm 1. (ε-MSN) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23:
M SA: Multiple Sequence Alignment of Strains G: Fully connected distance graph obtained from strains ε: ε ≥ 0, parameter for ε-MSN function ADD E PSILON E DGES(M ST , LongestEdge, E, ε) for (x, y) ∈ E do if d(x, y) ≤ (1 + ε) · (LongestEdge(x, y)) then add (x, y) to M ST end if end for return M ST end function function GET L ONGEST E DGES(M ST , E) for (x, y) ∈ E do LongestEdge(x, y) ← max(ei ) ∀ei ∈ M STx→y end for return LongestEdge end function procedure E MSN(A = M SA or G, ε) If A = M SA, obtain G(V, E) using a distance metric (e.g., Hamming, TN93, etc) M ST ← getM ST (G) LongestEdge ← getLongestEdges(M ST, E) eM ST ← addEpsilonEdges(M ST, LongestEdge, E, ε) end procedure
A Novel Network Representation of SARS-CoV-2 Sequencing Data
169
3 Results To demonstrate the usability of the (ε, τ )-MSN methodology, we benchmarked it against other methods. First, we compared ε-MSN and (ε, τ )-MSN with the two state-of-the-art methods for constructing τ -networks (used in HIV-TRACE) and minimum spanning networks (used in GHOST) on COVID-19 sequences available from GISAID using assortativity analysis. Second, we examined ε-MSN, (ε, τ )-MSN, τ -networks, and MSN on their ability to infer transmission events and compared them with other available tools including CSphylogeny [29], NETWORK5011CS [13], RAxML [31], outbreaker [6], and phybreak [18]. For this test, we used a SARS-CoV-2 sequencing dataset with known ground truth about infective transmission events, and we measured precision and recall of each of the methods when applied to infer these events from the sequencing data. Third, we showed the scalability potential of (ε, τ )-MSN to process networks up to a size of more than hundred thousand of sequences. 3.1 Datasets 1. For comparison of the methods via assortativity analysis, we used the coast-to-coast (C2C) dataset, which contains 168 SARS-CoV-2 sequences collected from different countries, including 9 sequences from COVID-19 patients identified in Connecticut [11]. Each sample in this dataset has geographical attributes named Continent, Country, and Division. 2. For comparison of precision and recall of the methods in inferring transmission links we used the Early Transmission Links (ETL) dataset, which consists of 293 global SARS-CoV-2 sequences collected before March 9th, 2020. Each sequence has a known country of origin. This dataset was constructed to match the 25 known country-to-country transmission links that were collected from news articles detailing transmissions prior to the pandemic declaration, in the MIDAS 2019 Novel Coronavirus Repository [29]. 3. For scalability analysis, we created datasets consisting of the initial 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000, and 100000 SARS-CoV-2 sequences from the masked multiple sequence alignment from GISAID. To generate these datasets, we ordered sequences by date, and picked the earliest date when the number of sequences exceeds the desired number. 3.2 Assortativity Analysis We ran MSN, ε-MSN, τ -network, and (ε, τ )-MSN on the C2C dataset using TN93 as measurement of genetic relatedness, and we evaluated attribute assortativity for continent, country, and division. Figure 1 shows the dependence of attribute assortativity on τ ∈ [0, 1] for τ -network. The optimal value for continent assortativity is 0.702 when τ is 0.0001. Figure 2 shows the dependence of attribute assortativity on ε ∈ [0, 1] for ε-MSN. The optimal value for continent assortativity is 0.661 when ε is 0.0002. Figure 3 shows the dependence of
170
S. Knyazev et al.
attribute assortativity on ε ∈ [0, 1] for (ε, τ )-MSN, with fixed threshold τ = 0.0001 that maximized assortativity in the τ -network analysis from Fig. 1. The optimal value for continent assortativity is 0.7573 when ε is 0.0002. Figure 4 shows the dependence of attribute assortativity on ε ∈ [0, 1] for ε-MSN with Hamming distance instead of TN93. The optimal value for continent assortativity is 0.546 when ε is 0.002. Table 1 shows that the maximum assortativity on the C2C dataset was achieved by (ε, τ )-MSN’s mixture of both parameters, with ε = 0.0002 and τ = 0.0001. The resulting continent assortativity value of 0.753 is higher than the other methods, and the same is seen for country and division assortativity.
Fig. 1. Attribute assortatitivity on the C2C dataset for different values of edge threshold τ , using τ -network with TN93 distance.
Fig. 2. Attribute assortativity on the C2C dataset for different values of ε, using εMSN with TN93 distance and edge threshold τ = ∞.
Fig. 3. Attribute assortatitivity on the C2C dataset for different values of ε, using (ε, τ )-MSN with TN93 distance and edge threshold τ = 0.0001. The maximum assortativity occurs when ε = 0.0002.
Fig. 4. Attribute assortatitivity on the C2C dataset for different values of ε, using εMSN with Hamming distance.
A Novel Network Representation of SARS-CoV-2 Sequencing Data
171
Table 1. This table shows the attribute assortativity values for optimal choices of ε and τ for MSN, ε-MSN, (ε, τ )-MSN, and threshold-based network, each using TN93 distance. Method
ε
τ
No. of edges Assortativity Continent Country Division
MSN
0
∞
717
0.626
0.581
0.360
τ -network
∞
0.0001 2056
0.702
0.631
0.351
ε-MSN
0.0002 ∞
(ε, τ )-MSN 0.0002 0.0001
821
0.661
0.625
0.342
727
0.753
0.706
0.374
We find that (ε, τ )-MSN performs the best in terms of country, continent, and division assortativity values across all four methods. 3.3 Transmission Network Analysis We evaluated the precision and recall of (ε, τ )-MSN on the ETL dataset. To evaluate the transmission network quality of (ε, τ )-MSN, we define an undirected transmission link to be the pair of sequence locations of the two vertices connected by an edge in (ε, τ )-MSN. The set of all unique undirected transmission links forms the transmission network. For each method shown in Table 2, we produced its transmission network and evaluated the precision and recall against the known links provided in the ETL dataset. We calculate precision as the ratio of the number of known true links predicted by the method and the total number of predicted links, and we calculate recall as the ratio of the number of known true links predicted and the total number of known true links in the ETL dataset. Table 2 shows that MSN performed best in Precision and F1-Score. τ -network performed best in Recall but not as well in precision or F1-score. ε-MSN and (ε, τ )-MSN performed comparably well to MSN, and together these network based methods all outperformed the other standard methods being compared. 3.4 Scalability Analysis To examine scalability of the proposed methods, we applied the ε-MSN tool to datasets of increasing sizes of up to several hundred thousand sequences. For each of these datasets, we ran ε-MSN in TN93 mode and Hamming distance mode separately and recorded the running times. Figure 5 shows the results of the analysis. We see that εMSN has a quadratic runtime in both modes, but that Hamming distance is significantly faster because of its efficient implementation.
172
S. Knyazev et al.
Fig. 5. Runtime analysis of ε-MSN on increasing input sizes. ε-MSN is a quadratic algorithm in both TN93 and Hamming distance modes, although Hamming distance, with its efficient implementation, is much faster. Table 2. Recall and precision comparison across different methods ran on the ETL dataset. MSN methods were ran using the TN93 distance metric. Recall is defined as the ratio of known true links formed by the tool to the total number of known true links. Precision is defined as the ratio of known true links formed by the tool to the total number of links formed by the tool. F1-Score is defined as the twice the product of precision and recall divided by the sum of precision and recall. * The ground truth is only partially known. Tool
Recall* Precision* F1-Score*
MSN (ε = 0, τ = ∞, TN93)
80%
7.6%
0.139
τ -network (ε = ∞, τ = 0.0001, TN93)
96%
2.5%
0.049
ε-MSN (ε = 0.0002, τ = ∞, TN93)
80%
7.4%
0.135
(ε, τ )-MSN (ε = 0.0002, τ = 0.0001, TN93) 72%
6.6%
0.121
CS-phylogeny
80%
4.76%
0.090
NETWORK5011CS
72%
4.99%
0.093
RAxML
64%
4.26%
0.080
Bitrugs
52%
3.38%
0.063
Outbreaker
28%
5.83%
0.097
4%
0.83%
0.076
Phybreak
4 Conclusion We have developed two versions of a new network-based tool, ε-MSN and (ε, τ )-MSN, which generalize the minimal spanning networks and τ -network approaches to representing genetic relationships.
A Novel Network Representation of SARS-CoV-2 Sequencing Data
173
We compared the proposed tools with other network-based methods using attribute assortativity values. The experiments show that (ε, τ )-MSN Hamming distance does not perform as well as with TN93 distance. With TN93, (ε, τ )-MSN outperforms all the other methods in continent, country, and division attribute assortativity. Further, we validated multiple tools, including the proposed ones, on known transmission networks. We evaluated recall, precision and F1-score for each tool. We found that network-based tools perform better than the others, including those that are phylogeny-based. The results validated the ε-MSN network and showed that the structure of the εMSN network correlates with phylogenetic trees. ε-MSN is interpretable, integrable and scalable. Users of our proposed tools can fit the parameters ε and τ to any dataset using the same methodology we used in our analysis, namely, fixing one parameter and varying the other, then fixing the other and varying the first. Our methodology is implemented in MicrobeTrace [4], a tool currently in use by the CDC for viral outbreak investigation. Acknowledgement. DN, SK, and AZ were partially supported by NSF grants 1564899 and 16119110 and by NIH grant 1R01EB025022-01. PS was partially supported by NIH grant 1R01EB025022-01 and NSF grant 2047828. SK was partially supported by the GSU Molecular Basis of Disease Fellowship. SM was partially supported by NSF grant 2041984.
References 1. Alexiev, I., et al.: Molecular epidemiological analysis of the origin and transmission dynamics of the HIV-1 CRF01 AE sub-epidemic in Bulgaria. Viruses 13(1), 116 (2021) 2. Alexiev, I., et al.: Molecular epidemiology of the HIV-1 subtype b sub-epidemic in Bulgaria. Viruses 12(4), 441 (2020) 3. Bandelt, H.J., Forster, P., Rohl, A.: Median-joining networks for inferring intraspecific phylogenies. Mol. Biol. Evol. 16(1), 37–48 (1999) 4. Campbell, E.M., et al.: MicrobeTrace: retooling molecular epidemiology for rapid public health response. PLOS Comput. Biol. 17(9), e1009300 (2021) 5. Campbell, E.M., et al.: Detailed transmission network analysis of a large opiate-driven outbreak of HIV infection in the united states. J. Infect. Dis. 216(9), 1053–1062 (2017) 6. Campbell, F., Didelot, X., Fitzjohn, R., Ferguson, N., Cori, A., Jombart, T.: outbreaker2: a modular platform for outbreak reconstruction. BMC Bioinformatics 19(S11) (2018). https:// doi.org/10.1186/s12859-018-2330-z 7. Campo, D.S., et al.: Next-generation sequencing reveals large connected networks of intrahost HCV variants. BMC Genomics 15(S5) (2014). https://doi.org/10.1186/1471-2164-15s5-s4 8. Campo, D.S., et al.: Accurate genetic detection of hepatitis c virus transmissions in outbreak settings. J. Infect. Dis. 213(6), 957–965 (2015) 9. Campo, D.S., Zhang, J., Ramachandran, S., Khudyakov, Y.: Transmissibility of intra-host hepatitis c virus variants. BMC Genomics 18(S10) (2017). https://doi.org/10.1186/s12864017-4267-4 10. Excoffier, L., Smouse, P.E.: Using allele frequencies and geographic subdivision to reconstruct gene trees within a species: molecular variance parsimony. Genetics 136(1), 343–359 (1994)
174
S. Knyazev et al.
11. Fauver, J.R., et al.: Coast-to-coast spread of SARS-CoV-2 in the United States revealed by genomic epidemiology (2020). https://doi.org/10.1101/2020.03.25.20043828 12. Felsenstein, J.: Inferring Phylogenies. Sinauer Associates is an imprint of Oxford University Press, paperback edn., September 2003. https://lead.to/amazon/com/?op=bt&la=en& cu=usd&key=0878931775 13. Forster, P., Forster, L., Renfrew, C., Forster, M.: Phylogenetic network analysis of SARSCoV-2 genomes. Proc. Natl. Acad. Sci. 117(17), 9241–9243 (2020) 14. Glebova, O., et al.: Inference of genetic relatedness between viral quasispecies from sequencing data. BMC Genomics 18(S10) (2017). https://doi.org/10.1186/s12864-017-4274-5 15. Gonzalez-Reiche, A.S., et al.: Introductions and early spread of SARS-CoV-2 in the New York city area. Science 369(6501), 297–301 (2020) 16. Grande, K.M., Schumann, C.L., Ocfemia, M.C.B., Vergeront, J.M., Wertheim, J.O., Oster, A.M.: Transmission patterns in a low HIV-morbidity state — Wisconsin, 2014–2017. MMWR. Morb. Mortal. Wkly. Rep. 68(6), 149–152 (2019). https://doi.org/10.15585/mmwr. mm6806a5 17. Houldcroft, C.J., Beale, M.A., Breuer, J.: Clinical and biological insights from viral genome sequencing. Nat. Rev. Microbiol. 15(3), 183–192 (2017). https://doi.org/10.1038/nrmicro. 2016.182 18. Klinkenberg, D., Backer, J., Didelot, X., Colijn, C., Wallinga, J.: New method to reconstruct phylogenetic and transmission trees with sequence data from infectious disease outbreaks (2016) 19. Knyazev, S., Hughes, L., Skums, P., Zelikovsky, A.: Epidemiological data analysis of viral quasispecies in the next-generation sequencing era. Briefings Bioinform. 22(1), 96–108 (2020) 20. Knyazev, S., et al.: Accurate assembly of minority viral haplotypes from next-generation sequencing through efficient noise reduction. Nucleic Acids Res. 49, e102 (2021) 21. Longmire, A.G., et al.: Ghost: global hepatitis outbreak and surveillance technology. BMC Genomics 18(S10) (2017). https://doi.org/10.1186/s12864-017-4268-3 22. Melnyk, A., Knyazev, S., Vannberg, F., Bunimovich, L., Skums, P., Zelikovsky, A.: Using earth mover’s distance for viral outbreak investigations. BMC Genomics 21(S5) (2020). https://doi.org/10.1186/s12864-020-06982-4 23. Melnyk, A., et al.: Clustering based identification of SARS-CoV-2 subtypes. In: Jha, S.K., M˘andoiu, I., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds.) ICCABS 2020. LNCS, vol. 12686, pp. 127–141. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-79290-9 11 24. Novikov, D., Knyazev, S., Grinshpon, M., Baykal, P.I., Skums, P., Zelikovsky, A.: Scalable reconstruction of SARS-CoV-2 phylogeny with recurrent mutations. J. Comput. Biol. (to appear) 25. Oster, A.M., et al.: Identifying clusters of recent and rapid HIV transmission through analysis of molecular surveillance data. JAIDS J. Acquir. Immune Defic. Syndr. 79(5), 543–550 (2018) 26. Pond, S.L.K., Weaver, S., Brown, A.J.L., Wertheim, J.O.: HIV-TRACE (TRAnsmission cluster engine): a tool for large scale molecular epidemiology of HIV-1 and other rapidly evolving pathogens. Mol. Biol. Evol. 35(7), 1812–1819 (2018) 27. Prabhakaran, S., Rey, M., Zagordi, O., Beerenwinkel, N., Roth, V.: HIV haplotype inference using a propagating Dirichlet process mixture model. IEEE/ACM Trans. Comput. Biol. Bioinform. 11(1), 182–191 (2014) 28. Sanju´an, R., Domingo-Calap, P.: Mechanisms of viral mutation. Cell. Mol. Life Sci. 73(23), 4433–4448 (2016) 29. Skums, P., Kirpich, A., Baykal, P.I., Zelikovsky, A., Chowell, G.: Global transmission network of SARS-CoV-2: from outbreak to pandemic (2020). https://doi.org/10.1101/2020.03. 22.20041145
A Novel Network Representation of SARS-CoV-2 Sequencing Data
175
30. Skums, P., et al.: QUENTIN: reconstruction of disease transmissions from viral quasispecies genomic data. Bioinformatics 34(1), 163–170 (2017) 31. Stamatakis, A.: RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9), 1312–1313 (2014) 32. Tamura, K., Nei, M.: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10, 512–526 (1993) 33. Wymant, C., et al.: PHYLOSCANNER: inferring transmission from within- and betweenhost pathogen genetic diversity. Mol. Biol. Evol. 35(3), 719–733 (2017). https://doi.org/10. 1093/molbev/msx304
Computational Proteomics
A Sequence-Based Antibody Paratope Prediction Model Through Combing Local-Global Information and Partner Features Shuai Lu1 , Yuguang Li2 , Xiaofei Nan1 , and Shoutao Zhang1(B) 1
2
Zhengzhou University, Zhengzhou 450001, China {iexfnan,zhangst}@zzu.edu.cn Zhengzhou University of Light Industry, Zhengzhou 450001, China
Abstract. Antibodies are proteins which play a vital role in the immune system by recognizing and neutralizing antigens. The region on the antibody binding to an antigen, known as paratope, mediates antibodyantigen interaction with high affinity and specificity. And the accurate prediction of those regions from antibody sequence contributes to the design of therapeutic antibodies and remains challenging. However, the experimental methods are time-consuming and expensive. In this article, we propose a sequence-based method for antibody paratope prediction by combing local and global information of antibody sequence and partner features from partner antigen sequence. Convolution Neural Networks (CNNs) and a sliding window approach on antibody sequence are used to extract local information. Attention-based Bidirectional Long ShortTerm Memory (Att-BLSTM) on antibody sequence are used to extract global information. Also, the partner antigen is vital for paratope prediction, and we employ Att-BLSTM on the partner antigen sequence as well. The outputs of CNNs and Att-BLSTM networks are combined to predict antibody paratope by fully-connected networks. The experiments show that our proposed method achieves superior performance over the state-of-the-art sequenced-based antibody paratope prediction methods on benchmark datasets. Keywords: CNN
1
· Bi-LSTM · Attention · Paratope prediction
Introduction
Antibodies play an import role in human immune system by binding a wide range of macromolecules with high affinity and specificity. This ability of binding is mediated by the interaction of amino acid residues at the paratope region of antibody [35]. Accurate prediction of antibody paratope is helpful for the study of antibody-antigen interaction mechanism and the development of antibody design [13]. Along with the establishment of antibody expression and purification process, the number of approved therapeutic antibodies is growing steadily c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 179–190, 2021. https://doi.org/10.1007/978-3-030-91415-8_16
180
S. Lu et al.
[24]. As a result, correct judgement of which amino acid residues belonging to paratope is essential [8]. The antibody paratope region is usually determined by observing the residues that are spatially close to its partner antigen from its structure obtained by Xray [32], NRM [3] and Cryo-EM [41]. However, these experimental methods are time-consuming and expensive [21]. Therefore, a lot of computational models are developed to overcome these problems. According to the input features, the computational methods for antibody paratope prediction can be classified into two classes: sequence-based and structure-based. As the name implies, the sequencebased models only use the antibody sequence features and the structure-based models utilize the antibody structure features as well. Some of the sequencebased methods take the whole antibody primary sequence as input and output the binding probability of each amino acid residue [2,19]. Other methods only utilize the residues in complementarity determining regions (CDRs) and predict their binding probability [6,23]. However, about 20% of the residues that participating in binding fall outside the CDRs [20]. Most structure-based methods consider antibody structure as a graph and employ graph convolution operation for aggregating the information of spatial neighboring residues [7,25,29]. Apart from those methods, 3D Zernike descriptors are used for representing a set of amino acid residues adjacent to each other, and shallow machine learning models are used [5]. Although structure-based models can provide more accurate description of paratope, sequence information is always available earlier than structure [23]. Global information of the whole sequence has been proved useful in many biology biological analysis tasks such as protein-protein interaction sites prediction [39] and protein phosphorylation sites prediction [11]. Methods from those works extract global features form the whole protein sequence by TextCNNs [36] or SENet blocks [14] and Bi-LSTM blocks [12]. In this study, we utilize Attentionbased Bidirectional Long Short-Term Memory (Att-BLSTM) to extract global features from antibody sequence and partner features form its partner antigen sequence, as Att-BLSTM shows superior performance in several machine learning tasks [4,40]. In this work, we propose a sequence-based method for antibody paratope prediction utilizing both local-global information and partner features by combing Convolutional Neural Networks (CNNs) and Att-BLSTM networks. For local features, we use CNNs with a fixed sliding window size to consider the neighbor information around a target amino acid residue on antibody sequence. For global features and partner features, we use Att-BLSTM networks on antibody sequence and its partner antigen sequence, respectively. After that, all features are combined to fed into fully-connected networks to predict the probability of each antibody residue belonging to paratope. We also compare results with other competing sequence-based paratope predictors, and our method achieves the best performances.
A Sequence-Based Antibody Paratope Prediction Model
2 2.1
181
Materials and Methods Datasets
The datasets used in this study are taken from PECAN [29]. All complexes are filtered to make sure that no antibody sequence share identity more than 95%. The complexes in training and validation sets are collected from the training sets of other paratope prediction works [18,19,23] and AbDb [9]. The training set contains 205 complexes and the validation set consists of 103 complexes to tune the hyper parameters in our model. And, 152 complexes are used for testing and evaluating the performance of our model. Similar as other works, a residue on antibody is defined to belong to paratope if any of its heavy atoms is less than 4.5 ˚ A away from any heavy atom on antigen [18,23,29]. And, the others are negative residues. It should be noted that the structure of antibody and antigen complex is used only extracted positive and navigate labels of antibody residues. The summary of datasets is shown in Table 1. Table 1. Number of complexes and residues in the datasets.
2.2
DataSet
Complexes Positive residues Negative residues
Training
205
4449 (5.19%)
81283 (94.81%)
Validation 103
2237 (5.24%)
60584 (94.76%)
Testing
3314 (5.19%)
40480 (94.81%)
152
Input Features
We utilize residue features consist of one-hot encoding, evolutionary information, seven additional parameters and predicted structural features of antibody or antigen sequence. All of them are described in details as follows: One-Hot Encoding of Antibody Sequence. Only 20 natural types of amino acid residues are considered in our study. Each type is encoded to a 20D vector where each element is either 1 or 0 and 1 indicates the existence of a corresponding amino acid residue. Evolutionary Information. A lot of studies use the evolutionary information of antibody or protein sequence for biological analysis tasks, such as B-cell epitope prediction [31], protein-protein interaction sites prediction [39], proteinDNA binding residues prediction [15], protein folding recognition [38] and protein contact map prediction [33]. In this study, we run PSI-BLAST [1] against the nonredundant (nr) [27] database for every antibody and antigen sequence in our
182
S. Lu et al.
datasets with three iterations and an E-value threshold of 0.001. After that, we get the position-specific scoring matrix (PSSM) and the position- specific frequency matrix (PSFM). Each amino acid is encoded as a 20D vector representing the probabilities and frequencies of 20 natural amino acid residues occurring at each position in PSSM and PSFM, respectively. For each protein sequence with L residues, there are L rows and 20 columns in PSSM or PSFM. Besides PSSM and PSFM, two parameters at each residue position are obtained as well. One is related with column entropy in multiple sequence alignment, and another is related with column gaps in multiple sequence alignment. Seven Additional Parameters. Those parameters represent physical, chemical and structural properties of each type of amino acid residue by training artificial neural networks for protein secondary structure prediction [28]. Predicted Structural Features. In this study, we predict antibody or antigen local structural features from sequence by NetSurfP-2.0 which is a novel deep learning model trained on several independent datasets [17]. Solvent accessibility, secondary structure, and backbone dihedral angles for each residue of the input sequences are returned from NetSurfP-2.0. Among those features prediction tasks, NetSurfP-2.0 achieves the state-of-the-art performance. We calculate the absolute and relative solvent accessibility surface accessibility (ASA and RSA, respectively), 8-class secondary structure classification (SS8), and the backbone dihedral angles(φ and ψ) for each residue position of input antibody or antigen sequence. ASA and RSA represent the solvent accessibility of an amino acid residue. The predicted secondary structure describes the local structural environment of a residue. And, φ and ψ figure the relative positions of adjacent residues. The 8-class secondary structures are: 3-helix (G), a-helix (H), p-helix (I), b-strand (E), b-bridge (B), b-turn (T), bend (S) and loop or irregular (L). And, we use the one-hot encoding of SS8. Together, an 80D vector is used for representing a residue in our study. 2.3
Input Representation
The antibody paratope prediction problem can be summarized as a binary classification task: judging whether a residue from a given antibody sequence binding with its partner antigen or not. As described in Sect. 2.2, each residue is encoded into an 80D vector. And each antibody or antigen sequence can be represented as a matrix S, including a list of residues: S = [r1 , r2 , r3 , · · · , ri , · · · , rl ]T , S ∈ R(l∗d)
(1)
where ri ∈ Rd is the residue feature vector corresponding to the i-th residue in the antibody or antigen sequence, l is the sequence length, and d is the residue feature dimension (80 in this paper).
A Sequence-Based Antibody Paratope Prediction Model
183
Fig. 1. Architecture of our method
2.4
Model Architecture
As Fig. 1 shows, our proposed method is mainly composed of three parallel parts: CNNs extract local features from antibody sequence, Att-BLSTM networks extract global features from antibody sequence and another Att-BLSTM network extracts partner features from partner antigen sequence. The outputs of those parts are concatenated and fed to fully connected networks to predict the binding probability for each antibody residue. CNNs. Convolution neural networks (CNNs) model has been used in some bioinformatic tools for protein binding site prediction [37], protein-ligand scoring [30] and protein-compound affinity prediction [16]. In our method, the input of CNNs is the local information of the target antibody residue which can be represented as ri−w:i+w . It means that we consider the target antibody residue at the center with 2w neighboring residues representing the context of target antibody residue. Those antibody residues which do not have neighboring residues in the left or right are padded by all-zero vectors. The convolutional operation is shown as: (2) ri = f (Wc ri−w:i+w + bc ) where f is a non-linear activation function (e.g. ReLU), Wc is the weight matrix, ri−w:i+w the concatenation of the local information of target antibody residue, and the bc is the bias vector. As Fig. 1 shows, BN means a batch norm layer and the repeat time is 3 in out model. Also, residual connection is used by adding inputs to outputs: ri = f (Wc ri−w:i+w + bc ) + ri−w:i+w
(3)
184
S. Lu et al.
Fig. 2. Architecture of Att-BLSTM
Att-BLSTM Networks. Attention-based Bidirectional Long Short-Term Memory (Att-BLSTM) networks have been used in chemical and biomedical text processing tasks [22,26]. However, its advantage has not been exploited in biology sequence analysis such as antibody paratope prediction. In this study, Att-BLSTM networks are used to capture global features of antibody and antigen sequences. As shown in Fig. 2, the architecture of Att-BLSTM consists of four parts: input layer, Bi-LSTM layer, attention layer and output layer. The input antibody or antigen sequence is represented as a set of residues. The Bi-LSTM layer contains two Long Short-Term Memory (LSTM) networks of which one is forward taking input residues form the beginning to the end and another is backward taking input residues from the end to the beginning. A standard LSTM contains three gates and a cell memory state to store and access information over time. Typically, a cell of LSTM can be computed at each time t as follows: it = σ(Wi ∗ [ht−1 , rt ] + bi )
(4)
ft = σ(Wf ∗ [ht−1 , rt ] + bf )
(5)
ot = σ(Wo ∗ [ht−1 , rt ] + bo )
(6)
A Sequence-Based Antibody Paratope Prediction Model
185
ct = ft ∗ ct−1 + it ∗ (tanh(Wc ∗ [ht−1 , rt ] + bc ))
(7)
ht = ot ∗ tanh(ct )
(8)
where Wi , Wf , Wo are the weight matrixs, and bi , bf , bo are the biases of input gate, forget gate and output gate, respectively. And, tanh is the elementwise hyperbolic tangent, σ is the logistic sigmoid function, rt , ht−1 and ct−1 are inputs, and ht and ct are outputs. For the i − th residue in the input antibody ← − → − or antigen sequence, we combine the output hi of forward LSTM and output hi of the backward LSTM by concatenating them: → ← − − hi = [ hi ⊕ hi ] (9) In attention layer, let H ∈ R(l∗2d) be a input sequence H = [h1 , h2 , h3 , ..., hi , ..., hl ]T
(10)
in which each element is the i-th residue output of Bi-LSTM, where l is the input sequence length, and d is the residue features dimension. The attention mechanism can be calculated as follows: M = tanth(H)
(11)
α = sof tmax(Wµ M ) yg = abg /agg = Hα
T
(12) (13)
where tanh is the element-wise hyperbolic tangent, Wu is the weight matrix, and α is an attention vector. The output yg (abg representing for the output of antibody sequence and agg representing the output of antigen sequence) is formed by a weighted sum of vectors in H. Fully-Connected Networks. As shown in Figs. 1 and 2, the local features extracted by CNNs is ri and the global features derived by Att-BLSTM networks from antibody and its partner antigen sequence are abg and agg , respectively. And then, ri , abg and agg are concatenated and fed to the fully-connected networks. The calculation of probability yi for each input antibody residue belonging to paratope is shown as: yi = f (W ((ri ⊕ abg ⊕ agg ) + b) 2.5
(14)
Training Details
We implement our model using PyTorch v1.4. The training configurations are: loss function: weighted cross-entropy loss function as in [10], optimization: Momentum optimizer with Nesterov accelerated gradients; learning rate: 0.001; batch size: 32; dropout: 0.5; a fixed sliding window length: 11. The first fully connected layer has 512 nodes and the second fully connected layer has 256 nodes. All these training parameters are taken from our previous work [25]. Training time of each epoch varies roughly from 1 to 2 min depending on the global features are used or not, using a single NVIDIA RTX2080 GPU.
186
3
S. Lu et al.
Results
3.1
Model Results
To evaluate the performance of our method and other competing methods, we use the standard metrics, i.e. area under the receiver operating characteristics curve (AUC ROC), the area under the precision recall curve (AUC PR), MCC and F-score. Because our method output a probability for each antibody residue, we compute MCC and F-score by predicting residues with probability above 0.5 as paratope. To summarize the performance, all metrics are averaged over all antibodies in the testing set. We repeat the training and testing procedures five times for providing robust estimates of performance. The mean value and standard error are reported in Table 2. Table 2 shows the results of our method and other competing sequence-based antibody paratope prediction methods. Only our method reports the AUC PR value, and F-score is’t reported for Fast-Parapred and AG-Fast-Parapred. Parapred uses CNNs and LSTM for predicting probability of the residues in CDRs of antibody sequence [23]. Fast-Parapred and AG-Fast-Parapred outperform Parapred by leveraging self-attention and cross-modal attention, respectively [6]. proABC-2 is based on CNNs model and takes separate heavy chain and light chain as input [2]. The results of proABC-2 is trained on the Parapred-set and uses a threshold of 0.37. As Table 2 shows, our method performs best on all metrics. Table 2. Performances comparison with competing methods Method
AUC ROC
Parapred Fast-Parapred AG-Fast-Parapred proABC-2 Our Method
0.880 0.883 0.899 0.91 0.945
3.2
AUC PR
± 0.002 – ± 0.001 – ± 0.004 – – ± 0.001 0.644 ± 0.002
MCC 0.564 0.572 0.598 0.56 0.635
F-score ± 0.007 0.69 ± 0.004 – ± 0.012 – 0.62 ± 0.005 0.819 ± 0.003
Effect of Global Features and Partner Features
Although global features have been proved useful in protein-protein interaction sites prediction [39] and protein phosphorylation sites prediction [11]. Those methods only use global features from self-sequence and don’t consider the partner features form the partner-sequence. To measure the effect of global features and partner features, we train our model using different feature combinations. The results are shown in Table 3. Except for AUC ROC, the combination of local-global and partner features achieves best performances on all other metrics. The dataset is imbalanced, and we consider AUC PR as the primary metric because itis more sensitive on an imbalanced dataset [34].
A Sequence-Based Antibody Paratope Prediction Model
187
The results from Table 3 indicates that both global and partner features in our method can help for improving the model performance for antibody prediction and the combination of all features performs best. The partner antigen information is also used in AG-Fast-Parapred. AG-Fast-Parapred and FastParapred are from the same work and trained on the same datasets [6]. And, Fast-Parapred only utilizes antibody information. From Table 2, we can find that AG-Fast-Parapred perform better than Fast-Parapred which also show the effect of partner features. Table 3. Performances of our method using different features combination Features
AUC ROC
Local
0.930 ± 0.001 0.626 ± 0.001 0.608 ± 0.001 0.805 ± 0.001
AUC PR
MCC
F-score
Local+Global
0.946 ± 0.001 0.643 ± 0.002 0.633 ± 0.009 0.818 ± 0.005
Local+Partner 0.948 ± 0.002 0.632 ± 0.012 0.626 ± 0.018 0.814 ± 0.009 All
4
0.945 ± 0.001 0.644 ± 0.002 0.635 ± 0.005 0.819 ± 0.003
Conclusion
In this study, we proposed a sequence-based antibody paratope prediction method leveraging local and global features of antibody and global features of partner antigen. CNNs model are used to extract the local features of antibody sequence. Att-BLSTM networks are used to capture the global features of the whole antibody and antigen sequence. We implement our model on benchmark datasets and the results show improvement of antibody paratope prediction. Moreover, our results declare that both global features from antibody and antigen are useful for performance improvement and our model performs best when both global features are used together. Though our method outperforms other competing sequence-based methods for antibody paratope prediction, it also has some limitations. First, similar like other sequence-base methods, our program takes a lot of time to generate sequence-based features by running PSI-BLSAT [1] and NetsurfP-2.0 [17]. Second, although the combined features improve the model performance, it is still inferior to the structure-based approaches. In our work, we show that combing local-global and partner antigen features can be useful for antibody paratope prediction. In the future, we would focus on how to mine more structural features from antibody sequence for improving our model performance.
References 1. Altschul, S.F., et al.: Lipman: gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)
188
S. Lu et al.
2. Ambrosetti, F., et al.: proABC-2: PRediction Of AntiBody Contacts v2 and its application to information-driven docking. Bioinformatics, 1–2 (2020). https://doi. org/10.1093/bioinformatics/btaa644 3. Bax, A., Grzesiek, S.: Methodological advances in protein NMR. Accounts Chem. Res. 26(4), 131–138 (1993). https://doi.org/10.1021/ar00028a001 4. Bin, Y., Yang, Y., Shen, F., Xie, N., Shen, H.T., Li, X.: Describing video with attention-based bidirectional LSTM. IEEE Trans. Cybern. 49(7), 2631–2641 (2019). https://doi.org/10.1109/TCYB.2018.2831447 5. Daberdaku, S., Ferrari, C.: Antibody interface prediction with 3D Zernike descriptors and SVM. Bioinformatics 35(11), 1870–1876 (2018). https://doi.org/10.1093/ bioinformatics/bty918 6. Deac, A., Velickovic, P., Sormanni, P.: Attentive cross-modal paratope prediction. J. Comput. Biol. 26(6), 536–545 (2019). https://doi.org/10.1089/cmb.2018.0175 7. Del Vecchio, A., Deac, A., Li` o, P., Veliˇckovi´c, P.: Neural message passing for joint paratope-epitope prediction. arXiv, pp. 1–5 (2021) 8. Esmaielbeiki, R., Krawczyk, K., Knapp, B., Nebel, J.C., Deane, C.M.: Progress and challenges in predicting protein interfaces. Brief. Bioinform. 17(1), 117–131 (2016). https://doi.org/10.1093/bib/bbv027 9. Ferdous, S., Martin, A.C.R.: AbDb: antibody structure database-a database of PDB-derived antibody structures. Database 2018, 1–9 (2018). https://doi.org/10. 1093/database/bay040 10. Fout, A., Byrd, J., Shariat, B., Ben-Hur, A.: Protein interface prediction using graph convolutional networks. In: Conference on Neural Information Processing Systems, pp. 6531–6540 (2017) 11. Guo, L., Wang, Y., Xu, X., Cheng, K.K., Long, Y., Xu, J., Li, S., Dong, J.: DeepPSP: a global-local information-based deep neural network for the prediction of protein phosphorylation sites. J. Proteome Res. 20(1), 346–356 (2021). https:// doi.org/10.1021/acs.jproteome.0c00431 12. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997) 13. Hu, D., et al.: Effective optimization of antibody affinity by phage display integrated with high-throughput DNA synthesis and sequencing technologies. PLoS ONE 10(6), 1–17 (2015). https://doi.org/10.1371/journal.pone.0129125 14. Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 42(8), 2011–2023 (2020). https://doi.org/ 10.1109/TPAMI.2019.2913372 15. Hu, J., Li, Y., Zhang, M., Yang, X., Shen, H.B., Yu, D.J.: Predicting protein-DNA binding residues by weightedly combining sequence-based features and boosting multiple SVMs. IEEE/ACM Trans. Comput. Biol. Bioinf. 14(6), 1389–1398 (2017). https://doi.org/10.1109/TCBB.2016.2616469 16. Karimi, M., Wu, D., Wang, Z., Shen, Y.: DeepAffinity: interpretable deep learning of compound-protein affinity through unified recurrent and convolutional neural networks. Bioinformatics 35(18), 3329–3338 (2019). https://doi.org/10.1093/ bioinformatics/btz111 17. Klausen, M.S., et al.: NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning. Proteins Struct. Funct. Bioinform. 87(6), 520– 527 (2019). https://doi.org/10.1002/prot.25674 18. Krawczyk, K., Baker, T., Shi, J., Deane, C.M.: Antibody i-Patch prediction of the antibody binding site improves rigid local antibody-antigen docking. Protein Eng. Des. Sel. 26(10), 621–629 (2013). https://doi.org/10.1093/protein/gzt043
A Sequence-Based Antibody Paratope Prediction Model
189
19. Kunik, V., Ashkenazi, S., Ofran, Y.: Paratome: an online tool for systematic identification of antigen-binding regions in antibodies based on sequence or structure. Nucleic Acids Res. 40(W1), 521–524 (2012). https://doi.org/10.1093/nar/gks480 20. Kunik, V., Peters, B., Ofran, Y.: Structural consensus among antibodies defines the antigen binding site. PLoS Comput. Biol. 8(2), e1002388 (2012). https://doi. org/10.1371/journal.pcbi.1002388 21. Kuroda, D., Shirai, H., Jacobson, M.P., Nakamura, H.: Computer-aided antibody design. Protein Eng. Des. Sel. 25(10), 507–521 (2012). https://doi.org/10.1093/ protein/gzs024 22. Li, L., Wan, J., Zheng, J., Wang, J.: Biomedical event extraction based on GRU integrating attention mechanism. BMC Bioinform. 19(Suppl 9), 93–100 (2018). https://doi.org/10.1186/s12859-018-2275-2 23. Liberis, E., Velickovic, P., Sormanni, P., Vendruscolo, M., Lio, P.: Parapred: antibody paratope prediction using convolutional and recurrent neural networks. Bioinformatics 34(17), 2944–2950 (2018). https://doi.org/10.1093/bioinformatics/ bty305 24. Lu, R.M., Hwang, Y.C., Liu, I.J., Lee, C.C., Tsai, H.Z., Li, H.J., Wu, H.C.: Development of therapeutic antibodies for the treatment of diseases. J. Biomed. Sci. 27(1), 1–30 (2020). https://doi.org/10.1186/s12929-019-0592-z 25. Lu, S., Li, Y., Wang, F., Nan, X., Zhang, S.: Leveraging sequential and spatial neighbors information by using CNNs linked With GCNs for paratope prediction. IEEE/ACM Trans. Comput. Biol. Bioinform., 1 (2021). https://doi.org/10.1109/ TCBB.2021.3083001 26. Luo, L., Yang, Z., Yang, P., Zhang, Y., Wang, L., Lin, H., Wang, J.: An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics 34(8), 1381–1388 (2018). https://doi.org/10.1093/ bioinformatics/btx761 27. McGinnis, S., Madden, T.L.: BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 32(Web Server issue), 20–25 (2004). https://doi.org/10.1093/nar/gkh435 28. Meiler, J., M¨ uller, M., Zeidler, A., Schm¨ aschke, F.: Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks. J. Mol. Model. 7(9), 360–369 (2001). https://doi.org/10.1007/ s008940100038 29. Pittala, S., Bailey-Kellogg, C.: Learning context-aware structural representations to predict antigen and antibody binding interfaces. Bioinformatics 36(13), 3996– 4003 (2020). https://doi.org/10.1093/bioinformatics/btaa263 30. Ragoza, M., Hochuli, J., Idrobo, E., Sunseri, J., Koes, D.R.: Protein-ligand scoring with convolutional neural networks. J. Chem. Inf. Model. 57(4), 942–957 (2017). https://doi.org/10.1021/acs.jcim.6b00740 31. Ren, J., Liu, Q., Ellis, J., Li, J.: Tertiary structure-based prediction of conformational B-cell epitopes through B factors. Bioinformatics 30(12), 264–273 (2014). https://doi.org/10.1093/bioinformatics/btu281 32. Schotte, F., et al.: Watching a protein as it functions with 150-ps time-resolved x-ray crystallography. Science 300(5627), 1944–1947 (2003). https://doi.org/10. 1126/science.1078797 33. Skwark, M.J., Raimondi, D., Michel, M., Elofsson, A.: Improved contact predictions using the recognition of protein like contact patterns. PLoS Comput. Biol. 10(11), 1–14 (2014). https://doi.org/10.1371/journal.pcbi.1003889
190
S. Lu et al.
34. Staeheli, L.A., Mitchell, D.: The relationship between precision-recall and ROC curves jesse. In: International Conference on Machine Learning, pp. 233–240 (2006). https://doi.org/10.1145/1143844.1143874 35. Stave, J.W., Lindpaintner, K.: Antibody and antigen contact residues define epitope and paratope size and structure. J. Immunol. 191(3), 1428–1435 (2013). https://doi.org/10.4049/jimmunol.1203198 36. Vieira, J.P.A., Moura, R.S.: An analysis of convolutional neural networks for sentence classification. In: Conference on Empirical Methods in Natural Language Processing. vol. 2017-Janua, pp. 1–5 (2017). https://doi.org/10.1109/CLEI.2017. 8226381 37. Wardah, W., Dehzangi, A., Taherzadeh, G., Rashid, M.A., Khan, M.G., Tsunoda, T., Sharma, A.: Predicting protein-peptide binding sites with a deep convolutional neural network. J. Theor. Biol. 496, 110278 (2020). https://doi.org/10.1016/j.jtbi. 2020.110278 38. Yan, K., Wen, J., Xu, Y., Liu, B.: Protein fold recognition based on auto-weighted multi-view graph embedding learning model. IEEE/ACM Trans. Comput. Biol. Bioinform. 5963(c), 1 (2020). https://doi.org/10.1109/tcbb.2020.2991268 39. Zeng, M., Zhang, F., Wu, F.X., Li, Y., Wang, J., Li, M.: Protein-protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics 36(4), 1114–1120 (2020). https://doi.org/10.1093/ bioinformatics/btz699 40. Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., Xu, B.: Attention-based bidirectional long short-term memory networks for relation classification. In: Annual Meeting of the Association for Computational Linguistics, pp. 207–212 (2016). https://doi.org/10.18653/v1/p16-2034 41. Zhou, Z.H.: Towards atomic resolution structural determination by single-particle cryo-electron microscopy, April 2008. https://doi.org/10.1016/j.sbi.2008.03.004
SuccSPred: Succinylation Sites Prediction Using Fused Feature Representation and Ranking Method Ruiquan Ge1 , Yizhang Luo1 , Guanwen Feng2 , Gangyong Jia1 , Hua Zhang1 , Chong Xu1 , Gang Xu1 , and Pu Wang3(B) 1 School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China [email protected] 2 Xi’an Key Laboratory of Big Data and Intelligent Vision, School of Computer Science and Technology, Xidian University, Xi’an 710071, China 3 Computer School, Hubei University of Arts and Science, Xiangyang 441053, China [email protected]
Abstract. Protein succinylation is a novel type of post-translational modification in recent decade years. Experiments verified that it played an important role in biological structure and functions. However, experimental identification of succinylation sites is time-consuming and laborious. Traditional technology cannot meet the rapid growth of the sequence data sets. Therefore, we proposed a new computational method named SuccSPred to predict succinylation sites in a given protein sequence by fusing many kinds of feature representation and ranking method. SuccSPred was implemented based on a two-step strategy. Firstly, linear discriminant analysis was used to reduce feature dimensions to prevent overfitting. Subsequently, the predictor was built based on incrementing features selection binding classifiers to identify succinylation sites. After the comparison of the classifiers using ten-fold cross-validation experiment, the selected model achieved promising improvement. Comparative experiments showed that SuccSPred significantly outperformed previous tools and had the great ability to identify the succinylation sites in given proteins. Keywords: Post-translational modification · Lysine succinylation · Feature representation · Pseudo amino acid composition · Machine learning · System biology
1 Introduction Post-translational modification (PTM) is considered an important biological mechanism which impacts the diversified structures and functions of the proteome. Up to now, a large number of PTMs have been studied extensively [1]. In recent years, lysine succinylation was found as a novel widespread reversible protein PTM and has attracted the attention of many researchers [2]. Lysine succinylation plays vital regulatory roles © Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 191–202, 2021. https://doi.org/10.1007/978-3-030-91415-8_17
192
R. Ge et al.
in various cellular processes such as mitochondrial metabolism, autophagy and gene expression. Meanwhile, the dysregulation of lysine succinylation is closely associated with many human diseases including inflammation, cancer, tuberculosis, neurodegenerative, allergic dermatitis etc. [3]. Therefore, it presents an urgent issue to develop accurate and efficient identified tools of succinylation sites so as to recognize and understand its mechanism better. Many studies have confirmed that protein succinylation modification is common in prokaryotes and eukaryotes. Succinylation was first shown to occur at the active site of high serine transsuccinylase. Docosahexaenoic acid (DHA) can promote succinylation of lysine residues, which may have a certain impact on the central nervous system. DHA can promote succinylation. And succinic acid has the same effect in E. coli [4]. In Mycobacterium tuberculosis, succinylated proteins are involved in many processes, including transcription, translation, stress response, protein interactions, etc. [5]. Now, mass spectrometry and high-throughput biological technology have been widely used to identify the sites of PTM. But, due to the time-consuming and laborious traditional experiment methods, many fast bioinformatics tools have been developed to predict succinylation sites in protein to make up the shortcomings of traditional technology in recent years [6, 7]. SucPred is considered as the first predictor for succinylation sites identification with multiple feature descriptors and support vector machine (SVM) classifier [8]. iSuc-PseAAC incorporated the peptide position-specific propensity into the general form of pseudo amino acid composition (Pse-AAC) and employed SVM to predict the succinylation sites in cross-validation dataset [9]. SuccFind was constructed to predict the succinylation sites based on sequence-derived features and evolutionaryderived information of sequence. Meanwhile, it adopted a two-step feature selection strategy for further optimization via SVM classifier [10]. SuccinSite exploited amino acid pattern and properties by incorporating three sequence encodings with random forest (RF) classifier. The identified rules from RF model might be helpful to understand the mechanisms of lysine succinylation [11]. PSSM-Suc employed position specific scoring matrix (PSSM) into bigram for feature extraction with evolutionary information of amino acids for predicting succinylated residues [12]. Success efficiently used the structural and evolutionary information of amino acids to extract the bigram features for predicting succinylation sites [13]. SucStruct utilized structural features of amino acids to improve succinylation sites prediction and used k-nearest neighbors (KNN) to deal with imbalance data set [14]. Jia proposed two predicted models which were pSuc-Lys and iSuc-PseOpt. pSuc-Lys employed PseAAC and ensemble random forest method [15]. And, iSuc-PseOpt incorporated sequence-coupling effects into pseudo components and optimized the imblanced dataset [16]. SSEvol-Suc primarily incorporated the secondary structure and PSSM through profile bigrams into an AdaBoost classifier for succinylation sites prediction. It achieved a significant improvement compared with iSuc-PseAAC, iSuc-PseOpt, SuccinSite and pSuc-Lys predictors [17]. PSuccE combined multiple features and employed information gain to select representative features, finally predicted lysine succinylation sites using an ensemble SVM classifiers [18]. GPSuc optimized multiple complementary features using the Wilcoxon-rank feature selection and employed logistic regression and RF classifier to make the final prediction [19]. HybridSucc integrated ten types of informative features. And then, it merged a
SuccSPred: Succinylation Sites Prediction
193
deep neural network (DNN) and penalized logistic regression (PLR) algorithm into a hybrid-learning architecture to build the model [20]. DeepSuccinylSite was also a tool using deep learning methodology to identify succinylation sites with embedding and one hot encoding [21]. SSKM_Succ employed a two-step feature selection strategy to remove redundant features which were extracted in three kinds of sequence features. Then, K-means clustering was used to divide data into 5 clusters. Finally, SVM was applied to construct a prediction model for each cluster [22]. IFS-LightGBM utilized the combination of the LightGBM feature selection method and the incremental feature selection (IFS) method to select the optimal feature subset which extracted multiple types of feature information [23]. Although many methods obtained a fairly good performance for predicting succinylation sites, there is still much room for improvement [24]. Firstly, effective feature extraction methods of protein sequences should be explored, which directly affect the prediction effect. Secondly, how to design effective dimension reduction method is a key step to accurately predict the succinylation sites, such as the strategies of feature selection and ranking [25]. Finally, how to make use of the experimental dataset and the various machine learning methods to design the accurate and efficient prediction model is also a very important and urgent issue.
2 Material and Method 2.1 Dataset
Fig. 1. Amino acid frequency distribution of succinylation residues against non-succinylation residues. Green bar box represents the frequency of amino acids in positive samples, and the blue represents the frequency of amino acids in negative samples. (Color figure online)
The experimental data of cross validation were collected from dbPTM [26], which integrated published literatures, public resources, and eleven biological databases related
194
R. Ge et al.
with PTMs. Finally, training samples were obtained in 2599 protein sequences which included 5049 experimentally verified lysine succinylation sites as positive samples and 5526 non-succinylation sites as negative samples. In addition, SMOTE algorithm was used to adjust the imbalanced numbers of positive and negative samples [27]. Figure 1 shows the statistical distribution of the frequency of each amino acid in all samples. It can be seen from the figure that lysine (K) is the most frequent, while cysteine (C) and tryptophan (W) are the lowest. And compared the frequency of each amino acid in the samples, the frequency of amino acids such as alanine (A), aspartate (D), glutamic acid (E), glycine (G), lysine (K) in positive samples is higher than that in negative samples. The frequency of amino acids such as cysteine (C), phenylalanine (F), histidine (H), methionine (M), asparagine (N) in positive samples is lower than that in negative samples. The frequency of isoleucine (I) is almost equal in positive and negative samples. To extract the local sequence features around prediction sites, two sample logos of the compositional biases were compared around succinylation sites and non-succinylation sites to display every 10 residues located upstream and downstream for every K residue in the protein sequence [28]. The importance of each amino acid was shown in the form of graphical sequence logo (Fig. 2). Significant symbols were statistically plotted using the variable symbol heights that was proportional to the difference between the two samples. Residues were separated in two groups: (1) enriched in the positive samples; and (2) depleted in the positive samples. As can be seen, some residues such as K, R, A were frequently observed in the enriched section, while H, S, M, Q frequently appeared in the depleted position. Color of the symbols was classified according to the polarity of the residue side chain. As can be seen from the above, location-based frequency coding may be an effective feature extraction method based on residue frequency-based encodings with position specific encodings to identify lysine succinylation sites.
Fig. 2. Two-sample logos of universal succinylation residues against non-succinylation residues. The upper side of the abscissa axis is the frequency of amino acids in positive samples, and the lower side is the frequency of amino acids in negative samples. The height of the letter indicates the frequency of amino acids. (Color figure online)
2.2 Feature Representation There are four groups of features used in this study, namely amino acid composition, pseudo amino acid composition, autocorrelation features and profile-based features.
SuccSPred: Succinylation Sites Prediction
195
Specifically, amino acid compositions include Kmer, Distance-based Residue (DR) and PseAAC of Distance-Pairs and reduced alphabet scheme (Distance Pair). Pseudo amino acid compositions include parallel correlation pseudo amino acid composition (PCPseAAC), series correlation pseudo amino acid composition (SC-PseAAC), general parallel correlation pseudo amino acid composition (PC-PseAAC-General) and general series correlation pseudo amino acid composition (SC-PseAAC-General). Autocorrelation features include auto covariance (AC), cross covariance (CC), auto-cross covariance (ACC) and physicochemical distance transformation (PDT). Profile-based features include Top-n-gram, profile-based physicochemical distance transformation (PDTProfile), distance-based Top-n-gram (DT), profile-based Auto covariance (AC-PSSM), profile-based Cross covariance (CC-PSSM), profile-based Auto-cross covariance (ACCPSSM), PSSM distance transformation (PSSM-DT) and PSSM relation transformation (PSSM-RT). So there are 19 kinds of features in total [29]. 2.3 Computational Model In this study, eight classifiers are tested based on above features representation, and the best one is chosen as the prediction engine in the final model. Logistic regression is often used to model a relationship between binary response and predictors. This algorithm is easy to implement and works well when the dataset is linearly separable. Naive Bayes (NB) is based on the Bayes theorem and assumes that all predictors are independent. Support Vector Machine (SVM) is one of the most popular machine learning algorithms, which can be used for regression as well as classification problems. Kernel trick can be used in SVM for the cases where the data is non-linearly separated. 19 Methods dbPTM Succinylation
Amino acid composition
Autocorrelation
Pseudo amino acid composition
Profile-based features
LDA LDA
New Features F1
LDA F2 F3 ... ... Rank
SuccSPred Prediction
F18 F19
Classifier
ANOVA Feature Selection
Fig. 3. A framework of SuccSPred model. The features are generated from dbPTM dataset through 19 feature extraction methods, and then LDAs are careried out to obtain 19 new features seperately. Then ANOVA feature selection method is used to sort 19 features, and the selected features are putted into Bayes classifier for training to obtain the final model. LDA: linear discriminant analysis.
Decision Tree is a powerful and popular supervised learning method used for classification and prediction. There are many classical algorithms to create Decision Tree, such
196
R. Ge et al.
as ID3, C4.5 and CART, and the last one is used in this study. Random forest (RF) is a meta-learning algorithm. It is made up of multiple decision trees, usually trained with the “bagging” method. AdaBoost is a meta-learning algorithm trained with the “boosting” method. The goal is to convert weak learners into a strong one. XGBoost is an improved version of Gradient Boosting Decision Tree (GBDT). It has enhanced performance and speed, and now is the most popular algorithm used in data science. LightGBM is also a tree-based machine learning algorithm using gradient boosting techniques. It has faster training speed and higher efficiency, Lower memory usage and better accuracy. In this study, a novel model named SuccSPred is proposed to predict lysine succinylation sites in protein sequences based on multiple features fusion and machine learning method as shown in Fig. 3. At first, protein sequences are represented with 19 kinds of features based on amino acid composition (AAC), autocorrelation, Pseudo amino acid composition (PseAAC) and profile-based features [30]. Secondly, linear discriminant analysis (LDA) is used to reduce dimensions of each type of multiple features respectively. Subsequently, features are ranked according to ANOVA and selected by the accuracy of the classifier using the incremental feature selection (IFS) method. Finally, Naive Bayes is selected as the predicted classifier compared with a few other classifiers based on ten-fold cross-validation experiment on dbPTM dataset. Experimental results demonstrate that SuccSPred is an effective and promising tool to predict lysine succinylation sites in protein sequences. SuccSPred is freely available at https://github.com/gre yspring/SuccSPred. 2.4 Performance Evaluation For any tested dataset with P real positive instances and N real negative instances, there are four outcomes from a binary classification system, namely true positive (TP), true negative (TN), false positive (FP), and false negative (FN). Then we can get five statistical measures of the performance as shown below,
Acc =
Sp =
TN TN + FP
(1)
Sn =
TP TP + FN
(2)
TP + TN TP + TN + FP + FN
(3)
F1 − score =
2TP 2TP + FP + FN
TP · TN − FP · FN MCC = √ (TP + FP) · (TP + FN ) · (TN + FP) · (TN + FN )
(4) (5)
Sensitivity (Sn) measures the proportion of positives that are correctly predicted (also known as recall). Specificity (Sp) measures the proportion of negatives that are correctly predicted. Accuracy (Acc) measures the proportion of instances that are correctly predicted. F1-score is a weighted harmonic average of precision and recall. Matthews
SuccSPred: Succinylation Sites Prediction
197
correlation coefficient (MCC) is regarded as one of the best measures which can be used even if the distribution is very unbalanced between positives and negatives [31]. Generally speaking, a coefficient of −1 indicates that the predictions are completely wrong, 0 indicates that the classification system is no better than random prediction and +1 represents a perfect prediction.
3 Experiment
Fig. 4. Receiver operating characteristics (ROC) curves comparison of multiple types of classifiers. The AUC value represents the area under the ROC curve.
Figure 4 shows the receiver operating characteristics (ROC) curve with the highest accuracy of the Bayes classifier in some ten-fold cross-validation when the selected feature number K = 8. At this time, the accuracy of the Bayes classifier was 0.7431. The area under the curve (AUC) was 0.7951, 0.7772, 0.8132, 0.7898, 0.8023, 0.6777, 0.7930 and 0.7908 respectively when SVM, random forest, Bayes, logistic regression, AdaBoost, decision tree, XGBoost and LightGBM were used in ten-fold cross-validation. These AUC values indicated that Bayes classifier performed best, while decision tree performed worst compared with other seven classifiers. In order to find the optimal number K of features, several comparative experiments were designed and analyzed when K = 8, K = 12 and K = 14. When K = 8 as shown in Table1, the maximum values of accuracy, sensitivity, specificity, Matthews correlation coefficient (MCC) and F1-score were (74.98 ± 0.68)%, (77.31 ± 1.18)%, (72.64 ± 1.64)%, (50.01 ± 1.35)%, and (75.63 ± 0.67)% respectively. And when K = 8, all the measurement indicators were better than that when K = 12 and K = 14. Therefore, the
198
R. Ge et al.
Table 1. The performance comparison of the different feature number K using Bayes classifier. K
Acc
Sn
Sp
MCC
F1-score
8
0.7498 ± 0.0068 0.7731 ± 0.0118 0.7264 ± 0.0164 0.5001 ± 0.0135 0.7563 ± 0.0067
12 0.7455 ± 0.0085 0.7716 ± 0.0146 0.7204 ± 0.0217 0.4925 ± 0.0159 0.7517 ± 0.0100 14 0.7455 ± 0.0089 0.7711 ± 0.0146 0.7186 ± 0.0228 0.4907 ± 0.0180 0.7561 ± 0.0083
Table 2. The average and standard deviation of the maximum accuracy of different classifiers in ten-fold cross-validation. Classifier
Acc
Sn
Sp
MCC
F1-score
SVM
0.7372 ± 0.0082
0.7422 ± 0.0183
0.7325 ± 0.0170
0.4748 ± 0.0164
0.7393 ± 0.0095
Random forest
0.7141 ± 0.0076
0.6849 ± 0.0208
0.7437 ± 0.0138
0.4294 ± 0.0144
0.7063 ± 0.0100
Bayes
0.7498 ± 0.0068
0.7731 ± 0.0118
0.7264 ± 0.0164
0.5001 ± 0.0135
0.7563 ± 0.0067
Logistic regression
0.7288 ± 0.0090
0.7329 ± 0.0192
0.7249 ± 0.0166
0.4579 ± 0.0180
0.7307 ± 0.0114
Adaboost
0.7419 ± 0.0075
0.7475 ± 0.0192
0.7365 ± 0.0175
0.4842 ± 0.0151
0.7441 ± 0.0095
Decision tree
0.6679 ± 0.0117
0.6750 ± 0.0212
0.6612 ± 0.0187
0.3363 ± 0.0232
0.6711 ± 0.0141
XGBoost
0.7248 ± 0.0134
0.7341 ± 0.0233
0.7157 ± 0.0202
0.4500 ± 0.0269
0.7280 ± 0.0145
LightGBM
0.7212± 0.0104
0.7297 ± 0.0212
0.7129 ± 0.0189
0.4429 ± 0.0210
0.7243 ± 0.0121
experiment of K = 8 was chosen as the final result based on Occam’s razor principle [32]. When setting K = 8, the average and standard deviation of the maximum accuracy of eight classifiers were analyzed statistically on the basis of 30 times of ten-fold cross-validation. Just like shown in Table 2, Fig. 5 also shows the visualized results of the SuccSPred model using K = 8 and ten-fold cross-validation in terms of average accuracy, AUC, sensitivity, specificity, MCC, and F1-score. Judging from the Table 2 and Fig. 5, Bayes and Adaboost classifiers performed better than other six classifiers. Several boosting classifiers also obtained relatively good results. But random forest and decision tree classifiers achieved relatively poor performances. In particular, the results of decision tree classifier were lower than those of other seven classifiers under the six evaluation indexes. Furthermore, SuccSPred model was also compared to several current popular prediction models and methods, such as IFS-LightGBM, Random Forest (RF) [33], ExtraTree (ET) [34], Gradient Boosting Decision Tree (GBDT) [35], K-nearest neighbor (KNN) [36], XGBoost [37], and Naïve Bayes (NB) [38]. Ten-fold cross-validation was adopted on the same succinylation data set in dbPTM for comparative analysis. And four evaluation indicators, including Accuracy (Acc), Sensitivity (Sn), MCC and F1-score, were employed to evaluate the experimental performances. As we can see in Table 3, SuccSPred performed better on all four indicators than other methods. The accuracy, sensitivity, MCCs and F1-score of the SuccSPred model were 0.7498, 0.7731, 0.5001 and 0.7563, respectively. Compared with IFS-LightGBM model
SuccSPred: Succinylation Sites Prediction
199
Fig. 5. Performance comparison among different classifiers in ten-fold cross-validation when selected 8 features. It can be seen that most indicators of Bayes classifier are the highest among the eight classifiers.
Table 3. Performance comparison of SuccSPred with other existing methods. Classifier
Acc
Sn
MCC
F1-score
IFS-LightGBM
0.7360
0.7223
0.4708
0.7232
RF
0.7182
0.6795
0.4345
0.6972
ET
0.6010
0.5934
0.2013
0.5868
GBDT
0.7267
0.7223
0.4528
0.7162
KNN
0.6596
0.7209
0.3258
0.6691
XGBoost
0.7271
0.7114
0.4529
0.7134
NB
0.6878
0.7724
0.3867
0.7026
SuccSPred
0.7498
0.7731
0.5001
0.7563
with the best performances in other seven methods, SuccSPred improved the accuracy, sensitivity, MCC and F1-score by 1.38%, 5.08%, 2.93% and 3.31% respectively.
4 Conclusion In this work, we proposed a novel method named SuccSPred to predict lysine succinylation sites in protein sequences. SuccSPred fused 19 kinds of feature encoding methods which were extracted in protein sequence according to amino acid composition, autocorrelation, pseudo amino acid composition and profile-based features. After reduced feature dimension using linear discriminant analysis, feature selection technique was employed to explore the optimal feature representations. The experimental results based on multiple classifiers show that SuccSPred can achieve relatively stable high performance using Bayes classifier compared with other existing methods.
200
R. Ge et al.
However, although SuccSPred achieves relatively good results, its advantages are not obvious on the whole. In future, the optimization of classifiers and the flexible application of feature selection techniques will be the direction of the next exploration to alleviate the difficulty of obtaining high-quality data. Additionally, with the development of highthroughput sequencing technology, deep learning techniques are also a good solution to solve the problem of various site predictions. With the development of proteomic research technology, the new methods will may help to reveal the regulatory mechanisms of succinylation in normal physiological processes as well as in pathological mechanisms. The succinylation will may be a target for the action of new drugs and provide new ideas for the treatment of diseases. Acknowledgment. This research was supported in part by the National Natural Science Foundation of China (No. 61702146, 61841104), National key research and development program of China (No. 2019YFC0118404), Joint Funds of the Zhejiang Provincial Natural Science Foundation of China (No. U1909210, U20A20386), Zhejiang Provincial Natural Science Foundation of China (No. LY21F020017) and Zhejiang Provincial Science and Technology Program in China (No. 2021C01108).
References 1. Meng, X., et al.: Proteome-wide lysine acetylation identification in developing rice (Oryza sativa) seeds and protein co-modification by acetylation, succinylation, ubiquitination, and phosphorylation. Biochim Biophys Acta Proteins Proteom 1866(3), 451–463 (2018) 2. Huang, K.Y., et al.: dbPTM in 2019: exploring disease association and cross-talk of posttranslational modifications. Nucleic Acids Res. 47(D1), D298–D308 (2019) 3. Ao, C., Yu, L., Zou, Q.: Prediction of bio-sequence modifications and the associations with diseases. Brief Funct. Genomics 20(1), 1–18 (2021) 4. Kawai, Y., et al.: Formation of Nepsilon-(succinyl)lysine in vivo: a novel marker for docosahexaenoic acid-derived protein modification. J. Lipid. Res. 47(7), 1386–1398 (2006) 5. Xie, L., et al.: First succinyl-proteome profiling of extensively drug-resistant Mycobacterium tuberculosis revealed involvement of succinylation in cellular physiology. J. Proteome Res. 14(1), 107–119 (2015) 6. Li, F., et al.: PRISMOID: a comprehensive 3D structure database for post-translational modifications and mutations with functional impact. Brief Bioinform. 21(3), 1069–1079 (2020) 7. Chen, Z., et al.: Large-scale comparative assessment of computational predictors for lysine post-translational modification sites. Brief Bioinform. 20(6), 2267–2290 (2019) 8. Zhao, X.W., et al.: Accurate in silico identification of protein succinylation sites using an iterative semi-supervised learning technique. J. Theor. Biol. 374, 60–65 (2015) 9. Xu, Y., et al.: iSuc-PseAAC: predicting lysine succinylation in proteins by incorporating peptide position-specific propensity. Sci. Rep. 5, 10184 (2015) 10. Xu, H.D., et al.: SuccFind: a novel succinylation sites online prediction tool via enhanced characteristic strategy. Bioinformatics 31(23), 3748–3750 (2015) 11. Hasan, M.M., et al.: SuccinSite: a computational tool for the prediction of protein succinylation sites by exploiting the amino acid patterns and properties. Mol. Biosyst. 12(3), 786–795 (2016) 12. Dehzangi, A., et al.: PSSM-Suc: accurately predicting succinylation using position specific scoring matrix into bigram for feature extraction. J. Theor. Biol. 425, 97–102 (2017)
SuccSPred: Succinylation Sites Prediction
201
13. Lopez, Y., et al.: Success: evolutionary and structural properties of amino acids prove effective for succinylation site prediction. BMC Genomics 19(Suppl 1), 923 (2018) 14. Lopez, Y., et al.: SucStruct: prediction of succinylated lysine residues by using structural properties of amino acids. Anal. Biochem. 527, 24–32 (2017) 15. Jia, J., et al.: pSuc-Lys: predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. J. Theor. Biol. 394, 223–230 (2016) 16. Jia, J., et al.: iSuc-PseOpt: identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. Anal. Biochem. 497, 48–56 (2016) 17. Dehzangi, A., et al.: Improving succinylation prediction accuracy by incorporating the secondary structure via helix, strand and coil, and evolutionary information from profile bigrams. PLoS One 13(2), e0191900 (2018) 18. Ning, Q., et al.: Detecting succinylation sites from protein sequences using ensemble support vector machine. BMC Bioinform. 19(1), 237 (2018) 19. Hasan, M.M., Kurata, H.: GPSuc: Global Prediction of Generic and Species-specific Succinylation Sites by aggregating multiple sequence features. PLoS One 13(10), e0200283 (2018) 20. Ning, W., et al.: HybridSucc: A Hybrid-learning Architecture for General and Species-specific Succinylation Site Prediction. Genomics Proteomics Bioinform. 18(2), 194–207 (2020) 21. Thapa, N., et al.: DeepSuccinylSite: a deep learning based approach for protein succinylation site prediction. BMC Bioinform. 21(Suppl 3), 63 (2020) 22. Ning, Q., et al.: SSKM_Succ: a novel succinylation sites prediction method incorprating Kmeans clustering with a new semi-supervised learning algorithm. IEEE/ACM Trans. Comput. Biol. Bioinform. (2020) 23. Zhang, L., et al.: Succinylation site prediction based on protein sequences using the IFSLightGBM (BO) model. Comput. Math. Methods Med. 2020, 8858489 (2020) 24. Zhu, Y., et al.: Inspector: a lysine succinylation predictor based on edited nearest-neighbor undersampling and adaptive synthetic oversampling. Anal. Biochem. 593, 113592 (2020) 25. Yang, Y., et al.: Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks. BMC Bioinform. 22(1), 171 (2021) 26. Huang, K.Y., et al.: dbPTM 2016: 10-year anniversary of a resource for post-translational modification of proteins. Nucleic Acids Res. 44(D1), D435–D446 (2016) 27. Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 14, 106 (2013) 28. Vacic, V., Iakoucheva, L.M., Radivojac, P.: Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics 22(12), 1536–1537 (2006) 29. Liu, B.: BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief Bioinform. 20(4), 1280–1294 (2019) 30. Ge, R., et al.: EnACP: an ensemble learning model for identification of anticancer peptides. Front. Genet. 11, 760 (2020) 31. Boughorbel, S., Jarray, F., El-Anbari, M.: Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS One 12(6), e0177678 (2017) 32. Narain, D., et al.: Structure learning and the Occam’s razor principle: a new view of human function acquisition. Front. Comput. Neurosci. 8, 121 (2014) 33. Bureau, A., et al.: Identifying SNPs predictive of phenotype using random forests. Genet. Epidemiol. 28(2), 171–182 (2005) 34. Maree, R., Geurts, P., Wehenkel, L.: Random subwindows and extremely randomized trees for image classification in cell biology. BMC Cell Biol. 8(Suppl 1), S2 (2007) 35. Zhou, C., et al., Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree. PLoS One, 2017. 12(8): p. e0181426.
202
R. Ge et al.
36. Sivaraj, S., Malmathanraj, R., Palanisamy, P.: Detecting anomalous growth of skin lesion using threshold-based segmentation algorithm and Fuzzy K-Nearest Neighbor classifier. J. Cancer Res. Ther. 16(1), 40–52 (2020) 37. Yu, B., et al.: SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting. Bioinformatics 36(4), 1074– 1081 (2020) 38. Aydin, Z., et al.: Learning sparse models for a dynamic Bayesian network classifier of protein secondary structure. BMC Bioinform. 12, 154 (2011)
BindTransNet: A Transferable Transformer-Based Architecture for Cross-Cell Type DNA-Protein Binding Sites Prediction Zixuan Wang1 , Xiaoyao Tan1 , Beichen Li1 , Yuhang Liu1 , Qi Shao1 , Zijing Li1 , Yihan Yang2 , and Yongqing Zhang1,3(B) 1
2 3
School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China [email protected] International College, Chongqing University of Posts and Telecommunications, Chongqing 400065, China School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Abstract. To comprehend DNA-protein binding specificity in diverse cell types is essential to reveal regulatory mechanisms in biological processes. Recently, deep learning has been successfully applied to predict DNA-protein binding sites from large-scale chromatin-profiling data. However, the precise identification of putative binding sites in specific cell types with low labeled samples remains challenging. To this end, we present a novel Transferable Transformer-based method, dubbed as BindTransNet, for cross-cell types DNA-protein binding prediction. Transfer learning and Transformer Encoder are simultaneously adopted in our presented approach to capture some shared long-range dependencies between various motifs available in cross-cell types. This unique design helps our method recognize putative binding sites without massive labeled samples by leveraging the above-mentioned standard features. This work is the first to apply a Transformer for DNA-protein binding sites prediction. The presented method is measured on TFs COREST and SRF in four cell types with eight cell-type TF pairs. For both 4-class prediction and binary-level prediction, BindTransNet can significantly outperform several state-of-the-art methods. Moreover, BindTransNet achieves considerable margin performance improvements by leveraging transfer learning. This is a presuasive indication that BindTransNet can indeed capture shared features available in other cell types.
Keywords: DNA-protein binding prediction Transformer · Transfer learning
· Cross cell-type ·
This work is supported by the National Natural Science Foundation of China under Grant No. 61702058; the China Postdoctoral Science Foundation funded project No. 2017M612948. c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 203–214, 2021. https://doi.org/10.1007/978-3-030-91415-8_18
204
1
Z. Wang et al.
Introduction
Transcription factors (TFs) are vital proteins for affecting numerous cellular processes and regulating activities of downstream genes [1]. There proteins prefer to recognizing specific genomic sequences which are referred as to DNA-protein binding sites or TFs binding sites (TFBSs) [2]. Prediction of putative binding sites in a particular cell types is fundamental step for studying molecular and cellular biology [3,4]. Naturally, how to predict DNA-protein binding sites has become an essential issue in bioinformatics [5]. Recently, deep learning has made outstanding achievements in TFBSs prediction [6–9]. For example, DeepBind [10] and DeepSEA [11] leverages convolutional neural networks (CNNs) for predicting sequence specificity of DNA-binding proteins and prioritizing functional variants, respectively. DanQ [12] and DeepD2V [13] adopt a combination of CNNs with bi-directional long short-term memory (Bi-LSTM) tofurther improve prediction performance by capturing regulatory syntax. Subsequently, some variants [14–16] of above-mentioned methods are presented in succession. These methods have successfully discovered putative TFBSs. However, above-mentioned methods fail to predict TFs binding sites cross-cell types. Recently, MTTFsite [17] has realized cross-cell-type prediction of TFBSs by employing multi-task learning. Nevertheless, how to utilize shared features related to long-range dependencies between motif variants available in cross-cell types for performing DNA-protein binding sites prediction in certain cell types with insufficient labeled samples remains a challenge. To deal with the drawbacks mentioned above, we present a Transformerbased method, dubbed as BindTransNet, for cross-cell types DNA-protein binding prediction. This method is composed of two stages: (1) Direct learning. The presented model has performed a four-class prediction on a universal dataset comprised of one TF in four cell types for capturing standard features. (2) Transfer learning. The pre-learned is further applied to conduct binary-level prediction on a private dataset comprised of one cell-type TF pair to fine-tune learnable parameters. To verify the availability of the presented method, we conducted extensive experiments on 8 ChIP-seq datasets. Experimental results illustrate that BindTransNet yield better performance as compared to several state-ofthe-art methods. Moreover, our method achieves 4.0% and 3.75% performance improvements concerning mean accuracy and ROC-AUC by leveraging transfer learning, respectively. To sum up, our main contributions are as follows: • We present a novel technique called BindTransNet for DNA-protein binding sites prediction. BindTransNet improved the capability to learn the longrange dependency between individual nucleotides by employing Transformer Encoder. • Transfer learning is utilized to realize the identification of putative binding sites on cell lines with insufficient samples by leveraging standard features available in cross-cell types. • We conduct extensive comparative experiments on ChIP-seq datasets to comprehensively validate the availability of our BindTransNet approach.
DNA-Protein Binding Sites Prediction by BindTransNet
205
The rest of the paper is organized as follows. In Sect. 2, we present the scheme and components of our architecture. Experimental setup and results are detailed in Sect. 3. Finally, we conclude our study in Sect. 4.
2 2.1
Method Feature Representation
The input of our presented model is a DNA sequence characterized as an l × 4 one-hot vector, where l refers to length of input DNA sequence, and 4 donates types of base-pairs. For one-hot vector, four base-pairs A, T, C and G are represented as [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0] and [0, 0, 0, 1], respectively. Because of the limitation of second-generation sequencing technology, base-pairs in some positions are not determined. To this end, un-determined base-pair is shown as N in DNA sequence, and indicated as [0, 0, 0, 0] in one-hot vector. The individual ChIP-seq experiment can be regarded as a cell-type TF pair [18]. Thus, on the mission of four-class prediction, the label is represented as a 4 × one binary vector, where individual scalar is referred to as one cell-type TF pair. While on the mission of binary-level prediction, the label is indicated as a binary scalar. 2.2
Model Architecture
2.2.1 Input and Output The DNA sequence consists of base-pairs arranged in a specific order [19]. Thus, we convert one DNA sequence into a feature matrix X(0) = [x1 , x2 , ..., xi , ..., xl ], where xi represents the feature vector of the base-pair at position i in DNA sequence. The above-mentioned feature matrix X(0) is used as the input of our neural network. In our study, we focus on four-class prediction problems and concentrate on binary-level prediction problems. According to distinct mission types, the output can be divided into two categories: (1) Stand-alone output y, where y means predicted occupancy probability for specific TFBSs in the input sequence. (2) = [y1 , ..., yi , ..., y4 ], where yi indicates a probability One-dimensional output y of the input sequence being the TFBSs of corresponding to the i-th cell-type TF pair. , DNA-protein binding Based on above-defined input X(0) and output y or y sites prediction can be described as finding inherent mapping f (·) from X(0) to . In our study, f (·) refers to a deep neural network. y or y 2.2.2 BindTransNet Architecture The architecture of our presented model BindTransNet is shown in Fig. 1. The model comprises three parts: convolutional neural network, Transformer Encoder and discriminator. • Convolutional neural network. This part is to capture the local patterns, such as base-pair arrangements, by utilization of weight-sharing strategy.
206
Z. Wang et al.
Fig. 1. The BindTransNet model architecture.
• Transformer Encoder. This part model both long-range dependencies and context information of individual base-pair by a multi-head self-attention and a point-wise fully connected layer. Moreover, the complexity of computation is reduced by parallel processes. • Discriminator. With a fully connected layer, the last part combines the features from the previous layers and makes the final prediction of the current input sequence. In the following, we detail the three parts sequentially. 2.2.3 Convolutional Neural Network This part is composed of a 1D convolution operation followed by a rectified linear operation and a batch normalization operation. The computation of abovementioned step is described as: X(i) = BN(ReLU(Convolutional(X(i−1) , Wc(i) , b(i) c ))) (i)
(i)
(1)
where Wc and bc indicate the weight arrays and the biases of the i-th convolutional layers, respectively. X(i) denotes the output of the i-th convolutional
DNA-Protein Binding Sites Prediction by BindTransNet
207
layers. Convolutional(·), ReLU(·) and BN(·) define the convolution operation, the rectified linear operation and the batch normalization operation, respectively. 2.2.4 Transformer Encoder This part is mainly composed of a multi-head self-attention and a point-wise fully connected layer [20]. The residual connection is utilized in individual abovementioned layers: = (X + P) + LN(A) A (2) = T + LN(T) T
(3)
indicates the input of the Transformer Encoder. P denotes the position where X refer to the actual output and the identical output of the matrix. A and A represent the actual output and multi-head self-attention, respectively. T and T the identical output of the point-wise fully connected layer. These matrix have the same width dm . LN(·) defines the layer normalization operation. the Positional Encoding [20] is leveraged to repreGive a feature matrix X, sent orientations and spatial distances of the sequence: P = PositionalEncoding(X)
(4)
where PositionalEncoding(·) defines the Positional Encoding function. The posi to perform the tion matrix P have the same dimension as the feature matrix X addition operation. After that, multi-head self-attention is adopted to allow our presented model to collectively focus on information from various representation subspaces at distinct positions. This maps key matrix K, query matrix Q and value matrix V into h subspaces with the same dimension of dhm . Considering individual subspace, the scaled dot-product attention is performed in parallel and yield some middle-output with the dimension of dhm . Their middle-output is concatenated as the final output of the multi-head self-attention. The computation of the multi-head self-attention is described as: A = Concat(H1 , H2 , ..., Hs , ..., Hh )Wa
(5)
where Hs indicates the output of the s-th scaled dot-product attention. Wa refers to the output denotes the weight arrays of the multi-head self-attention. A of multi-head self-attention. Concat(·) defines the concatenation operation. The computation of the scaled dot-product attention is described as: Hs = softmax(
Wks Qs Wqs KT √ s )Vs Wvs dk
(6)
where √1d indicates scaled factor. Qs , Ks and Vs denote the query, key and k value of the s-th subspace. Wqs , Wks and Wvs refer to the weight arrays of the s-th subspace.
208
Z. Wang et al.
Finally, a point-wise fully connected layer is applied to individual position separately and identically: (1)
T = ReLU(AWf (1)
(1)
(2)
+ bf )Wf
(2)
+ bf
(7)
(2)
where Wf and Wf indicate the weight arrays of the point-wise fully con(1) (2) nected layer. bf and bf denote the biased of the point-wise fully connected layer. 2.2.5 Discriminator This part is composed of a fully connected layer to compute the final output by an integration of the outputs from previous layers. Considering the binary-level prediction, the sigmoid function is adopted: d + bd )Wb + bb ) y = Sigmoid(ReLU(TW
(8)
where Wd and bd indicate the common weight arrays and biased of the discriminator, respectively. Wb and bb denote the particular weight arrays and biased of the binary-level prediction. Considering the 4-class prediction, the softmax function is adopted: d + bd )Wm + bm ) = softmax(ReLU(TW y
(9)
where Wm and bm indicate the particular weight arrays and biased of the 4-class prediction. To sum up, the inference step of the network is summarized in Algorithm 1. 2.3
Transfer Learning
Transfer learning strategy [21] is utilized in our study in order to leverage common features available in cross-cell types for further improving the performance of DNA-protein binding sites prediction without sufficient labeled samples. This strategy comprises two stages: • To capture the standard features, our presented model performs 4-class prediction on the datasets composed of one TF in 4 cell types. • To further fine-tune some learnable parameters, our presented model conducts binary-level prediction on individual ChIP-seq datasets. The model is initialed without freezing any parameters during fine-tuning. 2.4
Model Implementation
Both the presented method BindTransNet and the comparison methods used in our study are implemented by PyTorch. The optimizer adopts Adam with default settings and the cross-entropy function is selected as the loss function. The model is trained on an NVIDIA Tesla T4 for about one hour. The model with the best performance on the test data set is chosen as our final model.
DNA-Protein Binding Sites Prediction by BindTransNet
209
Algorithm 1: The Transferable Transformer-based architecture for crosscell type DNA-protein binding sites prediction.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
3 3.1
Input: The one-hot matrix X(0) ; ; Output: stand-alone output y or one-dimensional output y BindTransNet = load(weights, bias); for i in range Nconv do Compute the output of the i-th convolutional layers X(i) by Equation (1); end Compute the position matrix P by Equation (3); for i in range Ntrans do Compute original output of self-attention A by Equation (5) and (6); by Equation (2); Compute identical output of self-attention A Compute original output of point-wise fully connection T by Equation (7); by Equation (3); Compute identical output of point-wise fully connection T end if Carrying out binary-level prediction then Compute the final output y by Equation (8); else by Equation (9); Compute the final output y end
Experiment Experiment Setup
3.1.1 Dataset The ENCODE project provides many samples of TFs-DNA binding sites obtained by ChIP-seq technology. In our study, we collect 8 ChIP-seq data sets from ENCODE project to validate the performance of our methods. Each ChIP-seq experiment can be regarded as a cell-type TF pair (such as CORESTGM12878). The 8 ChIP-seq data sets contain human TF COREST from the four cell-type GM12878, Helas3, Hepg2 and K562, and human TF SRF from the four cell-type GM12878, H1hesc, Hepg2 and K562. The mentioned data sets can be obtained for free: http://cnn.csail.mit.edu/. 3.1.2 Competing Methods The most well-known methods for predicting DNA-protein binding sites, DeepSEA [11], CNN-Zeng [22] and DanQ [12] is selected to validate the availability and robustness of our presented strategy. 3.1.3 Evaluation Metrics In our study, multi-class accuracy, Jaccard similarity coefficient and Kappa coefficient are adopted to assess the 4-class prediction performance of BindTransNet. Besides, in order to measure binary-level prediction performance, binary-accuracy and ROC-AUC are used [23].
210
3.2
Z. Wang et al.
Experiment Results and Analysis
3.2.1 Parameter Selection for CNNs In this section, we study the impact of two parts: (1) the number of convolutional kernels and (2) the number of convolutional layers. The metric is multi-class accuracy. a
b
Fig. 2. The 4-class prediction performance of BindTransNet by varying the number of convolutional kernels and convolutional layers.
Number of Convolutional Kernels. As shown in Fig. 2a, we can observe that with the increasing of the convolutional kernels, the mean accuracy improves firstly and then starts to deteriorate. Generally, the convolutional kernels can lead to better prediction ability, resulting in over-fitting when the convolutional kernels are too large. Therefore, 32 is a good number for BindTransNet. Number of Convolutional Layers. As shown in Fig. 2b, note that the performance of utilizing four convolutional layers outperforms other setups. This result reveals that the deeper convolutional layers have the better representation capability to capture abstract features. 3.2.2 Parameter Selection for Transformer Encoder In this section, we investigate the influence of two parts: (1) the head in multihead attention and (2) the number of Transformer Encoder. The metric is multiclass accuracy. Head in Multi-head Attention. Figure 3a presents that BindTransNet achieves the best performance when the head is 8, which demonstrates that a combination of several Attentions is essential for DNA-protein binding sites prediction. Number of Transformer Encoder. Figure 3b shows that compared with other setups, our presented model with leveraging 8 Transformer Encoders outperforms other setups. This is mainly because capturing non-sequential dependency between motifs is vital for improving DNA-protein binding, and residual connect provides a more robust learning ability.
DNA-Protein Binding Sites Prediction by BindTransNet a
211
b
Fig. 3. The 4-class prediction performance of BindTransNet by varying the head in multi-head attention and the number of Transformer Encoder.
3.2.3 Comparison of BindTranNet with Other Predictors To validate the availability of our presented model on 4-class prediction, we compare BindTransNet with state-of-the-art methods on human ChIP-seq data sets regarding some multi-class metrics. The comparison result is summarized in Table 1, and the main observation from Table 1 is as follows: Table 1. Performance comparison between state-of-the-art methods and our presented model on multi-class prediction. DataSet
Model
COREST DeepSEA CNN-Zeng DanQ BindTransNet
Accuracy Jaccard Kappa 0.543 0.583 0.564 0.593
0.248 0.211 0.393 0.422
0.249 0.167 0.272 0.235
SRF
DeepSEA 0.395 0.186 0.071 CNN-Zeng 0.410 0.201 0.153 DanQ 0.405 0.254 0.063 BindTransNet 0.415 0.262 0.078 Note: The bold and underscore number refer to the best performer.
• Across all the ChIP-seq datasets, our methods can significantly outperform all the competitors for accuracy and Jaccard, which reveals that considering non-sequential long-range dependency between biological motifs can improve the performance of DNA-protein binding sites prediction. Therefore, the presented method is the best model to be regarded as the source domain. • Regarding metrics Kappa, our method is inferior to other state-of-the-art methods, mainly because of the imbalanced datasets in various cell types.
212
Z. Wang et al.
However, we care more about whether our methods can learn standard features from large datasets. Therefore, paying more attention to the majority class is more important. • On SRF datasets, all methods perform not as good as the COREST dataset, mainly because the sample size of SRF datasets is less than the COREST dataset. This indicates the necessity of sufficient labeled samples for supervised learning. 3.2.4 Performance Measure of Transfer Learning To demonstrate the effectiveness of our transfer learning strategy on further improving binary-level prediction performance, we compare our transfer BindTransNet with other state-of-the-art methods on each ChIP-seq dataset. The metrics are binary accuracy and ROC-AUC. Table 2. The comprehensive performance comparison between our transfer BindTransNet and other state-of-the-art methods. DataSet
Metrics
Hepg2-COREST K562-COREST GM12878-SRF H1hesc-SRF
BindTrans Net-T
0.627
0.587
0.620
0.576
0.635
ROC-AUC 0.673
0.642
0.674
0.583
0.677
GM12878-COREST Accuracy Helas3-COREST
CNN-Zeng DeepSEA DanQ BindTrans Net-NT
0.737
0.742
0.734
0.747
0.768
ROC-AUC 0.820
0.828
0.831
0.838
0.844
0.634
Accuracy
0.654
0.659
0.671
ROC-AUC 0.725
Accuracy
0.718
0.750 0.703
0.681 0.745
0.723
0.717
0.734
0.693
0.742
ROC-AUC 0.802
0.794
0.807
0.770
0.822
Accuracy
0.790
0.755
0.801 0.774
0.787
ROC-AUC 0.870
0.860
0.885 0.860
0.874
Accuracy
0.848
0.819
0.856 0.841
0.848
ROC-AUC 0.922
0.909
0.928
0.932
Accuracy
0.926
0.807
0.800
0.781
0.804
0.815
ROC-AUC 0.886
0.881
0.886
0.882
0.890
Accuracy
0.767
0.789 0.755
0.780
ROC-AUC 0.833 0.850 0.863 0.844 Note: The bold and underscore number refer to the best performer.
0.857
Hepg2-SRF K562-SRF
Accuracy
0.752
The comprehensive performance assessment of our transferable model is summarized in Table 2, where BindTransNet-T indicates the presented model with transfer learning. At the same time, BindTransNet-NT denotes the presented model with direct learning. The main observation from Table 2 are as follows: • The BindTransNet-T yields better performance than several state-of-the-art methods concerning mean accuracy and ROC-AUC. This reveals the difficulty of adopting datasets with insufficient samples to directly optimize the supervised learning-based deep network and prove the effectiveness of using
DNA-Protein Binding Sites Prediction by BindTransNet
213
datasets with sufficient samples available in cross-cell types to improve the representation ability of the deep network. • Transfer learning strategy improves mean accuracy and ROC-AUC by 4.0% from 0.728 to 0.757 and by 3.75% from 0.800 to 0.830 compared as a direct learning strategy, respectively. This indicates that some shared features related to long-range dependencies between motif variants are helpful for the deep network to search for a better local optimal solution.
4
Conclusion
In this paper, we present a novel method BindTransNet for cross-cell types DNAprotein binding sites prediction. The long-range dependencies between motif variants are considered as the universal features available in cross-cell types. To this end, we design a Transferable Transformer-based method, BindTransNet, to model those mentioned above shared long-range dependencies. To the best of our knowledge, BindTransNet is the first method utilising Transformer for crosscell types DNA-binding sites prediction. Extensive experiments are conducted on ChIP-seq datasets and demonstrate that BindTransNet can significantly outperform several state-of-the-art methods. Besides, we analyse the availability of transfer learning strategies. In the future, we plan to adopt the pre-trained Transformer to capture the common long-range dependencies better. Finally, we hope that BindTransNet can provide further comprehending of regulatory mechanisms at the genomic level.
References 1. Samuel, L., Arttu, J., Laura, C., et al.: The human transcription factors. Cell 172(4), 650–665 (2018) 2. Matthew, S., Tianyin, Z., Lin, Y., et al.: Absence of a simple code: how transcription factors read the genome. Trends Biochem. Sci. 39(9), 381–399 (2014) 3. Anthony, M., Beibei, X., Tsu-Pei, C., et al.: Dna shape features improve transcription factor binding site predictions in vivo. Cell Syst. 3(3), 278–286 (2016) 4. Stormo, G.: Modeling the specificity of protein-DNA interactions. Quant. Biol. 1(2), 115–130 (2013) 5. Yu, L., Chao, H., Lizhong, D., et al.: Deep learning in bioinformatics: introduction, application, and perspective in the big data era. Methods 166, 4–21 (2019) 6. Yongqing, Z., Shaojie, Q., Shengjie, J., et al.: Identification of DNA-protein binding sites by bootstrap multiple convolutional neural networks on sequence information. Eng. Appl. Artif. Intell. 79, 58–66 (2019) 7. Yongqing, Z., Shaojie, Q., Shengjie, J., et al.: Deepsite: bidirectional LSTM and CNN models for predicting DNA-protein binding. Int. J. Mach. Learn. Cybern. 11(4), 841–851 (2020) 8. Yongqing, Z., Jianrong, Y., Siyu, C., et al.: Review of the applications of deep learning in bioinformatics. Curr. Bioinform. 15(8), 898–911 (2020) 9. Yongqing, Z., Shaojie, Q., Yuanqi, Z., et al.: CAE-CNN: predicting transcription factor binding site with convolutional autoencoder and convolutional neural network. Expert Syst. Appl. 183, 115404 (2021)
214
Z. Wang et al.
10. Babak, A., Andrew, D., Matthew, W., et al.: Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nat. Biotechnol. 33(8), 831–838 (2015) 11. Jian, Z., Olga, T.: Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12(10), 931–934 (2015) 12. Daniel, Q., Xiaohui, X.: Danq: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44(11), e107–e107 (2016) 13. Deng, L., Wu, H., Liu, X., et al.: Deepd2v: a novel deep learning-based framework for predicting transcription factor binding sites from combined DNA sequence. Int. J. Mol. Sci. 22(11), 5521 (2021) 14. Qinhu, Z., Lin, Z., Wenzheng, B., et al.: Weakly-supervised convolutional neural network architecture for predicting protein-DNA binding. IEEE/ACM Trans. Comput. Biol. Bioinf. 17(2), 679–689 (2018) 15. Fang, J., Shaowu, Z., Zhen, C., et al.: An integrative framework for combining sequence and epigenomic data to predict transcription factor binding sites using deep learning. IEEE/ACM Trans. Comput. Biol. Bioinform. (2019) 16. Sirajul, S., Jianqiu, Z., Yufei, H.: Base-pair resolution detection of transcription factor binding site by deep deconvolutional network. Bioinformatics 34(20), 3446– 3453 (2018) 17. Zhou, J., Lu, Q., Gui, L., et al.: Mttfsite: cross-cell type TF binding site prediction by using multi-task learning. Bioinformatics 35(24), 5067–5077 (2019) 18. Park, S., Koh, Y., Jeon, H., et al.: Enhancing the interpretability of transcription factor binding site prediction using attention mechanism. Sci. Rep. 10(1), 1–10 (2020) 19. Hongjie, W., Chengyuan, C., Xiaoyan, X., et al.: Unified deep learning architecture for modeling biology sequence. IEEE/ACM Trans. Comput. Biol. Bioinf. 15(5), 1445–1452 (2017) 20. Ashish, V., Noam, S., Niki, P., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) 21. Jialin, P.S., Qiang, Y.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2009) 22. Haoyang, Z., Matthew, E., Ge, L.: Other: convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics 32(12), i121–i127 (2016) 23. Yuanqi, Z., Meiqin, G., Meng, L., et al.: A review about transcription factor binding sites prediction based on deep learning. IEEE Access 8, 219256–219274 (2020)
Overlapping Protein Complexes Detection Based on Multi-level Topological Similarities Wenkang Wang1 , Xiangmao Meng1 , Ju Xiang1,2 , and Min Li1(B) 1 Hunan Provincial Key Lab On Bioinformatics, School of Computer Science and Engineering,
Central South University, Changsha 410083, China [email protected] 2 Department of Basic Medical Sciences, Changsha Medical University, Changsha 410219, China
Abstract. Protein complex detection is an important issue in the field of system biology, which is crucial to understanding the cellular organization and inferring protein functions. In recent years, various computational methods have been proposed to detect protein complexes from protein-protein interaction (PPI) networks. Unfortunately, most of these methods only use the local information of PPI networks and treat protein complexes as dense subgraphs, ignoring the global topology information of PPI networks. To address these limitations, we propose a new method, named OPCMTS, to detect overlapping protein complexes by simultaneously considering the local topological information and global topological information of PPI network. First, a local similarity matrix is constructed via calculating the Jaccard coefficients between proteins in the original PPI network. Then, we adopt a hierarchical compressing strategy to get multiple levels of gradually compressed smaller networks from the original PPI network and apply a network embedding model to learn protein embeddings from the compressed networks at multiple levels. The protein embeddings from these networks are concatenated and a dimensionality reduction strategy is adopted to remove the redundancy of the concatenated embeddings to generate the final embeddings. Further, a global similarity matrix is constructed by calculating the cosine similarity of the final embeddings. Finally, a core-attachment strategy is used to detect overlapping protein complexes based on the local and global similarity matrices. The experimental results prove that the proposed OPCMTS method outperforms other five state-of-the-art methods in terms of F-measure on two yeast datasets. Keywords: Protein complex · Protein-protein interaction network · Network embedding · Core-attachment · Dimensionality reduction
1 Introduction Proteins are the essential elements of a cell. Most proteins tend to interact with other proteins to perform certain biological functions [1, 2]. These interacting proteins assemble into protein complexes and detecting protein complexes is especially important for W. Wang and X. Meng–These authors contributed equally to this work. © Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 215–226, 2021. https://doi.org/10.1007/978-3-030-91415-8_19
216
W. Wang et al.
understanding the cellular organization and inferring protein functions [3, 4]. In the past decades, a variety of methods have been proposed to detect protein complexes [5]. The existing protein complex detection methods can be divided into two categories: experimental methods and computational methods. Experimental methods have high accuracy but high cost and time consumption, which is impractical for rapidly growing number of protein complexes. With the development of high-throughput technologies, huge amounts of protein-protein interactions (PPIs) have been produced and form a large PPI network. It provides an effective means to detect protein complexes by computational methods, which are perfect supplements to the experimental methods. PPI network can be regarded as an undirected graph and protein complexes detection can be treated as a graph clustering or community mining problem [6–8]. Most of the computational methods mine dense subgraphs as protein complexes by utilizing different types of topological information of PPI networks. For example, several methods have been proposed to detect protein complexes based on clique and density subgraph, such as CMC [9], SPICI [10]. Wu et al. [11] proposed a novel protein complex detection algorithm, named COACH, which is based on the core-attachment structures of real protein complexes. Wang et al. [12] developed a method to detect overlapping protein complexes based on core-attachment structure, named EWCA, which uses highorder structural similarity as the metric and performs better than COACH. Besides, a series of clustering algorithms have been proposed to detect protein complexes from PPI networks based on seed extension strategy. For example, Nepuse et al. [13] utilized seed-extension concept to come up with a new algorithm named ClusterONE, which can detect overlapping protein complexes. In addition, some technologies in graph clustering and community mining have also been migrated to detect protein complexes. For example, Nazar et al. [14] developed an algorithm named ProRank to detect protein complexes based on PageRank. It quantifies the importance of each protein by voting from other proteins in the PPI network. In recent years, network embedding methods have been developed rapidly and achieved great achievements in the complex network analysis [15–17]. Network embedding aims to learn the low-dimensional vector for each node in a network and the embedding vectors contain the structural information of nodes in the original network. In 2014, DeepWalk [18] was first proposed to learn node features by random walk and skip-gram from word2vec and it achieved outstanding performance in downstream tasks. Later, LINE [19] was proposed to learn node representative vectors by considering of first and second-order neighbor information. In 2016, Grover et al. [20] developed node2vec which provides different ways to sample the neighborhood nodes. Based on the idea of network embedding, Meng et al. [21] proposed a novel protein complex detection method, named DPCMNE. This algorithm mainly applies a hierarchical compression strategy on the original PPI network and obtains a series of multi-level smaller PPI networks. Then, DeepWalk model is adopted on the compressed networks to generate protein embeddings and a new weighted PPI network is reconstructed based on protein embedding vectors. This method can capture both the local and global topological structure information. However, the speed of constructing the network will be slower when the network scale increases, and the network topological information cannot be represented well relying only on the information obtained by network embedding.
Overlapping Protein Complexes Detections
217
To address these limitations, in this study, we proposed a novel method, named OPCMTS, to detect overlapping protein complexes based on multi-level topological similarities, which can extract local and global topological information of the original PPI network with high speed. Performance evaluations on two yeast datasets show that OPCMTS achieves better performance than five state-of-the-art protein complex detection algorithms.
2 Materials In this study, we choose two different sources of yeast PPI datasets to detect protein complexes: BioGRID [22] and IntAct [23]. The first yeast dataset is downloaded from BioGRID database (https://thebiogrid.org, version 3.5.188) and the second yeast dataset is downloaded from IntAct database (https://www.ebi.ac.uk/intact/downloads, version 4.2.15). After removing duplicate PPIs and self-interacting PPIs, BioGRID contains 5784 proteins and 110597 PPIs while IntAct contains 5466 proteins and 77754 PPIs. We use CYC2008 [24] (http://wodaklab.org/cyc2008/downloads) as gold standard protein complexes. Since some methods can only detect protein complexes with three or more proteins, we filter out complexes with a size less than 3 in the gold standards. As a result, we obtain 236 gold standard protein complexes.
3 Methods The framework of the proposed OPCMTS method is shown in Fig. 1. It mainly includes three stages: (1) Constructing a local similarity matrix from PPI network based on neighbor similarity. (2) Constructing a global similarity matrix from PPI network based on hierarchical compression and network embedding. (3) Detecting overlapping protein complexes based on the multi-level similarities and core-attachment strategy.
Fig. 1. The framework of OPCMTS.
218
W. Wang et al.
3.1 Constructing Local Similarity Matrix Common neighbors can be used to evaluate the reliability of PPIs in a network Therefore, we construct a local similarity matrix based on common neighbors. We adopt Jaccard coefficient similarity (JCS) [25] to measure the similarity between two paired proteins. The calculation method is as shown in the formula (1): JCS(p, q) =
|N (p) ∩ N (q)| |N (p) ∪ N (q)|
(1)
where N(p) represents the set of the neighbors of protein p. |N(p) ∩ N(q)| represents the number of common neighbors between protein p and protein q. |N(p) ∪ N(q)| represents the number of proteins in the union set of N(p) and N(q). When the common neighbors of two interacting proteins increase, the similarity between the two interacting proteins is higher. Therefore, we calculate the JCS similarity between two interacting proteins and construct the local similarity matrix W local from the original PPI network. 3.2 Constructing Global Similarity Matrix Similar to DPCMNE method, we also adopt a hierarchical compression strategy to compress the original PPI network into a series of multi-level networks and apply a network embedding model, such as DeepWalk, to learn protein embeddings from these multi-level networks. We construct the global similarity matrix based on the protein embeddings. The process of constructing global similarity matrix is as follows. Firstly, the louvain method [26] is adopted to generate multi-level networks G0 ,G1 ,…,Gn by aggregating proteins in the original PPI network. Each protein p in the original PPI network corresponds to a partition in the multi-level networks Par 0 (p),Par 1 (p),…,Par n (p), particularly Par 0 (p) = p. Then, the DeepWalk network embedding algorithm is adopted to learn embedding vectors of nodes in each compressed network. As a result, we can get a set of embedding vectors for each protein: E0 (p),E1 (Par 0 (p)),…,En (Par n (p)), where Ei (Pari(p))represents the embedding vector of the protein p in the i-th level network Gi . Next, the embedding vectors obtained from each level network are concatenated as the topological feature of each protein p. The concatenated vectors CE(p) is defined as follows: CE(p) = [E0 (p), E1 (Par1 (p)), . . . , En (Parn (p))]
(2)
Then, the principal component analysis [27] (PCA) is applied on the concatenated vectors CE(p) to delete the redundant information, which can increase the speed in subsequent calculation steps. The embedding vector with PCA dimensionality reduction is denoted as E(p). Finally, we construct the global similarity matrix W global using cosine similarity based on the final embedding vectors E(p). 3.3 Detecting Protein Complexes Based on Core-attachment After obtaining the multi-level topological similarities, including the local similarity, global similarity and original interactions, we apply a core-attachment strategy [12] to
Overlapping Protein Complexes Detections
219
detect overlapping protein complexes. This strategy first detects the core of a protein complex based on the original PPI network. Then, local similarity and global similarity matrices are jointly used to divide the neighboring proteins of the core into overlapping proteins and peripheral proteins. Finally, a protein complex is generated by combining the core, the overlapping proteins and the peripheral proteins. 3.3.1 Detecting the Core of Protein Complex In this study, we use neighbor similarity to generate the core of a protein complex. The neighbor similarity (NS) [28] is defined as: |SN (p) ∩ SN (q)| NS(p, q) = |SN (p)| ∗ |SN (q)|
(3)
SN (p) = N (p) ∪ {p}
(4)
where SN(p) is the combination of neighbors of protein p and p itself. For each protein p, we try to use p as the initial protein to form the core, which means Core(p) = {p} initially. Then, we check the neighbors of protein p. For each neighboring protein q, it will be added to the Core(p) if NS(p, q) is greater than the threshold λ. Since different initial protein may form the same core, duplicate cores will be deleted. 3.3.2 Detecting the Overlapping Proteins For each protein complex core, we establish a set of candidate overlapping proteins (COPs) by filtering the neighbors of the core. The filter condition is defined as follows: COPs(p) = {q|N (q) ∩ Core(p) ≥ 2 }
(5)
Then, we formulate the following measures to select proteins as the overlapping proteins from the candidate overlapping proteins: in (q, Core(p)) = Wlocal (q, t) (6) wlocal t∈Core(p)
in (q, Core(p)) = wglobal out (q, Core(p)) = wlocal out (q, Core(p)) = wglobal avg
wlocal (Core(p)) = avg wglobal (Core(p))
=
t∈Core(p)
t ∈Core(p) / t ∈Core(p) /
x∈Core(p),y∈Core(p),x=y
x∈Core(p),y∈Core(p),x=y
Wglobal (q, t)
(7)
Wlocal (q, t)
(8)
Wglobal (q, t)
(9)
Wlocal (x, y)
|Core(p)|
Wglobal (x, y) |Core(p)|
(10) (11)
in (q, Core(p)) and w in where wlocal global (q, Core(p)) are used to measure the strength of the connectivity between q and internal Core(p) based on the local similarity matrix W local
220
W. Wang et al.
out (q, Core(p)) and w out (q, Core(p)) are and global similarity matrix W global . wglobal global used to measure the strength of the connectivity between q and external Core(p) based avg on the local similarity matrix W local and global similarity matrix W global . wlocal (Core(p)) avg and wglobal (Core(p)) are used to measure the strength of the in-interactions of Core(p) itself. Based on these measures, the filter conditions for overlapping proteins are defined as follows: out in (q, Core(p)) ≥ wlocal (q, Core(p)) wlocal
(12)
out in wglobal (q, Core(p)) ≥ wglobal (q, Core(p))
(13)
avg
(14)
avg
(15)
in (q, Core(p)) ≥ wlocal (Core(p)) wlocal in wglobal (q, Core(p)) ≥ wglobal (Core(p))
After filtering the proteins in the COPs, the overlapping proteins OP(p) of Core(p) are obtained. 3.3.3 Detecting the Peripheral Proteins For each protein complex core, we establish a set of candidate peripheral proteins (CPPs) by filtering the neighbors of the core. The filter condition is defined as follows: in out in out > wlocal or wglobal > wglobal ) } (16) CPPs(p) = {qN (q) ∩ Core(p) ≥ 2 and (wlocal Then, we propose two new metrics to measure the average connectivity between CPPs(p) and Core(p): avg in wlocal (q, Core(p)) |CPPs(p)| (17) wlocal (CPPs(p), Core(p)) = q∈CPPs(p)
avg
wglobal (CPPs(p), Core(p)) =
q∈CPPs(p)
in wglobal (q, Core(p))
|CPPs(p)|
(18)
avg
(19)
avg
(20)
in wlocal (q, Core(p)) > wlocal (CPPs(p), Core(p)) in wglobal (q, Core(p)) > wglobal (CPPs(p), Core(p))
After filtering the proteins in the CPPs, the peripheral proteins PP(p) of Core(p) are generated. 3.3.4 Detecting Protein Complexes For each initial protein p, generating the core of protein complex, the corresponding overlapping proteins and peripheral proteins, they are combined to form the final protein complex. Since different core may eventually form the same complex, redundant complexes are finally removed. Attachment(p) = {OP(p) ∪ PP(p)}
(21)
Overlapping Protein Complexes Detections
ProteinComplex(p) = {Core(p) ∪ Attachment(p)}
221
(22)
4 Results 4.1 Evaluation Metrics In order to evaluate the effectiveness of the proposed OPCMTS method, we match the predicted protein complexes with the gold standard protein complexes, and we adopt a commonly used evaluation metric - overlapping score to judge the match between a predicted complex and a gold standard complex. For a predicted protein complex pc and a gold standard protein complex gc, the overlapping score is defined as follows: Spc ∩ Sgc 2 OS(pc, gc) = Spc ∗ Sgc
(23)
where S pc represents the set of proteins in pc and S gc represents the set of proteins in gc. The predicted protein complex pc is judged to match the gold standard protein complex gc when the overlapping score between pc and gc is greater than or equal to a certain threshold θ. According to the previous research [29, 30], the threshold θ is set to 0.2. N pc represents the number of predicted protein complexes that match at least one gold standard complex while N gc represents the number of gold standard complexes that match at least one predicted protein complex. Npc = |{pc|pc ∈ PCs, ∃gc ∈ GCs, OS(pc, gc) ≥ θ }|
(24)
Ngc = |{gc|gc ∈ GCs, ∃pc ∈ PCs, OS(pc, gc) ≥ θ }|
(25)
where PCs represents the set of predicted protein complexes, and GCs represents the set of gold standard protein complexes. Precision represents the ratio of the number of predicted protein complexes that match the gold standard complexes to the total amount of predicted protein complexes, and recall represents the ratio of the number of gold standard protein complexes that match the predicted protein complexes to the total amount of gold standard protein complexes. F-measure is the harmonic average of precision and recall which reflects the overall performance of the algorithm. Precision = Recall = F−measure =
Npc |PCs|
Ngc |GCs|
2 ∗ Precision ∗ Recall Precision + Recall
(26) (27) (28)
222
W. Wang et al.
4.2 Comparison with Other Methods In order to evaluate the effectiveness of the proposed method, we compared OPCMTS method with five other state-of-the-art methods, SPICi [10], COACH [11], EWCA [12], ClusterONE [13], DPCMNE [21], all of which use the default parameters. For OPCMTS, we set the embedding dimension d of each single layer network as 256 while the dimension d pca after PCA is also set as 256 and λ is set as 0.4. The number of layers n is set as 3. Table 1 and Table 2 show the performance comparison of all methods on two yeast datasets: BioGRID and IntAct. From Table 1, we can see that our OPCMTS method obtain the best values of precision and f-measure. The DPCMNE has the highest recall value and OPCMTS has the second-best recall value. Compared with DPCMNE, EWCA, ClusterONE, SPICi, COACH, OPCMTS improves the performance of f-measure by 22.61%, 9.47%, 38.86%, 79.16% and 259.30%, respectively. From Table 2, we can also see that our OPCMTS method achieve the highest values of precision and f-measure. Compared with DPCMNE, EWCA, ClusterONE, SPICi, COACH, OPCMTS improves the performance of f-measure by 18.05%, 5.57%, 42.45%, 95.29% and 181.42%, respectively. In summary, the results prove that OPCMTS method can obtain better performance on detecting protein complexes by simultaneously considering the local topological and global topological information of PPI network. Table 1. Performance comparison on BioGRID dataset Benchmark
Methods
Npc
Ngc
Recall
Precision
F-measure
CYC2008
OPCMTS
681
176
0.7458
0.4924
0.5932
DPCMNE
266
187
0.7924
0.3482
0.4838
EWCA
642
165
0.6992
0.4425
0.5419
ClusterONE
165
153
0.6483
0.3185
0.4272
SPICi
106
119
0.5042
0.2465
0.3311
COACH
142
87
0.3686
0.1064
0.1651
Best and second-best results are bold and italic, respectively.
Table 3 shows five examples of protein complexes predicted by OPCMTS that perfect match the golden standard protein complexes. All of them have an OS score of 1. Specifically, the predicted protein complex (YIR015W, YBL018C, YAL033W, YBR257W, YNL282W, YGR030C, YHR062C, YBR167C, YNL221C) exactly matches the intronic snoRNA processing complex in the CYC2008 gold standard complexes. Table 4 shows five examples of protein complexes predicted by OPCMTS that do not match the gold standard protein complexes. Their OS scores are all less than 0.2, but their p-values are exceedingly small which indicates that they have important biological significance. For the predicted protein complex: YLR115W, YJR119C, YKL059C, YDR195W, YKR002W, YMR061W, YJR093C, YNL317W, YGR156W, YGL044C, YDR228C, its OS score just is 0.164, and the p-value in terms of biological processes (BP) is 1.55e-20, which is significantly in mRNA polyadenylation.
Overlapping Protein Complexes Detections
223
Table 2. Performance comparison on IntAct dataset Benchmark
Methods
Npc
Ngc
Recall
Precision
F-measure
CYC2008
OPCMTS
644
170
0.7203
0.5091
0.5966
DPCMNE
268
183
0.7754
0.3748
0.5054
EWCA
668
168
0.7119
0.4684
0.5651
ClusterONE
166
162
0.6864
0.3013
0.4188
SPICi
98
117
0.4958
0.2207
0.3055
COACH
170
111
0.4703
0.1369
0.212
Best and second-best results are bold and italic, respectively.
Table 3. Five examples of the predicted protein complexes with high OS values Predicted complex
Cluster frequency
p-value (BP)
GO term
OS
YIR015W, YBL018C, YAL033W, YBR257W, YNL282W, YGR030C, YHR062C, YBR167C, YNL221C
9 out of 9 genes, 100%
1.21e-26
Intronic snoRNA processing
1
YGL005C, YGL223C, YER157W, YPR105C, YGR120C, YNL041C, YML071C, YNL051W
8 out of 8 genes, 100%
7.92e-18
Intra-Golgi vesicle-mediated transport
1
YBR126C, YDR074W, 4 out of 4 genes, 100% YML100W, YMR261C
2.29e-12
Trehalose metabolism in response to stress
1
YDR331W, YLR088W, 5 out of 5 genes, 100% YHR188C, YLR459W, YDR434W
5.93e-16
Attachment of GPI anchor to protein
1
YHR119W, YAR003W, 8 out of 8 genes, 100% YPL138C, YLR015W, YKL018W, YBR175W, YBR258C, YDR469W
1.81e-21
Histone H3-K4 methylation
1
224
W. Wang et al. Table 4. Five examples of the predicted protein complexes with low OS values
Predicted complex
p-value (BP)
GO term
OS
YNL030W, YBL003C, YBL002W, 1.74e-15 YBR010W, YDR225W, YNL031C, YDR224C, YPL082C, YOL012C, YGR270W, YML069W, YGL207W, YBR009C
Chromatin assembly or disassembly
0.103
YGR048W, YDL190C, YDR049W, YML013W, YOL013C, YDL126C, YKL020C, YBR201W, YDR411C, YLR207W, YBR170C
4.93e-19
Ubiquitin-dependent ERAD pathway
0.164
YBR255C-A, Q0250, YNR018W, Q0045, YBL045C, YGL187C, YEL024W, YPR191W, YNL052W, YOR065W, Q0275, YJL166W, YML030W, YGL191W
7.97e-20
Electron transport chain
0.178
YLR115W, YJR119C, YKL059C, 1.55e-20 YDR195W, YKR002W, YMR061W, YJR093C, YNL317W, YGR156W, YGL044C, YDR228C
mRNA polyadenylation
0.164
YNR017W, YIL022W, YJL143W, YPL098C, YPL063W, YJL104W, YGR033C, YNL131W, YML054C, YLR008C
Protein import into mitochondrial matrix
0.15
7.12e-21
5 Conclusions Protein complex detection is helpful to understand biological organization and infer protein functions. Although various computational methods have been proposed to detect protein complexes, most methods only consider local structural information. In this study, we utilize both the local topology information and global topology information of PPI network to detect overlapping protein complexes. The local topology information obtained through neighbor similarity is used to construct a local similarity matrix. The global information obtained through multi-level embedding is used to construct a global similarity matrix. At the same time, we use PCA dimensionality reduction method to reduce redundant embedding information and improve the speed before constructing the global similarity matrix. Finally, the core-attachment strategy is applied to identify overlapping protein complexes based on the local and global similarity matrices. The experimental results show that the proposed OPCMTS method achieves the best performance in terms of F-measure, compared with other five state-of-the-art methods. It proves the effectiveness of our method of simultaneously considering the local and global topological information of PPI network in the detection of protein complexes.
Overlapping Protein Complexes Detections
225
Acknowledgments. This work was supported in part by the National Natural Science Foundation of China under Grant No.61832019, the Fundamental Research Funds for the Central Universities, CSU under Grant No.2282019SYLB004, Hunan Provincial Science and Technology Program 2019CB1007.
References 1. Berggård, T., Linse, S., James, P.: Methods for the detection and analysis of protein–protein interactions. Proteomics 7(16), 2833–2842 (2007) 2. Meng, X., Li, W., Peng, X., Li, Y., Li, M.: Protein interaction networks: centrality, modularity, dynamics, and applications. Front. Comp. Sci. 15(6), 1–17 (2020). https://doi.org/10.1007/ s11704-020-8179-0 3. Omranian, S., Angeleska, A., Nikoloski, Z.: PC2P: parameter-free network-based prediction of protein complexes. Bioinformatics 37(1), 73–81 (2021) 4. Pan, X., Hu, L., Hu, P., You, Z.H.: Identifying protein complexes from protein-protein interaction networks based on fuzzy clustering and GO semantic information. IEEE/ACM Trans. Comput. Biol. Bioinform. (2021). https://doi.org/10.1109/TCBB.2021.3095947 5. Wu, Z., Liao, Q., Liu, B.: A comprehensive review and evaluation of computational methods for identifying protein complexes from protein–protein interaction networks. Brief Bioinform. 21(5), 1531–1548 (2020) 6. Xiang, J., Zhang, Y., Li, J. M., Li, H.J., Li, M.: Identifying multi-scale communities in networks by asymptotic surprise. J. Stat. Mech. 2019(3), 033403 (2019) 7. Li, M., Li, D., Tang, Y., Wu, F., Wang, J.: CytoCluster: a cytoscape plugin for cluster analysis and visualization of biological networks. Int. J. Mol. Sci.18(9), 1880 (2017) 8. Wang, J., Liang, J., Zheng, W., Zhao, X., Mu, J.: Protein complex detection algorithm based on multiple topological characteristics in PPI networks. Inf. Sci. 489, 78–92 (2019) 9. Liu, G., Wong, L., Chua, H.N.: Complex discovery from weighted PPI networks. Bioinformatics 25(15), 1891–1897 (2009) 10. Jiang, P., Singh, M.: SPICi: a fast clustering algorithm for large biological networks. Bioinformatics 26(8), 1105–1111 (2010) 11. Wu, M., Li, X., Kwoh, C.K., Ng, S.K.: A core-attachment based method to detect protein complexes in PPI networks. BMC Bioinform. 10, 169 (2009) 12. Wang, R., Liu, G., Wang, C.: Identifying protein complexes based on an edge weight algorithm and core-attachment structure. BMC Bioinform. 20, 471 (2019) 13. Nepusz, T., Yu, H., Paccanaro, A.: Detecting overlapping protein complexes in protein-protein interaction networks. Nat. Methods 9(5), 471–472 (2012) 14. Zaki, N., Berengueres, J., Efimov, D.: ProRank: a method for detecting protein complexes. In: 14th Annual Conference on Genetic and Evolutionary Computation, pp. 209–216 (2012) 15. Meng, X., Peng, X., Wu, F.X., Li, M.: Detecting protein complex based on hierarchical compressing network embedding. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 215–218. IEEE (2019) 16. Xiang, J., Zhang, N.R., Zhang, J.S., Lv, X.Y., Li, M.: PrGeFNE: predicting disease-related genes by fast network embedding. Methods 192, 3–12 (2021) 17. Xu, B., et al.: Protein complexes identification based on go attributed network embedding. BMC Bioinform. 19, 535 (2018) 18. Perozzi, B., Al-Rfou, R., & Skiena, S.: Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 701–710 (2014)
226
W. Wang et al.
19. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: LINE: Large-scale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077 (2015) 20. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864 (2016) 21. Meng, X., Xiang, J., Zheng, R., Wu, F., Li, M.: DPCMNE: detecting protein complexes from protein-protein interaction networks via multi-level network embedding. IEEE/ACM Trans. Comput. Biol. Bioinform. (2021). https://doi.org/10.1109/TCBB.2021.3050102 22. Oughtred, R.,et al.: The BioGRID interaction database: 2019 update. Nucleic Acids Res. 47(D1), D529-D541 (2019) 23. Orchard, S., et al.: The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res.42(D1), D358–D363 (2014) 24. Pu, S., Wong, J., Turner, B., Cho, E., Wodak, S.J.: Up-to-date catalogues of yeast protein complexes. Nucleic Acids Res. 37(3), 825–831 (2009) 25. Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11(2), 37–50 (1912) 26. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech. 2008(10), P10008 (2008) 27. Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometr. Intell. Lab Syst. 2, 37–52 (1987) 28. Mete, M., Tang, F., Xu, X., Yuruk, N.: A structural approach for finding functional modules from large biological networks. BMC Bioinform. 9(Suppl 9), S19 (2008) 29. Lei, X., Fang, M., Fujita, H.: Moth–flame optimization-based algorithm with synthetic dynamic PPI networks for discovering protein complexes. Knowl. Based Syst. 172, 76–85 (2019) 30. Li, M., et al.: Identification of protein complexes by using a spatial and temporal active protein interaction network. IEEE/ACM Trans. Comput. Biol. Bioinform. 17(3), 817–827 (2020)
LPI-FKLGCN: Predicting LncRNA-Protein Interactions Through Fast Kernel Learning and Graph Convolutional Network Wen Li, Shulin Wang(B) , and Hu Guo College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, Hunan, China
Abstract. Predicting lncRNA-protein interactions (LPIs) through computational models can not only help to identify the function of lncRNAs, but also help to solve the problem of huge cost of materials and time. In this study, we develop a novel computational model combining fast kernel learning (FKL) and multi-layer graph convolution network (GCN) to identify potential lncRNA-protein interaction (LPI-FKLGCN). The LPIFKLGCN model can fuse the multi-source features and similarities by the FKL technique and code the embedding representive vectors by the multi-layer graph convolution network. Through 5-fold cross-validation, the LPI-FKLGCN obtains an AUPR value of 0.52 and an AUC value of 0.96, which is superior to other methods. In case studies, most of the predicted LPIs are confirmed by the newly published biological experiments. It can be seen that the fusion of multi-source similarities and features, combined with multi-layer embedding vectors from graph convolution network can improve the accuracy of LPI prediction and the model of LPI-FKLGCN is an efficient and accurate tool for LPI prediction. Keywords: Graph convolution network LncRNA-protein interactions
1
· Fast kernel learning ·
Introduction
Due to the rapid development of high-throughput sequencing technology, tens of thousands of human long non-coding RNAs have been identified [6]. It has been found that only about 1% of the RNA in human transcription encodes proteins, most of which are lncRNAs, whose transcripts are about longer than 200 nucleotides, which are not involved in coding protein and be considered as transcript noises [8]. LncRNAs are RNA molecules that played the role of Supported by the National Nature Science Foundation of China (Grant Nos. 61702054 and 62072169), the National Key Research and Development Program (Grant Nos. 2017YFC1311003), and the Training Program for Excellent Young Innovators of Changsha (Grant No. kq2009093). c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 227–238, 2021. https://doi.org/10.1007/978-3-030-91415-8_20
228
W. Li et al.
the regulator in the human body and had an important relationship with cell differentiation, apoptosis, and cancerization. By interacting with related RNAbinding proteins, lncRNAs participate in the regulation of a variety of biological processes and realize their complex and diverse functions. Therefore, identifying the interaction between lncRNAs and proteins is important for further exploring the cellular mechanisms and molecular functions of lncRNAs and understanding various biological processes related to the disease. Traditional high-throughput biological methods consume a lot, and the computational methods have been proved to be effective as they are not easily influenced by the expression time, tissue specificity and expression level of non-coding RNA, and can greatly reduce the time and cost. Most of LPI prediction methods can roughly be classified into two classes. One class is the computational models which only used the information contained in the sequences of lncRNAs and proteins to predict whether RNA and protein have interaction. It often makes predictions by statistical methods or machine learning-based methods. For example, the model of RPISeq encodes the sequences of RNAs and proteins and separately, and predict potential LPIs by the classifier of support vector machine (SVM) and random forest (RF) [16]. The second class of LPI prediction methods combines the known LPI information to extract features and calculate similarities. It is hypothesized that if a lncRNA (protein) is similar to one side of the interacting lncRNA-protein
Fig. 1. The overall framework of LPI-FKLGCN mode. Green and blue nodes represent lncrna and protein, respectively. The red nodes indicate the aggregation and update of information for the current node in the network. First, different features and similarities from different data sources are fused by fast kernel learning. Next, a set of lncRNA and protein embedded representation vectors are generated through the nonlinear multilayer graph convolutional network. Finally, the interaction probability score matrix can be obtained by a decoder. (Color figure online)
LPI-FKLGCN: Predicting LncRNA-Protein Interactions
229
pair, it may also interact with the other side of the interacting lncRNA-protein pair. So, the similarity information is very important [4]. For example, Zhang et al. have developed a method based on Sequence Feature Projection Ensemble Learning (SFPEL-LPI), which improves the prediction performance by combining multiple similarities and multiple features based on sequence extraction into a ensemble learning framework [25]. To solve the problem of the lack of negative samples, Zhao et al. also integrate random walk based on network and logical matrix factorization based on semi-supervised machine learning with regularization [26]. Since there are now a limited number of LPIs, some methods have emerged that can be predicted without direct LPIs. In recent years, machine learning and deep learning have been combined to improve prediction performance. Such as, Fan et al. have developed a new model of LPI-BLS, which integrates a deep learning-based broad learning system with ensemble logistical regression classifiers to predict LPIs [9]. In this study, we have proposed a novel prediction model which fuses multisource biological data by fast kernel learning method (FKL) to obtain a lncRNA comprehensive similarity and a protein comprehensive similarity and extracts two groups of embedding representation vectors by the Graph Convolutional Network (GCN). At last, the LPI probability score matrix is obtained through a decoder. The workflow of the LPI-FKLGCN is illustrated in Fig. 1.
2
Materials and Methods
In this section, we first build a heterogeneous network including the lncRNA similarity network, the protein similarity network and the known lncRNA-protein interaction network. Then, we fuse multiple base kernels into a lncRNA comprehensive similarity kernel and a protein comprehensive similarity kernel, respectively. Thirdly, we deploy the multi-layer graph convolution network on the constructed heterogeneous network to generate the embedding representation vectors for lncRNAs and proteins. The combination of the known LPIs and the multi-layer embedding vectors can improve the prediction accuracy due to the involvement of the proximity of different order. Finally, the lncRNA-protein interaction probability matrix is obtained by decoding. 2.1
Generate Multiple Base Kernels for LncRNAs and Proteins
Sequence Feature Kernels for LncRNAs and Proteins. We extracted the sequence features for lncRNAs and proteins in the same way as in the previous literature [18]. The lncRNA sequences are expressed by Conjoint Triad, and the protein sequences are expressed by Pseudo Position-Specific Score Matrix. The Sequence Feature kernels KlSF and KpSF can be extracted by the Radial Basis Function (RBF) kernel.
230
W. Li et al.
Sequence Similarity Kernels for LncRNAs and Proteins. The lncRNA Sequence Similarity (SS) kernels KlSS can be calculated by normalized SmithSW (Sli ,Slj ) Waterman (SW) score as: KlSS (li , lj ) = √ , where Sli SW (Sli ,Slj )SW (Sli ,Slj )
denotes the sequence of lncRNA li , SW (·) represents the Smith-Watermark score. Similarly, if we put protein sequences Spi into the equation above, we can also extract the protein Sequence Similarity kernels KpSS . Gaussian Interaction Profile Kernels for LncRNAs and Proteins. The Gaussian Interaction Profile (GIP) kernel calculated through the interaction profile can measure the network topology similarity [5,13]. lncRNA GIP kernel 2 KlGIP can be calculated by: KlGIP (li , lj ) = exp(−γl IP (li ) − IP (lj ) ), where vector IP (li ) represents the interaction profile for lncRNA li , which actually is the ith row vector of the interaction matrix B. It represents whether there is a interacting protein with the lncRNA li . Similarly, vector IP (pi ) is the ith column vector of the matrix B. The GIP kernel for proteins KpGIP can also γl and γp are be calculated by substituting IP (pi ) into the equation above. m 2 1 IP (li ) ), parameter bandwidths, which can be calculated as: γl = γ l ( m where γ is set to 1.
i=1
Expression Kernel for LncRNAs. The lncRNA Expression profiles are extracted from the database NONCODE, each is represented as a 24-dimension vector, corresponding to 24 cell types. Early in the study by Chen et al., the lncRNA similarity is calculated through the expression profiles of lncRNAs [3]. At last, the lncRNA Expression kernel KlExp can also be extracted by the Radial Basis Function. GO Kernel for Proteins. Gene ontology (GO) describes biomolecules or gene products in terms of biological processes, molecular functions and cellular components. GO has become a commonly used protein feature in interaction prediction models. We downloaded GO terms from the database GOA [21] to measure GO similarity between two proteins. Jaccard similarity, namely the overlap ratio of GO terms related to two proteins, is used to computed GO kernels of two pro|GOp ∩GOp | teins KpGO as: KpGO (pi , pj ) = GO i ∪GO j , where GOpi is the GO terms related | pi pj | with protein pi , ∩ is the intersection of two sets, ∪ is the union of two sets. Above all, four base kernels for lncRNAs KlSF , KlSS , KlGIP , KlExp are generated, and four base kernels for proteins KpSF , KpSS , KpGIP , KpGO are generated. 2.2
Kernel Fusion by Fast Kernel Learning
A variety of similarity and features can depict biological entities from different perspectives, so as to make better interaction prediction. In this paper, a linear combination strategy is used to quickly fuse multiple similarities and features
LPI-FKLGCN: Predicting LncRNA-Protein Interactions
231
[18]. In the ideal state, the fused similarity should satisfy K ideal = B × B T . As long as the appropriate weight w can be determined and the following for ideal 2 mula min K − K can be satisfied, the goal of multi-kernel fusion can be F 2 achieved. In order to avoid over-fitting, a regularization term λw is added in the fusion process, and the objective function can be transformed into the Eq. (1): 2 2 min K − K ideal F + λw w,K (1) 4 s.t. wi = 1 i=1
where ·F represents Frobenius norm, K represents Kl or Kp , w represents wl or wp . The tradeoff parameter λ is initialed to 2000 in this study. We call the Matlab tool function CVX to optimize the combination weights w and get the comprehensive similarity K for lncRNAs and proteins, respectively. 2.3
Construction of a Heterogeneous LncRNA-Protein Network
In order to make more accurate prediction, the LPI-FKLGCN model makes full use of multi-source information, such as known LPI, lncRNA and protein sequence, lncRNA expression profile and protein GO to generate multiple similarity kernels and feature kernels. The base kernels will be combined to a comprehensive similarity matrix by the fast kernel fusion. The heterogeneous lncRNA-protein network is constructed by a LPI network, a lncRNA comprehensive similarity network and a protein comprehensive similarity network. Assume that matrix B ∈ m×n represents the adjacency matrix for lncRNA-protein binary interaction network. If lncRNA li interacts with protein pj , then B(i, j)=1 (1 ≤ i ≤ m, 1 ≤ j ≤ n), otherwise B(i, j)=0. m, n are the number of lncRNAs and proteins, respectively. The adjacency matrix of the comprehensive similarity networks are denoted as Kl and Kp . At last, the adjacency matrix of the heterogeneous network M ∈ (m+n)×(m+n) can be defined as Eq. (2): Kl B (2) M= B T Kp 2.4
Encoding by Multi-Layer Graph Convolution Network
The graph convolutional network (GCN) which extends Convolutional Neural Networks into graph architecture has been applied in the field of biological information processing and achieved excellent performance. For example, GCN has been used in drug repositioning [2], microbe-drug association prediction [15], computational drug discovery [20] and drug-disease association prediction [23] and so on. We intend to construct a encoder by GCN and launch the GCN on the heterogeneous network M to learn the low-dimensional representations of lncRNAs and proteins. In order to reflect the contribution of the similarity
232
W. Li et al.
matrix to the model, we added a penalty factor, so the matrix M is transformed into G. The graph G to be input into the GCN is defined as: B μ ∼ Kl (3) G= B T μ ∼ Kp 1 1 1 1 where ∼ Kp =Dp − 2 Kp Dp − 2 , ∼ Kl = Dl − 2 Kl Dl − 2 , and Dp = diag( j Kpij ), Dl = diag( j Klij ). μ is a penalty factor indicating the importance of similarity in the process of information propagation. The GCN encoder processing this lncRNA-protein heterogeneous graph is: −1
−1
V (l+1) = f (V (l) , G) = σ(DG 2 GDG 2 V (l) W (l) )
(4)
where σ(·) is an non-linear activation function, V (h) is the output embedding representation vector of the h-th layer, W (h) is the training weight matrix of h-th layer, DG = diag( j Gij ) is the degree matrix of graph G. In order to accelerate the learning process and improve the generalization performance, the exponential linear units (ELUs) are adopted as the non-linear activation function σ(·) for all the GCN layers [7]. (0) = The initialization of the embedding presentation vector is defined as: V 0 B . So, the first layer embedding of the nodes of the GCN encoder (V (1) ∈ BT 0 −1
−1
(m+n)×k ) can be initialized as: V (1) = σ(DG 2 GDG 2 V (0) W (0) ), where W (0) ∈ (m+n)×k , parameter k is the embedded dimension, and W (0) ∈ (m+n)×k is an weight matrix. For all the layers (h = 1, 2, ..., H layers) of the GCN, the propagation progress follow the Eq. (4). Each time the graph passes through one GCN layer, one set of k-dimensional embedding vectors are generated. The embedding representation vectors of different GCN layers capture different order proximity of neighbors in heterogeneous networks. To avoid over-fitting, our model adopts regular dropout γ and node dropout β to the GCN [1,19]. The node dropout can be thought of as the training of different models on different small networks which can be integrated to predict unknown LPIs. Finally, the final embedding vectors of lncRNAs VL ∈ m×k and the final embedding vectors of proteins VP ∈ n×k are generated by the following equation: VL (5) = ah V h VP where ah is initialized as 1/(h + 1), h = 1, 2, ..., H and can be learned automatically through the GCN. It should be noted that before GCN starts training, we divide the standard data set into non overlapping test set and training set. We don’t use any test set information, and we don’t extract features during training. In this way, there will be no intersection between the test set and the training set, so as to prevent possible information leakage.
LPI-FKLGCN: Predicting LncRNA-Protein Interactions
2.5
233
Decoding and Final Prediction
At last, to predict the unknown LPIs, a bi-linear activation function sigmoid is used as a decoder: I = sigmoid(VL W VPT ) where W ∈ k×k is a training weight matrix. I is the predicted score matrix whose member I (i, j) indicates the probability that lncRNA li will interact with pj . The training weight W (h) and W matrix can be initialized by the Xaiver method [11]. The benchmark dataset is unbalanced, because the number of known interactions is far less than the total number of lncRNA-protein pairs. So, we optimize our FKLGCN model by minimizing the loss function: Loss = −
1 (λ × log I (i, j) + log(1 − I (i, j))) (i,j)∈Δ (i,j)∈∇ m×n
(6)
where Δ denotes the set of known interacted lncRNA-protein pairs (positive samples), and ∇ denotes the set of all the other lncRNA-protein pairs (negative samples). λ = |∇| / |Δ|. Inspired by the study [23], the Adam optimizer is adopted to minimize the loss function [12].
3 3.1
Results Datasets and Experimental Environments
There are two datasets used in this study previously studied by Zhang and Zheng et al. [24,27]. The benchmark dataset has 4158 experimentally validated LPIs, including 990 lncRNAs and 27 proteins. To further validate the results on the benchmark dataset, the performance tests and comparison tests are implemented on a novel dataset, which has 4467 experimentally validated LPIs, including 1050 lncRNAs and 84 proteins. In this section, global 5-fold cross validations (5-fold CVs) are adopted to evaluate the performances of LPI-FKLGCN model. All of the known LPIs are stochastically divided into five strictly equal folds. Each fold is taken in turn as a test sample and the remaining four folds are treated as a training sample. The prediction is realized based on the known interactions in the training sample and the results are for lncRNA-protein pairs in the testing sample. The Area Under ROC curve (AUC) and the Area Under Precision-REC curve (AUPR) are adopted as primary evaluation metrics. Besides, other metrics such as F1 score (F1), accuracy (ACC), REC (REC), specificity (SPEC) and precision (PRE) are also used as reference in our experiments. All experiments are run on a 64-bit Windows 10 Professional Operating System on a computer with the frequency of 3.59 GHz, 6-core CPU and 32 GB memory. 3.2
Parameter Impact Analysis
There are several parameters in our LPI-FKLGCN model. According to the best AUC and AUPR values, the parameter values are unanimously selected as: μ = 6, κ = 40, υ = 0.01, β = 0.4 and γ = 0.6. Besides, the embedding dimension
234
W. Li et al.
k defaults to 64 and the number of layers of GCN H defaults to 3. The results of the parameter impact test on the Benchmark Dataset are shown in Fig. 2. The result of the parameter impact test on the Novel Dataset are recorded in the Supplementary file. It is shown that when the values of parameters vary in a wide range, the values of AUC and AUPR fluctuate little. So, our LPI-FKLGCN model is robust.
Fig. 2. The influence of parameters on model performance on the benchmark dataset by 5-fold cross validation.
3.3
The Influence of FKL and GCN in LPI Prediction
To evaluate the effectiveness of FKL, two methods of kernel fusion with average similarity fusion (AVG) and Similarity Network Fusion (SNF) are compared. In the process of average similarity fusion, the fusion weight of each base kernel is 0.25, and all base kernels are combined by linear weighting. The method of similarity network fusion is introduced by Wang et al. [22], which fuses the sub-networks into one comprehensive network by an iterative non-linear process Table 1. The influence of FKL and GCN are tested by comparison with the variants model by 20 times 5-fold CV on the two datasets. Dataset
Models
AUPR
AUC
F1
ACC
REC
SPEC
PRE
Benchmark dataset
Fuse AVGa
0.2795
0.8736
0.2891
0.9220
0.4464
0.9395
0.2138
Fuse SNFb
0.3434
0.8874
0.3737
0.9351
0.5451
0.9494
0.2843
FKLGCN 1cov
0.5504
0.9476
0.5367
0.9628
0.6064
0.9759
0.4813
Novel dataset
FKLGCN 2cov
0.5655
0.9487
0.5444
0.9616
0.6050
0.9733
0.4710
FKLGCN 3cov
0.5928
0.9502
0.5424
0.9705
0.6113
0.9784
0.6041
Fuse AVGa
0.4946
0.9422
0.1655
0.9003
0.8451
0.9000
0.0908
Fuse SNFb
0.5121
0.9480
0.1782
0.9114
0.8693
0.9114
0.0988
FKLGCN 1cov
0.4469
0.9600
0.2282
0.9286
0.6566
0.9721
0.1313
FKLGCN 2cov
0.4890
0.9602
0.2317
FKLGCN 3cov 0.5212 0.9638 0.2362 a Fuse AVG denotes fusion with average weights b Fuse SNF denotes fusion with SNF(similarity network fusion)
0.9328
0.8165
0.9441
0.1350
0.9894
0.8859
0.9400
0.1362
LPI-FKLGCN: Predicting LncRNA-Protein Interactions
235
based on propagation theory. Furthermore, to evaluate the influence of multilayer GCN on model performance, we tested the GCN with only one convolutional layer, two convolutional layers and three convolutional layers, respectively. The results of test for fast kernel learning and multi-layer GCN are shown in Table 1. As shown, our model which adopt FKLGCN 3conv is superior to the other four models both on the benchmark dataset and on the novel dataset. In conclusion, the prediction performance of the LPI-FKLGCN can be greatly improved by the introduction of fast kernel fusion and three-layer GCN network. 3.4
Comparison with the Baseline Methods
To evaluate the performance of the LPI-FKLGCN, we compare our method with five baseline methods of lncRNA-protein interaction prediction. The comparisons are implemented under the same experimental settings both on the benchmark dataset and the novel dataset. The parameters in these comparison methods are the same as those in the original literature. Collaborative Filtering (CF) is widely used in the recommendation system [17], which is used to generate the protein similarity network and construct a weighted bipartite graph to represent the known LPI dataset. Random Walk with Restart (RWR) is a general, flexible and propagation-based approach, which has been used in many fields of biological computation. Recent studies have proved that recommendations based on propagation and random walk are superior to collaborative filtering methods. LncRNA-Protein Bipartite Network Inference (LPBNI) [10] run a resource allocation (RA) algorithm on a bipartite network to predict potential LPIs. LPI prediction based on Heterogeneous Network model (LPIHN) [14] constructed a heterogeneous network integrating a protein-protein interaction (PPI) network, a lncRNA similarity network and a LPI network and ran a random walk with restart (RWR) on the constructed network. The model, so called LPI-FKLKRR, applies fast kernel learning to integrate heterogeneous kernels and predicts LPIs by the kernel ridge regression. We conduct 20 times 5-fold CVs for all these methods, and the results are averaged. Results of comparison on the benchmark Table 2. Comparison with the baseline methods by 20 times 5-fold CV on the two datasets. Dataset
Metric RWR
CF
LPBNI LPIHN FKLKRR FKLGCN 0.3302
0.5506
0.5929
AUC
0.8134 0.7686 0.8451
0.8569
0.7939
0.9502
PRE
0.3538 0.3029 0.4139
0.2888
0.5189
0.6041
ACC
0.9536 0.9506 0.9581
0.9431
0.8702
0.9705
F1
0.3603 0.2992 0.3868
0.3337
0.5083
0.5424
Benchmark dataset AUPR 0.2827 0.2357 0.2399
Novel dataset
AUPR 0.2813 0.2628 0.3336
0.2472
0.4861
0.5212
AUC
0.9282 0.9028 0.9407
0.9315
0.8807
0.9639
PRE
0.3640 0.3616 0.3913
0.2864
0.6072
0.1363
ACC
0.9866 0.9869 0.9869
0.9828
0.9567
0.9895
F1
0.3488 0.3222 0.3938
0.3397
0.4911
0.2362
236
W. Li et al.
dataset and the novel dataset are shown in Table 2. Obviously, our model has the superior performance (marked in bold) with the AUPR value of 0.593 and the AUC value of 0.950 on the benchmark dataset. According to the values of precision, accuracy and f1 score, LPI-FKLGCN also outperforms other models. In the novel dataset, all the metric values are the best except the precision value. 3.5
Case Study
In the two datasets used in this study, the known lncRNA-proteins interactions are extracted from the database NPInter V2.0. Therefore, we predict all possible associated proteins for a specific lncRNA NONHSAT022115 and all possible lncRNAs for a specific protein ENSP00000401371, respectively. Then, for a specific lncRNA (protein), sort all the predictions according to the scores in ascending order. The top 10 potential associations are taken out to the latest NPInter database v4.0 and the latest literature for confirmation. Besides, to test whether our model can predict the protein that may be interacted with new lncRNAs (without any known association). All known associations with lncRNA NONHSAT022115 are deleted to simulate a novel lncRNA, and then predict potential proteins through our model. Finally, 7 out of 10 predicted proteins associated with the lncRNA NONHSAT145960 are confirmed, 7 out of 10 predicted lncRNAs associated with the protein ENSP00000401371, and 9 out 10 predicted proteins associated with the lncRNA NONHSAT022115 are confirmed. Thus, the prediction of this model has a high accuracy, and its prediction has a guiding significance for future biological experiments.
4
Discussion
In this paper, we developed a novel deep learning-based framework to predict possible LPIs. Due to the introduction of FKL, an intermediate integration technique, the weight combination can be optimized automatically according to the importance of each base kernel. And the combination of multi-layer embedding vectors from graph convolution encoder, the proximity of different order can be integrated to improve the accuracy of prediction. Above all, the model LPIFKLGCN is an predictive tool for LPIs with high accuracy, high efficiency and robustness. Considering the further development of lncRNA molecular functions, the further enrichment of multi-source heterogeneous databases, the combination of richer data information, such as mRNA, miRNA, disease and gene, the combination of superior deep learning methods and appropriate machine learning methods, we intend to design more appropriate encoders and decoders to better predict potential LPIs.
References 1. van den Berg, R., Kipf, T.N., Welling, M.: Graph convolutional matrix completion (2017)
LPI-FKLGCN: Predicting LncRNA-Protein Interactions
237
2. Cai, L., et al.: Drug repositioning based on the heterogeneous information fusion graph convolutional network. Brief. Bioinform. (2021). https://doi.org/10.1093/ bib/bbab319 3. Chen, X., Clarence Yan, C., Luo, C., Ji, W., Zhang, Y., Dai, Q.: Constructing lncRNA functional similarity network based on lncRNA-disease associations and disease semantic similarity. Sci. Rep. 5, July 2015. https://doi.org/10.1038/ srep11338 4. Chen, X., et al.: Computational models for lncRNA function prediction and functional similarity calculation. Brief. Funct. Genomics 18(1), 58–82 (2019). https:// doi.org/10.1093/bfgp/ely031 5. Chen, X., Yan, C.C., Zhang, X., You, Z.H.: Long non-coding RNAs and complex diseases: from experimental results to computational models. Brief. Bioinform. 18(4), 558–576 (2017). https://doi.org/10.1093/bib/bbw060 6. Chen, X., Yan, G.Y.: Novel human lncRNA-disease association inference based on lncRNA expression profiles. Bioinformatics 29(20), 2617–2624 (2013). https://doi. org/10.1093/bioinformatics/btt426,https://academic.oup.com/bioinformatics/ article-lookup/doi/10.1093/bioinformatics/btt426 7. Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). In: 4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings (2016) 8. Engreitz, J.M., et al.: Local regulation of gene expression by lncRNA promoters, transcription and splicing. Nature 539(7629), 452–455 (2016). https://doi.org/10. 1038/nature20149 9. Fan, X.N., Zhang, S.W.: LPI-BLS: Predicting lncRNA-protein interactions with a broad learning system-based stacked ensemble classifier. Neurocomputing 370 (2019). https://doi.org/10.1016/j.neucom.2019.08.084 10. Ge, M., Li, A., Wang, M.: A bipartite network-based method for prediction of long non-coding RNA-protein interactions. Genomics Proteomics Bioinform. 14(1), 62– 71 (2016). https://doi.org/10.1016/j.gpb.2016.01.004 11. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. J. Mach. Learn. Res 9, 249–256 (2010) 12. Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings (2015) 13. van Laarhoven, T., Nabuurs, S.B., Marchiori, E.: Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics 27(21), 3036–3043 (2011). https://doi.org/10.1093/bioinformatics/btr500 14. Li, A., Ge, M., Zhang, Y., Peng, C., Wang, M.: Predicting long noncoding RNA and protein interactions using heterogeneous network model. BioMed Res. Int. 2015 (2015). https://doi.org/10.1155/2015/671950 15. Long, Y., Wu, M., Kwoh, C.K., Luo, J., Li, X.: Predicting human microbedrug associations via graph convolutional network with conditional random field. Bioinformatics 36(19), 4918–4927 (2020). https://doi.org/10.1093/bioinformatics/ btaa598 16. Muppirala, U.K., Honavar, V.G., Dobbs, D.: Predicting RNA-protein interactions using only sequence information. BMC Bioinform. 12(1), 489 (2011). https://doi.org/10.1186/1471-2105-12-489, https://bmcbioinformatics. biomedcentral.com/articles/10.1186/1471-2105-12-489
238
W. Li et al.
17. Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Item-based collaborative filtering recommendation algorithms. In: Proceedings of the 10th International Conference on World Wide Web, WWW 2001, pp. 285–295 (2001). https://doi.org/10.1145/ 371920.372071 18. Shen, C., Ding, Y., Tang, J., Guo, F.: Multivariate information fusion with fast kernel learning to Kernel Ridge Regression in predicting lncRNA-protein interactions. Front. Genetics 10, 716 (2019). https://doi.org/10.3389/fgene.2018.00716, https://www.frontiersin.org/article/10.3389/fgene.2018.00716/full 19. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014) 20. Sun, M., Zhao, S., Gilvary, C., Elemento, O., Zhou, J., Wang, F.: Graph convolutional networks for computational drug development and discovery, May 2020. https://doi.org/10.1093/bib/bbz042, https://academic.oup.com/bib/ article/21/3/919/5498046 21. Wan, S., Mak, M.W., Kung, S.Y.: GOASVM: a subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou’s pseudo-amino acid composition. J. Theor. Biol. 323, 40–48 (2013). https://doi. org/10.1016/j.jtbi.2013.01.012 22. Wang, B., et al.: Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11(3), 333–337 (2014). https://doi.org/10.1038/nmeth.2810 23. Yu, Z., Huang, F., Zhao, X., Xiao, W., Zhang, W.: Predicting drug-disease associations through layer attention graph convolutional network. Briefings Bioinform. 2020(00), 1–11 (2020). https://doi.org/10.1093/bib/bbaa243, https://academic. oup.com/bib/advance-article/doi/10.1093/bib/bbaa243/5918381 24. Zhang, W., Qu, Q., Zhang, Y., Wang, W.: The linear neighborhood propagation method for predicting long non-coding RNA-protein interactions. Neurocomputing 273, 526–534 (2018). https://doi.org/10.1016/j.neucom.2017.07.065 25. Zhang, W., Yue, X., Tang, G., Wu, W., Huang, F., Zhang, X.: SFPEL-LPI: Sequence-based feature projection ensemble learning for predicting LncRNAprotein interactions. PLoS Comput. Biol. 14(12) (2018). https://doi.org/10.1371/ journal.pcbi.1006616 26. Zhao, Q., Zhang, Y., Hu, H., Ren, G., Zhang, W., Liu, H.: IRWNRLPI: integrating random walk and neighborhood regularized logistic matrix factorization for lncRNA-protein interaction prediction. Front. Genetics 9 (2018). https://doi.org/ 10.3389/fgene.2018.00239 27. Zheng, X., et al.: Fusing multiple protein-protein similarity networks to effectively predict lncRNA-protein interactions. BMC Bioinform. 18(S12), 420 (2017). https://doi.org/10.1186/s12859-017-1819-1, http://bmcbioinformatics. biomedcentral.com/articles/10.1186/s12859-017-1819-1
Biomedical Imaging
Prediction of Protein Subcellular Localization from Microscopic Images via Few-Shot Learning Francesco Arcamone1,2 , Yanlun Tu1 , and Yang Yang1,3(B) 1 Center for Brain-Like Computing and Machine Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Rd., Shanghai 200240, China [email protected] 2 Politecnico di Milano, Piazza Leonardo da Vinci, 32, 20133 Milan, Italy 3 Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai 200240, China
Abstract. Benefitting from the breakthrough development of microscopy imaging techniques, various bio-microscopic images have accumulated rapidly for the past decade. Using computer vision and machine learning methods, biological activities and molecular functions can be interpreted from these images, thus image analysis has become more and more important in current life science research. A prominent difficulty in biological image analysis is the lack of annotation, and the test set even contains data from unseen classes, i.e. the open-set issue. The image-based protein subcellular localization is a typical open-set problem. There are tens of subcellular compartments in cells, while the labeled data may only consist of proteins from several major organelles. Till now, the open-set problem has been rarely studied for biomedical image data. The main goal of this study is to train a few-shot learning model for the recognition of protein subcellular localization from immunofluorescence images. We conduct experiments on a data set collected from Human Protein Atlas (HPA) and the results show that the introduced system can provide accurate results even with a small handful of images for an unknown class in a multi-instance learning scenario.
Keywords: Protein subcellular localization ArcNet · Immunofluorescence images
1
· Few-shot learning ·
Introduction
Image classification has become a really hot topic in the bioinformatics field, and it is an extremely useful practice in order to extract biomedical information from raw images which will be later converted into valuable knowledge. In particular, for the exploration of the human proteome, the HPA dataset (Human Protein Atlas [12], https://www.proteinatlas.org), is a comprehensive database that map c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 241–253, 2021. https://doi.org/10.1007/978-3-030-91415-8_21
242
F. Arcamone et al.
all the human proteins in cells, tissues, and organs. HPA contains tens of millions of immunofluorescence images from a large number of human tissues and organs, which have been widely adopted as resource for tasks such as protein subcellular localization [3]. However, the image-based protein subcellular localization faces a lot of challenges, such as small data (i.e. proteins with annotated location information are few), multi-instance learning (each protein corresponds to multiple images), and imbalanced class distribution (protein distribution cross organelles is uneven) [14,16]. Some previous studies focused on these issues and a few computational methods have been developed [7,9,10,15], while the open-set challenge has remained unsolved yet and been rarely studied. The protein subcellular localization is a typical open-set problem, as there are tens of compartments in cells while the labeled data set often covers only several of them. In a traditional supervised learning scenario, the model is unable to predict for unseen classes. In recent years, few-shot learning and zero-shot learning methods have been developed in the computer vision field. The few-shot learning methods mainly focus on training a feature extractor to capture image features that can distinguish images from different classes. In the test phase, there are some unseen classes and their sample images. Although they are not used for training, we can get their feature embedding via the trained feature extractor, and compute the similarity between a query sample and the samples in the test set, thus yielding the label for the query sample. In the zero-shot learning case, there is no sample at all for unseen classes, then the prediction is usually based on semantic correlation between the labels of training and test data. For biomedical image analysis, there is usually no semantic information in images or labels, while we can adopt the few-shot learning scheme to achieve the prediction for unseen classes. This study is mainly aimed at creating an effective system capable of dealing with an open-set problem, and thus being capable of correctly recognizing proteins located at some cellular compartments that are not considered during training. Especially, we leverage the contrastive learning framework to learn feature representations for the immunofluorescence images, aiming to extract features that can differentiate proteins from different locations. We adopt the loss function, ArcFace, that is used in face verification [2], and construct a model called ArcNet. A 20-shot learning scheme is used to evaluation the performance. The testing results have shown that the network is well capable of recognizing not only the images from the same labels used for the training data but also the images coming from the unknown ones with accuracies well beyond the random choices, even despite the scarce number of available classes for the training process.
Prediction of Protein Subcellular Localization from Microscopic Images
2 2.1
243
Methods Few-Shot Learning
The case illustrated in this paper is not only aimed at tackling the open-set problem, but also the small data problem given the availability of only a few classes and the relative high imbalance between the number of samples.
Fig. 1. Three images belonging to the same class. It is immediately possible to notice how hard will it be for a system to perform a correct recognition given their high variability.
One-shot and few-shot learning techniques have been employed to address these issues [6]. However, the cell images belonging to the same class may vary a lot (see Fig. 1), making it thus necessary to resort to a few-shot learning technique. Through such technique, it will be possible to gather a few images at random, particularly suitable for this type of experiments since it is possible to take those same images even if they are completely different one another, and create an average of them based on the value of the weighted average of each pixel that will be used in a comparison between two arrays containing respectively the average of the images taken and another image belonging to the same class, one for each class of the dataset used. It is important to mention that the choice for the number of shots for each image has been chosen as equal to 20 (shown in Fig. 2) given that it represents the best trade-off between a quite easy to obtain number of samples for an unknown class and the best accuracy that can be provided by the architecture.
Fig. 2. The 20-shots method.
244
2.2
F. Arcamone et al.
Contrastive Representation Learning
Contrastive learning has been commonly used in computer vision for obtaining good representations for images. The main idea is to pull output feature vectors closer for input image pairs that are labeled as similar, and push them away if labeled as dissimilar. This is achieved by jointly tuning two convolutional neural networks (CNNs) which linked by one loss function, where the two networks share parameters, thus called Siamese network. The loss function used is usually a certain form of contrastive loss. Specifically, the model gets as input pairs of images. First, the CNNs will extract features for each image, outputting feature vectors. Afterwards, according to the computed distance between those feature vectors, the system will output a scalar distance for each pair in the batch. For each pair, the label will assume value one if the pair contains images from the same class and value zero otherwise. The distances and the labels of pairs will finally be fed to the contrastive loss function which will be used to train the model through backpropagation [11]. In our few-shot learning framework, we adopt a special contrastive learning model, which consists of the cosine distance metric and the ArcFace loss function [2]. We call it ArcNet. Specifically, after extracting feature vectors using CNNs, the ArcNet computes cosine distance between those feature vectors via the ArcFace loss function, adding the margin to the positive pairs and then outputting the cosine distance (between -1 and 1) for each pair in the batch. The ArcFace’s output and the labels of pairs are fed to the binary cross-entropy loss function and the model is optimised through back-propagation as well [11]. Although simple cosine distance would provide somewhat separable feature embeddings, by adding the margin to the positive pairs in the ArcFace algorithm, a more evident gap between the nearest classes will be enforced. The whole pipeline is shown in Fig. 3.
Fig. 3. Flowchart of ArcNet.
As shown in Fig. 3, following the practice of [2], it is possible to obtain the value of cosθj (logit) for each class as WjT xi based on the feature xi and the normalization of the weight W . After having calculated the arccosθyi in order to determine the angular distance between the feature xi and the ground truth weight Wyi , it will then be possible to add a penalizing angular margin factor m on the ground truth angle θyi . Finally, it will then be possible to calculate cos(θyi + m) and multiply all of the obtained logits for the feature scale s. The
Prediction of Protein Subcellular Localization from Microscopic Images
245
logits will then be fed to the softmax function and finally be examined through the cross-entropy loss. 2.3
Loss Functions
In this section, the formal definitions of conventional contrastive loss and the loss used in ArcNet are given. Contrastive Loss. The contrastive loss function is formulated in Eq. (1), L(W, (Y, X1 , X2 )) = (1 − Y ) ∗
1 1 ∗ (DW )2 + Y ∗ ∗ max(0, m − DW )2 , 2 2
(1)
where Y represents the probability for the terms X1 and X2 belonging to the same class, m is an hyperparameter called margin, which is used to determine the threshold of similarity between the two elements, and Dw represents the distance between the two feature embeddings. ArcNet Loss. The idea of implementing this particular type of loss function comes from observing the astounding results achieved in terms of facial recognition which outperformed, mostly by a great margin, all of the already existing loss functions by overcoming all of their weaknesses [2]. In short, such function maximizes the inter-class separability through an angular margin similar to the one of the SphereFace and Cosface functions [8,13]. First, the function evaluates the distance between each feature and the target weight and then adds the previously mentioned angular margin to the target angle from which it will be possible to obtain again the target logit back from the cosine function. Afterwards, the function rescales all of the logits according to a fixed feature norm and will then proceed through the next steps of a softmax-loss function, which can be described by Eq. (2), T N 1 eWyi xi +byi log n , L1 = − T N i=1 eWj xi +bj
(2)
j=1
where the element xi ∈ Rd denotes the feature embedding of the i-th sample, belonging to the yi -th class. The element Wj ∈ Rd represents the j-th column of the weight matrix W ∈ Rdxn , and finally the element bj ∈ Rn represents the bias of the system, set in this case as 0, just like in the experiment described in the original paper [2]. The batch size and the class number are denoted by N and n, respectively [2]. The logit WjT xi is equal to ||Wj ||||xi ||cosθj , with θj denoting the angle between the weight Wj and the feature Xi . The mentioned normalized weight will be set up as equal to 1 through L2 normalization, as well as the embedding feature ||xi ||, which will be rescaled to s, in this way the predictions will only depend on the angle between the feature and the target weight. The system will then distribute the learned embedding features on an hypersphere with radius s, i.e.,
246
F. Arcamone et al.
L2 = −
N 1 escosθyi n log scosθy . scosθj i + N i=1 e j=1,j=yi e
(3)
As the embedding features are distributed around each feature centre on the hypersphere, an additive angular margin penalty m between xi and Wyi will be added in order to simultaneously maximize the intra-class compactness and the inter-class distance, as shown in Eq. (4), L3 = −
3
N 1 escos(θyi +m) log scos(θ +m) n yi N i=1 e + j=1,j=yi escosθj
(4)
The Backbone CNN Model
During the contrastive learning, we tune the parameters in CNNs of the Siamese neural network to yield feature representations for the images. Here we employ a classic CNN model, ResNet-18 [4], as the feature extractor, i.e. the Siamese network’s backbone model. The model parameters are shown in Table 1. Table 1. The structure of the backbone CNN model. Layer
Backbone
Output size
conv1
7 × 7, 64, stride 2 3 × 3, 64 x2 3 × 3, 64
112 × 112 × 64
Conv2 x
conv3 x conv4 x conv5 x Average pool
56 × 56 × 64
3×3 max pooling, stride 2 3 × 3, 128 28 × 28 × 128 x2 3 × 3, 128 3 × 3, 256 14 × 14 × 256 x2 3 × 3, 256 3 × 3, 512 7 × 7 × 512 x2 3 × 3, 512 7 × 7 average pool
1 × 1 × 512
Fully connected 512 × 64 fully connections 1 × 64
4 4.1
Results Dataset
The images used for this experiment were captured by using the immunofluorescence technique, which visualizes the localization of proteins in cellular compartments and is of extreme importance in the medical and bioinformatics field.
Prediction of Protein Subcellular Localization from Microscopic Images
247
Fig. 4. Some of the most representative samples for each class of the Human Protein Atlas dataset
The samples fall into 18 different classes, and 9 classes are used for training and the last 9 classes for testing. The most prominent ones will be used for training. Figure 4 shows some representative images for different subcellular localizations. Although the categories used in the testing phase are unknown to the system, it will still be able to detect the images belonging to the same classes in the test set, thanks to the training with the known data aimed at recognizing the peculiar features of each class and search for the same traits in the unknown images. 4.2
Experimental Settings
We experiment with both classic contrastive loss and the arcface loss. In both experiments, the number of features is 64. (We search in the range [64, 128, 256, 512] and select the one with the best validation accuracy.) Following the practice of previous studies [2], the margins are set to 10 and 0.5, and the learning rates are set to 10−7 and 10−5 for these two kinds of methods, respectively. To prevent overfitting, we add a dropout layer of rate 0.1 after each convolutional layer, and another one of rate 0.5 after the fully connected layer. Each architecture is trained with Adam (ADAptive Moment estimation) [5,9] as optimizer given its outstanding capabilities and broad use, especially in many of the tasks in the computer vision and natural language processing fields. 4.3
Evaluation
To assess our model performance, two common metrics are used, i.e. precisionrecall AUC (Area under the ROC curve) and total accuracy. The metrics are evaluated via a 9-way few shot classification, i.e., 1 input image will be compared
248
F. Arcamone et al.
against 9 images belonging each one to a different class, and the one with the smallest distance (the highest similarity) will be chosen. For each class of the training set, 80 images are reserved for validation and 80 other images for testing. By forming positive and negative pairs of these images, we can perform 400 9-way few-shot classification tests for both training and testing. 4.4
Batch Selection
A fixed number of training images has been used for each class in order for each class to receive an equal representation during optimization as well as an equal number of positive and negative pairs which are selected as a combination between every class in the training set. Each batch is formed by 100 images taken randomly from two classes of each pair, forming thus a batch of 200 images. All of the possible positive and negative pairs are formed between the aforementioned images in equal number and then fed to the various loss functions. Every epoch will thus contain a batch for every combination of 2 classes, i.e. (1,2), (1,3), · · · , (2,3), (2,4), · · · , (8,9) making a total of 36 batches. There is no repetitions between pairs, e.g., for (1,2) there will be no pair (2,1). 4.5
Prediction Results of ArcNet
As aforementioned, we aim to solve the open-set problem in protein subcellular localization, i.e. the classes in the test set are different from those in the training set. The model is evaluated through the aforementioned 20-shot classification techniques in order to obtain an optimal classification based on a weighted average of each class given the intrinsic variance of each one of them. Table 2 shows the prediction accuracy for each class in the training and test sets. All of the results expressed are in terms of accuracy percentage. As it is possible to see there is a high variance across labels in terms of results obtained from the experiment, and except for an exiguous number of cases, the vast majority of accuracies are well above the random choice percentage, being equal in this case to 11.11%. The classes characterized by results with the lowest accuracies are imputable not simply to a scarcity of data but rather to an intrinsic complexity of those classes for which further training with a higher number of classes would be extremely beneficial [6]. As can be seen, the labels in the test set are completely different from those in the training set, while the performance on the test set is only slightly worse than that on the training set. The fact that the accuracy of the training process and of the testing one on the new set are well synchronized is a proof that the training of the model is effective and it is thus able to provide a satisfactory generalization even after having trained on only 9 classes. The results also suggest that the feature representation learned by ArcNet captures information that can discriminate different locations. Although the 9
Prediction of Protein Subcellular Localization from Microscopic Images
249
classes in the test set are not involved in the training process, using the few-shot learning scheme, the samples in the test set can be recognized with a certain accuracy, especially for intermediate filament, plasma membrane, and endosomes. Table 2. Prediction accuracy on locations in the training and test sets. Training
Test
Location
Accuracy (%) Location
Accuracy(%)
Mitochondria
85.0
Intermediate filaments 72.5
Nucleus
55.0
Plasma membrane
70.0
Nuclear speckles
47.5
Endosomes
70.0
Nucleoplasm
37.5
Peroxisomes
32.5 12.5
Cytosol
30.0
Nuclear bodies
Nucleoli
17.5
Cell junctions
20.0
Vesicles
15.0
Nuclear membrane
12.5
Endoplasmic reticulum 10.0
Centrosome
10
Golgi apparatus
Nucleoli fibrillar center 2.5
7.5
Furthermore, we compare the few-shot learning with single-shot learning and nearest neighbor classifier. The results of the three methods are listed in Table 3, where the nearest neighbor method also adopts a 20-shot strategy for inference. Table 3. The prediction accuracy (%) of three methods Data
Single-shot 20-shots Nearest neighbor
Training 33.06
36.88
29.0
Test
35.02
28.0
23.06
It has been immediately possible to see how the aforementioned 20-shots system has greatly outperformed the single-shot system through an experiment characterized by the same architecture and the same hyperparameters, confirming thus the initial hypothesis for which in a similar scenario characterized by a high-variability a weighted average of several different samples will be capable of providing better prediction results with respect to a single one. According to the results and the respective charts, it has been possible to see how not only the 20-shot method has vastly improved the overall accuracies for training and testing with respect to the single-shot algorithm, but also how the gap between the results obtained with respect to the training set and the testing one has been greatly reduced and how it has been possible for the 20shots architecture to improve over time with respect to an almost steady and
250
F. Arcamone et al.
invariant behaviour for the single-shot behaviour. It has as well been possible to notice that some classes have very low accuracy, like Golgi apparatus, which also received low prediction accuracy in previous studies [1,17], perhaps due to the high complexity of the patterns of these locations. As for the nearest neighbor (NN) classifier, the experiment is carried with the same embeddings of each image of the training and testing sets and the same convolutional architecture as well except for the removal of the last fully connected layer (basically, the one containing all of the final results for each class). The average accuracies obtained by NN on the training and test sets are 29.0% and 28.0%, respectively, which are much lower than those of ArcNet, suggesting the great advantages of the proposed method. 4.6
Comparison Between ContrastiveNet and ArcNet
To assess the impact of the component (e.g. loss function and distance metric) in the contrastive learning model. We compare the conventional contrastive learning model, named ContrastiveNet, with the model using Arcface loss function, i.e. ArcNet. Table 4 shows the results of these two methods on five metrics. As can be seen, ArcNet achieves better performance almost on all the metrics. Although the training accuracy of ArcNet is lower, the results on test set are much better compared with ContrastiveNet. Note that the test set is an open set, which is composed by unknown classes, suggesting powerful generalization ability of ArcNet. Table 4. Results of ArcNet and ContrastiveNet Method ArcNet ContrastiveNet
Metric Dataset
Accuracy AUC
Precision Recall F1-score
Training set
0.3688
0.7518 0.8056
0.3307 0.4689
Test set
0.3502
0.6646 0.7669
0.1698 0.2780
Training set
0.4306
0.7390 0.7977
0.3401 0.4661
Test set
0.3387
0.5873 0.6868
0.1594 0.2587
In terms of AUC the ArcNet architecture has outperformed as well the ContrastiveNet one. It is clearly visible from the comparison between the results that the ArcNet architecture achieves a better tradeoff and an overall better performance when it comes to a distinction between true and false positives and negatives. In addition, Fig. 5 shows the confusion matrices of the two models. It can be observed that ArcNet has a close performance to ContrastiveNet on the training set. By contrast, for the test set, ArcNet yields much fewer false positives (1616 vs. 2277), and it has more true positives (30784 vs. 30123) and true negatives (5318 vs. 4993) compared with ContrastiveNet.
Prediction of Protein Subcellular Localization from Microscopic Images
(a) ArcNet for training
(b) ArcNet for testing
(c) ContrastiveNet for training
(d) ContrastiveNet for testing
251
Fig. 5. Confusion matrices of the prediction results of ArcNet and ContrastiveNet.
As it is possible to see, despite an extremely low difference in terms of recall (representing the metric that describes the capability of the architecture to correctly recognize positive samples), the ArcNet architecture will still guarantee an overall better performance according to the respective F1-scores in comparison to the ContrastiveNet architecture.
5
Conclusions and Future Work
When it comes to the analysis of the results obtained from the experiment, it is important to keep in mind four important factors: – The number of classes used for this experiment is extremely low, especially in comparison with the original one-shot learning paper [6], which used 30 classes only for training purposes with respect to our 9. Having said that, it is surely expectable to get much better results once more classes will have been gathered. – It is possible to say the same thing for what it concerns the number of elements used for each dataset, in this case way smaller if put in comparison with the Omniglot one.
252
F. Arcamone et al.
– The Omniglot dataset contains only 50 alphabets, while the immunofluorescence images used in this study contain much more cells all different one another, making it unthinkable to use a simple classifier given its inability to work on unseen data and its necessity to be retrained from the beginning once a new class is added. – The images belonging to these datasets are extremely hard to analyze due to the low quality, which further adds difficulty to the prediction system. It can thus be possible, given such premises, to think of this experiment as a solid foundation for an instrument that could be potentially immensely helpful for what it concerns cells recognition in the human body. Through the availability of a larger number of classes for the training and testing processes it would be possible for example to explore in a deeper way the interdependence between the data quantity, the structure adopted for the CNN, and the various hyperparameters such as the learning rate or even the margin of the function. Ideally all of this should be experimented after having further investigated about the number of classes that would allow the system to reach its maximum performance, as well as the respective variety for each one of them. It would thus be obvious to say that what said would have impressive consequences not only in the bioinformatics field, but in the computer vision as well, especially given how omnipresent image recognition technologies have become. The contributions of this study can be summarized as follows. – To our best knowledge, this study is the first one introducing few-shot learning to the open-set problem in protein subcellular localization, and achieves good results on the unseen classes that are not present during training. – We demonstrate that the contrastive learning scheme is effective to learn the representation of microscopic images. And the proposed ArcNet improves the prediction accuracy compared with conventional constrastive loss. Funding. This work was supported by the National Natural Science Foundation of China (No. 61972251).
References 1. Briesemeister, S., et al.: Yloc-an interpretable web server for predicting subcellular localization. Nucleic Acids Res. 38(suppl 2), W497–W502 (2010) 2. Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019) 3. Emanuelsson, O., Nielsen, H., Brunak, S., Von Heijne, G.: Predicting subcellular localization of proteins based on their n-terminal amino acid sequence. J. Mol. Biol. 300(4), 1005–1016 (2000) 4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Prediction of Protein Subcellular Localization from Microscopic Images
253
5. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 6. Koch, G., et al.: Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, vol. 2. Lille (2015) 7. Kumar, A., et al.: Automated analysis of immunohistochemistry images identifies candidate location biomarkers for cancers. Proc. Natl. Acad. Sci. 111(51), 18249– 18254 (2014) 8. Liu, W., et al.: Sphereface: deep hypersphere embedding for face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 212–220 (2017) 9. Long, W., Yang, Y., Shen, H.B.: Imploc: a multi-instance deep learning model for the prediction of protein subcellular localization based on immunohistochemistry images. Bioinformatics 36(7), 2244–2250 (2020) 10. Newberg, J., Murphy, R.F.: A framework for the automated analysis of subcellular patterns in human protein atlas images. J. Proteome Res. 7(6), 2300–2308 (2008) 11. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by backpropagating errors. Nature 323(6088), 533–536 (1986) 12. Uhlen, M., et al.: Towards a knowledge-based human protein atlas. Nat. Biotechnol. 28(12), 1248–1250 (2010) 13. Wang, H., et al.: Cosface: large margin cosine loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274 (2018) 14. Xu, Y.Y., Fan, Y., Shen, H.B.: Incorporating organelle correlations into semisupervised learning for protein subcellular localization prediction. Bioinformatics (14), btw219 (2016) 15. Xu, Y.Y., Yang, F., Zhang, Y., Shen, H.B.: An image-based multi-label human protein subcellular localization predictor (i locator) reveals protein mislocalizations in cancer tissues. Bioinformatics 29(16), 2032–2040 (2013) 16. Xu, Y.Y., Yang, F., Zhang, Y., Shen, H.B.: Bioimaging-based detection of mislocalized proteins in human cancers by semi-supervised learning. Bioinformatics (Oxford, England) 31, November 2014. https://doi.org/10.1093/bioinformatics/ btu772 17. Zhou, H., Yang, Y., Shen, H.B.: Hum-mploc 3.0: prediction enhancement of human protein subcellular localization through modeling the hidden correlations of gene ontology and functional domain features. Bioinformatics 33(6), 843–853 (2017)
A Novel Pseudo-Labeling Approach for Cell Detection Based on Adaptive Threshold Tian Bai1,2(B) , Zhenting Zhang1,2 , Chen Zhao1,2 , and Xiao Luo3 1
College of Computer Science and Technology, Jilin University, Changchun, China [email protected] 2 Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University, Changchun, China 3 Department of Breast Surgery, China-Japan Union Hospital of Jilin University, Changchun, China
Abstract. The cell detection is not only significant to clinical diagnosis, but a challenging task in the field of computer-aided diagnosis. One of reasons for this challenge is that it is difficult to obtain sufficient labeled samples to train an excellent detection model. Labeling all cells in the image manually is a time-consuming task. In this article, we propose a semi-supervised learning approach that generating pseudo-labels through dealing with unlabeled samples to automatically extract additional information for the retraining of the model to reduce the manual labor cost. Differing from former pseudo-labeling methods, great efforts are made to boost the reliability of pseudo-labels. In our model, pseudo-labels are generated according to adaptive threshold to reduce the noisy labels and retain the effective information. Moreover, our model effectively avoid the impact of difficult-to-detect cells and inhomogeneous background in the image by distilling the training data with the implementation of “patch attention” when leveraging samples with pseudo-labels for retraining. Extensive experiments have been conducted on two datasets to verify the performance of our method. We obtain a performance close to that of 2+M labeled images in supervised learning with only 2 labeled images and M unlabeled images in a semi-supervised learning manner. It is worth mentioning that the state-of-the-art results are achieved by our model compared with other existing semi-supervised methods.
Keywords: Adaptive threshold Semi-supervised
· Cell detection · Pseudo-labels ·
This work is supported by the Development Project of Jilin Province of China (Nos.20200801033GH, YDZJ202101ZYTS128), Jilin Provincial Key Laboratory of Big Data Intelligent Computing (No.20180622002JC), The Fundamental Research Funds for the Central University, JLU. c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 254–265, 2021. https://doi.org/10.1007/978-3-030-91415-8_22
Pseudo-Labeling for Cell Detection
1
255
Introduction
With the development of deep learning, Convolutional Neural Networks (CNNs) have made huge breakthroughs in medical image analysis, which makes computeraided diagnosis used in practice more widely [9,10]. However, there is one point behind these achievements that can not be ignored, that is, these results are based on the premise of having a large number of manually labeled training samples. For experienced experts, it is a time-consuming and laborious project to manually label medical images, especially for pathological images. A 1000 × 1000 pathological slice usually contains hundreds or even thousands of cells. The factors that different types of cells have varied morphologies and some cells overlap with neighboring cells will increase the difficulty of manually labeling. In contrast, it is relatively easy to obtain a large number of unlabeled pathological slices. Hence, how to make use of information provided by unlabeled images to improve the performance of neural networks has become a very important work. In previous literature, only a small part of labeled images played a role in histopathology images analysis in a supervised framework [1,14], causing a large amount of unlabeled images to be wasted, which can also provide beneficial information for model learning, and some scholars have done studies on this issue. Guibas et al. [4] and Tang et al. [12] utilized GAN [3] to generate pseudo-labels for unlabeled images to obtain more training data. Lee and Jeong [6] leveraged consistency of predictions to improve pseudo-labels by iteratively averaging predictions for microscopic images with scribble annotations. However, above mentioned methods of generating pseudo-labels for unlabeled images all have a common problem that the correctness of pseudo-labels is unclear. Even though most of pseudo-labels are reliable, wrong labels are inevitably generated, which will lead to noisy annotations and reduce the validity of retraining procedure. The impact of these noisy labels will be expanded along with the increase of unlabeled images. The denoising threshold was applied to pseudo-label generation in [16,17] to reduce the noisy labels, but it was a global threshold that treated all unlabeled images equally and can not be adjusted for different images. A high global threshold decreased the number of pseudo-labels obviously, while a low global threshold increased more noisy labels [17]. To solve this problem, we propose a pseudo-labeling method in this article, namely pseudo-labeling based on adaptive threshold(PLAT). It will reduce the generation of noisy labels and ensure the accuracy of pseudo-labels by applying adaptive threshold. In addition, we completely avoid negative effects of difficultto-detect cells and inhomogeneous background in images with implementation of patch attention to further improve the reliability of training data when retraining with unlabeled images. Our algorithm contains two steps: first of all, we take advantage of a small number of labeled samples to pretrain a base model. Then, we input unlabeled images to base network to generate reliable pseudo-labels based on adaptive threshold, and retrain network with labeled images and unlabeled images in a semi-supervised manner. The experimental results show that compared with the base model, the F1 score of the retrained model has increased by 9%–13%, which is comparable to the results of supervised learning with all images.
256
2 2.1
T. Bai et al.
Method Method Overview
Effective information resourced from unlabeled images can be fully exploited to improve performance of model, which is the main advantage of semi-supervised l )}, which learning. Let’s consider a labeled dataset DL = {(xl1 , y1l ), . . . , (xlN , yN l l contains N labeled images xi and their labels yi , and an unlabeled dataset DU = {xu1 , . . . , xuM }, which contains only M unlabeled images xuj . xli , xuj ∈ R(C, W, H), where C is the number of channels in images, W and H represent the width and height respectively. yil is a binary image, which has the same size with xli , and each pixel with a value of 1 in yil represents the label of a cell. A sample is shown in Fig. 1a. Our objective is to make the model trained on DL and DU perform better than the model trained on only DL. However, simple point annotation is not conducive to learning the morphology, size and texture of cells. In order to locate individual cell more accurately, we generate a proximity map mli for yil like [15] and formula is shown in (1), where M (u, v) is value at pixel (u, v), D(u, v) represents the euclidean distance from pixel(u, v) to the closest label, d is the radius parameter, and α is the decay parameter. The pixel has a larger value when closer to label as shown in Fig. 1b.
Fig. 1. (a) Example and labels from NET. (b) Proximity map. (c) Prediction of base model.
M (u, v) =
D(u,v)
eα(1− d eα −1
0
)
−1
if D(u, v) ≤ d otherwise
yju = sgn(LM (puj , r))
(1) (2)
First, we only get base model Mbase trained on DL in a supervised manner. Then, Mbase is used to generate pseudo-labels yju for xuj . During the process of generating pseudo-labels, we input xuj into Mbase to get prediction puj = Mbase (xuj ), and regard the pixel containing the local maximum in prediction as pseudo-label of the cell, which is expressed in (2). LM is a function that
Pseudo-Labeling for Cell Detection
257
Fig. 2. Illustration of the proposed semi-supervised learning framework for cell detection. The above branch represents the pseudo-labels generation process while the below one is the retraining process. As the retraining progresses, the generalization ability of the below network gradually becomes stronger(We use mu instead of y u for better viewing, and the following pictures do the same.)
takes local maximum values in puj , sgn is a symbolic function, and parameter r represents local radius. We also get proximity map muj from yju as doing in pretraining, which is used as ground truth of xuj to retrain Mbase in a semisupervised manner. The retrained model Mre will have a better performance compared with Mbase . We use Mre to generate pseudo-labels again, and perform a second retraining. The generation and retraining loop until the best result is achieved. The general process of our method is shown in Fig. 2. In pretraining and retraining, we use the same regression loss functions of (3), where β and λ are used to tune the weights of the losses coming from different parts of p, m is a proximity map and p is a predicted map [15]. lreg =
W H 1 2 (βmw,h + λm)(m ¯ w,h − pw,h ) 2 × W × H w=1 h=1
1 m ¯ = W ×H 2.2
H W
(3)
mw,h
w=1 h=1
Pseudo-Labeling Based on Adaptive Threshold (PLAT)
Mbase will produce many false results during test time because of a limited scale of learning samples in DL, which are not enough to enable model to learn shape, color and geometric characteristic of cells thoroughly. As a result, Mbase can only detect some easy-to-detect cells but has the poor performance in densely-celled and light-colored regions. As can be seen from Fig. 1c, the boundaries of cells in prediction are blurred, and most cells have a variety of shapes, in addition, there is no obvious difference in brightness between the center and the boundary of
258
T. Bai et al.
cells. Even worse, non-zero pixels appear where there are no cells. If predictions output by Mbase is used to generate pseudo-labels directly according to (2), a large amount of noisy labels will be generated, including labeling background as cells and generating multiple pseudo-labels for a cell. In the process of pseudo-labels generation, non-maximum suppression is adopted to assign pixels below a certain threshold to 0, which is beneficial to reduce noisy labels significantly. A global threshold was employed in [15], which was set for all predictions in advance. However, it may not be the optimal denoising threshold for all images because of various cell types and background environment. Therefore, we propose an adaptive threshold t based on epistemic uncertainty [5] to eliminate noisy labels. E(˜ pw,h ) =
K 1 k p˜w,h K k=1
t=
H W K 1 k 1 2 (˜ pw,h − E(˜ pw,h )) W × H w=1 k h=1
(4)
k=1
tj tj = M ax{t1 , . . . , tM } The lack of training samples will lead to the great epistemic uncertainty of Mbase [5], which can be used to evaluate the complexity of the unlabeled images. Therefore, in order to estimate the epistemic uncertainty of Mbase , the following steps will be conducted: First, input xuj into Mbase for K times with Monte Carlo Dropout(MC-Dropout) [2] to obtain K different predictions {˜ p1j , p˜2j , . . . , p˜K j }. Then, calculate the variance of K predictions to represent uncertainty of pixels, and use mean value of variance of all pixels to represent uncertainty of an image. Finally, we divide by the largest variance among all predictions to normalize variance results, and the normalized result is used as adaptive thresholds, which can be expressed as (4) (For convenience, we omit subscript j in the first two equations.), (w, h) represent the coordinates of a pixel. The large uncertainty means that cells in the image are more difficult to detect, so a larger threshold is necessary to suppress noisy prediction. We zero the pixels whose value are lower than tj × M ax(puj ) in puj , where puj is prediction result without dropout layers, and then take local maximum as pseudo-labels for xuj . The adaptive threshold for each predicted map varies, which depends on the complexity of cellular shape in the image. Images with a large abundant difficultto-detect cells will get a higher threshold, while images with a simple background will get lower thresholds, which address the gap caused by the previous setting of a global denoising threshold that is too high for simple images so that fewer pseudo-labels are generated and too low for complex ones so that massive noisy labels will appear. The image with a threshold of 1 will not have any pseudolabels. Therefore, Eq. (2) can be expanded as follows: yju = sgn(LM (puj , r, tj ))
(5)
Pseudo-Labeling for Cell Detection
2.3
259
Retrain and Patch Attention
Pseudo-labels are taken as centers to generate a proximity map mu for y u before retraining, and Mbase is trained on DL and DU again. Since unlabeled images are much more than labeled images, the influence of false information in pseudolabels will be extended. To avoid this situation, we implement oversampling on DL and set the iterative ratio of unlabeled and labeled images to 1:n in batches. When n >1, multiple iterations of labeled images can be adopted to cover adverse effects of false information in pseudo-labels. The detection capability of Mbase is limited, especially in densely-celled regions and light-colored regions. In addition, adaptive threshold calculated according to epistemic uncertainty tends to be a higher value to guarantee highquality pseudo-labels, resulting in some difficult-to-detect cells are discarded. Therefore, the number of pseudo-labels is usually smaller than true cells in unlabeled image. As retraining progresses, detection ability of Mre is gradually enhanced, so some difficult-to-detect cells will appear in predicted map. However, the deficiency of corresponding pseudo-labels in mu will make these cells be mistakenly identified as background, resulting in fewer and fewer cells detected by the network. In order to tackle this problem, we adopt the “patch attention” to ensure the effectiveness of training samples. The so-called “patch attention” is to suppress the pixels that far away from pseudo-labels to 0 in the predicted map. We find out all pixels with a value of 1 in y u , and crop multiple squares of length l in predicted map centered on these pixels. All the cropped regions form a new predicted map as shown in Fig. 3. In the new map, the pixels outside the square are assigned to 0, which avoids the problem that regions where cells originally exist are incorrectly identified as backgrounds. It is worth noting that “patch attention” prevents erroneous gradient updates caused by misidentifying difficult-to-detect cells while not restraining the prediction of these cells.
3 3.1
Experiment Dataset
We evaluate our method on two datasets NET and CA. NET contains 100 ki67 immunohistochemically stained pancreatic neuroendocrine tumor microscopic images with a size of 500 × 500, and each image contains hundreds of cells. CA involves 100 hematoxylin and eosin (H&E) stained histology images of colorectal adenocarcinomas [11]. The size of the image is 500 × 500. Both datasets provide coordinates of nucleus as true labels and they are acquired by different staining methods. We group 100 images as follows: the training dataset consists of 2 labeled images and M (M = 2, 5, 10, 20, 40) unlabeled images, while validation set and test set contain 10 labeled images respectively. Each image in training set is cropped into four 256 × 256 images, while the images in validation set and test set are not cropped, because our model can detect images of any size.
260
T. Bai et al.
Fig. 3. Visualization of patch attention. (a) Example from NET. (b) Pseudo-labels generated by Mbase . (c) Predicted map generated by Mre . (d) New predicted map with patch attention on (c). Some difficult-to-detect cells in yellow dotted boxes are discarded by Mbase in (b) and they are detected by Mre again in (c). We suppress these pixels to 0 in (c) to avoid that real cells are identified as background during retraining. Furthermore, some messy backgrounds are removed in red dotted boxes.(M = 5) (Color figure online)
3.2
Experiment Detail
In our experiments, we use FRCN [15] as detection network in the model. Some of hyper parameters are set as follows: d = 12, α = 3 in (1), and K = 10, r = 6 when generating pseudo-labels. In “patch attention”, each patch is a square area with l = 24. The iteration ratio of unlabeled images and labeled images is 1: 5, and β = 0.2, λ = 1 in (3). We use Adam algorithm to train model during pretraining and retraining with β1 = 0.9, β2 = 0.999, learning rate = 10e − 4 and dropout rate = 0.5. Since unlabeled images account for a large proportion, all unlabeled images are iterated once as an epoch, and the training will be terminated when loss value of validation set dose not decrease for 20 consecutive epoches. We use precision (P), recall (R), and F1 score as evaluation criteria, and use Hungarian algorithm to match prediction results with real annotations. When evaluating the model, the threshold for non-maximum suppression is 0.5. 3.3
Experiment Result
Figure 4 shows the differences between pseudo-labels and real labels. It can be clearly observed from Fig. 4c that pseudo-labels generated by Mbase are very concentrated, especially for large cells. Mbase generates multiple pseudo-labels for a cell, and these pseudo-labels are very close to each other. This is the model underfitting caused by insufficient training samples. After first retraining, we leverage Mre to generate pseudo-labels for the second time. More accurate pseudo-labels generated again are used for the second retraining, which further improves performance of the model. Figure 4d also demonstrate that we alleviate the problem of dense distribution and improve confidence of pseudo-labels by generating pseudo-labels multiple times.
Pseudo-Labeling for Cell Detection
261
Fig. 4. Images and their labels. (a) Examples from NET and CA. (b) Real labels. (c) Pseudo-labels generated by Mbase . (d) Pseudo-labels generated by Mre . (M = 40)
PLAT is estimated with different number of unlabeled images on NET and CA, and results are listed in the last column of Table 1. M = 0 means that only 2 labeled images are used for supervised learning, and this result is taken as the baseline of our experiment. It can be observed from Table 1 that compared with the baseline, the detection performance of retrained model has been improved significantly after adding unlabeled images, which is particularly evident on the NET, where F1 score increases by up to 13%. On CA, F1 score also increases by 9%. Results from unlabeled images pairs, with incremental quantity of images, show insignificant differences in NET. This is because the cells with relatively regular shapes in the NET are easy to learn. When M = 2, although a small number of pseudo-labels are generated, almost all of these pseudo-labels are true positive, which enable the model to learn correct information. More unlabeled images being added, a far more growth of accurate pseudo-labels is generated compared with a moderate increase of noisy labels. Therefore, our model still obtain stable performance. However, the cells in CA have varied morphologies. When the unlabeled images increase, more valid information is added in the retraining, making the generalization ability of Mre stronger, and the F1 scores gradually increase. In conclusion, our method has good results on different types of cells. The performance of the model improves in both datasets with the increase of unlabeled images. Especially for cells with complex shapes, the improvement is more obvious than regular ones.
262
T. Bai et al. Table 1. Comparison of global threshold and adaptive threshold
3.4
2+M Threshold = 0.0
Threshold = 0.5
PLAT
NET P
P
P
R
F1
R
F1
F1
–
2+2 2+5 2+10 2+20 2+40
0.901 0.875 0.909 0.908 0.913
0.882 0.877 0.853 0.885 0.856
0.892 0.876 0.880 0.896 0.883
0.947 0.899 0.907 0.908 0.927
0.825 0.881 0.870 0.875 0.862
0.882 0.890 0.888 0.891 0.893
0.922 0.916 0.897 0.920 0.870
0.880 0.863 0.887 0.880 0.915
0.901 0.889 0.892 0.900 0.892
CA
P
R
F1
P
R
F1
P
R
F1
0.733 0.757 0.714 0.760 0.684
0.752 0.766 0.745 0.739 0.726
0.759 0.746 0.817 0.754 0.798
0.744 0.726 0.674 0.720 0.707
0.751 0.736 0.739 0.736 0.750
0.777 0.786 0.782 0.764 0.758
2+0
–
2+2 2+5 2+10 2+20 2+40
0.772 0.776 0.779 0.719 0.774
–
R
2+0
0.760 0.773 0.766
–
0.626 0.757 0.685 0.731 0.746 0.738 0.764 0.792
0.753 0.765 0.760 0.764 0.775
Ablation Study
In order to verify the effectiveness of adaptive threshold, we compare experimental results of PLAT with global threshold = 0 and global threshold = 0.5 which is the most suitable cut point for binary classification problems. The comparison results are illustrated in Table 1 and the best F1 score is expressed in bold font. It can be found that in NET and CA, PLAT almost always achieves the best performance in F1 score. When global threshold = 0.5, the F1 scores in CA fluctuate with different number of unlabeled images, and the P value is greater than the R value, especially when M = 10, 40, the differences are much larger. Obviously, 0.5 as denoising threshold will remove more cells with correct predictions. A similar situation appears in NET as well. When global threshold = 0.0, the F1 score of CA decreases with the increase of unlabeled images due to excessive noisy annotation. In contrary, the adaptive threshold can be adjusted according to the morphology and distribution of the cells and the complexity of the background in the image, so as to retain effective information while reducing noisy labels. We conduct a comparative experiment of failure of “patch attention” to verify effectiveness of “patch attention”. Figure 5 shows that results with patch attention are almost better than results without patch attention. This is because some real cells are incorrectly identified as background in the retraining. The result before and after “patch attention” operation on predicted map is shown in Fig. 3. Due to poor performance of Mbase , some cells are difficult to be detected by Mbase , just as those marked by yellow dashed lines. Therefore, no pseudolabels were generated for these cells during the process of first pseudo-labeling.
Pseudo-Labeling for Cell Detection
263
As retraining progresses, detection ability of Mre is improved gradually. Cells that were not detected by Mbase at the beginning will be displayed in prediction map, but there are no corresponding labels in pseudo-labels. As a result, these cells will gradually be mistaken for background. We manually set the weakly responsive cells and the background in predicted maps to 0, so as to prevent some difficult-to-detect cells from being incorrectly detected as background. After first retraining, those difficult-to-detect cells will be identified and more pseudo-labels are obtained to further improve the performance of model.
Fig. 5. Ablation study of “patch attention”. “PA” means “Patch Attention”, and “+” represents retraining with PA, “−”represents retraining without PA.
3.5
Comparison to Previous Work
We also compare our method with currently popular methods: mean teacher(MT) [13], deep adversarial networks (DAN) [18], cross-consistent training (CCT) [8], and dual-task consistent training(DTC) [7]. To ensure fairness of comparative experiments, we perform identical data grouping. The comparison results are shown in Table 2. The last column represents the results of supervised learning with 2 + M labeled images. We bold the best results in the table. The experimental results indicate that our pseudo-labeling method achieves the best performance in F1 score, which is close to the result of supervised learning. When M = 2, detection result of semi-supervised learning even exceeds that of supervised learning. Other semisupervised methods rely on predicted maps or feather maps to design a new loss function for unsupervised training. However, although supervision loss forces the consistency between prediction and real annotations, But this accounts for a small proportion in model training and there is still numerous noisy information in predictions of unlabeled images. We adopt the same preprocessing method as real labels when generating proximity maps for pseudo-labels. From the perspective of a single cell, real labels and pseudo-labels share the same distribution. In this way, pseudo-labels reproduce morphological characteristics of cells perfectly that are same with real labels. We also use adaptive thresholds and patch attention to ensure accurate center coordinates of nucleus. The location information and morphological information of pseudo-labels work together to improve the performance of cell detection.
264
T. Bai et al. Table 2. Comparison of our method and other semi-supervised methods
4
2+M DAN MT
CCT DTC
NET F1
F1
F1
F1
PLAT supervised F1
F1
2+2
0.820 0.823 0.819 0.828
0.901 0.892
2+5
0.868 0.834 0.817 0.889 0.889 0.898
2+10 0.864 0.879 0.837 0.760
0.892 0.904
2+20 0.863 0.879 0.778 0.883
0.900 0.928
2+40 0.851 0.888 0.794 0.877
0.892 0.933
CA
F1
F1
2+2
0.718 0.737 0.739 0.726
0.753 0.710
2+5
0.761 0.739 0.741 0.710
0.765 0.768
2+10 0.745 0.734 0.707 0.756
0.760 0.780
2+20 0.748 0.756 0.634 0.682
0.764 0.806
2+40 0.737 0.745 0.712 0.708
0.775 0.813
F1
F1
F1
F1
Conclusion
In this paper, we complete semi-supervised learning task of cell detection in two stages. In the first stage, we propose the adaptive threshold for generating pseudo-labels, which ensures the reliability of pseudo-labels, thereby reducing negative impact of noisy labels in retraining. In the second stage, we use “patch attention” to screen cells participating in retraining at cellular level, removing some difficult-to-detect cells and protect real cells from being identified as background. We have done a great deal of comparative experiments on various types of datasets. From the experimental results, we can conclude that our method perform well on different datasets and are applicable for images acquired in diverse ways.
References 1. Bai, T., Xu, J., Xing, F.: Multi-field of view aggregation and context encoding for single-stage nucleus recognition. In: Martel, A.L., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M.A., Zhou, S.K., Racoceanu, D., Joskowicz, L. (eds.) MICCAI 2020. LNCS, vol. 12265, pp. 382–392. Springer, Cham (2020). https:// doi.org/10.1007/978-3-030-59722-1 37 2. Gal, Y., Ghahramani, Z.: Dropout as a bayesian approximation: representing model uncertainty in deep learning. In: International Conference on Machine Learning, pp. 1050–1059. PMLR (2016) 3. Goodfellow, I.J., et al.: Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014) 4. Guibas, J.T., Virdi, T.S., Li, P.S.: Synthetic medical images from dual generative adversarial networks. arXiv preprint arXiv:1709.01872 (2017)
Pseudo-Labeling for Cell Detection
265
5. Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? arXiv preprint arXiv:1703.04977 (2017) 6. Lee, H., Jeong, W.K.: Scribble2label: scribble-supervised cell segmentation via selfgenerating pseudo-labels with consistency (2020) 7. Luo, X., Chen, J., Song, T., Chen, Y., Wang, G., Zhang, S.: Semi-supervised medical image segmentation through dual-task consistency. arXiv preprint arXiv:2009.04448 (2020) 8. Ouali, Y., Hudelot, C., Tami, M.: Semi-supervised semantic segmentation with cross-consistency training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12674–12684 (2020) 9. Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4 28 10. Shen, D., Wu, G., Suk, H.I.: Deep learning in medical image analysis. Annu. Rev. Biomed. Eng. 19, 221–248 (2017) 11. Sirinukunwattana, K., Raza, S.E.A., Tsang, Y.W., Snead, D.R., Cree, I.A., Rajpoot, N.M.: Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images. IEEE Trans. Med. Imaging 35(5), 1196–1206 (2016) 12. Tang, Y.B., Oh, S., Tang, Y.X., Xiao, J., Summers, R.M.: Ct-realistic data augmentation using generative adversarial network for robust lymph node segmentation. In: Medical Imaging 2019: Computer-Aided Diagnosis, vol. 10950, p. 109503V. International Society for Optics and Photonics (2019) 13. Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780 (2017) 14. Wang, S., Jia, C., Chen, Z., Gao, X.: Signet ring cell detection with classification reinforcement detection network. In: Cai, Z., Mandoiu, I., Narasimhan, G., Skums, P., Guo, X. (eds.) ISBRA 2020. LNCS, vol. 12304, pp. 13–25. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57821-3 2 15. Xie, Y., Xing, F., Shi, X., Kong, X., Su, H., Yang, L.: Efficient and robust cell detection: a structured regression approach. Med. Image Anal. 44, 245–254 (2018) 16. Xing, F., Bennett, T., Ghosh, D.: Adversarial domain adaptation and pseudolabeling for cross-modality microscopy image quantification. In: Shen, D., Liu, T., Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.-T., Khan, A. (eds.) MICCAI 2019. LNCS, vol. 11764, pp. 740–749. Springer, Cham (2019). https://doi.org/10. 1007/978-3-030-32239-7 82 17. Xing, F., Cornish, T.C., Bennett, T.D., Ghosh, D.: Bidirectional mapping-based domain adaptation for nucleus detection in cross-modality microscopy images. IEEE Trans. Med. Imaging, 1 (2020). https://doi.org/10.1109/TMI.2020.3042789 18. Zhang, Y., Yang, L., Chen, J., Fredericksen, M., Hughes, D.P., Chen, D.Z.: Deep adversarial networks for biomedical image segmentation utilizing unannotated images. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10435, pp. 408–416. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66179-7 47
Parameter Transfer Learning Measured by Image Similarity to Detect CT of COVID-19 Chang Zhao and Shunfang Wang(B) Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming 650504, Yunnan, People’s Republic of China [email protected]
Abstract. COVID-19 has spread throughout the world since 2019, and the epidemic has placed huge demands on the detection performance of COVID-19. A ParNet model is proposed in this paper which uses parameter transfer learning to initialize the training weights trained on ImageNet and then verifies its rationality from the theoretical aspect through four ways including cosine similarity, image average Hash, perceptual Hash, and difference Hash. Four ways measure image similarity from different angles. In this paper, the parallel channel and spatial attention mechanism is used to replace the channel attention mechanism, and the Swish activation function is used to replace the ReLU activation function to improve the performance of ParNet. This paper proposes ParNet to detect CT of Covid-19. Compared with the classic and the state-of-the-art models, ParNet has better performance. Source code is publicly available. Keywords: Parameter transfer learning · Image similarity · Parallel channel and spatial attention mechanism · Swish activation function
1 Introduction COVID-19 epidemic has been breaking out in various parts of the world since 2019. It is difficult to prevent and control it due to its extremely contagious nature, general susceptibility of the population, and the combination of underlying diseases. When the symptoms are severe, it can even be life-threatening. Even with the strengthening of preventive measures and the control of transmission routes, the number of new cases worldwide is still increasing every day. As of April 11, 2021, there were 131.4 million cases of COVID-19 infection worldwide and more than one million deaths. Therefore, the detection of COVID-19 is extremely important, and early detection can be early treated. The current detection methods for COVID-19 are mainly oropharyngeal swabs, nasopharyngeal swabs, and anal swabs. However, there is the risk of being infected by the virus and the shortcomings of a long detection cycle. In addition, some countries around the world have poor medical facilities and cannot provide effective protection during swab collection to cause infection. In order to detect COVID-19 quickly and accurately, this paper proposes the ParNet model for the detection of COVID-19. © Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 266–278, 2021. https://doi.org/10.1007/978-3-030-91415-8_23
Parameter Transfer Learning Measured
267
In the early stage of COVID-19, it can be distinguished by observing the multi-lobed ground glass shadows distributed on both sides, periphery or rear, and in the middle and late stages by observing the superposition of multi-lobed ground glass shadows and grid shadows or paving stones. The above judgment is based on the difference of images in CT. The difference between the normal area and the abnormal area, this difference can be clearly shown on CT. There are CT diagnostic methods in medical morphology, but it is time-consuming and may cause errors. The ParNet network proposed in this paper can realize the image detection of COVID-19, which can solve the problem of a long detection cycle and has excellent performance. This article works as follows: • The parameter transfer learning method is used in ParNet, which is different from Sakshi Ahuja’s features transfer learning. The parameter transfer learning method is used to take into account that the main characteristics of COVID-19 are the surrounding distribution and ground glass, the essence of its is the difference in image texture, edge information, and internal information. Therefore, only an initialization weight that can achieve a good image detection model (the pre-trained weight selected in this article is pre-trained on the ImageNet) is needed to update the weight, and finally, an accurate model can be quickly obtained. In order to highlight the advantages of parameter transfer learning, ParNet that does not use the transfer learning method will be compared with ParNet to prove that the initialization weights obtained by the parameter transfer learning are helpful to the improvement of indicators of ParNet. In order to prove the rationality of parameter transfer learning to improve the performance of the model from the theoretical level, some of the middle layer feature maps will be visualized, and the rationality of the parameter transfer learning will be verified from the theoretical level through four methods including cosine similarity, average Hash(aHash), perceptual Hash(pHash) and difference Hash(dHash). • The channel attention mechanism Spatial Squeeze and Channel Excitation(cSE) ignores the spatial information of the image. Parallel channel and spatial attention mechanism is used in ParNet, while paying attention to the channel information and spatial information of the image and assigning different weights to it, highlighting important features, diminishing secondary features, and further improving the performance of the model. • ParNet needs to classify the input medical image data. The input medical image data is not linearly separable, and can only use a non-linear function as the activation function. Channel Squeeze and Channel Excitation (cSE) uses ReLU as the first fully connected layer activation function. The Swish activation function is used in ParNet to replace the ReLU activation function. After being compared, the performance of the Swish activation function is better than that of the ReLU activation function.
268
C. Zhao and S. Wang
2 Related Work 2.1 Image and Conclusion Model Many experts and scholars have proposed different solutions for the detection of lung CT images to determine whether they are patients with COVID-19. Huazhong University of Science and Technology published and analyzed the dataset of CT of COVID-19 in the open resource of clinical data from patients [1]. ODEBANJAN KONAR proposed an automatic diagnosis of COVID-19 in lung CT images based on a semi-supervised shallow learning network, using the N-connected second-order neighborhood-based topology to interconnect the parallel trinity qubit hierarchical structure to segment local intensity changes Larger CT slices of the lungs [2]. Sakshi Ahuja proposed a three-phase detection model, the steps of which are: using stationary wavelet for data enhancement firstly, then detect COVID-19, and finally locate the abnormal area of the CT scan image [3]. Varalakshmi Perumal proposed CXR and CT images based on transfer learning and Haralick features to detect COVID-19. The article specifically introduced the reasons for the selection of Haralick features, so the final prediction results have a certain degree of interpretability [4]. Zhou Tao uses AlexNet, GoogLeNet, and ResNet three deep convolutional neural network models for pre-training. These models are used to extract features from all images. These three models obtain the integrated classifier EDL-COVID through relatively majority voting. Finally, the ensemble classifier is compared with the three classifiers, and it is finally found that the effect and detection speed of the ensemble classifier is better than that of a single classifier [5]. Parisa gifani uses 15 pre-trained convolutional neural networks for feature learning, and the learned features are input into the integrated model through transfer learning for voting. The results show that the proposed model performs well in terms of accuracy and recall [6]. 2.2 Attention Model Attention mechanisms include channel attention mechanism, spatial attention mechanism, self-attention mechanism and category attention mechanism, etc. This article considers channel attention and spatial attention mechanism. The channel attention mechanism Squeeze-and-Excitation Networks uses global average pooling to extract the global receptive field, transforms the characteristic channel information into point information, and then uses the multilayer perceptron to perform nonlinear transformations to build the correlation between the images. Finally, using channel excitation to strengthen the important information of the feature map and dilute the secondary information of the feature map [7]. The form of Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks is more general than Squeeze-and-Excitation Networks. From the perspective of context, it makes full use of spatial attention to mine the contextual information of features [8]. Spatial attention, because the features in each channel are processed equally, ignores the information interaction between channels; channel attention is the direct global processing of the information in a channel, and it is easy to ignore the information interaction in the space; mixed attention Force is mainly combined with the channel domain, spatial domain and other forms of attention to form a more comprehensive feature attention method. Concurrent Spatial and
Parameter Transfer Learning Measured
269
Channel Squeeze & Excitation in Fully Convolutional Networks (scSE) proposes a spatial channel parallel attention mechanism, which is divided into cSE [7] and Channel Squeeze and Spatial Excitation (sSE), cSE and Squeeze-and-Excitation Networks use the same method to redistribute the weights of feature maps; sSE compresses the channels, extracts fine-grained image features, and finally redistributes weights in the spatial dimension; scSE redistributes the weights of the space and channels, and finally adds the weights to output the final weight [9]. 2.3 Activate Function The input data of the deep learning model is mostly nonlinear, while the deep learning model is linear. In order to enhance the learning ability of the model, it is necessary to use a nonlinear activation function. Sigmoid(x) = 1/(1 + e−x ) is smooth and easy to derivate, but in the multi-layer model, the gradient value becomes very small, and the gradient disappears easily [10]; When x > 0, the derivative of ReLU(x) = max(0,x) is always 1, and the gradient will not disappear, but under the condition of x < 0, some neurons cannot be activated, resulting in the inability to learn the effective features of the image [11]; Swish(x) = x*Sigmoid (x) is differentiable at any point, and its gradient will never be zero, which solves the problem of gradient disappearance and can well improve the model training performance [12].
3 Method ParNet (Fig. 1), where Conv3*3 represents a common 3*3 size convolution kernel, C represents the number of channels, S represents the convolution kernel step distance. ModCon represents modular convolution, where K represents the size of the convolution kernel of the depth separable convolution, and n represents the first 1*1 convolution number of the modular convolution [13]. Introducing the important part of ParNet, the ordinary convolution kernel used by ParNet only has the function of dimension increase and dimension reduction, and no longer has the feature extraction function. ParNet uses pre-trained weights and trains them. Deep separable convolution is used to extract features for each channel separately, which can reduce model training parameters without reducing performance [14]. The Concurrent Spatial and Channel Squeeze and Channel Excitation (scSE) module is added after the deep separable convolution. The first fully connected layer of the channel attention mechanism branch of the scSE module replaces the ReLU activation function with the Swish activation function [12].
270
C. Zhao and S. Wang Weights of ImageNet
ModConv scSE
Fig. 1. Schematic diagram of ParNet
3.1 Parameter Transfer Learning Varalakshmi Perumal proposed CXR and CT images based on transfer learning and Haralick features to detect COVID-19. The article specifically introduces the reasons for the selection of Haralick features, which have interpretability for the final prediction results. Sakshi Ahuja’s transfer learning principle is to consider the similarity of the source domain (viral pneumonia) and the target domain (COVID-19) in disease, so the pre-trained weight of viral pneumonia can be transferred to the detection of COVID-19, and the model can be trained faster, and finally, a relatively satisfactory detection effect [4] can be achieved [20]. Varalakshmi Perumal’s method has strong limitations. It must search for diseases with similar features to transfer. If the new cases do not have similar case features, transfer learning cannot be performed, which will reduce the training speed and accuracy of the model. Based on the transfer learning model of Varalakshmi Perumal, this paper proposes a transfer learning method that does not require feature training for similar cases [21]. This article mainly considers morphology in medicine. Doctors can detect whether it is COVID-19 is based on CT. The essence of CT is the image. The difference between the COVID-19 and viral pneumonia above is the ground glass feature, which is very different from the normal lung area in the image, so ImageNet is used to perform parameter transfer learning on the pre-trained weight of the image [15]. In the deep learning model, the deeper the number of layers used, the more abstract its features are. Taking a COVID-19 CT in the dataset as an example to visualize the low-level features, the original image and the low-level visualization feature map are shown in Fig. 2a and Fig. 2b below. The low-level visualization feature map only selects the first 8 layers, the number of layers is from top to bottom, and the first 10 feature maps are selected in each layer, it can clearly be seen that ParNet extracts the edge information of the image.
Parameter Transfer Learning Measured
271
In order to prove the feasibility of parameter transfer learning, we confirm that there is no data on pneumonia or COVID-19 in the ImageNet dataset firstly to exclude features transfer learning. We randomly select the type “dog” in ImageNet, and the lowlevel features of the dog are visualized by ParNet. The original image and low-level visualization feature maps are shown in Fig. 2c and Fig. 2d The low-level visualization feature map only selects the first 8 layers, the number of layers is from top to bottom, and the first 10 feature maps are selected in each layer, it can clearly be seen that ParNet extracts the edges and textures information of the image.
Fig. 2. Comparison between CT of COVID-19 and a random animal picture of dog. (a) Original image of COVID-19. (b) Visualize feature maps of COVID-19. (c) Original image of dog. (d) Visualize feature maps of dog.
The source domain uses image information such as edges and textures to detect images, and the target domain uses feature map visualization to find that it extracts information such as edges and textures of the image. This may explain why transfer learning can be performed without using features of similar diseases. The above is the intuitive feeling of the human eyes. In order to explain the rationality of parameter transfer learning, the following explains the rationality of parameter transfer from four image similarity calculation methods including cosine similarity, image average Hash, image perceptual Hash, and image difference Hash algorithm. Cosine similarity. Firstly, the rationality of parameter transfer learning is explained theoretically by cosine similarity. Cosine similarity is a method of measuring image similarity. Its essence is to use the cosine value of the angle between two vectors in the vector space as a measure of the difference between two individuals [16]. The calculation formula of cosine similarity is as follows: n A∗B 1 (Ai ∗ Bi ) = Similarity = cos(θ ) = ||A|| ∗ ||B|| n n 2 2 ( ) ∗ A i 1 1 (Bi )
(1)
Among them, A and B are the vectors of the two images respectively, and Ai and Bi are the components of the vectors A and B respectively. Hash proof methods include three categories, namely image average Hash, image perceptual Hash, and image difference Hash. The purpose of the three Hash algorithms is to compare the similarity between the images. The early steps are to eliminate the influence of size and proportion on the
272
C. Zhao and S. Wang
judgment through zooming, and eliminate detailed information, and retain information such as structure, light, and shade. Image Average Hash. Image average Hash converts the high-frequency information of the image into low-frequency information, where the high-frequency information represents the details of the image, and the low-frequency information represents the gray value of the pixels of the image [17]. Image Perceptual Hash. Image perceptual Hash transforms the image from the pixel domain to the frequency domain through discrete cosine transform to obtain lowfrequency information of the image [17, 18]. Image Difference Hash. Image difference Hash generates a difference matrix by comparing the difference between adjacent pixels and then maps it into a Hash value for comparison [17, 19]. The closer the similarity value is to 1, the more similar the two compared images are. Table 1 shows the similarity of the original image and the similarity of the first 8 layers of the visual feature map calculated by four ways including cosine similarity, image average Hash(aHash), image perceptual Hash(pHash), and image difference Hash(dHash) algorithm.
Table 1. Estimation of image similarity through cosine similarity and Hash algorithm. Ways
Original
Similarity between the first to the eighth layer 1
2
3
4
5
6
7
8
Cosine similarity
0.67
0.90
0.96
0.95
0.97
0.96
0.98
0.98
0.97
aHash
0.44
0.89
0.86
0.88
0.84
0.92
0.61
0.72
0.88
pHash
0.58
0.95
0.92
0.98
0.94
0.92
0.89
0.91
0.98
dHash
0.44
0.72
0.7
0.63
0.94
0.79
0.67
0.71
0.78
3.2 Parallel Channel and Spatial Attention Mechanism A parallel attention mechanism is used in ParNet, that is, a mixture of channel and spatial attention mechanisms. The parallel channel and spatial attention mechanism is shown in Fig. 3. where X represents the input image, H represents the image height, W represents the image width, and C represents the number of image channels. After obtaining the input feature map U ∈ RH*W*C1 , the spatial squeeze and channel excitation’s input feature map X = [x1 ,x2 ……xC ] where xi ∈ RH*W*C is generated through the encoding and decoding block. Then X is taken as input, and global average pooling is used to reduce the dimension of X to a V vector where V ∈ R1*1*C . The operation of the two fully connected layers is W1 and W2, where W1 ∈ RC*(C/4) and W2 ∈ R(C/4)*C . The specific meaning is to reduce the channel of the vector V of size
Parameter Transfer Learning Measured
273
V
V C/4
C
C
Sigmoid
Swish
Spatial Squeeze and Channel Excitation
H C
H W C
W
Sigmoid
H W H
Channel Squeeze and Channel Excitation W
Fig. 3. Parallel channel and spatial attention mechanism
1*1*C to 1*1*(C/4) by W1. The vector of V of size 1*1*(C/4) is expanded to 1*1*C by the operation of W2. Then the Sigmoid activation function is used after the second fully connected layer in order to control the output range to [0,1]. Finally, the output of channel excitation is achieved [7, 9]. The formula is as follows Xˆ cSE = FcSE (X ) = [Sigmoid (ˆv1 )x1 , Sigmoid (ˆv2 )x2 , . . . , Sigmoid (ˆvc )xc ]
(2)
Channel squeeze and spatial excitation methods compress features along the channel and excite spatially, which facilitates fine-grained image classification. On the channel squeeze and spatial excitation branch, the input feature map is X = [x1,1 ,x1,2 ,…,xi,j ,…,xH,W ], where xi,j ∈ R1*1*C , i ∈ {1,2,…,H},j ∈ {1,2,…,W}. The spatial squeeze method q = Wsq*X, where the weight Wsq ∈ R1*1*C*1 , the generated vector q ∈ RH*W , qi,j represents the linear combination of all channels C to the spatial position (i,j). In order to control the output range to [0,1], the Sigmoid activation function is used after the second fully connected layer. Finally, after spatial excitation output [7, 9], The formula is as follows: Xˆ sSE = FsSE (X ) = [Sigmoid (q1,1 )x1,1 , . . . , Sigmoid (qi,j )xi,j , . . . , Sigmoid (qH ,W )xH ,W ]
(3)
Spatial squeeze and channel excitation focus on channel information, which is to assign different weights to the feature map through channel excitation. By multiplying the weights and input data, the important information of the channel can be highlighted and the secondary information of the channel can be downplayed. Channel squeeze and spatial excitation focus on the spatial information of the input feature map. It assigns different weights to the feature map after spatial excitation. By multiplying the weights with the input data, the important spatial information can be highlighted and the spatial secondary information can be downplayed. The parallel channel and spatial attention mechanism sums the output obtained by the spatial squeeze and channel excitation method and the input obtained by the channel squeeze and spatial excitation method [9], which can simultaneously highlight the channel importance and spatial importance of the input feature map. The parallel channel and spatial attention mechanism’s formula is as follows: Xˆ scSE = Xˆ cSE + Xˆ sSE
(4)
274
C. Zhao and S. Wang
3.3 Swish Activation Function The first fully connected layer of the parallel attention mechanism scSENet uses the ReLU activation function f(x) = max(0,x). ReLU activation function is used, it is nondifferentiable at zero, and its gradient is 0 when x < 0, leading to partial neuron is necrotic, and the neuron cannot be activated, resulting in a poor-fitting effect [11]. The Swish activation function formula is as follows: x (5) f (x) = x ∗ Sigmoid (x) = 1 + e−x The Swish activation function is differentiable at any point, and its gradient is never zero. The Swish activation function is a linear activation function in the small value interval, which can ensure the fast training of the model. The upper bound of the Swish activation function is infinite, so that when the model update gradient is close to 0, it will still not be saturated, which is conducive to training; the lower bound of the Swish activation function is limited, so it can have a strong regularization effect on negative input samples, which is good for training [12]. For the above reasons, the Swish activation function is used to replace the ReLU activation function. In the experimental part, an ablation experiment will be used to prove the effectiveness of the Swish activation function.
4 Experiment 4.1 Dataset and Code Availability The CT image dataset comes from http://ictcf.biocuckoo.cn/, the dataset has been confirmed by professional doctors and laboratories, the source is reliable, and the article Open resource of clinical data from patients with pneumonia for the prediction of COVID-19 outcomes via deep learning was published in Nature Biomedical Engineering. There are three categories in the dataset. The first category is NiCT, which is the image category that does not capture lung parenchyma, with a total of 5,705 images. The second category is pCT, which can clearly identify COVID-19, with a total of 4001 images. The third category is nCT, the imaging feature of benign lungs is benign, a total of 9,979 images. The dataset has a total of 19,685 sheets with a size of 224*224. The Data are enhanced by using the center cut and horizontal flip method for the image. The dataset is divided into a training set, validation set, and test set according to 7:2:1. The test set has a total of 1966 sheets, including NiCT 570 sheets, pCT 399 sheets, and nCT 997 sheets. The subsequent confusion matrix will use the test set data, and the derived evaluation indicators will also use this data. The URL address is https://github.com/Zcjiayouya/detection-of-COVID-19. Source code is publicly available. 4.2 Experimental Results This section consists of four parts, the purpose of this section is to verify the improvement of the ParNet model compared to other methods.
Parameter Transfer Learning Measured
275
• Comparison between ParNet and classic models. Classic models include VGG16, InceptionNetV3, Resnext50, MobileNetV3, ShuffleNetV2, and DenseNet.The comparison results are shown in Fig. 4. • In order to verify the superiority of the parallel channel attention mechanism over the channel attention mechanism and the superiority of the Swish activation function over the ReLU activation function, an ablation experiment will be adopted. The comparison results are shown in Fig. 5. • Comparing ParNet with the dataset source article Open resource of clinical data from patients with pneumonia for the prediction of COVID-19 outcomes via deep learning, the article was published in Nature Biomedical Engineering [1]. The comparison results are shown in Table 2. • In order to verify the effectiveness of parameter transfer learning, the detection performance of the model which uses transfer learning will be compared with the model which does not use transfer learning. The comparison results are shown in Table 3.
Fig. 4. Comparison between ParNet and classic models. (a)Type of NiCT. (b)Type of pCT. (c)Type of nCT
Notes: The models from left to right are VGG16, InceptionNetV3, ResNext, MobileNetV3, ShuffleNetV2, DenseNet, ParNet.
Fig. 5. Ablation experiment about activation function and attention mechanism. (a)Type of NiCT. (b)Type of pCT. (c)Type of nCT
Notes: The models from left to right are ParNet, ParNet without sSE, ParNet without Swish, ParNet without sSE and Swish.
276
C. Zhao and S. Wang Table 2. Comparison between ParNet and the original dataset model
Type
Ways
Sn
Sp
Ac
PPV
NiCT
Original
0.98404
0.99644
0.99424
0.99124
pCT
nCT
NPV 0.99554
MCC 0.98614
ParNet
0.98785
0.99928
0.99593
0.99825
0.99499
0.98519
Promotion
0.00381
0.00284
0.00169
0.00701
-0.00055
-0.00095
Original
0.97004
0.90684
0.91974
0.72744
0.99164
0.79404
ParNet
0.99749
0.99936
0.99898
0.99749
0.99936
0.99622
Promotion
0.02745
0.09252
0.07924
0.27005
0.00772
0.20218
Org
0.85474
0.99124
0.72384
0.99004
0.87254
0.85574
ParNet
1.000
0.99385
0.99695
0.99398
1.000
0.99391
Promotion
0.14526
0.00261
0.27311
0.00394
0.12746
0.13817
Table 3. Comparison between ParNet and the model without using transfer learning Type
Ways
Sn
Sp
Ac
PPV
NPV
MCC
NiCT
ParNet
0.98785
0.99928
0.99593
0.99825
0.99499
0.98519
Without transfer
0.97935
0.99928
0.99388
0.99825
0.99140
0.97560
pCT
nCT
Promotion
0.0085
0.0
0.00205
0.0
0.00359
0.00959
ParNet
0.99749
0.99936
0.99898
0.99749
0.99936
0.99622
Without transfer
0.99747
0.99745
0.99746
0.98997
0.99936
0.99150
Promotion
0.00002
0.00191
0.00152
0.00752
0.000
0.00472
ParNet
1.000
0.99385
0.99695
0.99398
1.000
0.99391
Without transfer
0.99697
0.98874
0.99288
0.98897
0.99690
0.98276
Promotion
0.00303
0.00511
0.00407
0.00501
0.0031
0.01115
5 Conclusion The ParNet model is presented in this paper. The main work of this paper is to use parameter transfer learning to detect CT of COVID-19, and to prove the rationality of using parameter transfer learning through image similarity. The experimental results show that the parallel channel and spatial attention mechanism instead of channel attention mechanism, and Swish instead of ReLU activation function has better performance. COVID-19 is still raging around the world, and it is hoped that ParNet can contribute to the prevention and control of the epidemic. Acknowledgments. We thank anonymous reviewers for valuable suggestions and comments. This work was supported by the National Natural Science Foundation of China (62062067), the Natural Science Foundation of Yunnan Province(2017FA032), and the Training Plan for Young and Middle-aged Academic Leaders of Yunnan Province (2018HB031).
Parameter Transfer Learning Measured
277
References 1. Ning, W., et al.: Open resource of clinical data from patients with pneumonia for the prediction of COVID-19 outcomes via deep learning. Nat. Biomed. Eng. 4(12), 1–11 (2020) 2. Konar, D., et al.: Auto-diagnosis of COVID-19 using lung CT images with semi-supervised shallow learning network. IEEE Access 99, 1 (2021) 3. Ahuja, S., Panigrahi, B.K., Dey, N., Rajinikanth, V., Gandhi, T.K.: Deep transfer learningbased automated detection of COVID-19 from lung CT scan slices. Appl. Intell. 51(1), 571– 585 (2020). https://doi.org/10.1007/s10489-020-01826-w 4. Perumal, V., Narayanan, V., Rajasekar, S.J.S.: Detection of COVID-19 using CXR and CT images using transfer learning and haralick features. Appl. Intell. 51(1), 341–358 (2020). https://doi.org/10.1007/s10489-020-01831-z 5. Zhou, T., Lu, H., Yang, Z., Qiu, S., Huo, B., Dong, Y.: The ensemble deep learning model for novel covid-19 on CT images. Appl. Soft Comput. 98, 106885 (2021) 6. Shalbaf, A., Vafaeezadeh, M.: Automated detection of COVID-19 using ensemble of transfer learning with deep convolutional neural network based on CT scans. Int. J. Comput. Assist. Radiol. Surg. 16(1), 115–123 (2021) 7. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018) 8. Hu, J., Shen, L., Albanie, S., Sun, G., Vedaldi, A.: Gather-excite: Exploiting feature context in convolutional neural networks. arXiv preprint arXiv:1810.12348 9. Roy, A.G., Navab, N., Wachinger, C.: Concurrent spatial and channel ‘Squeeze & Excitation’ in fully convolutional networks. In: Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. MICCAI 2018. Lecture Notes in Computer Science, vol. 11070, pp. 421-429. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00928-1_48 10. Sibi, P., Jones, S.A., Siddarth, P.: Analysis of different activation functions using back propagation neural networks. J. Theor. Appl. Inf. Technol. 47(3), 1264–1268 (2013) 11. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323. JMLR Workshop and Conference Proceedings (2011) 12. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, pp. 315–323 (2011) 13. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015) 14. Howard, A., et al.: Searching for mobilenetv3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324 (2019) 15. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2009) 16. Nguyen, H.V., Bai, L.: Cosine similarity metric learning for face verification. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010. LNCS, vol. 6493, pp. 709–720. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19309-5_55 17. Jiaheng, H., Xiaowei, L., Benhui, C., Dengqi, Y.: A comparative study on image similarity algorithms based on hash. J. Dali Univ. 2(12), 32 (2017) 18. Weng, L., Preneel, B.: A secure perceptual hash algorithm for image content authentication. In: De Decker, B., Lapon, J., Naessens, V., Uhl, A. (eds.) Communications and Multimedia Security, pp. 108–121. Springer Berlin Heidelberg, Heidelberg (2011). https://doi.org/10. 1007/978-3-642-24712-5_9
278
C. Zhao and S. Wang
19. Wang, X., Pang, K., Zhou, X., Zhou, Y., Li, L., Xue, J.: A visual model-based perceptual image hash for content authentication. IEEE Trans. Inf. Forensics Secur. 10(7), 1336–1349 (2015) 20. Shao, L., Zhu, F., Li, X.: Transfer learning for visual categorization: a survey. IEEE Trans. Neural Netw. Learn. Syst. 26(5), 1019–1034 (2015) 21. Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2021)
A Novel Prediction Framework for Two-Year Stroke Recurrence Using Retinal Images Yidan Dai1 , Yuanyuan Zhuo2 , Xingxian Huang2 , Haibo Yu2 , and Xiaomao Fan1(B) 1 School of Computer Science, South China Normal University, Guangzhou 510631, China
[email protected] 2 Department of Acupuncture and Moxibustion, Shenzhen Traditional Chinese Medicine
Hospital, Shenzhen 518033, China
Abstract. Stroke is a malignant disease with high incidence rate, high disability rate and high mortality rate. Particularly, the incidence rate of stroke recurrence is much higher than that of initial stroke. It brings a heavy life and financial burden to patients and their family. In this paper, we propose a novel prediction framework for two-year stroke recurrence based on features extracted from retinal images. Specifically, 425 patients with initial stroke were recruited from Shenzhen Traditional Chinese medicine Hospital and collected their clinical and retinal images between January 2017 and January 2019. After follow-up, 103 patients had stroke recurrence within 2 years. All collected retinal images are analyzed and the characteristics of fundus vessels are extracted by an automatic retinal image analysis system. We employ four widely used machine learning methods of support vector machine (SVM), random forest (RF), logistic regression (LR), and XGBoost to predict two-year recurrent stroke events. Experiment results show that our proposed prediction framework for two-year recurrent stroke can achieve promising results, the best prediction accuracy is up to 84.38%. Our proposed framework for predicting two-year stroke recurrence can be potentially applied in medical information systems to predict malignant events and help medical providers to take intervention in advance to prevent malignant events. Keywords: Stroke · Stroke recurrence · Retinal images · Two-year prediction framework · Malignant events
1 Introduction Stroke is generally defined as a brain injury caused by an interruption or substantial reduction in blood supply to the brain, which is one of the most prevalent cerebrovascular diseases in recent years. It is characterized by high incidence rate, high disability rate and high mortality rate. Stroke undoubtedly brings heavy life and financial burden to patients and their families. Furthermore, it has a much higher risk of recurrence for the patients with initial stroke. It is reported that the survival rate of patients with stroke is about 80–85%, and among the survivors, 15–30% have recurrent events within 2 years [1]. The damage caused by the recurrence of stroke is much more serious than that © Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 279–288, 2021. https://doi.org/10.1007/978-3-030-91415-8_24
280
Y. Dai et al.
caused by the initial stroke, which will cause serious nerve damage, make the treatment more difficult and the mortality higher. Therefore, it is of great clinical significance to propose an prediction framework for stork recurrence in secondary prevention after the first stroke and control the risk factors of stroke recurrence. In recent years, there are many studies on prediction for stroke recurrence, most of which mainly focus on the causes analysis of stroke recurrence. Hillen et al. have investigated that there are various factors for stroke recurrence and most recurrences remain unexplained by conventional risk factors [2]. Kariasa et al. suggested that hypertension and a sedentary lifestyle was considered as the most dominant factor contributing to stroke recurrence [3]. To prevent stroke recurrence well, the panel data predictive model using 12 possible factors was proposed to predict the recurrence of stroke which has the accuracy of 60.40% [4]. Profiles of ABP can help improve identification of the recurrence of stroke by capturing the additive effects of individual ABP parameters [5]. These studies attributed the recurrence of stroke to lifestyle and common clinical indicators. However, the lifestyle is quite obscure and much subjective. With advances in technology, retinal examination is able to reflect cerebral blood flow to a large extent, and many studies investigated the correlation between fundus microvascularity and stroke onset [6, 7]. Meanwhile, our previous study found that a recurrent cerebral infarction predictive model combining retinal images and clinical features could screen out patients with stroke recurrence well [8]. Herewith, in this paper, we further propose a prediction framework for two-year-recurrence of stroke based on retinal images and clinical indicators. Specifically, 425 patients with initial stroke were recruited from Shenzhen Traditional Chinese medicine Hospital and collected their clinical and retinal images between January 2017 and January 2019. After follow-up, 103 patients had stroke recurrence within 2 years. All collected retinal images are analyzed and the characteristics of fundus vessels are extracted by an automatic retinal image analysis system. We employ four widely used machine learning methods of support vector machine (SVM), random forest (RF), logistic regression (LR), and XGBoost to predict two-year recurrent stroke events. Experiment results show that our proposed prediction framework for two-year recurrent stroke can achieve promising results. Our main contributions can be summarized as follows: (1) we conduct a pilot study to recruit 425 patients with initial stroke to collect retinal images and clinical indicators and a follow-up to acquire stroke recurrence information with two years. (2) we propose a novel prediction framework for two-year stroke recurrence with four machine learning methods based on retinal images and clinical indicators, which are much more objective than lifestyle factors. (3) Experiment results show that our proposed framework can achieve promising results in screening out patients with high risk of two-year stroke recurrence, where the best accuracy of is up to 84.38%. The remainder of this paper is structured as follows. Section 2 introduces the methods in detail. The experimental results and discussion are presented in Sect. 3. Section 4 concludes this paper.
A Novel Prediction Framework for Two-Year Stroke Recurrence
281
2 Materials and Methods
Fig. 1. The prediction framework for two-year stroke recurrence
In this section, the materials and prediction framework for two-year stroke recurrence are described. The overall prediction framework is shown in Fig. 1. 2.1 Data Collection 425 stroke patients hospitalized in the acupuncture department of Shenzhen Hospital of Traditional Chinese Medicine from January 2017 to December 2020 were recruited for collecting retinal images and clinical indicators, including 103 patients who had recurrent events within 2 years (IRB No.: SZTCM(2018)75). The retinal images are collected by Digital Retinal Camera CR-2AF. Then, the automatic retinal image analysis system [9] developed by the Chinese University of Hong Kong is utilized to analyze the retinal image features related to cerebrovascular diseases, such as arteriovenous diameter and ratio [10], vascular symmetry, angle, bifurcation coefficient [11], soft exudate [12– 14] and curvature [15]. Clinical risk indicators are collected including general patient information (name, gender, age, body mass index (BMI), sleep, history of smoking and alcohol, family history of stroke, etc.), imaging indices (toast typing, carotid ultrasound, cranial MR, etc.), clinical risk factors (blood pressure, glucose, lipid control, and blood uric acid, coagulation, and homocysteine levels at the time of inclusion in the study). The specific clinical risk indicators included are shown in Table 1. The inclusion criteria of this paper were as follows: • Those who met the diagnostic criteria of Western medicine for cerebral infarction disease; • Those who had an onset of 6 months and had a first cerebral infarction. The exclusion criteria were as follows: • Transient cerebral ischemic attack (TIA).
282
Y. Dai et al. Table 1. Clinical indicators for inclusion in this study
Data type
Detailed variables
Categorical variables Gender, toast typing, carotid artery stenosis, vertebral artery stenosis, intracerebral vascular sclerosis, intracerebral vascular occlusion, family history of stroke, history of smoking, history of alcohol consumption, sleep status, blood pressure control, blood glucose control, hypertriglyceridemia, hypercholesterolemia, hyperuricemia, hyperhomocysteinemia Continuous variables Age, BMI, HDL, LDL, prothrombin activity, international normalized ratio, prothrombin ratio, activated partial thromboplast in time, fibrinogen, prothrombin time
• Patients with stroke caused by brain tumor, traumatic brain injury, hematologic disease, etc. confirmed by examination. • Combination of serious diseases of liver, kidney, hematopoietic system, endocrine system and osteoarthrosis. Patients with mental disorders, severe dementia or consciousness disorders that prevent them from cooperating with fundus image acquisition. In this paper, the observations with more than 50% missing features are removed. To ensure the objectivity of the data, we employ the plural fill of the column for discrete data and the mean fill for continuous data. To make the prediction model more stable, the characteristics of BMI and age are divided into boxes. Based on the World Health Organization’s obesity indicators for Southeast Asians, BMI is divided into: less than 18.5, 18.5–24, 24 or more, which corresponds to underweight, normal weight, and overweight. Since most patients have a stroke recurrence between the ages of 45 and 65, age is divided into: less than 45, 45–65, 65 or more. After preliminary processing, there are 55 features obtained and 419 observations. 2.2 Data Preprocessing A data preprocessing technique in this paper includes data discretization, oversampling, Min-Max scaling, label definition. Data discretization can effectively overcome the hidden defects in the data, improve the model’s ability to classify the samples and make the model results more stable. Data oversampling is applied to address class imbalances in the data set and to improve model generalization. Min-Max scaling allows the features to be in the same dimension. (1) Data discretization: in this paper, we use the CART algorithm for discretization of continuous variables [16]. As CART is a decision tree classification algorithm, which is equivalent to univariate decision tree classification. The method for discretising data based on the CART algorithm for continuous variables is: to split the data set in two by calculating the median of the values of two adjacent elements in turn. Then the Gini value at that point is calculated as the cut point, and the point with the greatest decrease
A Novel Prediction Framework for Two-Year Stroke Recurrence
283
in Gini value is selected as the optimal cut point at each cut, and the data set is cut according to the same principle until the termination condition. (2) Oversampling: since a classification model on imbalanced data is more biased towards training majority classes with more samples, oversampling technique is applied in our dataset. In this paper, we use the SMOTE algorithm for oversampling [17]. The basic idea of the SMOTE algorithm is to analyze each minority class sample xa , select one of its K-nearest neighbors xb , and then select a random point xc located at the line between points xa and xb as a new synthetic minority class sample. (3) Min-Max scaling: to transform the original data into a range between 0 and 1, Min-Max scaling method is utilized which can remove the effect of different magnitudes. Therefore, Min-Max scaling method is used, which is defined as: xnorm =
x − xmin xmax − xmin
(1)
where x refers to the original data, xnorm refers to the normalised data, and xmax and xmin refer to the maximum and minimum values of the original dataset respectively. (4) Label definition: in this study, according to routine rules, patients with stroke recurrence with two years is defined as 1, otherwise, 0. 2.3 Feature Selection In this paper, since all continuous input features are discretized into categorical variables, we utilize chi-square test (χ2 test) [18, 19] to evaluate the correlation between input features and class labels and make feature selection according to p-value. It is noted that features with p-value < 0.05 are significant statistically. In this paper, we keep all features with p-value < 0.05 as selected features, which are shown in Table 2. Table 2. Selected features with p-value < 0.05 Feature
p-value
Smoking
0.0001
Drinking
0.0124
Blood pressure control
0.0001
Glycated hemoglobin
0.0016
High triglycerides
0.0001
HDLC
0.0002
LDLC
0.0090
Uric acid
0.0000 (continued)
284
Y. Dai et al. Table 2. (continued)
Feature
p-value
Homocysteine
0.0000
International normalized ratio
0.0000
Prothrombin ratio
0.0000
Thrombin time
0.0030
LCRA + BJ:BVE_est
0.0258
LCRVE_est
0.0004
LVasym_est
0.0026
LVangle_est
0.0000
LAangle_est
0.0000
LBCV_est
0.0000
LAocclusion_estp
0.0265
LFDa_est
0.0000
LFDv_est
0.0000
RAasym_est
0.0076
RVangle_est
0.0000
RAangle_est
0.0014
RBCV_est
0.0001
RBCA_est
0.0050
RAocclusion_estp
0.0000
RFDa_est
0.0201
A Novel Prediction Framework for Two-Year Stroke Recurrence
285
2.4 Classification Methods (1) SVM: SVM is a machine learning algorithm proposed by Vapnik et al. [20]. In essence, it solves the constrained optimization problem by constructing an optimal hyperplane between two classes, so as to maximize the interval between two classes [21]. If any two classes can be separated from an infinite number of linear classifiers, SVM will determine the hyperplane to minimize the generalization error. On the contrary, SVM will try to find the hyperplane that maximizes the residuals. At the same time, if the two classes are inseparable, the quantity proportional to the number of misclassification will be minimized [22]. (2) RF: RF can be considered as an advanced bagging method [23]. RF has a good performance in classification task. It can be used for two kinds of problems or more than two kinds of multi class problems. RF is an ensemble learning algorithm based on classification tree and regression tree. Based on the voting method, RF can make decision to identify the data belonging to corresponding classes. (3) LR: generally speaking, LR [24] is a machine learning method, which is used to solve binary classification problems to estimate the possibility of something. Compared with supervised classification methods such as kernel support vector machine, the speed of logistics regression is faster, but the accuracy is lower. LR, as a statistical method similar to linear regression, its output is a sigmoid function, which is a number between 0 and 1. It is generally considered that the median value of 0.5 is the threshold to divide these two categories [25]. Because of the simplicity and effectiveness of its mathematical principle, it has a wide range of practical applications, including disease prediction, credit evaluation and so on. (4) XGBoost: Xgboost, proposed by Chen and Guestrin [26], is a highly scalable end-to-end tree boosting algorithm, which integrates multiple weak classifiers to form a strong classifier. Unlike random forest, which applies bootstrap aggregation technology to tree learners, the trees of boosting system are established in turn: the goal of each tree is to reduce the error of the previous tree. For each successive iteration of the model, a base tree is trained by fitting the gradient. Then the output of the current tree learner is combined with the previous prediction to build a new prediction, which means that it is an enhanced version of the previous model [27].
3 Results and Discussion 3.1 Experiment Environment The proposed approach is implemented on a System having 2 processors with 256 GB RAM and 1 TB hard Disk, each processor has 8 cores, which have experimented in sklearn 0.24.1. 3.2 Prediction Performance for Two-Year Stroke Recurrence In the paper, the dataset is randomly separated into training and test sets at a ratio of 3:1, and the SVM, RF, LR, and XGBoost models are built and trained respectively. In the modeling process, a grid search is used to optimize the models by combining
286
Y. Dai et al.
ten-fold cross-validation with tuning parameters. The performance results of the four methods on the test set are shown in Fig. 2. When using the RF in scikit-learn, the reliable effect of model is to set n_estimators (the number of weak learners with the largest parameters) to 200. For LR, the best model parameter is to set C (the reciprocal of the regularization coefficient λ) to 100, the penalty term is ‘l2’, the optimization algorithm selection parameter is ‘liblinear’, and the maximum number of iterations of the algorithm convergence is 1000. For SVM, the most stable model parameter is to set the kernel to ‘rbf’, the penalty parameter to 10, and the kernel function parameter to 1. When building the XGBOOST model, set the learning rate to 0.1, n_estimators to 170, the maximum depth of the tree to 9, and gamma to 0.1. It can be seen that the accuracy, recall, and precision of RF, SVM and XGBoost are all above 0.8, which indicates that the outstanding classification performance of these three models.
Fig. 2. Prediction performance of the proposed framework with four machine learning methods.
Fig. 3. The ROC curve of the proposed framework with four machine learning methods.
To choose the best method, the area under curve (AUC) value measure of each method was first considered. The AUC value is the area enclosed with the coordinate axis under the receiver operating characteristic (ROC) curve. The ROC curve can reflect a combination of sensitivity and specificity continuous variables. When the AUC value is closer to 1.0, the higher the authenticity of the detection method is. Therefore, the AUC value is utilized in this paper for the final selection of the four prediction models. The ROC diagrams for the four methods are shown in Fig. 3 As shown in this figure, Fig. 3 lists the corresponding ROC curves of the SVM, RF, LR, and XGBoost with the AUC values of 0.91, 0.92, 0.79 and 0.83, respectively. According to the highest AUC value, thus, the RF model is selected as the final model for the stroke recurrence prediction.
A Novel Prediction Framework for Two-Year Stroke Recurrence
287
4 Conclusion In this paper, we proposed a novel prediction framework for two-year stroke recurrence with combining retinal image and clinical indicators. Specifically, 425 patients with initial stroke were recruited from Shenzhen Traditional Chinese medicine Hospital and collected their clinical and retinal images between January 2017 and January 2019. After follow-up, 103 patients had stroke recurrence within 2 years. All collected retinal images are analyzed and the characteristics of fundus vessels are extracted by an automatic retinal image analysis system. We employ four widely used machine learning methods of SVM, RF, LR, and XGBoost to predict two-year recurrent stroke events. Experiment results show that our proposed prediction framework for two-year recurrent stroke can achieve promising results. The RF achieves the best performance metrics in accuracy, recall, F1 score, and AUC value in terms of 84.38%, 83.13%, 84.66%, and 92.36%, respectively. It means that our proposed prediction framework can be deployed on a medical system to help physicians to screen out patients with high risk of stroke recurrence and take interventions in advance to prevent malignant events. Acknowledgments. This work is supported by the National Natural Science Foundation of China (No. 81803952) and National Key R&D Program of China (No: 2018YFB1800705).
References 1. Grau, A.J., et al.: Risk factors, outcome, and treatment in subtypes of ischemic stroke: the German stroke data bank. Stroke 32(11), 2559–2566 (2001). https://doi.org/10.1161/hs1101. 098524 2. Hillen, T., Coshall, C., Tilling, K., Rudd, A.G., McGovern, R., Wolfe, C.D.A.: Cause of stroke recurrence is multifactorial: patterns, risk factors, and outcomes of stroke recurrence in the south London stroke register. Stroke 34(6), 1457–1463 (2003) 3. Kariasa, I.M., Nurachmah, E., Koestoer, R.A.: Analysis of participants’ characteristics and risk factors for stroke recurrence. Enferm. Clin. 29, 286–290 (2019) 4. Li, X., Chang, W., Zhou, S., Wei, F.: The panel data predictive model for recurrence of cerebral infarction with health care data analysis. In: 2017 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), pp. 2380–2385. IEEE (2017) 5. Jie, X., et al.: Ambulatory blood pressure profile and stroke recurrence. Stroke Vasc. Neurol. 6(3), 352–358 (2021) 6. Zee, B., et al.: Stroke risk as assessment for the community by automatic retinal image analysis using fundus photograph. Qual. Primary Care 24(3), 114–124 (2016) 7. De Boever, P., et al.: Static and dynamic retinal vessel analyses in patients with stroke as compared to healthy control subjects. Acta Ophthalmologica 94 (2016) 8. Zhuo, Y., Wu, J., Qu, Y., Yu, H., Yuan, W., Yang, Z.: Comparison of prediction models based on risk factors and retinal characteristics associated with recurrence one year after ischemic stroke. J. Stroke Cerebrovasc. Dis. 29(4), 104581 (2020) 9. Zee, C.Y. Lee, J.W., Li, E.Q.: Method and device for retinal image analysis. US (2014). Patent no. US20120257164 A1 10. Knudtson, M.D., Lee, K.E., Hubbard, L.D., Wong, T.Y., Klein, R., Klein, B.E.K.: Revised formulas for summarizing retinal vessel diameters. Curr. Eye Res. 27(3), 143–149 (2003). https://doi.org/10.1076/ceyr.27.3.143.16049
288
Y. Dai et al.
11. Patton, N., Aslam, T., Macgillivray, T., Dhillon, B., Constable, I.: Asymmetry of retinal arteriolar branch widths at junctions affects ability of formulae to predict trunk arteriolar widths. Invest. Ophthalmol. Vis. Sci. 47(4), 1329–1333 (2006) 12. Wong, T.Y., Klein, R., Couper, D.J., Cooper, L.S., Sharrett, A.R.: Retinal microvascular abnormalities and incident stroke: the atherosclerosis risk in communities study. Lancet 358(9288), 1134–1140 (2001) 13. Wong, T.Y., et al.: Cerebral white matter lesions, retinopathy, and incident clinical stroke. J. Am. Med. Assoc. (JAMA) 288, 1–67 (2002) 14. Witt, N., et al.: Abnormalities of retinal microvascular structure and risk of mortality from ischemic heart disease and stroke. Hypertension 47(5), 975–981 (2006) 15. Doubal, F.N., Hokke, P.E., Wardlaw, J.M.: Retinal microvascular abnormalities and stroke: a systematic review. J. Neurol. Neurosurg. Psychiatry 80(2), 158–165 (2009) 16. Li, W.L., Yu, R.H., Wang, X.Z.: Discretization of continuous valued attributes in decision tree generation. In: International Conference on Machine Learning and Cybernetics (2010) 17. Bansal, A., Saini, M., Singh, R., Yadav, J.K.: Analysis of smote: Modified for diverse imbalanced datasets under the IoT environment. Int. J. Inf. Retrieval Res. (IJIRR) 11(2), 15–37 (2021) 18. Bryant, F.B., Satorra, A.: Principles and practice of scaled difference chi-square testing. Struct. Eqn. Model. A Multidisc. J. 19(3), 372–398 (2012) 19. Chen, C., Liang, X.: Feature selection method based on Gini index and chi square test. Comput. Eng. Des. 40(8), 2342–2345 (2019) 20. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (2013). https:// doi.org/10.1007/978-1-4757-3264-1 21. Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988–999 (1999) 22. Deka, P.C., et al.: Support vector machine applications in the field of hydrology: a review. Appl. Soft Comput. 19, 372–386 (2014) 23. Rahmati, O., Pourghasemi, H.R., Melesse, A.M.: Application of GIS-based data driven random forest and maximum entropy models for groundwater potential mapping: a case study at Mehran region, Iran. CATENA 137, 360–372 (2016) 24. Nagendran, M., et al.: Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ 2020, 368:m689 (2020) 25. Hefner, J.T., Linde, K.C.: Atlas of Human Cranial Macromorphoscopic Traits. Academic Press (2018) 26. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016) 27. Wang, X.-W., Liu, Y.-Y.: Comparative study of classifiers for human microbiome data. Med. Microecol. 4, 100013 (2020)
The Classification System and Biomarkers for Autism Spectrum Disorder: A Machine Learning Approach Zhongyang Dai1,2 , Haishan Zhang1 , Feifei Lin3 , Shengzhong Feng2(B) , Yanjie Wei1(B) , and Jiaxiu Zhou4 1 Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology,
Center for High Performance Computing, Shenzhen Institutes of Advanced Technology, Shenzhen, Guangdong, China [email protected] 2 National Supercomputing Center in Shenzhen, Shenzhen, Guangdong, China 3 Department of Radiology, Shenzhen Children’s Hospital, Shenzhen, Guangdong, China 4 Department of Psychology, Shenzhen Children’s Hospital, Shenzhen, Guangdong, China
Abstract. Autism spectrum disorder (ASD) is a kind of neurodevelopmental disorder. ASD patients are usually difficult in social communication and daily life, and there are no drugs to cure ASD. A large number of clinical cases exhibits an early intervention is beneficial for ASD patients. Therefore, the rapid and accurate diagnosis is a great of significance. With the development of artificial intelligence techniques, machine learning (ML) models have been applied in analyzing the ASD. However, it is an extremely challenging task to reveal the biomarker through ML models. In this paper, a computational protocol is proposed to differentiate ASD from typical development (TD) using the resting-state functional magnetic resonance imaging (rs-fMRI) images of brains and reveal the brain regions related to ASD. The computational protocol is consisted of feature selection, model training, and feature analysis. Classification models are constructed based on XGBoost algorithm, and show a better performance compared with previous well-known models. By analyzing the input features of models, the functional connections of cingulo-opercular network (CON) and default-mode network (DMN) is founded to contribute the models’ performance significantly, which can be regarded as the biomarker of ASD. Keywords: Machine learning · Autism · rs-fMRI · Accurate prediction · Biomarker
1 Introduction Autism spectrum disorders (ASD) is a complex neurological and developmental disabilities that can cause significant social, communication and behavioral challenges [1, 2]. In recent years, the incidence of ASD patients is increasing, which brought a huge burden Z. Dai and H. Zhang—Contributed equally to this work. © Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 289–299, 2021. https://doi.org/10.1007/978-3-030-91415-8_25
290
Z. Dai et al.
to their families [3–6]. A large number of clinical trials showed that early treatment can help improve the prognosis and reduce the disability rate [3, 7–9]. Therefore, the rapid and accurate diagnosis for ASD is of high significance. However, till now there is no unified and concrete cognition about the exact cause of ASD [2], which makes the ASD diagnosis difficult. Matthew et al. systematically investigated a set of proteins which may carry risk for ASD [6]. Muller proposed the view of distributed disorder, arguing that autism is caused by changes in genetic, neuroanatomical features, brain functional organization, acquired education and other aspects [10]. With the advancement of imaging techniques, some researchers tried to study the brain images to find the difference between ASD and TD [11, 12]. Resting-state functional MRI (rs-fMRI) image is a subtype of MRI image, which mainly contains the function connection (FC) information among the brain regions. Some previous works were committed to differentiate ASD patients from TD by comparing the rs-fMRI images of brains, and found the cognitive deficits of ASD patients are associated with underlying brain FC abnormalities [13, 14]. However, it is an extremely challgenging task to mannually obtain the detailed difference between rs-fMRI images. Machine learning (ML) is an effective tool for extracting complex patterns and relationships of images and then perform a classification. Recently, it started to be developed for predicting ASD [9, 13, 15–17]. Zhao et al. proposed a multi-level, high-order FC network representation, which can capture complex interactions among brain regions, then comprehensively characterize their relationship for better diagnosis of ASD [18]. Wang et al. proposed a multi-center low-rank representation learning (MCLRR) method for ASD diagnosis, aiming at developing a model that can handle the heterogeneous data from different centers. The model’s accuracy based on 5 institutes of ABIDE is 0.69 [19]. In addition, the generalization of models is crucial, especially for the clinical practice. Yahata et al. developed a ML model based on a small number of FCs [7], which demonstrated a remarkable generalization for two independent validation cohorts in the USA and Japan. He et al. founded that generalization ability of ML model may be strongly influenced by data sites difference or cohorts difference, and could be improved by applying suitable noise reduction strategy on the rs-fMRI images from different sites [20]. The generalization capacity of above ML models indicates that there may be some brain features which can be treated as the biomarkers of ASD. Unfortunately, it is very difficult to derive those brain biomarkers directly through the ML models. In this paper, a new computation protocol is proposed to classify the ASD and TD using the rs-fMRI images. This computation protocol mainly applied the XGBoost algorithm to integrate the connectivity information of brain regions to construct a predictive model. Before the model training, a variety of feature selection methods were performed to remove the useless brain regions information and improve the effectiveness of model training. In addition, according to the protocol, the important connectivity information of brain regions was extracted, then the corresponding brain regions were backtracked, which could be the biomarkers for distinguishing the ASD from TD.
The Classification System and Biomarkers for Autism Spectrum Disorder
291
2 Materials and Methods 2.1 Datasets All rs-fMRI image samples used in this paper are directly obtained from the Autism Brain Imaging Data Exchange I (ABIDE I) website [21]. ABIDE is a well-known site in autism research, which collected the phenotypic data and imaging data of ASD patients and typical controls. There are 1112 groups of data in ABIDE I, including 539 individuals with ASD and 573 TDs, which are collected and shared by 17 institutes. Firstly, the refMRI images of 54 ASDs and 46 TDs were retrieved from New York University Langone Medical Center (NYU) of ABIDE, which was named as Dataset 1, all individuals are under 15 years old. To evaluate the performance of the protocol on the multi-sites data, another two datasets were generated, i.e., Dataset 2 and Dataset 3. The samples of Dataset 2 were selected from five different centers (LEUVEN, NYU, UCLA, UM and USM), totally 250 ASD and 218 TD with age less than 18 years old. Dataset 3 contained 505 ASD samples and 530 TD samples from all 17 institutes. For a particular dataset, e.g., Dataset 1, it was split into training set, validation set and test set. Twenty percent of each dataset was randomly split off as a test set, which was kept strictly separate. The reminder of the dataset was further divided into a training set and a validation set with a ratio of 6:1. The models were optimized on the training set, and their performance was evaluated on the corresponding validation set. The crossvalidation was employed in training models. Cross-validation method split the samples into K parts and used the kth part as validation set (k ranges from 1 to K), the remaining (K–1) parts of samples as training set. Through the cross-validation, K amounts of models were trained and validated, respectively. The model’s hyper-parameters were optimized according to the average value of models’ performance coefficients. In this paper, the K was set to seven. Finally, the independent test set was used to evaluate the generalization ability of models. 2.2 Data Preprocessing The pre-processed rs-fMRI data with the Configurable Pipeline of the Analysis of Connectomes (C-PAC) were downloaded. The image pre-processing pipeline is consisted by slice timing corrected, motion correction and intensity normalization. By applying the pre-processing pipeline, the noise signals of head motion, scanner drift, respiration, cardiac pulsation were removed. Then, regions-of-interest (ROIs) atlas was used to extract the Pearson correlation coefficient matrix of mean time series. The upper triangle of the coefficient matrix was converted to the features vector. Therefore, the dimension of feature vectors is followed the Eq. (1), (N − 1) ∗ N (1) 2 in which S represents the dimension of feature vectors, N denotes the number of brain regions. In this paper, the Automatic Anatomical Labeling (AAL) atlas was aligned onto the images of three datasets. AAL atlas segments the human brain into 116 regions, therefore the corresponding coefficient matrix has a shape of 116 * 116, therefore each subject has 6670 features. S=
292
Z. Dai et al.
2.3 Feature Selection To improve the efficiency of model training and enhance the model’s robustness, a feature selection procedure was performed to identify the important attributes to the ASD classification. The least absolute shrinkage and selection operator (Lasso), L1-norm support vector machine (L1-SVM) and random forest (RF) feature selection method were used on the training sets, respectively. Lasso method is based on the L1-norm penalty, whose term is defined as φ(W) = W1 , where W = (ω1 , ω2 , · · ·, ωn )T is the vector of unknown parameters. Lasso is commonly used as a sparsity-induced feature selection method for linear multivariate regression. L1-SVM can map the original input space to a hyperplane with maximum margin, and eliminates one feature with the least squared weight. RF is consisted of binary decision trees, a random feature selection can be used to construct each node of the tree. RF can estimate the Out-Of-Bag (OOB) error, where OOB is a sample set not used in training the current tree. Firstly, the OOB error (err X j ) of each feature is estimated by RF, then randomly exchanges the feature value and its j corresponding OOB error (err Xoob ) is estimated. The importance score of a particular feature is defined as Eq. (2), which represents the mean over all the trees of the difference between the OOB error before and after the feature value exchange. 1 j |err X j − err Xoob | (2) VI X j = nb_trees j
In this work, the above three feature selection methods were applied to compute the weight score of all features and rank them, respectively. 2.4 Model Construction and Analysis XGBoost is considered as tree boost [22], which can combine several weak learners into a strong learner as the following equation, φ(xi ) =
K
fk (xi )
(3)
k=1
Where fk (.) denotes the particular weak learner, K is the amount of weak learners. The core of XGBoost is the Newton boosting which optimize the parameters through minimizing the loss function, as shown in Eq. (4) L(θ ) =
n
l(φ(xi ), yi )+
i=1
(fk ) = γ ∗ T +
K
(fk )
(4)
k=1
1 ∗ α ∗ ω2 2
(5)
where (fk ) represents the complexity of the k-th tree model, n and T is the sample number and the nodes number of the trees, respectively. ω denotes the weight of each node, γ is the penalty for the complexity of the T –shape tree, α represents the regularization of fk .
The Classification System and Biomarkers for Autism Spectrum Disorder
293
XGBoost constructs each node as a basic learner, it also can reduce the relevance of each learner by subsampling among columns. The functional connection signal of human brain belongs to a weak signal, the corresponding brain regions can be used to construct tree structure of XGBoost. Then, the model can be optimized by minimizing the loss function. In this paper, XGBoost was employed to combine several hundreds of weak FC signals together to strength the model performance. In the part of analysis of brain region biomarkers, the feature vectors of the trained models were analyzed, their contributory brain regions were picked up and compared comprehensively. Therefore, the crucial regions related to ASD was revealed, which can be the possible biomarkers for ASD diagnosis.
3 Results and Discussions 3.1 The Performance of Classification System on the Single Site Dataset In the section, the whole computation protocol was employed on Dataset 1. The effect of feature selector and feature number on the performance were systematically investigated, the evaluation criterions of the model contain confusion matrix, sensitivity (true positive rate, TPR), specificity (true negative rate, TNR), accuracy (ACC), positive predictive value (PPV), negative predictive value (NPV), and F1 score. Then the best feature selector and its optimized feature vector were decided according to their performance on classifying ASD. In addition, the generalization ability of the optimized model was evaluated on the absolutely independent test set. The Effect of Feature Selectors on the ASD Classification Performance In this section, the Lasso, L1-SVM and RF were used to construct three kinds of feature selectors. To ensure involving in the enough effect factors, the dimension of feature vector was chosen as 300. For each feature selector, the corresponding ensemble method of feature selection was applied to compare with the single selection, where the ensemble selection method referred to perform a feature selection on the samples of training set for twenty times, averaged those twenty amounts of attributed coefficients for each feature, and picked up the most important 300 features. The XGBoost-based models were trained using the chosen features of samples on the training set, furthermore their performances were evaluated on the validation set of Dataset 1. To eliminate the influence of randomness, each experiment was performed for ten times and computed the average performance. The confusion matrix of models using six feature selectors were shown in the Fig. 1. The confusion matrix is consisted of four parts, i.e., true negatives (TN), false positives (FP), false negatives (FN), true positives (TP). TN represents that the prediction result is TD and the real label is also TD. FP means that the prediction result is ASD but the real label is TD. FN refers that the prediction result is TD but the real label is also ASD. TP represents that the prediction result is ASD and the real label is also ASD. Here, the sums of twenty experiments for the four variations were shown in the Fig. 1. It can be found that the values of FP and FN in the Fig. 1(a) and (d) are obviously larger than the others, meaning the Lasso feature selection method isn’t suitable for dealing the functional connectivity signals between the brain regions. The L1-SVM feature selector
294
Z. Dai et al.
showed the high TP and TN, as well as low FP and FN. Furthermore, the ensemble L1-SVM exhibited the best performance by comparing the result of Fig. 1(c) and (f).
Fig. 1. The confusion matrix of models on the validation set of Dataset 1 based on (a) single Lasso; (b) single RF; (c) single L1-SVM; (d) ensemble Lasso; (e) ensemble RF; (f) ensemble L1-SVM feature selector.
The Effect of Feature Number on the ASD Classification Performance In this section, the ensemble L1-SVM feature selector was used to construct a series of feature vectors with different size and evaluate the effect of feature size on the classification performance. Few features would reduce the accuracy of predicting ASD, but too many features can increase the influence of noise. In order to ensure a favorable size, the vectors with 50, 100, 150, 200, 250, 300 features were used to constructed the classification model and computed the corresponding performance. The training steps was consistent with the above section. The performance of those models on the validation sets was analyzed and exhibited on the Fig. 2. As shown in the Fig. 2, all curves had a trend of rising first and then falling, indicating that model’s performance firstly improved and then reduced when the feature number ranging from 50 to 300. When the feature number equals to 150, the model has the highest value of TNR, PPV, ACC, which were 0.9370, 0.9374, 0.9133, respectively. In addition, the NPV and F1-score value of model was 0.9075 and 0.9047 when the feature size was 150, which were both slightly lower than the corresponding maximum value. In generally, with the increasing of features, the model’s performance firstly improved and then reduced. When the feature size equaled to 150, the model has the best classification performance for ASD. Therefore, the optimal feature size of ensemble L1-SVM feature selector was 150. To evaluate the generalization ability of the trained model, the ensemble L1-SVM feature selector picked up 150 features for each sample of the test set. Then, those features were used as the input to directly predict if the samples belong to ASD through the trained model. The TPR and NPV of the trained model on the test set were 0.836
The Classification System and Biomarkers for Autism Spectrum Disorder
295
and 0.830, slightly lower than the ones of the model on the validation set. The ACC, F1-score, TNR, and PPV of the trained model on the test set were 0.796, 0.799, 0.756 and 0.774, respectively. Those results indicate that the model obtained by our computational protocol is generalized.
Fig. 2. The performance (a) TPR, (b) TNR, (c) PPV, (d) NPV, (e) ACC, (f) F1-score of models on the validation set of Dataset 1, where the ensemble L1-SVM feature selector was used and the feature size ranged from 50 to 300.
3.2 The Performance of Classification System on Dataset of the Multi-sites The samples of Dataset 2 are from five different institutes, and Dataset 3 is composed of samples which are collected from seventeen centers. Compared with Dataset 1, Dataset 2 and 3 are more heterogeneous. In this section, the computational protocol like Sect. 3.1 was applied on Dataset 2 and 3 to evaluate the performance of the protocol on the heterogeneous data. In detail, the ensemble L1-SVM feature selector firstly picked up 150 features for each sample, and then the XGBoost was used to train models based on those features of samples. We compared the models’ performance in this paper with that of three state-of-the-art models named ML-SVM, MCLRR and DNN [15, 18, 19], which were trained based on the samples of Dataset 1, Dataset 2, and Dataset 3, respectively. The results were shown in the Table 1, it can be found that the performance of our models was obviously higher than that of the compared ones. For example, the ACC of our model on the Dataset 1 is 0.9133, which is about 12.75% higher than the one of ML-SVM on the Dataset 1. The TPR, TNR and ACC of our model on the Dataset 2 was 0.8058, 0.8207 and 0.81, which was 14.72%, 23.54% and 17.70% higher than the corresponding ones of MCLRR on the Dataset 2. The ACC and TNR of our model on the Dataset 3 were also higher than the
296
Z. Dai et al.
ones of DNN on the Dataset 3. Those furtherly indicates that compared with the above three models, the models of this paper have an excellent performance in predicting ASD. Table 1. The comparison of our models’ performance with the performance of prior state-of-theart models. TPR
TNR
PPV
NPV
ACC
F1-score
Our model on Dataset 1
0.8874
0.9371
0.9374
0.9075
0.9133
0.9047
Our model on Dataset 2
0.8058
0.8207
0.8185
0.8158
0.8133
0.8095
Our model on Dataset 3
0.7259
0.8224
0.7897
0.7699
0.7765
0.7545
ML-SVM on Dataset 1 [15]
0.82
0.80
0.83
0.78
0.81
0.83
MCLRR on Dataset 2 [18]
0.7024
0.6643
/
/
0.691
/
DNN on Dataset 3 [19]
0.74
0.63
/
/
0.70
/
3.3 Analysis About the Biomarker of ASD In this section, the input features of models were backtracked and analyzed to reveal the brain regions related to ASD. The samples’ features of Dataset 1 was firstly analyzed, here, the top 10 input features of the model on the Dataset 1 were used to reveal the relevant brain regions, which contributed to approximately 68% of importance among total 150 features. The brain regions of the top 10 important features were listed in Table 2. There are five inter-hemispheric FCs, four left intra-hemispheric FCs and one right intra-hemispheric FC in the Table 2. Among those 10 FCs, there are 8 FCs whose corresponding brain regions belong to cingulo-opercular network (CON) [23], which are in bold font. CON is related to many mental diseases, and involved in a wide range of cognitive processes [24–26], e.g., word recognition, alertness etc. Vaden et al. found that CON activity has broad significance for speech recognition in challenging condition and partially account for why and when people experience speech-recognition impairments [24]. Coste et al. [25] compared the response of different networks on the target task, and found that CON plays a central role in sustaining alertness. In addition, Barttfeld and coworkers used a modified resting-state paradigm to drive subjects’ attention, they provided evidence of a very marked interaction between CON and ASD [26]. In above brain regions, 13 brain regions belong to left hemisphere and 7 brain regions belong to right hemisphere. The left hemisphere has more associations with the ASD than the right one, which agreed with previous studies [27–29]. Brain regions in Table 2 include parahippocampal gyrus (PHG), Superior parietal gyrus (SPG), Middle frontal gyrus (MFG), Postcentral gyrus (PoCG), Precuneus (PCUN), Inferior frontal gyrus, opercular part (IFGoperc), Olfactory cortex (OLF), Middle frontal gyrus, orbital part (ORBmid), Anterior cingulate and paracingulate gyri (ACG), Amygdala (AMYG), Superior frontal gyrus, orbital frontal cortex (ORBsup), Middle temporal gyrus (MTG), Superior occipital gyrus (SOG), Heschl gyrus (HES), Angular gyrus (ANG). The PHG belongs to the default-mode network (DMN), which is implicated in supporting memory.
The Classification System and Biomarkers for Autism Spectrum Disorder
297
Table 2. The brain regions of Top 10 important function connection (FC) features. Each row represents a FC connecting brain region 1 to brain region 2. Bold font indicates the brain region belonging to cingulo-opercular network (CON). Letters ‘R’ and ‘L’ refer to Right and Left respectively. Region 1
Region 2
Importance
PHG.R
SPG.L
0.19
MFG.L
PoCG.R
0.15
PHG.R
PCUN.L
0.07
IFGoperc.R
OLF.L
0.05
ORBmid.L
ACG.L
0.04
AMYG.L
PoCG.L
0.03
ORBsup.L
AMYG.L
0.03
AMYG.L
MTG.R
0.03
SOG.R
HES.R
0.03
IFGoperc.L
ANG.L
0.02
The PHG has been proved to be the primary hub of the DMN in the medial temporal lobe memory system [30]. The SOG can be associated with the quantitative variation in language-related phenotypes in ASD. These findings are consistent with the behavioral phenotype of ASD [31]. The HES and MTG are related to the auditory network, and some previous studies have reported an atypical reduction in FCs between temporal region and a medial prefrontal area in autism [32]. The ORBsup plays an important role in some other mental diseases such as obsessive-compulsive disorder and geriatric depression [33]. Some other brain regions, such as OLF, SPG, PoCG, ANG, are related to the perception of emotion, the interpretation of sensory information, language performance, and sports coordination [23]. Through the analysis of brain regions, it can be found that the brain regions involving in cognitive processes and memory behaviors plays a crucial role in predicting ASD. Specifically, the brain regions belonged to CON and DMN could be regarded as the biomarkers of ASD, e.g., ORBmid, AMYG, MFG, IFGoperc, ACG, SFGdor and SFGmed etc.
4 Conclusions In this paper, a new computation protocol was proposed to classify the ASD and TD based on the rs-fMRI images. The protocol was mainly divided into three parts. Firstly, AAL atlas was applied on the preprocessed rs-fMRI images to extract the function connection features between brain regions. Following by an ensemble L1-SVM feature selection, 150 features were picked up from the initial features as the input features of models. At last, the models were trained using the XGBoost algorithm. The results exhibited the models constructed through our protocol have an excellent performance and were
298
Z. Dai et al.
generalizable in both the single site dataset and multi-sites dataset. Compared with the prior models, our models on the three datasets have an enhancement of 12.75%, 17.70%, and 10.92% in the prediction accuracy, respectively. Through the analysis of brain regions of input features, some ones belonging to CON or DMN were founded to have a closed link with ASD. Especially, there were approximately 80% function connection in CON whose brain regions were related to ASD. Those brain regions usually involved in cognitive processes and memory behaviors of human, e.g., ORBmid, AMYG, MFG, IFGoperc, ACG, SFGdor and SFGmed etc. Therefore, the CON and DMN could be the potential biomarker for the clinical experiment of ASD. Acknowledgements. This work was supported by Strategic Priority CAS Project XDB38050100, the National Key Research and Development Program of China under grant No. 2018YFB0204403, National Science Foundation of China under grant no. U1813203, the Shenzhen Basic Research Fund under grant no. KQTD20200820113106007, JSGG20201102163800001 and RCYX2020071411473419, CAS Key Lab under grant no. 2011DP173015.
References 1. Lord, C., Cook, E.H., Leventhal, B.L., Amaral, D.G.: Autism spectrum disorders. Neuron 28, 355–363 (2000) 2. Lord, C., Elsabbagh, M., Baird, G., Veenstra-Vanderweele, J.: Autism spectrum disorder. Lancet 392, 508–520 (2018) 3. Hazlett, H.C., et al.: Early brain development in infants at high risk for autism spectrum disorder. Nature 542, 348–351 (2017) 4. McElhanon, B.O., McCracken, C., Karpen, S., Sharp, W.G.: Gastrointestinal symptoms in autism spectrum disorder: a meta-analysis. Pediatrics 133, 872–883 (2014) 5. Abrams, D.A., et al.: Underconnectivity between voice-selective cortex and reward circuitry in children with autism. Proc. Natl. Acad. Sci. 110, 12060–12065 (2013) 6. State, M.W., Sestan, N.: The emerging biology of autism spectrum disorders. Science 337, 1301–1303 (2012) 7. Yahata, N., et al.: A small number of abnormal brain connections predicts adult autism spectrum disorder. Nat. Commun. 7, 1–12 (2016) 8. Daniels, A.M., Mandell, D.S.: Explaining differences in age at autism spectrum disorder diagnosis: a critical review. Autism 18, 583–597 (2014) 9. Warren, Z., McPheeters, M.L., Sathe, N., Foss-Feig, J.H., Veenstra-VanderWeele, J.: A systematic review of early intensive intervention for autism spectrum disorders. Pediatrics 127, 1303–1311 (2011) 10. Ruiz-Rizzo, A.L., et al.: Decreased cingulo-opercular network functional connectivity mediates the impact of aging on visual processing speed. Neurobiol. Aging 73, 50–60 (2019) 11. Yang, J., Lee, J.: Different aberrant mentalizing networks in males and females with autism spectrum disorders: evidence from resting-state functional magnetic resonance imaging. Autism 22, 134–148 (2018) 12. Liu, T., Liu, X., Yi, L., Zhu, C., Markey, P.S., Pelowski, M.: Assessing autism at its social and developmental roots: a review of autism spectrum disorder studies using functional nearinfrared spectroscopy. Neuroimage 185, 955–967 (2019) 13. Sarraf, S., Tofighi, G.: Classification of alzheimer’s disease using fmri data and deep learning convolutional neural networks. arXiv preprint arXiv:1603.08631 (2016)
The Classification System and Biomarkers for Autism Spectrum Disorder
299
14. Zhu, Y., Zhu, X., Zhang, H., Gao, W., Shen, D., Wu, G.: Reveal consistent spatial-temporal patterns from dynamic functional connectivity for autism spectrum disorder identification. In: Ourselin, S., Joskowicz, L., Sabuncu, M., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9900, pp. 106–114. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-467207_13 15. Heinsfeld, A.S., Franco, A.R., Craddock, R.C., Buchweitz, A., Meneguzzi, F.: Identification of autism spectrum disorder using deep learning and the ABIDE dataset. NeuroImage Clin. 17, 16–23 (2018) 16. Sarraf, S., Tofighi, G.: Classification of alzheimer’s disease structural MRI data by deep learning convolutional neural networks. arXiv preprint arXiv:1607.06583 (2016) 17. Sarraf, S., Tofighi, G.: Initiative ADN DeepAD: Alzheimer’s disease classification via deep convolutional neural networks using MRI and fMRI. BioRxiv 70, 441 (2016) 18. Zhao, F., Zhang, H., Rekik, I., An, Z., Shen, D.: Diagnosis of autism spectrum disorders using multi-level high-order functional networks derived from resting-state functional MRI. Front. Hum. Neurosci. 12, 184 (2018) 19. Wang, M., Zhang, D., Huang, J., Shen, D., Liu, M.: Low-rank representation for multi-center autism spectrum disorder identification. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11070, pp. 647–654. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00928-1_73 20. He, Y., Byrge, L., Kennedy, D.P.: Nonreplication of functional connectivity differences in autism spectrum disorder across multiple sites and denoising strategies. Hum. Brain Mapp. 41, 1334–1350 (2020) 21. Abide, I.: http://fcon_1000.projects.nitrc.org/indi/abide/abide_I.html. Accessed 24 June 2016 22. Zhou, Y.R., Li, T.Y., Shi, J.Y., Qian, Z.J.: A CEEMDAN and XGBOOST-Based Approach to Forecast Crude Oil Prices. Hindawi Complexity (2019) 23. Sheffield, J.M., et al.: Fronto-parietal and cingulo-opercular network integrity and cognition in health and schizophrenia. Neuropsychologia 73, 82–93 (2015) 24. Vaden, K.I., Kuchinsky, S.E., Cute, S.L., Ahlstrom, J.B., Dubno, J.R., Eckert, M.A.: The cingulo-opercular network provides word-recognition benefit. J. Neurosci. 33, 18979–18986 (2013) 25. Coste, C.P., Kleinschmidt, A.: Cingulo-opercular network activity maintains alertness. Neuroimage 128, 264–272 (2016) 26. Pablo, B., et al.: State-dependent changes of connectivity patterns and functional brain network topology in autism spectrum disorder. Neuropsychologia 50, 3653–3662 (2012) 27. Daniel, P., Mahajan, R., Crocetti, D., Mejia, A., Mostofsky, S.: Left-hemispheric microstructural abnormalities in children with high-functioning autism spectrum disorder. Autism Res. 8, 61–72 (2015) 28. Perkins, T.J., et al.: Increased left hemisphere impairment in high-functioning autism: a tract based spatial statistics study. Psychiatry Res. Neuroimaging 224, 119–123 (2014) 29. Dawson, G., Warrenburg, S., Fuller, P.: Hemisphere functioning and motor imitation in autistic persons. Brain Cogn. 2, 346–354 (1983) 30. Ward, A.M., Schultz, A.P., Huijbers, W., Van Dijk, K.R.A., Hedden, T., Sperling, R.A.: The parahippocampal gyrus links the default-mode cortical network with the medial temporal lobe memory system. Hum. Brain Mapp. 35, 1061–1073 (2014) 31. Paracchini, S.: Dissection of genetic associations with language-related traits in populationbased cohorts. J. Neurodev. Disord. 3, 365–373 (2011) 32. Watanabe, T., Rees, G.: Brain network dynamics in high-functioning individuals with autism. Nat. Commun. 8, 1–14 (2017) 33. Szeszko, P.R., et al.: Orbital frontal and amygdala volume reductions in obsessive-compulsive disorder. Arch. Gen. Psychiatry 56, 913–919 (1999)
LiteTrans: Reconstruct Transformer with Convolution for Medical Image Segmentation Shuying Xu and Hongyan Quan(B) School of Computer Science and Technology, East China Normal University, Shanghai, China [email protected], [email protected]
Abstract. The combination of convolution and Transformer applied to medical image segmentation has achieved great success. However, it still cannot reach extremely accurate segmentation on complex and low-contrast anatomical structures under lower calculation. To solve this problem, we propose a lite Transformer based medical image segmentation framework called LiteTrans, which deeply integrates Transformer and CNN in an Encoder-Decoder-Skip-Connection U-shaped architecture. Inspired by Transformer, a novel multi-branch module with convolution operation and Local-Global Self-Attention (LGSA) is incorporated into LiteTrans to unify local and non-local feature interactions. In particular, LGSA is a global self-attention approximation scheme with lower computational complexity. We evaluate LiteTrans by conducting extensive experiments on synapse multi-organ and ACDC datasets, showing that this approach achieves state-of-the-art performance over other segmentation methods, with fewer parameters and lower FLOPs. Keywords: Medical image segmentation · Transformers · Convolutional neural networks · Local-Global Self-attention Multi-branch
1
·
Introduction
Continuously improving segmentation accuracy in medical complex organizational structures has become an urgent need for precision medicine and smart medicine [18]. CNN has achieved excellent segmentation effects in medical image segmentation, but still does not meet the requirements of accurate segmentation on some complex and low-contrast datasets. The recently popular Transformer [27] gave the light to solve the problem. Transformer has achieved good results when applied to computer vision [2,3,36]. However, the application of attention calculation directly on the image will lead to a sharp increase in the amount of calculation. How to reduce the amount of calculation generated during the attention calculation process is the key to the c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 300–313, 2021. https://doi.org/10.1007/978-3-030-91415-8_26
LiteTrans: Reconstruct Transformer with Convolution
301
success of the algorithm model. We propose LiteTrans, which reconstructs Transformer with convolutions for medical image segmentation. On the one hand, LiteTrans can use convolution to design a lightweight model, on the other hand, it deeply integrates the advantages of CNN and Transformer for segmentation. In our paper, motivated by convolution method, we design a novel attention calculation method called LGSA to achieve the purpose of reducing the amount of calculation. LGSA uses a local-global approach to approximate global attention. And then we build a multi-branch module composed of convolution and self-attention operations based on LGSA, which can bring the dynamic interaction of convolution and self-attention. Finally, we reconstruct the Transformer block using the above method and form LiteTrans network for medical image segmentation.
2
Related Work
The Method of Pure CNN: CNN can excellently solve the problem of medical image segmentation. One of the most famous is U-Net [24]. Many U-Net variants were produced at the same time. Milletari et al. proposed a similar structure (VNet [22]). It is mainly used for 3D medical image segmentation, using convolution instead of pooling, and increasing the residual connection. J´egou et al. developed a densely connected network architecture (DenseNet [12]). Zhou et al. proposed UNet++ [38], which connects all U-Net layers (from one to four), allowing the network to automatically learn the importance of features on different layers. Jha et al. proposed DoubleUnet [16], which is a combination of superimposing two UNet network structures [26]. The Method of Combining Attention and CNN: In recent years, adding attention to CNN has become a trend. Oktay et al. proposed the Attention UNet [23] model using Attention Gates in the decoder part. Hu et al. proposed SENet [11], which introduced channel attention to the field of image analysis. Kaul et al. proposed a FocusNet [17] that mixes spatial attention and channel attention for medical image segmentation. Wang et al. proposed the non-local U-Nets [31], which are equipped with flexible global aggregation blocks, for biomedical image segmentation. There are other excellent works [13,15,28,29,34]. The Method of Pure Transformer: Transformer [27] proposed by Vaswani et al. is a deep neural network mainly based on self-attention mechanism. Inspired by Transformer’s great success in natural language processing, researchers recently applied transformers to computer vision (CV) tasks [7,10,14,21,25,30, 32,37]. Dosovitskiy et al. proposed Vision Transformer (ViT) [7], which applied pure Transformer to classification tasks and obtained excellent results. Wang et al. proposed Pyramid Vision Transformer (PVT) [30], which overcomes the difficulties of porting Transformer to various dense prediction tasks. Liu et al. proposed Swin Transformer [21], which design a hierarchical Transformer whose representation is computed with shifted windows. Cao et al. proposed SwinUnet [2] for medical image segmentation according to [21].
302
S. Xu and H. Quan
The Method of Combining CNN and Transformer: Recently, there have been many cases of combining CNN with Transformer. Wu et al. proposed LITE TRANSFORMER [33], which achieves lightweight calculations. Wu et al. proposed Convolutional vision Transformer (CvT) [32], which introduces convolutions into ViT [7]. Chen et al. proposed TransUNet [3], which merits both Transformers and U-Net for medical image segmentation. In addition, there are other excellent works [6,19,20,35,36].
3
Method
Aiming at the characteristics of medical image segmentation such as larger data and higher accuracy requirements, we design a lite Transformer segmentation architecture (LiteTrans), which reconstructs Transformer by convolution and realizes a deep-going combination of convolution and self-attention. In this section, we first propose an approximate scheme of global attention, named Local-Global Self-Attention (LGSA). LGSA comes more competitive performance than vanilla Transformer with lighter attention calculation. And then we build a multi-branch module composed of convolution and self-attention operations based on LGSA, which can bring the dynamic interaction of convolution and self-attention. Finally, we reconstruct the Transformer block using the above method and form LiteTrans network for medical image segmentation.
And & Norm Up-sampling
Feed Forward
Multi-Head Self-Attention
And & Norm
SegmentationHead
depth-wise separable convolutions LiteTrans Block 2 h
c
Global Attention LiteTrans Block 2
Feature of Conv
pooling
Feature of Attention
LiteTrans Block 2 Merge Local-Global Self-Attention (LGSA)
LiteTrans Block 2
LiteTrans Block 2
Multi-Head Attention depth-wise separable convolutions
Convolved Feature Embedding LiteTrans Block 2
LiteTrans Block 8
Local Attention
LiteTrans Block (a)
(b)
(c)
Fig. 1. (a): The pipeline of the proposed Litetrans, which has a hierarchical multi-level structure. The convolution operation of Conv2d 3×3 to expand and zoom features, and The LiteTrans to extract multi-scale features; (b): schematic of the LiteTrans block, which contains two parallel branches of convolution and Local-Global Self-Attention; (c): Detailed description of Local-Global Self-Attention. Simulate global attention using local and global attention concatenation.
LiteTrans: Reconstruct Transformer with Convolution
3.1
303
Revisit Convolution and Self-attention
The Convolution Module. The convolution operation is local&linear&static. Given the 2D feature map X ∈ RH×W ×Ci as input, where H is the height of input, W is the width, Ci is the number of channels in the input feature. Then the output feature Y ∈ RH×W ×Co we can get by convolution operation, where Co is the dimensionality of the output space. The process of getting Y can be described by the following formula: def
Yco ,i,j =
Ci
Wco ,ci ,δi +K/2,δj +K/2 Xci ,i+δi ,j+δj + Bco ,
(1)
ci =0 (δi ,δj )∈ΔK
where Bco is the weight matrix offset of the convolution kernel, W is the convolution kernel with size K ×K. ΔK ∈ Z2 is a set of offsets for the convolution to slide across the window. (δi , δj ) is the displacement value from ΔK , which is written as: ΔK = [−K/2, · · · , K/2] × [−K/2, · · · , K/2].
(2)
The Self-attention Module. Transformer uses self-attention to realize the parallel calculation of Value (V), Key (K), and Query (Q). Self-attention to achieve additional modeling of the context through Q K V. QK T def Y = Attention(Q, K, V ) = sof t max( √ )V, dk
(3)
where Q = W Q X, K = W K X, V = W V X. Where X ∈ R(H×W )×C is the input one-dimensional feature vector. W Q , W K , W V is the weight of X for linear √ transformation. dk is the zoom factor of query-key-value sequence. Thus, Eq. 3 can be written as: T def (4) Y = softmax W Q X W K X W V X = W(X)X, where W(X) is the equivalent weight coefficient of attention, which is dynamically calculated based on the value of its own elements [4]. Therefore, the selfattention is non-linear&non-local&dynamic. Then, Muti-Head Self-Attention (MHSA) can be represented as: MHSA (Q, K, V ) = Concat (head 1 , . . . , head h ) W O where head i = Attention QWiQ , KWiK , V WiV . 3.2
(5)
Local-Global Self-Attention (LGSA)
In this section, inspired by the convolution operation, we propose a novel attention calculation method called LGSA to achieve the purpose of reducing the
304
S. Xu and H. Quan
amount of calculation. LGSA uses a local-global approach to approximate global attention. As can be seen from Fig. 1(c), LGSA is divided into two parts: local attention operator and global attention operator. Local Attention calculates selfattention in a local region then selects representatives of local region to participate in the calculation of global attention. In addition, LGSA uses depth-wise separable convolutions [5] to replace the original position-wise linear projection in Multi-Head Self-Attention. Location Embedding. Compared with the original Self-Attention, we add depth-wise separable convolution [5] operation to the query, key, and value of selfattention. Depth-wise separable convolution (DWSC) changes the original linear position embedding method, which can enhance feature semantic information and maintain feature resolution. As shown in Fig. 1(c), the feature vector is first subjected to convolutional position mapping before entering the attention calculation. The depth-wise separable convolution (DWSC) can be expressed as: ˆ K, ˆ Vˆ = DepthW iseConv(x) Q, = P ointwise(BatchN orm(Depthwise(x))),
(6)
where Depthwise operation to apply deep convolution on the input layer, BatchN orm is a regularization operation, P ointwise use a convolution kernel with a size of 1 × 1 × Ci to do point convolution, Ci is the number of channels. Local Attention Operator. The local attention operator focuses on selfattention calculations in the local area and establishes the dependence of the local receptive field. Specifically, we first divide the input feature map into several overlapping partitions according to a sliding window, and these partitions can be easily merged back. This process is as follows: 1,1 1,2 (7) xp−1 = xp−1 , xp−1 , . . . , xh,w p−1 , where h, w is the height and width of current feature map respectively, xi,j p−1 is a 2D feature at position (i, j) from the previous layer (layer(p − 1)). Then xp−1 split into several overlapping partitions:
where
x∗p−1 = U nf old(xp−1 ) h w ∗, K ,K ∗,1,1 ∗,1,2 , = xp−1 , xp−1 , . . . , xp−1
(K−1)h (K−1)w h i,j+ K i+ K ,j+ ∗,i,j i,j K xp−1 = xp−1 , xp−1 , . . . , xp−1 ,
(8)
where x∗p−1 is a set of partitions, U nf old is a function under the torch package, and its main function is to continuously extract features in the h and w dimensions, K is the kernel coefficient of U nf old, and K also represents the size (K×K)×C . Then, each partition of partitions splitted, for example, x∗,i,j p−1 ∈ R
LiteTrans: Reconstruct Transformer with Convolution
305
performs depth-wise separable convolutions (formula 6) and Muti-Head SelfAttention(MHSA) (formula 5) calculation. And then all partitions are merged back into one feature map and enter the module of Global Attention Operator : ∗,i,j ∗,i,j ∗,i,j Q∗,i,j p−1 , Kp−1 , Vp−1 = DepthW iseConv(xp−1 ), ∗,i,j ∗,i,j x,i,j = MHSA (Q∗,i,j p p−1 , Kp−1 , Vp−1 ), , h , w xp = F old xp,1,1 , . . . , xp K K ,
(9) (10) (11)
where F old(·) is the inverse operation of the U nf old(·) function, which stitches each partition into size of h × w. Global Attention Operator. The global attention operator selects representative points of local region to performs global self-attention. First, we downsampling the xp obtained by formula 11 to get feature map yp , which reduced size of the feature map; Then depth-wise separable convolutions (formula 6) and Muti-Head Self-Attention(MHSA)(formula 5) is applied to yp ; Finally, feature maps are up-sampled back to the original size. This process is as follows: Qp , Kp , Vp = DepthW iseConv(pooling(xp )),
(12)
zp
(13)
= U psampling( MHSA
(Qp , Kp , Vp ))
+
xp .
Complexity Analysis. In the segmentation task, the complexity of SelfAttention is closely related to the length of the input sequence N = h × w and the number of channels C: Ω(MHSA) = 4 N C 2 + 2 N 2 C.
(14)
This local attention module (formulated in Eq. 7 to 11) splits feature map(h× hw w) into K 2 partitions of size K × K. Ω(Local-MHSA) = 4 KC 2 =(
(hw)2 hw + 2 K 2C 2 4 K K
hw )4 hwC 2 + 2 hwC, K3
hw where K 3 ≤ 1, therefore we assume into below formulation:
hw K3
(15)
= 1. Thus, we can transform Eq. 15
Ω(Local-MHSA) = 4 hwC 2 + 2 hwC = 4 N C 2 + 2 N C.
(16)
For global attention module: Ω(Global-MHSA) = 4M C 2 + 2M 2 C ≈ M N C,
(17)
where M is constant times the size of the sequence after down-sampling. Then they can enable a global feature learning with linear complexity: Ω(LGSA) = Ω(Local-MHSA) + Ω(Global-MHSA) = 4 N C 2 + 2 N C + M N C.
(18)
306
3.3
S. Xu and H. Quan
Multi-branch Module Composed of Convolution and LGSA
Based on LGSA, a multi-branch basic module composed of convolution and selfattention operations is established, which can unify local and non-local feature interactions. It can be seen from Fig. 1(b), one branch performs light-weight attention calculations, and the other branch performs convolution operations at the same time, and then adds the obtained attention weights to the convolution weights. Because the weight of attention is dynamic and the weight of convolution is static, a set of weight values of analog dynamic convolution can be obtained. Our approach can be easily implemented with a few lines of code in Python. We show the code of our approach in Algorithm 1 based on PyTorch.
Algorithm 1. Pseudo code of Multi-branch module in a PyTorch-like style. # B: batch size, H: height, W: width # C: channel number, M: partition number # K: kernel size of Unfold, s: stride, r: scaling ratio ################## initialization ################## unfold = nn.Unfold(K, dilation, padding, s) fold = nn.Fold(K, dilation, padding, s) conv = nn.Conv2d(C, C//r, 1) # depth-wise separable convolutions, Eq. (6) SepConv2d=DepthWiseConv2d(C, C, kernel size=3, 1) # Multi-head Self-Attention, Eq. (5) attention=MHSA(img size, heads) downsampling = nn.MaxPool2d(K × K) upsample = nn.MaxUnpool2d(K × K) ################# forward pass ################## out1=BN(Rule(conv(x))) x unfolded = unfold(x) #B, C × K × K, H × W x unfolded = x unfolded.transpose(1, 2).view(M, B, N, K × K).transpose(2, 3) out2=fold(attention(SepConv2d(x unfolded))) out3=upsample(attention(SepConv2d(downsampling(out2)))) out=out1 + out2 + out3 return out
3.4
LiteTrans: Reconstruct Transformer with Multi-branch Module
Based on the multi-branch method described in Sect. 3.3, we reconstruct Transformer to adapt to medical image segmentation tasks. The reconstructed Transformer is a lightweight segmentation network, we named it LiteTrans. LiteTrans can be applied to longer sequences while keeping high accuracy.
LiteTrans: Reconstruct Transformer with Convolution
307
LiteTrans Block. LiteTrans Block introduces the parallel structure of convolution and LGSA on the basis of Transformer. In addition, LiteTrans Block also includes a normalized layer, residual connection, and feed forward layer. xp = LayerN orm(LGSA(xp−1 ) + xp−1 ),
(19)
zp = LayerN orm(F F N (xp ) + xp ).
(20)
LiteTrans for Medical Image Segmentation. The overall architecture of LiteTrans and the detailed design of LiteTrans Block can be seen in Fig. 1(a). LiteTrans is a convolutional vision Transformer with Encoder-Decoder-SkipConnection structure. LiteTrans deeply integrates Transformer and CNN for medical image segmentation. We use Conv Transformer with convolutional scaling layer as the encoder. And a symmetrical decoder is designed with Conv Transformer with convolutional expanding layer. Multi-level hierarchical features extracted by convolutional scaling-expanding are fed into Conv Transformer for global interactive learning. Loss Function. When training the LiteTrans model, we use the Binary CrossEntropy function and Dice–Sørensen loss to estimate the model’s predicted value and the groundtruth. They are expressed as: LBCE =
t i=1
(yi log (pi ) + (1 − yi ) log (1 − pi )) , t
LDice = 1 − t
i=1
i=1
y i pi + ε
y i + pi + ε
(21) ,
L = α · LBCE + β · LDice .
4 4.1
Experiments Datasets
The proposed LiteTrans is evaluated on synapse multi-organ segmentation datasets [9] and Automated cardiac diagnosis challenge datasets [1]. Synapse Multi-organ Segmentation Datasets [9]: The data comprises reference segmentations for 90 abdominal CT images delineating multiple organs: the aorta, gallbladder, left kidney, right kidney, liver, pancreas, the spleen, stomach. CT scans are composed of (319–1051) slices of (512 × 512) images, and have voxel spatial resolution of ([0.523 − 0.977] × [0.523 − 0.977] × 0.5)mm3 . Automated Cardiac Diagnosis Challenge (ACDC) [1]: ACDC dataset is composed of 150 exams. a series of short axis slices cover the LV from the base to the apex. The spatial resolution goes from 1. 37 to 1. 68 mm2 /pixel and 28 to 40 images cover completely or partially the cardiac cycle.
308
4.2
S. Xu and H. Quan
Implementation Details
We made experiments on synapse multi-organ segmentation datasets mainly. We preprocessed the training data and test data. We randomly selected 30 abdominal CT scans from the dataset of synapse, and There are a total of 3779 abdominal clinical CT axial-enhanced scan images. Among them, 18 cases of CT scans are used as training data, and 12 cases are used as test data. For training data, we convert them to NumPy format, clip the images within [−125, 275], normalize each 3D image to [0, 1], and extract 2D slices from 3D volume for training cases. In addition, we expand the data such as flip and rotation to increase the diversity of the data. For test data, we keep the 3D volume in h5 format for testing cases. Our experiments are implemented based on Python3.7 and Pytorch 1.7.1. The default settings are patch size with 244 × 244 and batch size with 16. Models are trained with SGD optimizer with learning rate 0.01, momentum 0.9 and weight decay 1e − 4. The default number of training iterations are 20k for ACDC dataset and 14k for Synapse dataset respectively [2,3]. All of our experiments have tested the algorithm through 5-fold cross validation, and calculated the Dice similarity coefficients (DSC, %) and average hausdorff distance (HD) to evaluate our estimates of 8 structures. Our LiteTrans network compares with V-Net [22], DARR [8], U-Net [24], AttnUNet [23], R50-ViT [7], TransUnet [3], the results can be see in the Table 1. And in Table 2, we can see the comparison of segmentation accuracy on ACDC dataset. From Table 1 and 2, it can be found that our network can achieve the previous state-of-the-art methods (e.g. TransUNet [3]) effect, and it is even better in the segmentation of certain individual organs. But the amount of parameters of our network is only half of TransUnet. Table 1. Segmentation accuracy of different methods on the synapse multi-organ CT dataset (Dice similarity coefficient (DSC, %) and Hausdorff distance (mm), and DSC % for each organ). Methods
DSC↑ HD↓
Aorta
Gallbladder
Kidney (L)
Kidney (R)
Liver
Pancreas
Spleen
Stomach
V-Net [22] DARR [8] U-Net [24] AttnUNet [23] R50-ViT [7] TransUnet [3]
68.81 69.77 76.85 75.57 71.29 77.48
75.34 74.74 89.07 55.92 73.73 87.23
51.87 53.77 69.72 63.91 55.13 63.13
77.10 72.31 77.77 79.20 75.80 81.87
80.75 73.24 68.60 72.71 72.20 77.02
87.84 94.08 93.43 93.6 91.51 94.08
40.05 54.18 53.98 49.37 45.99 55.86
80.56 89.90 86.67 87.19 81.99 85.08
56.98 45.96 75.58 74.95 73.95 75.62
LiteTrans (ours) 77.91 29.01 85.87 62.22
83.21
77.10
94.45 57.60
– – 39.70 36.97 32.87 31.69
86.52 76.30
LiteTrans: Reconstruct Transformer with Convolution
309
Table 2. Segmentation accuracy of different methods on the ACDC dataset. (Dice similarity coefficient (DSC, %) and DSC % for each organ). Methods
DSC
RV
U-Net [24] AttnUNet [23] R50-ViT [7] TransUNet [3]
87. 86. 87. 89.
87. 87. 86. 88.
55 75 57 71
10 58 07 86
Myo
LV
80. 79. 81. 84.
94. 93. 94. 95.
63 20 88 53
92 47 75 73
LiteTrans (ours) 89. 66 87. 97 85. 33 95. 68 aorta
gallbladder
(a) GroundTruth
left kidney
(b) ConvTrans
right kidney
liver
(c) TransUNet
pancreas
spleen
(d) Attention U-Net
stomach
(e) U-Net
Fig. 2. Qualitative comparison of different approaches by visualization. From left to right: (a) GroundTruth, (b) LiteTrans, (c) TransUNet, (d) Attention U-Net, (e) U-Net. Our method predicts less false positive and keep finer information.
4.3
Experiment Results
We have qualitatively compared our LiteTrans with the most advanced medical segmentation methods. From the Table 1, we can see that our LiteTrans network has the best segmentation performance, with segmentation accuracy rates of 77.91% (DSC ↑) and 29.01% (HD ↓). It is worth noting that we achieved 57.60% on Pancreas segmentation, which is very hard to segment Due to its complex shape. LiteTrans can still achieve excellent segmentation performance in a low-parameter environment, which is enough to prove the power of LiteTrans in complex medical image segmentation.
310
S. Xu and H. Quan
Finally, we visualized the segmentation results of our LiteTrans, TransUnet [3], U-Net [24], and AttnUNet [23], and compared them with ground truth, which shown in the Fig. 2. We can observe that our LiteTrans method is almost close to ground truth. TransUNet [3] can accurately locate the general position, but the edge segmentation is not accurate. Generally speaking, compared with these methods, our algorithm achieves state-of-the-art performance with lower parameters. 4.4
Ablation Study
The Impact on Different Components: In order to evaluate the effectiveness of each module added in the proposed LiteTrans, we conduct a comprehensive ablation experiment by continuously replacing the components. The components include MHSA (Multi-head Self-Attention), DWSC (depth-wise separable convolution), LGSA (Local-Global Self-Attention), MBM (Multi-branch module). Table 3 lists the results of different groups. Table 3. Ablation study on different components. Methods
DSC
MHSA+DWSC
76.85 81.84 66.33
80.12
73.91
93.64 55.04
86.10
LGSA+DWSC
76.48 77.27 68.42
86.84
78.29
93.74 56.16
86.34
72.77
MBM+LGSA+DWSC 77.91 85.87 62.22
83.21
77.10
94.45 57.60
86.52
76.30
Aorta Gallbladder Kidney (L) Kidney (R) Liver
Pancreas Spleen Stomach
72.20
The Impact on Resolution of Input Image: As the input size increases from 224 × 224 to 512 × 512, as shown in Table 4, the input feature sequence of LiteTrans will become larger, thus leading to improve segmentation performance of the model. However, although the segmentation accuracy of the model has been slightly improved, the computational load of the whole network has also increased. To ensure the running efficiency of the algorithm, the experiments in this paper are based on 224 × 224 resolution scale as the input. Table 4. Ablation study on resolution of input image. Size
DSC Aorta Gallbladder Kidney (L) Kidney (R) Liver Pancreas Spleen Stomach
224×224 77.91 85.87 62.22
83.21
77.10
94.45 57.60
86.52
76.30
512×512 82.73 90.23 70.82
83.47
82.64
95.30 65.32
90.27
83.81
The Impact on Skip Connections: The indicators corresponding to the number of different hop connections can be seen from Table 5. As the number of skip connections increases, the DSC value continues to rise. The best performance can be achieved when the number of skip connections is 3. The skip connection can
LiteTrans: Reconstruct Transformer with Convolution
311
well integrate low-dimensional local details and high-dimensional global abstraction, so that the final decoded original resolution picture retains more detailed contour information, which plays a key role in the final segmentation prediction and positioning. Table 5. Ablation study on the impact of skip connections. Num DSC Aorta Gallbladder Kidney (L) Kidney (R) Liver Pancreas Spleen Stomach
5
0
67.11 79.54 42.34
65.50
63.79
91.10 45.32
79.07
70.22
1
76.36 87.15 59.53
80.89
78.36
93.82 55.81
84.99
70.34
2
77.16 87.31 64.19
81.47
76.46
93.27 56.16
85.13
73.25
3
77.91 85.87 62.22
83.21
77.10
94.45 57.60
86.52
76.30
Conclusion
In this work, we proposed a novel and modular LiteTrans model that reconstruct Transformer with Convolution for Medical Image Segmentation. Our LiteTrans is an encoder-decoder structure with skip connections for medical image segmentation. In LiteTrans, we use a multi-branch architecture of convolution and attention mechanisms. In addition, we redesigned self-attention using the LocalGlobal approach. LiteTrans deeply integrates Transformer and CNN to perform segmentation tasks, which perfectly combines the advantages of Transformer and CNN. Experiments on multi-organ segmentation tasks show that our approach can achieve state-of-the-art performance with fewer parameters and lower FLOPs.
References 1. Bernard, O., et al.: Deep learning techniques for automatic MRI cardiac multistructures segmentation and diagnosis: is the problem solved? IEEE Trans. Med. Imaging 37, 2514–2525 (2018) 2. Cao, H., et al.: Swin-Unet: Unet-like pure transformer for medical image segmentation. arXiv preprint arXiv:2105.05537 (2021) 3. Chen, J., et al.: TransUNet: transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021) 4. Chen, X., Wang, H., Ni, B.: X-volution: on the unification of convolution and self-attention. arXiv preprint arXiv:2106.02253 (2021) 5. Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017) 6. Dai, Z., Liu, H., Le, Q.V., Tan, M.: CoAtNet: marrying convolution and attention for all data sizes. arXiv preprint arXiv:2106.04803 (2021)
312
S. Xu and H. Quan
7. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020) 8. Fu, S., et al.: Domain adaptive relational reasoning for 3D multi-organ segmentation. In: Martel, A.L. (ed.) MICCAI 2020. LNCS, vol. 12261, pp. 656–666. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59710-8 64 9. Gibson, E., et al.: Multi-organ abdominal CT reference standard segmentations. This data set was developed as part of independent research supported by Cancer Research UK (Multidisciplinary C28070/A19985) and the National Institute for Health Research UCL/UCL Hospitals Biomedical Research Centre (2018) 10. Han, K., et al.: A survey on visual transformer (2020) 11. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018) 12. Huang, G., Liu, Z., Laurens, V., Weinberger, K.Q.: Densely connected convolutional networks. IEEE Computer Society (2016) 13. Huang, L., Yuan, Y., Guo, J., Zhang, C., Chen, X., Wang, J.: Interlaced sparse self-attention for semantic segmentation (2019) 14. Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., Fu, B.: Shuffle transformer: rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650 (2021) 15. Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: CCNET: criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 603–612 (2019) 16. Jha, D., Riegler, M.A., Johansen, D., Halvorsen, P., Johansen, H.D.: DoubleU-Net: a deep convolutional neural network for medical image segmentation. In: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), pp. 558–564. IEEE (2020) 17. Kaul, C., Manandhar, S., Pears, N.: FocusNet: an attention-based fully convolutional network for medical image segmentation. In: 2019 IEEE 16th International Symposium on Biomedical Imaging, ISBI 2019, pp. 455–458. IEEE (2019) 18. Lei, T., Wang, R., Wan, Y., Du, X., Meng, H., Nandi, A.: Medical image segmentation using deep learning: a survey (2020) 19. Li, D., et al.: Involution: inverting the inherence of convolution for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12321–12330 (2021) 20. Li, J., Yan, Y., Liao, S., Yang, X., Shao, L.: Local-to-global self-attention in vision transformers. arXiv preprint arXiv:2107.04735 (2021) 21. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows (2021) 22. Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 4th International Conference on 3D Vision (3DV), pp. 565–571. IEEE (2016) 23. Oktay, O., et al.: Attention U-Net: learning where to look for the pancreas (2018) 24. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4 28 25. Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16519–16529 (2021)
LiteTrans: Reconstruct Transformer with Convolution
313
26. Taghanaki, S.A., Abhishek, K., Cohen, J.P., Cohen-Adad, J., Hamarneh, G.: Deep semantic segmentation of natural and medical images: a review. Artif. Intell. Rev. 54(1), 137–178 (2021) 27. Vaswani, A., et al.: Attention is all you need. arXiv (2017) 28. Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., Shlens, J.: Scaling local self-attention for parameter efficient visual backbones. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12894–12904 (2021) 29. Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.-C.: Axial-DeepLab: stand-alone axial-attention for panoptic segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 108–126. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8 7 30. Wang, W., Xie, E., Li, X., Fan, D.P., Shao, L.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions (2021) 31. Wang, Z., Zou, N., Shen, D., Ji, S.: Non-local U-Nets for biomedical image segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 6315–6322 (2020) 32. Wu, H., et al.: CvT: introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808 (2021) 33. Wu, Z., Liu, Z., Lin, J., Lin, Y., Han, S.: Lite transformer with long-short range attention. In: International Conference on Learning Representations (2019) 34. Xu, W., Xu, Y., Chang, T., Tu, Z.: Co-scale conv-attentional image transformers (2021) 35. Yu, Q., Xia, Y., Bai, Y., Lu, Y., Yuille, A., Shen, W.: Glance-and-gaze vision transformer. arXiv preprint arXiv:2106.02277 (2021) 36. Zhang, Y., Liu, H., Hu, Q.: Transfuse: fusing transformers and CNNs for medical image segmentation. arXiv preprint arXiv:2102.08005 (2021) 37. Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021) 38. Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J.: Unet++: redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imaging 39(6), 1856–1867 (2019)
CFCN: A Multi-scale Fully Convolutional Network with Dilated Convolution for Nuclei Classification and Localization Bin Xin1 , Yaning Yang1 , Dongqing Wei2 , and Shaoliang Peng1(B) 1
College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China {xinbin,yangyn,slpeng}@hnu.edu.cn 2 State Key Laboratory of Microbial Metabolism and College of Life Science and Biotechnology, Shanghai Jiaotong University, Shanghai 200240, China [email protected]
Abstract. Nuclei classification in histology images is a fundamental task in histopathological analysis. However, automated nuclei classification methods usually face problems such as unbalanced samples and significant cell morphology variances, which hinders the training of models. Moreover, many existing methods only classify individual cell patches, which are small pieces of images including a single cell. When the classification results need to be located at the corresponding position of images, the accuracy will decline rapidly, resulting in difficulties for subsequent recognition. In this paper, we propose a novel multi-scale fully convolution network, named CFCN, with dilated convolution for fine-grained nuclei classification and localization in histology images. Our network consists of encoding and decoding part. The encoding part takes cross stage partial designed network as backbone for feature extraction, and we apply cascade dilated convolution module to enlarge the receptive field. The decoding part contains transposed convolution upsampling layers, and path aggregation network is applied to fuse multi-scale feature maps. The experimental results in a typical histology image dataset show that our proposed network outperforms the other state-of-the-art nuclei classification models, and the F1 score reaches 0.750. Source code is available at https://github.com/BYSora/CFCN.
Keywords: Nuclei classification convolution · Histopathology
1
· Fully convolution network · Dilated
Introduction
Classification of nuclei is an essential basis in histopathology image analysis. The classification results provide important guidance for subsequent pathological diagnosis [2]. Compared with the detection of nuclei, fine-grained classification provides pathologists with more abundant pathological information [6]. However, c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 314–323, 2021. https://doi.org/10.1007/978-3-030-91415-8_27
A Multi-scale Fully Convolutional Network for Nuclei Classification
315
Fig. 1. Example of histology images and four types of nuclei. A: epithelial nuclei, B: fibroblast nuclei, C: inflammatory nuclei, D: miscellaneous nuclei.
expert pathologists require significant time to classify nuclei in whole slide images (WSI), which is very inefficient in most cases [10]. Therefore, several automated classification methods have been proposed to alleviate this issue [3,18]. But many methods for nuclei classification are difficult to accurately detect and classify all of them [12,16], because cell morphology in histology images is often complex and variable, and even cells of same class are likely to be quite different in shape. As shown in Fig. 1, many nuclei are very small, have an area of only a few pixels, and are also quite different in shape. In addition, the stacks of multiple cells and different focal distances in imaging can also create difficulties in nuclei classification. Early automated classification methods used manual features to detect and classify nuclei. Al-Kofahi et al. [1] proposed a Laplacian of Gaussian filter to detect nuclei with spatial constrain, Sharma et al. [13] used AdaBoost classifier with intensity and texture features to preform nuclei classification and segmentation. These methods focused on manual features and easily miss many irregularly shaped cells, causing low accuracy. With the development of deep learning, many Convolutional Neural Network (CNN) models are able to further improve nuclei classification accuracy [4,5]. But most of them only accept image patches as inputs, which include a single cell, then use a sliding window strategy to perform detection and classification of nuclei on images and convert the classification results to corresponding locations. Sirinukunwattana et al. [14] proposed a CNN model SC-CNN with spatial constrain for nuclei detection and use another CNN to further classify the nuclei. These methods do not preformed well in original images, because the contents in sliding windows often do not contain the complete cell morphology, which makes it difficult for models to predict. Wang et al. [17] proposed a classification reinforcement detection network for signet ring cell detection, and applied it in whole slide images. Zhou et al. [20] proposed a sibling Fully Convolution
316
B. Xin et al.
Network (FCN) for nuclei detection and classification, which is able to output a score map of the same size as the image to indicate the class and location of the nuclei. However, it tends to detect excessive nuclei and is difficult to localize them, which means that the model classifies many background pixels into nuclei. In this paper, we propose a novel fully convolution network, named CFCN, with dilated convolution for fine-grained classification and localization of nuclei in histology images, which is trained in an end-to-end manner, without extra processing in images. It inputs raw histology images and outputs score maps of the same size as the images to indicate nuclei location. In order to classify and locate the nuclei accurately from a complex background, we use Path Aggregation Network (PAN) to fuse the high-level semantic features and the low-level spatial features of network. Cascade dilated convolution is used not only to enlarge the receptive field of network, but also to obtain multi-scale features. In addition, We utilize a pretrained nuclei detection model to eliminate the noise caused by background pixels. Our main contributions are summarized as follows: – We propose a novel fully convolutional network for fine-grained nuclei classification and localization, which is trained in an end-to-end manner and avoid preprocessing of images. – Our model uses cascade dilated convolution and path aggregation network to extract multi-scale features and fuse the low-level and high-level features, thus balancing the accuracy of classification and localization. – The effectiveness of our model is validated on a typical nuclei classification dataset, and the experimental results show that it outperforms other stateof-the-art methods.
2 2.1
Methods CFCN Architecture
Our network is roughly divided into encoding and decoding part. The encoding part takes cross stage partial designed backbone network for efficient feature extraction, and then cascade dilated convolution is utilized to enlarge receptive field and extract multi-scale features. In the decoding part, we apply transposed convolution upsampling layers to restore feature maps to the original size, and outputs score maps to indicate nuclei location. In order to accurately locate the nuclei, we utilize path aggregation network to combine multi-scale spatial and semantic features from the encoding part. Our CFCN architecture is shown in Fig. 2. Cross Stage Partial Backbone Network. In the encoding part, backbone network must effectively extract nuclei features for classification and localization. We apply stacked ResNet [7] layers as our feature extraction network, each layer consists of nine residual blocks, which are basic unit of ResNet. The residual block has two 3 × 3 convolutional layers followed by batch normalization (BN)
A Multi-scale Fully Convolutional Network for Nuclei Classification
317
1
1
...
C
CSPResNet layer
C
n
C C
CDC
Cascade Dilated Convolution Encoding part
C
Conv 3x3 (dilation = n) CSPResNet layer Downsampling Upsampling
Element-wise summation Concatenation Skip connection
Decoding part
Fig. 2. CFCN for nuclei classification and localization.
and ReLU activation function, and these convolutional layers are connected by shortcut path, this structure can avoid the problem of gradient vanish during training phase. The downsampling layer is utilized to reduce dimension of feature maps to obtain abstract semantic information, it consists of 3 × 3 convolutional layers with a stride of 2. In order to prevent the loss of detail information caused by downsampling process, we only add three downsampling layers in backbone network. In addition, we apply Cross Stage Partial (CSP) [15] design to our backbone network for efficient feature extraction. CSP is a novel network design concept, it further increases the connections of features among network layers to improve the propagation of gradient. The original implementation in [15] uses half of the features through convolutional layer, and then concatenate these features with other half. But in our network, we only duplicate these features, then fuse trained and untrained features with a 1 × 1 convolutional layer, because it will make our network get full information. Cascade Dilated Convolution Module. Since the pixels belong to a single cell account for only a small part of the images, we use three downsampling layers to avoid the loss of detail information caused by the downsampling process. However, it will result in smaller receptive fields in network and will not be able to effectively identify cells from the background. Therefore, after backbone network, we add cascade dilated convolution module [19] to enlarge the entire receptive field, and the cascade structure could make our network extract nuclei features at multi-scale level, thereby increasing the robustness of network. We stack dilated convolution in cascade mode, as shown in Fig. 2. The top stack dilated convolution with dilation rate of 1, 2, 4 has maximum receptive field, and it decline gradually by removing the max dilation rate convolution. Multi-scale Feature Fusion for Accurate Localization. In the decoding part, we apply path aggregation network (PAN) [9] to fuse multi-scale features, so
318
B. Xin et al.
that we can combine high-level semantic information with low-level spatial information for accurate localization. As shown in Fig. 2, the two leftmost paths in the decoding part formed PAN. The bottom-up path (left) can combine semantic information with spatial information, and the up-bottom path (right) can make spatial information spread at multi-scale levels. Then we use transpose convolution to progressively upsample the output features of encoding part to original size. The final output includes k+1 channels, representing the background and k nuclei classes respectively. Then it passes through a softmax layer, and the non-maximum suppression [11] is applied to get score maps as classification result. For each score map, the value of pixels represents the probability that the corresponding location of the image belongs to nuclei. 2.2
Focal Loss for Multi-class Nuclei Classification
We choose Focal Loss [8] as the loss function of CFCN, which not only balances samples from the nucleus and background regions, but also focuses more on hardly trained samples, which means that our network could learn more information by these samples, thus many irregular cells can be classified as well. The loss of kth channel F L(Ik ) is defined in (1). F L(Ik ) =
N 1 Pi N i=1
−αk (1 − pi )γ log(pi ), Pi = −pγi log(1 − pi ),
y=1 y=0
(1)
Where N is the total number of pixels in single channel, Pi is the loss in pixel i, γ controls the penalty for difficult samples, pi is the output of network on pixel i after softmax layer, represents the probability of this pixel belongs to the class k nuclei, y is the corresponding label, y = 1 represents that the pixel belongs to class k nuclei, and y = 0 means that the pixel belongs to background or other nuclei. Unlike the original definition in [8], we modified α to accommodate the proportion of nuclei in different classes. Our final loss is the summation of loss from all classes excluded the loss from background channel.
3 3.1
Experiments Dataset
We evaluate the performance of our network on a typical nuclei detection and classification dataset [14], which contains 100 hematoxylin and eosin (H&E) stained histopathology images of colorectal adenocarcinoma. The resolution of all images are 500 × 500, with a total of 29756 annotations for detection task. In addition, 22444 annotations were further classified into four classes: epithelial nuclei, inflammatory nuclei, fibroblast nuclei and miscellaneous nuclei. Since the form of annotations is coordinates of cell centroids, we take an area with a radius of 3 pixels centered on coordinates as label for training.
A Multi-scale Fully Convolutional Network for Nuclei Classification
3.2
319
Implementation Details
Data Augmentation. We preformed data augmentation during training phase to prevent overfitting. Specifically, we cropped original images into 125 × 125 sub-images and removed the images which have less annotations compared with detection task. The radio of training, testing and validation sets is same as [14]. During training phase, we applied flipping, transposing, rotating, RGB shifting and gauss noise to perform data augmentation. We chose Adam as our optimizer. The learning rate was set 3 × 10−4 , with a batch size of 16. It took about 150 epochs for our network to converge. Pretrained Model. It is worth mentioning that most models use the results of nuclei detection to optimize the classification results. Similarly, in this paper, we use a pretrained detection model to eliminate the background noise from classification results. Specifically, we regard the output of pretrained model as a binary classification result of the nuclei and background region, and the value of pixel i represents the probability Pi (nuclei) that the pixel belongs to nuclei. The output of classification model represents the conditional probability Pi (k|nuclei) that the pixel i belongs to category k when it is detected as nuclei. Therefore, the probability that pixel i is classified as category k in the output of classification model is: Pi (k) = Pi (nuclei)Pi (k|nuclei)
(2)
Evaluation Metrics. We use weighted F1 score for evaluation and comparison. Same as [20], we treat classification results located in a mask area with a radius of 6 pixels centered on ground truth coordinate as true positive (TP), those results outside the mask area are treated as false positive (FP), and mask areas do not contain any classification results are false negative (FN). The weights of each class are the proportion of related samples in the dataset. 3.3
Results
Comparison with Other Methods. We compared our network with other state-of-the-art methods. Experimental results are shown in Table 1. The SRCNN [18] and SC-CNN [14] are patch-based models, they use a sliding window strategy to detect and further classify nuclei in histology images, which means that the nuclei are detected and then classified if they are determined to be nuclei. SCFN-OPI [20] is FCN-based model, it adds sibling branches for detection and classification of nuclei in the mean time, these branches share same layers of FCN, and use the information of detection data to guide classification. The experimental results show that FCN- based model is better than patch-based model, because the former can extract and locate the features of entire image. And our model not only has a higher F1 score than SFCN-OPI, but also has a better precision. This means that our model can accurately locate nuclei and ignore more background noise, which could reduce the misleading to expert pathologists in practical applications.
320
B. Xin et al.
Table 1. Experimental results of our network and comparison with other methods in nuclei classification dataset. Methods
Precision Recall F1
SR-CNN – – SC-CNN SFCN-OPI 0.718
– – 0.774
0.683 0.692 0.742
Ours 0.746 0.754 0.750 The - means results were not reported by that method.
Ablation Study. We used several methods for nuclei classification and localization, they are cross stage partial network (CSP), cascade dilated convolution (CDC) and path aggregation network (PAN), and we have done several experiments to find the contribution of each method, note all methods are combined with pretrained model. As shown in Table 2, CSP and CDC slightly improve the overall accuracy, which indicates that feature reuse and multi-scale structure can improve the learning ability of network. PAN can combine semantic information and spatial information for accurate location, so the precision of FCN has been improved. Table 2. Experimental results of ablation study. Methods
Precision Recall F1
FCN FCN+CSP FCN+PAN FCN+CSP+CDC
0.683 0.691 0.717 0.714
0.721 0.745 0.728 0.756
0.703 0.717 0.722 0.734
Ours
0.746
0.754
0.750
Qualitative Results. Figure 3 shows the CFCN prediction results of nuclei classification in histology images. The predicted nuclei are marked by dots, while the ground truth is marked by circles, and different colors indicate the classification results. It can be seen that nuclei of the same class are often clustered in uniform areas, and our classification results clearly show this. Although some nuclei have not been detected or classified incorrectly, this is due to the complexity of dataset and irregular cell morphology. In general, our result could providing important clues for the further diagnosis of expert pathologists.
A Multi-scale Fully Convolutional Network for Nuclei Classification
321
Fig. 3. Classification and localization results in histology images. The dots are network output of nuclei location, while circles represent corresponding ground truth. Different colors represent different classes (red: epithelial nuclei, green: fibroblast nuclei, blue: inflammatory nuclei, black: miscellaneous nuclei). (Color figure online)
4
Conclusions
In this paper, we propose a novel fully convolutional network, named CFCN, to classify and locate nuclei in histology images. By enlarging the receptive field and combining the multi-scale information while maintaining the detailed information at the same time, CFCN can effectively extract the cell features and locate them accurately. In addition, by using focal loss, the model can not only solve the problem of sample imbalance, but also pay more attention to the samples that are difficult to classify. Experimental results in a typical nuclei classification dataset show our proposed model outperforms other state-of-the-art methods. Further researches of our network can be extendable to other histopathology image analysis problems. Acknowledgments. This work was supported by NSFC Grants 61772543, U19A2067; Science Foundation for Distinguished Young Scholars of Hunan Province (2020JJ2009); National Key R&D Program of China 2017YFB0202602, 2018YFC0910405, 2017YFC1311003, 2016YFC1302500; Science Foundation of Changsha kq2004010; JZ20195242029, JH20199142034, Z202069420652; The Funds of Peng Cheng Lab, State Key Laboratory of Chemo/Biosensing and Chemometrics; the Fundamental Research Funds for the Central Universities, and Guangdong Provincial Department of Science and Technology under grant No. 2016B090918122.
References 1. Al-Kofahi, Y., Lassoued, W., Lee, W., Roysam, B.: Improved automatic detection and segmentation of cell nuclei in histopathology images. IEEE Trans. Biomed. Eng. 57(4), 841–852 (2009) 2. Basavanhally, A., et al.: Multi-field-of-view strategy for image-based outcome prediction of multi-parametric estrogen receptor-positive breast cancer histopathology: Comparison to oncotype dx. J. Pathol. Inf. 2, S1 (2011) 3. Chen, H., Dou, Q., Wang, X., Qin, J., Heng, P.A.: Mitosis detection in breast cancer histology images via deep cascaded networks. In: 30th AAAI Conference on Artificial Intelligence (2016)
322
B. Xin et al.
4. Cire¸san, D.C., Giusti, A., Gambardella, L.M., Schmidhuber, J.: Mitosis detection in breast cancer histology images with deep neural networks. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 411– 418. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40763-5 51 5. Cruz-Roa, A., et al.: Automatic detection of invasive ductal carcinoma in whole slide images with convolutional neural networks. In: Medical Imaging 2014: Digital Pathology, vol. 9041, p. 904103. International Society for Optics and Photonics (2014) 6. Fleming, M., Ravula, S., Tatishchev, S.F., Wang, H.L.: Colorectal carcinoma: pathologic aspects. J. Gastrointest. Oncol. 3(3), 153 (2012) 7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 8. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ ar, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017) 9. Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759–8768 (2018) 10. van Muijen, G.N., et al.: Cell type heterogeneity of cytokeratin expression in complex epithelia and carcinomas as demonstrated by monoclonal antibodies specific for cytokeratins nos. 4 and 13. Exp. Cell Res. 162(1), 97–113 (1986) 11. Neubeck, A., Van Gool, L.: Efficient non-maximum suppression. In: 18th International Conference on Pattern Recognition, ICPR 2006, vol. 3, pp. 850–855. IEEE (2006) 12. Park, C., Huang, J.Z., Ji, J.X., Ding, Y.: Segmentation, inference and classification of partially overlapping nanoparticles. IEEE Trans. Pattern Anal. Mach. Intell. 35(3), 1–1 (2012) 13. Sharma, H., et al.: A multi-resolution approach for combining visual information using nuclei segmentation and classification in histopathological images. In: VISAPP, vol. 3, pp. 37–46 (2015) 14. Sirinukunwattana, K., Raza, S.E.A., Tsang, Y.W., Snead, D.R., Cree, I.A., Rajpoot, N.M.: Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images. IEEE Trans. Med. Imaging 35(5), 1196–1206 (2016) 15. Wang, C.Y., Liao, H.Y.M., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.: CSPNet: a new backbone that can enhance learning capability of CNN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 390–391 (2020) 16. Wang, M., Zhou, X., Li, F., Huckins, J., King, R.W., Wong, S.T.: Novel cell segmentation and online SVM for cell cycle phase identification in automated microscopy. Bioinformatics 24(1), 94–101 (2008) 17. Wang, S., Jia, C., Chen, Z., Gao, X.: Signet ring cell detection with classification reinforcement detection network. In: Cai, Z., Mandoiu, I., Narasimhan, G., Skums, P., Guo, X. (eds.) ISBRA 2020. LNCS, vol. 12304, pp. 13–25. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57821-3 2 18. Xie, Y., Xing, F., Kong, X., Su, H., Yang, L.: Beyond classification: structured regression for robust cell detection using convolutional neural network. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 358–365. Springer, Cham (2015). https://doi.org/10.1007/978-3-31924574-4 43
A Multi-scale Fully Convolutional Network for Nuclei Classification
323
19. Zhou, L., Zhang, C., Wu, M.: D-LinkNet: LinkNet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 182–186 (2018) 20. Zhou, Y., Dou, Q., Chen, H., Qin, J., Heng, P.A.: SFCN-OPI: detection and finegrained classification of nuclei using sibling FCN with objectness prior interaction. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Image to Image Transfer Makes Malpositioned Teeth Orderly Sanbi Luo1,2(B) 1
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China [email protected] 2 School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Abstract. Deep learning’s continuous development has spawned many applications and neural networks that tackle challenging tasks in different fields. The fusion of these applications and networks could solve more complex problems in the real world. Taking image generation as an example, combining a pure image generator by CNN and a single text processor by RNN can form a more complex and capable fusion network for text to image generation. Generally, a basic successful deep learning application requires at least three elements: a suitable application scenario, an appropriate dataset, and a proper model. Under this principle, in this paper, we introduce a new task for virtual orthodontics, a new image-to-image transfer task from malpositioned-teeth-image to neat-teeth-image. We call it orthodontics transfer. To make up for the lack of datasets, we constructed a paired dataset about orthodontics before and after surgery from real medical cases. Simultaneously, a new orthodontics transfer network with a teeth-code transfer as a bridge is proposed. The experimental results show that our proposed method is effective, which can realize the orthodontic effect on teeth photos by image-to-image transfer. Keywords: Text to image · Image to image networks · Deep learning · Orthodontics
1
· Generative adversarial
Introduction
Orthodontic treatment can improve one’s dental health and change one’s facial appearance for the better, and it can boost one’s self-esteem. Unfortunately, however, dental anxiety is widespread. It always makes people daunt to treat teeth and prevents them from becoming healthier and more beautiful. Suppose there is a method, as long as patients take photos of their malpositioned teeth in private, they can see the beautiful and neat teeth images they will get after orthodontic treatment in the future. Then, many people may be able to pluck up the courage to accept orthodontics therapy. With the development of deep learning and image processing technology, the dream may become a reality today. c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 324–335, 2021. https://doi.org/10.1007/978-3-030-91415-8_28
Image to Image Transfer Makes Malpositioned Teeth Orderly to Image
Edges to Photo
Before
325
After
Fig. 1. Different image-to-image transfer tasks. Top left is edges to photo by pix2pix [10], and top right is label map to image via pix2pixHD [29]. The bottom is the malpositioned-teeth-image to neat-teeth-image in this paper, i.e., the comparison effect of teeth before and after orthodontics.
For example, a well-designed image-to-image transfer network may potentially translate a malpositioned teeth image into a neat teeth image. Image-to-image translation is the task of taking images from one domain and transforming them to have the style (or characteristics) of images from another domain, e.g., gray-scale image to color image, edge-map to photograph. Years of researches in computer vision and machine learning have produced powerful image translation approaches [2,10,12,20,26,27,29]. Most recently, classic GANs [5] based methods, including pix2pix [10], pix2pixHD [29], and StarGAN [2], give promising results and beat traditional solutions. However, most existing image-to-image translation frameworks focus on style transfer, such as changing the given images’ color or texture. In this situation, the object locations and postures in source images and target images keep consistent. For example, in edges to photograph translation [10], the parts of the handbag have not changed in position and posture, and in the label map to image transfer [29], the furniture positions in the house also do not change. Unlike existing application scenarios, the number, posture and position between malpositioned teeth and neat teeth will change in orthodontic treatment, as shown in Fig. 1. Because change positions and postures for malpositioned teeth will correct them, and some molars may be removed to free up alveolar space. Many text-to-image synthesis research efforts refer to a computation method that transforms natural language descriptions into images with semantics similar to the descriptions. With deep learning development, especially the emergence of deep generative models, these models can use appropriately trained neural network models to generate realistic visual images [8,13,18,21–24,34,36–38], which inspires us to introduce text information into the image-to-image transfer for virtual orthodontic. At the same time, the image-to-text generation networks
326
S. Luo
represented by VQA [1] and image captioning [9] also inspired us to generate orderly teeth image codes through the messy teeth images and their teeth codes. In this paper, we introduce a new task for virtual orthodontics, a new imageto-image transfer task from malpositioned-teeth-image to neat-teeth-image. We call it orthodontics transfer. To make up for the lack of datasets, we constructed a paired dataset about orthodontics before and after surgery from real medical cases. Simultaneously, a new orthodontics transfer network (OrthodGAN) with a teeth-code transfer as a bridge is proposed. The experimental results show that, compared with baselines, the method we propose is the most effective and can better achieve the orthodontic transfer effect on teeth photos by image-to-image translation. The contributions of the paper can be summarized as follows: 1) a new application scenario: orthodontics transfer is proposed. 2) a new orthodontics transfer architecture (Orthod-GAN) from malpositionedteeth-image to neat-teeth-image is proposed. 3) a new paired image dataset named OrthoD about orthodontics before and after surgery from real medical cases is built.
2 2.1
Related Works Generative Adversarial Network
Generative Adversarial Network (GAN), consisting of a generator network and a discriminator network, is an unsupervised learning generative framework proposed by Goodfellow [5]. Recently, GAN and its variants have achieved impressive performance in image generation, such as image generation [11], image editing [25], video generation [28], texture synthesis [14] and feature learning [33]. Recent approaches employ the idea of GAN for conditional image generation, such as image super-resolution [12], image inpainting [3], image-to-image translation [10,29], text-to-image translation [8,13,18,21–24,34,36–38], as well as the applications of other domains like semantic segmentation [17], object detection [15,30], music generation [35], VQA [1], image captioning [9] and 3D vision [19,32]. 2.2
Image-to-Image Translation
The idea of image-to-image translation goes back to Hertzmann et al.’s Image Analogies [6], which employs a non-parametric texture model [4] from a single input-output training image pairs. In recent years, with the development of deep learning, researchers use datasets of input-output examples to learn a parametric translation function using CNNs e.g., [16]. More recently, with the rise of GAN, GAN-based methods have achieved outstanding performance on image-to-image translation task, such as Pix2pix [10] and pix2pixHD [29]. In pix2pix, conditional adversarial networks are used as a general-purpose solution to image-to-image translation problems. It needs paired data and it is effective at synthesizing
Image to Image Transfer Makes Malpositioned Teeth Orderly
327
photos from label maps, reconstructing objects from edge maps, and colorizing images from gray-scales. In pix2pixHD, a new method for synthesizing highresolution photo-realistic images from semantic label maps using conditional GAN is presented. It can generate 2048 × 1024 pixel resolution images with a novel adversarial loss. However, all of above GAN-based image-to-image translation frameworks are not involved in location translation. That means the object locations in original and target images are consistent with each other. In our work, location translation task in malpositioned-teeth to neat-teeth will be studied, where the object-positions have obvious variations. In this paper, a new end-to-end model Orthod-GAN is well designed to deal with image-to-image location translation in virtual orthodontics scenario. 2.3
Text to Image Synthesis
The text-to-image synthesis is a typical application of multimodal deep learning. Generally, the input is a simple text sentence, and the output is a vivid image. The text has two primary purposes. The first is that the text is used as a conditional input to change the image’s content and color in image manipulation tasks [17,22,24,30]. The second is the only source input to provide semantic information for image generation [3,8,11,13–15,18,21,23,25,28,33–38]. 2.4
Image to Text Generation
The most representative image-to-text generation task is image captioning [9], which usually uses natural language to express the semantics of the image content. VQA [1] is another task of text generation based on the content of the image. However, unlike image captioning, VQA requires an additional question input, asking questions about the image. Finally, the model needs to generate a suitable text answer based on the image and related questions.
3
Dataset
Today the bottleneck of deep learning is often caused by the limited availability and quality of training data. Different application scenarios often require different training datasets, and orthodontic transfer is no exception. We collected more than 1,000 pairs of real orthodontic medical data. Each pair of images includes an image of malpositioned teeth before orthodontics and an image of neat teeth after orthodontic treatment. These images are all real photos taken by medical staff following the standardized procedures of dental diagnosis and treatment. They are also a necessary part of the orthodontic treatment procedures. Some paired images of the new constructed dataset OrthoD are shown in Fig. 2. There are more than 32 different teeth numbering systems. Two are commonly in use today. The one is the Universal Numbering System that has been
328
S. Luo
Fig. 2. A snapshot of our constructed orthodontic dataset OrthoD, in which the odd and even columns show the teeth before and after orthodontics.
0 0 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 19 20 21 22 23 24 25 26 27 28 29 30 31 0
0 0 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 19 20 21 22 23 24 25 26 27 28 29 30 00
Fig. 3. Teeth are numbering according to the Universal Numbering System. “0” means the tooth is missing at this position.
adopted by the American Dental Association and is in use by most general dentists today. The other is the two-digit FDI World Dental Federation notation that is also internationally widely used. We choose the Universal Numbering System as a reference to build the tooth numbering system for our OrthoD dataset in this paper, as shown in Fig. 3. We think the Universal Numbering System is more convenient for non-professionals (ordinary people rather than dentists). In OrthoD, the train set contains more than 1,000 pairs of malpositionedteeth-images and corresponding neat-teeth-images. In addition, the test set includes dozens of pairs of images. All the data, including tooth code documents, is available at the link https://drive.google.com/drive/folders/ 1qLTuALoF6nC6mCQXvUbZXbyg2LB7Jkoi?usp=sharing.
Image to Image Transfer Makes Malpositioned Teeth Orderly
4
329
Our Method: Orthod-GAN
Our goal is to achieve orthodontics transfer from malpositioned teeth to neat teeth. Given a four-tuple (Im , X, Y, In ), Im ∈ R3×H×W denotes a malpositioned teeth image, X ∈ (x1 , ..., xT ) represents the teeth-code of Im , Y ∈ (y1 , ..., yT ) denotes the teeth-code of In , and In ∈ R3×H×W denotes the neat teeth image of Im . (Input) Malpositioned Teeth (Im)
Encoder(Im)
Encoder
Decoder
ResnetBlocks
Malpositioned Teeth Code (X)
… …
0 0 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 19 20 21 22 23 24 25 26 27 28 29 0 0 0
Discriminator Target Neat Teeth
Neat Teeth Code (Y)
Encoder(X)
… … …
0 0 3 4 5 6 7 8 9 10 11 12 13 0 0 0 0 0 19 20 21 22 23 24 25 26 27 28 29 0 0 0
… … …
: Downsampling conv
Real/Fake
Encoder(Y)
Decoder(Im, X)
: Upsampling conv
(Output) Generated Neat Teeth (In)
: Embedding
: Linear
: Attn_Linear
: GRU
: ResnetBlock
Fig. 4. Framework of the proposed Orthod-GAN. Orthod-GAN completes the orthodontic transfer through three steps: the first step is to generate malpositionedteeth-code from malpositioned-teeth-image; the second step is to generate neat-toothcode by combining the malpositioned-teeth-code and the extracted malpositionedteeth-image features; the third step is to generate neat-teeth-image by combining the neat-teeth-code and the extracted malpositioned-teeth-image features.
As shown in Fig. 4, the proposed Orthod-GAN has three steps to generate neat teeth from malpositioned teeth: malpositioned-teeth-image to malpositioned-teeth-code GIm toX , malpositioned-teeth-code to neat-teeth-code GXtoY , and neat-teeth-code to neat-teeth-image GY toIn . Specifically, In = GY toIn (GXtoY (GIm toX (Im ), Im ), Im ). 4.1
(1)
Malpositioned-Teeth-Image to Malpositioned-Teeth-Code
Essentially, generating malpositioned-teeth-code from malpositioned-teethimage is the task of image to text, such as image caption [9], and its processing flow is as follows: RN Ninitial−input = CN N (Im ), (2) xt = RN N, (We Xt−1 ), t ∈ 1...N ,
(3)
where CNN processes Im as the initial input of RNN, Xt denotes each tooth code for malpositioned-teeth, and We denotes word embedding. Then the loss function can be formulated as: L(I, X) = −
N i=1
logxi (Xi−1 ),
(4)
330
S. Luo
and in this paper, N equals 32, representing 32 teeth of an adult. The loss is minimized with regard to all the parameters of the image embedder CNN, the code generator RNN, and word embeddings We . Special Note: in the experiment part, we provide the malpositioned-teethcodes for all the malpositioned-teeth-images in the training set and test set to focus more on the generation of neat-teeth-codes and the synthesis of the final neat-teeth-images. 4.2
Malpositioned-Teeth-Code to Neat-Teeth-Code
The step of malpositioned-teeth-code to neat-teeth-code is the critical generative component in Orthod-GAN, which implements malpositioned-teeth-image to neat-teeth-image on teeth codes from GIm toX (Im ) to GXtoY (GIm toX (Im )). Malpositioned-teeth-code to neat-teeth-code is essentially a translation problem, translating malpositioned-teeth-code into neat-teeth-code. However, it is not enough to generate the neat-teeth-code only from the malpositioned-teethcode in orthodontics transfer. It ignores the role of the malpositioned-teethimage. Therefore, we add the feature of the malpositioned-teeth-image. As a result, the pure translation work becomes a kind of work similar to VQA [1], where V stands for malpositioned-teeth-image (Im ), Q is the malpositionedteeth-code (X), and A is the neat-teeth-code (Y ) need to generate. Essentially, this is a problem of predicting a neat-teeth-code sequence (Y ) based on the model (GXtoY ) by inputting the malpositioned-teeth-code (X) and the malpositionedteeth-image (Im ): Y = arg max P(Y |X, Encoder(Im ); θ), and Y ∈ A,
(5)
where Encoder(Im ) denotes the image features extracted from Im , θ represents the parameters of the model (GXtoY ), and A is the set of all possible neat-teethcodes. 4.3
Neat-Teeth-Code to Neat-Teeth-Image
In Orthod-GAN, orthodontic translation from malpositioned-teeth-image to neat-teeth-image is implemented on teeth-codes by the GXtoY . Then the image synthesis generator GY toIn is built to translate the generated neat-teeth-codes to neat-teeth-images. This task belongs to the task of text to image. Formally, with the combination of GIm toX , GXtoY , and GY toIn , Orthod-GAN can generate realistic looking neat-teeth-images In from malpositioned-teethimages Im . However, a potentially crucial risk is that the objects like teeth in In may differ from the objects in Im and eventually causes the transfer to fail. In order to reduce this risk, we leverage a feature extractor to maximize the usage of the input image Im , for Im is the only real image input for Orthod-GAN. GY toIn in Fig. 4 has two inputs: a malpositioned-teeth-image Im and the corresponding generated neat-teeth-code Y . They are concatenated to synthesize the neat-teeth-image In . The object function of GY toIn is described as:
Image to Image Transfer Makes Malpositioned Teeth Orderly
L(GY toIn ) = EI∼pI ,Y ∼pdata [log(1 − DY toIn (GY toIn (Im , Y )))],
331
(6)
Target
Orthod-GAN pix2pixHD
pix2pix
Input
where DY toIn () denotes the corresponding discriminator of GY toIn ().
Fig. 5. Some generated images on our constructed orthodontic dataset by two representative baselines pix2pix, pix2pixHD, and our framework Orthod-GAN. Row 1 contains source malpositioned-teeth-images. Results in row 2 by pix2pix have generally failed in orthodontics transfer for some teeth in images are blurry. Results in row 3 by pix2pixHD are generally achieved, but some objects are blurred in detail. Finally, results in row 4 by our method Orthod-GAN achieved image-to-image orthodontic transfer effect in malpositioned-teeth to neat-teeth most successfully.
5
Experiments
In this section, we compare the performances of our method and some other baselines from both qualitative and quantitative perspectives. And we further analyze the advantages and limitations of our Orthod-GAN. 5.1
Baselines
Orthodontic transfer in malpositioned-teeth-image to neat-teeth-image is a new application scenario. As far as we know, there is no existing method declared to involve. So we have to choose the most typical and representative image-to-image style transfer models as baselines, such as pix2pix [10] and pix2pixHD [29]. pix2pix is a conditional adversarial network as a general-purpose solution to image-to-image style translation problems. pix2pixHD is a method for synthesizing high-resolution photo-realistic images from semantic label maps using conditional generative adversarial networks.
332
5.2
S. Luo
Comparison with Baselines
Qualitative Evaluation. As illustrated in Fig. 5, we show the qualitative comparison results with baselines. The first to eighth columns represent experimental results on different cases of the test set. The input malpositioned-teeth-images are listed in the first row, while the second to fourth rows represent different results produced by baselines and our Orthod-GAN. The fifth row shows neatteeth-images that after the real orthodontic treatment. We can observe from the second row in Fig. 5 that the pix2pix makes orthodontic transfer fail overall. Some teeth in generated images are blurry. Some two adjacent teeth stick together. In the third row, orthodontic transfer by pix2pixHD seems to have been achieved generally. But in detail, objects include the background, are blurred. In addition, the number, position, and shape of the generated teeth look similar and do not reflect the correlation with the original malpositioned-tooth-images well. As shown in the fourth row of Fig. 5, our method Orthod-GAN could achieve the image-to-image orthodontic transfer in malpositioned-teeth-image to neatteeth-image most successfully, and compared with baselines, generated teeth are more neat and clear. In addition, the background is more natural, especially the color, texture, and other details are more consistent with the original background of the corresponding neat-teeth-image. Quantitative Evaluation. We use FID [7] and SSIM [31] metrics to quantitatively evaluate the quality of the images generated by baselines and our model. FID (Fr´echet Inception Distance) measures the distance between the distribution of real and generated images in terms of features extracted by a pretrained network. Lower FID is better, corresponding to more similar real and generated samples measured by the distance between their activation distributions. SSIM (structural similarity index measure) is a method for measuring the similarity between two images, calculated on various windows of an image and based on three comparison measurements of luminance, contrast, and structure, and higher SSIM is better. Table 1. Benchmarking results of different methods on our constructed orthodontic dataset, w.r.t. two metrics including the FID Score and SSIM Score. The bold entries indicated the best performance in orthodontics transfer. Models Pix2pix Pix2pixHD Our Orthod-GAN
FID score SSIM score 156.60 118.41 104.70
0.4542 0.4779 0.4682
Table 1 shows FID and SSIM score between real images and generated images to measure the quality of generated images. The FID scores in the table also
Image to Image Transfer Makes Malpositioned Teeth Orderly
333
demonstrate that Orthod-GAN is superior to other baselines in orthodontic transfer from malpositioned-teeth-image to neat-teeth-image. However, because orthodontic treatment always takes years, malpositioned-teeth-images and neatteeth-images’ backgrounds, which account for a large proportion, are hard to have a high degree of similarity, so SSIM scores are difficult to evaluate the quality of orthodontic transfers. Input
Generated Image
Target
Fig. 6. Failure cases and Limitations.
Failure Cases and Limitations. Figure 6 shows some failed cases that represent the limitations of our Orthod-GAN in image-to-image orthodontics transfer. The first column represents the input malpositioned-tooth-images to be transferred, the second column represents the neat-tooth-images generated by our method Orthod-GAN, and the third column represents the target neat-toothimages after real orthodontics. Observing the results in the second column: firstly, the canine teeth framed by the red box in the first row and second column of the figure are not completely corrected to the proper position, and we consider that such a transfer fails, at least not wholly successful; secondly, the parts framed by the black box in images mean that the molar parts of these generated neat-teeth-images are blurred, compared with the neat-teeth-images in the third column of real orthodontics, the divisions between adjacent teeth are not clear, and we consider these parts of the transfer are fail too. Finally, compared with the third column of images of real neat-teeth after real orthodontics, the sharpness of the images generated by our method does not reach the target image. Although the primary goal of this paper is to focus on making malpositioned teeth neat, improving the resolution of the generated images should also be the goal of this task, which is also the current limitation of our Orthod-GAN.
6
Conclusion
In this paper, we propose a new deep learning application scenario: orthodontics transfer and a new image-to-image transfer network from malpositioned-teeth-
334
S. Luo
image to neat-teeth-image with the teeth-code transfer as a bridge (OrthodGAN). To make up for the lack of datasets, we constructed a paired dataset about orthodontics before and after surgery from real medical cases. Typical imageto-image transfer models are employed as baselines. Experimental results show that our proposed method is more effective and can better achieve orthodontic transfer effect on teeth photos through image-to-image translation. However, Orthod-GAN still has problems such as failure of orthodontics transfer and unclear tooth edges in some cases. These are the limitations of Orthod-GAN and are also the direction that future work needs to break through.
References 1. Antol, S., et al.: VQA: visual question answering, pp. 2425–2433 (2015) 2. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: unified generative adversarial networks for multi-domain image-to-image translation, pp. 8789–8797 (2018) 3. Dolhansky, B., Ferrer, C.C.: Eye in-painting with exemplar generative adversarial networks, pp. 7902–7911 (2018) 4. Efros, A.A., Leung, T.K.: Texture synthesis by non-parametric sampling, vol. 2, pp. 1033–1038 (1999) 5. Goodfellow, I., et al.: Generative adversarial nets. Adv. Neural Inf. Process. Syst. 27 (2014) 6. Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies, pp. 327–340 (2001) 7. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Adv. Neural Inf. Process. Syst. 30 (2017) 8. Hong, S., Yang, D., Choi, J., Lee, H.: Inferring semantic layout for hierarchical text-to-image synthesis, pp. 7986–7994 (2018) 9. Hossain, M.Z., Sohel, F., Shiratuddin, M.F., Laga, H.: A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. (CsUR) 51(6), 1–36 (2019) 10. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks, pp. 1125–1134 (2017) 11. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017) 12. Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network, pp. 4681–4690 (2017) 13. Li, B., Qi, X., Lukasiewicz, T., Torr, P.H.: Controllable text-to-image generation. arXiv preprint arXiv:1909.07083 (2019) 14. Li, C., Wand, M.: Precomputed real-time texture synthesis with Markovian generative adversarial networks, pp. 702–716 (2016) 15. Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., Yan, S.: Perceptual generative adversarial networks for small object detection, pp. 1222–1230 (2017) 16. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation, pp. 3431–3440 (2015) 17. Luc, P., Couprie, C., Chintala, S., Verbeek, J.: Semantic segmentation using adversarial networks. arXiv preprint arXiv:1611.08408 (2016)
Image to Image Transfer Makes Malpositioned Teeth Orderly
335
18. Mansimov, E., Parisotto, E., Ba, J.L., Salakhutdinov, R.: Generating images from captions with attention. arXiv preprint arXiv:1511.02793 (2015) 19. Park, E., Yang, J., Yumer, E., Ceylan, D., Berg, A.C.: Transformation-grounded image generation network for novel 3D view synthesis, pp. 3500–3509 (2017) 20. Park, T., Efros, A.A., Zhang, R., Zhu, J.Y.: Contrastive learning for unpaired image-to-image translation, pp. 319–345 (2020) 21. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis, pp. 1060–1069 (2016) 22. Reed, S.E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H.: Learning what and where to draw. Adv. Neural. Inf. Process. Syst. 29, 217–225 (2016) 23. Rombach, R., Esser, P., Ommer, B.: Network-to-network translation with conditional invertible neural networks. arXiv preprint arXiv:2005.13580 (2020) 24. Sharma, S., Suhubdy, D., Michalski, V., Kahou, S.E., Bengio, Y.: ChatPainter: improving text to image generation using dialogue. arXiv preprint arXiv:1802.08216 (2018) 25. Shu, Z., Yumer, E., Hadap, S., Sunkavalli, K., Shechtman, E., Samaras, D.: Neural face editing with intrinsic image disentangling, pp. 5541–5550 (2017) 26. Togo, R., Ogawa, T., Haseyama, M.: Multimodal image-to-image translation for generation of gastritis images, pp. 2466–2470 (2020) 27. Tomei, M., Cornia, M., Baraldi, L., Cucchiara, R.: Art2Real: unfolding the reality of artworks via semantically-aware image-to-image translation, pp. 5849–5859 (2019) 28. Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation, pp. 1526–1535 (2018) 29. Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: Highresolution image synthesis and semantic manipulation with conditional GANs, pp. 8798–8807 (2018) 30. Wang, X., Shrivastava, A., Gupta, A.: A-fast-RCNN: hard positive generation via adversary for object detection, pp. 2606–2615 (2017) 31. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004) 32. Wu, J., Zhang, C., Xue, T., Freeman, W.T., Tenenbaum, J.B.: Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling, pp. 82–90 (2016) 33. Xie, Q., Dai, Z., Du, Y., Hovy, E., Neubig, G.: Controllable invariance through adversarial feature learning. arXiv preprint arXiv:1705.11122 (2017) 34. Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks, pp. 1316–1324 (2018) 35. Yang, L.C., Chou, S.Y., Yang, Y.H.: MidiNet: a convolutional generative adversarial network for symbolic-domain music generation. arXiv preprint arXiv:1703.10847 (2017) 36. Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks, pp. 5907–5915 (2017) 37. Zhang, H., et al.: StackGAN++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1947–1962 (2018) 38. Zhang, Z., Xie, Y., Yang, L.: Photographic text-to-image synthesis with a hierarchically-nested adversarial network, pp. 6199–6208 (2018)
MIFS: A Peer-to-Peer Medical Images Storage and Sharing System Based on Consortium Blockchain Hao Liu, Xia Xiao, Xinglong Zhang, Kenli Li, and Shaoliang Peng(B) College of Computer Science and Electronic Engineering, HuNan University, Changsha 410082, China {liuhaohn,slpeng}@hnu.edu.cn
Abstract. As computer vision has continued to make significant breakthroughs in recent years, medical image processing has become a research hotspot. However, hospitals that generate medical images have difficulty sharing this data due to differences in information systems and centralized storage structures. As a result, researchers often have access to only a small number of samples for research. To solve medical image resource sharing, we propose MIFS, which stores, retrieves, authorizes, and shares medical images among hospitals. MIFS proposes a peer-to-peer data storage scheme based on consortium blockchain and an authentication mechanism compatible with the consortium blockchain. On this basis, MIFS proposes a blockchain-based access control scheme and a process for retrieving, authorizing, and sharing medical images. Finally, we implemented and evaluated the system to prove the feasibility of our scheme. Keywords: Medical image Consortium blockchain
1
· Peer-to-peer file system · Data sharing ·
Introduction
In recent years, medical imaging has become one of the fastest-growing fields in medical technology, and these medical images are mainly from medical imaging devices such as CT, MRI, PET-CT, and ultrasound [1]. Due to the significant advantages of deep neural networks in computer vision, many researchers have started to use them to process medical images. Medical images mainly store through tools such as PACS, which implements many of the functions of digital medical imaging. However, the centralized characteristics of these tools have led to the dispersion of medical imaging resources, difficulties in sharing imaging resources between hospitals, and challenges in data security. The only data that can be obtained on a large scale are medical images that are easily accessible, such as skin and CXR [2]. As a result, the analysis and research of medical images utilizing artificial intelligence are limited [3]. The research of T. Wang et al. [4] shows that the size of the training set affects the c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 336–347, 2021. https://doi.org/10.1007/978-3-030-91415-8_29
MIFS: A Peer-to-Peer Medical Images Storage and Sharing System
337
performance of machine learning, but the training samples in a large number of medical image-based studies are tiny. In order to promote the mining of more helpful information from medical images, a data platform for medical images sharing should be established. However, there are few cases of building a unified centralized data platform for data sharing among hospitals. First, the medical images of patients are stored in the PACS of each hospital where the patients are located, and it is challenging to interoperate information due to the different information systems, different standards, and different interfaces chosen by each hospital [5]. Secondly, if multiple institutions jointly build a centralized data platform, it will be difficult to define the dominant party of the platform. Finally, data can be easily copied in the process of sharing and circulation. If we cannot identify the right of data and the generator and user of data, we will not be able to realize data authorization well [6]. The data-sharing model based on blockchain technology can effectively solve many problems of data sharing among peers through mechanisms such as distributed ledger, data privacy security, data confirmation, and smart contract [7]. Blockchain data structure and distributed ledger technology can ensure that the data will not have tampered with. At the same time, blockchain also brings the characteristics of decentralization and traceability for applications. However, blockchain’s distributed ledger technology means that each accounting node of the blockchain system will save the same ledger. If the files are directly stored on the blockchain, the blockchain will suffer from bloat problems [8]. When a blockchain system stores a certain amount of data, its availability is gone [9]. Regarding the difficulty of sharing medical image resources among hospitals and the shortcomings of existing decentralized storage solutions implemented based on the public blockchain, this paper proposes a peer-to-peer medical images storage and sharing system based on the consortium blockchain. The system introduces an authentication mechanism compatible with the consortium blockchain as the basis for identity access and data confirmation. Furthermore, it proposes a scheme to achieve secure retrieval of medical image resources between organizations and a medical images authorization and access control mechanism to facilitate the sharing of medical images among hospitals.
2 2.1
Related Work PACS
Picture Archiving and Communication Systems (PACS) is an information system used by hospitals to manage medical images generated by medical devices such as CT, MR, etc. PACS can store and manage medical image data, display and process images, and provide data access interface to Radiology Information System (RIS) and Hospital Information System (HIS). The first basic PACS was created in 1972 by Dr. Richard J. Steckel [10]. PACS is mainly built centrally by hospitals, with each hospital organization building a system of its own. Although PACS can communicate with other applications
338
H. Liu et al.
in the organization, such as HIS, data cannot be securely interoperated between organizations. In recent years, some researchers have built data sharing platform based on PACS [11]. However, this platform is only applicable to small-scale remote diagnosis and other services among hospitals. No attention has been paid to security issues such as data confirmation and data privacy. 2.2
Blockchain-Based Storage System
Blockchain technology ensures data storage and transmission security through a decentralized peer-to-peer system, which can be combined with many fields. According to P. J. Taylor et al. [12], data storage and sharing is the second most popular topic after IoT technologies among technologies combined with blockchain. These studies include blockchain applications, encrypted cloud data search, and file tamper-proofing. Blockchain data storage mode has its limitations. Blockchain’s distributed ledger technology requires each accounting node to keep a copy of the ledger [13], an amount of redundancy that is uncommon even in large-scale traditional distributed systems. One way to solve the blockchain storage performance problem is to find ways to reduce blockchain redundancy. M. Dai et al. [8] proposed a distributed storage technique based on network encoding. The primary approach is that blocks are created and then split into sub-blocks, and then the system, in turn, encodes the sub-blocks into more sub-blocks, which are then distributed to all other nodes. This approach generates many block requests in the P2P network when querying blocks, which reduces the storage space but puts more severe pressure on the already inefficient blockchain system. A solution proposed by A. Palai et al. [9] focuses on storing the balance of non-empty addresses through a database called “account tree” and periodically delete the outdated data of blockchain. This solution is not universal and is limited to transactions that change account balances. Off-chain storage technology is the dominant approach to solving blockchain storage performance problems. The research by P. Sharma et al. [7] details the more successful existing off-chain storage applications, IPFS and Swarm, and analyzes the layered architectural model they share. Blockchain is loosely coupled with the storage layer in the system. It is only interconnected through a web interface instead of a traditional blockchain where each accounting node keeps all the data, which significantly reduces the pressure on the accounting nodes. Most of the current blockchain-based storage systems are public blockchain systems used to exchange storage resources using tokens issued on the public blockchain. For example, Swarm is an off-chain storage solution encouraged by Ethereum [14]. FileCoin used by IPFS is a public blockchain token used to purchase storage resources [15]. Storj has built its own public blockchain also mainly to issue tokens for purchasing storage resources, while Storj also supports tokens such as Ehereum [16]. There are fewer cases of distributed data storage based on consortium blockchain to solve data authentication and authorization.
MIFS: A Peer-to-Peer Medical Images Storage and Sharing System
3 3.1
339
Specific Design System Architecture
MIFS is a peer-to-peer structured storage system for solving inter-hospital medical image files sharing, and its system architecture is shown in Fig. 1. Each hospital needs to form a consortium blockchain and a peer-to-peer medical images storage sharing system (MIFS) based on this consortium blockchain. MIFS consists of peer file nodes (FileNode) of each hospital. Each FileNode requires the following components. – Blockchain Data Interface: Used to provide consortium blockchain trusted data access services to the system. A set of transactional operations for manipulating file metadata is provided in the smart contract of consortium blockchain. – File Storage: Used to store local medical images, as well as other medical image slices in the consortium. Member commits are split into file slices and stored in individual peer storage nodes. Authorizing other organizations to access the stored file slices of the organization needs to go through identification authentication and access control. – Identity Authentication: Used to enable identity access to storage nodes. Only authenticated nodes can be joined to the system. The implementation of this function requires an identity issuer and an authenticator. – Access Control: Used to verify whether an organization has permission to access a file and authorize an organization to access a file. This function consists of two parts: permission grantor and permission verifier. – Medical Image Retrieval: Used to retrieve medical images of interest from the system. This function requires a medical image feature extraction component and a medical image retrieval component.
Fig. 1. General architecture of MIFS.
340
H. Liu et al.
The metadata that MIFS depends on is stored in the consortium blockchain. Figure 2 shows the data structure of the blockchain storage. Two types of metadata are stored in the blockchain, namely node metadata, and file metadata. Node metadata is the status information of each FileNode, including node id, node address, node port, node load, node debt, and node state. File metadata is the meta information of each file, including file hash, file name, file feature, timestamp, file size, file state, file creator organization, file authorized organizations, and slice storage organizations. This metadata is manipulated through smart contracts, and the records of changes made to the metadata by the smart contracts are packaged into blocks and stored in the blockchain.
Fig. 2. The data structure of consortium blockchain on which MIFS relies.
3.2
Authentication and Access Control
The authentication mechanism is an essential prerequisite for system implementation and the main feature that distinguishes it from the public blockchain systems. The characteristics of consortium blockchain determine that they have their authentication mechanism. The authentication mechanism proposed in this paper makes it possible to share a compatible authentication mechanism between FileNodes and consortium blockchain. The authentication mechanism is shown in Fig. 3, which consists of two parts, a certificate authority (CA) and a membership identity provider (MSP). Each organization has its own CA, and FileNodes and consortium blockchain have their MSP. Organizational user login and access control between FileNodes, and data interaction between FileNode and blockchain are all identity-based. CA organizations issue three identity certificates, namely User identity, Admin identity, and Visitor identity. The MSP is responsible for verifying the legitimacy of the visitor’s certificates and providing the visitor with an identity.
MIFS: A Peer-to-Peer Medical Images Storage and Sharing System
341
Fig. 3. The authentication mechanism of MIFS.
The CA issues the User certificate to the organization’s user and delivers the public key signed by the corresponding CA to the MSP of the organization’s FileNode. The client with the User identity can log in to the FileNode to use the services it provides. The CA issues the Visitor identity and Admin identity to the FileNode and delivers the public key signed by the corresponding CA to the MSP of the consortium blockchain and the MSP of FileNode of each organization. FileNode with Admin identity can access the consortium blockchain, and FileNode with a Visitor identity can log in to other FileNodes belonging to other organizations. When users or nodes access the nodes in the system, they all need to be authenticated first, and the MSP will grant the visitor the respective identity. Users or nodes need to go through access control before performing subsequent operations, and the system verifies the visitor’s identity to see if they can perform this operation. Access control dependency data is stored on the blockchain, which can be accessed through the blockchain data interface, and then permission verification is done through access control rules. 3.3
Peer-to-Peer File Storage
In the system, the files uploaded by users are encrypted and encoded into file slices and then stored in each FileNode, and the specific processing flow is shown in Fig. 4. Step 1, FileNode will cache the file uploaded by the user of this organization locally. Step 2, FileNode will extract the file hash value and calculate the global features of the medical image. Step 3, FileNodes for the file slices is calculated based on the debt and load of each node.
342
H. Liu et al.
Fig. 4. The process of peer-to-peer file storage in MIFS.
Step 4, the file metadata is stored on the blockchain. Step 5, the file is encrypted and encoded into three slices, distributed to the FileNodes computed in the third step. In the second step, we need to extract global features of medical images, which usually have more obvious contour features and color features. Therefore, we can use the SaCoCo algorithm [17] to extract global features of medical images, and the features can also contain user-defined labels. In the third step, a node’s load represents the total size of its stored file slices, while a node’s debt represents the total size of file slices distributed by the organization to others. To ensure fairness of storage across organizations in the system, we require that the load and debt of one node are balanced. Therefore, the node with the highest value of debt minus load is preferred when determining the slice storers. In the fifth step, the symmetric encryption algorithm is used for file encryption. Each file will be encrypted with a randomly generated key, and the key will be stored in the organization’s FileNode as private data. File slicing can incorporate file redundancy mechanisms to improve file availability, and FileNode uses Reed Solomon Codes [18] to encode the encrypted file into three file slices. The system uses an encoding redundancy of 1/3, which can be considered that one of the three files is redundant. According to the principle of Reed Solomon Codes, we can take out any two of the three slices to recover the original file, so the whole system can tolerate that one FileNode is temporarily unavailable. The fourth step needs to be completed before the fifth step because when FileNode stores a file slice, it will first verify whether the file slice needs to be stored by itself. FileNode does not store the file slice that the system does not need. This method mainly prevents malicious nodes from launching garbage file writing attacks, that is, writing a large number of files that do not belong to the system, resulting in the unavailability of FileNode.
MIFS: A Peer-to-Peer Medical Images Storage and Sharing System
3.4
343
Retrieval, Authorization, and Sharing
Medical image retrieval can be based on image tags or image content. In the tag-based image retrieval mechanism, the user actively tags the medical image when uploading it, and the user enters a specific tag to retrieve all images in the system with the same tag. However, the tag-based image retrieval approach cannot retrieve all images in the system because some medical images may not be tagged. In the content-based image retrieval mechanism, the global features of a medical image are automatically extracted when the user uploads the image to FileNode. The user inputs a matching image, FileNode extracts its global features, and then performs similarity matching to get the most matching set of images as the matching result. The global feature extraction algorithm used in the system extracts the contour and color features of the images, so images are mainly similar in the contour and color histogram. Users can only download files that have been authorized. The data demander initiates the authorization request, and the data provider authorization will mark the file in the blockchain as accessible and send the symmetric key of the file to the demander. The process of downloading authorized files by users is as follows. First, the user requests the FileNode of the organization to download the file, and the FileNode first finds out whether the target file is cached locally and returns to the user if the file is cached. If it is not cached, it enters the process of caching the target file and then returns to the user after the caching is completed. When caching the target file, FileNode will look up the blockchain to get the target file slices stored in the FileNode of which organizations, and then visit two of them to get the file slices. After obtaining the file slices, it decodes the file slices into the original encrypted files and then decrypts the original files using the corresponding symmetric key.
4 4.1
Implementation and Evaluation System Implementation
The consortium blockchain network that the system relies on is implemented through Hyperledger Fabric. The authentication mechanism of the peer-to-peer file system is compatible with that of Hyperledger Fabric, so organizations only need to deploy a CA authority for issuing their organization’s identity. Based on the ideas in the general system architecture, we implemented peer file node using Java. The nodes communicate with other peer file nodes through the RESTful API to form a peer-to-peer storage system. Each peer file node implements the blockchain data interface, file storage, identity authentication, access control, and image retrieval functional modules. At the same time, the peer file nodes provide RESTful interface services to other peer file nodes in the system and provide web services to the users of the organization.
344
H. Liu et al.
Hospital users can log in to the peer file node of this hospital through the web client using the User identity issued by the CA of this hospital. Users can operate the peer file node to store, retrieve, authorize, and share medical image files in the system through this web application. 4.2
Performance
In our experiments, we constructed three virtual hospitals to form a consortium, and the nodes to be deployed for each hospital are shown in Table 1. Table 1. The nodes to be deployed for each hospital. Node
Quantity Function
CA
1
Issuing User, Admin, and Visitor identity
Peer
1+
Storing blockchain data and endorsing the results
Orderer
1+
Ordering transactions and keeping data consistency
FileNode 1
Forming a peer-to-peer medical image storage system
For our experiments, all nodes of each virtual hospital were deployed in docker containers. The machine running these containers had Centos 7 OS and Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50 GHz and a 36 GB memory.
Fig. 5. Impact of upload file size on server processing latency.
File Write Performance: We first evaluated the latency of uploading medical image files of different sizes to the peer-to-peer storage system. The latency is the time between the completion of the file upload and the return of the response
MIFS: A Peer-to-Peer Medical Images Storage and Sharing System
345
from the server. Note that the evaluation delay ignores the network transmission time of files uploaded to the server, and the purpose is to test the pressure on the peer-to-peer file system caused by the different file sizes. We generated medical image test files of different sizes using the image cropping method. In order to ensure a more reliable latency, ten files of each size were generated, and the final upload latency for that size was averaged. We uploaded these files using the JMeter script, and the results are shown in Fig. 5. From the figure, we can see that the delay in processing the uploaded files and the file size are linearly related.
Fig. 6. Time consumption of the server for each stage of processing the uploaded file.
Next, we tested the time consumption of the server for each stage of processing the uploaded files using a file of size 1M as a benchmark. We used the same method to generate 100 image test files of size 1M and then uploaded the files to the system using a JMeter script. Then we counted the average time consumption of each server at each stage of processing the uploaded files, and the results are shown in Fig. 6. From the results, we can see that the time consumed to send the metadata to the blockchain is the central part of the server processing latency, accounting for 88.41%. So blockchain read and write performance is still the bottleneck of the whole storage system. File Read Performance: We use a JMeter script to test the read latency of the system. Again, the latency statistics are the time consumed by the server to process the download request and do not consider the file network transfer latency. The first time the system reads an authorized file, it needs to find the file slice stored in the system, download the file slice and assemble it into a file locally. In the future, the file is read and then returned directly to the local cache. Therefore, the server processing latency of “Read Slice” and “Read Cache” is different, and Fig. 7 shows the test results of these two reading types. We can see almost no correlation between the server processing latency and the file size
346
H. Liu et al.
when reading the cache from the test results. In contrast, the server processing latency and the file size are linearly correlated when reading the file slice for the first time. Due to the local cache, the response latency for reading files is all in the millisecond range.
Fig. 7. Impact of different download size files on server processing latency under two read types.
5
Conclusion
This paper proposes MIFS, a system for storing, retrieving, authorizing, and sharing medical imaging resources between hospitals. The system uses a consortium blockchain for trusted data sharing among hospitals and builds a peer-topeer file system for medical image storage based on this consortium blockchain. Hospitals uploading medical image files to the system will be split into file slices and stored in each peer file node. The system proposes an authentication mechanism compatible with consortium blockchain for identity access and file authorization of the system. It also proposes an access control scheme to ensure file sharing security and a process to retrieve, authorize, and share medical images among hospitals to facilitate resource sharing. Finally, this paper implements the peer file node based on Fabric and builds this system, demonstrating the read and write latency of the system and proving the feasibility of the scheme. Acknowledgment. This work was supported by National Key R&D Program of China 2017YFB0202602, 2018YFC0910405, 2017YFC1311003, 2016YFC1302500, 2016YFB0200400, 2017YFB0202104; NSFC Grants U19A2067, 61772543, U1435 222, 61625202, 61272056; Science Foundation for Distinguished Young Scholars of Hunan Province (2020JJ2009); Science Foundation of Changsha kq2004010; JZ20195242029,
MIFS: A Peer-to-Peer Medical Images Storage and Sharing System
347
JH20199142034, Z202069420652; The Funds of Peng Cheng Lab, State Key Laboratory of Chemo/Biosensing and Chemometrics; the Fundamental Research Funds for the Central Universities, and Guangdong Provincial Department of Science and Technology under grant No. 2016B090918122.
References 1. Erickson, B.J., Korfiatis, P., Akkus, Z., Kline, T.L.: Machine learning for medical imaging, vol. 37(2), pp. 505–515 (2017) 2. Suzuki, K.: Overview of deep learning in medical imaging, vol. 10, no. 3, pp. 257– 273 (2017) 3. Willemink, M.J., et al.: Preparing medical imaging data for machine learning, vol. 295, no. 1, pp. 4–15 (2020) 4. Wang, T., et al.: A review on medical imaging synthesis using deep learning and its clinical applications, vol. 22, no. 1, pp. 11–36 (2021) 5. Eichelberg, M., Kleber, K., K¨ ammerer, M.: Cybersecurity challenges for PACS and medical imaging, vol. 27, no. 8, pp. 1126–1139 (2020) 6. Thabit, R.: Review of medical image authentication techniques and their recent trends, vol. 80, no. 9, pp. 13 439–13 473 (2021) 7. Sharma, P., Jindal, R., Borah, M.D.: Blockchain technology for cloud storage: a systematic literature review, vol. 53, no. 4, pp. 89:1–89:32 (2020) 8. Dai, M., Zhang, S., Wang, H., Jin, S.: A low storage room requirement framework for distributed ledger in blockchain, vol. 6, pp. 22 970–22 975 (2018) 9. Palai, A., Vora, M., Shah, A.: Empowering light nodes in blockchains with block summarization. In: 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS), 2018, pp. 1–5 (2018) 10. Huang, H.K.: PACS and imaging informatics : basic principles and applications. Hoboken, N.J, Wiley-Liss (2004) 11. Ke, X., Ping, H.: Research of remote image collaboration services based on PACS sharing platform (2016) 12. Taylor, P.J., Dargahi, T., Dehghantanha, A., Parizi, R.M., Choo, K.-K. R.: A systematic literature review of blockchain cyber security, vol. 6, no. 2, pp. 147–156 (2020) 13. Zheng, Q., Li, Y., Chen, P., Dong, X.: An innovative IPFS-based storage model for blockchain. In: IEEE/WIC/ACM International Conference on Web Intelligence (WI) 2018, pp. 704–708 (2018) 14. Swarm: storage and communication infrastructure for a self-sovereign digital society 15. Benet, J.: IPFS - content addressed, versioned, p2p file system (2014) 16. Wilkinson, S., Boshevski, T., Brandoff, J., Buterin, V.: Storj a peer-to-peer cloud storage network (2014) 17. Iakovidou, C., Anagnostopoulos, N., Lux, M., Christodoulou, K., Boutalis, Y., Chatzichristofis, S.A.: Composite description based on salient contours and color information for CBIR tasks, vol. 28, no. 6, pp. 3115–3129 (2019) 18. Plank, J., Simmerman, S., Schuman, C.: Jerasure: a library in c/c++ facilitating erasure coding for storage applications version 1.2 (2008)
StarLace: Nested Visualization of Temporal Brain Connectivity Data Ming Jing , Yunjing Liu, Xiaoxiao Wang, and Li Zhang(B) Department of Computer Science and Technology, Qilu University of Technology, Jinan, China [email protected], [email protected]
Abstract. Brain connectivity network can effectively express the physical connection between the real brain characteristic areas, the brain combined functional connection and the real signal characteristics. Brain connectivity networks are usually compact networks, even fully connected networks, and complex changes will occur at any time. The visualization of these data can give great help for the study of brain operation mechanism or brain disease diagnosis. In this paper, we present the design and implementation of StarLace a nested visualization tool for brain connected data to display the temporal features of the data. With StarLace, researchers can analyze a specific subject with temporal brain activity to discover the correlation between regions then find out the reasonable functional partition. Finally, the effectiveness and usability of the method are verified on real data. Keywords: Nested visualization data
1
· Functional connectivity · Temporal
Introduction
In the past ten years, the development of functional neuroimaging technology is changing rapidly, and it has become an important means to study cognitive and clinical brain diseases [1,2]. Especially, by using resting state fMRI (rsfMRI), researchers can obtain the information of spontaneous functional activity of human brain non invasively, which is helpful to understand the pathophysiological mechanism of major neuropsychiatric diseases (such as schizophrenia, depression, Alzheimer’s disease, childhood hyperactivity disorder, epilepsy, stroke, brain trauma, drug addiction, etc.), as well as exploring important clinical issues (such as early diagnosis, drug treatment mechanism, neurosurgical operation planning and rehabilitation evaluation) provide great value. We can take the temporal brain connectivity network as a normal dynamic network. Dynamic network data and temporal feature information hidden in it need to be processed through data mining, visual analysis and presented to users through formal expression. Dynamic network visualization can play a key role in complex network analysis [3]. As the time dimension grows, the amount c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 348–359, 2021. https://doi.org/10.1007/978-3-030-91415-8_30
StarLace: Nested Visualization of Temporal Brain Connectivity Data
349
of dynamic network data will also grow rapidly, the data of dynamic network visualization is constantly updated, and the new data has an inestimable on the effect of the original visualization, therefore, this characteristic presents some challenges for dynamic network visualization [4]. However, current layout methods often use vertex connection diagrams or adjacency matrix methods [5]. These methods are subject to the limitations of the network diagram itself, and the temporal information cannot be efficiently displayed by animation effects or timeline axe. With the help of navigation or interactive interface, these problems can be improved, but the cost of learning is very high, and sometimes the burden of understanding the information is increased. In addition, network data visualization methods are mostly limited to specific areas, lack of analysis of the specific visualization application, lack of the necessary guidance of visualization technology. Therefore, a more efficient and beautiful dynamic network visualization method is urgent needed. Brain connectivity network brain connectivity network is usually a compact network, some networks are even fully connected network, and the order will change at any time [6]. The weighted edge of brain connectivity network will also change at any time. Visual analysis of brain connectivity network can show the connectivity state of brain, the connection mode corresponding to specific functions, and the differences between individuals and some abnormal conditions, so as to realize the efficient display of brain functional state [7]. Based on the visualization technology of graph data, this paper proposes a nested visualization method of brain connectivity data aiming at the temporal characteristics of brain activity, so as to realize the efficient display of brain network feature. Our main goal is to visual brain function correlation between regions in an efficient way. We extract temporal features of functional correlation from the data and map them into a node-link graph. To reduce the complexity of the graph, an automatic simplify layout method is proposed. Then we apply the technology on the real data to build an application called StarLace, which aims to assist researchers among various disciplines, including neuroscientists, neurosurgeons, and psychologists. We design a web-based analysis tool for brain connectivity research that is easy to use for novices and experts alike.
2
Related Work
Brain functional connectivity data is defined by the correlation matrix between specific locations in the brain body (ROI). It is usually visualized as a threedimensional node link graph reflecting the spatial location of ROI in the brain. ROI is displayed as nodes, and weighted edges represent the relationship strength between two nodes. In the 3D node link graph corresponding to functional connectivity data, the number of visible elements can be reduced by hiding and displaying the edges lower than a certain intensity threshold, but the generated
350
M. Jing et al.
image will still be disordered and affected by the inherent shortcomings of 3D rendering, such as the problem of 3D occlusion. In order to facilitate comparison, two connected data sets can be overlapped in the 3D brain model. The color value of the edge represents the correlation and anti correlation between the nodes. However, the confusion and complexity of visual coding in space or volume rendering make it difficult to perform the task of accurate comparison of weighted edges. Two dimensional node link graph is another way to represent the connectivity data of brain function. The layout of graph can be calculated by multidimensional scaling or force oriented algorithm. These layout methods can solve the problem of view confusion by eliminating overlap and reducing the number of long sides [8]. Neuroscientists are usually trained to reason based on the spatial regions of the brain, and the spatial background of functional connectivity data is crucial for neuroscientists to interpret data effectively. Therefore, node link graph is usually accompanied by a spatial rendering of node position. Nodes in these two representations can be matched one-to-one by region color and label [9]. In order to preserve the spatial characteristics of data to a certain extent, many two-dimensional node link graphs representing connectivity adopt biological layout, in which the positions of nodes are projected onto a two-dimensional plane defined by two standard anatomical axes (e.g., ventral, anterior, posterior, left and right). These layouts can roughly convey the spatial properties of the data, help scientists navigate through the graph, and allow the use of colors to encode other information, such as changes across two states [10]. Function connection matrix [11–13] is also a popular alternative representation of connectivity data, which is occasionally used in the form of grid view to illustrate the trend of different connectivity data sets [14–16]. In order to support direct comparison, the correlation coefficient [17] from multiple scan states can be displayed in the nested quadrant of the matrix cell, which is similar to the design of Stein et al. [18]. However, this design makes it difficult to focus on a single scan state. The fibers that make up the anatomic connections represent physical entities [12] and are often visualized in 3D brain models. The research of fiber connection visualization mainly focuses on similarity clustering, binding simplification and selection interaction in 3D space, and descriptive description [19,20]. Although some literatures use the non spatial representation [21] of fiber similarity and aggregation, neuroscientists think it is not conducive to the understanding of data. Anatomical connection can be simplified as fiber density map between ROI, and can be regarded as 3D spatial node connection map [22] represented by combination matrix. So far, the synchronous visualization of structural and functional connectivity has received little attention. The existing visualization focuses on spatial representation rather than the comparison task of supporting abstract graphs. Some tools, such as ConnectomeViewer [12], FiberStars [23], CerebroVis [24], support the analysis of visual brain connections using spatial 3D Node links and matrix
StarLace: Nested Visualization of Temporal Brain Connectivity Data
351
representation and volume fiber fractal representation for anatomical connections. Although the communication of spatial context is very important in brain connectivity analysis of multiple tasks, 2D non spatial representation with flexible layout is more suitable to convey differences/changes in connectivity data in a clear way.
3 3.1
Proposed Technology Motivation
The general goal of our system is to support analysis of temporal patterns in brain connectivity networks by combine the correlation of area along time. We assume that few concerned tasks performed similarly in this kind of visualization: 1)what voxels have high correlation? 2)What is the difference between two person in function connectivity? 3) What is the most special feature among the specific disease? Based on those tasks, we have a few goals for our method as follows. (i) Extraction. For a single person, we will observe the voxel correlation and display the relationship as groups. (ii) Comparison. On the one hand, different individuals are selected to compare the differences between samples. On the other hand, by comparing with healthy human samples, it will be found that the special functional connection of a specific disease. (iii) Interaction. The design should be easy to interact with. According to a user’s demand, it should provide an interaction interface to help observing the visual result efficiently. The location of objects can be dragged to a wider space to prevent overlapping and occlusion. Selecting a specific object can check its details or trends along time. 3.2
System Overview
Figure 1 shows the pipeline of our system that contains three major components. The data component loads and pre-processes data on demand. The visualization component computes the position and size of each object. Concerning their complex relationship, we apply community detection technology to produce more readable visualizations. Then, we visualize all groups as a force-directed network with focused elements organized by a treemap layout. After that, embed the temporal information into the links or objects. In the last component, the user can set some filters and select any elements they are interested in and the relative information will display automatically for further analysis.
352
M. Jing et al.
Fig. 1. System overview. The data component deals with pre-processing and filters data according to user’s configuration. The Visualization module supports the temporal display technology. The interaction part provides user interactions to help them observing the result more efficiently.
Fig. 2. Pipeline of data processing and visualization
3.3
Pipeline of Data Processing
The processing process of brain connectivity data set is to calculate the functional correlation data between brain functional areas based on the anatomical data and signal data of brain. The anatomical data we used is a brain atlas including many anatomical labels (AAL) and coordinates. The signal data is
StarLace: Nested Visualization of Temporal Brain Connectivity Data
353
the preprocessed fMRI image data. The data processing process is shown in the Fig. 2. Given a data with N nodes (or AAL), the data processing mainly includes the following steps: (1) Data and brain essence network acquisition. The fMRI data has been preprocessed, including: re-alignment, registration, normalize, smooth, etc. The preprocessed fMRI data were processed by spatial group ICA algorithm, and many brain essence networks were obtained. (2) Brain signal time series data calculation. A sliding window is used to calculate the mean to obtain the signal data of all brain regions in a specific time step. Then we get q sequential values. (3) Generate the correlation matrix of brain function. In order to calculate the functional correlation of all brain regions, the sliding window method was used to calculate the matrix sequence of correlation coefficient(Ms ) based on the time series data. Given the window size W , and the sliding step size is s. A correlation coefficient matrix of Mcs is calculated, in which q − s < (m − 1) ∗ s + W ≤ q, m ∈ Z. In principle, the sliding window size of W must be an integral multiple of the sliding step size s. (4) Generate the temporal matrix for visualization. Based on the sequence of brain function correlation matrix obtained in step (3), a connected graph of N nodes G can be extracted. The node weight is determined by the weight of all the connected edges. The links between each two nodes correspond to a group of Ms time series. The time series determines the weight of the link edges. The nodes in G are filtered based on the threshold value Δ and finally visualized. Link edge En = (Va , Vb , S) is the numerical value of the corresponding sequence Si represents the two brain node Va and Vb corresponding to functional connection correlation at the time sequence of Si ∈ [−1, 1] [25]. 3.4
Visual Design
Based on this background, this paper apply an embedded dynamic network visualization method [25], which embeds the time series information into the nodelink diagram, and integrates a convenient interactive interface to statistically analyze the temporal information without affecting the network topology. After that, it is fully expressed and greatly enhances the display effects of information. Embedded Visualization. The representation of time direction is the key role for all the visualization of temporal data. Also, there are many visual elements proposed, such as arrow, text, curve, color.
354
M. Jing et al.
For the application scenarios involved in this paper, the arc (curve) representation method is simple and efficient in the application, can express the direction clearly when binding a lot of information [26]. We use the curve (or arc) instead of the straight line, while the direction indicated by the connection is represented by the clockwise direction of the curve (or arc), as shown in the Fig. 3. we combine the color wall and directed curve to display the temporal feature. Here, each rectangle represents a value on a time spot and the color represents the amount of this value. Color is a popular element to use. We draw the information rectangles as saturation low to high representing the value from small to large.
Fig. 3. Visual design of displaying temporal feature from matrixes to nested graph.
The weights of nodes, connecting edges and temporal information in temporal network data are transformed into elements in visual interface by visual coding. In the visual analysis system designed in this paper, the visual channels used include the radius of the circle, the filling color and transparency of the circle, the boundary of the circle, the color and width of the curve, the color and height of the rectangle, and the gap of the rectangle sequence. Layout Optimization. The discovery and analysis of clustering (or community structure) in network data is a hot topic in social network analysis. We use related technologies to cluster brain connectivity data, which can provide help for brain functional areas. There are two kinds of clustering analysis algorithms, condensing and splitting. Agglomerative clustering algorithm is realized by adding connecting edges to the network step by step; correspondingly, split clustering algorithm is realized by removing connecting edges from the network step by step. Girvan-Newman algorithm [27] is a classical split clustering algorithm. According to the characteristics of nested dynamic network visualization, this paper improves the algorithm. The adjacency matrix used in grouping can be adjusted according to the needs of users.
StarLace: Nested Visualization of Temporal Brain Connectivity Data
4
355
Result
The data set used in this experiment is the control data set ADHD-200 of attention deficit hyperactivity disorder (ADHD) published by ADHD-200 alliance.
Fig. 4. Parameter adjustment for sliding window. (1) Parameters: wstep and wsize mean the step of sliding window and size of sliding window respectively. (2) The result of a group in ADHD86 with wstep = 10 while wstep = 5, 10, 20 respectively. (3) The result of a group in ADHD86 with wsize = 10 while wsize = 5, 10, 20 respectively.
Fig. 5. Details of visualization. (1) and (2) are visualization of single sample. (3) and (4) are the result of comparison. (5) and (6) are ADHD and TDC visual results on the brain model respectively when selecting a comparison group.
356
M. Jing et al.
ADHD-200 data set contains 187 sample data, including 96 ADHD samples and 91 TDC samples. Based on link weight filtering, we get dynamic graph data set G := (V , e ). In the graph, the nodes represent brain regions r, and the link edges represent the functional correlation c between brain regions, where r ∈ [1, 116] and c ∈ [−1, 1] . Case 1: Single Sample. For single sample data, we focus on the positive correlation between brain regions, that is, which regions have the same changes in blood oxygen concentration. Select a sample, and the initial data volume is 116× 116 × 172. The correlation matrix M is calculated by sliding window technology. The window step and window size parameters of the sliding window can be set in the interactive interface. Then, the brain regions are divided by grouping technology. Finally, the data with relatively low correlation is removed by setting a threshold. Figure 4 shows the specific meaning of these two parameters and the visualization effect of different parameters on a group of areas in the same sample. We can observed some features (in the read squares) will disappear when sliding windows goes too fast. Figures 5 (1) and (2) are visualization results of two samples ADHD10 and ADHD20. In order to reduce the occlusion of the label, we use the region number from 1 to 116 in the AAL template instead of the region name. Case 2: Comparison. For two samples, we focus on the differences between them, that is, in which regions the correlation is quite different. Compare to single sample, the correlation degree of two time series data is calculated, rather than the correlation degree within a single time series data. Since the brain activity of the two samples may be asynchronous, we use dynamic time warping technology (DTW [28]) to calculate the difference between them. Figures 5 (3) and (4) show the comparison between ADHD10, ADHD20 AND TDC1 respectively. We can observe the most different group on the left-top corner. As Fig. 5(5) and (6)shows, When selecting a group, the details will display on the brain model.
5
Expert Evaluation
In order to evaluate the method, we invited 3 neuroscientists facing the problem of dynamic network visualization to use our system. We had face-to-face communication with the experts and recorded their use process. We want to focus on the overall usefulness of the method through open assessment using real data. We focus on the following issues: 1) Are the visual elements we use easy to understand? 2) Are the interactions we use useful and easy to operate? 3) Can our method help experts better understand the data?
StarLace: Nested Visualization of Temporal Brain Connectivity Data
357
Fig. 6. Quantitative evaluation from experts.
All experts are familiar with the background of the comparative data of ADHD children provided by us. We briefly introduce the basic operation interface, and specially explain the problems that may cause doubts. Then the data of a normal child and a ADHD child are loaded, and the initial visualization settings are: the time range is the whole time span, the value of correlation threshold τ ≥ 0.8, and the link frequency f ≥ 11. Experts showed great interest in the visualization results and carefully observed the whole interface. Then, experts use the mouse to drag and enlarge the points in a certain area, further observe the differences between any two samples, and confirm the meaning of different colors to us. Then experts adjusted the correlation threshold to change the number of elements on the screen. Then select other data for visualization, and use other interaction methods. Finally, 3 experts scored the first three questions as shown in Fig. 6. Also, we get the following feedback: 1) Lines and squares are easy to understand. The interface is beautiful and the graphic design is easy to understand. 2) It’s the first time to see this kind of layout. It’s very intuitive and efficient. 3) Grouping is very useful. No matter which way to group, the nodes with high correlation are divided into a group, which is more convenient to distinguish meaningful differences. 4) Interaction design is targeted, which is very helpful to view the results. 5) The time direction is not intuitive enough, sometimes it needs extra energy to identify.
6
Conclusion
In this paper, we proposed a novel nested visualization method to display the temporal feature of brain connectivity data. It contained method overview, temporal feature extraction and drawing algorithm and layout optimism method. By implementing on the real brain connectivity data, the proposed method showed some advantage in analyzing temporal character. On the other hand, since 2D graph has limitation in visual scene, the method was short of dealing with big or complex data although, using filter can reduce the burden of visualization. In the future, we will introduce other optimism algorithm to improve the performance.
358
M. Jing et al.
Acknowledgment. This work was supported by the National Natural Science Foundation of China (61902202), International Cooperation Foundation of Qilu University of Technology (45040118) and National Key Research and Development Program of China (2019YFB2102600).
References 1. Cao, C., Slobounov, S.: Alteration of cortical functional connectivity as a result of traumatic brain injury revealed by graph theory, ICA, and sLORETA analyses of EEG signals. IEEE Trans. Neural Syst. Rehabil. Eng. 18(1), 11–19 (2009) 2. Yeom, H.G., Kim, J.S., Chung, C.K.: User-state prediction using brain connectivity. In: 2019 19th International Conference on Control, Automation and Systems (ICCAS) 2019, pp. 1096–1097 (2019) 3. Dosenbach, N.U., et al.: Prediction of individual brain maturity using fMRI. Science 329(5997), 1358–1361 (2010) 4. Ji, J., Chen, Z., Yang, C.: Convolutional neural network with sparse strategies to classify dynamic functional connectivity. IEEE J. Biomed. Health Inf. 1 2021 5. Qi, S., Meesters, S., Nicolay, K., ter Haar Romeny, B.M., Ossenblok, P.: The influence of construction methodology on structural brain network measures: a review. J. Neurosci. Meth. 253, 170–182 (2015) 6. Yang, X., Shi, L., Daianu, M., Tong, H., Liu, Q., Thompson, P.: Blockwise human brain network visual comparison using Node Trix representation. IEEE Trans. Vis. Comput. Graph. 23(1), 181–190 (2017) 7. Wang, Y., Chen, X., Liu, B., Liu, W., Shiffrin, R.M.: Understanding the relationship between human brain structure and function by predicting the structural connectivity from functional connectivity. IEEE Access, vol. 8, pp. 209 926–209 938 (2020) 8. B¨ ottger, J., Sch¨ afer, A., Lohmann, G., Villringer, A., Margulies, D.S.: Threedimensional mean-shift edge bundling for the visualization of functional connectivity in the brain. IEEE Trans. Vis. Comput. Graph. 20(3), 471–480 (2014) 9. Chen, W., Shi, L., Chen, W.: A survey of macroscopic brain network visualization technology. Chin. J. Electron. 27(5), 889–899 (2018) 10. Alper, B., Bach, B., Henry Riche, N., Isenberg, T., Fekete, J.-D.: Weighted graph comparison techniques for brain connectivity analysis. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems 2013, pp. 483–492 (2013) 11. Achard, S., Salvador, R., Whitcher, B., Suckling, J., Bullmore, E.: A resilient, low-frequency, small-world human brain functional network with highly connected association cortical hubs. J. Neurosci. 26(1), 63–72 (2006) 12. Gerhard, S., Daducci, A., Lemkaddem, A., Meuli, R., Thiran, J.-P., Hagmann, P.: The connectome viewer toolkit: an open source framework to manage, analyze, and visualize connectomes. Front. Neuroinform. 5, 3 (2011) 13. Sanz-Arigita, E.J., et al.: Loss of small-world networks in Alzheimer’s disease: graph analysis of fMRI resting-state functional connectivity. PloS one 5(11), e13788 (2010) 14. Bassett, D.S., Brown, J.A., Deshpande, V., Carlson, J.M., Grafton, S.T.: Conserved and variable architecture of human white matter connectivity. Neuroimage 54(2), 1262–1279 (2011)
StarLace: Nested Visualization of Temporal Brain Connectivity Data
359
15. Kramer, M.A., Eden, U.T., Lepage, K.Q., Kolaczyk, E.D., Bianchi, M.T., Cash, S.S.: Emergence of persistent networks in long-term intracranial EEG recordings. J. Neurosci. 31(44), 15 757–15 767 (2011) 16. Ginestet, C.E., Simmons, A.: Statistical parametric network analysis of functional connectivity dynamics during a working memory task. Neuroimage 55(2), 688–704 (2011) 17. Thomason, M.E., et al.: Resting-state fMRI can reliably map neural networks in children. Neuroimage 55(1), 165–175 (2011) 18. Stein, K., Wegener, R., Schlieder, C.: Pixel-oriented visualization of change in social networks. In: 2010 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2010). IEEE, pp. 233–240 (2010) 19. Moberts, B., Vilanova, A., van Wijk, J.J.: Evaluation of fiber clustering methods for diffusion tensor imaging, in VIS 05. IEEE Visual. 2005, 65–72 (2005) 20. Sherbondy, A., Akers, D., Mackenzie, R., Dougherty, R., Wandell, B.: Exploring connectivity of the brain’s white matter with dynamic queries. IEEE Trans. Vis. Comput. Graph. 11(4), 419–430 (2005) 21. Jianu, R., Demiralp, C., Laidlaw, D.: Exploring 3D DTI fiber tracts with linked 2D representations. IEEE Trans. Vis. Comput. Graph. 15(6), 1449–1456 (2009) 22. Hagmann, P., et al.: Mapping the structural core of human cerebral cortex. PLoS Biol. 6(7), e159 (2008) 23. Franke, L., et al.: Fiberstars: visual comparison of diffusion tractography data between multiple subjects. In: 2021 IEEE 14th Pacific Visualization Symposium (PacificVis), pp. 116–125 (2021) 24. Pandey, A., et al.: CerebroVis: designing an abstract yet spatially contextualized cerebral artery network visualization. IEEE Trans. Vis. Comput. Graph. 26(1), 938–948 (2020) 25. Jing, M., Li, X., Zhang, L.: Interactive temporal display through collaboration networks visualization. Inf. Vis. 18(2), 268–280 (2019) 26. Holten, D., Isenberg, P., van Wijk, J.J., Fekete, J.-D.: An extended evaluation of the readability of tapered, animated, and textured directed-edge representations in node-link graphs. In: Proceedings of the 2011 IEEE Pacific Visualization Symposium, ser. PACIFICVIS 2011, Washington, D.C., USA: IEEE Computer Society, 2011, pp. 195–202 (2011) 27. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. 99(12), 7821–7826 (2002) 28. Wang, W., Lyu, G., Shi, Y., Liang, X.: Time series clustering based on dynamic time warping. In: 2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS), pp. 487–490 (2018)
Batch Weighted Nuclear-Norm Minimization for Medical Image Sequence Segmentation Kele Xu1 , Zijian Gao1(B) , Jilong Wang2 , Yang Wen3 , Ming Feng2 , Changjian Wang1 , and Yin Wang2 1
3
National University of Defense Technology, Changsha, China [email protected] 2 Tongji University, Shanghai, China Ecole Polytechnique F´ed´erale de Lausanne (EPFL), Lausanne, Switzerland
Abstract. Recently, convolutional neural networks have shown their superior performance for biomedical image sequence segmentation. Most of the current state-of-the-art segmentation methods are designed for segmentation in static frames. However, misleading or missing features in static images may severely degrade segmentation performance. Effective incorporation of shape priors can alleviate this issue, which has been under-explored for deep models in previous attempts. In this paper, we explore the use of the continuity-based prior, either in temporal or spatial, with the aim of improving the robustness of the medical image sequence segmentation. Specifically, we firstly propose a nuclear norm minimization (NNM) based regularizer. However, nuclear norm tends to over-shrink the rank components, and all singular value are equally regularized. The singular values should be treated differently, as neighboring frames are more similar than distant frames. To rectify the weakness of NNM, we further propose to utilize the weighted nuclear norm minimization (WNNM), which achieves a better matrix rank approximation than NNMs and avoids over-regularization. To empirically investigate the effectiveness and robustness of the proposed sequential segmentation approach, we have performed extensive experiments on two different imaging modalities. The results demonstrate that our method can provide better robustness against missing features.
Keywords: Biomedical image sequence segmentation modeling · Weighted nuclear-norm minimization
1
· Low-rank
Introduction
by numerous practical applications, image segmentation has witnessed dramatically progress in the past decades [1]. As a subfield of image segmentation, K. Xu and Z. Gao—Contributed equally to this work. c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 360–371, 2021. https://doi.org/10.1007/978-3-030-91415-8_31
Batch Weighted Nuclear-Norm Minimization
361
biomedical image segmentation has many potential applications in various scenarios [7,11,14], as it has a great impact on further quantitative clinical analysis, including measurement of disease progression, treatment planning and clinical monitoring. It is desirable to design a robust medical image segmentation approach with high accuracy and confidence. Recently, deep learning-based methods, especially deep convolutional neural networks, have become the methodology of choice for medical image segmentation [9]. Most of the current state-of-the-art medical image segmentation models are designed for the segmentation task of a single static frame without considering contextual information, which severely limits the application in clinical settings. In addition, the segmentation performance of deep models can be corrupted by missing or misleading features, which occurs frequently during the process of medical imaging. For example, segmentation of the tongue surface in the B-mode ultrasound image remains far from resolved due to the faint or missing contours, when the ultrasound waves are close to being parallel to the tongue surface (which can cause discontinuities) [13]. Other imaging artifacts, such as noise contamination, hiding from view and signal dropout, further degrade segmentation performance. A similar phenomenon can be found in the segmentation task of left ventricle in echocardiography images [15]. Many medical image segmentation methods have been proposed [12] that focus on learning a shape prior from a large annotated dataset, but such dataset is not easily available in practice. Moreover, it is not known whether the learned shape prior information can be generalized to new images. How to incorporate different prior information for the deep models has not been fully explored in previous studies. To improve the segmentation performance, a straightforward and intuitive approach is to leverage continuity to constrain the potential shape space, since the shapes are similar between adjacent frames. In this paper, we propose continuity as a prior for segmentation, with the goal to take sequential context information into account. Specifically, many medical image sequences acquired by static medical imaging devices have obvious low-rank properties, and the background modeling can be conducted by low rank matrix approximation (LRMA). Here, we first propose a novel loss function based on norm-normal minimization (NNM), which is easy to implement and can be applied to existing segmentation architectures without modification. However, the NNM tends to over-shrink the rank component, and all singular values are regularized equally. In the practical setting, the singular values should be treated differently because they have different meanings. To rectify the weakness of NNM and avoid over-regularization, we further propose to employ weighted nuclear norm minimization (WNNM), which can achieve better matrix rank approximation than NNM and provide frame-specific segmentation. In brief, our contributions can be summarized as follows: – We first show that the output of a deep learning-based segmentation model will form a low-rank matrix. Based on the low-rank property, we explore the use of weighted nuclear norm to regularize shape continuity in medical image segmentation tasks.
362
K. Xu et al.
– The weighted nuclear norm is represented as a new loss function that can be easily integrated into segmentation models to characterize the continuity (either in temporal or spatial). The proposed loss function is easy to implement and can be applied to existing segmentation architectures without any modification. – In an exhaustive empirical evaluations of different datasets (ranges from 2D serial data to 3D datasets), we demonstrate that our regularizer can provide stable segmentation results across many different medical imaging modalities.
2 2.1
Methodology Background and Notations
In medical image analysis field, segmentation is an indispensable tasks which have many practical application scenarios. As aforementioned, CNNs can provide better segmentation performance compared to traditional approaches. Following this routine, in this work, we address the segmentation problem with deep CNNs, n which utilizes a standard U-Net architecture. We assume that D = (xi , yi )i=1 is a labeled dataset containing n labeled samples. Here, X is the input images and the Y is the labeled output, respectively. Our goal is to learn a deep neural network f , which is a function used to predict segmentation results. Here, it is worthwhile to notice that: generally, the labeled data are fed into the segmentation model in random order during the training phase. 2.2
Loss Function
CNN-based segmentation models can be divided into two main classes: pixelbased or images-based methods. The loss function has a significant impact on the performance of the model for the training of deep CNNs. Several loss functions have been used for the segmentation task, such as cross-entropy loss, dice coefficient loss and the newly proposed active contour loss. Here, we will discuss the mainstream loss functions in the following part, before delving further into our methods. Cross-Entropy Loss: For the pixel-based methods, the segmentation model aims to classify each pixel into different objects, thus converting the segmentation problem as a classification problem. The binary cross-entropy (BCE) loss is a widely-used pixel-wise loss function for the segmentation model, which can be denoted as follows: m 1 [Ti log (Pi ) + (1 − Ti )log (1 − Pi )] , (1) LossBCE (T, P ) = − m 1 where T is the ground truth (or expert annotation, anatomical priors) and P is the segmentation result. And m is the index for each pixel value. Sustainable efforts have been made to improve the BCE-based loss function, while few attempts have been made to employ the geometric detail of objects (such as the temporal/spatial continuity of the object).
Batch Weighted Nuclear-Norm Minimization
363
Dice Score Coefficient Loss: Dice Coefficient (DC) is widely used in U-Net like models, which provides a simple way to measure the overlap ratio between the segmentation map and the ground truth. U-Net belongs to the image-based segmentation methods, which uses an image as input and the output will be the segmentation of the input image (in general, the input and output images are of the same size). Compared to pixel-wise approaches, U-Net like models can provide better performance while retaining simplicity. Specifically, given two sets of pixels T , P , the DC can be denoted as: m (Ti × Pi ) , (2) DC (T, P ) = 2 ∗ i=1 m i=1 (Ti + Pi ) where T is the ground truth and P is the segmentation result. However, DC loss function ignores the outside of the target. In this case, small segmented objects may appear around the boundaries. Active Contour Loss: Both the CE and DC loss functions didn’t take the geometrical information into consideration. A new loss function was proposed in [4], named deep Active Contour (AC loss). The loss function not only considers the region geometrical information, but also aims to preserve the shape information. The detailed definition of the AC loss can be formulated as: LossAC = Length + λ ∗ Region,
(3)
where Length = Region = σ
C
|∇u| ds
2
(4) 2
(c1 − v) − (c2 − v)
udx,
(5)
where c1 , c2 are the mean value outside and inside the image to be segmented respectively. v is the ground truth and u is the predicted values. More details can be found in [4]. In the following, we employ the aforementioned three loss functions as baselines for the segmentation task. 2.3
Measuring Similarity with Matrix Rank
Most of the previous researches focus on improving the segmentation performance of a single frame, while the problem of sequential segmentation has been under-explored in previous studies. In the practical settings of medical imaging, images are often presented as a sequence, either in temporal or spatial order. A straightforward intuition to improve the sequential segmentation performance is to incorporate the continuity-based constraints into the model as the deformation or motion of human organs are physical in nature. Take the 3D segmentation task as an example, where adjacent slides are presented in spatial order. The shapes in adjacent slides are similar, thus formulating a low-dimension subspace, and such shape prior information can be used to guide the segmentation.
364
K. Xu et al.
For the training of a deep segmentation model, we suppose the output on a batch are in sequential or spatial order. Thus, we can denote the output of kth frame as pk , and we represent the batch prediction output matrix as pk ∈ RH×W (H is the height and W is the width of the frame). As aforementioned, many of the medical image sequences have a clear low-rank property, such as, the ultrasound tongue image and the left ventricle echocardiography image sequence. As the sequential similarity is available, we can add the regularizer to enforce the similarity between the segmentation results of adjacent frames. To measure the similarity, different distance functions can be utilized, including the 1 and 2 . In this paper, as multiple adjacent frames are available, we employ the rank (P) as the approximation of the similarity, here P = [P1 , P2 , ..., PK ] and Pk is the flattened version of pk . Lower rank maintain higher prediction similarity and minimization of the rank (P) can enforce that the predictions on the batch samples should be as similar as possible. In practical settings, P will hardly be perfectly low-rank due to noise contamination from other imaging artifacts. Thus, we assume that the following model is more faithful in the real situation: P = X + E,
(6)
where X is the low-rank component and E is the noise or error induced by the incomplete boundaries and misleading features within the sequence. For the sequential segmentation task, it is a crucial task to recover the low-rank structure. To find the low-rank approximation, we aim to solve the following optimization process: 2
arg min ||P − X||F + λ ∗ rank (P) ,
(7)
where F denotes the Frobenius norm and λ is a positive constant. Essentially, the rank of the output matrix can be used to describe the degree of freedom for the predicted output. The deep segmentation model can address global changes of objects: including translation, rotation, scaling and principal deformation, by forcing the low-rank property of the outputs. At the same time, the local variation can also be addressed by the deep model, which is caused by some image defects (e.g., noise, hide from view during the medical imaging). However, the matrix rank is a discrete operator, which is difficult to optimize. Moreover, the rank is too rigid as a regularization method. Here, inspired by the approach proposed in [15], we utilize the nuclear norm of the matrix P to replace the rank, which is the sum of its singular values: ||X||∗ = i |σi (X)|1 . Here, the i-th singular values of matrix X is denoted as σi (X). Then, we can reformulate the Eq. 2 as: 2 (8) arg min ||P − X||F + λ ||X||∗ . 2.4
Weighted Nuclear Norm Minimization
NNM penalizes the singular values of X equally, thus each frame is assigned equal weight in the regularization process. This process is not very reasonable in
Batch Weighted Nuclear-Norm Minimization
365
the practical settings, because different singular values may have different importance. The segmentation results of adjacent frames should be more similar than distant frames. To treat each frame differently, we explore the use of weighted nuclear norm minimization (WNNM) to regularize X, and we can formulate the problem as follows: 2 (9) arg min ||P − X||F + λ ||X||w,∗ , where w is the weight and ||X||w,∗ = i |w ∗ σi (X)|1 . Here, we assume that the weights w follow a Gaussian distribution, with higher weights for adjacent frames and lower weights for distant ones. We employ a deep neural network for the training. Unlike previous studies: for each batch, we feed adjacent frames (or adjacent slides for 3D segmentation) for the training and inference. It is worth noting: within the batch, the relative order between adjacent frames cannot be changed, although such an operation does not change the low-rank property.
2.5
Sequential Segmentation with WNNM-Based Regularizer
For the sequential segmentation task, the total loss function is formulated as: m i=1
Loss + λ ||X||w,∗ ,
(10)
where m is the number of samples, the Loss can be one of the widely-used loss functions for the segmentation task, including the CE, DC and AC loss [4]. ||X||w,∗ is WNNM for the segmentation results for each batch. As can be seen from Eq. 10, we explore replacing rank (P) with the weighted nuclear norm ||X||w,∗ . In fact, the WNNM is widely used in the low-rank modeling, which includes matrix completion and robust principal component analysis. First, the WNNM function can be easily incorporated into the deep segmentation model, and we can employ standard stochastic gradient descent optimizer for the training of deep models. Second, small noise contamination in the shape may dramatically increase the rank (X), while ||X||w,∗ rarely changes in this case, thus avoiding the effect of noise. Thirdly, WNNM achieves a better matrix rank approximation than NNM and avoids over-regularization.
3
Experimental Results
In this section, we evaluate the proposed method on both synthesized and realistic datasets. To demonstrate the advantages of proposed regularizer, we compare the results of the same U-Net architecture using aforementioned loss functions, before and after applying the proposed constraint respectively. The CE loss, DC loss and AC loss are used as the baseline, which leveraging ResNet34 as our backbone architecture for U-Net. We choose the hyper-parameter λ empirically and use the same set of values for all experiments. The effect of λ on the
366
K. Xu et al.
segmentation performance will be further elaborated in the following part. To reduce overfitting, we introduce widely used data augmentation methods, such as flipping horizontally and vertically; rotating random angle and adding Gaussian blur with a certain probability. We use the Adam optimizer to update the weights of the network with the initial learning rate of 5 ∗ 10−4 .
Fig. 1. Example using the toy data. Top two row: the results using the BCE loss with/without the proposed regularizer. 3–4 Rows: the results using the DC loss with/without proposed regularizer. Bottom two rows: the results using the AC loss with/without proposed regularizer.
3.1
Synthetic Dataset
To clearly illustrate our approach, we synthesized a sequence of images. As can be seen in Fig. 1, the dark region in the heart shape is the object to be segmented. Different situations are synthesized, including occlusion or deletion, and Gaussian noise contamination. Here, we aim to segment the heart shapes in all images based on the prior that the shapes are similar to each other in these images. As shown in Figure 1, the results using the proposed regularizer provide robust segmentation results for local defects in the image compared to the results without such a regularizer.
Batch Weighted Nuclear-Norm Minimization
3.2
367
Ultrasound Tongue Contour Extraction
In this paper, we explored the tongue surface extraction from real B-mode ultrasound tongue image sequences. In the field of clinical linguistic and phonetics, tongue contours extraction is a prerequisite for analyzing ultrasound images. However, the extraction is extremely time-consuming and error-prone. Consequently, extracting tongue contours from ultrasound images remains a non-trivial task. In fact, tongue shape modeling is of great interest for the studies of speech production, and accurate modeling can be helpful for the treatment of speech pathology, language learning and rehabilitation, and silent speech recognition [5]. In our experiments, we explored different loss functions with and without the proposed regularizer and report their performance. The two curves can be compared without point-wise pre-alignment using mean sum of distance (MSD) as the measurement for evaluation from human annotation [8]. As can be seen from Table 1, better performance can be obtained using proposed regularizer. Moreover, it is worth noticing that for the standard deviation of the metrics, the smaller the standard deviation is, the more stable the performance is. As shown in Table 1, the standard deviation without the constraint is obviously higher than that with the proposed regularizer, which shows the importance of the proposed regularizer to improve the robustness of the deep models. Samples of the segmentation results are given in Fig. 2. As can be seen from the figure, the ultrasound image sequence appears faint or miss contour occurs, while, traditional loss functions are not able to address this issue completely. It is significant that better performance can be obtained with proposed nuclear norm-based regularizer. Moreover, compared with other combinations, better performance can be achieved with the combination of DC loss and the proposed regularizer [6].
Fig. 2. Sample predictions for the ultrasound tongue image sequences. Upper panels are obtained using the U-Net with DC loss; lower panels are obtained using the U-Net with DC loss and the regularizer. The green lines are the predicted results. (Color figure online)
3.3
LiTS 2017 CT Dataset
In the previous experiments, we demonstrated the advantages of our method for the 2D medical image sequence segmentation task. In practice, our method can be naturally extended to 3D segmentation tasks. For the 3D segmentation
368
K. Xu et al.
Table 1. Quantitative evaluation of the segmentation results with/without the proposed regularizer on ultrasound tongue image data. The values in each row represents the mean ± standard deviation calculated over all frames in each sequence. Lower MSD indicate better performance. Loss
Mean sum of distance
BCE
3.243 ± 0.366
DC
2.922 ± 0.122
AC
3.170 ± 0.165
BCE+WNNM 3.220 ± 0.126 DC+WNNM
2.890 ± 0.025
AC+WNNM
3.125 ± 0.002
task, we assume that the segmentation results in adjacent slides provide similar shapes, and therefore, our method can be directly applied to the 3D segmentation task without modification. In this part, we conduct experiments on the MICCAI 2017 LiTS challenge dataset [3], which consists of 131 and 70 constraint-enhanced 3D abnormal CT scans for training and testing. The segmentation is of great challenges due to various misleading features in the images. The segmentation results with and without the proposed regularizer are given in Table 2. The experiments show that the algorithm with the proposed regularizer can provide better performance than that without the regularizer. We speculate that the main reason behind this phenomenon is that the added regularizer will lead to better regularization of the deep model. Table 2. Quantitative evaluation of the segmentation results with/without the proposed regularizer on LiTS 2017 CT dataset. The values in each row represent the mean ± standard deviation calculated over all frames in each sequence. As the model with AC loss cannot converge, only the results of BCE and DC are presented. Lower Hausdorff distance indicate better performance. Loss
Hausdorff dist.
BCE DC
9.285 ± 1.380 8.274 ± 0.509
BCE+NNM DC+NNM
7.623 ± 0.970 6.657 ± 0.297
BCE+WNNM 7.043 ± 0.832 DC+WNNM 5.922 ± 0.382
Batch Weighted Nuclear-Norm Minimization
369
Table 3. Quantitative evaluation of the segmentation results with/without the proposed regularizer on BraTS 2019 validation dataset. The mean of Dice coefficient and Hausdorff distance are reported in this table. Here, ET denotes enhancing tumor, WT denotes whole tumor and TC denotes tumor core. Higher Dice coefficient and lower Hausdorff distance are desirable. Loss
Dice coefficient WT
TC
ET
WT
TC
0.722
0.784
0.891
6.434
7.825
7.973
0.736
0.802
0.872
6.584
8.048
8.656
0.741
0.828
0.884
5.995
7.496
7.543
0.750
0.824
0.902
5.891
7.516
7.570
0.765 0.836 0.901
5.468
7.324
7.301
0.763
3.4
Hausdorff distance
BCE DC NNM WNNM ET
0.834
0.905 5.463 7.154 7.167
BraTS 2019 Dataset
We also conducted a further experiment on 3D brain tumor segmentation dataset (BraTs2019 dataset [2,10]), which consists of MRI volumes with different modalities (T1, T1c, T2). The goal of the segmentation is to identify whole-tumor (WT), tumor-core (TC) and enhancing-tumor (ET) within the images. We followed the official preprocessing operations of BraTs 2019 and all MRI volumes have been co-registered to the same anatomical template, interpolated to the same resolution, and skull-stripped. All samples were normalized to zero mean with unit variance. We set the training batch size to 32 with a patch size of 128 × 128. The dice score, sensitivity, specificity and Hausdorff distance (95%) are used as the evaluation metrics. The experimental results are shown in Table 3 and 4. Similar to the previous experiments, better performance can be obtained with the proposed regularizer. Table 4. Quantitative evaluation of the segmentation results with/without the proposed regularizer on BraTS 2019 validation dataset. The mean of Dice coefficient and Hausdorff distance are reported in this table. Here, ET denotes enhancing tumor, WT denotes whole tumor and TC denotes tumor core. Higher Dice coefficient and lower Hausdorff distance are desirable. Loss
Dice coefficient
Hausdorff distance
BCE DC NNM WNNM ET
WT
TC
ET
WT
TC
0.722
0.784
0.891
6.434
7.825
7.973
0.736
0.802
0.872
6.584
8.048
8.656
0.741
0.828
0.884
5.995
7.496
7.543
0.750
0.824
0.902
5.891
7.516
7.570
0.765 0.836 0.901
5.468
7.324
7.301
0.763
0.834
0.905 5.463 7.154 7.167
370
3.5
K. Xu et al.
Effect of λ
The most important parameter in our method is the weight λ of the nuclear norm regularization in Eq. (8). To better illustrate the influence of λ, we evaluated the effect of the regularization weight using MSD. Here, the larger λ is, the lower rank(Y ) will be. A lower rank(Y ) will enforce the segmentation maps from the deep models to be more similar to each other. Here, we show the influence of λ to the MSD score averaged over the tested sequences from ultrasound tongue image dataset. As can be seen from the figure, when the λ increases in a proper range, both of the mean of MSD score and the variance can be decreased, which demonstrates the advantage of the proposed regularizer. However, after the turning point the accuracy decreases, since the over-regularization gives a large bias in shape estimation. Similar phenomenon can be found on other datasets used in our experiments, and we keep λ consistent across all the experiments (set as 10−4 ).
Fig. 3. Effect of varying the parameter λ on MSD score. The markers and bars denote the mean values and standard deviations of MAD respectively.
4
Conclusion
In this paper, we proposed a regularizer to enforce the similarity between segmentation maps of adjacent frames (or adjacent slides for 3D segmentation), and the regularizer is able to take into account the contextual information of the segmented objects. Such regularizer can be easily incorporated into deep models based on low-rank modeling and weighted nuclear-norm minimization. We use the basic segmentation model U-Net as a baseline and demonstrate that our proposed regularizer can be incorporated into the mainstream loss function without modification. We explored the proposed regularizer on large-scale dataset across different imaging modalities and the experimental results demonstrate that the proposed regularizer can provide better accuracy and lower variance compared to the relevant baseline, while having better robustness to missing features.
Batch Weighted Nuclear-Norm Minimization
371
References 1. Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017) 2. Bakas, S., et al.: Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci. Data 4, 170117 (2017) 3. Bilic, P., et al.: The liver tumor segmentation benchmark (LiTS). arXiv preprint arXiv:1901.04056 (2019) 4. Chen, X., Williams, B.M., Vallabhaneni, S.R., Czanner, G., Williams, R., Zheng, Y.: Learning active contour models for medical image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11632–11640 (2019) 5. Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J.M., Brumberg, J.S.: Silent speech interfaces. Speech Commun. 52(4), 270–287 (2010) 6. Feng, M., Wang, Y., Xu, K., Wang, H., Ding, B.: Improving ultrasound tongue contour extraction using U-Net and shape consistency-based regularizer. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6443–6447. IEEE (2021) 7. Gu, Z., et al.: CE-Net: context encoder network for 2D medical image segmentation. IEEE Trans. Med. Imaging 38(10), 2281–2292 (2019) 8. Li, M., Kambhamettu, C., Stone, M.: Automatic contour tracking in ultrasound images. Clin. Linguist. Phonetics 19(6–7), 545–554 (2005) 9. Litjens, G., et al.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017) 10. Menze, B.H., et al.: The multimodal brain tumor image segmentation benchmark (brats). IEEE Trans. Med. Imaging 34(10), 1993–2024 (2014) 11. Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. IEEE (2016) 12. Novikov, A.A., Major, D., Wimmer, M., Lenis, D., B¨ uhler, K.: Deep sequential segmentation of organs in volumetric medical scans. IEEE Trans. Med. Imaging 38(5), 1207–1215 (2018) 13. Stone, M.: A guide to analysing tongue motion from ultrasound images. Clin. Linguist. Phonetics 19(6–7), 455–501 (2005) 14. Tajbakhsh, N., Jeyaseelan, L., Li, Q., Chiang, J.N., Wu, Z., Ding, X.: Embracing imperfect datasets: a review of deep learning solutions for medical image segmentation. Med. Image Anal. 63, 101693 (2020) 15. Zhou, X., Huang, X., Duncan, J.S., Yu, W.: Active contours with group similarity. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2969–2976 (2013)
Drug Screening and Drug-Drug Interaction Prediction
Predicting Drug Drug Interactions by Signed Graph Filtering-Based Convolutional Networks Ming Chen1 , Yi Pan2,3(B) , and Chunyan Ji3 1
3
Department of Artificial Intelligence, Hunan Normal University, Changsha, Hunan, China [email protected] 2 Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China [email protected] Department of Computer Science, Georgia State University, Atlanta, GA, USA [email protected]
Abstract. Drug drug interactions (DDIs) are crucial for drug research and pharmacologia. Recently, graph neural networks (GNNs) have handled these interactions successfully and shown great predictive performance, but most computational approaches are built on an unsigned graph that commonly represents assortative relations between similar nodes. Semantic correlation between drugs, such as degressive effects or even adverse side reactions (ADRs), should be disassortative. This kind of DDIs networks can be represented as a signed graph taking drug profiles as node attributes, but negative edges have brought challenges to node embedding methods. We first propose a signed graph filteringbased convolutional network (SGFCN) for drug representations, which integrates both signed graph structures and drug profiles. Node features as graph signals are transited and aggregated with dedicated spectral filters that capture both assortativity and disassortativity of drug pairs. Furthermore, we put forward an end-to-end learning framework for DDIs, via training SGFCN together with a joint discriminator under a problemspecific loss function. Comparing with signed spectral embedding and graph convolutional networks, results on two prediction problems show SGFCN is encouraging in terms of metric indicators, and still achieves considerable level with a small-size model.
Keywords: Drug drug interactions Node embedding · Graph filtering
1
· Signed graph neural networks ·
Introduction
Drug drug interactions (DDIs), which are defined as the effect produced by a combination of two or more drugs [2], have become a significant problem for c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 375–387, 2021. https://doi.org/10.1007/978-3-030-91415-8_32
376
M. Chen et al.
drug administration and patient safety. DDIs are mainly the results of alterations in the pharmacokinetics or the pharmacodynamic properties of drugs. In some conditions, DDIs may result in adverse drug reactions (ADRs), which are considered to be a serious health hazard that can affect the health of patients and even cause death. Therefore, DDI identification has become an urgent task before clinical use of drugs. However, traditional experimental methods, such as detection of transporter-related interactions, are costly and time consuming. Moreover, only a few DDIs can be identified during the drug development process (usually in the clinical trial phase) while most DDIs are reported after drug approval and found in post-marketing surveillance. Computational approaches provide a promising alternative to discover potential DDIs on a large scale for further screening and have gained a lot of attention from both academy and industry [18]. Recently, deep learning technologies have achieved great success in drugrelated graph data analysis [18,19]. One advantage is their end-to-end learning frameworks, where graph (or node) representations and the downstream tasks are jointly solved. Deep learning also improves DDIs predictions, referring to Deepwalk [8], graph auto-encoder [2,10], graph neural networks (for example, graph convolutional networks (GCN) [9,11] and graph attention network [1]), knowledge graph-based deep learning [2,8], and so on. Their results are significantly better than shallow models such as random walk, tensor decomposition, matrix factorization, label propagation, and spectral graph theories. Among existing work, graph neural networks (GNNs) are a cluster of most popular and powerful tools [15,19]. A GNNs layer can be interpreted as a combination of traditional feature transformation with neural networks and graph filtering that explicitly integrates both node attributes and graph structure. Some recent studies show that the filter in generic GNNs mainly retains the commonality of node features, which inevitably ignores the difference, so that the learned representations of connected nodes become similar [16]. This mechanism may work well for assortative networks, i.e., similar nodes tend to connect with each other. However, semantic correlation between drugs, such as from incoherent action modes on targets [5,13], degressive effects [12,17] or even adverse side reactions [9] in clinical reports, should be disassortative, which hints relations between node pairs with significant difference or even inconsistence and can be naturally represented as negative links. For prediction problems on signed DDIs networks, it becomes the key issue to capture the structure information and also integrate drug profiles, but negative edges bring challenges to aggregate signals in GNNs. In this study, we propose a signed graph filtering-based method for drug representation and further an end-to-end learning framework for DDIs. Our contributions are listed as follows. • We propose the first signed graph filtering-based convolutional network (SGFCN) to learn drug presentation. We treat a drug relation network as overlapping of an associative graph and a disassortative graph, and dedicate spectral filters to model commonality and difference of features. Taking node attributes as graph signals, SGFCN integrates the signed network structure and drug profiles into embedding process.
SGFCN for DDIs Prediction
377
• We put forward an end-to-end learning framework for DDIs on signed networks, and then apply SGFCN to predict signed DDIs and the existence of drug side effects. Instead of two-stages models, our method jointly solves node embedding and correlation discrimination. • Experiments on two DDIs prediction problems are conducted to verify the effectiveness and robustness of proposed SGFCN-based learning framework. Comparing with two baseline solutions, i.e., signed spectral embedding (SSE) and GCN, the improvements of SGFCN are encouraging in terms of metrics indicators and it still achieves considerable precision with a small size model.
2
Signed DDIs Networks and Prediction Problems
A DDIs network G = (V, A, X), where V = {D1 , · · · , Dn } is a vertex set composed of n drugs, and the adjacency matrix A represents the correlation of drug pairs. Node attributes X = {X1 , · · · , Xn } ∈ Rn×d represent d-dimension drug profiles that can be chemical structure, related proteins, side effects, and other properties. For a specific drug pair vi and vj , the type of drug interaction is encoded with Aij . Most existing machine learning-based approaches are designed for conventional binary prediction, which only indicates how likely a pair of drugs generates a DDI. In this case, Aij takes values in {0, 1}, where 1 means that there is an interaction while 0 means unknown. It is more beneficial to know whether a DDI is positive or negative, especially when making optimal patient care, establishing drug dosage, or finding drug resistance to therapy. From a pharmacological perspective, a linked drug pair has some common attributes, such as targets, corresponding to a symptom. Two interacting drugs may change their own pharmacological behaviors or effects, e.g. increasing or decreasing serum concentration, or even adverse drug reactions. The positive interaction hints a consistent effect from these attributes while an negative interaction respects their inconsistency. It is reasonable to treat drug pairs to be assortative in the former case, while the latter is disassortative. This kind of DDIs are naturally modeled by signed networks as shown in Fig. 1. The first example is signed networks in [12,17], where enhancive DDIs and degressive DDIs obtained from Drugbank are labeled by ‘+’ and ‘−’ links, respectively. For example, the serum concentration of Quinine increases when it is taken with Aprepitant, whereas its serum concentration decreases when taken with Mitotane. The problem is to determine the link (or sign) between a drug pair, i.e. ‘?’∈ {+, −, 0} in Fig. 1. Another example models side effects on signed networks, and the task is to discriminate whether a drug pair has adverse drug reactions. Since side effect are usually caused by disassortative relations, the corresponding link between two drugs is marked as ‘−1’. Assortative relations are exploited to help discriminate negative links [9]. A drug pair is positively linked when there are no report about side effects and the similarity of drug profiles is very high. Since positive links are generated as pseudo labels, the problem becomes to determine whether the link between a drug pair is negative, i.e. ‘?’∈ {−, 0} in Fig. 1.
378
3
M. Chen et al.
The Proposed SGFCN for DDIs Prediction
Fig. 1. The predicting framework for DDIs problems defined on signed networks. A is the signed adjacency matrix, and X is the drug profile matrix. SGFCN is employed to embedding drugs, of which results are denoted by {Z1 , · · · , Zn }. A DDI is discriminated from joint embedding features such as Zi Zj (zi1 , · · · , ziL , zj1 , · · · , zjL ), of which the link type is problem specific.
As far as we know, there is no research that applies signed GNNs to solve the DDIs predicting problems defined on signed networks. Our goal is to design such a method for drug embedding, and also propose an end-to-end learning framework for DDIs. Since various GNN models share similar feature transformations while adopting different designs for aggregation operation, the key lies in how to properly model both positive and negative relations into aggregating functions. Inspired from GCN and its graph filtering views, we propose a signed GNN called SGFCN as described in Algorithm 1. Figure 1 shows our predicting framework, of which SGFCN and the DDI discriminator are jointly trained in the end-to-end DDIs learning system. 3.1
Graph Convolutional Networks and Graph Filtering
Typical GNN layer consists of two operations: 1) feature aggregation, which corresponds to an aggregating function used to fuse information between neighbors and itself to obtain a new feature with the same dimension; 2) feature transformation via learnable parameters matrices, which transforms the feature of each node and then generates a new feature with a different dimension. For exam in Θ, which is ple, GCN employs the following process in each layer: Xout = AX activated by σ to output the nonlinear mapping. The initial input Xin of the first layer, usually takes node attributes or derived network structure. Linear operation with Θ ∈ Rdin ×dout conducts feature transformation and its aggrega Here, A is a normalized version of the non-negative tion is implemented via A. D − 12 with A = A+I =D − 12 A adjacency matrix with a self-loop, defined by A and D = diag( j A1j , · · · , j Anj ).
SGFCN for DDIs Prediction
379
The aggregating process can be interpreted as graph filtering in terms of graph signal processing (GSP). In GSP [3], a signal on a graph G with n vertices, is a vector x ∈ Rn whose ith entry xi is defined on the vertex i. A graph filter, denoted by G is used to transform a graph signal x into another one, i.e. y = Gx. which integrates graph structure and node attributes. In applications, feature matrices are generalized graph signals, taking Xin and Xout as examples. G is derived by adjusting the frequency strength in spectral domain of the whole graph. Consider the real, symmetric and positive semi-indefinite graph matrix R ∈ Rn×n defined on G. Its spectral decomposition is R = U diag(λ1 , · · · , λn )U T , where eigenvalues λi ≥ 0 is called the graph frequency in GSP, and U = (u1 , · · · , un ) includes the corresponding eigenvectors. A graph filter can be defined via choosing a frequency weight function g(λk ),i.e. G = U diag(g(λ1 ), · · · , g(λn ))U T . = I −A is used for spectral decomposition, and In GCN, the graph matrix L the graph filter sets g(λk ) = 1 − λk [7]. 3.2
Drug Representation via Signed Graph Filtering-Based Convolutional Networks (SGFCN)
Negative links bring challenges since the existing spectral filters in GSP usually assume that edge weights on G are non-negative. To solve the problem, we treat a signed graph as overlapping of two unsigned graphs, and divide it − into: G+ = (V, A+ , X) and G− = (V, A− , X), where A+ ij = max{Aij , 0}, Aij = + − max{−A , 0}. D and D are their degree matrices, respectively, defined by ij n n + − − + = j=1 A+ and D = Dii ij ii j=1 Aij . G models assortative drug pairs, while G− represents disassortative relations. We define two informative filters G + and G − to capture both assortativity disassortativity of drug pairs. • For G+ , A+ is employed instead of A in GCN, and the aggregation oper + )− 12 (A+ + I)(D + )− 12 , where D + = D+ + I. The ator becomes G + = (D low-passing filter mainly retains the commonality of positively linked nodes and inevitably ignores the difference, so that the learned representations of connected nodes become similar [7]. • For G− , signless Laplacian Q− = D− + A− is a better choice as reported for spectral clustering with eigenvectors of top K-smallest eigenvalues, to identify clusters where the amount of edges has to be larger between clusters than inside clusters [4]. Our goal is to develop such kind of filters that try to make small eigenvalues play a stronger role. We choose the normalized version of − Q− and set g(λ) = λ− max − λ, where λmax is the maximum of egienvalues. As 1 1 − − a result, we obtain G = (λmax − 1)I − (D− )− 2 A− (D− )− 2 . It is observed from the formula that captures the difference between the central node and its negative neighbors.
380
M. Chen et al.
Algorithm 1. SGFCN Embedding Generation. Input: A DDIs network G with n nodes and a link matrix represented by A = A+ − A− ; the drug profile matrix X; number of layers L; the parameter matrices ΘN N {ΘN N (1) , · · · , ΘN N (L) } in neural networks. Output: Low-dimensional representation Z1 , · · · , Zn . Initialization: X (0) = X; For l ∈ {0, · · · , L − 1} pos(l+1) neg(l+1) , Xi using Eq.(1-2); calculate Xi (l+1) pos(l+1) neg(l+1) Xi ||Xi ; update X by Xi (L) Return Zi ← Xi , i = 1, · · · , n.
Furthermore, we design a spatial-type signed GNN layer to void spectral decomposition in G − . We set a learnable parameter matrix for each part of G − and eliminate the constant coefficient λ− max by integrating it into parameters. As a result, signal aggregation and feature transformation are jointly calculated from: + )− 12 (A+ + I)(D + )− 12 )]X (l) Θpos(l) ), (1) X pos(l+1) = σ([(D neg(l)
X neg(l+1) = σ(X (l) Θ0
1
1
neg(l)
− (D− )− 2 A− (D− )− 2 X (l) Θ1 neg(l)
neg(l)
).
(2)
dl × 12 dl+1
Here, X (l) ∈ Rn×dl , and Θpos(l) , Θ0 , Θ1 ∈R . Similar to GCN, an activation function σ is employed to obtain non-linear mapping. In this study, tanh is employed to retain negative values from Eq. (2). By stacking multiple-layer, we propose a signed graph filtering-based neural network, called SGFCN, of which the forward process is described in algorithm 1. X (0) is directly initialized as drug profiles X, or coupled with the network structure derived from other methods. For the lth layer output is the concatenated feature derived from G+ and G− , i.e., X (l) X pos(l) ||X neg(l) , and weight neg(l) neg(l) matrices are ΘN N (l) {Θpos(l) , Θ0 , Θ1 }. SGFCN inherits advantages from generic GNNs from several aspects: 1) both the graph structure information and node attributes are integrated into node embedding; 2) spectral decomposition is avoided and the formula Eq. (1–2) show localization in the vertex domain. In addition, each convolution layer seems to deal with the positive and negative edges independently, but it outputs the concatenation into the next layer. As a result, the signed graph structure information is well-utilized. 3.3
The End-to-End Learning Framework for DDIs on Signed Networks
We put forward an end-to-end DDIs learning framework via jointly training SGFCN and a DDI discriminator. We train the whole predictor with a problem-specific loss function. In our experiments, a softmax regression classifier with the regression coefficients ΘR acts as the discriminator, taking embedding results (Zi , Zj ) as an input. Here, Zi is the embedding results of vi
SGFCN for DDIs Prediction
381
via SGFCN with weight matrix parameters ΘN N = {ΘN N (1) , · · · , ΘN N (L) }. R } contains the regression coefficients, where ΘkR is the coefΘR = {Θ1R , · · · , Θ|S| ficients for the kth type link. Let sij ∈ S denote which type of link exists between vi and vj , where S depends on predicting tasks. Problem-dependent loss functions are employed. • Signed DDIs prediction. As introduced in the first example of Sect. 2, the goal is to predict signed links. we train models with randomly sampled ‘0’ edges, i.e., S = {+, −, 0}. The loss function is calculated as follows L(ΘN N , ΘR ) =
ij
−ωsij
exp([Zi ||Zj ]ΘtR ) I(sij = t)log |S| . R t∈S q=1 exp([Zi ||Zj ]Θq )
(3)
wsij denotes the weight associated with link type sij . I()˙ returns 1 if a given prediction is true, and 0 otherwise. • Side effect discrimination. As introduced in the second example of Sect. 2, the task is to discriminate negative links, with the help of positive links. Hence, we employ the binary loss function and set S = {−, 0} of which ‘0’ corresponds to drug pairs without side effects reported in the dataset. Even though SGFCN aggregates messages from both positive and negative neighbors, in the loss function we take the positive links as a special kind of ‘0’ edges. Our loss function is calculated as follows: I(sij = t)log(sof tmaxt ([Zi ||Zj ]ΘtR )), (4) L(ΘN N , ΘR ) = − i,j t∈S
where sof tmaxt ()˙ is the probability of sign t from softmax operation. Since positive links are significantly less than negative edges, we randomly sample ’0’ edges to keep balance of different link types.
4 4.1
Experimental Settings and Analyses Datasets and Signed DDIs Networks
We conduct experiments on two prediction problems defined on signed networks. Two datasets are collected, of which statistical properties, including the number of links and average degree of nodes, are listed in Table 1. Dataset1. It includes 1562 small molecule drugs collected by Shi et al. [12] from DrugBank database [6]. 180576 DDIs are labeled according to their descriptions. For example, the serum concentration of Quinine increases when it is taken with Aprepitant, whereas its serum concentration decreases when taken with Mitotane. The former DDI is positive while the latter one is labeled as a negative link. This dataset provides drug binding proteins(DBP) and chemical structures as drug profiles. DBP refers to 1213 drug targets and 429 non-target proteins. Chemical structures are collected from PubChem fingerprints and represented
382
M. Chen et al. Table 1. Statistics of signed DDIs networks Properties
Dataset1
Dataset2
Number of drugs
1562
548
Number of links: total/+/− 180576/125298/55278 56883/8299/48584 Average degree: total/+/−
231.2/160.4/70.8
207.60/30.29/177.31
as a 881-dimension binary vector. The problem is to determine the link type between a drug pair. Dataset2. It is provided by Liu et al. [9], where 548 drug nodes and 56883 drug pairs with side effects were collected from TWOSIDES. If there are side effects between two drugs, the corresponding link in the DDIs signed network is marked as ‘−1’. The two drugs have a positive link when there are no report about side effects between two drugs and Sij > μ, where Sij is the Jaccard similarity of their 881-dimension chemical substructures and μ is the threshold. The task is to discriminate whether a drug pair has side effects. 4.2
Training Settings
We choose GCN and signed spectral embedding (SSE) as two baselines. GCN is a popular deep learning method for drug discoveries. Raw GCN adapts to unsigned networks, but it is jointly trained here with a discriminator for signed networks, through a problem specific strategy to deal with links’ sign. SSE is a baseline method for signed networks embedding problems. It maps the DDI network into a space consisting of eigenvectors of signed graph Laplacian. In our experiments, we choose the symmetric normalization of L D+ −A+ +D− −A− that agrees with our signed graph filters. Since spectral decomposition cannot be jointly solved together with training the edge discriminator, the whole training process is two-stage. All algorithms are implemented in Pytorch. For GCN, we follow its version with PyTorch Geometric that is a geometric deep learning extension library. This toolbox also provides GNNs APIs for our SGFCN code. Both SGFCN and GCN employ 2 convolutional layers and an Adam optimizer with learning rate 0.01. For SGFCN and GCN, we set X (l) ∈ Rn×demb on each layer, and demb ∈ {4, 8, 16, 32, 64, 128, 256} are tested. We use different epoch numbers to train models and find that 200 is enough to get good results. 5-fold crossvalidation is used in one run and all results are the average of 10 runs. Some special settings are listed as follows: • Signed DDIs prediction on dataset1. SGFCN and GCN take drug profiles as initial features of nodes. GCN is trained without consideration of all link signs. Since raw SSE does not integrate node features, we encode drug profiles into edge attributes here with Jaccard score. Drug pairs of higher
SGFCN for DDIs Prediction
383
scores are added into the positive edge set to train a model. In our experiments, we try the threshold value in {0.5, 0.6, 0.7, 0.8}. We find that 0.8 is the best one for chemical features and set 0.5 for the DBP case. • Side effect discrimination on dataset2. Two drugs are positively linked when there are no report about side effects between two drugs and Jaccard similarity is greater than 0.8. GCN is trained with two strategies to deal with link signs, and they are denoted by GCN(N) and GCN(B) respectively. GCN(N) trains GCN via taking only negative links as a unsigned graph where 1 means two drugs have side effect, while GCN(B) additionally considers similar drug pairs (i.e. positive edges) but directly ignores all link signs. 4.3
Results and Analyses
We employ AUC and F1 indicators to evaluate experimental results. A higher value of these metrics indicates better performance. AUC is the area under the receiver operating characteristic (ROC) curve, which illustrates the true-positive rate versus the false-positive rate at different cutoffs. F1 scores can be interpreted as an average of the precision and recall. Precision is defined as the number of true positives over the number of true positives plus the number of false positives, and recall is defined as the number of true positives over the number of true positives plus the number of false negatives. Table 2. The best values of different methods Dataset1 GCN SSE
SGFCN Dataset2 GCN(N|B) SSE
AUC
0.87
0.87
0.91
F1
0.76
0.799 0.834
SGFCN
AUC
0.926|0.948 0.922 0.958
F1
0.947|0.956 0.904 0.959
There are some observations from results concluded in Table 2 and Fig. 2. • From Table 2, we observe the overview of the three methods in terms of the best values. Firstly, SGFCN is encouraging, since it significantly surpasses SSE and GCN. The potential factors are analyzed as follows. Different from SSE, SGFCN not only jointly trained embedding methods and the DDIs discriminator, but also directly integrates graph structures and drug profiles into node embedding process. Compared with GCN, which ignores link signs, SGFCN follows signs to aggregate messages from different kinds of neighbors, as a result, it makes better use of signed graph structures. Especially on the second dataset, SGFCN shows the capability of discriminating drug pairs with side effects. Compared with GCN on the unsigned graph with only negative edges, the superior performance of GCN(B) hints that high similar drugs can help discriminate drug pairs with side effect. However, GCN(B) is still inferior to SGFCN that additionally exploits link signs and it further verifies the advantage of SGFCN in capturing graph structure information from both assortativity and disassortativity of drug pairs.
384
M. Chen et al.
• From Fig. 2, we observe the effect of drug profiles and embedding dimensions on SGFCN. Firstly, the SGFCN always maintains evaluation indicators at a high level regardless of the kind of drug profiles, and the most difference among best values of SGFCN is less than 0.5 points percentage. Secondly, from the results of SGFCN on two datasets, SGFCN with very low dimension is slightly inferior to the best values but still surpasses the best values of GCN and SSE. In addition, it is observed that the best values of SGFCN generally occur in consideration of both chemical structure and DBP as drug attributes. In terms of a single kind of features, performance of algorithms with DBP feature is generally superior compared to chemical structures, which may exist higher inconsistency that does not agree with DDIs network. From these observations, SGFCN can be concluded as follows: 1) it is an effective method to both model assortativity and disassortativity of drug pairs; 2) it is robust regardless of the initial node inputs and embedding dimension, and shows considerable performance even when the model size is every small.
Fig. 2. The effect of drug profiles and embedding dimensions on SGFCN. (a) and (b) show the results on dataset1, varying with both drug profiles and embedding dimensions. ‘C’ represents chemical structure. ‘P’ means binding proteins of drugs. ‘CP’ combines chemical structures and DBP. (c) shows the results on dataset2.
In order to test the generalization performance of SGFCN, we use SGFCN trained on dataset1 to predict new DDIs from the new version of Drugbank. The dataset contains 180,576 annotated drug-drug interaction pairs among 1562 drugs and 1,038,565 unlabeled drug pairs that may contain DDIs. The best lowdimension SGFCN (i.e. demb = {16, 32, 64}) are trained. We statistic DDIs with the highest scores. For an unobserved drug pairs, higher scores of ‘+’ or ‘−’ indicate that there are higher probabilities to interact between these drugs. We observe that the number of predicted negative DDIs with high scores are much more than that of positive DDIs. Table 3 shows the top 5 new positive or negative DDIs, which are not available in the dataset used by Sect. 4.3. We search the evidence of these newly predicted DDIs on DrugBank database (V5) [14]. ‘Null’ means no descriptions in Drugbank.
SGFCN for DDIs Prediction
385
Table 3. Top 5 new positive or negative DDIs predicted by SGFCN Drug 1, Drug 2
5
Sign Descriptions in Drugbank
Chloroquine, Magnesium trisilicate –
Magnesium trisilicate can cause a decrease in the absorption of Chloroquine resulting in a reduced serum concentration and potentially a decrease in efficacy
Acarbose, Dienogest
–
The therapeutic efficacy of Acarbose can be decreased when used in combination with Dienogest
Chloroquine, Vigabatrin
–
The therapeutic efficacy of Vigabatrin can be decreased when used in combination with Chloroquine
Chloroquine, Naratriptan
–
Null
Stavudine, Dienogest
–
Null
Toremifene, Sofosbuvir
+
The serum concentration of Sofosbuvir can be increased when it is combined with Toremifene
Brompheniramine, Desloratadine
+
Brompheniramine may increase the central nervous system depressant (CNS depressant) activities of Desloratadine
Sofosbuvir, Cobicistat
+
The serum concentration of Sofosbuvir can be increased when it is combined with Cobicistat
Naloxegol, Avibactam
+
Null
Stanozolol, Cobicistat
+
Null
Conclusions
Some semantic correlation between drugs, such as degressive effects or even adverse side reactions (ADR), can be represented by negative edges, but they bring challenges to graph filtering-based deep learning methods. We proposed an effective and robust method SGFCN to represent drugs modeled on signed networks, through capturing both assortativity and disassortativity of drug pairs. Similar to GCN, SGFCN is inspired from graph filtering but finally realizes node feature transformation localized in the vertex domain. We further put forward an end-to-end DDIs learning framework by jointly training SGFCN and a DDIs discriminator. Compared with SSE and GCN, the results on two prediction problems confirm that SGFCN achieves encouraging performance in predicting new DDIs, and still keeps considerable precision in low embedding-dimension. In future research, we will extend our study from the following aspects: 1) generate multi-view signed DDI networks with multi-modal drug-related features,
386
M. Chen et al.
for example, by taking drug-target interaction and protein-protein interaction (PPI) into considerations; 2) extend our method to cold-start DDIs prediction problems with unknown drugs. Acknowledgments. This work is supported by the National Natural Science Foundation of China under Grant No. 62077014, and also supported by the Shenzhen KQTD Project(No. KQTD20200820113106007). The authors acknowledge Molecular Basis of Disease (MBD) at Georgia State University for supporting this research.
References 1. Bang, S., et al.: Polypharmacy side effect prediction with enhanced interpretability based on graph feature attention network. Bioinformatics (2021). https://doi.org/ 10.1093/bioinformatics/btab174 2. Dai, Y., et al.: Drug-drug interaction prediction with Wasserstein adversarial autoencoder-based knowledge graph embeddings. Briefings Bioinform. 1–15 (2020). https://doi.org/10.1093/bib/bbaa256 3. Dong, X., et al.: Graph signal processing for machine learning: a review and new perspectives. IEEE Sign. Process. Mag. 37(6), 117–127 (2020). https://doi.org/10. 1109/MSP.2020.3014591 4. Gionis, A., et al.: Mining signed networks: theory and applications. In: Proceedings of the World Wide Web Conference, pp. 309–310 (2020). https://doi.org/10.1145/ 3366424.3383113 5. Hu, B., Wang, H., Yu, Z.: Drug side-effect prediction via random walk on the signed heterogeneous drug network. Molecules 24(20), 3668 (2019). https://doi. org/10.3390/molecules24203668 6. Law, V., et al.: Drugbank 4.0: shedding new light on drug metabolism. Nucl. Acids Res. 42(1), D1091–D1097 (2014). https://doi.org/10.1093/nar/gkt1068 7. Li, Q., et al.: Label efficient semi-supervised learning via graph filtering. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9582–9591 (2019). https://doi.org/10.1109/CVPR.2019.00981 8. Lin, X., et al.: KGNN: knowledge graph neural network for drug-drug interaction prediction. In: Proceedings of Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI), pp. 2739–2745 (2020). https://doi.org/10.24963/ ijcai.2020/380 9. Liu, T., Cui, J., Zhuang, H., Wang, H.: Modeling polypharmacy effects with heterogeneous signed graph convolutional networks. Appl. Intell. 51(11), 8316–8333 (2021). https://doi.org/10.1007/s10489-021-02296-4 10. Ma, T., et al.: Drug similarity integration through attentive multi-view graph autoencoders. In: Proceedings of Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI), pp. 3477–3483 (2018). https://doi.org/10.24963/ ijcai.2018/483 11. Marinka, Z., Monica, A., Jure, L.: Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34(13), i457–i466 (2018). https://doi.org/ 10.1093/bioinformatics/bty294 12. Shi, J.-Y., Mao, K.-T., Yu, H., Yiu, S.-M.: Detecting drug communities and predicting comprehensive drug–drug interactions via balance regularized semi-nonnegative matrix factorization. J. Cheminform. 11(1), 1–16 (2019). https://doi.org/10.1186/ s13321-019-0352-9
SGFCN for DDIs Prediction
387
13. Torres, N.B., Altafini, C.: Drug combinatorics and side effect estimation on the signed human drug-target network. BMC Syst. Biol. 10(74) (2016). https://doi. org/10.1186/s12918-016-0326-8 14. Wishart, D.S., et al.: Drugbank 5.0: a major update to the drugbank database for 2018. Nucl. Acids Res. 46(D1), D1074–D1082 (2017). https://doi.org/10.1093/ nar/gkx1037 15. Wu, Z., et al.: A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32(1), 4–24 (2021). https://doi.org/10.1109/TNNLS. 2020.2978386 16. Xu, B., et al.: Graph convolutional networks using heat kernel for semi-supervised learning. In: Proceedings of Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI), pp. 1928–1934 (2020). https://doi.org/10.24963/ijcai. 2019/267 17. Yu, H., et al.: Predicting and understanding comprehensive drug-drug interactions via semi-nonnegative matrix factorization. BMC Syst. Biol. 12(Suppl 1), 14 (2018). https://doi.org/10.1186/s12918-018-0532-7 18. Zhang, T., Leng, J., Ying, L.: Deep learning for drug drug interaction extraction from the literature: a review. Briefings Bioinform. 21(5), 1609–1627 (2020). https://doi.org/10.1186/s12859-020-03724 19. Zhang, Z., Cui, P., Zhu, W.: Deep learning on graphs: a survey. IEEE Trans. Knowl. Data Eng. 32(1), 4–24 (2020). https://doi.org/10.1109/TKDE.2020.2981333
Drug-Target Interaction Prediction Based on Gaussian Interaction Profile and Information Entropy Lina Liu1 , Shuang Yao2 , Zhaoyun Ding3 , Maozu Guo4 , Donghua Yu5(B) , and Keli Hu5,6 1
Soochow University, Suzhou, China [email protected] 2 China Jiliang University, Hangzhou, China 3 National University of Defense Technology, Changsha, China [email protected] 4 Beijing University of Civil Engineering and Architecture, Beijing, China [email protected] 5 Shaoxing University, Shaoxing, China 6 Information Technology R&D Innovation Center of Peking University, Shaoxing, China
Abstract. Identifying drug-target interaction (DTI) is an important component of drug discovery and development. However, identifying DTI is a complex process that is time-consuming, costly, long, and often inefficient, with a low success rate, especially with wet-experimental methods. In contrast, numerous computational methods show great vitality and advantages. Among them, the precisely calculation of the drug-drug, target-target similarities are their basic requirements for accurate prediction of the DTI. In this paper, the improved Gaussian interaction profile similarity and the similarity fusion coefficient based on information entropy are proposed, which are fused with other similarities to enhance the performance of the DTI prediction methods. Experimental results on NR, GPCR, IC, Enzyme, all 4 benchmark datasets show that the improved similarity enhances the prediction performance of all six comparison methods. Keywords: Drug discovery and design · Drug-target interaction · Gauss interaction profile · Information entropy · Similarity computing.
1
Introduction
Drug-target interaction (DTI) refers to a drug that reacts with the target and triggers a certain form of positive biological response, such as modifying the Supported by the National Natural Science Foundation of China (No. 62002227,62031003) and Zhejiang Provincial Natural Science Foundation of China (Grant No. LY20F020011). c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 388–399, 2021. https://doi.org/10.1007/978-3-030-91415-8_33
Drug-Target Interaction Prediction
389
function and/or activity of the target to achieve control, prevention, cure, and diagnosis of diseases. The DTI prediction is an important step in drug discovery and design, which has an important impact on the drug development cycle, development cost and success rate. The wet-experiment method is extremely costly, low success rate, time-consuming and labor-intensive [7,24], while the computational method shows great vitality and advantages in it [29]. For example, the disease COVID-19 caused by the SARS-CoV-2 virus seriously threatens human health [10]. In January 2020, Wu et al. [32] published its genome sequence data (GenBank No: MN908947), and then Xu et al. [35] discovered that the spike glycoprotein (S protein) of SARS-CoV-2 interacts with the angiotensin converting enzyme (ACE2) protein molecule on the surface of the host cell by bioinformatic analysis. Subsequently, many candidate drugs obtained through drug repurposing methods were reported, and chloroquine is a relatively successful case [30]. Computational methods to predict DTI are mainly divided into ligand-based, docking-based and chemogenomic-based methods. The ligand-based methods are mainly to compare and analyze the similarity between the candidate compound and the targeted ligand, such as quantitative structure-activity relationship (QSAR) [25]. However, they are usually impossible to find enough known targeted ligands. The docking-based methods mainly use dynamic simulation, such as Dock [1], and AutoDock [23]. These methods are limited to protein receptors (targets) whose 3D structures are known. However, many targets, such as most membrane-bound proteins and ion channel proteins, have not resolved 3D structures. Compared with the aforementioned methods, the chemogenomicbased methods are less restrictive, such as using only genomes and protein sequences instead of 3D structure, or knowing a small number of targeted ligands, or even unknown targets (i.e. new targets). Therefore, this kind of method is becoming increasingly popular, mainly divided into 3 categories [4,5,39]: the network-based method [20,33], the machine learning-based method [3,8,37], and others [14]. In this paper, we improve the calculation of drug-drug, target-target similarities in the DTI prediction. The main contributions are as follows: 1. An improved similarity calculation method of Gaussian interaction profile is proposed, which combines the interactive state in the interaction profile to improve the degree of discrimination between different states. 2. The similarity fusion coefficient based on information entropy is proposed. According to the measurement of information entropy to the degree of uncertainty, the coefficient can adaptively adjust the contribution of Gaussian interaction profile similarity and other similarities in the fusion. 3. The experimental results show that the improved similarity proposed in this paper improves the DTI prediction performance in all 4 benchmark datasets and all 6 comparison methods, namely the ALADIN [2], BLMMLKNN [38], MLKNNSC [28], BLMNII [22], NRLMF [19], and WKNN [17].
390
2
L. Liu et al.
Related Works
Accurate calculation of drug-drug and target-target similarity is the basis of most DTI prediction methods. The most common used is the SIMCOMP algorithm to calculate the drug-drug compound similarity (called SIMCOMP similarity), and the normalized Smith-Waterman score serves as the target-target sequence similarity (called SW Similarity) [18,26,28,34]. There are also extended features or data to calculate similarity, for example, drug-drug similarity based on ATC code [28], molecular fingerprint [18,26], SMILFS [31], drug-drug side effects [15], and target-target similarity based on semantic ontology [18,28], gene sequence [13], binary characteristics of pseudo-amino acid composition [12]. These similarities are based on some characteristics of drug (target) itself, and lack of topological information in drug target interaction network. Gaussian interaction profile similarity is introduced to solve this problem [6,9,11,16,21,22], which emphasizes the contribution of different elements in the corresponding position of the drug (target) profile. Generally, the similarity of Gaussian interaction profile with topological information needs to be fused with other similarity, such as SIMCOMP and SW. The most common way is to averagely weight these similarities [11,16,21,22], or establish linear weights through multi-core learning [9].
3 3.1
Method Problem Formalization
The DTI prediction problem can be regarded as a classification problem, which infers the interaction confidence score between the drug and the target. Let Y ∈ Rm×n be an interaction matrix between m drugs and n targets with a value yij if drug di interacts with target tj and 0 otherwise. The DTI prediction is to make use of the known interactions Y together with the drug-drug similarities Sd ∈ Rm×m and target-target similarities St ∈ Rn×n to predict new interactions between drug-candidate compounds and target-candidate targets, which have no interactions currently known. 3.2
The Gaussian Interaction Profile Similarity Improvement
The drug interaction profile refers to the interaction relationship data between the drug and all targets, and the similarity of the Gaussian interaction profile is a value calculated by using the Gaussian function based on the drug interaction profile to measure the similarity of the topological structure between two drugs. In the calculation of the Gaussian interaction profile similarity, only different states in the interaction profile can contribute to it, while the same state does not. However, in the same state, there are two cases with opposite meanings, namely (yik , yjk ) = (0, 0), (yik , yjk ) = (1, 1) Therefore, it is necessary to consider this situation in the calculation to increase the discrimination of the
Drug-Target Interaction Prediction
391
Gaussian interaction profile similarity. Therefore, this paper redefines the Gaussian interaction profile parameter γd as follows: n t γd = γ d f (yik , yjk ) + 1 , γ d = 1 (1) k=1
where,
⎧ ⎪ ⎨1, f (a, b) = 1/2, ⎪ ⎩ 0,
a = 1, b = 1 a = 0, b = 0 others
Then, the improved Gaussian interaction profile similarity of the drugs di , dj is calculated as follows: (2) sgij = exp −γd ydi − ydj 2 Similarly, the improved Gaussian interaction profile similarity of the targets ti , tj can be calculated. 3.3
Similarity Fusion Coefficient Based on Information Entropy
In order to fuse the similarities between the Gaussian interaction profile and the SIMCOMP (SW) similarity, an appropriate fusion coefficient needs to be determined. The widely used average weight coefficient is not enough to reflect the Gaussian interaction profile similarity contribution ratio in the fusion of different similarities. Therefore, in order to ensure more accurate similarity after fusion, this paper proposes a similarity fusion coefficient based on information entropy. First, define the information entropy H(X) of the drugs di , dj Gaussian interaction profile: H(X) = H(x1 x2 · · · xn ) = H(x1 ) + H(x2 |x1 ) + · · · + H(xi |xi−1 · · · x1 ) + · · · H(xn |xn−1 · · · x1 ) (3) where, the xk = (yik , yjk ) represents the interaction relationship between the drugs di , dj and the target tk , and the X is a representation of the Gaussian interaction profile between the drugs di , dj , as shown in Table 1. Table 1. The Gaussian interaction profile between drugs di , dj t1
t2
· · · tn
di yi1 yi2 · · · yin dj yj1 yj2 · · · yjn X x1
x2
· · · xn
392
L. Liu et al.
Then, in order to calculate the information entropy H(xk ), we need to make clear xk = (yik , yjk ) all cases. Therefore, consider redefining the binary variable aij to represent the following interaction relationship:
T, interaction between di , tj aij = (4) F, no interaction between di , tj It should be noted that aij = F means that there is no interaction relationship, not an unknown interaction relationship. Thus, the yij = 1 corresponds to aij = T , and the yij = 0 corresponds to aij = T or aij = F two cases. Therefore, according to the information entropy definition, the H(xk ) can be calculated as the following: H(xk ) = − p(aik = b, ajk = c) log p(aik = b, ajk = c) (5) b,c∈{T,F }
Finally, considering the rate of uncertainty reduction caused by the drugs di , dj Gaussian interaction profile from completely unknown to the current state, this paper defines the similarity fusion coefficient λij as the following: λij =
H(X 0 ) − H(X) H(X 0 ) − H(X 1 )
(6)
where, the X 0 represents the all 0 state of the Gaussian interaction profile, and the X 1 represents the all 1 state. Similarity, the similarity fusion coefficient of the targets ti , tj can be calculated. 3.4
Similarity Fusion
Based on the drug SIMCOMP (or target Smith-Waterman ) similarity, the improved Gaussian interaction profile similarity Eq. (2), and the fusion coefficient Eq. (6), the fusion similarity sij can be calculated as the following: sij = (1 − λij )sij + λij sgij
(7)
where, the sij represents the drugs di , dj or targets ti , tj similarity, the sgij represents the drugs di , dj or targets ti , tj Gaussian interaction profile similarity.
4 4.1
Result Analysis and Discussion Datasets
In this paper, the four benchmark datasets, Enzyme, IC, GPCR and NR, are from Yamanishi et al. [36] and have been regarded as gold standard datasets for comparing various DTI prediction algorithms. The numbers of drugs m, targets n and known DTIs for these datasets are summarized in Table 2. The sparsity
Drug-Target Interaction Prediction
393
value, the last column of Table 2, equals the number of known DTIs divided by the number of unknown DTIs and represents the ratio of positive to negative samples. These low-sparsity values of the four datasets confirm the serious imbalance between positive and negative samples. In addition, each dataset provides two similarity matrices. One matrix is the drug-drug SIMCOMP similarities matrix Sd ∈ Rm×m . The other matrix is target-target SW similarities matrix St ∈ Rn×n . Table 2. Summary of four benchmark datasets Dataset Drugs Targets Interactions Sparsity value Enzyme 445
664
2926
IC
210
204
1476
0.034
GPCR
223
95
635
0.030
54
26
90
0.064
NR
4.2
0.010
Evaluation Metrics and Experimental Setup
In order to evaluate the performance of prediction methods, computational experiments are conducted on four benchmark datasets and the Area Under the Precision-Recall curve (AUPR) which heavily punishes highly ranked false positive predictions [27] is used. These methods, ALADIN [2], BLMMLKNN [38], MLKNNSC [28], BLMNII [22], NRLMF [19], and WKNN [17], serve as comparison methods. It is worth noting that this paper just proposes an improved similarity algorithm, rather than a complete DTI prediction method. Therefore, we will replace those methods corresponding similarity with the improved similarity. Since cross-validation (CV) is performed on drugs and/or targets, it is performed on the drug-target interaction relationship matrix Y. Considering the particularity of the operation on the matrix, we use the same experimental setup method as Liu et al. [17] provided, and distinguish three settings of DTI prediction, according to whether the drug and target involved in the test pair are included in the training set or not. In particular: – S2: predict the DTIs between test drugs D and training targets T . – S3: predict the DTIs between training drugs D and targets T . – S4: predict the DTIs between test drugs D and test targets T . where D is a set of test drugs disjoint from the training drug set (i.e. D ∩D = ∅), and T is a set of test targets disjoint from T . Limited by the small amount of drugs/targets and a small number of known DTIs, it is necessary to perform different CV in 3 different prediction settings. In S2, the drug wise CV is applied where one drug fold along with their corresponding rows in Y are separated for testing. In S3, the target wise CV is utilized
394
L. Liu et al.
where one target fold along with their corresponding column in Y are left out for testing. The block wise CV, which splits a drug fold and target fold along with the interactions between them (which is a sub-matrix of Y) for testing and uses interactions between remaining drugs and targets for training, is applied to S4. Two repetitions of 10-fold CV are applied to S2 and S3, and two repetitions of 3-fold block wise CV which contains 9 block folds generated by 3 drug folds and 3 target folds are applied to S4. Each method has a set of parameters to be optimized, such as the parameters γ, α, λ in BLMNII, the parameters k and cut-off threshold in MLKNNSC, etc. Therefore, in the CV, each method is assigned the same parameter search range, and the optimal parameters are obtained by the inner CV, that is to say, for each fold testing data, we perform CV again on the training data to obtain the optimal parameters. In order to obtain repeatable experimental results, we set the same random seed wherever random initialization is required. 4.3
Results
In this section, we will compare and analyze the performance improvement of the improved similarity. As mentioned above, this paper only proposes the improvement of similarity, not a complete DTI prediction method. However, the improved similarity can be applied to almost all DTI prediction methods that require drug-drug, target-target similarities. Therefore, this paper just replaces the original similarity with the improved similarity to observe whether the prediction performance of these methods have improved. Later, we use GF to represent the improved Gaussian interaction profile and the fusion coefficient based on information entropy. Figure 1 shows the comparison of the AUPR values on 4 benchmark datasets in S2, namely predicting interactions between unknown drugs and known targets. The vertical axis of the histogram represents the AUPR value, and the horizontal axis represents different datasets. In this paper, the original method is called base, marked in blue, and the improved method of similarity is called base+GF, marked in red, as shown in the legend. Over all the 4 datasets, all the red bars are higher than the blue bars, which shows that the improved similarity improves the prediction performance in all six comparison methods, regardless of whether it is the best or worst method. To be specific, regarding the NR dataset, WKNN became the best method for AUPR value 0.5, while the improved method, WKNN+GF, still increased by 0.102, eventually reaching 0.602, and the worst-performing method was ALADIN, through which AUPR value is only 0.433. However, the improved method, ALADIN+GF, is still improved by a slight margin, reaching 0.462. From the perspective of the improvement in accuracy, WKNN and BLMNII have a greater improvement than other methods. This also shows that although WKNN is currently the best performance prediction method, the improved similarity, +GF, proposed in this paper still helps it obtain a larger performance improvement. Figure 2 shows the comparison of the AUPR values on 4 benchmark datasets in S3, namely predicting interactions between known drugs and unknown targets.
Drug-Target Interaction Prediction
(a) ALADIN
(b) BLMNII
(c) BLMMLKNN
(d) MLKNNSC
(e) NRLMF
(f) WKNN
395
Fig. 1. Results of comparison methods in terms of AUPR in S2 (Color figure online)
(a) ALADIN
(b) BLMNII
(c) BLMMLKNN
(d) MLKNNSC
(e) NRLMF
(f) WKNN
Fig. 2. Results of comparison methods in terms of AUPR in S3
396
L. Liu et al.
In the S3 type setting, the improved similarity also improves the prediction performance in all six comparison methods, regardless of whether it is the best or worst method. Although the prediction performance is enhanced, for WKNN and BLMNII, the improvement in S3 is less significant than that in S2. We also found that all methods have superior prediction performance on the IC dataset, and the improved similarity, +GF, still obtains a large performance improvement. S4 is one of the most challenging prediction settings. Unlike S2 and S3, which either know the known drugs or the known targets. S4 has to predict the interactions between unknown drugs and unknown targets. Because of this challenge, almost all prediction methods have low AUPR values on all four benchmark datasets (see Fig. 3). Nevertheless, the improved similarity, +GF, proposed in this paper still works. In the 4 benchmark datasets and 6 comparison methods, no matter which dataset and which method, the improvement of similarity, +GF, increased the AUPR value.
(a) ALADIN
(b) BLMNII
(c) BLMMLKNN
(d) MLKNNSC
(e) NRLMF
(f) WKNN
Fig. 3. Results of comparison methods in terms of AUPR in S4
In short, in the 3 different prediction scenarios, the improved similarity, +GF, is effective on all 4 benchmark datasets and all 6 comparison methods.
5
Conclusion
In this paper, we propose an improved algorithm for similarity calculation, which mainly involves two aspects: one is to improve the calculation of Gaussian inter-
Drug-Target Interaction Prediction
397
action profile similarity, and the other is to propose a similarity fusion coefficient based on information entropy. This makes the fused drug-drug, target-target similarities more accurate. Compared with the other 6 methods, the experimental results show that the improved similarity can make all 6 methods improve the prediction performance. In the future, we will further explore more kinds of similarity fusion methods, not limited to Gaussian interaction profile and another similarity.
References 1. Allen, W.J., Balius, T.E., Mukherjee, S., et al.: Dock 6: impact of new features and current docking performance. J. Comput. Chem. 36(15), 1132–1156 (2015) 2. Buza, Krisztian, Peska, Ladislav: ALADIN: a new approach for drug–target interaction prediction. In: Ceci, Michelangelo, Hollm´en, Jaakko, Todorovski, Ljupˇco, Vens, Celine, Dˇzeroski, Sa.ˇso (eds.) ECML PKDD 2017. LNCS (LNAI), vol. 10535, pp. 322–337. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-712468 20 3. Chen, R., Liu, X., Jin, S., et al.: Machine learning for drug-target interaction prediction. Molecules 23(9), 2208 (2018) 4. Chen, X., Yan, C.C., Zhang, X., et al.: Drug-target interaction prediction: databases, web servers and computational models. Brief. Bioinform. 17(4), 696–712 (2016) 5. Cheng, T., Hao, M., Takeda, T., et al.: Large-scale prediction of drug-target interaction: a data-centric review. AAPS J. 19(5), 1264–1275 (2017) 6. Chu, Y., Kaushik, A.C., Wang, X., et al.: DTI-CDF: a cascade deep forest model towards the prediction of drug-target interactions based on hybrid features. Brief. Bioinform. 22(1), 451–462 (2021) 7. Dickson, M., Gagnon, J.P.: Key factors in the rising cost of new drug discovery and development. Nat. Rev. Drug Discovery 3(5), 417–429 (2004) 8. Ding, H., Takigawa, I., Mamitsuka, H., et al.: Similarity-based machine learning methods for predicting drug-target interactions: a brief review. Brief. Bioinform. 15(5), 734–747 (2014) 9. Ding, Y., Tang, J., Guo, F.: Identification of drug-target interactions via fuzzy bipartite local model. Neural Comput. Appl. 32, 10303–10319 (2020) 10. Gorbalenya, A.E.: Severe acute respiratory syndrome-related coronavirus-the species and its viruses, a statement of the coronavirus study group. BioRxiv online (2020). https://doi.org/10.1101/2020.02.07.937862 11. Hao, M., Wang, Y., Bryant, S.H.: Improved prediction of drug-target interactions using regularized least squares integrating with kernel fusion technique. Anal. Chim. Acta 909, 41–50 (2016) 12. He, Z., Zhang, J., Shi, X.H., et al.: Predicting drug-target interaction networks based on functional groups and biological features. PLoS One 5(3), e9603 (2010) 13. Jain, E., Bairoch, A., Duvaud, S., et al.: Infrastructure for the life sciences: design and implementation of the UniProt website. BMC Bioinform. 10, 136 (2009) 14. Keiser, M.J., Setola, V., Irwin, J.J., et al.: Predicting new molecular targets for known drugs. Nature 462(7270), 175–181 (2009) 15. Kuhn, M., Campillos, M., Letunic, I., et al.: A side effect resource to capture phenotypic effects of drugs. Mol. Syst. Biol. 6(1), 343 (2010)
398
L. Liu et al.
16. van Laarhoven, T., Nabuurs, S.B., Marchiori, E.: Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics 27(21), 3036–3043 (2011) 17. Liu, B., Pliakos, K., Vens, C., Tsoumakas, G.: Drug-target interaction prediction via an ensemble of weighted nearest neighbors with interaction recovery. Appl. Intell. 1–23 (2021). https://doi.org/10.1007/s10489-021-02495-z 18. Liu, H., Zhang, W., Nie, L., et al.: Predicting effective drug combinations using gradient tree boosting based on features extracted from drug-protein heterogeneous network. BMC Bioinformat. 20, 645 (2019) 19. Liu, Y., Wu, M., Miao, C., Zhao, P., Li, X.L.: Neighborhood regularized logistic matrix factorization for drug-target interaction prediction. PLoS Computat. Biol. 12(2), e1004760 (2016) 20. Lotfi Shahreza, M., Ghadiri, N., Mousavi, S.R., et al.: A review of network-based approaches to drug repositioning. Brief. Bioinform. 19(5), 878–892 (2018) 21. Luo, H., Wang, J., Li, M., et al.: Drug repositioning based on comprehensive similarity measures and Bi-random walk algorithm. Bioinformatics 32(17), 2664–2671 (2016) 22. Mei, J.P., Kwoh, C.K., Yang, P., et al.: Drug-target interaction prediction by learning from local information and neighbors. Bioinformatics 29(2), 238–245 (2013) 23. Morris, G.M., Huey, R., Lindstrom, W., et al.: AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility. J. Comput. Chem. 30(16), 2785–2791 (2009) 24. Paul, S.M., Mytelka, D.S., Dunwiddie, C.T., et al.: How to improve R&D productivity: the pharmaceutical industry’s grand challenge. Nat. Rev. Drug Disc. 9(3), 203–214 (2010) 25. Perkins, R., Fang, H., Tong, W., et al.: Quantitative structure-activity relationship methods: perspectives on drug discovery and toxicology. Environ. Toxicol. Chem. 22(8), 1666–1679 (2003) 26. Pliakos, K., Vens, C., Tsoumakas, G.: Predicting drug-target interactions with multi-label classification and label partitioning. IEEE/ACM Trans. Comput. Biol. Bioinf. 18(4), 1596–1607 (2021) 27. Schrynemackers, M., K¨ uffner, R., Geurts, P.: On protocols and measures for the validation of supervised methods for the inference of biological networks. Front. Genet. 4, 262 (2013) 28. Shi, J.Y., Yiu, S.M., Li, Y., et al.: Predicting drug-target interaction for new drugs using enhanced similarity measures and super-target clustering. Methods 83, 98– 104 (2015) 29. Sydow, D., Burggraaff, L., Szengel, A., et al.: Advances and challenges in computational target prediction. J. Chem. Inf. Model. 59(5), 1728–1742 (2019) 30. Wang, M., Cao, R., Zhang, L., et al.: Remdesivir and chloroquine effectively inhibit the recently emerged novel coronavirus (2019-nCov) in vitro. Cell Res. 30, 269–271 (2020) 31. Weininger, D.: Smiles, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28(1), 31–36 (1988) 32. Wu, F., Zhao, S., Yu, B., et al.: A new coronavirus associated with human respiratory disease in China. Nature 579(7798), 265–269 (2020) 33. Wu, Z., Li, W., Liu, G., et al.: Network-based methods for prediction of drug-target interactions. Front. Pharmacol. 9, 1134 (2018) 34. Xia, L.Y., Yang, Z.Y., Zhang, H., et al.: Improved prediction of drug-target interactions using self-paced learning with collaborative matrix factorization. J. Chem. Inf. Model. 59(7), 3340–3351 (2019)
Drug-Target Interaction Prediction
399
35. Xu, X., Chen, P., Wang, J., et al.: Evolution of the novel coronavirus from the ongoing Wuhan outbreak and modeling of its spike protein for risk of human transmission. Sci. China Life Sci. 63(3), 457–460 (2020) 36. Yamanishi, Y., Araki, M., Gutteridge, A., et al.: Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24(13), i232–i240 (2008) 37. Yu, D., Liu, G., Zhao, N., et al.: FPSC-DTI: drug-target interaction prediction based on feature projection fuzzy classification and super cluster fusion. Molecular Omics 16(6), 583–591 (2020) 38. Zhang, W., Liu, F., Luo, L., et al.: Predicting drug side effects by multi-label learning and ensemble learning. BMC Bioinformat. 16, 365 (2015) 39. Zhou, L., Li, Z., Yang, J., et al.: Revealing drug-target interactions with computational models and algorithms. Molecules 24(9), 1714 (2019)
A Deep Learning Approach Based on Feature Reconstruction and Multi-dimensional Attention Mechanism for Drug-Drug Interaction Prediction Jiang Xie1(B) , Jiaming Ouyang1 , Chang Zhao1 , Hongjian He1 , and Xin Dong2(B) 1
School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China [email protected] 2 School of Medicine, Shanghai University, Shanghai 200444, China
Abstract. Drug-drug interactions occur when two or more drugs are taken simultaneously or successively. Early discovery of drug-drug interactions can effectively prevent medical accidents and reduce medical costs. There are already many methods to discover drug-drug interactions. However, the current methods still have much space for performance improvement. We propose a new deep learning approach named FM-DDI based on feature reconstruction and multi-dimensional attention mechanism for drug-drug interactions prediction. The feature reconstruction extracts low-dimensional but informative vector representations of features for the drug from heterogeneous data sources, which can prevent information loss. The deep neural network model based on multi-dimensional attention mechanism gives high weight to critical feature dimensions, which can effectively learn critical information. FM-DDI achieves substantial performance improvement over several state-of-theart methods for drug-drug interaction prediction. The results indicate that FM-DDI can provide a valuable tool for extracting and learning drug features to predict new drug-drug interactions. Keywords: Drug-drug interactions · Feature reconstruction Multi-dimensional attention mechanism
1
·
Introduction
Drug-drug interactions (DDIs) refer to the phenomenon that one drug changes the pharmacological effect of the other drug when two or more drugs are taken simultaneously or successively [4]. DDIs may cause unexpected adverse drug side effects [3]. Early discovery of DDIs can effectively prevent medical accidents and reduce medical costs. Early researchers collected drug data from the literature, reports, etc., to predict DDIs, and then some researchers proposed machine learning methods to predict DDIs [10]. c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 400–410, 2021. https://doi.org/10.1007/978-3-030-91415-8_34
A Deep Learning Approach FM-DDI for Drug-Drug Interaction Prediction
401
The current DDI prediction methods based on machine learning are roughly divided into three categories: similarity-based methods, network-based methods, and classification-based methods. Similarity-based methods assume that drugs with similar properties tend to interact with the same drugs [15]. Early research used molecular structure similarity information to identify new DDI [15]. The latest research constructed four sub-models by using the features of each drug and used the joint deep neural network (DNN) to predict DDI-related Events [2]. The network-based method converts a graph into a low-dimensional space, retains the information of the structural graph in the space, and then uses the learned lowdimensional representation as a feature for prediction. Decagon, a graph convolutional neural network, was designed for running on large multimodal graphs [22]. Based on this model, a three-picture information dissemination (TIP) model improved prediction accuracy along with the time and space efficiency [17]. The classification-based methods represent DDI prediction as a binary classification task of predicting whether a drug pair is DDI. The information of drugs is represented as features, and the interactions between the drugs are represented as class labels. Early investigators applied integrated methods that combined chemical, biological, phenotypic, and network data to predict potential DDI [20]. Integrated heterogeneous information from different databases, a machine learning method predicted adverse drug reactions (ADRs) of combined medication by building up highly credible negative samples [21]. Although the above works have made significant efforts in DDI prediction, some new difficulties have emerged as the data increases. One of the challenges is to select effective drug features from a large amount of data. We propose a new deep learning approach named FM-DDI based on feature reconstruction and multi-dimensional attention mechanism (Attention-DNN) for DDI prediction. The feature reconstruction extracts low-dimensional but informative vector representations of features for the drug from heterogeneous data sources such as drug molecular fingerprints and correlation information. The low-dimensional feature vectors learned by the feature reconstruction capture drug features of different scales and dimensions, which can prevent information loss.
2 2.1
Methods Data Description and Preprocessing
Four sets of data (substructure, target, enzyme, pathway) compiled by previous research [2] are used as benchmarks to evaluate the performance of the proposed FM-DDI. In these data sets, there are 74,528 known DDIs, including 572 drugs from DrugBank 4.0 [8] and their smile information, as well as 1162 targets (T), 202 enzymes (E), and 957 pathways (P) associated with these drugs. In previous studies, the chemical structure information of the drug generally comes from a single type of drug chemical substructures, namely molecular fingerprints. However, the complementarity between multiple types of features helps improve performance [1]. In this paper, we regard different types of substructures as drug features and extract multiple drug chemical substructures.
402
J. Xie et al.
We use the open-source tool Chemistry Development Kit (CDK) [14], which is commonly used in DDI prediction, to generate various substructures. For different substructure information obtained through CDK, each drug representation can be expressed as a binary vector of length L m, where m represents the specific substructure extraction method. The value (0 or 1) on the index of the drug features indicates the existence of a specific substructure feature bit or not. As shown in Table 1, three types of chemical substructure, klekota-roth, daylight, and maccs, are used in this study. The chemical substructure feature is named S K, S D, S M according to the above three substructures, and the vector length is 4860, 1024, 166, respectively. According to the correlation information between the drug and the target (T), enzyme (E), and pathway (P), three drug feature vectors are generated, with 1162, 202, and 957 bits, respectively. These associations are also expressed as binary relationships, where “1” represents a known interaction and “0” represents an unknown interaction. Further, the drug can be represented as six feature vectors, including three substructure representations and three association representations. As shown in Fig. 1A, we input these six feature vectors into the feature reconstruction. Table 1. The three chemical substructure features. Feature name Substructure Length Description
2.2
SK
Klekota-roth 4860
Fingerprint defined by Klekota and Roth
SD
Daylight
1024
Considers paths of a given length
SM
Maccs
166
The popular MACCS keys are described by MDL
Feature Reconstruction
Feature selection has been successfully applied in many bioinformatics research fields, such as sub-Golgi protein localization [9]. This paper proposes an improved feature selection: feature reconstruction, which extracts low-dimensional but informative features. As shown in Fig. 1B, we first select the important features of the drug by Random forest (RF) and Light Gradient Boosting Machine (LGBM). The most important K-bit features are used to generate the cross features. Then the cross feature and the original feature are merged to produce the synthetic features. Finally, the similarity matrix is calculated separately through the six groups of synthetic features. A good characterization of RF and LGBM is that they provide indicators of which features are the most important in classification [6]. We use these two machine learning methods to calculate the importance of features from different angles.
A Deep Learning Approach FM-DDI for Drug-Drug Interaction Prediction
403
Fig. 1. The pipeline of the FM-DDI. A. Data preparation: FM-DDI extracts various structural information of drugs and associations related to targets, enzymes, pathways, and known DDIs. B. Feature reconstruction: RF and LGBM are used to select important features. The most important K-bit features are used to generate the cross features. Then the cross feature and the original feature are merged to produce the synthetic features. The similarity matrix is calculated separately through the six groups of synthetic features. C. Attention-DNN construction: In Attention-DNN, the attention mechanism assigns weights to six-dimensional features to accurately predict DDI.
Taking the S K as an example, we score and normalize the 4860 feature bits of the S K after training of RF and LGBM to quantify the relative importance. After normalization, the importance value of the feature bit is uniform between 0 and 1. The feature bit with the most significant importance will have a corresponding feature importance value of 1, and the feature bit with the least importance will have an importance value of 0. It is worth noting that we calculate the average value to obtain the final importance value of the drug feature Ii accurately: Ii = (IiR + IiL )/2
(1)
where IiR is the RF importance value of the ith bit feature, IiL is the LGBM importance value of the ith bit feature, and Ii is the final importance value of the ith bit feature. As shown in Fig. 2, we calculate the feature bit importance of each bit of the six-dimensional feature in the same way. We propose cross feature to represent the co-existence of two critical features. As shown in Fig. 3, we select the most important K-bit (K = 20) features from the 4860-bit S K, add these K features pairwise to generate a K(K−1)/2-bit (190bit) cross feature, and merge them with the original 4860-bit feature into a new 5050-bit synthetic feature. Each bit of the cross feature indicates whether two of the Top K features exist together. We use the same method for six-dimensional features to obtain their corresponding synthetic features. In order to reduce the high dimensionality and sparsity of synthetic features and facilitate the unified input of Attention-DNN, we use the synthetic features to generate the drug similarity matrixes. As shown in Fig. 1B, after extracting the synthetic feature, S K increases from the original 4860 bits to 5050 bits,
404
J. Xie et al.
Fig. 2. The drug features importance of the six types is ranked
Fig. 3. Take S K as an example, the synthetic feature generation process.
and more than 98% of the elements have a value of 0. In order to reduce the sparsity and obtain uniform-scale features as the input of the Attention-DNN model, we compress the features. Instead of using bit vectors as input, we use the Jaccard similarity metric [19] to calculate paired drug-drug similarities. Then we obtained six pairs of 572 * 572 drug similarity matrices and used them as the input of the Attention-DNN model. 2.3
Attention-DNN Construction
Traditional DNN also has certain shortcomings, such as the inability to select effective information from multi-dimensional features. The attention mechanism can make up for this deficiency well by paying attention to local information [5]. Therefore, after the feature reconstruction, we introduce the attention mechanism into the DNN to strengthen the contribution of critical feature dimensions. We propose a deep neural network based on multi-dimensional attention mechanism named Attention-DNN. In particular, different from traditional Attention, the Attention-DNN focuses on six-dimensional features of the input instead of each bit feature, which are S K, S D, S M, T, E, and P. As shown in Fig. 4, the Attention-DNN applies attention to DNN and uses the multidimensional attention mechanism to assign weights to six-dimensional features. The weight represents the importance of the dimension. The sum of the weights of the six dimensions is 1. The larger the weight of the feature dimension, the
A Deep Learning Approach FM-DDI for Drug-Drug Interaction Prediction
405
higher the importance degree, which is the dimension we need to pay more attention to. The formula of the Attention-DNN is described as follows: a = σ(W (x))
(2)
y =a∗x
(3)
where x is the original feature of the input, W (x) is a fully connected layer, σ is the Softmax function which makes the result of the operation only sum to 1 in dimension, a is the attention value, and y is the output feature has weight information. After calculating the weight of each dimension by Softmax, we fuse the original feature with the weight and use bitwise multiplication to obtain the new feature.
Fig. 4. Traditional DNN (three layers) and the Attention-DNN.
The Attention-DNN of FM-DDI has 65 output neurons, representing the 65 DDI types considered in this study [2]. The activity values of these output neurons are between 0 (there is no interaction between the drug pair) and 1 (the interaction between the drug pair has the highest activity), which can be considered as a probability [16]. The corresponding output neuron with the highest activity will activate a given drug pair with a specific DDI type.
3 3.1
Experiments and Results Evaluation Metrics
We evaluate the prediction performance of FM-DDI using a five-fold cross validation procedure, in which 80% of the drug pairs are randomly selected as the training set, and the remaining 20% of the drug pairs are used as the test set. In order to evaluate the performance objectively, we only use the training set during the feature reconstruction.
406
J. Xie et al.
Our task is a multi-class classification work. For evaluation, accuracy (ACC), area under the ROC curve (AUC), area under the precision–recall curve (AUPR), F1 score, Precision (Pre), and Recall (Rec) are adopted as the evaluation metrics: ACC = (T P + T N )/(T P + F P + T N + F N ) AU C = AU P R =
N n=1
N
n=1
(4)
T P i ΔF P i
(5)
P rej ΔRecj
(6)
F 1 = (2 ∗ Sen ∗ P re)/(Sen + P re)
(7)
P re = T P/(T P + F P )
(8)
Rec = T P/(T P + F N )
(9)
where TP means true positive, TN means true negative, FP means false positive, FN means false negative, i is ith the true-positive and false-positive operating point, and j is jth precision/recall operating point. It should note that in an imbalanced learning problem, F1 and AUPR are the essential evaluation metrics as they can provide comprehensive measures than other evaluation metrics [18]. 3.2
Experimental Setup
Here we consider the impact of the number of layers, optimizer, learning rate, and dropout rate, which may affect the performance of the FM-DDI model. First, we discuss the number of layers of the Attention-DNN structure. We set a rule that the number of neurons in a layer is half of the previous layer. We consider 2, 3, 4, and 5 hidden layers and adopt a network structure with a three-layer structure because it can achieve the best performance. In order to optimize the model, we use the Adam optimizer [7] to train up to 100 epochs with a learning rate of 0.3 and stop training if the verification loss does not decrease in 10 epochs [11]. This strategy can prevent overfitting while considerably speeding up the training process. We apply regular dropout [13] to hidden layer units so that the model can generalize well to unobserved drug pairs. We set the dropout rate from 0 to 0.5 in steps of 0.1 and get the highest ACC when the dropout is equal to 0.3. 3.3
Influence of Feature Reconstruction
In this section, we focus on whether the feature reconstruction helps predict DDI. First, we evaluate the performance of FM-DDI by training the model with different sizes of K. We set the K to 0, 5, 10, 15, 20, and 25, respectively. When K is set to 0, it means no feature reconstruction. As shown in Table 2, the experiments clear support for the effectiveness of the feature reconstruction and show that the best results can be obtained when the K is set to 20.
A Deep Learning Approach FM-DDI for Drug-Drug Interaction Prediction
407
Table 2. The performance with different sizes of K on the Attention-DNN. K Cross features ACC
3.4
ROC-AUC PR-AUC F1
Rec
Pre
0
0
0.8955
0.9977
0.9515
0.7888
0.8254
0.7909
5
10
0.8960
0.9982
0.9517
0.7923
0.8293
0.7916
10
45
0.8967
0.9983
0.9510
0.8008
0.8308
0.8079
15 105
0.8968
0.9983
0.9525
0.7950
0.8450
0.8112
20 190
0.9070
0.9985
0.9618
0.8267 0.8507 0.8226
25 300
0.9071 0.9983
0.9601
0.8134
0.8454
0.8201
Influence of Multi-dimensional Attention Mechanism
Besides the feature reconstruction, we evaluate the effectiveness of the multidimensional attention on the DDI prediction. We compare the traditional DNN with the Attention-DNN. As shown in Table 3, the performance of the AttentionDNN is better than DNN. The results validate the effectiveness of the AttentionDNN to identify critical dimensions of features. Table 3. The performance of Attention-DNN and DNN. Method
3.5
ACC
ROC-AUC PR-AUC F1
Rec
Pre
Attention-DNN 0.9070 0.9985
0.9618
0.8267 0.8507 0.8226
DNN
0.9499
0.8039
0.8934
0.9981
0.8312
0.7879
Comparison with Existing State-of-the-Art Methods
Besides traditional DNN, we compare FM-DDI with two latest competing methods (DDIMDL [2] and DeepDDI [12]) and two traditional machine learning methods (RF and k nearest neighbor (KNN)) to evaluate the performance of FM-DDI. Figure 5 find that all of the assessment metrics obtained by FM-DDI were higher than other methods. The ACC, AUC, AUPR, F1, Rec and Pre obtained by FM-DDI are 0.9070, 0.9985, 0.9618, 0.8267, 0.8507 and 0.8226, respectively, which are better than DDIMDL (0.8852, 0.9976, 0.9208, 0.7585, 0.8471 and 0.7182), DeepDDI (0.8371, 0.9961, 0.8899, 0.6848, 0.7275 and 0.6611), and DNN (0.8797, 0.9963, 0.9134, 0.7223, 0.8047 and 0.7027). Compared with DDIMDL, ACC, AUC, AUPR, F1, Rec and Pre have increased by 2.4%, 0.1%, 4.5%, 9%, 0.4% and 14.5% respectively. In addition, FM-DDI is also superior to two traditional machine learning methods, including RF and KNN.
408
J. Xie et al.
Moreover, the precision-recall curves of the above methods are shown in Fig. 6. We can see that the area under the precision-recall curves of FM-DDI is more extensive than all other methods. These results go beyond previous reports, showing that FM-DDI can effectively predict DDI.
Fig. 5. The performance of different methods.
Fig. 6. The precision-recall curves of different methods.
A Deep Learning Approach FM-DDI for Drug-Drug Interaction Prediction
4
409
Conclusion
Early discovery of DDIs can effectively prevent medical accidents and reduce medical costs. There are already many methods to discover DDIs. However, the current methods still have much space for performance improvement. Research on selecting effective drug features from a large amount of data is of great significance. We propose a new deep learning approach named FM-DDI based on feature reconstruction and multi-dimensional attention mechanism to predict DDIs. We first use various methods to extract three chemical substructures and other information of drugs as drug feature representations. Then we perform the feature reconstruction to determine the similarity matrices. Based on the features obtained from the previous step, we construct the Attention-DNN to predict the interaction between drugs. FM-DDI achieves substantial performance improvement over several state-of-the-art methods, indicating that FM-DDI can provide a valuable tool for extracting and learning drug features to predict new DDIs.
References 1. Chu, Y., et al.: DTI-MLCD: predicting drug-target interactions using multilabel learning with community detection method. Briefings Bioinf. 22(3), bbaa205 (2021) 2. Deng, Y., Xu, X., Qiu, Y., Xia, J., Zhang, W., Liu, S.: A multimodal deep learning framework for predicting drug-drug interaction events. Bioinformatics 36(15), 4316–4322 (2020) 3. Edwards, I.R., Aronson, J.K.: Adverse drug reactions: definitions, diagnosis, and management. Lancet 356(9237), 1255–1259 (2000) 4. Foucquier, J., Guedj, M.: Analysis of drug combinations: current methodological landscape. Pharmacol. Res. Perspect. 3(3), e00149 (2015) 5. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018) 6. Kastrin, A., Ferk, P., Leskoˇsek, B.: Predicting potential drug-drug interactions on topological and semantic similarity features using statistical learning. PLOS ONE 13(5), e0196865 (2018) 7. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 8. Law, V., et al.: DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. 42(D1), D1091–D1097 (2014) 9. Lv, Z., Wang, P., Zou, Q., Jiang, Q.: Identification of sub-Golgi protein localization by use of deep representation learning features. Bioinformatics 36(24), 5600–5609 (2020) 10. Percha, B., Garten, Y., Altman, R.B.: Discovery and explanation of drug-drug interactions via text mining. Pac. Symp. Biocomput. 2012, 410–421 (2012) 11. Prechelt, L.: Early stopping—but when? In: Montavon, G., Orr, G.B., M¨ uller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 53–67. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8 5
410
J. Xie et al.
12. Ryu, J.Y., Kim, H.U., Lee, S.Y.: Deep learning improves prediction of drug-drug and drug-food interactions. Proc. Nat. Acad. Sci. 115(18), E4304–E4311 (2018) 13. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 14. Steinbeck, C., Han, Y., Kuhn, S., Horlacher, O., Luttmann, E., Willighagen, E.: The chemistry development kit (CDK): an open-source java library for chemo-and bioinformatics. J. Chem. Inf. Comput. Sci. 43(2), 493–500 (2003) 15. Vilar, S., Harpaz, R., Uriarte, E., Santana, L., Rabadan, R., Friedman, C.: Drugdrug interaction through molecular structure similarity analysis. J. Am. Med. Inform. Assoc. 19(6), 1066–1074 (2012) 16. Wan, E.A.: Neural network classification: a Bayesian interpretation. IEEE Trans. Neural Netw. 1(4), 303–305 (1990) 17. Xu, H., Sang, S., Lu, H.: Tri-graph information propagation for polypharmacy side effect prediction. arXiv preprint arXiv:2001.10516 (2020) 18. Zeng, M., Zhang, F., Wu, F.X., Li, Y., Wang, J., Li, M.: Protein-protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics 36(4), 1114–1120 (2020) 19. Zhang, P., Wang, F., Hu, J., Sorrentino, R.: Label propagation prediction of drugdrug interactions based on clinical side effects. Sci. Rep. 5(1), 1–10 (2015) 20. Zhang, W., Chen, Y., Liu, F., Luo, F., Tian, G., Li, X.: Predicting potential drugdrug interactions by integrating chemical, biological, phenotypic and network data. BMC Bioinform. 18(1), 1–12 (2017) 21. Zheng, Y., Peng, H., Zhang, X., Zhao, Z., Yin, J., Li, J.: Predicting adverse drug reactions of combined medication from heterogeneous pharmacologic databases. BMC Bioinform. 19(19), 49–59 (2018) 22. Zitnik, M., Agrawal, M., Leskovec, J.: Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34(13), i457–i466 (2018)
OrgaNet: A Deep Learning Approach for Automated Evaluation of Organoids Viability in Drug Screening Xuesheng Bian1,2 , Gang Li3 , Cheng Wang1,2(B) , Siqi Shen1 , Weiquan Liu1 , Xiuhong Lin1 , Zexin Chen4 , Mancheung Cheung5 , and XiongBiao Luo1 1
Fujian Key Laboratory of Sensing and Computing for Smart Cities, School of Informatics, Xiamen University, Xiamen 361005, China [email protected], [email protected] 2 National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, China 3 Otolaryngology-Head and Neck Surgery, Nanfang Hospital, Southern Medical University, Guangzhou 510000, China 4 Zhuhai UM Science and Technology Research Institute, The University of Macau, Zhuhai 519000, China 5 Accurate International Biotechnology (GZ) Company, Guangzhou 510000, China
Abstract. Organoid, a 3D in vitro cell culture, has high similarities with derived tissues or organs in vivo, which makes it widely used in personalized drug screening. Although organoids play an essential role in drug screening, the existing methods are difficult to accurately evaluate the viability of organoids, making the existing methods still have many limitations in robustness and accuracy. Determination of Adenosine triphosphate (ATP) is a mature way to analyze cell viability, which is commonly used in drug screening. However, ATP bioluminescence technique has an inherent flaw. All living cells will be lysed during ATP determination. Therefore, ATP bioluminescence technique is an end-point method, which only assess cell viability in the current state and unable to evaluate the change trend of cell viability before or after medication. In this paper, we propose a deep learning based framework, OrgaNet, for organoids viability evaluation based on organoid images. It is a straightforward and repeatable solution to evaluate organoid viability, promoting the reliability of drug screening. The OrgaNet consists of three parts: a feature extractor, extracts the representation of organoids; a multi-head classifier, improves feature robustness through supervised learning; a scoring function, measures organoids viability through contrastive learning. Specifically, to optimize our proposed OrgaNet, we constructed the first dedicated dataset, which is annotated by seven experienced experts. Experiments demonstrate that the OrgaNet shows great potential in organoid viability evaluation. The OrgaNet provides another solution to evaluate organoids viability and shows a high correlation compared with ATP bioluminescence technique. Availability: https://github.com/541435721/OrgaNet c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 411–423, 2021. https://doi.org/10.1007/978-3-030-91415-8_35
412
X. Bian et al. Keywords: Organoid
1
· Drug screening · Microscopy image · ATP
Introduction
Since 2019, a new coronavirus disease (Covid-19) has ravaged the globe, leading to a considerable loss of people and property. On clinical, a large number of drugs have been applied for the treatment of Covid-19, but they have not achieved reliable results [4,19]. In addition, the individual differences in patients also increase the challenge of treatment. Particularly, different patients may have various responses to the same drug. If COVID-19 patients try a variety of drugs with unclear efficacy, they may face a higher risk. Therefore, for the safety of patients, it is necessary to create a safer, efficient and accurate personalized drug screening model instead of taking drugs by patients directly. The existing drug screening methods, such as in vitro tests and in vivo tests, have their own advantages and disadvantages respectively, as follows: 1) In vitro tests can be precisely controlled, but they cannot reflect the actual state of the derived organism [5,14]; 2) In vivo tests reflect the effects of drugs on the whole organism, but the cost is much higher. Organoid, a 3D in vitro cell culture, takes merits of both in vitro and in vivo tests. The organoid can be better used to simulate the histogenesis, even the physiological and pathological states of derived organs or tissues [17]. Therefore, organoids achieve more effective and real drug efficacy and toxicity in drug screening, helping to formulate an effective personalized medication plan. In fact, various factors affect organoid culture, e.g., there are significant individual differences between organoid samples used for drug screening. However, it is necessary to ensure the consistency between selected organoid samples before dosing. At present, the sample selection of organoids relies on manual observation (rely on only a few geometric parameters and ambiguous morphological features) or cell counting (organoid is a cell complex, which is hard to get a precise number of cells), which is not rigorous. Ignoring the above uncertainty may lead drug screening to generate unreliable results. In drug screening, the ATP bioluminescence technique is usually used to evaluate cell viability the after drug [11] (the process is shown in Fig. 1). To measure the bioluminescence value and detect ATP, the cell membranes and cell walls should be dissolved to release the ATP. There are three important steps performing drug screening with ATP: (1) Detecting total ATP under different concentrations of different drugs with ATP bioluminescence technique, then regarding the total ATP as the viability of living cells. (2) Fitting the doseresponse curve of a particular drug based on the amount of ATP at different concentrations. (3) Calculating the drug concentration when the cell viability remain 50% and take it as the drug efficacy evaluation value IC50 . Essentially, the lower the IC50 value, the better drug efficacy. However, there are two limitations of ATP bioluminescence technique: (1) ATP bioluminescence technique cannot accurately describe the effectiveness and toxicity of drugs as there are normal tissue organoids and focal tissue organoids in the sample. The global statistics of the cell viability cannot distinguish whether
OrgaNet: A Deep Learning Approach for Automated Evaluation
413
Fig. 1. The processes of ATP bioluminescence technique and its limitations. (1) Collect target tissue from a living body. (2) Process the derived tissue and extract the cell suspension. (3) Move the cells into the well plate and culture for specific days. (4) Select organoid samples to add different drugs and determine the cell viability of different drugs at different concentrations. (5) Fit the drug concentration and viability curve based on the cell viability determined in the previous step. (6) Calculate the drug efficacy evaluation value (IC50 ). This figure is created with BioRender.com.
the drug influences the healthy tissue or lesion. (2) It is difficult to estimate the authentic IC50 value because the viability of living cells after medication is a dynamically changing value. In detail, researchers obtain distinct cell viability at different times, which will affect the fitting result of the dose-response curve and the calculation of IC50 . In fact, the cell viability keeps changing, but ATP bioluminescence technique cannot access the process cell viability changing during time. Thus, under the above two circumstances, we need a method that evaluate organoids quickly, accurately and viability repeatedly. Specifically, the ideal method helps select suitable organoid samples for the drug screening and reduce the uncertainty in drug screening, and also provide an better solution to observe the detailed changing process of organoid viability. High-throughput microscopy imaging is an important way to observe the growth of organoids. Of note, the viability of organoids is roughly judged by researchers from morphology [16]. Therefore, the growth characteristics of organoids can be discovered from the visual perspective. Recently, computer vision technology has many successful application cases in the medical field [3,8]. In particular, combined with deep learning methods represented by convolutional neural networks (CNN), medical image processing technology has been greatly promoted. However, there are less Artificial Intelligence (AI) based researches focus on organoids based drug screening.
414
X. Bian et al.
It is worth noting that the viability evaluation of organoids is a challenging task, which is described as follows: (1) The manually labeled samples are divided into several grades, which is difficult to judge the viability difference of two organoids with the same grade. If the number of grades is increased, more errors may be introduced. (2) All the organoids are distributed to seven experts to label. Significantly, the differences in the expertise of these experts will cause inconsistent labeling or noisy labels. Therefore, the traditional supervised learning method is not completely applicable to our task that evaluates organoids viability based on AI. To solve the above challenges, we propose a novel neural network framework, OrgaNet, to evaluate organoids viability and alleviate the limitations of ATP bioluminescence technique. First, we introduce a multi-head classifier to alleviate the negative impact of the uncertain annotations created by different experts. Second, we use an entropy minimization strategy to improve the clustering performance of the samples. Finally, inspired by the idea of contrastive learning, we use a scoring function to map the input organoid to a scalar in the range of (0, 1.0) to represent the viability of organoids. Experimental results reveal that the viability predicted by OrgaNet is highly consistent with values given by ATP bioluminescence technique. Meanwhile, our proposed method can be performed repeatedly without affecting the subsequent drug screening process. In summary, the contributions of this paper are outlined below: 1. To our knowledge, we are the first to propose dedicated dataset for organoids viability evaluation. All data is meticulously annotated by seven experienced experts, which can be used for organoid detection, viability evaluation, etc. 2. We propose the first deep neural network, OrgaNet, for accurately quantifying the viability of organoids, that can be used in drug screening. 3. Experiments have proved that our method has a high correlation between the measured value of the traditional ATP method.
2
Related Work
In recent years, deep learning has become one of the most popular models in computer vision and brings impressive performance improvements on traditional vision tasks [7,9]. Many studies have demonstrated convolutional neural networks (CNN) effectively extract features with the high discriminative ability for visual tasks, called deep features. The representation ability of deep features is much stronger than traditional image features, such as SIFT [10], LBP [13], etc. Deep features are automatically learned by deep neural networks, eliminating the need for the manual design of feature descriptors. Recently, there are many networks designed for image recognition [7], which have been successfully applied as backbone networks for target detection [15], semantic segmentation [9] and other tasks. In particular, for the tasks with less data, using the pre-trained parameters to initialize the feature extractor also achieve a strong generalization ability [24]. Meanwhile, the pre-trained strategy significantly reduces the demand of training data and the risk of overfitting.
OrgaNet: A Deep Learning Approach for Automated Evaluation
415
Fig. 2. Process of acquiring the high-throughput organoid image. (1) 96-well plate. (2) Diagram of capturing the organoid image by microscope. (3) 3D structure of organoids. (4) Field of view in a single well. (5) Imaging at different focal planes. (6) Images under each FOV. (7) Panoramic image. (8) Z-stack Images. (9) High-throughput image.
Although supervised learning, a typical deep learning method, have achieved remarkable performance in various computer vision tasks, it relies on a large number of accurately labeled training data. In practice, especially in biomedicine, a large amount of accurate annotation requires a lot of labor and time cost, severely restricting the progress of deep learning technology in this field. Especially, the diagnose divergences of the same case given by different experts makes it hard to achieve the precise label. As the CNN has strong learning ability, these uncertain annotations easily make the network learn redundant representation irrelevant to the current task, causing the network to overfit. Many previous studies have proposed a series of solutions for noise labels, including loss function design [6], sample selection [12], noise model [21], and so on. Deep learning methods are popular in many fields. However, there are less studies are proposed For applications of organoid. Chen et al. proposed an AIbased model to evaluate tumor spheroid behavior by organoids [2]. Moreover, Bian et al. used a deep learning model to achieve organoid detection and tracking in high-throughput images of organoid [1].
3 3.1
Proposed Method Dataset Construction
High-Throughput Microscope Image Capture. All organoids are cultured on 96-well platesfor 8–9 days. Since the camera cannot cover the entire well in a single shot, the camera needs to take multiple shots to obtain panoramic view of organoids under each well. The microscope divides each well into 4 × 4 field of views (FOVs), and the area of each field is 1389 µm × 1389 µm (with overlap). To make the organoids under different focal planes clearly imaging, we collect images at different focal lengths under each FOV (the resolution of the image obtained under each field of view and focal length is 1992 × 1992). Finally,
416
X. Bian et al.
we stitch all patches from each FOV under each focal plane into a panoramic image and project all panoramic images into a high-throughput image. The highthroughput microscopy imaging process is illustrated in Fig. 2. Image Annotation. Considering the tedious work of labeling a large number of samples and avoiding misjudgments caused by personal tendency, we invited seven experienced experts to label our data in detail. We obtained 174 highthroughput images in total, and all organoids in each high-throughput image are manually annotated by seven experts with bounding boxes. Finally, we totally obtained 6696 organoids from the 174 high-throughput images. According to the diameter, outer wall shape, light transmittance, growth potential, and other factors of the organoids, seven experts comprehensively grade the organoids from 0 to 5. There are six grades marked as 0, 1, 2, 3, 4, and 5. To deal with the divergence of annotations of the seven experts, we divided the 174 high-throughput images into eight groups. The number of images in each group was 13, 23, 23, 23, 23, 23, 23, 23. Specially, the 13 images of group 1 are shared for all experts, and the remaining seven groups are arranged to be labeled by seven experts on average. Each expert annotates 36 (13 + 23) high-throughput images, which contain 1269 organoid patches. Since all organoids are manually annotated by 7 experts, there will be many divergence in these labels, that is, different experts will have diverse evaluations of the same organoid. This divergence of evaluation is quite common in the field of biomedicine. By analysing annotated results of the 13 organoid images shared by 7 experts, we found that this annotation difference is very serious, and only a tiny part of the samples reach a consensus. 3.2
Problem Formulation
Here, we introduce some markers to mathematically describe the research problem. Let X ⊂ Rd be the data space and Y = {0, 1, 2, 3, 4, 5} be the label space. Given a training data D = {xi , yi } (yij is the label given by the j-th expert), we aim to find a model F(x, θ) ∈ R, satisfying that if the viability of the organoid xu is worse than xv , then F(xu , θ) < F(xv , θ). Due to some subjective or objective reasons, the scores for same sample labelled by different experts may vary: ∃yis = yir where s = r. Thus, we regard this problem as the learning problem of organoid representation and scoring function under inconsistent labels. 3.3
OrgaNet and Loss Function
To evaluate organoid viability from vision perspective, we need to extract the most discriminative organoid representations, which should be accurately quantified by a scoring function. Particularly, we must consider two challenges: (1) There exists inconsistency in the annotations between different experts, and different experts have different labeling grades for the same organoid. (2) As the annotations of experts are discrete, it is difficult to distinguish which one is
OrgaNet: A Deep Learning Approach for Automated Evaluation
417
Fig. 3. The architecture of OrgaNet.
better between the two organoids given the same grade by experts. On the one hand, to solve the first challenge, we propose a multi-head classifier to simulate different habits of each expert, preventing the model from biasing the evaluation to specific experts. On the other hand, to conquer the second challenge, we introduce the contrastive learning strategy, focusing on the sample pairs of organoids with apparent differences, to learn organoid representations that show strong discriminative ability in organoid viability. In addition, to make the representations more robust, we take the similarity between organoid samples into consideration, and introduce the unsupervised strategy to pull the similar organoids together or push the distinct organoids away in the feature space. Our proposed OrgaNet consists of three parts: (1) CNN based feature extractor, used to extract the discriminative representation for organoid viability evaluation; (2) Multi-head classifier, used to simulate different experts evaluating habits to enhance the robustness of feature extractor; (3) Scoring function, a fully connected network, used to integrate organoid features and output organoid viability. The detailed network structure of OrgaNet is shown in Fig. 3. Feature Extractor. As CNNs have strong ability in extracting representations for images, we use the convolutional neural network as our feature extractor to learn the highly discriminative representation for organoids viability evaluation. The feature extractor (G) in OrgaNet is a pluggable design and can be replaced by any other feature extractor without modifying the overall network architecture, which is flexible. We obtain the deep features (h) of the input organoid (x) through the feature extractor, h = G(x). We used four different CNNs to validate the performance: Vgg [20], ResNet [7], GoogleNet [22], and Inception [23]. The convolutional layers are transplanted as our feature extractor and the fully connected layers and the classifier are removed. Multi-head Classifier. Since seven experts individually annotate all the 174 high-throughput images, there are many individual differences in the annotations. Some samples, graded by seven experts, have six or fewer different labels. The remaining samples are annotated by one of the seven experts.
418
X. Bian et al.
Inspired by the noise model hypothesis [21], which assumes that the corrupted label is transferred by mapping the true label according to the label transition matrix, we suppose that the grades from different experts are mapped from the shared features. It is the different weights of each feature component that lead to the disagreement. Since the experts are mainly concerned with the organoid diameter, outer wall shape, light transmittance, etc., we propose a hypothesis that the image features extracted by different experts are shared. The judgment of experts is regarded as different mapping of the same representation. Therefore, we designed a multi-head classifier to simulate the labeling process of experts. We use 7 fully connected networks (Cj (h) where j ∈ {0, 1, 2, 3, 4, 5, 6}, j stands for the ID of experts) to simulate the decisions of 7 experts respectively, and output 7 corresponding prediction yˆj for each input organoid. Here, the input organoid representation (h) is shared by all classifiers, which means that the image features extracted by the seven experts are the same. Through iterative optimization, the feature extractor will learn the organoid representation which seven experts highly rely on. Finally, we consider the above leaned shared features are the basis for us to realize the viability evaluation of organoids. Scoring Function. We aim to map any organoids to a scalar value to evaluate the viability of organoid. However, there are not precise annotations of the existing data that reflect the viability of organoid (only coarse and subjective grades). It is difficult to directly optimize the network through supervised learning to output evaluation values. We found that the viability of different organoids scored by the same expert is positively correlated with the scores; that is, the organoid with higher grade are better than the viability of organoid with lower grade. In addition, different experts gave the exact prediction of the relationship between the same organoid pairs. Although various experts had different grades on the two organoid, the organoid with worse viability is always given a lower grade than the better one. Therefore, we introduce contrastive learning into our proposed model. Given any organoid sample pairs (< (xu , yu ), (xv , yv ) >, where xu and xv are sampled organoids) and the scoring function (M), if yus < yvs , then M(G(xu )) < M(G(xv )); if yus + 2 < yvr & s = r, then M(G(xu )) < M(G(xv )). Here, s and r are sampled from {0, 1, 2, 3, 4, 5, 6}, which stand for the ID of experts. The structure of scoring function is a fully connected network. Entropy Minimization Based Neighbour Constraints. It is not ideal to rely only on the classifier trained with uncertain annotations by supervised training. Consequently, we take the similarity between samples into account. We believe that the similarity in the image space should be preserved in the feature space; that is, if xu and xv are similar or have similar expert scores, the distance between hu and hv should be close, and vice versa. Inspired by [18], we introduce the idea of clustering in unsupervised learning to make additional constraints on feature space. For the input organoid samples, we calculate the feature distance of all organoid pairs. To make all similar organoids be aggre-
OrgaNet: A Deep Learning Approach for Automated Evaluation
419
gated, and the remaining with significant differences keep distant, we minimize the entropy of the feature distance. In this way, we can fully explore the potential relationship between samples. In addition, by combining supervised learning and unsupervised learning methods through the idea of multi-task learning, we make the most use of all training data. Thus, the above strategy effectively enhances the generalization ability of features and suppresses the overfitting risk. Loss Function. In order to describe in detail the loss function used to optimize our proposed OrgaNet, we redeclare some mathematical notations: G represents the feature extractor mentioned above; Cj represents the multi-head classifier; the following subscript j represents the index of the multi-head classifier; M represents the scoring function, which is used to output the viability assessment value of organoids. We feed the batch with the number of training samples b: {(x1 , y1∗ ), (x2 , y2∗ ), . . . , (xb , yb∗ )}, xi represents the input sample, and yi∗ represents the annotation of xi by one or more experts. hi = G(xi ) represents the representation of the sample xi . Our loss function consists of three weighted items, which is detailed in Eq. (1). The first item Lcls is the cross-entropy loss for multi-head classifier, and I(x) in Eq. (2) is the indicator function, if x is true, then I(x) is 1, otherwise it is 0. The second item Lcluster is the neighbour constraint loss, and pvu represents the affinity of the sample xu and xv . The third item Lcomp is the contrastive loss. α and β are two hyper-parameters for balancing the important each items. L = Lcls + αLcluster + βLcomp n
Lcls =
k
1 I(∃ yij ) × (1 − yij log(Cj (G(xi ))) n i=1 j=1 Lcluster = −
where pvu = Lcomp =
Su,v
4 4.1
(1)
n 1 n u=1
exp(hu ×h v ) , Zu
n 1 |S = 0| u=1
n
(3)
v=1,u=v
Zu =
n
pji log(pji ),
(2)
n
v=n,v=u
exp(hu × h v)
sign(xu , xv )(M(hu ) − M(hv ))
(4)
v=1,u=v
⎧ ⎨ 1; if yuj < yvj or yu∗ + 2 < yv∗ 0; if yu∗ = yv∗ = sign(xu , xv ) = ⎩ j −1; if yu > yvj or yu∗ > 2 + yv∗
(5)
Experiments and Results Evaluation Metrics
Since we lack a gold standard for organoid viability, we validate our method by distinguishing the superiority-inferiority of two organoid in organoid pairs,
420
X. Bian et al.
which depend on the predicted scores. In fact, we can also use the viability ranking among multiple organoids as an evaluation method to test the distinguishing ability of the scoring function in multiple targets. In this paper, we take the commonly used the metrics in classification tasks as the evaluation metrics in organoid pairs, and use these ranking metrics as the metrics for our multiple organoids viability evaluation. For the organoid pairs, we use Precision, Recall, and F1-score as our evaluation metrics. For multiple organoids, we use the Footrule (SF) (Eq. (6)) and Average Overlap (AO) (Eq. (7)) of Spearman as the evaluation metrics. Moreover, we collect 20 organoid images and detect ATP by ATP bioluminescence technique to validate the correlation between the predicted scores from OrgaNet and values from biological methods. SF =
N −1
|i − argmin(x)(xi )|
(6)
i=0 N 1 AO = |range(i) & argmin(x)[: i]| N i
4.2
(7)
Results
Since the only indicators that experts directly quantify are the width and height of organoids, we compare the prediction of the proposed OrgaNet with the results that only rely on width and height. In addition, we use various mature network backbones as the feature extractors of OrgaNet to compare the performance differences of various feature extractors. We collect another 20 images as the testset to validate the performance of OrgaNet. There are 1853 organoids distributed in the 20 images, which is scored by seven experts and reached a consensus through cooperation. According to Table 1, we found that the performance of the deep learning-based scoring function is better than the method that directly depends on the width and height. Then, we use different network architectures as the feature extractors, and the results are also different. ResNet [7] and GoogleNet [22] performed well on this task (Table 1). In addition, we introduce transfer learning, which the pretraining method is slightly better than training from scratch. The ROC curves of OrgaNet with different feature extractors are illustrated in Fig. 4(a). Moreover, we use OrgaNet to predict the 20 organoid images which are sent to detect ATP, and compare the relationship between the predicted viability of OrgaNet and the value of ATP. The R-square measure between ATP and our combined features (the mount of organoids, the sum of predicted scores, the mean of predicted scores and the area proportion of organoids) reaches 0.83. We visualize the correlation of these features and ATP in Fig. 4(b).
OrgaNet: A Deep Learning Approach for Automated Evaluation
421
Table 1. Results of OrgaNet with different setting Method
Precision Recall F1-score Spearman’s Average footrule overlap
Only width
0.849
0.894
Only height
0.857
With Vgg16
0.926
With Resnet50
0.967
0.871
0.865
0.615
0.906
0.881
0.95
0.594
0.934
0.93
0.518
0.723
0.972
0.969
0.094
0.975
With GoogleNet 0.969
0.979 0.974
0.036
0.991
With inception
0.97
0.046
0.988
0.953
(a)
0.961
(b)
Fig. 4. (a) is the ROC curves of OrgaNet with different feature extractors. (b) is the correlation between ATP and the predicted viability. The red thick line represent the ATP value and the blue thick line stand for the fitting value of our proposed combined features. These thin line are components of the combined features. (Color figure online)
5
Conclusion
In this study, we construct the first dedicated dataset to achieve organoids viability evaluation. Moreover, we proposed a novel network, OrgaNet, to evaluate the viability of organoids. The proposed OrgaNet is based on the microscopic images of organoids, and AI based viability evaluation does not affect the development of organoids before and after drugs. Our OrgaNet can be used for organoids selection, and repeatedly observe the changes in organoid activity after drug. Our OrgaNet can be used as a supplement to traditional ATP detection methods, which is benefit to improve the reliability of experimental results in drug screening. In addition, OrgaNet first applies artificial intelligence technology to the organoid viability evaluation, and obtains satisfactory results, which fully demonstrates the great potential of artificial intelligence technology in this field. The combination of artificial intelligence technology and organoid technology will further promote the development of organoid drug screening.
422
X. Bian et al.
Acknowledgements. This work is supported by China Postdoctoral Science Foundation (No. 2021M690094) and the China Fundamental Research Funds for the Central Universities (No. 20720210074).
References 1. Bian, X., Li, G., et al.: A deep learning model for detection and tracking in highthroughput images of organoid. Comput. Biol. Med. 134, 104490 (2021) 2. Chen, Z., Ma, N., et al.: Automated evaluation of tumor spheroid behavior in 3D culture using deep learning-based recognition. Biomaterials 272, 120770 (2021) 3. Christ, P.E., et al.: Automatic liver and lesion segmentation in CT using cascaded fully convolutional neural networks and 3D conditional random fields. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 415–423. Springer, Cham (2016). https://doi.org/10.1007/978-3-31946723-8 48 4. Cunningham, A.C., Goh, H.P., et al.: Treatment of COVID-19: old tricks for new challenges (2020) 5. Emami, J., et al.: In vitro-in vivo correlation: from theory to applications. J. Pharm. Pharm. Sci. 9(2), 169–189 (2006) 6. Ghosh, A., Kumar, H., Sastry, P.: Robust loss functions under label noise for deep neural networks. In: AAAI, vol. 31 (2017) 7. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016) 8. Kuhlman, B., Bradley, P.: Advances in protein structure prediction and design. Nat. Rev. Mol. Cell Biol. 20(11), 681–697 (2019) 9. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) 10. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 11. Maehara, Y., Anai, H., et al.: The ATP assay is more sensitive than the succinate dehydrogenase inhibition test for predicting cell viability. Eur. J. Cancer Clin. Oncol. 23(3), 273–276 (1987) 12. Malach, E., Shalev-Shwartz, S.: Decoupling “when to update” from “how to update”. In: NIPS, pp. 960–970 (2017) 13. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002) 14. Polli, J.E., et al.: Novel approach to the analysis of in vitro-in vivo relationships. J. Pharm. Sci. 85(7), 753–760 (1996) 15. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2016) 16. Rios, A.C., Clevers, H.: Imaging organoids: a bright future ahead. Nat. Meth. 15(1), 24–26 (2018) 17. Rossi, G., et al.: Progress and potential in organoid research. Nat. Rev. Genet. 19(11), 671–687 (2018) 18. Saito, K., Kim, D., Sclaroff, S., et al.: Semi-supervised domain adaptation via minimax entropy. In: ICCV, pp. 8050–8058 (2019) 19. Shen, C., Wang, Z., Zhao, F., et al.: Treatment of 5 critically ill patients with COVID-19 with convalescent plasma. JAMA 323(16), 1582–1589 (2020)
OrgaNet: A Deep Learning Approach for Automated Evaluation
423
20. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Computer Science (2014) 21. Sukhbaatar, S., Bruna, J., Paluri, M., et al.: Training convolutional networks with noisy labels (2014) 22. Szegedy, C., Liu, W., et al.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015) 23. Szegedy, C., Vanhoucke, V., Ioffe, S., et al.: Rethinking the inception architecture for computer vision. In: CVPR, pp. 2818–2826 (2016) 24. Yosinski, J., Clune, J., et al.: How transferable are features in deep neural networks? In: NIPS, pp. 3320–3328 (2014)
HGDD: A Drug-Disease High-Order Association Information Extraction Method for Drug Repurposing via Hypergraph Shanchen Pang1
, Kuijie Zhang1 , Shudong Wang1 , Yuanyuan Zhang1,2(B) Sicheng He1 , Wenhao Wu1 , and Sibo Qiao1
,
1 College of Computer Science and Technology, China University of Petroleum (East China),
Qingdao, Shandong, China 2 School of Information and Control Engineering, Qingdao University of Technology, Qingdao,
Shandong, China
Abstract. Traditional drug research and development (R&D) methods are characterized by high risk and low efficiency. Drug repurposing provides a feasible way for the efficiency and safety of drug R&D. Since high-precision prediction of drug-disease association can help us quickly locate potential treatment options for existing drugs, how to accurately predict drug-disease association becomes the key of drug repurposing. In this paper, we propose a method to extract high-order drugdiseases association information using hypergraph, named Drug-Disease association prediction base on HyperGraph (HGDD). Specifically, HGDD first extracts the network topology information from the drug-disease association network based on the random walk strategy as the initial features of the nodes. Then, HGDD constructs the drug-disease hypergraph network based on the drug-disease association network. Finally, HGDD uses hypergraph neural network (HGNN) to aggregate higher-order information on hypergraph and predict the association between drugs and diseases. Compared with other traditional drug repurposing methods, HGDD shows substantial performance improvement. The area under the precision recall curve (AUPR) index of HGDD is significantly higher than other controlled trials. Case study also shows that HGDD can discover new associations that do not exist in our dataset. These results indicate that HGDD is a reliable method for drug-disease association prediction. Keywords: Drug-disease association prediction · Drug repurposing · High-order information · Hypergraph neural network
1 Introduction Drug research and development (R&D) is a process with high failure rate, high cost and slow progress, in which a lot of manpower and material resources will be spent [1]. In order to alleviate this problem, the medical profession has set off a wave of new use of old drugs, that is, drug repurposing [2]. One of the benefits of this strategy is that people have long known the tolerance of the above-mentioned drugs, their delivery characteristics and © Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 424–435, 2021. https://doi.org/10.1007/978-3-030-91415-8_36
HGDD: A Drug-Disease High-Order Association Information Extraction Method
425
kinetics in the human body, so these drugs can be used more effectively and safely [3]. Another benefit is that drug repurposing requires less time and economic costs than starting new drug development from scratch [2–4]. How to accurately predict drug-disease association is the core issue of drug repurposing [5]. Better drug-disease association prediction methods can reduce unnecessary experimental consumption, improve drug development efficiency, and save costs. With the improvement of the drug database and the breakthrough of graph related algorithms, more and more association prediction methods have been proposed. These methods can be roughly divided into two categories. One is to join related networks, such as drug-target association network, drug molecular structure similarity network, drug side-effect network, etc., to enrich the drug-disease association network information as much as possible to improve the prediction accuracy [6]. The other is to improve the graph information extraction algorithm to mine useful information as much as possible to influence the prediction results [7]. Multi-network prediction has gradually become the mainstream method of drugdisease association prediction [6]. The deepDTnet model proposed by Zeng et al. [8] even uses 15 related networks. However, it is not that the more relevant networks are, the better the performance is. Zong et al. [9] found that adding different types of related objects and corresponding connections does not always lead to an improvement in prediction performance. This is easy to understand. Whenever adding a new network information, it will inevitably dilute the previous network information. In order to ensure the effect of the prediction, the number of networks should still be controlled. Wang et al. [10] believe that although information from multiple sources is helpful for prediction, the data is usually not fully available, and the same type of data in different databases is not aligned. Therefore, only a few drugs and diseases with rich information can be used for multi-network prediction. Before the various databases are not well-shared, it is more meaningful to develop a high-precision single-network prediction method for practicability. With the addition of drug-disease-related networks, drug-disease association prediction algorithms are also constantly being developed. They can be roughly divided into four categories: machine learning-based methods, network diffusion-based methods, matrix decomposition-based methods [11, 12], and deep learning-based methods [7, 13, 14]. Among them, the application of deep learning algorithms is extremely powerful. The application of deep learning on graphs is graph neural network [15]. Graph Neural Network (GNN) uses graph operators to aggregate node information on the graph network [16]. It is easily embedded in the end-to-end architecture and maintains a high degree of interpretability. Recently, it has shown convincing performance in biomedical network analysis. Among them, the NIMCGCN model proposed by Li et al. [17] uses graph convolutional neural networks (GCN) to learn the potential feature representations of miRNAs and diseases from similar networks, and then obtains the reconstructed association matrix through the neural induction matrix completion model. Yu et al. [5] added the attention mechanism to the GCN model, which further improved the GCN-based model. Later, a strong combination of graph representation learning and graph neural network was discovered. Graph representation learning is responsible for the extraction of node structure information, and the graph neural network aggregates
426
S. Pang et al.
the structure information, so that the model effect is further improved. Lin et al. [18] used this model to complete the drug interaction network. Wang et al. [10] used this model to complete the task of drug repurposing. They all achieved excellent results. However, the GNN model and its derivative models are prone to over-smoothing problems [19], which causes them to only aggregate limited local neighbor information. It is difficult to capture high-order drug-disease association information. Comprehensively consider the problems faced by drug repurposing in multi-source data and in different algorithms. In this paper, we propose a drug-disease association prediction method based on hypergraph [20], named HGDD. In response to the current misalignment of most drugs and their related databases, HGDD only use drug-disease association network to complete the training of the model. Therefore, the model can be quickly applied to various drug databases, without the need to search for relevant networks in different databases. In order to solve the problem that GNNs cannot capture high-order information, we construct a high-order drug-disease association network in a hypergraph manner, and directly aggregate high-order association information through the hypergraph convolution operator [21]. We train and verify the model through the drug-disease associated data integrated by Yu et al. [5] from the CTD database. It is shown that the performance of HGDD is superior to the current methods and could achieve the task of drug repurposing.
2 Methods In this section, we introduce the HGDD model. In Sect. 2.1, we introduce the basic concepts of hypergraph and its advantages in extracting high-order association information. In Sect. 2.2, we give the overall framework of the model. In Sect. 2.3, Sect. 2.4 and Sect. 2.5, we describe the model algorithm flow in detail. 2.1 Hypergraph Introduction We can understand hypergraph from three aspects. The data structure of the hypergraph, the role of the hypergraph and why the hypergraph is introduced in the drug-disease association network. Let HG = (V , E, W ) denote a hypergraph with node set V , hyperedge set E and hyperedge weight set W . Hypergraph can be expressed as |V | × |E| matrix H , and the term is defined as 1, if v ∈ e h(v, e) = . (1) 0, if v ∈ /e For v ∈ V , its degree is defined as d (v) = e∈E w(e)h(v, e). For e ∈ E, its degree is defined as d (e) = v∈V h(v, e). Furthermore, De and Dv represent the diagonal matrices of hyperedge degree and node degree, respectively. The intuitive difference between graph and hypergraph is shown in Fig. 1. Figure 1(a) shows the graph structure and data representation form of the graph, and Fig. 1(b) shows the graph structure and data representation form of the hypergraph. The most significant difference is that the hyperedge in the hypergraph connects two or more nodes.
HGDD: A Drug-Disease High-Order Association Information Extraction Method
427
Fig. 1. Different representations of graphs and hypergraphs.
In many graph tasks, the hypergraph structure is used to model high-order correlations between data. Hypergraph learning was first introduced in [22] as an information dissemination process on the structure of hypergraph. The purpose of transductive reasoning of hypergraphs is to minimize the labeling differences between nodes with strong connections on the hypergraph [21]. These strongly correlated nodes are connected through hyperedges to complete the transmission of network information as a whole. Graph is generally used to describe binary relationship among nodes, but drugdisease is not a simple relationship between two nodes. There are multiple drugs for the one disease and one drug for multiple diseases. These drugs that can treat the same disease have some similar properties, and so do diseases. If these drugs with similar properties are connected by a hyperedge, the whole network can be represented by a high-order graph. Then the high-order association information is captured by hypergraph learning to complete the high-precision prediction of drug-diseases. 2.2 Overview The overview of HGDD is shown in Fig. 2. The input is a drug-disease association matrix A ∈ RN ×M with N drugs and M diseases. The output is the association score matrix Y ∈ RN ×M . Any yij ∈ Y is a value between (0, 1), indicating the prediction score of association between the drug ui and the disease dj . In order to capture the highorder association information between drugs and diseases, we designed a three-step framework: 1) Encode drugs and diseases neighborhood structure information as X (0) from the drug-disease association graph G, then construct drug-disease hypergraph HG based on G. 2) Use the hypergraph convolution operator to aggregate the node features from HG, extract high-order association information and generate node embedding. 3) Calculate the drug-disease association score through the inner product operation. In general, we first capture the association information from the low-order network through graph representation learning, and then input it as node features into the hypergraph to extract the high-order association information, so as to achieve the purpose of association prediction. Through these three steps, HGDD can mine the topological structure information of the drug-disease network as much as possible.
428
S. Pang et al.
Fig. 2. Framework of the HGDD.
2.3 Initialization of Drug-Disease Hypergraph A slight change in molecular structure can also cause a huge change in molecular function [23]. However, this small change can not be well reflected in the similarity calculation based on molecular structure. Therefore, using molecular structure as a drug feature may not accurately describe the relationship between drugs. But the relationship between which diseases a drug can treat and which drugs a disease can be treated by is a good description of the features of drug and disease. This view is intuitive and effective. Therefore, we use graph representation to capture the neighborhood structure information of nodes as node initial features. HGDD uses Node2Vec to capture network structure information and describe the topological environment of drug nodes and disease nodes. The tasks of Node2Vec can be summarized as v ∈ V → R(N +M )×c , where c is the dimension of the nodes initial feature that can be set. Given a drug-disease association network G, there are N drugs U = {u1 , u2 , . . . , uN } and M diseases D = {d1 , d2 , . . . , dM }. The node set of G is V = {u1 , u2 , . . . , uN , d1 , d2 , . . . , dM }. Through the Node2Vec, we obtained the initial features of the drug node and the disease node denoted by X (0) = xu1 , . . . , xuN , xd1 , . . . , xdM . HGDD builds the drug-disease hypergraph through the association matrix of the drug-disease network. Let A ∈ RN ×M be the association matrix of G. If drug ui can treat the diseasedj , Aij is 1,otherwise it is 0. For any drug ui , if it can treat k (k >= 1) diseases Dui = dj uˆ Aij = 1 , let Dui and ui form the hyperedge Ei = Dui ∪ {ui }. Since it contains diseases cured by the same drug, we call Ei disease hyperedge, and define ui as the bond node of hyperedge Ei . Same process for disease dj to get hyperedge Ej . Because it contains drugs that can treat the same disease, we call Ej drug hyperedge. we also define dj as the bond node of hyperedge Ej .Through the above process, we obtained the drug-disease association hypergraph as shown in Fig. 3, then initialize the nodes feature of the hypergraph as X (0) . For a hypergraph, the bond node is the information communication bridge between hyperedges. For example, the bond node v1 of the drug hyperedge E4 in Fig. 3 is also the
HGDD: A Drug-Disease High-Order Association Information Extraction Method
429
Fig. 3. Construction of the hypergraph based on the drug-disease association network.
node of the disease hyperedge E1 , E2 , E3 . Therefore, we can complete the information transmission among the hyperedge E1 , E2 , E3 and E4 through the bond node. 2.4 Extraction of High-Order Information from Hypergraph In order to aggregate the high-order association information of the drug-disease association network, we introduce the Hypergraph neural network(HGNN) model [21] into the HGDD model. HGNN is a multi-layer connected neural network framework, which is used to aggregate node feature information from the hypergraph structure and generate node embeddings. It uses hypergraph convolution operator to aggregate information in each layer, and then the node features through the hypergraph structure are reconstructed which are the input of the next layer. Specifically, given a network with the corresponding hypergraph incidence matrix H , the layerwise propagation rule of HGNN is formulated as (2) X (l+1) = f X (l) , H = σ Dv−1/2 HWDe−1 H T Dv−1/2 X (l) (l) , where X (l) is the embedding vector of the node feature in layer l, Dv is the degree matrix of the node, De is the degree matrix of the hyperedge, W is the weight matrix of the hyperedge. (l) is the trainable weight matrix of the l-th layer, and σ(·) is the non-linear activation function. We assume that all the hyperedge weights are the same, then the hyperedge weight matrix W become the identity matrix I . HGDD uses exponential linear units as non-linear activation functions in all graph convolutional layers, which can significantly improve the robustness of the model and improve the accuracy of the model. From the spatial perspective, it is easy to understand the advantages of hypergraph convolution operators versus graph convolution operators. The graph convolution operator is as follows: ˜ −1/2 X (l) (l) , ˜ −1/2 A˜ D X (l+1) = f X (t) , A = σ D A˜ =
A AT
+ I,
(3)
430
S. Pang et al.
⎡ ⎢ ˜ =⎢ D ⎢ ⎣
u∈V
˜ 0 , u) A(v
⎤
u∈V
˜ 1 , u) A(v
..
.
⎥ ⎥ ⎥, ⎦
u∈V
˜ N +M , u) A(v
where A˜ is the adjacency matrix of the drug-disease association graph with self-loop, and ˜ Obviously, D ˜ is the node degree matrix of A. ˜ plays a normalizing role, and A˜ is used to D guide information aggregation between itself and neighbor nodes. The graph convolution operator can aggregate one-order neighbor information in each iteration, but over-fitting will occur after three iterations, which also means that information about neighbors more than the third-order cannot be aggregated. This makes the graph convolution operator unable to capture the overall structure information of the network. In the hypergraph convolution operator, Dv and De also play a normalizing role, and HH T is used to guide information aggregation. From the perspective of matrix T represents the number of hyperedges shared by node i and node multiplication, HH(i,j) j. If node i represents drug (disease) and node j represents disease (drug), then this value represents the correlation between drug (disease) i and disease (drug) j. If node i represents drug (disease) and node j represents drug (disease), then this value represents the similarity between drug (disease) i and drug (disease) j. This allows a node to directly break through the limitations of the network topology and directly select similar or related nodes for information aggregation. The more hyperedges shared by two nodes, the more information the two nodes share under the guidance of HH T . In this case, high-order association information is better aggregated. 2.5 Association Prediction After obtaining the embedding vector of the drug and the disease, we use the inner product operation to calculate the predicted association confidence score between the drug ui and the disease dj as follows, y(i,j) = sigmoid xuTi xdj , (4) where xui and xdj are the embedding representations of the drug ui and disease dj respectively. y(i,j) ∈ (0, 1) represents the association score of them. We treat the associated drug-disease pairs as positive samples, and the others as negative samples. The set of positive samples is S+, and the set of negative samples is S−. Since the number of negative samples is much more than the number of positive samples, directly passing the crossentropy loss function will lead to poor results. Here we use a weighted cross-entropy loss function: 1 λ× (5) log y(i,j) + log 1 − y(,j) , Loss = − (i,j)∈S+ (i,j)∈S− N ×M where λ is the weight of the penalty after the positive sample is judged wrong. λ should be as close as possible to |S−| |S+| , where |S + | and |S − | are the numbers of positive samples and negative samples, respectively. The existence of λ makes the misjudgment of positive samples produce greater losses, thus reducing the impact of data imbalance.
HGDD: A Drug-Disease High-Order Association Information Extraction Method
431
3 Experiment and Discussion 3.1 Datasets and Settings In our experiment, we used the drug-disease association data integrated from the CTD database by Yu et al. [5] (Referred to as CTD dataset), as shown in Table 1. CTD dataset contains 269 drugs and 598 drugs. There are 18416 known associations between drugs and diseases, accounting for 11.4% of all drug-disease associations, and the ratio of positive to negative samples is 1:8. We divided CTD dataset according to the ratio of 4:1, used for experiments training and validation. Table 1. Summary of dataset
Association prediction can be regarded as a binary classification problem, and AUC, F-Measure and AUPR are commonly used indicators to evaluate prediction models. AUPR considers recall and precision, and is more convincing under unbalanced positive and negative sample datasets, so we use AUPR as the main evaluation criterion. We also considered sensitivity (SN, also called recall) and specificity (SP). 3.2 Comparison with Other Models In this section, we compare HGDD with six representative models to prove the great advantages of HGDD in the field of drug repurposing. The contrast models are SCMFDD [11] model and BNNR [12] model based on matrix decomposition and matrix completion, DeepDR [13] model based on multi-mode depth self encoder and collective variational self encoder, NIMCGCN [17] model and LAGCN [5] model based on graph convolution neural network, Node2Vec + GCN model based on graph representation learning Table 2. Experimental comparison between HGDD model and other models
432
S. Pang et al.
and graph convolution neural network. we use CTD dataset in all experiments. Each group of experiment was done five times, and the experimental results were averaged. The experimental results are shown in Table 2. It can be seen from Table 2 that the HGDD model is superior to other models in most evaluation indicators. The SCMFDD model and BNNR model based on matrix decomposition and matrix completion are much higher in AUPR index than the NIMCGCN model based on graph convolution operator, and the SCMFDD model exceeds all models in SP index. It embodies the great potential of matrix decomposition and matrix completion in drug repurposing. The NIMICGCN model, the LAGCN model and the Node2Vec+GCN model are all models based on graph convolution operators, but the AUPR value of the model Node2Vec+GCN is much higher than the other two models. Both NIMICGCN and LAGCN use the row vectors of the adjacency matrix as node features, and these node features only contain the first-order structure information of the network. Node2Vec+GCN obtains node features through Node2Vec. These node features contain the local neighborhood information of the network. The abundance of node information has greatly improved the effect of the model. With more abundant node information, HGDD completes information aggregation on high-level networks, which further improves the model effect. 3.3 Case Study In this section, we apply the HGDD model to the existing database to predict new associations that do not exist in the database. Since all known associations are used to construct predictive models, the new associations we speculate can only be verified through other literature. We sorted the drug-disease associations with a label of false and predicted to be true in the database, and selected the top ten to find examples, as shown in Table 3. Six of them have been documented in other literature. For example, dexamethasone is a corticosteroid used to treat endocrine, rheumatism, collagen, skin Table 3. Experimental comparison between HGDD model and other models
HGDD: A Drug-Disease High-Order Association Information Extraction Method
433
diseases, allergies, ophthalmology, gastrointestinal tract, respiratory tract, hematology, neoplasia, edema and other diseases. Now we have found that it can also be used to treat precocious puberty. Progesterone is a hormone naturally present in women’s body. It is essential for the receptivity of the endometrium, embryo implantation and successful pregnancy. Now we have discovered the neuroprotective effect of progesterone on the in vivo model of retinitis pigmentosa. Through the above cases, we can find that the HGDD model has a very high use value, and it can help us complete the task of drug repurposing.
4 Conclusion In this paper, we propose the HGDD model to mine high-order drug-disease association information. The HGDD model first introduces the concept of hypergraph into the drugdisease association network, then uses the hypergraph convolution operator to aggregate high-order drug-disease association information, and finally completes the task of drug repurposing. Experiments prove that the HGDD model can achieve the best results among many models. In addition to the ability to aggregate high-order correlation information, another advantage of the HGDD model is that it only needs to use the drug-disease association network to complete the prediction. This makes the method suitable for various drug databases without the need to search for more information and is easy to implement. HGDD still has some limitations. Firstly, it cannot use overly sparse drug-disease association networks. The interaction of information between hyperedges is completed through shared nodes. An overly sparse network will cause the information in the hyperedge to be unable to be transmitted, which will affect the prediction results. Secondly, HGDD can only predict the relationship between drugs and diseases invested in training. Each time a new drug is added, the entire hypergraph structure will be changed, and the HGDD model needs to be re-trained. This model also has a high development space in the future. First of all, with the development of hypergraphs, more and more hypergraph information aggregation methods have been proposed. Some methods will surpass the hypergraph convolution operator, and the model can be further improved by replacing the operator. Secondly, although the model completes the drug-disease association prediction, it does not use the characteristics of the drug and the disease, and only uses the structural information of the network. From a macro point of view, the model completes the bipartite graph correlation prediction, so the model has high scalability.
References 1. DiMasi, J.A., Grabowski, H.G., Hansen, R.W.: Innovation in the pharmaceutical industry: new estimates of R&D costs. J. Health Econ. 47, 20–33 (2016) 2. Li, J., Zheng, S., Chen, B., et al.: A survey of current trends in computational drug repositioning. Brief. Bioinform. 17(1), 2–12 (2016) 3. Pushpakom, S., Iorio, F., Eyers, P.A., et al.: Drug repurposing: progress, challenges and recommendations. Nat. Rev. Drug Discov. 18(1), 41–58 (2019)
434
S. Pang et al.
4. Rudrapal, M., Khairnar, S.J., Jadhav, A.G.: Drug Repurposing (DR): an emerging approach in drug discovery. In: Badria, F.A. (ed.) Drug Repurposing - Hypothesis, Molecular Aspects and Therapeutic Applications. IntechOpen, London (2020) 5. Yu, Z., Huang, F., Zhao, X., et al.: Predicting drug-disease associations through layer attention graph convolutional network. Brief. Bioinform. 22(4) (2020) 6. Emig, D., Ivliev, A., Pustovalova, O., et al.: Drug target prediction and repositioning using an integrated network-based approach. PLoS ONE 8(4), e60618 (2013) 7. Luo, H., Li, M., Yang, M., et al.: Biomedical data and computational models for drug repositioning: a comprehensive review. Brief. Bioinform. 22(2), 1604–1619 (2021) 8. Zeng, X., Zhu, S., Lu, W., et al.: Target identification among known drugs by deep learning from heterogeneous networks. Chem. Sci. 11(7), 1775–1797 (2020) 9. Zong, N., Wong, R.S.N., Yu, Y., et al.: Drug-target prediction utilizing heterogeneous biolinked network embeddings. Brief. Bioinform. 22(1), 568–580 (2021) 10. Wang, B., Lyu, X., Qu, J., et al.: GNDD: a graph neural network-based method for drugdisease association prediction. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1253–1255. IEEE (2019) 11. Zhang, W., Yue, X., Lin, W., et al.: Predicting drug-disease associations by using similarity constrained matrix factorization. BMC Bioinformatics 19(1) (2018). https://doi.org/10.1186/ s12859-018-2220-4 12. Yang, M., Luo, H., Li, Y., et al.: Drug repositioning based on bounded nuclear norm regularization. Bioinform. (Oxford Engl.) 35(14), i455–i463 (2019) 13. Zeng, X., Zhu, S., Liu, X., et al.: deepDR: a network-based deep learning approach to in silico drug repositioning. Bioinform. (Oxford Engl.) 35(24), 5191–5198 (2019) 14. Zhang, Z.-C., Zhang, X.-F., Wu, M., et al.: A graph regularized generalized matrix factorization model for predicting links in biomedical bipartite networks. Bioinformatics 36(11), 3474–3481 (2020) 15. Chen, F., Wang, Y.-C., Wang, B., et al.: Graph representation learning: a survey. APSIPA Trans. Signal Inf. Process. 9, e15 (2020) 16. Zhou, J., Cui, G., Zhang, Z., et al.: Graph neural networks: a review of methods and applications. AI Open 1, 57–81 (2018) 17. Li, J., Zhang, S., Liu, T., et al.: Neural inductive matrix completion with graph convolutional networks for miRNA-disease association prediction. Bioinformatics 36(8), 2538–2546 (2020) 18. Lin, X., Quan, Z., Wang, Z.-J., et al.: KGNN: knowledge graph neural network for drug-drug interaction prediction. In: des Jardins, M., Bessiere, C. (eds.) Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, California, pp. 2739–2745 (2020) 19. Nt, H., Maehara, T.: Revisiting graph neural networks: all we have is low-pass filters. arXiv, abs/1905.09550 (2019) 20. Bretto, A.: Hypergraph Theory: An Introduction. MATHENGIN. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-00080-0 21. Feng, Y., You, H., Zhang, Z., et al.: Hypergraph neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 3558–3565 (2019) 22. Zhou, D., Huang, J., Schölkopf, B.: Learning with hypergraphs: clustering, classification, and embedding. In: Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, 4–7 December 2006, vol. 19, pp. 1601–1608 (2006) 23. Cai, M., Gao, Z., Zhou, X., et al.: A small change in molecular structure, a big difference in the AIEE mechanism. Phys. Chem. Chem. Phys. 14(15), 5289–5296 (2012) 24. Mansour, A.M., Sheheitli, H., Kucukerdonmez, C., et al.: Intravitreal dexamethasone implant in retinitis pigmentosa-related cystoid macular edema. Retina (Philadelphia Pa.) 38(2), 416– 423 (2018)
HGDD: A Drug-Disease High-Order Association Information Extraction Method
435
25. Xiang, S., He, L., Ran, X., et al.: Primary glucocorticoid resistance syndrome presenting as pseudo-precocious puberty and galactorrhea. Sichuan da xue xue bao. Yi xue ban = J. Sichuan Univ. Med. Sci. Ed. 39(5), 861–864 (2008) 26. Sánchez-Vallejo, V., Benlloch-Navarro, S., López-Pedrajas, R., et al.: Neuroprotective actions of progesterone in an in vivo model of retinitis pigmentosa. Pharmacol. Res. 99, 276–288 (2015) 27. Wang, X., Ji, C., Zhang, H., et al.: Identification of a small-molecule compound that inhibits homodimerization of oncogenic NAC1 protein and sensitizes cancer cells to anticancer agents. J. Biol. Chem. 294(25), 10006–10017 (2019) 28. Bourque, F., Karama, S., Looper, K., et al.: Acute tamoxifen-induced depression and its prevention with venlafaxine. Psychosomatics 50(2), 162–165 (2009) 29. Jonsson, I.-M., Verdrengh, M., Brisslert, M., et al.: Ethanol prevents development of destructive arthritis. Proc. Natl. Acad. Sci. U.S.A. 104(1), 258–263 (2007)
IDOS: Improved D3DOCK on Spark Yonghui Cui1 , Zhijian Xu2 , and Shaoliang Peng1(B) 1
College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China {yonghui,slpeng}@hnu.edu.cn 2 Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China [email protected]
Abstract. Virtual molecular docking is a computational method used in computer-aided drug discovery that calculates the binding affinity of a small molecule drug candidate to a target protein and greatly reduces the time and cost of suggesting new potential pharmaceuticals. D3DOCK is a suite of automated docking tools, it was developed for sensitively investigating the effects of halogen bond in drug discovery by based on Autodock Vina. In this study, we developed IDOS, a high-throughput and scalable virtual docking system. We use the open source Hadoop framework implementing the Spark paradigm for distributed computing on a private cloud platform. IDOS can work on a single node as well as on distributed nodes compared with the stand-alone version of D3DOCK, it performed almost linear acceleration on our Spark cluster, with parallel efficiency of more than 80%, and 5x to 6x speedup can be achieved on the cluster depending on the number of work nodes. Moreover, On account of the widely used of cloud computing, IDOS can be easily installed and employed for docking on condition that the IDOS docker image was upload to the docker hub shared warehouse. We also have developed the docking workflow with an MPI version on high-performance computers (HPC). However, compared with HPC system, mounting IDOS on a cheap and easy to obtain cloud computing cluster is obviously more conducive, which allows more researchers to do docking with it expediently. IDOS is freely available at https://github.com/codedinner/IDOS.
Keywords: Molecular docking Distributed computing · Spark
1
· Virtual drug screening · Big data ·
Introduction
Virtual docking of compounds to protein receptors is a computational approach that is widely used in industrial and academic research laboratories. The goal of virtual docking is to allow quick, efficient, and affordable identification of small chemical compounds that bind to specific proteins and hence are capable of imparting important biological and pharmaceutical effects [7,8]. In the aspect c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 436–447, 2021. https://doi.org/10.1007/978-3-030-91415-8_37
IDOS: Improved D3DOCK on Spark
437
of targeted drug screening, virtual docking greatly promotes the efficiency and speed of new drug research and development and saves the research and development cost. Several docking programs are available, both open-source and proprietary (for a review of some of these programs see [5]). With the development of supercomputers, virtual screening began to be widely applied to drug discovery field, which made great progress in high-throughput virtual screening and parallel screening algorithms. In recent years, the emergence of cluster computer and cloud computing platform has further promoted its development [6,10,11,16]. Some certain works with an MPI version of AutoDock4 [4] enable rapid screening of large databases on HPC. Kento Aoyama et al. [1] introduced an HPC container workflow, which provides customized container image configuration for different HPC systems based on the HPC container manufacturer framework and they applied MEGADOCK, a high-performance protein-protein docking application, to the proposed workflow. For cloud computing architectures, Sally R [6] and others implemented the molecular docking program autodock on the cloud platform using Hadoop framework based on MapReduce paradigm of distributed computing. Experiments were conducted on the cloud computer Kandinsky at Oak Ridge National Laboratory. The high-throughput virtual docking software autodockcloud improved the acceleration of 450. Aoyama et al. [13] also used cloud computing for molecular docking. We have deployed the drug virtual screening system on Spark computing platform, and the improved molecular docking program is D3DOCK [3], but it can also be replaced with other docking programs, which shows that our framework is universal. D3DOCK is a virtual screening software based on Autodock Vina developed by Shanghai Pharmaceutical Research Institute. It has been developed two new halogen bond scoring functions: knowledge-based halogen bond scoring function XBPMF [9] and quantitative halogen bond scoring function XBScoreQM [15], which can effectively characterize the halogen bond system and make up for the defect that the halogen bond system cannot be effectively connected in the current virtual screening field.
2
Distributed (Spark) Implementation
Distributed systems allow us to improve computing performance by raising the number of machines. For reasons of processing large-scale data in an acceptable time, it is necessary to seek help from distributed computing. Drug screening on a large scale is intrinsically a batch job. Therefore, the distributed computing platform should provide high throughput to reduce processing time. In addition, the runtime should scale almost linearly according to the number of nodes in the cluster. Apache spark is an open source big data processing framework for distributed environment and has an active community. According to the evaluation results of analyzing the performance of big data platform which is proved by a large number of standard benchmarks in [12,15], spark provides better throughput and scalability in terms of batch processing and stream processing. Because of its implementation, spark is very fast, 100 times faster than the predecessors
438
Y. Cui et al.
such as Apache Hadoop [14]. It has been improved in storage and fault tolerance mechanism and tries to reduce runtime overhead to achieve linear acceleration related to the number of nodes in the cluster, which has high scalability. 2.1
Cluster Design
We implemented our docking system on Apache Spark, which is the batch processing engine. Figure 1 shows the architecture of IDOS, where the outer rectangles represent a machine. Spark driver, HDFS namenode and cluster manager are the coordinators on the master node. On each worker node, there is a spark executor, an HDFS data node and a docker worker, which is a shell script interface that can run D3DOCK. The workflow we designed is as follows: the system first uses HDFS to allocate small molecule ligand files between processing nodes, and then uses Spark to call the D3DOCK worker instance where the data is located. After that, dock worker starts molecular docking. Finally, the docking results can be stored on HDFS to prepare for any downstream analysis, or can be sent to the master node for aggregation.
Fig. 1. IDOS architecture. A master node (rectangle on the left) with a number of worker node (rectangle on the right). Spark Driver, HDFS NameNode are placed on the master node. Each worker node contains a HDFS DataNode, a Spark Executor, and an instance of DOCK worker. Spark executor communicates with DataNode and Dock worker through storage disk as the medium. DOCK worker calls D3DOCK for docking. After the job finished, executors aggregate results to the HDFS DateNode.
Figure 2 shows the workflow for each work node in our design. As shown in the figure, four entities have separate memory address spaces. • HDFS DataNode: The only entity that always uses storage is the datanode that stores SDF file blocks. • Spark Executor: Spark executor acts as an manager, divides the block into several parts that make up RDDs, and send them to dock worker for processing. It also calls a mappartitions() function for each partition.
IDOS: Improved D3DOCK on Spark
439
• DOCK WORKER: It is the interface between spark executor and D3DOCK. • D3DOCK: It is the lowest level of molecular docking program and it may be replaced by any other docking software.
Fig. 2. Worker node workflow. Horizontal and vertical line separates different hardware and software units, respectively.
2.2
Receptor Data Broadcast and Assemble
The docking receptor file in PDB format need to be uploaded to the HDFS file system, and the docking file is converted into an RDD through the textfile() function in Spark. The docking receptor RDD is cached into a string container and the container ensures that the data is orderly when stored. The container of the cached receptor RDD is packaged as a broadcast variable in spark; In the above way, each computing node can obtain the same docking receptor data from the driver node, and the sequential storage of containers ensures the consistency of docking receptor data. The receptor broadcast work is assigned to the master node. The receptor file data is packaged to ArrayList in Java, and then be packaged again to the broadcast broadcast variable container of Spark, and distributed to each worker node; After receiving the wrapped broadcast variables, the work node decomposes the data in order and writes it to the file system of the node through bufferedwriter class.
440
2.3
Y. Cui et al.
Ligand Data Division and Task Distribution
Plan-A: Row Convertion Ligand Index Construction. The original file consists of a large amount of single group ligand records. A single group ligand data occupies multiple lines, and the number of lines occupied by a single group ligand data is not a fixed number. The end flag of a single group ligand data is “$$$$”; By constructing the index, a single group ligand data is converted to a record occupied one line, and the index number of the single group ligand data is the serial number of the ligand in the original file; Ligand index construction is the basis of ligand data division. Spark divides the data according to rows by default. Converting a single group of ligand data to one row can ensure that a single group of ligand data will not be divided into different nodes during division procedure. RDD Data Division. The ligand file in SDF format contains multiple ligand molecule data. For high-throughput virtual screening, the ligands need to be evenly divided into several parts and sent to several computing tasks. In spark computing engine, the data partition maps the ligand data according to the text lines. Mapping a group of ligand molecule data originally scattered in multiple lines into a long string according to the ligand index created before. Replace the separator between the original lines with ‘ ˜ ’, create a HashMap container to store ligand data, Key is the ligand molecular index, and Value is the long string built by the above mapping method; Store each element in the obtained hash table into tuple2 data structure respectively. The first value of tuple2 is the key of hash table and the second value is the value of hash table. Each tuple2 represents a group data of docking molecule, and then load all tuple2 into list container; Convert the data in the list container into pairrdd through the parallelizePairs() function, and finally we specify the slices parameter numslices to achieve the purpose of data division. As shown in Fig. 3, after the ligand index is constructed, a row in the indexed RDD represents a ligand, and the number of rows in the RDD represents the number of all ligands to be docking in this job; In order to restore the ligand data accurately, the original ligand data line is identified by the separator ‘ ˜ ’ in the single line; The ligand data is divided into specific partitions by specifying the numslices parameter in the parallelizePairs function. The default divide method is average division, that is, the total ligand data is evenly divided into partitions. RDD Data Assemble. For molecular docking, the long string molecular data generated in the ligand division stage needs to be assembled and converted into the original SDF format text file as shown in Fig. 4. In each assigned computation task, the string is divided through the separator ‘ ˜ ’, and the generated data is received by the ArrayList container. Each element in the container can be regarded as a line in the original SDF text file; The Java character buffer output stream bufferedwriter is used to write the data in the container into the new blank file in order. The process ID number is added to the file name to
IDOS: Improved D3DOCK on Spark
441
Fig. 3. Ligand division process. The Indexed Data contains multiple ligands, which are divided into n(n = numbers of partition) smaller indexed data group, and each data group contains several ligands.
distinguish different docking processes, so as to ensure that independent docking processes use their own corresponding docking files to achieve the purpose of load balancing. Ligand assembly works in the worker node. On the work node where the partition is located, the ligand division results are split by the separator ‘ ˜ ’, stored in the string array in order, and then placed in the ArrayList container, Finally, the bufferedwriter class writes the data to a new blank file. When naming the file, the process ID number is used to distinguish different partitions. Plan-B: InputFormat Implemention. The default file split method of Hadoop is based on the concept of “line”. The Key processed in the map() function in Hadoop is the offset value of the line, and the Value is the data of the line when reading data from HDFS. However, for SDF file, the data of each ligand occupies multiple lines, and the adjacent ligand data groups are identified according to the specified separation symbol “$$$$”. If the data processing plan-A mentioned above is not carried out, Spark may divide the certain group of ligand data into different partitions during the partition process, which would result in the loss of ligand data. Another solution is to implement the inputformat, recordreader and linereader in Hadoop. When dividing the data, Hadoop split the file according to the division seperator set by yourself. The advantage of this method is that it is not necessary to convert the data into one line for division. Instead, when reading the file, the source file can be evenly divided into several small data groups by specifying the number of file fragments.
442
Y. Cui et al.
Fig. 4. Ligand assemble process. Take ‘Partition 1’ as an example, indexed data group in ‘partition 1’ is assembled into a file in an executor.
2.4
Parallelization
Spark takes each data partition as the parallel unit. If the number of partitions allocated is equal to the total number of CPU cores of the computing cluster, it can ensure that each CPU computing unit can calculate the data of its own partition to reach to the result of completely parallelism; The mappartitions() operator is used to specify the specific computation steps of each data partition. The operator takes the partition as the processing unit and can process all data in the partition at one time. The data processing process of each partition is specified by overriding the call() function in the flatmapfunction class; In the call() function, the broadcast variable in the receptor broadcast module is obtained, and the receptor broadcast variable is saved as a PDB format file in the temporary directory of the computer node where the computation task is located; The ligand assembly process is also in the call() function. After a couple of docking files have been prepared, the molecular docking module is called through the Process class in Java. 2.5
Docking and Aggregate
The docking module and the docking result aggregation module are deployed in the work node. Dock worker receive three parameters, including receptor file absolute path, ligand file absolute path and docking result saving path; The specific molecular docking program can be set in dock worker and the molecular docking program used in this job is D3DOCK, which can be replaced by other docking programs; After the docking job is completed, each working thread aggregates the docking results to the HDFS file system through the filesystem abstract class of Hadoop; The docking log is saved in the disk of each node
IDOS: Improved D3DOCK on Spark
443
through Linux input-output redirection, and then aggregated into HDFS like docking results. 2.6
Some Positive Measures
An important feature of Spark is that it can save the computation result data to memory or disk for subsequent operations to read. This is RDD caching that can also be called persist or caching (spark provides persist() and cache() functions to cache RDD). When caching an RDD, each Spark work node will save its calculated partition data to memory for faster reuse when other operations are performed based on these data. The function of javaprocess.waitfor() is to wait until the child process has finished runing. If the child process has been terminated, this method returns immediately. However, calling this function directly will cause the current thread to block until the child process exits. The standard I/O operations (that is, stdin, stdout, and stderr) will be redirected to the parent process through getoutputstream(), getinputstream() and geterrorstream(). If the output stream or input stream of the child process fails, the child process may be blocked or even deadlock. So it is better to actively store the output data of the executable child process instead of keeping in the buffer. The running environment of the master node and the work node are configured in the docker image, mainly including the environment configuration of Hadoop and Spark clusters as well as the environment configuration of molecular docking program; The docker image can be freely migrated to cloud computers, which is easy to deploy and has strong scalability.
3 3.1
Results Experimental Setup
We used Apache spark 2.2.4 for batch experiments. In the distributed mode, we deployed an Apache spark cluster composed of seven nodes, including one driver node and six working nodes. Each node has a configuration similar to a single node. The CPU used by each single node is Intel (R) Xeon (R) CPU e7-8890 V3, each node has 18 virtual CPUs with clock frequency of 2.50 GHZ, memory of 32G and cache size of 46080 KB. We deployed IDOS on an Apache spark cluster composed of 7 nodes, 1 master node and 6 working nodes. The master node is located in the middle of the star topology, which is connected with the other 6 working nodes to test the accelerated performance. When the number of docking tasks is fixed, we repeat experiments on different numbers of work nodes to evaluate the strong scalability of the tool, and enlarge the number of docking tasks according to the proportion of the number of work nodes to evaluate the weak scalability of IDOS.
444
3.2
Y. Cui et al.
The Accuracy
We prepared 1000 groups of protein small molecules to test the correctness of docking results of IDOS and D3DOCK. First, we use D3DOCK to test 1000 groups of test cases sequentially, and then use IDOS to complete docking in parallel. Table 1 shows the difference of RMSD and binding energy of each group. It can be seen that the maximum difference of RMSD is less than 0.25, the maximum difference of binding energy is less than 3.89, and the average difference of parallel two is less than 0.035 and 1.72. Therefore, the docking results of IDOS are reliable. Table 1. Docking results RMSD(A) Bind energy (kcal/mol)
3.3
Ave 0.035
1.72
Min 0.003
0.52
Max 0.25
3.89
Speed Up Performance
Figure 5(a) and Fig. 5(b) show the speedup and acceleration efficiency of the tool with single node performance as the reference point. As expected, as the number of nodes increases sixfold, the docking performance almost increases sixfold, showing IDOS got a higher linear speedup. The acceleration efficiency is more than 0.8. Comparing the number of work nodes of single node and cluster shows the overhead of distributed implementation, the results of the experiment show that the docking speed of our IDOS is about 5–6x faster than D3DOCK. IDOS achieved almost linear speedup on Spark cluster, with parallel efficiency of more than 80%, and realized 5–6x acceleration on laboratory cluster. In addition, the acceleration effect of PLAN-A is better than PLAN-B as mentioned above. 3.4
Scalability
The scalability means that with the expansion of the scale of the problem, the increase of data, or the increase of the number of cluster machines, whether the program can still show excellent performance. Firstly, 1000 docking tasks were prepared to test the performance of IDOS on 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10 worker nodes respectively. This setting fixes the scale of the problem and with the increase of the cluster size of IDOS, that is, the number of processes, the speed up efficiency fluctuates little, indicating that IDOS has good strong scalability. Figure 6(a) shows the test results. Then, we prepare 100, 200, 300, 400, 500, 600, 700, 800, 900 and 1000 tasks for 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10 nodes respectively, that is, the number of tasks = the
IDOS: Improved D3DOCK on Spark
(a) Speed-up
445
(b) Efficiency
Fig. 5. Speed up performance of PLAN-A (Row Convertion) and PLAN-B (InputFormat Implementation), the red lines represents the ideal speed up and ideal efficiency, respectively. (Color figure online)
(a) Strong-Scalability
(b) Weak-Scalability
Fig. 6. (a) 1000 dock tasks on different number work node in cluster. (b) Weak Scalability of application on a spark cluster.
number of nodes * 100. With the increase of the number of nodes, the number of tasks increases in proportion. This setting not only increases the cluster size, but also increases the problem size at the same rate and the experiment can test the weak scalability of IDOS as shown in Fig. 6(b). According to the results shown in Fig. 6, IDOS performs well in both strong scalability and weak scalability.
4
Conclusion and Discussion
Finally, we introduce IDOS (improved D3DOCK on spark), which is a large-scale drug virtual screening system running on a single node as well as on Apache spark cluster. One of our main purposes is to improve the docking speed of D3DOCK. As far as we know, IDOS is the first molecular docking software implemented on spark cluster, which can easily run on Linux, Windows and MacOS. IDOS is 6 times faster than the original D3DOCK. Another purpose is to design a framework suitable for molecular docking in distributed and cloud computing environment. We transplanted D3DOCK into IDOS framework and provide an interface that can run other docking software in IDOS. We upload the image deployed with IDOS framework to dockerhub, which is convenient for
446
Y. Cui et al.
scholars to deploy. There are two improvements in the future work: our work in this paper is based on batch processing, but stream processing can overlap the communication time and processing time [2]. Therefore, we can develop a molecular docking system based on flow processing and compare it with the batch processing system. The second is to deploy IDOS on public cloud, providing a access interface and cloud docking-service to external consumers. Acknowledgments. This work was supported by National Key R&D Program of China 2017YFB0202602, 2018YFC0910405, 2017YFC1311003, 2016YFC1302500, 2016YFB0200400, 2017YFB0202104; NSFC Grants U19A2067, 61772543, U1435 222, 61625202, 61272056; Science Foundation for Distinguished Young Scholars of Hunan Province (2020JJ2009); Science Foundation of Changsha kq2004010; JZ20195242029, JH20199142034, Z202069420652; The Funds of Peng Cheng Lab, State Key Laboratory of Chemo/Biosensing and Chemometrics; the Fundamental Research Funds for the Central Universities, and Guangdong Provincial Department of Science and Technology under grant No. 2016B090918122.
References 1. Aoyama, K., Watanabe, H., Ohue, M., Akiyama, Y.: Multiple HPC environmentsaware container image configuration workflow for large-scale all-to-all protein– protein docking calculations. In: Panda, D.K. (ed.) SCFA 2020. LNCS, vol. 12082, pp. 23–39. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-48842-0 2 2. Carbone, P., Ewen, S., F´ ora, G., Haridi, S., Richter, S., Tzoumas, K.: State manR consistent stateful distributed stream processing. Proc. agement in Apache Flink: VLDB Endow. 10(12), 1718–1729 (2017) 3. Cheng, Q., et al.: mD3DOCKxb: a deep parallel optimized software for molecular docking with Intel Xeon Phi coprocessors. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 725–728. IEEE (2015) 4. Collignon, B., Schulz, R., Smith, J.C., Baudry, J.: Task-parallel message passing interface implementation of Autodock4 for docking of very large databases of compounds using high-performance super-computers. J. Comput. Chem. 32(6), 1202–1209 (2011) 5. Cross, J.B., et al.: Comparison of several molecular docking programs: pose prediction and virtual screening accuracy. J. Chem. Inf. Model. 49(6), 1455–1474 (2009) 6. Ellingson, S.R., Baudry, J.: High-throughput virtual molecular docking: hadoop implementation of AutoDock4 on a private cloud. In: Proceedings of the 2nd International Workshop on Emerging Computational Methods for the Life Sciences, pp. 33–38 (2011) 7. Ke, N., Baudry, J., Makris, T.M., Schuler, M.A., Sligar, S.G.: A retinoic acid binding cytochrome P450: CYP120A1 from synechocystis sp. PCC 6803. Arch. Biochem. Biophys. 436(1), 110–120 (2005) 8. Kitchen, D.B., Decornez, H., Furr, J.R., Bajorath, J.: Docking and scoring in virtual screening for drug discovery: methods and applications. Nat. Rev. Drug Disc. 3(11), 935–949 (2004) 9. Liu, Y., Xu, Z., Yang, Z., Chen, K., Zhu, W.: A knowledge-based halogen bonding scoring function for predicting protein-ligand interactions. J. Mol. Model. 19(11), 5015–5030 (2013). https://doi.org/10.1007/s00894-013-2005-7
IDOS: Improved D3DOCK on Spark
447
10. Malysiak-Mrozek, B., Danilowicz, P., Mrozek, D.: Efficient 3D protein structure alignment on large Hadoop clusters in Microsoft Azure cloud. In: Kozielski, S., Mrozek, D., Kasprowski, P., Malysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2018. CCIS, vol. 928, pp. 33–46. Springer, Cham (2018). https://doi.org/10.1007/ 978-3-319-99987-6 3 11. Mrozek, D., Suwala, M., Malysiak-Mrozek, B.: High-throughput and scalable protein function identification with Hadoop and Map-only pattern of the MapReduce processing model. Knowl. Inf. Syst. 60(1), 145–178 (2018). https://doi.org/ 10.1007/s10115-018-1245-3 12. Nasiri, H., Nasehi, S., Goudarzi, M.: A survey of distributed stream processing systems for smart city data analytics. In: Proceedings of the International Conference on Smart Cities and Internet of Things, pp. 1–7 (2018) 13. Ohue, M., Aoyama, K., Akiyama, Y.: High-performance cloud computing for exhaustive protein-protein docking. arXiv preprint arXiv:2006.08905 (2020) 14. White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc. (2012) 15. Yang, Z., et al.: A quantum mechanics-based halogen bonding scoring function for protein-ligand interactions. J. Mol. Model. 21(6), 1–21 (2015). https://doi.org/10. 1007/s00894-015-2681-6 16. Yueli, D., Quan, G., Bin, S.: A molecular docking platform based on Hadoop. In: 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 284–287. IEEE (2017)
Biomedical Data
A New Deep Learning Training Scheme: Application to Biomedical Data Jianhong Cheng1,2 , Qichang Zhao1 , Lei Xu1 , and Jin Liu1(B) 1
Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China [email protected] 2 Institute of Guizhou Aerospace Measuring and Testing Technology, Guiyang 550009, People’s Republic of China
Abstract. Improving the performance of deep learning algorithms and reducing the cost of training time are ongoing challenges in bioinformatics. Some strategies are proposed to address these challenges such as changing learning rate and early stopping technique together with crossvalidation training. These approaches still take plenty of training time and have some bottlenecks in improving performance under traditional cross-validation training settings. In this study, we propose a successive cross-validation training strategy for biomedical data and develop a new training scheme to improve performance with reduced training time using weight transfer learning. We design and perform multiple experiments with three different domains to evaluate the proposed training scheme. The deep learning models include DeepDTA for drug-target affinity prediction, RAAU-Net for glioma segmentation, and DeepCaps for image classification. Experimental results demonstrate that our proposed training scheme not only outperforms the existing scheme on both performance and efficiency in bioinformatics but also can be easily generalized to a variety of intelligent tasks.
Keywords: Biomedical data Weight transfer learning
1
· Deep learning · Cross-validation ·
Introduction
Drug discovery, semantic segmentation, and image classification are crucial parts of applications of artificial intelligence in bioinformatics, precision medicine, automatic drive, and others. Despite these applications have achieved great success through numerous deep learning technologies [5,14], training a deep learning model usually needs to continuously optimize via iterative pattern, which is computationally expensive and time-consuming. Artificial neural networks (ANNs), especially convolutional neural networks (CNNs), have recently achieved impressive success in many intelligent tasks such as drug-target predictions [16,18], medical image segmentations [9], c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 451–459, 2021. https://doi.org/10.1007/978-3-030-91415-8_38
452
J. Cheng et al.
and image classifications [12,26]. While ANNs have become the most famous benchmark in deep learning and surpassed human performance in certain domains [6,7,17,20,22], their training process is usually based on cross-validation (CV) and iterative optimization manner on the limited number of samples. The CV technique is the most common method used in machine learning, which is used for evaluating models or fine-tuning hyperparameters. In the absence of an independent validation dataset, k-fold cross-validation (k-FCV) is usually used to estimate the performance of prediction models. For example, Deepak et al. used 5-FCV on MRI dataset from figshare to evaluate their proposed classification system [8]. Yang et al. proposed a CNN model based on transfer learning for glioma grading and evaluated the performance of the model using 5-FCV [24]. Iesmantas et al. employed a convolutional capsule network (CapsNet) trained with cross-validation for classification of breast cancer histology images [11]. Esteva et al. utilized a CNN to classify three-class disease partition and nineclass disease partition and demonstrated that the overall accuracy of the CNN trained with 9-FCV is better than that of two dermatologists [10]. 3-FCV and 4FCV were performed and used to evaluate their works [25]. Another case is that CV is used to find the optimal parameters and prevent overfitting. There are numerous outstanding works trained with CV to obtain the optimal parameters of their models. A multiple instance learning-based deep learning system was presented and used for diagnosing tissue types [3], a total of 44,732 whole slide images from 15,187 patients were evaluated, and 15% of which are used for tuning the hyperparameters. Our previous study proposed a RAAU-Net for brain tumor segmentation and trained the network with 5-FCV on the training data for tuning the hyperparameters [4]. These networks optimize the learning rate and prevent overfitting by determining whether the validation loss decreases within different fixed epochs. These strategies can save some times and avoid unnecessary epoch of iterative training compared with the scheme fixed training epochs. However, there are still some challenges of insufficient data utilization and high cost of training times when cross-validation experiments are performed. In this study, to tackle the above-mentioned challenges and effectively utilize the limited dataset, we propose a new deep learning training scheme based on cross-validation training strategy. The proposed training scheme employs weight transfer to accelerate network training and successively trains using crossvalidation to improve network performance. To evaluate the proposed training scheme, multiple experiments are designed and performed on three benchmark tasks, such as drug discovery [21], medical image segmentation [4], and image classification [19]. Experimental results show that the proposal can improve the performance of deep learning models and reduce the cost of their training times.
2
Successive k-fold Cross-Validation
CV is a resampling procedure used to evaluate the generalization ability of prediction models and to prevent overfitting problem on a limited data sample [2]. It is similar to the repeated random subsampling way, but the sampling method
A New Deep Learning Training Scheme: Application to Biomedical Data
453
b)
a)
Weights transfer Result
Average Result
Weights transfer
Weights transfer
Training data
Validation data
Testing data
Fig. 1. Comparison diagram between k-FCV and Sk-FCV. a) is the traditional k fold cross-validation, namely k-FCV. b) is the proposed training scheme, namely Sk-FCV.
is executed in such a way that no two validation subsets overlap. In k-FCV, the available learning data is randomly divided into k disjoint subsets of approximately equal size. k−1 subsets of which represent the training set and are used for training models. Then, the remaining subset is used as the validation set. This procedure is repeated in turn until each of the k subsets has been used as validation set. In practice, if there is no independent testing set, a fraction of data is about 10%–30% of training set, which needs to fine-tune the hyperparameters and select models. Here, the validation set is used to measure the performance of the models. If there is an independent testing set, the validation set is used to fine-tune the hyperparameters for selecting optimal model, and the model is applied to the testing set for measuring the performance. Finally, the average of the k performance measurements on the k validation sets or independent testing set is usually served as the cross-validated performance. For simplicity, we consider a scenario for an independent testing set in this study. Figure 1a) illustrates this scenario for k-FCV. Based on the above analysis, we can easily find that there are the following challenges: 1) extracting part of data for selecting models will lead to insufficient data utilization; and 2) each fold requires training from scratch, resulting in a high cost of training times. In this study, we improve k-FCV training strategy based on weight transfer learning and propose a successive k-fold cross-validation (Sk-FCV) training scheme to address these challenges. Weight transfer learning is introduced into cross-validation procedure for improving performance with reduced training times. As shown in Fig. 1b), we randomly split the available learning data into k disjoint parts of approximately equal size, k−1 parts of which are used as training data and the remaining one is used as validation data. Each model is trained using k-FCV, but the training weight of the previous fold is successively used as the initial weight of the next fold, in turn, until the last fold. Then, we use an independent testing data to evaluate the performance
454
J. Cheng et al.
of the final trained model. Since it is similar to k-FCV, we call our proposed training scheme as successive k-fold cross-validation. Compared with the k-FCV, there are two main advantages of using Sk-FCV for deep learning tasks. 1) The whole training dataset is fully utilized for training and validating. That is to say, more training samples participate in the training and each sample guides the tuning parameters, which contributes to improve the accuracy of model prediction. 2) This scheme takes full advantage of weight transfer as a kind of knowledge, improving the generalization performance of the network and its training speed.
3
Experiments and Results
In order to evaluate the effectiveness of our proposed training scheme, we conduct multiple experiments on two biomedical benchmarks including biomedical regression task, medical image segmentation. To validate the practicability, we further generalize it to image classification. 3.1
Experimental Datasets
In this study, we test our training scheme with several benchmark datasets. KIBA dataset [21] is used for biomedical regression task. BraTS 2018 dataset is used for tumor segmentation, which is derived from the Brain Tumor Segmentation Challenge (BraTS) 2018 [1]. CIFAR10 [13], MNIST [15], and Fashion MNIST [23] datasets are used for image classification. The detailed statistics of the datasets are shown in Table 1. Table 1. Summary of the datasets used in this study. Task
Dataset
Drug-target affinity prediction KIBA
3.2
Training data Testing data 98,545
19,709
285
66
CIFAR10
50,000
10,000
MNIST
60,000
10,000
Fashion MNIST 60,000
10,000
Tumor segmentation
BraTS 2018
Image classification Image classification Image classification
Benchmark Methods
In the experiments, we evaluate our proposed training scheme based on several benchmark methods for these tasks. DeepDTA [18], which comprises two CNN blocks to learn high-level representations from drug SMILES strings and protein sequences, is used to predict binding affinity based on KIBA dataset. Our previous RAAU-Net [4] is used as tumor segmentation model based on the BraTS 2018 dataset. The deep capsule network (DeepCaps) model [19] is used for image classification on CIFAR10, MNIST, and Fashion MNIST datasets.
A New Deep Learning Training Scheme: Application to Biomedical Data
3.3
455
Experimental Settings
Our whole experimental procedure is implemented on an NVIDIA TITAN V GPU, and the development of all networks is based on the Keras library. The parameter settings of three networks are presented in Table 2. For example, Adam is used as an optimizer with an initial learning rate of 5e−4 for RAAU-Net. The early stopping strategy is adopted to avoid overfitting, that is, the training will be terminated if the validation loss does not decrease within 50 epochs. The other settings of three networks, such as loss function and so on, are the same as the references [4,18,19]. Dice, accuracy, and mean square error (MSE) are used to evaluate the performance of segmentation, classification, and regression task, respectively. And the training times of each network are recorded. Furthermore, we conduct three comparison experiments of 3-FCV vs S3-FCV, 5-FCV vs S5-FCV, and 10-FCV vs S10-FCV for each task. Table 2. Parameter settings of three networks.
3.4
Network
Optimizer Learning rate Early stopping
DeepDTA
Adam
1e−3
5
RAAU-Net Adam
5e−4
50
DeepCaps
1e−3
20
Adam
Drug-Target Affinity Prediction
We also test the proposed training scheme using DeepDTA [18] method with KIBA dataset [21] in drug-target affinity prediction, which is an important part of drug discovery process and is regarded as a regression task. The comparison results are presented in Table 3. In terms of training times, S3-FCV, S5-FCV, and S10-FCV can save 44.98%, 60.55%, and 55.13% compared with their respective schemes. We can see that the model trained with S10-FCV obtains the lowest MSE with a value of 0.17, achieving state-of-the-art performance in drug-target affinity prediction. Therefore, the proposed training scheme can play the same role in computational biology. 3.5
Tumor Segmentation
In tumor segmentation task, three sub-regions whole tumor (WT), tumor core (TC), and enhance tumor (ET) are considered for evaluation. Table 4 presents the comparison results of RAAU-Net trained using our proposed Sk-FCV and traditional k-FCV. The results show that RAAU-Net trained with Sk-FCV can not only improve segmentation performance in three sub-regions, but also reduce the cost of training times compared with the model trained with k-FCV. More specifically, the segmentation performance of the RAAU-Net trained using S3FCV is 0.7% (WT), 2.48% (TC), and 3.04% (ET) higher than that using traditional 3-FCV and the corresponding training times are also reduced by 25.19%.
456
J. Cheng et al.
Table 3. Cost of training times and MSE of DeepDTA trained using different training schemes. Training scheme Times (h) MSE 3-FCV S3-FCV
2.49 1.37
0.27 0.18
5-FCV S5-FCV
3.65 1.44
0.24 0.21
10-FCV S10-FCV
9.16 4.11
0.23 0.17
In traditional cross-validation schemes, the RAAU-Net trained with 10-FCV achieves a competitive segmentation performance with an average dice score of 87.33% for WT, 78.93% for TC, and 70.22% for ET, but it takes up to 63.88 h of training times. When the model trained with S10-FCV, it saves 41.41% of training times and achieves the best performance with an improvement of 0.18% for WT, 1.18% for TC, and 1.84% for ET compared with the model trained with 10FCV. Therefore, our proposed Sk-FCV achieve high segmentation performance with low training overhead. Table 4. Cost of training times and Dice score of RAAU-Net trained using different training schemes. Training scheme Times (h) Dice (%) WT TC
3.6
ET
3-FCV S3-FCV
21.87 16.36
86.46 75.63 69.15 87.16 78.11 72.19
5-FCV S5-FCV
38.62 19.63
86.56 77.51 69.27 87.80 78.57 72.87
10-FCV S10-FCV
63.88 37.43
87.33 78.93 70.22 87.51 80.11 72.06
Image Classification
To generalize to image classification, we test the proposed training scheme using DeepCaps model with three benchmark datasets including CIFAR10, MNIST and Fashion MNIST in image classification tasks. Figure 2 shows the comparison results of Sk-FCV and k-FCV training schemes in terms of training times and classification accuracy. As shown in Fig. 2a)–c), the training time of Sk-FCV is much less than that of k-FCV. Compared with 3-FCV, the training time of S3-FCV is saved by 44.82%, 45.51%, and 43.43% in three datasets, respectively. Compared with 5-FCV, the training time of S5-FCV is saved by 37.23%, 46.94%,
A New Deep Learning Training Scheme: Application to Biomedical Data
457
and 46.99% in three datasets, respectively. Likewise, the training time of S10FCV is 61.86%, 57.23%, and 64.69% faster than that of 10-FCV in three datasets, respectively. As the number of folds increases, the advantages of the proposed training scheme in terms of time cost will be more obvious. From Fig. 2d), we can see that the classification accuracy of the model trained using Sk-FCV has also an improvement in three datasets. Therefore, the proposed training scheme can not only enhance the generalization, but also accelerate the training in image classification domain. a)
1th fold
2th fold
b)
3th fold
1th fold
2th fold
3th fold
4th fold
5th fold
35 30
60
28.67
Times(h)
Times (h)
20
15.82
15 10
8.75
6.42
40 29.82 30 15.64
20 9.48
4.95
3.58
5
8.29
5.03
10
0
0 3-FCV
S3-FCV
CIFAR10
c)
3-FCV
S3-FCV
MNIST
3-FCV
1th fold
2th fold
3th fold
4th fold
5th fold
6th fold
7th fold
8th fold
9th fold
10th fold
98.72 80 60 37.65
29.65
22.12
10.47
9.46
20 0 10-FCV
S10-FCV
CIFAR10
10-FCV
S10-FCV
MNIST
10-FCV
S10-FCV
Fashion MNIST
S5-FCV
5-FCV
CIFAR10
d)
100
40
5-FCV
S3-FCV
Fashion MNIST
120
Times (h)
47.51
50
25
S5-FCV MNIST
5-FCV
S5-FCV
Fashion MNIST
Classification accuracy of DeepCaps in three datasets with different training schemes. Training scheme
CIFAR10
MNIST
Fashion MNIST
3-FCV
89.41%
99.11%
93.72%
S3-FCV
91.13%
99.14%
93.87% 94.24%
5-FCV
90.14%
99.09%
S5-FCV
91.59%
99.30%
94.72%
10-FCV
91.00%
99.17%
94.38%
S10-FCV
91.36%
99.25%
94.61%
Fig. 2. Performance of DeepCaps on three datasets. a) is the comparison cost of training times between the S3-FCV and 3-FCV; b) is the comparison cost of training times between the S5-FCV and 5-FCV; c) is the comparison cost of training times between the S10-FCV and 10-FCV; and d) is the classification accuracy of DeepCaps in three datasets with different training schemes.
4
Conclusion and Future Work
In this study, we propose a fast and effective training scheme, namely Sk-FCV, which is based on k-FCV in combination with weight transfer learning for successively training deep learning models. Comprehensive experiments are carried out on three networks and multiple datasets from three different domains, and empirical results demonstrate that the Sk-FCV scheme can effectively improve predictive performance with reduced training time compared with k-FCV. Especially under application conditions where the data samples are relatively rare and valuable in bioinformatics and computational biology, our Sk-FCV can make full use of the sample information to explore the ability of the models. More
458
J. Cheng et al.
importantly, this method is easily generalized to all intelligent task, which is an alternative to the traditional cross-validation training. In the future work, we will further analyze and study this scheme in depth according to the data distribution. Acknowledgments. This work is funded partially by the National Natural Science Foundation of China under Grant No. 61802442, No. 61877059, the Natural Science Foundation of Hunan Province under Grant No. 2019JJ50775, the 111 Project (No. B18059), the Hunan Provincial Science and Technology Program (No. 2018WK4001), and the Hunan Provincial Science and Technology Innovation Leading Plan (No. 2020GK2019).
References 1. Bakas, S., et al.: Advancing the Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci. Data 4, 170117 (2017) 2. Berrar, D.: Cross-validation. In: Ranganathan, S., Gribskov, M., Nakai, K., Schnbach, C. (eds.) Encyclopedia of Bioinformatics and Computational Biology, pp. 542–545. Academic Press, Oxford (2019) 3. Campanella, G., et al.: Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 25(8), 1301–1309 (2019) 4. Cheng, J., Liu, J., Liu, L., Pan, Y., Wang, J.: Multi-level glioma segmentation using 3D U-Net combined attention mechanism with atrous convolution. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1031–1036. IEEE (2019) 5. Cheng, J., et al.: Automated diagnosis of COVID-19 using deep supervised autoencoder with multi-view features from CT images. IEEE/ACM Trans. Comput. Biol. Bioinf. (2021). https://doi.org/10.1109/TCBB.2021.3102584 6. Cheng, J., et al.: Prediction of glioma grade using intratumoral and peritumoral radiomic features from multiparametric MRI images. IEEE/ACM Trans. Comput. Biol. Bioinf. (2020). https://doi.org/10.1109/TCBB.2020.3033538 7. Cheng, J., et al.: Multimodal disentangled variational autoencoder with game theoretic interpretability for glioma grading. IEEE J. Biomed. Health Inform. (2021). https://doi.org/10.1109/JBHI.2021.3095476 8. Deepak, S., Ameer, P.: Brain tumor classification using deep CNN features via transfer learning. Comput. Biol. Med. 111, 103345 (2019) 9. Dolz, J., Gopinath, K., Yuan, J., Lombaert, H., Desrosiers, C., Ayed, I.B.: HyperDense-Net: a hyper-densely connected CNN for multi-modal image segmentation. IEEE Trans. Med. Imaging 38(5), 1116–1126 (2018) 10. Esteva, A., et al.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639), 115–118 (2017) 11. Iesmantas, T., Alzbutas, R.: Convolutional capsule network for classification of breast cancer histology images. In: Campilho, A., Karray, F., ter Haar Romeny, B. (eds.) ICIAR 2018. LNCS, vol. 10882, pp. 853–860. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93000-8 97 12. Kooi, T., et al.: Large scale deep learning for computer aided detection of mammographic lesions. Med. Image Anal. 35, 303–312 (2017) 13. Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
A New Deep Learning Training Scheme: Application to Biomedical Data
459
14. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015) 15. LeCun, Y., Cortes, C., Burges, C.: MNIST handwritten digit database. ATT Labs. http://yann.lecun.com/exdb/mnist. Accessed Feb 2010 16. Lee, I., Keum, J., Nam, H.: DeepConv-DTI: prediction of drug-target interactions via deep learning with convolution on protein sequences. PLOS Comput. Biol. 15(6), e1007129 (2019) 17. Liu, J., Zeng, D., Guo, R., Lu, M., Wu, F.X., Wang, J.: MMHGE: detecting mild cognitive impairment based on multi-atlas multi-view hybrid graph convolutional networks and ensemble learning. Clust. Comput. 24(1), 103–113 (2021) ¨ urk, H., Ozg¨ ¨ ur, A., Ozkirimli, E.: DeepDTA: deep drug-target binding affinity 18. Ozt¨ prediction. Bioinformatics 34(17), i821–i829 (2018) 19. Rajasegaran, J., Jayasundara, V., Jayasekara, S., Jayasekara, H., Seneviratne, S., Rodrigo, R.: DeepCaps: going deeper with capsule networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10725–10733 (2019) 20. Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015) 21. Tang, J., et al.: Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. J. Chem. Inf. Model. 54(3), 735–743 (2014) 22. Wang, Y., Liu, J., Xiang, Y., Wang, J., Chen, Q., Chong, J.: MAGE: automatic diagnosis of autism spectrum disorders using multi-atlas graph convolutional networks and ensemble learning. Neurocomputing (2021). https://doi.org/10.1016/j. neucom.2020.06.152 23. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017) 24. Yang, Y., et al.: Glioma grading on conventional MR images: a deep learning study with transfer learning. Front. Neurosci. 12, 804 (2018) 25. Yuan, Y., Bar-Joseph, Z.: Deep learning for inferring gene relationships from singlecell expression data. Proc. Natl. Acad. Sci. 116(52), 27151–27158 (2019) 26. Zhang, J., Xie, Y., Wu, Q., Xia, Y.: Medical image classification using synergic deep learning. Med. Image Anal. 54, 10–19 (2019)
EEG-Based Emotion Recognition Fusing Spacial-Frequency Domain Features and Data-Driven Spectrogram-Like Features Chen Wang1 , Jingzhao Hu1 , Ke Liu1 , Qiaomei Jia1 , Jiayue Chen1 , Kun Yang1,2 , and Jun Feng1(B) 1
2
The School of Information Science and Technology, Northwest University, Xi’an 710127, Shaanxi, China [email protected], [email protected] The School of Computer Science and Electronic Engineering, University of Essex, Colchester CO4 3SQ, UK
Abstract. Research on emotion recognition based on EEG (electroencephalogram) signals has gradually become a hot spot in the field of artificial intelligence applications. The recognition methods mainly include designing traditional hand-extracted features in machine learning and fully automatic extraction of EEG features in deep learning. However, onefold features cannot represent emotional information perfectly which is contained in EEG signals. Traditional hand-extracted features may lose a lot of hidden information contained in raw signals, and automatically extracted features also do not contain prior knowledge. In this context, a multi-input Y-shape EEG-based emotion recognition neural network is proposed in this paper, which fusing spacial-frequency domain features and data-driven spectrogram-like features. It can effectually extract information in three domains, time, space, and frequency from raw EEG signals. Moreover, this paper also proposes a novel EEG feature mapping method. The experimental results show that the accuracy of EEG emotion recognition has achieved the state-of-the-art result based on the established DEAP benchmark dataset. The average emotion recognition rates are 71.25%, 71.33% and 71.1% in valance, arousal and dominance respectively. Keywords: Emotion recognition Data-driven · Feature fusion
1
· EEG · Deep neural networks ·
Introduction
Emotions are a means for humans to express and transmit miscellaneous attitudes, and their recognition is extremely important in human daily life [15]. For instance, monitoring the emotions of the elderly in the nursing home can regulate Supported by organization the National Key Research and Development Program of China under grant 2017YFB1002504. c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 460–470, 2021. https://doi.org/10.1007/978-3-030-91415-8_39
Hamiltonian Mechanics
461
the temperature and lighting of the entire room in real-time. And by detecting the emotions of children with depression or autism, they can be treated and protected more instantly and appropriately [2,11]. In particular, the emotional state of humans can be recognized more accurately based on EEG (electroencephalogram) signals. It will not be affected by subjective factors which may cause deviations in emotion recognition, and it is not easy to be disguised [1,23]. EEG-based emotion recognition has received more and more attention from investigators in the fields of human-computer interaction, emotional computing and biological information recognition [17,22]. In addition, with the development and innovation of deep learning and neural networks, they have a greatly important position and contribution in various research fields of artificial intelligence. In the field of EEG signal processing and emotion recognition, deep learning and neural networks also provide powerful technical support [13,14]. In this work, we specifically focus on the multi-input neural network model of feature fusion for EEG-based emotion recognition. In a general way, the emotion recognition methods can be divided into two major technical schools. One is based on the design and extraction of traditional manual features, such as Chao Hao et al. [6] actualized an accuracy of 67.7% using proposed the multiband feature matrix to recognize EEG emotional signals. The other one is based on deep learning and neural networks, emotions are classified using automatically extracted depth features by increasing the number of neural network layers continuously, such as Zheng Wei-Long et al. [24] got a mean recognition accuracy of 72.93% by EmotionMeter, a multimodal emotion recognition deep neural network, which combines brain waves with eye movements. However, these methods are also bound to have some flaws and deficiencies. When extracting hand-extracted features, it is extremely easy to lose rich information about human emotion, so the accuracy of emotion recognition is not pretty optimistic [18]. In addition, it is indeed improved by deep features extracted based on deep learning and neural networks, but they are hard to illustrate and lack prior knowledge [3,4,8]. Thus, this work combines spatial-frequency domain features designed with prior knowledge of emotional information and data-driven features automatically extracted from raw EEG signals for detecting human emotions. In this paper, we propose a multi-input Y-shape deep neural network SFMSNet fusing spatial-frequency domain features and data-driven spectrogram-like features for EEG-based emotion recognition. It takes Spacial-Frequency Matrices (SFMs) with prior knowledge as the first input of the network. At the same time, adopting scaling convolutional layers to automatically extract data-driven spectrogram-like features, which will be used as another input to the network. In the end, our contributions are summarized as follows: – We propose a novel method of EEG-based emotional feature mapping with prior knowledge and construct emotional features SFMs. – A multi-input neural network SFMS-Net fusing spacial-frequency domain features and data-driven spectrogram-like features is proposed in this paper for EEG-based emotion recognition.
462
C. Wang et al.
– It covers time, space, and frequency three domains information from raw EEG signals, and also realizes the complementary information.The proposed EEG-based emotion recognition SFMS-Net can exceedingly improve the performance of subject-independent emotional classification.
2
Methodology
The EEG-based emotion recognition method proposed in this paper consists of three steps: (1) Constructing spacial-frequency feature matrices SFMs; (2) Using scaling convolutional layers to automatically extract data-driven spectrogramlike features; (3) Fusing SFMs and data-driven features. The specific implementation ways are as follows: 2.1
Spacial-Frequency Matrix
In order to divide the raw EEG signals into different frequency bands, first and foremost, it is subjected to Fourier transfer. Then we consider the perspective of energy and obtain the power spectral density (PSD) on different frequency bands respectively [5]. Due to the strong nonlinearity of EEG signals, a logarithmic operation is performed on the PSD in different frequency bands, so that we can obtain differential entropy (DE) [16]. The DE is given by: DE = −
b
p(x) log(p(x)) dx
(1)
a
where p(x) as the probability density of continuous-time signal, [a, b] as value range between information. For a section of EEG signal that approximately obeys the Gaussian distribution with N (u, σ 2 ). The DE can be given by [20]: DE = −
∞ −∞
1 2πσi2
−
e
(x−μ)2 2σ 2 i
2
(x−μ) − 1 2 log e 2σi dx 2 2πσi
(2)
1 = log(2πeσi2 ) 2 For the sake of better integrating spatial position and frequency information about EEG channels, we map 32-channel EEG spatial position into a 9*9 matrix according to the international 10–20 standard of EEG electrode positions. Each frequency band is mapped into such a 9*9 matrix. After that, following the sequence from left to right and top to bottom, we splice them into a spacialfrequency feature matrix SFM. The specific construction process can be seen in Fig. 1. Therefore, we extract the EEG emotional spacial-frequency features matrix SFMs with prior knowledge and use it as an input of SFMS-Net.
Hamiltonian Mechanics
463
Fig. 1. The process of spacial-frequency matrix construction.
2.2
Scaling Convolutional Layer
From the perspective of time-frequency and deep learning, we use the scaling convolutional neural network layer to automatically extract time-frequency information about raw EEG signals [10]. The input of scaling convolutional layers can receive raw signals of any length. And the cross-correlation between the EEG signal and an initial scaling convolution kernel is calculated at first. In order to ensure that the output length brings into correspondence with the input after each cross-correlation operation, the initial scaling convolution kernel must be an odd number. And then, the scaling convolution kernel is downsampled to scale the size of the convolution kernel. A series of such operations are performed continuously until the scaling convolution kernel reaches the lower boundary set by downsampling. In this way, we can get a crowd of data-driven spectrogram-like features extracted by many independent scaling convolutional layers automatically. Figure 2 shows the process of scaling layers. It can be expressed as: H output (l) = δ(bias(l) + downSample(weight, l) ⊗ H input )
(3)
where H input as the input of scaling layers shaped like (times steps, 1) the one-dimensional signal, H output as the matrix of activations shaped like (time steps, scaling levels) the data-driven spectrogram-like feature map. Besides, bias is the biases for multi-kernel generated by scaling a basic kernel, and δ(·) denotes an activation function. Finally, weight as the basic kernel where others kernel scaled and l is a hyper-parameter which controls the scaling level. ⊗ is a valid cross-correlation operator, which can be defined as: (f ⊗ g)[n]
N −1
f [m] g[(m + n)
mod N ]
(4)
m=0
Next, we use an independent scaling convolutional layer for each channel to extract the data-driven spectrogram-like features. Therefore, a three-dimensional feature tensor containing all channel information is obtained by stacking above data-driven spectrogram-like features on the channel dimension. It is the second feature we incorporate in the EEG-based emotion recognition network. And we use it as another input of SFMS-Net.
464
C. Wang et al.
Fig. 2. The process of scaling layers[10].
2.3
Proposed SFMS-Net
Fig. 3 is our proposed multi-input Y-type EEG-based emotion recognition network model. It has two inputs and is a dual-branch deep neural network. The first input is the raw EEG signal with 32 channels. Through 32 scaling convolutional layers, we can obtain a data-driven spectrogram-like feature with three dimensions. The second input is the spacial-frequency feature matrix SFM with electrode spatial position and frequency information, which is extracted based on prior knowledge. And then we use three different convolutional neural network layers and a global pooling layer to perform feature transformations on them. We can obtain the prior feature vector f1 and the data-driven automatic emotion feature vector f2 . Next, we directly concatenate the two feature vectors from the first dimension into a fusion vector F and send them to the Dense layers for deep fusion to extract high-order semantic features. Finally, F is sent to the softmax activation function, the last layer of SFMS-Net to classify emotions. The design of the entire network fuses spacial-frequency domain features which contain prior knowledge with data-driven features automatically extracted by the scaling convolutional layer. On one hand, SFM features include spatial frequency information of EEG signals. On the other hand, data-driven features include time-frequency domain information of EEG. The proposed emotion recognition network model can greatly integrate different types of features, and also cover the time-space and frequency information of EEG signals.
3
Experiments
In this section, we will introduce the EEG-based emotion dataset and the experiment settings and analyze the performance of the proposed model.
Hamiltonian Mechanics
465
Fig. 3. The proposed EEG-based emotion recognition network fusing spacial-frequency domain features and data-driven features.
3.1
Dataset
The proposed emotion recognition model is tested by the DEAP dataset [12]. It recorded the multi-modal physiological signals of 32 subjects watching 40 different music videos. The equipment for collecting EEG signals is a Biosemi ActiveTwo system, and its sampling rate 512 Hz. Besides, it uses the international 10–20 standard of 32 EEG electrode positions equally. Moreover, the SAM emotion model is used for emotion description, which is described in four dimensions: arousal, valence, dominance, and liking. Volunteers would make a selfassessment in each dimension giving a score out of 9. And if the score was less than 5, the label was set to “Negative Emotion”. If it was higher than or equal to 5, the label was set to “Positive Emotion”. Some pre-processing has been done on the raw EEG signals. A 4–45Hz lowpass filter is used to eliminate noise, and the EEG signal data is down-sampled 128 Hz. The length of signals recorded in each trial is 63s, and the first 3s is the baseline. Therefore, we removed the first 3s and only used the EEG data of the 60s. At the same time, we adopted the z-score method to normalize the data and also played a role in baseline removal. 3.2
Experiment Settings and Evaluation
The proposed method is verified using 5-fold cross-validation, and the ratio of training set to verification set is 8: 2. The average accuracy is used to evaluate the performance of the proposed method. In addition, we don’t perform the dividing process on EEG signals, so we don’t amplify the training dataset. The total number of samples is 1280.
466
C. Wang et al.
The final hyperparameters of the experiment are set as follows: the batchsize is 64 trials and the lase learning rate is set to 0.001. Besides, the size of the initial scaling convolution kernel is 65 and the optimizer adopts Adam. Cross entropy is used as the loss function. For all experiments in this paper, the program was implemented using Pytorch deep learning framework in Ubuntu 18.04. 3.3
Experimental Results
Table 1 shows the comparison of experimental results between our proposed method and other several existing algorithms. For the DEAP dataset, the mean accuracy of emotion recognition has achieved 71.25%, 71.33%, 71.1% in the valance, arousal, and dominance dimension. Compared with the study of Li et al. [21], the accuracy of the three dimensions has been improved by 12.85%, 7.13%, and 5.3% observably. All investigations we choose to compare used the same method to label emotions and belong to the subject-independent emotional classification researches. In addition, it is suitable for any subject to recognize their emotion using the model SFMS-Net we trained. It can also be seen from Table 1 that our proposed method can recognize human emotions much more efficiently than existing algorithms. This result is mostly attributable to our method which fuses spacial-frequency domain and data-driven, two different types of EEG emotional features, therefore the performance is better. Table 1. The performance of the proposed model. Studies
Features
Classifiers
Accuracy Valance Arousal Dominance
Chao, H. et al. [6]
MFM
CapsNet
0.6673
0.6828
0.6725
Li et al. [21]
DBN
SVM
0.5840
0.6420
0.6580
CHEN et al. [7]
PSD+AR
H-ATT-BGRU 0.6790
0.6650
–
Yang et al. [19]
VAE
SVM
0.688
0.670
–
RVM
0.690
0.670
–
Gupta, R et al. [9] Graph-theoretic Ours
SFM+Scaling layer
MLP
0.7125 0.7133 0.7110
In addition, we visualized the proposed SFM feature heat maps as different emotion categories. In Fig. 4, (a), (c) and (e) belong to SFM features of positive emotions, and (b), (d), (f) are heat maps of negative emotions. It can be seen that there were obvious dissimilarities in the extracted SFM feature maps of different emotions from Fig. 4. For positive emotions, the SFM feature heat maps possess higher energy in theta and alpha bands. For negative emotions, the energy in theta and alpha bands is relatively low. And the distinction within each emotion category is very puny. Moreover, we can conclude that the energy in the theta and alpha bands of the brain will increase when positive emotions are induced.
Hamiltonian Mechanics
467
Fig. 4. Heat map for the features of positive and negative emotions, which (a), (c) and (e) belong to SFM features of positive emotions, and (b), (d), (f) are heat maps of negative emotions.
3.4
Ablation Studies
In order to verify the effectiveness of the emotion recognition method fusing spacial-frequency domain and data-driven features, we carry out the following ablation experiments. A typical convolutional neural network (CNN) constructed with three two-dimensional convolutional layers, a pooling layer, and two fully connected layers is used as a benchmark for comparison firstly. And then the SFM feature block and the scaling convolutional layer are added to the network for extracting pertinent EEG emotion features. Finally, we tested and found that these two modules are added at the same time for EEG-based emotion recognition. Similarly, we performed 5-fold cross-validation for it, and the results of ablation experiments are shown in Table 2. It can be seen that regardless of whether the SFM feature block is increased or the scaling convolutional layer is increased, compared with the CNN network, their recognition performance is all improved. Moreover, the performance of emotion recognition is optimal when two different driving features are simultaneously fused. In addition, for the sake of verifying the stable performance of our method, we also draw a box-plot for cross-validation of ablation experiments. As shown in Fig. 5, it can be found that our method shows more stable performance during the experimental verification process. Therefore, it is more suitable for real-time brain-computer emotional interaction interface.
468
C. Wang et al.
Table 2. The performance enhanced by the SFM feature block and the scaling convolutional layer. Model
Valance Arousal Dominance
CNN
0.6286
SFM-CNN
0.6374
0.6655
0.6711
Scaling layer 0.7113
0.6999
0.7078
0.6541
0.6683
SFMS-Net 0.7125 0.7133 0.7110
Fig. 5. Box plot of the accuracies about ablation experiment.
4
Conclusion
EEG-based emotion recognition is a core issue to promote the development of biological information recognition and artificial intelligence technology, which can make human-computer interaction more intelligent and emotional. In this paper, the emotion recognition network SFMS-Net we proposed fuses spacialfrequency domain and data-driven features by designing SFM features with prior knowledge and scaling convolutional layers that can automatically extract time-frequency information. After deep feature fusion, a state-of-the-art emotion recognition performance is finally achieved. This method can improve the lack of prior EEG emotion knowledge, strong hypothesis, and feebleness of retained task-related features in the process of EEG emotion recognition. In the future, we can also add the correlation between EEG channels to further improve recognition performance and make brain-computer emotional interaction smarter. Acknowledgment. This work was supported by the National Key Research and Development Program of China under grant 2017YFB1002504.
Hamiltonian Mechanics
469
References 1. Zhang, Y., et al.: Neural complexity in patients with poststroke depression: a resting EEG study. J. Affect. Disord. 188, 310–318 (2015) 2. Huang, H., Xie, Q., Pan, J., He, Y., Wen, Z., Yu, R., Li, Y.: An EEG-based brain computer interface for emotion recognition and its application in patients with disorder of consciousness. IEEE Trans. Affect. Comput. 1 (2019). https://doi.org/ 10.1109/TAFFC.2019.2901456 3. Asghar, M.A., Khan, M.J., Amin, Y., Rizwan, M., Rahman, M., Badnava, S., Mirjavadi, S.S., et al.: EEG-based multi-modal emotion recognition using bag of deep features: an optimal feature selection approach. Sensors 19(23), 5218 (2019) 4. Bashivan, P., Rish, I., Yeasin, M., Codella, N.: Learning representations from EEG with deep recurrent-convolutional neural networks. Comput. Sci. arXiv preprint arXiv:1511.06448 (2015) 5. Bashivan, P., Rish, I., Yeasin, M., Codella, N.: Learning representations from EEG with deep recurrent-convolutional neural networks. arXiv preprint arXiv:1511.06448 (2015) 6. Chao, H., Dong, L., Liu, Y., Lu, B.: Emotion recognition from multiband EEG signals using CapsNet. Sensors 19(9), 2212 (2019) 7. Chen, J.X., Jiang, D.M., Zhang, Y.N.: A hierarchical bidirectional GRU model with attention for EEG-based emotion classification. IEEE Access 7, 118530–118540 (2019) 8. Fu, B., Li, F., Niu, Y., Wu, H., Shi, G.: Conditional generative adversarial network for EEG-based emotion fine-grained estimation and visualization. J. Vis. Commun. Image Representation 74, 102982 (2021) 9. Gupta, R., Laghari, K.U.R., Falk, T.H.: Relevance vector classifier decision fusion and EEG graph-theoretic features for automatic affective state characterization. Neurocomputing 174(JAN.22PT.B), 875–884 (2016) 10. Hu, J., Wang, C., Jia, Q., Bu, Q., Sutcliffe, R., Feng, J.: Scalingnet: extracting features from raw eeg data for emotion recognition. Neurocomputing 463, 177–184 (2021). https://www.sciencedirect.com/science/article/pii/S0925231221012029 11. Huang, H., et al.: An EEG-based brain computer interface for emotion recognition and its application in patients with disorder of consciousness. IEEE Trans. Affect. Comput. (2019) 12. Koelstra, S., et al.: Deap: a database for emotion analysis; using physiological signals. IEEE Trans. Affect. Comput. 3(1), 18–31 (2011) 13. Liu, W., Zheng, W.L., Lu, B.L.: Multimodal emotion recognition using multimodal deep learning. arXiv preprint arXiv:1602.08225 (2016) 14. Mammone, N., Ieracitano, C., Morabito, F.C.: A deep CNN approach to decode motor preparation of upper limbs from time-frequency maps of EEG signals at source level - sciencedirect. Neural Netw. 124, 357–372 (2020) 15. Soleymani, M., Lichtenauer, J., Pun, T., Pantic, M.: A multimodal database for affect recognition and implicit tagging. IEEE Trans. Affect. Comput. 3(1), 42–55 (2012) 16. Soroush, M.Z., Maghooli, K., Setarehdan, S.K., Nasrabadi, A.M.: Emotion classification through nonlinear EEG analysis using machine learning methods. Int. Clin. Neurosci. J. 5(4), 135 (2018) 17. Wang, F., Wu, S., Zhang, W., Xu, Z., Coleman, S.: Emotion recognition with convolutional neural network and EEG-based EFDMs. Neuropsychologia 146(10), 107506 (2020)
470
C. Wang et al.
18. Wang, F., et al.: Emotion recognition with convolutional neural network and EEGbased EFDMs. Neuropsychologia 146, 107506 (2020) 19. Yang, H., Lee, C.: An attribute-invariant variational learning for emotion recognition using physiology. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1184–1188 (2019) 20. Zhang, G., et al.: A review of EEG features for emotion recognition. Sci. Sinica Informationis 49(9), 1097–1118 (2019) 21. Zhang, P., Li, X., Hou, Y., Yu, G., Song, D., Hu, B.: EEG based emotion identification using unsupervised deep feature learning (2015) 22. Zhang, T., Cui, Z., Xu, C., Zheng, W., Yang, J.: Variational pathway reasoning for EEG emotion recognition. In: AAAI, pp. 2709–2716 (2020) 23. Zheng, W.L., Zhu, J.Y., Lu, B.L.: Identifying stable patterns over time for emotion recognition from EEG. IEEE Trans. Affect. Comput. 10(3), 417–429 (2017) 24. Zheng, W., Liu, W., Lu, Y., Lu, B., Cichocki, A.: Emotionmeter: a multimodal framework for recognizing human emotions. IEEE Trans. Syst. Man Cybern. 49(3), 1110–1122 (2019)
ECG Arrhythmia Detection Based on Hidden Attention Residual Neural Network Yuxia Guan1 , Jinrui Xu1 , Ning Liu1 , Jianxin Wang1 , and Ying An2(B) 1
2
School of Computer Science and Engineering, Central South University, Changsha 410083, China Big Data Institute, Central South University, Changsha 410083, China [email protected]
Abstract. Arrhythmias such as atrial fibrillation (AFIB), atrial flutter (AFL), and ventricular fibrillation (VFIB) are early indicators of stroke and sudden cardiac death, and the electrocardiogram (ECG) provides vital information for arrhythmia detection. In recent years, deep learning has been widely used in automated arrhythmia detection to reduce manual intervention and improve analysis efficiency. Most existing methods directly extract features from one-dimensional ECG signals. However, since the information that one-dimensional ECG signals can provide is often insufficiency, the performance of these methods is limited. To solve this problem, we transform ECG signals to images. Then a Hidden Attention Residual Neural Network (HA-ResNet) is proposed for the discrimination of normal rhythm (NSR) from AFIB, AFL, and VFIB. In this model, an SE (Sequeeze-and-Excitation) block is used to strengthen essential features, while a BConvLSTM is employed to capture the temporal dependency among ECG signals, thereby effectively improving the classification performance. In this work, we evaluate our method on a combined dataset with two seconds and five seconds’ durations. And, we achieved an accuracy, and F1-Score of 99.2%, 96% respectively for two seconds of ECG segments. We obtained an accuracy of 99.3%, the F1-Score of 96.7% for five seconds of ECG duration.
Keywords: ECG classification
1
· Arrhythmia · Deep learning
Introduction
Arrhythmia is an abnormal heartbeat rhythm that may endanger life [1]. ECG plays a vital role in arrhythmia diagnosis. However, due to the complexity and non-linearity of ECG signals, the manual analysis of ECG is much difficult and time-consuming [2]. Therefore, automatic detection of arrhythmia based on ECG has become a hot research topic. Traditional arrhythmia detection mostly depends on artificial feature extraction by signal processing and statistical technology [3]. Martis et al. [4] extracted features by high-order cumulants (HOS) c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 471–483, 2021. https://doi.org/10.1007/978-3-030-91415-8_40
472
Y. Guan et al.
method and finally classified ECG signals by KNN classifier. Elhaj et al. [5] used linear discrete wavelet transform (DWT) to extract linear features, and used nonlinear HOS cumulant method to extract nonlinear features. These methods are complicated, and the feature design requires researchers’ high professional level and experience [6]. In addition, artificially designed features may change due to noise, scaling, and translation. In recent years, several deep learning based methods are proposed to detect arrhythmia. Zhu et al. [7] developed a 94-layer deep neural network for fourclass classification of ECG signals, which used the Squeeze-and-Excitation (SE) residual neural network [8] to improve the model’s representation capacity by performing dynamic channel-wise feature recalibration. Fan et al. [9] proposed a multi-scale fusion of deep convolution neural network, to detect atrial fibrillation from single-lead ECG records. Petmezas et al. [10] presented a deep model combining CNN and LSTM to classify four types of heart rhythms. Hannun et al. [11] used a deep convolution neural network with 16 residual blocks to classify 12 rhythms in a single lead. The above research shows that deep neural networks can automatically learn complex representative features, thus reducing the excessive dependence on artificial feature extraction. However, in most of the above-mentioned studies, the relevant features are directly extracted from the one-dimensional ECG signals, which is easy to miss some concealed important information. Recently, some scholars have tried to provide diversified information for the model by converting the one-dimendional ECG signals into two-dimensional images. For example, Huang et al. [12] used short-time Fourier transform(STFT) to transform ECG signals into time-frequency spectrograms and classified five types of arrhythmia by 2D-CNN. Finally, it achieved a 8.07% improvement in accuracy compared with the one-dimensional method. Alqudah et al. [13] also used STFT to transform ECG signals into images and get good performance in arrhythmias classification. Naz et al. [14] transformed ECG data into binary images and used 2D-CNN to detect ventricular arrhythmias. Unfortunately, these existing approaches fail to consider the temporal dependence of ECG signals, which is potentially able to strengthen useful features. In this paper, a hidden attention residual network model is proposed to discriminate normal rhythm (NSR) from AFIB, AFL, and VFIB using twodimensional images of ECG segments. Figure 1 shows the overall flow chart.
Fig. 1. The overall schematic of our method
ECG Detection Using Hidden Attention Residual Neural Network
473
The main contributions of this work can be summarized as follows: • We use four different two-dimensional transformation methods to transform ECG signals into images for providing richer feature information. • We propose a novel hidden attention residual network model for automated arrhythmia classification, which integrates an SE block and a BConvLSTM [15] to enhance the ability of capturing the key features of arrhythmias and their temporal dependencies from ECG signals. The structure of the paper is summarized as follows: Sect. 2 describes the methods. Then, the experiments and results are illustrated in Sect. 3. Section 4 discusses the classification of ECG records. And Sect. 5 summarizes this paper.
2
Methods
Figure 2 shows the architecture of our HA-ResNet model. In our model, the original one-dimensional ECG signals are first converted into two-dimensional images and fed to a 2-D convolution layer followed by a batch normalization (BN) layer and a max-pooling layer for shallow feature extraction. Then, four HA modules (HAM) with different kernel sizes and SERadios are deployed to further capture the diverse deep features in ECG data. Finally, the deep feature representation is sequentially input to an average pooling layer and a softmax layer to obtain the final classification result.
Fig. 2. The architecture of our HA-ResNet model
2.1
Data Prepocessing
We use four standard two-dimensional conversion methods to convert the original one-dimensional data into two-dimensional images. The specific methods are described as follows.
474
Y. Guan et al.
Image Formation by Recurrence Plot (RP). The RP [16] can reveal the internal structure of time series and provide a priori knowledge of similarity, information content, and predictability. Thus, it is particularly suitable for short time-series data and can reflect the stationarity and inherent similarity of time series. Let q (t) ∈ Rd be a multi-variate time series. The recurrence plot is defined as: RP = θ (ε − ||q(i) − q(j)||) (1) where ε is a threshold, and θ is the Heaviside function. The image formed by RP is shown in Fig. 3 (Take 2 s dataset as an example).
Fig. 3. Display of two-second segments of different types of arrhythmias on RP images
Image Formation by Gramian Angular Field (GAF). The GAF [17] creates a matrix of temporal correlations for each (xi , xj ). First it rescales the time series in a range [a, b] where −1 ≤ a < b ≤ 1. Then it computes the polar coordinates of the scaled time series by taking the arccos. Finally it computes the cosine of the sum of the angles for the Gramian Angular Summation Field (GASF) or the sine of the difference of the angles for the Gramian Angular Difference Field (GADF). ∼
xi −min(x) xi = a + (b − a) × max(x)−min(x) , ∀i ∈ {1, ..., n} ∼ φi = arccos xi , ∀i ∈ {1, ..., n} GASFi,j = cos (φi + φj ) , ∀i, j ∈ {1, ..., n} GADFi,j = cos (φi − φj ) , ∀i, j ∈ {1, ..., n}
(2)
The image formed by GAF is shown in Fig. 4 (Take 2 s dataset as an example).
Fig. 4. Display of two-second of different types of arrhythmias on GAF images
ECG Detection Using Hidden Attention Residual Neural Network
475
Image Formation by Markov Transition Field (MTF). The MTF [17] discretizes a time series X = x1 , x2 , ..., xn into bins qk . It then computes the Markov Transition Matrix of the discretized time series. Finally it spreads out the transition matrix to a field in order to reduce the loss of temporal information. ⎤ ⎡ wlk|x1 ∈ql ,x1 ∈qk · · · wlk|x1 ∈ql ,xn ∈qk ⎢ wlk|x2 ∈ql ,x1 ∈qk · · · wlk|x2 ∈ql ,xn ∈qk ⎥ ⎥ ⎢ (3) MTF = ⎢ ⎥ .. ⎦ ⎣ . wlk|xn ∈ql ,x1 ∈qk · · · wlk|xn ∈ql ,xn ∈qk where wlk is the frequency with which a point in quantile qk is followed by a point in quantile ql . The image formed by MTF is shown in Fig. 5 (Take 2 s dataset as an example).
Fig. 5. Display of two-second of different types of arrhythmias on MTF images
Image Formation by Short-Time Fourier Transform(STFT). The STFT is a method derived from the discrete Fourier transform to analyze the instantaneous frequency and the instantaneous amplitude of a localized wave with time-varying characteristics. In the non-stationary signal analysis, it is assumed that the signal is approximately stationary within the span of a temporal window of finite support. The image formed by STFT is shown in Fig. 6 (Take 2 s dataset as an example).
Fig. 6. Display of two-second of different types of arrhythmias on STFT images
476
2.2
Y. Guan et al.
Feature Extraction
Embedding Layer: We input the images into the embedding layer after the transformation of ECG data. The embedding layer includes a 2-D convolution layer, a BN layer, and a max-pooling layer. Thus, we can extract shallow features through it. Then we extract deep features through four HAM blocks with different parameters. HAM: Figure 2 lower-left corner shows the architecture of HAM. First, we use two 2-D convolution blocks to extract additional features. Moreover, the convolution block in our approach goes through the BN layer and the ReLU layer after the convolution operation, which has been reported to be a better practice than only going through the convolution layer. SE Block. We adopt the SE block (Fig. 2 lower-right corner) to give different weights to different channels of the feature maps, such that the model can learn to focus more on the crucial features and pay less attention to those less critical features. In addition, the SE module provides an adaptive way to aggregate features extracted from the ECG signals, which results in better feature representation. BConvLSTM After the SE block feature extraction, we begin to focus on timingdependent features of ECG signals. So, we use BConvLSTM to consider the data dependency of the two directions. First, we merge the output of the SE block(Xe ) and the input of the SE block(Xi ), then we feed it to the BConvLSTM layer. ConvLSTM [18], which exploited convolution operations into input-to-state and state-to-state transitions. It consists of an input gate it , an output gate ot , a forget gate ft , and a memory cell Ct . Input, output, and forget gates act as controlling gates to access, update, and explicit memory cell. ConvLSTM can be formulated as follows (for convenience, we remove the subscript and superscript from the parameters): it = σ (Wxi ∗ Xt + Whi ∗ Ht−1 + Wci ∗ Ct−1 + bi ) ft = σ (Wxf ∗ Xt + Whf ∗ Ht−1 + Wcf ∗ Ct−1 + bf ) Ct = ft ◦ Ct−1 + it tanh (Wxc ∗ Xt + Whc ∗ Ht−1 + bc ) ot = σ (Wxo ∗ Xt + Who ∗ Ht−1 + Wco ◦ Ct−1 + bc ) Ht = ot ◦ tanh (Ct )
(4)
where ∗ and ◦ denote the convolution and Hadamard functions, respectively. Xt is the input tensor (in our case Xe and Xi ). Ht is the hidden state tensor, Ct is the memory cell tensor, and Wx∗ and Wh∗ are 2D Convolution kernels corresponding to the input and hidden state, respectively, and bi , bf , bo , and bc are the bias terms. In the standard ConvLSTM, only forward dependencies are processed. However, all the information in the sequence should be fully considered, so it may be adequate to consider backward dependence. Therefore, we adopted BConvLSTM, used two ConvLSTMs to process the input data into forward and backward paths, and then decided on the current input by processing the data dependency of the two directions.
ECG Detection Using Hidden Attention Residual Neural Network
477
Average Pooling Layer: We adopt the standard average pooling layer to downsample the features obtained from HAM to reduce the time in the classifying step and better generalizability of the model. 2.3
Classification
After feature extraction, we use a softmax layer to realize the classification of AFIB, NSR, VFIB, and AFL. Softmax maps the output of multiple neurons to the value of (0,1). Then we can understand it as a probability. When we finally select the output node, we can choose the node with the maximum probability as our prediction target.
3 3.1
Experiment Data Source
We get the ECG signals from the Creighton University ventricular tachyarrhythmia database (CUDB), the MIT-BIH atrial fibrillation database (AFDB), and the MIT-BIH arrhythmia database (MITDB). The frequency of ECG signals in CUDB and AFDB 250 Hz. So, we sample the frequency of ECG signals in MITDB 360 Hz 250 Hz. Subsequently, Daubechies wavelet six is employed to denoise the signals and remove the baseline [19]. The denoised signals are segmented into 2-s and 5-s segments respectively, and finally two corresponding datasets are obtained. The details of the two datasets are given in Table 1. Table 1. Overview of the data used in this study Data
Taken from
NSR
CUDB MITDB
Total number of original 5-s segments
902
361
AFIB AFDB MITDB 18804
7407
AFL
AFDB MITDB
VFIB CUDB
3.2
Total number of original 2-s segments
1840
736
163
65
Implementation Details
In our study, all the models are implemented with Python 3.6.0 and trained by Adam optimizer with a learning rate of 0.0001. The batch size is set to 8 for all methods. The proposed network is implemented and trained using Keras 2.2.2 framework, and all experiments are performed on a server with Intel(R) Xeon(R) Gold 6230 CPU @ 2.10 GHz, 126 GB memory, and six GeForce RTX cards. We randomly split the dataset into the training, validation and test set with a ratio of 0.6 : 0.2 : 0.2, and train each prediction model by 5-fold crossvalidation to enhance the model’s generalization performance. Four measurements are adopted to evaluate the performance of models: Accuracy, Precision, Recall, and F1-score. For all models, we repeatedly conduct experiments 10 times and report the mean evaluation metrics for testing performance.
478
3.3
Y. Guan et al.
Results
Overall Performance. We first compare the overall performance of our model with four different two-dimensional signal conversion methods are used. The results on 2 s segment and 5 s segment datasets are given in Table 2 and Table 3. It can be observed that HA-ResNet+RP and HA-ResNet+GAF obviously outperform the other comparison methods. Among them, HA-ResNet+RP obtains the highest F1-score of 96.0% on the 2 s segment dataset. On the 5 s segment dataset, the F1-score of HA-ResNet+GAF reaches 96.7%, which is slightly higher than that of HA-ResNet+RP. It may be because that GAF is more sensitive to long signal segments than RP. In the following comparative experiments, we only report the best results obtained by our proposed model. Table 2. The results of different two-dimensional methods on 2 s datasets Accuracy(%) Precision(%) Recall(%) F1-Score(%) HA-ResNet+RP
99.2
96.2
95.4
96.0
HA-ResNet+STFT 91.5
67.1
60.0
63.0
HA-ResNet+GAF
97.7
95.9
92.3
94.1
HA-ResNet+MTF
95.9
86.4
82.2
84.2
Table 3. The results of different two-dimensional methods on 5 s datasets Accuracy(%) Precision(%) Recall(%) F1-Score(%) HA-ResNet+RP
98.4
95.4
92.3
93.7
HA-ResNet+STFT 93.5
89.1
76.9
78.8
HA-ResNet+GAF
99.3
96.1
97.3
96.7
HA-ResNet+MTF
94.7
82.0
76.6
79.1
To demonstrate the effectiveness of our proposed model, we further compare HA-ResNet with the following five state-of-the-art methods. Desai et al. [20] applied Recurrence Quantification Analysis (RQA) features to classify four ECG beats classes. They used Decision Tree (DT), Random Forest (RAF), and Rotation Forest (ROF) ensemble methods to select the best classifier. Acharya et al. [21] used thirteen nonlinear features of ECG beats. The extracted features are ranked using ANOVA and subjected to automated classification using the K-Nearest Neighbor (KNN) and Decision Tree (DT) classifiers. Acharya et al. [2] adopted an 11-layer CNN to automatically classify the four classes of ECG signals (NSR, AFIB, AFL, and VFIB), which greatly simplified the system flow. Fujita et al. [22] presented a 6-layer deep convolutional neural network (CNN) for automatic ECG pattern classification of four classes. Sree et al. [23] extracted
ECG Detection Using Hidden Attention Residual Neural Network
479
18 non-linear features from the cumulant images determined from the ECG segments, and significant features were selected using the t-test. The selected features were used to train several classifiers. Table 4 shows the comparison results of our model and other baselines. From the table we can see that, althoug [20], and [21] obtained good performance on the classification of four arrhythmias, they relied on QRS detection and realized the ECG beat classification. [23] realized ECG segment classification without QRS detection, but it obtained only 73% on F1-score. On the one hand, the feature learning ability of the machine learning method is insufficient; on the other hand, the dependence of feature selection on manual intervention also significantly limits its performance. At the same time, we found that [2] adopted an 11-layer CNN to detect arrhythmia, but the performance was poor, even lower than [23], which used the random forest. [22] used continuous wavelet transformation (CWT) for optimal extraction of necessary features before classification by CNN. However, the performance gains are still limited. The main reason is that CNN lacks the learning ability of temporal dependency, which affects the model’s overall performance. In contrast, our approach transforms ECG signals into images for providing richer feature information. Moreover, we present the hidden attention module to captures local features and temporal dependencies in the ECG more efficiently, thus Table 4. Summary of selected studies conducted for the detection of arrhythmia doing the similar work Methods, Year
Database
Special Characteristics
ECG rhythms
Classifier
Performance
Desai et al. 2016 [20]
MITDB AFDB CUDB
QRS detection performed 3858 ECG beats
AFIB AFL VFIB NSR
Decision Accuracy = 98.37% Tree, Random Forest, and Rotation Forest
Acharya et al. 2016 [21]
MITDB AFDB CUDB
QRS detection performed 614526 ECG beats
AFIB AFL VFIB NSR
ANOVA ranking and decision tree
Accuracy = 96.3%
Acharya et al. 2017 [2]
MITDB AFDB CUDB
No QRS detection performed AFIB 21709 2 s ECG segment AFL 8683 5 s ECG segment VFIB NSR
CNN
2 s:Accuracy = 92.5% F1-Score = 71% 5 s:Accuracy = 94.9% F1-Score = 72%
Fujita et al. 2019 [22]
MITDB AFDB CUDB
No QRS detection performed AFIB 25459 2 s ECG segment AFL VFIB NSR
CNN
Accuracy = 97.78%
Sree et al. 2021 [23]
MITDB AFDB CUDB
No QRS detection performed AFIB 21709 2 s ECG segment AFL 8683 5 s ECG segment VFIB NSR
HOS, Adasyn, Random Forest
2 s:Accuracy = 93.94% F1-Score = 73% 5 s:Accuracy = 95.62% F1-Score = 78%
Our Method MITDB AFIB CUDB
No QRS detection performed AFIB 21709 2 s ECG segment AFL 8569 5 s ECG segment VFIB NSR
HAResNet
2 s:Accuracy = 99.2% F1-Score = 96% 5 s:Accuracy = 99.3% F1-Score = 96.7%
480
Y. Guan et al.
achieving the best performance. In general, the current work [2,20–22] with the same task and original data set directly carries out feature extraction for onedimensional ECG signal, and our method obtains better performance by using the two-dimensional ECG signal. Moreover, it proves the progressive nature of our overall scheme. Benefits of Hidden Attention Module. To investigate the benefit of the HA-ResNet on the performance of ECG classification, we compare our model with its two variants. One is basic ResNet, which is obtained by removing the hidden attention module(HAM) from our model. Another is the SEResNet model, which is obtained by subtracting the BConvLSTM module of the HAM. Table 5 shows the results of the three models on 2 s dataset and 5 s dataset. As we can see from the table, the extra module of structure in the ResNet allows us to achieve substantially better performance than traditional models. Take the result on the 5 s dataset as example, the ResNet obtain 85.99% on F1-Score. When the SE block is added, the model SEResNet obtains a 2.02% Table 5. Performance comparison of HA-ResNet and its modifications Dataset
Methods
Class
Accuracy(%)
Precision(%)
Recall(%)
F1Score(%)
2 s dataset
ResNet
AFIB AFL NSR VFIB Average
98.96 85.32 96.69 90.9 92.97
98.4 90 99.4 90.9 94.7
99 85.3 96.7 90.9 92.98
98.7 87.6 98 90.9 93.8
SEResNet
AFIB AFL NSR VFIB Average
99.02 85.86 97.79 90.91 93.4
98.4 91.1 98.3 90.91 94.7
99 85.9 97.8 90.9 93.4
98.7 88.4 98.1 90.9 94
HA-ResNet AFIB AFL NSR VFIB Average
98.4 98.6 99.9 99.9 99.2
98.9 92.5 99.4 93.9 96.2
99.2 90.5 97.8 93.9 95.4
99.1 91.5 98.6 93.9 96
AFIB AFL NSR VFIB Average
94.46 96.56 98.02 99.88 97.23
98.11 61.49 96.3 100 86.47
95.6 97.85 72.41 88.67 88.13
96.84 85.52 78.75 92.86 85.99
AFIB AFL NSR VFIB Average
95.45 96.5 98.89 99.83 97.67
98.04 68.92 93.15 100 90.03
96.74 87.93 82.93 81.25 87.21
97.39 77.27 87.74 89.66 88.01
HA-ResNet AFIB AFL NSR VFIB Average
98.6 98.83 99.82 99.94 99.3
99.26 92.57 97.26 100 97.27
99.12 93.84 98.61 92.86 96.11
99.19 93.2 97.93 96.3 96.65
5 s dataset ResNet
SEResNet
ECG Detection Using Hidden Attention Residual Neural Network
481
improvement on F1-score compared with ResNet. In contrast, due to the addition of BConvLSTM, the HA-ResNet achieves the best performance with a further improvement of 8.46% on F1-score over the SEResNet. This fully shows that the SE block and the BConvLSTM layer can effectively enhance the feature capture ability of the model, and further proves the effectiveness of our proposed HA-ResNet for improving the classification performance of ECG arrhythmia.
4
Discussion
To demonstrate the clinical practicality of the method presented in this paper, we evaluate our best scheme in 105 ECG records. We use a sliding window to take out the five-second segments each time, and the slides overlap for 2.5 s. The model votes on the classification results of all segments and then selects the category with the highest votes as the sample category(Since AFIB and AFL may occur in one record, we uniformly mark these two types as AF). From the Table 6, we can see that our approach achieves an average performance of more than 90% on all evaluation metrics. It proves that our method is also effective in the classification of ECG records. Table 6. Confusion Matrix and Performance Measures obtained using the HA-ResNet for the two-second ECG dataset Original/Predicted AF NSR VFIB Accuracy(%) Precision(%) Recall(%) F1-Score(%)
5
AF
29
1
0
93.3
82.9
96.7
89.2
NSR
3
32
1
94.3
94.1
88.9
91.4
VFIB
3
1
35
95.2
97.2
89.7
93.3
Average
–
–
–
94.3
91.4
91.8
91.3
Conclusion
This paper proposes a framework for arrhythmia detection. In this framework, we convert one-dimensional ECG signals into two-dimensional images, and then construct a Hidden Attention Residual Neural Network combining SE blocks and BConvLSTM to enhance the ability for capturing essential features and temporal dependencies in ECG signals. The experimental results prove the effectiveness of our method. In the future, we will try a multimodal fusion of 2D representation of ECG to further improve the performance of our method. Acknowledgement. This work has been supported by the NSFC-Zhejiang Joint Fund for the Integration of Industrialization and Informatization (U1909208), and the Nature Science Foundation of Hunan Province in China (2018JJ2534).
482
Y. Guan et al.
References 1. Wang, J.: Automated detection of atrial fibrillation and atrial flutter in ECG signals based on convolutional and improved Elman neural network. Knowl.-Based Syst. 193, 105446 (2020) 2. Acharya, U.R., Fujita, H., Lih, O.S., Hagiwara, Y., Tan, J.H., Adam, M.: Automated detection of arrhythmias using different intervals of tachycardia ECG segments with convolutional neural network. Inf. Sci. 405, 81–90 (2017) 3. Bhaskar, N.A.: Performance analysis of support vector machine and neural networks in detection of myocardial infarction. Procedia Comput. Sci. 46, 20–30 (2015) 4. Martis, R.J., Acharya, U.R., Prasad, H., Chua, C.K., Lim, C.M., Suri, J.S.: Application of higher order statistics for atrial arrhythmia classification. Biomed. Signal Process. Control 8(6), 888–900 (2013) 5. Elhaj, F.A., Salim, N., Harris, A.R., Swee, T.T., Ahmed, T.: Arrhythmia recognition and classification using combined linear and nonlinear features of ECG signals. Comput. Methods Programs Biomed. 127, 52–63 (2016) 6. Sidek, K.A., Khalil, I., Jelinek, H.F.: ECG biometric with abnormal cardiac conditions in remote monitoring system. IEEE Trans. Syst. Man Cybern. Syst. 44(11), 1498–1509 (2014) 7. Zhu, J., Zhang, Y., Zhao, Q.: Atrial fibrillation detection using different duration ECG signals with se-resnet. In: 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), pp. 1–5. IEEE (2019) 8. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018) 9. Fan, X., Yao, Q., Cai, Y., Miao, F., Sun, F., Li, Y.: Multiscaled fusion of deep convolutional neural networks for screening atrial fibrillation from single lead short ECG recordings. IEEE J. Biomed. Health Inf. 22(6), 1744–1753 (2018) 10. Petmezas, G., et al.: Automated atrial fibrillation detection using a hybrid CNNLSTM network on imbalanced ECG datasets. Biomed. Signal Process. Control 63, 102194 (2021) 11. Hannun, A.Y., et al.: Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat. Med. 25(1), 65– 69 (2019) 12. Huang, J., Chen, B., Yao, B., He, W.: ECG arrhythmia classification using STFTbased spectrogram and convolutional neural network. IEEE Access 7, 92871–92880 (2019) 13. Alqudah, A.M., Qazan, S., Al-Ebbini, L., Alquran, H., Qasmieh, I.A.: ECG heartbeat arrhythmias classification: a comparison study between different types of spectrum representation and convolutional neural networks architectures. J. Ambient Intell. Humanized Comput. 1–31 (2021). https://doi.org/10.1007/s12652-02103247-0 14. Naz, M., Shah, J.H., Khan, M.A., Sharif, M., Raza, M., Damaˇseviˇcius, R.: From ECG signals to images: a transformation based approach for deep learning. PeerJ Comput. Sci. 7, e386 (2021) 15. Song, H., Wang, W., Zhao, S., Shen, J., Lam, K.M.: Pyramid dilated deeper convlstm for video salient object detection. In: Proceedings of the European conference on computer vision (ECCV), pp. 715–731 (2018)
ECG Detection Using Hidden Attention Residual Neural Network
483
16. Eckmann, J.P., Kamphorst, S.O., Ruelle, D., et al.: Recurrence plots of dynamical systems. World Sci. Ser. Nonlinear Sci. Ser. A 16, 441–446 (1995) 17. Wang, Z., Oates, T.: Encoding time series as images for visual inspection and classification using tiled convolutional neural networks. In: Workshops at the TwentyNinth AAAI Conference on Artificial Intelligence (2015) 18. Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810 (2015) 19. Alfaouri, M., Daqrouq, K.: ECG signal denoising by wavelet transform thresholding. Am. J. Appl. Sci. 5(3), 276–281 (2008) 20. Desai, U., Martis, R.J., Acharya, U.R., Nayak, C.G., Seshikala, G., Shetty, K.R.: Diagnosis of multiclass tachycardia beats using recurrence quantification analysis and ensemble classifiers. J. Mech. Med. Biol. 16(01), 1640005 (2016) 21. Acharya, U.R., et al.: Automated characterization of arrhythmias using nonlinear features from tachycardia ECG beats. In: 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 000533–000538. IEEE (2016) 22. Fujita, H., Cimr, D.: Decision support system for arrhythmia prediction using convolutional neural network structure without preprocessing. Appl. Intell. 49(9), 3383–3391 (2019). https://doi.org/10.1007/s10489-019-01461-0 23. Sree, V., et al.: A novel machine learning framework for automated detection of arrhythmias in ECG segments. J. Ambient Intell. Humanized Comput. 12(11), 10145–10162 (2021)
EEG-Based Depression Detection with a Synthesis-Based Data Augmentation Strategy Xiangyu Wei1 , Meifei Chen1 , Manxi Wu2 , Xiaowei Zhang1(B) , and Bin Hu1,3(B) 1
3
Gansu Provincial Key Laboratory of Wearable Computing, School of Information Science and Engineering, Lanzhou University, Lanzhou, China {xywei2020,chenmf20,zhangxw,bh}@lzu.edu.cn 2 Vivo Mobile Communication, Shenzhen, China CAS Center for Excellence in Brain Science and Institutes for Biological Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
Abstract. Recently, Electroencephalography (EEG) is wildly used in depression detection. Researchers have successfully used machine learning methods to build depression detection models based on EEG signals. However, the scarcity of samples and individual differences in EEG signals limit the generalization performance of machine learning models. This study proposed a synthesis-based data augmentation strategy to improve the diversity of raw EEG signals and train more robust classifiers for depression detection. Firstly, we use the determinantal point processes (DPP) sampling method to investigate the individual differences of the raw EEG signals and generate a more diverse subset of subjects. Then we apply the empirical mode decomposition (EMD) method on the subset and mix the intrinsic mode functions (IMFs) to synthesize augmented EEG signals under the guidance of diversity of subjects. Experimental results show that compared with the traditional signal synthesis methods, the classification accuracy of our method can reach 75% which substantially improve the generalization performance of classifiers for depression detection. And DPP sampling yields relatively higher classification accuracy compared to prevailing approaches. Keywords: Depression detection augmentation · Signal synthesis
1
· Electroencephalography · Data
Introduction
Depression has become a global mental illness that endangers human health [1–3]. The impact is far-reaching, bringing a heavy material and spiritual burden to the patient’s family and society. A number of relevant studies have shown that by collecting biological information of subjects and extracting relevant features, we can analyze changes in internal factors, such as emotions and stress c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 484–496, 2021. https://doi.org/10.1007/978-3-030-91415-8_41
A Synthesis-Based Data Augmentation Strategy
485
levels, and predict whether they suffer from depression [4,5]. Especially, EEG can provide more detailed and complex information for the depression detection task. Meanwhile, the character that EEG cannot be changed or hidden intentionally makes EEG-based depression detection achieve a more effective and reliable result [6]. Many researchers have investigated different feature extraction and classification methods for depression detection based on EEG signals [7–9]. Most of the researchers used traditional machine learning methods or neural networks to classify EEG signals [10–12]. The excellent performance of machine learning methods depends on a large amount of labeled training data to avoid over-fitting. However, the sizes of most EEG datasets for depression detection are limited, ranging from tens to hundreds of subjects due to the lengthy and expensive acquisition process and the reluctance of patients to share data. Therefore, it is difficult for small EEG datasets to obtain the best results using machine learning models because of their lack of sufficient training. In addition, the EEG signals of different subjects under the same label may vary greatly [13,14]. The individual differences of EEG signals will mislead the classification and further degrade the generalization performance of machine learning models when training on a limited amount of trials or samples [15]. To solve the two problems mentioned above, we proposed a novel EEG signal synthesis algorithm for depression detection, which is called subject-diversitybased empirical mode decomposition (SD-EMD). Overall, our main contributions are as follows: • Determinantal point processes (DPP) algorithm can effectively identify the diversity and difference of dataset, which provides efficient and accurate help for sampling, marginalization and other reasoning tasks. In this study, we used the DPP algorithm to calculate a DPP kernel matrix that contained the individual differences in EEG signals. Then we applied the differential information to sample subjects to construct a more diverse subset of subjects. • We adopt the empirical mode decomposition (EMD) method to decompose the original EEG signals of these sampled subjects into multiple intrinsic mode functions (IMFs), and then recombine IMFs of different subjects to synthesize augmented signals under the guidance of diversity of subjects. By doing this, the synthesized EEG signals can be thought to belong to a new augmented subject. • The diverse amplified dataset can sufficiently train the machine learning models to make them more robust. Moreover, experimental results show that the SD-EMD method can effectively improve the performance of these models in depression detection. The rest of the paper is organized as follows: Related work on data augmentation methods is explained in Sect. 2. The framework of our method is given in Sect. 3. EEG data acquisition and preprocessing as well as the experimental results are presented and discussed in Sect. 4. The conclusion is given in Sect. 5.
486
2
X. Wei et al.
Related Work
To increase the amount and diversity of existing data, data augmentation is often applied as a preprocessing step when building machine learning models. Data augmentation is the process of generating new samples by transforming training data, to improve the accuracy and robustness of classifiers. Data augmentation methods can obtain a large number of data in a relatively simple and convenient way. Their effectiveness has been shown in many applications, such as the AlexNet data amplification method applied to image classification [16]. In addition, data amplification methods can also be used to expand time series, such as using fast Fourier transform and wavelet transform to reconstruct signal to increase the amount of data. But these approaches are not adequate to extract features and reconstruct new effective EEG frames, because EEG signal has some characteristics which are different from other time series, such as its nonlinearity and nonstationarity [17,18]. Hence, many researchers have focused on EEG signal synthesis by extracting correlation information to seek reconstruction of raw EEG signals. For the depression detection task, different data augmentation methods based on EEG signals are also proposed and analyzed in the literature. Time domain augmentation is the most direct method for EEG signal synthesis. Most of them operate directly on the raw EEG signals, such as Gaussian noise or spike noise, step trend noise and slope trend noise. But those approaches neglect the temporal features of EEG signals. Window cropping is also a simple time-domain-based EEG signal reconstruction method [19]. This method randomly extracts serial slice sub-samples from the raw EEG signals. The step size of the slice and the coincidence window are adjustable parameters. It keeps the temporal features of EEG signals but fails to process the spectral features. Window warping is a unique augmentation method for raw EEG signals, similar to dynamic time warping (DTW). This method selects a random time range and then compresses it (down-sampling) or enlarges it (up-sampling) while keeping the other time frames unchanged. These methods lose the time-domain or frequency-domain characteristics of EEG signals, so the EEG signals can not be accurately reconstructed. The empirical mode decomposition algorithm is suitable for non-stationary signals on account of its subtle changes in frequency structure within a short period. The EMD method can effectively and losslessly decompose and re-synthesize EEG signals [20]. However, the EMD algorithm only pays attention to a single object and does not systematically solve the problem of individual differences among subjects. Based on this, to introduce individual differences in the augmentation process of EEG data, increase the diversity of augmented signals, and further train more robust classifiers for depression detection, we combined the EMD method with a subject sampling strategy using DPP algorithm.
3
Method
In this section, the subject-diversity-based empirical mode decomposition (SDEMD) method is proposed to augment raw EEG signals to obtain a more diverse
A Synthesis-Based Data Augmentation Strategy
487
dataset which can contribute to training a more robust depression detection model and improving the generalization performance of classifiers further. The proposed method mainly considers three aspects: First, the time domain and frequency domain features of raw EEG signals can be preserved by the augmented signals. Second, the augmented signals can cover the unexplored input space while maintaining correct labels. Third, extending more diverse signals to solve the problem caused by individual differences. For the depressed patients, we first use the DPP sampling method to explore the individual differences of the raw EEG signals to select a more diverse subset of subjects. Then the EMD algorithm is performed on the signals of selected subjects, and the corresponding IMFs are mixed to synthesize an enhanced EEG signal according to the characteristics of the raw EEG signals. We present the framework of the SD-EMD algorithm below (Fig. 1). The signal synthesis process of normal controls is the same as that of depressed patients.
Fig. 1. The subject-diversity-based empirical mode decomposition method.
Assume the raw EEG signal dataset of depressed patients Xori has an instance Xis ∈ RD×1 , where i = 1...N, denotes the i-th subject and s = 1...S, denotes the s-th channel EEG signal of subject, in which D is the sample dimension. 3.1
Individual Diversity Sampling
s between i-th subject’s and The first step is to find the DPP relationship Ki,j j-th subject’s s-th channel EEG signal. s = e−Xi ,Xj 2 Ki,j s
s
(1)
Then we can get the average DPP relationship Ki,j between i-th subject and j-th subject. S 1 s Ki,j = K (2) S s=1 i,j
488
X. Wei et al.
Meanwhile, we define the DPP relationship between the i-th subject and itself as Ki,i = N1 , where N is the number of subjects in the original EEG dataset Xori . The second step is to sample M subjects as a basic subset of augmented basic . Then we use the DPP sample algorithm to calculate the DPP dataset Xaug kernel matrix K. For any subset of the original dataset B ⊂ Xori , we have: basic = det (KB ) p B ⊆ Xaug (3) KB is a square matrix indexed by row and column of K according to the elements in B, det is the determinant, p is the probability. and basic basic = p {Xi , Xj } ⊆ Xaug = If B = {Xi , Xj }, so that p B ⊆ Xaug det (Ki,j ). Ki,i Ki,j basic = p ({Xi , Xj } ⊂ Xaug Kj,i Kj,j (4) = Ki,i Kj,j − Ki,j Kj,i basic basic 2 p Xj ∈ Xaug − Ki,j = p Xi ∈ Xaug basic basic 2 p Xj ∈ Xaug − Ki,j , so the more similar We have known p Xi ∈ Xaug basic will the i-th subject and the j-th subject are, the smaller p {Xi , Xj } ⊂ Xaug be, the probability that the i-th subject and the j-th subject will be selected at the same time is smaller. In this case, there is a big difference between the two basic at the same time. subjects that are selected into the basic subset Xaug basic . Through the matrix K we can get a basic subset of augmented dataset Xaug basic Xaug = {X1 , X2 , . . . . . . , XM }
3.2
(5)
Signal Decomposition and Recombination
basic First, we need to decompose the EEG signals of all subjects in Xaug into IMFs. In this paper, we stipulate that all EEG signals need to be decomposed into M IMFs which is similar to a previous study [21]. To decompose EEG signals, the first step we need to do is to find all the local maximum and minimum points in the signal sj (t), where initially we set s0 (t) = Xis and j = 1...M. Then we need to obtain the upper and lower envelopes, denoted as uj (t) and vj (t). And we calculate the mean value of the upper and lower envelope mj (t): uj (t) + vj (t) (6) mj (t) = 2 The second step is to find the difference between the signal and the mean envelope: (7) hj (t) = sj (t) − mj (t)
If hj (t) satisfies the characteristics of IMFs, it will become a decomposed j (t). Otherwise, let sj+1 (t) = hj (t), and repeat steps 1 and 2 for sj+1 (t) IM Fi,s
A Synthesis-Based Data Augmentation Strategy
489
j until hj (t) meets the characteristics of IMFs. At this time, we remove IM Fi,s (t) and denote the EEG signal margin as: j j−1 j ri,s (t) = ri,s (t) − IM Fi,s (t)
(8)
1 M (t) = Xis . After decomposing M rounds, we get ri,s (t) and where ri,s M IM Fi,s (t). According to the sampling order of the DPP sampling method, we respectively select the i-th IMF of the s-th channel EEG signal of the i-th subject. And then we use all the selected IMFs and the final residual to synthesize the s-th channel EEG signal for a new augmented subject k:
k,s (t) = Xaug
M
i M IM Fi,s (t) + rM,s (t)
(9)
i=1
Due to the characteristics of EMD method, we can quickly obtain augmented dataset that is several times larger than the raw EEG dataset. The whole algorithm flow of our method is as follows. Algorithm 1: SD-EMD Input : EEG dataset Xori Output : augmented EEG dataset Xaug 1: for k=1,2...K (the number of subjects to synthesize) do basic 2: Use (1)-(5) to sample Xori to obtain subset Xaug basic 3: for i=1,2...M (the number of subjects in Xaug ) do 4: for s=1,2...S (the number of channels) do 5: for j=1,2...M (the number of IMFs) do j j (t) and ri,s (t) 6: Use(6)-(8) to decompose Xis to obtain IM Fi,s j j 7: If ri,s (t) can not be decomposed, set all IM Fi,s (t) to zero 8: end for k,s i M (t) = M 9: Xaug i=1 IM Fi,s (t) + rM,s (t) 10: end for 11: end for 12: end for 13: return Xaug
4 4.1
Experiments and Results Data Acquisition and Preprocessing
In this study, 170 subjects (81 depressed patients and 89 normal controls) were recruited from the psychiatric hospital. All the participants, aged 18 to 65, were right-handed and of normal intelligence. On the whole, there was no significant difference in age and gender distribution between the two groups. Medicated
490
X. Wei et al.
subjects or patients with other mental system abnormalities were excluded to ensure that the main difference between the test and the normal control group is the depressive disorder. Written informed consent was obtained from all participants before the experiment. Every subject will receive inquiries and scale diagnoses of professional doctors before data collection. Doctors use “The Mini-International Neuropsychiatric Interview” (MINI) and “Patient Health Questionnaire” (PHQ-9) for examination [22,23]. Consent forms and study design were approved by the local Ethics Committee for Biomedical Research at the Beijing Anding Hospital Capital Medical University in accordance with the Code of Ethics of the World Medical Association (Declaration of Helsinki). Because the prefrontal EEG signals are closely related to emotions, we adopt the prefrontal three-electrode EEG acquisition system which was developed by Gansu Provincial Key Laboratory of Wearable Computing in China to collect EEG signals (Fig. 2). The sampling frequency of the system 250 Hz and the passband is 0.5–50 Hz. After one minute of relaxation, we recorded a 90-s restingstate EEG data segment with closed eyes. After the acquisition, EEG signals were high-pass filtered 1 Hz cutoff frequency and low-pass filtered 40 Hz cutoff frequency to remove power-line interference. Prefrontal EEG signals are usually affected by electrooculogram (EOG), so we used the discrete wavelet transform and Kalman filtering to remove EOG artifacts. The features extracted from the EEG signals include Ruili entropy, C0 complexity, correlation, center frequency, maximum value and mean value of frequency domain.
Fig. 2. The prefrontal three-electrode EEG acquisition system and positions of three electrodes.
A Synthesis-Based Data Augmentation Strategy
4.2
491
Experimental Settings
In our study, we used Support Vector Machine (SVM), K Nearest Neighbor (KNN), Naive Bayes (NB) and Logistic Regression (LR) to detect depression. Accuracy and F1-score were adopted to evaluate the performance of the SDEMD method. We took the five-fold cross-validation method to evaluate the performance of the proposed method. Original samples (170 subjects) were divided into five subsets, each subset containing 34 subjects. Each subset would be used as a test set for depression detection in turn, and the remaining four subsets (136 subjects) were used as the training set to be augmented. We carried out the proposed signal synthesis method on depressed patients and normal controls in training set repectively. In Sect. 4.3, we used the SD-EMD method and other signal synthesis methods to augment the EEG signals of the 136 subjects in training set, and then the augmented signals were applied to train four classifiers mentioned above. Mainstream signal synthesis methods include Gaussian noise, replacement, scaling and dynamic time warping. Gaussian Noise. Gaussian noise is added to the raw EEG signals [24]. Replacement. The EEG signals of subject A are divided into five segments on average according to time, denoted as A1, A2, A3, A4 and A5. The five segments are shuffled and rearranged to generate a new EEG signal [25]. Scaling. Signals are reduced or enlarged by a certain multiple in proportion to the y-axis. In this experiment, considering that the raw EEG signals are one kind of the physiological signals of the human body, they should not be enlarged or reduced by too large, so the multiples are 0.95 times and 1.05 times respectively [26]. Dynamic Time Warping (DTW). For the EEG data of two subjects with one-to-one corresponding electrode order, the dynamic time warping algorithm warps the EEG signals to obtain augmented data of the two subjects and get the DTW value simultaneously. The DTW value indicates the similarity between EEG signals. In this paper, the DTW augmentation experiment was carried out. Based on the maximum and minimum DTW values, we obtained augmented signals for 136 subjects. We also combined DTW algorithm with DPP subject sampling strategy and used it as a comparative method [27]. In Sect. 4.4, we expanded the original training set N times through the SDEMD method and other subject sampling methods and compared the effects of different subject sampling strategies on signal synthesis, so the number of subjects in the augmented dataset is 136N.
492
4.3
X. Wei et al.
Comparison with Mainstream Signal Synthesis Methods
In this section, we will compare the classification performance between the SDEMD method and commonly used data augmentation methods. Table 1. Comparison of classification results between the SD-EMD method and mainstream signal synthesis methods. %
NB
KNN
SVM
Accuracy F1
Accuracy F1
Accuracy F1
LR Accuracy F1
Non-augmented Mean 64.52
60
66.13
65
63.68
64
62.48
63
Replacement
Mean 66.46
64
68.72
62
66.27
65
66.46
67
Scaling(0.95)
Mean 66.96
67
69.2
63
67.91
70
67.14
67
Scaling(1.05)
Mean 67.11
65
70.31
65
67.25
67
68.57
65
DTW(min)
Mean 67.73
66
69.65
69
67.51
65
66.75
69
DTW(max)
Mean 65.35
63
66.73
65
67.87
62
67.14
67
Gaussian Noise
Mean 66.15
60
68.74
61
69.1
65
67.23
76.85
SD
15.25 5.58
10.3
5.46
8.26
DPP+DTW Our Method
7.74
14.32 6.2
Mean 69.01
68
68.18
66
69.1
70
68.94
68
SD
8.85
4.76
6.69
3.27
8.21
2.88
5.62
76.85
3.57
Mean 75.68 SD
1.67
71
72.59
73
77.23
75
5.7
0.76
4.24
1.15
3.29 1.07
72 5.5
The main difference between this work and other signal synthesis methods applied in classification experiments is that we can use individual diversity information as the reference to guide the synthesis. The final results are the average value and standard deviation (SD) of the accuracy and F1-score using different classifiers. As shown in Table 1, among the four classifiers, in the case of augmenting the same number of subjects, the classification accuracy of our method has an average improvement of about 8% compared with other augmentation methods. Our method achieves the highest accuracy and the F1-score has increased by 5–14%. The experimental results show that our algorithm is better because we introduce multiple subjects’ information to alleviate the impact of individual differences and the augmented dataset enables the model to be fully trained. The standard deviation of our method is smaller than others, which also proves that our algorithm has certain advantages in terms of stability. The experimental results also show that the accuracy and F1-score of the DTW method with the DPP subject sampling strategy are less than the proposed method. This is probably because EMD is more suitable for EEG signal synthesis than conventional methods and can decompose and reassemble the EEG signal without loss. In a word, our method has better results than other signal synthesis methods.
A Synthesis-Based Data Augmentation Strategy
4.4
493
Comparison with Other Subject Sampling Strategies
We have already proved that the synthesis-based data augmentation method can improve the performance of EEG-based depression recognition. To further explore whether different subject sampling strategies have an impact on signal synthesis, we also conduct comparative experiments on random sampling, acceptreject sampling and DPP sampling. Random Sampling. Also known as the uniform distribution sampling method, this method stipulates that all data in the data set have the same probability of being selected, which is a completely equal probability sampling method [28]. Accept−Reject Sampling. If the current target data distribution P(x) is difficult to sample directly, we set Q(x), which is easier to sample, for replacement [29].
(a) KNN
(b) SVM
Fig. 3. Comparison between random, accept-reject and DPP subject sampling strategies with (a) KNN and (b) SVM classifier.
We used random sampling and accept-reject sampling to sample 15 subjects from the training dataset. And the subsequent operation is the same as the DPP subject sampling strategy. In Fig. 3, Nx represents that the number of samples in the extended dataset is N times that of the original dataset (136 subjects), such as 2x represents that the extended dataset has 272 subjects. We can conclude from Fig. 3 that random subject sampling and accept-reject subject sampling are inferior to DPP subject sampling because they don’t combine the individual diversity information brought by the DPP algorithm. The DPP subject sampling strategy can achieve better classification performance in a small augmentation scale. When the size of the augmented dataset reaches about 6–7 times that of the original dataset, the classification performance tends to be stable. Meanwhile, the size of the augmented dataset must reach 9–10 times to achieve stable classification performance while using random or accept-reject subject sampling strategies. The DPP subject sampling strategy has achieved good classification performance in all aspects.
494
5
X. Wei et al.
Conclusion
The mismatch between the amount of actual data and the amount of data needed for machine learning methods hampers the development of existing depression detection models. Moreover, individual differences in EEG signals will further lower the generalization performance of the models. To solve these problems, we proposed the subject-diversity-based empirical mode decomposition (SD-EMD) method. The SD-EMD method can not only synthesize the augmented signals without loss but also add the diverse information of subjects into the data augmentation process, so as to obtain a more robust depression detection model. The experimental results show the superiority of our method in improving the classification accuracy and generalization performance of the model when compared with mainstream signal synthesis methods and other subject sampling strategies. In future, we will seek to use other data augmentation methods in conjunction with the DPP subject sampling strategy, such as generative adversarial networks, to generate more effective samples of EEG data and improve the performance of EEG-based depression detection. Acknowledgement. This work was supported in part by National Key R&D Program of China (Grant No. 2019YFA0706200), in part by the National Natural Science Foundation of China (Grant No. 62072219, 61632014), in part by the National Basic Research Program of China (973 Program, Grant No.2014CB744600).
References 1. Sharma, M., Achuth, P., Deb, D., Puthankattil, S.D., Acharya, U.R.: An automated diagnosis of depression using three-channel bandwidth-duration localized wavelet filter bank with EEG signals. Cogn. Syst. Res. 52, 508–520 (2018) 2. Kessler, R.C., Chiu, W.T., Demler, O., Walters, E.E.: Prevalence, severity, and comorbidity of 12-month dsm-iv disorders in the national comorbidity survey replication. Arch. Gen. Psychiatry 62(6), 617–627 (2005) 3. Hardeveld, F., et al.: Recurrence of major depressive disorder across different treatment settings: results from the NESDA study. J. Affect. Disord. 147(1–3), 225–231 (2013) 4. Kreezer, G.: The electro-encephalogram and its use in psychology. Am. J. Psychol. 51(4), 737–759 (1938) 5. Hosseinifard, B., Moradi, M.H., Rostami, R.: Classifying depression patients and normal subjects using machine learning techniques and nonlinear features from EEG signal. Comput. Methods Program. Biomed. 109(3), 339–345 (2013) 6. Acharya, U.R., et al.: A novel depression diagnosis index using nonlinear features in EEG signals. Eur. Neurol. 74(1–2), 79–83 (2015) 7. Hinrikus, H., et al.: Electroencephalographic spectral asymmetry index for detection of depression. Med. Biol. Eng. Comput. 47(12), 1291–1299 (2009) 8. Shen, J., Zhang, X., Hu, B., Wang, G., Ding, Z.: An improved empirical mode decomposition of electroencephalogram signals for depression detection. IEEE Trans. Affect. Comput. (2019)
A Synthesis-Based Data Augmentation Strategy
495
9. Zhang, X., Shen, J., ud Din, Z., Liu, J., Wang, G., Hu, B.: Multimodal depression detection: fusion of electroencephalography and paralinguistic behaviors using a novel strategy for classifier ensemble. IEEE J. Biomed. Health Inf. 23(6), 2265– 2275 (2019) 10. Ay, B., et al.: Automated depression detection using deep representation and sequence learning with EEG signals. J. Med. Syst. 43(7), 1–12 (2019) 11. Tian, C., Xu, Y., Zuo, W.: Image denoising using deep CNN with batch renormalization. Neural Netw. 121, 461–473 (2020) 12. Zhang, X., Li, J., Hou, K., Hu, B., Shen, J., Pan, J.: Eeg-based depression detection using convolutional neural network with demographic attention mechanism. In: 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pp. 128–133. IEEE (2020) 13. Jorm, A.F., et al.: MRI hyperintensities and depressive symptoms in a community sample of individuals 60–64 years old. Am. J. Psychiatry 162(4), 699–705 (2005) 14. Siegel, M.J., Bradley, E.H., Gallo, W.T., Kasl, S.V.: The effect of spousal mental and physical health on husbands’ and wives’ depressive symptoms, among older adults: longitudinal evidence from the health and retirement survey. J. Aging Health 16(3), 398–425 (2004) 15. Van Putten, M.J., Olbrich, S., Arns, M.: Predicting sex from brain rhythms with deep learning. Sci. Rep. 8(1), 1–7 (2018) 16. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017) 17. Paris, A., Atia, G.K., Vosoughi, A., Berman, S.A.: A new statistical model of electroencephalogram noise spectra for real-time brain-computer interfaces. IEEE Trans. Biomed. Eng. 64(8), 1688–1700 (2016) 18. Lotte, F.: Generating artificial eeg signals to reduce BCI calibration time. In: 5th International Brain-Computer Interface Workshop, pp. 176–179 (2011) 19. Le Guennec, A., Malinowski, S., Tavenard, R.: Data augmentation for time series classification using convolutional neural networks. In: ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal Data (2016) 20. Dinar`es-Ferran, J., Ortner, R., Guger, C., Sol´e-Casals, J.: A new method to generate artificial frames using the empirical mode decomposition for an EEG-based motor imagery BCI. Front. Neurosci. 12, 308 (2018) 21. Zhang, Z., et al.: A novel deep learning approach with data augmentation to classify motor imagery signals. IEEE Access 7, 15945–15954 (2019) 22. Sheehan, D.V., et al.: The mini-international neuropsychiatric interview (mini): the development and validation of a structured diagnostic psychiatric interview for dsm-iv and icd-10. J. Clin. Psychiatry 59(20), 22–33 (1998) 23. Kroenke, K., Spitzer, R.L., Williams, J.B.: The PHQ-9: validity of a brief depression severity measure. J. Gen. Intern. Med. 16(9), 606–613 (2001) 24. Molla, M.K., Tanaka, T., Rutkowski, T.M., Cichocki, A.: Separation of EOG artifacts from EEG signals using bivariate EMD. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 562–565. IEEE (2010) 25. Liu, P., Wang, X., Xiang, C., Meng, W.: A survey of text data augmentation. In: 2020 International Conference on Computer Communication and Network Security (CCNS), pp. 191–195. IEEE (2020) 26. Li, K., Shapiai, M.I., Adam, A., Ibrahim, Z.: Feature scaling for EEG human concentration using particle swarm optimization. In: 2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), pp. 1–6. IEEE (2016)
496
X. Wei et al.
27. Yamauchi, T., Xiao, K., Bowman, C., Mueen, A.: Dynamic time warping: a single dry electrode EEG study in a self-paced learning task. In: 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 56–62. IEEE (2015) 28. Flandrin, P., Rilling, G., Goncalves, P.: Empirical mode decomposition as a filter bank. IEEE Sig. Process. Lett. 11(2), 112–114 (2004) 29. Bajaj, V., Pachori, R.B.: Classification of seizure and nonseizure EEG signals using empirical mode decomposition. IEEE Trans. Inf. Technol. Biomed. 16(6), 1135– 1142 (2011)
Sequencing Data Analysis
Joint CC and Bimax: A Biclustering Method for Single-Cell RNA-Seq Data Analysis He-Ming Chu , Xiang-Zhen Kong(B) , Jin-Xing Liu , Juan Wang, Sha-Sha Yuan, and Ling-Yun Dai School of Computer Science, Qufu Normal University, Rizhao 276826, Shandong, China [email protected]
Abstract. One of the important aims of analyzing single-cell RNA sequencing (scRNA-seq) data is to discovery new cell subtypes by clustering. For the scRNAseq data, it is obvious that lots of genes have similar behavior under the different conditions (cells). Traditional clustering algorithms could not obtain high-quality cluster on scRNA-seq data. However, the biclustering algorithm has begun a more powerful data mining tool, which can cluster genes and conditions (cells) simultaneously. In this paper, we propose a novel biclustering algorithm named JCB: Joint CC and BIMAX. The algorithm is based on the two classic biclustering algorithms: Cheng and Church’s algorithm (CC) and Binary Inclusion-Maximal biclustering algorithm (Bimax). The main idea of the JCB method is that it joints the “mean squared residual (MSR)” proposed by CC with the model of BIMAX. It gets the biclusters by iterating on rows and columns of the data matrix with the “MSR”, and it also benefits the advantage of simply model from Bimax. We evaluate the proposed method by carrying out extensive experiments on three scRNA-seq datasets. The JCB method is used to compete with six other bi-clustering algorithms. The experimental results show that the proposed method outperforms the others. Keywords: Biclustering · Mean squared residual · Single-cell RNA sequencing data
1 Introduction Single-cell RNA sequencing (scRNA-seq) technology has become the first choice to process single-cell data, which can reveal the difference between cells. What’s more, scRNA-seq technology is a novel and powerful tool, which can discover cellular heterogeneity and new cell types in scRNA-seq data [1–3]. More precisely, it has more comprehensive application in bioinformatics. The fundamental step of scRNA-seq technology is analyzing cells by clustering [4]. Clustering is a method that grouping a set of elements so that cells or genes in the same cluster are more similar than elements in other clusters [5]. In addition, clustering can determine similar cells with distances such as Euclidean distance, Manhattan distance, etc. [6]. The K-means clustering [7] and spectral clustering are widely used [8]. elements are divided into disjoint clusters. However, overlapping clusters cannot be distinguished. To improve the analysis of the © Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 499–510, 2021. https://doi.org/10.1007/978-3-030-91415-8_42
500
H.-M. Chu et al.
scRNA-seq data, the bi-clustering algorithm that can cluster genes and conditions at the same time has received more attention in practical applications [9]. Biclustering the matrix is to obtain biclusters that involves the similar genes. Padilha et al. argue that the biclustering can be classified as greedy algorithm, divideand-conquer algorithm, exhaustive enumeration algorithm and distribution parameter identification algorithm [10]. The best locally optimal solution is obtained for each iteration of the greedy algorithm. For example, Cheng Y and Church GM proposed the Cheng and Church’s algorithm(CC) to obtain the optimal clusters[11]. The algorithm is based on MSR removal or clustering of rows and columns. Ben-Dor A et al. proposed an Order-Preserving Submatrix (OPSM) algorithm that searches for clusters of the same rows under the same conditions [12]. The Iterative Signature Algorithm (ISA), proposed by Bergmann S et al., randomly selects a row as a seed to find bicluster according to iterating rows and columns [13]. The divide-and-conquer algorithm can be described that a question is divided into several small questions until the question cannot be divided, integrating with the solutions of those small questions to obtain the optimal solution at last. Bimax is proposed by Preli´c A et al., which is one of the classicals divideand-conquer algorithms [14]. Its model is the same as the method proposed by Tanay et al. [15], and the biclusters contain elements 0. The Factor Analysis for Bicluster Acquisition (FABIA) is one of the distribution parameters identification algorithms, which is proposed by Hochreiter S et al. The algorithm assumes a multiplicative model and uses a factor analysis approach together with an expectation-maximization algorithm to fit it to data [3, 16]. Another biclustering algorithm is called QUalitative BIClustering (QUBIC) which belongs to the same class as FABIA. QUIBIC is proposed by Li G et al. and it obtains biclusters by discretizing the input data and building a graph [17]. Each algorithm has its characteristics, but also has some shortcomings. However, these shortcomings could be improved by combing the characteristics of other algorithms. In this paper, we provide a new biclustering algorithm, namely the JCB algorithm, which joints the structure of Bimax with the “MSR” proposed in CC. Our method contributions are as follows: • According to the MSR value, the similar elements with important information are clustered and the elements with the unimportant information are abandoned. Since the MSR value means that the variation of the interaction between the genes and conditions (cells) [18], we directly process the data matrix (not conversion to binary matrix) through the MSR value, and we can obtain a cluster similar to Bimax. • The JCB algorithm can save good clusters in the original matrix data. JCB divides these similar rows into the same cluster by using similar row elements that have been obtained. The rest of the paper is organized as follows. In Sect. 2, the related works are introduced. The JCB method is illustrated in Sect. 3. The experiments are presented in Sect. 4, which includes the description of the datasets used in the experiments, the parameter’s selection and the experimental results. The conclusion and some suggestions for future works are proposed in Sect. 5.
Joint CC and Bimax: A Biclustering Method
501
2 Related Works In this section, we describe the Bimax algorithm and the MSR method in detail. Bimax is an algorithm based on the “divide-and-conquer” approach. CC algorithm is the first biclustering algorithm applied to gene expression data and the MSR has more important significance in the method. In order to overcome the shortcomings of the Bimax algorithm, MSR is used to improve the quality of clusters. Therefore, the MSR method from the CC algorithm is combined in the Bimax algorithm to form a new biclustering algorithm, the JCB algorithm. 2.1 Bimax Model Intuitively, the Bimax algorithm is illustrated in Fig. 1. Bimax converts the original dataset into a binary matrix. The binary matrix is divided into two subsets of columns CU and CV . Furthermore, according to the seed that is randomly selected from a set of rows, the rows of Mi×j are rearranged: all genes only respond to condition in CU ; then, the genes respond to condition in CU and CV simultaneously; in the end, all genes only respond to condition in CV . In the same way, the set of rows is defined three types. The CU is the set of rows which only responds to the set of conditions in CU . The GW is the set of rows which responds to the set of conditions in CU and in CV . And the CV is the set of rows which responds to the set of conditions in CV .
Fig. 1. The model of the Bimax algorithm. Here selects the first row as a template.
502
H.-M. Chu et al.
More specifically, iterating the set of U and V constantly, until two subsets only have nonzero elements or achieve the iteration times. Particularly, if the set of Gw exists, the subset in the next iteration will involve the element which is in Gw so that the subset prevents the same elements to exist. In addition, the time complexity of Bimax is O(nmmin{n, m}). Generally, if the matrix only has two biclusters which are without the element responded under CU and CV simultaneously, the time complexity is O(nmβ min{n, m}). Where the β represents the number of biclusters. It was proved by Preli´c A et al. 2.2 The Mean Square Residual In this section, we mainly introduce the MSR solution steps. The method was proposed by Cheng’s and Church and showed that the genes obtained by low MSR values were similar [11]. In other words, clustering of similar conditions in the same gene into column clusters based on MSR values. Formally, setting a matrix Ei×j to represent the scRNAseq data matrix, where i represents row of matrix and j represents column of matrix. We define the formular of row average as follows: eij i∈I eIj = , (1) |I| where eij is an element at the matrix and represents the ith row in the jth column, I is a subset of genes, and eIj represents the average score of ith row. In the same way, we define the average score of jth column: eij eiJ =
j∈J
,
|J|
(2)
where the J is a subset of conditions. Next, we define eIJ which represents the matrix average value: eij eIJ =
i∈I,j∈J
|I||J|
.
(3)
Furthermore, we define the MSR value: MSR =
RS2ij i∈I,j∈J
|I||J|
,
(4)
where the RS2ij is the element of ith row with jth column square residual score and RSij is defined as follows: RSij = eij − eIj − eiJ + eIJ .
(5)
Joint CC and Bimax: A Biclustering Method
503
3 JCB Algorithm Figure 2 provide to illustrate for the JCB algorithm. We suppose there are three biclusters in the original matrix. At first, the original matrix received that is the pre-processed scRNA-seq data. In Fig. 2(b), randomly choosing an element of the original matrix to calculate the MSR. Clustering of all conditions in the row vector where the selected element is located on MSR values. If the MSR values of these elements are less than a threshold ε, these elements are clustered in the same clustering cluster. We suggest that the seed needs to select more times. And finally, the row vector with the smallest mean value is selected, indicating the optimal cluster for that row. It is noteworthy that each set of row initialization is empty. In Fig. 2(c), the row average value is compared with the row threshold ε to seek similar rows. Attention to cluster the row vector, the difference between the means of different row vectors is less than the threshold ε. Finally, if the detected current row vector can be divided into a cluster, the row vector is divided into this cluster, otherwise, the next row vector is detected. The JCB algorithm does not stop until all row vectors have been detected. In Fig. 2(d), to obtain more accurate clusters, the clusters are obtained in the form of a maximum matrix when the row clustering is finished. Summarily, the JCB algorithm mainly consists of two steps, which are filtrating the condition and categorizing the similar genes. Algorithm 1 presented in Table 1 shows how to filtrating the conditions. And Algorithm 2 shown in Table 2 describes how to categorize similar genes.
Fig. 2. The steps of the JCB method.
In Algorithm 1, the thresholds ε and t represent the column threshold and the number of cycles respectively. We set the number of cycles to determine the optimal column cluster. That is, to obtain the row vector with the lowest mean value. Algorithm 1 does not
504
H.-M. Chu et al. Table 1. The pseudocode of filtrating the condition.
stop until all rows have calculated the optimal column cluster. The purpose of Algorithm 2 is to cluster the optimal column clusters into optimal row clusters. The main idea of obtaining row clusters is to evaluate the similarity between row clusters. When the similarity between two column clusters is less than threshold β, these two column clusters belong to the same row cluster. In order to obtain the algorithm time complexity, assuming the number of genes is n and the number of conditions is m. We calculate the time complexity of the JCB algorithm from two parts: algorithm 1 and algorithm 2. For the algorithm 1, each row requires selecting n column and all of rows are required to execute the operation. Each row iteration takes time O(n). There are m rows so that the time complexity of algorithm 1 spends time O(mn). For the algorithm 2, the JCB algorithm divides all of rows. Note that each row is to be a run once at least, which is finding similar genes. It is obvious that each row runs (m−1) times to determine the categorization. Therefore, the time complexity of algorithm 2 is O(m(m − 1)).
Joint CC and Bimax: A Biclustering Method
505
Table 2. The pseudocode of the categorize the similar genes.
4 Experiments 4.1 Datasets In this paper, two mouse datasets and one human dataset are used to evaluate the JCB algorithm, and also compares with six algorithms. Respectively, those datasets are Deng [16], Pollan [19] and Treutlin [20]. Note that these datasets are pre-processed before using [21]. More precisely, we organize each dataset into a data matrix. To evaluate the algorithms effectively, two datasets all have category labels. Moreover, and the first dataset Pollan has 120 genes and 10365 conditions. The second dataset Treutlin has 80 genes and 959 conditions. The third dataset Deng has 135 genes and 12548 conditions. More details of the datasets are shown in Table 3. Table 3. Detail of the three datasets Name
Researcher
ID
Cell types
Species
Treutlin
Barbara Treutlin
GSE52583
5
Mouse
Pollen
Alex A Pollen
SRP041736
11
Human
Deng
Qiaolin Deng
GSE45719
7
Mouse
506
H.-M. Chu et al.
4.2 Evaluation Metric In this paper three metrics including Precision [22], Accuracy value (ACC) [23, 24] and F1 Measure score (F1) [25] are adopted to evaluate the clustering performance. The closer the metric is to 1, the better the performance. Precision is that the proportion of identifying true positive predictions out of all predictions pointed [26, 27]. The formula is as follows to calculate: Pr ecision =
TP , TP + FP
(6)
where the TP represents the number of labels in the bicluster, and the FP represents the number of labels in the dataset. The calculation formula of Accuracy is as follows: ACC =
TP + TN , TP + TN + FP + FN
(7)
where the TN represents the number of the unlabeled genes which are identified by the algorithm in the dataset. And the FN represents the number of unlabeled genes in the bicluster. At last, the F1 Measure is defined as follows: F1 =
2 · Pr ecision · recall , Pr ecision + recall
(8)
where the recall represents the ratio of the number of correct categories to all correct numbers in the cluster. The formulation is shown as follows: recall =
TP . TP + FN
(9)
4.3 Parameters Selection There are different thresholds for the different datasets. In this section, we need to select the appropriate parameter pair of each dataset. Depending on the role of the threshold, and the exact value of the elements in the data, we test the pair of parameters starting from (0.001, 0.001) in order. We chose the square root of the precision as the initial threshold value, and the precision of the elemental values of the data in this paper is 1 × 10−6 . If the result is too low, we will ignore those parameter pairs. Meanwhile, when one of the pairs of parameters comes up with a higher value, we should test the adjacent parameters. This is because the value may be the maximum. The optimal pair of parameters are listed in Table 4. For comparison, six algorithms including CC, Bimax, FABIA, OPSM, QUBIC, ISA have been set optimal parameters according to [24, 28]. In addition, there is a parameter α in the evaluation method, which is used to obtain the first convergent algorithm. This will make it easier to get the highest performance algorithm among the seven algorithms.
Joint CC and Bimax: A Biclustering Method
507
Table 4. The parameters were selected for the different datasets. Threshold
Dataset Treutlin
Pollen
Deng
δ
0.2
0.06
0.01
ε
0.2
0.06
0.07
4.4 Biclustering Results JCB algorithm can obtain a better result by running a mass of times since the row vectors are randomly selected. We choose the higher values including the average value of the F1 measure, the average of Precision, and the average of Accuracy. The single value represents the single bicluster performance and the average value represents all of biclusters performance. In other words, the average value has important significance than the single value. Table 5, 6 and 7 shows the experimental results on the Treutlin, Pollen, and Deng dataset. It can be seen from Table 5 to Table 7 that OPSM is the first to convergent. In these three tables, F1, Pre and Acc respectively represent F1 Measure value, Precision value and Accuracy value. In Table 5, the JCB algorithm can be considered the optimal algorithm because its evaluation result is the highest. In Table 6, the JCB algorithm has higher F1 measure values and higher Precision values, but the Accuracy value is so low. Considering the definition of Accuracy, when the Precision is higher and the Accuracy value is lower, the smaller the bicluster is. Furthermore, the gap between the JCB algorithm and OPSM is not very large from Table 6 when α is equal to 3. In Table 7, when α is equal to 4 and 5, the result of JCB is second only to FABIA. When α is equal to 6, the result of JCB exceeds FABIA and JCB is lower than Bimax. However, the gap between JCB and Bimax is not large, and the biclusters obtained by JCB are also larger than Bimax. This shows that the performance of the JCB algorithm is higher than other algorithms. That is to say, when the size of the bicluster is about the same size, the performance of JCB is relatively high, and when the result is slightly higher than that of JCB, the size of the bicluster obtained by this algorithm is smaller than that obtained by JCB. Through experimental analysis, the Bimax algorithm simply arranges the elements by randomly selecting seeds without further analysis of whether the elements have similar rows. The CC algorithm, on the other hand, is a greedy algorithm that selects clusters based solely on MSR values and thresholds, a structure that has the disadvantage of being time consuming when dealing with large datasets. The JCB algorithm uses the MSR value to classify the clusters and combines the structure of the Bimax algorithm. Therefore, the JCB algorithm has the speed of the Bimax algorithm and the accuracy of the CC algorithm.
508
H.-M. Chu et al. Table 5. The result of seven algorithms in the Treutlin dataset.
Method
α=4
α=5
α=6
F1
Pre
Acc
F1
Pre
Acc
F1
Pre
Acc
JCB
0.8602
0.7574
0.7548
0.9203
0.8522
0.8525
0.9586
0.919
0.9192
CC
0.6970
0.5453
0.1612
0.8080
0.6974
0.2054
0.8794
0.8073
0.2387
ISA
0.5093
0.3430
0.0860
0.5880
0.4185
0.1046
0.6617
0.4974
0.1244
OPSM
0.8558
0.7554
0.1511
0.8609
0.7634
0.1527
0.8648
0.7695
0.1539
QUBIC
0.8440
0.7387
0.6332
0.8800
0.8933
0.6710
0.9064
0.8346
0.7128
Bimax
0.7995
0.6663
0.2221
0.7995
0.6663
0.2221
0.8256
0.7033
0.2344
FABIA
0.7780
0.6441
0.5121
0.8432
0.7344
0.5801
0.8919
0.8086
0.6348
Table 6. The result of seven algorithms in the Pollen dataset. Method
α=1
α=2
α=3
F1
Pre
Acc
F1
F1
Pre
Acc
Pre
F1
0.7388
0.5858
0.1420
0.955
0.9139
0.1448
0.9983
0.9966
0.1631
CC
0.5429
0.3731
0.1751
0.8983
0.6527
0.2972
0.9357
0.8815
0.4045
ISA
0.4240
0.2707
0.1012
0.6780
0.5140
0.1980
0.8914
0.8049
0.2915
OPSM
0.6214
0.4521
0.1058
0.8481
0.7592
0.1902
1
1
0.2350
QUBIC
0.6513
0.4870
0.2527
0.7577
0.6225
0.3275
0.9279
0.8671
0.4423
Bimax
0.7361
0.5823
0.5630
0.9530
0.9101
0.8798
0.9973
0.9946
0.9615
FABIA
0.6382
0.4725
0.044
0.8805
0.7910
0.0712
0.9598
0.9242
0.0835
JCB
Table 7. The result of seven algorithms in the Deng dataset. Method
α=4
α=5
α=6
F1
Pre
Acc
F1
F1
Pre
Acc
Pre
F1
0.8865
0.7962
0.7919
0.9422
0.8903
0.8842
0.9736
0.9485
0.9419
CC
0.8679
0.7719
0.4909
0.9326
0.8774
0.5578
0.9734
0.9505
0.6003
ISA
0.7811
0.6408
0.5746
0.8461
0.7333
0.6575
0.8957
0.8957
0.7273
OPSM
0.7240
0.5882
0.5882
0.7458
0.6176
0.6176
0.7458
0.6176
0.6176
QUBIC
0.7312
0.5903
0.2662
0.7764
0.6381
0.3176
0.8196
0.6977
0.3511
Bimax
0.8185
0.6928
0.2612
0.8865
0.7961
0.3002
0.9811
0.9629
0.3631
FABIA
0.9058
0.8336
0.8078
0.9469
0.9017
0.8727
0.9715
0.942
0.9182
JCB
Joint CC and Bimax: A Biclustering Method
509
5 Conclusion In this paper, we propose a new biclustering method namely the JCB algorithm. It joints the structure of Bimax with the “MSR” proposed in CC. The MSR, the JCB algorithms are used to preserve good bicluster in the original dataset. Moreover, by analyzing the time complexity of the JCB algorithm and comparing it with the Bimax algorithm, it is found that the JCB algorithm is superior to the Bimax algorithm in terms of running speed and efficiency. However, the accuracy value of the JCB algorithm will be reduced with the noise increasing. Although the accuracy value is relatively high, it is easy to miss information. This is caused by the random selection of elements by the JCB model. As information becomes more complex, the dataset also grows large. It means that the JCB algorithm needs to be further studied to process the scRNA-seq data with more noise. What’s more, we will also consider improving the speed of the algorithm while solving the lack of information. Acknowledgments. This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 61702299, 61872220, 62172253.
References 1. Islam, S., et al.: Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Meth. 11(2), 163 (2014) 2. Ciortan, M., Defrance, M.: Contrastive self-supervised clustering of scRNA-seq data. BMC Bioinform. 22(1), 1–27 (2021) 3. Wang, C.Y., Gao, Y.-L., Liu, J.-X., Kong, X.-Z., Zheng, C.-H.: Single-cell RNA sequencing data clustering by low-rank subspace ensemble framework IEEE/ACM Trans. Comput. Biol. Bioinform. (2020) 4. Kim, J., Stanescu, D.E., Won, K.J.: CellBIC: bimodality-based top-down clustering of singlecell RNA sequencing data reveals hierarchical structure of the cell type. Nucleic Acids Res. 46(21), e124–e124 (2018) 5. Hanafi, S., Palubeckis, G., Glover, F.: Bi-objective optimization of biclustering with binary data. Inf. Sci. 538, 444–466 (2020) 6. Wang, H., et al.: Clustering by pattern similarity in large data sets. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (2002) 7. Qin, J., et al.: Distributed $k$-means algorithm and fuzzy $c$-means algorithm for sensor networks based on multiagent consensus theory. IEEE Trans. Cybern. 47(3), 772–783 (2016) 8. Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007) 9. Pontes, B., Giraldez, R., Aguilar-Ruiz, J.S.: Biclustering on expression data: a review. J. Biomed. Inform. 57, 163–180 (2015) 10. Padilha, V.A., Campello, R.J.G.B.: A systematic comparative evaluation of biclustering techniques. BMC Bioinform. 18(1), 55 (2017) 11. Cheng, Y., Church, G.M.: Biclustering of expression data. In: ISMB 2000 (2000) 12. Ben-Dor, A., et al.: Discovering local structure in gene expression data: the orderpreserving submatrix problem. In: Proceedings of the 6th Annual International Conference on Computational Biology (2002) 13. Bergmann, S., Ihmels, J., Barkai, N.: Iterative signature algorithm for the analysis of largescale gene expression data. Phys. Rev. E 67(3), 031902 (2003)
510
H.-M. Chu et al.
14. Preli´c, A., et al.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22(9), 1122–1129 (2006) 15. Tanay, A., Sharan, R., Shamir, R.: Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(Suppl 1), S136–S144 (2002) 16. Deng, Q., et al.: Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science 343(6167), 193–196 (2014) 17. Li, G., et al.: QUBIC: a qualitative biclustering algorithm for analyses of gene expression data. Nucleic Acids Res. 37(15), e101–e101 (2009) 18. Saber, H.B., Elloumi, M.: A comparative study of clustering and biclustering of microarray data. Int. J. Comput. Sci. Inf. Technol. 6(6), 93 (2014) 19. Pollen, A.A., et al.: Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat. Biotechnol. 32(10), 1053 (2014) 20. Treutlein, B., et al.: Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature 509(7500), 371–375 (2014) 21. Liu, J.X., Wang, C.Y., Gao, Y.L., et al.: Adaptive total-variation regularized low-rank representation for analyzing single-cell RNA-seq data. Interdiscip. Sci. Comput. Life Sci. 13, 476–489 (2021) 22. Buckland, M., Gey, F.: The relationship between recall and precision. J. Am. Soc. Inf. Sci. 45(1), 12–19 (1994) 23. Cai, D., et al.: Non-negative matrix factorization on manifold. In: 2008 8th IEEE International Conference on Data Mining. IEEE (2008) 24. Varshavsky, R., Linial, M., Horn, D.: Compact: a comparative package for clustering assessment. In: Chen, G., Pan, Yi., Guo, M., Jian, Lu. (eds.) Parallel and Distributed Processing and Applications - ISPA 2005 Workshops, pp. 159–167. Springer, Heidelberg (2005). https://doi. org/10.1007/11576259_18 25. Lipton, Z.C., Elkan, C., Naryanaswamy, B.: Optimal thresholding of classifiers to maximize F1 measure. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2014, Nancy, France, September 15–19, 2014. Proceedings, Part II, pp. 225–239. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44851-9_15 26. de Castro, P.A., et al.: Applying biclustering to perform collaborative filtering. In: 7th International Conference on Intelligent Systems Design and Applications, ISDA 2007. IEEE (2007) 27. Hanczar, B., Nadif, M.: Precision-recall space to correct external indices for biclustering. In: International Conference on Machine Learning. PMLR (2013) 28. Verma, N.K., et al.: BIDEAL: a toolbox for bicluster analysis—generation, visualization and validation. SN Comput. Sci. 2(1), 1–15 (2021)
Improving Protein-protein Interaction Prediction by Incorporating 3D Genome Information Zehua Guo1,2,3 , Kai Su1,2 , Liangjie Liu1,2 , Xianbin Su4 , Mofan Feng1,2 , Song Cao5 , Mingxuan Zhang6 , Runqiu Chi1,2 , Luming Meng7 , Guang He1,2(B) , and Yi Shi1,2(B) 1 Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric
Disorders, Shanghai Jiao Tong University, 1954 Huashan Road, Shanghai 200030, China {guozehua,sukai980112,liuliangjie,fmf.von,heguang, yishi}@sjtu.edu.cn 2 Shanghai Key Laboratory of Psychotic Disorders, and Brain Science and Technology Research Center, Shanghai Jiao Tong University, 1954 Huashan Road, Shanghai 200030, China 3 Department of Instrument Science and Engineering, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China 4 Key Laboratory of Systems Biomedicine, Ministry of Education. Shanghai Center for Systems Biomedicine, Shanghai Jiaotong University, Shanghai 200240, China [email protected] 5 School of Medicine, University of California San Diego, La Jolla 92093, USA [email protected] 6 Weill Graduate School of Medical Science, Cornell University, New York 14853, USA [email protected] 7 College of Biophotonics, South China Normal University, Guangzhou 510631, China [email protected]
Abstract. Numerous computational methods have been proposed to predict protein-protein interactions, none of which however, considers the original DNA loci of the interacting proteins in the perspective of 3D genome. Here, we retrospect the DNA origins of the interacting proteins from the perspective of 3D genome and found that by incorporating 3D genome information into existing PPI prediction methods, i.e., Auto Covariance (AC) and Conjoint triad (CT) coding based support vector machine (SVM) and ensemble extreme learning machine (EELM), the predictions can be further improved in terms of accuracy and area under ROC curve (AUC). Combining our previous discoveries, we conjecture this is due to that co-localized DNA elements may lead to increased probability of the co-localization of their downstream RNA elements and protein elements. Keywords: PPI · Interactome · Hi-C · 3D genome · Machine learning
Z. Guo, K. Su, L. Liu, X. Su, M. Feng, S. Cao, M. Zhang—authors contribute equally as co-first authors. © Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 511–520, 2021. https://doi.org/10.1007/978-3-030-91415-8_43
512
Z. Guo et al.
1 Introduction In most cellular processes, from DNA transcription and replication to signaling cascades and metabolic cycles, and to many additional processes, proteins undertake their cellular functions by coordinating with other proteins [1]. It is therefore important to know the specific nature of these protein-protein interactions (PPIs). A human cell at any time, contains over 100,000 binary interactions between proteins [2], a small fraction of these protein-protein interactions however, are experimentally identified [3], lagging behind the generation of sequencing information which grow exponentially. Biological-wise, this is due to the dynamic nature of these interactions, that many of them are transient, and others occur only in certain cellular contexts or at particular times in development [3]. Technology-wise, this is due to the low throughput and inherent imperfection of the empirical PPI identification experiments; for example, yeast two-hybrid (Y2H) system [4] and co-immunoprecipitation (coIP) coupled with mass spectrometry [5] are two widely adopted methods, both prone to false discoveries because procedures from the reagent choosing to the cell type used and experimental conditions can all influence the final outcome [6]. To bridge the gap between ensemble in situ PPI and the identified ones, accurate and efficient computational methods are required, as the prediction results can either be directly used or boost the labor-intensive empirical methods. In the past two decades, numerous computational protein interaction discovery approaches have been developed. A PPI prediction method is usually determined by two factors: the first factor is the encoding scheme, i.e., what information is adopted and how they are encoded for the target protein or protein pair; the other factor is the mathematical learning model being employed. By combining these two factors, computational PPI prediction approaches can be further categorized into four classes: network topology based, genomic context and structural information based, text mining based, and machine learning based which utilize heterogeneous genomic or proteomic features. Many studies have demonstrated that utilizing these PPI prediction tools is important for new research in protein-protein interaction analysis to be conducted [1, 7–11]. It has been reported that genes that are proximate to each other in terms of linear genomic distance, could lead to their protein counterparts interacting to each other [12]. This occurs to us that genes that are proximate in 3D genomic space may also obey such rule, and chromatin conformation capturing technologies such as Hi-C [13, 14] and ChIA-PET [15] developed in recent years provide an excellent opportunity to systematically investigate this conjecture. To the best of our knowledge, there is no existing PPI prediction method that considers the genomic 3D distance of the corresponding gene pairs so far. Therefore, if the gene-gene 3D distances are indeed correlated to the protein-protein interaction, it would contribute to the PPI prediction without doubt. In this work, we retrospect the DNA origins of the interacting proteins in the context of 3D genome and discovered that by incorporating 3D genome information, existing PPI prediction methods can be further improved in terms of accuracy. Furthermore, by combining our previous discoveries – that somatic co-mutation DNA loci tend to form Somatic Co-mutation Hotspots (SCHs) in 3D genome space [16], which was recently supported by Akdemir et al. [17], and that 3D genome contribute to immunogenic neoantigen distribution [18] – we conjecture that the co-localization of DNA elements
Improving Protein-protein Interaction Prediction
513
can contribute to increased probability of the co-localization of their downstream elements including RNAs, proteins, and even metabolic molecules. We hypothesize that the evolution of 3D genome separates inherently correlated DNA elements linearly, while still keep them and their downstream products proximate in the 3D space, either within the nucleus or outside of it in the cytoplasm.
2 Methods 2.1 PPI and 3D Genome Data We collected and curated five representative PPI datasets, namely BioGRID [19], HI2014[20], HPRDall [21], iRefWeb [22], and Clarivate MetaCore. The positive samples are interacting protein-protein pairs and the negative samples are draw from all the non-PPIs with different subcellular locations.For the 3D genome data, we collected eight Hi-C datasets, namely hESC, IMR90 [23, 24]. The datasets are normalized using the KRNorm method and are curated so that intra-chromosomal heatmaps are of 40kb bin resolution and the inter-chromosomal heatmaps are of 500kb bin resolution. 2.2 Chromatin 3D Modeling The contact frequency Hi-C data from hESC and IMR90 cell lines were generated by Bin Ren’s lab as our chromatin 3D conformation data source [23]. For each cell line dataset, we applied a whole-genome 3D modeling algorithm for the human genome using molecular dynamics (MD) based approach with resolution of 500kb (bin size), where each bin was coarse-grained by the algorithm as one bead and intact genome was modeled as 23 polymer chains represented by bead-on-the-string structures [25, 26]. The chromatin connectivity that constrains sequentially neighbor beads in close spatial proximity and the chromatin activity that ensures active regions are more likely to be located close to the center of cell nucleus are the two main factors affecting the modeling results [25]. The chromatin activity was estimated as compartment degree that can be directly calculated from Hi-C matrix [27]. With all the beads assigned distances to the nuclear center, the conformation of chromatin was optimized from random initial structures. 2.3 Encoding Scheme and Prediction Methods Feature representation is the key in PPI prediction, which determines the upper bound of the trained model [28, 29]. Here, Auto Covariance (AC) and Conjoint triad (CT) are employed to transform the protein sequences into feature vectors. And 3D genomic locations of the proteins are based on the above chromatin 3D modeling results. Specifically, for a protein sequence, auto covariance(AC) can reveal the interactions between amino acids, consider the neighboring effect and describe the pattern of different variables interacting [30, 31]. Previous study has proven that AC can be able to avoid generating over-sized variants [30]. Generally, a protein sequence characteristic is determined by the physicochemical properties of its amino acid. These physicochemical properties
514
Z. Guo et al.
are the basis of PPI, including hydrophobicity, hydrophilicity, volumes of side chains of amino acids, polarity, polarizability, solvent-accessible surface area (SASA) and net charge index (NCI) of side chains of amino acids. Table S1 showed the corresponding values for each amino acid. Before protein sequence translated, these property values were normalized in z-score method to zero mean and unit standard deviation according to Eq. (1) [30, 32]:
Pij =
Pij − Pj (i = 1, 2, . . . , 20; j = 1, 2, . . . , 7.) Sj
(1)
where Pij is the j-th property value for i-th amino acid, Pj is the mean of the j-th property over the 20 amino acids in the Table S1 and Sj is the corresponding standard deviation. Then each protein sequence was translated into seven vectors represented by Pij in the property space. Given a protein P with a certain length N, the AC variables are calculated according to Eq. (2) [31]: N −lag 1 1 N 1 N AC(lag, j) = (Pij − Pij ) × P(i+lag)j − Pij i=1 i=1 i=1 N − lag N N (2) where lag is the distance between two amino acids, denoting the neighboring effect of amino acids (lag=1, 2, …lg). lg is the maximum lag, which is set as 30 according to the relevant research. So, the number of AC variables, D can be calculated as D = lg × q, where q is the number of the amino acid property, which is 7 in this study. In this way, a protein sequence pair concatenating the features of the two proteins can be represented by 420 numbers (2 × 30 × 7). Conjoint triad (CT) is a cycling shifting code based on the sequence information by regarding all three continuous amino acids as a unit. 20 amino acids are divided into 7 classes [33]. Then the protein sequence is transformed into number sequence. With a 3-number window sliding from N to C-terminus one step and one step, all coding units are extracted. Each unit is regarded as a feature, and the feature number of a protein sequence is 343 (7 × 7 × 7). Each unit frequency is the value of the corresponding feature. Thus, each protein pair is represented by a 686-dimensional vector [34]. The feature space is as follows: ⎧ ⎫ f 1 = 111 f 8 = 121 . . . f 337 = 177 ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ f 2 = 211 f 9 = 221 . . . f 338 = 277 ⎪ ⎪ ... ... ... ⎪ ⎪ ⎩ ⎭ f 7 = 711 f 14 = 721 . . . f 343 = 777
(3)
Protein-protein interaction prediction methods (SVM and EELM) were implemented in python with all parameters set to default. Support vector machine (SVM) is a supervised learning method, which is often used for classification and regression problems. Previous research has proven its efficacy in many real applications including target detection, disease diagnosis and bioinformatics. SVM has a better performance in a limited
Improving Protein-protein Interaction Prediction
515
number of samples. This makes SVM suitable for PPI prediction [35]. SVM creates a hyperplane between two groups and maximizes their margins. The solution can be by minimizing the following cost function: n 1 ξi (4) ||w||2 + C i=1 2
Subject to yi wT xi + b ≥ 1 − ξi , ξi ≥ 0, i = 1, 2, . . . , n where wT , xi ∈ R2 and b ∈ R1 , ||w||2 = wT w, C is the regularization penalty term, ξi measures the degree of misclassification of the data point i, yi is the corresponding sample label, whose value can be 0 or 1. In this study, the kernel function is the radial basis function and svc classifier in the sklearn (a python package) is the binary classifier. Extreme learning machine(ELM) proposed by Huang et al. [36] is a single hidden layer feedforward neural network for classification and regression. It has the advantages of good generalization performance, easy implementation and faster training than backpropagation networks because the input weights and first hidden layer biases are randomly assigned in ELM. The ELM classifier with h hidden layer nodes is as follows [37]: h βi f wi · xj + bi = yj (5) i=1
where wi and bi are the hidden layer weights and biases, βi is the out weight, xj and yj represent the input and the output, f(x) indicates the corresponding activation function. The loss function is: 2 n h βi f wi · xj + bi − lj (6) j=1
i=1
where lj denotes the sample label and βi is determined uniquely after w and b assigned randomly. Based on ELM, ensemble extreme learning machine (EELM) takes account on ensemble learning by training multiple ELMs for a dataset at the same time [34]. These ELMs have same structure but different randomly assigned parameters. The final result is determined by ELMs voting. You et al. have proven EELM has a better performance, reliability and stability. The whole EELM structure is as Fig. 1.
3 Results We implemented the Auto Covariance (AC) and Conjoint triad (CT) coding for each protein-protein, including positive PPI pairs and negative PPI (non-PPI) pair, as basic features, and applied EELM and SVM as prediction models. We then added the 3D genome features, i.e., the coordinates and radius positions of the genome 3D models built based on the hESC and IMR90 Hi-C data, and re-applied EELM and SVM to compare the baseline PPI predictions with the 3D genome assisted ones, under the 5-fold cross validation scheme. As Fig. 2. and Table 1 demonstrated, in all the three PPI datasets, i.e., BioGRID, HI2014 and MetaCore, the predictions generated after adopting genome 3D information outperform the baseline predictions, indicating that the proximity of the DNA loci contributes to increased probability that a protein-protein pair interact.
516
Z. Guo et al.
Fig. 1. Structure diagram of ensemble extreme learning machine
4 Discussion In this work, we retrospect the DNA origins of the interacting proteins in the context of 3D genome and discovered that 1) if a gene pair is more proximate in 3D genome, their corresponding proteins are more likely to interact. 2) signal peptide involvement of PPI affects the corresponding gene-gene proximity in 3D genome space. 3) by incorporating 3D genome information, existing PPI prediction methods can be further improved in terms of accuracy. Combining our previous discoveries, we conjecture that the colocalization of DNA elements may lead to increased probability of the colliding of their downstream elements, including RNAs, proteins, and metabolic molecules. We believe it is the 3D genome evolution that separates inherently correlated (functionally correlated) DNA elements linearly to avoid local bash failure, while still keep them and their downstream products proximate in the 3D space, either within the nucleus or outside of it in the cytoplasm. More detailed investigation is needed to either further prove the 3D genome driven co-localization theory or utilize this theory in assisting 3D genome related research.
Improving Protein-protein Interaction Prediction
517
Fig. 2. PPI prediction comparisons between baseline predictions, EELM and SVM with AC and CT feature encoding, and the 3D genome assisted predictions EELM + 3D Genome and SVM + 3D Genome. a and b: ROC and AUPR curves on BioGRID PPI dataset. c and d: ROC and AUPR curves on HI2014 PPI dataset. e and f: ROC and AUPR curves on MetaCore PPI dataset. Table 1. PPI prediction results between baseline predictions, EELM and SVM with AC and CT feature encoding, and the 3D genome assisted predictions EELM + 3D Genome and SVM + 3D Genome in the three datasets Dataset
Method
Precision
Recall
F1-measure
AUPR
AUC
BioGRID
SVM
0.8510
0.6863
0.8221
0.8781
0.7874
3D Genome -SVM
0.8508
0.7004
0.8233
0.8642
0.8151
(continued)
518
Z. Guo et al. Table 1. (continued)
Dataset
HI2014
MetaCore
Method
Precision
Recall
F1-measure
AUPR
AUC
EELM
0.8091
0.6218
0.7894
0.8437
0.7511
3D Genome -EELM
0.7768
0.7299
0.7892
0.8486
0.7561
SVM
0.8927
0.7993
0.8769
0.9240
0.8266 0.8574
3D Genome -SVM
0.8890
0.8227
0.8749
0.9159
EELM
0.8573
0.7478
0.8447
0.9105
0.8113
3D Genome -EELM
0.8684
0.7316
0.8463
0.9132
0.8171
SVM
0.8615
0.8285
0.8630
0.8890
0.8046
3D Genome -SVM
0.8814
0.7826
0.8645
0.8931
0.8308
EELM
0.8338
0.7367
0.8353
0.8881
0.7788
3D Genome -EELM
0.7867
0.8217
0.8289
0.9132
0.8171
References 1. Zahiri, J., Bozorgmehr, J.H., Masoudi-Nejad, A.: Computational prediction of protein-protein interaction networks: algorithms and resources. Curr. Genomics 14(6), 397–414 (2013) 2. Venkatesan, K., et al.: An empirical framework for binary interactome mapping. Nat. Methods 6(1), 83–90 (2009) 3. Bonetta, L.: Protein-protein interactions: interactome under construction. Nature 468(7325), 851–854 (2010) 4. Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M., Sakaki, Y.: A comprehensive twohybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. U. S. A. 98(8), 4569–4574 (2001) 5. Gavin, A.C., et al.: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415(6868), 141–147 (2002) 6. van den Berg, D.L., et al.: An Oct4-centered protein interaction network in embryonic stem cells. Cell Stem Cell 6(4), 369–381 (2010) 7. Shoemaker, B.A., Panchenko, A.R.: Deciphering protein-protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PLOS Comput. Biol. 3(4), e43 (2007) 8. Tuncbag, N., Kar, G., Keskin, O., Gursoy, A., Nussinov, R.: A survey of available tools and web servers for analysis of protein-protein interactions and interfaces. Brief Bioinformatics 10(3), 217–232 (2009) 9. Li, X., Wu, M., Kwoh, C.K., Ng, S.K.: Computational approaches for detecting protein complexes from protein interaction networks: a survey. BMC Genomics 11(Suppl 1), S3 (2010) 10. Skrabanek, L., Saini, H.K., Bader, G.D., Enright, A.J.: Computational prediction of proteinprotein interactions. Mol. Biotechnol. 38(1), 1–17 (2008) 11. Raman, K.: Construction and analysis of protein-protein interaction networks. Autom. Exp. 2(1), 2 (2010) 12. Santoni, D., Castiglione, F., Paci, P.: Identifying correlations between chromosomal proximity of genes and distance of their products in protein-protein interaction networks of yeast (in English). PLOS ONE 8(3) (2013) 13. Lieberman-Aiden, E., et al.: Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326(5950), 289–93 (2009)
Improving Protein-protein Interaction Prediction
519
14. Rao, S.S.P., et al.: A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. 159, 1665 (2014) (in English). Cell 162(3), 687–688 (2015) 15. Fullwood, M.J., Ruan, Y.: ChIP-based methods for the identification of long-range chromatin interactions. J Cell Biochem. 107(1), 30–39 (2009) 16. Shi, Y., Su, X.B., He, K.Y., Wu, B.H., Zhang, B.Y., Han, Z.G.: Chromatin accessibility contributes to simultaneous mutations of cancer genes. Sci. Rep. 6, 35270 (2016) 17. Akdemir, K.C., et al.: Somatic mutation distributions in cancer genomes vary with threedimensional chromatin structure. Nat. Genet. 52(11), 1178–1188 (2020) 18. Shi, Y., et al.: DeepAntigen: a novel method for neoantigen prioritization via 3D genome and deep sparse learning. Bioinformatics 36(19), 4894–4901 (2020) 19. Oughtred, R., et al.: The BioGRID interaction database: 2019 update. Nucleic Acids Res. 47(D1), D529–D541 (2019) 20. Ideker, T., Valencia, A.: Bioinformatics in the human interactome project. Bioinformatics 22(24), 2973–2974 (2006) 21. Keshava Prasad, T.S., et al.: Human protein reference database—2009 update. Nucleic Acids Res. 37(Database issue), D767-D772 (2009) 22. Turner, B., et al.: iRefWeb: interactive analysis of consolidated protein interaction data and their supporting evidence. Database (Oxford) 2010, baq023 (2010) 23. Dixon, J.R., et al.: Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485(7398), 376–380 (2012) 24. Rao, S.S., et al.: A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159(7), 1665–1680 (2014) 25. Shi, Y., et al.: DeepAntigen: a novel method for neoantigen prioritization via 3D genome and deep sparse learning. Bioinformatics 36(19), 4894–4901 (2020) 26. Shi, Y., et al.: A novel neoantigen discovery approach based on chromatin high order conformation. BMC Med Genomic, 13 (2020). (in English) 27. Xie, W.J., Meng, L., Liu, S.,Zhang, L., Cai, X., Gao, Y.Q.: Structural modeling of chromatin integrates genome features and reveals chromosome folding principle. Sci. Rep. 7(1), 2818 (2017) 28. Park, Y., Marcotte, E.M.: Flaws in evaluation schemes for pair-input computational predictions. Nat. Methods 9(12), 1134–1136 (2012) 29. Pei, F., Shi, Q., Zhang, H., Bahar, I.: Predicting protein-protein interactions using symmetric logistic matrix factorization. J. Chem. Inf. Model. 61(4), 1670–1682 (2021) 30. Guo, Y., Yu, L., Wen, Z., Li, M.: Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences. Nucleic Acids Res. 36(9), 3025–3030 (2008) 31. Chen, H., et al.: Systematic evaluation of machine learning methods for identifying human– pathogen protein–protein interactions. Brief Bioinformatics 22(3), bbaa068 (2021) 32. Liu, B.: BioSeq-analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief Bioinformatics 20(4), 1280–1294 (2019) 33. Sun, T., Zhou, B., Lai, L., Pei, J.: Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinformatics 18(1), 1–8 (2017) 34. You, Z.-H., Lei, Y.-K., Zhu, L., Xia, J., Wang, B.: Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis. BMC Bioinformatics 14(8), 1–11 (2013). https://doi.org/10.1186/1471-210514-S8-S10
520
Z. Guo et al.
35. Hamp, T., Rost, B.: Evolutionary profiles improve protein–protein interaction prediction from sequence. Bioinformatics 31(12), 1945–1950 (2015) 36. Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Extreme learning machine: theory and applications. Neurocomputing 70(1–3), 489–501 (2006) 37. Wang, L., You, Z.-H., Huang, Y.-A., Huang, D.-S., Chan, K.C.: An efficient approach based on multi-sources information to predict circRNA–disease associations using deep convolutional neural network. Bioinformatics 36(13), 4038–4046 (2020)
Boosting Metagenomic Classification with Reads Overlap Graphs M. Cavattoni and M. Comin(B) Department of Information Engineering, University of Padua, 35100 Padua, Italy [email protected]
Abstract. Current technologies allow the sequencing of microbial communities directly from the environment without prior culturing. One of the major problems when analyzing a microbial sample is to taxonomically annotate its reads to identify the species it contains. Most of the methods currently available focus on the classification of reads using a set of reference genomes and their k-mers. While in terms of precision these methods have reached percentages of correctness close to perfection, in terms of sensitivity (the actual number of classified reads) the performances are often poor. One of the reasons is the fact that the reads in a sample can be very different from the corresponding reference genomes, e.g. viral genomes are usually highly mutated. To address this issue, in this paper we propose ClassGraph a new taxonomic classification method that makes use of the reads overlap graph and applies a label propagation algorithm to refine the result of existing tools. We evaluated the performance on simulated and real datasets against several taxonomic classification tools and the results showed an improved sensitivity and F-measure, while preserving high precision. ClassGraph is able to improve the classification accuracy especially on difficult cases like Virus and real datasets, where traditional tools are not able to classify many reads. Availability: https://github.com/CominLab/ClassGraph
Keywords: Metagenomic reads classification propagation
1
· Reads graph · Label
Introduction
Metagenomics is the study of the heterogeneous microbes samples (e.g. soil, water, human microbiome) directly extracted from the natural environment with the primary goal of determining the taxonomical identity of the microorganisms residing in the samples. It is an evolutionary revise, shifting focuses from the individual microbe study to a complex microbial community. The classical genomic-based approaches require the prior clone and culturing for further investigation [16]. However, not all Bacteria can be cultured. The advent of metagenomics succeeded to bypass this difficulty. Microbial communities can be c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 521–533, 2021. https://doi.org/10.1007/978-3-030-91415-8_44
522
M. Cavattoni and M. Comin
analyzed and compared through the detection and quantification of the species they contain [6,19]. In this paper, we will focus on the detection of species in a sample using a set of reference genomes, e.g. Bacteria and Virus, a problem known as taxonomic binning of reads [21]. Several methods have been developed over the recent years, and they can be broadly divided into two categories: (1) alignment-based methods, (2) sequence-composition-based methods, which are based on the nucleotide composition (e.g. k-mers usage). Traditionally, the first strategy was to use BLAST [1] to align each read with all sequences in GenBank. Later, faster methods have been deployed for this task; popular examples are MegaBlast [25] and Megan [10]. However, as the reference databases and the size of sequencing datasets have grown, alignment has become computationally infeasible, leading to the development of metagenomic classifiers that provide much faster results. The fastest and most promising approaches are based on sequence composition [6,13]. The basic principle can be summarized as follows: each genome of reference organisms is represented by some of its k-mers and the associated taxonomic label of the organisms, then the reads are searched and classified throughout this k-mers database. For example, Kraken [23] constructs a data structure that is an augmented taxonomic tree in which a list of significant k-mers are associated to each taxon. Clark [18] uses a similar approach, building databases of species- or genus-level specific k-mers, and discarding any k-mers mapping to higher levels. Several other composition-based methods have been proposed over the years. In [7] the number of unassigned reads is decreased through reads overlap detection and species imputation. Centrifuge and Kraken 2 [12,24] try to reduce the size of the k-mer database with the use respectively of FM-index and minimizers. The sensitivity can be improved by filtering uninformative kmers [17,20] or by using spaced seeds instead of k-mers [5]. The precision of these methods is as good as MegaBlast [25], nevertheless, the processing speed is much faster [13]. The major problem with these reference-based metagenomics classifiers is the fact that the genomes of most microbes in a metagenomic sample can be taxonomically distant from those present in the existing reference databases. This fact is even more important in the case of viral genomes, where the mutation and recombination rate is very high and as a consequence, the viral reference genomes are usually very different from the other viral genomes of the same species. For these reasons, most of the reference-based metagenomic classification methods do not perform well when the sample under examination contains strains that are different from the genomes used as references. Indeed, e.g. CLARK [18] and Kraken 2 [24] report precision above 95% on many datasets. On the other hand, in terms of sensitivity, i.e. the percentage of reads correctly classified, both Clark and Kraken 2 usually show performances between 50% and 60%, and sometimes on real metagenomes, just 20% of reads can be assigned to some taxa [13].
Boosting Metagenomic Classification with Reads Overlap Graphs
523
In this paper we address this problem and propose ClassGraph, a metagenomic taxonomic refinement tool that makes use of the reads connectivity information from the reads overlap graph to classify unlabelled reads. Recently, reads overlap graphs have been used in the context of reference-free reads classification [2,3] and contig taxonomic classification [14,26]. ClassGraph utilizes the result of an existing taxonomic classification tool and a label propagation algorithm to predict the labels of reads that could not be classified. The novel paradigm exploited by ClassGraph is able to further improve the classification accuracy especially on difficult cases like Virus and on real datasets (see Sect. 3).
2
Methods
As previously mentioned the purpose of ClassGraph is to improve the classification of pre-existing taxonomic classification tools. An overview of the ClassGraph can be found in Fig. 1. In the pre-processing phase a set of metagenomic reads is classified with a given tool. With the same input a reads graph is built base on reads overlaps. Then, the initial taxonomic labels are assigned to the nodes of the graph. Next, the graph is reduced and simplified by ClassGraph. Finally, in the label propagation phase, the labels are spread in the graph in order to classify the reads that were not classified at the beginning.
Fig. 1. The workflow of ClassGraph.
524
2.1
M. Cavattoni and M. Comin
Pre-processing
As stated above ClassGraph requires two input files: one representing the graph of reads and the other containing the result of the classification process. In particular, we use SGA [22], an assembler based on Overlap Layout Consensus (OLC), to build the reads graph. SGA owes its efficiency to the use of an FMindex generated directly from the reads, and it can handle very large reads datasets. In a reads graph, the reads are represented as nodes and the edges are their overlaps. An edge is created if the overlap of the two reads is at least L. In particular two reads X and Y overlap when the suffix of X matches a prefix of Y or vice-versa. An edge is added also when the prefix of a read matches the reverse complemented prefix of another one. During the graph construction, all the transitive edges are deleted. We chose SGA to build the reads graph because it is very efficient, however other popular assemblers could be used like Spades [4], or any other tool producing a graph in the asqg format. Then, we need to select a classifier in order to obtain the initial taxonomic labels. We choose to test the most widely used tools: CLARK [18], Kraken [23], Kraken 2 [24], and Centrifuge [12]. However, any other taxonomic classification tool could be used in the workflow of ClassGraph. In order to proceed with the classification all tools at first they need to build its internal database. In particular the database is built upon a set of reference genomes and their relative taxonomy downloaded from the NCBI. All tools are tested using the same set of reference genomes (see Results). 2.2
Graph Transformation and Reduction
ClassGraph has been devised to be used with paired-end reads. In this case the taxonomic classification tools collapse each couple (Ri/1 , Ri/2 ) returning as output one single read Ri together with its label l. On the other hand SGA maintains the reads separated, by matching each read with a node in the graph. Since two paired reads belong to the same DNA fragment, it’s guarantee that they belong to the same species. For this reason in ClassGraph we chose to modify the SGA graph by collapsing the nodes representing paired reads. Thus, we must redefine the edges in the graph. In particular let Ei/1 = {ei/1,1 , ..ei/1,k } and Ei/2 = {ei/2,1 , ..ei/2,j } be the set of edges directly connected to the nodes Ri/1 and Ri/2 in the SGA graph. In ClassGraph all the edges deriving from the union of Ei = Ei/1 ∪ Ei/2 will be assigned to the read Ri . Every edge has an attribute weight defined as follows: let Ri and Rj be two reads linked by an edge, and let L = |R| be the standard length of a read and oij the length of the overlap between Ri and Rj . The weight of the edge between Ri and Rj o will be computed as: weij = Lij . If in the SGA graph there are two edges e1 and e2 linking reads belonging to two couples of paired-end reads (R1/1 , R1/2 ) (R2/1 , R2/2 ), in the ClassGraph graph they will be merged as one edge e. The weight of e will be computed as we = max{we1 , we2 }. We choose the maximum of {we1 , we2 } because a large overlap between reads might indicate that the reads are from the same species.
Boosting Metagenomic Classification with Reads Overlap Graphs
525
After the graph construction and the assignment of the corresponding tax-ids as labels to the nodes, the graph is reduced. In particular all the edges linking two already labelled nodes are deleted in order to reduce the memory usage and running time. An example of graph construction and reduction can be found in Fig. 2 (a).
(a) Reads Graph Reduction
(b) ClassGraph V1 and V2: Label Propagation
Fig. 2. ClassGraph reduction (a) and label propagation (b)
2.3
Label Propagation
The label propagation has been realized through an iterative algorithm. The propagation proceeds on different levels. The level 0 is formed by all non isolated nodes having an initial label. We say that a node belongs to the level i if it’s labelled at the i-th iteration. In particular at the i-th iteration will be labelled only the nodes without any label and directly linked to at least one node labelled at the iteration i − 1. The number of iterations must be given as input to the program. Once a node has been classified its label cannot be changed. We decided to realize two different versions of the propagation algorithm, that from now on will be indicated as ClassGraph V 1 and ClassGraph V 2. In ClassGraph V 1 at the i-th iteration the nodes of the level i − 1 send to their unlabelled neighbours the information (l, w). In this case, l represents the label of the node sending the information and w the weight of the edge between the two nodes. At the end all the nodes at level i will have received kof this phase y a list as follow: {(l1 , i=1 wi ), ..., (lk , i=t wi )}. The label that will be actually assigned to the receiving node is the one that maximizes the sum of the weights of the edges associated with it. An example of label propagation can be found in Fig. 2 (b). All edges already explored are removed from the graph. In the following is illustrated the pseudo-code of ClassGraph V 1.
526
M. Cavattoni and M. Comin
Algorithm 1. Label Propagation V1 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18:
Level0 = {All labelled nodes with at least one egde} Level1 ={∅} for i = 1, 2, . . . , #iter do for all j ∈ Leveli−1 do for all edges ejk do Send to the node k the label lj with weight wjk Add node k to Leveli end for Delete all the edges of node j end for for all j ∈ Leveli do For each label received lk compute the sum of its weights Find the label lmax with the maximum weight lj = lmax end for Leveli−1 = Leveli Leveli =∅ end for
On the other hand in ClassGraph V 2 the propagation occurs on two different levels. In fact at the i-th iteration not only the nodes at the level i − 1 propagate the labels, but also the nodes at level i exchange information between each other. With reference to the example in Fig. 2 (b), let k and h be the two nodes in grey, that are both at the same level i and directly linked with an edge of weight wkh = 0.8. Every time the node k receives a pair (lj , wjk ), this information is also forwarded to h, but with a smaller weight (lj , wjk ∗ whk ∗ β). Where β = 0.9 is a correction factor to reduce the impact of label propagation among nodes of the same level i. In summary, two nodes at the same level i that are directly connected, they can exchange between each other the list {(l1 , w1 ), ..., (lj , wy )} obtained from their neighbors at level i − 1, but the weights are modified by adding to the original ones the two multiplication factors as explained above. The information exchanged between nodes at level i − 1 and i and the way the final label of a node is selected are the same of V1. In Fig. 2 (b) we can see how the two different label propagation methods can produced different labellings.
3
Results
In order to evaluate the effectiveness of ClassGraph we compared it with several other tools on simulated and real metagenomic datasets. For the initial taxonomic reads classification we selected the popular tools: Centrifuge [12], CLARK [18], Kraken [23], and Kraken 2 [24]. Note that taxonomic reads classification is substantially different from taxonomic contig classification [21]. Although there exist several tools like GraphBin2 [15], METAMVGL [26], VContact2 [11], and many others, they serve a different purpose and they cannot be directly compared with taxonomic reads classification tools.
Boosting Metagenomic Classification with Reads Overlap Graphs
527
The simulated dataset has been obtained with the same procedure used by the authors of Kraken 2 [24] in their strain exclusion experiment. In particular we first downloaded from NCBI the reference genomes of Archaeal, Viruses and Bacteria. From these we obtained the so called origin strains by removing from the initial set the genomes of 40 Bacteria and 10 Viruses. The origin strains were then used for the testing, while all the remaining reference genomes were exploited to build the databases of the different classifiers. The idea behind this experiment is to simulate the typical situation in which in the reference database is present a different strain of the same specie with respect to the one that we have to classify. Starting from the selected origin strains we used Mason 2 [9] to simulate a 3M paired-end reads dataset. Specifically, the Mason 2’s mason simulator command is used with Illumina error profile for the simulation of the sequence’s errors. That means that the generated sequences contain an error rate of 0.4% mismatches, 0.005% insertions and 0.005% deletions. On this dataset two different experiments were conducted. First of all we ran ClassGraph starting from the complete output of the different classifiers and we compared its performance with the original taxonomic classification. In a second experiment, we decided to simulate a situation where a low percentage of initial labels was available, to test the robustness of ClassGraph. In order to do this we maintained different percentage of labels assigned by initial taxonomic classification : 100%, 75%, 50%, 25%. The files thus modified were given as input to ClassGraph that had to re-assign the missing labels. For the experiments on real data, we have used a paired-end read collection from the Human Microbiome Project (HMP). The real metagenome SRR1804065 is a DNA tool sample that has been previously studied in [8,20]. Since the “ground truth” is not available for a real metagenome we used the same evaluation procedure as in Kraken 2 [24]. The dataset have been filtered in order to keep only the reads of which the ground truth was known using BLAST, by selecting the reads with a high confident match, at least 95% identity, with some reference genome. The performance of the tools are evaluated and compared, using the same evaluation metrics as in [20,24], on the base of the four standard metrics of: Sensitivity, Precision, F1-Score and Pearson Correlation Coefficient (PCC) of the abundance profile. 3.1
Performance Evaluations
In this section we will analyze the performance obtained with ClassGraph by comparing it with other tools. In all the experiments the input of ClassGraph was composed by the output of one of the taxonomic classification tools and of SGA. In particular the overlap parameter −m of SGA was constantly set to 35, as in [8]. The number of iterations of the label propagation in ClassGraph was fixed to 20. In general, we observed that most of the nodes are labelled within the first few iterations. In each figure two different results are reported for ClassGraph, one for each version of the label propagation algorithm.
528
M. Cavattoni and M. Comin
(a) Centrifuge
(b) Clark
(c) Kraken
(d) Kraken 2
Fig. 3. Refinement taxonomic classification results of ClassGraph for the Bacteria dataset evaluated at species level. Each graph denotes the sensitivity, Precision, F1score and PCC values at the species level of the original tool compared with the scores obtained after applying ClassGraph.
In Fig. 3 and 4 are shown the results obtained respectively for Bacteria and Viruses at species level in the case of the simulated dataset. For the Bacteria, as shown in Fig. 3, the performance of ClassGraph are consistent with the ones of all the others tools. In particular, the parameters of sensitivity, F1-Score and PCC always presented a slight increase, while the precision a slight decrease. In any case the differences are very small and this can be explained with the fact that all the classifiers have already labelled most of the reads, making impossible for ClassGraph to extend the classification and make significant improvements. The case of the Viruses, Fig. 4, is completely different. The high genetic variability of the Viruses makes the classification more complicated and for this reason the sensitivity of all tools is generally low (20%–30%). However, for ClassGraph it was possible to correctly classify a substantial number of unlabelled reads, significantly increasing the sensitivity of all the classifiers (above 50%). Also the F1-Scores obtained by ClassGraph are higher among all the considered tools. In fact in the face of a growth of the sensitivity value, the performance obtained by ClassGraph in terms of precision are in line with the ones of the other classifiers. We can also notice a general increase in the value of the PCC, with a better estimation of the abundance ratios. In this experiment the version 2 of the label propagation has a slight advantage over V1. Similar results can be observed for the classification at genus level.
Boosting Metagenomic Classification with Reads Overlap Graphs
(a) Centrifuge
(b) Clark
(c) Kraken
(d) Kraken 2
529
Fig. 4. Refinement taxonomic classification results of ClassGraph for the Virus dataset evaluated at species level.
Figure 5 represents the results obtained in the second experiment on the simulated dataset. It shows the trend of the F1-Score at species level for Bacteria when we vary the percentage of labels kept from the initial taxonomic classification. The real performance of the classifiers thus are the ones that in the figures appear in blue in the group 100%. The other data associated with the classifiers represent the performance calculated after the removal of different percentages of labels from their output and have the function to show the robustness of ClassGraph in these different scenarios. We can notice that the decrease of the number of initial labels almost doesn’t affects the performance of ClassGraph in terms of F1-Score. This last one is kept over 80% for all the tools and for all the percentages of initial labelled reads. We can explain this fact by focusing on the trends of sensitivity and precision. In fact ClassGraph is able to maintain the level of precision of the initial taxonomic classification tool. The connection of the graph besides allows ClassGraph to classify almost all the nodes even when starting from a small number of labelled ones, keeping high also the sensitivity. These two facts combined justify the trend of the F1-Score. In Fig. 6 it’s illustrated the trend of the F1-Score for the Viruses at species level while varying the percentage of initial taxonomic classification labels. In this case ClassGraph improves the F1-Score w.r.t. to the original taxonomic labels (100%), and it maintains these high performance even when we reduced the initial number of labels. We can notice that also in these experiments the second version of the label propagation algorithm obtain better performance.
530
M. Cavattoni and M. Comin
(a) Centrifuge
(b) Clark
(c) Kraken
(d) Kraken 2
Fig. 5. Refinement taxonomic classification F1-Score of ClassGraph for the Bacteria dataset evaluated at species level, while varying the percentage of initial labels.
(a) Centrifuge
(b) Clark
(c) Kraken
(d) Kraken 2
Fig. 6. Refinement taxonomic classification F1-Score of ClassGraph for the Virus dataset evaluated at species level, while varying the percentage of initial labels.
Boosting Metagenomic Classification with Reads Overlap Graphs
(a) Centrifuge
(b) Clark
(c) Kraken
(d) Kraken 2
531
Fig. 7. Refinement taxonomic classification results of ClassGraph for the real dataset SRR1804065.
The results on the real dataset are reported in Fig. 7. In general we can notice that the sensitivity thanks to ClassGraph more then doubles its value with respect to the initial classifiers. On the other hand ClassGraphs looses some points in terms of precision. However the increase of the sensitivity is always much bigger then the decrease of the precision and for this reason the F1-Score significantly increases too. The value of the PCC remains always really close to 100%. The difference between the two versions of ClassGraph is clearer in the case of the real dataset. By utilising the V2 in fact we improved in all cases the three parameters of sensitivity, precision and F1-Score, as reported in Fig. 7. In terms of computational resources required for the classification of a 6M reads dataset, Kraken 1 needs 72 Gb of RAM and 330 s, Kraken 2 9,8 GB of RAM and 119 s, ClassGraph 16 GB of RAM and 610 s. The current implementation of ClassGraph is a proof of concept and there is room for improvement in both memory and time. In general we can observe that, even if we consider the classification of state of the art tools, like Kraken 2 and Centrifuge, ClassGraph is able to increase significantly the number of correctly classified reads, i.e. sensitivity. This behaviour is desirable when processing difficult real metagenomic datasets, or viral datasets, where the number of classified reads by traditional tools is particularly low.
532
4
M. Cavattoni and M. Comin
Conclusions
In this paper we presented ClassGraph, a metagenomic taxonomic classification refinement tool based on reads overlap graph that can classify unlabelled reads. It utilizes the taxonomic classification results of existing tools and a label propagation algorithm to predict the taxonomic labels of reads that could not be classified. We have shown that ClassGraph is able to improve the performance of others popular classifiers in terms of Sensitivity and F1-Score. ClassGraph is particularly effective in the case of Viral datasets that, due to their major genetic variability, are in general more difficult to classify. ClassGraph was tested also on a real dataset showing encouraging results. In fact also in this case F1-score and Sensitivity increased their values, and the PCC remained close to 100%. On the other hand by using ClassGraph the Precision tends to decrease slightly. For this reason in the future it would be interesting to add to the pipeline of ClassGraph a new pre-processing phase, in order to detect and delete the ambiguous labels.
References 1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990) 2. Andreace, F., Pizzi, C., Comin, M.: MetaProb 2: improving unsupervised metagenomic binning with efficient reads assembly using minimizers. In: Jha, S.K., M˘ andoiu, I., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds.) ICCABS 2020. LNCS, vol. 12686, pp. 15–25. Springer, Cham (2021). https://doi.org/10.1007/ 978-3-030-79290-9 2 3. Andreace, F., Pizzi, C., Comin, M.: Metaprob 2: metagenomic reads binning based on assembly using minimizers and k-mers statistics. J. Comput. Biol. https://doi. org/10.1089/cmb.2021.0270, pMID: 34448593 4. Bankevich, A., et al.: Spades: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012). https:// doi.org/10.1089/cmb.2012.0021, pMID: 22506599 5. Bˇrinda, K., Sykulski, M., Kucherov, G.: Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics 31(22), 3584 (2015). https://doi.org/10.1093/ bioinformatics/btv419 6. Comin, M., Di Camillo, B., Pizzi, C., Vandin, F.: Comparison of microbiome samples: methods and computational challenges. Briefings Bioinf. (June 2020). https:// doi.org/10.1093/bib/bbaa121, bbaa121 7. Girotto, S., Comin, M., Pizzi, C.: Higher recall in metagenomic sequence classification exploiting overlapping reads. BMC Genomics 18(10), 917 (2017) 8. Girotto, S., Pizzi, C., Comin, M.: Metaprob: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 32(17), i567–i575 (2016). https://doi.org/10.1093/bioinformatics/btw466 9. Holtgrewe, M.: Mason: a read simulator for second generation sequencing data (2010) 10. Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: Megan analysis of metagenomic data. Genome Res. 17, 377–386 (2007) 11. Jang, H.B., et al.: Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat. Biotechnol. (June 2019). https://doi.org/ 10.1038/s41587-019-0100-8
Boosting Metagenomic Classification with Reads Overlap Graphs
533
12. Kim, D., Song, L., Breitwieser, F., Salzberg, S.: Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, gr.210641.116 (2016). https://doi.org/10.1101/gr.210641.116 13. Lindgreen, S., Adair, K., Gardner, P.: An Evaluation of the Accuracy and Speed of Metagenome Analysis Tools. Cold Spring Harbor Laboratory Press, New York (2015) 14. Mallawaarachchi, V., Wickramarachchi, A., Lin, Y.: GraphBin: refined binning of metagenomic contigs using assembly graphs. Bioinformatics 36(11), 3307–3313 (2020) 15. Mallawaarachchi, V.G., Wickramarachchi, A.S., Lin, Y.: GraphBin2: refined and Overlapped binning of metagenomic contigs using assembly graphs. In: Kingsford, C., Pisanti, N. (eds.) 20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Leibniz International Proceedings in Informatics (LIPIcs), vol. 172, pp. 8:1–8:21. Schloss Dagstuhl-Leibniz-Zentrum f¨ ur Informatik, Dagstuhl, Germany (2020). https://doi.org/10.4230/LIPIcs.WABI.2020.8, https://drops.dagstuhl.de/opus/volltexte/2020/12797 16. Mande, S.S., Mohammed, M.H., Ghosh, T.S.: Classification of metagenomic sequences: methods and challenges. Briefings Bioinf. 13(6), 669–681 (2012). https://doi.org/10.1093/bib/bbs054 17. Marchiori, D., Comin, M.: Skraken: fast and sensitive classification of short metagenomic reads based on filtering uninformative k-mers. In: BIOINFORMATICS 2017–8th International Conference on Bioinformatics Models, Methods and Algorithms, Proceedings; Part of 10th International Joint Conference on Biomedical Engineering Systems and Technologies, BIOSTEC 2017, vol. 3, pp. 59–67 (2017) 18. Ounit, R., Wanamaker, S., Close, T.J., Lonardi, S.: Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16(1), 1–13 (2015) 19. Qian, J., Comin, M.: Metacon: unsupervised clustering of metagenomic contigs with probabilistic k-mers statistics and coverage. BMC Bioinf. 20(367), (2019). https://doi.org/10.1186/s12859-019-2904-4 20. Qian, J., Marchiori, D., Comin, M.: Fast and sensitive classification of short metagenomic reads with SKraken. In: Peixoto, N., Silveira, M., Ali, H.H., Maciel, C., van den Broek, E.L. (eds.) BIOSTEC 2017. CCIS, vol. 881, pp. 212–226. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-94806-5 12 21. Sczyrba, A., Hofmann, P., McHardy, A.C.: Critical assessment of metagenome interpretation-a benchmark of metagenomics software. Nat. Methods 14, 1063– 1071 (2017) 22. Simpson, J., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22(3), 549–56 (2012) 23. Wood, D., Salzberg, S.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, 1–12 (2014) 24. Wood, D.E., Lu, J., Langmead, B.: Improved metagenomic analysis with kraken 2. Genome Biol. 20(1), 257 (2019) 25. Zhang, Z., Schwartz, S., Wagner, L., Miller, W.: A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7(1–2), 203–214 (2004) 26. Zhang, Z., Zhang, L.: Metamvgl: a multi-view graph-based metagenomic contig binning algorithm by integrating assembly and paired-end graphs. BMC Bioinf. 22 (July 2021). https://doi.org/10.1186/s12859-021-04284-4
ScDA: A Denoising AutoEncoder Based Dimensionality Reduction for Single-cell RNA-seq Data Xiaoshu Zhu1 , Yongchang Lin1 , Jian Li1 , Jianxin Wang2 , and Xiaoqing Peng3(B) 1 School of Computer Science and Engineering, Yulin Normal University, Yulin 537000,
Guangxi, China 2 Hunan Provincial Key Lab On Bioinformatics, School of Computer Science and Engineering,
Central South University, Changsha 400083, Hunan, China 3 Center for Medical Genetics School of Life Sciences, Central South University, Changsha
400083, Hunan, China [email protected]
Abstract. Single-cell RNA-seq (scRNA-seq) data has provided a higher resolution of cellular heterogeneity. However, scRNA-seq data also brings some computational challenges for its high-dimension, high-noise, and high-sparseness. The dimension reduction is a crucial way to denoise and greatly reduce the computational complexity by representing the original data in a low-dimensional space. In this study, to achieve an accurate low-dimension representation, we proposed a denoising AutoEncoder based dimensionality reduction method for scRNA-seq data (ScDA), combining the denoising function with the AutoEncoder. ScDA is a deep unsupervised generative model, which models the dropout events and denoises the scRNA-seq data. Meanwhile, ScDA can reveal the nonlinear feature extraction of the original data through maximum distribution similarity before and after dimensionality reduction. Tested on 16 scRNA-seq datasets, ScDA provides superior average performances, and especially superior performances in large-scale datasets compared with 3 clustering methods. Keywords: Dimensional reduction · Denoising AutoEncoder · Single-cell RNA-seq data · Kullback–Leibler divergence · Hierarchical clustering
1 Introduction With the rapid development of single-cell RNA-sequencing (scRNA-seq) technology, thousands of single-cells have been simultaneously sequenced, and the gene expression level of each cell is further accurately measured. So, the cellular heterogeneity would be explicitly revealed in scRNA-seq data. [1, 2]. However, in scRNA-seq data, the number of genes is much larger than that of cells, commonly exceeds 20000, bringing some challenges on similarity measurement and computational overhead [3–5]. Meanwhile, a large number of dropout events have occurred from the transcription failure or the shallow sequencing depth, which introduces high noise into scRNA-seq data [6, 7]. © Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 534–545, 2021. https://doi.org/10.1007/978-3-030-91415-8_45
ScDA: A Denoising AutoEncoder Based Dimensionality Reduction
535
Notably, the increasing samples of scRNA-seq data lead to another challenge that the current clustering methods suitable to small-scale scRNA-seq data become inefficient for the computational cost [8, 9]. For example, SC3 [10] method performed well on scRNAseq data of less than 1000 samples but barely worked on larger samples. Moreover, increasing efforts show that the analysis of scRNA-seq data plays an important role in the research of cell differentiation. Such as, cell types are identified by clustering cells, based on that, cell Trajectory Inference (TI) is predicted by characterizing the dynamic changes of gene expression [11–13]. The commonly clustering methods for scRNA-seq data include k-means [10], hierarchical clustering (HC) [14, 15], DBSCAN [16], and Louvain community detection [17]. To overcome the above-mentioned challenges, some special clustering methods for large-scale scRNA-seq data have been proposed recently [18–20]. They focused on the low-dimensional embedded representation, containing two main categories: projectionbased methods and subspace division-based methods. In the projection-based clustering methods, the original scRNA-seq data has been projected into a low-dimension space to extract the latent features. For example, VASC [8] used mean square error (MSE) and Kullback–Leibler (KL) divergence to reconstruct the loss function, which added a dropout layer in Encoder to model the noise and introduced a zero-inflated negative binomial model (ZINB) to solve the problem of a large number of zero. SHARP [21] partitioned the large-scale dataset into B blocks and performed a sparse random projection (RP) algorithm k times to project the original D-dimension into d-dimension. The resulting k d-dimension datasets were clustered using HC, and the k clustering results were integrated by a weighted-based wMetaC. scDeepCluster [20] combined the ZINB with AutoEncoder to learn the low-dimensional representation. The random Gaussian noise was incorporated into the Encoder, and ZINB loss was integrated into Decoder. KL divergence (clustering loss) was performed to cluster the embedded points. In the subspace division-based clustering methods, the original scRNA-seq data has been divided into several subspaces in which the important features were selected. For example, SSRE [22] selected the significant genes according to Laplacian scores and simultaneously learned the sparse subspace representation in each cell. The similarity matrix was calculated by integrating Spearman correlation, Pearson correlation coefficient, and cosine similarity, in which the spectral clustering was performed. ENCORE [23] constructed an approximate density matrix by calculating the kernel density value of each gene, and reveal the informative genes with similar density profiles into the same subspaces by Gaussian mixture models (GMM). The similarity matrix in different subspaces was achieved by the Pearson correlation coefficient. ENCORE constructed a consensus-factor matrix across subspaces based on cluster-based similarity-partitioning (CSPA) to improve the clustering efficiency. Therefore, for the scRNA-seq data with high-dimension and highnoise, how to reduce the high dimension while reserving the informative features and avoiding randomness is a challenge [24, 25]. To efficiently represent large-scale scRNA-seq data in a latent feature space, we proposed a dimensionality reduction approach, called ScDA (a denoising AutoEncoder for scRNA-seq data), in which the denoise function was incorporated into the AutoEncoder model. Here, to achieve the reconstructed data in Decoder with a small error, similar
536
X. Zhu et al.
distribution, and less noise, we designed the objective function combing three components: MSE, KL divergence, and denoise function. Firstly, the scRNA-seq data were preprocessed to remove outliers and filter ineffective genes using binning and entropy. Binning can discretize the continuous eigenvalues, and reduce the influence of the different gene expression levels. By calculating the entropy of each gene in the box, the genes with low entropy were removed. Then, the AutoEncoder was constructed to reduce dimension. To improve the performance of the Autoencoder, the denoising regular term was introduced. So, the model can learn the rules for denoising by adding noise into the original data. Finally, HC was used to cluster cells. To evaluate the performance, we compared ScDA with 3 clustering methods on 16 real scRNA-seq datasets. The experimental results showed that ScDA provided superior average performances and superior performances in most datasets compared with other methods, especially in large-scale datasets.
2 Methods and Materials The increasing researches showed that the neural network represented excellent performance on large-scale datasets with high-dimension. The AutoEncoder is a typical neural network with the advantage of nonlinear feature extraction in unsupervised learning. By adding noise into the AutoEncoder, the model would reduce the noise and become more robust. Thus, ScDA had the following advantages: (1) the structure of the model was simple and lower computational cost was achieved; (2) it did not need some extra parameters to train the model, except that the threshold of the number of genes in preprocessing was set to 2000; (3) to effectively remove noise and improve the data quality, several denoising strategies were significantly performed. For example, the outliers were identified and smoothened; the genes lacking separability to cluster were removed by calculating the entropy of gene in the box; ZINB was performed to handle the large amount of zero; the noise was introduced into Encoder to train model, the generated data in Decoder would be denoised. 2.1 ScDA In ScDA, a gene expression matrix X with m genes (columns) and n cells (rows) was originally input. Let x ij be the element of matrix X, denoting the expression level of the j-th gene in the i-th cell. Then, genes were sorted and partitioned into 20 boxes, and the ineffective genes were removed according to the entropy score. The new gene expression matrix X ’ with m’ genes and n cells was obtained. Noise ε was added into gene expression matrix X ’ , gene expression matrix X ” was generated and input into the AutoEncoder. The parameters of AutoEncoder were optimized using the small-batch stochastic gradient descent method, and the learning rate of the model was set to 0.001. The overview of ScDA method was shown in Fig. 1. Outliers Smoothen. The outliers were smoothened. The mean expression of a gene across all cells was calculated as μ, and the variance expression of a gene across all cells was calculated as σ . If the gene expression level was out of the range of [μ-2σ , μ + 2σ ], it would be smoothened as the boundary value of this interval.
ScDA: A Denoising AutoEncoder Based Dimensionality Reduction
537
Fig. 1. An overview of ScDA. (a) The gene expression matrix with m genes and n cells as input; (b) Outliers were smoothened into a boundary threshold; (c) Ineffective genes were removed by calculating the entropy score in the box; (d) ZINB was implemented to model the dropout events; (e) Max-min normalization was performed; (f) Denoising AutoEncoder was designed in which the objective function combines MSE, KL divergence, and denoise function; (g) HC was performed to cluster cells.
Binning. Ineffective genes were removed by binning. Binning can reduce the influence of different gene expression levels, in which the number of boxes was set as an empirical value. All of the genes were sorted and divided into 20 sub-intervals (boxes) according to their attribute values, attribute values in the same box were regarded as in the same category. The entropy of every gene was calculated using formula (1). If the entropy score was less than 0.3, the genes were removed. p(x) • log p(x) (1) H (X ) = x∈X
Normalization. The max-min normalization was implemented. The gene expression level was normalized into [0,1] by formula (2). xnew =
x − xmin xmax − xmin
(2)
where x max is the maximum of datasets, and x min is the minimum of datasets. Dropout Imputation. ZINB was used to model the zero-inflated and improve the accuracy and robustness. The basic idea of the zero-inflated model was to divide the count of events (feature value) into two cases: the Bernoulli process with feature value 0, and Poisson or Negative Binomial process with feature value 0 or a positive integer. Therefore, ZINB was performed to impute the “false zero” from dropout events.
538
X. Zhu et al.
Denoising Autoencoder Model. The denoising Autoencoder (DA) model was designed to reduce dimension. The core idea of AutoEncoder was unsupervised feature extraction of a hidden layer in the neural network. In AutoEncoder, input and output were assumed as consistent as possible, that is, the loss of the model was minimum. The detail of DA was shown as follows. Input. The data was divided into N batches and input, described as formula (3). N x(n) ∈ Rv
(3)
n=1
where x(n) was a batch, Rv represents the dataset, and N represented the number of batches. AutoEncoder model. The AutoEncoder was modeled as an Encoder and a Decoder denoted as formula (4). X = σa (Wa • x + ba ) ∈ Rv (4) x = σs (Ws • X + bs ∈ Ru σa σs where X was input and x was the output of the hidden layer; W a and ba were the parameters in the Encoder, and W s and bs were the parameters in the Decoder; σa and σs represented the activation function in Encoder and Decoder, respectively. Objective function. Different objective functions had been proposed according to different loss criteria, such as energy, entropy, KL divergence, etc. In ScDA, the objective function was designed combining MSE, KL divergence, and denoise function. MSE commonly measured the difference between the predicted value and the true one. The smaller the MSE was, the smaller the loss of energy would be. Therefore, the objective function based on MSE was constructed as formula (5). N 1 (yi − y i )2 min J (θ) = − N
(5)
i=1
where N was the number of samples; yi and y i was the input and output, respectively. KL divergence described the similarity of distribution between the reconstructed dataset and input. KL divergence was denoted as formula (6). DKL =
N n=1
p(yn ) • log (
p(yn ) p (yn )
(6)
So, the objective function introducing KL divergence was denoted as formula (7). min J (θ) =
N 1 p(yi ) [(yi − y i )2 + p(yn ) • log ( )] N p ( y ) i=1
i
(7)
ScDA: A Denoising AutoEncoder Based Dimensionality Reduction
539
To improve the performance of the Autoencoder and enable the model to learn the rules for denoising, the noise with consistent distribution to original data was added, and the denoise function was applied. Let the added noise was ε(n) , the input was described as formula (8):
x (n) = x(n) + ε(n) Then, the AutoEncoder model was correspondingly changed to formula (9). X = σa (Wa • x + ba )
X = σa (Wa • X + ba ) And the corresponding objective function was as formula (10). 2 N 1 (n) (n) min J (θ) = x − x + λ • R(θ) N
(8)
(9)
(10)
n=1
Where x (n) was the output in the hidden layer, xˆ (n) was the desired output, and x(n) was the input. The final objective function incorporating the denoise function was denoted as formula (11). min J (θ) =
N 2 1 p(yi ) [(p(yi ) • log ( ) + y i − yi ] + λ • R(θ) N p( y i ) n=1
(11)
Notably, the input of the model was the noised data, while the desired output was de-noised data. Clustering. Based on the low-dimensional data, the similarity matrix was calculated by Pearson correlation coefficient, and HC was used to cluster cells. 2.2 Evaluation Metric To test the performance of ScDA, we calculated two classical indices for clustering: Normalized Mutual Information (NMI) [26] and Adjusted Rand Index (ARI) [27]. NMI showed the consistency between the predicted clusters and the labels, with the range from 0 to 1. ARI described the possibility that data pair with the same cluster label was predicted into the same cluster. The range of ARI was from -1 to 1. The higher NMI or ARI, the better the clustering performance. 2.3 Datasets We downloaded 16 real scRNA-seq datasets in which there were 10 large-scale datasets. They were from National Center for Information (NCBI) Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/), and the European Bioinformatics Institute (EMBL-EBI) (https://www.ebi.ac.uk/). The true labels were carried in these datasets and only used to calculate the evaluation metrics, so these datasets were considered as gold standard datasets. These datasets were listed in Table 1 in detail.
540
X. Zhu et al. Table 1. The description of the sixteen real scRNA-seq datasets.
Accessed ID
Datasets
Platform
Tissue
#Cell type #Cell
#Gene
Reference
E-MTAB-3321 Goolam
SMARTSeq2
Mus musculus
5
124
40315 [28]
GSE60749
Kumar
SMARTer C1
Mus musculus
3
268
45686 [29]
GSE60749
Kumar TCC SMARTer C1
Mus musculus
3
268 803405 [29]
GSE83139
Wang
Smart-Seq2
Homo sapiens
8
479
20490 [30]
GSE102299
Wallrapp
Smart-Seq2
Mus musculus
3
752
45686 [31]
GSE57872
Patel
SMART-Seq
Homo sapiens
7
864
65218 [32]
GSE92332
Haber
10X
Mus musculus
9
1522
20108 [33]
E-MTAB-3929 Petropoulos
Smart-Seq2
Homo sapiens
5
1529
65218 [34]
GSE65525
Klein
Droplet-seq
Mus musculus
3
2717
24175 [35]
GSE108097
Han
Microwell-seq Mouse bladder 16
2746
20670 [36]
GSE81076
Grun
CEL-Seq
Homo sapiens
5
3083
45686 [37]
GSE98561
Cao
sci-RNA-seq
Worm neuron
10
4186
13488 [38]
GSE127005
Spallanzani
InDrops
Mus musculus
3
5287
23725 [39]
GSE110679
Zemmour
InDrops
Mus musculus
3
6106
23725 [40]
GSE127892
Sala
SmartSeq2
Mus musculus
32
10801
29474 [41]
GSE81905
Shekhar
Drop-seq
mouse
19
27499
13166 [42]
3 Results To test the performance of ScDA, we compared it to DBSCAN, k-means, and VASC. The first two methods were typical clustering methods, and the latter was the stateof-the-art single-cell clustering algorithm. In ScDA, the number of Encoder layer and Decoder layer were both 3; and the number of neurons was 256, 128, 64, and 64,128,256, respectively. The activation function in the hidden layer was Relu, the dropout rate was 0.5. The random seed of the experiment was set to 0. The parameter was initialized using a Gaussian distribution with a mean of 0 and a variance of 0.1. We performed a small-batch stochastic gradient descent (Adam) as an optimizer in which the training batch size was set to 128, and the learning rate was set to 0.001. 3.1 Comparison of the Clustering Performance We compared the performance of clustering between DBSCAN, k-means, VASC, and ScDA in terms of NMI and ARI, shown in Tables 2 and 3. From Tables 2 and 3, we observed that ScDA outperformed other methods. ScDA achieved the best performance in terms of NMI in 12 datasets. Especially, for 10 largescale scRNA-seq datasets with more than 1000 cells, ScDA achieved the best performance in terms of NMI in 7 datasets. ScDA got the performance improvement of 6.08% in the average of NMI than VASC, and 0.13% in that of ARI. However, due to the problems of local minimization, over-fitting, and sample dependence, the neural network was not good for some datasets. For example, the ARI of the Han dataset was only 0.546.
ScDA: A Denoising AutoEncoder Based Dimensionality Reduction Table 2. The NMI of four clustering methods Datasets
DBSCAN
k-means
VASC
ScDA
Goolam
0.879
0.888
0.928
0.905
Kumar
0.999
0.969
0.963
1.000
Kumar TCC
0.975
0.984
0.995
1.000
Wang
0.870
0.863
0.804
0.875
Wallrapp
0.882
0.849
0.876
0.886
Patel
0.811
0.806
0.761
0.835
Haber
0.820
0.799
0.675
0.828
Petropulos
0.797
0.778
0.805
0.801
Klein
0.833
0.813
0.865
0.845
Han
0.611
0.606
0.553
0.624
Grun
0.764
0.764
0.726
0.791
Cao
0.730
0.718
0.665
0.758
Spallanzani
0.875
0.865
0.875
0.882
Zemour
0.858
0.870
0.906
0.879
Sala
0.564
0.559
0.465
0.579
Shekhar
0.685
0.703
0.665
0.715
Average
0.805
0.796
0.773
0.820
Table 3. The ARI of four clustering methods Datasets
DBSCAN
k-means
VASC
ScDA
0.929
0.929
0.931
0.883
0.945
0.987
0.987
0.976
0.976
1.000
0.998
0.998
0.965
0.997
1.000
0.852
0.852
0.852
0.836
0.866
0.842
0.842
0.828
0.864
0.871
0.789
0.789
0.755
0.785
0.792
0.781
0.781
0.792
0.694
0.795
0.768
0.768
0.751
0.816
0.783
0.789
0.789
0.815
0.889
0.816
0.539
0.539
0.503
0.608
0.546 (continued)
541
542
X. Zhu et al. Table 3. (continued) Datasets
DBSCAN
k-means
VASC
ScDA
0.758
0.758
0.718
0.741
0.765
0.676
0.676
0.651
0.736
0.689
0.842
0.842
0.845
0.883
0.861
0.865
0.865
0.820
0.916
0.865
0.546
0.546
0.564
0.486
0.564
0.688
0.688
0.702
0.686
0.715
0.781
0.781
0.769
0.794
0.795
3.2 Comparison of the Run Time In addition, we compared the run time between DBSCAN, k-means, VASC, and ScDA, shown in Table 4. Table 4. The run time of four clustering methods Datasets
DBSCAN
k-means
VASC
ScDA
Goolam
12.1
93.5
30.8
27.2
Kumar
46.1
67.7
133.8
102.5
646.2
631.2
862.7
846.2
70.2
3588.1
197.6
241.9
239.0
25398.7
406.7
Patel
265.9
277.3
401.4
354.4
Haber
461.8
310.6
608.6
567.4
Petropulos
775.0
711.6
54402.8
976.8
Kumar TCC Wang Wallrapp
Klein
4.94
51.8
60.7
162.8
145.5
Han
1591.9
1611.9
1831.5
1679.5
Grun
1479.7
1342.3
1764.8
1562.4
Cao
1813.0
1738.6
2105.6
1896.7
902.0
883.1
17827.9
1236.8
Zemour
1183.4
1240.3
17277.5
1405.5
Sala
1447.7
1523.4
52557.8
1698.1
775.3
616.1
38745.7
869.9
11417.5
217700.5
13973.2
Spallanzani
Shekhar Sum
11698.74
ScDA: A Denoising AutoEncoder Based Dimensionality Reduction
543
From Table 4, it was found that ScDA achieved higher computational cost than DBSCAN and k-means, for the reason that HC performed “bottom-up” or “up-bottom” strategies to obtain the hierarchical relationships among clusters. However, ScDA achieved less run time than VASC, especially in large-scale datasets.
4 Conclusion With the increase of large-scale scRNA-seq data, various special clustering methods have been proposed, which focus on dimensional reduction. However, there is a significant problem that high noise and high dropout events exist in scRNA-seq data, which would limit the accuracy and robustness of clustering. The AutoEncoder is a classical neural network model with the advantage of accurately extracts the nonlinear latent features, which would solve those problems. Thus, we propose a novel AutoEncoder model ScDA, combining the denoise function with AutoEncoder. In ScDA, how to denoise scRNA-seq data and improve the accuracy and robustness are the main works. To remove the noise in scRNA-seq data and improve data quality, four steps were performed. Outliers were smoothened with conforming distribution; genes were sorted and partitioned into 20 boxes, the entropy of each gene was calculated to filter the ineffective genes; ZINB was used to model the dropout event and impute the “false” zero; normalization was performed to compress values into the same interval. Furthermore, to improve the accuracy and robustness, a DA model was designed. We constructed the objective function combining MSE, KL divergence, and denoise function, leading to the reconstructed output similar to the input in both value and distribution. So, this model can also further reduced noise and enhanced robustness. ScDA achieved close-to-the-best clustering performance by combining various denoise methods. Especially, SCDA performed well on large-scale scRNA-seq datasets. However, a large number of parameters and low convergence rate in the neural network would need large computational overhead, and how to optimize the model and reduce the computational overhead should be focused on in future work. Funding Statement. This research was supported by the National Natural Science Foundation of China (No. 61762087, 61702555, 61772557), Hunan Provincial Science and Technology Program (2018WK4001), 111 Project (No. B18059), Guangxi Natural Science Foundation (No. 2018JJA170175). Authors’ Contributions. Conceptualization, X.Z., X.P., and J.W.; Methodology, X.Z., Y.L., and X.P.; Software, Y.L., and J.L.; Writing-Original Draft Preparation, X.Z., J.W., and X.P.; Visualization, Y.L.; Funding Acquisition, X.Z., and X.P.
References 1. Vitak, S.A., et al.: Sequencing thousands of single-cell genomes with combinatorial indexing. Nat. Methods 14(3), 302 (2017) 2. Stuart, T., Satija, R.: Integrative single-cell analysis. Nat. Rev. Genet. 20(5), 257–272 (2019) 3. Laehnemann, D., Kster, J., Szczurek, E., Mccarthy, D.J., Schnhuth, A.: Eleven grand challenges in single-cell data science. Genome Biol. 21(1), 31 (2020)
544
X. Zhu et al.
4. Wolf, F.A., et al.: PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 20(1), 59 (2019) 5. Taiyun, K., Chen, I.R., Lin, Y., Wang, Y.Y., Yang, J., Yang, P.: Impact of similarity metrics on single-cell RNA-seq data clustering. Brief. Bioinform. 20(6), 2316–2326 (2018) 6. Eling, N., Morgan, M.D., Marioni, J.C.: Challenges in measuring and understanding biological noise. Nat. Rev. Genet. 20(9), 536–548 (2019) 7. Andrews, T.S., Hemberg, M., Birol, I.: Dropout-based feature selection for scRNASeq. Bioinformatics 35(16), 2865–2867 (2018) 8. Wang, D.: VASC: Dimension reduction and visualization of single-cell RNA-seq data by deep variational autoencoder. Genomics Proteomics Bioinformatics 16(5), 320–331 (2018) 9. Raphael, P., Li, Z., Kuang, R.: Machine learning and statistical methods for clustering singlecell RNA-sequencing data. Brief. Bioinform. 4, 4 (2019) 10. Kiselev, V.Y., et al.: SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods 14(5), 483 (2017) 11. Jia, C., Hu, Y., Derek, K., Junhyong, K., Li, M., Zhang, N.R.: Accounting for technical noise in differential expression analysis of single-cell RNA sequencing data. Nucleic Acids Res. 19, 10978 (2017) 12. Liu, Z., et al.: Reconstructing cell cycle pseudo time-series via single-cell transcriptome data. Nat. Commun. 8(1), 22 (2017) 13. Saelens, W., Cannoodt, R., Todorov, H., Saeys, Y.: A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37(5), 547–554 (2019) 14. Llorens-Bobadilla, E., Zhao, S., Baser, A., Saiz-Castro, G., Zwadlo, K., Martin-Villalba, A.: Single-cell transcriptomics reveals a population of dormant neural stem cells that become activated upon brain injury. Cell Stem Cell 17(3), 329–340 (2015) 15. Spyros, D., et al.: Hayden, Barres BA, Quake SR: A survey of human brain transcriptome diversity at the single cell level. Proc. Natl. Acad. Sci. 112(23), 7285–7290 (2015) 16. GiniClust3: a fast and memory-efficient tool for rare cell type identification. BMC Bioinformatics 21(1), 158 (2020) 17. Zhu, X., Zhang, J., Xu, Y., Wang, J., Peng, X., Li, H.-D.: Single-cell clustering based on shared nearest neighbor and graph partitioning. Interdisc. Sci.: Computat. Life Sci. (2020) 18. Yip, S.H., Chung, S.P., Wang, J.: Evaluation of tools for highly variable gene discovery from single-cell RNA-seq data. Brief. Bioinform. 4, 4 (2018) 19. Becht, E., et al.: Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019) 20. Tian, T., Wan, J., Song, Q., Wei, Z.: Clustering single-cell RNA-seq data with a model-based deep learning approach. Nat. Mach. Intell. 1(4), 191–198 (2019) 21. Wan, S., Kim, J., Won, K.J.: SHARP: hyper-fast and accurate processing of single-cell RNAseq data via ensemble random projection. Genome Res. 30(2), gr.254557.254119 (2020) 22. Liang, Z., Li, M., Zheng, R., Tian, Y., Wang, J.: SSRE: cell type detection based on sparse subspace representation and similarity enhancement. Genomics Proteomics Bioinformatics S1762–0229(21), 00038–33 (2020) 23. Song, J., Liu, Y., Zhang, X., Wu, Q., Yang, C.: Entropy subspace separation-based clustering for noise reduction (ENCORE) of scRNA-seq data. Nucleic Acids Res. 49(3), e18 (2020)‘ 24. Kiselev, V.Y., Andrews, T.S., Hemberg, M.: Challenges in unsupervised clustering of singlecell RNA-seq data. Nat. Rev. Genet. 20(5), 273–282 (2019) 25. Soneson, C., Robinson, M.D.: Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods 15(4), 255–261 (2018) 26. Estevez, P.A., Tesmer, M., Perez, C.A., Zurada, J.M.: Normalized mutual information feature selection. IEEE Trans. Neural Netw. 20(2), 189–201 (2009) 27. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
ScDA: A Denoising AutoEncoder Based Dimensionality Reduction
545
28. Heterogeneity in Oct4 and Sox2 targets biases cell fate in 4-cell mouse embryos. Cell 165(1), 61–74 (2016) 29. Kumar, R.M., et al.: Deconstructing transcriptional heterogeneity in pluripotent stem cells. Nature 16(7529), 56–61 (2014) 30. Wang, Y.J., et al.: Single-cell transcriptomics of the human endocrine pancreas. Diabetes 65(10), 3028 (2016) 31. Wallrapp, A., et al.: The neuropeptide NMU amplifies ILC2-driven allergic lung inflammation. Nature 549(7672), 351–356 (2017) 32. Patel, A.P., et al.: Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344(6190), 1396–1401 (2014) 33. Haber, A.L., et al.: A single-cell survey of the small intestinal epithelium. Nature 551(7680), 333–339 (2017) 34. Petropoulos, S., et al.: Single-cell RNA-Seq reveals lineage and x chromosome dynamics in human preimplantation embryos. Cell 165(4), 1012–1026 (2016) 35. Klein, A., et al.: Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161(5), 1187–1201 (2015) 36. Han, X., Wang, R., Zhou, Y., Fei, L., Guo, G.: Mapping the mouse cell atlas by Microwell-Seq. Cell 172(5), 1091-1107.e1017 (2018) 37. Grün, D., et al.: De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19(2), 266–277 (2016) 38. Cao, J., et al.: Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 357(6352), 661 (2017) 39. Spallanzani, R.G., Zemmour, D., Xiao, T., Jayewickreme, T., Mathis, D.: Distinct immunocyte-promoting and adipocyte-generating stromal components coordinate adipose tissue immune and metabolic tenors. Sci. Immunol. 4(35), eaaw3658 (2019) 40. Zemmour, D., Zilionis, R., Kiner, E., Klein, A.M., Mathis, D., Benoist, C.: Single-cell gene expression reveals a landscape of regulatory T cell phenotypes shaped by the TCR. Nat. Immunol. 19(3), 291–301 (2018) 41. Frigerio, C.S., et al.: The major risk factors for Alzheimer’s disease: age, sex, and genes modulate the microglia response to Aβ plaques. Cell Rep. 27(4), 1293-1306.e1296 (2019) 42. Shekhar, K., et al.: Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell 166(5), 1308-1323.e1330 (2016)
Others
PickerOptimizer: A Deep Learning-Based Particle Optimizer for Cryo-Electron Microscopy Particle-Picking Algorithms Hongjia Li1,2 , Ge Chen2,3 , Shan Gao1,2 , Jintao Li1 , and Fa Zhang1(B) 1
3
High Performance Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China {lihongjia18z,gaoshan,jtli,zhangfa}@ict.ac.cn 2 University of Chinese Academy of Sciences, Beijing, China Domain-Oriented Computing Technology Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China [email protected]
Abstract. Cryo-electron microscopy single particle analysis requires tens of thousands of particle projections for the structural determination of macromolecules. To free researchers from laborious particle picking work, a number of fully automatic and semi-automatic particle picking approaches have been proposed. However, due to the presence of carbon and different types of high-contrast contaminations, these approaches tend to select a non-negligible number of false-positive particles, which affects the subsequent 3D reconstruction. In order to overcome this limitation, we present a deep learningbased particle pruning algorithm, PickerOptimizer, to separate erroneously picked particles from the correct ones. PickerOptimizer trained a convolutional neural network based on transfer learning techniques, where the pre-trained model maintains strong generalization ability and can be quickly adapted to the characteristics of the new dataset. Here, we build the first cryo-EM dataset for image classification pre-training which contains particles, carbon regions and high-contrast contaminations from 14 different EMPIAR entries. The PickerOptimizer works by fine-tuning the pre-trained model with only a few manually labeled samples from new datasets. The experiments carried out on several public datasets show that PickerOptimizer is a very efficient approach for particle post-processing, achieving F1 scores above 90%. Moreover, the method is able to identify false-positive particles more accurately than other pruning strategies. A case study further shows that PickerOptimizer can improve conventional particle pickers and complement deeplearning-based ones. The Source code, pre-trained models and datasets are available at https://github.com/LiHongjia-ict/PickerOptimizer/.
Keywords: Cryo-electron microscopy learning · Transfer learning
· Particle pruning · Deep
c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 549–560, 2021. https://doi.org/10.1007/978-3-030-91415-8_46
550
1
H. Li et al.
Introduction
With the development of cryo-electron microscopy (cryo-EM) single particle analysis (SPA), many high-resolution three-dimensional (3D) structures of protein and macromolecular complexes have been reported [1,2]. In order to reconstruct the 3D structure of a macromolecular, generally tens of thousands of high-quality single-particle projections are required, thus manual picking is time consuming, tedious and may introduce manual bias into the procedure. On the contrary, the automatic particle picking algorithms are suitable for this highthroughput task, since they are able to quickly collect a massive number of particles with decent performance. Nevertheless, the performance of these algorithms is degraded by the low signal-to-noise ratio and the presence of high contrast artifacts and contaminants in the micrographs. The current mainstream particle picking algorithms can be divided into two categories, conventional particle pickers and deep learning-based ones. The conventional pickers generally utilize the basic morphological image features or template-matching algorithms to identify particles, such as Relion [3,4]. Due to the extremely simple features used for picking, these methods suffer from high false-positive rates, typically ranging from 10% to more than 25% [5]. Deep learning-based pickers alleviated the problem to a certain extent, since benefiting from the powerful learning capability of convolutional neural network (CNN). These pickers typically provide general models trained with large datasets, such as crYOLO [6], PIXER [7] or train a new model for a specific dataset with manually labeled positive samples (e.g. particles) and/or negative samples (e.g. carbon, contaminants or background regions, etc.), such as Topaz [8]. These learning-based particle pickers are more robust to false-positives but still exists wrong picks. As a consequence, it is common practice in the field to perform several pre-processing or post-processing steps to clean and remove incorrectly selected particles. Some work has already been proposed to improve the false-positive rate of particle picking. For example, the em hole finder [9] program and EMHP [10] are designed to prevent particle selection in carbon regions based on morphological image processing, image filtering and thresholding operations. MAPPOS [11] is a bagging classifier and employed to predict which already picked particles are good and which are false positives based on image features such as phase symmetry or dark dot dispersion. Limited by the weak representation ability of traditional image features and some specific shortcomings, these methods have not been widely adopted by researchers. Benefiting from the powerful feature mining and learning potential of deep learning, some deep learning-based methods have been proposed. [12] proposed Deep Consensus, a deep learning-based approach for particle pruning in Cryo-EM, which works by computing a smart consensus over the output of different particle-picking algorithms, resulting in a set of particles with a lower false-positive ratio than the initial set obtained by the pickers. However, the users are required to provide particle selection results from at least two different algorithms, which is time-consuming and laborious. [13] developed MicrographCleaner, a deep learning package designed to discrimi-
PickerOptimizer
551
nate, between regions of micrographs which are suitable for particle picking, and those which are not. The method works in an automated fashion by providing a general model trained on a dataset of 539 manually segmented micrographs. However, the performance of this method is directly affected by the consistency of the training dataset and the new dataset. In response to these challenges, we hope to provide a model with strong generalization ability and can be quickly adapted to the characteristics of new dataset with minimal human intervention. Therefore, we have developed PickerOptimizer, a deep learning-based particle pruning algorithm that classifies the preliminarily selected particles into true-positive particles and false-positive particles. The optimizer is trained based on transfer learning techniques where knowledge from a large-scale natural image-classification task is leveraged to obtain feature extraction ability. Considering the huge difference between cryo-EM images and natural images, we constructed a cryo-EM data set for image classification which contains positive samples (particles) and negative samples (carbon region and high-contrast contaminations) collected from 14 different EMPIAR [14] entries. Therefore, The classifier is firstly pre-trained with a combination of a natural image dataset and a cryo-EM image dataset, and then fine-tuned with only a few manually labeled samples from the new dataset to adapt to new features. PickerOptimizer was evaluated with several well-known public data sets, achieving F1 scores above 90%. Moreover, we compared our method with a commonly used particle pruning algorithm, MicrographCleaner, on dealing with different types of contaminations, where PickerOptimizer achieved better or equivalent performance. A use case study further shows that PickerOptimizer is able to improve conventional particle pickers and complement deep learning-based ones, since it can mask out most false-positive particles while not affecting the truepositive ones. The Source code, pre-trained models and datasets are available at https://github.com/LiHongjia-ict/PickerOptimizer/.
2 2.1
Materials and Methods Classification Datasets
In this work, in order to make the model maintain strong generalization ability and be able to quickly adapt to the characteristics of cryo-EM images, the training data for the pre-trained model comes from two sources, natural images and cryo-EM images. For the natural one, we chose ImageNet [15], which is one of the most widely used large-scale datasets for benchmarking image classification algorithms. The dataset contains about 1.2 million images and is divided into 1000 categories, enabling the model to learn basic features of images. However, the images contained in the ImageNet dataset are of quite different nature compared to micrographs, therefore, we constructed a cryo-EM dataset for image classification to familiarize the model with the characteristics of the micrographs. The dataset contains positive samples (particles) and negative samples (carbon or contaminants regions) manually selected from 14 different EMPIAR entries, as shown in Table 1. The whole dataset contains 3600 images and is divided into
552
H. Li et al.
36 categories (14 kinds of particles, 14 kinds of high-contrast contaminants and 8 kinds of carbon regions) with 100 images in each category. Since the pre-trained model has strong feature extraction capabilities and is familiar with the features of micrographs, when given a new dataset, only a few data are required for fine-tuning, such as dozens of images. With the new data, the classifier can quickly learn the new features and obtain the capability to rule out false-positive particles from all picked particles in the new dataset. Table 1. The detailed information of the 14 public datasets in cryo-EM datasets. Dataset
Biological structure
EMPIAR-10406 70S ribosome
Reference [16]
EMPIAR-10059 NCP-CHD4 complexes
[17]
EMPIAR-10285 P-Rex1–G-beta-gamma signaling scaffold
[18]
EMPIAR-10333 Human FACT
[19]
EMPIAR-10590 Human BAF complex
[20]
EMPIAR-10283 Mammalian ATP synthase tetramer
[21]
EMPIAR-10454 Saccharomyces cerevisiae fatty acid synthase complex [22] EMPIAR-10470 Saccharomyces cerevisiae fatty acid synthase complex [22] EMPIAR-10099 Hrd1 and Hrd3 complex
[23]
EMPIAR-10350 LetB from E.coli
[24]
EMPIAR-10399 Arabinofuranosyltransferase AftD
[25]
EMPIAR-10063 Activated NAIP2/NLRC4 Inflammasome
[26]
EMPIAR-10097 Influenza Hemagglutinin Trimer
[27]
EMPIAR-10077 Elongation factor SelB
[28]
2.2
Algorithm
Figure 1A shows the architecture of PickerOptimizer, the neural network mainly consists of two parts: feature extractor and classifier. Here, we adopt residual blocks to construct a basic feature extractor which comprises several layers, including convolution (Conv), max-pooling (MaxPooling) and residual block (ResBlock). Among them, the key residual block is shown in Fig. 1B, which contains composite operations, such as convolution, batch normalization (BN), or rectified linear units (ReLU). The feature extractor part is initialized with the weights of the pre-trained model which is trained on natural and cryo-EM datasets and then fine-tuned with the new dataset to obtain better extraction capabilities for new feature. For a given input sample, represented as a 2D array of Rn×n , after the first shallow Conv and max pooling, the extracted features are used as input for the following residual blocks. After a series of convolution blocks, the given input patch is represented as a patch of highly abstract feature maps for final classification.
PickerOptimizer
553
A Feature extractor
classifier Residual block
…
Linear
GAP
Conv
MaxPooling
Residual block
class Output
Input
B
output
ReLU
BatchNorm
3x3 Conv
ReLU
BatchNorm
3x3 Conv
input
Fig. 1. The architecture of PickerOptimizer. (A) The model framework of PickerOptimizer. (B) The composite function of each layer in residual block, including convolution (Conv), batch normalization (BN), or rectified linear units (ReLU)
The classifier part is designed for a specific dataset. The weight of the classifier only depends on the new data, and the weight of the pre-trained classifier is directly discarded. Usually, fully connection (FC) is used to map the final feature maps to a categorical vector which shows the probabilities assigned to each class. In order to increase the non-linearly, traditional CNN always contains multiple FC layers. However, FC covers most of the parameters of the network which can easily cause model overfitting. In particular, our method strives to train the model with the least amount of data. Therefore, to reduce model parameters and avoid overfitting, global average pooling (GAP) is introduced to replace the first FC layer [29]. The GAP does average pooling to the whole feature map, so all feature maps become a 1D vector. Then the last FC layer with fewer parameters maps these 1D vectors to get the category vector. 2.3
Training Details
In this work, the neural networks of PickerOptimizer are implemented with Pytorch [30]. The training of the model is divided into two steps. The first step is to train a pre-trained model with strong generalization ability, and the second step is to fine-tune the model to obtain the capability to extract new features. The pre-training of the neural network was carried out with both ImageNet and the Cryo-EM dataset. For simplicity, we directly re-use the pre-training parameters of resnet [31] provided by pytorch as the initialized parameters, which are obtained by training on the ImageNet dataset. On this basis, the model is re-trained with cryo-EM datasets. The network is trained for 200 epochs on one 2080TI GPU with a batch size of 256. According to the training experience, we
554
H. Li et al.
used Stochastic Gradient Descent (SGD) [32] with momentum as the optimizer and the learning rate is set at 0.1 and scaled down by a factor of 0.1 after every 7 epochs. We use a weight decay of 0.0001 and a momentum of 0.9. The weighted cross-entropy are used here. For neural network fine-tuning, The network is trained for 30 epochs on one 2080TI GPU with a batch size of 30. The learning rate is set at 0.01 and scaled down by a factor of 0.1 after every 7 epochs. The same SGD with momentum and weighted cross-entropy is used. Since only a few data were used for finetuning, before training images are fed to the network, a data augmentation was performed, including random horizontal flip and random rotation. 2.4
Evaluation Metrics
In order to quantify analysis the performance of PickerOptimizer, we chose the F1 score which shows the balance of recall and precision as the metric. Since there may be two categories (particles and carbon region or high-contrast contaminations) or three categories (particles, carbon region and high-contrast contaminations) in datasets, we chose the Macro F1 metric which weighs the F1 achieved on each label equally, as follows: F1 =
2T P 2 × precision × recall = precision + recall 2T P + F N + F P N 1 M acro − F 1 = F 1i N i=0
(1)
(2)
In Eq. 1, T P means true positive, F N means false negative and F P means false positive. In Eq. 2, N indicates the number of categories and F 1i indicates F1 score achieved on the ith category.
3 3.1
Results Classification Performance
The performance of our approach has been assessed on 14 publicly available datasets as shown in Table 2. In order to verify the performance improvement brought by the transfer learning techniques and the constructed cryo-EM dataset, we trained three types of models where noPre means training the neural network from the scratch, Pre and Pre+ means fine-tuning on pre-trained models. The pre-trained model of Pre comes from the natural image dataset ImageNet, and the pre-trained model of Pre+ comes from the combination of natural images and cryo-EM images. Our method strives to use minimal data to achieve the best classification performance, therefore, we further analyze the performance of the classifier trained with different amounts of data. We set three data sizes, denoted as 10 shots, 20 shots and 30 shots, corresponding to 10, 20 and 30 images in each category, respectively.
PickerOptimizer
555
It should be noted that the pre-trained model of Pre+ is trained with the cryoEM dataset which contains the public datasets shown in Table 2. Therefore, to avoid the crossover between the dataset for pre-training and fine-tuning, the pretraining datasets for each experiment contains 13 different datasets from cryo-EM datasets. For example, to obtain the classifier for EMPIAR-10406, the ImageNet and the whole cryo-EM dataset except EMPIAR-10406 are used for pre-training. Since the amount of data required for fine-tuning is relatively small, to avoid the bias introduced by artificial selection of the training set, we randomly sample training dataset from the whole dataset and use the remaining data as the validation set. Furthermore, the classifier obtained by randomly sampling the dataset once may not be representative, therefore, we sampled multiple times (about ten thousand times) to obtain multiple classifiers, and calculate the average accuracy (macro-F1 score) of these classifiers as the overall correct rate of the model. Table 2. The classification performance (macro-F1 score) of PickerOptimizer on different datasets. NoPre
Pre
Pre+
10 shots 20 shots 30 shots 10 shots 20 shots 30 shots 10 shots 20 shots 30 shots Three classes 10406 55.73
59.98
66.24
83.65
91.57
93.79
91.47
94.91
94.62
10059 57.52
66.77
74.82
87.70
93.92
95.75
98.99
97.97
97.53
10285 65.68
82.01
93.49
92.00
96.45
97.64
97.75
98.85
98.87
10333 73.66
81.47
91.60
91.80
95.91
97.61
97.54
98.81
98.55
10590 81.36
84.61
88.07
94.65
96.58
97.69
98.99
99.30
99.30
10283 62.32
67.71
78.72
89.00
94.29
96.04
96.25
96.97
96.63
10454 38.13
45.81
56.83
52.10
64.59
78.80
75.20
84.10
87.38
10470 33.60
40.75
51.65
51.21
67.20
81.17
78.30
86.05
89.18
10099 80.61
89.67
96.05
88.86
95.25
97.61
96.68
98.26
98.28
10350 69.71
82.76
90.24
89.67
94.17
96.45
95.91
98.17
98.10
10399 92.05
93.85
96.08
91.73
95.24
97.61
96.91
98.21
98.39
10063 84.75
94.14
97.51
92.31
97.70
97.88
97.75
98.69
98.63
10097 76.96
84.11
90.11
82.28
89.56
94.14
94.52
96.54
96.61
10077 54.20
76.47
90.34
91.48
95.51
95.81
95.71
96.74
96.58
Two classes
It can be seen from Table 2 that compared with noPre which is trained from scratch, the two fine-tuned models show great advantages, achieving an improvement of the macro-F1 score by 20% to 30%. This is in line with expectations, since the size of training data is relatively small, the models of noPre can only learn limited knowledge and easily be overfitted. However, due to the powerful generalization ability of the pre-trained model, the Pre and Pre+ are able to achieve the macro-F1 score of more than 90% in most cases. Furthermore, benefit from the unique features learned from cryo-EM datasets, the Pre+ model achieved a relatively higher macro-F1 score. It is worth noting that the Pre+ model is also more robust to the amount of training data. It can be seen that when the training data is 20 shots and 30 shots, the macro-F1 score is almost the same. When the data size is 10 shots, the macro-F1 score is not greatly affected compared to the cases of NoPre and Pre or even higher than that of 20 shots and 30 shots.
556
3.2
H. Li et al.
Comparison with Other Pruning Approaches
Pollutants in micrographs can usually be divided into three types, carbon region, high-contrast contaminations or both. Here, we analyzed the performance of pickerOptimizer on tackling with these three situations (as shown in Fig. 2). In addition, we chose MicrographCleaner, a commonly used pruning approach, as a comparison. The particles selected by Relion autopicker are used as the initially picked particle to be optimized. The Fig. 2 shows the particles picked by Relion autopicker (the first row), the particles post-processed with MicrographCleaner (the second row) and the particles post-processed with PickerOptmizer (the last row). RA
RA
RA
RA-MC
RA-MC
RA-MC
RA-PO
RA-PO
RA-PO
Fig. 2. The comparison of MicrographCleaner (MC) and pickerOptimizer (PO) on dealing with different types of pollutants, including carbon (the first column), icecontaminated areas (the second column) and both (the third column). The first rows correspond to particles picked by Relion autopicker (RA), the second row corresponds to the remaining particles after applying MicrographCleaner (RA-MC) and the last row corresponds to the remaining particles after applying pickerOptimizer (RA-PO).
It can be seen that the performance of the Relion autopicker is quite poor in all these three cases. It cannot distinguish the difference between positive and negative samples, and incorrectly recognizes a lot of carbon regions and high-contrast contaminations as particles. Therefore, further particle optimization is required. In the case where only carbon region exists (the first column in Fig. 2), both MicrographCleaner and PickerOptmizer performed excellently and
PickerOptimizer
557
can perfectly avoid the particle picked in the carbon region, with true positive particles not be ruled out. However, when there are lots of ice contaminations in micrographs (the second column in Fig. 2), though MicrographCleaner can filter out most of the negative samples, there is still some obvious ice pollution left. Conversely, PickerOptmizer performs better in this case, since almost all contaminations and nearby affected particles have been identified and ruled out, as shown in the last row of Fig. 2. In the presence of both carbon and contaminations (the third column in Fig. 2), similarly, both MicrographCleaner and PickerOptmizer perform well on the carbon area, but MicrographCleaner is less effective than PickerOptmizer in dealing with contaminations, since some ice are still erroneously identified as particles. Our method can avoid ice-contaminated areas more accurately. Here, we used the recommended default threshold 0.2 in all experiments of MicrographCleaner. 3.3
Use Case Raw
RA
RA-PO
CA
TT
Thr=0.5
Thr=2 Thr=2
CA
T
Thr=0.3
Thr=0
CA-PO
T-PO
Thr=0.3
Thr=0
Fig. 3. PickerOptimizer improves particle picking on the EMPIAR-10590 dataset. Coordinates selected with Relion autopicker (R), Cryolo pre-trained general model (CA) and Topaz (T) are respectively displayed in columns one to three. Top row images correspond to the raw micrograph and the remaining particles after applying a higher threshold to the low threshold Cryolo general and Topaz solutions. The second row images correspond to the low threshold Cryolo general and Topaz solutions and the last row images correspond to the remaining particles after applying PickerOptimizer (PO) to the low threshold Cryolo general and Topaz solutions.
558
H. Li et al.
In this section, we present an example, in which, both conventional particle pickers and deep learning-based pickers struggle to identify particles from problematic regions (carbon areas and high-contrast contaminations), and thus they both could benefit from PickerOptimizer. Here we chose Topaz and the Cryolo particle pickers as deep learning representatives. The Topaz algorithm acts as the semi-automatic particle picker which was trained with about 800 particles picked from 40 micrographs. The Cryolo general model, which does not require any training, was employed as a fully automatic one. Relion autopicker was chosen as the representative of conventional particle pickers. As it is illustrated in Fig. 3, both the Relion and the Cryolo particle pickers tend to pick particles located at the carbon and ice-contaminated areas, whereas the Topaz particle picker is able to avoid most of the carbon region, although it still selects many false positives at ice-contaminated areas. It is interesting to note that, although the number of particles picked at the carbon area/edge and ice-contaminated areas can be decreased using stricter thresholds, it comes at the cost of ruling out true positive particles, especially for Topaz. Moreover, they still incorrectly select many small contaminants as particles, as the red boxed ones shown in Fig. 3. On the contrary, PickerOptimizer is able to rule out those false positive particles while not affecting the true positive ones, as shown in the third row of Fig. 3, hence it can be used as a complement for any particle picker independently of threshold decisions.
4
Conclusions
In this work, we present a deep learning-based particle pruning algorithm, PickerOptimizer, to separate erroneously picked particles from correct ones. The PickerOptimizer implements a classification network with residual blocks and the network is trained utilizing transfer learning techniques. To obtain the pre-trained model, we constructed the first cryo-EM datasets for image classification which contains the samples from 14 public datasets. Therefore, the model is pre-trained with the combination of a natural image dataset and a cryo-EM image dataset to gain powerful feature extraction capabilities. The final classifier can be obtained by fine-tuning the pre-trained model with minimal data from new datasets. The experiments carried out on several public datasets show that PickerOptimizer is a very efficient approach for particle post-processing, achieving F1 scores above 90%. Moreover, we have compared PickerOptimizer with other pruning strategies and showed that it works better, or at least as well, as commonly applied methods. A use case further shows that PickerOptimizer is able to improve conventional particle pickers and complement deep learning-based ones, therefore promoting subsequent processing. Acknowledgments. The research is supported by the National Key Research and Development Program of China (No. 2017YFA0504702), the NSFC projects grants (61932018, 62072441 and 62072280), and Beijing Municipal Natural Science Foundation Grant (No. L182053).
PickerOptimizer
559
References ˚ resolution cryo-EM structure of human p97 and mecha1. Banerjee, S., et al.: 2.3 A nism of allosteric inhibition. Science 351(6275), 871–875 (2016) 2. Zhang, Y., et al.: Cryo-EM structure of the activated GLP-1 receptor in complex with a G protein. Nature 546(7657), 248–253 (2017) 3. Scheres, S.H.W.: Semi-automated selection of cryo-EM particles in RELION-13. J. Struct. Biol. 189(2), 114–122 (2015) 4. Fa Zhang, Yu., Chen, F.R., Wang, X., Liu, Z., Wan, X.: A two-phase improved correlation method for automatic particle selection in cryo-EM. IEEE/ACM Trans. Comput. Biol. Bioinform. 14(2), 316–325 (2015) 5. Zhu, Y., et al.: Automatic particle selection: results of a comparative study. J. Struct. Biol. 145(1–2), 3–14 (2004) 6. Wagner, T., et al.: SPHIRE-crYOLO is a fast and accurate fully automated particle picker for cryo-EM. Commun. Biol. 2(1), 1–13 (2019) 7. Zhang, J., Zihao Wang, Yu., Chen, R.H., Liu, Z., Sun, F., Zhang, F.: PIXER: an automated particle-selection method based on segmentation using a deep neural network. BMC Bioinformatics 20(1), 1–14 (2019) 8. Bepler, T., et al.: Positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs. Nat. Methods 16(11), 1153–1160 (2019) 9. Lander, G.C., et al.: Appion: an integrated, database-driven pipeline to facilitate EM image processing. J. Struct. Biol. 166(1), 95–102 (2009) 10. Berndsen, Z., Bowman, C., Jang, H., Ward, A.B.: EMHP: an accurate automated hole masking algorithm for single-particle cryo-EM image processing. Bioinformatics 33(23), 3824–3826 (2017) 11. Norousi, R., et al.: Automatic post-picking using MAPPOS improves particle image detection from cryo-EM micrographs. J. Struct. Biol. 182(2), 59–66 (2013) 12. Sanchez-Garcia, R., Segura, J., Maluenda, D., Carazo, J.M., Sorzano, C.O.S.: Deep consensus, a deep learning-based approach for particle pruning in cryo-electron microscopy. IUCrJ 5(6), 854–865 (2018) 13. Sanchez-Garcia, R., Segura, J., Maluenda, D., Sorzano, C.O.S., Carazo, J.M.: Micrographcleaner: a Python package for cryo-EM micrograph cleaning using deep learning. J. Struct. Biol. 210(3), 107498 (2020) 14. Iudin, A., Korir, P.K., Salavert-Torres, J., Kleywegt, G.J., Patwardhan, A.: EMPIAR: a public archive for raw electron microscopy image data. Nat. Methods 13(5), 387–388 (2016) 15. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009) 16. Nicholson, D., Edwards, T.A., O’Neill, A.J., Ranson, N.A.: Structure of the 70S ribosome from the human pathogen acinetobacter Baumannii in complex with clinically relevant antibiotics. Structure 28(10), 1087–1100 (2020) 17. Gao, Y., Cao, E., Julius, D., Cheng, Y.: TRPV1 structures in nanodiscs reveal mechanisms of ligand and lipid action. Nature 534(7607), 347–351 (2016) 18. Cash, J.N., et al.: Cryo-electron microscopy structure and analysis of the P-REX1Gβγ signaling scaffold. Sci. Adv. 5(10), eaax8855 (2019) 19. Liu, Y., et al.: Fact caught in the act of manipulating the nucleosome. Nature 577(7790), 426–431 (2020) 20. Mashtalir, N., et al.: A structural model of the endogenous human BAF complex informs disease mechanisms. Cell 183(3), 802–817 (2020)
560
H. Li et al.
21. Jinke, G., et al.: Cryo-EM structure of the mammalian ATP synthase tetramer bound with inhibitory protein IF1. Science 364(6445), 1068–1075 (2019) 22. Singh, K., et al.: Discovery of a regulatory subunit of the yeast fatty acid synthase. Cell 180(6), 1130–1143 (2020) 23. Schoebel, S., et al.: Cryo-EM structure of the protein-conducting ERAD channel Hrd1 in complex with Hrd3. Nature 548(7667), 352–355 (2017) 24. Isom, G.L., Coudray, N., MacRae, M.R., McManus, C.T., Ekiert, D.C., Bhabha, G.: LetB structure reveals a tunnel for lipid transport across the bacterial envelope. Cell 181(3), 653–664 (2020) 25. Tan, Y.Z., et al.: Cryo-EM structures and regulation of arabinofuranosyltransferase AftD from mycobacteria. Mol. Cell 78(4), 683–699 (2020) 26. Zhang, L., et al.: Cryo-EM structure of the activated NAIP2-NLRC4 inflammasome reveals nucleated polymerization. Science 350(6259), 404–409 (2015) 27. Tan, Y.Z., et al.: Addressing preferred specimen orientation in single-particle cryoEM through tilting. Nat. Methods 14(8), 793–796 (2017) 28. Fischer, N., et al.: The pathway to GTPase activation of elongation factor SelB on the ribosome. Nature 540(7631), 80–85 (2016) 29. Gao, S., Han, R., Zeng, X., Liu, Z., Xu, M., Zhang, F.: Macromolecules structural classification with a 3D dilated dense network in cryo-electron tomography. IEEE/ACM Trans. Comput. Biol. Bioinform. (2021) 30. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8026–8037 (2019) 31. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 32. Bottou, L.: Stochastic gradient descent tricks. In: Montavon, G., Orr, G.B., M¨ uller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 421–436. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8 25
SkeIn: Sketchy-Intensive Reading Comprehension Model for Multi-choice Biomedical Questions Jing Li1(B) , Shangping Zhong1 , Kaizhi Chen1 , and Taibiao Li2 1
College of Computer and Data Science, Fuzhou University, Fuzhou 350108, China {N190320027,spzhong,ckz}@fzu.edu.cn 2 The Fifth Hospital of Xiamen, Xiamen 361101, China
Abstract. Recent advances in Pre-trained Language Models (PrLMs) have driven general domain multi-choice Machine Reading Comprehension (MRC) to a new level. However, they perform much worse on domain-specific MRC like biomedicine, due to the lack of effective matching networks to capture the relationships among documents, question and candidate options. In this paper, we propose a Sketchy-Intensive (SkeIn) reading comprehension model, which simulates the cognitive thinking process of humans to solve the Chinese multi-choice biomedical MRC questions: (1) obtaining a general impression of content with a sketch reading process; (2) capturing dedicated information and relationships of documents, question and candidate options with an intensive reading process and making the final prediction. Experimental results show that our SkeIn model achieves substantial improvements over competing baseline PrLMs, with average accuracy improvements of +4.03% dev/+3.43% test, +2.69% dev/+3.22% test, and 5.31% dev/5.25% test from directly fine-tuning BERT-Base, BERT-wwm-ext and RoBERTa-wwm-ext-large, respectively, indicating the effectiveness of SkeIn to enhance the general performance of PrLMs on the biomedical MRC tasks.
Keywords: Biomedical machine reading comprehension network · Multi-head attention · Fine-tuning PrLMs
1
· Matching
Introduction
Biomedical Machine Reading Comprehension (BMRC) has been a popular and challenging task in Natural Language Understanding (NLU), aiming to teach machines to read and understand text materials and answer the relevant biomedical questions. Such a task has significant potentials of benefiting computeraided diagnosis in real-world clinical scenarios, and has gained rapid progress We thank the anonymous reviewers for their insightful comments and suggestions. This work is supported by the National Natural Science Foundation of China (NSFC No. 61972187). c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 561–571, 2021. https://doi.org/10.1007/978-3-030-91415-8_47
562
J. Li et al.
with the wide availability of MRC datasets [7,11,13,18] and deep-learning models [1,4,15], especially PrLMs such as BERT [3] and RoBERTa [9]. Most previous works for MRC run along the two lines: (1) train a more powerful PrLM such as BioBERT [8] and CMedBERT [20]. However, the training process is timeconsuming and resource-demanding, especially for domain-specific tasks like Chinese BMRC, corpora are considerably deficient; (2) design a matching network for PrLMs to solve downstream tasks, such as MMM [6] and DCMN [19]. With the development of newer and more powerful PrLMs, the previous matching network patterns either bring very limited improvements [12,19] or even beaten by the PrLMs [3,17], motivating us to explore a more effective pattern to support PrLMs to further enhance their performance on BMRC tasks. Table 1. An example of questions in the NMLEC.
In this paper, we mainly focus on the BMRC with Chinese biomedical multichoice questions, and propose a Sketchy-Intensive (SkeIn) reading comprehension model to solve these questions. The proposed model employs a human-like cognitive thinking process of reading comprehension that firstly (1) reads through the whole content sketchily and obtain a general impression of documents, question and candidate options with an encoder module, then (2) re-reads and captures dedicated information of overall impressions intensively with a dualpath multi-head co-attention module, re-considerate the details and ultimately makes answer prediction. The details of the SkeIn model are shown in Sect. 3.
SkeIn: Sketchy-Intensive Reading Comprehension Model
563
Our SkeIn model was examined by taking the five categories of National Medical Licensing Examination in China (NMLEC)1 , with over 136k multi-choice questions designed by human medical experts, Table 1 shows an example. We also compare SkeIn against seven strong baselines, and the experimental results show that our proposed model achieves substantial improvements, with average accuracy improvements of +4.03% dev/+3.43% test, +2.69% dev/+3.22% test, and 5.31% dev/5.25% test from directly fine-tuning BERT-Base, BERT-wwmext and RoBERTa-wwm-ext-large, respectively. Indicating the effectiveness and superiority of SkeIn on the BMRC tasks. The major contributions can be summarized as: (1) We propose a novel Sketchy-Intensive reading comprehension model which is capable of performing human-like cognitive thinking process of reading comprehension to solve Chinese multi-choice BMRC tasks; (2) We conduct detailed experiments over five state-of-the-art PrLMs as baselines, also compare our model to the strongest competitor on five biomedical datasets for a fair comparison; (3) The results of the experiments show that our proposed SkeIn achieves substantial improvements over five strong baseline PrLMs, confirming the enhancements of our SkeIn model.
2
Dataset
2.1
Task Formulation
The NMLEC is an annual licensing examination that evaluates professional knowledge and skills for those who want to be medical practitioners in China. As shown in Table 1, the task is made up of the following three parts: – Question: problem description, either a single statement of a specific biomedical concept or a long paragraph of a clinical scenario. – Candidate Options: each question contains five candidate options with one correct or best option and four incorrect or partially correct options. – Question-Relevant Documents: a collection of supporting text material extracted from a large collection of documents, which contains information to help answer questions correctly. The goal is to read the documents and determine the one correct or best answer to the question among five candidate options. 2.2
Data Analysis
We use the NMLEC as the data source of questions, and collect questions from the Internet and published materials. After removing any duplicated or incomplete questions, we randomly split them into training, development and test 1
http://www.nmec.org.cn/Pages/ArticleList-12-0-0-1.html.
564
J. Li et al.
sets according to a certain ratio of 8 : 1 : 1, resulting in over 130k multichoice questions of five biomedical categories: Clinic, Stomatology, Public Health, Traditional Chinese Medicine, and Integrated Traditional Chinese and Western Medicine (denoted as Chinese Western Medicine). The overall data statistics are summarized in Table 2 and Table 3. Table 2. Number of training, development and test sets. Category
Train
Dev
Test
Clinic
26,913
3,365
3,363
Stomatology
21,159
2,645
2,645
Public health
14,818
1,852
1,854
Traditional Chinese medicine 24,692
3,086
3,087
Chinese Western medicine
21,406
2,676
2,675
Total
108,988 13,624 13,624
Table 3. Statistics of questions and options. Length is calculated in characters and Vocab size is measured by Pkuseg [10] in words. Category
Question Len Option Len Vocab Size Avg. Max. Avg. Max.
Clinic
46.51 332
9.07 100
46,175
Stomatology
37.12 341
9.71 101
41,178
Public health
36.76 352
9.72 125
35,790
Traditional Chinese medicine 32.24 340
7.25 200
38,904
Chinese Western medicine
7.77 130
38,187
30.63 280
We prepare supporting text materials with over 1 million articles of realworld facts from Chinese Wikipedia dumps2 . Then we produce question-relevant documents by a distributed search and analytic engine, ElasticSearch3 , which supports very fast full-text searches. To measure the relevance of documents and search queries, we use BM25 ranking [14] as the similarity scoring function, and is defined as: BM25(D, Q) =
m
IDF (qi ) ·
i=1
IDF(qi ) = ln 2 3
https://dumps.wikimedia.org/. https://www.elastic.co/.
(k1 + 1) · f (qi , D) , Dlen + f (qi , D) k1 · 1 − b + b · avgdlen
(1)
M − m(qi ) + 0.5 +1 , m(qi ) + 0.5
(2)
SkeIn: Sketchy-Intensive Reading Comprehension Model
565
where qi and f (qi , D) represent the ith search query term of Q and its term frequency in the document D, respectively. IDF(qi ) represents the Inverse Document Frequency weight of the query term qi . Dlen and avgdlen are the length of the document D, and the average document length in the document collections, respectively. M is the total number of document collections, and m(qi ) is the number of documents containing qi . We set hyper-parameters in fixed values of b = 0.75 and k1 = 1.2 in Elasticsearch, which determine the effects of Dlen to avgdlen and term frequency saturation characteristics, respectively. The stronger the relevance between a document and a query, the larger the BM25 score. Specifically, for each question and its candidate options, we perform retrieval for each candidate option in turn by concatenating it with the question as a search query to Elasticsearch and is repeated for all options. The document with the highest BM25 score returned by each query is selected as supporting materials for the MRC task.
3
Method
In this paper, the Chinese multi-choice BMRC task can be described as a triplet D, Q, O, where Q represents a Question, D represents the collection of retrieved question relevant Documents by ElasticSearch, and O are the candidate Options. among the five candidate The goal is to determine the one correct/best answer O options O. As shown in Fig. 1, the Sketchy-Intensive reading comprehension model simulates the cognitive thinking process of humans to solve the multi-choice MRC questions: (1) a BERT-like encoder forms a global sequence representation of input text, which simulates examinees first read through the whole examination questions sketchily and obtain an overall impression; (2) a dual-path multi-head
Fig. 1. Overall architecture of SkeIn model.
566
J. Li et al.
co-attention module captures the connection of core information among documents, question and candidate options, then fuses them into (3) a decoder that simulates human readers re-read and considerate the content intensively based on the overall impression to aggregate information, and make the final decision. 3.1
Encoder
A BERT-like PrLM is used as the encoder of our model to encode input tokens of documents, question and each candidate option into a contextualized representations, we denote D = [d1 , d2 , ..., dm ], Q = [q1 , q2 , ..., qn ], and O = [o1 , o2 , ..., ok ] as the sequences of documents, question and each candidate option, respectively. m, n and k are the lengths of tokens. Each input token is encoded by the encoder as E = Encode(D Q O), where the encoding function Encode(·) returns the last layer’s output representations [e1 , e2 , ..., em+n+k ], which are consists of the feature vectors of input tokens in a fixed dimension Dimmodel . 3.2
Dual-Path Multi-Head Co-Attention
Inspired by previous works of multi-head attention mechanism and reading strategy [16,22], we split the last-layer’s output representation E into E D = qo qo [ed1 , ed2 , ..., edld ] and E QO = [eqo 1 , e2 , ..., elqo ], according to its position information. edi , eqo j are the i-th and j-th token representations of documents and questionoption, respectively. Both of the sequences are padded or truncated to the maximum length of ld and lqo in a mini-batch. Then, we feed the E D and E QO into an extra one-layer dual-path multi-head co-attention module, which calculates the attention representations from a dual-path way: (1) documents-centered path takes E D as Q, E QO as K and V , while (2) question and option-centered path takes E QO as Q, E D as K and V , where we use Q, K and V to represent Query, Key and Value, respectively.
T D QO QO
E D E QO √ Attention E , E , E = SoftMax E QO , (3) Dimk where WiQ ∈ RDimmodel ×Dimq , WiK ∈ RDimmodel ×Dimk , WiV ∈ RDimmodel ×Dimv , and WiO ∈ RhDimv ×Dimmodel are parameter matrices, Dimq , Dimk , Dimv are the dimensions of Q, K, V vectors, respectively. We use h to represents the number of heads. The Multi-Head Attention representation MultiHead(·) is calculated as:
(4) MultiHead E D , E QO , E QO = Concat (Head1 , . . . , Headh ) W O , Headi = Attention E D WiQ , E QO WiK , E QO WiV , (5)
SkeIn: Sketchy-Intensive Reading Comprehension Model
567
The dual-path multi-head co-attention module SkeIn(·) calculates question and option-centered documents representations MultiHeadQO and documentscentered question and option representations MultiHeadD as:
MultiHeadQO = MultiHead E QO , E D , E D , (6)
MultiHeadD = MultiHead E D , E QO , E QO , where MultiHeadQO simulates human readers re-read the detailed information in the documents with the impression of question and option, then re-considerate the connection of question and option with a deeper understanding of the documents MultiHeadD . Finally, the Fuse(·, ·) function fuses all the key information by using mean pooling to the sequences output representations of MultiHead(·), and then aggregates the two pooled output representations to make the final decision: SkeIn = Fuse (MultiHeadD , MultiHeadQO ) 3.3
(7)
Decoder
We pass the output representations of SkeIn(·) to a decoder, which computes the probability distribution of each candidate option, and obtains an option with the highest probability as the predicted answer. Specifically, given the ith triplet D, Q, O, we denote Oj as the j-th candidate option, the objective function is computed as:
exp W T SkeInr E D , E QOr , (8) Loss (Or | D, Q) = − log s T D QOi )) i=1 exp (W SkeIni (E , E where SkeIni ∈ Rl is the output representations of SkeIn(·), and Or is the correct answer. We use W ∈ Rl represents learnable parameters, and set s in a fixed number of 5 to represent the number of options.
4 4.1
Experiment Experimental Settings
We use the competing three baseline PrLMs as the encoder: Chinese BERTBase (denoted as BERT-Base) [3], Chinese BERT-Base with whole word masking and pre-trained over larger corpora (denoted as BERT-wwm-ext) [2], and the robustly optimized BERT RoBERTa-wwm-ext-large [9]. We also compare the performance of SkeIn against seven strong baselines: Random, IR Baseline, BERT-Base, Multilingual Uncased BERT-Base (denoted as BERT-BaseMultilingual) [3], BERT-wwm-ext, Chinese RoBERTa-wwm-ext and Chinese RoBERTa-wwm-ext-large [9]. Our implementation of SkeIn and baseline models are based on PyTorch Lightning [5] and UER-py [21], respectively. All models are trained through a fine-tuning way with 15 epochs, a maximum sequence length
568
J. Li et al.
of 512, an initial learning rate in {2e−6, 4e−6}, a warm-up rate of 0.1, a batch size of 5, and we keep the default values for the other hyper-parameters [3]. Accuracy (%) is used as the metric to evaluate different models’ performance. Instead of human performance, we also provide human pass line (60) for a fair comparison due to human performance is full of variety, from almost full marks to cannot even pass the exam. 4.2
Results
Table 4 and Table 5 show the experimental results on five Chinese multi-choice BMRC datasets, where Cli, Sto, PH, TCM and CWM denote Clinic, Stomatology, Public Health, Traditional Chinese Medicine, and Chinese Western Medicine, respectively. Method Random refers to the selection of options based on a random distribution, method IR baseline refers to the selection of options based on the ranking of the score by ElasticSearch, while for the others, we give the results of fine-tuning PrLMs directly. As we can see, by fine-tuning PrLMs, RoBERTa-wwm-ext-large and BERTwwm-ext perform better than the other models on five biomedical datasets, while our SkeIn model shows substantial improvements over various strong baseline models, with average improvements of +4.03% dev/+3.43% test by using BERTBase as encoder, and +2.69% dev/+3.22% test, 5.31% dev/5.25% test by using the most competitive baseline models BERT-wwm-ext and RoBERTa-wwm-extlarge as encoders, respectively. We believe that the reason for the performance improvement of SkeIn is that, PrLMs are limited by the self-attention mechanism, which ignores careful consideration of the relationships among documents, question and options. In contrast, our proposed SkeIn model can effectively capture and utilize such relationships by the dual-path multi-head co-attention module, which leads to better performance. Table 4. Test set performance in accuracy (%) on the five datasets. Model
Cli
Sto
PH
TCM CWM
Random IR Baseline BERT-Base BERT-Base-Multilingual BERT-wwm-ext RoBERTa-wwm-ext RoBERTa-wwm-ext-large
19.61 30.10 48.30 47.68 50.89 51.97 53.22
19.43 26.12 40.08 38.76 42.05 40.88 43.75
20.13 23.26 37.40 36.70 40.04 38.91 38.75
20.11 32.13 49.14 46.61 54.94 49.82 48.65
20.16 28.93 45.14 42.86 50.04 46.00 50.11
SkeIn (BERT-Base) 51.83 43.26 42.36 52.28 47.48 54.72 46.25 43.12 57.89 52.07 SkeIn (BERT-wwm-ext) SkeIn (RoBERTa-wwm-ext-large) 55.88 47.54 43.50 59.44 54.39 Human Pass Line
60
SkeIn: Sketchy-Intensive Reading Comprehension Model
569
Table 5. Performance comparison in accuracy (%) on the development and test sets. Cli
Model BERT-Base
Sto
PH
TCM
CWM
Dev
Test
Dev
Test
Dev
Test
Dev
Test
Dev
Test
47.26
48.30
40.53
40.08
38.99
37.40
48.51
49.14
44.32
45.14
42.36
52.66
52.28
51.98
51.83
44.07
43.26
40.83
50.21
47.48
50.27
50.89
43.26
42.05
41.75 40.04 54.57 54.94 49.89
50.04
54.83
54.72
46.57
46.25
42.56
43.12
56.90
57.89
52.34
52.07
RoBERTa-wwm-ext-large 53.25 53.22 44.92 43.75 39.10
38.75
47.99
48.65
50.49 50.11
43.50
59.01
59.44
54.84
+SkeIn BERT-wwm-ext +SkeIn +SkeIn
55.49
55.88
49.11
47.54
43.86
54.39
Fig. 2. Performance on knowledge driven and case analysis questions.
570
J. Li et al.
In the NMLEC, questions are divided into five types: A1, B1, A2, A3, and A4, which can be summarized as Knowledge Driven Questions (A1+B1) and Case Analysis Questions (A2+A3+A4). We conduct experiments on both question types, and the evaluation results are listed in Fig. 2. The results demonstrate that our SkeIn model achieves varying degrees of performance improvement in both question types, while most models achieve better performance on Case Analysis Questions as they are usually involved simpler concepts, which results in easier details capture. The SkeIn model gives a major boost in performance on five biomedical datasets, especially at solving Knowledge Driven Questions. Showing that by integrating the relationship and detail information among documents, question and options into the PrLMs, the model can capture more semantic information, which is helpful in solving Knowledge Driven Questions. In contrast, solving Case Analysis Questions needs sophisticated reasoning ability as they are usually more difficult and involve complex medical scenarios.
5
Conclusion
In this paper, we propose a Sketchy-Intensive (SkeIn) reading comprehension model to tackle the machine reading comprehension with biomedical multi-choice questions. Inspired by the cognitive thinking process of humans to solve reading comprehension questions, the proposed model implicitly takes advantage of dedicated information and relationships of documents, question and candidate options with a matching network, which is able to cooperate with state-of-theart pre-trained language models such as BERT and RoBERTa. The experimental results consistently confirm the enhancements and effectiveness of SkeIn model on five multi-choice biomedical machine reading comprehension tasks. In the future, we will explore how structural knowledge such as Knowledge Graphs can further enhance the performance and reasoning ability of models.
References 1. Chen, D., Fisch, A., Weston, J., Bordes, A.: Reading Wikipedia to answer opendomain questions. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1870–1879. Association for Computational Linguistics (July 2017) 2. Cui, Y., et al.: Pre-training with whole word masking for Chinese BERT. arXiv (2019) 3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs] (May 2019) 4. Ding, M., Zhou, C., Chen, Q., Yang, H., Tang, J.: Cognitive graph for multihop reading comprehension at scale. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2694–2703. Association for Computational Linguistics (July 2019). https://doi.org/10.18653/ v1/P19-1259, cogQA-master
SkeIn: Sketchy-Intensive Reading Comprehension Model
571
5. Falcon, WA: PyTorch lightning. GitHub (March 2019). https://github.com/ PyTorchLightning/pytorch-lightning 6. Jin, D., Gao, S., Kao, J.Y., Chung, T., Hakkani-tur, D.: MMM: multi-stage multi-task learning for multi-choice reading comprehension. arXiv:1910.00458 [cs] (November 2019). Accepted by AAAI 2020 7. Lai, G., Xie, Q., Liu, H., Yang, Y., Hovy, E.: RACE: large-scale reading comprehension dataset from examinations. arXiv:1704.04683 [cs] (December 2017). EMNLP 2017 8. Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2019) 9. Liu, Y.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv (2019) 10. Luo, R., Xu, J., Zhang, Y., Ren, X., Sun, X.: PKUSEG: a toolkit for multi-domain Chinese word segmentation. CoRR abs/1906.11455 (2019) 11. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2383–2392. Association for Computational Linguistics (November 2016) 12. Ran, Q., Li, P., Hu, W., Zhou, J.: Option comparison network for multiple-choice reading comprehension. arXiv:1903.03033 [cs] (March 2019). Comment: 6 pages, 2 tables 13. Richardson, M., Burges, C.J., Renshaw, E.: MCTest: a challenge dataset for the open-domain machine comprehension of text. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 193–203. Association for Computational Linguistics (October 2013) 14. Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retrieval 3(4), 333–389 (2009) 15. Shin, H.C., et al.: BioMegatron: larger biomedical domain language model. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4700–4706. Association for Computational Linguistics (November 2020) 16. Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 [cs] (December 2017). Comment: 15 pages, 5 figures 17. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. arXiv:1906.08237 [cs] (January 2020). Comment: Pretrained models and code are available at https://github.com/zihangdai/xlnet 18. Yang, Z., et al.: HotpotQA: a dataset for diverse, explainable multi-hop question answering. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380, Brussels, Belgium. Association for Computational Linguistics (October 2018). https://doi.org/10.18653/v1/D18-1259 19. Zhang, S., Zhao, H., Wu, Y., Zhang, Z., Zhou, X., Zhou, X.: DCMN+: dual comatching network for multi-choice reading comprehension. arXiv:1908.11511 [cs] (January 2020). Accepted by AAAI 2020 20. Zhang, T., Wang, C., Qiu, M., Yang, B., He, X., Huang, J.: Knowledge-empowered representation learning for Chinese medical reading comprehension: task, model and resources. arXiv:2008.10327 [cs] (August 2020) 21. Zhao, Z., et al.: UER: an open-source toolkit for pre-training models. In: EMNLP/IJCNLP (2019) 22. Zhu, P., Zhao, H., Li, X.: DUMA: reading comprehension with transposition thinking. arXiv:2001.09415 [cs] (September 2020)
DNA Image Storage Using a Scheme Based on Fuzzy Matching on Natural Genome Jitao Zhang1,2 , Shihong Chen3,4,5 , Haoling Zhang2,3,4,6 , Yue Shen2,3,4,5,6(B) , and Zhi Ping2,3,4,6(B) 1 College of Life Sciences, University of Chinese Academy of Sciences, Beijing 101408, China 2 BGI-Shenzhen, Shenzhen 518083, China 3 Guangdong Provincial Key Laboratory of Genome Read and Write,
BGI-Shenzhen, Shenzhen 518120, China 4 George Church Institute of Regenesis, BGI-Shenzhen, Shenzhen 518120, China 5 China National GeneBank, BGI-Shenzhen, Shenzhen 518120, China 6 Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology,
Chinese Academy of Sciences, Shenzhen 518055, China {shenyue,pingzhi}@genomics.cn
Abstract. Among lots of emerging storage technologies, DNA storage is with great potential for its high data storage density and low maintenance cost. However, DNA synthesis and sequencing, the two enabling technologies for DNA storage, are of high cost and inefficient in information writing and reading, which postpones the commercialization of DNA storage. Considering the expensive DNA synthesis cost, a DNA storage system based on natural genomes is devised to compress images by using fuzzy matching and image processing technology, which can reduce the cost of storing images in the DNA medium. According to our devised DNA storage scheme, the number of nucleotide sequences to be synthesized can be reduced by about 90% and the visual quality of retrieved images can be compared with conventional algorithms. Furthermore, because of no dependence among index sequences generated by fuzzy matching, the robustness of our scheme is better than that of those DNA storage schemes directly using conventional algorithms to compress images. Finally, we have investigated the factors that may influence images’ fuzzy matching, including genome size, GC content and relative entropy, which can be used to design a criterion to select better genomes for a given image. Keywords: DNA-based data storage · DNA synthesis · Fuzzy matching · Super-resolution · Image denoising
1 Introduction With the rapid development of information technology, the ability of creating data becomes more and more powerful. According to previous report, the total amount of global data is predicted to grow from 45 Zettabytes (ZB) to 175 ZB in 2025 [1]. © Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 572–583, 2021. https://doi.org/10.1007/978-3-030-91415-8_48
DNA Image Storage Using a Scheme
573
Tremendous growth of global data demands a much higher information storage capacity. DNA-based data storage is an emerging technology for its nonvolatility, remarkable durability, incomparable storage density and capability for cost-efficient information duplication [2–8]. The general steps of DNA storage include encoding, writing (DNA synthesis), reading (DNA sequencing) and decoding. The binary data stream of digital files is transcoded into DNA sequences and these sequences will be synthesized into oligonucleotides (oligos) or double-stranded DNA fragments for storage in the writing procedure. These sequences will be sequenced by utilizing sequencers and then decoded into original files when data retrieval. Although DNA-based data storage has significant advantages over traditional storage technologies, high cost of synthesis is a key factor to hinder its commercialization [6, 9]. Another issue of DNA storage is that DNA synthesis currently is an error-prone process [5, 8–10], which may result in difficulties of accurate data retrieval. These errors can be addressed by introducing error-correction code (ECC) into DNA storage, which can reconstruct the missing information or correct errors but at the cost of additional DNA synthesis [4, 11]. Another strategy to avoid errors and maintain low synthesis cost was proposed by Tabatabaei et al. [12] and it was called DNA punch cards. It was designed to modify the topology of native DNA sequences to store data via enzymatic nicking. Although this strategy is benefit to the cost reduction of DNA synthesis, it greatly sacrifices information density, which is supposed to be one of the major advantages of DNA storage [13]. In this paper, we propose a novel DNA storage scheme based on native genomes to archive images. The principle of the scheme is to search the most similar sequence in a native genome for each encoded nucleotide sequence of images (referred hereinafter as fuzzy matching) and the indices of searched sequences in the genome will be recorded and then encoded into nucleotide sequences for synthesis. The difference between matched genome sequences and encoded nucleotide sequences of an image may cause noises or distortions during information retrieval but can be reduced by image processing technologies. To improve the visual quality of retrieved noisy images, an image denoising method and a super-resolution (SR) method are introduced into our DNA storage scheme. In total, since the fuzzy matching and image downsampling applied in our elaborate scheme, the data of an image is compressed into index sequences and the amount of nucleotide sequences synthesized to store the image can be reduced by 90.62% at most. Common image compression algorithms can achieve even better compression rates, but binary digits of compressed images are associated with each other, which can result in catastrophic error propagation when undesired errors occur. In comparison, oligos representing indices of matched sequences in our scheme are not interdependent, resulting in better robustness of the scheme when image retrieval. Furthermore, we analyzed three factors that may affect the quality of image retrieval.
2 Data and Software Availability 2.1 Image Dataset The images used in this study belong to three datasets, including DIV2K [14], Set5 [15], Set14 [16]. DIV2K is an image dataset with 2k resolution used in some computer vision
574
J. Zhang et al.
challenges from 2017 to 2018, including 800 training images, 100 testing images, and 100 images for validation. The other two datasets contain 5 and 14 images, respectively. The images in these datasets are 24-bit images, most of which are color images. They are commonly used in computer vision research. 2.2 Genome Files and Source Code Genome files are reference sequences downloaded from the genome database in NCBI. The genome for the experiment of comparing different transformation rules is the genome of Saccharomyces cerevisiae S288C, of which the genome size is 12.16 Mb and the GC content is 38.2%. 20 genomes and 100 genomes are used in the experiments to investigate the influence of genome size and GC content, respectively and the GC content of them is from 20% to 70%. The source codes include the implementation of fuzzy matching, image processing algorithms and simulation experiments. The list of genome files and source code are available in the github repository https://github.com/zhangjitaoBGI/ DNA-storage-based-on-reference-genome. 2.3 Testing Environment Google colabpro, GPU: Tesla V100, python 3.7.10, pytorch 1.7.1, cuda 10.1.
3 Methods 3.1 Overview of Genome-Based DNA Storage System In order to achieve information compression for reducing the number of nucleotides to be synthesized, we devised a genome-based DNA storage system to store images, via searching a nucleotide sequence similar to an encoded sequence of an image in a natural known genome. As Fig. 1 shows, in this system, an image is first downsampled using bicubic kernel function [17] to reduce the data size. The downsampled image is divided into several blocks according to a square matrix and pixels in each block are represented by a nucleotide sequence of a natural genome. Then, the indices of genome sequences will be recorded and further encoded into nucleotide sequences for synthesis by any reported encoding method. More specifically, pixels in each block are rearranged into a mean-difference sequence, which includes the mean of these pixels and the differences between the mean and each pixel. To select a genome sequence to represent each meandifference sequence, all nucleotide sequences of a genome are converted into digital sequences according to some transforming rules. Combined with their indices, these digital sequences are used to construct a digital sequences dictionary and the indices of selected digital sequences will be recorded for data retrieval. Apart from all indices of an image, other necessary information for decoding (e.g. taxonomy ID (txid) of the genome, the transforming rules, etc.) is called file header, which is also encoded into nucleotide sequences according to the same encoding method. To retrieve a stored image, DNA strands related to the image are sequenced and decoded to generate indices of genome sequences and a file header. Based on the genome
DNA Image Storage Using a Scheme
575
information and transforming rules in the file header, the digital sequences dictionary can be reconstructed and these indices can be retrieved in it to find all mean-difference sequences of the stored image. By adding each difference to a mean within each sequence, pixels can be calculated and the stored image can be recovered. To improve the visual quality of reconstructed images, image denoising is used to reduce the noise caused by fuzzy matching. Then, a SR method based on deep learning is applied to scale the denoised image back to its original size. As long as the length of each index in the digital sequence dictionary is shorter than that of each mean-difference sequence, the data of downsampled images will be compressed, and thus it reduces the number of nucleotide sequences to be synthesized.
Fig. 1. The workflow of the DNA-based storage scheme. Lower: the overall procedures to store and read an image. Upper: an example of transforming a block of pixels into an index in a natural genome.
3.2 Transforming Nucleotide Sequences into Digital Sequences As shown in Fig. 1, a key step in our DNA-based data storage system is fuzzy matching, whose target is to find the most similar oligo in a given genome for each mean-difference sequence of an image. To make full use of human visual system’s characteristics, one of which is more sensitive to luminance, the RGB color channels of color images are converted into YCbCr channels. Considering color images’ three channels, pixels in each channel of an image are divided by one matrix and mean-difference sequences of three channels are fuzzy matched in the same genome respectively. Here, we take one channel as an example to show the details of fuzzy matching. In the process of fuzzy matching, a mean-difference sequence corresponding to one block of an image is comprised of one mean and m differences which are integers in the range of [0,255] and [−31,32] (differences out of this interval are assigned to the nearest value, −31 or 32),
576
J. Zhang et al.
respectively. For convenience, all differences plus 31 and the interval becomes [0,63]. To map each integer to a nucleotide sequence, a quadruplet and a triplet are selected to represent a mean and a difference, respectively. In Fig. 2a, the process to transform a genome into a digital sequence dictionary can be divided into two steps. The first step is to transform a genome into an oligo dictionary and each oligo is generated by segmenting the genome sequence with a sliding window whose step size is 1 and the window size is 4 + 3 m nt (m is equal to the count of pixels in one block mentioned in Fig. 1, e.g., 4 or 16). Then, the second step is to convert the oligo dictionary into a digital sequence dictionary and three different rules are designed to implement the process (Fig. 2b). Two kinds of strategies, encoding and mapping, are used in these three rules. Encoding means a nucleotide sequence is directly encoded into an integer based on a simple encoding scheme.
Fig. 2. Transforming nucleotide sequences into digital sequences. a. An oligo dictionary generated by segmenting a genome is transformed into a digital sequence dictionary to accomplish fuzzy matching. b. The principle of three rules applied to transform oligo dictionaries. The encoding scheme is used to directly encode integers and the two mapping lists are used to map an integer to a k-mer. c. An example shows an oligo in the oligo dictionary is transformed into different digital sequences according to three rules. d. The average peak signal-to-noise ratio (PSNR) of reconstructed images and average bytes of pixel ranking lists when three rules are used to fuzzy match.
Another strategy, mapping, denotes a quadruplet or triplet is mapped to an integer according to a related mapping list. By collecting a k-mer [18] ranking list of the genome
DNA Image Storage Using a Scheme
577
and a pixel ranking list of the image and establishing the mapping relations between them, a mapping list can be generated. K-mer ranking lists include a quadruplet list and a triplet list, which are generated by counting the relative frequency of quadruplets and triplets of oligos in an oligo dictionary and ranking their ratios in descending order. Pixel ranking lists also include two kinds of lists comprised of means and differences respectively, which are generated by counting the relative frequency of means and differences of all mean-difference sequences in one channel. Following means and differences are mapped to quadruplets and triplets respectively, a quadruplet-mean mapping list and a triplet-difference mapping list can be generated by these four lists. According to these mapping lists, an oligo dictionary can be transformed into a digital sequence dictionary. Considering two sections of each oligo, three rules are designed to transform each oligo into a digital sequence by utilizing different strategies in two sections. Figure 2c shows an example to transform a nucleotide sequence based on different rules. 3.3 Image Denoising When stored images are obtained by sequencing and decoding, these images need to be denoised in order to reduce the information loss generated by the encoding and fuzzy matching procedures. The image denoising scheme used in this study is based on the method proposed by Jeremy in his introduction of the deep learning framework fastai [19]. The input images with noise are divided into a training set as well as a test set in the ratio of 9:1, using a batch size and an image size of 32, 128, respectively. U-net, a widely used image denoising model [20], is constructed using the pre-training model Resnet34 [21], which is a network can be trained end-to-end by using very few images and fast. To enable the trained model to perceive the difference between the generated image and the target image, perceptual loss [22] is introduced into the image denoising model. 3.4 Image Super-Resolution Image super-resolution is to recover high-resolution (HR) images with better visual quality and refined details from low-resolution images [17, 23]. In our scheme, the scaling factor is ×2 and the denoised images is a half of original images because of the down-sampling process. To scale the denoised images to its original images, a SR model based on deep learning, Enhanced SRGAN (ESRGAN) [24], is introduced to scale the denoised images. SRGAN [25] is a generative adversarial network (GAN) for SR task, which can generate photo-realistic natural images based on downsampled images. ESRGAN model improved the structure of SRGAN and it won the first place in the PIRM2018-SR Challenge. Considering the open nature, stability and remarkable performance of ESRGAN, it is selected to solve our SR task. 3.5 The Compression Rate The definition of compression ratio (CR) in image compression is the ratio between uncompressed image size and compressed image size [26]. In this scheme, the compression rate is the reciprocal value of CR and its definition is the ratio between the data
578
J. Zhang et al.
size of indices of an image and the data size of pixels of the image (the unit of data size is byte). The data of an image to be stored is reduced when the image is processed by downsampling and fuzzy matching. In total, the compression rate of the image can be calculated by following formula: log2 v2 log2 v3 log2 v1 1 1 8 8 8 × + + ) × ( s2 3 x2 y2 z2 log2 v1 log2 v2 log2 v3 1 ( + + ) = 24s2 x2 y2 z2
c=
(1)
where, s is the scaling factor of an image, x, y, z are the side of matrixes used to divide Y, Cb, Cr color channels of the image, respectively. v1 , v2 , v3 are the number of sequences in the digital sequence dictionary corresponding to the Y, Cb, Cr channels.
4 Result 4.1 The Comparison Among Three Transforming Rules To evaluate these rules, an experiment was performed to measure noise introduced in the fuzzy matching process and the data size of pixel ranking lists that these rules were dependent on. PSNR can measure the difference between an image to be stored and its related noisy image generated by fuzzy matching. Higher PSNR implies smaller difference and better result of fuzzy matching. Both k-mer ranking lists and pixel ranking lists are necessary to construct mapping lists in rule B and rule C, but k-mer ranking lists can be recovered by utilizing the genome information and parameters saved in the file header instead of storing the whole lists. The data set used to perform this experiment was the train set of DIV2K and the downsampled image size was 128 × 128. The average PSNR of the data set is shown in Fig. 2d. It’s evident that the PSNR of rule C is better than that of other rules, but another factor, the size of pixel ranking lists it required, must be put into consideration. In rule B, the pixel ranking list is a difference list and it needs to be stored. On the basis of rule B, the sequence recording pixel ranking lists becomes longer because a mean list also needs to be stored when rule C is applied to transform oligos. It is a tradeoff between PSNR and the size of pixel ranking list to be stored and higher PSNR requires more nucleotide sequences. Rule B can significantly enhance PSNR of the reconstructed images at the cost of storing relatively small size of pixel ranking list and rule C is a better choice for an image with large size. Considering the image size of our data set is 128 × 128, images generated by following rule B are chosen to perform subsequent image processing experiments. 4.2 Image Processing When images stored in form of oligos are retrieved, images with noises introduced in the fuzzy matching need to be denoised. To reduce noises of reconstructed images, a model based on U-net architecture and perceptual loss [19] was introduced into the DNA storage system. In Fig. 3a, compared with original images, there are some distortions
DNA Image Storage Using a Scheme
579
in the edge of items in stored images and a lot of them are eliminated in the denoised images, resulting in better visual quality and better PSNR and Structural similarity [27] (SSIM). Apart from image denoising, ESRGAN was utilized to scale denoised images to the size of original images before downsampling. As shown in Fig. 3b, the scaled images are realistic and its visual quality is close to that of compressed images processed by JPEG algorithm. The PSNR and SSIM of scaled images, two common metrics to measure image difference, are much lower than those of JPEG images, but these two metrics can not reflect visual quality of reconstructed images precisely and the subjective human visual perception is more important in practical work [24]. With the development of the SR method, image quality has great potential to be improved to process lower-resolution images when larger scaling factors are selected, which can result in a lower compression rate. a
Stored image (PSNR | SSIM)
(29.50 | 0.88)
Denoised image Original image b (PSNR | SSIM)
(31.09 | 0.94)
(33.04 | 0.85)
(34.60 | 0.90)
(26.22 | 0.86)
(27.63 | 0.93)
(PSNR | SSIM)
Scaled image
Image in jpeg format
Original image
(PSNR | SSIM)
(PSNR | SSIM)
(PSNR | SSIM)
(22.43 | 0.64)
(29.03 | 0.88)
(∞ | 1)
(28.97 | 0.84)
(34.31 | 0.93)
(∞ | 1)
(∞ | 1)
(∞ | 1)
(∞ | 1)
Fig. 3. a. The Comparison among stored images, denoised images and original images before processing. b. The Comparison among scaled images using the ESRGAN model, JPEG images with about 10% of compression rate and unpressed original images.
4.3 The Compression Rate of the Genome-Based DNA Storage System To save the cost of DNA synthesis, data of images is compressed in our devised DNA storage system. The information to be stored is divided into a file header and index sequences. The data in the file header includes a fixed-length part and a run-length part. The fixed-length part records txid of the genome used in fuzzy matching and the parameters used to construct digital sequence dictionaries, while the run-length part keeps pixel ranking lists. The compression rate of this system depends on the transforming rule and the scaling factor. In this experiment, the size of original image was 256 × 256, the downsampling factor was ×2, the matrixes used to divide each image’s three channels were 2 × 2, 4 × 4, 4 × 4 respectively and the data size of each index in digital sequence dictionaries was 3 bytes (the count of sequences in each dictionary constructed by the genome Saccharomyces cerevisiae S288C was about 12
580
J. Zhang et al.
million and 224 could cover all sequences’ indices). The fixed-length parts of three rules were 13 bytes,16 bytes, 16 bytes respectively and the run-length parts are shown in Fig. 2d. In total, the length of file header of three rules were 13 bytes, 188 bytes, 516 bytes respectively and the size of index sequences was 18,432 bytes. The average data size of 800 images used to compare three rules was 196,662 bytes and the compression rates of these three rules were 9.38%, 9.47%, 9.63%, respectively. The format of stored images was BMP and the data size of each image could be reduced by more than 90% under above conditions, which implies the synthesis cost to store images can be highly reduced. If the downsampling factor or the matrix size used in each image channel increases for a given genome, the compression ratio and synthesis cost will further decrease. 4.4 Factors to Influence the Result of Image’s Fuzzy Matching in Genomes Although the procedure of fuzzy matching has been developed, factors that may have an effect on its result need to be investigated to establish a criterion for selecting a better genome in the fuzzy matching procedure. It’s obviously that genome size is an important factor, because a nucleotide sequence with fewer differences is more likely to be searched in a larger genome with more kinds of nucleotide sequences. Meanwhile, the GC content is also a common feature of a genome, which determines the ratios of four nucleotides in a genome. In addition, the mapping lists used in the fuzzy matching are based on the distributions of k-mers and pixels. When an image is matched in different genomes, the difference between pixels’ frequency distribution of the image and k-mers’ frequency distribution in each genome is different, which may also influence the matching result. Relative entropy, a measure of the difference between a probability distribution and a reference probability distribution, is introduced to explore the influence of this factor. Smaller relative entropy implies smaller difference of two probability distributions. In these investigations, images were fuzzy matched in sub-dictionaries with specific number of digital sequences, which were generated by randomly selecting digital sequences in the whole digital sequence dictionaries constructed by genomes. This strategy could precisely control the size of dictionary used in the fuzzy matching. In Fig. 4a, 20 genomes and 200 images in DIV2K dataset were selected to perform fuzzy matching experiment. Five kinds of genome size were explored in the experiment. The average PSNR of 200 images matched in each sub-dictionary was calculated to reflect the result of fuzzy matching. With increased genome size, the mean PSNR of twenty genomes also increases. At the same time, larger genome size also increases the length of index in the digital sequence dictionary, resulting in higher compression rate and synthesis cost. Trading off between PSNR and compression rate should be put into consideration when choosing the size of genomes. The effect of GC content is shown in Fig. 4b 100 genomes were divided into 10 groups based on their GC content and 800 images in DIV2K training set were fuzzy matched in each of 100 genomes. The size of each genome’s sub-dictionaries was unified into 411 . The PSNR of each group is the mean value of ten genomes’ PSNR. In Fig. 4b, the trend of 10 groups’ histograms is just like a saddle. When GC content varies from 50% to 20% or 70%, the average PSNR of reconstructed images in each group rises up, which shows choosing genomes with GC content close to 50% is not a good choice and genomes with relatively extreme GC content have higher priority.
DNA Image Storage Using a Scheme
581
Apart from those two factors, the relationship between PSNR and relative entropy is demonstrated in Fig. 4c. When nucleotide sequences were transformed into digital sequences, 64 triplets of genomes were mapped to differences in [0,63] according to their relative frequencies. The relative entropy between each of 800 images and a genome was calculated based on the data in Fig. 4b and the PSNR and relative entropy when 800 images were fuzzy matched in each of 800 genomes are shown in Fig. 4c. It’s clear that there is a relative strong negative correlation between PSNR and relative entropy (R2 is 0.7738), but choosing a genome for a given image only based on relative entropy is not a reliable strategy. Although these three factors provide a reference when choosing genomes for a given image, more features of genomes and images need to be collected to develop a better scheme to accomplish such a task. a
47
32
31.47
31
c 30.94
30.46
30
PSNR
49
30.31 29.55
29.49 28.86
29
28.49
28
411
32.00 31.00 30.00
28.29
29.00
27.58
R² = 0.7738
28.00
27
27.00
26 25
410
PSNR
b
48
20
25
30
35
40
45
GC
50
55
60
65
70
26.00 0.30
0.50
0.70
0.90
Relative entropy
1.10
1.30
Fig. 4. Three factors to influence fuzzy matching result. a. The average PSNR of reconstructed images at different genome size. The genome size represents the number of nucleotide sequences in sub-dictionaries constructed by each of 20 genomes used in this experiment. The box plot of each genome size is the statistical result of 20 genomes’ PSNR (each genome’s PSNR was 200 reconstructed images’ average PSNR when downsampled images were matched in the subdictionary related to the genome). b. The average PSNR of reconstructed images when images were fuzzy matched in genomes with different GC content. c. The average PSNR of reconstructed images and average relative entropy between the genome and 800 images when images in the dataset were fuzzy matched in each of 100 genomes. The R2 shows the correlation between PSNR and relative entropy.
5 Conclusion Based on readily available natural genomes, we designed a DNA storage system to store images in a lossy compression manner, which can mitigate the prohibitively high cost of
582
J. Zhang et al.
DNA synthesis before great progress has been made in the DNA synthesis technology. Following our scheme, the number of nucleotide sequences to store an image in DNA medium can be reduced by about 90% and the visual quality of recovered images is comparable to that of images compressed by JPEG algorithm. What’s more, we also explored three different factors that may have an effect on fuzzy matching of images, but more features of images and genomes need to be explored to devise a strategy to select better genomes for a given image. However, we note that visual quality of recovered images can be further improved and more customized image processing algorithms should be developed. Acknowledgment. This work was supported by the National Key Research and Development Program of China (No. 2020YFA0712100) and the Guangdong Provincial Key Laboratory of Genome Read and Write (No. 2017B030301011). We are thankful for support on computing resource provided by China National GeneBank (CNGB).
References 1. Reinsel, D., Gantz, J., Rydning, J.: Data age 2025: the digitization of the world from edge to core. Seagate Data Age, 1–28 (2018) 2. Church, G.M., Gao, Y., Kosuri, S.: Next-generation digital information storage in DNA. Science 337(6102), 1628 (2012) 3. Goldman, N., et al.: Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494(7435), 77–80 (2013) 4. Grass, R.N., Heckel, R., Puddu, M., Paunescu, D., Stark, W.J.: Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew. Chem. Int. Ed. 54(8), 2552–2555 (2015) 5. Tabatabaei Yazdi, S.M.H., Yuan, Y., Ma, J., Zhao, H., Milenkovic, O.: A rewritable, randomaccess DNA-based storage system. Sci. Rep. 5(1), 1–10 (2015) 6. Zhirnov, V., Zadegan, R.M., Sandhu, G.S., Church, G.M., Hughes, W.L.: Nucleic acid memory. Nat. Mater. 15(4), 366 (2016) 7. Erlich, Y., Zielinski, D.: DNA Fountain enables a robust and efficient storage architecture. Science 355(6328), 950–954 (2017) 8. Ceze, L., Nivala, J., Strauss, K.: Molecular digital data storage using DNA. Nat. Rev. Genet. 20(9), 456–466 (2019) 9. Dong, Y., Sun, F., Ping, Z., Ouyang, Q., Qian, L.: DNA storage: research landscape and future prospects. National Sci. Rev. 7(6), 1092–1107 (2020) 10. Bornholt, J., Lopez, R., Carmean, D.M., Ceze, L., Seelig, G., Strauss, K.: A DNA-based archival storage system. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 637–649 (2016) 11. Li, B., Ou, L., Du, D.: Image-based Approximate DNA Storage System. arXiv preprint arXiv: 2103.02847 (2021) 12. Tabatabaei, S.K., et al.: DNA punch cards for storing data on native DNA sequences via enzymatic nicking. Nature Commun. 11(1), 1–10 (2020) 13. Han, M., Chen, W., Song, L., Li, B., Yuan, Y.: DNA information storage:bridging biological and digital world. Synthetic Biol. J. 1–14 (2021)
DNA Image Storage Using a Scheme
583
14. Agustsson, E., Timofte, R.: Ntire 2017 challenge on single image super-resolution: dataset and study. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 126–135 (2017) 15. Bevilacqua, M., Roumy, A., Guillemot, C., Alberi-Morel, M.L.: Low-complexity singleimage super-resolution based on nonnegative neighbor embedding. In: Proceedings of the 23rd British Machine Vision Conference (BMVC), pp. 135.1–135.10. BMVA Press (2012) 16. Zeyde, R., Elad, M., Protter, M.: On single image scale-up using sparse-representations. In: Boissonnat, J.-D., et al. (eds.) Curves and Surfaces 2010. LNCS, vol. 6920, pp. 711–730. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27413-8_47 17. Wang, Z., Chen, J., Hoi, S.C.: Deep learning for image super-resolution: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 1, 3365–3387 (2020) 18. Compeau, P.E., Pevzner, P.A., Tesler, G.: Why are de Bruijn graphs useful for genome assembly? Nat. Biotechnol. 29(11), 987 (2011) 19. Jeremy, H.: https://nbviewer.jupyter.org/github/fastai/course-v3/blob/master/nbs/dl1/les son7-superres.ipynb. Accessed 19 Jul 2021 20. Komatsu, R., Gonsalves, T.: Comparing u-net based models for denoising colorimages. AI1(4), 465–486 (2020) 21. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 234–241 (2015) 22. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and superresolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43 23. Anwar, S., Khan, S., Barnes, N.: A deep journey into super-resolution: a survey. ACM Comput. Surv. (CSUR) 53(3), 1–34 (2020) 24. Wang, X., et al.: ESRGAN: enhanced super-resolution generative adversarial networks. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11133, pp. 63–79. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11021-5_5 25. Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4681–4690 (2017) 26. Yu, H., Winkler, S.: Image complexity and spatial information. In: 2013 FifthInternational Workshop on Quality of Multimedia Experience (QoMEX), pp. 12–17. IEEE (2013) 27. Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for image quality assessment. In: The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, pp. 1398–1402 (2003)
Prediction of Virus-Receptor Interactions Based on Similarity and Matrix Completion Lingzhi Zhu1,2 , Guihua Duan1(B) , Cheng Yan1,3 , and Jianxin Wang1 1
2 3
Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China [email protected] School of Computer Science and Engineering, Hunan Institute of Technology, Hengyang 421008, China School of Information Science and Engineering, Hunan University of Chinese Medicine, Changsha 410208, China
Abstract. Viral infectious diseases are threatening human health and global security by rapid transmission and severe fatalities. The receptorbinding is the first step of viral infection. Identifying hidden virusreceptor interactions opens new perspectives to understand the virusreceptor interaction mechanisms, and further help develop an economical and effective treatment of viral infectious diseases. As a cost-effective strategy, computational methods are adopted to predict potential virusreceptor interactions. However, a small amount of missing values in the raw biological data and zero rows and columns in the virus-receptor interaction matrix affect the stability and precision of the potential virusreceptor interaction prediction method. Therefore, filling out these missing values and zero rows and columns is imperative. In this study, a novel predictive model (PreVRIs) is proposed to predict hidden virus-receptor interactions. In PreVRIs, the viral protein sequence similarity matrix and the viral genomic sequence similarity matrix are computed based on viral protein sequences downloaded from the UniProt database and viral RefSeq genomes obtained from the Reference sequence (RefSeq) database, respectively. We apply the gaussian radial basis function (GRB) to fill in the missing values in the viral protein sequence similarity matrix and the viral genomic sequence similarity matrix, respectively. They are fused into an integrated viral similarity kernel by the similarity kernel learning method (SKL). Second, we calculate the receptor sequence similarity and the receptor protein-protein interaction network similarity based on the amino acid sequence information and the human protein-protein interaction network. Missing values of the receptor protein-protein interaction network similarity matrix are also filled in with GRB. They are also combined into an integrated receptor similarity kernel by SKL. Finally, we fill in zero rows/columns in the interaction matrix with the K-nearest neighbor (KNN) preprocessing method, respectively, and use a matrix completion model to predict hidden virus-receptor interactions. On the viralReceptor sup dataset and the viralReceptor dataset, 10-fold CrossValidation (10-fold CV) experimental results show that area under curve c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 584–595, 2021. https://doi.org/10.1007/978-3-030-91415-8_49
Prediction of Virus-Receptor Interactions
585
(AUC) values of PreVRIs are 0.9106 and 0.9252, respectively, which is consistently better than the other related models. In addition, a case study also confirms the effectiveness of PreVRIs to predict hidden virusreceptor interactions. Keywords: Virus-receptor interactions · Gaussian radial basis Similarity kernel learning · Similarity · Matrix completion
1
·
Introduction
Humans can be vulnerable to attack from hundreds of viruses [1]. Above all, some pandemic viruses, such as Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) [2], Zika virus [3], Ebola virus [4] and Severe Acute Respiratory Syndrome Coronavirus [5], are emerging at an historically unprecedented rate. For example, the recent global outbreak of Coronavirus Disease 2019 (COVID19) has emerged as a frequently fatal respiratory-tract infection caused by the newly identified SARS-CoV-2 [6]. This indicates that viral infectious diseases remain a serious health threat in the world [1]. There is growing evidence that the binding of SARS-CoV-2 to its corresponding receptor, angiotensin converting enzyme 2, is the initial event of viral infection [2]. To systematically analyze this interaction mechanism, Zhang et al. [8] proposes a mammalian virus-host receptor interaction dataset (viralReceptor). Based on the viralReceptor dataset, Yan et al. [7] extracts 211 human virusreceptor interactions as a benchmark dataset and proposes a laplacian regularized least square model, IILLS, to predict potential virus-receptor interactions [7]. However, noise and disturbances on the virus similarity networks and the receptor similarity networks do affect the prediction performance of IILLS. To reduce noise and disturbances on the similarity networks, NERLS, our previous work, is proposed based on Network Enhancement and the Regularized Least Squares [9]. In NERLS, the Network Enhancement method is utilized to reduce the noise and disturbances of the virus similarity networks and the receptor similarity networks, and the regularized least square algorithm is used to predict potential virus-receptor interactions. But missing values in the virus and receptor similarities and zero rows and columns in the virus-receptor interaction matrix are not considered in these models. To solve the problem, a new computational model (PreVRIs) is proposed. The advantages of PreVRIs are as follows: (1) 371 virus-receptor interactions are collected as a supplement dataset of viralReceptor (viralReceptor sup). (2) We utilize the gaussian radial basis function (GRB) to handle with missing values in the virus and receptor similarities and use the similarity kernel learning method (SKL) [10] to integrate different similarity kernels into an integrated similarity kernel. (3) The K-nearest neighbor (KNN) preprocessing method is applied to fill in zero rows and columns in the interaction matrix, and a matrix completion model is adopted to predict potential virus-receptor interactions.
586
2
L. Zhu et al.
Materials
To improve the performance of potential virus-receptor interaction prediction models, 371 interactions between 181 viruses and 117 receptors are collected from some popular databases, such as viralReceptor [8], ViralZone [11], and UniProt [12]. The specific approaches of collecting virus-receptor interaction information are as follows. First, we download 212 interactions between 105 viruses and 74 receptors from the viralReceptor dataset and utilize the TaxID and names of 105 viruses to continue a fingertip search of the UniProtKB database and related publictions to obtain 57 virus-receptor interactions. The Kyoto Encyclopedia of Genes and Genomes (KEGG) Virus database [13] saves 152 human diseases caused by 141 viruses. After comparing 141 viruses with 105 viruses above, we remove 93 redundance viruses and get 48 new viruses. According to the TaxID and names of 48 viruses, we obtain 65 known virus-receptor interactions from the UniProtKB database and related publictions. In addition, we also collect 28 viruses and 37 virus-receptor interactions from the ViralZone dataset. As a result, 371 known virus-receptor interactions are chosen as the supplement dataset of viralReceptor (viralReceptor sup). The statistics of the viralReceptor sup dataset and the viralReceptor dataset are shown in Table 1. Table 1. The statistics for the viralReceptor sup and viralReceptor dataset Datasets
Viruses receptors Interactions
The viralReceptor sup dataset 181 The viralReceptor dataset
3 3.1
104
117
371
74
211
PreVRIs for Discovering Virus-Receptor Interactions Virus Similarity Kernel
For viruses, we construct the viral protein sequence similarity kernel and the viral genomic sequence similarity kernel. Let V = {v1 , v2 , v3 , ..., vnv } be a set of viruses, R = {r1 , r2 , r3 , ..., rnr } be a set of receptors, and Y ∈ Rnv×nr be a known virus-receptor interaction matrix. Viral Protein Sequence Similarity Kernel. The binding of the viral proteins to their corresponding receptors is the initial event of viral infection [2]. So the viral protein sequence similarity plays an essential role in predicting potential virus-receptor interactions. We download amino acid sequences of viral binding proteins from UniProtKB and use their normalized Smith-Waterman score to construct the viral protein sequence similarity matrix VP roSeq ∈ Rnv×nv as follows, VP roSeq (vi , vj ) = SW (vi , vj )/ SW (vi , vi ) SW (vj , vj ) (1)
Prediction of Virus-Receptor Interactions
587
Given the lack of 13 of 181 viral binding protein sequences, we use the i-th row vector of VP roSeq to represent the similarities among the viral protein sequence similarity vi and all others. So, the distance between the i-th row vector and the j-th row vector is also used to represent the viral protein sequence similarity between virus vi and virus vj . The gaussian radial basis similarity matrix GVP roSeq is constructed by GRB as below, 2
GV P roSeq (vi , vj ) = exp(
||VP roSeq (vi ) − VP roSeq (vj )|| ) −2σ 2
(2)
Next, GV P roSeq is used to fill out missing elements of VP roSeq to construct the filled viral protein sequence similarity kernel KV 1 ∈ Rnv×nv as follows, VP roSeq (vi , vj ), if VP roSeq (vi , vj ) = 0 KV 1 (vi , vj ) = (3) otherwise GV P roSeq (vi , vj ), Viral Genomic Sequence Similarity Kernel. According to the assumption that similar viruses show similar k-mer patterns [14], the k − mer similarity can be developed to infer the correlation of viruses. Firstly, we download the viral RefSeq genomes from the Reference sequence (RefSeq) database at the U.S. National Center for Biotechnology Information (NCBI) [15]. Based on viral genomic k-mer frequencies, we use d∗2 oligonucleotide frequency measures [14] to calculate the distance of k − mer frequency vectors as, ⎤ ⎡ i i j j (Nw −ENw )(Nw −ENw ) √ w∈Ωk ⎥ 1⎢ EN i EN j ⎥, w w (4) 1 − d∗2 (vi , vj ) = ⎢ ⎣ j j 2 ⎦ 2 i −EN i )2 (Nw (N −EN ) w w w w∈Ωk
i ENw
w∈Ωk
j ENw
in which a word w = w1 w2 w3 ...wk is used to represent the k − mer, each element of the word w is in the set Ω = {A, C, G, T } and Ωk is the set of all the k − mer. For the genomic sequence of the virus vi , Nwi is the number of occurrences of the word w and ENwi is the expected number of occurrences of the word w. When k is set to 6 based on the existing study [14], the genomic sequence similarity matrix VgenSeq ∈ Rnv×nv can be constructed as, ⎤ ⎡ i i j j (Nw −ENw )(Nw −ENw ) √ w∈Ωk ⎥ i EN j 1⎢ ENw w ⎥ (5) 1 + VgenSeq (vi , vj ) = ⎢ ⎦ j j 2⎣ i −EN i )2 2 (Nw (N −EN ) w w w w∈Ωk
i ENw
w∈Ωk
j ENw
Considering that 21 of 181 viral genomic sequence data are also missing, we use GRB to recompute the viral genomic sequences to obtain the gaussian radial basis similarity matrix GVgenSeq as below, 2
GVgenSeq (vi , vj ) = exp(
||VgenSeq (vi ) − VgenSeq (vj )|| ) −2σ 2
(6)
588
L. Zhu et al.
Similarly, GVgenSeq is also used to fill out missing elements of VgenSeq to construct the genomic sequence similarity kernel KV2 ∈ Rnv×nv as follows, VgenSeq (vi , vj ), if VgenSeq (vi , vj ) = 0 (7) KV2 (vi , vj ) = otherwise GVgenSeq (vi , vj ), 3.2
Receptor Similarity Kernel
For receptors, the receptor protein-protein interaction network (PPI) similarity kernel and the receptor sequence similarity kernel can be constructed. Receptor Protein-Protein Interaction Network Similarity Kernel. Neighbors of receptors in PPI have an important impact on the measure of the receptor similarity. After projecting the name or GeneID of receptors to PPI, all neighbors of receptors are chosen and the log likehood scores (LLS) between proteins are downloaded. The log likehood scores (LLS) between proteins are normalized as below, LLS(m, n) LLS ∗ (m, n) = (8) LLSmax Let Ni be the set of the nearest neighbors of receptor ri and Nj be the set of the nearest neighbors of receptor rj , then the receptor protein-protein interaction network similarity matrix RP P I can be computed as follows, ⎧ ∗ ⎪ i ∈ Ni ) ∩ (rj ∈ Nj ) ⎨LLS ∗ (ri , rj ), if (r (LLS (ri ,m)+LLS ∗ (m,rj )) RP P I (ri , rj ) = , if (m ∈ (Ni ∩ Nj )) 2 ⎪ ⎩ (LLS ∗ (ri ,m)+LLS ∗ (m,n)+LLS ∗ (n,rj )) , if (m ∈ Ni ) ∩ (n ∈ Nj ) 3 (9) Considering that 7 of 117 receptors are not in PPI, we also use GRB to recompute RP P I to obtain the gaussian radial basis similarity matrix GRP P I as below, 2
GRP P I (ri , rj ) = exp(
||RP P I (ri ) − RP P I (rj )|| ) −2σ 2
(10)
Similarly, GRP P I is also used to fill out the missing elements of RP P I to construct the receptor protein-protein interaction network similarity kernel KR1 ∈ Rnr×nr , RP P I (ri , rj ), if RP P I (ri , rj ) = 0 KR1 (ri , rj ) = (11) otherwise GRP P I (ri , rj ), Receptor Sequence Similarity Kernel. In addition, amino acid sequences of receptors are downloaded from the KEGG GENE database and their normalized Smith-Waterman scores are used to calculate the receptor sequence similarity and construct the receptor sequence similarity kernel KR2 ∈ Rnr×nr as follows, KR2 (ri , rj ) = Rseq (ri , rj ) = SW (ri , rj )/ SW (ri , ri ) SW (rj , rj ) (12)
Prediction of Virus-Receptor Interactions
3.3
589
Similarity Kernel Learning
Fusing different features and different similarities can improve the predication performance. In this study, we integrate two similarity kernels into an integrated similarity kernel by SKL [10]. For viruses, KV1 and KV2 are combined into an integrated virus similarity kernel KV ∈ Rnv×nv as belows, KV =
2
μvi KVi
(13)
i=1
It is generally considered that KV is close to the interaction matrix Y . The virus interaction matrix Y v is defined as, Yv =YYT
(14)
The distance of KV and Y v is minimized to find μv ∈ R2×1 as below, ||KV − Y v ||F min v 2
µ
(15)
in which ||KV − Y v ||F denotes the sum of the distance between KV and Y v . To avoid the overfitting, we add the regularized parameter and the regularized item ||μv || as follows, 2
min ||KV − Y v ||F + λ||μv ||2 v 2
µ
s.t. 2
μvi ≥ 0, i = 1, 2
(16)
μvi = 1
i=1
in which λ is initialized to 2 × 104 based on the previous study [10]. The matlab R2015b CVX is used to solve the optimization problem to get the integrated parameter μv ∈ R1×2 for KV1 and KV2 . The integrated virus similarity kernel KV can be defined as follows, KV =
2
μvi KVi
(17)
i=1
Similarly, we can also define the integrated receptor similarity kernel as, KR =
2
μri KRi
(18)
i=1
in which we also get this integrated parameter μr ∈ R1×2 for KR1 and KR2 by the same method.
590
3.4
L. Zhu et al.
Matrix Completion for Discovering Virus-Receptor Interactions
Driven by a matrix completion model [16] for drug repositioning, we utilize the matrix completion model to uncover hidden virus-receptor interactions. Based on the integrated virus similarity kernel and the integrated receptor similarity kernel, PreVRIs fills in zero rows and zero columns in the virus-receptor interactions matrix with the KNN preprocessing method, respectively. Specifically, PreVRIs utilizes the integrated virus similarity matrix and the filled virus-receptor interaction matrix to construct one block adjacency matrix, and uses the integrated receptor similarity matrix and the virus-receptor interaction matrix to construct another block adjacency matrix. The KNN Preprocessing Method. To fill in zero rows and zero columns in the virus-receptor interactions matrix, the KNN preprocessing method is employed. The i-th row vector is used to represent the interaction values among the i-th virus and all receptors and the j-th column vector is used to represent the interaction values among all viruses and the j-th receptor. Let P denote the K nearest neighbors of the i-th virus in the integrated virus similarity kernel. KV (vi , vp ), (19) Sv = vp ∈P
Y1 (vi , :) =
KV (vi , vp ) ∗ Y1 (vp , :) Sv
(20)
vp ∈P
in which y1 (vi , :) is the i-th row in the virus-receptor interaction matrix and the denominator is the normalization term. After the KNN preprocessing step, we get an updated virus-receptor interaction matrix Y1 and a block adjacency matrix B1 ∈ Rnv×(nv+nr) as below, (21) B1 = Y1 KV Similarly, we can also utilize Q to denote the K nearest neighbors of the j-th receptor in the integrated receptor similarity kernel. Sr = (rj , rq ), (22) rq ∈Q
Y2 (:, rj ) =
KR(rj , rq ) ∗ Y2 (:, rq ) Sr
(23)
rq ∈Q
Another updated virus-receptor interaction matrix Y2 is also obtained by the KNN preprocessing step. Another block adjacency matrix B2 ∈ R(nv+nr)×nr is defined as, KR B2 = (24) Y2
Prediction of Virus-Receptor Interactions
591
BNNR Model. In this subsection, we utilize a bounded nuclear norm regularization (BNNR) [17] to recover unknown elements of the block adjacency matrix B1 and discover potential virus-receptor interactions. The new block adjacency matrix is computed as, ∗ KV ∗ Y1 (25) BNNR is also applied to recover unknown elements of the block adjacency matrix B2 and discover potential virus-receptor interactions as below, KR∗ (26) Y2∗ Finally, two predictive virus-receptor interaction matrixes are combined into a final predictive matrix by the linear mean method as, Y∗ =
4 4.1
(Y1∗ + Y2∗ ) 2
(27)
Experimental Results and Discussion Performance Evaluation
10-fold CV is implemented to evaluate the prediction performance of PreVRIs. In 10-fold CV, all known virus-receptor interactions are randomly divided into 10 approximately equal parts. Each part is left out in turn as a test sample, and the remaining parts as the training samples. All unknown virus-receptor interactions are the candidate samples. 4.2
Comparison with Related Methods
In this study, we compare PreVRIs with four related methods, such as NERLS [9], IILLS [7], BRWH [18], and LapRLS [19] in 10-fold CV. On the viralReceptor sup dataset, the predictive results of five methods are showed in Fig. 1. As shown in and Fig. 1, PreVRIs has a superior result with AUC values of 0.9106, while NERLS, ILLS, BRWH, and LapRLS just achieve AUC values of 0.8893, 0.8568, 0.7811, and 0.7655 in 10-fold CV, respectively. Based on the viralReceptor dataset, we also compare the prediction performance of PreVRIs with NERLS, IILLS, BRWH, and LapRLS. In Fig. 2, PreVRIs can obtain AUC values of 0.9252, which has a better prediction performance than other four methods (NERLS:0.893, IILLS:0.8675, BRWH:0.7959, and LapRLS:0.7577). In summary, PreVRIs can be considered as a more promising method in predicting potential virus-receptor interactions compared with four related methods on the viralReceptor sup dataset and the viralReceptor dataset.
592
L. Zhu et al. 10-fold CV
1 0.9
Ture Positive Rate
0.8 0.7 0.6 0.5 0.4 PreVRIs(AUC=0.9106) NERLS(AUC=0.8893) IILLS(AUC=0.8568) BRWH(AUC=0.7811) LapRLS(AUC=0.7655)
0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate
Fig. 1. The AUC curves of five methods on the viralReceptor sup dataset 10-fold CV
1 0.9
Ture Positive Rate
0.8 0.7 0.6 0.5 0.4 PreVRIs(AUC=0.9252) NERLS(AUC=0.893) IILLS(AUC=0.8675) BRWH(AUC=0.7959) LapRLS(AUC=0.7577)
0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate
Fig. 2. The AUC curves of five methods on the viralReceptor dataset
4.3
Case Study
This subsection later describes a case study of real applications on demonstrating the prediction performance of PreVRIs. In this subsection, all known interactions are regarded as the training samples and PreVRIs is utilized to predict the scores of hidden interaction pairs of viruses and receptors. The hidden interactions can be ranked from high to low according to the predicted score. We select top 10 hidden interactions of PreVRIs to analyse the prediction capability. Table 2
Prediction of Virus-Receptor Interactions
593
shows that 4 of top 10 hidden interactions are validated by existing literatures in PubMed. For instance, Bovine alphaherpesvirus 1 (BoHV-1) is belong to the neurotropic alphaherpesviruses and tend to infect sensory neurons by the binding of its glycoprotein D to the receptor, nectin cell adhesion molecule 2 [20]. Rhesus rotavirus uses integrin subunit beta 7 as the viral receptor, resulting in CHO K1 cell infection [21]. The binding of its spike protein VP4 to integrin subunit beta 7 results in increased infectivity in CHO cells [21]. Investigating a solute carrier family 20 member 1/solute carrier family 20 member 2 chimera where solute carrier family 20 member 1 backbone harbors the G120-V141 of solute carrier family 20 member 2 indicates Amphotropic murine leukemia virus receptor function upon human solute carrier family 20 member 1 [22]. Coxsackievirus A16 (called CVA16), a causative agent of hand, foot and mouth disease (HFMD), can use an entry mechanism via an identified functional receptor, Pselectin glycoprotein ligand 1, for infection of host cells [23]. Table 2. Top-10 predictive results of PreVRIs Rank Virus name
Receptor name
References
1
Bovine alphaherpesvirus 1
nectin cell adhesion molecule 2
Rudd et al., (2021)
2
B virus (Macacine alphaherpesvirus 1)
TNF receptor superfamily member 14
Unknown
3
Rhesus rotavirus
Integrin subunit beta 7
PMID:16298987
4
Lymphocytic choriomeningitis mammarenavirus
lysosomal associated membrane protein 1
Unknown
5
Human betaherpesvirus 7 C-type lectin domain family 4 member M
Unknown
6
Human papillomavirus type 6b
CD151 molecule (Raph blood group)
Unknown
7
Marburg marburgvirus
C-type lectin domain family 4 member G
Unknown
8
Sin Nombre virus (Sin Nombre orthohantavirus)
jumonji domain containing 6, arginine demethylase and lysine hydroxylase
unknown
9
Amphotropic murine leukemia virus
Solute carrier family 20 member 1
PMID:21586110
10
Coxsackievirus A16
Selectin P ligand
PMID:19543284
5
Conclusion
In our study, a new PreVRIs model has been proposed for discovering potential virus-receptor interactions, which can efficiently solve missing values in the virus and receptor similarity matrices. Furthermore, PreVRIs can integrate the viral protein sequence similarity, the viral genomic sequence similarity, the receptor
594
L. Zhu et al.
sequence similarity and the receptor protein-protein interaction network similarity. In addition, PreVRIs can efficiently fill in zero rows or columns in the interaction matrix and improve the prediction performance. To validate the performance of PreVRIs, 10-fold CV and a case study are conducted. Experimental results show that PreVRIs is effective compared with four related models. Acknowledgement. This work is supported in part by the National Natural Science Foundation of China (No.61772552, No.61832019, and No.61962050), 111 Project (No.B18059), Hunan Provincial Science and Technology Program (No.2018WK4001), the Science and Technology Foundation of Guizhou Province of China under Grant NO.[2020]1Y264, the Hengyang Civic Science and Technology Program (202010031491), and the Aid Program Science and Technology Innovative Research Team of Hunan Institute of Technology.
References 1. Geoghegan, J.L., Senior, A.M., Di Giallonardo, F., Holmes, E.C.: Virological factors that increase the transmissibility of emerging human viruses. Proc. Nat. Acad. Sci. 113(15), 4170–4175 (2016) 2. Zhou, P., et al.: A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579(7798), 270–273 (2020) 3. Mlakar, J., et al.: Zika virus associated with microcephaly. New Engl. J. Med. 374(10), 951–958 (2016) 4. Maganga, G.D., et al.: Ebola virus disease in the democratic republic of Congo. New Engl. J. Med. 371(22), 2083–2091 (2014) 5. Ge, X.Y., et al.: Isolation and characterization of a bat sars-like coronavirus that uses the ace2 receptor. Nature 503(7477), 535–538 (2013) 6. Wu, F., et al.: A new coronavirus associated with human respiratory disease in china. Nature 579(7798), 265–269 (2020) 7. Yan, C., Duan, G., Wu, F.X., Wang, J.: Iills: predicting virus-receptor interactions based on similarity and semi-supervised learning. BMC Bioinf. 20(23), 651 (2019) 8. Zhang, Z., et al.: Cell membrane proteins with high n-glycosylation, high expression and multiple interaction partners are preferred by mammalian viruses as receptors. Bioinformatics 35(5), 723–728 (2019) 9. Zhu, L., Yan, C., Duan, G.: Prediction of virus-receptor interactions based on improving similarities. J. Comput. Biol. 28(7), 650–659 (2021) 10. He, J., Chang, S. F., Xie, L.: Fast kernel learning for spatial pyramid matching. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–7. IEEE, Alaska (2008) 11. Masson, P., et al.: ViralZone: recent updates to the virus knowledge resource. Nucleic Acids Res. 41(D1), D579–D583 (2012) 12. Bairoch, A., et al.: The Universal Protein Resource (UniProt). Nucleic Acids Res. 33(suppl 1), D154–D159 (2005) 13. Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., Morishima, K.: KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45(D1), D353–D361 (2017) 14. Ahlgren, N.A., Ren, J., Lu, Y.Y., Fuhrman, J.A., Sun, F.: Alignment-free oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences. Nucleic Acids Res. 45(1), 39–53 (2017)
Prediction of Virus-Receptor Interactions
595
15. O’Leary, N.A., et al.: Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44(D1), D733–D745 (2016) 16. Yang, M., Luo, H., Li, Y., Wu, F.X., Wang, J.: Overlap matrix completion for predicting drug-associated indications. PLoS Comput. Biol. 15(12), e1007541 (2019) 17. Yang, M., Luo, H., Li, Y., Wang, J.: Drug repositioning based on bounded nuclear norm regularization. Bioinformatics 35(14), i455–i463 (2019) 18. Luo, H., Wang, J., Li, M., et al.: Drug repositioning based on comprehensive similarity measures and bi-random walk algorithm. Bioinformatics 32(17), 2664–2671 (2016) 19. Xia, Z., Wu, L.Y., Zhou, X., Wong, S.T.: Semi-supervised drug-protein interaction prediction from heterogeneous biological spaces. BMC Syst. Biol. 4, S6 (2010) 20. Rudd, J.S., Musarrat, F., Kousoulas, K.G.: Development of a reliable bovine neuronal cell culture system and labeled recombinant bovine herpesvirus type-1 for studying virus-host cell interactions. Virus Res. 293, 198255 (2021) 21. Graham, K.L., Fleming, F.E., Halasz, P., et al.: Rotaviruses interact with alpha4beta7 and alpha4beta1 integrins by binding the same integrin domains as natural ligands. J. Gen. Virol. 86(Pt 12), 3397–3408 (2005). https://doi.org/10. 1099/vir.0.81102-0 22. Bøttger, P., Pedersen, L.: Mapping of the minimal inorganic phosphate transporting unit of human PiT2 suggests a structure universal to PiT-related proteins from all kingdoms of life. BMC Biochemistry 12(1), 1–16 (2011) 23. Nishimura, Y., Shimojima, M., Tano, Y., et al.: Human P-selectin glycoprotein ligand-1 is a functional receptor for enterovirus 71. Nat. Med. 15(7), 794–797 (2009)
An Efficient Greedy Incremental Sequence Clustering Algorithm Zhen Ju1,2 , Huiling Zhang1,2 , Jingtao Meng2 , Jingjing Zhang1,2 , Xuelei Li2 , Jianping Fan2 , Yi Pan2 , Weiguo Liu3 , and Yanjie Wei2(B) 2
1 University of Chinese Academy of Sciences, Beijing 100049, China Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518005, China {xl.li,yj.wei}@siat.ac.cn 3 Shandong University, Jinan 250100, China
Abstract. Gene sequence clustering is very basic and important in computational biology and bioinformatics for the study of phylogenetic relationships and gene function prediction, etc. With the rapid growth of the amount of biological data (gene/protein sequences), clustering faces more challenges in low efficiency and precision. For example, there are many redundant sequences in gene databases that do not provide valid information but consume computing resources. Widely used greedy incremental clustering tools improve the efficiency at the cost of precision. To design a balanced gene clustering algorithm, which is both fast and precise, we propose a modified greedy incremental sequence clustering tool, via introducing a pre-filter, a modified short word filter, a new data packing strategy, and GPU accelerates. The experimental evaluations on four independent datasets show that the proposed tool can cluster datasets with precisions of 99.99%. Compared with the results of CD-HIT, Uclust, and Vsearch, the number of redundant sequences by the proposed method is four orders of magnitude less. In addition, on the same hardware platform, our tool is 40% faster than the second-place. The software is available at https://github.com/SIAT-HPCC/gene-sequence-clustering. Keywords: Greedy incremental alignment clustering · Filtering
1
· OneAPI · Gene
Introduction
The data generation ability of the new sequencing technology has surpassed the Moore’s Law, and imposes substantial burden on the research communities that use such genomics resources. Therefore, it is of great importance to build nonredundant sequence databases using clustering analysis from massive genomics data. Most previous works on sequence clustering focused on the greedy incremental alignment-based (GIA) algorithm [4–6,11]. The original GIA algorithm is often time-consuming, thus a pre-align filter is proposed to reduce the number of sequences for dynamic programming-based alignment [9]. Different from c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 596–607, 2021. https://doi.org/10.1007/978-3-030-91415-8_50
An Efficient Greedy Incremental Sequence Clustering Algorithm
597
alignment based clustering, alignment-free method does not rely on any alignment in the algorithm, thus is more efficient [12,13]. Recently deep learning (DL) based unsupervised methods are also used to solve the clustering problems [7,8]. Since alignment is the most reliable way to measure sequence similarity, the GIA algorithms are still the most widely used tools for sequence clustering. The efficiency of the GIA tools can be improved via the pre-align filtering strategy. The higher the rejection rate of the filter, the faster the tool runs. However, aggressive filtering generally results in poor clustering precisions. In the worst case, more than 95% of the results are redundant sequences. There are two challenges to achieve an efficient and precise GIA clustering tool. (1) Modify and improve the pre-align filter. Since there is a trade-off between false-negative rate and the rejection rate of the filter, one need to carefully design the filter so as not to increase the false-negative rate. (2) Speed up sequence alignment via parallelization, which can significantly improve the efficiency of the clustering. In this paper, we proposed an efficient and precise GIA-based clustering tool by introducing the following modifications in the original algorithm. (1) A pre-filter with time complexity of O(1) is designed to reduce the amount of sequences; (2) A short word based filter with distance constraint is introduced, for further improving the rejection rates without increasing the false-negative rate; (3) Data packing is modified so that more efficient dynamic programmingbased alignment can be achieved; (4) The clustering algorithm is implemented via heterogeneous parallelism. The proposed clustering tool is compared with three most widely-used clustering tools, CD-HIT, Uclust, and Vsearch. The results based on four independent datasets confirm that our tool is the most efficient and also achieves the highest precision.
2
Background
Gene clustering tools/algorithms can be roughly divided into two classes: alignment-based and alignment-free clustering tools [6,7,12]. Alignment-free tools are efficient since no sequence alignment is needed. Wei et al. proposed an alignment-free clustering method in 2012, which measures similarity by Ktuples [13]. Steinager and S¨ oding implemented Linclust, which do clustering by finding same k-mers [12]. James et al. implemented MeShCluster which is a DLbased sequences clustering tool [7]. A comprehensive review of recent DL-based clustering methods can be found in Karim et al. [8]. Holm and Sander first implemented the GIA-based clustering tool in 1998 [6]. Li and Godzik implemented CD-HIT in 2006, which improved Holm’s algorithm by introducing a short word filter before alignment [9]. Edgar developed Uclust in 2010, which only aligns the most likely be similar sequence pairs after filtering, in which not all the pairs that can pass the filter [4]. Later, Rognes et al. developed Vsearch, an open-source alternative to Uclust, in 2016 [11]. Alignment-based tools are most widely used for sequence clustering. Due to the high time-complexity of dynamic programming in the alignment algorithms, many acceleration strategies have been proposed. For example, the
598
Z. Ju et al.
Four-Russians algorithm, proposed in 1970, accelerates the alignment process by dividing the dynamic programming matrix into small blocks and compute each block in advance. Loving et al. proposed a bit-parallel algorithm BitPAl in 2014, which uses one bit to represent a base in the dynamic programming matrix [10]. NVIDIA officially proposed NVBIO, a GPU-accelerated C++ framework for high-throughput sequence analysis for both short and long-read alignment in 2014. Ahmed et al. improved NVBIO by packing data and implemented the fastest sequence alignment library in 2019 [1]. Gene clustering tools often use a fast pre-align filter to reduce the number of sequence pairs for the efficient pairwise sequence alignments. One example is the short word filter, which is based on the idea that similar sequences will share some short words. Li et al. improved the short word filter by increasing the threshold to increase the rejection rate [9]. However, the strategy also increases the false-negative rate and eventually reduces the precision. In 2015, Xin et al. proposed a shift hamming distance (SHD) filtering algorithm inspired by the pigeonhole principle [14]. In 2018, Chan et al. reduced the computation step of SHD and relied on Xeon phi for acceleration [3]. Acceleration of SHD algorithm using FPGA is also proposed by Alser et al. [2]. CD-HIT, Uclust, and Vsearch [4,5,11] are three widely used clustering tools, which have different clustering mechanisms and are employed to analyze different datasets [15]. CD-HIT is an open-source tool based on the original GIA clustering algorithm. It can process both nucleotide and amino acid sequences. CD-HIT improves the rejection rate of the short word filter by increasing the threshold and the length of the short word. Uclust is similar to CD-HIT, but only aligns the first few sequence pairs which are most likely to be similar after filtering. Uclust provides a free 32-bit version package, while its 64 bit version is not free. Vsearch is a 64-bit and free open-source software, which uses the same alignment algorithm as CD-HIT but does not support amino acid sequence analysis.
3
Methods and Evaluation Matrices
The process of the original GIA clustering is as follows: (1). Sort sequences by length, and let the longest one be the representative of the first cluster; (2). Enter the next sequence into the filter; if it dose not pass the filter, let it be a new representative; (3). Align the sequence that passed the filter with all representative sequences; if it is not similar to any representative, it becomes a new representative. The clustering method of our tool is based on the original GIA algorithm, and four additional modifications are introduced to improve the clustering efficiency. The modifications are base-count based pre-filtering, modified short-word-based filtering, data packing and GPU-based parallelization.
An Efficient Greedy Incremental Sequence Clustering Algorithm
3.1
599
Pre-filtering
The pre-filtering assumes that abundant identical bases exist in similar sequence pairs. For example, sequence “ACCA” and sequence “AAGG” share two same bases (2 As). The detailed algorithm is illustrated in Fig. 1.
Base count
Text Sequence: ACGCTCACGT(10) Pattern Sequence: ACGAATACGT(10) LCS: ACGTACGT(8)
Base Counng A: AA(2) C: CCCC(4) G: GG(2) T: TT(2) A: C: G: T:
Compare Similarity: 0.8
AAAA(4) CC(2) GG(2) TT(2)
Base Summing A:min(2,4) C:min(4,2) G:min(2,2) T:min(2,2) 2+2+2+2=8 Base sum
Cutoff=10*0.8
>=8 PASS
Fig. 1. Illustration of pre-filtering algorithm; green bases indicate the same bases shared by text and pattern sequences.
Pre-filtering can be divided into three steps. First, counting the numbers of As/Cs/Gs/Ts in the text sequence and pattern sequence as “base count”. Then adding the minimum of each “base count” in two sequences as the base sum. By comparing the base sum with the similarity cutoff, one can determine whether the text sequence and the pattern sequence are similar or not. In Fig. 1, given a text sequence and a pattern sequence with the length of 10 bases, the length of LCS is 8. The threshold is equal to the base sum. This pair of sequences pass prefiltering. The computational complexity of pre-filtering step is O(1), therefore it can reduce the number of sequences without increasing the computational cost. 3.2
Modified Short Word Filtering
A short word, also called seed, is a subsequence of a fixed number of bases. Many clustering tools rely on short word filtering [4,5,11], assuming similar gene sequences should share enough number of short words. In the original short word filtering, for a short word with length W , a gene sequence with length L contains (L − W + 1) short words. Let the similarity be S, then a pair of similar sequences have L ∗ S same bases, and the number of short words that are different between the text and pattern sequences is at most (L − L ∗ S) ∗ W . The minimum number of the same short words between the text and pattern sequences can be computed as (L − W + 1) − (L − L ∗ S) ∗ W . If the number of the same short words between the two sequences is greater than the minimum, they are considered likely to be similar.
600
Z. Ju et al.
In Fig. 2, according to the traditional short word filtering, the pattern and text sequences are considered likely to be similar since there are 3 same short words (indicated by blue boxes) which equals to the minimum of 3. However, the length of LCS is 8 is smaller than the similarity cutoff 9 (L ∗ S), indicating that they are not similar. To solve the problem, we introduce an additional distance constraint in the short word filtering. If the distance between the two same short words on the text and pattern sequences are larger than allowed distance shift, defined as (1similarity cutoff)*L, then this pair of the same short words is excluded. For example in Fig. 2, the distance between two CAAAs in text and pattern sequences is 4, larger than the allowed distance shift 1, thus they are excluded. Two out of 3 same words in the original short word filtering are excluded after applying the distance constraint (indicated by the red arrows). Since the number of the same words is one for the pattern and text sequences, these two sequences are excluded for further analysis.
Text Sequence:
A
A
A
A
C
C
A
A
A
LCS: AAAAAAAA(8)
AAAA AAAC AACC ACCA CCAA CAAA AAAA
Similarity:0.9
CCAA CAAA AAAA AAAA AAAA AAAA AAAA
Pattern Sequence: C
C
A
A
A
A
A
A
A
A (10)
A (10)
Fig. 2. Modified short word filtering. The length of the text sequence and pattern sequence is 10, and the short word length is 4. Each sequence can generate 7 short words.
3.3
Data Packing
In sequence clustering, dynamic programming-based sequence alignment consumes over ninety percent of running time. Dynamic programming is memoryintensive and good data packing technique can significantly improve the speed. DNA and RNA sequences are made up of 5 nucleotides, A, C, G, T/U (T in DNA and U in RNA), and N (unknown base), in principle 3 bits are enough for representing the 5 bases [1]. However, in practice, 4 bits representation is widely used, such as in GASAL2, a very fast sequence alignment library [1]. For sequence clustering, the length of LCS is important while nucleotide N in the sequence does not affect the result and can be ignored. By ignoring N, one can further packing the data using 2 bits representation, as shown in Fig. 3.
An Efficient Greedy Incremental Sequence Clustering Algorithm
601
N C G T A N G T 01001110 01000011 01000111 01010100 01000001 01001110 01000111 01010100 A C N T A C G N 01000001 01000011 01001110 01010100 01000001 01000011 01000111 01001110 A C G T N 0000 0001 0010 0011 0100
NC GT AN GT 01000001 00100011 00000100 00100011 AC NT AC GN 00000001 01000011 00000001 00100100
A C G T 00 01 10 11
CGTA GTAC TACG 01101100 10110001 11000110
Fig. 3. Illustration of data packing
3.4
Parallelization
Fig. 4. Illustration of our clustering process; each bar represents a sequence; sequences in the same color are similar.
The original GIA method takes one of the remaining sequences to align with one representative sequence at a time, while our method aligns all the remaining sequences with one representative sequence at a time. As shown in Fig. 4, the alignment of text sequence with other sequences is parallelized, as in the “for loop”. Each thread in the loop is assigned to a CUDA core. Since fast alignment algorithms on CPU may not be efficient on GPU, we have evaluated several alignment algorithms on GPU and found that BitPal[10] and k-band alignment[3] algorithms are efficient on GPU for different datasets. 3.5
Evaluation Metrics
The following metrics are used to evaluate the performance of the proposed method: (1) Precision for clustering. Precision is the fraction of relevant sequences among the retrieved instances, it can be used to measure the quality of clustering. To generate the gold standard clustering results, sequences are aligned
602
Z. Ju et al.
pair by pair with the Smith-Waterman algorithm, and then the redundant sequences in the results can be found. Let the number of redundant sequences in clustering results be R and non-redundant be N . Then we can define precision as below. P recision =
N N +R
(1)
(2) Rejection rate for pre-filtering. Rejection rate is a measure of filtering efficiency. Let the number of sequence pairs for the pre-filtering step be Npre , and the number of rejected pairs be Rpre , then the rejection rate is shown below. Rpre (2) RRpre = Npre (3) Rejection rate improvement for modified short word filtering. It measures the improvement of modified short word filtering compared with the original algorithm. The number of sequence pairs that are rejected by the original short word filter and the modified short word filter is denoted as Rs and Rp , respectively. The rejection rate for modified short word filtering is defined as follows, Rp (3) RRmod = Rs (4) Speed up is used to measure the impact of heterogeneous acceleration. Let the running time of the oneAPI version of our tool on CPU be Tc , and that on GPU be Tg . Speedup is defined as below, Speed up =
4 4.1
Tc Tg
(4)
Result and Discussion Datasets and Experimental Setting
We have used four widely used 16S rRNA datasets as test data for evaluating the proposed algorithm, as shown in Table 1. The sizes of datasets range from 150 Table 1. Datasets used for evaluation Dataset
Count
Length
Size
NCBI
97,413 1251–1598 150 MB
SILVA
381,226 1251–1672 581 MB
GREENGENES 406,997 1251–1829 630 MB RDP
711,278 1251–1672 1135 MB
An Efficient Greedy Incremental Sequence Clustering Algorithm
603
MB to 1.135 GB, with as many as 711,278 sequences. All datasets are downloaded from GREENGENES (https://GREENGENES.lbl.gov). The proposed algorithm is developed in CUDA and the other three tool are CPU based. The hardware information of the CPU and GPU platform is shown in Table 2. We also developed a oneAPI version, which allows the same code running on both CPU and GPU. Detailed hardware information of oneAPI can also be found in Table 2. Table 2. Hardware environment of CPU, GPU, and oneAPI platform.
4.2
Platform
Processor
Frequency TDP
CPU
Intel i7-9700K
3.6 GHz
GPU
NVIDIA GTX1060 1.7 GHz
95 W 120 W
OneAPI CPU Intel i5-1135G7
2.4 GHz
28 W
OneAPI GPU Intel Iris Xe Max
1.6 GHz
25 W
Performance Improvement by Filtering
A pre-filtering step and a modified short word filtering step are applied in the proposed clustering algorithm. Firstly rejection rate RRpre is used for evaluating the pre-filtering, for which the 4 datasets in Table.1 are used as input. It can be seen in Fig. 5(a) when the similarity reaches 0.9, the pre-filter can reject more than 1% of sequences pairs, and when the similarity reaches 0.99, the pre-filter can reject 50% of pairs. For all 4 datasets, the pattern is similar, the rejection rate of pre-filtering is increasing as the similarity increases. The time complexity of the pre-filter is O (1), demonstrated by the fact that in our experiments, the impact of this step on running time is less than 1%. The pre-filtering step is simple yet efficient, which can also be used by any other clustering tool.
Fig. 5. (a) Rejection rate of pre-filtering on different datasets. (b)Rejection rate improvement of the modified short word filter. The x-axis denotes similarity, the yaxis denotes rejection rate.
604
Z. Ju et al.
The modified short word filter is compared with the original short word filter. Rejection rate improvement RRmod is used for comparison, for which the 4 datasets in Table.1 are used as input. The results are shown in Fig. 5(b). It can be seen, when similarity larger than 0.9, the number of rejected sequence pairs increased significantly. The time complexity of the improved algorithm is the same as that of original algorithm. Our improved filter achieves performs better without increasing the cost. 4.3
Performance Improvement by Data Packing
GASAL2 is a very efficient sequence alignment library [1]. In order to evaluate the effect of data packing on the clustering performance, the proposed data packing is implemented in GASAL2, the results of modified GASAL2 with data packing are compared with the original GASAL2 algorithm. Running times of both the original and modified GASAL2 algorithms were compared in Fig. 6(a). The speedup is defined as the ratio between the running time of GASAL2 and the modified GASAL2 algorithms. As shown in Fig. 6(a), the speed-ups for NCBI, SILVA, GREENGENES and RDP are 1.25, 1.33, 1.22, 1.21 respectively. A minimum of twenty percent speedup can be achieved using the modified data packing strategy in GASAL2. Since packing data does not require redesigning the algorithm, it can be easily applied to other clustering tools.
Fig. 6. (a)Running times of the GASAL2 algorithm with modified data packing strategy and the original GASAL2. (b)Heterogeneous Speedup based on different datasets.
4.4
Heterogeneous Parallelism
CUDA and ROCm are most widely used traditional heterogeneous acceleration platforms, but they support only GPU computing. It is difficult to evaluate the impact of migrating serial code to heterogeneous directly. Intel and Khronos SYCL Working Group has built a unified open and standard-based programming model, oneAPI, which supports major hardware and provides unified
An Efficient Greedy Incremental Sequence Clustering Algorithm
605
programming model and API to users. In this paper, we relied on Intel oneAPI, which integrates different hardwares and hides the differences. oneAPI based softwares/applications can run on CPUs, GPUs, and FPGAs without major code modification. We have implemented our clustering algorithm with oneAPI model. We run oneAPI version of the proposed tool on CPU and GPU respectively (see Tab. 2), and get the speedup. It can be seen from Fig. 6(b), datasets in Tab. 1 are used as input, GPU can accelerate the application more than 10–70 times under the same TDP. The computing power of GPU is much higher than that of CPU, which makes it possible to get precise clustering results. 4.5
Comparison with Other Clustering Tools
We have selected 3 widely used clustering tools, CD-HIT, Uclust and Vsearch, to compare with our clustering method. It can be seen from Fig. 7(a) that the precisions of other tools improve with the increasing similarity while our algorithm stays close to 1 for all similarity thresholds, indicating that regardless of different similarity thresholds, our algorithm can guarantee high precision. Running times of each tool on different datasets are shown in Fig. 7(b). It can be seen that our tool is the fastest one. With Table.2, it can be seen that our tool runs the fastest and uses a cheaper hardware. There is a turning point in the running time curves. This is due to the characteristics of greedy incremental clustering. The higher the similarity, the more non-redundant sequences, and the more sequence pairs need to be aligned. But with the increase of similarity, the rejection rate of the filtering algorithm is also rising, which will reduce the number of sequences pairs to be aligned. Before the turning point, the increase of alignment of sequence pairs plays a major role and running time increases. Behind the turning point, filter rejects more sequence pairs, so the running time decreases. Therefore, the more left the turning point, the higher the efficiency of the filtering algorithm. It can see that Uclust uses a more radical filtering algorithm, CD-HIT uses a balanced one, and our tool use the most conservative one. The turning point of Vsearch is not obvious, but the efficiency of its filtering algorithm is between Uclust and CD-HIT.
606
Z. Ju et al.
Fig. 7. (a)The precision of different tools on different datasets. (b)Running time of different tools on different datasets.
5
Conclusion
In this paper, we proposed a high efficient and precise greedy incremental method for gene sequence clustering by using a pre-filter, a modified short word filter, a modified data-packing strategy and heterogeneous acceleration. In addition, we take the advantage of oneAPI programming model and with Data Parallel C++ to build the code that can run on both CPU and GPU without major reprogramming, thus fair comparison can be performed. The proposed algorithm is superior to three widely used GIA-based tools in both clustering precision and
An Efficient Greedy Incremental Sequence Clustering Algorithm
607
running speed. A similar algorithm can also be designed for protein sequences, which will be addressed in the future. Acknowledgment. This work was partly supported by the National Key Research and Development Program of China under Grant No. 2018YFB0204403, Strategic Priority CAS Project XDB38050100, National Science Foundation of China under grant no. U1813203, the Shenzhen Basic Research Fund under grant no. RCYX2020071411473419, KQTD20200820113106007 and JCYJ20180507182818013, CAS Key Lab under grant no. 2011DP173015. We would like to thank Intel for the tech support and resources such as oneAPI DevCloud in this study.
References 1. Ahmed, N., L´evy, J., Ren, S., Mushtaq, H., Bertels, K., Al-Ars, Z.: Gasal2: a GPU accelerated sequence alignment library for high-throughput NGS data. BMC Bioinf. 20(1), 1–20 (2019) 2. Alser, M., Hassan, H., Kumar, A., Mutlu, O., Alkan, C.: Shouji: a fast and efficient pre-alignment filter for sequence alignment. Bioinformatics 35(21), 4255–4263 (2019) 3. Chan, Y., Xu, K., Lan, H., Schmidt, B., Peng, S., Liu, W.: Myphi: efficient levenshtein distance computation on xeon phi based architectures. Current Bioinf. 13(5), 479–486 (2018) 4. Edgar, R.C.: Search and clustering orders of magnitude faster than blast. Bioinformatics 26(19), 2460–2461 (2010) 5. Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W.: Cd-hit: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23), 3150–3152 (2012) 6. Holm, L., Sander, C.: Removing near-neighbour redundancy from large protein sequence collections. Bioinf. (Oxford, England) 14(5), 423–429 (1998) 7. James, B.T., Luczak, B.B., Girgis, H.Z.: Meshclust: an intelligent tool for clustering DNA sequences. Nucleic acids Res. 46(14), e83–e83 (2018) 8. Karim, M.R., et al.: Deep learning-based clustering approaches for bioinformatics. Briefings Bioinf. 22(1), 393–415 (2021) 9. Li, W., Godzik, A.: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13), 1658–1659 (2006) 10. Loving, J., Hernandez, Y., Benson, G.: Bitpal: a bit-parallel, general integer-scoring sequence alignment algorithm. Bioinformatics 30(22), 3166–3173 (2014) 11. Rognes, T., Flouri, T., Nichols, B., Quince, C., Mah´e, F.: Vsearch: a versatile open source tool for metagenomics. PeerJ 4, e2584 (2016) 12. Steinegger, M., S¨ oding, J.: Clustering huge protein sequence sets in linear time. Nat. Commun. 9(1), 1–8 (2018) 13. Wei, D., Jiang, Q., Wei, Y., Wang, S.: A novel hierarchical clustering algorithm for gene sequences. BMC Bioinf. 13(1), 1–15 (2012) 14. Xin, H., et al.: Shifted hamming distance: a fast and accurate simd-friendly filter to accelerate alignment verification in read mapping. Bioinformatics 31(10), 1553– 1560 (2015) 15. Zou, Q., Lin, G., Jiang, X., Liu, X., Zeng, X.: Sequence clustering in bioinformatics: an empirical study. Briefings Bioinf. 21(1), 1–10 (2020)
Correlated Evolution in the Small Parsimony Framework Brendan Smith1 , Cristian Navarro-Martinez1 , Rebecca Buonopane1 , S. Ashley Byun1(B) , and Murray Patterson2(B) 1 Fairfield University, Fairfield, CT 06824, USA {brendan.smith1,cristian.navarro-martinez, rebecca.buonopane}@student.fairfield.edu, [email protected] 2 Georgia State University, Atlanta, GA 30303, USA [email protected]
Abstract. When studying the evolutionary relationships among a set of species, the principle of parsimony states that a relationship involving the fewest number of evolutionary events is likely the correct one. Due to its simplicity, this principle was formalized in the context of computational evolutionary biology decades ago by, e.g., Fitch and Sankoff. Because the parsimony framework does not require a model of evolution, unlike maximum likelihood or Bayesian approaches, it is often a good starting point when no reasonable estimate of such a model is available. In this work, we devise a method for detecting correlated evolution among pairs of discrete characters, given a set of species on these characters, and an evolutionary tree. The first step of this method is to use Sankoff’s algorithm to compute all most parsimonious assignments of ancestral states (of each character) to the internal nodes of the phylogeny. Correlation between a pair of evolutionary events (e.g., absent to present) for a pair of characters is then determined by the (co-) occurrence patterns between the sets of their respective ancestral assignments. We implement this method: parcours (PARsimonious CO-occURrenceS) and use it to study the correlated evolution among vocalizations and morphological characters in the Felidae family, revealing some interesting results. parcours is freely available at https://github.com/murraypatterson/ parcours. Keywords: Parsimony
1
· Evolution · Correlation
Introduction
The principle of parsimony first appeared in the context of computational evolutionary biology in [8,15]. A few years later, Sankoff (and Rousseau) generalized Research supported by: Fairfield University Science Institute grants for SAB and MP; Georgia State University startup grant for MP. Fredrickson Family Innovation Grant for SAB. c Springer Nature Switzerland AG 2021 Y. Wei et al. (Eds.): ISBRA 2021, LNBI 13064, pp. 608–619, 2021. https://doi.org/10.1007/978-3-030-91415-8_51
Correlated Evolution in the Small Parsimony Framework
609
this to allow the association of a (different) cost to each transition between a pair of states [31]. A few years after that, Felsenstein noticed that parsimony could produce misleading results when the evolutionary rate of change on the branches of the phylogeny is high [12]. Because parsimony does not take branch length into account, it ignores the fact that many changes on a long branch—while being far from parsimonious—may not be so unlikely in this case, resulting in a bias coined as “long branch attraction”. This paved the way for a proposed refinement to parsimony known as the maximum likelihood method [10,12]. Because of an explosion in molecular sequencing data, and the sophisticated understanding of the evolutionary rate of change in this setting, maximum likelihood has become the de facto framework for inferring phylogeny. Many popular software tools which implement maximum likelihood include RAxML [33], PHYLIP [11] and the Bio++ library [16]. Even more recently, some Bayesian approaches have appeared, which sample the space of likely trees using the Markov-Chain Monte Carlo (MCMC) method, resulting in tools such as MrBayes [18] and Beast [7]. However, in cases where there is no model of evolution, or no reasonable estimation of the rate of evolutionary change on the branches of the phylogeny, parsimony is a good starting point. In fact, even in the presence of such a model, there are still conditions under which a maximum likelihood phylogeny is always a maximum parsimony phylogeny [34]. Finally, when the evolutionary rate of change of different characters is heterogeneous, maximum parsimony has even been shown to perform substantially better than maximum likelihood and Bayesian approaches [22]. A perfect example is cancer phylogenetics [17,32], where little is known about the modes of evolution of cancer cells in a tumor. Due to its importance, this setting has seen a renewed interest in high-performance methods for computing maximum parsimonies in practice, such as SCITE [20] and SASC [4]. The setting of the present work is an evolutionary study of vocalizations and morphological characters [3] among members of the family Felidae. Felids possess a range of intraspecific vocalizations for close, medium, and long range communication. There is no evidence that Felid vocalizations are learned: it is more likely that these calls are genetically determined [9,26]. While there are 14 major discrete and graded calls documented in Felidae, not all calls are produced by all species [28]. In [29], the authors map some of these calls to a molecular phylogeny of Felidae, to show that it was consistent with what was previously known, strengthening the argument that vocalizations are genetically determined. Here we consider a similar approach, but because (a) an obvious model of evolution is lacking in this case; (b) the possibility that vocalizations within a given group of species can evolve at considerably different rates [29]; and (c) that rates for specific characters can differ between different lineages within that group [29], it follows from [22] that parsimony is more appropriate than maximum likelihood or a Bayesian approach. In this work, we develop a general framework to determine correlated (or co-) evolution among pairs of characters in a phylogeny, when no model of evolution is present. We then use this framework to understand how these vocalizations and
610
B. Smith et al.
morphological characters may have evolved within Felidae, and how they might correlate (or have co-evolved) with each other. The first step of this approach is to infer, for each character, the set of all most parsimonious assignments of ancestral states (small parsimonies) in the phylogeny. Then, for each character, we construct from its set of small parsimonies, a consensus map (see Sect. 3) for each (type of) evolutionary event (e.g., absent to present) along branches of the phylogeny. Correlation between a pair of evolutionary events (for a pair of characters) is then determined by how much their respective consensus maps overlap. We implement this approach in a tool called parcours, and use it to detect correlated evolution among 14 Felid vocalizations and 10 morphological characters, obtaining results that are consistent with the literature [29] as well as some interesting associations. While various methods for detecting correlated evolution exist, they tend to use only phylogenetic profiles [24], or are based on maximum likelihood [2,6], where a model of evolution is needed. Methods that determine co-evolution in the parsimony framework exist as well [25], however they are aimed at reconstructing ancestral gene adjacencies, given extant co-localization information. Finally, while some of the maximum likelihood software tools have a “parsimony mode”, e.g., [11,16], the character information must be encoded using restrictive alphabets, and there is no easy or automatic way to compute all most parsimonious ancestral states for a character—something which is central to our framework. On the contrary, parcours takes character information as a column-separated values (CSV) file, inferring the alphabet from this input, and efficiently computes all small parsimonies, automatically. In summary, our contribution is a methodology for inferring correlation among pairs of characters when no model of evolution is available, and implemented into an open-source software that is so easy to use that it could also be used as a pedagogical tool. This paper is structured as follows. In Sect. 2, we provide the background on parsimony, as well as our approach for efficiently computing all most parsimonious ancestral states in a phylogeny, given a set of extant states. In Sect. 3, we present our approach for computing correlation between pairs of characters from all such parsimonies. In Sect. 4, we describe an experimental analysis of the implementation, parcours, of our method on Felid vocalizations and morphological characters. Finally, in Sect. 5, we conclude the paper with a discussion of the results obtained, and future directions.
2
Small Parsimony
In computing a parsimony, the input is typically character-based : involving a set of species, each over a set of characters (e.g., weight category), where each character can be in a number of states (e.g., low, medium, high, etc.). The idea is that each species is in one of these states for each character—e.g., in the Puma, the weight (one of the 10 morphological characters) is “high”—and we want to understand what states the ancestors of these species could be in. Given a phylogeny on these species: a tree where all of these species are leaves, a small parsimony is an assignment of states to the internal (ancestral) nodes of this
Correlated Evolution in the Small Parsimony Framework
611
tree which minimizes the number of changes of state among the characters along branches of the tree. We illustrate this with the following example. Suppose we have the four species: Puma, Jaguarundi, Cheetah (of the Puma lineage), and Pallas cat (outgroup species from the nearby Leopard cat lineage), alongside the phylogeny depicted in Fig. 1a, implying the existence of the ancestors X, Y and Z. We are given some character, e.g., weight, which can be in one of a number of states taken from alphabet Σ = {low, high, unknown}, which we code here as {0, 1, 2} for compactness. If weight is high only in Puma and Cheetah as in Fig. 1b, then the assignment of 0 (low) to all ancestors would be a small parsimony. This parsimony has two changes of state: a change 0 → 1 on the branches Z → Puma and Y → Cheetah—implying convergent increase of weight in these species. Another small parsimony is that weight is high in Y and Z (low in X)—implying that high weight was ancestral (in Y ) to the Puma, Jaguarundi and Cheetah, and later decreased in the Jaguarundi. A principled way to infer all small parsimonies is to use Sankoff’s algorithm [31], which also makes use of a cost matrix δ, depicted in Fig. 1c, that encodes the evolutionary cost of each state change along a branch. Since the change 0 → 1 (low to high) costs δ0,1 = 1, and vice versa (δ1,0 = 1), it follows that each of the small parsimonies mentioned above have an overall cost, or score of 2 in this framework. A simple inspection of possibilities shows that 2 is the minimum score of any assignment of ancestral states.
Fig. 1. A (a) phylogeny and (b) the extant state of character weight in four species; and (c) the cost δi,j of the change i → j from state i in the parent to state j in the child along a branch of the phylogeny, e.g., δ1,2 = 1 (δ2,1 = ∞).
Sankoff’s algorithm [31] computes all (small) parsimonies given a phylogeny and the extant states of a character in a set of species, and a cost matrix δ, e.g., Fig. 1. The algorithm has a bottom-up phase (from the leaves to the root of the phylogeny), and then a top-down phase. The first (bottom-up) phase is to compute si (u): the minimum score of an assignment of ancestral states in the subtree of the phylogeny rooted at node u when u has state i ∈ Σ, according to the recurrence: min {sj (v) + δi,j } (1) si (u) = v∈C
j∈Σ
where C is the set of children of node u in the phylogeny. The idea is that the score is known for any extant species (a leaf node in the phylogeny), and
612
B. Smith et al.
is coded as si (u) = 0 if species u has state i, and ∞ otherwise. The score for the internal nodes is then computed according to Eq. 1 in a bottom-up dynamic programming fashion, starting from the leaves. The boxes in Fig. 2 depict the result of this first phase on the instance of Fig. 1, for example: s0 (X) = min {s0 (Pallas) + δ0,0 , s1 (Pallas) + δ0,1 , s2 (Pallas) + δ0,2 } + min {s0 (Y ) + δ0,0 , s1 (Y ) + δ0,1 , s2 (Y ) + δ0,2 } = min {0 + 0, ∞ + 1, ∞ + 1} + min {2 + 0, 1 + 1, ∞ + 1}
(2)
= min {0, ∞, ∞} + min {2, 2, ∞} = 0 + 2 = 2 In general, for a given character with |Σ| = k, this procedure would take time O(nk), where n is the number of species, since we compute nk values, and computing each one takes (amortized) constant time.
Fig. 2. Graph structure resulting from Sankoff’s algorithm on the instance of Fig. 1. The boxes contain the values of si (u) computed in the bottom-up phase, for each state i ∈ Σ and internal node u ∈ {X, Y, Z} of the phylogeny. The arrows connecting these boxes were computed in the top-down phase.
After the bottom-up phase, we know the minimum score of any assignment of ancestral states, but we do not yet have an ancestral assignment of states. Here, since X is the root in Fig. 1a, we see from Fig. 2 that the minimum score for this example is 2, as we saw earlier. Note that this minimum may not be unique; indeed, here s0 (X) = s1 (X) = 2, meaning that in at least one parsimony, weight is low (0) in X, and in at least one other parsimony, weight is high (1) in X (but never unknown, i.e., s2 (X) = ∞). Now, to reconstruct one of these ancestral assignments of minimum score, we first assign to the root, one of the states i ∈ Σ for which si is minimum. We then determine those states in each child of the root from which si can be derived (these may not be unique either), and assign those states accordingly. We continue, recursively, in a top-down fashion until we reach the leaves of the tree, having assigned all states at this point. For example, s0 (X) = 2, and can be derived in Eq. 2 from s1 (Y ) (and s0 (Pallas)), which is in turn derived from s1 (Z). This corresponds to the second parsimony mentioned earlier, where high weight was ancestral in Y , and later decreased in the Jaguarundi. Notice that s0 (X) can also be derived in Eq. 2 from s0 (Y ),
Correlated Evolution in the Small Parsimony Framework
613
which is in turn derived from s0 (Z). This is the first parsimony where weight is low (0) in all ancestors. One can compactly represent all parsimonies as a graph structure with a box for each si at each internal node of the phylogeny (e.g., boxes of Fig. 2), and an arrow from box si (u) to sj (v) for some node u and its child v in the phylogeny, whenever si (u) can be derived from sj (v) (e.g., arrows of Fig. 2). A parsimony is then some choice of exactly one box in this graph for each internal node of the phylogeny in such a way that they form an underlying directed spanning subtree in this graph, in terms of the arrows which join the boxes. This spanning subtree will have the same topology as the phylogeny, in fact, and the choice of box will correspond to the choice of ancestral state of the corresponding internal nodes of this phylogeny. In Fig. 2, for example, the spanning subtree s0 (X) → s0 (Y ) → s0 (Z) corresponds to the parsimony where weight is low in all ancestors, and s0 (X) → s1 (Y ) → s1 (Z) corresponds to the parsimony where high weight was ancestral in Y , and decreased in the Jaguarundi. Implicitly the leaves are also included in these spanning subtrees, but since there is no choice of state for extant species, they are left out for simplicity—see [5] for another example of this graph structure, with the leaves included. Notice, from Fig. 2 that there is a third solution s1 (X) → s1 (Y ) → s1 (Z) corresponding to a parsimony where high weight was ancestral (in X) to all species here, and then decreased in both the Pallas and the Jaguarundi. Note that for internal nodes u other than the root, that a state i ∈ Σ for which si (u) is not minimized can appear in a solution, e.g., s0 (Y ). In general, for |Σ| = k, the procedure for building this graph structure (e.g., in Fig. 2) would take time O(nk 2 ), since at each node u, each of the k values si (u) is derived from O(k) values at the (amortized constant number of) children of u.
3
Correlation
We want to determine correlated evolution among pairs of characters. Suppose we have a second character dental profile (number of teeth), which can have states from {28, 30, unknown}, which we again code as {0, 1, 2} for compactness. Here, the dental profile is “fewer” (0) only in the Pallas cat as depicted in Fig. 3a, and we want to understand how this is correlated to the weight of Fig. 1b. The idea is that we first construct likely hypotheses for what ancestral states each character c might have, namely the set P (c) of all small parsimonies. Then we determine if there are any correlated changes in state of pairs of characters along branches of the phylogenetic tree given their sets of parsimonies. While a drawback of parsimony is potentially many solutions, without any a priori knowledge of ancestral state, choosing one solution (or a subset of solutions) is an arbitrary decision. Since such a decision could potentially bias any downstream analysis, our approach attempts to avoid this, in using the entire set of parsimonies. Let a = α → β be a change in state (for some pair, α, β of states) of character c. Let Bp (a) be the multiset (even though each element will have multiplicity 0 or 1) of branches in the phylogenetic tree where change a occurs
614
B. Smith et al.
Fig. 3. The (a) extant state of character dental in four species and (b) the structure resulting from Sankoff’s algorithm on these extant states along with the phylogeny of Fig. 1a and cost of Fig. 1c. Values s2 have been omitted for compactness, since they are all ∞, like in Fig. 2.
in parsimony p ∈ P (c). The consensus map Cc (a) for change a of character c is then the multiset resulting from Bp (a) , (3) p∈P (c)
in preserving the information of all parsimonies. Finally, the correlation between some change a of character c and some change b of a character d is then the weighted Jaccard index [19] (also known as the Ruˇziˇcka index [30]) |Cc (a) ∩ Cd (b)| . |Cc (a) ∪ Cd (b)|
(4)
For example, given the character’s weight (Fig. 1b) and dental profile (Fig. 3a), we compute the correlation of the change 0 → 1 (low to high in weight, and 28 to 30 in dental) of both of these characters in the phylogeny of Fig. 1a (according to cost of Fig. 1c) as follows. Character weight has the set P (weight) = {p1 , p2 , p3 } of three parsimonies, with Bp1 (0 → 1) = {Z → Puma, Y → Cheetah}, Bp2 (0 → 1) = {X → Y }, and Bp3 (0 → 1) = ∅. It follows from Eq. 3 that the consensus map Cweight (0 → 1) = {Z → Puma, Y → Cheetah, X → Y }. By inspecting Fig. 3b, character dental has the set P (dental) = {q1 , q2 } of two parsimonies, with Bq1 (0 → 1) = {X → Y } and Bq2 (0 → 1) = ∅, hence Cdental (0 → 1) = {X → Y }. It then follows from Eq. 4 that the correlation of this co-event is 1/3 ≈ 0.33. The correlation of the co-event 1 → 0 in both weight and dental is also 1/3, while the correlation of 0 → 1 in weight and 1 → 0 in dental (and vice versa) are 0, because there is no overlap between sets Bp (0 → 1) and Bq (1 → 0) for any pair of parsimonies p ∈ P (weight) and q ∈ P (dental) (and vice versa). For completeness, the correlation of any combination of event 0 → 2 or 2 → 0 with any other event in either weight or dental is 0 because none of these events happen in any parsimony of weight or dental. We use the weighted Jaccard index because it measures the amount of event co-occurrence, normalized by the amount of independent occurrences of either event in the set of all parsimonies, taking multiplicities
Correlated Evolution in the Small Parsimony Framework
615
into account. If one wanted focus on just the events on the different branches (without multiplicity), one could use the (unweighted) Jaccard index, which is Eq. 4 where all multisets are treated as sets (all non-zero multiplicities cast to 1).
4
Experimental Analysis
We used the following data to validate our approach. The phylogeny is that of the family Felidae from Johnson et al. [21], where Leopard and Jaguar are “swapped”1 , due to more recent findings [13]. The set of extant states are for the 14 vocalizations documented in Sunquist and Sunquist [35], and the 10 morphological characters compiled from the various sources within [3]. Finally, the cost for changing to a different state is 1, and to an unknown state is ∞ (e.g., Fig. 1c). Since unknown states are artifacts of the collection process, by assigning a high evolutionary cost to the changes of any unknown to some known state (which makes little evolutionary sense) we mitigate the propagation of any unknown state to ancestral nodes in any parsimony, whenever some known state (e.g., 0 or 1) can explain the data. For example, in the instance of Fig. 1, if the weight was instead unknown in the Jaguarundi and Pallas cat, then it would only have the unique parsimony where all ancestors X, Y and Z have state 1 (high). This is why we use the approach of Sankoff [31] (instead of, e.g., Fitch [15]), because we need these transition-specific costs. Our approach and resulting implementation into the software tool parcours begins with the efficient computation of all parsimonies described in Sect. 2, followed by the computation of correlation from these parsimonies, as described in Sect. 3. Since parcours starts with the same input as Sankoff’s algorithm (e.g., Fig. 1), we input the phylogenetic tree and extant states of the 24 characters and the Felidae phylogeny mentioned above. The cost matrix mentioned above is computed automatically by parcours, this being the default, when no specific cost matrix is provided. From this, parcours returns the correlations of all possible events between all pairs of characters according to Eq. 4. Since each correlation is, by the definition in Eq. 4, a value between 0 and 1, we report in Table 1 all events for all pairs of characters that had a correlation of 0.5 or greater. Figure 4 show some consensus maps on the tree for selected characters of Table 1.
5
Discussion
Understanding the evolution of complex traits such as vocalization and morphological characters remains a key challenge in evolutionary studies. We discuss the results of our experimental analysis on these characters, and future directions below. The gurgle and prusten are two of the major friendly, close range 1
see example at https://github.com/murraypatterson/parcours/tree/main/felidae.
616
B. Smith et al.
Table 1. Pairs of correlated events (Pairs t1 , t2 of events (columns 3–4) in pairs c1 , c2 of characters (columns 1–2) with correlation (Eq. 4) of 0.5 or greater (column 5). Jaccard index (see Sect. 3) is also included for reference (column 6). Vocalizations (gurgle, prusten, grunt, roar) are all binary (0 or 1), while the remaining morphological characters have categorical ranks.) c2
t1
1 Gurgle
c1
Prusten
1 → 0 1 → 0 1.0
1.0
2 Gurgle
Gestation 1 → 0 3 → 1 1.0
1.0
3 Prusten
Gestation 0 → 1 3 → 1 1.0
1.0
4 Grunt
Roar
0 → 1 0 → 1 1.0
1.0
5 Gestation
Weight
4 → 1 3 → 2 0.5
0.5
6 Weight
Skull L.
4 → 3 3 → 2 0.57
0.5
7 Weight
Skull L.
1 → 2 1 → 2 0.64
1.0
8 Skull W.
Gestation 2 → 3 1 → 3 0.71
1.0
9 sk.L./sk.W. Dental
t2
Correlation Jaccard
4 → 3 1 → 2 0.67
0.67
Fig. 4. The consensus maps (Eq. 3) corresponding to the pairs of events in: (a) rows 1 and 4 of Table 1, where (+) is 0 → 1 and (−) is 1 → 0; and (b) row 8 of Table 1, where the number of lines on a branch indicates multiplicity. Note here, that the Bay cat lineage (grayed out) was excluded (automatically by parcours) from analyses due to lack of data on species of that group.
vocalizations in Felidae. These vocalizations are short, largely atonal calls produced during friendly approaches and close contact situations. Each felid species uses only one of these homologous, close-range calls [27]. The gurgle is the most widely distributed call, found in all felid species except for members of Pantherinae, and are considered to be ancestral. Along the lineage leading to Pantherinae, the prusten appears to have replaced the gurgle about 8–3 myr BP (node 1 [21] to 33) [29]. The complete correlation of a loss of gurgle and gain of prusten (row 1 of Table 1, and Fig. 4a) is consistent with the idea that these calls are functionally equivalent vocalizations, one being replaced as another is gained successively over time [29]. Also, parcours linked an increase of gestation period [3] to the
Correlated Evolution in the Small Parsimony Framework
617
loss of the gurgle, and gain of the prusten, respectively (rows 2–3 of Table 1). Finally, the concomitant gain of grunt and roar (row 4 of Table 1 and Fig. 4a) is consistent with the observation that the grunt typically occurs in a roaring sequence [35]. The parcours tool detected a correlation of 0.5 (row 5 of Table 1) with large increases in gestation period (rank 4 to 1) and a moderate increase in weight (rank 3 to 2). This is largely consistent with observations that body weight and gestation period in mammals are loosely correlated [1]. Felids exhibit considerable variability in body weight (>200 kg to 30 cm to