141 96 80MB
English Pages 851 [844] Year 2022
LNCS 13394
De-Shuang Huang · Kang-Hyun Jo · Junfeng Jing · Prashan Premaratne · Vitoantonio Bevilacqua · Abir Hussain (Eds.)
Intelligent Computing Theories and Application 18th International Conference, ICIC 2022 Xi’an, China, August 7–11, 2022 Proceedings, Part II
Lecture Notes in Computer Science Founding Editors Gerhard Goos Karlsruhe Institute of Technology, Karlsruhe, Germany Juris Hartmanis Cornell University, Ithaca, NY, USA
Editorial Board Members Elisa Bertino Purdue University, West Lafayette, IN, USA Wen Gao Peking University, Beijing, China Bernhard Steffen TU Dortmund University, Dortmund, Germany Moti Yung Columbia University, New York, NY, USA
13394
More information about this series at https://link.springer.com/bookseries/558
De-Shuang Huang · Kang-Hyun Jo · Junfeng Jing · Prashan Premaratne · Vitoantonio Bevilacqua · Abir Hussain (Eds.)
Intelligent Computing Theories and Application 18th International Conference, ICIC 2022 Xi’an, China, August 7–11, 2022 Proceedings, Part II
Editors De-Shuang Huang Tongji University Shanghai, China
Kang-Hyun Jo University of Ulsan Ulsan, Korea (Republic of)
Junfeng Jing Xi’an Polytechnic University Xi’an, China
Prashan Premaratne The University of Wollongong North Wollongong, NSW, Australia
Vitoantonio Bevilacqua Polytecnic of Bari Bari, Italy
Abir Hussain Liverpool John Moores University Liverpool, UK
ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-031-13828-7 ISBN 978-3-031-13829-4 (eBook) https://doi.org/10.1007/978-3-031-13829-4 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022, corrected publication 2022 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The International Conference on Intelligent Computing (ICIC) was started to provide an annual forum dedicated to the emerging and challenging topics in artificial intelligence, machine learning, pattern recognition, bioinformatics, and computational biology. It aims to bring together researchers and practitioners from both academia and industry to share ideas, problems, and solutions related to the multifaceted aspects of intelligent computing. ICIC 2022, held in Xi’an, China, during August 7–11, 2022, constituted the 18th International Conference on Intelligent Computing. It built upon the success of the previous ICIC events held at various locations in China (2005–2008, 2010–2016, 2018– 2019, 2021) and in Ulsan, South Korea (2009), Liverpool, UK (2017), and Bari, Italy (2020). This year, the conference concentrated mainly on the theories, methodologies, and emerging applications of intelligent computing. Its aim was to unify the picture of contemporary intelligent computing techniques as an integral concept that highlights the trends in advanced computational intelligence and bridges theoretical research with applications. Therefore, the theme for this conference was “Advanced Intelligent Computing Technology and Applications”. Papers focused on this theme were solicited, addressing theories, methodologies, and applications in science and technology. ICIC 2022 received 449 submissions from authors in 21 countries and regions. All papers went through a rigorous peer-review procedure and each paper received at least three review reports. Based on the review reports, the Program Committee finally selected 209 high-quality papers for presentation at ICIC 2022, which are included in three volumes of proceedings published by Springer: two volumes of Lecture Notes in Computer Science (LNCS) and one volume of Lecture Notes in Artificial Intelligence (LNAI). Among the 449 submissions to the conference were 57 submissions for the six special sessions and nine workshops featured the ICIC this year. All these submissions were reviewed by members from the main Program Committee and 22 high-quality papers were selected for presentation at ICIC 2022 and included in the proceedings based on the topic. This volume of Lecture Notes in Computer Science (LNCS) includes 72 papers. The organizers of ICIC 2022, including the EIT Institute for Advanced Study, Xi’an Polytechnic University, Shenzhen University, and the Guangxi Academy of Sciences, made an enormous effort to ensure the success of the conference. We hereby would like to thank the members of the Program Committee and the referees for their collective effort in reviewing and soliciting the papers. In particular, we would like to thank all the authors for contributing their papers. Without the high-quality submissions from the authors, the success of the conference would not have been possible. Finally, we are
vi
Preface
especially grateful to the International Neural Network Society and the National Science Foundation of China for their sponsorship. June 2022
De-Shuang Huang Kang-Hyun Jo Junfeng Jing Prashan Premaratne Vitoantonio Bevilacqua Abir Hussain
Organization
General Co-chairs De-Shuang Huang Haiyan Wang
Tongji University, China Xi’an Polytechnic University, China
Program Committee Co-chairs Kang-Hyun Jo Junfeng Jing Prashan Premaratne Vitoantonio Bevilacqua Abir Hussain
University of Ulsan, South Korea Xi’an Polytechnic University, China University of Wollongong, Australia Polytechnic University of Bari, Italy Liverpool John Moores University, UK
Organizing Committee Co-chairs Pengfei Li Kaibing Zhang Lei Zhang
Xi’an Polytechnic University, China Xi’an Polytechnic University, China Xi’an Polytechnic University, China
Organizing Committee Hongwei Zhang Minqi Li Zhaoliang Meng Peng Song
Xi’an Polytechnic University, China Xi’an Polytechnic University, China Xi’an Polytechnic University, China Xi’an Polytechnic University, China
Award Committee Co-chairs Kyungsook Han Valeriya Gribova
Inha University, South Korea Far Eastern Branch of the Russian Academy of Sciences, Russia
Tutorial Co-chairs Ling Wang M. Michael Gromiha
Tsinghua University, China Indian Institute of Technology Madras, India
viii
Organization
Publication Co-chairs Michal Choras Hong-Hee Lee Laurent Heutte
Bydgoszcz University of Science and Technology, Poland University of Ulsan, South Korea Université de Rouen Normandie, France
Special Session Co-chairs Yu-Dong Zhang Vitoantonio Bevilacqua Hee-Jun Kang
University of Leicester, UK Polytechnic University of Bari, Italy University of Ulsan, South Korea
Special Issue Co-chairs Yoshinori Kuno Phalguni Gupta
Saitama University, Japan Indian Institute of Technology Kanpur, India
International Liaison Co-chair Prashan Premaratne
University of Wollongong, Australia
Workshop Co-chairs Jair Cervantes Canales Chenxi Huang Dhiya Al-Jumeily
Autonomous University of Mexico State, Mexico Xiamen University, China Liverpool John Moores University, UK
Publicity Co-chairs Chun-Hou Zheng Dhiya Al-Jumeily Jair Cervantes Canales
Anhui University, China Liverpool John Moores University, UK Autonomous University of Mexico State, Mexico
Sponsors and Exhibits Chair Qinghu Zhang
Tongji University, China
Program Committee Abir Hussain Angelo Ciaramella Antonino Staiano Antonio Brunetti
Liverpool John Moores University, UK Parthenope University of Naples, Italy Parthenope University of Naples, Italy Polytechnic University of Bari, Italy
Organization
Bai Xue Baitong Chen Ben Niu Bin Liu Bin Qian Bin Wang Bin Yang Bingqiang Liu Binhua Tang Bo Li Bo Liu Bohua Zhan Changqing Shen Chao Song Chenxi Huang Chin-Chih Chang Chunhou Zheng Chunmei Liu Chunquan Li Dah-Jing Jwo Dakshina Ranjan Kisku Daowen Qiu Dhiya Al-Jumeily Domenico Buongiorno Dong Wang Dong-Joong Kang Dunwei Gong Eros Gian Pasero Evi Sjukur Fa Zhang Fabio Stroppa Fei Han Fei Guo Fei Luo Fengfeng Zhou Gai-Ge Wang Giovanni Dimauro Guojun Dai
ix
Institute of Software, CAS, China Xuzhou No. 1 Peoples Hospital, China Shenzhen University, China Beijing Institute of Technology, China Kunming University of Science and Technology, China Anhui University of Technology, China Zaozhuang University, China Shandong University, China Hohai University, China Wuhan University of Science and Technology, China Academy of Mathematics and Systems Science, CAS, China Institute of Software, CAS, China Soochow University, China Harbin Medical University, China Xiamen University, China Chung Hua University, Taiwan, China Anhui University, China Howard University, USA Harbin Medical University, China National Taiwan Ocean University, Taiwan, China National Institute of Technology Durgapur, India Sun Yat-sen University, China Liverpool John Moores University, UK Politecnico di Bari, Italy University of Jinan, China Pusan National University, South Korea China University of Mining and Technology, China Politecnico di Torino, Italy Monash University, Australia Institute of Computing Technology, CAS, China Stanford University, USA Jiangsu University, China Central South University, China Wuhan University, China Jilin University, China Ocean University of China, China University of Bari, Italy Hangzhou Dianzi University, China
x
Organization
Haibin Liu Han Zhang Hao Lin Haodi Feng Ho-Jin Choi Hong-Hee Lee Hongjie Wu Hongmin Cai Jair Cervantes Jian Huang Jian Wang Jiangning Song Jiawei Luo Jieren Cheng Jing Hu Jing-Yan Wang Jinwen Ma Jin-Xing Liu Ji-Xiang Du Joaquin Torres-Sospedra Juan Liu Junfeng Man Junfeng Xia Jungang Lou Junqi Zhang Ka-Chun Wong Kanghyun Jo Kyungsook Han Lejun Gong Laurent Heutte Le Zhang Lin Wang Ling Wang Li-Wei Ko
Beijing University of Technology, China Nankai University, China University of Electronic Science and Technology of China, China Shandong University, China Korea Advanced Institute of Science and Technology, South Korea University of Ulsan, South Korea Suzhou University of Science and Technology, China South China University of Technology, China Autonomous University of Mexico State, Mexico University of Electronic Science and Technology of China, China China University of Petroleum (East China), China Monash University, Australia Hunan University, China Hainan University, China Wuhan University of Science and Technology, China Abu Dhabi Department of Community Development, UAE Peking University, China Qufu Normal University, China Huaqiao University, China Universidade do Minho, Portugal Wuhan University, China Hunan First Normal University, China Anhui University, China Huzhou University, China Tongji University, China City University of Hong Kong, Hong Kong, China University of Ulsan, South Korea Inha University, South Korea Nanjing University of Posts and Telecommunications, China Université de Rouen Normandie, France Sichuan University, China University of Jinan, China Tsinghua University, China National Yang Ming Chiao Tung University, Taiwan, China
Organization
Marzio Pennisi Michael Gromiha Michal Choras Mine Sarac Mohd Helmy Abd Wahab Na Zhang Nicholas Caporusso Nicola Altini Peng Chen Pengjiang Qian Phalguni Gupta Ping Guo Prashan Premaratne Pu-Feng Du Qi Zhao Qingfeng Chen Qinghua Jiang Quan Zou Rui Wang Ruiping Wang Saiful Islam Seeja K. R. Shanfeng Zhu Shanwen Wang Shen Yin Shihua Zhang Shihua Zhang Shikui Tu Shitong Wang Shixiong Zhang Shunren Xia Sungshin Kim Surya Prakash Takashi Kuremoto Tao Zeng
xi
University of Eastern Piedmont, Italy Indian Institute of Technology Madras, India Bydgoszcz University of Science and Technology, Poland Stanford University, USA, and Kadir Has University, Turkey Universiti Tun Hussein Onn Malaysia, Malaysia Xuzhou Medical University, China Northern Kentucky University, USA Polytechnic University of Bari, Italy Anhui University, China Jiangnan University, China GLA University, India Beijing Normal University, China University of Wollongong, Australia Tianjin University, China University of Science and Technology Liaoning, China Guangxi University, China Harbin Institute of Technology, China University of Electronic Science and Technology of China, China National University of Defense Technology, China Institute of Computing Technology, CAS, China Aligarh Muslim University, India Indira Gandhi Delhi Technical University for Women, India Fudan University, China Xijing University, China Harbin Institute of Technology, China Academy of Mathematics and Systems Science, CAS, China Wuhan University of Science and Technology, China Shanghai Jiao Tong University, China Jiangnan University, China Xidian University, China Zhejiang University, China Pusan National University, South Korea Indian Institute Technology Indore, India Nippon Institute of Technology, Japan Guangzhou Laboratory, China
xii
Organization
Tatsuya Akutsu Tieshan Li Valeriya Gribova
Vincenzo Randazzo Waqas Haider Bangyal Wei Chen Wei Jiang Wei Peng Wei Wei Wei-Chiang Hong Weidong Chen Weihong Deng Weixiang Liu Wen Zhang Wenbin Liu Wen-Sheng Chen Wenzheng Bao Xiangtao Li Xiaodi Li Xiaofeng Wang Xiao-Hua Yu Xiaoke Ma Xiaolei Zhu Xiaoli Lin Xiaoqi Zheng Xin Yin Xin Zhang Xinguo Lu Xingwen Liu Xiujuan Lei Xiwei Liu Xiyuan Chen Xuequn Shang
Kyoto University, Japan University of Electronic Science and Technology of China, China Institute of Automation and Control Processes, Far Eastern Branch of the Russian Academy of Sciences, Russia Politecnico di Torino, Italy University of Gujrat, Pakistan Chengdu University of Traditional Chinese Medicine, China Nanjing University of Aeronautics and Astronautics, China Kunming University of Science and Technology, China Tencent Technology, Norway Asia Eastern University of Science and Technology, Taiwan, China Shanghai Jiao Tong University, China Beijing University of Posts and Telecommunications, China Shenzhen University, China Huazhong Agricultural University, China Guangzhou University, China Shenzhen University, China Xuzhou University of Technology, China Jilin University, China Shandong Normal University, China Hefei University, China California Polytechnic State University, USA Xidian University, China Anhui Agricultural University, China Wuhan University of Science and Technology, China Shanghai Normal University, China Laxco Inc., USA Jiangnan University, China Hunan University, China Southwest Minzu University, China Shaanxi Normal University, China Tongji University, China Southeast University, China Northwestern Polytechnical University, China
Organization
Xuesong Wang Xuesong Yan Xu-Qing Tang Yan-Rui Ding Yansen Su Yi Gu Yi Xiong Yizhang Jiang Yong-Quan Zhou Yonggang Lu Yoshinori Kuno Yu Xue Yuan-Nong Ye Yu-Dong Zhang Yue Ming Yunhai Wang Yupei Zhang Yushan Qiu Zhanheng Chen Zhan-Li Sun Zhen Lei Zhendong Liu Zhenran Jiang Zhenyu Xuan Zhi-Hong Guan Zhi-Ping Liu Zhiqiang Geng Zhongqiu Zhao Zhu-Hong You Zhuo Wang Zuguo Yu
xiii
China University of Mining and Technology, China China University of Geosciences, China Jiangnan University, China Jiangnan University, China Anhui University, China Jiangnan University, China Shanghai Jiao Tong University, China Jiangnan University, China Guangxi University for Nationalities, China Lanzhou University, China Saitama University, Japan Huazhong University of Science and Technology, China Guizhou Medical University, China University of Leicester, UK Beijing University of Posts and Telecommunications, China Shandong University, China Northwestern Polytechnical University, China Shenzhen University, China Shenzhen University, China Anhui University, China Institute of Automation, CAS, China Shandong Jianzhu University, China East China Normal University, China University of Texas at Dallas, USA Huazhong University of Science and Technology, China Shandong University, China Beijing University of Chemical Technology, China Hefei University of Technology, China Northwestern Polytechnical University, China Hangzhou Dianzi University, China Xiangtan University, China
Contents – Part II
Biomedical Data Modeling and Mining A Comparison Study of Predicting lncRNA-Protein Interactions via Representative Network Embedding Methods . . . . . . . . . . . . . . . . . . . . . . . . . . Guoqing Zhao, Pengpai Li, and Zhi-Ping Liu
3
GATSDCD: Prediction of circRNA-Disease Associations Based on Singular Value Decomposition and Graph Attention Network . . . . . . . . . . . . . Mengting Niu, Abd El-Latif Hesham, and Quan Zou
14
Anti-breast Cancer Drug Design and ADMET Prediction of ERa Antagonists Based on QSAR Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wentao Gao, Ziyi Huang, Hao Zhang, and Jianfeng Lu
28
Real-Time Optimal Scheduling of Large-Scale Electric Vehicles Based on Non-cooperative Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rong Zeng, Hao Zhang, Jianfeng Lu, Tiaojuan Han, and Haitong Guo
41
TBC-Unet: U-net with Three-Branch Convolution for Gliomas MRI Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yongpu Yang, Haitao Gan, and Zhi Yang
53
Drug–Target Interaction Prediction Based on Graph Neural Network and Recommendation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peng Lei, Changan Yuan, Hongjie Wu, and Xingming Zhao
66
NSAP: A Neighborhood Subgraph Aggregation Method for Drug-Disease Association Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qiqi Jiao, Yu Jiang, Yang Zhang, Yadong Wang, and Junyi Li
79
Comprehensive Evaluation of BERT Model for DNA-Language for Prediction of DNA Sequence Binding Specificities in Fine-Tuning Phase . . . Xianbao Tan, Changan Yuan, Hongjie Wu, and Xingming Zhao
92
Identification and Evaluation of Key Biomarkers of Acute Myocardial Infarction by Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Zhenrun Zhan, Tingting Zhao, Xiaodan Bi, Jinpeng Yang, and Pengyong Han
xvi
Contents – Part II
Glioblastoma Subtyping by Immuogenomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Yanran Li, Chandrasekhar Gopalakrishnan, Jian Wang, Rajasekaran Ramalingam, Caixia Xu, and Pengyong Han Functional Analysis of Molecular Subtypes with Deep Similarity Learning Model Based on Multi-omics Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Shuhui Liu, Zhang Yupei, and Xuequn Shang Predicting Drug-Disease Associations by Self-topological Generalized Matrix Factorization with Neighborhood Constraints . . . . . . . . . . . . . . . . . . . . . . . 138 Xiaoguang Li, Qiang Zhang, Zonglan Zuo, Rui Yan, Chunhou Zheng, and Fa Zhang Intelligent Computing in Computational Biology iEnhancer-BERT: A Novel Transfer Learning Architecture Based on DNA-Language Model for Identifying Enhancers and Their Strength . . . . . . . 153 Hanyu Luo, Cheng Chen, Wenyu Shan, Pingjian Ding, and Lingyun Luo GCNMFCDA: A Method Based on Graph Convolutional Network and Matrix Factorization for Predicting circRNA-Disease Associations . . . . . . . . 166 Dian-Xiao Wang, Cun-Mei Ji, Yu-Tian Wang, Lei Li, Jian-Cheng Ni, and Bin Li Prediction of MiRNA-Disease Association Based on Higher-Order Graph Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Zhengtao Zhang, Pengyong Han, Zhengwei Li, Ru Nie, and Qiankun Wang SCDF: A Novel Single-Cell Classification Method Based on Dimension-Reduced Data Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Chujie Fang and Yuanyuan Li Research on the Potential Mechanism of Rhizoma Drynariae in the Treatment of Periodontitis Based on Network Pharmacology . . . . . . . . . . . 207 Caixia Xu, Xiaokun Yang, Zhipeng Wang, Pengyong Han, Xiaoguang Li, and Zhengwei Li Predicting Drug-Disease Associations via Meta-path Representation Learning based on Heterogeneous Information Net works . . . . . . . . . . . . . . . . . . . 220 Meng-Long Zhang, Bo-Wei Zhao, Lun Hu, Zhu-Hong You, and Zhan-Heng Chen An Enhanced Graph Neural Network Based on the Tissue-Like P System . . . . . . 233 Dongyi Li and Xiyu Liu
Contents – Part II
xvii
Cell Classification Based on Stacked Autoencoder for Single-Cell RNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Rong Qi, Chun-Hou Zheng, Cun-Mei Ji, Ning Yu, Jian-Cheng Ni, and Yu-Tian Wang A Novel Cuprotosis-Related Gene Signature Predicts Survival Outcomes in Patients with Clear-Cell Renal Cell Carcinoma . . . . . . . . . . . . . . . . . . . . . . . . . . 260 Zhenrun Zhan, Pengyong Han, Xiaodan Bi, Jinpeng Yang, and Tingting Zhao Identification of miRNA-lncRNA Underlying Interactions Through Representation for Multiplex Heterogeneous Network . . . . . . . . . . . . . . . . . . . . . . 270 Jiren Zhou, Zhuhong You, Xuequn Shang, Rui Niu, and Yue Yun ACNN: Drug-Drug Interaction Prediction Through CNN and Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 Weiwei Wang and Hongbo Liu Elucidating Quantum Semi-empirical Based QSAR, for Predicting Tannins’ Anti-oxidant Activity with the Help of Artificial Neural Network . . . . . 289 Chandrasekhar Gopalakrishnan, Caixia Xu, Yanran Li, Vinutha Anandhan, Sanjay Gangadharan, Meshach Paul, Chandra Sekar Ponnusamy, Rajasekaran Ramalingam, Pengyong Han, and Zhengwei Li Drug-Target Interaction Prediction Based on Transformer . . . . . . . . . . . . . . . . . . . 302 Junkai Liu, Tengsheng Jiang, Yaoyao Lu, and Hongjie Wu Protein-Ligand Binding Affinity Prediction Based on Deep Learning . . . . . . . . . . 310 Yaoyao Lu, Junkai Liu, Tengsheng Jiang, Shixuan Guan, and Hongjie Wu Computational Genomics and Biomarker Discovery Position-Defined CpG Islands Provide Complete Co-methylation Indexing for Human Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 Ming Xiao, Ruiying Yin, Pengbo Gao, Jun Yu, Fubo Ma, Zichun Dai, and Le Zhang Predicting the Subcellular Localization of Multi-site Protein Based on Fusion Feature and Multi-label Deep Forest Model . . . . . . . . . . . . . . . . . . . . . . 334 Hongri Yang, Qingfang Meng, Yuehui Chen, and Lianxin Zhong
xviii
Contents – Part II
Construction of Gene Network Based on Inter-tumor Heterogeneity for Tumor Type Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 Zhensheng Sun, Junliang Shang, Hongyu Duan, Jin-Xing Liu, Xikui Liu, Yan Li, and Feng Li A Novel Synthetic Lethality Prediction Method Based on Bidirectional Attention Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 Fengxu Sun, Xinguo Lu, Guanyuan Chen, Xiang Zhang, Kaibao Jiang, and Jinxin Li A Novel Trajectory Inference Method on Single-Cell Gene Expression Data . . . 364 Daoxu Tang, Xinguo Lu, Kaibao Jiang, Fengxu Sun, and Jinxin Li Bioinformatic Analysis of Clear Cell Renal Carcinoma via ATAC-Seq and RNA-Seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 Feng Chang, Zhenqiong Chen, Caixia Xu, Hailei Liu, and Pengyong Han The Prognosis Model of Clear Cell Renal Cell Carcinoma Based on Allograft Rejection Markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 Hailei Liu, Zhenqiong Chen, Chandrasekhar Gopalakrishnan, Rajasekaran Ramalingam, Pengyong Han, and Zhengwei li Membrane Protein Amphiphilic Helix Structure Prediction Based on Graph Convolution Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 Baoli Jia, Qingfang Meng, Qiang Zhang, and Yuehui Chen The CNV Predict Model in Esophagus Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 Yun Tian, Caixia Xu, Lin Li, Pengyong Han, and Zhengwei Li TB-LNPs: A Web Server for Access to Lung Nodule Prediction Models . . . . . . . 415 Huaichao Luo, Ning Lin, Lin Wu, Ziru Huang, Ruiling Zu, and Jian Huang Intelligent Computing in Drug Design A Targeted Drug Design Method Based on GRU and TopP Sampling Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 Jinglu Tao, Xiaolong Zhang, and Xiaoli Lin KGAT: Predicting Drug-Target Interaction Based on Knowledge Graph Attention Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 Zhenghao Wu, Xiaolong Zhang, and Xiaoli Lin
Contents – Part II
xix
MRLDTI: A Meta-path-Based Representation Learning Model for Drug-Target Interaction Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 Bo-Wei Zhao, Lun Hu, Peng-Wei Hu, Zhu-Hong You, Xiao-Rui Su, Dong-Xu Li, Zhan-Heng Chen, and Ping Zhang Single Image Dehazing Based on Generative Adversarial Networks . . . . . . . . . . . 460 Mengyun Wu and Bo Li K-Nearest Neighbor Based Local Distribution Alignment . . . . . . . . . . . . . . . . . . . 470 Yang Tian and Bo Li A Video Anomaly Detection Method Based on Sequence Recognition . . . . . . . . 481 Lei Yang and Xiaolong Zhang Drug-Target Binding Affinity Prediction Based on Graph Neural Networks and Word2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496 Minghao Xia, Jing Hu, Xiaolong Zhang, and Xiaoli Lin Drug-Target Interaction Prediction Based on Attentive FP and Word2vec . . . . . . 507 Yi Lei, Jing Hu, Ziyu Zhao, and Siyi Ye Unsupervised Prediction Method for Drug-Target Interactions Based on Structural Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517 Xinyuan Zhang, Xiaoli Lin, Jing Hu, and Wenquan Ding Drug-Target Affinity Prediction Based on Multi-channel Graph Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533 Hang Zhang, Jing Hu, and Xiaolong Zhang An Optimization Method for Drug-Target Interaction Prediction Based on RandSAS Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547 Huimin Xiang, AoXing Li, and Xiaoli Lin A Novel Cuprotosis-Related lncRNA Signature Predicts Survival Outcomes in Patients with Glioblastoma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556 Hongyu Sun, Xiaohui Li, Jin Yang, Yi Lyu, Pengyong Han, and Jinping Zheng Arbitrary Voice Conversion via Adversarial Learning and Cycle Consistency Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569 Jie Lian, Pingyuan Lin, Yuxing Dai, and Guilin Li MGVC: A Mask Voice Conversion Using Generating Adversarial Training . . . . 579 Pingyuan Lin, Jie Lian, and Yuxing Dai
xx
Contents – Part II
Covid-19 Detection by Wavelet Entropy and Genetic Algorithm . . . . . . . . . . . . . . 588 Jia-Ji Wan, Shu-Wen Chen, Rayan S. Cloutier, and Hui-Sheng Zhu COVID-19 Diagnosis by Wavelet Entropy and Particle Swarm Optimization . . . 600 Jia-Ji Wang Theoretical Computational Intelligence and Applications An Integrated GAN-Based Approach to Imbalanced Disk Failure Data . . . . . . . . 615 Shuangshuang Yuan, Peng Wu, Yuehui Chen, Liqiang Zhang, and Jian Wang Disk Failure Prediction Based on Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . 628 Guangfu Gao, Peng Wu, Hui Li, and Tianze Zhang Imbalanced Disk Failure Data Processing Method Based on CTGAN . . . . . . . . . 638 Jingbo Jia, Peng Wu, Kai Zhang, and Ji Zhong SID2 T: A Self-attention Model for Spinal Injury Differential Diagnosis . . . . . . . 650 Guan Wang, Yulin Wu, Qinghua Sun, Bin Yang, and Zhaona Zheng Predicting Protein-DNA Binding Sites by Fine-Tuning BERT . . . . . . . . . . . . . . . . 663 Yue Zhang, Yuehui Chen, Baitong Chen, Yi Cao, Jiazi Chen, and Hanhan Cong i6mA-word2vec: A Newly Model Which Used Distributed Features for Predicting DNA N6-Methyladenine Sites in Genomes . . . . . . . . . . . . . . . . . . . 670 Wenzhen Fu, Yixin Zhong, Baitong Chen, Yi Cao, Jiazi Chen, and Hanhan Cong Oxides Classification with Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680 Kai Xiao, Baitong Chen, Wenzheng Bao, and Honglin Cheng Protein Sequence Classification with LetNet-5 and VGG16 . . . . . . . . . . . . . . . . . . 687 Zheng Tao, Zhen Yang, Baitong Chen, Wenzheng Bao, and Honglin Cheng SeqVec-GAT: A Golgi Classification Model Based on Multi-headed Graph Attention Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697 Jianan Sui, Yuehui Chen, Baitong Chen, Yi Cao, Jiazi Chen, and Hanhan Cong Classification of S-succinylation Sites of Cysteine by Neural Network . . . . . . . . 705 Tong Meng, Yuehui Chen, Baitong Chen, Yi Cao, Jiazi Chen, and Hanhan Cong
Contents – Part II
xxi
E. coli Proteins Classification with Naive Bayesian . . . . . . . . . . . . . . . . . . . . . . . . . 715 Yujun Liu, Jiaxin Hu, Yue Zhou, Wenzheng Bao, and Honglin Cheng COVID-19 and SARS Virus Function Sites Classification with Machine Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 722 Hongdong Wang, Zizhou Feng, Baitong Chen, Wenhao Shao, Zijun Shao, Yumeng Zhu, and Zhuo Wang Identification of Protein Methylation Sites Based on Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 731 Wenzheng Bao, Zhuo Wang, and Jian Chu Image Repair Based on Least Two-Way Generation Against the Network . . . . . . 739 Juxi Hu and Honglin Cheng Prediction of Element Distribution in Cement by CNN . . . . . . . . . . . . . . . . . . . . . . 747 Xin Zhao, Yihan Zhou, Jianfeng Yuan, Bo Yang, Xu Wu, Dong Wang, Pengwei Guan, and Na Zhang An Ensemble Framework Integrating Whole Slide Pathological Images and miRNA Data to Predict Radiosensitivity of Breast Cancer Patients . . . . . . . . 757 Chao Dong, Jie Liu, Wenhui Yan, Mengmeng Han, Lijun Wu, Junfeng Xia, and Yannan Bin Bio-ATT-CNN: A Novel Method for Identification of Glioblastoma . . . . . . . . . . . 767 Jinling Lai, Zhen Shen, and Lin Yuan STE-COVIDNet: A Multi-channel Model with Attention Mechanism for Time Series Prediction of COVID-19 Infection . . . . . . . . . . . . . . . . . . . . . . . . . 777 Hongjian He, Xinwei Lu, Dingkai Huang, and Jiang Xie KDPCnet: A Keypoint-Based CNN for the Classification of Carotid Plaque . . . . 793 Bindong Liu, Wu Zhang, and Jiang Xie Multi-source Data-Based Deep Tensor Factorization for Predicting Disease-Associated miRNA Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807 Sheng You, Zihan Lai, and Jiawei Luo Correction to: Multi-source Data-Based Deep Tensor Factorization for Predicting Disease-Associated miRNA Combinations . . . . . . . . . . . . . . . . . . . . Sheng You, Zihan Lai, and Jiawei Luo
C1
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823
Biomedical Data Modeling and Mining
A Comparison Study of Predicting lncRNA-Protein Interactions via Representative Network Embedding Methods Guoqing Zhao, Pengpai Li, and Zhi-Ping Liu(B) School of Control Science and Engineering, Shandong University, Jinan 250061, Shandong, China [email protected]
Abstract. Network embedding has become an important representation technique recently as an effective method to solve the heterogeneity of data relations of non-Euclidean learning. With the aims of learning low-dimensional latent representations of nodes in a network, the learned representations can be used as efficient features for various network-based tasks, such as classification, clustering, link prediction and visualization. In recent years, various low-dimensional graph embedding methods have been proposed. Yet few of them are analyzed and studied in a systematic experiment in the prediction of lncRNA-protein interaction (LPI), especially for the newly available methods. Here, we divide these methods into three categories, i.e., factorization-based, random walk-based and deep learning-based method, and select six representative methods in them for predicting LPIs. Finally, these state-of-the-art network embedding methods are evaluated on five benchmark datasets containing three ones in human and two in plants. Experimental results demonstrate that recent network embedding methods, e.g., metapath2vec, achieve better prediction performances. The data and code in this study are available at: https://github.com/zpliulab/Bconstract_embedding. Keywords: lncRNA-protein interaction prediction · Network embedding · Data integration · Benchmark datasets · Comparison study
1 Introduction Network is a collection of nodes and edges. Many complex systems can be represented in the form of networks, such as social network, biological network and information network [1]. By analyzing graph tasks in molecular interaction networks, we can predict lncRNA-protein interactions (LPIs) to better understand the regulation of cellular biological processes [2], predict drug-disease associations to provide important information for drug discovery and drug repositioning [3], and predict protein-protein interactions to catalyze genomic interpretation [4], which will help clinical treatment, thereby promoting the development of digital health. Network embedding is a graph representation technique to learn the features underlying the network architectures and attributes. To © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 3–13, 2022. https://doi.org/10.1007/978-3-031-13829-4_1
4
G. Zhao et al.
analyze graph tasks, Cai et al. [5], Goyal et al. [6] and Wang et al. [7] have briefly summarized the techniques, applications and performances of network embedding. To our knowledge, DeepWalk is the first network embedding method that employ the idea of natural language processing (NLP) [8], which learns to word representations from sentences. Then node2vec further improves DeepWalk by introducing breadth-first search and depth-first search to generate node sequences. Subsequently, metapath2vec proposes a random walk method based on meta-path, which resolves the processing problems of heterogeneous network structures. At the same time, various new network embedding methods have been proposed sequentially, such as HARP [9], DGI [10] and GAE [11]. As a typical type of non-Euclidean data, graph and network embedding attract much attention in machine learning theories and applications [12]. However, few studies have analyzed the performance of multiple embedding methods in some unified experiments, especially emerging methods in biological networks such as lncRNA-protein interactions [13]. Su et al. [14] discussed how the network embedding approaches were performed on biomedical networks as well as how they accelerated the downstream tasks in biomedical science. In fact, they did not employ the network embedding methods to implement the experiments or even analyze the performance in real data. Nelson et al. [15] respectively applied the two methods in protein network alignment, community detection, and protein function prediction aspects. However, they were tested only on few datasets and thus the performances were not convincing enough. In addition, as a hot topic in network analysis, many newly published methods became available. Therefore, it needs to provide a comparison study of these network representation methods in some specific application scenario, such as in the LPI prediction. In this study, we first group the available methods of network embedding into three major categories, including those based on factorization methods, random walks and deep learning methods respectively. Then we select six representative methods in the three categories to perform a comparison study in link prediction tasks. For our research interests, we use them to conduct experiments on three human LPI data and two plant LPI data. Through the comparisons in predicting LPI, the recently proposed network embedding methods, e.g., metapath2vec [16], demonstrate better prediction performance, which indicates the relationality of integrating multiple features in the network representation and highlights the importance of developing novel feature extraction technique for network structured data.
2 Materials and Methods 2.1 Datasets The data used in this study are extracted from Peng et al. [17]. There are totally five benchmarked datasets. Table 1 shows the details of the five datasets. Datasets 1, 2, and 3 are human LPIs, and Datasets 4 and 5 are plant LPIs. Dataset 1 is obtained by restricting organisms and ncRNA species from NPinter (http://www.bioinfo.org/NPInter/). Dataset 2 also extracts Homo sapiens ncRNA-protein interactions from NPInter and filters the lncRNAs that interacts only one protein. Dataset 3 is experimentally determined the interactions in 1114 lncRNAs and 96 proteins. By contrast, Dataset 4 contains 948
A Comparison Study of Predicting lncRNA-Protein Interactions
5
Arabidopsis thaliana LPIs, and Dataset 5 has 22133 Zea Mays LPIs, which are from http://bis.zju.edu.cn/PlncRNADB/. Table 1. The statistics of LPI data. Dataset
lncRNAs
Proteins
LPIs
Dataset 1
935
59
3479
Dataset 2
885
84
3265
Dataset 3
990
27
4158
Dataset 4
109
35
948
Dataset 5
1704
42
22133
2.2 Survey of Network Embedding Methods For simplicity, we divide network embedding methods into three categories and introduce them according to the techniques used. The fundamental of network embedding methods is to learn the representation of nodes while preserving the maximal information of the network. The difference between these methods lies in the angle of preserving the structural information of network. Factorization-Based Methods. Factorization-based network embedding methods represent network property (e.g., existing edges) in the form of a matrix and factorize this matrix to obtain node embeddings [6]. The purpose of this kind of methods is to use matrix decomposition to represent nodes as low-dimensional vectors for downstream tasks while preserving the network structure in the representing matrix. There are many matrix decomposition methods, such as GF [18], GraRep [19] HOPE [20] and SVD [21]. Among them, SVD extracts bias information and latent variables from LPIs, who often focus on factorizing the first-order data matrix [22]. Random Walk-Based Methods. In recent years, methods based on random walks have attracted widespread attention and have been applied to various aspects, such as identifying important proteins [23], predicting lncRNA-disease associations [24], predicting Parkinson’s disease genes [25], and network alignment [26]. Therefore, we focus on and introduce methods in this category. Random walk-based methods subtly imitate the method in natural language processing to generate the context [27], that is, sequences of nodes generated by walking in the network randomly. Furthermore, these sequences are fed into skip-gram [28] to learn the representations of nodes.
6
G. Zhao et al.
DeepWalk [29] is the first method proposed to employ random walk-based techniques for network embedding. The DeepWalk algorithm mainly consists of two parts: a random walk sequence generator and an update process. The random walk sequence generator first randomly samples a root node of a random walk in the graph, and then uniformly randomly samples a node from the neighbors of the root node until a set maximum length is reached. For a generated random walk sequence centered on the root node and the left and right windows of a specified size, DeepWalk utilizes the SkipGram algorithm to optimize the model to learn node representations. Then the variant methods node2vec [30], and metapath2vec [16] appeared successively. Node2vec further extends DeepWalk by changing the way of random walk sequences generated. In DeepWalk, the method selects the next node in the random walk sequence is uniform and random, yet node2vec introduces breadth-first search and depth-first search into the generation process of random walk sequence by introducing two parameters breadth-first search (BFS) and depth-first search (DFS). BFS focuses on adjacent nodes and depicts a relatively local network representation. The nodes in BFS generally appear many times, thereby reducing the variance of neighbor nodes that characterize the central node. In contrast, DFS reflects the homogeneity between nodes at a higher level. That is to say, BFS can explore the structural properties of the graph, while DFS can explore the similarity in content. Metapath2vec can set meta-path based random walks to build heterogeneous neighborhoods for each vertex comparing with node2vec. Regardless of vertex type and edge type, p represents the transition probability from a vertex to its neighbor vertices. However, Dong et al. [16] showed that random walks on heterogeneous networks were biased towards certain highly visible types of vertices whose paths dominated the network, and a certain percentage of these paths pointed to a small number of nodes. In view of this, the author proposed a random walk method based on meta-path to generate the neighborhood context of the Skip-Gram model. The random walk method can simultaneously capture the semantic and structural relationships between different types of vertices, which promotes the conversion of heterogeneous network structures to the Skip-Gram model of metapath2vec. Furthermore, it uses the Skip-Gram model for training so that the node representations of the context are more similar, and the representations of remaining nodes tend to be orthogonal. More importantly, it also takes the node type into account when doing negative samplings. In the other words, metapath2vec decodes multiple information in both architectures and elements underlying the network. Because this method is applicable to networks in various fields and is not only designed for biological networks. In biological information network, it still does not include the intrinsic information of nodes, such as the sequence or structure information of nodes. Deep Learning-Based Methods. Deep learning has become ubiquitous in many fields with its powerful learning ability [31], and network embedding based on deep learning is no exception, such as LINE [32] and SDNE [33].
A Comparison Study of Predicting lncRNA-Protein Interactions
7
LINE explicitly defines two functions for first-order and second-order approximations, and minimizes the combination of two functions. In SDNE, the first-order similarity is to have higher similarity between connected nodes, and the second-order similarity is to have more similar representation vectors between nodes with the same neighbor nodes.
Fig. 1. The flowchart of predicting lncRNA-protein interactions via representative network embedding methods.
2.3 LncRNA-Protein Interactions Prediction For comparing different network embedding methods in the prediction of LPIs, we evaluate the performance of the former reviewed 6 representative methods on 5 LPI datasets. The specific details of the 6 methods are shown in Table 2, including the year of their release, source of context nodes, embedding learning method and the original reference. Figure 1 shows the flowchart of our comparison study of LPI predictions by these network embedding methods. In order to avoid leaking the edges in the validation set, we eliminate the positive edges in the validation set when constructing the adjacency matrix or random walk. We take each lncRNA as a starting point to walk 200 times, and the length of a walk is 100. Specifically, 80% of the edges in the network graph are treated as the training set, while 20% of the edges as the test set. In addition, for each dataset, negative samples are viewed as all the possible LPIs except the positive samples. The node representations learned by every model are represented as 128-dimensional vectors, and then the node vectors of lncRNA and protein are concatenated together to obtain 256-dimensional edge representations. Finally, the representations are fed into an SVM classifier to predict their interaction. The kernel of SVM is radial basis function (RBF) and the penalty coefficient of SVM is 5. We repeat this process five times and take the average as the final results.
8
G. Zhao et al. Table 2. A summary of six representative network embedding methods.
Method category
Method
Year
Source of context nodes
Embedding learning method
Reference
Factorization-based
SVD
2012
\
Matrix decomposition
[21]
Random walk-based
DeepWalk
2014
Random walks
Skip-gram with hierarChical method
[29]
node2vec
2016
Biased random walk
Skip-gram with negative sampling
[30]
metapath2vec
2017
Meta-path and biased random walk
Skip-gram with negative sampling
[16]
LINE
2015
1st-order and 2nd-order neighbor
Single-layer neural network
[32]
SDNE
2016
1st-order and 2nd-order neighbor
Deep autoencoder
[33]
Deep learning- based
3 Results and Discussion To evaluate the prediction performance of the selected network embedding methods on 5 benchmark datasets, we employ precision (PRE), recall (REC), specificity (SPE), accuracy (ACC), F1-score (F1) and AUC as performance evaluation metrics. The corresponding ROC curves are shown in Fig. 2. The mean, variance and standard deviation of AUC are as shown in Table 3. Table 4 shows the overall performance of different LPI prediction methods on the five datasets. Table 3. The mean, variance and standard deviation of AUC on five datasets. Method category
Method
Mean
Variance
Standard deviation
Factorization-based
SVD
0.925
1.656 × 10−3
4.070 × 10−2
Random walk- based
DeepWalk
0.931
1.543 × 10−3
3.929 × 10−2
0.923
1.629 × 10−3
4.036 × 10−2
metapath2vec
0.943
1.064 × 10−3
3.262 × 10−2
LINE
0.920
2.541 × 10−3
5.040 × 10−2
0.913
1.243 × 10−3
3.525 × 10−2
node2vec Deep learning-based
SDNE
A Comparison Study of Predicting lncRNA-Protein Interactions
9
Fig. 2. The ROC curves of six methods based on metapath2vec, DeepWalk, node2vec, LINE, SVD and SDNE on the five datasets.
As shown in Fig. 2, the recently proposed methods generally show better prediction performance on all the 5 datasets. For instance, metapath2vec consistently outperforms the other methods on every dataset, and the AUC value is always the best in these methods. What’s more, we can find the variance and standard deviation of its AUC are
10
G. Zhao et al. Table 4. The prediction performance of six methods on five datasets.
Data
Category
Dataset 1 Factorization-based Random walk-based
Method
PRE
REC
SPE
ACC
F1
AUC
SVD
0.836 0.977 0.808 0.892 0.901 0.943
DeepWalk
0.870 0.917 0.863 0.890 0.893 0.951
node2vec
0.881 0.881 0.880 0.880 0.880 0.948
metapath2vec 0.885 0.930 0.878 0.904 0.907 0.963 Deep learning- based LINE Dataset 2 Factorization-based Random walk-based
0.863 0.920 0.853 0.886 0.890 0.945
SDNE
0.868 0.962 0.854 0.908 0.913 0.939
SVD
0.852 0.995 0.827 0.911 0.918 0.953
DeepWalk
0.899 0.910 0.898 0.904 0.904 0.963
node2vec
0.904 0.865 0.908 0.887 0.884 0.958
metapath2vec 0.883 0.972 0.871 0.921 0.925 0.969 Deep learning- based LINE SDNE Dataset 3 Factorization-based Random walk-based
0.872 0.948 0.861 0.904 0.908 0.955 0.866 0.962 0.851 0.907 0.912 0.939
SVD
0.737 0.800 0.714 0.757 0.767 0.844
DeepWalk
0.835 0.705 0.861 0.783 0.764 0.854
node2vec
0.791 0.716 0.811 0.763 0.751 0.847
metapath2vec 0.816 0.775 0.824 0.800 0.794 0.879 Deep learning- based LINE Dataset 4 Factorization-based Random walk-based
0.772 0.703 0.793 0.748 0.736 0.820
SDNE
0.758 0.806 0.741 0.774 0.780 0.847
SVD
0.862 0.914 0.853 0.883 0.887 0.942
DeepWalk
0.895 0.902 0.894 0.898 0.898 0.945
node2vec
0.873 0.875 0.873 0.874 0.874 0.917
metapath2vec 0.915 0.889 0.917 0.903 0.902 0.953 Deep learning- based LINE Dataset 5 Factorization-based Random walk-based
0.893 0.898 0.892 0.895 0.895 0.943
SDNE
0.846 0.861 0.844 0.852 0.853 0.905
SVD
0.880 0.853 0.884 0.868 0.866 0.943
DeepWalk
0.860 0.871 0.858 0.865 0.865 0.944
node2vec
0.871 0.858 0.873 0.866 0.864 0.945
metapath2vec 0.875 0.876 0.874 0.875 0.875 0.950 Deep learning- based LINE SDNE
0.871 0.836 0.876 0.856 0.853 0.938 0.855 0.863 0.854 0.858 0.859 0.934
the lowest on five datasets from Table 3, which shows that its AUC is centralized and stable on the five datasets. In addition, in the experiments, we find that the newly proposed method takes less time to learn node representations than previous traditional methods such as LINE and
A Comparison Study of Predicting lncRNA-Protein Interactions
11
DeepWalk. Specifically, SVD, DeepWalk, node2vec, metapath2vec, LINE and SDNE need 32 s, 671 s, 322 s, 421 s, 3959 s and 373 s to learn the representation of nodes on the dataset 1 respectively. The results indicate the newly proposed network embedding methods, e.g., metapath2vec, are efficient and effective, which deserves our in-depth study and utilization of building the LPI prediction methods.
4 Conclusion In this paper, we conducted extensive experiments on the comparison study of network embedding methods in the prediction of LPIs. Specifically, we tested the 6 representative methods based on different network representation strategies on 5 benchmarked datasets. The results demonstrated that the recently proposed method is time-saving and efficient, e.g., metapath2vec, in both human and plant datasets. We hope our results will provide a direction for future network embedding development in bioinformatics, especially for molecular interaction prediction. The methods such as metapath2vec integrated multilevel attributes in the network in the low-dimensional mapping result in higher prediction performance. This indicates to develop more efficient network representation models in future to extract more importance features of network for improving the LPI prediction tasks. Acknowledgements. This work was partially supported by National Natural Science Foundation of China (No. 61973190); National Key Research and Development Program of China (No. 2020YFA0712402); Shandong Provincial Key Research and Development Program (Major Scientific and Technological Innovation Project 2019JZZY010423); Natural Science Foundation of Shandong Province of China (ZR2020ZD25); the Innovation Method Fund of China (Ministry of Science and Technology of China, 2018IM020200); the Tang Scholar and the program of Qilu Young Scholar of Shandong University.
References 1. Leskovec, J., Sosiˇc, R.: SNAP: A general-purpose network analysis and graph-mining library. ACM Trans. Intell. Syst. Technol. 8, 1–20 (2016) 2. Xiao, Y., Zhang, J., Deng, L.: Prediction of lncRNA-protein interactions using HeteSim scores based on heterogeneous networks. Sci. Rep. 7, 3664 (2017) 3. Zhang, W., et al.: Predicting drug-disease associations and their therapeutic function based on the drug-disease association bipartite network. Methods 145, 51–59 (2018) 4. Li, T., et al.: A scored human protein–protein interaction network to catalyze genomic interpretation. Nat. Methods 14, 61–64 (2017) 5. Cai, H., Zheng, V.W., Chang, K.C.-C.: A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Trans. Knowl. Data Eng. 30, 1616–1637 (2018) 6. Goyal, P., Ferrara, E.: Graph embedding techniques, applications, and performance: A survey. Knowl.-Based Syst. 151, 78–94 (2018) 7. Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge graph embedding: A survey of approaches and applications. IEEE Trans. Knowl. Data Eng. 29, 2724–2743 (2017) 8. Qiu, J., et al.: Network embedding as matrix factorization: Unifying DeepWalk, LINE, PTE, and node2vec. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 459–467. ACM (2018). https://doi.org/10.1145/3159652.3159706
12
G. Zhao et al.
9. Chen, H., Perozzi, B., Hu, Y., Skiena, S.: HARP: Hierarchical representation learning for networks. 8. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32(1). https://ojs.aaai.org/index.php/AAAI/article/view/11849 10. Velickovic, P., et al.: Deep graph infomax. In: Proceedings of the Seventh International Conference on Learning Representations, vol. 46 (2019) 11. Veliˇckovi´c, P., et al.: Graph Attention Networks. ArXiv171010903 Cs Stat (2018) 12. Xue, G., et al.: Dynamic network embedding survey. Neurocomputing 472, 212–223 (2022) 13. Wang, Y., et al.: De novo prediction of RNA–protein interactions from sequence information. Mol. BioSyst. 9, 133–142 (2013) 14. Su, C., Tong, J., Zhu, Y., Cui, P., Wang, F.: Network embedding in biomedical data science. Brief. Bioinform. 21, 182–197 (2020) 15. Nelson, W., et al.: To embed or not: Network embedding as a paradigm in computational biology. Front. Genet. 10, 381 (2019) 16. Dong, Y., Chawla, N.V., Swami, A.: metapath2vec: Scalable representation learning for heterogeneous networks. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 135–144. ACM (2017). https://doi.org/10. 1145/3097983.3098036 17. Peng, L., Tan, J., Tian, X., Zhou, L.: EnANNDeep: An ensemble-based lncRNA–protein interaction prediction framework with adaptive k-nearest neighbor classifier and deep models. Interdiscip. Sci. Comput. Life Sci. 14, 209–232 (2022) 18. Ahmed, A., Shervashidze, N., Narayanamurthy, S., Josifovski, V., Smola, A.J.: Distributed large-scale natural graph factorization. In: Proceedings of the 22nd International Conference on World Wide Web - WWW 2013, pp. 37–48. ACM Press (2013). https://doi.org/10.1145/ 2488388.2488393 19. Cao, S., Lu, W., Xu, Q.: GraRep: Learning graph representations with global structural information. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 891–900. ACM (2015). https://doi.org/10.1145/2806416.280 6512 20. Ou, M., Cui, P., Pei, J., Zhang, Z., Zhu, W.: Asymmetric transitivity preserving graph embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1105–1114. ACM (2016). https://doi.org/10.1145/2939672. 2939751 21. Chen, T., et al.: SVDFeature: A toolkit for feature-based collaborative filtering. J. Mach. Learn. Res. 13, 3619–3622 (2012) 22. Dai, W., et al.: Matrix factorization-based prediction of novel drug indications by integrating genomic space. Comput. Math. Methods Med. 2015, 1–9 (2015) 23. Lei, X., Yang, X., Fujita, H.: Random walk based method to identify essential proteins by integrating network topology and biological characteristics. Knowl.-Based Syst. 167, 53–67 (2019) 24. Xie, G., Huang, B., Sun, Y., Wu, C., Han, Y.: RWSF-BLP: A novel lncRNA-disease association prediction model using random walk-based multi-similarity fusion and bidirectional label propagation. Mol. Genet. Genomics 296(3), 473–483 (2021). https://doi.org/10.1007/s00438021-01764-3 25. Peng, J., Guan, J., Shang, X.: Predicting Parkinson’s disease genes based on Node2vec and autoencoder. Front. Genet. 10, 226 (2019) 26. Gu, S., Milenkovic, T.: Graphlets versus node2vec and struc2vec in the task of network alignment. ArXiv180504222 Phys (2018) 27. Zhang, Y., Tang, M.: Consistency of random-walk based network embedding algorithms. ArXiv210107354 Cs Stat (2021) 28. Chami, I., Abu-El-Haija, S., Perozzi, B., Ré, C., Murphy, K.: Machine Learning on Graphs: A Model and Comprehensive Taxonomy. ArXiv200503675 Cs Stat (2021)
A Comparison Study of Predicting lncRNA-Protein Interactions
13
29. Perozzi, B., Al-Rfou, R., Skiena, S.: DeepWalk: Online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 701–710. ACM (2014). https://doi.org/10.1145/2623330.2623732 30. Grover, A., Leskovec, J.: node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864. ACM (2016). https://doi.org/10.1145/2939672.2939754 31. Shinde, P.P., Shah, S.: A review of machine learning and deep learning applications. In: Proceedings of the 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), pp. 1–6. IEEE (2018). https://doi.org/10.1109/ICC UBEA.2018.8697857 32. Tang, J., et al.: LINE: Large-scale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077. International World Wide Web Conferences Steering Committee (2015). https://doi.org/10.1145/2736277.2741093 33. Wang, D., Cui, P., Zhu, W.: Structural deep network embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1225–1234. ACM (2016). https://doi.org/10.1145/2939672.2939753
GATSDCD: Prediction of circRNA-Disease Associations Based on Singular Value Decomposition and Graph Attention Network Mengting Niu1,2 , Abd El-Latif Hesham3 , and Quan Zou1,2(B) 1 Institute of Fundamental and Frontier Sciences, University of Electronic Science
and Technology of China, Chengdu, China [email protected] 2 Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China 3 Genetics Department, Faculty of Agriculture, Beni-Suef University, Beni-Suef 62511, Egypt
Abstract. With the deepening of research, we can find that circular RNAs (circRNAs) have important effects on many human physiological and pathological pathways. Studying the association of circRNAs with diseases not only helps to study biological processes, but also provides new directions for the diagnosis and treatment of diseases. However, it is relatively inefficient to verify the association of circRNAs with diseases only by biotechnology. This paper proposed a computational method GATSDCD based on graph attention network (GAT) and neural network (NN) to predict associations between circRNAs-diseases. In GATSDCD, it combined similarity features and semantic features of circRNAs and diseases as raw features. Then, we denoised the original features using singular value matrix decomposition to better represent circRNAs and diseases. Further, using the obtained circRNA and disease features as node attributes, a graph attention network was used to construct feature vectors in subgraphs to extract deep embedded features. Finally, a neural network was applied to make predictions about potential associations. The experimental results showed that the GATSDCD model outperforms existing methods in multiple aspects, and is an effective method to identify circRNA-disease associations. Case study also demonstrated that GATSDCD can effectively identify circRNAs associated with gastric and breast cancers. Keywords: circRNA-disease association · Graph attention network · Singular value decomposition · Neural network
1 Introduction CircRNAs are noncoding single-stranded RNA molecules and produced by the process of back-splicing and form a circular structure by covalent bonds [1]. It widely exists in a variety of biological cells [2], and has characteristics of structural stability, sequence conservation and cell or tissue-specific expression [3]. Recently, studies have found circRNAs can function in a variety of ways, such as acting as miRNA sponges [4] and © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 14–27, 2022. https://doi.org/10.1007/978-3-031-13829-4_2
GATSDCD: Prediction of circRNA-Disease Associations
15
interacting with miRNAs [5], participating in the regulation of gene transcription [6], cell cycle or physiological processes such as aging [7]. It’s closely relevant to the regulation process of human health killers such as cancer and heart disease. CircRNA plays indispensable parts in the biogenesis and development of certain diseases (such as tumors, atherosclerosis, diabetes, and nervous system diseases) [8–10]. This not only gives us a detailed understanding of circRNAs, but also gives us meaningful guidance in the diagnosis, treatment and prevention. For example, Overexpression of hsa_circ_0119412 promotes cervical cancer progression by targeting miR-217 to upregulate the anterior gradient [11]. Circ-OPHN1 inhibits trophoblast proliferation, migration and invasion by mediating the miR-558/THBS2 axis [12]. With the development of testing technology and the deepening of circRNA research, more and more circRNA molecular functions have been confirmed [13], and a variety of functional databases have been developed, such as circBase [14], CircInteractome [15], circRNADb [16], and CIRCpedia [17]. Regarding the research on circRNAdisease, many biological studies have been used to discover new associations, and related databases have been built based on these data, such as CircR2Disease [18] and circAtlas [19]. With the support of data, association prediction calculation models based on data mining appeared. The original research approach was to address the association prediction problem as a recommendation problem. Xiujuan Lei et al. constructed a computational model based on the collaborative filtering recommendation system to solve this question [20]. Hang Wei decomposed the interaction spectrum between circRNA and disease based on the matrix factorization algorithm [21]. Most are computational models based on similarity matrices of circRNAs and diseases. Yaojia Chen predicted association using relational graph convolutional network and DistMult decoder [22]. Niu et al. employed TuMarkov neural network algorithm to predict unknown associations [23]. Hang Wei used matrix factorization to predict associations based on circRNA-disease interaction profiles [24]. K.Deepthi utilized the similarity matrix to propose a novel circRNA-disease association that relies on autoencoders and deep neural networks [25]. Guanghui Li utilized network-consistent projection to identify novel circRNA-disease associations [26]. Furthermore, several studies have used random walks on the network to infer potential associations [27]. Although current methods contribute significantly in predicting potential, these methods are sparse and cannot accurately describe the association characteristics. In our study, we proposed a hybrid computational framework GATSDCD (as shown Fig. 1) based on the graph attention network and neural network to predict associations of circRNAs with diseases. It integrates similarity data of circRNAs and semantic features of diseases as original features. Then use singular value decomposition (SVD) to realize noise reduction of original features. Furthermore, deep information of circRNAs and diseases is extracted using graph attention networks. The Neural network (NN) is then used to predict unknown associations. To verify the capability of GATSDCD, fivefold cross-validation (FFCV) is performed in the experiments. The results show that GATSDCD outperforms other state-of-the-art models. Case studies of gastric cancer and breast cancer demonstrate that GATSDCD is an effective method to infer potential circRNAs to be disease-related.
16
M. Niu et al.
Fig. 1. The overall framework of GATSDCD. (A) Data preprocessing: The similarity and semantic characteristics of circRNA and disease were taken as the original features. (B) Singular value decomposition is used to denoise the original features. Furthermore, graph attention network is used to extract topological features. (C) GATSDCD for predicting circRNA-disease association.
2 Materials and Methods 2.1 Datasets CircR2Disease is a database dedicated to the collection of experimentally validated circular RNAs and disease associations, which provided an open-source platform for the study of disease-related circular RNAs mechanisms [18]. Using this platform, we obtained 661 circRNAs and 739 experimentally confirmed circRNA-disease relationships associated with 100 diseases. We removed redundant data and retained only human-related diseases. Eventually, we obtained 650 associations of 585 circular RNAs and 88 diseases. 2.2 Feature Representation Disease Attribute Feature. In this section, we mainly constructed two disease semantic similarities and GIPs to describe the similarity features between diseases. We used the MeSH [28] database to compute semantic similarity SV 1 of diseases. In MeSH, associations between circRNAs and diseases are displayed by DAG (Directed Acyclic Graph) [23], namely: DAG DA = (DA, NDA , EDA ), where NDA represents the set of node A and ancestor nodes. EDA represents the set of all edges in the DAG graph. Then semantic similarity SV 1 between diseases DA and DB is: tTDA ∩TDB (DDA (t) + DDB (t)) (1) SV 1 (DA, DB) = DV (DA) + DV (DB)
GATSDCD: Prediction of circRNA-Disease Associations
17
Among them, DV(DA) and DV(DB) are the semantic values of disease DA and disease DB, respectively (the calculation formula is as in Eq. (2). DDA (t) and DDB (t) are the semantic value of disease t to disease DA and DB. The semantic value of disease DA is the sum of the semantic contributions of all nodes in DAGDA. For disease D1 in DAGDA, the formula for calculating DA is: DDA (D1) = 1 if D1 = DA (2) DDA (D1) == max α · DDA (D )e children of D1 if D1 = DA where ρ is the semantic contribution factor, which is generally 0.5. Using Eq. (3), it can be obtained that the semantic value DV1(DA) of the disease DA is the semantic value DV (DA). DDA (D1) (3) DV (DA) = D1DDA
SV 1 assumes that diseases at the same level have the same contribution, so the number of diseases in the DAG will be ignored. Unusual disease D should obtain a higher contribution value. Therefore, we constructed a second disease similarity SV 2 . SV 2 assumes that the semantic contribution of each layer in the disease DAG is different. Based on this assumption, the semantic similarity SV 2 (C, D) of disease C to disease D is defined as: tTC ∩TD (D2C (t) + D2D (t)) (4) SV 2 (C, D) = DV (C) + DV (D) Among them, DV(C) and DV(D) are the semantic values of disease C and disease D. D2C (t) and D2D (t) are the semantic contributions for C and D. When computing the semantic value of the disease C, assuming that C appears less frequently in the DAGs of all diseases than another disease D, then disease C should be more closely related to the disease in its DAG, this disease is more special, and the semantic contribution should also be larger. Based on this assumption, D2C is the semantic contribution of disease t to C. D2C (t) = −log(
num(DAG(t)) num(Diseases)
(5)
DAGs for all diseases are not included in MeSH, so to get more comprehensive disease characterization, we adopted GIPs. The Gaussian kernel function, also known as the radial basis function, is a monotonic function of the Euclidean distance of two vectors, which can be used to measure the degree of similarity between samples. The GIPs GD(A, B) of disease A to disease B is defined as: GD(A, B) = exp(−θd ||V (A) − V (B))||2 )
(6)
where θ d is the bandwidth of GIP, which controls the scope of the kernel function. θd =
1 m ||V (d (i))||2 i=1 m
(7)
18
M. Niu et al.
where m is the number of rows of M. Since semantic similarity of diseases is calculated based on the DAG, but in real cases, not all the semantic values of the disease can be obtained. Therefore, for disease A and B, if there is semantic value, the average disease SV 1 (A, B) and SV 2 (A, B) is used to represent the similarity between disease A and B; if there is no semantic similarity, the similarity between disease A and B is represented by the nuclear similarity of the interaction spectrum between them. Therefore, the final disease similarity between disease A and B can be calculated: if A and B has semantic similarity SV1 (A, B) + SV2 (A, B) DSim(A, B) = GD(A, B) otherwise (8) CircRNA Feature Representation. In the section, we computed the GIPs values of circRNAs. Similar to disease, Eq. (9) can be used to calculate circRNAs c(i) and c(j) for the Gaussian kernel interaction characteristic similarity GR(c(i), c(j)). We calculated the GIPs of circRNAs here. GR(c(i), c(j)) = exp(−θc ||V (c(i)) − V (c(j)))||2 ) θc =
1 n ||V (c(i))||2 i=1 n
(9) (10)
θ c and n d have the same meaning as θd and m.
2.3 Singular Value Decomposition for Feature Noise Reduction When training a model based on data, it usually encounters the problem that the dimension is too high, that is, there are redundant features. Sometimes there is a certain correlation between the features [29, 30]. Therefore, we use principal component analysis (PCA) [31] to reduce noise, and new features are called principal components. In this study, the data matrix was subjected to singular value decomposition (SVD) [32] to obtain the principal components. The eigenvectors of the covariance matrix are right singular vectors V after the SVD of matrix X. We used Fig. 2 to illustrate the relationship between PCA and SVD.
Fig. 2. The relationship between PCA and SVD.
GATSDCD: Prediction of circRNA-Disease Associations
19
In this paper, the matrices are input: c = {c1 , c2 , · · · , cm }, and d = {d1 , d2 , · · · , dn }, which are the original features of circRNAs and diseases. Then, principal component analysis is implemented using singular value decomposition (SVD). The final circRNA and disease feature vector are transformed into c = {c1 , c2 , · · · , cc } and = {d1 , d2 , · · · , dc }; where c is the number of circRNA and disease features. 2.4 Graph Attention Network Embedding Features Recently, graph convolutional networks have got great success in bioinformatics [33]. The graph attention network (GAT) can be seen as one of the variants of the graph convolutional network GCN, which is a NN architecture based on the graph structure [34]. The graph attention network needs to construct a graph in advance as input. This section builds a graph based on circRNA and disease associations, where the input m node features are: F = {F1 , F2 , . . . , Fn }(Fn ∈ R ) and the structural features are 0 A N = . A is the circRNA-disease association matrix. Among them, n is the AT 0 number of nodes (the number of all circular RNAs and all diseases), and m is the number of features, which represents m features of each node whose input is n nodes, and F represents the features of all nodes. To obtain the corresponding conversion of input and output, we need to obtain the output feature F’ by at least one linear transformation according to the input F, namely:
F = F · WF
(11)
where WF is a learnable weight matrix, WF ∈ Rm×m , which is the relationship between the input and output m features and the output m’ features. The first step is to understand the importance of the neighbors of a given node. Considering the different importance of different nodes, an attention mechanism is adopted for nodes. This paper introduces a multi-head attention mechanism to utilize multiple. An attention mechanism is employed to calculate the attention coefficient eij of surrounding nodes, which makes the learning process of the model more stable. The non-normalized attention coefficient eij of the association pair of circRNA ci and disease dj is expressed as follows. eij is computed by current node i and its first-order neighbor node j. eij (ci , dj ) = α(WFi , WFj ) R.
(12)
Among them, α is the shared attention calculation function, that is α = Rm × Rm →
Then, to make the non-normalized attention coefficients easier to calculate and compare, we introduce softmax to regularize all the adjacent nodes j of i:
exp(eij ) αij = softmax eij = k∈Ni exp(eik ) Among them, Ni is the neighbor node set of circRNA ci .
(13)
20
M. Niu et al.
For the operation of the linear layer, it is generally necessary to activate the nonlinear function. In the paper, the LeakyRelu activation function is used, and the slope is 0.2. Combining Eq. (12) and Eq. (13), the final attention calculation formula is to obtain a complete attention mechanism: exp(leakyReLU (α T WFi WFj )) αij = (14) T k∈Ni exp(leakyReLU (α [WFi WFk ])) where T stands for transpose. “||” means concatenation of vectors. α is the weight coefficient matrix of the graph attention layer. This represents the contribution of the features of node j to node i. In the whole calculation process, it is necessary to calculate the contribution of each neighbor node k of node i to i. Next, we need to fuse the neighborhood representations of nodes according to their attention coefficients. For the embedding of a given node, the projected node features of neighbors with different weights can be used for fusion. For the calculation results under K independent attention mechanisms, the K average is used to replace the connection, and the calculation formula is as follows:
Fi = σ (
K 1 k k αij W fk ) K
(15)
k=1 jNi
Fi is the feature vector that needs to be input to the output layer after feature extraction by the graph attention network. K represents the serial number of the independent attention mechanism; σ () represents the activation function; αijk is the attention coefficient of node i to node j. As the L-layer passes through GAT, the final node feature is obtained, which is defined as F = c1 , c2 , · · · , cNc , d1 , d2 , · · · , dNd . 2.5 Neural Network for Prediction In this section, we employed the neural network (NN) [35] to construct the GATSDCD model. The k-layer output of the NN is: x(k+1) = σ (W (k) × x(k) + b(k) )
(16)
where x0 is the input, x0 = F . σ represents the LeakyReLU, which is the activation function. W (k) and b(k) are the weight and bias parameters in the k layers. In the k layer (last layer), we can compute the output score: f (c, d ) = x(k+1) = σ (W k × xk + bk )
(17)
In GATSDCD, known circRNA pairs and diseases were regarded as positive data, labeled as 1. Then we randomly chose the same number from unknown associations and labeled it as 0. Finally, we can define our loss function with the following formula: m K 1 L=− [ylogf (c, d ) + (1 − y)log(1 − f (c, d )) + λ2 ] N i=1 k=1
where N is the number of training samples. λ is the regularization control factor.
(18)
GATSDCD: Prediction of circRNA-Disease Associations
21
2.6 Evaluation Criteria To evaluate the capability of GATSDCD, we used FFCV for validation. We counted metrics such as accuracy, precision, recall, and F1-score. Then, the receiver operating characteristic curve (ROC) was plotted. And we calculated AUC (area under the ROC curve). Moreover, the area under the PR curve (AUPR) value was also employed to estimate the overall performance of GATSDCD.
3 Experiments and Results 3.1 GATSDCD Performance To evaluate the performance of our method GATSDCD, we performed FFCV based on our data. The results of FFCV were counted (see Table 1). It can be found that the GATSDCD model has an average accuracy of 0.88602, precision of 0.92106, recall of 0.89438, F1-score of 0.88798, AUC of 0.9481, and AUPR of 0.91842. We also plotted the ROC curve and PR curve of the GATSDCD model, as shown in Fig. 3. The ROC and PR curves of the FFCV are similar, which also proves that the prediction performance of GATSDCD is stable after all, and can promote the prediction performance of potential disease-related circRNAs. Table 1. Results of FFCV based on CircR2Disease dataset TestFold
Accuracy
precision
recall
F1-score
AUC
AUPR
1
0.881
0.916
0.918
0.927
0.943
0.9109
2
0.8923
0.9243
0.9029
0.8719
0.943
0.9173
3
0.8831
0.9208
0.8901
0.8652
0.952
0.9244
4
0.8891
0.923
0.87
0.8789
0.947
0.9196
5
0.8841
0.9203
0.8904
0.8963
0.953
0.9199
Ave
0.8860
0.9210
0.89438
0.8879
0.948
0.9184
Fig. 3. The AUPR and AUROC values of the GATSDCD model.
22
M. Niu et al.
Fig. 4. The comparison results of different parameter values.
3.2 Impact of Parameters In this section, we will analyze the impact of several important hyperparameters on the performance of GATSDCD, such as GAT layers, the number of heads in GAT, regularization factor, and dropout rate. Figure 4 shows the results of AUC and AUPR under different parameter values. As reported in previous studies, an appropriate increase in the number of layers in a GNN network improves network performance, but an excessively deep network layer number can degrade performance. So, we compared the performance of different numbers of GATs. This paper studied the effect of the number of GAT layers L on the performance of GATSDCD when the number of GAT heads K was 1. By changing the values of L ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, test the AUC and AUPR values of GATSDCD on the dataset (as shown in Fig. 4A). It can be seen that the AUC of GATSDCD decreases with the increase of L, and the AUPR value is relatively stable when the number of layers is less than 5 and reaches the maximum value when the number of layers is 5, and then begins to decrease. Therefore, this paper takes L = 5. In addition, this paper also studied the influence of the number of graph attention heads K on the final performance of GATSDCD when the number of GAT layers L is 5. By changing the values of K ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, the AUC and AUPR values of GATSDCD on the dataset were tested (as shown in Fig. 4B). It can be seen that the AUC of GATSDCD shows a downward trend as a whole with the increase of K, and the AUPR value is the largest when the number of attention heads is 6, and then begins to decrease. Taking all factors into consideration, this paper sets the number of GAT layers and the number of graph attention heads to 6.
GATSDCD: Prediction of circRNA-Disease Associations
23
In addition, we also analyzed the regularization factor λ (Fig. 4C). It can be observed that GATSDCD achieved the best AUC and AUPR at λ = 0.01. We use Adam to set the number of iterations to 1000. And analyze the dropout rate parameters of the GAT, set the dropout rate values to {0.2, 0.4, 0.6, 0.8 respectively, and count the results (Fig. 4D). It can be found that the dropout rate is 0.4, which is the best. 3.3 Ablation Study
Values under 5-fold CV
We also evaluate the effect of different components in the GATSDCD model, such as SVD feature denoising, deep feature extraction for GAT, and NN classifiers. We separately removed individual components for ablation studies. In particular, we define variants of GATSDCD as follows. GATSDCD without feature denoising (n-SVD): Remove the SVD part and directly use the circRNA and disease features without denoising as the initial node features. GATSDCD without GAT (n-GAT): GAT is removed from GATSDCD, and the features after SVD are used and stitched into the neural network. GATSDCD without NN (nNN): It uses point generation to compute prediction scores instead of using a two-layer NN for prediction. Statistical results were obtained and Fig. 5 was drawn. The AUC and AUPR values of GATSDCD without feature noise reduction are the lowest, indicating that fused similarity as an initial node feature can quite improve the capability. The performance of GATSDCD without GAT and GATSDCD with NN drops by about 10%. Thus, our model GATSDCD combines the strengths of these parts for the best performance.
n-SVD
1
n-GAT
n-NN
GATSDCD
0.5
0 AUC
AUPR
Fig. 5. The comparison results of ablation study.
3.4 Performance Comparison with Other Methods As several models have been constructed to predict circRNAs associated with diseases, we compared the constructed model GATSDCD with other state-of-the-art methods. Because those methods used different evaluation criteria and datasets, for ease of comparison, we select 9 methods that used the same datasets and employed AUC as the evaluation metric. We compared our model GATSDCD with the following methods: DWNN-RLS [36], iGRLCDA [37], KATZHCDA [38], NCPCDA [26], AERF [39],
24
M. Niu et al.
Wang’s method [40], iCircDA-MF [24], and GCNCDA [20]. In the results of 5-fold cross-validation, we statistically compare the best and average performance of GATSDCD, denoted as GATSDCD-best and GATSDCD-average, respectively. The statistical results are shown in Table 2. Table 2. Performance comparison of different classification models Models
AUC
DWNN-RLS
0.8854
PWCDA
0.89
KATZHCDA
0.7936
NCPCDA
0.9201
RWRHCD
0.666
Wang’s method
0.8667
iCircDA-MF
0.9178
GATSDCD-best
0.9537
GATSDCD-average
0.9481
The results proved that GATSDCD outperforms other advanced predictors in AUC with GATSDCD -best and GATSDCD -average values of 0.9537 and 0.9481, respectively. GATSDCD is obviously better than ICFCDA, DWNN-RLS, KATZHCDA, RWRHCD, iCDA-CGR, and other algorithms based on label propagation algorithm, random walk algorithm, and mathematical-statistical model. Second, compared with GCNCDA based on GNN, the performances of GATSDCD also have great advantages. This is because our model can first filter out important features through SVD, and secondly, it can learn deep features using a graph attention network. This is more efficient than statistical analysis methods. This also demonstrates the effectiveness of GATSDCD. 3.5 Case Study To further demonstrate the potential ability of GATSDCD to infer disease-associated circRNAs, we performed case studies of gastric and breast cancers. We first train GATSDCD with known associated data and then use the resulting model to make predictions. Finally, the obtained prediction scores are ranked, and the predicted results are verified by searching the literature and databases. Gastric cancer ranks fourth in the global incidence of malignant tumors and third in the number of deaths. Gastric cancer often occurs in specific regions, and more than half of the patients diagnosed with gastric cancer are from East Asia, such as Japan and China [41]. In the prediction of gastric cancer, we present the top 20 association results with verified and unverified (Table 3). We can see that 16 of the top 20 candidates with the highest scores passed the test. 2 others have not been experimentally verified.
GATSDCD: Prediction of circRNA-Disease Associations
25
Table 3. The top 20 gastric cancer related candidate circRNAs Rank
circRNA
PMID
Rank
circRNA
PMID
1
circ-ABCB10
28744405
11
circDENND4C
28739726
2
hsa_circ_0000911
28744405
12
hsa_circ_0006528
28803498
3
hsa_circ_0000732
28744405
13
circ-Foxo3
27886165
4
circ-TTBK2
Unconfirmed
14
hsa_circ_0008945
28744405
5
hsa_circRNA_100269
28657541
15
circLARP4
28893265
6
hsa_circRNA_006054
28484086
16
hsa_circRNA_104821
28484086
7
hsa_circ_0018293
28744405
17
hsa_circRNA_103110
28484086
8
hsa_circ_0072765
Unconfirmed
18
circHIPK3
Unconfirmed
9
hsa_circ_0000893
28744405
19
hsa_circRNA_406697
Unconfirmed
10
hsa_circ_0003159
28618205
20
hsa_circ_0000190
28130019
4 Conclusion Identifying associations between circRNAs and diseases will help biologists understand the pathogenesis of diseases and promote the treatment of diseases. In this study, we propose a computational model GATSDCD based on graph attention networks. GATSDCD selected similarity features of circRNAs and diseases as original features. The original features were denoised by principal component analysis using singular value matrix decomposition. Then, GAT was used to aggregate adjacent feature information, and a graph network is used to automatically learn useful latent features. Lastly, GATSDCD used neural networks to predict potential circRNA-disease associations. In particular, we performed an FFCV evaluation and got the best AUC and AUPR. In addition, case studies of gastric cancer and breast cancer also demonstrate that GATSDCD can predict potential disease-related circular RNAs with high accuracy. Nevertheless, GATSDCD can still be improved. In this study, only considering the similar characteristics of circRNAs to diseases is incomplete and will affect the performance of the model. Therefore, more biological information, such as circRNA-miRNA associations or circRNA sequences, will be used for further studies to construct more precise node features. Moreover, the known association and unknown association data are very different, we will build a better negative sampling strategy to improve the performance of GATSDCD. Acknowledgments. The work was supported by the National Natural Science Foundation of China (No. 62131004, No.61922020, No.61872114), the Sichuan Provincial Science Fund for Distinguished Young Scholars (2021JDJQ0025), and the Special Science Foundation of Quzhou (2021D004).
Competing Interests. The authors have declared no competing interests.
26
M. Niu et al.
References 1. Kristensen, L.S., Andersen, M.S., Stagsted, L.V., Ebbesen, K.K., Hansen, T.B., Kjems, J.: The biogenesis, biology and characterization of circular RNAs. J. Nat. Rev. Genet. 20(11), 675–691 (2019) 2. Ye, C.Y., Chen, L., Liu, C., Zhu, Q.H., Fan, L.: Widespread noncoding circular RNA s in plants. J. New Phytol. 208(1), 88–95 (2015) 3. Chen, L.-L.: The biogenesis and emerging roles of circular RNAs. J. Nat. Rev. Mol. Cell Bio. 17(4), 205–211 (2016) 4. Kulcheski, F.R., Christoff, A.P., Margis, R.: Circular RNAs are miRNA sponges and can be used as a new class of biomarker. J. Biotechnol. 238, 42–51 (2016) 5. Jiao, J., et al.: Development of a two-in-one integrated assay for the analysis of circRNAmicroRNA interactions. Biosens. Bioelectron. 178, 113032 (2021) 6. Zhao, Z.-J., Shen, J.: Circular RNA participates in the carcinogenesis and the malignant behavior of cancer. RNA Biol. 14(5), 514–521 (2017) 7. Qu, S., et al.: The emerging landscape of circular RNA in life processes. RNA Biol. 14(8), 992–999 (2017) 8. Zhou, Z., Sun, B., Huang, S., Zhao, L.: Roles of circular RNAs in immune regulation and autoimmune diseases. Cell Death Dis. 10(7), 1–13 (2019) 9. Liang, Z.-Z., Guo, C., Zou, M.-M., Meng, P., Zhang, T.-T.: circRNA-miRNA-mRNA regulatory network in human lung cancer: An update. Cancer Cell Int. 20(1), 1–16 (2020) 10. Wang, K., Gao, X.-Q., Wang, T., Zhou, L.-Y.: The function and therapeutic potential of circular RNA in cardiovascular diseases. Cardiovasc. Drugs and Ther., 1–18 (2021) 11. Lv, Y., Wang, M., Chen, M., Wang, D., Luo, M., Zeng, Q.: hsa_circ_0119412 overexpression promotes cervical cancer progression by targeting miR-217 to upregulate anterior gradient 2. J. Clin. Lab. Anal. 36, e24236 (2022) 12. Li, Y., Chen, J., Song, S.: Circ-OPHN1 suppresses the proliferation, migration, and invasion of trophoblast cells through mediating miR-558/THBS2 axis. Drug Dev. Res. (2022) 13. Wang, S., et al.: Exosomal circRNAs as novel cancer biomarkers: Challenges and opportunities. Int. J. Biol. Sci. 17(2), 562 (2021) 14. Glažar, P., Papavasileiou, P., Rajewsky, N.: circBase: A database for circular RNAs. RNA 20(11), 1666–1670 (2014) 15. Dudekula, D.B., Panda, A.C., Grammatikakis, I., De, S., Abdelmohsen, K., Gorospe, M.: CircInteractome: A web tool for exploring circular RNAs and their interacting proteins and microRNAs. RNA Biol. 13(1), 34–42 (2016) 16. Chen, X., Han, P., Zhou, T., Guo, X., Song, X., Li, Y.: circRNADb: A comprehensive database for human circular RNAs with protein-coding annotations. Sci. Rep. 6(1), 1–6 (2016) 17. Dong, R., Ma, X.-K., Li, G.-W., Yang, L.: CIRCpedia v2: An updated database for comprehensive circular RNA annotation and expression comparison. Genomics Proteomics Bioinf. 16(4), 226–233 (2018) 18. Fan, C., Lei, X., Fang, Z., Jiang, Q., Wu, F.-X.: CircR2Disease: A manually curated database for experimentally supported circular RNAs associated with various diseases. Database 2018 (2018) 19. Wu, W., Ji, P., Zhao, F.: CircAtlas: An integrated resource of one million highly accurate circular RNAs from 1070 vertebrate transcriptomes. Genome Biol. 21(1), 1–14 (2020) 20. Lei, X., Fang, Z., Guo, L.: Predicting circRNA–disease associations based on improved collaboration filtering recommendation system with multiple data. Front. Genet. 10, 897 (2019) 21. Wang, H., Tang, J., Ding, Y., Guo, F.: Exploring associations of non-coding RNAs in human diseases via three-matrix factorization with hypergraph-regular terms on center kernel alignment. Briefings Bioinf. 22(5), bbaa409 (2021)
GATSDCD: Prediction of circRNA-Disease Associations
27
22. Chen, Y., Wang, Y., Ding, Y., Su, X., Wang, C.: RGCNCDA: Relational graph convolutional network improves circRNA-disease association prediction by incorporating microRNAs. Comput. Biol. Med. 143, 105322 (2022) 23. Niu, M., Zou, Q., Wang, C.: GMNN2CD: Identification of circRNA–disease associations based on variational inference and graph Markov neural networks. Bioinformatics 28, 2246– 2253 (2022) 24. Wei, H., Liu, B.: iCircDA-MF: Identification of circRNA-disease associations based on matrix factorization. Brief. Bioinform. 21(4), 1356–1367 (2020) 25. Deepthi, K., Jereesh, A.: An ensemble approach for CircRNA-disease association prediction based on autoencoder and deep neural network. Gene 762, 145040 (2020) 26. Li, G., Yue, Y., Liang, C., Xiao, Q., Ding, P., Luo, J.: NCPCDA: Network consistency projection for circRNA–disease association prediction. RSC Adv. 9(57), 33222–33228 (2019) 27. Lei, X., Bian, C.: Integrating random walk with restart and k-Nearest neighbor to identify novel circRNA-disease association. Sci. Rep. 10(1), 1–9 (2020) 28. Lowe, H.J., Barnett, G.O.: Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches. JAMA 271(14), 1103–1108 (1994) 29. Niu, M., Lin, Y., Zou, Q.: sgRNACNN: Identifying sgRNA on-target activity in four crops using ensembles of convolutional neural networks. Plant Mol. Biol. 105(4–5), 483–495 (2021). https://doi.org/10.1007/s11103-020-01102-y 30. Ao, C., Zou, Q., Yu, L.: NmRF: Identification of multispecies RNA 2’-O-methylation modification sites from RNA sequences. Briefings Bioinf. 23(1), bbab480 (2022) 31. Destefanis, G., Barge, M.T., Brugiapaglia, A., Tassone, S.: The use of principal component analysis (PCA) to characterize beef. Meat Sci. 56(3), 255–259 (2000) 32. Stewart, G.W.: On the early history of the singular value decomposition. SIAM Rev. 35(4), 551–566 (1993) 33. Niu, M., Zou, Q., Lin, C.: CRBPDL: Identification of circRNA-RBP interaction sites using an ensemble neural network approach. PLoS Comput. Biol. 18(1), e1009798 (2022) 34. Wang, X., He, X., Cao, Y., Liu, M., Chua, T.-S. (eds.): KGAT: knowledge graph attention network for recommendation. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2019) 35. Kong, Y., Gao, J., Xu, Y., Pan, Y., Wang, J., Liu, J.: Classification of autism spectrum disorder by combining brain connectivity and deep neural network classifier. Neurocomputing 324, 63–68 (2019) 36. Yan, C., Wang, J., Wu, F.-X.: DWNN-RLS: Regularized least squares method for predicting circRNA-disease associations. BMC Bioinformatics 19(19), 73–81 (2018) 37. Zhang, H.-Y., et al.: iGRLCDA: Identifying circRNA–disease association based on graph representation learning. Briefings Bioinf. 23, bbac083 (2022). https://doi.org/10.1093/bib/ bbac083 38. Fan, C., Lei, X., Wu, F.-X.: Prediction of CircRNA-disease associations using KATZ model based on heterogeneous networks. Int. J. Biol. Sci. 14(14), 1950 (2018) 39. Deepthi, K., Jereesh, A.: Inferring potential CircRNA–disease associations via deep autoencoder-based classification. Mol. Diagn. Ther. 25(1), 87–97 (2021) 40. Wang, L., You, Z.-H., Huang, Y.-A., Huang, D.-S., Chan, K.C.: An efficient approach based on multi-sources information to predict circRNA–disease associations using deep convolutional neural network. Bioinformatics 36(13), 4038–4046 (2020) 41. Hartgrink, H.H., Jansen, E.P., van Grieken, N.C., van de Velde, C.J.: Gastric cancer. The Lancet 374(9688), 477–490 (2009)
Anti-breast Cancer Drug Design and ADMET Prediction of ERa Antagonists Based on QSAR Study Wentao Gao1
, Ziyi Huang2
, Hao Zhang1,3(B)
, and Jianfeng Lu1,3
1 CIMS Research Center, Tongji University, Shanghai 201804, China
[email protected]
2 Department of Electrical Engineering, City University of Hong Kong, Hong Kong, China 3 Engineering Research Center of Enterprise Digital Technology, Ministry of Education,
Shanghai 201804, China
Abstract. The development of breast cancer is closely related to ERα gene, which has been identified as an important target for the treatment of breast cancer. The establishment of effective Quantitative structure-activity relationship model (QSAR) of compounds can predict the biological activity of new compounds well and provide help for the research and development of anti-breast cancer drugs. However, it is not enough to screen potential compounds only depending on biological activity. ADMET properties of drugs also need to be considered. In this paper, based on the existing data set, we perform hierarchical clustering on 729 variables, and calculate the Pearson correlation coefficient between them and the pIC50 value of biological activity, and screen out five variables that have a significant impact on biological activity. Perform multiple linear regression on these five molecular descriptors and the biological activity values, and then use the multiple stepwise regression method to optimize to establish a QSAR model. Furthermore, Fisher discriminant analysis is used to classify and predict the ADMET properties of the new compounds. Both models have good statistical parameters and reliable prediction ability. As a result, we come to a conclusion that Oc1ccc(cc1)C2 = C(c3ccc(C = O)cc3)c4ccc(F)cc4OCC2 and other compounds not only have high biological activity, but also have great ADMET properties, which could be used as potential anti-breast cancer drug compounds. These results provide a certain theoretical basis for the development and validation of new anti-breast cancer drugs in the future. Keywords: Pearson correlation coefficient · QSAR · Fisher discriminant · ADMET properties
1 Introduction Breast cancer has become the leading cancer type and disease burden for females around the world, with a higher fatality rate [1]. In recent years, the incidence of breast cancer has gradually increased among female malignancies, threatening their physical and mental © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 28–40, 2022. https://doi.org/10.1007/978-3-031-13829-4_3
Anti-breast Cancer Drug Design and ADMET Prediction
29
health seriously [2]. The demand for anti-breast cancer drugs has increased rapidly because of the higher incidence of breast cancer, while the tendency towards long-term use of anti-breast cancer drugs has led to further development of anti-breast cancer drugs. Some researches have recently indicated that ERα (Estrogen receptors alpha) plays a very Important role in Breast Growth [2]. Therefore, ERα is regarded as an effective target for the treatment of breast cancer which is applied to define histological subtype and guide treatment options. Compounds which antagonize ERα could be potential drug candidates for the treatment of breast cancer, such as tamoxifen and raloxifene. However, its effectiveness is limited by acquired resistance [3]. It is necessary to find new compounds that can help identify potential therapeutic targets for breast cancer and to promote this cancer treatment. In the current pharmaceutical researches, the screening of potentially active compounds is usually implemented by building prediction models of compounds’ activities. The quantitative structure–activity relationship (QSAR) model is one of the major computational tools for chemical and pharmaceutical science and has been widely used in recent years, with the expansion of data sets [4]. Some previous studies have focused on using combinatorial chemistry to find new compounds [5]. But the enormous investments have not resulted in satisfactory quantities and qualities of new pharmaceutical molecular entities. This paper is based on a series of compounds acting on targets relevant to the disease and their biological activity data, and then constructs a QSAR model for the compounds applying a series of molecular structural descriptors as the independent variables and the biological activity values of the compounds as the dependent variables. Consequently, new molecular compounds with better biological activity can be predicted better or structural optimization of existing active compounds can be guided. In addition, this paper also considers that compounds should have excellent pharmacokinetic properties and security in humans besides good biological activity to become available candidates [6]. We select five properties: permeability of intestinal epithelial cells, metabolic stability of the compounds, cardiotoxicity of the compounds, human oral biological availability and genotoxicity of the compounds, collectively called ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity). In this characterization selection research, analyzing the mathematical correlation between the biological activity of their compounds and molecular descriptors could help researchers find ERα antagonists better and enable them to develop effective medicines to treat breast cancer. In this paper, the main contributions are as follows: 1) We use multiple stepwise regression to optimize the original model to make it more explicable, avoid over-fitting and reduce the complexity of the model. 2) Secondary screening of anti-breast cancer drugs with high biological activity, considering ADMET properties as the optimization targets. 3) Using Binary Search to find the optimal solution in selecting the threshold of the Fisher discriminant model reduces the time complexity and improves the efficiency.
30
W. Gao et al.
2 Related Work The research and development of anti-breast cancer drugs has always been an active research field for decades, and the research of anti-breast cancer drugs has gradually entered the era of targeting. The current direction is mainly to discover leading compounds from a large number of compound libraries through high-throughput screening for drug targets of a specific disease, and then gradually optimize them to finally obtain innovative drugs [5, 7]. Compared with traditional drug screening methods, highthroughput screening method has the advantages of small reaction volume, automation, and sensitive detection. Zeng et al. [8] established an anti-enteroviral drug screening model targeting EV71 3C protease, and applied the model to a compound library containing 20,000 small molecules for primary screening and re-screening on a high-throughput drug screening platform. Sagara et al. [9] established a new high-throughput screening platform that quantified lipid droplets in PSCs, which may be useful to discover new compounds that attenuate PSC activation. Even so, high-throughput screening method still has obvious limitations. The models used for high-throughput screening are limited, and the models cannot fully reflect the comprehensive pharmacological effects of drugs. In view of these, we use quantitative structure-activity relationship (QSAR) to construct a correlation model between chemical structure and biological activity. When the receptor structure is unknown, the QSAR approach is the most accurate and efficient method for drug design. Huang et al. [10] used a quantitative structure-activity relationship (QSAR) approach to investigate the potential structural descriptors of the solubilizing addends related to the inhibitory activities on various types of lung cancer cells. Wang et al. [11] studied the binding mode of 3-phenylsulfonylaminopyridine derivatives to PI3Kα through threedimensional quantitative structure-activity relationship (3D-QSAR), molecular docking and molecular dynamics simulation, and designed five new PI3Kα inhibitors with better biological activity. However, to some extent, the predictive ability of the QSAR method is limited by the experimental data. And even if a compound with good biological activity is screened by the QSAR method, it cannot be determined whether the compound can be used as a candidate drug. Accordingly, this paper proposes a QSAR quantitative prediction model of compound biological activity and a classification prediction model of ADMET property to optimize drug screening from the perspective of ADMET properties of compounds, which not only improve the overall quality of the drug candidate, but also improve the success rate of drug development. Tong et al.[6] used HQSAR and Topomer CoMFA method to establish a QSAR model of 75 1-(3,3-diphenylpropyl)-piperidinyl and urea derivatives, and obtained good statistical parameters and reliable prediction ability. After the model, the surfx-dock method and ADMET technology were used to conduct molecular docking, oral bioavailability and toxicity prediction of the designed drug molecules. Similar to [6], we built the QSAR model, but then used multiple stepwise regression analysis to optimize the feature selection of the original model, selecting the optimal subset of molecular descriptors and more accurately predicting the biological activity of the compounds in the test set. Combined with the ADMET characteristics of the
Anti-breast Cancer Drug Design and ADMET Prediction
31
compounds, a series of test set were classified by the Fisher discrimination method, so as to further identify and screen the compounds. The new compound molecules screened by the model have both good biological activities and ADMET properties, and can be used as reliable ERα antagonists.
3 Method In this paper, based on the existing data set 1,974 molecular descriptors are clustered by class average method, and representative molecular descriptors are selected from each class as independent variables. Multiple linear regression is performed on these molecular descriptors and the biological activity of the compounds, and QSAR model is established to find the target compounds. In QSAR modeling, pIC50 is often used to represent the biological activity, and a higher value means a better compound potency [12]. Based on the QSAR model, the biological activity values of new compounds with unknown properties are predicted. However, it is not enough to screen compounds only by establishing QSAR model. Fisher discriminant method is used to predict the ADMET properties of the new compound. Considering both the biological activity and ADMET properties of the compounds, we can obtain the potential active compounds for the treatment of breast cancer. 3.1 Dataset and Data Processing The dataset of this paper comes from Huawei Cup Mathematical Modeling Competition in 2021. And the file “ERα_activity” includes the biological activity data of 1,974 compounds against ERα. The file “Molecular_Descriptor.xlsx” contains the information on ERα antagonists, which is the value of 729 molecular descriptor variables corresponding to 1,974 compound samples. The file “ADMET.xlsx” contains data of five properties of compound ADMET. We divided the data of 1,974 compounds into two parts, including 1,500 compounds as training sets and 474 compounds as test sets. Both files contain 50 compounds of unknown properties, and we need to predict them. The data need to be cleaned first. If the value of a molecule descriptor in the table is all 0, then the data for that column will be deleted and the molecule descriptor will not be considered in the analysis. In QSAR modeling, the value of pIC50 is often used to represent biological activity, and a higher value means a better compound potency. The parameters of the structure and property characteristics of different compound will have different influence on biological activity, and also have different scales. Let the variable matrix be X and the matrix of pIC50 be Y. Among different molecular descriptor variables, the proximity between continuous attributes is usually expressed by the difference between attribute values. Compared with the Euclidean distance and the Minkowski distance, the Pearson correlation coefficient fully takes ERα biological activity and molecular descriptor measures into account, which can evaluate a large number of different similarity measures. Therefore, Pearson correlation is used to measure the importance of molecular descriptors of different
32
W. Gao et al.
compounds on biological activity, and the correlation coefficients between variables and pIC50 are calculated: n
k=1 (xk
−
−
− x)(yk − y ) corr(x, y) = − − 1 n 1 n 2 y 2 k=1 (xk − x) k=1 (yk − ) n−1 n−1 1 n−1
(1)
Accordingly, the correlation coefficients between all variables and biological activity are derived and ranked. The closer Pearson correlation coefficient is to 1, the stronger the influence of variables on biological activity is, and the variables with the most significant influence on biological activity are screened out. The stronger the positive correlation indicates that a compound containing the molecular descriptor is more effective in inhibiting the activity of ERα. Instead, we should avoid the stronger negative correlation as much as possible. We take 20 of these molecular descriptors as examples and draw the following heat map in Fig. 1. 1.00
nX nI nBr nCl nF nP nS nO nN nC nH nHeavyAtom nAtom nAromBond naAromAtom apol AMR ALogp2 ALogP nAcid
0.71
0.42
0.13
-0.16
nAcid ALogP ALogp2 AMR apol naAromAtom nAromBond nAtom nHeavyAtom nH nC nN nO nS nP nF nCl nBr nI nX
-0.45
Fig. 1. Pearson correlation coefficient heat map
Afterwards, the correlation coefficient matrix thermal map is used to show the correlation between variables through the depth of color. By taking the absolute value operation of the correlation coefficient between biological activity of 1,974 compounds and molecular descriptors, and sorting, the following table is obtained:
Anti-breast Cancer Drug Design and ADMET Prediction
33
Table 1. The top 20 molecular descriptors with the most significant impact on bioactivity Molecular descriptors
Correlation coefficient
MDEC-23
0.538047798
MLogP
0.529321142
LipoaffinityIndex
0.491854942
maxsOH
0.466620545
…
…
CrippenLogP
0.412300157
maxHsOH
0.408760763
From the Table 1, we can find that the value of MDEC-23 more than other molecular descriptors determines the biological activity of the compound on ERα. And it can also be concluded that Atom type electrotopological state shows a greater role than other molecular descriptors in establishing the QSAR equation of ERα for breast cancer treatment, including LipoaffinityIndex, maxsOH, minsOH, MinsssN, SwHBa, maxHsOH and nHaaCH. They provide important implications for the applicability of ERα bioactivity as a therapeutic target for breast cancer. When a compound adequately exhibits the pharmacological information of the above molecular character types, it provides good biological activity against the breast cancer therapeutic target ERα. 3.2 Hierarchical Clustering In the process of establishing QSAR model, if all the obtained molecular descriptors with high correlation are selected, the model will be redundant and difficult to fit and analyze. At the same time, there is some correlation among these molecular descriptors. Therefore, we adopt the R-type clustering method to deal with variables. When clustering variables, the methods we commonly use include the shortest distance method, the longest distance method, class average method, etc [13]. Since the correlation of these molecular descriptors is generally high, we take the longest distance method in this paper, and define the distance between the two types of variables as: R(G1 , G2 ) =
max
xj ∈G1 ,xk ∈G2
{djk }
(2)
2 . Then, R(G ,G ) is related to the two variables And d jk = 1 – |r jk | or djk2 = 1 − rjk 1 2 with the least similarity in the two columns.
3.3 Model Building After a series of processing, one of the most representative variables is extracted from each category by combining the properties of each molecular descriptor. And multiple linear regression is performed on the five extracted molecular descriptors, which can
34
W. Gao et al.
reduce the complexity of the model, avoid the problem of over-fitting and the difficulty of system analysis caused by too many independent variables. According to the clustering results, the values of minsssN, MDEC-23, BCuTP-1H, maxsOH and Hmin are selected as the input, and the pIC50 values of the compounds as the output. Establish the multiple linear regression model: Y = μ0 + μ1 X1 + μ2 X2 + μ3 X3 + μ4 X4 + μ5 X5
(3)
The multivariate linear regression function is trained from 1,500 samples. At the same time, considering the stability of the constructed model [14], three statistics of correlation coefficient R2 , F value and probability p corresponding to F value, are selected to test the significance of the regression model. After passing the significance test, the QSAR model will be tested with 474 test sets. The reliability of the model is determined by comparing the predicted values with the true values, using the mean squared error (MSE) and the maximum deviation P. MSE can describe the deviation between real data and forecast data and measure the fluctuation of data. The maximum deviation value P reflects the worst data in the prediction results, and whether there are anomalies or outliers in the results. The smaller the MSE and the maximum deviation, the better the model. 3.4 Multiple Stepwise Regression Considering the limitations of R-type cluster and the fact that not every variable has a significant effect on the model results, so we introduce the stepwise regression analysis. Multiple stepwise regression is a kind of screening variable regression analysis method, and we can utilize stepwise regression to build regression model from a set of candidate variables, to let the system automatically identify influential variables and eliminate insignificant factors. In this case, the idea is as follows: (1) Among the five molecular descriptors (independent variables), the regression equation should be introduced one by one starting from one independent variable, depending on the significance of the effect on the biological activity of the dependent variable. (2) When the introduced independent variable becomes insignificant due to the introduction of the later independent variable, it should be eliminated. (3) The introduction of an independent variable or the elimination of an independent variable from the regression equation is a step in the stepwise regression. (4) The biological activity value test should be carried out for each step to ensure that the regression equation only contains the variables that have significant effects on biological activity before each new significant variable is introduced. (5) This process is repeated until there is neither insignificant variable to be removed from the regression equation, nor significant variable to be introduced into the regression equation. Consequently, stepwise regression charts are generated from using stepwise regression analysis. The five molecular descriptors are input into the regression equation in turn,
Anti-breast Cancer Drug Design and ADMET Prediction
35
and F-test is performed after each variable is introduced, and all introduced molecular descriptors are tested for significance alpha one by one. When the originally introduced molecular descriptor is significant α > 0.05 due to the introduction of the later molecular descriptor, it will be removed and so on. In this way, the analysis shows that the independent variable Hmin has a high P value, which is not well significant. Therefore, the new regress model is obtained after it is removed: Y = μ0 + μ1 X1 + μ2 X2 + μ3 X3 + μ4 X4
(4)
3.5 Fisher Discrimination In terms of the actual drug development, it is necessary to evaluate the safety and pharmacokinetics of drugs. The properties of compounds (in the case of small intestinal epithelial cell permeability) are classified according to the molecular descriptor information. We use the fisher discriminant to separate projected classes from each other by projecting data that cannot be easily classified in a certain direction. For this problem, let the two populations be X1 and X2 , and the second-order moments both exist. The idea of Fisher discriminant criterion is to change multivariate observation X to unitary observation Y, or to find an optimal projection direction, so that the y generated by the population X1 and X2 is separated as much as possible along this direction. Fisher discriminant rule is following: x ∈ X1 , when W (x) ≥ 0 (5) x ∈ X2 , when W (x) < 0 The permeability properties of small intestinal epithelial cells are divided into good and bad, and the results are marked as 0 and 1 respectively. 1,500 compounds with known ADMET properties are taken as learning samples, with the idea of pattern recognition, and the Fisher discriminant function can be obtained through training to set a reasonable threshold value, i.e. the optimal direction is found to make the best separation. The process is shown in Fig. 2.
Fig. 2. Fisher discriminant model flowchart
36
W. Gao et al.
After establishing the model, the next step is to select a suitable threshold value as discriminant classification. Under the condition of guaranteeing the optimal classification effect of small intestinal epithelial cell permeability, a threshold is found as the classification basis, which maximally reflects the differences between individuals of different classes. Here, the idea of half-interval search is adopted to determine the threshold value, and the steps are as follows: Step1 : Set the search interval, find the maximum value ymax and minimum value ymin of these 1,500 sample regression points, then the search interval is [ymin , ymax ]. Step2 : Take the middle value of the search interval, and calculate the accuracy p when the threshold is set to this value. Step3 : Take the intermediate values ymid1 and ymid2 of [ymin , ymid ] and [ymid , ymax ] respectively, and calculate the corresponding values of p1 and p2. Step4 :Compare p, p1 and p2. If the values of p and p2 are larger, the interval of [ymid , ymid2 ] is continued to be selected for half-broken search. Step5 : Repeat the process until the optimal threshold yc is found. After half-interval search, the threshold is set to 1.38. If it is greater than 1.38, the compound is judged to have good small intestinal epithelial cell permeability. Otherwise, the compound has poor small intestinal epithelial cell permeability. Similarly, the discriminant equations and thresholds of other properties are obtained.
4 Experimental Results 4.1 MLP Results According to the data set cited in this paper, it can be obtained that in the linear regression equation, the constant term μ0 is 2.4371, the regression coefficients of minsssN, MDEC-23, BCutP-1H, maxsOH are positively correlated, and the regression coefficient of Hmin is negatively correlated, which is also consistent with the results of the correlation coefficient matrix. The statistics of the regression model are tested, which have a coefficient of judgement R2 = 0.45, a value of 322.2681 for the F statistic, a probability p-value of 1.6830 * 10–252 corresponding to F and an error variance of 1.1162. Because p is less than 0.05, and all regression coefficients are within the confidence interval, the multiple linear regression equation meets significance. And the residual plots are shown in Fig. 3.
Anti-breast Cancer Drug Design and ADMET Prediction Residual Case Order Plot
37
4
5
3
Residuals
2 1
0
0 -1 -2
-5
-3
200
400
600
800
1000
1200
1400
Case Number
(a)
-4 0
500
1000
1500
(b)
Fig. 3. Residual plots of between true and predicted values for QSAR model
From the figure, it is easy to find that the residuals between the true value and the fitted value of the regression curve are mostly concentrated within –2 to 2, with only a very small number of points, which deviates greatly from the value of the regression model. The maximum deviation P is 3.8690. Then, we process these outliers to get a more reliable model. Finally, the regress model is applied to predict the biological activity of 50 compounds by multiple linear analysis. 4.2 Results of Stepwise Regression Analyze the optimized model. The parameters of the regression equation y = μ0 + μ1 X1 + μ2 X2 + μ3 X3 + μ4 X4 are μ0 = 2.3322, μ1 = 0.0397, μ2 = 0.1172, μ3 = 0.2400, μ4 = 0.1754. The four regression coefficients are all positive, which is consistent with the positive correlation between the four molecular descriptors and the biological activity values. Moreover, some test statistics of the model are obtained: correlation coefficient R2 = 0.45, the value of F statistic is 401.8485, the probability p value corresponding to F is 2.9162*10–252 , and the error variance is 1.1171. From these values, we can find that the correlation coefficient and error variance have not changed much from the original model, and the P value has been slightly improved. However, the F value has increased significantly from 322.2681 to 401.8485. Meanwhile, the maximum deviation P also decreases to 3.7792, so it can be concluded that the model has better significance and higher accuracy after removing the variable Hmin. Figure 4 reflects the true and predicted biological activity values of these 1,974 compounds. Most of the samples are concentrated around the y = x line, and the predicted pIC50 values of the compounds are almost the same as the experimental values. It can be seen that the optimized QSAR model not only has a small amount of calculation, but also has better prediction ability. Based on this model, the biological activities of the 50 compounds are predicted and sequenced, and we can get the conclusion that OC(=O)\C = C\c1ccc(cc1)C2 = C(CCSc3cc(F)ccc23)c4ccc(O)cc4 could be considered as potential drug candidates.
38
W. Gao et al.
Fig. 4. Plots of the true versus predicted pIC50 values for the training and test set
4.3 Optimization of Candidate Compounds Based on Fisher Discriminant Fisher discriminant is used to predict the ADMET properties of compounds. After obtaining the discriminant function, calculating the difference between the observed value and the predicted value given by the model and the corresponding residual diagram is made in Fig. 5. From this plot, we can get a conclusion that the residual distribution is reasonable, so it can be proved that the model is available. Residual Case Order Plot
1.5
1.5
1
1
Residuals
0.5
0.5 0
0 -0.5
-0.5
-1 -1
-1.5 200
400
600
800
1000
Case Number
(a)
1200
1400
-1.5
0
500
1000
1500
(b)
Fig. 5. Residual plots of between true and predicted values for ADMET properties
At the same time, we can get the Receiver operating characteristic (ROC) curve of the classification model, as shown in Fig. 6. Then, 474 test data are used to evaluate the accuracy of the discriminant function, and the accuracy rate can reach 83.54%.
Anti-breast Cancer Drug Design and ADMET Prediction
39
Fig. 6. ROC curve of ADMET properties based on fisher discriminant
According to the established QSAR model, the biological activity values of new compounds can be predicted. Although the prediction results have a high degree of confidence, in the actual medicinal field, biological activity is not the only criterion for drug selection, and the ADMET properties of the compound need to be considered. Therefore, taking the ADMET properties of compounds predicted by Fisher discriminant model into account, the final compounds suitable for medicinal use can be obtained as follows: Oc1ccc(cc1)C2 = C(c3ccc(C = O)cc3)c4ccc(F)cc4OCC2, CCCC(CCC)(c1ccc(O)c(C)c1)c2cccs2, etc.
5 Conclusion In this study, 729 molecular descriptors are screened, and the original model is improved by stepwise regression. Finally, four molecular descriptors are selected to establish a QSAR model related to compound biological activity values. According to this model, compounds with high biological activity are predicted, and the ADMET properties of compounds are taken as the optimization target in combination with the actual drug development situation. Fisher discriminant is introduced to predict the ADMET properties of compounds. In the end, considering the above two points, compounds with high biological activity and meeting the drug requirements are successfully predicted, which provides a reference for the research and development of anti-breast cancer drugs. Acknowledgments. Research work in this paper is supported by the National Natural Science Foundation of China (Grant No. 72171173) and Shanghai Science and Technology Innovation Action Plan (No. 19DZ1206800).
References 1. Chou, J., Shen, R., Zhou, H.B., et al.: OBHS impairs the viability of breast cancer via decreasing ERα and Atg13. Biochem. Biophys. Res. Commun. 573, 69–75 (2021) 2. Yu, J., Li, F., Li, Y., et al.: The effects of hsa_circ_0000517/miR-326 axis on the progression of breast cancer cells and the prediction of miR-326 downstream targets in breast cancer. Pathol.-Res. Pract. 227, 153638 (2021)
40
W. Gao et al.
3. Munne, P.M., Martikainen, L., Räty, I., et al.: Compressive stress-mediated p38 activation required for ERα+ phenotype in breast cancer. Nat. Commun. 12(1), 1–17 (2021) 4. Cherkasov, A., Muratov, E.N., Fourches, D., et al.: QSAR modeling: Where have you been? Where are you going to? J. Med. Chem. 57(12), 4977–5010 (2014) 5. Sun, J., Wang, Y.J., He, Z.G.: Biodistribution chromatography: High-throughput screening in drug membrane permeability and activity. Prog. Chem. 18(0708), 1002 (2006) 6. Tong, J.B., Zhang, X., Bian, S., Luo, D.: Drug design, molecular docking, and ADMET prediction of CCR5 inhibitors based on QSAR study. Chin. J. Struct. Chem. 41(02), 1–13 (2022) 7. Ran, B.B., et al.: Research progress on tumor protein biomarkers using high-throughput proteomics based on mass spectrometry. Chin. J. Clin. Oncol. 47(08), 411–417 (2020) 8. Zeng, S.N., Li, Q.W., Pan, T., et al.: Establishment and application of high-throughput screening model for antiviral agents targeting EV71 3C (pro). Prog. Biochem. Biophys. 44(9), 776–782 (2017) 9. Sagara, A., Nakata, K., Yamashita, T., et al.: New high-throughput screening detects compounds that suppress pancreatic stellate cell activation and attenuate pancreatic cancer growth. Pancreatology 21(6), 1071–1080 (2021) 10. Huang, H.J., Chetyrkina, M., Wong, C.W., et al.: Identification of potential descriptors of water-soluble fullerene derivatives responsible for antitumor effects on lung cancer cells via QSAR analysis. Comput. Struct. Biotechnol. J. 19, 812–825 (2021) 11. Wang, X.C., Yang, M.C., Zhang, M.X., et al.: 3D-QSAR, molecular docking and molecular dynamics simulations of 3-Phenylsulfonylaminopyridine derivatives as novel PI3Kα inhibitors. Chin. J. Struct. Chem. 40(12), 1567–1585 (2021) 12. Zekri, A., Harkati, D., Kenouche, S., et al.: QSAR modeling, docking, ADME and reactivity of indazole derivatives as antagonizes of estrogen receptor alpha (ER-α) positive in breast cancer. J. Mol. Struct. 1217, 128442 (2020) 13. He, L., Jurs, P.C.: Assessing the reliability of a QSAR model’s predictions. J. Mol. Graph. Model. 23(6), 503–523 (2005) 14. Putri, D.E.K., Pranowo, H.D., Wijaya, A.R., et al.: The predicted models of anti-colon cancer and anti-hepatoma activities of substituted 4-anilino coumarin derivatives using quantitative structure-activity relationship (QSAR). J. King Saud Univ.-Sci. 34(3), 101837 (2022)
Real-Time Optimal Scheduling of Large-Scale Electric Vehicles Based on Non-cooperative Game Rong Zeng , Hao Zhang(B)
, Jianfeng Lu , Tiaojuan Han , and Haitong Guo
CIMS Research Center, Tongji University, Shanghai 201804, China [email protected]
Abstract. This paper uses non-cooperative game to solve the optimization scheduling problem of charging and discharging of large-scale electric vehicles supporting V2G (Vehicle to Grid) in microgrid. Firstly, the new energy microgrid price calculation model and the charging and discharging process model of electric vehicles are constructed. Then, the objective function of the minimum charging cost of electric vehicles is proposed. After that, according to the characteristic that each electric vehicle aims to minimize its own charging cost in the charging process, the scheduling process is modeled as a non-cooperative game model between electric vehicles, there exists a unique Nash equilibrium of the game model. The Nash equilibrium solution method based on broadcast program is designed in this paper. Finally, through simulation, it can be seen that each electric vehicle constantly adjusts its charging strategy to minimize the charging cost during the game. The charging and discharging strategy of electric vehicle population reaches a stable Nash equilibrium, and the optimization goal can be got. The algorithm proposed in this paper can also reduce the total imported electricity and comprehensive operating costs of microgrid, which take into account the interests of both electric vehicles and microgrid. Keywords: Optimal scheduling · Non-cooperative game · Microgrid · Electric vehicle
1 Introduction Microgrid integrates all kinds of distributed power, loads, energy storage devices and control devices into a miniature energy supply system, which can smoothly convert between grid-connected and off-grid modes. As a bridge between distributed power supply and large power grid, microgrid can effectively improve the utilization rate of distributed power [1]. New energy microgrid includes a lot of new energy power sources, such as solar and wind power, to provide low-carbon electricity. However, generating capacity of new energy sources is affected by environment and climate, and has certain intermittency and randomness, which adds complexity to the operation management of microgrid. Therefore, for new energy microgrid, energy storage equipment is a key factor to ensure its stable operation. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 41–52, 2022. https://doi.org/10.1007/978-3-031-13829-4_4
42
R. Zeng et al.
As a new demand for green consumption, the biggest advantage of electric vehicle (EV) is that it makes travel more economical and low-carbon. In recent years, it has been promoted and popularized rapidly. In addition, with the help of Vehicle to Grid (V2G) technology, EV can be regarded as a distributed energy storage equipment with spatio-temporal characteristics [2], It participates in the dispatching operation of power grid by controlling the charging and discharging behavior of electric vehicles [3], and improving the stability and economy of power grid operation [4, 5]. Due to the randomness and intermittence of EV charging behavior, if there is no reasonable distribution and guidance strategy, a large number of EV disorderly charging and discharging behavior will cause load growth. Charging at peak load will further widen the peak-valley difference of the system, increase the power supply pressure of the power grid, and weaken the service ability of the smart grid for EV [6]. Therefore, orderly scheduling of charge and discharge of electric vehicles is needed. At present, many scholars have studied the scheduling problem of electric vehicles in microgrid. Tookanlou et al. [7] proposed a control protocol to coordinate the day-ahead charging plan of electric vehicles to shift loads and achieve valley filling. Amamra et al. [8] proposed an optimal advance scheduling strategy to reduce the energy bill and manager to reduce the peak load constraint of microgrid through the lithium ion battery of electric vehicles. However, day-ahead scheduling is not flexible enough and the anti-disturbance resistance is poor, therefore there may be a large error in the scheduling result. Du et al. [9] proposed an intelligent multi-microgrid energy management method based on deep neural network (DNN) and reinforcement learning technology to achieve optimal scheduling of electric vehicles in the microgrid. Li et al. [10] defined the charge-discharge scheduling problem of electric vehicles as a constrained Markov decision process (CMDP), and found a charging and discharging scheduling strategy to minimize the charging cost based on deep reinforcement learning. However, it is necessary to interact with the environment for many times to obtain feedback to update the model. In this case, the state space is large, which will lead to high computational cost, long training time and low learning efficiency, and “dimensional disaster” is prone to occur [11]. MA et al. [12] applied non-cooperative game to study the charging strategy of electric vehicles, and proposed distributed broadcast program algorithm to solve this problem. However, the discharge process of EV is not considered. This paper analyzes the charging and discharging scheduling of electric vehicles to achieve the balance of power supply and demand of the microgrid. On the basis of minimizing the charging cost of EV, the total electricity purchased by the microgrid from the large grid and the comprehensive operation cost are reduced. Referring to literature [12], the paper introduces a non-cooperative game process in scheduling problem and uses a “broadcast” method to solve Nash equilibrium in game process. When the game reaches Nash equilibrium, the optimal scheduling problem can be solved. Different from literature [12], this paper considers the discharge behavior of electric vehicles, and the influence of charging and discharging behavior of electric vehicles on pricing strategy of microgrid is considered. The remaining parts of the paper is organized as follows: Sect. 2 establishes the pricing model of microgrid and mathematical model of charging and discharging behavior of electric vehicles. In Sect. 3, the optimization objective function is proposed. Section 4
Real-Time Optimal Scheduling of Large-Scale Electric Vehicles
43
introduces the non-cooperative game process in scheduling problem and the process of solving Nash equilibrium using broadcast program algorithm. Simulation experiments are provided in Sect. 5. Section 6 summarizes the full text.
2 Mathematical Models of New Energy Microgrid and Electric Vehicle Charging and Discharging Behavior In this paper, a new energy microgrid with large-scale bidirectional charging piles is considered. The electricity price of the microgrid is affected by the electricity obtained from the power generation side of new energy, the charge and discharge quantity of electricity of electric vehicles, and the electricity imported from the large grid or other microgrids. In this scenario, the charging and discharging behaviors of electric vehicles in the microgrid will affect the electricity price of the microgrid, and then the electricity price of microgrid will change the charging and discharging strategy of electric vehicles. The microgrid adopts the time-of-use electricity price in this paper, which will adjust the electricity price every hour. The price function of selling electricity of microgrid and the charging and the model of charging and discharging behavior of electric vehicle are described as follows respectively. 2.1 The Price Function of Selling Electricity of New Energy Microgrid The microgrid control center is the guarantee of the safe and economic operation of the microgrid. It is responsible for the energy scheduling of the entire microgrid area and tries to achieve the balance between the supply and demand of the microgrid. When the new energy power generation in the microgrid can not meet the total load in the region, the microgrid control center purchases the required electricity from the large grid. The electricity price of the microgrid increases with the increase of the electricity imported from the large grid or other microgrids. In our model, the control center of the microgrid mainly adjusts the electricity price per hour. The price function ρh is given as follows: 1 (1) ρ(h) = ρ (ηb · db + ηd · dd +ηi · di ) × ωrf_price de where db is the basic electricity load at the h-th moment, de is the electric quantity obtained from the generating side of new energy, dd is the electric quantity charged and discharged by EV, di represents the electricity which is imported from large grids or other microgrids, ωrf _price is the reference price,the reference electricity price is constant [13]. ηb , ηd , ηi are the weight coefficients of three kinds of electricity respectively. When the weight coefficients are not properly selected, the game is difficult to reach the equilibrium point. In order to let the game can reach a stable equilibrium point, this paper set ηb = ηd = ηi = 1.
44
R. Zeng et al.
2.2 Modeling of Electric Vehicle Charging and Discharging Behavior The mathematical model of charging and discharging behavior of electric vehicles is shown in the Eq. (2)–(5): H 2
h + δ pˆ h · h − avg p ˆh ρ(h) pˆ hm · m m pˆ = m m h=1
⎫ ⎧ h h h h ⎬ H ⎨ ρ 1 ηb · db + η · M d m=1 pˆ m · m +ηi · di × ωrf_price pˆ m · m de 2 = h ⎩ ⎭ h ˆh h=1 +δ pˆ m · m − avg p
(2)
pˆ hm = phm · shm
(3)
M 1 h h pˆ m · m avg pˆ h = M
(4)
m=1
pˆ h = lim avg pˆ h = M→∞
M 1 h h pˆ m · m M→∞ M
lim
(5)
m=1
Equation (2) is the charging cost of the m-th EV, shm is the charging strategy of the m-th electric vehicle at the h-th moment, the number represents its charging strategy 1: charge, 0: remain idle, –1: discharge. phm is the charge and discharge quantity of the m-th h represents whether the m-th electric vehicle electric vehicle at the h-th moment. m h = 1: the electric vehicle is connected to the is in the network at the k-th moment. m h microgrid. m = 0: the electric vehicle is in running state, pˆ h is the charge and discharge control strategy of EV population at the h-th moment, δ represents the penalty degree of deviation from the center of the average strategy trajectory of the population. In the experiment, the value is 0.0125 [16]. Equation (4) is the average charge and discharge strategy of EV population at the k-th moment, Eq. (5) the average charge-discharge strategy trajectory of the whole EV population. When the population of electric vehicles is infinite, there is an incrementalism of the following equation [14]: ⎧ + ηi ·di ) ⎨ lim (ηb ·dbM = dh M→∞ (6) d ⎩ lim Me = e M→∞
H 2
1 h h dh +ηd · pˆ h × ωrf_price pˆ hm · m ρ (7) + δ pˆ hm · m m pˆ = − pˆ h e h=1
pˆ h = lim avg pˆ h = M→∞
M 1 h h pˆ m · m M→∞ M
lim
(8)
m=1
where pˆ h is the average charge-discharge strategy trajectory of the whole EV population.
Real-Time Optimal Scheduling of Large-Scale Electric Vehicles
45
3 Optimization Objective This paper studies the optimization scheduling problem. The decision-making subject is electric vehicle users, and the decision-making unit is hourly. The interactive goal the lowest electricity cost of the EV users, expressed as: H cost = ρh · EV hm (9) h=1
S.t. a EV min m ≤ EV m +
h
t ptm · stm · m ≤ EV max m , ∀h ∈ H
(10)
t=0 max EV min m = 10%Capm , EV m = Capm
(11)
EV am ≥ EV min m
(12)
EV Tm ≤ EV am + EV m ≤ EV max m
(13)
T max 80%EV max m ≤ EV m ≤ 100%EV m
(14)
where EV m min is the minimum state of charge for EV. EV m max is the maximum state of charge for EV. EV m a is the initial charge of the m-th EV. EV m T is the target energy level of the battery of the m-th EV when it leaves. Capm is the battery capacity. Equation (10) is the constraint on electric quantity level of EV at any time, Eq. (11) is the constraint on upper and lower limit of electric vehicle state of charge, Eq. (12) is the constraint on the energy level of EV at arrival, Eq. (13) and Eq. (14) are constraints on target energy level of EV.
4 Decentralized Electric Vehicle Control Method Based on Non-cooperative Game In the charging and discharging process of electric vehicles, the charging and discharging behavior of EV will change the pricing strategy of microgrid, pricing strategy is further applied to EV to change their decision to charge and discharge. EV compete with each other during this process, therefore, there are non-cooperative game processes in the real-time optimal scheduling model proposed in this paper. 4.1 Non-cooperative Game Model According to non-cooperative game theory, every EV is regarded as a participant in the game, and it is considered that it always maintains individual rationality and always aims at minimizing its own electricity purchase cost during the game. Therefore, the interaction between electric vehicles can be described as the following non-cooperative game model.
46
R. Zeng et al.
1) Players: electric vehicles in the microgrid scenario; 2) Strategies: shm is the charging strategy of electric vehicle m at the h-th moment. The number represents its charging strategy 1: charge, 0: remain idle, –1: discharge. pˆ m represents the strategy set adopted by the m-th electric vehicle, pˆ −m represents the charge and discharge strategy set of the remaining EV population except for the m-th EV. 3) Payoffs: the cost of an electric vehicle is shown in the Eq. (2). According to the electricity price information of the microgrid, EV will adjust the charging and discharging plan to maximize their own interests. It is assumed that the optimal response strategy of the m-th vehicle to the remaining population of electric electric vehicles except itself is pˆ ∗m pˆ −m . pˆ ∗m pˆ −m = arg min m pˆ m ; pˆ −m (15) pˆ −m = pˆ k ; k ∈ M, k = m According to the definition of Nash Equilibrium in game theory, there is the following theorem: Theorem 4.1. If each EV cannot obtain benefits by unilaterally changing its own strat egy, then the charge and discharge control set of EV population pˆ ∗m ; m ∈ M is a Nash equilibrium that is: ∀m ∈ M (16) m pˆ ∗m ; pˆ ∗−m ≤ m pˆ m ; pˆ ∗−m Therefore, Nash equilibrium can be considered as the solution to the optimization problem of each EV with fixed EV strategies. When the number of EV is infinite, there is the following lemma: Lemma 4.1. For a large population of electric vehicles, the charging and discharging strategy set pˆ is a Nash equilibrium if and only if: 1) For all electric vehicles m ∈ M and the trajectory zh , their charge and discharge control set minimizes their charge cost. H 2
1 h h ρ (dh +ηd zh ) × ωrf_price pˆ hm · m m pˆ m ; z = + δ pˆ hm · m − zh e h=1
(17) 2) zh =pˆ h represents the average trajectory of the optimal charge and discharge strategy set of all electric vehicles in the EV population, which is true for all charging and discharging times h ∈ H. The following is the proof process of Lemma 4.1 ⎞ ⎛ M M ∗ 1 ⎝ h ⎠ = lim 1 ˆ h = zh∗ pˆ h∗ pˆ h∗ lim pˆ m + k k =p M→∞ M M→∞ M k=1,k =m
k=1
(18)
Real-Time Optimal Scheduling of Large-Scale Electric Vehicles H 2
∗ 1 h + δ pˆ h · h − p ˆ h∗ ρ dh + ηd · ph × ωrf_price pˆ hm · m m pˆ m ; pˆ ∗−m = m m e h=1 = m pˆ m ; z∗
47
(19)
When the number of EV population increases and tends to infinity, the influence of single EV charge-discharge strategy on the average charge-discharge strategy trajectory of EV population as a whole can be ignored. Therefore, for ∀m ∈ M. The optimal charge-discharge strategy set of each EV minimizes its charging cost. As can be seen from Eq. (19), it also minimizes the charging cost of EV population, so the optimal charge-discharge strategy set of EV population is a Nash equilibrium, that is, the existence of Nash equilibrium solution can be proved. It is also known that the Nash equilibrium solution of this form is unique [15]. According to this lemma, when the non-cooperative game reaches Nash equilibrium, the proposed optimization scheduling problem can be solved. 4.2 Broadcast Programming for Strategy Solving As can be seen from the above, when the charging cost of EV population reaches the minimum, the game converges to Nash equilibrium and the optimization scheduling problem is solved. In this paper, a decentralized control broadcast program algorithm is used to solve Nash equilibrium in non-cooperative games. Every EV is a player in the game. The algorithm collects the optimal charging and discharging strategies of all EVs and broadcasts the aggregate EVs demand and predicted base load demand. After obtaining the base load demand and aggregate EVs’ demand, each EV minimizes its own electricity expenditure as its own optimal charging strategy. The game continues until the charging strategy of every EV, or the total electricity bill, does not change. Game is a non-cooperative and selfish interactive process, in which each EV decides its charging plan by knowing the average charging strategy trajectory of all other EVs. In the broadcast program, each connected EV shares the base load demand information and tracks the average charging strategy trajectory of all EVs. Each EV determines the charging strategy related to the charging strategy adopted by other electric vehicles, to reduce the cost of charging itself to the greatest extent. Nash equilibrium is reached when the population charging cost is minimum. The basic steps of broadcast program algorithm of decentralized control for strategy solving is shown in Table 1 [12].
5 Experimental Results 5.1 Evaluation Index The algorithm in this paper considers the discharge behavior of electric vehicles. In the original algorithm in literature [12], electric cars can only be charged, In order to evaluate the performance of the improved algorithm in this paper and the game algorithm, the following three indicators are derived for evaluation.
48
R. Zeng et al. Table 1. Procedure of broadcast program algorithm
Procedure of broadcast program algorithm Step1: The utility broadcasts the predicted baseload to all electric vehicles dh Step2: A charging control strategy is proposed for each EV to minimize its charging cost(2), The charging charges are related to the utility’s broadcast of aggregated EV population demand Step3: The utility collects all proposed charge and discharge control strategies for electric vehicles and updates aggregated EV requirements. Updated convergence requirements for electric vehicles are broadcast to all electric vehicles Step4: Repeat steps 2 and 3 until the optimal charging and discharging strategy of each EV does not change. δ value as 0.0125
(1) Comprehensive operation cost of microgrid total In the process of vehicle-network interaction, it is expected that the comprehensive operation cost of microgrid is the lowest, that is, the electricity cost on the load side is the lowest and the electricity revenue on the microgrid side is the largest. Therefore, this paper selects the comprehensive operating cost as an index of the algorithm. As given by: min obj = total = cos t − sell =
H
ρ(h) EV hM + Dh − Eg(h)+γ (h)
(20)
h=1
(2) Imported electricity of microgrid γtotal When the microgrid purchases power from the large grid, power purchase cost, dispatching cost and transportation loss cost will be generated. The operating principle of the new energy microgrid studied in this paper is to achieve the internal supply-demand balance of the microgrid by generating electricity from new energy and scheduling the charge and discharge of electric vehicles as much as possible. Therefore, the imported electric quantity of microgrid is also one of the most important indicators to measure the optimization effect. As given by: γtotal =
H
γ (h)=
h=1
H
max EV hM + Dh −Eg(h), 0
(21)
h=1
(3) Battery loss factor βloss In the process of vehicular network interaction, battery loss during charging and discharging of electric vehicles is also an important evaluation index. It can be expressed
Real-Time Optimal Scheduling of Large-Scale Electric Vehicles
49
as follows: H M ph · sh · h /ph m m m max
βloss =
h = 1 m=1
H M h = 1 m=1
(22) h m
where phmax represents the absolute value of the maximum charge-discharge power of the electric vehicle at the h-th moment. 5.2 Experimental Results Considering the pricing situation of the new energy microgrid under the grid-connected mode, the improved game theory algorithm is compared with the original game theory algorithm in literature [12] to verify the improvement of the improved non-cooperative game decentralized control algorithm in each performance index. Besides, the improved game theory is aiso compared with some common methods like reinforcement learning and Day-ahead optimal scheduling. By setting the same vehicle network interaction scene, including the same power generation (537 KW), base load curve, and the same EV on-grid situation, input two algorithm models respectively, and use the designed broadcast program to solve. The output vehicle network interaction index is shown in Table 2. Table 2. Comparison of the interactive effect of car network Index Interaction mechanisms
Evalution total /Yuan
γtotal /kWh
βloss
Improved game theory
30
124
0.256
Game theory
67
163
0.248
Reinforcement learning
49
193
0.346
Day-ahead optimal scheduling
27
110
0.798
As can be seen from the above table, compared with the original game theory, the improved non-cooperative game decentralized control algorithm has less imported electricity due to the function of discharging, and correspondingly the comprehensive operation cost of microgrid is lower. But at the same time, as the decision-making space of EV population becomes more complex, EV can choose more behaviors, so the battery utilization rate is higher, and the battery charge and discharge loss coefficient is also higher. Compared with the original algorithm, the improved non-cooperative game decentralized control algorithm reduces the overall operating cost by 55.22%, the imported power by 23.93%, and the battery loss factor only increased by 0.008. Also, The overall cost of game theory and imported electricity is also lower than reinforcement
50
R. Zeng et al.
learning. Compared with day-ahead centralized dispatching, the battery loss coefficient decreases (Fig. 1 and 2).
Fig. 1. Graph of vehicle network interaction process under improved non-cooperative game decentralized control algorithm
Fig. 2. Graph of vehicle network interaction process under decentralized control algorithm of original non-cooperative game
It can be seen from the figure above that the base load curve at 17:00 to 20:00 is greater than the power generation curve, which is the peak time of electricity consumption. In the improved non-cooperative game decentralized control algorithm, electric vehicles discharge at the peak time of electricity consumption due to the additional discharge
Real-Time Optimal Scheduling of Large-Scale Electric Vehicles
51
function. The original non-cooperative game decentralized control algorithm can only choose to remain idle in the peak hours of power consumption. The aggregate power demand of the improved algorithm decreases correspondingly in the peak hours of load, so the total imported electricity and the comprehensive operating cost also decrease. In general, the improved non-cooperative game decentralized control algorithm is better than the original algorithm without discharge function.
6 Conclusion Considering that the charging and discharging behavior strategy of electric vehicles will affect the pricing strategy of microgrid, and then the pricing strategy of power grid will affect the charging cost of electric vehicles. Based on this characteristics of vehicle network interaction, the problem was modeled as a decentralized control method based on non-cooperative game, aiming at the lowest electricity cost of electric vehicles. Finally, a broadcast program algorithm was designed to solve the above large-scale vehicle network interaction process of electric vehicles. The simulation results show that in the interaction between electric vehicles and microgrid, electric vehicles constantly adjust their charging and discharging strategies to minimize their charging costs, which is a dynamic learning process, and finally the game in the population of electric vehicles reaches a stable Nash equilibrium. Moreover, the improved non-cooperative game algorithm, which takes into account the electric vehicle discharge function and the real-time electricity price mechanism of the micro-grid in grid-connected mode, has better vehicle network interaction effect than the original game theory algorithm. Acknowledgement. Research work in this paper is supported by the National Natural Science Foundation of China (Grant No. 71871160) and Shanghai Science and Technology Innovation Action Plan (No. 19DZ1206800).
References 1. Marzband, M., Javadi, M., Pourmousavi, S.A., Lightbody, G.: An advanced retail electricity market for active distribution systems and home microgrid interoperability based on game theory. Electr. Power Syst. Res. 157, 187–199 (2018) 2. Saboori, H., Jadid, S., Savaghebi, M.: Optimal management of mobile battery energy storage as a self-driving, self-powered and movable charging station to promote electric vehicle adoption. Energies 14(3), 736 (2021) 3. Dequan, H.U., Guo, C., Qinbo, Y.U., Yang, X.: Bi-level optimization strategy of electric vehicle charging based on electricity price guide. Electr. Power Constr. 39(1), 48–53 (2018) 4. Suganya, S., Raja, S.C., Srinivasan, D., Venkatesh, P.: Smart utilization of renewable energy sources in a microgrid system integrated with plug-in hybrid electric vehicles. Int. J. Energy Res. 42(3), 1210–1224 (2017) 5. Zhang, W., Wang, J.: Research on V2G control of smart microgrid. In: Proceedings of the 2020 International Conference on Computer Engineering and Intelligent Control (ICCEIC), pp. 216–219 (2020)
52
R. Zeng et al.
6. Chukwu, U.C.: The impact of load patterns on power loss: A case of V2G in the distribution network. In: Proceedings of the 2020 Clemson University Power Systems Conference (PSC), pp. 1–4 (2020) 7. Tookanlou, M.B., Kani, S.A.P., Marzband, M.: An optimal day-ahead scheduling framework for E-mobility ecosystem operation with drivers’ preferences. IEEE Trans. Power Syst. 36(6), 5245–5257 (2021) 8. Amamra, S.A., Marco, J.: Vehicle-to-grid aggregator to support power grid and reduce electric vehicle charging cost. IEEE Access 7, 178528–178538 (2019) 9. Du, Y., Li, F.: Intelligent multi-microgrid energy management based on deep neural network and model-free reinforcement learning. IEEE Trans. Smart Grid 11, 1066–1076 (2019) 10. Li, H., Wan, Z., He, H.: Constrained EV charging scheduling based on safe deep reinforcement learning. IEEE Trans. Smart Grid 11, 2427–2439 (2019) 11. Cheng, L., Yu, T., Zhang, X., Yin, L.: Machine learning for energy and electric power systems: State of the art and prospects. Dianli Xitong Zidonghua/Autom. Electr. Power Syst. 43(1), 15–31 (2019) 12. Ma, Z.: Decentralized valley-fill charging control of large-population plug-in electric vehicles. In: Proceedings of the Control & Decision Conference. IEEE (2012) 13. Li, Y., et al.: Optimal scheduling of isolated microgrid with an electric vehicle battery swapping station in multi-stakeholder scenarios: A bi-level programming approach via real-time pricing. Appl. Energy 232, 54–68 (2018) 14. Ma, Z., Callaway, D., Hiskens, I.: Decentralized charging control for large populations of plug-in electric vehicles. In: Proceedings of the IEEE International Conference on Control Applications. IEEE (2011) 15. Ma, Z., Callaway, D.S., Hiskens, I.A.: Decentralized charging control of large populations of plug-in electric vehicles. IEEE Trans. Control Syst. Technol. 21(1), 67–78 (2012)
TBC-Unet: U-net with Three-Branch Convolution for Gliomas MRI Segmentation Yongpu Yang1 , Haitao Gan1,2(B)
, and Zhi Yang1,2
1 School of Computer Science, Hubei University of Technology, Wuhan 430068, China
[email protected] 2 State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences,
Hubei University, Wuhan 430062, China
Abstract. Segmentation networks with encoder and decoder structures provide remarkable results in the segmentation of gliomas MRI. However, the network loses small-scale tumor feature information during the encoding phase due to the limitations of the traditional 3 × 3 convolutional layer, decreasing network segmentation accuracy. We designed a three-branch convolution module (TBC module) to replace the traditional convolutional layer to address the problem of small-scale tumor information loss. The TBC module is divided into three branches, each of which extracts image features using a different convolutional approach before fusing the three branches’ features as the TBC module’s output. The TBC module enables the model to learn richer small-scale tumor features during encoding. Furthermore, since the tumor area in an MRI only accounts for around 2% of the whole image, there is a problem with pixel category imbalance. We construct a new loss function to address the problem of category imbalance. Extensive experiments on BraTS datasets demonstrate that the proposed method achieves very competitive results with the state-of-the-art approaches. Keywords: Gliomas · MRI segmentation · Small-scale tumor · TBC module · Loss function
1 Introduction Glioma is the most common primary cranial brain tumor, arising from cancerous changes in the brain and spinal cord’s glial cells. Glioma development, like other cancers, is primarily linked to genetics and environmental factors. The World Health Organization (WHO) divides gliomas into two types [1]: high-grade gliomas and low-grade gliomas. High-grade gliomas are usually poorly differentiated or undifferentiated cells that grow and spread very quickly and are malignant. Low-grade gliomas are well-differentiated cells that are usually well identified and tumor cells grow more slowly [2]. The commonly used treatments for gliomas are surgery, radiotherapy, and chemotherapy. However, since glioma has such a high recurrence rate, the treatment effect is constrained. Once a patient is diagnosed, the survival period is often less than 14 months [3], which is extremely harmful. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 53–65, 2022. https://doi.org/10.1007/978-3-031-13829-4_5
54
Y. Yang et al.
Traditional machine learning segmentation methods and deep learning segmentation methods are the two primary kinds of MRI-based brain glioma image segmentation approaches. The traditional machine learning segmentation method requires human intervention, can’t fully automate segmentation, has poor robustness, and the segmentation results have large errors. On the other hand, the deep learning segmentation method eliminates the downsides of the traditional machine learning method and greatly improves segmentation performance. The deep learning segmentation method mainly relies on fully convolutional neural networks with U-shaped structures, such as the U-net model which consists of an encoder and a decoder [4]. The feature information of the picture was extracted using consecutive convolutional layers and pooling layers in the encoder of the U-net model. The decoder then upsampled the target details and fused them with the encoder’s high-resolution features through a skip connection, minimizing detail information loss. U-net model performed well in medical image segmentation, researchers enhanced it and developed a number of new algorithms, including Res-unet [5], Unet++ [6], and Unet3plus [7]. Because of the small tumor regions in glioma MRI, encoder downsampling would cause features in these small-scale tumor regions to disappear, resulting in insufficient feature extraction from small-scale tumor regions. Since the tumor area in glioma MRI is very small and the shape is very irregular, and the U-net model encoder is a cascaded 3 × 3 convolutional layer, some features of small-scale tumors will be lost, which will affect the performance of the network [8]. In order to improve the feature extraction ability of the model, we propose a model named TBC-Unet, which has the following two contributions: 1) the U-net model, we designed a new TBC module that substituted the traditional 3 × 3 convolutional layers. The TBC convolution block consists of three parallel branches, each with a different shape of convolution, which can extract wider and deeper semantic features. 2) We designed a new loss function to address the problem of pixel category imbalance in images. The remaining portion of the paper is organized as follows. The related work is discussed in Sect. 2. Our suggested TBC-Unet method is described in Sect. 3. Section 4 presents the experimental setup in detail, including datasets, evaluation metrics, and network comparisons. The conclusions are given in Sect. 5.
2 Related Work Traditional machine learning algorithms were most often utilized in early image segmentation research. Region growth method [9], watershed algorithm [10], fuzzy C-means clustering method [11], and so on are examples of conventional machine learning methods. Although these traditional methods are reasonably simple to apply, they ignore the image’s spatial information and only employ the image’s surface information. As a result, it is unsuitable for segmentation tasks that need a large amount of semantic information. Convolutional neural networks have become an essential methond of image segmentation thanks to the breakthrough development of deep learning in the area of computer vision. They can fully use the semantic information of images to achieve image semantic segmentation. Some outstanding segmentation methods, such as SegNet [12],
TBC-Unet: U-net with Three-Branch Convolution
55
Mask-RCNN [13], and DeepLab V3 [14], have been presented to meet the increasingly complicated issue of image segmentation. Deep neural networks are often employed in medical image segmentation because of their powerful feature extraction capability. Wu [15] proposed an iterative convolutional neural network (CNN) based method to segment cell membranes in 2015. While this method enhanced cell membrane segmentation accuracy, it lacked spatial continuity and was inefficient. Long et al. [16] proposed a CNN-based fully convolutional neural network (FCN) to segment medical images. The coding and decoding structure of the FCN model provided a solution of image spatial continuity and improves medical image segmentation accuracy. For medical image segmentation, Ronneberger et al. [4] proposed U-net, which was a fully convolutional neural network. The U-net model varied from the FCN model in that the decoding, encoding phases employed the same number of convolutional layers, and the decoding and encoding layers were fused by jump connections. The Unet model exceled in medical imaging segmentation because of its innovative network topology. However, the U-net model’s decoding and encoding phases employed typical 3x3 convolutional layers, resulting in inadequate feature extraction during the network’s encoding step. Chen et al. [17] proposed a neural network based on spatial channel convolution to overcome this issue. This network could extract the mapping relationship of spatial information between pixels, which aided feature extraction during encoding. Chen et al. [18] proposed the Drinet network, which integrated three common network structures: Densenet [19], Inception [20], and Resdual [21]. This ingenious combination also enhanced feature extraction during network encoding. Some novel convolution methods have been proposed by researchers to enable the network to extract richer features. In Inception-v3 [22], the 7 × 7 convolutions were replaced by a sequence of 1 × 7 and 7 × 1 convolutions. However, the authors found out that such replacement was not equivalent as it did not work well on the low-level layers. EDANet [23] used a similar method to decompose the 3 × 3 convolutions, resulting in a 1/3 saving in the number of parameters and required computations with minor performance degradation. Many excellent medical image segmentation models are now built on the U-net model. Since these model encoders employ standard 3 × 3 convolutional layers, these models suffer from insufficient feature extraction. To increase the model segmentation performance, the feature extraction ability of the models should be improved during the encoding process.
3 Proposed Method Figure 1 shows the structure of the TBC-Unet we designed. Similar to U-net, TBC-Unet has an encoder and a decoder, with skip connections fusing the two layers together. In this section, we will go over its network structure and the loss function we designed in detail.
56
Y. Yang et al.
3.1 TBC Module In the U-net model, the encoder consists of consecutive convolutional layers and maxpooling layers. Because consecutive convolutional operations would result in the loss of image details in the U-net model, we employed the TBC module to replace the traditional 3 × 3 convolutional layers. Figure 2 shows the structure of the TBC module. It is divided into three branches: (1) a 3 × 1 convolutional layer; (2) a 1 × 3 convolutional layer; (3) The third branch was divided into two sections: a 3 × 3 convolutional layer and a Depthwise convolutional layer. The outputs of the three branches all had the same shape feature matrix.
512^2
512^2 256^2
256^2
128
256
128^2
128^2 512
4
64^2
64^2
64^2 1024
512
64
32^2
64^2
1024 32^2
64^2
1024
512
32^2
128^2
512
256
128^2
512
256
64^2
128^2
128^2
256
128
256^2
256
128
256^2
256^2
256^2
128
64
512^2
128
512^2
512^2
64 64
512^2
1
TBC module
Skip connection
Max pool 2x2
Up-conv 2x2
Conv 1x1
Fig. 1. TBC-unet model.
For example, when the number of feature matrix channels output by the first and second branches of the TBC module is 64, respectively, the number of feature matrix channels output by the third branch after a 3 × 3 convolutional layer is 32. The 3 × 3 convolutional layer’s feature matrix is next subjected to a Depthwise convolution operation, and the 3 × 3 convolutional layer’s output is eventually fused with the Depthwise convolution layer’s output utilizing skip connections. The third branch’s feature matrix has the same shape as the previous two branches’ feature matrix. The network can integrate shallow and deep features to learn richer feature information using the TBC convolution module.
TBC-Unet: U-net with Three-Branch Convolution
57
The TBC module’s underlying principle is simple: if a two-dimensional matrix has a rank of 1, it can be represented as the product of a non-zero column matrix and a non-zero row matrix. Due to the theoretical foundation, we directly substituted the traditional 3 × 3 convolutional layer with two parallel 3 × 1 and 1 × 3 layers; however, this substitution had a significant disadvantage: encoding loses a lot of spatial information. To compensate for this flaw, we added a third branch based on the GhostNet [24] model. Figure 3 shows the GhostNet model’s main structure. The TBC module could obtain more detailed information than the typical 3 × 3 convolutional layer. The results of the subsequent comparison tests revealed that our network outperformed a number of frequently used segmentation networks.
Conv 3x1
1x3 Conv
Feature map
3x3 Conv
1x1 DW
Feature map
concat
Fig. 2. TBC module.
3x3 Conv
Linear operations
concat
Fig. 3. GhostNet: Wavelet transform, affine transform, and other intermediate linear operations are available in GhostNet’s main structure
3.2 Loss Function Glioma segmentation is actually a binary classification task from the perspective of a single pixel since the part other than glioma is the background. As a result, Binary Cross Entropy (BCE) can be used as the loss function, and the formula is shown in (1). BCE =
1 n (yi logyi + (1 − yi )log(1 − yi )) i=1 n
(1)
where yi represents the true label value, yi represents the value predicted by the model, and n is the number of training batches. There is an issue of unbalanced pixel categories in glioma MRI since the tumor area only contributes to roughly 2% of the overall image, while the background accounts for much too much. If BCE is only utilized as the loss function, the network will learn background features while ignoring tumor region features during training, affecting the network’s overall segmentation performance. The
58
Y. Yang et al.
Dice loss function is particularly good at dealing with category imbalance, and the formula is presented below (2). 2 i yi yi + ε LDice = 1 − (2) i yi + i yi + ε
In the formula, yi represents the true label value, yi represents the value predicted by the model, and ε is a smoothing operator to prevent the numerator and denominator from being zero. We combined these two loss functions to generate a new loss function based on the two factors mentioned before. The formula for this new loss function is shown in (3). Loss = λBCE + LDice
(3)
where λ is a hyperparameter.
4 Experiments and Results 4.1 Dataset We ran numerous tests on the BraTs [25] dataset to verify the TBC-Unet model, and we evaluated the findings of the experiments. The training set of BraTs2018 was used as our training set with 285 samples (210 for HGG and 75 for LGG). The test set uses an additional 50 samples, which is obtained based on BraTs2018 from BraTs2019 (49 cases for HGG and 1 case for LGG). Each sample contains four modalities: Flair, T1, T1ce, and T2. The MRI’s dimensions are 244 × 244 × 155, with the width and height of the image indicated by the first two numeric dimensions and the number of slices indicated by the third numeric dimension. The black slices were discarded after slicing the 3D MRI along the axis and decomposing it into 155 pictures. The amount of dark images in each patient’s MRI slice varied, as did the number of useable slices in each patient’s brain MRI. As a result of processing all patients’ brain MRIs, there were 18,923 images for the training set and 3,138 images for the test set used in the experiments. The images used for the experiments are shown in Fig. 4. Peritumoral edema (ED), enhancing tumor (ET), and non-enhancing tumor (NET) are the three main components of gliomas. Whole tumor (WT, ED+ET+NET), enhance tumor (ET), and tumor core (TC, ET+NET) are the three most common segmentation tasks we do. Core tumor (TC) and enhanced tumor (ET) are small-scale tumors compared to the whole tumor (WT) among the three segmentation tasks. 4.2 Metrics for Evaluation The performance of our model is evaluated using the Dice correlation coefficient, precision, and sensitivity. The Dice correlation coefficient calculation formula is shown in (4): Dice =
2TP FN + FP + 2TP
(4)
TBC-Unet: U-net with Three-Branch Convolution
59
The number of true positive samples, false positive samples, and false negative samples are all represented by the letters TP, FP, and FN , respectively. Precision denotes the percentage of accurately predicted positive samples. The formula for calculating it can be found here (5). Precision =
TP FP + TP
(5)
Sensitivity is a measure of the model’s sensitivity to segmented areas, and it can be used to assess the number of true positives and false negatives. The formula for calculating it can be found here (6). Sensitivity =
TP FN + TP
(6)
Fig. 4. From left to right, T1, T1ce, T2, and Flair are the four distinct modalities presented. The fifth image is the ground truth, where green represents edema, yellow represents enhancing tumors, and red represents non-enhancing tumors. (Color figure online)
4.3 Experiment Detail Hyperparameters are parameters that are often set when training models in the field of deep learning, and appropriate hyperparameters have a significant influence on experiment results. Setting the values of hyperparameters, on the other hand, often necessitates extensive experience. We train U-net, DeepResUnet, Unet++, Dense_Unet, Unet3plus, TBC-Unet with a learning rate of 0.0003, a batch_size of 8, and use Adam as the optimizer. The experimental environment is on Pytorch with the runtime platform processor of Intel(R) Core (TM) i910940X CPU @ 3.30 GHZ 3.31 GHZ, 64 GB RAM, NVIDIA GeForce RTX 3090, 64-bit Windows 10. The development software platform is PyCharm with Python 3.9. 4.4 Ablation Study Our proposed method is based on U-Net, therefore U-Net is the most fundamental baseline model. To improve the feature extraction ability of the model, we use the TBC module instead of the original 3 × 3 convolutional layers. To demonstrate the
60
Y. Yang et al.
effectiveness of the TBC module in the TBC-Unet model, we conduct a series of ablation studys. We take out the three branches of the TBC module separately for experiments, and there are the following six cases: 1) Use the first branch in the TBC module to replace the traditional 3 × 3 convolutional layer, which we call “TBC1-Unet”; 2) Use the second branch in the TBC module to replace the traditional 3 × 3 convolutional layer, called “TBC2-Unet”; 3) Use the third branch in the TBC module to replace the traditional 3 × 3 convolutional layer, called “TBC3-Unet”; 4) Use the first and second branches in the TBC module to replace the traditional 3 × 3 convolutional layer, called “TBC12-Unet”; 5) Use the first and third branches in the TBC module to replace the traditional 3 × 3 convolutional layer, called “TBC13-Unet”; 6) Use the second and third branches in the TBC module to replace the traditional 3 × 3 convolutional layers, called “TBC23-Unet”; Table 1, 2, and 3 show the results of the ablation experiments. It can be seen from the experimental results in the table that TBC-Unet has the best performance. It shows that when the TBC module has three branches at the same time, the feature extraction ability is the strongest and the performance is the best. 4.5 Results In this study, TBC-Unet and five state-of-the-art segmentation models (U-net [4], DeepResUnet [26], Unet++ [6], Dense_Unet [27], and Unet3plus [7]) were tested on the BraTs dataset, and the experimental results were quantitatively and qualitatively analyzed. The proposed TBC-Unet and the state-of-the-art five models on the BraTs dataset are quantitatively compared in Table 4, 5, and 6. The Dice correlation coefficient and Precision results are shown in Table 4 and 5. Our model has an absolute advantage in segmentation accuracy, as shown in the table, and the segmentation accuracy of the three tasks is higher than that of other models. The accuracy of our model in segmenting core tumor (TC) and enhanced tumor (ET) is clearly higher than the other five models, and the accuracy of our model for three segmentation tasks is higher than the U-net method, as shown in Table 6. TBC-Unet outperforms the other five models in terms of segmenting core tumor (TC) and enhanced tumor (ET) as shown in Table 4, 5, and 6. When compared to the U-net model, our model was able to segment the core tumor (TC) and enhanced tumor (ET) regions, with Dice values increasing by 5.8% and 1.5%, Precision increasing by 6.2% and 2.9%, and Sensitivity increasing by 2% and 4.6%, respectively. We randomly selected 6 images from the test set for qualitative comparison, as shown in Fig. 5, to make the segmentation results more intuitive to observe. Our model is able to segment the tumor region with the highest accuracy, as shown in the figure. Our model outperforms the other five models in segmenting core tumor (TC) and enhanced tumor (ET), which are small-scale tumors. When a small-scale tumor appears in the image, as shown in the seventh row of Fig. 5, the TBC-Unet model can better segment it, demonstrating the model’s superiority. In conclusion, the TBC-Unet model can accurately segment small-scale tumors when compared to the state-of-the-art five segmentation models. It shows that the TBC module we designed can increase the model’s attention to small-scale tumors and improve the model’s feature extraction ability for small-scale tumors.
TBC-Unet: U-net with Three-Branch Convolution Table 1. Results measured using the Dice correlation coefficient. Method
WT_Dice
TC_Dice
ET_Dice
TBC1-Unet
0.6837
0.7153
0.6357
TBC2-Unet
0.6748
0.7262
0.6548
TBC3-Unet
0.7668
0.7536
0.7216
TBC12-Unet
0.7451
0.7215
0.6975
TBC13-Unet
0.7846
0.7921
0.7648
TBC23-Unet
0.7952
0.7783
0.7712
U-net
0.8387
0.8215
0.7688
TBC-Unet
0.8484
0.8797
0.7838
Table 2. Results measured using Precision. Method
WT_Precision
TC_Precision
ET_Precision
TBC1-Unet
0.6983
0.7326
0.6187
TBC2-Unet
0.7135
0.7232
0.6657
TBC3-Unet
0.7753
0.7818
0.7437
TBC12-Unet
0.7624
0.7413
0.7332
TBC13-Unet
0.8167
0.8678
0.7533
TBC23-Unet
0.8367
0.8565
0.7471
U-net
0.8561
0.8533
0.7745
TBC-Unet
0.8711
0.9152
0.8037
Table 3. Results using sensitivity measures. Method
WT_sensitivity
TC_sensitivity
ET_sensitivity
TBC1-Unet
0.7033
0.7426
0.6927
TBC2-Unet
0.6973
0.7286
0.7143
TBC3-Unet
0.7725
0.7917
0.7542
TBC12-Unet
0.7463
0.7571
0.7332
TBC13-Unet
0.8428
0.8667
0.8327
TBC23-Unet
0.8575
0.8754
0.8416
U-net
0.8670
0.9014
0.8295
TBC-Unet
0.8736
0.9211
0.8758
61
62
Y. Yang et al. Table 4. Results measured using the Dice correlation coefficient.
Method
WT_Dice
TC_Dice
ET_Dice
U-net
0.8387
0.8215
0.7688
DeepResUnet
0.8463
0.8319
0.7819
Unet++
0.8465
0.8604
0.7771
Dense_Unet
0.8393
0.8703
0.7716
Unet3plus
0.8183
0.7645
0.7368
TBC-Unet
0.8484
0.8797
0.7838
Table 5. Results measured using Precision. Method
WT_Precision
TC_Precision
ET_Precision
U-net
0.8561
0.8533
0.7745
DeepResUnet
0.8422
0.8735
0.7720
Unet++
0.8303
0.8851
0.7579
Dense_Unet
0.8619
0.9012
0.7862
Unet3plus
0.7887
0.7510
0.7133
TBC-Unet
0.8711
0.9152
0.8037
Table 6. Results using sensitivity measures. Method
WT_sensitivity
TC_sensitivity
ET_sensitivity
U-net
0.8670
0.9014
0.8295
DeepResUnet
0.8939
0.8956
0.8485
Unet++
0.9002
0.9121
0.8551
Dense_Unet
0.8738
0.9146
0.8351
Unet3plus
0.9016
0.9032
0.8071
TBC-Unet
0.8736
0.9211
0.8758
TBC-Unet: U-net with Three-Branch Convolution
63
Input Flair sequence
U-net
DeepResUnet
Unet++
Dense_Unet
Unet3plus
TBC-Unet
Ground Truth
Fig. 5. Predictions of our proposed model and the state-of-the-art models on six unseen images from test data; (row-wise) 1: FLAIR images, 2: U-net, 3: DeepResUnet, 4: Unet++, 5: Dense_Unet, 6: Unet3plus, 7: TBC-Uunet. 8: Ground Truth. Color code is the same as that in Fig. 4.
5 Conclusion We proposed the TBC-Unet deep neural network model, which is an upgraded version of the U-net model. The feature extraction capability of the TBC module we designed compensates for the shortcomings of traditional convolutional layer feature extraction. According to the findings of the experiments, the TBC-Unet model outperforms the commonly used medical image segmentation models. We will strive to advance the model in the future, with the goal of optimizing the model and reducing model computation without impacting the segmentation effect.
64
Y. Yang et al.
Acknowledgements. The work was supported by the High-level Talents Fund of Hubei University of Technology under grant No. GCRC2020016, Open Funding Project of the State Key Laboratory of Biocatalysis and Enzyme Engineering No. SKLBEE2020020 and SKLBEE2021020.
References 1. Louis, D.N., Perry, A., Reifenberger, G., et al.: The 2016 World Health Organization classification of tumors of the central nervous system: A summary. Acta Neuropathol. 131(6), 803–820 (2016) 2. Gonbadi, F.B., Khotanlou, H.: Glioma brain tumors diagnosis and classification in mr images based on convolutional neural networks. In: Proceedings of the 2019 9th International Conference on Computer and Knowledge Engineering (ICCKE), pp.1–5. IEEE (2019) 3. Van Meir, E.G., Hadjipanayis, C.G., Norden, A.D., et al.: Exciting new advances in neurooncology: The avenue to a cure for malignant glioma. CA Cancer J. Clin. 60(3), 166–193 (2010) 4. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-31924574-4_28 5. Xiao, X., et al.: Weighted res-unet for high-quality retina vessel segmentation. In: Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education (ITME), pp. 327–331. IEEE (2018) 6. Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net architecture for medical image segmentation. In: Stoyanov, D., Taylor, Z., Carneiro, G., SyedaMahmood, T., Martel, A., Maier-Hein, L., Tavares, J.M.R.S., Bradley, A., Papa, J.P., Belagiannis, V., Nascimento, J.C., Lu, Z., Conjeti, S., Moradi, M., Greenspan, H., Madabhushi, A. (eds.) DLMIA/ML-CDS -2018. LNCS, vol. 11045, pp. 3–11. Springer, Cham (2018). https:// doi.org/10.1007/978-3-030-00889-5_1 7. Huang, H., et al.: Unet 3+: A full-scale connected unet for medical image segmentation. In: Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1055–1059. IEEE (2020) 8. Gu, Z., Cheng, J., Fu, H., et al.: Ce-net: Context encoder network for 2d medical image segmentation. IEEE Trans. Med. Imaging 38(10), 2281–2292 (2019) 9. Deng, W., et al.: MRI brain tumor segmentation with region growing method based on the gradients and variances along and inside of the boundary curve. In: Proceedings of the 2010 3rd International Conference on Biomedical Engineering and Informatics, pp. 393–396. IEEE (2010) 10. Kaleem, M., Sanaullah, M., Hussain, M.A., Jaffar, M.A., Choi, T.-S.: Segmentation of brain tumor tissue using marker controlled watershed transform method. In: Chowdhry, B.S., Shaikh, F.K., Hussain, D.M.A., Uqaili, M.A. (eds.) IMTIC 2012. CCIS, vol. 281, pp. 222–227. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28962-0_22 11. Menon, N., et al.: Brain tumor segmentation in MRI images using unsupervised artificial bee colony algorithm and FCM clustering. In: Proceedings of the 2015 International Conference on Communications and Signal Processing (ICCSP), pp. 0006–0009. IEEE (2015) 12. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481– 2495 (2017) 13. He, K.M., et al.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969. IEEE (2017)
TBC-Unet: U-net with Three-Branch Convolution
65
14. Chen, L.C., Papandreou, G., Schroff, F., et al.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 15. Wu, X.D.: An iterative convolutional neural network algorithm improves electron microscopy image segmentation. arXiv preprint arXiv:1506.05849 16. Long, J., et al.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. IEEE (2015) 17. Chen, Y., Wang, K., Liao, X., et al.: Channel-Unet: A spatial channel-wise convolutional neural network for liver and tumors segmentation. Front. Genet. 10, 1110 (2019) 18. Chen, L., Bentley, P., Mori, K., et al.: DRINet for medical image segmentation. IEEE Trans. Med. Imaging 37(11), 2453–2462 (2018) 19. Huang, G., et al.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708. IEEE (2017) 20. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. IEEE (2015) 21. He, K.M., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. IEEE (2016) 22. Szegedy, C., et al.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826. IEEE (2016) 23. Lo, S.Y., et al.: Efficient dense modules of asymmetric convolution for real-time semantic segmentation. In: Proceedings of the ACM Multimedia Asia, pp. 1–6 24. Han, K., et al.: Ghostnet: More features from cheap operations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1580–1589. IEEE (2019) 25. Menze, B.H., Jakab, A., Bauer, S., et al.: The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 34(10), 1993–2024 (2014) 26. Zhang, Z., Liu, Q., Wang, Y.: Road extraction by deep residual u-net. IEEE Geosci. Remote Sens. Lett. 15(5), 749–753 (2018) 27. Kaku, A., Hegde, C.V., Huang, J., et al.: DARTS: DenseUnet-based automatic rapid tool for brain segmentation. arXiv preprint arXiv:1911.05567
Drug–Target Interaction Prediction Based on Graph Neural Network and Recommendation System Peng Lei1(B) , Changan Yuan2,3 , Hongjie Wu4 , and Xingming Zhao5 1 Institute of Machine Learning and Systems Biology, School of Electronics and Information
Engineering, Tongji University, Shanghai 201804, China [email protected] 2 Guangxi Academy of Science, Nanning 530007, China 3 Guangxi Key Lab of Human-Machine Interaction and Intelligent Decision, Guangxi Academy Sciences, Nanning, China 4 School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China 5 Institute of Science and Technology for Brain Inspired Intelligence (ISTBI), Fudan University, Shanghai 200433, China
Abstract. Drug therapy is an important means to cure diseases. The identification of drugs and target proteins is the key to the development of new drugs. However, due to the limitations of high throughput, low precision and high cost of biological experimental methods, the verification of a large number of drug target interactions has a certain degree of blindness, which makes it difficult to carry out widely in practical applications. Driven by information science, intelligent information processing technologies such as machine learning, data mining and mathematical statistics have been developed and applied rapidly. Predicting the interaction between drugs and target proteins through computer simulation can reduce the research and development cost, shorten the time of new drug development and reduce the blindness of new drug development. It is of great significance for new drug research and development and the improvement of human medical treatment. However, the existing drug-target interactions (DTIs) prediction methods have the problems of low accuracy and high false positive rate. In this paper, a new DTIs prediction method GCN_NFM is proposed by combining graph neural network and recommendation system, the framework first learns the low dimensional representation of drug entities and protein entities in graph neural network (GCN), and then integrates multimodal information through neural factorization machine (NFM). The results show that under the 5-fold cross-validation, the area under the receiver operating characteristic curve (AUROC) obtained by this method is 0.9457, indicating that GCN_NFM can effectively and robustly capture undiscovered DTIs. Keywords: Drug discovery · Drug-target interactions · Graph neural network · Recommendation system
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 66–78, 2022. https://doi.org/10.1007/978-3-031-13829-4_6
Drug–Target Interaction Prediction Based on Graph Neural Network
67
1 Introduction Finding a new drug takes more than 10 years and costs more than $2.6 billion [1]. In recent years, many AI drug-development start-ups have emerged and successfully applied deep learning technology to assist new drug development, which has greatly reduced time and cost [2, 3]. When deep learning technology is used in drug development, one of the most concerned topics is drug target interaction prediction. Diseases are usually attributed to target proteins in the disease pathway. Drugs can be used to regulate this target protein, which is equivalent to cutting off the pathway of the disease, so as to cure the disease. One of the main drug action mechanisms is the “lock and key” theory [4]. The target protein is the “lock”, and the drug is the appropriate “key” to unlock the target protein. The matching degree of lock and key is also called binding affinity. Drug target interaction (DTI) measures the binding affinity between drug molecules and protein targets. Therefore, we can easily imagine that if a DTI deep learning prediction model can accurately predict the binding affinity between drug molecules and protein targets, it can be greatly beneficial to drug discovery [5]. More specifically, virtual screening and drug repurposing are two main applications based on DTI. Virtual screening helps to identify ligand candidates that can bind to target proteins, while drug repositioning can find new therapeutic purposes for existing drugs [38–41]. In the field of DTI prediction, the traditional calculation methods are mainly divided into two categories: ligand based calculation methods and structure based calculation methods[42]. However, when the three-dimensional structure of the target protein is unknown, the structure based method is not applicable, and when there are few target protein binding ligands, the ligand based method has limited prediction ability [7–11]. In recent years, the widespread recognition of data-driven methods has made machine learning algorithms widely used in biomolecular correlation prediction [12–15]. There are four main methods related to in silico method: machine learning based method, network-based method, matrix factor based method and depth learning based method [16–18]. For example, DTINet [19] proposed by Luo et al. applied an unsupervised method to learn low-dimensional feature representations of drugs and target proteins from heterogenous data and predicted DTI using inductive matrix completion. Wen et al. used unsupervised learning to extract representations from the original input descriptors to predict DTIs [20]. Ding et al. used substructure fingerprints, physical and chemical properties of organisms, and DTIs as feature extraction methods and input features, and further used SVM for classification [21]. Thafar et al. combined graph embedding and similarity-based techniques for DTI prediction [22]. Deep learning has achieved great success in Euclidean data, but more and more applications need to analyze non Euclidean data [23]. People are increasingly interested in extending deep learning based on graph data [24]. Driven by deep learning, researchers designed the architecture of graphical neural network (GNN) based on the ideas of CNN, LSTM and deep AE. Previous work has shown that graph neural network has good performance for DTIs [25, 26], but only understanding the data relationship between DTIs can not mine the hidden information in graph data well. Therefore, it is necessary to explore the depth information of drugs and target proteins through graph neural network. Specifically, this study uses graph convolution network (GCN) to represent the low dimensional information of the graph. Different from the ordinary feature representation,
68
P. Lei et al.
the features extracted by GCN retain the structural information in the graph, which is very helpful for the subsequent training accuracy. In DTI prediction, another suitable technology is recommendation system. Recommendation system is essentially a technical means for users to find the information they are interested in from a large amount of information under the condition of unclear user needs. The recommendation system combines the user’s information, item information and the user’s past behavior towards items, and uses machine learning technology to build a user interest model to provide users with accurate personalized recommendation [43, 44]. For DTI prediction using recommendation system, users can be modeled as drugs and goods can be modeled as targets. A mainstream recommendation method called collaborative filtering has been integrated into network-based methods, such as dual regularized one-class collaborative filtering [27]. The traditional DTI prediction can be understood as directly using the molecular fingerprint of drug molecules and the vector representation of target proteins, and then using the basic classification method for binary classification [45]. The prediction efficiency is low and the accuracy is not high. The proposed GCN_NFM model can be regarded as a novel one using graph neural network for feature extraction and the relevant technology of recommendation system for DTI prediction, so as to solve the problems of low accuracy and high false positive rate of other models.
2 Materials and Methods 2.1 Datasets The data of drug target interaction used in this study is from DrugBank5.0 [28]. DrugBank5.0 is an open, free and comprehensive database, including constantly updated drug molecular structure, mechanism and drug target interaction. We downloaded 11396 known DTIs from DrugBank5.0, including 984 drugs, 635 proteins and 11396 known DTIs used as the benchmark data set and as positive samples in training. The protein sequence information used in this study comes from the STRING database [30], on which you can searches for known protein interactions online. At present, the newest version is 11.5. 2.2 Attribute Representation 2.2.1 Drug Attribute Representation One of the most important problems encountered in comparing the similarities between two compounds is the complexity of the task, which depends on the complexity of molecular characterization. In order to simplify or make the calculation easier, a degree of simplification or abstraction is required. Molecular fingerprint [34] is an abstract representation of molecules. It transforms (encodes) molecules into a series of bit strings (bit vectors), and then it is easy to compare between molecules [46]. In this experiment, qualitative molecular fingerprint descriptors were selected as the method of numerical characterization of drug compounds. Molecular fingerprint has become one of the most effective numerical characterization methods of drugs at present
Drug–Target Interaction Prediction Based on Graph Neural Network
69
because it can convert the molecular structure into binary fingerprint features according to the molecular structure fragment information in the molecular structure [29], as shown in Fig. 1. The key of this method is to detect the existence of specific fragments in the molecular structure of drug compounds, and then encode the specific fragments into numbers and correspond to binary strings through hash algorithm or dictionary based method, so as to numerically characterize the molecules of drug compounds and form an orderly digital fingerprint sequence [47, 48]. It is worth mentioning that drug compound molecules with similar structures are likely to have similar biological activities [35].
Fig. 1. The illustration of drugs molecular fingerprinting
2.2.2 Protein Attribute Representation Extract protein sequence information from string database. Proteins are important biological macromolecules [49]. All proteins are polymers linked by 20 different amino acids, including (ALA, Val, Leu, ile, met, Phe, Trp, pro), (Gly, Ser, THR, Cys, ASN, gin, Tyr), (Arg, Lys, his) and (ASP, Glu). Then, using the k-mer method [31], set K to 3, and convert each protein sequence into a 64 dimensional (4 * 4 * 4) feature vector by calculating the frequency of each subsequence in the whole protein sequence. 2.3 Graph Convolutional Network Graph neural networks are divided into five categories [23], namely graph convolution networks (GCN), graph attention networks, graph autoencoders, graph generative networks and graph spatial temporal networks. The method used for feature extraction in this paper adopts graph convolution network. Graph convolution network (GCN) [32] is a semi supervised method, which converts topological links into topological graph. In the algorithm, the input of GCN is the structure of the graph and the characteristics of each node, and the output includes node level results, graph level results and node level pooling information. Therefore, it is widely used in non Euclidean space [50].
70
P. Lei et al.
Graph convolution network extends convolution operation from traditional data (such as image) to graph data. The core idea is to learn a function map f (.), through which the node vi in the map can aggregate its own feature xi and its neighbor feature xj (j ∈ N(vi )) to generate a new representation of node vi . Graph convolution network is the basis of many complex graph neural network models, including automatic encoder based model, generation model and spatio-temporal network. GCN methods can be divided into two categories: spectral-based method and spatialbased method. The spectral-based method introduces a filter from the perspective of graph signal processing to define graph convolution, in which the graph convolution operation is interpreted as removing noise from the graph signal. The spatial-based method represents graph convolution as aggregating feature information from the neighborhood. When the algorithm of graph convolution network runs at the node level, the graph pooling module can interleave with the graph volume layer and coarsen the graph into a high-level substructure. This paper adopts the spatial-based method [51, 52]. The reason why we do not use spectral-based method is that the graph convolution structure in the spectral domain is fixed. Adding or deleting nodes or adding or deleting connections will invalidate the previously trained model. Moreover, the original version of spectral-base convolution network has high time complexity due to the need to decompose the Laplace matrix. Although the improved version does not need matrix decomposition, there are fewer parameters to learn, the complexity of the model is reduced and the expression ability is insufficient. Our model needs to be debugged many times in the training stage. If spectral-based method is adopted, the efficiency will be too low, so we choose spatial-based method. 2.4 Neural Factorization Machine Neural factorization machine(NFM) is a cutting-edge model in the recommendation system and an improvement of the traditional factorization machines(FM) model. NFM is mainly oriented to the intersection problem of sparse features, which seamlessly combines the linear intersection of FM to second-order features with the nonlinear intersection of neural network to higher-order features. Although the traditional FM model considers the combination characteristics, its essence is still a linear model, and the representation ability of the model is limited after all. Some subsequent models try to introduce DNN on the basis of FM to strengthen the nonlinear ability of the model, but this kind of model is sensitive to parameters and difficult to train. NFM also introduces DNN based on FM and uses nonlinear structure to learn more data information. Different from the previous model, NFM uses Bi-linear interaction structure to process the second-order cross information, so that the information of cross features can be better learned by DNN structure and reduce the difficulty of DNN learning higher-order cross feature information. Reducing the burden of DNN means that a deeper network structure is no longer needed, so the amount of model parameters is reduced and the model training is more convenient. The structure of NFM model is as follows (Fig. 2):
Drug–Target Interaction Prediction Based on Graph Neural Network
71
Fig. 2. The illustration of the neural factorization machines model
2.5 Architecture In this paper, the input data of the model are the molecular fingerprint of the processed drug molecules, the vector of the target protein extracted by k-mer method and the DTIs extracted from drugbank5.0, including 984 drug molecules and 635 target proteins. In this study, a new DTI prediction method GCN_NFM is proposed by combining graph neural network and recommendation system together, the framework first learns the low dimensional representation of drug entities and protein entities in graph neural network, and then integrates multimodal information through neural factorization machine (NFM). The specific structure of the model is shown in Fig. 3. Specifically, firstly, the drug molecules are represented by molecular fingerprints, then the corresponding feature vectors of protein molecules are obtained by k-mer method, and then an undirected graph is established combined with DTIs. Then, the embeddings of drug molecules and target proteins are learned by GCN. Finally, combined with the original characteristics of drug and protein molecules, NFM is used for link prediction, Finally, we get the result. In short, this paper has two contributions: (i) the proposed model to use specific GCN to learn the low dimensional representation of nodes, rather than just using the original features. (ii) the proposed model to use the NFM framework in the recommendation system for link prediction. The results show that the area under the receiver operating characteristic curve (AUROC) obtained by this method is 0.9457 under 5-fold cross validation. In addition, we compare the proposed method with some of the latest existing methods, and these results show that GCN_ NFM can effectively and robustly capture undiscovered DTIs.
72
P. Lei et al.
Fig. 3. The flowchart of the GCN_NFM model. (a) A bipartite graph of DTIs. The solid black line is described as known DTIs, and the dashed line is described as latent DTIs. (b) An adjacency graph constructed by the bipartite graph.in which green nodes are drugs and orange nodes are targets (c) The left part represents the information extracted by the GCN, and the right part represents the feature original (molecular fingerprints and protein descriptors) (d) The integration of multimodal information by NFM.
3 Result and Discussion 3.1 Evaluation Criteria Evaluation criteria used in our experiment include overall prediction accuracy (Accu.), sensitivity (Sen.), specificity (Spec.), precision (Prec.) and Matthews correlation coefficient (MCC). The calculation formulas are listed below: TP + TN TP + TN + FP + FN
(1)
TP = recall TP + FN
(2)
specificity =
TN TN + FP
(3)
precision =
TP TP + FP
(4)
TP × TN − FP × FN (TP + FP)(TP + FN )(TN + FP)(TN + FN )
(5)
accuracy =
sensitivity =
MCC = √
where true positive (TP) is the number of drug target pairs correctly classified as interacting; False positive (FP) refers to the number of samples incorrectly classified as
Drug–Target Interaction Prediction Based on Graph Neural Network
73
interacting; True negative (TN) is the number of samples correctly classified as non interactive; False negative (FN) is the number of samples incorrectly classified as non interactive. In order to intuitively display the results, we introduce the receiver operating characteristic (ROC) [33] curve to evaluate the ability of the classifier and calculate the area under the curve (AUC). 3.2 Performance Evaluation of GCN_NFM Using 5-Fold Cross-Validation In order to accurately evaluate the stability and robustness of GCN_NFM, 5-Fold crossvalidation is selected. The original data set is randomly divided into five parts, four parts are selected as the training set each time, and the remaining one is used as the test set. The cross validation is repeated for 5 times, and the average of the accuracy of 5 times is taken as the evaluation index of the final model, which can effectively avoid the occurrence of over fitting and under fitting. For the evaluation index, the receiver operating characteristic (ROC) curve can reflect the ability of the model, and AUC is the area under the ROC curve. AUC value is used as the evaluation standard because ROC curve can not clearly explain which classifier is better, but as a value, the classifier with larger AUC is better. The closer the ROC curve is to the upper left corner, the better the performance of the model is, and the AUC value is also very high. Our experimental results are shown in the figure below. It can be seen that our GCN_NFM model has a good level in all indicators, and by analyzing the results of each fold cross validation, it can be found that the training results are close, indicating that our model has good stability and robustness, as shown in Table 1. Table 1. Five-fold cross-validation results by GCN_NFM model Fold
Spec. (%)
Sen. (%)
Prec. (%)
MCC (%)
Acc. (%)
AUC (%)
0
93.66
82.33
92.77
77.81
88.95
95.13
1
93.48
83.52
92.68
78.05
89.36
94.35
2
94.08
85.30
92.86
77.63
89.63
93.89
3
93.52
84.67
93.09
77.55
88.84
94.67
4
93.35
83.88
92.55
77.13
88.76
94.80
Average
93.62 ± 0.28
83.94 ± 1.14
92.79 ± 0.20
77.63 ± 0.34
89.11 ± 0.37
94.57 ± 0.47
3.3 Compared GCN_NFM with Different Machine Learning Algorithms In this experiment, we not only embed the features through the graph neural network, and then further integrate the original features, but also use the NFM model commonly used in the recommendation system. By comparing different classification algorithms such as logistic regression (LR), k-nearest neighbor (KNN), ion (LR), K-nearest neighbor (KNN), gradient boosting decision tree(GBDT) and random forest classifier (RF), we can intuitively see the advantages of GCN_NFM. In order to make the comparison
74
P. Lei et al.
more fair and objective, all classification algorithms choose the default parameters. The following is the result of five fold cross validation of various models. It can be seen that our GCN_NFM model has great advantages over other models, as shown in Table 2. Table 2. Comparison of different machine learning models Models
Spec. (%)
Sen. (%)
Prec. (%)
MCC (%)
Acc. (%)
AUC (%)
LR
68.51 ± 1.49 76.57 ± 1.01 70.86 ± 1.21 45.23 ± 2.47 72.54 ± 1.23 78.26 ± 0.78
KNN
49.15 ± 2.69 92.99 ± 0.68 64.67 ± 1.09 46.90 ± 1.82 71.07 ± 1.15 82.63 ± 0.46
GBDT
89.41 ± 0.26 80.54 ± 0.65 88.38 ± 0.19 70.23 ± 0.41 84.98 ± 0.23 91.62 ± 0.38
RF
87.38 ± 0.17 71.42 ± 1.15 84.38 ± 0.21 71.58 ± 0.48 86.54 ± 0.33 88.39 ± 0.50
GCN_NFM 93.62 ± 0.28 83.94 ± 1.14 92.79 ± 0.20 77.63 ± 0.34 89.11 ± 0.37 94.57 ± 0.47
The above results can be explained as follows: (i) for logistic regression, the data of DTIs is highly complex, which makes it difficult for logistic regression to find a linear classification surface, so it is impossible to fit the features; (ii) for k-nearest neighbor, because this method is to get information from the nearest neighbor, in our previous work, we fused the attributes of adjacent nodes in the sample, resulting in low classification efficiency and low accuracy; (iii) gradient boosting decision tree and random forest classifier are integrated classifiers. Although they can make up for the shortcomings of a single classifier, they are still too limited and the classification accuracy is still insufficient. The GCN_NFM model used in this experiment not only learns the low dimensional representation of drug entities and protein entities through graph neural network and integrates with the original features, but also integrates multimodal information through neural factorization machine (NFM), so it can finally achieve relatively good results. 3.4 Compared GCN_NFM with Existing State-of-the-Art Prediction Methods In order to evaluate the superiority of this method, we compare it with other advanced methods. The similarity between these methods and our method is that we all use specific feature extraction methods and different link prediction methods to predict drug-target interactions. The methods proposed by Chen et al. [36] and Ji et al. [37] consider the network information of nodes, although they can fully represent the local information of nodes in the network, our GCN_NFM model can extract node information more fully, and its AUROC and ACC are stronger than other methods, as shown in Table 3.
Drug–Target Interaction Prediction Based on Graph Neural Network
75
Table 3. Comparison of existing state-of-the-art prediction methods Methods
Datasets
AUROC
ACC
Chen et al. methods
DrugBank
0.9206
0.8545
Ji et al. methods
DrugBank
0.9233
0.8583
GCN_NFM
DrugBank
0.9457
0.8911
4 Conclusions We propose a new model GCN_NFM for predicting drug target interactions, the model starts from the two perspectives of feature extraction and link prediction. During feature extraction, the graph convolution network is used for feature extraction and fusion with the original features. Finally, the NFM model in the recommendation system is used for link prediction to finally improve the accuracy of prediction. Experiments show that GCN_ NFM model has good stability and robustness. Compared with the traditional model that regards link prediction as binary classification, our model not only extracts the neighborhood information through graph neural network, but also further predicts the link through NFM model, so the prediction accuracy is greatly improved. The results show that under the 5-fold cross-validation, the area under the receiver operating characteristic curve (AUROC) obtained by this method is 0.9457, and the results show that GCN_ NFM can effectively and robustly capture undiscovered DTIs. Acknowledgements. This work was supported by the grant of National Key R&D Program of China (No. 2018YFA0902600 & 2018AAA0100100) and partly supported by National Natural Science Foundation of China (Grant nos. 61732012, 62002266, 61932008, and 62073231), and Introduction Plan of High-end Foreign Experts (Grant no. G2021033002L) and, respectively, supported by the Key Project of Science and Technology of Guangxi (Grant no. 2021AB20147), Guangxi Natural Science Foundation (Grant nos. 2021JJA170204 & 2021JJA170199) and Guangxi Science and Technology Base and Talents Special Project (Grant nos. 2021AC19354 & 2021AC19394).
References 1. Mullard, A.: New drugs cost US[dollar]2.6 billion to develop. Nat. Rev. Drug Discov. 13 (2014) 2. Nic, F.: How artificial intelligence is changing drug discovery. Nature 557(7707), S55 (2018) 3. Smalley, E.: AI-powered drug discovery captures pharma interest. Nat. Biotechnol. 35 (2017) 4. Gschwend, D.A., Good, A.C., Kuntz, I.D.: Molecular Docking Towards Drug Discovery, vol. 9, issue 2, pp. 175–186. Wiley (1996) 5. Mayr, A., Klambauer, G., Unterthiner, T., et al.: Large-scale comparison of machine learning methods for drug target prediction on ChEMBL. Chem. Sci. 9(24), 5441–5451 (2018) 6. Sydow, D., Burggraaff, L., Szengel, A., et al.: Advances and challenges in computational target prediction. J. Chem. Inf. Model. 59(5), 1728–1742 (2019) 7. Li, J., Zheng, S., Chen, B., et al.: A survey of current trends in computational drug repositioning. Brief. Bioinform. 17(1), 2–12 (2016)
76
P. Lei et al.
8. Napolitano, F., Zhao, Y., Moreira, V.M., et al.: Drug repositioning: a machine-learning approach through data integration. J. Cheminform. 5(1), 1–9 (2013) 9. Wu, C., Gudivada, R.C., Aronow, B.J., et al.: Computational drug repositioning through heterogeneous network clustering. BMC Syst. Biol. 7(5), 1–9 (2013) 10. Kinnings, S.L., Liu, N., Buchmeier, N., et al.: Drug discovery using chemical systems biology: repositioning the safe medicine Comtan to treat multi-drug and extensively drug resistant tuberculosis. PLoS Comput. Biol. 5(7), e1000423 (2009) 11. Liu, Z., Fang, H., Reagan, K., et al.: In silico drug repositioning–what we need to know. Drug Discov. Today 18(3–4), 110–115 (2013) 12. Bagherian, M., Sabeti, E., Wang, K., et al.: Machine learning approaches and databases for prediction of drug–target interaction: a survey paper. Brief. Bioinform. 22(1), 247–269 (2021) 13. Agamah, F.E., Mazandu, G.K., Hassan, R., et al.: Computational/in silico methods in drug target and lead prediction. Brief. Bioinform. 21(5), 1663–1675 (2020) 14. Manoochehri, H.E., Nourani, M.: Drug-target interaction prediction using semi-bipartite graph model and deep learning. BMC Bioinform. 21(4), 1–16 (2020) 15. D’Souza, S., Prema, K.V., Balaji, S.: Machine learning models for drug–target interactions: current knowledge and future directions. Drug Discov. Today 25(4), 748–756 (2020) 16. Xue, H., Li, J., Xie, H., et al.: Review of drug repositioning approaches and resources. Int. J. Biol. Sci. 14(10), 1232 (2018) 17. Luo, H., Li, M., Yang, M., et al.: Biomedical data and computational models for drug repositioning: a comprehensive review. Brief. Bioinform. 22(2), 1604–1619 (2021) 18. Yella, J.K., Yaddanapudi, S., Wang, Y., et al.: Changing trends in computational drug repositioning. Pharmaceuticals 11(2), 57 (2018) 19. Luo, Y., Zhao, X., Zhou, J., et al.: A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information. Nat. Commun. 8(1), 1–13 (2017) 20. Wen, M., Zhang, Z., Niu, S., et al.: Deep-learning-based drug–target interaction prediction. J. Proteome Res. 16(4), 1401–1409 (2017) 21. Ding, Y., Tang, J., Guo, F.: Identification of drug-target interactions via multiple information integration. Inf. Sci. 418, 546–560 (2017) 22. Thafar, M.A., Olayan, R.S., Ashoor, H., et al.: DTiGEMS+: drug–target interaction prediction using graph embedding, graph mining, and similarity-based techniques. J. Cheminform. 12(1), 1–17 (2020) 23. Wu, Z., Pan, S., Chen, F., et al.: A comprehensive survey on graph neural networks. IEEE Trans. Neural Networks Learn. Syst. 32(1), 4–24 (2020) 24. Cheng, T., Hao, M., Takeda, T., et al.: Large-scale prediction of drug-target interaction: a data-centric review. AAPS J. 19(5), 1264–1275 (2017) 25. Zhao, T., Hu, Y., Valsdottir, L.R., et al.: Identifying drug–target interactions based on graph convolutional network and deep neural network. Brief. Bioinform. 22(2), 2141–2150 (2021) 26. Torng, W., Altman, R.B.: Graph convolutional neural networks for predicting drug-target interactions. J. Chem. Inf. Model. 59(10), 4131–4149 (2019) 27. Lim, H., Poleksic, A., Yao, Y., et al.: Large-scale off-target identification using fast and accurate dual regularized one-class collaborative filtering and its application to drug repurposing. PLoS Comput. Biol. 12(10), e1005135 (2016) 28. Wishart, D.S., Feunang, Y.D., Guo, A.C., et al.: DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 46(D1), D1074–D1082 (2018) 29. Li, Y., Liu, X., You, Z.H., et al.: A computational approach for predicting drug–target interactions from protein sequence and drug substructure fingerprint information. Int. J. Intell. Syst. 36(1), 593–609 (2021)
Drug–Target Interaction Prediction Based on Graph Neural Network
77
30. Szklarczyk, D., Morris, J.H., Cook, H., et al.: The STRING database in 2017: qualitycontrolled protein–protein association networks, made broadly accessible. Nucleic Acids Res. 2016, gkw937 (2016) 31. Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013) 32. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016) 33. Zweig, M.H., Campbell, G.: Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin. Chem. 39(4), 561–577 (1993) 34. Rogers, D., Hahn, M.: Extended-connectivity fingerprints. J. Chem. Inf. Model. 50(5), 742– 754 (2010) 35. Maggiora, G., Vogt, M., Stumpfe, D., et al.: Molecular similarity in medicinal chemistry: miniperspective. J. Med. Chem. 57(8), 3186–3204 (2014) 36. Chen, Z.H., You, Z.H., Guo, Z.H., et al.: Prediction of drug–target interactions from multimolecular network based on deep walk embedding model. Front. Bioeng. Biotechnol. 8, 338 (2020) 37. Ji, B.Y., You, Z.H., Jiang, H.J., et al.: Prediction of drug-target interactions from multimolecular network based on LINE network representation method. J. Transl. Med. 18(1), 1–11 (2020) 38. Shen, Z., Zhang, Q., Han, K., et al.: A deep learning model for RNA-protein binding preference prediction based on hierarchical LSTM and attention network. IEEE/ACM Trans. Comput. Biol. Bioinform. (2020) 39. Zhang, Q., Shen, Z., Huang, D.S.: Predicting in-vitro transcription factor binding sites using DNA sequence+ shape. IEEE/ACM Trans. Comput. Biol. Bioinf. 18(2), 667–676 (2019) 40. Shen, Z., Deng, S.P., Huang, D.S.: Capsule network for predicting RNA-protein binding preferences using hybrid feature. IEEE/ACM Trans. Comput. Biol. Bioinf. 17(5), 1483–1492 (2019) 41. Zhu, L., Li, N., Bao, W., et al.: Learning regulatory motifs by direct optimization of Fisher Exact Test Score. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 86–91. IEEE (2016) 42. Shen, Z., Deng, S.P., Huang, D.S.: RNA-protein binding sites prediction via multi scale convolutional gated recurrent unit networks. IEEE/ACM Trans. Comput. Biol. Bioinf. 17(5), 1741–1750 (2019) 43. Zhang, Q., Zhu, L., Bao, W., et al.: Weakly-supervised convolutional neural network architecture for predicting protein-DNA binding. IEEE/ACM Trans. Comput. Biol. Bioinf. 17(2), 679–689 (2018) 44. Zhang, Q., Zhu, L., Huang, D.S.: High-order convolutional neural network architecture for predicting DNA-protein binding sites. IEEE/ACM Trans. Comput. Biol. Bioinf. 16(4), 1184– 1192 (2018) 45. Zhang, Q., Shen, Z., Huang, D.S.: Modeling in-vivo protein-DNA binding by combining multiple-instance learning with a hybrid deep neural network. Sci. Rep. 9(1), 1–12 (2019) 46. Xu, W., Zhu, L., Huang, D.S.: DCDE: an efficient deep convolutional divergence encoding method for human promoter recognition. IEEE Trans. Nanobiosci. 18(2), 136–145 (2019) 47. Shen, Z., Bao, W., Huang, D.S.: Recurrent neural network for predicting transcription factor binding sites. Sci. Rep. 8(1), 1–10 (2018) 48. Zhang, H., Zhu, L., Huang, D.S.: DiscMLA: an efficient discriminative motif learning algorithm over high-throughput datasets. IEEE/ACM Trans. Comput. Biol. Bioinf. 15(6), 1810–1820 (2016) 49. Zhu, L., Zhang, H.B., Huang, D.: LMMO: a large margin approach for refining regulatory motifs. IEEE/ACM Trans. Comput. Biol. Bioinf. 15(3), 913–925 (2017)
78
P. Lei et al.
50. Shen, Z., Zhang, Y.H., Han, K., et al.: miRNA-disease association prediction with collaborative matrix factorization. Complexity 2017, 1–9 (2017) 51. Zhu, L., Zhang, H.B., Huang, D.S.: Direct AUC optimization of regulatory motifs. Bioinformatics 33(14), i243–i251 (2017) 52. Zhang, H., Zhu, L., Huang, D.S.: WSMD: weakly-supervised motif discovery in transcription factor ChIP-seq data. Sci. Rep. 7(1), 1–12 (2017)
NSAP: A Neighborhood Subgraph Aggregation Method for Drug-Disease Association Prediction Qiqi Jiao1 , Yu Jiang1 , Yang Zhang3 , Yadong Wang1,2 , and Junyi Li1(B) 1 School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen),
Shenzhen 518055, Guangdong, China [email protected] 2 Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, Heilongjiang, China 3 College of Science, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China
Abstract. Exploring the association between drugs and diseases can help to accelerate the process of drug development to a certain extent. In order to investigate the association between drugs and diseases, this paper constructs a network composed of different types of nodes, and proposes a model NSAP based on neighborhood subgraph prediction. The model captures local and global information around the target node through metagraphs and contextual graphs, respectively, and can generate node representations with rich information. In addition, in metagraphs and context diagrams, the model takes advantage of graph structures to automatically generate weights for edges, which better reflects the degree of association of different neighbor nodes with the target node. At last, the attention mechanism is used to aggregate the nodal representations generated by different metapaths in the graph, so that the final representation of the nodes incorporates different semantic information. For the edge prediction, a correlation score between drug-disease node pairs is calculated by the decoder. The experimental results have confirmed that our model does have certain effect by comparing it with state of the art method. The data and code are available at: https://github.com/jqq125/NSAP. Keywords: Drug disease association prediction · Heterogeneous network · Network representation method · Attention mechanism · Link prediction
1 Introduction Drug disease association prediction is used to explore possible unknown indications for drugs through the existing relations between drugs and diseases [1, 2]. Domestic research on drug disease association prediction started slowly, but the relatively stable development has been achieved in this area of research for the last few years. The related research is mainly based on computer technology and clinical trials, the former is biased towards prediction, while the latter is biased towards verification. The research work in this paper is biased towards drug disease association prediction based on computer technology, which is also known as computational drug repositioning. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 79–91, 2022. https://doi.org/10.1007/978-3-031-13829-4_7
80
Q. Jiao et al.
Computational drug relocation is an important task in pharmaceutical research. So what is the purpose of this study? First, from a cost point of view, traditional drug development takes decades and costs a fortune to bring a new drug to the market [3, 4]. However, the number of approved drugs has not increased significantly with the huge investment of time and funds. Calculating drug repositioning is not only shorter in time but also relatively low in cost and risk. In addition, for sudden outbreaks of large-scale epidemic diseases, relying only on traditional drug research and development cannot play an effective control and intervention role soon. The huge contradiction between the urgency of disease development and the lag in drug supply makes computational drug repositioning an increasingly important strategy. Drug repositioning offers researchers a new way of thinking about how they can shift their focus to studying existing antiviral drugs. Last but not least, diseases such as rare and orphan diseases are less of a concern due to the small size of the potential market, but their affected populations may benefit from the development of drug repositioning in the future [2]. As biology approaches and related technologies improves, lots of biomedical data (for instance, protein related information, the information of medicines, the connection between drugs and protein, etc.) have been included in relevant open databases, which will accelerate the process of research in the field of biology. In addition, the development of a large number of cutting-edge technologies such as machine learning, neural networks, data mining, and network analysis enables relevant researchers to use these technologies to model and analyze multi-source biomedical data. This will help further explore the intrinsic relationship of diseases and mechanism of action of drugs, and accelerate the research process of drug disease association prediction. A lot of research work on the prediction of drug disease association, has emerged in recent years. Some studies manually extract features from some data sources, and combine them with machine learning classifiers for training and prediction tasks. However, there are some problems you cannot run away from, such as low efficiency caused by excessive manual extraction work, statistical difference caused by the different progress of research on each disease or drug. In this regard, it is possible to improve efficiency by building a network that combines multi-source data and generating node characterization. Common network-based approaches are generally used to generate nodes embedding in a network, which is also known as network representation learning methods. Network characterization learning generates node representations and graph structure representations by encoding potential information about local and global structures in the network. Common network characterization methods are mainly based on random walking, matrix decomposition, graph neural network. Classic methods based on random walking include: Deepwalk [5], Node2vec [6] and TP-NRWRH [7].Classical methods based on matrix decomposition include: TADW [8], GraRep [9] and NRLMF [10]. Classical methods based on graph neural networks include: GCN [11], GAT [12], GraphSage [13]. Random walk-based approaches are mostly designed for isomorphic networks. It can’t treat different types of nodes and edges differently; however, their contribution to the learning process is different in the graph representation. For the method based on matrix decomposition, it is necessary to decompose and calculate the representation matrix of the entire graph, which results in an excessively high calculation complexity.
NSAP: A Neighborhood Subgraph Aggregation Method
81
And for the new node representation learning, it is necessary to re-decompose the whole graph matrix, which is not conducive to expansion. Graph neural network-based methods can not only use node properties and structures to map nodes from non-Euclidean spaces to Euclidean spaces, but use for further in subsequent tasks. We started with the drug disease association prediction problem, and proposed a new method for the generation of node characterization. In order to take advantage of the known associations between multiple sources of data, the nodes selected in this paper include not only drug nodes and disease nodes, but also target nodes and gene nodes. The introduction of multiple nodes will help mine semantic information and capture the topology. A new model named NSAP has been proposed in our research work, which is a new neighborhood subgraph aggregation method for the link prediction. The core tasks of link prediction including two parts, one is to generate high-quality representation of nodes in the network, and the other is to determine whether there are edges between node pairs. In order to better accomplish these two tasks, we introduce two kinds of neighborhood subgraphs, namely metagraphs and contextual graphs. The specific research work is summarized below: (i)
Reduced loss of information during model integration. Compared with the method of pre-learning alone and then integrating the results of each data source, the model proposed in this paper directly integrates multi-source data to build heterogeneous network learning, which can avoid the information loss before the model integration as much as possible. (ii) Improved the quality of representation of nodes in the network. We propose a link prediction architecture based on neighborhood map. For study nodes, we extract the third-order neighborhood graph, generate the metagraph and contextual graph based on the neighborhood diagram, and then aggregate the information from the metagraph and contextual graph to generate node representations. Experimental results show that our model performs well. (iii) Better capture the possibility of an edge that exists between node pairs. The structure of the neighborhood subgraph is used to automatically generate the weights of the edges, which is conducive to better measuring the degree of association between different node pairs, so that the prediction tasks of subsequent edges can benefit from it.
2 Dataset The data used in this project mainly include four types, namely: drugs, proteins, diseases and genes. These four types of data are mainly derived from the following databases, namely: DrugBank [14], Uniprot [15], Malacards [16], repoDB [17], DisGeNET [18] and Clinicaltrials.gov [19]. Among them, DrugBank is the most powerful comprehensive drug database; Uniprot is a common database for proteomes; Malacards is a complete database of human diseases and their annotations; DisGeNET contains a large amount of variant and genetic information related to human diseases; repoDB is a standard drug relocation dataset; and ClinicalTrials.gov is a website about clinical studies. After
82
Q. Jiao et al.
crawling from the above database, we obtained 1482 drug nodes, 793 disease nodes, 2077 protein nodes and 6365 gene nodes. In addition, we also get the data of edges, including: 11540 edges between drugs and diseases, 11407 edges between drugs and proteins, 18844 edges between diseases and genes. Table 1 shows the data information obtained from them. Table 1. Biolink dataset information Relations (A–B)
Number of Number of Number of A B A–B
Drug-disease 1482
793
11540
Drug-protein
1482
2077
11407
Disease-gene
793
6365
18844
In any network, if there are edges between nodes, the weight value of the edges are reset to 1, otherwise 0. Each node in the network can carry a vector as its feature.
3 Method The model studied in this paper is NSAP, and the implementation details and the concrete components of the NSAP model will be introduced in this section. The NSAP consists of four main parts: neighborhood graph generation, meta and contextual graph extraction, meta and contextual graph aggregation, and link prediction between drug-disease node pairs. Figure 1 displays the frame diagram of proposed model NSAP. 3.1 Neighborhood Graph Extraction The neighborhood map is extracted based on the existing connections of the target node in the dataset. The neighborhood radius is represented by R, which represents the path length traveled by the target node. The Nth-order neighborhood graph covers the target node’s first hop to the Nth hop neighbor. Figure 1(a) shows the third-order neighborhood plot we are using. Neighborhood subgraph is a subgraph of a neighborhood graph, which contains two types, namely metagraph and contextual graph. Wherein the metagraph refers to the homogeneous diagram generated from the target node based on the wandering of a specific metapath, the contextual graph refers to the neighborhood subgraph around the K-hop neighbor of the target node, where K is the length of the metapath. Before the meta-graph extraction, we will briefly explain the basis for the extraction of the metagraph. Our goal is to obtain new edges of association between drugs and diseases. Throughout a heterogeneous network, isotopes can be associated with other heterogeneous nodes. Based on the diversified relationships among nodes, the sparseness of existing edges can be alleviated to a certain extent, and many potential correlations can be captured to improve the prediction effect. Because a drug may be used to treat multiple diseases, drugs that target the same protein may function similarly, and multiple diseases may all be related to a gene, and these diseases may be cured by the same drug. As Fig. 2(c)
NSAP: A Neighborhood Subgraph Aggregation Method
83
Fig. 1. The frame diagram of proposed model NSAP. (a) Neighborhood graph extraction. (b) Meta-graph and contextual graph extraction. (c) Aggregate metagraph and contextual graph and generate embedding of target node. (d) link prediction between drug-disease node pairs, hu is the embedding of drug node, hv is the embedding of disease node.
exhibits, the disease s5 is known to be associated with gene Gene3, but there are currently fewer drugs used for disease s5. To further explore the potential properties of the drug, we considered the relationship between disease and genes. Given that both the disease s5 and the disease S6 are related to gene Gene3, and the disease S6 can be cured by the drug u3, the disease S5 may also be cured by the drug u3.
Fig. 2. Explanation of the theoretical basis for meta-path selection.
84
Q. Jiao et al.
Based on the above assumptions, we can mine the metapath patterns corresponding to drugs and diseases from multi-source heterogeneous networks. Sun et al. define the R1
R2
Rn−2
Rn−1
metapath as: P = A1 → A2 → · · · → An−1 → An , where Ai ∈ A represents a node of type A and Rj ∈ R represents an edge of type R. The metapath patterns excavated mainly contain 4 types. Among them, there are 2 metapaths between drugs and drugs, and 2 metapaths between diseases and diseases, as shown in Table 2. By fusing multi-source data to build a heterogeneous network, you can distinguish the importance of different types of path correlation information, reflecting the advantages of heterogeneous networks. Table 2. The metapath used in our model Metapath category
Metapath (abbreviation)
Drug
Drug-protein-drug (utu)
Disease
Disease-drug-disease (sus)
Drug-disease-drug (usu) Disease-gene-disease (sgs)
3.2 Metagraph and Contextual Graph Extraction Metagraph Extraction. A metagraph is a homomorphic diagram generated by a target node based on specific metapath wandering, which is being put to work to dig up semantic information and capture topology information around the target node. The number of steps taken by the metapath is expressed in K, and its value usually does not exceed 4. Because it is generally believed that if there is a relationship between two nodes, the shortest path distance is usually no more than 4. As shown in Fig. 3(c), it shows that when the target node is a drug node, and K takes 2, the target node is based on the metapath of the wandering plot, wherein the selected metapath is shown in Fig. 3(b), Fig. 3(d) is the corresponding metagraph.
Fig. 3. Metagraph description.
NSAP: A Neighborhood Subgraph Aggregation Method
85
Contextual Graph Extraction. Contextual graph refers to the First-order neighborhood map around the Kth hop neighbor of the center node, where the Kth jump neighbor is also known as the context node. The contextual graph covers the specific first-order neighbors of the context node in the neighborhood graph, which shows the topology around the context node. As shown in Fig. 1(b), a contextual graph extracted from the neighborhood map is shown, R1 and R2 represent the neighborhood range of the contextual graph, where R1 = K − 1, R2 = K + 1. Nodes represented by dotted coils in the contextual graph are contextual nodes, which are nodes shared by the metagraph and the contextual graph. These context nodes show how metagraphs and contextual diagrams are connected to each other.
3.3 Metagraph and Contextual Graph Aggregation Node Mapping. Before we aggregate the metagraph and context diagram, we first map the different types of nodes. Because the properties of different kinds of nodes usually are different and are initially in different feature spaces, they need to be projected into the same feature space to eliminate differences. The formula for mapping a node u of type A is as follows: hu = TA · hAu0
(1)
where hu0 represents the original eigenvector of node u, TA is the mapping matrix of nodes of type A, and hu is the eigenvector after node u mapping. Metagraph Aggregation. Metagraph aggregation focuses on homogeneous nodes based on selected metapaths. By fusing the information of these nodes, we can capture the semantic message implied in those metapath, as well as the topology around the target node, thereby enriching the characteristics of the target node. First, we need to select the metapath for the target node to get the same kind of nodes. i By encoding these homogeneous nodes, we can acquire a comprehensive feature hm u of homogeneous nodes conducted with the particular metapath pi . For more information about the encoding, see Eq. (2): mi i hm (2) u = SUM α · hv , ∀v ∈ Gu where v is the same type of neighbor node of the target node u by walking based on a specific metapath pi , and α represents the corresponding weight value of the neighbor node, whose value is automatically generated according to the metagraph structure. In order to aggregate the node features generated by the different metapaths of the target node, we need to calculate the specific gravity of each metapath during the aggregation process. First, we need to calculate the average score cmi of the same kind nodes under the metapath pi . Then we use softmax to perform a numeric conversion, and get the normalized score βmi . The weights of important metapaths can be more prominently through softmax’s internal mechanism. And the rest proportion is used to aggregate the characteristics of the target node based on different metapaths to acquire
86
Q. Jiao et al.
a comprehensive feature hm u of the target node u. The aggregation method is shown in Eq. (3): 1 i ActivateFunc r T · hu hm u |VA | u∈VA βmi =softmaxpi exp cmi exp cmi = pi ∈P exp cmi i hm βpi · hm u = u cmi =
(3)
pi ∈P
here r is the learnable weight parameter. In addition, a multi-head attention mechanism can be introduced so that the learning process would be more stable. The specific approach is as follows: mi (4) zum = ||N n=1 σ pi ∈P [αpi ]n · hu where N is the number of the attention heads, and αpi is attention score of the metapath pi . Contextual Graph Aggregation. Contextual graph aggregation is primarily used to aggregate global topology information around the target node. Because context nodes are common nodes in metagraphs and contextual diagrams, you can use context nodes as aggregation baselines to aggregate the contextual graph to which each context node belongs. In order to aggregate all the context nodes to get the characteristics of the contextual graph, we need to build a virtual node and connect all the context nodes and virtual nodes. The specific operation is divided into two steps: First, considering the context map Guci generated based on a specific metapath pi , we need to encode each node in it to get the comprehensive characteristics hcui of the contextual map. The encoding is as follows: hcui = SUM γ · hv , ∀v ∈ Guci (5) where v is the specific type of neighbor of node u, and γ represents the corresponding weight value of the neighbor node u, whose value is automatically generated according to the contextual graph structure. Secondly, the self-attention mechanism needs to be used to aggregate the features of the contextual graph which is guided by different metapaths to get the global feature of node u. The aggregation method is shown in Eq. (6):
NSAP: A Neighborhood Subgraph Aggregation Method
1 ActivateFunc qT · hcui |VA | u∈VA βci =softmaxci exp rci exp rci = pi ∈P exp rci hcu = βci · hcui
87
rci =
(6)
pi ∈P
where q is the learnable weight parameter, rci is the average score of the same kind nodes under metapath pi , βci is the normalized score by softmax mechanism, hcu is the feature of node u under the contextual graph. Feature Fusion. Finally, the metagraph aggregation feature hm u and the context graph aggregation feature hcu are fused to acquire the final representation hu of node u. This feature expresses both the semantic information and the topology information around the target node. The specific fusion method is shown in Eq. (7). c hu = α1 · hm u + α2 · hu
(7)
where α1 and α2 is the learnable weight value of metagraph and contextual graph,hu is the final feature of target node u.
3.4 Link Prediction Our goal is to train an end-to-end model applicable to the association prediction of node pairs, which is different from previous models. Our model design considers both the node characterization generation and edge prediction tasks, so that the two parts can benefit each other. For the edge prediction part, a correlation score is calculated by the decoder which is directly set to the inner product here. score(u, s) = σ (hu · hs )
(8)
A two-classification cross-entropy function is used as the loss function, which is as follows: (9) Here represents the set of edges that exist in the training set, and − represents the set of edges obtained by negative sampling of drug and disease node pairs. The size of is equal to the size of − . Because there is currently no gold standard set of negative samples in this field, we randomly select the set of negative samples from unlabeled samples, which is the negative sampling method used by most previous methods.
88
Q. Jiao et al.
4 Experiment In this section, some related methods will be chosen to prove the effectiveness of our proposed model experimentally. 4.1 Comparison Methods To further confirm the performance of our proposed model, some classical approaches about attention mechanisms and metapaths have been chosen, including the following: Metapath2vec [20]: The earliest model to propose a metapath. It employs a metapathguided random walk to create Skip-Gram’s neighborhood context, which can capture relationships among different kinds of vertices. GAT [12]: It employs an attention mechanism to convolve in a homocomposite to obtain node characterization. HAN [21]: It uses metapath wandering to transform heterogeneous networks into isomorphic networks, and then uses the self-attention mechanism to generate node embeddings under specific metapaths, and finally aggregates the node embeddings generated by different metapaths to obtain the final characterization of nodes. MAGNN [22]: It encodes the internal nodes of the metapath through the metapath instance encoder, captures the semantic information of the metapath, and generates the representation of the metapath instance. In addition, it also uses a attention mechanism to weight the aggregation of information from different metapaths to generate a final representation of the nodes. FactorHNE [23]: It uses graph factorization to decouple the multiple semantic information factor graph implied by the metapath, and then stitches together the neighbor information in each factor graph by the self-attention mechanism, and finally aggregates all the metapath vectors, and calculates the score of the disease gene relationship pair by the vector to predict. 4.2 Comparison of Results We carry out experiments on NSAP and all baseline models with the Biolink dataset. The collected drug-disease pairs are treated as positive samples and all other unconnected drug-disease pairs are treated as negative samples. Before the model is trained, we preprocess the data. The positive samples are divided into three sets, namely: training set, validation set and test set, with a ratio of 6:2:2. This is a typical allocation ratio, and taking such a division can more accurately reflect the performance of the model while reducing information leakage. The ratio of positive and negative samples is 1:1, and the negative samples is randomly sampled. For the traditional model Metapath2vec, we set the parameters of random walking as follows: the window size is equal to 5, the length of the walk is equal to 10, each node performs 5 walks, and the final embedding dimension of the node is equal to 64. For GNN models (MAGNN and NSAP) that use neighborhood sampling, the number of neighbor sampling nodes is set to 150. The learning rate used is 0.005 and the L2 penalty weight is 0.001 for the Adam optimizer. We use the same training set, validation set and test set in all models.
NSAP: A Neighborhood Subgraph Aggregation Method
89
Fig. 4. The performance of NSAP against other baseline models.
We employee the Area Under ROC Curve (AUC) and the Area Under P-R Curve (AUPR) as the evaluation metrics. As Fig. 4 illustrates, NSAP outperforms other models on both metrics. This proves that the model can not only take advantage of the existing connections between different types of nodes to fully excavate the topology around the target nodes, but also use different types of edges to fuse different semantic information. 4.3 Parameter Sensitivity Analysis In this article, the effects of two different hyperparameters on model results will be discussed, including the number of attention heads and the number of neighbor samples. Here, we use AUROC as an evaluation metric. Using the right number of attention heads can make the learning process more stable. In Fig. 5(a), it can be found that the model effect is optimal when the number of attention heads is taken at 4. In addition, due to the influence of prior knowledge, there are certain differences in the number of neighbors acquired by different nodes. The model can only balance this data difference well if the number of sampled neighbors is at a certain value. From Fig. 5(b), we can find that the number of neighbor samples is more appropriate at 150, because the model evaluation indicator performs best at this time.
Fig. 5. Sensitivity analysis of parameters.
90
Q. Jiao et al.
5 Conclusion In this article, a new method called NSAP for node feature learning in heterogeneous networks has been proposed. This method can address some of the limitations of existing methods. Firstly, for learning the topology structure around the target node in an even better fashion, this paper proposes to extract the metagraph and the context diagram in the neighborhood graph, the former focus on the local information of the node, while the latter focus on the global information of the node. In addition, in the metagraph and context diagram, the graph structure is used for automatically acquiring the weights of the edges, which can better reflect the degree of association of different neighbor nodes with the target node. Eventually, we aggregate the nodal representations under the different metapaths bootstraps, so that the final representation of the nodes incorporates different semantics. The test results confirm that our proposed method does have certain effect, and it has some improvement compared with the best existing methods. In the future work, we will optimize the sampled way of neighbors to better balance the influence of data differences on model performance. Acknowledgements. This work was supported by the grants from the National Key R&D Program of China (2021YFA0910700), Shenzhen science and technology university stable support program (GXWD20201230155427003-20200821222112001), Shenzhen Science and Technology Program (JCYJ20200109113201726), Guangdong Basic and Applied Basic Research Foundation (2021A1515012461 and 2021A1515220115).
Authors’ Contributions. QJ designed the study, performed bioinformatics analysis and drafted the manuscript. All of the authors performed the analysis and participated in the revision of the manuscript. JL and YW conceived of the study, participated in its design and coordination and drafted the manuscript. All authors read and approved the final manuscript. Additional Files. All additional files are available at: https://github.com/jqq125/NSAP Competing Interests. The authors declare that they have no competing interests.
References 1. Ashburn, T.T., Thor, K.B.: Drug repositioning: identifying and developing new uses for existing drugs. Nat. Rev. Drug Discov. 3(8), 673–683 (2004) 2. Jarada, T.N., Rokne, J.G., Alhajj, R.: A review of computational drug repositioning: strategies, approaches, opportunities, challenges, and directions. J. Cheminform. 12(1), 1–23 (2020). https://doi.org/10.1186/s13321-020-00450-7 3. Li, J., et al.: A survey of current trends in computational drug repositioning. Brief. Bioinform. 17(1), 2–12 (2015). https://doi.org/10.1093/bib/bbv020 4. Sadeghi, S.S., Keyvanpour, M.R.: An analytical review of computational drug repurposing. IEEE/ACM Trans. Comput. Biol. Bioinform. 1–1 (2019) 5. Perozzi, B., et al.: DeepWalk: online learning of social representations. In: Macskassy, S.A. et al. (eds.) The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2014, 24–27 August 2014. pp. 701–710. ACM, New York, NY, USA (2014). https://doi.org/10.1145/2623330.2623732
NSAP: A Neighborhood Subgraph Aggregation Method
91
6. Grover, A., Leskovec, J.: Node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2939672.2939754 7. Liu, H. et al.: Inferring new indications for approved drugs via random walk on drug-disease heterogenous networks. BMC Bioinformatics. 17, 17, 539 (2016). https://doi.org/10.1186/ s12859-016-1336-7 8. Yang, C., et al.: Network representation learning with rich text information. In: Proceedings of the 24th International Conference on Artificial Intelligence, pp. 2111–2117. AAAI Press (2015) 9. Cao, S., et al.: GraRep: learning graph representations with global structural information. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 891–900 Association for Computing Machinery, New York, NY, USA (2015). https://doi.org/10.1145/2806416.2806512 10. Liu, Y., et al.: Neighborhood regularized logistic matrix factorization for drug-target interaction prediction. PLOS Comput. Biol. 12 (2016) 11. N. Kipf, T., Welling, M.: Semi-Supervised Classification with Graph Convolutional Networks. ICLR (2017). https://doi.org/10.48550/arXiv.1609.02907 12. Velickovic, P., et al.: Graph attention networks. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 3–May 3 2018, Conference Track Proceedings. OpenReview.net (2018) 13. Hamilton, W.L., et al.: Inductive representation learning on large graphs. In: NIPS. (2017) 14. Wishart, D.S., et al.: DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 46, Database-Issue, D1074–D1082 (2018). https://doi.org/10.1093/nar/ gkx1037 15. Bateman, A., et al.: UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49(Database-Issue), D480–D489 (2021). https://doi.org/10.1093/nar/gkaa1100 16. Rappaport, N., et al.: MalaCards: an amalgamated human disease compendium with diverse clinical and genetic annotation and structured search. Nucleic Acids Res. 45(Database-Issue), D877–D887 (2017). https://doi.org/10.1093/nar/gkw1012 17. Brown, A.S., Patel, C.J.: A standard database for drug repositioning. Sci. Data. 4, 170029 (2017) 18. González, J.P., et al.: The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 48(Database-Issue) D845–D855 (2020). https://doi.org/10.1093/nar/gkz 1021 19. Huser, V., et al.: ClinicalTrials.gov: Adding Value through Informatics. In: AMIA 2015, American Medical Informatics Association Annual Symposium, 14–18 Nov 2015. AMIA, San Francisco, CA, USA (2015) 20. Dong, Y., et al.: metapath2vec: scalable representation learning for heterogeneous networks. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 13–17 August 2017, pp. 135–144. ACM, Halifax, NS, Canada (2017). https://doi.org/10.1145/3097983.3098036 21. Wang, X., et al.: Heterogeneous Graph Attention Network. WWW 2019, The Web Conference 2019, 13–17 May 2019, pp. 2022–2032. ACM, San Francisco, CA, USA, (2019). https://doi. org/10.1145/3308558.3313562 22. Fu, X., et al.: MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding. WWW 2020: The Web Conference 2020, 20–24 April 2020, pp. 2331– 2341. ACM/IW3C2Taipei, Taiwan (2020). https://doi.org/10.1145/3366423.3380297 23. He, M., et al.: Factor graph-aggregated heterogeneous network embedding for disease-gene association prediction. BMC Bioinf. 22(1), 165 (2021). https://doi.org/10.1186/s12859-02104099-3
Comprehensive Evaluation of BERT Model for DNA-Language for Prediction of DNA Sequence Binding Specificities in Fine-Tuning Phase Xianbao Tan1(B) , Changan Yuan2,3 , Hongjie Wu4 , and Xingming Zhao5 1 Institute of Machine Learning and Systems Biology, School of Electronics and Information
Engineering, Tongji University, Shanghai 201804, China [email protected] 2 Guangxi Academy of Science, Nanning 530007, China 3 Guangxi Key Lab of Human-Machine Interaction and Intelligent Decision, Guangxi Academy Sciences, Nanning, China 4 School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China 5 Institute of Science and Technology for Brain Inspired Intelligence (ISTBI), Fudan University, Shanghai 200433, China
Abstract. Deciphering the language of DNA has always been one of the difficult problems that informatics methods need to deal with. In order to meet this challenge, many deep learning models have been proposed. Among them, DNAlanguage models based on pre-trained Bidirectional Encoder Representations from Transformers (BERT) is one of the methods with excellent performance in recognition accuracy. At the same time, most studies focus on the design of the model structure, while for pre-trained DNA-language models such as BERT, there are relatively few studies on the influence of the fine-tuning stage on model performance. To this end, we select DNABERT, the first pre-trained BERT model for DNA-language, to analysis its fine-tuning performances with different parameters settings in motif mining tasks, which are one of the most classic missions for prediction of DNA sequence binding specificities. Furthermore, we compare the fine-tuning results to the performances of previously existing models by dividing different types of datasets. The results show that in fine-tuning phase, different hyper-parameters combinations and types of dataset do have significant impact on model performance. Keywords: Motif mining · Sequence binding specificities · Fine-tuning analysis · DNA-language model · BERT
1 Introduction In DNA there exists short, repetitive sequences that can point out sequence-specific binding sites of proteins, which are called motif [1]. Motif mining is one of the most © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 92–102, 2022. https://doi.org/10.1007/978-3-031-13829-4_8
Comprehensive Evaluation of BERT Model for DNA-Language
93
representative tasks for predicting DNA sequence binding specificities [36, 37]. Although these sequences do not preserve genetic information, they play an important role in gene expression as regulatory codes [2, 38, 39]. Motif mining, aiming to recognize these key elements more accurately and efficiently, used experimental methods at the beginning [3–5], which are very expensive and time consuming [6, 35]. With the rise of deep learning algorithms, more and more computational methods have been proposed and applied to the tasks of motif mining. Because we also utilize models other than BERT for comparison in our experiments, including DeepBind, DanQ, ECLSTM and ECBLSTM, we will introduce these models in chronological order. The deep learning networks first applied to the motif mining tasks are Convolutional Neural Networks (CNN) [7], Recurrent Neural Networks (RNN) [8] and their hybrids. DeepBind [9] is the first deep learning approach for this task, using a single-layer convolutional neural network to convolve the input DNA sequences and learning signal detectors that recapitulate known motifs. After DeepBind, many CNN-based motif mining methods have been proposed, such as DeepSea, DeepSNR, Dilated [12] and so on [40–42]. In addition, many studies have further focused on the synergy of cis-regulatory elements over long distances, discovering that different biological contexts also affect the roles played by cis-regulatory elements [13–15]. This suggested that polysemy and distant semantic relationship, which is also a feature of natural language processing should be taken into consideration while dealing the tasks of motif, one of the cis-regulatory elements. Therefore, RNN-based motif mining methods have also been successively researched and designed. DanQ [17] is a typical example of such a network. DanQ adds a long-short-term memory network (LSTM) on the basis of DeepBind. This additional long-short-term memory network layer is used to improve the recognition and extraction of long-distance dependencies between motif elements. ECLSTM [18] aims to improve Full Connection LSTM (FCLSTM), using a set of one-dimensional convolutional layers embedded in LSTM model. The original LSTM can only obtain unidirectional information. FCLSTM improves this by using a sliding window and a fully connected layer. ECLSTM further converts the full connection layers to convolutional layers, allowing a hierarchical decomposition of raw data and combinations of lower-level features. ECBLSTM goes one step further after ECLSTM, replacing LSTM network with Bi-LSTM [19] to obtain information in two temporal directions. In addition to deep learning networks based on CNN and RNN [43, 44], motif mining algorithms have made new progress with the development of transfer learning and Transformer [20]. Devlin et al. [21] proposed a BERT pre-trained model. Inspired by the Cloze task [22, 25], BERT adopts a Masked language Model (MLM) by randomly masking tokens in input sequences and predicting those based on contexts. Experimental results show that BERT performs well on different natural language datasets [23, 24]. Based on the BERT network, Ji et al. [26] further proposed DNABERT, the first BERT model for DNA-language. DNABERT adjusts the length of input sequences and forces the model to adapt to DNA scene, and predicts several consecutive tokens. DNABERT
94
X. Tan et al.
can be applied to various downstream tasks related to DNA sequences including motif mining. DNABERT shows that it is not only more effective than previous methods [28, 30, 31] in the tasks of predicting proximal and core promoter regions in EPDnew dataset [27] but also a better choice for accuracy on transcription factor discrimination tasks based on ENCODE database [32], in compared with previous models [9, 17, 33, 34]. In general, pre-trained models are more applicable than other models. But before applying these algorithms to downstream tasks, motif mining for example, fine-tuning must be done to allow the algorithms to learn more details, especially for the rather new BERT algorithm. Therefore, the performance of the pre-trained model in different finetuning environments needs to be evaluated comprehensively. We choose motif mining tasks as the target mission for the prediction of DNA sequence binding specificities, and select DNABERT as the representative of BERT model for DNA-language to conduct our experiments of fine-tuning analysis. We set up a series of experiments to find out how fine-tuning phase influences model performance. We solve the above challenges by (i) testing diverse learning rates in finetuning phase and observing their corresponding results; (ii) experimenting on different k values of k-mer embedding; (iii) classifying different types of dataset to evaluate the performances of DNABERT by comparing with other models.
2 Materials and Methods In this study, we explore the performances of DNABERT on different learning rates, k values and over different datasets in fine-tuning phase. In this section, first the datasets used for fine-tuning are shown; then, we introduce DNABERT model used in this study; finally, we will describe the training and fine-tuning details. 2.1 Dataset The deep learning models are evaluated on data from ChIP-seq experiments. We use 32 ChIP-seq experiments from the ENCODE project, which assays binding of different transcription factors. During training, we use the same division strategy as DeepBind as we divided the peek data into three lists: A, B, and C. A is the set of the top 500 even-numbered peaks when considering the ranked list of peaks detected. B is the set of the top 500 odd-numbered peaks and C is the set of remaining peaks. During training, we use list A and C for model training or fine-tuning, and then use list B for validation. Positive examples in this binary classification task consist of 101 bp regions centered on each ChIP-seq peak. The negative examples are generated by shuffling the positive sequences. All sequences are represented by k-mer embedding as model inputs. 2.2 Model Architectures DNABERT is inspired by BERT, a transformer-based contextualized language representation model which achieves state-of-the-art performance in many NLP tasks. As shown in Fig. 1, transformer unit is the key structure in DNABERT, each k-mer embedded
Comprehensive Evaluation of BERT Model for DNA-Language
95
Fig. 1. Model structure of DNABERT
sequence will be represented as a matrix M. Transformer captures contextual information by performing the multi-head self-attention mechanism on M: MultiHead (M ) = Concat(head 1 , . . . , head h )W O where Q
head i = softmax( Q
Q
MWi MWiK T ) · MWiV √ dk h
W O andWi , WiK , WiV {Wi , WiK , WiV }i=0 are learned parameters for linear projection. head calculates the next hidden states of M by first computing the attentions scores between every two tokens and then utilizing them as weights to sum up lines inMW Vi . MultiHead concatenates results of h independent head with different set Q of{Wi , WiK , WiV }. The entire procedure is performed L times with L being number of layers.
96
X. Tan et al.
2.3 Training and Fine-Tuning Before fine-tuning DNABERT, we must pre-train the algorithm. First, we use the same pre-training steps as BERT. At the beginning, a large amount of DNA sequence data is adopted to make the model capture basic syntax and semantics in the pre-training stage. For each sequence, 15% of the regions on the sequence are masked randomly, which will be predicted from the remaining regions by DNABERT. In pre-training phase, we use the same data and training steps as DNABERT to obtain the same pre-trained model. Next comes the fine-tuning phase. We first observe the influences of different learning rates and k-mer embeddings with diverse k values in fine-tuning step. For selecting learning rate, we test 2 × 10−4 , 2 × 10−5 and 2 × 10−6 three learning rates under the circumstances of k = 6. Then we use the selected best learning rate to further explore the impact of different k values on model performance. After a representative hyperparameter combination has been identified, further comparisons with other models are made. In order to compare with the pre-existing model, we used the same ChIP-seq dataset, using A and C for fine-tuning, and B for testing. The process of fine-tuning is consistent with the DNABERT article. Finally, we divide the whole dataset into four parts according to their size and training difficulty, and we evaluate different performances of DNABERT on different parts of datasets. Each fine-tuning result on different dataset is evaluated using 3-fold cross-validation to enhance the accuracy of the results and exclude occasionality. To train DeepBind, DanQ, ECLSTM and ECBLSTM for comparation, we refer to the work in DeepRam in training and evaluating these models, using their results as a comparison for the performance of the DNABERT model on the same datasets. Briefly, DeepRam first randomly selects 40 hyperparameter settings for each dataset and measures them using the AUC values in 3-fold cross-validation; then uses the best hyperparameter combination to select an appropriate model; after this by tracing the changes in model performance over 400,000 iterations obtain the optimal learning rate; finally get the final model performance results. In order to compare these algorithms’ performances, we use ACC, F1, MCC and AUC as evaluation indicators.
3 Results and Analysis We first study two hyper-parameters in fine-tuning phase: learning rate and k value of kmer embedding, finding out the best hyper-parameters combination with the best model performance. 3.1 Relatively Small Learning Rate Leads to Better Performance In this experiment, we discover that for the DNABERT model, using a smaller learning rate tends to achieve better results in the fine-tuning stage. As shown in Fig. 2, even though 2×10−4 itself is a relatively small learning rate, while the learning rate in the finetuning stage increases linearly and then decreases linearly, there is still a phenomenon of divergence in some datasets, which means DNABERT fails to learn those datasets
Comprehensive Evaluation of BERT Model for DNA-Language
97
Fig. 2. Different evaluations using learning rates. (a) (b) and (c) all show similar results that there are outliers in diagram when learning rate is set to 2 × 10−4 and 2 × 10−6 , while the best performance is gain when learning rate is 2 × 10−5 .
in such a learning rate. Meanwhile, for the two smaller learning rates of 2 × 10−5 and 2 × 10−6 , the fine-tuning results are significantly better. In particular, when the learning rate is 2 × 10−5 , the outliers of the boxplot disappear, resulting in better ACC, F1 and MCC evaluation results. DNABERT is a pre-trained model, which has learned a large number of features from source domain in the pre-training phase. In the fine-tuning phase, where algorithm tries to learn from target domain, we speculate that a smaller learning rate is more conducive to model convergence. Setting a large learning rate is easier to cause the loss function of the algorithm to jump out of the global minimum area and lead to divergence. 3.2 DNABERT with Different k Value of k-mer Embedding Achieves Similar Performances We test the effect of four different k values of k-mer embedding on the model performance: k = 3, k = 4, k = 5 and k = 6. As shown in Fig. 3, the distributions of ACC, F1, and MCC of the model on all datasets are relatively similar in the four cases. However, note that there is an outlier in both k = 5 and k = 6 cases, which may be caused by a certain degree of overfitting resulting from the larger k value of k-mer embedding. This will also be discussed in the following part that DNABERT does not achieve leading performance on all datasets, but it does not affect the overall excellent performance
98
X. Tan et al.
Fig. 3. Different evaluations using different k values. We can conclude that when k = 4 and k = 6 the DNABERT algorithm obtains similar performance, while plots in k = 6 are more crowded near average.
of the DNABERT model. Here, combined with the experiment results of DNABERT article, we select k = 6 setting with more average overall performance for subsequent experiments. 3.3 DNABERT Achieves Outstanding Performance Overall
(a) AUC values distribution of different models (b) Stacked bar plots of different types of dataset
Fig. 4. Evaluations between DNABERT and other models on ChIP-seq datasets. (a) The AUC of DNABERT is significantly higher than other algorithms overall. (b) In “large&complex” and “small&complex” datasets, DNABERT shows greater advantage, which is less in other types of dataset.
Comprehensive Evaluation of BERT Model for DNA-Language
99
We test the DNABERT model on 32 datasets of ChIP-seq, and compare the results with the performance of the earlier models in DeepRam. As can be seen from Fig. 4, DNABERT has outstanding performance overall, and Fig. 4(a) shows the boxplots of the AUC values of DeepBind, DanQ, ECLSTM, ECBLSTM and DNABERT on all datasets. Obviously DNABERT surpasses other models in all indicators, with the AUC value being above 0.9 basically, showing the excellent discrimination ability of DNABERT in motif mining tasks. Furthermore, to experiment on how DNABERT differs from other models in motif mining tasks, we further classify the datasets. The basis of classification is the same as that in DeepRam experiment. First, according to the size of dataset, the whole datasets is divided into “large” and “small” types, and the threshold of division is 10000; secondly, according to the training difficulty of dataset, which based on the AUCs of DeepBind, DanQ, ECLSTM, ECBLSTM, “large” and “small” types are furthur divided into “large&complex”, “large&simple”, “small&complex” and “small&simple” four types. Based on the division, we further analyze the performance of DNABERT and other models. As shown in Fig. 4(b), the performance lead of DNABERT is mainly reflected in the “complex” tasks, while the performance leads on other types of datasets is relatively smaller. Regardless of the size of dataset, when a motif mining task is more difficult, DNABERT can better capture the sequence information of a complex task with the benefits of the depth of its network and the sequence knowledge obtained in pre-training, achieving an overall performance that other models can never catch up with. Meanwhile, this advantage is not so prominent in “simple” datasets, where DNABERT and other models have similar performance. Table 1. AUC increase of DNABERT compared with other models Dataset type
AUCs
Increase/%
DNABERT
Others in average
Large & complex
0.948
0.784
20.9
Large & simple
0.988
0.968
2.1
Small & complex
0.946
0.748
26.5
Small & simple
0.916
0.864
6.0
We can see this more clearly in Table 1, the AUC of DNABERT is over 20 percent better than other algorithms in “large&complex” and “small&complex”. But when it comes to “simple” datasets, the increases drop to 2.1 and 6.0 percent. Since DNABERT is a pre-trained model, in order to achieve applicability on a wide range of datasets, even after fine-tuning on some datasets, its performance may be not so satisfying, compared with those models that are pre-existing and specifically trained on those datasets. But even so, DNABERT’s performance in motif mining tasks is beyond doubt.
100
X. Tan et al.
4 Conclusion In this work, we explore different factors which may influence the performance of DNABERT, and compare its motif mining performance with DeepBind, DanQ, ECLSTM, ECBLSTM and DNABERT on the ChIP-seq datasets, further analyzing the performance differences while using different datasets in fine-tuning phase. The results show that DNABERT, as a BERT model for DNA-language, has a great performance advantage in motif mining tasks as long as fine-tuning is done appropriately. It requires that a relatively small learning rate should be applied in fine-tuning stage and an appropriate k value should be selected while embedding input sequences. What’s more, we find that DNABERT shows greater performance advantages on complex datasets of different sizes, while the performance on simple datasets is not much improved compared with those pre-existing algorithms. This has important practical guiding significance for us to apply the algorithm. If we want to train some simple datasets with machines with general performance in a short time, the pre-existing models can meet the requirements, because their structures are simpler, meaning time-consuming and complex training is needless while getting good results. For complex datasets, if you want a good algorithm performance, DNABERT, or say BERT is a better choice. Acknowledgements. This work was supported by the grant of National Key R&D Program of China (No. 2018YFA0902600 & 2018AAA0100100) and partly supported by National Natural Science Foundation of China (Grant nos. 61732012, 62002266, 61932008, and 62073231), and Introduction Plan of High-end Foreign Experts (Grant no. G2021033002L) and, respectively, supported by the Key Project of Science and Technology of Guangxi (Grant no. 2021AB20147), Guangxi Natural Science Foundation (Grant nos. 2021JJA170204 & 2021JJA170199) and Guangxi Science and Technology Base and Talents Special Project (Grant nos. 2021AC19354 & 2021AC19394).
References 1. D’haeseleer, P.: What are DNA sequence motifs? Nat. Biotechnol. 24, 423–425 (2006) 2. Nirenberg, M., Leder, P.: RNA codewords and protein synthesis, VII. On the general nature of the RNA code. Proc. Natl. Acad. Sci. USA 53, 1161–1168 (1965) 3. Galas, D.J., Schmitz, A.: DNAase footprinting a simple method for the detection of proteinDNA binding specificity. Nucleic. Acids Res. 5(9), 3157–3170 (1978) 4. Hellman, L., Fried, M.: Electrophoretic mobility shift assay (EMSA) for detecting protein– nucleic acid interactions. Nat. Protoc. 2, 1849–1861 (2007) 5. Schenborn, E., Groskreutz, D.: Reporter gene vectors and assays. Mol. Biotechnol. 13, 29–44 (1999) 6. Trabelsi, A., Chaabane, M., Ben-Hur, A.: Comprehensive evaluation of deep learning architectures for prediction of DNA/RNA sequence binding specificities. Bioinformatics 35(14), i269–i277 (2019) 7. LeCun, Y.: Gradient-based learning applied to document recognition. Proc. IEEE. 86, 2278– 2324 (1998) 8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Comprehensive Evaluation of BERT Model for DNA-Language
101
9. Alipanahi, B., Delong, A., Weirauch, M.: Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015) 10. Zhu, L., Zhang, H.B., Huang, D.S.: Direct AUC optimization of regulatory motifs. Bioinformatics 33(14), i243–i251 (2017) 11. Shen, Z., Zhang, Y.H., Han, K.S., Nandi, A.K., Honig, B., Huang, D.S.: miRNA-disease association prediction with collaborative matrix factorization. Complexity. 2017(2017), 1–9 (2017) 12. Gupta, A., Rush, A.M.: Dilated convolutions for modeling long-distance genomic dependencies. arXiv:1710.01278 (2017) 13. Davuluri, R.V.: The functional consequences of alternative promoter use in mammalian genomes. Trends Genet. 24, 167–177 (2008) 14. Gibcus, J.H., Dekker, J.: The context of gene expression regulation. F1000 Biol. Rep. 4, 8 (2012) 15. Vitting-Seerup, K., Sandelin, A.: The landscape of isoform switches in human cancers. Mol. Cancer Res. 15, 1206–1220 (2017) 16. Zhang, H.B., Zhu, L., Huang, D.S.: WSMD: weakly-supervised motif discovery in transcription factor ChIP-seq data. Sci. Rep. 7 (2017) 17. Quang, D., Xie, X.: A hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44, e107 (2016) 18. Zhou, Y.X., Hefenbrock, M., Huang, Y.R., Riedel, T., Beigl, M.: Automatic Remaining Useful Life Estimation Framework with Embedded Convolutional LSTM as the Backbone. ECML PKDD 2020: Machine Learning and Knowledge Discovery in Databases: Applied Data Science Track, pp. 461–477 (2020) 19. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Sig. Process. 45(11), 2673–2681 (1997) 20. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017) (2017) 21. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of Deep Bidirectional Transformers for Language Understanding (2018) 22. Taylor, W.L.: Cloze procedure: a new tool for measuring readability. J. Bull. 30(4), 415–433 (1953) 23. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: Glue: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 2018a, pp. 353–355 (2018) 24. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392(2016) 25. Zhu, L., Zhang, H.B., Huang, D.S.: LMMO: a large margin approach for optimizing regulatory motifs. IEEE/ACM Trans. Comput. Biol. Bioinf. 15(3), 913–925 (2018) 26. Ji, Y.R., Zhou, Z.H., Liu, H., Davuluri, R.V.: DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 37(15), 2112–2120 (2021) 27. Dreos, R., Ambrosini, G., Périer, R.C., Bucher, P.: EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era. Nucleic Acids Res. 41(D1), D157–D164 (2013) 28. Oubounyt, M., Louadi, Z., Tayara, H., Chong, K.T.: DeePromoter: robust promoter predictor using deep learning. Front Genet. 10, 286 (2019) 29. Zhang, H.B., Zhu, L., Huang, D.S.: DiscMLA: An efficient discriminative motif learning algorithm over high-throughput datasets. IEEE/ACM Trans. Comput. Biol. Bioinf. 15(6), 1810–1820 (2018)
102
X. Tan et al.
30. Solovyev, V., Kosarev, P., Seledsov, I.: Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol. 7(S10) (2006) 31. Davuluri, R.V.: Application of FirstEF to find promoters and first exons in the human genome. Current Protocols Bioinform. 1, 4.7.1–4.7.10 (2003) 32. The ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012) 33. Zhang, Y., Qiao, S., Ji, S., Li, Y.: DeepSite: bidirectional LSTM and CNN models for predicting DNA–protein binding. Int. J. Mach. Learn. Cybern. 11(4), 841–851 (2019). https:// doi.org/10.1007/s13042-019-00990-x 34. Khamis, A.M., et al.: A novel method for improved accuracy of transcription factor binding site prediction. Nucleic Acids Res. 46(12), e72 (2018) 35. Shen, Z., Zhang, Q., Han, K., Huang, D.S.: A deep learning model for RNA-protein binding preference prediction based on hierarchical LSTM and attention network. IEEE/ACM Trans. Comput. Biol. Bioinform. 19 (2020) 36. Zhang, Q., Shen, Z., Huang, D.S.: Predicting in-vitro transcription factor binding sites using DNA sequence shape. IEEE/ACM Trans. Comput. Biol. Bioinform. 18 (2019) 37. Shen, Z., Deng, S.P., Huang, D.S.: Capsule network for predicting RNA-Protein binding preferences using hybrid feature. IEEE/ACM Trans. Comput. Biol. Bioinform. 17 (2019) 38. Zhu, L., Bao, W.Z., Huang, D.S.: Learning TF binding motifs by optimizing fisher exact test score. IEEE/ACM Trans. Comput. Biol. Bioinform. (2016) 39. Shen, Z., Deng, S.P., Huang, D.S.: RNA-Protein binding sites prediction via multi-scale convolutional gated recurrent unit networks. IEEE Trans. Comput. Biol. Bioinform. 17 (2019) 40. Zhang, Q.H., Zhu, L., Bao, W.Z., Huang, D.S.: Weakly-supervised convolutional neural network architecture for predicting protein-DNA binding. IEEE/ACM Trans. Comput. Biol. Bioinform. 17 (2020) 41. Zhang, Q.H., Zhu, L., Huang, D.S.: High-order convolutional neural network architecture for predicting DNA-protein binding sites. IEEE/ACM Trans. Comput. Biol. Bioinform. 16 (2019) 42. Zhang, Q.H., Shen, Z., Huang, D.S.: Modeling in-vivo protein-DNA binding by combining multiple-instance learning with a hybrid deep neural network. Sci. Rep. 9, 8484 (2019) 43. Xu, W.X., Zhu, L., Huang, D.S.: DCDE: an efficient deep convolutional divergence encoding method for human promoter recognition. IEEE Trans. Nanobiosci. 18(2), 136–145 (2019) 44. Shen, Z., Bao, W.Z., Huang, D.S.: Recurrent neural network for predicting transcription factor binding sites. Sci. Rep. 8, 15270 (2018)
Identification and Evaluation of Key Biomarkers of Acute Myocardial Infarction by Machine Learning Zhenrun Zhan1,2 , Tingting Zhao1,2 , Xiaodan Bi1,2 , Jinpeng Yang1,2 , and Pengyong Han1(B) 1 Changzhi Medical College, Changzhi 046000, Shanxi, China
[email protected] 2 Heping Hospital Affiliated to Changzhi Medical College, Changzhi 046000, Shanxi, China
Abstract. Acute myocardial infarction (AMI) is a severe disease that can occur in all age groups. About 8.5 million patients die of this disease every year. Although the diagnostic technology of AMI is relatively mature, there are still many limitations. We aim to use comprehensive bioinformatics and machine learning algorithms to study the potential molecular mechanism of acute myocardial infarction and seek new prevention and treatment strategies. Methods: The expression profiles of GSE66360 and GSE48060 were downloaded from the Gene Expression Omnibus database, microarray datasets were integrated, and differential genes were obtained to be further analyzed by bioinformatic technique. Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis, Disease Oncology (DO) analysis and Gene Set Enrichment Analysis (GSEA) were performed on the differential genes using R software, respectively. Then the Lasso algorithm was used to identify the AMI-related essential genes in the training set and validate them in the test set. Potential mechanistic analyses of the development of AMI included the following: the expression differences of crucial genes, differences in immune cell infiltration, immune cell correlation, and the correlation between critical genes and immune cells between normal and AMI samples. Results: Finally, five essential genes were screened, including CLEC4D, CSF3R, SLC11A1, CLEC12A, and TAGAP. The expression of critical genes differed between normal, and AMI samples and the genes can be used as a diagnostic factor in patients. Meanwhile, normal and AMI samples showed significant differences in immune infiltration, and the expression of critical genes was closely related to the abundance of immune cell infiltration. Conclusion: In this study, five essential genes were screened, and the underlying molecular mechanisms of AMI pathogenesis were analyzed, which may provide theoretical support for the diagnosis, prevention, prognosis evaluation and targeted immune therapy of AMI patients. Keywords: Acute myocardial infarction · Machine learning · Biomarkers · Infiltration immune
Z. Zhan and T. Zhao—Contributed to the work equally © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 103–115, 2022. https://doi.org/10.1007/978-3-031-13829-4_9
104
Z. Zhan et al.
1 Introduction The unstable ischemic syndrome can cause acute myocardial infarction (AMI) [1]. The most common initiation cause is the rupture or erosion of vulnerable and lipid-rich atheromatous coronary plaques, which exposes the core and matrix materials in plaques to circulating blood [2]. In the past ten years, acute myocardial infarction accounted for the most significant deaths from diseases worldwide [3]. More than 2.4 million patients die each year from the disease in the United States, and one-third of patients die yearly from AMI in developed countries [4]. At the same time, the global burden of cardiovascular diseases and acute myocardial infarction is shifting to low-income and middle-income countries. More than 80% of cardiovascular deaths worldwide occur in these countries [5, 6]. However, with the development of early reperfusion and therapeutic drugs, the incidence of severe complications of acute myocardial infarction has been significantly reduced [7], and the mortality rate of patients significantly decreased. However, there is still considerable opportunity for improvement. In recent years, the development and improvement of the second-generation sequencing technology have screened out disease characteristic genes via bioinformatics analysis [8, 9]. Of course, there are also studies using the famous LASSO machine learning algorithm to build models to identify disease biomarkers, making the study more reliable [10]. Lasso (Least absolute shrinkage and selection operator) algorithm is a kind of shrinkage estimator [11], obtaining a relatively refined model by constructing a penalty function. It can realize the selection of variables while estimating parameters and better solve the multicollinearity problem in regression analysis. The model achieves the optimal when lambda is the minimum. Here, the Lasso machine learning algorithm combined with differential gene expression analysis to identify diagnostic biomarkers of AMI and predict reliable and interpretable models. This study downloaded two open-accessed AMI patient data sets from the GEO database. Firstly, the DEGs were screened, and GO, KEGG, DO, and GSEA enrichment analyses were employed. And then, the essential genes related to AMI were identified in the training set by the LASSO algorithm and verified in the testing set. The expression differences of crucial genes, differences in immune cell infiltration, immune cell correlation, and the correlation between critical genes and immune cells between standard and AMI samples were studied. It will provide a reference for an in-depth study on the occurrence and development, diagnosis, prevention, prognosis evaluation and targeted immune therapy of AMI.
2 Materials and Methods 2.1 Data Collection NCBI was used [12] to find the appropriate data set and the GEOquery [13] package to download the chip data of GSE66360 and GSE48060 data sets, including AMI samples and standard samples. Then, the corresponding annotation information of each platform chip probe is obtained from the GEO database. In the gene symbol and probe ID conversion, when a single gene symbol is found to be homologous with multiple probes, the average probe expression level is considered the gene expression level.
Identification and Evaluation of Key Biomarkers
105
2.2 DEG Screening The differential expression of mRNA was studied using the limma software package of R software. “Adjusted P < 0.05 and [log2(FC) > 1 or log2(FC) ≤ 1]” were defined as screening threshold for differential mRNA expression. The R software package “pheatmap” was used to visualize [14]. 2.3 GO, KEGG, DO and GSEA Enrichment Analysis The clusterprofiler package was employed for Gene Ontology (GO) analysis (Adjusted P < 0.05), Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis, Gene Set Enrichment Analysis (GSEA) and Disease Oncology (DO) analysis of candidate mRNA [15]. 2.4 Screening and Identification of Gene Prediction Model for Early Diagnosis GSE66360 dataset as the training set, and the GSE48060 dataset as the test set. The glmnet packet for constructing the binomial LASSO model in the training set identifies the candidate hub genes [15]. Then, differential expression analysis on the candidate hub genes in the verification set to obtain critical genes, which are also differentially expressed in the verification set. Then, the ROC curve was constructed and calculated the area under the curve (AUC) and estimated the model’s prediction accuracy in the training and test set. The pROC package was used to evaluate the diagnostic value of crucial genes [16]. The e1071 and caret packages were used to determine the recursive features in differential genes, and the obtained data are calculated to screen the best gene signature. 2.5 The Immune Cell Infiltration Analysis We conducted correlation analysis of immune cells, obtained an immune cell infiltration matrix through Cibersort analysis, and the proportion of immune cells in acute myocardial infarction and standard samples were assessed [17]. Then, the ggplot2 package was use to make heatmap and reveal the characteristics of immune cell infiltration in infarcted myocardial tissue.
3 Results 3.1 Preprocessing and Analysis of AMI-Related Differentially Expressed Genes This study is shown in the flowchart (Fig. 1). After gene annotation and data standardization, the GSE66360 data set contains 99 circulating endothelial cell samples, including 49 acute myocardial infarction patients and 50 standard control samples. GSE48060 data set contains 52 blood samples, including 31 acute myocardial infarction patients and 21 standard control samples. Through differential expression analysis of the GSE66360 data set, 27 differentially expressed genes were obtained, of which 21 were up-regulated, six were down-regulated, and all were marked with gene names (Fig. 2A) in the volcano map. The smaller the adjusted P value was, the higher the significance of gene differential expression could be. At the same time, the heatmap package of R software was used to draw the heat map of differentially expressed genes (Fig. 2B).
106
Z. Zhan et al.
Fig. 1. The flowchart of this study.
Fig. 2. The differential gene expression of GSE66360. (A) the DEG Volcano map which red and green shows up and down-regulated genes. (B) normal samples and AMI samples with high expression in red and low expression in blue. (Color figure online)
3.2 GO, KEGG, DO and GSEA Enrichment Analysis of Differential Genes The clusterProfiler package performed GO enrichment analysis of the 27 DEGs. Using the Benjamini-Hochberg correction method, we set adjusted P-values of 0.079680048, LAC >6.533333333, network >7.72106676, (b) filter betweenness >7.495325508, closeness >0.702364865, degree >15.5, Eigen-vector >0.176887855, LAC >10.330316745, network >12.38486444 (c) PPI network diagram of key targets
3.6 GO and KEGG Pathway Analysis Enrichment analysis has done for the key targets using the Metascape database, and the GO analysis items included 537 biological processes (BP), 14 cell components (CC), 29 molecular functions (MF), and 209 KEGG pathways. At the same time, select P 0 is the distance function that measures s(Rk (n)) and s(Rk (n)), and f−1 = 0. We employ Dynamic Time Warping (DTW) [31] to measure the distance between s(Rk (n)) and s(Rk (m)) because their sizes are different: g(s(Rk (n)), s(Rk (m))) =
max(s(Rk (n)), s(Rk (m))) −1 min(s(Rk (n)), s(Rk (m)))
(3)
Calculate the similarity of degree distribution between all node pairs in the graph, and build a multi-layer weighted graph based on the similarity. In the same layer, the edge weight between nodes n and m is defined as: ωk (n, m) = e−fk (n,m) , k = 0, 1, · · · , k ∗
(4)
where k ∗ is the diameter of a similar network. Directional edge connections belong to the same nodes in different layers. For each node n in the k layer, it is connected to k − 1 and k + 1 layers. The edge weight between multiple layers is defined as follows: ω(nk , nk−1 ) = log(k (n) + e) ω(nk , nk+1 ) = 1
(5)
where k (n) is the number of edges connecting to n in layer k and the weight is greater than the average weight. We use the biased random walk to walk in the weighted graph of a multi-layer to generate a node sequence. In random walk, we walk with probability q in the current layer, with the probability from node n to node m in layer k being: pk (n, m) =
e−fk (n,m) Zk (n)
(6)
350
Z. Sun et al.
where zk (n) = m=n e−fk (n,m) is the normalization factor of a node n in the k-layer. Jump to another layer with a probability of (1 − q). If the jump is made, the probability of jumping to k + 1 and k − 1 is as follows: ω(nk , mk+1 ) ω(nk , mk+1 ) + ω(nk , mk−1 ) pk (nk , mk−1 ) = 1 − pk (nk , mk+1 ) pk (nk , mk+1 ) =
(7)
We start from the bottom layer and randomly walk the randomly selected nodes. The random walk sequence is set to be 80 steps long. We randomly traverse according to the probability of (6) and (7) according to the equation, and each node generates 20 random walk sequences For the generated node sequence, Skip-Gram [32] model is used for training, and 128-dimensional features are generated for each node. 2.5 Constructing Patient Features In order to better describe patients, we integrate the vector of mutant genes. For patients with different cancer types, we divided similar genes into corresponding clusters. For each patient, we create a new 128-dimensional vector by fusing the mutant gene vectors in the same cluster and connecting the vectors of all clusters to create a new vector that represents the patient’s features. When sorting out the gene mutation data, we find that the gene mutation frequency of different cancer types is significantly different. Some genes mutate only in specific cancers, and some genes are evident in all cancer types. We define a gene weight to increase the impact of genes that are significantly mutated in the cancer type in question. The weight is defined as: ω(n) =
ci (n) s(n)
(8)
where s(n) is the total number of genes n in 14 cancer types, and ci is the total number of genes n in cancer i. After then, the patient’s features are updated based on their weights. 2.6 Supervised Classification Model We use the supervised classification algorithm lightGBM to classify patients. It is a framework to implement GBDT (Gradient Boosting Decision Tree) algorithm [33, 34]. The goal is to deconstruct the continuous floating point eigenvalues into k integers while constructing a k-width histogram. In cancer classification, we regard tumors of specific cancer types as positive samples and tumors of other cancer types as negative samples. In this study, the AUC value, namely the area under the receiver operating characteristic (ROC) [35] curve, is selected as the evaluation index to judge the classification performance. We calculate the true positive rate (TPR) and false positive rate (FPR) by changing the threshold, and obtain the ROC curve according to the following equation: ⎧ ⎪ ⎨ TPR = TP TP+FN (9) ⎪ ⎩ FPR = FP TN +FP
Construction of Gene Network Based on Inter-tumor Heterogeneity
351
where FN and FP are the number of negative and positive samples with wrong identification, TN and TP are the number of negative and positive samples with correct identification. 2.7 Unsupervised Classification Model To cluster patients with diverse subtypes of the same cancer, we the unsupervised OPTICS density clustering method. The algorithm has two main parameters: neighborhood radius ε and minimum neighborhood number M inPts. Different from DBSCAN, the design of this algorithm makes it less sensitive to the setting of initial super parameters. Iterate for each temporary cluster to see if the internal point is the core point, then merge the temporary cluster corresponding to the chosen core point with the current temporary cluster to create a new temporary cluster. The iteration in the temporary cluster continues until the point in the temporary cluster is not the core point or the point within the density range. For a specific cancer type, the patients are divided into n clusters by setting ε and MinPts.
3 Results 3.1 Pan-Cancer Classification In this study, gene expression, methylation, and gene mutation data of 14 cancer types: Bladder urothelial carcinoma (BLCA), Breast invasive carcinoma (BRCA), Cervical squamous cell carcinoma (CESC), Colon adenocarcinoma (COAD), Head and neck squamous cell carcinoma (HNSC), Kidney renal clear cell carcinoma (KIRC), Liver hepatocellular carcinoma (LIHC), Lung adenocarcinoma (LUAD), Lung squamous cell carcinoma (LUSC), Rectum adenocarcinoma (READ), Skin cutaneous melanoma (SKCM), Stomach adenocarcinoma (STAD), Thyroid carcinoma (THCA) and Thyroid carcinoma (UCEC) are downloaded from TCGA’s official website. All the obtained data are normalized. The methylation sites located in the same gene were averaged. A total of 5290 samples were obtained. We obtain a feature vector to represent each patient by integrating the genetic vector of the mutated gene and the network embedding. Each patient is a 1280-dimensional vector representation. In order to verify the quality of the RFNE method. We use the lightGBM classification algorithm to make predictions for patients. The learned patient features are used as input to the algorithm. Taking THCA as an example, we label patients with THCA cancer as positive samples and patients with 13 other cancers as negative samples. We pass 100 times of 5-fold cross validation and take the average value as the final AUC value. The AUC of 14 cancer types ranges from 0.85 to 0.99, and the average AUC value is close to 0.93. The higher AUC value shows that our algorithm can well predict the type of patients. We compare the RFNE method with the three best methods. As shown in Fig. 2, among the classification of 14 cancer types, only under CESC, LUAD and STAD, our method is lower than NBS but higher than the other two methods, and the other 11 cancer types are better than NES, NBS, ECC methods. On the whole, our method has the advantage.
352
Z. Sun et al.
Fig. 2. Comparison of ROC curves of our RFNE method and three state-of-the-art methods: NES, NBS, ECC for 14 cancer types.
3.2 Cancer Subtype Classification Through the above research, we believe that patients with similar clinical information may be more inclined to gather together. This means that we can obtain the optimal clustering of patients through an unsupervised learning method, that is, subdivide patients into subtypes. Here, we use the unsupervised OPTICS (density based clustering) method to cluster the patients. The number of patient clusters for each malignancy is roughly identical to the number of subtypes identified in medical tumor subtype identification literature. Taking THCA as an example, it is mainly divided into two subtypes: differentiated thyroid cancer and undifferentiated thyroid cancer. We generated two groups of patients with thyroid cancer (THCA) in the data Fig. 3 shows the clustering results of
Construction of Gene Network Based on Inter-tumor Heterogeneity
353
darker colors. The results showed that paired patients had higher clustering. The KaplanMeier survival chart in Fig. 3 shows that there are obvious differences in survival time and survival probability between the identified subtypes. For example, the survival curve shows that the survival time of patients with two THCA subtypes is significantly different with log-rank P = 0.03. This shows that RFNE has good efficiency. Among the 14 cancer types, most of the cancer subtypes determined by RFNE are strongly associated with the survival rate of patients (P < 0.05).
Fig. 3. Stratified results of tumor mutations in three cancer types: THCA, CESC, and HNSC. The darker region suggests that the relevant cancer patients should be classified into the same subtype, while the left side depicts patient clustering. The survival analysis of patients with different subtypes is represented on the right side, and the P-value is obtained using the log-rank test.
4 Discussion The network-embedding method of constructing a gene network based on random forest puts forward a new way for precision medicine. We use an unsupervised random forest to construct a gene similarity network based on the assumption that if the leaf nodes are located in the same position of the same tree, it indicates that the two genes are similar. A structural feature learning representation model (struc2vec) is used to create gene features, and patient features are combined with gene mutation information. ROC curve shows that classification by constructed patient vector is effective. Finally, we also classify patients with specific cancer types into clinically relevant subtypes through unsupervised options clustering algorithm. The clinical data shows that the survival time of patients with different subtypes is significantly different.
354
Z. Sun et al.
Although this study provides an effective method for tumor stratification, there are still some limitations. In the current framework, we only use mRNA expression, methylation, and gene mutation data. Integrating other types of data, such as miRNA and copy number variation, may further improve the RFNE model. Within this framework, other methods can also, be used to solve the problem of cancer classification. For example, a graph convolution neural network [36] is used to improve the prediction accuracy, and hierarchical clustering, K-means, and other clustering methods are used. In conclusion, this study provides a new way for precision medicine. Funding:. This work has been supported by the National Natural Science Foundation of China (61902216, 61972236 and 61972226), and Natural Science Foundation of Shandong Province (No. ZR2018MF013). Code Availability:. The code of RFNE is available on GitHub: github.com/Feng Li12/RFNE.
References 1. Hanahan, D., Weinberg, R.A.: Hallmarks of cancer: the next generation. Cell 144(5), 646–674 (2011) 2. Dobin, A., et al.: STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1), 15–21 (2013) 3. Anders, S., Pyl, P.T., Huber, W.: HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31(2), 166–169 (2015) 4. Weinstein, J.N., et al.: The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45(10), 1113–1120 (2013) 5. Zhao, L., Lee, V.H., Ng, M.K., Yan, H., Bijlsma, M.F.: Molecular subtyping of cancer: current status and moving toward clinical applications. Brief. Bioinform. 20(2), 572–584 (2019) 6. Cheng, F., Jia, P., Wang, Q., Lin, C.-C., Li, W.-H., Zhao, Z.: Studying tumorigenesis through network evolution and somatic mutational perturbations in the cancer interactome. Mol. Biol. Evol. 31(8), 2156–2169 (2014) 7. Liu, H., Zhao, R., Fang, H., Cheng, F., Fu, Y., Liu, Y.-Y.: Entropy-based consensus clustering for patient stratification. Bioinformatics 33(17), 2691–2698 (2017) 8. Network, C.G.A.R.: Integrated genomic analyses of ovarian carcinoma. Nature 474(7353), 609 (2011) 9. Levine, D.A.: Integrated genomic characterization of endometrial carcinoma. Nature 497(7447), 67–73 (2013) 10. Esteva, F.J., et al.: Prognostic role of a multigene reverse transcriptase-PCR assay in patients with node-negative breast cancer not receiving adjuvant systemic therapy. Clin. Cancer Res. 11(9), 3315–3319 (2005) 11. Hoadley, K.A., et al.: Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell 158(4), 929–944 (2014) 12. Koboldt, D., et al.: Comprehensive molecular portraits of human breast tumours. Nature 490(7418), 61–70 (2012) 13. Cheng, F., et al.: A gene gravity model for the evolution of cancer genomes: a study of 3,000 cancer genomes across 9 cancer types. PLoS Comput. Biol. 11(9), e1004497 (2015) 14. Hofree, M., Shen, J.P., Carter, H., Gross, A., Ideker, T.: Network-based stratification of tumor mutations. Nat Methods 10(11), 1108–1115 (2013) 15. Liu, C., Han, Z., Zhang, Z.-K., Nussinov, R., Cheng, F.: A network-based deep learning methodology for stratification of tumor mutations. Bioinformatics 37(1), 82–88 (2021)
Construction of Gene Network Based on Inter-tumor Heterogeneity
355
16. Liu, C., et al.: Computational network biology: data, models, and applications. Phys. Rep. 846, 1–66 (2020) 17. Peng, J., Guan, J., Shang, X.: Predicting Parkinson’s disease genes based on node2vec and autoencoder. Front. Genet. 10, 226 (2019) 18. Zeng, X., et al.: Target identification among known drugs by deep learning from heterogeneous networks. Chem. Sci. 11(7), 1775–1797 (2020) 19. Zong, N., Kim, H., Ngo, V., Harismendy, O.: Deep mining heterogeneous networks of biomedical linked data to predict novel drug–target associations. Bioinformatics 33(15), 2337–2344 (2017) 20. Wang, B., et al.: Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11(3), 333–337 (2014) 21. Lee, J.-H., et al.: Integrative analysis of mutational and transcriptional profiles reveals driver mutations of metastatic breast cancers. Cell Discov. 2(1), 1–14 (2016) 22. Breiman, L.: Random forests. Mach Learn 45(1), 5–32 (2001) 23. Chen, X., Liu, X.: A weighted bagging LightGBM model for potential lncRNA-disease association identification. In: Qiao, J., Zhao, X., Pan, L., Zuo, X., Zhang, X., Zhang, Q., Huang, S. (eds.) BIC-TA 2018. CCIS, vol. 951, pp. 307–314. Springer, Singapore (2018). https://doi. org/10.1007/978-981-13-2826-8_27 24. Dassun, J.C., Reyes, A., Yokoyama, H., Dolendo, M.: Ordering points to identify the clustering structure algorithm in fingerprint-based age classification. Virtutis Incunabula 2(1), 17–27 (2015) 25. Tomczak, K., Czerwi´nska, P., Wiznerowicz, M.: The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp. Oncol. 19(1A), A68 (2015) 26. Zhu, Y., Qiu, P., Ji, Y.: TCGA-assembler: open-source software for retrieving and processing TCGA data. Nat. Methods 11(6), 599–600 (2014) 27. Freund Y, Mason L: The alternating decision tree learning algorithm. In: icml: 1999. Citeseer: 124–133 28. Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1992) 29. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864 (2016) 30. Ribeiro, L.F., Saverese, P.H., Figueiredo, D.R.: struc2vec: Learning node representations from structural identity. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 385–394 (2017) 31. Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time series. In: KDD Workshop, Seattle, WA, USA, pp. 359–370 (1994) 32. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781 (2013) 33. Chen, T., et al.: Xgboost: extreme gradient boosting. R package version 04–2 1(4), 1–4 (2015) 34. Rao, H., et al.: Feature selection based on artificial bee colony and gradient boosting decision tree. Appl. Soft Comput. 74, 634–642 (2019) 35. Yang, S., Berdine, G.: The receiver operating characteristic (ROC) curve. Southwest Respiratory Critical Care Chronicles 5(19), 34–36 (2017) 36. He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., Wang, M.: Lightgcn: Simplifying and powering graph convolution network for recommendation. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 639–648 (2020)
A Novel Synthetic Lethality Prediction Method Based on Bidirectional Attention Learning Fengxu Sun, Xinguo Lu(B) , Guanyuan Chen, Xiang Zhang, Kaibao Jiang, and Jinxin Li College of Computer Science and Electronic Engineering, Hunan University, Changsha, China [email protected]
Abstract. Simultaneous mutations in synthetic lethality genes can lead to cancer cell apoptosis and it can be utilized in cancer target therapy. However, highthroughput wet laboratory screening methods are expensive and time-consuming. Computational methods are therefore a good complement to the prediction of synthetic lethality. Recently, graph embedding-based methods have been developed to predict synthetic lethal gene pairs. Here, we proposed a novel synthetic lethality prediction method based on bidirectional attention learning. Through aggregating biological multi-omics data, we can construct the node embedding representation and the graph link representation respectively. The correlation between gene pairs with these two feature representations is calculated using a multilayer perceptron as a decoder. The correlation with high gene pair score is predicted as potential synthetic lethal pair. Keywords: Synthetic lethality · Bidirectional attention learning · Multi-omics data · Graph representation
1 Introduction The identification of synthetic lethal (SL) is extremely important in cancer therapy as cancer treatments based on the SL concept produce fewer adverse effects [1]. However, high-throughput wet-lab screening methods suffer from expensive costs and timeconsuming [2]. Therefore, computational methods are an effective complement to the prediction of synthetic lethality. Many computational methods have been proposed to identify potential SL pairs on different gene expression levels from different platforms [3]. These proposed methods for predicting genetic interactions are modeled on many biological data, including metabolic modeling, evolutionary characteristics and transcriptomic profiles interaction networks. Previous methods mainly use one of these levels including genomic, epigenomic and transcriptomic levels from different platforms [4]. However, these individual data source may focus on specific biological function and the gene interaction prediction on individual data sources may not reveal potential associations. Hence, there is a need to develop synthetic lethal genes discovery methods that can effectively represent and integrate the diverse data sources. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 356–363, 2022. https://doi.org/10.1007/978-3-031-13829-4_30
A Novel Synthetic Lethality Prediction Method
357
Recently, some matrix factorization, graph embedding-based methods [5] have been developed to predict synthetic lethal pairs. For example, Huang et al. present a framework to learn representations of genes based on regularized self-represent matrix factorization (GRSMF) [6]. Subsequently, Liany et al. propose a Collective tri-matrix factorization model, named g-CMF, to integrate multiple heterogeneous data for predicting SL relationships based on collective matrix factorization [6]. Furthermore, Cai et al. proposed a dual-dropout graph convolutaional network (DDGCN) to learn gene embedding in SL network for SL prediction [7]. These methods usually reconstruct gene signatures through a series of manipulations to predict potential synthetic lethal pairs. However, in the process of feature reconstruction, these methods often only focus on the association of synthetic lethality, while ignoring the rich biological multi-omics data such as protein sequences [8], gene ontology [9] and protein interconnection networks [10]. Although the matrix factorization method such as SL2MF uses the data of GO and PPI [11], but for these two data, SL2MF only calculates the similarity of genes in the two data sets of GO and PPI, and cannot dynamically use the features of GO and PPI to learn the embedded representation of node. Here, we propose a synthetic lethality prediction method based on feature similarity and bidirectional attention. Through aggregating biological multi-omics data, we can construct the node embedding representation and the graph link representation respectively. Then, we construct the synthetic lethality prediction method with bidirectional attention model, in which the correlation between gene pairs with these two feature representations is calculated using a multilayer perceptron as a decoder. The correlation with high gene pair score is predicted as potential synthetic lethal pair.
2 Methods We propose a synthetic lethality prediction method based on feature similarity and bidirectional attention named BSGAT. As shown in Fig. 1, BSGAT mainly consists of three steps. The first step is to construct features based on the amino acid sequence of the protein. The second step is to construct features based on the protein topology features corresponding to the genes. The third step, based on two-way attention feature learning with similar features, performs the two-way attention model between these two features. Finally, the correlation between the two feature representations of a gene is calculated using a multilayer perceptron as a decoder, the higher the correlation, the more likely it is that a potential synthetic lethal pair is associated with each other.
358
F. Sun et al.
Fig. 1. The overall framework of the synthetic fatality prediction method based on bidirectional attention learning
2.1 Construction of Protein Amino Acid Sequence Features Let T = {t 1 , t2 , . . . , tns } be the residue sequence of a protein. We convert the residue (1) (1) (1) sequence into a randomly initial embedding S (1) = {s1 , s2 , . . . , sns }, where si ∈ Rds d (1) represents the dimension of the initial embedding and S ∈ R s ×ns represents the feature matrix of the protein sequence. On the basis of the initial embedding, this method obtains the amino acid sequence feature matrix of all proteins, takes the amino acid sequence feature as the initial input of the model, and updates its features during the training process through several layers of CNN and nonlinear activation function ReLU. In order to obtain the amino acid sequence representation of all proteins, this method adds an average pooling layer at the end of the convolutional neural network. For the (2) (2) (2) embedding representation S (2) = {s1 , s2 , . . . , sns } of a protein after CNN, its final feature s∗ is represented as: s∗ =
1 ns
ns i=1
(2)
si
(1)
where s∗ i ∈ Rds is the amino acid sequence feature of the i-th protein. Hence, we can obtain the feature representation S = {s∗ 1 , s∗ 2 , . . . , s∗ n } of the amino acid sequences of all the proteins. 2.2 Construction of Protein Topological Features We randomly initializes the initial feature representation P = {p1 , p2 , . . . , pn } of proteins, where pi ∈ Rd1 , d1 represents the dimension of the initial embedding, and n represents the number of proteins. We adopts a linear transformation to transform the initial feature representation into P :
P = WP + b1
(2)
A Novel Synthetic Lethality Prediction Method
359
where W ∈ Rd2 ×d1 is the weight matrix, d2 is the feature dimension of the input after linear transformation, and b1 is the bias. Next, we use the normalized attention coefficients to update the vertex hidden vector as the final output feature for each vertex: αij pj ) pi = σ ( (3) j∈N
where αij is the attention coefficient between the i-th node feature and the j-th node. Then, combining the output features of K independent graph attention layers, picat is represented to multi-sources attention to stabilize the learning process of self-attention, k αijk pj picat = ||K (4) k=1 σ j∈N
Kd1 is the feature Here, || represents the vector concatenation operation, and pcat i ∈R k vector of K th attention layers. αij denotes the attention coefficient between the i-th and j-th vectors in the k-th attention layer. −→ Finally, the feature vector pcat i is transformed using a multilayer perceptron (MLP):
pi = ReLU (Wa picat )
(5)
2.3 Bidirectional Attention Learning Bidirectional attention is described as: As = σ ([S (1) ws ||Es ]as )
(6)
Ap = σ ([P (1) wp ||Ep ]ap )
(7)
Here Ws , Wp ∈ Rd×d is the weight matrix and || is the vector join operation. as ∈ and ap ∈ R2d×d represent the attention matrix in the attention mechanism. σ indicates that the normalization calculation is performed on each column. As ∈ Rd×d and Ap ∈ Rd×d in the formula represent the normalized attention coefficients.Then, the amino acid features and PPI features of genes are expressed as: R2d×d
S = S (1) · As
P = P (1) ·Ap
(8) (9)
where s ∈ Rd and p ∈ Rd represent the original amino acid sequence feature and PPI feature after attention, respectively. Then, the M-layer attention mechanism are transformed into dimension d through a single-layer neural network: S out = Wcs S
(10)
360
F. Sun et al.
P out = Wcp P out
(11) out
where Wcs , Wcp ∈ Rd×Md is weight matrix. S and P are the final representation features. The predicted score matrix Y for the final synthetic lethal interaction is expressed as follows: Y = Wo [S out ||P out ] + bo
(12)
Here, Wo ∈ Rd×2d , bo ∈ Rd 是 are the weight matrix and bias. σ is the sigmoid function. The value yij in the prediction matrix Y represents the prediction score between the i-th gene and the j-th gene.
3 Experiment and Result 3.1 Dataset The experiment uses three data sources, including synthetic lethal datasets, PPI datasets, and protein sequence datasets. Synthetic lethal data set is downloaded from SynLethDB which contains 6375 interaction associations on 19667 genes. Using interrelated data from STRING, the PPI dataset is represented as a graph representation. The protein series data is downloaded from the UniProtKB database. 3.2 Experimental Setup We utilize features learned by CNN and GAT to represent gene features from different data sources. The proposed method represents the protein association between genes and genes through a graph, and uses GAT to learn the feature representation of genes, which can extract various information from the graph, such as the degree of gene correlation. The amino acid sequence of the protein in this method is used to generate the feature representation of the protein using CNN. Finally, the present method integrates twodimensional representations of genes using a bidirectional attentional neural network and predicts the interactions of gene pairs at corresponding positions. Given a set of synthetic lethal interaction labels, the training objective of this method is to minimize the loss function and optimize all weight matrices and bias vectors in GAT, CNN, and feature similarity-based bidirectional attention neural network using backpropagation. In this experiment, the learning rate is empirically set to 0.005. Based on the known synthetic lethal label data in the SynLethDB dataset, we use 5-fold cross-validation to evaluate the predictive performance of our method. For each dataset to be predicted, we randomly divided the known synthetic lethal correlations into 5 subsets with equal size. In each fold of experiments, one of the subsets is selected as the test set of the model, and the remaining 4 subsets are used as the training set of the model. In order to avoid the bias caused by the random segmentation of the data, this experiment runs the cross-validation process 5 times independently. In order to make the results more reliable, in this experiment, we repeated the 5-fold cross-validation ten times, and took the average of the final 50 evaluation results as the final performance.
A Novel Synthetic Lethality Prediction Method
361
3.3 Performance Comparison of Different Models We compare our method with other state-of-the-art methods, including biological networks based methods (DeepWalk, LINE), matrix factorization based methods (SL2MF, GRMSF, CMF) and feature reconstruction based methods (DDGCN, VAGE). Table 1. Performance comparison of different methods under 5-CV Method
AUPR
AUC
F1
DeepWalk
0.0010
0.5260
0.0005
LINE
0.0214
0.7409
0.0842
SL2MF
0.2824
0.8529
0.4363
GRMSF
0.1465
0.9043
0.2688
CMF
0.0011
0.5572
0.0028
VAGE
0.2667
0.8601
0.4663
DDGCN
0.3242
0.8763
0.4724
BSGAT
0.3348
0.8974
0.4639
Table 1 shows the performance of different methods on the SynLethDB datasets. As shown in the table, BSGAT ranks first in AUPR, which is about 1% ahead of the second method DDGCN. the AUC value and F1 score of BSGAT ranks second among all methods, but is less than one percentage away from the AUC value of the first-place method GRMSF of 0.9043 and also is less than one percentage away from the F1 score of the first-place method DDGCN of 0.4724. 3.4 Case Study To investigate the ability of the proposed model to identify novel synthetic lethal interactions, we conduct case study using all synthetic lethal pairs in SynLethDB. The model is trained with known synthetic lethal genes in SynLethDB. The model with the best performance under 5-fold cross-validation was used to conduct experiments to score the degree of synthetic lethal association of genes, and then predict potential synthetic lethal pairs from unknown gene associations. And in the 1000 pairs of synthetic lethal pairs predicted by the model, 10 gene pairs were verified in SynLethDB2.0 (Table 2). At the same time, we also found that many of these synthetic lethal genes have been verified by wet experiments. For example, USP1 and TP53 in line 9 have been validated in RNAI wet experiments [12], KRAS and DDR1 in line 7 have been validated by shRNA high-throughput screening [13], E2F1 and KRAS in line 8 It has been confirmed by siRNA screening method [3]. The correlation of KRAS and ARP9 in row 5 was validated by a CRISPR screen-based approach [14]. Indeed, KRAS is a commonly mutated oncogene in human cancers and is considered a high-priority synthetic lethal therapeutic target, and in this case study, the synthetic lethal correlation of multiple KRASs was validated. In conclusion, this model is an effective synthetic lethality prediction tool, which can help biologists to screen for synthetic lethality-related effects.
362
F. Sun et al. Table 2. Predicted synthetic lethal correlations confirmed by BSGAT in SynLethDB Number
Gene1
Gene2
PubMed ID
1
BID
KRAS
24104479
2
KRAS
SSH3
24104479
3
KIT
ABL1
26637171
4
MAPK1
TP53
23728082
5
SRP9
KRAS
28700943
6
PTEN
CHEK1
28319113
7
KRAS
DDR1
24104479
8
E2F1
KRAS
22613949
9
USP1
TP53
23284306
10
PDGFRB
KIT
26637171
4 Conclusions Existing synthetic lethality prediction methods mostly rely on the correlation of synthetic lethality when building models, while ignoring the diverse biological data of the genes themselves. In this study, gene features were first constructed by convolutional neural network based on the amino acid sequence of the protein corresponding to the gene. A graph attention neural network is then used to obtain the feature representation of genes on the protein interconnected structure. Then, the bidirectional attention between the two features is calculated to capture the feature representation of genes on the two biological data. Finally, potential synthetic lethal pairs are predicted using the feature integration and embedding. To demonstrate the effectiveness of the presented method and verify the performance of the model, the model is experimentally validated on SynLethDB, comparing with multiple benchmark methods. The experimental results show that BSGAT can make good use of the information in different data sources and realize the prediction of potential synthetic lethal pairs. Finally, in order to verify the ability of this model to predict potential synthetic lethal interactions, we conduct a case study on SynLethDB, a synthetic lethal comprehensive tag database, and many results predicted by the model are verified on SynLethDB 2.0. The experimental result show that BSGAT can efficiently identify potential synthetic lethal pairs on the synthetic lethal benchmark dataset. Acknowledgements. This work was supported by Natural Science Foundation of China (Grant No. 61972141) and Natural Science Foundation of Hunan Province, China (Grant No. 2021JJ30144).
References 1. Chan, D.A., Giaccia, A.J.: Harnessing synthetic lethal interactions in anticancer drug discovery. Nat. Rev. Drug Discov. 10(5), 351–364 (2011)
A Novel Synthetic Lethality Prediction Method
363
2. Du, D., Roguev, A., Gordon, D.E., et al.: Genetic interaction mapping in mammalian cells using CRISPR interference. Nat. Methods 14(6), 577–580 (2017) 3. Luo, J., Emanuele, M.J., Li, D., et al.: A genome-wide RNAi screen identifies multiple synthetic lethal interactions with the Ras oncogene. Cell 137(5), 835–848 (2009) 4. Zhong, W., Sternberg, P.W.: Genome-wide prediction of C. elegans genetic interactions. Science 311(5766), 1481–1484 (2006) 5. Long, Y., Wu, M., Liu, Y., et al.: Graph contextualized attention network for predicting synthetic lethality in human cancers. Bioinformatics 37(16), 2432–2440 (2021) 6. Huang, J., Wu, M., Lu, F., et al.: Predicting synthetic lethal interactions in human cancers using graph regularized self-representative matrix factorization. BMC Bioinformatics 20(19), 1–8 (2019) 7. Cai, R., Chen, X., Fang, Y., et al.: Dual-dropout graph convolutional network for predicting synthetic lethality in human cancers. Bioinformatics 36(16), 4458–4465 (2020) 8. Clark, W.T., Radivojac, P.: Analysis of protein function and its prediction from amino acid sequence. Proteins Struct. Funct. Bioinform. 79(7), 2086–2096 (2011) 9. Ashburner, M., Ball, C.A., Blake, J.A., et al.: Gene ontology: tool for the unification of biology. Nat. Genet. 25(1), 25–29 (2000) 10. Keshava Prasad, T.S., Goel, R., Kandasamy, K., et al.: Human protein reference database— 2009 update. Nucleic Acids Res. 37(suppl_1), D767–D772 (2009) 11. Liu, Y., Wu, M., Liu, C., et al.: SL2MF: Predicting synthetic lethality in human cancers via logistic matrix factorization. IEEE/ACM Trans. Comput. Biol. Bioinf. 17(3), 748–757 (2019) 12. Kipf, T.N., Welling, M.: Variational graph auto-encoders (2016) 13. Vizeacoumar, F.J., Arnold, R., Vizeacoumar, F.S., et al.: A negative genetic interaction map in isogenic cancer cell lines reveals cancer cell vulnerabilities. Mol. Syst. Biol. 9(1), 696 (2013) 14. Martin, T.D., Cook, D.R., Choi, M.Y., et al.: A role for mitochondrial translation in promotion of viability in K-Ras mutant cells. Cell Rep. 20(2), 427–438 (2017)
A Novel Trajectory Inference Method on Single-Cell Gene Expression Data Daoxu Tang, Xinguo Lu(B) , Kaibao Jiang, Fengxu Sun, and Jinxin Li College of Computer Science and Electronic Engineering, Hunan University, Changsha, China [email protected]
Abstract. Recent advances in single-cell RNA sequencing(scRNA-seq) provide the possibility to allow researchers study cellular differentiation and heterogeneity at the single-cell level. Analysis of single-cell gene expression data taken during cell differentiation provides insight into the dynamic process of cells to understand mechanisms of normal development and disease. In this work, we present a novel trajectory inference method which adopts cluster ensembles with elastic principal graphs optimization strategy to improve the robustness of lineage predictions. Aim to evaluate the performance of this method, we carry out experiments on the published datasets which including one or multiple lineage bifurcations. The experimental results show that the proposed method can be effectively used for robust lineage reconstruction. Moreover, the reconstructed trajectories enable us to build a flat tree plot which arranging cells on parallel branches in pseudotime chronological order. The flat tree plot helps us further understand cell state transition. Keywords: Single cell RNA-seq · Consensus clustering · Cell lineage reconstruction · Trajectory inference · Elastic principal graph
1 Introduction Cell differentiation and proliferation are often accompanied by cell state transitions. The study of the trajectory of cell differentiation has always been an important topic in biology. Analyzing trajectories of cell differentiation can help researchers understand how lineages differentiate into specific cell types from embryonic development. During cell differentiation, macro-regulation of genes causes cells to exhibit strong heterogeneity and asynchrony[1]. With the development of high-throughput sequencing technology, scRNA-seq has been applied to various species including biology, pharmacology, clinical medicine and other disciplines, and it allows researchers to study the fundamental mechanisms underlying normal development and diseases [2, 3]. It can also get celllevel resolution which offering new insights on understanding cellular heterogeneity. From the computational viewpoint, dimensionality reduction, clustering analysis and trajectory inference are the three most common but most challenging data analysis tasks of scRNA-seq data [4]. Trajectory inference (TI) methods also known as pseudotime ordering have recently emerged as an effective method to analyze scRNA-seq data. It is © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 364–373, 2022. https://doi.org/10.1007/978-3-031-13829-4_31
A Novel Trajectory Inference Method
365
aimed to infer cell lineage relationships when cells undergo specific cell type transitions based on single-cell gene expression data. Many algorithms have been proposed specifically for the task of lineage reconstruction. Most of the computational approaches can be divided into MST (minimum spanning tree)-based or PG (principal graph)-based. As a general analysis tool, monocle [1] utilizes independent component analysis (ICA) for dimensionality reduction, and then relied on building the minimum spanning tree (MST) to connect all cells. However, it is because all cells are involved in the constructed MST structure so that computation is time-consuming and the structure is quite intricate. Besides, it cannot work well without any relevant prior knowledge. To address this problem, Monocle2 [5] utilizes the advanced machine learning technique reverse graph embedding (RGE) to learn explicit principal graphs from single-cell transcriptomic data and detect branching events without prior knowledge. Other methods such as Waterfall [6] and TSCAN [7], using some dimensionality reduction algorithms to obtain a lower-dimension embedding of the data and rely on building an MST on the cluster centers, and then project each cell onto the tree to obtain the pseudotime of cells. Another class of methods uses principal graph or principal curves to reconstruct cell trajectories development. As a kind of graph-based data approximator, principal graphs have been widely used in computer and other related disciplines, such as STREAM [8] and MERLoT [9]. Although many methods have made significant progress in lineage reconstruction, trajectory inference algorithms often suffer from dimensionality reduction techniques that having limitations to reconstruct complex trajectory structures. On the other hand, traditional clustering algorithms such as k-means, hierarchical clustering and spectral clustering is usually affected by the noise and high dimensionality [10]. Recently, intensive efforts have been devoted to consensus clustering methods because of their superior stability and robustness. Many researchers have used ensemble clustering to analyze gene expression data and obtained good clustering results by combining the results of different clusters through a consensus cluster ensemble model. Therefore, we have reason to believe that ensemble clustering can effectively identify cell subpopulations, and then can reconstruct a more robust trajectory structure. Here, we present an efficient and stable TI method which adopts a consensus clustering combined with elastic principal graphs optimization strategy to improve the robustness of lineage predictions. The consensus clustering combines highly stable clustering technology to tackle with the data with noise, outliers and batch effects to precisely cluster cells. Like a previous clustering method TSCAN, our algorithm first utilizes consensus clustering to obtain cluster centers. We assume that the cells belonging to the same cluster have the same cell state, and there will be cell state changes between different clusters. Then, an MST is constructed based on the cluster centers and is used as an initialization process of the elastic principal graphs. Furthermore, we apply elastic principal graph algorithm on MST to get an optimized structure tree. The experiment results show that the presented method can effectively reconstruct cellular trajectories on the previously published scRNA-seq datasets including linear structure and branch structure.
366
D. Tang et al.
2 Methods 2.1 Cell Partition and Clustering Given a gene expression matrix GM ∗N with M genes expressed in N cells, each entry represents the expression level of gene in this cell. We aim to get a cell sample set (CST) which contains K different subgroup of N cell by subsampling, which combining with uniform and non-uniform sampling. For small sample datasets, we use uniform, probability-based systematic sampling. For large sample datasets, we apply a non-uniform, density-based sampling strategy which is implemented and posted on the python package index [11]. Specifically, the local density per cell is considered to be the number of cells within a size in the high-dimensional gene expression space. The probability of each cell being selected (P(Ci )) is calculated by the following equation: ⎧ 0, OD > LDi ⎨ P(Ci ) = (1) 1, OD ≤ LDi ≤ TD ⎩ TD/LDi , LDi > TD where LDi represents the local density of the ith cell, OD (outlier density) and TD (target density) are computed as particular percentiles of the distribution of local densities. After getting the local density of cells, we can get a subsample of all cells which reserve the most of coverage of the gene expression space. To improve the robustness of the single-cell clustering, we repeat the above sampling K times, and finally get a cell sample set and each subsample has N (N < N ) cells. 2.2 Consensus Clustering Although many current semi-supervised clustering methods can achieve good results without providing the number of clusters, they usually obtain clustering results with another form of restriction (adjusting parameters, setting thresholds). Consensus Clustering provide some reliability metrics for determining the number of possible clusters of datasets. We adopted consensus clustering to aggregate the K different partitions and find the final consensus partitions most similar to all partitions by Cluster Ensemble model. For each partition, we can get a binary similarity matrix according to the previous clustering results. 1, two cells belong to the same cluster B Ci , Cj = (2) 0, otherwise. let B be the (N × N) binary similarity matrix such that entry is equal to 1 if two cells belong to the same cluster, and 0 otherwise. The consensus matrix C was then constructed by averaging these binary similarity matrixes. According to the above definition, Ci,j can be decomposed as follows: Ci,j =
TBi,j K
(3)
A Novel Trajectory Inference Method
367
where TBi,j is the number of times the two cells belong to the same cluster, K denotes the total number of times of clustering. The entry represents the probability of two cells belong to the same cluster. We build a hypergraph based on consensus matrix C, and then use three Cluster Ensemble model algorithms, CSPA (cluster-based similarity partitioning algorithm), MCLA (hyper graph partitioning algorithm) and HGPA (metaclustering algorithm) to divide the hypergraph separately to get the clustering results. We can get the final cell consensus label by selecting the best normalized mutual information among all clustering results with different partitions. 2.3 Initial Tree Structure and Elastic Principal Graph Embedding As previous studies have shown that the principal graph may lead to local minima, it is critical to build an initial structure to get good quality of the inferred trajectory. For a given consensus clustering result, we choose cluster centers as nodes in the graph which is constructed through building a minimum spanning tree (MST) by connecting every cluster center. An MST is a tree with the smallest sum of edge weights after connecting all nodes. Constructing an MST involves specifying a distance measure between nodes (in this case, cell clusters). Constructing an MST needs to take into account the calculation of distances between nodes. In order to get a better initialized tree structure and further infer the cell trajectory, we use Mahalanobis-like distance to measure the distance between clusters. Mahalanobis-like distance, a covariance scaled Euclidean distance, can effectively calculate the differences and associations of samples features. The distance of the pairwise clusters is computed as follows: −1 T (4) Xki − Xkj ki, kj < K dm Cki , Ckj = (Xki − Xkj ) Yki + Ykj Xki and Xki denotes the center of cluster ki and kj, respectively. Yki , Ykj represents the covariance matrix of the corresponding clusters. The initial tree structure may not smoother enough, we apply elastic principal graph algorithm [12, 13] (Epigraph) to penalize edges and branches in the tree by defining elastic energy to get an optimized structure tree. Then, we get smooth trajectory structure after the algorithm converges. The Elastic principal graphs are treated as data approximators which approximating the distribution of datapoints. Concisely, an elastic principal is a graph with vertices representing data points and edges representing the connections between data points. The data points are added to the graph by minimizing the sum of graph elastic energy and data approximation term is computed as follows:
2 |V | U φ (X , G) = |X1 | j=1 i:P(i)=j min Xi − φ Vj , R20 (i) (i) (i) 2 + E (i) λ + α max 2, deg E (0) , deg E (1) − 2 φ E (0) − φ E (i) (1) degS (j) (0) (j) 2 +μ S (j) φ S (j) (0) − deg S1jij (0) φ S (i) ( ) i=1 (5) where X = {X1 }, i = 1, . . . |X | is a set of data points, two vertices, E (i) represents an edge in the graph, corresponding to two vertices E (i) (0) and E (i) (1), S (j) (0), . . . , S (j) (k)
368
D. Tang et al.
denotes a set of vertices of star, deg(Vi ) is a mapping function that calculating the order k of the star whose central vertex is Vi ,φ Vj defines the position of the vertex of the jth graph, R0 , λ, μ and α are related parameters to adjust the position of related vertices and the connectivity of edges in the graph. 2.4 Cell Pseudotime Ordering The main point of trajectory inference is concerned with assigning pseudotime to individual cells. Once the pseudotime is determined, we can know the biological progress of the cell. Based on previous trajectory structure, we can further obtain the ordering at the cellular level by projecting cells onto the main path and each branching path. For this purpose, we make use of Epigraph (implemented in the ElPiGraph.R module) to get each cell pseudotime. For each path, the arc length between starting point of the curve and cell point’s projection is defined as pseudotime. After assigning pseudotime to each cell, the cell with the lowest pseudotime may be regarded as the starting point of the differentiation path.
3 Experiments and Results 3.1 Datasets and Data Prepossessing To evaluate the robustness of our method, we analyzed three scRNA-seq datasets and one synthetic dataset for which cell types were known or confirmed by previous studies. The three datasets came from Trapnell et al. [1], Klein et al. [14] and Guo et al. [15]. For the first dataset, we analyzed a subset of human skeletal muscle myoblasts (HSMM), which contains about 271 cells from 0,24,48 and 72h. The second dataset contains 2717 single cells describing the differentiation of mouse embryonic stem cells after LIF withdrawal. For the Guo dataset, we get 438 cells and the expression levels of 48 selected genes about mouse embryo including 7 stages. We used the dyngen data simulation tool to generate a dataset of 700 cells divided into three cell types. In the following, the three scRNA-seq datasets are referred to as HSMM, Klein and Guo respectively. In order to alleviate the effect of drop-out events on the following analyses [16] and improve the quality of scRNA-seq data, we apply data preprocessing which including quality control, log-normalization and batch correction to all four datasets [17]. After that, we filter out a subset of cells and genes, and the remaining cells and genes are used for downstream processing. The raw single-cell gene expression data are generally high-dimensional, it is difficult to capture the intrinsic structure and characteristics of the data. 3.2 Evaluation Metrics The cell pseudotime accuracy is evaluated based on Pearson correlation r, which is calculated by the following equation: (x − x)(y − y) (6) r = (x − x)2 (y − y)2
A Novel Trajectory Inference Method
369
where x is the mean of the vector x(true sampling time) and y is the mean of the vector y (inferred pseudotime). di represents the rank difference between the vector x and vector y. Note that a higher coefficient value indicates that the inferred trajectory pseudotime is more accurate. 3.3 Reconstruction of Cell Lineages Tree We used three scRNA-seq datasets and one synthetic dataset to evaluate the performance on this proposed method. In the following, all datasets are preprocessed (QC, batch effects) to reconstruct reliable lineages tree. For each dataset, we first screened out 1000 variable genes and mapped them onto 30 principal components by PCA, this step is to obtain the overall expression profile of the most representative genes. In order to further visualize the dimensionality reduction results, we mapped the results onto the 2-dimensional subspace by MLLE or SE. We first analyze HSMM dataset, as can be seen in Fig. 1, three cell clusters were identified by Spectral clustering in HSMM, which mainly correspond to the three different states of cells. We inferred a linear trajectory from green cluster, through orange cluster, to blue cluster. The green dots indicate the initial distribution of cells, then gradually transition to orange, and finally stay in blue. The transitions between the three colors represent transitions between the different states of the cell. For the dataset Klein, we identified four cell clusters and we get a multi-branched trajectory structure after using EPG. It can be seen from Fig. 2 that S1 is used as the starting point, and S0 and S3 are branch points, which indicates that the cells have a progressive transition state on the second and fourth days, and similarly on the fourth and seventh days. Besides, this further reflects the multi-directional differentiation characteristics of mouse embryonic stem cells. For the dataset Guo, we identified two important bifurcation points (Fig. 3), S0 and S3, respectively, consistent with the two bifurcation events identified by SCUBA [18]. This suggests that cells undergo cellular differentiation during these two stages.
Fig. 1. Reconstruction of the differentiation trajectory of HSMM cells
370
D. Tang et al.
To further test the robustness of the method on more complex branching structure, we conducted experiment on a synthetic dataset. As shown in Fig. 4, the tree has two bifurcating events, the main path S2-S0 bifurcates into two paths, one of which corresponds to cell state 1 through state 2 to state 3, and the other one is differentiated from state 2 to state 3. This shows that our method can also show good reconstruction ability on multi-branch structure.
Fig. 2. Reconstruction of the differentiation trajectory of Klein cells
Fig. 3. Reconstruction of the differentiation trajectory on Guo dataset
A Novel Trajectory Inference Method
371
Fig. 4. Reconstruction of the differentiation trajectory on synthetic dataset
3.4 Performance Comparison We evaluate our method on three scRNA-seq datasets and one synthetic dataset with three different measure of correlation and compare the presented method with four widely trajectory inference methods: DPT [19], Monocle2 [5], Slingshot [20] and TSCAN [7]. In other methods, we set default parameters based on previous studies. From Table 1, we can see our method outperforms the other four methods on pearson correlation on dataset HSMM and Synthetic. For dataset Guo, DPT has a higher pearson correlation than other methods. For Klein dataset, Slingshot got the best performance, but for other datasets, it cannot perform very well. This reflects from the side that we need to select the corresponding trajectory inference method according to the appropriate topology of the data. Monocel2 is close to our method in the overall level on these four datasets, indicating that our method can effectively reconstruct cell trajectories. Table 1. Performance comparison with Pearson correlation Dataset
Ours
DPT
Monocle2
Slingshot
TSCAN
HSMM
0.437
0.243
0.415
0.320
Nan
Klein
0.859
0.803
0.828
0.945
0.754
Guo
0.899
0.963
0.888
0.346
0.439
Synthetic
0.909
0.859
0.912
0.743
0.815
Avg.
0.776
0.717
0.761
0.588
0.502
4 Conclusion Advanced scRNA-seq technology enables researchers to discover cell developmental progression, state dynamic transitions and interactions at single-cell resolution. High
372
D. Tang et al.
dimensionality, high noise and high throughput have always been the characteristics of scRNA-seq data, which brought great challenges to computational models. In addition, it is not easy to obtain prior knowledge such as the number of clusters, starting cell points, cell labels, etc. In this work, we develop a novel computational model for pseudotime inference from scRNA-seq data. Firstly, the strong stability and high robustness of consensus clustering can reduce the impact of noise and high dimensionality, helping us accurately divide cell subpopulations(clusters). Next, we use Mahalanobis-like distance to measure the distance of cell clusters, it can effectively capture the differences between clusters and work well in practice. Then, we construct a minimum spanning tree to initially identify potential global structures in the data which can speed up the convergence of principal graph inference. Finally, we utilize the elastic principal graph adjustment strategy to optimize our tree structure and finally get smooth trajectories. We analyzed different single-cell datasets including both linear and bifurcated, the results showed that our method could accurately capture the branching structure. In addition, by comparing our method to the other four methods, we found that our method can reconstruct cell lineage differentiation trajectories. The good performance can help us to further understand the transition of cell fate. Acknowledgements. This work was supported by Natural Science Foundation of China (Grant No. 61972141) and Natural Science Foundation of Hunan Province, China (Grant No. 2021JJ30144).
References 1. Trapnell, C., et al.: The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32(4), 381–386 (2014) 2. Baslan, T., Hicks, J.: “Unravelling biology and shifting paradigms in cancer with single-cell sequencing. Nat. Rev. Cancer 17(9), 557 (2017) 3. Guo, M., et al.: Single cell RNA analysis identifies cellular heterogeneity and adaptive responses of the lung at birth. Nat. Commun. 10(1), 37 (2019) 4. Luecken, M.D., Theis, F.J.: Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol. 15(6), e8746 (2019) 5. Qiu, X., et al.: Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14(10), 979–982 (2017) 6. Jaehoon, S., et al.: Single-Cell RNA-Seq with Waterfall Reveals Molecular Cascades underlying Adult Neurogenesis. Cell Stem Cell 17(3), 360–372 (2015) 7. Zhicheng, J.: TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 44(13), e117 (2016) 8. Chen, H., et al.: Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM. Nat. Commun. 10(1), 1903 (2019) 9. Parra, R.G., et al.: Reconstructing complex lineage trees from scRNA-seq data using MERLoT. Nucleic Acids Res. 47(17), 8961–8974 (2019) 10. Stegle, O., Teichmann, S.A., Marioni, J.C.: Computational and analytical challenges in singlecell transcriptomics. Nat. Rev. Genet. 16(3), 133–145 (2015) 11. Giecold, G., et al.: Robust lineage reconstruction from high-dimensional single-cell data. Nucleic Acids Res. 44(14), e122–e122 (2016)
A Novel Trajectory Inference Method
373
12. Gorban, A., Zinovyev, A.: elastic principal graphs and manifolds and their practical applications. Computing 75(4), 359–379 (2005) 13. Gorban, A.N., Sumner, N.R., Zinovyev, A.Y.: Topological grammars for data approximation. Appl. Math. Lett. 20(4), 382–386 (2007) 14. Klein, A., et al.: Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161(5), 1187–1201 (2015) 15. Guoji, G., et al.: Resolution of cell fate decisions revealed by single-cell gene expression analysis from zygote to blastocyst. Dev. Cell 18(4), 675–685 (2010) 16. Kharchenko, P.V., Silberstein, L., Scadden, D.T.: Bayesian approach to single-cell differential expression analysis. Nat. Methods. 11, 740–742 (2014) 17. Jia, J., Chen, L.: Single-cell RNA sequencing data analysis based on non-uniform ε−neighborhood network. Bioinformatics 38(9), 2459–2465 (2022) 18. Marco, E., et al.: Bifurcation analysis of single-cell gene expression data reveals epigenetic landscape. Proc. Natl. Acad. Sci. 111(52), E5643–E5650 (2014) 19. Haghverdi, L., Büttner, M., Wolf, F.A., et al.: Diffusion pseudotime robustly reconstructs lineage branching[J]. Nat. Methods 13(10), 845–848 (2016) 20. Street, K., et al.: Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19(1), 477 (2018)
Bioinformatic Analysis of Clear Cell Renal Carcinoma via ATAC-Seq and RNA-Seq Feng Chang1,2 , Zhenqiong Chen1,2 , Caixia Xu2 , Hailei Liu1,2 , and Pengyong Han2(B) 1 Department of Urology, Heping Hospital Affiliated to Changzhi Medical College,
Changzhi 046000, Shanxi, China 2 Changzhi Medical College, Changzhi 046000, China
[email protected]
Abstract. To explore the related functions of chromatin open state in clear cell renal carcinoma (ccRCC) using the assay for transposase accessible chromatin using sequencing (ATAC-seq) data and transcriptome sequencing (RNA-seq) from The Cancer Genome Atlas (TCGA) database. Peaks of ATAC-seq data were annotated, and gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were performed on the annotated genes. Most of the peaks in ATAC-seq data of ccRCC were distributed in the promoter region, and the distances from the transcription start site were ≤1 kb, 1–2 kb, and 2–3 kb, accounting for 38.27% and 6.24%, respectively. 4.3% and 17.25% are located in the distal intergenic region, which adheres to the two distributions of chromatin open regions. The results of KEGG and function analysis showed that the annotated genes were significantly enriched in ErbB, MAPK and Rap1 signalling pathways etc. DUSP9, HS6ST2, and MUC15 were especially differentially low expressed in ccRCC. DUSP9, HS6ST2, and MUC15 genes potentially exert as prognostic indicators for clear cell renal cell carcinoma and provide support for treatment and research. Keywords: Renal clear cell carcinoma · TCGA · ATAC-seq
1 Introduction Clear cell renal cell carcinoma (ccRCC) is the most common malignant type of kidney cancer, accounting for 75% of renal cancers [1]; the pathogenesis has not been fully elucidated. The clinical manifestations are not obvious. Most patients have symptoms at an advanced stage and are not sensitive to radiotherapy and chemotherapy. Surgery is still the primary method of treatment. Therefore, it is necessary to discover new therapeutic targets and prognostic markers to assist in diagnosis and treatment. The Cancer Genome Atlas is an international database which hosts different omics data on different types of cancer. ATAC-Seq is a method to fathom chromatin’s local accessibility, which is a marker of transcriptional activity [2]. Compared with other technologies, Open Chromatin Sequencing (ATAC-seq) has lower cell requirements, good reproducibility, and high consistency with other experimental methods. It can identify and screen potential © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 374–382, 2022. https://doi.org/10.1007/978-3-031-13829-4_32
Bioinformatic Analysis of Clear Cell Renal Carcinoma
375
targets for possible disease treatments discovered through transcriptome sequencing. In this paper, ATAC-seq and RNA-seq of ccRCC were used to analyze chromatin accessibility and screen key genes and the changes in related functions and pathways and the potential mechanism of crucial genes in clear cell renal cell carcinoma.
2 Materials and Methods 2.1 Download of ATAC-Seq and Transcriptome Data of Renal Clear Cell Carcinoma Tissue Samples Seventy-two para-neoplasm tissue samples and 539 KIRC samples were obtained from the TCGA database. All tumour samples normalized in ATAC-seq count matrices and RNA-seq counts were downloaded from the TCGA website. 2.2 Quality Control R 4.1.2 software, BSgenome, and karyoploteR were used to visualize the chromosome coverage of the KIRC peaks; ChIPseeker, TxDb and the clusterProfiler package were used to summarize, classify and visualize the positional features of the peaks. 2.3 Data Analysis RNA-seq data were analyzed via edgeR package. Gene Ontology and KEGG enrichment analysis were used in R 4.1.2. 2.4 Correlation Analysis Limma packages were to filter the correlation. Annotate the correlation results to determine whether the gene conforms to the distribution of genomic features.
3 Results 3.1 Quality of ATAC-Seq Data in Renal Clear Cell Carcinoma The peaks analyzed in the ATAC-seq data were evenly distributed across chromosomes (Fig. 1). For most of the peaks located in the promoter region, the distance between the peaks and transcription start site(TSS) ≤1 kb, 1–2 kb, and 2–3 kb account for 38.27% and 6.24%, 4.3% respectively, and 17.25% for the distal intergenic regions which adhere to two distributions of chromatin open regions(Fig. 2 and Fig. 3). Chromatin open region peaks are primarily located near the TSS, which is consistent with the characteristics of chromatin openness (Fig. 4).
376
F. Chang et al.
Fig. 1. Genome map of clear cell renal cell carcinoma ATAC-seq data.
Fig. 2. Peaks coverage map of ATAC-seq data for clear cell renal cell carcinoma.
Bioinformatic Analysis of Clear Cell Renal Carcinoma
377
Fig. 3. Heat map of the distribution of peaks from the transcription start site in ATAC-seq data of renal clear cell carcinoma.
Fig. 4. Venn diagram of ATAC-seq data of ccRCC.
378
F. Chang et al.
3.2 GO and KEGG Pathway Enrichment Analysis Cancer-related GO terms and KEGG signalling pathways were significantly enriched in ccRCC. ErbB, MAPK and Rap1 signalling pathways were enriched in ccRCC. Ubiquitymediated proteolysis, proteasome protein catabolism, inter-cellular connections, transcriptional regulator activity, and phosphate nucleotide metabolism were enriched GO terms (Fig. 5 and Fig. 6). 3.3 Correlation Between the Key Genes and the Peaks Three differential genes (DUSP9, HS6ST2, MUC15) were screened for correlation analysis, and the peaks of ATAC-seq were co-expressed with these three genes (Fig. 7). The expression of the three genes was significantly lower than in normal tissues (Fig. 8).
Fig. 5. Bubble Chart GO items of annotated peaks in ccRCC
Bioinformatic Analysis of Clear Cell Renal Carcinoma
379
Fig. 6. B KEGG Bubble Chart of annotated peaks in ccRCC
4 Discussion Renal cancer accounts for 2–3% of malignancies, and approximately 30% of renal cancer patients have metastasized when diagnosed, and about minor patients will recur after surgical and other forms of treatment [3, 4]. The mechanism of pathogenesis of renal cancer, poor sensitivity to chemotherapy and radiotherapy, and the high postoperative recurrence rate are still unclear. This paper analyzed the mechanism of related signalling and functional pathways of ccRCC integrating ATAC-seq and RNA-seq. Chromatin accessibility is the state in which regulatory factors bind to open chromatin. ATAC-Seq is an epigenetic technology emerging in recent years to study the openness of chromatin. With the open regions information and active regulatory sequences on chromatin, we can speculate on the possible transcription factors and their dynamics of specific physiological processes on the genome-wide scale [2]. Most genomic DNA is tightly wound with nucleosomes, and some naked DNA regions without nucleosomes are called open chromatin. Other transcription factors (TF) can approach and bind in these open chromatin regions to fulfill transcription and DNA replication. DUSP9, HS6ST2, and MUC15 are all lowly expressed in ccRCC, indicating that these three genes act as tumour suppressors in renal cancer patients and promote tumour proliferation, invasion and metastasis through different signalling pathways. DUSP9 can reverse dephosphorylation and restore MAPK to an inactive state [5]. Several studies have shown that DUSP9 is a tumour suppressor gene in gastric, skin, and lung cancer [6–9]. The expression of DUSP9 is low in renal cancer cells, which indicates that the down-regulation of DUSP9 promotes the occurrence and development of tumours [10]. This is consistent with our analysis of the low expression of DUSP in ccRCC. Yue Y et al. demonstrated that MUC15 is lowly expressed in renal cancer compared to normal tissues, and MUC15 can inhibit the invasion and metastasis of renal cancer cells in vitro and in vivo [11].
380
F. Chang et al.
Fig. 7. Correlation between DUSP9, HS6ST2, MUC15 and peak.
Fig. 8. Expression of DUSP9, HS6ST2, MUC15 in normal and tumor tissues.
Bioinformatic Analysis of Clear Cell Renal Carcinoma
381
However, MUC15 is highly expressed in tumours such as thyroid cancer and glioma [12–14]. The expression of HS6ST2 was reduced by around 50% in renal cancer cells, and the expression in tumour tissue was significantly lower than that in normal tissue [15]. HS6ST2 is a glycolysis-related gene. HS6ST2 is lowly expressed in tumour tissues compared with normal tissues [16]. However, HS6ST2 is highly expressed in gastric, colon, breast, and other tumours [17–19] and may be involved in the carcinogenesis of these tumours as a tumour-promoting gene. The mechanism of cancer development is complicated, including chromatin status, miRNA, gene mutation and etc. [20–24]. The regulatory network between genes and diseases exert important role in diesease.The algorithms developed to screen miRNA-target interaction, while the detailed regulatory mechanism of atac-seq peak remains further explored [25–29]. These three genes may serve as biomarkers and potential therapeutic targets for predicting the prognosis of ccRCC. Acknowledgement. This study was supported by Provincial Science and Technology Grant of Shanxi Province (20210302124588), Science and technology innovation project of Shanxi province universities (2019L0683).
References 1. Inamura, K.: Renal cell tumors: understanding their molecular pathological epidemiology and the 2016 WHO classification. Int. J. Mol. Sci. 18(10), 2195 (2017) 2. Hendrickson, D.G., Soifer, I., Wranik, B.J., David Botstein, R., McIsaac, S.: Simultaneous profiling of DNA accessibility and gene expression dynamics with ATAC-seq and RNA-seq. In: von Stechow, L., Delgado, A.S. (eds.) Computational Cell Biology: Methods and Protocols, pp. 317–333. Springer, New York (2018). https://doi.org/10.1007/978-1-4939-8618-7_15 3. Motzer, R.J.: Prognostic nomogram for sunitinib in patients with metastatic renal cell carcinoma. Cancer 113, 155–158 (2010) 4. Nerich, V., et al.: Clinical impact of targeted therapies in patients with metastatic clear-cell renal cell carcinoma. Oncotargets Ther. 7, 365–374 (2014) 5. Chen, H.-F., Chuang, H.-C., Tan, T.-H.: Regulation of dual-specificity phosphatase (DUSP) ubiquitination and protein stability. Int. J. Mol. Sci. 20(11), 2668 (2019) 6. Wu, F., et al.: Epigenetic silencing of DUSP9 induces the proliferation of human gastric cancer by activating JNK signaling. Oncol. Rep. 34, 121–128 (2015) 7. Low, H.B., Zhang, Y.: Regulatory roles of MAPK phosphatases in cancer. Immune Netw. 16(2), 85–98 (2016) 8. Qiu, Z., Liang, N., Huang, Q., Sun, T., Wang, Q.: Downregulation of DUSP9 promotes tumor progression and contributes to poor prognosis in human colorectal cancer. Front. Oncol. 10, 547011 (2020) 9. Xia, L., Wang, H., Xiao, H., Lan, B., Liu, J., Yang, Z.: EEF1A2 and ERN2 could potentially discriminate metastatic status of mediastinal lymph node in lung adenocarcinomas harboring EGFR 19Del/L858R mutations. Thorac. Cancer 11, 2755–2766 (2020) 10. Zhou, L., et al.: Integrated profiling of MicroRNAs and mRNAs: MicroRNAs located on Xq27.3 associate with clear cell renal cell carcinoma. PLoS ONE 5, e15224 (2010) 11. Yue, Y., Hui, K., Wu, S., Zhang, M., Fan, J.: MUC15 inhibits cancer metastasis via PI3K/AKT signaling in renal cell carcinoma. Cell Death Dis. 11, 336 (2020)
382
F. Chang et al.
12. Choi, C., et al.: Promotion of tumor progression and cancer stemness by MUC15 in thyroid cancer via the GPCR/ERK and integrin-FAK signaling pathways. Oncogenesis 7, 1–13 (2018) 13. Huang, J., et al.: Overexpression of MUC15 activates extracellular signal-regulated kinase 1/2 and promotes the oncogenic potential of human colon cancer cells. Carcinogenesis 30, 1452–1458 (2009) 14. Cheng, M., Liu, L.: MUC15 promotes growth and invasion of glioma cells by activating Raf/MEK/ERK pathway. Clin. Exp. Pharmacol. Physiol. 47(6), 1041–1048 (2020) 15. Liep, J., Kilic, E., Meyer, H.A., Busch, J., Rabien, A.: Cooperative effect of miR-141-3p and miR-145-5p in the regulation of targets in clear cell renal cell carcinoma. PLoS ONE 11, e0157801 (2016) 16. Xing, Q., Zeng, T., Liu, S., Cheng, H., Ma, L., Wang, Y.: A novel 10 glycolysis-related genes signature could predict overall survival for clear cell renal cell carcinoma. BMC Cancer 21(1), 381 (2021) 17. Jin, Y., He, J., Du, J., Zhang, R.X., Yao, H.B., Shao, Q.S.: Overexpression of HS6ST2 is associated with poor prognosis in patients with gastric cancer. Oncol. Lett. 14, 6191–6197 (2017) 18. Nishio, K., et al.: Overexpression of heparan sulfate 6-o-sulfotransferase-2 in colorectal cancer. Mol. Clin. Oncol. 1, 845–850 (2013) 19. Pollari, S., et al.: Heparin-like polysaccharides reduce osteolytic bone destruction and tumor growth in a mouse model of breast cancer bone metastasis. Mol. Cancer Res. 10, 597–604 (2012) 20. Zhong, T., Li, Z., Zhu-Hong You, R., Nie, H.Z.: Predicting miRNA–disease associations based on graph random propagation network and attention network. Brief. Bioinf. 23(2), bbab589 (2022) 21. Li, Z.-W., Zhong, T.-B., Huang, D.-S., You, Z.-H., Nie, R.: Hierarchical graph attention network for miRNA-disease association prediction. Mol. Ther. 30, 1775–1786 (2022) 22. Li, Z.-W., Li, J.-S., Nie, R., You, Z.-H., Bao, W.-Z.: A graph auto-encoder model for miRNAdisease associations prediction. Brief. Bioinf. 22(4), bbaa240 (2021) 23. Nie, R., Li, Z.-W., You, Z.-H., Bao, W.-Z., Li, J.-S.: Efficient framework for predicting miRNA-disease associations based on improved hybrid collaborative filtering. BMC Med. Inf. Decis. Making 21(S1), 254 (2021) 24. Liu, B.-L., Zhu, X.-Y., Zhang, L., Liang, Z.-Z., Li, Z.-W.: Combined embedding model for MiRNA-disease association prediction. BMC Bioinf. 22, 161 (2021) 25. Zhang, L., Liu, B.-L., Li, Z.-W., Zhu, X.-Y., Liang, Z.-Z., An, J.-Y.: Predicting miRNAdisease associations by multiple meta-paths fusion graph embedding model. BMC Bioinf. 21, 470 (2020) 26. Li, J.-S., Li, Z.-W., Nie, R., You, Z.-H., Bao, W.-Z.: FCGCNMDA: predicting MiRNAdisease associations by applying fully connected graph convolutional networks. Mol. Genet. Genomics 295(5), 1197–1209 (2020) 27. Li, Z.-W., Nie, R., You, Z.-H., Cao, C., Li, J.-S.: Using discriminative vector machine model with 2DPCA to predict interactions among proteins. BMC Bioinf. 20(Suppl. 25), 694–702 (2019) 28. Li, Z.-W., You, Z.-H., Chen, X., Nie, R., An, J.-Y.: In silico prediction of drug-target interaction networks based on drug topological structure and protein sequences. Sci. Rep. 9, 2045–2322 (2017) 29. Li, Z.-W., et al.: Accurate prediction of protein-protein interactions by integrating potential evolutionary information embedded in PSSM profile and discriminative vector machine classifier. Oncotarget 8(14), 23638–23649 (2017)
The Prognosis Model of Clear Cell Renal Cell Carcinoma Based on Allograft Rejection Markers Hailei Liu1,2 , Zhenqiong Chen1,2 , Chandrasekhar Gopalakrishnan4 , Rajasekaran Ramalingam4 , Pengyong Han2(B) , and Zhengwei li3 1 Department of Urology, Heping Hospital Affiliated to Changzhi Medical College,
Changzhi 046000, Shanxi, China 2 Central Lab, Changzhi Medical College, Changzhi 046000, China
[email protected]
3 School of Computer Science and Technology, China University of Mining and Technology,
Xuzhou 221116, China 4 Quantitative Biology Lab, Department of Biotechnology, School of Bio Sciences
and Technology, Vellore Institute of Technology (Deemed to be University), Vellore 632014, Tamil Nadu, India
Abstract. Renal cell carcinoma (RCC), also called as renal adenocarcinoma or hypernephroma, is the most common form cancer that occurs in kidney. About 9 out 10 malignant complications that occurs in kidney are RCC, and it accounts for 4.2% of all cancer types. And, patients with RCC tend to have poor prognosis with declining survival probability as the pathology advances. However, the current oncogenic detection procedures are quite inept in precisely predicting the prognostic outcomes for RCC patients in due time. Concurrently it is also well established that to prevent allograft rejection, induced immunosuppression can also actuate tumor progression. Conversely, the biomarkers that are involved in allograft rejection can also be used to asses the prognosis of cancer progression. Based on this notion, in the present study we aim to formulate immune response based prognostic biomarkers to aid clinicians to effectively asses and detect RCC prognosis. Methods: The biomarkers based out of allograft rejections were used as prognostic markers in RCC and were bolstered by series of statistical data analysis performed on kidney renal clear cell carcinoma (KIRC) cohorts based out of The Cancer Genome Atlas (TCGA) dataset. Results: Based on differential gene expression analysis between diseased and control group. a prognostic signature consisting of 14 allograft rejection associated genes (ARGs) CCL22, CSF1, CXCL13, ETS1, FCGR2B, GBP2, HLA-E, IL4R, MAP4K1, ST8SIA4, TAP2, TIMP1, ZAP70, TLR6 were delineated and were validated with the help of series of statistical analysis to assed their robustness. Consistent findings form univariate and multivariate regression analysis, survival analysis and risk prediction analysis, indicates that aforementioned set of genes can indeed be used as biomarkers to aid in RCC prognosis. And the cox regression analysis based out of these markers predicted the largest area under the curve (AUC = 0.8) in the receiver operating characteristic (ROC). Conclusions: The immune system based prognostic, predictive model formulated here can be effectively and efficiently used in the prediction of survival outcomes and immunotherapeutic responses of RCC patients. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 383–393, 2022. https://doi.org/10.1007/978-3-031-13829-4_33
384
H. Liu et al. Keywords: Renal clear cancer · Allograft rejection · Prognosis model
1 Introduction Renal cell carcinoma is the most common type of malignancy that arises out of kidney [1]. As per the 2022 cancer statistics, kidney cancer amounts up to 29.6% of occurrence in USA, while 6.3% of mortality rate. It is actuated when cells from lining of the kidney’s tubules starts dividing uncontrollably. RCC is a fast-spreading cancer and often proliferates into the lungs and adjacent organs. The precise mechanism behind RCC remains obscure, but certain risk factors associated with RCC including family history, hypertension, smoking, obesity, prolonged abuse of prescribed medications, have been suspected. And, its symptoms include: blood in urine, lump in abdomen persistent pain, anemia, elevated calcium in blood, fatigues, unprecedented weight loss, etc. also, as the stages in RCC progresses, the chances of survival become narrow. Further, the available therapeutic strategies against RCC at present, consists of surgery, chemotherapy, immunotherapy, target therapy and adjuvant therapy. Although, there are an extensive range of therapeutic strategies available against this nocuous debilitative disorder, the overall prognosis of RCC patients remains dismal yet, especially the ones in the late-stage RCC. Hence a robust set of prognostic biomarkers is required to better asses the RCC progression in patients. Further, studies have shown that to induces broad immunosuppression to control allograft rejection pose increased risk of oncogenesis and cancer progression. And conversely, treatment of patients with immune system actuating drugs such as pembrolizumab have resulted in allograft rejection [2]. Based on these observations, its safe to surmise that set of genes that involved with allograft rejection is somehow intricately involved in cancer progression. Hence, a subset of allograft rejection immunomodulator genes could be viable biomarkers for monitoring cancer prognosis, especially in RCC patients, since renal cancer pose poor prognosis for patients. With increasingly large amounts of genomic and oncogenic information available to biologists, day by day, this task should be plausible. Hence in the present study, with the help of adept and robust computational tools and pipelines, we have systematically scoured though said extensive genomic data collection to precisely isolate a set of distinct allograft rejection genes that are statistically associated with RCC progression. These gene could be availed to effectually monitor and asses the prognosis of RCC patients.
2 Materials and Methods 2.1 Data Acquisition and Preprocessing The initial genomic information of kidney cancer patients consisting of transcriptional gene expression profiles, mutation patterns, and other related clinical information were retrieved from the Cancer Genome Atlas (TCGA). And all the data were subjected to initial screening with the help of R program.
The Prognosis Model of Clear Cell Renal Cell
385
2.2 Differentially Expressed Allograft Rejection Associated Genes (DE-ARGs) and Functional Enrichment Analyses To better delineate the genes involved in RCC progression, cancer and normal tissues’ gene expression data were procured from TCGA kidney cancer cohort and were subjected to differential gene expression analysis with the help of ‘EdgeR’ R module [3]. The screening criteria log2(fold change) > 2, P-value < 0.05 was effectuated. Further, the set of Allograft rejection associated genes (ARGs) were poured from the findings of the Broad institute. Then, the ARGS that were significantly differentially regulated on comparing kidney cancer and normal cells’ expression profile were retained for further analysis. And, criterion for the analysis was set at false discovery rate (FDR) < 0.05, to be statistically significant. 2.3 Independent Prognostic Value of the Immune-Associated Prognostic Signature Then, to identify if DE-ARG have any implaction with the survical of patients, univariate cox regression analysis was performed according to the criteria of P < 0.01 using the “survival” R package. Subsequently, multivariate cox regression analysis was conducted to construct the prognostic signature. Multivariate Cox regression analysis with the forward stepwise procedure was performed to investigate if the risk score is an independent prognostic factor. The allograft rejection associated risk score and other clinical variables with P < 0.05 were identified as independent prognostic risk factors. The risk score formula is as follows: Risk score = Coef (ARGs) ∗ Exp (ARGs). Herein, the Coef (ARGs) represent the coefficient of each ARGs and Exp (ARGs) is the expression of each ARGs. Based on the median risk score, the patients were divided into high-risk and low-risk groups, with Kaplan-Meier survival analysis performed to estimate the survival rate of both groups using the “survival” and “survminer” R packages. Area under the ROC was used to evaluate the accuracy of predict model. 2.4 Statistical Analysis The differences between variables were determined by chi-square as well as Student’s t-tests. The Kaplan–Meier survival curves were compared using the log-rank test. P < 0.05 indicated statistical significance. R 4.1.2 software was used for all analyses.
3 Results 3.1 Identification and Functional Analyses of DE-ARGs Initially, to delineate the genes that are differentially expressed in kidney cancer patients, relative to normal tissues, the criterion log2 fold change (logFC) > 2
386
H. Liu et al.
and adjusted P < 0.05 was used in R package ‘EdgeR’. Accordingly, 7369 DEGs were filtered, of which 1903 genes were significantly elevated, while 5456 genes were significantly suppressed in kidney cancer samples, compared to normal samples (Figs. 1A, B). Subsequently, from this list, 118 were delineated all graft rejection genes (based on the allograft rejection gene list from broad institute) (Fig. 1C)). These 118 genes were availed for further analysis.
Fig. 1. Differentially expressed allograft rejection associated genes. (A) Heatmap of top 15 up- and down-regulated genes between normal and tumor tissues. (B) Volcano plot for DEGs between normal and tumor tissues. (C) Venn diagram for intersections between DEGs and ARGs.
3.2 Risk Prediction Analysis Next these DEG-AR’s influence on renal cancer were further analyzed with the help of Univariate and multivariate Cox regression analysis. Herein the association of aforementioned 118 genes with renal cancer patient’s survival were statistically correlated. Univariate (one gene at a time) and multivariate (all the genes) Cox regression analysis was used to analyze the influence of risk core and clinicopathological factors such as
The Prognosis Model of Clear Cell Renal Cell
387
age, gender, pathological grade, clinical stage and other clinicopathological factors on the survival of patients with renal cancer (Fig. 2).
Fig. 2. Establishment and verification of the prognostic signature of allograft rejection genes in the TCGA database: Univariate and multivariate Cox regression analysis. Here T, M and N are stages of cancer.
The results showed that there are 55 DEARGs significantly related to the prognosis of KIRP patients (p < 0.05). To make the analysis statistically robust the most correlated 14 prognostic-related DE-ARGs were further availed for subsequent analysis. Next, the delineated 14 genes, were utilized for subsequent multivariate Cox regression analysis. 14 DEARGs (CCL22, CSF1, CXCL13, ETS1, FCGR2B, GBP2, HLA-E, IL4R, MAP4K1, ST8SIA4, TAP2, TIMP1, ZAP70, TLR6) were filtered out and used for allograft rejection-related risk model construction (Fig. 3). The risk scores were calculated accordingly. Patients were then divided into high-risk (n = 265) and low-risk group (n = 265) by the median risk score as the critical value. The risk score was calculated for each patient and list of they belong to low or high-risk group.
388
H. Liu et al.
Fig. 3. Heatmap of the expression of the 14 ARGs in the high- and low-risk groups; distribution of risk scores between the two groups; distribution of survival status and OS time.
3.3 Survival Analysis Once the 14 gene signature DE-ARG based multivariate cox regression model was availed to formulate risk scores to delineate kidney cancer patients into high risk and low risk groups. To further asses the risk prediction model, kidney cancer patients’ survival analysis was performed with the help of Kaplan-Meier analysis of both the groups. The results indicate that survival of low risk patients is significantly longer than high risk patients (p < 0.001) (Fig. 4). Further evaluation of the DE-ARG based risk prediction model revealed area under ROC = 0.797 which makes the model statistically robust (Fig. 5).
The Prognosis Model of Clear Cell Renal Cell
Fig. 4. Kaplan–Meier curves for OS outcomes in the high- and low-risk score groups..
389
390
H. Liu et al.
Fig. 5. Evaluation of prognostic accuracy of the risk model and other clinicopathological characteristics. ROC curves of risk score (AUC = 0.797) and other clinical features (age, gender, grade, stage, T, N and M).
4 Discussion Both allograft rejection and cancer progression, though seemingly unrelated have a shared yet intricate immune modulatory elusive mechanism involved. Many studies have corroborated this striking overlap between these two biomolecular phenomena [4]. Reports have shown that allogenic graft suffers notably due to lymphocyte-mediated immune-rejection, which is closely associated with ER stress activation stress response in lymphocytes of host cell, more specifically CD8 T cells [5]. And interestingly the same CD8 T cells are insinuated to be a crucial immune check point for ccRCC tumorigenesis [6]. And, case studies have shown that immunotherapy administered to metastatic cutaneous squamous-cell carcinoma patient with solid-organ transplant, evinced both tumor suppression and allograft rejection. Moreover, studies have shown that immune checkpoint inhibitors initiation among patients with solid organ transplantation have shown a high rate of allograft rejection. This elusive interplay between allograft rejection and cancer progression with immuno-regulation as a common factor, propelled us look for allograft rejection based biomarkers that could aid in efficient prognosis of ccRCC.
The Prognosis Model of Clear Cell Renal Cell
391
Accordingly, with the help of differential gene expression analysis and cox regression we have formulated a risk assessment model based out of allograft rejection gene set, to monitor the prognosis of ccRCC patients. The model deems statistically robust as corroborated by survival analysis area under ROC curve. Furthermore, the delineated 14 allograft rejection genes’ assertion for ccRCC prognosis is bolstered by prior studied from literature, which elucidates their role in immunoonco-modulation, and reiterates and reaffirms their importance as biomarkers for monitoring ccRCC progression in patients. Firstly, gene CCL22 chemokine was found to have control over T Cell (regulatory) migration [7]. While CSF1 gene was found to promote ccRCC oncogenesis10. CXCL13 gene found predominately on M2 macrophage is also implicated in ccRCC invasion, migration and EMT [8]. ETS1transcription factor is also implicated in general tumorigenesis, and in most cancer types it is involved in reduced survival [9]. HLA-E is a known biomarker of ccRCC [10]. And IL4R polymorphism is implicated in increased risk of ccRCC among Chinese [11]. Further, MAP4K1 gene was found to function as AML cancer enhancer by regulating DNA damage and repair pathways and MAPK pathways [12]. Gene ST8SIA4 was found to be overexpressed in RCC cell lines, and its ectopic expression regulated the migration, proliferation and invasion of RCC cells [13]. TAP2 was associated with antigen presentation with MHC class 1 [14], while TIMP1 over expression in aorta decreased the hostility of allograft vasculopathy in mice [15]. And reports have shown that TLR6 has been implicated in various experimental analysis of renal ischemia/reperfusion injury [16].The mechanism of cancer development is complicated, including miRNA, gene mutation and etc. [17–22]. The regulatory network between genes and diseases exert important role in diesease.The algorithms developed to screen miRNA-target interaction, while the predictive model in other regulatory level remains further explored [23–26]. From the aforementioned literature survey its apparent that a vast majority of them are implicated in ccRCC progression, which logically assert the rationale of elucidating them in the prognostic model, while the remaining are associated with immune modulation. Therefore, statistically delineating the 14 genes as potential biomarkers in assessing ccRCC progression seems a viable assertion. And the formulated risk prediction model based out the 14 genes can be employed in efficient prediction and evaluation of ccRCC prognosis. Acknowledgement. This study was supported by Provincial Science and Technology Grant of Shanxi Province (20210302124588), Science and technology innovation project of Shanxi province universities (2019L0683).
References 1. Lee, S., Ku, J.Y., Kang, B.J., Kim, K.H., Kim, S.: A unique urinary metabolic feature for the determination of bladder cancer, prostate cancer, and renal cell carcinoma. Metabolites 11, 591 (2021) 2. Aguirre, L.E., Guzman, M.E., Lopes, G., Hurley, J.: Immune checkpoint inhibitors and the risk of allograft rejection: a comprehensive analysis on an emerging issue. The Oncologist (2018)
392
H. Liu et al.
3. Smyth, G.K.: [BioC] combining differential gene expression on 2 reference transcriptomes: EdgeR analysis 4. Land, W.G., Agostinis, P., Gasser, S., Garg, A.D., Linkermann, A.: DAMP—induced allograft and tumor rejection: the circle is closing. Am. J. Transp. 16, 3322–3337 (2016) 5. Shi, Y., Lu, Y., Zhu, C., Luo, Z., You, J.: Targeted regulation of lymphocytic ER stress response with an overall immunosuppression to alleviate allograft rejection. Biomaterials 272, 120757 (2021) 6. Wu, K., Zheng, X, Yao, Z., Zheng, Z., Zheng, J.: Accumulation of CD45RO+CD8+ t cells is a diagnostic and prognostic biomarker for clear cell renal cell carcinoma. Aging. 13, 14304– 14321 7. 32 - modulation of autoimmunity and allograft rejection by viral expression of interleukin-35. Canadian Journal of Diabetes (2016) 8. Xie, Y., Chen, Z., Zhong, Q., Zheng, Z., Xie, W.: M2 macrophages secrete CXCL13 to promote renal cell carcinoma migration, invasion, and EMT. Cancer Cell Int. 21(1), 677 (2021) 9. Dittmer, J.: The role of the transcription factor Ets1 in carcinoma. Semin. Cancer Biol. 35, 20–38 (2015) 10. Chu, G., Jiao, W., Yang, X., Liang, Y., Niu, H.: C3, C3AR1, HLA-DRA, and HLA-e as potential prognostic biomarkers for renal clear cell carcinoma. Trans. Andrology Urol. 9, 2640–2656 (2020) 11. Zhang, Z., Yadi, Q., Wang, M., Haiyan, Y., Qian, F.: Polymorphism rs4787951 in IL-4R contributes to the increased risk of renal cell carcinoma in a Chinese population. Gene 685, 242–247 (2019) 12. Ling, Q., Li, F., Zhang, X., Mao, S., Jin, J.: MAP4K1 functions as a tumor promotor and drug mediator for AML via modulation of DNA damage/repair system and MAPK pathway. EBioMedicine 69, 103441 (2021) 13. Pan, Y., et al.: Long noncoding RNA HOTAIR promotes renal cell carcinoma malignancy through alpha-2, 8-sialyltransferase 4 by sponging microRNA-124. Cell Prolif. 51, e12507 (2018) 14. Chevrier, D., et al.: Effects of MHC-encoded TAP1 and TAP2 gene polymorphism and matching on kidney graft rejection. Transplantation 60, 292–295 (1995) 15. Remes, A., Franz, M., Zaradzki, M., Borowski, C., Arif, R.: AAV-mediated TIMP-1 overexpression in aortic tissue reduces the severity of allograft vasculopathy in mice. J. Heart Lung Transpl. 39, 389–398 (2020) 16. Hoffmann, U., et al.: Impact of toll-like receptor 2. Accessed 21 Nov 2016 17. Li, Z.-W., Zhong, T.-B., Huang, D.-S., You, Z.-H., Nie, R.: Hierarchical graph attention network for miRNA-disease association prediction. Molecular Therapy, Advance access (2022) 18. Zhong, T.-B., Li, Z.-W., You, Z.-H., Nie, R., Zhao, H.: Predicting miRNA-disease associations based on graph random propagation network and attention network. Briefings in Bioinformatics., Advance Access (2022) 19. Li, Z.-W., Li, J.-S., Nie, R., You, Z.-H., Bao, W.-Z.: A graph auto-encoder model for mirnadisease associations prediction. Briefings Bioinform. 22(4) bbaa240 (2021) 20. Nie, R., Li, Z.-W., You, Z.-H., Bao, W.-Z., Li, J.-S.: Efficient framework for predicting miRNA-disease associations based on improved hybrid collaborative filtering. BMC Med. Inform. Decis. Making 21(S1), 254 (2021) 21. Liu, B.-L., Zhu, X.-Y., Zhang, L., Liang, Z.-Z., Li, Z.-W.: Combined embedding model for MiRNA-disease association prediction. BMC Bioinform. 22, 161 (2021) 22. Zhang, L., Liu, B.-L., Li, Z.-W., Zhu, X.-Y., Liang, Z.-Z., An, J.-Y.: Predicting miRNA-disease associations by multiple meta-paths fusion graph embedding model. BMC Bioinform. 21, 470 (2020)
The Prognosis Model of Clear Cell Renal Cell
393
23. Li, J.-S., Li, Z.-W., Nie, R., You, Z.-H., Bao, W.-Z.: FCGCNMDA: predicting MiRNAdisease associations by applying fully connected graph convolutional networks. Mol. Genet. Genomics, 295(5), 1197–1209 (2020) 24. Li, Z.-W., Nie, R., You, Z.-H., Cao, C., Li, J.-S.: Using discriminative vector machine model with 2DPCA to predict interactions among proteins. BMC Bioinform. 20(Suppl 25), 694–702 (2019) 25. Li, Z.-W., You, Z.-H., Chen, X., Nie, R., An, J.-Y.: In silico prediction of drug-target interaction networks based on drug topological structure and protein sequences. Sci. Rep. 9, 2045–2322 (2017) 26. Li, Z.-W., et al.: Accurate prediction of protein-protein interactions by integrating potential evolutionary information embedded in PSSM profile and discriminative vector machine classifier. Oncotarget 8(14), 23638–23649 (2017)
Membrane Protein Amphiphilic Helix Structure Prediction Based on Graph Convolution Network Baoli Jia1,2 , Qingfang Meng1,2(B) , Qiang Zhang3 , and Yuehui Chen1,2 1 School of Information Science and Engineering, University of Jinan, Jinan 250022, China
[email protected]
2 Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan 250022,
China 3 Institute of Jinan Semiconductor Elements Experimentation, Jinan 250014, China
Abstract. The amphiphilic helix structure in membrane proteins is involved in membrane-related biological processes and has important research significance. In this paper, we constructed a new amphiphilic helix dataset containing 70 membrane proteins with a total of 18,458 amino acid residues. We extracted three commonly used protein features and predicted the membrane proteins amphiphilic helix structure using graph convolutional neural network. We improved the prediction accuracy of membrane proteins amphiphilic helix structure with the newly constructed dataset by rigorous 10-fold cross-validation. Keywords: Membrane protein · Amphiphilic helix · Structure prediction · Graph convolutional network
1 Introduction The cell membrane is a biological barrier that separates the inside and outside of the cell. It consists of phospholipid bilayer and a large number of membrane proteins. Membrane protein usually refers to the protein that crosses the cell membrane and adheres to the cell membrane. It is the main carrier of biofilm function, accounting for about 20–30% of all proteins in the human genome [1]. Membrane proteins play important roles in many biological processes such as cell signal transduction, cell recognition and cell communication. Currently, half of the molecular drugs are related to membrane proteins [2]. Membrane proteins can be roughly divided into two categories according to their secondary structure. One is α-helix membrane proteins, and the other is β-sheet membrane proteins. This paper mainly studies the former. There are two major directions to study the structure of membrane proteins, namely, the transmembrane helix (TMH) and the amphiphilic helix (AH). Research on TMH have been relatively mature [3, 4], while those on AH are relatively few. This article focuses on the research of AH. Fragment of AH is usually short in length, but it plays an important role in membrane proteins. For example, it can regulate interaction with other membrane proteins, feel the curvature of © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 394–404, 2022. https://doi.org/10.1007/978-3-031-13829-4_34
Membrane Protein Amphiphilic Helix Structure Prediction
395
biofilms [5], participate in the regulation of membrane tubule formation [6], and play a role in identifying orientation in the targeted positioning of drugs [7]. As AH structure is difficult to be measured experimentally, the prediction of AH in bioinformatics has become more and more important. Over the past few decades, there have been several methods to predict the amphipathic helix structure of membrane proteins. Helical Wheel [8, 9] method is a graphical method proposed by Schiffe and Edmundson, which can be used to represent the broken line diagram of the planar helix distribution of protein sequences, and can be used to identify the distribution characteristics of hydrophobic and hydrophilic amino acids. Although it can visualize AH in an intuitive way, it has no predictive function and can only be used as a tool for researchers to verify whether α-helix structure is amphipathic. David Eisenberg put forward the concept of Hydrophobic Moment [10, 11], which can quantify the amphipathic of AH. Since only fixed-length AH can be found, this method is not practical. Subsequently, Martin G.Roberts proposed the deepth-weighted inserted hydrophobicity (DWIH) [12], which can solve the above problems. AmphipaSeeK [13] is an online webserver developed by Nicolas Sapay of the Institute of Protein and Biochemistry of the French National Center for Scientific Research. This is the earliest machine learning method in this field. Since its small dataset and the imbalance of positive and negative samples in the dataset, the prediction accuracy is poor. In 2020, Feng proposed a deep learning-based prediction model MemBrain3.1 [14]. The model consists of a residual neural network and an uneven-thresholds decision algorithm, which achieves better classification accuracy. Inspired by these new development tools, we construct a standard dataset of AH, extract three commonly used protein features. Based on the graph convolutional neural network, the prediction of membrane proteins amphiphilic helix structure is realized.
2 Materials and Methods 2.1 Dataset The currently published AH dataset has only 8 membrane protein sequences, that is, the dataset used by AmphipaSeeK. Training data has a great impact on the accuracy of machine learning methods, and it is imminent to construct a new public dataset of AH. In this paper, a dataset of 70 membrane proteins is constructed. The construction method is as follows. First, all α-helix membrane proteins are downloaded from the PDBTM protein database [15] to constitute the candidate membrane proteins dataset; Then, according to the most stringent de-redundancy criteria for homologous protein sequences, CDHIT [16] tool is used to cluster protein sequences and retain 30% homologous protein sequences to obtain alternative membrane proteins dataset; Finally, using OPM database [17] and PDB database [18], according to their relevant literature and 3D structure, AHs labeled as amphiphilic helix structure in these two databases are the required standard data. The dataset contains a total of 18,458 amino acid residues. The residues of AH are positive samples, with a total of 1,050. Non-AH residues are negative samples, with a total of 17,408.
396
B. Jia et al.
2.2 Node Features Extraction We employ three groups of protein features to train our model, i.e., secondary structure, Hidden Markov Model profile and hydrophobicity scale. Secondary structure (SS) is a commonly used protein structural feature. According to the amino acid structure on the protein main chain, it is roughly divided into three states: α-helix, β-sheet and random coil. Thus, the dimension of this feature is 3. Here, we derived the secondary structure from PSIPRED [19], one of the most accurate predictors. Its output contains 3 states. We encode them using a 3D one-hot vector, which is represented by a matrix of L × 3, where L represents the length of the protein sequence. The Hidden Markov Model (HMM) profile is derived from the Hidden Markov Model and contains a lot of evolutionary information. In this paper, HMM profile is generated by running the HHblits [20] homologous sequence search tool. This tool is a statistical model that can predict the probability of sequence mutation and improve the sensitivity of searching similar sequences. The query sequence is searched against the Uniclust30 database [21] with three iterations and an E-value threshold of 0.001. The generated HMM profile is represented in the form of an L × 30 matrix. Hydrophobicity scale is a value describing the relative hydrophobicity of amino acids, which plays an important role in this field. We adopt the Eisenberg scale [10] in this study. Therefore, using the above feature extraction method, we obtained the 34D node feature vector of each amino acid in the protein sequence. 2.3 Graph Feature Extraction In this section, Chaos game presentation is used to represent the corresponding time series obtained from the secondary structure, and then adjacency matrix is constructed from the perspective of horizontal visibility network as the graph feature of membrane protein. Chaos game presentation (CGR) is a method first proposed by Jeffrey [22] to visualize DNA sequences. Later, Yang [23] extended this method to protein sequences and proposed a CGR based on secondary structure. The type, order and position of amino acids are essential for predicting AH. This method can reflect the structural and sequential information hidden in the sequence. We start with an equilateral triangle of unit length, each vertex representing one of the letters of the secondary structure, namely H, E, and C. For each letter of a given sequence of secondary structure, we can draw a point within the triangle. The first point is placed in the middle of the center of the triangle and vertex corresponding to the first letter of the secondary structure sequence, and then the i-th point is placed in the middle of the the (i − 1)-th point and vertex corresponding to the i-th letter. The coordinates of point i are shown in Eq. (1–2): (1) xi = 0.5 × xi−1 + cjx yi = 0.5 × yi−1 + cjy
(2) where cjx , cjy represent the coordinates of the vertex corresponding to each letter, √ namely (0,0), (0,1) and (0.5,0.5 3). Therefore, as shown in Fig. 1, we model a CGR graph as a x time series, as shown in Fig. 2.
Membrane Protein Amphiphilic Helix Structure Prediction
397
Fig. 1. The CGR of predicted secondary structure for protein 1LGH.
Fig. 2. The time series (CGRX) represents the x-coordinate of the points in Fig. 1
In recent years, complex network theory is not only used to deal with time series, but also widely used in many bioinformatics problems [24]. Horizontal visibility graph (HVG) can be used to represent time series by complex network theory [25]. Let {Xi }N i=1 is a time series, visibility algorithm maps each point in the time series to a node in HVG. According to the visibility criterion, two adjacent nodes are bound to be connected. For two non-adjacent nodes xi and xj , if any node xn between these two nodes satisfies xn < min xi , xj , then node i and node j are connected nodes. After completing the construction of the horizontal visibility graph, define a limited penetrable distance Lp . Then, we draw a horizontal line between any two nodes. If the number of times the horizontal line is truncated satisfies n ≤ Lp , it means that there is an edge connection between the two nodes. The limited penetrable horizontal visibility graph [26] (LPHVG) increases the number of connections between nodes, allowing shorter protein sequences to reflect topological statistics. Figure 3 shows the limited penetrable horizontal visibility graph when Lp = 1. Solid lines represent connections between adjacent nodes, and dotted lines represent edge connections between
398
B. Jia et al.
non-adjacent nodes. Then, the weight given to the edge connection is Eq. 3: xi − xj (i − j) + 1, node i is connected to node j Wij = 0, node i is not connected to node j
(3)
The construction of the weighted horizontal visibility graph (WHVG) not only reflects whether the nodes are connected, but also reflects the distance between different nodes, which can better represent the dynamic characteristics of the complex network.
Fig. 3. The diagram of the Limited Penetrable Horizontal Visibility Graph.
LPHVG determines whether there is a side connection between the two nodes, and WHVG determines the strength of the connection, so we get the weighted and limited penetrable horizontal visibility graph (WLPHVG). For a protein sequence of length L, we go to an adjacency matrix of L × L.
3 Prediction Model 3.1 Graph Convolutional Network Based Model Protein molecules belong to irregular data structure. For such irregular data objects, the effect of ordinary convolutional networks is not satisfactory. Graph convolutional network (GCN) has subtly designed a method to extract features from graph data, from which it can be used for node classification. It has been successfully applied to other fields of bioinformatics, such as protein function prediction [27], protein solubility prediction [28], protein-protein interaction site prediction [29] and so on. Given a protein sequence with L amino acids, the node features of the protein are represented by the matrix X ∈ RL×F , and the features of the graph structure are represented by the adjacency matrix A ∈ RL×L , D ∈ RL×L is the diagonal node degree matrix. Kipf and Welling [30] introduced a simple hierarchical propagation rule for the direct operation graph neural network model, as shown in Eq. 4: 1 1 ˜ − 2 H (l) W (l) ˜ − 2 A˜ D (4) H (l+1) = σ D
Membrane Protein Amphiphilic Helix Structure Prediction
399
where A˜ = A + IL is formedby the addition of the original adjacency matrix A and ˜ ii = k A˜ ik and W (l) ∈ RF×F are the weight matrices that can the identity matrix IL . D L×F (l) is the activation matrix of the l layer, and the initial state is be trained. H ∈ R H (0) = X . A two-layer GCN network is constructed. Based on the above hierarchical propagation rule, the overall forward propagation formula is Eq. 5: (0) ˜ AXW ˜ W (1) (5) Z = f (X , A) = σ Aσ where Z ∈ RL×C represents the class probability of each node, and C represents the number of classes.
Fig. 4. The overall framework of our model.
Our model is shown in Fig. 4. The input to the model is two types of features, node features and graph feature from the previous subsection. Graph convolutional network aggregates protein structure information from its nodes and edges during iteration. For node features, we use sliding windows to fuse the secondary structure, HMM profile and hydrophobicity scale to extract neighbor features of amino acids. Specifically, if the length of sliding window is 7, the features of amino acids at position i − 3, i − 2, i − 1, i, i + 1, i + 2, i + 3 are taken as local context features for the i-th residue. For those amino acids do not have neighbors in the left window or the right window, we add zero to them as the missing features. The sliding window is generally set between 7 and 15 according to the length of AH. Through experiments, it is found that the experimental effect is the best when the length of window is 13, and the final node features dimension is 442. 3.2 Model Training and Evaluation Due to the limited dataset, this paper uses strict 10-fold cross-validation to train and evaluate the model. The dataset is divided based on membrane protein sequences, and
400
B. Jia et al.
the ratio of training set, validation set, and test set is 8:1:1. The key hyperparameters included in the model are determined by the performance on the validation set. In order to reduce the complexity of the model, the number of GCN layers is set to 2, the first layer has 256 nodes, and the second layer has 2 nodes. To reduce the risk of overfitting, the dropout rate is set to 0.4. Since the positive and negative samples of the dataset are unbalanced, the cross-entropy loss is used as the loss function, and the class weights is set to 7:1 according to the ratio of positive and negative samples. The model is trained for 20 epochs using the Adam optimizer with a learning rate of 0.001 and a batch size of 1. 3.3 Evaluation Indicators The dataset studied in this paper is imbalanced, and the evaluation indicators used should focus on positive examples. Therefore, precision, recall, and F-measure are used to evaluate the performance of the model. Precision, which is equal to the proportion of correctly predicted AH residues among all positively predicted amino acids, the formula is as Eq. 6: precision =
TP TP + FP
(6)
Recall, which is equal to the proportion of correctly predicted AH residues in all AH residues, is given by Eq. 7: recall =
TP TP + FN
(7)
measure, which is the harmonic mean of precision and recall, the formula is as Eq. 8: F − measure = 2 ×
precision × recall precision + recall
(8)
4 Results and Discussion 4.1 Node Features Importance Comparison In order to indicate the importance of node features, we test individual feature and different combinations of node features on the dataset, and the results are shown in Fig. 5. Secondary structure performed worst, with F-measure at 7.8%. The hydrophobicity scale also performed poorly, with F-measure equal to 14.65%. Not surprisingly, these two features have very small dimensions (3-dimensional secondary structure and 1-dimensional hydrophobicity scale), which contain less information and cannot be fully learned by the model. HMM profile obtained the highest F-measure of 26.59%. When these features are combined, performance improves. The combination of the SS, HMM profile, and hydrophobicity scale obtained the best performance. They contain the structural information, evolutionary information and physicochemical properties of amino acids respectively, which can maximize the feature diversity.
Membrane Protein Amphiphilic Helix Structure Prediction
401
Fig. 5. The comparison results of different combinations of node features.
4.2 The Length of Sliding Window Comparison In order to select the length of sliding window, sliding windows of different lengths (i.e., 7, 9, 11, 13 and 15) are used to observe the performance of the model. According to the results in Table 1, when the sliding window length is 13, the model has the best performance, precision, recall and F-measure are 22.66%, 50.57% and 31.30% respectively. Table 1. The comparison results of different length of sliding window. The length of sliding window
Precision
Recall
F-measure
7
0.2045
0.4133
0.2736
9
0.2044
0.4429
0.2797
11
0.2028
0.4219
0.2740
13
0.2266
0.5057
0.3130
15
0.2023
0.3857
0.2654
The length of protein has a great impact on predicting the quality of results. If the length of the sliding window is small, the neighborhood information of amino acids cannot be fully extracted. If the length of sliding window is too large, the extracted information will be redundant and the computational complexity will increase. The results show that the length of AH segments in our dataset is long, so we need a big sliding window to extract the neighbor features of amino acids.
402
B. Jia et al.
4.3 Comparison with Existing Methods In this section, we compare our model with three typical methods on our dataset, namely hydrophobic moment plot, AmphipaSeek and MemBrain3.1. The hydrophobic moment plot cannot directly predict AH, so this paper uses PSIPRED to predict the helix, and then determines whether the helix is AH according to the hydrophobic moment. For AmphipaSeek, we submit the sequences from our dataset to a released web server to predict AH. For MemBrain3.1, we download its program and then enter our dataset for prediction. Table 2 shows the performance of the four methods on our dataset. Table 2. The comparison results between AmphipaSeek, hydrophobic moment plot, MemBrain3.1 and our model. Method
Precision
Recall
F-measure
AmphipaSeek
0.1199
0.0819
0.0973
Hydrophobic moment plot
0.0838
0.5219
0.1444
MemBrain3.1
0.3783
0.3124
0.3422
Our model
0.2266
0.5057
0.3130
It can be seen that the performance of AmphipaSeek is the worst, with precision, recall and F-measure being 11.99%, 8.19% and 9.73% respectively. Probably because that the accuracy of the two features it uses (one is the LRG matrix that predicts the secondary structure, and the other is the PHAT matrix that predicts the transmembrane helix) are not high, which results in its insufficient learning. The hydrophobic moment plot is also not dissatisfactory, and precision, recall and F-measure being 8.38%, 52.19% and 14.44% respectively. MemBrain3.1 has a good performance, with precision, recall and F-measure being 37.83%, 31.24% and 34.22% respectively. Our model also has a good performance, with precision, recall and F-measure being 22.66%, 50.57% and 31.30% respectively. It can be seen that our recall is higher than that of MemBrain3.1, while precision is lower than that of MemBrain3.1. This indicates that our model has a stronger ability to identify positive samples than MemBrain3.1, and a weaker ability to distinguish negative samples than MemBrain3.1. In general, our model has a strong advantage in identifying positive samples, but needs to be strengthened for negative samples. The model proposed in this paper is effective for the prediction of the amphiphilic helix structure of membrane proteins. 4.4 Conclusion It is still a complex and challenging problem to accurately predict the membrane protein amphipathic helix structure. In this paper, a novel method for predicting the membrane protein amphipathic helix structure is proposed. The difference between this method and other existing methods is that (1) we extract three different types of features to characterize AH, which enriches the diversity of features. (2) We use CGR to extract time series from the secondary structure of proteins, which contains the sequence information
Membrane Protein Amphiphilic Helix Structure Prediction
403
of proteins. (3) The graph features of the proteins are obtained by mapping the time series to the complex network using the limited penetrable horizontal visibility graph. (4) The graph convolution network is employed to predict the membrane protein amphipathic helix structure. The experimental results show that the method provides an effective tool for accurately predicting the amphipathic helix structure of membrane protein. Although our model improves the prediction accuracy to a certain extent, there is still has space to improve. In the future, we will continue to expand the dataset, extract more types of features to characterize AH, and further improve the prediction accuracy of AH. Acknowledgment. This work was supported by the National Natural Science Foundation of China (Grant No. 61671220), University Innovation Team Project of Jinan (2019GXRC015), the Natural Science Foundation of Shandong Province, China (Grant No. ZR2021MF036).
References 1. Smith, S.M.: Strategies for the purification of membrane proteins. Methods Mol. Biol. 681, 485–496 (2011) 2. Cuthbertson, J., Sansom, M.: Structural bioinformatics and molecular simulations: looking at membrane proteins. Biochemist 4, 21–24 (2004) 3. Feng, S.H., Zhang, W.X., Yang, J., et al.: Topology Prediction Improvement of α-helical transmembrane proteins through Helix–tail modeling and multiscale deep learning fusion. J. Mol. Biol. 432(4), 1279–1296 (2019) 4. Tsirigos, K.D., Govindarajan, S., Bassot, C., et al.: Topology of membrane proteins–predictions, limitations and variations. Curr. Opin. Struct. Biol. 50, 9–17 (2018) 5. Drin, G., Casella, J.F., Gautier, R., et al.: A general amphipathic α–helical motif for sensing membrane curvature. Nat. Struct. Mol. Biol. 14(2), 138–146 (2007) 6. Brady, J.P., Claridge, J.K., Smith, P.G., et al.: A conserved amphipathic helix is required for membrane tubule formation by Yop1p. Proc. Natl. Acad. Sci. 112(7), 639–648 (2015) 7. Milletti, F.: Cell-penetrating peptides: classes, origin, and current landscape. Drug Discov. Today 17(15), 850–860 (2012) 8. Schiffer, M., Edmundson, A.B.: Use of helical wheels to represent the structure of proteins and to identify segments with helical potential. Biophys. J. 7(2), 121–135 (1967) 9. Rodaway, A., Sternberg, M., Bentley, D.L.: Similarity in membrane proteins. Nature 342(6250), 624 (1989) 10. Eisenberg, D., Schwarz, E., Komaromy, M., et al.: Analysis of membrane and surface protein sequences with the hydrophobic moment plot. J. Mol. Biol. 179(1), 125–142 (1984) 11. Eisenberg, D., Weiss, R.M., Terwilliger, T.C.: The helical hydrophobic moment: a measure of the amphiphilicity of a helix. Nature 299(5881), 371–374 (1982) 12. Roberts, M.G., Phoenix, D.A., Pewsey, A.R.: An algorithm for the detection of surface active α helices with the potential to anchor proteins at the membrane interface. Bioinformatics 13(1), 99–106 (1997) 13. Sapay, N., Guermeur, Y., Deléage, G.: Prediction of amphipathic in–plane membrane anchors in monotopic proteins using a SVM classifier. BMC Bioinform. 7(1), 1–11 (2006) 14. Feng, S.H., et al.: Ab-initio membrane protein amphipathic helix structure prediction using deep neural networks. In: IEEE/ACM Transactions on Computational Biology and Bioinformatics/IEEE, p. 99. ACM (2020)
404
B. Jia et al.
15. Tusnády, G.E., Zsuzsanna, D., István, S.: PDB_TM: selection and membrane localization of transmembrane proteins in the protein data bank. Nucleic Acids Res. 33(suppl_1), D275– D278 (2005) 16. Li, W.Z., Adam, G., et al.: Cd–hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13), 1658–1659 (2006) 17. Lomize, M.A., et al.: OPM database and PPM web server: resources for positioning of proteins in membranes. Nucleic Acids Res. 40(D1), 370–376 (2011) 18. Sussman, J.L., Lin, D., Jiang, J., et al.: Protein data bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta Crystallogr. A 54(6–1), 1078–1084 (2010) 19. Daniel, W.A., et al.: Scalable web services for the PSIPRED protein analysis workbench. Nucleic Acids Res. 41(W1), W349–W357 (2013) 20. Remmert, M., Biegert, A., Hauser, A., et al.: HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods 9(2), 173–175 (2012) 21. Milot, M., von den Driesch Lars, Clovis, G., et al.: Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Research 45(D1), D170–D176 (2017) 22. Jeffrey, H.J.: Chaos game representation of gene structure. Nucleic Acids Res. 18(8), 2163– 2170 (1990) 23. Yang, J.Y., Peng, Z.L., Chen, X.: Prediction of protein structural classes for low-homology sequences based on predicted secondary structure. BMC Bioinform. 11(1), 1–10 (2010) 24. Olyaee, M.H., Yaghoubi, A., Yaghoobi, M.: Predicting protein structural classes based on complex networks and recurrence analysis. J. Theor. Biol. 404, 375–382 (2016) 25. Luque, B., Lacasa, L., Ballesteros, F.: Horizontal visibility graphs: exact results for random time series. Phys. Rev. E 80(4), 046103 (2019) 26. Gao, Z.K., Cai, Q., Yang, Y.X.: Multiscale limited penetrable horizontal visibility graph for analyzing nonlinear time series. Sci. Rep. 6(1), 1–7 (2016) 27. Gligorijevi, V., Renfrew, P.D., Kosciolek, T., et al.: Structure–based protein function prediction using graph convolutional networks. Nat. Commun. 12(1), 1–14 (2021) 28. Chen, J., Zheng, S., Zhao, H., et al.: Structure–aware protein solubility prediction from sequence through graph convolutional network and predicted contact map. J. Cheminform. 13(1), 1–10 (2021) 29. Yuan, Q., Chen, J., Zhao, H., et al.: Structure–aware protein-protein interaction site prediction using deep graph convolutional network. Bioinformatics 38(1), 125–132 (2022) 30. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv 1609, 02907 (2016)
The CNV Predict Model in Esophagus Cancer Yun Tian1 , Caixia Xu1 , Lin Li1 , Pengyong Han1(B) , and Zhengwei Li2 1 Changzhi Medical College, Changzhi 046000, China
[email protected] 2 School of Computer Science and Technology, China University of Mining and Technology,
Xuzhou 221116, China
Abstract. Copy number variations (CNVs) are critical factors in esophageal cancer carcinogenesis. The present study identified molecular signatures that predict prognosis in esophageal cancer by comprehensively analyzing copy number and gene expression data. Methods: Esophageal cancer expression profiles, CNVs, and clinical data from The Cancer Genome Atlas (TCGA) dataset were collected. Univariate survival COX analysis, multivariate survival COX analysis, chi-square, Kaplan-Meier (KM) survival curves, and receiver operating characteristic (ROC) analysis was employed to model gene signatures and evaluate their performance. Results: 649 CNV-related differentially expressed genes obtained from the TCGAesophageal cancer dataset were associated with several cancer pathways and functions. A prognostic gene set of 3 genes was screened to classify patients into highrisk and low-risk groups. The 3-gene signature developed in this study achieved a higher AUC. Conclusion: The current study demonstrates the esophageal cancerCNV gene signature to assess the prognosis of esophageal cancer patients, which may innovate the clinical application of predictive assessment. Keywords: Copy number variations · Predict model · Esophagus carcinoma
1 Introduction Esophageal adenocarcinoma (ES) ranks top 10 most prevalent cancer and cause of vital cancer-related deaths [1]. Although immunotherapy has emerged as a potential treatment, curative treatments for ES remain further to be explored [2]. In addition, high recurrence rates make long-term survival difficult. Genetic and epigenetic molecular alterations play a central role in ES development [3]. Therefore, a better understanding of the underlying molecular mechanisms driving the occurrence and development of ES is crucial. Exploring ES drivers-based predictive models can be useful for precise treatment and prognosis evaluation. Copy number variations (CNVs) are DNA segments either amplified or deleted more prominent than 1 kb [4], which play an essential role in cancer pathogens. Sites of copy number variation provide hotspots for somatic alterations in cancer. CNVs can lead to the oncogenes activation or the tumour suppressor genes inactivation, thereby driving cancer development [5]. Studies showed that various CNVs associated with the carcinogenesis of cancers like thyroid cancer, etc. [6, 7]. Frequent CNVs in cancer cell subsets can remodel ES heterogeneity, suggesting a pivotal role © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 405–414, 2022. https://doi.org/10.1007/978-3-031-13829-4_35
406
Y. Tian et al.
for CNVs in ES carcinogens is and progression [8, 9]. We used RNA-seq and CNV profiles to build a predictive model of ES based on CNV-driven genes. Our study may better understand the potential mechanisms and provide novel therapeutic targets for ES therapy.
2 Materials and Methods 2.1 Data Collection 161 RNA-seq profiles of EC patients and corresponding normal samples, clinical data and CNVs data retrieved from TCGA from onset to March 1, 2022). 2.2 Differentially Expressed Genes (DEGs) Screened Between Cancer and Para-Neoplasm Tissues EdgeR package was used to screen genes critical for ES carcinogenic between tumour and para-neoplasm from TCGA [10]. False discovery rate (FDR) < 0.01 and log2 (fold change [FC]) > 2 and were adopted as thresholds. 2.3 DNA CNVs Annotation and Relate Analysis GRCh38 was adopted as the reference genome for the genes in the CNV region. The copy variation rates of genes within samples were calculated. The chi-square test was then used to compare the rate of CNVs change between normal and tumor samples, adjusted P value less than 0.05 were selected for further analysis. 2.4 Establishment of Risk Prediction Model CNV genes associated with prognosis were filtered to build a prognostic evaluate model. Univariate Cox proportional hazards regression analysis with P < 0.01 were chosen for subsequent analysis. Multivariate Cox regression analysis was employed to generate coefficient parameters. log-rank and Kaplan-Meier (KM) were deployed to compare the two subgroups’ overall survival (OS). 2.5 Independence of Risk Score from Other Clinical Features Univariate and multivariate analyses of clinical information and gene expression were used to assess the dependent characteristics between the risk score and other clinical index. 2.6 Functional Enrichment and GO Analysis To explore potential biological functions, Gene Ontology (GO) were performed via clue GO [11] app and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway was enriched via Cluster Profiler package in R [12].
The CNV Predict Model in Esophagus Cancer
407
2.7 Statistical Analysis All analyses were performed using R 4.1.2. P values less than 0.05 were considered statistically significant.
3 Results 3.1 Differential Gene Analysis of Esophageal Cancer 951 DEGs were screened from tumours and para-neoplasm. Two hundred eighty-one genes were up-regulated, and 670 genes were down-regulated (Fig. 1).
Fig. 1. Heatmap and volcano of the DEG.
3.2 Identification of CNV-Driven Genes in ES Patients 3789 ES-associated CNV genes were screened, (adjusted P < 0.05) (Fig. 2). The detailed information of ES-associated CNVs of chromosomes position were shown in Fig. 2. Intersection with RNAseq differential genes found 57 genes. KEGG-enriched signaling pathways include neuropeptide signaling pathways (Fig. 3), and GO analysis found that they were mainly enriched in physiological processes such as G protein-coupled receptors, neuropeptide receptor binding, and hormone binding (Fig. 4).
408
Y. Tian et al.
Fig. 2. Circos plot of ES-related CNVs. Twenty-four chromosomes are in the outer circle; the inside means CNVs, while the blue dots represent CNV deletions). ES, Esophagus carcinoma.
Fig. 3. GO analysis.
Fig. 4. KEGG analysis.
The CNV Predict Model in Esophagus Cancer
409
3.3 Screening of Prognostic CNV Driver Genes In the intersection of differential RNA-seq genes and differential cnv driver genes, 57 driver genes differed between tumour and para-neoplasm tissues. 4 CNV genes were found as potential OS prognostic biomarkers In univariate analysis (P < 0.01) (Fig. 5). After multivariate analysis, 3 CNV-driven genes were screened as potential prognostic biomarkers for OS (P < 0.01).
Fig. 5. Univariate cox regression.
Establishment of a Prognostic Model ES Patients were stratified into two groups according to a risk scoring model. We establish a model based on the 3 CNV driver genes using coefficients from multivariate Cox regression analysis. The prognostic prediction model includes three genes like below. Risk score = (−0.167 × HCN1 mRNA level) + (−0.169 × KCTD8 mRNA level)+ (0.279 × RXFP3 mRNA level). Except for HCN1 and KCTD8, the risk of RXFP3 was more significant than 1, suggesting that these genes are associated with shorter OS. High-risk (77 cases) and low-risk (82) groups were stratified based on the risk scores. The higher the risk scores, the more temporary OS (Fig. 4A). Risk scores and CNV-driven genes were distributed and plotted (Fig. 6ABC). Except for RFXP3, the expression levels of CNV driver genes increased with higher risk scores, suggesting that these CNV driver genes are factors for high-risk patients. Compared to low-risk scores patients, the high risk patients have shorter survival (Fig. 7), while the predicted value can be around 0.7 (Fig. 8).
410
Y. Tian et al.
Fig. 6. Construction of CNV predict model.
The CNV Predict Model in Esophagus Cancer
Fig. 7. Survival curve of the predict model.
Fig. 8. ROC prediction value.
411
412
Y. Tian et al.
4 Discussion ES remains the top lethal cause of cancer-related death worldwide and exerts heavy public health burdens [13]. Unravelling the role of CNVs in Esophagus-carcinogenesis is vital for ES early diagnosis, prevention, and prognosis evaluation. Single nucleotide mutations play a crucial role in carcinogenesis, a multi-step process [14]. We comprehensively analyzed CNV and gene expression to screen hub CNV driver genes associated with ES survival and establish prognostic signatures. The predictive risk score can be utilized as an independent clue of OS via multivariate Cox analysis. Patients’ Survival analysis demonstrated that the risk score evaluation model might help improve ES patients’ precise prediction of OS.HCN1 and KCTD8, and RXFP3 were related to the prognosis with outcome in ES patients. HCN1 is associated with low breast and colorectal cancer survival rates, which codes for hyperpolarization-activated cyclic nucleotide-gated channel subunits [15]. KCTD8 was frequently methylated in locally advanced diseases encoding a potassium channel [16]. RXFP3 has been shown as a potential epigenetic marker for endometrial cancer [17]. Five CNV gene signature has been validated to evaluate the prognosis of breast cancer patients [18]. Four CNVs constructed could potentially provide a method for detecting lung cancer [19]. The mechanism of cancer development is complicated, including miRNA, gene mutation, etc. [20–24]. The regulatory network between genes and diseases exerts an essential role in diseases. The algorithms developed to screen miRNA-target interaction, while the detailed regulatory mechanism of CNV remains further explored [25–29]. We investigated the association between ES patients’ survival and risk scores. We indicated that the expression of RXFP3 might be correlated with the poor survival of high-risk groups. In conclusion, CNV driver genes associated with ES survival have been identified. We establish an ES OS prognostic prediction model on CNV-driven genes. These results will help better understand ES occurrence from a CNV perspective and may predict ES patients’ prognosis. Acknowledgement. This study was supported by Provincial Science and Technology Grant of Shanxi Province (20210302124588), Science and technology innovation project of Shanxi province universities (2019L0683).
References 1. Radani, N., et al.: Analysis of fecal, salivary and tissue microbiome in barrett esophagus, dysplasia and esophageal adenocarcinoma. Gastro Hep Advances (2022) 2. Raimondi, A., et al.: The Emerging Role of Immunotherapy in Gastroesophageal Cancer: State of Art and Future Perspective (2022) 3. Maslyonkina, K.S., Konyukova, A.K., Alexeeva, D.Y., Sinelnikov, M.Y., Mikhaleva, L.M.: Barrett’s esophagus: The pathomorphological and molecular genetic keystones of neoplastic progression. Cancer Med. 11, 447–478 (2022) 4. Srivastava, S., Phadke, S.R.: Low-pass genome sequencing: a good option for detecting copy number variations 5. Shahrisa, A., Tahmaseby, M., Ansari, H., Mohammadi, Z., Carloni, V., Asl, J.M.: The pattern of gene copy number variations (CNVs) in hepatocellular carcinoma; in silico analysis (2021)
The CNV Predict Model in Esophagus Cancer
413
6. McKelvey, B.A., Zeiger, M.A., Umbricht, C.B.: Characterization of TERT and BRAF copy number variation in papillary thyroid carcinoma: an analysis of the cancer genome atlas study. Genes Chromosom. Cancer 60, 403–409 (2021) 7. Nazha, B., et al.: Circulating tumor DNA (ctDNA) in patients with advanced adrenocortical carcinoma (2021) 8. Fan, X., et al.: CSMD1 mutation related to immunity can be used as a marker to evaluate the clinical therapeutic effect and prognosis of patients with esophageal cancer. Int. J. Gen. Med. 14, 8689 (2021) 9. Maity, A.K., et al.: others: Novel epigenetic network biomarkers for early detection of esophageal cancer. Clin. Epigenetics 14, 1–14 (2022) 10. Smyth, G.K.: edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139 (2010) 11. Gabriela, B., Bernhard, M., Hubert, H., Pornpimol, C., Marie, T.: ClueGO: a cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks. Bioinformatics (Oxford, England) (2009) 12. Yu, G.: ClusterProfiler: an universal enrichment tool for functional and comparative study (2018) 13. Li, S., et al.: Changing trends in the disease burden of esophageal cancer in china from 1990 to 2017 and its predicted level in 25 years. Cancer Med. 10, 1889–1899 (2021) 14. Firigato, I., López, R.V., Curioni, O.A., De Antonio, J., Gattás, G.F., Toledo Gonçalves, F.: de: Many hands make light work: CNV of GSTM1 effect on the oral carcinoma risk. Cancer Epidemiol. 78, 102150 (2022) 15. Phan, N.N., Huynh, T.T., Lin, Y.-C.: Hyperpolarization-activated cyclic nucleotide-gated gene signatures and poor clinical outcome of cancer patient. Transl. Cancer Res. 6 (2017) 16. Daniunaite, K., et al.: Promoter methylation of PRKCB, ADAMTS12, and NAALAD2 is specific to prostate cancer and predicts biochemical disease recurrence. Int. J. Mol. Sci. 22, 6091 (2021) 17. Huang, Y.W., et al.: Hypermethylation of CIDEA and RXFP3 as potential epigenetic markers for endometrial cancer 18. Establishment of a novel CNV-related prognostic signature predicting prognosis in patients with breast cancer. J. Ovarian Res. 14 (2021) 19. Daping, Y., et al.: Copy number variation in plasma as a tool for lung cancer prediction using extreme gradient boosting (XGBoost) classifier. Thorac. Cancer 11(1), 95–102 (2020). https://doi.org/10.1111/1759-7714.13204 20. Zhong, T.-B., Li*, Z.-W., You*, Z.-H., Nie, R., Zhao, H.: Predicting miRNA-disease associations based on graph random propagation network and attention network. Brief. Bioinform. Advance Access (2022) 21. Li*, Z.-W., Zhong, T.-B., Huang, D.-S., You*, Z.-H., Nie*, R.: Hierarchical graph attention network for miRNA-disease association prediction. Mol. Ther., Advance access (2022) 22. Li*, Z.-W., Li, J.-S., Nie*, R., You*, Z.-H., Bao, W.-Z.: A graph auto-encoder model for mirna-disease associations prediction. Brief. Bioinform. 22(4), bbaa240 (2021) 23. Nie, R., Li*, Z.-W., You*, Z.-H., Bao, W.-Z., Li, J.-S.: Efficient framework for predicting miRNA-disease associations based on improved hybrid collaborative filtering. BMC Medical Inform. Decis. Mak. 21(S1), 254 (2021) 24. Liu, B.-L., Zhu, X.-Y., Zhang*, L., Liang, Z.-Z., Li*, Z.-W.: Combined embedding model for mirna-disease association prediction. BMC Bioinform. 22, 161 (2021) 25. Zhang, L., Liu*, B.-L., Li*, Z.-W., Zhu, X.-Y., Liang, Z.-Z., An, J.-Y.: Predicting miRNAdisease associations by multiple meta-paths fusion graph embedding model. BMC Bioinform. 21, 470 (2020)
414
Y. Tian et al.
26. Li, J.-S., Li*, Z.-W., Nie*, R., You, Z.-H., Bao, W.-Z.: FCGCNMDA: predicting MiRNAdisease associations by applying fully connected graph convolutional networks. Mol. Genet. Genom. 295(5), 1197–1209 (2020) 27. Li, Z.-W., Nie*, R., You, Z.-H., Cao, C., Li*, J.-S.: Using discriminative vector machine model with 2DPCA to predict interactions among proteins. BMC Bioinform. 20(Suppl 25), 694–702 (2019) 28. Li, Z.-W., You, Z.-H., Chen, X., Nie, R., An, J.-Y.: In silico prediction of drug-target interaction networks based on drug topological structure and protein sequences. Sci. Rep. 9, 2045–2322 (2017) 29. Li, Z.-W., et al.: Accurate prediction of protein-protein interactions by integrating potential evolutionary information embedded in PSSM profile and discriminative vector machine classifier. Oncotarget 8(14), 23638–23649 (2017)
TB-LNPs: A Web Server for Access to Lung Nodule Prediction Models Huaichao Luo1,2 , Ning Lin3 , Lin Wu4 , Ziru Huang1 , Ruiling Zu2 , and Jian Huang1(B) 1 School of Life Science and Technology, University of Electronic Science and Technology
of China, Chengdu, Sichuan, China [email protected] 2 Department of Clinical Laboratory, School of Medicine, Sichuan Cancer Hospital & Institute, Sichuan Cancer Center, University of Electronic Science and Technology of China, Chengdu, Sichuan, China 3 School of Healthcare Technology, Chengdu Neusoft University, Sichuan, China 4 College of International College of Digital Innovation, Chiang Mai University, Chiang Mai, Thailand
Abstract. A large number of lung nodule prediction models have been developed by scientific societies, such as the Brock University (BU) model and the Mayo Clinic (MC) model, which are easy to apply by the general public and researchers. However, there are few existing web servers that can combine these models. TBLNPs (Tool Box of Lung Nodule Predictors) is a web-based tool that provides fast and safe functionality based on accessible published models. TB-LNPs consists of four segments, including ‘Home’, ‘About Us’, ‘Manual’, and ‘Tool Box of Lung Nodule Predictions’. We give extensive manual guiding for TB-LNPs. In addition, in the ‘Tool Box of Lung Nodule Predictors’ part, we reconstructed six published models by R and constructed a web server by Spring Boot. TB-LNPs provides fast interactive and safe functions using asynchronous JavaScript and Data-Oriented Security Architecture. TB-LNPs bridges the gap between lung nodule prediction models and end users, thus maximizing the value of lung nodule prediction models. TB-LNPs is available at http://i.uestc.edu.cn/TB-LNPs. Keywords: Lung nodule · Web server · Prediction models
1 Introduction The United States and worldwide continue to experience the highest rates of lung cancer deaths. It is estimated that 228,820 adults will be diagnosed with lung cancer in the United States alone by 2020 [1]. While immunotherapy and other treatment options have improved recently, the 5-year survival rate for lung cancer remains at 21.7%, primarily because most lung cancers are diagnosed at an advanced stage [2]. It has been shown that early diagnosis can significantly improve outcomes--the survival rate for patients with stage IA1 non-small cell cancer is 92% [2]. H. Luo, N. Lin, L. Wu, Z. Huang and R. Zu—Contributed equally to the work presented here and should therefore be regarded as equivalent authors. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 415–420, 2022. https://doi.org/10.1007/978-3-031-13829-4_36
416
H. Luo et al.
Early lung cancer diagnosis is possible in two ways. First, low-dose computed tomography (LDCT) screenings have been shown to reduce lung cancer deaths by 20% in the US National Lung Screening Trial (NLST) [3]. Second, cancer may be discovered as an incidental finding during imaging procedures for unrelated reasons. It has been estimated that 1.57 million patients with pulmonary nodules are identified as incidental findings on chest CT every year in the United States, and that 30% of chest CT report indeterminate pulmonary nodules [4]. In recent years, more and more predication models are published. Logistic regression–based methods, such as the Mayo and Brock risk models, are recommended by some guidelines [5, 6]. The Department of Veterans Affairs (VA) model was developed utilizing data from 375 patients across 10 VA sites as part of a prospective study assessing the accuracy of CT compared to PET for the evaluation of lung nodules [7, 8]. Our group have published several models with XGBoost and support vector machine algorithm to discriminate malignant nodule and benign nodule [9, 10]. However, many models are difficult to apply for public use, so we constructed one web server to fill this gap.
2 Materials and Methods 2.1 Implementations All users have free access to the TB-LNPs website. The website was developed using Spring Boot (https://spring.io/projects/spring-boot), which simplified the initial setup and development of new Spring application. We also achieved the innovative task of calling R across platforms and integrating several existing R language algorithms into one Java system. Consequently, we have taken advantage of the statistical analysis advantages of R language and the simplicity and stability of Spring Boot at the same time. In the course of research and development, the method of asynchronous and cluster processing was used to solve the problem of high concurrency, which can greatly improve the stability and practicability of the system and prevent the system from being shut down due to a large number of requests arriving at the server at the same time. To protect data, DOSA (Data-Oriented Security Architecture) was used. Furthermore, DOSA can also register various information about data, including information about security attributes, build a logical data resource pool through it, manage data, and provide data services. There is no login requirement for accessing any features in TB-LNPs (Fig. 1). Three canonical lung cancer diagnosis models were developed using available data, including, the Brock University (BU) model [11], the Mayo Clinic (MC) model [12], Veterans Affairs (VA) [7] model are calculated, which are available in TB-LNPs. Moreover, two novel model of our group are including into TB-LNPs [9, 10]. TB-LNPs features are grouped into four tabs: Home, About Us, Manual, and Tool Box of Lung Nodule Predictions. The Lung Nodule Predictions Tool Box tab provides key interactive functions corresponding to different feature inputs (Fig. 2). 2.2 Functionalities and Documentation We provide default inputs for every model in order to facilitate quick start. Furthermore, we added one bar plot to show the predicated value. TB-LNP documentation is available
TB-LNPs: A Web Server for Access to Lung Nodule Prediction Models
417
and can be accessed by clicking the ‘Manual’ link in the top right navigation bar. Documentation includes a description of each feature function, the introduction of parameters in each feature, and the results of each analysis. In addition, TB-LNPs provides a link to ‘About Us’ in the top right navigation bar for quick access to our group information. SCHC (Sichuan Hospital of Cancer model) model was constructed using eXtreme Gradient Boosting (XGBoost) based on patient’s age, nodule/mass size, and platelet characteristics. The user should first enter the required features in the box. Age represents the age of patient in years; pPLT represents platelet counts in platelet rich plasma sample (×109 /L); pPCT represents plateletcrit in platelet rich plasma sample; size represents the largest diameter of nodule/mass in millimeter (mm); bPCT represents plateletcrit in whole blood sample. The probability of malignancy will be displayed as a bar box after clicking the ‘get result’ link at the bottom of the web page. In the blank box, it would display “malignant” if the probability is greater than the cut-off value, otherwise it would display “benign”. BU model, also known as Brock model, was developed from the Pan-Canadian Early Detection of Lung Cancer Study, which included two sets of models: a parsimonious model and a full model. BU calculated the probabilities using multivariable logistic regression based on the variables age, gender, family history of lung cancer, presence of COPD, nodule size, nodule location, nodule count, and nodule characteristics. The user should first enter the required features in the box. Age is age in years; nodule count represents the count of the nodule/mass on imaging; size represents the largest diameter of nodule/mass in millimeter (mm); PSN, Gender, COPD, GGN, Up, Spiculation box are selectable in the list. After clicking ‘get result’ on the bottom of the web page, the probability of malignancy would display as the bar box. The BU simple model is the parsimonious version of the BU model, which includes size, gender, up, and speculation. The user should first enter the required features in the box. Size represents the largest diameter of nodule/mass in millimeter (mm); Gender, Up, Spiculation box are selectable in the list. After clicking ‘get result’ on the bottom of the web page, the probability of malignancy would display as the bar box. The VA model was developed using the data from Department of Veterans Affairs (VA), which calculated the probability of malignancy in patients with solid pulmonary nodules (SPNs) using multivariable logistic regression based on the patients’ smoking history, age and nodule size. TCRnodseek (TCR nodule seek model) model integrates TCR diversities and clinical information to distinguish indeterminate lung nodules as benign or malignant. TCRnodseek model was constructed based on Support Vector Machines.
3 Discussion Currently, lung cancer is a common clinical problem that always appears as a pulmonary nodule or mass. Accurate assessment of pulmonary nodules/masses is crucial to the diagnosis and treatment of patients. A number of clinical models combining patient clinical characteristics with other characteristics have been developed recently to evaluate the malignancy and benignity of pulmonary nodules/masses, but few have been externally validated [13]. There are about eight models that have been externally validated: the
418
H. Luo et al.
Gurney, MC, Herder, VA, Peking University People’s Hospital (PKUPH), BU, Thoracic Research Evaluation and Treatment (TREAT), and Bayesian Inference Malignancy Calculator (BIMC) models [5, 7, 8, 12, 14–18]. With the exception of the PKUPH model, probability calculators that were built from these models are available online for clinical use, however, as far as we are aware, there is no web server to combine them into one site. Here, a tool estimating the probability of malignancy has been developed. The tool box of lung nodule predictors is a scalable and one-stop web platform that includes the SCHC model, the BU model, the BU simple model, the MC model (Mayo Clinic Model), the VA model, and the TCRnodseek model (TCR nodule seek model). The SCHC model combines platelet features with nodule imaging features to estimate a likelihood of malignancy. The BU model, the BU simple model, the MC model, and the VA model calculate the malignant probability based on chest radiographic features and clinical characteristics. The TCRnodseek model provides a malignant probability based on the clinical characteristics and TCR information. TB-LNPs provides a simpler web, which can be used by clinicians and suspicious patients to estimate the probability of malignancy. TB-LNPs enables patients and clinicians without any programming skills to perform their own analysis. As more and more robust models are published and validated, we will attempt to combine available models into an all-in-one web server-TB-LNPs that will likely improve malignancy probability estimation in the near future. In conclusion, we have constructed one web server (TB-LNPs), which bridges the gap between lung nodule prediction models and end users, thus maximizing the value of lung nodule prediction models. TB-LNPs is available at http://i.uestc.edu.cn/TB-LNPs.
Fig. 1. The workflow of TB-LNPs. From model search to web side, we have experienced four steps to complete this project.
TB-LNPs: A Web Server for Access to Lung Nodule Prediction Models
419
Fig. 2. Here is an example web server TB-LNP with default inputs (A-G).
Acknowledgements. HC.L. and Z.H. built the server base system; L.W. and HC.L. designed the user interface; J.H. obtained funding and supervised the project, and oversaw the manuscript preparation.
Funding. This study was supported by grants from the Sichuan Medical Association Research project (S20087), Sichuan Cancer Hospital Outstanding Youth Science Fund (YB2021033), and the National Natural Science Foundation of China (62071099).
References 1. Siegel, R.L., Miller, K.D., Jemal, A.: Cancer statistics, CA Cancer J. Clin. 70, (2020) 2. Massion, P.P., et al.: Assessing the accuracy of a deep learning method to risk stratify indeterminate pulmonary nodules. Am. J. Respir. Crit. Care Med. 202, 241–249 (2020) 3. Aberle, D.R., et al.: Reduced lung-cancer mortality with low-dose computed tomographic screening. N. Engl. J. Med. 365, 395–409 (2011) 4. Gould, M.K., et al.: Recent trends in the identification of incidental pulmonary nodules. Am. J. Respir. Crit. Care Med. 192, 1208–1214 (2015) 5. McWilliams, A., et al.: Probability of cancer in pulmonary nodules detected on first screening CT. N. Engl. J. Med. 369, 910–919 (2013) 6. Hawkins, S., et al.: Predicting malignant nodules from screening CT scans. Journal of Thoracic Oncology: Official Publication of the International Association for the Study of Lung Cancer 11, 2120–2128 (2016) 7. Gould, M.K., Ananth, L., Barnett, P.G.: A clinical model to estimate the pretest probability of lung cancer in patients with solitary pulmonary nodules. Chest 131, 383–388 (2007) 8. Kymes, S.M., Lee, K., Fletcher, J.W.: Assessing diagnostic accuracy and the clinical value of positron emission tomography imaging in patients with solitary pulmonary nodules (SNAP). Clin. Trials. 3, 31–42 (2006) 9. Zu, R., et al.: A new classifier constructed with platelet features for malignant and benign pulmonary nodules based on prospective real-world data. J. Cancer 13, 2515–2527 (2022) 10. Luo, H., Zu, R., Li, Y., Huang, J.: Characteristics and diagnostic significance of peripheral blood T-cell receptor repertoire features in patients with indeterminate lung nodules. Available at SSRN: https://ssrn.com/abstract=3978572 (2022)
420
H. Luo et al.
11. Chung, K., et al.: Brock malignancy risk calculator for pulmonary nodules: validation outside a lung cancer screening population. Thorax 73, 857–863 (2018) 12. Swensen, S.J., et al.: The probability of malignancy in solitary pulmonary nodules. Application to small radiologically indeterminate nodules. Arch. Intern. Med. 157, 849–855 (1997) 13. Choi, H.K., Ghobrial, M., Mazzone, P.J.: Models to estimate the probability of malignancy in patients with pulmonary nodules. Ann. Am. Thorac. Soc. 15, 1117–1126 (2018) 14. Herder, G.J., et al.: Clinical prediction model to characterize pulmonary nodules: validation and added value of 18F-fluorodeoxyglucose positron emission tomography. Chest 128, 2490– 2496 (2005) 15. Gurney, J.W., Swensen, S.J.: Solitary pulmonary nodules: determining the likelihood of malignancy with neural network analysis. Radiology 196, 823–829 (1995) 16. Gurney, J.W.: Determining the likelihood of malignancy in solitary pulmonary nodules with Bayesian analysis. Part I. theory. Radiology 186, 405–413 (1993) 17. Soardi, G.A., Perandini, S., Motton, M., Montemezzi, S.: Assessing probability of malignancy in solid solitary pulmonary nodules with a new Bayesian calculator: improving diagnostic accuracy by means of expanded and updated features. Eur. Radiol. 25(1), 155–162 (2014). https://doi.org/10.1007/s00330-014-3396-2 18. Deppen, S.A., et al.: Predicting lung cancer prior to surgical resection in patients with lung nodules. Journal of Thoracic Oncology: Official Publication of the International Association for the Study of Lung Cancer 9, 1477–1484 (2014)
Intelligent Computing in Drug Design
A Targeted Drug Design Method Based on GRU and TopP Sampling Strategies Jinglu Tao1,2,3 , Xiaolong Zhang1,2,3(B) , and Xiaoli Lin1,2,3 1 College of Computer Science and Technology, Wuhan University of Science and Technology,
Wuhan, Hubei, China {xiaolong.zhang,linxiaoli}@wust.edu.cn 2 Hubei Key Laboratory of Intelligent Information Processing and Realtime Industrial System, Wuhan, Hubei, China 3 Institute of Big Data Science and Engineering, Wuhan University of Science and Technology, Wuhan, Hubei, China
Abstract. Deep learning algorithms can be used to improve the efficiency of drug design, which is a very meaningful research topic. This paper proposes a targeted drug design model based on the gated recurrent unit (GRU) neural network algorithm, which trains a large number of drug molecules obtained from the Chembl database for generating a generic and unbiased molecular library. For improving the efficiency and accuracy of the trained model, a fine-tuning strategy is used to train against the active compounds of the target protein. In addition, a TopP sampling strategy is used to sample molecular tokens for reducing the number of generated drug molecules that are invalid or existing drug molecules. Finally, the novel coronavirus 3CLpro protease is selected for verifying the effectiveness of the proposed model. Molecular docking results show that the molecules generated by the proposed model have lower average binding energies than the existing active compounds. Keywords: Drug design · Fine-tuning · Molecular docking · COVID-19
1 Introduction The traditional drug design is a long and expensive process. Computer-aided drug design methods can accelerate the drug design process for improving the efficiency of selecting drug molecules for a given target, which can generate a large number of new drug candidates, and evaluate molecules pre-defined in a stored drug library by invoking high-throughput screening (HTS) [1] and virtual screening (VS) [2]. Machine learning has been increasingly used in molecule generation [3], mainly including autoencoders, generative adversarial networks, chemical space exploration based on continuous encoding of molecules, recurrent neural network, and other methods. Autoencoders are used to encode molecules into a continuous vector space [4]. Generative adversarial neural network (GAN) is a widely adopted algorithm [5]. Chemical space exploration based on continuous encoding of molecules has efficient gradientbased search effect [6]. Recurrent neural network (RNN) has long been demonstrated © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 423–437, 2022. https://doi.org/10.1007/978-3-031-13829-4_37
424
J. Tao et al.
its ability to predict the next token from a sequence [7]. The gated recurrent unit (GRU) is a recurrent neural network that can be used to process sequential data. It has the ability to remember events that have previously occurred in the data sequence, and it is designed to solve problems such as long-term memory and gradients in backpropagation for improving training efficiency [8]. GRU can be used not only to generate canonical SMILES strings, but also can be fine-tuned by transfer learning [9]. Fine-tuning is one of the most commonly used techniques to deal with data scarcity problems, it can help model to better extract sample features to generate targeted drug molecules when the number of samples is insufficient. In this paper, GRU and fine-tuning strategies are used to design targeted drug generation models. To determine whether the generated drug molecule interacts with the target, molecular docking analysis is performed to assess the binding degree. AutoDock Vina is a tool for docking and virtual screening of proteins and drug molecules [10], which provides good docking results with high average accuracy of binding pattern prediction and fast search speed. Based on our previous work [11–13], this paper proposes a drug design model based on GRU neural network, combining a fine-tuning strategy and a TopP sampling strategy. In addition, a targeted drug design method (GFTP) based on COVIDVS drug screening model is performed. A series of verification experiments have been carried out on the genetic material 3 Chymotrypsin-like protease (3CLpro ) [14] of the novel coronavirus. Drug molecules obtained from the Chembl database is used to train the drug design model for generating a large number of drug molecules with a wide chemical space. Fine-tuning strategy is used to train active compounds against the target, and the TopP sampling strategy is used to generate SMILES with higher feature similarity to known ligands. Furthermore, drug molecules generated by fine-tuning are scored and screened with the COVIDVS drug screening model for selecting active compounds against the 3CLpro protease; Finally, molecular docking analysis is performed to evaluate the interaction effect between the generated molecule and the protease.
2 Methods 2.1 Dataset During training, the input drug molecules are represented with SMILES, which is one of the “linear symbols” used to express the structure of the molecule with a single line of text. To obtain enough data, the latest version of dataset Chembl29 is obtained from Chembl database (www.ebi.ac.uk/chembl, version 29). More than 2.08 million SMILES data is extracted with script. These SMILES were preprocessed for obtaining molecules ranging in length from 34 to 74 tokens. Then, the ones with duplicates, salts, and stereochemical information etc. are removed. Eventually, 937,543 SMILES could be used for experiments. Model can learn the rules of drugs molecules corresponds to the actual chemical structure [15] with these selected SMILES. To generate drug molecules against 3CLpro protease, the model is fine-tuned with 86 SMILES, which are drug molecules effective against 3CLpro protease. They contain a collection of 70 active molecules against the novel coronavirus [16–20], and 20
A Targeted Drug Design Method Based on GRU and TopP
425
experimentally screened active compounds in the ReFRAME library [21, 22]. Finally, 86 active compounds are obtained by removing the repeated molecules. An overview of the datasets used in this paper is shown in Table 1. Table 1. Dataset overview. Dataset
Number
Description
Chembl_cleansed
937543
Molecules after Chembl29 pre-treatment, which are used for training model to generate an untargeted generic library of compounds
Finetune_3CL
86
Active compounds obtained from [16–20] and ReFRAME [21, 22], which are used for fine-tuning to generate drug molecules for 3CLpro protease
2.2 Targeted Drug Generation Process The paper proposes a targeted drug design model (GFTP) for generating drug molecules, and a novel coronavirus 3CLpro protease is selected to validate the model. The model is based on the gated recurrent unit (GRU) to generate a generic chemotactic library without targeting. The fine-tuning strategy and TopP sampling strategy are adopted to generate targeted drug molecules against 3CLpro . Furthermore, COVIDVS model [23] is used to screen the generated drug molecules, and molecular docking is used to test the screened drug molecules. Finally, the compound with activity against the given target protein can be found. The detailed description of four modules included in GFTP is as follows: (1) Targeted drug training model based on GRU and fine-tuning strategy: Firstly, drug molecules obtained from Chembl database is trained using the model to learn the rules corresponding to the actual chemical structure of the drug molecule. Furthermore, fine-tuning strategies is adopted to reload model data and train active compounds against targets for learning the properties of specific drugs. (2) Molecular generation module based on TopP sampling strategy: In the process of designing drug molecules, the TopP sampling strategy is adopted to get the token as the next token in generated molecule sequence until new drug molecules are obtained. (3) COVIDVS drug screening: The module is an anti-novel coronavirus prediction model that can scores drug molecules generated in fine-tuning. Higher scores indicate that the drug molecule is more effective against the virus. (4) Molecular docking: Molecular docking is a process of molecular recognition during ligand-receptor interaction, which shows the binding effect of a drug molecule to a target protein. In this paper, drug molecules with good scoring performance are molecularly docked with 3CLpro , and the docking results are evaluated by the free binding energy and interaction pair of them.
426
J. Tao et al.
The process of targeted drug design method is shown in Algorithm 1.
Algorithm 1: GFTP : Targeted Drug Design Method Inputs: Chembl_cleansed: Original training dataset obtained from the Chembl database; Finetune_3CL: Fine-tuned dataset for the 3CLpro protease; Generate_num: Number of SMILES generated. Output: Newly generated drug molecules targeting 3CLpro protease. Begin: (1) For SMILES in Chembl_cleansed, do: Normalization: Add ‘G’ at the beginning, ‘E’ at the end, and fill with ‘A’. Convert SMILES to one hot vector Ox. End (2) Train Ox and save model data M. (3) For SMILES in Finetune_3CL, do: Normalization: Add ‘G’ at the beginning, ‘E’ at the end, and fill with ‘A’. Convert SMILES to one hot vector Oy. End (4) Load model data M and train Oy. (5) For i←1 to Generate_num: T = ‘G’. When Tk is not ‘E’ or len(T)0.8
Number of SMILES
Average binding energy
11
−9.164
0.7–0.8
9
−8.889
0.6–0.7
38
−8.708
0.5–0.6
238
−8.622
0.4–0.5
401
−7.871
Table 5. Top 10 drug molecules in terms of binding energy Name
SMILES sequences of drug molecules
Binding energy
Molecule O = C1c2ccccc2C(=O)c2c1ccc(C(Cc1ccc(Cl)cc1)NCc1ccccc1)c2O 1
−11.2
Molecule CN1CCN(C(=O)c2ccc(NC(=O)c3cc(CC(=O)N4CCN(Cc5ccccc5)CC4)cc(N(C)C)c3)cc2)CC1 2
−10.8
Molecule CC(C)(N)C(=O)NC1CCN(c2cc(NC(=O)c3ccc(C#N)cc3)ccc2Oc2ccccc2C(=O)NCC(F)(F)F)CC1 3
−10.8
Molecule CN1CCN(c2ccc(Nc3nc4c(C(F)(F)F)cccc4n3C3CCN(Cc4ccccc4)CC3)cc2)CC1 4
−10.6
Molecule Cc1ccc(C = C2SC(=N)N(c3ccc(CCNC(=O)CCCCNC(=O)c4cc(-c5ccccc5C)on4)cc3)C2 = O)cc1 5
−10.6
Molecule O = C(Cn1c(=O)oc2ccccc21)NCCCC(=O)N1CCN(C2CCN(c3ccc(C(F)(F)F)cc3)c3ccccc32)CC1 6
−10.5
Molecule CC(C)N(CCCOc1ccc(C(=O)NCc2ccc(F)cc2)cc1)C(=O)C1CCN(C(=O)Nc2ccc(C(F)(F)F)cc2)CC1 −10.5 7 Molecule CN(C)C(=O)c1cccc2oc(-c3ccc(NC(=O)c4cccc(S(=O)(=O)N5CCCC5)c4)cc3)nc12 8
−10.5
Molecule CC(COc1ccc(F)cc1)NC(=O)c1ccc(Cl)c(S(=O)(=O)N2CCc3ccccc3C2)c1 9
−10.3
Molecule CC(C)CN(C)CC1 = C2C = C(C(=O)N3CCC(NC(=O)c4cc(-c5cccc(C(C)(C)C)c5)on4)CC3)C = C2C = C1 10
−10.3
The atomic binding between protein fragment and ligand is an interaction pair, which needs to satisfy an interatomic distance no more than 5 Å [28]. To better demonstrate the binding action of drug molecule and protease, two molecules (molecule 1 and molecule 2) with highest binding energy ranking are selected to show the results. The docking results are shown in Fig. 6 and 7, and the interaction pairs of molecules with the 3CLpro protease are shown in Table 6 and Table 7. It is shown in Fig. 6 that molecule 1 has six
434
J. Tao et al.
Fig. 6. Molecular structure of molecule 1 with docking effect
interaction pairs, where NH1 in interaction pair 1 and interaction pair 2 are located at different coordinates, they are at different chain’s amino acid residues. Molecule 2 in Fig. 7 has five interaction pairs, of which the OG in interaction pair 1 and interaction pair 3 are located at different coordinates are at different chain’s amino acid residues. The interaction docking results in Table 6 and Table 7 indicate that the distance between drug molecule and protein fragment meets the requirement of an atomic distance between individual interaction pair of less than 5 Å. Table 6. Interaction pair and binding distance of molecule 1 Interaction pair
Label at the interaction pair (coordinates)
1
O1(XYZ:−6.391,4.530,25.956)-ChainA/NH1(XYZ:−8.005,3.386,23.051)
Binding distance (Å) 3.5
2
O1(XYZ:−6.391,4.530,25.956)-O(XYZ:−5.988,6.015,22.754)
3.6
3
O2(XYZ:−3.561,4.702,30.517)-ChainB/NH1(XYZ:−2.767,3.721,33.717)
3.4
4
O3(XYZ:−4.204,2.078,31.076)-ChainB/NH1(XYZ:−2.767,3.721,33.717)
3.4
5
O3(XYZ:−4.204,2.078,31.076)-NZ(XYZ:−6.362,1.371,32.750)
2.8
6
H14(XYZ:−6.439,-0.272,30.692)-OE1(XYZ:−7.905,−1.406,32.550)
2.6
A Targeted Drug Design Method Based on GRU and TopP
435
Fig. 7. Molecular structure of molecule 2 with docking effect
Table 7. Interaction pair and binding distance of molecule 2 Interaction pair
Label at the interaction pair (coordinates)
1
O2(XYZ:−5.919,−7.042,29.785)-ChainA/OG(XYZ:−8.872,−6.194,29.852)
Binding distance (Å) 3.1
2
O2(XYZ:−5.919,−7.042,29.785)-N(XYZ:−8.146,−9.188,30.916)
3.3
3
H10(XYZ:−4.407,−6.019,27.576)-ChainB/OG(XYZ:−2.509,−6.309,27.278)
1.9
4
O1(XYZ:−6.957,0.023,25.614)-NZ(XYZ:−4.259,1.229,24.185)
3.1
5
N1(XYZ:−11.414,0.509,28.085)-O(XYZ:−13.335,2.061,26.318)
3.0
4 Conclusion This paper proposes a targeted drug design model (GFTP), which is based on GRU neural network. To improve the efficiency and accuracy of the model, a fine-tuning strategy is introduced to train against the active compounds of the target protein. A TopP sampling strategy is used to sample molecular, which can generate a large number of new drug molecules that are structurally similar to the training drug molecules, as well as overlapping chemical space. Combine the generative model with COVIDVS drug screening model, which can accurately generate drugs against 3CLpro protease targets. Furthermore, molecular docking is used to validate the performance of the model with specific targets. At present, the amount of drug molecules for some viruses is not enough, which will affect the model results to some extent when fine-tuning operation is performed. One of the next works is to select more protease to validate the effectiveness of the proposed model. Acknowledgements. The authors thank the members of Machine Learning and Artificial Intelligence Laboratory, School of Computer Science and Technology, Wuhan University of Science and Technology, for their helpful discussion within seminars. This work was supported by National Natural Science Foundation of China (No. 61972299, 61502356).
436
J. Tao et al.
References 1. Okea, A., Sahin, D., Chen, X., Shang, Y.: High throughput screening for drug discovery and virus detection. Comb. Chem. High Throughput Screen. 25(9), 1518–1533 (2021) 2. Evanthia, L., George, S., Demetrios, V., Zoe, C.: Structure-based virtual screening for drug discovery: principles, applications and recent advances. Current Top. Med. Chem. 14(16), 1923–1938 (2014) 3. Hartenfeller, M., Proschak, E., Andreas Schüller, Schneider, G.: Concept of combinatorial de novo design of drug-like molecules by particle swarm optimization. Chem. Biol. Drug Des. 72(1), 16–26 (2010) 4. Cwla, B., Ys, C., Yd, D., Uy, E.: Asrnn: a recurrent neural network with an attention model for sequence labelling–science direct. Knowl.-Based Syst. 212, 106548 (2021) 5. Goodfellow, I., et al.: Generative adversarial nets. Neural Inf. Process. Syst. 2(14), 2672–2680 (2014) 6. Gómez-Bombarelli, R., et al.: Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018) 7. Wu, J., Hu, C., Wang, Y., Hu, X., Zhu, J.: A hierarchical recurrent neural network for symbolic melody generation. IEEE Trans. Cybern. 50(6), 2749–2757 (2020) 8. Fabio, B., Marcello, F., Riccardo, S.: On the stability properties of gated recurrent units neural networks. Syst. Control Lett. 157 (2021) 9. Pan, X.: De novo molecular design of caspase-6 inhibitors by a gru-based recurrent neural network combined with a transfer learning approach. Pharmaceuticals 14(12), 1249 (2021) 10. Morris, G.M., et al.: AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility. J. Comput. Chem. 30(16), 2785–2791 (2009) 11. Lin, X.L., Zhang, X.L.: Prediction of hot regions in PPIs based on improved local community structure detecting. IEEE/ACM Trans. Comput. Biol. Bioinform. 15(5), 1470–1479 (2018) 12. Zhang, X.L., Lin X.L., et al.: Efficiently predicting hot spots in PPIs by combining random forest and synthetic minority over-sampling technique. IEEE/ACM Trans. Comput. Biol. Bioinform. 16(3), 774–781 (2019) 13. Lin, X.L., Zhang, X.L., Xu, X.: Efficient classification of Hot spots and Hub protein interfaces by recursive feature elimination and gradient boosting. IEEE/ACM Trans. Comput. Biol. Bioinform. 17(5), 1525–1534 (2020) 14. Behzadipour, Y., Gholampour, M., Pirhadi, S.: Viral 3CLpro as a target for antiviral intervention using milk-derived bioactive peptides. Int. J. Pept. Res. Ther. 27, 2703–2716 (2021) 15. Gupta, A., Müller, A.T., Huisman, B., Fuchs, J.A., Schneider, P., Schneider, G.: Generative recurrent networks for de novo drug design. Mol. Inform. 37(1–2), 1700111 (2018) 16. Jeon, S., Ko, M., Lee, J., Choi, I., Kim, S.: Identification of antiviral drug candidates against sars-cov-2 from fda-approved drugs. Antimicrob. Agents Chemother. 64(7) (2020) 17. Weston, S., et al.: Broad anti-coronavirus activity of food and drug administration-approved drugs against sars-cov-2 in vitro and sars-cov in vivo. J. Virol. 94(21), e01218-e1220 (2020) 18. Touret, F., et al.: In vitro screening of a fda approved chemical library reveals potential inhibitors of sars-cov-2 replication. Sci. Rep. 10(1), 13093 (2020) 19. Fintelman-Rodrigues, N., et al.: Atazanavir, alone or in combination with ritonavir, inhibits sars-cov-2 replication and proinflammatory cytokine production. Antimicrob. Agents Chemother. 64(10), e00825–20 (2020) 20. Yamamoto, N., Matsuyama, S., Hoshino, T., Yamamoto, N.: Nelfinavir inhibits replication of severe acute respiratory syndrome coronavirus 2 in vitro. bio Rxiv (2020). https://doi.org/10. 1101/2020.04.06.026476
A Targeted Drug Design Method Based on GRU and TopP
437
21. Riva, L., Yuan, S., Yin, X., et al.: Discovery of sars-cov-2 antiviral drugs through large-scale compound repurposing. Nature 586, 113–119 (2020) 22. Janes, J., et al.: The reframe library as a comprehensive drug repurposing library and its application to the treatment of cryptosporidiosis. Proc. Natl. Acad. Sci. U.S.A. 115(42), 10750–10755 (2018) 23. Wang, S., Sun, Q., Xu, Y., Pei, J., Lai, L.: A transferable deep learning approach to fast screen potential antiviral drugs against sars-cov-2. Brief. Bioinform. 22(6), bbab211 (2021) 24. Santana, M.V.S., Silva-Jr, F.P.: De novo design and bioactivity prediction of sars-cov-2 main protease inhibitors using recurrent neural network-based transfer learning. BMC Chem. 15(1), 8 (2021) 25. Popova, M., Isayev, O., Tropsha, A.: Deep reinforcement learning for de-novo drug design. Sci. Adv. 4(7), eaap7885 (2018) 26. Chenthamarakshan, V., et al.: Cogmol: target-specific and selective drug design for COVID-19 using deep generative models (2020) 27. Yasonik, J.: Multiobjective de novo drug design with recurrent neural networks and nondominated sorting. J. Cheminform. 12(1), 1–9 (2020). https://doi.org/10.1186/s13321-020-004 19-6 28. Wei, X., et al.: Botanical drugs: a new strategy for structure-based target prediction. Brief. Bioinform. 23(1), bbab425 (2022)
KGAT: Predicting Drug-Target Interaction Based on Knowledge Graph Attention Network Zhenghao Wu1,2,3 , Xiaolong Zhang1,2,3(B) , and Xiaoli Lin1,2,3 1 College of Computer Science and Technology, Wuhan University of Science and Technology,
Wuhan, Hubei, China {Xiaolong.zhang,linxiaoli}@wust.edu.cn 2 Hubei Key Laboratory of Intelligent Information Processing and Realtime Industrial System, Wuhan, Hubei, China 3 Institute of Big Data Science and Engineering, Wuhan University of Science and Technology, Wuhan, Hubei, China
Abstract. Prediction of Drug-target interaction (DTI) is an important topic in bioinformatics which plays an important role in the process of drug discovery. Although many machine learning methods have been successfully applied to DTI prediction, traditional approaches mostly utilize single chemical structure information or construct heterogeneous graphs that integrate multiple data sources for DTI prediction, while these methods ignore the interaction relationships among sample entities (e.g., drug-drug pairs). The knowledge graph attention network (KGAT) uses biomedical knowledge bases and entity interaction relationships to construct knowledge graph and transforms the DTI problem into a linkage prediction problem for nodes in the knowledge graph. KGAT distinguishes the importance of features by assigning attention weights to neighborhood nodes and learns vector representations by aggregating neighborhood nodes. Then feature vectors are fed into the prediction model for training, at the same time, the parameters of prediction model update by gradient descent. The experiment results show the effectiveness of KGAT. Keywords: Drug-target interaction · Drug discovery · Knowledge graph · Knowledge graph attention network · Gradient descent
1 Introduction Drug-target interaction (DTI) is that the drug and the target are associated and cause a change in behavior. The targets are any part of the organism associated with the drugs which can produce physiological changes with the drug [1]. The most common targets are proteins, enzymes and ion channels. DTI prediction is an important reference for the discovery of new drugs and the avoidance of drug side effects [2]. Traditional biological approaches of drug discovery are time-consuming and costly [3]. Artificial intelligence-based approaches can greatly accelerate the process of drug discovery and narrow the scope of clinical trials, thereby significantly reducing the cost. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 438–450, 2022. https://doi.org/10.1007/978-3-031-13829-4_38
KGAT: Predicting Drug-Target Interaction
439
The previous methods for DTI predictions can be divided into ligand-based methods, docking methods and chemical genomic methods [4]. Chemogenomic approaches are currently the most used methods. These methods use broad biological datasets to unify drugs and other entities in a common setting to infer possible interactions [5]. Most existing DTI prediction methods typically focus on integrating multiple data sources to obtain drug features, including similarity features [6–8], adverse effects or side effects [9] and multi-task learning [10]. These methods rely on the assumption that drugs with similar representations will perform similar DTIs. At the same time, some computational approaches tend to combine with popular embedding methods [11–13] that seek to automatically learn drug representations and then model DTIs through specific operations such as matrix decomposition, random walks and graph neural networks [14]. Despite the good results have achieved by the above methods, an overlooked shortcoming is that they model the DTI as a separate sample of data and do not take into account correlations between sample entities (e.g., drug-drug pairs). Knowledge graphs are essentially knowledge bases for semantic networks, which are simply multi-relational graphs [15]. Knowledge graph (KG) represents drugs, targets and other attribute features as nodes and the relationship of entities as edges connecting the nodes, providing new ideas for DTI prediction. Most of the existing knowledge graph-based works obtain vector representations of drugs and targets by various of graph embedding methods, such as Deepwalk, Node2Vec, RotatE [16], TransE, RDF2Vec [17] and ComplEx [18]. These methods learn potential embedding vectors of nodes directly, but they are limited in accessing rich neighbor information of knowledge graph entities. Machine learning has been widely used in the field of biological information, we have done a lot of work on it such as LCSD [19], SRF [20] and SVM-RFE [21] which are efficient for predicting protein interactions. The paper proposes knowledge graph attention network (KGAT) based on our previous work which captures the higher-order structure and semantic relationships of the knowledge graph. The framework is made up of three main modules. The first module extracts the DTIs and combines with the bioinformatics knowledge base to build knowledge graphs rich in structure and attribute features. The second module is to extract the higher-order structure and semantic information of drugs and targets by KGAT and to obtain feature representations of drugs and targets by aggregating neighborhood representations with attention weights. The third module feeds training samples into two fully connected layers and a SoftMax layer for training while the parameters are updated, and the trained model is then used to predict unknown drug-target interactions.
2 Methods The overall framework of the experiment is shown in Fig. 1 and can be divided into three parts: • Transform bioinformatic databases and construct knowledge graphs by using Bio2RDF scripts. • Learn vector representations of drugs and targets in knowledge graphs with attribute nodes and neighborhood structure by using KGAT.
440
Z. Wu et al.
• Divide the sample set into a training set and a test set, train the model using the training set and evaluate the model performance with the test set.
Fig. 1. The workflow of DTI prediction.
The KGAT captures higher-order features from knowledge graph by assigning attention weights to neighborhood nodes and aggregating neighborhood nodes’ embedding representations. Then the sample feature vectors are fed into the prediction model. After two fully connected layers and a SoftMax layer the model outputs prediction score. 2.1 Construction of Knowledge Graph We collected the latest raw data from the bioinformatics database and then constructed two knowledge graphs kg_Drugbank and kg_KEGG based on Drugbank (https://www. drugbank.com/) [22] and KEGG (https://www.kegg.jp/) [23]. A sample of knowledge graph is shown in Fig. 2. Drugs, Targets and their attributes are represented as nodes. The edges represent relationship between nodes. Then the raw data was converted to RDF triples by Bio2RDF (https://github.com/bio2rdf/bio2rdf-scripts/). The tool links biological entity data according to its specific naming rules [24]. Finally, a SPARQL query was performed to filter invalid triples (duplicate, redundant, incomplete triples) and selected those that were useful for predicting DTI. The specific information of the constructed knowledge graphs is shown in Table 1. Here are two knowledge graphs. The datasets extracted from Drugbank are based
KGAT: Predicting Drug-Target Interaction
441
Fig. 2. Sample knowledge graph.
on kg_Drugbank for the experiments and the Yamainshi_08 is based on kg_KEGG. The reason for conducting experiments on both knowledge graphs is that there are two formats of datasets, one is based on Drugbank-id, and the other is based on KEGG-id. There will be a loss of samples if the id format is mapped directly to the other database. As it can be seen from Table 1, the knowledge graphs are very rich in entities and triples. Table 1. Details of the knowledge graph. Kg name
Data source
kg_KEGG
KEGG_drug
129910
362870
KEGG_gene
429987
1122765
KEGG_disease
42026
108565
KEGG_pathway
12150
40552
2261199
8806549
kg_Drugbank
Drugbank 5.1.9
Entity
Triples
2.2 Knowledge Graph Attention Network After the construction of knowledge graphs, the next and the most important task is to learn the vector representation with attributes and structural information from the knowledge graph. The model recursively propagates and aggregates the embedding representations of neighborhood nodes, and then adds attention network to the propagation
442
Z. Wu et al.
process to learn the weights of neighborhood nodes to distinguish the importance of different neighborhood nodes. Receptive Field. Inspired by convolutional neural network, our model focus on the neighborhood of drugs and targets in the dataset when extracting structural similarity feature from the knowledge graph, and the neighborhood of concern is called Receptive Field (RF). Different from convolutional neural network, our method does not only focus on direct neighbors, but extends to n hop (the value of n can be freely adjusted depending on the size of the KG) to extract higher-order structures and semantic relations. In other words, the model can customize the depth of aggregation. Neighborhood Sampling. Typically, the knowledge graphs are non-Euclidean graphs, where the type and number of neighboring nodes around each sample node are not the same. In order to meet the fixed computational pattern of the neural network and efficiency issues, our model selects a fixed number of neighbor nodes in the neighborhood instead of using their full neighborhood. In other words, our model can customize the breadth of aggregation to suit the actual situation. The selected neighborhood nodes will contain duplicate items if the neighborhood size |N neigh (e)| < k (the value of k is taken as our defined aggregation breadth) during the operation. An example of neighborhood sampling with parameters n = 2 and k = 2 is shown in Fig. 3.
Fig. 3. Aggregation example diagram.
Attention Embedding Propagation. Our model builds upon the architecture of graph neural network to recursively propagate attributes features and structural similarity along high-order connectivity. Moreover, we generate attention weights for cascade propagation to reveal the importance of this connectivity by using the idea of graph attention network. Attention embedding propagation is composed of three components: KG propagation, attention network and feature vector aggregation, then we discuss how to generalize it to multiple layers.
KGAT: Predicting Drug-Target Interaction
443
KG Propagation. An entity can be involved in multiple triples, serving as the bridge propagating information. Taking e1 r2 > d 1 r1 > t 1 and e2 r3 > d 1 r1 > t 1 as an example, drug d 1 takes attributes e1 and e2 to enrich its own features, and then d 1 participates as a neighbor node in the feature representation of the target t 1 , which implements t 1 to extract higher-order structure features other than direct neighbors in the knowledge graph. The aggregation approach implements the cascading propagation of attributes and structure features in the knowledge graph based on the graph structure rather than limiting to extracting features from direct neighbor nodes, which allows to more fully extract features from the graph structure. We use N(h) = {(h, r, t)|(h, r, t) ∈ KG} to denote the set of triples where h is the head entity, r represents relationship between entities and t represents the tail entity. To describe the first-order connectivity structure of entity h, we compute the neighborhood vector representation eN(h) : eN (h) = π (h, r, t) et (1) (h,r,t)∈N (h)
where π (h, r, t) controls the decay factor for each propagation on edge (h, r, t), indicating the importance of features being propagated from t to h conditioned to relation r. Attention Network: π (h, r, t) is implemented via relational attention mechanism, which is formulated as follows: π (h, r, t) = (Wr et )T tanh(Wr eh + er )
(2)
where we select tanh [25] as the nonlinear activation function and W r is parameter matrix. It makes the attention score dependent on the distance between eh and et in the relation r’s space. The result is that the closer the nodes in the relationship space, the higher the attention weights will be. Thereafter, the attention weight coefficients are normalized by employing the SoftMax function: exp(π(h, r, t)) (3) π (h, r, t) = exp π h, r , t h,r ,t ∈N(h)
Thus, the final attention score can suggest which neighboring nodes should be given more attention to capture features. When performing forward propagation, attention suggests to focus on part of the data, which can be seen as the interpretation behind the suggestion. Feature Vector Aggregation: The final step is to aggregate the entity representation eh and its neighborhood representation eN(h) as the new representation of entity h. More (1) formally, eh = f eh , eN (h) , where f represents the aggregation methods, and we aggregate vectors by concatenation: aggreconcat = σ W · concat eh , eN (h) + b (4) In summary, the advantage of the attention embedding propagation lies in the use of the knowledge graph to associate drugs, targets and their surrounding attributes, different attention weights to different adjacent nodes through the attention mechanism, and aggregation of feature vectors for representing sample nodes.
444
Z. Wu et al.
High-Order Propagation: Our model allows for customization of the depth of features propagation, and higher order features propagation is based on the principle of recursively performing first order features propagation operations. During a recursive operation, the formula for the l-th operation is shown as: (l) (l−1) (l−1) (5) eh = f eh , eN (h) (l−1)
(l−1)
where eh is the vector representation of entity h before the l-th operation, eN (h) is the (l − 1) hop neighborhood vector of entity h, we get neighborhood representation as follows: (l−1) (l−1) eN (h) = π (h, r, t)et (6) (h,r,t)∈N (h)
(l−1)
represents the vector representation where π (h, r, t) represents attention weights, et of the (l − 1) hop neighborhood entity. Algorithm 1 shows the main process of the KGAT algorithm: M represents the set of drug-target pairs, G represents the knowledge graph, N(h) is the RF of sample node h, L is the depth of aggregation, k represents the breadth of aggregation, aggre () is the aggregation algorithm, and π (h, r, t) is the attention weights of each neighboring node. The algorithm returns the final vector representations eh of samples.
˖
KGAT: Predicting Drug-Target Interaction
445
Line3: Determine the L-hop range receptive field RF(h) of sample node h. Line4: Obtain the initial embedding of N(h) according to the Embedding function. Line7: Obtain the aggregation vector for the (l − 1)-hop receptive field. Line8: The vector after the l-th update of sample h is represented as the vector after its (l − 1)-th update aggregating with the vectors of the (l − 1)-hop receptive field. By using KGAT, we can capture higher-order neighbor features and structure similarity feature from KG. The algorithm updates sample vector representations by aggregating neighborhood nodes. Attention network assigns attention weights to help the model to distinguish the importance of different neighborhood nodes.
2.3 Model Prediction Our model obtains feature representation of the samples containing higher-order neighborhood information by KGAT. DTI prediction is treated as a binary classification task where we aggregate all sample entities and their topological neighborhoods to predict the interaction values between drug-target pairs. More precisely, for the KGAT layer, multi-layer aggregation is used to update the vector representation of sample entities to obtain attributes and structure similarity features in the knowledge graph in a way that distinguishes attention. For the batch input of samples, we splice the feature vectors of the drug-target pairs, then use the spliced feature vectors as the x-values in the neural network and the interaction true values of the sample pairs as the y-values to train the neural network model. The parameters update by using a binary cross-entropy loss function [26] via gradient descent [27]. The loss function is used to measure the error between the true value and the predicted value. Gradient descent helps to find the parameters with the smallest loss function value. The model can best fit our problem under the parameters.
3 Experiments This section describes the datasets, the evaluation methods, the experimental results and the specific parameter settings used in the experiments. The results of the comparison with the previous methods are also given. 3.1 Datasets Based on the two knowledge graphs, the datasets are also divided into two types: One is the Drugbank-id based datasets including seven datasets. Another is the widely used standard dataset based on the KEGG-id. The details of the two kinds of datasets are shown in Table 2 and Table 3:
446
Z. Wu et al. Table 2. The details of Yamainshi_08 datasets.
Dataset
–
Drug
Target
Positive interactions
Yamainshi_08
Enzyme
445
664
2926
IC
210
204
1476
GPCR
223
95
635
54
26
90
NR
The Yamainshi_08 and Drugbank_fda are widely used datasets in the field of DTI prediction. Table 3. The details of Drugbank datasets. Dataset
Group
Drug
Target
Positive interactions
Drugbank
Drugbank_fda
1482
1408
9881
Approved
2251
2686
8957
Investigation
1958
2558
6660
Experimental
4564
2948
8191
Withdrawn
2252
2689
8961
Illicit
98
94
368
Nutraceutical
96
799
1022
To enrich the experiments, six datasets were extracted from Drugbank during the experiment. They are divided by the drugs’ property “Group”, which contain Approved, Investigational, Experimental, Withdrawn, Illicit and Nutraceutical drugs. It can be seen from Table 2 and Table 3 that the datasets include only positive samples as there are no identified negative samples. We use Borderline-SMOTE to composite negative samples. The Borderline-SMOTE sampling algorithm divides positive samples into three classes: Safe, Danger and Noise, and finally selects negative samples from the Danger class: Safe: The number of positive samples in the neighbor samples is less than half. Danger: More than half of the neighbor samples are positive. Noise: All the neighbor samples are positive. Our model applies the Borderline-SMOTE oversampling algorithm to synthesize negative samples according to the ratio of 1:1 with the positive samples on each data sets.
KGAT: Predicting Drug-Target Interaction
447
3.2 Experimental Settings The experiment generates two knowledge graphs kg_Drugbank and kg_KEGG using the Bio2RDF scripts based on two bioinformatic databases. A total of 11 different datasets are experimented on the two KGs after removing duplicate, redundant, invalid triples and format alignment. For each dataset, Borderline-SMOTE is used to generate negative samples at a ratio of 1:1 with the positive samples, and then the sample sets are divided into training sets, validation sets and test sets at a ratio of 8:1:1. In applying the KGAT model for the neighborhood representation of sample nodes, the experiment defines the depth of neighborhood aggregation h = 2 and the breadth of aggregation as k = 12 after experimental comparison. Finally, the most influential features are passed through two fully connected layers and a SoftMax layer to train the neural network. 3.3 Results We performed experiments on the kg_Drugbank for the Drugbank_fda, Approved, Experimental, Investigational, Withdrawn, Illicit and Nutraceutical datasets. In addition, the Yamainshi_08 was experimented on the kg_KEGG to compare with the previous methods. Evaluation: To evaluate the model’s performance and to compare it with the previous methods, the experiment used 10-fold cross-validation and calculated the average of AUC (area under the ROC curve), AUPR (area under the PR curve; PR is a graph consisting of recall and precision), ACC (accuracy) values and F1-score (the summed average of accuracy and recall) under 10-fold cross-validation. Results on Kg_KEGG: Experiment are carried out on Yamainshi_08 dataset. Since the dataset is represented by the KEGG-id, so the experiment is conducted based on Kg_KEGG. A comparison of the results with the previous methods is shown as Fig. 4. To illustrate the performance of KGAT, the experiments have administrated the advantage over the previous methods in the field of DTI prediction in recent years. The method is used in the comparison with DDR, TRIMODEL, NRLMF, DINLMF and DTIGEMS. Figure 4 illustrates their prediction performance under the Yamanishi_08 dataset. The mean values of AUPR for the KGAT model under 10-fold cross-validation are 97.2%, 97.0%, 94.6% and 90.2% on Enzyme, IC, GPCR and NR datasets. In addition, the mean values of AUC under 10-fold cross-validation are 97.8%, 96.9%, 95.2% and 92.7% for the Enzyme, IC, GPCR and NR datasets. This provides further evidence that our model performs well and is of certain research value. Results on Kg_Drugbank: The kg_Drugbank is constructed based on the Drugbank database. Seven different datasets are conducted. The results of the experiments conducted on kg_Drugbank are presented in the Table 4. We applied 10-fold cross-validation on each data set in order to reduce contingency. The performance of the model grows with the number of samples because the model can learn more features. The model performs best on Drugbank_fda reaching an AUC of 98.24%. These datasets further validate the performance of KGAT.
448
Z. Wu et al.
Fig. 4. Comparison results for KGAT and the previous methods in terms of AUPR using Yamainshi_08 datasets. Table 4. DTI prediction results on kg_Drugbank. AUC
AUPR
F1
ACC
Drugbank_fda
0.9824
0.9792
0.9476
0.9472
Approved
0.9743
0.9697
0.9334
0.9234
Investigation
0.9698
0.9660
0.9377
0.9321
Experimental
0.9642
0.9473
0.9484
0.9254
Withdrawn
0.9632
0.9423
0.9243
0.9345
Illicit
0.9032
0.7552
0.7249
0.8193
Nutraceutical
0.9267
0.7730
0.7694
0.8597
4 Conclusion The paper proposes a new DTI prediction model KGAT. By selectively aggregating neighborhood nodes with attention weights, KGAT can learn the higher-order topological and semantic features of the knowledge graphs and obtain the feature representations of drugs and targets. The acquired feature representations are then fed into the predictive model for training. The advantage of our model is it focus on higher-order structure, so
KGAT: Predicting Drug-Target Interaction
449
it can capture more useful features to help generate vector representations of examples. At the same time, attention network helps to distinguish the importance of different features. There is room for progress in the experiment. For example, there is not enough attributes in the construction of knowledge graph. Kg_Drugbank is converted from the Drugbank bioinformatics database, which is rich in drugs attribute features but lacking in targets, so more useful features is expected to be added to the knowledge graph in the future. Acknowledgements. The authors thank the members of Machine Learning and Artificial Intelligence Laboratory, School of Computer Science and Technology, Wuhan University of Science and Technology, for their helpful discussion within seminars. This work was supported by National Natural Science Foundation of China (No. 61972299, 61502356).
References 1. Vuignier, K., Schappler, J., Veuthey, J.L., Carrupt, P.A., Martel, S.: Drug–protein binding: a critical review of analytical tools. Anal. Bioanal. Chem. 398(1), 53–66 (2010) 2. Ezzat, A., Wu, M., Li, X., Kwoh, C.K.: Computational prediction of drug-target interactions via ensemble learning. Methods Mol. Biol. (Clifton, N.J.) 1903, 239–254 (2019) 3. Zhao, T., Hu, Y., Valsdottir, L.R., Zang, T., Peng, J.: Identifying drug-target interactions based on graph convolutional network and deep neural network. Brief. Bioinform. 22(2), 2141–2150 (2020) 4. Ezzat, A., Wu, M., Li, X., Kwoh, C.K.: Computational prediction of drug-target interactions using chemogenomic approaches: an empirical survey. Brief. Bioinform. 20(4), 1337–1357 (2019) 5. Yan, G., Wang, X., Chen, Z., Wu, X., Yang, Z.: In-silico adme studies for new drug discovery: from chemical compounds to Chinese herbal medicines. Curr. Drug Metab. 18(999), 535–549 (2017) 6. Vilar, S., Harpaz, R., Uriarte, E., et al.: Drug-drug interaction through molecular structure similarity analysis. J. Am. Med. Inform. Assoc. 19(6), 1066–1074 (2012) 7. Baskaran, S., Panchavarnam, P.: Data integration using through attentive multi-view graph auto-encoders. Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol. 5, 344–349 (2019) 8. Ryu, J.Y., Kim, H.U., Sang, Y.L.: Deep learning improves prediction of drug–drug and drug– food interactions. Proc. Natl. Acad. Sci. U.S.A. 115(18), 4304–4311 (2018) 9. Zhu, J., Liu, Y., Wen, C.: MTMA: multi-task multi-attribute learning for the prediction of adverse drug-drug interaction. Knowl. Based Syst. 199, 105978–105988 (2020) 10. Wang, S., Shan, P., Zhao, Y., Zuo, L.: MLRDA: GanDTI: a multi-task neural network for drug-target interaction prediction. Comput. Biol. Chem. 92(9), 4518–4524 (2021) 11. Zhe, Q., Xuan, L., Wang, Z.J., Yan, L., Li, K.: A system for learning atoms based on long short-term memory recurrent neural networks. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 728–733. IEEE (2018) 12. Xia, L.I., Liu, C., Zhang, Y., Jiang, S.: Cross-lingual semantic sentence similarity modeling based on local and global semantic fusion. J. Chin. Inf. Process., 526–533 (2019) 13. Quan, Z., Wang, Z.J., Le, Y., Yao, B., Li, K., Yin, J.: An efficient framework for sentence similarity modeling. IEEE/ACM Trans. Audio, Speech Lang. Process. 27(4), 853–865 (2019)
450
Z. Wu et al.
14. Chen, J., Gong, Z., Wang, W., Wang, C., Liu, W.: Adversarial caching training: unsupervised inductive network representation learning on large-scale graphs. IEEE Trans. Neural Netw. Learn. Syst. 99, 1–12 (2021) 15. Ehrlinger, L.: Towards a definition of knowledge graphs. In: Joint Proceedings of the Posters and Demos Track of 12th International Conference on Semantic Systems – SEMANTiCS 2016 and 1st International Workshop on Semantic Change & Evolving Semantics (SuCCESS16), vol. 48, pp. 1–4 (2016) 16. Sun, Z., Deng, Z.H., Nie, J.Y., Tang, J.: RotatE: knowledge graph embedding by relational rotation in complex space. In: 7th International Conference on Learning Representations, pp. 978–991 (2019) 17. Ristoski, P., Paulheim, H.: RDF2Vec: RDF graph embeddings for data mining. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 498–514. Springer, Cham (2016). https://doi. org/10.1007/978-3-319-46523-4_30 18. Trouillon, T., Bouchard, G.: Complex embeddings for simple link prediction. In: International Conference on Machine Learning, vol. 48, pp. 2071–2080 (2016) 19. Lin, X.L., Zhang, X.L.: Prediction of hot regions in PPIs based on improved local community structure detecting. IEEE/ACM Trans. Comput. Biol. Bioinf. 15(5), 1470–1479 (2018) 20. Zhang, X.L., Lin, X.L., et al.: Efficiently predicting hot spots in PPIs by combining random forest and synthetic minority over-sampling technique. IEEE/ACM Trans. Comput. Biol. Bioinf. 16(3), 774–781 (2019) 21. Lin, X.L., Zhang, X.L., Xu, X.: Efficient classification of hot spots and hub protein interfaces by recursive feature elimination and gradient boosting. IEEE/ACM Trans. Comput. Bioinform. 17(5), 1525–1534 (2020) 22. Wishart, D.S., Knox, C.: DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 34, 668–672 (2006) 23. Kanehisa, M., Miho, F.: KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, 353–361 (2017) 24. Shaban-Nejad, A., Baker, C.J.O., Haarslev, V., Butler, G.: The FungalWeb ontology: semantic web challenges in bioinformatics and genomics. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 1063–1066. Springer, Heidelberg (2005). https://doi.org/10.1007/11574620_78 25. Fan, E.: Extended tanh-function method and its applications to nonlinear equations. Phys. Lett. A. 277, 212–218 (2000) 26. Caballero, R., Molina, J.: Cross entropy for multiobjective combinatorial optimization problems with linear relaxations. Eur. J. Oper. Res. 243(2), 362–368 (2015) 27. Burges, C., Shaked, T., Renshaw, E.: Learning to rank using gradient descent. In: International Conference on Machine Learning, pp. 89–96 (2005)
MRLDTI: A Meta-path-Based Representation Learning Model for Drug-Target Interaction Prediction Bo-Wei Zhao1,2,3 , Lun Hu1,2,3(B) , Peng-Wei Hu1,2,3 , Zhu-Hong You4(B) , Xiao-Rui Su1,2,3 , Dong-Xu Li1,2,3 , Zhan-Heng Chen5 , and Ping Zhang6 1 The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences,
Urumqi 830011, China [email protected] 2 University of Chinese Academy of Sciences, Beijing 100049, China 3 Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China 4 School of Computer Science, Northwestern Polytechnical University, Xi’an 710129, China [email protected] 5 College of Computer Science and Engineering, Shenzhen University, Shenzhen 518060, China 6 Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
Abstract. Predicting the relationships between drugs and targets is a crucial step in the course of drug discovery and development. Computational prediction of associations between drugs and targets greatly enhances the probability of finding new interactions by reducing the cost of in vitro experiments. In this paper, a Meta-path-based Representation Learning model, namely MRLDTI, is proposed to predict unknown DTIs. Specifically, we first design a random walk strategy with a meta-path to collect the biological relations of drugs and targets. Then, the representations of drugs and targets are captured by a heterogeneous skipgram algorithm. Finally, a machine learning classifier is employed by MRLDTI to discover novel DTIs. Experimental results indicate that MRLDTI performs better than several state-of-the-art models under ten-fold cross-validation on the gold standard dataset. Keywords: Drug repositioning · Computational prediction · Drugs · Targets · DTIs
1 Introduction Identification of drug-target interactions plays an important role in the course of discovering new drugs and biological mechanisms. Traditional experiments such as in vitro experiments are time-consuming and costly [1, 2]. Therefore, in silico experiments are selected as another popular way to discover new drugs, which is a low-cost and easy-to-implement method. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 451–459, 2022. https://doi.org/10.1007/978-3-031-13829-4_39
452
B.-W. Zhao et al.
Computational prediction-based methods have been widely and successfully used in silico experiments, due to the high development of computer technology [3–17]. For instance, Luo et al. [18] proposed a network-based model, namely DTINet, through constructing a heterogeneous network for DTI prediction, which aggregates multiple drugrelated and target-related networks to obtain the representations of drugs and targets, and further achieve the prediction task of DTIs. Hu et al. [19] developed a novel method called DDTF to predict DTIs. DDTF first constructs an unreliable similarity network by nonnegative matrix factorization, and then the network is represented according to several similarity matrices, and finally DDTF predicts latent drug and target pairs. NeoDTI [20] as a nonlinear end-to-end model is designed for DTI prediction. NeoDTI first aggregates nodes’ neighbor information, and then utilizes a network topology-preserving learning approach to learn the features of drugs and targets, and further completes the prediction task of DTIs. Recently, representation learning-based methods are successfully applied to predict relationships between biological molecules, which can more comprehensively consider biological features from biological networks and further accurately predict DTIs [21–39]. Chen et al. [40] designed a representation learning-based model by constructing multiple molecular networks to reveal new DTIs. LGDTI [41] learns the local and global structure features by different graph representation learning methods, and then the two structural features are put together as input to the random forest classifier to predict DTIs. However, the above methods still entail two limits. First, when the lack of biological knowledge, the features of these nodes are assigned with static values. In doing so, the accuracy of the model is reduced in the course of DTI prediction. Then, these methods failed to account for the heterogeneity of biological networks when learning representations of drugs and targets. To address these issues, we propose a Meta-path-based Representation Learning model, called MRLDTI, for DTI prediction. More concretely, we first design a random walk strategy with a meta-path to obtain the biological relations between drugs and targets [42]. After that, a heterogeneous skip-gram algorithm is performed to learn the representations of drugs and targets. At last, a machine learning classifier is applied to complete the prediction task of DTIs. The experiments demonstrate that MRLDTI achieves satisfactory performance compared with the state-of-the-art methods on the benchmark dataset.
2 Methods 2.1 The Framework of MRLDTI MRLDTI mainly consists of the following three steps. (i) constructs a drug-related network including drug-drug networks, drug-target networks, and target-target networks. (ii) designs a random walk with a meta-path to collect paths for drugs and targets. (iii) learns the representations of drugs and targets by using the heterogenous skip-gram algorithm. (iv) identifies unknown DTIs. Therefore, we define a graph G(V , E), where V = {V dr , V tr } and E represent all nodes and edges from the drug-related and proteinrelated networks, and E dt ∈ E denotes the edges for the drug-target network, and |V | is the number of drugs and targets.
MRLDTI
453
2.2 Gathering the Meta-path Relations of Drugs and Targets For a biological network, the types of nodes are different, and the relationships between nodes are complex. Hence, we should consider more comprehensively about how to better learn the representations for drugs and targets, and further improve the accuracy of the DTI prediction model. In particular, a random walk strategy S : drug − target − target − drug is designed to capture the relations between biological nodes [42–44]. Let us assume a matrix P = {pi } to represent the paths of all nodes is collected by this strategy. The details of implementing a path are as follows: 1 if [vi , vj ] ∈ E, φ(vi ) = tj prob(vi+1 |vi , S) = |N (φ(vi ))| (1) 0 otherwise where t ∈ {drug, target} represents the type of node vi and N (φ(vi )) denotes a set of type tj of neighbor nodes of node vi . In this regard, the biological relation of each node is regarded as a path pi to gather by Eq. 1 in P. 2.3 Learning the Representations of Drugs and Targets This step aims to learn representations from drugs and targets. In Eq. 2, we use the heterogeneous skip-gram model to learn deep representations of drugs and targets. In particular, Eq. 1 is maximized as follows: argmax logprob(vj |vi , P) (2) ri
vi ∈V
vj ∈N (φ(vi ))
where ri is the representation of the node vi , prob(vj |vi , P) denotes the conditional probability of node vi to node vj in the path matrix P. To facilitate, we optimize Eq. 2 by: prob(vj |vi , P) =
erj ri vk∈N (φ(vi )) e
rk ri
(3)
At last, R = {ri } ∈ R|v|×d as a matrix to describe the representations of drugs and targets, where d denotes the dimension of it, d = 128. 2.4 Discovering Unknown DTIs In this section, we achieve the task of DTI prediction by using a machine learning classifier, i.e., the Gradient Boosting Decision Tree (GBDT) classifier [45]. In particular, given a set of X = [R(V dr ), R(V tr )] ∈ E dt as the input features of the classifier. Besides, we also tune the GBDT classifier to attain the best performance in the course of discovering unknown DTIs, where the best parameter of the GBDT classifier is trees of 999.
454
B.-W. Zhao et al.
3 Results 3.1 Evaluation Criteria Regarding the dataset of MRLDTI, we select the gold standard dataset from [18], called L-Dataset, which entails 1923 DTIs, 10036 drug-drug interactions, and 7363 targettarget interactions. Specifically, we use a ten-fold cross-validation (10-fold CV) strategy to accurately evaluate the performance of MRLDTI. Meanwhile, multiple metrics are introduced as evaluation indicators, i.e., Matthew’s correlation coefficient (MCC), Precision, Recall, F1-score, the Receiver Operating Characteristic (ROC) curve and the area under the ROC curve (AUC), and the Precision-Recall (PR) curve and the area under the PR curve (AUPR). Accuracy = MCC = √
TP + TN TP + TN + FP + FN
(4)
TP × TN − FP × FN (TP + FP)(TP + FN )(TN + FP)(TN + FN ) Precision = Recall = F1 − score =
(5)
TP TP + FP
(6)
TP TP + FN
(7)
2TP 2TP + FP = FN
(8)
where TP and TN represent the number of true positives, and true negatives, FP and FN represent the number of false positives and false negatives, respectively. 3.2 Performance Evaluation of MRLDTI In this work, we evaluate the performance of MRLDTI on the L-Dataset under 10-fold CV. The experimental results are present in Table 1 and Fig. 1. In particular, MRLDTI achieves a high indicator for DTI prediction, and the standard deviations of these evaluation criteria are 0.80\%, 1.62\%, 1.72\%, 1.68\%, 0.79\%. Besides, the high values of AUC and AUPR are achieved by the proposed MRLDTI model, 95.64\%, 95.75\%. In short, MRLDTI yields excellent performance while discovering unknown DTIs. Table 1. Performance evaluation of MRLDTI under 10-fold CV. Fold 0
Accuracy (%) 89.09
MCC (%) 78.19
F1-score Precision (%)
Recall (%)
F1-score (%)
89.53
88.60
89.06 (continued)
MRLDTI
455
Table 1. (continued) Fold
Accuracy (%)
MCC (%)
F1-score Precision (%)
Recall (%)
F1-score (%)
1
89.09
78.22
87.94
90.67
89.29
2
89.61
79.32
87.68
92.23
89.90
3
89.87
79.74
89.64
90.10
89.87
4
89.09
78.20
88.27
90.10
89.18
5
87.53
75.07
87.89
86.98
87.43
6
90.36
80.78
91.89
88.54
90.19
7
90.10
80.28
91.85
88.02
89.89
8
89.84
79.78
91.80
87.50
89.60
9
89.06
78.16
90.32
87.50
88.89
Avg.
89.36 ± 0.80
78.77 ± 1.62
89.68 ± 1.72
89.36 ± 0.80
89.33 ± 0.79
Fig. 1. The ROC and PR curves of MRLDTI under 10-fold CV.
3.3 Comparison with State-of-the-Art Models When evaluating the effectiveness of MRLDTI, we compared it with five state-of-the-art models proposed for DTI prediction on the L-Dataset, including HNM [46], MSCMF [47], LPMIHN [48], DTInet [18], NeoDTI [20]. The compared experimental results are presented in Table 2, from which we have the following observations: (1) Since the capacity to adaptively learn the heterogeneous drug-target network, MRLDTI consistently has outperforms other baseline models, which has demonstrated the high performance of MRLDTI. (2) MRLDTI can take into account the heterogeneity of biological networks to capture the representations of drugs and targets. Overall, MRLDTI is promising to present a high accurateness for the prediction task of DTIs, and our expectation about MRLDTI from the perspective of effectiveness could be confirmed.
456
B.-W. Zhao et al. Table 2. Comparison with state-of-the-art models on the gold standard dataset. Models
AUC
AUPR
HNM
0.8374
0.5876
MSCMF
0.8408
0.6048
LPMIHN
0.9025
0.7935
DTINet
0.9137
0.8154
NeoDTI
0.9438
0.8645
MRLDTI
0.9564
0.9675
4 Conclusion In this paper, we propose a representation learning based on meta-path model, termed MRLDTI, for DTI prediction. By analyzing existing DTIs-based models, our findings suggest that they rely too much on biological knowledge and do not take into account the heterogeneity of biological networks when learning the representations of drugs and targets. More specifically, MRLDTI first conducts a random walk based on meta-path strategy for drugs and targets on the drug-related network. Second, a heterogeneous skipgram algorithm is applied to obtain the representations of drugs and targets. In the end, the GBDT classifier is used to discover potential DTIs. Experiments on the real-world dataset demonstrate that MRLDTI can achieve the best performance when compared with several state-of-the-art drug repositioning models. Hence, we are interested in exploring the possibility of considering more different meta-paths and biomolecules for improved performance of MRLDTI [49–51]. Acknowledgments. This work was supported in part by the Natural Science Foundation of Xinjiang Uygur Autonomous Region under grant 2021D01D05, in part by the Pioneer Hundred Talents Program of Chinese Academy of Sciences, in part by the National Natural Science Foundation of China, under Grants 61702444, 62002297, 61902342, in part by Awardee of the NSFC Excellent Young Scholars Program, under Grant 61722212, and in part by the Tianshan youth - Excellent Youth, under Grant 2019Q029.
References 1. Su, X., Hu, L., You, Z., Hu, P., Wang, L., Zhao, B.: A deep learning method for repurposing antiviral drugs against new viruses via multi-view nonnegative matrix factorization and its application to SARS-CoV-2. Brief. Bioinform. 23, bbab526 (2022) 2. Zhao, B.-W., Hu, L., You, Z.-H., Wang, L., Su, X.-R.: HINGRL: predicting drug–disease associations with graph representation learning on heterogeneous information networks. Brief. Bioinform. 23, bbab515 (2022) 3. Hu, L., Chan, K.C.: A density-based clustering approach for identifying overlapping protein complexes with functional preferences. BMC Bioinform. 16, 1–16 (2015) 4. Hu, L., Chan, K.C.: Extracting coevolutionary features from protein sequences for predicting protein-protein interactions. IEEE/ACM Trans. Comput. Biol. Bioinf. 14, 155–166 (2016)
MRLDTI
457
5. You, Z.-H., Li, X., Chan, K.C.: An improved sequence-based prediction protocol for proteinprotein interactions using amino acids substitution matrix and rotation forest ensemble classifiers. Neurocomputing 228, 277–282 (2017) 6. Ezzat, A., Wu, M., Li, X.-L., Kwoh, C.-K.J.B.i.b.: Computational prediction of drug–target interactions using chemogenomic approaches: an empirical survey, 20, 1337–1357 (2019) 7. Hu, L., Yuan, X., Liu, X., Xiong, S., Luo, X.: Efficiently detecting protein complexes from protein interaction networks via alternating direction method of multipliers. IEEE/ACM Trans. Comput. Biol. Bioinf. 16, 1922–1935 (2018) 8. Guo, Z.-H., Yi, H.-C., You, Z.-H.: Construction and comprehensive analysis of a molecular association network via lncRNA–miRNA–disease–drug–protein graph. Cells 8, 866 (2019) 9. Hu, L., Chan, K.C., Yuan, X., Xiong, S.: A variational Bayesian framework for cluster analysis in a complex network. IEEE Trans. Knowl. Data Eng. 32, 2115–2128 (2019) 10. Wang, L., You, Z.-H., Li, Y.-M., Zheng, K., Huang, Y.-A.: GCNCDA: a new method for predicting circRNA-disease associations based on graph convolutional network algorithm. PLoS Comput. Biol. 16, e1007568 (2020) 11. Yi, H.-C., You, Z.-H., Wang, M.-N., Guo, Z.-H., Wang, Y.-B., Zhou, J.-R.: RPI-SE: a stacking ensemble learning framework for ncRNA-protein interactions prediction using sequence information. BMC Bioinform. 21, 1–10 (2020) 12. Wang, L., You, Z.-H., Huang, D.-S., Zhou, F.: Combining high speed ELM learning with a deep convolutional neural network feature encoding for predicting protein-RNA interactions. IEEE/ACM Trans. Comput. Biol. Bioinf. 17, 972–980 (2018) 13. Hu, L., Wang, X., Huang, Y., Hu, P., You, Z.-H.: A novel network-based algorithm for predicting protein-protein interactions using gene ontology. Front. Microbiol. 2441 (2021) 14. Pan, X., Hu, L., Hu, P., You, Z.-H.: Identifying protein complexes from protein-protein interaction networks based on fuzzy clustering and GO semantic information. IEEE/ACM Trans. Comput. Biol. Bioinform. (2021) 15. Hu, L., Zhang, J., Pan, X., Yan, H., You, Z.-H.: HiSCF: leveraging higher-order structures for clustering analysis in biological networks. Bioinformatics 37, 542–550 (2021) 16. Li, Z., Hu, L., Tang, Z., Zhao, C.: Predicting HIV-1 protease cleavage sites with positiveunlabeled learning. Front. Genet. 12, 658078 (2021) 17. Hu, L., Yang, S., Luo, X., Yuan, H., Sedraoui, K., Zhou, M.: A distributed framework for largescale protein-protein interaction data analysis and prediction using mapreduce. IEEE/CAA J. Automatica Sinica 9, 160–172 (2021) 18. Luo, Y., et al.: A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information. Nat. Commun. 8, 1–13 (2017) 19. Hu, P., Huang, Y., You, Z., Li, S., Chan, K.C.C., Leung, H., Hu, L.: Learning from deep representations of multiple networks for predicting drug–target interactions. In: Huang, D.S., Jo, K.-H., Huang, Z.-K. (eds.) ICIC 2019. LNCS, vol. 11644, pp. 151–161. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26969-2_14 20. Wan, F., Hong, L., Xiao, A., Jiang, T., Zeng, J.: NeoDTI: neural integration of neighbor information from a heterogeneous network for discovering new drug–target interactions. Bioinformatics 35, 104–111 (2019) 21. Hu, A.L., Chan, K.C.: Utilizing both topological and attribute information for protein complex identification in ppi networks. IEEE/ACM Trans. Comput. Biol. Bioinf. 10, 780–792 (2013) 22. Hu, L., Chan, K.C.: Fuzzy clustering in a complex network based on content relevance and link structures. IEEE Trans. Fuzzy Syst. 24, 456–470 (2015) 23. Xing, W., et al.: A gene–phenotype relationship extraction pipeline from the biomedical literature using a representation learning approach. Bioinformatics 34, i386–i394 (2018)
458
B.-W. Zhao et al.
24. Hu, P., Huang, Y.-A., Chan, K.C., You, Z.-H.: Learning multimodal networks from heterogeneous data for prediction of lncRNA-miRNA interactions. IEEE/ACM Trans. Comput. Biol. Bioinform. (2019) 25. Jiang, H.-J., You, Z.-H., Hu, L., Guo, Z.-H., Ji, B.-Y., Wong, L.: A highly efficient biomolecular network representation model for predicting drug-disease associations. In: Huang, D.-S., Premaratne, P. (eds.) ICIC 2020. LNCS (LNAI), vol. 12465, pp. 271–279. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60796-8_23 26. Hu, L., Yang, S., Luo, X., Zhou, M.: An algorithm of inductively identifying clusters from attributed graphs. IEEE Trans. Big Data (2020) 27. Wang, L., You, Z.-H., Li, J.-Q., Huang, Y.-A.: IMS-CDA: prediction of CircRNA-disease associations from the integration of multisource similarity information with deep stacked autoencoder model. IEEE Trans. Cybern. 51, 5522–5531 (2020) 28. Guo, Z.-H., et al.: MeSHHeading2vec: a new method for representing MeSH headings as vectors based on graph embedding algorithm. Brief. Bioinform. 22, 2085–2095 (2021) 29. Wang, L., You, Z.-H., Huang, Y.-A., Huang, D.-S., Chan, K.C.: An efficient approach based on multi-sources information to predict circRNA–disease associations using deep convolutional neural network. Bioinformatics 36, 4038–4046 (2020) 30. Hu, L., Zhang, J., Pan, X., Luo, X., Yuan, H.: An effective link-based clustering algorithm for detecting overlapping protein complexes in protein-protein interaction networks. IEEE Trans. Netw. Sci. Eng. 8, 3275–3289 (2021) 31. Wang, L., You, Z.-H., Zhou, X., Yan, X., Li, H.-Y., Huang, Y.-A.: NMFCDA: combining randomization-based neural network with non-negative matrix factorization for predicting CircRNA-disease association. Appl. Soft Comput. 110, 107629 (2021) 32. Hu, L., Zhao, B.-W., Yang, S., Luo, X., Zhou, M.: Predicting large-scale protein-protein interactions by extracting coevolutionary patterns with MapReduce paradigm. In: 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 939–944. IEEE (2021) 33. Zhao, B.-W., You, Z.-H., Hu, L., Wong, L., Ji, B.-Y., Zhang, P.: A multi-graph deep learning model for predicting drug-disease associations. In: Huang, D.-S., Jo, K.-H., Li, J., Gribova, V., Premaratne, P. (eds.) ICIC 2021. LNCS (LNAI), vol. 12838, pp. 580–590. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-84532-2_52 34. Hu, L., Wang, X., Huang, Y.-A., Hu, P., You, Z.-H.: A survey on computational models for predicting protein–protein interactions. Brief. Bioinform. 22, bbab036 (2021) 35. Su, X.-R., You, Z.-H., Yi, H.-C., Zhao, B.-W.: Detection of drug-drug interactions through knowledge graph integrating multi-attention with capsule network. In: Huang, D.-S., Jo, K.H., Li, J., Gribova, V., Premaratne, P. (eds.) ICIC 2021. LNCS (LNAI), vol. 12838, pp. 423– 432. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-84532-2_38 36. Zhao, B.-W., You, Z.-H., Wong, L., Zhang, P., Li, H.-Y., Wang, L.J.F.i.G.: MGRL: predicting drug-disease associations based on multi-graph representation learning. Front. Genet. 12, 491 (2021) 37. Su, X., et al.: SANE: a sequence combined attentive network embedding model for COVID-19 drug repositioning. Appl. Soft Comput. 111, 107831 (2021) 38. Zhang, H.-Y., Wang, L., You, Z.-H., Hu, L., Zhao, B.-W., Li, Z.-W., Li, Y.-M.: iGRLCDA: identifying circRNA–disease association based on graph representation learning. Brief. Bioinform. 23, bbac083 (2022) 39. Su, X.-R., Huang, D.-S., Wang, L., Wong, L., Ji, B.-Y., Zhao, B.-W.: Biomedical knowledge graph embedding with capsule network for multi-label drug-drug interaction prediction. IEEE Trans. Knowl. Data Eng. (2022) 40. Chen, Z.-H., You, Z.-H., Guo, Z.-H., Yi, H.-C., Luo, G.-X., Wang, Y.-B.: Prediction of drugtarget interactions from multi-molecular network based on deep walk embedding model. Front. Bioeng. Biotechnol. 8, 338 (2020)
MRLDTI
459
41. Zhao, B.-W., et al.: A novel method to predict drug-target interactions based on large-scale graph representation learning. Cancers 13, 2111 (2021) 42. Dong, Y., Chawla, N.V., Swami, A.: metapath2vec: Scalable representation learning for heterogeneous networks. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 135–144 (2017) 43. Wang, L., You, Z.-H., Huang, D.-S., Li, J.-Q.: MGRCDA: metagraph recommendation method for predicting CircRNA-disease association. IEEE Trans. Cybern. (2021) 44. Li, J., Wang, J., Lv, H., Zhang, Z., Wang, Z.: IMCHGAN: inductive matrix completion with heterogeneous graph attention networks for drug-target interactions prediction. IEEE/ACM Trans. Comput. Biol. Bioinf. 19, 655–665 (2021) 45. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 1189–1232 (2001) 46. Wang, W., Yang, S., Zhang, X., Li, J.: Drug repositioning by integrating target information through a heterogeneous network model. Bioinformatics 30, 2923–2930 (2014) 47. Zheng, X., Ding, H., Mamitsuka, H., Zhu, S.: Collaborative matrix factorization with multiple similarities for predicting drug-target interactions. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1025–1033. 2013) 48. Yan, X.-Y., Zhang, S.-W., Zhang, S.-Y.: Prediction of drug–target interaction by label propagation with mutual interaction information derived from heterogeneous network. Mol. BioSyst. 12, 520–531 (2016) 49. Guo, Z.-H., You, Z.-H., Yi, H.-C.: Integrative construction and analysis of molecular association network in human cells by fusing node attribute and behavior information. Mol. Therapy-Nucl. Acids 19, 498–506 (2020) 50. Yi, H.-C., You, Z.-H., Guo, Z.-H., Huang, D.-S., Chan, K.C.: Learning representation of molecules in association network for predicting intermolecular associations. IEEE/ACM Trans. Comput. Biol. Bioinform. (2020) 51. Guo, Z.-H., You, Z.-H., Wang, Y.-B., Huang, D.-S., Yi, H.-C., Chen, Z.-H.: Bioentity2vec: attribute-and behavior-driven representation for predicting multi-type relationships between bioentities. GigaScience9, giaa032 (2020)
Single Image Dehazing Based on Generative Adversarial Networks Mengyun Wu1,2 and Bo Li1,2(B) 1 Wuhan University of Science and Technology, Wuhan 430070, China
[email protected] 2 Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time
Industrial System, Wuhan University of Sciences and Technology, Wuhan 430070, China
Abstract. Optical imaging technologies generally provide inferior images in foggy conditions resulting in low contrast, significant color distortion, and severe loss of image details for real world application. Most defogging methods uses atmospheric scattering models to estimate atmospheric light, however, the complexity of the atmospheric environment prevents them from obtaining parameters reliably. Learning-based strategies yield more natural results, but still suffer color distortion and incomplete defogging. This paper presents a generative adversarial network based single-image defogging model to end-to-end remove haze directly, where multi-scale structure is employed. An attention mechanism is also introduced to improve haze removal efficiency with edge loss functions. The network outperforms some classic defogging models on some datasets. Keywords: Single image dehazing · Multi-scale · Fusion · Attention mechanism
1 Introduction Single-image dehazing is a hot issue in image processing, which has been attracted many attentions. To remove the haze from the image, some models have been presented such as histogram equalization algorithm [1–3], Retinex algorithm [4–6], the haze removal technique based on the wavelet transform [7, 8] as well as the haze removal approach [9]. These approaches can improve image visibility by boosting contrast, but the fog still exists. Other methods use atmospheric scattering models to generate fog-free photos and then restore the images. A priori dark channel theory [10, 11] has been proposed to improve the defogging efficiency [12, 13] and quality [14, 15], where its effect is reduced when the scene target is similar to ambient light, such as the sky. Choi [16] also employed the complicated fusion to recover fogged images. GAN (generative adversarial networks) was popular to learn end-to-end mappings in recent years. Dehazing-GAN is a generative adversarial network-based defogging technique suggested by Zhu et al. [17]. Du et al. [18] introduced a defogging approach based on generative adversarial network, which directly learns the nonlinear mapping between the fogged and the clear images for direct defogging processing. Li et al. [19] proposed © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 460–469, 2022. https://doi.org/10.1007/978-3-031-13829-4_40
Single Image Dehazing Based on Generative Adversarial Networks
461
GAN based encoder and decoder networks for image defogging by incorporating VGG features and L1 regularized gradient. Dong et al. [20] presented a Fused Discriminator Generative Adversarial Network (FD-GAN) for defogging, which employs picture frequency information as priori. In this paper, we also propose an end-to-end MultiScale Fusion guided Generative Adversarial Network (MSF-GAN) to eliminate fog in the images, where the multi-scale network contributes to reducing spatial dependencies between features, avoiding gradient explosion, and simplifying computation.
2 GAN GAN has been widely used in single-image deblurring, which contains a generator and a discriminator. The discriminator is a classifier that discriminates the output created by the generator and decides whether or not it is a random sample. The training process for GAN is a game between the generator and the discriminator, and the outcome is that the discriminator cannot distinguish the fake image by generator and the real image. The generator receives random noises as input and generates a new data sample by learning the probability distribution of the existing real data sample. Its objective function can be defined as: min max V (D,G) = min max Ex∼Pdata (x) [log D(x)] + Ez∼Pz(Z) [log(1 − D(G(Z)))] G
D
G
D
(1) Compared to GAN, CGAN takes noise as input for both the generator and the discriminator. The objective function for CGAN is shown in Eq. (2). min max V (D,G) = Ex∼pdata (x) [log(D(x|y))] + Ez pz (z) [log(1 − D(G(z|y)))] G
D
(2)
where random noise z and condition y are combined in the hidden layer, and then a generated data Gz|y with conditions is output. The real data x and y are combined in discriminator D as input of the discriminant function.
3 MSF-GAN In this paper, we propose a MSF-GAN network. The generator uses a hybrid multi-scale and code network consisting of convolutional networks in three separate scale-spaces. The network creates a multi-scale space by preprocessing the blurred images using Gaussian pyramids. Then, feature fusion is performed between the multiple scale-spaces, and then a spatial attention mechanism is introduced to reconstruct features based on the haze concentration in the region. The decoder is composed of single-size convolutional layers with the same number as those of the encoder. The discriminator discriminates the generated image and the real image used as input. Figure 1 shows the networks of the proposed MSF-GAN.
462
M. Wu and B. Li Feature
Input
Fusion
I Dehaze
I
1 C
or
I ½ scale
0 IR
I ¼ scale C
Convolution Spatial Pixel-wise Skip Channel Decoder Deconvolution Module Summation contactconnection Discriminator Attention AƩenƟon
Fig. 1. MSF-GAN based on multi-scale and feature fusion
3.1 Parameters The blurred image is down-sampled through a Gaussian pyramid with size of 1/2 and 1/4. The image is used as input for the network, which adopts three consecutive convolutional modules and layers to extract features. The resulting feature maps are zero-filled to keep the image size, which is shown in Table 1. The decoder is also composed of three deconvolution layers, which correspond to the encoder part. The parameters are set in Table 2. Table 1. Parameter settings for the convolutional layer of the convolution model Convolution model
Structure
Number
Size
Conv+Relu
16
3
Layer1
Conv1 Conv2
Conv+BN+Relu
32
3
Layer2
Conv3
Conv+BN+Relu
64
3
Conv4
Conv+BN+Relu
64
3
Layer3
Conv5
Conv+BN+Relu
128
3
Conv6
Conv+BN+Relu
256
3
Table 2. Decoder parameter settings Convolution model Layer1
Structure
Number
Size
Conv1
Conv+Relu
256
3
Conv2
Conv+BN+Relu
128
3
(continued)
Single Image Dehazing Based on Generative Adversarial Networks
463
Table 2. (continued) Convolution model Layer2 Layer3
Structure
Number
Size
Conv3
Conv+BN+Relu
64
3
Conv4
Conv+BN+Relu
64
3
Conv5
Conv+BN+Relu
32
3
Conv6
Conv+BN+Relu
16
3
3.2 Attention Mechanism The channel attention module converts the feature image of N × C × H × W into a feature map of N × C × 1 × 1 and then obtaining the weights, where N represents the batch, C denotes the channel, H indicates the height of the feature map, and W is the width of the feature map. In the channel attention module, both global maximum pooling and global average pooling are used to extract the features in each channel. The global maximum pooling and global average pooling operations are defined as follows (Fig. 2): ymax = (max(xi,j )) yavg =
1 H W xi,j i=1 j=1 H ·W
(3) (4)
where x i,j is the feature point in a particular channel.
Fig. 2. Channel attention mechanism
The channel attention module has two primary purposes: First, to filter channels containing more fog-related features from channels containing fewer fog-related features; Second, to reduce the computational expense by assigning small weights to the channels containing fewer fog-related features. Since the spatial attention mechanism focuses on the haze correlation of spatial regions, an averaging pooling operation is considered to extract the feature background information. Then a 1 × 1 convolution operation is performed to weight the original features to obtain the new ones. The spatial attention mechanism is illustrated in Fig. 3. The spatial attention network weights can be calculated as follows. fn = (ϑ + 1)fn
(5)
464
M. Wu and B. Li
Fig. 3. Spatial attention mechanism
where ϑ is the weighting factor. Thus k fnk = (ϑ + 1)[fnk+1 ↑ +fn−1 ]
(6)
where fnk is the output feature map of the n-th layer in the k-layer scale space, where ↑ is the deconvolution operation. 3.3 Discriminators The discriminant network determines whether the generator’s output is a real image. The discriminator network is presented in Fig. 4.
Fig. 4. The network structure of the conditional discriminator
As illustrated in Fig. 4, I is the input fogged image, J represents the real image, and J* denotes the generated image. The [I, J] and [I, J*] are input for discriminator. The parameters of the discriminator are set in Table 3: Table 3. Discriminator parameters Discriminant
Structure
Number of convolution kernels
Size of the convolution kernel
Conv1
Conv++LeakyRelu
32
3
Conv2
Conv+BN+LeakyRelu
64
3
Conv3
Conv+BN+LeakyRelu
128
3
Conv4
Conv+LeakyRelu
256
3
Conv5
Conv+Sgmoid
1
3
Single Image Dehazing Based on Generative Adversarial Networks
465
3.4 Loss Function L1 loss function can maintain the pixel similarity of the input and output images, which benefits to make the output image closer to the target one. Thus L1 loss function can be adopted, which is stated below: L1 =
N 1 ||G(Ii ) − Ji ||1 N
(7)
i=1
where N indicates the number of pixels in an image, G(I i ) denotes the i-th pixel in defogged image, and J i represents the i-th pixel in the standard image. Since the high-frequency information is more easily lost for deep feature learning, an edge loss is proposed to improve the detailed information and to keep the high-frequency features. The edge loss is defined as: (8) Ledge = (lap(J ) − lap(J ∗ ))2 + ε2 In Eq. (8), the high-frequency features are bounded between J (ground truth image) and J* (generated image); lap(J) and lap(J*) denote the edge mapping of J and J*, respectively. As a result, the overall loss function of the generator is expressed as: LG = λ1 L1 + λ2 Ledge + λ3 LCGAN
(9)
where λ1 , λ2 , and λ3 are the coefficients for each loss function, respectively. After extensive experimental tests, it is determined that λ1 = 1, λ2 = 1, and λ3 = 2. The final goal is to optimize the generator G by minimizing Eq. (10).
4 Experiments 4.1 Data Sets The method is evaluated on two datasets: RESIDE (2018a) and Teri 18 (Ankuti and Timoth 2018). The training datasets are SOTS and NTIRE’18, where SOTS contains 600 indoor and 600 outdoor synthetic blur maps by re-sizing SOTS and NTIRE’18 to 256 × 256. Moreover, two evaluation crietria as peak signal-to-noise ratio (PSNR) and structural similarity ratio (SSIM) are taken for comparisons. 4.2 Results The size of input and output for the generator are 256 × 256 × 3, while the input size of the discriminator is set to 256 × 256 × 6 and its output size is 64 × 64 × 1. Then we set λ1 = 100, λ2 = 100, and λs = 0.1 in the experiments, respectively. The scale-space level is 3 including the original rate, 1/2 rate, and 1/4 rate. The depth/number of feature fusion and channel attention mechanisms are set to 10 and 3, respectively. In addition, MSF-GAN is trained with learning rate 2 × 10–4 . With the above settings, the network
466
M. Wu and B. Li Table 4. PSNR and SSIM of MSF-GAN on NTIRE18 dataset Dataset
SOTS indoor/outdoor
NTIRE18 indoor/outdoor
PSNR (DB)
37.63/35.85
32.56/31.18
SSIM
0.9996/0.9967
0.9943/0.9958
is iterated 30 times. Finally, our model runs 319,000 iterations on Windows 10 system with NVIDIA 1080Ti GPU, which expenses 105 h. Table 4 shows the partial defogging results of the proposed MSF-GAN on the SOTS synthetic dataset. Figure 5 displays the partial defogging results for outdoor on the NTIRE18 dataset. It can be concluded that the defogged images using MSA-GAN are the same as the standard images in both the thick and the thin fog region, which further indicates that MSA-GAN has good robustness to the real fogged images.
Fig. 5. Dehazing results of MSF-GAN in SOTS part
The proposed networks containing different numbers of spatial attention mechanisms at a single spatial scale and networks with channel attention mechanisms. Models 1–3 are networks with 1–3 spatial attention mechanisms added on each spatial scale. Model 2 is the original MSF-GAN. Model 4 is the network with the channel attention mechanism removed (Table 5). Table 5. Comparison of improvements on the public synthetic dataset and before improvements Models
PSNR (dB)
SSIM
Processing time (s)
Model 1
35.82
0.9976
0.034
Model 2 (MSF-GAN)
36.41
0.9981
0.048
Model 3
36.45
0.9983
0.057
Model 4
29.37
0.9735
0.031
It can be seen that model 2 and model 3 are better than model 1. However, the running time for model 3 is much higher than that of model 2 and the time for model 4 is the
Single Image Dehazing Based on Generative Adversarial Networks
467
shortest, but PSNR and SSIM are lower than that of model 2. In this scenario, model 2 is finally used in this paper. The proposed method is compared with some other models as DCP, AOD, CAP, GFN, and DCPDN, which are shown in Fig. 6. To some extent, DCP and CAP cause color distortion and blurring in certain blurred regions, while CAP cannot effectively defog using a linear model. AOD leaves haze residues in some areas. GFN and DCPDN use CNN learning schemes; GFN can remove haze to some extent but brings excessive contrast; DCPDN tends to overestimate atmospheric light, leading to localized overbrightness in certain blurred parts.
Fig. 6. Visual and quantitative comparison of the synthetic dataset.
The signal-to-noise ratio (PSNR) and structural similarity (SSIM), as shown in Table 6, indicate the proposed method has significant advantages over other methods.
468
M. Wu and B. Li Table 6. Comparison with other methods on the public combined test dataset
Method
RESIDE (PSNR/SSIM)
TERI 18 (PSNR/SSIM)
DCP
18.3421/0.8675
14.2139/0.6239
AOD
19.7841/0.8667
15.2506/0.6120
CAP
18.6520/0.8549
15.0241/0.6452
DCPDN
17.9772/0.8523
13.8660/0.6862
GFN
21.8109/0.8610
15.0669/0.6183
MS-GAN(OURS)
24.0501/0.9203
19.1562/0.8022
5 Conclusion MSF-GAN presented in this paper is of high efficiency in extracting features on different scale-spaces for fusion and reconstructing features through multiple spatial attention networks and a channel attention network. The final experimental results show that the proposed method has improved performance compared with some state-of-the-art techniques in haze removing.
References 1. Stark, J.: Adaptive image contrast enhancement using generalizations of histogram equalization. IEEE Trans. Image Process. 9(5), 889–896 (2000) 2. Xu, Z., Liu, X., Na, J.: Fog removal from color images using contrast limited adaptive histogram equalization. In: 2nd International Congress on Image and Signal Processing, pp. 2–9. IEEE, Beijing (2009) 3. Fattal, R.: Single image dehazing. ACM Trans. Graph. 27(3), 72 (2008) 4. Yang, W., Wang, R., Fan, S., Zhang, X.: Variable filter Retinex algorithm for foggy image enhancement. J. Comput.-Aided Design Comput. Graph. 22(6), 965–971 (2010) 5. Yu, P., Hao, C.: Fractional-order differentiation and multi-scale Retinex coupling for haze image enhancement. Adv. Lasers Optoelectron. 55(1), 10–12 (2018) 6. Tan, R.: Visibility in bad weather from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE, Alaska (2008) 7. Yu, P., Hao, C.: Fractional differential and multiscale Retinex combined foggy image enhancement. Laser Optoelectron. Prog. 55(1), 11–12 (2018) 8. Ouyang, W., et al.: DeepID-Net: multi-stage and deformable deep convolutional neural networks for object detection. In: Transactions on Pattern Analysis and Machine Intelligence, pp. 5–12. IEEE, Eprint Arxiv (2016) 9. Cai, B., Xu, X., Jia, K., Qing, C., Tao, D.: DehazeNet: an end-to-end system for single image haze removal. IEEE Trans. Image Process. 25(11), 5187–5198 (2016) 10. He, K., Sun, J., Tang X.: Single image haze removal using dark channel prior. In: Computer Vision and Pattern Recognition, pp. 1956–1963. IEEE, Miami (2009) 11. Tarel, J., Hautiere, N.: Fast visibility restoration from a single color or gray level image. In: International Conference on Computer Vision, pp. 2201–2208. IEEE, Costa Rica (2009) 12. Ko, N., Louis, K., Stephen, L.: Bayesian defogging. Int. J. Comput. Vision 98(3), 263–278 (2012)
Single Image Dehazing Based on Generative Adversarial Networks
469
13. Choi, L., You, J., Bovik, A.: Referenceless prediction of perceptual fogdensity and perceptual image defogging. IEEE Trans. Image Process. 24(11), 3888–3901 (2015) 14. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. arXiv: 1701.07875. Accessed 6 Dec 2017 15. Peng, X., Feng, J., Xiao, S., Yau, W.-Y., Zhou, J.T., Yang, S.: Structured autoencoders for subspace clustering. IEEE Trans. Image Process. 27(10), 5076–5086 (2018) 16. Ren, W., Liu, S., Zhang, H., Pan, J., Cao, X., Yang, M.-H.: Single image dehazing via multiscale convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 154–169. Springer, Cham (2016). https://doi.org/10.1007/ 978-3-319-46475-6_10 17. Zhu, H., Peng, X., Chandrasekhar, V., Liyuan, L., Lim, J.-H.: DehazeGAN: When image dehazing meets differential programming. In: International Joint Conferences on Artificial Intelligence, pp. 1234–1240. IEEE, Stockholm (2018) 18. Du, Y., Li, X.: Recursive image Dehazing via perceptually optimized generative adversarial network. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. IEEE, CA (2019) 19. Li, C., Gao, J., Porikli, F., Guo, C., Fu, H., Li, X.: DR-Net: transmission steered single image dehazing network with weakly supervised refinement. arXiv: 1712.00621. Accessed 2 Dec 2017 20. Dong, Y., Liu, Y., Zhang, H., Chen, S., Qiao, Y.: FD-GAN: generative adversarial networks with fusion-discriminator for single image dehazing. In: AAAI Conference on Artificial Intelligence, pp. 10729–10736. AAAI, New York (2020)
K-Nearest Neighbor Based Local Distribution Alignment Yang Tian1,2 and Bo Li1,2(B) 1 Wuhan University of Science and Technology, Wuhan 430070, China
[email protected] 2 Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time
Industrial System, Wuhan University of Sciences and Technology, Wuhan 430070, China
Abstract. When massive labeled data are unavailable, domain adaptation can transfer knowledge from a different source domain. Many recent domain adaptation methods merely focus on extracting domain-invariant features via minimizing the global distribution divergence between domains while ignoring local distribution alignment. In order to solve the problem of incomplete distribution alignment, we propose a K-nearest neighbors based local distribution alignment method, where Maximum Mean Discrepancy (MMD) is adopted as the transfer loss function to reduce the global distribution discrepancy, and then a K-nearest neighbors based transfer loss function is also devised to minimize the local distribution difference for the complete alignment of source and target domain. The proposed method contributes to avoid the dilemma of incomplete alignment in MMD by local distribution alignment and improve its recognition accuracy. Experiments on multiple transfer learning datasets show that the proposed method performs comparatively well. Keywords: Domain adaptation · Maximum mean discrepancy (MMD) · Local distribution alignment
1 Introduction The explosive of data has improved neural networks performance gradually. But it also raises a problem: Most of the data lack data label. Neural networks can indeed identify the labeled training set better. However, they also show their deficiency on target data classification when there exists discrepancy between the training set (source domain) and the test set (target domain) known as domain shift [1], which will lead to the low recognition of the trained model. How to reduce the difference between the source and target domain is the key to solving domain shift. Usually, the method of domain adaptation [2] is adopted. Domain adaptation is mainly used to deal with the situation where the source domain and target domain are with different marginal distributions. It can reduce the impact of domain shift and improve the performance of the model on the target domain. In recent years, with the development of neural networks, some studies have proved that the image features © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 470–480, 2022. https://doi.org/10.1007/978-3-031-13829-4_41
K-Nearest Neighbor Based Local Distribution Alignment
471
extracted by neural networks have the characteristics of high precision [3, 4], and deep domain adaptation has gradually become a hot issue. Most of deep domain adaptation methods mainly concentrate on the distribution discrepancy metric, by extracting the domain-invariant features to minimize the domain discrepancy to solve the domain shift problem. For example, the widely used MMD, MMD defines the distance between the source and target domain data mean as a distribution discrepancy metric. Conditional maximum mean discrepancy (CMMD) [27] proposes that the calculation of MMD is too rough, and raises MMD to the class level. Multi-kernel maximum mean discrepancy (MK-MMD) [16] upgrades MMD from one-kernel to multi-kernel, enhances the representation ability. Weighted maximum mean discrepancy(WMMD) [28] adds the concept of probability distribution alignment into the computation of MMD, which increases the calculation accuracy. Soft weighted maximum mean discrepancy (SWMMD) [29] introduces pseudo-labels on the basis of WMMD, which further improves the recognition rate. Although, those MMD-based methods perform well, they still not fix local distribution alignment problem. Besides, many researchers have put forward some models as the transfer models. The earliest deep network adaptation method can be traced back to the domain adaptive neural network (DaNN) [5] proposed in 2014, whose network structure consists of only two layers: a feature layer and a classifier layer. DaNN adds a MMD adaptation layer after the feature projection to calculate the difference between the source and target domain. However, the network is too shallow to extract discriminant features for the model, so DaNN cannot well solve the domain shift problem. Tzeng et al. presented deep domain confusion(DDC) [6], where AlexNet [7] is adopted to replace the simple network structure in DaNN. Later, some researchers used deep convolutional neural network (ResNet) [9] and generative adversarial network(GAN) [10] as transfer networks with MMD based transfer loss for domain adaptation tasks [11–15]. Except for the change of the transfer model, MMD has been widely used in above methods as loss metrics. However, MMD has its own limitations that it only calculates the global mean difference and ignores the local distribution which is useful for the deep learning. This paper proposes a KNN-based local distribution alignment to solve the abovementioned problem by exploiting the similarity of the same class data in the source and target domains, which helps to improve the performance of neural networks on the target domain. We also evaluate our approach on some benchmark domain adaptation tasks such as digit recognition (MNIST, USPS) and object recognition (COIL20, Office, Caltech) with some state-of-the-art deep models.
2 Related Work 2.1 Domain Adaptation s The source domain with labels is defined as DS (xis ,yis )ni=1 , and the target domain withnt t t out labels is DT (xi ,yi )i=1 . Where S, T represents the data from the source and target domainrespectively, x, y means the data and its label, and n is the number of data. A model for domain adaptation usually consists of two parts: a feature extraction module and a classifier module. The feature extraction module is defined as ϕ(·), which can transform the data x into the data feature ϕ(x). Denote the classifier as clf (·), and the data feature ϕ(x) is transformed into the label clf (ϕ(x)) = y. The similarity between
472
Y. Tian and B. Li
the source and target domain data is that their category information are exactly the same like yis = yit (i = 1, 2, 3 . . . nclass ), which means that the source and target domain data share the same class information. The difference between the source and target domain data is that they have different data distributions like P(xs ) = P(xt ), which is the reason that the classifier clfS (·) trained on the source domain data perform terribly on the target domain. Then the goal of domain adaptation is to transfer the model (ϕ(·),clf (·)) from clf (ϕ(xs )) = yS to clf (ϕ(xt )) = yt , and makes the model work well on the target domain. 2.2 MMD MMD is adopted as the global transfer loss function. MMD is used to align the data distribution difference between the source domain and the target domain. In deep domain adaptation, MMD is always introduced as a loss function, which can be used to make constraint on the outputs of feature extraction module. So, it is formulated as: 1 ns 1 nt s t MMD(DS ,DT ) = ϕ(x ) − ϕ(x ) (1) i j n i j nt s 2 Sometimes, when the same feature extraction module is used by the source domain and the target domain, MMD aims toobtain a feature extraction module ϕ(·) by minimizing MMD(DS ,DT ): min LMMD = MMD(DS ,DT ) ϕ
(2)
When the optimal solution of MMD(DS ,DT ) is obtained, the feature extraction module ϕ(·) extracts features from both the source and the target data with corresponding means E(ϕ(xs )) ≈ E(ϕ(xt )). As a result, it can be considered that the source domain feature s t and the target domain feature distribution DT (ϕ(xit ),yit )ni=1 distribution DS (ϕ(xis ),yis )ni=1 share the same data distribution. The feature extraction module ϕ(·) can extract the source domain data to obtain the source domain feature distribution ϕ(xs ), and then use the source domain label information ys to train a classifier clf (·). Since the source domain and the target domain share the same feature distribution at the moment, the clf (·) trained on the source domain data can also perform well on the target domain data. Accordingly, the global transformation of the model has been accomplished.
3 Methods MMD only calculates the global mean difference to align the global mean points, which ignores the alignment of the local data points. As shown in Fig. 1, the blue and the orange observations represent the source and the target domain data, respectively. Moreover, the black lines mean the classifier. Although the transformation model can identify a part of the target domain data. However, due to the difference of data distribution between the source and the target domain, the source domain data can be correctly identified but it does not work well on the target domain data. Applying MMD as the global transfer loss function, the distance between the mean centers of the source and target domain can be
K-Nearest Neighbor Based Local Distribution Alignment
473
minimized. After that, the point groups which consist of the overlap of the source and the target domain data can be accurately identified by the classifier, but the remain of the target domain data will be misidentified.
Fig. 1. Local distribution alignment problem.
To solve the problem, a simple scheme is proposed to reduce the distance between observations in the source domain and its nearest neighbors in the target domain, such as the distances A1, B1, B2, B3, C2 in Fig. 2.
Fig. 2. Local data point distance analysis.
Motivated by the idea, this paper proposes the K-nearest neighbor based local distribution alignment for domain adaptation, where the KNN condition is used to determine the nearest neighbor relation. The different label data usually have different characteristics, so there are differences in the features extracted from the different label data, which will form different data clusters. For the any observation in the data clusters belonging to source domain, the nearest neighbors of its can be obtained by KNN and its category information can be deduced from source domain. According to the KNN criterion, its category information is related to its K nearest neighbors. Thus, a set of neighbor relationship arrays can be constructed according to the distances between observation and its K nearest neighbors: DNeighbors (x,y) = [d1 (x,x1 ),d2 (x,x2 ),d3 (x,x3 ) . . . dk (x,xk )]
(3)
474
Y. Tian and B. Li
In this paper, Euclidean distance metric is selected to construct the neighbor relationship array, so d is formulated as follows: di = ϕ(xs ) − ϕ(xi )2 (i = 1,2,3 . . . k)
(4)
Since there are some observations from the target domain in the nearest neighbors, it is difficult to obtain the category information for all the nearest neighbors, where only the nearest neighbors from the source domain can be obtained. So, the mean of the distance between the observations with the same category label in the neighbor relationship array is calculated as: dsame (x,y) =
1 ωi di n
(5)
where ωi is calculated by the category information of nearest neighbors. When oservation x selected from source domain and its nearest neighbors have the same category information, the value of ωi is set to 1, otherwise it is 0. 1, only if y = yi (6) ωi = 0, others Similarly, the mean of the distance between the observations with the different category in the neighbor relationship array can be determined. 1 wi di n
(7)
1, only if y = yi 0, others
(8)
ddiff (x, y) = where wi is opposite to ωi : wi =
Besides, there are data points from the target domain in the neighbor relationship array. For data points from source domain formed non-overlapping data clusters, they should have the same category information if they are in the neighbor relationship array. Similarly, the data points from target domain should also share the same category information with the source domain data points in the calculated neighbor relationship array. In this case, the mean of the distance between the observations from target domain in the neighbor relationship array is calculated as: dtarget (x,y) =
1 τi di n
(9)
The value of τi is depended on if the points in the neighbor relation array come from the target domain: 1, only if xi ∈ DT (10) τi = 0, others
K-Nearest Neighbor Based Local Distribution Alignment
475
In summary, to reduce the number of data points with different category in the neighbor relationship array, the optimization goal is to maximize ddiff (x,y). max
ϕ
ddiff (xi ,y)(xi ∈ DS )
(11)
Fix dsame (x,y), the number of the same category source domain data points is similar to K in the neighbor relationship array. keep dsame (x,y)
(12)
Since the feature extraction module will gradually learn the unique features of different categories during the training with the labeled source domain data, the extracted features from source domain data will automatically form their own data clusters. However, whether expanding or shrinking dsame (x,y) will have a negative impact on the size of data cluster from the source domain, so it is not worth paying more attention to operate dsame (x,y) for keeping the number of same category data points. To align the target domain data distribution with the source domain data distribution completely, dtarget (x,y) needs to be minimized. min
ϕ
dtarget (xi ,y)(xi ∈ DS )
Finally, the local transfer loss is summarized as n ddiff (xi ,y) Llocal = dtarget (xi ,y) + α i 1 n τj ϕ(xis ) − ϕ(xj )2 + α = 2 wk ϕ(xis ) − ϕ(xk )2 i n
(13)
(14)
where α > 0 is a hyper-parameter. Equation (14) can be used as the transfer loss with 1. So, the whole domain adaptation model is formulated as min γ LMMD + βLlocal + Lclf ϕ
(15)
where Lclf is the classification loss, γ > 0 and β > 0 are the trade-off parameter of the transfer loss and the classification loss.
4 Experiment and Analysis The proposed method has been tested on three datasets: MNIST&USPS [30, 31], COIL20 [34] and Office-Caltech10 [32, 33]. The relevant information of the datasets is shown in Table 1. In order to validate the performance of the proposed method, this paper conduct comparison experiments with DDC [6], domain-adversarial neural network (DANN) [17], deep adaptation network (DAN) [15], adversarial discriminative domain adaptation(ADDA) [12], coupled generative adversarial network (CoGAN) [14], joint geometrical and statistical alignment (JGSA) [22], domain invariant and class discriminative (DICD) [23], domain-irrelevant class clustering (DICE) [24], linear discriminant
476
Y. Tian and B. Li Table 1. Datasets. Dataset
Category
Samples
MNIST + USPS
Digital recognition
79298
COIL20
Object recognition
1440
Office-caltech10
Object recognition
4110
analysis via pseudo labels (LDAPL) [25], correlation alignment(CORAL) [8], manifold embedded distribution alignment (MEDA) [26], deep reconstruction-classification network (DRCN)[35], asymmetric tri-training domain adaptation (ATDA) [36]. In the experiment on MNIST&USPS, the proposed adopts LeNet as the basic model. The proposed method defines the output of the fully connected layer before the classifier layer as the output feature, and calculates their transfer losses between the source and target domain. The experiments are carried out 10 times, the mean of the corresponding best accuracy is seen as the final experimental result. During training, the batch size of the proposed method is set to 64, the initial learning rate is 10–3 , and Adam is chosen as the optimization method. The performance of our method on MNIST & USPS is shown in Table 2. In Table 2, it can be found that by comparing to the methods only using MMD, the average accuracy of the proposed method increases 13.22%. On the U → M domain adaptation task, the proposed method has obvious advantages, such as 4.36% more than ADDA. In terms of overall accuracy, the proposed method has better performance than ADDA and CoGAN. Table 2. Accuracy (%) on digit recognition tasks for unsupervised domain adaptation. (“ −” means that we did not find the result on the task). Method
M→U
U→M
average
LeNet
53.61
63.11
58.63
MMDonly
80.61
72.20
76.18
–
–
79.1
DANN
78.1
75
76.55
DRCN
91.8
73.7
82.75
ATDA
93.17
84.14
88.65
ADDA
89.2
89.5
89.3
CoGAN
90.4
88.4
89.4
The Proposed
84.95
94.06
89.50
DDC
For experiments on the COIL20 dataset, the proposed method uses a convolutional neural network with two convolutional layers and three fully connected layers. The method defines the output of the penultimate fully connected layer as output feature and adds transfer loss to this layer. During training, the batch size, learning rate, optimization
K-Nearest Neighbor Based Local Distribution Alignment
477
method and other settings used by the proposed method are the same as the experiment on MNIST & USPS. The performance on the COIL20 dataset is shown in Table 3, which shows that the proposed method obtains good results on COIL20, and can achieve the transferring result from the source domain to the target domain. Table 3. Accuracy (%) on COIL20 for unsupervised domain adaptation Method
C1 → C2
C2 → C1
average
JGSA
95.4
93.9
94.7
DICD
95.69
93.33
94.51
DICE
99.7
99.7
99.7
LDAPL
99.44
100
99.72
The proposed
100
100
100
In experiments on Office31 dataset, the proposedadoptsResNet50.The method defines the output of the previous layer of the classifier layer as final feature layer and adds a transfer loss to it. The batch size, the initial learning rate, and the optimization method are set to 32,2×10−5 and SGD, respectively. As for the momentum term parameter, it is 0.9, and the weight decay parameter is 5 × 10−4 . The proposed method adopts a 10-category datasets extracted from office31 for experiments, and other methods use the dataset or a dataset composed of Decaf6 features extracted from the 10-category dataset for experiments. The experimental results are shown in Table 4, from which it can be found that the proposed method is better than other methods on the two tasks as A → D and A → W. Compared with these methods on the six domain adaptation tasks, the average accuracy of the proposed method improves 2.45%for DDC, 1.75%forJGSA, and is superior to MEDA. Table 4. Accuracy (%) on Office-Caltech10 for unsupervised domain adaptation Method
A→W
A→D
W→A
W→D
D→A
D→W
Average
CORAL
74.6
84.1
81.2
100
85.5
99.3
87.45
DICD
81.4
83.4
89.7
100
92.1
99
90.93
DDC
86.1
89.0
84.9
100
89.5
98.2
91.28
JGSA
81.0
88.5
90.7
100
92.0
99.7
91.98
DICE
86.4
89.8
90.7
100
92.5
99
93.06
DAN
91.8
91.7
92.1
100
90
98.5
94.01
MEDA
88.1
88.1
99.4
99.4
93.2
97.6
94.30
The proposed
93.89
94.26
92.17
99.36
90.81
96.94
94.57
478
Y. Tian and B. Li
Finally, the proposed method has a better performance on the MNIST&USPS dataset, COIL20 dataset, and Office-Caltech10. T-SNE is used to visualize the U → M task, which is shown in Fig. 3, where the red parts represent the source domain data, the blue means the target domain data. The left image is the visualization after using only the global transfer loss function MMD as the transfer loss, and the right image is the result using proposed method. Figure 3 suggests that only using MMD as the global transfer function can align the mean center, but the target domain data cluster is not aligned completely with the source domain data cluster, which leads to the poor recognition rate on target domain. Moreover, the proposed method can effectively overcome the problem validated by Fig. 3.
Fig. 3. T-SNE (red represents source domain data, blue represents target domain data)
5 Conclusion In this paper, a K-nearest neighbors based local distribution alignment method is proposed, which solves the problem of poor local distribution alignment by just using MMD as the global transfer loss function. The proposed method effectively aligns the overall data distribution of the source domain and the target domain, and improves the target domain recognition rate. Certainly, there is still a small part of the data that cannot be aligned correctly. So, it will be further considered about how to align those part of the target domain data to the right source domain data cluster in future work.
References 1. Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting visual category models to new domains. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 213–226. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_16 2. Pan, S., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2009) 3. Donahue, J., Jia, Y.,Vinyals, O., Hoffman, J.: Decaf: A deep convolutional activation feature for generic visual recognition. In: 31st International Conference on Machine Learning on Proceedings, pp. 647–655. JMLR, New York (2014) 4. Razavian A., Azizpour H., Sullivan J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 512–519. IEEE, Piscataway (2014)
K-Nearest Neighbor Based Local Distribution Alignment
479
5. Ghifary, M., Kleijn, W.B., Zhang, M.: Domain adaptive neural networks for object recognition. In: Pham, D.-N., Park, S.-B. (eds.) PRICAI 2014. LNCS (LNAI), vol. 8862, pp. 898–904. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-13560-1_76 6. Tzeng E., Hoffman, J., Zhang N., Saenko, K., Darrell, T.: Deep domain confusion: maximizing for domain invariance. arXiv:1412.3474, Accessed 17 March 2020 7. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Annual Conference on Neural Information Processing System, pp. 1097– 1105. MIT, Cambridge (2012) 8. Sun, B., Feng, J., Saenko, K.: Return of frustratingly easy domain adaptation. In: AAAI conference on Artificial Intelligence, pp. 2058–2065. AAAI, Palo Alto (2016) 9. He, K., Zhang, X., Ren, S., Su, J.: Deep residual learning for image recognition. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp.770–778. IEEE, Piscataway (2016) 10. Goodfellow, I.J., et al.: Generative adversarial nets. In: Proc. of the 27th International Conference on Neural Information Processing Systems, pp. 2672–2680. MIT, Cambridge (2014) 11. Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., Erhan, D.: Domain separation networks. In: Annual Conference on Neural Information Processing System, pp. 343–351. MIT, Cambridge (2016) 12. Tzeng E., Hoffman J., Saenko K., Darrell, T.: Adversarial discriminative domain adaptation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2962–2971. IEEE, Piscataway (2017) 13. Shen, J., Qu, Y., Zhang, W., Yu, Y.: Wasserstein distance guided representation learning for domain adaptation. arXiv:1707.01217v2. Accessed 21 Nov 2017 14. Liu, M., Tuzel, O.: Coupled generative adversarial networks. In: Annual Conference on Neural Information Processing System, pp. 469–477. MIT, Cambridge (2016) 15. Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with deep adaptation networks. In: International Conference on Machine Learning, pp. 97–105. JMLR, New York (2015) 16. Gretton, A., et al.: Optimal kernel choice for largescale two-sample tests. In: Annual Conference on Neural Information Processing System, pp. 1205–1213. MIT, Cambridge (2012) 17. Ganin, Y., et al.: Domain-adversarial training of neural networks. In: Csurka, G. (ed.) Domain Adaptation in Computer Vision Applications. ACVPR, pp. 189–209. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58347-1_10 18. Zellinger, W., Grubinger, T., Lughofer, E., Natschläger, T., Saminger-Platz, S.: Central moment discrepancy (CMD) for domain-invariant representation learning. arXiv:1702.088 11v3. Accessed 2 May 2019 19. Long M., Wang J., Jordan M.: Deep transfer learning with joint adaptation networks. In: International Conference on Machine Learning, pp. 2208–2217. JMLR, New York (2017) 20. Wang, W., et al.: A Unified Joint Maximum Mean Discrepancy for Domain Adaptation. arXiv: 2101.09979, Accessed 25 Jan 2021 21. Zhang, L., Wang, S., Huang, G.B., Zuo, W., Yang, J., Zhang, D.: Manifold Criterion Guided Transfer Learning via Intermediate Domain Generation. arXiv:1903.10211v1. Accessed 25 Mar 2019 22. Zhang, J., Li, W., Ogunbona, P.: Joint geometrical and statistical alignment for visual domain adaptation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1859– 1867. IEEE, Piscataway (2017) 23. Long, M., Wang, J., Ding, G., Sun, J., Yu, P.S.: Transfer joint matching for unsupervised domain adaptation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1410–1417. IEEE, Piscataway (2014)
480
Y. Tian and B. Li
24. Liang, J., He, R., Sun, Z., Tan, T.: Aggregating randomized clustering-promoting invariant projections for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 41(5), 1027–1042 (2019) 25. Sanodiya, R., Yao, L.: Linear discriminant analysis via pseudo labels: a unified framework for visual domain adaptation. IEEE Access 8, 200073–200090 (2020) 26. Wang, J., Chen, Y., Yu, H., Wenije, F.: Visual domain adaptation with manifold embedded distribution alignment. In: Proceedings of the 26th ACM international conference on Multimedia, pp. 402–410. ACM, New York (2018) 27. Long, M., Wang, J., Ding, G., Sun, J., Yu, P.S.: Transfer feature learning with joint distribution adaptation. In: 2013 IEEE International Conference on Computer Vision, pp. 2200–2207. IEEE, Piscataway (2013) 28. Yan H., Ding Y., Li P., Wang, Q., Xu, Y., Zuo, W.: Mind the class weight bias: weighted maximum mean discrepancy for unsupervised domain adaptation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 945–954. IEEE, Piscataway (2017) 29. Ren, C., Ge, P., Yang, P., Yan, S.: Learning target-domain-specific classifier for partial domain adaptation. IEEE Trans. Neural Netw. Learning Syst. 32(5), 1989–2001 (2021) 30. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 31. Hull, J.: A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 550–554 (1994) 32. Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting visual category models to new domains. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 213–226. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_16 33. Griffin, G., Holub, A., Perona, P.: Caltech-256 Object Category Dataset. http://resolver.cal tech.edu/CaltechAUTHORS:CNS-TR-2007-001. Accessed 19 April 2007 34. Nene, S.A., Nayar S.K., Murase, H.: Columbia object image library (coil-100) (1996) 35. Ghifary, M., Bastiaan Kleijn, W., Zhang, M., Balduzzi, D., Li, W.: Deep reconstructionclassification networks for unsupervised domain adaptation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 597–613. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_36 36. Saito, K., Ushiku, Y., Harada, T.: Asymmetric tri-training for unsupervised domain adaptation. In: International Conference on Machine Learning, pp. 2988–2997. ACM, New York (2017)
A Video Anomaly Detection Method Based on Sequence Recognition Lei Yang1,2 and Xiaolong Zhang1,2(B) 1 Hubei Key Laboratory of Intelligent Information Processing and Real-Time Industrial
Systems, Wuhan University of Science and Technology, Wuhan 430065, Hubei, China [email protected] 2 College of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065, Hubei, China
Abstract. There are some the issues caused by the diversity and complexity of anomalous events in video, a sequence recognition-based video anomaly detection method is proposed to better extract feature vectors and carve anomaly boundaries to improve the detection accuracy. In order to avoid annotating unusual clips or clips in the training video, which is very time consuming, the weakly labeled is invoked to train videos. The normal and abnormal videos as a whole sequence are used to accomplish the anomaly detection task. First, the frame rate and size of the video are unified, and the video is decomposed into RGB frames and optical flow frames. Next, the I3D model will be used as a feature extraction model to extract the feature vector of the video from the decomposed video frames. And then the feature vector is input into the Bi-LSTM model to learn the context informations between video clips, and the hidden layer states of the Bi-LSTM model is encoded as the feature of the video. Finally, the encoded feature vectors are input to SR-Net to obtain the video anomaly score, and the anomaly score is used to detect whether there are anomalous events in the video. Theoretical analysis and experimental results show that the proposed method achieves a video-level AUC detection accuracy of 85.5% and a false alarm rate of 0.8 on the UCF-Crime dataset. Compared with previous algorithms for anomalous event detection based on multi-instance learning, the detection algorithm in this paper has a higher accuracy and a lower false alarm rate. The proposed method provides better detection results in video anomalous event detection tasks. Keywords: Videos · Anomaly detection · Sequence recognition · Computer vision · Monitor
1 Introduction Video anomaly event detection is an important and challenging task in computer vision. Surveillance cameras are now increasingly used in streets, intersections, banks, and shopping centers to increase public safety. However, it is very difficult to detect all the surveillance videos through human resources, which not only requires a lot of time and human resources, but also is not as effective as desired. Therefore, automatic video © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 481–495, 2022. https://doi.org/10.1007/978-3-031-13829-4_42
482
L. Yang and X. Zhang
anomaly event detection algorithms have become an urgent need. A video anomaly detection system should react to anomalous events in a timely manner and determine the time period in which these events occur. Therefore, we can understand video anomaly detection as a coarse-level video understanding that filters out anomalous patterns from normal patterns. Early video anomalous event detection algorithms are used to detect specific anomalous events, such as violence detection in video [1]. Due to the diversity of the types of exception events, it is difficult to list all the types of exception events. Therefore, such algorithms, which can only detect specific anomalous events, were of limited use in practice. The diversity of exception event requirements makes it difficult to circle all possible exceptions. Therefore, anomaly detection algorithms should be event-independent. That is, the less supervision required in the anomalous event detection task, the better detection effect is achieved. However this hypothesis is almost impossible to hold. Previous anomaly detection algorithms have assumed that a small number of initial videos contain only normal events. We can build a dictionary of normal events [2]. The main idea of this type of anomaly detection algorithm is that anomalous events cannot be accurately reconstructed from the normal event dictionary. The above algorithm is based on the assumption that any event that deviates from the normal pattern is an abnormal event. This is because defining a normal event set that contains all possible normal patterns or behaviors is difficult, or almost impossible.Therefore this type of algorithmic method produces a high false alarm rate for those normal events that are not involved in the construction of the normal event dictionary. Currently, most of the video anomalous event detection algorithms are based on weakly supervision [3]. In most of these weakly supervised-based approaches, the video anomaly detection task is accomplished by multi-instance learning. However, the multiple instance learning (MIL) based video exception event approach fragments the connections in the video clips and does not take the full advantage of the contextual information between the video clips. In order to solve the above mentioned problems of the existing anomalous event detection algorithms, we formulate video anomaly detection as a weakly-supervised learning problem following binary classification paradigm, where only video-level labels are involved in training stage. In this paper, we propose a video anomaly event detection method based on sequence recognition. We consider each normal video or abnormal video as a sequence. For all video sequences, we automatically learn a sequence recognition model with a higher anomaly score for sequences containing anomalous events. This approach avoids fragmenting the contextual information between video clips. We use a more complete feature extraction model I3D to obtain more expressive features and we also design a more reasonable boundary fitting model to inscribe the boundary of anomalous video sequences.
2 Related Work With the increasing demand for security, video anomaly detection has become a hot research topic in the field of computer vision. A number of results have been achieved in the research area of detecting abnormal events in surveillance videos. In recent years,
A Video Anomaly Detection Method Based on Sequence Recognition
483
the rapid development of deep learning and Convolutional Neural Networks (CNN) has provided a new way of thinking for video anomaly detection research techniques. Tran et al. [4] proposed a deep 3D convolutional neural network (3 Dimension Convolution Network, C3D). The model takes the video block as the input and directly obtains the time-domain features and space-domain features of the video block. Based on the proposed C3D network, researchers have proposed a series of improved anomaly detection algorithms. Chu et al. [5] combined C3D with sparse coding. By using C3D for feature extraction, the generated manual feature sparse coding results are inputed to guide unsupervised feature learning. During the iterative process of feature extraction, the trained C3D network can generate sparsely encoded spatio-temporal features again. Sultani et al. [6] proposed an algorithm based on multi-instance learning.The algorithm breaks the common unsupervised setting of anomaly detection and chooses a weakly supervised setting, which can break the existing small data set of anomaly detection to some extent. Zhong et al. [7] first used GCN to correct for noisy labels in the field of video analysis. Designing graphical convolutional neural networks based on feature similarity and temporal consistency will supervise signal propagation from high confidence segments to low confidence segments. It is straightforward to apply fully supervised action classifiers, such as C3D, to weakly supervised anomaly detection and to maximize the use of these well-established classifiers. In addition to appealing these algorithms that improve on the C3D feature extraction model, there are many anomaly detection algorithms that perform well. For example, Ionescu et al. [8] proposed a CAE model. The model is a convolutional self-encoder based on the center of the object. The algorithm detects foreground objects frame by frame by the SSD algorithm, and K-means clustering is performed after the convolutional self-encoder acquires the feature vectors. The test samples are classified by k SVM models, and the category with the maximum classification score is the anomaly classification. Gong et al. [9] proposed the Mem-AE model. This model addresses the drawback of the large reconstruction loss of the autoencoder. A storage module is added to the autoencoder to do the canonical restriction on the encoder generated hidden feature vector Z. This enables a more fitting reduction of the modeling of normal samples in video anomaly detection. Rodrigues et al. [10] proposed a MT(multi-timescale) model. The model captures the spatio-temporal dynamics at different time scales by running a sliding window on the input signal and averaging multiple predicted values to obtain the final predicted anomaly score value at that moment. The MPED-RNN network proposed by Morais et al. [11] was the first to use biological semantic information. The model uses 2D human skeletal trajectories as features to combine global movement and local pose to describe the bounding box of the human body in the frame. It applies the semantic information rich in skeletal features to the detection video to accomplish the task of detecting abnormal events in human behavior.
3 Proposed Anomaly Detection Method 3.1 Feature Extraction To make use of both the appearance and motion information of the videos, inflated 3D (I3D) [12], pretrained on the Kinetics [10] dataset, is used as the feature extraction
484
L. Yang and X. Zhang
network. In previous C3D feature extraction models, single-size convolution kernels were heavily used. The use of such a convolutional kernel can only compute the input information one by one, not in batch, so a target feature set must be generated after repeating the single-step operation several times. The feature set thus obtained is mostly a uniform distribution of the same target features. In contrast, the inception module in the I3D feature extraction model used in this paper uses multiple channels composed of convolutional kernels of different sizes. Each channel is calculated separately, and the information obtained is finally stitched and fused. The result of the fusion is no longer a uniform distribution of the same target features, but the features with stronger relevance are selected from the original target features for integration, and multiple feature subsets are generated with these more relevant features. This can strengthen the connection and influence of related feature information, in which the weaker related feature information will be weakened. This can improve the purity of the feature results, so that the feature information we finally get is the feature information with strong utility and representativeness. I3D has a dual-stream network structure. In this paper, the input video is divided into several video frames for feature extraction. The features of I3D are divided into two types. The features obtained through the channels with RGB image sequences as the input we call I3D_RGB. The features obtained by using the optical flow image sequence obtained by the DIS dense optical flow algorithm as the input we call I3D_Flow. First, we need to extract the video RGB image sequence and optical flow image sequence separately. Then we use I3D network to obtain the appearance features (I3D_RGB) and motion features (I3D_Flow) of the video respectively. Then to analyze the effect of different features on the accuracy of the anomalous event detection task. We conducted comparison experiments with different feature combinations, respectively. After comparative experiments, we finally decided to use only the RGB features of the video as the input of the sequence recognition network, and the output result of the penultimate layer of I3D_RGB is used as the features of the video. 3.2 Bi-LSTM Model In order to allow the neural network model to make judgments based on longer sequences, rather than being limited to the limited (on the time axis) field of perception of Conv3D, we need to make full use of the contextual information of the video. Recurrent neural network(RNN) is a structural model of event sequence networks capable of storing historical states. However, multilayer RNN, is often limited in computing contextual information due to the reasons such as gradient explosion and gradient disappearance. LSTM is a variant of RNN, which is mainly used to solve the problem of RNN longrange gradient computation. In the LSTM structure, the state updates at moment t when the hidden layer vector is ht shown in Eq. 1. it = σ (Wi x(t) + Ui (t − 1) + bi ) ft = σ Wf x(t) + Uf (t − 1) + bf ot = σ (Wo x(t) + Uo (t − 1) + bo ) C˜ t = tanh(Wc x(t) + Uc (t − 1) + bc )
A Video Anomaly Detection Method Based on Sequence Recognition
Ct = it ∗ C˜ t + ft ∗ Ct−1 ht = ot ∗ tanh(Ct )
485
(1)
It , ft , ot , Ct are the output value of the input gate, the output value of the forget gate, the output value of the output gate and the output value of the memory cell, respectively. σ is the Sigmoid function. W, U, R are the parameters of the LSTM neural network. Also we believe that the detection of anomalous events in video is not only related to the video clips before the occurrence of anomalous events, but also to the video clips after the occurrence of anomalous events. And the LSTM network cannot compute the contextual information of video feature vectors in the reverse order. Therefore, to make full use of the contextual information between video clips, the Bi LSTM layer is used to replace the LSTM layer in the sequence recognition model. Forward and reverse sequences are combined as outputs, as shown in Eq. 2. (2) ht and are the computed results of the Bi-LSTM forward and reverse hidden layers, respectively. The Bi-LSTM can be computed and updated in both directions. This bidirectional LSTM network structure can provide complete contextual information for the video. 3.3 Sequence Recognition Network Model The overall network structure of the sequence recognition model is shown in Fig. 1, and only video-level tags are required in the sequence recognition model for training video abnormal event detection. The role of the two-layer LSTM network in the model is to allow the neural network to make judgments based on longer sequences, rather than being limited to the limited perceptual field of Conv3D. After two layers of LSTM network, the output of its hidden layer is taken as the feature encoding of the video and scored once by regression. Tanh is used as the activation function of FC layer. The sequence recognition algorithm is shown in Table 1. The main process is as follows. (1) The features of the penultimate layer of the I3D network are extracted as the feature vectors of the videos using the Inflated 3D ConvNet network model pre-trained on the Kinetics dataset, and we obtain the feature vectors Xt,i of length D for k segments for each video. Xt = Xt,1 , · · · Xt,k , Xt,i ∈ RD (3) In Eq. 3, Xt denotes the feature vector of a video, Xt,i denotes the feature vector corresponding to the ith segment of the video, k denotes the number of segments into which the video is divided, and D denotes the length of the feature vector of each segment. We empirically set k to 32 and D to 1024.
486
L. Yang and X. Zhang Table 1. Sequence recognition algorithm
Input Output
Video RGB image sequence Probability of abnormal events in the video
Begin Video preprocessing, unified frame rate and resolution Generate video RGB image sequence set S ( S1RGB , S 2RGB , S RGB ,…, S RGB ) 3
n
For each image sequence S RGB in image sequence set S i do The RGB image sequence S RGB is input into the pre-trained I3D feature extraction neti work
X t represents the feature vector of the video
Input the feature vector X into the Bi-LSTM module to obtain its context information t
' t
X represents the encoded video feature vector Input X ' into SR-net to obtain the probability of abnormal events in the video t
end for End
(2) The feature vector is used as input to the Bi-LSTM network and the final output is the encoded feature vector. The cell state and hidden state at the first moment are initialized as follows. c0 = finit.c (
1 1 Xt,i ) and h0 = finit.h ( Xt,i ) k k k
k
i=1
i=1
(4)
In Eq. 4, c0 denotes the cell state at the first moment in the LSTM network, and h0 denotes the hidden state at the first moment in the LSTM network. (3) The encoded feature vector matrix is fed into the SR-Net model for training, and the final score si of the video is obtained, where si indicates the probability of the video being classified as anomalous. We create a mapping function between XiFC , which represents the last FC layer, and the anomaly score si according to the SR-Net model. It can be expressed as follows. si =
1 1 + exp(WFC XiFC + bFC )
(5)
In Eq. 5, W FC , bFC are the parameters that can be learned in the last FC layer; X i FC denotes the feature vector of the last FC layer of the ith video input.
A Video Anomaly Detection Method Based on Sequence Recognition
487
(4) We consider the anomalous event detection problem as a weakly supervised learning problem that follows the binary classification paradigm. We therefore use a crossentropy loss function commonly used for binary classification tasks, which can be expressed as follows. (x, y) = L = {l1 , · · · , lN } ln = −wn yn · log xn + (1 − yn ) log(1 − xn )
(6)
In Eq. 6, yn denotes the label of sample n, positive class is 1 and negative class is 0. x n denotes the probability that sample n is predicted to be positive class.
Fig. 1. Sequence recognition model network structure diagram
3.4 Abnormal Event Location A video anomaly detection system should respond to anomalous events in a timely manner and determine the time period in which they occur. And the anomaly detection algorithm based on sequence recognition proposed in this paper is different from the multi-instance learning based algorithm. The algorithm proposed in this paper cannot locate the time period when anomalous events occur by the segments with high anomaly scores in the sample packets. Therefore, we need to design a reasonable location method to locate the time period when the abnormal events occur. In the algorithm proposed in this paper, each video is divided into 32 non-overlapping segments, but the anomalous events in each segment may represent only a small fraction of the total. In other words, the characteristics of anomalous frames will be overwhelmed by normal frames in a segment. This situation causes segments containing anomalies to tend to be considered as normal patterns. To address the problems that arise above, we refer to the segment-based video detection algorithm [18] in locating anomalous segments. Instead of dividing the video into segments with a fixed number of frames, we extract features by dividing the video into an average of 32 segments as in the training and testing process. Since our model is trained and tested with a feature vector of size 32 * 1024, we designed the following method for locating anomalous events in the video.
488
L. Yang and X. Zhang
Fig. 2. Schematic diagram of abnormal event fragment detection
As shown in Fig. 2, First we set a detection window of 32 * 1024 to detect whether an abnormal event occurs in the input video sequence, and initialize the detection window to 0. Then we input the feature vectors of the video into the detection window segment by segment, and for each segment of 1024 long feature vectors we score the sequence in the detection window for anomalies. If the anomaly score is close to 0, we consider that the video clip corresponding to the input feature vector does not contain anomalous events, and if the anomaly score is high close to 1, we consider that anomalous events occur in the video corresponding to the input feature vector.
4 Experiments 4.1 Dataset The sequence recognition model proposed in this paper was mainly experimented on two larger datasets of anomalous event detection as follows: (1) UCF-Crime dataset [4]: This dataset contains videos with variable scenes, content and duration that are untrimmed. The advantages of this dataset are mainly twofold: first, the number of videos and the total duration of videos are much more than the previous dataset, and second, the types of abnormal events contained in it are richer. In terms of dataset composition, the dataset contains a total of 13 abnormal events. There are 1900 videos in total, of which 950 are abnormal videos and 950 are normal videos. For the data set division, the training set contains 1610 videos (800 normal videos and 810 abnormal videos) and the test set contains 290 videos (150 normal videos and 140 abnormal videos). (2) ShanghaiTech dataset: This dataset is a medium-sized dataset containing 437 videos, including 130 anomalous events from 13 scenes. All training videos are normal in terms of the division of the data set. But this division does not apply to binary classification tasks. Therefore, we will repartition the dataset to divide a random portion of the abnormal videos in the test set into the training set, and conversely put a randomly selected portion of the normal videos in the training set into the test set. And we let both the training set and the test set contain 13 scenarios.
A Video Anomaly Detection Method Based on Sequence Recognition
489
4.2 Analysis of the Proposed Method Feature Extraction. First, we use one of the video RGB features (I3D_RGB) and optical flow features (I3D_Flow) alone as input to investigate the effect of using appearance features or motion features alone on the experimental results. Then we use the combined features as input to study the effect of different feature combinations on the experimental results. We use two combinations of RGB features and optical flow features after adding and averaging the fused features (I3D_Merge) and the spliced features (I3D_Cat) after stitching the RGB features and optical flow features together. Among them, I3D_RGB features, I3D_Flow features, and I3D_Merge features are all 1024-dimensional features, while I3D_Cat features have 2048 dimensions. On the UCF-Crime dataset, we use different features as input and three fully connected layers as the classification model and the detection accuracies we obtain are shown in Table 2. Table 2. Effect of different features on AUC values on the UCF-Crime dataset Method
AUC (%)
C3D
74.44
I3D_RGB
80.93
I3D_Flow
68.52
I3D_Cat
79.73
I3D_Merge
80.58
As shown in the data in Table 2, When we use the I3D_Flow features as input to the three-layer fully connected layer, the detection accuracy obtained is lower, even slightly lower than that using the C3D network. The reason for this phenomenon may be due to the presence of a large number of shaky shots and split-screen switching shots in the UCF-Crime dataset. The video frames corresponding to these shots produce a large interference with the optical flow characteristics, so the detection accuracy is not high. The remaining three features have similar AUC values for detecting anomalous events, among which I3D_RGB performs the best, with an AUC value 1.2% higher than the I3D_Cat feature and 0.35% higher than I3D_Merge. This also illustrates that the direct use of optical flow features can affect the detection results to some extent. The use of I3D_Flow features has little impact on the improvement of detection accuracy. For using it as a fusion feature for anomaly detection, it even doubles the training time and slows down the convergence of the algorithm. Due to the impact on time overhead and detection accuracy, we finally decided to use only the I3D_RGB features with RGB image sequences as input as the example features for our experiments. Ablation Experiments. Meanwhile, to verify that the addition of the Bi-LSTM module as well as the SR-Net module does help to improve the detection accuracy, we carry out the ablation experiments shown in Table 3. In our ablation experiments, the methods based on the I3D feature extractor and multi-instance learning are involved. The method (I3D + MIL) achieves 80.9% video-level AUC on the UCF_Crime dataset, and it can be
490
L. Yang and X. Zhang
seen that using the I3D model as the feature improves the experimental results. And the proposed method based on C3D feature extractor and Bi-LSTM sequence recognition model obtains 78.43% video-level AUC on the UCF_Crime dataset, which shows that our proposed method has some improvement on the overall performance of the model. Finally the proposed method based on I3D feature extractor and Bi-LSTM sequence recognition model obtains 85.5% video-level AUC on UCF_Crime dataset. From the above experiments, it can be seen that the proposed Bi-LSTM module and SR-Net module proposed in this paper are effective for improving the detection accuracy and achieved good results. Table 3. AUC and FAR under different combinations Method
AUC (%)
FAR (%)
C3D + MIL
74.4
1.9
I3D + MIL
80.9
1.2
C3D + Bi-LSTM + SR-Net
78.43
1.4
I3D + Bi-LSTM + SR-Net (Our)
85.5
0.8
Comparison with Existing Methods. An anomalous event detection algorithm with good robustness should have high accuracy (AUC) and low false alarm rate. As shown in Table 4, we compare the proposed methods with those of existing methods on the UCF_Crime dataset. Table 4. AUC and FAR of different methods on UCF_Crimes dataset Method
AUC (%)
FAR (%)
Hasan et al. [13]
50.6
27.2
Lu et al.
65.51
3.1
Sultani et al. [6]
74.44
1.9
C3D + TCN [14]
78.66 —
TSN-RGB
82.12
0.1
TSN-Optical Flow
78.08
1.1
Not only look [15]
82.44 —
Ours
85.5
0.8
We compare the detection results on the representative self-encoder-based algorithm UCF_Crime dataset. For example, the algorithm based on full convolutional feedforward deep self-encoder proposed by Hasan et al. and the algorithm based on dictionary
A Video Anomaly Detection Method Based on Sequence Recognition
491
learning proposed by Lu et al. The detection results of currently available weakly supervised learning-based algorithms on the UCF-Crime dataset are also compared. In order to exclude the influence caused by the parameters. we uniformly adopted the video-level AUC value at a threshold of 0.5 and the false alarm rate (FAR) as evaluation criteria where the optimal AUC and FAR values are bolded. From Table 4, we can clearly see that the AUC values of the self-encoder-based algorithms are consistently substantially lower than those of other weakly supervised learning-based algorithms on the UCFCrime dataset. And by comparing the methods proposed by Sultani et al. [6] and Zhong et al. [7]. we can find that obtained the long time series temporal information of the video by using temporal convolutional network (TCN) based on which eventually led to the same improvement of detection accuracy. This is sufficient to demonstrate that the video short time series spatio-temporal features obtained by using 3D convolutional networks alone are not well suited for the task of identifying anomalous events in realworld surveillance video, Long time series information is more useful for identifying networks to understand the occurrence of certain behaviors. And the video-level AUC value of the proposed algorithm is 85.5% on the UCF-Crime dataset, which improves AUC compared to the previous methods in Sultani et al. [6] and Zhong et al. [7]. In order to more intuitively reflect the superiority of the algorithm in terms of accuracy, we plot the ROC curve of the proposed algorithm on the UCF-Crime dataset, as shown in Fig. 3. Receiver operating characteristic curve (ROC) is a coordinate diagram composed of false positive rate as the horizontal axis and true positive rate as the vertical axis. The area under the ROC curve is defined as AUC. Therefore, When the ROC curve is closer to the upper left corner of the coordinate graph, we will get better detection effect. We can see that the ROC curve is extremely close to the upper left corner, which has an overwhelming advantage in the evaluation index of accuracy rate.
Fig. 3. ROC curve of the model on the UCF_Crime dataset
492
L. Yang and X. Zhang
We also conduct comparative experiments on the ShanghaiTech dataset to compare the AUC values and FAR values of different algorithms. As shown in Table 5, As shown in the data in the table, the comparison of the algorithm proposed with the existing methods on the ShanghaiTech dataset also achieves relatively good results. Table 5. AUC and FAR of different methods on ShanghaiTech dataset Method
AUC (%)
Liu et al. [16]
72.8
Liu et al. [17]
76.8
FAR (%) — —
Sultani et al. [6]
82.21
TSN-RGB
84.4
0.24 —
C3D-TCN [14]
82.5 —
Our Method
86.63
0.19
Abnormal Event Location. The specific effect of abnormal event location is shown in a), b), c) in Fig. 4. We take a video in the UCF_Crime dataset to do the time period location experiment of video abnormal events. In Fig. 4. a), the explosion008 video in the UCF_Crime dataset is used as the experimental sample. From the figure we can see that when the explosion event does not occur, the anomaly score of the detection window is almost close to 0. When the explosion occurs, the anomaly score of the detection window will quickly rise to a level close to 1. When the explosion is over and the video screen is calm again, the anomaly score will fall back to close to 0. In Fig. 4. b), RoadAccidents015 video is used as an experimental sample, When the vehicles in the picture are driving normally, the abnormal scoring of the detection window is almost close to 0. When the vehicles collide and an accident occurs, the abnormal scoring of the detection window rises rapidly, and finally, as the road blockage becomes more and more serious, the abnormal scoring of the detection window becomes higher and higher. In Fig. 4. c), Normal_Videos140 is used as experimental sample. We can see that the anomaly scoring of the detection window keeps converging to 0 when no anomalous events occur.
A Video Anomaly Detection Method Based on Sequence Recognition
493
a) Explosion008
b) RoadAccidents015
c) Normal_Videos140 Fig. 4. The effect of abnormal event fragment detection
5 Conclusions This paper proposes a method to detect anomaly events in videos, that first extracts a more complete feature vector of the video using the feature extraction network I3D and captures the past and future features at the current moment t using Bi-LSTM network. SR-net is used to make judgments based on longer sequences rather than being restricted to a limited field of perception. The use of these two modular networks helps to improve the detection ability. A sequence recognition model is designed to carve the boundaries of abnormal videos and build discriminative classification models by using both the information before and after the video clips in the training process. Experimental results have demonstrated the effectiveness of the proposed video anomaly event detection method. Acknowledgement. The authors thank the members of Machine Learning and Artificial Intelligence Laboratory, School of Computer Science and Technology, Wuhan University of Science and Technology, for their helpful discussion within seminars. This work was supported in part by National Natural Science Foundation of China (61972299, U1803262).
494
L. Yang and X. Zhang
References 1. Mohammadi, S., Perina, A., Kiani, H., Murino, V.: Angry crowds: detecting violent events in videos. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 3–18. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_1 2. Rashmiranjan, N.F., Umesh, C.S., Santos, K.T.: A comprehensive review on deep learningbased methods for video anomaly detection. Image Vision Comput. 106 (2021). ISSN 02628856 3. Deepak, K., Srivathsan, G., Roshan, S.: Deep multi-view representation learning for video anomaly detection using spatiotemporal autoencoders. Circuits Syst. Sig. Process. 1333–1349 (2021) 4. Tran, D., Bourdev, L., Fergus, R.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489– 4497 (2015) 5. Chu, W., Xue, H., Yao, C.: Sparse coding guided spatiotemporal feature learning for abnormal event detection in large videos. In: 2019 IEEE Transactions on Multimedia, vol. 21, no. 1, pp. 246–255 (2019) 6. Sultani, W., Chen, C., Shah, M.: Real-world anomaly detection in surveillance videos. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6479–6488 (2018) 7. Zhong, J.X., Li, N., Kong, W.: Graph convolutional label noise cleaner: train a plug-and-play action classifier for anomaly detection. In: 2019 IEEE/CVF. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1237–1246 (2019) 8. Ionescu, R.T., Khan, F.S.,Georgescu, M.I.: Object-centric auto-encoders and dummy anomalies for abnormal event detection in video. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7834–7843 (2019) 9. Gong, D.: Memorizing normality to detect anomaly: memory-augmented deep autoencoder for unsupervised anomaly detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1705–1714 (2019) 10. Rodrigues, R., Bhargava, N., Velmurugan, R.: Multi-timescale trajectory prediction for abnormal human activity detection. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 2615–2623 (2020) 11. Morais, R., Le, V., Tran, T.: Learning regularity in skeleton trajectories for anomaly detection in videos. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11988–11996 (2019) 12. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017) 13. Hasan, M., Choi, J., Neumann, J.: Learning temporal regularity in video sequences. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 733–742 (2016) 14. He, Y., Zhao, J.: Temporal convolutional networks for anomaly detection in time series. J. Phys. Conf. Ser. 1213, 042050 (2019) 15. Wu, P., et al.: Not only look, but also listen: learning multimodal violence detection under weak supervision. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 322–339. Springer, Cham (2020). https://doi.org/10.1007/978-3-03058577-8_20 16. Liu, W., Luo, W., Lian, D.: Future frame prediction for anomaly detection - a new baseline. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6536–6545 (2018)
A Video Anomaly Detection Method Based on Sequence Recognition
495
17. Liu, W., Luo, W., Li, Z.: Margin learning embedded prediction for video anomaly detection with a few anomalies. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence Main Track, pp. 3023–3030 (2019) 18. Narayan, S., Cholakkal, H., Khan, F.S.: 3C-Net: category count and center loss for weaklysupervised action localization. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8678–8686 (2019)
Drug-Target Binding Affinity Prediction Based on Graph Neural Networks and Word2vec Minghao Xia1 , Jing Hu1,2 , Xiaolong Zhang1,2(B) , and Xiaoli Lin1,2 1 School of Computer Science and Technology, Wuhan University of Science and Technology,
Wuhan 430065, Hubei, China [email protected], {hujing,xiaolong.zhang, linxiaoli}@wust.edu.cn 2 Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time Industrial System, Wuhan, China
Abstract. Predicting drug-target interaction (DTI) is important for drug development because drug-target interaction affects the physiological function and metabolism of the organism through bonding reactions. Binding affinity is the most important factor among many factors affecting drug-target interaction, thus predicting binding affinity is the key point of drug redirection and new drug development. This paper proposes a drug-target binding affinity (DTA) model based on graph neural networks and word2vec. In this model, the word embedding method is used to convert targets/proteins sequence into sentences containing words to capture the local chemical information of targets/proteins. Then Simplified Molecular Input Line Entry System (SMILES) is used to convert drug molecules into graphs. After feature fusion, DTA is predicted by graph convolutional networks. We conduct experiments on the Kiba and Davis datasets, and the experimental results show that the proposed method significantly improves the prediction performance of DTA. Keywords: Drug-target interaction · Binding affinity · Drug redirection · Graph neural networks · Word2vec
1 Introduction In the process of research and development of new drugs, traditional wet experiments are inefficient, costly, and time-consuming [1, 2]. At the same time, with the continuous improvement of medicine-related regulations, the number of newly approved drugs is decreasing, and the difficulty and cost of developing new drugs are increasing year by year. Expensive and lengthy drug development processes can be avoided by finding new uses for already approved drugs [3]. To effectively reuse drugs, it is necessary to understand which proteins are targeted by which drugs. Predicting the strength of drug-target pair interactions can facilitate drug redirection and new drug screening. The methods for predicting drug-target relationship could be classified into two categories. One is to use binary classification to predict drug-target interactions, and the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 496–506, 2022. https://doi.org/10.1007/978-3-031-13829-4_43
Drug-Target Binding Affinity Prediction
497
other is to use regression to predict drug-target affinity. In the research of DTI prediction based on binary classification, many studies have begun to use deep learning techniques, including restricted Boltzmann machines [4], deep neural networks [5, 6], stacked autoencoders [7, 8], and deep belief networks [9]. However, binary classification ignores an important piece of information about drug-target interactions, namely binding affinity. Binding affinity provides information on the strength of the interaction between a drugtarget pair and it is usually expressed in measures such as dissociation constant (Kd), inhibition constant (Ki), or the half maximal inhibitory concentration (IC50) [10]. IC50 depends on the concentration of target and protein and low IC50 values signal strong drug-target binding. Likewise, low Ki values indicate high binding affinity. Kd and Ki values are usually represented in terms of pKd or pKi, which are the negative logarithms of the dissociation and inhibition constants. Among the early machine learning methods for predicting the binding affinity of drug targets, the Kronecker Regularized Least Squares (KronRLS) algorithm [11] was used and achieved good results. This method uses the Smith-Waterman (S-W) algorithm [12] and the PubChem structural clustering tool to calculate similarity scores for proteins and drugs and uses a similarity score matrix to represent proteins and drugs. Besides, there is a gradient boosting-based method called SimBoost [13]. This method relies on feature engineering of compounds and proteins, using similarity and network-based features to predict DTA. The current popular method is to input protein and drug sequences (1D representation) into deep learning models to predict DTA. For example, DeepDTA [14] uses CNN blocks to learn representations of raw protein sequences and SMILES strings and combines these representations to feed into a fully connected layer block to obtain drug-target binding affinity scores. The WideDTA [15] model is an extension of DeepDTA, which uses four text-based sources of information, namely protein sequences, drug SMILES molecules, protein domains and motifs, and maximum common substructure [16] to predict binding affinity. Deep learning models have achieved good results in predicting DTA. However, these models represent drugs as strings, ignoring that drug molecules are spatially connected by atoms and chemical bonds. When SMILES strings are used, structural information of the molecule is lost, which may impair the predictive power of the model. At the same time, it is difficult to obtain the 3D structure [17] of proteins under the current technical level, and it is a very challenging task to obtain the spatial structure information of large-sized proteins. Therefore, we hope to find a method that can effectively extract and utilize drug target information. In this paper, we propose a predictive drug-target binding affinity model based on graph neural networks and word vectors. This is a new neural network architecture that starts from a one-dimensional form of drug target input, converts drugs into twodimensional molecular graphs, and processes protein sequences into embedding matrices composed of word vectors. The drug and target features are then extracted using a deep neural network and fed into a fully connected layer to predict DTA. This paper conducts experiments on two benchmark datasets, Davis [18] and Kiba [19], which are commonly used for drug-target binding affinity prediction. Compared with other drug-target binding affinity prediction models using the same dataset, our model has better performance.
498
M. Xia et al.
2 Materials and Methods 2.1 Overview of Our Model We propose a deep learning model for predicting DTA based on graph and word vector. This model can represent drug SMILES molecules as molecular graphs, extract atomic features of molecular graphs using Rdkit [20], and input them into graph convolutional neural networks for feature extraction. In addition, although the three-dimensional structure of proteins cannot be obtained, word embedding is used in this model to represent the embedded information of target proteins. Compared with the traditional natural coding, it can effectively extract the context information of each amino acid, which can improve the utilization of protein sequence information to a certain extent. The experimental results show that our model has improved in all aspects compared with previous models. 2.2 Dataset We evaluated our proposed model on two different datasets, the Kinase Davis dataset and the Kiba dataset, which are widely used as benchmark datasets for the predictive evaluation of DTA. The Davis dataset contains selectivity analyses of kinase protein families and related inhibitors and their respective dissociation constant (Kd ) values. By measuring the Kd values of 68 drugs and 442 targets, the binding affinities between them were obtained, and the affinities ranged from 5.0 to 10.8. For the drug molecules of the Davis dataset, the maximum length of SMILES is 103, while the average length is 64. Davis has a maximum protein sequence length of 2549 and an average length of 788. We normalize Kd to pKd to make it fit our model better: pKd = −log10(
Kd ) 1e9
(1)
The Kiba dataset was derived from a method called Kiba, which introduced the Kiba score to integrate Kd, Ki, and IC50 statistics into a single drug-target interaction bioactivity score. The processed dataset contains 2111 drugs and 229 targets for a total of 118,254 affinity scores. Affinities in the Kiba dataset ranged from 0.0 to 17.2, and the drug SMILES had a maximum length of 590 and an average length of 58. The maximum length of the target protein sequence was 4128, and the average length was 728 (Table 1). Table 1. Summary of the datasets. Dataset
Proteins
Compounds
Interactions
Davis
442
68
30056
Kiba
229
2111
118254
Drug-Target Binding Affinity Prediction
499
2.3 Drug Representation Simplified Molecular Input Line System (SMILES) is a molecular structure specification that unambiguously describes the three-dimensional chemical structure of a compound through a one-dimensional sequence of ASCII strings. It was developed by David Weininger [21] and Arthur Weininger in the late 1980s, and modified and extended by others. SMILES implements fast retrieval and substructure search functions, and can easily be used to describe small-molecule chemical drugs. In existing models, natural language processing (NLP) techniques are usually used to describe drug SMILES molecules and input them into convolutional neural networks for training. SMILES can be converted into two-dimensional Structure maps through Structure Diagram Generation algorithms [22] so that the model can learn the characteristics of compounds through two-dimensional graph data. To describe a node in the graph, we use a set of atomic properties from DeepChem [23]. We represent each node as a multidimensional binary eigenvector expressing five pieces of information: atomic symbol, number of adjacent atoms, number of adjacent hydrogen atoms, the implied value of the atom, and whether the atom is in an aromatic structure. We converted the SMILES codes into corresponding molecular maps and extracted atomic features using the open-source chemical informatics software RDKit. 2.4 Protein Representation Targets/proteins are usually represented as amino acid sequences (eg, MKKHHDSRREQ…). We first use word embeddings to encode amino acid sequences into embedding matrices, which are then fed into CNN blocks to obtain the local chemical information of the target/protein. In biology, since individual amino acids are usually meaningless, we apply a fixed-length N-gram splitting method to segment sequences into meaningful “biological words”. The sequence here refers to the fixed-length input protein sequence (rather than the full sequence), which is preprocessed (truncated for long, padded with 0s for short). A fixed-length N-gram divides a protein sequence into a sequence of N-grams, each N-gram is considered a “biological word”. Compared with natural encoding, it can reflect the direct contextual information of amino acid molecules. Considering that the human body typically has 20 amino acids, the maximum number of possible N-grams is 20N . To make the trade-off between the training feasibility and vocabulary size, we define N = 3. Specifically, given a protein sequence L = {“MKKFFDSR”}, the sequence is segmented using fixed-length 3-g splitting, each 3-g is a biological word composed of 3 amino acids. The result of segmentation is L = {“MKK”, “KKF”, “KFF”, “FFD”, “FDS”, “DSR”}. For each biological word, we map it to an embedding vector by looking up a pre-trained embedding dictionary [24] of 9048 words obtained from Swiss-Prot with 560,118 human-annotated sequences. With 3-g, we convert each protein sequence into a matrix with a biological word embedded in each row. The matrix is then fed into a CNN to extract contextual information about the target protein.
500
M. Xia et al.
2.5 Deep Learning on Molecular Graphs After representing drug compounds as graphs, we need to find an algorithm that can efficiently extract features from graph data. The results of convolutional neural networks in computer vision, language recognition, and natural language processing have inspired research on graph convolutions. The traditional convolution method has great power in the Euclidean data space, but it is not so good in the non-European data space. A very important reason is that the traditional convolution method cannot be maintained in the non-Euclidean data space “Translation invariance”. To extend convolution to the topology of non-European data structures such as graphs, the GCN model was born. In this paper, we propose a new DTA prediction model based on graph neural networks and traditional CNN. Figure 1 shows a schematic of the model. For proteins, we use word2vec to strip words and apply several 2D CNN blocks on the text to learn sequence representation vectors. Specifically, each protein sequence is first regarded as a sentence, and every three adjacent amino acids are regarded as a word, and the protein sentence containing the word is fixed-length. The protein sentences are then transformed into embedding matrices through a pre-trained embedding dictionary, where each word is represented by a 100-dimensional vector. We next use three 2D convolutions to learn different levels of abstract features from the input. Finally, a max-pooling layer is applied to obtain the representation vector for the input protein sequence.
Fig. 1. This figure shows our method architecture, which takes drug-target pairs as input data and the affinity of drug-target pairs as output data.
For drugs, we use five atomic features from DeepChem to represent drugs as molecular graphs, and use RDkit to extract drug atomic features, then input the obtained features into GCN for training, and perform global max pooling on the results to obtain drug molecules the representation vector. In this work, we focus on predicting a continuous value that indicates the level of interaction between drug and protein sequences. Each drug is represented as a graph and each protein as a sentence. Graph for a given drug with G = (V, E), where V is the set of
Drug-Target Binding Affinity Prediction
501
N nodes and E is the set of edges represented by the adjacency matrix A. A multi-layer graph convolutional network (GCN) takes as input a node feature matrix X ∈ RN ∗C (N = |V|, C: the number of features per node) and an adjacency matrix A ∈ RN ∗N ; then produces a node-level output Z ∈ RN ∗F (F: the number of output features per node). Communication rules can be written in the standardized form: 1 1 ˜ − 2 HlW l ˜ − 2 A˜ D (2) H (l+1) = σ D the undirected graph with added selfwhere A˜ = A + IN is the adjacency matrix of ˜ ii = i A˜ ii ; H (l) ∈ RN ∗C is the first layer of connections, D is a diagonal matrix and D the activation of matrix, H (0) = X . σ is the activation function, and W is the learnable parameter. The hierarchical convolution operation can be approximated as: ˜ −2 X Θ ˜ − 2 A˜ D Z =D 1
1
(3)
where Θ ∈ RC∗F (F: the number of filters or feature maps) is the parameter matrix of the filter. However, the output of the GCN model is the node-level output matrix Z ∈ RN ∗F . To make GCN suitable for learning representation vectors for the drug molecule graph, we add a global max-pooling layer [25] after the last GCN layer. In our model, we use three consecutive GCN layers, the output channel dimension of each GCN layer is twice the output dimension of the previous layer, and each layer is activated by the ReLU [26] function. A global max-pooling layer is added after the 3-layer continuous GCN to achieve representation aggregation of the entire graph. Finally, the dimension of the output result is transformed through two linear transformation layers to obtain a 128-dimensional drug feature tensor.
3 Experiments and Results 3.1 Evaluation Metrics We used two metrics commonly used in regression tasks to evaluate the performance. They include Mean Squared Error (MSE) and Concordance Index (CI). Mean square error is a measure reflecting the difference between the predicted value and the true value, which is generally calculated by the following formula: MSE =
1 n (Pi − Yi )2 i=1 n
(4)
where P is the predicted value of affinity, Y is the true value of affinity, and n is the number of samples. The concordance index measures whether the predicted binding affinity values of two random drug-target pairs are in the same order as their true values and is calculated as: CI =
1 h pi − pj yi >yj Z
(5)
502
M. Xia et al.
where pi is the predicted value of the larger affinity yi , pj is the predicted value of the smaller affinity yj , and Z is the normalization constant. H(x) is the step function [27]: ⎧ ⎨ 1, x > 0 h(x) = 0.5, x = 0 ⎩ 0, x < 0
(6)
Generally speaking, the smaller the MSE and the larger the CI, the better the experimental results. 3.2 Results and Discussion We evaluate the performance of our model on benchmark datasets. To learn a generalized model, we randomly divide the dataset into six equal parts, one of which serves as an independent test set, and the remaining part is used to determine hyperparameters. We perform a grid search on the hyperparameters to determine the best settings. On the validation set, the hyperparameter combination that provides the best mean MSE score is selected as the best combination to model the test set. The final parameter settings for our model are as follows (Table 2): Table 2. The detailed training settings of our method. Parameter
Setting
Learning rate
0.0005
Batch size
512
Epoch
500
Length of protein sequence
2000
CNN kernel size
23
GCN/CNN layers
3
As benchmarks for comparison, we used machine learning algorithms KronRLS and SimBoost, which use protein and compound similarity matrices as input. The pairwise similarity of proteins and ligands was calculated using the SW and Pubchem Sim algorithms, respectively. As a comparative deep learning algorithm, we use DeepDTA and its modification WideDTA. They represent drugs and proteins as one-dimensional linear molecules and feed them into two different CNN blocks for training separately. In the latest deep learning models, GraphDTA converts drug molecules into graphical representations, feeding protein sequences directly into 1D convolutional neural networks.
Drug-Target Binding Affinity Prediction
503
Table 3. Prediction performance on the Davis dataset Method
Protein rep.
Compound rep.
CI
MSE
KronRLS
S-W
Pubchem-Sim
0.871
0.379
SimBoost
S-W
Pubchem-Sim
0.872
0.282
DeepDTA
1D
1D
0.878
0.261
WideDTA
1D + PDM
1D + LMCS
0.886
0.262
GraphDTA
1D
GIN
0.893
0.229
OurMethod
2D + word2vec
GCN
0.892
0.225
Table 3 compares the performance of our model in the Davis dataset with that of the existing benchmark model. As can be seen, our model greatly improves MSE and CI over traditional machine learning methods KronRLS and Simboost. Compared with the classical deep learning algorithm DeepDTA, our MSE is 0.225, which is 16% higher than 0.261. Meanwhile, our CI index also maintains a good effect of 0.892. In the Davis data set, the improvement of MSE relative to WideDTA is 30%. In Davis dataset, compared with the latest deep learning method GraphDTA, our model improves the performance of MSE while maintaining a similar level of CI, reaching the lowest value of 0.225. On the Kiba dataset, our model also shows good performance, achieving the best results on MSE and CI compared to all benchmark models. Table 4. Prediction performance on the Kiba dataset Method
Protein rep.
Compound rep.
CI
MSE
KronRLS
S-W
Pubchem-Sim
0.782
0.411
SimBoost
S-W
Pubchem-Sim
0.836
0.222
DeepDTA
1D
1D
0.863
0.194
WideDTA
1D + PDM
1D + LMCS
0.875
0.179
GraphDTA
1D
GAT_GCN
0.891
0.139
OurMethod
2D + word2vec
GCN
0.895
0.137
Tables 3 and 4 show that the Kiba dataset performs better than the Davis dataset on every deep learning model, probably due to the larger data volume of the Kiba model. We can also observe that deep learning-based methods are significantly better than machine learning-based methods in predicting DTA. GraphDTA, which represents drugs as molecular graphs, performs better than DeepDTA, which directly processes linear input data. While representing drugs as molecular graphs, our model converts target proteins into sentences consisting of “biological words”, and uses a pre-trained
504
M. Xia et al.
dictionary to generate embedding matrices, achieving better results than the GraphDTA model.
Fig. 2. Predictions from our model against measured binding affinity values.
Figure 2 shows the predicted scatter plots of the two datasets, where P is the predicted value and m is the actual value. When the predicted value is close to the actual value, it proves that the model performance is better, that is, the sample point should fall near the straight line (p = m). For the Davis data set, the dense region of pKd values ranges from 5 to 6 on the X-axis, which is consistent with the actual data distribution of the data set. Meanwhile, the X-axis density of KIBA scores ranged from 10 to 14. For these two data sets, the distribution of sample points is close to a straight line (p = m), which also proves that our model has good predictive performance.
4 Conclusions Accurate prediction of DTA is a crucial and challenging task in drug discovery. In this work, we propose a method for predicting DTA based on graphs and word vectors, which represent drugs as graphs and feed the atomic features of the drug graphs into GCN for training. Our model utilizes Word2vec to convert protein sequences into sentences consisting of “biological words” and feeds the embedding matrix into a 2D convolutional neural network for training. Experimental results show that our model can not only predict drug-target affinity better than non-deep learning models but also outperform competing deep learning methods. On two independent benchmark databases, our model performs well on almost all evaluation metrics. In addition to solving the drug-target binding affinity problem, our method can also be extended to other fields of data mining and bioinformatics, which is also the direction of future research. Acknowledgements. This work was supported by National Natural Science Foundation of China (No. 61972299).
Drug-Target Binding Affinity Prediction
505
References 1. Ezzat, A., Wu, M., Li, X.L., et al.: Computational prediction of drug–target interactions using chemogenomic approaches: an empirical survey. Brief. Bioinform. 20(4), 1337–1357 (2019) 2. Chen, X., Yan, C.C., et al.: Drug-target interaction prediction: databases, web servers and computational models. Brief. Bioinform. 17(4), 696–712 (2016) 3. Strittmatter, S.M.: Overcoming Drug Development Bottlenecks With Repurposing: Old drugs learn new tricks. Nat. Med. 20(6), 590–591 (2014) 4. Wang, Y., Zeng, J.: Predicting drug-target interactions using restricted Boltzmann machines. Bioinformatics 29, 126–134 (2013) 5. Tian, K., Shao, M., et al.: Boosting compound-protein interaction prediction by deep learning. Methods Companion Methods Enzymol. 110, 64–72 (2016) 6. Wan, F., Zeng, J.: Deep learning with feature embedding for compound-protein interaction prediction. bioRxiv (2016) 7. Hu, P., Chan, K.C., You, Z.H.: Large-scale prediction of drug-target interactions from deep representations. In: International Joint Conference on Neural Networks, pp.1236–1243 (2016) 8. Wang, L., et al.: A computational-based method for predicting drug-target interactions by using stacked autoencoder deep neural network. J. Comput. Biol. J. Comput. Mol. Cell Biol. 25(3), 361–373 (2018) 9. Wen, M., et al.: Deep-learning-based drug–target interaction prediction. J. Proteome Res. 16(4), 1401–1409 (2017) 10. Cer, R.Z., et al.: IC50-to-Ki: a web-based tool for converting IC50 to Ki values for inhibitors of enzyme activity and ligand binding. Nucl. Acids Res. 37 (2009) 11. Pahikkala, T., et al.: Toward more realistic drug–target interaction predictions. Brief. Bioinform. 16(2), 325–337 (2014) 12. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981) 13. He, T., Heidemeyer, M., Ban, F., Cherkasov, A., Ester, M.: SimBoost: a read-across approach for predicting drug–target binding affinities using gradient boosting machines. J. Cheminform. 9(1), 24 (2017) 14. Öztürk, H., Özgür, A., Ozkirimli, E.: Deepdta: deep drug–target binding affinity prediction. Bioinformatics 34(17), 821–829 (2018) 15. Öztürk, H., Özgür, A., Ozkirimli, E.: WideDTA: prediction of drug-target binding affinity. arXiv. (2019) 16. Wo´zniak, M., Wołos, A., Modrzyk, U., Górski, R., et al.: Linguistic measures of chemical diversity and the “keywords” of molecular collections. Sci. Rep. 8(1), 7598 (2018) 17. Fout, A., Byrd, J., Shariat, B., Ben-Hur, A.: Protein interface prediction using graph convolutional networks (2017) 18. Davis, M.I., Hunt, J.P., Herrgard, S., Ciceri, P., Wodicka, L.M., Pallares, G., et al.: Comprehensive analysis of kinase inhibitor selectivity. Nat. Biotechnol. 29(11), 1046–1051 (2011) 19. Tang, J., et al.: Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. J. Chem. Inf. Model. 54(3), 735–743 (2014) 20. Landrum, G. RDKit: Open-source cheminformatics (2006) 21. Weininger, D.: SMILES: a chemical language and information system. J. Chem. Inf. Comput. Sci. 28(1), 31–36 (1988) 22. Helson, H.E.: Structure diagram generation. Rev. Comput. Chem. 13, 313–398 (2007) 23. Ramsundar, B., Eastman, P., Walters, P., Pande, P.: Deep Learning for the Life Sciences: Applying Deep Learning to Genomics, Microscopy, Drug Discovery, and More. O’Reilly Media (2019)
506
M. Xia et al.
24. Asgari, E., Mofrad, M.R.K.: Continuous distributed representation of biological sequences for deep proteomics and genomics. Plos One. 10(11), e0141287 (2015) 25. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011) 26. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines Vinod Nair. In: Proceedings of the 27th International Conference on Machine Learning, pp. 807–814. Omnipress (2010) 27. Pahikkala, T., et al.: Toward more realistic drug–target interaction predictions. Brief Bioinform. 16, 325–327 (2014)
Drug-Target Interaction Prediction Based on Attentive FP and Word2vec Yi Lei1 , Jing Hu2,3,4(B) , Ziyu Zhao2 , and Siyi Ye1 1 International School, Wuhan University of Science and Technology, Wuhan 430065,
Hubei, China 2 School of Computer Science and Technology, Wuhan University of Science and Technology,
Wuhan 430065, Hubei, China [email protected] 3 Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time Industrial System, Wuhan, China 4 Institute of Big Data Science and Engineering, Wuhan University of Science and Technology, Wuhan, Hubei, China
Abstract. The study of drug-target interactions (DTIs) plays a crucial role in localizing new drug targets, drug discovery, and drug improvement. Nevertheless, traditional DTI experimental determination is time-consuming and labor-intensive work. Therefore, more and more researchers are investing in using computational methods to predict drug-target interactions. This paper uses the neural network model to conduct further in-depth research on DTI. We use the Word2vec model to extract protein features, use the Attentive FP model to process drugs, obtain their features, and evaluate the results through several widely used model evaluation methods. The experiment result shows that our method works well on human and C.elegans datasets. Keywords: Drug-target interaction · Graph neural network · Word2vec · Attentive FP
1 Introduction In genomic drug discovery, detecting drug-target interactions (DTIs) is a tremendously significant area of research, which can give rise to the identification of new drugs or novel targets for the current drugs [1–3]. Although the research on DTIs prediction has developed rapidly after decades of efforts, the experimental determination of DTIs is still very challenging and very expensive today, even with evolving techniques. Therefore, it is necessary and urgent to design effective and accurate computational prediction methods. In recent years, many machine learning methods have been applied to bioinformatics. For example, Chao Wang [4] applied metric learning to facilitate pedestrian re-identification to improve accuracy. Di Wu [5] introduces a deep attention architecture with multi-scale deep supervision to improve the efficiency of person re-identification. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 507–516, 2022. https://doi.org/10.1007/978-3-031-13829-4_44
508
Y. Lei et al.
Shihao Zhang [6] applied Attention-Guided Network (AG-Net) to retinal image segmentation. Similarly, many machine learning-based drug-target interaction analysis algorithms have emerged. After realizing that sequential CPI-based models may face the problem of using inappropriate datasets, hidden ligand biases, and inappropriate segmentation of datasets, Lifan Chen constructed a new dataset specifically for CPI prediction, proposed an innovative transformer neural network named TransformerCPI, and used the label reversal experiment to test the correctness of model interaction feature learning [7]. Masashi Tsubaki conducted in-depth research on end-to-end representation learning for compounds and proteins, integrated representations, and developed a more competitive CPI prediction method by combining graph neural networks for compounds and convolutional neural networks for proteins [8]. Hui Liu integrated the chemical structures, chemical expression profiles, and side effects of compounds, amino acid sequences, protein-protein interaction networks, protein functional annotations, and other resources into a systematic screening framework, intending to establish a reliable set of CPI-negative samples by the silicon screening method, to achieve a valuable complement to the existing compound protein database [9]. These studies are very innovative, but the performance of their machine learning models for classification is still open to question. This paper employs graph attention mechanisms and deep neural networks to explore drug-target interactions to predict whether an unknown drug interacts with a target. Experimental results show that our method has a better predictive performance.
2 Method 2.1 Datasets Although the existing datasets all show high performance when used, it cannot be denied that there are hidden problems. Because many datasets used in evaluating machine learning-based CPI prediction methods [10–12] contain positive samples and randomly generated negative samples, it is difficult to avoid the situation that the randomly generated negative samples contain unknown positive samples. This situation will lead to poor performance of the classifier when faced with real test datasets. Because of this, screening for accurate negative samples is an essential step in creating a high-confidence CPI dataset [13]. Therefore, this paper used the same dataset as Masashi Tsubaki since its data set contains highly confident negative samples of compound-protein pairs obtained by using a systematic screening framework, and the positive sample was retrieved from two manually curated databases (DrugBank4.1 [14] and Matador [15]). The human dataset used in the experiment contains 3369 positive interactions between 1052 unique compounds and 852 unique proteins; the C.elegans dataset used contains 4000 positive interactions between 1434 unique compounds and 2504 unique proteins [8]. 2.2 Representing Drug Molecules The SMLES [16] format is a common standard used in many cheminformatics software and was developed by David Weininger and Arthur Weininger in the 1980s. There
Drug-Target Interaction Prediction
509
are several ways to deal with molecular formulas in this format. For example, Zaynab Mousavian [17] used the rcdk [18] package for processing in his research and used this package to encode each drug into an 881-dimensional binary vector. S.M.Hasan Mahmud [19] used a specific drug ID to collect molecular drug structures from the DrugBanK database and then used the Rcpi Toolkit [20] to process the medicinal chemical structures into 193-dimensional feature vectors. Here, this paper uses a graph neural network model named AttentiveFP [21] for molecular representation, which introduces an attention mechanism at the intramolecular level to allow the method to focus more on the most critical parts of the molecular action on the target. The main steps are as follows: (1) We use the RDKIT [22] method to convert the molecular formula in the smiles format into a mol object, analyze the atoms and bonds in the molecule, count their characteristics, and then represent them with one-hot encoding (see Table 1). Note that we need to save the atom numbers on both sides of the bond to facilitate the generation of molecular diagrams later. (2) Taking the 36-dimensional atomic feature vector, 10-dimensional bond feature vector, and graph as input, the state vector includes more neighborhood information after multiple layers of atomic embedding. It then goes through multiple layers of processing layers for molecular embedding to generate the entire state vector of the molecule. The resulting 128-dimensional state vector is the structural information of the molecular graph. Table 1. Initial atomic features. Atom feature
Size
Description
Atom symbol
14
[N,C,O,P,S,F,Cl,Na,I,Br,Mg,Ca,K,Co] (one-hot)
Degree
6
Number of covalent bonds [0,1,2,3,4,5] (one-hot)
Hybridization
6
[sp,sp2 ,sp3 ,sp3 d,sp3 d2 ,other] (one-hot)
Aromaticity
1
Whether the atom is part of an aromatic system[0/1] (one-hot)
Hydrogens
6
Number of connected hydrogens
Chirality
1
Whether the atom is a chiral center[0/1] (one-hot)
Chirality type
2
[R,S] (one-hot)
[0,1,2,3,4,5] (one-hot)
510
Y. Lei et al. Table 2. Initial bond features. Bond feature
Size
Description
Bond type
4
[Single, double, triple, aromatic] (one-hot)
Conjugation
1
Whether the bond is conjugated [0/1] (one-hot)
Ring
1
Whether the bond is in Ring [0/1] (one-hot)
Stereo
4
[StereoNone, StereoAny, StereoZ, StereoE] (one-hot)
2.3 Representing Target Proteins To extract valuable features of proteins, we use a successful word embedding technique called Word2vec, mainly used in various bulk NLP tasks to embed protein sequences [23]. This paper uses the skip-gram model from Word2vec. It is an architecture similar to CBOW, but unlike CBOW, it does not generate intermediate words based on context but infers context words through intermediate words [23]. It is trained in this form: the output of the hidden layer is the word vector of each word, and then the softmax function is used in the output layer to generate a probability distribution, that is, the probability of various values of the words in the context of the word window, take The word with the highest probability is used as the current word [24]. The training complexity of this architecture is as follows Q = C × (D + D × log2(V))
(1)
Here C is the maximum distance of the word. In the same way, as Fangping Wan [10], this paper regarded each protein frame as a sentence read in order of the amino acid arrangement in protein biosynthesis and started from the first, second, and third amino acid residues of the N-terminus, respectively. Every three non-overlapping amino acid residues are divided into words, and the residues that cannot form words are discarded so that each protein sequence is divided into three parts by different starting positions of the division, where each word is equipped with formal embedding and auxiliary embeddings, trained on skip-grams of Word2vec models, learn relationships between contexts. The protein eigenvectors are represented by summing and averaging the three protein sequence searches. The dimension of our word embedding is 100, and the word vector after 100dimensional word embedding can represent the feature information of each protein sequence. The final calculated protein feature vector should also be 100-dimensional. 2.4 Representation of Drug-Target Pair This paper uses a similar approach to Shuangjia Zheng [25], combining drug and target features.
Drug-Target Interaction Prediction
511
The 128-dimensional molecular state vector output by the Attentive FP model is passed through the linear transformation layer to obtain a new 128-dimensional vector. Then an activation function named elu is applied, and some features are dropped through the dropout layer to improve robustness. The dropout value is set to 0.2, and then a linear transformation layer is used to obtain the final characteristic molecular vector. Finally, the activation function of relu is used again to increase the nonlinearity of the neural network model. For the word vector, we first normalize it, add an elu activation function, and then linearly transform the original 100-dimensional word vector to obtain a 128-dimensional feature vector, which is consistent with the size of the feature vector of the molecule. The two 128-dimensional vectors obtained above are synthesized into a 256dimensional vector, which contains the drug’s molecular information and target information. Then we linearly convert it to 64-dimensional. After adding the relu activation function, we discard some features and then linearly convert them into a 2-dimensional vector and use the sigmoid function to determine whether there is an interaction between the drug and the target. 2.5 Evaluation Criteria To evaluate the classification performance of the deep neural network constructed in this study, we employ several widely used measures, including precision, recall, accuracy, and area under the ROC curve [26]. The area under the ROC curve is the most important evaluation criterion. Recall =
TP TP + FN TP TP + FP
(3)
TP + TN TP + FP + TN + FN
(4)
Precision = Accuracy =
(2)
Table 3. Confusion matrices Confusion matrix Predicted
Actual True
False
Positive
True positive (TP)
False positive (FP)
Negative
True negative (TN)
False negative (FN)
Here TP indicates a drug-target pair that interacts, and FP indicates a drug-target pair that interacts with the target but is incorrectly predicted to have no interaction. FN indicates that there is no interaction between the drug and the target. TN indicates a drug-target pair that interacts but is incorrectly predicted to be interacting and indicates a drug-target pair that does not interact (as shown in Table 2).
512
Y. Lei et al.
3 Experiment 3.1 Performance on Human Dataset and C.elegans Dataset This paper uses the Human dataset and the C.elegans dataset to train the model in this experiment. The dataset is divided into five sub-datasets using the Kfold method in sklearn.model_selection, and then we use five-fold cross-validation. Each time one of the subsets is used as the training set, and the remaining four subsets are used as the training set. Table 3 shows the results of the training dataset and the test dataset on two different datasets. It can be seen that under the same representation method and neural network model, the prediction results based on the C.elegans dataset are slightly better than those based on the Human dataset, indicating that the data set contains information that is more suitable for this method. Table 4. Prediction result on Human and C.elegans dataset Dataset
Recall
Precision
Accuracy
AUC
Human_train
0.987
0.986
0.987
0.999
Human_test
0.944
0.935
0.94
0.987
C.elegans_train
0.984
0.986
0.986
0.999
C.elegans_test
0.95
0.956
0.958
0.986
Figures 1 and 2 show the ROC curves based on the Human and C.elegans datasets, respectively. It can be seen that the ROC curves of both are very close to the upper left corner, and the average AUC value of the Human test dataset is 0.987. The average AUC value of the C.elegans test dataset is 0.986, which indicates that the method has good classification performance.
Drug-Target Interaction Prediction
513
Fig. 1. The ROC curve of one random prediction on human
Fig. 2. The ROC curve of one random prediction on C.elegans
3.2 Comparison with the Results of Existing Papers In this paper, the results obtained from the research are compared with those in existing papers such as K nearest neighbors (KNN), random forest (RF), L2-logistic (L2), support vector machines (SVM), GraphDTA, GCN, DrugVQA. For comparison, by comparing the performance of each model on the standard evaluation criteria such as AUC, the precision and recall, it is used as reference data to evaluate its performance. Since our dataset is the same as Lifan Chen’s [2], we use the data collected by him as shown in Tables 4 and 5. It can be seen that the results of our study outperform the results of other papers in terms of AUC regardless of the dataset based on which, that is, our model shows
514
Y. Lei et al. Table 5. Comparison results of the proposed model and baselines on the human dataset
Method
AUC
Precision
Recall
KNN
0.86
0.927
0.798
RF
0.94
0.897
0.861
L2
0.911
0.913
0.967
SVM
0.91
0.966
0.969
GraphDTA
0.960 ± 0.005
0.882 ± 0.040
0.912 ± 0.040
GCN
0.956 ± 0.004
0.862 ± 0.006
0.928 ± 0.010
CPI-GNN
0.97
0.918
0.923
DrugVQA(VQA-seq)a
0.964 ± 0.005
0.897 ± 0.004
0.948 ± 0.003
AttentiveFP_W2V
0.978 ± 0.010
0.935 ± 0.019
0.944 ± 0.030
Table 6. Comparison results of the proposed model and baselines on the C.elegans dataset Method
AUC
Precision
Recall
KNN
0.858
0.801
0.827
RF
0.902
0.821
0.844
L2
0.892
0.89
0.877
SVM
0.894
0.785
0.818
GraphDTA
0.974 ± 0.004
0.927 ± 0.015
0.912 ± 0.023
GCN
0.975 ± 0.004
0.921 ± 0.008
0.927 ± 0.006
CPI-GNN
0.978
0.938
0.929
AttentiveFP_W2V
0.986 ± 0.005
0.956 ± 0.014
0.953 ± 0.005
a higher ability to capture the characteristics of interactions between compounds and proteins (Table 6).
4 Conclusion This paper proposes a deep neural network constructed by combining an attention-based graph neural network and Word2vec to predict the interaction relationship between drugs and targets. According to the properties of the molecule, we extract its molecular fingerprint and convert it into one-hot encoding to represent the molecular feature vector. For proteins, we train a word vector for each protein by treating the protein as a sentence and every three residues as a word. In the neural network, this paper use activation functions and multi-layer linear transformations to integrate the collected feature vectors many times and use the sigmoid function to determine whether there is an interaction between the drug and the target.
Drug-Target Interaction Prediction
515
This paper compared the results of the research with the data in the existing papers and found that its performance was significantly improved in all aspects. Furthermore, the overall prediction performance is more robust on the C.elegans dataset. We believe that the prediction of drug-target interactions using deep neural networks can effectively save time screening drug compounds or antibodies and improve the high-risk and high-cost problems in traditional drug development. In the future, we will continue to collect and publish more experimental data based on more extensive testing. Acknowledgment. This work is supported by the National Natural Science Foundation of China (No. 61972299).
References 1. Masoudi-Nejad, A., Mousavian, Z., Bozorgmehr, J.H.: Drug-target and disease networks: polypharmacology in the post-genomic era. In Silico Pharmacol. 1(1), 17 (2013) 2. Hasan Mahmud, S.M., Chen, W., Meng, H., Jahan, H., Yongsheng Liu, S.M., Hasan, M.: Prediction of drug-target interaction based on protein features using undersampling and feature selection techniques with boosting. Anal. Biochem. 589, 113507 (2020) 3. Mahmud, S.M.H., et al.: iDTi-CSsmoteB: identification of drug-target interaction based on drug chemical structure and protein sequence using XGBoost with over-sampling technique SMOTE. IEEE Access 7(2019), 48699–48714 (2019) 4. Wang, C., Pan, Z., Li, X.: Multilevel metric rank match for person re-identification. Cogn. Syst. Res. 65(2021), 98–106 (2020) 5. Wu, D., Wang, C., Wu, Y., Wang, Q.: Attention deep model with multi-scale deep supervision for person re-identification. IEEE Trans. Emerg. Top. Comput. Intell. 5(1), 70–78 (2021) 6. Zhang, S., et al.: Attention guided network for retinal image segmentation. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11764, pp. 797–805. Springer, Cham (2019). https://doi. org/10.1007/978-3-030-32239-7_88 7. Chen, L., et al.: TransformerCPI: improving compound–protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics 36, 4406–4414 (2020) 8. Tsubaki, M., Tomii, K., Sese, J.: Compound-protein interaction prediction with end-to-end learning of neural networks for graphs and sequences. Bioinformatics 35(2), 309–318 (2019) 9. Liu, H., Sun, J., Guan, J., Zheng, J., Zhou, S.: Improving compound–protein interaction prediction by building up highly credible negative samples. Bioinformatics 2015(12), i221– i229 (2015) 10. Wan, F., Zeng, J.: Deep learning with feature embedding for compound-protein interaction prediction (2016) 11. Tian, K., Shao, M., Wang, Y., Guan, J., Zhou, S.: Boosting compound-protein interaction prediction by deep learning. Methods 110, 64–72 (2016) 12. Hamanaka, M., et al.: CGBVS-DNN: Prediction of Compound-protein Interactions Based on Deep Learning. Mol. Inf. 36(1–2), 1600045 (2017) 13. Ding, H., Takigawa, I., Mamitsuka, H., Zhu, S.: Similarity-based machine learning methods for predicting drug–target interactions: a brief review. Briefings in Bioinf. 15(5), 734–747 (2014) 14. Wishart, D.S., et al.: DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 36(Database issue), D901–D906 (2008)
516
Y. Lei et al.
15. Gunther, S., et al.: SuperTarget and matador: resources for exploring drug-target relationships. Nucleic Acids Res. 36(Database), D919–D922 (2007) 16. Weininger, D.: SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem, Inform. Model. 28(1), 31–36 (1988) 17. Mousavian, Z., Khakabimamaghani, S., Kavousi, K., Masoudi-Nejad, A.: Drug–target interaction prediction from PSSM based evolutionary information. J. Pharmacol. Toxicol. Methods 78, 42–51 (2016) 18. Guha, R.: Chemical informatics functionality in R. J. Stat. Softw. 18(5), 359–361 (2007) 19. Hasan Mahmud, S.M., Chen, W., Jahan, H., Dai, B., Din, S.U., Dzisoo, A.M.: DeepACTION: a deep learning-based method for predicting novel drug-target interactions. Anal. Biochem. 610, 113978 (2020) 20. Cao, D.-S., Xiao, N., Xu, Q.-S., Chen, A.F.: Rcpi: R/Bioconductor package to generate various descriptors of proteins, compounds and their interactions. Bioinformatics 31(2), 279–281 (2015) 21. Xiong, Z., et al.: Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J. Med. Chem. 63(16), 8749–8760 (2020) 22. Landrum, G.: RDKit: open-source cheminformatics from machine learning to chemical registration. Abstracts of Papers Am. Chem. Soc. 258 (2019) 23. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space. Computer Science abs/1301.3781 (2013) 24. Zhang, Y., et al.: SPVec: a word2vec-inspired feature representation method for drug-target interaction prediction. Front. Chem. 7, 895 (2019) 25. Zheng, S., Li, Y., Chen, S., Jun, X., Yang, Y.: Predicting drug–protein interaction using quasi-visual question answering system. Nat. Mach. Intell. 2(2), 134–140 (2020) 26. Gan, H., Hu, J., Zhang, X., Huang, Q., Zhao, J.: Accurate prediction of hot spots with greedy gradient boosting decision tree. In: Huang, D.-S., Jo, K.-H., Zhang, X.-L. (eds.) ICIC 2018. LNCS, vol. 10955, pp. 353–364. Springer, Cham (2018). https://doi.org/10.1007/978-3-31995933-7_43
Unsupervised Prediction Method for Drug-Target Interactions Based on Structural Similarity Xinyuan Zhang1 , Xiaoli Lin1,2,3(B) , Jing Hu1,2,3 , and Wenquan Ding1 1 College of Computer Science and Technology, Wuhan University of Science and Technology,
Wuhan, Hubei, China {201903164171,linxiaoli,hujing}@wust.edu.cn 2 Hubei Key Laboratory of Intelligent Information Processing and Realtime Industrial System, Wuhan, Hubei, China 3 Institute of Big Data Science and Engineering, Wuhan University of Science and Technology, Wuhan, Hubei, China
Abstract. Predicting drug-target interactions are important in drug discovery and drug repositioning. Discovering drug-target interactions by computational method still has great potential and advantages in many aspects, such as focusing on interested drug or target proteins. This paper proposes an unsupervised clustering model (OpBGM) based on structural similarity, which combines OPTICS and BGMM algorithms. First, the required PDB files are obtained. Then, the interaction pair is defined and extracted from each PDB file. The interactions are encoded and dimensionally reduced using PCA algorithm. OPTICS is used to detect and remove noise, and BGMM is used to extract significant interaction pairs. Potential binding sites are discovered through interaction pairs and drug similarities discovered. In addition, a target protein is randomly selected to dock with each drug in one cluster. The number of clusters with the average affinity less than −6 kcal/mol accounts for 82.73% of the total number of clusters, which shows the feasibility of proposed prediction method. Keywords: Drug-target interactions · Target identification · Drug repurposing · OpBGM · Clustering
1 Introduction Protein ligand complex is a reversible non-covalent interaction between two chemical molecules, such as hydrogen bond [1], hydrophobic force, π - π interaction [2], etc. Non-covalent is the interaction without covalent bond [3]. Protein is an indispensable part of human body and plays a key role in various life activities [4, 5]. The representative structures of proteins include G protein coupled receptors [6, 7], enzymes and so on. With the development of intelligent computing technology, more and more ligand small molecules and their target interactions have been found, which can be obtained in various biological protein molecules or drug databases. Studies on drug-target interactions are © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 517–532, 2022. https://doi.org/10.1007/978-3-031-13829-4_45
518
X. Zhang et al.
very meaningful, which include three aspects: (1) For a pair of protein and drug, where on the surface of a protein can bind small molecules. (2) Finding the drug that may interact with it when a protein is known. (3) Conversely, given a ligand, which proteins may interact with it [8]. In recent years, there are many methods for predicting drug-target interactions. Karasev et al. [9] developed an algorithm for scoring the positions of some amino acids in protein sequence to predict protein-ligand interaction prediction. Keum et al. [10] used a semi-supervised learning model based on Support Vector Machines for potential drug discovery. Olayan et al. [11] proposed a new algorithm based on random forests, which aggregated the criteria of similarity judgment between molecules to achieve drug-target interaction. With the development of neural networks, there are a few prediction methods based on neural network models. For example, Verma et al. [12] designed a new framework based on deep neural network (DNN). There are also prediction methods based on convolutional neural network (CNN) [13]. Huang et al. [14] designed a target-drug interaction converter called MolTrans to calculate whether an interaction is possible. Due to the high cost of research on new drug, it requires a significant investment such as manpower, related development technologies and equipment. The prediction of drugtarget interaction provides a preliminary judgment on the feasibility of drugs. Although there are a lot of methods of protein-target interaction prediction, most of which are semi supervised or fully supervised training schemes. In addition, the current algorithms about drug-target prediction are still slightly lack of unsupervised learning [15]. Unsupervised learning also has a wide range of applications in the biological sciences, especially when features are numerous. For example, the deep contextual language model, which is widely used in un-supervised learning, combined with neural network to discover the knowledge of the internal biological characteristics of proteins [16]. In this paper, based on the known three-dimensional structure of drug-ligand interactions, an unsupervised clustering model (OpBGM) is used to cluster the interaction pairs for discovering similar ligands. According to these similar ligands, similar target proteins can be identified, thus helping to discover other feasible options for the treatment of certain disease. Finally, the docking verification are carried out. The whole process is summarized into the following six parts: (1) Extracting data based on disease. (2) Acquisition of interaction pairs. (3) Coding and optimization. (4) Clustering based on OpBGM model. (5) Extracting protein ligand information. (6) Docking proteins with ligands for validation.
2 Methods In this paper, the process of predicting drug-target interactions includes database mapping, extraction of interaction pairs, dataset processing and optimization, clustering, and docking verification. The PDB files are obtained from MalaCards, DrugBank, UniProt and RCSB PDB databases through tag extraction and crawler. The detail process is shown in the Fig. 1.
Unsupervised Prediction Method for Drug-Target Interactions
519
Fig. 1. The process of predicting drug-target interactions
2.1 Datasets Obtaining Dataset About Drugs and Targets of Related Diseases MalaCards database [17] records key information about various human diseases. For each disease, the database has a specially displayed web page tag, which integrates various information of the disease that has been investigated. This paper searches drugs corresponding to heart disease. The drug information is obtained by the unique identification CAS number of drugs. The python file containing the crawler package is used to extract the HTML tag text of the source code. The names and CAS numbers of all drugs, PubChem ID can be obtained. Drugbank database [18] integrates detailed data of various drugs including their chemical information, biological information and interaction information of drugs and target proteins. First, the HTML file of Drugbank’s database is downloaded. The regular expressions are used to match the content in the HTML tags to obtain information about all drugs about the heart disease. The information about the target protein of each drug is extracted, including Uniprot ID of the target protein, Drugbank ID and CAS number of the drug. Obtaining PDB Files UniProt (Universal Protein) has a very detailed record of proteins, including a lot of information about the biological functions of proteins from literatures. By mapping the CAS number of extracted drugs to the CAS number in UniProt database, the UniProt number of target proteins related to drugs can be obtained. Then, the corresponding
520
X. Zhang et al.
PDB files can be obtained according to the Uniprot number. Most of PDB file contains the three-dimensional information of the interaction between protein and ligand [19]. A representative PDB file number is taken for each protein and stored, and a total of 935 PDB files are extracted. Finally, the required PDB files is downloaded through the batch download script provided on the RCSB PDB website. Excluding the files without ligands, there are 118 files with one ligand and 117 files with two ligands, both of which account for a high percentage of cases. In addition, there are also many proteins with more than 15 ligands, which have more interacting binding sites. The detailed statistics are shown in Fig. 2. The horizontal axis represents the number of ligands in the PDB file, and the vertical axis represents the number of PDB files with this number of ligands.
Fig. 2. Statistics on the number of ligands in PDB files
2.2 Extraction of Interaction Pairs Interaction pair is defined as the interaction between a protein fragment and a ligand atom [20]. First, PDB files need to be transformed into mol2 files with Sybyl type information. Mol2 file retains the relevant information of PDB file, and additional Sybyl type information is added to each atom in the file of protein ligand interaction. OpenBabel [21] is used to convert the obtained PDB files into mol2 files in batches. The obtained files need eliminate hydrogen atoms. A total of 864 mol2 files are obtained by removing a part of those that could not be converted and those without ligands. A potential interaction pair contains a protein fragment and a Sybyl type of ligand atoms. The protein in the PDB file fragments need to be traversed. Protein consists of 20 amino acids, and the adjacent atoms within each amino acid are fixed. It is necessary to identify which atoms within 20 amino acids are adjacent, and collect all protein fragments of an amino acid into a two-dimensional array, which is recorded in the dictionary according to the names of amino acids. Then, each MOL2 file should be traversed. First, the amino acid atom set and small molecule atom set of each file are extracted according to their type. The water molecule should be removed from the small molecule
Unsupervised Prediction Method for Drug-Target Interactions
521
set. The set of amino acid atoms is traversed, and all protein fragments contained in it are searched according to the dictionary of adjacent protein fragments. Then, they are matched with all the atoms of small molecule ligands one by one, and are evaluated whether the distance of their three-dimensional coordinates is less than 5 angstroms [20]. If the distance is less than 5 angstroms, it is considered that they may interact with each other, and this is recorded as a potential interaction pair. Finally, the interaction pairs of proteins and ligands in all MOL2 files can be obtain with a total of 423,639 lines of data, including 315 ligand types.
Fig. 3. Column bar chart of ligand types in interaction pairs
Table 1. Statistical table of ligand types in interaction pairs Ligands
Number
Ratio
MSE
64432
15.21%
HEM
23860
5.63%
NAP
18223
4.30%
SO4
17668
4.17%
GOL
13374
3.16%
NAG
12040
2.84%
NAD
11840
2.79%
FAD
11227
2.65%
CIT
8827
2.08%
CA
8675
2.05%
MLY
8550
2.02% (continued)
522
X. Zhang et al. Table 1. (continued) Ligands
Number
Ratio
ATP
7821
1.85%
SAH
7132
1.68%
ADP
6676
1.58%
EDO
6382
1.51%
A number of ligands with a part of the interaction are shown in Fig. 3. The vertical axis is the type of ligands, and the horizontal axis is the number of interaction pairs of ligands. The interaction pairs of ligands MSE account for the largest proportion. The detailed number and proportion of interaction pairs are recorded in the Table 1. 2.3 Prediction of Interaction Pairs Based on OpBGM Model Preprocessing To obtain good clustering results, the better feature needs to be obtained. Three atoms of a protein fragment and Sybyl type of ligand atom in the interaction pair are selected as the features. These features are discrete and there is no significant difference. For mapping to the Euclidean space and more conveniently calculate the similarity of each individual in the cluster, One-Hot Encoding is chosen to encode. A feature matrix is obtained with a total of 83 features. The data with high dimension can affect the accuracy of clustering, and have a great impact on the value of Silhouette coefficient. PCA algorithm [22] is used to reduce the dimensionality. Here, the dimensionality is reduced under the condition that at least 85% of the valid information of the original dataset is retained. Finally, 20 new features are obtained. The dimension is reduced from 83 dimensions to 20 dimensions, and 85.4% of the effective information is retained. Prediction Based on OpBGM Model To eliminate noise points to improve intra-cluster similarity, a hybrid clustering model (OpBGM) is used to obtain similar drugs or ligands, which combines OPTICS [23] and Bayesian Gaussian Mixture Model (BGMM) [24]. Table 2 shows the steps of OpBGM algorithm. First, OPTICS is used to cluster the original dataset, which can identify the noise points in the original interaction pairs. The noise points are removed to improve the effect of cluster. With the criterion of retaining about 90% of the dataset, the minimum number of neighborhood samples of the core point is set to 40, and the neighborhood between two samples is set to 0.95. There are 40,163 noise points obtained. The frequency of these interaction pairs is relatively low and different from other interaction pairs. Then, clustering is performed using to assign labels to each piece of data, and the significant interactions are discovered. OpBGM model can infer an approximate posterior distribution over the parameters of a Gaussian mixture distribution.
Unsupervised Prediction Method for Drug-Target Interactions
523
Table 2. The steps of OpBGM algorithm Input: MOL2 files of all protein ligand interactions to be used. Output: Each protein fragment and ligand atom Sybyl type interaction pair label. Begin 1. Protein fragments in each MOL2 file are extracted one by one. 2. Judge the distance between the protein fragment of amino acid and ligand atom, and the interaction pair with distance less than 5 angstroms is store as a potential interaction pair. 3. Integrate all suspected interaction pairs stored in MOL2 files, and extract ligand atom Sybyl types and protein fragments as clustering features. 4. Sybyl types and protein fragment are mapped into Euclidean space by One-Hot encoding. 5. The dimension of the dataset obtained in the previous step is reduced by PCA and 85% of the information is retained. 6. OPTICS clustering algorithm is used to preliminarily cluster the datasets and eliminate the interaction data that are judged as noise points. 7. Again, PCA dimensionality reduction is performed on the dataset after eliminating noise points, and 85% of the information is retained. 8. Cluster the processed dataset with Gaussian Bayesian mixture model to predict the label, and record the label. End
3 Experimental Results 3.1 Analysis of Clustering In this paper, OpBGM algorithm is used for clustering. The Silhouette Coefficient is selected as the standard to judge the clustering results. 100,000 data are randomly selected to calculate the Silhouette score. Five bar charts about Silhouette score at 50, 80, 110, 140, 170 clusters are generated, as shown in Fig. 4. The vertical axis indicates the cluster number sorted by Silhouette score, and the horizontal axis indicates the Silhouette coefficient of the cluster. In Fig. 4, the red dotted line is the total Silhouette score of the cluster. When the number of clusters is 50, 80, 110, 140, 170, the Silhouette score is 0.432, 0.587, 0.676, 0.741, 0.792 respectively. It can be seen that the more the number of clusters, the higher the average Silhouette score of the whole will be, and the whole similarity within the cluster can also be higher, which is a normal phenomenon in the case of sparse sample labels. When the number of clusters is continuously increasing, the total Silhouette score also keeps increasing. When the total number of clusters is 110, the total Silhouette score begins to grow slowly. The Silhouette scores of 50, 80, 110, 140, 170 clusters after the OPTICS clustering and noise elimination are 0.511, 0.680, 0.761, 0.823 and 0.867 respectively, which are increased by 0.079, 0.094, 0.085, 0.081 and 0.075 respectively compared with those before the noise elimination. According to the elbow method, 110 clusters can be selected as the number of clusters used in the following. Here, the results of 110 clusters are analyzed. Figure 5 shows the Silhouette score with noise points removed for 110 clusters.
524
X. Zhang et al.
Fig. 4. Silhouette score at 50, 80, 110, 140, 170 clusters (Color figure online)
Fig. 5. Silhouette score with noise points removed for 110 clusters
Unsupervised Prediction Method for Drug-Target Interactions
525
Figure 6 gives the similarity of partial clusters with 110 clusters. The blue part represents the percentage of similarity, and the orange represents the percentage of dissimilarity. It can be seen that a small number of clusters, such as clusters 1, 6, and 27 have low similarity. In these clusters, rare Sybyl types such as Se, Au, Hg, and Du often appear, which are less in PDB files. Therefore, they are clustered together as Sybyl features, resulting in low similarity. The interaction pairs are very similar in clusters 3, 7, and 9. For example, in cluster 3, they are all protein fragments {‘CG’, ‘CD1’, ‘CD2’} and the binding of C.3 in cluster 3. The cluster 7 is the combinations of {‘CB’, ‘CA’, ‘C’} and C.3, which will be clustered separately in one cluster. In addition, the occurrence frequency of these protein fragments and Sybyl type is very high compared with other types. And the occurrence probability of these interaction pairs is also very high, which leads to convergence to their centers. OpBGM can determine the cluster number according to the probability of each interaction pair accounting for each Gaussian Bayes mixed distribution, and can select the number of the cluster with the highest probability as the final cluster number. Each distribution also calculates a weight based on the samples number of its neighborhood to indicate the importance of the model. Figure 7 shows the percentage of the number of clusters to total number of distributions with different weights. Clusters are divided according to the weight of each distribution. It can be seen that the weight of most distributions is relatively average, concentrated between 0.003 and 0.02, and weight of 0.003–0.06 is the most, which also shows that the clustering effect is better.
Fig. 6. The pie chart of similarity of partial clusters with 110 clusters (Color figure online)
The label of each interaction pair is obtained through clustering, and the interaction pair in each cluster is traversed with a threshold 20%. If the number of an interaction pair accounts for more than 20% of the total number of interaction pairs in one cluster, the interaction pair is judged to be a significant interaction pair. Once the traversal is over, 104 interaction pairs can be finally obtained, including 46 protein fragments and
526
X. Zhang et al.
7 types of Sybyl atoms. The interaction relationship between them is shown in Fig. 8. The blue unit is the protein fragment, and the magenta unit is the Sybyl type of ligand atom, and the green line indicates the possible interaction.
Fig. 7. Gaussian Bayesian mixed distribution weights
Fig. 8. Significant interaction relationships (Color figure online)
It can be seen from the Fig. 8 that most of the protein fragments in the representative interaction relationship are protein fragments on the side chain, which is consistent with the fact that most of the side chains of amino acids are bound to ligands. In addition, the frequency of C.3 in Sybyl type is very high because there are a large number of SP3 hybridized C atoms in the ligand.
Unsupervised Prediction Method for Drug-Target Interactions
527
3.2 Docking Verification In this paper, Auto dock vina [25] is used to dock ligand with target protein. This paper uses python script to do batch docking. The docking process consists of the following steps. First, the target proteins are extracted from the PDB file and then converted to a PDBQT file using AutoDockTools, and hydrogen atoms are added to the protein. Then, for the ligands, their mol2 files are downloaded, which are converted to PDBQT files and hydrogen atoms are added to the ligands. In addition, the gird box should be extracted, which is the most likely binding region between the target and the ligand. Finally, the two PDBQT files and the docking box information can be used to dock the processed target protein with the ligand. The binding affinity between target protein and ligand is calculated, and the docking results are judged based on binding affinity. The binding affinity of docking is shown in the Fig. 9. The horizontal axis in the figure represents the cluster number. The vertical axis is the number of drug-target pairs in the binding affinity ranges. Most of binding affinity values are in range from −8.5 kcal/mol to −5.5 kcal/mol, and most of the binding affinity values are around −7 kcal/mol, indicating that the clustering results are reasonable to some extent. The part with affinity greater than −5.5 kcal/mol has unsatisfactory docking effect, accounting for a small part of the whole. Detailed statistics are shown in the Table 3.
Fig. 9. Statistics of binding affinity of docking
528
X. Zhang et al. Table 3. Statistics of the number of drug-target pairs in each affinity interval
Cluster number
AFFI > −5.5
−7 < AFFI
−5.5
−8.5
0–9
263
315
314
178
10–19
319
328
363
119
20–29
363
373
434
158
30–39
302
321
306
188
40–49
279
329
243
169
50–59
407
295
224
105
60–69
290
347
276
181
70–79
240
241
248
88
80–89
176
212
205
103
90–99
293
265
235
112
100–109
156
209
268
63
AFFI
−7
AFFI < −8.5
Figure 10 shows four docking results with representative drug-target interactions. Purple is the specific binding residue of the target protein, blue is the drug small molecule, and the yellow dashed line is the polar contacts formed by the drug binding target. The number marked above is the distance between two atoms. The binding sites, distances and affinities corresponding to Fig. 10 are summarized in Table 4 and Table 5. From Fig. 10 and Table 5, it can be seen that the three-dimensional space of distance between the target protein and the ligand atom is less than 5 angstroms, which also demonstrates the effectiveness of the proposed model for predicting the interaction pairs.
Unsupervised Prediction Method for Drug-Target Interactions
(a) Interaction between CHOLESTEROL OXIDASE and APR
(b) Interaction between RNA-DIRECTED RNA POLYMERASE and SAE
Interaction between Cytochrome P450 19A1 and ASD
(a) Interaction between prostaglandin F synthase and FRM
Fig. 10. 3D structure visualization of docking results (Color figure online)
529
530
X. Zhang et al. Table 4. Binding sites of representative docking results
Targets
Drugs Atoms at the binding site
Cytochrome P450 19A1
ASD
OD2(XYZ: 88.230, 49.521, 51.205)-O(XYZ: 87.641, 49.566, 48.022)
Cytochrome P450 19A1
ASD
NH1(XYZ: 87.027, 58.064, 40.490)-O(XYZ: 86.069, 58.665, 43.376)
CHOLESTEROL OXIDASE APR
HN(XYZ: −17.415, 4.468, 17.819)-O(XYZ: −18.794, 6.974, 16.791)
CHOLESTEROL OXIDASE APR
O(XYZ: −21.718, 6.032, 17.055)-O(XYZ: −18.794, 6.974, 16.791)
Prostaglandin F Synthase
FRM
NE2(XYZ: 86.269, 29.454, −2.233)-O(XYZ: 87.966, 30.981, −4.506)
Prostaglandin F Synthase
FRM
NE2(XYZ: 87.660, 31.614, −7.817)-O(XYZ: 87.966, 30.981, −4.506)
RNA-DIRECTED RNA POLYMERASE
SAE
N(XYZ: 12.826, −41.099, 6.629)-O(XYZ: 13.517, − 43.695, 5.519)
RNA-DIRECTED RNA POLYMERASE
SAE
OG(XYZ: 11.510, -50.823, −2.869)-NH(XYZ: 11.737, -50.359, −0.888)
Table 5. Distance and affinity of representative docking results Targets
Drugs
Distance
Affinity
Cytochrome P450 19A1
ASD
3.2
−10.4
Cytochrome P450 19A1
ASD
2.8
−10.4
CHOLESTEROL OXIDASE
APR
3.0
−10.7
CHOLESTEROL OXIDASE
APR
3.1
−10.7
Prostaglandin F Synthase
FRM
3.2
−11.3
Prostaglandin F Synthase
FRM
3.4
−11.3
RNA-DIRECTED RNA POLYMERASE
SAE
2.9
−10.1
RNA-DIRECTED RNA POLYMERASE
SAE
2.0
−10.1
4 Conclusion This paper proposes a new unsupervised model (OpBGM) to prediction drug-target interaction, which combines OPTICS and BGMM. To improve prediction performance, OPTICS is used to remove noise points. The Silhouette score is used to evaluate the clustering effect in the case of unknown real labels. The similarity in cluster is improved by removing the noise points. Due to discrete features values resulting in too many
Unsupervised Prediction Method for Drug-Target Interactions
531
encoded feature, PCA dimension reduction algorithm is used. The extraction of representative interaction fragments by clustering helps to predict more specific binding sites of drug and targets. In addition, the docking of target with small molecules validates the rationality of drug-target binding and specific binding sites. Currently, some drugs have less PDB files, which may affect the performance of prediction. The next step will be to try to add some other spatial three-dimensional features to the protein fragments for obtaining the better characterization of interaction pairs. The correlation score can also be tried to introduce, which will help to make comprehensive judgment about the combinations of drug and target based on the multiplicity of spatial features. Acknowledgements. The authors thank the members of Machine Learning and Artificial Intelligence Laboratory, School of Computer Science and Technology, Wuhan University of Science and Technology, for their helpful discussion within seminars. This work was supported by National Natural Science Foundation of China (No. 61972299).
References 1. Itoh, Y., Nakashima, Y., Tsukamoto, S., et al.: N+-C-H···O Hydrogen bonds in protein-ligand complexes. Sci. Rep. 9(1), 767 (2019) 2. Kumar, K., Woo, S.M., Siu, T., et al.: Cation–π interactions in protein–ligand binding: theory and data-mining reveal different roles for lysine and arginine. Chem. Sci. 9(10), 2655–2665 (2018) 3. Lin, X.L., Zhang, X.L.: Prediction of hot regions in PPIs based on improved local community structure detecting. IEEE/ACM Trans. Comput. Biology Bioinf. 15(5), 1470–1479 (2018) 4. Lin, X.L., Zhang, X.L., Xu, X.: Efficient classification of hot spots and hub protein interfaces by recursive feature elimination and gradient boosting. IEEE/ACM Trans. Comput. Biology Bioinf. 17(5), 1525–1534 (2020) 5. Driver, M.D., Williamson, M.J., Cook, J.L., et al.: Functional group interaction profiles: a general treatment of solvent effects on non-covalent interactions. Chem. Sci. 11(17), 4456– 4466 (2020) 6. Basith, S., Cui, M., Macalino, S., et al.: Exploring G Protein-Coupled Receptors (GPCRs) ligand space via cheminformatics approaches: impact on rational drug design. Front. Pharmacol. 9, 128 (2018) 7. Warner, K.D., Hajdin, C.E., Weeks, K.M.: Principles for targeting RNA with drug-like small molecules. Nat. Rev. Drug Discov. 17(8), 547–558 (2018) 8. Hwang, H., Dey, F., Petrey, D., et al.: Structure-based prediction of ligand–protein interactions on a genome-wide scale. Proc. Natl. Acad. Sci. 114(52), 13685–13690 (2017) 9. Karasev, D., Sobolev, B., Lagunin, A., et al.: Prediction of protein-ligand interaction based on the positional similarity scores derived from amino acid sequences. Int. J. Mol. Sci. 21(1), 24 (2020) 10. Keum, J., Nam, H.: SELF-BLM: prediction of drug-target interactions via self-training SVM. PLoS ONE 12(2), e0171839 (2017) 11. Olayan, R.S., Ashoor, H., Bajic, V.B.: DDR: efficient computational method to predict drug– target interactions using graph mining and machine learning approaches. Bioinformatics 34(7), 1164–1173 (2018) 12. Verma, N., Qu, X., Trozzi, F., et al.: SSnet: a deep learning approach for protein-ligand interaction prediction. Int. J. Mol. Sci. 22(3), 1392 (2021)
532
X. Zhang et al.
13. Hu, S., Zhang, C., Chen, P., et al.: Predicting drug-target interactions from drug structure and protein sequence using novel convolutional neural networks. BMC Bioinform. 20, 689 (2019) 14. Huang, K., Xiao, C., Glass, L.M., et al.: MolTrans: molecular interaction transformer for drug–target interaction prediction. Bioinformatics 37(6), 830–836 (2021) 15. Hameed, P.N., Verspoor, K., Kusljic, S., et al.: A two-tiered unsupervised clustering approach for drug repositioning through heterogeneous data integration. BMC Bioinform. 19(1), 129 (2018) 16. Rives, A., Meier, J., Sercu, T., et al.: Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. 118(15), e2016239118 (2021) 17. Rappaport, N., Nativ, N., Stelzer, G., et al.: MalaCards: an integrated compendium for diseases and their annotation. Database 2013, bat018 (2013) 18. Wishart, D.S., Feunang, Y.D.,Guo, A.C., et al.: DrugBank 5.0: a major update to the DrugBank database for 2018. Nucl. Acids Res. 46(D1), D1074–D1082 (2018) 19. Burley, S.K., Berman, H.M., Christie, C., et al.: RCSB protein data bank: sustaining a living digital data resource that enables breakthroughs in scientific research and biomedical education. Protein Sci. 27(1), 316–330 (2018) 20. Wei, X., Wu, X., Cheng, Z., et al.: Botanical drugs: a new strategy for structure-based target prediction. Brief. Bioinform. 23(1), bbab425 (2022) 21. O’Boyle, N.M., Banck, M., James, C.A., et al.: Open babel: an open chemical toolbox. J. Cheminform. 3(1), 33 (2011) 22. Bhattacharya, S., Singh, S., Kaluri, R., Maddikunta, P.K.R., et al.: A novel PCA-Firefly based XGBoost classification model for intrusion detection in networks using GPU. Electronics 9(2), 219 (2020) 23. Li, P., Sun, M., Wang, Z., et al.: OPTICS-based unsupervised method for flaking degree evaluation on the murals in mogao grottoes. Sci. Rep. 8(1) (2018) 24. Ma, Z., Lai, Y., Kleijn, W.B., et al.: Variational Bayesian learning for Dirichlet process mixture of inverted Dirichlet distributions in Non-Gaussian image feature modeling. IEEE Trans. Neural Netw. Learn. Syst. 30(2), 449–463 (2019) 25. Nguyen, N.T., Nguyen, T.H., Pham, T.N.H., et al.: Autodock vina adopts more accurate binding poses but autodock4 forms better binding affinity. J. Chem. Inf. Model. 60(1), 204–211 (2020)
Drug-Target Affinity Prediction Based on Multi-channel Graph Convolution Hang Zhang1 , Jing Hu1,2,3(B) , and Xiaolong Zhang1,2,3 1 School of Computer Science and Technology, Wuhan University of Science and Technology,
Wuhan 430065, Hubei, China {hujing,xiaolong.zhang}@wust.edu.cn 2 Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time Industrial System, Wuhan, China 3 Institute of Big Data Science and Engineering, Wuhan University of Science and Technology, Wuhan, Hubei, China
Abstract. Computer-aided drug design with high performance is a promising field, and the pre-diction of drug target affinity is an important part in computeraided drug design. As a kind of deep learning model algorithm, graph neural network series model algorithm has been gradually applied in drug and protein research due to its excellent performance in structural feature learning. In the new field of drug target affinity prediction, graph neural network also has great potential. In this paper, a novel approach for drug target affinity prediction based on multi-channel graph convolution network is proposed. The method encodes drug and protein sequences into corresponding node adjacency matrix. The adjacency matrix together with the physical and chemical characteristics of drug and protein sequences are used as the inputs of the model to construct a multi-channel graph convolution network that aggregates the information of nodes at different distances. The drug feature and target feature vectors are concatenated, and then through the full connection layer, the concatenated vector is converted to the predicted value. The experiment results on Davis dataset and KIBA dataset show that the proposed method outperforms most relevant methods. While the experimental results is Slightly worse than GraphDTA, it shows that the proposed method can improve the prediction of drug-target affinity to a certain extent and aggregate more information from other nodes of higher order proximity in the graphs. Keywords: Prediction of drug target affinity · Graph convolutional neural network · Nodes of high order proximity · Multichannel graph convolutional networks
1 Introduction It takes a lot of money and development time to develop new drugs. According to statistics, FDA-approved drugs cost about $2.6 billion and take 17 years to develop. Finding new uses for approved drugs can avoid the expensive and time-consuming drug development process [1–3]. To effectively change the use of approved drugs, it © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 533–546, 2022. https://doi.org/10.1007/978-3-031-13829-4_46
534
H. Zhang et al.
is necessary for researchers to understand which proteins are targets for which drugs. High-throughput screening tests can detect drug target affinity, but these tests are costly and time-consuming [4, 5]. Moreover, the presence of a large number of drug-like compounds and potential protein targets makes thorough screening difficult [6–8]. However, the computational model based on the existing drug target experiments can effectively estimate the interaction intensity of new drug target pair, so this kind of method is gradually popular. At present, many methods have been used to predict drug target interaction, which greatly promotes the development of drug target interaction research. Pahikkala et al. used the Kronecker Regularized Least Squares (KronRLS) algorithm to calculate the paired nuclear K from drug-drug and protein-protein Kronecker products [9]. He et al. Proposed SimBoost method to predict the affinity between the unknown drug and the target which used affinity similarities between drugs and targets to construct new features. All the methods mentioned above are traditional machine learning methods [10]. With the improvement of the accuracy of neural network and the continuous improvement of the high precision requirements of drug design, deep learning methods are also applied to the scoring and prediction of protein ligand interactions. Hakime Öztürk et al. showed DeepDTA methods to predict the affinity of the drug to the protein target [11]. They used SMILES [12], simplified molecular input line entry specification of drug molecules, and the protein sequence expression as the input of the model, respectively constructing two convolutional neural networks to extract the expressions of drugs and proteins, and finally combined the two expressions to predict the affinity between drugs and protein targets. Hakime Öztürk [13] proposed WideDTA method which was further improved based on DeepDTA. The model takes ligand SMILES (LS), ligand max common substructure [14] (LMCS), protein sequence [15] (PS), protein motifs and domains (PMD) as input, after convolution neural network training, then the representation vectors are concatenated and through full connection layers we can get the predicted values. Although the methods mentioned above are significantly better than traditional machine learning methods in predicting results, the representation of drug molecules and protein sequences as strings is not a natural way to express. Recently, graph neural network has been widely used in different fields. It has no restriction on the size of input graph and can express the structure of drug molecules and proteins more truly than the way of using string as input expression, so it can extract more deep molecular information in a more flexible form. The PADAME model designed by Q. Feng [16] utilizes molecular graph convolution in drug target interaction prediction, demonstrating the potential of graph convolutional neural networks in drug discovery. Like PADAME T. Nguyen et al. proposed GraphDTA method which took atoms as graph nodes and chemical bonds as graph edges to construct a drug molecule graph [17]. The drug molecule graphs, and protein sequences are inputs of the network, then through training and concatenation, the predicted affinity values of drugs and targets can be obtained. Compared with other methods, GraphDTA has an obvious improvement in the prediction performance of drug-target interaction. However, the model has only three-layer graph convolution, which is difficult to aggregate information of similar but distant node. Although increasing the number of graph convolution layers can realize the information aggregation of similar but distant nodes, the expression of nodes will gradually be projected to a stable state. Therefore, the number of graph
Drug-Target Affinity Prediction Based on Multi-channel Graph Convolution
535
convolution layers is limited, and it is difficult to aggregate high-order similar nodes. Therefore, it is necessary to develop a convolutional architecture with high computational efficiency to utilize the information of high-order neighboring nodes through appropriate aggregators while maintaining the heterogeneity of nodes. Zhou et al. proposed a multichannel graph convolutional network (MCGCN) model to achieve the aggregation of high-order information by enriching the number of input channels [18]. In this paper, the multi-channel graph convolutional neural network was applied to the prediction of drug target affinity to further optimize the experimental results. Compared with other methods, the proposed method solves the problem caused by too many convolution layers by aggregating the node information of different distances in each channel and learns more comprehensive graph data. Secondly, the proportion of aggregated node information of each channel is adjusted by parameters to make the model more rational.
2 Methods In this experiment, we referred to a variety of methods for predicting drug target affinity based on deep learning and obtained the prediction results by extracting the expressions of drug molecules and proteins respectively and then splicing them together. Compared with other methods, the innovation of this experiment lies in the introduction of multichannel graph convolution into the training of drug molecular graph data. Compared with traditional graph convolution, multi-channel graph convolution can better obtain the structural information with different distances in drug molecular graph. 2.1 Molecular Representation In the experimental dataset, we obtained the model input expression using the Simplified Molecular Linear Input Specification (SMILES). Smiles enables molecular data to be read by computers for efficient applications such as fast retrieval and substructure search. The compound SMILES strings of the Davis dataset are extracted from the PubChem compound database according to their PubChem CIDs. The KIBA dataset needs to first convert the ChemBL ID to PubChem CID, and then extract the SMILES string through the corresponding CID. Expressed by SMILES, molecular graphs can be constructed with atoms as nodes and chemical bonds as edges. In the experiment, the atomic number of the drug molecule, the set of atomic pairs at both ends of the chemical bond and the physical and chemical characteristics of each atom were taken as the input expression of the drug molecule. To ensure that the node features are fully considered in the graph convolution process, self-loop is added into the graph convolution structure to improve the performance of drug molecules. The graph construction for molecular features is shown in Fig. 1. The molecular features are illustrated in Table 1, which are the same as those in DGraphDTA. Contact map is one of the outputs of structure prediction method, usually in matrix form. Assume that the length of the protein sequence is L, then the predicted contact map M is a matrix of L rows and L columns, where each element mij of M represents whether the corresponding residue pairs, namely residues I and J, are contacted. In general, two
536
H. Zhang et al.
Fig. 1. Graph construction for molecular graph
Table 1. Node features (atom) Feature
Dimesion
One-hot encoding of the atom element
44
One-hot encoding of the degree of the atom in the molecule, which is the number of directly-bonded neighbors (atoms)
11
One-hot encoding of the total number of H bound to the atom
11
One-hot encoding of the number of implicit H bound to the atom
11
Whether the atom is aromatic
1
residues are in contact if the Euclidean distance between the Cβ atoms (in glycine’s case, the Cα atoms) is less than a specified threshold. In this study, Pconsc4 open source method was used to predict contact maps efficiently and quickly. Pconsc4 uses the U-NET [19] architecture, which operates on 72 features calculated from each position in a multi-sequence alignment. Pconsc4 takes the probability of the attachment of residue pairs as the output, and then takes 0.5 as the threshold to obtain the contact map of size (L, L), which also corresponds to the adjacency matrix of protein sequences. PSSM [20] (position-specific scoring matrix) is a common protein expression pattern in proteomics. In PSSM, each residue position can be scored according to the sequence alignment results and used to represent the residue node features. In this experiment, PSSM and the physicochemical properties of each residue node were taken as the features of protein sequences. The specific features of these nodes are shown in Table 2 (Fig. 2).
Fig. 2. Graph construction for protein graph
Drug-Target Affinity Prediction Based on Multi-channel Graph Convolution
537
Table 2. Node features (residue) Feature
Dimension
One-hot encoding of the residue symbol
21
Position-specific scoring matrix (PSSM)
21
Whether the residue is aliphatic
1
Whether the residue is aromatic
1
Whether the residue is polar neutral
1
Whether the residue is acidic charged
1
Whether the residue is basic charged
1
Residue weight
1
The negative of the logarithm of the dissociation constant for the -COOH group
1
The negative of the logarithm of the dissociation constant for the -NH3 group
1
The negative of the logarithm of the dissociation constant for any other group in
1
The pH at the isoelectric point
1
Hydrophobicity of residue (pH = 2)
1
2.2 Multichannel Graph Convolution Structure In recent years, the success of convolutional neural networks in computer vision, speech recognition and natural language processing has stimulated researchers to study the field of graph neural networks. Graph neural network solves two main problems when convolutional neural network is extended to graphs: (1) forming receptive fields in graphs where data points are not arranged according to Euclidean grids; (2) Pool the graph under sampling. After years of rapid development, Graph Neural Network has derived many powerful variants, such as Graph Convolution Network (GCN) Graph Attention Network (GAT) Graph Isomorphism Network (GIN), these models are very effective for graph feature extraction. For GCN, each layer will perform the convolution operation through (1): −1 −1 (1) H l+1 = f H l , A = σ D 2 AD 2 H l W l+1
In the equation, A is the adjacency matrix of the protein graph of shape (n, n), n is the number of nodes in the graph, A = A + I , where I is the identity matrix, D is the diagonal node degree matrix calculated by A and its shape is the same as matrix A, Wl+1 is the weight matrix of l + 1 layer, Hl is the output of the last layer of shape (n, Fl ), Fl is the number of output channels in layer l, H0 = X, where X is the input eigenvector of the node. In essence, graph convolutional networks treat the network structure as a computational graph and train the entire neural network model in an end-to-end manner. By adopting an appropriate message passing mechanism in each convolution layer of the
538
H. Zhang et al.
graph convolutional network, each node can aggregate attribute information from adjacent nodes in the network. However, as the depth of the graph convolutional network increases, the nodes will aggregate information from other nodes of higher order proximity. During this process, the node representation is projected to a steady state after several aggregation steps. Therefore, the number of existing graph convolutional network layers should not be too large. In practical applications, nodes with the same/similar structural roles may be far away from each other in the network, and graph convolutional networks with limited depth cannot aggregate the information of nodes with similar roles but far away from each other. Therefore, this paper does not increase the depth of graph neural network, but chooses rich information channel, that is, uses multi-channel graph convolutional network to support any order of information aggregation through the network. Like graph convolution, multi-channel graph convolutional network uses (2) to implement message delivery: X k=0 (2) Hk = ˆ σ AHk−1 Wk−1 k = [1, l] Specifically, the number of layers of multi-channel graph convolutional neural network is l. In the current k layer, H0 = X represents the eigenmatrix X as the input of the model. In addition, Hk ∈ RN ×dk is the output node expression of layer K and the input node expression of layer K + 1, so the node information will be aggregated through the message passing model. σ represents the message propagation function that aggregates information through the network, A represents the renormalized adjacency matrix, and Wk-1 is the weight matrix of the kth layer.
Fig. 3. Multi-channel convolution architecture
Drug-Target Affinity Prediction Based on Multi-channel Graph Convolution
539
The multi-channel convolution architecture is shown in Fig. 3. The model takes the feature matrix X ∈ RN ×d as input, each row of which represents the features of a node, and the node information can be aggregated in different channels respectively. k
Specifically, the propagation network in channel k corresponds to a specific matrix A , which is the k power of the normalized adjacency matrix A. The forward propagation expression of the model is shown in (3): 2 3 H = AGG AXW1 , A XW2 , A XW3 , ... (3)
i
In the equation, A XWi represents a high-order GCN channel that gets information from the ith order neighbor, and AGG is used to aggregate node information from all channels. In this paper, we considered using the summation operator as aggregation function from two aspects: first of all, the general schemes of GCN polymerization can be viewed as functions on a set of domain nodes, and in the different aggregation functions only the summation operator can get the complete set, so the summation operator more than other operators will be able to distinguish between different network structure [21]; Secondly, the implementation of the summation operator can obtain the weighted summation over different convolution channels, which can amplify the relatively important information. Therefore, the forward propagation model can be calculated as (4): k i A XWi (4) H=
i=1
In the equation, k represents the total number of channels of the model, and Wi represents the learnable weight. Wi can be also regarded as a pre-processing operation on node characteristics in each channel. To reduce model parameters and avoid overfitting, the experiment uses shared weight Ws for different channels of the model. At the same time, the nonlinear function σ was used in the experiment to improve the expression ability of the model, and the parameter α was used for appropriate adjustment between different channels. Finally, the equation of the forward propagation process of the model is rewritten as following:
i k (5) αA XWS H =σ
i=1
2.3 Model Structure Figure 4 shows the complete structure of the model. Through experiments, we found that when the drug data was 3-channel graph convolution and the protein data was 3-layer graph convolution, the experimental results were the best. The drug molecule graph data and protein graph data are input into the convolutional layer of the model. After that, the characterization vectors of drug molecules and protein sequences are respectively obtained through a pooling layer and two layers of full-connection layers. Finally, vector concatenated is conducted to obtain the predicted value of the model through the two full-connection layers.
540
H. Zhang et al.
Fig. 4. The complete model structure
3 Results and Discussion 3.1 Dataset To compare with other drug target affinity prediction models such as GraphDTA, WideDTA and DeepDTA, Davis [22] and KIBA [22] were selected for training and testing. The Davis dataset contains selected entities from the kinase protein family and related inhibitors, in addition to their respective dissociation constants. The Davis dataset contains 442 kinase proteins and 68 related inhibitors, as well as the dissociation constants for 30056 interactions. The KIBA dataset differs from the Davis dataset in that it contains bioactivity of kinase inhibitors from different sources, including Ki , Kd , and IC50 , which are processed as scores for the model to train and predict in the KIBA dataset. The KIBA data set initially contained 467 targets and 52,498 drugs, which was filtered by He et al. to contain only drugs and at least 10 interacting targets, resulting in 299 unique proteins and 2111 unique drugs. Table 3 shows two datasets of protein and drug molecules and their interactions. Table 3. Dataset Dataset
Proteins
Compounds
Binding entities
Davis
442
68
30056
KIBA
229
2111
118254
Drug-Target Affinity Prediction Based on Multi-channel Graph Convolution
541
For the Davis dataset, the dissociation constant was converted to the exponential space to obtain pKd as the affinity prediction, with the specific expression shown in (6): Kd (6) pK d = −log 109 He et al. took the negative value of each KIBA score and then selected the minimum value among the negative values and added the absolute value of the minimum value to all the negative values to construct the final form of KIBA score. 3.2 Metrics Concordance index [23] (CI) and mean square error [24] (MSE) are both applied in the experiment which are also used in other state of the art methods. Concordance index (CI) is obtained through (7), which is mainly used to calculate the difference between the predicted value and the actual value. The greater the value, the more consistent the predicted value is with the actual value. CI =
1 h bx − by dx >dy Z
(7)
In the equation, bx is the predictor of the larger affinity dx , by is the predictor of the smaller affinity dy , Z is a normalized constant, h(x) is the step function, and the equation is shown in (8). ⎧ x>0 ⎨1 h(x) = 0.5 x = 0 (8) ⎩ 0 x 0.05, indicating that there was no bias of clinical characteristics in the sample grouping. 3.2 Construction and Validation of Risk Models 21 prognostic-related lncRNAs were screened via Univariate and lasso regression analysis (Figs. 1A-C). Associations between the expression levels of prognosis-related lncRNAs and survival (OS) were estimated using the Kaplan-Meier survival analysis (Figs. 2A). ROC was performed at 1, 2, and 3 years to assess the predictive accuracy of the prognostic model (Figs. 2B). Risk curves, risk scatter and risk heatmaps, were plotted to assess the ability of the model to discriminate between high- and low-risk groups (Figs. 2C).
A Novel Cuprotosis-Related lncRNA Signature Predicts
559
Table 1. The clinical characteristics of patients in different cohorts. Training group (n = 112)
Verification group (n = 47)
P
65
42
13
Female
43
13
Male
69
43
Clinical characteristic Age
Gender 0.2666
3.3 Construction of a Cuprotosis-Related lncRNA Prognostic Model We analyzed the co-expression of cuprotosis-related genes and lncRNAs (Fig. 3A). We identified 559 lncRNAs related to cuprotosis. Univariate Cox regression analysis revealed that 21 lncRNAs were associated with the prognosis of GBM patients. Multivariate Cox regression analysis showed that 6 lncRNAs (LINC02328, AC005229.4, CYTOR, AC019254.1, DLEU1, GNG12-AS1) associated with cuprotosis were identified to construct predictive signatures. The risk score is calculated as follows: Risk score = (LINC02328 × 0.715) + (AC005229.4 × 0.824) + (CYTOR × 0.276) + (AC019254.1 × 1.042) + (DLEU1 × -1.663) + (GNG12-AS1 × −0.959). To determine which cuprotosis-related lncRNAs correlated with which cuprotosis-related genes, we made a correlation heatmap (Fig. 3B). 3.4 Enrichment Analysis of Cuprotosis-Related Genes KEGG and GO analysis of risk differential genes associated with cuprotosis. KEGG pathway analysis showed that the risk differential genes related to cuprotosis were mainly enriched related to IL-17, chemokine, and TNF signaling pathway, etc. (Figs. 4A). GO analysis in the biological process category enriched in signaling receptor activator activity, receptor-ligand activity, etc. In the cellular component category, enriched in the collagen-containing extracellular matrix, vesicle lumen, etc. and the molecular function category, enriched in cell chemotaxis, leukocyte chemotaxis, etc. (Figs. 4B).
560
H. Sun et al.
Fig. 1. Univariate and lasso regression screened the best prognosis-related lncRNA.
3.5 Mutational Signature of GBM 70 samples in the high-risk group and 81 samples in the low-risk group were included in the mutational signature analysis. The mutation rates of TP53, PTEN, and EGFR ranked in the top three. Patients were divided into high and low TMB groups by calculating the number of TMBs per megabase per GBM sample. (Figs. 5A–B) The mutation frequency of the low-risk group is higher than that of the high-risk group. Mutations were present in 88.89% of patients in the low-risk group and 80% in the high-risk group. In both groups, missense mutations were the most common mutation type. Kaplan-Meier analysis was used to evaluate the potential correlation between TMB and prognostic survival in GBM, and TMB (P = 0.001, Figs. 5C) was associated with prognostic survival, and patients with high TMB had a better prognosis. Combined analysis of risk groups and mutation burden data to evaluate the correlation between TMB and high and low risk in GBM on prognosis and survival, the high and low risk and TMB (P < 0.001, Figs. 5C) were also
A Novel Cuprotosis-Related lncRNA Signature Predicts
561
Fig. 2. (A) The risk model of the training group and the validation group (P < 0.05). (B) The ROC curve suggests that the risk model has good short-term and long-term predictive values between the training and verification groups. (C) The risk curve and risk status show the survival status of the patient as the score increases. The risk heatmap shows the expression of prognosis-related genes in the high- and low-risk groups.
562
H. Sun et al.
Fig. 3. (A) Sankey diagram of prognostic cuproptosis-based lncRNAs. (B) Correlation heatmap of six cuproptosis-related lncRNAs and cuproptosis-related.
A Novel Cuprotosis-Related lncRNA Signature Predicts
563
Fig. 4. GO and KEGG analyses of cuproptosis-related gene in cancer. (A) KEGG analysis of cuproptosis-related gene. (B) GO analysis of cuproptosis-related gene. BP, biological process; CC, cellular components; MF, molecular function.
associated with prognosis survival, among which patients with high TMB and low risk have a better prognosis. 3.6 GBM Immune Function, Immune Escape and Immunotherapy Analysis Analysis of immune-related functions was performed, and cytolytic activity, T cell costimulation, APC co-stimulation, CCR, etc. were found. There was a difference (P
120 ms) [25, 26]. The short-latency potentials have a strong lock-time correlation with the stimulus and a stable signal, which can reliably monitor spinal cord function. Components with latencies longer than 45 ms have higher variability because they are susceptible to cognitive factors [11]. Because of the lock-time correlation of SEP with stimuli and the correlation of various subcomponents of SEP with different anatomical structures of the sensory conduction pathway, SEP can effectively monitor the integrity of the sensory conduction pathway and can identify lesions in different locations and structures within the pathway. However, intraoperatively acquired somatosensory evoked potential signals contain a large amount of noise, including not only electromagnetic interference from the patient and the vicinity of the evoked potential monitor, but also various endogenous artifacts, including oculomotor artifacts, myoelectric artifacts, cardiac artifacts, and electroencephalographic artifacts [27–30]. This results in a low signal-to-noise ratio (SignalNoiseRatio, SNR) of the SEP signal and therefore has the potential to affect the reliability of latency and wave amplitude measurements. If the SNR of the SEP signal can be improved to identify and exclude components associated with noise or artifacts, an accurate SEP based differential diagnosis of spinal cord injury and spinal cord shock during the acute period may be achieved. Among the Intraoperative Electrophysiological Monitoring (IEM) techniques currently in clinical use, the SEP quantifies the wave amplitude and latency of the primary cortical response and uses its changes to detect the transection of the spinal cord (a 50% reduction in wave amplitude or a 10% delay in latency is generally used as criteria for spinal cord injury identification) [31, 32]. The time-domain features of SEPs, on the other hand, are subject to operating room noise, which may impact detection accuracy; also, focusing on a single component misses a huge amount of detailed information in the SEP waveform [33, 34]. Therefore, frequency domain and time-frequency domain analysis methods have also been used to extract different features of the SEP signal in the study of SEPs. 2.2 Self-attention and Vision Transformer Visual Transformer (ViT) [35] showed that the transformer can learn advanced image features by computing different patches of the image through an attention mechanism. This approach outperforms CNNs after pre-training on large datasets. The literature [36] suggests that strong augmentation, finely tuned hyperparameters and marker-based refinement can be used to improve data efficiency. While the Transformer model has been widely used for speech signal and image processing, to the best of our knowledge, a fully-attention model based on Transformer architecture has not been investigated for SEP signal classification. Our approach is inspired by ViT, we use spectrogram as input and pay close attention to understanding how this technique is generally applicable to new domains.
SID2 T: A Self-attention Model for Spinal Injury Differential Diagnosis
655
3 Materials and Methods 3.1 Dataset We use two datasets to train and validate the model. The dataset 1 contains 1615 patients diagnosed with spinal cord injury in the Affiliated Beijing Boai Hospital of China Rehabilitation Research Center and the Peking Union Medical College Hospital with SEP diagnosis between January 2006 and December 2018. In Table 1, the percentage of females with spinal cord injury is 97.34%, which is consistent with the World Spinal Cord Injury Foundation [37]. The dataset 2 contains 3584 patients diagnosed with spinal cord shock in the Affiliated Beijing Boai Hospital of China Rehabilitation Research Center and the Peking Union Medical College Hospital with SEP diagnosis between January 2006 and December 2018. SEP signals were gathered at 21 key points for each patient. The first dataset (DS-1) shown in Table 1 includes 43 males and 1572 females, with an average age of (26 ± 9) years. The injury locations of spinal cord injury patients are classified into five groups in this dataset: cervical spine (CS), upper thoracic spine (UTS), middle thoracic spine (MTS), lower thoracic spine (LTS), and lumbar spine (LS). The cervical group includes C5–C8 segments. The upper thoracic group includes T1– T4 segments. The middle thoracic group includes T5–T9 segments. The lower thoracic group includes T10–T12 segments. The lumbar group includes L1–L5 segments. Table 1. DS-1 Spinal cord injury dataset Variable
Groups CS
UTS
MTS
LTS
LS
Age (years)
37 ± 11
25 ± 8
26 ± 11
26 ± 13
34 ± 16
Gender
2M, 94F
19M, 771F
12M, 446F
8M, 154F
2M, 107F
# Individuals
96
790
458
162
109
# SEP-Key-Points
21
21
21
21
21
The second dataset (DS-2) shown in Table 2 includes 2151 males and 1433 females, with an average age of (46 ± 32) years. According to patients with spinal shock in this dataset clinical diagnostic, individuals were separated into three groups: cervical spine group, thoracic spine group, and lumbar spine group. The cervical group includes three segments: C4, C5, and C6. The thoracic group includes all twelve segments: T1–T12. The lumbar group includes five segments: L1–L5.
656
G. Wang et al. Table 2. DS-2: Spinal cord shock dataset
Variable
Groups CS
TS
LS
Age (years)
46 ± 21
43 ± 26
45 ± 33
Gender
1078M, 661F
119M, 57F
1192M, 715F
# Individuals
1739
176
1907
# SEP Key Points
21
21
21
# SEP Signals
511K
44K
560K
3.2 Model Architecture For a SEP signal, we will multiply a window function at each sampling time point, and then do the Discrete (Time) Fourier Transform, so in this short period, we have the frequency component of the signal. In that period, we have the frequency component of that signal. This process is essentially equivalent to calculating the squared magnitude of the Short Time Fourier Transform (STFT) of the signal s(t), i.e., for a window width ω, spectrogram(t, ω) = |STF(t, ω)|2 .
Fig. 3. Transformer encoder architecture SID2 T used
Let X ∈ RT ×F denote the output of the spectrogram to the Transformer model, where T ∈ [1, t] is the time window and F ∈ 1, f is the frequency. A linear projection matrix W0 ∈ RF×d is used to map the spectrogram in the frequency domain to a higher dimension d . In order to learn global features representing the entire spectrum, a learnable embedding Xclass ∈ R1×d is concatenated with the input in the time domain. A learnable position embedding matrix Xpos ∈ R(T +1)×d is then added. The input representation to the Transformer encoder is given in the Eq. (1). X0 = [Xclass ; XW0 ] + Xpos
(1)
SID2 T: A Self-attention Model for Spinal Injury Differential Diagnosis
657
The projected frequency-domain features are then input into a sequential Transformer encoder made up of N multi-head attention (MSA) and multi-layer perceptron (MLP) blocks. In the l: th Transformer block, queries, keys and values are calculated as Q = Xl WQ , K = Xl WK and V = Xl WV respectively. The self attention (SA) is calculated as in the Eq. (2). T √ SA(Xl ) = Softmax QK V (2) d h
Equation (3) shows the MSA operation, that is obtained by linearly projecting the concatenated output, using another matrix WP ∈ Rkdh ×d , from the k attention head. MSA(Xl ) = [SA1 (Xl ); SA2 (Xl ); . . . .; SAk (Xl )]WP
(3)
Although PreNorm structures tend to be easier to train, the result is usually not as good as PostNorm [38, 39]. As shown in Fig. 3, in our default setting, we use the PostNorm [1] converter architecture, in contrast to the PreNorm [40] where the Layer Normalization (LN) [41] is applied first, the LN is applied after the MSA and MLP blocks. As a typical Transformer, we used the Gaussian Error Linear Unit (GELU) [42] activation in all MLP blocks. The output of the l: th transformer block is given in the Eq. (5). l = LN(MSA(Xl−1 ) + Xl−1 ), l = 1, . . . , L X
(4)
l , l = 1, . . . , L Xl = LN MLP X˜ l + X
(5)
In the output layer, the class embedding is fed to a linear classifier, which outputs the SEP signal corresponding to the probability of spinal cord injury or spinal cord shock. The global features of the image are obtained by the self-attention mechanism after segmenting the image patches in ViT. In our model, like ViT, SEP signals are segmented by the time window, and the attention mechanism is performed in the time domain so that signals in different time windows pay attention to each other to form the overall representation in the class embedding. The size of the model can be adjusted by tuning the parameters of the Transformer. Following [36], we fix the number of consecutive Transformer encoder blocks to 12 and let d /k be 64, where d is the embedding dimension and k is the count of attention heads. By changing the number of heads k, we get three different models, as shown in Table 3. Table 3. Model parameters Model
Dim
MLP-Dim
Heads
Layers
# Parameters
SID2 T -1
64
256
1
12
607K
SID2 T -2
128
512
2
12
2,394K
SID2 T -3
192
768
3
12
5,361K
658
G. Wang et al.
4 Experiments and Results 4.1 Experiment Setup In order to further evaluate the performance of the SID2 T model, the proposed model is compared with other existing methods. We combine the two datasets detailed in Sect. 3.1 and depicted in Table 1 and Table 2, then train our models using the combined dataset. To allow a fair comparison on differential diagnosis of spinal cord injury and spinal cord shock, support vector machine (SVM), random forest (RF), convolutional neural network (CNN) and long short-term memory (LSTM) are selected as comparison algorithms, and a 10-fold cross-validation test is performed on the same dataset. Before the experiment, training has been performed on comparison algorithms, using the dataset described in Table 1 and Table 2. Each dataset is splited into three parts, 80% is used as the train set, 10% is used as the validation set and the remaining 10% as the test set. For clarity, the hyperparameters used in all experiments are reported in Table 4. Table 4. Hyperparameters used in all experiments Stage
Parameter
Value
Training
Training steps
23,000
Batch size
512
Regularization
Pre-processing
Data augmentation
Optimizer
AdamW
Learning rate
0.001
Schedule
Cosine
Warmup epochs
10
Weight decay
0.1
Label smoothing
0.1
Dropout
0
Time window length
30 ms
Time window stride
10 ms
#DCT Features
40
Time shift [ms]
[−100, 100]
Resampling
[0.85, 1.15]
#Time masks
2
#Frequency masks
2
4.2 Performance Measures In this paper, we use the following measures to measure the performance of the SID2 T model to assess the patient’s spinal cord. A true-positive (TP) disease prediction is
SID2 T: A Self-attention Model for Spinal Injury Differential Diagnosis
659
defined as the number of patients in the prediction set who are also in the spinal cord injury set. The false-positive (FP) of the disease prediction is the number of patients in the predicted set but not in the spinal cord injury set. The false-negative (FN) of the disease prediction is the number of patients that are not in the predicted set but in the spinal cord injury set. The root means square (RMS) is used as an evaluation function to evaluate the SID2 T model. Overall accuracy (OA) indicates that the model is calculated separately for all datasets. Also, the next couple of properties are used to evaluate the accuracy of the prediction, namely Sensitivity (Sens ) and Specularity (Spec ). Explicitly, they are described in Eqs. (6)–(8). OA =
TP + TN TP + FN + FP + TN , OA
∈ [0, 1]
(6)
Sens =
TP TP + FN , Sens
∈ [0, 1]
(7)
Spec =
TN FP + TN , Spec
∈ [0, 1]
(8)
4.3 Results Table 5 shows the results. We report a 95% confidence interval for the mean accuracy across all three model assessments for our own results. With considerable gains on both validation and test datasets, our top models meet or surpass prior state-of-the-art accuracy. Transformers, in general, gain better from big volumes of data. Table 5. Accuracy on differential diagnosis of spinal injury and spinal shock Model
DS-1
DS-2
OA
Sens
Spec
OA
Sens
SVM
89.60
92.70
88.40
88.20
91.80
88.30
RF
86.30
89.20
87.10
84.70
88.60
89.40
CNN
95.40
93.60
93.70
94.80
95.40
95.70
LSTM
97.20
96.80
97.10
95.10
95.20
95.10
SID2 T -3(Ours)
97.49 ± 0.15
97.36 ± 0.13
96.91 ± 0.11
96.56 ± 0.07
95.41 ± 0.09
96.57 ± 0.08
SID2 T -2(Ours)
97.27 ± 0.08
97.29 ± 0.14
96.98 ± 0.09
96.43 ± 0.08
96.37 ± 0.09
95.98 ± 0.14
SID2 T -1(Ours)
97.26 ± 0.18
97.13 ± 0.08
96.82 ± 0.07
96.08 ± 0.10
96.29 ± 0.08
96.47 ± 0.11
Spec
It is worth pointing out that, on datasets with more than 1350K samples, the SID2 T model outperforms CNN and LSTM. Deep neural network models outperform SVM and RF models on average. It is worth pointing out that, compared with the most advanced methods, the SID2 T model exhibits accurate differential diagnosis of spinal cord injury and spinal cord shock.
660
G. Wang et al.
5 Conclusions In this paper, we explore the direct application of the Transformer model to the differential diagnosis of spinal cord injury and spinal shock in the acute phase. We propose the SDCT attention model, which takes the spectrogram of SEP signal in time-domain as input and differentiates spinal cord injury and spinal cord shock in the time domain, giving classification results. The advantages of the SID2 T model over previous works for rapid differential diagnosis in the acute phase of spinal cord injury and spinal shock are experimentally verified. These findings suggest that Transformer research in other disciplines could pave the way for more inquiry in differential diagnosis of spinal cord injury and spinal cord shock in the future. Transformers, in particular, benefit from largescale pre-training, with model compression reducing latency by 5.5 times and sparsity and hardware codesign reducing energy by up to 4059 times. As our future work, we will try to pretrain the SID2 T model on a larger dataset and use it in other medical differential diagnosis areas. Acknowledgement. This work was supported by the talent project of “Qingtan Scholar” of Zaozhuang University, Jiangsu Provincial Natural Science Foundation, China (No. SBK2019040953), Youth Innovation Team of Scientific Research Foundation of the Higher Education Institutions of Shandong Province, China (No. 2019KJM006), and the Key Research Program of the Science Foundation of Shandong Province (ZR2020KE001).
References 1. van Den Hauwe, L., Sundgren, P.C., Flanders, A.E.: Spinal trauma and spinal cord injury (SCI). In: Diseases of the Brain, Head and Neck, Spine 2020–2023, pp. 231–240 (2020) 2. Freund, P., Curt, A., Friston, K., Thompson, A.: Tracking changes following spinal cord injury: insights from neuroimaging. Neuroscientist 19(2), 116–128 (2013) 3. World Spinal Cord Injury Foundation. Spinal cord injury assessment (2022). https://www. wscif.org/standards/scia 4. Nuwer, M.R.: Fundamentals of evoked potentials and common clinical applications today. Electroencephalogr. Clin. Neurophysiol. 106(2), 142–148 (1998) 5. Al-Nashash, H., et al.: Spinal cord injury detection and monitoring using spectral coherence. IEEE Trans. Biomed. Eng. 56(8), 1971–1979 (2009) 6. Rossini, P.M., Rossi, S.: Clinical applications of motor evoked potentials. Electroencephalogr. Clin. Neurophysiol. 106(3), 180–194 (1998) 7. Morishita, Y., Hida, S., Naito, M., Matsushima, U.: Evaluation of cervical spondylotic myelopathy using somatosensory-evoked potentials. Int. Orthop. 29(6), 343–346 (2005) 8. Ryan, T.P., Britt, R.H.: Spinal and cortical somatosensory evoked potential monitoring during corrective spinal surgery with 108 patients. Spine 11(4), 352–361 (1986) 9. Nuwer, M.R., Dawson, E.G., Carlson, L.G., Kanim, L.E.A., Sherman, J.E.: Somatosensory evoked potential spinal cord monitoring reduces neurologic deficits after scoliosis surgery: results of a large multicenter survey. Electroencephalogr. Clin. Neurophysiol./Evoked Potentials Section 96(1), 6–11 (1995) 10. Berthier, E., Tuijman, F., Mauguiere, F.: Diagnostic utility of somatosensory evoked potentials (SEPs) in presurgical assessment of cervical spondylotic myelopathy. Neurophysiologie Clinique/Clin. Neurophysiol. 26(5), 300–310 (1996)
SID2 T: A Self-attention Model for Spinal Injury Differential Diagnosis
661
11. Cruccu, G., et al.: Recommendations for the clinical use of somatosensory-evoked potentials. Clin. Neurophysiol. 119(8), 1705–1719 (2008) 12. Hu, Y., Liu, H., Luk, K.D.: Time–frequency analysis of somatosensory evoked potentials for intraoperative spinal cord monitoring. J. Clin. Neurophysiol. 28(5), 504–511 (2011) 13. Mehta, S.S., Lingayat, N.S.: Biomedical signal processing using SVM. In: 2007 IET-UK International Conference on Information and Communication Technology in Electrical Sciences (ICTES 2007), pp. 527–532. IET (2007) 14. Kai, F., Qu, J., Chai, Y., Dong, Y.: Classification of seizure based on the time-frequency image of EEG signals using HHT and SVM. Biomed. Signal Process. Control 13, 15–22 (2014) 15. Rojo-Álvarez, J.L., Camps-Valls, G., Martnez-Ramón, M., Soria-Olivas, E., Navia-Vázquez, A., Figueiras-Vidal, A.R.: Support vector machines framework for linear signal processing. Signal Process. 85(12), 2316–2326 (2005) 16. Rojo-Álvarez, J.L., Martnez-Ramón, M., Muñoz-Mar, J., Camps-Valls, G.: A unified SVM framework for signal estimation. Digit. Signal Process. 26, 1–20 (2014) 17. Omidvar, M., Zahedi, A., Bakhshi, H.: EEG signal processing for epilepsy seizure detection using 5-level Db4 discrete wavelet transform, GA-based feature selection and ANN/SVM classifiers. J. Ambient. Intell. Humaniz. Comput. 12(11), 10395–10403 (2021) 18. Kropf, M., Hayn, D., Schreier, G.: ECG classification based on time and frequency domain features using random forests. In: 2017 Computing in Cardiology (CinC), pp. 1–4. IEEE (2017) 19. Hayashi, N., Nishijo, H., Ono, T., Endo, S., Tabuchi, E.: Generators of somatosensory evoked potentials investigated by dipole tracing in the monkey. Neuroscience 68(2), 323–338 (1995) 20. Peterson, N.N., Schroeder, C.E., Arezzo, J.C.: Neural generators of early cortical somatosensory evoked potentials in the awake monkey. Electroencephalogr. Clin. Neurophysiol./Evoked Potentials Section 96(3), 248–260 (1995) 21. Sonoo, M., Genba-Shimizu, K., Mannen, T., Shimizu, T.: Detailed analysis of the latencies of median nerve somatosensory evoked potential components, 2: analysis of subcomponents of the P13/14 and N20 potentials. Electroencephalogr. Clin. Neurophysiol./Evoked Potentials Section 104(4), 296–311 (1997) 22. Suzuki, I., Mayanagi, Y.: Intracranial recording of short latency somatosensory evoked potentials in man: identification of origin of each component. Electroencephalogr. Clin. Neurophysiol./Evoked Potentials Section 59(4), 286–296 (1984) 23. Emerson, R.G.: Anatomic and physiologic bases of posterior tibial nerve somatosensory evoked potentials. Neurologic Clin. 6(4), 735–749 (1988) 24. Lee, E.-K., Seyal, M.: Generators of short latency human somatosensory-evoked potentials recorded over the spine and scalp. J. Clin. Neurophysiol. 15(3), 227–234 (1998) 25. Schomer, D.L., Da Silva, F.L.: Niedermeyer’s Electroencephalography: Basic Principles, Clinical Applications, and Related Fields. Lippincott Williams & Wilkins (2012) 26. Lorenz, J., Grasedyck, K., Bromm, B.: Middle and long latency somatosensory evoked potentials after painful laser stimulation in patients with fibromyalgia syndrome. Electroencephalogr. Clin. Neurophysiol./Evoked Potentials Section 100(2), 165–168 (1996) 27. Guérit, J.-M.: Neuromonitoring in the operating room: why, when, and how to monitor? Electroencephalogr. Clin. Neurophysiol. 106(1), 1–21 (1998) 28. Gunnarsson, T., Krassioukov, A.V., Sarjeant, R., Fehlings, M.G.: Real-time continuous intraoperative electromyographic and somatosensory evoked potential recordings in spinal surgery: correlation of clinical and electrophysiologic findings in a prospective, consecutive series of 213 cases. Spine 29(6), 677–684 (2004) 29. Schneider, R., Bures, C., Lorenz, K., Dralle, H., Freissmuth, M., Hermann, M.: Evolution of nerve injury with unexpected EMG signal recovery in thyroid surgery using continuous intraoperative neuromonitoring. World J. Surg. 37(2), 364–368 (2013)
662
G. Wang et al.
30. Phelan, E., et al.: Continuous vagal IONM prevents recurrent laryngeal nerve paralysis by revealing initial EMG changes of impending neuropraxic injury: a prospective, multicenter study. Laryngoscope 124(6), 1498–1505 (2014) 31. MacDonald, D.B., et al.: Recommendations of the international society of intraoperative neurophysiology for intraoperative somatosensory evoked potentials. Clin. Neurophysiol. 130(1), 161–179 (2019) 32. Nuwer, M.R.: Spinal cord monitoring with somatosensory techniques. J. Clin. Neurophysiol. 15(3), 183–193 (1998) 33. Li, R., et al.: Utility of somatosensory and motor-evoked potentials in reflecting gross and fine motor functions after unilateral cervical spinal cord contusion injury. Neural Regen. Res. 16(7), 1323 (2021) 34. Hu, Y., Luk, K.D., Lu, W.W., Leong, J.C.: Comparison of time–frequency analysis techniques in intraoperative somatosensory evoked potential (SEP) monitoring. Comput. Biol. Med. 32(1), 13–23 (2002) 35. Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G.: Not all images are worth 16 × 16 words: dynamic transformers for efficient image recognition. In: Advances in Neural Information Processing Systems, vol. 34 (2021) 36. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training dataefficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021) 37. World Spinal Cord Injury Foundation. Spinal cord injury ABC (2022). https://www.wscif. org/sciabc/en-us 38. He, R., Ravula, A., Kanagal, B., Ainslie, J.: RealFormer: transformer likes residual attention. arXiv preprint arXiv:2012.11747 (2020) 39. Liu, L., Liu, X., Gao, J., Chen, W., Han, J.: Understanding the difficulty of training transformers. arXiv preprint arXiv:2004.08249 (2020) 40. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 41. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016) 42. Hendrycks, D., Gimpel, K.: Bridging nonlinearities and stochastic regularizers with Gaussian error linear units (2016)
Predicting Protein-DNA Binding Sites by Fine-Tuning BERT Yue Zhang1 , Yuehui Chen2 , Baitong Chen3 , Yi Cao4(B) , Jiazi Chen5 , and Hanhan Cong6 1 School of Information Science and Engineering, University of Jinan, Jinan, China 2 School of Artificial Intelligence Institute and Information Science and Engineering, University
of Jinan, Jinan, China 3 Xuzhou First People’s Hospital, Xuzhou, China 4 Shandong Provincial Key Laboratory of Network Based Intelligent Computing (School of
Information Science and Engineering), University of Jinan, Jinan, China [email protected] 5 Laboratory of Zoology, Graduate School of Bioresource and Bioenvironmental Sciences, Kyushu University, Fukuoka-shi, Fukuoka, Japan 6 School of Information Science and Engineering, Shandong Normal University, Jinan, China
Abstract. The study of Protein-DNA binding sites is one of the fundamental problems in genome biology research. It plays an important role in understanding gene expression and transcription, biological research, and drug development. In recent years, language representation models have had remarkable results in the field of Natural Language Processing (NLP) and have received extensive attention from researchers. Bidirectional Encoder Representations for Transformers (BERT) has been shown to have state-of-the-art results in other domains, using the concept of word embedding to capture the semantics of sentences. In the case of small datasets, previous models often cannot capture the upstream and downstream global information of DNA sequences well, so it is reasonable to refer the BERT model to the training of DNA sequences. Models pre-trained with large datasets and then fine-tuned with specific datasets have excellent results on different downstream tasks. In this study, firstly, we regard DNA sequences as sentences and tokenize them using K-mer method, and later utilize BERT to matrix the fixed length of the tokenized sentences, perform feature extraction, and later perform classification operations. We compare this method with current state-ofthe-art models, and the DNABERT method has better performance with average improvement 0.013537, 0.010866, 0.029813, 0.052611, 0.122131 in ACC, F1score, MCC, Precision, Recall, respectively. Overall, one of the advantages of BERT is that the pre-training strategy speeds up the convergence in the network in migration learning and improves the learning ability of the network. DNABER model has advantageous generalization ability on other DNA datasets and can be utilized on other sequence classification tasks. Keywords: Protein-DNA binding sites · Transcription factor · Traditional machine learning · Deep learning · Transformers · BERT
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 663–669, 2022. https://doi.org/10.1007/978-3-031-13829-4_57
664
Y. Zhang et al.
1 Introduction Protein-DNA binding site refers to a fragment of a protein macromolecule that specifically [1] binds to a DNA sequence of approximately 4–30 bp [2–4] in length. And transcription factors, as a common type of protein macromolecule, are an important issue for Protein-DNA binding site prediction, and when transcription factors bind to these specific regions, the sites are called transcription factor binding sites (TFBS) [5, 6]. During the transcription of a gene, transcription factor binds specifically to a segment of DNA sequence as a protein macromolecule, and the region forms the transcription factor binding site. Transcription factors are of great importance in gene regulation, transcription, and biological research and drug design [7–9]. Therefore, accurate prediction of Protein-DNA binding sites is very important for genomic understanding, description of gene specific functions, etc. [10, 11]. In the past decades, sequencing operations were performed using traditional biological methods, especially ChIP-seq [12] sequencing technology, which greatly increased the quantity and quality of available sequences and laid the foundation for subsequent studies. With the development of sequencing technology, the number of genomic sequences has increased dramatically, and traditional biological sequencing techniques are costly and slow, therefore, machine learning [13] ideas have been applied to ProteinDNA binding site prediction, such as, Wong et al. proposed the kmerHMM [14] model based on Hidden Markov (HMMs) and belief propagations, and Li et al. [15] proposed the fusion pseudo nucleic acid composition (PseNAC) model based on SVM. However, with the gradual accumulation of sequences, traditional machine learning methods cannot meet the requirements in terms of prediction accuracy and computational speed, and deep learning has performed well in other fields such as machine vision [2, 16, 17]. so researchers have gradually applied deep learning to bioinformatics [4, 18–20], DeepBind has applied convolutional neural networks to Protein-DNA binding site prediction for the first time, and Zeng et al. further explored the number of convolutional layers and pooling methods to validate the value of Convolutional Neural Network (CNN) for Protein-DNA binding sites. KEGRU is a framework model that is fully based on RNN using Bidirectional Gated Recurrent Unit (Bi-GRU) and K-mer embedding. DanQ utilizes a hybrid neural network combining CNN and Recursive Neural Network (RNN) with the addition of Bi-directional Long-Short Term Memory (Bi-LSTM) layers for better long distance dependencies in sequence relations for learning. In our work, we utilized DNABERT for feature extraction of the dataset and classification by fully connected layers. First, we segment the DNA sequences using the K-mer representation, as opposed to the One-hot encoding commonly utilized in previous deep learning, we only segment it, and later utilize the processed data add the location information as the input to BERT. Then feature extraction is performed using BERT based on the Multi-headed Self-attention mechanism, with 101x768 dimensions for the input data and no change in the dimensionality of the output data. Finally, the input is fed into the fully connection and activated using the softmax function for binary classification prediction. In order to verify the generalization ability of the model, we utilized fine-tuning model to predict different cell line transcription factor datasets and verified the effectiveness of the model.
Predicting Protein-DNA Binding Sites by Fine-Tuning BERT
665
2 Materials and Methods 2.1 Benchmark Dataset To better evaluate the performance of the model, we selected 45 public transcription factor ChIP-seq datasets of Broad cell lines from the ENCODE dataset, which were previously utilized in DeepBind, CNN-Zeng, and DeepSEA model frameworks, each with a DNA sequence sample length of 101 bp and a positive to negative sample number ratio of approximately 1:1. These data can be found in http://cnn.csail.mit.edu/motif_ discovery/. 2.2 Model Tokenization We utilize K-mer for DNA sequences, and for each deoxyribonucleic acid base concatenate it with subsequent bases, integrating better contextual information for each deoxyribonucleic acid. Different K values correspond to different tokenization of DNA sequences, and we set the value of K to 6, i.e. {ACGTACGT} can be tagged as {ACGTAC, CGTACG, GTACGT}. In the utterance, in addition to all permutations indicated by K-mer, five other special tokens are included, the categorical CLS token inserted into the head, the SEP token inserted after each sentence, the MASK token that masks the words, the placeholder pad token, and UNK token that stands for unknown in the sequence, when K = 6, there are 46 + 5 token. The DNABERT Mode Bert is a transformer-based pre-trained language representation model that is a milestone in NLP. It introduces an idea of pre-training and fine-tuning, where after pre-training with a large amount of data, an additional output layer is added for fine-tuning using small task-specific data to obtain state-of-the-art performance in other downstream tasks. The innovation of BERT is the use of a new technique of masked language model (MLM), which uses a bi-directional Transformer for language modeling, where the bidirectional model will outperform the uni-directional model in language representation. BERT models can also be used in question-and-answer systems, language analysis, document clustering, and many other tasks. We believe that BERT can be applied to Protein-DNA binding site prediction to better capture the hidden information in DNA sequences, as shown in Fig. 1.
666
Y. Zhang et al.
Fig. 1. DNABERT framework.
3 Result and Discussion 3.1 Competing Methods In order to ensure the fairness of the experiment, we used three deep learning-based models to compare performance with DNABERT model, namely DeepBind, DanQ and WSCNNLSTM. Through comparison, it is found that DNABERT model has better performance in the evaluation indexes we used. Table 1 shows the performance comparison of DNABERT in the data set of each cell line we selected. As can be seen from the Table 1, DNABERT is higher than existing models in the evaluation indexes ACC, F1-Score, MCC, Precision and Recall. ACC is 0.013537 higher than other methods on average, and F1-score increases by 0.010866. MCC increased by 0.029813, Precision and Recall increased by 0.052611 and 0.122131, respectively. Experimental results show that our method is superior to existing networks. Table 1 is the setting of hyper-parameters in the experiment.
Predicting Protein-DNA Binding Sites by Fine-Tuning BERT
667
Table 1. Comparison of performance on datasets of cell lines. BERT
ACC
AUC
F1
MCC
Precision
Recall
Dnd41
0.89524
0.94062
0.89501
0.79390
0.89867
0.89524
Gm12878
0.88167
0.92133
0.88121
0.76934
0.88769
0.88167
H1sec
0.77026
0.81595
0.76376
0.57290
0.80364
0.77024
Helas3
0.84735
0.88263
0.84583
0.70885
0.86164
0.84735
Hepg2
0.89043
0.93070
0.89013
0.78514
0.89473
0.89043
Hmec
0.88357
0.91528
0.88316
0.77254
0.88900
0.88357
Hsmm
0.89062
0.93426
0.89031
0.78579
0.89518
0.89062
Huvec
0.83400
0.86503
0.83225
0.68245
0.84860
0.83400
K562
0.61842
0.62076
0.57777
0.30206
0.69262
0.61842
Nha
0.87029
0.90167
0.86962
0.74823
0.87798
0.87029
Nhdfa
0.87213
0.91073
0.87149
0.75176
0.87967
0.87213
Nhek
0.80832
0.83796
0.80481
0.64008
0.83221
0.80832
Nhlf
0.84788
0.87823
0.84663
0.70735
0.85957
0.84788
Oste
0.88605
0.92901
0.88565
0.77758
0.89155
0.88605
4 Conclusion In recent years, transformer-based series models have had state-of-the-art performance in the field of NLP. As the research gradually progressed, researchers migrated it to other fields and achieved equally desirable results. In our work, we demonstrate that the performance of DNABERT for Protein-DNA binding site prediction greatly exceeds that of other existing tools. Due to the sequence similarity between genomes, it is possible to transfer data of biological information to each other using the DNABERT pre-trained model. DNA sequences cannot be directly translated on the machine, and DNABERT gives a solution to the problem of deciphering the language of non-coding DNA, correctly capturing the hidden syntactic semantics in DNA sequences, showing excellent results. Although DNABERT has excellent performance in predicting Protein-DNA binding sites, there is room for further improvement. CLS token represents the global information of the sequence, and the rest token represents the features of each part of the sequence, we can consider separation processing to better capture the sequence features and achieve better results. However, so far, the BERT pre-training method for ProteinDNA binding site prediction has the most advanced performance at present, and the use of DNABERT introduces the perspective of high-level language modeling to genomic sequences, providing new advances and insights for the future of bioinformatics. Acknowledgments. This work was supported in part by the University Innovation Team Project of Jinan (2019GXRC015), the Natural Science Foundation of Shandong Province, China (Grant No. ZR2021MF036).
668
Y. Zhang et al.
References 1. Rohs, R., Jin, X., West, S.M., Joshi, R., Honig, B., Mann, R.S.: Origins of specificity in protein-DNA recognition. Annu. Rev. Biochem. 79, 233–269 (2010). https://doi.org/10.1146/ annurev-biochem-060408-091030 2. Jordan, M.I., LeCun, Y., Solla, S.A. (eds.): Advances in Neural Information Processing Systems: Proceedings of the First 12 Conferences. MIT Press, Cambridge (2004) 3. Liu, Y., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach (2019) 4. Liu, Y., Zhu, Y.-H., Song, X., Song, J., Yu, D.-J.: Why can deep convolutional neural networks improve protein fold recognition? A visual explanation by interpretation. Brief Bioinform. 22, bbab001 (2021). https://doi.org/10.1093/bib/bbab001 5. Karin, M.: Too many transcription factors: positive and negative interactions. New Biol. 2, 126–131 (1990) 6. Latchman, D.S.: Transcription factors: an overview. Int. J. Biochem. Cell Biol. 29, 1305–1312 (1997). https://doi.org/10.1016/s1357-2725(97)00085-x 7. Jolma, A., et al.: DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013). https://doi.org/10.1016/j.cell.2012.12.009 8. Tuupanen, S., et al.: The common colorectal cancer predisposition SNP rs6983267 at chromosome 8q24 confers potential to enhanced Wnt signaling. Nat. Genet. 41, 885–890 (2009). https://doi.org/10.1038/ng.406 9. Wasserman, W.W., Sandelin, A.: Applied bioinformatics for the identification of regulatory elements. Nat. Rev. Genet. 5, 276–287 (2004). https://doi.org/10.1038/nrg1315 10. Lambert, S.A., et al.: The human transcription factors. Cell 172, 650–665 (2018). https://doi. org/10.1016/j.cell.2018.01.029 11. Basith, S., Manavalan, B., Shin, T.H., Lee, G.: iGHBP: computational identification of growth hormone binding proteins from sequences using extremely randomised tree. Comput. Struct. Biotechnol. J. 16, 412–420 (2018). https://doi.org/10.1016/j.csbj.2018.10.007 12. Furey, T.S.: ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nat. Rev. Genet. 13, 840–852 (2012). https://doi.org/10.1038/nrg 3306 13. Manavalan, B., Shin, T.H., Lee, G.: DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest. Oncotarget 9, 1944–1956 (2017). https://doi.org/10.18632/oncotarget.23099 14. Wong, K.-C., Chan, T.-M., Peng, C., Li, Y., Zhang, Z.: DNA motif elucidation using belief propagation. Nucleic Acids Res. 41, e153 (2013). https://doi.org/10.1093/nar/gkt574 15. Li, L., et al.: Sequence-based identification of recombination spots using pseudo nucleic acid representation and recursive feature extraction by linear kernel SVM. BMC Bioinform. 15, 340 (2014). https://doi.org/10.1186/1471-2105-15-340 16. Angermueller, C., Pärnamaa, T., Parts, L., Stegle, O.: Deep learning for computational biology. Mol. Syst. Biol. 12, 878 (2016). https://doi.org/10.15252/msb.20156651 17. Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649 (2013). https://doi.org/10.1109/ICASSP.2013.6638947 18. Hong, J., et al.: Convolutional neural network-based annotation of bacterial type IV secretion system effectors with enhanced accuracy and reduced false discovery. Brief. Bioinform. 21, 1825–1836 (2020). https://doi.org/10.1093/bib/bbz120
Predicting Protein-DNA Binding Sites by Fine-Tuning BERT
669
19. Hong, J., et al.: Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning. Brief Bioinform. 21, 1437–1447 (2020). https://doi.org/10.1093/bib/bbz081 20. Min, S., Kim, H., Lee, B., Yoon, S.: Protein transfer learning improves identification of heat shock protein families. PLoS ONE 16, e0251865 (2021). https://doi.org/10.1371/journal. pone.0251865
i6mA-word2vec: A Newly Model Which Used Distributed Features for Predicting DNA N6-Methyladenine Sites in Genomes Wenzhen Fu1 , Yixin Zhong2 , Baitong Chen3 , Yi Cao4(B) , Jiazi Chen5 , and Hanhan Cong6 1 School of Information Science and Engineering, University of Jinan, Jinan, China 2 School of Artificial Intelligence Institute and Information Science and Engineering, University
of Jinan, Jinan, China 3 Xuzhou First People’s Hospital, Xuzhou, China 4 Shandong Provincial Key Laboratory of Network Based Intelligent Computing (School of
Information Science and Engineering), University of Jinan, Jinan, China [email protected] 5 Laboratory of Zoology, Graduate School of Bioresource and Bioenvironmental Sciences, Kyushu University, Fukuoka-shi, Fukuoka, Japan 6 School of Information Science and Engineering, Shandong Normal University, Jinan, China
Abstract. DNA N6 methyladenine (6mA) is a widely studied and widespread epigenetic modification, which plays a vital role in cell growth and development. 6mA is present in many biological cellular processes, such as the regulation of gene expression and the rule of cross dialogue between transposon and histone modification. Therefore, in some biological research, the prediction of the 6mA site is very significant. Unfortunately, the existing biological experimental methods are expensive both in time and money. And they cannot meet the needs of existing research. So it is high time to develop a targeted and efficient computing model. Consequently, this paper proposes an intelligent and efficient calculation model i6mA-word2vec for the discrimination of 6mA sites. In our work, we use word2vec from the field of natural language processing to carry out distributed feature encoding. The word2vec model automatically represents the target class topic. Then, the extracted feature space was sent into the convolutional neural network as prediction input. The experimental prediction results show that our computational model has better performance. Keywords: Methylation of DNA N6-methyladenine · word2vec · Deep learning
1 Introduction DNA N6 methyladenine (6mA), that is, a modification of the methylation of the sixth nitrogen atom of adenine. It plays a crucial role in various life activities of eukaryotes [1–4]. Research shows that it appears in bacteria for the first time [5]. Compared with 5mC, 6mA remains rarely studied. In previous studies, researchers believe that DNA N6 © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 670–679, 2022. https://doi.org/10.1007/978-3-031-13829-4_58
i6mA-word2vec: A Newly Model Which Used Distributed Features
671
methylation sites only occur in prokaryotes [6]. However, as the existing research have deepened, scientists have also found 6mA sites both in zoological and botanical species (such as Arabidopsis, mice, zebrafish, etc.). Recently, researchers found that DNA 6mA sites are also widely present in the genomes of humans [7] and rice [8]. These findings further explain the multiple functions of 6mA in eukaryotes, such as the regulation of gene expression and the principle of cross dialogue between transposon and histone modification. Now there are several high-throughput experimental methods to identify DNA methylation sites. It includes methyl DNA immunoprecipitation technology [9], laserinduced fluorescence capillary electrophoresis [10], and liquid chromatography-tandem mass spectrometry and the single-molecule real-time sequencing [11]. They can provide detection of DNA methylation sites at single nucleic acid resolution, but they cannot detect DNA methylation sites from the whole genome. Moreover, sequencing by experimental methods is expensive and primarily time-consuming. In the end, it is of great interest to build an efficient computational prediction model to compensate for the shortcomings of the experiment. Machine learning performs specific tasks by extracting sequence features to build models. It has been widely used in biological problems, such as post-transcriptional RNA recognition [12–14], promoter discovery [15–17] and nucleotide modification prediction [18–20], and so on. Studies have shown that the occurrence of 6mA is closely related to the properties of its surrounding sequences, which play an essential role in the catalytic process dependent on methyltransferase and demethylase [10, 21, 22]. As the first tool for 6mA sites recognition, i6mA-Pred [23] can identify DNA 6mA sites from the Rice genome. It full used both the nucleotide chemical properties and the position-specific nucleotide frequency. Chen et al. used the support vector machine (SVM) as a classifier to develop the algorithm. Furthermore, there is another remarkable contribution to this work. They produce a high-quality rice 6mA benchmark dataset. The dataset contains 880 positive samples and negative samples, respectively. Huang et al. built the model, which called 6mA-Rice-Pred [24]. They used the same dataset as well as i6mA-Pred to identify 6mA sites. In terms of feature encoding, they utilized the methods which included binary encoding, Kmer encoding and Markov features. The MM-6mAPred tool uses Markov model to identify 6mA sites in the rice genome based on the dependency information between adjacent nucleotides. In addition, based on the physicochemical properties and seven sequences oriented information, the method named 6mA-Finder [25] obtains the optimal feature group by using a recursive feature elimination strategy. However, it has not been tested on independent datasets. The model called iDNA6mA-PseKNC predicts 6mA sites based on the mouse genome. The model uses the feature from PseKNC and support vector machine (SVM). However, instead of using the same species to evaluate their prediction model, they used eight other species (nematodes, Arabidopsis, Escherichia coli, acid bacteria, alteromonadaceae bacteria, seaweed polycyclic seaweed worm, gastrococcus xanthoma and melon sheath). Compared with the previous research, iDNA6mA introduced the deep learning methods. Based on the rice dataset, it used MBE coding to convert DNA sequence into 164-dimensional feature vectors to create the final model. The technique named iDNA6mA-Rice constructs a training dataset containing 154000 6mA sites and 154000 non-6mA sites. It adopts three codes (MBE,
672
W. Fu et al.
KNC and natural vector) and uses the random forest as its classifier. Unlike the existing methods, this method uses independent data sets to evaluate the model. On the basis of the iDNA6mA-Rice dataset, the SNNRice6mA model adopts MBE coding and deep learning architecture for ten fold cross-validation test. It is a pity that the model also does not evaluate the robustness of the model on independent datasets. The model called 6mAPred-FO is developed using the fusion optimization protocol. This method is based on nucleotide position specificity (NPS) and PseKNC coding. Then, it uses the method of analysis of variance to enhance the features in order to obtain the optimal features. Afterwards, the features were sent to the support vector machine (SVM) classifier. i6mAFuse is maken based on the datasets of two species (RC and FV). The sequences with 65% consistency with other sample sequences are removed by CD-HIT. Six different feature codes (KMER, DPCP, EIIP, MBE, KNC and TPCP) and random forest are used to build the model. It combines the prediction probability of 6mA by linear regression method. On the other hand, we can use the latest deep learning technology to further improve the prediction accuracy of 6mA sites. In our work, we propose a new prediction model i6mA-word2vec, which uses the embedding technology in natural language processing to encode features and learn the features hidden between 6mA sites and non-6mA sites in multiple species. In word vector embedding, word2vec is considered as one of the best embedding methods. It is used to deal with various classification and detection in bioinformatics. The features obtained by distributed feature encoding are sent to the convolutional neural network to identify 6mA sites. The experimental results show that our proposed method is better than the existing calculation model.
2 Materials and Methods 2.1 Dataset In our model, the datasets we used are the 6mA datasets from the rice DNA sequences. Dataset1 was provided by Chen et al. [23] and dataset2 comes from Lv et al. The dataset1 was obtained from the National Biotechnology Information Center in 2019 and used CDHIT to remove sequences with more than 60% homology. Formula (1) (2) represents the mathematical representation of the dataset1 and the dataset2 respectively: S1 = S1+ ∪ S1−
(1)
S2 = S2+ ∪ S2−
(2)
where S1 is a rice dataset composed of 1760 samples, where S1+ represents 880 positive subsets containing 6mA sites and S1− represents a negative subset including 880 negative pieces. The dataset2, which is called S2 contains 300800 examples, including 154000 positive examples and 154000 negative examples (Table 1).
i6mA-word2vec: A Newly Model Which Used Distributed Features
673
Table 1. Benchmark dataset Dataset
Positive
i6mA-Pred 6mA-rice-Lv
Negative
Total
Species
880
880
1760
Rice
154000
154000
308000
Rice
2.2 Feature Encoding In the field of natural language processing, word2vec is one of the most efficient embedding technologies. It can learn the word context. As a result, distributed features can represent different language rules and pattern encoding pattern components. To date, it has been widely used in bioinformatics. In the word2vec model, there are two ways to learn word context: Continuous Bag-of-Words (CBOW) and continuous skip grammar model (Skip-Gram). The difference between the two models is that in the learning process, the CBOW model predicts the current word through the word of the context, while the Skip-Gram model predicts the context through the present word. Compared with the Skip-Gram model, it shows better performance for uncommon words. Since our research focuses on frequently used words, we use the CBOW model for word2vec learning. In our study, in the first step of constructing the word2vec model, continuous k-mer in the DNA sequence are regarded as words, and the word2vec model is trained according to the dataset of existing species. We use the CBOW method to train the word2vec model, which predicts the current word w(T ) based on the contextual words around the predefined window. The learning representation of words at time t (i.e. w(T )) depends on a window size. In the experiment, we choose the window size of 5. The input formula of the model based on the CBOW is as follows: 2 w(t + k) (3) k=−2,k=0
In the word2vec model, each k-mer is represented by a 100-dimensional vector, and the corresponding characteristic matrix is generated by concatenating them. Therefore, the sequence window of each nucleotide is represented by a n (L − k + 1) × 100 matrix. The length of the matrix is 100, and L is the length of the DNA sequence (Figs. 1 and 2).
674
W. Fu et al.
Fig. 1. CBOW model
Fig. 2. Word2vec model
2.3 Convolutional Neural Network In recent years, the convolutional neural network (CNN) has been widely used in the fields of bioinformatics and computational biology. We use the distributed feature representation generated by word2vec as the input of the convolutional neural network (CNN) model. The convolution neural network continuously optimizes various Super parameters in the learning process, including the convolution layers’ number, the number of filters, the size and maximum pool of convolution filters, and the dropout probability after convolution. In order to get better performance, we use classic evaluation indexes. In our work, it consists of two convolution layers. We use a nonlinear activation function in all layers, namely rectified linear unit (ReLU). In addition, in order to avoid overfitting the network, we use dropout regularization. The dropout probability is 0.25, then a full connection layer is connected, and the sigmoid function is used to predict whether it is a 6mA site. The output range is [0, 1]. The mathematical representation of these functions is as follows: ReLU (z) = max(0, z)
(4)
i6mA-word2vec: A Newly Model Which Used Distributed Features
Sigmoid (z) =
1 1 + e−z
675
(5)
3 Prediction Accuracy Assessment In our research, we use the indicators used by many researchers to evaluate the success rate of our intelligent computing model, including accuracy (ACC), sensitivity (SN), specificity (SP) and Matthews correlation coefficient (MCC). ACC =
MCC = √
TP + FN × 100% FP + FN + TN + TP
(6)
Sn =
TP × 100% FN + TP
(7)
Sp =
TN × 100% FP + TN
(8)
TP × TN + FP × FN (TP + FN )(TN + FP)(TP + FP)(TN + FN )
(9)
In the above mentioned, TP, FP, TN and FN represent true positive, false positive, true negative and false negative, respectively. We set the threshold to judge whether it is a 6mA site to 0.5. We use the scikit learn package in Python to calculate these evaluation indicators.
4 Result and Discussion 4.1 Comparison with the Different k Values Based on the Same Dataset In our experiment, we found that the value of K impacts greatly on the prediction of 6mA sites. As a result, we change the value of K based on the i6mA-word2vec model. For a more objective comparison, we used the dataset provided by the iDNA6mARice method. The dataset contains 154000 positive samples which contains 6mA sites and 154000 negative samples which includes non-6mA sites. They are all from the rice genome. For the data preprocessing process of the rice dataset, we carried out experiments under different K values such as 3-mers, 4-mers and 5-mers. The results show that among various K-mer word segmentation, 4-mers has the best success rate. Table 2 records the performance of our model under different K values (Fig. 3).
676
W. Fu et al.
Table 2. The performance of i6mA-word2vec and the different k values in the model based on the same dataset k
Sn (%)
Sp (%)
ACC (%)
MCC
3-mers
81.03
82.11
82.15
0.620
4-mers
79.90
83.93
88.20
0.625
5-mers
79.86
83.31
79.77
0.615
ACC
MCC
100 90 80 70 60 50 40 30 20 10 0 Sn
Sp 3-mers
4-mers
5-mers
Fig. 3. Comparison with the different k-mers based on the same dataset
4.2 Comparison with the Different Datasets Based on the Same k Value In the experiment, we further analyzed the impact of different datasets on the model’s performance. In our work, we used 4-mers to experiment on the i6mA-Pred dataset and the 6mA-Rice-Lv dataset. Figure 5 shows the confusion matrix for the two datasets (Fig. 4).
i6mA-word2vec: A Newly Model Which Used Distributed Features
677
Fig. 4. Comparison with the different datasets based on the 4-mers
4.3 Comparison with the Existing Classical Methods In addition, we also compare our method with existing methods such as i6mA-Pred and 6mA-RicePred. Our approach is better in performance (Table 3). Table 3. Detailed performance comparison Method
Sn (%)
Sp (%)
Acc (%)
MCC
i6mA-Pred
83.41
83.64
83.52
0.67
6mA-RicePred
84.89
89.66
87.27
0.75
i6mA-word2vec
79.90
83.93
88.20
0.625
678
W. Fu et al.
100 90 80 70 60 50 40 30 20 10 0 Sn
Sp i6mA-Pred
6mA-RicePred
ACC
MCC
i6mA-word2vec
Fig. 5. Comparison with the existing classical methods
5 Conclusion In our study, we propose a more intelligent model for detecting DNA N6 methylation sites, namely i6mA-word2vec. There are two critical steps in our model: distributed feature coding and classification. In the first step, we use word2vec in natural language processing to process data. After this process, we can obtain distributed feature encoding. And then, we send the extracted feature space into the CNN model as input for prediction. The model we designed has been improved in various evaluation indexes. Future research should focus on developing models that automatically extract features from data sets. Our research can help reduce the difficulty of artificial feature selection and classification model selection in each epigenetic prediction task. Acknowledgments. This work was supported in part by the University Innovation Team Project of Jinan (2019GXRC015), the Natural Science Foundation of Shandong Province, China (Grant No. ZR2021MF036).
References 1. Luo, G.Z., Blanco, M.A., Greer, E.L., et al.: DNA N(6)-methyladenine: a new epigenetic mark in eukaryotes? Nat. Rev. Mol. Cell Biol. 16, 705–710 (2015) 2. Liu, B., et al.: iRO-3wPseKNC: identify DNA replication origins by three-window-based PseKNC. Bioinformatics 34(18), 3086–3093 (2018). https://doi.org/10.1093/bioinformatics/ bty312 3. Wahab, A., Ali, S.D., Tayara, H., et al.: iIM-CNN: intelligent identifier of 6mA sites on different species by using convolution neural network. IEEE Access 7, 178577–178583 (2019) 4. Alam, W., Ali, S.D., Tayara, H., et al.: A CNN-based RNA N6-methyladenosine site predictor for multiple species using heterogeneous features representation. IEEE Access 8, 138203– 138209 (2020)
i6mA-word2vec: A Newly Model Which Used Distributed Features
679
5. Dunn, D.B., Smith, J.D.: Occurrence of a new base in the deoxyribonucleic acid of a strain of bacterium coli. Nature 175, 336–337 (1955) 6. Vanyushin, B.F., Tkacheva, S.G., Belozersky, A.N.: Rare bases in animal DNA. Nature 225(5236), 948–949 (1970) 7. Xiao, C.L., Zhu, S., He, M., et al.: N6-methyladenine DNA modification in the human genome. Mol. Cell 71(2), 306–318 (2018) 8. Chao, Z., Wang, C., Liu, H., et al.: Identification and analysis of adenine N6-methylation sites in the rice genome. Nat. Plants 4, 554–563 (2018) 9. Pomraning, K.R., Smith, K.M., Freitag, M.: Genome-wide high throughput analysis of DNA methylation in eukaryotes. Methods 47(3), 142–150 (2009) 10. Krais, A.M., et al.: Genomic N6-methyladenine determination by MEKC with LIF. Electrophoresis 31, 3548–3551 (2010) 11. Flusberg, B.A., Webster, D.R., Lee, J.H., et al.: Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods 7(6), 461–465 (2010) 12. de Araujo Oliveira, J.V., et al.: SnoReport 2.0: new features and a refined Support Vector Machine to improve snoRNA identification. BMC Bioinform. 17(18), 464 (2016) 13. Gupta, Y., et al.: ptRNApred: computational identification and classification of posttranscriptional RNA. Nucleic Acids Res. 42(22), e167 (2014) 14. Jana, H., Hofacker, I.L., Stadler, P.F.: SnoReport: computational identification of snoRNAs with unknown targets. Bioinformatics 2, 158–164 (2008) 15. Song, K.: Recognition of prokaryotic promoters based on a novel variable-window Z-curve method. Nucleic Acids Res. 40(3), 963–971 (2012) 16. Umarov, R.Kh., Solovyev, V.V.: Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PLoS One 12(2), e0171410 (2017) 17. Wu, Q., Wang, J., Yan, H.: An improved position weight matrix method based on an entropy measure for the recognition of prokaryotic promoters. Int. J. Data Min. Bioinform. 5(1), 22 (2011) 18. Chen, W., et al.: iDNA4mC: identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics 33, 3518–3523 (2017) 19. He, W., et al.: 4mCPred: machine learning methods for DNA N4-methylcytosine sites prediction. Bioinformatics 35, 593–601 (2018) 20. Liu, Z., Xiao, X., Qiu, W.R., et al.: IDNA-methyl: identifying DNA methylation sites via pseudo trinucleotide composition. Anal. Biochem. 474, 69 (2015) 21. Fu, Y., Luo, G.Z., Chen, K., et al.: N6-methyldeoxyadenosine marks active transcription start sites in chlamydomonas. Cell 161(4), 879–892 (2015) 22. Iyer, L.M., Abhiman, S., Aravind, L.: Natural history of eukaryotic DNA methylation systems. Progress Mol. Biol. Transl. Sci. 101(101), 25–104 (2011) 23. Chen, W., Lv, H., Nie, F., et al.: i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome. Bioinformatics 35(11), 2796–2800 (2019) 24. Huang, Q., Zhang, J., Wei, L., et al.: 6mA-RicePred: a method for identifying DNA N 6methyladenine sites in the rice genome based on feature fusion. Front. Plant Sci. 11, 4 (2020) 25. Xu, H., Hu, R., Jia, P., et al.: 6mA-Finder: a novel online tool for predicting DNA N6methyladenine sites in genomes. Bioinformatics 36(10), 3257–3259 (2020)
Oxides Classification with Random Forests Kai Xiao1 , Baitong Chen2 , Wenzheng Bao1 , and Honglin Cheng1(B) 1 School of Information Engineering, Xuzhou University of Technology, Xuzhou 221018, China
[email protected] 2 Xuzhou First People’s Hospital, Xuzhou 221000, China
Abstract. Oxides are divided into binary and ternary oxides according to different components. Binary oxides are widely used in industrial catalysis, and the development of new energy field of ternary oxides is also very broad. This paper mainly studies the oxide classification method based on machine learning. In order to facilitate calculation, the method of describing oxides is redefined. The prediction results are evaluated according to the accuracy of evaluation index, F1 score, accuracy and recall rate, and the best algorithm is selected. The results show that the random forest has a good classification effect and is of great significance to the identification of oxides. Keywords: Oxide · Classification algorithm · Machine learning · Random forests
1 Introduction Metal-oxides are binary compounds consisting of oxygen in conjunction with other metal-chemical elements, such as Fe2 O3 , Na2 O and etc. Oxides include basic oxides, acid oxides, peroxides, superoxides and amphoteric oxides. Metal oxides are widely used in daily life. Quicklime is a common desiccant. Fe2 O3 is often used as a red color. Some catalysts used in industry are also metal oxides. Metal oxides play an important role in the field of catalysis. They are widely used as main catalyst, cocatalyst and support. For binary oxides, there are two 3 elements including oxygen, such as Fe2 O3 and MgO. The three elements are called ternary oxides, which are relatively rare. Generally, there are some elements at the junction of metal and nonmetal, such as Si and Ge. They can be written in the form of salts or oxides, such as Ca1Fe1O2 and Ba4Ga2O7. Ternary oxides can be used as composite electrode materials and special pollutants in wastewater treatment. Ternary oxide can also be used to make lubricating materials, and has excellent high temperature stability and self-lubricating performance. Ternary oxides are widely used in the fields of environmental protection, catalysis and new energy. Due to the wide application of binary and ternary oxides, how to distinguish these two substances quickly and effectively has become a problem to be solved. Due to the rapid development of computer science, machine learning has been used in many aspects and has been mature. Among them, KNN, Random forest, Naive Bayes, Integrated learning, Discriminant analysis algorithm, limit vector machine algorithm and so on are studied © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 680–686, 2022. https://doi.org/10.1007/978-3-031-13829-4_59
Oxides Classification with Random Forests
681
more. Therefore, this experiment can use a variety of typical machine learning algorithms to evaluate the final recognition results through a variety of evaluation indicators. In the article, we focus on the classification of binary and ternary oxide We use data such as atomic number, electron number and electronegativity to analyze the data using a variety of machine learning methods, and compare the result of each algorithm, to achieve efficient classification of binary and ternary oxides. The results showed that random forests has higher accuracy in the experiment In the process of distinguishing binary and ternary oxides, and accurate and convenient identification is realized, which provides reference for distinguishing oxides.
Fig. 1. Flow chart of the experiment
2 Methods and Materials 2.1 Materials The data of this experiment came from The Citrine Informatics Database, and The data set consisted of 271 binary and 1009 ternary metal oxides. The database also includes six structural families including fluorite resistant, corundum, halite, rutile, perovskite and spinel. By magnetic state, the database consists of ferromagnetic, antiferromagnetic and nonmagnetic compounds. The 1280 metal oxides are composed of three alkali metal elements, four alkaline earth metal elements, 22 transition metal elements and five other metal elements (Al, Ga, In, Ge, Sn), In addition to O element, which is sufficient to indicate the diversity of structures and the generality of experimental results.
682
K. Xiao et al.
Fig. 2. Data set proportion
2.2 Characteristic Variables The first step in building a machine learning model is to select suitable features as inputs, which should accurately describe metal oxides and have obvious physical significance. Although traditional methods can describe oxides effectively, they lack obvious physical significance and are complicated in calculation. To this end, this paper redefines the six characteristics xi (I = 1, 2, 3, 4, 5 and 6) to uniquely describe the metal oxides. The first and second characteristics are the total number of atoms and electrons per unit: X1 = Nt X2 =
Ei
(1) (2)
i
Nt is the total number of atoms per cell, Ei is the number of electrons per atom. The characteristics of the oxide must be related to the oxygen element. The third and fourth elements are defined as follows: X3 = χo − χn
(3)
X4 = N (O)t
(4)
χo is the electronegativity of O atom on the Pauling scale, χn is the electronegativity of the atom nearest to O atom, and N (O)t is the total number of oxygen atoms in each unit. Other characteristics are defined as follows: N (O)t X5 = (5) Nt N (O)t × EO X6 = (6) Et Eo is the number of electrons in a single O atom.
Oxides Classification with Random Forests
683
2.3 Methods In this experiment, six classification models including k-Nearest Neighbor, Random forest, Naive Bayes, Ensemble learning, Discriminant analysis and Support vector machine were used to analyze and process the data. The k-fold cross-validation method is adopted in the experiment. For small sample data, the test result of this method is more reliable than that of dividing the original sample into training set and test set. Therefore, the value of k is 10 (divide the data into 10 parts, take one part as test set each time, and take the remaining nine parts as training set). The grid search method is used to determine the optimal hyperparameters. 2.4 Parameter Optimization Given that ACC is our objective function in the experiment, the classification algorithm may overfit the prediction model to achieve the highest ACC. Therefore, we repeated the training data set 10 times by randomly dividing it to generate 10 feature input sets for each classifier. For example, the optimal parameters C and G of SVM are obtained based on different feature sets. We chose C and G to develop the final prediction model. This random cross validation technique can avoid overfitting. Finally, the average performance obtained from cross-validation is compared in order to select the best model for the experiment. 2.5 Model Execution All cross-validation and performance evaluations are performed in matlab_2019R. We calculate three different scoring functions (F1, ACC and recall) for evaluation. 2.6 Evaluation Index The accuracy of classification results was selected as an evaluation index of the model performance, and the confusion matrix was established to evaluate the actual classification prediction ability of the model. According to the experimental results, all the classification results can be divided into four cases: TP, TN, FP and FN. In this experiment, four evaluation indexes were applied to evaluate the performance of the model. Accuracy refers to the proportion of correct results predicted by the model: accuracy =
TP + TN TP + TN + FP + FN
(7)
Precision is the proportion of the set of all positive predicted samples that is correct: precision =
TP TP + FP
(8)
The recall rate is an actual positive sample of what the model correctly predicts: recall =
TP TP + FN
(9)
684
K. Xiao et al.
Precision and Recall sometimes contradict each other. In order to comprehensively consider them, F1 value is also taken as one of the evaluation indexes. F1 value is the harmonic mean between Precision and Recall, which takes Precision and Recall into consideration: F1 = 2 ×
1 1 precision
+
(10)
1 recall
3 Model Results and Analysis 3.1 Performance Evaluation of the Model By using the feature input rewritten according to the physical properties of oxides and six machine learning models (KNN, NB, ENS, SVM, DAC, RT), we have fully studied the role of each model in oxide classification. We conducted 10 random 10-fold cross-validation for each model to obtain the experimental results, and compared the performance of the models. You can see from Tabulation 3-1 that RT has superior performance. The results showed that the average ACC of KNN, NB, ENS, SVM, DAC and RT were 96.8, 86.7, 96.1, 89.0, 87.5 and 98.4, respectively. The results show that all the classifiers perform well in oxide classification. 3.2 Model Selection As mentioned in the establishment of the model, six machine learning algorithms are applied to classify oxides and three evaluation functions are used to evaluate the performance of the model. The results are shown in Table 1. Among them, THE accuracy of KNN, ENS and RT is very high. They all have good performance, but RT is obviously superior. In terms of recall rate and F value, RT can reach 98 and 0.98, which is also higher than the other two algorithms. Therefore, among the six classifiers, we choose the classifier of random forest, whose performance is better than other similar classifiers. Table 1. Model training results Machine learning
Accuracy/%
Recal/%
F1
KNN
96.8
97.9
0.97
RT
98.4
98.0
0.98
NB
86.7
86.4
0.91
ENS
96.0
95.1
0.97
DAC
87.5
87.9
0.92
SVM
89.0
88.8
0.93
Oxides Classification with Random Forests
685
4 Conclusion In the experiment, a method for the classification of unitary oxides and binary oxides is presented. In the past, the characteristic input of oxides was too cumbersome to calculate and lacked obvious physical significance. So the experiment redefined the six characteristics to uniquely describe the metal oxides. In order to accurately classify, six machine learning algorithms are selected for comparison and three evaluation functions are selected to evaluate the performance of the model. The proposed method performs well in experiments and has high classification accuracy. We will continue to try to find more easily calculated characteristic inputs to describe oxides, and to try more classification methods to further improve the performance of classification. In summary, the experimental method has achieved relatively stable and excellent performance, but there are still improvements and progress. In the field of machine learning, as the research continues to deepen, machine learning is becoming more and more efficient, scholars can design new and efficient classifiers for oxides. The method proposed in this paper can classify oxides efficiently and save time and cost. Although this paper has achieved good results in the classification of oxides, there are still some shortcomings and shortcomings. The oxides proposed in this paper only apply to binary and ternary oxides, and the scope of application of the classifier is small. Therefore, more kinds of data can be collected in the future work to improve the scope of application of classifier. The classifier used in this paper is Random Forest. However, with the development of computer technology, more efficient classification methods have emerged and been applied to reality, such as Deep Learning and Convolutional Neural Network. Therefore, in the future work, we should use these new classification methods to improve the classification ability of the model. Acknowledgement. This work was supported by the Natural Science Foundation of China (No. 61902337), the fundamental Research Funds for the Central Universities, 2020QN89, Xuzhou science and technology plan project, KC19142, KC21047, Jiangsu Provincial Natural Science Foundation (No. SBK2019040953), Natural Science Fund for Colleges and Universities in Jiangsu Province (No. 19KJB520016) and Young talents of science and technology in Jiangsu. Baitong Chen and Kai Xiao can be treated as the co-first authors.
Availability of Data and Materials. The data used in this study are available upon request.
Ethics Approval and Consent to Participate. Not applicable. Consent for Publication. Not applicable. Competing Interests. The authors declare that they have no competing interests.
686
K. Xiao et al.
References 1. Feng, P., Yang, H., Ding, H., et al.: iDNA6mA-PseKNC: identifying DNA N6methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics 111, 96–102 (2018) 2. Tahir, M., Tayara, H., Chong, K.T.: iDNA6mA (5-step rule): identification of DNA N6methyladenine sites in the rice genome by intelligent computational model via Chou’s 5-step rule. Chemometr. Intell. Lab. Syst. 189, 96–101 (2019) 3. Hao, L., Dao, F.Y., Guan, Z.X., et al.: iDNA6mA-rice: a computational tool for detecting N6-methyladenine sites in rice. Front. Genet. 10, 793 (2019) 4. Yu, H., Dai, Z.: SNNRice6mA: a deep learning method for predicting DNA N6-methyladenine sites in rice genome. Front. Genet. 10 , 1071 (2013) 5. Cai, J., Wang, D., Chen, R., et al.: A bioinformatics tool for the prediction of DNA N6methyladenine modifications based on feature fusion and optimization protocol. Front. Bioeng. Biotechnol. 8 , 502 (2020) 6. Hasan, M.M., Manavalan, B., Shoombuatong, W., et al.: i6mA-Fuse: improved and robust prediction of DNA 6 mA sites in the Rosaceae genome by fusing multiple feature representation. Plant Mol. Biol. 103(1), 225–234 (2020) 7. Nazari, I., Tahir, M., Tayara, H., et al.: iN6-Methyl (5-step): identifying RNA N6methyladenosine sites using deep learning mode via Chou’s 5-step rules and Chou’s general PseKNC. Chemometr. Intell. Lab. Syst. 193, 103811 (2019) 8. Oubounyt, M., Louadi, Z., Tayara, H., et al.: Deep learning models based on distributed feature representations for alternative splicing prediction. IEEE Access 99, 58826–58834 (2018) 9. Tahir, M., Hayat, M., Chong, K.T.: A convolution neural network-based computational model to identify the occurrence sites of various RNA modifications by fusing varied features ScienceDirect. Chemometr. Intell. Lab. Syst. 211, 104233 (2021) 10. Tahir, M., Tayara, H., Hayat, M., et al.: kDeepBind: prediction of RNA-proteins binding sites using convolution neural network and k-gram features. Chemom. Intell. Lab. Syst. 208(7457), 104217 (2021) 11. Zhang, Y., Hamada, M.: DeepM6ASeq: prediction and characterization of m6A-containing sequences using deep learning. BMC Bioinform. 19(S19), 524 (2018) 12. Tahir, M., Hayat, M.: iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chou’s PseAAC. Mol. BioSyst. 12(8), 2587–2593 (2016) 13. Muhammad, T., Hayat, M., et al.: iNuc-ext-PseTNC: an efficient ensemble model for identification of nucleosome positioning by extending the concept of Chou’s PseAAC to pseudotri-nucleotide composition. Mol. Genet. Genomics 294, 199–210 (2018). https://doi.org/10. 1007/s00438-018-1498-2 14. Tahir, M., Tayara, H., Chong, K.T.: iRNA-PseKNC(2methyl): identify RNA 2’-Omethylation sites by convolution neural network and Chou’s pseudo components. J. Theor. Biol. 465 (2018) 15. Chou, K.C.: Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct. Funct. Bioinf. 43(3), 246–255 (2010) 16. Zeng, F., Fang, G., Yao, L.: A deep neural network for identifying DNA N4-methylcytosine sites. Front. Genet. 11, 209 (2020)
Protein Sequence Classification with LetNet-5 and VGG16 Zheng Tao1 , Zhen Yang1 , Baitong Chen2 , Wenzheng Bao1 , and Honglin Cheng1(B) 1 School of Information Engineering, Xuzhou University of Technology, Xuzhou 221018, China
[email protected] 2 Xuzhou First People’s Hospital, Xuzhou 221000, China
Abstract. Classification of protein sequences is an important method to predict the structure and function of novel protein sequences. The determination of protein function has a very important role in promoting both disease prevention and drug development. With the continuous development of bioinformatics and the large accumulation of related data, the functional prediction of unknown proteins using scientific computational methods has become an important research topic in bioinformatics in the post-genomic era, so the classification algorithm of protein sequences has also become one of the primary tasks of the current life science research. In this paper, we try to use two classical classification algorithms, LetNet-5 and VGG16, to study the classification problem of protein sequences. Keywords: Protein sequences · Classification · LetNet-5 · VGG16
1 Introduction Recently, Google has trained a deep learning model called ProtCNN that can be utilized to accurately predict the function of protein sequences, making more unknown protein sequences annotated. It is understood that these annotations are evaluated based on a rigorous benchmark constructed by the mainstream protein family database Pfam, which records a series of protein families and whose functional annotations. The following research has expanded the coverage of protein sequences in the Pfam database by 9.5%. With these efforts, we can find that the artificial intelligence to deal with computational proteomics is a novel and potential method. The structure and biological function extracting from sharply expanding information on protein sequence data can be regarded as an important challenge in the post-genomic era. Protein structure type can intuitively describe the complete spatial folding structure pattern of proteins, which is an important source of information to explain the protein structure and function, and provides a theoretical basis for the development of related biotechnology. With several years’ researches, classical biological experimental methods to determine the type of protein structure are not only time-consuming and laborious but also expensive. Therefore, it will be very meaningful work to develop rapid and effective tools on protein structure class prediction. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 687–696, 2022. https://doi.org/10.1007/978-3-031-13829-4_60
688
Z. Tao et al.
This paper focus on the classification issue on protein sequences. Prior to experiments, we made the protein sequence data to the network requirements of LetNet-5 and VGG16 by visualizing a series of protein sequences. Meanwhile, to simplify the processing of the experimental data, we divided the dataset reasonably in advance by python and converted it into a format similar to the handwritten dataset. Next, they were trained on the LetNet-5 and VGG16 networks, utilizing the prepared datasets. And then the results of the training are shown as images. Last but not least, the performances of model can be evaluated by Sp, Sn, Acc, and MCC.
2 Methods and Materials 2.1 Data The six datasets used in this paper are protein sequences from six species, BS, CG, EC, GK, MT, and ST. Of these, 3,142 data for BS species, 2,104 data for CG species, 13,184 data for EC species, 412 data for GK species, 1,730 data for MT species and 396 data for ST species. The proportion of positive and negative data was 1:1. In real experiments, we divided each dataset into training set, validation set and test set at 8:1:1. Dataset samples for each sample, Fig. 1 BS, 2 CG, 3 EC, 4 GK, 5 MT, and 6 ST, are shown below.
Fig. 1. BS
Fig. 2. CG
Fig. 3. EC.
Fig. 4. GK
Fig. 5. MT
Fig. 6. ST
Protein Sequence Classification with LetNet-5 and VGG16
689
2.2 Methods In this paper, two deep learning classification models, including LetNet-5 and VGG16, were utilized to study the secondary classification problem of the six species mentioned above. The first is the LetNet-5 model, a convolutional neural network designed by Yann LeCun in 1998, which was used by most American banks to identify the handwritten numbers on checks. It was one of the most representative experimental systems in early convolutional neural networks. The LenNet-5 has 7 layers (excluding the input layers), each containing different amounts of training parameters, as shown in Fig. 7 below.
Fig. 7. The LetNet-5 network structure diagram
In LeNet-5, there are mainly 2 convolutional layers, 2 subsampling layers (pooling layers), and 3 fully connected layers. The convolutional layers of LetNet-5 all adopt convolutional cores of 5 × 5 size, and the convolutional core slides one pixel at a time, with a feature map using the same convolutional kernel. The value of each upper layer node is multiplied by the parameters on the connection, adding these products and an offset parameter to get a sum, put this and input the activation function, the output of the activation function is the value of the next layer node. The lower sampling layer of LetNet-5 uses 2 × 2 input domain, that is, the four nodes of the previous layer as the input of the next layer, and the input domain does not overlap, that is, sliding 2 pixels each time, the four input nodes of each lower sampling node are summed, the average is multiplied by one parameter plus a bias parameter as the input of the activation function, the activation function output is the value of the node of the next level. In order to meet the requirements of the experiment, we need to slightly fine-tune the output part of LetNet-5. The input image of LetNet-5 matches the topology of the network. Its feature extraction and pattern classification are performed simultaneously, and are simultaneously generated in training, while weight sharing can reduce the training parameters of the network, making the neural network structure simpler and more adaptable. The VGG16 model, and the VGG convolutional neural network is the one proposed by the University of Oxford in 2014. When this model was proposed, it immediately became the most popular convolutional neural network model of the time, due to its simplicity and utility. And it showed very good results in both image classification and target detection tasks. In the 2014 ILSVRC competition, VGG was 92.3% correct in the
690
Z. Tao et al.
Top-5. The VGG16 network structure contains 16 network layers, namely 13 convolutional layers, 5 pooling layers, and 3 fully connected layers, excluding the activation layer. Figure 8 are the plots of the network structure of VGG16.
Fig. 8. The VGG16 network structure diagram
Since the default size of VGG16 for the input image is 224 × 224 × 3, and to meet the need of second classification in this work, we need to slightly adjust the input and output parts of VGG16 during the experimental process. VGG16 uses multiple convolutional layers of smaller convolutional cores 3 × 3 for a larger convolution layer, reduced on the one hand and equivalent to performing more nonlinear mapping on the other, to increase the fitting power of the network. The authors of VGG16 argue that the receptive field size obtained by two 3 × 3 convolution convolutions, while three 3 × 3 convolution stacks acquired the receptive fields equivalent to a 7 × 7 convolution. As the convolutional kernel focuses on expanding the number of channels and pooling focuses on reducing width and high, the increase scale of computation is controlled deeper and wider while working on the model architecture.
3 Results and Discussions 3.1 LetNet-5 Experimental Result After 100 epochs, we obtained images of various parameters for the protein sequence classification of the six species, including Fig. 9 BS, 10 CG, 11 EC, 12 GK, 13 MT, and 14 ST.
Protein Sequence Classification with LetNet-5 and VGG16
Fig. 9. BS experimental results
Fig. 10. CG experimental results.
Fig. 11. EC experimental results
Fig. 12. GK experimental results
Fig. 13. MT experimental results
Fig. 14. ST experimental results
691
692
Z. Tao et al.
From the above six experimental results plots, we can obtain the following results: 1. The loss of each validation set shows an upward trend, and we can speculate that there is a phenomenon of overfitting to the training. 2, The accuracy of the six species followed BS > MT > EC > ST > CG > GK permutations. 3, The precision of the six species were arranged as MT > BS > ST ≥ EC > CG > GK. 4, The recall of the six species were arranged by MT ≥ BS > EC > CG ≥ ST > GK. Therefore, it can be inferred from four results that the LetNet-5 model classified better on MT and BS species than on GK and CG. 3.2 VGG16 Experimental Result After 100 epochs, we obtained images of various parameters of six species protein sequences classified at 10 epochs, including Fig. 15 BS, Fig. 16 CG, Fig. 17 EC, Fig. 18 GK, Fig. 19 MT, and Fig. 20 ST.
Fig. 15. BS experimental results
Fig. 16. CG experimental results
Fig. 17. EC experimental results
Fig. 18. GK experimental results
Protein Sequence Classification with LetNet-5 and VGG16
Fig. 19. MT experimental results
693
Fig. 20. ST experimental results
From the Fig. 15, 16, 17, 18, 19, 20, we can obtain several phenomena. From the loss plots of the six species, we can see that the training of CG, GK, and MT species may have encountered bottlenecks. The six species have similar accuracy phases, except for the lower accuracy on the ST species. The precision of the six species were ranked by GK ≥ CG > BS ≥ EC > ST > MT = 0. 4. From these results, we can speculate that the MT species had the worst classification effect on VGG16 and GK and CG. 3.3 Comparison of the Experimental Results Between LetNet-5 and VGG16 The comparison of the above experimental results shows that the accuracy and precision of LetNet-5 are overall much higher than VGG16, while the recall is lower than VGG16. In this regard, ROC curves and AUC values were introduced to further compare the results of LetNet-5 and VGG16, showing Fig. 21 ROC curve of BS, Fig. 22 ROC curve of CG, Fig. 23 ROC curve of EC, Fig. 24 ROC curve of GK, Fig. 25 ROC curve of MT and Fig. 26 ROC curve of ST.
Fig. 21. ROC curve of BS
Fig. 22. ROC curve of CG
694
Z. Tao et al.
Fig. 23. ROC curve of EC
Fig. 24. ROC curve of GK
Fig. 25. ROC curve of MT
Fig. 26. ROC curve of ST
Based on the ROC curves and AUC values on the LetNet-5 model and VGG16 models of the above six species test sets, we can see that the AUC values on LetNet-5 are higher than the AUC values on VGG16, and we can obtain the following conclusion that the LetNet-5 model works better in classifying the protein sequences of these six species. The LetNet-5 model and the VGG16 model, which deal with the classification on protein sequences, have been employed in this work. Meanwhile, several elements should be taken in account during this work. For instance, the data form should be redefined. The different forms of data may have influence on the final results. The dataset division maybe another significant elements of this work. Meanwhile, it is also possible to influence the experimental results when modifying the output of LetNet-5 and the input and output of VGG16. Therefore, in the course of subsequent experiments, such situations need to be discussed separately and analyzed whether the feasibility of such an operation will have an impact on the experimental results.
4 Conclusions Classification of protein sequences is an important method to predict the structure and function of novel protein sequences. The determination of protein function has a very important role in promoting both disease prevention and drug development. A large number of results shown that a large number of protein sequences are closely related,
Protein Sequence Classification with LetNet-5 and VGG16
695
and the classification of protein sequences has obvious practical significance, so protein classification is a very important problem in the current process of protein research. With the continuous deepening and development of protein research work, as well as the improvement of research methods and technologies, the data to classify protein sequence classification also increased sharply, so it is urgent to study protein sequence classification algorithms with higher accuracy and faster efficiency. Acknowledgement. This work was supported by the Natural Science Foundation of China (No. 61902337), the fundamental Research Funds for the Central Universities, 2020QN89, Xuzhou science and technology plan project, KC19142, KC21047, Jiangsu Provincial Natural Science Foundation (No. SBK2019040953), Natural Science Fund for Colleges and Universities in Jiangsu Province (No. 19KJB520016) and Young talents of science and technology in Jiangsu. Zheng Tao, Zhen Yang, and Baitong Chen can be treated as the co-first authors.
References 1. Zheng, M., Kahrizi, S.: Protein molecular defect detection method based on a neural network algorithm. Cell Mol. Biol. (Noisy-le-Grand, France) 66(7), 76 (2020). https://doi.org/10. 14715/cmb/2020.66.7.13 2. Cheng, J., Tegge, A.N., Baldi, P.: Machine learning methods for protein structure prediction. IEEE Rev. Biomed. Eng. 1(2008), 41–49 (2008). https://doi.org/10.1109/RBME.2008.200 8239 3. Gharib, T.F., Salah, A., Salem, M.: PSISA: an algorithm for indexing and searching protein structure using suffix arrays. In: Annual Conference on Computers World Scientific and Engineering Academy and Society (WSEAS) (2008) 4. Wei, Z.: A summary of research and application of deep learning. Int. Core J. Eng. 5(9), 167–169 (2019) 5. Xiao, Y., et al.: Assessment of differential gene expression in human peripheral nerve injury. BMC Genomics 3(1), 28 (2002) 6. Gupta, R., et al.: Time-series approach to protein classification problem: WaVe-GPCR: wavelet variant feature for identification and classification of GPCR. Eng. Med. Biol. Mag. IEEE 28(4), 32–37 (2009) 7. Hasan, M.M., Manavalan, B., Shoombuatong, W., et al.: i6mA-Fuse: improved and robust prediction of DNA 6mA sites in the Rosaceae genome by fusing multiple feature representation. Plant Mol. Biol. 103(1), 225–234 (2020) 8. Nazari, I., Tahir, M., Tayara, H., et al.: iN6-Methyl (5-step): identifying RNA N6methyladenosine sites using deep learning mode via Chou’s 5-step rules and Chou’s general PseKNC. Chemomet. Intell. Lab. Syst. 193, 103811 (2019) 9. Oubounyt, M., Louadi, Z., Tayara, H., et al.: Deep learning models based on distributed feature representations for alternative splicing prediction. IEEE Access PP(99), 1 (2018) 10. Tahir, M., Hayat, M., Chong, K.T.: A convolution neural network-based computational model to identify the occurrence sites of various RNA modifications by fusing varied features. Chemom. Intell. Lab. Syst. (2021). https://doi.org/10.1016/j.chemolab.2021.104233 11. Tahir, M., Tayara, H., Hayat, M., et al.: kDeepBind: prediction of RNA-Proteins binding sites using convolution neural network and k-gram features. Chemom. Intell. Lab. Syst. 208(7457) (2021)
696
Z. Tao et al.
12. Zhang, Y., Hamada, M.: DeepM6ASeq: prediction and characterization of m6A-containing sequences using deep learning. BMC Bioinform. 19, S19 (2018). https://doi.org/10.1186/s12 859-018-2516-4 13. Tahir, M., Hayat, M.: iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chou’s PseAAC. Mol. BioSyst. 12(8), 2587 (2016)
SeqVec-GAT: A Golgi Classification Model Based on Multi-headed Graph Attention Network Jianan Sui1 , Yuehui Chen2 , Baitong Chen3 , Yi Cao4(B) , Jiazi Chen5 , and Hanhan Cong6 1 School of Information Science and Engineering, University of Jinan, Jinan, China 2 School of Artificial Intelligence Institute and Information Science and Engineering, University
of Jinan, Jinan, China 3 Xuzhou First People’s Hospital, Xuzhou, China 4 Shandong Provincial Key Laboratory of Network Based Intelligent Computing (School of
Information Science and Engineering), University of Jinan, Jinan, China [email protected] 5 Laboratory of Zoology, Graduate School of Bioresource and Bioenvironmental Sciences, Kyushu University, Fukuoka-shi, Fukuoka, Japan 6 School of Information Science and Engineering, Shandong Normal University, Jinan, China
Abstract. Golgi apparatus is also known as Golgi complex and Golgi apparatus. It is one of the components of the endosomal system in eukaryotic cells. The main function of the Golgi apparatus is to process, sort, and transport proteins synthesized by the endoplasmic reticulum, and then sort them into specific parts of the cell or secrete them outside the cell. Dysregulation of the Golgi apparatus can cause neurodegenerative diseases. The classification of Golgi proteins is particularly important for the development of drugs to treat the corresponding diseases, but existing methods are cost time and laborious. In this paper, we utilize the SeqVec model to extract the features of Golgi proteins and utilize a multi-headed graph attention network as a classification model. The experimental results show that the predictive classification is better than the machine learning methods commonly used in golgi protein classification, the final experimental results are Acc 98.44%, F1-score 0.9844, Sn 92.31%, Sp 100%, MCC 0.9515, AUROC 0.9615. Keywords: Golgi · SeqVec · Graph attention network · Machine learning
1 Introduction The Golgi Apparatus (GA), an important eukaryotic organelle involved in the metabolism of numerous proteins, and Golgi proteins are mainly composed of two parts: cis-Golgi proteins and trans-Golgi proteins [1]. The main task of the cis-Golgi is to accept proteins and the main task of the trans-Golgi is to release synthesized proteins. Studies have shown that functional defects of the Golgi apparatus in cells can lead to the development of certain diseases such as diabetes [2], Parkinson’s disease [3], Alzheimer’s disease [4] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 697–704, 2022. https://doi.org/10.1007/978-3-031-13829-4_61
698
J. Sui et al.
and some cancers such as breast cancer, and lung cancer [5]. Therefore, the design of an efficient golgi protein classification prediction model is of far-reaching significance to meet the requirements of the development of therapeutic drugs for corresponding diseases [6]. With the maturity and continuous development of machine learning methods, they are increasingly being used for protein classification prediction tasks [7]. Van Dijk et al. experimentally determined sub-Golgi protein localization using SVM as a prediction algorithm [12]. Ding et al. proposed a different PseAAC model to predict Golgi protein types in combination with a custom Markov discriminator with an overall accuracy of 74.7% [13]. Then they used developed a support vector machine based method, proposed ANOVA to optimize the feature set as a feature selection method model accuracy finally reaches 85.4% [14]. Jiao and Du et al. extracted features using position specific physical properties of amino acid residues (PSPCP) and selected significant feature values by ANOVA, model accuracy finally reaches 86.9% [15]. Lv et al. In their study, A stateof-the-art golgi protein classifier rfGPT was developed, with a prediction accuracy of 90.5%, which was ranked as one of the highest Golgi sub-predictors at that time [16]. Zhao et al. extracted features using PseAAC and FUNDES with a prediction accuracy of 78.4% [17]. Yang et al. came up with a CSP-based feature extraction method using a combined minority oversampling technique (SMOTE) and a feature selection method called RF-RFE, and model accuracy finally reaches 88.5% [18]. From the recent results of classifiers for Golgi proteins, the RF Bert model established by Cui et al. used a pretrained Bert model to extract the sequence features of Golgi proteins and established a sub-Golgi protein localizer model based on the RF model, which achieved better results than the traditional machine learning models, and the prediction model accuracy finally reaches 95.31%. In order to achieve accurate classification of Golgi proteins, we extracted the SeqVecGAT model. In this experiment, we utilized the SeqVec model to extract the features of Golgi proteins with the extracted feature dimension of 1024 dimensions and assigned the extracted features to each node of the graph. We utilized graph attention network as the classification model, and in order to improve the benefits brought by self-attention, a three-headed graph attention network was used, Adam was used as the optimizer, the learning rate was set to 0.001, the number of iterations was 350 rounds, and the final experimental results were ACC 98.40%, F1-score 0.9844, Sn 92.30%, Sp 96.10%, MCC 95.10%, AUC 0.96. The experimental results show that the predictive classification is better than the machine learning methods commonly used in golgi protein classification (Fig. 1).
SeqVec-GAT: A Golgi Classification Model
699
Fig. 1. Work flow chart of SeqVec-GAT
2 Materials and Methods 2.1 Data The data for this experiment were obtained from the dataset constructed by Yang et al. and contained 304 sequences of Golgi proteins, of which 217 were negative samples and 87 were positive samples. A separate test set containing 64 sequences of Golgi proteins was used to verify the performance of the model, and the ratio of positive to negative samples was about 1:5. In Golgi protein classification, different feature extraction methods can greatly affect the model accuracy. In previous models, feature extraction is mainly based on the physicochemical properties of amino acids, such as PSSM, PsePSSM etc. With the maturity of deep learning, they are utilized for feature extraction. Rao et al. proposed to Bert model-based task for evaluating protein embedding (TAPE) [19]. In 2019 the model was utilized for feature extraction. We used SeqVec (Sequence-to-Vector) embedding [20] which models Golgi protein amino acids as continuous vectors to extract their features. SeqVec was obtained by training ELMo language models [21] taken from NLP on protein sequence datasets. The results show that this embedding method has different prediction tasks with better application results. 2.2 SeqVec-GAT Classifier In this paper, we utilized a multi-headed graph attention network (GAT) for our experiments. We extracted features of Golgi protein sequences using SeqVec, which is a 304 * 1024 feature matrix utilizing this method, and then assigned each feature vector to the nodes of the constructed graph. Then Put this graph structure into our constructed graph attention network model, which uses two graph attention layers, and let the feature vector corresponding to any node in the graph at the first layer be, denoting the feature length of the node, shown in Fig. 2.
700
J. Sui et al.
Fig. 2. Graph attention layer
Suppose the central node is vi , and we set the weight coefficient of the neighboring nodes vj to vi : eij = a(Whi, Whj )
(1)
(l)
where, W ∈ Rd (l+1)×d is the weight coefficient of the feature transformation of the node in this layer. a(·) is the function to calculate the correlation of two nodes. And a single fully connected layer is chosen here: eij = LeakyReLU (aT [Whi ||Whj ])
(2)
The activation function of our graph attention network is designed as a LeakyReLU function, and in order to better assign weights, we need to normalize the correlation computed with all neighbors in a uniform way, in terms of the specific form as softmax normalization: αij = soft max(eij ) = j
exp(eij ) vk ∈N (vi ) exp(eik )
(3)
α is the weight coefficient, and the idea of summing the weight coefficients of all neighbors is guaranteed by the processing of Eq. (3), and Eq. (4) gives the complete formula of the weight coefficient: αij =
exp(LeakyReLU (aT [Whi ||Whj ])) vk∈N (vi )
exp(LeakyReLU (aT [Whi ||Whj ]))
(4)
SeqVec-GAT: A Golgi Classification Model
701
The model uses the Adam optimizer with a learning rate set to 0.001 and 350 iteration rounds, in order to further improve the expression of the attention layer. We use a multiheaded attention mechanism with the number of heads K = 3 and a new feature vector of nodes:
hi = ||K k=1 σ (
αij(k) W (k) hj )
(5)
vj ∈N (vi )
In order to compare the effectiveness of this algorithm, we use six classifiers, namely Random Forest, KNN, GBDT, RF_Bert, LightGBM, XGBoost as our comparison experiments, in order to check the degree of improvement of our model’s classification effect compared to these more traditional classifiers. 2.3 Evaluation Metrics and Methods In this experiment, Accuracy (Acc), sensitivity (SN), pecificity (SP), Matthews correlation coefficient (MCC), F1-score are utilized to evaluate the performance of the prediction system [23]. They were calculated as follows:
MCC = √
Sp =
TN TN + FP
(6)
Sn =
TP TP + FN
(7)
Acc =
TP + TN TP + FN + TN +FP
(8)
F1 =
2 × TP 2 × TP + FN + FP
(9)
TP × TN − FP × FN (TP + FP) × (TP + FN ) × (TN + FN ) × (TN + FP)
(10)
For a binary classification problem, the actual values are only positive and negative instances. If an instance is positive class and is predicted to be positive class, it is true class (TP), if it is negative class and is predicted to be positive class, it is false positive class (FP), if it is negative class is predicted to be negative class. Sn and Sp are respectivelys the proportion of correct predictions for the positive and negative cases. f1-score reflects the robustness of the model, the higher the score, the more robust the model is. Acc is the percentage of all predictions that are correct. When the dataset is unbalanced, Acc cannot really assess the quality of the classification results. In this case, it can be evaluated by MCC. the horizontal axis of the ROC curve is generally the ratio of FPR, i.e., the ratio of negative class samples judged as positive class, and the vertical axis is the ratio of FPR, i.e., the ratio of positive class samples judged as positive class. The full name of AUC is Area Under Curve, which is the Area between ROC Curve and x axis (FPR axis). the larger the AUC value, the better the model.
702
J. Sui et al.
3 Results To prove the validity of this model, we compare the prediction accuracy of this model with other models. Table 1 shows the comparison results of our model with other six models in Sn, Sp, Acc, MCC, F1-score and AUROC. It can be seen that our results in the evaluation indicators Sn and Sp are better than most models. The score of our model is the highest in F1-Score, indicating that our model is more robust. In terms of Acc and MCC, our model is also superior to the best RF_Bert model, indicating that the overall accuracy of our model’s predictor is the best. At the same time, it is not difficult to see that our model is also excellent in the evaluation index AUROC, which is about 0.02 higher than the best RF_Bert model, indicating that our model has the best generalization performance compared with the other six models. Table 1. Comparison with other models Model
Sn (%)
Sp (%)
Acc (%)
MCC
F1
AUROC
GBDT
69.23
98.04
92.19
0.7526
0.7826
0.9532
KNN
30.77
98.04
84.38
0.4319
0.4444
0.7745
XGBoost
53.84
98.04
89.06
0.6312
0.6667
0.9321
LightGBM
53.85
92.16
84.38
0.4906
0.5833
0.9020
RF
30.77
1.00
85.94
0.5114
0.4706
0.8982
RF_Bert
84.62
98.04
95.31
0.8520
0.9521
0.9434
This work
92.31
1.00
98.44
0.9515
0.9844
0.9615
4 Conclusion In this paper, we proposed a new Golgi protein classification and prediction model based on graph attention network, which utilized the SeqVec model to extract features and assign the extracted features to each node of the graph, the model constructed two graph attention layers, in order to further improve the expression of the attention layer, we utilized a three-headed graph attention network, the classification effect achieved a good improvement compared with previous methods, the final experimental results are: ACC 98.44%, Sn 92.31%, Sp 100%, MCC 0.9515, AUROC 0.9615. However, our model relies on the construction of the graph structure, and proper construction of the graph can make the model achieve unexpected classification prediction results, while the opposite can make the model much less effective. Therefore, the next step is to design a general graph structure and improve the stability of the model. From the point of view of effect, graph neural network has a great prospect in golgi protein classification. Acknowledgments. This work was supported in part by the University Innovation Team Project of Jinan (2019GXRC015), the Natural Science Foundation of Shandong Province, China (Grant No. ZR2021MF036).
SeqVec-GAT: A Golgi Classification Model
703
References 1. Hoyer, S.: Is sporadic Alzheimer disease the brain type of non-insulin dependent diabetes mellitus? A challenging hypothesis. J. Neural Transm. 105(4–5), 415–422 (1998) 2. Rose, D.R.: Structure, mechanism and inhibition of Golgiα-mannosidase II. Curr. Opin. Struct. Biol. 22(5), 558–562 (2012) 3. Gonatas, N.K., Gonatas, J.O., Stieber, A.: The involvement of the Golgi apparatus in the pathogenesis of amyotrophic lateral sclerosis, Alzheimer’s disease, and ricin intoxication. Histochem. Cell Biol. 109(5–6), 591–600 (1998) 4. Yang, W., et al.: A brief survey of machine learning methods in protein sub-Golgi localization. Curr. Bioinform. 14(3), 234–240 (2019) 5. Wang, Z., Ding, H., Zou, Q.: Identifying cell types to interpret scRNA-seq data: how, why and more possibilities. Briefings Funct. Genomics. 19(4), 286–291 (2020) 6. Yuan, L., Guo, F., Wang, L., Zou, Q.: Prediction of tumor metastasis from sequencing data in the era of genome sequencing. Brief. Funct. Genomics 18(6), 412–418 (2019) 7. Hummer, B.H., Maslar, D., Gutierrez, M.S., de Leeuw, N.F., Asensio, C. S.: Differential sorting behavior for soluble and transmembrane cargoes at the trans-Golgi network in endocrine cells. Mol. Biol. Cell mbc-E19 (2020) 8. Deng, S., Liu, H., Qiu, K., You, H., Lei, Q., Lu, W.: Role of the Golgi apparatus in the blood-brain barrier: golgi protection may be a targeted therapy for neurological diseases. Mol. Neurobiol. 55(6), 4788–4801 (2018) 9. Villeneuve, J., Duran, J., Scarpa, M., Bassaganyas, L., Van Galen, J., Malhotra, V.: Golgi enzymes do not cycle through the endoplasmic reticulum during protein secretion or mitosis. Mol. Biol. Cell 28(1), 141–151 (2017) 10. Hou, Y., Dai, J., He, J., Niemi, A.J., Peng, X., Ilieva, N.: Intrinsic protein geometry with application to non-proline cis peptide planes. J. Math. Chem. 57(1), 263–279 (2019) 11. Wei, L., Xing, P., Tang, J., Zou, Q.: PhosPred-RF: a novel sequence-based predictor for phosphorylation sites using sequential information only. IEEE Trans. Nanobiosci. 16(4), 240– 247 (2017) 12. van Dijk, A.D.J., et al.: Predicting sub-Golgi localization of type II membrane proteins. Bioinformatics 24(16), 1779–1786 (2008) 13. Ding, H., et al.: Identify Golgi protein types with modified mahalanobis discriminant algorithm and pseudo amino acid composition. Protein Pept. Lett. 18(1), 58–63 (2011) 14. Ding, H., et al.: Prediction of Golgi-resident protein types by using feature selection technique. Chemom. Intell. Lab. Syst. 124, 9–13 (2013) 15. Jiao, Y.S., Du, P.F.: Predicting Golgi-resident protein types using pseudo amino acid compositions: approaches with positional specific physicochemical properties. J. Theor. Biolo. 391, 35–42 (2016) 16. Lv, Z., et al.: A random forest sub-Golgi protein classifier optimized via dipeptide and amino acid composition features. Frontiers in bioengineering and biotechnology 7, 215 (2019) 17. Zhao, W., et al.: Predicting protein sub-Golgi locations by combining functional domain enrichment scores with pseudo-amino acid compositions. J. Theor. Biol. 473, 38–43 (2019) 18. Yang, R., Zhang, C., Gao, R., Zhang, L.: A novel feature extraction method with feature selection to identify Golgi–resident protein types from imbalanced data. Int. J. Mol. Sci. 17(2), 218 (2016) 19. Heinzinger, M., Ahmed Elnaggar, Y., Wang, C.D., et al.: Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinf. 20(1), 1–17 (2019) 20. Peters, M.E., Neumann, M., Iyyer, M., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365. 2018
704
J. Sui et al.
21. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 22. Zeng, X., Lin., W., Guo, M., Zou, Q.: A comprehensive overview and evaluation of circular RNA detection tools, PLoS Comput. Biol., 13(6), Art. no. e1005420 (2017) 23. Wei, L., Xing, P., Su, R., Shi, G., Ma, Z.S., Zou, Q.: CPPred–RF: a sequence-based predictor for identifying cell–penetrating peptides and their uptake efficiency. J. Proteome Res. 16(5), 2044–2053 (2017) 24. Wei, L., Xing, P., Zeng, J., Chen, J., Su, R., Guo, F.: Improved prediction of protein–protein interactions using novel negative samples, features, and an ensemble classifier. Artif. Intell. Med. 83, 67–74 (2017) 25. Hu, Y., Zhao, T., Zhang, N., Zang, T., Zhang, J., Cheng, L.: Identifying diseases-related metabolites using random walk. BMC Bioinf. 19(S5), 116 (2018) 26. Zhang, M., et al.: MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters. Bioinformatics 35(17), 2957–2965 (2019) 27. Song, T., Rodriguez-Paton, A., Zheng, P., Zeng, X.: Spiking neural P systems with colored spikes. IEEE Trans. Cogn. Devel. Syst. 10(4), 1106–1115 (2018) 28. Lin, X., Quan, Z., Wang, Z.-J., Huang, H., Zeng, X.: A novel molecular representation with BiGRU neural networks for learning atom. Briefings Bioinf. Art. no. bbz125 (2019)
Classification of S-succinylation Sites of Cysteine by Neural Network Tong Meng1 , Yuehui Chen2 , Baitong Chen3 , Yi Cao4(B) , Jiazi Chen5 , and Hanhan Cong6 1 School of Information Science and Engineering, University of Jinan, Jinan, China 2 School of Artificial Intelligence Institute and Information Science and Engineering, University
of Jinan, Jinan, China 3 Xuzhou First People’s Hospital, Xuzhou, China 4 Shandong Provincial Key Laboratory of Network Based Intelligent Computing (School of
Information Science and Engineering), University of Jinan, Jinan, China [email protected] 5 Laboratory of Zoology, Graduate School of Bioresource and Bioenvironmental Sciences, Kyushu University, Fukuoka-shi, Fukuoka, Japan 6 School of Information Science and Engineering, Shandong Normal University, Jinan, China
Abstract. S-succinylation of proteins is a significant and common posttranslational modification (PTM) that takes place on Cysteine. And in many biological processes, PTM plays an important role, which is also closely related to many diseases in humans. Hence, identifying the s-succinylation sites of Cysteine is very pivotal in biology and disease research. However, traditional experimental methods are expensive and time-consuming, so ML methods have been proposed by some researchers to deal with the problem of PTM recognition. In particular, the deep learning method is also applied to this field. We put forward a convolutional neural network to identify the hidden sites of s-succinylation in our work. In addition, we utilized the datasets of human and mouse, and we aim to predict the s-succ sites existing in humans, and verify them by loo verification method. More specifically, five metrics are utilized to assess the prediction performance of classifier. In general, CNN model that we proposed achieves better prediction performance. Keywords: Cysteine succinylation · Convolutional neural network · Machine learning · Protein post-translational modification
1 Introduction In the whole process of biological development, protein post-translational modifications (PTMs) can coordinate the activity of most proteins. It is precisely because of the posttranslational modification of proteins that the classical relationship between a gene and an egg white is broken, which increases the complexity of the process of human life. PTM can also increase the functional diversity of proteome in two main ways: on the one hand, it can covalently add functional groups or proteins; on the other hand, it can regulate © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 705–714, 2022. https://doi.org/10.1007/978-3-031-13829-4_62
706
T. Meng et al.
the proteolysis of subunits or the degradation of the whole protein [1]. At present, more than 600 post-translational modifications have been identified, including ubiquitination, phosphorylation, nitrosylation, proteolysis, lipidation, methylation, acetylation and glycosylation, which affect almost all aspects of cell biology and many pathogenesis. The team of the University of Chicago first discovered the post-translational modified protein site of lysine succinylation in nature chemical biology published in 2010. Protein succinylation modification refers to the process that succinyl group donors (such as succinyl coenzyme A) covalently bind succinyl groups to lysine residues of substrate proteins in an enzymatic or non enzymatic manner. Compared with methylation and acetylation, lysine succinylation modification can cause more changes in protein properties. This is because the lysine group with succinylation is given two negative charges, and the valence state changes from +1 to −1, which is higher than the charge change caused by acetylation (+1 to 0) and monomethylation (no change). Moreover, succinylation brings a group with larger structure and greater change in protein structure and function. At the same time, Succinyl-CoA is a cofactor for enzyme regulation of succinylation. As an intermediate product of important metabolic reactions, Succinyl-CoA appears in the reactions of TCA cycle, porphyrin synthesis and some branched chain amino acids. Its stable state is essential for maintaining normal cell physiological activities. Mutations that occur in the metabolism of Succinyl-CoA are likely to cause disease. As a posttranslational modification, lysine succinylation initially needs a large number of mass spectrometry and protein sequence alignment to be identified [2]. In 2013, through the research of Park et al., 2565 protein succinylation sites were identified in 779 proteins, and their experimental results showed that the enzymes produced by lysine succinylation had potential effects on the metabolic function of mitochondria, such as the degradation of amino acids, the circulation of tricarboxylic acids and the metabolism of fatty acids [3]. Lysine succinylation widely exists in eukaryotic and prokaryotic cells, and as a new post-translational modification of protein, it plays an indispensable role [4, 5]. Lysine residues that is specific in proteinscan covalently bind to succinyl groups, which may lead to significant chemical changes in proteins during succinylation [6]. Among them, cysteine (Cys) is an amino acid with low frequency in most proteins (about 1%–2%) [7], but Cys has high thiol reaction activity (affinity and redox sensitivity). It often plays a crucial role in the structure and function of proteins as redox catalytic reaction, metal binding and allosteric regulation sites, and participates in the regulation of physiological processes such as cell recognition and signal transduction [8]. Cys thiol group is very sensitive to the changes of local environment in cells, and is prone to a series of non enzymatic or enzyme catalyzed post-translational modifications, so as to quickly and dynamically regulate protein configuration and activity, and even lead to protein function damage, which is closely related to many important human diseases. Identifying succinylation substrate proteins and succinylation sites is vital to know the molecular mechanism of succinylation in biological systems, so this field has attracted more and more attention [9–12].
Classification of S-succinylation Sites of Cysteine
707
In our paper, a new predictor were developed by us to predict s-succinylation sites in proteins by utilizing S-succinylation site datasets of human and mouse Cysteine. We utilize traditional methods such as SVM and deep learning method such as CNN to identify potential sites. In the meantime, EAAC ecoding method are utilized to describe the characteristics of amino acid sequences, and then evaluate the model by Leave-OneOut cross validation method (LOO), and the flow chart is shown in Fig. 1. The results show that our CNN model shows better performance on the independent test set, and the AUC and ACC values reach 93.7% and 87.5% respectively.
Fig. 1. The activity flow of classifier development.
2 Methods Our workflow is shown in Fig. 1. Specifically, firstly, we do data preprocessing to refine the dataset. Secondly, feature encoding is our next step, in which we can converts the input data into feature vectors. Nextly, different models will be utilized to train the training set. Lastly, the training model will be validated on the independent test data, and then 5 metrics are utilized to evaluate the performance of the predictor. 2.1 Feature Encoding EAAC Encoding. In this study, EAAC encoding were utilized to encoding amino acid sequences of proteins, which was proposed by Chen et al. Since each modifier site has 20 amino acid residues around it, the encoding method can reflect the frequency of its residues. In detail, we set a sliding window with the size of 8, and let it slide continuously from the head to the end of the input amino acid sequence. Then we calculate the frequency of 20 amino acids in each 8-dimensional window [23]. Therefore, we can calculate the feature size by the following formula: Ns = L − Ls + 1
(1)
708
T. Meng et al.
Deaac = Ns × 2
(2)
where, L indicates the length of each input amino acid sequence fragment, Ls represents the size of the sliding window, Deaac represents the dimension of the eigenvector. Since all peptides have 51 residues, through the above formula, each amino acid sequence in the dataset will be transformed into 44 (33 – 8 + 1) matrix * 20 dimensions, and then stretched into 880 dimensional vectors. EBAG+Profile Encoding. Because different amino acid residues have different physical and chemical properties, it can be utilized as a basis for grouping amino acid residues. This encoding method based on attribute grouping is called EBAG [24, 25]. Since these 20 amino acid residues contain hydrophobic groups, polar groups, acidic groups and basic groups, so they can be divided into four groups by EBAG method. In addition to the above four groups, it also contains the fifth group “X”, which represents some intervals that have no physical and chemical properties in the amino acid sequence, but can be utilized as the foundation for determining whether a site can be modified. As shown in Table 1. Table 1. A group generated by the EBAG method. Group
Amino acid residue
Label
C1
A, F, G, I, L, M, P, V, W
Hydrophobic
C2
C, N, Q, S, T, Y
Polar
C3
D, E
Acidic
C4
H, K, R
Basic
C5
X
Intervals
After performing the attribute grouping, the frequency of each residue is calculated by utilizing Profile encoding and the frequency sequence of each peptide is generated. Among them, each sample peptide contained 51 amino acid residues and each frequency can be calculated by the following formula: Fj = Cj /Lp
(3)
where, Lp is the length of the peptide, and j is the type of amino acid residues, Cj is the number of times it appears in the peptide. After that, the sample sequence can be transformed into eigenvector V by the following formula: V = [F1 , F2 , F3 · · · F20 ]
(4)
Classification of S-succinylation Sites of Cysteine
709
This is a new strategy that combines Profile encoding with EBAG. In particular, firstly, a 51 dimensional peptide is splits into a new sequence containing 5 groups by EBAG. Secondly, the EBAG sequence generated by Profile encoding is generated, and the calculated frequency is assigned to all remainder.
2.2 Construction of Classifier Support Vector Machine. SVM is an algorithm widely utilized in classification problems in different fields. It can map samples to high-dimensional feature spaces. It is a classical supervised classifier based on Vapnik Chervonenkis dimension theory and structural risk minimization principle [26], and has better generalization ability. Model f(x) = yˆ = x·w + b is a linear prediction model for support vector machine (SVM) learning. When the data is linearly separable, SVM optimizes the parameters w through the maximum margin hyperplane given by the following: w·x+b=1
(5)
w · x + b = −1
(6)
Then, we utilize the following formula to prevent training scores from exceeding the scope: yi (w · xi ) ≥ 1 where i ∈ [1, n]
(7)
where n is the number of training samples. And the following formula has predicted class of a sample: n αiyiK(xi, x) + b (8) f (x) = i=1
where xi represent the ith training sample and yi represent its label. The weight is represented by αi, where if αi = 0, a support vector corresponds to αi and an additive deviation of learning is represented by b. Random Forest. RF is a bagging type integration method, which integrates many decision trees. There are four basic ideas of RF, the first is the random selection of data samples, the second is the construction of decision tree, the next is the random selection of features to be selected, and the last is forest prediction strategy. Let’s take a look at how the integration algorithm ensures that the effect of integration must be better than that of a single learner: suppose ensemble learning set is {h1, h2, …, hT}, where hi (x) is the output of hi on the sample x. Voting method is frequently-utilized combination strategy in classification tasks. Assuming in a category, the set is {c1, c2, …, cN}, for convenience of discussion, here we predict the output of hi on the sample x as a N dimension vector T (hi1 (x), hi2 (x), ..., hiN (x)) , which is the output of category. The voting method is as follows: T j j cj , Ti=1 hi (x) > 0.5 N k=1 i=1 hi (x) (9) H(x) = reject, other
710
T. Meng et al.
That is, if a mark obtains more than half of the votes, it will be predicted as this category, otherwise it will be rejected. From our experiment, 1000 decision trees were selected and and an RF model is established through continuous attempts. K-Nearest Neighbor Algorithm. KNN is a supervised learning algorithm that can be utilized for classification and regression. Its basic idea is that if the sample exists in a category of the nearest neighbor in the feature space, the sample belongs to that category, which is also can be understood as how to divide the classes of samples was decided based on the class of only the most recent one or a few samples. KNN has some basic elements, namely the selection of K value, distance measurement and classification decision rules. Among them, the Euclidean distance is a commonly utilized method in KNN to measure the distance from the midpoint of the space, which can be calculated as follows in in two-dimensional space: 2 (10) ρ = (x2 − x1 )2 + (y2 − y1 ) In our work, we built a kNN model, in which we choose 5 as k-value to represent the number of neighbors and we choose Euclidean distance as the measurement distance in our model. Artificial Neural Network. ANNs consist of an input layer, which receives data from external sources, processes data implemented by one or more hidden layers, and an output layer, which provides one or more data points for network-based functions. Here, we developed a neural network classifier, which consists of three layers: input layer, the hidden layer and output layer. In the first layer, it will receive 51 peptide fragment residues, the immediately following hidden layeras is composed of 100 neurons, the last layer can output the numerical vector. Then the ‘sigmod’ was utilized as an activation function to obtain a probability score for modification of s-succinylation sites. Sigmoid formula is as follows: ϕ(x) =
1 1 + e−x
(11)
Convolutional Neural Network. Deep learning is another method utilized in our field except the traditional ML methods discussed above. Among them, CNN is one of the most successful applications of deep learning algorithm, which has convolution structure. It is worth mentioning that convolutional structures can reduce the amount of memory occupied by deep networks. In our experiments, we put forward the idea of combining EAAC encoding with CNN and two models are established: one dimensional CNNeaac1 model and two dimensional CNNeaac2 model.
Classification of S-succinylation Sites of Cysteine
711
2.3 Performance Evaluation of Predictors LOO Cross Validation. CV is an extensive model selection method. In this work, because of the relatively small amount of data, we utilize LOO cross validation to evaluate the performance of the classifier. The idea of CV is to divide a large data set into k small data sets, in which one was chosen as the test set and the remaining k-1 was utilized as the test set, then the next one was chosen as the test set, the other remaining k − 1 as the training set, and so on. LOO method is a special case of cross validation. As the name suggests, the value of k is the number of data in the dataset. Thus, in our paper the k value is 8, in which we choose one as the test set and the remaining seven served as the training set. Through this method, we can obtain the closest expected value for the training entire test set.
3 Results and Discussion 3.1 Performance Comparison of LOO Cross Validation Under Different Classifiers In previous studies, due to many predictors suffer from over fitting, cross validation is utilized to address this. In order to compare the performance of different classifiers, we built a LOO cross validation model for each classifier, such as RF, ANN, SVM and kNN. For the model utilizing the LOO cross validation method, we have a data set. We divide it into 8 parts, utilize 7 parts for training, and leave only one test set for validation. Then the final model has been tested with independent dataset. Our indicators include AUC, Acc, MCC, Sn and Sp. Table 2 have showed the results. Table 2. The results of different classifiers under Leave-One-Out cv. Classifier
AUC
Acc
MCC
Sn
Sp
RF
0.625
62.5%
0.378
25%
100%
ANN
0.812
62.5%
0.258
50%
75%
SVM
0.500
50.0%
0
0
kNN
0.687
62.5%
0.258
50%
100% 75%
It is obvious that the AUC score of ANN is 0.812, which is higher than 0.1–0.3 score of the other three classifiers. Therefore, under the EAAC encoding mode, ANN has better classification effect for small sample data.
712
T. Meng et al.
3.2 Different Feature Extraction Methods Produce Different Prediction Results Because the score of prediction method is not only affected by the choice of verification modalities, but also from different encoding schemes. Therefore, in view of two different encoding modalities, namely EAAC and EBAG+Profile, we trained the above classifiers. The following Table 3 demonstrated the results. Table 3. The results of different classifiers under the EAAC and EBAG + Profile encoding methods. Classifier
Methods
AUC
Acc
MCC
Sn
ANN
EAAC
0.812
62.50%
0.258
50.00%
75.00%
EBAG+Profile
0.312
37.50%
−0.258
25.00%
50.00%
RF
Sp
EAAC
0.625
62.50%
0.378
25.00%
100.00%
EBAG+Profile
0.687
62.50%
0.377
25.00%
100.00/%
SVM
EAAC
0.500
50.00%
0
0
100.00%
EBAG+Profile
0.750
50.00%
0
0%
100.00%
kNN
EAAC
0.687
62.50%
0.258
50.00%
75.00%
EBAG+Profile
0.687
62.50%
0.258
75.00%
50.00%
As can be seen from Table 3, the different encoding modes do have a tremendous impact on the classification performance of different classifiers. For example, in the way of EAAC encoding, the AUC value of the ANN classifier reaches 0.812, while in the EBAG+Profile encoding method, it only has 0.312. RF, SVM and other classifiers also have a similar situation. Under different encoding methods, not only the AUC value is different, but also the other metrics are very different, which further verifies that the prediction method score is greatly influenced by the encoding method. In addition, it also can be found that most of the classifiers utilizing EAAC encoding methods perform better than the same classifier utilizing another encoding method. Besides, it is obvious that no matter which classifier is utilized, EAAC scheme only captures information from peptides containing s-succinylation. 3.3 The Superior Performance of Deep Learning Through our experiments, it can be seen that the classification effect of the above traditional classifiers is general. Therefore, we utilized the deep learning method to establish two convolutional neural network models based on EAAC ecoding, which are called one-dimensional convolutional CNN1eaac and two-dimensional convolutional CNN2eaac.CNN1eaac includes the following 7 layers. First, the EAAC generated 44 * 20 feature matrix is transformed into the convolution layer of the intermediate level feature, followed by the pool layer, which was utilized as input to the entire network. Similar to the first two layers, the third and fourth layer convolution can improve the expressivity of the classifier. The fifth layer is a flat layer for unified multi-dimensional
Classification of S-succinylation Sites of Cysteine
713
output, where the multidimensional feature matrix will be transformed into 1 * 64 dimensional feature vector. Following is the fully connected layer, of which 64 neuron units consist of a linear linear unit (ReLU) that selects activation functions. The last layer is the output layer, and the output probability score is activated by ‘sigmod’. In addition, we also constructed a two-dimensional CNN model named CNN2eaac, consisting of two convolutional layers, two pooling layers, a flat layer, a fully connected layer, and an output layer derived from the ‘sigmoid’ activation function. After the establishment of our two models, we trained them and tested them on an independent dataset. In order to analyze its performance more directly, we have drawn two histograms utilizing the two evaluation indexes called AUC and MCC value, which can best show the performance. The results are shown in Fig. 2.
Average AUC
Classifier 1 0.5 0
Average MCC
Classifier 1 0.5 0 RF
ANN
SVM
KNN CNN1D CNN2D
Fig. 2. The average performances of different classifiers.
As can be seen from the two charts, the deep learning approach based on the model CNN1eaac has the highest score in terms of AUC value, followed by the deep learning model CNN2eaac. It is obvious that the traditional classifier is backward compared with the two depth models. Similarly, in terms of MCC value, the score of two CNN model is still higher than that of other traditional classifier. This indicates that DL method can obtain better results in classification, because it has excellent fitting ability. Therefore, CNN1eaac has achieved better performance in PTM prediction, which a AUC value reachs 93.7% and a MCC value reachs 0.774.
714
T. Meng et al.
4 Conclusion In this study, although the amount of data is very small, we developed a novel and effective deep learning model called CNN for predicting cysteine s-succinylation sites. Because the deep learning method can improve the classification accuracy that the traditional machine learning method can not achieve with excellent performance. Of course, the results indicate that our deep learning model can gain a better performance compared with machine learning. This is because the deep neural network can let expression ability become stronger, following the better fitting.
References 1. Modification, P.T., Protein, F., Contents, P., et al.: Overview of post-translational modifications (PTMs) (2015) 2. Zhang, Z., Tan, M., Xie, Z., et al.: Identification of lysine succinylation as a new post-translational modification. Nat. Chem. Biol. 7(1), 58–63 (2011) 3. Park, J., Yue, C., Tishkoff, D.X., et al.: SIRT5-mediated lysine desuccinylation impacts diverse metabolic pathways. Mol. Cell 50(6), 919 (2013) 4. Weinert, B., et al.: Lysine Succinylation is a frequently occurring modification in prokaryotes and eukaryotes and extensively overlaps with acetylation. Cell Rep. 4(4), 842–851 (2013) 5. Xie, Z., et al.: Lysine Succinylation and lysine Malonylation in histones. Mol. Cell Proteom. Mcp. 11(5), 100–107 (2012) 6. Papanicolaou, K.N., O’Rourke, B., Foster, D.B.: Metabolism leaves its mark on the powerhouse: recent progress in post-translational modifications of lysine in mitochondria. Front Physiol. 5(5), 301 (2013) 7. Kim, H.J., Ha, S., Lee, H.Y., Lee, K.J.: Mass Spectrom. Rev. 34(2), 184–208 (2015) 8. Pace, N.J., Weerapana, E.: ACS Chem. Biol. 8(2), 283–296 (2013) 9. Jia, J., Liu, Z., Xiao, X., Liu, B.: pSuc-Lys: predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. J. Theor. Biol. 394, 223–230 (2016) 10. Chou, K.C.: Impacts of bioinformatics to medicinal chemistry. Med Chem. 11, 218–234 (2015) 11. Jia, J., Liu, Z., Xiao, X.: iSuc-PseOpt: identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. Anal. Biochem. 497, 48–56 (2016) 12. Xu, Y.: Recent progress in predicting posttranslational modification sites in proteins. Curr. Top. Med. Chem. 16, 591–603 (2016)
E. coli Proteins Classification with Naive Bayesian Yujun Liu, Jiaxin Hu, Yue Zhou, Wenzheng Bao(B) , and Honglin Cheng(B) Xuzhou University of Technology, Xuzhou 221018, China [email protected], [email protected]
Abstract. E. coli is a normal hermit in the intestines of animals. It is the most dominant and numerous bacterium in the guts of humans and many animals, and is mainly parasitic in the large intestine. It has a wide variety of products, which have an important impact on the metabolism and intestinal function of the human body, and generally have the effect of promoting human metabolism. Due to the fact that the current bioanalytical methods and techniques have certain misunderstandings about the understanding of E. coli, the test results are inaccurate and there is a certain result deviation. Therefore, three methods for predicting protein sequences in E. coli are proposed, and the results are compared to calculate the optimal method. Mainly based on E. coli as the research object, The naive Bayesian, Gaussian process and decision tree methods were used to analyze and study E. coli by isolating unit points to unit points, etc. Accuracy on the test set is 98.37%. The experimental results show that the establishment of grey relational degree-naive Bayesian model is much better than the test effect of the other two methods. Keywords: E. coli · Naive Bayes · Decision Tree · Gaussian Process · PHATE dimensionality reduction
1 Introduction Escherichia coli is a kind of bacteria closely related to our daily life. It is a kind of Enterobacteriaceae. Under normal circumstances, most E. coli are very “disciplined”. They will not bring any harm to our health [1]. On the contrary, they can competitively resist the attack of pathogenic bacteria, and help synthesize vitamin K, which is a mutually beneficial symbiotic relationship with the human body [2]. Only under special circumstances such as reduced immunity and long-term lack of stimulation of the intestinal tract, E. coli will migrate to places outside the intestinal tract, such as appendix, urethra, gallbladder and bladder, resulting in infection in corresponding parts or systemic disseminated infection. Therefore, most E. coli are usually regarded as opportunistic pathogens [3]. With the development of machine learning technology, machine learning model has been applied to the research of E. coli analysis [4]. However, there are few methods to study the prediction accuracy of protein sequences in E. coli [5]. Nakashima proposed © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 715–721, 2022. https://doi.org/10.1007/978-3-031-13829-4_63
716
Y. Liu et al.
the method of amino acid composition (AAC) in 1986 [6]. Using 20 dimensional vector to represent protein sequence, this method has been widely used by biologists. However, this method only considers the frequency of amino acids in protein sequence, and does not consider the information such as amino acid sequence relationship and spatial structure, so the prediction accuracy is not high [7]. According to the knowledge of protein synthesis and sorting mechanism, Matsuda divided the protein sequence into three parts: N-terminal, middle and C-terminal [8]. From these three parts, the information such as amino acid distance and frequency was extracted respectively, and the amino acid composition method was fused to construct a new feature vector to describe the protein. In addition to the above several common protein feature vector construction methods, scholars at home and abroad have also tried to integrate a variety of sequence features to construct feature vectors. In 2018, Chen Yuxiang and Li yanru extracted the information features related to the local structure of the protein main chain, screened the extracted features using the incremental feature selection method, and identified the secreted and non secreted proteins of Plasmodium [9]. In 2019, Liu Qinghua and others fused the features of Di density, autocorrelation coefficient and amino acid composition to obtain 190 dimensional feature vector, and reduced the dimension by using linear discriminant analysis method [10]. Finally, SVM method was used to predict the subcellular localization of protein. Experiments on Gram-negative and Gram-positive data sets achieved high prediction results. In order to improve the accuracy of protein feature extraction in E. coli, we propose to separate protein sequences in E. coli by using naive Bayes, Gaussian Process and Decision Tree method. Firstly, the dimensionality reduction analysis of protein separation sites is carried out, and Python is used to separate unit points and multiple points to obtain the corresponding sequence positions, so as to extract the most important 80 dimensions from the 625 dimension fixed length data set for dimensionality reduction, and finally get the data set composed of 249 80 wiki basic data. Secondly, n corresponding single functional sites of protein were extracted before and after (corresponding sequence feature extraction). In this experiment, 10 sequence letters were taken as positive and negative sites before and after the sequence position pointed by the corresponding unit point, and X was selected as complement for the case of insufficient digits. Then, select the positive and negative points before and after the unit point (get the positive and negative data set). Finally, write the corresponding algorithm to get the final vector set. The data set is added by the combination of four functional loci, and then the model is trained (see the classification effect). The accuracy on the test set is 98.37%. The experimental results show that the effect of the establishment of grey relational degree-naive Bayesian model is much better than that of the other two methods.
2 Data Collection and Processing 2.1 Data Set The benchmark data set for this experiment comes from the protein-specific website to obtain the sequence and characteristic site of the corresponding protein through crawler (https://www.uniprot.org/uniprot/?query=streptococcus+salivarius& sort=created&desc=no). The dataset contains 249 protein sequences of Escherichia coli,
E. coli Proteins Classification with Naive Bayesian
717
including 147 positive samples and 102 negative samples. To avoid overfitting, we randomly extract 50 fixed test sets that are not included in the training set during machine learning. The approximate ratio of positive and negative samples is 2:1 protein sequence. Among them, there are 17 positive samples, 33 negative samples. The feature extraction of the initial data is what we need to consider in the classification process. Choosing an appropriate feature extraction method can greatly enrich the data and represent the corresponding indicators more accurately. Thus, data guarantee is provided for the subsequent classification accuracy. In previous research, it was found that Feature extraction is mainly based on the autocorrelation of sequences pointed to by functional sites of protein sites and the physicochemical properties of amino acid composition. Such as AC, ACC, CC, DP, DR, KMER, SC-PseAAC and other methods. And we found good results when we used machine learning to do linear fitting. 2.2 Dataset Construction Given that initial protein sequences are represented by letters and vary in length, and the digital representation of the position pointed to by each unit point. Therefore, it is necessary to convert it into a fixed-length sequence while retaining as much information as possible. After research, 10 sequence letters were taken before and after the sequence position pointed to by the corresponding unit point as the positive and negative sites, respectively, and X is selected as the complement for the case of insufficient number of digits. At the same time, different functional sites are separated. There are four functional sites in total, namely Active site, Binding site, Metal binding and Site, and separate them to establish corresponding data sets. That is, it is converted into a sequence of two 10 letters, and label the last column as a category. Proteins with functional sites are marked with 1, and proteins without functional sites are marked with 0.
3 Model Building 3.1 The Establishment of Grey Relational Degree-Naive Bayesian Model The gray correlation degree is a simple classification model to avoid overfitting [11]. The gray correlation degree model is used to pre-classify the obtained vector set and add a data set to enrich the data to improve the accuracy of subsequent training. 3.2 Decision Tree Model Establishment Entropy is a concept in information theory, which measures the uncertainty of things, and the more uncertain things are, the greater their entropy. When the probability of each thing occurring is the same, the randomness of their occurrence is greatest, so their entropy is also greater.
718
Y. Liu et al.
3.3 Gaussian Process Modeling Gaussian Process is a set of random variables in a continuous domain (time or space), in which any random variable has a joint Gaussian distribution [12–14]. In this method, the unknown target variable is represented by the state function of the known input variable. The mean function and covariance function jointly determine their properties. When the input value changes, the covariance function represents the expected change of the target variable. 3.4 Feature Fusion Previous studies have shown that the use of multiple feature fusions has a better effect on the feature representation of biological information. By using 11 extraction feature methods above, 4 sets of data were obtained, and each sequence corresponds to the dimension length of the corresponding method. There are a total of 6 combinations of four different functional sites arranged in pairs, and the corresponding extraction methods are also fused by the method of horizontal splicing. Finally, 6 groups of data sets with 11 extraction methods are obtained, and the influence of dimension on the classification results is not considered (Fig. 1).
Fig. 1. Data synthesis flow chart.
3.5 Boosting In order to improve the classification accuracy, this paper uses the Boosting strategy to promote two weak learners into strong learners. First, the basic learner is established, that is, the three classification machine learning algorithms used in this paper. Then, the results obtained by the gray correlation degree are formed into a one-dimensional new index and merged into the basic data set, and then the classification training is carried out through machine learning. The accuracy of the learner aggregated by Boosting is improved from 0.804 for single machine learning to 0.985. Taking into account the possible overfitting, the grey correlation degree obtained through model testing improves the accuracy of data prediction.
E. coli Proteins Classification with Naive Bayesian
719
4 Results In order to find a suitable machine learning with better accuracy for classification, and to facilitate the analysis and localization of different sites. In this experiment, three kinds of learners were used, and after comparison, it was found that the gray correlation degree set based on Naive Bayes classification had the highest accuracy. In this experiment, three functional sites, Binding site, Metal binding and Active site, were selected for classification accuracy analysis (Table 1). Table 1. Corresponding method site combination accuracy. Extraction method
Active binding
Active metal
Binding metal
AC
0.9514
0.9268
0.9325
ACC
0.9417
0.9756
0.9263
CC
0.9611
0.9674
0.9202
DP
0.6116
0.6991
0.7239
DR
0.6699
0.7073
0.7361
KMER
0.7087
0.7398
0.7177
PC-PseAAC
0.9805
0.9837
0.9987
PC-PseAAC-General
0.9805
0.9837
0.9987
PDT
0.5825
0.6747
0.5951
SC-PseAAC
0.9029
0.9837
0.9571
SC-PseAAC-General
0.9029
0.9837
0.9571
After performing naive Bayes classification on the vector sets of the eleven feature extraction methods, the classification accuracy of the above corresponding combinations is obtained, according to the above table, choose to use the PC-PseAAC extraction method to extract the protein feature site data set. It can greatly improve the classification accuracy. A high accuracy rate indicates that a positive sample with a label of 1 is highly reliable. This means that the effect of this functional site on the protein is highly effective and and the results shown in Table 1. Through the analysis of different groups and sums, it can be found that the correct rate of the combination of Active site and Metal binding is higher than that of the other two combinations, so the validity of these two sites is high. Secondly, analyze the others by controlling the variables. After separating the two main functional nodes and combining with the Binding site, it is found that the combination containing the Active site has better reliability, so it can be concluded that the influence of the Active site on the protein is more important at the three functional sites. For subsequent research, you can locate the Active site to do the corresponding research and positioning.
720
Y. Liu et al.
5 Conclusion In recent years, proteomics has gradually become the focus of molecular biology research, in which the study of protein sequences has also become a popular branch. This paper carries out the research work of protein interaction prediction based on traditional machine learning and deep neural networks: starting from the common problems of existing protein interaction prediction methods. In terms of feature learning algorithm, the relevant research of protein sequence coding method is carried out. According to the inherent characteristics of protein sequence data, a new protein sequence characteristic learning algorithm is proposed. The accuracy and reliability of the feature learning algorithm are verified on the standard data set and the external data set; Starting with the common problems with existing protein interaction prediction methods, dimensionality reduction analysis of protein isolation sites, separation of unit points and multi-site points. Obtain the corresponding sequence position to achieve dimensionality reduction. Then take 10 sequence letters before and after the corresponding single function site of the protein as the positive and negative sites. For cases of insufficient number of digits, X is used as the complement. Then, select the positive and negative points before and after the unit point to obtain the positive and negative data sets. Finally, write out the corresponding algorithm to get the final set of vectors. Different functional site combinations add datasets and then train models to see the classification effect. In this experiment, a classification model of gray correlation degree and naïve Bayes combined with naïve Bayes based on PC-PseAAC feature extraction algorithm is used. Got a good effect and finally selected the most effective Active site in the four functional sites in the experiment. However, the way of continuously combining functional sites leads to an exponential increase in the number of combinations with the increase of detection sites. There is still a certain degree of difficulty and time-consuming for census sites, but relative accuracy has certain advantages for precise positioning. It is expected that this method and idea can become a powerful tool for bioinformatics and protein research in the future. Acknowledgement. This work was supported by the Natural Science Foundation of China (No. 61902337), the fundamental Research Funds for the Central Universities, 2020QN89, Xuzhou science and technology plan project, KC19142, KC21047. Jiangsu Provincial Natural Science Foundation (No. SBK2019040953), Natural Science Fund for Colleges and Universities in Jiangsu Province (No. 19KJB520016) and Young talents of science and technology in Jiangsu. Wenzheng Bao and Honglin Chen can be treated as the co-corrsponding authors. Yujun Liu, Jiaxin Hu, Yue Zhou can be treated as the co-first authors.
References 1. Nakashima, H., Nishikawa, K., Ooi, T.: The folding type of a protein is relevant to the amino acid composition. J. Biochem. 99(1), 153–162 (1986) 2. Chou, K.C.: Prediction of protein structural classes and subcellular locations. Curr. Protein Pept. Sci. 1(2), 171–208 (2000)
E. coli Proteins Classification with Naive Bayesian
721
3. Matsuda, S., Vert, J.P., Saigo, H., et al.: A novel representation of protein sequences for prediction of subcellular location using support vector machines. Protein Sci. 14(3), 2804– 2813 (2005) 4. Chen, Y., Li, Y., Zhang, Y., et al.: Identification of malaria parasite secretion proteins using main chain local structure association features. J. Ding Inner Mongolia Univ. Technol. 37(5), 332–339 (2018) 5. Liu, Q., Lai, Y., Ding, H., et al.: Prediction of protein subcellular localization based on SVM. Comput. Eng. Appl. 55(11), 136–141 (2019) 6. Jiang, L.: Research on Naïve bayes classifier and its improved algorithm. China University of Geosciences (2009) 7. Lu, J., Wei, Y.: Engineering uncertainty analysis based on Gaussian process machine learning. Mechatron. Eng. Technol. 51(02), 7–8 (2022) 8. Huang, Z., Liang, Y.: Research of data mining and web technology in university discipline construction decision support system based on MVC model. Libr. Hi Tech (2019). https:// doi.org/10.1108/LHT-09-2018-0131 9. Hwang, S., Han, S.R., Lee, H.: A study on the improvement of injection molding process using CAE and decision-tree. Korea Acad.-Ind. Cooperation Soc. 4 (2021). https://doi.org/ 10.5762/KAIS.2021.22.4.580 10. Song, M., Zhao, J., Gao, X.: Research on entity relation extraction in education field based on multi-feature deep learning (2020). https://doi.org/10.1145/3422713.3422723 11. Guo, Y., Yu, L., Wen, Z., et al.: Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res. 36(9), 3025–3030 (2008) 12. Roy, S., Martinez, D., Platero, H., et al.: Exploiting amino acid composition for predicting protein-protein interactions. PLoS One 4(11), e7813 (2009) 13. Yu, G., et al.: Dynamic simulation of soil water-salt using BP neural network model and grey correlation analysis. Trans. Chin. Soc. Agric. Eng. 25(11), 74–79 (2009) 14. Chen, F., et al.: A diagnosis method of vibration fault of steam turbine based on information entropy and grey correlation analysis. IOP Conf. Ser. Earth Environ. Sci. 714(4), 042055 (2021)
COVID-19 and SARS Virus Function Sites Classification with Machine Learning Methods Hongdong Wang1 , Zizhou Feng1 , Baitong Chen2 , Wenhao Shao3 , Zijun Shao1 , Yumeng Zhu1 , and Zhuo Wang1(B) 1 School of Information Engineering, Xuzhou University of Technology, Xuzhou, China
[email protected]
2 Xuzhou First People’s Hospital, Xuzhou, China 3 Inspur Electronic Information Industry Co., Ltd., Jinan, China
Abstract. COVID-19 and SARS virus are two related coronaviruses. In recent years, the increasingly serious epidemic situation has become the focus of all human beings, and has brought a significant impact on daily life. So, we proposed a link analysis of the two viruses. We obtained all the required COVID-19 and SARS virus data from the Uniprot database website, and we preprocessed the data after obtaining the data. In the prediction of the binding site of the COVID-19 and SARS, it is to judge the validity between the two binding sites. In response to this problem, we used Adaboost, voting-classifier and SVM classifier, and compared different classifier strategies through experiments. Among them, Metal binding site can effectively improve the accuracy of protein binding site prediction, and the effect is more obvious. Provide assistance for bioinformatics research. Keywords: COVID-19 and SARS virus · Adaboost · SVM · Metal binding site
1 Introduction The COVID-19 and the SARS virus are viruses that have a greater impact on the human world in the 21st century, and there is a certain connection between them. The COVID19 is the sister virus of the SARS coronavirus [1–3]. The COVID-19 may cause fever, fatigue, and dry cough as the main manifestations. Upper respiratory symptoms such as nasal congestion and runny nose are rare, and hypoxia and hypoxia will occur. About half of the patients developed dyspnea more than a week later, and severe cases rapidly progressed to acute respiratory distress syndrome, septic shock, refractory metabolic acidosis, and coagulation dysfunction. A small number of patients became critically ill and even died [4–7]. Today, when the COVID-19 is still raging, it is a very suitable method to seek similarities and find a way out from the research of the COVID-19 and SARS. If a breakthrough is obtained, great progress can be made in biomedicine. With the development of machine learning technology, machine learning models have been applied to related researches on protein analysis [8]. However, there are few classifications of COVID-19 types currently studied. This article mainly focuses on the analysis of the binding site of the COVID-19. “Binding once means that a certain amino © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 722–730, 2022. https://doi.org/10.1007/978-3-031-13829-4_64
COVID-19 and SARS Virus Function Sites Classification
723
acid residue on the protein sequence undergoes biochemical translation with the ligand, and this amino acid residue is called the “binding site”, otherwise “non-binding site”. We put the two binding sites of Active_site and Metal_binding into the three algorithm models of Adaboost, voting-classifier and SVM to train and predict the effectiveness and accuracy. The final experiment shows that the Metal_binding binding site is very useful for studying the COVID-19 and SARS virus, the effectiveness is higher and more stable.”
2 Materials and Methods 2.1 Data The benchmark data set for this experiment comes from the data set constructed in Uniprot, which contains the protein sequences of the SARS and COVID-19. We used a series of tools such as crawler to obtain the required SARS and COVID-19 virus from the Uniprot database website for all protein sequences, functions and binding sites, two binding sites, Active_site and Metal_binding, were selected in this experiment [9, 10]. The total size of Active_site and Metal_binding data is 71. The feature extraction of the initial data is an important step in the classification process. Choosing an appropriate feature extraction method will greatly enrich the amount of information, thereby providing information guarantee for the subsequent improvement of the classification accuracy. 2.2 Data Processing In SARS and COVID-19 protein data, we preprocessed the data. First, we use python to separate the protein data by single point and extract the binding site to find the key parts we need, and then perform data processing such as intercepting the protein sequence according to the requirement of fixed length, so as to obtain each binding unit of each protein. Isometric protein sequence of dots. 2.3 Vectorization of Sequence Features The machine learning classification algorithm method mainly relies on the feature set constructed according to the structure and functional properties of the protein, and achieves satisfactory classification results by constructing a discriminative feature set, but uses a feature set that can reflect sequence pattern information and maintain key sequences Discrete models or vectors of information is a difficult task. Therefore, we used the tool Pse-In-One [11, 12], which can generate different feature vectors according to different features, which can not only extract protein sequence features, but also apply to DNA and RNA sequences, including 8 kinds of sequence-based or pseudo A protein feature extraction method based on sequence composition, which covers 547 amino acid physicochemical properties. Through the 11 feature extraction methods in this tool: Auto covariance r(k) =
n − − 1 (Zt − Z )(Zt−k − Z ) n t=k+1
(1)
724
H. Wang et al.
Cross covariance CC(u, v, d ) =
L−d
−
−
(Iu (Ri ) − Iu )(Iv (Ri+d ) − Iv )/(L − d )
(2)
i=1
Basic kmer V (kmer − 1) = [x1 , x2 · · · x20 ]2
(3)
V (kmer − 2) = [y1 , y2 · · · y400 ]2
(4)
Parallel correlation pseudo amino acid composition H10 (i) − H1 (i) =
20
i=1
20 i=1 20
[H10 (i)−
i=1
20 20
H20 (i) − H2 (i) =
20
i=1
i=1
[H20 (i)−
20
i=1
20 20
M10 (i) − M (i) =
20 i=1
i=1
[M10 (i)−
20
i=1
H10 (i) 20 H10 (i) 2 20 ]
(5)
H20 (i) 20 H20 (i) 2 20 ]
(6)
M10 (i) 20 M10 (i) 2 20 ]
(7)
20
General parallel correlation pseudo amino acid composition Hu0 (i) − Hu (i) =
20 i=1
20
Hu0 (i) 20
20 Hu0 (i) 2 [Hu0 (i)− 20 ] i=1 i=1
(8)
20
Extract the features of the two binding site data of Active_site and Metal_binding that have been collected, convert them into feature vectors, and then label them with 01 and integrate them into positive and negative data samples, which are put into Adaboost,Voting-classifier and the SVM algorithm model are used for training and prediction, so as to determine the effectiveness between the two binding sites.
3 Classification Algorithms 3.1 Adaboost Algorithm AdaBoost is the most representative boosting algorithm in the Boosting method [13–16]. The method reduces the weight of the paired samples and increases the weight of the
COVID-19 and SARS Virus Function Sites Classification
725
wrongly classified samples in each round, so that the classifier is gradually improved in the iterative process, and finally all the classifiers are linearly combined to obtain the final classifier [17]. AdaBoost can adaptively adjust the weight distribution of samples, setting the weight of wrongly classified samples high and the weight of paired samples low, so it is called “Adaptive Boosting”. AdaBoost algorithm process: (1) Initialize the weights of each training example, with a total of N training examples. 1 , i = 1, 2, · · · , N (9) N (2) A total of M rounds of learning are carried out, and the mth round of learning is as follows: W = (w11 , w12 · · · w1N ), w1i =
A) The base classifier Gm is learned using training examples with weight distribution Wm. B) Calculate the error rate of the base classifier obtained in the previous step: N
em = P(Gm (Xi ) = yi ) =
wmi I (Gm (xi ) = yi )
i=1 N
=
N
wmi I (Gm (xi ) = yi )
i=1
wmi
i=1
(10) N
wmi = 1
i=1
C) Calculate the weight coefficient in front of Gm: am =
1 − em 1 log 2 em
(11)
D) Update the weight coefficients of the training examples: wmi exp(−am yi Gm (xi )) Wm+1,i = Zm Zm =
N
wmi exp(−am yi Gm (xi ))
(12) (13)
i=1
E) Repeat A) to D). Obtain a series of weight parameters am and base classifier Gm Linearly combine the base classifier obtained in the previous step according to the weight parameters to obtain the final classifier: f (x) =
M
am Gm (x)
(14)
m=1
Gm (x) = sign(f (x)) = sign(
M m=1
am Gm (x))
(15)
726
H. Wang et al.
3.2 Voting Classifier Algorithm The idea of VotingClassifier is to combine conceptually different machine learning classifiers and use majority voting or average prediction probability (soft voting) to predict class labels [18]. Such classifiers are useful for a set of equally well-performing models to balance their respective weaknesses. Assuming that for a binary classification problem, there are several basic models, we can get a voting classifier through these basic models, and take the category with the most votes as the category we want to predict, that is, the minority obeys the majority. Among them, the majority class label (majority/hard voting): directly vote on multiple models without distinguishing the relative importance of the model results, and the class with the most votes is the final predicted class. Weighted average probability (soft voting): The weight setting is added on the basis of hard voting, and different weights can be set for different models, thereby reflecting the importance of each model. 3.3 Methods and Evaluation Indicators For a binary classification problem, the actual value is only positive and negative, and the actual predicted result will only have two values of 0 and 1. If an instance is a positive class and is predicted to be a positive class, it is a true class (TP), if it is a negative class, it is predicted to be a positive class, and it is a false positive class (FP), and if it is a negative class, it is predicted to be a negative class. Known as True Negatives (TN), positive classes are predicted as Negatives as False Negatives (FN). There are many types of evaluation indicators, such as Accuracy (ACC), AUC (Area Under Roc Curve), F1-score, sensitivity (Sn), specificity (Sp), and Matthews correlation coefficient (MCC), etc. The positive class is predicted to be a negative class It is called false negative (FN). The general horizontal axis of the ROC curve is FPR, that is, the ratio of negative samples that are judged to be positive, and the vertical axis is FPR, that is, positive samples that are judged to be positive. AUC is the area under the curve, and the larger the value, the better the classification effect of the model. The F1-score reflects the robustness of the model. The higher the score, the more robust the model. Sn represents the proportion of all positive examples that are paired, which measures the ability of the classifier to identify positive examples. Sp represents the proportion of all negative examples that are paired, which measures the ability of the classifier to identify negative examples. Here, this paper adopts ACC as the evaluation index of the classifier. TP + TN TP + TN + FN + FP
(16)
2TP 2TP + FN + FP
(17)
Sn =
TP TP + FN
(18)
Sp =
TN TN + FP
(19)
ACC =
F1_score =
COVID-19 and SARS Virus Function Sites Classification
727
In the detection of binding site validity of SARS-CoV-2 and SARS, selecting appropriate evaluation metrics to evaluate the performance of the model is an essential step. What this paper studies is the effectiveness between the two binding sites Active_site and Metal_binding. We set Active_site as a true sample and Metal_binding as a negative sample. The larger the result, the higher the effectiveness of the positive sample Metal_binding, and vice versa, the smaller the result, indicating that the negative sample Active_site is more effective. The procedure for evaluating the effectiveness of the two binding sites is shown in Fig. 1.
Fig. 1. Validity analysis of binding sites of COVID-19 and SARS
4 Results and Discussion Based on different data sets constructed by different extraction methods, in order to illustrate the efficiency of the model, these different feature extraction methods are then connected to different model methods, and the prediction accuracy is compared with the effects of other models. Specifically, the prediction results of Adaboost, VotingcClassifier and SVM are used for comparison. Figure 2 shows the accuracy of data predictions using the Adaboost machine learning model for different extraction methods. In this model, using different Pse-In-One feature vector conversion tools will get different accuracy rates, and the parameters of the classifier will also be different, so a comprehensive comparison is required. According to Fig. 2, it can be seen that the highest accuracy rate achieved by various feature extraction methods is around 0.75 to 0.85, and in most cases, the accuracy rate is greater than 0.5, so it can be judged that the expression effectiveness of the positive sample Metal_binding is more prominent. When using the VotingClassifier machine learning model to predict data with different extraction methods, in this model, the final voting classifier is obtained based on the accuracy predicted by the six classifiers: Logistic Regression, Random Forest, naive
728
H. Wang et al.
Fig. 2. 11 Feature Extraction Methods of Adaboost Algorithm
Bayes, DecisionTreeClassifier, KNeighborsClassifier and SVC, accuracy of the model. Its characteristic is that the models are independent of each other, and there is no correlation between the results. The more similar models are fused, the worse the fusion effect will be. It can be clearly seen that the voting method “voting” outperforms the six base models because this ensemble method improves the robustness of the model. In the basic model, especially KNeighborsClassifier has the worst effect, and the accuracy rate under the AC and CC evaluation indicators is as low as 0.68 and 0.69. Of course, due to the aforementioned “disadvantages of voting method”, the error of KNeighborsClassifier is also brought into the voting model, middle. Figure 2 shows the prediction accuracy of the data extracted by different extraction methods using the SVM machine learning model and the image formed by each point. It can be seen from the experiment that the classification surface needs to correctly separate the two categories and maximize the classification interval, and the classification interval is different for each data, and a comprehensive comparison needs to be carried out. Similar extraction methods have no effect on the results of the model. When the classification model does not make a classification when the dataset does not have two types. In this paper, in order to maintain the consistency of the conditions for obtaining the results, three different machine learning models were run for multiple times, and the result with the highest accuracy rate in each time was taken as the current running result, and the average value of ten runs was taken as the final result of the model. After the comparison of the experiments. For the effectiveness between the two binding sites of Active_site and Metal_binding, the effectiveness of the Metal_binding binding site is more effective. It is higher and more stable, and can be finally used in the targeted research of the COVID-19 and SARS virus.
COVID-19 and SARS Virus Function Sites Classification
729
5 Conclusion With the development of time and technology, human beings have been able to further understand various protein sequences, and use computer algorithms to compare DNA and protein sequences to detect the evolutionary relationship between structure, function and sequence. Sequences of various genomes generate large amounts of DNA sequence data and biological information, which have been applied to study gene function and predict previously unknown gene functions. In recent years, the increasingly serious epidemic situation has become the focus of attention of all mankind. By analyzing the protein sequence and site characteristics of the COVID-19 and SARS virus, it can further help in the fight against the epidemic and in the research on effective vaccines. However, due to the complex structure of viral proteins, it is difficult to study smoothly with the help of traditional classification prediction tools. Therefore, machine learning is more conducive to the study of this. By continuously modifying parameters, the accuracy of the experimental model can be further improved. This paper firstly preprocesses the protein data of the COVID-19 and SARS obtained from uniport, and then extracts the protein sequence of the binding site through 11 different feature extraction methods in Pse-InOne, and converts it into a protein sequence that can use Adaboost, Voting-classifier and SVM algorithm to classify the digital vector, and finally put it into the three different models in this article for training and prediction, and fully mine various information. Finally, it was found that the Metal_binding binding site is more effective and stable for studying the COVID-19 and SARS virus. Through the algorithm model in this paper, we deal with the two-class problem and make better use of weak classifiers for cascading, which improves the accuracy of training prediction, and the generalization ability is relatively strong, but there are still some shortcomings. It is necessary to re-select the best segmentation point of the current classifier. If the source data used is unbalanced, the classification accuracy can also be reduced. In the next experimental research, we will be more committed to improving the processing of data sets, reducing category imbalance, adding enough training examples, avoiding excessive data enhancement, dividing the data set reasonably, and continuously improving the generalization of the model. In the future development, this method can make more contributions to the study of proteins, to promote the development of vaccines, and to biomedicine. Acknowledgement. This work was supported by the Natural Science Foundation of China (No. 61902337), the fundamental Research Funds for the Central Universities, 2020QN89, Xuzhou science and technology plan project, KC19142, KC21047, Jiangsu Provincial Natural Science Foundation (No. SBK2019040953), Natural Science Fund for Colleges and Universities in Jiangsu Province (No. 19KJB520016) and Young talents of science and technology in Jiangsu.
References 1. Yang, W., et al.: A brief survey of machine learning methods in protein sub-Golgi localization. Curr. Bioinf. 14(3), 234–240 (2019) 2. Hoyer, S.: Is sporadic Alzheimer disease the brain type of non-insulin dependent diabetes mellitus? A challenging hypothesis. J. Neural Transm. 105(4–5), 415–422 (1998)
730
H. Wang et al.
3. Rose, D.R.: Structure, mechanism and inhibition of Golgiα-mannosidase II. Curr. Opin. Struct. Biol. 22(5), 558–562 (2012) 4. Gonatas, N.K., Gonatas, J.O., Stieber, A.: The involvement of the Golgi apparatus in the pathogenesis of amyotrophic lateral sclerosis, Alzheimer’s disease, and ricin intoxication. Histochem. Cell Biol. 109(5–6), 591–600 (1998) 5. Elsberry, D.D., Rise, M.T.: ‘Techniques for treating neuro degenerative disorders by infusion of nerve growth factors into the brain. U.S. Patents US6042579A (1998) 6. Yuan, L., Guo, F., Wang, L., Zou, Q.: Prediction of tumor metastasis from sequencing data in the era of genome sequencing. Brief. Funct. Genom. 18(6), 412–418 (2019) 7. Hummer, B.H., Maslar, D., Soltero-Gutierrez, M., de Leeuw, N.F., Asensio, C.S.: Differential sorting behavior for soluble and transmembrane cargoes at the trans-Golgi network in endocrine cells. Molecul. Biol. Cell 31(3), 157–166 (2020) 8. Deng, S., Liu, H., Qiu, K., You, H., Lei, Q., Lu, W.: Role of the Golgi apparatus in the blood-brain barrier: Golgi protection may be a targeted therapy for neurological diseases. Mol. Neurobiol. 55(6), 4788–4801 (2018) 9. Villeneuve, J., Duran, J., Scarpa, M., Bassaganyas, L., Van Galen, J., Malhotra, V.: Golgi enzymes do not cycle through the endoplasmic reticulum during protein secretion or mitosis. Mol. Biol. Cell 28(1), 141–151 (2017) 10. Hou, Y., Dai, J., He, J., Niemi, A.J., Peng, X., Ilieva, N.: Intrinsic protein geometry with application to non-proline cis peptide planes. J. Math. Chem. 57(1), 263–279 (2019) 11. Wei, L., Xing, P., Tang, J., Zou, Q.: PhosPred-RF: a novel sequence-based predictor for phosphorylation sites using sequential information only. IEEE Trans. Nano Biosci. 16(4), 240–247 (2017) 12. van Dijk, A.D.J., et al.: Predicting sub-Golgi localization of type II membrane proteins. Bioinformatics 24(16), 1779–1786 (2008) 13. Ding, H., et al.: Identify Golgi protein types with modified mahalanobis discriminant algorithm and pseudo amino acid composition. Protein Pept. Lett. 18(1), 58–63 (2011) 14. Ding, H., et al.: Prediction of Golgi-resident protein types by using feature selection technique. Chemomet. Intell. Lab. Syst. 124, 9–13 (2013) 15. Jiao, Y.-S., Pu-Feng, D.: Predicting Golgi-resident protein types using pseudo amino acid compositions: approaches with positional specific physicochemical properties. J. Theor. Biol. 391, 35–42 (2016) 16. Jiao, Y.-S., Du, P.-F.: Prediction of Golgi-resident protein types using general form of Chou’s pseudo-amino acid compositions: approaches with minimal redundancy maximal relevance feature selection. J. Theor. Biol. 402, 38–44 (2016) 17. Lv, Z., et al.: A random forest sub-Golgi protein classifier optimized via dipeptide and amino acid composition features. Front. Bioeng. Biotechnol. 7, 215 (2019) 18. Rao, R., et al.: Evaluating protein transfer learning with tape. Adv. Neural Inf. Process. Syst. 32, 9689 (2019)
Identification of Protein Methylation Sites Based on Convolutional Neural Network Wenzheng Bao, Zhuo Wang, and Jian Chu(B) Xuzhou University of Technology, Xuzhou, China [email protected]
Abstract. Protein is an important component of all human cells and tissues. Protein posttranslational modification (PTM) refers to the chemical modification of protein after translation, which changes the biochemical characteristics of protein by adding chemical groups on one or more amino acid residues under the catalysis of enzymes. Protein methylation is a common post-translational modification. Protein methylation modification refers to the process of methyl transfer to specific amino acid residues under the catalysis of methyltransferase. Protein methylation is involved in a variety of biological regulation, in-depth understanding can help to understand its molecular mechanism and various functional roles in cells. Abnormal protein translation can lead to changes in protein structure and function, which is related to the occurrence and development of human diseases. The traditional experimental methods are time-consuming and laborious. In this paper, the characteristics of protein methylation sites of six species are extracted, and the convolutional neural network is used for classification. The appropriate learning rate is selected in the training network to inhibit over-fitting. Under sufficient iterations, a good classification structure is finally obtained. The AUC value calculated by this experiment: BS: 0.945, CG: 0.665, GK: 0.952, MT: 0.957, ST: 1.0. It provides theoretical guidance for the subsequent research on protein methylation site recognition. Keywords: Protein methylation · Convolutional neural network · Site recognization
1 Introduction Protein post-translational modification (PTM) refers to the post-translational chemical modification of proteins, which changes the biochemical characteristics of proteins by adding chemical groups on one or more amino acid residues under the catalysis of enzymes. PTM can increase the number of proteins, control more accurately, structure more complex and function more perfect. Common posttranslational modifications include ubiquitination, acetylation, phosphorylation and methylation. Arginine and lysine are frequently methylated and demethylated modified amino acids. Arginine methylation has two types of modification, monomethylation and bimethylation, and bimethylation includes asymmetric methylation and symmetric © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 731–738, 2022. https://doi.org/10.1007/978-3-031-13829-4_65
732
W. Bao et al.
methylation. Lysine ε-amino acids have three patterns: monomethylation, double methylation and trimethylation. Catalytic protein methylation is often catalyzed by arginine methyltransferase and lysine methyltransferase. Protein methylation is involved in the regulation of multiple organisms, such as gene transcription, RNA processing and transportation, protein interaction, DNA damage repair, and is closely related to a variety of diseases. Abnormal posttranslational modification of proteins can lead to conformational changes and functional disorders of proteins, which can lead to the occurrence and development of human diseases. Therefore, identifying modification sites is helpful to understand the molecular mechanism of proteins and reduce the occurrence of diseases. At first, the traditional experimental methods, such as site-directed mutagenesis, immune antibody and radioactive tracer technology, are used to identify PTMs, which require a lot of manpower and funds. At present, due to the development of science and technology, the use of bioinformatics methods can quickly and effectively analyze experimental data. Bioinformatics used in this experiment is based on convolution neural network to learn the experimental data, find out the potential rules, and provide reasonable reference information for further research. In order to realize the accurate identification of protein methylation sites, this paper uses convolutional neural network as the classification model. Firstly, the dimensionality reduction of protein methylation sites is carried out, and the unit point and multiple points are separated to obtain the response sequence. Then, the feature matrix of 28 * 28 is extracted, and the feature matrix is input into the convolutional neural network. Relu is used as the activation function, and Adam optimizer is used. The learning rate is set to 0.001. Under 500 iterations, under the six species dataset, the final AUC value is BS: 0.945, CG: 0.665, GK: 0.952, MT: 0.957, ST: 1.0, and very good classification results are obtained.
2 Data Collection and Processing 2.1 Data Set In order to carry out the subsequent classification, this experiment collected the data set of protein methylation sites through the protein sequence feature extraction algorithm. In this experiment, the protein sequence is represented by letters, and the position of each unit point is represented by numbers. Different functional sites were separated, and there were the following four functional sites: active site, binding site, metal binding site and low binding. Proteins with functional sites were labeled with 1, and those without functional sites were labeled with 0. The following is the data set used in this used in this paper, with a total of 20, 968 characteristic matrices of protein methylation sites. The training set and the test set are divided by 9 : 1, with a positive-negative ratio of 1 : 1 and a characteristic matrix of 28 * 28, which shown in Fig. 1. 2.2 Convolutional Neural Network CNN is essentially a multi-layer perceptron, which uses the method of local connection and sharing weights. On the one hand, it reduces the number of weights to facilitate
Identification of Protein Methylation Sites
733
Table 1. The information of datasets BS
CG
EC
GK
MT
ST
Positive
1571
1052
6592
206
865
198
Negative
1571
1052
6592
206
865
198
network optimization. On the other hand, it reduces the risk of over-fitting. The advantage of convolutional neural network is that it can directly use the image as the input of the network. Under the condition of unclear reasoning rules and distorted images, it can extract image features and has good robustness. The network structure includes convolution layer, downsampling layer and full connection layer. The convolution layer enhances the signal features in the original data and reduces noise by convolution operation. Downsampling layer, according to the image correlation, the image is fully sub-sampled, and finally the full connection layer outputs feature matrix features. For an input feature matrix, the feature matrix is converted into the corresponding matrix. The matrix element is the corresponding pixel value, and the convolution kernel is also called the filter. Assuming that the size of the input matrix is W, the size of the convolution kernel is k, the step size is s, and the zero-filling layer is p, the calculation formula of the size of the feature map generated after convolution is: w = (W + 2p − k)/S + 1
(1)
For each convolution layer, input is: “
”
(2)
Output: Y = ϕ(V )
(3)
Each convolution layer has a different weight matrix W, and W, X, Y, are matrix forms. For the last full connection layer, set to L layer, output y ˆ L in vector form, expected output d, then the total error formula is: 2 1 (4) E = d − y L 2 2 CNN is trained by gradient descent algorithm and back propagation. The gradient formulas of convolution layer and pooling layer are as follows: ∂Vij ∂E ∂E ∂Vij = = δij ∂Wij ∂Vij ∂Wij ∂Wij
(5)
The model uses the Adam optimizer, the convolution layer is activated by the Relu function, and the softamax is used to activate the full connection layer to output the results.
734
W. Bao et al.
2.3 Overfitting Inhibition In this experiment, the feature matrix of protein methylation sites was input into the convolutional neural network, resulting in over-fitting. Random inactivation and batch normalization operations were used in the model to alleviate the over-fitting situation and improve the generalization ability of the model, which had better performance on the test set. Using dropout to suppress overfitting. Random deactivation is to reduce the mutual dependence between nodes and realize the regularization of neural network by randomizing the partial weight or output of hidden layer to zero in the learning process. In convolutional neural networks, random inactivation refers to random connection inactivation, random zeroing of some elements in the convolution kernel, or random spatial inactivation, and zeroing of the entire feature map channel in a multi-channel case. The neural network without dropout: (l+1)
(l+1) l
= wi
(l+1)
y + bi
(6)
(l+1) = f zi
(7)
rj ∼ Bernouli (p)
(8)
y˜ (l) = r (l) ∗ y(l)
(9)
zi
(l+1)
yi
The neural network with dropout: (l)
(l+1)
zi
(l+1)
yi
(l+1) l
= wi
(l+1)
y˜ + bi
(10)
(l+1) = f zi
(11)
Using Batch Normalization to Suppress. Overfitting the BN layer is added, and the input of each layer is normalized in batches to alleviate the over-fitting and increase the generalization ability of the network. Normalized each dimension: x(k) − E x(k) (k) (12) xˆ = Var x(k) In addition, reconstruction parameters are introduced: γ and β. y(k) = γ (k) xˆ (k) + β (k) y(k) =
Var x(k)
β (k) = E x(k)
(13) (14)
(15)
The batch normalization operation alleviated the over-fitting and improved the generalization ability of the model in this experiment.
Identification of Protein Methylation Sites
735
Network Structure and Parameters. Network structure and parameters have a great influence on the performance of convolutional neural networks. This paper selects the following parameters through a large number of experiments. Since the size of the feature matrix of the data is 28 * 28, it is changed to 32 * 32 in this paper. In the input to the convolutional neural network, the network uses 3 * 3 convolution kernels and is activated by the relu function. The first layer uses 32 convolution kernels and adds a BN layer, followed by a global average pooling layer. The second layer uses 64 convolution kernels and adds a BN layer, followed by a global average pooling layer. The third layer uses 128 convolution kernels and adds a BN layer, followed by a global average pooling layer. Finally, the prediction results are output by the full connection layer, and the full connection layer is activated by the softmax function. The relu function is used for activation, and the training batch is 32. Finally, the optimal model and weight of training are saved.
3 Evaluation Indicators and Methodologies This experiment draws ROC curve to reflect the prediction results of this experiment. TP is true class, FN is false negative class, FP is false negative class, TN is true negative class, the formula is as follows Precision = Recall =
TP TP + FP
TP TP + FN
(16) (17)
TPR =
TP TP + FN
(18)
FPR =
FP FP + TN
(19)
Accuracy =
TP + TN TP + TN + FP + FN
(20)
The true class rate TPR represents the proportion of actual positive instances to all positive instances in the predicted positive class, and the negative real rate FPR represents the proportion of actual negative instances to all negative instances in the predicted positive class. AUC curve refers to the area under the ROC curve. FPR represents the response degree of the model false alarm. TPR represents the coverage degree of the model prediction response. When the TDR is higher, the FRP is lower, the AUC value is higher, and the prediction performance of the model is better.
4 Result In this experiment, the protein methylation site data were trained. The training set and test set were divided by 9 : 1. The learning rate was set to 0.0001. The Adam optimizer
736
W. Bao et al.
was used to run 500 iterations. The training set was used for model training, and the test set was used to feedback the generalization performance of the model. The abscissa was the number of iterations, and the ordinate was the accuracy of the model, the results shown in Fig. 1 and Fig. 2.
Fig. 1. The accuracy of this model
Fig. 2. The loss of this model
It can be seen that the model has reached nearly 100% correct rate on the training set and about 80% correct rate on the test set, and there are fluctuations. The loss function decreases rapidly in the 100 iterations. In the subsequent iterations, the loss function cannot be decreased, and the model training appears over-fitting. In order to better compare the structure of protein methylation site recognition in six species and determine the classification effect of convolutional neural network, the ROC curve was used to determine the results in this experiment. The following is the calculated AUC value. From this classification effect, the convolutional neural network can play a good effect in the identification of protein methylation sites under suitable parameters, and has good application prospects. The results shown in Fig. 3.
Identification of Protein Methylation Sites
737
Fig. 3. The ROC curves of six datasets
The convolution neural network has a very high AUC value in classification, which means that the convolution neural network can effectively predict the protein methylation sites. Through the analysis of different species, the performance of ST data is higher than that of other data, and the performance of CG data is lower than that of other data. It shows that the convolution neural network can effectively predict the protein methylation sites of ST species. Through reasonable training parameters, the convolution neural network has important significance and research prospects for the follow-up study of protein methylation site recognition.
5 Conclusion In recent years, protein methylation has become a hot research direction in bioinformatics. In this paper, convolution neural network in deep learning is used to identify protein methylation sites. Firstly, the characteristics of protein methylation sites were analyzed. The protein sequence learning method was used to extract the feature matrix, which was combined with deep learning and input into the convolutional neural network. Through random inactivation and batch normalization operations, the over-fitting was alleviated, and the appropriate learning rate and sufficient number of iterations were selected to train, train and classify the model. Finally, the ROC curve was drawn to evaluate the results. This experiment achieved good classification results, which proved that the convolutional neural network had a good effect on the identification of protein methylation sites, although the selection of parameters in the training process was difficult. However, convolutional neural network has obvious advantages in protein methylation site recognition, which is of great significance for subsequent protein methylation site recognition, and has good research prospects for subsequent protein research and bioinformatics. Acknowledgement. This work was supported by the Natural Science Foundation of China (No. 61902337), the fundamental Research Funds for the Central Universities, 2020QN89, Xuzhou science and technology plan project, KC19142, KC21047, Jiangsu Provincial Natural Science
738
W. Bao et al.
Foundation (No. SBK2019040953), Natural Science Fund for Colleges and Universities in Jiangsu Province (No. 19KJB520016) and Young talents of science and technology in Jiangsu.
References 1. Yang, W.: A brief survey of machine learning methods in protein sub-Golgi localization. Curr. Bioinf. 14(3), 234–240 (2019) 2. Hoyer, S.: Is sporadic alzheimer disease the brain type of non-insulin dependent diabetes mellitus? A challenging hypothesis. J. Neural Transm. 105(4–5), 415–422 (1998). https:// doi.org/10.1007/s007020050067 3. Rose, D.R.: Structure, mechanism and inhibition of Golgiα-mannosidase II. Curr. Opin. Struct. Biol. 22(5), 558–562 (2012) 4. Gonatas, N.K., Gonatas, J.O., Stieber, A.: The involvement of the Golgi apparatus in the pathogenesis of amyotrophic lateral sclerosis, alzheimer’s disease, and ricin intoxication. Histochem. Cell Biol. 109, 591–600 (1998). https://doi.org/10.1007/s004180050257 5. Elsberry, D.D., Rise, M.T.: Techniques for treating neuro degenerative disorders by infusion of nerve growth factors into the brain. U.S. Patents US6042579A, 5 August 1998 6. Yuan, L., Guo, F., Wang, L., Zou, Q.: Prediction of tumor metastasis from sequencing data in the era of genome sequencing. Brief. Funct. Genomics 18(6), 412–418 (2019) 7. Hummer, B.H., Maslar, D., Gutierrez, M.S., de Leeuw, N.F., Asensio, C.S.: Differential sorting behavior for soluble and transmembrane cargoes at the trans-Golgi network in endocrine cells. Mol. Biol. Cell (2020). mbc-E19 8. Deng, S., Liu, H., Qiu, K., You, H., Lei, Q., Lu, W.: Role of the Golgi apparatus in the blood-brain barrier: golgi protection may be a targeted therapy for neurological diseases. Mol. Neurobiol. 55(6), 4788–4801 (2018). https://doi.org/10.1007/s12035-017-0691-3 9. Villeneuve, J., Duran, J., Scarpa, M., Bassaganyas, L., Van Galen, J., Malhotra, V.: Golgi enzymes do not cycle through the endoplasmic reticulum during protein secretion or mitosis. Mol. Biol. Cell 28(1), 141–151 (2017) 10. Hou, Y., Dai, J., He, J., Niemi, A.J., Peng, X., Ilieva, N.: Intrinsic protein geometry with application to non-proline cis peptide planes. J. Math. Chem. 57(1), 263–279 (2019). https:// doi.org/10.1007/s10910-018-0949-7 11. Wei, L., Xing, P., Tang, J., Zou, Q.: PhosPred-RF: a novel sequence-based predictor for phosphorylation sites using sequential information only. IEEE Trans. Nanobiosci. 16(4), 240– 247 (2017)
Image Repair Based on Least Two-Way Generation Against the Network Juxi Hu1,2 and Honglin Cheng2(B) 1 China University of Mining and Technology, Xuzhou 221000, China 2 Xuzhou University of Technology, Xuzhou 221018, China
[email protected]
Abstract. With the rapid development of deep learning in the field of artificial intelligence, deep learning has received more and more attention in the application of computer vision technology. Image restoration is an important application in the field of image generation. The missing information in the original image can be filled in and restored through the contextual information and perceptual information of the image. Traditional physical and mathematical image restoration algorithms pay attention to context information but ignore perception information, and use information pixels around the area to be restored to restore the original image. In the face of a single natural landscape and background, the restoration effect is acceptable However, the traditional image restoration method does not perform well on images related to faces and bodies with a large amount of perceptual information. Since then, the researchers have developed image restoration based on a generative confrontation network. A large number of fake images are generated through the generator, the discriminator optimizes the discrimination, and then the part is cropped for restoration. However, the image quality generated by the traditional generative confrontation network is not high, and the phenomenon of gradient disappearance is prone to appear. Image restoration based on the improved generative confrontation network uses the least squares method of generative confrontation network to perform image restoration. The network is better, the repair effect is better, the quality of the generated image is higher, and the gradient is easier to overcome Disappearance. From the perspective of visual effects, it can make people define the results of image restoration as correct, whether it is from the contextual information of the pixel part connected by the color or the perceptual information of the object, it can be judged as the correct image, and the restoration effect is better. Keywords: Deep learning · Image restoration · Generative adversarial network · Least squares
1 Introduction Computer vision has image recognition, image classification, image generation in a variety of directions, image repair belongs to one of the applications in image generation. In the field of computer vision, the traditional mathematical physics-based image repair © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 739–746, 2022. https://doi.org/10.1007/978-3-031-13829-4_66
740
J. Hu and H. Cheng
scheme is 0 Gradually eliminated, the use of deep learning image repair program. and the use of deep learning image repair scheme is gradually occupying a seat in the field of image 2 Gradually in the field of image repair occupies a seat, the principle is based on the image image adjacent pixel information, that is, contextual information and determine whether it is normal perceptive information. principle is based on the image image adjacent pixel 3 is better. With deep learning gradually gaining an advantage in most areas of computer vision, researchers have continuously discovered through experiments that traditional models encounter bottlenecks, while deep learning has made breakthroughs by using complex models to explain and continuously research and develop. In the image repair of traditional physics mathematics, we have encountered the problem of the lack of perceptual information which is difficult to solve. At this time, the use of deep learning for computer vision applications, not only from the application of the final processing results of the image in computer vision or from the computer vision of the process of image processing, are better than traditional physical mathematics, in the field of image repair, the traditional image repair focus on the use of existing pixels to fill the pixels to be repaired, the use of contextual information contained in the image to repair the edge information of the relevant image, in the image area to be repaired small, to be repaired in a single environment, the effect of repair is still good. However, in the case of large areas to be repaired and complex environments in the area to be repaired, the effect of the repair is not satisfactory. With the researchers’ research on the generated confrontation network, it is becoming more and more mature to use the generated anti-network to generate fake images to repair the image, but there are some problems, first of all, the judge will have gradient disappearance phenomenon, and the quality of the image processed is not high. In this paper, the resulting adversity network is improved, and its original cross-entropy cost function is improved to the least-multiplier cost function, which can effectively solve the phenomenon of gradient disappearance and produce a higher picture accuracy.
2 Method and Material 2.1 Build-Against the Network Image repair based on a generated anti-network is divided into two parts. One part is the generational anti-network section (where an improved build-against the network DCGAN is used), the model diagram is shown in 3–1, which is divided into a generator model and a judge model, the role of the generator model is to accept random noise, through continuous training to generate false pictures, and the false images are constantly regeneration. When the image is repaired, the generator’s input is the last output, and in the network, the generator model’s function of generating fake pictures is recorded as G(z). The role of the judge model is to determine whether the input picture is true or false, if it is true, then the output results will tend to 1, if it is false, then the output results will tend to 0. When the image is repaired, the attention to the perceived information is to improve the perceived information by whether the judge is true or not. The other part is an image repair model. That is, through one or two steps of the generational confrontation network model, will need to repair the picture information
Image Repair Based on Least Two-Way Generation Against the Network
741
input into the generator, generate false pictures to judge, and constantly through the generation of anti-network game, and finally generate a seemingly correct flawless picture, this picture is the repair completed picture. The overall flowchart is shown in Fig. 1.
Fig. 1. DCGAN’s network model.
2.2 Mask Treatment In order to get the projection of the area to be repaired of the false image, the mask operation of the repair image should be taken. Mask’s function is primarily to mask specific areas of the image to be repaired. Usually after practice, in most cases, the repair image is treated in a special way, mask operation. Take the mask matrix, which is the same size as the area to be repaired, with a full 0 inside, and multiply the matrix of this particular mask and the picture to be repaired, i.e. the output results in a picture to be repaired that obscures the area to be repaired. Taking the same mask matrix as the area to be repaired, the interior of which is full of 0, multiplys the fake picture matrix generated by this special mask matrix and the generator model, i.e. the output results are projections of the area to be repaired on the fake picture generated by the generator model. After multiplying the mask matrix with the picture area matrix, the process of image repair can be expressed mathematically through discrete: (1) Lc = G z (i) ∗ MASK + (1 − MASK) ∗ x In this line, Lc is defined as a repaired picture, which is made up of the mask area of the fake picture and the mask area dug up by the picture to be repaired. G (z(i)) is defined as a false picture generated by the generator model, multiplied by the mask area to obtain the false picture mask area. x is defined as a picture to be repaired, multiplied by 1-mask, to get the part of the mask area dug up. Mask is defined as large and small as the picture to be repaired, and its mask corresponds to the pending area of the picture to be repaired to be repaired with a total of 1, compared to the other areas with a full 0. In image repair, the generator model theoretically cannot generate a large number of images of the same size as the image size of the area to be repaired, so it is multiplied
742
J. Hu and H. Cheng
by the mask matrix, which corresponds to 0 for the area to be repaired and 1 for the remaining area. 2.3 The Generator Model Building a generator takes five steps. The first step: the initial input after the use of a full connection layer, the purpose of the use of a full connection layer is to connect each node of the layer with the upper layer of all nodes, so that the layer can use the full connection layer input noise loss-free conversion to the relevant amount of sheet, so that processing, the original input of 100 orders, after the full connection layer becomes 8192 × 1 related zhang. The second step: Using the numpy reshape method in python, the sheet is processed as 4 × 4 × 512. The third step: secondly, set the number of layers of trans-reel layer is 4, and then set up the corresponding co product core, the corresponding parameters of the co product core is: edge length is 5, step length is 2. Step 4: In order for the refring operation to be smaller, the image remains the same size, so it needs to be filled, and if 0 is filled, the fill method here is same. Step 5: Align the sheet because the negative value is a very small slope, so the unified activation function is set to a leak correction linear unit (LeakyRelu). By completing the above 5 steps, the output size is twice that of the image at the time of input, and the number of channels is reduced. Repeat 5 steps 3 times, i.e. the resulting output size is 8 times larger than the input, the number of output channels is reduced to 3, and then in order to ensure that the next layer of input is between –s1 and 1, the output results need to be normalized, mapped to –1, 1, so tanh is taken as an activation function (Fig. 2).
Fig. 2. Generator constructs the model.
3 Results 3.1 The Operation of the Image Repair Model Image repair is the addition of parts of the picture to be repaired to the generator model, and the result is connected to complete the image repair. The repair diagram is shown in Fig. 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17 and18. There are two points to note during training.
Image Repair Based on Least Two-Way Generation Against the Network
743
The first point: by constantly training the output of the generator, the generator can generate the best fake picture, at this time analyze the parameters of the generator model and false picture, get a weight matrix that can be fixed to generate the best fake picture. The second point: take the weight matrix to continuously repeatedly optimize the image repair. Specifically, there are six steps required to complete an image repair: the first step: repeated iterations of image generation, and the completion of parts of the image generation model. The second step: Stop optimizing image generation, at which point the image generation model is fixed and there is no need to optimize the model. The third step: Repeat the input of the optimization image generation model, which needs to be optimized. Step 4: Optimize the input until the input distribution is judged and the input similarity of the fixed picture is as high as the fifth step: execute the model, generate a fake picture with false real false picture Step 6: by mask operation, dig out the area of the best fake picture and the picture connection to be repaired. Select some images to complete and place them in the corresponding folder. After the image repair, you can see the image repair effect map as shown in Fig. 3, from these 20 pictures is not difficult to find, from the first repair effect is not good picture, after the judge, the generator continues to generate, the image effect is getting better and better, Finally formed the picture 0950.png, at this time, although strictly speaking, the repaired image and the original input image is not exactly the same, but through the means of repair, can make the damaged image to achieve a normal image, it also completed the purpose of this article. The input image is shown in Fig. 4, and the image after the masking area is added is shown in Fig. 5. The repair process is shown in Fig. 6. The fake image generated by the mask area is shown in Fig. 7, where the fake picture and the mask area are multiplied to get the area and filled in the original area for a repair effect.
Fig. 3. A summary of the repair images.
Fig. 4. Enters the image.
744
J. Hu and H. Cheng
Fig. 5. Adds a mask post-image.
Fig. 6. Pictures of the image repair process.
Fig. 7. Replaces the damaged area of the original image.
3.2 A Wide Range of Opencv Built-In Image Repair Algorithms Algorithms based on the Navier-Stokes 10 The main points considered when the algorithm is put forward are: to detect the edge area of the image to be repaired, to add content to the edge of the detection result, and to detect the outside of the edge area of the image to be repaired, and to make noise removal of the edge content of the detection result.
Image Repair Based on Least Two-Way Generation Against the Network
745
Telea-based FMM algorithm (fast-moving FMM11 The key points considered when the algorithm is put forward: The first step is to detect the information around the outside of the picture area to be repaired, to process the repair step by step according to the detection information, and then gradually push inward, after iterative repair until the repair is completed. The repair effect is shown in Fig. 8 and Fig. 9, where the input is the same original figure is Fig. 10, the mask area is Fig. 11, the repair effect clearly shows that both methods are filled with surrounding pixels, missing important elements of the nose. Original.
Fig. 8. Fix the original illustration.
Mask.
Fig. 9. Repair mask diagram.
NS method.
Fig. 10. Repair mask diagram.
FMM method.
Fig. 11. TELEA algorithm repair diagram.
3.3 Contrast with Normal Generated Image Repair Against the Network Normal generated anti-network image repair effect as shown in the following image, the repair of the original figure is shown in Fig. 12, at this time musk diagram is shown in Fig. 13, after the ordinary generated anti-network repair is shown in Fig. 14, because computer performance issues can not train more mature network, but the repair completion diagram can clearly see that the network can identify the missing nose and the corners of the two eyes, can be well filled. But the effect of the repair found that the nose is still looming, at this time using the improved LSGAN repair picture shown in Fig. 15. Original.
Fig. 12. Fixes the original picture display.
Mask.
Fig. 13. Repair mask diagram.
Normal GAN repair effect.
Fig. 14. GAN repair diagram.
Improves GAN repair.
Fig. 15. Generates a fake picture.
4 Conclusion Experiments show that the traditional opencv built-in algorithms, including the NavierStokes and tealea algorithms, have a better repair effect when the repair area is small and the background does not have meaning, but the repair effect is poor when the damaged area is large and the damaged area contains common sense information for recognition.
746
J. Hu and H. Cheng
The traditional build-against network is well fixed, but the details of the repair are not as good as the improved least-multiplier generation-against the network. Using ordinary GAN network for training and image repair, in the ordinary generational anti-network, it is more suitable for the general popular image repair effect, and the image repair is more suitable and high satisfaction. And when the number of related pictures is small, it can be fixed well. However, its disadvantages are also obvious, for ordinary images appear edge fault phenomenon recovery rate is relatively high and more prone to gradient disappearance phenomenon. Based on LSGAN’s image repair model, the image repair model can combine the unique advantages of reel neural networks and generated anti-networks, and can also make the network more mature and stable than GAN, in general, in terms of the stability of the repair effect. However, in the process of image repair, it is found that a small number of image repair does not conform to the phenomenon, and the gap between contextual information and perceptive information of the repair area and the original area is not small, which needs to be solved. In addition, in the face of complex images, the two contextual information and perceptive information alone can not complete the image repair effect to achieve the best, should also be considered from a variety of perspectives. Acknowledgement. This work was supported by the Jiangsu Provincial Natural Science Foundation (No. SBK2019040953), Natural Science Fund for Colleges and Universities in Jiangsu Province (No. 19KJB520016).
References 1. Bertalmio, M., Bertozzi, A.L., Sapiro, G., et al.: Navier-stokes, fluid dynamics, and image and video inpainting. In: Computer Vision and Pattern Recognition, pp. 355–362 (2001) 2. Radford, A., Metz, L., Chintala, S., et al.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv: Learning (2015) 3. Pathak, D., Krahenbuhl, P., Donahue, J., et al.: Context encoders: feature learning by inpainting. In: Computer Vision and Pattern Recognition, pp. 2536–2544 (2016) 4. Criminisi, A., Perez, P., Toyama, K.: Region filling and object removal by exemplar-based image Inpainting. IEEE Trans. Image Process. 13(9), 1200–1212 (2004) 5. Hays, J., Efros, A.A.: Scene completion using millions of photographs. ACM Trans. Graph. 26(3), 4 (2007) 6. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., et al.: Generative adversarial networks. Adv. Neural Inf. Process. Syst. 3, 2672–2680 (2014) 7. Radford, A., Metz, L., Chintala, S., et al.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv: Learning (2015) 8. Yeh, R., Chen, C., Lim, T.Y., et al.: Semantic image inpainting with perceptual and contextual losses (2016) 9. Mao, X., Li, Q., Xie, H., et al.: Least squares generative adversarial networks (2016) 10. 李率杰, 李鹏, 冯兆永, 等.基于Navier-Stokes方程的图像修复算法. 中山大学学报 自然科 学版 51(1), 9–13 (2012) 11. Telea, A.: An image inpainting technique based on the fast marching method. J. Graph. Tools 9(1), 23–34 (2004). https://doi.org/10.1080/10867651.2004.10487596
Prediction of Element Distribution in Cement by CNN Xin Zhao1 , Yihan Zhou2 , Jianfeng Yuan1 , Bo Yang1 , Xu Wu3 , Dong Wang1,5 , Pengwei Guan6 , and Na Zhang1,4(B) 1 Shandong Provincial Key Laboratory of Network Based Intelligent Computing, University of
Jinan, Jinan 250022, Shandong, China [email protected] 2 Department of Computer Science, Southern Methodist University, Dallas, TX 75205, USA 3 Shandong Provincial Key Laboratory of Preparation and Measurement of Building Materials, University of Jinan, Jinan 250022, Shandong, China 4 Shandong Key Laboratory of Intelligent Buildings Technology, Shandong Jianzhu University, Jinan, China 5 School of Information Science and Engineering, University of Jinan, Jinan 250022, Shandong, China 6 Shandong Qiuqi Analytical Instrument Co., Ltd., Zaozhuang, China
Abstract. Cement-based materials are widely used in today’s society. Their quality directly determines the quality of buildings. Therefore, it is urgent to improve the physical properties of cement. To study high-performance cement, researchers commonly use a scanning electron microscope for element analysis to explore the reasons that affect the performance of cement. A scanning electron microscope (SEM) is a common tool to study materials’ internal microstructure and physical properties. It can obtain high-quality images, so it has many applications. Nevertheless, it also has some inherent disadvantages: Firstly, its cost is expensive, and many scientific research institutions do not have such strong economic strength; Secondly, even if it can be rented, it tends to consume considerable time and economic cost, which causes difficulties for the researchers. To solve the above problems, we have made a lot of attempts and finally adopted deep learning to solve this problem. Deep learning, the most widely used technology in artificial intelligence, developed rapidly and its related technology is widely used in material science. In this paper, we make full use of the advantages of deep learning and propose a new method to analyze the element distribution in cement for simple and rapid element analysis. This method implements the function similar to that of scanning electron microscope. It adopts convolutional neural networks and takes the cement backscattered electron (BSE) image as input. After the input image passes through the convolutional neural networks, the distribution map of some elements are obtained. Therefore, we can obtain the distribution of elements of interest in cement samples through this method. Our method can realize the rapid element analysis of cement and reduce the time and money cost of analyzing elements in physical experiments. Keywords: Cement element prediction · Convolutional neural network · Energy Dispersive Spectroscopy Prediction
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 747–756, 2022. https://doi.org/10.1007/978-3-031-13829-4_67
748
X. Zhao et al.
1 Introduction Cement is the most widely applied architecture material globally, which has an crucial impact on the quality of construction projects. To study high-performance cement, many teams worldwide are exploring the factors that affect the performance of cement. In recent years, computer modeling for auxiliary analysis has become popular. In the research process, researchers usually obtain the secondary electron image, Back-Scattered Electron (BSE) image, and Energy Dispersive Spectroscopy (EDS) image of cement samples using SEM to analyze. Using SEM can obtain accurate high-resolution images. However, the equipment is expensive, and many scientific research institutions do not have the conditions to buy it. Even if it can be rented, it will cost a lot of time and economic cost, which brings difficulties to researchers. Therefore, researchers want to use other technical methods to replace the scanning electron microscope to achieve the same function. Many researchers use Monte Carlo method to simulate the random diffraction of electron beam under different experimental conditions in scanning electron microscope. However, Monte Carlo method can only simulate simple SEM images. Some works predict SEM images based on electron solid interaction from the perspective of physics. However, it is limited to predict secondary electron images or back-scattering images. EDS prediction is a very important and difficult problem that exists no good way to solve. Is there a way to solve this difficult problem? In this paper, we are committed to solving this problem. We use convolutional neural network, which is the most popular application in the field of computer vision, to solve this problem. The model takes the BSE image of cement as the input of the model, extracts the features of BSE image through convolution neural network. So why can this model predict EDS images? In recent years, deep learning has made great progress in image vision, including but not limited to target detection, image segmentation, face recognition, and unmanned driving. The core of deep learning is convolutional neural network. The reasons why convolutional neural network is suitable for current problems are as follows. On the one hand, BSE and EDS images are different images formed by installing different probes in SEM. The BSE image can observe the micro morphology of cement surface and the EDS images can observe the distribution of elements on the cement surface. There is an internal relationship between them. The cement presents different micro representations in different positions because of the different distribution of elements. The convolutional neural network is good at discovering this relationship. Therefore, using a convolution neural network is a very good way to solve the problem of element distribution prediction. On the other hand, it has many common advantages, making it perform well on the current problems.
Prediction of Element Distribution in Cement by CNN
749
• It has strong abstract coding ability and learning ability: It can obtain very complex underlying patterns in data so it can encode very complex speech or image, and show high prediction accuracy in practice. • It has wide coverage and good adaptability: Neural network has many layers and wide width. In theory, it can fit any function to solve very complex problems. • It has good portability: A fully trained neural network can still achieve good results on the data it has not seen. • Compared with machine learning, the hidden layer of neural network in deep learning significantly reduces the demand for feature engineering. • When adding new data, it is easy to update the model. Therefore, based on the above advantages of deep learning and its excellent performance in computer vision, we use a convolutional neural network to encode BSE images. Therefore, in this paper, we design a convolution neural network that can extract finegrained information by using down sampling, up sampling and the connection between deep and shallow feature maps, which solves the disadvantage that ordinary neural networks can only extract global information. This method takes cement as the experimental object. The designed network is used to infer and analyze the distribution of elements in cement samples.
2 Related Work 2.1 SEM Prediction By studying the electron solid interaction, the SEM was simulated by Monte Carlo method. However, this requires the following two points: first, the theoretical description of electron scattering in solids and cascade processes; Second, a reasonable description of the geometric boundary of the sample [1]. The main data of sem-x-ray images of cement-based materials required by CEMHYD3D model are the surface area fraction and auto-correlation function of each phase, this paper [2] predicts the surface area fraction and auto-correlation function of the phase required by the model through neural network. Johnsen proposed a method [3] that developed a Monte Carlo simulation program MCSEM to simulate the scanning electron microscope (SEM) image formation of any sample structure (such as the layout structure of wafer or photomask). However, it did not propose a method to predict EDS. Our sample is cement and special preparation of cement specimen and polishing technique are required for this SEM image [4, 5]. 2.2 Convolutional Neural Networks Convolutional neural network is a kind of deep neural network with convolution structure. Lenet proposed by Lecun [6] is generally considered to be the first convolutional neural network. Alexnet [7] won the Imagenet competition in 2012, and used local response normalization in the network to reduce the error rate and dropout to prevent
750
X. Zhao et al.
overfitting. Theoretically, the deeper the neural network, the stronger the expression ability, and the better the effect. Therefore, VGG [8] was proposed in 2014. This network was a very deep network at that time, which reaching 19 layers. In this network, small convolution kernel is creatively used instead of large convolution kernel. And 1 × 1 kernel is introduced, which greatly reduce the number of parameters and improves the network performance. However, as the network becomes deeper and deeper, people begin to find that the neural network has the problem of degradation. The degradation problem is that the fitting ability decreases when the network becomes deeper. In order to solve this problem, Kaiming put forward the famous Resnet [9] in 2015. The network adopts residual structure to change the function that the network should learn from H(x) to F(x), and H(x) = F(x) + x.The proposed Resnet greatly improves the expression ability of deep neural network which makes convolutional neural network enter a new era. In 2016, a new idea was put forward, which did not start from the depth or width of the network. It reused the features of each layer, so as to alleviate the problem of gradient disappearance and strengthen the transmission of features in the network. It is DenseNet [10]. With the development of large-scale model, too many parameters and the large model structure greatly limit the practical application of convolutional neural network. People begin to focus on the lightweight of the model. Mobilenet is a classic lightweight neural network model, which has a great influence in academia and industry. The core idea of MobileNet [11] is depthwise separable convolution. The depthwise separable convolution controls the width of the network and the resolution of the input image through two hyperparameters, so as to control the size of the model. In 2017, on the basis of depthwise separable convolution, ShuffleNet [12] combined with group convolution to design the network structure on the basis of residual block, so as to reduce the amount of calculation of the network and maintain high performance. EfficientNet [13] adopts AutoML method to adaptively adjust the depth, width, and resolution of the model which achieves good results. As summarized above, all kinds of work simulating the function of scanning electron microscope and the rapid development of convolutional neural network in recent years have laid a very good foundation for our work.
3 Methodology This section describes the study method adopted in detail. This paper proposes a new method for predicting the distribution of elements in cement. The method is divided into the following steps: (1) preparing hardened cement samples, (2) obtaining the backscatter electron diagram (BSE) and element distribution (EDS) diagram of cement, (3) designing the network structure and training, (4) predicting with the trained model. The complete experimental process is shown in Fig. 1. The experimental method will be explained step by step.
Prediction of Element Distribution in Cement by CNN
751
Fig. 1. Flowchart of our method
Data Preparation. Through a series of processes such as cement mixing, hydration and demoulding, we then put the sample into absolute ethanol to stop hydration to obtain hardened cement of different ages. Finally, the prepared cement sample is encapsulated in resin to keep dry and prevent oxidation for later use. Image Acquisition. The sample’s surface is polished and leveled, and then the backscattered electron diagram and several main element distribution diagrams (EDS) of the sample are obtained by scanning electron microscope. Among them, the sampling time of EDS image is 70 frames. The BSE image and EDS image are shown in Fig. 2. BSE image shows the microscopic characterization of cement, and EDS image shows the distribution of elements in cement. According to the imaging principle of EDS, the gray value of pixels represents the level of element content. We can do what we want to do through neural network with the data.
Fig. 2. Data
752
X. Zhao et al.
Network Structure. As a branch of artificial neural network, convolutional neural network is widely used in the field of vision and has achieved good results. It benefits from its strong coding ability and feature extraction ability. Therefore, we also use convolutional neural network to realize our idea. The characteristics of each layer in the network are three-dimensional: h × w × c, where c is the number of characteristic channels. Furthermore, the convolution process has translation invariance, so the corresponding position after convolution is only related to its relative coordinates. The components of the whole network include convolution, pooling, activation function, and batch normalization. Our neural network does not contain any fully connection layer, but is composed of convolution layer. Therefore, our network can operate on any size of input. In our experiment, to increase the amount of data, we adopt the method of random crop, and the size after crop is 32 × 32, But in the application process of the model, it can accept the input of any size. Generally speaking, neural networks usually solve the single value problem: an input image corresponds to a single output. Our problems are different. Each pixel of the input image has a corresponding pixel output. We call this kind of problem dense prediction problem. In the design of the convolutional kernel, the stride will affect the size of its output feature map. Usually, in convolution, the length and width of the input image will be reduced in proportion. If so, we can’t get the output of the corresponding pixel. Therefore, we conduct up-sampling after the network has been down-sampled many times to restore its resolution. There are many ways of up-sampling, such as interpolation and deconvolution. The stride of up-sampling is the same as that of down-sampling, ensuring the one-to-one correspondence between up-sampling and down-sampling. We can calculate the pixel level loss through backpropagation after getting the output with the same resolution as the input image. After each convolution, we will use batch normalization to restrict its distribution and transform it into a state with a mean of 0 and a variance of 1 to speed up its convergence speed. Because we are pixel-level prediction, it is more complex than ordinary prediction. For ordinary prediction tasks, we pay more attention to global information. For example, for target detection, the network itself pays more attention to the general outline and structure of the target rather than the details within the target. For dense prediction, it is not enough to focus only on the global information because it is not enough for the network to predict each pixel. For convolutional neural networks, the spatial range of the receptive field is larger with the deepening of the network. Therefore, the deep feature maps of the network contains the global information of the input image, and the shallow feature maps contains the local detail information of the input image. Then how can we use local information and global information to achieve dense prediction? A straightforward idea is to up sample the last layer of down-sampling continuously, and the stride is consistent with the stride of down-sampling. In this way, the network forms a symmetrical structure. Then the feature maps with the same resolution are concatenated according to the channel.
Prediction of Element Distribution in Cement by CNN
753
Therefore, we first repeat convolution and pooling to obtain the characteristic map of 8 × 8, 4 × 4 and 2 × 2, then the feature map of 2 × 2 is upsampled by four times and the feature map of 4 × 4 is upsampled by two times, and then concatenate them with the feature map of 8 × 8 according to the channel, and we can get the feature map with size 8 × 8 × c. This feature map will be upsampled by four times to restore the resolution of the output image to the same as that of the input image, that is, the prediction result of the network is obtained. The complete network structure diagram is shown in Fig. 3.
Fig. 3. Model structure
4 Result With the above network structure definition, we can start training the network. We use MSE loss as the loss function and Adam as the optimizer. To prevent under fitting, we use the learning rate decay strategy. After our design, the loss function achieves a low value on the train set and the test set simultaneously. At this time, we use the trained model to predict a random image in the test set. Because we normalize the labels during training, we need to de-normalize them when obtaining the prediction results. The prediction result and original label are shown in Fig. 4, Fig. 5, and Fig. 6.
754
X. Zhao et al.
Figure 4 shows the prediction results of the major element calcium. Due to the differences in imaging methods between the predicted image and the original label, the image style is different, but the distribution of elements contained in it is very similar, which is also our focus. So experiments show that our method is effective on calcium.
Fig. 4. Experiment result of calcium.
Fig. 5. Experiment result of silicon.
Calcium is the main element in cement. It is vital to predict the calcium content accurately. However, cement contains not only major elements such as calcium and silicon but also trace elements such as magnesium. Figure 5 and Fig. 6 show the experimental results of this method on silicon and magnesium. Figure 5 shows the experimental results of this method on silicon. Since the major elements are diffused in the cement, we only need to pay attention to the places where the elements gather. Therefore, this method is also effective in predicting the distribution of silicon. Figure 6 shows the experimental results of this method on magnesium. We can see that this method is invalid on magnesium. After our analysis, the reasons are as follows. In cement, the content of magnesium is low and belongs to trace elements. Therefore, only a few areas in the element distribution map show specific characteristics, which
Prediction of Element Distribution in Cement by CNN
755
Fig. 6. Experiment result of magnesium.
is difficult for neural networks. The network only needs to predict the sample into an all-zero state to obtain a very low loss function value.
5 Conclusion In this paper, we use CNN to predict the distribution of elements in hydrated cement samples. Experiments show that our method has a good effect on calcium and silicon but not on the trace element magnesium.
6 Feature Work In this paper, the main elements calcium, silicon, and trace element magnesium in cement are experimentally studied. Several main elements in cement include calcium, silicon, aluminum, and iron, and several common trace elements include sodium, magnesium, sulfur, and potassium. Therefore, the prediction of cement properties by element analysis depends on the above elements simultaneously. Therefore, our future work is to extend the method in this paper to other elements. However, for trace elements, this method will have some problems. Therefore, in our future work, in addition to extending the method to major elements, we will continue to explore the improvement and application of the technique in trace elements. Acknowledgements. This work was supported by National Natural Science Foundation of China under Grant No. 61872419, No. 62072213, No. 61873324, No. 61903156. Shandong Provincial Natural Science Foundation ZR2020KF006, No. ZR2019MF040, No. ZR2018LF005. Taishan Scholars Program of Shandong Province, China, under Grant No. tsqn201812077. “New 20 Rules for University” Program of Jinan City under Grant No. 2021GXRC077. Shandong Key Laboratory of Intelligent Buildings Technology (Grant No. SDIBT2021004). Opening Fund of Shandong Provincial Key Laboratory of Network based Intelligent Computing.
756
X. Zhao et al.
References 1. Li, Y.G., Mao, S.F., Ding, Z.J.: Monte Carlo simulation of SEM and SAM images. IntechOpen (2011) 2. Mohamed, A.R., El Kordy, A., Elsalamawy, M.: Prediction of sem-x-ray images’ data of cement-based materials using artificial neural network algorithm. Alexandria Eng. J. 53(3), 607–613 (2014) 3. Johnsen, K.P., Frase, C.G., Bosse, H., Gnieser, D.: Sem image modeling using the new monte carlo model mcsem. In: Proceedings of SPIE, the International Society for Optical Engineering. Society of Photo-Optical Instrumentation Engineers (2010) 4. Stutzman, P.E., Clifton, J.R., et al.: Specimen preparation for scanning electron microscopy. In: Proceedings of the International Conference on Cement Microscopy, vol. 21, pp. 10–22, International Cement Microscopy Association (1999) 5. Bentz, D.P., Stutzman, P.E., Haecker, C.J., Remond, S., et al.: Sem/x-ray imaging of cementbased materials. In: Proceedings of the 7th Euroseminar on Microscopy Applied to Building Materials, pp. 457–466 (1999) 6. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012) 8. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 9. He, K., Zhang, X., Ren, S., Sun, J.:Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 10. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017) 11. Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017) 12. Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–856 (2018) 13. Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International conference on machine learning, pp. 6105–6114. PMLR (2019)
An Ensemble Framework Integrating Whole Slide Pathological Images and miRNA Data to Predict Radiosensitivity of Breast Cancer Patients Chao Dong1 , Jie Liu1 , Wenhui Yan1 , Mengmeng Han1 , Lijun Wu1,2 , Junfeng Xia1 , and Yannan Bin1(B) 1 Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education and
Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei 230601, Anhui, China [email protected] 2 Key Laboratory of High Magnetic Field and Ion Beam Physical Biology, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, Anhui, China
Abstract. Breast cancer (BRCA) is the most primary cause of cancer death among women worldwide, and radiotherapy (RT) is a common treatment for BRCA patients. However, not all BRCA patients can benefit from RT because of the BRCA heterogeneity. In this study, we present an ensemble framework, named EnWM, which integrates whole slide pathological images (WSI) with microRNA (miRNA) for effectively predicting the radiosensitivity. Firstly, a 7-dimensional WSI feature vector was acquired with feature extraction network and analysis of variance, meanwhile a 7-dimensional feature vector was obtained based on differential expression analysis and analysis of variance. Secondly, we got six individual models by training three different machine learning classifiers with these WSI and miRNA vectors (modalities), respectively. The individual model with the best performance in each modality was selected, and the calibrated output probabilities of these two individual models were combined as a new feature vector for the final model. Finally, the 2-dimensional vector was put into logistic regression classifier to construct the ensemble model EnWM. The comprehensive results demonstrated that EnWM could provide accurate radiosensitivity prediction for BRCA patients with WSI and miRNA data. Keywords: Breast cancer · Radiosensitivity · Whole slide pathological image · miRNA · Ensemble learning
1 Introduction Breast cancer (BRCA) is the most common cancer among women worldwide [1], and radiotherapy (RT) combined with mastectomy is one of the most common treatment options for BRCA patients [2]. As well as reducing the risk of BRCA recurrence, RT has been shown to reduce mortality of BRCA patients [3, 4]. However, due to the high © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 757–766, 2022. https://doi.org/10.1007/978-3-031-13829-4_68
758
C. Dong et al.
heterogeneity of BRCA, the adjuvant RT for patients would have different effects, and even caused adverse consequences (RT injury, secondary cancer, and so on) [5, 6]. Thus, it is essential to design computational tools to predict the responses of BRCA patients with RT, and forecast who would benefit from RT and have a long-term survival. There has been some research that provides and validates diverse signatures with radiosensitivity based on genomic data [7–10]. Wen et al. established a prognostic model for BRCA patients with RT based on the immune infiltration status, and the model included seven genes and achieved success in RT benefits prediction [6]. Chen et al. constructed a prediction model with six genes that could accurately predict the radiosensitivity of BRCA patients [2]. Although these methods have achieved good results, there is still room for improving prediction performance. With the development of sequencing technologies, in recent years, many researchers have used microRNA (miRNA) to study the radiosensitivity of cancer patients [11]. As a group of short non-coding RNA, miRNAs play an essential role in regulating most biological processes of BRCA [12, 13]. Some research has reported that miRNAs (e.g. miR-18b, miR-3069 and miR-137) are involved in progression, metastasis, or survival in BRCA [14–16]. Besides, the dysregulation of miRNAs would affect radiation response in BRCA cells [17, 18]. Therefore, the differential expression of miRNAs between longterm and short-term survival patients could be identified as an important modulator in evaluating the radiation and treatment responses of BRCA patients [19]. Furthermore, whole slide pathological images (WSI) contain rich phenotypic information (cell size and shape, nuclei distribution and texture) that reveals molecular profiles of disease progression, and guides clinical diagnosis and treatment [20, 21]. For example, Sun et al. employed WSI and genomic data with multiple kernel learning for BRCA survival prediction [22]. With the development of artificial intelligence, more and more computational technologies for WSI analysis are applied for the diagnosis of BRCA. In recent years, convolutional neural network (CNN), which can automatically learn and quantify the representative features from WSI, has been widely used in image recognition and processing [21, 23]. Zhu et al. proposed a survival analysis model based on WSI combined with CNN for lung cancer patients [24]. To date, there are no studies using WSI data for the radio sensitivity prediction in BRCA patients. Considering the effective information extracted from WSI and miRNA data, we proposed an ensemble framework, named EnWM, combining WSI and miRNA for radio sensitivity prediction in BRCA patients. Firstly, we preprocessed WSI and miRNA data, and performed WSI and miRNA features selection. Secondly, these WSI and miRNA features were combined with different machine learning algorithms, including Gauss Naive Bayes (GNB), logistic regression (LR) and multi-layer perception (MLP), to get different base models. Then, for each WSIs or miRNA modality, the base model with the best performance was selected, and the two calibrated probabilities of these base models were combined to be a new feature vector. Finally, the vector was input into LR to construct the prediction model EnWM. The framework of EnWM is shown in Fig. 1.
An Ensemble Framework Integrating Whole Slide Pathological Images
759
Fig. 1. An overview of EnWM. (A) WSI features are obtained with two major stages: (1) WSIs are clustered into three classes using K-means algorithm, and three feature extraction network models (FENs) are trained. FEN1 is chosen to extract WSI features based on its performance. (2) For each patch feature, the medians of patient patches are extracted as WSI features (16 dimensions), then, 7 optimal WSI features are selected by feature selection algorithm. Here, n represents the number of each patient patches. (B) miRNA features are obtained based on differential expression analysis and feature selection. (C) Prediction model is constructed by combining the output of the two base models with LR.
2 Materials and Methods 2.1 Dataset and Preprocessing In this study, WSI and miRNA data of female BRCA patients with RT were downloaded from The Cancer Genome Atlas (TCGA) [25] by the GDC data transfer tool of TCGA and R package TCGAbiolinks [26], respectively. We excluded patients whose survival time was less than 30 days to avoid the influence of deaths for unrelated reasons [27]. Two hundred and seventy six patients were obtained for further analyses. Then, these patients were divided into long-term or short-term survivors according to the 3-year survival. The patients with the long-term survivor (≥3 years) were marked with 1 and regarded as positives, which were sensitive to RT and could benefit from RT. Meanwhile, those patients with the short-term survivor ( 1 and adjusted P < 0.05 [31]. Therefore, miRNAs related to radio sensitivity could be identified. Then, based on ANOVA, the p value (P) between positives and negatives for each miRNA feature was used to obtain the optimal miRNA feature set. Finally, we got 7 optimal miRNA features to form a 7-dimensional miRNA vector. 2.4 Ensemble Model Construction Ensemble learning is an effectively method which combines different learning algorithms to achieve better performance. In this study, ensemble learning was used to integrate WSI with miRNA data to predict radio sensitivity for BRCA patients. Firstly, six base models were constructed by integrating the optimal WSI vector and miRNA vector with GNB, LR and MLP, respectively. For each feature modality, the base model with the best performance was chosen as optimal base model for the final model. Then, the output probabilities of the WSI and miRNA base models were calibrated by isotonic regression [32]. Finally, we concatenated the two calibrated probabilities as a new 2-dimensional feature vector and constructed the final predictor based on LR. 2.5 Evaluation Metrics In order to assess the performance of different models systematically, a series of common metrics are employed, including Sensitivity (SEN), Specificity (SPE), Accuracy (ACC), Precision (PRE) and F1-score (F1). The computational formulas of these metrics are shown as follows: SEN =
TP TP + FN
(1)
SPE =
TN TN + FP
(2)
PRE =
TP TP + FP
(3)
F1 = ACC =
2TP 2TP + FP + FN TP+TN TP + FP + TN + FN
(4) (5)
where TP, TN, FP, and FN mean true positives, true negatives, false positives, and false negatives, respectively. Besides, we used AUC, the area under the receiver operating characteristic (ROC) curve to evaluate the comprehensive performance of the models. On the training set, 5-fold cross-validation was employed.
762
C. Dong et al.
3 Results and Discussion 3.1 Predicting Radio Sensitivity with WSI Information Considering the important patient information provided by WSI, we applied CNN and K-means to extract WSI features of BRCA patients. We used K-means method to group the patches on the training set into 3 categories, and trained 3 corresponding FEN models (FEN1, FEN2, FEN3), respectively. FEN1 model, which achieved the best predicting accuracy of 0.691 was chosen as the WSI feature extraction model. As described in the section of Methods, we obtained a 7-demensional WSI vector and put the vector into GNB, LR and MLP to construct three base models (i.e., WSI-GNB, WSI-LR, and WSI-MLP). And the performance of these models is shown in Table 2. On the training set, the model of WSI-GNB achieves the performance with SPE of 0.553, PRE of 0.564, ACC of 0.564, and AUC of 0.605, and outperforms the other base models comprehensively. Meanwhile, WSI-MLP performs with the best SEN value (0.980), but the worst SPE value (0.050). This result illustrates that the most negative samples have been predicted as positives, and the MLP algorithm is unsuitable to predict the radio sensitivity with WSI features. On the test set, similar with the results on the training set, WSI-GNB has the best performance on SPE, PRE, ACC, and AUC. Furthermore, WSI-MLP achieves the best performance on SEN and F1. Ultimately, with the general evaluation of performance, the base model WSI-GNB is superior to the other models on both the training and test sets. Table 2. The performance of base models combined WSIs features with different algorithms on the training and test sets. Dataset
Model
SEN
SPE
PRE
F1
ACC
AUC
Training
WSI-GNB
0.574
0.553
0.564
0.568
0.564
0.605
WSI-LR
0.583
0.505
0.539
0.560
0.545
0.545
Test
WSI-MLP
0.980
0.050
0.508
0.669
0.515
0.515
WSI-GNB
0.622
0.649
0.639
0.630
0.635
0.626
WSI-LR
0.622
0.405
0.511
0.561
0.514
0.530
WSI-MLP
0.946
0.135
0.522
0.673
0.541
0.554
Note: the maximum value of each evaluation metric on the training and test sets is bolded
3.2 Predicting Radio Sensitivity with miRNA Information As stated in previous study, miRNAs correlate closely with the radiation responses in BRCA patients [18], and the dysregulation of miRNAs can affect radiation response [16, 17]. Based on the differential expression analysis and the feature selection method ANOVA, a 7-dimensional miRNA vector was obtained, then put into different algorithms to construct three base models (miRNA-GNB, miRNA-LR and miRNA-MLP). The performance of these models is shown in Table 3.
An Ensemble Framework Integrating Whole Slide Pathological Images
763
On the training set, among these models, miRNA-LR performs the best on the metrics of SEN = 0.662, F1 = 0.676, ACC = 0.668 and AUC = 0.747. In addition, miRNAGNB has the best performance on SPE = 0.801 and PRE = 0.726. On the test set, miRNA-LR is superior to other models on two global measurements ACC = 0.595 and AUC = 0.682. Meanwhile, miRNA-GNB achieves the best performance on SPE, PRE and ACC, and miRNA-MLP has the best performance on SEN and F1. As AUC is the most important evaluated metric for classification task with machine learning, we chose miRNA-LR, with the highest AUC values whether on the training set or test set, for the further analysis and the final model construction. Table 3. The performance of base models combined miRNA features with different algorithms on the training and test sets. Dataset
Model
Training
Test
SEN
SPE
PRE
F1
ACC
AUC
miRNA-GNB
0.483
0.801
0.726
0.572
0.644
0.707
miRNA-LR
0.662
0.712
0.704
0.676
0.688
0.747
miRNA-MLP
0.612
0.672
0.656
0.627
0.643
0.728
miRNA-GNB
0.378
0.811
0.667
0.483
0.595
0.666
miRNA-LR
0.514
0.676
0.613
0.559
0.595
0.682
miRNA-MLP
0.622
0.432
0.523
0.568
0.527
0.658
Note: the maximum value of each evaluation metric on the training and test sets is bolded
3.3 Predicting Radio Sensitivity Based on Integration of WSI and miRNA Information In consideration of the data quantity and feature descriptors provided by WSI- and miRNA- modalities, we combined the two modalities to extract more feature information and improve the performance of the radio sensitivity prediction method. According to the aforementioned comparisons, WSI-GNB and miRNA-LR are the highest-performance models in each feature modality. To verify the effectiveness of the ensemble method integrating WSI and miRNA features, the output calibrated probabilities of WSI-GNB and miRNA-LR were formed to be a 2-dimensional probability vector. Then, this vector was put into LR to train the final ensemble model EnWM. Figure 2 shows the performance comparison of the ensemble method with WSIGNB and miRNA-LR. On the training set, AUC increases by 1.6%, which shows the high-performance of our ensemble method (Fig. 2A). In Fig. 2B, the result suggests that compared with any single modality, the performance improves a lot by the integration of WSI and miRNA features on the test set. And the AUC value of EnWM increases by 8.6% and 3% compared with WSI-GNB and miRNA-LR, which only uses WSI or miRNA data. Besides, the performance of EnWM, WSI-GNB, and miRNA-LR on the measurements of SEN, SPE, PRE, F1, and ACC is shown in Fig. 3. We find that the overall
764
C. Dong et al.
Fig. 2. ROC curves of different models on training (A) and test sets (B).
performance of the model EnWM integrating WSI and miRNA features is better than that using a single modality. It suggests that the integration of WSI and miRNA modalities provides more feature information and can improve the prediction performance.
Fig. 3. The performance of different models on the test set.
4 Conclusion In this study, we present an ensemble learning based framework EnWM to efficiently integrate WSI with miRNA features, which can predict radio sensitivity more accurately for BRCA patients. The results show that EnWM outperforms those models using single WSI or miRNA data. Furthermore, EnWM shows a significant advantage in integrating
An Ensemble Framework Integrating Whole Slide Pathological Images
765
features from two different modalities and acquires splendid performance on the radio sensitivity prediction for BRCA. Although EnWM shows remarkable performance, there remains considerable room for improvement in further study. Firstly, the accurate subregions of normal and cancer in WSI marked with the pathologist are important to extract the exact information for BRCA patients. Secondly, the increases of BRCA samples with RT and the high quality of data are crucial to achieve higher performance for EnWM. Funding. This work was supported by the National Natural Science Foundation of China (11835014, 62072003, and U19A2064) and the Education Department of Anhui Province (KJ2020A0047).
References 1. Sung, H., Ferlay, J., Siegel, R.L., et al.: Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. Ca-a Cancer J. Clin. 71(3), 209–249 (2021) 2. Chen, X., Zheng, J., Zhuo, M.L., et al.: A six-gene-based signature for breast cancer radiotherapy sensitivity estimation. Biosci. Rep. 40(12) (2020). BSR20202376 3. Montero, A., Ciervide, R., Garcia-Aranda, M., et al.: Postmastectomy radiation therapy in early breast cancer: utility or futility? Crit. Rev. Oncol. Hematol. 147, 102887 (2020) 4. Lagendijk, M., van Maaren, M.C., Saadatmand, S., et al.: Breast conserving therapy and mastectomy revisited: breast cancer-specific survival and the influence of prognostic factors in 129,692 patients. Int. J. Cancer 142(1), 165–175 (2018) 5. Quon, H., McNutt, T., Lee, J., et al.: Needs and challenges for radiation oncology in the era of precision medicine. Int. J. Radiat. Oncol. Biol. Phys. 103(4), 809–817 (2019) 6. Wen, P., Gao, Y., Chen, B., et al.: Pan-cancer analysis of radiotherapy benefits and immune infiltration in multiple human cancers. Cancers 12(4), 957 (2020) 7. Meehan, J., Gray, M., Martinez-Perez, C., et al.: Precision medicine and the role of biomarkers of radiotherapy response in breast cancer. Front. Oncol. 10, 628 (2020) 8. Eschrich, S.A., Fulp, W.J., Pawitan, Y., et al.: Validation of a radiosensitivity molecular signature in breast cancer. Clin. Cancer Res. 18(18), 5134–5143 (2012) 9. Speers, C., Zhao, S., Liu, M., et al.: Development and validation of a novel radiosensitivity signature in human breast cancer. Clin. Cancer Res. 21(16), 3667–3677 (2015) 10. Liu, J., Han, M., Yue, Z., et al.: Prediction of radiosensitivity in head and neck squamous cell carcinoma based on multiple omics data. Front. Genet. 11, 960 (2020) 11. Liu, N., Boohaker, R.J., Jiang, C., et al.: A radiosensitivity miRNA signature validated by the tcga database for head and neck squamous cell carcinomas. Oncotarget 6(33), 34649–34657 (2015) 12. Yang, B., Kuai, F., Chen, Z., et al.: Mir-634 decreases the radioresistance of human breast cancer cells by targeting stat3. Cancer Biother. Radiopharm. 35(3), 241–248 (2020) 13. Zhang, J.-H., Hou, R., Pan, Y., et al.: A five-microRNA signature for individualized prognosis evaluation and radiotherapy guidance in patients with diffuse lower-grade glioma. J. Cell Mol. Med. 24(13), 7504–7514 (2020) 14. Kang, Y., Wan, L., Wang, Q., et al.: Long noncoding RNA snhg1 promotes tert expression by sponging mir-18b-5p in breast cancer. Cell Biosci. 11(1), 169 (2021) 15. Li, D., Wang, X., Yang, M., et al.: Mir3609 sensitizes breast cancer cells to adriamycin by blocking the programmed death-ligand 1 immune checkpoint. Exp. Cell Res. 380(1), 20–28 (2019)
766
C. Dong et al.
16. Ma, L., Zheng, L., Zhang, D., et al.: Effect of cbx4/mir-137/notch1 signaling axis on the proliferation and migration of breast cancer cells. Trop. J. Pharm. Res. 20(3), 491–496 (2021) 17. Masoudi-Khoram, N., Abdolmaleki, P., Hosseinkhan, N., et al.: Differential miRNAs expression pattern of irradiated breast cancer cell lines is correlated with radiation sensitivity. Sci. Rep. 10(1), 9054 (2020) 18. Pajic, M., Froio, D., Daly, S., et al.: Mir-139-5p modulates radiotherapy resistance in breast cancer by repressing multiple gene networks of DNA repair and ros defense. Can. Res. 78(2), 501–515 (2018) 19. Grinan-Lison, C., Olivares-Urbano, M.A., Jimenez, G., et al.: miRNAs as radio-response biomarkers for breast cancer stem cells. Mol. Oncol. 14(3), 556–570 (2020) 20. Zhu, X., Yao, J., Zhu, F., et al.: WSISA: making survival prediction from whole slide histopathological images. In: 2017 IEEE Conferenceon Computer Visionand Pattern Recognition, pp. 6855–6863 (2017) 21. Lu, L., Daigle, B.J., Jr.: Prognostic analysis of histopathological images using pre-trained convolutional neural networks: application to hepatocellular carcinoma. PeerJ 8, e8668 (2020) 22. Sun, D., Li, A., Tang, B., et al.: Integrating genomic data and pathological images to effectively predict breast cancer clinical outcome. Comput. Methods Programs Biomed. 161, 45–53 (2018) 23. Litjens, G., Kooi, T., Bejnordi, B.E., et al.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017) 24. Zhu, X., Yao, J., Huang, J.: Deep convolutional neural network for survival analysis with pathological images. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine, pp. 544–547 (2016) 25. Tomczak, K., Czerwinska, P., Wiznerowicz, M.: The cancer genome atlas (TCGA): an immeasurable source of knowledge. Contemp. Oncol. Wspólczesna Onkologia 19(1A), A68-77 (2015) 26. Colaprico, A., Silva, T.C., Olsen, C., et al.: Tcgabiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 44(8), e71 (2016) 27. Chen, L., Wen, Y., Zhang, J., et al.: Prediction of radiotherapy response with a 5-microRNA signature-based nomogram in head and neck squamous cell carcinoma. Cancer Med. 7(3), 726–735 (2018) 28. Goode, A., Gilbert, B., Harkes, J., et al.: Openslide: a vendor-neutral software foundation for digital pathology. J. Pathol. Inform. 4, 27 (2013) 29. Anand, D., Kurian, N.C., Dhage, S., et al.: Deep learning to estimate human epidermal growth factor receptor 2 status from hematoxylin and eosin-stained breast tissue images. J. Pathol. Inform. 11, 19 (2020) 30. Gelman, A.: Analysis of variance-why it is more important than ever. Ann. Stat. 33(1), 1–53 (2005) 31. Love, M.I., Huber, W., Anders, S.: Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biol. 15(12), 550 (2014) 32. Jiang, X., Osl, M., Kim, J., et al.: Smooth isotonic regression: a new method to calibrate predictive models. AMIA Summits Transl. Sci. Proc. 2011, 16–20 (2011)
Bio-ATT-CNN: A Novel Method for Identification of Glioblastoma Jinling Lai1 , Zhen Shen2 , and Lin Yuan1(B) 1 School of Computer Science and Technology, Qilu University of Technology (Shandong
Academy of Sciences), Jinan 250353, Shandong, China [email protected] 2 School of Computer and Software, Nanyang Institute of Technology, Changjiang Road 80, Nanyang 473004, Henan, China
Abstract. Through the years, many learning methods which made remarkable feat are raised in many industries. Many focus had been paid to attentionconvolution (ATT-CNN). Achievements have been made in image processing, computer vision and natural language processing with this technique. But, insufficiency of interpretability is still a significant hinder to application of deep neural networks. It is especially in predicting performance of illness result. Regrettably, ATT-CNN is not able to directly apply in it. Accordingly, we came up with an original method. It is named Bio-ATT-CNN. It is able to distinguish long-term survival (LTS) and non-LTS if we use glioblastoma multiforme (GBM) as out detecting task. Let me just make a few points. Traditional model is not able to directly apply biological data. This model is able to be good for applying to biological data. It means that identifies essential biological connection of illness. Keywords: ATT-CNN · GBM
1 Introduction There are many methods which are traditional machine learning methods offer the ability to process non-linear modeling [1]. These methods are able to understand elucidative states, especially in processing complicated structures [2]. And the convolutional neural network based on attention (ATT-CNN) has got a huge success in picture processing and identification, disposing mesh structure inputs or pictures and grasping local dependencies in efficient [3]. In the recent years, some studies are used to figure out bioinformatics issues, such as protein–RNA binding or other biological ways which include predicting sequence of RNA-binding proteins [4]. And there are conventional neural networks rather than ATT-CNN which have been largely used to figure out biological issues due to nominal grid data form represented in biological array data. Recently, many papers have used conventional neural networks on bio-data to predict cancer treatment survival. But, few paper have applied ATT-CNN for bio-data [5].
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 767–776, 2022. https://doi.org/10.1007/978-3-031-13829-4_69
768
J. Lai et al.
The deficiency of interpretability is a necessary reason that few papers use ATTCNN in medicine. And biological representation is essential to grasp the biological mechanisms of complicated illness. We come up with a method which is named Gradientweighted Class Activation Mapping (Grad-CAM) [6, 7]. This method is to increase the interpretability of ATT-CNN [8]. In this paper, we thought up a novel method [9]. It use ATT-CNN which is based biological data [10]. It is an interpretable ATT-CNN model of medical way. As data insert to model, we enable images of biological pathway to be used. And they will be put into the low dimensional space. After using it, the performance of experience is better than other models. The method which we show in this paper is used to predict long-term survival in individual tested for glioblastoma multiforme (GBM) [11]. Which is one of the most malignant illness [12]. And this method can make people understand issue. We think that ATT-CNN model will be better than other methods. And it which uses Grad-CAM is explicable make identity of important biological pathways visible.
2 Preliminaries 2.1 Data There are different type data from GBM [13]. It contains mRNA expression, CNV and DNA methylation. It is shown as G, C and M ∈ Rn×r , separately. And n, r means the qualities of samples and genes separately and they offered by cBioPortal database. Longterm survival (LTS) is shown as survival > 2 years and non-LTS is shown as survival ≤ 2 years. 2.2 Pathway In this paper, the type of data is replaced by pathway at a gene level. In order to finish this task. We should extract pathway information which is linked with related genes which are come from Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. After removing diseases-linked pathways, we apply the quality of pathway whose number is 146. The mRNA expression data of connected gene is pick up by mRNA expression matrix (G), which produces pi and further produce an intermediate matrix B ∈Rn×ri . In this equation, ri means the qualities of genes put in the pathway pi . There are samples in rows and genes which are offer by pathway in columns. And they make up the matrix B. After applying principal component analysis, the B will be decomposed into unrelated factors. And it will generate Gpi ∈Rn×q . In this equation, q means the quality of principal components. And this measure is also put in the CNV and DNA methylation matrixes. And they also generate Cpi , Mpi ∈ Rn×146q about the 146 pathways. After repeating this processing, it will generates Gp , Cp , Mp ∈ Rn×146q [14]. Finally, we will rearrange the sample and get the combination of matrixKsj ∈ R146×3q . This is called the pathway images. And it will put into the ATT-CNN model. In this paper, we apply some principal components (q = 1–5). Such as, a sample is shown by a matrix using 146 × 6
Bio-ATT-CNN: A Novel Method for Identification of Glioblastoma
769
for q = 2. And the first and second columns are mRNA expression and third and fourth columns are CNV, and the fifth and sixth columns are DNA methylation showing the first two principle components of every omics type [15–17]. 2.3 Comparison with Methods In this paper, the performance of model is going to be compared with serval methods. There are four methods in this paper to test the performance with ATT-CNN [18]. Such as supporting vector machines, fully connected neural networks and the recently shown Multi-omics Integrative Net [19]. The four experiments setting apply the same with ATT-CNN [20]. Multinomial and sigmoid kernels which are applied in SVM. For the fully connected neural network is usually adjusted. The qualities of nodes are 10 k, 7.5 k, 5 k, 2.5 k and 500 [21, 22]. The rate is used in the drop out layer is 80% [23]. The model of MiNet compute in concordance with index [24–26].
Fig. 1. ATT-CNN model
770
J. Lai et al.
2.4 ATT-CNN Architecture There are 32 and 64 filters, and 3 × 3 size. And they are ensued by a 4 × 2 max pooling. And dropout rate is 50% [27]. And there are two attention modules. And it will be combined when parameters are put into dilapidation convolution, 32 filters and size of 2 × 2. The output is come from the dropout layer. The output is finally deformed to a vector. It is connected with a fully connected layer which has quality of 64 nodes, which are ensued by a softmax layer. We grasp spatial collection of a feature map. We make it by leverage average-pooling and max-pooling. This process generates two c and Fc . And they are put on a shared network. Its totally various spatial context: Favg max task is to create channel attention map Mc ∈ RC×1×1 . Multi-layer perceptron (MLP) which have one hidden layer produces shared network. Secondly, the hidden layer size is set to RC/r×1×1 [28, 29]. In this paper, r means the speed of reducing, which can reduce parameter overhead. Finally, we apply element wise summary. In this way, we can combine the output. SGD used as an optimizer with a learning rate of 0.001, decay of 1e-3. [30] Fig. 1 illustrates the model of ATT-CNN. 2.5 Grad-CAM Grad-CAM is used to recognize the key pixels which are linked with LTS in GBM patients [31, 32]. It is produced by calculating the gradient of a score for each class c about feature maps A [33]. wkc =
1 ∂yc Z ∂Akij i
j
K means the quality of feature maps. Z means the quality of pixels. And they produce the equation: c k wk A Lc = ReLU i
L means the weighted sum of feature maps. ReLU is used to bring out pixels which is good for increasing the gradient of a score.
3 Experimental Results 3.1 Data There are three date type which shown in this paper TCGA multi-omics data from GBM computed at the gene level: mRNA expression, CNV and DNA methylation. And there are 12042 genes in 528 cases in mRNA expression. There are 24776 genes in 577 cases. There are 11807 genes. There are DNA methylation and two DNA methylation are decorated with arrays which are used for 285 and 155 cases separately. After deleting
Bio-ATT-CNN: A Novel Method for Identification of Glioblastoma
771
5 same cases, and 435 are tested for DNA methylation sample. There are 8037 genes which are identical in the three omics types. Keep moving patients who exist in the last follow ≤ 2 years, LTS and non-LTS group have 55 and 232 cases. Class weights are put in the model about the ratio of the quality of sample in the groups. Overall, 4989 specific genes are shown in 146 KEGG with on average 68 genes. PCA methods are put in it for individual respectively on every omics type to transfer gene level information to pathway level information. If genes which are part of given pathway are not accessible, PCA continue performing without it. The qualities of missing genes in each pathway. There are 13 in mRNA expression. There are 2 in CNV. There are 20 in DNA methylation. The average age in LTS is 48 years. The non-LTS groups is 61 years. And the difference in age which is in LST group and non-LTS is very important. Thinking about that, age will be put in ATT-CNN model. GBM has a worse symptom, while GBM is frequently shown in younger patients [16, 34]. There is the distinction which is very important. 3.2 Comparison with Benchmark Methods In this paper, the ATT-CNN will be compared with four methods which are shown in this paper. It is the first factor that we should pay attention to the predictive performance. Besides, we also download other database about cancers which is kidney cancer, low grade glioma (LGG). They also include RNAseq gene expression, CNV and DNA methylation. It offered by TCGA. Because the three cancers are less malignant than GBM [14]. So, we newly set the 3 years to defined LTS. LTS is defined as survival > 3 years. After deleting individuals who is alive and < 3 years, the LTS and non-LTS groups have 154 and 69 cases for kidney cancer. The groups have 156 and 75 cases for LGG [18]. And the class weights are put in the model ratio of the qualities of sample. For LGG age, age is also put into the model. LTS at 3 years for GBM have 23 cases. Non-LTS for GBM have 256 cases. In this experiment, the average of AUC and 5-fold CV is imported. For other cancers, CNN is better than other methods. But, SVM with RBF in LGG also have better than other [35]. As shown in the Fig. 2. 3.3 Model Performance The ATT-CNN model is used to identify LTS and non-LTS groups in GBM which applies pathway [36]. And it is put in 5-fold Cross Validation scheme. There are 146 rows which means the identical pathway and 3 × q columns. q means the qualities of PCs [8]. Such as, for q = 2, columns are produced by first two PCs in the order: mRNA expression, CNV and DNA methylation. For model, different sizes of pathway are assessed with q = 1 through 5 [8]. In this paper, different sizes of model are tested from q = 1 to q = 5 [4]. In this experiment, the combination of mRNA expression, CNV and DNA methylation in pathway pixels with two PCs are used to test [21, 17]. And different combination will get different outcomes. Such as the AUC of combination of CNV, DNA methylation and mRNA expression achieves 0.738, AUC of combination of DNA methylation, CNV and
772
J. Lai et al.
Fig. 2. Predictive performance of different methods
mRNA expression achieves 0.744 and AUC of combination of mRNA expression DNA methylation and CNV achieves 0.755. I think that the connection of different data types make this phenomenon [37].
Fig. 3. Performance comparison of convolutional neural network models with a combination of two omics types. AUC, area under the curve; CNV, copy number variation; DM, DNA methylation
Bio-ATT-CNN: A Novel Method for Identification of Glioblastoma
773
Fig. 4. Performance comparison of models. It is tested with top one principal component by 5 sizes. AUC, area under the curve
We want to figure out which data type more informative to mode. So we make use of combinations of two omics types which is generated with q = 2 [38, 25]. The AUC of combination of mRNA expression and CNV, and AUC of combination of CNV and DNA methylation, achieves 0.749 and 0.748 separately. And the AUC of combination of mRNA expression and DNA methylation achieves 0.704. Otherwise, the biological data with a single type in ATT-CNN achieve AUCs of 0.699 for mRNA. The biological data with a single type in ATT-CNN achieve AUCs of 0.715 for CNV. The biological data with a single type in ATT-CNN achieve AUCs of 0.699 for DNA methylation. So the combination of data gets a better performance [39, 26]. But, the DNA methylation is not get a remarkably improvement in predictive power in other single type. As shown in the Fig. 3, Fig. 4.
4 Discussion Much attention has been paid to GBM which is one of the most malignant cancers. Although the annual incidence of GBM is less 10 per 100 000 people. But we still take it seriously. The model can help better grasp the implicit biological meaning. An insightful analysis of data which is come from patients can implicitly offer valuable information into the complex biology. For the past few years, ATT-CNN has got a huge achievement in computer vision [40, 24]. But it also has some disadvantages in applying to bioinformatics problems. Biological data are shown as non-grid data format. The data are put in the model and findings whose problems which is about interpretation are often shown. In order to use ATT-CNN in biological way, we come up with an idea that use a different method to show data [22, 41]. This method of idea summarizes connected pathway activity. This model effectively strengthens predictive power. Besides, it adds insightful news to the model. This idea can remove useless information that patients have unique symptom. In the among pathways, the number of genes from 10 to 393. And the
774
J. Lai et al.
size of the pathway image of data is corrected with 146 pathways. If there are missing sample, PCA will continue perform skipping them, which hugely reduces pressure. In this paper, this model predicts LTS. And the average AUC is 0.753 when three omics types are used by experiment. And we use the first two PCs for each omics type. And when we use two omics types, the average AUC is 0.749 and 0.748. We use combination which is mRNA expression and CNV [15, 42]. Another combination is CNV and DNA methylation. And we got a new discovery. There is a combination which is mRNA and DNA methylation. The average of AUC is 0.704. In conclusion, the application of different type data remarkably improves the performance. For the biological interpretation of the class activation maps that are originate in applying a Grad-CAM method. And this matrix that indicates degree [43]. Then, the analysis is generated to test the distinction about the two different groups.
5 Conclusion In this paper, we have pictured Path-ATT-CNN which is a fully new method. It establishes on the fully new biological idea. This model which predicts individual who is sentence with GBM or non-primary GBM is better than some methods. In addition, the use of Grad-CAM which is on the pathway picture is able to make the identification of picture. In summary, this paper demonstrates hidden ability of application of ATT-CNN in biological data and Grad-CAM. This model can find complicated biological connection of disease disadvantages. And it can also classify the LTS and the non-LTS groups. There is a limitation of model is that the identification of identical pathway needs closed alignment of the pathways. This could be improved in next 10 years. Funding. This work was supported by National Natural Science Foundation of China (Grant nos. 62002189, 62102200), supported by Natural Science Foundation of Shandong Province, China (Grant nos. ZR2020QF038).
References 1. Ampie, L., Woolf, E.C., Dardis, C.: Immunotherapeutic advancements for glioblastoma. Front. Oncol. 5, 12 (2015) 2. Bengio, Y., Goodfellow, I., Courville, A.: Deep Learning, vol. 1. MIT press, Cambridge (2017) 3. Du, J., et al.: Convolution-based neural attention with applications to sentiment classification. IEEE Access 7, 27983–27992 (2019) 4. Cai, W., Wei, Z.: Remote sensing image classification based on a cross-attention mechanism and graph convolution. IEEE Geosci. Remote Sens. Lett. (2020) 5. Evans, R., et al.: De novo structure prediction with deep learning based scoring. Annu. Rev. Biochem. 77(363–382), 6 (2018) 6. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 7. Jiang, H., et al.: A multi-label deep learning model with interpretable Grad-CAM for diabetic retinopathy classification. In: 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE (2020) 8. Yun, S., et al., Graph transformer networks. Adv. neural Inf. Process. Syst. 32 (2019) 9. Kamilaris, A., Prenafeta-Boldú, F.X.: Deep learning in agriculture: a survey. Comput. Electron. Agric. 147, 70–90 (2018)
Bio-ATT-CNN: A Novel Method for Identification of Glioblastoma
775
10. Liu, J., et al.: Deep adversarial graph attention convolution network for text-based person search. In: Proceedings of the 27th ACM International Conference on Multimedia (2019) 11. Pusey, C.D., et al.: Plasma exchange in focal necrotizing glomerulonephritis without antiGBM antibodies. Kidney Int. 40(4), 757–763 (1991) 12. Terzopoulos, D., Vasilescu, M.: Sampling and reconstruction with adaptive meshes. In: CVPR (1991) 13. Holland, E.C.: Glioblastoma multiforme: the terminator. Proc. Natl. Acad. Sci. 97(12), 6242– 6244 (2000) 14. Wang, L., et al.: Graph attention convolution for point cloud semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019) 15. Yuan, L., et al.: Module based differential coexpression analysis method for type 2 diabetes. BioMed. Res. Int. 2015 (2015) 16. Yuan, L., et al.: Nonconvex penalty based low-rank representation and sparse regression for eQTL mapping. IEEE/ACM Trans. Comput. Biol. Bioinf. 14(5), 1154–1164 (2016) 17. Yuan, L., Yuan, C.A., Huang, D.S.: FAACOSE: a fast adaptive ant colony optimization algorithm for detecting SNP epistasis. Complexity, 2017 (2017) 18. Wu, J.: Introduction to convolutional neural networks. National Key Lab for Novel Software Technology. Nanjing University. China, 5(23), p. 495 (2017) 19. Lomonaco, V., et al.: CVPR 2020 continual learning in computer vision competition: approaches, results, current challenges and future directions. Artif. Intell. 303, 103635 (2022) 20. Vedaldi, A., Lenc, K.: Matconvnet: convolutional neural networks for matlab. In: Proceedings of the 23rd ACM International Conference on Multimedia (2015) 21. Li, Z., et al.: A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. (2021) 22. Yuan, L., et al.: Integration of multi-omics data for gene regulatory network inference and application to breast cancer. IEEE/ACM Trans. Comput. Biol. Bioinf. 16(3), 782–791 (2018) 23. Learning, D.: Deep learning. High-dimensional fuzzy clustering (2020) 24. Yuan, L., Huang, D.-S.: A network-guided association mapping approach from DNA methylation to disease. Sci. Rep. 9(1), 1–16 (2019) 25. Yuan, L., et al.: A novel computational framework to predict disease-related copy number variations by integrating multiple data sources. Front. Genet. 12 (2021) 26. Yuan, L., et al.: A machine learning framework that integrates multi-omics data predicts cancer-related LncRNAs. BMC Bioinf. 22(1), 1–18 (2021) 27. Hellmark, T., Segelmark, M.: Diagnosis and classification of Goodpasture’s disease (antiGBM). J. Autoimmun. 48, 108–112 (2014) 28. Selvaraju, R.R., et al.: Grad-CAM: why did you say that? arXiv preprint arXiv:1611.07450 (2016) 29. Zhang, Y., et al.: Grad-CAM helps interpret the deep learning models trained to classify multiple sclerosis types using clinical brain magnetic resonance imaging. J. Neurosci. Methods 353, 109098 (2021) 30. Golestaneh, S.A., Karam, L.J.: Spatially-Varying Blur Detection Based on Multiscale Fused and Sorted Transform Coefficients of Gradient Magnitudes. In: CVPR (2017) 31. Chen, L., et al.: Adapting Grad-CAM for embedding networks. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2020) 32. Choi, J., Choi, J., Rhee, W.: Interpreting neural ranking models using grad-cam. arXiv preprint arXiv:2005.05768 (2020) 33. Selvaraju, R.R., et al.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision (2017) 34. Wu, Z., et al.: A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32(1), 4–24 (2020)
776
J. Lai et al.
35. Joo, H.-T., Kim, K.-J.: Visualization of deep reinforcement learning using grad-CAM: how AI plays atari games? In: 2019 IEEE Conference on Games (CoG). IEEE (2019) 36. Zheng, H., et al.: Learning multi-attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE International Conference on Computer Vision (2017) 37. Yu, A.W., et al.: Qanet: combining local convolution with global self-attention for reading comprehension. arXiv preprintarXiv:1804.09541 (2018) 38. Ohgaki, H., Kleihues, P.: Genetic pathways to primary and secondary glioblastoma. Am. J. Pathol. 170(5), 1445–1453 (2007) 39. Chen, Y., et al.: Dynamic convolution: attention over convolution kernels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020) 40. Hirschman, I.I., Widder, D.V.: The convolution transform. Courier Corporation (2012) 41. Bruna, J., Mallat, S.: Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1872–1886 (2013) 42. Chen, L.-C., et al.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) 43. Huang, C.-Z.A., et al.: Counterpoint by convolution. arXiv preprint arXiv:1903.07227 (2019)
STE-COVIDNet: A Multi-channel Model with Attention Mechanism for Time Series Prediction of COVID-19 Infection Hongjian He, Xinwei Lu, Dingkai Huang, and Jiang Xie(B) School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China [email protected]
Abstract. The outbreak of COVID-19 has had a significant impact on the world. The prediction of COVID-19 can conduct the distribution of medical supplies and prevent further transmission. However, the spread of COVID-19 is affected by various factors, so the prediction results of previous studies are limited in practical application. A deep learning model with multi-channel combined multiple factors including space, time, and environment (STE-COVIDNet) is proposed to predict COVID-19 infection accurately in this paper. The attention mechanism is applied to score each environment to reflect its impact on the spread of COVID-19 and obtain environmental features. The experiments on the data of 48 states in the United States prove that STE-COVIDNet is superior to other advanced prediction models in performance. In addition, we analyze the attention weights of the environment of the 48 states obtained by STE-COVIDNet. It is found that the same environmental factors have inconsistent effects on COVID-19 transmission in different regions and times, which explains why researchers have varying results when studying the impact of environmental factors on transmission of COVID-19 based on data from different areas. STE-COVIDNet has a certain explainability and can adapt to the environmental changes, which ultimately improves our predictive performance. Keywords: Prediction of Covid-19 infection · Deep learning · Multi-channels · Attention mechanism · Environmental impact · STE-COVIDNet
1 Introduction Since COVID-19’s first appearance in December 2019, it has swept worldwide and affected the world in the environment [24], economy [12], education [7], and so on. Many scientists tried to predict the trend of COVID-19 in response to this sudden epidemic. The short-term forecasting is forecasting over days or weeks to help allocate medical supplies in a short period [13].
This work was supported by the National Nature Science Foundation of China under grant 61873156. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 777–792, 2022. https://doi.org/10.1007/978-3-031-13829-4_70
778
H. He et al.
The data-driven machine learning algorithms have played an essential role in the short-term prediction task of COVID-19 infection. The traditional models such as Cubist Regression (CUBIST) [20], Random Forest (RF) [3], Ridge Regression (RIDGE) [14], Support Vector Regression (SVR) [9], and stacking-ensemble [31] were successfully applied in the area [22]. In particular, the AutoRegressive Integrated Moving Average model (ARIMA) was used in 147 countries [12]. However, such classical machine learning methods need to be analyzed individually to select the optimal parameters for each dataset from a specific country or region, which makes them lack adaptability. Recently, deep learning algorithms were adopted to solve the adaptability problem and improve the performance of time series forecasting. The Long Short-Term Memory (LSTM) was demonstrated the transferability of the deep learning models on COVID-19 prediction tasks. It was trained on the datasets consisting of positive case data from Italy and the United States and then validated on Germany and France [11]. The LSTM model and its variants have also been applied in China [25], India [21], and other regions [1, 2]. However, the above models only considered the infection in an individual region while ignoring the spread between regions. In the last two years, researchers have begun to consider the impact of time and space on the spread of the epidemic. Using intrastate and interstate population movement data and U.S. COVID-19 cases data, Google researchers constructed a spatiotemporal graph neural network model through Multilayer Perceptron (MLP) and Graph Convolutional Network (GCN) to predict the number of COVID-19 cases in the coming day [16]. On a large scale, the STAN [10] model used a Graph Attention Network (GAT) and a gated recurrent unit (GRU) to make more accurate predictions about the number of active cases up to 20 days into the future. IT-GCN [33] implemented a more complex spatial COVID-19 forecasting model. It built a spatial relationship map by introducing ARIMA in GCN to measure the degree of infection between cities. Although these models performed well initially, their predictive performance often declined over time. This is because the environmental information was rarely used to train the early models, resulting in the models that were difficult to adapt to the environmental changes and thus were limited in practice [15]. In particular, the advent of vaccines has affected the spread of COVID-19 [30], making the environment for the transmission of the virus different than before. With the attention mechanism technology, deep learning models can find the factors most relevant to the prediction task, so it is possible to build an interpretable COVID-19 prediction model that can adapt to environmental changes. In this work, the time series prediction model STE-COVIDNet that fuses spatial, temporal, and environmental information to predict the COVID-19 infection is proposed. STE-COVIDNet considers the effects of multiple environmental factors including weather, in-state population mobility, and vaccination factors on the spread of COVID19, then all of the factors are fused through an attention mechanism, so that the impact of environmental factors on the spread of COVID-19 can be assessed and the interpretability of the model can be improved. Comparative experiments on the U.S. dataset demonstrate that STE-COVIDNet outperforms other advanced models. By analyzing the attention weight of environmental factors, we found that in the United States in January 2021, the areas affected by rainfall were concentrated in the central and western U.S., and the east coastlines were affected by the maximum temperature. In October 2021,
STE-COVIDNet
779
the most areas were affected by rainfall. In terms of travel factors, staying at home has been critical in spreading COVID-19. For vaccination, the higher the vaccination rate, the more significant impact of the vaccine on the spread of COVID-19. In general, the influences of the same environmental factors on the spread of COVID-19 are inconsistent across regions and time, which is consistent with existing relevant studies. This paper is organized as follows. At first, the details of STE-COVIDNet are introduced in Sect. 2 and datasets are described in Sect. 3. Then, the experiments and analysis for STE-COVIDNet on datasets are presented in Sect. 4. Finally, Sect. 5 is the conclusion.
2 Materials and Methods
Fig. 1. The schematic flow of STE-COVIDNet.
STE-COVIDNet includes three feature channels and a fusion prediction module (Shown in Fig. 1). The three kinds of feature channels are responsible for spatial, temporal, and environmental factors respectively, extracting features and analyzing the spread of COVID-19 from different perspectives. The spatial feature channel utilizes GAT to extract feature of the spatial spread of COVID-19 from our constructed spatial relationship graph. The temporal feature channel uses LSTM to extract temporal features reflecting the spread’s temporal regularity from past COVID-19 infections. The environmental channel can select the critical environmental factors in the spread of COVID-19 through the attention mechanism and obtain environmental feature by fusing the COVID19 infection situation and environmental information. The fusion prediction module uses temporal, spatial, and environmental features to achieve multi-channel COVID-19 prediction.
780
H. He et al.
The model uses the past t days to predict infection in the next l days, where t and l are set to 7 (see Sect. 3.3). The incident rate was chosen as our prediction task. This is because the population size of each region is different, so cumulative cases are not of the same order of magnitude, making it difficult to compare the severity of outbreaks. The incident rate expresses the number of cases per 100,000 people, thus eliminating the population differences between states. 2.1 Spatial Feature Channel As the number of cases increases in one area, so does the number of cases in neighboring areas as the population moves. The spatial feature channel captures the spatial feature affecting the spread of COVID-19. Here, a spatial relationship graph is designed to describe the relationship between regions on a geographic map, with each node representing a region and edges between nodes indicating that the two regions are geographically adjacent. The case of region i in the past day j is represented by the value xji , and the corresponding node feature i
X = mean(x1i , x2i , ..., xti ) indicates the infection status of the region in the past t days of average value. Composed the feature of all nodes together, the feature matrix X of the spatial relationship graph is constructed. In the spatial feature channel, we apply GAT [29] to obtain spatial feature from the feature matrix X and the adjacency matrix A of the spatial relationship graph. The GAT can learn the correlation between two nodes through an attention mechanism. It can reflect the aggregated weight between two nodes and represent the intensity of COVID19 transmission in these two areas. The spatial feature of region i is denoted as S h i and calculated as follows: Shi = GAT (X, A)i
(1)
2.2 Temporal Feature Channel The temporal feature is the most basic feature in time series forecasting, which reflects the changing law of data over a period of time. As one of the most popular models to extract temporal feature, the LSTM has been widely used to predict COVID19 infection [11, 25]. Following previous researches, our temporal feature channel consists of LSTM. The LSTM takes the infection status in the past t days X i = {x1i , x2i , ..., xti } as input and gets Thi for the output of the hidden layer of the last round. The Thi is the temporal feature and contains the features of the infection situation of the previous t days in the region. The calculation formula of Thi is formula (2). Thi = LSTM (x1i , x2i , . . . , xti )t
(2)
STE-COVIDNet
781
2.3 Environmental Feature Channel In addition to spatial and temporal factors, the spread of COVID-19 is also influenced by environmental factors such as weather [17], intrastate population movement [26], and vaccination [30]. Models that ignore environmental factors will be difficult to apply in practice [15]. Furthermore, simply concatenating COVID-19 information and multiple environmental information is not conducive to model performance. Our environmental feature channel is individually designed to extract environmental features and evaluate the impact of the factors on the transmission of COVID-19. The channel consists of attention mechanism modules [28] and focuses on n environmental factors by the infection data of COVID-19 in the previous t days. The impact of environmental factors on the spread of COVID-19 is assessed with an attention score. In general, the attention mechanism uses an element of one set to assign attention weights to another set or itself. It can enable the infection to selectively focus on environmental factors and assign high weights to these critical environmental factors in predicting COVID-19. So, in the environment channel, the COVID-19 infection situations in the past t days X i will be used to calculate the queries. The feature matrix Envi consists of various environmental factors to calculate the keys and values. Then, the set of queries and key-value pairs are mapped to an output matrix, and the calculation is as formula (5). In this way, the potential relationship between environmental factors and the spread of COVID-19 is captured. The environmental features Ehi are obtained by adding the infection situation to the attention output matrix. It is computed as follows: Xhi = X i W X
(3)
Envih = Envi W E
(4)
QK T V Attention(Q, K, V ) = softmax √ dk
(5)
Ehi = Xhi + Attention(Xhi W Q , Envih W K , Envih W V )
(6)
where Q, K, and V are sets of queries, keys, and values, respectively. T is a transpose T √ ) represents the result of the attention operation, and d k is keys of scale. The softmax( QK d k
weight assignment. W X , W E , W Q , W K , W V are the learned parameter matrix. 2.4 Fusion Prediction Module At the end of STE-COVIDNet is the fusion prediction module. The spatial, temporal, and environmental features are fused by concatenation operation, and then the predicted value of l days is output by MLP. This fusion prediction method is as follows: i pt+1,t+2,...,t+l = MLP(concat(Thi , Shi , Ehi ))
(7)
The dropout layers are added to avoid overfitting and to improve the model’s generalization ability.
782
H. He et al.
In order to establish the correlation between the three kinds of features and the daily change of COVID-19 and to better accomplish the prediction task, we adopted the firstorder difference prediction. It doesn’t directly predict the infection over the next l days, but rather how many infections will increase each day over the next l days compared to the day before. Then, the infection situation of the following l days is calculated by cumulative summation of the actual infection situation xji on the t day and the predicted values. The detailed calculation formula is as follows: Finally, the Mean Square Error (MSE) is applied as the objective function to train STE-COVIDNet. The calculation is as follows: i = xti + yt+k
k
i pt+j (1 ≤ k ≤ l)
(8)
1 i 2 i (xt+k − yt+k ) l
(9)
j=1 l
MESLoss =
k=1
3 Datasets 3.1 COVID-19 Cases Dataset The COVID-19 infection dataset came from the Novel Coronavirus Visualization Dashboard data repository 2019, which is operated by the Johns Hopkins Center for Systems Science and Engineering [8]. This dataset describes global COVID-19 infections. Its records include cumulative COVID-19 cases, cumulative recoveries, cumulative deaths, and so on. It has been updated daily from the onset of COVID-19 to the present. The COVID-19 situation in each U.S. state from April 12, 2020, was explicitly recorded in this dataset. In this paper, the data of the 48 states (excluding Hawaii and Alaska, which lack spatial relationships) from May 1, 2020 to October 31, 2021 were selected and named the COVID-19 Dataset. 3.2 Environmental Information Datasets In addition to COVID-19 infection data, three kinds of environmental data are considered in our model: weather, in-state population mobility, and vaccination. The weather data was from the Global Historical Climatology Network (GHCN)Daily [19]. In 2011, this database became the official database for all U.S. daily climate data. The rainfall, average temperature, maximum temperature, and minimum temperature were used as environmental factors. The average of all climatic stations in a state was taken as the state’s weather values. Here, the data of the 48 states from May 1, 2020 to October 31, 2021 were collected as the Weather Dataset.
STE-COVIDNet
783
The in-state population mobility data were collected from Google Maps. Google Researchers also used it in the COVID-19 forecasting task [16]. This dataset records the rate of change for different in-state mobility destinations, with a baseline from January 3 to February 6, 2020. From February 15, 2020, to the present, the researchers recorded rates of change in six mobility destinations, including retail and recreation, grocery and pharmacy, parks, transit stations, workplaces, and residences. All of the above types of mobility were considered as in-state population mobility factors. The data from May 1, 2020 to October 31, 2021 was used in this study and called the In-state Population Mobility Dataset. The vaccination data was from Vaccination Information Dataset [23]. Starting January 12, 2021, COVID-19 vaccination was implemented nationwide in the United States. The dataset records daily vaccinations at multiple locations. The data on the number of fully vaccinated people per 100 in 48 states from January 29, 2021 to October 31, 2021 were collected, named the Vaccination Dataset. To sum up, the environmental factors that we finally considered are shown in Table 1. Table 1. The environmental factors considered. Dataset
Name
Describe
Weather
PRCP
Precipitation
Dataset
TAVG
Average Temperature
TMAX
Maximum Temperature
TMIN
Minimum Temperature
In-state
RaR
Retail and Recreation percent change from baseline
Population
GaP
Grocery and Pharmacy percent change from baseline
Mobility
P
Parks percent change from baseline
Dataset
T
Transit stations percent change from baseline
W
Workplaces percent change from baseline
R
Residences percent change from baseline
V
Number of fully vaccinated people per 100
Vaccination Dataset
3.3 Data Preprocessing The datasets were divided into two collections based on vaccination as shown in Table 2. The one is the Dataset_2020, which contains the COVID-19 Dataset, the Weather Dataset, and the In-state Population Mobility Dataset from May 1, 2020 to January 31, 2021. There was no vaccine during the training period in this collection, and vaccination in the validation and testing period was ignored because only a few days of vaccination were available and the vaccination rate was only 1.7%. The other collection is the
784
H. He et al.
Dataset_2021, which consists of the COVID-19 Dataset, the Weather Dataset, In-state Population Mobility Dataset, and the Vaccination Dataset from January 29, 2021 to October 31, 2021. By popular practice, both t and l are set to 7, so a sliding window is set for 14 days with a stride of 1 for grouping (as shown in Fig. 2). The split of the training set, validation set, and test set is shown in Table 2. According to the above partitioning method, the output of the test set in the Dataset_2020 can cover January 2021, and the output of the Dataset_2021 can cover October 2021. This allowed us to analyze specifically the impact of environmental factors on the spread of COVID-19 in January 2021 and October 2021. Table 2. Datasets state description and are divided by time intervals. Datasets
Set
Temporal coverages
Number of days (Groups)
Environmental datasets (Dimensions)
Dataset_2020
Train
2020.5.1 2020.11.29
213(200)
Val
2020.11.30 2021.12.24
25(12)
Weather Dataset(4) In-state Population Mobility Dataset(6)
Test
2020.12.25 2021.1.31 (output: 2021.1.1 – 2021.1.31)
38(25)
Train
2021.1.29 2021.8.29
213(200)
Val
2021.8.30 2021.9.23
25(12)
Test
2021.9.24 2021.10.31 (output: 2021.10.1 – 2021.10.31)
38(25)
Dataset_2021
Weather Dataset (4) In-state Population Mobility Dataset (6) Vaccination Dataset (1)
Fig. 2. Schematic diagram of the sliding window.
In addition, Z-Score is applied to standardize the data. The detailed calculation formula is as follows: z=
x−μ σ
(10)
STE-COVIDNet
785
The mean μ and standard deviation σ in the training set is used to normalize the overall dataset, and then the inverse operation of Z-Score restores the output values of the model.
4 Experiments 4.1 Experimental Setup Our model is implemented with PyTorch. The model consists of one layer of LSTM, two layers of GAT, and one layer of attention. Each module has a 64-dimensional hidden and output layer. The obtained 64-dimensional features of each channel are concatenated and input into the MLP to complete the prediction task. AdamW is applied as the optimizer of the model. The learning rate is 5e−3, the weight decay coefficient is 5e−4. The model is trained for 1000 steps. The batch size is 100. The dropout rate is 0.3. The following evaluation metrics are utilized to evaluate our model: Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Root Mean Square Error (RMSE), and the average Concordance Correlation Coefficient (CCC). The experiments are repeated five times, and the average value is taken as the final result. 4.2 Comparison with Existing Models Since the outbreak of COVID-19, there have been many efforts to predict infection. Here STE-COVIDNet is compared with the advanced models on the Dataset_2020 and Dataset_2021, respectively. The models for comparison are as follows: 1. LSTM, BiLSTM, and GRU [25]: Deep learning models on temporal data, where BiLSTM and GRU are variants of LSTM. Their hidden layers were set to 3. 2. LSTM_Env, BiLSTM_Env, and GRU_Env: Environmental information was added to the original model LSTM, BiLSTM, and GRU. Their parameter is same as LSTM, BiLSTM and GRU. 3. STAN [10]: It captured the spatial information of the areas through GAT, then used GRU to extract the time information, and predicted the number of active cases in the next few days by combining it with the SIR model. The GRU layer was set to 1 and the GAT layer was set to 2. In addition, suppose the next l days have an infection growth of 0, the baseline named Previous Days means that the infection rate of next l days is equal to the value of day t, which helps us understand how many valid features these models are able to extract from the provided dataset. As shown in Table 3 and Table 4, STE-COVIDNet improved MAE, MAPE, RMSE, and CCC metrics by 5.50, 0.1101%, 6.55, and 0.0009 respectively compared to suboptimal models on the Dataset_2020. On the Dataset_2021, STE-COVIDNet also had the best MAE and MAPE metrics.
786
H. He et al.
Table 3. The performance of various methods for predicting incident rates in Dataset_2020. Model
MAE
MAPE (%)
RMES
CCC
Previous Days
228.15 ± 0.00
3.3210 ± 0.0000
262.19 ± 0.00
0.8534 ± 0.0000
LSTM
63.16 ± 2.21
0.9222 ± 0.0250
82.41 ± 2.58
0.9652 ± 0.0017
BiLSTM
68.35 ± 2.32
1.0050 ± 0.0338
87.89 ± 2.67
0.9618 ± 0.0038
GRU
59.78 ± 1.52
0.8809 ± 0.0272
77.87 ± 1.96
0.9658 ± 0.0054
LSTM_Env
85.99 ± 0.74
1.2243 ± 0.0097
108.47 ± 0.77
0.9486 ± 0.0025
BiLSTM_Env
87.16 ± 1.85
1.2467 ± 0.0200
109.61 ± 1.91
0.9445 ± 0.0037
GRU_Env
77.16 ± 1.08
1.1104 ± 0.0102
97.92 ± 0.91
0.9541 ± 0.0041
STE-COVIDNet
54.28 ± 1.46
0.7708 ± 0.0123
71.32 ± 1.85
0.9667 ± 0.0053
It is worth noting that the performance of LSTM_Env, BiLSTM_Env, and GRU_Env is decreased compared with that of LSTM, BiLSTM, and GRU, respectively, which means that it is difficult to make the model make full use of environmental factors just by simply adding environmental information. In STE-COVIDNet, the environmental features are extracted by the attention mechanism, and the model performance is superior to the other models. It can be seen that the attention mechanism is an effective way to improve the model performance by fully considering the environmental features. Unlike the above models, STAN aims to predict the number of active cases rather than the infection rate. Therefore, additional experiments are conducted to compare STECOVIDNet with STAN for the prediction of the number of active cases. The experiments only performed on the Dataset_2020 but not on the Dataset_2021 because the COVID-19 dataset stopped recording active numbers on March 7, 2021. Table 4. The performance of various methods for predicting incident rates in Dataset_2021. Model
MAE
MAPE (%)
RMES
CCC
Previous Days
122.88 ± 0.00
0.9221 ± 0.0000
142.55 ± 0.00
0.8503 ± 0.0000
LSTM
41.93 ± 1.19
0.3029 ± 0.0073
51.07 ± 1.32
0.8946 ± 0.0089
BiLSTM
42.74 ± 0.97
0.3146 ± 0.0068
52.69 ± 1.28
0.8954 ± 0.0017
GRU
39.29 ± 0.81
0.2902 ± 0.0048
48.76 ± 0.96
0.8797 ± 0.0104
LSTM_Env
44.58 ± 1.22
0.3292 ± 0.0088
55.46 ± 1.26
0.8825 ± 0.0115
BiLSTM_Env
44.76 ± 0.92
0.3248 ± 0.0074
56.18 ± 1.21
0.8773 ± 0.0166
GRU_Env
46.24 ± 0.85
0.3324 ± 0.0061
56.24 ± 0.91
0.8780 ± 0.0191
STE-COVIDNet
37.93 ± 0.56
0.2787 ± 0.0037
50.44 ± 1.94
0.8840 ± 0.0094
STE-COVIDNet
787
As shown in Table 5, Although STE-COVIDNet was lower than STAN in CCC, it was significantly higher than STAN in MAE, MAPE and RMES. In general, STE-COVIDNet still outperformed the STAN in predicting the effective number of cases. This experiment also reflects that STE-COVIDNet can be applied to different time series prediction tasks. Table 5. The performance of STAN and STE-COVIDNet for predicting active cases.
STAN
MAE
MAPE (%)
RMES
CCC
12983 ± 2560
9.3526 ± 0.6278
23637 ± 4557
0.5555 ± 0.0195
6777 ± 138
8.9115 ± 0.2315
13190 ± 530
0.4516 ± 0.0311
STE-COVIDNet
4.3 Ablation Experiments The ablation experiments were done on the Dataset_2020 to verify the effectiveness of multi-channel fusion because Dataset_2020 is basically not affected by vaccines, the epidemic develops rapidly, and the number of cases changes greatly, which makes it easier to evaluate the predictive performance of the model. As shown in Table 6, among the three separate channels, the temporal feature channel is the most effective, followed by the environmental feature channel, and the spatial feature channel is the least effective. The multi-channel approaches with the environmental channel improve the performance of the single-channel approaches. The best prediction performance is achieved when three channels are fused. This suggests that the environmental factors can improve COVID-19 prediction performance and that a fusion of multiple factors can be more effective in predicting the infection of COVID-19. Table 6. Comparison of prediction performance of various channel combinations. T: temporal feature channel; S: spatial feature channel; E: environmental feature channel. MAE
MAPE (%)
RMES
CCC
S
67.24 ± 2.80
0.9422 ± 0.0337
86.86 ± 3.12
0.9589 ± 0.0040
T
54.80 ± 0.39
0.7932 ± 0.0067
72.61 ± 0.48
0.9675 ± 0.0027
E
60.41 ± 1.92
0.8554 ± 0.0265
79.04 ± 2.15
0.9696 ± 0.0023
S+T
57.51 ± 1.12
0.8241 ± 0.0098
75.22 ± 1.29
0.9653 ± 0.0025
S+E
56.40 ± 1.38
0.8050 ± 0.0143
73.78 ± 1.51
0.9671 ± 0.0027
T+E
54.58 ± 1.15
0.7797 ± 0.0080
72.10 ± 1.17
0.9699 ± 0.0027
S+T+E
54.28 ± 1.46
0.7708 ± 0.0123
71.32 ± 1.85
0.9667 ± 0.0053
788
H. He et al.
4.4 Explainability Analysis To obtain an interpretable model, the environmental factors that influence the spread of COVID-19 should be identified. Furthermore, the attention weights of these factors in the environmental channels should be studied. Our model outputs attention weights in the environment channel on test dataset. As shown in Fig. 3(a), the effect of each environmental factor on the transmission of COVID-19 was inconsistent across states. But overall, the number of people staying at home had the biggest impact in January 2021. It is also recognized that home isolation is the most effective way to prevent the transmission of COVID-19 [32]. In addition, some areas are also affected by rainfall and the maximum temperatures. As shown in Fig. 4(a), the east coast were most affected by the maximum temperatures, while the most pronounced effects of rainfall are concentrated in the middle and west of the United States. Comparing Fig. 4(a) and Fig. 4(b), it can be seen that except for the South Carolina, the distribution of the areas affected by the maximum temperatures is almost the same as that of the high temperature zones in the east coast. This finding is in common with the study that has found that a rise in maximum temperature reduces the incidence rate of COVID-19 [27]. Additionally, comparing Fig. 4(a) with Fig. 4(c), the areas affected by rainfall in Fig. 4(a) are concentrated in the western and central United States, which happen to be the areas with the heaviest or least rainfall in Fig. 4(c). This is consistent with previous studies that have stated that insufficient or excessive rainfall can affect the spread of COVID-19 [5, 6]. The same experiments were also performed on the Dataset_2021, which includes vaccination data on environmental factors, as shown in Fig. 3(b). Compared Fig. 3(b) with Fig. 3(a), the spread of COVID-19 during October 2021 was influenced by the environment differently than in January 2021. Firstly, in terms of climate, the impact of rainfall has increased significantly compared to January 2021, while the impact of maximum temperatures has decreased. Someone believes that rain affects the spread of COVID-19 because it forces people to stay at home [18]. Meanwhile, researchers also have suggested that the optimal temperature for COVID-19 to spread in the United States is between 3 and 17 degrees Celsius [4]. It is interesting that the average temperature in the U.S. during this period was higher than in January 2021, and the average maximum temperature was higher than 20 degrees Celsius. The average number of rainy days increased from 14 to 22 days. These reasons explain why rainfall became a more significant factor than the temperature in some areas during this period. Secondly, in travel factors, staying at home remains an important factor affecting the transmission of COVID-19. Finally, compared with Fig. 4(d) and Fig. 4(e), it can be found that regions that were highly affected by vaccines tend to have high vaccination rates, and vice versa (circled regions in Fig. 4(d, e)).
STE-COVIDNet
789
Fig. 3. The weights of environmental factors on the Dataset_2020 and Dataset_2021 of 48 states. The vertical axis represents 48 states, and the horizontal axis represents environmental factors (see Table 1 for details).
Comparing Fig. 3 with Fig. 4, the attention weights are dynamic and vary with time and regions. So it can understand why we get different results in different areas and at different times and why the same environmental factors have different degrees of influence during the transmission of COVID-19 [17]. Through experiments on the two datasets of Dataset_2020 and Dataset_2021, we found the key environmental factors affecting the spread of COVID-19 and obtained consistent results with existing studies, indicating that the model STE-COVIDNet is reliable. In addition, the weight of these key factors will change with time and regional changes, indicating that STE-COVIDNet is able to adapt to environmental changes and is an interpretable model. Our analyses show that although weather impacts the transmission of COVID-19, the effect is still small compared to home isolation. The vaccination has an important role in spreading COVID-19 but requires high vaccination rates. So, home isolation remains the best way to contain the spread of COVID-19.
790
H. He et al.
Fig. 4. Geographic distribution of attention score and related factors.
5 Conclusion STE-COVIDNet combines temporal, spatial, and environmental factors to predict the spread of COVID-19. In particular, it can adjust the weight of each factor to reflect changes in the environment to improve predictive performance. Firstly, the feature extraction is performed through three channels for the time series prediction of COVID-19 cases. Secondly, integrating environmental information through an attention mechanism increases the explainability of the model and enables accurate predictions of both infection rate and the number of cases. Finally, attention mechanism is used to analyze data such as climate, vaccines, and intra-state population mobility to determine which factors significantly impact the spread of COVID-19 in the current period. The experiments demonstrated that STE-COVIDNet has better predictive performance than the existing models. Moreover, we found that the influence of environmental factors on the spread of COVID-19 varies over time and in different regions. This phenomenon explains why previous studies on environmental impacts have yielded inconsistent results. STE-COVIDNet performed well in both incident rates and active cases prediction, indicating the potential of STE-COVIDNet in time series prediction. Therefore, STECOVIDNet can be used in analysis missions for Omicron and other infectious diseases with the support of sufficient data. In the future, STE-COVIDNet can be improved by attention mechanism and orthogonal loss to measure the difference between the three features and give solutions to tasks in which a region lacks a certain feature.
STE-COVIDNet
791
References 1. Alassafi, M.O., Jarrah, M., Alotaibi, R.: Time series predicting of COVID-19 based on deep learning. Neurocomputing 468, 335–344 (2022) 2. ArunKumar, K., Kalaga, D.V., Kumar, C.M.S., Kawaji, M., Brenza, T.M.: Forecasting of COVID-19 using deep layer recurrent neural networks (RNNs) with gated recurrent units (GRUs) and long short-term memory (LSTM) cells. Chaos Solit. Fractals 146, 110861 (2021) 3. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 4. Bukhari, Q., Massaro, J.M., D’agostino, R.B., Khan, S.: Effects of weather on coronavirus pandemic. Int. J. Environ. Res. Public Health 17(15), 5399 (2020) 5. Chan, A.Y., Kim, H., Bell, M.L.: Higher incidence of novel coronavirus (COVID-19) cases in areas with combined sewer systems, heavy precipitation, and high percentages of impervious surfaces. Sci. Total Environ. 820, 153227 (2022) 6. Chien, L.C., Chen, L.W.: Meteorological impacts on the incidence of COVID-19 in the US. Stoch. Environ. Res. Risk Assess. 34(10), 1675–1680 (2020) 7. Cummings, C., Dunkle, J., Koller, J., Lewis, J.B., Mooney, L.: Social work students and COVID-19: impact across life domains. J. Soc. Work Educ. 1–13 (2021) 8. Dong, E., Du, H., Gardner, L.: An interactive web-based dashboard to track COVID19 in real time. Lancet Infect. Dis. 20(5), 533–534 (2020) 9. Drucker, H., Burges, C., Kaufman, L., Smola, A., Vapnik, V.: Linear support vector regression machines. Adv. Neural Inf. Process. Syst. 9, 155–161 (1996) 10. Gao, J., et al.: STAN: spatio-temporal attention network for pandemic prediction using realworld evidence. J. Am. Med. Inform. Assoc. 28(4), 733–743 (2021) 11. Gautam, Y.: Transfer learning for COVID-19 cases and deaths forecast using LSTM network. ISA Trans. 124, 41–56 (2021) 12. Hasan, M., Mahi, M., Sarker, T., Amin, M., et al.: Spillovers of the COVID-19 pandemic: impact on global economic activity, the stock market, and the energy sector. J. Risk Financ. Manag. 14(5), 200 (2021) 13. Hernandez-Matamoros, A., Fujita, H., Hayashi, T., Perez-Meana, H.: Forecasting of COVID19 per regions using ARIMA models and polynomial functions. Appl. Soft Comput. 96, 106610 (2020) 14. Hoerl, A.E., Kennard, R.W.: Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67 (1970) 15. Ioannidis, J.P., Cripps, S., Tanner, M.A.: Forecasting for COVID-19 has failed. Int. J. Forecast. 38(2), 423–438 (2020) 16. Kapoor, A., et al.: Examining COVID-19 forecasting using spatio-temporal graph neural networks. arXiv preprint arXiv:2007.03113 (2020) 17. McClymont, H., Hu, W.: Weather variability and COVID-19 transmission: a review of recent research. Int. J. Environ. Res. Public Health 18(2), 396 (2021) 18. Menebo, M.M.: Temperature and precipitation associate with COVID-19 new daily cases: a correlation study between weather and COVID-19 pandemic in Oslo, Norway. Sci. Total Environ. 737, 139659 (2020) 19. Menne, M.J., Durre, I., Vose, R.S., Gleason, B.E., Houston, T.G.: An overview of the global historical climatology network-daily database. J. Atmos. Ocean. Tech. 29(7), 897–910 (2012) 20. Quinlan, J.R.: Combining instance-based and model-based learning. In: Proceedings of the Tenth International Conference on Machine Learning, pp. 236–243 (1993) 21. Reddy, K.S.S., Reddy, Y.P., Rao, C.M.: Recurrent neural network based prediction of number of COVID-19 cases in India. Materials Today: Proceedings, pp. 1–4 (2020) 22. Ribeiro, M.H.D.M., Da Silva, R.G., Mariani, V.C., Dos Santos Coelho, L.: Short term forecasting COVID-19 cumulative confirmed cases: perspectives for Brazil. Chaos Solit. Fractals 135, 109853 (2020)
792
H. He et al.
23. Ritchie, H., et al.: Coronavirus pandemic (COVID-19). Our World in Data (2020). https://our worldindata.org/coronavirus 24. SanJuan-Reyes, S., Gómez-Oliván, L.M., Islas-Flores, H.: COVID-19 in the environment. Chemosphere 263, 127973 (2021) 25. Shahid, F., Zameer, A., Muneeb, M.: Predictions for COVID-19 with deep learning models of LSTM, GRU and BI-LSTM. Chaos Solit. Fractals 140, 110212 (2020) 26. Tang, K.H.D.: Movement control as an effective measure against COVID-19 spread in Malaysia: an overview. J. Public Health 30(3), 583–586 (2022) 27. Tobías, A., Molina, T.: Is temperature reducing the transmission of COVID-19? Environ. Res. 186, 109553 (2020) 28. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017) 29. Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017) 30. Voysey, M., et al.: Safety and efficacy of the chadox1 ncov-19 vaccine (azd1222) against sars-cov-2: an interim analysis of four randomised controlled trials in Brazil, South Africa, and the UK. Lancet 397(10269), 99–111 (2021) 31. Wolpert, D.H.: Stacked generalization. Neural Netw. 5(2), 241–259 (1992) 32. Yang, W.: Modeling COVID-19 pandemic with hierarchical quarantine and time delay. Dyn. Games Appl. 11(4), 892–914 (2021) 33. Yu, Z., Zheng, X., Yang, Z., Lu, B., Li, X., Fu, M.: Interaction-temporal GCN: a hybrid deep framework for COVID-19 pandemic analysis. IEEE Open J. Eng. Med. Biol. 2, 97–103 (2021)
KDPCnet: A Keypoint-Based CNN for the Classification of Carotid Plaque Bindong Liu1 , Wu Zhang2 , and Jiang Xie1(B) 1 School of Computer Engineering and Science, Shanghai University, Shanghai, China
[email protected] 2 School of Mechanics and Engineering Science, Shanghai University, Yanchang Road,
Shanghai, China
Abstract. Classification of carotid plaque echogenicity in ultrasound images is an important task for identifying plaques prone to rupture, thus for early risk estimation of cardiovascular and cerebrovascular events. However, it is difficult for normal classification methods to distinguish the plaque area and extract the feature of plaques, because the carotid artery plaque area accounts for a very small proportion of the entire ultrasound image, and the plaque boundary is fuzzy. In addition, the image usually needs to be resized before being fed to the neural network, resulting in information loss. In this work, a keypoint-based dual-branch carotid plaque echogenicity classification network (KDPCnet) is proposed to solve those problems. Our model consists of two parts. First, a lightweight sub-network is applied to identify the plaque’s center point. Then, a dual-branch classification sub-network is proposed to integrate global information of the entire ultrasound image and the local detail information of plaques without reducing the resolution of the plaque area and changing the aspect ratio of the plaque. On the dataset of 1898 carotid plaque ultrasound images from the cooperation hospital, the five-fold cross-validation results show that KDPCnet outperforms other advanced classification models and keypoint localization can effectively assist carotid artery plaque echogenicity classification. Keywords: Carotid plaque · Ultrasound · Keypoint localization · Classification
1 Introduction Cardiovascular disease (CVD) has become one of the leading causes of death worldwide [23]. As an important cause and predictor of CVD, carotid plaque is studied extensively. Carotid plaque results from the interaction between modified lipids, extracellular matrix, monocyte-derived macrophages, and activated vascular smooth muscle cells that accumulate in the arterial wall [6]. When vulnerable carotid plaques rupture, atherothrombotic emboli consisting of clumps of platelet aggregates or plaque fragments may travel into the brain, occluding smaller arteries and resulting in a transient ischaemic attack This work was supported by the National Nature Science Foundation of China under Grant 61873156. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 793–806, 2022. https://doi.org/10.1007/978-3-031-13829-4_71
794
B. Liu et al.
or stroke [4]. Early and accurate identification of high-risk plaques is necessary. The current clinical examination methods for carotid plaque mainly include ultrasound, CT angiography, magnetic resonance angiography, and digital subtraction angiography. In which, ultrasound is widely used to diagnose carotid plaque due to its low cost, convenience, and safety. Sonographers usually perform diagnoses based on sonographic characteristics of plaque in ultrasound images, which is relatively subjective and highly depends on the clinical experience of the sonographer. Many computer-aided diagnosis (CAD) systems have been proposed for the objective classification of carotid plaque to make up for this deficiency. In machine learning methods, researchers always focus on extracting features that reflect the properties of plaques. For image classification and retrieval of carotid plaque ultrasound images, texture features, shape features, morphology features, histogram features, and correlogram features are used to perform retrieval and classification of carotid ultrasound images and proved to perform better than a single feature [2]. Four decomposition schemes (the discrete wavelet transform, the stationary wavelet transform, wavelet packets, and Gabor transform) are studied to discriminate between symptomatic and asymptomatic cases [22]. Wavelet packets analysis produced the highest overall classification performance. The discrete Fréchet distance features (DFDFs) of each plaque is also shown to be a good feature for identifying plaque echo types [8]. In deep learning methods, image features are extracted from the data directly, alleviating the burden of designing the specific features and the classification. Researchers have attempted to apply deep learning in carotid plaque classification. A simple CNN model is built to identify the different plaque compositions (lipid core, fibrous cap, and calcified tissue), achieving a correlation of about 0.90 with the clinical assessment [11]. Ma et al. [15] train VGG and SVM on three different regions of interest (ROI) for carotid plaque datasets. Later, the same research team redesigns the spatial pyramid pooling (SPP) structure according to the characteristics of carotid artery plaque and proposes multilevel strip pooling (MSP), which effectively improves the accuracy of the echo type classification of plaque ROI, reaching 92% [14]. Most of the machine learning models and some deep learning models in carotid plaques classification research are trained with ultrasound images that only contain the ROI. This is determined by the nature of the carotid plaque ultrasound image. The carotid plaque area accounts for a tiny proportion of the entire ultrasound image, about 2%. In contrast, the median scale of object instances relative to the image is 54.4% in the ImageNet dataset [20] (a widely used natural image dataset). Therefore, natural image classification methods that directly process images are not suitable for the scene of carotid ultrasound plaque classification. In machine learning, if the entire image features are extracted to make the final prediction, the plaque features will be faded by the parts of the unrelated area. In deep learning, a too-small ROI is also not conducive to plaque localization and feature extraction. The disadvantage of small ROIs becomes more salient, because the quality of ultrasound image images is influenced by signal attenuation, speckle noise, artifacts of shadowing and enhancement, poor contrast, and low signal-to-noise ratio. Moreover, the original ultrasound image has high resolution and needs to be resized when fed into the network, which results in the loss of the original information of the ultrasound image partly. To solve those problems, some researchers
KDPCnet: A Keypoint-Based CNN for the Classification of Carotid Plaque
795
attempt to segment or detect plaque first, and then use the ROI images generated by the first step for subsequent classification [1, 24, 27]. However, it is difficult to obtain accurate segmentation or detection labels because the boundary of the plaque is very fuzzy in most cases. Inaccurate labels bring inherent errors for subsequent research. Those additional annotations also introduce a highly tedious and time-consuming workload. In this study, we treat a plaque as a point and propose a keypoint-based dual-branch carotid plaque diagnosis network (KDPCnet) for the automatic and accurate carotid plaque echogenicity classification. Compared with segmentation annotation and detection annotation, the keypoint annotation of plaque (the plaque’s center point) is much easier. People are likely to be confused about the boundary of some plaques, but the center of the plaque can always be found easily, which reduces the label error within the dataset compared with the segmentation or detection labels. KDPCnet consists of two parts: (1) Detect the center point of the plaque by a simplified keypoint localization sub-network. (2) Classify plaque echogenicity of carotid ultrasound image by an orthogonalization-guided dual-branch classification sub-network. In the first part, we consider the difference between carotid ultrasound images and natural images, and construct a simple but efficient keypoint localization network. To further improve the localization performance, the distribution radius of keypoint heatmap labels is chosen according to the actual distribution of carotid plaque size. In the second part, in addition to the global branch that extracts global features under the guidance of localization information, another local branch is designed to extract high-definition detailed information from plaque area of the original ultrasound image that has not been resized. An orthogonalization module is applied to reduce the redundancy of global and local features. The experimental results demonstrate that our model accurately classifies plaque echogenicity and has superior performance compared with existing CAD methods. The rest of this article is organized as follows: Sect. 2 introduces the dataset used in this study and illustrates the proposed method. Section 3 describes the experimental setup, performance metrics, and experimental results. The conclusion and discussion are presented in Sect. 4.
2 Materials and Methods 2.1 Dataset In this study, 1898 carotid plaque ultrasound images of 204 patients were collected from the cooperation hospital. Each image has a resolution of 540 × 740 and contains only one plaque, which is presented as a longitudinal view of the vessels. All of these plaques are divided into hyperechoic, mixed-echoic, or hypoechoic, with a clinical risk corresponding to low, medium, or high, respectively. Therefore, our dataset contains 539 hyperechoic, 597 mixed-echo, and 762 hypoechoic plaque images. Some examples are shown in Fig. 1. The labeling of these categories is carefully done by an ultrasound physician with at least 10 years of clinical experience. In addition, the center of each plaque is also annotated as a point with an annotation tool named LableMe. The carotid ultrasound images were acquired in B-mode using a general electric MyLab Twice system equipped with a 7–13 MHz transducer. All the carotid ultrasound
796
B. Liu et al.
(a)
(b)
(c)
Fig. 1. Some typical examples of our dataset. a,b,c are the ultrasound images of hyperechoic plaque, mixed-echoic plaque, and hypoechoic plaque, respectively. In general, hyperechoic plaques are bright, hypoechoic plaques are darker, and mixed-echo plaques are between the two. The boundary of hyperechoic plaque is usually smooth, while those of hypoechoic and mixed-echo plaques are rougher. However, the characteristics of some plaques are not so obvious, and doctors need to make comprehensive judgments based on professional knowledge.
scans were performed between 2020 and 2021. Each volunteer has signed informed consent, and this study was approved by the Ethics Committee. 2.2 Method Due to difficulties in identifying small plaques directly from high-resolution carotid ultrasound images, KDPCnet follows the idea of locating first and then classifying for carotid plaque diagnosis, as shown in Fig. 2. KDPCnet treats a plaque as a point. So the plaque localization problem is transformed into the localization problem of plaque center, and a simplified localization sub-network is designed. On the basis of the localization results, an orthogonalization-guided dual-branch classification sub-network is built to extract the global and local features of the plaque for carotid plaque echogenicity classification. In which, an orthogonal module is applied to reduce the redundancy of global features and local features. Simplified Keypoint Localization Sub-network. As the first part of the whole framework, the localization sub-network plays an important role. The accuracy of localization directly affects the classification performance of the model. And it must be light because a too heavy localization sub-network followed by a classification sub-network will make the whole model run slowly and hard to train. The hourglass module [16] is chosen as our localization sub-network. And we make some improvements according to the nature of carotid ultrasound. In 2016, the stacked hourglass network [16] is introduced to perform human pose estimation, and has since become a popular backbone network [3, 12, 17, 26]. It is built by placing 8 hourglass modules together end-to-end, each with four downsampling layers. However, the task of locating plaque in a carotid ultrasound image differs from the task of locating the person’s joints in a complex environment. Plaque is always in the carotid vascular, and the number of plaque is much less than the number of person’s joints. So a simple model is enough to locate plaque in ultrasound. A single hourglass module with three downsampling layers is designed in our localization sub-network. The total parameter is significantly reduced compared with the original stacked hourglass model. Furthermore, the input image is downsampling through convolutional and max pooling
KDPCnet: A Keypoint-Based CNN for the Classification of Carotid Plaque
797
layers from 256 × 256 to 64 × 64 before being fed to the localization sub-network. It makes the network more efficient and easy to train.
Fig. 2. Schematic representation of the proposed KDPCnet. The whole network consists of two parts: localization and classification. The original ultrasound image is resized and then fed into the localization network. The localization part consists of an hourglass module with three downsampling layers, named simplified keypoint localization sub-network, which produces a heatmap representing the location of plaque center point. In this part, convolutional and max pooling layers are used to process features down to a very low resolution. At max pooling step, the network branches off and applies more convolutions at the original pre-pooled resolution. After reaching the lowest resolution, the network begins the top-down sequence of upsampling and combination of features across scales. The classification part consists of global and local branches, named orthogonalization-guided dual-branch classification sub-network. The plaque localization heatmap and the resized ultrasound image are concatenated together to form the input of the global branch. The red square area around plaque on the original ultrasound image is cropped as the input of the local branch according to the localization heatmap. Orthogonalization module is applied before the classification layer. Conv layers refer to Fig. 3.
A plaque point annotation is converted to a heatmap according to Cornernet [10], which is suitable for localization sub-network. For each plaque, only the center of the plaque is positive location. All other locations are negative. During training, instead of equally penalizing negative locations, we reduce the penalty given to negative locations within a radius of the positive location. Because as long as the prediction point is in the plaque area or near the plaque area, this point can still represent the approximate location of the plaque, which is sufficient to assist plaque echogenicity classification. The amount of penalty reduction is given by an unnormalized 2D Gaussian. Let Dxy be the value in the ground-truth heatmap at location (x, y). The generated heatmap label is formulated as follows: ⎧ 2 +(y−y )2 ⎨ − (x−x0)radius 20 2∗ 3 Dxy = e (1) , (x − x0 )2 + (y − y0 )2 < radius2 ⎩ 0, otherwise
798
B. Liu et al.
where (x0 , y0 ) is the coordinate of the plaque center. Orthogonalization-Guided Dual-Branch Classification Sub-network. It is hard to extract features of a small plaque from an entire ultrasound image. Moreover, The differences between plaques with different echo types are usually reflected in subtle places. But due to the limitation of memory size and computing power, the size of the input image is usually reduced, which results in detailed information loss. It is not conducive to plaque classification. A dual-branch CNN sub-networks is proposed for carotid plaque classification, as shown in Fig. 2. It contains a global branch guided by localization information and a local branch rich in detailed information about the plaque area. An orthogonalization module is applied to manage the redundancy of the two branch information.
Fig. 3. The structure of a single branch network in the classification sub-network.
Specifically, the plaque localization heatmap generated by the localization subnetwork and the resized ultrasound image are concated together as the input of the global branch. According to the localization heatmap, a plaque patch of size 286 × 286 is cropped from the original image, as the local branch input. Both branches have a similar structure but different parameters. Each branch consists of 4 layers of convolution (convolution 1 uses 7 × 7 convolution kernels with two strides, and convolutions 2–4 use 3 × 3 convolution kernels with one stride) and three layers of fully connected layers (the number of units is 2048, 512, and 2), as shown in Fig. 3. A 3 × 3 pooling layer is used after each convolution. The global feature and the local feature are concated together and then pass three fully connected layers to generate the final prediction. We add a classification loss to both the global and local branches, which avoids that only the global branch or the local branch works and can extract features more fully. The inputs of both global and local branches contain plaque areas. There is also a lot of redundant information. If both branches learn similar knowledge, the settings of the two branches will be meaningless, even leading to a decrease in model performance. To reduce the redundancy of the information learned by the two branches, we introduce an orthogonalization module inspired by [13]. Denote the flattened feature after pool4 of the global branch as F1 = [α1 , α2 , . . . , αn ], and the flattened feature after pool4 of the local branch as F2 = [β1 , β2 , . . . , βn ]. The orthogonalization module is implemented by an orthogonal loss (LOrth ) to ensure the independence of F1 and F2 , as follows: LOrth = cos θ =
F1 ∗F2 F1 F2
(2)
KDPCnet: A Keypoint-Based CNN for the Classification of Carotid Plaque
799
θ is the angle of the two features. Since F1 and F2 have gone through the ReLU activation function, all elements are greater than zero. It is only necessary to minimize cosθ to achieve the purpose of making F1 and F2 orthogonal. Loss Function. The mean squared error loss is applied in the localization sub-network. The localization loss can be formulated as follows: 2 1 H W (3) Llocalization = H ∗W i=1 j=1 Pij − Gij P is the predicted heatmap with the size of H × W. Pij represents the probability that the point with coordinates (i, j) in P is the plaque center. And G is the ground-truth heatmap label. The loss of classification sub-network (Lclassification ) consist of global branch classification loss (LG ), local branch classification loss (LL ), fusion classification loss (LF ), and the orthogonal loss (Lorth ). We apply cross-entropy loss in LG , LL and LF . Lclassification can be formulated as follows: Lclassification = w1 LG + w2 LL + w3 LF + w4 Lorth LG = − LL = − LF = −
(4)
pgi
(5)
i=0 yi log(pli )
(6)
pfi
(7)
2
i=0 yi log
2
2
i=0 yi log
In which, pg = [pg0 , pg1 , pg2 ], pl = [pl0 , pl1 , pl2 ], pf = [pf 0 , pf 1 , pf 2 ] represent the output of the global branch, local branch, and fusion branch, respectively. y = [y0 , y1 , y2 ] is the ground-truth label of plaque echogenicity. w1 , w2 , w3 and w4 are the weight of each loss item, which can be adjusted. In this study, we set w1 = 0.1, w2 = 1, w3 = 0.1, w4 = 0.4 by experimental analysis (as shown in Table 4). The total loss of the entire network can be formulated as follows: Ltotal = Llocalization + Lclassification
(8)
3 Experiments and Results 3.1 Implementation Detail KDPCnet is trained in three steps. In the first step, the classification sub-network weights are frozen and the localization sub-network is trained with a learning rate of 0.001 for 100 epochs. In the second step, the localization sub-network weights are frozen and the classification sub-network is trained with a learning rate of 0.0001 for 50 epochs. In the last step, the whole network is fine-tuned with a learning rate of 0.0001 for 150 epochs. Adam is used as our optimizer and set batch size 8. The intensity of each pixel
800
B. Liu et al.
is normalized to [0,1]. All models are implemented on Pytorch 1.7.0 with NVIDIA GeForce RTX3090. Five-fold cross-validation is performed in each experiment. 80% of the data is used for training, and 20% of the data is used for testing. We apply stratified sampling for each category of patients to ensure that the data distribution in the training set and the test set are similar. The patients existing in the training dataset are excluded from the test dataset to ensure the reliability of the experiments. 3.2 Evaluation Metrics The euclidean distance of the predicted point and ground-truth point is computed to evaluate the performance of localization. The plaque center point is considered to be predicted successfully if the distance is less than 30 pixels; otherwise, it is predicted wrongly. Coresbounding accuracy, we call it Acc30. Similarly, we definite Acc100. We used accuracy, precision, sensitivity, and F1-Score to evaluate classification performance. They are expressed as follows: Accuracy =
TP+TN TP+TN +FP+FN
Precision = Recall = F1-Score =
TP TP+FP
TP TP+FN
2∗Precision∗Recall Precision+Recall
(9) (10) (11) (12)
where FP, TP, TN, and FN indicate false positives, true positives, true negatives, and false negatives, respectively. Precision denotes the proportion of positive cases that were classified as positive cases. Recall measures the ability to correctly recognize positive cases. F1-score represents the harmonic average of precision and recall and is typically used for the optimization of a model towards either precision or recall. All of those metrics give scores between 0 and 1, where 1 represents the best prediction and indicates that the predicted classification output is identical to the ground truth. 3.3 Result In this section, the performance of our localization network is evaluated first. We explore the effect of radius in heatmap labels on localization performance and compare the localization performance of hourglass networks with different levels of complexity. Then the performance of our classification network is evaluated. The models with different crop sizes of the local branch input are compared. Ablation experiments are performed on each module of the classification network to verify the rationality of each module in our model. Finally, we compare our model with other popular classification models. Localization Results. The radius of the Gaussian distribution of the heatmap label directly determines how much area the model should focus on around the center of the plaque. A larger radius means that the model should focus on a larger area.
KDPCnet: A Keypoint-Based CNN for the Classification of Carotid Plaque
801
To understand the general distribution of plaque length in carotid artery ultrasound images, we measure the length of each plaque in the dataset. The plaque length distribution is shown in Fig. 4. The carotid plaques range in length from 30 pixels to 382 pixels. Let L be the longest length of plaques. We make gaussian heatmap labels with the radius of 2L, 7/4L, 6/4L, 5/4L, L, 3/4L, 2/4L, and 1/4L, respectively. The performance of the localization network under those different heatmap labels is evaluated, as shown in Table 1. It can be seen that a radius that is too large or too small leads to relatively poor performance. The size of the radius has a limited effect on Acc100, but it influences Acc30 greatly. It means that the localization network can find the approximate position of the plaque, but the fineness of the localization is affected by the radius. Finally, 191 is chosen as the radius of the heatmap.
Fig. 4. The plaque size histogram of our dataset.
Table 1. Performance of the localization sub-networks with different radii Radius (pixels)
Distance
Acc30
Acc100
1/4L(96)
20.89
82.55
97.62
1/2L(191)
20.05
86.75
97.40
3/4L(287)
20.22
86.08
97.63
L(382)
21.29
82.55
97.62
5/4L(478)
20.88
83.61
98.05
6/4L(573)
22.07
82.15
97.52
7/4L(669)
22.28
82.45
98.11
2L(764)
22.57
80.07
97.95
Classification Result. The orthogonalization-guided dual-branch classification subnetwork consists of local and global branches. The local branch crop a square area
802
B. Liu et al.
around the detected center point on the original ultrasound image as the input. Much irrelevant information will be introduced into the network if the cropped area is too large. However, cropping a too-small area may result in an incomplete ROI. Even the plaque is not in the cropped area completely due to the deviation of localization. Therefore, it is necessary to select an appropriate size as the side length of the cropped area. Taking the length of the longest plaque (L) as the benchmark, classification performance for varying sizes of the cropped area (W×W (the width of the original image, 540), 5/4L× 5/4L(477), L×L(382), 3/4L×3/4L(286), 2/4L×2/4L(191), 1/4L×1/4L(95)) is studied. The experimental results are shown in Table 2. In the end, 286 is chosen as the side length of the ROI area. Table 2. Impact of crop size Crop size
Accuracy
Precision
Recall
F1-Score
95×95
84.54
84.22
83.65
83.09
191×191
85.04
85.12
84.42
84.31
286×286
85.37
85.87
84.50
84.53
382×382
83.54
83.88
82.89
82.82
477×477
80.64
81.46
79.38
78.84
540×540
76.89
77.40
75.48
74.76
We performed ablation experiments on the components of KDPCnet, results are shown in Table 3. To justify the dual-branch design in the classification network, models that contain only global branches or local branches are trained separately and are compared with KDPCnet. When training the network containing only the global branch, we follow the training step of the proposed network. When training the network with the only local branch, the localization network is first trained with a learning rate of 0.001 for 100 epochs, and the classification network is trained with a learning rate of 0.0001 for 200 epochs. The accuracy of the model that contains only the global branch or the local branch is lower than that contains both. To prove the effectiveness of the orthogonalization module, we remove the orthogonalization module from KDPCnet. Experiments show that the orthogonalization module improves the model accuracy by 2.38%. In order to obtain good model performance, we experimented with different values of loss weights (w1 , w2 , w3 and w4 ), as shown in Table 4. If the enumeration method is used to select the optimal weights, it will bring a huge experimental cost. All the loss weights are set to 1 as a benchmark model for comparison. w1 and w2 are determined first, followed by w3 . Finally, w4 is determined. From Table 3, it can be seen that the classification performance of using only the local branch is much better than that of only using the global branch. It can be speculated that the quality of the features obtained by the local branch is higher than that of global branch, so a large weight is given to the local branch loss weight (w2 ), and a small weight is given to the global branch loss (w1 ). w1 , w2 , w3 , w4 are set 0.1, 1, 1, and
KDPCnet: A Keypoint-Based CNN for the Classification of Carotid Plaque
803
Table 3. Ablation study Global branch √
Local branch √ √
× √
× √
√
Orthogonalization module √
Accuracy
Precision
Recall
F1-Score
85.37
85.87
84.50
84.53
×
82.69
83.05
81.84
81.74
×
78.50
77.94
77.55
76.73
×
82.99
83.66
82.16
81.59
1, respectively. For comparison, we also set w1 = 1, w2 = 0.1, w3 = 1, w4 = 1for experiments. Experiment results show that it is reasonable to set the local branch loss with a large weight and the global branch loss with a small weight. We also explore the weight relationship between the fusion branch and the other branches. w1 , w2 , w3 , w4 are set 0.1, 1, 0.1, and 1, respectively. Experiment results show that this setting is effective. Table 4. Model performance with different loss weights w1
w2
w3
w4
Accuracy
Precision
Recall
F1-Score
1
1
1
1
82.87
82.88
81.80
81.60
0.1
1
1
1
84.15
84.49
82.76
82.77
1
0.1
1
1
82.86
83.29
81.87
81.78
0.1
1
0.1
1
84.82
84.82
83.73
83.64
0.1
1
0.1
0.2
85.36
84.90
84.26
84.30
0.1
1
0.1
0.4
85.37
85.87
84.50
84.53
0.1
1
0.1
0.6
84.54
84.15
83.60
83.52
0.1
1
0.1
0.8
84.13
83.74
82.88
82.97
The inputs of both the global and local branches contain the central part of plaque, which leads to much redundant information in the global and local features. The orthogonalization module can effectively reduce this redundant information, so we conduct more detailed experiments on the weights of the orthogonal loss. The settings of w1 , w2 , w3 are fixed, and w4 is set to 0.2, 0.4, 0.6, and 0.8, respectively. The experiment results show the performance is best when w4 = 0.4. Finally, the loss weights are set as w1 = 0.1, w2 = 1, w3 = 0.1, w4 = 0.4. We also compare the performance of our model with other popular classification methods, such as Alexnet [9], ResNext50 [25], ResNet50 [5], DenseNet169 [7], MobileNet-v2 [19], EfficientNet-b7 [21], Conformer-S [18]. In which, Conformer-S is a model that embodies the idea of global and local branch design. All compared models are pre-trained on ImageNet and fine-tuned on our dataset. Our model is trained from scratch on our dataset. As shown in Table 5, the classification accuracy and F1-Score of our model are 2.86% and 2.83% higher than the second-best model, respectively.
804
B. Liu et al. Table 5. Comparison with other popular classification methods
Models
Accuracy
Precision
Recall
F1-Score
Alexnet
67.96
68.08
67.03
66.08
ResNet50
78.71
78.57
77.60
77.27
ResNext50
81.77
82.69
80.68
80.65
DenseNet169
82.51
82.77
81.72
81.70
MobileNet-v2
80.13
80.92
79.05
78.60
EfficientNet-b7
78.28
78.53
77.19
76.79
Conformer-S
77.59
78.04
76.60
76.28
KDPCnet
85.37
85.87
84.50
84.53
4 Conclusion and Discussion In this article, a classification model of carotid plaque ultrasound images based on keypoint detection is proposed. Since carotid plaques account for a small proportion of the effect of the entire image, we follow the idea of locating first and classifying them later. Different from previous methods, a plaque is treated as a point, and the problem of plaque localization is transformed into a keypoint localization problem. In a carotid ultrasound image, the plaque boundary is usually blurred, but the location of the plaque is relatively easy to determine. Compared with segmentation and detection labels, point labels significantly reduce labeling workload and label error. The Gaussian heatmap label is centered on the keypoint and gradually weakens to the surrounding area, which also fits well with the problem of blurred boundary of carotid plaque. A detailed study is conducted on the configuration of the localization model, and a lightweight and accurate localization sub-network is designed according to the characteristics of the ultrasound images. The experimental results show that the localization accuracy (Acc30) can reach 86.75%, which proves that it is feasible to treat the plaque as a point for model design. Based on the localization information, an orthogonalization-guided dual-branch classification sub-network is built and obtains a classification accuracy of 85.37%. This classification sub-network contains global and local branches, and adopts orthogonal modules to effectively reduce the redundancy of global and local features. The duralbranch design model is effective for classifying plaque echogenicity for three main reasons. First, the local branch directly extracts features from the plaque area of the original ultrasound image to avoid information loss caused by resizing operations, and the extracted features contain relatively little irrelevant information. Second, in addition to the plaque area, other areas also contain valid information of plaque echogenicity, for example, a long artifact that often appears below the hyperechoic plaque, and other features that are difficult to find by the human eye. These features can be effectively captured under the supervision of plaque echogenicity labels and the constraints of orthogonalization modules. Finally, when the localization is inaccurate, the global branch provides sufficient image features to complete the echo classification of plaques. The proposed
KDPCnet: A Keypoint-Based CNN for the Classification of Carotid Plaque
805
model can output not only the echo type of the plaque but also the location of the plaque, which enhances the reliability of the model. This study is experimented on a single-plaque dataset which is relatively small, showing the potential of keypoint localization in assisting plaque echogenicity classification. In the future work, we will continue to collect data to improve the performance of the proposed model, especially the collection of multi-plaque ultrasound images, and further improve the plaque echogenicity classification performance of multi-plaque ultrasound according to the design idea of this study.
References 1. Abd-Ellah, M.K., Khalaf, A.A., Gharieb, R.R., Hassanin, D.A.: Automatic diagnosis of common carotid artery disease using different machine learning techniques. J. Ambient Intell. Hum. Comput. 1–17 (2021) 2. Christodoulou, C., Pattichis, C., Kyriacou, E., Nicolaides, A.: Image retrieval and classification of carotid plaque ultrasound images. Open Cardiovasc. Imaging J. 2(1), 18–28 (2010) 3. Dong, X., Yu, J., Zhang, J.: Joint usage of global and local attentions in hourglass network for human pose estimation. Neurocomputing 472, 95–102 (2022) 4. Ge, J., et al.: Screening of ruptured plaques in patients with coronary artery disease by intravascular ultrasound. Heart 81(6), 621–627 (1999) 5. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 6. Hermus, L., Lefrandt, J.D., Tio, R.A., Breek, J.C., Zeebregts, C.J.: Carotid plaque formation and serum biomarkers. Atherosclerosis 213(1), 21–29 (2010) 7. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017) 8. Huang, X., et al.: Identication of ultrasonic echolucent carotid plaques using discrete fréchet distance between bimodal gamma distributions. IEEE Trans. Biomed. Eng. 65(5), 949–955 (2017) 9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25 (2012) 10. Law, H., Deng, J.: Cornernet: detecting objects as paired keypoints. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018, vol. 11218, pp. 765–781. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_45 11. Lekadir, K., et al.: A convolutional neural network for automatic characterization of plaque composition in carotid ultrasound. IEEE J. Biomed. Health Inform. 21(1), 48–55 (2016) 12. Li, A., Yuan, Z., Ling, Y., Chi, W., Zhang, C., et al.: A multi-scale guided cascade hourglass network for depth completion. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 32–40 (2020) 13. Lin, S., Bai, M., Liu, F., Shen, L., Zhou, Y.: Orthogonalization-guided feature fusion network for multimodal 2d+ 3d facial expression recognition. IEEE Trans. Multimedia 23, 1581–1591 (2020) 14. Ma, W., Cheng, X., Xu, X., Wang, F., Zhou, R., Fenster, A., Ding, M.: Multilevel strip poolingbased convolutional neural network for the classification of carotid plaque echogenicity. Comput. Math. Methods Med. 2021, 1–13 (2021)
806
B. Liu et al.
15. Ma, W., Zhou, R., Zhao, Y., Xia, Y., Fenster, A., Ding, M.: Plaque recognition of carotid ultrasound images based on deep residual network. In: 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), pp. 931–934. IEEE (2019) 16. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29 17. Payer, C., Stern, D., Feiner, M., Bischof, H., Urschler, M.: Segmenting and tracking cell instances with cosine embeddings and recurrent hourglass networks. Med. Image Anal. 57, 106–119 (2019) 18. Peng, Z., et al.: Conformer: local features coupling global representations for visual recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 367–376 (2021) 19. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2 inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018) 20. Singh, B., Davis, L.S.: An analysis of scale invariance in object detection snip. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3578–3587 (2018) 21. Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114. PMLR (2019) 22. Tsiaparas, N.N., Golemati, S., Andreadis, I., Stoitsis, J.S., Valavanis, I., Nikita, K.S.: Comparison of multiresolution features for texture classification of carotid atherosclerosis from b-mode ultrasound. IEEE Trans. Inf. Technol. Biomed. 15(1), 130–137 (2010) 23. Virani, S.S., et al.: Heart disease and stroke statistics—2021 update: a report from the American heart association. Circulation 143(8), e254–e743 (2021) 24. Wu, J., Xin, J., Yang, X., Sun, J., Xu, D., Zheng, N., Yuan, C.: Deep morphology aided diagnosis network for segmentation of carotid artery vessel wall and diagnosis of carotid atherosclerosis on black-blood vessel wall mri. Med. Phys. 46(12), 5544–5561 (2019) 25. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017) 26. Xu, T., Takano, W.: Graph stacked hourglass networks for 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16105–16114 (2021) 27. Zreik, M., Van Hamersvelt, R.W., Wolterink, J.M., Leiner, T., Viergever, M.A., Išgum, I.: A recurrent cnn for automatic detection and classification of coronary artery plaque and senosis in coronary CT angiography. IEEE Trans. Med. Imaging 38(7), 1588–1598 (2018)
Multi-source Data-Based Deep Tensor Factorization for Predicting Disease-Associated miRNA Combinations Sheng You1(B) , Zihan Lai2 , and Jiawei Luo2 1 National Supercomputing Centre in Changsha, Hunan University, Changsha 410082, China
[email protected] 2 College of Computer Science and Electronic Engineering, Hunan University,
Changsha 410082, China Abstract. MicroRNAs (miRNAs) play a significant role in the occurrence and development of complex diseases. The regulatory level of multiple miRNAs is stronger than that of a single miRNA. Therefore, using miRNA combinations to treat complex diseases has become a promising strategy, which provides great insights for exploring disease-associated miRNA combinations and comprehending the miRNA synergistic mechanism. However, current researches mainly focus on the miRNA-disease binary association, or merely the synergetic miRNAs on specific diseases, which may cause incomplete understanding of the pathogenesis of complex diseases. In this work, we present a novel tensor factorization model, name MAGTF, to predict disease-associated miRNA combinations. MAGTF exploits a graph attention neural network to learn the node features over multi-source similarity networks. Then, a feature aggregation module is applied to capture the heterogeneous features over miRNA-disease association network. The learned features are spliced to reconstruct the association tensor for predicting disease-associated miRNA combinations. Empirical results showed the powerful predictive ability of the proposed model. Ablation study indicated the contribution of each module in MAGTF. Moreover, case studies further demonstrated the effectiveness of MAGTF in identifying potential disease-associated miRNA combinations. Keywords: Disease-associated miRNA combinations · Multi-source data · Deep learning · Tensor factorization
1 Introduction MicroRNAs (miRNAs) are a kind of endogenous non-coding RNAs with about 22–25 nucleotides, which are widely involved in the occurrence and development of human diseases by regulating gene transcription [1]. In recent years, more and more researchers have discovered the role of miRNAs in various stages of cells, such as proliferation, differentiation and apoptosis, which has attracted extensive attention [2–4]. As miRNAs play an important role in the pathogenesis of diseases, it is of great significance to take miRNAs as disease-related biomarkers and therapeutic tools. The original version of this chapter was revised: the last name of Jiawei Luo was misspelled. This was corrected. The correction to this chapter is available at https://doi.org/10.1007/978-3-031-13829-4_73 © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022, corrected publication 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, pp. 807–821, 2022. https://doi.org/10.1007/978-3-031-13829-4_72
808
S. You et al.
In addition to identifying dis-regulated miRNAs during the development of complex disease, researchers have found that the occurrence and development of multiple diseases in human can be regulated by a single miRNA, and the combination of multiple miRNAs may also synergistically act on the same disease [5, 6]. Currently, the synergistic effects of miRNA to diseases are widely accepted. Studies have also verified the existence of disease-associated miRNA synergistic combinations. For example, the synergistic effect of miR-21 and miR-1 on myocardial cell apoptosis, myocardial hypertrophy and fibrosis has been functionally verified [7]. The modules composed of miR-124, miR-128 and miR-137 are co-expressed in nerve cells, but are simultaneously lost in glioblastoma [8]. Researchers also reveal that the regulatory level of two miRNAs is stronger than that of a single miRNA [9]. With the biological data increasing, computational methods are promising complements for miRNA synergy studies, making it possible to effectively integrate multi-source information and discover functional synergistic miRNA modules. Li et al. [10] proposes a deterministic overlapping clustering algorithm Mirsynergy framework to identify miRNA co-regulatory modules. However, current studies mainly focus on the synergism of miRNAs or the association prediction between one miRNA and one disease [11]. Based on the above observations, using miRNA combinations as biomarkers is very meaningful for understanding the synergistic molecular mechanisms of miRNAs and the diagnosis and treatment of diseases. Modeling the associations between miRNA combinations and diseases is a multiple dimensions problem, which is of great challenge. Tensors of N-way can effectively represent data that have more than two modes and are widely used in biomedicine, data mining and social network. Liu et al. [12] integrated multi-view miRNAs and diseases information to discover potential disease-associated miRNA pairs based on tensor composition model. While traditional tensor factorization method is efficient to represent multi-way data, it fails to capture the complex structural information in biological networks. Sun et al. [13] introduced a deep tensor decomposition model DTF, which combined the tensor decomposition model with deep neural network to predict drug synergies. Liu et al. [14] proposed a miRNA-gene-disease association prediction algorithm NTDMDA based on neural tensor decomposition. Although these methods can learn the complicated relationship in the network, and improve the predictive ability of the model, they cannot fully exploit the rich biological data, or simply apply deep learning models such as multi-layer perceptron to learn node features. Furthermore, current studies mainly focus on the synergistic effect of miRNA combinations or binary miRNA-disease associations, which cannot apply to the disease-associated miRNA combinations prediction task. To tackle the above issues, we propose a novel tensor factorization framework based on multi-source data, namely MAGTF to predict disease-associated miRNA combinations. First, a graph attention neural network is applied to learn the miRNA and disease features over multi-source biological network. Then, the multiple miRNA and disease features are aggregated by a multi-layer perceptron, respectively. Third, a heterogeneous feature learning module is used for capturing features over miRNA-disease association network. Finally, the concatenate global representations of miRNAs and diseases are
Multi-source Data-Based Deep Tensor Factorization
809
reconstructed into a miRNA-miRNA-disease association tensor for prediction. Experimental results show that MAGTF is superior to the comparison methods, which verifies the effectiveness of the model.
2 Materials and Method 2.1 miRNA-miRNA-Disease Association Tensor Following Liu et al. [12], we collect the miRNA-miRNA synergetic interactions from miRBase v22 [15] which are from the same miRNA family, and download the miRNAdisease associations from HMDD v3.2 [16]. The miRNA-miRNA-disease association tensor X is constructed by using the shared miRNAs appear in miRNA-miRNA and miRNA-disease interactions, i.e., Xijk = 1 when miRNA i is both related to miRNA j, disease k and miRNA j is related to disease k, otherwise, Xijk = 0. Finally, we obtain 15484 known interactions, covering 325 miRNAs and 283 diseases. 2.2 miRNA Similarity and Disease Similarity There is a large amount of biological data which provide valuable resource for computational methods to discover disease-associated miRNA combinations. In this subsection, we introduce several similarity calculation methods for miRNAs and diseases. miRNA Sequence Similarity. miRNA sequence is the unique identifier of miRNA. We use the miRNA sequence downloaded from miRBase v22 [15], apply NeedlemanWunsch global alignment algorithm, and obtain the miRNA sequence similarity MS. Disease Semantic Similarity. Semantic similarity plays an important role in biomedical research. Following Wang et al., we first describe the disease-related data in MeSH database [17] a hierarchical directed acyclic graph (DAG). Then, calculate the similarity score based on the semantic contribution of each node in the DAG structure. The disease semantic similarity is denoted as DS. Gaussian Kernel Similarity for miRNAs and Diseases. Gaussian kernel similarity can effectively calculate the similarity between nonlinear data [18, 19]. Given the miRNA-disease association matrix MD, the gaussian kernel similarity for miRNAs can be expressed as: 2 (1) MG mi , mj = exp −γm MD[:, i] − MD :, j where MD[:, i] denotes the i-th column of the association matrix; γm is the kernel bandwidth, it can be calculated as follows. (2) γm = 1n ni=1 MD[:, i]2 where n is the number of miRNAs. Similarly, the disease gaussian kernel similarity matrix is DG. Target-Based Similarity for miRNAs and Diseases. Following Xiao et al. [20], we download the gene interaction score IS from HumanNet [21], and calculate the similarity of the shared genes between miRNAs and diseases, respectively. The miRNA functional similarity matrix is denoted as MF and target-based disease similarity matrix is DF.
810
S. You et al.
2.3 Method Overview In this work, we propose a new tensor factorization model, called MAGTF, which consists of four parts (see Fig. 1). First, we calculate the miRNA and disease similarity from multi-source data. Second, we use an attentive graph learning module to aggregate the neighborhood information over multiple similarity networks and put the learned features into a multi-layer perceptron to obtain the similarity features for miRNAs and diseases, respectively. Then, we exploit a graph learning module to learn the heterogeneous miRNA and disease features over miRNA-disease association network. Finally, we concatenate the learned features and reconstruct the association tensor for prediction.
Fig. 1. The overall workflow of the proposed method MAGTF.
Similarity Network Attentional Learning Module. A typical graph neural network consists of propagation and perceptron layers. Given H (l) represents the node features of the l-th layer, the propagation layer can be expressed as: H˜ (l) = PH (l)
(3)
where P denotes the propagation matrix. Then, a single-layer perceptron is applied to each node: H (l+1) = σ H˜ (l) W (l) (4) where W (l) is the weight matrix; σ (·) is activation function.
Multi-source Data-Based Deep Tensor Factorization
811
Inspired by Thekumparampil et al. [22], it is the propagation layer rather than the perceptron layer that plays a significant role in graph neural networks. Therefore, we propose to apply the simplified attention propagation layer AGN to adaptively learn the more relevant neighbor features in the similarity network. AGN contains only one scalar parameter, and the single-layer perceptron in the traditional graph neural network is removed to reduce the complexity of the model. For the input similarity network, the l-th propagation layer is denoted as follows: H (l+1) = α (l) H (l)
(5)
where α (l) ∈ Rn×n is attention propagation matrix. Specifically, for node i, the neighbor feature aggregation can be defined as: (l+1)
hi
=
j∈N (i)∪{i}
(l) (l)
αij hj
(6)
where N (i) denotes the neighbors of node i. For nodes i and j, we apply the attention mechanism to calculate the attention coefficient:
(l) (l) (l) (l) αij = softmax β cos hi , hj (7) j∈N (i)∪{i}
where cos(x, y) = xT y/xy; · is the L2 regularization. Finally, we get the output features of the nodes: (l+1) (l+1) (l+1) (8) zi = softmax hi W For the input miRNA sequence, function and gaussian kernel similarity network, we obtain the output features MS (l) , MF (l) , MG (l) , respectively. For disease semantics, gaussian kernel and target-based similarity network, the output features are expressed as DS (l) , DG (l) , DT (l) . To fuse the multi-source features, we first splice the learned miRNA features, and then use a multi-layer perceptron to obtain the multi-source similarity features of miRNAs. The fused feature can be expressed as follows: (l) (9) M1 = Φm MS (l) MF (l) MG (l) (l)
where denotes the concatenation operation; Φm (·) is the multi-layer perceptron for miRNA feature learning. Similarly, the multi-source similarity features of diseases can be described as: (l) (10) D1 = Φd DS (l) DT (l) DG (l) (l)
where Φd (·) is the multi-layer perceptron for disease feature learning. Heterogeneous Information Network Feature Learning Module. Heterogeneous information network (HIN) is widely used in many fields since it can model heterogeneous data flexibly. However, due to the complexity of nodes and edge types in HIN, how to learn the heterogeneous features is still challenging. Graph convolutional neural
812
S. You et al.
networks have achieved great success in semi-supervised learning of graphs, but most of them are difficult to extend to large-scale graph learning, or can only learn on fixed graphs. Therefore, we adopt a feature learning module GraphSAGE [23] to adaptively learn the features of miRNA and disease over miRNA-disease association network. For the input miRNA-disease association network, the feature aggregation steps can be divided into the following two steps: (1) Sample the neighbor nodes. According to the input adjacency matrix, we first sample the nodes randomly and the number of neighbors sampled for each layer is no more than Sk . (2) Neighbor feature aggregation. We use the aggregator to obtain feature representation of nodes. For node i, the mean feature aggregation of the l-th layer can be expressed as:
(l+1) (l) (l) hi ← σ W (l) · MEAN hi ∪ hj , ∀j ∈ N (i) (11) Specifically, the features of miRNAs and diseases are randomly initialized in the miRNA-disease association network. Then, the miRNA-related disease features are aggregated into the nodes using Eq. (11), and the features of disease-related miRNAs are aggregated into the disease nodes in the same way. Finally, we obtain the miRNA features M2 and disease features D2 . Deep Tensor Factorization Prediction Module. After deriving the miRNA similarity feature M1 , miRNA heterogeneous feature M2 , and disease similarity feature D1 , disease heterogeneous feature D2 , we splice the miRNA features and disease features, respectively, and finally obtain the global miRNA representation M and disease representation D. M = M1 ⊕ M2
(12)
D = D1 ⊕ D2
(13)
Then, we can get the reconstructed miRNA-miRNA-disease association tensor X as follows:
(l)
Xˆ = [[M , M , D]]
(14)
(l) (l) (0) (l) (0) M = Φm AGN m M1 ⊕ SAGE m M2
(15)
(l) (l) (0) (l) (0) D = Φm AGN m M1 ⊕ SAGE m M2
(16)
where Φm is the multilayer perceptron; AGN (·) is the similarity network attentional learning module; SAGE(·) is the heterogeneous information network feature learning module; ⊕ denotes the concatenation operation. For the input miRNA similarity feature (0) (0) M1 and heterogeneous feature M2 , we apply the AGN and SAGE module to learn the
Multi-source Data-Based Deep Tensor Factorization
813
features, then splice them to obtain the global feature matrix of miRNA M . The global feature of disease D is calculated in the same way. We use mean square error to minimize the Frebious norm of the difference between preference tensor X and predicted tensor X as the loss function of our model. The loss is formulated as: 2 2 min AΩ X − X + AΩ X − X + λ Γm 2F + Θd 2F # (17)
M ,M ,D
F
F
(1) (l) (1) (l) (1) (l) (1) (l) where Γm = Wm , · · · , Wm , bm , · · · , bm , Θd = Wd , · · · , Wd , bd , · · · , bd are the parameters involved in training. For MAGTF, the number of iterations of training is set to 50, and the model is optimized with an Adam optimizer with a learning rate of 0.001. In particular, MAGTF consists of an adaptive attention propagation network AGN, a multi-layer perceptron and a heterogeneous information network learning module GraphSAGE. It is worth noting that the initial features of the network in MAGTF are set to meet the standard normal distribution through random initialization. We set the input feature dimension D of miRNA and disease to 128.
3 Experiments 3.1 Experimental Setup To comprehensively evaluate the performance of MAGTF, we conduct five-fold crossvalidation experiments and compare it with three state-of-the-art methods. As there are many unobserved entries in the association tensor, the sparsity is very large and a great majority of these unobserved elements are negative samples. To solve the imbalance of positive and negative samples, we randomly choose the same size of negative samples as positive samples from the missing entries of the association tensor. Then the positive and negative samples are divided into five parts. For each fold, the four parts of positive and negative samples are used as training set, and the rest one part is treated as testing set. To avoid the difference that negative samples may cause, we report the average performance on 10 repeated times with different negative samples. We mainly choose metrics that are typical for classification task: area under the receiver operator characteristics curve (ROC-AUC), area under the precision recall curve (PR-AUC). 3.2 Baselines Currently, there are many outstanding tensor completion methods for predicting triple associations in bioinformatics. miRCom [12] and GraphTF [24] are currently two algorithms for predicting disease-associated miRNA combinations, and DTF [13] is an advanced deep tensor factorization model. Therefore, we take these models as baselines. miRCom is a tensor completion framework, which solves the problem of sparse original association tensor by introducing multi-source information of miRNAs and
814
S. You et al.
diseases, and combines CP tensor decomposition and matrix decomposition to predict disease-associated miRNA pairs. DTF is a deep tensor factorization model that integrates tensor decomposition method and deep neural network to predict drug synergy. DTF extracts potential drug features through tensor factorization model, and then constructs a binary classifier using deep neural network to predict drug synergy. GraphTF uses graph attention network to learn the features over miRNA and disease similarity networks, then uses a tensor factorization model to reconstruct the tensor for predicting disease-associated miRNA pairs. 3.3 Experimental Results The experimental results of MAGTF and three comparison methods are shown in Fig. 2. MAGTF is superior to other comparison algorithms in AUC, and the performance of MAGTF and miRCom is about the same in AUPR. From the results, we can draw the following conclusions: (1) Compared with MAGTF, DTF simply takes concatenate data as the input feature and then puts it into the deep neural network for prediction, ignoring the rich network relationship among associated data. While MAGTF uses the attention mechanism to learn node features by assigning different weights to neighbors, which can effectively preserve the network topology. (2) Compared with miRCom, MAGTF uses 6 kinds of miRNA and disease similarity data, GraphTF only exploits 3 kinds of similarity data, and miRCom uses 9 kinds of similarity information, which fully proves that the fusion of multi-source data is conducive to improving the model performance. GraphTF and MAGTF have achieved good performance, which also shows the advantages of deep learning model in capturing nonlinear relationships in networks. (3) Compared with GraphTF, MAGTF uses the simplified attention propagation layer and the feature learning method of heterogeneous information network to predict, which makes a significant improvement in performance, indicating the effectiveness of introducing multi-source similarity data and heterogeneous information.
Fig. 2. (a) The ROC curves of MAGTF and compared methods. (b) The AUPR curves of MAGTF and compared methods.
Multi-source Data-Based Deep Tensor Factorization
815
3.4 Ablation Study MAGTF is mainly composed of two modules, i.e., similarity network attentional learning module and heterogeneous information network feature learning module. To explore the influence of the two modules on model performance, we design two variant models for the two modules, MAGTFagn and MAGTFsage . MAGTFagn only uses multi-source similarity networks to learn miRNA and disease features for tensor reconstruction; MAGTFsage only learns the features of miRNAs and diseases over miRNA-disease heterogeneous network. It can be seen from Table 1 that the performance of MAGTFagn and MAGT Fsage is degraded to some extends when removing part of the model, which indicates that both similarity network and heterogenous network contain much information, and contributes a lot to MAGTF. The result shows that the combination of similarity features and heterogeneous network features makes the model achieve best performance, demonstrating the effectiveness of each module in our proposed model. Table 1. The influence of each MAGTF module on model performance. Indicator
MAGT Fagn
MAGT Fsage
MAGTF
AUC
0.9751
0.9723
0.9832
AUPR
0.9738
0.9682
0.9853
3.5 Running Time Analysis To evaluate the efficiency of MAGTF, we compare the running time of MAGTF with other comparison algorithms in ten five-fold crossover experiments. All the methods are implemented on a machine equipped with one NVIDIA 2060 GPU and one 2.90 GHz AMD Ryzen 7 4800H with Radeon Graphics CPU with 16 GB memory. As shown in Table 2, the running time of DTF, MAGTF and GraphTF is shortened to some extent compared with miRCom, indicating that the deep tensor factorization model can reduce the complexity of model calculation in the optimization process compared with the mathematical calculation model. The reason for the longer DTF time may be that there are many neural units in the hidden neural network layer of the model, which improves the complexity of model calculation. MAGTF has an obvious advantage in running time over miRCom. In addition, the running time of MAGTF is shorter than that of GraphTF, which further demonstrates the effectiveness of the simplified calculation method of attentional mechanism in the attentional learning module of MAGTF.
816
S. You et al. Table 2. The running time of different methods.
Methods
Time (s)
DTF
23015
GraphTF
3397
miRCom
25757
MAGTF
1830
3.6 Case Study To further validate the ability of MAGTF to predict miRNA-miRNA-disease associations, case studies of two common complex human diseases, namely Brain Neoplasms and Kidney Neoplasms, are conducted in this section. Specifically, for a given ternary association m1 , m2 , d , we validate it by verifying the decomposed pairwise associations, i.e., m1 , m2 , m1 , d , m2 , d . We verify the top-20 predictions with the prominent miRNA-disease association database dbDEMC 3.0 [25]. Brain Neoplasms refers to the uncontrollable growth or abnormality of brain cells, which is a great threat to human health [26]. Table 3 lists the top-20 Brain Neoplasmsassociated miRNA combinations predicted by MAGTF, from which we can see that some predicted miRNA combinations are confirmed to have synergistic role by literatures. For example, the combination of miR-194 and miR-660 are proved to have synergetic role for the diagnosis of non-small cell lung cancer [27].
Multi-source Data-Based Deep Tensor Factorization
817
Table 3. Top-20 Brain Neoplasms-associated miRNA combinations predicted by MAGTF. Rank
miRNA i
miRNA j
Evidence
1
hsa-mir-548j
hsa-mir-518f
PMID: 23071543
2
hsa-mir-194-1
hsa-mir-660
PMID: 26213369
3
hsa-mir-548a-3
hsa-mir-518c
PMID: 23071543
4
hsa-mir-1302-1
hsa-mir-515-2
Unconfirmed
5
hsa-mir-411
hsa-mir-92b
Unconfirmed
6
hsa-mir-374b
hsa-mir-181a-1
PMID: 28392786
7
hsa-mir-133a-1
hsa-mir-518f
PMID: 34475038
8
hsa-mir-548j
hsa-mir-1302-1
Unconfirmed
9
hsa-mir-548a-3
hsa-mir-302a
PMID: 28237528
10
hsa-mir-548a-3
hsa-mir-369
Unconfirmed
11
hsa-mir-516b-2
hsa-mir-105-2
PMID: 32356618
12
hsa-mir-410
hsa-mir-1302-1
Unconfirmed
13
hsa-mir-105-2
hsa-mir-1302-1
Unconfirmed
14
hsa-mir-660
hsa-mir-369
Unconfirmed
15
hsa-mir-380
hsa-mir-514a-2
Unconfirmed
16
hsa-mir-23a
hsa-mir-1302-1
Unconfirmed
17
hsa-mir-548a-3
hsa-mir-1302-1
Unconfirmed
18
hsa-mir-548a-1
hsa-mir-518a-1
PMID: 23071543
19
hsa-mir-548a-3
hsa-mir-517b
PMID: 28823541
20
hsa-mir-516b-2
hsa-mir-518f
PMID: 32626972
Kidney Neoplasms is one of the most common cancers, with a higher risk in men than women [28]. Table 4 lists the top-20 Kidney Neoplasms-associated miRNA combinations predicted by MAGT and some miRNA combinations are verified.
818
S. You et al.
Table 4. Top-20 Kidney Neoplasms-associated miRNA combinations predicted by MAGTF. Rank
miRNA i
miRNA j
Evidence
1
hsa-mir-548j
hsa-mir-518f
PMID: 23071543
2
hsa-mir-521-2
hsa-mir-660
Unconfirmed
3
hsa-mir-548b
hsa-mir-518a-2
PMID: 23071543
4
hsa-mir-551b
hsa-mir-548b
Unconfirmed
5
hsa-mir-487b
hsa-mir-518f
Unconfirmed
6
hsa-mir-133a-1
hsa-mir-518f
PMID: 34475038
7
hsa-mir-660
hsa-mir-194-1
PMID: 26213369
8
hsa-mir-101-2
hsa-mir-660
PMID: 32290510
9
hsa-mir-548b
hsa-mir-548j
PMID: 23150165
10
hsa-mir-372
hsa-mir-518f
Unconfirmed
11
hsa-mir-181a-1
hsa-mir-518f
Unconfirmed
12
hsa-mir-19b-1
hsa-mir-548j
Unconfirmed
13
hsa-mir-320d-1
hsa-mir-518f
PMID: 29193229
14
hsa-mir-214
hsa-mir-550b-1
PMID: 26329304
15
hsa-mir-660
hsa-mir-181a-1
PMID: 23510112
16
hsa-mir-329-2
hsa-mir-548b
Unconfirmed
17
hsa-mir-376b
hsa-mir-518f
Unconfirmed
18
hsa-mir-1285-2
hsa-mir-518f
Unconfirmed
19
hsa-mir-124-3
hsa-mir-518f
Unconfirmed
20
hsa-mir-128-2
hsa-mir-548b
PMID: 28836962
To illustrate the applicability of MAGTF to new disease which has no known associated miRNAs, we carry out another case study on Breast Neoplasms. Noted that all the known associations with Breast Neoplasms are removed so that it can be regarded as a new disease to train the model. Table 5 lists the top-20 Breast Neoplasms-associated miRNA combinations predicted by MAGTF, from which we can see that almost all miRNA combinations are verified in literatures. In general, all the case studies show the powerful predictive ability of MAGTF. It can not only discover new associations, but is also reliable in predicting miRNA combinations associated with new disease.
Multi-source Data-Based Deep Tensor Factorization
819
Table 5. Top-20 Breast Neoplasms-associated miRNA combinations predicted by MAGTF. Rank
miRNA i
miRNA j
Evidence
1
hsa-mir-125a
hsa-let-7a-1
PMID: 26130254
2
hsa-let-7a-3
hsa-mir-125a
PMID: 26130254
3
hsa-mir-125a
hsa-let-7g
PMID: 34003899
4
hsa-let-7i
hsa-mir-125b-2
PMID: 25815883
5
hsa-let-7g
hsa-mir-106b
PMID: 30687004
6
hsa-let-7a-3
hsa-mir-106b
PMID: 30687004
7
hsa-let-7i
hsa-mir-17
PMID: 30687004
8
hsa-mir-106b
hsa-let-7a-1
PMID: 30687004
9
hsa-mir-125b-2
hsa-let-7a-1
PMID: 26130254
10
hsa-let-7i
hsa-mir-125a
Unconfirmed
11
hsa-mir-10a
hsa-mir-1-1
PMID: 30582949
12
hsa-mir-20a
hsa-mir-125a
PMID: 33948855
13
hsa-let-7g
hsa-mir-17
Unconfirmed
14
hsa-mir-29a
hsa-mir-17
PMID: 34278196
15
hsa-let-7a-3
hsa-mir-125b-2
PMID: 31758976
16
hsa-let-7a-3
hsa-mir-15a
PMID: 30389902
17
hsa-mir-1-1
hsa-mir-125a
PMID: 29391047
18
hsa-mir-99a
hsa-let-7i
PMID: 31483222
19
hsa-let-7g
hsa-mir-99a
PMID: 22562236
20
hsa-let-7i
hsa-mir-106b
PMID: 31443156
4 Conclusion Discovering disease-associated miRNA combinations is a promising strategy for disease treatment, which helps understand the synergism of miRNAs and pathogenesis of diseases. In this paper, we propose a novel tensor factorization model MAGTF, which integrates graph learning module for predicting disease-associated miRNA combinations. MAGTF can not only effectively predict potential disease-associated miRNA combinations, but also extract useful information from multiple network to enhance the model performance. The experimental results under five-fold cross-validation show the effectiveness of MAGTF. Ablation study confirms the importance of each module in MAGTF. Case studies also verify the ability of MAGTF in discovering potential disease-associated miRNA combinations and its applicability to new disease. In the future, more biological auxiliary information and feature fusion strategy remain to explore. More effort will be put into investigating other graph learning module for improving the model performance.
820
S. You et al.
Funding. This work has been supported by the Natural Science Foundation of China (Grant no. 61873089) and (Grant no. 62032007).
References 1. Bartel, D.P.: MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116(2), 281– 297 (2004) 2. Bushati, N., Cohen, S.M.: microRNA functions. Annu. Rev. Cell Dev. Biol. 23, 175–205 (2007) 3. Alvarez-Garcia, I., Miska, E.A.: MicroRNA functions in animal development and human disease. Development 132, 4653–4662 (2005) 4. Petrocca, F., et al.: E2F1-regulated microRNAs impair TGFβ-dependent cell-cycle arrest and apoptosis in gastric cancer. Cancer Cell 13(3), 272–286 (2008) 5. Skommer, J., Rana, I., Marques, F., Zhu, W., Du, Z., Charchar, F.: Small molecules, big effects: the role of microRNAs in regulation of cardiomyocyte death. Cell Death Dis. 5(7), e1325–e1325 (2014) 6. Latronico, M.V., Catalucci, D., Condorelli, G.: Emerging role of microRNAs in cardiovascular biology. Circ. Res. 101(12), 1225–1236 (2007) 7. Zhu, W., et al.: Dissection of protein interactomics highlights microRNA synergy. PLoS ONE 8(5), e63342 (2013) 8. Bhaskaran, V., et al.: The functional synergism of microRNA clustering provides therapeutically relevant epigenetic interference in glioblastoma. Nat. Commun. 10(1), 1–13 (2019) 9. Doench, J.G., Sharp, P.A.: Specificity of microRNA target selection in translational repression. Genes Dev. 18(5), 504–511 (2004) 10. Li, Y., Liang, C., Wong, K.-C., Luo, J., Zhang, Z.: Mirsynergy: detecting synergistic miRNA regulatory modules by overlapping neighbourhood expansion. Bioinformatics 30(18), 2627– 2635 (2014) 11. Zhou, F., Yin, M.-M., Jiao, C.-N., Zhao, J.-X., Zheng, C.-H., Liu, J.-X.: Predicting miRNAdisease associations through deep autoencoder with multiple kernel learning. IEEE Trans. Neural Netw. Learn. Syst. (2021) 12. Liu, P., Luo, J., Chen, X.: miRCom: tensor completion integrating multi-view information to deduce the potential disease-related miRNA-miRNA pairs. IEEE/ACM Trans. Comput. Biol. Bioinform. 19, 1747–1759 (2020) 13. Sun, Z., Huang, S., Jiang, P., Hu, P.: DTF: deep tensor factorization for predicting anticancer drug synergy. Bioinformatics 36(16), 4483–4489 (2020) 14. Liu, Y., Luo, J., Wu, H.: miRNA-disease associations prediction based on neural tensor decomposition. In: International Conference on Intelligent Computing, pp. 312–323 (2021) 15. Kozomara, A., Birgaoanu, M., Griffiths-Jones, S.: miRBase: from microRNA sequences to function. Nucleic Acids Res. 47(D1), D155–D162 (2019) 16. Huang, Z., et al.: HMDD v3.0: a database for experimentally supported human microRNA– disease associations. Nucleic Acids Res. 47(D1), D1013–D1017 (2019) 17. Lipscomb, C.E.: Medical subject headings (MeSH). Bull. Med. Libr. Assoc. 88(3), 265 (2000) 18. Ji, C., Gao, Z., Ma, X., Wu, Q., Ni, J., Zheng, C.: AEMDA: inferring miRNA–disease associations based on deep autoencoder. Bioinformatics 37(1), 66–72 (2021) 19. Yin, M.-M., Liu, J.-X., Gao, Y.-L., Kong, X.-Z., Zheng, C.-H.: NCPLP: a novel approach for predicting microbe-associated diseases with network consistency projection and label propagation. IEEE Trans. Cybern. 52, 5079–5087 (2020)
Multi-source Data-Based Deep Tensor Factorization
821
20. Xiao, Q., Luo, J., Liang, C., Cai, J., Ding, P.: A graph regularized non-negative matrix factorization method for identifying microRNA-disease associations. Bioinformatics 34(2), 239–248 (2018) 21. Hwang, S., et al.: HumanNet v2: human gene networks for disease research. Nucleic Acids Res. 47(D1), D573–D580 (2019) 22. Thekumparampil, K.K., Wang, C., Oh, S., Li, L.-J.: Attention-based graph neural network for semi-supervised learning. arXiv preprint arXiv:1803.03735 (2018) 23. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 24. Luo, J., Lai, Z., Shen, C., Liu, P., Shi, H.: Graph attention mechanism-based deep tensor factorization for predicting disease-associated miRNA-miRNA pairs. In: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 189–196 (2021) 25. Xu, F., et al.: dbDEMC 3.0: functional exploration of differentially expressed miRNAs in cancers of human and model organisms. bioRxiv (2022) 26. Farmanfarma, K.K., Mohammadian, M., Shahabinia, Z., Hassanipour, S., Salehiniya, H.: Brain cancer in the world: an epidemiological review. World Cancer Res. J. 6(5), 1–5 (2019) 27. Zhou, C., et al.: Combination of serum miRNAs with Cyfra21-1 for the diagnosis of non-small cell lung cancer. Cancer Lett. 367(2), 138–146 (2015) 28. Chow, W.-H., Dong, L.M., Devesa, S.S.: Epidemiology and risk factors for kidney cancer. Nat. Rev. Urol. 7(5), 245–257 (2010)
Correction to: Multi-source Data-Based Deep Tensor Factorization for Predicting DiseaseAssociated miRNA Combinations Sheng You, Zihan Lai, and Jiawei Luo
Correction to: Chapter “Multi-source Data-Based Deep Tensor Factorization for Predicting Disease-Associated miRNA Combinations” in: D.-S. Huang et al. (Eds.): Intelligent Computing Theories and Application, LNCS 13394, https://doi.org/10.1007/978-3-031-13829-4_72
In the original version of this chapter, the last name of Jiawei Luo was misspelled. This was corrected.
The updated original version of this chapter can be found at https://doi.org/10.1007/978-3-031-13829-4_72 © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D.-S. Huang et al. (Eds.): ICIC 2022, LNCS 13394, p. C1, 2022. https://doi.org/10.1007/978-3-031-13829-4_73
Author Index
Alanezi, Ahmad III-277 Alatrany, Abbas Saad III-129 Alatrany, Saad S. J. III-129 Alejo, R. I-169, III-67 Al-Jumaili, Zaid III-277 Al-Jumaily, Dhiya III-129 Al-Jumeily, Dhiya III-220, III-277 Anandhan, Vinutha II-289 Bai, Yunyi III-698, III-709 Bao, Wenzheng II-680, II-687, II-715, II-731 Basseur, Matthieu I-125 Bassiouny, Tarek III-277 Berloco, Francesco III-242 Bevilacqua, Vitoantonio III-242 Bi, Xiaodan I-701, II-103, II-260 Bin, Yannan II-757 Blacklidge, Rhys III-394 Cai, Junchuang I-27 Cao, Dehua III-463 Cao, Yi II-663, II-670, II-697, II-705 Castorena, C. M. I-169 Cervantes, Jair I-391 Cervantes, Jared I-391 Chai, Jie III-341 Chang, Feng II-374 Chen, Baitong II-663, II-670, II-680, II-687, II-697, II-705, II-722 Chen, Cheng II-153 Chen, Debao I-112 Chen, Guang Yi I-330, I-420 Chen, Guanyuan II-356 Chen, Guolong I-292 Chen, Jianyong I-27, I-41 Chen, Jiazi II-663, II-670, II-697, II-705 Chen, Junxiong I-673 Chen, Mingyi I-535 Chen, Peng I-753, I-772, I-787 Chen, Shu-Wen II-588 Chen, Wen-Sheng I-267 Chen, Xiangtao III-288 Chen, Ying III-198, III-729 Chen, Yuanyuan III-802
Chen, Yuehui II-334, II-394, II-615, II-663, II-697, II-705 Chen, Zhang I-245 Chen, Zhan-Heng I-739, II-220, II-451 Chen, Zhenqiong II-374, II-383 Cheng, Honglin II-680, II-687, II-715, II-739 Cheng, Li-Wei I-726 Cheng, Long III-755 Cheng, Meili I-701 Cheng, Zhiyu I-444, III-150 Chora´s, Michał III-257 Chou, Hsin-Hung I-726 Chu, Jian II-731 Chu, Po-Lun I-726 Cloutier, Rayan S. II-588 Colucci, Simona III-242 Cong, Hanhan II-663, II-670, II-697, II-705 Cong, Hongshou I-772 Cuc, Nguyen Thi Hoa III-544 Cui, Xinchun I-412 Cui, Xiuming I-412 Dai, Lai I-673 Dai, Yuxing I-430, I-494, II-569, II-579 Dai, Zhenmin I-379 Dai, Zichun II-319 Dalui, Mamata I-811 Das, Sunanda I-811 del Razo-López, F. III-67 Ding, Bowen I-68 Ding, Pingjian II-153 Ding, Wenquan II-517 Dmitry, Yukhimets III-504 Dong, Chao II-757 Dong, Chenxi I-401, III-117 Dong, Yahui III-380 Du, Jianzong I-412 Du, Yanlian I-51 Duan, Hongyu II-345 Fan, Jingxuan I-506 Fan, Wei III-3 Fang, Ailian III-141 Fang, Chujie II-196
824
Author Index
Fang, Liang-Kuan III-626 Fang, Yu I-339 Feng, Cong III-604, III-662 Feng, Yue I-412 Feng, Zejing I-51 Feng, Zizhou II-722 Fengqi, Qiu III-106 Filaretov, Vladimir III-55, III-93 Fu, Qiming III-234 Fu, Wenzhen II-670 Gan, Haitao II-53 Gangadharan, Sanjay II-289 Gao, Guangfu II-628 Gao, Lichao III-729 Gao, Peng III-626 Gao, Pengbo II-319 Gao, Wentao II-28 Gao, Xiaohua I-701 Gao, Yun III-353 García-Lamont, Farid I-391 Ge, Lina I-638, III-802 Ghali, Fawaz III-183 Ghanem, Fahd A. III-304 Gong, Xiaoling III-32 Gong, Zhiwen I-444, III-209 Gopalakrishnan, Chandrasekhar II-116, II-289, II-383 Granda-Gutíerrez, E. E. I-169, III-67 Gu, Yi III-719, III-755 Guan, Pengwei II-747 Guan, Shixuan II-310 Guan, Ying III-18 Gubankov, Anton III-93 Guo, Haitong I-13, II-41 Guo, Jiamei III-198 Guo, Xiangyang III-267 Gupta, Rachna I-739 Gurrib, Ikhlaas III-589 Ha, CheolKeun III-80 Ha, Cheolkeun III-544 Han, Mengmeng II-757 Han, Pengyong II-103, II-116, II-181, II-207, II-260, II-289, II-374, II-383, II-405, II-556 Han, Tiaojuan I-3, I-13, II-41 Han, Xiulin I-306 Hao, Bibo I-739 Hao, Zongjie III-463
Haoyu, Ren III-106 Harper, Matthew III-183 He, Chunlin III-3 He, Hongjian II-777 He, Zhi-Huang I-317 Hesham, Abd El-Latif II-14 Hong, Sun-yan I-160 Hsieh, Chin-Chiang I-726 Hsieh, Sun-Yuan I-726 Hu, Fan III-741 Hu, Jiaxin II-715 Hu, Jing II-496, II-507, II-517, II-533 Hu, Juxi II-739 Hu, Lun I-739, II-220, II-451 Hu, Peng-Wei II-451 Hu, Pengwei I-739 Hu, Rong III-473 Hu, Xiangdong I-627 Hu, Yunfan I-149 Hu, Zhongtian III-234 Huang, Dingkai II-777 Huang, Huajuan I-80, I-97, III-769, III-785, III-860 Huang, Jian II-415 Huang, Jiehao I-549 Huang, Kuo-Yuan I-726 Huang, Lei I-306 Huang, Qinhua III-162 Huang, Youjun III-729 Huang, Ziru II-415 Huang, Ziyi II-28 Hussain, Abir III-129, III-220 Hussain, Abir Jaafar III-277 Il, Kim Chung III-55 Ji, Cun-Mei II-166, II-245, III-639 Ji, Xuying I-112 Jia, Baoli II-394 Jia, Jingbo II-638 Jiang, Kaibao II-356, II-364 Jiang, Peng III-463 Jiang, Tengsheng II-302, II-310 Jiang, Yizhang III-741 Jiang, Yu II-79 Jianzhou, Chen III-106 Jiao, Erjie III-198 Jiao, Qiqi II-79 Jin, Bo I-401, III-117
Author Index Jing, Qu I-617 Joshi, Rajashree I-739 Kamalov, Firuz III-589 Kang, Hee-Jun III-518, III-529 Kesarwani, Abhishek I-811 Khan, Md Shahedul Islam III-170 Khan, Wasiq III-183, III-220, III-277 Kisku, Dakshina Ranjan I-811 Koepke, David I-739 Komuro, Takashi I-472, I-483 Kozik, Rafał III-257 Krzyzak, Adam I-330 Kuang, Jinjun I-564, I-589 Lai, Jinling II-767 Lai, Zihan II-807 Lam, Dinh Hai III-544 Le, Duc-Vinh III-80 Le, Tien Dung III-518, III-529 Lee, Hong-Hee III-484 Lee, Mark III-394 Lei, Peng II-66 Lei, Yi II-507 Lei, Yuan-Yuan III-626 Leng, Qiangkui III-198 Li, AoXing II-547 Li, Bin II-166, III-18 Li, Bo II-460, II-470 Li, Dong-Xu II-451 Li, Dongyi II-233 Li, Feng II-345 Li, Gang III-44 Li, Guilin II-569 Li, Hui I-137, I-196, I-209, I-221, II-628 Li, Ji III-315 Li, Jianqiang I-27, I-41 Li, Jing I-579 Li, Jinxin II-356, II-364 Li, Juan III-380 Li, Jun III-341 Li, Junyi II-79 Li, Lei II-166 Li, Liang I-306 Li, Lin II-405 Li, Pengpai II-3 Li, Qiang III-409 Li, Renjie III-353 Li, Rujiang I-306 Li, Shi I-181
825
Li, Wenqiang I-535 Li, Xiaoguang II-138, II-207 Li, Xiaohui II-556 Li, Xin I-209 Li, Xin-Lu III-626 Li, Yan II-345 Li, Yanran II-116, II-289 Li, Ya-qin I-160 Li, Yaru III-18 Li, Yuanyuan II-196 Li, Zhang III-106 Li, Zhaojia I-685 Li, Zhengwei II-181, II-207, II-289, II-405 li, Zhengwei II-383 Li, Zhipeng III-304 Lian, Jie II-569, II-579 Liang, Yujun III-315 Liang, Zhiwei III-846 Lin, Fangze III-492 Lin, Ke III-3 Lin, Ning II-415 Lin, Pingyuan I-430, I-494, II-569, II-579 Lin, Qiuzhen I-27, I-41 Lin, Xiaoli II-423, II-438, II-496, II-517, II-547 Lin, Zeyuan I-673 Ling, Ying III-891 Liu, Bindong II-793 Liu, Guodong III-409 Liu, Hailei II-374, II-383 Liu, Hao III-719 Liu, Hongbo II-278 Liu, Jie II-757 Liu, Jin-Xing II-345 Liu, Juan III-463 Liu, Jun III-150, III-209 Liu, Junkai II-302, II-310 Liu, Qilin III-615 Liu, Ruoyu III-719 Liu, Shuhui II-126 Liu, Si III-604, III-662 Liu, Tao I-506 Liu, Xiaoli I-412 Liu, Xikui II-345 Liu, Xiyu II-233 Liu, Xujie I-430, I-494 Liu, Yonglin I-412 Liu, Yujun II-715 Liu, Yunxia III-604, III-672 Liu, Yuqing III-198
826
Author Index
Liu, Zhi-Hao III-639 Liu, Zhi-Ping II-3 Liu, Zhiyang III-448 Lu, Jianfeng I-3, I-13, I-673, I-685, II-28, II-41, III-814 Lu, Kun I-772, I-787 Lu, Lei I-245 Lu, Wang III-106 Lu, Weizhong III-234 Lu, Xingmin III-423 Lu, Xinguo II-356, II-364 Lu, Xinwei II-777 Lu, Yaoyao II-302, II-310 Lu, Yonggang III-18 Luna, Dalia I-391 Luo, Hanyu II-153 Luo, Huaichao II-415 Luo, Lingyun II-153 Luo, Qifang III-830, III-846, III-860, III-876, III-891 Luo, YiHua I-444 Luo, Jiawei II-807 Lv, Gang I-522 Lv, Jiaxing I-772, I-787 Lyu, Yi II-556 Ma, Fubo II-319 Ma, Jinwen III-267 Ma, Zhaobin I-68 Ma, Zuchang I-522 McNaughton, Fiona I-739 Mei, Jing I-739 Meng, Qingfang II-334, II-394 Meng, Tong II-705 Mi, Jian-Xun III-353 Min, Xiao I-363 Min, Xu I-739 Ming, Zhong I-27, I-41 Miranda-Piña, G. I-169, III-67 Moussa, Sherif III-589 Nagahara, Hajime I-472 Nazir, Amril III-589 Nguyen, Duy-Long III-484 Ni, Jian-Cheng II-166, II-245 Nian, Fudong I-522 Nie, Ru II-181 Ning, Wei III-492 Niu, Mengting II-14
Niu, Rui II-270 Niu, Zihan III-331 Ouyang, Weimin
III-162
Pan, Binbin I-267 Pan, Yuhang I-258 Pang, Baochuan III-463 Paul, Meshach II-289 Pawlicki, Marek III-257 Pei, ZhaoBin III-684 Peng, Yanfei I-181 Ponnusamy, Chandra Sekar II-289 Premaratne, Prashan III-394 Protcenko, Alexander III-55 Pu, Quanyi I-352 Qi, Miao I-506, III-380 Qi, Rong II-245 Qian, Bin III-473 Qian, Pengjiang III-698, III-709, III-741 Qiao, Li-Juan III-639 Qiu, Shengjie I-535, I-549 Qiu, Zekang I-663 Qu, Qianhui III-719 Ramalingam, Rajasekaran II-383 Rendón, E. I-169, III-67 Renk, Rafał III-257
II-116, II-289,
Sang, Yu I-181 Sarem, Mudar I-535, I-564, I-589 Shan, Chudong I-663 Shan, Wenyu II-153 Shang, Junliang II-345 Shang, Li I-456, I-464 Shang, Xuequn II-126, II-270, III-170 Shao, Wenhao II-722 Shao, Zijun II-722 Shen, Yijun I-51 Shen, Zhen II-767 Sheng, Qinghua I-412 Shi, Chenghao III-830 Shi, Yan III-860 Song, Wei III-423 Su, Xiao-Rui II-451 Su, Yanan III-719 Sui, Jianan II-697 Sun, Fengxu II-356, II-364
Author Index Sun, Feng-yang III-654 Sun, Hongyu II-556 Sun, Hui III-380 Sun, Lei I-444, III-209 Sun, Pengcheng I-549 Sun, Qinghua II-650 Sun, Shaoqing I-306 Sun, Yining I-522 Sun, Zhan-li I-456, I-464 Sun, Zhensheng II-345 Sun, Zhongyu I-579 Szczepa´nski, Mateusz III-257 Taleb, Hamdan III-304 Tan, Ming III-435 Tan, Xianbao II-92 Tang, Daoxu II-364 Tang, Yuan-yan III-435 Tang, Zeyi I-430, I-494 Tang, Zhonghua III-830 Tao, Dao III-785 Tao, Jinglu II-423 Tao, Zheng II-687 Tian, Wei-Dong III-341, III-367 Tian, Yang II-470 Tian, Yu I-245 Tian, Yun II-405 Topham, Luke III-220 Tran, Huy Q. III-544 Tran, Quoc-Hoan III-484 Truong, Thanh Nguyen III-518, III-529 Tsunezaki, Seiji I-483 Tun, Su Wai I-472 Tun, Zar Zar I-483 Valdovinos, R. M. I-169, III-67 Van Nguyen, Tan III-544 Vladimir, Filaretov III-504 Vo, Anh Tuan III-518, III-529 Vu, Huu-Cong III-484 Wan, Jia-Ji II-588 Wang, Bing I-753 Wang, Chao I-292 Wang, Chaoxue I-258 Wang, Dian-Xiao II-166 Wang, Dong II-747 Wang, Guan II-650 Wang, Haiyu I-653 Wang, Han I-277
Wang, Hongdong II-722 Wang, Hui-mei I-160 Wang, Jia-Ji II-600 Wang, Jian II-116, II-615, III-32 Wang, Jian-Tao III-435 Wang, Jianzhong I-506 Wang, Jing I-412 Wang, Jin-Wu I-379 Wang, Kai I-317 Wang, Qiankun II-181 Wang, Ruijuan I-221 Wang, Shaoshao I-277 Wang, Shengli I-430, I-494 Wang, Weiwei II-278 Wang, Wenyan I-772, I-787 Wang, Xiao-Feng I-317, III-435 Wang, Xue III-814 Wang, Xuqing III-719 Wang, Yadong II-79 Wang, Ying III-684 Wang, Yingxin I-673 Wang, Yonghao III-672 Wang, Yuli III-234 Wang, Yu-Tian II-166, II-245, III-639 Wang, Zhe I-638 Wang, Zhenbang I-258 Wang, Zhipeng II-207 Wang, Zhuo II-722, II-731 Waraich, Atif III-220 Wei, Xiuxi I-80, I-97, III-769, III-785, III-876 Wei, Yixiong III-331 Wei, Yuanfei III-891 Wei, Yun-Sheng I-317 Weise, Thomas III-448 Win, Shwe Yee I-483 Wu, Daqing III-267 Wu, Geng I-401, III-117 Wu, Hongje III-615 Wu, Hongjie I-352, I-799, II-66, II-92, II-302, II-310, III-234, III-304 Wu, Lijun II-757 Wu, Lin II-415 Wu, Mengyun II-460 Wu, Peng II-615, II-628, II-638 Wu, Qianhao III-331 Wu, Shuang III-367 Wu, Xiaoqiang I-41 Wu, Xu II-747 Wu, Yulin II-650
827
828
Author Index
Wu, Zhenghao II-438 Wu, Zhi-Ze I-317 Wu, Zhize III-448 Wu, Ziheng I-772, I-787 Xia, Junfeng II-757 Xia, Luyao I-673, I-685 Xia, Minghao II-496 Xiahou, Jianbing I-494 Xiang, Huimin II-547 Xiao, Di III-463 Xiao, Kai II-680 Xiao, Min III-170 Xiao, Ming II-319 Xie, Daiwei I-379 Xie, Jiang II-777, II-793 Xie, Kexin I-267 Xie, Lei I-292 Xie, Wen Fang I-420 Xie, Wenfang I-330 Xie, Yonghui I-221 Xie, Zhihua III-409 Xieshi, Mulin I-430, I-494 Xin, Sun I-617 Xin, Zhang III-106 Xing, Yuchen I-137 Xu, Caixia II-116, II-207, II-289, II-374, II-405 Xu, Cong I-763 Xu, Hui I-506, III-380 Xu, Lang I-444 Xu, Li-Xiang III-435 Xu, Mang III-698, III-709 Xu, Mengxia III-814 Xu, Xian-hong III-654 Xu, Xuexin I-430, I-494 Xu, Youhong I-799 Xu, Yuan I-535, I-549, I-564, I-589 Xue, Guangdong III-32 Xuwei, Cheng III-106 Yan, Jun III-234 Yan, Rui II-138 Yan, Wenhui II-757 Yang, Bin I-579, II-650 Yang, Bo II-747 Yang, Chang-bo I-160 Yang, Changsong III-234 Yang, Chengyun I-245 Yang, Hongri II-334
Yang, Jin II-556 Yang, Jing III-435 Yang, Jinpeng I-701, II-103, II-260 Yang, Lei II-481 Yang, Liu I-627 Yang, Weiguo I-579 Yang, Wuyi I-233 Yang, Xi I-627 Yang, Xiaokun II-207 Yang, Yongpu II-53 Yang, Yuanyuan III-473 Yang, Zhen II-687 Yang, Zhi II-53 Yao, Jian III-741 Ye, Lvyang III-769 Ye, Siyi II-507 Yin, Ruiying II-319 Yixiao, Yu I-605 You, Sheng II-807 You, Zhu-Hong II-220, II-451 You, Zhuhong II-270 Yu, Changqing I-763 Yu, Chuchu I-80 Yu, Jun II-319 Yu, Naizhao I-363 Yu, Ning II-245 Yu, Xiaoyong I-292 Yu, Yangming I-401, III-117 Yu, Yixuan III-876 Yuan, Changan I-352, I-799, II-66, II-92, III-55, III-93, III-304, III-504 Yuan, Changgan III-615 Yuan, Jianfeng II-747 Yuan, Lin II-767 Yuan, Qiyang I-339 Yuan, Shuangshuang II-615 Yuan, Zhiyong I-663 Yukhimets, Dmitry III-93 Yun, Yue II-270 Yupei, Zhang II-126 Zaitian, Zhang III-106 Zeng, Anping III-3 Zeng, Rong I-13, II-41 Zeng, Rong-Qiang I-125 Zha, Zhiyong I-401, III-117 Zhai, Pengjun I-339 Zhan, Zhenrun II-103, II-260 Zhang, Aihua I-277
Author Index Zhang, Bingjie III-32 Zhang, Dacheng III-473 Zhang, Fa II-138 Zhang, Fan I-258 Zhang, Guifen III-802 Zhang, Hang II-533 Zhang, Hao I-3, I-13, I-638, I-685, II-28, II-41, III-814 Zhang, Hongbo I-306 Zhang, Hongqi III-331 Zhang, JinFeng I-444, III-209 Zhang, Jinfeng III-150 Zhang, Jun I-753, I-772, I-787 Zhang, Kai II-638 Zhang, Le II-319 Zhang, Lei I-245 Zhang, Le-Xuan III-626 Zhang, Liqiang II-615 Zhang, Lizhu III-44 Zhang, Meng-Long II-220 Zhang, Na II-747 Zhang, Ping II-451 Zhang, Qiang II-138, II-394 Zhang, Qin I-627 Zhang, Shanwen I-306, I-763 Zhang, Tao I-233 Zhang, Tian-Yu III-341 Zhang, Tian-yu III-367 Zhang, Tianze II-628 Zhang, TianZhong III-209 Zhang, Tingbao I-663 Zhang, Wensong III-557, III-572 Zhang, Wu II-793 Zhang, Xiang I-430, I-494, II-356 Zhang, Xiaolong II-423, II-438, II-481, II-496, II-533 Zhang, Xiaozeng III-141 Zhang, Xin I-68 Zhang, Xing III-557, III-572 Zhang, Xinghui I-753 Zhang, Xinyuan II-517 Zhang, Yan I-112 Zhang, Yang II-79 Zhang, Yanni I-506 Zhang, Yin I-196 Zhang, Yu I-233 Zhang, Yuan I-739 Zhang, Yue II-663 Zhang, Yupei III-170 Zhang, Yuze I-456, I-464
829
Zhang, Yu-Zheng III-367 Zhang, Zhengtao II-181 Zhao, Bo-Wei I-739, II-220, II-451 Zhao, Guoqing II-3 Zhao, Hao I-522 Zhao, Hongguo III-604, III-672 Zhao, Jianhui I-663 Zhao, Liang I-363 Zhao, Meijie III-288 Zhao, Tingting II-103, II-260 Zhao, Wenyuan I-663 Zhao, Xin II-747 Zhao, Xingming I-352, I-799, II-66, II-92, III-304 Zhao, Xinming III-615 Zhao, Yuan I-772, I-787 Zhao, Zhong-Qiu III-341, III-367 Zhao, Ziyu II-507 Zhaobin, Pei I-605, I-617 Zheng, Aihua I-292 Zheng, Chunhou I-753, II-138 Zheng, Chun-Hou II-245, III-639 Zheng, Dulei I-339 Zheng, Jinping II-556 Zheng, Xiangwei I-412 Zheng, Yunping I-535, I-549, I-564, I-589 Zheng, Yu-qing III-654 Zheng, Zhaona II-650 Zhirabok, Alexey III-55 Zhong, Guichuang I-535 Zhong, Ji II-638 Zhong, Lianxin II-334 Zhong, Yixin II-670 Zhou, Hongqiao III-331 Zhou, Jiren II-270 Zhou, Yaya III-170 Zhou, Yihan II-747 Zhou, Yongquan III-802, III-830, III-846, III-860, III-876, III-891 Zhou, Yue II-715 Zhou, Zhangpeng I-430, I-494 Zhu, Fazhan I-772, I-787 Zhu, Hui-Sheng II-588 Zhu, Jun-jun III-367 Zhu, Qingling I-27, I-41 Zhu, Xiaobo III-615 Zhu, Yan I-160 Zhu, Yumeng II-722 Zhuang, Liying I-412 Zitong, Yan III-106
830
Author Index
Zong, Sha III-409 Zou, Feng I-112 Zou, Le I-317 Zou, Pengcheng I-97 Zou, Quan II-14
Zou, Zhengrong III-492 Zou, Zirui I-549 Zu, Ruiling II-415 Zuev, Alexander III-55 Zuo, Zonglan II-138