139 67 56MB
English Pages 840 [835] Year 2023
LNCS 14088
De-Shuang Huang · Prashan Premaratne · Baohua Jin · Boyang Qu · Kang-Hyun Jo · Abir Hussain (Eds.)
Advanced Intelligent Computing Technology and Applications 19th International Conference, ICIC 2023 Zhengzhou, China, August 10–13, 2023 Proceedings, Part III
Lecture Notes in Computer Science Founding Editors Gerhard Goos Juris Hartmanis
Editorial Board Members Elisa Bertino, Purdue University, West Lafayette, IN, USA Wen Gao, Peking University, Beijing, China Bernhard Steffen , TU Dortmund University, Dortmund, Germany Moti Yung , Columbia University, New York, NY, USA
14088
The series Lecture Notes in Computer Science (LNCS), including its subseries Lecture Notes in Artificial Intelligence (LNAI) and Lecture Notes in Bioinformatics (LNBI), has established itself as a medium for the publication of new developments in computer science and information technology research, teaching, and education. LNCS enjoys close cooperation with the computer science R & D community, the series counts many renowned academics among its volume editors and paper authors, and collaborates with prestigious societies. Its mission is to serve this international community by providing an invaluable service, mainly focused on the publication of conference and workshop proceedings and postproceedings. LNCS commenced publication in 1973.
De-Shuang Huang · Prashan Premaratne · Baohua Jin · Boyang Qu · Kang-Hyun Jo · Abir Hussain Editors
Advanced Intelligent Computing Technology and Applications 19th International Conference, ICIC 2023 Zhengzhou, China, August 10–13, 2023 Proceedings, Part III
Editors De-Shuang Huang Department of Computer Science Eastern Institute of Technology Zhejiang, China Baohua Jin Zhengzhou University of Light Industry Zhengzhou, China Kang-Hyun Jo University of Ulsan Ulsan, Korea (Republic of)
Prashan Premaratne University of Wollongong North Wollongong, NSW, Australia Boyang Qu Zhong Yuan University of Technology Zhengzhou, China Abir Hussain Department of Computer Science Liverpool John Moores University Liverpool, UK
ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-981-99-4748-5 ISBN 978-981-99-4749-2 (eBook) https://doi.org/10.1007/978-981-99-4749-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023, corrected publication 2023 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Preface
The International Conference on Intelligent Computing (ICIC) was started to provide an annual forum dedicated to emerging and challenging topics in artificial intelligence, machine learning, pattern recognition, bioinformatics, and computational biology. It aims to bring together researchers and practitioners from both academia and industry to share ideas, problems, and solutions related to the multifaceted aspects of intelligent computing. ICIC 2023, held in Zhengzhou, China, August 10–13, 2023, constituted the 19th International Conference on Intelligent Computing. It built upon the success of ICIC 2022 (Xi’an, China), ICIC 2021 (Shenzhen, China), ICIC 2020 (Bari, Italy), ICIC 2019 (Nanchang, China), ICIC 2018 (Wuhan, China), ICIC 2017 (Liverpool, UK), ICIC 2016 (Lanzhou, China), ICIC 2015 (Fuzhou, China), ICIC 2014 (Taiyuan, China), ICIC 2013 (Nanning, China), ICIC 2012 (Huangshan, China), ICIC 2011 (Zhengzhou, China), ICIC 2010 (Changsha, China), ICIC 2009 (Ulsan, South Korea), ICIC 2008 (Shanghai, China), ICIC 2007 (Qingdao, China), ICIC 2006 (Kunming, China), and ICIC 2005 (Hefei, China). This year, the conference concentrated mainly on theories and methodologies as well as emerging applications of intelligent computing. Its aim was to unify the picture of contemporary intelligent computing techniques as an integral concept that highlights the trends in advanced computational intelligence and bridges theoretical research with applications. Therefore, the theme for this conference was “Advanced Intelligent Computing Technology and Applications”. Papers that focused on this theme were solicited, addressing theories, methodologies, and applications in science and technology. ICIC 2023 received 828 submissions from 12 countries and regions. All papers went through a rigorous peer-review procedure and each paper received at least three review reports. Based on the review reports, the Program Committee finally selected 337 high-quality papers for presentation at ICIC 2023, and inclusion in five volumes of proceedings published by Springer: three volumes of Lecture Notes in Computer Science (LNCS), and two volumes of Lecture Notes in Artificial Intelligence (LNAI). This volume of LNCS_14088 includes 68 papers. The organizers of ICIC 2023, including Eastern Institute of Technology, China Zhongyuan University of Technology, China, and Zhengzhou University of Light Industry, China, made an enormous effort to ensure the success of the conference. We hereby would like to thank the members of the Program Committee and the referees for their collective effort in reviewing and soliciting the papers. In particular, we would like to thank all the authors for contributing their papers. Without the high-quality submissions from the authors, the success of the conference would not have been possible. Finally,
vi
Preface
we are especially grateful to the International Neural Network Society, and the National Science Foundation of China for their sponsorship. June 2023
De-Shuang Huang Prashan Premaratne Boyang Qu Baohua Jin Kang-Hyun Jo Abir Hussain
Organization
General Co-chairs De-Shuang Huang Shizhong Wei
Eastern Institute of Technology, China Zhengzhou University of Light Industry, China
Program Committee Co-chairs Prashan Premaratne Baohua Jin Kang-Hyun Jo Abir Hussain
University of Wollongong, Australia Zhengzhou University of Light Industry, China University of Ulsan, Republic of Korea Liverpool John Moores University, UK
Organizing Committee Co-chair Hui Jing
Zhengzhou University of Light Industry, China
Organizing Committee Members Fubao Zhu Qiuwen Zhang Haodong Zhu Wei Huang Hongwei Tao Weiwei Zhang
Zhengzhou University of Light Industry, China Zhengzhou University of Light Industry, China Zhengzhou University of Light Industry, China Zhengzhou University of Light Industry, China Zhengzhou University of Light Industry, China Zhengzhou University of Light Industry, China
Award Committee Co-chairs Michal Choras Hong-Hee Lee
Bydgoszcz University of Science and Technology, Poland University of Ulsan, Republic of Korea
viii
Organization
Tutorial Co-chairs Yoshinori Kuno Phalguni Gupta
Saitama University, Japan Indian Institute of Technology Kanpur, India
Publication Co-chairs Valeriya Gribova M. Michael Gromiha Boyang Qu
Far Eastern Branch of Russian Academy of Sciences, Russia Indian Institute of Technology Madras, India Zhengzhou University, China
Special Session Co-chairs Jair Cervantes Canales Chenxi Huang Dhiya Al-Jumeily
Autonomous University of Mexico State, Mexico Xiamen University, China Liverpool John Moores University, UK
Special Issue Co-chairs Kyungsook Han Laurent Heutte
Inha University, Republic of Korea Université de Rouen Normandie, France
International Liaison Co-chair Prashan Premaratne
University of Wollongong, Australia
Workshop Co-chairs Yu-Dong Zhang Hee-Jun Kang
University of Leicester, UK University of Ulsan, Republic of Korea
Organization
ix
Publicity Co-chairs Chun-Hou Zheng Dhiya Al-Jumeily Jair Cervantes Canales
Anhui University, China Liverpool John Moores University, UK Autonomous University of Mexico State, Mexico
Exhibition Contact Co-chair Fubao Zhu
Zhengzhou University of Light Industry, China
Program Committee Members Abir Hussain Antonio Brunetti Antonino Staiano Bin Liu Bin Qian Bin Yang Bing Wang Binhua Tang Bingqiang Liu Bo Li Changqing Shen Chao Song Chenxi Huang Chin-Chih Chang Chunhou Zheng Chunmei Liu Chunquan Li Dahjing Jwo Dakshina Ranjan Kisku Dan Feng Daowen Qiu Dharmalingam Muthusamy Dhiya Al-Jumeily Dong Wang
Liverpool John Moores University, UK Polytechnic University of Bari, Italy Università di Napoli Parthenope, Italy Beijing Institute of Technology, China Kunming University of Science and Technology, China Zaozhuang University, China Anhui University of Technology, China Hohai University, China Shandong University, China Wuhan University of Science and Technology, China Soochow University, China Harbin Medical University, China Xiamen University, China Chung Hua University, Taiwan Anhui University, China Howard University, USA University of South China, China National Taiwan Ocean University, Taiwan National Institute of Technology Durgapur, India Huazhong University of Science and Technology, China Sun Yat-sen University, China Bharathiar University, India Liverpool John Moores University, UK University of Jinan, China
x
Organization
Dunwei Gong Eros Gian Pasero Evi Sjukur Fa Zhang Fengfeng Zhou Fei Guo Gaoxiang Ouyang Giovanni Dimauro Guoliang Li Han Zhang Haibin Liu Hao Lin Haodi Feng Hongjie Wu Hongmin Cai Jair Cervantes Jixiang Du Jing Hu Jiawei Luo Jian Huang Jian Wang Jiangning Song Jinwen Ma Jingyan Wang Jinxing Liu Joaquin Torres-Sospedra Juan Liu Jun Zhang Junfeng Xia Jungang Lou Kachun Wong Kanghyun Jo Khalid Aamir Kyungsook Han L. Gong Laurent Heutte
China University of Mining and Technology, China Politecnico di Torino, Italy Monash University, Australia Beijing Institute of Technology, China Jilin University, China Central South University, China Beijing Normal University, China University of Bari, Italy Huazhong Agricultural University, China Nankai University, China Beijing University of Technology, China University of Electronic Science and Technology of China, China Shandong University, China Suzhou University of Science and Technology, China South China University of Technology, China Autonomous University of Mexico State, Mexico Huaqiao University, China Wuhan University of Science and Technology, China Hunan University, China University of Electronic Science and Technology of China, China China University of Petroleum, China Monash University, Australia Peking University, China Abu Dhabi Department of Community Development, UAE Qufu Normal University, China Universidade do Minho, Portugal Wuhan University, China Anhui University, China Anhui University, China Huzhou University, China City University of Hong Kong, China University of Ulsan, Republic of Korea University of Sargodha, Pakistan Inha University, Republic of Korea Nanjing University of Posts and Telecommunications, China Université de Rouen Normandie, France
Organization
Le Zhang Lejun Gong Liang Gao Lida Zhu Marzio Pennisi Michal Choras Michael Gromiha Ming Li Minzhu Xie Mohd Helmy Abd Wahab Nicola Altini Peng Chen Pengjiang Qian Phalguni Gupta Prashan Premaratne Pufeng Du Qi Zhao Qingfeng Chen Qinghua Jiang Quan Zou Rui Wang Saiful Islam Seeja K. R. Shanfeng Zhu Shikui Tu Shitong Wang Shixiong Zhang Sungshin Kim Surya Prakash Tatsuya Akutsu Tao Zeng Tieshan Li Valeriya Gribova
Vincenzo Randazzo
xi
Sichuan University, China Nanjing University of Posts and Telecommunications, China Huazhong Univ. of Sci. & Tech., China Huazhong Agriculture University, China University of Eastern Piedmont, Italy Bydgoszcz University of Science and Technology, Poland Indian Institute of Technology Madras, India Nanjing University, China Hunan Normal University, China Universiti Tun Hussein Onn Malaysia, Malaysia Polytechnic University of Bari, Italy Anhui University, China Jiangnan University, China GLA University, India University of Wollongong, Australia Tianjin University, China University of Science and Technology Liaoning, China Guangxi University, China Harbin Institute of Technology, China University of Electronic Science and Technology of China, China National University of Defense Technology, China Aligarh Muslim University, India Indira Gandhi Delhi Technical University for Women, India Fudan University, China Shanghai Jiao Tong University, China Jiangnan University, China Xidian University, China Pusan National University, Republic of Korea IIT Indore, India Kyoto University, Japan Guangzhou Laboratory, China University of Electronic Science and Technology of China, China Institute of Automation and Control Processes, Far Eastern Branch of Russian Academy of Sciences, Russia Politecnico di Torino, Italy
xii
Organization
Waqas Haider Wen Zhang Wenbin Liu Wensheng Chen Wei Chen Wei Peng Weichiang Hong Weidong Chen Weiwei Kong Weixiang Liu Xiaodi Li Xiaoli Lin Xiaofeng Wang Xiao-Hua Yu Xiaoke Ma Xiaolei Zhu Xiangtao Li Xin Zhang Xinguo Lu Xingwei Wang Xinzheng Xu Xiwei Liu Xiyuan Chen Xuequn Shang Xuesong Wang Yansen Su Yi Xiong Yu Xue Yizhang Jiang Yonggang Lu Yongquan Zhou Yudong Zhang Yunhai Wang Yupei Zhang Yushan Qiu
Kohsar University Murree, Pakistan Huazhong Agricultural University, China Guangzhou University, China Shenzhen University, China Chengdu University of Traditional Chinese Medicine, China Kunming University of Science and Technology, China Asia Eastern University of Science and Technology, Taiwan Shanghai Jiao Tong University, China Xi’an University of Posts and Telecommunications, China Shenzhen University, China Shandong Normal University, China Wuhan University of Science and Technology, China Hefei University, China California Polytechnic State University, USA Xidian University, China Anhui Agricultural University, China Jilin University, China Jiangnan University, China Hunan University, China Northeastern University, China China University of Mining and Technology, China Tongji University, China Southeast Univ., China Northwestern Polytechnical University, China China University of Mining and Technology, China Anhui University, China Shanghai Jiao Tong University, China Huazhong University of Science and Technology, China Jiangnan University, China Lanzhou University, China Guangxi University for Nationalities, China University of Leicester, UK Shandong University, China Northwestern Polytechnical University, China Shenzhen University, China
Organization
Yunxia Liu Zhanli Sun Zhenran Jiang Zhengtao Yu Zhenyu Xuan Zhihong Guan Zhihua Cui Zhiping Liu Zhiqiang Geng Zhongqiu Zhao Zhuhong You
xiii
Zhengzhou Normal University, China Anhui University, China East China Normal University, China Kunming University of Science and Technology, China University of Texas at Dallas, USA Huazhong University of Science and Technology, China Taiyuan University of Science and Technology, China Shandong University, China Beijing University of Chemical Technology, China Hefei University of Technology, China Northwestern Polytechnical University, China
Contents – Part III
Biomedical Data Modeling and Mining A Segmentation Method of 3D Liver Image Based on Multi-scale Feature Fusion and Coordinate Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Meng Zhang, Xiaolong Zhang, He Deng, and Hongwei Ren A Prior-Guided Generative Adversarial Net for Semantically Strict Ultrasound Images Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ruiguo Yu, Pan Sun, Xuewei Li, Ruixuan Zhang, Zhiqiang Liu, and Jie Gao DETA-Net: A Dual Encoder Network with Text-Guided Attention Mechanism for Skin-Lesions Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cong Shen, Xinyue Wang, Jijun Tang, and Zhijun Liao PEA-U-Net: Parallel Embedded Attention for Liver and Tumor Segmentation in CT Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weinian Cao, Shengxiang Rao, Lijun Luo, Huijuan Zhang, and Changqing Yin Federated Semi-supervised Medical Image Segmentation Based on Asynchronous Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fangbo Liu and Feng Yang Generative Adversarial Network-Based Data Augmentation Method for Anti-coronavirus Peptides Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiliang Xu, Chungui Xu, Ruifen Cao, Yonghui He, Yannan Bin, and Chun-Hou Zheng
3
16
28
41
55
67
Prediction of Cancer Driver Genes Based on Pyramidal Dynamic Mapping Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pi-Jing Wei, Shu-Li Zhou, Rui-Fen Cao, Yansen Su, and Chun-Hou Zheng
77
Identification of CircRNA-Disease Associations from the Integration of Multi-dimensional Bioinformatics with Graph Auto-encoder and Attention Fusion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lin Yuan, Jiawang Zhao, Zhen Shen, Wendong Yu, Hongwei Wei, Shengguo Sun, Xingang Wang, and Yushui Geng
87
xvi
Contents – Part III
An Improved Method for CFNet Identifying Glioma Cells . . . . . . . . . . . . . . . . . . . Lin Yuan, Jinling Lai, Zhen Shen, Wendong Yu, Hongwei Wei, Ling Zhao, Zhijie Xu, Xingang Wang, and Yushui Geng
97
Extraction of Relationship Between Esophageal Cancer and Biomolecules Based on BioBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Dayu Tan, Yang Yang, Minglu Wang, Pengpeng Wang, Lejun Zhang, Tseren-Onolt Ishdorj, and Yansen Su BIJE: A Joint Extraction Model for Biomedical Information Extraction . . . . . . . 119 Yansen Su, Pengpeng Wang, Shuna Cui, Fei Xu, and Tseren-Onolt Ishdorj MGCPI: A Multi-granularity Neural Network for Predicting Compound-Protein Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Peixuan Lin, Likun Jiang, Fatma S. Ahmed, Xinru Ruan, Xiangrong Liu, and Juan Liu SSTVC: Carotid Plaque Classification from Ultrasound Images Using Self-supervised Triple-View Contrast Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Cheng Li, Xiaoyue Fang, Ran Zhou, Zhi Yang, and Haitao Gan Multi-level Subgraph Representation Learning for Drug-Disease Association Prediction Over Heterogeneous Biological Information Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Bo-Wei Zhao, Xiao-Rui Su, Yue Yang, Dong-Xu Li, Peng-Wei Hu, Zhu-Hong You, and Lun Hu An Unsupervised Domain Adaptive Network Based on Category Prototype Alignment for Medical Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Mei Yu, Zhiyuan Xu, Jie Gao, Jian Yu, and Mankun Zhao A Novel Graph Representation Learning Model for Drug Repositioning Using Graph Transition Probability Matrix Over Heterogenous Information Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 Dong-Xu Li, Xun Deng, Bo-Wei Zhao, Xiao-Rui Su, Guo-Dong Li, Zhu-Hong You, Peng-Wei Hu, and Lun Hu MORGAT: A Model Based Knowledge-Informed Multi-omics Integration and Robust Graph Attention Network for Molecular Subtyping of Cancer . . . . . . 192 Haobo Shi, Yujie Gu, Hengyuan Zhang, Xuan Li, and Yangkun Cao
Contents – Part III
xvii
Biomedical Informatics Theory and Methods Explainable Stuttering Recognition Using Axial Attention . . . . . . . . . . . . . . . . . . . 209 Yu Ma, Yuting Huang, Kaixiang Yuan, Guangzhe Xuan, Yongzi Yu, Hengrui Zhong, Rui Li, Jian Shen, Kun Qian, Bin Hu, Björn W. Schuller, and Yoshiharu Yamamoto Optimizing Cardiac Surgery Risk Prediction: An Machine Learning Approach with Counterfactual Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Dengkang Qin, Mengxue Liu, Zheng Chen, and Qian Lei Patient Mortality Prediction Based on Two-Layer Attention Neural Network . . . 233 Lin Wang, Zhengzhong Wang, Quanrun Song, Changtong Ding, Xiaoning Li, Xiangwei Zhang, and Shichao Geng Identifying Drug–Target Interactions Through a Combined Graph Attention Mechanism and Self-attention Sequence Embedding Model . . . . . . . . . 246 Kang Wang, Jing Hu, and Xiaolong Zhang An Omics-Based Metastasis Prediction Model for Osteosarcoma Patients Using Multi-scale Attention Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 Ning Wang and Yizhang Jiang Spectral Clustering of Single-Cell RNA-Sequencing Data by Multiple Feature Sets Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 Yang Liu, Feng Li, Junliang Shang, Daohui Ge, Qianqian Ren, and Shengjun Li Seizure Prediction Based on Multidimensional EEG Spatial Matrix and Residual Network Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Jiahao Zhang, Qingfang Meng, and Zewen Wang LANCMDA: Predicting MiRNA-Disease Associations via LightGBM with Attributed Network Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Xu-Ran Dou, Wen-Yu Xi, Tian-Ru Wu, Cui-Na Jiao, Jin-Xing Liu, and Ying-Lian Gao DBL-MPE: Deep Broad Learning for Prediction of Response to Neo-adjuvant Chemotherapy Using MRI-Based Multi-angle Maximal Enhancement Projection in Breast Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 Zihan Cao, Zhenwei Shi, XiaoMei Huang, Chu Han, Xinyu Dong, Zhihe Zhao, Dan Wang, Peng Xu, Zaiyi Liu, and Wenbin Liu
xviii
Contents – Part III
Fed-CSA: Channel Spatial Attention and Adaptive Weights Aggregation-Based Federated Learning for Breast Tumor Segmentation on MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 Xinyu Dong, Zhenwei Shi, XiaoMei Huang, Chu Han, Zihan Cao, Zhihe Zhao, Dan Wang, Peng Xu, Zaiyi Liu, and Wenbin Liu Identify Complex Higher-Order Associations Between Alzheimer’s Disease Genes and Imaging Markers Through Improved Adaptive Sparse Multi-view Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 Yi-Ming Wang, Xiang-Zhen Kong, Bo-Xin Guan, Chun-Hou Zheng, and Ying-Lian Gao A Deep Learning Approach Incorporating Data Missing Mechanism in Predicting Acute Kidney Injury in ICU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Yuan Zhang, Zhengbo Zhang, Xiaoli Liu, Lei Zha, Fengcong, Xiaorui Su, Bowei Zhao, Lun Hu, and Pengwei Hu Medical Image Segmentation Based on Federated Distillation Optimization Learning on Non-IID Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 Fangbo Liu and Feng Yang Spatial Domain Identification Based on Graph Attention Denoising Auto-encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 Yue Gao, Dai-Jun Zhang, Cui-Na Jiao, Ying-Lian Gao, and Jin-Xing Liu Intelligent Computing in Computational Biology Molecular Identification Using Deep Learning Method . . . . . . . . . . . . . . . . . . . . . . 371 Mingxiang Gao and Bo Li RareDR: A Drug Repositioning Approach for Rare Diseases Based on Knowledge Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 Yuehan Huang, Shuting Jin, Xinyu Yu, Changzhi Jiang, Zhengqiu Yu, Xiangrong Liu, and Shaohui Huang Multi-omics Cancer Subtype Recognition Based on Multi-kernel Partition Aligned Subspace Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Jian Liu, Long Hou, and Shuguang Ge Prediction of LncRNA-Protein Interactions Based on Multi-kernel Fusion and Graph Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 Dongdong Mao, Cong Shen, Ruilin Wu, Yuyang Han, Yankai Wu, Jinxuan Wang, Jijun Tang, and Zhijun Liao
Contents – Part III
xix
LXLMEPS: Leveraging the XGB-lCE-Based Model for Early Prediction of Sepsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416 Zhang Leyi, Long Yingjie, Hu Yingbiao, and Li Huinian DeepMAT: Predicting Metabolic Pathways of Compounds Using a Message Passing and Attention-Based Neural Networks . . . . . . . . . . . . . . . . . . . 428 Hayat Ali Shah, Juan Liu, Zhihui Yang, and Jing Feng SpliceSCANNER: An Accurate and Interpretable Deep Learning-Based Method for Splice Site Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 Rongxing Wang, Junwei Xu, Xiaodi Huang, Wangjing Qi, and Yanju Zhang TransOrga: End-To-End Multi-modal Transformer-Based Organoid Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460 Yiming Qin, Jiajia Li, Yulong Chen, Zikai Wang, Yu-An Huang, Zhuhong You, Lun Hu, Pengwei Hu, and Feng Tan GPU Optimization of Biological Macromolecule Multi-tilt Electron Tomography Reconstruction Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 Zi-Ang Fu, Xiaohua Wan, and Fa Zhang Multi-task Question Generation Based Data Augmentation for Biomedical Answer Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 Junting Zhao, Jun Bai, Wenge Rong, Yuanxin Ouyang, and Zhang Xiong Prediction of circRNA-Binding Protein Site Based on Hybrid Neural Networks and Recurrent Forests Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 Zewen Wang, Qingfang Meng, Qiang Zhang, and Jiahao Zhang An Improved Variational Autoencoder-Based Clustering Method for Pan-Cancer Diagnosis and Subtyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509 Binhua Tang and Jiafei Nie A Stacking-Based Ensemble Learning Predictor Combined with Particle Swarm Optimizer for Identifying RNA Pseudouridine Sites . . . . . . . . . . . . . . . . . . 521 Xiao Wang, Pengfei Li, Lijun Han, and Rong Wang GeneSpider: Inferring Gene Regulation Relationships Through Graph Neural Network from Single-Cell RNA Sequence Data . . . . . . . . . . . . . . . . . . . . . 532 Zhihua Du, Xing Zhong, Min Fang, and Jianqiang Li
xx
Contents – Part III
Attention-Aware Contrastive Learning for Predicting Peptide-HLA Binding Specificity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544 Pengyu Luo, Yuehan Huang, Xinyi Zhang, Lian Shen, Yuan Lin, Xiangrong Liu, and Xiaoyang Huang A Transformer-Based Deep Learning Approach with Multi-layer Feature Processing for Accurate Prediction of Protein-DNA Binding Residues . . . . . . . . 556 Haipeng Zhao, Baozhong Zhu, Tengsheng Jiang, Zhiming Cui, and Hongjie Wu TAPE-Pero: Using Deep Representation Learning Model to Identify and Localize Peroxisomal Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568 Jianan Sui, Yuehui Chen, Yi Cao, and Yaou Zhao Classification of Coding and Non-coding Genes in Paeonia Lactiflora Pall Based on Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578 Bolun Yang, Yuehui Chen, Yaou Zhao, and Yi Cao Accurate Identification of Submitochondrial Protein Location Based on Deep Representation Learning Feature Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . 587 Jianan Sui, Yuehui Chen, Yi Cao, and Yaou Zhao Identification of Active and Binding Sites with Multi-dimensional Feature Vectors and K-Nearest Neighbor Classification Algorithm . . . . . . . . . . . . . . . . . . . 597 Baichuan Zhang, Zhuo Wang, Wenzheng Bao, and Honglin Cheng Mit Protein Transformer: Identification Mitochondrial Proteins with Transformer Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607 Baichuan Zhang, Luying He, Qi Wang, Zhuo Wang, Wenzheng Bao, and Honglin Cheng Plant Vacuole Protein Classification with Ensemble Stacking Model . . . . . . . . . . 617 Xunguang Ju, Kai Xiao, Luying He, Qi Wang, Zhuo Wang, and Wenzheng Bao De Novo Drug Design Using Unified Multilayer Simple Recurrent Unit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627 Zonghao Li, Jing Hu, and Xiaolong Zhang DTI-MACF: Drug-Target Interaction Prediction via Multi-component Attention Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639 Jiejin Deng, Yijia Zhang, Jing Zhang, Yaohua Pan, and Mingyu Lu
Contents – Part III
xxi
Intelligent Computing in Drug Design Drug-Target Interaction Prediction Based on Knowledge Graph and Convolutional Neural Network Integrated with CBAM Module . . . . . . . . . . . 653 Zhongyu He Deep Learning-Based Prediction of Drug-Target Binding Affinities by Incorporating Local Structure of Protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666 Runhua Zhang, Baozhong Zhu, Tengsheng Jiang, Zhiming Cui, and Hongjie Wu Drug-Target Interaction Prediction Based on Interpretable Graph Transformer Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676 Baozhong Zhu, Runhua Zhang, Tengsheng Jiang, Zhiming Cui, and Hongjie Wu NIEE: Modeling Edge Embeddings for Drug-Disease Association Prediction via Neighborhood Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687 Yu Jiang, Jingli Zhou, Yong Zhang, Yulin Wu, Xuan Wang, and Junyi Li A Novel Descriptor and Molecular Graph-Based Bimodal Contrastive Learning Framework for Drug Molecular Property Prediction . . . . . . . . . . . . . . . . 700 Zhengda He, Linjie Chen, Hao Lv, Rui-ning Zhou, Jiaying Xu, Yadong Chen, Jianhua Hu, and Yang Gao Multi-objective Optimization-Based Approach for Detection of Breast Cancer Biomarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716 Jiaxin Yang, Chuanyuan Wang, Duanchen Sun, and Zhi-Ping Liu MOFNet: A Deep Learning Framework of Integrating Multi-omics Data for Breast Cancer Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727 Chunxiao Zhang, Pengpai Li, Duanchen Sun, and Zhi-Ping Liu EEG Convolutional Sparse Transformer for Epilepsy Detection and Related Drug Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739 Zhengda He, Linjie Chen, Hao Lv, Rui-ning Zhou, Jiaying Xu, Yadong Chen, Jianhua Hu, and Yang Gao Adopting Autodock Koto for Virtual Screening of COVID-19 . . . . . . . . . . . . . . . . 752 Zhangfan Yang, Kun Cao, Junkai Ji, Zexuan Zhu, and Jianqiang Li An Efficient Drug Design Method Based on Drug-Target Affinity . . . . . . . . . . . . 764 Haoran Liu, Xiaolong Zhang, Xiaoli Lin, and Jing Hu
xxii
Contents – Part III
Drug-Target Affinity Prediction Based on Self-attention Graph Pooling and Mutual Interaction Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776 Xizi Wang, Jing Hu, and Xiaolong Zhang DU-DANet: Efficient 3D Automatic Brain Tumor Segmentation Based on Dual Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 791 Zhenhua Cai, Xiaoli Lin, Xiaolong Zhang, and Jing Hu Drug-Target Interaction Prediction Based on Knowledge Graph Embedding and BiLSTM Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803 Yiwen Zhang and Mengqi Cheng Correction to: SpliceSCANNER: An Accurate and Interpretable Deep Learning-Based Method for Splice Site Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . Rongxing Wang, Junwei Xu, Xiaodi Huang, Wangjing Qi, and Yanju Zhang
C1
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815
Biomedical Data Modeling and Mining
A Segmentation Method of 3D Liver Image Based on Multi-scale Feature Fusion and Coordinate Attention Mechanism Meng Zhang1,2,3 , Xiaolong Zhang1,2,3(B) , He Deng1,2,3 , and Hongwei Ren4 1 School of Computer Science and Technology, Wuhan University of Science and Technology,
Wuhan, Hubei, China [email protected] 2 Institute of Big Data Science and Engineering, Wuhan University of Science and Technology, Wuhan, Hubei, China 3 Hubei Key Laboratory of Intelligent Information Processing and Real-Time Industrial System, Wuhan, Hubei, China 4 Tianyou Hospital Affiliated to Wuhan University of Science and Technology, Wuhan, Hubei, China
Abstract. Due to the high similarity of organs in 3D liver image and the use of simple connection by U-Net to fuse different semantic features, the segmentation accuracy of network needs to be improved. To solve these problems, this paper proposes a 3D liver semantic segmentation method based on multi-scale feature fusion and coordinate attention mechanism. Firstly, in the encoder section of U-Net, the multi-scale feature fusion module was used to capture multi-scale features; Then, coordinate attention mechanism was used to fuse low-level features and high-level features to locate regions of interest; Finally, the segmentation effect of edge details was improved through a deep supervision mechanism. The experimental results show that: on the LiTS dataset, the dice similarity coefficient (DSC) of this method reaches 96.5%. Compared with the U3 -Net + DC method, the DSC increases by 0.1%, and the relative volume difference (RVD) decreases by 1.09%; On the CHAOS dataset, the DSC of this method reaches 96.8%, and compared with CANet, the DSC increases by 0.2%; On the MRI dataset of a hospital, the DSC of this method reaches 97.2%. Keywords: 3D liver image · semantic segmentation · multi-scale feature fusion · coordinate attention · deep supervision
1 Introduction Liver cancer is one of the common cancer diseases. Due to the unclear early symptoms, it is difficult to detect, resulting in a large number of deaths every year. Computed Tomography (CT) [1] and Magnetic Resonance Imaging (MRI) [2] are the main medical imaging methods for detecting and diagnosing liver cancer in clinical trials. Currently, in clinical diagnosis in hospitals, doctors in professional imaging departments manually © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 3–15, 2023. https://doi.org/10.1007/978-981-99-4749-2_1
4
M. Zhang et al.
label the liver and liver tumor sites to provide reference for treatment plans. However, this way can be influenced by subjective factors and experience of experts, and the amount of image data is large, requiring a large amount of time. Therefore, how to apply artificial intelligent algorithms to assist in accurate liver segmentation is a very challenging task. The main methods of liver segmentation are divided into traditional machine learning segmentation methods and deep learning segmentation methods. Traditional machine learning methods include threshold method, region growing method, and pixel clustering method. They mainly use attributes such as gray values of medical image to do statistical features, establish models, and finally segment medical image. However, such methods require strict algorithms, manual intervention, and certain professional knowledge, and are influenced by factors such as grayscale values and liver shape, which limits the development of traditional machine learning segmentation methods. With the development of artificial intelligence, the Convolutional Neural Network(CNN) [3] represented by Visual Geometry Group Network(VGGNet) [4] and Residual Network (ResNet) [5] have performed well in the field of medical image processing. Due to the fact that CNN can only extract local features, Long et al. [6] proposed Full Convolutional NetWorks(FCN) for semantic segmentation of medical image, which can input image of any resolution size for segmentation. According to the characteristics of medical image such as simple semantics and small data volume, Ronnerberger et al. [7] proposed U-Net model. Compared to FCN, U-Net uses skip connection to connect feature maps of the same layer, complementing the features lost during the subsampling process, ensuring that the final recovered feature map incorporates more low-level features and has more detailed features. Many researchers proposed new models based on U-Net, such as U-Net++ [8] and U-Net 3+ [9]. Li et al. [10] proposed a novel hybrid dense connection H-Dense U-Net that effectively combines 2D on-chip information and 3D on-chip information. Szegedy et al. [11] proposed a deep convolutional neural network called Inception. Dou et al. [12] proposed a 3D deeply supervised convolutional network with three output branches to improve segmentation accuracy. Hou et al. [13] proposed a coordinate attention based on SENet [14] and CBAM [15], which encodes channel relationships and long-term dependencies through accurate location information, helping networks locate interested targets more accurately. Jiang et al. [16] proposed an attention hybrid connection network that combines soft and hard attention mechanisms, as well as long and short skip connection. For the first time, the attention mechanism was applied to liver tumor segmentation, proving the effectiveness of the attention mechanism in medical image segmentation. At the same time, a cascaded network based on liver localization network, liver segmentation network, and tumor segmentation network was proposed. The above methods have made some progress, but there are still some shortcomings. Firstly, these methods cannot fully extract the local features and spatial features of the image. Secondly, low-level features and high-level features are only fused through simple connection, and a large amount of redundant features will appear. The above coordinate attention mechanism is based on 2D, and this paper will improve the coordinate attention mechanism in 3D space. Therefore, this paper proposes a model called MCNet with novelty in multi-scale feature fusion module and a coordinate attention mechanism. The multi-scale feature fusion module can extract features from different receptive fields, and
A Segmentation Method of 3D Liver Image Based on Multi-scale
5
can fully extract context information and local information in 3D space. A coordinate attention mechanism can effectively reduce redundant information at skip connection and refine the edge information of the liver. The main work of this paper are as follows: 1) A 3D liver image segmentation model based on multi-scale feature fusion and coordinate attention mechanism is proposed. 2) In order to fully utilize 3D spatial information, an improved multi-scale feature fusion module is added to obtain a larger receptive field and extract multi-scale feature. 3) In order to fully fuse low-level features and high-level features, a coordinate attention mechanism is added to the skip connection, and a deep supervision mechanism is invoked to improve segmentation performance.
2 Method 2.1 MCNet Structure Based on the U-Net network, this paper designs a MCNet network. The MCNet consists of an encoder and a decoder. The encoder is responsible for feature extraction, the decoder is responsible for feature map recovery and feature localization. The network model of MCNet is shown in Fig. 1.
Fig. 1. Architecture of MCNet
The encoder has four layers. Each layer consists of multi-scale feature fusion module, batch normalization, Rectified Linear Unit (ReLU) activation function, and pooling operation. The decoder also has four layers, and each layer consists of transposed convolution, residual block, batch normalization, and ReLU activation function. In the U-Net network model, the feature map obtained by the encoder and the feature map obtained by the decoder are directly connected at the skip connection. This connection is too simple
6
M. Zhang et al.
and will result in a large amount of redundant information. In this paper, the coordinate attention mechanism is added to the direct connection of U-Net network. Finally, the deep supervision mechanism is added to alleviate the problem of gradient disappearing or gradient explosion caused by small amount of data. 2.2 Multi-scale Feature Fusion Module In the U-Net model, the receptive fields of the two convolution operations of the encoder are insufficient to capture the context and boundary information of the liver image. Therefore, this paper designs a multi-scale feature fusion module to expand the receptive field. The commonly used multi-scale feature fusion module is to splice a pool operation and three convolution operations with convolution kernel sizes of 1 × 1, 3 × 3, 5 × 5 in parallel. Although this way can obtain feature maps under three different receptive fields, its feature representation ability is still insufficient. Therefore, this paper designs a multiscale feature fusion module. That makes feature fusion more effective. The structure of the multi-scale feature fusion module is shown in Fig. 2.
Fig. 2. Multi-scale Feature Fusion Module
In the multi-scale feature fusion module, the input feature map X is first subjected to a 1 × 1 × 1 convolution to adjust the number of channels to obtain the feature map X 1 . The feature map X 2 is obtained by performing a 3 × 3 × 3 convolution operation on X 1 . Then, the X 1 and X 2 are connected together and input into a 5 × 5 × 5 convolution operation to obtain the feature map X 3 . The X 1 , X 2 , and X 3 are connected together and input them into a 7 × 7 × 7 convolution operation to obtain the feature map X 4 , as shown in Eq. (1). ⎧ ⎪ ⎪ ⎨
X1 = Conv1(X ), X2 = Conv3(X1 ), ⎪ X3 = Conv5(concat(X1 , X2 )), ⎪ ⎩ X4 = Conv7(concat(X1 , X2 , X3 )),
(1)
A Segmentation Method of 3D Liver Image Based on Multi-scale
7
where Convi is a i × i × i convolution operation. Finally, the X 1 , X 2 , X 3 , and X 4 under different receptive fields are connected to obtain a feature map with four times the number of channels as X 1 , followed by a 1 × 1 × 1 convolution and fusion operation to adjust the channels and get the final output feature map O, as shown in Eq. (2). O = Conv(concat(X1 , X2 , X3 , X4 ))
(2)
2.3 Coordinate Attention Mechanism In the previous algorithm, the low-level features and high-level features are directly connected in U-Net, resulting in a large amount of redundant features, and cannot highlight the importance of regions of interest. Therefore, we add coordinate attention mechanism to skip connection to effectively fuse features, reduce redundant features, and locate regions of interest. As shown in Fig. 3, there are two inputs in the coordinate attention module, where f is the output feature map of encoder and g is the up-sampled feature map of decoder. First, f and g undergo a 1 × 1 × 1 convolution operation to adjust the number of channels, and followed by simple summation and fusion to obtain the feature map x. With the three pooling kernels (D, 1, 1), (1, H, 1), (1, 1, W), x becomes three separate feature maps x1 , x1 , x1 with specific direction information, so as to capture the remote dependence relationship. The outputs of c-th channel at depth d, height h and width w are shown in Eq. (3), (4) and (5). 1 x1 = zcd = H ×W (3) 0≤i γ q ) q,l auk = (5) q,l q,l > γ q) k∈N q (uq )∪{uq } I(sim hu , hk q,l
where auk represents the weight of node uq and k in the l layer of RGAT in omics q, node k is the neighbor of node uq , that is k ∈ N q (uq ), and γ q is the threshold value. I(·) represents returning the original value if true in parentheses, 0 otherwise. The similarity-based attention mechanism normalizes the weights of nodes and their neighbors to maintain the stability of the model and prevent gradient explosion. The
198
H. Shi et al.
graph structure of each layer is based on the previous layer’s graph structure array for edge deletion operations. This ensures that layer l + 1 does not have more edges than layer l. After several training cycles, the graph structure becomes stable. The robust structure can weaken the influence of noise data on the model. During training, model parameters are updated in the direction of real data to avoid overfitting the model.
A
B
Fig. 2. Illustration of RGAT. (A) The process of attention calculation. (B) The process of judging false edges by similarity
We use weighted summation to aggregate information and update node features. The formula is as follows: q,l+1 q,l q,l = σ( auk W q,l hk )w (6) hu q q q k∈N (u )∪{u }
where W q,l is the parameter matrix. A new node representation is obtained by aggregating the information of the node and its neighborhood. The RGAT incorporates dropout into each layer to prevent the model from overfitting prematurely. The RGAT is trained separately on each omics dataset, and the feature representations are obtained for each dataset. These representations are then integrated based on the central dogma, which we will explain in detail in the next section. Knowledge-Informed Multi-omics Integration Module To integrate genomics, epigenomics, and transcriptomics omics data, MORGAT constructs the knowledge-informed multi-omics integration module. The knowledge of the central dogma of molecular biology states that DNA is transcribed into Gene expression, which is then translated into proteins. These proteins catalyze chemical reactions that produce or act on various metabolites. By regarding this process as a time series, we can arrange the various omics data in the order of genomics, epigenomics, transcriptomics, proteomics, and metabolomics. In this study, we focus on using genomics, gene epigenomics, and transcriptomics to conduct research. To capture highly complementary information among omics data and improve the overall performance of the model, we employ the bidirectional long short-term memory (bi-lstm) model [23]. The bi-lstm model is composed of a forward lstm [24] and a reverse lstm, which allows us to obtain
MORGAT: A Model Based Knowledge-Informed Multi-omics Integration
199
both forward and backward information sequences. lstm, a type of recurrent neural network, is composed of three gating mechanisms: the forget gate, the input gate, and the output gate. These gates are responsible for controlling the forgetting of information from the previous moment, the memorization of information from the current moment, and the transmission of information to the next moment, respectively. lstm outputs the hidden layer state at each time step. The forget gate, input memory gate, and output gate are calculated by the hidden layer state of the previous time and the current input. The formula for the forward lstm is as follows: −−→ (7) it = σ (Wxi xt + Whi ht−1 + Wci ct−1 + bi ) −−→ ft = σ (Wxf xt + Whf ht−1 + Wcf ct−1 + bf )
(8)
−−→ ct = σ (ft ct−1 + it tanh(Wxc xt + Whc ht−1 + bc ))
(9)
−−→ ot = σ (Wxo xt + Who ht−1 + Wco ct + bo )
(10)
− → ht = ot tanh(ct )
(11)
where σ is the logistic sigmoid function and ht denotes the implicit vector at t time with forward order. it , ft , ot , ct are the input gate, forget gate, output gate, and cell
vectors at time t, which are the same dimension as ht . Reverse lstm is a variant of lstm that processes input sequences in reverse order, allowing it to capture information that is complementary to that captured by the forward lstm. In our approach, we use the forward lstm to obtain the feature representations of the omics data, and the reverse lstm to obtain the complementary feature representations, which are then combined to form the fusion feature representation. The formula for the fusion feature representation is as follows: T
T
H = || ht || || ht t=1
(12)
t=1
where ht represents the implicit variable of reverse lstm at time t, T is the length of the sequence, equal to the omics number Q, and H is the feature representations after the integration of multi-omics. Feature Attribution Module. The Integrated Gradients algorithm [25] combines Gradients and Back-propagation based approaches. The algorithm’s formula is expressed as follows:
1 ∂F(X + a × (X − X )) da (13) IntegratedGradientsi = (Xi − Xi ) × ∂Xi a=0 where X is the current input, X is the baseline input and F is the function that represents the model. The subscript i denotes the i-th feature in X. The formula is based mainly on Taylor expansion. It integrates the path between the input value and the baseline value. A contribution greater than 0 indicates a positive role. We use the Cross-Entropy loss function and adopt Adam to minimize the loss function.
200
H. Shi et al.
Fig. 3. Results of ablation experiment on BRCA and LUAD. The reconstituted omics orders were M_C_G, M_G_C, and C_G_M.
3 Experiments In this section, we conducted two experiments and identified biomarkers for each molecular subtype of cancer based on the weight obtained by feature attribution. Experiments were performed: two different types of classification experiments that compared our model with other models, and two ablation experiments. In the ablation experiment, we established experiments in three aspects: removing the robust structure, removing knowledge-informed multi-omics integration, and disrupting sequences obtained according to the central dogma. The correct omics order was C_M_G (C represents CNV, M represents DNA Methylation, and G represents Gene expression). We evaluated the performance of MORGAT with other models on BRCA and LUAD through 5-fold cross-validation. The classification performances were measured in terms of the Balance accuracy (BACC), Matthews correlation coefficient (MCC), Kappa coefficient (Kappa), macro-averaged F1 score (macro-F1), and macro-averaged precision (macro-precision). 3.1 Experimental Results Classification Performance. We compared the classification performance of MORGAT with RF, XGBoost, OmiEmbed [26], GCN, and MOGONET [27] in BRCA and LUAD. Tables 3 and 4 provide the details of the results, which are the mean of the 5-fold results. Table 3 shows that the sample distribution in BRCA is unbalanced. However, MORGAT outperforms other models in all metrics. The average results for BACC, macro-precision, macro-F1, MCC, and Kappa are 0.867, 0.902, 0.874, 0.873, and 0.873, respectively, which are 4–5% higher than the suboptimal model. Notably, MORGAT performs particularly well in MCC and Kappa, indicating its superiority in handling unbalanced data sets compared to other models. LUAD is a small sample dataset. MORGAT also outperforms other models. The average results of BACC, macro-precision, macro-F1, MCC, and Kappa reach 0.888, 0.89, 0.884, 0.829, and 0.825, respectively, which are 4%–7% higher than the suboptimal model. Results of Ablation Experiment. As depicted in Fig. 3, MORGAT performs the best performance across all metrics evaluated. The ablation experiment reveals that the removal of RGAT lead to greater performance degradation compares to other ablation terms. Additionally, the removal of bi-lstm results in the third-highest performance
MORGAT: A Model Based Knowledge-Informed Multi-omics Integration
201
Table 3. Prediction performances under five evaluation metrics of six methods on BRCA BACC
macro-precision
macro-F1
MCC
Kappa
RF
0.733 ± 0.055
0.765 ± 0.098
0.744 ± 0.073
0.783 ± 0.051
0.780 ± 0.053
XGBoost
0.750 ± 0.048
0.835 ± 0.090
0.765 ± 0.049
0.817 ± 0.032
0.814 ± 0.033
OmiEmbed
0.791 ± 0.071
0.886 ± 0.070
0.817 ± 0.060
0.817 ± 0.053
0.811 ± 0.053
GCN
0.822 ± 0.101
0.838 ± 0.112
0.822 ± 0.102
0.839 ± 0.053
0.838 ± 0.053
MOGONET
0.788 ± 0.079
0.834 ± 0.097
0.784 ± 0.075
0.830 ± 0.023
0.827 ± 0.022
MORGAT
0.867 ± 0.058
0.902 ± 0.025
0.874 ± 0.046
0.873 ± 0.036
0.873 ± 0.036
Table 4. Prediction performances under five evaluation metrics of eight methods on LUAD Bacc
macro-precision
macro-F1
MCC
Kappa
RF
0.792 ± 0.033
0.823 ± 0.052
0.787 ± 0.047
0.712 ± 0.068
0.697 ± 0.074
XGBoost
0.796 ± 0.094
0.811 ± 0.114
0.796 ± 0.102
0.707 ± 0.160
0.698 ± 0.160
OmiEmbed
0.842 ± 0.068
0.854 ± 0.067
0.838 ± 0.069
0.768 ± 0.100
0.760 ± 0.101
GCN
0.823 ± 0.075
0.829 ± 0.075
0.820 ± 0.077
0.733 ± 0.117
0.729 ± 0.118
MOGONET
0.829 ± 0.061
0.860 ± 0.054
0.825 ± 0.069
0.760 ± 0.094
0.743 ± 0.106
MORGAT
0.888 ± 0.055
0.890 ± 0.066
0.884 ± 0.062
0.829 ± 0.090
0.825 ± 0.091
degradation in BRCA and the second-highest in LUAD, indicating the effectiveness of robust structural and knowledge-informed integration. Furthermore, the three scrambled sorting approaches result in a significant performance drop, highlighting that the sequential order based on the central dogma is the most efficient.
Fig. 4. Numerical distribution heatmaps of significant features of omics data on BRCA
3.2 The Significant Genes Identified by MORGAT Through feature attribution, we obtained the contribution of different omics features to the molecular typing of cancer. For each molecular subtype sample set, we selected the top 100 features from each omics as the candidate set. Because the training was
202
H. Shi et al.
conducted using 5-fold cross-validation, we removed features with a frequency less than 3 from the candidate set. We used genes and sites in candidate sets to create heatmaps (Figs. 4 and 5). Figure 4 shows that CNV is more effective in distinguishing between Basal, Her2, and LumB. DNA Methylation is more effective in distinguishing between Basal, LumA, and LumB. Gene expression is effective in distinguishing all molecular subtypes of BRCA. As can be seen from Fig. 5, DNA Methylation and Gene expression can very well distinguish each molecular subtype of LUAD. CNV is more effective in discriminating between PI and PP.
Fig. 5. Numerical distribution heatmaps of significant features of omics data on LUAD
After conducting a heatmap analysis, we were able to determine which omics were most effective in identifying a molecular subtype. Based on these findings, we selected the five genes or sites with the highest contribution in these omics as biomarkers for the subtype (Tables 5 and 6). Table 5. Biomarkers of BRCA molecular subtypes Omics
LumA
LumB
Basal
Normal
Her2
CNV
—
CTTN, MRPL21 TESMIN, LTO1, PPP6R3
TARS2, ENSA, SF3B4, RPRD2, HORMAD1
—
GRB7, BTBD17, KIF19, CD300LF, CSH1
DNA Methylation
GHITM, PIGB, SYNJ2, TLR5, ICOSLG CPT1A, CUX1, ZMYM4, NCALD
KCNJ1, RALB FOXO3
—
—
Gene expression
IL6ST, BEX1, ESR1, GFRA1 FAM83E
NDC80, ELF5 CENPA, KIF4A, DEPDC1
DSC3, KRT14, PTPRZ1, JUN, PIK3C2G
—
BUB1, KIF23, ATAD2, FAM83D
MORGAT: A Model Based Knowledge-Informed Multi-omics Integration
203
Most of the genes selected in BRCA have been shown to have a relationship with the development or prognosis of BRCA. Those which have not been noticed, are also potentially associated with BRCA. For example, GHITM (TMBIM5), which is localized in the mitochondrial inner membrane, may be related to the imbalance of mitochondrial function in breast cancer cells [28]. It regulates mitochondrial metabolism [29] and oxidative phosphorylation [30] and is also involved in protecting cells from oxidative stress and apoptosis. ENSA gene is a protein-coding gene involved in the regulation of protein phosphatase 2A (PP2A), which plays an important role in cellular processes such as cell proliferation, signal transduction, and apoptosis, and is considered to be a tumor suppressor and functionally inactivated in cancer [31]. ENSA gene can inhibit the activity of PP2A [32], which may be related to abnormal proliferation and survival of breast cancer cells. Table 6. Biomarkers of LUAD molecular subtypes Omics
TRU
PI
PP
CNV
—
OPN1MW, TKTL1, HCN3, ARHGAP4, PDZD4
C16orf89, PLAAT4, ROS1, CEACAM7, SHE
DNA Methylation
AGAP1, TPO, OPN3, SLC6A3, DPP6
AP2A2, GCNT3, PFKP, LRIG1, MBTD1
ITGB2, FNDC1, TNC STEAP1, HLA-DRA
Gene expression
C16orf89, ROS1, SHE, CEACAM7, PLAAT4
SFT2D2, EFR3A, APLP2, WIPI2
PAEP, S100P, FGG, CYP24A1, BPIFA1 (LUNX)
Most of the genes selected in LUAD have been shown to have a relationship with the development or prognosis of LUAD. Similar to BRCA, many genes that have not been identified as biomarkers in previous studies are potentially associated with LUAD. For example, the OXR1 gene plays a role in protecting cells from oxidative stress [33]. In cancer, oxidative stress caused by reactive oxygen species (ROS) can contribute to cancer development by stimulating tumorigenesis, promoting the proliferation of cancer cells, and leading to cell death [34]. The OXR1 gene, which is associated with oxidative stress, may have potential as a biomarker for LUAD. WIPI2 is an integral part of the autophagy mechanism, which primarily regulates intracellular degradation processes [35]. In LUAD, autophagy is essential for maintaining glucose homeostasis and tumor growth [36], making autophagy-related genes potential biomarkers for LUAD.
4 Conclusion In this study, we propose a framework (MORGAT) based on knowledge-informed multiomics integration and robust attention networks. MORGAT integrates omics data as a time series according to the central dogma of molecular biology, and can better capture
204
H. Shi et al.
potential association features among omics. Additionally, the internal relation of omics data is captured by RGAT to remove the wrong edges in the composition and enhance the robustness of the model. Finally, the Integrated Gradients algorithm is used to determine the contribution of each feature, allowing for the identification of biomarkers for each subtype. We select three kinds of omics, CNV, DNA methylation, and Gene expression, to conduct experiments on BRCA and LUAD. Compared with the other five methods, MORAGT has the best molecular subtype classification performance. Through ablation experiments, we demonstrate the effectiveness of substructures in MORGAT. MORGAT is proven to effectively predict the subtypes of cancers and identify important multi-omics features. Acknowledgments. The key data of this research was downloaded from UCSC Xena. We thank Chaoyi Yin of Jilin University for her help in data processing.
References 1. Grizzi, F., Chiriva-Internati, M.: Cancer: looking for simplicity and finding complexity. Cancer Cell Int. 6, 4 (2006). https://doi.org/10.1186/1475-2867-6-4 2. Lee, Y.-M., Oh, M.H., Go, J.-H., Han, K., Choi, S.-Y.: Molecular subtypes of triple-negative breast cancer: understanding of subtype categories and clinical implication. Genes Genom. 42, 1381–1387 (2020). https://doi.org/10.1007/s13258-020-01014-7 3. Nicholson, J.K., Wilson, I.D.: Opinion: understanding ‘global’ systems biology: metabonomics and the continuum of metabolism. Nat. Rev. Drug Discov. 2(8), 668–676 (2003). https://doi.org/10.1038/nrd1157 4. Knox, S.S.: From ‘omics’ to complex disease: a systems biology approach to geneenvironment interactions in cancer. Cancer Cell Int. 10, 11 (2010). https://doi.org/10.1186/ 1475-2867-10-11 5. Yang, Z., Michailidis, G.: A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data. Bioinformatics 32(1), 1–8 (2016). https://doi.org/ 10.1093/bioinformatics/btv544 6. Mo, Q., Shen, R., Guo, C., Vannucci, M., Chan, K.S., Hilsenbeck, S.G.: A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. Biostatistics 19(1), 71–86 (2018). https://doi.org/10.1093/biostatistics/kxx017 7. Zhang, L., et al.: Deep learning-based multi-omics data integration reveals two prognostic subtypes in high-risk neuroblastoma. Front. Genet. 9, 477 (2018). https://doi.org/10.3389/ fgene.2018.00477 8. Chaudhary, K., Poirion, O.B., Lu, L., Garmire, L.X.: Deep learning-based multi-omics integration robustly predicts survival in liver cancer. Clin. Cancer Res. 24(6), 1248–1259 (2018). https://doi.org/10.1158/1078-0432.CCR-17-0853 9. Rappoport, N., Shamir, R.: Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res. 46(20), 10546–10562 (2018). https://doi.org/10.1093/ nar/gky889 10. Sun, D., Wang, M., Li, A.: A multimodal deep neural network for human breast cancer prognosis prediction by integrating multi-dimensional data. IEEE/ACM Trans. Comput. Biol. Bioinform. (2018). https://doi.org/10.1109/TCBB.2018 11. Sharifi-Noghabi, H., Zolotareva, O., Collins, C.C., Ester, M.: MOLI: multi-omics late integration with deep neural networks for drug response prediction. Bioinformatics 35(14), i501–i509 (2019). https://doi.org/10.1093/bioinformatics/btz318
MORGAT: A Model Based Knowledge-Informed Multi-omics Integration
205
12. Xu, J., et al.: A hierarchical integration deep flexible neural forest framework for cancer subtype classification by integrating multi-omics data. BMC Bioinform. 20(1), 527 (2019). https://doi.org/10.1186/s12859-019-3116-7 13. Ning, M., Lo, E.H.: Opportunities and challenges in omics. Transl. Stroke Res. 1(4), 233–237 (2010). https://doi.org/10.1007/s12975-010-0048-y 14. Yang, Z.-Y., Liang, Y., Zhang, H., Chai, H., Zhang, B., Pen, C.: Robust sparse logistic regression with the lq(0 < q < 1) regularization for feature selection using mRNA data. IEEE Access PP, 68586–68595 (2018). https://doi.org/10.1109/ACCESS.2018.2880198 15. Momeni, Z., et al.: A survey on single and multiomics data mining methods in cancer data classification. J. Biomed. Inform. 107, 103466 (2020). https://doi.org/10.1016/j.jbi.2020. 103466 16. Parker, J.S., et al.: Supervised risk predictor of breast cancer based on intrinsic subtypes. J. Clin. Oncol. 27(8), 1160–1167 (2009). https://doi.org/10.1200/JCO.2008.18.1370 17. Cancer Genome Atlas Research Network: Comprehensive molecular profiling of lung adenocarcinoma. Nature 511(7511), 543–550 (2014). https://doi.org/10.1038/nature13385 18. Buhmann, M.D.: Radial Basis Functions: Theory and Implementations. Cambridge Monographs on Applied and Computational Mathematics (2003). https://doi.org/10.1017/CBO978 0511543241 19. Rappoport, N., Shamir, R.: NEMO: cancer subtyping by integration of partial multi-omic data. Bioinformatics 35(18), 3348–3356 (2019). https://doi.org/10.1093/bioinformatics/btz058 20. Chen, Y., et al.: Understanding and improving graph injection attack by promoting unnoticeability. In: International Conference on Learning Representations (ICLR 2022) (2022). arXiv: 2202.08057 21. Veliˇckovi´c, P., et al.: Graph attention networks. In: Proceedings of the International Conference on Learning Representations (ICLR 2018) (2018). https://doi.org/10.17863/CAM. 48429 22. Zhang, X., Zitnik, M.: GNNGuard: defending graph neural networks against adversarial attacks. In: Neural Information Processing Systems (NIPS 2020) (2020). arXiv:2006.081 49v3 23. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005). https://doi.org/ 10.1016/j.neunet.2005.06.042 24. Graves, A., Mohamed, A., Hinton, G.: Speech Recognition with Deep Recurrent Neural Networks (ICASSP 2013) (2013). https://doi.org/10.1109/ICASSP.2013.6638947 25. Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: International Conference on Machine Learning (ICML 2017) (2017). arXiv:1703.01365v2 26. Zhang, X., Xing, Y., Sun, K., Guo, Y.: OmiEmbed: a unified multi-task deep learning framework for multi-omics data. Cancers (Basel) 13(12), 3047 (2021). https://doi.org/10.3390/can cers13123047 27. Wang, T., et al.: MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat. Commun. 12(1), 3445 (2021). https://doi.org/10.1038/s41467-021-23774-w 28. Srinivasan, S., Guha, M., Kashina, A., Avadhani, N.G.: Mitochondrial dysfunction and mitochondrial dynamics-the cancer connection. Biochim. Biophys. Acta Bioenerg. 1858(8), 602–614 (2017). https://doi.org/10.1016/j.bbabio.2017.01.004 29. Seitaj, B., et al.: Transmembrane BAX Inhibitor-1 Motif Containing Protein 5 (TMBIM5) sustains mitochondrial structure, shape, and function by impacting the mitochondrial protein synthesis machinery. Cells 9(10), 2147 (2020). https://doi.org/10.3390/cells9102147 30. Patron, M., et al.: Regulation of mitochondrial proteostasis by the proton gradient. EMBO J. 41(16), e110476 (2022). https://doi.org/10.15252/embj.2021110476
206
H. Shi et al.
31. Seshacharyulu, P., Pandey, P., Datta, K., Batra, S.K.: Phosphatase: PP2A structural importance, regulation and its aberrant expression in cancer. Cancer Lett. 335(1), 9–18 (2013). https://doi.org/10.1016/j.canlet.2013.02.036 32. Lacerda, J.T., et al.: Lack of TRPV1 channel modulates mouse gene expression and liver proteome with glucose metabolism changes. Int. J. Mol. Sci. 23(13), 7014 (2022). https://doi. org/10.3390/ijms23137014 33. Matsui, A., et al.: Oxidation resistance 1 functions in the maintenance of cellular survival and genome stability in response to oxidative stress-independent DNA damage. Genes Environ. 42(1), 29 (2020). https://doi.org/10.1186/s41021-020-00168-w 34. Hayes, J.D., Dinkova-Kostova, A.T., Tew, K.D.: Oxidative stress in cancer. Cancer Cell 38(2), 167–197 (2020). https://doi.org/10.1016/j.ccell.2020.06.001 35. Polson, H.E., et al.: Mammalian Atg18 (WIPI2) localizes to omegasome-anchored phagophores and positively regulates LC3 lipidation. Autophagy 6(4), 506–522 (2010). https://doi.org/10.4161/auto.6.4.11863 36. Guo, J.Y., White, E.: Autophagy, metabolism, and cancer. Cold Spring Harb. Symp. Quant. Biol. 81, 73–78 (2016). https://doi.org/10.1101/sqb.2016.81.030981
Biomedical Informatics Theory and Methods
Explainable Stuttering Recognition Using Axial Attention Yu Ma1 , Yuting Huang1 , Kaixiang Yuan1 , Guangzhe Xuan1 , Yongzi Yu1 , Hengrui Zhong1 , Rui Li2 , Jian Shen1(B) , Kun Qian1(B) , Bin Hu1(B) , Björn W. Schuller3,4 , and Yoshiharu Yamamoto5 1 School of Medical Technology, Beijing Institute of Technology, Beijing, China
{shenjian,qian,bh}@bit.edu.cn
2 School of Information Science and Engineering, Lanzhou University, Lanzhou, China 3 GLAM – Group on Language, Audio, and Music, Imperial College London, London, UK 4 Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg,
Augsburg, Germany 5 Educational Physiology Laboratory, Graduate School of Education, University of Tokyo,
Tokyo, Japan
Abstract. Stuttering is a complex speech disorder that disrupts the flow of speech, and recognizing persons who stutter (PWS) and understanding their significant struggles is crucial. With advancements in computer vision, deep neural networks offer potential for recognizing stuttering events through image-based features. In this paper, we extract image features of Wavelet Transformation (WT) and Histograms of Oriented Gradient (HOG) from audio signals. We also generate explainable images using Gradient-weighted Class Activation Mapping (GradCAM) as input for our final recognition model–an axial attention-based EfficientNetV2, which is trained on the Kassel State of Fluency Dataset (KSoF) to perform 8 classes recognition. Our experimental results achieved a relative percentage increase in unweighted average recall (UAR) of 4.4% compared to the baseline of ComParE 2022, demonstrating that the axial attention-based EfficientNetV2, combined with the explainable input, has the capability to detect and recognise multiple types of stuttering. Keywords: Stuttering Recognition · Speech · Wavelet Transformation · Histogram of Oriented Gradient
1 Introduction As artificial intelligence develops, the machine learning based methods could help a treatment specialist track the progress of a patient’s disorder [1–3]. As one of the disorders in the audio field, stuttering is a multifaceted speech disorder that affects more than 70 million individuals worldwide [4]. The condition can result in various challenges, including the repetition and prolongation of sounds [5], and can significantly impact an individual’s quality of life and communication abilities. Previous research has revealed that the most common treatment methods for stuttering are speech therapy [6], which © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 209–220, 2023. https://doi.org/10.1007/978-981-99-4749-2_18
210
Y. Ma et al.
typically includes the usage of behavioral and cognitive approaches [7]. Various strategies can be employed to improve fluency, such as word pronunciation facilitation, speech rate reduction, and chorusing. To some extent, medication can also be used for neurogenic stuttering, although it is not first choice [6]. However, speech therapy sessions are private and often come at a high cost, making them unaffordable for some PWS. Therefore, the development of a stuttering automatic recognition system using machine learning and deep learning techniques holds significant importance. Early researches on this topic focused mainly on a series of spectrogram-based features [8, 9] such as linear predictive cepstral coefficients (LPCCs), Mel-frequency cepstral coefficients (MFCCs), etc. Moreover, recent research has shown that WT images, which provide feedback on time-frequency dynamics [10], have garnered considerable attention in diverse physiological signal processing applications. These techniques can be utilized to extract information about the fundamental mechanisms of stuttering and offer valuable insights into how the disorder progresses over time. In recent decades, deep neural networks have been widely used in different tasks such as health care monitoring [11, 12], depression recognition [13–15] and so on. Notably, in the field of computer vision and signal processing, a wide range of features are being employed as input to classification models [16–18], with encouraging results [19–21]. In addition, the gradient direction histogram is widely applied in tasks to capture the variation in intensity of pixels in an image. For instance, in the audio field, Demir et al. [22] have explored the extraction of HOG feature and Local Binary Pattern (LBP) feature for snoring classification. Compared with the challenge baseline and state-of-the-art deep spectral features, the fusion of these two texture features improved the unweighted average recall by 23.1% and 8.3%, respectively. Despite the numerous potential applications, stuttering recognition has garnered less attention, especially from the perspective of multiple image features as input to deep neural networks, which could reflect the dynamic process of signals. The current application of machine learning in stuttering recognition mainly focuses on data features rather than image features [23]. One of the most popular networks for this purpose is the Artificial Neural Network (ANN) [24], which can be implemented using various architectures such as Long Short-Term Memory (LSTM) [25] and Multi-layer Perceptron [26]. In addition to ANN, other algorithms such as Support Vector Machines (SVM) [27], k-Nearest Neighbors (k-NN) [28], Linear Discriminant Analysis (LDA) [29], and Gaussian Mixture Models (GMM) [30] have also been employed in this field. It is important to note that these algorithms are rigorously evaluated for their efficacy and performance before their implementation in stuttering recognition. By applying machine learning algorithms to analyze speech signals and detect patterns in the data, speech-language pathologists can gain a better understanding of how stuttering affects an individual’s speech patterns and monitor their progression over time. Most importantly, this data can then be leveraged to create personalized treatment plans that cater to the unique needs of each patient, leading to more efficacious and tailored therapeutic outcomes. More seriously, the explainability of recognition models doesn’t draw much attention in previous works. The ability of Grad-CAM [31] to provide clear and interpretable explanations of deep neural networks makes it an indispensable tool for researchers and practitioners
Explainable Stuttering Recognition Using Axial Attention
211
working in various domains. It can generate high-quality visualizations that highlight the regions of input images that are important for making class-specific predictions. One of the major advantages of Grad-CAM is its ability to provide an interpretable explanation of the network’s decision-making process, which can help users gain insight into how the network is processing input data. Furthermore, it can be applied to a wide range of tasks, including object recognition, medical image analysis, and speech recognition, among others. The use of Grad-CAM can also facilitate the development of more transparent and interpretable models, which is becoming increasingly important in many fields, especially in healthcare, where the reliability and transparency of decision-making systems are crucial. To this end, we propose a framework for stuttering recognition that utilizes explainable input images and efficient recognition models. The inclusion of explainable input images helps to ensure that the recognition models are transparent and can be more easily understood by clinicians and patients. The rest of the paper is organised as follows. Section 2 introduces the models for explainable stuttering recognition. Section 3 provides our experimental results and discussions. Finally, Sect. 4 concludes this paper and gives some future works.
Fig. 1. Explainable framework with an axial attention based EfficientNetV2 for stuttering recognition Model Configurations.
Our proposed explainable framework (shown in Fig. 1) for stuttering recognition consists of the following steps: 1. The features of stuttering are extracted based on the methods of WT and HOG, specifically, we first extract the WT features and then extract the HOG based on those WT images. Moreover, we fuse these two kinds of images on the pixel-level to build more effective representations. 2. Based on the trained models including MobileNetV2 [32] and ResNet50 [33], we obtain explainable images using Grad-CAM [31] to explore the inherent characteristics of different categories. Moreover, we fuse the two outputs from the Grad-CAM based on MobileNetV2 and ResNet50 with U2Fusion [34], which could combine the regions of interest generated from single models and refine our explainable results. 3. We propose an axial attention based on EfficientNetV2 [35] including row and column attention parts to extract the information from fused explainable images and conduct final stuttering recognition.
212
Y. Ma et al.
2 Feature Extraction In this paper, as outlined, we utilised WT and HOG for signal processing and features extraction both in the time and frequency domains. The motivation of our paper is as follows: First of all, compared with traditional audio features, images could reflect the dynamic process of signals. Specifically, we extracted HOG features based on WT images, which provided the feasibility for their pixel-level fusion in terms of texture correspondence. Given the audio dataset D = {X1 , X2 , . . . , Xn } of stuttering, where Xi is the ith audio signal in the dataset, the two feature images extracted from this signal are fused as follows: Fi = (λ) × Ti + (1 − λ) × Hi , λ ∈ [0, 1], i ∈ n,
(1)
where T , H and n are the WT image, HOG image and the number of subjects, respectively, and λ is a regulated parameter representing the contribution of two features in the process of fusion. In the current experiment, we set this parameter to 0.5, i.e., we consider the two features to have the same importance. In future experiments, we will use this parameter and other fusion strategies to explore the effect of the contribution of different features on the results. The whole algorithm for feature extraction is shown in Algorithm 1. Algorithm 1: Feature Extraction and Fusion for Stuttering Recognition. Input: Given a dataset of stuttering, where is the audio signal in the dataset for each audio in dataset do 1 Calculate the images set of WT feature 2 for each image in set do 3 Calculate the HOG feature 4 where is the number of subjects contained in . 5 6 end for each HOG feature do 7 Fuse the and on the pixel level: 8 . 9 end 10 end Output: Fused feature images derived from WT and HOG.
2.1 Explainable Module Explainability is of great urgency in deep learning especially in the detection and recognition of various diseases, which motivates us to explore explainable module as one of key components in our stuttering recognition framework. In order to obtain effective
Explainable Stuttering Recognition Using Axial Attention
213
explainable images for those stuttering signals, we firstly train two novel models including MobileNetV2 and ResNet50, and then exploit Grad-CAM to obtain the images of visual explanations. We choose Grad-CAM as the interpretable model because we want to explore the important image regions of different stuttering categories in the deep neural network. By using Grad-CAM, we visualized the specific regions of each category in the related image. Those significant localised regions in explainable module contain the information featuring the underlying attention for different categories, thus providing the possibility for an interpretation of subsequent the predicted stuttering category. Although it is a challenge to further clarify what biological information these different regions might embody, this differentiation could offer new insights for the interpretability of physiological signals. 2.2 Axial Attention Based EfficientNetV2 EfficientNetV2 has considerable advantages in terms of training speed and number of parameters. While continuing its advantages, we would like to make more targeted optimizations of the classification model for the image features we extract from the stuttering classification task. To this end, we have applied the axial attention [36] to EfficientNetV2, the structure of which is shown in the Fig. 2. By adding axial attention, we can improve the performance of the stuttering classification model by improving the extraction of interpretable image features from both the row and column directions.
Fig. 2. Structure of the Axial-Fused-MBConv.
214
Y. Ma et al.
3 Experiment and Result 3.1 Dataset We utilised KSoF [37, 38], a therapy-based dataset that contains more than 5 500 clips of persons who stutter labelled with certain stuttering-related event types including blocks, prolongations, sound repetitions, word repetitions, and others. These 214 audio recordings feature 37 speakers, of which 28 are male and 9 are female. They were made during therapy sessions and are available by request for the research purposes. In our paper, the results are comparable with those not having access to the test set in ComParE 2022 challenge [38], and 80% of the original training set is split for learning and the other 20% are used for developmental set. The original development set is used for testing. Moreover, we conduct the stuttering recognition with 8 classes blocks, prolongations, sound repetitions, word repetitions, fillers, no disfluencies and so on. 3.2 Feature Extraction Our motivation for extracting these two features is mainly due to the advantages of picture-based features in compensating for the shortcomings of traditional acoustic features. For instance, in terms of description, picture-based features can better describe the dynamic processes and the time-frequency distribution of the signals. The extracted feature image has a resolution of 618 * 306. The parameters of WT are: we chose the “mexh” function; in addition, we loaded the data and sampling rate with the function: Librosa [39]. We plotted WT images with Matplotlib [40] with time as X-axis and frequency as Y-axis. Our feature images are all drawn with the same resolution, which could be input into different models such as Resnet, VGG and etc. 3.3 Explainable Output In this study, we leveraged the power of Grad-CAM and applied it to two separate deep learning models: MobileNetV2 and ResNet50. This was done to generate two types of explainable images that provided us with a better representation of specific categories. The decision to use two models rather than one was based on two main factors. Firstly, we recognized that explainable images have the potential to provide a more effective representation of specific categories than previous features. However, in the absence of cross-model evaluation, we cannot ensure the output from a single model. By using two different models, we were able to ensure that our results were not biased by the performance of a single model, but rather validated across two independent models. Secondly, Grad-CAM has the unique ability to visualize regions of input images that are critical to class-specific predictions. By leveraging this functionality, we were able to identify and isolate the most significant regions of the input images and generate explainable images that provide a clear visualization of the regions that contributed most to the stuttering classification. To enhance the reliability of these localized regions, we fused the two types of explainable images generated by MobileNetV2 and ResNet50. This fusion enabled us to combine the strengths of both models and generate a more comprehensive representation
Explainable Stuttering Recognition Using Axial Attention
215
of the important regions of the input images. By combining these two approaches, we were able to generate a set of explainable images that provided us with valuable insights into the underlying mechanisms of stuttering classification. The use of multiple models and the fusion of their outputs not only enhanced the reliability of our findings but also provided a more robust framework for future research in the field of speech-language pathology and stuttering recognition. The explainable images from two single models and their fusion by U2Fusion are shown in Fig. 3.
Fig. 3. Explainable images from different models and their fusion.
3.4 Stuttering Recognition We have compared our recognition result with the baseline and the classification model based on images has higher performance than the baseline for this 8 classes recognition task. In addition, we selected models of the classical visual domain to explore their potential in stuttering recognition, a particular audio classification task, for further diversification to improve the identification of this special disease. We evaluated our fused images with axial attention based EfficientNetV2, the model was trained for 50 epochs with the Adam optimiser with an initial learning rate of 0.001. Table 1 presents the performance of our model and several models built for comparison as follows: No Grad-CAM: Our proposed framework without the explainable module. Axial-VGG: This approach is built on VGG16 with an axial attention module. Sing.ResNet: An axial attention based EfficientNetV2 with the explainable images from the ResNet50. Sing.MobileNet: An axial attention based EfficientNetV2 with the explainable images from the MobileNetV2. EfficientNet: This approach utilises EfficientNet as the final recognition model, replacing our proposed axial attention based EfficientNetV2. Axial-EfficientNet: Our proposed framework in this paper. Moreover, we evaluated the contribution of WT and HOG on the classification model respectively.
216
Y. Ma et al.
Fig. 4. Experimental results of different models.
Table 1. Performance results [%] of axial attention based EfficientNetV2 vs other models. Devel: Development set. Test: Test set. Acc.: Accuracy. We compare our test UAR to the Dev baseline used in ComParE 2022 [38], which is 30.2%. Model
Devel
Test
Acc. UAR Acc. UAR No Grad-CAM
56.6 38.4
48.7 29.3
Axial-VGG
66.9 44.6
42.1 24.0
Sing.MobileNet
72.9 62.5
31.6 29.4
Sing.ResNet
66.1 43.5
45.1 26.5
EfficientNet
71.3 58.0
43.4 31.9
Axial-EfficientNet 71.9 60.0
43.0 34.6
Feature
Devel
Test
Acc.
UAR
Acc.
UAR
Sing.WT
48.9
31.3
49.6
31.6
Sing.HOG
54.0
31.0
48.5
30.5
3.5 Discussion The experimental results presented in Table 1 demonstrate the superior performance of our proposed framework in terms of UAR compared to other models. Specifically, our approach achieved the best UAR among all models evaluated in the study. To further validate the effectiveness of our approach, we compared our results with the baseline of development set used in ComParE 2022, and achieved a relative percentage increase
Explainable Stuttering Recognition Using Axial Attention
217
in UAR of 4.4%. The accuracy and UAR of different methods can be seen in Fig. 4, in which we provide the accuracy and UAR results for the training, validation, and testing phases across 50 epochs. Furthermore, we conducted experiments to investigate the impact of fused explainable images on the overall performance of our framework. Our findings revealed that the fused explainable images outperformed the single explainable images, providing further evidence to support our claim that the fusion of explainable images based on two models is an effective way to extract and inherit vital information and further improve upon the overall performance. The improved performance of our framework can be attributed to several key factors. Firstly, the incorporation of explainable images provided us with a more interpretable representation of the important features that contributed to stuttering classification. Secondly, the usage of two independent models and the fusion of their outputs allowed us to overcome potential biases and improve the robustness of our results. Lastly, our approach leveraged advanced machine learning techniques and sophisticated feature engineering to achieve a highly accurate and reliable stuttering recognition model. Most importantly, our approach has the potential to improve the diagnosis and treatment of stuttering disorders, ultimately enhancing the quality of life for individuals suffering from this condition.
4 Conclusion and Future Work In this study, we extracted WT and HOG features and fused them as the input of our explainable module, which output explainable images as the input of our stuttering recognition model. The results, based on the KSoF dataset, demonstrate that our methods could outperform other models for comparison. Furthermore, applying Grad-CAM to audio data classification enhances the interpretability and visualizability of the model’s feature during prediction, thus improving the model’s reliability and scalability in practical applications. The identified localization regions in the interpretable module contain potentially significant attention features for different categories, thus providing a possibility for explaining the predicted categories, which might be especially valuable in the medical field. In our next work, we would explore those various regions visualized by Grad-CAM to explain the underlying nuances of different stuttering categories from the perspective of physiological signal. Moreover, other explainable models are also needed to be explored to advance the interpretability and the transparency of both, features and models. Explainability is of great urgency in deep learning especially in the detection and recognition of various diseases. We step forward by an explainable model with high efficiency and hope this work could show some clues about it. Acknowledgements. This work partially supported by the National Key Research and Development Program of China (Grant No. 2019YFA0706200), the Project funded by China Postdoctoral Science Foundation (Grant No. 2021M700423), the Ministry of Science and Technology of the People’s Republic of China (No. 2021ZD0201900, 2021ZD0200601), the National Natural Science Foundation of China (No. 62227807, 62272044, 62072219), the National High-Level Young Talent Project, the BIT Teli Young Fellow Program from the Beijing Institute of Technology,
218
Y. Ma et al.
China, the Natural Science Foundation of Gansu Province, China (No. 22JR5RA401), the Fundamental Research Funds for the Central Universities (No. lzujbky-2022-ey13), the JSPS KAKENHI (No. 20H00569), the JST Mirai Program (No. 21473074), and the JST MOONSHOT Program (No. JPMJMS229B), Japan.
References 1. Hu, B., Shen, J., Zhu, L., Dong, Q., Cai, H., Qian, K.: Fundamentals of computational psychophysiology: theory and methodology. IEEE Trans. Comput. Soc. Syst. 9(2), 349–355 (2022) 2. Shen, J., Zhang, X., Hu, B., Wang, G., Ding, Z., Hu, B.: An improved empirical mode decomposition of electroencephalogram signals for depression detection. IEEE Trans. Affect. Comput. 13(1), 262–271 (2022) 3. Zhang, X., Shen, J., ud Din, Z., Liu, J., Wang, G., Hu, B.: Multimodal depression detection: fusion of electroencephalography and paralinguistic behaviors using a novel strategy for classifier ensemble. IEEE J. Biomed. Health Inform. 23(6), 2265–2275 (2019) 4. Banerjee, N., Borah, S., Sethi, N.: Intelligent stuttering speech recognition: a succinct review. Multimed. Tools Appl. 81, 1–22 (2022) 5. Lickley, R.: Disfluency in typical and stuttered speech. Fattori Sociali E Biologici Nella Variazione Fonetica-Social and Biological Factors in Speech Variation (2017) 6. Junuzovic-Zunic, L., Sinanovic, O., Majic, B.: Neurogenic stuttering: etiology, symptomatology, and treatment. Med. Arch. 75(6), 456 (2021) 7. Catalano, G., Robben, D.L., Catalano, M.C., Kahn, D.A.: Olanzapine for the treatment of acquired neurogenic stuttering. J. Psychiatr. Pract.® 15(6), 484–488 (2009) 8. Oue, S., Marxer, R., Rudzicz, F.: Automatic dysfluency detection in dysarthric speech using deep belief networks. In: Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies, pp. 60–64 (2015) 9. Sheikh, S.A., Sahidullah, M., Hirsch, F., Ouni, S.: StutterNet: stuttering detection using time delay neural network. In: 29th European Signal Processing Conference (EUSIPCO), pp. 426– 430 (2021) 10. Qian, K., et al.: A bag of wavelet features for snore sound classification. Ann. Biomed. Eng. 47(4), 1000–1011 (2019) 11. Qian, K., Zhang, Z., Yamamoto, Y., Schuller, B.W.: Artificial intelligence Internet of Things for the elderly: from assisted living to health-care monitoring. IEEE Signal Process. Mag. 38(4), 78–88 (2021) 12. Qian, K., et al.: Computer audition for healthcare: opportunities and challenges. Front. Digit. Health 2, 5 (2020) 13. Shen, J., Zhao, S., Yao, Y., Wang, Y., Feng, L.: A novel depression detection method based on pervasive EEG and EEG splitting criterion. In: 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1879–1886. IEEE (2017) 14. Shen, J., et al.: An optimal channel selection for EEG-based depression detection via kerneltarget alignment. IEEE J. Biomed. Health Inform. 25(7), 2545–2556 (2020) 15. Yang, M., Ma, Y., Liu, Z., Cai, H., Hu, X., Hu, B.: Undisturbed mental state assessment in the 5G era: a case study of depression detection based on facial expressions. IEEE Wirel. Commun. 28(3), 46–53 (2021) 16. Zhang, K., et al.: Research on mine vehicle tracking and detection technology based on YOLOv5. Syst. Sci. Control Eng. 10(1), 347–366 (2022)
Explainable Stuttering Recognition Using Axial Attention
219
17. Shen, J., et al.: Exploring the intrinsic features of EEG signals via empirical mode decomposition for depression recognition. IEEE Trans. Neural Syst. Rehabil. Eng. 31, 356–365 (2022) 18. Shen, J., et al.: Depression recognition from EEG signals using an adaptive channel fusion method via improved focal loss. IEEE J. Biomed. Health Inform. 27, 3234–3245 (2023) 19. Rosenberg, J., et al.: Conflict processing networks: a directional analysis of stimulus-response compatibilities using MEG. PLoS ONE 16(2), e0247408 (2021) 20. Dong, Q., et al.: Integrating convolutional neural networks and multi-task dictionary learning for cognitive decline prediction with longitudinal images. J. Alzheimer’s Dis. 75(3), 971–992 (2020) 21. Wu, Y., et al.: Person reidentification by multiscale feature representation learning with random batch feature mask. IEEE Trans. Cogn. Dev. Syst. 13(4), 865–874 (2020) 22. Demir, F., Sengur, A., Cummins, N., Amiriparian, S., Schuller, B.W.: Low level texture features for snore sound discrimination. In: 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 413–416 (2018) 23. Barrett, L., Hu, J., Howell, P.: Systematic review of machine learning approaches for detecting developmental stuttering. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 1160–1172 (2022) 24. Howell, P., Sackin, S.: Automatic recognition of repetitions and prolongations in stuttered speech. In: Proceedings of the First World Congress on Fluency Disorders, vol. 2, pp. 372–374. University Press Nijmegen Nijmegen, The Netherlands (1995) 25. Gupta, S., Shukla, R.S., Shukla, R.K., Verma, R.: Deep learning bidirectional LSTM based detection of prolongation and repetition in stuttered speech using weighted MFCC. Int. J. Adv. Comput. Sci. Appl. 11(9), 1–12 (2020) ´ 26. Swietlicka, I., Kuniszyk-Jó´zkowiak, W., Smołka, E.: Artificial neural networks in the disabled speech analysis. Comput. Recogn. Syst. 3, 347–354 (2009) 27. Ravikumar, K.M., Rajagopal, R., Nagaraj, H.: An approach for objective assessment of stuttered speech using MFCC features. ICGST Int. J. Digit. Signal Process. 9(1), 19–24 (2009) 28. Chee, L.S., Ai, O.C., Hariharan, M., Yaacob, S.: MFCC based recognition of repetitions and prolongations in stuttered speech using k-NN and LDA. In: 2009 IEEE Student Conference on Research and Development (SCOReD), pp. 146–149. IEEE (2009) 29. Ai, O.C., Hariharan, M., Yaacob, S., Chee, L.S.: Classification of speech dysfluencies with MFCC and LPCC features. Expert Syst. Appl. 39(2), 2157–2165 (2012) 30. Mahesha, P., Vinod, D.: Support vector machine-based stuttering dysfluency classification using gmm supervectors. Int. J. Grid Util. Comput. 6(3–4), 143–149 (2015) 31. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 618–626 (2017) 32. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.: MobilenetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520 (2018) 33. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016) 34. Xu, H., Ma, J., Jiang, J., Guo, X., Ling, H.: U2Fusion: a unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 502–518 (2020) 35. Tan, M., Le, Q.: EfficientnetV2: smaller models and faster training. In: International Conference on Machine Learning (ICML), pp. 10096–10106 (2021) 36. Ho, J., Kalchbrenner, N., Weissenborn, D., Salimans, T.: Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180 (2019)
220
Y. Ma et al.
37. Bayerl, S.P., von Gudenberg, A.W., Hönig, F., Nöth, E., Riedhammer, K.: KSoF: the Kassel state of fluency dataset–a therapy centered dataset of stuttering. arXiv preprint arXiv:2203. 05383 (2022) 38. Schuller, B.W., et al.: The ACM Multimedia 2022 Computational Paralinguistics Challenge: Vocalisations, Stuttering, Activity, & Mosquitoes, pp. 1–5. arXiv Preprint arXiv:2205.06799 (2022) 39. McFee, B., et al.: librosa: audio and music signal analysis in Python. In: Proceedings of the 14th Python in Science Conference, vol. 8, pp. 18–25 (2015) 40. Hunter, J.D.: Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9(03), 90–95 (2007)
Optimizing Cardiac Surgery Risk Prediction: An Machine Learning Approach with Counterfactual Explanations Dengkang Qin1
, Mengxue Liu2
, Zheng Chen1(B)
, and Qian Lei2(B)
1 School of Information and Software Engineering, University of Electronic Science and
Technology of China, Chengdu, Sichuan, People’s Republic of China [email protected] 2 Department of Anesthesiology, Sichuan Academy of Medical Sciences and Sichuan Provincial People’s Hospital, University of Electronic Science and Technology of China, Chengdu, Sichuan, People’s Republic of China [email protected]
Abstract. Postoperative complications after cardiac surgery can be severe and even fatal, making it a high-risk procedure. Predicting surgical risk can guide the effective formulation of treatment plans for high-risk cardiac surgery, thereby reducing the risk of postoperative complications, which has attracted widespread attention from cardiac surgeons. The most commonly used method, EuroSCORE, has the problems of low prediction accuracy and weak targeting for postoperative complications. In this paper, we developed a machine learning (ML) model for predicting adverse outcomes (AO) after cardiac surgery with high accuracy and demonstrated the clinical interpretability of the model with counterfactual explanation (CE) based explainable artificial intelligence (XAI). A total of 2324 patients who had undergone cardiac surgeries with cardiopulmonary bypass support in a single center were included in this study, were divided into two groups as non-AO (n = 2148) and AO (n = 176). Our ML prediction model showed the best prediction performance using perioperative data (AUC = 0.769) when compared with models of EuroSCORE (AUC = 0.663) and EuroSCORE covariates (AUC = 0.710). CE method applied to the ML model showed how operation duration, ASA class, BMI, Lac entering ICU and PLT value increase the risk of adverse outcomes following surgery. In addition, sufficiency and necessity metrics was used to provide CE with a better explanation of feature importance. It has been proven that machine learning models have shown hope in improving the risk assessment of adverse outcomes after cardiac surgery, and counterfactual explanations methods provide more detailed and practical explanations, which are more useful for medical professionals. Keywords: Adverse outcome prediction · Counterfactual explanations · Explainable machine learning
Qin and Liu contributed to the work equally and should be regarded as co-first authors. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 221–232, 2023. https://doi.org/10.1007/978-981-99-4749-2_19
222
D. Qin et al.
1 Introduction Despite advancements in surgical techniques, anesthetic procedures, and postoperative care, cardiac surgery still poses an inherent risk of perioperative mortality and morbidity. Hence, precise preoperative surgical risk evaluation of patients is of immense importance to enhance collective decision-making among medical professionals, care teams, and patients, to facilitate perioperative planning and risk mitigation. Out of several predictive models established to determine the risk of heart surgery [1–5], the European System for Cardiac Operative Risk Evaluation (EuroSCORE) stands out as the most influential and widely utilized cardiac surgical risk assessment system. Over the last few years, experts from various regions of the world have brought to light several issues related to the EuroSCORE, including insufficient risk prediction [6] or excessive risk overestimation [7, 8], demographic and geographic constraints [9, 10], and subpar precision in predicting severe post-cardiac surgery complications [11]. Recently, machine learning (ML) has emerged as a powerful tool in analyzing vast amounts of medical data and providing precise decision-making in disease diagnosis and treatment. ML methods have demonstrated significantly higher accuracy than clinical reference tools in predicting mortality [12, 13]. Commonly used algorithms include Logical regression, Support vector machine, Decision tree, Naive Bayes, and Neural network. However, in the face of complex algorithms and a large amount of data, Ensemble learning algorithms such as Random Forest, XGBoost, and LightGBM have achieved relatively higher levels of accuracy and are increasingly being utilized in clinical practice [14–16]. The use of ML models in medicine has not been as ideal as expected due to the lack of explainability. Consequently, researchers have been paying more attention to the field and proposing explainable methods. Presently, major explainable algorithms used for disease prediction model focus on Post-hoc type, which provide explanations for input features on a trained model. Model-agnostic methods within Post-hoc are particularly popular, as they eliminate the limitations of specific models and can be applied to any machine learning model, such as Shapley Additive Explanation (SHAP) and the Local Interpretable Model-Agnostic Explanations (LIME). Above methods are all aim to attribute each feature to obtain the feature importance score of each feature to explain the model. In recent years, there has been a gradual emergence of interpretable methods based on samples, particularly methods for Counterfactual Explanations [17] (CE). To a certain extent, CE is more aligned with the medical professionals’ judgement and understanding in the field of medicine. The methods aim to answer the question of “How to minimize the change of features to achieve the goal of model prediction reversal?” This approach can provide insight into what changes in a patient’s features would lead to a different prediction outcome and may be more applicable to clinical decision-making. Prior research has predominantly focused on quantifying the risk of mortality after cardiac surgery using predictive models and machine learning techniques. Furthermore, their interpretability could be improved to a certain extent, making them more suitable for clinical use. To address this gap, our study aimed to develop an explainable machine learning model capable of predicting AO, including severe complications and mortality, after heart surgery. We utilized data collected from admission in the hospital to departure from critical care in a prospectively collected, single-center cohort. Subsequently,
Optimizing Cardiac Surgery Risk Prediction: An ML Approach with CE
223
we embedded the best-performing solution with interpretability methods, specifically counterfactual explanations. This approach enables clinicians to identify the factors that have the greatest impact on the risk of cardiac surgery predicted by the model, which enhances its acceptability and usability as a clinical decision support tool. The main findings and contributions of this article are listed below: • We have developed an ML model that can more accurately predict AO following cardiac surgery. • We employed interpretable methods to identify the factors contributing to the model’s predictions of AO after surgery. • To offer a better explanation of feature importance, we utilized the counterfactual explanations method with the sufficiency and necessity metrics, which are more aligned with the medical perspective and can assist clinicians in diagnosis.
2 Methods 2.1 Study Design and the Related Dataset Data from consecutive patients who underwent cardiac surgery with cardiopulmonary bypass between January 2013 and December 2018 in an over 3000-bed hospital were included in this single-center cohort study. The study was approved by the hospital Human Research Ethics Committee in April 2020 (No. 2020–54), which waived the need for informed consent, because of the observational nature of the study. Reporting of this study complies with the Strengthening the Reporting of Observational studies in Epidemiology recommendations statement for reporting. EuroSCORE score was calculated using the coefficients described in the literature [1]. We used socio-demographic characteristics, comorbidities, preoperative status data, surgical data, intensive care unit data, and EuroSCORE and 17 EuroSCORE subitems as predictors. We combined acute myocardial infarction (AMI), cardiac arrest, stroke, chronic kidney disease requiring dialysis and death as a postoperative adverse outcome dichotomous feature as the model output for prediction (Adverse outcome, AO and Non-adverse outcome, nonAO). Perioperative care (including anesthesia, monitoring techniques, normothermic cardiopulmonary bypass and critical care) was standardized for all patients. 2.2 Data Processing and Modeling Considering the significant number of data features and many missing values, as well as the superior performance of the ensemble learning method in classification tasks compared to other machine learning methods, we ultimately chose LightGBM as the prediction model. Light gradient boosting machine (LightGBM) [18] is a highly efficient gradient boosting decision tree (GBDT) that can perform regression and classification tasks, and also supports category features and missing values. It utilizes the decision tree algorithm of Histogram, which has low internal storage and low data separation complexity. This makes it an effective solution for the efficiency and scalability challenges of traditional GBDT models when dealing with high feature dimensions and huge data volumes.
224
D. Qin et al.
Based on the different input features of the model, three models have been designed for training. When using only the EuroSCORE value as input feature for LightGBM, the model is denoted as LightGBM-Eurolinear. The model trained with all 17 EuroSCORE subitem features jointly input is recorded as LightGBM-EuroSCORE, and the model trained with all available features is denoted as LightGBM-all. Prior to the actual training process, we utilized a fast and lightweight autoML (FLAML) [19] library to select a set of parameters with the minimum loss for the dataset under the objective function. To create a more reliable prediction model and avoid biased results and overfitting, we utilized a stratified 5-fold sampling technique to partition the dataset into a training set and a test set, and assigned weights to two sample types during the training process. 2.3 Model Explainability and Counterfactual Analysis Compared to the difficult task of developing self-explanatory ML models, post-hoc explainability techniques can be easily applied. Specifically, post-hoc explainability can be classified into two categories: feature-attribution-based methods such as LIME and SHAP, and sample-based explanation methods such as counterfactual explanations. In this article, we primarily utilized the counterfactual explanation method to explain the model, while also employing LIME and SHAP for comparison to demonstrate the advantages of the counterfactual explanation approach. LIME (Local Interpretable Model-agnostic Explanations) [20] is a method that utilizes a local surrogate model to generate a new dataset by slightly perturbing the input and observing changes in the output of a black-box model. This new dataset is then used to train an interpretable model that approximates the predictions of the black-box model. LIME computes the importance of each feature by analyzing the coefficients of the surrogate model. The magnitude of these coefficients can elucidate the degree to which each feature contributes to the predictive outcome of the surrogate model. The SHAP (Shapley Additive Explanations) method [21] attributes features by estimating their Shapley values. The Shapley value [22] is a method from coalitional game theory that determines the contribution of each feature value to the prediction. Specifically, given a set of current feature values (i.e. a coalition), the contribution of the feature value to the difference between the actual prediction and the average prediction is the estimated Shapley value. For each instance, feature attributions are calculated, and the average absolute value of the Shapley values across all instances is used as the importance score for each feature, which forms the global explanation. Methods that rely on feature attribution essentially generate importance indices for each feature in a given sample. However, these explanations may lack persuasiveness and prospective inference, particularly in the field of medicine. As a post-hoc explainable technology, the counterfactual explanation method based on samples may offer a novel perspective. This method can generate counterfactual examples for a given sample, where the minimum modification of features results in a change in predicted outcomes. By elucidating the model through the changes between the sample and its counterfactual examples, this approach can provide a different explanation in comparison to the static numerical method of feature importance. The concept of counterfactual explanations was first introduced by Wachter [17] in 2017, and he abstracted the process of generating
Optimizing Cardiac Surgery Risk Prediction: An ML Approach with CE
225
counterfactual samples and transformed it into an optimization problem represented by mathematical as follows: c = arg min yloss(f (c), y) + |x − c|,
(1)
c
where the first part, yloss pushes the counterfactual example c towards a different prediction than the original example x, while the second part keeps the counterfactual example c close to the original example x. However, Wachter’s counterfactual generation method can only generate a single counterfactual example for each original example x, which is often insufficient in many cases. To tackle this issue, the Diverse Counterfactual Explanations (DICE) [23] approach proposes a method that utilizes determinant point processes (DPP) to generate and evaluate multiple counterfactual interpretations. The core idea is to change the output of the machine learning model through perturbation, but this change is diversified (it can generate multiple counterfactual examples for any machine learning model) and feasible (it supports simple constraints on features). DICE transforms the problem of searching for counterfactual explanations into the optimization problem with mathematical form as follows: C(x) = arg min c1 ,...,ck
λ1 κ 1 k yloss(f (ci ), y) + dist(ci , x) − λ2 dpp_ diversity(c1 , . . . , ck ), i=1 i=1 k k
(2) where ci is a counterfactual case, k is the total number of counterfactual cases generated, f (.) is the machine learning model to be interpreted, yloss(.) is the distance between the prediction of model f (.) and the required output y under ci , dist(.) is used to measure the distance between counterfactual case ci and the original input x, dpp_ diversity (see Eq. 3) is the diversity degree indicator and λ1 and λ2 are the weights of the last two items. dpp_ diversity = det K where Ki,j =
1 1 + dist(ci , cj )
(3)
2.4 Quantify Metrics of Model Explainability The Explainable artificial intelligence (XAI) field has not yet reached a consensus on standardized evaluation metrics, so auxiliary criteria such as interpretability may not be easily quantifiable. In conjunction with the explainable methods utilized in this article, the Sufficiency and Necessity indicators of features [24] are introduced to evaluate the interpretability of the model through constraining each feature using a sample-based method. Assuming that there are N samples, and n counterfactual examples are generated for each sample, the sufficiency and necessity conditions for the feature xj of sample x are defined as follows: (1) Necessity: by fixing all other features in the sample, only the feature xj is allowed to change, which can be written as Necessity =
Σi,xj =α 1(ci ) n∗N
.
(4)
226
D. Qin et al.
(2) Sufficiency: fix the feature xj and allow all other features to change, which can be written as Sufficiency =
Σi 1(ci ) Σi,xj =a 1(ci ) − . n∗N n∗N
(5)
The calculation of necessity requires only a certain feature to be changed, which can result in a reversal of the model’s prediction results (from AO to non-AO). In reality, it is often difficult to change a single feature without affecting other features. Therefore, necessity can be used to measure the degree to which features should be considered in ideal condition of cardiac surgery. The calculation of sufficiency involves restricting a certain feature to remain unchanged. A higher sufficiency value leads to fewer counterfactual samples being generated, which indicates that this feature is often changed together with multiple other features in the counterfactual sample, resulting in a reversal of the model’s predicted results. Thus, sufficiency can serve as a necessary supplement, and a higher sufficiency value implies a lower likelihood of mutual influence between features. Through comparison of the feature importance rankings obtained from these indicators with those obtained from the interpretable methods utilized in this paper, the differences between various explainable methods can be evaluated.
3 Results 3.1 Characteristics of Patients From January 2013 to December 2018, 2324 cardiac procedures with cardiopulmonary bypass were performed (57.2% females). Among them, 176 (7.6%; 95% CI: 6.5–8.7) had a postoperative adverse outcome, 88 (3.8%; 95% CI: 3.0–4.6) died during the inhospital stay. The mean (standard deviation) age was 46.3 (17.2) years, and the mean EuroSCORE was 1.6 (1.6). Univariate analysis showed a significant difference of adverse outcome for 46 variables: age, hypertension, ASA class, NYHA class, LVEF, heart rate, EuroSCORE, Preoperative laboratory analysis, intraoperative data and ICU data at entry. 3.2 Receiver Operating Characteristic Analysis The performance of each model was evaluated using the area under the receiver operating characteristic (ROC) curves (AUC) on the validation dataset. Stratified 5-fold sampling was performed on the three models, and the average ROC curves of the five rounds were plotted in Fig. 1. The figure displays the ROC curves of EuroSCORE, Eurolinear, and the ML model tested on the validation dataset for predicting AO in all patients included in this study. In addition, 2000 samples were taken for each model to calculate the 95% confidence intervals (CI) of the models, which summarizes the performance of different predictive models tested on the validation dataset: Eurolinear, EuroSCORE and ML models. The ML model had the highest accuracy (AUC, 0.769 (0.728–0.812)), while the AUCs of EuroSCORE (0.710 (0.642–0.757)) and Eurolinear (0.663 (0.659–0.698)) were significantly lower than the ML model (P < 0.001).
Optimizing Cardiac Surgery Risk Prediction: An ML Approach with CE
227
Fig. 1. Receiver operating characteristic curves showing the performance of LightGBMEurolinear (only EuroSCORE), LightGBM-EuroSCORE (EuroSCORE covariates), and LightGBM-all (all features) in predicting postoperative adverse outcome.
3.3 Diverse Counterfactual Explanations For the explanation of the model, multiple experiments were conducted. In the counterfactual explanation, more attention was given to patients who suffered from postoperative complications, exploring how to perturb their features in order to avoid the occurrence of postoperative adverse outcome under this model. The LightGBM-Eurolinear model was selected as a comparison, and those samples that were judged as True Positive in the LightGBM-all model but predicted as Negative in the LightGBM-Eurolinear model were screened out for counterfactual explanation. The DICE method calculates feature importance by counting the number of times a feature appears in the counterfactual examples generated, divided by the maximum number of counterfactual examples that could be generated in theory. Calculating the feature importance score of the top-feature under three methods (see Fig. 2) and computing the sufficiency and necessity score of each feature and ranked them accordingly (see Table 1). Furthermore, the average rank difference between feature importance and sufficiency (or necessity) of each feature was computed. The results showed that LIME, SHAP and DICE had average rank differences of 5.5, 3.6, and 3.1 for necessity, and 4.7, 7.1, and 8.6 for sufficiency, respectively. It can be observed that (1) LIME performs better in sufficiency (2) SHAP and DICE exhibit similar performance in sufficiency and necessity ranking, but DICE performs better in necessity. Although each of the three has its own advantages, only sample-based methods such as DICE can calculate sufficiency and necessity indicators.
228
D. Qin et al.
Fig. 2. The importance value of top features under three interpretation methods: LIME, SHAP and DICE, based on the LightGBM-all model. The indicator is a postoperative indicator if there are (AFTER) suffixed. Table 1. Ranking of characteristics for different interpretable methods. Necessity and Sufficiency (score/rank)
Feature importance (rank)
Necessity
LIME
Rank
Sufficiency
Rank
SHAP
DICE
Operation duration
0.302
2
0.123
8
1
2
1
Lac entering ICU
0.354
1
0.104
13
5
1
4
ASA
0.137
13
0.095
15
9
4
2
CPB
0.281
3
0.097
14
4
7
7
Blood transfusion-quantity
0.208
6
0.132
7
7
3
14
Creatininea
0.130
14
0.638
1
2
13
12
Lac after CPB
0.191
7
0.264
6
8
8
13
hs-CRPa
0.154
11
0.439
3
12
5
10
NEUT%
0.167
10
0.091
17
11
11
8
PLTa
0.184
8
0.123
9
13
6
11
BNP
0.094
17
0.473
2
6
12
15
Age
0.148
12
0.082
19
14
10
9 (continued)
Optimizing Cardiac Surgery Risk Prediction: An ML Approach with CE
229
Table 1. (continued) Necessity and Sufficiency (score/rank)
Feature importance (rank)
Necessity
Sufficiency
Rank
LIME
HCT
0.264
Rank 5
0.091
17
17
SHAP 9
DICE 6
BMI
0.269
4
0.104
12
18
15
3
PLT
0.174
9
0.111
11
19
19
5
CK-MB
0.123
15
0.426
5
10
16
18
Ureaa
0.106
16
0.123
9
16
18
15
4 Discussion The objective of this study was to use machine learning model to incorporate preoperative, intraoperative, and postoperative factors in predicting AO in patients undergoing cardiac surgery. The main results were: (I) compared with the EuroSCORE risk assessment, the machine learning model showed good predictive ability in risk assessment with higher identification rate (AUC), (II) the top-ranking risk factors in DICE, SHAP, and LIME methods showed similarities but with certain critical differences. However, the counterfactual explanation method DICE can supplement the interpretability of the machine learning model through necessity and sufficiency, making it more conducive for clinical doctors to understand. Due to the impact of cardiac surgery on multiple systems in the body, postoperative mortality rates are high. We attempt to use various assessment methods to identify patients who are most likely to experience AO after surgery, so as to take preventive measures to reduce the occurrence of such AO. The specialty of cardiac surgery has always been at the forefront of using risk prediction models. The two most commonly used models for risk stratification in cardiac surgery are EuroSCORE and the American Society of Thoracic Surgeons (STS) risk scores, which are based on logistic regression models [1, 2, 25]. Currently available risk stratification tools provide a cross-sectional view of a patient’s disease, but are often not sensitive enough for clinical application due to differences in target populations, surgical types, and outcome indicators [25]. Machine learning models overcome this limitation by fully utilizing patient data for analysis. They can quickly generate predictive models for many clinical problems [26] and can learn and update in real time with changes in parameters, providing strong support for clinical decisions for doctors and patients [27]. Our study has found that machine learning models have a higher accuracy in predicting risk compared to the EuroSCORE evaluation, which is consistent with previous research findings [13, 16]. The LightGBM-all model achieved the highest AUC value of 0.769, followed by LightGBM-EuroSCORE with an AUC value of 0.711, and LightGBM-EuroLinear with an AUC value of 0.663. The improvement in AUC value suggests that features other than EuroSCORE, particularly those related to cardiac surgery, are valuable in the LightGBM-all model decision making. This finding is in agreement with the study of Fan, Y. et al. [28], which suggested that there may be some
230
D. Qin et al.
influential factors not collected by EuroSCORE, but have an impact on mortality risk, such as preoperative blood loss, surgery time, and cross-clamp time. Previous studies on using machine learning models for cardiac risk assessment have mainly relied on model-agnostic post-hoc interpretability methods such as LIME or SHAP. The core idea behind these methods is to attribute each feature and obtain the feature importance score of each one. Although these methods can explain the important features of black-box models, their explanation approach is more in line with computer developers’ understanding of the model, and to some extent, still differs from a medical perspective. Counterfactual explanation methods are based on sample-specific interpretability. In this study, it generates different versions of a patient’s features (generating counterfactual explanation samples) and demonstrates how a patient who would have had a poor postoperative outcome can achieve a model prediction reversal (predicted as non-AO) with the minimum feature changes. This kind of interpretability, to some extent, is more in line with medical professionals’ judgment and understanding. After all, one of the purposes of risk prediction is to accurately intervene in the factors that have the greatest impact on patients’ AO and improve their prognosis. In this study, the top-ranked risk factors identified by DICE, SHAP, and LIME methods exhibited similarity. To further explain the model, we introduced the concepts of necessity and sufficiency based on DICE. When a certain feature is important enough in a given sample, perturbing only that feature is sufficient to achieve the counterfactual effect. Such features tend to be more necessary. Experimental results have shown that the trend of feature importance and necessity measures obtained by the three methods is roughly similar, and the ‘Operation duration’ value is considered the most influential feature on AO in patients. Sufficiency represents the degree of interaction between a feature and other features. The more sufficient a feature is, the more features are involved in its changes, and changing that feature will be restricted by other features. For example, ‘Creatinine after surgery’ has a high degree of sufficiency, which may be related to factors such as creatinine before surgery, CPB time, aortic occlusion time, erythrocyte suspension transfusion quantity. Reducing ‘Creatinine after surgery’ would require changes to multiple features. In general, regardless of whether it is based on feature attribution or sample-based counterfactual methods, the importance of features in high-dimensional data is not obvious in terms of sufficiency and necessity. Although there are many shortcomings, the DICE method of explanation has greater potential in medicine compared to SHAP and LIME, and further development is needed to achieve clinical application. The current study has limitations in certain aspects. Firstly, the EuroSCORE used in this study was calculated retrospectively, which may introduce some bias. Secondly, our results may not be generalizable due to the relatively small sample size, partial missing data, and lack of external validation. Additionally, there may be some confounding variables that were not included in the analysis but do predict the outcome. Finally, using data from a single center for both development and validation may limit the generalizability of our findings, and future persuasive prospective multicenter studies are necessary.
Optimizing Cardiac Surgery Risk Prediction: An ML Approach with CE
231
5 Conclusions As a result, it was seen that there are studies have been performed using ML methods for cardiac surgery estimation in the literature, but mainly focused on predicting patient mortality and few of them combining ML and XAI. Even if XAI methods are used, most of these studies explain the model by obtaining feature importance. Therefore, the present study is the first to combine machine learning with counterfactual interpretation-based XAI techniques to predict AO following cardiac surgery and determine the risk factors for in patients before and after cardiac surgery based on various indicators. The results of this study will help clinical doctors identify individuals at risk by paying attention to the patient’s surgical parameters. Acknowledgments. This work is supported by the Natural Science Foundation of Sichuan Province (2022NSFSC0503), Sichuan Science and Technology Program (2022ZHCG0007), the National Natural Science Foundation of China (82202071), Sichuan Provincial Science and Technology Program (2022YFS0301, 2023YFS0036), the Science and Technology Project of the Health Planning Committee of Sichuan (20ZD011), and Chengdu Science and Technology Program (2021-YF05-00640-SN).
References 1. Nashef, S.A., et al.: European system for cardiac operative risk evaluation (EuroSCORE). Eur. J. Cardiothorac. Surg. 16(1), 9–13 (1999) 2. Nashef, S.A., et al.: EuroSCORE II. Eur. J. Cardiothorac. Surg. 41(4), 734–744 (2012). Discussion 744-5 3. Parsonnet, V., Dean, D., Bernstein, A.D.: A method of uniform stratification of risk for evaluating the results of surgery in acquired adult heart disease. Circulation 79(6 Pt 2), I3-12 (1989) 4. Tu, J.V., Jaglal, S.B., Naylor, C.D.: Multicenter validation of a risk index for mortality, intensive care unit stay, and overall hospital length of stay after cardiac surgery. Circulation 91(3), 677–684 (1995). Steering Committee of the Provincial Adult Cardiac Care Network of Ontario 5. Edwards, F.H., et al.: The Society of Thoracic Surgeons National Cardiac Surgery Database: current risk assessment. Ann. Thorac. Surg. 63(3), 903–908 (1997) 6. Siregar, S., et al.: Performance of the original EuroSCORE. Eur. J. Cardiothorac. Surg. 41(4), 746–754 (2012) 7. Parolari, A., et al.: Performance of EuroSCORE in CABG and off-pump coronary artery bypass grafting: single institution experience and meta-analysis. Eur. Heart J. 30(3), 297–304 (2009) 8. Basraon, J., et al.: Comparison of risk scores to estimate perioperative mortality in aortic valve replacement surgery. Ann. Thorac. Surg. 92(2), 535–540 (2011) 9. Zheng, Z., et al.: The Chinese coronary artery bypass grafting registry study: how well does the EuroSCORE predict operative risk for Chinese population? Eur. J. Cardiothorac. Surg. 35(1), 54–58 (2009) 10. Yap, C.H., et al.: Validation of the EuroSCORE model in Australia. Eur. J. Cardiothorac. Surg. 29(4), 441–446 (2006). Discussion 446 11. Toumpoulis, I.K., et al.: Does EuroSCORE predict length of stay and specific postoperative complications after cardiac surgery? Eur. J. Cardiothorac. Surg. 27(1), 128–133 (2005)
232
D. Qin et al.
12. Meyer, A., et al.: Machine learning for real-time prediction of complications in critical care: a retrospective study. Lancet Respir. Med. 6(12), 905–914 (2018) 13. Allyn, J., et al.: A comparison of a machine learning model with EuroSCORE II in predicting mortality after elective cardiac surgery: a decision curve analysis. PLoS ONE 12(1), e0169772 (2017) 14. Rufo, D.D., et al.: Diagnosis of diabetes mellitus using gradient boosting machine (LightGBM). Diagnostics (Basel) 11(9), 1714 (2021) 15. Tseng, P.Y., et al.: Prediction of the development of acute kidney injury following cardiac surgery by machine learning. Crit. Care 24(1), 478 (2020) 16. Zeng, X., et al.: Prediction of complications after paediatric cardiac surgery. Eur. J. Cardiothorac. Surg. 57(2), 350–358 (2020) 17. Wachter, S., Mittelstadt, B., Russell, C.: Counterfactual explanations without opening the black box: automated decisions and the GDPR. Harv. JL Tech. 31, 841 (2018) 18. Qi, M.: LightGBM: a highly efficient gradient boosting decision tree. In: Neural Information Processing Systems (2017) 19. Wang, C., et al.: FLAML: a fast and lightweight AutoML library (2019) 20. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should I trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016) 21. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 22. Shapley, L.S.: A value for n-person games. Technical report, Rand Corp Santa Monica CA (1952) 23. Mothilal, R.K., Sharma, A., Tan, C.: Explaining machine learning classifiers through diverse counterfactual explanations. In: FAT* 2020: Conference on Fairness, Accountability, and Transparency (2020) 24. Mothilal, R.K., et al.: Towards Unifying Feature Attribution and Counterfactual Explanations: Different Means to the Same End (2020) 25. Ad, N., et al.: Comparison of EuroSCORE II, original EuroSCORE, and the society of thoracic surgeons risk score in cardiac surgery patients. Ann. Thorac. Surg. 102(2), 573–579 (2016) 26. Shamout, F., Zhu, T., Clifton, D.A.: Machine learning for clinical outcome prediction. IEEE Rev. Biomed. Eng. 14, 116–126 (2021) 27. Thorsen-Meyer, H.C., et al.: Dynamic and explainable machine learning prediction of mortality in patients in the intensive care unit: a retrospective study of high-frequency data in electronic patient records. Lancet Digit. Health 2(4), e179–e191 (2020) 28. Fan, Y., et al.: Development of machine learning models for mortality risk prediction after cardiac surgery. Cardiovasc. Diagn. Ther. 12(1), 12–23 (2022)
Patient Mortality Prediction Based on Two-Layer Attention Neural Network Lin Wang1 , Zhengzhong Wang1(B) , Quanrun Song1 , Changtong Ding1 , Xiaoning Li1 , Xiangwei Zhang2 , and Shichao Geng3(B) 1 School of Information Science and Engineering, Shandong Normal University, Jinan 250014,
Shandong, China [email protected] 2 Department of Thoracic Surgery, Shandong Provincial Hospital Affiliated to Shandong First Medical University, Jinan 250014, Shandong, China 3 School of Journalism and Communication, Shandong Normal University, Jinan 250014, Shandong, China [email protected]
Abstract. With the development of medical informatization, electronic medical records are important in hospital information systems, and their use for patient mortality prediction can contribute to further improvement of clinical auxiliary diagnosis decision-making systems. Existing models for mortality risk prediction achieve good performance; however, data utilisation is limited predominantly to a single aspect. In this study, a scalable two-layer attention mechanism neural network is proposed to predict patient mortality by focusing on the patient diagnostic code, drug code, surgical code, and global condition. The first layer of the attention network uses three independent long short-term memory networks to learn the attention features of diagnosis, medication, and surgery of a patient single admission, and output three feature matrices. The feature matrix output of the first layer of the attention network is spliced and input for use, and the second layer of the attention network uses the transformer encoder. Finally, a fully connected layer is used to obtain the mortality prediction of the patient. The feasibility of the model is demonstrated by comparison with baseline methods and ablation experiments. In conclusion, the proposed model can comprehensively consider the information of patient diagnosis, medication, and surgery, and has good practicability and scalability. Keywords: Patient mortality prediction · Attention Neural Network · LSTM · Transformer
1 Introduction Electronic medical records (EMRs) are the foundation of information construction in medical institutions. An excellent EMR system plays an important role in eliminating isolated information islands in hospitals, as well as strengthening medical record quality control, clinical path management, diagnosis and treatment safety, and mobile © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 233–245, 2023. https://doi.org/10.1007/978-981-99-4749-2_20
234
L. Wang et al.
medical care. Thus, it determines the efficiency of medical services and medical service quality, and is the foundation for medical safety. Although the original intention of establishing EMRs is to obtain safe, reliable, and real-time patient health records anytime and anywhere when medical treatment is needed, EMRs have rich secondary use value [1]. Applying artificial intelligence technology to EMR data for auxiliary clinical decision-making is an important branch of their secondary use. Both traditional machine learning and deep learning methods have been applied in this regard. Although traditional machine learning methods (such as random forests) have achieved important results in clinical auxiliary decision making, they have shortcomings, such as heavy data preprocessing and unsatisfactory accuracy. With the increasing amount of information contained in EMRs, deep neural networks that contain more parameters can be used and patient information contained in EMRs can be learnt for risk prediction (such as patient death prediction). Hence, deep neural networks have gained increasing attention for the secondary utilization of EMRs. Effective patient risk prediction methods can lay the foundation for further implementation and improvement of clinical auxiliary diagnosis decision-making systems and reduce the possibility of medical accidents caused by human error. Traditional machine learning techniques, such as logistic regression, support vector machines, and random forests, have been applied to analyse EMR data. However, these methods cannot fully utilise the contextual relationship in EMR data and may require considerable data preprocessing in the face of unstructured data. With the rapid development of deep learning technology, numerous studies have applied it to EMR data processing. Among them, recurrent neural networks (RNNs) have initially been successful in predicting patient mortality; however, their practicability remains poor. In this study, a two-layer attention neural network model was proposed to predict patient mortality by training the model on EMR data. The input of the model was the relevant information sequence of patient P in one admission, and the two-layer attention mechanism learnt the patient’s diagnosis code, drug code, operation code, and the context features summarised by the three codes, and output the patient’s mortality rate. The experimental results revealed that the proposed two-layer attention neural network model exhibits good performance and is feasible. The first contribution of this study is a new two-layer attention mechanism neural network model that can capture the local and global attention features of patient admission information, and effectively predict patient mortality. Second, the initial model of the proposed attention mechanism can flexibly process a variety of sequence information according to the actual scene, and has good practicability and scalability. The remainder of this paper is organised as follows. Section 2 presents the progress and limitations of neural networks to process EMR data. Section 3 describes the twolayer attention neural network model for predicting the risk of death in patients. Section 4 presents the comparison and ablation experiments, and the produced results. Finally, Sect. 5 summarises the study findings.
2 Related Work As EMRs are serialised and contain time information, the attention mechanism neural network has become a powerful tool for processing such data.
Patient Mortality Prediction Based on Two-Layer Attention Neural Network
235
Since the rapid development of deep learning, attention mechanism has become one of the core technologies in the fields of natural language processing and image coding. The attention mechanism can focus important information with high weights and ignore irrelevant information with low weights, thus having higher robustness and scalability [2, 3]. In the development of attention neural networks, representative network types include recurrent neural networks (RNN), long short term memory networks (LSTM) and self attention networks. RNN is a neural network model that can process sequence data. Different from the traditional feedforward neural network, RNN considers the correlation between input elements rather than treating them as independent. RNN can model sequence data by using the output of the previous moment as the input of the current moment. However, RNN also faces problems such as gradient vanishing and gradient explosion. LSTM is a variant model of RNN. The problems of gradient vanishing and gradient explosion cannot be solved under the RNN structure, while LSTM uses gating structures, namely input gate, forgetting gate, and output gate. The gating structure makes the control of information flow in LSTM more effective, and alleviates the drawbacks of gradient existing in traditional RNN. Self attention mechanism was first proposed in the field of natural language processing (NLP) to solve the problem of feature extraction of long text sequences. However, due to its powerful performance, the self attention mechanism has been widely applied in various artificial intelligence fields, such as image processing and speech processing. The core idea of the self attention mechanism is to map each element in the input sequence through three weight matrices, and then determine the importance of each element to other elements by calculating their similarity (including the element itself). Representative examples of self attention neural networks include Transformers and their variants BERT (Bidirectional Encoder Representations from Transformers). The complexity and temporal nature of data contained in electronic medical records pose many challenges in neural network model processing, including how to fully utilize the rich data contained in electronic medical records, and how to enable the model to effectively learn its contextual features. In view of the limitations of traditional machine learning algorithms in dealing with these challenges, cyclic neural networks [4, 5], short-term memory networks [6–8] and gated cyclic units [9] have been applied to the processing of electronic medical record data. Compared to traditional machine learning algorithms, these models are more effective in extracting features from sequential data, so applying these models to electronic medical record data processing has shown better performance [10–13]. The neural network model based on attention mechanism aims to learn the weights of patients for each visit and simulate the diagnostic thinking of doctors in real clinical practice by assigning different attention. ReTain [14] is an interpretable model for risk prediction, which not only learns the weights of each visit, but also assigns a weight to each diagnostic code in a single visit. ReTain uses two RNNs to learn weights separately. Although ReTain has some interpretability, its performance is still unsatisfactory. Li et al. proposed a Transformer based prediction model that learns patient visit data to predict the likelihood of multiple situations occurring during future visits [15]. Shahid et al.
236
L. Wang et al.
used LSTM, GRU, and BiLSTM methods to predict patient data infected with COVID19 and validated the effectiveness of the methods used [16]. TMA B et al. trained LSTM models using intensive care unit patient data to predict patient mortality [17]. These common methods are suitable for dealing with single type unstructured data, but if they are used to deal with multiple types of unstructured data roughly, the performance is not satisfactory. To fully utilise the rich data contained in EMRs, particularly patient diagnosis, medication, and surgery information, a two-layer attention neural network model was proposed herein to predict patient mortality risk.
3 Methods This section introduces the task of this study and the two-layer attention neural network model. 3.1 Task Definition The diagnosis coding sequence xt , drug coding sequence xp , and operation coding sequence xq of a patient admitted to a hospital constitute the input of the double-layer attention neural network model X = {xt , xp , xq }; the model outputs its mortality rate. 3.2 Feature Representation of Local Patient Information Based on LSTM LSTM [18] was proposed by Hochreiter S et al.; it is a time-cyclic neural network that is designed specifically to solve the long-term dependency problem of a general RNN. In this study, LSTM networks were used to extract the local attention features of patient information. Three independent LSTM networks were fed into patient diagnosis code, drug code, and surgical code sequences. The diagnostic coding sequence xt was used as an example to introduce the calculation process of LSTM. When xt passes into an LSTM cell, the forget gate decides what information should be retained. The information from the previous hidden state ht-1 and the current input information xt are passed to the sigmoid function simultaneously; the output value ft is between 0 and 1. A value closer to 0 indicates that it should be discarded, and a value closer to 1 indicates that it should be retained. (1) ft = σ Wf ht−1 , xt + bf Subsequently, the input gates used to update the cell state are computed. First, the information of the hidden state of the previous layer ht−1 and the information of the current input xt are passed to the sigmoid function to obtain a value it between 0 and 1 to determine which information to update. Second, the information of the hidden state of the previous layer ht−1 and the information of the current input xt are passed to the tanh function to create a new vector of candidate value C˜ t . Finally, the output value of sigmoid ˜ t ; the output value of the sigmoid determines it is multiplied by the output value of tanh C which information in the output value of tanh is important and must be preserved. (2) it = σ Wi ht−1 , xt + bi
Patient Mortality Prediction Based on Two-Layer Attention Neural Network
C˜ t = tanh Wc ht−1 , xt + bc
237
(3)
The cell state is obtained by multiplying the cell state of the previous layer Ct and the output of the forget gate ft . If it is multiplied by a value close to 0, in the new cell state, this information must be discarded. This value is subsequently added to the output value of the input gate to obtain the cell state Ct , expressed as follows. Ct = ft Ct−1 + it C˜ t
(4)
The output gate is used to determine the value of the next hidden state, which contains the information of the previous input. First, the previous hidden state ht-1 and current input xt are passed to the sigmoid function to obtain ot ; next, the newly obtained cell state Ct is passed to the tanh function. Finally, the output of tanh is multiplied by ot with the output of sigmoid to determine the hidden state ht , which is used as the output of the current cell and passed to the next time step together with the new cell state. (5) ot = σ Wo ht−1 , xt + bo ht = ot tanh(Ct )
(6)
The patient diagnosis coding, drug coding, and surgical coding sequences are input into three independent LSTM units; the three feature matrices obtained according to the above operation process are spliced into a complete matrix INPUT_T and sent to the second layer of the attention network. 3.3 Global Patient Information Feature Representation Based on Transformer Encoder The transformer [19] (proposed in 2017) is one of the latest and most powerful classes of models invented thus far; it learns context by tracking relationships in sequence data. Typically, the transformer is composed of an encoder and decoder; its encoder is used as the second-layer attention network of the entire model in this study. The content of the transformer encoder primarily includes multi-head self-attention and a feed-forward network. The former is used to capture the relationship between features, whereas the latter is used for further encoding learning. In contrast to the LSTM processing of the input sequence according to the input order, the subtlety of the transformer lies in its self-attention mechanism. The input INPUT_T of the network is multiplied by three different weight matrices to obtain the query matrix (Q), key matrix (K), and value matrix (V). Q, K, and V are calculated to obtain the weighted feature matrix [19], expressed as follows. QKT V, i ∈ {1, 2, 3, 4, 5, 6, 7, 8} (7) Zi = Attention(Q, K, V) = softmax √ dk The multi-head self-attention is equivalent to the integration of h with different Zi . In this study, h was set to 8; i.e., eight feature matrices Zi were spliced and passed through
238
L. Wang et al.
Fig. 1. The multi-head self-attention
a fully connected layer to obtain the output Z. The calculation process of multi head self attention is shown in Fig. 1. The global attention mechanism of the second layer summarizes the patient information features. 3.4 Patient Mortality Prediction Model Based on Two-Layer Attention Neural Network As shown in Fig. 2, the model predicts the risk of death in patients from three perspectives, i.e., diagnostic coding, drug coding, and surgical coding. The MIMIC dataset was used in this study. The diagnostic, drug, and surgical codes were mapped to word vectors and then input to the first layer of the attention mechanism; i.e., three independent one-way LSTM networks were input, and three feature matrices were output. The three spliced feature matrices were input into the second layer of the attention mechanism; i.e., the transformer encoder was used to calculate the self-attention of all encoded word vectors. The final output was the prediction result. The first layer of the model used the continuous bag of words (CBOW) model to map the input codes to word vectors. In this study, diagnostic, drug, and surgical coding were used to simulate the actual treatment process of patient admission. Three matrices containing word vectors were input into three independent two-layer LSTM units; their outputs were concatenated into a complete feature matrix that was input into the secondlayer attention mechanism composed of a four-layer transformer encoder. Finally, the prediction result was output through the feed-forward neural network. The ratio of the number of dead samples to nondead samples in the entire dataset was approximately 89:11. To manage the impact of data imbalance on model performance, focal loss [20] was chosen as the loss function of model training. He Kaiming proposed
Patient Mortality Prediction Based on Two-Layer Attention Neural Network
239
Fig. 2. Patient mortality prediction model based on two-layer attention neural network
focal loss, which was originally used in the imaging field to solve model performance problems caused by data imbalance. The commonly used cross-entropy loss function is: − logy , y =1 (8) L= − log 1 − y , y = 0 where y is the output of the activation function, which is between 0 and 1. For ordinary cross-entropy and positive samples, the greater the output probability, the smaller the loss. For negative samples, the smaller the output probability, the smaller the loss. Focal loss adds factors α and γ to balance the importance of positive and negative samples and adjusts the rate at which simple sample weights decrease [20]; it is expressed as: γ −α 1 − y logy , y = 1 (9) Lfocalloss = −(1 − α)yγ log 1 − y , y = 0 The entire training process of the model was based on the Adam optimiser. The Adam optimiser combines the advantages of two optimisation algorithms, namely, the adaptive gradient algorithm (AdaGrad) and root mean squared propagation (RMSProp). The update step size was calculated by comprehensively considering the first moment estimation of the gradient (i.e., the mean value of the gradient) and the second moment estimation (i.e., the uncentered variance of the gradient). Nadam added the Nesterov momentum based on Adam to improve the convergence speed. Obviously, the first layer of the attention mechanism in the model can use multiple LSTM units to process patient information from different angles. Therefore, this model is easy to expand, can flexibly adapt to different usage environments, and has strong practicability.
240
L. Wang et al.
4 Results and Discussion 4.1 Training Preparation The experimental data set comprised 44407 samples; the number of samples marked as not dead was 39528, the number of samples marked as dead was 4879, and the ratio was 89:11. The dataset was divided into three subsets, namely, the training, validation, and test sets. The data in each subset accounted for 60%, 20%, and 20% of the total dataset, respectively; the proportion of nondead and dead samples in each subset was consistent with the total dataset. Details of the data distribution are presented in Table 1. Table 1. Data distribution Number of surviving samples Training set
23,718
Number of dead samples 2,928
Validation set
7,905
975
Test set
7,905
976
Total data set
39,528
4,879
Model Settings. The open-source Python machine learning library PyTorch was used to build a two-layer attention neural network, and the length of the word vector input to the model was set to 128. The model consisted of three mutually independent LSTM units with two stacking layers, and the feature dimension of the hidden layer nodes was set to 128. A four-layer transformer encoder was set to receive the output of the LSTM, and its multi-head attention was set to 8. Finally, two linear layers were set up to output the prediction results. Data Preprocessing. MIMIC [21] is a large-scale public intensive care medical information database provided by the Massachusetts Institute of Technology. MIMIC-III was used in this study that contained data from 2001 to 2012, and patient data were collected from MetaVision and CareVue. The relational database management system MySQL was used to filter and merge the immense data contained in the MIMIC database to form a data table with the structure presented in Table 2 as the experimental dataset. Optimiser Selection. During the experiments, various optimisers were applied to improve the model performance. Table 3 lists the selected optimisers and model training times. The specifications of the experimental environment were: Intel Xeon(R) E5-1650 [email protected] GHz, 64 GB memory, and NVIDIA GeForce GTX 1080Ti GPU.When the model input and parameters remained unchanged, the optimiser was controlled as the only variable, and the impact
Patient Mortality Prediction Based on Two-Layer Attention Neural Network
241
Table 2. Experimental dataset structures Field
Type of data
Explanation
SUBJECT_ID
INT
A unique identifier that specifies an individual patient
HADM_ID
INT
Represents a single patient’s admission to the hospital
ICD9_CODE
VARCHAR
Contains the actual code corresponding to the diagnosis assigned to the patient for the given row
NDC
VARCHAR
The National Drug Code
PROCEDURE_ICD9_CODE
VARCHAR
The ICD-9 code for the given procedure
HOSPITAL_EXPIRE_FLAG
TINYINT
Whether the patient died within the given hospitalization
Table 3. Optimizer selection Optimiser
Model training time (min)
Adam
45
Adadelta
45.2
ASGD
43.4
Adamax
43.8
AdamW
44.9
RAdam
46.9
Nadam
42.4
of different optimisers on the model training time was tested. From the perspective of training time, Nadam was found to be more suitable for the proposed model. Loss Function Selection. Among the 44,407 samples contained in the dataset, the ratio of the number of nondead samples to dead samples was 89:11; thus, the dataset was imbalanced.
Table 4. Loss function results Loss function
Accuracy
Precision
Recall
F1
ROC_AUC
Cross-entropy loss
0.95
0.82
0.71
0.76
0.85
Focal loss
0.95
0.79
0.77
0.78
0.87
242
L. Wang et al.
To verify the effectiveness of focal loss in unbalanced datasets, the control loss function was the only variable used to conduct experiments under the condition that the model input and parameters remained unchanged. The experimental results are listed in Table 4. The results indicate that focal loss effectively improves the performance of the model compared with the commonly used cross-entropy loss function. 4.2 Comparison Experiments To verify the effectiveness of the model, the following methods were used for comparison with the model used in this study. Random Forest [22]. It is an ensemble learning algorithm that integrates multiple decision trees to complete the predictions. For classification problems, the prediction is a vote for all decision tree predictions. RETAIN. The model uses two sets of attention networks, namely, forward and reverse. LSTM. LSTM networks have been widely used to process serialised EMRs. In contrast to the two-layer attention model, in the comparison experiment, the vector sequences of diagnostic, drug, and surgical coding were directly input into LSTM, and the results were compared with the two-layer attention model. Bidirectional LSTM (BiLSTM). BiLSTM contains a forward LSTM and backward LSTM, which can better capture bidirectional semantics. Gated Recurrent Unit (GRU). In comparison with LSTM, GRU can achieve comparable results, but its simpler structure makes it easier to train. The experimental results are listed in Table 5. In comparison with the baseline, the two-layer attention model achieved better performance. Unlike the baseline, which directly deals with diagnostic, drug, and surgical codes, the two-layer attention model generates attention features locally and globally, which can better mine the internal connections of diagnostic code, drug code, and surgical code sequences as well as the three interrelationships between them. Table 5. Results of comparison experiments Model
Accuracy
Precision
Recall
F1
ROC_AUC
Random forest
0.92
0.72
0.54
0.62
0.76
RETAIN
0.95
0.82
0.69
0.75
0.84
LSTM
0.94
0.75
0.67
0.71
0.82
BiLSTM
0.94
0.76
0.64
0.69
0.81
GRU
0.94
0.74
0.69
0.72
0.83
Two-layer attention model
0.95
0.79
0.77
0.78
0.87
Patient Mortality Prediction Based on Two-Layer Attention Neural Network
243
4.3 Ablation Experiments In this study, the first and second layers of the attention mechanism were removed. Ablation experiments were performed to demonstrate the effectiveness of the two methods. The experimental results are listed in Table 6. The results indicate that removing any layer of the two-layer attention network causes a certain decline in the model performance. Evidently, the local feature extraction and the second-layer attention network of the first-layer attention network for diagnostic, drug and surgical coding. The extraction of global information features by the force network contributes to the good performance of the model. Table 6. Results of ablation experiments Model
Accuracy
Precision
Recall
F1
ROC_AUC
Model without first layer of attention
0.93
0.81
0.47
0.60
0.73
Model without second layer of attention
0.94
0.76
0.66
0.71
0.82
Two-layer attention Model
0.95
0.79
0.77
0.78
0.87
Overall, Table 5 presents the performance of the proposed two-layer attention model and other baseline models on the MIMIC dataset; the two-layer attention model exhibits excellent performance and obtains the highest scores on most of the metrics. The ablation experiment results presented in Table 6 suggest that the two-layer attention network contributes to the success of the entire model.
5 Conclusions The prediction of death risk from EMR data is an important branch of the secondary utilisation of EMRs. Most current mortality risk predictions focus on the improvement of model performance and ignore the use of multilevel and multifaceted data contained in EMRs; therefore, the use of existing models is often limited. To solve this problem, this study proposed a two-layer attention model; it learns the context characteristics of the diagnosis, medication, and operation of a patient’s single admission through the first layer of the attention network, and then learns the three parts through the second layer. The prediction results of patient mortality are output after the aggregated information features. Further, the performance of the proposed two-layer attention model was evaluated on the MIMIC dataset. The findings indicate that the proposed model outperforms commonly used models, including LSTM and GRU.
References 1. MIT Critical Data: Secondary Analysis of Electronic Health Records. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43742-2
244
L. Wang et al.
2. Chaudhari, S., Mithal, V., Polatkan, G., et al.: An attentive survey of attention models. ACM Trans. Intell. Syst. Technol. (TIST) 12(5), 1–32 (2021) 3. Hao, S., Lee, D.-H., Zhao, D.: Sequence to sequence learning with attention mechanism for short-term passenger flow prediction in large-scale metro system. Transp. Res. Part C: Emerg. Technol. 107, 287–300 (2019) 4. Che, Z., Purushotham, S., Cho, K., et al.: Recurrent neural networks for multivariate time series with missing values. arXiv e-prints (2016) 5. Hewamalage, H., Bergmeir, C., Bandara, K.: Recurrent neural networks for time series forecasting: current status and future directions. Int. J. Forecast. 37, 388–427 (2021) 6. Baytas, I.M., Cao, X., Xi, Z., et al.: Patient subtyping via time-aware LSTM networks. In: ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM (2017) 7. Maragatham, G., Devi, S.: LSTM model for prediction of heart failure in big data. J. Med. Syst. 43, 1–13 (2019) 8. Lu, W., Ma, L., Chen, H., et al.: A clinical prediction model in health time series data based on long short-term memory network optimized by fruit fly optimization algorithm. IEEE Access 8, 136014–136023 (2020) 9. Khoshnevisan, F., Ivy, J., Capan, M., Arnold, R., Huddleston, J., Chi, M.: Recent temporal pattern mining for septic shock early prediction. In: 2018 IEEE International Conference on Healthcare Informatics (ICHI) (2018) 10. Park, H.J., Jung, D.Y., Ji, W., et al.: Detection of bacteremia in surgical in-patients using recurrent neural network based on time series records: development and validation study. J. Med. Internet Res. 22(8), e19512 (2020) 11. Reddy, B.K., Dursun, D.: Predicting hospital readmission for lupus patients: an RNN-LSTMbased deep-learning methodology. Comput. Biol. Med. 101, 199–209 (2018) 12. Yang, Y., Fasching, P.A., Tresp, V.: Predictive modeling of therapy decisions in metastatic breast cancer with recurrent neural network encoder and multinomial hierarchical regression decoder. In: 2017 IEEE International Conference on Healthcare Informatics (ICHI). IEEE (2017) 13. Hung, C.Y., Chen, W.C., Lai, P.T., et al.: Comparing deep neural network and other machine learning algorithms for stroke prediction in a large-scale population-based electronic medical claims database. In: 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE (2017) 14. Choi, E., Bahadori, M.T., Sun, J., et al.: RETAIN: an interpretable predictive model for healthcare using reverse time attention mechanism. In: Advances in Neural Information Processing Systems, vol. 29 (2016) 15. Li, Y., Rao, S., Solares, J.R.A., et al.: BEHRT: transformer for electronic health records. Sci. Rep. 10(1), 1–12 (2020) 16. Shahid, F., Zameer, A., Muneeb, M.: Predictions for COVID-19 with deep learning models of LSTM, GRU and Bi-LSTM. Chaos Solitons Fractals 140, 110212 (2020) 17. Thorsen-Meyer, H.C., Nielsen, A.B., Nielsen, A.P., et al.: Dynamic and explainable machine learning prediction of mortality in patients in the intensive care unit: a retrospective study of high-frequency data in electronic patient records. Lancet Digit. Health 2(4), e179–e191 (2020) 18. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997) 19. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. arXiv (2017) 20. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: IEEE International Conference on Computer Vision (ICCV), Venice, Italy, pp. 2999–3007 (2017)
Patient Mortality Prediction Based on Two-Layer Attention Neural Network
245
21. Johnson, A.E.W., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016) 22. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Identifying Drug–Target Interactions Through a Combined Graph Attention Mechanism and Self-attention Sequence Embedding Model Kang Wang, Jing Hu(B) , and Xiaolong Zhang School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, China {wangkang,hujing,xiaolong.zhang}@wust.edu.cn
Abstract. Identifying drug–target interactions (DTIs) is a critical part of the drug discovery and drug development processes. Although wet lab-based methods are still the most reliable to determine DTIs, their cost and time are unaffordable. Therefore, it is particularly important to develop an effective computational method to predict DTIs. Here, we built an end-to-end deep learning framework with the Simplified Molecular Input Linear Entry System (SMILES) and protein sequences as raw data, and introduced a graph neural network and graph attention mechanism to learn the SMILES-transformed molecular graph features. We used Word2vec to process protein sequences and extract semantic features of protein sequences combined with self-attention sequence embedding models. After each group of control experiments, we used area under the ROC curve and area under the PR curve as the main evaluation indicators, and the mean of the five-fold cross-validation as the final result. The results showed that the model shows good performance on the C. elegans and human benchmark datasets. Keywords: attention mechanism · drug–target interactions · graph neural network · Word2vec
1 Introduction Prediction of drug-target interactions (DTIs) provides valuable information for understanding drug efficacy and side effects, which is crucial for drug discovery and drug development [1]. However, methods based on biochemical experiments to predict DTIs are expensive, time-consuming, and cumbersome [2]. In the post-gene era, the Human Genome Project continues to develop, and some public databases storing information on known drug targets have been established and released [3], such as DrugBank [4], Kyoto Encyclopedia of Genes and Genomes (KEGG) [5], Protein Data Bank (PDB) [6], and PubChem [7]. These constantly improving genomic, chemical, and pharmacological database support the prediction of DTIs based on computational methods [8], which is one of the reasons for the rapid development of computational prediction of DTIs. Traditional computational prediction methods for DTIs are mainly divided into three © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 246–257, 2023. https://doi.org/10.1007/978-981-99-4749-2_21
Identifying DTIs Through a Combined Graph Attention Mechanism
247
categories: ligand-based methods, docking-based methods, and literature text miningbased methods [9]. Ligand-based virtual screening methods can be effectively applied to the prediction task of DTIs, but this approach has limitations—it usually relies on the ligand information of known targets [3, 8]. Campillos et al. [10] used the similarity of the side effects of known drugs to obtain the molecular activity between drugs and targets, however, when the drug side effect information is unknown, this method will not be able to predict new DTIs. In the docking-based approach, dynamic simulation is the main method to identify DTIs. This method requires the three-dimensional structural information of proteins. However, for many proteins, three-dimensional structural information is not known. It is very complicated and time-consuming to obtain the three-dimensional structural information of target proteins [2]. Finally, the method based on literature text mining co-occurrence in the literature to mine the implicit relationships between drugs and targets, but this method cannot discover new interaction information [3, 9]. With the continuous improvement of data on drug targets and interactions, and the continuous development of computer hardware and machine learning algorithms, machine learning-based methods for predicting DTIs have developed rapidly. DTI prediction is usually regarded as a binary classification task—that is, whether the drug interacts with the target. Z Mousavian et al. [11] characterized drug molecules as 881bit PubChem [12] substructure fingerprints, extracted from the Position Specific Scoring Matrix (PSSM) [13] containing biological evolution information and mapped them to 400-dimensional protein feature vectors. Then, a support vector machine (SVM) [14] was used for classification prediction. The final area under the ROC curve (AUC) was greatly improved compared with previous methods, but the area under the precision– recall curve (AUPR) was not ideal. Similarly, Wang et al. [15] and Li et al. [3] also used PubChem fingerprints and PSSM, the difference is that they used the invariant moment algorithm to extract protein features from the PSSM, Wang et al. used DeepLSTM as a DTI prediction classifier for the first time, whereas Y Li et al. used rotation forest (RF) as a classifier [16]. Peng et al. [8] performed feature extraction based on heterogeneous networks that contained rich information such as drugs, target proteins, diseases, and side effects. The authors obtained good prediction results. However, building heterogeneous networks relies on information such as drugs and diseases, drugs and side effects, and proteins and diseases. In recent years, Word2vec [17] has been used to characterize drugs and target proteins [1, 18–20]. Wan and Zeng et al. [1] used Morgan fingerprints [21] as well as latent semantic analysis [22] to learn the characteristics of drug molecules and learn target protein features using the skip_gram [23] model of Word2vec. Zhang and Wang et al. [20] directly applied the Word2vec method to extract features from the Simplified Molecular Input Line Entry System (SMILES) and protein sequences, and these features were used in machine learning algorithms. These works demonstrated that the Word2vec method is capable of learning low-dimensional, efficient features of SMILES and protein sequences. The attention mechanism shows excellent performance on sequence data, and it makes the model focus on the most relevant parts of the input data to get better prediction results [24, 25]. Lin et al. [26] proposed a self-attention mechanism applied to sentence embedding, which outperformed other sentence embedding methods and was
248
K. Wang et al.
interpretable on three tasks: author description, sentiment classification and text entailment. Velickovic et al. [27] applied attention mechanisms to graph-structured data, and their graph attention networks performed well in node classification tasks. Xiong et al. [25] proposed Attentive FP and used the graph attention mechanism to characterize small drug molecules at the atomic and molecular levels. Inspired by the above work, we converted the SMILES of small molecules into graph-structured data, used different graph neural networks to process and used Word2vec’s skip-gram model to train word vectors and build word vector matrices. Then we combined BiLSTM [28] and the self-attention Mechanism [24, 26] to learn semantic features of protein sequences.
2 Methods 2.1 Datasets Previously, many negative samples used to evaluate DTI prediction model were taken from unknown interactions [3, 11, 15]; however, these negative samples could be unknown interacting DTIs. Liu et al. [29] created human and C. elegans datasets using highly confident negative samples of compound-protein pairs obtained by a systematic screening framework. The human dataset contains 3,369 pairs of positive samples, including 1,052 unique compounds and 852 unique proteins. In the C. elegans dataset, there are 4,000 pairs of positive interactions between 1,434 unique compounds and 2,504 unique proteins [30]. We evaluated our model on these two benchmark datasets based on the data provided by Masashi Tsubaki et al. [30]. 2.2 Drug Module 1) Node features and bond features of molecules: The drug raw data we used are SMILES, which are processed into molecular graphs using RDkit [31]. The required information is obtained from the graph structure. It mainly includes three types of information: node features (corresponding to atomic features), edge features (corresponding to molecular bond features), and an adjacency matrix that records the connections between nodes. Based on previous studies [25, 32] and the statistics of all compound molecules in the used dataset, the following atomic and bond characteristics were finally determined. Based on the frequency of atoms appearing in the dataset, the connections between atoms, and the number of connected hydrogens, we determined the atom type, degree, and number of connected hydrogens, which were characterized by one-hot encoding. In addition, the type of hybrid orbital, whether it is aromatic, whether there is chirality, and the type of chirality are all common atomic feature choices. Details are shown in Table 1. The characteristics of molecular bonds are divided into four categories, corresponding to GetType: the type of the key, GetStereo: Stereo, ISinRing: whether it is in a ring, GetIsConjugated: whether it is a conjugate bond, a total of 10 dimensions are shown in Table 2.
Identifying DTIs Through a Combined Graph Attention Mechanism
249
Table 1. Atomic Features Atom feature
Detailed information
Dim
Type
C, O, N, S, F, P, Cl, Na, W, Br, H, Ca, K, I, Al, Zn, As, Cu, Li, other
20
Degree
0, 1, 2, 3, 4, 5
6
TotalNumHs
0, 1, 2, 3, 4, 5
6
Hybridization
sp, sp2 , sp3 , sp3 d, sp3 d 2 , other
6
Aromaticity
0/1
1
Chirality
0/1
1
Chirality type
R/S
2
Table 2. Bond Features Bond feature
Detailed information
Dim
GetType
SINGLE, DOUBLE, TRIPLE, AROMATIC
4
GetStereo
STEREONONE, STEREOANY, STEREOZ, STEREOE
4
ISinRing
0/1
1
GetIsConjugated
0/1
1
2) Molecular characterization: An efficient molecular characterization method will benefit machine learning predictions and help medicinal chemists gain new intelligence from the dramatic increase in pharmacological data. Previously, PubChem fingerprints and Morgan fingerprints (also known as extended connectivity fingerprints), among others, have been used to characterize drug molecules in many studies and also been widely used in tasks such as machine learning and similarity search [21]. However, how to focus on the most task-relevant parts of the input data to achieve better predictions is more interesting and challenging. Xiong et al. [25] proposed a new graph neural network structure Attentive FP based on a graph neural network and graph attention mechanism to represent molecules. The graph structure of drug molecules treats atoms as nodes and bonds as edges. First, the target atom and the adjacent atoms need to generate the initial state vector, and then through the multilayer attention layer, the attention mechanism is used to aggregate the neighborhood information, and focus on the most relevant information to obtain the atomic embeddings, which need to be combined to form embedding of molecules. This process also introduces an attention mechanism, performs operations on the stacked attention layers, and finally realizes the prediction task through a fully connected layer. We aim to predict whether a drug will interact with its target and also identify the atoms or functional groups of the drug molecule that are most relevant to the prediction task. Based on the above studies, we used Attentive FP as our drug molecular characterization method. The drug is represented as follows Fig. 2. In addition, we also used GAT
250
K. Wang et al.
and Morgan fingerprints used in previous studies for drug molecular characterization and comparison. 2.3 Target Module 1) Protein word embedding matrix: The Word2vec technique is widely used in natural language processing tasks. It is an unsupervised learning method capable of translating words into high-quality real-valued embeddings [23]. Many researchers have applied Word2vec to characterize target sequences as target embeddings, and their results have proved its effectiveness [18–20]. Usually, Word2vec can be divided into CBOW (Continuous Bag of Words) and skip-gram models. The former predicts the central word from the context, whereas the latter predicts the context through the central word [21, 23]. Compared with the CBOW model, the skip-gram model pays more attention to the context order of the central word [19]. In the previous representation of protein sequences, the skip-gram model performed better, so we applied the skip-gram model for pre-training. In addition, the negative sampling technique was used to replace the hierarchical softmax of the classic skip-gram model, which significantly improves the training speed [23]. We take the target sequences of all FDA-approved drugs in DrugBank as the corpus, the original protein sequence as Sentence1, and the original sequences starting from the second and third amino acid residues as Sentence2 and Sentence3 and then split the sentences. Every three amino acid residues are treated as one word. Similar to the hyperparameters set in previous studies, we set the size of the context window to 12 and the number of negative samples to 15. After applying skip-gram for pre-training, each word can be represented as a vector of the specified dimension d. This provides a mapping dictionary for the representation of protein sequences as word embedding matrices. Assuming there is a protein sequence P with a sequence length of 1, we use a window of length 3 to perform sliding word selection on the sequence from the beginning. To ensure that each word can be mapped to a word vector w, we continued to train the protein data in the public dataset based on the pre-trained model; then, the protein is represented as follows Fig. 1. P = (w1 , w2 , w3 . . . wl−2 )
(1)
2) Bidirectional long-term and short-term memory and self-attention mechanism: After the above processing, each protein can be expressed as a two-dimensional matrix with a shape of (l-2)-by-d. For the convenience of batch training, we count the length of the protein sequence in the datasets, set the maximum length MaxL, and use truncation or zero-padding for different word embedding matrices. We process the protein word embedding matrix to obtain its semantic information using a bidirectional long shortterm memory network.
The hidden state hsi is obtained by connecting hsi and hsi . We set the number of hidden units to h. Then the dimension of MaxL hidden units is MaxL-by-2h, which is abbreviated as H. H contains the dependencies between amino acid residues; so, H is an informative carrier. (2) hsi = hsi , hsi
Identifying DTIs Through a Combined Graph Attention Mechanism
251
hsi = LSTM wi , hsi−1
(3)
hsi = LSTM wi , hsi−1
(4)
Next, we need to use the attention mechanism to understand the amino acid residues that contribute more to the DTI prediction task, which will be helpful for drug design and other work. Therefore, we introduce a multi-head self-attention mechanism, taking H T as input, we calculate the linear combination through MLP and use softmax() to ensure that the sum of the weights is 1. (5) A = softmax MLP H T A = (a1 , a2 , a3 . . . am )
(6)
A consists of m groups of weight vectors a, where m is the number of attention heads, and A is a weight matrix of dimension m-by-MaxL. The embedding of the protein sequence can be obtained by multiplying the weight matrix A and the state vector H. The dimension of matrix representation is m-by-2h. The process is shown in Fig. 1.
Fig. 1. The process of converting protein sequences into word embedding matrices.
2.4 Classifier Attentive FP characterizes the drug as an informative vector d. We processed the protein feature matrix of dimension m-by-2h, which we treated as m sets of vectors. We added these vectors and normalized to 1; so, we obtained the one-dimensional protein vector t. The drug and protein vectors were spliced and fed to the classifier, where we used the MLP to return the probability of DTI ypre . We treated DTIs as binary classification tasks; so, we used binary cross-entropy loss. The training objective was to minimize the loss
252
K. Wang et al.
J, and in addition to preventing overfitting, we also used dropout and L2 regularization techniques. Figure 2 shows the overall framework. J =
m 1 [−y(i) log hΘ (di, ti, ) − 1 − y(i) log 1 − hΘ (di, ti, ) ) m i=1 (i) hΘ (di, ti, ) = sigmoid ypre
(7) (8)
Fig. 2. Overall framework of the model. We extract the graph structure data from the molecular diagram, and extract the features from the protein sequence to represent as the protein embedding matrix. Then, we apply the Attentive layer, Bilstm and attention mechanism to obtain the feature vectors of drugs and proteins respectively, and then splice them as the input of the batch normalization layer, use MLP as the classifier, and use dropout to reduce the over-fitting, and finally get the predicted value Y_ pre.
3 Results 3.1 Evaluation Indicators We used AUC, AUPR as the main evaluation metrics of the model along with precision (Pre) and recall (Recall). We used the average of five-fold cross-validation as the final evaluation result. In the formula below, true positives (TP) represent the number of positive samples (drug targets with interactions) that are correctly predicted, false positives (FP) are the number of positive samples that are incorrectly predicted; true negatives (TN) are the number of negative samples that are correctly predicted (drug targets with no interaction); and false negatives (FN) are the number of negative samples that are incorrectly predicted. Furthermore, we calculated the area under the ROC versus PR curve, that is, AUC versus AUPR, to evaluate the performance of the method. Pre =
TP TP + FP
Recall =
TP TP + FN
(9) (10)
Identifying DTIs Through a Combined Graph Attention Mechanism
253
3.2 Prediction Results for C. elegans and Human Tables 3 and 4 present the five-fold cross-validation results for our model on the human and C. elegans datasets using the mean as the final result to avoid discrepancies from chance. The experimental results showed that the overall prediction performance of the model is better on C. elegans compared to human, which is consistent with previous studies and related to the larger data size of C. elegans. On the C. elegans. dataset, the values of AUC, AUPR, Recall, and Pre were 0.987, 0.980, 0.949 and 0.948, respectively, and the standard deviations corresponding to these four indicators are 0.003, 0.008, 0.013 and 0.017, respectively. On the human dataset, the values of AUC, AUPR, Recall, and Pre are 0.982, 0.980, 0.936, and 0.931, respectively, and the standard deviations corresponding to these four indicators were 0.003, 0.005, 0.011, and 0.020, respectively. Table 3. Results of Five-fold Cross-validation on C. elegans K_fold
AUC
AUPR
Recall
Pre
1
0.986
0.979
0.942
0.950
2
0.990
0.982
0.958
0.950
3
0.988
0.987
0.938
0.966
4
0.989
0.986
0.941
0.954
5
0.982
0.967
0.968
0.921
Avg
0.987 ± 0.003
0.980 ± 0.008
0.949 ± 0.013
0.948 ± 0.017
Table 4. Results of Five-fold Cross-validation on Human K_fold
AUC
AUPR
Recall
Pre
1
0.978
0.971
0.923
0.921
2
0.982
0.981
0.952
0.936
3
0.980
0.981
0.935
0.935
4
0.983
0.982
0.942
0.903
5
0.986
0.986
0.930
0.958
Avg
0.982 ± 0.003
0.980 ± 0.005
0.936 ± 0.011
0.931 ± 0.020
3.3 Comparison of Different Methods on C. elegans and Human We conducted four sets of controlled experiments to explore the effects of different multihead attention mechanisms and different representation methods on the DTI prediction task. The first group was the original method, and in the second group we did not use graph neural networks to characterize molecules but used Morgan fingerprints instead, which
254
K. Wang et al.
are one of the most commonly used molecular fingerprints because they are commonly used in similarity search and virtual screening tasks. They outperformed other types of fingerprints. After the experimental comparison, we finally used RDkit to generate a Morgan fingerprint with a radius of 2 and a fingerprint length (nBits) of 512 bits. The chemical structure of each drug was provided to RDkit using the SMILES format, and finally, each drug was encoded into a 512-bit fingerprint feature vector. Each bit is interpretable, as shown in the figure. The third group used GAT, which performs well on node classification tasks. The fourth set of experiments aimed to explore the effect of the multi-head attention mechanism on processing protein word embedding matrices, so we removed the attention layer of this module. The control experiment results of the two datasets are shown in Tables 5 and 6. Except for the human dataset, the precision of the drug molecule module using Morgan fingerprint and MLP was higher than that of the original method, and the original method is overall better than other control experimental methods. It is worth noting that the multi-head attention mechanism has a greater impact on the model. After its removal, the AUC and AUPR on the C-elegans dataset were reduced by about 3%, and the AUC and AUPR on the human dataset were reduced by about 6%. Table 5. Comparison of Different Methods on C. elegans Drug Module Target Module Attentive FP
AUC
AUPR
Recall
Pre
BiLSTM + 0.987 ± 0.003 0.980 ± 0.008 0.949 ± 0.013 0.948 ± 0.017 Attention
Morgan BiLstm + Fingerprint + Attention MLP
0.981 ± 0.004 0.979 ± 0.005 0.928 ± 0.023 0.935 ± 0.012
GAT
BiLstm + Attention
0.978 ± 0.003 0.975 ± 0.008 0.932 ± 0.017 0.926 ± 0.012
Attentive FP
Without Attention
0.961 ± 0.003 0.948 ± 0.005 0.894 ± 0.021 0.887 ± 0.010
3.4 Comparison of Several Existing Methods Finally, we compared the three traditional machine learning methods previously studied on the same dataset, namely K-nearest neighbors (KNN), random Forest (RF), L2 logistic regression (L2) [29], and three deep learning methods studied in recent years. Masashi Tsubaki et al. proposed a CPI–GNN model, using GNN to process molecular graphs and CNN to process protein sequences. Zheng et al. [33] used protein threedimensional structure information to represent proteins as two-dimensional distance maps and regarded them as images. They regarded drug SMILES as the answer and built a visual question answering system, Drug-VQA, to predict drug–protein interactions,
Identifying DTIs Through a Combined Graph Attention Mechanism
255
Table 6. Comparison of Different Methods on Human Drug Module
Target Module
AUC
AUPR
Recall
Pre
Attentive FP
BiLSTM Attention
0.982 ± 0.003
0.980 ± 0.005
0.936 ± 0.011
0.931 ± 0.020
Morgan Fingerprint + MLP
BiLstm Attention
0.975 ± 0.002
0.975 ± 0.003
0.892 ± 0.028
0.947 ± 0.019
GAT
BiLstm Attention
0.972 ± 0.001
0.973 ± 0.003
0.891 ± 0.047
0.930 ± 0.029
Attentive FP
Without Attention
0.918 ± 0.008
0.905 ± 0.014
0.858 ± 0.019
0.839 ± 0.021
Chen et al. [34] regarded both drugs and proteins as sequences and built a model, TransformerCPI, based on transformers to predict drug–protein interactions. Table 7 shows the comparison results of our method with several existing methods. The experimental results showed that our method is state-of-the-art. Table 7. Comparison of Several Existing Methods methods
C-elegans AUC
Human Recall
Pre
AUC
Recall
Recall
KNN
0.858
0.827
0.801
0.860
0.798
0.927
RF
0.902
0.844
0.821
0.940
0.861
0.897
L2
0.892
0.877
0.890
0.911
0.867
0.913
CPI–GNN
0.978
0.929
0.938
0.970
0.923
0.918
Drug-VQA
\
\
\
0.979 ± 0.003
0.961 ± 0.002
0.954 ± 0.003
TransformerCPI
0.988 ± 0.002
0.953 ± 0.005
0.952 ± 0.006
0.973 ± 0.002
0.925 ± 0.006
0.916 ± 0.006
Our Methods
0.987 ± 0.003
0.949 ± 0.013
0.948 ± 0.017
0.982 ± 0.003
0.936 ± 0.011
0.931 ± 0.020
4 Conclusion Deep learning is often referred to as a black-box algorithm. In many previous studies, it was usually only explored whether there is an interaction between drug targets, and it is difficult to explain the contribution of drugs and targets to the prediction task. In this study, we used simple and more accessible SMILES, protein sequences, and labels (used as raw data), and the graph attention network and multi-head attention mechanism were introduced to make the DTI task interpretable. We obtained better performance. In follow-up research, we will consider using more informative protein three-dimensional structures, and try to use the protein three-dimensional structure predicted by Alphafold
256
K. Wang et al.
[35] to make up for the lack of its three-dimensional structure. In addition, if we can use known interaction mechanism, it would be more practical to predict which specific functional groups of drug molecules interact with amino acid residues. Acknowledgment. This work is supported by the National Natural Science Foundation of China (No. 61972299).
References 1. Wan, F., Zeng, J.M.: Deep learning with feature embedding for compound-protein interaction prediction. bioRxiv, p. 086033 (2016) 2. Mahmud, S.M.H., Chen, W., Jahan, H., et al.: DeepACTION: a deep learning-based method for predicting novel drug-target interactions. Anal. Biochem. 610, 113978 (2020) 3. Li, Y., Liu, X., You, Z.H., et al.: A computational approach for predicting drug–target interactions from protein sequence and drug substructure fingerprint information. Int. J. Intell. Syst. 36(1), 593–609 (2021) 4. Wishart, D.S., Knox, C., Guo, A.C., et al.: DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucl. Acids Res. 36(Suppl_1), D901–D906 (2008) 5. Kanehisa, M., Goto, S.: KEGG: kyoto encyclopedia of genes and genomes. Nucl. Acids Res. 28(1), 27–30 (2000) 6. Burley, S.K., Berman, H.M., Bhikadiya, C., et al.: RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy. Nucl. Acids Res. 47(D1), D464–D474 (2019) 7. Kim, S., Thiessen, P.A., Bolton, E.E., et al.: PubChem substance and compound databases. Nucl. Acids Res. 44(D1), D1202–D1213 (2016) 8. Peng, J., Li, J., Shang, X.: A learning-based method for drug-target interaction prediction based on feature representation learning and deep neural network. BMC Bioinform. 21(13), 1–13 (2020) 9. Zhu, S., Okuno, Y., Tsujimoto, G., et al.: A probabilistic model for mining implicit ‘chemical compound–gene’ relations from literature. Bioinformatics 21(Suppl_2), ii245–ii251 (2005) 10. Campillos, M., Kuhn, M., Gavin, A.C., et al.: Drug target identification using side-effect similarity. Science 321(5886), 263–266 (2008) 11. Mousavian, Z., Khakabimamaghani, S., Kavousi, K., et al.: Drug–target interaction prediction from PSSM based evolutionary information. J. Pharmacolog. Toxicolog. Methods 78, 42–51 (2016) 12. https://pubchem.ncbi.nlm.nih.gov/ 13. Gribskov, M., McLachlan, A.D., Eisenberg, D.: Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. 84(13), 4355–4358 (1987) 14. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995) 15. Wang, Y.B., You, Z.H., Yang, S., et al.: A deep learning-based method for drug-target interaction prediction based on long short-term memory neural network. BMC Med. Inform. Decis. Mak. 20(2), 1–9 (2020) 16. Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998) 17. Mikolov, T., Sutskever, I., Chen, K., et al.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111– 3119 (2013)
Identifying DTIs Through a Combined Graph Attention Mechanism
257
18. Asgari, E., Mofrad, M.R.K.: Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10(11), e0141287 (2015) 19. Jaeger, S., Fulle, S., Turk, S.: Mol2vec: unsupervised machine learning approach with chemical intuition. J. Chem. Inf. Model. 58(1), 27–35 (2018) 20. Zhang, Y.F., Wang, X., Kaushik, A.C., et al.: SPVec: a Word2vec-inspired feature representation method for drug-target interaction prediction. Front. Chem. 7, 895 (2020) 21. Rogers, D., Hahn, M.: Extended-connectivity fingerprints. J. Chem. Inf. Model. 50(5), 742– 754 (2010) 22. Deerwester, S., Dumais, S.T., Furnas, G.W., et al.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990) 23. Mikolov, T., Chen, K., Corrado, G., et al.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 24. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 25. Xiong, Z., Wang, D., Liu, X., et al.: Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J. Med. Chem. 63(16), 8749–8760 (2019) 26. Lin, Z., Feng, M., Santos, C.N., et al.: A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130 (2017) 27. Veliˇckovi´c, P., Cucurull, G., Casanova, A., et al.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017) 28. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997) 29. Liu, H., Sun, J., Guan, J., et al.: Improving compound–protein interaction prediction by building up highly credible negative samples. Bioinformatics 31(12), i221–i229 (2015) 30. Tsubaki, M., Tomii, K., Sese, J.: Compound–protein interaction prediction with end-to-end learning of neural networks for graphs and sequences. Bioinformatics 35(2), 309–318 (2019) 31. https://www.rdkit.org/ 32. Nguyen, T., Le, H., Quinn, T.P., et al.: GraphDTA: predicting drug–target binding affinity with graph neural networks. Bioinformatics 37(8), 1140–1147 (2021) 33. Zheng, S., Li, Y., Chen, S., et al.: Predicting drug–protein interaction using quasi-visual question answering system. Nat. Mach. Intell. 2(2), 134–140 (2020) 34. Chen, L., Tan, X., Wang, D., et al.: TransformerCPI: improving compound–protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics 36(16), 4406–4414 (2020) 35. Jumper, J., Evans, R., Pritzel, A., et al.: Highly accurate protein structure prediction with AlphaFold. Nature 596(7873), 583–589 (2021)
An Omics-Based Metastasis Prediction Model for Osteosarcoma Patients Using Multi-scale Attention Network Ning Wang(B) and Yizhang Jiang School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, Jiangsu, China [email protected]
Abstract. Osteosarcoma is a type of malignant cancer that is commonly found in children and adolescents. Metastasis is the leading cause of death for patients with osteosarcoma, which highlights the need for accurate metastasis prediction. While many methods for metastasis prediction have been proposed, most of them only use single genomics or methylation data, and have not explored the potential of multiscale and multi-omics data in metastasis prediction for osteosarcoma. To address this, we used an algorithm called min-redundancy max-relevance to select the most important features from multi-omics data, which includes copy number variation data, DNA methylation data, and RNA gene sequencing data. We also balanced the data samples using the SMOTE algorithm and ENN algorithm. Finally, we developed a metastasis prediction model called MSA-CNN, which uses a onedimensional multi-scale convolution network and a one-dimensional convolution block attention module (CBAM1D) and trained it with multi-omics data. Our performance indexes indicate that the MSA-CNN model can more accurately predict the metastasis of osteosarcoma patients. Keywords: Osteosarcoma Metastasis Prediction · Multi-omics Data · Multi-scale Convolution Network · Attention Mechanism
1 Introduction An aggressive type of cancerous bone tumor, known as osteosarcoma [1], is frequently observed in children and teenagers. It has the characteristics of high malignancy and rapid progression. Before the 1970s, the main treatment for osteosarcoma patients was surgical resection, and the five-year survival rate was less than 20% [2]. Over the last few decades, progress in surgical procedures and chemotherapy has led to an increase in the 5-year survival rate of people who have been diagnosed with osteosarcoma. The rate has risen from about 20% to approximately 60%. [3]. However, about 20% of osteosarcoma patients were found to have lung metastasis at the early diagnosis and 40% of patients had metastasis at the late stage. Only 20% of patients with metastatic osteosarcoma survive for five years, and the majority of deaths are due to the spread of the cancer to other parts of the body [4]. With the development of molecular biology © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 258–267, 2023. https://doi.org/10.1007/978-981-99-4749-2_22
An Omics-Based Metastasis Prediction Model for Osteosarcoma Patients
259
and information science, the relationship between osteosarcoma metastasis and gene expression data has been discovered [5]. It is feasible to utilize machine learning and deep learning methodologies on omics data to accurately forecast the probability of metastasis in individuals diagnosed with osteosarcoma. cancer prediction models often use supervised learning techniques like Support Vector Machine (SVM). 64 feature genes are predicted by SVM and key genes related to metastasis are revealed [6]. LASSO logistic regression[7],COX-PH risk model [8] and other machine learning models are also used to predict the recurrence and metastasis of osteosarcoma based on gene expression data. However, most osteosarcoma prediction models only use one kind of gene expression data, and do not take multi-group data into account in model construction. Advancements in next-generation sequencing technology have allowed for the integration of diverse types of gene expression data, including DNA methylation, copy number variation, and RNA sequencing, in predicting cancer prognosis from a multi-dimensional perspective [9]. Multi-omics data is often high-dimensional, redundant, and has limited sample size, posing challenges for predictive modeling. To address this issue, feature selection and extraction methods are commonly utilized to overcome the curse of dimensionality and improve the learning and training of prediction models. These methods include: filtering method, embedded method, wrapper method and hybrid method [10]. In disease prediction, samples of different types of cases may be unbalanced, and unbalanced multiomics data may lead to over-fitting of the model. The under-sampling method, such as Tomek Links [11], removes samples from most classes, while the under-sampling method considers synthesizing a few classes of samples, such as SMOTE [12] and ADASYN [13]. It can solve the imbalance problem of multi-omics data according to data and problems, and the model can avoid over-fitting problem. Neural networks have been rapidly advancing in recent years, leading to significant breakthroughs in areas such as natural language processing, medical image recognition, and intelligent medical treatments. In comparison to traditional machine learning and statistical methods, deep learning has demonstrated notable advantages in tackling classification and prediction challenges in multi-omics. It has enabled more accurate feature extraction, leading to improved predictive performance [14, 15]. Moon et al. [16] introduced a framework of multi-omics data integration and classification based on attention mechanism, which revealed the relationship between multi-omics data and achieved good performance of the model. Albaradei et al. [17] implemented a multigroup prediction model of pan-cancer metastasis based on convolutional variational autoencoder(CVAE), which revealed the superior performance of deep neural network in metastasis prediction and was superior to other machine learning methods. Deep learning-based methods have made remarkable achievements in survival prediction and pan-cancer classification using multi-omics data, but the prediction of cancer metastasis using multi-omics data remains an area worth exploring. In this paper, we regard the prediction of metastasis in osteosarcoma as a classification problem and propose a novel multi-omics data-based metastasis prediction model called MSA-CNN, which integrates attention mechanisms and multi-scale networks. The proposed prediction pipeline is shown in Fig. 1. This study integrated three different types of data
260
N. Wang and Y. Jiang
(including copy number variation data, RNA gene sequencing data, and DNA methylation data)from TARGET OS database. To reduce the high dimensions of these omics data, a feature selection approach called the max-relevance and min-redundancy algorithm(mRMR) was employed. A hybird sampling method was used to rebalance the osteosarcoma omics dataset. Our metastasis prediction model employs a multi-scale network and a one-dimensional convolutional block attention module (CBAM1D) to capture features from one-dimensional omics data. The CBAM1D module weights the importance of features, which enhances the accuracy of predictions. Through experimental validation, our proposed model outperforms other methods and achieves more accurate predictions of osteosarcoma metastasis.
Fig. 1. Diagram of proposed prediction pipeline
2 Materials and Method 2.1 Datasets and Preprocessing The multi-omics data used for this research was obtained from the TARGET (Therapeutically Applicable Research To Generate Effective Treatments) database. Specifically, The data of 78 patients diagnosed with osteosarcoma was obtained by downloading from the TARGET OS database (https://ocg.cancer.gov/programs/target). Metastasized and primary group have 21 and 57 cases. 78 samples contain information from 60,447 genes for CNV data, 59,956 genes for RNA gene sequencing data, and 385,252 CpG sites for DNA methylation. In the preprocessing phase, we deleted features that showed zero in more than 20% of osteosarcoma patients. Second, to reduce the dimensions of omics data features, we employed a two-tailed t-test algorithm to identify the features that showed significant differences between the two groups. We set different thresholds for the different datasets (0.1 for CNV, 0.05 for DNA methylation, and 0.1 for RNA-seq). After preprocessing, the dimensions of features for CNV is 8867, RNA-seq is 2369, and DNA methylation is 24394. 2.2 Feature Selection Omics data typically contain a large number of features, or variables, which can outnumber the available samples. The high dimensionality of the data makes it difficult to analyze and model effectively. It can result in overfitting issues and negatively impact the performance of the model. In our study, we have employed the max-relevance and minredundancy algorithm (mRMR) developed by Peng et al. [18] to identify and choose the most informative and relevant features from the omics data. The minimum redundancy
An Omics-Based Metastasis Prediction Model for Osteosarcoma Patients
261
and maximum correlation algorithm can select the feature subset with high correlation and low redundancy from the omics data. mRMR calculates the correlation of omics feature oi and class label c through mutual information as follows: ¨ p(oi , c) doi dc p(oi , c) log (1) I (oi , c) = p(oi )p(c) where random variables oi and c and their probability density functions p(oi ), p(c), p(oi , c). The mRMR algorithm calculates the maximum correlation between feature subsets and categories, and the minimum redundancy between features. The formulas are as follows: 1 I (oi ; c) xi ∈S |S|
(2)
1 I (oi , oj ) xi ,xj ∈S |S|2
(3)
maxD(S, c), D = minR(S), R =
where S represents the set of selected features, c represents the target category, I (oi ; c) represents the mutual information between omics feature i and the target categoryc, and I (oi , oj ) represents the mutual information between omics feature i and omics feature j. The feature score is determined by taking the difference between the maximum correlation value and the minimum redundancy value, and the feature subset is selected according to the feature score. The feature score is calculated: max(D, R), = D − R
(4)
In our study, we use the mRMR algorithm to calculate the feature score from the omics data.1024 features are selected from each kind of omics data that are conducive to the prediction of osteosarcoma metastasis. Then we use the cascading method to fuse three kinds of omics data, and use Z-core to standardize the data. 2.3 Hybrid Sampling Multi-omics datasets often contain various samples, resulting in an uneven distribution of data across different classes. In datasets with imbalanced data, the performance of models can be adversely affected, especially in predicting the less represented classes. Two methods are often used to solve this problem: over-sampling and under-sampling. In our study, a combination of over-sampling and under-sampling techniques was employed to obtain a balanced representation of the omics data for osteosarcoma. The SMOTE algorithm employs a linear interpolation technique to generate synthetic data points, which are then appended to the original dataset, effectively augmenting the sample size for underrepresented classes. The flow of algorithm is as follows: 1) For each sample Xi from the minority samples, find k samples of the nearest neighbor based on Euclidean distance, denoted as Xi(near) .
262
N. Wang and Y. Jiang
2) Severe samples Xi(nn) are randomly selected from k nearest neighbors Xi(near) based on oversampling rate N . 3) Generate a random number ω,where ω ∈ [0, 1], A new sample is synthesized as follows: Xnew = Xi + ω Xi(nn) − Xi .
(5)
Edited Nearest Neighbor (ENN) [19] is a technique used for under-sampling in machine learning. It involves examining each sample in the majority class and calculating its K nearest neighbors. If the most of the nearest neighbors do not belong to the same class as the sample being examined, the sample is deleted. This process is repeated for all samples in the majority class until all samples have been examined. In this paper, we utilized a combination of SMOTE and ENN algorithms to perform hybrid sampling on the osteosarcoma omics data after feature selection in the previous section. 2.4 Network Architecture In this paper, as shown in Fig. 2, we designed a novel shallow neural network architecture to predict the metastasis of osteosarcoma patients based on multi-omics data, namely MSA-CNN. MSA-CNN model includes four one-dimensional convolution layers, two dropout layers, two full-connection layers, one one-dimensional maximum pooling layer and one one-dimensional CBAM module.
Fig. 2. MSA-CNN model
Multi-scale Convolution Layer In recent years, more and more CNN models have been applied in bioinformatics. As shown in Fig. 2, Four convolution layers are included in our model, which extract features from input data. we use one-dimensional convolutional layers to extract local features from the osteosarcoma omics data. The input data that has been processed in the previous section is fed into a one-dimensional convolutional layer. Then, the output characteristics of the input layer are simultaneously input into three one-dimensional convolutional
An Omics-Based Metastasis Prediction Model for Osteosarcoma Patients
263
layers with a convolution kernel of 3. These three convolution layers have different dilation rate, which are 1,2,3. The purpose of setting different dilation rates is to build a kind of dilated convolution with sparse convolution core, and increase the receptive field of output unit without increasing the amount of computation. Different receptive fields can obtain different scales of feature information, which will help to further feature capture in the prediction of omics data. Attention Mechanism The attention mechanism mimics human observation by focusing on important parts to form an overall impression from a vast amount of information. For one-dimensional omics data, CNN will output a characteristic graph of C × L for each layer, where C is the number of channels and L is the length of data. The two domains that the attention mechanism can focus on are the channel domain and the spatial domain. The channel attention mechanism learns and assigns weights to different channels present in a feature map, and the neural network will pay attention to the feature channel with greater weight. While spatial attention mechanism learns a weight matrix for the characteristic graph with the length of L one-dimensional data, the neural network will focus on the feature points with higher weight. In our study, we modify the channel and spatial attention module proposed in CBAM [20] to enable it to be applied to the prediction task of one-dimensional omics data.CBAM1D module calculates the attention weight for the channel and space respectively and weights the feature map. Its structure is shown in Fig. 3.
Fig. 3. CBAM1D module
The channel feature map and one-dimensional spatial feature map are sequentially calculated in the CBAM model. In the channel attention mechanism, two onedimensional features C × 1 are obtained through global maximum pooling and average pooling, and then input into a shared two-layer neural network. The two output vectors obtained are added and the channel attention Mc is obtained through the sigmoid activation function. Multiply the input feature F and channel attention weight Mc to obtain a channel weighted feature F. In the one-dimensional spatial attention mechanism, using the feature map F obtained in the previous step as input, using average
264
N. Wang and Y. Jiang
pooling and maximum pooling to obtain two 1 × L feature vectors, inputting them into a one-dimensional convolutional layer based on channel connections to obtain a spatial attention MS , and finally multiply it with F’ to obtain a weighted feature map F , The formulas are as follows: F = Mc (F) ⊗ F
(6)
F = Ms (F) ⊗ F
(7)
3 Experiment 3.1 Evaluation Indicators In our study, we used stratified 5-fold cross-validation(5-fold CV) to test the performance of MSA-CNN. The dataset was divided into five subsets. Four of these subsets were used for training the model, while the fifth subset was reserved for testing the model’s performance. The stratified 5-fold cross-validation can make the data distribution of each fold consistent with the original dataset. The model’s performance was evaluated using several metrics, including Accuracy (ACC), Recall (REC), Precision (PRE), and F1-score (F1). The formulas are as follows: TP + TN TP + TN + FP + FN
(8)
REC =
TP TP + FN
(9)
PRE =
TP TP + FP
(10)
ACC =
F1 =
2TP 2TP + FP + FN
(11)
where the number of samples that were correctly predicted to have metastasis is represented by TP, while the number of samples that were incorrectly predicted to have metastasis is represented by FN, the number of samples that were correctly predicted to be primary is represented by TN, while the number of samples that were incorrectly predicted to be primary is represented by FP. 3.2 Results Our osteosarcoma metastasis prediction model is implemented using the Pytorch framework and written in Python 3. The program is running on a Windows 11 PC equipped with an Intel Core i7 12700F processor, 16GB of RAM, and an NVIDIA GTX3080 12G GPU. We used the Adam optimizer to optimize the MSA-CNN model with a learning rate of 0.005 and weight decay of 0.001 for 100 epochs.
An Omics-Based Metastasis Prediction Model for Osteosarcoma Patients
265
In order to demonstrate the performance of specific components of the MSA-CNN model, two models were designed for ablation experiments. The details of the model are as follows: MS-CNN: Remove the CBAM module from MSA-CNN and flatten the features directly into the fully connected layer for model output. ACNN: Replace three parallel one-dimensional convolutional layers with one onedimensional convolutional layer, and input the convolutional features to the CBAM module. In this study, three different types of omics data (include copy number variation data, DNA methylation data, and RNA gene sequencing data) were used separately to train and test models. Subsequently, the three types of data were integrated to train and test models. To compare the performance of the models, evaluation metrics are calculated using the average value of 5-fold CV. Table 1 shows the predictive capability of the MSA-CNN model. Table 2 displays the predictive capability results of the MS-CNN model, while Table 3 displays the predictive capability results of the A-CNN model. After comparing various data sources and models, it can be observed that the employment of multi-omics data (combining copy number variation data, DNA methylation data, and RNA sequencing data) has led to improved prediction performance of the models. Secondly, multi-scale convolution layer and CBAM1D module have realized the extraction of omics data features of osteosarcoma and more accurate prediction of osteosarcoma metastasis. Table 1. MSA-CNN experimental data Datatype
ACC
REC
PRE
F1
Muti-omics
98.34
1
98.34
99.14
CopyNumber
94
94
1
96.48
DNAmethylation
90.26
92.72
96.92
94.22
RNA-gene
93.34
98.18
94.86
96.56
Table 2. MS-CNN experimental data Datatype
ACC
REC
PRE
F1
Muti-omics
96.68
96.36
1
98.08
CopyNumber
96.36
1
96.36
98.08
DNAmethylation
93.46
96.36
96.64
96.42
RNA-gene
95.02
1
95.02
97.42
To showcase the efficacy and practicality of the MSA-CNN model in predicting osteosarcoma metastasis, several classification methods were trained and tested using multi-omics data. Specifically, the MS-CNN model, ACNN model, CNN-LSTM model,
266
N. Wang and Y. Jiang Table 3. ACNN experimental data Datatype
ACC
REC
PRE
F1
Muti-omics
96.66
98.18
98.18
98.18
CopyNumber
98.18
1
98.18
99.04
DNAmethylation
92.06
96.52
95.14
95.66
RNA-gene
95.02
1
95.02
97.42
CNN model, SVM model, and XGBoost model were implemented and compared against the proposed methods in this study. To ensure objectivity and fairness, all classification algorithms were trained using the same multi-group dataset. Table 4 shows the result of comparative experiments. As shown in Table 4, the MSA-CNN model performs relatively better in the evaluation indicators of the model compared to the comparison method, although it did not achieve the best performance in terms of precision. Table 4. Comparative experiments with other classification methods Model
ACC
REC
PRE
F1
MSA-CNN
98.34
MS-CNN
96.68
1
98.34
99.14
96.36
1
98.08
ACNN
96.66
98.18
98.18
98.18
CNN-LSTM
93.6
1
93.6
96.68
CNN
95.02
98.18
96.68
97.32
SVM
82.43
90
62
71.68
XGBoost
74.35
60
28.6
38.18
4 Conclusion In this paper, MSA-CNN was used to predict osteosarcoma metastasis based on multiomics data. Copy number variation data, RNA gene sequencing data, and DNA methylation data are processed through feature selection and hybrid sampling. The fused multiomics data is input, and the model achieves more accurate prediction of osteosarcoma metastasis status compared to other methods. Despite the strong predictive performance of the model for osteosarcoma metastasis and the utilization of a 5-fold cross-validation method, the multi-omics dataset of osteosarcoma remains limited in size, and the generalization ability of the algorithm is still worth considering. Data enhancement and model generalization capability improvement will be a future research direction. Moreover, multi-omics data is used as input to the model for dimensionality reduction in
An Omics-Based Metastasis Prediction Model for Osteosarcoma Patients
267
the preprocessing step using feature selection methods, which may result in the loss of some features related to osteosarcoma metastasis. How to better utilize and process the original multi-omics data is still a problem worth studying.
References 1. Ritter, J., Bielack, S.: Osteosarcoma. Ann. Oncol. 21, vii320–vii325 (2010) 2. Rosen, G., et al.: The rationale for multiple drug chemotherapy in the treatment of osteogenic sarcoma. Cancer 35, 936–945 (1975) 3. Allison, D.C., et al.: A meta-analysis of osteosarcoma outcomes in the modern medical era. Sarcoma 2012 (2012) 4. PosthumaDeBoer, J., et al.: Molecular alterations as target for therapy in metastatic osteosarcoma: a review of literature. Clin. Exp. Metas. 28, 493–503 (2011). https://doi.org/10.1007/ s10585-011-9384-x 5. Khanna, C., et al.: Metastasis-associated differences in gene expression in a murine model of osteosarcoma. Can. Res. 61, 3750–3759 (2001) 6. He, Y., Ma, J., Ye, X.: A support vector machine classifier for the prediction of osteosarcoma metastasis with high accuracy. Int. J. Mol. Med. 40, 1357–1364 (2017) 7. Dong, S., et al.: A risk score model for the prediction of osteosarcoma metastasis. FEBS Open Bio 9, 519–526 (2019) 8. Zhang, M., Liu, Y., Kong, D.: Identifying biomolecules and constructing a prognostic risk prediction model for recurrence in osteosarcoma. J. Bone Oncol. 26, 100331 (2021) 9. Chai, H., et al.: Integrating multi-omics data through deep learning for accurate cancer prognosis prediction. Comput. Biol. Med. 134, 104481 (2021) 10. Bhandari, N., et al.: Comprehensive survey of computational learning methods for analysis of gene expression data in genomics. arXiv e-prints. arXiv-2202 (2022) 11. Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. SMC-6, 769–772 (1976) 12. Chawla, N.V., et al.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002) 13. He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. Presented at the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) (2008) 14. Leng, D., et al.: A benchmark study of deep learning-based multi-omics data fusion methods for cancer. Genome Biol. 23, 1–32 (2022) 15. Albaradei, S., et al.: Machine learning and deep learning methods that use omics data for metastasis prediction. Comput. Struct. Biotechnol. J. 19, 5008–5018 (2021) 16. Moon, S., Lee, H.: MOMA: a multi-task attention learning algorithm for multi-omics data interpretation and classification. Bioinformatics 38, 2287–2296 (2022) 17. Albaradei, S., et al.: MetaCancer: a deep learning-based pan-cancer metastasis prediction model developed using multi-omics data. Comput. Struct. Biotechnol. J. 19, 4404–4411 (2021) 18. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of maxdependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005) 19. Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 3, 408–421 (1972) 20. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1
Spectral Clustering of Single-Cell RNA-Sequencing Data by Multiple Feature Sets Affinity Yang Liu, Feng Li(B) , Junliang Shang, Daohui Ge, Qianqian Ren, and Shengjun Li School of Computer Science, Qufu Normal University, Rizhao 276826, China [email protected]
Abstract. A critical stage in the study of single-cell RNA-sequencing (scRNAseq) data is cell clustering. The quality of feature selection, which comes first in unsupervised clustering, directly affects the quality of the analysis that follows. It is difficult to choose high-quality characteristics since the gene expression data from scRNA-seq are high dimensional. Feature extraction is often used on gene expression data to choose highly expressed features, that is, subsets of original features. The typical ways for feature selection are to either reserve by percentage or to simply establish a specified threshold number based on experience. It is challenging to guarantee that the first-rank clustering results can be procured using these methods because they are so subjective. In this study, we propose a feature selection method scMFSA to overcome the one-dimensional shortcoming of the traditional PCA method by selecting multiple top-level feature sets. The similarity matrix constructed from each feature set is enhanced by affinity to optimize the feature learning. Lastly, studies are carried out on the actual scRNA-seq datasets using the features discovered in scMFSA. The findings indicate that when paired with clustering methods, the features chosen by scMFSA can increase the accuracy of clustering results. As a result, scMFSA can be an effective tool for researchers to employ when analyzing scRNA-seq data. Keywords: scRNA-seq · feature extraction · fusion · clustering
1 Introduction Recent advances in scRNA-seq technology have enabled researchers to efficiently analyze gene expression in single cells in a high-throughput manner [1]. Compared with bulk RNA sequencing, the development and application of scRNA-seq can provide message on the expression profile of individual cells, understand cellular level transcriptional regulation based on single-cell resolution, and reveal the heterogeneity of individual cells. ScRNA-seq can provide more biological information and provide effective help for exploring disease mechanism or drug development. The primary objective of single-cell RNA-sequencing data processing is the grouping of individual cells according to transcriptome. whereas, scRNA-seq data have the properties of high dimensionality, sparsity and noise, which pose a challenge to cell © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 268–278, 2023. https://doi.org/10.1007/978-981-99-4749-2_23
Spectral Clustering of Single-Cell RNA-Sequencing Data
269
clustering. There are many methods that will affect the subsequent clustering results, such as preprocessing, dimension reduction, feature selection, data fusion and so on. The feature selection method is a significant step, which can have a decisive influence on the clustering results. A good feature should contain as much useful information as possible and not contain features affected by noise. Currently, popular feature selection methods include PCA [2], SC3 [3], Seurat [4], CIDR [5], PanoView [6], SCANPY [7], TSCAN [8], GiniClust [9] and so on. At present, most of the scRNA-seq clustering tools include the step of feature selection. However, these steps are only some primitive unsupervised feature selection methods, which are not suitable for all scRNA-seq datasets. For example, SC3 cannot provide stable clustering results when processing large-scale datasets; Seurat can be used to select genes with high variation from cell to cell for feature selection. However, it performs well on large datasets, but poorly on small datasets; GiniClust only shows strong performance on datasets that contain rare data. In addition to these simple methods, there are some relatively complex feature selection methods. A novel method ground on regularized copula, RgCop [10], can capture the multivariate dependence between a set of genes in gene expression data. It uses L1 regularized term to punish the redundancy coefficient of genes for gene selection. The intrinsic entropy model [11] identifies information genes from the general principles of statistics. FEAST [12] uses the F-statistic to rank the features, and then uses the silhouette coefficient [13] to optimize the feature set based on the results of the initial hierarchical clustering. In this study, a feature extraction method scMFSA based on multiple feature sets fusion of affinity matrix is proposed. Based on the selected multiple feature sets, an affinity network is fused to learn complementary information on express data. In the process of iterative fusion, weak connections are removed and strong connections are strengthened. The error of extracting single feature set by subjective factors is reduced and a group of valuable high-quality feature set is obtained before spectral clustering. The extracted features can be used in combination with the spectral clustering method to accurately identify the cell population. The great precision of the scMFSA in identifying cell populations is demonstrated by experimental results on four actual scRNA-seq datasets.
2 Materials and Methods 2.1 Overview The input of the scMFSA method is the scRNA-seq data matrix X ∈ Rn×m , where n represents the amount of cells and m represents the amount of genes. The cells in each dataset contain known labels, and the original scRNA-seq data set is from the public data source library. Firstly, the scRNA-seq data is preprocessed. The intrinsic entropy model method (IE) is used to remove noise and select features with large amount of information, and the preliminary screening of features is realized. The cell-to-cell similarity matrix is constructed by the selected feature sets. Then, the affinity matrix obtained from each feature set is fused into an affinity network, and further used to extract suitable features.
270
Y. Liu et al.
The cells are ultimately clustered using the spectral clustering technique. A workflow for scMFSA method is illustrated in Fig. 1.
Fig. 1. The framework of the scMFSA
2.2 Data Preprocessing In the scMFSA model, perform the following steps to preprocess data. The original scRNA-seq data is used as input to the scMFSA method and normalized using the Linnorm transformation method. Filtration operations are performed on cells and genes separately. Specifically, we select genes with a minimum read count greater than 0.3 in at least 10% of all cells. Select cells with more than 1000 gene expression values (non-zero values). Log2 normalization is employed on the transformed matrix by adding one as a pseudo count [10]. 2.3 Feature Ordering and Feature Set Selection In the scMFSA method, we use the intrinsic entropy model to preliminatively screen features. The intrinsic entropy model is obtained on the basis of information theory by deducing the entropy decomposition formula. Information theory divides variables (total entropy) into intrinsic and extrinsic entropy (EE). Application of entropy estimation of each gene from gene expression data fluctuation degree, for the late gene clustering analysis to provide information. It is calculated by the Eq. (1). +∞ ¨ p(x, Z) x x x dxdZ (1) = Etot − Eext =− p(x)lnp(x)dx − p(x, Z) ln Eint p(x)p(Z) −∞
Spectral Clustering of Single-Cell RNA-Sequencing Data
271
where p(x, Z) is the joint probability distribution function of x and Z, p(x) and p(Z) are the probability distribution functions of x and Z, respectively. After that, consistent clustering is used, that is, the method of resampling is used to disrupt the original data set, and cluster analysis is conducted on the samples of each resampling, in which K-means method is used to cluster subsets of all data. Since single cells have the characteristics of high noise, the first clustering can reduce the interference of noise between samples. After comprehensive evaluation of multiple clustering results to give a consistent conclusion. Based on the results of consistent clustering, the significance of each feature is calculated by the F-statistics, and the features are ranked according to the F-statistics. Some studies have shown that consensus clustering steps can not only improve the signal but also provide more unique features [12]. Based on the F statistical results, TOP i feature subsets with different numbers (default i = 1,2,3,4,5) are selected. In order to reduce the complexity of feature set combination, two combination methods are set in line accordance with the number of features. For features 48 h) are tested to determine the most accurate prediction period. The contributions of this paper mainly have the following three aspects: (1) We utilize XGBoost to determine the optimal feature set and analyze important features that contribute to the prediction. (2) The LCE model is applied to predict sepsis in ICU patients and compared against other models to assess its advantages. (3) Our research findings demonstrate that the LCE model outperforms other models with high ACC, AUROC, and AUPR values of 97.5%, 99.2%, and 98.3%, respectively.
418
Z. Leyi et al. Predict sepsis with LCE
XGBoost
Data preprocessing
Yes
No
Reduce dimension
Data
Time delay effect
Conclusion
Model analysis
Fig. 1. The main process of sepsis prediction experiment.
2 Materials and Methods 2.1 Datasets The dataset used in this study is an open-source dataset released for Physical Network Computing in Cardiology Challenge 2019 [14]. This dataset includes data from two hospital systems, Beth Israel Deaconess Medical Center (Hospital System A) and Emory University Hospital (Hospital System B), which were de-identified and labelled using sepsis-3 clinical criteria [15]. The datasets include 8 vital sign variables, 26 laboratory variables, and 6 demographic variables. In total, the data includes more than 2.5 million hourly time windows and 15 million data points. 2.2 Feature Selection with XGBoost XGBoost is an optimized distributed gradient boosting library that enhances the algorithm. It creates many weak classifiers to form robust classifiers and optimizes the structural loss function by adding a regular loss function to reduce the risk of overfitting. If the tree model we want to train in the t-th iteration is f (x), its calculation formula is: (t) yˆ i
=
t
(t−1)
fk (xi ) = yˆ i
+ ft (xi )
(1)
k=1
In formula (1), yˆ i(t) represents the predicted result of sample i after the t-th iteration. (t−1) yˆ i represents the prediction result of the previous t-1 tree. ft (xi ) is the function of the t tree. During the training process of XGBoost, we use a precise greedy algorithm to determine the optimal split node: 2 2 2 gi gi gi 1 i∈IR 1 i∈IR 1 i∈IL + − −γ (2) Lsplit = 2 hi + λ 2 hi + λ 2 hi + λ i∈IL
i∈IR
i∈I
LXLMEPS: Leveraging the XGB-lCE-Based Model
419
It starts with a single leaf and iteratively adds branches to the tree. IL and IR are samples of left and right nodes after splitting, respectively. λ and γ are the penalty parameters. L represents the gain score of each split of the tree, and the final feature importance score is calculated by the average gain. The higher the feature importance score of XGBoost, the more important and effective the corresponding feature is. 2.3 Principle of Local Cascade Ensemble (LCE) LCE [13] is a novel machine learning method that combines the strengths of Random Forest and XGBoost and utilizes complementary diversification techniques to obtain more effective generalization predictors. LCE is an advanced hybrid (explicit and implicit) version of the implicit cascading generalization method. It integrates a straightforward augmented bagging approach to handle the bias-variance trade-off and an implicit divideand-conquer approach (decision trees) to learn various parts of the training data. Firstly, LCE reduces bias in the decision tree divide-and-conquer methods using a boost-based classifier as the base classifier (Hb in Fig. 2). The augmentation can be propagated down the tree by adding the class probabilities of the base classifiers to the dataset as new attributes. Then, the overfitting caused by the augmented decision tree is mitigated by utilizing bagging. Bagging reduces variance by creating multiple predictors from a random sample and replacing the original dataset (see D1 … D2 in Fig. 2).
Fig. 2. Local Cascade Ensemble (LCE). Hi - base classifier trained on a dataset at a tree depth of i (Hb : : eXtreme Gradient Boosting). Di : dataset at a tree depth of i.
For training dataset (x1, …, xn), the performance of the learner H t (x) is reinforced through multiple rounds of iterations. In the previous round of iteration, the learner H t −1(x) and loss function L (y, H t −1(x)) are obtained. This round of iteration is to train the weak learner H t −1(x) to minimize the loss function, which can be expressed as (3), where y is SoC estimation value. ht (x) = arg minh∈H L(y, Ht−1 (x) + ht (x))
(3)
420
Z. Leyi et al.
The root mean square error is as objective function, and the weak learner H t (x) can be express as follows. ht (x) = arg minh∈H (4) (rt − ht (x))2 The learner in this round of iteration is obtained. Ht (x) = Ht−1 (x) + ht (x)
(5)
The hybrid ensemble approach LCE allows for balancing bias and variance while benefiting from improved generalization capabilities by explicitly creating different training sets (bagging, augmentation). Furthermore, the LCE implicit divide-andconquer ensures that classifiers are learned on different parts of the training data. Therefore, LCE has good predictive performance. 2.4 Performance Evaluation Metrics To evaluate the predictive performance of the module, we count metrics such as accuracy, precision, recall, and F1-score. Then, the receiver operating characteristic curve (ROC) and the precision-recall curve (PR) are plotted. And we calculated AUROC (area under the ROC curve) and AUPR (area under the PR curve).
3 Experiments and Results 3.1 Data Processing Since the data distribution of the entire data set is quite sparse (shown in Fig. 3(a)), and the data of different features vary greatly, we need to preprocess the data before conducting experiments. First, we utilized the fillna and dropna functions from the pandas library to handle missing values. Then, we applied logarithmic transformation to achieve a normal distribution of the data. Due to the extensive fluctuation range of continuous variables in the experimental data, standard normalization is carried out on the variables. We output the number of positive samples and negative samples, which are 15284 and 750935, respectively. It can be found that the proportion of positive and negative samples in the data set is unbalanced. To avoid the problem of bias in the prediction results, we use the RandomUnderSampler class in the sklearn library for random sampling. After data processing, it can be seen in the correlation heat map in Fig. 3(b) that almost all features are not highly correlated. 3.2 Feature Selection After the completion of the data preprocessing stage, a total of 18 data features remain (as depicted in Fig. 3(b), where 0 and 1 represent the gender feature). Due to the considerable sample size of this study, certain indicators may exhibit statistical significance but lack clinical significance. To address this, we employ XGBoost for feature selection.
LXLMEPS: Leveraging the XGB-lCE-Based Model
(a)
421
(b)
Fig. 3. (a) Data display of the dataset. (b) The correlation heatmap between various features.
Through this process, we reduce the feature set from 18 to the top 10 features, which are subsequently utilized as inputs for the model. By comparing the resulting ACC and AUROC values, we can discern the optimal feature set. The outcomes of this comparison are illustrated in Fig. 4. By examining the accuracy (ACC) and area under the receiver operating characteristic (AUROC) values across different feature sets, it is evident that the maximum ACC and AUROC are achieved when utilizing 12 features. Notably, compared to the inclusion of all 18 features, the utilization of this reduced set of 12 features results in a noteworthy increase of 1.9% and 0.6% in ACC and AUROC, respectively. These 12 features, namely Age, White Blood Cell Count (WBC), Platelets, Hospital Admission Time (HospAdmTme), Glucose, Hour, Hematocrit (Hct), Blood Urea Nitrogen (BUN), Temperature, Creatinine, Hemoglobin (Hgb), and Chloride, are visually presented in Fig. 5.
Fig. 4. ACC and AUROC scores for different feature sets.
3.3 Prediction Results We conducted training and testing on a dataset comprising 40,336 patients from two hospital systems: Beth Israel Deaconess Medical Center (Hospital System A) and Emory University Hospital (Hospital System B). The LCE model along with several other widely employed models commonly used for classification tasks is carried out. The scores of the experimental results are shown in Table 1, while the AUROC and AUPR curves are presented in Fig. 6.
422
Z. Leyi et al.
(a)
(b)
Fig. 5. (a) Feature importance of the dataset based on XGBoost. Specific gravity is the proportion of the number of times a feature appears in a tree; (b) The top 12 features with the highest contribution. Table 1. ACC, AUROC, AUPR, PRE, REC, and F1 scores for different models. Model
ACC
AUROC
AUPR
PRE
REC
F1
LCE
0.975
0.992
0.983
0.964
0.973
0.960
XGB
0.901
0.959
0.914
0.892
0.864
0.872
RF
0.957
0.987
0.975
0.933
0.961
0.931
NBC
0.721
0.748
0.609
0.649
0.402
0.497
SVM
0.746
0.664
0.658
0.743
0.344
0.471
LG
0.745
0.740
0.636
0.755
0.347
0.475
(a)
(b)
Fig. 6. (a) Receiver Operating Characteristic (ROC) Curve and (b) Precision-Recall (PR) Curve.
LXLMEPS: Leveraging the XGB-lCE-Based Model
423
Among the six models evaluated, LCE demonstrated the best performance, achieving an accuracy of 97.5%, with average AUROC and AUPR scores of 99.2% and 98.3% respectively. Additionally, Random Forest and XGB also exhibited impressive performance in this task. Random Forest achieved an accuracy of 95.7%, with AUROC and AUPR scores of 98.7% and 97.5% respectively. However, algorithms such as Naive Bayes Classifier (NBC), Logistic Regression (LG), and Support Vector Machines (SVM) performed poorly on this dataset, with none of their metrics surpassing 80%. To effectively showcase the superiority of our proposed approach, we conducted a comprehensive comparison with several cutting-edge methodologies for sepsis prediction, as depicted in Table 2. The findings derived from Table 2 clearly demonstrate that our approach surpasses the other methods, exhibiting notable enhancements in ACC, AUROC, and AUPR metrics. Table 2. Comparison with cutting-edge methodologies for sepsis prediction. Methods
Data source
Performance metrics ACC -
AUROC
AUPR
88%
-
91%
68% -
MGP–RNN [16]
EHR data from a quaternary academic hospital
TCN [17]
PhysioNet Challenge 2019
CNN-LSTM [18]
Multiple Danish Hospitals
-
85.6%
MGP-TCN [19]
MIMIC-III database
-
86%
40%
LiSep LSTM [11]
MIMIC-III database
-
83.06%
-
This work (XGB-LCE)
PhysioNet Challenge 2019
99.2%
98.3%
95.5%
97.5%
To determine the period during which the LCE model had the highest accuracy in predicting sepsis by leveraging the patient’s health status at different time intervals, we partitioned the dataset into five small test sets based on the duration of ICU stay: 0–6, 7–12, 13–24, 25–48, and >48 h. To ensure the stability of the results, we will conduct 10 repeated experiments during each period using random data for training and testing the model. The average results of the 10 experiments are shown in Fig. 7. The results clearly indicated that sepsis prediction accuracy varies with different time intervals, with the highest accuracy achieved within 13–24 h and a decrease in accuracy observed after 48 h. This finding can provide valuable insights to ICU medical professionals, enabling them to allocate their time more efficiently for sepsis detection and leading to more precise diagnostic outcomes.
424
Z. Leyi et al.
Fig. 7. ACC and AUROC scores for different periods.
4 Discussion In the part of feature selection, the results indicate that the best prediction performance is achieved when the feature set contains Age, WBC, Platelets, HospAdmTme, Glucose, Hour, Hct, BUN, Temp, Creatinine, Hgb, and Chloride. These 12 features include two features of the SIRS [20], namely Temp (Temperature) and WBC (Leukocyte count), and compared with the SOFA [20], Platelets and Creatinine are included here. We observed that the accuracy of the 10–18 features is above 95%. To analyze the contribution of all 18 features comprehensively, Fig. 8 can be observed. We find that features ranked 13–18 include the remaining two features of the SIRS scoring standard - HR and Hgb, as well as MAP in the SOFA scoring standard. This highlights the practical significance of feature selection using XGBoost on datasets.
Fig. 8. The top 18 features with the highest contribution.
In addition to the indicators that exist in the existing rating standards, we also identify some important indicators among these features. Among the top five contributing
LXLMEPS: Leveraging the XGB-lCE-Based Model
425
features, there is a vital sign and laboratory variable worth attention - age and blood sugar. Upon investigation, it was found that age can influence the risk and severity of sepsis, as older adults and young children are more vulnerable to sepsis than adults [21–23]. Hyperglycemia, a state of elevated blood sugar levels, can increase the risk of sepsis and also affect its prognosis [24]. In sepsis patients, hyperglycemia can intensify the inflammatory response, worsening organ damage and increasing the risk of multiple organ dysfunction syndrome [25]. The analysis above shows that the features selected by XGB do indeed correspond to the relevant factors associated with the onset of sepsis. Among these 12 features, those that are not currently included in the existing rating system as indicators have reference value. Furthermore, another strongest impression we gained from this experiment is that the three highest-performing models in the experiment are all con-structured based on decision trees. It is important to recognize that tree-based algorithms exhibit enhanced expressive capabilities when confronted with the complexity of data features. They excel in capturing intricate relationships among features, thereby holding an advantage in handling complex feature sets. However, algorithms such as Naive Bayes Classifier (NBC), Logistic Regression (LG), and Support Vector Machines (SVM) rely on specific assumptions regarding the underlying data distribution during the prediction process. Deviations of the actual data distribution from these assumptions can lead to performance degradation. In contrast, tree-based algorithms exhibit a more flexible approach towards the data distribution assumptions and demonstrate greater suitability for diverse data types. Notably, the LCE model distinguishes itself by integrating the advantages of random forest and XGB, leveraging complementary diversification techniques to enhance generalization and optimize prediction performance. However, we must acknowledge that the LCE model has its limitations. The LCE model we employed is akin to a black box. While it delivers high accuracy in sepsis prediction, it may pose challenges for medical researchers to explicate the correlation between sepsis prediction and each characteristic variable. Therefore, to enhance the transparency and facilitate medical decision-making of the sepsis detection system, our future efforts will be dedicated to augmenting this model with additional explainable artificial intelligence systems.
5 Conclusion In this study, we propose an XGB-LCE model for predicting sepsis in ICU patients. By performing dimensionality reduction using XGB and comparing the effects of different feature sets on prediction results, we provide valuable insights for future sepsis evaluation criteria. Additionally, our experimental results demonstrate the feasibility of using LCE for sepsis prediction, with this model offering a rapid and accurate prediction method. We also investigated the impact of different time periods on sepsis prediction accuracy, with our findings indicating that sepsis detection accuracy decreases after the patient has been in the ICU for 48 h. It is our hope that the conclusions drawn from this study will guide the development of early sepsis prediction methods, ultimately advancing the field of medical care and benefiting the public.
426
Z. Leyi et al.
Acknowledgments. I wish to express my sincere gratitude to my supervisor and my teams who have made this paper possible. Their unwavering support and invaluable contributions have been instrumental in the successful completion of my research.
References 1. Singer, M., et al.: The third international consensus definitions for sepsis and septic shock (Sepsis-3). JAMA 315(8), 801–810 (2016) 2. Rudd, K.E., et al.: Global, regional, and national sepsis incidence and mortality, 1990–2017: analysis for the Global Burden of disease study. Lancet 395(10219), 200–211 (2020) 3. Seymour, C.W., et al.: Time to treatment and mortality during mandated emergency care for sepsis. New Engl. J. Med. 376(23), 2235–2244 (2017) 4. Lambden, S., Laterre, P.F., Levy, M.M., Francois, B.: The SOFA score—development, utility and challenges of accurate assessment in clinical trials. Crit. Care 23(1), 1–9 (2019) 5. Henry, K.E., Hager, D.N., Pronovost, P.J., Saria, S.: A targeted real-time early warning score (TREWScore) for septic shock. Sci. Transl. Med. 7(299), 299ra122–299ra122 (2015) 6. McCoy, A., Das, R.: Reducing patient mortality, length of stay and readmissions through machine learning-based sepsis prediction in the emergency department, intensive care unit and hospital floor units. BMJ Open Qual. 6(2), e000158 (2017) 7. Nemati, S., Holder, A., Razmi, F., Stanley, M.D., Clifford, G.D., Buchman, T.G.: An interpretable machine learning model for accurate prediction of sepsis in the ICU. Crit. Care Med. 46(4), 547 (2018) 8. Delahanty, R.J., Alvarez, J., Flynn, L.M., Sherwin, R.L., Jones, S.S.: Development and evaluation of a machine learning model for the early identification of patients at risk for sepsis. Ann. Emerg. Med. 73(4), 334–344 (2019) 9. Maitra, S., Som, A., Bhattacharjee, S.: Accuracy of quick Sequential Organ Failure Assessment (qSOFA) score and systemic inflammatory response syndrome (SIRS) criteria for predicting mortality in hospitalized patients with suspected infection: a meta-analysis of observational studies. Clin. Microbiol. Infect. 24(11), 1123–1129 (2018) 10. Giannini, H.M., et al.: A machine learning algorithm to predict severe sepsis and septic shock: development, implementation and impact on clinical practice. Crit. Care Med. 47(11), 1485 (2019) 11. Fagerstrm, J., Bng, M., Wilhelms, D., Chew, M.S.: LiSep LSTM: a machine learning algorithm for early detection of septic shock. Sci. Rep. (1), (2019) 12. Kam, H.J., Kim, H.Y.: Learning representations for the early detection of sepsis with deep neural networks. Comput. Biol. Med. 89, 248–255 (2017) 13. Fauvel, K., Fromont, É., Masson, V., Faverdin, P., Termier, A.: XEM: an explainable-bydesign ensemble method for multivariate time series classification. Data Min. Knowl. Discov. 36(3), 917–957 (2022) 14. Reyna, M.A., Josef, C.S., Jeter, R., Shashikumar, S.P., Sharma, A.: Early prediction of sepsis from clinical data: the PhysioNet/computing in cardiology challenge 2019. Crit. Care Med. 48(2), 1 (2019) 15. Seymour, C.W., et al.: Assessment of clinical criteria for sepsis: for the Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3). JAMA 315(8), 762–774 (2016) 16. Bedoya, A.D., et al.: Machine learning for early detection of sepsis: an internal and temporal validation study. JAMIA Open 3(2), 252–260 (2020) 17. Kok, C., et al.: Automated prediction of sepsis using temporal convolutional network. Comput. Biol. Med. 127, 103957 (2020)
LXLMEPS: Leveraging the XGB-lCE-Based Model
427
18. Lauritsen, S.M., et al.: Early detection of sepsis utilizing deep learning on electronic health record event sequences. Artif. Intell. Med. 104, 101820 (2020) 19. Moor, M., Horn, M., Rieck, B., Roqueiro, D., Borgwardt, K.: Early recognition of sepsis with Gaussian process temporal convolutional networks and dynamic time warping (2019) 20. Marik, P.E., Taeb, A.M.: SIRS, qSOFA and new sepsis definition. J. Thoracic Disease 9(4), 943 (2017) 21. Starr, M.E., Saito, H.: Sepsis in old age: review of human and animal studies. Aging Disease 5(2), 126 (2014) 22. Tran, D.D., Groeneveld, A., Van der Meulen, J., Nauta, J., Strack van Schijndel, R., Thijs, L.: Age, chronic disease, sepsis, organ system failure, and mortality in a medical intensive care unit. Crit. Care Med. 18(5), 474–479 (1990) 23. Emr, B.M., Alcamo, A.M., Carcillo, J.A., Aneja, R.K., Mollen, K.P.: Pediatric sepsis update: how are children different? Surg. Infect. 19(2), 176–183 (2018) 24. Ali, N.A., et al.: Glucose variability and mortality in patients with sepsis. Crit. Care Med. 36(8), 2316 (2008) 25. Van Cromphaut, S., Vanhorebeek, I., d Berghe, G.: Glucose metabolism and insulin resistance in sepsis. Curr. Pharm. Des. 14(19), 1887–1899 (2008)
DeepMAT: Predicting Metabolic Pathways of Compounds Using a Message Passing and Attention-Based Neural Networks Hayat Ali Shah, Juan Liu(B)
, Zhihui Yang, and Jing Feng
School of Computer Science, Institute of Artificial Intelligence, Wuhan University China, Wuhan, China {hayatali,liujuan,zhy,gfeng}@whu.edu.cn
Abstract. The study of compounds and their metabolic pathways is crucial for predicting the presence of compounds in metabolic pathways based on the molecular properties, which can be used for drug design and metabolic pathway reconstruction. To accurately reconstruct metabolic pathways, predicting which compounds belong to specific pathways is necessary. While several computational methods have been proposed for this task, they can only map compounds to metabolic pathway classes and not actual metabolic pathways. Furthermore, similarity and feature-engineering-based methods are proposed to predict actual metabolic pathways. However, problems arise when the similarity score is below 50%, and similarity score calculations can be computationally intensive and timeconsuming, especially when dealing with large datasets. To address these limitations, this paper proposes a message-based neural network (DeepMAT) by integrating a Message passing neural network (MPNN) and a multi-head attention Transformer encoder to predict the actual metabolic pathways involved by compounds. The purpose of multi-head attention transformer encoder integration is to calculate the overall information of the entire graph by calculating the influence of each node’s neighbors. Experimental results show that integrating a Message-passing network into a Transformer-style architecture is more expressive and outperforms other methods. Keywords: Metabolic pathway · SMILES · Molecule object · Deep learning
1 Introduction Metabolism is the sum of all chemical reactions that occur in an organism to maintain the living state of the organism. Metabolism chemical reactions are organized into metabolic pathways, in which molecules (substrates) are transformed into another molecule (product) facilitated by a specific enzyme. The analysis of metabolic pathways is very important for discovering, designing and reconstructing metabolic pathways. To date, many metabolic pathways have been displayed and stored in public databases, such as the Encyclopedia of Genes and Genomes (KEGG) [1, 2], based on their biological functions. These pathways are categorized into 12 different classes. These classes © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 428–446, 2023. https://doi.org/10.1007/978-981-99-4749-2_37
DeepMAT: Predicting Metabolic Pathways of Compounds
429
are Energy metabolism, Carbohydrate metabolism, Nucleotide metabolism, Amino acid metabolism, Lipid metabolism, Glycan biosynthesis and metabolism, Metabolism of other amino acid, Metabolism of cofactors and vitamins, Metabolism of terepenoids and polyketides, Xenobiotic biodegradation and metabolism, Biosynthesis of other secondary metabolites, and Chemical structure transformation maps. The metabolic pathways contain in these classes provide valuable information on metabolomics. However, many metabolic pathways remain unknown, and many enzymatic reactions are still missing, even in well-studied species. For example, humans have many unknown metabolic pathways and missing reaction steps [3]. Efficiently and correctly mapping compounds involved in metabolic pathways is of great significance for the screening of drug candidate compounds [4], the identification of hidden reactions [5], and may help to reconstruct incomplete pathways [6]. In vivo prediction of metabolites of metabolic pathways is the most widely used method to identify whether a metabolite is involved in a metabolic pathway. Although reliable results are obtained by this method, it requires skilled man power, a high cost and a time-consuming protocol, especially where researchers need to discover drugs and medicines within a set time frame [7]. Alternatively, computational methods would benefit by reducing the use of animal experiments and speeding up the development of drugs to treat many diseases [8]. In particular, machine learning models have become a standard tool for proposing new biological mechanisms for metabolic pathway analysis considering metabolites and model costs. Therefore, machine learning methods are important alternatives for predicting metabolic pathways. Researchers have proposed several machine learning methods in the past to predict metabolic pathways of compounds [9–15]. Amongst them, Cai et al. [9] proposed the machine learning method nearest neighbor algorithm to map small chemical molecules to the metabolic pathway classes they may belong to. The authors of this work collected 2,764 compounds from 11 metabolic pathway classes. Experimental results of their work showed that 73.3% of small molecules were correctly mapped to metabolic pathways. Hu et al. [10] predicted metabolic pathways based on the chemical-chemical interaction of 8686 compounds, of which 3137 compounds whose metabolic pathways are known and the remaining 5,549 compounds whose metabolic pathways types are unknown, with an accuracy of 77.97%. Gao et al. [11] predicted metabolic pathways based on interaction information of chemical (compounds) and protein(enzymes) by constructing a dataset consisting of 3,348 small molecules and 654 enzymes. The proposed hybrid model takes molecules and enzymes as nodes and edges. The jackknife test evaluated the performance of the model was 79.56%. Peng et al. [12] was proposed machine learning algorithm AdaBoost to map small molecules with possible metabolic pathway types based on physicochemical properties. The method proposed by Peng et al. [12], 83.88%, correctly matched small molecules to relevant pathway classes. This work was further extended by Baranwal et al., who introduced the concept of neural networks for predicting metabolic pathways by integrating graph neural networks (GNNs) and random forests (RFs). They constructed a dataset from 11 types of metabolic pathways consisting of 6669 compounds. These compounds were classified in their corresponding pathway types based on global molecular features and molecular descriptors with an accuracy of 97.61%, precision of 91.61% and recall of 92.50% [13].
430
H. A. Shah et al.
Inspired by the work of Baranwal et al. [13], recently hybrid model GAT was proposed by Yang et al., using the dataset of [13] work for the prediction of metabolic pathway classes by combining global and local features of compounds. Their work improved the performance of classifying compounds in their corresponding metabolic pathway types [15]. Despite the good performance of these methods based on performance metrics and the development of computational methods for metabolic pathway analysis, these methods do not focus on the actual metabolic pathways, such as a compound belongs to a certain metabolic pathway class, does not belong to all the metabolic pathways present in that class. For example, a compound called “alpha-D-glucose” belongs to the class of carbohydrate metabolism, but this compound belongs to only a few metabolic pathways present in the class of carbohydrates, such as galactose metabolism, glycolytic metabolism, and fructose metabolism, not belonging to the metabolic pathways exist in the metabolism of carbohydrates such as Propanoate and Butanoate metabolic pathways. To address this problem, Jia et al. [14] targeted the actual metabolic pathways present in KEGG based on the compounds’ similarity scores and feature engineering. Their work paired 168 metabolic pathways with 5641 compounds, where seven features represented each compound and metabolic pathway pair. For the classification of constructed pairs, the machine learning method, Random Forest (RF), was trained and evaluated with 93.43% accuracy, 94.37% precision, 92.35% recall and 93.35% F1-score. The performance is excellent compared to many previous methods. However, similarity-based methods encounter difficulties in mapping compounds and metabolic pathways when the similarity score is below 50%. Furthermore, feature engineering of compounds becomes tedious and time-consuming as data increases. All previous methods aim to introduce computational tools that have shown excellent performance in predicting metabolic pathways. Their work encountered three major limitations. First, predicting actual metabolic pathway of the compound to which the compound belongs. Second, if the similarity score is below 50%, the association of compounds becomes confusing. Third, feature engineering is the most iterative, resource-intensive process and time-consuming as the number of compounds in pathway databases increases. The DeepMAT model was developed to overcome current methods’ limitations for predicting a compound’s metabolic pathway. The model consists of a message-passingbased neural network that can directly extract node and edge features, and a multi-head attention mechanism is applied to the graph representation of a molecule to enhance the ability of the model to extract important information from the graph, which improve the ability of the model to predict the metabolic pathway of a compound. The messagepassing mechanism allows the network to compute the influence of each node and extract the comprehensive information about the entire graph, including information about the node’s neighbors. DeepMAT uses the SMILES string of a compound to predict its actual metabolic pathway instead of just its pathway type. This is achieved by pairing the compound with its corresponding pathway as a sample for a binary classification problem. The results of this study show that DeepMAT outperforms other methods in accurately predicting the metabolic pathway of each compound.
DeepMAT: Predicting Metabolic Pathways of Compounds
431
2 Material We extracted 12 different types of metabolic pathways from the KEGG database. Each pathway type consists of several metabolic pathways. We retrieved 180 metabolic pathways from 12 metabolic pathway types and obtained 2856 compounds from these metabolic pathways. Each compound belongs to a single or multiple metabolic pathways. Compounds belonging to a single pathway are paired with that pathway and are referred to as dataset samples. Many compounds belong to multiple metabolic pathways as a multi-label classification problem [16]. However, we used the scheme of the previous study [14], where the multi-label classification problem is transformed into a binary classification problem using the perception of similarity. In this work, we paired these compounds with their respective metabolic pathways. For example, compound “C02320” belongs to metabolic pathways “map00480 and map00983”. We paired the compounds with each pathway separately and as samples for a binary classification problem. Binary classification problems classify input samples into positive and negative classes. We constructed 4304 positive samples. However, no negative samples were reported in the KEGG database. Therefore, we randomly generated unlabeled pairs where the compound did not belong to the pathway. For example, the compound “acetyl phosphate” belongs to the pyruvate metabolic pathway. We paired and labelled samples as positive. The same compound does not belong to the Propanoate pathway. We paired it with the Propanoate pathway and labelled it as a negative sample. All negative samples were constructed this way, and 4304 negative samples were randomly selected for analysis.
3 Method We propose a DeepMAT framework based on message passing and transformer encoder to predict the metabolic pathways of compounds. A detailed overview of DeepMAT is shown in Fig. 1. DeepMAT combines two models: message passing and a multihead attention transformer encoder. First, we apply a message-passing neural network to compute an embedding vector for each input compound. Second, we add an encoder layer with a transformer-like structure, which is more suitable for input representation learning. Finally, a prediction layer builds to calculate the scores of input compounds for predicting metabolic pathways. 3.1 Problem Formulation The input of the model is a set of n compounds C = {c1, c2, c3 . . . . . . . . . cn} and a set of metabolic pathway types, T = {t1, t2, t3, . . . . . . tm} where each pathway types is divided into various metabolic pathways, P = {p1, p2, p3, . . . . . . pk}. For each compound, we know that it belongs to a specific subset of metabolic pathways within each class. This problem can be modelled as a binary classification task. The model’s output is a set of pairs (C, P) where C is the set of compounds and P is the set of metabolic pathways.
432
H. A. Shah et al.
Fig. 1. An illustration of DeepMAT architecture predicting metabolic pathway of compounds. DeepMAT is mainly based on Multi-head attention Transformer encoder to capture the structural information of molecules.
3.2 Data Representation Molecules can be represented in various ways for analysis, such as Mol representation [17], graphical representation [18], and NLP representation [19]. It defends computational tools and addresses the question of what type of molecular representation is needed. This work used the Simplified Molecular Input Line Entry System (SMILES), a textual representation of molecules. There are several ways to represent SMILES for metabolomics analysis, such as the n-gram approach [19], where a word2vec [20] model converts multiple words into a single SMILES string. The n-gram representation of SMILES is widely used in bioinformatics, such as compound protein interaction [21], molecular property prediction, etc. SMILES can also be expressed as a graph, and the SMILES string of a molecule is converted into a graph according to specific grammar rules. This representation of SMILES is more compatible with machine learning methods, especially for Graph neural networks (GNN). This study processes SMILES strings through RDKit to generate molecular descriptors, which are then used as input features for building models. 3.3 DeepMAT Input DeepMAT takes the SMILES string for each compound in the metabolic pathway and converts it into a molecule object. The graph from the molecule object represents atom
DeepMAT: Predicting Metabolic Pathways of Compounds
433
and bond features. Atom features define methods for extracting various features of an atom, such as its symbol, number of Hydrogen, Hybridization, and number of valence electrons. Bond features specialize in featuring bonds between atoms in a molecule. It defines methods for extracting various bond features, such as their type (single, double, triple, or aromatic) and whether it is conjugated. Finally, there are atomic and bond features, with specific allowed features such as valence, hybridization state of atoms, bond type, and bond conjugation state of bonds. These instances can then generate feature vectors for specific molecules, which can be used as input to DeepMAT models. 3.4 Message-Passing Neural Network Message-passing neural networks (MPNNs) are used for SMILES strings of compounds because they are particularly well-suited to handle graph-structured data, which is the natural representation of a molecule. A SMILES string represents a molecule as a graph, with atoms as nodes and bonds as edges. MPNNs are designed to operate on graphstructured data by passing messages between atoms and bonds. This allows the network to learn the relationship between atoms in the molecule and make predictions about its properties. This message-passing process is repeated several times, allowing the network to extract increasingly more complex structural information. We apply a direct message passing neural network on graph G, constructed from SMILES string. There are two main features of G, atom or node features xv and bond or edge features evw . MPNN consists of a message-passing phase which works as follows: The message-passing phase is a critical step in many graph-based machine learning algorithms, especially those that utilize neural network architectures. At this stage, each node in the graph sends and receives messages from its neighbors. These messages can contain various types of information, such as edge weights, node characteristics, or other relevant data. As a result of this communication, each node becomes aware of its immediate task. Mt , which may refer to the node’s specific role or purpose in the graph. Additionally, each node knows about its vertex neighborhood, which refers to the set of adjacent nodes in the graph that are directly connected by edges. The message-passing phase is typically repeated multiple times, with each iteration making the nodes more and more aware of their surroundings. This means that after a certain number of message-passing rounds, each node will know its second-order neighborhood, including nodes two edges away from it in the graph. More precisely, the message passing phase defined in terms of the message updated function Ut and runs for T time steps in the graph. During the message phase, each node hidden states htv updated in the graph based on messages mt+1 v in Eqs. 1 and 2 as follows: mt+1 = Mt (htv , htw , evw ) (1) v wN (v)
where, mt+1 is updated message in the graph G, N (v) is the set of neighbors of v, Mt v is the message function, htv and htw are the hidden states of nodes v and w at iteration t, and evw is the edge features. The hidden states of node v are updated htv according to the as follow: message mt+1 v ht+1 = Ut (htv , mt+1 v v )
(2)
434
H. A. Shah et al.
During the node update stage, the hidden states of the node are updated via a gated recurrent unit (GRU) by the following equation: = GRU (htv , mtv ) ht+1 v
(3)
where the mtv is the input and htv is the hidden state of node. The message-passing phase captures the local structure information of the molecular graph. GRU stands for (Gated Recurrent Unit) commonly used in message-passing neural networks for prediction purposes. Regarding metabolic pathway prediction GRU processes molecular graphs in message passing by iterating through each element in the graph and updating its hidden states. This new hidden state encodes the updated information from the previous hidden states and the current input. A GRU has two main components: an update gate and a reset gate. The update gate determines how much of the previous hidden state should be kept and how much new information should be added to the new hidden state. The reset gate controls how much of the previous hidden state should be ignored and how much new input should be used to update the new hidden state. We extended the above MPNN by employing a transformer encoder with a multihead attention mechanism. The attention mechanism has been widely used on molecular problems and has performed better than other neural networks. 3.5 Transformer The Transformer is first proposed to model the long-range dependency for the machine translation task [22]. The variants of the Transformer have shown expressive power and improvements on many molecular prediction tasks [23]. Besides, GNN and MPNN are widely used for molecular problems. However, when these networks are used to classify the molecular graphs, each node of the graph only aggregates the features of its neighborhood within k-hop due to the recursive aggregation scheme. This makes it difficult for MPNN and GNN to learn the representation of long-range features. Therefore, in this work, we integrate the Transformer encoder into the MPNN to capture the long-range features. The Transformer architecture used in this work consists of multi-head attention, normalization layer, feed-forward layers, average pooling layers and prediction layers. The architecture of the Transformer is shown in Fig. 1. 3.6 Multi-head Attention We integrate a Transformer encoder containing multi-head attention, which requires vectorized inputs. Therefore, we extract vectors from nodes of the graph as queries (Q), keys (K) and values (V ), then apply multi-head attention and make it possible to extract a global representation of nodes. Both local and global information extraction strategy largely enhances the representation of query compounds for predicting metabolic pathways. More precisely, the multi-head attention divides the embedding vector, which is the output of the MPNN, into multiple subspaces and integrates them, which is calculated as follows: A = MHA(Q, K, V) = Concat(h1 , h2 . . . . . . hn )WO
(4)
DeepMAT: Predicting Metabolic Pathways of Compounds Q
V Where hi = Attention(HWi , HWK i , HWi )
435
(5)
In the equations provided, the variable h represents self-attention heads; H repQ resents the embedding matrix, and Wi ∈ Rdmodel ×dk , WiK ∈ Rdmodel ×dk , WiV ∈ d ×d O hd ×d R model k andW ∈ R v model are the projection parameter matrices. We perform a parallel attention function on the projection parameter matrix keys, queries, and values to produce d v dimensional output values. An overview of multi-head attention is shown in Fig. 2. After the multi-head attention block, The output of the attention layer is added to the original input tensor and normalized using layer normalization. Layer normalization normalizes features vector in the last dimension using their variance and means, improving model stability and convergence. It also helps mitigate the effects of covariate shifts and vanishing gradient issues that can arise during training. The normalized output is passed through a Feedforward neural network consisting of a linear transformation and a ReLU activation function, computed as follows: FFN (A) = ReLU (AW1 + b1 )W2 + b2
(6)
where W1; W2 represent the weight matrix, b1; b2 represent the bias vector. The Feed forward neural network output was then added to the original input again and normalized. The output of the final normalization is passed through global average pooling to produce a single output.
Fig. 2. Illustration of Multi-head attention for prediction of metabolic pathway
3.7 Final Prediction The prediction layer of the model utilizes a softmax function to calculate the probability score for each metabolic pathway and compound pair. The softmax activation function maps the output representation of the transformer-encoder to a probability interval
436
H. A. Shah et al.
between 0 and 1. This interval represents the likelihood of a given compound having a specific metabolic pathway. A probability score for each metabolic pathway, with a score of 1 indicating a 100% probability of the compound having that metabolic pathway and a score of 0 indicating a 0% probability.
4 Experiment This work presents DeepMAT, a model that combines message-passing neural networks and Transformer encoders with multi-head attention blocks to predict the metabolic pathways of compounds. The compounds used in this study are represented in SMILES strings and are processed using the RDKit package. This package efficiently converts SMILES strings into molecular objects, after which molecular features can be calculated. The description of features is shown in Table 1. Table 1. Features description of nodes and edges Features
Description
Atom type
C, N, O, P, Br, H
Hybridization
Sp, Sp2, Sp3, Sp4
Bonds
Atoms involved in the number of bonds
Aromaticity
Whether the atom is the part of Aromatic system
Number of valences
0, 1, 2, 3, 4
Number of Hydrogen
0, 1, 2, 3, 4
Bond type
Single, double, triple, aromatic
Conjugated
True, False
The node feature V represents the atomic feature, while the edge feature E represents the bond feature. The Python library package RDKit computes all of these characteristics. Atoms are symbolized by H for Hydrogen and O for oxygen. Valence is a the number of electrons in an atom’s outermost shell that participates in the creation of chemical bonds. A hydrogen atom is the lightest atom with a single electron that forms a covalent bond, which is necessary to develop molecular structures. Electron pairing to generate chemical bonds is possible via Hybridization. Atomic bonds, single bonds, double bonds, and triple bond conjugates are all possible bond types. The third aspect is the molecule’s conjugated coupling of p orbitals with delocalized electrons. 4.1 Hyperparameters In developing a neural network model, selecting appropriate hyperparameters plays a crucial role in achieving high performance. In this case, the hyperparameters utilized include message units, steps for message passing, dense units, and hidden size. The message stride is set to 4, the message unit to 64, and the dense unit to 512. The readout
DeepMAT: Predicting Metabolic Pathways of Compounds
437
phase utilizes transformer encoding and average pooling to divide aggregated nodes into subgraphs. The model employs a two-layer architecture and utilizes a binary crossentropy loss function with the Adam optimizer and a learning rate of 5e−3. The readout function utilizes a two-layer neural network with a ReLU activation function, and the model is trained for binary-classification over 200 epochs. 4.2 Model Training DeepMAT models are supervised to learn whether compound and metabolic pathway pairs are positive or negative samples. A positive sample means that the compound belongs to a metabolic pathway, and a negative sample means that the compound does not belong to any metabolic pathway. The number of positive and negative samples in our dataset is equal; 4304 samples are classified as positive samples and 4304 samples are classified as negative samples. We randomly split the dataset into training, testing and validation sets. Among them, 6886 samples belong to the training set, 861 to the verification set, and 861 to the test set. Furthermore, we use 10-fold cross-validation to check the effectiveness of the model. Divide the training dataset into subgroups by dividing each subset into ten equal-sized subsets. Each of these ten subsets is used as a test dataset, and the remaining subsets are used as a training set, resulting in ten different success rates. We run several rounds of DeepMAT during training and testing with the performance metrics shown in Figs. 3 and 4.
Fig. 3. Training and validation Area Under the Curve (AUC) of the proposed model.
438
H. A. Shah et al.
Fig. 4. Training and validation loss of the proposed model
5 Result and Discussion We proposed a DeepMAT SMILES-based-deep learning model to predict the presence of compounds in their actual metabolic pathway based on molecular properties. The descriptors contain chemical features from compounds such as the number of atoms, valence, aromaticity, number of bonds, bond types, conjugated, Hybridization and number of Hydrogen. To evaluate the effectiveness of our proposed DeepMAT model, we employed four widely-used performance metrics: accuracy, precision, recall, and F1score. To assess the performance of our model, we conducted two comparison studies with previous methods. The first study compared our model with other methods based on their performance on various datasets, feature extraction methods, and final predictions, as presented in Table 2. The second study involved comparing our model with previously published methods in terms of performance metrics, as illustrated in Fig. 5. Additionally, we conducted ablation studies to compare the performance of general MPNNs and DeepMAT on the same dataset. Furthermore, we also investigated the impact of the selected hyperparameters on the performance of our proposed model. 5.1 Evaluation The proposed model is evaluated using standard performance metrics such as accuracy, precision, recall, and F1-score. These metrics provide a comprehensive evaluation of the model’s performance and assess its ability to classify samples and avoid false positives and negatives correctly. The following formula shows the performance metrics. Where, TP represents the number of true positive samples (positive samples that the model correctly classifies), TN represents the number of true negative samples (negative samples that the model correctly classifies), FP represents the number of false positive samples (negative samples that are incorrectly classified as positive by the model), and FN represents the number of false negative samples (positive that are incorrectly classified as
DeepMAT: Predicting Metabolic Pathways of Compounds
439
TP + TN TP + TN + FP + FN
(7)
TP TP + FP
(8)
negative by the model). Accuracy =
Precision =
TP TP + FN
Recall = F1−Score = 2 ×
(9)
Precision × Recall Precision + Recall
(10)
Our study uses a dataset of 17440 instances, with an equal proportion of positive and negative instances. The model was trained and evaluated for a total of five rounds. The best performance metrics were achieved in the third round with an accuracy of 95.27%, a precision of 94.05%, a recall of 93.98%, and an F1-score of 93.67%. The overall average performance across all rounds was accuracy of 95.14%, an average precision of 93.89%, an average recall of 93.73%, and an average F1-score of 93.54%. These results demonstrate the proposed model’s robustness and effectiveness in accurately predicting compounds’ metabolic pathways. 5.2 Comparison with Other Methods The DeepMAT model is a method for predicting compound metabolic pathways that have been compared to previous methods. Most previous methods predict pathway types, but the [15] method is an exception, as it predicts actual metabolic pathways based on similarity scores. Table 2 in the study compares all the methods based on their implementation mechanisms. Table 2. Comparison of computational tools predicting metabolic pathways. Method
Number of molecules
Feature engineering required
Computational tools
Predicting
Cai et al.
2764
Yes
Nearest neighbor algorithm
Pathway class
Hu et al.
3137, 5559
Yes
Multi-target model
Pathway class
Gao et al.
3348
No
Hybrid model
Pathway class
Peng et al.
—
Yes
NNA, SVM, AdaBoost
Pathway class
Baranwal et al.
6669
No
GNN + RF
Pathway class (continued)
440
H. A. Shah et al. Table 2. (continued)
Method
Number of molecules
Feature engineering required
Computational tools
Predicting
Jia et al.
5641
Yes
RF
Actual pathway
Yang et al.
6669
No
GAT
Pathway class
Our
2856
No
DeepMAT (MPPN + Actual pathway Transformer encoder)
Table 2 in the study compares different methods used to predict the metabolic pathways of compounds. It includes information on the number of molecules used in each study, whether feature engineering is required, the computational tools used, and what the model predicts (pathway class or actual pathway). The proposed method, DeepMAT, utilizes a combination of MPNN and Transformer encoder, does not require feature engineering, and predicts a compound’s metabolic pathway. Additionally, the study also includes a comparison of the DeepMAT method with previous methods based on performance metrics, as illustrated in Fig. 5.
Fig. 5. Comparison of DeepMAT and previous methods for predicting metabolic pathways
Figure 5 shows the performance of the proposed DeepMAT method with previous methods using four performance metrics: accuracy, precision, recall, and F1-score. The results show that DeepMAT outperforms other methods in three metrics. The next bestperforming method is Jia et al., with the highest precision.
DeepMAT: Predicting Metabolic Pathways of Compounds
441
5.3 Comparison of Prediction of Actual Pathway and Pathway Classes This paper discusses various techniques employed for predicting metabolic pathways. Specifically, it focuses on two distinct approaches - those predicting metabolic pathway classes and actual metabolic pathways. We conduct a comparison between these two methods, evaluating their relative strengths. The comparison is presented in Table 3, which provides a detailed analysis of the performance of each method. The paper provides a comprehensive examination of the methods used for predicting metabolic pathways and highlights the strengths and limitations of each approach. Furthermore, the comparison of the two methods in Table 3 clearly explains which method is more effective for a given application. Table 3. Strengths of predicting actual pathway and pathway classes Strength of method
Actual metabolic pathway prediction
Pathway classes prediction
Increased accuracy
Yes
No
Better Understanding of Compound Function
Yes
No
Better prioritization of compounds
Yes
No
Better utilization of data
Yes
Yes
More actionable information
Yes
No
A better understanding of the metabolic pathway
Yes
No
Better identification of unknown compound
Yes
No
Simplicity
No
Yes
Broader perspective
No
Yes
Easier data analysis
No
Yes
Actual metabolic pathway prediction is more accurate than pathway class prediction because it involves identifying the specific enzymes and reactions involved in the metabolism of a compound rather than simply classifying it into a general pathway group. This allows for a better understanding of the compound’s function and better prioritization of compounds for further study. Actual metabolic pathway prediction also allows for better utilization of data, as it can provide more actionable information for drug discovery and a better understanding of metabolic pathways in general. It can also aid in the better identification of unknown compounds. On the other hand, pathway class prediction offers simplicity and a broader perspective, as it groups compound into general categories rather than identifying specific enzymes and reactions. It also allows for easier data analysis.
442
H. A. Shah et al.
5.4 Comparison of DeepMAT with the Previous Method The DeepMAT model is compared with previous work that utilizes message-passing neural networks for molecular property prediction. Yang et al. [24] developed a messagepassing neural network and applied it to the SMILES structure of the molecule to learn the molecular representation of the property prediction of molecules. Kim et al. [25] developed a web server based on a Transformer encoder to learn the molecular representation for molecular property prediction. Besides our proposed model, DeepMAT integrated a message-passing neural network and a Transformer encoder to predict metabolic pathways. The comparison of a single message-passing neural network [24], a Transformer encoder [25] and our proposed model is shown in Fig. 6 based on AUC.
Fig. 6. Comparison of DeepMAT with previous methods.
The DeepMAT outperforms the other two methods, using a single model for molecular property prediction. 5.5 Ablation Study An ablation study is a useful tool to understand the contribution of individual components or aspects of a model, like MPNN and Transformer encoder, to its overall performance. It can be done by training and evaluating several variations of the model, each with a different component removed or altered and comparing their performance to the full model. In the context of an MPNN combined with a transformer encoder, an ablation study was used to investigate the relative importance of the two architectures in terms of their contribution of multi-head attention, increasing message passing step and number of attention heads to the performance of the model. Impact of Multi-head Attention. The proposed model, DeepMAT, which combines a Message passing neural network (MPNN) and a Transformer encoder with multi-head attention, was chosen to extract rich compound information. The model’s MPNN component can capture the local structural information of the compounds by passing messages between atoms and bonds in the graph. With multi-head attention, the transformer
DeepMAT: Predicting Metabolic Pathways of Compounds
443
encoder component can capture the global structural information of the compounds by effectively learning the long-range dependencies in the SMILES string representation of the compounds. We also conducted an ablation study by comparing the performance of DeepMAT to that of a generic MPNN model, which uses only the MPNN component on the same dataset with the same parameter settings. The results, as shown in Fig. 7, indicate that DeepMAT outperforms the generic MPNN model. This highlights the benefit of incorporating the Transformer encoder component in the model to capture both local and global structural information of compounds and achieve the best performance.
Fig. 7. The performance comparison of a generic Message passing neural network (MPNN) model with the proposed DeepMAT model for predicting metabolic pathways of compounds.
Impact of Multi-head Attention. We also conducted experiments to investigate the impact of two important hyperparameters on the proposed DeepMAT model’s performance for predicting compounds’ metabolic pathways. These hyperparameters are the number of message-passing steps and the number of attention heads. A range of values was used for these hyperparameters, including 2, 4, 6, and 8 message-passing steps and 4, 6, 8, and 12 attention heads. The model was trained and evaluated for each combination of these hyperparameters, and the results were recorded. The performance of the DeepMAT model was then analyzed and visualized using performance metrics, as shown in Figs. 8A and B. As shown in Fig. 8, the results of the ablation study indicate that the performance of the proposed DeepMAT model for predicting metabolic pathways of compounds is highest when using 4 message-passing steps and 8 attention heads. The performance metrics predicting metabolic pathways were highest at this configuration. More specifically, the study found that the performance of the model remain relatively stable when using 4 message passing steps, with only a slight decrease in performance observed as the number of message-passing steps was increased. However, when looking at the number of attention heads, the study found that the model’s performance improved as
444
H. A. Shah et al.
Fig. 8. (A) illustrates the model’s performance with different message-passing steps, while (B) illustrates the model’s performance with varying heads of attention.
the number of attention heads was increased, with the highest performance recorded at 8 attention heads.
6 Conclusion In this paper, we proposed DeepMAT, an expressive architecture for predicting compound metabolic pathways. The architecture we developed, which combines a messagepassing neural network (MPNN) with a transformer encoder, is intended to extract rich information from chemical graphs and predict the metabolic pathway of query compounds rather than simply the pathway class. In addition to predicting actual metabolic pathways, DeepMAT also demonstrated acceptable performance on standard performance metrics. This suggests that DeepMAT has potential applications in identifying new compounds involved in metabolic pathways. Based on chemical and protein interactions, the suggested design may be expanded to predict metabolic pathways for chemical processes. This will give a more complete knowledge of metabolic pathways and their underlying processes, which may assist in the identification of novel medications. Acknowledgments. We are very thankful to Wuhan University for its generous support in conducting this research. Authors’ Contributions. JL proposed the idea, HAS collected data, planned the formulation of the dataset, implemented the experimental study, wrote the manuscript, ZY and JF discussed the outline of the manuscript. Funding. This work was funded by the National Key R&D Program of China (2019 YFA0904303), the Major Projects of Technological Innovation in Hubei Province (2019AEA170), and the Frontier Projects of Wuhan for Application Foundation (2019010701011381).
Availability of Data and Materials. The datasets used and analyzed during the current study is available at GitHub https://github.com/Hayatalishah4272/DeepMAT_pathway.
DeepMAT: Predicting Metabolic Pathways of Compounds
445
Ethics Approval and Consent to Participate. Not applicable. Consent for Publication. Not applicable Conflicts of interest None Declared. Competing Interests. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References 1. Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., Kanehisa, M.: KEGG: Kyoto encyclopedia of genes and genomes. Nucl. Acids Res. 27(1), 29–34 (1999). https://doi.org/10.1093/ nar/27.1.29 2. Okuda, S., et al.: KEGG Atlas mapping for global analysis of metabolic pathways. Nucl. Acids Res. 36(Web Server issue), 423–426 (2008). https://doi.org/10.1093/nar/gkn282 3. Kotera, M., Tabei, Y., Yamanishi, Y., Tokimatsu, T., Goto, S.: Supervised de novo reconstruction of metabolic pathways from metabolome-scale compound sets. Bioinformatics 29(13), 135–144 (2013). https://doi.org/10.1093/bioinformatics/btt244 4. Nakamura, M., Hachiya, T., Saito, Y., Sato, K., Sakakibara, Y.: An efficient algorithm for de novo predictions of biochemical pathways between chemical compounds. BMC Bioinform. 13Suppl 1(Suppl 17), S8 (2012). https://doi.org/10.1186/1471-2105-13-s17-s8 5. Inokuma, Y., Nishiguchi, S., Ikemoto, K., Fujita, M.: Shedding light on hidden reaction pathways in radical polymerization by a porous coordination network. Chem. Commun. 47(44), 12113–12115 (2011). https://doi.org/10.1039/c1cc15053g 6. Shah, H.A., Liu, J., Yang, Z., Feng, J.: Review of machine learning methods for the prediction and reconstruction of metabolic pathways. Front. Mol. Biosci. 8(June), 1–11 (2021). https:// doi.org/10.3389/fmolb.2021.634141 7. Xavier, F.G., Balu, A., Seetharaman, S., Lakshmikandhan, A., Lawrence, A.A.E.: Alternatives to in vivo experiments – a pandect. Res. J. Pharm. Technol. 12(9), 4575–4577 (2019). https:// doi.org/10.5958/0974-360X.2019.00786.8 8. Sorguven, E., Bozkurt, S., Baldock, C.: Computer simulations can replace in-vivo experiments for implantable medical devices. Phys. Eng. Sci. Med. 44(1), 1–5 (2021). https://doi.org/10. 1007/s13246-021-00978-4 9. Cai, Y.D., et al.: Prediction of compounds’ biological function (metabolic pathways) based on functional group composition. Mol. Divers. 12(2), 131–137 (2008). https://doi.org/10.1007/ s11030-008-9085-9 10. Hu, L.L., Chen, C., Huang, T., Cai, Y.D., Chou, K.C.: Predicting biological functions of compounds based on chemical-chemical interactions. PLoS ONE 6(12), e29491 (2011). https:// doi.org/10.1371/journal.pone.0029491 11. Gao, Y.F., Chen, L., Cai, Y.D., Feng, K.Y., Huang, T., Jiang, Y.: Predicting metabolic pathways of small molecules and enzymes based on interaction information of chemicals and proteins. PLoS ONE 7(9), 1–9 (2012). https://doi.org/10.1371/journal.pone.0045944 12. Peng, C.-R., Lu, W.-C., Niu, B., Li, M.-J., Yang, X.-Y., Wu, M.-L.: Predicting the metabolic pathways of small molecules based on their physicochemical properties. Protein Pept. Lett. 19(12), 1250–1256 (2012). https://doi.org/10.2174/092986612803521585 13. Baranwal, M., Magner, A., Elvati, P., Saldinger, J., Violi, A., Hero, A.O.: A deep learning architecture for metabolic pathway prediction. Bioinformatics 36(2010), 1–7 (2019). https:// doi.org/10.1093/bioinformatics/btz954 14. Jia, Y., Zhao, R., Chen, L.: Similarity-based machine learning model for predicting the metabolic pathways of compounds. IEEE Access 8, 130687–130696 (2020). https://doi.org/ 10.1109/access.2020.3009439
446
H. A. Shah et al.
15. Yang, Z., Liu, J., Shah, H.A., Feng, J.: A novel hybrid framework for metabolic pathways prediction based on the graph attention network. BMC Bioinform. 23, 1–14 (2022). https:// doi.org/10.1186/s12859-022-04856-y 16. Baranwal, M., et al.: A deep learning architecture for metabolic pathway prediction. Bioinformatics 36(8), 2547–2553 (2020). https://doi.org/10.1093/bioinformatics/btz954 17. David, L., Thakkar, A., Mercado, R., Engkvist, O.: Molecular representations in AI-driven drug discovery: a review and practical guide. J. Cheminform. 12(1), 1–22 (2020). https://doi. org/10.1186/s13321-020-00460-5 18. Hirohara, M., Saito, Y., Koda, Y., Sato, K., Sakakibara, Y.: Convolutional neural network based on SMILES representation of compounds for detecting chemical motif. BMC Bioinform. 19(Suppl 19), 83–94 (2018). https://doi.org/10.1186/s12859-018-2523-5 19. Arús-Pous, J., et al.: Randomized SMILES strings improve the quality of molecular generative models. J. Cheminform. 11(1), 1–13 (2019). https://doi.org/10.1186/s13321-019-0393-0 20. Zhang, Y.F., et al.: SPVec: a Word2vec-inspired feature representation method for drug-target interaction prediction. Front. Chem. 7(January), 1–11 (2020). https://doi.org/10.3389/fchem. 2019.00895 21. Lim, S., et al.: A review on compound-protein interaction prediction methods: Data, format, representation and model. Comput. Struct. Biotechnol. J. 19, 1541–1556 (2021). https://doi. org/10.1016/j.csbj.2021.03.004 22. Furfari(tony), F.A.: The transformer. IEEE Ind. Appl. Mag. 8(1), 8–15 (2002). https://doi.org/ 10.1109/2943.974352 23. Deng, D., Lei, Z., Hong, X., Zhang, R., Zhou, F.: Describe molecules by a heterogeneous graph neural network with transformer-like attention for supervised property predictions. ACS Omega 7(4), 3713–3721 (2022). https://doi.org/10.1021/acsomega.1c06389 24. Yang, K., et al.: Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 59(8), 3370–3388 (2019). https://doi.org/10.1021/acs.jcim.9b00237 25. Kim, H., Lee, J., Ahn, S., Lee, J.R.: A merged molecular representation learning for molecular properties prediction with a web-based service. Sci. Rep. 11(1), 1–9 (2021). https://doi.org/ 10.1038/s41598-021-90259-7
SpliceSCANNER: An Accurate and Interpretable Deep Learning-Based Method for Splice Site Prediction Rongxing Wang1
, Junwei Xu1 , Xiaodi Huang1 and Yanju Zhang1,2,3(B)
, Wangjing Qi1
,
1 Guangxi Key Laboratory of Image and Graphic Intelligent Processing,
Guilin University of Electronic Technology, Guilin 541004, Guangxi, China [email protected] 2 College of Computer Science and Technology, Huaqiao University, Xiamen 361021, China 3 Xiamen Key Laboratory of CVPR, Huaqiao University, Xiamen 361021, China Abstract. The identification of splice sites is significant to the delineation of gene structure and the understanding of complicated alternative mechanisms underlying gene transcriptional regulation. Currently, most of the existing approaches predict splice sites utilizing deep learning-based strategies. However, they may fail to assign high weights to important segments of sequences to capture distinctive features. Moreover, they often only apply neural network as a ‘black box’, arising criticism for scarce reasoning behind their decision-making. To address these issues, we present a novel method, SpliceSCANNER, to predict canonical splice sites via integration of attention mechanism with convolutional neural network (CNN). Furthermore, we adopted gradient-weighted class activation mapping (Grad-CAM) to interpret the results derived from models. We trained ten models for donor and acceptor on five species. Experiments demonstrate that SpliceSCANNER outperforms state-of-the-art methods on most of the datasets. Taking human data for instance, it achieves accuracy of 96.36% and 95.77% for donor and acceptor respectively. Finally, the cross-organism validation results illustrate that it has outstanding generalizability, indicating its powerful ability to annotate canonical splice sites for poorly studied species. We anticipate that it can mine potential splicing patterns and bring new advancements to the bioinformatics community. SpliceSCANNER is freely available as a web server at http://www. bioinfo-zhanglab.com/SpliceSCANNER/. Keywords: Splice site prediction · CNN · Attention mechanism · Interpretation
1 Introduction Splicing is a vital process during gene expression, in which introns are removed and the flanking exons are concatenated to form the mature RNAs which will be transferred to synthesize proteins and other functional units. Hence, precisely locating the exon-intron boundaries, called splice sites, is essential to further understand the transcriptional mechanisms within a cell and delineate the components constituting a gene. In addition, as The original version of this chapter was revised: the link in the abstract section is not valid anymore. This has been corrected. The correction to this chapter is available at https://doi.org/10.1007/978-981-99-4749-2_69 © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023, corrected publication 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 447–459, 2023. https://doi.org/10.1007/978-981-99-4749-2_38
448
R. Wang et al.
a majority of genes contain multiple exons, many of them undergo alternative splicing through selecting different combination of splice sites leading versatility of the transcriptome. And defects in alternative splicing and mutations at splice sites will cause diseases and promote disease progression [1]. Furthermore, identification and characterization of the splice patterns will provide insights to elucidate the role of splicing in the downstream regulation. Currently, up to 99% of known splice sites have dinucleotide patterns with GT and AG at exon-intron and intron-exon boundaries respectively, which also known as canonical donor and acceptor sites [2]. However, the presentation of these patterns is not sufficient for splice site identification since enormous GT and AG existing in the reference genome are not related to splicing. Moreover, although several splicing signals are observed around the boundaries, for example there is a pyrimidine-rich region near the acceptor site and a short consensus approximated the donor site, how these consensus sequences contribute to splicing is still poorly understood [3]. As the advance of sequencing technology, more reference genomes are revealed, and it is necessary to precisely and efficiently discriminate the real splice sites from the pseudo ones to promote the understanding of splicing. Splice sites can be predicted either from RNAs or DNAs. The approaches which handle RNAs are mainly based on alignment, in which RNA sequences are aligned to the reference genome and a potential splice site is reported if an RNA sequence spans exon-exon junction. Such tools for second-generation RNA-seq data are TopHat [4], HISAT [5] and etc., while for third-generation are minimap2 [6], deSALT [7] and etc. Many computational approaches have been proposed to predict splicing from DNA sequences. In the early stage, machine learning methods are predominant but their feature extraction and selection processes are knowledge-based, burdensome and less straightforward. Consequently, recent methods adopt deep learning techniques to solve this problem. The main architecture used is CNN because of its brilliant pattern recognition ability. The application of CNN has achieved great success in different genomic-related studies [8, 9], and it has been gradually applied to solve the problem of splicing detection. For instance, SpliceRover [10] builds the models based on CNN, which consists of alternant convolutional, dropout, max-pooling as well as fully connected layers, and concludes with a Softmax classifier. SpliceRover improves prediction by reducing false discovery rate. SpliceFinder [11] uses one convolutional layer with 50 kernels and two fully connected layers together to train the genomic sequences, and achieves better prediction performance by decreasing the number of false positives. Besides, DeepSplicer [12] employs three convolutional layers with 50 kernels for each and a dense layer. It also reduces false positives to obtain competitive accuracy. As the latest tool, EnsembleSplice [13] is an ensemble learning architecture of four different CNN sub-models, whose predictions are fed into a logistic regression predictor. It has been reported that EnsembleSplice is more accurate than several aforementioned methods on Homo sapiens and Arabidopsis thaliana datasets. Although these deep learning methods have performed well in splicing detection, there are still some problems. Firstly, they may ignore the importance of key sequence regions, such as pyrimidine-rich regions for detecting acceptor sites, probably failing to learn features of them with high weights, which leaves room for improving performance. Secondly, most methods only utilize CNN as a ‘black
SpliceSCANNER: An Accurate and Interpretable Deep Learning-Based Method
449
box’ and they suffer from the criticism of limited interpretation of the decision-making mechanisms. In this paper, we propose a novel CNN-based method that fully leverages attention mechanism for splice site prediction (termed SpliceSCANNER). We build our models based on CNN and employ attention mechanism for a better performance since it has been widely used in diverse application fields. We first use a convolutional layer to capture coarse features of input sequences. Then, CBAM [14] is introduced following the max-pooling layer to refine features from channel and spatial perspective. Subsequently, the refined features are fed to two alternant convolutional and max-pooling layers for further feature extraction and selection. Experimental results show that SpliceSCANNER is more accurate for splicing prediction than state-of-the-art methods on most of the datasets. For instance, it outperforms the recently released method EnsembleSplice with a 0.9% and 1.02% increase in accuracy for donor and acceptor respectively on human dataset. Moreover, to surmount the barrier to adoption of our models for predicting splice sites, unlike previous methods, we provide an interpretable Grad-CAM [15] based method to provide detailed insight into decision-making from the models, which explores the underlying rules of DNA samples without any prior knowledge. The reason why we choose Grad-CAM is that it can produce higher fidelity scores and is more class-discriminative and efficient than DeepLIFT when interpreting which parts of input data facilitate the decision made by CNN-based models [16]. The visualizations depict weight at each position of samples and reveal the similar law in different organisms, revealing how SpliceSCANNER makes the prediction decisions. Furthermore, to probe generalizability of our models, cross-organism validation experiments were conducted. The results manifest that our models trained on Homo sapiens have advantageous generalization ability. The main contributions of this work are summarized as follows: • We propose a CNN-based method integrating with attention mechanism for predicting canonical splice sites accurately. For convenient use, a freely and user-friendly online web server is established for splicing detection. • Our method achieves the best accuracy for predicting both donor and acceptor sites on the most of organisms compared with four cutting-edge approaches. • We first utilize Grad-CAM to interpret the results from our method without changing network architecture or retraining the models. • Our method has prominent generalizability and is more capable of annotating newly sequenced or poorly studied genomes of different organisms.
2 Materials and Methods 2.1 Data Preprocessing In this study, donor and acceptor datasets of five organisms, including Homo sapiens, Arabidopsis thaliana, Oryza sativa japonica, Drosophila melanogaster and Caenorhabditis elegans, are downloaded from previous literature [17]. Each dataset contains equal positive and negative samples. A positive sample involves in the splicing machinery, which contains 602 nucleotides, including 300 in the upstream segment from splice site, dinucleotides of splice site and 300 in the downstream. A negative sample has
450
R. Wang et al.
the identical sequence structure as the positive but without referring to splicing mechanisms. As shown in Fig. 1 top left, for each dataset which contains a certain amount of DNA sequences and has not any additional information about chromosome and position to which the sequences belong, it is divided into three sub-datasets: training dataset (70%), validation dataset (20%) and independent test dataset (10%), using split function provided by sklearn library twice with test size of 0.1 and 0.22. Then, sequences are converted to matrixes using one-hot encoding. Thus, each sequence can be converted to a 602 × 4 matrix. Model construction
Data preprocessing Homo sapiens Arabidopis thaliana Oryza sativa japonica Drosophila melanogaster Caenorhabditis elegans
7:2:1
Training dataset
. . . . . .
0 0 1
0 1 0
1 0 0
0 0 0
One-hot encoding
0 0 1
0 1 0
1 0 0
0 0 0
Modeling
Convolution
Max-pool
Forward
CBAM
Convolution
Max-pool
Flatten
. . . . . .
Validation dataset
G C A
Original dataset Testing dataset
0 1 0
0 0 0
0 0 0
1 0 1
0 1 0
0 0 0
0 0 0
1 0 1
Input matrixes
Encoded matrixes
Softmax 0.93 0.07 Output neurons
Feature maps
Feature maps T A T Sequences
. . . . . .
Feature maps
Feature maps
Vector
Grad-CAM
Model interpretation W1 W2 1.0
W3 . . . . . .
+
+
Average
Normalization
0.6 0.4 0.2
Wn-2
0.0
Wn-1
100
Wn Feature maps of last Weights of convolutional layer feature maps
Weight
0.8
×
Weighted class activation maps
Weighted class activation maps
200
300 400 Position
500
600
Weight distribution curve Weights of position
Fig. 1. The workflow of SpliceSCANNER for predicting splice sites. Schematic displaying the three stages in construction of SpliceSCANNER: data preprocessing, model construction and model interpretation. ⊗ and ⊕ means element-wise multiplication and addition, respectively.
2.2 Model Construction Overview SpliceSCANNER is based on CNN. As shown in Fig. 1 top right, it receives as inputs encoded sequences, extracts and selects features using deep learning technique from the flanking regions of splice sites. Given an encoded sequence s, the model computes a predicted score f(s), which represents the binary classification result of this sequence, namely whether the input sequence has a true splice site or not. The initial layer is input layer, which standardizes the encoded sequences. Subsequently, several convolutional layers follow to extract features. In each convolutional layer, there are some filters to capture discriminative features for splice site detection. After a convolutional layer, an activation function is introduced, aiming at guaranteeing the nonlinearity of the model. Generally, the ReLU function is chosen as the activation function to filter out needless information and retain significant signals. To alleviate overfitting and abstract previously learned features, a pooling layer is often used to select features obtained from the previous layer. Generally, the maximum or average pooling contributes to a smoother representation of feature maps. A maxpooling layer intends to further represent features, taking the maximum value of the signal within a series of non-overlapping windows.
SpliceSCANNER: An Accurate and Interpretable Deep Learning-Based Method
451
After several convolutional and pooling layers, there may be one or more fully connected layers, which fuse the class-discriminative local information. Finally, the outputs of last fully connected layer are fed to a Softmax layer and activated by a Softmax function. Convolutional Block Attention Module To enhance the prediction performance by capturing more informative features of sample sequences, CBAM [14] is introduced into the CNN model, which consists of both channel attention module (CAM) and spatial attention module (SAM). As illustrated in Fig. 2a, given an intermediate feature map F ∈ RC×H ×W as input, it is sequentially refined by CAM and SAM, generating a 1D channel attention map Mc ∈ RC×1×1 and a 2D spatial attention map Ms ∈ R1×H ×W , respectively. The overall process of CBAM can be summarized as: F = Mc (F) ⊗ F
(1)
F = Ms (F ) ⊗ F
(2)
where Mc (F) is channel attention map of F, Ms (F ) is spatial attention map of F and ⊗ is element-wise multiplication operation, during which channel attention values are broadcasted along the spatial dimension, and vice versa. Figure 2 reveals the computation process of each attention module.
Fig. 2. (a) Diagram of convolutional block attention module. As shown, CBAM consists of two sub-modules: (b) channel attention module; and (c) spatial attention module. The intermediate input feature map is sequentially refined by CAM and SAM, which calculates channel attention and spatial attention, respectively. ⊗ and ⊕ means element-wise multiplication and addition.
In CAM module illustrated at Fig. 2b, to aggregate the spatial information of a feature map F, first and foremost, both max-pooling and average-pooling operations avg are performed to extract max-pooled features Fcmax and average-pooled features Fc respectively. Then, in order to create channel attention map Mc , these two types of features are forwarded to a shared network, whose backbone is a multi-layer perceptron
452
R. Wang et al.
(MLP) with one hidden layer. After that, the output feature vectors are merged together. To summarize, the channel attention of a feature map is calculated as: avg
Mc (F) = σ (MLP(MaxPool(F)) + MLP(AvgPool(F))) = σ (W1 (W0 (Fcmax )) + W1 (W0 (Fc ))) (3)
where σ ( · ) is the sigmoid function, W0 ∈ RC/r×C and W1 ∈ RC×C/r (r is the reduction avg ratio) are the MLP weights and they are shared for Fcmax and Fc . In SAM (Fig. 2c), to coalesce the channel information of a channel-refined feature map F, two pooling operations are executed to generate two 2D feature maps: Fsmax ∈ avg R1×H ×W and Fs ∈ R1×H ×W . Then, they are concatenated together and convolved by a standard convolutional layer, producing a 2D spatial attention map. In brief, the process of computing spatial attention can be formulated as: avg
Ms (F) = σ (f 7×7 ([MaxPool(F);AvgPool(F)])) = σ (f 7×7 ([Fsmax ; Fs ]))
(4)
where f 7×7 means a convolution operation with kernel size of 7×7. Model Implementation CNN is a type of feed-forward neural network with deep structure and convolutional computation. In this work, we integrate CBAM into CNN to construct splice site prediction model, the detailed implementation steps of which are elaborated as follows: (1) In the first convolutional layer, we use 32 convolution kernels with size of 7 × 4 to smoothly traverse the entire input matrix and fit the results nonlinearly using the ReLU activation function, obtaining primary feature maps of samples. (2) Then, the primary feature maps are downsampled by max-pooling operation with window size of 2 × 1 and stride of 2. (3) Subsequently, the CBAM is introduced following the pooling layer, which can focus on the feature maps from both channel and space perspective and learn from these two dimensions with high weights, acquiring refined feature maps. (4) After CBAM, we alternatively set two convolutional and two pooling layers to extract more deeper-seated features. To reduce the risk of overfitting, each pooling layer is followed by dropout operation with dropout rate of 0.2. (5) Next, each feature map is flattened into a 1D vector, which is consecutively fed to two fully connected layers with 120 and 32 neurons. (6) In the output layer, Softmax function is employed to two neurons, corresponding to two classification results of true and false splice site, respectively. Finally, the neuron which has larger probability leads to the prediction results. Therefore, for each sequence s, its predicted score can be calculated as: f (s) = den3 (den2 (den1 (mpool3 (conv3 (mpool2 (conv2 (atten(mpool1 (conv1 (s)))))))))) (5) where den, mpool, conv and atten denote fully connected layers, max-pooling layers, convolutional layers and the convolutional block attention module, respectively. Performance Metrics and Evaluation In this study, seven statistical measures, including accuracy (Acc), specificity (Sp), sensitivity (Sn), Precision (Pre), F1-score (F1), Matthew’s correlation coefficient (MCC)
SpliceSCANNER: An Accurate and Interpretable Deep Learning-Based Method
453
and AUC value are used to assess the performance of our method. For performance comparison, four cutting-edge CNN-based methods are chosen to test on independent test sets and conduct cross-organism validation experiments. 2.3 Model Interpretation Previous studies of splice site prediction mainly focused on model performance improvement and rarely explained their results derived from CNNs. Literature devoted to evaluating model interpretation methods reveal that Grad-CAM performs better totally in terms of fidelity, contrastivity and consuming time than DeepLIFT, from which higher fidelity is more helpful to improve trust of prediction model and higher contrastivity indicates that the model is more class-discriminative [16]. Additionally, as proposed by Teng et al. [18], Grad-CAM can localize important regions of feature map, which is consistent with the purpose of attention mechanism: to strengthen the regions of interest. Based on these conclusions, to tackle the challenge of model interpretability, we employ Grad-CAM to help visualize the importance of the features learned by the model according to the magnitude of gradient signal. As depicted in Fig. 1 bottom, we first compute the weights of feature maps. Specifically, for a class c, we calculate its gradient of prediction score S c , with respect to feature map Ak of the last convolutional layer in the CNN. Then, the gradient is averaged on each channel by the size of feature map, which resembles the global average pooling operation, obtaining the weight of each feature map. Mathematically, given a feature map Ak with size Z (Z = u × v), the weight αck represents the importance for a target class c, calculated as: 1 u v ∂Sc αck = (6) i=1 j=1 ∂Ak Z ij ∂Sc denotes the gradient of prediction score S c for a class c, with respect to feature ∂Akij k A at its position of row i and column j. Whereafter, a weighted combination of
where
map αck and Ak is performed and then activated by the ReLU function, obtaining classdiscriminative result LcGrad −CAM for class c, computed as below: LcGrad −CAM = ReLU( αck Ak ) (7) k
Based on Grad-CAM, we propose an approach to paraphrase the results without modifying the model’s original structure and parameters, which includes four steps: (1) Load the trained model to detect DNA sequences, obtaining their classification results. Then, extract the gradient information of feature maps in the last convolutional layer and calculate their weights with the aid of Grad-CAM. (2) Resize the feature maps in the last convolutional layer to the initial input size using bilinear interpolation method. Then multiply feature maps and weights, and merge the results, then the weighted class activation maps are generated. (3) After producing activation maps with size of 602 × 4, average the column values. Then, the maps have the same shape with original DNA sequences. (4) Lastly, sum up all activation maps and normalize the results, then, we obtain weights of each position in sequence and draw the weight distribution curve.
454
R. Wang et al.
3 Results and Discussion 3.1 Performance Analysis Performance of SpliceSCANNER Table 1 shows the performance metrics of SpliceSCANNER for predicting donor and acceptor splice sites on five organisms using models trained under ten-fold cross-validation. Table 1. Performance metrics (%) of SpliceSCANNER on five organisms.
Donor
Acceptor
Organism
Acc
Sp
Sn
Pre
F1
MCC
AUC
H. sapiens
96.36
95.98
96.73
96.01
96.37
92.72
99.14
A. thaliana
95.59
96.64
94.55
96.56
95.55
91.21
98.80
O. sativa japonica
94.92
94.66
95.18
94.69
94.94
89.85
98.96
D. melanogaster
94.39
92.96
95.82
93.15
94.47
88.81
98.67
C. elegans
97.23
96.67
97.80
96.70
97.25
94.47
99.62
H. sapiens
95.77
96.06
95.48
96.04
95.76
91.54
98.96
A. thaliana
95.19
95.01
95.37
95.03
95.20
90.38
98.75
O. sativa japonica
94.43
95.27
93.60
95.19
94.39
88.88
98.62
D. melanogaster
94.62
93.87
95.37
93.96
94.66
89.24
98.85
C. elegans
97.80
98.39
97.21
98.37
97.79
95.61
99.71
Comparison with Baseline To demonstrate the effectiveness of attention mechanism, we performed comparative analysis with baseline models, whose architecture and parameters were identical as SpliceSCANNER but without CBAM. As seen in Table 2, SpliceSCANNER manufactures more accuracy than baseline. Thus, it is proved that the employment of CBAM contributes to the performance improvement for splice site detection. Comparison with Other Approaches To offer more insight into the performance and reliability of SpliceSCANNER, we compared its accuracy with four state-of-the-art approaches: SpliceRover, SpliceFinder, DeepSplicer and EnsembleSplice. As shown in Table 3, SpliceSCANNER yields the highest accuracy for eight datasets. Close inspection reveals that there is a significant difference between SpliceSCANNER and EnsembleSplice, which is the second best tool in general on H. sapiens dataset, where the accuracy of SpliceSCANNER is 0.9% and 1.02% higher than EnsembleSplice for donor and acceptor respectively. When analyzing the results of H. sapiens acceptor, it is found that SpliceSCANNER predicts more accurately up to 472 sites when compared with EnsembleSplice. Scrutiny of these sites shows
SpliceSCANNER: An Accurate and Interpretable Deep Learning-Based Method
455
Table 2. Comparing accuracy (%) of baseline with SpliceSCANNER on five organisms. Organism
Donor
Acceptor
baseline
SpliceSCANNER
baseline
SpliceSCANNER
H. sapiens
96.15
96.36
95.70
95.77
A. thaliana
95.48
95.59
95.09
95.19
O. sativa japonica
94.88
94.92
94.07
94.43
D. melanogaster
93.77
94.39
93.08
94.62
C. elegans
96.79
97.23
97.72
97.80
that a positive sample which contains a hit of “TCTACCGGGATTTCTAGAGCAGCCTTGTGAGA” that aligns to reference genome (GRCh38.p12) uniquely on chromosome 3 is only predicted correctly by SpliceSCANNER. For a convincing conclusion, we also compared MCC to measure classifier performance in Table 4, which shows that SpliceSCANNER achieves highest values on eight datasets, indicating reliability of our method. Table 3. Comparing accuracy (%) of four methods with SpliceSCANNER on five organisms. Organism Donor
SpliceRover SpliceFinder DeepSplicer EnsembleSplice SpliceSCANNER
H. sapiens
93.50
93.05
95.54
95.46
96.36
A. thaliana
92.37
N/A
93.22
95.09
95.59
O. sativa japonica
N/A
N/A
95.11
N/A
94.92
D. N/A melanogaster
92.94
93.81
N/A
94.39
C. elegans
N/A
N/A
96.92
N/A
97.23
Acceptor H. sapiens
94.55
93.20
94.64
94.75
95.77
A. thaliana
90.76
N/A
92.04
94.94
95.19
O. sativa japonica
N/A
N/A
93.12
N/A
94.43
D. N/A melanogaster
92.74
94.70
N/A
94.62
C. elegans
N/A
96.68
N/A
97.80
N/A
Note: N/A means that the method has not trained the specific model for this organism
3.2 Interpretation and Visualization In our work, we first exploited Grad-CAM to implement the interpretation for decision results made by SpliceSCANNER for identifying splice sites without any prior knowledge. Figure 3 shows the distribution of normalized weight at each position of sample
456
R. Wang et al.
Table 4. Comparing MCC (%) of four methods with SpliceSCANNER on five organisms.
Donor
Organism
SpliceRover SpliceFinder DeepSplicer EnsembleSplice SpliceSCANNER
H. sapiens
87.44
A. thaliana
85.64
O. sativa japonica
N/A
D. N/A melanogaster C. elegans
86.28
91.08
90.92
92.72
N/A
86.50
90.19
91.21
N/A
90.22
N/A
89.85
86.06
87.61
N/A
88.81 94.47
N/A
N/A
93.85
N/A
Acceptor H. sapiens
89.47
86.50
89.27
89.50
91.54
A. thaliana
82.81
N/A
84.11
89.89
90.38
O. sativa japonica
N/A
N/A
86.29
N/A
88.88
D. N/A melanogaster
85.50
89.41
N/A
89.24
C. elegans
N/A
93.36
N/A
95.61
N/A
sequences. Apparently, the approximately middle positions on each curve, namely the splice site at 301st and 302nd nucleotides, acquire the lowest weight, which implies they contribute scarcely to predicting splice sites. Meanwhile, the hithermost areas around splice sites get the highest weight, causing drastic fluctuation in the middle of each curve, suggesting that each model pays more attention to the section around splice sites than the bilateral areas when extracting and selecting effective features for recognizing splice sites. This phenomenon revealed in SpliceSCANNER is consistent with the hypothesis proposed by Zuallaert et al. [10]. Moreover, on one hand, by scrutinizing the curve of donor site (Fig. 3a), it can be easy to see that the right region of splice site gains a higher gross weight than the left, O. sativa japonica and A. thaliana in particular. Through empirical analysis, we can conclude that the reason for this lies in the conservatism of splicing mechanisms, namely: the intronic region of a donor sample is less involved in alternative splicing, consequently, it contributes more to discerning donor sites. Conversely, on the other hand, for the curve of acceptor site (Fig. 3b), the left gets a higher total weight than the right, especially in H. sapiens and A. thaliana, indicating that the left part, intronic region, offers more discernable features for characterizing acceptor sites. In addition, as is known that branch point sites are generally located at 20 to 50 nucleotides upstream from the 3’ ends of introns, which can bolster the recognition of acceptor sites [19]. Consistently, as can be seen, higher weight is indeed assigned to these locations than the outlying segment, implying the identification of this biological signal. Interestingly, focusing on the curve of C. elegans, the flanking regions around splice sites have much lower weights than other species, which may be related to the compactness of its small genome, including small introns and quite proximate genes [20].
SpliceSCANNER: An Accurate and Interpretable Deep Learning-Based Method (b) Acceptor 1.0
0.8
0.8 Weight
Weight
(a) Donor 1.0
0.6 0.4 0.2
457
H. sapiens A. thaliana O. sativa japonica D. melanogaster C. elegans
0.6 0.4 0.2
0.0
0.0 100
200
300 400 Position
500
600
100
200
300 400 Position
500
600
Fig. 3. Weight distribution for five organisms in splice site samples. Each point on the curve represents the weight of the corresponding position. The larger the weight, the more important the position for splicing detection.
3.3 Generalization Analysis To further explore the generalizability of SpliceSCANNER, cross-organism validation experiments were performed. As the most widely and deeply studied species, H. sapiens plays a significant role in biological research. Based on this consideration, in practice, we tested the model trained on H. sapiens with testing datasets of other four organisms. As exhibited in Fig. 4, SpliceSCANNER produces the maximum accuracy compared with other four approaches except for C. elegans, which DeepSplicer performs slightly better on. Therefore, we believe that SpliceSCANNER has preeminent generalizability and is more qualified for annotating newly sequenced or deficiently studied genomes of various organisms than other methods in general. (b) Acceptor 92
90
90
88
88
Acc(%)
Acc(%)
(a) Donor 92
86
86
84
84
82
82
80
SpliceRover SpliceFinder DeepSplicer EnsembleSplice SpliceSCANNER
80 A. thaliana
O. sativa japonicaD. melanogaster
C. elegans
A. thaliana
O. sativa japonicaD. melanogaster
C. elegans
Fig. 4. Accuracy of cross-organism validation by models trained on H. sapiens. To make it easy to observe the differences of results, accuracy starts at 80%. The higher cross-validation accuracy a method obtains, the better generalizability it has.
4 Conclusions In this work, we have developed SpliceSCANNER, which is a deep learning-based method for accurate canonical splice sites prediction. To establish effective and generalizable models, we first downloaded datasets of five organisms from published literature. Then, we constructed ten models based on CNN for these datasets. To capture more informative features to characterize splice sites, CBAM was introduced into each model. Finally, to provide detailed insight into the classification results, Grad-CAM was employed to represent the importance of features learned by the model without modifying the network structure and parameters.
458
R. Wang et al.
Experimental results show that the proposed method has better performance and generalizability for detecting splice sites. Compared with baseline, it yields more performance. When compared with several state-of-the-art approaches, it still exceeds them on most of the species in terms of accuracy and generalizability. All visualizations acquired for interpreting the decision-making do confirm that SpliceSCANNER focuses on the key regions around splice sites and assigns high weights to them when learning their features, which conforms to the previous literature. In a nutshell, the proposed SpliceSCANNER method is quite accurate and interpretable for predicting splice sites. In the future, we will still explore this field by constructing hybrid model, combining machine learning with deep learning techniques together. Moreover, with the advent of Transformer, we would like to construct model based on it. We hope our work could be beneficial to gene regulation analysis and sheds light on bioinformatics community. Acknowledgements. This work is supported by Guangxi Key Laboratory of Image and Graphic Intelligent Processing (GIIP2004), National Natural Science Foundation of China (61862017), and Innovation Project of GUET (Guilin University of Electronic Technology) Graduate Education (2022YCXS063).
References 1. Wang, G.-S., Cooper, T.A.: Splicing in disease: disruption of the splicing code and the decoding machinery. Nat. Rev. Genet. 8, 749–761 (2007) 2. Burset, M., Seledtsov, I.A., Solovyev, V.V.: SpliceDB: database of canonical and noncanonical mammalian splice sites. Nucl. Acids Res. 29, 255–259 (2001) 3. Pertea, M., Lin, X., Salzberg, S.L.: GeneSplicer: a new computational method for splice site prediction. Nucl. Acids Res. 29, 1185–1190 (2001) 4. Trapnell, C., Pachter, L., Salzberg, S.L.: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009) 5. Kim, D., Langmead, B., Salzberg, S.L.: HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015) 6. Li, H.: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094– 3100 (2018) 7. Liu, B., Liu, Y., Li, J., Guo, H., Zang, T., Wang, Y.: deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index. Genome Biol. 20, 1–14 (2019) 8. Wang, S., et al.: CnnPOGTP: a novel CNN-based predictor for identifying the optimal growth temperatures of prokaryotes using only genomic k-mers distribution. Bioinformatics 38, 3106–3108 (2022) 9. Hernández, D., Jara, N., Araya, M., Durán, R.E., Buil-Aranda, C.: PromoterLCNN: a light CNN-based promoter prediction and classification model. Genes 13, 1126 (2022) 10. Zuallaert, J., Godin, F., Kim, M., Soete, A., Saeys, Y., De Neve, W.: SpliceRover: interpretable convolutional neural networks for improved splice site prediction. Bioinformatics 34, 4180– 4188 (2018) 11. Wang, R., Wang, Z., Wang, J., Li, S.: SpliceFinder: ab initio prediction of splice sites using convolutional neural network. BMC Bioinform. 20, 1–13 (2019)
SpliceSCANNER: An Accurate and Interpretable Deep Learning-Based Method
459
12. Akpokiro, V., Oluwadare, O., Kalita, J.: DeepSplicer: an improved method of splice sites prediction using deep learning. In: 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 606–609. IEEE (2021) 13. Akpokiro, V., Martin, T., Oluwadare, O.: EnsembleSplice: ensemble deep learning model for splice site prediction. BMC Bioinform. 23, 413 (2022) 14. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1 15. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017) 16. Shun, K.T.T., Limanta, E.E., Khan, A.: An evaluation of backpropagation interpretability for graph classification with deep learning. In: 2020 IEEE International Conference on Big Data (Big Data), pp. 561–570. IEEE (2020) 17. Albaradei, S., et al.: Splice2Deep: an ensemble of deep convolutional neural networks for improved splice site prediction in genomic DNA. Gene 763, 100035 (2020) 18. Teng, Q., Liu, Z., Song, Y., Han, K., Lu, Y.: A survey on the interpretability of deep learning in medical diagnosis. Multimed. Syst. 28, 1–21 (2022) 19. Nazari, I., Tayara, H., Chong, K.T.: Branch point selection in RNA splicing using deep learning. IEEE Access 7, 1800–1807 (2018) 20. Blumenthal, T., Spieth, J.: Gene structure and organization in Caenorhabditis elegans. Curr. Opin. Genet. Dev. 6, 692–698 (1996)
TransOrga: End-To-End Multi-modal Transformer-Based Organoid Segmentation Yiming Qin1 , Jiajia Li1,2 , Yulong Chen1 , Zikai Wang2 , Yu-An Huang3 , Zhuhong You3 , Lun Hu4 , Pengwei Hu4 , and Feng Tan5(B) 1 Shanghai Jiao Tong University, Shanghai, China
[email protected]
2 Shanghai Artificial Intelligence Research Institute, Shanghai, China 3 Northwestern Polytechnical University, Xi’an, China 4 Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences,
Shanghai, China 5 Merck KGaA, Darmstadt, Germany
[email protected]
Abstract. Organoid research plays an important role in drug screening and disease modeling. Obtaining accurate information about organoid morphology, number, and size is fundamental to this research. However, previous methods relied on fluorescence labeling which can harm organoids or have problems with accuracy and robustness. In this paper, we first introduce Transformer architecture into the organoid segmentation task and propose an end-to-end multi-modal method named TransOrga. To enhance the accuracy and robustness, we utilize a multimodal feature extraction module to blend spatial and frequency domain features of organoid images. Furthermore, we propose a multi-branch aggregation decoder that learns diverse contexts from various Transformer layers to predict the segmentation mask progressively. In addition, we design a series of losses, including focal loss, dice loss, compact loss and auxiliary loss, to supervise our model to predict more accurate segmentation results with rational sizes and shapes. Our extensive experiments demonstrate that our method outperforms the baselines in organoid segmentation and provides an automatic, robust, and fluorescent-free tool for organoid research. Keywords: Organoid segmentation · Transformer · Multi-modal
1 Introduction Organoids are multicellular 3D structures that are derived from stem cells and selforganize into functional tissues that resemble human organs [1]. They are considered as an important tool for disease modeling, drug screening, and regenerative medicine. Organoids are embedded in a biological matrix and can be generated from various tissues, such as the liver, lung, and brain, etc. They mimic the in vivo organ development process, allowing researchers to study organogenesis and disease progression in a controlled environment. Furthermore, organoids can be genetically manipulated, allowing © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 460–472, 2023. https://doi.org/10.1007/978-981-99-4749-2_39
TransOrga: End-To-End Multi-modal Transformer-Based Organoid Segmentation
461
researchers to study the effects of specific gene mutations on disease development and progression [2, 3]. The detection of organoids from the culture medium is the foundation for these studies. Researchers face several challenges when trying to obtain organoids from a large number of microscope images for organoid research. First, the size and shape of organoids are different for organoids with the same tissue type, or between different tissue types in the same culture medium. Second, a single microscope image contains many small organoid cells which are difficult to detect. Finally, negative factors such as air bubbles and nutrients in the biological culture media interfere with organoid recognition. Traditional methods involve the fluorescent labeling of organoids [4–6], followed by manual labeling or image analysis methods based on manual parameters [7, 8] for organoid segmentation. However, some fluorescent dyes can be toxic to cells, leading to cell damage or death, and affecting cellular dynamics. Besides, photobleaching appears when the fluorescent dye molecules lose their ability to fluoresce due to prolonged exposure to light which leads to loss of signal and decreased sensitivity in imaging. Therefore, an automatic organoid segmentation method that is accurate, robust, reproducible, and fluorescent-free is necessary. Some researchers have investigated the recognition of organoids from brightfield or phase-contrast organoid images, such as the adaptive thresholding method [9] which needs to adjust the parameters of each image manually. With the development of deep learning, convolutional neural networks (CNN) are introduced into organoid segmentation [10–13]. However, these methods still face the challenge of robustness and accuracy due to the network structure. The CNN-based methods always utilize the pyramid structure (resolution compression), leading to information loss during feature extraction. In contrast, the Transformer structure does not have a pyramid structure which reduces the information loss and extracts rich information. Inspired by these, we introduce Transformer into the organoid segmentation task and proposed TransOrga, an end-toend multi-modal Transformer-based organoid segmentation. TransOrga mainly contains multi-modal feature extraction module (MFE), Transformer-based encoder and multibranch decoder. Given the brightfield or phase-contrast organoid images, we design a MFE module to obtain the saliency map based on the Fourier transform which reflects the frequency domain information of the organoid images and helps extract image features and suppress noise. To integrate these data from different domains, MFE then fuses the information of the organoid image and its saliency map. We then feed the fused features into a Transformer-based encoder to learn richer and deeper representations using its self-attention mechanism. Specifically, to make use of the encoded features more efficiently, we propose a multi-branch aggregation decoder that learns diverse contexts from various Transformer layers for organoid segmentation. We also introduce several loss functions, including focal loss, dice loss, compact loss, and auxiliary loss, to supervise our model to predict more accurate segmentation results with rational sizes and shapes. Our extensive experiments demonstrate that our method outperforms the state-of-the-art baselines qualitatively and quantitatively. Above all, the contributions of our paper are, • To the best of our knowledge, we first introduce Transformer architecture into the organoid segmentation task and propose TransOrga, an end-to-end multi-modal
462
Y. Qin et al.
Transformer-based method that bridges the spatial and frequency domain features for organoid segmentation. • We propose a multi-branch aggregation decoder that learns the diverse context from various Transformer layers for organoid segmentation. • To ensure accurate segmentation with rational sizes and shapes, we design a set of losses, including focal loss, dice loss, compact loss, and auxiliary loss.
2 Related Work 2.1 Organoid Segmentation Traditional methods usually label organoids with fluorescence to aid organoid segmentation. For example, organoids have been labeled with genetically modified fluorescent proteins in previous studies [4–6] for segmentation, but this approach may alter the intrinsic cellular dynamics of the original sample and lead to cumulative toxicity through longer growth times. As a result, studying organoids using brightfield or phase-contrast images has become more common. OrganoSeg [9] performed an adaptive thresholding method to segment the organoid using brightfield images. They first smoothed the image and then obtained the adaptive thresholding based on the adaptive threshold function to binarize this image. Finally, they got segmentation results from this binary image. In recent years, deep learning methods have been introduced into organoid segmentation. OrgaQuant [10] introduced vanilla convolutional neural networks for segmenting human intestinal organoids in brightfield images. Furthermore, some researchers utilized the Unet [36] structure instead of the vanilla CNN structure [11–13] to improve the performance. However, traditional methods rely on manual parameter tuning, and existing deep learning methods are primarily based on convolutional neural networks, which may struggle with accurately and robustly segmenting small objects. In this work, we present TransOrga, a multi-modal transformer-based approach that overcomes the information loss that may occur with the pyramid structure (such as Unet) and improves segmentation accuracy and robustness. 2.2 Transformer The Transformer network was originally developed for natural language processing (NLP) tasks. Vaswani et al. [14] first proposed the transformer based on the self-attention mechanism for machine translation. BERT [15] and GPT [16] demonstrated the power of the transformer on NLP tasks. Inspired by the success of transformer networks in NLP, researchers have studied how to apply transformers to computer vision (CV) tasks. Carion et al. [17] combined the CNN and transformer for object detection. Zhu et al. [18] introduced deformable attention modules replacing the original multi-head attention mechanism to improve the detection performance. Dosovitskiy et al. [19] proposed ViT which directly utilized a pure transformer to the sequences of image patches for image classification and achieved the state-of-the-art (SOTA) performance. Zheng et al. [20] proposed a transformer-based semantic segmentation network that utilized ViT to extract features and an aggregation module for pixel-wise segmentation. Wang et al.
TransOrga: End-To-End Multi-modal Transformer-Based Organoid Segmentation
463
[21] introduced Max-DeepLab, a novel approach for panoptic segmentation that leverages a mask transformer to predict results directly, bypassing the need for surrogate sub-tasks such as box detection. Wang et al. [22] presented a method for generating instance prediction results from a sequence of input images. Huang et al. [23] estimated 3D hand poses using point sets as input based on the transformer network. METRO [24] reconstructed 3D human pose and mesh from a single RGB image based on the transformer. Liu et al. [25] introduced LSTR which leveraged a transformer network to learn the global context for curve lane detection. For medical images, Valanarasu et al. [26] proposed a Gated Axial-Attention model to solve the problem that it is difficult to train the transformer because of too few medical images. Swin-Net [27] is an Unet-like pure transformer for medical image segmentation. Unlike previous tasks, organoids are numerous and small in organoid images and the background is noisy, for example, air bubbles in culture medium. Based on these, we first introduce the Transformer structure into the organoid segmentation task and propose TransOrga which utilizes multi-modal data from both spatial and frequency domains and multi-layer features for robust segmentation.
Fig. 1. Structure of proposed TransOrga. Given the organoid image as input, TransOrga obtains the saliency map containing the frequency features and then fuses it with the organoid image. The Transformer-based encoder learns multi-layer features from the fused features to capture contextual information. Finally, the multi-branch aggregation decoder leverages diverse context information from various Transformer layers to predict the organoid segmentation mask progressively. A set of losses are designed to ensure accurate and rational segmentation results.
3 TransOrga The overview of TransOrga is shown in Fig. 1. TransOrga mainly contains three parts, MFE module, Transformer-based encoder, and multi-branch decoder. Given the organoid image I s as input, MFE obtains its saliency map S s which contains the frequency domain features, and extracts and fuses the organoid image and saliency map features as Os , in
464
Y. Qin et al.
Sect. 3.1, which are fed into the Transformer-based encoder. Our encoder learns multilayer features from the fused features to capture contextual information for segmentation, in Sect. 3.2. Subsequently, the multi-branch aggregation decoder leverages diverse context information from various Transformer layers to predict the organoid segmentation mask P progressively, in Sect. 3.3. Furthermore, based on the characteristics of organoids, we introduce a set of losses to enhance the performance, in Sect. 3.4.
Fig. 2. Structure of multi-modal feature extraction module (MFE). The input is Is , the saliency map is Ss and the fused output is Os . The Fourier transform and the inverse Fourier transform are denoted by FFT and IFFT , respectively. Gaussian smoothing filter is denoted by G). An additive operation is denoted by the symbol ⊕. L(·) denotes the matrix logarithmic operation.
3.1 Multi-modal Feature Extraction In the process of obtaining organoid images, external environmental interference and instability of imaging equipment can easily cause additional noise, while the size and shape of cells have a significant impact on deep segmentation models. Fourier transformation can help separate noise and useful information in the images. The frequency domain information it provides can help the network learn features that are less sensitive to small changes in the input image [28]. This, in turn, allows for better generalization performance when dealing with new and unseen data. Motivated by this, we design a multi-modal feature extraction module (MFE) to combine the multi-modal information from the organoid image. Figure 2 illustrates the overview of MFE module, where we let I s denote the input organoid image, and then calculate the amplitude As and phase spectrum Ps of the input I s as follows: As = A(FFT (Is )) + e−10
(1)
Ps = P(FFT (Is )),
(2)
where FFT represents the Fourier transform operation. As and Ps represent the amplitude and phase spectra of the input image Is , respectively, computed by A(·) and P(·). e−10 is a constant. The logarithmic amplitude Ls of the input organoid image Is is obtained by subjecting the amplitude spectrum As to a logarithmic operation. To obtain the spectral
TransOrga: End-To-End Multi-modal Transformer-Based Organoid Segmentation
465
residual of the image, the mean spectrum is removed. In this study, we use a mean filter v1 (·) with a kernel size of 3 × 3, a stride of 1, and a padding of 1 to filter Ls and obtain the mean spectrum Ls . The residual spectrum Rs is then calculated as the difference between Ls and Ls , followed by a Fourier inverse transform combined with the phase spectrum Ps to obtain Es . Finally, Es is passed to the Gaussian smoothing filter to get the saliency map Ss . The output saliency map Ss is calculated as follows, 2 , (3) Ss = G ∗ IFFT exp(Rs + Ps ) where IFFT denotes the inverse Fourier transform and G denotes the Gaussian smoothing filter. To facilitate the learning process, we concatenate the organoid image Is and the saliency map Ss along the channel dimension. We then apply a convolution layer with a 3 × 3 kernel size and 3 output channels, a normalization layer, and a ReLU activation function to compute the output Os . 3.2 Encoding After generating the fused feature map Os with 3 channels, we split the image into several 16 × 16 patches, which are then transformed into a one-dimensional embedding sequence according to patch embedding and position embedding modules [19]. We refer to the learned features as Z. To learn these features, we use multiple stacked Transformer layers, each of which has a global field-of-reception property, solving the limited field-of-reception problem of existing convolutional encoders. The encoder with stacked Transformer layers Le consists of Multi-head Self Attention (MSA) and MultiLayer Perceptron (MLP) blocks. At each layer l, we compute a triple (query vector Q, key vector K, value vector V) from l − 1 layer features Z l−1 ∈ RL×C to serve as input to the self-attentive block, as follows, Q = Z l−1 W Q , K = Z l−1 WK , V = Z l−1 WV ,
(4)
where W Q /WK /WV ∈ RC×d are the learnable weight parameters in the three linear mapping layers, and d is the feature dimension of the triplet. Subsequently, the selfattentive mechanism (SA) can be expressed as, Z l−1 WQ Z l−1 WK l−1 l−1 Z l−1 WV . =Z (5) + softmax SA Z √ d MSA extends m independent SA operations and maps their joint output: MSA Z l−1 = l−1 l−1 l−1 SA1 Z ; SA2 Z ; · · · ; SAm Z WO , where WO ∈ Rmd ×C , d is usually set to C/m. The output of the MSA is then transformed by the MLP block, using the residual join as the output of the layer l as follows, (6) Z l = MSA Z l−1 + MLP MSA Z l−1 ∈ RL×C , Furthermore, layer normalization is applied before the MSA and MLP blocks, but it is not included in the formula for brevity. In this paper, Z 1 , Z 2 , · · · , Z Le is used as a feature of the Transformer layer.
466
Y. Qin et al.
3.3 Decoding The encoder in our proposed model is based on the Transformer architecture without the pyramid structure, as illustrated in Fig. 1. It consists of 12 layers with output features of the same scale ( HW 512 ×768). To enhance interaction between different layers, we introduce a multi-branch aggregation design by progressively fusing the corresponding contexts between different layers, as shown in Fig. 1. We select the features of the encoder of layers 2, 5, 8, and 11, i.e., Z 2 , Z 5 , Z 8 , andZ 11 for multi-branch organoid segmentation. Each branch focuses on a selected layer. We first reshape the features from 2D shape H W ( HW 512 ×C) to 3D shape ( 16 × 16 ×C). Then, we pass them into three conv layers with kernel sizes of 1 × 1, 3 × 3 and 3 × 3, respectively. To enhance interactions across different branch, we then introduce a top-down aggregation structure by fusing the top and down layer features by element-wise summation and increase the size dimensions by an upsampling operation. After obtaining four groups of features, we stitch them together along the channel dimension and restore them to the input size using convolution and up-sampling operations. Finally, we obtain the predicted organoid segmentation mask P, where the channel of P is 2. 3.4 Losses Considering the characteristics of organoids, we design a series of loss functions. Focal loss [29] penalizes the error-predicted pixels. Due to the imbalanced ratio between organoids and others, we opted for focal loss instead of cross-entropy loss as follows, (7) pt =
pˆ if argmax(P) = 1 1 − pˆ otherwise,
(8)
where P is the predicted segmentation mask. argmax(·) gets the channel index of the maximum value. p is the predicted probability of the organoid class. α and γ are hyperparameters are the sample weight and the weight for hard cases, respectively. Dice loss [30] measures the similarity between the predicted and ground-truth segmentation masks. Dice loss is calculated as the ratio of the overlap between prediction and ground-truth segmentation masks to their union as follows,
(9) where P and G are the predicted and ground-truth segmentation masks. Compact loss measures the compactness of a shape. The shapes of organoids are always circles and ellipses, we adopt the Isoperimetric Quotient (IQ) [31]. IQ is defined as the ratio of a shape’s area to its perimeter’s square. Inspired by [32], the compact loss is the reciprocal of IQ as follows, (10)
TransOrga: End-To-End Multi-modal Transformer-Based Organoid Segmentation
467
where p is the predicted segmentation results, indicates all pixels in the results. ph and pv are the gradients for each pixel in the horizontal and vertical directions. is a hyperparameter to avoid computational instability. Auxiliary Loss enhances feature representation learning. We propose an auxiliary loss that utilizes features extracted from hidden layers. We select the features in the middle layers of our model to calculate the focal loss as follows, (11) where φ(·) is the hidden layer and λi represents the weight assigned to the i-th layer. Above all, we combine previous losses with weights to form the final loss, (12)
4 Experiment and Evaluation In this section, we compare our proposed method with the SOTA methods both qualitatively and quantitatively. Additionally, we conduct an ablation study to demonstrate the effectiveness of our approach. 4.1 Implementation Details We implement our approach using Pytorch [33] and train and test our model on the OrganoID dataset [13]. The input size of our model is 512 × 512 and the batch size is 4. We employ the stochastic gradient descent (SGD) algorithm with a stochastic weight averaging strategy for optimizing our model. The initial learning rate is 0.01, and we reduce it by a factor of 0.1 after every 10 epochs. We train our model on an NVIDIA RTX 3090 with 24 GB of memory for 80 epochs. During training, we set λf as 1.0, λd as 0.5, λc as 0.5, λa as 0.3. 4.2 Qualitative and Quantitative Comparison We compare our method with SOTA segmentation methods. (1) SegNet [34] uses a symmetric encoder-decoder structure with pooling indices up-sampling for image segmentation. (2) A-Unet [35] adds an attentional gate model to Unet [36], making the model focus on the target structures. (3) OrganoID [13] is based on Unet [36] for organoid segmentation. We modify the input module of all baselines to be compatible with the OrganoID dataset and all models are trained successfully. For the evaluation metrics, we select precision, recall, F1-score, mean intersectionover-union (mIoU) and DICE which are common metrics for segmentation tasks. Precision represents the proportion of true positive predictions among all positive predictions, whereas recall shows the proportion of true positive predictions among all ground truth positives. F1-score is the harmonic mean of precision and recall. To calculate mIoU, we TP , where TP is true positive, average the IoU for all classes. IoU is computed as TP+FP+FN
468
Y. Qin et al.
2TP FP is false positive, and FN is false negative. DICE is computed as 2TP+FP+FN , which is the ratio of the intersection of the two sets to their average size. Note that both mIoU and DICE range from 0 to 1, where a value of 1 indicates perfect overlap between the predicted and ground-truth segmentation masks. The segmentation results are presented in Fig. 3. We compare baselines and our methods on different tissues of organoids, including salivary adenoid cystic carcinoma (ACC), colon epithelia (Colon), lung epithelia (Lung) and pancreatic ductal adenocarcinoma (PDAC). Due to the influence of noise backgrounds, such as air bubbles, baseline methods produce broken and wrong segmentation results with noise. Moreover, the segmentation results of the baselines contain irregular shapes that do not conform to reasonable organoid shapes, generating fragment boundaries and holes inside the organoid, as shown in PDAC in Fig. 3. Our proposed methods overcome these limitations, as demonstrated in the experimental results. We achieved better segmentation results with regular shapes that conform to reasonable organoid shapes. We also show more results of our method on different tissues in Fig. 4. Moreover, we also evaluate the segmentation results quantitatively, as shown in Table 1. Our evaluation compares the performance of our method against the baselines on different tissues of organoids. The quantitative results show that our approach outperforms the baselines in terms of both robustness and accuracy.
4.3 Ablation Study In this section, we evaluate the effectiveness of the proposed components. Firstly, we remove the multi-modal inputs, denoted as ours w/o multi-modal. During the experiment, we only use the organoid image to generate the segmentation results. The absence
Fig. 3. Visualization comparison of various methods on the testing dataset.
TransOrga: End-To-End Multi-modal Transformer-Based Organoid Segmentation
469
of multi-modal inputs decreases our model’s robustness, leading to misidentifications of the culture medium or air bubbles as organoids. We also evaluate feature fusion using complex networks, for example, ResNet [37], as ours w/ResNet. In the experiment, we feed the spatial and frequency domain data to ResNet to extract and fuse these features as 3-channel features. However, the segmentation result is very poor, as ResNet destroys the position relationship of patches, making the Transformer learn invalid features from patch sequences. Moreover, we eliminate the compact loss (denoted as ours w/o L ) and auxiliary loss (denoted as ours w/o L ). The absence of L leads to irregular shapes, including fragment boundaries and holes that violate the rational organoid shapes. Additionally, some small organoids are lost due to the lack of L . Based on the quantitative results shown in Table 2, it is evident that the proposed components have a positive effect on organoid segmentation.
Fig. 4. Results of our method on different tissues.
Table 1. Quantitative Results of baselines and our method on different tissues. We compute Dice, mIoU, Precision, Recall and F1-score of different methods on different tissues. Model
Dice↑
mIoU↑
Precision↑
Recall↑
F1-score↑
SegNet [34]
0.798
0.664
0.579
0.803
0.630
A-Unet [35]
0.884
0.791
0.671
0.952
0.783
OrganoID [13]
0.848
0.736
0.622
0.866
0.716
Ours
0.913
0.840
0.791
0.903
0.843
SegNet [34]
0.864
0.761
0.742
0.769
0.738
A-Unet [35]
0.877
0.781
0.645
0.952
0.764
OrganoID [13]
0.867
0.766
0.674
0.850
0.745
ACC
Colon
(continued)
470
Y. Qin et al. Table 1. (continued)
Model
Dice↑
mIoU↑
Precision↑
Recall↑
F1-score↑
Ours
0.919
0.851
0.786
0.918
0.844
SegNet [34]
0.877
0.781
0.903
0.730
0.801
A-Unet [35]
0.948
0.900
0.892
0.946
0.917
OrganoID [13]
0.911
0.836
0.794
0.938
0.858
Ours
0.946
0.898
0.921
0.910
0.915
SegNet [34]
0.875
0.778
0.740
0.855
0.783
A-Unet [35]
0.889
0.801
0.763
0.875
0.806
Lung
PDAC
OrganoID [13]
0.859
0.752
0.702
0.836
0.752
Ours
0.898
0.814
0.778
0.885
0.821
Table 2. Quantitative results of ablation study on all tissues. Our proposed components contribute to organoid segmentation. Model
Dice
mIoU
Precision
Recall
F1-score
Ours w/ResNet
0.476
0.312
0.078
0.008
0.012
Ours w/o Multi-modal
0.879
0.785
0.691
0.917
0.781
Ours w/o L
0.843
0.728
0.595
0.943
0.719
Ours w/o L
0.881
0.788
0.688
0.934
0.785
Ours
0.916
0.846
0.813
0.901
0.850
5 Conclusion In this paper, we first introduce Transformer architecture into organoid segmentation and propose TransOrga, an end-to-end multi-modal organoid segmentation method based on Transformer and spatial and frequency domain features. Given the organoid image, TransOrga obtains its saliency map which contains frequency features and combines it with the organoid image using the MFE module. The multi-branch aggregation decoder is designed to learn diverse contexts from various Transformer layers to predict the segmentation mask progressively. Additionally, a set of losses, including focal loss, dice loss, compact loss, and auxiliary loss, supervise TransOrga to predict more accurate segmentation results with rational sizes and shapes. The extensive experiments demonstrate that our proposed method outperforms baselines in organoid segmentation, providing an automatic, robust, and fluorescent-free tool for organoid research.
TransOrga: End-To-End Multi-modal Transformer-Based Organoid Segmentation
471
TransOrga provides an automatic, robust, and fluorescent-free tool for organoid research. For future works, we will extend our method to organoid tracking to solve the division, differentiation, and movement of organoids. Research Grants. This work was supported by the Xinjiang Tianchi Talents Program (E33B9401).
References 1. Kretzschmar, K., Clevers, H.: Organoids: modeling development and the stem cell niche in a dish. Dev. Cell 38(6), 590–600 (2016) 2. Dutta, D., Heo, I., Clevers, H.: Disease modeling in stem cell-derived 3D organoid systems. Trends Mol. Med. 23(5), 393–410 (2017) 3. Sachs, N., et al.: A living biobank of breast cancer organoids captures disease heterogeneity. Cell 172(1–2), 373–386 (2018) 4. Kim, S., et al.: Comparison of cell and organoid-level analysis of patient-derived 3D organoids to evaluate tumor cell growth dynamics and drug response. SLAS Discov. 25(7), 744–754 (2020) 5. Dekkers, J.F., et al.: High-resolution 3D imaging of fixed and cleared organoids. Nat. Protoc. 14(6), 1756–1771 (2019) 6. Hof, L., et al.: Long-term live imaging and multiscale analysis identify heterogeneity and core principles of epithelial organoid morphogenesis. BMC Biol. 19, 1–22 (2021) 7. Mead, B.E., et al.: Screening for modulators of the cellular composition of gut epithelia via organoid models of intestinal stem cell differentiation. Nat. Biomed. Eng. 6(4), 476–494 (2022) 8. Brandenberg, N., et al.: High-throughput automated organoid culture via stem-cell aggregation in microcavity arrays. Nat. Biomed. Eng. 4(9), 863–874 (2020) 9. Borten, M.A., et al.: Automated brightfield morphometry of 3D organoid populations by OrganoSeg. Sci. Rep. 8(1), 5319 (2018) 10. Kassis, T., et al.: OrgaQuant: human intestinal organoid localization and quantification using deep convolutional neural networks. Sci. Rep. 9(1), 1–7 (2019) 11. Kok, R.N.U., et al.: OrganoidTracker: efficient cell tracking using machine learning and manual error correction. PLoS ONE 15(10), e0240802 (2020) 12. Larsen, B.M., et al.: A pan-cancer organoid platform for precision medicine. Cell Rep. 36(4), 109429 (2021) 13. Matthews, J.M., et al.: OrganoID: a versatile deep learning platform for tracking and analysis of single-organoid dynamics. PLOS Comput. Biol. 18(11), e1010584 (2022) 14. Vaswani, A., et al.: Attention is all you need. In: NIPS (2017) 15. Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019) 16. Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020) 17. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10. 1007/978-3-030-58452-8_13 18. Zhu, X., et al.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2021) 19. Dosovitskiy, A., et al.: An image is worth 16 × 16 words: transformers for image recognition at scale. In: ICLR (2021)
472
Y. Qin et al.
20. Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR (2021) 21. Wang, H., et al.: MaX-DeepLab: end-to-end panoptic segmentation with mask transformers. In: CVPR (2021) 22. Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: CVPR (2021) 23. Huang, L., Tan, J., Liu, J., Yuan, J.: Hand-transformer: non-autoregressive structured modeling for 3D hand pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 17–33. Springer, Cham (2020). https://doi.org/10.1007/978-3030-58595-2_2 24. Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: CVPR (2021) 25. Liu, R., et al.: End-to-end lane shape prediction with transformers. In: CVPR (2021) 26. Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., Patel, V.M.: Medical transformer: gated axialattention for medical image segmentation. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12901, pp. 36–46. Springer, Cham (2021). https://doi.org/10.1007/978-3-03087193-2_4 27. Cao, H., et al.: Swin-unet: Unet-like pure transformer for medical image segmentation. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) ECCV 2022. LNCS, vol. 13803, pp. 205–218. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-25066-8_9 28. Li, J., et al.: CDX-NET: cross-domain multi-feature fusion modeling via deep neural networks for multivariate time series forecasting in AIOps. In: ICASSP (2022) 29. Lin, T.-Y., et al.: Focal loss for dense object detection. In: ICCV (2017) 30. Milletari, F., Navab, N., Ahmadi, S.-A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 3DV (2016) 31. Li, W., Goodchild, M.F., Church, R.: An efficient measure of compactness for twodimensional shapes and its application in regionalization problems. IJGIS 27, 1227–1250 (2013) 32. Liu, Q., Dou, Q., Heng, P.-A.: Shape-aware meta-learning for generalizing prostate MRI segmentation to unseen domains. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12262, pp. 475–485. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59713-9_46 33. Paszke, A., et al.: Automatic differentiation in PyTorch (2017) 34. Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. TPAMI 39, 2481–2495 (2017) 35. Oktay, O., et al.: Attention U-net: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018) 36. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-31924574-4_28 37. He, K., et al.: Deep residual learning for image recognition. In: CVPR (2016)
GPU Optimization of Biological Macromolecule Multi-tilt Electron Tomography Reconstruction Algorithm Zi-Ang Fu1 , Xiaohua Wan2 , and Fa Zhang2(B) 1 School of Information Science and Engineering, Lanzhou University, Lanzhou 730000,
Gansu, China [email protected] 2 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China {wanxiaohua,zhangfa}@ict.ac.cn
Abstract. Three-dimensional (3D) reconstruction in cryo-electron tomography (cryo-ET) plays an important role in studying in situ biological macromolecular structures at the nanometer level. Owing to limited tilt angle, 3D reconstruction of cryo-ET always suffers from a “missing wedge” problem which causes severe accuracy degradation. Multi-tilt reconstruction is an effective method to reduce artifacts and suppress the effect of the missing wedge. As the number of tilt series increases, large size data causes high computation and huge memory overhead. Limited by the memory, multi-tilt reconstruction cannot be performed in parallel on GPUs, especially when the image size reaches 1 K, 2 K, or even larger. To optimize large-scale multi-tilt reconstruction of cryo-ET, we propose a new GPU-based large-scale multi-tilt tomographic reconstruction algorithm (GMSIRT). Furthermore, we design a two-level data partition strategy in GM-SIRT to greatly reduce the memory required in the whole reconstructing process. Experimental results show that the performance of the GM-SIRT algorithm has been significantly improved compared with DM-SIRT, the distributed multi-tilt reconstruction algorithm on the CPU cluster. The acceleration ratio is over 300%, and the memory requirement only decreases to one-third of DM-SIRT when the image size reaches 2 K.
1 Introduction Cryo-EM (electron microscopy) has become popular for studying the structures of protein macromolecules at near-atomic resolution, in which biological samples are frozen at ultra-low temperature before taking pictures [1]. Cryo-electron Tomography (Cryo-ET) is an indispensable tool in cryo-EM for structural biology to visualize and understand macromolecular complexes at sub-molecular resolution [2]. Many biological structures, such as SARS-CoV-2 in their native state can be interpreted by cryo-ET [3–5]. In Cryo-ET, the three-dimensional (3D) density of a biological sample is generated from a series of two-dimensional (2D) micrographs, i.e. tilt series, acquired at different orientations by tilting specimens around axes. The angular tilt range is limited from − 70° to +70° during data collection because of the equipment limitation. The incomplete © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 473–484, 2023. https://doi.org/10.1007/978-981-99-4749-2_40
474
Z.-A. Fu et al.
data called “missing wedge” will cause severe artifacts and degenerate seriously the accuracy of reconstruction. One solution to compensate for the “missing wedge” issue is acquiring multiple tilt series by rotating the sample in a plane. Performing three-dimensional reconstruction with multi-tilt data can reduce the impact of the “missing wedge” problem. Recently, multi-tilt tomographic reconstruction has proved its effectiveness in improving resolution [6, 7]. Since multi-tilt reconstruction requires the data of all axes in each iteration, as the size of the reconstructed macromolecules increases, the huge space and computing resources required for calculation limit the application of multi-tilt reconstructing algorithms. For the problem that multi-tilt reconstruction cannot handle large-scale data, the DM-SIRT algorithm [8] is proposed in 2019. The main workflow for DM-SIRT is to divide the whole data into different subsets reconstructed separately and then combine all the sub-solutions to obtain the final result. However, two obstacles hamper DMSIRT to achieve high-performance multi-tilt reconstructions in parallel. First, when the data resolution reaches 1 K and the number of tilts exceeds 4, the time required for reconstruction based on the CPU cluster is unacceptable. Besides, DM-SIRT requires huge memory resources, which seriously affects the performance of the algorithm. As the image size reached 2K, the occupied memory of a single thread has reached 20 GB. To solve these existing problems in DM-SIRT, we propose a new parallel framework named GM-SIRT (GPU-based Multi-tilt SIRT) to process the multi-tilt reconstruction in an acceptable time. Furthermore, a two-level data partition strategy is designed in GM-SIRT to greatly reduce the memory required in the whole reconstructing process. The structure of this article is as follows. Section 1 is an introduction to the background of the paper, Sect. 2 is related work, Sect. 3 proposes GM-SIRT, and explains the principle and algorithm, Sect. 4 is the experimental results of the paper, Sect. 5 is the summary and discussion of the paper.
2 Related Works 2.1 “Missing Wedge” In cryo-ET, single-tilt series are taken with an electron microscope by tilting a unique biological sample at different angles around an axis that is perpendicular to the electron beam. However, owing to the physical limitation of the microscope, the range of EM rotating sample is ±70° at small increments of 1–2°, resulting in a “missing wedge” problem. That is, the obtained data cannot cover all angles. This problem can seriously affect the accuracy of the final 3D reconstruction because of the lack of information in Fourier space, as shown in Fig. 1. For the double-tilt series, the sample is first tilted around a single fixed axis and then the axis is rotated by 90° to obtain the second tilt series. The multi-tilt series are obtained by increasing the number of axes which can be extended to 8-tilt or 16-tilt series. Multi-tilt data can be considered as a combination of N single-tilt data with different rotating axes. When N is 2, two tilts always are perpendicular to each other. For multi-tilt reconstruction, the images from different tilts can be adjusted to a global coordinate system to ensure the precision of reconstruction. The common values of N
GPU Optimization of Biological Macromolecule
475
include 2, 4, 8, 16. The method proposed in this paper can perform 3D reconstruction of large-scale data with more than 8 tilts.
Fig. 1. Missing information in Fourier Space [9]
2.2 Multi-tilt Reconstruction One possible solution to the “missing wedge” problem is multi-tilt reconstruction. After Collecting data from different rotation tilts, the missing information of one single-tilt series can be partially filled by the other tilt series. The number of tilts is negatively related to the amount of missing information, as shown in Fig. 1, the missing information can be restored to attain a high-resolution three-dimensional model. Traditional iterate algorithms, such as SIRT [10], SART [11], ASART [12], ART [13] have been widely adopted to deal with the 3D reconstruction of cryo-ET. All traditional algorithms have similar ideas, correct the model based on the difference between the projections generated by the computed or the real mode, as Eq. (1). xik+1 denotes the reconstruction model of the (k + 1)th iteration of pixel i. Rxk refers to the projections of xk at all angles, and R is the projection weight contribution matrix. p − Rxk denotes the difference between the true projection P and the generated projection Wxk . δ function performs the inverse projection operation on the difference, which is used to calculate the correction value at each pixel of the 3D model. xik+1 = δU P − Rxik + xik (1) In fact, the traditional multi-tilt reconstruction algorithm is similar to the GD (gradient descent) algorithm in the field of artificial intelligence, where U in Eq. (1) denotes the set of projection data involved in the correction, similar to the “batch” in GD. When U is large, or even the entire data set to update each pixel of model, the result will be highly accurate but quite time-consuming. SIRT is to use all data for each iteration. When U is small, or only one projection point is used (e.g. ART [13]), the iteration is accelerated, but the result is affected by random errors. ART is performed with only one ray per iteration, and SART is between ART and SIRT. Some algorithms (e.g. DWE [14], CAV [15]) attach weights to different pixels and perform weighted iterative updates.
476
Z.-A. Fu et al.
2.3 Consensus Optimization Strategies for Reconfiguration DM-SIRT algorithm converts multi-tilt reconstruction into a consensus optimization. It divides the data into different subsets and performs local optimization as shown at Eqs. (2) and (3). zik is obtained by initializing xik . Finally, the global optimal solution is found after combining local solutions. Each sub-problem is solved with the SIRT algorithm, and the global optimal solution is the mean of the local optimal solutions. The detailed algorithm and proof can be found in [8]. vik = αi WiT pi − Wi zik + zik (2) xik+1 = ρ 2vik − zik + xik
(3)
This method solves the problem of rapid growth in computing resources required by traditional SIRT algorithms. But each node is reconstructed using partial data, and the result is a complete large model in DM-SIRT. The global solution is the weighted sum of all locally optimal solutions. As the size of the macromolecular increasing, the DM-SIRT requires the pixel-by-pixel computation of the 3D model in each thread. Each thread requires huge memory and time resources to take on computation tasks, which slows down the efficiency of the operation. When using large-scale data of 1 K resolution with more than four tilts, DM-SIRT cannot be able to show satisfactory performance. This work modified the DM-SIRT algorithm to run a 3D reconstruction of largescale data in GPU. Modify the grouping strategy, and make use of GPUs to achieve pixel-level parallelism in a single subset of optimization tasks. Propose a multi-level refactoring strategy to reduce the amount of computation each thread undertakes and increase parallelism. A significant increase in computing speed is achieved compared to DM-SIRT. This method makes multi-tilt reconstruction usable to the currently ultralarge-scale electron microscopy data.
3 Methodology 3.1 GPU-Based Multi-tilt Reconstruction Algorithm P ∈ Qn×n×m Denotes the projection of the model, n × n is the size of a single projection, and m refers to the number of projections. x ∈ Ql×w×h denotes the grayscale value of each 3D pixel of the model, l, w, h indicate the length, width, and height of the model, respectively, and Q means the set of rational numbers. rij is the influence weight of the j-th pixel of 3D model on the i-th projection point, and R denotes the weight matrix. Then the goal of the tomographic reconstruction algorithm is to solve Eq. (4) by P and R and get x. P = Rx
(4)
Solving Eq. (4) directly by the formula x = R−1 P requires calculating the inverse matrix of R. It is quite time-consuming to find R’s inverse matrix. Consider using an optimization strategy to solve this problem. It means solving Eq. (5) to get optimal x. x∗ = arg minf (x) = arg minRx − P2 x
x
(5)
GPU Optimization of Biological Macromolecule
477
f (x) can be considered as the loss function between the current projection Rx and the real projection P. To perform multi-tilt reconstruction by processing all tasks in parallel, the whole dataset is divided into smaller subsets to do local optimization. The loss of the whole system is the sum of the losses in all subsets, as in Eq. (6). n F(x) = fi (x) (6) i=1
Each subset performs SIRT iteration separately. The local optimization is performed to obtain the local model. The logical optimization goal for each subset is to reconstruct the same 3D model. In other words, DM-SIRT aims to compute global optimal x∗ that minimizes F(x). Equation (7) represents the standard form of a proximal mapping function, and Eq. (5) can be expressed as the standard form of proximal optimization. Gi (x) = arg min v
v − x2 + gi (v) 2ρ 2
(7)
According to the work of Buzzard et al. in [16] and DM-SIRT, the distributed optimization problem satisfying Eq. (7) can be considered as a consensus optimization problem. Each subset is optimized separately as independent data, and the global solution is found that fits all subsets as well as possible. Buzzard proposed a theoretical solution framework to deal with multi-subject consensus optimization problems. An iterative strategy for multi-tilt reconstruction based on Buzzard’s work can be driven. The optimal strategy for each subset in GM-SIRT (GPU-based Multi-tilt Reconstruction Algorithm) is designed as Eqs. (8) and (9). (8) wik+1 = ρ 2xik+1 − xik + (1 − ρ)wik xik+1 =
j
SIRT (sijk , pi )
(9)
Each thread loads a subset of data and maintains an intermediate vector w. w is initialized to 2wi − xi at each iteration in the subset. The global optimal solution of (i + 1)th round is the average of wi in each thread. The iterative method of wi is as Eq. (8). xik+1 is the local solution of the i-th subset in the k-th iteration, traditional SIRT method is performed at local optimizations as in Eq. (9). The SIRT update method is as Eq. (2), which means each sub-model will be corrected through the difference between real projection and computed projection generated by xik . Further data partitioning strategy is designed to improve parallelism. Tasks at nodes are assigned to GPUs, and each GPU reconstructs some slices of a node model with SIRT. All slices computed at one single GPU are combined to generate a complete result. The sijk denotes the j-th slice in model xik . The symbol in Eq. (9) does not mean sum but append, combining all slices into one complete reconstruction model.
478
Z.-A. Fu et al.
3.2 Multi-tilt Reconstruction Framework The input of the algorithm is an initial model and projections at different angles, and the main flow is shown in Fig. 2. Local optimization requires a calculated initial model, and the SIRT algorithm requires projection data and angle data. For the current 4-tilt, 8-tilt, or 16-tilt data, high precision results are generally produced after about 25 rounds of iterations.
Fig. 2. Multi-tile reconstruction framework of producing 3D result with projection data.
The pseudocode of the program at one single GPU is shown in Algorithm 1. The initial model is computed using the back-projection algorithm (BPT). Divide the data into n subsets, and each node generates an initial 3D model zi (corresponds to xik in Fig. 2) by initializing the model wi . . Then divides the tasks according to the number of GPUs in the node (M indicates the number of GPUs). zi is divided into slices of the same size and assigned to different GPUs (the “scatter” at line 7), and then the SIRT algorithm is used to calculate the partial models separately. After local optimization, the iteration result vi are gathered to the main thread (the “gather” at line 9). Update wi with model zi and vi at end of the iteration. The global iteration result is obtained from the average of all wi and compared with the result of the previous iteration. The algorithm will be proceeding until the result does not change.
GPU Optimization of Biological Macromolecule
479
In GM-SIRT, the data are divided hierarchically. Firstly, the data are divided by tilts without overlap. Secondly, each subset is divided into several slices. One subset will be processed by one node, and one slice set will be processed by one GPU. Slice division is generally done according to the range of the z-axis. This algorithm can parallelize the reconstruction at the pixel level with multiple GPUs, and handle large-scale data of more than 1 K size on 16 tilts.
4 Results and Discussions 4.1 Data Introduction The datasets used in this paper, EEL-Crosscut, was taken by the National Center for Microscopy and Imaging in USA (NCMIR) using a 300 kV transmission electron microscope. There are 16 tilts of data in total, with a, b separated by 90°, and c, d separated by 90°, as shown in Fig. 3. The other tilts are similar. The projection angle of the projection image for each tilt is in the range of −60° to 60°, and the interval is 1°. 121 shots are taken for each tilt rotation. The projection size is 4096 × 4096, and the pixel size is 1.36 nm. The original datasets were compressed into a compact dataset of 1 k × 1 k or 2 k × 2 k, and the angular data were formatted with TxBR software [17]. 4.2 Experimental Results Four methods were analyzed in this article. The first one is the traditional iterative SIRT algorithm, which uses all data for each iteration without grouping. SIRT has the highest
480
Z.-A. Fu et al.
Fig. 3. Multi-tile data acquisition.
accuracy but needs much time and storage resources to produce a reconstruction result. The second one is the DM-SIRT algorithm, which sets the total number of subsets to 20 and runs in the supercomputer Tianhe-II. The third one is the GPU-based multi-tilt reconstruction algorithm (GM-SIRT) proposed in this paper, running in the Tianhe-II GPU node, with NVIDIA Tesla K80 and Tesla V100, respectively. The fourth is the simple BPT algorithm, which is not an iterative method and is used to be the initial model of iteration. The iteration step size of all iterative methods is set to 0.5. The 16-tilt projection of 1 K size was reconstructed for the first time with the GMSIRT algorithm, and the results are shown in Fig. 4.
Fig. 4. The multi-tilt reconstruction slice. (a) original projection. (b) GM-SIRT for 16-tilt data, 9 iterations. (c). GM-SIRT for 16-tilt data, 99 iterations.
Reconstruction Precision. Figure 4 shows the reconstruction results of different methods. We choose the normalized correlation coefficient (NCC) as the evaluation method of reconstruction precision. The calculation method of NCC is as Eq. (10). (I1 − μ1 )(I2 − μ2 ) NCC(I1 , I2 ) = (10) (I1 − μ1 )2 (I2 − μ2 )2
GPU Optimization of Biological Macromolecule
481
Compares the similarity between projections of real model and reconstruction result, and the NCC value represents the precision of the algorithm. Map NCC to [0, 1], higher NCC values mean higher similarity between projections, and higher similarity means higher reconstruction precision. The SIRT algorithm is one of the highest accuracy algorithms but it consumes significant time and memory resources. The GM-SIRT is more efficient while maintaining the accuracy of the SIRT-level. The BPT algorithm is adopted to compute the initial model and all algorithms use the same parameters. As shown in Fig. 5, the difference of NCC value between SIRT and GM-SIRT is quite small at all 121 angles (shown at Fig. 5(a)), slightly worse at some angles (Fig. 5(b)). The NCCs of DM-SIRT and GM-SIRT are too similar, so no distinction is made in Fig. 5(a). Compared with DM-SIRT, the GM-SIRT algorithm can achieve satisfactory accuracy more efficiently.
Fig. 5. The NCC values in different methods. (a) NCC at all 121 angles (b) NCC at field where the difference between GM-SIRT and SIRT is largest.
Table 1. Running time comparison between DM-SIRT and GM-SIRT. Nodes
2
4
8
CPU cores
48
96
192
GPUs
8
16
32
DM-SIRT (min)
565
290
175
GM-SIRT (min)
55
42
42
Acceleration ratio
10.27
6.04
4.17
Performance Results. Compare performance between DM-SIRT and GM-SIRT. Two algorithms both running on the Tianhe-II supercomputer, while DM-SIRT running on
482
Z.-A. Fu et al.
the CPU node and GM-SIRT running on the V100 node. The 4-tilt data is used for iterative testing. As shown at Table 1, the GM-SIRT algorithm requires significantly less running time compared to DM-SIRT. Even in bad situations, the GPU acceleration ratio is over 400%. Scalability. Run these algorithms on K80 and V100 nodes separately, and the running time is shown in Table 2. The experiments were performed using the 1 K × 1 K projection. Ensure that the data of each tilt is processed by 2 GPUs when the number of nodes and tilts increases. As shown in Table 2, the performance growth of GM-SIRT is linearly related to the amount of data, which proves the scalability of GM-SIRT. Table 2. Running time of GM-SIRT on Tianhe-II. Titles
1
2
4
8
16
Nodes
1
1
2
4
GPUs
2
4
8
16
4(K80)/8(V100) 16(K80)/32(V100)
Time(min)-K80
77
75
102
106
178
Time(min)-V100
36
37
42
45
48
Storage Occupation. GM-SIRT can process large-scale data and reconstruct highprecision models with less memory occupation. The memory required by the two methods is shown in the Table 3. Two methods both use 4-tilt data and each tilt is divided into four subsets. When the image size is 512 × 512, DM-SIRT occupies less memory, but as the size increases to 2 K, the space required by DM-SIRT is over 20 GB, while the space required by GM-SIRT is still within 10 GB. When DM-SIRT processes high-precision reconstruction, the memory required for a single thread increases exponentially, which limits the scalability of the algorithm. But in GM-SIRT, the memory required for a single thread grows slowly with increasing of image size. Table 3. Memory occupation at different resolution. Resolution 512 × 512
GM-SIRT (MB)
DM-SIRT (MB)
243.0
65
1024 × 1024
1028.7
4853
2048 × 2048
10134.5
29318
GPU Optimization of Biological Macromolecule
483
5 Conclusions Based on the DM-SIRT, this paper designs a GPU-based multi-tilt reconstruction algorithm (GM-SIRT), improving the performance of multi-tilt reconstruction without precision reduction. As the image size increases, the DM-SIRT is time-consuming to handle large-scale data of 8-tilt or 16-tilt, and the memory requirements are unacceptable. Compared with DM-SIRT, GM-SIRT can handle larger-scale data. In addition, GM-SIRT adopts a 2-layer data division strategy to reduce the memory requirements. Acknowledgement. This work is supported by NSFC Grant #61932018, 32241027 and 62072441.
References 1. Tegunov, D., Liang, X., Dienemann, C., Cramer, P., Mahamid, J.: Multi-particle cryo-EM refinement with m visualizes ribosome-antibiotic complex at 3.5 in cells, Nat. Methods 18(2), 186–193 (2021) 2. Briggs, J.A.: Structural biology in situ–the potential of subtomogram averaging. Curr. Opin. Struct. Biol. 23(2), 261–267 (2013) 3. Turonova, B., Sikora, M., Schurmann, C., Hagen, W., Beck, M.: In situ structural analysis of SARS-CoV-2 spike reveals flexibility mediated by three hinges. Science 370(6513), 203–208 (2020) 4. Ke, Z., Oton, J., Qu, K., et al.: Structures and distributions of SARS-CoV-2 spike proteins on intact virions. Nature 588, 498–502 (2020) 5. Yao, H., Song, Y., Chen, Y., et al.: Molecular architecture of the SARS-CoV-2 virus. Cell 183(3), 730–738 (2020) 6. Phan, S., Boassa, D., Nguyen, P., Wan, X., et al.: 3D reconstruction of biological structures: automated procedures for alignment and reconstruction of multiple tilt series in electron tomography. Adv. Struct. Chem. Imag. 2(1), 8 (2017) 7. Xiao, W., Sabne, A., Kisner, S., et al.: High performance model-based image reconstruction. ACM SIGPLAN Not. 51(8), 1–12 (2016) 8. Wang, Z., Zhang, J., Gao, W., et al.: A consensus framework of distributed multiple-tilt reconstruction in electron tomography. J. Comput. Biol. 27(2), 212–222 (2020) 9. Frank, J.: Electron tomography: Methods for Three-Dimensional Visualization of Structures in the Cell, 2nd edn. Springer, New York (2006). https://doi.org/10.1007/978-0-387-69008-7 10. Sorzano, C., Marabini, R., Boisset, N., et al.: The effect of overabundant projection directions on 3D reconstruction algorithms. J. Struct. Biol. 133(2–3), 108–118 (2021) 11. Andersen, A.H., Kak, A.C.: Simultaneous algebraic reconstruction technique (SART): a superior implementation of the art algorithm. Ultrason. Imag. 6(1), 81–94 (1984) 12. Wan, X., Zhang, F., Chu, Q., et al.: Three-dimensional reconstruction using an adaptive simultaneous algebraic reconstruction technique in electron tomography. J. Struct. Biol. 175(3), 277–287 (2011) 13. Marabini, R., Herman, G.T., Carazo, J.M.: 3D reconstruction in electron microscopy using ART with smooth spherically symmetric volume elements (blobs). Ultramicroscopy 72(1–2), 53–65 (1998) 14. Echebest, N., Guardarucci, M.T., Scolnik, H., et al.: An accelerated iterative method with diagonally scaled oblique projections for solving linear feasibility problems. Ann. Oper. Res. 138, 235–257 (2005)
484
Z.-A. Fu et al.
15. Censor, Y., Dan, G., Gordon, R.: Component averaging: an efficient iterative parallel algorithm for large and sparse unstructured problems. Parallel Comput. 27(6), 777–808 (2011) 16. Buzzard, G.T., Chan, S.H., Sreehari, S., et al.: Plug-and-play unplugged: optimization-free reconstruction using consensus equilibrium. SIAM J. Imag. Sci. 11(3), 2001–2020 (2018) 17. Lawrence, A., Bouwer, J.C., Perkins, G., et al.: Transform-based back projection for volume reconstruction of large format electron microscope tilt series. J. Struct. Biol. 154(2), 144–167 (2006)
Multi-task Question Generation Based Data Augmentation for Biomedical Answer Generation Junting Zhao, Jun Bai, Wenge Rong(B) , Yuanxin Ouyang, and Zhang Xiong School of Computer Science and Engineering, Beihang University, Beijing, China {zhaojt0705,ba1_jun,w.rong,oyyx,xiongz}@buaa.edu.cn
Abstract. Limited by the corpus size and the annotation cost, biomedical question answering (BioQA) is a task of great research value. To generate professional biomedical answers, we first propose a text-to-text multi-task question generation model, which improves the accuracy of domain question generation with two auxiliary tasks. Based on this, a multi-task QA pipeline system with filtering is designed to synthesize high-quality biomedical data. Then, we use three data augmentation strategies to conduct generative BioQA experiments on original and synthetic data. The results on the factoid BioASQ 7b, 8b, and 9b datasets demonstrate the effectiveness of our method. Keywords: Biomedical Answer Generation · Multi-task Learning · Data Augmentation
1 Introduction Biomedical Question Answering (BioQA) task aims to obtain answers to given questions from biomedical-related knowledge. Professionals and the general public can effectively learn and understand obscure biomedical knowledge through the BioQA system [8]. Current BioQA research tends to focus on extractive tasks [18], which however are often limited by the size of the corpus and expensive training costs. In contrast, the generative QA models can understand questions and then generate answers from the original text. Such approaches have higher diversity and can better adapt to the different needs of users [9]. Many current state-of-the-art QA models typically rely on pre-trained language models. However, most of them are pre-trained on general domain corpora. Limited by the amount of data and the requirements of domain expertise, the performance of tasks in the existing biomedical domain is often not as good as that of the general domain [2]. Therefore, researchers may consider applying data augmentation methods in BioQA tasks. Data augmentation can increase the diversity of training samples on the basis of existing data. Methods such as lexical substitution focus on obtaining a copy that is not much different from the original data [4]. Nevertheless, the application of such methods © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 485–496, 2023. https://doi.org/10.1007/978-981-99-4749-2_41
486
J. Zhao et al.
to data-scarce tasks has little effect on the improvement of downstream QA tasks with large pre-trained models. As a result, improving QA performance through automatically generated synthetic data is a long-term research goal. Question Generation (QG) can be applied to data augmentation in QA tasks, helping corpus development [12]. For example, combining the existing samples with the synthetic data produced by the QG system to train the QA model, or using only synthetic data for training to improve the QA performance under few-shot or zero-shot tasks [1]. In this paper, we focus on biomedical answer generation, aiming to improve the performance of BioQA models through data augmentation based on multi-task question generation under scarce knowledge. We propose a multi-task QG model for generative data augmentation. Answer extraction (AE) and QA, which are closely related to the QG task, are taken as auxiliary tasks to improve the prediction accuracy and generalization ability of the model [5]. Furthermore, we investigate the impact of three different synthetic data application approaches on answer generation performance in the biomedical domain and demonstrate the effectiveness of our proposed method through experiments.
2 Related Work Due to the limitation of the data scale in the biomedical domain, the pre-trained language model can be powerfully used for BioQA tasks to achieve the best performance. Based on BioBERT, transfer learning is performed on large-scale general datasets, and then the last layer is modified to be applied to different downstream tasks, significantly reducing the cost of BioQA systems [18]. Subsequently, the authors of [6] proved that the negative transfer of mixed domain pre-training might hinder the performance of limited domain tasks. Most existing research focuses on extractive QA tasks, in which the diversity of answers, manual labeling and training costs are still challenging [6, 18]. On the contrary, generative QA tasks with higher flexibility urgently need breakthroughs [8]. Data augmentation has attracted interest in NLP field due to the popularity of pretrained models that require large amounts of data [4]. In this field, the commonly used lexical substitution method attempts to replace the existing words in the text without changing the meaning of the sentence, and the idea of random noise injection is to inject noise into the text to enhance the robustness of the model to disturbance [16]. Furthermore, generative methods can also be used for data augmentation, capable of generating additional training data based on existing data [12]. Exploration above this is still relatively under-explored, the reason may be that the discreteness of the text excludes continuous noise, making it more difficult to maintain invariance. The earlier QG research focused on template or rule-based approaches [7]. Recently, the sequence-to-sequence neural network model has gradually become mainstream. The authors of [3] introduced an attention-based model for QG tasks in reading comprehension. In recent years, pre-trained language models have transferred knowledge from data-rich pre-training tasks to downstream tasks, achieving state-of-the-art performance on QG [10].
Multi-task Question Generation Based Data Augmentation
487
3 Methodology 3.1 Task Definition and Framework Overview Given the biomedical knowledge paragraph P = [s1 , s2 , . . . , sl ], and the question Q = [q1 , q2 , . . . , qm ] about entity names (e.g., names of diseases, drugs, or genes), numbers, or similar expressions in the text, we aim to generate an appropriate answer A = [a1 , a2 , . . . , an ] according to the question Q and the context P. To achieve this goal, firstly, we propose a multi-task QG model as a unified textto-text framework that can automatically generate the desired output according to the specific format of the task. On this basis, we use a multi-task QA pipeline system, and the augmentation and filtering of the BioQA dataset are completed on a large amount of unlabeled biomedical text. Finally, three data augmentation strategies are used to fine-tune the pre-trained language model to obtain the best performance on generative factoid BioQA tasks. The overview of our proposed framework is shown in Fig. 1. Output
T5-Decoder MultiTask -T5
T5-Encoder Position Embedding
Input Answer Extraction
Question Generation
Question Answering
Fig. 1. Overview of our multi-task QG based data augmentation model framework. The inputs of answer extraction, question generation and question answering tasks are trained unitedly in our multi-task system to obtain task-specific outputs yAE , yQG and yQA . H e is the representation of the input sequence after passing through the encoder.
3.2 Multi-task Question Generation Embedding Layer. Our model uses the same SentencePiece as T5 [13] for data preprocessing, which takes the input sentence as a whole and then splits it into pieces. When encoding each word, in addition to token embedding, the model uses a simplified relative position embedding to complete the encoding of the position of each term. The position embedding in scalar form is added to the logits to calculate the attention weight: (1) softmax QK T + positionbias V
488
J. Zhao et al.
where the attention for a set of queries Q are computed simultaneously, with K and V being their corresponding keys and values, respectively.
“extract answers: The role of LOX and LOXL2 in scar formation after glaucoma surgery. …… a humanized monoclonal antibody derived from GS-607601.” “answer: LOXL2 context: The role of LOX and LOXL2 in scar formation after glaucoma surgery. …… a humanized monoclonal antibody derived from GS607601.”
“LOXL2 ”
MultiTask-T5
“question: What is the drug target for Simtuzumab? context: The role of LOX and LOXL2 in scar formation after glaucoma surgery. …… a humanized monoclonal antibody derived from GS607601.”
“What is the drug target for Simtuzumab?”
“LOXL2”
Fig. 2. The task format of our multi-task model. The prefix of AE task is “extract answers:”; for QG task is “answer: {answer} context: {context}”; and QA task is input in the format of “question: {question} context: {context}”.
The representation Xi of each word wi is equal to the sum of its token embedding TE(wi ) and position embedding PE(wi ): Xi = TE(wi ) + PE(wi )
(2)
Task-Specific Layer. In addition to the QG task, AE and QA, which correlate strongly with QG, are used as auxiliary tasks to improve the performance of biomedical QG. QA task aims to generate relevant answers given biomedical passages and questions. On the other hand, AE is a Text-to-Format task. Sentences that may contain the answers will be highlighted one by one with a particular token < hl > before input into the model so that the text-to-text pre-training model can pay more attention to the sentence. The extracted answer will be followed by a < sep > token. If multiple answers are extracted in the same sentence, the < sep > token will separate all the answers. The input of the model is defined as: Input = {P T , X T , yT }
(3)
where T ∈ {AE, QA, QG} is the task corresponding to every training sample, and P T is the prompt text, which uses a specific prefix to exert the performance improvement brought by the enormous semantic information hidden in the pre-trained model. The relevant form is shown in Fig. 2. X T is the input of each subtask, and its content depends on the specific task. In the AE task, the input is an unstructured biomedical passage. The QA task takes the biomedical context and the professional question of the text as input. While in the QG task, it is the concatenation of a biomedical answer and associated knowledge text. They are arranged in the following ways: X AE = [s1 , s2 , . . . , sl ]
(4)
Multi-task Question Generation Based Data Augmentation
Synthetic QA Generation
489
X QA = [q1 , q2 , . . . , qm , s1 , s2 , . . . , sl ]
(5)
X QG = [a1 , a2 , . . . , an , s1 , s2 , . . . , sl ]
(6)
T5
Data Augmentation
PubMed
SRC+SYN
SRC
T5
SYN
T5
SYN
T5
SRC
T5
2
Question Generation
1
Answer Extraction
3
Question Answering
4
Roundtrip Filtering
SYN Data
Biomedical Answer
Fig. 3. Left: Synthetic QA data generation process using our multi-task QA pipeline system. Right: From top to bottom are the generative QA model training process using SRC + SYN, SRC → SYN and SYN → SRC strategies. SRC is the original BioASQ dataset, SYN is the synthetic dataset, and T5 is the model used to generate answers.
Accordingly, yT is the output sequence corresponding to task T , which form depends on the specific task. Loss. A unified cross-entropy loss function for each training sample calculates our multi-task model’s loss. Given a training sample, the loss function is as follows: Lθ = −
L
T log Pθ ylT |y 0. In our model, the attention layer is set to 2, with each layer containing a double-headed self-attentive, and the output from each self-attentive head is then stitched together so that each amino acid is eventually represented as a continuous vector. As shown in Fig. 2, given an attention vector, we generate a comparison view for the input peptide sequence by masking a certain proportion of the amino acids. To investigate the effect of masking different parts, we tried three masking strategies: (1) maximal attention masking: masking the amino acid residues with larger attention weights for r\%. This generates an enhanced view which differs most from the original view of the input peptide sequence for each round. (2) Minimal attention masking: the amino acid residues with a small attention weight of r\% are masked. This generates an enhanced view which differs the least from the original view of the input peptide sequence for each round. (3) Random masking: amino acid residues of r\% are randomly selected for masking. This masking strategy is frequently used in previous studies and can be compared with the attentional masking approach. Generating attention-aware masking of the enhanced view will help reveal important amino acids. Also, this approach enriches the diversity of negative samples, which facilitates the learning of expressive and robust representations of peptide sequences. Contrastive Learning. The goal of contrast learning is to capture the true data structure from large-scale unlabelled data. Contrast learning maximises the similarity between positive sample contrast view representations and minimises the similarity between negative sample contrast view representations in a potential vector space by learning similarity between negative sample contrast view representations. As a result, the pretrained Transformer encoder outputs different feature representations based on different
548
P. Luo et al.
Fig. 2. Description of the three masking strategies used to generate the comparison view.
instances [15]. Formally. The input peptide sequence and its augmented view are transformed into vectors hi and hi , which are then mapped to zi and zi by a non-linear projection layer. Next calculate the two projection vectors zi and zi with similarity sim(zi , zi ). The cosine distance is a common method used to assess the similarity of two views of the same sample, here we it to calculate the similarity. Finally, we use Normalized Temperature-scaled Cross Entropy (NT-Xent) as the loss function for comparison learning: Lcs = log 2N
exp(sim(zi , zi )/τ)
k=1 b1k=i exp(sim(zi , zk )/τ)
(3)
where 1k=i is an indicator function that takes on the value 1 when k = i; τ denotes the temperature coefficient; and N is the sample size of the minimum batch.
2.4 Transfer to pHLA Binding Prediction LSTM is a subtype of recurrent neural network [16]. Its design features are suitable for modelling sequential data such as text and time series. Bi-directional LSTMs (BiLSTMs) show better ability to capture text patterns through a combination of forward and backward LSTMs [17]. Bi-LSTMs have been successfully used for antibacterial and antifungal peptide prediction [18, 19]. We constructed a Bi-LSTM with an attention mechanism where we input HLA representations into the Bi-LSTM after the HLA alleles have undergone sequence embedding to obtain a better feature representation of the HLA alleles. We then stitched together the representations of the peptide sequences obtained through pre-training and the representations of HLA obtained through Bi-LSTM and input them into the prediction module for pHLA binding prediction. In the migration learning phase, we did not freeze the parameters of the pre-trained Transformer used to encode the peptide sequences. Therefore, the learnable parameters of the peptide sequence encoder were also fine-tuned during the learning process of the downstream task to achieve better prediction performance.
Attention-Aware Contrastive Learning for Predicting Peptide-HLA
549
3 Experiments and Results 3.1 Comparison of ACLPHLA with Existing Methods To validate the validity of ACLPHLA, we compared it with the IEDB’s six baseline methods, the IEDB-recommended method NetMHCpanEL [12], the method TransPHLA published in 2022, and two recently published attention-based methods (ACME [8], DeepAttentionPan [4]) for comparison. Among them, the six baseline methods include ANN [20], Consensus [21], NetMHCcons [22], NetMHCpanBA [12], PickPocket [6] and NetMHCstabpan [23], all of which can be downloaded from IEDB1 . It is worth noting that not every method is compatible with every length of peptide and HLA allele. With the exception of NetMHCpanBA, NetMHCpanEL, TransPHLA and our method, these methods all have different limitations. Therefore, not every method can be compared using our two test sets. We employ the area under the receiver operating characteristics curve (AUC), accuracy (ACC), Matthews correlation coefficient (MCC), and F1-score (F1) as performance evaluation metrics. The results of ACLPHLA compared with other models are shown in Table 2 and Table 3. Overall, the ACLPHLA model showed significant improvements in all four metrics compared to the baseline model, and achieved better results in comparison with the latest methods. Moreover, compared to the latest method TransPHLA, the ACLPHLA model performed similarly on the independent test dataset and outperformed it on the external test dataset. These results suggest that the introduction of pre-training with contrast learning helps to improve the performance of the model on the pHLA combined with prediction task to some extent. Table 2. Comparison experiments on the independent dataset. Methods
AUC
ACC
MCC
F1
NetMHCpan_EL
0.956
0.795
0.643
0.745
NetMHCpan_BA
0.955
0.802
0.649
0.757
PickPocket
0.924
0.702
0.493
0.584
NetMHCstabpan
0.916
0.790
0.622
0.744
TransPHLA
0.978
0.928
0.857
0.928
ACLPHLA
0.977
0.930
0.864
0.931
3.2 Attention-Aware Masking Performed Better than Random Masking To investigate which method is the best masking strategy for generating contrast views, we ran three different masking strategies for peptide sequence representation learning taking minimum attention, maximum attention and random masking, and then evaluated 1 http://tools.iedb.org/main/tools-api/
550
P. Luo et al. Table 3. Comparison experiments on the external dataset.
Methods
AUC
ACC
MCC
F1
NetMHCpan_EL
0.941
0.733
0.548
0.638
NetMHCpan_BA
0.935
0.735
0.548
0.643
NetMHCcons
0.933
0.736
0.549
0.645
ANN
0.926
0.731
0.540
0.638
ACME
0.925
0.691
0.481
0.557
Consensus
0.920
0.733
0.541
0.642
PickPocket
0.910
0.610
0.342
0.370
NetMHCstabpan
0.904
0.704
0.486
0.595
DeepAttentionPan
0.647
0.549
0.134
0.319
TransPHLA
0.950
0.878
0.765
0.869
ACLPHLA
0.954
0.885
0.772
0.881
them in a downstream task. We also compared the proposed model ACLPHLA with the baseline model ACLPHLA-base, which was not pre-trained with contrast learning. The baseline model encodes the peptide sequences directly using the Transformer, and the other structures and parameters used during training remain the same as ACLPHLA. The experimental results are shown in Table 4. It can be observed that minimum attention masking outperforms the other two masking strategies. The results suggest that masking amino acids with low attention weights is more beneficial for the model to focus on certain important amino acids when generating comparison views, resulting in more informative potential representations. Also, model performance was better than the baseline model without attentional masking, regardless of the masking approach taken. This suggests that pre-training based on contrast learning resulted in better generalisation of the model and improved the prediction accuracy of peptide-HLA binding. Table 4. Effect of different masking strategies on model ACLPHLA performance. Masking strategy Minimum masking Maximum masking Random masking ACLPHLA-base
AUC 0.977 0.963 0.943 0.941
Independent test set ACC MCC 0.930 0.919 0.887 0.896
0.864 0.846 0.852 0.841
F1
AUC
External test set ACC MCC
0.931 0.928 0.927 0.921
0.954 0.933 0.941 0.939
0.885 0.878 0.879 0.870
0.772 0.776 0.760 0.745
F1 0.881 0.865 0.872 0.865
Attention-Aware Contrastive Learning for Predicting Peptide-HLA
551
3.3 ACLPHLA Uncovers the Underlying Patterns of pHLA Binding After pre-training and downstream task fine-tuning, we counted the attention scores of all data to identify important amino acids and probe the binding rules of pHLA. As shown in Fig. 3, the attention scores were highest for the first, second and last positions of the bound peptide, implying that amino acids at these three positions are critical for peptide binding to HLA. It has been shown that the C-terminal, N-terminal and anchor sites during pHLA binding are of great biological importance, which is in agreement with the experimental results [24].
Fig. 3. Attention score heatmap of 8-14mer peptide sequence.
Next, we counted the cumulative attention scores for all types of amino acids at positions 8–14. As shown in Fig. 4, L (Leu) with E(Glu) at position 2, E(Glu) and P(Pro) at position 4, and L (Leu), F (Phe), V (Val) and Y (Tyr) at position 9 were identified in the positive samples. This means that, from a macroscopic point of view, peptides with specific types of amino acids at these specific positions are more likely to bind HLA, a finding that was confirmed in a previous study [25].
Fig. 4. The contribution accumulative attention score of the amino-acid types of peptides and peptide positions to pHLA binding.
We followed the approach of existing studies ACME [8] and visualised the binding patterns of five HLA alleles to further analyse specific cases. Figure 5 shows that the amino acid distribution patterns identified by ACLPHLA at each position on the binding peptides of these HLA alleles are very similar to previous studies [8]. On HLAA*11:01, ACLPHLA identifies a K(Lys) residue at position 9. On HLA-B*40:01, E(Glu) at position 2 and leucine (L) at position 9 are the key residues for pHLA binding. On HLA-B*57:03, L (Leu), W (Trp) and F (Phe) at position 9 have higher cumulative attention scores, and according to the 2BVP structure [26] in the PDB database, these three hydrophobic residues are more likely to form the binding pocket. On HLA-A*68:01, the
552
P. Luo et al.
4HWZ structure [27] demonstrates that the K (Lys) and R (Arg) residues at position 9 of the peptide contribute significantly to binding to HLA. On HLA-B*44:02, the critical role of E (Glu) at position 2 was similarly verified on the 1M6O structure [28]. These case studies of experimental results are supported by existing studies, and agreement with them largely demonstrates the validity of the ACLPHLA model and validates the interpretability of the model.
Fig. 5. Accumulative attention scores for peptide binders associated with several wellcharacterized HLA-I alleles.
3.4 Ablation Study We conducted experiments to combine some of the parameters of the model, to explore the variation in model performance with different parameters and to choose the best model. Different attention heads can focus on different regions within the protein. Therefore, the size of the number of attention heads is closely related to the level of protein understanding of the model. Experimental results for ACLPHLA models with attention heads ranging from 1–9 have been presented in Table 5. An attention head of 7 was chosen to combine the results of the experiments. In addition, we tested different percentages of amino acid masking patterns to investigate their effect on downstream task performance. As the minimal attentional masking strategy achieved better performance, we tested 10%, 25% and 50% amino acid masking percentages. The results are shown in Table 6, from which it can be seen that the 25% masking percentage achieved the best performance was achieved.
Attention-Aware Contrastive Learning for Predicting Peptide-HLA
553
Table 5. Effect of number of attentional heads on model ACLPHLA performance. Head
Independent test set
External test set
AUC
ACC
MCC
F1
AUC
ACC
MCC
F1
1
0.976
0.927
0.856
0.927
0.953
0.884
0.771
0.879
2
0.975
0.928
0.858
0.929
0.951
0.879
0.756
0.867
3
0.975
0.927
0.858
0.930
0.951
0.883
0.767
0.874
4
0.976
0.928
0.859
0.929
0.952
0.885
0.762
0.880
5
0.977
0.929
0.863
0.930
0.953
0.883
0.772
0.881
6
0.976
0.928
0.860
0.931
0.954
0.885
0.769
0.873
7
0.977
0.930
0.864
0.931
0.954
0.885
0.772
0.881
8
0.977
0.929
0.862
0.930
0.954
0.880
0.764
0.871
9
0.976
0.927
0.860
0.931
0.951
0.878
0.771
0.875
Table 6. Effect of masking ratio on the performance of model ACLPHLA. Masking ratio
Independent test set
External test set
AUC
ACC
MCC
F1
AUC
ACC
MCC
F1
10
0.971
0.923
0.851
0.924
0.941
0.874
0.760
0.876
25
0.977
0.930
0.864
0.931
0.954
0.885
0.772
0.881
50
0.965
0.923
0.856
0.922
0.939
0.872
0.761
0.873
4 Conclusion Accurate prediction of peptide-HLA binding plays a key role in neoantigen identification, immunotherapy development and vaccine design. We present ACLPHLA, a pHLA binding specificity prediction model based on comparative learning and Transformer. It’s a pan-specific model that is not limited by HLA alleles or peptide length. We introduced an attention-aware masking approach to generate different contrast views, allowing pretrained models to focus on amino acids at key positions and extract higher-order semantic information from the sequences. We performed self-supervised representation learning on a large number of peptide sequences and fine-tuned the model on pHLA binding prediction. We performed two types of independent tests and ACLPHLA achieved superior performance in both experiments compared to recently published state-of-the-art methods and other baseline approaches. At the same time, the self-attentive mechanism improved the interpretability of the model. For example, we observed attentional fraction patterns for a number of amino acids that may determine pHLA binding specificity. We conclude that specific amino acids located in key regions of the peptide sequence will bind strongly to the HLA allele.
554
P. Luo et al.
Acknowledgements. This work is supported by the National Natural Science Foundation of China (grant nos. 62072384, 61872309, 62072385, 61772441), and the Zhejiang Lab (2022RD0AB02), and the National Key R\&D Program of China (2017YFE0130600).
References 1. Lundegaard, C., Lund, O., Buus, S., Nielsen, M.: Major histocompatibility complex class I binding predictions as a tool in epitope discovery. Immunology 130(3), 309–318 (2010) 2. Xie, X., Han, Y., Zhang, K.: Mhcherrypan: a novel pan-specific model for binding affinity prediction of class I HLA-peptide. Int. J. Data Min. Bioinform. 24(3), 201–219 (2020) 3. Yang, X., Zhao, L., Wei, F., Li, J.: Deepnetbim: deep learning model for predicting HLAepitope interactions based on network analysis by harnessing binding and immunogenicity information. BMC Bioinform. 22(1), 1–16 (2021) 4. Jing, J., et al.: Deep learning pan-specific model for interpretable MHC-I peptide binding prediction with improved attention mechanism. Proteins Struct. Funct. Bioinform. 89(7), 866–883 (2021) 5. Chu, Y., et al. A transformer-based model to predict peptide–HLA class i binding and optimize mutated peptides for vaccine design. Nature Mach. Intell. 4(3):300–311 (2022) 6. Zhang, H., Lund, O., Nielsen, M.: The pickpocket method for predicting binding specificities for receptors based on receptor pocket similarities: application to MHC-peptide binding. Bioinformatics 25(10), 1293–1299 (2009) 7. Mei, S., et al.: Anthem: a user customised tool for fast and accurate prediction of binding between peptides and HLA class I molecules. Briefings Bioinform. 22(5), bbaa415 (2021) 8. Yan, H., et al.: Acme: pan-specific peptide–MHC class I binding prediction through attentionbased deep neural networks. Bioinformatics 35(23), 4946–4954 (2019) 9. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020) 10. Wang, Q., et al.: Pssm-distil: Protein secondary structure prediction (pssp) on low-quality pssm by knowledge distillation with contrastive learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 617–625 (2021) 11. Fang, Y., Liu, X., Liu, H.: Attention-aware contrastive learning for predicting t cell receptor– antigen binding specificity. Briefings Bioinform. 23(6), bbac378 (2022) 12. Reynisson, B., Alvarez, B., Paul, S., Peters, B., Nielsen, M.: Netmhcpan-4.1 and netmhciipan4.0: improved predictions of mhc antigen presentation by concurrent motif deconvolution and integration of ms mhc eluted ligand data. Nucleic Acids Res. 48(W1), W449–W454 (2020) 13. Jurtz, V., Paul, S., Andreatta, M., Marcatili, P., Peters, B., Nielsen, M.: Netmhcpan-4.0: improved peptide–MHC class I interaction predictions integrating eluted ligand and peptide binding affinity data. J. Immunol. 199(9), 3360–3368 (2017). https://doi.org/10.4049/jim munol.1700893 14. Larsen, M.V., et al.: An integrative approach to CTL epitope prediction: a combined algorithm integrating MHC class I binding, tap transport efficiency, and proteasomal cleavage predictions. Eur. J. Immunol. 35(8), 2295–2303 (2005) 15. van den Oord, V., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018) 16. Liu, Q., et al.: Deeptorrent: a deep learning-based approach for predicting DNA n4methylcytosine sites. Briefings Bioinform. 22(3), bbaa124 (2021)
Attention-Aware Contrastive Learning for Predicting Peptide-HLA
555
17. Hasegawa, D., Kaneko, N., Shirakawa, S., Sakuta, H., Sumi, K.: Evaluation of speechto-gesture generation using bi-directional LSTM network. In: Proceedings of the 18th International Conference on Intelligent Virtual Agents, pp. 79–86 (2018) 18. Singh, V., Shrivastava, S., Singh, S.K., Kumar, A., Saxena, S.: Stable-abppred: a stacked ensemble predictor based on bilstm and attention mechanism for accelerated discovery of antibacterial peptides. Briefings Bioinform. 23(1):bbab439 (2022) 19. Sharma, R., Shrivastava, S., Singh, S.K., Kumar, A., Saxena, S., Singh., R.K.: Deep-afppred: identifying novel antifungal peptides using pretrained embeddings from seq2vec with 1dcnnbilstm. Briefings Bioinform. 23(1), bbab422 (2022) 20. Andreatta, M., Nielsen, M.: Gapped sequence alignment using artificial neural networks: application to the MHC class I system. Bioinformatics 32(4), 511–517 (2016) 21. Moutaftsi, M., et al.: A consensus epitope prediction approach identifies the breadth of murine tcd8+-cell responses to vaccinia virus. Nature Biotechnol. 24(7), 817–819 (2006) 22. Karosiene, E., Lundegaard, C., Lund, O., Nielsen, M.: Netmhccons: a consensus method for the major histocompatibility complex class I predictions. Immunogenetics 64, 177–186 (2012) 23. Rasmussen, M., et al.: Pan-specific prediction peptide–MHC class I complex stability, a correlate of T cell immunogenicity. J. Immunol. 197(4), 1517–1524 (2016) 24. Madden, D.R.: The three-dimensional structure of peptide-MHC complexes. Ann. Rev. Immunol. 13(1), 587–622 (1995) 25. Parker, K.C., Shields, M., DiBrino, M., Brooks, A., Coligan, J.E.: Peptide binding to MHC class I molecules: implications for antigenic peptide prediction. Immunol. Res. 14, 34–57 (1995) 26. Stewart-Jones, G.B.E., et al.: Structures of three hiv-1 hla-b* 5703-peptide complexes and identification of related hlas potentially associated with long-term nonprogression. J. Immunol. 175(4), 2459–2468 (2005) 27. Niu, L., et al.: Structural basis for the differential classification of hla-a*6802 and hla-a* 6801 into the a2 and a3 supertypes. Molecul. Immunol. 55(3–4), 381–392 (2013) 28. Macdonald, W.A.,et al.: A naturally selected dimorphism within the hla-b44 supertype alters class I structure, peptide repertoire, and t cell recognition. J. Exper. Med. 198(5), 679–691 (2003). https://doi.org/10.1084/jem.20030066
A Transformer-Based Deep Learning Approach with Multi-layer Feature Processing for Accurate Prediction of Protein-DNA Binding Residues Haipeng Zhao1 , Baozhong Zhu1 , Tengsheng Jiang2 , Zhiming Cui1 , and Hongjie Wu1(B) 1 School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China [email protected] 2 Gusu School, Nanjing Medical University, Suzhou, Jiangsu, China
Abstract. Proteins have significant biological effects when they bind to other substances, with binding to DNA being particularly crucial. Therefore, accurate identification of protein-DNA binding residues is important for further understanding of the protein-DNA interaction mechanism. Most current state-of-the-art methods are two-step approaches: the first step uses a sliding window technique to extract residue features; the second step uses each residue as an input to the model for prediction. This has a negative impact on the efficiency of prediction and ease of use. In this study, we propose a sequence-to-sequence (seq2seq) model that can input the entire protein sequence of variable length and use multiple modules including Transformer Encoder Module, Feature Fusion Module, and Feature Extraction Module for multi-layer feature processing. The Transformer Encoder Module is used to extract global features while the Feature Extraction Module is used to extract local features, further improving the recognition capability of the model. Comparison results on two benchmark datasets PDNA-543 and PDNA-41 demonstrate the effectiveness of our method in identifying protein-DNA binding residues. The code is available at https://github.com/HaipengZZhao/Predictionof-Residues. Keywords: Protein-DNA Binding Residues · Deep Learning · Transformer
1 Introduction Protein is a very important substance in our body that can be combined with many other substances, such as other biological macromolecules (DNA, RNA, nucleotides, etc.) or metal ions (Mn2+ , Zn2+ , Fe3+ , Ca2+ , Na1+ , etc.), to perform specific life activities [1– 3]. Protein-DNA binding is a crucial process in biology that plays a significant role in various fundamental functions of life, including gene regulation, DNA replication, and transcriptional regulation, etc. [4]. In addition, studying protein-DNA binding residues can help us further understand the mechanism of protein-DNA interactions [5]. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 556–567, 2023. https://doi.org/10.1007/978-981-99-4749-2_47
A Transformer-Based Deep Learning Approach with Multi-layer Feature Processing
557
Given the importance of protein-DNA binding, many wet-lab methods have been developed to identify protein-DNA binding residues. These methods include X-ray crystallography [6], Fast ChIP [7], and electrophoretic mobility shift assays (EMSAs) [8, 9]. Although wet-lab methods can yield precise identification outcomes, they are expensive and labor intensive. Moreover, they cannot keep up with the growth rate of protein sequences in the post-genomic era [10]. Therefore, there is a need to develop an efficient and convenient computation-based method for identifying protein-DNA binding residues. With advancements in computer theory, a number of computational methods have emerged for this purpose. These methods can be broadly categorized into three types: sequence-based, structure-based, and hybrid methods [11]. Bioinformatics research primarily focuses on sequence-based methods, which pose a significant challenge. Predicting protein-DNA binding residues using only sequencebased features may have poor performance due to the limited information contained in protein sequences. However, the number of protein sequences is increasing day by day, research in this area is still focused on utilizing sequence features. In the past decade, several sequence-based methods have been proposed. These include BindN [12], ProteDNA [13], DP-Bind [14], BindN + [15], MetaDBSite [16], TargetDNA [17], DNABind [18], DNAPred [19] and PredDBR [20], among others. In BindN, they utilized three types of protein sequence features: hydrophobicity, side chain pKa value, and molecular mass of amino acids. These features were inputted into a support vector machine (SVM) to accurately predict protein-DNA binding residues. In DP-Bind, they utilized evolutionary information obtained from protein sequences, specifically the position-specific scoring matrix (PSSM) [21]. To enhance the recognition accuracy of protein-DNA binding residues, three conventional machine learning techniques were combined: penalized logistic regression, SVM, and kernel logistic regression. In TargetDNA, they used two protein sequence features, solvent accessibility and evolutionary information, and made use of an under-sampling technique to divide the raw data into multiple sub-datasets and applied multiple SVMs for ensemble learning to predict protein-DNA binding residues. Structure-based methods utilize either natural or predicted 3D structure information of proteins. This is because the 3D structure of a protein contains a large amount of information and the structure of a protein determines the function of the protein to some extent. Consequently, utilizing protein structure information for predicting proteinDNA binding residues often yields better performance than sequence-based methods. Common structure-based methods include: DBD-Hunter [22], DNABINDPROT [23], DR_bind [24], PreDs [25], etc. All these methods mentioned above use only the structure information of the protein and ignore the information that may be contained in the protein sequence that may be helpful in predicting the protein-DNA binding residues. To enhance prediction accuracy, hybrid methods integrate both sequence and structure information. Some common hybrid methods include: TargetATP [26], COACH [27], TargetS [28], SVMPred [29] and NsitePred [30], etc. In DR_bind, the model predicts proteinDNA binding residues by utilizing evolutionary, geometric and electrostatic properties to describe the protein structure. In COACH, they designed an algorithm named TM-SITE to infer binding sites from homologous structural templates and also an algorithm named S-SITE for sequence profile alignment based on evolutionary information, after which
558
H. Zhao et al.
the results of both algorithms were combined using a SVM to predict protein-DNA binding residues. Deep learning has achieved significant success in computer vision and natural language processing. As a result, there is now an extensive body of research that applies deep learning to bioinformatics, including the prediction of transcription factor binding sites [31], Identification of Bacteriocins [32], and Prediction of Drug-Drug Interactions [33]. In this study, we introduce a novel computational and sequence-based approach to efficiently and conveniently predict protein-DNA binding residues. Taking inspiration from DeepCSeqSite [34], we propose an encoder-decoder model that enables the prediction of the entire protein sequence. We conducted experiments on the PDNA543 and PDNA-41 datasets, comparing our method with existing ones. The results of the comparison demonstrate that our approach achieves competitive or even superior prediction performance compared to state-of-the-art methods. Our work has two main highlights. Firstly, we propose an encoder-decoder model that can handle the entire protein sequence, allowing for end-to-end prediction of protein-DNA binding residues. Secondly, we introduce a multi-layer structure to process features from protein residues. This structure is capable of processing both global and local interrelationships between residues.
2 Material and Method 2.1 Data Set The PDNA-543 and PDNA-41 datasets were constructed by Hu et al. [17]. Hu et al. initially gathered 7186 DNA-binding proteins from the Protein Data Bank (PDB) that had clear annotations. They then utilized CD-hit software [35] to eliminate duplicate sequences so that the identity of the remaining protein sequences was less than 30%, resulting in 584 sequences that met the requirements. The 584 protein sequences were split into a training set and a test set. The training set, called the PDNA-543 dataset, contained 543 protein sequences while the test set, known as the PDNA-41 dataset, had only 41 protein sequences. The two datasets are distinct and do not have any duplicate sequences. We trained our model using the optimal hyperparameters identified on PDNA543 and conducted an independent test on PDNA-41 to validate its generalizability. The details of PDNA-543 and PDNA-41 are shown in Table 1. The PDNA-543 dataset contains protein sequences that share less than 30% identity with each other. The PDNA-41 dataset is similar to the PDNA-543 dataset, but it only includes 41 protein sequences. 2.2 Feature Representation As we all know, the input data’s features significantly impact the model’s ultimate performance. In this study, we utilized two types of features to depict each protein residue: Position Specific Scoring Matrix (PSSM) and predicted secondary structure (PSS).
A Transformer-Based Deep Learning Approach with Multi-layer Feature Processing
559
Table 1. Details of the PDNA-543 and PDNA-41 datasets. Dataset
No. Sequences1
No. Positive2 , No. Negative3
RDNABR4 (%)
PDNA-543
543
(9549,134995)
7.07
PDNA-41
41
(734,14021)
5.24
1 No. Sequences: number of protein sequences 2 No. Positive: number of DNA binding residues 3 No. Negative: number of non-DNA binding residues 4 RDNABR: ratio of DNA binding residues
2.2.1 PSSM The PSSM contains the evolutionary information of the query protein. Previous studies have shown that PSSM has a positive impact on various bioinformatic tasks [36–38]. In this study, we also utilize the PSSM to represent each residue. The PSSM features were generated using the multiple sequence alignment tool PSI-BLAST to search against Uniprot [39] database for three iterations and the E-value threshold was set to 10–3 . After that, a normalization formula was used to scale the values in the PSSM to the (0,1) interval in order to unify units with different features. The normalized formula for the PSSM is: y=
1 1 + e−x
(1)
where x is each raw score in PSSM and y is the normalized score. Given a protein sequence of length L, the dimension of the PSSM features is L * 20. 2.2.2 Predicted Secondary Structure There are three types of secondary structures of proteins, namely: coiled, α-helix and βfold. Popular tools for predicting secondary structure, such as PSIPRED [5] and PSSpred, produce 3-dimensional features for each residue. The values of these features range from 0 to 1. We used the PSIPRED tool to predict the secondary structure of the target protein in this study. This tool predicts the secondary structure features of a protein sequence with a dimension of L * 3, where L is the length of the sequence. The three values in each feature represent the probability that the residue belongs to one of three types of secondary structures: coiled, α-helix, and β-fold. 2.3 Model Traditional binary classification issues, such as the prediction of DNA-binding proteins, classify the entire protein sequence. In contrast, the prediction of Protein-DNA binding residues classifies each residue in a protein sequence. Therefore, traditional methods use the sliding window technique to integrate features for each residue so that the residue is fed into the model as a sample and eventually classified for that residue. This kind of approach splits a large problem into smaller sub-binary classification problems.
560
H. Zhao et al.
In contrast, we propose an encoder-decoder model inspired by the seq2seq model, which does not have to perform task splitting. We can input one whole protein sequence at a time, and the input protein lengths can be different. The overall framework of the model is shown in Fig. 1. Protein sequence
LayerNorm
N×
3×3 Conv
GLU
PSSM and Predicted Secondary Structure
Transformer Encoder Module
Q
K
V
Linear Q
Linear K
Linear V
Self-Attention
Concat
Linear T×
LayerNorm
Feature Fusion Module
LayerNorm N×
3×3 Conv
Feature Extraction Module
MLP
GLU LayerNorm
LayerNorm
Decoder Module
Binding probability of each residue
Fig. 1. Overall framework of the model. The prediction process of the model is as follows: (1) Encode the protein sequence using PSSM and PSS, combining the protein evolutionary information with secondary structure features. (2) The protein sequence encoding vector undergoes processing by the Transformer Encoder, which utilizes a self-attention mechanism to weigh the key information within the sequence. This mechanism prioritizes important information and reduces attention towards unimportant information. Therefore, global correlation information can be obtained between specific residues in the protein sequence and other residues. (3) Send the processed data to Bi-directional Long-Short Term Memory (BiLSTM) to obtain hidden feature information deep in the sequence and explore long-range dependencies. Then extract the corresponding hidden units. (4) After obtaining global information from the protein sequence, we process the data using Layer Normalization (LayerNorm) to address internal covariate offset issues. Next, we extract key sequence information through convolutional layers. (5) To avoid overfitting, part of the information is lost through the dropout layer to prevent the model from overfitting. And the final prediction is completed using the Rectified Linear Unit function (ReLU).
A Transformer-Based Deep Learning Approach with Multi-layer Feature Processing
561
2.3.1 Transformer Encoder Module The Transformer Encoder employs a self-attention mechanism that establishes direct connections between different positions. This enables it to capture crucial information in the text on a global scale. In this work, the protein sequence features will enter the Transform Encoder Module. The structure of Transform Encoder Module is shown in Fig. 1. First, Embedded Sequences will enter into Multi-Head Attention. Self-Attention calculates the attention weight among all other residues including itself for each residue in the protein sequence. It is calculated as follows: QK T Attention(Q, K, V) = softmax( √ ) dk
(2)
where the query Q has dimension dQ , the keyword K has the same dimension dK , and the value V has dimension dV (usually dQ = dK = dV ). And Multi-Head Attention is the projection of Q, K, V by h different linear transformations, and finally the different attention results are stitched together as follows: MultiHead(Q, K, V) = Concat(head 1 , . . . , head h )W 0 Q
head i = Attention(QWi , KWiK , VWiV )
(3) (4)
where W is a spatial mapping function and has the same dimension in this work. After Multi-Head Attention, there is a residual connection that adds the input to the Multi-Head Attention output. In addition to Multi-Head Attention, there is usually a Multilayer Perceptron (MLP) module in the Transformer Encoder Module. The MLP module contains of two fully-connected layers and a nonlinear activation function. In general, too deep models will make the gradient disappear in training or make it difficult to propagate to the shallow part of the model during backpropagation, resulting in ineffective parameter updates. To solve this problem, so we add some residual connections after the multi-headed attention module and the MLP module. 2.3.2 Feature Fusion Module The prediction of protein-DNA binding residues can be influenced by dependencies that exist between different sequence contexts. BiLSTM can learn features in both forward and backward directions separately, and then fuse them together to better understand the contextual relationships in sequence data. The performance of the prediction model may be impacted by dependencies among sequence contexts. To address this, we utilized the BiLSTM algorithm to gather additional dependency information between protein sequence contexts. The forward layer of BiLSTM performed forward calculation from time 1 to t and obtained the output of the forward hidden layer at each time. From time t to 1, the backward layer performed reverse calculations to obtain the output of the backward hidden layer at each time. On this basis, the outputs of the forward layer and the backward layer at each moment were combined to obtain the final output result: Cf = f (w1 xt + w2 Cf −1 )
(5)
562
H. Zhao et al.
Cb = f (w3 xt + w5 Cb−1 )
(6)
Hm = g(w4 Cf + w6 Cb )
(7)
where t represents time; x represents the input; wi is the weight; Cf is the output of the forward layer; Cb is the output of the backward layer; f() and its derivative function calculate the outputs of the forward and backward layers, respectively; and g() combines and sums the outputs of the forward and backward layers. Finally, the output (Hm ) of the BiLSTM layer was generated. 2.3.3 Feature Extraction Module After the Feature Fusion Module, the dimensionality of the sequence is processed so that it can be processed by the convolution module. When it comes to proteins, residues that are close in sequence typically share similar properties. Therefore, understanding the local features of these residues is crucial. The convolution operation involves sliding a kernel over a feature map to aggregate input features. This process effectively extracts the local features of residues and is therefore utilized in our Feature Extraction Module through a convolutional neural network. The structure of Feature Extraction Module is shown in Fig. 1. Gated Linear Units (GLU) can effectively capture the correlations between inputs, thereby improving the accuracy and generalization ability of the model. The Feature Extraction Module is composed of 2 LayerNorm-conv-GLU blocks. The main difference is that the first LayerNorm-conv-GLU block is followed by the residual connection, while the second one is not. Assuming that the feature dimension is L * 1 * C after the Feature Fusion Module, the feature dimension becomes L * 1 * 2C after the LayerNormconv. After that, the feature dimension is the same for the residual connection. Also, for the consistency of the front and back LayerNorm-conv-GLU blocks, we apply the same GLU activation function in the second LayerNorm-conv-GLU block. After the two LayerNorm-conv-GLU blocks, the output of the encoder is obtained, which contains both global information learned from Transformer Encoder Module and local information learned from the convolutional neural networks (CNN) block. After that, it will go to the decoder to get the final prediction result. 2.3.4 Decoder Module For the decoder, the most important thing is to generate a result of the same length as the protein, which is used to determine the type of each residue (whether it is a protein-DNA binding residue or not). Here, a decoder based on multi-layer CNN can be used, but the number of output channels of the last convolutional layer is required to be 2, which is used to discriminate the type of residues. In the decoder, the input data is first subjected to a convolution operation. The output values are then passed through a ReLU activation function to introduce non-linearity. Afterwards, it enters a dropout layer which randomly sets some of the neuron outputs to 0 in order to avoid overfitting risks for the model. The architecture of the decoder is shown in the Fig. 2.
A Transformer-Based Deep Learning Approach with Multi-layer Feature Processing
Hidden Feature
Convolutional Layer1 channels=(128,256) kernels=(1,1) stride=1 Activation function=ReLU
Dropout rate=0.3
563
Convolutional Layer2 channels=(256,2) kernels=(1,1) stride=1 Activation function=ReLU
Binding probability of each residue
Fig. 2. The CNN-based decoder consists of convolutional layers with a convolutional kernel size of 1 × 1 and a ReLU activation function. With the decoder, we are able to obtain the final prediction for each residue.
2.4 Model Evaluation Protein-DNA binding residue prediction is a binary classification problem. We used sensitivity (SN), specificity (SP), accuracy (ACC), and Matthew’s correlation coefficient (MCC) to assess the prediction performance. Binding residue prediction is highly imbalanced; hence, MCC is a more suitable indicator for evaluating the performance of the method. The formulas for calculating the metrics are as follows: SN =
TP × 100 TP + FN
(8)
SP =
TP × 100 TN + FP
(9)
ACC =
TP + TN × 100 TP + TN + FN + FP
TP × TN − FP × FN MCC = √ (TP + FN ) × (TP + FP) × (TN + FN ) × (TN + FP)
(10) (11)
where TP is the predicted correct DNA residue binding site (positive sample), TN is the predicted correct non-DNA residue binding site (negative sample), FP is the incorrectly predicted non-DNA residue binding site (negative sample) as DNA residue binding site (positive sample), and FN is the incorrectly predicted DNA residue binding site (positive sample) as non-DNA residue binding site (negative sample).Larger values for all four of these metrics indicate better performance of the model.
564
H. Zhao et al.
3 Result and Discussion 3.1 Setting of Hyperparameters Due to the large number of hyperparameters in a model, each experimental run can take several hours up to one day. As a result, it was not feasible to perform experimental comparisons for all hyperparameters. To address this issue, we empirically set certain hyperparameter values as shown in Table 2 and used these same values for subsequent experiments. Specifically, we set the batch size to 1 so that the model could handle longer protein sequences without requiring additional padding operations. In addition, we set a small learning rate to avoid causing significant changes in the weights of the model. Table 2. Hyperparameter values. Hyperparameter
Values
Optimizer
Adam
Loss_function
CrossEntropyLoss
Learning rate
0.00005
Epoch
1000
Batch_size
1
3.2 Performance Comparison with Other Predictors We conducted independent tests on PDNA-41 to verify the effectiveness of our approach. Table 3 presents a comparison between our method and other existing methods, including MataDBSite [16], BindN [12], COACH [27], DP-Bind [14], BindN + [15] and TargetDNA [17]. The results in Table 3 demonstrate that our method achieved satisfactory experimental outcomes when compared to previous approaches. Specifically, we obtained MCC, SP, SN and ACC values of 0.343, 96.37%, 46.34% and 94.79%, respectively. Our method is generally superior to other methods, although some methods may have slightly higher scores in certain indicators, their performance in other aspects is significantly weaker than our method. Compared to TargetDNA (with SN set approximately at 95%), our method achieved better experimental results on all evaluation metrics (including MCC, SP, SN and ACC) in independent testing on the PDNA-41 dataset. Specifically, there was an improvement of 14.3% in MCC, 3.3% in SP, 1.8% in SN and 4.3% in ACC respectively.
A Transformer-Based Deep Learning Approach with Multi-layer Feature Processing
565
Table 3. Performance comparison of different classifiers on PDNA-41 dataset via independent test. Method
MCC
SP
SN
ACC
MataDBSite
0.221
93.35
34.20
90.41
BindN
0.143
80.90
45.64
79.15
COACH
0.352
95.10
46.19
92.67
DP-Bind
0.241
82.43
61.72
81.40
BindN + (SP≈95%)
0.178
95.11
24.11
91.58
BindN + (SP≈85%)
0.213
85.41
50.81
83.69
TargetDNA (SN≈SP)
0.269
85.79
60.22
84.52
TargetDNA (SN≈95%)
0.300
93.27
45.50
90.89
Our Method
0.343
96.37
46.34
94.79
4 Conclusion In this study, we propose an encoder-decoder model to predict protein-DNA binding sites. To represent a protein sequence, we use two sequence-based features, the evolutionary feature PSSM and the predicted secondary structure, respectively. Unlike current state-of-the-art methods, our model enables end to end prediction of an entire protein sequence without the need for feature pre-extraction for each residue using a sliding window technique, which demonstrates the ease of use of our model. Comparing with previous methods, our model achieves respectable performance on the PDNA-41 test set (MCC:0.343, SP:96.37%, SN:46.34%, ACC:94.79%), which proves the effectiveness of our model. While our method has made some progress and can handle variable length protein sequences, it also limits our model to one protein input at a time. Therefore, we will further try more models for the problem of inconsistent protein sequence lengths. Given the success of graph neural networks in bioinformatics, we will try to employ graph structures to represent protein sequences to identify DNA binding residues. In addition, the features used in this work could be improved. With the great achievements in the field of protein structure prediction in recent years, we can use the predicted structural information to aid in this task. Acknowledgement. This paper is supported by the National Natural Science Foundation of China (62073231, 62176175, 61902271), National Research Project (2020YFC2006602), Provincial Key Laboratory for Computer Information Processing Technology, Soochow University (KJS2166), Opening Topic Fund of Big Data Intelligent Engineering Laboratory of Jiangsu Province (SDGC2157).
566
H. Zhao et al.
References 1. Dobson, C.M.: Chemical space and biology. Nature 432(7019), 824–828 (2004) 2. Gao, M., Skolnick, J.: The distribution of ligand-binding pockets around protein-protein interfaces suggests a general mechanism for pocket formation. Proc. Natl. Acad. Sci. 109(10), 3784–3789 (2012) 3. Zhao, J., Cao, Y., Zhang, L.: Exploring the computational methods for protein-ligand binding site prediction. Comput. Struct. Biotechnol. J. 18, 417–426 (2020) 4. Ofran, Y., Mysore, V., Rost, B.: Prediction of DNA-binding residues from sequence. Bioinformatics 23(13), i347–i353 (2007) 5. Jones, S., Van Heyningen, P., Berman, H.M., et al.: Protein-DNA interactions: a structural analysis. J. Mol. Biol. 287(5), 877–896 (1999) 6. Smyth, M.S., Martin, J.H.J.: X Ray crystallography. Mol. Pathol. 53(1), 8 (2000) 7. Nelson, J.D., Denisenko, O., Bomsztyk, K.: Protocol for the fast chromatin immunoprecipitation (ChIP) method. Nat. Protoc. 1(1), 179–185 (2006) 8. Heffler, M.A., Walters, R.D., Kugel, J.F.: Using electrophoretic mobility shift assays to measure equilibrium dissociation constants: GAL4-p53 binding DNA as a model system. Biochem. Mol. Biol. Educ. 40(6), 383–387 (2012) 9. Hellman, L.M., Fried, M.G.: Electrophoretic mobility shift assay (EMSA) for detecting protein–nucleic acid interactions. Nat. Protoc. 2(8), 1849–1861 (2007) 10. Vajda, S., Guarnieri, F.: Characterization of protein-ligand interaction sites using experimental and computational methods. Curr. Opin. Drug Discov. Devel. 9(3), 354 (2006) 11. Ding, Y., Yang, C., Tang, J., et al.: Identification of protein-nucleotide binding residues via graph regularized k-local hyperplane distance nearest neighbor model. Appl. Intell. 1–15 (2022) 12. Wang, L., Brown, S.J.: BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Res. 34(suppl_2), W243-W248 (2006) 13. Chu, W.Y., Huang, Y.F., Huang, C.C., et al.: ProteDNA: a sequence-based predictor of sequence-specific DNA-binding residues in transcription factors. Nucleic Acids Res. 37(suppl_2), W396-W401 (2009) 14. Hwang, S., Gou, Z., Kuznetsov, I.B.: DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics 23(5), 634–636 (2007) 15. Wang, L., Huang, C., Yang, M.Q., et al.: BindN+ for accurate prediction of DNA and RNAbinding residues from protein sequence features. BMC Syst. Biol. 4, 1–9 (2010) 16. Si, J., Zhang, Z., Lin, B., et al.: MetaDBSite: a meta approach to improve protein DNA-binding sites prediction. BMC Syst. Biol. 5(1), 1–7 (2011) 17. Hu, J., Li, Y., Zhang, M., et al.: Predicting protein-DNA binding residues by weightedly combining sequence-based features and boosting multiple SVMs. IEEE/ACM Trans. Comput. Biol. Bioinf. 14(6), 1389–1398 (2016) 18. Liu, R., Hu, J.: DNABind: a hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning-and template-based approaches. PROTEINS: Structure, Function Bioinform. 81(11), 1885–1899 (2013) 19. Zhu, Y.H., Hu, J., Song, X.N., et al.: DNAPred: accurate identification of DNA-binding sites from protein sequence by ensembled hyperplane-distance-based support vector machines. J. Chem. Inf. Model. 59(6), 3057–3071 (2019) 20. Hu, J., Bai, Y.S., Zheng, L.L., et al.: Protein-DNA binding residue prediction via bagging strategy and sequence-based cube-format feature. IEEE/ACM Trans. Comput. Biol. Bioinf. 19(6), 3635–3645 (2021) 21. Altschul, S.F., Madden, T.L., Schäffer, A.A., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)
A Transformer-Based Deep Learning Approach with Multi-layer Feature Processing
567
22. Gao, M., Skolnick, J.: DBD-Hunter: a knowledge-based method for the prediction of DNA– protein interactions. Nucleic Acids Res. 36(12), 3978–3992 (2008) 23. Ozbek, P., Soner, S., Erman, B., et al.: DNABINDPROT: fluctuation-based predictor of DNAbinding residues within a network of interacting residues. Nucleic Acids Res. 38(suppl_2), W417-W423 (2010) 24. Chen, Y.C., Wright, J.D., Lim, C.: DR_bind: a web server for predicting DNA-binding residues from the protein structure based on electrostatics, evolution and geometry. Nucleic Acids Res. 40(W1), W249–W256 (2012) 25. Tsuchiya, Y., Kinoshita, K., Nakamura, H.: PreDs: a server for predicting dsDNA-binding site on protein molecular surfaces. Bioinformatics 21(8), 1721–1723 (2005) 26. Yu, D.J., Hu, J., Tang, Z.M., et al.: Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling. Neurocomputing 104, 180–190 (2013) 27. Yang, J., Roy, A., Zhang, Y.: Protein–ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics 29(20), 2588–2595 (2013) 28. Yu, D.J., Hu, J., Yang, J., et al.: Designing template-free predictor for targeting protein-ligand binding sites with classifier ensemble and spatial clustering. IEEE/ACM Trans. Comput. Biol. Bioinf. 10(4), 994–1008 (2013) 29. Chen, K., Mizianty, M.J., Kurgan, L.: ATPsite: sequence-based prediction of ATP-binding residues proteome science. BioMed Central 9(1), 1–8 (2011) 30. Chen, K., Mizianty, M.J., Kurgan, L.: Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors. Bioinformatics 28(3), 331–341 (2012) 31. Zhang, Q., Wang, S., Chen, Z., et al.: Locating transcription factor binding sites by fully convolutional neural network. Brief. Bioinform. 22(5), bbaa435 (2021) 32. Cui, Z., Chen, Z.H., Zhang, Q.H., et al.: Rmscnn: a random multi-scale convolutional neural network for marine microbial bacteriocins identification. IEEE/ACM Trans. Comput. Biol. Bioinf. 19(6), 3663–3672 (2021) 33. Su, X., You, Z.H., Huang, D., et al.: Biomedical knowledge graph embedding with capsule network for multi-label drug-drug interaction prediction. IEEE Trans. Knowl. Data Eng. (2022) 34. Cui, Y., Dong, Q., Hong, D., et al.: Predicting protein-ligand binding residues with deep convolutional neural networks. BMC Bioinform. 20(1), 1–12 (2019) 35. Li, W., Godzik, A.: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13), 1658–1659 (2006) 36. Wang, Y., Ding, Y., Guo, F., et al.: Improved detection of DNA-binding proteins via compression technology on PSSM information. PLoS ONE 12(9), e0185587 (2017) 37. Ding, Y., Tang, J., Guo, F.: Identification of protein–ligand binding sites by sequence information and ensemble classifier. J. Chem. Inf. Model. 57(12), 3149–3161 (2017) 38. Ahmad, S., Sarai, A.: PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics 6, 1–6 (2005) 39. UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47(D1), D506-D515 (2019)
TAPE-Pero: Using Deep Representation Learning Model to Identify and Localize Peroxisomal Proteins Jianan Sui1 , Yuehui Chen2(B) , Yi Cao3 , and Yaou Zhao4 1 School of Information Science and Engineering, University of Jinan, Jinan, China 2 Artificial Intelligence Institute (School of Information Science & Engineering),
University of Jinan, No. 336, Jinan, China [email protected] 3 Shandong Provincial Key Laboratory of Network Based Intelligent Computing (School of Information Science & Engineering), University of Jinan, No. 336, Jinan, China 4 Artificial Intelligence Institute (School of Information Science & Engineering), University of Jinan, No. 336, Jinan, China
Abstract. Peroxisomes, organelles containing one or more oxidases within a single lipid bilayer, play a crucial role in various metabolic pathways. Incorrect localization of peroxisomal proteins can lead to severe diseases. In this study, we introduced the TAPE-Pero model to improve the accuracy of peroxisomal protein identification and localization. This model incorporates two deep representation learning models, ProSE and BERT-based TAPE, trained on large protein databases, to extract peroxisomal protein features. The data set is balanced using SMOTE, and the optimal feature vector is selected using analysis of variance (ANOVA). Subsequently, nine machine learning classifiers are utilized to identify and locate peroxisomal proteins. Our model outperforms existing state-of-the-art methods with an overall prediction accuracy of 98.97% and 92.57% from tenfold cross-validation and double cross-validation tests on the data set, respectively. The proposed model provides a novel approach for the identification and localization of peroxisomal proteins. Keywords: Peroxisomal proteins · SMOTE · Deep representation learning · Feature selection · Machine learning
1 Introduction Organelle proteins, a group of proteins associated with or located within an organelle, play critical roles in various cellular processes and exhibit varying functions based on their localization within the organelle. Accurate identification of organelle proteins can aid in understanding their functions and pave the way for new treatments for related diseases. It is also important to determine the precise localization of organelle proteins to fully characterize their functions.The identification of organelle protein localization has been widely studied using machine learning methods. For instance, Zhou et al. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 568–577, 2023. https://doi.org/10.1007/978-981-99-4749-2_48
TAPE-Pero: Using Deep Representation Learning Model to Identify
569
[1] introduced a method for predicting Golgi protein types that utilized a combination of pseudo amino acid composition (PseAAC), dipeptide composition (DC), pseudo position-specific scoring matrix (PsePSSM), and an ensemble of binary classifiers by gradient boosting (EBGW) to extract feature vectors, and employed XGBoost as the classifier. This method achieved an accuracy of 92.1%. Lv et al. [2] created the rfGPT, a Golgi protein classifier that employed 2-gap dipeptide and split amino acid composition as feature vectors, balanced the data set using SMOTE, and used ANOVA for feature selection before feeding the data into an RF model. The rfGPT achieved an independent test accuracy of 90.6%. Yu et al. [3] proposed SubMito-XGBoost, a method that utilized XGBoost for predicting sub-mitochondrial protein types using two training datasets, M317 and M983, with prediction accuracy of 97.7% and 98.9% respectively and 94.8% for the independent test set M495. Other studies on organelle protein identification have also been conducted [4–6]. In this study, we examined the identification and subcellular localization of peroxisomal proteins. Peroxisomes, also referred to as microbodies, are membrane-bound organelles that contain one or more oxidases and play a crucial role in various cellular processes, including regulation of cellular immunity. Aberrant localization of peroxisomal proteins has been linked to diseases such as Alzheimer’s disease, X-linked adrenoleukodystrophy (X-ALD), prostate cancer, and bladder cancer [7–10]. Although treatments such as antiinflammatory and neuroprotective therapies exist, they are not always effective in curing the underlying diseases [11–14]. Hence, accurate recognition and localization of peroxisomal proteins is vital for timely detection of abnormalities and injuries and could have significant implications for the treatment of related diseases. Currently, the only tool available for peroxisome protein recognition and localization is In-Pero [15]. This tool uses deep learning embedding techniques such as UniRep [16] and SeqVec [17] to extract the features of peroxisome protein sequences and employs four machine learning models in combination with five protein embedding methods. It achieved a classification accuracy of 0.92 using cross-validation. This study represents the first comprehensive examination of the topic and provides a benchmark for future research. The authors believe that deep learning methods hold promise in addressing this problem. In this study, we present a novel approach named TAPE-Pero for identifying and localizing peroxisomal proteins. Our model utilizes deep representation learning methods, including TAPE [18] based on BERT and the pre-trained multi-task language model ProSE [19], to extract features from peroxisomal protein sequences. We address the class imbalance issue by employing SMOTE and select optimal features via Analysis of Variance (ANOVA) [20]. Our results indicate that the TAPE model outperforms other feature extraction methods in terms of accuracy. We further feed the optimal feature vector into nine traditional machine learning methods, including Gaussian Naive Bayes (GaussianNB), Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), Light Gradient Booster (LightGBM), Gradient Tree Boosting (GBDT), MultiLayer Perceptron (MLP), K-Nearest Neighbor (KNN), and Extreme Gradient Boosting (XGBoost). The general flow chart of the TAPE-Pero model is depicted in Fig. 1.
570
J. Sui et al.
Fig.1. The general flow chart of the TAPE-Pero model.
2 Materials and Methods 2.1 Datasets In this study, we employed the dataset for peroxisomal proteins created by Anteghini et al. [15] in 2021. This dataset was sourced from the UniprotKB/SwissProt database and underwent a filtering process. After being processed by the Cd-hit clustering program, the sequence identity was set to 40%. The final dataset consisted of 132 peroxisome membrane protein sequences and 28 peroxisome matrix protein sequences, with a ratio of approximately 5:1. The basic construction process of the peroxisome protein dataset is shown in Fig. 2.
Fig. 2. Flow chart of peroxisome proteins dataset construction.
2.2 Feature Extraction In recent years, there has been a shift in the approach for feature extraction in protein characterization tasks. Earlier models relied primarily on features such as compositional features, positional features, and physical-chemical properties. With the advancement of deep learning methods, they have become increasingly popular in sequence-based protein characterization tasks. In our work, we adopted two feature extraction techniques based on Natural Language Processing (NLP) pre-trained models, namely ProSE, and TAPE.
TAPE-Pero: Using Deep Representation Learning Model to Identify
571
2.2.1 ProSE This feature extraction model employs a three-task deep learning approach that utilizes a three-layer bidirectional LSTM network with skip connections. In this paper we call it ProSE.This model trains a protein language model through self-supervised learning of a large corpus of natural sequence data, in combination with structural supervision of a smaller set of sequences. The three tasks involved in the training process include: a masked language modeling task, prediction of residue-residue contact within protein structures, and prediction of structural similarity. As a result, the ProSE model effectively represents protein sequences as continuous vectors, combining the benefits of self-supervised learning on a large sequence corpus and structural supervision on a smaller set of sequences. 2.2.2 TAPE In the field of protein representation learning in machine learning, the author introduces a novel evaluation task for protein embeddings, referred to as TAPE. This task is designed to assess the performance of different protein embedding methods on supervised tasks that are critical for three areas of protein biology. To this end, the author selected a set of supervised tasks that are likely to benefit from self-supervised learning. In this paper, we focuse on a BERT-based TAPE model. Each protein sequence is first converted to an integer sequence according to the following function: f (mj ) = i
(1)
i = 1, 2......., 20, if mj ∈ 20 canonical amino acid
(2)
where mj is the j th amino acid of the sequence, The integer sequence f (mj ), j = 1,2,3,4,……L(length of protein sequence) was embedded into 6165-long feature vectors via the ProSE model and 768-long feature vectors via the TAPE model. 2.3 Feature Selection To enhance the precision of the predictive outcomes and mitigate the risks of overfitting, we employed ANOVA (Analysis of Variance) [20] to perform feature selection on the dimensions, and subsequently utilized the selected features as input into the classifier. ANOVA (Analysis of Variance) is a statistical method used to compare the means of multiple groups to determine if there is a significant difference among them. The method is used to test the null hypothesis that all groups have the same mean. The general formula for one-way ANOVA can be represented as: F = MSB/MSW
(3)
where MSB (Mean Square Between) represents the variance between the group means and MSW (Mean Square Within) represents the variance within each group. The Fstatistic calculated using this formula is then compared to a critical value from the F-distribution to determine the significance of the result.
572
J. Sui et al.
Following feature selection, the dimensionality of the extracted features was reduced from 6165 to 100, thereby eliminating significant amounts of redundant information and mitigating the risk of overfitting. 2.4 Balanced Dataset Given that this imbalance in the dataset may potentially impact the model’s performance, we leveraged the SMOTE algorithm, a popular method for random oversampling of samples, to mitigate the issue. SMOTE, which stands for Synthetic Minority Over-sampling Technique, is an algorithm for synthesizing new samples for an under-represented class in a binary classification task. Given a set of minority class samples, SMOTE algorithm generates synthetic samples by computing the difference between a minority sample and its nearest neighbors, and then adding this difference to other minority samples. The goal is to increase the diversity of the minority class, thus mitigating the class imbalance problem. The mathematical formula for generating synthetic samples can be expressed as: (4) SMOTE(x) = xminority + r ∗ xneighbor − xminority where xminority is the minority sample, xneighbor is one of its k nearest neighbors, and r is a random number between 0 and 1. By this formula, a new sample is generated and added to the original dataset, leading to an increased number of minority samples. This enable us to balance the dataset and enhance the performance of the model. 2.5 Classification Models In our study, we employed nine traditional machine learning models for classification, including Gaussian Naive Bayes (GaussianNB), Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), Light Gradient Boosting Machine (LightGBM), Gradient Tree Boosting (GBDT), Multilayer Perceptron (MLP), K-Nearest Neighbor (KNN), and Extreme Gradient Boosting Tree (XGBoost), which have been widely used in the field of protein identification and localization. These models were implemented through the scikit-learn, and we fine-tuned their hyperparameters through grid search to achieve the best possible performance. 2.6 Evaluation Metrics and Methods Accuracy (Acc), sensitivity (Sn), specificity (Sp), Matthews correlation coefficient (MCC) and F1-score were used to evaluate the performance of the prediction system. The calculation method is as follows: Sp =
TN TN + FP
(5)
Sn =
TP TP + FN
(6)
TAPE-Pero: Using Deep Representation Learning Model to Identify
ACC =
TP + TN TP + FN + TN +FP
BACC = F1 =
Sp + Sn 2
2 × TP 2 × TP + FN + FP
TP × TN − FP × FN MCC= √ (TP + FP) × (TP + FN ) × (TN + FN ) × (TN + FP)
573
(7) (8) (9) (10)
In this study, we focus on the binary classification problem of identifying plant vacuole proteins. The classifier’s performance is evaluated using various metrics such as true positive (TP), false positive (FP), true negative (TN), and false negative (FN). Sensitivity (Sn) and specificity (Sp) are used to measure the proportion of correct predictions in positive and negative samples, respectively. The F1 score is utilized to reflect the model’s robustness, with higher scores indicating stronger robustness. To evaluate the overall accuracy of the classifier, accuracy (ACC) is used. However, in cases where the dataset is imbalanced, balanced accuracy (BACC) is preferred as an evaluation metric. In addition, the Matthew’s Correlation Coefficient (MCC) is another metric used to evaluate the classifier’s accuracy in such cases. To evaluate the model’s performance, a receiver operating characteristic (ROC) curve is plotted, with the false positive rate (FPR) on the horizontal axis and the true positive rate (TPR) on the vertical axis. The area under the ROC curve (AUC) is also used as an evaluation metric, with higher values indicating better model performance.
3 Result and Discussions In this study, nine classification algorithms were analyzed and compared. The performance of the features extracted by the ProSE model on these classifiers was evaluated using tenfold cross-validation after balancing the dataset and feature selection. The results, presented in Table 1, indicate that the LightGBM model outperformed the other models, with impressive performance metrics including ACC, F1-score, specificity, sensitivity, MCC, and AUC values of 0.9836, 0.9447, 0.9631, 0.9289, 0.8935, and 0.9950, respectively. Furthermore, Fig. 3 illustrates the ROC diagrams for all nine classifiers. Table 2 displays the results of the peroxisomal protein sub-localization prediction using the TAPE feature extraction method. The performance was evaluated through tenfold cross-validation, data set balancing, and feature selection. The LightGBM model achieved the best performance. Figure 4 shows the ROC curve for all nine classifiers. The results suggest that the TAPE method outperformed the other feature extraction method, and the LightGBM model was the most effective classifier for predicting the sub-localization of peroxisomal proteins. Finally, our proposed TAPE-Pero model was evaluated using a double crossvalidation approach the peroxisomal protein dataset, and its performance was compared to the In-Pero model constructed by Anteghini et al. in 2021 [15]. Results, presented in
574
J. Sui et al. Table 1. ProSE + SMOTE + ANOVA.
Model
ACC
F1-score
Sp
Sn
MCC
ROC-AUC
GaussianNB
0.8332
0.8361
0.7909
0.8648
0.6662
0.9273
LR
0.8748
0.8699
0.8763
0.8614
0.7448
0.9524
RF
0.9509
0.9490
0.9678
0.9353
0.9007
0.9905
SVM
0.9014
0.9223
0.9482
0.8998
0.8518
0.9630
LightGBM
0.9836
0.9447
0.9631
0.9289
0.8935
0.9950
GBDT
0.9584
0.9473
0.9692
0.9305
0.8997
0.9924
MLP
0.9470
0.9420
0.9542
0.9381
0.8957
0.9730
KNN
0.9359
0.9349
0.9708
0.9122
0.8855
0.9544
XGBoost
0.9319
0.9316
0.9679
0.9090
0.8798
0.9590
Fig. 3. ProSE + SMOTE + ANOVA.
Table 3, demonstrate that the TAPE-Pero model outperforms the In-Pero model, with higher values in ACC, BACC, F1(inner), F1(outer), and MCC. These results indicate the effectiveness of the TAPE-Pero model.
TAPE-Pero: Using Deep Representation Learning Model to Identify Table 2. TAPE + SMOTE + ANOVA. Model
ACC
F1-score
Sp
Sn
MCC
AUC-ROC
GaussianNB
0.8828
0.8778
0.9033
0.8558
0.7598
0.9378
LR
0.9359
0.9276
0.9870
0.8837
0.8743
0.9535
RF
0.9397
0.9362
0.9870
0.8983
0.8832
0.9855
SVM
0.9583
0.9450
0.9905
0.9107
0.8998
0.9775
LightGBM
0.9897
0.9610
0.9913
0.9386
0.9280
0.9823
GBDT
0.9360
0.9533
0.9888
0.9280
0.9151
0.9790
MLP
0.9472
0.9408
0.9870
0.9075
0.8962
0.9494
KNN
0.9130
0.9206
0.9905
0.8691
0.8646
0.9435
XGBoost
0.9020
0.9091
0.9842
0.8548
0.8448
0.9903
Fig. 4. TAPE + SMOTE + ANOVA. Table 3. Comparison with the In-Pero model. Model
ACC
BACC
F1(inner)
F1 (outer)
MCC
In-Pero
0.919
0.863
0.825
0.859
0.721
TAPE-Pero
0.926
0.924
0.920
0.916
0.854
575
576
J. Sui et al.
4 Conclusions In this study, we evaluated the performance of the ProSE and TAPE methods for the identification and localization of peroxisomal proteins. To our knowledge, this is the first time that these methods have been utilized for peroxisomal protein identification and localization. Additionally, compared to the state-of-the-art In-Pero model that utilized a combination of the SeqVec and UniRep methods, our proposed TAPE-Pero model only utilized the ProSE method for feature extraction, demonstrating the superiority of ProSE for peroxisomal protein identification and localization. In future studies, we plan to investigate additional deep representation learning methods for peroxisomal protein feature extraction, as well as expand our model to other organelle protein identifications and expand the peroxisomal protein dataset for application of deep learning methods. Acknowledgments. This work was supported in part by Shandong Provincial Natural Science Foundation, China (ZR2021MF036), the Key Science & Technology Innovation Project of Shandong Province (2019JZZY010448) and the National Natural Science Foundation of China (31872415).
References 1. Zhou, H., Chen, C., Wang, M., Ma, Q., Yu, B.: Predicting golgi-resident protein types using conditional covariance minimization with XGBoost based on multiple features fusion. IEEE Access 7, 144154–144164 (2019) 2. Lv, Z., Jin, S., Ding, H., Zou, Q.: A random forest sub-Golgi protein classifier optimized via dipeptide and amino acid composition features. Front. Bioeng. Biotechnol. 7, 215 (2019) 3. Yu, B., et al.: SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting. Bioinformatics 36(4), 1074– 1081 (2020) 4. Ahmad, J., Hayat, M.: MFSC: multi-voting based feature selection for classification of Golgi proteins by adopting the general form of Chou’s PseAAC components. J. Theor. Biol. 463, 99–109 (2019) 5. Qiu, W., et al.: Predicting protein submitochondrial locations by incorporating the pseudoposition specific scoring matrix into the general Chou’s pseudo-amino acid composition. J. Theor. Biol. 450, 86–103 (2018) 6. Savojardo, C., Bruciaferri, N., Tartari, G., Martelli, P.L., Casadio, R.: DeepMito: accurate prediction of protein sub-mitochondrial localization using convolutional neural networks. Bioinformatics 36(1), 56–64 (2020) 7. Wanders, R.J.: Metabolic functions of peroxisomes in health and disease. Biochimie 98, 36–44 (2014) 8. Cai, M., et al.: Disruption of peroxisome function leads to metabolic stress, mTOR inhibition, and lethality in liver cancer cells. Cancer Lett. 421, 82–93 (2018) 9. Benjamin, D.I., et al.: Ether lipid generating enzyme AGPS alters the balance of structural and signaling lipids to fuel cancer pathogenicity. Proc. Nat. Acad. Sci. 110(37), 14912–14917 (2013) 10. Zhou, M., Chinnaiyan, A.M., Kleer, C.G., Lucas, P.C., Rubin, M.A.: Alpha-Methylacyl-CoA racemase: a novel tumor marker over-expressed in several human cancers and their precursor lesions. Am. J. Surg. Pathol. 26(7), 926–931 (2002)
TAPE-Pero: Using Deep Representation Learning Model to Identify
577
11. Hartmann, T., et al.: Alzheimer’s disease βA4 protein release and amyloid precursor protein sorting are regulated by alternative splicing. J. Biological Chem. 271(22), 13208–13214 (1996) 12. Berger, J., Dorninger, F., Forss-Petter, S., Kunze, M.: Peroxisomes in brain development and function. In: Biochimica Et Biophysica Acta (BBA)-Molecular Cell Research, vol. 1863, no. 5, pp. 934–955 (2016) 13. Trompier, D., et al.: Brain peroxisomes. Biochimie 98, 102–110 (2014) 14. Ding, H., Liu, L., Guo, F.-B., J. Huang, an d H. Lin, “Identify Golgi protein types with modified mahalanobis discriminant algorithm and pseudo amino acid composition,” Protein and peptide letters, vol. 18, no. 1, pp. 58–63, 2011 15. Anteghini, M., Martins dos Santos, V., Saccenti, E.: In-Pero: exploiting deep learning embeddings of protein sequences to predict the localisation of peroxisomal proteins. Int. J. Molecular Sci. 22(12), 6409 (2021) 16. Alley, E.C., Khimulya, G., Biswas, S., AlQuraishi, M., Church, G.M.: Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16(12), 1315– 1322 (2019) 17. Heinzinger, M., et al.: Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 20(1), 1–17 (2019) 18. Rao, R., et al.: Evaluating protein transfer learning with TAPE. In: Advances in Neural Information Processing Systems, vol. 32 (2019) 19. Bepler, T., Berger, B.: Learning the protein language: Evolution, structure, and function. Cell Syst. 12(6), 654–669 (2021) 20. St, L., Wold, S.: Analysis of variance (ANOVA). Chemometrics Intell. Lab. Syst. 6(4), 259– 272 (1989)
Classification of Coding and Non-coding Genes in Paeonia Lactiflora Pall Based on Machine Learning Bolun Yang1 , Yuehui Chen2(B) , Yaou Zhao2 , and Yi Cao3 1 School of Information Science and Engineering, University of Jinan, Jinan, China 2 Artificial Intelligence Institute (School of Information Science and Engineering),
University of Jinan, No. 336, Jinan, China [email protected] 3 Shandong Provincial Key Laboratory of Network Based Intelligent Computing (School of Information Science and Engineering), University of Jinan, No. 336, Jinan, China
Abstract. Paeonia lactiflora is a commonly used herb in clinical work of traditional Chinese medicine. Total glucosides of paeony shows its superiority in the treatment of recurrent oral ulcer. Long-term use of total glucosides of paeony has fewer side effects for patients, and is an ideal drug for the treatment of recurrent oral ulcer. In order to further study the medicinal chemical effects of Paeonia lactiflora, the correct classification of the coding and non-coding genes of Paeonia lactiflora can provide great help for scholars. In this paper, four kinds of algorithms, k-mers, RevcKmer and PseNAC, are used to extract the features of Paeonia lactiflora DNA sequence. Three kinds of machine learning algorithms, such as support vector machine, are used as classification models to classify the extracted features. The experimental results show that the combination of feature extraction as k-mers and classifier as SVM achieves the best classification and prediction performance. The final experimental results are Acc93.445, F1-score0.9512, Sn98.25, Sp83.34, MCC0.8702, AUROC0.9333. Keywords: Paeonia lactiflora · DNA sequence · Machine learning algorithms
1 Introduction Recurrent oral ulcer is a common oral disease in the Department of Rheumatology and Immunology. Patients with recurrent oral ulcers will relapse and have poor local treatment. Most of the oral ulcers are distributed in the oral mucosa and the tip of the tongue. Severe ulcers can affect the patient ‘s diet and daily life. At present, there are many treatment measures for recurrent oral ulcers. Studies have shown that the application of total glucosides of paeony can effectively prolong the intermittent period of oral ulcers. At the same time, compared with other drugs, long-term application of total glucosides of paeony has fewer side effects and better patient compliance. Total glucosides of paeony is an ideal drug for patients with recurrent oral ulcers. Paeonia Lactiflora Pall is the dried root of Paeonia lactiflora in Ranunculaceae [1]. Total glucosides of paeony © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 578–586, 2023. https://doi.org/10.1007/978-981-99-4749-2_49
Classification of Coding and Non-coding Genes in Paeonia Lactiflora Pall
579
is extracted from the root of Paeonia lactiflora. It is a general term for paeoniflorin, hydroxypaeoniflorin, peoniflorin, albiflorin, and peoniflorin benzoate [2]. Among them, the content of paeoniflorin accounted for more than 90% of the total glycosides. Professor Xu Shuyun ‘steam ‘s systematic study of total glucosides of paeony found that it plays an important role in immune regulation, anti-inflammatory, analgesic and liver protection. Therefore, our main work is to study the related genomics of Radix Paeoniae Alba that can extract total glucosides of Paeonia lactiflora. Starting from the existing DNA sequence of Paeonia lactiflora, we use machine learning methods to design classifiers. Correctly identifying and classifying its coding and non-coding gene sequences can help researchers better understand the pharmacological and chemical effects of Paeonia lactiflora and make it more widely used. With the continuous development and progress of machine learning methods, it is increasingly used in the prediction and classification tasks of biological information. In determining the DNA methylation characteristics of COVID-19 disease, Bowler S et al.used the random forest classification model to identify individuals with severe COVID-19 disease, and obtained AUC-ROC of 0.898 and AUC-PRC of 0.864 [3]. Leitheiser M et al.developed four classifiers based on different machine learning models to predict the main sites of HNSC tumors based on DNA methylation features [4]. Sarkar S used polynomial Naive Bayes classifier and logistic regression with k-mer coding to obtain good accuracy in the classification of DNA sequences, which were 93.16% and 93.13%, respectively [5]. Mridha K used random forest to distinguish benign and malignant tumors in the diagnosis of breast cancer, with an accuracy rate of 98.83% [6]. Sun et al. used stacked autoencoders to study sequence-based PPI prediction. Through 10 times cross validation, the average accuracy of the best model reached 97.19% [7]. Tampuu A et al. developed a deep learning-based method ViraMiner, which achieved significant accuracy in the classification of viral genomes, with an AUC-ROC of 0.923 [8]. Quang D et al. proposed DanQ, a new hybrid convolutional and bidirectional long shortterm memory recurrent neural network framework for predicting non-coding functions from sequences [9]. Mahmoud et al. presents a computationally effective framework for classifying DNA sequences of living organisms in the image domain. The proposed strategy relies upon multilayer perceptron trained by a pseudoinverse learning autoencoder (PILAE) algorithm and the PILAE classifier can accomplish better performance contrasting with other deep neural network [10]. In order to obtain an accurate classifier of coding genes and non-coding genes of Radix Paeoniae Alba, our experiment is mainly divided into the following steps: (1) Feature extraction of DNA sequence of Paeonia lactiflora by k-mers, RevcKmer and pseudo-nucleotide composition algorithm. (2) Three machine learning algorithms such as SVM were selected to classify the extracted DNA sequence features of Paeonia lactiflora. (3) The results are analyzed and evaluated by ACC, AUC, MCC and other classifier performance evaluation indexes. (4) Finally, the best classifier combination we obtained after comparison is kmer-SVM, and the results are ACC93.445%, MCC0.8702, F1-score0.9512, AUC 0.9333. The experimental results show that the machine learning method has good performance for the classification of Paeonia lactiflora (Fig. 1).
580
B. Yang et al.
Fig. 1. Work flow chart
2 Methods and Materials 2.1 Data The data used in this experiment were obtained from the gene library search in NCBI (National Center for Biotechnology Information). The gene library is a searchable gene database that focuses on genomes that have been fully sequenced and has an active research community to provide gene-specific data. Gene information includes nomenclature, chromosomal localization, gene products and their properties (e.g. proteinprotein interactions), markers, phenotypes, interactions, citation links, sequences, mutation details, maps, expression reports, homologues, protein domain content, and external database links. Finally, we obtained a total of 92 DNA sequences of Paeonia lactiflora, of which 58 were protein coding sequences and 34 were non-coding sequences. 2.2 Feature Extraction Method In this paper, four feature extraction methods, k-mers, RevcKmers, PseNAC (PseDNC and PseKNC), are used. 2.3 k-mers Feature Extraction Method K-mers is a common feature extraction method in bioinformatics. It divides the DNA sequence into subsequences of length K, and counts all possible subsequences of length K to achieve feature extraction [11]. Because the DNA sequence has four bases A, T, C and G, the k-mers substring of the DNA sequence should have 4k kinds. Assuming that the length of the DNA sequence is M, a window with a length of k is set, and the sliding cutting is performed from the first base of the sequence, one base is sliding each time, and the number of k-mers substrings after the cutting is M − k + 1, while the adjacent k-mers have only one base difference. When the k value becomes larger, the dimension of the feature vector will also increase, so that it can reflect the basic structural
Classification of Coding and Non-coding Genes in Paeonia Lactiflora Pall
581
information in the DNA sequence in more detail, which is convenient for us to classify and analyze the sequence. However, too high k value will cause high latitude disaster, resulting in sparse feature vectors, so that our model has problems such as overfitting. Therefore, choosing the appropriate k value when facing different problems is the focus of the algorithm. After k-mers feature extraction, we can use machine learning algorithms to classify and predict biological sequences, and achieve good results. Déraspe M et al. used the kmers method to compare the prokaryotic genomes and determine the similarity between the strain and the entire bacterial genome cluster [12]. In the experiment, they determined the correlation of five important bacterial genome features by comparing the k-mers composition of the bacterial genome, which also indicated that k-mers can be applied to study the importance and correlation of specific gene categories. Therefore, we can see that the k-mers feature extraction method is widely used in bioinformatics. 2.4 RevcKmer Feature Extraction Method The reverse complementary k-mers algorithm (RevcKmer) is similar to the k-mers algorithm and has been improved on the basis of k-mers [13]. When calculating k-mers features, not only positive k-mers but also anti-complementary k-mers should be considered. After we obtain the k-mers subsequence of the DNA sequence, the adjacent reverse complementary k-mers subsequence is deleted, and the remaining k-length subsequences are counted. This method greatly reduces the dimension of the feature vector and improves the efficiency of the algorithm. It is an improvement of the k-mers algorithm. 2.5 Pseudo-Nucleotide Composition (PseNAC) Algorithm Pseudo-nucleotide composition (PseNAC) is a common algorithm for extracting DNA feature sequences [14]. The general single nucleotide composition (NAC) calculates the frequency of each nucleotide in the DNA sequence as a sequence feature, but does not consider the order information of the DAN sequence. The PseNAC algorithm introduces pseudo nucleotide composition, obtains the order information of DNA sequence, and simplifies the long sequence, making the calculation more concise. Commonly used PseNAC algorithms include pseudo dinucleotide composition (PseDNC) and pseudo k-nucleotide composition (PseKNC). The PseDNC algorithm is a feature extraction method developed by Chen W et al. on the basis of the PseAAC algorithm, which introduces three angle parameters and three translation parameters [15]. For a DNA sequence D, the dinucleotide is used as the basic unit, and the number of dinucleotides in 16 of the target sequence is counted. The DNC characteristics of this sequence are obtained after standardization: T DNC = f (AA) f (AC) f (AG) f (AT ) · · · f (TT )
(1)
In order to merge the order information of the global sequence into the feature vector of the DNA sequence, this method introduces a set of sequence order correlation factors,
582
B. Yang et al.
which are defined as follows: θλ =
L−1−λ 1 (Ri Ri+1 , Ri+λ Ri+λ+1 ) L−1−λ
(2)
i=1
In the formula, represents the length of the DNA sequence, and the parameter λ is an integer, representing the highest order of the DNA sequence order correlation factor. The correlation function is as follows: μ
(Ri Ri+1 , Rj Rj+1 ) =
2 1 Pu (Ri Ri+1 ) − Pu Rj Rj+1 μ
(3)
u=1
By adding the order correlation factor θλ of the sequence to the DNC feature, we can obtain the PseDNC feature as follows: T D = d1 d2 · · · d16 d16+1 · · · d16+λ (4) where:
dk =
⎧ fk ⎪ (1 ≤ k ≤ 16) ⎪ λ ⎪ 16 ⎨ i=1 fi + w j=1 θj wθk−16 ⎪ ⎪ ⎪ (17 ≤ k ≤ 16 + λ) λ ⎩ 16 i=1 fi + w j=1 θλ
(5)
Through the PseDNC algorithm, (16 + λ)-dimensional feature vectors containing sequence order information can be extracted, and DNA sequences with different sequence lengths can also be converted into the same-dimensional feature vectors. The pseudo-nucleotide composition algorithm is simple in calculation and low in feature dimension. It can quickly extract the characteristics of DNA sequences and is applied to sequence classification and prediction. 2.6 Classification Model In this paper, on the construction and selection of classification model, aiming at the typical classification problem of whether the DNA sequence of Paeonia lactiflora can be encoded, we choose to use the support vector machine machine learning algorithm to classify the extracted DNA sequence features, and use Gaussian Naive Bayes (GaussianNB) and Light Gradient Booster (LightGBM) two algorithms as comparison. Support Vector Machine (SVM) is a common machine learning algorithm, which is usually used for classification and regression problems [16]. It is based on statistical learning theory and structural risk minimization principle, using a discriminant model to learn and predict. The basic idea of SVM is to map the input data into the high-dimensional feature space and find an optimal hyperplane to segment the data, so that the sample points of different categories can be correctly classified. Several training sample points closest to the hyperplane are found on both sides of the hyperplane. These sample points are
Classification of Coding and Non-coding Genes in Paeonia Lactiflora Pall
583
called ‘support vectors’, which are very important for determining the position of the hyperplane. In classification problems, SVM can deal with linear separable, linear inseparable and nonlinear problems [17]. In the case of linear separability, SVM divides different categories of data by finding the Maximal Margin Hyperplane. In the case of linear inseparability and nonlinearity, SVM uses kernel function to map data into high-dimensional space, and constructs an optimal hyperplane in this space to complete the classification task [18]. The traditional machine learning method mainly minimizes the expected risk by minimizing the empirical risk. Therefore, it is more dependent on the number of training samples [19]. The training effect is not satisfactory in the case of small samples, but the support vector machine adopts a principle different from the traditional machine learning method. 2.7 Evaluation Metrics and Methods Accuracy (Acc), sensitivity (SN), specificity (SP), Matthews correlation coefficient (MCC) and F1-score were used to evaluate the performance of the prediction system [20]. The calculation method is as follows: TN TN + FP TP Sn = TP + FN TP + TN Acc = TP + TN + FP + FN 2 × TP F1 = 2 × TP + FN + FP TP × TN − FP × FN MCC = √ (TP + FP) × (TP + FN ) × (TN + FP) × (TN + FP) Sp =
(6) (7) (8) (9) (10)
For a binary classification problem, the classification results are as follows: true class TP, false positive class FP, true negative class TN, false negative class FN. Among them, TP is a positive sample predicted as a positive class, FP is a negative sample predicted as a positive class, TN is a negative sample predicted as a negative class, and FN is a positive sample predicted as a negative class. SN and SP are the proportion of correct predictions in positive samples and negative samples. The f1 score reflects the robustness of the model. The higher the score, the more stable the model. Acc reflects the overall accuracy of the classifier. When the data set is unbalanced, Acc cannot really evaluate the quality of the classification results. In this case, we will choose MCC for evaluation. The horizontal axis of the ROC curve is generally the ratio of FPR, that is, the ratio of negative samples to positive samples, and the vertical axis is the ratio of FPR, that is, the ratio of positive samples to positive samples. AUC refers to the area under the ROC curve as an evaluation index. When AUC = 1, it is the ideal state of the model, but it is difficult to achieve in reality. When 0.5 < AUC < 1, it shows that the model is useful. When AUC is closer to 1, the effect of the model is better.
584
B. Yang et al.
3 Result and Discussions In order to verify the best combination of different feature extraction methods and classifiers, we first use k-mers, RevcKmer and pseudo-nucleotide composition (PseNAC) algorithm to extract the DNA sequence of Paeonia lactiflora. The four different extracted features are put into classifiers such as support vector machine for training (Table 1 and Table 2). Table 1. k-mers Model
Sn(%)
Sp(%)
Acc(%)
Mcc
F1
AUROC
SVM
98.25
83.34
93.445
0.8702
0.9521
0.9333
GaussianNB
91.666
88.334
90.001
0.8191
0.9164
0.9194
LightGBM
94.666
84.167
90.334
0.8079
0.9222
0.9717
F1
AUROC
Table 2. RevcKmer Model
Sn(%)
Sp(%)
Acc(%)
Mcc
SVM
87.667
88.334
87.779
0.7725
0.8943
0.9283
GaussianNB
89.999
88.333
88.89
0.7935
0.9075
0.9000
LightGBM
89.666
81.667
86.89
0.7346
0.8956
0.9411
Table 3. PseKNC Model
Sn(%)
Sp(%)
Acc(%)
Mcc
F1
AUROC
SVM
92.666
85.834
90.112
0.8078
0.9172
0.9172
GaussianNB
96.667
85.834
92.223
0.8532
0.9389
0.9122
LightGBM
90.999
93.334
88.001
0.7624
0.9005
0.9300
Table 4. PseDNC Model
Sn(%)
Sp(%)
Acc(%)
Mcc
F1
AUROC
SVM
92.666
85.834
90.112
0.8078
0.9172
0.9172
GaussianNB
96.667
85.834
92.223
0.8532
0.9389
0.9122
LightGBM
90.999
93.334
88.001
0.7624
0.9005
0.9300
Comparing the data in the four tables, we can conclude that the SVM algorithm can obtain a high Acc value under four different extraction features, among which the k-mers
Classification of Coding and Non-coding Genes in Paeonia Lactiflora Pall
585
extraction feature exceeds the other two algorithms, and the highest score is obtained in the combination of all features and classifiers. However, in the other three extracted features, the Gaussian Naive Bayes classifier obtained a higher Acc value and AUC score, which is different from the SVM that we expected to have better performance on small sample data. However, these three classifiers have achieved the desired results, which provides a great help for us to classify the DNA sequence of Paeonia lactiflora (Table 3 and Table 4).
4 Conclusion In this paper, we propose an effective classifier for the determination of the coding genes of Paeonia lactiflora. The feature extraction methods used are k-mer, RecvKmer and PseNAC, respectively. The combination of k-mers and support vector machine is the best. The final experimental results are ACC94.534%, Sp88.334%, Sn98.333%, MCC0.89, AUROC0.953. Although this paper has contributed to the determination of the coding genes of Paeonia lactiflora, there are still many shortcomings. The classifier used in this paper is the classical machine learning model. With the popularity of deep learning in the field of machine learning, we can continue to try to use deep learning methods to apply to the determination of coding genes. It is believed that deep learning has broad prospects in the determination of white peony genes. Acknowledgments. This work was supported in part by Shandong Provincial Natural Science Foundation, China (ZR2021MF036), and in part by Key Science & Technology Innovation Project of Shandong Province (2019JZZY010448) and National Natural Science Foundation of China (31872415).
References 1. He, D.Y., Dai, S.M.: Anti-inflammatory and immunomodulatory effects of Paeonia lactiflora Pall, a traditional Chinese herbal medicine. Front. Pharmacol. 2, 10 (2011) 2. Lee, S.C., Kwon, Y.S., Son, K.H., et al.: Antioxidative constituents from Paeonia lactiflora. Arch. Pharmacal Res. 28, 775–783 (2005) 3. Bowler, S., Papoutsoglou, G., Karanikas, A., et al.: A machine learning approach utilizing DNA methylation as an accurate classifier of COVID-19 disease severity. Sci. Rep. 12(1), 17480 (2022) 4. Leitheiser, M., Capper, D., Seegerer, P., et al.: Machine learning models predict the primary sites of head and neck squamous cell carcinoma metastases based on DNA methylation. J. Pathol. 256(4), 378–387 (2022) 5. Sarkar, S., Mridha, K., Ghosh, A., et al.: Machine learning in bioinformatics: new technique for DNA sequencing classification. In: Advanced Computing and Intelligent Technologies: Proceedings of ICACIT 2022. Singapore: Springer Nature Singapore, pp. 335–355 (2022) 6. Mridha, K.: Early prediction of breast cancer by using artificial neural network and machine learning techniques. In: 2021 10th IEEE International Conference on Communication Systems and Network Technologies (CSNT). IEEE, pp. 582–587 (2021) 7. Sun, T., Zhou, B., Lai, L., et al.: Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinform. 18, 277 (2017)
586
B. Yang et al.
8. Tampuu, A., Bzhalava, Z., Dillner, J., et al.: ViraMiner: deep learning on raw DNA sequences for identifying viral genomes in human samples. PLoS ONE 14(9), e0222271 (2019) 9. Quang, D., Xie, X.: DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44(11), e107–e107 (2016) 10. Mahmoud, M.A.B., Guo, P.: DNA sequence classification based on MLP with PILAE algorithm. Soft. Comput. 25(5), 4003–4014 (2021) 11. Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinform. 12(1), 1–7 (2011) 12. Déraspe, M., Raymond, F., Boisvert, S., et al.: Phenetic comparison of prokaryotic genomes using k-mers. Mol. Biol. Evol. 34(10), 2716–2729 (2017) 13. Dao, F.Y., Lv, H., Su, W., et al.: iDHS-Deep: an integrated tool for predicting DNase I hypersensitive sites by deep neural network. Brief. Bioinform. 22(5), bbab047 (2021) 14. Chen, W., Lin, H., Chou, K.C.: Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. Mol. BioSyst. 11(10), 2620–2634 (2015) 15. Chen, W., Feng, P.M., Lin, H., et al.: iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res. 41(6), e68–e68 (2013) 16. Hearst, M.A., Dumais, S.T., Osuna, E., et al.: Support vector machines. IEEE Intell. Syst. Appl. 13(4), 18–28 (1998) 17. Support vector machines applications. New York: Springer (2014) 18. Cherkassky, V., Ma, Y.: Practical selection of SVM parameters and noise estimation for SVM regression. Neural Netw. 17(1), 113–126 (2004) 19. Huang, S., Cai, N., Pacheco, P.P., et al.: Applications of support vector machine (SVM) learning in cancer genomics. Cancer Genomics Proteomics 15(1), 41–51 (2018) 20. Wei, L., Xing, P., Su, R., Shi, G., Ma, Z.S., Zou, Q.: CPPred–RF: a sequence-based predictor for identifying cell–penetrating peptides and their uptake efficiency. J. Proteome Res. 16(5), 2044–2053 (2017)
Accurate Identification of Submitochondrial Protein Location Based on Deep Representation Learning Feature Fusion Jianan Sui1 , Yuehui Chen2(B) , Yi Cao3 , and Yaou Zhao2 1 School of Information Science and Engineering, University of Jinan, Jinan, China 2 Artificial Intelligence Institute (School of Information Science & Engineering),
University of Jinan, No. 336, Jinan, China [email protected] 3 Shandong Provincial Key Laboratory of Network Based Intelligent Computing (School of Information Science & Engineering), University of Jinan, No. 336, Jinan, China
Abstract. Mitochondria, comprising two layers of membranes, are indispensable organelles present in most cells. They perform a vital function in generating cellular energy and facilitating aerobic respiration. Experimentally determining the submitochondrial location of proteins is both time-consuming and costly. Therefore, the development of a reliable method to predict the sub-mitochondrial position of mitochondrial proteins is imperative. In this study, we propose a gradient boosting tree (GBDT) based approach to enhance the accuracy of sub-mitochondrial protein localization. To achieve this, we re-divided the benchmark dataset called M317 and utilized deep representation learning to extract features from mitochondrial protein sequences. Additionally, we used Generative Adversarial Network (GAN) to balance the dataset. The extracted features were selected using light gradient boosting machine (LightGBM). In the end, we selected the optimal feature set from the submitochondrial protein features extracted by the TAPE model and combined it with the submitochondrial protein features extracted by the SeqVec model. Subsequently, we inputted the fused features into six traditional machine learning models. We performed tenfold cross-validation experiments on the M317 dataset and achieved high accuracies. The accuracy for inner membrane, matrix, and outer membrane on the M317 dataset were 98.34%, 97.16%, and 98.23%, respectively. Keywords: Submitochondrial localization · Deep representation learning · Generative Adversarial Network · Feature fusion · GBDT
1 Introduction Mitochondria, which are present in most cells, are the main sites for aerobic respiration, specifically the second and third stages, which are responsible for energy production in cells. Mitochondria are double-membraned organelles containing many enzymes involved in aerobic respiration, as well as a small amount of DNA and RNA. They © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 587–596, 2023. https://doi.org/10.1007/978-981-99-4749-2_50
588
J. Sui et al.
play crucial roles in various metabolic and cellular pathways, making them a prominent topic of study in the fields of life science and biology. The inner mitochondrial membrane surrounds the mitochondrial matrix, while the outer membrane separates the organelle’s interior from the rest of the cell, and the intermembrane space separates the two membranes. The existence of these internal compartments suggests that proteins located in different mitochondrial compartments have specialized tasks or functions. Submitochondrial protein mislocalization can result in a range of damaging interactions and cause serious diseases such as type II diabetes, Parkinson’s disease, and multifactorial disorders. Therefore, developing a method to accurately predict the submitochondrial location of mitochondrial proteins is crucial for their precise functional characterization [5]. The progress of machine learning has led to the development of several improved models for predicting the subcellular localization of proteins within the mitochondria. Du and Li [6] conducted the initial study on sub-mitochondrial protein localization, where the amino acid composition and dipeptide composition information were fused, followed by the fusion of physicochemical properties and finally input to the support vector machine (SVM) classifier. Lin et al. [7] utilized over-represented tetrapeptides for predicting the submitochondrial location of mitochondrial proteins, achieving 94% accuracy on the M317 dataset. Li et al. [8] proposed a method integrating position-specific scoring matrix (PSSM), gene ontology (GO), and protein features (PROFEAT) into the main features, selecting the optimal features by the recursive feature selection method and inputting them into the SVM classifier. Jiao et al. [9] presented a sub-mitochondrial protein feature extraction method combining functional domain enrichment fraction, position-specific physical and chemical properties (PSPCP), and pseudo amino acid composition (PseAAC). The position-specific physical and chemical properties method was enhanced and eventually fed into the SVM classifier. Qiu et al. [10] combined Chou’s pseudo amino acid composition (PseAAC) and pseudo position-specific scoring matrix (PsePSSM) to extract sub-mitochondrial protein sequence features. Then, the extracted features were denoised, and the optimal feature vector was input into the SVM classifier to predict the sub-mitochondrial protein localization. Savojardo et al. [11] proposed introduced DeepMito, a novel approach that utilizes a convolutional neural network (CNN) for predicting the submitochondrial cellular localization of proteins. Yu et al. [12] proposed SubMito-XGBoost, an extreme gradient boosting (XGBoost)-based method for predicting protein submitochondrial localization on two training datasets, M317 and M983. The prediction accuracy of the SubMito-XGBoost method was 97.7% and 98.9%, respectively, while the prediction accuracy of the independent test set M495 was 94.8%. Prior models mainly relied on component features and chemical properties for feature extraction. However, the advancement and refinement of deep learning methods have facilitated the adoption of these techniques in sequence-based protein characterization tasks. In this paper, we leveraged NLP pre-training models, such as SeqVec based on the ELMO model and TAPE based on the BERT model, to extract features from mitochondrial protein sequences. To address the issue of unbalanced data affecting the model’s performance, we utilized Generative Adversarial Network (GAN) to balance the dataset. Next, we employed LightGBM (Light Gradient Boosting Machine) to select
Accurate Identification of Submitochondrial Protein Location
589
the most relevant features extracted from the data. Finally, we selected the most optimal feature set from the submitochondrial protein features extracted by the TAPE model and fused it with the submitochondrial protein features extracted by the SeqVec model. The fused features were then fed into six traditional machine learning models. The flow chart of our proposed model is presented in Fig. 1.
Fig. 1. Model flow chart.
2 Materials and Methods 2.1 Datasets The selection of datasets is a critical aspect of classification tasks. In our study, we selected a benchmark dataset, M317 [6], carefully. Du et al. extracted this dataset from the Uniprot database (https://www.uniprot.org/). The dataset was constructed using sequences with less than 40% similarity and includes proteins from the outer membrane, inner membrane, intermembrane space, and matrix. Specifically, the M317 dataset comprises 41 outer membrane, 131 inner membrane, and 145 matrix proteins. In contrast to previous studies that performed generalized multi-classification on this dataset, we reclassified the dataset into three categories to enhance the precision of submitochondrial protein localization. These categories are outer membrane, non-outer membrane, inner membrane, non-inner membrane, matrix, and non-matrix. The number of proteins in the M317 dataset is presented in Table 1.
590
J. Sui et al. Table 1. Protein distribution in the M317 dataset. Categories of proteins
Number of proteins
Inner membrane
131
Matrix
145
Outer membrane
41
2.2 Feature Extraction Over the course of several decades, efforts in feature extraction have primarily focused on the physicochemical properties of amino acids, including PSSM, PsePSSM, and related approaches [13, 14]. Recently, deep learning began to be applied to sequence-based protein characterization tasks. In this paper, we utilized two feature extraction methods, including SeqVec, and TAPE. 2.2.1 SeqVec This approach employs ELMo, a commonly-used deep bidirectional model in natural language processing (NLP), to generate continuous vector representations (embeddings) of protein sequences. By modeling protein sequences, ELMo can effectively capture the biophysical properties of the language of life from large amounts of unlabeled data (UniRef50). ELMo is a probabilistic distribution model that incorporates evolutionary information into embeddings, thus enabling it to effectively capture the biophysical properties of protein sequences. Once trained, the acquired knowledge can be transferred to individual protein sequences by predicting relevant sequence features [15]. 2.2.2 TAPE In the realm of machine learning-based protein representation learning, researchers have introduced a new evaluation task for protein embeddings known as TAPE. This task is specifically designed to assess the performance of different protein embedding methods on critical supervised tasks in three areas of protein biology. To achieve this, the authors have chosen a set of supervised tasks that are likely to benefit from self-supervised learning. The present study is focused on a TAPE model that utilizes BERT. 2.3 Feature Selection In the field of protein sequence analysis, feature extraction is accomplished using a variety of methods, which often generate high-dimensional feature vectors containing redundant information that can negatively impact the predictive performance of models. Feature selection techniques aim to filter high-dimensional feature vectors to eliminate unnecessary features. In this study, we first utilize the TAPE deep learning model to convert submitochondrial protein sequences into 768-dimensional features. The resulting feature vectors are then filtered using the LGBM feature selection technique and downscaled to 350-dimensional vectors. The LGBM algorithm is used to select the optimal feature space based on feature importance values calculated by the model.
Accurate Identification of Submitochondrial Protein Location
591
2.4 Balanced Dataset As the benchmark dataset used in our study is imbalanced, it could potentially impact the model’s performance. In this study, we employed a Generative Adversarial Network (GAN) to balance the submitochondrial protein dataset. Specifically, the generator network of the GAN was trained to generate synthetic samples of minority classes in the dataset, while the discriminator network was trained to differentiate between the synthetic samples and real samples. The GAN loss function can be formulated as follows: (1) minmax(D, G) = Ex∼pdata (x) log D(x) + Ez∼pz (z) log(1 − D(G(z))) G
D
where G is the generator network, D is the discriminator network, x is a real sample from the dataset, z is a random vector sampled from the noise distribution pz (z), and G(z) is a synthetic sample generated by the generator from the noise vector z. The first term in the loss function encourages the discriminator to correctly classify real samples as real, while the second term encourages the discriminator to correctly classify synthetic samples as fake. The generator is trained to minimize the second term in the loss function, which encourages it to generate synthetic samples that are indistinguishable from real samples. 2.5 Classification Model In order to identify the most appropriate and optimal machine learning algorithms, we evaluated six commonly used methods in submitochondrial protein localization studies, including random forest (RF), K-nearest neighbor algorithm (KNN), light gradient boosting machine (LightGBM), support vector machine (SVM), gradient boosting tree (GBDT), and extreme gradient boosting (XGBoost). These models were implemented through the scikit-learn [16], and we fine-tuned their hyperparameters through grid search to achieve the best possible performance. 2.6 Evaluation Metrics and Methods In this experiment, Accuracy (ACC), sensitivity (Sn), specificity (Sp), Matthews correlation coefficient (MCC), and F1- score were used to evaluate the performance of the prediction system [17–19]. The calculation method is as follows: Sp =
TN TN + FP
(2)
Sn =
TP TP + FN
(3)
ACC = F1 =
TP + TN TP + FN + TN +FP
(4)
2 × TP 2 × TP + FN + FP
(5)
592
J. Sui et al.
MCC= √
TP × TN − FP × FN (TP + FP) × (TP + FN ) × (TN + FN ) × (TN + FP)
(6)
To improve the accuracy of submitochondrial protein localization, we transformed the dataset into a binary classification problem. Binary classification predicts either 0 or 1, where true positive (TP) indicates the instance is positive and predicted as positive, false positive (FP) indicates the instance is negative but predicted as positive, and true negative (TN) indicates the instance is negative and predicted as negative. Sn and Sp represent the proportion of correct positive and negative predictions, respectively, while the F1 score measures the model’s robustness, with a higher score indicating greater robustness. ACC reflects overall predictor accuracy, but it may not accurately assess classification results quality for imbalanced datasets. MCC can be used for evaluation in such cases. The ROC curve’s horizontal axis represents the false positive rate (FPR), i.e., the ratio of negative class samples predicted as positive class, while the vertical axis represents the true positive rate (TPR), i.e., the ratio of positive class samples predicted as positive class. A higher AUC value indicates a better model performance.
3 Result and Discussions 3.1 First Partitioning (Inner Membranes and Non-Inner Membranes) on the M317 Dataset To begin with, we divided the dataset into two categories: inner membrane and noninner membrane protein sequences. We extracted feature vectors of dimension 768 from the mitochondrial protein sequences using the TAPE method based on NLP pre-trained models. These resulting feature vectors were then used as input into six commonly used machine learning models, without any data balancing or feature selection. After performing ten-fold cross-validation, we present the experimental results for the GBDT model before and after balancing the dataset in Fig. 2. To further improve our model’s performance, we integrated the submitochondrial protein feature vectors extracted from the TAPE method with those extracted from the SeqVec method after performing feature selection with LightGBM.The final outcomes are displayed in Table 2, and the GBDT model demonstrated the best performance. 3.2 Second Partitioning (Matrix and Non-Matrix) on the M317 Dataset In our study, we further partitioned the dataset into submitochondrial matrix protein sequences and submitochondrial non-matrix protein sequences. We employed the LightGBM algorithm to select the optimal features extracted by the TAPE method, and fused this feature set with those extracted by the SeqVec method. The performance of the models is presented in Table 3, where the GBDT model exhibited the highest accuracy of 97.16%.
Accurate Identification of Submitochondrial Protein Location
Fig. 2. Comparison of model performance before and after dataset balancing.
Table 2. Performance comparison of different models after feature fusion. Model
ACC(%)
F1-score
Sp(%)
Sn(%)
MCC
Auc
RF
96.53
0.9717
97.78
95.38
0.9287
0.9904
SVM
97.79
0.9819
98.89
96.94
0.9547
0.9952
LightGBM
98.31
0.9862
98.89
97.76
0.9651
0.9968
GBDT
98.34
0.9864
98.43
98.03
0.9652
0.9974
KNN
96.83
0.9733
99.03
95.46
0.9375
0.9956
XGBoost
97.68
0.9805
98.98
96.78
0.9537
0.9973
Table 3. Performance comparison of different models after feature fusion. Model
ACC(%)
F1-score
Sp(%)
Sn(%)
MCC
Auc
RF
88.65
0.8811
84.06
92.98
0.7708
0.9678
SVM
94.32
0.9405
92.03
96.49
0.8854
0.9839
LightGBM
96.22
0.9604
94.69
97.66
0.9236
0.9893
GBDT
97.16
0.9703
96.02
98.25
0.9427
0.9920
KNN
89.66
0.8972
85.88
93.91
0.7975
0.9511
XGBoost
93.10
0.9315
90.58
95.94
0.8650
0.9674
593
594
J. Sui et al.
3.3 Third Division on M317 Dataset (Outer Membrane and Non-Outer Membrane) We conducted a third partition of the dataset, separating it into submitochondrial outer membrane protein sequences and submitochondrial non-outer membrane protein sequences. We further fused the selected features with the features extracted by the SeqVec model and fed the fused features into the six traditional machine learning models. The model performance metrics are summarized in Table 4. Table 4. Performance comparison of different models after feature fusion. Model
ACC(%)
F1-score
Sp(%)
Sn(%)
MCC
Auc
RF
88.65
0.8811
84.06
92.98
0.7708
0.9678
SVM
94.32
0.9405
92.03
96.49
0.8854
0.9839
LightGBM
96.22
0.9604
94.69
97.66
0.9236
0.9893
GBDT
97.16
0.9703
96.02
98.25
0.9427
0.9920
KNN
89.66
0.8972
85.88
93.91
0.7975
0.9511
XGBoost
93.10
0.9315
90.58
95.94
0.8650
0.9674
3.4 Comparison with Previous Models In addition to partitioning the dataset for our own experiments, we also compared our method with the recently published approach by Yu et al. [12]. It is worth noting that we employed the identical benchmark dataset as Yu et al. did in their research, and their experimental results presented the best performance achieved in the experiment. As shown in Table 5, our method outperforms the SubMito-XGBoost model proposed by Yu et al. on the M317 dataset. Table 5. Comparison with previous method. Methods
Datasets
Yu et al. (2020) M317
Our Method
M317
Structure class
Sn(%)
Sp(%)
MCC
Inner membrane
95.36
99.14
0.9524
Matrix
97.93
97.73
0.9539
Outer membrane
96.34
97.94
0.9414
Inner membrane
98.03
98.43
0.9652
Matrix
98.25
96.02
0.9920
Outer membrane
96.50
100.0
0.9671
Accurate Identification of Submitochondrial Protein Location
595
4 Conclusion In this study, we aimed to improve the precision of submitochondrial protein localization by partitioning the dataset into outer membrane protein sequences and non-outer membrane protein sequences, as well as inner membrane protein sequences and noninner membrane protein sequences, matrix protein sequences and non-matrix protein sequences for submitochondrial proteins. We utilized pre-trained NLP-based models, such as the Bert-based TAPE model and the ELMO-based SeqVec model, to extract features from submitochondrial protein sequences. Additionally, we used generative adversarial network (GAN) for the first time to balance the submitochondrial protein dataset and applied LGBM to select the extracted protein sequence features. Although our proposed method has shown improvement in submitochondrial protein localization prediction accuracy, there is still significant scope for enhancing algorithm efficiency. In the future, we intend to explore additional feature extraction and feature selection techniques to enhance the method’s performance, create larger benchmark datasets, and implement deep learning approaches to predict submitochondrial protein localization. Acknowledgments. This work was supported in part by Shandong Provincial Natural Science Foundation, China (ZR2021MF036), the Key Science & Technology Innovation Project of Shandong Province (2019JZZY010448) and the National Natural Science Foundation of China (31872415).
References 1. Poveda-Huertes, D., Mulica, P., Vögtle, F.N.: The versatility of the mitochondrial presequence processing machinery: cleavage, quality control and turnover. Cell Tissue Res. 367(1), 73–81 (2016). https://doi.org/10.1007/s00441-016-2492-9 2. Gerbitz, K.-D., Gempel, K., Brdiczka, D.: Mitochondria and diabetes: genetic, biochemical, and clinical implications of the cellular energy circuit. Diabetes 45(2), 113–126 (1996) 3. Burbulla, L.F., et al.: Dopamine oxidation mediates mitochondrial and lysosomal dysfunction in Parkinson’s disease. Science 357(6357), 1255–1261 (2017) 4. Shi, S.-P., et al.: Identify submitochondria and subchloroplast locations with pseudo amino acid composition: approach from the strategy of discrete wavelet transform feature extraction. Biochimica et Biophysica Acta (BBA)-Molecular Cell Research, 1813(3), 424–430 (2011) 5. Martelli, P.L., Savojardo, C., Fariselli, P., Tasco, G., Casadio, R.: Computer-based prediction of mitochondria-targeting peptides. Mitochondrial Medicine, Springer, pp. 305–320 (2015) 6. Du, P., Li, Y.: Prediction of protein submitochondria locations by hybridizing pseudoamino acid composition with various physicochemical features of segmented sequence. BMC Bioinform. 7(1), 1–8 (2006) 7. Lin, H., Chen, W., Yuan, L.-F., Li, Z.-Q., Ding, H.: Using over-represented tetrapeptides to predict protein submitochondria locations. Acta. Biotheor. 61(2), 259–268 (2013) 8. Li, L., et al.: Protein submitochondrial localization from integrated sequence representation and SVM-based backward feature extraction. Mol. BioSyst. 11(1), 170–177 (2015) 9. Jiao, Y.-S., Du, P.-F.: Predicting protein submitochondrial locations by incorporating the positional-specific physicochemical properties into Chou’s general pseudo-amino acid compositions. J. Theor. Biol. 416, 81–87 (2017)
596
J. Sui et al.
10. Qiu, W., et al.: Predicting protein submitochondrial locations by incorporating the pseudoposition specific scoring matrix into the general Chou’s pseudo-amino acid composition. J. Theor. Biol. 450, 86–103 (2018) 11. Savojardo, C., Bruciaferri, N., Tartari, G., Martelli, P.L., Casadio, R.: DeepMito: accurate prediction of protein sub-mitochondrial localization using convolutional neural networks. Bioinformatics 36(1), 56–64 (2020) 12. Yu, B., et al.: SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting. Bioinformatics 36(4), 1074– 1081 (2020) 13. Du, P., Yu, Y.: SubMito-PSPCP: predicting protein submitochondrial locations by hybridizing positional specific physicochemical properties with pseudoamino acid compositions. BioMed Res. Int. 2013 (2013) 14. Kumar, R., Kumari, B., Kumar, M.: Proteome-wide prediction and annotation of mitochondrial and sub-mitochondrial proteins by incorporating domain information. Mitochondrion 42, 11–22 (2018) 15. Heinzinger, M., et al.: Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 20(1), 1–17 (2019) 16. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 17. Zeng, X., Lin, W., Guo, M., Zou, Q.: A comprehensive overview and evaluation of circular RNA detection tools. PLoS Comput. Biol. 13(6), e1005420 (2017) 18. Wei, L., Xing, P., Su, R., Shi, G., Ma, Z.S., Zou, Q.: CPPred-RF: a sequence-based predictor for identifying cell-penetrating peptides and their uptake efficiency. J. Proteome Res. 16(5), 2044–2053 (2017) 19. Wei, L., Xing, P., Zeng, J., Chen, J., Su, R., Guo, F.: Improved prediction of protein–protein interactions using novel negative samples, features, and an ensemble classifier. Artif. Intell. Med. 83, 67–74 (2017)
Identification of Active and Binding Sites with Multi-dimensional Feature Vectors and K-Nearest Neighbor Classification Algorithm Baichuan Zhang, Zhuo Wang(B) , Wenzheng Bao(B) , and Honglin Cheng School of Information Science and Engineering, University of Jinan, Jinan, China [email protected], [email protected]
Abstract. Helicobacter pylori is a pathogenic and carcinogenic bacterium, mainly living in the stomach and duodenum, and has been declared a prokaryotic carcinogen by the World Health Organization. The control of gastric cancer has attracted increasing attention. Studying the binding reactions of substrates at different protein sites will help to understand the relationship between protein structure and function, and pave the way for future research on the pathogenesis of Helicobacter pylori and the development of protein-targeted drugs. This paper provides a new identification method for predicting protein sites. It wants to classify the active sites and binding sites of proteins based on a K-nearest neighbor classification method by learning the multi-dimensional features of protein sites. First, the protein information of Helicobacter pylori is retrieved, and the Active_site and Binding_site sites are obtained from the existing database. Then, the protein fragment sequences adjacent to the sites are intercepted, and the protein sequences are analyzed by a custom correlation function to obtain feature vectors with the same length. After that, supervised learning will be used. For the n-dimensional vector input after the transformation, the machine learning KNN classification algorithm is used to perform the corresponding kd-tree optimization, and the NCA algorithm is introduced to automatically learn the distance measurement and complete the dimensionality reduction. The accuracy rate in the test set reaches 84.2, which is 6.5 higher than the traditional gradient boosting tree algorithm (GBDT). It is shown that this classification method is much better than previous classifiers and can make the binding site of proteins more effective. Keywords: Helicobacter pylori proteins · KNN · Feature selection · Machine learning
1 Introduction The classification of Helicobacter pylori loci has been a topic of interest in the field of microbiology and medical research [1]. H. pylori is a Gram-negative bacterium that colonizes the human gastric mucosa and is associated with a range of gastric diseases, including gastritis, peptic ulcer disease, and gastric cancer. Accurate identification and © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 597–606, 2023. https://doi.org/10.1007/978-981-99-4749-2_51
598
B. Zhang et al.
classification of H. pylori loci is crucial for understanding the molecular mechanisms of H. pylori-associated diseases and developing effective treatments. Traditionally, H. pylori loci have been classified using culture-based methods, such as culture, microscopy, and biochemical assays. However, these methods have limitations, such as low sensitivity and specificity and the need for specialized equipment and trained personnel. In recent years, molecular methods, such as PCR-based assays, have been widely used for H. pylori locus classification due to their high accuracy and specificity [2–5]. With the rapid development of machine learning and bioinformatics, there has been a growing interest in using machine learning algorithms for H. pylori locus classification. These algorithms have the advantage of being able to process large amounts of data in a short time and accurately identify H. pylori loci based on molecular features [6, 7]. However, the performance of these algorithms is dependent on the quality and quantity of the data used for training, as well as the choice of algorithm and feature selection. In recent years, various machine learning algorithms, such as support vector machines, decision trees, and random forests, have been applied to H. pylori locus classification, with promising results [8–12]. However, there is still room for improvement. Here, we propose a KNN-based protein functional site classifier model to enhance the classification efficacy of binding and activation sites. To compose the initial data, Helicobacter proteins are first crawled. The protein sequences of the binding and the activation sites are then selected and used for training and cross-validation. To prevent the influence of outliers, the data is standardized and preprocessed. Additionally, since the Euclidean distance is not appropriate for all types of data, we used the supervised learning neighbor component analysis algorithm to automatically learn the distance metric. We also added the kd-tree algorithm to improve the classification efficiency. The index is then built, and the search algorithm is executed according to the index. Finally, the data is entered into the K-nearest_neighbor classifier for training and prediction.
2 Materials and Methods 2.1 Datasets The experimental dataset was generated by sourcing Helicobacter protein information from the UniPort database [13]. Initially, we retrieved all protein sequences of Helicobacter pylori and extracted proximal amino acid sequences from them using metal binding and active site locations. These amino acid sequences were then selected for the subsequent extraction of dimensional features. To guarantee the dimensional consistency of features, we discarded marginal amino acids to ensure all amino acid lengths were the same. In total, we obtained 1094 amino acid sequences, including 620 positive and 474 negative samples. 2.2 Sequence to Feature Vector Conversion Since metal binding and active site are very different in function, this paper utilizes autocovariance in the autocorrelation method of neighboring sequences for feature extraction,
Identification of Active and Binding Sites
599
as metal binding and active sites have distinct functions [14, 15]. First, two physicochemical properties, hydrophobicity and hydrophilicity, were selected. Then, for each amino acid sequence, the mean value of these physicochemical properties was calculated with a lag length of 10. The last eight values of the mean were used as each dimension in the final feature vector. This final feature vector provides data support for subsequent classification. 2.3 KNN Algorithm Implementation After preprocessing the dataset, load it into an equal-length vector and perform random shuffling to ensure accurate model fitting. Randomly separate the training set from the cross-validation method and set aside the test dataset. Introduction of StandardScaler standardization is important to transform the original data within the range of mean value 0 and standard deviation 1, in order to account for the influence of abnormal data points. Additionally, perform data dimension reduction (NCA) calculation with the use of kd-tree algorithm. This will convert all the data items into an ndarray-type data matrix called datamatrix. Instantiate a KNeighborsClassifier class, train and classify the activation sites and binding sites. Finally, utilize the pipeline mechanism to encapsulate and manage all the steps. The pipeline object accepts a list composed of two tuples and allows for the construction of the classification model. The K-nearest neighbor (KNN) algorithm is a supervised method that involves dividing the space to classify samples. It predicts the label of an unlabeled sample by analyzing its nearest K neighbors. The value of K depends largely on the dataset. If the K value is small, the model can be affected by outliers and overfitting can occur. To mitigate this problem, preprocess and calculate the feature vector, measure the distance between the unlabeled and labeled data, and store the calculated distance in the distance array to sort and determine the top k values (y_knn). The sample being labeled is given the label of the largest number of samples in y_knn. The traditional Euclidean distance can be used for this, but with a p value of 2 a better method is to use the storage data structure of the KD tree (k-dimensional tree) for optimization. The tree, which has K feature dimensions, can significantly reduce the number of distance calculations needed. To do this, use the kd tree to store train data, build the kd tree model, and predict the test set. The algorithm optimizes by constructing a balanced binary tree and recursively dividing the parameter space. This is achieved by partitioning the nested orthotropic regions until the tree has no more leaves. The process is shown in Fig. 1. 2.4 KNN Classification Rules Initially, the division dimension is determined, in which it is important to ensure that the values of all data items are distributed widely to achieve maximum scatter. To do so, calculate the variance of all dimensions and pick the one with the highest variance as the cutting standard. Then, find the median of all values in that dimension and calculate the n variances F(x) of the n-dimensional x1, x2, x3…..xn data set. Determine the maximum variance Max (F(x1, x2, x3…..xn)). Subsequently, define a line in the hyperplane with a vertical dividing line of the Xi axis to divide the space samples into two. Finally, sort the samples according to the value of the Xi axis, and find the midpoint as the current
600
B. Zhang et al.
Fig. 1. Steps of Feature Partition in KD-Tree Construction
node of the KD tree. Use the corresponding left space sample point as the left subtree and the spatial sample point on the right as the right subtree. Then, recursively calculate the data. Implement the nearest neighbor search based on KD tree, construct a bounded priority queue of length K, and save and update the distance to the sample point that is being classified in the current search process. For a given test sample x, the set of its k nearest training instances is represented as Nk(x), while the classification loss function employs the 0–1 loss. Assuming the class covering the Nk(x) region is cj, the classification error rate can be defined as follow: 1 k
xi ∈Nk(x)
I {yi = cj } = 1 −
1 k
xi ∈Nk(x)
I {yi = cj }
(1)
The KNN model can be defined as obtaining the set of Nk(x) for any x, while minimizing the 0–1 loss function with an optimization strategy employing the smallest empirical risk. 2.5 NCA Distance Metric NCA can be combined with the KNN classifier to improve the classification accuracy of multi-class problems, with no need for increased model size or additional user-tuned parameters. In the KNN model, the API encapsulates widely applicable distance formulas, but using them as defaults does not achieve the best classification effect, as observed in a sample experiment. To rectify this issue, Near Neighbor Component Analysis (NCA) is introduced, which is a supervised algorithm with the objective of finding a linear transformation on a training set, that maximizes the classification accuracy and applies automatic learning to obtain a linear space transition matrix. The leave-one-out average effect calculation in the new transition space assists in defining an appropriate distance measurement formula for sample data, by introducing a classification random variance. Such measurements can be conducted without the use of complex matrix operations for distance measurement and dimensionality reduction. Given a data set of n sample vectors (X1, X2, X3… Xn) and their corresponding label values (y1, y2, y3… yn), a metric must be acquired to enhance classification performance.
Identification of Active and Binding Sites
601
Essentially, obtaining a positive definite matrix A (P = ATA) that corresponds to a transformation matrix P will achieve this purpose. d (x, y) = (x − y) Q(x − y) = (Ax − Ay) (Ax − Ay)
(2)
During the calculation of error using the leave-one-out method, the error function is discontinuous concerning A. Consequently, a softmax error function is necessary. Here, pij is expressed as follows: when randomly selecting the nearest neighbor, xi picks xj as the closest neighbor and adopts the probability of class label yj. This method ensures continuity and accuracy in the calculation of error. pij =
e
−Lxi −Lxj 2
k =i
2
e−Lxi −Lxk
, pii = 0
(3)
We can calculate the probability of correctly classifying Xi using the neighbors belonging to the same class as Xi. Here, Cj represents all neighbors in the same class as Xi. This probability is obtained using a formula that takes into account the number of correctly classified neighbors and the total number of neighbors. (4) pi = pij , Ci = j|ci = cj j∈Ci
It is expected that points of the same kind would have a high percentage of selected neighbors. Therefore, to identify them accurately, we define a function as follows: f (A) =
i j∈Ci
pij =
i
pp
(5)
Given that f(A) can be differentiated, maximizing it becomes an unconstrained optimization problem. In this case, we can use an iterative conjugate gradient method to determine the optimal value of A. ∂f pij xij xij − pik xik xik (6) ∂A = −2A i j∈Ci
k
2.6 K-fold Cross-Validation Random grouping of the original data into the training and testing sets leads to a weak correlation between the classification accuracy of the final validation set and the original data grouping, resulting in unreliable outcomes. To overcome this issue, it is necessary to perform K-fold cross-validation on the test set by randomly dividing the entire dataset into training and testing sets. In K-fold cross-validation, the dataset is partitioned randomly into approximately equal and non-overlapping K parts. One of them is used as the validation set in each iteration to calculate the model’s accuracy K times, which is then averaged. Cross-validation helps to avoid overfitting and underfitting, thereby producing a more accurate and reliable model.
602
B. Zhang et al.
2.7 Improved K-Nearest Neighbor Firstly, KNN algorithm has other improvements besides the standard implementation. One of them involves increasing the weight between neighbors of the sample data. The default weight used in the model is uniform, but different neighbors can be assigned varying weights based on the specific training sample distribution. The corresponding weight value is given using the distance. In the KNeighborsClassifier class, we can instantiate and optimize the hyperparameters k and p using Grid Search. By setting the optimal neighbor value to 3, we obtained a training accuracy rate of 70.3%. Secondly, the nearest neighbor algorithm with limited radius can be used to replace k nearest points with the points within a fixed radius when the data sampling is uneven. This technique can perform better in some cases. To implement this method, we use the Radius Neighbors Classifier class by setting the radius to 500.0 and the neighbor value to 5. This variant algorithm uses fewer neighbor points to classify the test points in the sparsely distributed area. However, in high-dimensional parameter spaces, the training samples suffer from the “curse of dimensionality,” and the accuracy rate drops to 68.5%, which is not ideal. 2.8 Evaluation Metrics and Methods In the classification and identification of protein active sites and binding sites, it is imperative to select appropriate evaluation indicators to assess the model’s performance. This experiment primarily employs accuracy, precision rate (precision), recall rate (recall), Fscore, and ROC-AUC curve as evaluation indicators for the classifier. For this binary classification problem of protein sites, we have only positive and negative values, where positive and negative samples represent active and binding sites respectively. The device’s prediction category has four classifications: true positive examples, false-positive examples, true-negative examples, and false-negative examples. A true positive example (TP) refers to an instance that belongs to the positive class and is predicted correctly; a falsepositive example (FP) is one in which an instance that belongs to the negative class is predicted as positive. True negative examples (TN) are negative samples predicted as negative, while false-negative examples (FN) are positive samples predicted as negative. The F1-Score indicator combines the results of Precision and Recall, calculated from the confusion matrix. AUC is the area under the ROC curve, usually between 0.5 and 1, indicating the quality of classification. The larger the value, the better the device. The calculation method employed for evaluation indices in this experiment is as follows: F1 = ACC =
1 2
(8)
2TP 2TP+FN +FP
(9)
m−1 i=1
(7)
TP+TN TP+FN +TN +FP
F1 = AUC =
2PR P+R
(xi+1 − xi )(yi + yi+1 )
(10)
Identification of Active and Binding Sites
603
3 Results and Discussion The KNN classification algorithm is simple and easy to implement. However, when the training dataset is large, a significant amount of storage space is needed to save the dataset, and the algorithm requires complex and time-consuming calculations to obtain the distance between the sample and all training samples. In this experiment, the nearest component analysis (NCA) algorithm was added. By linearly transforming the samples and calculating the appropriate distance measurement method, the data calculation and storage burden was reduced, improving the class distinction in the dataset. The efficiency of the algorithm can be further improved with a tree data structure such as kd-tree, which can be used for n dimensions. However, in cases of a large amount of data and a small division granularity, the overhead of building a tree can be high, reducing its efficiency. Moreover, KNN has poor classification effects on randomly distributed datasets but can perform better on datasets with small intra-class spacing and large inter-class spacing, and irregular boundaries, making it better than linear classifiers. Generally, KNN performs better on unbalanced samples, and each nearest neighbor’s data can be assigned different weights to improve the performance further. However, for this experiment’s relatively balanced dataset, there was no need for special treatment of weights. The evaluation of the model was done using 10-fold cross-validation where the dataset was divided equally into K subsets with one subset serving as the test set, and the remaining K-1 subsets as the training set. The accuracy of K classifiers was then obtained, and the average value calculated to avoid the influence of data difference on the model’s effectiveness and improve the generalization ability of the model. Table 1. The accuracy of data prediction using KNN machine learning model for different extraction methods Model
AC
ACC
CC
DP
DR
KMER
PC-PseACC
KNN
0.72
0.71
0.70
0.74
0.84
0.80
0.71
PC-PseAAC-General
PDT
SC-PseAAC
SC-PseAAC-General
0.72
0.82
0.70
0.71
To ensure the accuracy and credibility of the results, we utilized both the algorithm developed in this study and the conventional gradient boosting tree algorithm to classify site samples obtained from various extraction methods. Comparison of the two algorithms (as shown in Table 1 and Table 2) demonstrates that the KNN algorithm outperforms the gradient boosting algorithm, resulting in a 6.5% higher accuracy rate. This experiment employed binary classification, with positive samples representing activation sites and negative samples representing binding sites. The True Positive Rate (TPR) (i.e., the ratio of the number of correctly classified positive samples to the total number of positive samples) was plotted on the ordinate, while the False Positive Rate (FPR) (i.e., the ratio of the number of falsely classified negative samples to the total
604
B. Zhang et al.
Table 2. The accuracy of data prediction using GBDT machine learning model for different extraction methods Model
AC
ACC
CC
DP
DR
KMER
PC-PseACC
GBDT
0.65
0.70
0.65
0.74
0.71
0.77
0.66
PC-PseAAC-General
PDT
SC-PseAAC
SC-PseAAC-General
0.65
0.70
0.69
0.67
Fig. 2. The ROC curve of the KNN classifier and the traditional GBDT classifier
number of negative samples) was plotted on the abscissa. The changing relationship between the two was explored by classifying samples using different thresholds. The results demonstrated that the KNN algorithm exhibited superior classifier performance, as measured by the AUC index (Table 3). Table 3. Comparison of classification effects between the final KNN model and the traditional GBDT model Model
ACC
AUC
F1-score
recall
KNN
0.841
0.832
0.805
0.827
GBDT
0.776
0.774
0.782
0.793
The results of this study show that the KNN classification algorithm with NCA can outperform the conventional GBDT algorithm in site sample classification. These findings provide valuable insights into the potential of KNN algorithms for improving the accuracy and efficiency of site sample classification.
Identification of Active and Binding Sites
605
4 Conclusion In conclusion, this study demonstrates that combining the nearest component analysis algorithm with the KNN classification algorithm improves both the efficiency and accuracy of the classification model. Compared to the conventional gradient boosting decision tree model, the KNN model exhibits higher accuracy rates, higher AUC index scores, and better F1-score and recall values. Thus, the KNN model is a more effective method for classifying site samples obtained from various extraction methods. However, it is important to note that the choice of algorithm should consider the specific characteristics of the dataset, such as size and distribution. For datasets with a large amount of data and small division granularity, building a tree may decrease the efficiency of the algorithm. Further research could explore the integration of other algorithms to improve the KNN model’s performance and address these limitations. Overall, this study highlights the effectiveness of the KNN model with NCA for site sample classification, contributing to the development of more advanced and efficient machine learning models for site sample classification. The advantages of using machine learning to classify protein sites are significant due to the complex composition of proteins, and dimensionality reduction methods can also enhance the classification model’s performance. This study used Helicobacter protein information from the Uniport database, processed the amino acid sequences, unified the sequences of different lengths into fixed-length feature vectors, and solved the problem of unbalanced positive and negative sample proportions. By combining the K nearest neighbors of the machine model with the analysis of the nearest neighbor components, kd-tree optimization was implemented. The experimental results demonstrate that the method effectively recognizes and classifies protein activation and binding sites. In addition to conducting further research on classification and identification methods, this method and idea can serve as a powerful tool for bioinformatics and protein information research. Acknowledgement. This work was supported by the National Natural Science Foundation of China (Grant No. 61902337), Xuzhou Science and Technology Plan Project (KC21047), Jiangsu Provincial Natural Science Foundation (No. SBK2019040953), Natural Science Fund for Colleges and Universities in Jiangsu Province (No. 19KJB520016) and Young Talents of Science and Technology in Jiangsu and ghfund202302026465.
References 1. Ding, H., Liu, L., Guo, F.-B., Huang, J., Lin, H.: Identify Golgi protein types with modified mahalanobis discriminant algorithm and pseudo amino acid composition. Protein Peptide Lett. 18(1), 58–63 (2011) 2. Zeng, X., Lin, W., Guo, M., Zou, Q.: A comprehensive overview and evaluation of circular RNA detection tools. PLoS Comput. Biol. 13(6), e1005420 (2017) 3. Yu, B., et al.: SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting. Bioinformatics 36(4), 1074– 1081 (2020) 4. Savojardo, C., Bruciaferri, N., Tartari, G., Martelli, P.L., Casadio, R.: DeepMito: accurate prediction of protein sub-mitochondrial localization using convolutional neural networks. Bioinformatics 36(1), 56–64 (2020)
606
B. Zhang et al.
5. Heinzinger, M., et al.: Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 20(1), 1–17 (2019) 6. Trompier, D., et al.: Brain peroxisomes. Biochimie 98, 102–110 (2014) 7. Cai, M., et al.: Disruption of peroxisome function leads to metabolic stress, mTOR inhibition, and lethality in liver cancer cells. Cancer Lett. 421, 82–93 (2018) 8. Qiu, W., et al.: Predicting protein submitochondrial locations by incorporating the pseudoposition specific scoring matrix into the general Chou’s pseudo-amino acid composition. J. Theor. Biol. 450, 86–103 (2018) 9. Sampaio, P.N., Cunha, B., Rosa, F., Sales, K., Lopes, M., Calado, C.R.C.: Molecular fingerprint of human gastric cell line infected by Helicobacter pylori. In: 2015 IEEE 4th Portuguese Meeting on Bioengineering (ENBENG), Porto, Portugal, pp. 1-5 (2015) 10. Runhong, M., Shihe, S., Fang, M.: Construction and identification of hp0532 gene mutant in Helicobacter pylori Cag-PAI. In: Proceedings 2011 International Conference on Human Health and Biomedical Engineering, Jilin, China, pp. 280–284 (2011) 11. Gunasundari, R., Thara, L.: Helicobacter pylori infection and associated stomach diseases: comparative data mining approaches for diagnosis and prevention measures. In: 2016 IEEE International Conference on Advances in Computer Applications (ICACA), Coimbatore, India, pp. 9–13 (2016) 12. Song, T., Rodríguez-Patón, A., Zheng, P., Zeng, X.: Spiking neural P systems with colored spikes. IEEE Trans. Cogn. Dev. Syst. 10(4), 1106–1115 (2017) 13. Morgat, A., et al.: Enzyme annotation in UniProtKB using Rhea. Bioinformatics 36(6), 1896– 1901 (2020) 14. Zeng, X., Lin, W., Guo, M., Zou, Q.: A comprehensive overview and evaluation of circular RNA detection tools. PLoScomputational biology 13(6), e1005420 (2017) 15. Li, W., Godzik, A.: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13), 1658–1659 (2006)
Mit Protein Transformer: Identification Mitochondrial Proteins with Transformer Model Baichuan Zhang, Luying He, Qi Wang, Zhuo Wang(B) , Wenzheng Bao(B) , and Honglin Cheng Xuzhou University of Technology, Xuzhou 221018, China [email protected], [email protected]
Abstract. Mitochondrial proteins carry out unique physiological functions within subcellular localization regions, making accurate subcellular localization prediction essential in uncovering pathological disease mechanisms and guiding drug development. Mitochondria serve as the energy production centers of cells, comprising four major subcellular localization regions, namely matrix, outer membrane, inner membrane, and intermembrane space. While traditional research methods analyze physical and chemical protein sequence properties, coupled with machine learning or deep learning algorithms, such methods necessitate considerable time and effort in data preprocessing. To overcome such challenges, this study proposes a novel approach to efficiently predicting mitochondrial protein subcellular localization by perceiving semantic information of protein sequences directly, using an ESM2-35B pre-trained model based on transformer architecture. The study utilized four datasets, comparing two models - the Transformer-Encoder Only model trained from scratch and the classification predictor centered on ESM2-35B pre-trained model fine-tuning. Results show that fine-tuning the large pre-trained model presents a superior performance in subsequent mitochondrial protein subcellular localization tasks in comparison to the Transformer-Encoder Only model. In conclusion, ESM2-35B pre-trained model based on transformer architecture offers vast application prospects in addressing mitochondrial protein subcellular localization prediction issues. Keywords: Mitochondrial · Fine-tuned Transformer Model · Mitochondrial Intermembrane Space
1 Introduction Mitochondria are crucial membrane-bound organelles that widely exist in eukaryotic cells, with essential functions in serving as the energy metabolism center and involvement in various cellular pathology processes [1]. The normal functioning of mitochondria depends on the cooperation between nuclear and mitochondrial DNA, various signaling molecules, regulatory factors, and important molecules such as enzymes, carrier and transport proteins, and histones. The mitochondrial structure comprises an inner membrane, outer membrane, intermembrane space, and matrix, all rich in proteins influencing © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 607–616, 2023. https://doi.org/10.1007/978-981-99-4749-2_52
608
B. Zhang et al.
mitochondrial function [2], that play a critical role in various cell life activities like electron transfer, ATP synthesis, fatty acid oxidation, amino acid degradation, tricarboxylic acid cycle, and several other biological processes. Furthermore, mitochondrial proteins participate in other important cellular pathologies processes such as apoptosis, oxidative stress, and inflammatory reactions. Hence, the subcellular localization of mitochondrial proteins is vital, as an imbalance or abnormality can result in various diseases, including cardiovascular diseases, tumors, neurodegenerative diseases, and metabolic diseases. Predicting mitochondrial protein subcellular localization entails different bioinformatics methods, including machine learning algorithms, traditional techniques based on physical and chemical properties, and deep learning methodologies. These methods enable scientists to understand mitochondrial proteins’ functions and their involvement in disease occurrence and development, providing significant theoretical bases for drug development and disease treatment. Protein subcellular localization is a significant topic in the proteomics field, where advances have been notable in recent years. However, subcellular research, particularly at the mitochondrial level, is complex, and progress has been relatively slow. Computational methods for predicting mitochondrial protein subcellular locations have become more common with the growing sequence data. Over the past decade, several reliable methods have been developed, achieving notable results, including the MK-TLM model based on nuclear transfer learning by Mei et al. [3], the establishment of M495 by Lin et al. [4] using expressed tetrapeptide sequences in predicting mitochondrial protein subcellular location, and the method for predicting mitochondrial protein and submitochondrial localization by Kumar et al. [5] In addition, Qiu et al. [6] implemented pseudo composition and pseudo-localization-specific scoring matrices to extract features while Yu et al. [7] utilized extreme gradient boosting to predict proteins’ submitochondrial localization. Savojardo et al. [8] employed deep learning algorithms to predict four sub-mitochondrial locations, while Xiao et al. [9] proposed a method that uses deep convolutional neural networks to predict four sub-mitochondrial locations. The subcellular localization prediction of mitochondrial proteins is a complex multilabel classification problem, primarily due to the limited number of multi-label proteins for training multi-label predictors. Machine learning algorithms have become the core method for predicting submitochondrial protein localization in recent years. However, traditional machine learning often necessitates the manual extraction of protein features, entailing the conversion of these characteristics into suitable vectors for classification. Despite the predictive performance of these methods, manual feature design’s partial and subjective nature significantly limits the model’s performance. The ESM2 protein model developed by Facebook AI Research (FAIR) [10] applies a deep neural network based on the Transformer architecture, pre-trained on vast, diverse protein sequence and structure data. Compared with traditional protein structure prediction techniques, the ESM2 protein model boasts greater robustness and accuracy, displaying increased efficiency in predicting almost all protein sequences. Moreover, it’s scalable and can handle more extensive protein sequence and structural data, making it ideal for addressing the amplified needs of the increased protein data-related challenges. Hence, we posit that fine-tuning models
Mit Protein Transformer: Identification Mitochondrial Proteins
609
based on large-scale biological sequence language models will yield better subcellular mitochondrial protein localization prediction outcomes. This study trained classifiers based on ESM2 pre-training models for fine-tuning and encoder-only classifiers based on the Transformer architecture to ascertain the potential of large-scale biological sequence language models in predicting protein mitochondrial sublocalization. The experimental outcomes confirmed that the performance of fine-tuning pre-training models exceeds that of small Transformer models trained from scratch, highlighting the significant potential of using large-scale biological sequence language models in predicting protein mitochondrial sublocalization. Drawing from the research and exploration done on the ESM2 model, this paper proposes a submitochondrial protein localization predictor that retrieves semantic information from protein sequences directly. The predictor utilizes fine-tuning with 35B-EMS2 biological sequence language model implementation. The article comprises four parts, with the first part highlighting four datasets, an encoder-only neural network transformer, ESM2 large-scale pre-training network, model optimizers, and result indicators. The second part compares the results obtained through fine-tuning 35B-EMS2 pre-trained biological sequence language models with the Transformer encoder-only neural network trained from scratch. Finally, the third part summarizes the paper, as shown in Fig. 1, illustrating the entire experimental procedure.
Fig. 1. Experimental Workflow
610
B. Zhang et al.
2 Materials and Methods 2.1 Datasets This study evaluates the proposed method’s performance using four benchmark datasets: M317 dataset comprising 317 proteins distributed in three sub-mitochondrial locations, M983 dataset containing 983 proteins divided into three sub-mitochondrial locations, SM424–18 dataset containing 426 proteins separated into four sub-mitochondrial locations, and SubMitoPred dataset having 570 proteins divided into four sub-mitochondrial locations. We merged the corresponding sub-mitochondrial locations from the datasets, inclusive of the inner membrane, intermembrane space, matrix, and outer membrane, resulting in a total of 2296 protein sequences. Detailed dataset information is available in Table 1. The dataset was then divided into training and testing batches, with 32 sequences per batch, and a training-testing batch ratio of 6:4. The testing batch ratio was increased to validate the fine-tuned large model’s higher robustness. Table 1. The feature of datasets Compartment
M317
M983
SM424–18
SubMitoPred
Outer membrane
41
145
74
82
Inner membrane
131
661
190
282
Intermembrane space
NA
NA
25
32
Matrix
145
177
135
174
2.2 Tokenizer Protein sequences must be tokenized before being inputted into the Transformer model. A typical tokenizer for protein sequences utilizes amino acid encoding, which converts a sequence of amino acids into a series of numbers. 2.3 Fine-tuned 35B-ESM2 Model This study employs the ESM2 deep learning model, which implements the Transformer architecture and is trained using a masked language modeling objective to extract biologically-relevant information from protein sequences. Developed by Facebook AI Research (FAIR) team, the pre-trained ESM2 model comprises 35B parameters and has been demonstrated to exhibit high prediction accuracy and robustness through various benchmark tests. To represent a protein sequence and extract biological information, the model assigns a 480-dimensional feature vector to each amino acid molecule, resulting in a variablelength vector of (max(1024, sequence length), 480). In this work, we designed a custom classification head, which comprises four linear layers and four ReLU activation functions, and appended it to the pre-trained ESM2 model. The classification head outputs a
Mit Protein Transformer: Identification Mitochondrial Proteins
611
4-dimensional vector, representing the protein’s subcellular localization category, after sequentially stacking linear layers with ReLU activation functions. The softmax function is utilized to derive the probability distribution of each category, and we calculate the cross-entropy loss between the predicted probability distribution and the actual label probability distribution for optimization purposes. To fine-tune the model for the mitochondrial subcellular localization task, we backpropagate and only train the custom classification head. Figure 2 illustrates the model’s data flow.
Fig. 2. The structure of the fine-tuned ESM2 model
2.4 Transformer Encode-Only Model The Transformer encoder model represents a deep neural network model that relies on a self-attention [11] mechanism, broadly employed in the fields of natural language processing and computer vision alike. Its primary aim is to cater to the challenge of modeling long sequence data, including machine translation and natural language generation [12]. What distinguishes the Transformer encoder model from traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs) is the use of a self-attention mechanism, which, without taking into account the sequence order, can capture the dependency relationships presented in the sequence [13]. When the decoder component of the Transformer model is neglected, the resulting model is known as the “Transformer encoder-only model” [14]. This model receives a sequence of tokens as input, which passes through multiple encoder layers, producing a representation vector for each position, with each vector containing the respective position’s information. These representation vectors may serve numerous purposes [15], including text classification, named entity recognition, and sentiment analysis. To perform a controlled experiment investigating pre-trained [16] models, we evaluate the efficacy of training a simplified Transformer encoder-only model on the same dataset from scratch, as compared to fine-tuning a larger pre-trained model. As the dataset is constrained in size, training a large Transformer network is not feasible. Hence, we constructed a smaller network architecture comprising four attention heads and three stacked encoder layers, producing a 128-dimensional output shape for each amino acid,
612
B. Zhang et al.
and a final output shape of (max(sequence length,1024), 128). Like the ESM2 finetuned model, we input the output of this encoder-only network into a custom classification head. The classification head model consists of several layers, in which each layer uses a linear layer and ReLU activation function to linearly transform the sequence vector into a 4-dimensional vector space. We use the softmax function to calculate the probability distribution and optimize it according to the cross-entropy loss between the predicted probability distribution and the actual label probability distribution. Finally, using the backpropagation algorithm, we train the encoder-only network, along with the customs classification head, and complete the model training, rendering it effective for the mitochondrial sublocalization task. The Transformer model’s self-attention mechanism [17] is a crucial component used to capture global dependency relationships within sequence data processing. The self-attention mechanism offers the advantage of being more efficient at capturing long-range dependency relationships, accelerating computation speed, and allowing for parallel processing, compared to traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs). The formula for self-attention calculation is as follows: T √ (1) V A(Q, K, V ) = softmax QK d k
2.5 Loss Function Both the fine-tuned 35B-ESM2 model and the encoder-only model, utilized in the control experiment, integrate a cross-entropy loss function during the loss calculation process. The custom classification header generates a probability distribution output, represented by p(x), while the label is converted to a probability distribution of the label, represented by q(x), through one-hot encoding. The formula applied to calculate the loss is: Loss = − N1
N
i=1 q(xi )log(p(xi ))
(2)
3 Results and Discussions This chapter aims to present an in-depth investigation and analysis of the fine-tuned ESM2 model, also known as “tuned-ESM2,” as well as the Transformer-based encoderonly model trained from scratch, referred to as “encoder-only.“ The explicit focus is to comparatively evaluate the performance of these two models and examine their respective benefits, limitations, as well as applicability. This will be accomplished by comprehensively assessing the models’ performance indicators such as accuracy, recall, precision, and F1 score. A detailed analysis, inclusive of discussions, will be executed to identify the factors that influence the models’ performance. Furthermore, we will suggest future research directions to extend the scope of this study.
Mit Protein Transformer: Identification Mitochondrial Proteins
613
3.1 Performance Comparison The purpose of this study is to perform a mitochondrial protein subcellular localization classification task that aims to categorize proteins into one of four subcellular structures, which include the matrix, outer membrane, inner membrane, and intermembrane space. For the evaluation of these models, we employed the fine-tuned ESM2 and encoder-only classifiers. The results from this study indicate that the tuned-ESM2 model exhibited superior performance compared to the encoder-only classifier. The model demonstrated impressive precision, recall, F1 score, and accuracy, all attaining an exceptional score of 94.5%. These findings signify that the tuned-ESM2 classifier displays high precision and recall abilities, as well as exceptional accuracy in overall performance. Conversely, the encoderonly classifier’s performance was relatively substandard, with precision and recall rates of 76.5% and 77.5%, respectively, while the F1 score and accuracy were both 76.5%. This suggests that the encoder-only model’s performance is comparatively weaker than the tuned-ESM2 classifier. For detailed performance scores, refer to Table 2. Table 2. Comparison of experimental results for different classification algorithms Classification
Precision
Recall
F1
Accuracy
Fine-tune ESM2
94.5%
95.5%
94.5%
94.5%
Encoder-only
76.5%
77.5%
76.5%
76.5%
The study also involved the utilization of the Receiver Operating Characteristic curve (ROC curve) as an additional evaluation metric to assess the classification performance of the two classifiers. However, since the ROC curve is typically employed for binary classification tasks, we implemented the “One-vs-All” strategy to enable its adjustment to the multi-class prediction output of the two classifiers. The “One-vs-All” strategy follows the process of treating each class as a positive class while the rest are regarded as negative classes. By implementing this, the multiclass problem is transformed into several binary classification problems. Figure 3 depicts a plotted ROC curve for each class to exemplify the performance of the model for each category. The left-hand side ROC curves illustrate the encoder-only model’s predictions for four categories, whereas, the right-hand side ROC curves showcase the predictions of the fine-tuned ESM2 model (tuned-ESM2) for the four categories. From the observed ROC curves displayed in Fig. 6, it is clear that the tuned-ESM2 model outperformed the encoder-only model in all categories. This finding provides further evidence that the former has higher classification performance in the mitochondrial protein subcellular localization classification task. In subsequent research, a deeper analysis of these results could be conducted, and there is the possibility of optimizing these two models to enhance their performance in this type of mission.
614
B. Zhang et al.
Fig. 3. The comparison of ROC curves between two models
3.2 Comparing Fine-tuning Pre-trained Models vs. Training from Scratch In this section, we will provide a brief overview of the differences in performance of fine-tuning pre-trained models (tuned-ESM2) and models trained from scratch (encoderonly), highlighting their respective advantages and disadvantages, as well as potential reasons for performance differences. As previously elaborated, performance metrics comparisons reveal that the fine-tuned ESM2 model outperforms the encoder-only model in accuracy, recall, precision, and F1 score. The pre-training knowledge utilized by the tuned-ESM2 model may account for this performance advantage. The model relies on pre-trained models for fine-tuning, thus already possessing protein sequence semantic and structural knowledge. This allows the model to quickly learn relevant features for sequence-based classification tasks based on existing semantic and structural knowledge.
4 Conclusion The subcellular localization of proteins holds immense significance in bioinformatics by providing insights into protein function and mechanism within cells [18]. As vital molecules that support cellular activities, proteins exhibit distinct distribution patterns within cells, each tightly linked to its corresponding subcellular location. Predicting protein subcellular localization with precision is, therefore, crucial in unraveling protein function and biological significance. Significantly, subcellular localization information serves as a vital guide in the study of disease mechanisms, drug design, gene therapy, and other related fields. This research compared the performance of the fine-tuned ESM2 model with that of the Encoder-Only model of Transformer trained from scratch in the task of mitochondrial protein subcellular localization. Our results lead to the following critical conclusions: The fine-tuned ESM2 model surpasses the Encoder-Only model in terms of accuracy, recall, precision, and F1 score, revealing its superior performance, primarily attributed to the stronger feature representation capability and better generalization ability of the pre-trained model. These results highlight the need for further research, including an
Mit Protein Transformer: Identification Mitochondrial Proteins
615
in-depth investigation of the performance variations between pre-trained models and models trained from scratch concerning different tasks. It is essential to identify the reasons behind the models’ performance variations and provide theoretical support for optimization. The study also underscores the importance of exploring other pre-trained models and models trained from scratch, evaluating their suitability and performance in protein subcellular localization tasks. Such efforts would contribute to the development of efficient and precise solutions in bioinformatics. Acknowledgments. This work was supported by the National Natural Science Foundation of China (Grant No. 61902337), Xuzhou Science and Technology Plan Project (KC21047), Jiangsu Provincial Natural Science Foundation (No. SBK2019040953), Natural Science Fund for Colleges and Universities in Jiangsu Province (No. 19KJB520016) and Young Talents of Science and Technology in Jiangsu and ghfund202302026465.
References 1. Dai, L.S., Zhu, B.J., Zhao, Y., et al.: Author correction: comparative mitochondrial genome analysis of Eligma narcissus and other lepidopteran insects reveals conserved mitochondrial genome organization and phylogenetic relationships. Sci. Rep. 10, 7221 (2020) 2. Dorji, J., Vander Jagt, C.J., Garner, J.B., et al.: Correction to: expression of mitochondrial protein genes encoded by nuclear and mitochondrial genomes correlate with energy metabolism in dairy cattle. BMC Genomics 23, 315 (2022) 3. Mei, S.: Predicting plant protein subcellular multi-localization by Chou’s PseAAC formulation based multi-label homolog knowledge transfer learning. JTBIAP 310, 80–87 (2012) 4. Lin, H., Chen, W., Yuan, L.F., Li, Z.Q., Ding, H.: Using over-represented tetrapeptides to predict protein submitochondria locations. Acta Biotheor. 61, 259–268 (2013) 5. Kumar, R., Kumari, B., Kumar, M.: Proteome-wide prediction and annotation of mitochondrial and sub-mitochondrial proteins by incorporating domain information. Mitochondrion 42, 11–22 (2018) 6. Qiu, W., et al.: Predicting protein submitochondrial locations by incorporating the pseudoposition specific scoring matrix into the general Chou’s pseudo-amino acid composition. J. Theor. Biol. 450, 86–103 (2018) 7. Yu, B., et al.: SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting. Bioinformatics 36, 1074–1081 (2020) 8. Jiarui, F., Yang, Y., Chengduo, Z., Jie, Z.: Turbotransformers: an efficient GPU serving system for transformer models. In: PPoPP 2021: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2021) 9. Xiao, W., Yinping, J., Qiuwen, Z.: DeepPred-SubMito: A Novel Submitochondrial Localization Predictor Based on Multi-Channel Convolutional Neural Network and Dataset Balancing Treatment (2020) 10. Lin, Z., et al.: Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023) 11. Yi, T., Dara, B., Donald, M., Dacheng, J., Zhe, Z., Che, Z.: Synthesizer: rethinking selfattention in transformer models. In: Proceedings of ICML (2021) 12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
616
B. Zhang et al.
13. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) 14. Bapna, A., Chen, M., Firat, O., Cao, Y., Wu, Y.: Training deeper neural machine translation models with transparent attention. In: EMNLP, pp. 3028–3033 (2018) 15. Lan, Z., et al.: Albert: a lite BERT for self-supervised learning of language representations. In: International Conference on Learning Representations (2020) 16. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: ELECTRA: pre-training text encoders as discriminators rather than generators. In: International Conference on Learning Representations (2020) 17. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017) 18. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Plant Vacuole Protein Classification with Ensemble Stacking Model Xunguang Ju, Kai Xiao, Luying He, Qi Wang, Zhuo Wang(B) , and Wenzheng Bao Xuzhou University of Technology, Xuzhou 221018, China [email protected]
Abstract. The prediction of subcellular localisation of proteins is one of the main goals of proteome sequencing, and researchers have achieved high classification accuracy with the help of computer technology, but most of the current classification models are not applicable to the classification of plant vacuole proteins, and it is tedious and time-consuming to classify plant vacuole proteins using subcellular localisation methods. In this paper, we focus on the classification of plant vacuole proteins based on an ensemble stacking model. New feature inputs are generated by fusing statistical and physicochemical features of proteins. The data is accurately classified by using an ensemble stacking model based on a number of machine learning algorithms. The results show that the model achieves a classification accuracy of 73%, which is a significant advance compared to other models and is of high significance for studying the classification of plant vacuole proteins. Keywords: plant vacuole proteins · feature extraction · ensemble stacking model · machine learning
1 Introduction A plant vacuole is a vesicle-like structure in the cytoplasm. It is separated from the cytoplasm by an outer vacuole membrane and contains aqueous cytosol. The vacuole is small in young plant cells and large in mature plant cells [13–15]. Vacuoles are not unique to plant cells, but are larger and more varied between vacuoles in plant cells. The plant vacuole is representative of a single large structure that is involved in a variety of functions, such as plant growth and development, maintenance of cellular homeostasis, cellular functions to maintain swelling and an increase in nutrients, ions and secondary metabolites. The plant vacuole are filled with cytosol, the main component of which is a complex aqueous solution containing a wide range of organic and inorganic substances. Some of these substances are storage materials produced by cellular metabolism, such as sugars, organic acids and proteins. Some of them are salt crystals, which are mostly by-products of cellular metabolism and are often harmful to the cell, such as oxalic acid. The vacuole contains a variety of enzymes which can break down the stored substances in the vacuole and reuse them to participate in various metabolic discourses under certain conditions. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 617–626, 2023. https://doi.org/10.1007/978-981-99-4749-2_53
618
X. Ju et al.
With the availability of whole proteomes of any plant, the rapid and accurate classification of proteins depends on the subcellular location of the protein. Subcellular localisation is a very tedious and time-consuming experiment, so the development of easy and fast algorithms for accurate classification prediction is a priority [1–3]. In the past, many algorithms have been applied to protein subcellular localization [4], but none of them have been specifically applied to the classification of vacuolar proteins, which has resulted in poor classification of the models. Therefore, in this experiment, we develop a classification model for vacuolar protein prediction based on an ensemble stacking model. In this paper, we fuse feature extraction methods based on amino acid composition and physicochemical properties of amino acids for protein sequences, and use an ensemble stacking model consisting of knn, lightgbm, and Random forest algorithms for predictive classification. The flow chart is shown in Fig. 1.
Fig. 1. Experimental flow chart
2 Feature Input 2.1 Dataset The dataset used in this study was derived from publically available database UniprotKB/SwissProt (release 3 July 2019). To create a balanced dataset, we randomly selected 200 proteins from each of the positive and negative datasets and used them in the prediction model. Thus, our final training dataset had 200 vacuolar and 200 non-vacuole plant proteins. 2.2 Extraction of Autocorrelation Features from Protein Sequences 2.2.1 Amino Acid Composition The amino acid composition pattern is one of the simplest methods of protein feature extraction, which counts the frequency of each of the twenty amino acid molecules occurring in the amino acid sequence, and the feature extraction formula is as follows: x(i) = Ni /N
(1)
Plant Vacuole Protein Classification with Ensemble Stacking Model
619
i = 1, 2, 3, . . . ., 20, Ni represents the number of occurrences of each amino acid and N represents the total number of amino acid molecules in the amino acid sequence. The number of feature dimensions obtained by this feature extraction method is 20. 2.2.2 Autocovariance The autocovariance is a measure of correlation between the values of the same physicochemical property of two nucleotide base pairs or amino acid residue pairs separated by a specified distance. AC primarily considers proximity effects as it relates to amino acid interactions with amino acids in the sequence. Amino acid interactions are reflected by seven physicochemical properties of sequence-based amino acids, namely hydrophobicity, hydrophilicity, net charge index, polarity, polarizability, solvent accessible surface area and net charge index of the side chain, denoted as: (2) where, u is the physicochemical index, L is the length of the protein sequence, Pu (Ri ) refers to the numerical position i of the physicochemical index u for amino acid Ri , and P u is the sequence average of the overall physicochemical index u. 2.2.3 Reciprocal Covariance Pu1 (Ri ) − P u1 Pu2 Ri+lag − P u2 /(L − lag)
L−lag
CC(u1 , u2 , lag) =
(3)
i=1
where, u1 , u2 is the physicochemical index, L is the length of the protein sequence, Pu1 (Ri ) is the value of the physicochemical index u for amino acids Ri at position i, and P u1 is the sequence average of the overall physicochemical index u2 . 2.3 Protein Feature Extraction Based on Physicochemical Properties In addition to the composition and order of arrangement of each substance itself, the physicochemical properties it contains are worthy of study, as is the case with the standard amino acids. Different amino acids differ greatly in their hydrophilicity, hydrophobicity and charged properties, and there are also huge differences in the mass and volume of its side chain molecules. It is because of the differences in the physicochemical properties of amino acids that it is also an excellent way to distinguish proteins by their physicochemical properties [5, 6]. 2.3.1 Pseudo Amino Acid Composition The feature extraction method of pseudo-amino acid composition is to represent the protein sequence as a 20-dimensional vector consisting of amino acids and a -dimensional
620
X. Ju et al.
vector consisting of sequence correlation factors. The sequence correlation factor can be defined as:
(4)
1 2 2 2 Qi , Qj = F Qj − F(Qi ) + G Qj − G(Qi ) + H Qj − H (Qi ) (5) 3 F(Qi ), G(Qi ) and H (Qi ) represent the hydrophobic, hydrophilic and side chain molecular weight assessment values, respectively. Values can be obtained from the database website (http://www.genome.jp/aaindex). It is important to note that before using these three types of values, they should be normalised using an equation. The equation is defined as:
(6)
F 0 (i), G 0 (i) and H 0 (i) represent the values of hydrophobicity, hydrophilicity and side chain molecular mass of the ith amino acid, respectively.
(7)
fi (i = 1, 2, · · · , 20) is the normalised frequency of occurrence of the 20 amino acids in the protein; w is a weighting factor taking values between 0 and 1; and δϕ is ϕ sequence
Plant Vacuole Protein Classification with Ensemble Stacking Model
621
correlation factor of c units. The final feature vector of the protein sequence can be expressed as a (20 + )-dimensional vector: P = p1 , p2 , p2 , . . . , p20+ (8)
2.3.2 Series Correlation Pseudo Amino Acid Composition SC-PseAAC is a variant of PC-PseAAC, in which:
(9)
where, fi (i = 1, 2, · · · , 20) is the normalised frequency of occurrence of the 20 amino acids in the protein; w is a weighting factor taking values between 0 and 1; τj reflects the sequence order correlation between all the jth most adjacent residues in the protein chain and is defined by the following formula:
(10) ......
1 and H 2 are derived from the hydrophobic and hydrophilic correlation functions Hi,j i,j given by the following equations: 1 Hi,j = h1 (Ri ) · h1 Rj (11) 2 = h2 (Ri ) · h2 Rj Hi,j
where h1 (Ri ) and h2 (Ri ) represent the hydrophobic and hydrophilic values of amino acid i in Eq. 1, respectively, and (.) represents the multiplicative symbol.
622
X. Ju et al.
Note that a standard conversion is required when bringing in hydrophobic and hydrophilic values by means of the following equation:
(12)
Ri (i = 1, 2, 3, . . . , 20) indicates the 20 natural amino acids. h10 和 h20 denote the original hydrophobicity and hydrophilicity values of the amino acids in parentheses. The feature vector of the final protein sequence can be expressed as a (20 + 2) dimensional vector. T P = p1 , p2 , p3 , . . . , p20 , p20+1 , . . . , p20+2 (13) 2.4 Data Feature Combination and PCA 2.4.1 Data Feature Combination In order to increase the dimensionality and information content of the data, the data is normalised and stitched together to increase the dimensionality. Finally, dimensionality reduction is performed to ease the training of the classifier. 2.4.2 Normalisation and PCA Normalisation controls the different data to between 0,1 and removes the difference between the different magnitudes, thus enabling the work of data fusion stitching, which is carried out by the following formula: x =
x − min(x) max(x) − min(x)
(14)
The PCA algorithm was chosen to reduce the dimensionality of the data due to the 242 dimensions of the stitched data to make the classifier converge faster. The final dimensionality of the 242-dimensional data was reduced to 45-dimensional data by dimensionality reduction.
3 Model Construction 3.1 Random Forest The American scientist Breiman presented a classification algorithm based on a random subspace and an ensemble classifier, called Random Forest [7], which is based on the decision tree.
Plant Vacuole Protein Classification with Ensemble Stacking Model
623
3.2 Support Vector Machines Support vector machines [8–10] are a class of generalized linear classifiers that perform binary classification of data in a supervised learning manner, with a decision boundary of the maximum margin hyperplane solved for the learned samples. It is based on the basic principle of minimizing the empirical risk as well as maximizing the classification interval, which makes it possible to obtain better classification accuracy even when the number of samples is small. Support vector machines are good for dealing with smallscale, high-dimensional nonlinear data and have been used to solve numerous problems such as pattern recognition and regression prediction. 3.3 K Nearest Neighbors Knn is an algorithm of supervised learning. A sample belongs to a category if the majority of its nearest neighbours in the K-feature space in the feature space belong to that category as well. The method relies only on the category of the nearest sample or samples to be classified in the decision to classify a sample. 3.4 Ensemble Stacking Model Stacking is the process of stacking the models that have been fitted to the original data [11, 12]. The algorithm first learns the original data from the base learners, and then each of these base learners outputs the original data. The output of these models is then stacked in columns to form the new (m,p) dimensional data, with m representing the number of samples and p representing the number of base learners, and then the new sample data is given to the second layer model for fitting. To prevent overfitting of the model, so K-fold cross-validation was used and not all training was done at once. 3.5 Model Evaluation We build the confusion matrix based on the experimental results to judge the actual classification prediction ability of the model. Based on the results of the experiments, the full classification results can be classified into four cases: true (TP), true negative (TN), false positive (FP), and false negative (FN) [10–15]. The four evaluation metrics in this experiment were applied to assess the model performance. The accuracy rate is the percentage of results that the model predicts correctly. accuracy =
TP + TN TP + TN + FP + FN
(15)
The precision rate is the proportion of correct predictions in the set of all predictions with positive samples. precision =
TP TP + FP
(16)
624
X. Ju et al.
The recall is the actual positive sample correctly predicted by the model. recall =
TP TP + FN
(17)
Precision and Recall sometimes appear to be contradictory, and in order to consider them together, the F1 value is also used as one of the evaluation indicators. F1 value is the average of the reconciliation between Precision and Recall, and it considers Precision and Recall together. F1 = 2 ×
1 1 precision
+
(18)
1 recall
4 Model Results and Analysis In this experiment, KNN, SVM, lightgbm, Random forest and Gaussian NB models were first selected as base learner candidates to classify the data separately, and then the model with better prediction performance was selected as the base learner. The results of the prediction of the data using a single model are shown in Table 1. Lightgbm performed the best with the highest accuracy and Gaussian NB had the lowest accuracy. Among so many models, knn, svm and Random forest were the three models with higher accuracy. We chose these three models as the base learners for the integrated model and used lightgbm as the meta-classifier. Table 1. Accuracy of single models Model
Key parameters
values
Acc (%)
Knn
n_neighbors
10
65
svm
Kernel, C, grama
Rtf,1,0.5
69
Random forest
n_estimators
70
67
lightgbm
Nthread, learning_rate
4,0.1
70
Gaussian NB
—
—
60
After selecting the classifier, we used the ensemble model built to process the data and perform model evaluation, as shown in Table 2:
Plant Vacuole Protein Classification with Ensemble Stacking Model
625
Table 2. Accuracy of the final model Acc
precision
recall
F1
macro avg
0.73
0.73
0.73
0.73
weighted avg
0.73
0.73
0.73
0.73
vacuole proteins
0.74
0.69
0.76
0.72
non-vacuole proteins
0.72
0.76
0.70
0.73
The final classification accuracy of the model reached 73%, which is a significant improvement over most of the current vacuole protein classifiers, and the classification of vacuole proteins has been achieved with good results. A comparison of vacuole and non-vacuole proteins shows that the classification accuracy of vacuole proteins is higher, indicating that the model learns the sequence features of vesicular proteins more easily.
5 Conclusion With the availability of whole proteomes project in recent years, fast and accurate classification of proteins depends on the subcellular location of proteins, but subcellular localisation is a very tedious and time-consuming experiment, so building easy and fast algorithms for accurate classification prediction is the focus. Existing models are not applicable to the classification of vacuolar proteins, and this experiment constructs a protein classification process based on an ensemble stacking model. For feature extraction, we use several methods of protein feature extraction and fuse them all together to adequately characterise the protein. To reduce the difficulty of fitting the model, we normalise and reduce the dimensionality of the data. On the model aspect, we used an ensemble stacking model incorporating a variety of machine learning methods to obtain a model accuracy of 73%, which demonstrates the effectiveness of our proposed method for the classification of vesicles. However, there are still some issues that we need to address. For example, the classification of the model is still inadequate despite the progress made, and in the future, we can use deep learning or other advanced algorithms to build the model and improve its accuracy. We still need to put more effort into this experiment in the future to help other researchers to better differentiate between vacuolar proteins. Acknowledgments. This work was supported by the National Natural Science Foundation of China (Grant No. 61902337), Xuzhou Science and Technology Plan Project (KC21047), Jiangsu Provincial Natural Science Foundation (No. SBK2019040953), Natural Science Fund for Colleges and Universities in Jiangsu Province (No. 19KJB520016) and Young Talents of Science and Technology in Jiangsu and ghfund202302026465.
626
X. Ju et al.
References 1. Boden, M., Hawkins, J.: Prediction of subcellular localization using sequence-biased recurrent networks. Bioinformatics 21, 2279–2286 (2005) 2. Chou, K.C., Shen, H.B.: Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization. PLoS ONE 5, e11335 (2010) 3. Nakashima, H., Nishikawa, K.: The amino acid composition is different between the cytoplasmic and extracellular sides in membrane proteins. FEBS Lett. 303 (1992) 4. Guo, J., Lin, Y., Sun, Z.: A novel method for protein subcellular localization: combining residue-couple model and SVM. In: Asia-Pacific Bioinformatics Conference, Singapore, pp. 117–129 (2005) 5. Chou, K.C.: Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Struct. Funct. Bioinf. 43(3), 246–255 (2001) 6. Chou, K.C., Shen, H.B.: Predicting protein subcellular location by fusing multiple classifiers. J Cell Biochem. 99(2), 517–527 (2006) 7. Breiman, L.: Random forest. Mach. Learn. 45(1), 5–32 (2001) 8. Wang, Y.-C., Wang, Y., Yang, Z.-X., et al.: Support vector machine prediction of enzyme function with conjoint triad feature and hierarchical context. BMC Syst. Biol. 5(S1), S6 (2011) 9. Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998) 10. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 11. Wei, L., Xing, P., Zeng, J., Chen, J., Su, R., Guo, F.: Improved prediction of protein–protein interactions using novel negative samples, features, and an ensemble classifier. Artif. Intell. Med. 83, 67–74 (2017) 12. Wei, L., Xing, P., Su, R., Shi, G., Ma, Z.S., Zou, Q.: CPPred-RF: a sequence-based predictor for identifying cell-penetrating peptides and their uptake efficiency. J. Proteome Res. 16(5), 2044–2053 (2017) 13. Zhang, C., Hicks, G., Raikhel, N.: Molecular composition of plant vacuoles: important but less understood regulations and roles of tonoplast lipids. Plants 4, 320–333 (2015) 14. Zhang, C., Hicks, G.R., Raikhel, N.V.: Plant vacuole morphology and vacuolar trafficking. Front. Plant Sci. 5, 476 (2014) 15. Zhang, L., Zhao, X., Kong, L.: Predict protein structural class for low-similarity sequences by evolutionary difference information into the general form of Chou’s pseudo amino acid composition. J. Theor. Biol. 355, 105–110 (2014)
De Novo Drug Design Using Unified Multilayer Simple Recurrent Unit Model Zonghao Li1 , Jing Hu1,2(B) , and Xiaolong Zhang1,2 1 School of Computer Science and Technology, Wuhan University of Science and Technology,
Wuhan 430065, Hubei, China {hujing,xiaolong.zhang}@wust.edu.cn 2 Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time Industrial System, Wuhan, China
Abstract. Drug de novo design has developed rapidly and innovatively in recent years. In this way, novel drug structures can be designed, which has made outstanding contributions to the field of drug design. Improving the quality of generated drugs is the unifying goal of most published experimental papers, and the sheer volume of data provides a solid foundation for achieving this goal. However, as the amount of drug data becomes larger and larger, we have to consider the time cost of such experiments. Correspondingly, when we are studying a small number of drugs with specific properties, the fitting ability of the model becomes critical. As a result, some models perform wildly differently on different datasets. In view of SMILES (Simplified molecular input line entry system) experiment data of different data sets, this paper proposes a “Unified Multilayer SRU De novo drug design acceleration Model” (USD) based on multi-layer Simple Recurrent Unit (SRU). Aiming at the problem that the amount of data in deep learning has a greater impact on the effect of experimental training, this experiment trains the SMILES data in the DugBank database (small data amount) and ChEMBL (big data amount), and finally generates novel drug molecular data. Both the efficiency of molecular generation and the time cost of model training and data generation have been greatly improved. A series of comparative experiments have proved that USD has a good balance ability between drug generation quality and time cost, which proves that this new research direction has certain experimental and reference value. Keywords: De Novo Drug Design · USD · Simple Recurrent Unit
1 Introduction The term drug design first appeared around the 1920s, and it came into being with the discipline of medicinal chemistry. After years of development and precipitation, the initial qualitative research has entered into quantitative research, including the theoretical foundations of medicinal chemistry, molecular biology, statistical mathematics, etc., and the use of electronic computers and other means. With the in-depth study of the concepts of drug molecular structure, enzymes, receptors, etc., we have made continuous progress © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 627–638, 2023. https://doi.org/10.1007/978-981-99-4749-2_54
628
Z. Li et al.
in the field of drug design, especially in the use of computers for drug design. For huge databases, model algorithms such as machine learning, deep learning, reinforcement learning, and deep reinforcement learning greatly reduce the time cost of data processing and screening and effectively improve the efficiency of drug design.
2 Research Achievements at Home and Abroad Drug de novo design has also made significant contributions as an aspect of the field of drug design. This idea refers to the use of computational growth algorithms to design novel chemical entities that conform to a set of constraints [1]. The word “de novo” means “from the beginning”. It was demonstrated that new molecules can be generated in this way without the starting template entity [2]. In this field, methods such as machine learning and deep learning have been used, which have achieved good results based on huge data sets, and have been successfully applied to the design and development of new drugs. Various artificial neural networks are used in existing methods: Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), Long Short Term Memory Networks (LSTM), Generative Adversarial Networks (GAN), Autoencoders (AE) and so on [3]. In the experiments of Olivecrona Marcus et al. [4], a recurrent neural network was used to propose a policy-based reinforcement learning method. A sequence-based generative model approach is adapted for molecular redesign, with increased contextual likelihoods that can learn to generate structures with specific desirable properties. Marwin H. S. Segler et al. [5] conducted experiments in the long-short-term memory network (LSTM) for the related data of Staphylococcus aureus and Plasmodium falciparum (malaria) and obtained good results. Merk Daniel et al. [6] used a recurrent neural network to capture the composition of a large number of known biologically active compounds represented by SMILES strings. Through transfer learning, this general model was fine-tuned for the identification of retinoid X and peroxisome proliferator-activated receptor agonists. Five top-level compounds designed by generative models were finally synthesized. Gupta Anvita et al. [7] proposed a generative network based on long short-term memory for molecular de novo design, and the results support generative RNN-LSTM systems for high-impact use cases such as low-data drug discovery, fragment-based molecular design, and hit-to-lead optimization for different drug targets. Ståhl Niclas et al. [8] proposed a segment-based reinforcement learning method. The method is based on an actor-critic model for generating new molecules with optimal properties. Oleksii Prykhodko et al. [9] proposed a new deep learning architecture LatentGAN. It combines an autoencoder and generative adversarial neural network for de novo molecular design. Zheng S et al. [10] reported a quasi-bioderived molecular generator (QBMG) to compose a virtual library of quasi-bioderived compounds with a gated recurrent unit recurrent neural network. This approach can be used to generate virtual compound libraries for the identification and optimization of drug leads. Li Y et al. [11] proposed a new comprehensive tool for scaffold-based drug discovery using deep learning: DeepScaffold. In their research, Yasonik J et al. [12] proposed a multi-objective drug design based on recurrent neural network and non-dominated sorting. The non-dominated sorting algorithm is used to select the optimal data in the generated molecules for transfer learning to generate better data.
De Novo Drug Design Using Unified Multilayer SRU Model
629
Based on a large number of investigations and statistics, we found that domestic and foreign researchers basically used compound datasets with a large amount of data in their research. For the accessibility of experimental results, researchers will use various methods to converge the properties of the generated data after the experimental data is generated, such as transfer learning and other methods. Here we found two problems. First, the large amount of data has a huge time cost during training. Although there has been considerable progress compared to traditional drug design methods, the large amount of data is also a big test for the hardware level of the computer. At the beginning of drug design experiments, it is common to see RNN-based or LSTM-based experimental models. From the perspective of effect and computational performance, LSTM greatly improves the training effect of RNN in long sequence context. However, LSTM themselves still have a pain point: the computational time cost. Second, while using the transfer learning method, we found that even with pre-trained models, there are still some problems with the degree of model fit when training small datasets. And if we directly apply the small data set to the initial training model, the final result is far from the training result of the large data set. After investigation, we have not been able to find a model that is not affected by the amount of data. And the number of drugs that really target a specific disease or a specific target is limited. In this experiment, we proposed a “Unified Multilayer SRU-Drug Design Acceleration Model”, which can improve the quality of drug molecular data generation and greatly reduce the time cost. Finally, we use the non-dominated sorting algorithm to screen the optimal data for the generated drug molecule data, which can be used for further experimental operations.
3 Method 3.1 The Advantage of the SRU Figure 1 shows the structure comparison of the simple recurrent unit and long-term memory. Among them, (a) is the internal structure of LSTM, and (b) is the internal structure of SRU. It can be seen from the figure that SRU is a simplified form of LSTM. In the experiments of Tao L et al. [13] in 2018, SRU was used for classification, question answering, language modeling, translation, and speech recognition. The input of each cell of LSTM depends on the output of the previous cell state, so it is difficult to run as fast as the well-known CNN, and the time cost is relatively large. The experiments of Tao L et al. [13] not only apply the SRU but also simplify the state calculation and no longer depend on the state at the previous moment, i.e. ht-1(The specific situation is given by formulas 1, 2, 3, 4, 5). Exhibits the same parallelism as CNNs, attention models, and feed-forward networks, and uses CUDA-level optimization methods for SRU, eventually reaching speeds close to conv2d. This solves the long-running time problem of LSTM. Therefore, we believe that SRU still has great experimental value in drug de novo design experiments. x˜ t = Wxt
(1)
ft = σ (Wfxt + bf )
(2)
630
Z. Li et al.
rt = σ (Wrxt + br)
(3)
ct = ft ct − 1 + (1 − ft) x˜ t
(4)
ht = rt g(ct) + (1 − rt) xt
(5)
where x˜ t represents the input at time T; W, b stands for weight and bias; ft stands for Forget Gate at time t; Rt stands for reset gate at time T; Ct and HT represent the state and final output at time T respectively. σ and g represent the Sigmoid function and activation function (Tanh ReLu) respectively. The operations between the corresponding elements of the representative matrix in the formula. As for CUDA optimization, matrix multiplication can be batch processed in all time steps, which can significantly improve the calculation intensity and GPU utilization rate. In the formula (1), (2), (3), matrix multiplication can be combined into one, and subsequent processing can be searched according to the index, as shown in Formula (6): ⎛ ⎞ W UT = ⎝ Wf ⎠[x1, x2, . . . , xn] (6) Wr Operations between elements in a sequence can be compiled and merged into a kernel function and parallelized on hidden dimensions. After the above treatment, Tao L et al. [13] not only improved the running speed of the model, but also did not lose the accuracy of the model.
Fig. 1. Internal structure of generation model. Comparision of LSTM and SRU internal structres.
3.2 Overall Experimental Process The process of the whole experiment is shown in Fig. 2. Data is processed first, and then the processed data is fed into the training model. The specific internal structure
De Novo Drug Design Using Unified Multilayer SRU Model
631
of the training model is shown in Fig. 3. After the training, the generation module was entered, and then the generated new molecules were screened. We used the nondominated sorting algorithm to select the optimal generated molecules according to the five different candidate drug criteria in the “Three Principles” (an extension of Lipinski’s Five Principles) [14]. 3.3 Non-dominated Sorting and “Three Principles” A piece of SMILES sequence data is called a target, and often a target has multiple properties. The optimization criteria for sorting are often unique, which makes the goals of multiple properties contradict each other, and makes sorting engineering complicated and difficult to measure. Multi-objective optimization has shifted in recent years from trying to find a singular optimal solution to finding a set of Pareto optimal or nondominated solutions [15]. In our work, Fonseca and Fleming’s non-dominated ranking algorithm [16] was employed to compare molecules produced by SRU networks based on the "three principles" criteria [14], which contained five different drug candidate criteria. This algorithm is simple and efficient, and its computational complexity is O(n2 ). We use the open source Python chemical information library RDKit [17] for five different drug standards, but since RDKit cannot directly measure molecular mass, we temporarily use molecular weight instead. The optimized five standards are as follows: • • • • •
Octanol-water partition coefficient logP ≤ 3. Molecular weight ≤480 g/mol. ≤3 hydrogen bond donors. Hydrogen bond acceptor ≤3. 3 rotatable bonds.
Fig. 2. Internal structure of generation model.
632
Z. Li et al.
Fig. 3. Example of unified data processing unit.
3.4 Data Normalization Unit This part is a novel unified data processing module that we add to the experimental model. Different from the previous undifferentiated data division: the previous data division method is to divide each piece of filtered data according to the form of a single character. An example of the division of the unified data processing module added in this experiment is shown in Fig. 3. The green part is the area identified by the entire unit. This module enables this experiment to divide different SMILES data according to their own characteristics: the purpose is to accurately identify the basic structures of different lengths and different writing styles. Such processing can also leave more useful structural and biochemical information for our training phase to learn.
4 Data processing 4.1 Dataset Our experiments use two datasets, one large and one small: ChEMBL [18] (large data volume: 1,562,408 SMILES data) is a large-scale, open-access drug discovery database that collects medicinal chemistry data and knowledge during drug research and development. Information on small molecules and their biological activities is derived from full-text articles in several core medicinal chemistry journals, combined with data on approved drugs and clinical development candidates, such as mechanisms of action and therapeutic indications. The DrugBank [21] dataset (small data size: 8288 SMILES data) was first reported in 2006. It integrates information such as the structure and pharmacology data of drug molecules and the protein sequences of their targets, and connects multiple databases for detailed analysis of drugs. The database is currently widely used in drug docking, screening, drug retrieval, and other applications. 4.2 The Data Preprocessing First, the SMILES sequence data in the dataset is screened, leaving drug molecule data with a smile character length between 34 and 75. In addition, add a “G” character at the beginning of each data band as a start character, and use “\n” at the end to indicate a terminator. One-hot encoding each SMILES “character” using the integer encoding as
De Novo Drug Design Using Unified Multilayer SRU Model
633
an index. Finally, the vocabulary, integer index, and one-hot encoding are formed into a one-to-one comparison table as shown in Fig. 4.
Fig. 4. Data encoding matrix.
5 Result After training, we use the data directly for the generation step. For different data, we improve the experiment step by step, and the results are shown in the following table. DrugBank dataset: The original data volume is 8288 pieces of SMILES drug data, and after the screening, the remaining 4624 pieces of SMILES drug data. ChEMBL data set: The original data volume is 1,562,408 pieces of SMILES drug data, and 481,496 pieces of SMILES drug data remain after screening. To demonstrate the advantages of the USD model, we choose the experimental model of Yasonik J et al. for comparative experiments. The experimental results of DrugBank data are shown in Table 1. The USD model (using a 4-layer SRU structure) has very obvious advantages compared with the experimental model of Yasonik J et al. in terms of the quality of the generated data (“Valid”, the data reaches 70.3%) and the time cost. The experimental results of ChEMBL data are shown in Table 2. The USD model (using a 5-layer SRU structure) has a slight lead in the quality of the generated data (the Valid index reaches 77.9%) compared to the experimental model of Yasonik J et al. (The data in their paper is 77%, our data after re-run is 67.4%). However, in terms of compression time cost (Total time: training time plus generation time. The h in the table stands for hour, m for minute, and s for second), the USD model still shows strong capabilities. For the time being, it can be shown that when the experimental model parameters are the same, the USD model not only has a strong ability to balance the quality of data generation and time cost but also has a significant improvement in their respective indicators. At the same time, we also found a problem: the amount of data generated by the USD model is relatively small, whether in experiments with large or small amounts of data. We need to further increase the amount of data generated and then compare the indicators to support the conclusions of the current experiment. As shown in Table 3, we increase the amount of generated data for the USD model and adopt a 4-layer SRU structure. In comparison, we can clearly see the advantages of the USD model in the experiment: when generating drug data of the same order of magnitude, the data quality still has a certain advantage (Valid value reaches 79.5%),
634
Z. Li et al.
and almost half of the time cost is saved. In addition, we introduced the evaluation index of average Tanimoto similarity. This shows that the drug molecule data generated by the USD model is also extremely diverse. To explore the effect of the data normalization unit in the USD model, we compared the data metrics before and after normalization and the result is shown in Table 4. There is a substantial improvement in the quality of the generated data (the validity of the generated data without normalization is only 64.1%). However, the time cost of the experiment is slightly higher than that of the unstandardized experiment. After observation of the data, the data normalization processing unit will generate more diverse unique “characters”. Table 1. DrugBank dataset experimental results Models
USD
Yasonik J et al.
Layers
4 (SRU)
3 (LSTM)
Valid (%)
70.3
1.5
Number valid
147
409
Total
209
26518
Total time
21 m 17 s
32 m 02 s
Table 2. ChEMBL dataset experimental results Models
USD
Yasonik J et al.
Layers
5 (SRU)
3 (LSTM)
Valid (%)
77.9
77
Number valid
162
13334
Total
208
19774
Total time
10 h 46 m 08 s
113 h 24 m 27 s
In order to ensure that the types of SMILES sequences generated by SRU are not single, the generated effective molecules are represented by Morgan molecular fingerprints, and 1000 molecules are randomly selected to calculate the Tanimoto similarity between each pair of molecules A and B. The formula is shown as (7), “ma” and “mb” represent two kinds of sets. It can be seen from the formula that this evaluation index actually computes the ratio of intersection and union of two kinds of sets. Thus, the value of Tanimoto Similarity is between 0 and 1, and the final calculated value is 0.1534, which indicates that the small molecules generated by our experiment with SRU have good diversity. T (a, b) =
|ma ∩ mb| |ma ∪ mb|
(7)
De Novo Drug Design Using Unified Multilayer SRU Model
635
Table 3. Experimental Results On The ChEMBL Dataset To Increase The Amount Of Data Generated Models
USD
Yasonik J et al.
Layers
4 (SRU)
3 (LSTM)
Valid (%)
79.5
77
Number valid
14862
13334
Total
18704
19774
Total time
69 h 23 m 59 s
113 h 24 m 27 s
Average Tanimoto Similarity
0.1534
0.1608
Table 4. Experimental Results on the ChEMBL Dataset to Increase the Amount of Data Generated. Models
USD
USD (No specific slicing operation)
Layers
4 (SRU)
4 (SRU)
Valid (%)
79.5
64.1
Number valid
14862
11866
Total
18704
18510
Total time
69 h 23 m 59 s
60 h 11 m 37 s
Average Tanimoto Similarity
0.1534
0.1544
Before the next stage of targeted transfer learning, we compare the generated valid and unique, novel data with the existing database (Since the existing database has been updated many times, we think this comparative screening method is feasible. And, which is beneficial to verify the accessibility of our entire model and experimental ideas.), found 29 SMILES data existing in the ChEMBL database, and recorded in DrugBank, PubChem, ZINC and other databases. Among them, the drug GSI-136 is shown in Fig. 5, which is recorded in the DrugBank database for the treatment of Alzheimer’s disease, and has entered the clinical stage.
Fig. 5. GSI-136 molecular structure.
In order to verify the actual effect of generating drugs and corresponding diseases, we used AutoDock [22] tool for molecular docking verification of GSI-136 drug. We selected the gamma-secretase complex (7d5u, a common target for treating Alzheimer’s
636
Z. Li et al.
disease). After analysis, the binding site is shown in Fig. 6 (The white sphere represents the drug molecule, and the white banded structure represents the corresponding target), and the specific binding information is shown in Fig. 7 and Fig. 8: Fig. 7 shows the lowest binding energy shown in this molecular docking verification, −5.6, and hydrogen bond is formed on the No. 245 N atom on chain A of 7d5u. Similarly, in Fig. 8 it is shown that the sites with binding energy of −5.5 next to those shown in Fig. 7 also form hydrogen bonds.
Fig. 6. Schematic diagram of GSI-136 binding position with 7d5u.
Fig. 7. Schematic diagram of the location of Fig. 8. Schematic diagram of the second the lowest binding energy of GSI-136 drug on lowest binding energy location of GSI-136 7d5u. drug on 7d5u.
6 Conclusion In this experiment, we propose for the first time an SRU-based USD drug de novo design model. In order to verify that the USD model has the ability to balance the generation of data validity indicators and time cost indicators, we conducted experiments on 8288 pieces of SMILES drug molecule data in the DrugBank database and 1,562,408 pieces of SMILES drug molecule data in the ChEMBL database. Exploring the validity of the generated data, the time cost of the experiment, and the average Tanimoto similarity of the generated data. Compared with other models, the effectiveness of ChEMBL data in generating data in experiments is also improved to a certain extent, but the most significant advantage is that the time cost is reduced by nearly half. After four sets of comparative experiments, we proved that: • The unified data processing unit proposed in the USD model effectively learns more biochemical information, improves the validity index of the generated data, and helps to generate more diverse drug molecule data.
De Novo Drug Design Using Unified Multilayer SRU Model
637
• Regardless of the amount of data, the compression time cost capability of the USD model and the ability to generate valid data can be guaranteed, and it has certain advantages over other models in comparison. The USD model has an extraordinary ability to balance these two important metrics. After screening the experimental data, we found 29 molecules that already exist in the ChEMBL database in the generated effective and novel drug data. Some drug molecules, such as GSI-136, have entered or have been in clinical stage. In addition, AutoDock tool was used for further data visualization verification and searching of binding sites for the target sites. We can also judge the degree of similarity between existing molecules and generated molecules based on molecular fingerprints and conduct visual reference and observation. This not only enhanced our confidence in adding transfer learning experiments, but also laid a solid foundation and strong support for our further experiments in this direction. Of course, our work still has a lot of room for improvement. In the last stage of the experiment, we used the non-dominated sorting algorithm to screen the excellent drug molecule data, which has a great data mining value. Further experiments will probably start with specific targets and specific conditions. The number of drugs with specific properties is often limited, and the previous experiments on small data sets also show that our model can well match the situation with fewer data, laying the foundation for subsequent experiments. Acknowledgment. This work is supported by the National Natural Science Foundation of China (No.61972299). The authors thank the members of Machine Learning and Artificial Intelligence Laboratory, School of Computer Science and Technology, Wuhan University of Science and Technology, for their helpful discussion within seminars.
References 1. Mouchlis, V.D., Melagraki, G., Zacharia, L.C., Afantitis, A., et al.: Computer-aided drug design of β-secretase, γ-secretase and anti-tau inhibitors for the discovery of novel Alzheimer’s therapeutics. Int. J. Mol. Sci. 21, 703 (2020). Author, F., Author, S.: Title of a proceedings paper. In: Editor, F., Editor, S. (eds.) CONFERENCE 2016, LNCS, vol. 9999, pp. 1–13. Springer, Heidelberg (2016) 2. Schneider, P., Schneider, G.: De novo design at the edge of Chaos. J. Med. Chem. 59, 4077– 4086 (2016) 3. Mouchlis, V.D., Afantitis, A., Serra, A., et al.: Advances in de novo drug design: from conventional to machine learning methods. Int. J. Mol. Sci. 22, 1676 (2021) 4. Marcus, O., Thomas, B., Ola, E., Hongming, C.: Molecular de-novo design through deep reinforcement learning. J. Cheminform. 48 (2017) 5. Segler, M.H.S., Kogej, T., Tyrchan, C., Waller, M.P.: Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Central Sci. 4(1), 120–131 (2017) 6. Daniel, M., Lukas, F., Francesca, G., Gisbert, S.: De novo design of bioactive small molecules by artificial intelligence. Mol. Inform. 37 (2018) 7. Gupta, A., Müller, A.T., Huisman, B.J.H., Fuchs, J.A., Schneider, P., Schneider, G.: Generative recurrent networks for de novo drug design. Mol. Inform. 37 (2018)
638
Z. Li et al.
8. Niclas, S., Göran, F., Alexander, K., Gunnar, M., Jonas, B.: Deep reinforcement learning for multiparameter optimization in de novo drug design. J. Chem. Inf. Model. 59, 3166–3176 (2019) 9. Prykhodko, O., et al.: A de novo molecular generation method using latent vector based generative adversarial network. J. Cheminform. 74 (2019) 10. Zheng, S., Yan, X., Gu, Q., et al.: QBMG: quasi-biogenic molecule generator with deep recurrent neural network. J. Cheminform. 11 (2019) 11. Li, Y., Hu, J., Wang, Y., et al.: DeepScaffold: a comprehensive tool for scaffold-based de novo drug discovery using deep learning. J. Chem. Inf. Model. 60, 77–91 (2019) 12. Yasonik, J.: Multiobjective de novo drug design with recurrent neural networks and nondominated sorting. J. Cheminform. 12 (2020) 13. Tao, L., Yu, Z.,Sida, I.W., Hui, D., Yoav, A.: Simple recurrent units for highly parallelizable recurrence. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018) 14. Jhoti, H., Williams, G., Rees, D.C., Murray, C.W.: The ‘rule of three’ for fragment-based drug discovery: where are we now? Nat. Rev. Drug Discov. 12 (2013) 15. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6, 182–197 (2002) 16. Fonseca, C.M., Fleming, P.J.: Genetic algorithms for multiobjective optimization: formulation discussion and generalization. In: Proceedings of the 5th International Conference on Genetic Algorithms, Urbana-Champaign, IL, USA, June 1993 (1993) 17. RDKit: Open-Source Cheminformatics. https://www.rdkit.org 18. ChEMBL: A manually curated database of bioactive molecules with drug-like properties. https://www.ebi.ac.uk/chembl/ 19. PubChem: The world’s largest collection of freely accessible chemical information. https:// pubchem.ncbi.nlm.nih.gov/ 20. BindingDB: A public, web-accessible database. https://www.bindingdb.org/bind/index.jsp. 21. Wishart, D.S., et al.: DrugBank 5.0: a major update to the DrugBank database for 2018. Nucl. Acids Res. (2017). https://doi.org/10.1093/nar/gkx1037 22. Morris, G.M., et al.: Autodock4 and AutoDockTools4: automated docking with selective receptor flexibility. J. Comput. Chem. 16, 2785–2791 (2009)
DTI-MACF: Drug-Target Interaction Prediction via Multi-component Attention Network Jiejin Deng, Yijia Zhang(B) , Jing Zhang, Yaohua Pan, and Mingyu Lu(B) School of Information Science and Technology, Dalian Maritime University, Dalian 116024, Liaoning, China {zhangyijia,lumingyu}@dlmu.edu.cn
Abstract. Drug-target interaction (DTI) prediction plays an essential role in drug discovery. Traditional biomedical measurement via vitro experiments is reliable but can be prohibitively expensive, time-consuming, and inefficient, especially in large-scale datasets. In recent years, deep learning has been increasingly used in the biomedical field, especially for drug-target prediction. However, existing deep-learning-based DTI methods still need to improve in the aspect of feature extraction. In this paper, we propose a multi-component aggregation model with collaborative filtering for DTI prediction called DTI-MACF. Our approach constructs a bipartite graph to extract various potential features through multiple components module. To improve the accuracy of feature representation, we design a neighbourhood aggregator module based on the bipartite graph, which fuses abundant historical interactive information. We conduct extensive experiments on three benchmark datasets to demonstrate the strong competitiveness of our proposed model. Keywords: Drug-Target Interaction · Collaborative filtering · Attention network · Multi-components
1 Introduction The investigation of drug-target interaction plays a crucial role in drug discovery and research [1], as it aims to uncover the relationship between drugs and proteins, ultimately guiding drug development. In recent years, computer-based techniques have become increasingly prevalent in the prediction of biomolecular correlations [2]. Four primary approaches have emerged, including machine learning-based methods, network-based methods, measurement factor-based methods, and deep learning-based methods [3, 4]. Machine learning is frequently employed for predicting DTI [5]. For example, the model of KronRLS [6] employs a Gaussian interaction profile kernel and applies a classifier of regularized least squares (RLS) to make predictions. DTINet [7] leverages unsupervised methods to learn low dimensional feature representations of drugs and targets from heterogeneous data, and then uses inductive matrix completion to make predictions. Random forest (RF) [8] and support vector machine (SVM) [9] are also commonly used to train drug-target pairs in the training set as feature vectors and predict © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 639–650, 2023. https://doi.org/10.1007/978-981-99-4749-2_55
640
J. Deng et al.
results in the test set. For example, Yu et al. [10] integrated chemical, genomic, and pharmacological information to predict DTI based on RF and SVM. However, these methods require constant updates and enrichment of large-scale genomic, chemical, and pharmacological data, which can limit their overall performance [11]. Deep learning has shown remarkable success not only in European data but also in non-European data, with convolutional neural networks (CNNs), graph neural networks (GNNs), and other types of neural networks being widely employed in drug-target interaction (DTI) prediction. For example, DeepDTI [12], RFDTI [13], DeepDTA [14] and DeepConv-DTI [15] employ deep neural network to predict DTI. Fu et al. [16] proposed a new multi view graph convolutional network (MVGCN) framework for link prediction in biomedical binary networks, which has generalization ability on six benchmark datasets involving three typical tasks. Chu et al. [17] proposed a new hierarchical graph representation learning model for DTA prediction, called HGRL-DTA. Furthermore, recommendation systems are gradually being applied to DTI prediction, providing users with accurate and personalized recommendations by finding relevant information from vast amounts of data even when the user’s needs are unclear [18, 19]. Such as Lim et al. [20] proposed a fast and accurate off-target prediction method for DTI, which is based on a dual regularized one-class collaborative filtering algorithm. In this paper, we aim to explore the potential of fine-grained by extracting multiple embedded features through different components. We then aggregate these features into a final embedded representation via neural network to predict interactions. To further improve the learning ability of feature representation, we design a neighborhood aggregator to fuse historical interactive information based on bipartite graph. To summarize, the main contributions of this paper are as follows: (1) We propose a multi-component aggregation model with collaborative filtering for DTI prediction called DTI-MACF, which utilizes multiple components module to extract more detailed information and improve accuracy. (2) We design a neighborhood aggregation technique to combine abundant historical interactive information, which enhances the model’s representation learning capability.
2 Methodology This Section provides a detailed description of our DTI-MACF model used for DTI prediction. The overall architecture of our model is presented in Fig. 1, which comprises three parts: drug representation learning, protein representation learning, and drug-protein interaction prediction. The main contents of the model are described in detail below. 2.1 Drug-Target Bipartite Graph We apply the matrix factorization technique in collaborative filtering to convert the drug-target interaction data into two matrices D and P. The D matrix depicts the relationship between drugs and their latent features, while the P matrix represents the relationship between proteins and their latent features. These latent features serve as implicit indicators of the associations between drugs and proteins.
DTI-MACF: Drug-Target Interaction Prediction
641
Fig. 1. Overview of the proposed DTI-MACF. Drug and protein feature extraction uses the same operation, which mainly includes three parts: multi-component extraction of fine-grained features, and aggregation of neighbor node information based on bipartite graph.
Drugs have the feature matrix D = (D1 , D2 , . . . , DT ) ∈ RLd ×T , where Ld is the dimension of drug feature and T is the number of drug. Proteins have the feature matrix P = (P1 , P2 , ..., PN ) ∈ RLp ×N , where Lp is the dimension of protein feature and N is the number of protein. We use R to denote the interaction between drugs and proteins. 1, interaction R(dt , pn ) = (1) 0, no interaction where dt ∈ D and pn ∈ P. We construct a bipartite graph G = (D, P, R, E) to represent the known drug-target interactions in our datasets. D and P represent collections of drugs and proteins, respectively. R represents the collection of interactions between drugs and proteins. If the drug interacts with the protein, the value of R is 1. Otherwise, the value is 0. For each edge e = {d, p, r} ∈ E, it represents that there is an interaction from drug Di to protein Pj . The interaction between drugs and proteins is determined by a range of features. We assume that the drug-protein bipartite graph G is driven by M latent components. Different components capture different features. The m-th component captures the m-th latent motivation in the drug-protein interactions. Therefore, we first design M component transformation matrices for the drug and protein respectively to extract different features that correspond to particular components, W = {W1 , W2 , W3 , . . . , WM }
642
J. Deng et al.
and Q = {Q1 , Q2 , Q3 , . . . , QM }. For the drug Di , its m-th drug component him can be extracted as: him = Qm Di
(2)
j
For the protein Pj , its m-th protein component sm can be extracted as: j
sm = Wm Pj
(3)
2.2 Drug Feature Representation For drug Di , we extract features representation of different components him , as for Eq. (2). The relative importance of the drug to different proteins varies. Then we design a neighborhood aggregator, which uses an attention mechanism to aggregate neighborhood (the drug’s neighbors in the drug-protein graph) for each component feature of the drug. The purpose is to improve feature representation. att im = σ (W · AGG drug (him ||Pia , ∀a ∈ A(i)) + bias)
(4)
Zmi = att im · Pia
(5)
where AGG drug denote that uses attention to aggregate the neighborhood infromation of drugs and A(i) denote the set of proteins interacting with drug Di . In this way, we can obtain the aggregation characteristics of the m-th component of drug Di . Considering the different potential characteristics of drugs, the importance of drug-target interaction is different. Then, we integrate all components features, the Z i is as follows. m di di αm = softmax Zmi , Z i = Zmi · αm (6) k=1
him
of the drug Di is as follows: We fuse initial potential feature representation m him · βmdi (7) βmdi = softmax him , hi = k=1
The vector Z i after neighborhood aggregation may lie in a different latent space with of drug Di . The g in the following formula is the weight hyperparameter.
hi . Therefore, we further combine the two parts to get the final representation d i d i = g · hi + (1 − g) · Z i
(8)
2.3 Protein Feature Representation j
For protein Pj , we extract features representation of different components sm , as for Eq. (3). The relative importance of protein to different drugs varies. We use attention mechanism to aggregate neighborhood (the protein’s neighbors in the drug-protein graph) for each component feature of the protein. j
j
att m = σ (W · AGG protein (sm ||Djb , ∀b ∈ B(j)) + bias)
(9)
DTI-MACF: Drug-Target Interaction Prediction j
j
Vm = att m · Djb
643
(10)
where AGG protein denote that uses attention to aggregate the neighborhood infromation of proteins and B(j) denote the set of drugs interacting with protein Pj . In this way, we can obtain the aggregation characteristics of the m-th component of protein Pj . Considering the different potential characteristics of proteins, the importance of drugtarget interaction is different. Then, we integrate all components features, the V j is as follows. m pj j j pj Vm · αm (11) αm = softmax Vm , V j = k=1
j
The fusing of initial potential feature representation sm of protein Pj is as follows: m j pj j pj βm = softmax sm , sj = sm · βm (12) k=1
sj .
The vector V j after neighborhood aggregation may lie in a different latent space with We further combine the two parts to get the final representation pj of protein Pj . pj = g · sj + (1 − g) · V j
(13)
2.4 Drug-Target Interaction Prediction To predict drug-target interactions, the decoder takes as input a pair of drug-target embeddings. The embeddings are first multiplied element-wise to capture the interactions between the two entities. The resulting vector is then passed through a MLP that outputs a single scalar value. This scalar value represents the predicted interaction score between the drug and target. Finally, a sigmoid activation function is applied to the output, indicating the probability of interaction between the drug and target. y = d i pj
(14)
Rij = Sigmoid (MLP(y))
(15)
The symbol denote vector multiplication. 2.5 Loss Function We treat the drug-target interaction task as a binary classification problem and use binary cross-entropy loss (BCELoss) to measure the difference between the predicted probabilities and the true labels. (16) LossBCE = − Rij · logRij + 1 − Rij · log 1 − Rij where Rij is the ground truth by the drug i on the protein j, and Rij is the predicted value by the drug i on the protein j.
644
J. Deng et al.
To address over-parametrization and overfitting, we can use various regularization techniques to prevent the model from overfitting the training data. One such technique is L0 regularization, which aims to encourage the model to use only a small subset of the available features or parameters. The final objective function is as follows: Loss = LosssBCE + λθ 0
(17)
Where θ = {W, Q}, By sparsifying the multi-component extraction matrices W and Q, we can avoid unnecessary resources and alleviate overfitting, because irrelevant degrees of freedom are pruned away. The λ is a hyper-parameter used to balance the LossBCE and regularization.
3 Experimental and Results 3.1 Parameter Setting The paper sets the parameters for the proposed model as shown in Table 1, and the model achieves the best performance by adjusting these parameters. To ensure the robustness of the results, we perform five initialization runs on the model and take the average value of the evaluation metrics. Table 1. Parameter setting. Parameter
Setting
batch size
8,16,32,64,128,256
dropout rate
0.1,0.2,0.3,0.4,0.5
learning rate
0.005,0.001,0.0015,0.0005
embedding dimension d
32,64,128,256
number of components M
1,2,3,4,5
3.2 Evaluation Parameters Based on the previous work [21], we have selected the AUROC and AUPR scores as our evaluation metrics. These metrics are particularly suitable for evaluating binary classification models that are trained on unbalanced data. The values for both AUROC and AUPR range from 0 to 1, where a higher score indicates a more effective model. The AUROC curve is a graphical representation used to evaluate the accuracy of a binary classification model. The horizontal axis of the curve represents the false positive rate (FPR), while the vertical axis represents the true positive rate (TPR). FPR = FP/(TN + FP), TPR = TP/(TP + FN )
(18)
DTI-MACF: Drug-Target Interaction Prediction
645
The AUPR curve is a commonly used evaluation metric for binary classification models. It is a graphical representation that measures the trade-off between precision and recall for different classification thresholds. The horizontal axis of the curve represents the recall, while the vertical axis represents the precision. Recall = TP/(TP + FN ), Precision = TP/(TP + FP)
(19)
TP represents the true positive samples, FP represents the false positive samples, TN represents the true negative samples, and FN represents the false negative samples. 3.3 Datasets In our comparative experiments, we primarily utilize the Davis [22] and BindingDB [23] datasets. For Davis dataset, we use Kd value of 7 as threshold. For BindingDB dataset, we use Kd value of 30 as threshold. In addition to these two datasets, we add several datasets that are more common in DTI tasks for experiments, which are collected by Yamanishi et al. [24]. This dataset includes several main subsets: Enzymes (EN), ion channels (IC), G-protein-coupled receptors (GPCR) and nuclear receptors (NR). Since the NR dataset is too small, we only use three datasets in this paper: EN, IC, and GPCR. Similar to the cold start problem in the recommendation system, the model’s performance is affected by the small number of datasets. Table 2 shows the details of our experimental datasets. For Davis and BindingDB datasets, we divide them into three separate sets in a ratio of 7:1:2 for training, validation, and testing. Since the EN, IC, and GPCR datasets are relatively small, we divide them into two separate sets in a ratio of 8:2 for training and testing. In order to ensure the reliability of our experimental results, we carry out five distinct runs for each experiment, utilizing different random splits of the dataset for each run. Table 2. Statistics of the datasets. Drugs Davis BindingDB
Targets
Interactions
68
442
30056
14643
2623
49199
EN
445
663
2925
IC
210
204
1276
GPCR
223
95
635
3.4 Comparison with State-of-the-Art Methods Our experimental comparison results on EN, IC, and GPCR datasets are shown in Table 3, which demonstrates the strong competitiveness of our proposed model. Specifically, for the IC dataset, DTI-MACF achieves a significant improvement compare to the baselines. Among the baseline methods, NormMulInf had the best AUC score of 0.939,
646
J. Deng et al.
while IMCHGAN had the best AUPR score of 0.920. Compare to these two methods, DTI-MACF shows an improvement of 0.02 and 0.034 in terms of AUC and AUPR, respectively. For the EN dataset, compare with these baseline models, the model we proposed has strong competitiveness. DTI-MACF is better than the second best method by 0.02 in terms of AUPR and is the same as the best method in terms of AUC. However, for the GPCR dataset, DTI-MACF reaches 0.911 and 0.907 in terms of AUC and AUPR, respectively. Our model is inferior to some comparison models because the GPCR dataset is small. Similar to the cold start problem in the recommendation system, the model’s performance is affected by the small number of datasets. This will also be our future research direction. Table 3. The performance evaluation results of DTI-MACF on EN, IC, and GPCR datasets in terms of AUC and AUPR scores. The best result in each column is marked in bold. EN
IC
GPCR
AUC
AUPR
AUC
AUPR
AUC
AUPR
AutoDTI++ [25]
0.900
0.820
0.910
0.900
0.860
0.850
IMCHGAN [26]
0.926
0.940
0.904
0.920
0.936
0.947
KEMV [27]
0.950
0.937
0.883
0.757
0.880
0.883
NormMulInf [28]
0.958
0.932
0.939
0.913
0.948
0.879
DTI-MACF
0.958
0.960
0.959
0.954
0.911
0.907
Table 4. The performance evaluation results of DTI-MACF on Davis and BindingDB datasets in terms of AUC and AUPR scores. The best result in each column is marked in bold. Davis
BindingDB
AUC
AUPR
AUC
AUPR
GNN-CPI [29]
0.840
0.269
0.900
0.578
DeepDTA [14]
0.880
0.302
0.913
0.622
DeepConv-DTI [15]
0.884
0.299
0.908
0.611
MolTrans [30]
0.907
0.404
0.914
0.622
DTI-MACF
0.921
0.685
0.936
0.818
We also conduct comparative experiments on the Davis and BindingDB datasets, and the experimental results are shown in Table 4, which demonstrates the strong competitiveness of our proposed model. The DTI-MACF demonstrate significant improvement over baselines on the Davis dataset. Among the baseline methods, MolTrans achieved the highest AUC score of 0.907 and the highest AUPR score of 0.404. Compared to MolTrans, DTI-MACF show improvements of 0.014 and 0.281 in term of AUC and AUPR, respectively. The model also demonstrate excellent performance on the BindingDB dataset. While the MolTrans model performed well, DTI-MACF outperform it
DTI-MACF: Drug-Target Interaction Prediction
647
with improvements of 0.022 in AUC and 0.196 in AUPR. The experimental results demonstrate that our model achieve outstanding performance, particularly in the AUPR index, where we achieve significant improvements. Specifically, we observe a 69% increase on the Davis dataset and a 32% increase on the BindingDB dataset. To further demonstrate the effectiveness of our proposed model, we compare it with other drug-target interaction prediction models based on collaborative filtering evidentiary decomposition on EN, IC, and GPCR datasets, including MSCMF [31], NRLMF [32], mk-TCMF [33], and DNILMF [34], using AUPR as the evaluation metric. The experimental results, as shown in Fig. 2, indicate that our propose model has significant advantages over other DTI prediction models based on collaborative filtering. Although DNLMF has demonstrated good performance, our propose model outperforms it, achieving an increase of 4.1%, 1.7%, and 11.7% on the three respective datasets. It is worth noting that our proposed model has shown significant improvements on GPCR datasets, highlighting the excellent performance of DTI-MACF on small datasets compared to other DTI models based on collaborative filtering.
Fig. 2. Performance comparison with other DTI models based on collaborative filtering on three datasets.
3.5 Component Quantity Analysis
Fig. 3. Impact of components numbers M on three datasets.
We set the number of components M from 1 to 5, and other parameters remain unchanged. The experiments conduct on three datasets, and the results are presented in Fig. 3. The findings show that the model’s performance consistently improves as the number of components increases. The best performance in terms of AUROC and
648
J. Deng et al.
AUPR is achieved when M = 3, indicating that multiple components are effective for potential property extraction. However, when the number of components increases, the performance tends to saturate, which leads to a decline in model performance. 3.6 Impact of Different Hyperparameter g Through Eq. (8) and Eq. (13), we respectively weight the two parts embedded representations of drugs and proteins. Here we conduct experiments on the selection of weight parameters. Our experimental results can be seen in Fig. 4. We set the weight parameter g from 0.4 to 1. It can be seen from the experimental results that the best performance of the hyperparameter g is different on different datasets. The performance is optimal for EN and IC datasets when the hyperparameter g = 0.8. The experimental results show that it is effective for us to aggregate historical interactive information to improve the learning ability of feature representation. For dataset GPCR, the best performance is achieved at g = 0.5. Because the GPCR dataset is small and the historical interactive information is less, the weight value of the part that integrates the historical information on the dataset is not exceptionally high.
Fig. 4. Impact of weight parameter g on the model on three datasets.
4 Conclusion In this study, we propose a multi-component aggregation model with collaborative filtering for DTI prediction. Our proposed method utilizes multiple components module to extract different potential characteristics of drugs and proteins, and employs an attention-mechanism-based neighbor aggregator module to improve accuracy of node representation learning. Our experimental results show that DTI-MACF outperforms other excellent models on large-scale datasets. In future research, we plan to explore the use of advanced neural networks such as GCN and GAT to aggregate neighborhood information and may achieve even better feature representation. Acknowledgements. This work is supported by grant from the Natural Science Foundation of China (No. 62072070).
DTI-MACF: Drug-Target Interaction Prediction
649
References 1. Petta, I., Lievens, S., Libert, C., Tavernier, J., De Bosscher, K.: Modulation of protein–protein interactions for the development of novel therapeutics. Mol. Ther. 24(4), 707–718 (2016) 2. Bagherian, M., Sabeti, E., Wang, K., Sartor, M.A., Nikolovska-Coleska, Z., Najarian, K.: Machine learning approaches and databases for prediction of drug–target interaction: a survey paper. Brief. Bioinform. 22(1), 247–269 (2021) 3. Luo, H., Li, M., Yang, M., Wu, F.X., Li, Y., Wang, J.: Biomedical data and computational models for drug repositioning: a comprehensive review. Brief. Bioinform. 22(2), 1604–1619 (2021) 4. Xue, H., Li, J., Xie, H., Wang, Y.: Review of drug repositioning approaches and resources. Int. J. Biol. Sci. 14(10), 1232 (2018) 5. Zhang, W., Lin, W., Zhang, D., Wang, S., Shi, J., Niu, Y.: Recent advances in the machine learning-based drug-target interaction prediction. Curr. Drug Metab. 20(3), 194–202 (2019) 6. Van Laarhoven, T., Nabuurs, S.B., Marchiori, E.: Gaussian interaction profile kernels for predicting drug–target interaction. Bioinformatics 27(21), 3036–3043 (2011) 7. Luo, Y., et al.: A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information. Nat. Commun. 8(1), 1–13 (2017) 8. Ho, T.K.: Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE (1995) 9. Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector machines. IEEE Intell. Syst. Appl. 13(4), 18–28 (1998) 10. Yu, H., et al.: A systematic prediction of multiple drug-target interactions from chemical, genomic, and pharmacological data. PLoS ONE 7(5), e37608 (2012) 11. Lim, S., et al.: A review on compound-protein interaction prediction methods: data, format, representation and model. Comput. Struct. Biotechnol. J. 19, 1541–1556 (2021) 12. Wen, M., et al.: Deep-learning-based drug–target interaction prediction. J. Proteome Res. 16(4), 1401–1409 (2017) 13. Wang, L., et al.: A computational-based method for predicting drug–target interactions by using stacked autoencoder deep neural network. J. Comput. Biol. 25(3), 361–373 (2018) 14. Öztürk, H., Özgür, A., Ozkirimli, E.: Deepdta: deep drug–target binding affinity prediction. Bioinformatics 34(17), i821–i829 (2018) 15. Lee, I., Keum, J., Nam, H.: Deepconv-dti: Prediction of drug-target interactions via deep learning with convolution on protein sequences. PLoS Comput. Biol. 15(6), e1007129 (2019) 16. Fu, H., Huang, F., Liu, X., Qiu, Y., Zhang, W.: MVGCN: data integration through multiview graph convolutional network for predicting links in biomedical bipartite networks. Bioinformatics 38(2), 426–434 (2022) 17. Chu, Z., et al.: Hierarchical graph representation learning for the prediction of drug-target binding affinity. Inf. Sci. 613, 507–523 (2022) 18. Zhang, Q., Zhu, L., Bao, W., Huang, D.S.: Weakly-supervised convolutional neural network architecture for predicting protein-DNA binding. IEEE/ACM Trans. Comput. Biol. Bioinf. 17(2), 679–689 (2018) 19. Zhang, Q., Zhu, L., Huang, D.S.: High-order convolutional neural network architecture for predicting DNA-protein binding sites. IEEE/ACM Trans. Comput. Biol. Bioinf. 16(4), 1184– 1192 (2018) 20. Lim, H., et al.: Large-scale off-target identification using fast and accurate dual regularized one-class collaborative filtering and its application to drug repurposing. PLoS Comput. Biol. 12(10), e1005135 (2016)
650
J. Deng et al.
21. Peng, J., Li, J., Shang, X.: A learning-based method for drug-target interaction prediction based on feature representation learning and deep neural network. BMC Bioinform. 21(13), 1–13 (2020) 22. Davis, M.I., et al.: Comprehensive analysis of kinase inhibitor selectivity. Nat. Biotechnol. 29(11), 1046–1051 (2011) 23. Liu, T., Lin, Y., Wen, X., Jorissen, R.N., Gilson, M.K.: BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic Acids Res. 35(suppl_1), D198–D201 (2007) 24. Yamanishi, Y., Araki, M., Gutteridge, A., Honda, W., Kanehisa, M.: Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24(13), i232–i240 (2008) 25. Sajadi, S.Z., Zare Chahooki, M.A., Gharaghani, S., Abbasi, K.: Autodti++: deep unsupervised learning for dti prediction by autoencoders. BMC Bioinform. 22(1), 1–19 (2021) 26. Li, J., Wang, J., Lv, H., Zhang, Z., Wang, Z.: IMCHGAN: inductive matrix completion with heterogeneous graph attention networks for drug-target interactions prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. (2), 19 (2022) 27. Shen, Y., Zhang, Y., Yuan, K., Li, D., Zheng, H.: A knowledge-enhanced multi-view framework for drug-target interaction prediction. IEEE Trans. Big Data 8(5), 1387–1398 (2021) 28. Peng, L., Liao, B., Zhu, W., Li, Z., Li, K.: Predicting drug–target interactions with multiinformation fusion. IEEE J. Biomed. Health Inform. 21(2), 561–572 (2015) 29. Tsubaki, M., Tomii, K., Sese, J.: Compound–protein interaction prediction with end-to-end learning of neural networks for graphs and sequences. Bioinformatics 35(2), 309–318 (2019) 30. Huang, K., Xiao, C., Glass, L.M., Sun, J.: Moltrans: molecular interaction transformer for drug–target interaction prediction. Bioinformatics 37(6), 830–836 (2021) 31. Zheng, X., Ding, H., Mamitsuka, H., Zhu, S.: Collaborative matrix factorization with multiple similarities for predicting drug-target interactions. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1025–1033 (2013) 32. Liu, Y., Wu, M., Miao, C., Zhao, P., Li, X.L.: Neighborhood regularized logistic matrix factorization for drug-target interaction prediction. PLoS Comput. Biol. 12(2), e1004760 (2016) 33. Ding, Y., Tang, J., Guo, F., Zou, Q.: Identification of drug–target interactions via multiple kernel-based triple collaborative matrix factorization. Brief. Bioinform. 23(2) (2022) 34. Hao, M., Bryant, S.H., Wang, Y.: Predicting drug-target interactions by dual-network integrated logistic matrix factorization. Sci. Rep. 7(1), 1–11 (2017)
Intelligent Computing in Drug Design
Drug-Target Interaction Prediction Based on Knowledge Graph and Convolutional Neural Network Integrated with CBAM Module Zhongyu He(B) College of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, Hubei, China [email protected]
Abstract. Prediction of drug-target interaction (DTI) by modern computer technology has become a hot research topic, and various prediction methods emerge in an endless stream. This paper introduces a prediction model of drug-target interaction based on knowledge graph technology, including knowledge graph embedding model (ConvOSFT) and downstream classifier model (ConvCSS). In this paper, a large amount of data was extracted from medical databases to construct our knowledge graph. The ConvOSFT (Convolutional Neural Networks for Knowledge Graph Embedding Completion Considering Original Sequential Features of Triples) proposed in this paper was used to complete knowledge graph embedding and knowledge completion, and the embedded feature extraction was used as the feature representation of entities and relationships. One-dimensional splicing of the feature representation of drugs and targets was carried out and input into our proposed ConvCSS (Convolutional Neural Network Model integrated with CBAM module for classifying Short Sentences composed of drug-target pairs) for binary prediction, and the gold standard dataset was adopted five-fold cross validation conducts experiments and evaluates the overall model through a series of indicators. Experimental results show that the proposed method can effectively perform knowledge representation and downstream drug-target interaction prediction tasks. Keywords: Drug-target interaction · Knowledge graph · Attention mechanism · Knowledge completion · Convolutional neural network
1 Introduction Drug-target interaction prediction is an important reference for drug discovery. In recent years, many effective prediction methods have emerged in different fields of modern computer technology. In general, there are two common approaches for drug-target interaction prediction in the computer field: simulating molecular structure docking and using machine learning methods for prediction. Compared with the method of predicting molecular structure by simulating molecular structure docking, machine learning methods get rid of the dependence on the 3D structure of proteins and can perform © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 653–665, 2023. https://doi.org/10.1007/978-981-99-4749-2_56
654
Z. He
model training and predict new drug-target pairs through publicly available datasets. This makes machine learning methods widely used in the field of drug-target prediction. Over the past few years, many effective drug-target interaction prediction methods have been born, such as FRnet-Predict [4], LpbyCD [5], etc. All the methods mentioned above represent drug-target pairs in different ways, and all propose novel binary classification models for interaction prediction. Therefore, feature representation of drug-target pairs and interaction prediction using binary classification models are two important processes for drug-target interaction prediction using machine learning methods. The most common method to represent drugs and targets is to use one-dimensional feature sequences, such as Smiles sequences of drugs, amino acid coding sequences of targets, etc. Among the above-mentioned feature representation methods, most of them discuss the characteristics of drugs and targets separately. However, such feature representation methods are difficult to capture the influence caused by the interaction between drugs and targets, and it is also difficult to use the influence caused by other kinds of entities other than drugs and targets. In this paper, we use knowledge graphs for structured representation of medical data from different publicly available databases. Moreover, the entities and relations in the knowledge graph are embedded into the lowdimensional dense vector space through the knowledge graph embedding technology to form the feature representations of entities and relations. When the drug-target pair is abstracted into low-dimensional features, it can be used as input to the downstream interaction classification prediction model for classification and prediction. In this paper, we propose a new DTI prediction model, in which we make innovations in both the knowledge graph embedding model and the binary model. The experimental results show that the proposed model can play a role in the field of drug-target mutual prediction. The following diagram shows the details of the model’s prediction process (Fig. 1).
Fig. 1. This paper presents a detailed diagram of the process for predicting drug-target interactions.
Drug-Target Interaction Prediction Based on Knowledge Graph
655
In the figure above, we divide the proposed method into the following stages: the integrative processing of medical data, the construction of knowledge graph, the knowledge graph embedding process and the process of using binary classification model to predict drug-target interactions.
2 Materials and Methods 2.1 Dataset Four types of target proteins have been made into gold standard data set by Yamanishi et al. [3], and this gold standard data set is used in this paper to train and test the model. The four types of data in the gold standard data set are Enzyme, G-protein coupled receptors (GPCR), Ion Channels (IC) and Nuclear Receptors (NR) (Table 1). Table 1. Statistical information for the four gold-standard datasets: the number of targets, drugs, and interaction pairs. Dataset
Targets
Drugs
Positive Interactions
Enzyme
664
445
2926
Ion channel
204
210
1467
gpcr
95
223
635
nuclear
26
54
90
2.2 Knowledge Graph Each major medical database has a large number of valuable medical data, and these valuable data are open to everyone. Researchers can extract the data from the database through some tools and apply it to their own research. The larger the amount of data used to construct the knowledge graph, the richer the types of entities and relations, and the more effective the feature representation of entities and relations obtained through the knowledge graph embedding will be. In this paper, the data of entities and interactions including drugs, diseases, genes and other categories were collected from KEGG [1] and Drugbank [2] databases, including 98924 entities, 34 relationships and 456421 triples. In this paper, the collected data are integrated and processed, and the integrated data is used to construct the knowledge graph. The constructed knowledge graph is relatively large, and it is difficult to display the knowledge graph completely through code tools. So, we selected a drug in the knowledge graph and drew the corresponding subgraph of the drug. This subgraph is centered on the corresponding drug, does not distinguish between the categories of different relations, and only shows all entities related to the drug at the center of the graph. To distinguish different types of entities, we use different colors to render the entities, and finally visualize the subgraph through the visualization code tool. The following figure shows the corresponding subgraph of the drug with drug number D00537 in the knowledge graph (Fig. 2).
656
Z. He
Fig. 2. The corresponding subgraph in the knowledge graph for the drug with drug number D00537.
2.3 Knowledge Graph Embedding Knowledge graph contains abundant entity and relation data, but these data cannot be directly applied to classification, so entity and relation need to be extracted into feature representation by knowledge graph embedding technology. We refer to knowledge graph embedding models that use convolutional neural networks to complete feature extraction in recent years, such as ConvE [6]、ConvKB [7]、RotatE [8], etc., and propose an improved method for embedding model in this model. The following diagram shows the triplet scoring principle of the embedded model in this model. We believe that knowledge graph embedding through convolutional neural networks can capture deeper features. Up to now, there have been many models for knowledge graph embedding using convolutional neural networks, and the authors of the models have defined the scoring rules of the embedding process in different ways. For example, ConvKB [7] model concatenates (h, r, t) triples in the horizontal direction, and then uses three 1 ∗ 3 convolution kernels for convolution operation. Each 1 ∗ 3 convolution kernel focuses on the overall characteristics of the triples with the same one-dimensional values, and finally the triples are scored by the dot product operation. CapsE [8] model took ConvKB [7] model as the first stage, and further extracted features from the feature map obtained by the convolution process of the model through capsule network in the field of computer vision, which also achieved good results. Based on the basic idea of ConvKB [7] model, we make a new idea. We also consider the triples as a whole to score, but no longer consider the attribute values of the triples in the same dimension. Instead, we splice the triples in the vertical direction through the feature vectors after pre-embedding and through the reshape and dropout operations. We consider the order of triples (h, r, t) to be a global feature. Next, we refine the feature extraction by two rounds of convolution operation, and finally calculate the triple score by dot product operation. The following figure shows the triple scoring rule of the proposed embedding model (Fig. 3).
Drug-Target Interaction Prediction Based on Knowledge Graph
657
Fig. 3. Schematic diagram of the proposed embedding model based on convolutional neural network to score triples.
In the above figure, to simplify the use case, we use k to represent the dimension of each triple pre-embedding vector. Before the convolution operation, each embedding vector of each triple needs to be remolded into a (k/2) ∗ 2 matrix, followed by dropout operation, and then concatenated in the vertical direction. Next, the first round of convolution operation is completed by four 1 ∗ 2 convolution kernels, and the intermediate features are merged for the second round of convolution operation. The purpose of the second round of convolution operation is to transform the intermediate features into single column features so that the dot product operation of the fully connected layer can be obtained to obtain the triple score. In the process of model training, the scoring function of each triple was defined by the following formula. f (h, r, t) = (concat(g(concat(vt , vr , vt ) ∗ ω)) ∗ ϕ) · w
(1)
In the formula, the (vh , v r, vt ) is the feature representation of each triple, which is obtained through the first layer of knowledge graph embedding. We can use some classical embedding models for pre-training to obtain the feature representation of the first layer embedding (such as the Trans series). We first need to reshape and dropout each feature vector of the triple to obtain (vh , vr , vt ). ω represents the parameters of the first convolution operation, and ϕ represents the parameters of the second convolution operation. g is some activation function (such as Relu, etc.), and w represents the process of the final triple score obtained by dot product. 2.4 Classification Model There have been many different classification models in different fields of machine learning, all of which solve some problems in their fields, such as ResNet [12], which is widely used in the field of computer vision, Word2Vec [10], which is used in the field of text classification, etc. In this paper, we propose a binary model based on the basic idea of text analysis and the introduction of attention mechanism. During the design of the model in this paper, we stumbled upon the TextCNN [11] model that uses convolutional neural networks for text classification. TextCNN [11] divides the text into a “sentence” consisting of several “words”, where each “word” has its corresponding feature representation. Based on the
658
Z. He
basic idea of this model, we propose that a drug-target pair can also be regarded as a “sentence” with a strict semantic order, and the drug and the target are the two words constituting the “sentence”. However, it is difficult to use multiple convolution kernels of different sizes to excavate deep features by dividing only two words, so we remolded this “sentence” into five words, that is, the characteristics of the entity and the target were divided into half and then spliced again. After passing through the convolution layer, each convolution kernel will be convolved with the “sentence” to generate corresponding multi-channel features. TextCNN [11] does Max pooling on multiple intermediate results after convolution and reduces the dimension through pooling. We believe that direct maximum pooling operation may ignore the importance difference of feature information contained in different channels and lose part of feature information. Therefore, we introduce attention mechanism from the field of computer vision. In this paper, by constructing Cbam [13] module, an attention module including channel attention and spatial attention is introduced. We connect this module to the convolutional layer and assign attention weights to the intermediate results obtained from multiple convolution kernels of different sizes, so that different channels and different attribute values on each channel change according to their attention weights. When different parts of the intermediate result are given different attention weights, the proportion of effective feature information will be increased. At this point, the average pooling operation is carried out on the intermediate results, and the obtained results are concatenated. Such changes can solve the problem of feature information loss caused by direct maximum pooling to a certain extent, and finally the binary classification result score is obtained through a fully connected layer. The following figure shows the detail diagram of the working principle of the classification model within this model (Fig. 4).
Fig. 4. Schematic diagram of the proposed classification model based on convolutional neural network with attention mechanism.
In the schematic diagram, to simplify the example, we set three convolution kernels of different sizes, which are 2, 3 and 4 respectively, and there are two convolution kernels of each size. To more accurately express the scoring principle of the classifier model proposed in this paper, we will precisely define the classifier through the following
Drug-Target Interaction Prediction Based on Knowledge Graph
659
formula. Score = cat(avgPool CBAM (reshape cat vdrug , vtarget ∗ ϕ) ) · w
(2)
In the above formula, CBAM stands for CBAM attention module, which is composed of channel attention mechanism and spatial attention mechanism. ϕ represents the convolution operation using multiple kernels of different sizes, and w corresponds to the dot product operation performed on the fully connected layer to obtain the binary classification score.
3 Experimental Results 3.1 Evaluation Criteria This paper uses the area under the exact recall curve (AUPR) and the area under the receiver operating characteristic curve (AUC) to evaluate its performance. The horizontal axis of the ROC curve is the FPR (False Positive Rate), and the vertical axis of the ROC curve is the TPR (True Positive Rate). FPR =
FP FP+TN
(3)
TPR =
TP TP+FN
(4)
AUPR is the area under the PR curve, and the horizontal axis of the PR curve is Recall and the vertical axis is Precision. Recall =
TP TP+FN
Precession =
TP TP+FP
(5) (6)
We use five-fold cross validation to evaluate and compare the experimental results of our model using AUC and AUPR. At the same time, MR (Mean Rank), MRR (Mean Reciprocal Rank), hits@10 indexes were used to simply evaluate and compare the link prediction effect of the embedded model in the model. We also used F1-Score to evaluate the parameters of the classification model. 3.2 Classification Model Parameter Adjustment and Effect Comparison The classification model in this paper uses multiple convolution kernels of different sizes to complete feature extraction of the convolution process. To explore the appropriate value of the number of convolution kernels that can improve the performance of the model, we conducted a comparison experiment. To control the experimental variables, this experiment without changing the size of the convolution kernel, only increase the number of convolution kernels, using gold standard data sets continue five-fold cross validation, and record the effect of different number of convolution kernels were compared. The number of convolution kernels used in our comparison experiment is 2, 128, 256,
660
Z. He
and 1024, respectively. From the experimental results, it can be observed that when the number of convolution kernels is within the range of [2, 256], AUC and AUPR predicted by the model are both in an increasing state. However, when the number of convolution kernels was adjusted to 1024, the complexity of the model was greatly increased, and the result was not as good as the experimental result when the number of convolution kernels was equal to 128. Therefore, we believe that the optimal number of convolution kernels is in the interval [256, 1024]. However, when the number of convolution kernels is equal to or greater than 256, the complexity of the model is very high, and the performance of the model increases less than when the number of convolution kernels is equal to 128. Therefore, we verified by experiments and finally selected the best number of convolution kernels to be 256. The following figure shows the line graph of the changes of the two indicators under the five-fold cross validation during the adjustment of the number of convolution kernels (Fig. 5).
Fig. 5. Line plot of the changes of auROC and auPR indices during the adjustment of convolution kernel.
Many options for translating drug-target pairs to “sentences” have been considered. For example, dividing drug-target pairs into 5 “words” or 10 “words” or even 20 “words”. The central idea of these schemes is the same, that is, the feature combination of drugtarget pairs can be evenly divided into multiple “words”, so that convolution operations can be performed with different sizes of convolution kernels. However, with different options, we need to adjust different parameters to make the model work properly. For example, in this paper, we choose to divide into 5 “words”, which also means that the maximum height of the convolution kernel we can choose is 5, and the number of combinations of different convolution kernel sizes is 10, which makes it tedious to compare the effect through experiments. To explore whether increasing the number of divisions can optimize the performance of the model, we try to divide drug-target pairs into 4,5,10 and 20 “words”, and conduct comparative experiments according to the size and number of convolution kernels selected in this paper. The following figure shows the experimental results of these three different partitions on the same dataset (Nuclear) with the same convolution kernel size and number (Fig. 6). In the simple comparison experiment conducted above, to control the experimental variables, the three convolution kernel sizes used under each division are 2,3, and 4, respectively. We can see from the result that different scheme for dividing the “sentence” on the final model classification effect is not very big difference. But as the number of
Drug-Target Interaction Prediction Based on Knowledge Graph
661
Fig. 6. Bar chart of F1-Score values obtained from experiments performed under the same conditions for different number of partitions.
split “words” increases, there are more possibilities for the combination of sizes of the three kernels used. The larger the number of “words”, the more difficult it is to determine the optimal number of partitions by experimental comparison. Considering the time cost of the experiment, we prefer to choose a relatively small number of divisions, to facilitate our experimental verification and save the experimental cost. Therefore, in this paper, we choose to partition drug-target pairs into five “words”. 3.3 Experimental Results and Comparison The gold standard dataset was used to train and test our model. The classification performance of the model was evaluated by five-fold cross validation, and the results were compared with those of the drug-target prediction model using the gold dataset. Area under ROC curve (AUC) and area under PR curve (AUPR) were used as indicators to evaluate the experimental effect of the model in this paper, and were compared with a series of published models for predicting drug-target interaction.The experimental results showed that the AUC values of our model in the enzyme, Ion Channel, GPCR and Nuclear datasets were 0.9670, 0.9717, 0.9567 and 0.9307, respectively, and the AUPR values were 0.81, 0.82, 0.74 and 0.71, respectively. After comparative experiments, the AUC of the proposed model on the Enzyme dataset is slightly lower than that of FRnet-Predict [4], but it has the best performance on the AUPR index. The AUC of the proposed model on the Ion Channel dataset is slightly lower than that of DeepMPF [17], but it has the best performance on the AUPR index. In the experimental results of GPCR dataset, the proposed model performs best in AUC and AUPR index values. The AUC index performance of the proposed model on the Nuclear dataset is better than other models, but the AUPR index value is slightly lower than CFSBoost [14] and FRnet-Predict [4]. In the model experiment stage of this paper, a five-fold cross validation experiment was conducted on the four data sets, and the experimental ROC curve was drawn according to the experimental results of each fold verification. Among them, A, B, C and D in the following figure correspond to the five-fold cross validation experimental results of the proposed model under the Nuclear, GPCR, Ion Channel and Enzyme data sets, respectively (Fig. 7).
662
Z. He
Fig. 7. The ROC curve obtained by five-fold cross validation on the gold standard dataset was used for the model in this paper (Table 2).
The best results under each indicator have been bolded, and the second-best results are underlined. Then, the link prediction effect was evaluated separately for the embedded part of the knowledge graph of this model. In this experiment, FB15K-237 data set was selected, and the obtained evaluation index values were compared with the published knowledge graph embedding model. Through the link prediction experiment on this dataset, the MR, MRR, and hits@10 index values obtained by the proposed embedding model on the FB15K-237 dataset are 221(retained integer), 0.344, and 0.537, respectively. Due to the lack of data for TransE [9] model, we used open-source third-party code to test this dataset and record relevant indicators. In addition, our embedding model performs better than ConvE [6] and ConvKB [7] on the MR Index, better than ConvE [6] and RotatE [18] but slightly worse than ConvKB [7] on the MRR index, and better than ConvKB [7], ConvE [6] and slightly better on the Hits@10 index RotatE [18], there is a small gap between the two models on Hits@10 index value (Table 3). The best results under each indicator have been bolded, and the second-best results are underlined.
Drug-Target Interaction Prediction Based on Knowledge Graph
663
Table 2. AUC and AUPR scores of the proposed model and other models by five-fold cross validation under the gold standard dataset. Dataset
Method
AUC
AUPR
Enzyme
LpbyCD [5]
0.96
0.71
CFSBoost [14]
0.9563
0.68
FRnet-Predict [4]
0.9754
0.70
MSPEDTI [16]
0.9437
–
MEBoost [15]
0.9404
0.41
Ion Channel
GPCR
Nuclear
DeepMPF [17]
0.9645
–
Our Method
0.9670
0.81
LpbyCD [5]
0.97
0.78
CFSBoost [14]
0.9377
0.50
FRnet-Predict [4]
0.9478
0.49
MSPEDTI [16]
0.9088
–
MEBoost [15]
0.928
0.329
DeepMPF [17]
0.9762
–
Our Method
0.9717
0.82
LpbyCD [5]
0.89
0.49
CFSBoost [14]
0.9278
0.54
FRnet-Predict [4]
0.9512
0.69
MSPEDTI [16]
0.8802
–
MEBoost [15]
0.9075
0.46
DeepMPF [17]
0.8781
–
Our Method
0.9567
0.74
LpbyCD [5]
0.82
0.45
CFSBoost [14]
0.8147
0.73
FRnet-Predict [4]
0.9241
0.73
MSPEDTI [16]
0.8663
–
MEBoost [15]
0.9165
0.23
DeepMPF [17]
0.8271
–
Our Method
0.9307
0.71
664
Z. He
Table 3. All the models in FB15K - 237 index under MR, MRR and Hits @ 10 indicators of the experimental results. Method
FB15k-237 MR
MRR
Hits@10
TransE [9](our results)
322
0.297
0.445
ConvE [6]
244
0.325
0.501
ConvKB [7]
257
0.396
0.517
RotatE [18]
—
0.338
0.533
Our Model (ConvOSFT)
221
0.344
0.537
4 Conclusion In this paper, we propose a drug-target interaction prediction model based on knowledge graph technology. We improve both the knowledge graph embedding process and the downstream classification model. In the process of research, we refer to many literatures and propose a knowledge graph embedding model based on convolutional neural networks and a convolutional neural network classification model based on an imposed attention mechanism. We construct knowledge graphs from integrated medical data and the feature representation of each entity and relation is obtained by embedding process. We transform the input drug-target pair into feature vectors, concatenate and reshape them, and treat them as a sentence with five words, which can be fed into our proposed classification model for classification prediction. During the experiment, we used the gold standard data set to conduct a 5-fold cross-validation experiment and evaluated the performance of our model through a series of internationally recognized indicators. It is proved that our model can play a certain role in the field of drug-target interaction prediction.
References 1. Kanehisa, M., Miho, F.: KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, 353–361 (2017) 2. Wishart, D.S., Knox, C.: DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 34, 668–672 (2006) 3. Yamanisha, Y., et al.: Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24(13(2008)), i232–i240 (2008) 4. Rayhan, F., et al.: FRnet-DTI: deep convolutional neural network for drug-target interaction prediction. Heliyon 6(3), e03444 (2020) 5. Koptelov, M., Zimmermann, A., Crémilleux, B., Soualmia, L.F.: LPbyCD: a new scalable and interpretable approach for link prediction via community detection in bipartite networks. Appl. Netw. Sci. 6(1), 1–39 (2021). https://doi.org/10.1007/s41109-021-00415-1 6. Dettmers, T., Minervini, P., Stenetorp, P., Riedel, S., et al.: Convolutional 2D knowledge graph embeddings. In: Proceedings of the AAAI Conference on Artificial Intelligence 32(1) (2018)
Drug-Target Interaction Prediction Based on Knowledge Graph
665
7. Dai, Q.N., et al.: A novel embedding model for knowledge base completion based on convolutional neural network (2018) 8. Nguyen, D.Q., et al.: A capsule network-based embedding model for search personalization (2018) 9. Bordes, A., et al.: Translating embeddings for modeling multi-relational data. In: Neural Information Processing Systems. Curran Associates Inc. (2013) 10. Ayyadevara, V.K.: Word2vec. In: Enter: Specialized Machine Learning Algorithms. UC Berkeley Apress (2018). https://doi.org/10.1007/978-1-4842-3564-5_8 11. Kim, Y., et al.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751 (2014) 12. He, K., et al.: Deep Residual Learning for Image Recognition. IEEE (2016) 13. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1 14. Rayhan, F., et al.: CFSBoost: cumulative feature subspace boosting for drug-target interaction prediction. J. Theor. Biol. 464, 1–8 (2019) 15. Brown, B., Weaver, T., Wolfson, J.: MEBoost: variable selection in the presence of measurement error. Stat. Med. (2019) 16. Wang, L., et al.: MSPEDTI: prediction of drug–target interactions via molecular structure with protein evolutionary information. In: Biology 2022, vol. 11, p. 740 (2022) 17. Ren, Z.H., You, Z.H., Zou, Q., et al.: DeepMPF: deep learning framework for predicting drug– target interactions based on multi-modal representation with meta-path semantic analysis. J. Transl. Med. 21, 48 (2023) 18. Sun, Z., et al.: RotatE: knowledge graph embedding by relational rotation in complex space (2019)
Deep Learning-Based Prediction of Drug-Target Binding Affinities by Incorporating Local Structure of Protein Runhua Zhang1 , Baozhong Zhu1 , Tengsheng Jiang2 , Zhiming Cui1 , and Hongjie Wu1(B) 1 School of Electronic and Information Engineering, Suzhou University of Science and
Technology, Suzhou 215009, China [email protected] 2 Gusu School, Nanjing Medical University, Suzhou, Jiangsu, China
Abstract. Traditional drug discovery methods are both time-consuming and expensive. Utilizing artificial intelligence to predict drug-target binding affinity (DTA) has become an essential approach for accelerating new drug discovery. While many deep learning methods have been developed for DTA prediction, most of them only consider the primary sequence structure of proteins. However, drug-target interactions occur only in specific regions of the protein, and the primary structure can only represent the global protein features, which fails to fully disclose the relationship between the drug and its target. In this study, we used both the primary and secondary protein structures to represent the protein. The primary structure served as the global feature, and the secondary structure as the local feature. We use convolutional neural networks (CNNs) and graph neural networks (GNNs) to model proteins and drugs separately, which helped to better capture the interactions between drugs and targets. As a result, our method demonstrated improved performance in predicting DTA comparing to the latest methods on two benchmark datasets. Keywords: drug-target affinity · global and local feature · cnn · gnn
1 Introduction Developing a new drug that can be brought to market costs approximately $2.6 billion, and the approval rate of new drugs that enter clinical trials is less than 12% [1, 2]. Moreover, developing a new drug requires a significant amount of time [3]. Therefore, computer-aided drug development has become a hot research topic in recent years [4]. Accurately identifying drug-target interactions is an essential step in the computational stages of drug development [5]. Currently, there are mainly two categories of computational methods used for predicting drug-target interactions. The first type treats interaction prediction as a binary classification task [6], that is, determining whether a drug and a target interact or not. The other type treats it as a regression task for predicting the binding affinity between the drug and the target. Binding affinity can measure the © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 666–675, 2023. https://doi.org/10.1007/978-981-99-4749-2_57
Deep Learning-Based Prediction of Drug-Target Binding Affinities
667
strength of drug-target interactions, and is usually expressed using inhibition constant (Ki ), dissociation constant (Kd ), or the half maximal inhibitory concentration (IC50 ) [7]. Our method focuses mainly on predicting drug-target binding affinity (DTA). There are several computational methods used for predicting DTA. One approach is the ligand-based method which compares a query ligand to known ligands based on its target protein. However, if the number of known ligands for the target protein is insufficient [8], the predictions may be unreliable [9]. Another approach is molecular docking [10], which models the binding of compounds and proteins in conformational space based on their 3D structures. However, preparing 3D protein-ligand complexes can be quite challenging [11]. Predicting DTA using computational methods typically involves three main steps. First, drug and target protein data are converted into computationally ready vectors or graphs using various encoding methods [12]. The commonly used representation forms of drugs mainly include simplified molecular linear input specification (SMILES) [13], molecular fingerprint and graph. Proteins are usually represented using one-hot encoding to capture their primary sequences. Second, different feature extraction methods are applied to obtain representative features of drugs and proteins, which are then used to replace their original input features. Finally, a regression process is performed to combine the respective representations and predict binding affinities. In recent years, deep learning (DL) has made significant progress in the field of computer-aided drug design [14], particularly in the prediction of DTA. Many DLbased methods have been developed to improve DTA prediction performance. One of the earliest DL-based DTA prediction models, DeepDTA [15] uses one-dimensional (1D) convolutional neural networks (CNN) to extract sequence features of drugs and proteins, it uses the protein primary sequence and the SMILES string of the drug ligand as input, without incorporating any additional input information. WideDTA [16] improves prediction performance by incorporating protein domain information. However, expressing drugs as SMILES strings leads to the loss of their original graph structure, motivating the use of graph neural networks (GNN). GraphDTA [17] represents drugs as graphs, using multiple GNN variants such as the graph convolutional network (GCN) [18], the graph attention network (GAT) [19], and the graph isomorphism network (GIN) [20], and retaining CNN to represent proteins. This model outperformed existing 1D methods, highlighting the importance of structural information. However, these models only consider the overall interaction between drugs and proteins. MGraphDTA [21] introduces dense connections into the GNN and builds an ultra-deep network structure consisting of 27 layers of GCN. This architecture enables the simultaneous capture of local and global structures of compounds, improving the prediction performance of DTA. Additionally, MGraphDTA proposes a new visualization method to better understand the role of GNN in DTA prediction. DeepAffinity [22] introduces an attention mechanism to learn the binding site information between compounds and proteins, improving model interpretability. These approaches have demonstrated the success of using CNNs for feature extraction from protein sequences. GraphDTA, on the other hand, uses a graph structure to represent drugs and applies GCN for feature extraction, leading to improved prediction performance. This indicates that graph structures can be effectively utilized in DTA prediction. The above method mainly uses the primary structure of the protein,
668
R. Zhang et al.
that is, the amino acid sequence to represent and input, and can only extract the global features of the protein, ignoring the local features of the protein in a segment. In this paper, we propose a novel deep learning-based method for predicting DTA that integrates both global and local features of proteins. The entire model comprises three distinct modules: the global protein features module, the local protein features module, and the ligand module. The protein data is one-dimensional and consists of the amino acid sequence structure and secondary structure of the protein, while the drug ligand is represented using graph data. We use CNN to learn the representation of protein primary and secondary sequences, employ GAT and GCN to learn the graph data representation of drugs, and finally concatenate the features obtained from the convolution and maximal pooling layers of the three modules and fed them into the classification component.
2 Data Set and Data Representation 2.1 Data Set We evaluate our model on two different public datasets, Davis [23] dataset and KIBA [7] dataset, information about both datasets can be found at the webpage: https://tdcommons. ai/multi_pred_tasks/dti/. In most of the previous drug target affinity prediction methods, these two data are considered as benchmark data sets. Notably, the protein secondary structure information is not available in these datasets, so we predict the secondary structure with MLRC [24] methods from the NPS [25] webpage: https://npsa-prabi.ibcp.fr/ cgi-bin/npsa_automat.pl?page=/NPSA/npsa_mlrc.html and incorporate this information into our model. The Davis dataset contains selectivity assays for kinase protein families and related inhibitors and their respective Kd values, which range from 5.0 to 10.8, and the dataset contains interactions from 442 proteins and 68 ligands. The KIBA dataset uses the KIBA score to represent the affinity, which is calculated from three different inhibitor efficacy indicators, Kd , Ki , and IC50 , and includes the interactions of 229 proteins and 2111 ligands. Table 1. Datasets used in the model. Datasets
Davis
KIBA
Proteins
422
229
Ligands
68
2111
Interactions
30056
118254
Training
25046
98545
Testing
5010
19709
As the protein sequence lengths in the two datasets vary, with the longest sequence length being 4128, we set the maximum protein primary sequence and secondary structure length as 1000 and padded the shorter sequences with zeros and trimmed the longer
Deep Learning-Based Prediction of Drug-Target Binding Affinities
669
sequences before feeding them into our neural network for training. We randomly divided each dataset into six equal parts, with one part serving as an independent test set and the rest five are used for cross-validation. To tune the hyperparameters, we conduct five-fold cross-validation on the dataset. Once the hyperparameters are optimized, we train the model on all five parts of the dataset and assess its performance on a separate test set. Table 1 shows the detailed usage of the dataset. 2.2 Data Representation 2.2.1 Global Protein Features Representation Protein primary sequences are used to represent global features, which are composed of amino acids. Protein sequences typically consist of 20 different amino acids, and previous studies have used one-hot encoding to represent proteins, as well as other biological sequences such as DNA and RNA [26]. While graphs can also be used to represent proteins, this is challenging due to the difficulty in obtaining accurate tertiary structures [11]. Therefore, we adopted the one-hot encoding approach to represent proteins, each amino acid corresponds to a unique encoding vector, with only one element equal to 1 and all others equal to 0. We utilize a 20D one-hot encoding scheme for the 20 different types of protein sequence residues. Each amino acid type is assigned a unique integer code based on its associated letter notation (e.g., alanine (A) is 1, cysteine (C) is 3, aspartic acid (D) is 4, etc.), which enables proteins to be represented as sequences of integers. 2.2.2 Local Protein Features Representation The linear sequence of amino acids in a protein can provide information about its overall structure, but it cannot capture the local patterns present in the protein structure. The secondary structure of a protein is considered a fundamental unit in the spatial organization of proteins and typically includes eight categories: α-helix (H), isolated β-bridge residues (B), extended chains participating in β ladders (E), twist (T), 310 helix (G), π helix (I), bend (S), and coil (C) [27]. These secondary structures can be viewed as local patterns in protein structures and can be used to characterize local features of proteins. In our approach, we use an 8D one-hot vector to represent the secondary structure, which captures these local features of the protein. 2.2.3 Ligand Representation In these two datasets, drugs are represented using SMILES, which is an ASCII string that describes the chemical structure of a drug molecule in one-dimensional sequence. From this representation, the chemical properties of atoms and their arrangement can be obtained. However, in our method, we represent drug compounds as graphs of interactions between atoms. To preprocess SMILES, we use RDKit to convert them into a graph format with vertex (or node) features and an adjacency matrix, as shown in Fig. 1. Each node in the graph is a multidimensional binary feature vector that expresses 5 pieces of information: the atom symbol, the number of adjacent atoms, the number of
670
R. Zhang et al.
Fig. 1. SMILES and molecular graph
adjacent hydrogen atoms, the implicit value of the atom, and whether the atom is in an aromatic structure.
3 Method 3.1 Proposed Model Our study aims to predict drug-target interactions by treating it as a regression task, predicting specific binding affinities. The proposed model architecture is illustrated in Fig. 2 and consists of three functional modules. Two CNN modules are employed to learn global and local features of the protein, while GAT and GCN are utilized to learn ligand features. Each CNN module comprises three convolutional layers serving as feature extractors, with the number of filters increasing as the model becomes better at recognizing patterns. The second layer is twice the size of the first layer, and the third layer is three times the size of the first layer. The convolutional layers are followed by max pooling layers. In DTA prediction, it is crucial to understand the interaction of each node with its neighbors, and GCN can capture the connectivity relationship between graph nodes to produce essential feature representations. Unlike GCN, GAT proposes an attention-based architecture to learn hidden representations of nodes in the graph using a self-attention mechanism. In the ligand feature learning module, we initially use a single GAT layer and then employ three GCN layers to extract ligand features. Finally, the features of the maximum pooling layer of the three modules are concatenated and fed back to the classification part. The classification part comprises three fully connected layers. We utilized 1024 nodes in the first two FC layers, each of which was subsequently followed by a dropout layer with a rate of 0.2. Dropout is a regularization technique [28] that helps to prevent overfitting by randomly setting the activation of some neurons to 0. The third layer contained 512 nodes, followed by the output layer. In all experiments, we trained the model for 1000 epochs with a batch size of 512 and the default learning rate was set at 0.0005. In summary, our model combines both protein local features and global features to extract richer interaction information. We also represent drug structures as graphs to better extract drug features. This approach allows us to better predict drug-target binding affinity.
Deep Learning-Based Prediction of Drug-Target Binding Affinities
671
Fig. 2. Framework of our model.
3.2 Metrics During the experiment, we used the Adam optimization algorithm [29] to train the network, with the Rectified Linear Unit (ReLU) [30] as the activation function. This choice
672
R. Zhang et al.
reduces the training time and helps prevent overfitting. In our study, we approached DTA prediction as a regression task and used Mean Squared Error (MSE) as the loss function to evaluate the performance of the model. The goal is to minimize the value of MSE, which is calculated by taking the average of the squared differences between the predicted and true values, where n represents the sample size, Pi denotes the prediction vector, and Yi is the actual output vector. MSE =
1 n (Pi − Yi )2 i=1 n
(1)
Another metric we used is the Concordance Index (CI), which measures the ability of the model to correctly rank the predicted binding affinities. Specifically, CI is calculated as the fraction of pairs of drug-target interactions whose predicted affinity rankings match the true rankings, where bi is the predicted value with higher affinity δi , bj is the predicted value with smaller affinity δj , and Z is the normalization constant CI =
1 h(bi − bj ) δi >δj Z
(2)
Moreover, the h(x) is the step function, which decides the value 1.0, 0.5 and 0.0 depending on whether x is greater than or equal to 0. The concordance index values can range from 0.5 to 1.0, with 0.5 indicating a random predictor and 1.0 indicating total prediction accuracy based on the test data. The values of the CI range from 0.5 to 1.0. ⎧ ⎪ ⎪ ⎪ 1, if x > 0 ⎪ ⎪ ⎨ (3) h(x) = 0.5, if x = 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 0, if x < 0
4 Result and Discussion This study presents a deep learning-based drug-target affinity prediction model. The model leverages two CNN blocks to learn global and local features from the protein’s primary and secondary structure, respectively. It also employs GAT and GCN to learn the features of drug ligands. Our model is compared to several other models, namely DeepDTA, WideDTA, GraphDTA, and AttentionDTA [31]. These four methods all use CNN to learn protein features from protein primary sequence and use different methods to learn the characteristics of the ligands, but none of them used secondary structure in their experiments. The specific experimental results are recorded in Table 2. As shown in Table 2, our model was evaluated on two datasets, Davis and KIBA. The results show that our proposed model achieved an MSE of 0.218 and a CI of 0.894 on the Davis dataset, and an MSE of 0.128 and a CI of 0.897 on the KIBA dataset. Compared with other methods, our model achieved superior performance on both datasets. This indicates that incorporating both protein global and local features can enhance the accuracy of predicting drug target binding affinity.
Deep Learning-Based Prediction of Drug-Target Binding Affinities
673
Table 2. CI and MSE score of our model and other baselines models on Davis and KIBA Davis
KIBA
Models
CI
MSE
CI
MSE
DeepDTA
0.878
0.261
0.863
0.194
WideDTA
0.886
0.262
0.875
0.179
GraphDTA
0.881
0.245
0.882
0.147
AttentionDTA
0.887
0.245
0.882
0.162
Our model
0.894
0.218
0.897
0.128
5 Conclusion Our study introduces a deep learning-based model that leverages protein primary sequence and secondary structure to predict drug target binding affinity. We also enriched the Davis and KIBA datasets by including protein secondary structure data, which can be utilized in future experiments. Our findings suggest that incorporating local features of proteins improves the ability to learn and represent proteins, leading to more accurate predictions of drug target binding affinity. Acknowledgement. This paper is supported by the National Natural Science Foundation of China (62073231, 62176175, 61902271), National Research Project (2020YFC2006602), Provincial Key Laboratory for Computer Information Processing Technology, Soochow University (KJS2166). Opening Topic Fund of Big Data Intelligent Engineering Laboratory of Jiangsu Province (SDGC2157).
References 1. DiMasi, J.A., Grabowski, H.G., Hansen, R.W.: Innovation in the pharmaceutical industry: new estimates of R&D costs. J. Health Econ. 47, 20–33 (2016) 2. Mullard A. New drugs cost US $2.6 billion to develop[J]. Nature reviews. Drug discovery, 2014, 13(12): 877 3. Ding, Y., Tang, J., Guo, F.: Identification of drug–target interactions via dual laplacian regularized least squares with multiple kernel fusion. Knowl.-Based Syst. 204, 106254 (2020) 4. Sun, M., Tiwari, P., Qian, Y., et al.: MLapSVM-LBS: predicting DNA-binding proteins via a multiple Laplacian regularized support vector machine with local behavior similarity. Knowl.Based Syst. 250, 109174 (2022) 5. Ding, Y., Tang, J., Guo, F.: Identification of drug–target interactions via fuzzy bipartite local model[J]. Neural Comput. Appl. 32, 10303–10319 (2020) 6. Yamanishi, Y., Kotera, M., Kanehisa, M., et al.: Drug-target interaction prediction from chemical, genomic and pharmacological data in an integrated framework. Bioinformatics 26(12), i246–i254 (2010) 7. Tang, J., Szwajda, A., Shakyawar, S., et al.: Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. J. Chem. Inf. Model. 54(3), 735–743 (2014)
674
R. Zhang et al.
8. Yang, H., Ding, Y., Tang, J., et al.: Drug–disease associations prediction via multiple kernelbased dual graph regularized least squares. Appl. Soft Comput. 112, 107811 (2021) 9. Ding, Y., Tang, J., Guo, F.: Human protein subcellular localization identification via fuzzy model on kernelized neighborhood representation. Appl. Soft Comput. 96, 106596 (2020) 10. Wu, H., Ling, H., Gao, L., et al.: Empirical potential energy function toward ab initio folding G protein-coupled receptors. IEEE/ACM Trans. Comput. Biol. Bioinf. 18(5), 1752–1762 (2020) 11. Karimi, M., Wu, D., Wang, Z., et al.: Explainable deep relational networks for predicting compound–protein affinities and contacts. J. Chem. Inf. Model. 61(1), 46–66 (2020) 12. Ding, Y., Tang, J., Guo, F.: Identification of drug-target interactions via multi-view graph regularized link propagation model. Neurocomputing 461, 618–631 (2021) 13. Weininger, D.: SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28(1), 31–36 (1988) 14. Ding, Y., Tang, J., Guo, F.: Identification of drug-side effect association via semisupervised model and multiple kernel learning. IEEE J. Biomed. Health Inform. 23(6), 2619–2632 (2018) 15. Öztürk, H., Özgür, A., Ozkirimli, E.: DeepDTA: deep drug–target binding affinity prediction. Bioinformatics 34(17), i821–i829 (2018) 16. Öztürk, H., Ozkirimli, E., Özgür, A.: WideDTA: prediction of drug-target binding affinity. arXiv preprint arXiv:1902.04166 (2019) 17. Nguyen, T., Le, H., Quinn, T.P., et al.: GraphDTA: predicting drug–target binding affinity with graph neural networks. Bioinformatics 37(8), 1140–1147 (2021) 18. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016) 19. Veliˇckovi´c, P., Cucurull, G., Casanova, A., et al.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017) 20. Xu, K., Hu, W., Leskovec, J., et al.: How powerful are graph neural networks?.arXiv preprint arXiv:1810.00826 (2018) 21. Yang, Z., Zhong, W., Zhao, L., et al.: Mgraphdta: deep multiscale graph neural network for explainable drug–target binding affinity prediction. Chem. Sci. 13(3), 816–833 (2022) 22. Karimi, M., Wu, D., Wang, Z., et al.: DeepAffinity: interpretable deep learning of compound– protein affinity through unified recurrent and convolutional neural networks. Bioinformatics 35(18), 3329–3338 (2019) 23. Davis, M.I., Hunt, J.P., Herrgard, S., et al.: Comprehensive analysis of kinase inhibitor selectivity. Nat. Biotechnol. 29(11), 1046–1051 (2011) 24. Guermeur, Y., et al.: Improved performance in protein secondary structure prediction by inhomogeneous score combination. Bioinformatics (Oxford, England) 15(5), 413–421 (1999) 25. Combet, C., et al.: NPS@: network protein sequence analysis. Trends Biochem. Sci. 25(3 (2000): 147–150 26. Wang, H., Tang, J., Ding, Y., et al.: Exploring associations of non-coding RNAs in human diseases via three-matrix factorization with hypergraph-regular terms on center kernel alignment. Brief. Bioinform. 22(5), bbaa409 (2021) 27. Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopoly. Original Res. Biomol. 22(12), 2577– 2637 (1983) 28. Wan, L., Zeiler, M., Zhang, S., et al.: Regularization of neural networks using dropconnect. In: International Conference on Machine Learning, pp. 1058–1066. PMLR (2013)
Deep Learning-Based Prediction of Drug-Target Binding Affinities
675
29. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412. 6980 (2014) 30. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807– 814 (2010) 31. Zhao, Q., Xiao, F., Yang, M., et al.: AttentionDTA: prediction of drug–target binding affinity using attention model. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 64–69. IEEE (2019)
Drug-Target Interaction Prediction Based on Interpretable Graph Transformer Model Baozhong Zhu1 , Runhua Zhang1 , Tengsheng Jiang2 , Zhiming Cui1 , and Hongjie Wu1(B) 1 School of Electronic and Information Engineering, Suzhou University of Science and
Technology, Suzhou 215009, China [email protected] 2 Gusu School, Nanjing Medical University, Suzhou, Jiangsu, China
Abstract. This study proposes a novel architecture for drug-target interaction (DTI) prediction by leveraging protein binding sites and self-attention mechanisms. The architecture consists of four modules: Data Preparation, Graph Embedding Learning, Feature Extraction, and Prediction. Protein binding sites are extracted from the 3D structure of proteins using a simulation-based model in the Data Preparation module to simplify model complexity. A map of protein pockets and ligands is then generated and utilized to learn embeddings using Topology Adaptive Graph Convolutional Networks to extract global and local features of the protein pocket and ligand. The protein pocket and ligand signature are fused via the Self-attentive Bidirectional Long Short-Term Memory block to obtain a representation of the drug-target complex. The resulting cascaded representation is then fed into a binary classifier for predicting DTI. By employing the self-attention mechanism in the network, the attention output is computed using cascading embeddings of drug-target pairs as inputs, enabling interpretability by identifying the protein regions that interact with ligands in a given drug-target pair. The experimental results demonstrate the superiority of the proposed architecture over existing DTI predictive models. Keywords: drug–target interaction · Self-Attention · transformer · graph neural networks · Binding Sites
1 Introduction Drug discovery is a complex and time-consuming process, and despite significant investments, success rates remain suboptimal [1]. Proteins are the primary targets of drugs and the identification of drug-target interaction (DTI) has become a crucial task in early-stage drug development and drug repurposing [2]. Since experimental DTI studies are expensive and time-consuming, computational methodologies have been proposed to facilitate the identification of putative DTI, thereby expediting the process of drug discovery [3]. One of the main methods for virtual screening involves predicting potential drugs by screening out drug candidate ligands for receptor proteins of interest from large-scale © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 676–686, 2023. https://doi.org/10.1007/978-981-99-4749-2_58
Drug-Target Interaction Prediction
677
compound ligand libraries using many calculations [4]. Virtual screening methods can be divided into two categories: receptor-based virtual screening and ligand-based virtual screening. Receptor-based virtual screening mainly studies the three-dimensional structure of proteins and seeks interactions with small molecule compounds from the three-dimensional structure, making it also known as structure-based virtual screening [5]. However, these methods have practical limitations due to their heavy reliance on the high-quality three-dimensional structure of proteins and their computational expenses and inefficiencies. Ligand-based virtual screening typically begins with ligands and analyzes molecular structure and activity information of known inhibitors to summarize structural features that significantly contribute to their binding capacity. This learned knowledge is then used to screen new ligands to find compound molecules that meet the requirements. Virtual screening methods often rely on predicting drug-target interactions, which can be understood as a series of continuous values that express the intensity of different drug-target interactions. With the rapid development of deep learning methods [6], researchers have used deep learning models to measure drug-target interactions as binary classification tasks [7]. These DTI prediction models have been hugely successful because they can automatically capture data depth features, resulting in better models with excellent capabilities in complex molecular data processing [8]. DTI’s deep learning models can be divided into two main categories [9]. One type act on processing sequence-based representations of input data, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). In related works, Peng proposed a method based on convolutional neural networks to extract drug and protein features from heterogeneous networks, and used convolutional neural network models to predict the interaction between drugs and proteins [10]. Karimi proposes a semi-supervised deep learning model that unifies recurrent and convolutional neural networks to jointly encode molecular representation and predict affinity using unlabeled and labeled data [11]. However, these models usually express drugs in the form of strings, and onedimensional sequences are not a natural way of expressing molecules. Therefore, to compensate for the lack of molecular structure information, a second type of deep learning model, the graph neural network (GNN), was introduced, and the use of graph convolutional networks has also proven to be more beneficial for computational drug discovery [12]. GNN uses a graphical description of molecules, where atoms and chemical bonds correspond to nodes and edges, respectively [13]. The most commonly used GNN-based models today are the graph convolutional neural network (GCNN) [14] and the graph attention network (GAT), which is one of the variants of GCNN. Related work includes Zhao using the constructed graph convolutional network to learn the drug-protein pairs built to improve the prediction accuracy [15]. Zhao proposes a new graph convolutional DTI prediction model. Specifically, the first-order neighbor information of a node can be aggregated through GCN; The high-order neighbor information of the node is learned by the graph embedding method, which improves the accuracy of prediction [16]. Despite the impressive performance of both CNN-based and graph-based neural network methods in DTI prediction, certain challenges remain unresolved [17]. One significant limitation of most deep learning methods is that they employ only a few CNN layers, resulting in the compression of all feature information into a small area, which
678
B. Zhu et al.
may cause the loss of local features of the original data. Moreover, all graph-based models are currently represented using the amino acid sequence of the protein, which cannot capture the crucial 3D structural features that are essential in DTI prediction. Obtaining a high-resolution 3D structure of a protein is a difficult task due to its complex nature and large number of atoms, necessitating a massive 3D (sparse) matrix to capture the entire structure. This paper proposes a novel approach for predicting DTI that leverages the structural features of small molecules and protein binding sites in the form of graphs. To preserve the influence of molecular structure on the prediction results, a transformer model is introduced to extract global features. Moreover, a self-attention Bidirectional Long Short-Term Memory mechanism is employed to identify the parts of the protein that are most likely to bind to a given drug, thereby enhancing the model’s interpretability.
2 Materials and Methods 2.1 Framework of Our Method Input Representation
Fig. 1. Our proposed framework consists of four main modules: (1) a pretreatment module, including the search for binding sites for proteins; (2) Graph representation module, in which we construct a graphical representation of ligand SMILE and protein binding sites, and create a graph convolutional neural network; (3) Feature extraction module, the network is equipped with a transformer block and a BiLSTM block with a mask self-attention mechanism, extracting global features from graph features and learning the relationship between ligand and protein binding sites; (4) The prediction module, which predicts unknown interactions in drug-target pairs, can handle classification and regression tasks.
Our proposed architecture, as illustrated in Fig. 1, consists of four main modules: Data Preparation, Graph Embedding Learning Module, Feature Extraction Module, and Prediction Module. To identify the binding sites of proteins, we utilize the algorithm proposed by Saberi-Fatthi [18] in the Data Preparation module. This simulation-based
Drug-Target Interaction Prediction
679
approach allows for the extraction of protein binding sites from the 3D structure of proteins prior to entering data into an end-to-end architecture, effectively reducing the model complexity. Despite its simplicity, this approach shows comparable performance to other, more complex simulation-based methods. The constructed map of protein pockets and ligands is then fed into the TAGCN to generate embeddings from the corresponding graph. The weight of neighboring nodes is considered in this process. The resulting global and local features of the protein pocket and ligand are then fed into the transformer-based multicephalic attention mechanism. To cooperate with the output of TAGCN, two single-head attention mechanisms are employed to achieve the effect of multi-head. By fusing the protein pocket and ligand signature, we obtain a drug-target complex representation containing its structural signature. This representation is then inputted into the Self-attentive BiLSTM block, which employs a mask to obtain a drug-target complex relationship feature. Finally, the cascaded representation is fed into the Prediction Module, which contains a binary classifier used to predict DTI. The self-attention mechanism in the network computes the attention output using the cascading embeddings of drug-target pairs as inputs, enabling the model to understand which parts of the protein interact with ligands in a given drug-target pair. Thus, the model achieves interpretability. The Prediction Module contains two fully connected layers, and ultimately predicts the output in the form of probabilities using the logistic Sigmoid function. 2.2 Dataset
Table 1. Summary of the DUD-E and Human dataset. Datasets
Drugs
DT pairs
Active
Inactive
DUD-E
22886
Proteins 102
1429790
22645
1407145
Human
2726
2001
6728
3364
3364
Our model was subjected to rigorous evaluation using two widely recognized DTI datasets, namely, the DUD-E dataset and the Human dataset. These benchmark datasets are commonly used in the field of drug target interaction prediction. The DUD-E dataset comprises 102 targets belonging to eight distinct protein families. Each target comprises roughly 224 active compounds and more than 10,000 bait molecules. On the other hand, the Human dataset was constructed by combining a highly credible and reliable set of negative drug-protein samples with known positive samples using systematic in silico screening methods. The dataset contains 5423 interactions between drug and target molecules. Table 1 presents a summary of the key statistics for these two datasets. All datasets are publicly available. DUD-E dataset is available at http://dude.docking.org, Human dataset is available at https://github.com/IBMInterpretableDTIP.
680
B. Zhu et al.
2.3 Input Representation 2.3.1 Protein Following the extraction of protein binding sites, a distinct graph representation is constructed whereby individual atoms are represented as nodes and inter-atomic connections as edges [19]. Each atom’s eigenvector is computed using a thermal encoding method that takes in to account atomic type, atomic size, total number of connected hydrogen atoms, and hidden valence. This approach generates a 31-dimensional vector for each node. 2.3.2 Ligand For each ligand in the DTI dataset, a bidirectional graph is constructed in the Simplified Molecular Input Line Input System (SMILE) format [20, 21]. The atoms in the ligand are represented using a single thermal encoding scheme that includes the atomic type, atomic size, formal charge, number of free radical electrons, hybridization, aromatization, and total hydrogen number. This encoding generates a vector of size 1 × 74 for each node in the ligand graph [22, 23]. Similarly, for the protein, after extracting its binding sites, a separate graph is constructed, with each atom represented as a node and the connections between atoms as edges in the graph. Each atom is represented using a thermal encoding scheme that considers the atomic type, atomic size, total number of hydrogen atoms (connected), and the hidden valence of the atom, resulting in a vector of size 1 × 31 for each node. The resulting maps of protein and ligand graphs are then fed into the TAGCN to learn their corresponding embeddings. 2.4 Graph Embedding Module We use a TAGCN [24], which is a variant of graph convolutional network, and it works by simultaneously sliding a set of fixed size learnable filters on the input graph to produces a weighted sum of the filter’s outputs, representing both the strength correlation between graph vertices and the vertex features, themselves. In other words, the output of a convolutional layer is the weighted sum of the feature maps resulting from filters with varying size k, for k = 1,… K. The graph convolutional layer for TAGCN is defined as: k K k −1/2 −1/2 D (1) AD X k + bk H = k=1
where the quantity A is the normalized adjacency matrix of the graph, D = j=0 Aij is its corresponding diagonal degree matrix, X is the input feature matrix of the nodes, k is the vector of linear weights aggregating the results from all the adjacent vertices within a k-hop distance of a given node. Also, bk is the learnable bias, which is used in the summation after every hop. 2.5 Feature Extracting and Module 2.5.1 Transformer The Transformer is a neural network architecture based on a self-attention mechanism, which allows for better learning of global information. It comprises a multi-head attention
Drug-Target Interaction Prediction
681
mechanism and a Feed Forward Neural Network [25]. To adapt to the output of TAGCN, we employ two single-head self-attention mechanisms to simulate multi-head attention. The self-attention mechanism selectively focuses on the most relevant parts of the input vector by mapping a query and a set of key-value pairs to a weighted sum of values, which is computed based on the relationship between the query and corresponding key. The input consists of a query (Q), a key (K), and a value (V), which are projected onto separate linear layers. The query vector’s matrix is multiplied by the transpose of the key vector’s matrix, and the resulting elements are normalized by the square root factor of the bond size. QK T V (2) Attention(Q, K, V) = Softmax √ dk where the query Q has dimension dQ , the keyword K has the same dimension dK , and the value V has dimension dV (usually dQ = dK = dV ). This self-attention mechanism uses a sequence of embeddings as input to extract Q, K and V. 2.5.2 Self-attentive BiLSTM We have chosen to use the BiLSTM architecture to extract sequential features from the interactions between protein binding sites and ligands [26]. The bidirectional nature of the LSTM network enables it to capture both past and future information, making it well-suited for our purposes. Furthermore, we desire interpretability in our model to identify the key contributors to the predicted interactions, which can inform the design or optimization of compounds by chemists. To this end, we employ a self-attention mechanism with masks to prevent protein binding sites from attending to each other. Specifically, we construct the mask such that each binding site attends only to itself and the interacting ligand, resulting in a matrix with the same dimensions as the query and key matrices. The diagonal elements of this matrix are set to 1, and the last column corresponding to the ligand is also set to 1, while all other values are set to a small value (9 × 10−15 ) to discourage protein binding sites from attending to each other. 2.6 Prediction Module 2.6.1 Classifier The extracted features are connected to a 1D vector, denoted as I, and then passed to the classification layer. To achieve this, we adopt a two-layer fully connected neural network, which maps the extracted features into the final classification output using a multilayer perceptron with a Rectified Linear Unit (ReLU) activation function. To improve the generalization capability of the model, we employ dropout regularization before each linear layer [27]. The logistic sigmoid function is utilized in the last layer to predict the output as probabilities. In addition, we employ the average binary cross-entropy loss (Eq. 3) to train the model by propagating the error backwards through the network, and updating
682
B. Zhu et al.
all parameters of the model in an end-to-end manner. N
1 L y, y = − yi log yi + (1 − yi )log 1 − yi N
(3)
i=1
3 Results 3.1 Experimental Strategies The implementation of our model was conducted using Pytorch 1.13.1. Specifically, we employed a batch size of 40 and trained the network using the Adam optimizer with a learning rate of 0.001 over 100 epochs. In order to enhance generalization, we utilized dropout with a probability of 0.2 before each fully connected layer. The number of hops for proteins in TAGCN was set to four, while the number of hops for ligands was set to two. To correspond to the output of the graph convolutional layer, we set the size of the hidden state of the two single-headed attention transformers to 31, with the same size of the hidden state for the BiLSTM layer. Zero padding was employed to reshape each matrix to the maximum number of bound pockets in the dataset. All relevant hyperparameters are listed in Table 2. To evaluate the effectiveness of our model, we used several widely recognized metrics in the field of drug-target interaction classification models. These included the area under the receiver operating characteristic curve (AUC), precision and recall for human datasets, as well as ROC enrichment (RE) for DUD-E datasets. We selected different performance metrics for different datasets and benchmark models to facilitate comparisons with results reported in the literature. These metrics have been widely used for evaluating DTI classification models. Table 2. Training hyperparameters of the dataset (FC represents the number of fully connected layers, L-GCN and P-GCN are the number of graph convolutional layers used to extract ligands and embedded protein binding sites, respectively, TFs represents the number of transformers). TAGCN Hops
FC
L-GCN
P-GCN
LR
TFs
Dropout
Batch size
4
2
5
4
3364
2
0.02
40–100
3.2 Comparison on the Human Dataset On the human dataset, we compared it with several traditional machine learning (ML) models, including K-nearest neighbor (KNN), random forest (RF), and L2 logic (L2) [28], as well as some recently developed graph-based methods such as graph cell neural networks (GCN) [29] and CPIGNN [30]. We utilized a uniform experimental setup to ensure a fair comparison with these models, and the results were obtained from a previous study [31]. As demonstrated in Table 3, our proposed model achieved better predictive performance than all the ML and GNN-based models tested on the human dataset.
Drug-Target Interaction Prediction
683
Table 3. Human Dataset Comparison. AUC
Precision
Recall
F1 Score
K-NN
0.86
0.798
0.927
0.858
RF
0.940
0.861
0.897
0.879
L2
0.911
0.861
0.913
0.902
GCN
0.956
0.862
0.928
0.894
CPI-GNN
0.970
0.923
0.918
0.920
E2E/GO
0.970
0.893
0.914
0.903
Ours
0.989
0.949
0.949
0.947
3.3 Comparison on the DUD-E Dataset On the DUD-E dataset, we compared our proposed model with state-of-the-art models from four categories: (1) machine learning-based methods such as NN score [32] and random forest score (RF score) [33]; (2) molecular docking program AutoDock Vina; (3) 3D-CNN models based on deep learning [34]; and (4) graph-based models such as PocketGCN [35] and GNN [36]. PocketGCN utilizes two Graph CNNs to automatically extract features from the graph of protein bags and ligands to capture protein-ligand binding interactions. CPI-GNN [30] is a predictive model that combines a graph neural network of ligands and a CNN of a target. Results are shown in Table 4. Our proposed model outperforms all machine learning and graph-based models. Table 4. DUD-E Dataset Comparison. AUC
0.5% RE
1.0% RE
2.0% RE
5.0% RE
NN Score
0.584
4.166
2.980
2.460
1.891
RF-score
0.622
5.628
4.274
3.499
2.678
Vina
0.716
9.139
7.321
5.811
4.444
3D-CNN
0.868
42.559
26.655
19.363
10.710
PocketGCN
0.886
44.406
29.748
19.408
10.735
GNN
0.940
Ours
0.956
– 74.323
– 54.356
– 32.982
– 16.790
684
B. Zhu et al.
4 Discussion We posit that the improved performance of our proposed model can be attributed to several factors: (1) Input representation plays a crucial role in predicting the binding affinity of drug-target complexes. Utilizing more sophisticated input representations, such as structural diagrams, can aid in capturing crucial structural information regarding molecules. (2) Feature extraction technique is an important consideration, and transformer-based architectures provide a robust automatic feature extraction mechanism that can capture high-order nonlinear relationships. Additionally, graph-based neural networks that employ graphical representations of drugs and proteins can effectively capture the topological relationships between drug molecules and target proteins, further enhancing the performance. (3) To more effectively model and interpret the binding relationship of drug-target complexes, we introduce a self-attentive BiLSTM with masks. This model not only retains past and future information of the sequence input flowing in both directions but also explicates the degree of binding of drug-target complexes through the attention weight ratio.
5 Conclusion In this study, we propose a novel model for predicting drug-target interactions that incorporates TAGCN, transformer, and self-attentive BiLSTM. Our model leverages the self-attention mechanism to effectively capture any relationship between the binding site of a protein and a drug. Our model achieves high performance in DTI prediction, while also providing interpretability by identifying the specific binding sites of proteins that interact with a given ligand. Acknowledgement. This paper is supported by the National Natural Science Foundation of China (62073231, 62176175, 61902271), National Research Project (2020YFC2006602), Provincial Key Laboratory for Computer Information Processing Technology, Soochow University (KJS2166), Opening Topic Fund of Big Data Intelligent Engineering Laboratory of Jiangsu Province (SDGC2157).
References 1. Abbasi Mesrabadi, H., Faez, K., Pirgazi, J.: Drug–target interaction prediction based on protein features, using wrapper feature selection. Sci. Rep. 13, 3594 (2023). https://doi.org/ 10.1038/s41598-023-30026-y 2. Soh, J., Park, S., Lee, H.: HIDTI: integration of heterogeneous information to predict drugtarget interactions. Sci. Rep. 12, 3793 (2022). https://doi.org/10.1038/s41598-022-07608-3 3. Azuaje, F., Zhang, L., Devaux, Y., et al.: Drug-target network in myocardial infarction reveals multiple side effects of unrelated drugs. Sci. Rep. 1, 52 (2011). https://doi.org/10.1038/sre p00052
Drug-Target Interaction Prediction
685
4. Beroza, P., Crawford, J.J., Ganichkin, O., et al.: Chemical space docking enables large-scale structure-based virtual screening to discover ROCK1 kinase inhibitors. Nat. Commun. 13, 6447 (2022). https://doi.org/10.1038/s41467-022-33981-8 5. Crunkhorn, S.: Novel virtual screening approach. Nat. Rev. Drug. Discov. 16, 18 (2017). https://doi.org/10.1038/nrd.2016.272 6. Ding, Y., Tang, J., Guo, F.: Identification of drug–target interactions via dual Laplacian regularized least squares with multiple kernel fusion. Knowl.-Based Syst. 204, 106254 (2020) 7. Ding, Y., Tang, J., Guo, F.: Identification of drug–target interactions via fuzzy bipartite local model. Neural Comput. Appl. 32(14), 10303–10319 (2019). https://doi.org/10.1007/s00521019-04569-z 8. Ding, Y., Tang, J., Guo, F.: Identification of drug-side effect association via semisupervised model and multiple kernel learning. IEEE J. Biomed. Health Inform. 23(6), 2619–2632 (2018) 9. Ding, Y., Tang, J., Guo, F.: Identification of drug-target interactions via multi-view graph regularized link propagation model. Neurocomputing 461, 618–631 (2021) 10. Peng, J., Li, J., Shang, X.: A learning-based method for drug-target interaction prediction based on feature representation learning and deep neural network. BMC Bioinform. 21(Suppl 13), 394 (2020). https://doi.org/10.1186/s12859-020-03677-1. PMID: 32938374; PMCID: PMC7495825 11. Karimi, M., Wu, D., Wang, Z., Shen, Y.: DeepAffinity: interpretable deep learning of compound-protein affinity through unified recurrent and convolutional neural networks. Bioinformatics 35(18), 3329–3338 (2019). https://doi.org/10.1093/bioinformatics/btz111. PMID: 30768156; PMCID: PMC6748780 12. Veliˇckovi´c, P., et al.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017) 13. Torng, W., Altman, R.B.: Graph convolutional neural networks for predicting drug-target interactions. J. Chem. Inf. Model. 59(10), 4131–4149 (2019) 14. Lim, J., et al.: Predicting drug–target interaction using a novel graph neural network with 3D structure-embedded graph representation. J. Chem. Inf. Model. 59(9), 3981–3988 (2019) 15. Zhao, T., Hu, Y., Valsdottir, L.R., Zang, T., Peng, J.: Identifying drug–target interactions based on graph convolutional network and deep neural network. Brief. Bioinform. 22(2), 2141–2150 (2021). https://doi.org/10.1093/bib/bbaa044 16. Zhao, B.W., et al.: A novel method to predict drug-target interactions based on large-scale graph representation learning. Cancers (Basel). 13(9), 2111 (2021). https://doi.org/10.3390/ cancers13092111. PMID: 33925568; PMCID: PMC8123765 17. Ding, Y., Tang, J., Guo, F.: Protein crystallization identification via fuzzy model on linear neighborhood representation. IEEE/ACM Trans. Comput. Biol. Bioinform. 18(5), 1986–1995 (2019) 18. Fathi, S., Majid, S., Tuszynski, J.A.: A simple method for finding a protein’s ligand-binding pockets. BMC Struct. Biol. 14(1), 1–9 (2014) 19. Ding, Y., Tang, J., Guo, F.: Human protein subcellular localization identification via fuzzy model on kernelized neighborhood representation. Appl. Soft Comput. 96, 106596 (2020) 20. Wu, H., et al.: Empirical potential energy function toward ab initio folding G protein-coupled receptors. IEEE/ACM Trans. Comput. Biol. Bioinform. 18(5), 1752–1762 (2020) 21. Wang, H., et al.: Exploring associations of non-coding RNAs in human diseases via three-matrix factorization with hypergraph-regular terms on center kernel alignment. Brief. Bioinform. 22(5), bbaa409 (2021) 22. Yang, H., et al.: Drug–disease associations prediction via multiple kernel-based dual graph regularized least squares. Appl. Soft Comput. 112, 107811 (2021) 23. Sun, M., et al.: MLapSVM-LBS: Predicting DNA-binding proteins via a multiple Laplacian regularized support vector machine with local behavior similarity. Knowl.-Based Syst. 250, 109174 (2022)
686
B. Zhu et al.
24. Du, J., et al.: Topology adaptive graph convolutional networks. arXiv preprint arXiv:1710. 10370 (2017) 25. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 26. Zhou, P., et al.: Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics: Short Papers, vol. 2 (2016) 27. Yazdani-Jahromi, M., et al.: AttentionSiteDTI: an interpretable graph-based model for drug-target interaction prediction using NLP sentence-level relation classification. Brief. Bioinform. 23(4), bbac272 (2022) 28. Liu, H., et al.: Improving compound–protein interaction prediction by building up highly credible negative samples. Bioinformatics 31(12), i221–i229 (2015) 29. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016) 30. Wang, E., et al.: A graph convolutional network–based method for chemical-protein interaction extraction: algorithm development. JMIR Med. Inform. 8(5), e17643 (2020) 31. Wu, Y., et al.: BridgeDPI: a novel graph neural network for predicting drug–protein interactions. Bioinformatics 38(9), 2571–2578 (2022) 32. Durrant, J.D., McCammon, J.A.: NNScore 2.0: a neural-network receptor–ligand scoring function. J. Chem. Inf. Model. 51(11), 2897–2903 (2011) 33. Ballester, P.J., Mitchell, J.B.O.: A machine learning approach to predicting protein–ligand binding affinity with applications to molecular docking. Bioinformatics 26(9), 1169–1175 (2010) 34. Ragoza, M., et al.: Protein–ligand scoring with convolutional neural networks. J. Chem. Inf. Model. 57(4), 942–957 (2017) 35. Torng, W., Altman, R.B.: Graph convolutional neural networks for predicting drug-target interactions. J. Chem. Inf. Model. 59(10), 4131–4149 (2019) 36. Tsubaki, M., Tomii, K., Sese, J.: Compound–protein interaction prediction with end-to-end learning of neural networks for graphs and sequences. Bioinformatics 35(2), 309–318 (2019)
NIEE: Modeling Edge Embeddings for Drug-Disease Association Prediction via Neighborhood Interactions Yu Jiang1 , Jingli Zhou1 , Yong Zhang2 , Yulin Wu1,3 , Xuan Wang1,3 , and Junyi Li1,3(B) 1 School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen),
Shenzhen 518055, Guangdong, China [email protected] 2 Department of Orthopedics, Shenzhen University General Hospital, Shenzhen 518055, China 3 Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China
Abstract. Using computational methods to search for potential drugs for diseases can speed up the drug development process. The majority of current research focuses on obtaining node embedding representations for link prediction using deep learning techniques. They use a simple inner product to simulate the association between drug and disease nodes, which is insufficient, thus we propose an edge embedding model, which named NIEE, based on the interaction between drug neighborhood and disease neighborhood for performing link prediction tasks. The core idea of NIEE is to simulate the embedding of edges between source and target nodes using the interaction between their neighborhoods. The model first samples the neighborhoods of nodes on the heterogeneous network in accordance with the specially designed meta-paths, and then uses the interaction module to simulate the interaction between the neighborhoods. We de-signed a hierarchical attention mechanism to aggregate heterogeneous nodes within meta-paths and perform semantic-level aggregation between meta-paths. Finally, use the MLP to predict whether the edge exists. We compared our model with four GNN models, and the experiments show that our model outperforms other models in all indicators, confirming the effectiveness of NIEE. Keywords: Heterogeneous information network · Attention mechanism · Network representation method · Drug disease association prediction
1 Introduction Drug-disease association prediction is the process of identifying potential candidate therapeutic drugs for a specific disease by mining the information embedded in clinically validated drug-disease associations [1]. The outbreak of COVID-19 has highlighted the Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-981-99-4749-2_59. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 687–699, 2023. https://doi.org/10.1007/978-981-99-4749-2_59
688
Y. Jiang et al.
urgent need for finding effective drugs to treat this disease, making drug repurposing for specific diseases a new research focus. The use of computational methods in this field can promote the full utilization of biological information. Currently, this field has accumulated a lot of research work [2]. Considering the diversity of biological entities and the complexity of their relationships, networks are the preferred choice for modeling entity relationships, and graph representation learning methods are the natural choice for studying drug-disease associations [3]. These research methods [4] usually focus on learning potential, low-dimensional embedding representations of graph vertices while preserving the graph structure, and then using them for subsequent link prediction tasks. Graph representation learning algorithms can be divided into two categories. One category is unsupervised graph representation algorithms, including DeepWalk [5], LINE [6], Node2vec [7], and SDNE [8]. The other category is semi-supervised graph representation algorithms, including GCN [9] and GAT [10]. The above graph representation learning algorithms were generally applied to homogeneous networks. Directly applying them to heterogeneous networks would result in the loss of unique information, thus losing rich heterogeneous information. Therefore, algorithms proposed for HIN aim to obtain high-quality low-dimensional embedding representations of nodes while preserving the heterogenous graph structure and heterogenous node information. Metapath2vec [11], HIN2vec [12] and HAN [13] are the classic methods in this field. Among them, HAN [13] discards the intermediate node information during the extraction of meta-path guided neighborhoods, resulting in the early summarization problem. To address this issue, models such as MAGNN [14] and NIRec [15] preserve the intermediate node information when sampling node neighborhoods, which contributes to improved performance to some extent. Models designed for HIN typically employ the strategy of decomposing the network into multiple homogeneous networks (e.g. HAN [13]) to tackle link prediction problems. Machine learning techniques are then used to capture the network structure, followed by poor inner product between drug and disease node embeddings to determine whether an edge exists. There are two questionable aspects to this design. Firstly, it is uncertain whether an inner product between high-quality node embeddings is sufficient to simulate the existence of edges between nodes. Secondly, it is unclear whether the captured structural information is directly beneficial for predicting edges. To address these issues, our project proposes a model called NIEE which simulates interactions between node neighborhoods to determine edge existence. Additionally, to ensure that the captured network structure is directly useful for edge prediction, we use different meta-path design that differs from those used in HAN [13] and MAGNN [14], where the meta-path endpoints are source and target nodes rather than nodes of the same type. In this study, we propose a heterogeneous graph neural network model, NIEE, which is based on simulating the interactions between neighborhoods of node pairs to obtain edge embeddings, and uses these embeddings to predict whether an edge exists. Among them, we sample the neighborhoods according to different meta-paths, perform interaction operations on the neighborhoods, and aggregate the results at both the intra and inter meta-path levels using attention mechanisms, resulting in the edge embedding.
NIEE: Modeling Edge Embeddings for Drug-Disease
689
We apply a multilayer perceptron to these embeddings to predict edge existence. Our research work can be summarized as below: (i) Constructing a HIN that integrates multiple data sources and sampling along the meta-path. Neighborhoods extracted will directly benefit link prediction tasks. (ii) Simulating the interaction between node pairs by modeling the interaction between their neighborhoods. We consider this approach can fully utilize the surrounding information of the given nodes which can improve the quality of edge embeddings. (iii) Designing an aggregation module to obtain the final edge embedding. We design different meta-paths which contain different types of nodes, therefore we use intra and inter meta-path level aggregations to acquire the final embedding of the edge. (iv) Designing a link prediction model. We use the MLP to predict whether the edge exists, with the edge embedding as input.
2 Dataset Our dataset bioDDG contains three types of nodes and two types of edges, with detailed statistics presented in Table 1. The drug-disease associations and disease-gene associations were both extracted from Malacards [16], a comprehensive database of human diseases and their annotations. In this study, 103099 valid drug-disease associations and 116864 disease-gene associations are contained. We randomly select negative samples according to the ratio of positive and negative samples 1:1 for subsequent experiments. Table 1. BioDDG. Edges (node A- node B)
Number of node A
Number of node B
Number of edges
Meta-path pair
Drug-Disease
2011
2649
103099
(RI, IR) (RIRI, IRIR)
Disease-Gene
2649
14177
116864
(RIGI, IGIR)
3 Method The overall workflow of the NIEE is shown in Fig. 1. The heterogeneous network was built based on bioDDG. We have designed special meta-paths for drug nodes and disease nodes, and there is a one-to-one correspondence between the two meta-paths. Get the neighborhoods of the given node by random walks along the chosen metapaths on the heterogeneous network. Neighborhoods generated by a pair of meta-paths interact with each other. Multiple meta-path pairs reveal various semantic information. We obtain the final edge embedding representation through intra-metapath aggregation and inter-metapath aggregation. To reinforce the information of the node pair to be predicted, the edge embedding is concatenated with the initial embedding of the nodes at both ends of the edge. MLP is then applied to the embedding after splicing to produce the prediction score of the node pair.
690
Y. Jiang et al.
Fig. 1. The workflow of the NIEE
3.1 Network Construction This study uses HIN and it can be represented as G = (V , E), where the set of node types A contains drug nodes, gene nodes, as well as disease nodes and the set of relation types R contains drug-disease associations and disease-gene associations. The network has the nature of |A| + |R| > 2. 3.2 Sampling Design Meta-path Pairs. Figure 2 presents the meta-paths used in this study, which are important tools to decompose and obtain semantic information in HIN.
Fig. 2. Meta-path pairs
In order to decouple heterogeneous networks without losing heterogeneous information, we designed asymmetric meta paths inspired by this study [17], with both ends of the meta paths being source and target type nodes. Pairwise random walks, using to fulfill symmetry, start from source and target nodes and reach the same middle objects in this study [17]. In our research, random walks guided by the meta path are used to obtain the neighborhood of nodes, rather than the possibility of source and target nodes meeting along the meta path. Therefore, we specially designed one-to-one corresponding meta paths for the source and target nodes, and used one-way random walks so that the selected drug node’s neighborhood is aimed at moving to the disease nodes, and the chosen disease node’s neighborhood is intended at moving to the drug nodes. Our meta-path design has two main features. First, the meta-paths connect to a pair of drug-disease nodes on both ends, which is different from the previous design where the meta-paths connect to a pair of isomorphic nodes. This approach enables the intuitive
NIEE: Modeling Edge Embeddings for Drug-Disease
691
extraction of drug-disease associations and preserves the intermediate nodes during the sampling process. Second, the meta-paths are designed in pairs, which facilitates the simulation of interactions between the neighborhood of the given nodes. Specifically, the meta-paths designed for the source drug node arep : [RI , RIRI , RIGI ], while those designed for the target disease node areq : [IR, IRIR, IGIR], meanwhile pi and qi are designed in opposite directions. Neighborhoods Guided by Meta-paths. Given a source drug node o and a specific meta-path pi , the neighborhood of node o is constructed by all the nodes encountered along the meta-path pi starting from node o. Therefore, the neighborhood of node o contains nodes of different types, including heterogeneous information and complete semantic information. To illustrate this, Fig. 3 provides an example where (b) highlights the neighborhood of drug node j guided by the RIGI meta-path, and (d) lists the sequences that can be formed by random walks along the RIGI meta-path starting from drug node j. Same principle for the disease node k. Considering the amount of information extracted by different meta-paths can vary significantly (8 different sequences can be obtained by walking along the RIGI meta-path starting from j, while only 2 different sequences obtained starting from k). If these data are directly fed into the model, it will influence the precision of the model due to the imbalance of information weights. Hence, we adopt the same times of random walks for each meta-path to collect information.
Fig. 3. Neighborhoods guided by meta-paths for given nodes (take RIGI & IGIR as example)
Random Walk. For a network G = (V , E) and a meta-path pi : A0 , A1 , . . . , Ak , . . . , AI −1 , where I means the length of the meta-path and Ak represents the node type in the heterogeneous information network. Ak represents the set of nodes of type Ak . The random walk is generated in accordance with the following function (Eq. 1): ⎧ 1 , (s, x) ∈ E and s, x ∈ Ak−1 , Ak ⎨ num εAk (s) P(nk = x|nk−1 = s ) = (1) ⎩ 0, others where nk represents the node at step k during the random walk. EAk (s) represents the set of nodes of type Ak connected to node s. Npi (o) represents the set of sequences obtained
692
Y. Jiang et al.
by performing random walks along the meta-path pi starting from node o. It is a matrix of size RC×I , where C is the number of random walks. 3.3 Neighborhood Interaction Module In HIN, different projection matrices need to be designed for different types of nodes to map nodes to a unified vector space. The projection process can be represented by Eq. 2: hi = Ma · hai
(2)
where Ma is the projection matrix for type a nodes, hai is the original feature of node i of type a, and hi is the transformed feature of node i. According to the sampling strategy, we obtain the neighborhood Npi (s) of the source drug node s guided by the meta-path pi , and the neighborhood Nqi (t) of the target disease node t guided by the meta-path qi . The embedding matrices of the two neighborhoods are denoted as Z[N pi (s)] and Z[N qi (t)], respectively. Then, the Hadamard product operation is performed on the two representation matrices to simulate the interaction between neighborhoods based on the specific meta-path pair. The result of the interaction is denoted as Z[N (pi , qi ) (st)] as shown in Eq. 3:
Z N(pi , qi ) (st) = Z Npi (s), Nqi (t) = Z Npi (s) ◦ reverse Z Nqi (t) (3) we can obtain Z[N (pi , qi ) (st)], a matrix of size RC×I ×E , where C is the number of random walks, I is the length of the meta-path, and E is the dimensionality of the node embedding. Figure 5 in additional file describes the interaction process with an example. 3.4 Aggregation Module The aggregation module consists of two parts: intra and inter-metapath aggregation. In the previous section, we obtained Z N(pi , qi ) (st) , the embedding representation of the interaction between neighborhood of drug-disease pair (s, t). To obtain the embedding matrix z (pi , qi ) (st), we need to aggregate the heterogeneous nodes within the meta-path. Different meta-path pairs reveal different semantic information; thus we also need to aggregate the embedding matrices generated by different meta-path pairs to fuse these semantic information, and ultimately obtain the edge embedding representation u(st) between drug and disease node (s, t). Intra-metapath Aggregation. To perform the intra-metapath aggregation, we use a self-attention technique to figure out how much each node is worth within a specific meta-path pair (pi , qi ). Appendix Fig. 6 provides an illustration of how it functions. Specifically, the Eq. 4 is used to describe the calculation: eij = leakyrelu(aT Wxi Wxj ) (4) where xi and xj are the representations of two nodes within the meta-path after interaction, j ∈ Ni and Ni represents all nodes within the meta-path formed by the interaction. W is a linear transformation matrix used to transform the nodes after interaction, a is the
NIEE: Modeling Edge Embeddings for Drug-Disease
693
vector used to calculate the weight, a ∈ R2×E , and is the concatenation operation. eij is the attention coefficient of node j to node i within the meta-path. Equation 5 is used for normalization: αij = softmax(eij ) =
exp(eij ) k∈Ni exp(eik )
(5)
by applying the attention coefficients to the nodes within the meta-path, we can obtain the interaction representation z (pi , qi ) (st) between the neighborhoods of drug-disease pair (s, t) guided by the meta-path pair (pi , qi ). The calculation is shown in Eq. 6. z(st) = σ ( αij Wxj ) (6) j∈Ni
We use a multi-head attention mechanism, which is described in Eq. 7, to increase the stability of the learning outcomes. z(st) = σ (
1 K k αij Wk xj ) k=1 K
(7)
j∈Ni
where K is the quantity of attention heads, Wk represents the trainable parameter, and z (pi , qi ) (st) is an embedding matrix of size RC×E . At this point, we obtain the interaction representations under different semantic interpretations given different meta-path pairs
{(p0 , q0 ) · · · (pi , qi )}, which can be represented as z (p0 , q0 ) (st) · · · z (pi , qi ) (st) . Inter-metapath Aggregation. Given that various meta-path pairs have various semantic contents and contribute in various ways to the final embedding of drug-disease edges, we need to assign different weights to different meta-path pairs. First, we need to apply non-linear transformations to the embedding representations obtained from specific meta-path pair, and then apply a semantic-level attention vector to calculate the attention values at the semantic level. This is specifically shown in Eq. 8: ω(pi , qi ) = wT · tanh(Wq · z (pi , qi ) (st) + bq )
(8)
where w is the attention vector at the semantic level, and ω(pi , qi ) is the attention value calculated for (pi , qi ). Equation 9 demonstrates how softmax is used to normalize the relevance of all meta-path pairs. exp(ω(pi , qi ) ) β (pi , qi ) = softmax(ω(pi , qi ) ) = K (pk , qk ) ) k=1 exp(ω
(9)
where K is the quantity of meta-path pairs. The embedding depiction of drug-disease edge interactions is finally within our grasp meanwhile it incorporates all semantic information, as shown in Eq. 10: u(st) = where u is a vector with E
K i=1
β (pi , qi ) · z (pi , qi ) (st)
(10)
694
Y. Jiang et al.
3.5 Prediction Module This module consists of two parts: obtaining the predicted score for edge between node pairs and designing the loss function. Predict Score. Based on nodes’ neighborhood, we create the edge embedding between the source drug node and the target illness node. By concatenating this with the source drug node features and disease node features, and applying a non-linear projection, we obtain the final prediction result for the drug-disease pair, as shown in Eq. 11. (11) y(s, t) = sigmoid(MLP hs ht u(s, t) )
Loss Function. We use Eq. 12’s depiction of the binary cross entropy function as our loss function.
yij log yˆ ij + 1 − yij log 1 − yˆ ij (12) L Y , Yˆ = + − i, j∈Y ∪Y
yij represents the predicted result of a drug-disease pair (i, j), yij represents the true label of the pair, Y + represents the set of positive samples, and Y − represents the set of negative samples. Our model is end-to-end, and since the prediction portion may backpropagate optimization information to the embedding portion, the model’s performance is enhanced.
4 Experiment We give a thorough explanation of result analysis in this part. Our model fared better than all the other models in every assessment metric when we compared it to other network embedding techniques. 4.1 Baselines We employ standard measures like AUC, AP, F1-score, Accuracy, Precision, and Recall to assess the performance. Link prediction tasks are frequently evaluated using AUC and AP. Considering that our prediction results provide a candidate list for later clinical trials, the presence of false positives in this list can have significant consequences if they are selected for clinical trials. Therefore, Precision is an indispensable metric that we value. Then, we will compute these evaluation measures for NIEE and other benchmark models meanwhile analyze the experimental data. These are the benchmark models that we used: GAT [10]: This model uses a multi-head attention mechanism to harvest information from neighbors and weighted summation to integrate that information into the embedding representation of the current node. HAN [13]: This GNN model uses a hierarchical attention technique for heterogeneous network.
NIEE: Modeling Edge Embeddings for Drug-Disease
695
MAGNN [14]: This GNN model based on attention and meta-graph convolution. It addresses the “early summarization” issue in HAN by using meta-graph convolution to capture the relationships between various types of nodes within a single meta-path. FactorHNE [18]: This GNN model is based on the factorization mechanism. It captures various semantic relationships between heterogeneous nodes through factorization and constructs different semantic factor graphs to effectively aggregate various semantic relationships. For NIEE, both the hidden and output layers’ embedding dimensions are set at 128. With an Adam optimizer with a learning rate of 0.001 and weight decay of 0.0009, we employ a three-headed attention mechanism. The selected meta-paths are {(IR, RI ), (IRIR, RIRI ), (IGIR, RIGI )}, and 16 times are chosen at random to travel each metapath. In a 7:1:2 ratio, the dataset is divided into a training set, a validation set, and a test set. For other models, we adopt the parameters used in their original papers. To ensure fair comparisons of experimental results, each model is run three times, and the average values are taken. 4.2 Experiment Analysis Table 2 and Fig. 4 provide the experimental results of the NIEE and other baseline models. Table 2. Results of the link prediction task from the experiment (%) Model
AUC
AP
F1-score
Accuracy
Precision
Recall
GAT
90.37 ± 0.24 89.26 83.09 ± 0.40 82.75 ± 0.27 81.46 ± 0.72 84.81 ± 1.43 ± 0.23
HAN
91.74 ± 0.08 90.41 84.37 ± 0.48 84.30 ± 0.27 83.95 ± 0.68 84.81 ± 1.62 ± 0.13
MAGNN
92.19 ± 0.08 91.00 85.26 ± 0.09 84.97 ± 0.05 83.66 ± 0.64 86.93 ± 0.87 ± 0.06
FactorHNE 94.75 ± 1.45 89.81 88.99 ± 1.66 89.09 ± 1.44 89.66 ± 1.44 88.41 ± 3.79 ± 5.11 NIEE
95.92 ± 0.03 95.52 89.71 ± 0.06 89.68 ± 0.05 90.21 ± 0.32 89.21 ± 0.38 ± 0.02
Through analyzing the results, we can draw some conclusions. Figure 4 compares the ROC and PR curves for each model, and the results of the six metrics for each model are shown in Table 2. All the compared GNN models characterize the edge between node pairs by obtaining a real value through a simple inner product of node embeddings, which is a very crude representation of edges. Therefore, we propose a strategy based on neighborhood interaction to obtain edge embeddings between node pairs. The algorithm we propose can be calculated in parallel for multiple pairs of drugs and diseases. They are independent of each other and will not affect each other during the calculation process.
696
Y. Jiang et al.
We analyze the time complexity of the aggregation module and the interaction module which are the core of the model. The time complexity of the aggregation module is O(KV ), where K is the number of attention heads and V is the total nodes. Meanwhile the time complexity of the interaction module is O(C 2 I ), where C is the number of random walks and I is the length of meta-path. The findings demonstrate that our model outperforms competing models by a wide margin across all metrics. This shows that our model offers a practical and useful method for tackling the edge characterization issue.
Fig. 4. NIEE’s performance in comparison to other benchmark models.
4.3 Parameter Sensitivity Analysis In this section, we adjust two parameters and look at how the values of those parameters affect the performance of the model. AUC and AP are two metrics we use to assess how the model has changed. Figure 5 displays the comparative experimental outcomes for each parameter. The Embedding Size of the Hidden Layer. Figure 5(a) show that both the AUC and AP scores get better as the hidden layer dimension gets bigger, peaking at 256 dimensions. According to the experiment’s findings, the vectors can hold more data as the number of hidden layer dimensions rises. However, when it reaches a certain point, it becomes mixed with redundant information, resulting in a decline in performance. The Number of Random Walks Guided by Meta-paths. Figure 5(b) shows that both AUC and AP curves are steep, and then the increase becomes slow, eventually becoming flat or even slightly decreasing. As the number of random walks rises, more information can be contained in the neighborhoods of the given nodes, increasing the reference foundation for predicting edges between nodes. The information in the neighborhood approaches a threshold beyond which further increases do not significantly improve performance. Due to the addition of noise, performance can even suffer as a result.
4.4 Case Study To confirm our model’s dependability even further, this part presents a case study using NIEE to identify potential therapeutic drugs for Alzheimer’s disease (AD). The bioDDG dataset was trained using all positive samples, and an equal number of negative samples. The top 20 drugs, predicted by the model, with the highest scores were selected as
NIEE: Modeling Edge Embeddings for Drug-Disease
697
Fig. 5. Sensitivity analysis of parameters.
potential therapeutic drugs for AD, as shown in Appendix Table 1. Methylphenidate is a safe and effective drug for Alzheimer’s disease apathy [19] conducted a multicenter randomized placebo-controlled clinical trial to demonstrate this. Dexamethasone acetate is a corticosteroid, [20] proposed that intrathecal corticosteroids might become part of a multi-agent regimen for Alzheimer’s disease and also have application for other neurodegenerative disorders. To sum up, half of our predicted drugs have been validated by external authoritative database such as CTD [21]. In summary, our model mined hidden information from the data associated with known AD-related drugs to predict new potential therapeutic drugs. Half of these drugs were validated in external databases, while the remaining half can be used as a candidate drug list for pharmacologists to conduct further physical and animal experiments. This accelerates the process of drug repositioning development.
5 Conclusion Using neighborhood interactions to simulate edge embedding, we propose NIEE for link prediction in this study. Unlike previous models that simulate node-to-node interactions through inner products, our approach based on neighborhood interactions allows for a more comprehensive representation of information, providing strong support for the accuracy of prediction results. In order to achieve this, we specifically created meta-paths with the source node type and target node type as their two ends. This way, the neighborhood derived by sampling directly aids link prediction. We develop an interaction module to calculate the interaction between neighborhoods, and an aggregation module to combine information within and between meta-paths. This results in an edge embedding that not only includes information between heterogeneous nodes within meta-paths, but also the semantic information revealed by different meta-paths. In the study, we evaluate the performance of our model NIEE against four other models in the field of graph neural networks and investigate how parameter tuning affects overall performance. We also perform a case study to find possible medications for the treatment of Alzheimer’s disease, some of which are verified in outside databases. In general, NIEE achieves exceptional performance and dependability. At present, we randomly sample negative samples which contributes less to the model. In future work, we will try to use the negative sampling method that introduces additional information to further improve our model.
698
Y. Jiang et al.
Acknowledgements. This work was supported by the grants from the National Key R&D Program of China (2021YFA0910700), Shenzhen science and technology university stable support program (GXWD20220811170225001), Shenzhen Science and Technology Program (JCYJ20200109113201726), basic research general project of Shenzhen Science and technology innovation Commission of China (JCYJ20190808153011417), Guangdong Basic and Applied Basic Research Foundation (2021A1515012461 and 2021A1515220115). Authors’ Contributions. YJ designed the study, performed bioinformatics analysis and drafted the manuscript. All of the authors performed the analysis and participated in the revision of the manuscript. JL conceived of the study, participated in its design and coordination and drafted the manuscript. All authors read and approved the final manuscript. Additional Files. Clear versions of all images are available in additional file: https://github.com/ porvinci/NIEE. Competing Interests. The authors declare that they have no competing interests.
References 1. Ashburn, T.T., Thor, K.B.: Drug repositioning: identifying and developing new uses for existing drugs. Nat. Rev. Drug Discov. 3(8), 673–683 (2004) 2. Meng, Y., Changcheng, L., Jin, M., Junlin, X., Zeng, X., Yang, J.: A weighted bilinear neural collaborative filtering approach for drug repositioning. Briefings Bioinform. 23(2), bbab581 (2022) 3. Cheng, F., Desai, R.J., Handy, D.E., et al.: Network-based approach to prediction and population-based validation of in silico drug repurposing. Nat Commun 9, 2691 (2018) 4. Fu, H., Huang, F., Liu, X., et al.: MVGCN: data integration through multi-view graph convolutional network for predicting links in biomedical bipartite networks. Bioinformatics 38(2), 426–434 (2022) 5. Perozzi, B., et al.: DeepWalk: online learning of social representations. In: Macskassy, S.A., et al. (eds.) The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 24–27, pp. 701–710. ACM, New York, NY, USA (2014) 6. Tang, J., Qu, M., et al.: LINE: large-scale Information Network Embedding. In: Proceedings of the 24th international conference on world wide web, pp. 1067–1077. International World Wide Web Conferences Steering Committee, Florence (2015) 7. Grover, A., Leskovec, J.: Node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864. Association for Computing Machinery, New York (2016) 8. Wang, D., Cui, P., et al.: Structural deep network embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2016), pp. 1225–1234. Association for Computing Machinery, New York (2016) 9. N. Kipf, T.,Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017). https://doi.org/10.48550/arXiv.1609.02907 10. Velickovic, P., et al.: Graph attention networks. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 3–May 3 2018, Conference Track Proceedings. OpenReview.net (2018) 11. Dong, Y., et al.: metapath2vec: scalable representation learning for heterogeneous networks. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 135–144. ACM, Halifax (2017)
NIEE: Modeling Edge Embeddings for Drug-Disease
699
12. Fu, T.-Y., Lee, W.-C., Lei, Z.: HIN2Vec: explore meta-paths in heterogeneous information networks for representation learning. In: Proceedings of the 2017 ACM on conference on information and knowledge management, pp. 1797–806. Singapore (2017) 13. Wang, X., et al.: Heterogeneous graph attention network. WWW 2019, The Web Conference, pp. 2022–2032. ACM, San Francisco (2019) 14. Fu, X., et al.: MAGNN: metapath aggregated graph neural network for heterogeneous graph embedding. WWW 2020: The Web Conference 2020, pp. 2331–2341. ACM/IW3C2, Taipei (2020). https://doi.org/10.1145/3366423.3380297 15. Jin, J., Qin, J., et al.: An efficient neighborhood-based interaction model for recommendation on heterogeneous graph. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 75–84. Virtual Event (2020) 16. Malacards Homepage. https://www.malacards.org/. Accessed 27 Mar 2023 17. Shi, C., Kong, X., Huang, Y., Yu, P.S., Bin, W.: HeteSim: a general framework for relevance measure in heterogeneous networks. IEEE Trans. Knowl. Data Eng. 26(10), 2479–2492 (2014) 18. He, M., et al.: Factor graph-aggregated heterogeneous network embedding for disease-gene association prediction. BMC Bioinf. 22(1), 165 (2021) 19. Fredericks, C.: Methylphenidate for apathy in Alzheimer disease—why should we care? JAMA Neurol. 78(11), 1311 (2021) 20. Alisky, J.M.: Intrathecal corticosteroids might slow Alzheimer’s disease progression. Neuropsychiatr Dis Treat 4(5), 831–833 (2008). https://doi.org/10.2147/ndt.s3685. PMID: 19183775; PMCID: PMC2626920 21. The Comparative Toxicogenomics Database | CTD Homepage. http://ctdbase.org/. Accessed 31 Mar 2023
A Novel Descriptor and Molecular Graph-Based Bimodal Contrastive Learning Framework for Drug Molecular Property Prediction Zhengda He1,2 , Linjie Chen2 , Hao Lv2 , Rui-ning Zhou2 , Jiaying Xu2 , Yadong Chen2 , Jianhua Hu2 , and Yang Gao1(B) 1 Nanjing University, Nanjing, Jiangsu, China
[email protected] 2 China Pharmaceutical University, Nanjing, Jiangsu, China
Abstract. In AI drug discovery, molecular property prediction is critical. Two main molecular representation methods in molecular property prediction models, descriptor-based and molecular graph-based, offer complementary information, but face challenges like representation conflicts and training imbalances when combined. To counter these issues, we propose a two-stage training process. The first stage employs a self-supervised contrastive learning scheme based on descriptors and graph representations, which pre-trains the encoders for the two modal representations, reducing bimodal feature conflicts and promoting representational consistency. In the second stage, supervised learning using target attribute labels is applied. Here, we design a multi-branch predictor architecture to address training imbalances and facilitate decision fusion. Our method, compatible with various graph neural network modules, has shown superior performance on most of the six tested datasets. Keywords: Molecular Property Prediction · Graph Neural Networks · Molecular Descriptors · Molecular Contrastive Learning · Drug Discovery
1 Introduction Drug development, costly and time-consuming, risks high failure rates even after entering clinical studies. Accurate molecular property prediction can expedite and economize drug development, finding extensive application in virtual screening, drug repositioning, preclinical studies, and drug design [1, 4]. Artificial intelligence has proven effective in this area [1, 2], emerging as a new machine learning research hotspot [3]. The key to accurate molecular property prediction involves securing efficient molecular representation and building a dependable machine learning model around it. Enhanced, informationrich representations and well-designed models can considerably improve drug molecule property prediction accuracy, thus influencing drug discovery and development. This is a foundational study in drug discovery. Molecular property prediction utilizes several molecular representations: 1. Molecular descriptor [4]: Quantitative molecular characteristics, including theoretical and © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 700–715, 2023. https://doi.org/10.1007/978-981-99-4749-2_60
A Novel Descriptor and Molecular Graph-Based Bimodal Contrastive Learning
701
experimental descriptors and molecular fingerprinting. Descriptor-based machine learning models, like Deep Neural Network (DNN) [5], eXtreme Gradient Boosting (XGBoost) [6], and Random Forest (RF), have found success, especially in QSAR modeling [4]. 2. Molecular graph structure representation: 2D representation reflecting topological structure and connection information of molecules. Graph Neural Network (GNN) [7, 8] learns graph embeddings from such representations, aiding molecular property prediction [8]. 3. Linear and 3D representations: Simplified Molecular Input Line Entry System (SMILES) converts compound structure into text, facilitating property learning through Natural Language Processing (NLP). 3D voxel representation is used in predicting properties via 3DCNN. The most widely used models are descriptor-based and molecular graph-based. Graph Neural Networks (GNNs) and descriptor-based methods both have strengths and weaknesses [7, 8, 19, 20]. Descriptors provide global information, carrying expert knowledge, require less data, compute quickly, and differentiate isomers. Their drawback lies in the complexity of selection and limitation of fixed-length features. Graph representations give 2D topology and atomic details, require less feature engineering, and are highly transferable, but face oversmoothing and interpretation difficulties. Combining these representations is necessary for robustness, as a single representation struggles to characterize the vast molecular space of 1060 . Most current studies use a singular molecular representation. Only a handful of pioneering studies by reputable institutions have attempted simple concatenation of two feature types [25, 26]. This includes the widely cited work of MIT [25] and the Technical University of Munich [26]. However, merging these different representations is challenging: (1) Representation conflicts: The two representations focus on global expert knowledge and molecular topology, respectively, leading to different representation spaces. Directly using simple feature concatenation can result in conflicts due to differences in representation spaces of the two modalities, causing prediction performance to degrade. (2) Training imbalance of the encoders: The encoders corresponding to the two representations, such as DNN and GNN, have different training difficulties. Direct concatenation could lead to an imbalance in the training of the encoders of the two representations, making one modal encoder more effective than the other. This situation also makes the predictor prefer one modal feature, making the information from one modality dominant rather than the information from both modalities being able to complement each other. The innovations we propose to address the two challenges above are as follows. (1) Pre-training stage of bimodal contrastive learning: To address the first challenge, we add a pre-training stage before the traditional molecular supervised learning. We proposed a novel self-supervised contrastive learning scheme based on descriptors and molecular graph to pre-train the encoders of the two modal representations. The objective function is to maximize the similarity of two modal encoder representations of the same molecule in the spaces of the projection variables. Our proposed scheme overcomes the conflict between the bimodal features and improves the consistency. (2) Multi-branch predictor architecture: To address the second challenge, we designed a bimodal multi-branch predictor architecture in the supervised learning stage of molecular property prediction, which consists of a descriptor branch predictor, a
702
Z. He et al.
graph neural network branch predictor, and a joint branch predictor. These predictors use independent outputs, which can enhance each branch’s generalization ability and solve the training imbalance problem. Meanwhile, this provides a decision fusion that gives more accurate predictions. Our proposed method can employ various graph neural network modules with good flexibility. Benchmarks were performed on six datasets and compared with various machine learning algorithms and graph neural network models. Experimental results show that our method achieved the best performance on most datasets. Our contributions: 1. To the best of our knowledge, we proposed the first self-supervised contrastive learning scheme based on descriptors and molecular graphs and used it for the pre-training of two modal encoders of molecules. 2. We proposed a bimodal multi-branch predictor structure that solves the problem of unbalanced training of encoder models for different modal molecular representations in supervised learning of molecular property prediction. 3. The proposed system is end-to-end with high integration and accuracy, and applies to various GNN modules. Our system is cost-effective and has high performance compared to current leading molecular property prediction models.
2 Related Work 2.1 Descriptor Development Over the past decades, numerous computational and experimental data have been transformed into descriptors. The research on descriptor generation provides ever-growing expert knowledge and experimental prior knowledge resources. There are many commercial and open-source software for calculating molecular descriptors [10, 11], such as CDK Descriptor, DRAGON, MOE [12], RDKit [10], et al. 2.2 Molecular Graph Supervised Learning The Graph Convolutional Network (GCN) applies Convolutional Neural Network (CNN) concepts to graph structure data, proving its efficacy in quantitative structureactivity relationship predictions, and drug target interactions [13]. The Graph Attention Network (GAT) enhances the expressiveness of graph neural networks via adaptive allocation of attention weights of different neighbors, boosting drug prediction performance [14]. AttentiveFP employs atomic and molecular-level attention mechanisms for local and global molecular feature learning, enhancing molecular property interpretability [15]. However, graph neural networks face limitations due to their reliance on atomiclevel feature inputs, hindering performance improvements in molecular property prediction. While advanced AI technologies, including improved attention mechanisms, Transformer models [9], large-scale pre-training models [9, 16–18] on tens of millions of molecular datasets, semi-supervised and self-supervised technologies are being used, prediction performance improvements are still limited.
A Novel Descriptor and Molecular Graph-Based Bimodal Contrastive Learning
703
2.3 Molecular Contrastive Learning The contrastive Learning method aims to learn representations by comparing positive example data pairs with negative example data. In the fields of computer vision and natural language processing, MoCo [23], SimCLR have caused a research boom. However, in the molecular property prediction field, there are only very few studies [24]. Similar to the work [24] published in Nature machine intelligence, these works build self-supervised contrastive learning tasks on a single modal molecule representation. They perform data augmentation by adding and removing chemical bonds to the graph structure representation of the same molecule to construct positive sample pairs. Subtle changes in molecular structure can cause unpredictable property change problems. The molecular comparative learning scheme we proposed here provides a new idea by forming contrastive learning through descriptors and graph representations of the same molecule, avoiding the problem of unpredictable property changes from changing molecular structures.
Fig. 1. Descriptors and molecular graph based bimodal contrastive learning framework diagram. A. Stage 1: Self-supervised contrastive learning pre-training of bimodal molecular encoders, No molecular label information is used at this stage; B. Stage 2: Supervised learning of molecular attribute prediction based on multi-branch predictors
704
Z. He et al.
3 Method Our proposed bimodal molecular contrastive learning method is shown in Fig. 1. Our proposed method is divided into two stages. The first stage is the molecular encoder’s pre-training stage using bimodal contrastive learning, as shown in Fig. 1(A). In this stage, we do not use the label information of the molecule. We constructed a bimodal encoder consisting of the descriptors encoder and the graph representation encoder. They have independent inputs and different feature extraction network models to generate molecular features for different modalities. The bimodal encoders are connected to nonlinear projection heads, and the features extracted by the encoders are mapped to a hidden space through the nonlinear projection heads. Both modal encoders are trained using only the self-supervised contrastive learning loss function of the two modal representations in this hidden space. This pre-training stage can improve the consistency between the two modal representations. The second stage is the supervised learning stage for molecular attribute prediction, as shown in Fig. 1(B). We perform supervised fine-tuning learning using the target molecular attribute labels. In this stage, we construct a multi-branch predictor, and the dual-modal encoders and multi-branch predictor are connected for supervised molecular learning. Features from two modalities are fed to three predictors to generate molecular attribute prediction jointly. This approach addresses the training imbalance between two different encoder models and the problem of dominance of a single modal representation while also providing a decision fusion method that gives more accurate attribute predictions. 3.1 Bimodal Molecular Encoders Descriptor Encoder: A feature vector composed of hand-selected molecular descriptors is used as input to extract high-level descriptor features hD through the DNN modul fθ (·). Graph Neural Network Encoder: The molecular graph constructed from the SMILES expressions of molecules is used as input, and the graph-level features hG are output by the readout layer after processing through the Graph neural network fφ (·). A molecule can be represented as a graph G = (V , E), where V is a set of nodes and E is a set of edges. The node v in the graph represents the atom in the molecule, and the edge.eu,v . Represents the chemical bond between the atom u and the atom v. xv ∈ Rd is the feature vector of node v, encoded by atomic features, these atomic features include atomic symbol, atomicity, formal charge, radical electrons, hybridization, aromaticity, e ∈ Rb is the edge feature vector coded by hydrogen, chirality, and chirality type. xuv bond features (bond type, conjugation, ring, and stereo). They are the input to the graph neural network. Node features and edge features are iteratively updated layer by layer by the graph neural network to form high-level hidden features of nodes and edges. Nodes in the kth layer obtain information other than k hops. This iterative update step includes two operations: aggregation operation and combine operation. The aggregation operation transfers messages from nodes in vs neighbor set N (v) to v. The difference among the various graph neural network variants lies in
A Novel Descriptor and Molecular Graph-Based Bimodal Contrastive Learning
705
the use of different aggregation operation methods and combine operation to achieve different performance and purposes. (l) (l−1) (l−1) e (1) hv av = AGGREGATE(l) , hu , xuv : u ∈ N (v) (l) (l−1) (l) hv = COMBINE(l) hv , av
(2)
The graph readout function reads the hidden features to form graph-level features, which represents a whole molecule. hG = READOUT ({hv | v ∈ V})
(3)
Graph neural network encoders can be used with different architectures of graph neural networks. Our experiments use three types of networks to test the generality of the joint learning architecture, namely GCN, GAT, and AttentiveFP. In our architecture, the graph neural network is used as an embedded component with the flexibility of easy replacement. 3.2 Self-supervised Contrastive Learning Pre-training Stage of Bimodal Molecular Encoder This stage is pre-training the bimodal molecular encoder using molecular contrastive learning. This stage aims to reduce the conflicts between bimodal features, improve the consistency between the two modal representations, and to pre-train the bimodal encoder initially, as shown in Fig. 1(A). The bimodal encoder (fθ (·), fφ (·)) and the nonlinear projection head (gθ (·), gφ (·)) are connected and trained using molecular self-supervised contrastive loss in the selfsupervised contrastive learning pre-training stage. The descriptor and graph represpace of the same molecule should be similar. sentations (zD , zG ) in some potential zD = gθ (fθ (xD )), zG = gφ fφ (xD ) , Where xD and xG are the descriptors input and graph representation input of the molecule respectively. Therefore, we set the self-supervised contrastive learning loss function as a consistent loss function for both representations of the same molecule. We first project the encoder output features of both branches into the same space through their respective projection layers and then compare the features of the two modalities of the molecules in batch data. We make the descriptor feature and the graph feature of the same molecule as close as possible and as far as possible from any modal feature of the other molecule. The self-supervised contrastive loss function Lself used for training is: L
self
=
2N
Lself i
i=1
Lself i
= − log
exp zi · zj(i) /τ 2N k=1
1i=k · exp(zi · zk /τ )
(4)
706
Z. He et al.
where τ is an adjustable temperature coefficient greater than 0. zi denotes a modal feature of the ith molecule, and zj(i) is another modal feature of the ith molecule. N is the number of molecules. A total of 2N features are generated from the descriptor branch and GNN branch. 3.3 Supervised Learning Stage for Molecular Attribute Prediction Based on Multi-branch Predictors This stage is the supervised learning stage for molecular property prediction, as shown in Fig. 1(B). We perform supervised fine-tuning learning using the target molecular property labels. At this stage, we remove the projection head, construct the multi-branch predictor, and connect it to the bimodal encoder for supervised molecular learning. Instead of directly concatenating two high-level molecular features and feeding them to a single predictor, we use a three-branch predictor, i.e., a descriptor branch predictor, a molecular graph branch predictor, and a joint branch predictor. The input to the descriptor branch predictor uses the high-level descriptor features extracted by the DNN encoder. The molecular graph branch predictor input uses graph-level features output from the readout layer of the graph neural network. The joint branch predictor uses joint features obtained by concatenating high-level descriptor features and graph-level features, and this step is to implement feature fusion. (i) (i) Given a molecular samplei, xD represents the descriptor input vector, xG represents the molecular graph input. Three predictors of the bimodal contrastive learning architec(i) (i) (i) ture respectively give three outputs yD , yG , yJoint We weight the three outputs to obtain the predicted value:
(i)
(i)
(i)
y(i) = λ1 yD + λ2 yJoint + λ3 yG
(5)
The weight parameter λ1 , λ2 , λ3 are obtained by hyperparameter search and vary with the experimental data set. If only the joint predictor is used, the model faces the uneven training of the two modalities, making it challenging to achieve high performance. The other two branch predictors impose appropriate constraints on the different representations so that the features extracted by the two branches are effective and uniform. The total loss function of the predictor is defined as follows. (6) LossPredictor = L y, y = L y, λ1 yD + λ2 yJoint + λ3 yG
where y is the label of the sample. L(.) uses different types of loss functions depending on the kind of task. Cross entropy is used for classification problems, and RMSE is used for regression problems. The predictor loss function is calculated from the output of the three branch predictors and the true labels. When back-propagating the gradient, the descriptor term and the graph term in the predictor loss function affect the respective branch predictor and encoder, without affecting the other branch. The joint branch output term affects the joint predictor and both encoders. Where λ1 , λ2 , λ3 are the weight parameters described previously. The weighted loss of the three predictors reflects the idea of decision fusion in ensemble learning methods. It is with the above architecture. We can effectively use two joint representations.
A Novel Descriptor and Molecular Graph-Based Bimodal Contrastive Learning
707
4 Experiments 4.1 Datasets We test our model on public datasets [19] that are common on molecular property prediction tasks. These datasets include various regression and classification targets, covering quantum mechanics, physical chemistry, biophysics, and physiology. Three datasets in our experiment are used for regression tasks, including ESOL, FreeSolv, and Lipop, and the remaining datasets are used for classification tasks, including HIV, BACE and Tox21. The statistics of the datasets are shown in Table 1. Table 1. Experimental datasets Datasets
Task
Task Type
Molecules
Nature of forecast
ESOL
1
Regression
1128
FreeSolv
1
Regression
643
Lipop
1
Regression
4200
logP
BACE
1
Classification
1522
Human secretase
HIV
1
Classification
41127
HIV replication
12
Classification
8014
Tox21
Water solubility Hydration free energy
Toxicity
4.2 Data Preprocessing Table 2. Table of node (atom) feature encoding method Index
Description
0–15
One-hot encoding of the atom element, ‘B’, ‘C’, ‘N’, ‘O’, ‘F’, ‘Si’, ‘P’, ‘S’, ‘Cl’, ‘As’, ‘Se’, ‘Br’, ‘Te’, ‘I’, ‘At’, ‘unknown’
16–21
One-Hot encoding of the number of covalent bonds of atoms
22
Indicates whether the atom has a positive charge or no
23
Number of free radical electrons of atoms
24–29
One-Hot encoding of atomic hybridization SP, SP2, SP3, SP3D, SP3D2
30
Aromaticity of atoms
31–35
One-Hot coding of the number of connected hydrogen atoms
36–38
Indicates whether the atom is chiral, chiral type, other specific properties
(1) Data Cleaning The original data set has duplicate data with inconsistent labels and inappropriate structure. We used the sdwash module in MOE to clean the compound data and delete inorganic substances. Then delete the duplicate records of the compound that RDKit cannot recognize and the inconsistent label.
708
Z. He et al. Table 3. Table of edge (bond) feature encoding method
Index
Description
0–3
The one hot vector is used to encode the bond types, ‘single bond’, ‘double bond’, ‘triple bond’, ‘benzene ring’
4
Whether the bond is conjugate
5
Whether the key is a ring
6–9
One hot vector is used to express non isomerism, other isomerism, CIS isomerism and trans isomerism
(2) Descriptor calculation tools We utilized MOE (version 2014.0901) for calculating 192 2D descriptors and PaDELDescriptor (version 2.1) for 881 PubChem fingerprints (PubchemFP) and 307 substructure fingerprints (SubFP). The descriptor input vector length for each molecule is 1380. (3) Feature Selection Strategy Null descriptor feature data is handled by deleting feature columns with null values. For feature selection, descriptors are considered high-dimensional molecular features, and traditional feature screening is applied with two filtering criteria: feature variance and feature correlation coefficient. (4) Molecular graph feature calculation The GNN takes graph structure data from molecular SMILES. Standard SMILES are generated using RDKit, while DGL produces graph structure inputs with atomic/bondlevel features. See Tables 2 and 3 for encoding methods. 4.3 Experimental Setup Following the work of [15, 19], we divide each dataset into training, validation, and test sets (8:1:1 ratio) and perform multiple experiments with different random seeds. Hyperparameters are searched using Hyperopt. Most models train for 300 epochs with early stopping after 50 consecutive epochs without improvement. Each task is repeated 10 times to reduce uncertainty. 4.4 Baseline Methods Six single-modal models were chosen for comparison with our proposed bimodal contrastive learning models: GCN, GAT, AttentiveFP, DNN, XGBoost, and RF.GCN, GAT, and AttentiveFP use molecular graph representation, while DNN, XGBoost, and RF use descriptor representation. 4.5 Metric We use the area under the receiver operating characteristic curve (AUC-ROC) and the area under precision-recall curve (AUC-PRC) for classification tasks for evaluation. For
A Novel Descriptor and Molecular Graph-Based Bimodal Contrastive Learning
709
regression tasks, we use root mean square error (RMSE) and mean absolute error (MAE) for evaluation.
Fig. 2. Performance comparison of all models on classification testsets (higher is better), The bimodal representation model uses our proposed two-stage approach.
Fig. 3. Performance comparison of all models on regression testsets (lower is better), The bimodal representation model uses our proposed two-stage approach
5 Results 5.1 Performance Comparison Tables 4 and 5 documented 10 times repeated experiments on 6 datasets for 3 bimodal contrastive learning models based on joint representation input and 6 models based on single representation input, the results in bold indicate that the proposed model outperforms the corresponding single model. Figure 2 and Fig. 3 show the performance comparison of various methods. 1) The proposed bimodal contrastive learning model wins out across the board. From Table 4, we can see that the bimodal contrastive learning model achieves the largest AUC-ROC score on all testsets on the classification task. Table 5 shows that all
710
Z. He et al.
Table 4. Performance (test set) comparison of 3 bimodal Contrastive learning architecture models and 6 single-input models on classification datasets (higher is better) Dataset
Tasks
Model
AUC_ROC
AUC_PRC
BACE
1
DNN
0.878 ± 0.024
0.824 ± 0.042
XGBoost
0.846 ±0.013
0.802 ± 0.019
HIV
Tox21
1
12
RF
0.761 ± 0.015
0.693 ± 0.027
GCN
0.894 ± 0.018
0.864 ± 0.032
GAT
0.879 ± 0.015
0.833 ± 0.028
AttentiveFP
0.869 ± 0.015
0.816 ± 0.029
Contrastive DNN + GCN (ours)
0.904 ± 0.016
0.870 ± 0.035
Contrastive DNN + GAT (ours)
0.895 ± 0.017
0.844 ± 0.034
Contrastive DNN + AttentiveFP (ours)
0.887 ± 0.024
0.836 ± 0.034
DNN
0.777 ± 0.031
0.339 ± 0.042
XGBoost
0.787 ± 0.012
0.369 ± 0.021
RF
0.773 ± 0.007
0.258 ± 0.017
GCN
0.811 ± 0.028
0.331 ± 0.028
GAT
0.807 ± 0.024
0.327 ± 0.054
AttentiveFP
0.818 ± 0.030
0.364 ± 0.049
Contrastive DNN + GCN (ours)
0.826 ± 0.016
0.414 ± 0.042
Contrastive DNN + GAT (ours)
0.820 ± 0.021
0.411 ± 0.044
Contrastive DNN + AttentiveFP (ours)
0.821 ± 0.024
0.436 ± 0.040
DNN
0.837 ± 0.015
0.426 ± 0.029
XGBoost
0.695 ± 0.030
0.157 ± 0.008
RF
0.759 ± 0.025
0.251 ± 0.029
GCN
0.830 ± 0.011
0.442 ± 0.039
GAT
0.836 ± 0.012
0.407 ± 0.024
AttentiveFP
0.823 ± 0.013
0.356 ± 0.031
Contrastive DNN + GCN (ours)
0.855 ± 0.014
0.480 ± 0.043
Contrastive DNN + GAT (ours)
0.855 ± 0.011
0.465 ± 0.037
Contrastive DNN + AttentiveFP (ours)
0.844 ± 0.011
0.451 ± 0.031
the testsets have the lowest error on the regression task, that is, the smallest RMSE value. Each bimodal contrastive learning model has higher predictive performance than the corresponding single modal representation method that composes this joint model. This performance improvement shows the advantages of a joint representation bimodal contrastive learning method that incorporates more complementary information.
A Novel Descriptor and Molecular Graph-Based Bimodal Contrastive Learning
711
Table 5. Performance (test set) comparison of 3 bimodal Contrastive learning architecture models and 6 single-input models on regression datasets (lower is better) Dataset
Tasks
Model
RMSE
MAE
ESOL
1
DNN
0.358 ± 0.046
0.261 ± 0.027
XGBoost
0.361 ± 0.026
0.249 ± 0.011
FreeSolv
Lipop
1
1
RF
0.404 ± 0.019
0.289 ± 0.013
GCN
0.338 ± 0.107
0.216 ± 0.055
GAT
0.377 ± 0.092
0.245 ± 0.059
AttentiveFP
0.279 ± 0.039
0.175 ± 0.021
Contrastive DNN + GCN (ours)
0.286 ± 0.042
0.183 ± 0.024
Contrastive DNN + GAT (ours)
0.299 ± 0.051
0.187 ± 0.024
Contrastive DNN + AttentiveFP (ours)
0.275 ± 0.041
0.177 ± 0.019
DNN
1.072 ± 0.216
0.696 ± 0.128
XGBoost
1.475 ± 0.169
0.892 ± 0.072
RF
1.582 ± 0.225
0.939 ± 0.064
GCN
1.060 ± 0.254
0.663 ± 0.104
GAT
1.257 ± 0.335
0.743 ± 0.091
AttentiveFP
1.125 ± 0.127
0.743 ± 0.091
Contrastive DNN + GCN (ours)
0.983 ± 0.149
0.664 ± 0.098
Contrastive DNN + GAT (ours)
0.972 ± 0.216
0.582 ± 0.096
Contrastive DNN + AttentiveFP (ours)
0.899 ± 0.203
0.564 ± 0.098
DNN
0.646 ± 0.036
0.465 ± 0.016
XGBoost
0.724 ± 0.016
0.552 ± 0.012
RF
0.779 ± 0.018
0.611 ± 0.013
GCN
0.663 ± 0.043
0.470 ± 0.025
GAT
1.049 ± 0.339
0.857 ± 0.351
AttentiveFP
0.677 ± 0.034
0.508 ± 0.025
Contrastive DNN + GCN (ours)
0.645 ± 0.051
0.448 ± 0.026
Contrastive DNN + GAT (ours)
0.646 ± 0.035
0.463 ± 0.019
Contrastive DNN + AttentiveFP (ours)
0.609 ± 0.032
0.442 ± 0.020
2) On the dataset FreeSolv with a relatively small number of molecules, the bimodal contrastive learning model has achieved more than 10% improvement over the six representative single representation techniques, which also shows the robustness and advantages of the joint representation bimodal contrastive learning model on a small amount of data.
712
Z. He et al.
3) We can also see from the table that it is not easy to have a unified judgment on the merits of the six single modal representation models on multiple data sets. It is consistent with the facts reflected in the literature [19, 20], which shows no unified and simple method of choosing a single representation model on different datasets and tasks. Graph neural networks are not the end of the line for predictive models based on traditional molecular descriptor representations. 4) As shown in Fig. 4, We use violin plots to visualize the experimental results on the BACE dataset. According to the violin plot, the proposed bimodal contrastive learning model not only improves the model’s performance in most cases but also reduces the data bias, and the generalization ability and stability of the model are greatly improved.
Fig. 4. Experimental results for all models (including single model, bimodal simple concatenation model, and molecular bimodal contrastive learning model) on the classification dataset BACE, with higher values indicating better results
5.2 Ablation Experiments Ablation experiments were performed to further illustrate the effectiveness of the two stages of the proposed method. The experimental results are shown in Table 6. We use the pre-training process of the bimodal encoder based on molecular self-supervised contrastive learning and the multi-branch predictor design in the supervised learning stage of molecular attribute prediction as factors to be examined in the ablation experiments. Due to the space limitation, we only list the results of the ablation experiments on the BACE dataset and ESOL dataset. Different species of graph neural network encoders were used for the ablation experiments. As can be seen from Table 6, both essential components of our proposed method are practical.
A Novel Descriptor and Molecular Graph-Based Bimodal Contrastive Learning
713
Table 6. Ablation experiments on the BACE dataset and the ESOL dataset Contrastive Loss
Multi-branch predictor
Classification (higher better)
Regression (lower better)
Model
Dataset
ROC_AUC
Dataset
MAE
DNN+GCN DNN+GCN DNN+GCN DNN+GAT DNN+GAT DNN+GAT DNN+AttentiveFP DNN+AttentiveFP DNN+AttentiveFP
BACE BACE BACE BACE BACE BACE BACE BACE BACE
0.904±0.016 0.896±0.014 0.888±0.018 0.895±0.017 0.889±0.015 0.884±0.018 0.887±0.024 0.885±0.028 0.884±0.019
ESOL ESOL ESOL ESOL ESOL ESOL ESOL ESOL ESOL
0.183±0.024 0.191±0.017 0.198±0.022 0.187±0.024 0.195±0.023 0.211±0.029 0.177±0.019 0.177±0.020 0.180±0.015
5.3 Performance and Cost - Comparison with a Large-Scale Pre-trained Model of Unimodal Molecular Representation Table 7. Performance comparison between the best model of bimodal contrastivelearning architecture and notable large-scale pre-training model Classification (ROC, higher Regression (RMSE, lower better) better) BACE
Tox21
ESOL
FreeSolv
Lipop
N-GRAM [21]
0.876(0.035) 0.769(0.027) 1.100(0.160) 2.512(0.190) 0.876(0.033)
Smiles Transformer [22]
0.719(0.023) 0.706(0.021) 1.144(0.118) 2.246(0.237) 1.169(0.031)
HU.et.al. [17]
0.851(0.027) 0.811(0.015) -------
-------
-------
GROVERlarge [18] 0.894(0.028) 0.831(0.025) 0.831(0.120) 1.544(0.397) 0.560(0.035) Best GNN + DNN 0.904(0.016) 0.855(0.011) 0.275(0.041) 0.899(0.203) 0.609(0.032) (ours)
Recent studies [17, 18] have pre-trained millions of unlabeled molecules using cutting-edge graph neural network-based or Transformer-based pre-training and selfsupervised strategies, and then fine-tuned the training on a target small dataset for better performance. These models use only unimodal molecular representations of molecular graphs or SMILES strings, with no descriptor representations, but are trained on extensive datasets. Our proposed bimodal self-supervised contrastive learning pre-training stage is trained only on small-scale datasets. Table 7 compares the performance of our model with that of the state-of-the-art large-scale pre-trained model. The performance is comparable to large-scale pre-trained models, but with significantly lower computational resources, demonstrating our method’s cost-efficiency. We hypothesize that our method could yield more accurate results if also pre-trained on a large-scale unlabeled molecular dataset, a prospect for future work.
714
Z. He et al.
6 Conclusion and Future Work Our study highlights the necessity of integrating descriptor and molecular graph models, developing a bimodal contrastive learning architecture. This leverages expert knowledge in descriptors and the modeling capabilities of graph neural networks, using feature and decision fusion techniques for superior performance. We introduced a novel selfsupervised contrastive learning scheme based on descriptors and molecular graphs for pre-training two-modal encoders. We presented a bimodal multi-branch predictor structure to tackle unbalanced encoder training. This enhances virtual screening, aids in predicting drug-target interactions and ADMET properties, and facilitates drug design, repurposing, potentially improving drug development efficiency. Acknowledgements. Supported by grants from the National Natural Science Foundation of China (No. 81973182); National Science Foundation of China (No. 61806092); Jiangsu Natural Science Foundation (No. BK20180326); “Double First-Class” University project from China Pharmaceutical University (Program No. CPU2018GF02).
References 1. Rajpurkar, P., Chen, E., Banerjee, O., et al.: AI in health and medicine. Nat. Med. 28(1), 31–38 (2022) 2. Rabaan, A.A., Alhumaid, S., Mutair, A.A., et al.: Application of artificial intelligence in combating high antimicrobial resistance rates. Antibiotics 11(6), 784 (2022) 3. Fang, X., Liu, L., Lei, J., et al.: Geometry-enhanced molecular representation learning for property prediction. Nature Mach. Intell. 4(2), 127–134 (2022) 4. Asada, M., Miwa, M., Sasaki, Y.: Using drug descriptions and molecular structures for drug– drug interaction extraction from literature. Bioinformatics 37(12), 1739–1746 (2021) 5. Kurotani, A., Kakiuchi, T., Kikuchi, J.: Solubility Prediction from Molecular Properties and Analytical Data Using an In-phase Deep Neural Network (Ip-DNN), ACS omega (2021) 6. Alves, A.H.R., Cerri, R.: A two-step model for drug-target interaction prediction with predictive bi-clustering trees and XGBoost. In: 2022 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2022) 7. Wei, Y., Li, S., Li, Z., et al.: Interpretable-ADMET: a web service for ADMET prediction and optimization based on deep neural representation. Bioinformatics 38(10), 2863–2871 (2022) 8. Wieder, O., et al.: A compact review of molecular property prediction with graph neural networks, Drug Discovery Today: Technologies (2020) 9. Rong, Y., Bian, Y., Xu, T., et al.: Self-supervised graph transformer on large-scale molecular data. Adv. Neural. Inf. Process. Syst. 33, 12559–12571 (2020) 10. Lovri´c, M., Molero, J.M., Kern, R.: PySpark and RDKit: moving towards big data in cheminformatics. Mol. Inf. 38(6), 1800082 (2019) 11. Yap, C.W.: PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 32(7), 1466–1474 (2011) 12. Abu-Dief, A.M., El-Metwaly, N.M., Alzahrani, S.O., et al.: Structural, conformational and therapeutic studies on new thiazole complexes: drug-likeness and MOE-simulation assessments. Res. Chem. Intermediates 47, 1979–2002 (2021) 13. Li, Z., Liu, F., Yang, W., et al.: A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans. Neural Networks Learn. Syst. (2021)
A Novel Descriptor and Molecular Graph-Based Bimodal Contrastive Learning
715
14. Busbridge, D., Sherburn, D., Cavallo, P., Hammerla, N.Y.: Relational graph attention networks, arXiv preprint arXiv:1904.05811 (2019) 15. Xiong, Z., et al.: Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J. Med. Chem. 63(16), 8749–8760 (2019) 16. Chithrananda, S., Grand, G., Ramsundar, B.: Chemberta: large-scale self-supervised pretraining for molecular property prediction, arXiv preprint arXiv:2010.09885 (2020) 17. Hu, W., Liu, B., Gomes, J., et al.: Strategies for pre-training graph neural networks. In: International Conference on Learning Representations (ICLR) (2020) 18. Li, P., et al.: Learn molecular representations from large-scale unlabeled molecules for drug discovery, arXiv preprint arXiv:2012.11175 (2020) 19. Jiang, D., et al.: Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models. J. Cheminform. 13(1), 1–23 (2021) 20. Bai, P., Miljkovi´c, F., John, B., et al.: Interpretable bilinear attention network with domain adaptation improves drug–target prediction. Nature Mach. Intell., 1–11 (2023) 21. Liu, S., Demirel, M.F., Liang, Y.: N-gram graph: Simple unsupervised representation for graphs, with applications to molecules. Advances in neural information processing systems, 32 (2019) 22. Honda, S., Shi, S., Ueda, H.R.: Smiles transformer: Pre-trained molecular fingerprint for low data drug discovery, arXiv preprint arXiv:1911.04738 (2019) 23. He, K., Fan, H., Wu, Y., et al.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020) 24. Wang, Y., Wang, J., Cao, Z., et al.: Molecular contrastive learning of representations via graph neural networks. Nature Mach. Intell. 4(3), 279–287 (2022) 25. Yang, K., et al.: Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 59(8), 3370–3388 (2019) 26. Rahaman, O., Gagliardi, A.: Deep learning total energies and orbital energies of large organic molecules using hybridization of molecular fingerprints. J. Chem. Inf. Model. 60(12), 5971– 5983 (2020)
Multi-objective Optimization-Based Approach for Detection of Breast Cancer Biomarkers Jiaxin Yang1 , Chuanyuan Wang1 , Duanchen Sun2 , and Zhi-Ping Liu1(B) 1 School of Control Science and Engineering, Shandong University, Jinan 250061, Shandong,
China [email protected] 2 School of Mathematics, Shandong University, Jinan 250100, Shandong, China
Abstract. An increasing number of studies have shown a close link between the development of breast cancer (BRCA) and molecular signatures. Currently, certain of them have been identified and confirmed as biomarkers for the early diagnosis and prognosis evaluation of BRCA. Nevertheless, identifying biomarkers with high sensitivity and specificity remains exceedingly challenging. In this paper, we aim to identify BRCA biomarkers from high-throughput data by proposing a multi-objective optimization method. Our method involves constructing differential gene regulatory networks based on gene expression profiles of various phenotypes. We extract all pathways from BRCA elite genes to differentially expressed genes to capture the information flow between key genes. In addition, we have constructed a set of virtual nodes and edges that represent the differentially expressed genes reaching the virtual nodes. This enables us to simulate the genetic information transmission process. Using the maximum flow minimum cut theorem, we extract the dysfunctional modules within the identified causal pathways. Ultimately, we derive a globally optimal solution with diversity based on a multi-objective optimization algorithm, which represents a potential biomarker set for BRCA diagnosis. The experimental results validate that the proposed disease diagnosis model is more accurate than previous methods. It is expected to effectively reduce the cost of our clinical trials and be beneficial in identifying therapeutic targets for BRCA. Keywords: Biomarker Discovery · Breast Cancer · Maximum Flow Minimum Cut Theorem · Multi-objective Optimization
1 Introduction Breast cancer (BRCA) is a prevalent cancer among women worldwide [1], and its incidence has been on the rise annually [2]. Unfortunately, early-stage BRCA often lacks evident clinical symptoms, resulting in delayed diagnosis and missed opportunities for optimal treatment, which increases the risk of adverse outcomes [3]. Biomarkers, as one of the most efficient and rapid diagnostic tools for early cancer detection, are essential for the effective treatment and prevention of cancer [4]. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 716–726, 2023. https://doi.org/10.1007/978-981-99-4749-2_61
Multi-objective Optimization-Based Approach for Detection
717
The identification of biomarkers is a crucial research area in bioinformatics and computational biology. Several approaches have been proposed to discover biomarkers from high-dimensional data [5]. For example, Kong et al. proposed the integration of external relational information of features into a deep neural network, named Graph-Embedded Deep Feedforward Networks, to enhance classification performance [6]. Cai et al. combined integrated feature selection methods with incremental feature selection methods to select key features [7]. Additionally, Wang et al. introduced a network rewiringbased approach for identifying biomarkers [8]. These methods assess and extract critical features based on gene expression profiles from a machine-learning perspective. However, these methods do not evaluate features that lead to phenotypic differences from a global perspective, nor do they consider communication and delivery pathways between biomarkers in real tumor microenvironments. In this paper, we present a multi-objective optimization-based approach to extract dysfunctional modules and biomarkers from gene regulatory networks (GRNs) with gene expression profiles. This approach aims to accurately identify the key features of BRCA and improve the interpretability of the proposed model. Firstly, the method constructs a differential gene regulatory network (D-GRN) based on normal and disease samples to evaluate gene circuits. Secondly, all pathways from the cancer elite genes to the differentially expressed genes (DEGs) are calculated to simulate the transmission of genetic information from a systematic perspective. Moreover, a network flow balancingbased algorithm is used to calibrate the delivery of gene information and further screen critical gene information delivery pathways. These information flows reflect the process of gene information transferring from elite genes to DEGs, and play a critical role in detecting the development of breast cancer. Finally, multi-objective optimization is employed to identify the most relevant genes as diagnostic biomarkers for breast cancer.
2 Materials and Methods 2.1 Datasets and Data Preprocessing We obtain raw BRCA RNA sequencing (RNA-seq) data from The Cancer Genome Atlas (TCGA) database (https://cancergenome.nih.gov/), which comprises 1,104 BRCA patients and 114 normal controls. In addition, we obtain an external independent validation dataset consisting of 104 BRCA patients and 17 normal controls from Gene Expression Omnibus database (https://www.ncbi.nlm.nih.gov/geo/) with accession number GSE42568. We also download 89 elite genes associated with BRCA from the Malacards Human Disease Database (https://www.malacards.org/). The integrated human GRN is downloaded from RegNetwork [9]. 2.2 Identification of Differentially Expressed Genes and Integration of Prior Knowledge of BRCA We employ the Wilcoxon rank-sum test to identify DEGs in TCGA data. By setting the thresholds of P-value < 0.01, FDR < 0.01, and |log2FC|>3.3, we screen 1,267 genes as DEGs, as shown in Fig. 1A. Simultaneously, we utilize five types of prior knowledge
718
J. Yang et al.
genes collected by Li et al. [10], including the top 519 genes associated with BRCA in the Gene Ontology Annotation ranking of GO terms, 70 marker genes linked to BRCA in MammaPrint, 128 genes for BRCA diagnosis in the OSbrca webserver, 147 genes in the BRCA pathway in KEGG, and 10 BRCA prognostic genes downloaded by scPrognosis from single-cell RNA sequencing data. Next, we integrate 874 interesting genes, 1,267 DEGs, and 89 elite genes of BRCA and extract them from the gene expression data to form a set of 1,958 genes. These genes are then mapped to the integrated human GRN, as depicted in Fig. 1B. Eventually, we obtain 7,099 edges and 1,531 genes for further processing. 2.3 Construction of Differential Gene Regulatory Network Inferring GRN that explicitly represents the causal relationships in regulatory or developmental processes is a highly challenging task in systems biology [11, 12]. In this study, we utilize our developed path consistency algorithm based on conditional mutual information (PCA-CMI) algorithm to infer the GRN from gene expression data, which calculates the conditional independence between gene pairs based on mutual information (MI) instead of Pearson correlation coefficients and enjoys the advantages of high performance and generality [13]. To construct a more specific regulatory network, we separately calculate the MI of gene pairs in disease and normal states from the inferred GRN, as shown in Fig. 1C. We then subtract the MI of the same gene pair in different states, take the absolute value, and use it to construct a D-GRN across controls and BRCA samples. 2.4 Pathway Integration To establish the process of information transmission between these complied genes, we first identify all pathways from BRCA elite genes to DEGs in the D-GRN, and, based on the theory of six degrees of separation in complex network, we limit the number of genes in each pathway to no more than six, meaning that it takes six or fewer steps to connect any two genes in the D-GRN. Then, we combine all paths from each elite gene to all DEGs into a sub-network. Finally, we create a set of virtual nodes and virtual edges from all DEGs to the virtual nodes and add them to the constructed sub-network, as shown in Fig. 1D. 2.5 Functional Dysfunctional Modules Identification Based on Maximum Flow Minimum Cut Theorem We employ maximum flow minimum cut theorem to identify dysfunctional modules in a capacity-constrained network. This theorem maximizes the flow from the source to the sink while ensuring that the flow from the source equals the flow into the sink. The method is executed in the following manner: Firstly, a weighted directed graph G(V , E, C) without self-loops is constructed, where V represents the set of genes, E represents the regulatory relationships between genes, and C represents the strength of interactions between genes. Secondly, the source and sink are determined in the graph.
Multi-objective Optimization-Based Approach for Detection
719
Fig. 1. A framework for detecting BRCA biomarkers from gene expression data. (A) Identification of DEGs from RNA-seq data. (B) Integration of seven gene sets and mapping them to the RegNetwork knowledgebase for GRN extraction. (C) Calculate the MI between nodes under disease samples and normal samples respectively based on gene expression data (D) All modules in which key gene delivery information is extracted from the constructed subnets using the maximum flow minimum cut theorem. (E) Application of the multi-objective optimization algorithm to find Pareto optimal solutions with diversity, as well as the validation of their classification.
Finally, the Edmond-Karp algorithm is used to determine the maximum amount of traffic that can flow through the network at any given time, which in turn identifies all the genes involved in the transmission of this information flow. The problem of finding the maximum flow can be mathematically described by the following equations: ⎧ f (s, w) = f (w, t) = Q ⎪ ⎪ ⎪ ⎪ ⎪ w∈V ⎨ w∈V Max Q, s.t. f w) = f (w, u) (u, ⎪ ⎪ ⎪ u∈s,t / u∈s,t / ⎪ ⎪ ⎩ 0 ≤ f (u, w) ≤ C(u, w) where ∀u, w ∈ V , (u, w) ∈ E, C(u, w) is the capacity carried by each edge in this flow network, f (u, w) is the flow assigned to each edge, and Q denotes the maximum amount of traffic that can pass through the network at a given moment.
720
J. Yang et al.
2.6 Multi-objective Optimization for Identifying Biomarkers The maximum flow minimum cut theorem identifies various modules that can potentially serve as a library of biomarker candidates [14, 15]. To identify significant biomarker modules that are most closely associated with BRCA, we utilize multi-objective optimization techniques. Multi-objective optimization, as the name implies, aims to achieve the optimal state of multiple objectives that interact or conflict with each other in a certain region as much as possible. In this case, we employed the non-dominated ranking genetic algorithm with elite strategy (NSGA-II) [16, 17]. NSGA-II identifies significant biomarker modules by maximizing the following three objectives: NSGAII FAUC M c , FAIC M c , FPAS M c ranki (P(M c )) − LPy(LPy+1) 2 c i∈pos(y) FAUC M = , LPy = len(pos(y)), LNy = len(neg(y)) LPy × LPy SSR(P(M c ), y) FAIC M c = 2k c + len(y) ln len(y)
len(y) Mjc , y c j=1 FPAS M = len(y)
where M c ∈ M 1 , M 2 , ..., M L , M is a set of L identified dysfunctional modules, M c = {gene1 , gene2 , ..., genem } represents the set of expression data for all genes in each module, y represents the phenotypic symbolic vector corresponding to M c , len(y) represents the number of samples, and P(M c ) represents the probability values for predicting different phenotypes for each sample. To evaluate the classification performance of the model, we introduce area under the curve (AUC) as the objective function FAUC . This metric does not require manually setting thresholds and can also be used to evaluate models in the case of imbalanced positive and negative samples. Here, rank indicates the sorting of probability values for all predicted samples in an ascending order, LPy and LNy respectively represent the indices of positive and negative samples in the sorted label y, and extract their lengths. We incorporate the negative value of Akaike Information Criterion (AIC) as the objective FAIC to address the issue of overfitting by including a penalty term for model complexity. The model with the smallest AIC is considered to be the best choice from the model selection perspective. Furthermore, the negative value of AIC is used to maximize one of the objective functions. Here, k c represents the number of genes in module M c , and SSR(·) represents the residual function. We also introduce phenotype association score (PAS) [18] as the objective function FPAS , which computes the mutual information between genes and phenotype vectors for each module.
Multi-objective Optimization-Based Approach for Detection
721
Fig. 2. Identified dysfunctional modules. (A) A merged network of key modules that transmit information between genes is extracted from each subnet, where different colors represent different modules, and genes that recur in two or more modules are indicated in gray. (B) Pie chart of the number of genes in the module.
3 Results and Discussion 3.1 Constructed Differential Gene Regulatory Network By integrating selected BRCA-related genes, gene expression profiling data and RegNetwork, we construct a GRN consisting of 1,531 genes and 7,099 edges. To specify the GRN in BRCA, we remove redundant gene regulations from this prior network employing PCA-CMI algorithm with a threshold value of 0.5. Any edge between gene pairs with MI or conditional mutual information lower than this threshold value is removed, resulting in the retention of 4,957 edges and 1,469 genes. Subsequently, we separately calculate the MI of gene pairs in both states based on the gene expression data to obtain the D-GRN. 3.2 Identified Dysfunctional Modules Initially, we identify 89 BRCA elite genes from the downloaded datasets, as well as 1,267 DEGs that are screened. Next, we connect all pathways between elite genes and DEGs, which include regulatory interactions between genes and regulatory effects on DEGs. From this, we extract eight subnetworks that are extended from the elite genes to all DEGs in the D-GRN. Each subnetwork contains a minimum of 3,384 edges and a maximum of 3,546 edges. The results of identifying dysfunctional modules from each sub-network are illustrated in Fig. 2. We extract important modules from each of the eight subnetworks using maximum flow minimum cut theorem to ensure the balance of the flows during the information transfer. We merge the obtained modules to form a network consisting of 310 edges and 209 nodes, as depicted in Fig. 2A. Nodes of different colors in this network represent different modules, and genes present in two or more modules are denoted in gray. It can be observed that Module 3 and Module 6 have more unique genes. The number of genes in each module is shown in Fig. 2B, and these modules provide the
722
J. Yang et al.
Fig. 3. (A) The non-dominated solutions identified by the proposed multi-objective optimizationbased approach. (B) The ROC curves of the proposed multi-objective optimization methods and two comparison algorithms.
Table 1. Performance of three methods in 30 times run. Module
AUC
M1
0.997 ± 0.001 0.987 ± 0.001 0.993 ± 0.001 0.993 ± 0.001 0.993 ± 0.001
Accuracy
F1
Recall
Precision
M2
0.997 ± 0.001 0.987 ± 0.001 0.993 ± 0.001 0.993 ± 0.001 0.993 ± 0.001
M3
0.995 ± 0.001 0.985 ± 0.001 0.992 ± 0.001 0.992 ± 0.001 0.992 ± 0.001
M4
0.996 ± 0.001 0.985 ± 0.001 0.992 ± 0.001 0.992 ± 0.001 0.992 ± 0.001
M5
0.996 ± 0.001 0.984 ± 0.001 0.991 ± 0.001 0.992 ± 0.001 0.991 ± 0.001
M6
0.996 ± 0.001 0.986 ± 0.001 0.992 ± 0.001 0.992 ± 0.001 0.992 ± 0.001
M7
0.996 ± 0.001 0.987 ± 0.001 0.993 ± 0.001 0.993 ± 0.001 0.993 ± 0.001
M8
0.996 ± 0.001 0.986 ± 0.001 0.993 ± 0.001 0.992 ± 0.001 0.993 ± 0.001
SVM-RFE 0.980 ± 0.001 0.969 ± 0.001 0.983 ± 0.001 0.998 ± 0.001 0.969 ± 0.001 HLR
0.963 ± 0.001 0.971 ± 0.001 0.984 ± 0.001 0.991 ± 0.001 0.978 ± 0.001
candidates for further identification of disease biomarker genes. Figure 3A displays the non-dominated solutions selected using the NSGA-II algorithm, which exhibits the least target conflict compared to other solutions. 3.3 Identification and Validation of Biomarkers To further validate our selected biomarker modules, we conduct a comparative study using the feature selection method based on SVM-RFE algorithm proposed by Guyon et al. [19] and the HLR method proposed by Huang et al. [20]. We evaluate the classification results using ROC curves, which are displayed in Fig. 3B. The results indicate that the AUC values of the modules obtained by our proposed method are higher than those of the “HLR” method, with the exception of Module 6 and Module 7. Furthermore, the
Multi-objective Optimization-Based Approach for Detection
723
Fig. 4. Classification performance of biomarkers, random genes, and DEGs identified in the independent validation dataset. (A) ROC curves of the module biomarkers. (B) Classification ability of modular biomarkers with equal numbers of DEGs and random genes.
AUCs of our proposed method are better than those of the “SVM-RFE” method, except for Modules 6 and 7. We also calculate the accuracy, F1 value, recall, and precision of the three methods 30 times, and compute their means and standard deviations, as shown in Table 1. All standard deviations in Table 1 are less than 0.001, indicating that the experimental results have high stability and reliability. Although the “SVM-RFE” method achieves the highest recall value, the classification performance of our method is still significantly better. We assess the validity of our method on an external independent validation dataset. The classification performance of the selected biomarker modules on the validation data is shown in Fig. 4A. All eight modules exhibit good classification performance, with the highest AUC value reaching 0.991 and the lowest being 0.946. Furthermore, we evaluate the classification ability of the selected biomarker modules on an independent validation dataset, as shown in Fig. 4B. The classification ability of the module biomarkers is significantly different from that of both DEGs and random genes of the same size, with the P value of significance less than 0.05. These results suggest that our selected biomarkers are potentially important molecular markers for early diagnosis of breast cancer. We also use the DAVID online database to extract dysfunctional annotation information from these gene modules. Table 2 shows ten enriched functional clusters with their associated genes. The transcription and regulation associated with “RNA polymerase II” are significantly enriched, and transcriptional dysregulation is closely related to cancer development. The regulation of gene expression impacts the normal development of cells, while “breast development” can increase the risk of breast cancer. The enrichments explain the biological functions of the identified biomarker genes in terms of biological processes. The functional enrichment analysis provides more validation evidence for our proposed method of identifying BRCA biomarkers.
724
J. Yang et al.
4 Conclusions To summarize, we developed a bioinformatics method based on a multi-objective optimization approach for detecting biomarkers from gene expression and GRN data. The optimization objectives include AUC, AIC, and PAS, which enable the population to evolve towards a globally optimal solution in multiple selection directions. Additionally, we developed a computational method based on network flow balancing to identify dysfunctional modules and to elucidate the effectiveness of disease-related genes from a network-level analysis. We have applied this framework to identify BRCA biomarkers and obtained a collection of 8 gene modules. These potential biomarkers have been well validated on both the internal and external validation datasets and are of great importance for the study of cancer pathogenesis and influencing factors. Apparently, our method is universal and can also be used for the discovery of biomarkers for other complex diseases. Table 2. The enriched GO biological processes in the identified BRCA biomarkers. GO term
Description
Biomarker gene
Adjusted P-value
GO:0045944
Positive regulation of transcription from RNA polymerase II promoter
RB1, NFIX, NUFIP1, HDAC1, NR2C2, ETS1, HOXA9, SOX17, EPCAM, E2F1, ABL1, EP300, E2F3, HES1, KDM6B, JUN, STAT1, FOS, KLF4, ESR1, ESR2, KAT2B, SMO, NFIB, CDH13, ATM, TP53
5.95E-13
GO:0045893
positive regulation of transcription, DNA-templated
JUN, STAT1, HDAC1, 3.43E-07 AXIN1, FOS, PSEN1, KLF4, ETS1, ESR1, ESR2, KAT2B, SOX17, NFIB, E2F1, EP300, MAPK1, TP53
GO:0000122
negative regulation of transcription from RNA polymerase II promoter
RB1, JUN, NFIX, STAT1, HDAC1, PSEN1, KLF4, NR2C2, ESR1, ESR2, SMO, NFIB, E2F1, EP300, HES1, TP53, JDP2, HSPA1A
1.93E-06
GO:0010628
positive regulation of gene expression
CSF1, HDAC1, PSEN1, KLF4, ETS1, SOX17, SMO, E2F1, MAPK1, HES1, ATM, TP53, HSPA1A
1.91E-05
(continued)
Multi-objective Optimization-Based Approach for Detection
725
Table 2. (continued) GO term
Description
Biomarker gene
Adjusted P-value
GO:0006357
regulation of transcription RB1, JUN, TCF7L1, NFIX, from RNA polymerase II ZNF581, HDAC1, FOS, promoter KLF4, NR2C2, ETS1, ESR1, ESR2, HOXA9, SOX17, NFIB, E2F1, E2F3, HES1, SOS1, TP53, JDP2
3.95E-05
GO:0006366
transcription from RNA polymerase II promoter
RB1, JUN, NFIX, EP300, FOS, KLF4, ETS1, ESR1
0.002239
GO:0048538
thymus development
ABL1, MAPK1, HES1, ATM, 0.002239 PSEN1
GO:1902895
positive regulation of pri-miRNA transcription from RNA polymerase II promoter
JUN, FOS, KLF4, ETS1, TP53
0.002239
GO:0090399
replicative senescence
CHEK1, SERPINE1, ATM, TP53
0.002239
GO:0006915
apoptotic process
ALDH1A3, JUN, SMO, UNC5A, CHEK1, AXIN1, EP300, MAPK1, PSEN1, TP53, BBC3
0.002299
Acknowledgments. This work was partially supported by National Natural Science Foundation of China (No. 61973190); National Key Research and Development Program of China (Nos. 2022YFA1004801, 2020YFA0712402); the Fundamental Research Funds for the Central Universities (No. 2022JC008) and the program of Qilu Young Scholar of Shandong University.
References 1. Waks, A.G., Winer, E.P.: Breast cancer treatment: a review. JAMA 321, 288–300 (2019) 2. Alsheikhy, A.A., Said, Y., Shawly, T., Alzahrani, A.K. Lahza, H.: Biomedical diagnosis of breast cancer using deep learning and multiple classifiers. Diagnostics 12 (2022) 3. Milosevic, M., Jankovic, D., Milenkovic, A., Stojanov, D.: Early diagnosis and detection of breast cancer. Technol. Health Care 26, 729–759 (2018) 4. Strimbu, K., Tavel, J.A.: What are biomarkers? Curr Opin HIV AIDS 5, 463–466 (2010) 5. Rehman, O., Zhuang, H., Muhamed Ali, A., Ibrahim, A. & Li, Z.: Validation of miRNAs as Breast Cancer Biomarkers with a Machine Learning Approach. Cancers 11, (2019) 6. Kong, Y., Yu, T.: A graph-embedded deep feedforward network for disease outcome classification and feature selection using gene expression data. Bioinformatics 34, 3727–3737 (2018) 7. Cai, Z., et al.: Classification of lung cancer using ensemble-based feature selection and machine learning methods. Mol. BioSyst. 11, 791–800 (2015)
726
J. Yang et al.
8. Wang, Y., Liu, Z.-P.: Identifying biomarkers for breast cancer by gene regulatory network rewiring. BMC Bioinformatics 22, 1–15 (2022) 9. Liu, Z.-P., Wu, C., Miao, H., Wu, H.: RegNetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse. Database 2015, bav095 (2015) 10. Li, L., Liu, Z.-P.: Detecting prognostic biomarkers of breast cancer by regularized Cox proportional hazards models. J. Transl. Med. 19, 514 (2021) 11. Aghdam, R., Ganjali, M., Eslahchi, C.: IPCA-CMI: an algorithm for inferring gene regulatory networks based on a combination of PCA-CMI and MIT score. PLoS ONE 9, e92600 (2014) 12. Karlebach, G., Shamir, R.: Modelling and analysis of gene regulatory networks. Nat. Rev. Mol. Cell Biol. 9, 770–780 (2008) 13. Zhang, X., et al.: Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information. Bioinformatics 28, 98–104 (2012) 14. Tai, X.-C., Deng, L.-J., Yin, K.: A multigrid algorithm for maxflow and min-cut problems with applications to multiphase image segmentation. J. Sci. Comput. 87(3), 1–22 (2021). https://doi.org/10.1007/s10915-021-01458-3 15. Yuan, J., Bae, E., Tai, X.-C.: A study on continuous max-flow and min-cut approaches. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2217–2224. IEEE (2010) 16. Deb, K.: Multi-objective optimisation using evolutionary algorithms: an introduction. In: Multi-objective Evolutionary Optimisation for Product Design and Manufacturing, pp. 3–34 (2011) 17. Srinivas, N., Deb, K.: Muiltiobjective optimization using nondominated sorting in genetic algorithms. Evol. Comput. 2, 221–248 (1994) 18. Shang, H., Liu, Z.-P.: Network-based prioritization of cancer biomarkers by phenotype-driven module detection and ranking. Comput. Struct. Biotechnol. J. 20, 206–217 (2022) 19. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002) 20. Huang, H.-H., Liu, X.-Y., Liang, Y.: Feature selection and cancer classification via sparse logistic regression with the hybrid L1/2+ 2 regularization. PLoS ONE 11, e0149675 (2016)
MOFNet: A Deep Learning Framework of Integrating Multi-omics Data for Breast Cancer Diagnosis Chunxiao Zhang1 , Pengpai Li1 , Duanchen Sun2 , and Zhi-Ping Liu1(B) 1 School of Control Science and Engineering, Shandong University, Jinan 250061, Shandong,
China [email protected] 2 School of Mathematics, Shandong University, Jinan 250100, Shandong, China
Abstract. With the advancement of technology, annotated multi-omics datasets are becoming increasingly abundant. In this paper, we propose a novel deep learning framework, called multi-omics data fusion network (MOFNet), to integrate multi-omics data for disease diagnosis. MOFNet is a multi-task learning framework that combines multiple deep learning models to learn the complex relationships between multi-omics data and disease label. MOFNet focuses on improving disease classification performance with fewer features extracted from interrelated multi-omics data. We demonstrate that MOFNet outperforms other state-of-the-art supervised multi-omics data integration methods in breast cancer sample classification tasks using mRNA expression, DNA methylation, and microRNA expression profiles. The selected features can be regarded as integrative biomarkers of breast cancer diagnosis and stratification. Keywords: Multi-omics Data Integration · Graph Convolution Network · Graph Classification · Machine Learning · Breast Cancer
1 Introduction Breast cancer is currently the most common cancer among women worldwide and ranks second as a cause of cancer-related death [1]. Molecular heterogeneity exists among different subtypes of the same cancer type [2, 3]. Different molecular subtypes of breast cancer typically require different treatment methods, including surgery, chemotherapy, and hormone therapy [4]. Therefore, accurately identifying breast cancer subtypes is of great significance for precise diagnosis, individualized cancer treatment planning development, and improving patient prognosis [5, 6]. Multi-omics data refers to different types of high-throughput molecular data, such as gene expression, DNA methylation, and microRNA expression, obtained from the same batch of samples [7, 8]. With the development of high-throughput technologies, various types of omics data are collected at an unprecedented level of detail [9], providing strong data support for exploring cancer subtypes and studying cancer molecular mechanisms. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 727–738, 2023. https://doi.org/10.1007/978-981-99-4749-2_62
728
C. Zhang et al.
Although each single omics data can only describe biological and physiological features from one dimension, integrating multi-omics data can provide a more comprehensive view of some underlying biological processes [10]. For example, it has been shown that integrating multi-omics data can achieve better accuracy when compared to using only single feature data [11]. However, how to effectively integrate and analyze multi-omics data remains a challenging problem. In the early stages of multi-modal omics data integration, because many types of omics data did not have sample labels, many works were based on unsupervised learning [12]. With the rapid development of personalized medicine, datasets with detailed annotation information are becoming ubiquitous. With the improvement of data availability, we are gradually focusing on supervised data integration methods that can make predictions on the same samples via feature concatenation and feature ensemble-based methods. The former trains a classification model by directly concatenating features of different dimensions. The latter uses different classifiers for different feature dimensions and then trains the model respectively. However, these methods did not consider the correlation between features of different dimensions, and thus were more likely to bias the prediction towards some specific omics data. Recently, deep learning techniques have demonstrated powerful learning ability and flexibility in various tasks, leading to the development of more and more methods for multi-omics data integration. For example, Huang et al. [13] combined mRNA and miRNA expression data with additional clinical information in the hidden layer to better predict the prognosis of breast cancer. However, most of these methods are based on fully connected networks. They can effectively learn nonlinear features but do not effectively utilize the correlation between samples. Moreover, although most data fusion methods currently integrate in the input space or feature learning space [13], different types of omics data may also exhibit unique features in high-dimensional label space. Therefore, Wang et al. [14] proposed a fusion multi-omics method called MOGONET, which first uses GCN(Graph Convolutional Networks) as the classifier for prediction in a single omics data and then fuses in high-dimensional label space. However, there are still some shortcomings, such as it did not consider the features within each type of omics data that may also be redundant. To address the issues in the existing methods, we propose a novel multi-omics data fusion method, MOFNet (Multi-Omics Fusion Network), to classify disease samples by integrating multi-omics data. MOFNet uses the SGO (Similarity Graph pOoling with structure learning) method to learn from omics data with different dimensions, and utilizes the initial prediction results from each omics data to construct a cross-omics discovery tensor that reflects the correlation of labels across different omics. Under the comparison studies on the benchmark dataset, MOFNet can achieve better results than the state-of-the-art (SOTA) methods using only 25% of the features.
MOFNet: A Deep Learning Framework of Integrating Multi-omics Data
729
2 Materials and Methods 2.1 Datasets To validate the effectiveness of MOFNet, we use the TCGA breast cancer (BRCA) dataset for PAM50 subtype classification. Three types of omics data, namely mRNA expression data, DNA methylation data and miRNA expression data, are used to provide comprehensive and complementary information on the diseases. Only samples with matching mRNA expression, DNA methylation, and miRNA expression data are considered in our study. This dataset has already undergone data preprocessing. The details of these datasets are listed in Table 1. Table 1. Summary of dataset. Dataset
Categories
Number of features for training mRNA, Meth, miRNA
BRCA
Normal-like: 115, Basal-like: 131, HER2-enriched: 46,Luminal A: 436, Luminal B:147
1000,1000,503
mRNA refers to mRNA expression data. Meth refers to DNA methylation data. miRNA refers to miRNA expression data. The BRCA dataset is for breast invasive carcinoma PAM50 subtype classification with normal-like, basal-like, human epidermal growth factor receptor 2 (HER2)-enriched, Luminal A, and Luminal B subtypes
2.2 SGO for Omics-Specific Learning GCN has been proposed since 2017 and has achieved impressive performances in various challenging tasks [15]. The method has been proven to be both general and effective. Therefore, we chose GCN as our omics-specific model’s building block. Here we briefly introduce GCN. For an omics data, each sample is regarded as a node in the sample similarity network. The classification task is performed by learning the sample similarity network and feature network. The same sample similarity network is shared within each omics. The adjacency matrix A ∈ Rn×n and feature network X ∈ Rn×f are used as the input of the GCN model. Each layer is defined as: H k+1 = f (H k , A) = σ (AH k W l ),
(1)
where H k is the input of the kth layer, when k = 0, H 0 = X ; W k is the weight matrix of the kth layer, which is a learnable parameter. σ is a nonlinear activation function similar to ReLU. In order to construct a GCN network, the first step is to obtain the initial adjacency matrix A. The matrix is constructed by calculating the cosine similarity between each pair of nodes and retaining edges with cosine similarity greater than a given
730
C. Zhang et al.
threshold ε. For example, in a graph, node i and node j have the following adjacency relationship: c(xi , xj ), if i = j and ε ≤ c(xi , xj ) (2) Aij = 0, otherwise. where xi and xj are the feature vectors of node i and node j. c(xi , xj )=
xi · xj xi 2 xj 2
(3)
represents the cosine similarity score between node i and node j. The value of ε depends on parameter k, which is the average number of edges retained per node. The edges here include self-connected edges. In this study, we use k = 10 to generate an adjacency matrix A, that is, each node retains 9 edges. Through the above operations, we constructed a sample similarity network. Below is a brief review of how each layer is passed in GCN. The input of the k−th layer of the original GCN consists of two parts: the adjacency matrix A of the graph G and the hidden feature matrix Hk .The output of the (k+1) layer can be expressed as: ˜ −1/2 H k W k ) ˜ −1/2 A˜ D H k+1 = σ (D
(4)
This formula describes how to update the hidden feature matrix H at each layer of the GCN.σ represents a non-linear activation layer. If it is the first layer, then H 0 = ˜ is the degree matrix X,A˜ = A + I , which is an adjacency matrix with self-connections. D k d ×d k k+1 ˜ is a learnable parameter. For convenience, we set dk+1 = dk = d . of A.W ∈ R The original GCN algorithm achieved good results in many fields. However, it is not suitable for the field of gene expression regulation. The features used in this study are all differentially expressed genes, but not all genes positively impact the correct prediction of labels. The existence of many redundant features may not only slow down the training speed but also reduce the prediction accuracy. Therefore, SGO adds two layers of pooling networks to the original GCN to minimize the number of features used. Most of the previous graph pooling methods have some issues, such as the problem of isolated nodes in the subgraph after pooling. If there are too many isolated nodes, it will hinder the subsequent information propagation. To address this issue, SGO has also been optimized. SGO mainly consists of two parts: (1) Graph pooling is a method that measures the information value of each node in the previous layer and retains nodes that are difficult to be characterized by surrounding nodes, while removing nodes that can be easily characterized and do not cause significant loss in the previous layer. This method is used to reduce the number of nodes in the subgraph. In this study, we further optimized the method for multi-omics data. In multiomics data, the same weighted sample similarity network is shared among a single omics dataset. If the node information score is directly used, it may lead to the problem that different graphs have completely different remaining nodes after pooling, which is not consistent with common sense that different patients with the same disease have the same effective genes. Therefore, we do not directly use the node information score, but add up the corresponding scores of the same nodes in all graphs to obtain the summarized
MOFNet: A Deep Learning Framework of Integrating Multi-omics Data
731
score for each node, and then send the summarized score to the corresponding nodes in each graph. The pooling is based on the summarized score of each node. Since nodes in different graphs that are located in the same position receive the same summarized score, the same nodes will remain after pooling for each graph. This approach helps to improve the interpretability of the model in multi-omics data integration domain. (2) Structure learning. It is important to maintain graph structure. As an example (Fig. 1), the left upper corner has an isolated node after initial pooling. This isolated node should be connected to the rest of the system, but its unconnected characteristics can hinder the propagation of information to the subsequent layer, especially when the neighboring nodes are aggregated. Hierarchical graph representation is possible because the entire system is stacked by convolution and pooling two parts. A criterion for selecting nodes is first defined by the SGO method. We consider this node dispensable if it can be easily reconstructed by its neighbors with almost no information loss. In contrast, if a node cannot be easily reconstructed by neighbor nodes, we regard it as very important. Accordingly, we define the initial node information score as follows: (5) si = (Iik − (Dik )−1 Aki )Hik , 1
where ·1 represents the Manhattan distance, Aki ∈ Rni ×ni is the adjacency matrix, Iik k is the identity matrix, Dik represents the degree matrix, and Hik ∈ Rni ×d is the node representation matrix. However, in the case of multiple sets of omics data, the same adjacency matrix is shared within each omics. From a gene regulation perspective, after two pooling operations, the same set of nodes should remain within each omics data. Therefore, SGO adds up the scores for each corresponding node across all graphs to obtain a consolidated score s. The calculation method is as follows: ⎞ ⎛ n 1 si ⎜ i=1 ⎟ ⎟ ⎜ n ⎜ 2⎟ ⎟ ⎜ s i ⎟ ⎜ (6) s = ⎜ i=1 ⎟, ⎜ .. ⎟ ⎜ . ⎟ ⎟ ⎜ n ⎝ ni ⎠ si k
k
i=1
where sini represents the initial information score for node ni in the graph Gi , and the final score for this series of nodes is sni : n n si i . (7) s ni = i=1
After obtaining the consolidated score sni , we define: sini = sni .
(8)
This ensures that nodes located at the same position in different graphs have the same score. We retain nodes with higher information scores because these nodes cannot
732
C. Zhang et al.
be well-represented by their neighboring nodes, and thus can provide more information. Specifically, we firstly re-order the nodes in the graph based on the obtained consolidated scores and select a subset of nodes with higher ranks for the following operations: index = top − rank(s, r ∗ nki ) Hik+1 = Hik (index, :) Aik+1
(9)
= Aki (index, index)
The first formula is the index of the r ∗ nki elements with the largest value after reordering, where r represents the pooling rate. Hik+1 is the hidden matrix for exporting the subgraph, and Aik+1 is the adjacency matrix for exporting the subgraph. Regarding structural learning, we have employed a novel layer using the adjacency k k k matrix Aki ∈ Rni ×ni and hidden representation Hik ∈ Rni ×d with graph structural information as inputs. The objective is to learn the graph structural information, which can encode the underlying pairwise relationships between every pair of nodes. The similarity score between nodes i and j can be expressed as: → α Hik (i, :) Hik (j, :) (10) + β · Aki (i, j), Eik (i, j) = σ − → where σ (·) is the activation function, − α ∈ R1×2d is the weight vector,Hik (i, :) ∈ R1×d k 1×d and Hi (j, :) ∈ R indicate the a − th and b − th row of matrix Hik .The represents the concatenation operation. If nodes a and b are not directly connected, then Aki (i, j)=0. We add Aki to make the scores of directly connected nodes higher. β is a balancing parameter. We retain the edges with high similarity scores between nodes as the outcome of structural learning. Then, we normalize the results using the SparseMax function [16].
Fig. 1. Illustration of SGO. The workflow of the SGO section is enclosed in a dashed box. Since multi-modal data share the same adjacency matrix, SGO must ensure that the same nodes remain after processing each graph. This approach is more interpretable for gene regulation. The dashed lines represent edges that were learned through structure learning and did not exist previously.
2.3 VCDN for Multi-omics Data Integration VCDN was originally designed for two-dimensional data. In MOFNet, we further generalize it to adapt to samples of any dimensions. Compared to the commonly used direct concatenation or fusion in low-dimensional feature space, VCDN can leverage higherlevel cross-omics correlations in the label space because different types of omics data can provide unique class-level distinctiveness.
MOFNet: A Deep Learning Framework of Integrating Multi-omics Data
733
In the case study of BRCA, we use three types of omics data: mRNA, DNA methylation, and miRNA expression. For the qth sample predicted by different omics, the predic(i) tion result yˆ q , i = 1, 2, 3, can construct a cross-omics discovery tensor Tq ∈ Rc×c×c , where c represents the number of categories in the classification task. (1) (2) (3) Tq,a1 ,a2 ,a3 = yˆ q,a yˆ yˆ , 1 q,a2 q,a3 (i)
(11)
(i)
where yˆ q,a denotes the ath entry of yˆ q . The obtained cross-omics findings tensor Tq should be reshaped into a vector Tj of c3 dimensions, and forwarded to the central node for final prediction. VCDN (·) is designed as a fully connected neural network with the output dimension of c. Therefore, VCDN (·) can integrate predictive results from diverse dimensions of omics data, unraveling potential cross-omics label associations, thus facilitating performance improvement. The ultimate prediction results of MOFNet are based on the initial predictions from specific omics and the predicted knowledge of cross-omics label associations.
3 Results and Discussion 3.1 Overview of MOFNet Figure 2 demonstrates the overview of MOFNet (Fig. 2). It contains two key steps in the graph-neural-network-based framework: (1) First is about the graph pooling. Here, we propose the SGO (Similarity Graph pOoling with structure learning) method to learn from omics data with different dimensions. For each type of omics data, the cosine similarity is calculated among the omics features, and a weighted sample similarity network is constructed by setting a threshold. Then, the omics features and the constructed weighted sample similarity network are used for training. Since many genes may lead to a decrease in prediction accuracy, we use the SGO method to perform two rounds of subsampling on all genes. SGO performs graph pooling by adaptively selecting a subset of nodes to form a derived subgraph for the subsequent layer, by deleting nodes that can be easily replaced by neighboring nodes without losing too much information and retaining nodes that are difficult to be represented by neighbor nodes. With the addition of graph attention mechanisms, it learns enough graph structure information and thus retain the connection edges between similar nodes, effectively improving the problem of isolated nodes in the subgraph. During the pooling process, we also improved the way of selecting nodes based on the characteristics of multimodal omics data, ensuring that for the same disease, the effective genes in the same sample are the same within a batch, i.e., the same graph will have the same remaining nodes in the subgraph after pooling. This proposed scoring method ensures the consistency of gene regulatory relationships and enhances the interpretability of graph pooling operations in the field of multimodal omics and gene regulation. Moreover, the hyperparameters used in the model are listed here (pooling rate = 0.5, learning rate = 0.001, hidden layer = 128, pooling layer = 2). (2) Second is multi-omics integration based on VCDN (View Correlation Discovery Network). It utilizes the initial prediction results from each omics data to construct a
734
C. Zhang et al.
cross-omics discovery tensor that reflects the correlation of labels across omics. This tensor is reshaped into a vector and forwarded to VCDN for final label prediction. By exploring the potential correlations of different omics data types in a high-level label space, VCDN effectively integrates the initial predictions from each omics network [14]. MOFNet is an end-to-end model that alternately trains omics-specific SGO and VCDN until convergence. In summary, the final prediction of MOFNet is based on the initial predictions generated by SGO for specific omics and the high-dimensional label space prediction generated by VCDN for cross-omics. As far as we know, MOFNet is the first method to explore graph pooling with attention mechanisms in a high-dimensional label space. From the results, MOFNet not only uses significantly fewer features than the SOTA methods using graph convolution (reducing features by 75%), but also achieves better prediction performance.
Fig. 2. Illustration of MOFNet. MOFNet is an end-to-end model where all networks are jointly learned. MOFNet contains two major steps, i.e., SGO and VCDN. SGO is used for learning specific-dimensional omics data, while VCDN combines results from different omics data in a high-dimensional feature space. Firstly, cosine similarity is calculated for each omics data, and a sample similarity network is constructed by setting a threshold. Then, the sample similarity network is used for learning together with the omics feature network, resulting in the initial prediction by SGO. Afterwards, the initial predictions from multiple omics data are concatenated in a high-dimensional space to calculate the cross-omics discovery tensor for learning in the high-dimensional label space. Finally, a fully connected VCDN is used for the final prediction.
3.2 Multi-omics Classification Performance Evaluation We compare the classification performance of MOFNet with numerous existing supervised multi-omics ensemble methods. To evaluate and compare the methods, we employ accuracy (ACC), average F1 score weighted by support (F1_weighted), and macroaveraged F1 score (F1_macro) for multi-class classification tasks. The results are shown in Table 2. The HGPSL method is a single-modal method, so we conduct the comparative experiments using modes 1–3 (M1-M3) separately. We can see that the MOFNet method
MOFNet: A Deep Learning Framework of Integrating Multi-omics Data
735
outperforms the other methods in all three metrics: ACC, F1_weighted, and F1_macro. Even compared to the latest and best-performing MOGONET method, MOFNet still achieves significant advantages, with an increase of 1.5 percentage points in ACC, 2.7 percentage points in F1_weighted, and 5.8 percentage points in F1_macro. Table 2. Classification results on BRCA dataset. Method
ACC
F1_weighted
F1_macro
KNN
0.753
0.732
0.704
Random Forest
0.768
0.757
0.673
SVM
0.729
0.500
0.330
Lasso
0.715
0.702
0.656
Ridge
0.760
0.747
0.681
Elastic Net
0.734
0.700
0.571
HGPSL(M1)
0.672
0.601
0.497
HGPSL(M2)
0.702
0.628
0.535
HGPSL(M3)
0.580
0.521
0.363
NN_NN
0.796
0.784
0.723
NN_VCDN
0.792
0.781
0.721
MOGONET
0.829
0.825
0.774
MOGONET_NN
0.805
0.782
0.737
MOFNet (Ours)
0.844
0.852
0.832
3.3 Performance of MOFNet Under Different Omics Data Types We conduct ablation experiments to further demonstrate the effectiveness of aggregating multi-omics data in improving classification performance. We compare the classification performance of models using three types of omics data (mRNA + Meth + miRNA), two types of omics data (mRNA + Meth, mRNA + miRNA, Meth + miRNA), and one type of omics data (mRNA, Meth, miRNA) separately. The results are shown in Fig. 3. In all classification tasks, the model using three types of omics data outperforms the models using any two or any one type of omics data. The models using two types of omics data generally perform better than the models using one type of omics data. 3.4 MOFNet Identified Breast Cancer Biomarkers For BRCA PAM50 subtype classification, the top 30 features, 10 mRNA features, 10 DNA methylation features, and 10 miRNA features are identified by MOFNet as biomarkers (shown in Table 3). For example, genes identified by mRNA expression features, several GO and KEGG terms related to breast cancer are significantly
736
C. Zhang et al.
0.9 0.8 0.7 0.6 0.5
ACC mRNA+meth+miRNA meth+miRNA miRNA
F1_weighted mRNA+meth mRNA
F1_macro mRNA+miRNA meth
Fig. 3. Comparison results. The classification performances are achieved by integrating different combinations of multi-omics data.
enriched, including ABC transporters (KEGG:02010, p = 9.96E-3) and regulation of cell cycle(GO:0051726, p = 8.12E-9). ABCC1 was discovered to be correlated to breast cancer proliferation in siRNA knockdown cell models [17]. Its expression was strongly correlated to high-grade breast carcinomas in BRCA patients [18]. In the case of hormonereceptor positive breast cancer, estrogen drives cell cycle progression by binding to the ER, leading to its dimerization, translocation to the nucleus, and transcriptional activity at estrogen response elements (EREs) [19]. Due to limited pages, we will investigate these biomarkers in details via more interpretation. Table 3. Important omics biomarkers identified by MOFNet. Omics data type
Biomarker
mRNA expression
ABCA13, ABCC8, ABCG2, ABCG1, ABCC11, C10ORF90, APBB2, AURKB, CDCA8, ANLN
DNA methylation
ACSM2A, ABAT, ACSM5, ALDH5A1, ACYP1, ABAT, ABT1, ADAM8, ADAMTS3, AGR2
miRNA expression
hsa-let-7a-1, hsa-let-7a-2, hsa-mir-125b-1, hsa-let-7b, hsa-let-7a-3, hsa-let-7c, has-mir-17, hsa-let-7f-2, hsa-let-7d, hsa-let-7g
4 Conclusion This paper proposed a supervised multi-omics data fusion method, MOFNet, to address the classification problem of breast cancer samples. Currently, the common approach for breast cancer subtype classification using multiple modalities is to directly concatenate
MOFNet: A Deep Learning Framework of Integrating Multi-omics Data
737
input features to learn a classification model that integrates different omics datasets. However, this approach often fails to consider the correlation between data types and may be biased towards certain data type. The available methods often employed graph convolutional networks to learn specific omics data types, but this may result in feature redundancy, slowing down training speed and reducing testing accuracy. The proposed MOFNet method uses cross-omics tensor discovery in the label space to explore cross-omics correlations, effectively improving the integration of multi-omics data. Within each individual omics, we proposed an SGO technique to improve general graph pooling step and extend to multi-dimensional modalities, making it suitable for the characteristics of multiple modalities and reducing feature redundancy and model complexity. Additionally, SGO ensures that the same genes remain after pooling, enhancing the biological interpretability of the model, and making it more suitable for the specific biomedical problem of breast cancer sample classification. Finally, MOFNet only uses 25% of features but achieves higher performance than the SOTA methods. Moreover, MOFNet can also be generally employed to select biomarkers for other diseases. In summary, MOFNet is a novel deep learning-based multi-omics classification algorithm with lower complexity and better performance. Acknowledgments. This work was partially supported by National Natural Science Foundation of China (No. 61973190); National Key Research and Development Program of China (Nos. 2022YFA1004801, 2020YFA0712402); the Fundamental Research Funds for the Central Universities (No. 2022JC008) and the program of Qilu Young Scholar of Shandong University.
References 1. Hanahan, D., Weinberg, R.A.: Hallmarks of cancer: the next generation. Cell 144(5), 646–674 (2011). https://doi.org/10.1016/j.cell.2011.02.013 2. González-García, I., Solé, R.V., Costa, J.: Metapopulation dynamics and spatial heterogeneity in cancer. Proc. Natl. Acad. Sci. 99(20), 13085–13089 (2002) 3. Shipitsin, M., et al.: Molecular definition of breast tumor heterogeneity. Cancer Cell 11(3), 259–273 (2007) 4. Urruticoechea, A., Alemany, R., Balart, J., Villanueva, A., Vinals, F., Capella, G.: Recent advances in cancer therapy: an overview. Curr. Pharm. Des. 16(1), 3–10 (2010) 5. Toss, A., Cristofanilli, M.: Molecular characterization and targeted therapeutic approaches in breast cancer. Breast Cancer Res. 17(1), 1–11 (2015) 6. Lee, Y.-M., Oh, M.H., Go, J.-H., Han, K., Choi, S.-Y.: Molecular subtypes of triple-negative breast cancer: understanding of subtype categories and clinical implication. Genes Genomics 42, 1381–1387 (2020) 7. Singh, A., et al.: DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics 35(17), 3055–3062 (2019). https://doi.org/10.1093/ bioinformatics/bty1054 8. Kim, D., Li, R., Dudek, S.M., Ritchie, M.D.: ATHENA: Identifying interactions between different levels of genomic data associated with cancer clinical outcomes using grammatical evolution neural network. BioData Mining 6(1), 23 (2013). https://doi.org/10.1186/17560381-6-23 9. Subramanian, I., Verma, S., Kumar, S., Jere, A., Anamika, K.: Multi-omics data integration, interpretation, and its application. Bioinform. Biol. Insights 14, 1177932219899051 (2020)
738
C. Zhang et al.
10. Günther, O.P., et al.: A computational pipeline for the development of multi-marker biosignature panels and ensemble classifiers. BMC Bioinformatics 13(1), 326 (2012). https:// doi.org/10.1186/1471-2105-13-326 11. Wang, L., Ding, Z., Tao, Z., Liu, Y., Fu, Y.: Generative Multi-View Human Action Recognition,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South): IEEE, Oct. 2019, pp. 6211–6220. https://doi.org/10.1109/ICCV.2019.00631 12. Shen, R., Olshen, A.B., Ladanyi, M.: Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25(22), 2906–2912 (2009). https://doi.org/10.1093/bioinformatics/btp543 13. Huang, Z., et al.: SALMON: survival analysis learning with multi-omics neural networks on breast cancer. Front. Genet. 10, (2019). Accessed 14 Mar 2023. https://www.frontiersin.org/ articles/10.3389/fgene.2019.00166 14. Wang, T., et al.: MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat. Commun. 12(1), 3445 (2021). https://doi.org/10.1038/s41467-021-23774-w 15. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016) 16. Martins, A., Astudillo, R.: From softmax to sparsemax: a sparse model of attention and multilabel classification. In: International Conference on Machine Learning, PMLR, pp. 1614–1623 (2016) 17. Low, F.G., Shabir, K., Brown, J.E., Bill, R.M., Rothnie, A.J.: Roles of ABCC1 and ABCC4 in proliferation and migration of breast cancer cell lines. Int. J. Mol. Sci. 21(20), 7664 (2020) 18. Hlaváˇc, V., et al.: The expression profile of ATP-binding cassette transporter genes in breast carcinoma. Pharmacogenomics 14(5), 515–529 (2013) 19. Platet, N., Cathiard, A.M., Gleizes, M., Garcia, M.: Estrogens and their receptors in breast cancer progression: a dual role in cancer proliferation and invasion. Crit. Rev. Oncol. Hematol. 51(1), 55–67 (2004)
EEG Convolutional Sparse Transformer for Epilepsy Detection and Related Drug Classification Zhengda He1,2 , Linjie Chen2 , Hao Lv2 , Rui-ning Zhou2 , Jiaying Xu2 , Yadong Chen2 , Jianhua Hu2 , and Yang Gao1(B) 1 Nanjing University, Nanjing, Jiangsu, China
[email protected] 2 China Pharmaceutical University, Nanjing, Jiangsu, China
Abstract. Epilepsy is one of the common neurological disorders worldwide, which causes significant damage to patients’ health. The EEG clinical manifestations of epilepsy are diverse and complex, and it is necessary to study efficient EEG-based automatic epilepsy detection techniques and use them to monitor and develop epilepsy-related drugs. In this paper, we proposed a convolutional sparse Transformer architecture, where the model can learn directly from the raw EEG data for epilepsy detection and epilepsy-related drug classification. Our proposed model uses a channel attention module to capture the correlation of different spatial locations of the signal. We also construct a sparse Transformer and effectively combine the Transformer and convolutional neural network, which is more suitable for learning on long sequence data like EEG than the standard Transformer, avoiding the performance degradation caused by dense attention. We perform experiments on epilepsy detection and related drug classification datasets, and the results show that the proposed model achieves the current leading performance. The proposed model is a unified architecture suitable for epilepsy detection and drug classification and can also be used for other diseases and drug discovery. Keywords: EEG · Convolutional Sparse Transformer · Epilepsy Detection · Drug Classification
1 Introduction Epilepsy is a neurological disorder caused by paroxysmal abnormal discharges of neurons in the brain that can cause impairment of mental and cognitive function and even lead to death [1]. Electroencephalogram (EEG) measures electrical activity using multiple electrodes placed at different locations in the brain, either on the scalp’s surface or implanted for a short time inside the skull [2]. The signal recorded usually contains multiple channels. EEG is a useful clinical tool for determining the onset of epilepsy and can be used to identify, predict and localize epilepsy [3]. L. Chen—Co-first author. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 739–751, 2023. https://doi.org/10.1007/978-981-99-4749-2_63
740
Z. He et al.
In addition to diagnostic analysis of diseases, EEG can also be used to characterize and quantify the effects of drugs on the central nervous system [4]. Pharmaco-EEG can even be used as a biomarker to identify the type of drug a patient is taking, analyze the mechanism of action of psychotropic drugs, and even for psychotropic drug discovery [5]. Many studies have used Pharmaco-EEG to study epilepsy-related drugs [6]. The two core aspects of building an automatic analysis model for epilepsy-related EEG are feature extractor and classifier design [14]. The traditional machine learning methods that have been employed in this field contain Support Vector Machines (SVM), K-Nearest Neighbor (KNN) algorithms [7], etc. However, due to the drawbacks of conventional methods that require the manual design of features based on expert experience, deep learning-based techniques have started to be emphasized in recent years. The leading deep learning-based models are Convolutional Neural Networks (CNN) [8], Recurrent Neural Networks(RNN) [9], and Convolutional Recurrent Networks that fuse the two [10]. However, these methods have their shortcomings due to traditional machine learning methods. CNN and RNN are not good at handling long-range information dependence in long sequence data [11]. RNN networks cannot capture spatial information and have low parallel computing efficiency [12]. The Transformer [13] model has achieved top performance in natural language processing (NLP) [14] and computer vision(CV) applications [15]. Research work is starting to apply Transformer to process EEG signal analysis [16]. However, most current research works first transform EEG signals into correlation matrices or interpolate them into images and then process them [17]. This approach can lose a lot of signal information or introduce noise. So it is necessary to learn directly from the original EEG signal. But we see that such studies only use the standard Transformer, which is challenging to handle such long-time sequences of EEG signals due to the high computational cost of self-attention, resulting in poor performance. Also, learning directly on the time domain raw data lacks consideration of the spatial correlation of EEG signals. To solve the above problems, we proposed the convolutional sparse Transformer, which can learn and extract features from the raw EEG data for classification. We designed the spatial channel attention module to capture the EEG signal spatial dependence. We also designed the sparse Transformer to implement an effective attention mechanism for the signal time domain. We also use distilled convolutional layers to reduce the temporal dimension and select essential features. We perform experiments on the epilepsy detection dataset and epilepsy-related drug classification dataset, and the proposed model obtains the current leading performance. Our contributions are as follows: 1. We constructed effective end-to-end EEG analysis models for epilepsy, which can learn from the raw EEG data. 2. We designed the channel attention module to capture the correlation of different spatial channels of EEG signals. 3. We combine sparse Transformer and CNN for EEG analysis for the first time. 4. The proposed model is a unified architecture applicable to epilepsy detection and drug classification and can also be used for other diseases and related drug discovery processes.
EEG Convolutional Sparse Transformer
741
2 Related Work 2.1 Epilepsy Detection and Related Drug Classification Tapani et al.used SVM to detect seizures [18]. Nagarajan et al. [19] extracted seven time-domain features and four entropy-domain features from the raw EEG signal, used Principal Component Analysis (PCA) to downscale the features, and then used a KNN algorithm-based ProtoNN model to classify them. Ansari et al. [20] used a CNN for feature extraction and a decision tree to classify each raw multichannel EEG data segment. O’Shea et al. [21] designed a Fully Convolutional Network (FCN) architecture for the neonatal epilepsy detection task, using convolutional layers to extract features from the input and perform classification. Frassineti [22] used the Smooth Wavelet Transform (SWT) to remove unnecessary frequency information from the EEG and then used CNN and FCN to perform the classification task. Tanveer et al. [8] proposed an end-to-end framework based on 2D CNN for epilepsy detection using time-domain EEG signals as input to the model. In addition to CNN architecture, more used are RNN architecture and Autoencoder architecture [23]. Kalitin et al. used Autoencoder architecture to propose drug EEG features for epilepsy-related drug classification [24]. 2.2 Transformer for EEG Analysis More and more researchers have applied Transformer’s attention mechanism to the EEG domain. Shi et al. [25] proposed a fully automated sleep scoring model using Transformer architecture to capture different sleep stage features. Sun [16] constructed several Transformer-based models for Motor Imagination EEG classification, which obtained superior performance compared with CNN, RNN, and other models. Wang [26] proposed a Transformer-based model to learn spatial information from electrode level to brain region level in a hierarchical manner for discriminating emotional states, and the model achieved outstanding performance. In predicting seizures, J. Pedoeem et al. [27] built a Transformer-based architecture to introduce the Transformer to epilepsy detection for the first time.
3 Method 3.1 Model Architecture A crucial current research direction on Transformer models is to design new sparse attention schemes, and many works have been successful [28–30] Zhou et al.proposed Informer models to achieve leading performance on long-time sequence prediction tasks using sparse encoder-decoder structures [31]. Inspired by Zhou’s work, we proposed a convolutional sparse Transformer architecture, which for the first time, combined a sparse Transformer encoder and convolutional layers. The model can learn from raw EEG data for epilepsy detection and classification of related drugs. As shown in Fig. 1, the proposed model consists of four components: 1. Spatial channel attention module for correlation modeling between EEG signal channels. 2. Sparse Transformer Encoder component for attention mechanism-based information interaction on the
742
Z. He et al.
full-time domain of EEG signals. 3. Distilled convolutional layer component for time dimensional reduction and extraction of essential features. 4. Predictor module, which uses the extracted features for signal classification. Our model architecture uses three layers of sparse Transformer and three layers of standard Transformer. This design is because the time dimension of the features in the last three layers is already small enough.
Fig. 1. The architecture of the proposed EEG convolutional sparse Transformer
3.2 EEG Input Signal Let the number of electrode channels of the EEG signal be C and the number of sampled time frames in an epoch signal be T. The EEG signal of this epoch can be expressed as X, X ∈ RT ×C , where the the multichannel signal at the i th time i th row of X represents frame, xi ∈ RC , xi = xi,1 , xi,2 , · · · , xi,C , where xi,j is the value of the j th electrode channel in the i th time frame. 3.3 Spatial Channel Attention Module The signal appearance of EEG events such as seizures is different on different electrode channels and may even be restricted to specific electrode channels. In addition, the correlation between different signal channels is distinct. Previous work has paid little attention to the correlation of varying feature channels, resulting in extracted features interfering with each other and causing degradation in the model’s predictive performance. To capture the spatial dependence of seizure location, we construct a spatial channel attention
EEG Convolutional Sparse Transformer
743
module that uses an attention mechanism to enhance the accuracy by reinforcing the essential channel features. This module is designed using a structure similar to CBAM Networks [32] in computer vision, as shown in Fig. 2. First, the EEG signal is processed in the spatial channel dimension using global maximum pooling and average pooling (compression (squeeze) operation) to generate the statistics of each channel, and then the MLP network is used to calculate the degree of dependence among channels, which is the channel attention Mc(X). Finally, Mc(X) is multiplied by the input EEG signal X to obtain the scaled new feature. Mc(X ) =σ (MLP(AvgPool(X)) + MLP(MaxPool(X))) c c + W1 W0 Xmax =σ W1 W0 Xavg
(1)
Fig. 2. The attention mechanism used in Sparse Transformer
3.4 Sparse Transformer Encoder The time complexity and memory consumption of the Vanilla Transformer’s self attention computation is O L2 [13], where L is the sequence length. When modeling long sequences, the high computational cost makes the design and scaling of the model limited. In addition, Vanilla Transformer uses global dense attention for all input embeddings, making many unimportant embeddings introduce signal noise and degrade the model performance. In their work Informer model [31], Zhou pointed out that only a few critical embeddings in the input sequence corresponding to the active query need global attention, and the rest of the embeddings can use average attention. This improvement can enhance the prediction performance. The sparse Transformer we constructed differs from Zhou’s work: (1) Zhou uses a content-based sparsity measure function to select active queries, while we choose active queries based on equal time intervals, and experiments show
744
Z. He et al.
that this selection method can further improve model performance. (2) We only use the Transformer encoder and do not use the decoder.In Vanilla Transformer, the input embedding of each encoder layer is updated by a global dense dot product attention √ T mechanism, whose attention is represented as: A(Q, K, V ) = softmax QK / d V where Q ∈ RLQ ×d , K ∈ RLK ×d , V ∈ RLV ×d are the query matrix, key matrix, and value matrix of the input embedding of this layer, respectively. LQ , LK ,LV are the number of rows of the three matrices, respectively, and d is the model dimension. The time complexity and memory consumption of computing A(Q, K, V ) is O LQ LK . We designed a sparse Transformer encoder that uses global attention updates for only a few terms. The computational steps are as follows. (1) Ln(L) terms are selected at equal intervals from the input embedding of the current layer encoder. When i = 1, 1 + L/Ln(L), 1 + 2∗ L/Ln(L)… The attention of its corresponding i th query (i th row of Q) is calculated as follows. A(qi , K, V ) =
j
k q ,k ( i j ) vj l k (qi ,kj )
where k qi , kj = exp
qi kjT √ d
(2)
(3)
(2) The attention of the input embedding of the remaining terms is updated using the mean value of the value matrix V . (4) A(qi other , K, V ) = Mean vj The time complexity and memory consumption of the sparse Transformer is O LK Ln LQ . 3.5 Distilling Convolutional Layer We add a distilling convolutional layer after each encoder layer. This convolutional layer serves two purposes: (1) Performing information fusion of temporally adjacent input features to take advantage of the CNN architecture. (2) The maximum pooling operation reduces the temporal dimension of the input features. Further, it eliminates the redundant average features in the Transformer encoder output, preserving the essential features for the next layer, which can be seen as a distillation process. The feature update process for each convolutional layer is as follows. (5) Xj+1 = MaxPool ELU Conv1d Xj
EEG Convolutional Sparse Transformer
745
4 Experiments 4.1 Datasets We conducted comparison experiments in two datasets to validate the validity of the proposed model. (1) Helsinki children’s EEG dataset [33]. The dataset contains multichannel EEG recordings performed on 79 neonates admitted to the neonatal intensive care unit at Helsinki University Hospital, 39 of whom suffered from neonatal seizures. The dataset contains signals recorded at a sampling rate of 256 Hz and stored in EDF files, and the recording files contain expert annotations on whether a seizure is present and can be used for seizure detection. (2) Pharmaco-EEG dataset provided by Kalitin et al. [24]. This dataset contains EEG recordings from experimental rats after administration of the maximum therapeutic dose of epilepsy-related drugs. The EEG data were recorded after the drug reached its peak concentration. The maximum therapeutic dose in rats was calculated based on the conversion factor in humans. Eleven epilepsy-related drugs were included in the dataset, seven of which are anticonvulsants and four are proconvulsants. These drugs correspond to five different effects: 1, calcium channel blockers; 2, sodium channel blockers; 3, γ -aminobutyric acid (GABA) analogs; 4, γ -aminobutyric acid antagonists; and 5, choline analogs. 4.2 Data Preprocessing Our experiments used the raw EEG signals without any transformation operations on the EEG signals other than filtering and slicing. We filtered the raw data using a high-pass filter with a cutoff frequency of 0.5 Hz for the Helsinki dataset. We sliced the dataset according to the size of time windows of 4s, 8s, and 16s to form datasets containing samples of different lengths. Each sample contains t ∗ f ∗ ch, where t represents the window time length, f represents the EEG frequency domain, and ch represents the number of electrode channels. For the Kalitin dataset, we divided the dataset according to the time window size of 2 s. The Kalitin dataset contains three kinds of labels, namely: 1. Drug name, 2. Drug mechanism of action, 3. Drug efficacy. All the brain waves can be classified into 11 categories by drug name, 5 categories by drug mechanism of action, and 2 categories by drug efficacy. 4.3 Experimental Setup The parameters settings of the proposed model in this study are shown in Table 1. We used the same model architecture on both datasets, containing 3 sparse attention layers, 3 non-sparse attention layers, 5 convolutional layers, and 5 pooling layers. The number of heads in the attention layer is 8, and the hidden dimension is 128. Since the Kalitin dataset requires a more difficult multi-classification task, a high epoch number and a small batch size with a low learning rate are used. The task on the Helsinki dataset is a 2-classification task, and a low epoch number and a large batch size with a high learning rate are used.
746
Z. He et al. Table 1. Main hyper-parameters of the model.
Datesets
Helsinki
Kalitin
Epoch
50
300
Bachsize
64
32
Learning Rate
10–3
10–4
Sparse-Attention layer
3
3
Full-Attention layer
3
3
Attention Heads
8
8
dmodel
128
128
dff
512
512
CNN layers
5
5
Maxpooling layers
5
5
Distilling
1x3 conv1d, ELU
1x3 conv1d, ELU
Distilling Max pooling
stride = 2
stride = 2
4.4 Baseline Methods In the epilepsy detection task, the models proposed in literature [19, 22] were chosen as the baseline models. Two models are used in the literature [22]. They are the CNN model and the FCN model. The CNN model consists of convolutional and fully connected layers; the FCN model replaces the fully connected layers in the CNN model with convolutional layers and pooling layers, so the FCN model consists entirely of convolutional layers and pooling layers. The KNN model and the ProtoNN model are used in literature [19]. The KNN model is one of the common methods for machine learning. The ProtoNN model uses a sparse projection matrix to project the entire data in low dimensions, and in this way, the model’s flexibility is improved. In the epilepsy-related drug classification task, we choose the Autoencoder model proposed in the literature [24] as the baseline model. The encoder of Autoencoder consists of convolutional layers and a pooling layer to extract the features of the drug EEG. The model completes the classification of drug EEG by passing the output of the encoder layer to a predictor consisting of a fully connected neural network. 4.5 Metric Our experiments use five evaluation metrics, namely accuracy, precision, recall, F1score, and ROC-AUC. Where accuracy indicates the probability that the model predicts correctly; precision indicates the proportion of correctly predicted positive samples to all predicted positive samples; recall indicates the proportion of correctly predicted positive samples to all positive samples; F1score is the summed average value of precision and recall. The higher the value of precision rate and recall rate, the higher the F1score; AUC indicates the area under the receiver operator-characteristics curve of the model prediction results; the higher the value, the better the model’s performance.
EEG Convolutional Sparse Transformer
747
Table 2. Results of epilepsy detection experiments (Comparison with baseline [19, 22]). Length
Model
Accuracy
Precision
Recall
ROC-AUC
F1-Score
4s
CNN
0.83
0.41
0.39
0.77
0.40
FCN
0.84
0.38
0.47
0.79
0.42
ProtoNN
0.77
0.84
0.81
——
0.82
KNN
0.78
0.8
0.76
——
0.78
OURS
0.957 ± 0.004
0.889 ± 0.008
0.901 ± 0.019
0.976 ± 0.002
0.895 ± 0.006
CNN
0.79
0.34
0.51
0.77
0.41
FCN
0.82
0.39
0.63
0.81
0.48
ProtoNN
0.77
0.85
0.83
——
0.84
KNN
0.80
0.79
0.78
——
0.78
OURS
0.953 ± 0.004
0.872 ± 0.013
0.888 ± 0.040
0.967 ± 0.006
0.879 ± 0.021
CNN
0.85
0.73
0.37
0.75
0.49
FCN
0.80
0.37
0.57
0.79
0.45
ProtoNN
0.78
0.80
0.85
——
0.82
KNN
0.78
0.77
0.75
——
0.76
OURS
0.942 ± 0.006
0.820 ± 0.019
0.872 ± 0.060
0.943 ± 0.008
0.844 ± 0.029
8s
16s
5 Results 5.1 Results of Epilepsy Detection Experiments In the experiments of epilepsy detection by EEG analysis, we evaluated our model using 10-fold cross-validation, and the experimental results are shown in Table 2. Our proposed model achieves excellent performance in time windows of 4s, 8s, and 16s and is significantly higher than the baseline model in all five evaluation metrics. The model can take the best results when using 4s as the window length, with an accuracy of 0.957 and an AUC-ROC of 0.976. 5.2 Results of Epilepsy-Related Drug Classification In experiments on epilepsy-related drug classification, we evaluated our model using five-fold cross-validation, and the experimental results are shown in Table 3, Confusion Matrix is shown in Fig. 3. Our model achieves excellent performance in drug efficacy classification, drug mechanism classification, and drug name classification tasks, and all evaluation metrics are significantly better than the baseline model. 5.3 Ablation Experiments We chose to conduct ablation experiments on the mechanism of action classification task of epilepsy-related drugs, and the following three factors were considered in this study: 1. Spatial channel attention, 2. Transformer sparse attention, and 3. Distillation
748
Z. He et al. Table 3. Results of epilepsy-related drug classification (Comparison with baseline [24])
Task Label
Model
Classes
Accuracy
Precision
Recall
F1-Score
Effect
Autoencoder
2
0.814
0.789
0.803
0.795
Mechanism
5
0.626
0.598
0.601
0.587
Name
11
0.434
0.434
0.463
0.435
2
0.968 ± 0.005
0.963 ± 0.005
0.968 ± 0.006
0.965 ± 0.005
Effect
OURS
Mechanism
5
0.964 ± 0.003
0.960 ± 0.004
0.964 ± 0.003
0.962 ± 0.004
Name
11
0.832 ± 0.007
0.834 ± 0.007
0.835 ± 0.007
0.832 ± 0.008
Fig. 3. Confusion Matrix of epilepsy-related drug classification
convolution. The results of the experiment are shown in Table 4. It can be seen that the three improvement strategies for the Transformer proposed in this study are effective. The distillation convolution operation is essential for the Transformer to extract brain wave signal features, indicating that our proposed method of combining convolution and Transformer is effective for EEG raw data analysis. Table 4. Results of Ablation experiments on epilepsy-related drug classification. Spatial Channel Attention √ × √ √
Transformer Sparse Attention √
√
√
√
× √
Distillation
√ ×
Accuracy
Precision
Recall
F1-Score
0.964
0.960
0.964
0.962
0.939
0.935
0.940
0.937
0.947
0.942
0.948
0.944
0.753
0.749
0.749
0.749
EEG Convolutional Sparse Transformer
749
6 Conclusion Automatic EEG-based epilepsy detection is an effective means to improve the efficiency of epilepsy treatment. Using the latest machine learning techniques to build high-performance EEG-based epilepsy automatic detection models has been the direction of researchers’ efforts. In this work, we proposed a convolutional sparse Transformer architecture. The model extracts features directly from the original EEG signal to avoid losing signal information or introducing noise. We design a spatial channel attention module that captures the spatial dependence of seizure location and can enhance important channel features to improve accuracy. In this paper, we created the sparse Transformer module to avoid the high computational cost of the standard Transformer and the performance degradation caused by dense global attention and achieve an effective attention mechanism in the signal time domain. Our work combines sparse Transformer and CNN for the first time to improve the performance of epilepsy EEG analysis, and our proposed scheme is an effective attempt to integrate two classical techniques of Transformer and CNN organically. The results on two publicly available datasets validate the effectiveness of the proposed model. The model presented in this paper is a unified architecture applicable to epilepsy detection and drug classification and can also be used for other diseases and related drug discovery processes. We hope that our work will be helpful for the health business and the development of related medical technologies. Acknowledgements. Supported by grants from the National Natural Science Foundation of China (No.81973182); National Science Foundation of China (No. 61806092); Jiangsu Natural Science Foundation (No. BK20180326); “Double First-Class” University project from China Pharmaceutical University (Program No. CPU2018GF02).
References 1. Engel, J., Jr.: A proposed diagnostic scheme for people with epileptic seizures and with epilepsy: report of the ILAE task force on classification and terminology. Epilepsia 42(6), 796–803 (2001) 2. Kayser, J., Tenke, C.E.: Issues and considerations for using the scalp surface Laplacian in EEG/ERP research: a tutorial review. Int. J. Psychophysiol. 97(3), 189–209 (2015) 3. Cilio, M.R.: EEG and the newborn. J Pediatric Neurology 7(1), 25–43 (2009) 4. Jobert, M., et al.: Guidelines for the recording and evaluation of Pharmaco-EEG data in man: the International Pharmaco-EEG society (IPEG). Neuropsychobiology 66(4), 201–220 (2012) 5. Skarpaas, T.L., Tcheng, T.K., Morrell, M.J.: Clinical and electrocorticographic response to antiepileptic drugs in patients treated with responsive stimulation. Epilepsy Behav. 83, 192– 200 (2018) 6. Höller, Y., Helmstaedter, C., Lehnertz, K.: Quantitative pharmaco-electroencephalography in antiepileptic drug research. CNS Drugs 32(9), 839–848 (2018) 7. Hussain, L.: Detecting epileptic seizure with different feature extracting strategies using robust machine learning classification techniques by applying advance parameter optimization approach. Cogn. Neurodyn. 12(3), 271–294 (2018) 8. Tanveer, M.A., Khan, M.J., Sajid, H., Naseer, N.: Convolutional neural networks ensemble model for neonatal seizure detection. J. Neurosci. Methods 358, 109197 (2021)
750
Z. He et al.
9. Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization arXiv preprint arXiv:1409.2329 (2014) 10. Affes, A., Mdhaffar, A., Triki, C., Jmaiel, M., Freisleben, B.: A convolutional gated recurrent neural network for epileptic seizure prediction. In: Pagán, J., Mokhtari, M., Aloulou, H., Abdulrazak, B., Cabrera, M.F. (eds.) ICOST 2019. LNCS, vol. 11862, pp. 85–96. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32785-9_8 11. Han, D., Liu, Q., Fan, W.: A new image classification method using CNN transfer learning and web data augmentation. Expert Syst. Appl. 95, 43–56 (2018) 12. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014) 13. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 14. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 15. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012– 10022 (2021) 16. Sun, J., Xie, J., Zhou, H.: EEG classification with transformer-based models. In: 2021 IEEE 3rd Global Conference on Life Sciences and Technologies (LifeTech), pp. 92–93. IEEE (2021) 17. Winkler, I., Haufe, S., Tangermann, M.: Automatic classifica-tion of artifactual ICAcomponents for artifact removal in EEG signals. Behav. Brain Funct. 7(1), 1–15 (2011) 18. Tapani, K.T., Vanhatalo, S., Stevenson, N.J.: Incorporating spike correlations into an SVMbased neonatal seizure detector. Presented at the (2018). https://doi.org/10.1007/978-981-105122-7_81 19. Nagarajan, V., Muralidharan, A., Sriraman, D., et al.: Scalable machine learning architecture for neonatal seizure detection on ultra-edge devices. arXiv preprint arXiv:2111.15569 (2021) 20. Ansari, A.H., Cherian, P.J., Caicedo, A., Naulaers, G., De Vos, M., Van Huffel, S.: Neonatal seizure detection using deep convolutional neural networks. Int. J. Neural Syst. 29(04), 1850011 (2019) 21. O’Shea, A., Lightbody, G., Boylan, G., Temko, A.: Neonatal seizure detection from raw multi-channel EEG using a fully convolutional architecture. Neural Netw. 123, 12–25 (2020) 22. Frassineti, L., Ermini, D., Fabbri, R., Manfredi, C.: Neonatal seizures detection using stationary wavelet transform and deep neural networks: preliminary results. In: 2020 IEEE 20th Mediterranean Electrotechnical Conference (MELECON), pp. 344–349. IEEE (2020) 23. Louizos, C., Swersky, K., Li, Y., Welling, M., Zemel, R.: The variational fair autoencoder. arXiv preprint arXiv:1511.00830 (2015) 24. Kalitin, K.Y., Nevzorov, A.A., Spasov, A.A., Sotnikov, P.I.: Deep learning-based i-EEG classification with convolutional neural networks for drug-target interaction prediction. arXiv preprint arXiv:2009.12984 (2020) 25. Shi, G., Chen, Z., Zhang, R.: A transformer-based spatial-temporal sleep staging model through raw EEG. In: 2021 International Conference on High Performance Big Data and Intelligent Systems (HPBD&IS), pp. 110–115. IEEE (2021) 26. Wang, Z., Wang, Y., Hu, C., Yin, Z., Song, Y.: Transformers for EEG-based emotion recognition: a hierarchical spatial information learning model. IEEE Sens. J. (2022) 27. Pedoeem, J., Abittan, S., Yosef, G.B., Keene, S.: Tabs: transformer based seizure detection. In: 2020 IEEE Signal Processing in Medicine and Biology Symposium (SPMB), pp. 1–6. IEEE (2020) 28. Zaheer, M., et al.: Big bird: transformers for longer sequences. Adv. Neural. Inf. Process. Syst. 33, 17283–17297 (2020) 29. Roy, A., Saffar, M., Vaswani, A., Grangier, D.: Efficient con-tent-based sparse attention with routing transformers. Trans. Assoc. Comput. Linguist. 9, 53–68 (2021)
EEG Convolutional Sparse Transformer
751
30. Tay, Y., Bahri, D., Yang, L., Metzler, D., Juan, D.-C.: Sparse sinkhorn attention. In International Conference on Machine Learning, pp. 9438–9447. PMLR (2020) 31. Zhou, H., et al.: Informer: beyond efficient transformer for long sequence time-series forecasting. In: Proceedings of AAAI (2021) 32. Qin, Z., Zhang, P., Wu, F., Li, X.: FcaNet: frequency channel attention networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 783–792 (2021) 33. Stevenson, N.J., Tapani, K., Lauronen, L., Vanhatalo, S.: A dataset of neonatal EEG recordings with seizure annotations. Sci. Data 6(1), 1–8 (2019)
Adopting Autodock Koto for Virtual Screening of COVID-19 Zhangfan Yang1,2 , Kun Cao1,2 , Junkai Ji1,2(B) , Zexuan Zhu1,2 , and Jianqiang Li1,2 1 College of Computer Science and Software Engineering, Shenzhen University,
Shenzhen 518060, China {2070276085,2210273130}@email.szu.edu.cn, {jijunkai,zhuzx, lijq}@szu.edu.cn 2 National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, China
Abstract. COVID-19 is a highly contagious virus that causes respiratory diseases in humans. Responding quickly to such pathogen is crucial to stop the uncontrolled spread of diseases. Through computational approaches, repurposing existing drugs is an efficient and effective way to provide treatments. In this study, a powerful docking program named Autodock Koto, proposed in our previous research, is used to virtually screen antiviral drugs for SARS-CoV-2. It identified new uses of drugs that have safety profiles and well-established pharmacology. Several vital target proteins have been considered in our experiments, such as Spike, 3-chymotrypsin-like protease (3CLpro), and RNA-dependent RNA polymerase (RdRp). Experimental results demonstrate that Nystatin, Amphotericin B, Hypericin, Ergotamine, Natamycin and Teicoplanin have the potential as antiviral drugs for the treatment of SARS-CoV-2, and are worth further in-vitro or clinical trials. In addition, the interactions between the drugs and corresponding target proteins have also been analyzed in this study. Keywords: Molecular Docking · Virtual Screening · SARS-CoV-2 · Drug Repositioning
1 Introduction New Coronavirus Disease (COVID-19) has challenged economic and healthcare systems worldwide as a disease with high transmission and mortality rates. As of 2022, the new coronavirus is responsible for approximately 260 million infections and approximately 5.47 million deaths, according to Johns Hopkins University. SARS-CoV-2, a positive single-stranded RNA virus belonging to the coronaviridae family, is the betacoronavirus that causes COVID-19 and has developed many mutations. Most mutations do not have severe consequences for virus transmission and mortality [3]. However, because RNA viruses can replicate within host cells and produce many mutations during replication, the likelihood of new variants becoming available increases as more people become infected [34]. To date, variants of the new coronaviruses have been identified in five major lineages. The earliest α, variant was discovered in the UK © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 752–763, 2023. https://doi.org/10.1007/978-981-99-4749-2_64
Adopting Autodock Koto for Virtual Screening of COVID-19
753
in 2020, the β, variant in South Africa in September of the same year, and the γ variant in Brazil the following month [35]. Meanwhile, in November of that year, the η and δ variants were discovered one after the other [9]. B.1.1.529 was detected in Botswana and South Africa in 2021 and reported to the World Health Organization (WHO) on 24 November of that year. It was subsequently defined as a new variant of concern (VOC) by the WHO and named Omicron [30]. Despite technological advances and understanding of human disease, the translation of these benefits into therapeutic advances has been much slower than expected. At the same time, the rapid mutation of viruses makes it difficult to develop and innovate drugs for each variant. Due to the high attrition rates, high costs, and long development cycles associated with new drug discovery and development, the repurposing of ‘old’ drugs to treat joint and rare diseases is becoming an increasingly attractive approach [22]. However, drug repurposing offers the potential benefit of reducing overall development costs and time by using clinically safe compounds. The antiviral efficacy of drugs already on the market in early clinical trials against SARS-CoV-2 has been studied, together with the chemical structures, pharmacological effects, indications, adverse effects, and general use of these chemical drugs [20]. In our previous research, a novel docking program named AutoDock Koto was developed. A series of comparative experiments were performed to verify its superiority of docking performance, compared with other commonly-used docking programs [12]. We attempt to use Koto to perform virtual screening experiments on the important targets of SARS-CoV-2 in this study. Considering the importance of the prickly protein mentioned above, we selected the spike protein of SARS-CoV-2 and its variants for drug screening, including the original spike protein, the spike protein of the δ variant, and the spike protein of the omicron variant. Then, we selected the SARS-CoV-2 RNAdependent RNA polymerase (RDRP). These mRNAs encode viral proteins that facilitate virus replication and propagation. RDRP is encoded by the nsp12 gene in the SARSCoV-2 genome and consists of two subunits, a large subunit (nsp12), and two small subunits (nsp7 and nsp8). These three subunits bind together to form a complex, with nsp12 being the major catalytic subunit and nsp7 and nsp8 assisting in their function as accessory subunits [33]. Next, we selected Papain-like proteinases (PLpro) of the new coronavirus. This crucial viral enzyme cleaves and processes the precursor proteins of viral proteins, thereby facilitating the virus’s life cycle [18]. The nsp3 gene encodes the PLpro in the SARS-CoV-2 genome, a complex consisting of three subunits, namely, the nsp3 core structural domain, the Thumb structural domain, and the Palm structural domain. Finally, we also selected the SARS-CoV-2 NSP15, which is involved in viral replication [13]. The remainder of this paper is organized as follows: in Sect. 2, the target and ligand preparation, and docking program were introduced. Section 3 presented the experimental results of virtual screen for SARS-CoV-2. Then, the discussion and analysis of the docking results were provided in Sect. 4. Finally, our conclusions and some possible directions for future work are presented in Sect. 5.
754
Z. Yang et al.
2 Methods 2.1 Target Preparation In this study, all target structures for the SARS-CoV-2 were downloaded from the Research Collaboratory for Structural Bioinformatics Protein Data Bank (PDB) [2]. First, three structures of the spiked proteins were obtained in the PDB, including the original spike protein (6M0J) [14], the delta-mutated spike protein (7V8B) and the omicron-mutated spike protein (7T9L) [17]. In the PDB file, each of these spike proteins can form a complex structure with the ACE2. We then obtained the structures of the remaining three different SARS-CoV-2 proteins, RDRP (6M71) [8], PLpro (6M71) [24], NSP15 protein (6VWW) [13], respectively. It is worth noting that some proteins have multiple chains, and most of them have no active site. Therefore, the B, C, and D chains from PLpro, and the A chain from NSP15 were removed. In the preparation phase of virtual screening, the pdb formats of all targets were converted into the pdbqt format via the Open Babel [19]. Moreover, the Pymol is used to visualize the virtual screening results [5]. 2.2 Ligand Preparation All drug structures were extracted from the database, named SUPERDRUG2, in the virtual screening exercise [26]. Since each drug has multiple conformations in SUPERDRUG2, only the first conformation was chosen for docking as the 3D structure of the drug molecule. In addition, the drugs from the ZINC15 database were also used to enrich the diversity of drugs. The database collects over 230 million compounds with 3D structures [28]. Three categories of drugs from ZINC15 were adopted in this study: the drugs approved by the US Food and Drug Administration (FDA), the drugs that have been approved but not yet approved by the FDA, and the drugs that are still in clinical trials, totaling more than 9,000 drugs. To prevent redundancy between these two databases, if two drugs are the same in the final screening results, only the one with lower energy will be retained. The hydrogen atoms of ligands were removed, and a Gasteiger charge was added to each atom by the Open Babel. Similar to the target proteins, the sdf formats of ligands were converted to pdbqt formats. 2.3 Docking Program AutoDock Koto is a novel docking program developed in our previous research. It has satisfactory docking capabilities, which used the L-SHADE as the global search algorithm and the Adam as the local search algorithm. Experimental results verified that Koto performed better than commonly-used docking programs, including Glide, GOLD, Dock, rdock, LeDock, AutoDock and AutoDock Vina [12]. Therefore, it was used as the docking program to virtually screen antiviral drugs for SARS-CoV-2 in this study. The optional parameters of Koto were set as follows: the running times of the algorithm were 10, and the maximum number of candidate conformations per output was 9. The entire virtual screening effort was deployed via Koto, Open Babel and some custom Python and Shell scripts.
Adopting Autodock Koto for Virtual Screening of COVID-19
755
Table 1. Drugs screened by Koto binding to spike protein variants Drug ID
Drug Name
Binding affinity Original
Delta
Omicron
ZINC000245190612
Nystatin
-10.8
-9.8
-9.2
ZINC000003780340
Hypericin
N/A
-9.4
-8.5
ZINC000052955754
Ergotamine
-9.6
-9.3
N/A
ZINC000003975327
Telomestatin
-8.9
-8.7
N/A
ZINC000255962815
Amphotericin B
-8.9
-8.7
-8.7
ZINC000012358610
Phthalocyanine
-8.8
-8.6
N/A
2.4 Target Active Site Identification The active sites of target proteins can be identified according to the residues interacting with specific molecules. The coordinate of the binding site in the spike protein can be set to (-34.704, 33.025, 1.163) [14]. The coordinates of binding sites in the Delta-mutated spike protein and the omicron-mutated spike protein can be set to (187.368, 197.840, 279.201) and (232.250, 176.706, 257.997), respectively [31]. Depending on the location of the active site residues, the coordinate of the binding site in SARS-CoV-2 RDRP is set to (113.520, 114.089, 122.780) [33], that of SARS-CoV-2 PLpro is set to (23.250, 70.900, 5.150) [24], and that of SARS-CoV-2 NSP15 site is set to (-90.300, 21.250, 32.250) [13].
3 Experimental Results For each spike protein, the drugs with the best binding affinity screened by Koto have been listed in Table 1. It can be found that Nystatin and Amphotericin B (Amb) can bind to all three spike proteins, simultaneously. However, the binding affinity scores increased from the original spike protein to the latest variant of the omicron. That explains why the effectiveness of some antiviral drugs might reduce for the latest variant of SARS-CoV-2. Figure 1 showed the interaction of these two drugs with Delta and Omicron. In Fig. 1(a), Nystatin formed hydrophobic bonds to interact with the residues LEU126, PHE127, TYR160, and TYR176, and produced hydrogen bonds to interact with the residues ARG74, GLU77, GLN80, TYR124, and GLY167 of the Delta spike protein, respectively. Amb formed the same interaction with the residues LEU126 and ARG74 according to Fig. 1(b). In fact, these residues had a similar interaction with the human ACE2 protein. For instance, the residues TYR124 and GLY167 of the Delta variant interacted with human ACE2 by carbon bond, and the TYR160 and TYR176 are mediated by charged and Pi-alkyl interactions with human ACE2 residues. Unlike the Delta variant, a large number of mutations occur in the RNA-Binding Motif (RBM) of the Omicron variant, which can promote binding to human ACE2. For instance, GLN164ARG, one of the mutant residues, generates hydrogen bonds with Nystatin and Amb, and forms a salt bridge with Amb from Fig. 1(c) and (d). The residues
756
Z. Yang et al.
Fig. 1. Visualization of docked conformations of SARS-CoV-2 spike protein variants with Nystatin and Amphotericin B. (a) and (b) present the docked conformations of Delta variant with Nystatin and Amphotericin B; (c) and (d) present the docked conformations of Omicron variant with Nystatin and Amphotericin B.
LEU126 and TYR160 in the Delta variant generate hydrophobic bonds with Nystatin and AmB, and TYR124 generates hydrogen bonds with them. Similarly, the residues LEU126 and TYR160 in the delta variant produce hydrophobic bonds with Nystatin and AmB, and the TYR124 produces hydrogen bonds with them. All of these interactions can also be found in Omicron. Therefore, it can be concluded that Amb and Nystatin have similar interactions with the residues of spike protein variants. Both can have the potential to be antiviral drugs for SARS-CoV-2. Table 2. Drugs binding to the RDRP of SARS-CoV-2 screened by Koto Drug ID
Drug Name
Binding affinity (kcal/mol)
ZINC000255962815
Amphotericin B
−9.9
ZINC000068014156
Zalypsis
−9.9
SD002319
Teicoplanin
−9.8
ZINC000253668332
Deslanoside
−9.8
ZINC000334138310
Rifaximin
−9.8 (continued)
Adopting Autodock Koto for Virtual Screening of COVID-19
757
Table 2. (continued) Drug ID
Drug Name
Binding affinity (kcal/mol)
ZINC000072190220 ZINC000245190613 ZINC000169621215 ZINC000003934128 ZINC000252474776 ZINC000150609284 ZINC000084726167 ZINC000008220909 ZINC000203757351
Elsamitrucin Nystatin Rifabutin Temoporfin Solamargine GS-9256 TMC-647055 Natamycin Paritaprevir
−9.7 −9.6 −9.6 −9.6 −9.5 −9.5 −9.4 −9.4 −9.4
The drug molecules screened by Koto have been listed in Table 2, which achieved the best binding affinity for binding to the RDRP of SARS-CoV-2. Interestingly, both Amphotericin B and Nystatin were also presented in this table, and were even at the top of the list. Figure 2 shows the interactions between the top five drug molecules with the RdRp. It can be seen that the residue ASP618 of RdRp forms a hydrophobic interaction with four drug molecules, except Amb. It has been reported that the residues SER759, ASP760, and ASP761 are active catalytic sites for RdRp [33]. The Asp760 is essential because it participates in the complementary pairing of nucleotides and the formation of phosphate ester bonds in catalytic reactions. Also, it forms a salt bridge with Teicoplanin. The Ser759 and Asp761 of the SDD also interact with phosphate radicals in catalytic reactions and participate in the formation of active catalytic sites. From Fig. 2, these residues form hydrogen bonds with the five drug molecules. Similar to the RdRp enzymes of other viruses, the residues Gly553, Asp545, and Asp555 of RdRp are the co-catalytic sites for RNA polymerization. The residues Gly553 form hydrogen bonds with Amb, Zalypsis, Teicoplanin, and Rifaximin, and the residues ARG555 form hydrogen bonds with Teicoplanin and Rifaximin. The drugs screened by Koto have been shown in Table 3, which achieved the best binding affinity to bind to the PLpro of SARS-CoV-2. Among them, Natamycin and TMC-647055 also are in the screen results of RDRP, and Hypericin is in those of the spike proteins. Table 4 shows the drugs screened by Koto with the lowest binding affinity. These drugs are quite different from those of other target proteins. The active site of NSP15 is located in a shallow groove between two β-folds carrying the six key residues conserved in SARS-CoV-2: His235, His250, Lys290, Thr-341, Tyr343, and Ser294 [13].
758
Z. Yang et al.
Fig. 2. Visualization of docked conformations of SARS-CoV-2 RDRP with the top-5 drugs, including Amphotericin B, Zalypsis, Teicoplanin, Deslanoside and Rifaximin.
Table 3. Drugs binding to the PLpro of SARS-CoV-2 screened by Koto Drug ID
Drug Name
Binding affinity (kcal/mol)
ZINC000100053593
Ketotifen-N-glucuronide
−9.7
ZINC000005328059
Cephalochromin
−9.6
ZINC000008220909
Natamycin
−9.4 (continued)
Adopting Autodock Koto for Virtual Screening of COVID-19
759
Table 3. (continued) Drug ID
Drug Name
Binding affinity (kcal/mol)
ZINC000084726167
TMC-647055
−9.3
ZINC000095618817
3’-demethyletoposide
−9.2
ZINC000145806066 ZINC000051951669 ZINC000001554077 ZINC000003781738 ZINC000003780340
Zoliflodacin Bemcentinib Diaplasinin Lestaurtinib Hypericin
−9.2 −9.1 −9.1 −9.0 −9.0
Table 4. Drugs binding to the NSP15 of SARS-CoV-2 screened by Koto Drug ID
Drug Name
Binding affinity (kcal/mol)
ZINC000043133316
Tirilazad
−9.7
ZINC000003800855
Exatecan
−9.7
ZINC000012358610
Phthalocyanine
−9.7
ZINC000100054221
Hydromorphone-3-Glucuronide
−9.6
ZINC000068250462
Tucatinib
−9.6
ZINC000072190220
Elsamitrucin
−9.6
ZINC000014880002
Dihydroergotoxine
−9.5
ZINC000885764928
Paritaprevir
−9.5
SD000790
Dihydroergotamine
−9.4
ZINC000602986377
VP-14637
−9.4
ZINC000096928979
Deleobuvir
-9.4
ZINC000100341584
Setrobuvir
-9.4
ZINC000052955754
Ergotamine
-9.4
4 Discussion Since molecular docking programs adopt stochastic optimization algorithms as the sample methods, it may lead to the problem of inconsistency between the screening results and the actual values, for example, false positives [16]. The limitation is mainly due to the stochastic nature of optimization algorithms, the inconsistency of the input ligand. setting parameters, and the approximation of the scoring function. The stochastic nature leads to the differences in each docking process. All the scoring functions are based on physics or statistics, but none can currently reflect the precise interactions between proteins and ligands. The optimization process used such an approximate scoring function. It provides an a priori guide to simulate target-drug interactions with almost no cost. The structural and interaction information of the targets was obtained in detail
760
Z. Yang et al.
through extensive literature analysis, thus allowing the binding pockets to be accurately identified and the unnecessary search space to be reduced. Nystatin is found in all three spike proteins and RDRP. And all the binding affinities are below −9 kcal/mol. Nystatin is a polyene ion carrier antimycotic agent used to treat cutaneous, mucocutaneous, and gastrointestinal fungal infections, especially those caused by Candida spp. Nystatin has broad-spectrum fungicidal and bacteriostatic activities against a variety of yeast and fungi, especially Candida. Amphotericin B (AmB) also appeared in the screening results of three spike proteins and RDRP at the same time. It is a polyene antibiotic, which was first isolated from the fermentation tank culture of Streptomyces nodosus in 1959 [6], and has not been approved by the FDA. The mechanism of action of AmB is based on the combination of drugs with ergosterol in fungal cell membranes to form channels that promote the excretion of cytoplasmic content and subsequent cell death. Most efforts to improve thetoxic characteristics of AmB have focused on the preparation of lipid preparations [15]. Similar studies have indicated that AmB and vancomycin are the most promising drugs to block the binding of SARS-CoV-2 spike protein to human ACE-2 [23]. Hypericin can bind to each variant spike proteins and PLpro with low binding energies. Hypericin is an antiviral agent found in the common St. John’s grass (Hypericin), a natural substance found in standard St. John’s wort (Hypericum spp.). It has been detected as a precursor compound for the Main protease(Mpro) of SARS-CoV-2 [21]. Ergotamine, an approved non-antiviral drug for the treatment of acute migraine-type headaches [29], appears in the screen results of the original spike protein, the Delta variant and NSP15. Dihydroergotoxine, Ergotamine, and Dihydroergotamine were also seen to have structural similarities to adrenergic, dopaminergic, and serotonergic neurotransmitters. They are very effective at 5 − HT 1B and 5 − HT 1D antimigraine receptors, and they both have sustained vasoconstrictor effects [25]. Similar to our docking results, [10] demonstrated that Ergotamine and Dihydroergotamine showed strong interaction potential with both 3CLpro and RdRp and can be regarded as promising drug candidates. The ergot derivatives, such as Dihydroergotamine and Ergotamine, were indicated as potential ligands for targeting Nsp15 in [27]. Natamycin has the potential to bind to RDRP and PLpro, according to the docking results of Koto. A similar study used AutoDock Vina and Glide to dock it to Mpro, PLpro, and RdRp, which obtained satisfactory binding energies for each protein target [11]. It has in-vitro activity against a variety of yeast and filamentous fungi, according to the description of DRUGBANK [32]. In our screening results, some other drugs have attracted our attention, although they cannot bind to multiple targets simultaneously. For example, Tirilazad achieved a binding affinity of -9.7 kcal/mol to bind to NSP15. Tirilazad has been used in trials investigating the treatment of spinal cord injury [1]. In another study, Tirilazad was found to bind to the ‘Native Spike Glycoprotein’ with an affinity of -11.8 kcal/mol, to the South African (B.1.351) SARS-CoV-2 spike-in variant with an affinity of -10 kcal/mol, and to the SARS-COV-2 main protease with an affinity of -10.5 kcal/mol [7]. The second is Teicoplanin, which produces a binding energy of -9.8 kcal/mol with RdRp. Teicoplanin is a commonly-used glycopeptide antibiotic in the treatment of bacterial infections. It
Adopting Autodock Koto for Virtual Screening of COVID-19
761
has shown efficacy against various viruses, such as the Ebola virus, influenza virus, and human immunodeficiency virus, as well as coronaviruses, such as the Middle East Respiratory Syndrome Coronavirus and SARS-CoV [4].
5 Conclusion Drug reuse is one of the most attractive options for rapidly spreading and even lifethreatening pandemics such as COVID-19. Virtual screening research based on molecular docking programs can be used to identify drugs quickly, thereby discovering the potential efficacy of known drugs and achieving the goal of drug reuse. Considering the powerful docking ability of Autodock Koto proposed in our previous research, we adopt it to screen antiviral drugs to treat COVID-19. Three spike protein variants, RDRP, PLpro and NSP15 are selected as the target proteins. And Koto screens a set of drugs for each target, which have a large potential as antiviral drugs based on their predicted binding affinities. The interactions between the drugs and target proteins have also been analyzed and discussed in this study. According to the experimental results, it can be concluded that Nystatin, Amphotericin B, Hypericin, Ergotamine, Natamycin and Teicoplanin have the effects of anti-virus to treat SARS-CoV-2, and are worth further clinical trials. In our future research, we will continue to develop effective docking programs based on deep learning architecture and apply them to discover novel drugs for specific diseases, such as Parkinson. Acknowledgements. This work is supported in part by the National Key R&D Program of China under Grant 2020YFA0908700, the National Natural Science Foundation of China under Grants 62106151 and 62073225, and the Shenzhen Science and Technology Program under Grant JCYJ20220531101614031.
References 1. Bracken, M.B., et al.: Methylprednisolone or tirilazad mesylate administration after acute spinal cord injury: 1-year follow up: results of the third national acute spinal cord injury randomized controlled trial. J. Neurosurg. 89(5), 699–706 (1998) 2. Burley, S.K., Berman, H.M., Kleywegt, G.J., Markley, J.L., Nakamura, H., Velankar, S.: Protein data bank (pdb): the single global macromolecular structure archive. In: Protein Crystallography: Methods and Protocols, pp. 627–641 (2017) 3. Chen, B., et al.: Overview of lethal human coronaviruses. Signal Transduct. Target. Ther. 5(1), 89 (2020) 4. Colson, P., Raoult, D.: Fighting viruses with antibiotics: an overlooked path. Int. J. Antimicrob. Agents 48(4), 349 (2016) 5. DeLano, W.L., et al.: Pymol: an open-source molecular graphics tool. CCP4 Newsl. Protein Crystallogr. 40(1), 82–92 (2002) 6. Dutcher, J.D., William, G., Pagano, J.F., John, V.: Amphotericin b, its production, and its salts (1959). US Patent 2,908,611 7. Ferrari, I., Di Mario, M.: Sars-cov-2 proteins, in complex with tirilazad. Int. J. Sci. Res. Comput. Sci. Eng. 10(1) (2022)
762
Z. Yang et al.
8. Gao, Y., et al.: Structure of the rnadependent rna polymerase from covid-19 virus. Science 368(6492), 779–782 (2020) 9. Greaney, A.J., et al.: Comprehensive mapping of mutations in the sars-cov-2 receptor-binding domain that affect recognition by polyclonal human plasma antibodies. Cell Host Microbe 29(3), 463–476 (2021) 10. Gul, S., Ozcan, O., Asar, S., Okyar, A., Barıs, I., Kavakli, I.H.: In silico identification of widely used and well-tolerated drugs as potential sars-cov-2 3c-like protease and viral RNAdependent RNA polymerase inhibitors for direct use in clinical trials. J. Biomol. Struct. Dyn. 39(17), 6772–6791 (2021) 11. Hosseini, M., Chen, W., Xiao, D., Wang, C.: Computational molecular docking and virtual screening revealed promising sarscov-2 drugs. Precision clinical medicine 4(1), 1–16 (2021) 12. Ji, J., Zhou, J., Yang, Z., Lin, Q., Coello, C.A.C.: Autodock koto: a gradient boosting differential evolution for molecular docking. IEEE Trans. Evol. Comput. (2022) 13. Kim, Y., et al.: Crystal structure of nsp15 endoribonuclease nendou from sars-cov-2. Protein Sci. 29(7), 1596–1605 (2020) 14. Lan, J., et al.: Structure of the sars-cov-2 spike receptor-binding domain bound to the ace2 receptor. Nature 581(7807), 215–220 (2020) 15. Laniado-Labor´ın, R., Cabrales-Vargas, M.N.: Amphotericin b: side effects and toxicity. Revista iberoamericana de micolog´ıa 26(4), 223–227 (2009) 16. Leach, A.R., Shoichet, B.K., Peishoff, C.E.: Prediction of protein- ligand interactions. Docking and scoring: successes and gaps. J. Med. Chem. 49(20), 5851–5855 (2006). 17. Mannar, D., et al.: Sars-cov-2 omicron variant: antibody evasion and cryo-EM structure of spike protein–ACE2 complex. Science 375(6582), 760–764 (2022) 18. Mielech, A.M., Kilianski, A., Baez-Santos, Y.M., Mesecar, A.D., Baker, S.C.: Mers-cov papain-like protease has deisgylating and deubiquitinating activities. Virology 450, 64–70 (2014) 19. O’Boyle, N.M., Banck, M., James, C.A., Morley, C., Vandermeersch, T., Hutchison, G.R.: Open babel: An open chemical toolbox. J. Cheminformat. 3(1), 1–14 (2011) 20. Pan, X., Dong, L., Yang, L., Chen, D., Peng, C.: Potential drugs for the treatment of the novel coronavirus pneumonia (covid19) in china. Virus Res. 286, 198,057 (2020) 21. Pitsillou, E., Liang, J., Ververis, K., Hung, A., Karagiannis, T.C.: Interaction of small molecules with the sars-cov-2 papain-like protease. In: Silico studies and in vitro validation of protease activity inhibition using an enzymatic inhibition assay. Journal of Molecular Graphics and Modelling 104, 107,851 (2021) 22. Pushpakom, S., et al.: Drug repurposing: progress, challenges and recommendations. Nat. Rev. Drug Discov. 18(1), 41–58 (2019) 23. Qiao, Z., Zhang, H., Ji, H.F., Chen, Q.: Computational view toward the inhibition of sars-cov-2 spike glycoprotein and the 3cl protease. Computation 8(2), 53 (2020) 24. Rut, W., et al.: Activity profiling and crystal structures of inhibitor-bound sars-cov-2 papainlike protease: a framework for anti–covid-19 drug design. Sci. Adv. 6(42), eabd4596 (2020) 25. Silberstein, S.D., McCrory, D.C.: Ergotamine and dihydroergotamine: history, pharmacology, and efficacy. Headache: J. Head Face Pain 43(2), 144–166 (2003) 26. Siramshetty, V.B., et al.: Superdrug2: a one stop resource for approved/marketed drugs. Nucleic Acids Res. 46(D1), D1137–D1143 (2018) 27. Sixto-López, Y., Martínez-Archundia, M.: Drug repositioning to target nsp15 protein on sars-cov-2 as possible covid-19 treatment. J. Comput. Chem. 42(13), 897–907 (2021) 28. Sterling, T., Irwin, J.J.: Zinc 15–ligand discovery for everyone. J. Chem. Inf. Model. 55(11), 2324–2337 (2015) 29. Tfelt-Hansen, P., et al.: Ergotamine in the acute treatment of migraine: a review and European consensus. Brain 123(1), 9–18 (2000)
Adopting Autodock Koto for Virtual Screening of COVID-19
763
30. Tian, D., Sun, Y., Xu, H., Ye, Q.: The emergence and epidemic characteristics of the highly mutated sars-cov-2 omicron variant. J. Med. Virol. 94(6), 2376–2383 (2022) 31. Vardhan, S., Sahoo, S.K.: Computational studies on the interaction of sars-cov-2 omicron sgp rbd with human receptor ace2, limonin and glycyrrhizic acid. Comput. Biol. Med. 144, 105,367 (2022) 32. Wishart, D.S., et al.: Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic acids research 46(D1), D1074–D1082 (2018) 33. Yin, W., et al.: Structural basis for inhibition of the RNA-dependent RNA polymerase from sars-cov-2 by remdesivir. Science 368(6498), 1499–1504 (2020) 34. Zhao, Y., Huang, J., Zhang, L., Chen, S., Gao, J., Jiao, H.: The global transmission of new coronavirus variants. Environ. Res. 206, 112,240 (2022) 35. Zhou, D., et al.: Evidence of escape of sars-cov-2 variant b. 1.351 from natural and vaccineinduced sera. Cell 184(9), 2348–2361 (2021)
An Efficient Drug Design Method Based on Drug-Target Affinity Haoran Liu1,2 , Xiaolong Zhang1,2(B) , Xiaoli Lin1,2 , and Jing Hu1,2 1 School of Computer Science and Technology, Wuhan University of Science and Technology,
Wuhan, China [email protected] 2 The Hubei Key Laboratory of Intelligent Information Processing and Real-Time Industrial System, Wuhan University of Science and Technology, Wuhan, China
Abstract. Computer-aided drug design can accelerate drug development and reduce the cost. This study proposes a targeted drug design method based on long short-term memory (LSTM) neural network and drug-target affinity. The method consists of de novo drug design and targeted drug design. First, the de novo drug design model learns molecular coding rules and broad chemical information through a large number of drug-like molecules training. Then, based on affinity score obtained from the drug-target interaction prediction model, the gradient of model parameters is clipped during training, so that the targeted drug design model can learn target specific information, efficiently designing drugs for a given target. In the experiment, the model can efficiently generate new drug-like molecules, and design more affinity drugs for 3CLpro of COVID-19 than the previous ones. In the docking structure, drug molecules designed have stable binding conformation and short atomic distances with amino acid residues of the given target. Keywords: Targeted Drug Design · Deep Learning · Molecular Docking
1 Introduction Traditional biological and medical drug design methods usually require a long time and high cost, such as high throughput screening [1, 2]. Computer aided drug design can reduce the cost of drug development and promote drug design [3, 4]. Computer aided drug design can predict drugs-targets affinity and virtually screen drugs to improve the discovery efficiency of lead compounds [5, 6]. The development of deep learning [7] in recent years has promoted the improvement of drug-target affinity prediction algorithm [8]. These algorithms can screen lead compounds that may inhibit the activity of the target from the known large compound library [9–11]. The relevant study estimated that there are at least 1060 compounds in chemical space, but the number of compounds in the compound library is far less than 1060 [9]. The virtual screening method will be limited by the lack of drugs in the compound library. Recent research shows that new drug molecules that do not exist in compound library © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 764–775, 2023. https://doi.org/10.1007/978-981-99-4749-2_65
An Efficient Drug Design Method Based on Drug-Target Affinity
765
can be generated through deep learning [11–13]. Yasonik [11] proposed a de novo approach capable of optimizing multiple traits collectively. For situations where the targeted specific ligand dataset is limited or unavailable, Krishnan [12] leverage deep learning and molecular modeling approaches to develop a drug design pipeline. Ramesh [13] introduced a method to generate target-specific molecules using a Generative Adversarial Network (GAN). When designing targeted drugs, deep learning models are trained through known targeted drugs [14, 15]. For example, in this compound library, only few drugs are known to treat COVID-19 [16]. Tao [17] proposed a targeted drug design model based on the gated recurrent unit (GRU) neural network algorithm, which used targeted active compounds as training data. Deep learning algorithms usually require high quantity and quality of data. This study proposes a targeted drug design method for 3CLpro , which does not need target specific drugs as the training data set, but improves the efficiency of targeted drug design. The method can be useful for cases where there is limited or no availability of target-specific ligand datasets.
2 Related Work It has been proved feasible to generate new drug-like molecules in a way similar to natural language processing by recurrent neural network [12, 17]. In this study, the molecules encoded in Simplified Molecular Input Line Entry System (SMILES) [18] format are used. LSTM [19] based deep learning algorithms have shown performance in de novo drug design without a given target [11]. Wang [6] proposed a drug screening model for the 3C-like protease (3CLpro ), which gives an affinity score in a range of (0,1) based on the activity of the drug. The molecules with high scores have high affinity for 3CLpro . The AutoDock Vina proposed by Trott [20] has high accuracy of the binding mode predictions for doing molecular docking [21]. The structure of the molecule and protein being docked are required, and the Vina will give the conformation and binding affinity of the molecule binding to the target protein. Molecules with lower binding affinity have a better ability to inhibit the activity of target protein. Therefore, Vina [22, 23] based molecular docking is used as a validation method for targeted drug discovery. This study selects COVID-19 as a drug discovery, where the viral main protease 3CLpro is a drug target, given its essential role in viral infection and there being no human homologue [24, 25]. Further, the deep learning model based on LSTM and affinity score is used to capture the chemical information and generate new targeted drugs.
3 Method The method proposed in this study includes two parts. The first part is a de novo drug design model, which is used to generate new drug-like molecules by training with a large number of drug-like molecules; the second part is a targeted drug design model, which is used to design target drugs for 3CLpro by training with the constraint of drug-target affinity.
766
H. Liu et al.
3.1 De Novo Drug Design Model The purpose of de novo drug design is to generate new drug-like molecules that do not exist in the compound library. The model consists of three LSTM layers, one linear layer and one softmax layer. The forward propagation of the model inputs a word vector of one SMILES character at a time, and predicts the next SMILES character. During training, the input of the model at each time is a real word vector of a SMILES character. After LSTM and linear layer calculation, the predicted word vector of the next SMILES character is output. The cross-entropy function calculates the loss between the predicted word vector and the real word vector. Then, the loss value is back propagated and the gradient is obtained. Adam optimizer [26] is used to update model parameters according to both gradient and learning rate. During generation, the input of the model at each time is a predicted word vector of one SMILES character which is output from linear layer at the previous time (the first time is the start word vector of the SMILES character). The predicted word vector is sampled into the predicted SMILES character by softmax. The generation of the current SMILES sequence is ended until the model generates “\n” and more than 128 SMILES characters. During the training and generation, the model improves the uniqueness of the generated SMILES sequences and the diversity of new drug-like molecules, where hidden variables of LSTM layers are randomly initialized when the start character is input. The effect of this initialization method has been proved in previous study. 3.2 Targeted Drug Design Model Based on the previous research [27, 28], the targeted drug design model can generate drug molecules that can better inhibit the activity of for 3CLpro . The de novo drug design model does not have the ability to discriminate whether drug molecules have a tight affinity to the target protein, and does not distinguish each drug molecule in learning. The molecules generated by de novo drug design model are not targeted ones for the given target proteins. The targeted drug design model does not need target specific drugs as the training data set, but learns target information about 3CLpro through the affinity score of each molecule. In the training of the targeted drug design model, the molecules in training data set have different effects on the model by different affinity scores. Molecules with lower affinity scores have a lower effect on model parameters, while molecules with higher affinity scores may have a greater effect on model parameters. The parameters update process of the targeted drug design model is shown in Fig. 1. A gradient clipping strategy based on affinity scores is the key of the targeted drug design model. One SMILES sequence of a drug molecule is used as the training data each time. An affinity score of this drug molecule is given by the screening model. The crossentropy loss is calculated every time after the model prediction of a SMILES character. The gradient is obtained by the back propagation network when the model completes the prediction of a SMILES sequence. Then, the gradient is clipped based on affinity
An Efficient Drug Design Method Based on Drug-Target Affinity
767
scores of molecules. SGD optimizer [29] is used to update parameters of targeted drug design model. The formula is as shown in (1) and (2): n |g|2i (1) G = g2 = norm(g) =
i=1
g k∗score G
∗g
(2)
In the formula (1), g is the gradient, G is the Euclidean norm (L2 norm) of the g, , score is the affinity score of a molecule with a range of (0,1), and k is a parameter which needs to be set in advance. Before training, k will be set as a constant. The formula (2) is the specific clipping method of g. When the G is greater than k ∗ score, the g will be clipped. The k ∗ score/G is a scalar less than 1, and (k ∗ score/G) ∗ g is less than g. The gradient is clipped based on affinity score.
Fig. 1. Process of updating parameters of targeted drug design model.
When the affinity score of a molecule is small, the G is set to a small threshold. Once G is greater than k ∗ score, g will be clipped to the original k ∗ score/G times. When affinity score is small, the gradient will be limited to a small range based on affinity score. The optimizer updates the parameters of the model by clipped gradient. Therefore, molecules with a low affinity score have less effect on the model. When the affinity score of molecules is large, the G is set to a large threshold, and the gradient will be limited to a large range. However, when the original gradient is small, the optimizer will still use a small gradient to update the parameters of the model, even if the molecule has a high affinity score.The original gradient is derived from the back propagation of cross-entropy loss. The reason why the original gradient is small is that the cross-entropy loss is small. The SMILES sequence of this molecule has been well learned by the model. Therefore, molecules well learned by the model have less effect on the model in training. Molecules with high affinity scores will be fully learned by the model without being over-learned. Finally, based on the information learned from molecules with different affinity score, the model can design new molecules with high drug-target affinity. Since each
768
H. Liu et al.
parameter update is only affected by only one molecular affinity, SGD algorithm [29] is used as the optimization algorithm. The specific algorithm is shown in Algorithm 1.
Algorithm 1 Targeted Drug Design Algorithm Input: X: Drugs, k Parameter: The model parameters Output: New targeted drugs 1: M ← Establish deep learning model based on LSTM. 2: for molecule in X: 3: S ← Compute affinity score of molecules. 4: P ← M obtains molecule and outputs prediction. 5: L ← Calculate cross-entropy loss (P, molecule). 6: G ← L backward and the gradient is obtained. 7: if G ≥ k∙S then 8: G = (k∙S/||G||2)∙G 9: end if 10: Optimizer updates parameters of M. 11: end for 12: M predicts new targeted drugs. 13: return New targeted drugs The Algorithm 1 describes the targeted drug design process. The first step is to establish the targeted drug design model based on LSTM. The molecules in the training data set are used one by one as the training data of the model. The molecule is calculated to get its affinity score. The model outputs the predicted SMILES sequence. The crossentropy loss is calculated based on the real SMILES sequence and the predicted SMILES sequence. The loss is back propagated to obtain the original gradient. According to the gradient clipping strategy, the original gradient is clipped. The optimizer updates the parameters of the model based on the gradient and the preset learning rate. Finally, the trained model can design new targeted drugs.
4 Experiment The de novo drug design experiment is conducted according to the de novo drug design model that is used to generate drug-like molecules. Then, targeted drug design experiment is conducted with the targeted drug design model that is used to generate the targeted drugs given the target proteins. In addition, the ablation experiment is also carried out for the gradient clipping strategy. Finally, docking experiments is conducted between the newly designed targeted drug and the 3CLpro to verify the performance of the method proposed in this study. 4.1 Data Source and Preprocessing All data are collected from ChEMBL [30] (www.ebi.ac.uk/chembl/) and Drug Repurposing Hub [31] (clue.io/repurposing) databases.
An Efficient Drug Design Method Based on Drug-Target Affinity
769
The ChEMBL31 data set and Drug Repurposing Hub data set contains 2,331,700 distinction compounds and 7,934 drugs respectively. RDKit (www.rdkit.org) is used to standardize all molecules. Then, select molecules with SMILES length of 34 to 74. The SMILES sequences are filled with “G” at the beginning, “E” at the end. To make SMILES sequences have the same length of 76, character “Q” is filled at the end of the SMILES sequence. The SMILES sequence is one-hot coded. Each SMILES sequence is encoded into a matrix whose shape is [76,61]. After the preprocessing, there are 1,706,161 molecules in the ChEMBL31 data set and 3,128 molecules in the Drug Repositioning Hub data set. The data sets are also shown in Table 1. Table 1. Training data sets Data set
Number of molecules
ChEMBL31
1,706,161
Drug Repurposing Hub
3,128
4.2 De Novo Drug Design Experiment The purpose of de novo drug design experiment is to design new drug-like molecules which do not exist in the database. In this study, PyTorch [32] is used to establish deep learning models. The hyperparameters are shown in the Table 2. The input size of LSTM layers is 61, and the hidden size is 1024. The input and output size of linear layer are 1024 and 61. The predicted word vector is output by linear layer. Softmax outputs the SMILES characters according to the predicted word vectors one by one. During training, the batch size is set to 16, the learning rate is set to 0.0001, and the drop out is set to 0.2. The ChEMBL31 data set is used for training. The de novo drug design model is obtained after 17 training epochs. The model is set to generate 1,000,000 SMILES characters. These characters constitute 13,301 strings, of which 12,701 are valid SMILES sequences, accounting for 95.49%. 99.93% of these SMILES sequences are different from each other, and 92.70% are new molecules that do not exist in the training data. Performance of de novo drug design model is shown in Table 3, compared with the previous methods [11–13], the de novo drug design method proposed can more efficiently generate new drug-like molecules. 4.3 Targeted Drug Design Experiment The purpose of the targeted drug design experiment is to design new drug molecules with high drug-target affinity. In addition, an ablation experiment is conducted without gradient clipping based on affinity score.
770
H. Liu et al. Table 2. Hyperparameters of model Input size
Output size
Hidden size
Num layers
Drop out
1024
3
0.2
–
1
–
LSTM
61
1024
Linear
1024
61
Table 3. Performance of methods to generate new molecules. Method
Validity
Uniqueness
Novelty
New Molecules
Yasonik’s method [11]
77%
68.1%
–
47.7%
Krishnan’s method [12]
92.9%
84.2%
–
74.7%
Ramesh’s method [13]
53.0%
100%
98%
53%
De novo drug design model
95.49%
99.93%
92.70%
88.4%
The de novo drug design model is continuously trained to obtain the targeted drug design model. During training, the L2 norm of the gradient is clipped based on affinity score. The learning rate is set to 0.0001, the drop out is 0.2, and the affinity score coefficients k are set to 2, 3, and 4 respectively. The ablation model is obtained without gradient clipping. These models are obtained by training 170 epochs on the Drug Repositioning Hub data set. These models are set to generate 1,000,000 SMILES characters. The highest affinity scores of molecules designed by targeted drug design models are 0.78, 0.86 and 0.80 respectively. The distribution of affinity score is shown in Fig. 2. It can be seen that when k = 3, the affinity score of the molecule generated is highest. Therefore, the targeted drug design model of k = 3 is used as the result in the targeted drug design experiment. The affinity score of the molecules of the Drug Repurposing Hub data set, the ablation experiment and the targeted drug design experiment, as shown in Fig. 3.
Fig. 2. Distribution of affinity score of molecules generated by three targeted drug design models.
The highest affinity score of the molecules generated in the targeted drug design experiment and ablation experiment is 0.86 and 0.76. The highest affinity score of
An Efficient Drug Design Method Based on Drug-Target Affinity
771
Fig. 3. Affinity score of training data and the result of ablation experiment and targeted drug design experiment.
molecules in the training data is 0.78. As shown in Fig. 3, the affinity score of molecules generated by the targeted drug design model is highest. This shows that the targeted drug design model proposed in this study can effectively design drug molecules with high drug-target affinity. 4.4 Molecular Docking Experiment To further verify the performance of the targeted drug design model proposed in this study, molecular docking experiment is conducted. Four drug molecules generated in the targeted drug design experiment are selected to dock with the 3CLpro . The SMILES sequence, affinity score and binding affinity of four drug molecules are shown in Table 4. The affinity score is calculated by the screening model [6] and the binding affinity is calculated by Vina [20]. The specific 2D structures of these four drug molecules are shown in Fig. 4. Autodock Tools [34] is used to process 3CLpro and generated molecules. Autodock Vina [20] is used for molecular docking. The parameters of the Gridbox of Vina are as follows: center_x = −5.365, center_y = 9.156, center_z = 28.03, size_x = 72.00, size_y = 69.75, size_z = 71.25. Then, the four drug molecules dock with the 3CLpro . In the previous drug screening study, Yu [22] found that luteolin (the main flavonoid in honeysuckle) had good binding affinity for 3CLpro with −5.37 kcal/mol. Ray [35] found that the best screened (top three) FDA approved drugs, Velpatasvir, Glecaprevir, and Grazoprevir, have binding affinity of −9.1 kcal/mol, −8.7 kcal/mol, and − 8.7 kcal/mol for 3CLpro . In the research of de novo design of new chemical entities, 3CLP_28301 molecule designed by Bung [23] had the best binding affinity for 3CLpro with −9.1 kcal/mol. The comparison is shown in Fig. 5. The green column, yellow columns and blue columns represent the binding affinities of luteolin discovered by Yu [22], Velpatasvir, Glecaprevir, and Grazoprevir discovered by Ray [35], and 3CLP_28301 designed by Bung [23]. The red columns represent four molecules designed in this study. It can be seen that this study can design molecules with better binding affinity to 3CLpro . Furthermore, the docking results are 3D visualized to find the binding conformation. PyMol is used to analyze the 3D conformation of docking results. The binding conformation of molecule (a) docked with 3CLpro is shown in Fig. 6.
772
H. Liu et al.
Fig. 4. 2D structures of four drug molecules.
Table 4. SMILES sequence, affinity score and binding affinity of four molecules. No
SMILES
Affinity Score
Binding Affinity (kcal/mol)
a
CCOc1cc2ncc(C#N)c(Nc3ccc(OCc4ccccc4)c(Cl)c3)c2cc1NC(=O)C(F) = CCN(C)C
0.83
−10.0
b
N#Cc1cnc2ccc(-c3ccnc(NCC(O)CN4CCc5ccccc5C4)c3)cc2c1Nc1ccc(Cl)cc1
0.86
−9.7
c
CNC(=O)COc1cccc(-c2cc(Nc3cccc(Sc4ccc5c(c4)OCCO5)c3C#N)ccc2C(C)C)c1
0.81
−9.3
d
CC(C)Oc1ccccc1N1CCN(CC(O)CNC(=O)c2cccnc2Nc2ccccc2C#N)CC1
0.81
−8.7
In Fig. 6 (a), molecule and 3CLpro is displayed in a green stick structure and a gray surface structure with 40% transparency, respectively. It can be seen that molecule is docked with the 3CLpro in the cavity structure of the 3CLpro and has a stronger binding conformation. In Fig. 6 (b), the 3CLpro is displayed in a cyan cartoon structure as a whole, while the SER-284 amino acid residue which has a hydrogen bond with molecule is displayed in a gray stick structure. The hydrogen bond is displayed by a yellow dotted line. It can be seen that molecule forms a hydrogen bond with the SER-284 amino acid residue of the 3CLpro with atomic distance of 2.1 Å.
An Efficient Drug Design Method Based on Drug-Target Affinity
773
Fig. 5. Binding affinity of luteolin, Velpatasvir, Glecaprevir, Grazoprevir, 3CLP_28301 and molecules generated by targeted drug design experiment.
(a). Surface structure.
(b). Banding position.
Fig. 6. Binding conformation.
In conclusion, the drug molecule designed in this study has a strong tendency to bind to the target, can form a stable binding conformation, and can better inhibit the activity of the 3CLpro .
5 Conclusion This paper has proposed a targeted drug design method. The method based on LSTM extracts broad chemical information from a large number of drug-like molecules and narrow chemical information from some drugs with different affinity scores for 3CLpro , and finally design new drug molecules which can inhibit the activity of the 3CLpro . In targeted drug design model training, a gradient clipping strategy based on affinity scores is used. The clipped gradient, which is used by optimizer to update model parameters,
774
H. Liu et al.
plays key role when the molecular affinity score is large and the molecule is not well learned by the model. The model learns about molecules in different measures based on their different affinity score. In the experiment, the new drug molecules are with high drug-target affinity scores. Compared with the previous methods, the drug molecules designed in this study have better binding affinity and binding conformation. One of our further works is to conduct how to generate drug molecules for multi-target proteins. Acknowledgments. The authors thank the members of Machine Learning and Artificial Intelligence Laboratory, School of Computer Science and Technology, Wuhan University of Science and Technology, for their helpful discussion within seminars. This work was supported by National Natural Science Foundation of China (No. 61972299, 61502356).
References 1. Scannell, J.W., Blanckley, A., Boldon, H., Warrington, B.: Diagnosing the decline in pharmaceutical R&D efficiency. Nat. Rev. Drug. Discov. 11(3), 191–200 (2012) 2. Oke, A., Sahin, D., Chen, X., Shang, Y.: High throughput screening for drug discovery and virus detection. Comb. Chem. High Throughput Screen 25(9), 1518–1533 (2021) 3. Schneider, G., Fechner, U.: Computer-based de novo design of drug-like molecules. Nat. Rev. Drug. Discov. 4(8), 649–663 (2005) 4. Mak, K.K., Pichika, M.R.: Artificial intelligence in drug development: present status and future prospects. Drug. Discov. Today 24(3), 773–780 (2019) 5. Lionta, E., Spyrou, G.M., Vassilatis, D., Cournia, Z.: Structure-based virtual screening for drug discovery: principles, applications and recent advances. Curr. Top. Med. Chem. 14, 1923–1938 (2014) 6. Wang, S., Sun, Q., Xu, Y., Pei, J., Lai, L.: A transferable deep learning approach to fast screen potential antiviral drugs against SARS-CoV-2. Brief Bioinform 22(6), bbab211 (2021) 7. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 8. Zhang, L., Wang, C., Chen, X.: Predicting drug-target binding affinity through molecule representation block based on multi-head attention and skip connection. Brief Bioinform. 23(6), bbac468 (2022) 9. Reymond, J.L., Ruddigkeit, L., Blum, L., Deursen, R.: The enumeration of chemical space. Wiley Interdiscip Rev. Comput. Mol. 2(5), 717–733 (2012) 10. Ye, Q., Zhang, X., Lin, X.: Drug-target interaction prediction via graph auto-encoder and multi-subspace deep neural networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2022) 11. Yasonik, J.: Multiobjective de novo drug design with recurrent neural networks and nondominated sorting. J. Cheminf. 12(1), 1–9 (2020). https://doi.org/10.1186/s13321-020-004 19-6 12. Krishnan, S.R., Bung, N., Bulusu, G., Roy, A.: Accelerating de novo drug design against novel proteins using deep learning. J. Chem. Inf. Model 61(2), 621–630 (2021) 13. Ramesh, A., Rao, A. S., Moudgalya, S., Srinivas, K.S.: GAN based approach for drug design. In: 2021 20th IEEE Inter-national Conference on Machine Learning and Applications (ICMLA), pp. 825–828 (2021) 14. Lin, X., Zhang, X., Xu, X.: Efficient classification of hot spots and hub protein interfaces by recursive feature elimination and gradient boosting. IEEE/ACM Trans. Comput. Biol. Bioinf. 17(5), 1525–1534 (2020)
An Efficient Drug Design Method Based on Drug-Target Affinity
775
15. Lin, X., Zhang, X.: Prediction of hot regions in PPIs based on improved local community structure detecting. IEEE/ACM Trans. Comput. Biol. Bioinf. 15(5), 1470–1479 (2018) 16. Wu, F., Zhao, S., Yu, B., Chen, Y., et al.: A new coronavirus associated with human respiratory disease in China. Nature 579(7798), 265–269 (2020) 17. Tao, J., Zhang, X., Lin, X.: A targeted drug design method based on GRU and TopP sampling strategies. In: Intelligent Computing Theories and Application (2022) 18. Weininger, D.: SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules. J. Chem. Inf. Model 28, 31–36 (1988) 19. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997) 20. Trott, O., Olson, A.J.: AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput. Chem. 31(2), 455–461 (2010) 21. Eberhardt, J., Santos-Martins, D., Tillack, AF., Forli, S.: AutoDock Vina 1.2.0: new docking methods, expanded force field, and python bindings. J. Chem. Inf. Model 23, 61(8), 3891–3898 (2021) 22. Yu, R., Chen, L., Lan, R., Shen, R., Li, P.: Computational screening of antagonists against the SARS-CoV-2 (COVID-19) coronavirus by molecular docking. Int. J. Antimicrob Agents 56(2), 106012 (2020) 23. Bung, N., Krishnan, S.R., Bulusu, G., Roy, A.: De novo design of new chemical entities for SARS-CoV-2 using artificial intelligence. Future Med Chem 13(6), 575–585 (2021) 24. Zhang, L., Lin, D., Sun, X., Curth, U., et al.: Crystal structure of SARS-CoV-2 main protease provides a basis for de-sign of improved α-ketoamide inhibitors. Science 368(6489), 409–412 (2020) 25. Jin, Z., Du, X., Xu, Y., Deng, Y., Liu, M., et al.: Structure of Mpro from SARS-CoV-2 and discovery of its inhibitors. Nature 582(7811), 289–293 (2020) 26. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International conference for learning representations (2015) 27. Pang, J., Shu, Z., Ding, L., Jiang, C., Liu, C., Zhang, X.: Efficient and exact multigraph matching search. IEEE Trans. Industr. Inf. 17(6), 4141–4149 (2021) 28. Hu, J., Zhou, L., Li, B., Zhang, X., Chen, N.: Improve hot region prediction by analyzing different machine learning algorithms. BMC Bioinform. 22(Suppl3), 522 (2021) 29. Ruder, S.: An overview of gradient descent optimization algorithms. ArXiv, abs/1609.04747 (2016) 30. Mendez, D., Gaulton, A., Bento, A.P., Chambers, J., et al.: ChEMBL: towards direct deposition of bio-assay data. Nucleic Acids Res. 47(D1), D930–D940 (2019) 31. Corsello, S.M., Bittker, J.A., Liu, Z., Gould, J., McCarren, P., et al.: The drug repurposing hub: a next-generation drug library and information resource. Nat. Med. 23(4), 405–408 (2017) 32. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., et al.: PyTorch: an imperative style, high-performance deep learning library. In: NeurIPS, pp. 8024–8035 (2019) 33. Santana, M.V.S., Silva-Jr, F.P.: De novo design and bioactivity prediction of SARS-CoV-2 main protease inhibitors using recurrent neural network-based transfer learning. BMC Chem. 15(1), 8 (2021) 34. Morris, G.M., Huey, R., Lindstrom, W., Sanner, M.F., Belew, R.K., et al.: AutoDockTools4: automated docking with selective receptor flexibility. J. Comput. Chem. 30(16), 2785–2791 (2009) 35. Ray, A.K., Sen, G.P.S., Panda, S.K., et al.: Repurposing of FDA-approved drugs as potential inhibitors of the SARS-CoV-2 main protease: molecular insights into improved therapeutic discovery. Comput. Biol. Med. (2022)
Drug-Target Affinity Prediction Based on Self-attention Graph Pooling and Mutual Interaction Neural Network Xizi Wang1 , Jing Hu1,2,3(B) , and Xiaolong Zhang1,2,3 1 School of Computer Science and Technology, Wuhan University of Science and Technology,
Wuhan 430065, Hubei, China {hujing,xiaolong.zhang}@wust.edu.cn 2 Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time Industrial System, Wuhan, China 3 Institute of Big Data Science and Engineering, Wuhan University of Science and Technology, Wuhan, Hubei, China
Abstract. Predicting drug-target affinity is an important step in drug screening and discovery. This paper proposes a novel prediction model SAIG-DTA, which is built on SAGPool and MINN and can predict the binding affinity value between drugs and their protein targets. Unlike most prior prediction models, this model converts target and drug molecule into a protein contact map and a drug molecule map respectively and then employs the self-attention method on the maps to generate an effective representation of medicines and targets. Before being aggregated into a molecular representation, the characteristics of each atomic node in graph are weighted using an attention score. The techniques used to score self-attention were compared in this study. Furthermore, while the attention mechanism has been widely used to capture the one-way influence between the drug and the target, the interaction between the drug and the target has been underexplored, so the MINN module is added to the model, and the MINN will combine InteractingTransformer (Interformer) with the improved Communicative Message Passing Neural Network (CMPNN) (Inter-CMPNN) to better capture the bidirectional effects between drugs and targets and improve the model. The Davis, KIBA (Kinase Inhibitor Bioactivity), and Metz datasets were used to train our suggested model. The comparative experimental findings on regression and tasks reveal that the SAIG-DTA technique outperforms earlier sequence-based or other graph-based methods and has a high degree of generalization. Keywords: Drug-target affinity prediction · Graph convolutional neural network · Attention mechanism · Multi-channel graph convolutional network
1 Introduction The cost of time and money spent in the development of a new drug is positively correlated [1], and research has found that the success rate of academic drug discovery and development is only 50% in the middle and late stages of research [2]. The fast growth © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 776–790, 2023. https://doi.org/10.1007/978-981-99-4749-2_66
Drug-Target Affinity Prediction
777
of computer technology over the last several decades has enabled improved drug design to help experimental drug design and speed up drug development [3]. In many drug design stages, drug-target affinity (DTA) prediction is one of the most important steps involved in computational methods. By refining the properties of possible medications and the search space, accurate and efficient DTA prediction algorithms may substantially speed up the virtual screening process of potential drug molecules, minimizing wasteful biological and chemical tests. Early experiments for feature extraction approaches described medications and proteins using human experience or ingeniously devised mathematical descriptors, i.e., hand-crafted features [4]. The original KronRLS [5] and SimBoost [6] versions both have several custom features in this regard. These approaches have performed well on early DTA prediction challenges, but they rely on chemical insights or expert expertise, which restricts their opportunity for development. Recently, with the increase of molecular experimental data, methods based on machine learning and deep learning [7, 8] have shown obvious advantages. The features obtained by these deep learning methods are different from hand-crafted features, which can input the original and complete molecular representation, and the features can be automatically extracted by deep learning methods, and these features are more effective than artificially set features in the experimental process. Deep learning approaches in DTA may be classified into two types: sequence-based methods and structure-based methods. The former learn feature representations from sequence data (molecular fingerprints and protein amino acid sequences); for example, Öztürk, H et al. [9] used a one-dimensional representation of drug molecules and target proteins: drug SMILES and amino acid sequences. Similarly, WideDTA [10] also only relies on a simple onedimensional representation, but it is different from the feature representation of DeepDTA. In DeepDTA, drug SMILES and protein sequences are represented as words instead of characters, which correspond to chemical words and protein motifs and domains in the extracted 3-residual sequences. Thereafter, DeepCPI [11] leverages natural language processing techniques to learn low-dimensional feature representations, including latent semantic analysis for drug embeddings and Work2vec for protein embeddings. On the other hand, structure-based methods utilize two-dimensional topologies (i.e., graphs) [12] or three-dimensional structures [13] for feature extraction, respectively. The molecular graph is a kind of non-Euclidean data with irregular size, which is difficult to be processed by traditional deep learning methods. To process graph data, it is proposed to apply a graph neural network (GNN) to DTA prediction. Graph neural network research has long been at the forefront of research, with the most frequently used graph convolutional network (GCN) [14] and graph attention network (GAT) [15] being employed in the DTA prediction task GNN. There are also many research results in the application of GraphDTA [16], which introduces a graphical representation that expresses the SMILES of drug molecules as a molecular graph to obtain the two-dimensional structure information of the graph, but does not construct a graph for each protein and retains CNN in the protein feature extraction, similar to Deep DTA. In comparison to the baseline approach, GraphDTA produces better results than the baseline one-dimensional method, demonstrating that structural information is superior than sequence information. DGraphDTA [17] takes GraphDTA as a starting
778
X. Wang et al.
point and proposes that a graph can be formed for proteins to obtain more structural information, thereby introducing the contact graph. The contact graph [18] is a representation of a protein structure, which is a 2D (two-dimensional) representation of a 3D (three-dimensional) protein structure, and it is often used as an output for protein structure prediction. The final result is better than GraphDTA, which proves the validity of protein two-dimensional structure information. Meanwhile, DeepGS [19] used the embedding technology of Smi2Vec and Prot2Vec to develop chemical background information in drug SMILES and amino acid sequences and then combined this chemical background with graph-derived features for DTA prediction. In addition, the attention mechanism is also an important part of the neural network. By introducing the attention mechanism, the performance of feature representation in the graph neural network can be improved. The attention mechanism allows the network to focus on the relevant parts of the input data, making the information more relevant and the learned features more detailed. It has been proven to apply to various modeling tasks [20, 21]. In this study, different model structures have different model effects. Lee et al. proposed an attention structure, a self-attention graph pool [22] (SAGPool), which introduces a self-attention mechanism performing node pooling and has achieved stateof-the-art performance in many graph learning tasks. Inspired by this model, SAGPool is integrated into the network in the target preprocessing network (TPN) in this paper to obtain more target feature representations. SAIG-DTA proposed in this paper is an end-to-end prediction algorithm. The SAIG model uses the SMILES of the drug molecule and the amino acid sequence of the protein as input to construct the drug molecule map and the protein contact map, respectively. After the target preprocessing network (TPN) to obtain preliminary feature extraction. In the general DTA model prediction process, the next step after extracting the feature vector is to splice the features and input them into MLP to predict the affinity value, which is to say, in most existing models, the drug and the target protein are processed and represented separately. On this basis, the background of the interaction between the drug molecule and the target protein is ignored. In order to overcome the shortcomings of the existing DTA prediction model, the interacting neural network in the MINNDTI [23] model is introduced after the preliminary feature extraction, which consists of an interacting-transformer (called an Interformer) and an improved communication message passing neural network (called an Inter-CMPNN) combined to capture the interaction between the drug and the target protein and obtain a better representation. On the benchmark datasets Davis and KIBA, the method in this paper achieves better performance than existing methods.
2 Materials and Methods 2.1 Overview of Model The SAIG model proposed in this paper is shown in Fig. 1, which consists of three important modules: target preprocessing network (TPN), MINN, and interaction prediction network. MINN is the core component of the model, which consists of Interformer module and an Inter-CMPNN module. With these two modules, we can extract latent vectors of targets and drugs while considering the context of their interactions. The
Drug-Target Affinity Prediction
779
input protein contact map will first be preprocessed by the target preprocessing network (TPN), and the latent feature vectors of targets and molecules are extracted using MINN. These latent feature vectors are then concatenated, and finally, the drug target affinity value is obtained through the prediction network. The following sections describe these modules in detail.
Fig. 1. The overall architecture of SAIG-DTA.
2.2 Target Preprocessing Network As shown in Fig. 2, the internal structure of the Target Preprocessing Network (TPN) module consists of three self-attention map pooling modules, including a graph convolution layer and a SAGPooling layer. The convolution of each module’s results is merged hierarchically and read out. These outputs are then passed to fully connected layers to obtain the final drug representation.
Fig. 2. The structure of target preprocessing network
780
X. Wang et al.
Graph Convolution Layer The graph convolutional layer is expressed as formula (1): −1 −1 ˜ 2 hl Θ ˜ 2 A˜ D hl+1 = σ D
(1)
˜ ∈ RN ×N is the diagonal where A˜ ∈ RN ×N is the graph adjacency matrix with self-loop, D l N ×F ˜ h ∈R is the node feature matrix of layer l, and Θ ∈ RF×F is the angle matrix of A, trainable convolution weight of input feature dimension F and output feature dimension F. Finally, the rectified linear unit (ReLU) function σ is adopted as the activation function of the model. Self-attention Graph Pooling Layer The Self-Attention Graph Pooling (SAGPool) layer includes the node scoring formula and the subsequent masking operation, and this process is shown in Fig. 3. Simply put, different attention score formulas are used to obtain the self-attention scores of all atomic nodes in the molecular graph, and then all nodes are sorted by scores, and the top nodes kN are selected according to their scores. Z.k is the pooling ratio (Z.k ∈ (0, 1)), indicating the ratio of reserved nodes. The mask operation can be expressed as formula (2): idx = top − rank(Z, kN )
(2)
Zmask = Zidx where idx is the indexing operation used to obtain the feature attention mask Zmask .
Fig. 3. The process of self-attention graph pooling.
This study evaluates three scoring methods, GNN, GCN, and GAT, which are representative GNN variants and have good performance in tasks related to graph data. GNN Scoring Method The GNN scoring method is defined as formula (3): hu Θ2 Z = σ hv Θ1 + u∈N (v)
(3)
where v represents the node itself and N (v) is the set of all neighborhoods of the node v. is the feature of the l-th layer node v, Θ1 , Θ 2 ∈ RF×1 is the trainable
Drug-Target Affinity Prediction
781
convolution weight whose input feature dimension is F, σ (·) representing the activation function ReLU. GCN Scoring Method The GCN scoring method is defined as formula (4): −1 −1 ˜ 2 hl Θ ˜ 2 A˜ D Z =σ D
(4)
Formula (4) is the same as the formula on the right side of the equal sign in formula (1), the difference is that formula (4) changes the dimension of the convolution weight R1×F to get the attention score Z. GAT Scoring Method The GAT scoring method is defined as formula (5): αu,v hu · Θ Z = σ αv,v hv + u∈N (v)
(5)
Among them, Θ ∈ RN ×1 is the trainable convolution weight shared by all nodes, αu,v is the attention coefficient, and the calculation formula (6): exp LeakyReLU aT [Θhu Θhv ] αu,v = (6) T u∈N (v)∪{v} exp LeakyReLU a [Θhu Θhv ]
where α is R2F the shared attention operation that will be mapped to R. Readout Layer The readout layer aggregates node features hierarchically according to the pooling architecture. In this paper, the readout layer is the cascade of the average value of the maximum value of node features, which can be written as formula (7): 1 N xi maxxi r= i=1 i=1 N
(7)
where N is the number of nodes and xi is the feature vector of the i-th node. 2.3 Mutual Interaction Neural Network (MINN) MINN is an interactive neural network that can simultaneously consider the contextual information of target and drug interactions. MINN consists of two parts, namely Interformer and Inter-CMPNN. Interformer In this paper, two interacting Interformer decoders are used to extract the feature vectors of targets and drugs, referred to as Interformer, and its structure is shown in Fig. 4. Each decoder of Interformer consists of one or more identical layers, similar to Transformer [24]. Each layer of Interformer consists of three sub-layers: a multi-head self-attention
782
X. Wang et al.
layer, an interactive attention layer, and a fully-connected feed-forward network. The multi-head self-attention sublayer and feed-forward sublayer are basically the same as Transformer, except that the mask operation [25] is removed in Transformer ‘s work to utilize complete drug and target information. The inter-attention layer in each decoder of Interformer employs a multi-head scaling point attention block to receive extrinsic information from another decoder. The biggest difference from the transformer codec layer is that the source of the external information of the original model is the encoder. Among them, the scaling point attention block can be expressed as formula (8):
QK T V Attention(Q, K, V ) = softmax √ dk
(8)
where Q is the linear transformation output of the multi-head self-attention layer of the decoder, K and V is the linear transformation output of the last layer of another transformer decoder, dk is the dimension of V and K. The multi-head self-attention layer can be expressed as formular (9)(10): Q
head i = Attention(QWi , KWiK , VWiV )
(9)
MultiHead (Q, K, V ) = Concat(head 1 , . . . , head h )W O
(10)
Q
Among them Wi , WiK , WiV , W O are the parameter matrix.
Fig. 4. The structure of Interformer
Inter-CMPNN The Inter-CMPNN module is improved based on Communicative Message Passing Neural Networks (CMPNN). CMPNN is a variant of directed graph-based message passing
Drug-Target Affinity Prediction
783
neural network [26]. It uses three operations (AGGREGATE, COMMUNICATION, UPDATE) to enhance message interaction between nodes and edges. In this study, the INTERDFORMER module was applied to the COMMUNICATE function, so that the network can make full use of the interaction between the target and the drug: for K iterations : mk (v) = AGGREGATE(hk−1 (e)) hk (v) = COMMUNICATE(mk , hk−1 (v)) hk (v), Aka = INTERFORMER(hk (v), Aak−1 ) hk (e) = UPDATE hk (v), h0 (e), hk−1 (e) , k = 1, 2, . . . , K Among them, Aka is the target feature map of the k-th iteration, also the last integration: m = GGREGATE(hL (e)) h = COMMUNICATE(m, hL (v), x) h, A0a = INTERFORMER(hk (v), Aak−1 )
Among them, h and A0a are the finally obtained drug molecule map features and target map features. The interaction principle between Inter -CMPNN and Interformer is shown in Fig. 5. Finally, the last hidden atom representation of the drug molecule map and the feature map vector of each target are averaged to obtain a fixed-size vector, which is finally fed into the prediction network to obtain the affinity value.
Fig. 5. Message interaction between Interformer and Inter-CMPNN
784
X. Wang et al.
3 Experimental Results and Evaluation 3.1 Datasets This paper uses benchmark datasets for performance evaluation, including Davis [27] and KIBA [6] datasets. The Davis dataset contains drug target entries selected from kinase protein families and related inhibitors and their respective dissociation constant values Kd . The KIBA dataset is extracted from biochemical assays of combined kinase inhibitors from different sources and processed using KIBA scores for training and prediction. The protein and drug molecule entries in the two datasets are shown in Table 1. In benchmarking, each dataset is split into six parts, one for testing and five for cross-training and validation. Table 1. Benchmark Datasets Dataset
Proteins
Compounds
Binding entities
Davis
442
68
30056
KIBA
229
2111
118254
According to the analysis of DeepDTA, the affinity value of the Davis dataset has a huge gap, and its affinity value has a wide distribution range, but the distribution gap between each interval is also very obvious. Therefore, logarithmic processing is performed on the affinity value, and the affinity is converted into a logarithmic space with base 10. Taking the Davis data set as an example, the affinity value is converted into a logarithmic space (pKd ) by using Eq. (11):
Kd (11) pKd = −log10 109 In this paper, we mainly rely on these two benchmark data sets to conduct model adjustment and optimization experiments and compare with other drug target affinity prediction models that also use these two data sets to prove the feasibility of the model proposed in this paper. 3.2 Metric The experiment used consistency coefficient (CI) [28] and mean square error (MSE), which are also used in other drug-target affinity prediction methods. The consistency coefficient is mainly used to calculate the difference between the predicted value and the actual value. The larger the value, the more consistent the predicted value is with the actual value. Its definition is as follows: CI =
1 h bx − by d >d x y Z
(12)
Drug-Target Affinity Prediction
785
In the formula, is bx the predicted value of by larger affinity, is dx the predicted value of smaller affinity, dy Z is a standardized constant, h(x) is a step function, and its definition is as follows: ⎧ ⎨ 1, x > 0 h(x) = 0.5, x = 0 (13) ⎩ 0, x < 0 The mean square error is also a common indicator used to measure the difference between the predicted value and the actual value, and the smaller the value, the closer the predicted value is to the real value. For n samples, the mean square error is the average of the sum of squares of the differences between the predicted value pi (i = 1, 2, . . . , n) and the true value yi , which can be specifically expressed as: MSE =
1 n (pi − yi )2 i=1 n
(14)
3.3 Setting of the Hyperparameters SAIG-DTA model proposed in this paper contains many hyperparameters. The best hyperparameters we explored through experiments on the Davis dataset are shown in Table 2. Most of these hyperparameters come from the baseline model, while through five-fold cross-validation, two key factors affecting SAG performance —pooling ratio and scoring method— are identified in detail. This section introduces these two key hyperparameters and related experimental results. Table 2. Hyperparameters setting in SAIG-DTA Hyperparameter
Value _
Epoch
2000
BatchSize
256
Optimizer
Adam
Learning rate
0.001
Dropout value
0.1
SAGPooling ratio
0.1,0.2,0.3,0.4, …,0.9,1.0
SAGPooling method
GNN, GCN, GAT
Performances of Various Pooling Ratios The pooling ratio of SAGPool determines the percentage of nodes that should be retained, which is a key factor to be considered in the model. To determine the optimal graph pooling scale, values from 0.1 to 1 were evaluated on the Davis dataset, as shown in Fig. 6.
786
X. Wang et al.
Fig. 6. Fivefold cross-validation results when using different pooling ratios.
The experimental results show that the overall MSE shows a downward trend, and when the pooling ratio is 1.0, the MSE reaches the lowest value of 0.217. When the pooling ratio is greater than 0.4, another indicator CI oscillates between 0.892 and 0.895. In this architecture, based on the main indicator MSE, the optimal pooling ratio is finally determined to be 1.0. Performances of Various Attention Scoring Methods The self-attention map pooling layer assigns each node an attention score. For the scoring method, in this section, we compared three GNN variants as the scoring method, namely GNN, GCN, GAT, using five-fold cross-validation, and the results are shown in Fig. 7:
Fig. 7. Fivefold cross-validation results when using different scoring methods.
It can be seen from the figure that the MSE of GNN is 0.217, which is the lowest among the four scoring methods; the difference in the obtained CI values is not large. These results show that GNN is the most effective method among the four scoring methods.
Drug-Target Affinity Prediction
787
3.4 Comparisons with Other Baseline Models The optimal values of SAIG-DTA parameters obtained through the above hyperparameter tuning are used for model evaluation experiments using five-fold cross-validation. Specifically, the benchmark training set is shuffled and randomly divided into 5 folds, 4 of which are used as training sets and the rest as validation sets. The model is trained on four training sets and validated on the validation set, and this process is repeated five times. Average results are recorded to evaluate model performance. After all hyperparameters have been determined in this way, we train the model using all five folds and test it on the benchmark test set. Finally, in this section we compare the SAIG -DTA model with traditional machine learning methods for comparison. Performances on the KIBA Dataset The KIBA dataset is shown in Table 3. Compared with recent cutting-edge DTA prediction methods, the model proposed in this paper has a certain performance improvement. Specifically, the training results on the SAIG-DTA model are as follows: CI value is 0.910 and MSE value is 0.129. In the table, the value of CI is better than all baseline models, and the value of MSE is slightly inferior to that of DGraphDTA, but outperforms other baseline models, and the results demonstrate the effectiveness of the model in DTA prediction. Table 3. Performances on the KIBA Dataset Method
Proteins and compounds
CI
MSE
KronRLS
S-W & PubChem Sim
0.782
0.411
SimBoost
S-W & PubChem Sim
0.836
0.222
DeepDTA
CNN & CNN
0.863
0.194
WideDTA
PS+PDM & LS+LMCS
0.875
0.179
DeepGS
CNN & Graph
0.860
0.193
GraphDTA (GCN)
CNN & Graph
0.889
0.139
SAG-DTA DGraphDTA SAIG-DTA
CNN & Graph GNN & GNN GNN & GNN
0.893 0.904 0.910
0.131 0.126 0.129
Performances on the Davis Dataset Likewise, the overall performance of all models measured by MSE and CI on the KIBA dataset is shown in Table 4. Compared with recent cutting-edge DTA prediction methods, the model proposed in this paper has a certain performance improvement. The training results on the SAIG-DTA model are as follows: CI value is 0.909 and MSE value is 0.199. In the table, the values of CI and MSE are better than all baseline models. The results prove that the model is Effectiveness in DTA prediction.
788
X. Wang et al. Table 4. Performances on the Davis Dataset
Method
Proteins and compounds
CI
MSE
KronRLS
S-W & PubChem Sim
0.871
0.379
SimBoost
S-W & PubChem Sim
0.872
0.282
DeepDTA
CNN & CNN
0.878
0.261
WideDTA
PS+PDM & LS+LMCS
0.886
0.262
DeepGS
CNN & Graph
0.880
0.252
GraphDTA (GIN)
CNN & Graph
0.893
0.229
SAG-DTA DGraphDTA SAIG-DTA
CNN & Graph GNN & GNN GNN & GNN
0.901 0.904 0.909
0.212 0.202 0.199
4 Conclusion DTA prediction is a key step in computer-aided drug design virtual screening, an accurate D TA algorithm will save the experimental cost and time cost of drug screening. In this paper, drug molecular graphs and protein target graphs are constructed to obtain features, and protein features with more information are obtained through the self-attention graph pool, and the information of drugs and targets are fused and iteratively acquired more closely through the interactive module MINN The relevant information, thus greatly improving the accuracy of DTA prediction. Evaluations on benchmark datasets show that the SAIG-DTA method outperforms existing prediction methods, which demonstrates the effectiveness of the model in predicting drug target affinity. Acknowledgment. This work is supported by the National Natural Science Foundation of China (No. 61972299).
References 1. DiMasi, J.A., Grabowski, H.G., Hansen, R.W.: Innovation in the pharmaceutical industry: new estimates of R&D costs. J. Health Econ. 47, 20–33 (2016) 2. Takebe, T., Imai, R., Ono, S.: The current status of drug discovery and development as originated in United States academia: the influence of industrial and academic collaboration on drug discovery and development. Clin T ransl Sci 11(6), 597–606 (2018) 3. Lin, X., Li, X., Lin, X.: A review on applications of computational methods in drug screening and design. Molecules 25(6), 1375 (2020) 4. Ding, Y., Tang, J., Guo, F.: Identification of protein–protein interactions via a novel matrixbased sequence representation model with amino acid contact information. Int. J. Mol. Sci. 17, 1623 (2016) 5. Cichonska, A., et al.: Computational-experimental approach to drug-target interaction mapping: A case study on kinase inhibitors. PLoS Comput. Biol. 13, e1005678 (2017)
Drug-Target Affinity Prediction
789
6. He, T., Heidemeyer, M., Ban, F., Cherkasov, A., Ester, M.: SimBoost: a read-across approach for predicting drug–target binding affinities using gradient boosting machines. J. Cheminform. 9, 1–14 (2017) 7. Abbasi, K., Razzaghi, P., Poso, A., Ghanbari-Ara, S., Masoudi-Nejad, A.: Deep learning in drug target interaction prediction: current and future perspective. Curr. Med. Chem. 28, 2100–2113 (2020) 8. Wang, S., et al.: MCN-CPI: Multiscale convolutional network for compound-protein interaction prediction. Biomolecules 11, 1119 (2021) 9. Öztürk, H., Özgür, A., Ozkirimli, E.: DeepDTA: deep drug–target binding affinity prediction. Bioinformatics 34, i821–i829 (2018) 10. Öztürk, H., Ozkirimli, E., Özgür, A. WideDTA: Prediction of drug-target binding affinity. arXiv 2019, arXiv:1902.04166 (2019) 11. Wan, F., et al.: DeepCPI: a deep learning-based framework for large-scale in silico drug screening. Genom. Proteom. Bioinform. 17, 478–495 (2019) 12. Zhao, T., Hu, Y., Valsdottir, L.R., Zang, T., Peng, J.: Identifying drug–target interactions based on graph convolutional network and deep neural network. Brief. Bioinform. 22, 2141–2450 (2020) 13. Lim, J., Ryu, S., Park, K., Choe, Y.J., Ham, J., Kim, W.Y.: Predicting drug–target interaction using a novel graph neural network with 3D structure-embedded graph representation. J. Chem. Inf. Model. 59, 3981–3988 (2019) 14. Chen, M., Wei, Z., Huang, Z., et al.: Simple and deep graph convolutional networks. In: International Conference on Machine Learning. PMLR 2020, pp. 1725–1735 (2020) 15. Veliˇckovi´c, P., Cucurull, G., Casanova, A., et al.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017) 16. Nguyen, T., Le, H., Venkatesh, S.: GraphDTA: prediction of drug–target binding affinity using graph convolutional networks. BioRxiv 2019, 684662 (2019) 17. Jiang, M., et al.: Drug–target affinity prediction using graph neural network and contact maps. RSC Adv. 10, 20701–20712 (2020). https://doi.org/10.1039/D0RA02297G 18. Qi, W., Peng, Z., Anishchenko, I., Cong, Q., Baker, D., Yang, J.: Protein contact prediction using metagenome sequence data and residual neural networks. Bioinformatics 36(1), 41–48 (2020) 19. Lin, X.: DeepGS: deep representation learning of graphs and sequences for drug-target binding affinity prediction. arXiv 2020, arXiv:2003.13902 20. Zhao, Q., Xiao, F., Yang, M., Li, Y., Wang, J.: Attention DTA: prediction of drug-target binding affinity using attention model. In: Proceedings of the 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 18–21 November 2019, pp. 64–69 (2019) 21. Xiong, Z., et al.: Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J. Med. Chem. 63, 8749–8760 (2019) 22. Lee, J., Lee, I., Kang, J.: Self-attention graph pooling. In: Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9 June 2019, pp. 3734–3743 (2019) 23. Li, F., et al.: Effective drug-target interaction prediction with mutual interaction neural network. Bioinformatics (Oxford, England) 38(14), 3582–3589 (2022). https://doi.org/10.1093/ bioinformatics/btac3772 24. Vaswani, A., et al.: Attention is all you need. In: Proceedings of 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA (2017) 25. Chen, L., et al.: TransformerCPI: improving compound-protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics 36, 4406–4414 (2020)
790
X. Wang et al.
26. Song, Y., et al.: Communicative representation learning on attributed molecular graphs. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI 2020), Yokohama, Japan, pp.2831–2838 (2020) 27. Pahikkala, T., et al.: Toward more realistic drug-target interaction predictions. Brief. Bioinform. 16, 325–337 (2015). https://doi.org/10.1093/bib/bbu010 28. Mithat, G., Glenn, H.: Concordance probability and discriminatory power in proportional hazards regression. Biometrika 92(4), 965–970 (2005)
DU-DANet: Efficient 3D Automatic Brain Tumor Segmentation Based on Dual Attention Zhenhua Cai1,2,3 , Xiaoli Lin1,2,3(B) , Xiaolong Zhang1,2,3 , and Jing Hu1,2,3 1 School of Computer Science and Technology, Wuhan University of Science and Technology,
Wuhan 430065, Hubei, China {15391525083,linxiaoli,xiaolong.zhang,Hujing}@wust.edu.cn 2 Institute of Big Data Science and Engineering, Wuhan University of Science and Technology, Wuhan 430065, Hubei, China 3 Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time Industrial System, Wuhan 430065, Hubei, China
Abstract. Automatic segmentation of brain tumors using multimodal magnetic resonance imaging (MRI) has significant potential for clinical disease assessment. However, how to obtain more detailed features of brain tumors from brain images to further improve the accuracy of segmentation is still a critical issue. This paper proposes a new 3D multimodal brain tumor segmentation model with dual attention (DU-DANet), which uses two main modules of down-sampling with 1D attention convolutional (1D-DAC) and up-sampling with spatial and channel attention (USCA). 1D-DAC can preserve more low-level brain tumor features while reducing the resolution of brain images. USCA can fully fuse low-level feature maps with high-level ones to extract richer contextual information. In addition, USCA combines attention gates (AGs) and leverages channel relationships to effectively suppress irrelevant information, resulting in improved extraction of brain tumor features. Experiment results demonstrate that DU-DANet outperforms other stateof-the-art methods in both whole tumor and tumor core segmentation. Overall, the proposed DU-DANet model with USCA and 1D-DAC modules provides an efficient and accurate solution for automatic brain tumor segmentation, which has a certain potential for practical clinical applications. Keywords: automatic segmentation · brain tumor · dual attention · 1D attention convolutional
1 Introduction Automated and accurate segmentation is essential for quantitatively assessing brain tumor progression and planning preoperative treatments [1]. The development of automated segmentation models has become increasingly popular, which provides a more reliable reference for studying tumor progression. However, the shape, appearance and location of tumors vary from individual to individual. It is difficult to detect the exact location and boundary of a tumor by human vision. In addition, manual annotation may © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 791–802, 2023. https://doi.org/10.1007/978-981-99-4749-2_67
792
Z. Cai et al.
miss subtle symptoms or introduce subjective bias of the annotator, resulting in imprecise boundary delineation [2, 3]. Therefore, accurate segmentation of brain tumors is particularly important to help physicians accurately assess and treat brain tumor disease. Currently, the main methods used to obtain brain tumors in the collar bed are positron emission tomography (PET), computed tomography (CT), and magnetic resonance imaging (MRI) [4]. Among them, MRI is a popular non-invasive strategy that produces a large number of different tissue contrasts in each imaging modality and has been preferred by medical experts for the diagnosis of brain tumors [5]. Meanwhile, the measurement of brain tumor-induced tissue changes relies on complementary biological information provided by multiple MRI modalities. This information includes four sequences: fluid attenuated inversion recovery (FLAIR), contrast enhanced T1-weighted (T1c), T1-weighted (T1) and T2-weighted (T2) [6]. The main brain tumor segmentation methods are divided into two main categories: traditional machine learning methods and deep learning methods. The former mainly includes thresholding methods, region-based segmentation methods and boundary-based segmentation methods. However, most of these methods are semi-automatic, not stable enough, time-consuming and laborious. In particular, the emergence of deep convolutional neural networks has driven the development of medical image segmentation. Researchers have also improved fully convolutional networks (FCNs) [7] based on the features of medical images and proposed the most classical U-Net [8] and 3D U-Net [9] network models. Both connect the lowlevel features to the high-level features through skip connections. The difference is that the 3D structure is easier to handle 3D data and can focus on details and local features, which is suitable for segmentation of small targets. The encoder-decoder architecture for 3D is also widely used in modern semantic segmentation models [10]. To focus on regions that need attention, the current popular approach is to add an attention mechanism between the encoder and decoder. Hu et al. [11] introduced the SENet channel attention mechanism into neural network CNNs that can learn to use global information to enhance attention to important information features, but also adds some complexity. Wang et al. [12] proposed an efficient channel attention mechanism that not only has some improvement in the efficiency of acquiring inter-channel dependencies, but also is relatively lightweight. Oktay et al. [13] proposed a new attention gate for medical imaging that pays more attention to information of interest spatially and suppresses image information that does not require attention. There are also many models applicable to brain tumor segmentation. Hua et al. [14] proposed a new cascade V-Net approach to segment brain tumors. Isensee et al. [15] used 3D-UNet and optimized the training process to achieve good results in the BRATS2018 Challenge. Domenico et al. [16] proposed a 3D volume-to-volume generative adversarial network to segment brain tumors. Ding et al. [17] proposed a region-aware fusion network considering that different modalities have different sensitivities to different brain tumor regions. Wang et al. [18] introduced Transformer to extract contextual information and spatial features. Zhou et al. [19] incorporated a channel and spatial attention mechanism to enhance the extraction of low-level features. Despite the results of existing models in the field of computer vision, segmentation of brain tumors from MRI images faces several challenges. (1) It is difficult to adequately
DU-DANet: Efficient 3D Automatic Brain Tumor Segmentation
793
capture boundary detail features in multimodal brain tumor images. (2) Reducing the resolution of brain tumor images during down-sampling may lead to loss of local information at the same time. (3) These models are still not quite effective for fusing high-level semantic features and low-level semantic features of brain tumor images, which ignore the dependencies between image channels when considering the contextual information of images. To address the above issues, we propose a new brain tumor segmentation model called DU-DANet, which combines down-sampling with 1D attention convolutional (1D-DAC) modules and an up-sampling with spatial and channel attention (USCA) modules. The 1D-DAC module extracts detailed features to multimodal brain tumor images and can efficiently extract boundary details that are easily overlooked. The USCA module focuses on pixel-to-pixel relationships, fuses low-level features from the encoder to obtain more detailed contextual information. In addition, to make the model more stable for training and fast convergence, the brain tumor images are processed and enhanced in this paper. The main work of this paper is summarized as follows: • Designed a down-sampling with 1D attention convolutional (1D-DAC) module for extracting more features, which can efficiently extract intra-slice features and can effectively prevent the loss of low-level features. • Designed an up-sampling with spatial and channel attention (USCA) module that can fuse contextual information to obtain multi-scale and multi-level brain tumor features. And the module fully extracts the spatial boundary information between multimodal brain tumor slices and the deep semantic information between brain tumor channels. • Designed pre-processing and data enhancement techniques for the brain dataset to increase the robustness of the model and avoid over-fitting. The processing improves the performance and accuracy of the model while capturing more information about brain tumors.
2 Method 2.1 Task Definition This paper proposes a new 3D multimodal brain tumor image segmentation model DUDANet, which can extract more details of brain tumor images and improve the accuracy of brain tumor segmentation. To measure the effectiveness of the proposed model as well as the two key modules 1D-DAC and USCA, ablation experiments, comparison experiments and generalization experiments are conducted in this paper. Multimodal brain tumor segmentation aims to segment three brain tumor regions: whole tumor, tumor core, and enhancing tumor from a combination of Flair, T1c, T1, and T2 multimodal MRI images. The whole tumor (WT) consists of three tumor subregions: tumor core (TC), peritumor edema (ED), and enhancing tumor (ET). The tumor core (TC) consists of NCR/NET and ET, but does not include the tumor necrosis region. Among them, ET represents the most active and malignant part of the tumor. NCR/NET, ED and ET are indicated in red, green and yellow in Fig. 1, respectively.
794
Z. Cai et al.
Fig. 1. Visualization of brain tumor image. From left to right: images of Flair, T1c, T1, and T2 morphologies, and labels of three patients.
2.2 3D Brain Segmentation Model Figure 2 shows the overall framework of the proposed DU-DANet, which consists mainly of an encoder and a decoder. The encoder continuously reduces the resolution of the feature map to obtain the global information. The decoder gradually recovers the detail and spatial dimension of the segmented feature map.
Fig. 2. The framework of our DU-DANet model. The proposed model contains an encoder component and a decoder component. (At each iteration stage, the down-sampling is optimized using the 1D-DAC module. The up-sampling is optimized using the USCA module.)
The encoder mainly consists of a double convolutional layer controlling the input and four down-sampling with 1D attention convolutional (1D-DAC) modules. The function of the double convolution layer is to expand the number of channels of the multimodal brain images. The original input brain images of each of the four modalities correspond to one channel, for a total of four channels. The double convolution layer converts the number of image channels into 24 channels to obtain richer detailed features. 1D-DAC
DU-DANet: Efficient 3D Automatic Brain Tumor Segmentation
795
is used to increase the perceptual field of the model while more low-level features can be obtained. The decoder consists mainly of a convolutional layer that controls the output and four up-sampling with spatial and channel attention (USCA) modules that are used to recover encoder-processed low-resolution images. The convolutional layer restores the number of channels of the multimodal brain tumor image. The USCA module helps to combine low-level features and better focus on spatial information. Down-Sampling with 1D Attention Convolutional Module. The purpose of downsampling is to extract more detailed features of the image, which can effectively reduce the complexity of the model and prevent overfitting. However, continuous downsampling will lose some low-level features. The proposed 1D-DAC module uses a 1D attention convolutional module [14] during down-sampling as shown in Fig. 3. The module generates channel attention by 1D convolution to capture the relationship between image channels. And it effectively prevents the loss of feature in multimodal brain tumor images.
Fig. 3. Diagram of a down-sampling with 1D attention convolutional module. (The blue dotted box is the 1D convolutional attention module.)
The process can be summarized as follows: First, global pooling of the brain tumor feature maps of size H × D × W is performed so that the 3D images size is compressed to 1 × 1 × 1 pixels. Second, they are lowered in dimension and transposed to convert into multi-channel 1D feature maps. And the process can effectively reduce the parameters of the model and prevent its overfitting. Then, the compressed feature map is convolved by 1D convolution for channel feature learning. After that, the attention weights of the total channels ωc are obtained by the Sigmoid activation function, the process is defined as follows: ωc = σ (C1Dk (c))
(1)
where σ is the Sigmoid activation function, C1D is the 1D convolution, and k is the size of the convolution kernel.
796
Z. Cai et al.
The size of k value has less effect on the prediction results, but the increase of k value increases the complexity of the model. For a smaller model complexity, k is set to 3 in this paper. Up-Sampling with Spatial and Channel Attention Module. To obtain multi-scale and multi-level information about multimodal brain tumors, we propose an up-sampling module with dual attention to achieve the fusion of contextual features, which combines 1D convolutional channel attention and spatial attention. Figure 4 shows the USCA module. This module can efficiently acquire low-resolution features from the encoder and high-resolution features from the decoder. In addition, the module can enhance the weights of important regions in the images, promote dependencies between channels, and reduce the weight share of the model on irrelevant regions such as the background. USCA fuses high level features and low-level features, which helps the segmentation model to focus more on the global information and details of the brain tumor.
Fig. 4. Up-sampling module with dual attention module. (The green dashed line is the attention gate.)
The main idea of the USCA module is to match the size of the decoder’s features x with the encoder’s features g through interpolation and upsampling. These adjusted features are then combined to create hybrid features. Next, the two are summed to obtain the hybrid features, which are then processed by the following two methods: (1) 1D convolutional attention generates an attentional feature map with mixed features. This layer processes the module with only a small number of additional parameters. (2) The PRelu activation feature is then convolved in 3D to compress the channel dimensions to obtain, defined as: (2) q = T σp WxT x + WgT g + bg + bψ where σp is the PRelu activation function. Wx , Wg , are all 3D convolutions, T is the transpose, and bg and bψ are both bias terms of the convolution. Then, the contextual attention weights ωs are obtained by Sigmoid activation function and resampling, and combined with the upper-level features to generate the contextual feature map β. β = σ (q) × x
(3)
DU-DANet: Efficient 3D Automatic Brain Tumor Segmentation
797
where σ is the Sigmoid activation function. At this time, the feature map generated by combining the contextual feature map and the 1D convolutional feature map is fused and stitched with the upper input feature map to obtain the spatial channel double-fused feature map S. S = Cat(β × δ, x)
(4)
where Cat is fusion splicing. The final blended feature map S undergoes a double convolutional layer containing one-dimensional convolutional attention to obtain more high-level features. 2.3 Loss Function To ensure that the trainable parameters of each layer in the deep neural network are optimized sufficiently, this study proposes a new loss function that combines two common loss functions, Dice Loss [20] and Cross Entropy Loss [21]. Dice Loss can aid the model in handling class imbalance problems by comparing the overlap between predicted results and true labels, while Cross Entropy Loss quantifies the difference between predicted and true labels. By combining these two functions, the model’s overall performance can be improved by considering both its segmentation and classification abilities. The loss L used is defined as follows: k k 2 i∈I ui vi + vki loguki (5) L=− k∈K k k |K| u + v i∈I i i∈I i i k where u is the prediction by applying softmax, v is the ground truth, i is the number of pixels in the training patch, and k ∈ K is the class.
3 Experiment 3.1 Dataset BRATS2021 (http://braintumorsegmentation.org/): It is composed of mpMRI scans from 2000 patients, which contains 1251 training subjects with annotations. In this study, 1251 training sets are used as the main dataset, which are randomly divided into training, validation and test sets according to 8:1:1. BRATS 2020 (https://www.med.upenn.edu/cbica/brats2020/data.html): It consists of mpMRI scans from 660 patients. In this study, the 369 training subjects with annotations were divided following the same method. BRATS 2018 (https://www.med.upenn.edu/cbica/brats2018/data.html): It includes mpMRI scans from 660 patients. It comprises 285 training subjects with annotations, and this paper use all of them as the test set. Subjects in all datasets contained four different MRI morphologies, namely Flair, T1c, T1 and T2. These modalities were strictly aligned and resampled to 1 × 1 × 1 mm isotropic resolution co-aligned with the skull stripped. In this study, ablation experiments are performed on the BRATS2021 dataset and compared with other state-of-the-art methods on the BRATS2020 and BRATS2018 datasets.
798
Z. Cai et al.
3.2 Image Pre-processing and Augmentation To improve the contrast of brain tumor images and extract more details, pre-processing and data enhancement of brain tumor images are required before training the model. The following processing are mainly performed on the brain tumor images, and the results are shown in Fig. 5.
Fig. 5. Comparison of four modality images of a brain tumor before and after pre-processing and enhancement.
Image Pre-processing. For the convenience of model training, this paper fused the four modalities of brain images and segmentation labels for each case into a 4D image of 4× 240×240×155. To highlight the tumor area, the image grayscale was standard-ized and the regions outside the background were normalized. Image Augmentation. To increase the robustness of the model and prevent model overfitting, this paper also performs data enhancement on brain images. (1) To reduce the noise, Gaussian noise was added in this study. (2) To reduce the background range and black edges of the images, random and center cropping are performed. The brain image size is converted from 240 × 240 × 155 to 160 × 160 × 128, which also reduces the training time and memory requirement. (3) To enhance the focus on brain tumors and to obtain more information about brain tumors. In this paper, the contrast of the channels of the image and the brightness of the image are appropriately transformed. (4) To increase the diversity of the brain tumor data, the study randomly flipped the brain images. However, to avoid resampling, the images were only rotated at 90°, 180° and 270°, and this rotation does not affect the acquisition of features. 3.3 Implementation Details and Evaluation Indicators The training of the DU-DANet model takes approximately 12 h using one NVIDIA GeForce RTX 3090 GPU, and all models are trained from scratch. During model training, a batch size of 2 is used. The SGD optimizer is chosen with an initial learning rate of 0 and a momentum of 0.9. To prevent overfitting, a weight decay of 5e-4 is set. Cosine learning rate decay is employed in this study with a maximum learning rate of 0.004 and
DU-DANet: Efficient 3D Automatic Brain Tumor Segmentation
799
a minimum learning rate of 0.002. The maximum number of training iterations is set to 60 epochs, with a warm-up period of 10 epochs. This paper primarily measures the segmentation performance of the proposed method using the Dice similarity coefficient (Dice) [22] for three tumor categories: whole tumor (WT), tumor core (TC), and enhancing tumor (ET), defined as: Dice(A, B) =
2|A ∩ B| |A| + |B|
(6)
where A and B represent the sets of points contained in the two contour regions, which respectively represent the predicted and true labels of the segmentation task. 3.4 Comparisons with the State-of-the-Art To demonstrate the segmentation performance of DU-DANet, comparative experiments were conducted on the BRATS2020 dataset in this paper, as shown in Table 1. The proposed model, DU-DANet, outperforms the state-of-the-art methods Vox2Vox [16], TransBTS [18] in all aspects except the Dice fraction of enhancing tumor (ET) which is 1.6% lower than them. Compared with the suboptimal model TransBTS, the proposed model improves the average Dice scores in whole tumor (WT), and tumor core (TC) by 0.5%, and 4.7%, respectively. Table 1. Comparison of brain tumor segmentation performance on BRATS 2020 test dataset. Model
Dice (%) ET
WT
TC
RFNet [17]
61.5
87.0
78.2
Vox2Vox [16]
78.7
87.2
81.1
TransBTS [18]
78.7
90.1
81.7
DU-DANet (ours)
77.1
90.6
86.4
Table 2. Comparison of brain tumor segmentation performance on BRATS 2018 test dataset. Model
Dice (%) ET
WT
TC
RFNet [17]
57.1
76.5
85.7
Hua et al. [14]
72.2
80.4
86.4
Isensee et al. [15]
77.9
80.6
87.8
DU-DANet (ours)
73.7
84.1
91.1
In addition, to demonstrate the generalization ability of the proposed model DUDANet, the BRATS2018 dataset was also used to test. The results in Table 2 indicate
800
Z. Cai et al.
that the proposed model achieves significantly higher Dice scores for the WT and TC classes compared to the suboptimal model [15], with improvements of 3.5% and 2.3%, respectively. While the DU-DANet model obtained a Dice score that was 4.2% lower than the best model in ET segmentation, it still outperformed the third-best model [14] by 1.5%, demonstrating the effectiveness of the DU-DANet model in accurately segmenting the ET region with subtle differences from surrounding normal tissue. 3.5 Ablation Studies In this section, the ablation studies are performed on two key modules of the proposed DU-DANet, including the 1D-DAC and USCA, to analyze their effectiveness. w/o 1DDAC & USCA represents the 3D structure of AttUnet [13]. The importance of the two modules to the baseline model can be seen in Table 3, especially the USCA module. Table 3. Ablation study of DU-DANet, where w/o means removing the corresponding modeling part in DU-DANet. Model
Dice (%) ET
WT
TC
Full model
83.7
91.6
89.3
w/o 1D-DAC
82.3
91.3
87.6
w/o USCA
82.0
91.2
87.3
w/o 1D-DAC & USCA
81.5
90.2
85.7
Fig. 6. Visualization of the segmentation results of the ablation study. (2D images on the left and 3D on the right.)
The segmentation results are visualized, as shown in Fig. 6. Based on the 2D results presented in Fig. 6, it appears that the images using the w/o 1D-ACD & USCA methods have poor segmentation in the whole tumor region compared to the original images. The results using the w/o USCA method are significantly better in the whole tumor
DU-DANet: Efficient 3D Automatic Brain Tumor Segmentation
801
region, but still part of the whole tumor region is misclassified as normal brain tissue (as indicated by the red circle). The w/o 1D-ACD method results in a brain tumor that is closer to the ground truth. However, some of the whole tumor region (as indicated by the green circle) is misclassified as non-enhanced tumor tissue (as indicated by the blue region). In contrast, the segmented image using the DU-DANet method is very similar to the original image. It can also be seen from the 3D brain tumor images in Fig. 6 that DU-DANet has better brain tumor segmentation performance compared to other models, and the segmented images are most similar to the original brain tumor.
4 Conclusion This paper proposes a 3D automatic segmentation model DU-DANet to automatically and accurately segment brain tumors, which combines down-sampling with 1D attention convolutional (1D-DAC) modules and up-sampling with spatial and channel attention (USCA) modules. The method can effectively segment the enhanced tumor, whole tumor and tumor core regions of brain tumors. It can also effectively extract the intra-slice features of multimodal brain tumor images and multi-scale features in multimodal images. 1D-DAC effectively captures the low-level features in multimodal brain tumor images by making full use of the relationships between the channels in multimodal brain tumor images. USCA is based on 1D convolutional attention and attention gates, which can effectively extract features from the region of interest. Experimental results show that DU-DANet has good generalization and performance for brain tumor segmentation of MRI images. Acknowledgements. The authors thank the members of Machine Learning and Artificial Intelligence Laboratory, School of Computer Science and Technology, Wuhan University of Science and Technology, for their helpful discussion within seminars. This work was supported in part by National Natural Science Foundation of China (No.61972299, 62071456).
References 1. Chen C., et al.: Robust Multi-modal Brain Tumor Segmentation via Feature Disentanglement and Gated Fusion. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Proceedings, pp. 13–17, Springer, China (2019).. https://doi.org/10.1007/978-3-030-32248-9_50 2. Pei, L., Reza, S.M.S., Li, W., Davatzikos, C., Iftekharuddin, K.M.: Improved brain tumor segmentation by utilizing tumor growth model in longitudinal brain MRI. In: Medical Imaging 2017: Image Processing, vol. 10134, pp. 666–674, SPIE (2017) 3. Kamnitsas, K., et al.: Ensembles of multiple models and architectures for robust brain tumour segmentation. In: Crimi, A., Bakas, S., Kuijf, H., Menze, B., Reyes, M. (eds.) MICCAI 2017, LNCS, vol. 10670, pp. 450–462. Springer, Cham (2018) 4. Havaei, M., Davy, A., Warde-Farley, D., Biard, A., Courville, A., Bengio, Y.: Brain tumor segmentation with deep neural networks. Med. Image Anal. 35, 18–31 (2017) 5. Bakas, S., et al.: Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci. Data 4(1), 1–13 (2017)
802
Z. Cai et al.
6. Hussain, S., Anwar, S., Majid, M.: Segmentation of glioma tumors in brain using deep convolutional neural network. Neurocomputing 282(1), 248–261 (2018) 7. Ben-Cohen, A., Klang, E., Kerpel, A., Konen, E., Amitai, M.M., Greenspan, H.: Fully convolutional network and sparsity-based dictionary learning for liver lesion detection in CT examinations. Neurocomputing 275, 1585–1594 (2018) 8. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-31924574-4_28 9. Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_49 10. Shen, D., Wu, G., Suk, H.: Deep learning in medical image analysis. Annu. Rev. Biomed. Eng. 19, 221–248 (2017) 11. Hu, J., Shen, L., Sun, G.: Squeeze-and-Excitation Networks. In: Proceedings of the IEEE Conference on Compter Vision and Pattern Recognition, pp. 7132–7141 (2018) 12. Wang, Q., Wu, B., Zhu, P., Liang, D., Zhang, Y.: ECA-Net: efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.1534–11542 (2020) 13. Oktay, O., et al.: Attention U-Net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018) 14. Hua, R., et al.: Segmenting brain tumor using cascaded V-Nets in multimodal MR images. Front. Comput. Neurosci. 14, 9 (2020) 15. Isensee, F., Kickingereder, P., Wick, W., Bendszus, M., Maier-Hein, K. H. (2019). No newnet. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, MICCAI 2018, RSP, pp. 234-244. Springer, BrainLes (2018). https:// doi.org/10.1007/978-3-030-11726-9_21 16. Domenico, M., Abramian, D., Eklund, A.: Vox2Vox: 3D-GAN for brain tumour segmentation. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 6th International Workshop, MICCAI 2020, RSP, vol. 6, pp. 274–284. Springer, Peru (2020). https://doi.org/10.1007/978-3-030-72084-1_25 17. Ding, Y., Yu, X., Yang, Y.: RFNet: region-aware fusion network for incomplete multi-modal brain tumor segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3975–3984 (2021) 18. Wang, W., Chen, C., Ding, M., Yu, H., Zha, S., Li, J.: TransBTS: multimodal brain tumor segmentation using transformer. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12901, pp. 109–119. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87193-2_11 19. Zhou, T., Ruan, S., Guo, Y., Canu, S.: A multi-modality fusion network based on attention mechanism for brain tumor segmentation. In: 17th International Symposium on Biomedical Imaging (ISBI), pp. 377–380 (2020) 20. Huang, Q., Sun, J., Ding, H., Wang, X., Wang, G.: Robust liver vessel extraction using 3D U-Net with variant dice loss function. Comput. Biol. Med. 101, 153–162 (2018) 21. Zhang, Z., Sabuncu, M.: Generalized cross entropy loss for training deep neural networks with noisy labels. In: Advances in Neural Information Processing Systems. Vol. 31 (2018) 22. Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
Drug-Target Interaction Prediction Based on Knowledge Graph Embedding and BiLSTM Networks Yiwen Zhang(B) and Mengqi Cheng College of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, Hubei, China {202013407043,cccmq}@wust.edu.cn
Abstract. Predicting drug-target interactions is very important to shorten the drug development cycle and reduce the cost of drug development. In this paper, we use a prediction framework based on knowledge graphs and binary classification models. Firstly, a knowledge graph is constructed using a drug database. Then, the entities in the knowledge graph are transformed into embedded vectors. Based on a dataset of drug-target interactions, the embedded vectors corresponding to drugs and targets are used as input data, and whether there is an interaction between the drug and the target is used as the label input to a binary classification neural network model for training. The experimental results show that the accuracy of drug-target prediction can be improved, when the improved transR strategy is used to construct the embedding vectors and the BiLSTM binary classification neural network model with attention mechanism. Keywords: knowledge graph · transR algorithm · BiLSTM model · attention mechanism
1 Introduction The pharmacological effects are usually discovered using a primitive trial-and-error process, such as applying plant extracts to living systems and observing the results. Later, the drug development process evolves to elucidate the mechanism of drug action and its impact on the phenotype. Recently, advances in molecular biology and biochemistry have allowed for more complex analysis of drugs, their targets, and their mechanisms of action, and research on drug targets has become very popular [1, 2]. Traditional computational methods for predicting drug-target interactions (DTI) mainly include ligand-based methods, docking methods, and chemogenomics methods. Ligand-based methods require the combination of a large amount of data, and the prediction of new interactions is limited to the known connections between ligands and proteins. Docking methods predict whether proteins and drugs can interact based on their three-dimensional structures. However, this method cannot be applied when the structure is unknown. Chemogenomics methods are currently the most commonly used © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 803–813, 2023. https://doi.org/10.1007/978-981-99-4749-2_68
804
Y. Zhang and M. Cheng
method, which can use a wide range of biological datasets to infer possible interactions by unifying drugs and other entities in a unified subspace. The emergence of machine learning methods has promoted the development of biological field [3, 4]. In recent years, significant progress has been made in representation learning techniques. The semantic information of biological data can be represented as dense low-dimensional vectors. These feature vectors can also be used for other downstream tasks. In addition, some progress has been made in the DTI field, such as Discovering DTI and DDI by Knowledge Graph with MHRW and Improved Neural Network by Shuo Z et al. [5] and A unified drug–target interaction prediction framework based on knowledge graph and recommendation system by Ye, Q. et al. [6]. In this paper, we use knowledge graphs to integrate the semantic information of biological data.
2 Materials and Methods In this paper, the process of predicting whether there is an interaction between drugs and targets includes using databases to construct a knowledge graph, converting entities in the knowledge graph into embedded vectors, and using a binary classification neural network model on a general dataset to verify whether drugs and targets have interactions. The detailed process is shown in the Fig. 1. 2.1 Embedding Vector Representation In this experiment, we first represent entities in the knowledge graph using embedding vectors, and we use the TransR method to transform the entity embeddings. We implement the TransR model algorithm using PyTorch. We make the following optimizations to the model. When evaluating the score, we change the metric by projecting the three vectors into the same space and calculating the cosine value between the head entity embedding vector and the relation entity embedding vector and between the tail entity embedding vector and the relation entity embedding vector. Then, the similarity between these two cosine similarities (as shown in formula 1) is calculated, and use it to score. In addition, when adjusting the head and tail vectors for each triple, we choose to adjust the entity with the higher number of corresponding nodes to prioritize the important parts of the knowledge graph network, which can improve the accuracy of the embedding vectors within a limited number of adjustment times. Furthermore, when training the model, the Adam adaptive gradient descent optimization method is used to improve efficiency and performance. cosinesimilarity(h,r,t) =
dotproduct(h,r,t) norm(h)∗norm(r)∗norm(t)
(1)
2.2 BiLSTM Neural Network Model In the BiLSTM neural network model, we first calculate the forget gate to select the information to be forgotten (as shown in formula 2). Then we calculate the memory gate to select the information to be remembered (as shown in formulas 3 and 4), and calculate
Drug-Target Interaction Prediction
805
Fig. 1. Flowchart for predicting the interaction between drugs and targets
the current cell state (as shown in formula 5). Next, we calculate the output gate and the current hidden state (as shown in formulas 6 and 7), which gives us a sequence of the same length as our input layer. Finally, we add a fully connected layer to the model to reduce the dimensionality of the output layer. (2) ft = σ Wf · ht−1 , xt + bf it = σ Wi · ht−1 , xt + bi
(3)
C˜ = tanh WC · ht−1 , xt + bC
(4)
t Ct = ft ∗ Ct−1 + it ∗ C
(5)
ot = σ Wo · ht−1 , xt + bo
(6)
ht = ot ∗ tanh(Ct )
(7)
2.3 Attention Mechanism In our model, we add a self-attention layer that takes the input sequence as input and outputs a set of weights that represent the attention values at each position in the sequence. These weights can be used for weighted sum to generate a fixed-length representation of the sequence. Finally, we use this representation to predict the interactions between
806
Y. Zhang and M. Cheng
drugs and targets. The use of attention mechanisms helps the model capture relevant information in the input sequence more effectively, thereby improving the performance of the model. The model with attention mechanism is shown in the Fig. 2. T √ V (8) Attention(Q, K, V) = softmax QK d k
Fig. 2. Structure diagram of BiLSTM model with attention mechanism
2.4 Interaction Prediction Using Knowledge Graph Embedding and Neural Network Model In this paper, we take the following steps to predict the interaction between drugs and targets. First, we use the drugbank and kegg databases to construct a knowledge graph. Next, we use the knowledge graph embedding strategy to transform entities in the knowledge graph into embedding vectors. Then, we obtain the corresponding relationship between drugs and targets from a general dataset. Finally, we input the drug and target embedding vectors and corresponding relationships into a binary neural network model that incorporates attention mechanisms for training and validation. The detailed process is shown in the Table 1. Building a Knowledge Graph To achieve good prediction results, we need to construct a corresponding knowledge graph through database. Here, we use DrugBank [7] and KEGG database [8] to build the knowledge graph. We first obtain the corresponding XML files for these two databases, and then use the bio2RDF script to convert the XML files into corresponding RDF files. Through the RDF files, we can easily extract the triple relationships in the knowledge graph. The nodes in this knowledge graph include drugs, chemical properties, diseases, proteins, substructures, sequences, genes, side effects, pathways, etc. Moreover, the drugs and targets in the common dataset are also included in this knowledge graph. Embedding Representations of Knowledge Graphs We need to input the entities in the knowledge graph as input data and the existence of
Drug-Target Interaction Prediction
807
interaction relationships between entities as data labels into the neural network model for training. However, due to the complexity of entity types, they cannot be directly input into the model. Therefore, we try to use various knowledge graph embedding strategies to construct the model based on the relationships between entities in the knowledge graph triple. Each entity is represented by a vector of a specific dimension, which includes all the features of the entity in the knowledge graph. Finally, we found that using the improved transR strategy to construct the model had better results. Prediction of Interactions Based on Binary Classification Neural Network Model We use the embedding vectors of each drug-target pair in the general dataset as input data and whether there is an interaction between the drug and target as the label, and train and validate different binary neural network models. We choose the BiLSTM model with attention mechanisms, which shows better results, and further train it. During training, we can set different parameters to select the best training method. Finally, we compare the validation results with those obtained by other studies using the same approach. If some of our results are better than those obtained in other studies, we can conclude that our approach is meaningful. Table 1. Steps for predicting drug-target interactions Steps
Description
Step 1
Build a knowledge graph from DrugBank and KEGG databases
Step 2
Convert the entities in the knowledge graph into embedding vectors using various knowledge graph embedding methods
Step 3
Use the embedding vectors generated by the improved transR strategy as input data
Step 4
Obtain the drug-target correspondences from a general dataset and use them as input labels
Step 5
Train and validate different neural network models using the data and labels
Step 6
Choose the BiLSTM neural network model with attention mechanisms that shows better performance and train and validate it again
Step 7
Set different parameters during training to improve the training effectiveness
Step 8
Compare the model’s predicted results to validate the effectiveness of the model
3 Experimental Results 3.1 Datasets For ease of comparison, we use the universal dataset introduced by Yamanishi et al. [9] in this experiment. The dataset contains drug-target interaction information for four targets, namely enzymes (E), ion channels (IC), nuclear receptors (NR), and GPCR targets, from different public datasets including DrugBank, BRITE, BRENDA, SuperTarget, and KEGG. The use of these datasets facilitates cross-comparison of different deep learning models.
808
Y. Zhang and M. Cheng
3.2 Evaluation Criteria This paper uses the area under the receiver operating characteristic curve (AUC) and the area under the precision-recall curve (AUPR) to evaluate the performance of the model using five-fold cross-validation. The ROC curve has FPR (false positive rate) on the x-axis and TPR (true positive rate) on the y-axis. FPR represents the proportion of negative samples that are incorrectly classified as positive, while TPR represents the proportion of positive samples that are correctly classified as positive. AUPR stands for Area Under Precision-Recall curve. The horizontal axis of the PR curve represents the proportion of samples correctly judged to be positive, and the vertical axis represents the proportion of true-positive samples among all samples predicted to be positive. 3.3 Using RNN Binary Classification Neural Network Model to Test Embedding Vectors To verify the superiority of our improved embedding vectors over those constructed by the original transE method, we use the RNN neural network model to train and validate the original and improved embedding vectors as input data separately. The experimental results are shown in the Table 2 and the Table 3. The results in Table 2 are obtained using the original transE method, while the results in Table 3 are obtained using the improved transR method. The four subcategories included in Table 2 and Table 3 are the results for four different datasets under the general dataset. The experimental results indicate that when using the improved embedding vectors as input data, both AUC and AUPR are significantly higher than the embedding vectors before improvement. Therefore, in the subsequent experiments, we will use the improved embedding vectors as input data. Table 2. Training and validation results for embedding vectors using the original TransE Datasets
Test set
Accu. (%)
Prec. (%)
AUC (%)
APR (%)
Enzyme
1
97.53
96.21
78.83
24.03
2
97.91
96.65
83.82
39.82
3
97.94
97.01
84.03
40.68
4
98.01
97.03
86.72
44.86
Ion channel
5
98.05
97.56
87.25
51.03
Average
97.89
96.89
83.97
40.08
1
97.33
96.09
78.66
23.96
2
97.63
96.46
83.63
39.69
3
97.75
96.91
83.89
40.44 (continued)
Drug-Target Interaction Prediction
809
Table 2. (continued) Datasets
GPCR
Nuclear receptor
Test set
Accu. (%)
Prec. (%)
AUC (%)
APR (%)
4
97.81
96.97
86.43
44.63
5
97.83
97.47
87.04
50.92
Average
97.67
96.78
83.82
39.93
1
97.65
96.37
70.09
12.33
2
97.73
97.44
85.44
41.95
3
97.92
97.45
87.23
33.02
4
98.04
96.84
89.14
36.97
5
97.89
97.54
93.22
47.47
Average
97.85
97.13
84.94
34.35
1
96.38
90.89
65.78
21.33
2
99.24
93.35
92.35
74.12
3
99.57
96.78
88.94
90.92
4
99.89
95.13
99.96
99.38
5
99.85
97.99
99.26
99.97
Average
98.99
94.83
87.75
77.14
Table 3. Training and validation results for constructing embedding vectors using the improved TransR method Datasets
Test set
Accu. (%)
Prec. (%)
AUC (%)
APR (%)
Enzyme
1
98.67
97.96
93.56
63.79
2
98.99
97.97
95.82
72.49
3
99.15
98.92
95.91
71.82
4
98.73
97.23
94.12
73.75
5
99.3
98.98
95.18
75.23
Ion channel
Average
98.97
98.21
94.82
71.42
1
98.52
97.93
93.51
63.68
2
98.59
97.97
95.46
72.38
3
98.51
98.32
94.93
71.45
4
98.47
98.23
95.29
73.56
5
98.41
98.23
93.94
74.91
Average
98.51
98.14
94.39
71.19 (continued)
810
Y. Zhang and M. Cheng Table 3. (continued)
Datasets
Test set
Accu. (%)
Prec. (%)
AUC (%)
APR (%)
GPCR
1
98.11
97.49
90.73
45.13
2
98.43
97.61
94.79
60.62
3
98.39
98.17
94.33
58.52
4
98.51
97.96
94.04
60.15
5
98.51
98.08
95.15
62.13
Average
98.39
97.86
93.67
57.31
1
96.54
95.08
81.32
24.69
2
99.52
95.62
95.55
79.35
3
99.92
98.07
99.21
88.14
4
99.99
99.53
92.81
93.02
5
99.99
99.26
99.82
99.95
Average
99.19
97.51
94.88
77.03
Nuclear receptor
3.4 Training and Validation Results for Different Models To obtain better classification results, different neural network models are attempted in this section to train and validate the data. We use the BiLSTM neural network model, CNN neural network model, and RNN neural network model, with training iterations set to 100 and the loss function set to cross-entropy. The experimental results are shown in the Table 4, representing the results of four different datasets under the general dataset. The experimental results show that the BiLSTM neural network model has the best performance. Table 4. Training and validation results for different models Dataset
Classifier
AUPR
AUC
enzymes
CNN
0.3154
0.8856
Ion channel
GPCR
RNN
0.7142
0.9482
BiLSTM
0.7715
0.9417
CNN
0.3258
0.8628
RNN
0.7119
0.9439
BiLSTM
0.8086
0.9154
CNN
0.0309
0.5025 (continued)
Drug-Target Interaction Prediction
811
Table 4. (continued) Dataset
Nuclear receptor
Classifier
AUPR
AUC
RNN
0.5731
0.9367
BiLSTM
0.7402
0.9579
CNN
0.4285
0.7775
RNN
0.7703
0.9488
BiLSTM
0.7291
0.9218
3.5 Improvements and Optimizations to the Model After selecting the model, it is necessary to improve and optimize the model. Since the BiLSTM neural network model performs best, we use the BiLSTM neural network model to solve our binary classification problem and make improvements to it to achieve better AUC performance. Here, we adopt the self-attention mechanism, which can not only capture global information but also perform parallel calculations. The training iterations are set to 100, and the cross-entropy function is used as the loss function. We then train the model with and without the self-attention mechanism. The experimental results are shown in the Fig. 3, which shows the training and validation results of four different datasets under the introduction of self-attention mechanisms and without attention mechanisms. The experimental results show that the BiLSTM binary classification neural network model with self-attention mechanism performs better in AUC.
Fig. 3. Training and validation results for the introduction of the attention mechanism
812
Y. Zhang and M. Cheng Table 5. Comparison of results with different methods
Dataset
Classifier
AUPR
AUC
enzymes
Deep latent factor model [10]
0.728
0.899
FRnet-DTI [11]
0.7
0.9754
GCN-DTI [12]
0.92
0.97
Ion channel
GPCR
Nuclear receptor
TBSelfNet-DTI
0.7496
0.9731
Deep latent factor model
0.616
0.884
FRnet-DTI
0.49
0.9512
GCN-DTI
0.79
0.98
TBSelfNet-DTI
0.8387
0.9785
Deep latent factor model
0.828
0.942
FRnet-DTI
0.69
0.9512
GCN-DTI
0.82
0.97
TBSelfNet-DTI
0.7804
0.9786
Deep latent factor model
0.125
0.669
FRnet-DTI
0.79
0.9285
GCN-DTI
0.83
0.92
TBSelfNet-DTI
0.7697
0.9271
3.6 Comparison with Existing Results To demonstrate the effectiveness of our improvements, we collect some existing prediction results of drug-target interactions obtained through this method, and compare them with the results we obtained. The experimental results are shown in the Table 5. Since we use the transR algorithm, a BiLSTM neural network binary classification model, and self-Attention mechanism in our prediction method for drug-target interactions, we name this method TBSelfNet-DTI. The table shows the results obtained in different publications for four datasets in the general dataset. From the experimental results, we find that the results we obtained are better than most of the publicly available results at present, which demonstrates the superiority of the approach we adopt.
4 Conclusion This thesis proposes a drug target interaction prediction model based on the TransR algorithm and BiLSTM binary deep learning model, which can predict drug target interactions more accurately by improving the TransR and BiLSTM algorithms. By improving the TransR algorithm and using cosine similarity for distance measurement, we can better capture the complex semantic relationships between entities and relationships. In addition, when adjusting the vectors corresponding to entities using the TransR
Drug-Target Interaction Prediction
813
algorithm, we adjust the vectors of entities with more corresponding nodes, allowing the vector relationships between entities to reach the desired result more quickly. The experimental results show that the proposed model can predict drug-target interactions more accurately. In the next work, we will further optimize the accuracy of the model and consider its application to more cases. Acknowledgements. The authors thank the members of Machine Learning and Artificial Intelligence Laboratory, School of Computer Science and Technology, Wuhan University of Science and Technology, for their helpful discussion within seminars. This work was supported by the Innovation and Entrepreneurship Training Program for University Students (2022169).
References 1. Mohamed, S.K., Nováˇcek, V., Nounu, A.: Discovering protein drug targets using knowledge graph embeddings. Bioinformatics 36(2), 603–610 (2020) 2. Xiaoli, L., Shuai, X., Xuan, L., Xiaolong, Z., Jing, H.: Detecting drug-target interactions with feature similarity fusion and molecular graphs. Biology 11(7), 967 (2022) 3. Xiaoli, L., Xiaolong, Z.: Efficient classification of hot spots and hub protein interfaces by recursive feature elimination and gradient boosting. IEEE/ACM Trans. Comput. Biol. Bioinf. 17(5), 1525–1534 (2020) 4. Xiaoli, L., Xiaolong, Z.: Prediction of hot regions in PPIs based on improved local community structure detecting. IEEE/ACM Trans. Comput. Biol. Bioinf. 15(5), 1470–1479 (2018) 5. Shuo, Z., Xiaoli, L., Xiaolong, Z.: Discovering DTI and DDI by knowledge graph with MHRW and improved neural network. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM2021) (2021) 6. Ye, Q., Hsieh, C.Y., Yang, Z., et al.: A unified drug–target interaction prediction framework based on knowledge graph and recommendation system. Nat. Commun. 12, 6775 (2021) 7. Wishart, D.S., Knox, C., Guo, A.C., et al.: DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 36, D901–D906 (2008) 8. Kanehisa, M.: The KEGG database. In: Silico’Simulation of Biological Processes: Novartis Foundation Symposium 247. Chichester, UK: John Wiley & Sons Ltd, vol. 247, pp. 91–103 (2002) 9. Yamanishi, Y., Araki, M., Gutteridge, A., Honda, W., Kanehisa, M.: Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24, i232–i240 (2008) 10. Mongia, A., Jain, V., Chouzenoux, E., et al.: Deep latent factor model for predicting drug target interactions. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1254–1258. IEEE (2019) 11. Rayhan, F., Ahmed, S., Mousavian, Z., et al.: FRnet-DTI: Deep convolutional neural network for drug-target interaction prediction. Heliyon 6(3), e03444 (2020) 12. Zhao, T., Hu, Y., Valsdottir, L.R., et al.: Identifying drug–target interactions based on graph convolutional network and deep neural network. Briefings Bioinform. 22(2), 2141–2150 (2021)
Correction to: SpliceSCANNER: An Accurate and Interpretable Deep Learning-Based Method for Splice Site Prediction Rongxing Wang
, Junwei Xu , Xiaodi Huang and Yanju Zhang
, Wangjing Qi
,
Correction to: Chapter “SpliceSCANNER: An Accurate and Interpretable Deep Learning-Based Method for Splice Site Prediction” in: D.-S. Huang et al. (Eds.): Advanced Intelligent Computing Technology and Applications, LNCS 14088, https://doi.org/10.1007/978-981-99-4749-2_38
In the original version of this paper the link in the abstract section is not valid anymore. This has been corrected. The correct link is now “http://www.bioinfo-zhanglab.com/ SpliceSCANNER/”.
The updated original version of this chapter can be found at https://doi.org/10.1007/978-981-99-4749-2_38 © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, p. C1, 2023. https://doi.org/10.1007/978-981-99-4749-2_69
Author Index
A Ahmed, Fatma S. 131
Fengcong, 335 Fu, Zi-Ang 473
B Bai, Jun 485 Bao, Wenzheng 597, 607, 617 Bin, Yannan 67
G Gan, Haitao 144 Gao, Jie 16, 168 Gao, Mingxiang 371 Gao, Yang 700, 739 Gao, Ying-Lian 291, 324, 359 Gao, Yue 359 Ge, Daohui 268 Ge, Shuguang 395 Geng, Shichao 233 Geng, Yushui 87, 97 Gu, Yujie 192 Guan, Bo-Xin 324
C Cai, Zhenhua 791 Cao, Kun 752 Cao, Ruifen 67 Cao, Rui-Fen 77 Cao, Weinian 41 Cao, Yangkun 192 Cao, Yi 568, 578, 587 Cao, Zihan 300, 312 Chen, Linjie 700, 739 Chen, Yadong 700, 739 Chen, Yuehui 568, 578, 587 Chen, Yulong 460 Chen, Zheng 221 Cheng, Honglin 597, 607 Cheng, Mengqi 803 Cui, Shuna 119 Cui, Zhiming 556, 666, 676 D Deng, He 3 Deng, Jiejin 639 Deng, Xun 180 Ding, Changtong 233 Dong, Xinyu 300, 312 Dou, Xu-Ran 291 Du, Zhihua 532 F Fang, Min 532 Fang, Xiaoyue 144 Feng, Jing 428
H Han, Chu 300, 312 Han, Lijun 521 Han, Yuyang 405 He, Luying 607, 617 He, Yonghui 67 He, Zhengda 700, 739 He, Zhongyu 653 Hou, Long 395 Hu, Bin 209 Hu, Jianhua 700, 739 Hu, Jing 246, 627, 764, 776, 791 Hu, Lun 156, 180, 335, 460 Hu, Peng-Wei 156, 180 Hu, Pengwei 335, 460 Huang, Shaohui 383 Huang, Xiaodi 447 Huang, XiaoMei 300, 312 Huang, Xiaoyang 544 Huang, Yu-An 460 Huang, Yuehan 383, 544 Huang, Yuting 209 Huinian, Li 416
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNCS 14088, pp. 815–818, 2023. https://doi.org/10.1007/978-981-99-4749-2
816
Author Index
I Ishdorj, Tseren-Onolt
106, 119
J Ji, Junkai 752 Jiang, Changzhi 383 Jiang, Likun 131 Jiang, Tengsheng 556, 666, 676 Jiang, Yizhang 258 Jiang, Yu 687 Jiao, Cui-Na 291, 359 Jin, Shuting 383 Ju, Xunguang 617 K Kong, Xiang-Zhen
324
L Lai, Jinling 97 Lei, Qian 221 Leyi, Zhang 416 Li, Bo 371 Li, Cheng 144 Li, Dong-Xu 156, 180 Li, Feng 268 Li, Guo-Dong 180 Li, Jiajia 460 Li, Jianqiang 532, 752 Li, Junyi 687 Li, Pengfei 521 Li, Pengpai 727 Li, Rui 209 Li, Shengjun 268 Li, Xiaoning 233 Li, Xuan 192 Li, Xuewei 16 Li, Zonghao 627 Liao, Zhijun 28, 405 Lin, Peixuan 131 Lin, Xiaoli 764, 791 Lin, Yuan 544 Liu, Fangbo 55, 347 Liu, Haoran 764 Liu, Jian 395 Liu, Jin-Xing 291, 359 Liu, Juan 131, 428 Liu, Mengxue 221 Liu, Wenbin 300, 312 Liu, Xiangrong 131, 383, 544
Liu, Xiaoli 335 Liu, Yang 268 Liu, Zaiyi 300, 312 Liu, Zhi-Ping 716, 727 Liu, Zhiqiang 16 Lu, Mingyu 639 Luo, Lijun 41 Luo, Pengyu 544 Lv, Hao 700, 739 M Ma, Yu 209 Mao, Dongdong 405 Meng, Qingfang 279, 497 N Nie, Jiafei 509 O Ouyang, Yuanxin P Pan, Yaohua
485
639
Q Qi, Wangjing 447 Qian, Kun 209 Qin, Dengkang 221 Qin, Yiming 460 R Rao, Shengxiang 41 Ren, Hongwei 3 Ren, Qianqian 268 Rong, Wenge 485 Ruan, Xinru 131 S Schuller, Björn W. 209 Shah, Hayat Ali 428 Shang, Junliang 268 Shen, Cong 28, 405 Shen, Jian 209 Shen, Lian 544 Shen, Zhen 87, 97 Shi, Haobo 192 Shi, Zhenwei 300, 312 Song, Quanrun 233 Su, Xiao-Rui 156, 180
Author Index
Su, Xiaorui 335 Su, Yansen 77, 106, 119 Sui, Jianan 568, 587 Sun, Duanchen 716, 727 Sun, Pan 16 Sun, Shengguo 87 T Tan, Dayu 106 Tan, Feng 460 Tang, Binhua 509 Tang, Jijun 28, 405 W Wan, Xiaohua 473 Wang, Chuanyuan 716 Wang, Dan 300, 312 Wang, Jinxuan 405 Wang, Kang 246 Wang, Lin 233 Wang, Minglu 106 Wang, Ning 258 Wang, Pengpeng 106, 119 Wang, Qi 607, 617 Wang, Rong 521 Wang, Rongxing 447 Wang, Xiao 521 Wang, Xingang 87, 97 Wang, Xinyue 28 Wang, Xizi 776 Wang, Xuan 687 Wang, Yi-Ming 324 Wang, Zewen 279, 497 Wang, Zhengzhong 233 Wang, Zhuo 597, 607, 617 Wang, Zikai 460 Wei, Hongwei 87, 97 Wei, Pi-Jing 77 Wu, Hongjie 556, 666, 676 Wu, Ruilin 405 Wu, Tian-Ru 291 Wu, Yankai 405 Wu, Yulin 687 X Xi, Wen-Yu 291 Xiao, Kai 617 Xiong, Zhang 485
817
Xu, Chungui 67 Xu, Fei 119 Xu, Jiaying 700, 739 Xu, Jiliang 67 Xu, Junwei 447 Xu, Peng 300, 312 Xu, Zhijie 97 Xu, Zhiyuan 168 Xuan, Guangzhe 209 Y Yamamoto, Yoshiharu 209 Yang, Bolun 578 Yang, Feng 55, 347 Yang, Jiaxin 716 Yang, Yang 106 Yang, Yue 156 Yang, Zhangfan 752 Yang, Zhi 144 Yang, Zhihui 428 Yin, Changqing 41 Yingbiao, Hu 416 Yingjie, Long 416 You, Zhu-Hong 156, 180 You, Zhuhong 460 Yu, Jian 168 Yu, Mei 168 Yu, Ruiguo 16 Yu, Wendong 87, 97 Yu, Xinyu 383 Yu, Yongzi 209 Yu, Zhengqiu 383 Yuan, Kaixiang 209 Yuan, Lin 87, 97 Z Zha, Lei 335 Zhang, Baichuan 597, 607 Zhang, Chunxiao 727 Zhang, Dai-Jun 359 Zhang, Fa 473 Zhang, Hengyuan 192 Zhang, Huijuan 41 Zhang, Jiahao 279, 497 Zhang, Jing 639 Zhang, Lejun 106 Zhang, Meng 3 Zhang, Qiang 497 Zhang, Ruixuan 16
818
Zhang, Runhua 666, 676 Zhang, Xiangwei 233 Zhang, Xiaolong 3, 246, 627, 764, 776, 791 Zhang, Xinyi 544 Zhang, Yanju 447 Zhang, Yijia 639 Zhang, Yiwen 803 Zhang, Yong 687 Zhang, Yuan 335 Zhang, Zhengbo 335 Zhao, Bo-Wei 156, 180 Zhao, Bowei 335 Zhao, Haipeng 556 Zhao, Jiawang 87
Author Index
Zhao, Junting 485 Zhao, Ling 97 Zhao, Mankun 168 Zhao, Yaou 568, 578, 587 Zhao, Zhihe 300, 312 Zheng, Chun-Hou 67, 77, 324 Zhong, Hengrui 209 Zhong, Xing 532 Zhou, Jingli 687 Zhou, Ran 144 Zhou, Rui-ning 700, 739 Zhou, Shu-Li 77 Zhu, Baozhong 556, 666, 676 Zhu, Zexuan 752