213 35 15MB
English Pages XV, 216 [231] Year 2021
Advances in Intelligent Systems and Computing 1240
Gabriella Panuccio · Miguel Rocha · Florentino Fdez-Riverola · Mohd Saberi Mohamad · Roberto Casado-Vara Editors
Practical Applications of Computational Biology & Bioinformatics, 14th International Conference (PACBB 2020)
Advances in Intelligent Systems and Computing Volume 1240
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen , Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong
The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. ** Indexing: The books of this series are submitted to ISI Proceedings, EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink **
More information about this series at http://www.springer.com/series/11156
Gabriella Panuccio Miguel Rocha Florentino Fdez-Riverola Mohd Saberi Mohamad Roberto Casado-Vara •
•
•
•
Editors
Practical Applications of Computational Biology & Bioinformatics, 14th International Conference (PACBB 2020)
123
Editors Gabriella Panuccio Enhanced Regenerative Medicine Istituto Italiano di Tecnologia Genoa, Genova, Italy Florentino Fdez-Riverola Computer Science Department University of Vigo Vigo, Spain
Miguel Rocha Department de Informática Universidade do Minho Braga, Portugal Mohd Saberi Mohamad Institute for Artificial Intelligence and Big Data (AIBIG) Universiti Malaysia Kelantan, Kampus Kota Kota Bharu, Malaysia
Roberto Casado-Vara Biotechnology, Intelligent Systems and Educational Technology (BISITE) Research Group University of Salamanca Salamanca, Salamanca, Spain
ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-3-030-54567-3 ISBN 978-3-030-54568-0 (eBook) https://doi.org/10.1007/978-3-030-54568-0 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
There are diverse sequencing techniques, and new technologies emerge continually, making it possible to obtain a large amount of multi-omics data. Bioscience is progressively turning into a kind of computer science, as it has begun to rely on computer science applications. As a result, bioinformatics and computational biology are fields that encounter new challenges as they attempt to analyze, process, assimilate, and get insight into data. To be able to overcome those challenges, it is necessary to develop new algorithms and approaches in fields such as databases, statistics, data mining, machine learning, optimization, computer science, machine learning, and artificial intelligence. A new generation of interdisciplinary researchers, with extensive background in biological and computational sciences, work on meeting those needs. The International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB) is an annual international event dedicated to applied research and challenges in bioinformatics and computational biology. Building on the success of previous events, this volume gathers the contributions for the 14th PACBB Conference. All submissions have been thoroughly reviewed and selected by an international committee, which includes members from 21 different countries. The PACBB’20 technical program includes 21 papers of authors from many different countries (Australia, Colombia, Egypt, Germany, India, Malaysia, Portugal, Saudi Arabia, Slovakia, South Korea, Spain, Switzerland, Turkey, United Arab Emirates, UK, and USA) and different subfields in bioinformatics and computational biology. There will be special issues in JCR-ranked journals, such as Interdisciplinary Sciences: Computational Life Sciences, Integrative Bioinformatics, Information Fusion, Neurocomputing, Sensors, Processes, and Electronics. Therefore, this event will strongly promote the interaction among researchers from international research groups working in diverse fields. The scientific content will be innovative, and it will help improve the valuable work that is being carried out by the participants. This symposium is organized by the University of L’Aquila with the collaboration of the University of Malaysia Kelantan, the University of Minho, the University of Vigo, and the University of Salamanca. We would like to thank all the v
vi
Preface
contributing authors, the members of the Program Committee, the sponsors (IBM, Indra, AEPIA, APPI, AIIS, EurAI, and AIR Institute). We thank for funding support to the project: “Intelligent and sustainable mobility supported by multi-agent systems and edge computing” (Id. RTI2018-095390-B-C32), and finally, we thank the Local Organization members and the Program Committee members for their valuable work, which is essential for the success of PACBB’20. Gabriella Panuccio Miguel Rocha Florentino Fdez-Riverola Mohd Saberi Mohamad Roberto Casado-Vara
Organization
General Co-chairs Gabriella Panuccio Miguel Rocha Florentino Fdez-Riverola Mohd Saberi Mohamad Roberto Casado-Vara
University of Genoa, Italy University of Minho, Portugal University of Vigo, Spain Universiti Malaysia Kelantan, Malaysia University of Salamanca, Spain
Program Committee Vera Afreixo Amparo Alonso-Betanzos Rene Alquezar Manuel Álvarez Díaz Jeferson Arango Lopez Joel Arrais Julio Banga Carlos Bastos Carole Bernon Lourdes Borrajo Ana Cristina Braga Boris Brimkov Guillermo Calderon Rui Camacho José Antonio Castellanos Garzón Luis Fernando Castillo José Manuel Colom
University of Aveiro, Portugal University of A Coruña, Spain Technical University of Catalonia, Spain University of A Coruña, Spain Universidad de Caldas, Colombia University of Coimbra, Portugal Instituto de Investigaciones Marinas (C.S.I.C.), Spain University of Aveiro, Portugal IRIT/UPS, France University of Vigo, Spain University of Minho, Portugal Rice University, USA Autonomous University of Manizales, Colombia University of Porto, Portugal University of Salamanca, Spain Universidad de Caldas, Colombia University of Zaragoza, Spain
vii
viii
Fernanda Brito Correia
Daniela Correia Ángel Martín del Rey Roberto Costumero Francisco Couto Yingbo Cui Masoud Daneshtalab Javier De Las Rivas Sergio Deusdado Oscar Dias Fernando Diaz Ramón Doallo Xavier Domingo-Almenara Pedro Ferreira
João Diogo Ferreira Nuno Filipe Mohd Firdaus-Raih Nuno A. Fonseca Dino Franklin Alvaro Gaitan Narmer Galeano Vanessa Maria Gervin Rosalba Giugno Josep Gómez Patricia Gonzalez Consuelo Gonzalo-Martin David Hoksza Roberto Casado-Vara Natthakan Iam-On Gustavo Isaza Paula Jorge Martin Krallinger Rosalia Laza Thierry Lecroq Giovani Librelotto Filipe Liu Ruben Lopez-Cortes
Organization
DETI/IEETA University of Aveiro and DEIS/ISEC/Polytechnic Institute of Coimbra, Portugal University of Minho, Portugal University of Salamanca, Spain Technical University of Madrid, Spain University of Lisbon, Faculty of Sciences, Portugal National University of Defense Technology, China KTH Royal Institute of Technology in Stockholm, Sweden University of Salamanca, Spain Technical Institute of Bragança, Portugal University of Minho, Portugal University of Valladolid, Spain Univ. A Coruña, Spain Rovira i Virgili University, Spain Ipatimup: Institute of Molecular Pathology and Immunology of the University of Porto, Portugal University of Lisbon, Faculty of Sciences, Portugal University of Porto, Portugal National University of Malaysia, Malaysia University of Porto, Portugal Federal University of Uberlandia, Spain Café de colombia, Colombia Universidad Catolica de Manizales, Colombia Hathor Group, Brazil University of Verona, Italia University Rovira i Virgili, Spain University of A Coruña, Spain Universidad Politécnica de Madrid, Spain University Karlova, Czech Republic University of Salamanca, Spain Mae Fah Luang University, Thailand University of Caldas, Colombia University of Minho, Portugal National Center for Oncological Research, Spain Universidad de Vigo, Spain University of Rouen, France Federal University of Santa Maria, Portugal CEB, University of Minho, Portugal University of Vigo, Spain
Organization
Hugo López-Fernández Eva Lorenzo Iglesias Analia Lourenco Sara Madeira Marcelo Maraschin Marcos Martinez-Romero Sérgio Matos Mohd Saberi Mohamad Loris Nanni José Luis Oliveira Maria Olivia Pereira Alexandre Perera Lluna Martin Pérez Pérez Gael Pérez Rodríguez Cindy Perscheid Armando Pinho Ignacio Ponzoni Antonio Prestes Garcia Heri Ramampiaro Juan Ranea Miguel Reboiro-Jato Jose Ignacio Requeno João Manuel Rodrigues Alejandro Rodriguez Alfonso Rodriguez-Paton Miriam Rubio Camarillo Gustavo Santos-Garcia Pedro Sernadela Amin Shoukry Naresh Singhal Ana Margarida Sousa Niclas Ståhl Carolyn Talcott Mehmet Tan Rita Margarida Teixeira Ascenso Mark Thompson Antonio J. Tomeu-Hardasmal Alicia Troncoso Turki Turki
ix
University of Vigo, Spain University of Vigo, Portugal University of Vigo, Spain University of Lisbon, Faculty of Sciences, Portugal Federal University of Santa Catarina, Brazil Stanford University, USA IEETA, Universidade de Aveiro, Portugal University Teknologi Malaysia, Spain University of Padua, Italy University of Aveiro, Portugal University of Minho, Centre of Biological Engineering, Portugal Technical University of Catalonia, Spain University of Vigo, SING group, Spain University of Vigo, SING group, Spain Hasso-Plattner-Institut, Denmark University of Aveiro, Portugal National South University, Argentina Universidad Politécnica de Madrid, Spain Norwegian University of Science and Technology, Norway University of Malaga, Spain University of Vigo, Spain University of Zaragoza, Spain DETI/IEETA, University of Aveiro, Portugal Universidad Politécnica de Madrid, Spain Universidad Politécnica de Madrid, Spain National Center for Oncological Research, Spain Universidad de Salamanca, Spain University of Aveiro, Portugal Egypt Japan Univ of Science and Technology, Egypt University of Auckland, New Zealand University of Minho, Portugal University of Skovde, Sweden SRI International, USA TOBB University of Economics and Technology, Turkey ESTG - IPL, Portugal LUMC, Netherland University of Cadiz, Spain Universidad Pablo de Olavide, Spain New Jersey Institute of Technology, USA
x
Eduardo Valente Alfredo Vellido Jorge Vieira Alejandro F. Villaverde Pierpaolo Vittorini
Organization
IPCB, Portugal Technical University of Catalonia, Spain University of Porto, Portugal Instituto de Investigaciones Marinas (C.S.I.C.), Spain University of L’Aquila - Department of Life, Health, and Environmental Sciences, Italy
Organizing Committee Juan M. Corchado Rodríguez Roberto Casado Vara Fernando De la Prieta Sara Rodríguez González Javier Prieto Tejedor Pablo Chamoso Santos Belén Pérez Lancho Ana Belén Gil González Ana De Luis Reboredo Angélica González Arrieta Emilio S. Corchado Rodríguez Ángel Martín del Rey Ángel Luis Sánchez Lázaro Alfonso González Briones Yeray Mezquita Martín Enrique Goyenechea Javier J. Martín Limorti Alberto Rivas Camacho Ines Sitton Candanedo Elena Hernández Nieves Beatriz Bellido María Alonso Diego Valdeolmillos Sergio Marquez Jorge Herrera Marta Plaza Hernández David García Retuerta Guillermo Hernández González
University of Salamanca, AIR Institute, Spain University of Salamanca, University of Salamanca, University of Salamanca, University of Salamanca, AIR Institute, Spain University of Salamanca, University of Salamanca, University of Salamanca, University of Salamanca, University of Salamanca, University of Salamanca,
Spain Spain Spain Spain Spain Spain Spain Spain Spain Spain Spain
University of Salamanca, Spain University of Salamanca, Spain University Complutense of Madrid, Spain University of Salamanca, Spain University of Salamanca, Spain AIR Institute, Spain University of Salamanca, Spain University of Salamanca, Spain University of Salamanca, Spain University of Salamanca, Spain University of Salamanca, Spain University of Salamanca, Spain AIR Institute, Spain University of Salamanca, Spain University of Salamanca, Spain University of Salamanca, Spain University of Salamanca, Spain AIR Institute, Spain
Organization
Luis Carlos Martínez de Iturrate Ricardo S. Alonso Rincón Javier Parra Niloufar Shoeibi Zakieh Alizadeh-Sani
xi
University of Salamanca, AIR Institute, Spain University of Salamanca, University of Salamanca, University of Salamanca, University of Salamanca,
Spain Spain Spain Spain Spain
Local Organizing Committee Pierpaolo Vittorini Tania Di Mascio Giovanni De Gasperis Federica Caruso Alessandra Galassi
PACBB 2020 Sponsors
University University University University University
of of of of of
L’Aquila, L’Aquila, L’Aquila, L’Aquila, L’Aquila,
Italy Italy Italy Italy Italy
Contents
Identification of Antimicrobial Peptides from Macroalgae with Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michela Caprani, Orla Slattery, Joan O’Keeffe, and John Healy A Health-Related Study from Food Online Reviews. The Case of Gluten-Free Foods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martín Pérez-Pérez, Anália Lourenço, Gilberto Igrejas, and Florentino Fdez-Riverola The Activity of Bioinformatics Developers and Users in Stack Overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roi Pérez-López, Guillermo Blanco, Florentino Fdez-Riverola, and Anália Lourenço
1
12
23
ProPythia: A Python Automated Platform for the Classification of Proteins Using Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . Ana Marta Sequeira, Diana Lousa, and Miguel Rocha
32
Inferences on Mycobacterium Leprae Host Immune Response Escape and Antibiotic Resistance Using Genomic Data and GenomeFastScreen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hugo López-Fernández, Cristina P. Vieira, Florentino Fdez-Riverola, Miguel Reboiro-Jato, and Jorge Vieira
42
Compi Hub: A Public Repository for Sharing and Discovering Compi Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alba Nogueira-Rodríguez, Hugo López-Fernández, Osvaldo Graña-Castro, Miguel Reboiro-Jato, and Daniel Glez-Peña DeepACPpred: A Novel Hybrid CNN-RNN Architecture for Predicting Anti-Cancer Peptides . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nathaniel Lane and Indika Kahanda
51
60
xiii
xiv
Contents
Preventing Cardiovascular Disease Development Establishing Cardiac Well-Being Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ana Duarte and Orlando Belo Fuzzy Matching for Cellular Signaling Networks in a Choroidal Melanoma Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adrián Riesco, Beatriz Santos-Buitrago, Merrill Knapp, Gustavo Santos-García, Emiliano Hernández Galilea, and Carolyn Talcott Towards A More Effective Bidirectional LSTM-Based Learning Model for Human-Bacterium Protein-Protein Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huaming Chen, Jun Shen, Lei Wang, and Yaochu Jin
70
80
91
Machine Learning for Depression Screening in Online Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Alina Trifan, Rui Antunes, and José Luís Oliveira Towards Triclustering-Based Classification of Three-Way Clinical Data: A Case Study on Predicting Non-invasive Ventilation in ALS . . . 112 Diogo Soares, Rui Henriques, Marta Gromicho, Susana Pinto, Mamede de Carvalho, and Sara C. Madeira Searching RNA Substructures with Arbitrary Pseudoknots . . . . . . . . . . 123 Michela Quadrini An Application of Ontological Engineering for Design and Specification of Ontocancro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Jéssica A. Bonini, Matheus D. Da Silva, Rafael Pereira, Bruno A. Mozzaquatro, Ricardo G. Martini, and Giovani R. Librelotto Evaluation of the Effect of Cell Parameters on the Number of Microtubule Merotelic Attachments in Metaphase Using a Three-Dimensional Computer Model . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Maxim A. Krivov, Fazoil I. Ataullakhanov, and Pavel S. Ivanov Reconciliation of Regulatory Data: The Regulatory Networks of Escherichia coli and Bacillus subtilis . . . . . . . . . . . . . . . . . . . . . . . . . 155 Diogo Lima, Fernando Cruz, Miguel Rocha, and Oscar Dias A Hybrid of Bat Algorithm and Minimization of Metabolic Adjustment for Succinate and Lactate Production . . . . . . . . . . . . . . . . . 166 Mei Yen Man, Mohd Saberi Mohamad, Yee Wen Choon, and Mohd Arfian Ismail Robustness of Pathway Enrichment Analysis to Transcriptome-Wide Gene Expression Platform . . . . . . . . . . . . . . . . 176 Joanna Zyla, Kinga Leszczorz, and Joanna Polanska
Contents
xv
Hypoglycemia Prevention Using an Embedded Model Control with a Safety Scheme: In-silico Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Fabian Leon-Vargas, Andres L. Jutinico, and Andres Molano-Jimenez Bidirectional-Pass Algorithm for Interictal Event Detection . . . . . . . . . . 197 David García-Retuerta, Angel Canal-Alonso, Roberto Casado-Vara, Angel Martin-del Rey, Gabriella Panuccio, and Juan M. Corchado Towards the Reconstruction of the Genome-Scale Metabolic Model of Lactobacillus acidophilus La-14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Emanuel Cunha, Ahmad Zeidan, and Oscar Dias Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Identification of Antimicrobial Peptides from Macroalgae with Machine Learning Michela Caprani1(B) , Orla Slattery1 , Joan O’Keeffe1 , and John Healy2 1
Marine and Freshwater Research Centre (MFRC), Galway-Mayo Institute of Technology, Galway, Ireland [email protected], {orla.slattery,joan.okeeffe}@gmit.ie 2 Department of Computer Science and Applied Physics, Galway-Mayo Institute of Technology, Galway, Ireland [email protected]
Abstract. Antimicrobial peptides (AMPs) are essential components of innate host defense showing a broad spectrum of activity against bacteria, viruses, fungi, and multi-resistant pathogens. Despite their diverse nature, with high sequence similarities in distantly related mammals, invertebrate and plant species, their presence and functional roles in marine macroalgae remain largely unexplored. In recent years, computational tools have successfully predicted and identified encoded AMPs sourced from ubiquitous dual-functioning proteins, including histones and ribosomes, in various aquatic species. In this paper, a computational design is presented that uses machine learning classifiers, artificial neural networks and random forests, to identify putative AMPs in macroalgae. 42,213 protein sequences from five macroalgae were processed by the classifiers which identified 24 putative AMPs. While initial testing with AMP databases positively identifies these sequences as AMPs, an absolute determination cannot be made without in vitro extraction and purification techniques. If confirmed, these AMPs will be the first-ever identified in macroalgae. Keywords: Antimicrobial peptides · Macroalgae · Pseudo Amino Acid Composition (PseAAC) · Machine learning classifiers
1
Introduction
Since the introduction of antibiotics, the development of microbial resistance to conventional antibiotics has progressed, prompting complications for the treatment of infectious disease. Antimicrobial peptides (AMPs, host defense peptides or innate immune peptides) are recognized as an alternative therapeutic agent to address the emergence of resistant strains [1]. AMPs are gene-coded short amino acid sequences (” at the beginning of a line) from sequence headers of the input files. identify-orthologs: orthologous gene identification using the two-way-blast Compibased Docker image available at pegi3s/two-way-blast7 . This step creates the orthologs datasets. translate-reference: translation of the reference sequence using the pegi3s/emboss Docker image. split-translated-reference: split of the translated reference using seqkit [7] to create one FASTA file for each sequence. check-m3f1: for each sequence of each orthologs dataset, find the correct frame and make sequences multiple of three, using the “m3f1” script pegi3s/blast_utilities Docker image.
3 https://www.sing-group.org/compihub/explore/5e2eaacce1138700316488c1. 4 https://github.com/pegi3s/pss-genome-fs. 5 https://hub.docker.com/r/pegi3s/pss-genome-fs. 6 https://pegi3s.github.io/dockerfiles/. 7 https://www.sing-group.org/compihub/explore/5e2db6f9e1138700316488be.
Inferences on Mycobacterium Leprae
45
Fig. 1. Directed acyclic graph of the GenomeFastScreen pipeline.
7)
filter-sequences: filter each orthologs dataset to remove sequences containing in-frame stop codons or N’s using the “fasta_remove_sequences_with_in _frame_stops_or_n” script of the pegi3s/utilities Docker image. 8) remove-stop-codons: process each one of the previous files to remove stop codons using the “batch_fasta_remove_stop_codons” script of the pegi3s/utilities Docker image. 9) after-remove-stop-codons: move the files created in the previous step into a new location to be taken by the next step and removes the intermediate results files generated so far. 10) fast-screen: run the FastScreen image (pegi3s/pss-fs) using as input each one of the orthologs datasets. 11) get-short-list-files: copy the sequence files listed in the FastScreen short list (those that should be the subject of detailed analyses) into the “short_list_dir” directory. 12) orthologs-reference-species: for every gene list that is produced by FastScreen, perform the identification of orthologous genes in an external reference species (allowing cross-species comparisons), using the two-way-blast Compi-based Docker image available at pegi3s/two-way-blast. The README of the GenomeFastScreen repositories includes the necessary commands to try the pipeline using the sample data provided by us.
46
H. López-Fernández et al.
2.2 Data Source and Pre-processing The gene annotations of six Mycobacterium leprae genomes (GCF_ 000026685; GCF_000195855; GCF_001648835; GCF_001653495; GCF_ 003253775; and GCF_003584725) were downloaded from the NCBI RefSeq database on December 2019. GCF_000195855 was used as the M. leprae reference. We also downloaded from NCBI RefSeq database the gene annotation of Mycobacterium tuberculosis genome GCF_000195955 to be used as an external reference. Although pre-processing of sequence files is not a requisite of the Compi-based Docker image here developed, we pre-processed the files downloaded from RefSeq using SEDA [8], to shorten header names (only accession numbers and gene names are kept). It should, however, be noted that very long headers may lead to errors in the creation of BLAST databases. 2.3 Analyses The Docker image here developed for the GenomeFastScreen Compi pipeline was used to identify genes that likely show PSS. Detailed analyses of the identified genes were performed using ADOPS [9, 10]. Sequences were aligned at the amino acid level using Muscle [11], and the corresponding nucleotide alignment obtained. Only codons that are aligned with a confidence score of three or higher are used to obtain a phylogeny using MrBayes [12]. The model of sequence evolution used was the GTR (allowing for among-site rate variation and a proportion of invariable sites). Third codon positions were allowed to have a gamma distribution shape parameter that is different from that for first and second codon positions. Two independent runs of 1000,000 generations with four chains each (one cold and three heated chains) were used. Trees were sampled every 100th generation and the first 2500 samples were discarded (burn-in). Positively selected amino acid sites are then inferred using codeML [1]. We have used PANTHER8 [13] to test for possible functional enrichment in each group.
3 Results When the Mycobacterium leprae dataset is used, the GenomeFastScreen pipeline identifies 1601 sets of orthologous genes (files saved in the “orthologs” directory). Nevertheless, four sets contain a single sequence without in-frame stop codons and ambiguous positions, and thus are not used. Despite these in-built options that automatically remove the problematic sequences, as well as making sure that sequences are multiple of three and that they are in frame +1, two genes out of 1597 could not be processed by FUBAR (listed in the output file named “files_requiring_attention”). This is because at least one sequence in these files presents non-multiple of three indels when compared to the other sequences in the same file, leading to a non-multiple of three nucleotide alignment. This can only happen if at least one sequence is annotated wrongly in the corresponding genome, likely due to errors in the genome sequence, or if the gene is a pseudogene in 8 http://www.pantherdb.org/.
Inferences on Mycobacterium Leprae
47
at least one genome. Such errors are difficult to eliminate in an automated way, and thus user intervention is in this case required before re-analysing the problematic files. When running FUBAR, only three genes are identified as having PSS, but when using codeML model M2a, putative PSS are identified in 528 genes where FUBAR (the first software to run in the FastScreen pipeline) did not, but it should be noted that many of these may be false positives, since the log-likelihood of model M2a is not compared with that of model M1a (neutral evolution), for time efficiency purposes. Therefore, although PSS are detected when using model M2a, the likelihood of this model may not be statistically different from that of model M1a, and if so, such genes should have been considered as not showing evidence for PSS. It is thus likely that, at least in this case, the true number of genes showing evidence for PSS is closer to that identified by FUBAR. When running detailed analyses using ADOPS [9], that uses a better sequence alignment strategy, tree building method, and codeML models M1a, M2a, M7 and M8, PSS are detected at 31 genes (dnaA, dnaE, ftsX, glcB, gpsA, leuc, mfd, ML0051, ML0208, ML0240, ML0283, ML0314, ML0606, ML0803, ML0825, ML1119, ML1243, ML1286, ML1652, ML1740, ML1750, ML2053, ML2075, ML2570, ML2597, ML2630, ML2664, ML2692, murE, recG, and tesB) out of the 531 genes inferred by FastScreen as likely showing evidence for PSS. The results of these analyses, are deposited at the B+ database [10] under project number BP20200000019 . One out of the three genes identified by FUBAR as showing PSS, was not identified when using ADOPS. In conclusion, by using GenomeFastScreen, it was possible to quickly and effortlessly identify 531 genes out of 1597 (33.2%), as likely showing evidence for PSS. By eliminating 1066 genes from time-consuming detailed analyses, time and computational savings were obtained. As implemented in FastScreen (where the log-likelihood of model M2a is not compared with that of model M1a), only 5.9% (31/528) of the genes identified by codeML model M2a, as likely showing evidence for PSS, show evidence for PSS when detailed analyses are performed. On the other hand, FUBAR identified only 6.5% of the genes that in detailed analyses show evidence for PSS. Therefore, at least for this dataset, running both approaches (FUBAR and codeML), as implemented in FastScreen, was useful. When using the GenomeFastScreen pipeline, it is possible to provide an external reference (in our case Mycobacterium tuberculosis, the causative agent of tuberculosis) to identify the putative orthologs of the genes likely showing PSS. Using this information, the putative orthologs of 29 out of 31 genes showing PSS were identified (dnaA, dnaE1, ftsX, glcB, gpdA2, leuc, mfd, PPE68, Rv3632, rpfB, -, lipU, Rv2410c, Rv3220c, smtB, Rv1277, lipQ, Rv1626, Rv2242, Rv3057c, Rv1354c, adhA, Rv1828, aftD, Rv0177, -, ldtA, ino1, murE, recG, and tesB1; genes are presented in the same order as above and - means that no ortholog was identified). Using this information and PANTHER, 13 genes could be classified as encoding proteins of a given PANTHER protein class. Of these, 6 (42.6%) are hydrolases, but, after correction for multiple comparisons, the list of genes is not statistically significantly enriched for any category listed in the 9 available PANTHER annotation datasets. It should be noted, however, that M. leprae has fewer genes than M. tuberculosis [6], and that by using the available annotation datasets for the latter species as the reference, we are assuming that the percentage of genes in the 9 http://bpositive.i3s.up.pt/transcriptions?id=30.
48
H. López-Fernández et al.
different gene categories is similar in the two species, despite the large difference in gene number. None of the genes here identified as showing PSS are in common with those reported by Osório et al. [5], who found evidence for PSS in only five out of 576 M. tuberculosis genes that were previously associated with drug resistance or encoding membrane proteins. In M. leprae we have found evidence for PSS at only one membrane protein gene (the orthologue of Rv3632), according to PANTHER. Nevertheless, Osório et al. [5] found evidence of PSS at the M. tuberculosis gene Rv0176, a mammalian cell entry mce1 gene. mce1 genes are transcribed as a 13-gene polycistronic message encompassing Rv0166 to Rv0178 [14], and here we have detected evidence for PSS at the putative M. leprae orthologue of gene Rv0177. In M. tuberculosis, the mce1 operon is likely involved in modulating the host inflammatory response in such a way that the bacterium can enter a persistent state without being eliminated or causing disease in the host [15]. Moreover, we have also found evidence for PSS at the PPE68 orthologous gene. This cell envelope protein plays a major role in M. tuberculosis RD1-associated pathogenesis, and may contribute to the establishment and maintenance of infection [16]. In addition, we have detected PSS at the orthologue of the resuscitation-promoting factor RpfB, that is mainly responsible for M. tuberculosis resuscitation from dormancy [17]. We also detected evidence for PSS at the RecG orthologue. This is an interesting observation, because expression of M. tuberculosis RecG in an Escherichia coli recG mutant strain has been reported as providing protection against mitomycin C, methyl methane sulfonate and UV induced cell death [18]. Lipases/esterases are also essential for M. tuberculosis survival and persistence, even virulence [19], and here we have found evidence for PSS at two lipases (the orthologues of lipQ and lipU) and two esterases (the orthologues of Rv3220c and tesB1). Therefore, the detailed analysis of the genes showing PSS, as well as the PSS themselves, may reveal interesting hints on the modulation of the different M. leprae phenotypes. In the near future, we want to analyse every Mycobacterium species for which there are at least five sequenced and annotated genomes. The use of an external reference will be useful to compare the set of genes showing PSS in species where orthologous gene names are often not the same.
4 Conclusion GenomeFastScreen allows the identification of genes likely showing PSS, almost without user intervention, starting from FASTA files, one per genome, containing all annotated coding sequences. We show the usefulness of such a pipeline using Mycobacterium leprae, but in the future, we want to analyse every Mycobacterium species for which there is at least five sequenced and annotated genomes. The availability of a Compibased Docker image for the GenomeFastScreen pipeline means that even researchers without a background in informatics should be able to run it in an efficient way. Acknowledgments. The SING group thanks the CITI (Centro de Investigación, Transferencia e Innovación) from the University of Vigo for hosting its IT infrastructure. This work was partially supported by the Consellería de Educación, Universidades e Formación Profesional (Xunta de Galicia) under the scope of the strategic funding ED431C2018/55-GRC Competitive Reference Group.
Inferences on Mycobacterium Leprae
49
References 1. Yang, Z.: PAML: a program package for phylogenetic analysis by maximum likelihood. Bioinformatics 13, 555–556 (1997). https://doi.org/10.1093/bioinformatics/13.5.555 2. Murrell, B., Moola, S., Mabona, A., Weighill, T., Sheward, D., Kosakovsky Pond, S.L., Scheffler, K.: FUBAR: a Fast, Unconstrained Bayesian AppRoximation for Inferring Selection. Mol. Biol. Evol. 30, 1196–1205 (2013). https://doi.org/10.1093/molbev/mst030 3. Wilson, D.J., McVean, G.: Estimating diversifying selection and functional constraint in the presence of recombination. Genetics 172, 1411–1425 (2006). https://doi.org/10.1534/gen etics.105.044917 4. López-Fernández, H., Duque, P., Vázquez, N., Fdez-Riverola, F., Reboiro-Jato, M., Vieira, C.P., Vieira, J.: Inferring positive selection in large viral datasets. In: Fdez-Riverola, F., Rocha, M., Mohamad, M.S., Zaki, N., Castellanos-Garzón, J.A. (eds.) 13th International Conference on Practical Applications of Computational Biology and Bioinformatics, pp. 61–69. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-23873-5_8 5. Osório, N.S., Rodrigues, F., Gagneux, S., Pedrosa, J., Pinto-Carbó, M., Castro, A.G., Young, D., Comas, I., Saraiva, M.: Evidence for diversifying selection in a set of Mycobacterium tuberculosis genes in response to antibiotic- and nonantibiotic-related pressure. Mol. Biol. Evol. 30, 1326–1336 (2013). https://doi.org/10.1093/molbev/mst038 6. Chavarro-Portillo, B., Soto, C.Y., Guerrero, M.I.: Mycobacterium leprae’s evolution and environmental adaptation. Acta Trop. 197, 105041 (2019). https://doi.org/10.1016/j.actatropica. 2019.105041 7. Shen, W., Le, S., Li, Y., Hu, F.: SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS ONE 11, e0163962 (2016). https://doi.org/10.1371/journal.pone.016 3962 8. López-Fernández, H., Duque, P., Henriques, S., Vázquez, N., Fdez-Riverola, F., Vieira, C.P., Reboiro-Jato, M., Vieira, J.: Bioinformatics protocols for quickly obtaining large-scale data sets for phylogenetic inferences. Interdiscip. Sci. Comput. Life Sci. 11, 1–9 (2019). https:// doi.org/10.1007/s12539-018-0312-5 9. Reboiro-Jato, D., Reboiro-Jato, M., Fdez-Riverola, F., Vieira, C.P., Fonseca, N.A., Vieira, J.: ADOPS–Automatic Detection Of Positively Selected Sites. J Integr Bioinform. 9, 200 (2012). https://doi.org/10.2390/biecoll-jib-2012-200 10. Vázquez, N., Vieira, C.P., Amorim, B.S.R., Torres, A., López-Fernández, H., Fdez-Riverola, F., Sousa, J.L.R., Reboiro-Jato, M., Vieira, J.: Large scale analyses and visualization of adaptive amino acid changes projects. Interdiscip. Sci. Comput. Life Sci. 10, 24–32 (2018). https:// doi.org/10.1007/s12539-018-0282-7 11. Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004). https://doi.org/10.1093/nar/gkh340 12. Ronquist, F., Teslenko, M., van der Mark, P., Ayres, D.L., Darling, A., Höhna, S., Larget, B., Liu, L., Suchard, M.A., Huelsenbeck, J.P.: MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst. Biol. 61, 539–542 (2012). https:// doi.org/10.1093/sysbio/sys029 13. Mi, H., Huang, X., Muruganujan, A., Tang, H., Mills, C., Kang, D., Thomas, P.D.: PANTHER version 11: expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements. Nucleic Acids Res. 45, D183–D189 (2017). https://doi.org/10. 1093/nar/gkw1138 14. Casali, N., White, A.M., Riley, L.W.: Regulation of the Mycobacterium tuberculosis mce1 operon. J. Bacteriol. 188, 441–449 (2006). https://doi.org/10.1128/JB.188.2.441-449.2006 15. Shimono, N., Morici, L., Casali, N., Cantrell, S., Sidders, B., Ehrt, S., Riley, L.W.: Hypervirulent mutant of Mycobacterium tuberculosis resulting from disruption of the mce1 operon. Proc. Natl. Acad. Sci. 100, 15918–15923 (2003). https://doi.org/10.1073/pnas.2433882100
50
H. López-Fernández et al.
16. Demangel, C., Brodin, P., Cockle, P.J., Brosch, R., Majlessi, L., Leclerc, C., Cole, S.T.: Cell envelope protein PPE68 contributes to Mycobacterium tuberculosis RD1 immunogenicity independently of a 10-kilodalton culture filtrate protein and ESAT-6. Infect. Immun. 72, 2170–2176 (2004). https://doi.org/10.1128/IAI.72.4.2170-2176.2004 17. Squeglia, F., Romano, M., Ruggiero, A., Vitagliano, L., De Simone, A., Berisio, R.: Carbohydrate recognition by RpfB from Mycobacterium tuberculosis unveiled by crystallographic and molecular dynamics analyses. Biophys. J. 104, 2530–2539 (2013). https://doi.org/10. 1016/j.bpj.2013.04.040 18. Thakur, R.S., Basavaraju, S., Somyajit, K., Jain, A., Subramanya, S., Muniyappa, K., Nagaraju, G.: Evidence for the role of Mycobacterium tuberculosis RecG helicase in DNA repair and recombination. FEBS J. 280, 1841–1860 (2013). https://doi.org/10.1111/febs. 12208 19. Li, C., Li, Q., Zhang, Y., Gong, Z., Ren, S., Li, P., Xie, J.: Characterization and function of Mycobacterium tuberculosis H37Rv Lipase Rv1076 (LipU). Microbiol. Res. 196, 7–16 (2017). https://doi.org/10.1016/j.micres.2016.12.005
Compi Hub: A Public Repository for Sharing and Discovering Compi Pipelines Alba Nogueira-Rodríguez1,2 , Hugo López-Fernández1,2,3(B) , Osvaldo Graña-Castro1,4 , Miguel Reboiro-Jato1,2,3 , and Daniel Glez-Peña1,2,3 1 Department of Computer Science, University of Vigo, ESEI, Campus As Lagoas,
32004 Ourense, Spain {alnogueira,hlfernandez,mrjato,dgpena}@uvigo.es 2 The Biomedical Research Centre (CINBIO), Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain 3 SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Vigo, Spain 4 Bioinformatics Unit, Structural Biology Programme, Spanish National Cancer Research Centre (CNIO), C/ Melchor Fernández Almagro, 3, 28029 Madrid, Spain [email protected]
Abstract. Sharing the source code necessary to perform and reproduce any type of data analysis in computational biology and bioinformatics is becoming more and more important nowadays. This includes the publication of complete executable workflows and pipelines in a form that allows to reproduce the original results easily or to re-use them to analyze new datasets. Recently, we have developed Compi, an application framework to develop end-user, pipeline-based applications focused on automatic user interface generation and application packaging, including also the most common features of workflow management systems. Here we introduce Compi Hub, a public repository of Compi pipelines that allows the community to explore them interactively. Within this work, we aim to demonstrate how to use it efficiently in order to release Compi pipelines, giving also some recommendations of good practices illustrated with our own pipelines. Keywords: Workflow · Pipeline · Application framework · Public repository · Reproducibility · Docker
1 Introduction Nowadays, the scientific analysis of massive datasets in fields like bioinformatics and biomedicine units relies on the combination of multiple sequential or parallel steps using dedicated software tools to manage their execution [1]. These computational pipelines or workflows are usually published by researchers in the form of protocols, best practices or even ready-to-run executable workflows. By defining all required steps and dependencies, these workflows ensure the reproducibility of the analyses and facilitate job © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 G. Panuccio et al. (Eds.): PACBB 2020, AISC 1240, pp. 51–59, 2021. https://doi.org/10.1007/978-3-030-54568-0_6
52
A. Nogueira-Rodríguez et al.
automation. The importance of the reproducibility of the scientific results and the interest of the community has led to the development of different Workflow Management Systems (WMS) in the last years. Some of the most well-known examples are Galaxy [2], a web-based platform designed for scientists with little or no programming experience, or Taverna [3], a suite for defining workflows by orchestrating and interconnecting web services. More recently, command-line (CLI) based applications such as Snakemake [4], Nextflow [5], or SciPipe [6], have been proposed as feature-rich workflow engines oriented to bioinformaticians with medium-to-high programming skills and are widely used at the present time. Our research group has contributed to the plethora of tools for creating scientific pipelines with Compi1 , a tool designed for researchers aiming to take advantage of common WMS features (e.g. automatic job scheduling, restarting from point of failures, etc.) while retaining the simplicity and readability of shell scripts and without the need to learn a new programming language. Moreover, we developed Compi to be an application framework to develop end-user, pipeline-based applications by focusing on two aspects: (i) automatic user interface generation, by dynamically producing a classical CLI for the entire pipeline, based on its parameter specifications; (ii) application packaging, by providing a mechanism to package the pipeline application along with its dependencies into a Docker image. Compi has been already used to develop two pipelines presented in the 13th edition of the International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2019): Metatax, a pipeline for metataxonomics in precision medicine [7], and FastScreen, a pipeline for fast screening of positively selected sites [8]. In the last decade, there has been an important shift towards open science and open data, mainly due to the reproducibility crisis [9, 10]. In this scenario, data sharing is now ubiquitous and a requirement for publication in a growing number of journals where initiatives like FAIR2 are trying to define how to properly share and reuse research data [11]. This philosophy is not limited to sharing the data, but also to sharing the source code necessary to perform and reproduce the data analysis, including the complete executable workflows and pipelines. While the mentioned tools help researchers in creating reproducible analyses, there is also interest in sharing them publicly so that other researchers can reuse them or reproduce research results. For instance, the awesomepipeline site3 lists Nextflow pipelines developed by the community, the Snakemakeworkflows project4 is a joint effort to create workflows for common use cases of the Snakemake WMS, and the ENCODE project5 collects best-practice workflows for different types of genomic data analyses. More formally, the Galaxy toolshed6 is a website were users can find workflows for Galaxy [12]. Recently, the Nextflow developers have
1 https://www.sing-group.org/compi/. 2 https://www.go-fair.org/go-fair-initiative/. 3 https://github.com/pditommaso/awesome-pipeline. 4 https://github.com/snakemake-workflows/docs. 5 https://www.encodeproject.org/pipelines/. 6 https://toolshed.g2.bx.psu.edu/.
Compi Hub: A Public Repository
53
released nf-core,7 a community effort to collect a curated set of analysis pipelines built using Nextflow [13]. To encourage developers of Compi pipelines to share them and follow good practices, we have developed Compi Hub8 , a public repository of Compi pipelines that allows the community to explore them interactively. Compi Hub takes advantage of the pipelines defined in XML to generate an online documentation of the pipeline tasks and parameters dynamically, and uses Compi to generate the task graphs of the pipelines. Compi Hub also shows other useful information, such as the pipeline dependencies, license and associated datasets, as long as the developers provide this information. The goal of this communication is to present Compi Hub and demonstrate how to use it efficiently to share Compi pipelines, giving also some recommendations of good practices illustrated with our own pipelines. For those readers not familiarized with Compi concepts, we recommend having a look at the online documentation9 when needed.
2 Related Tools The most related resources to Compi Hub are the Galaxy toolshed and nf-core. Both resources aim to provide information about the available workflows or pipelines for the corresponding workflow management systems (Galaxy and Nextflow, respectively), similarly to what Compi Hub does for Compi pipelines. Galaxy toolshed is the largest repository since Galaxy has been around for a while and it also includes individual tools. A repository at the Galaxy toolshed gives only a brief description of the workflow with different links for users to obtain more information or clone it. However, it does not seem to provide detailed documentation, which tasks involve the pipeline, or an easy way to use the pipeline (e.g. try it with example data). On the other hand, nf-core is a much richer environment that serves two purposes: (i) an online repository to explore and discover Nextflow pipelines, (ii) a command-line tool to interact with the repository and manage the execution of the hosted pipelines. Regarding the first purpose, the nf-core pipelines page10 lists all the pipelines available at the nf-core GitHub repository11 . In this sense, nf-core is coupled to GitHub by design, where its web acts mainly as a frontend view of the corresponding GitHub repository, being able to show different pipeline statistics and information obtained through the GitHub API (e.g. stargazers, number of clones, collaborators, and so on). The pipeline view at nf-core shows four sections: (i) Readme, displaying the contents of the README.md file of the repository, (ii) Documentation, showing the contents of the docs folder of the repository, (iii) Statistics, plotting statistics obtained through the GitHub API, and (iv) Releases, showing the GitHub releases. Unlike Compi Hub, it does not necessarily provide an overview of the pipeline or links to test datasets. Compi Hub generates web documentation automatically, such as the Direct Acyclic Graph, as well as tasks and parameter descriptions. 7 https://nf-co.re/. 8 https://www.sing-group.org/compihub/. 9 https://www.sing-group.org/compi/docs. 10 https://nf-co.re/pipelines. 11 https://github.com/nf-core.
54
A. Nogueira-Rodríguez et al.
Regarding the second purpose, nf-core tools provide a list of useful tools to interact with the repository and manage the execution of the pipelines. It allows users to list the pipelines available at the repository and commands for launching and downloading them. It also includes commands to create and check the pipeline against the nf-core guidelines. As the nf-core repository lists what is in their GitHub profile, pipeline developers must request access to them and commit their contributions there. In contrast, compi-dk, the command-line tool associated to Compi and Compi Hub, only provides methods to ease the publication of pipelines at Compi Hub as explained in more detail later.
3 Releasing a Compi Pipeline Compi provides the compi-dk tool, a CLI application that allows developers to create an end-user portable application packaged as a Docker image that includes the pipeline XML, the Compi executable, and all the dependencies required to run the pipeline declared in the Dockerfile. Once this has been done, we recommend following the process illustrated in Fig. 1 in order to properly release the pipeline: (i) publishing the source code at a public repository at platforms such as GitHub or GitLab. This way, all the files of the compi-dk project directory remain public so that any user can re-build the project locally at any moment; (ii) pushing the Docker image to the Docker Hub registry. This way, users that just want to run the application only have to pull the image from Docker Hub and follow the instructions; (iii) registering the pipeline at Compi Hub in order to increase its visibility and benefit from the Compi Hub features described in Sect. 2.
Fig. 1. Schematic diagram of releasing a Compi pipeline.
Compi Hub: A Public Repository
55
The combination of these three platforms is essential to grant the reproducibility of the pipeline, as the main goal of Compi Hub is to aid users to find and explore pipelines that can be useful for them, but it does not store the complete source code (e.g. scripts included in the pipeline.xml as source files) or any Docker image associated with the pipelines. Therefore, we strongly encourage developers to register the links of the external repositories so that visitors can easily navigate from Compi Hub to them. For instance, the FastScreen pipeline is published at Compi Hub with id 5d5bb64f6d9e31002f3ce30a12 , the source code of the project is available at GitHub13 , and the Docker image is available at Docker Hub14 .
4 Overview of the Compi Hub Repository When a visitor lands the Compi Hub home, a list of the available pipelines is shown along with a text box to rapidly search among them (Fig. 2A). By clicking on the pipeline identifier, the specific pipeline page is shown (Fig. 2B). The top of this page shows a box with the main pipeline information, including title, description and creation date, as well as links to external repositories associated to the pipeline in GitHub, Docker Hub or GitLab. In Compi Hub, each pipeline can have one or more versions associated to it (the version of each pipeline is defined in the Compi XML file) and each version can be explored by selecting it in the corresponding combo box shown in Fig. 2B. This selection changes the information associated to the specific pipeline version shown in the tabs of the bottom part of this page.
Fig. 2. Screenshots of Compi Hub. (A) Landing page of Compi Hub, showing the listing of pipelines. (B) Public view of a pipeline. 12 https://www.sing-group.org/compihub/explore/5d5bb64f6d9e31002f3ce30a. 13 https://github.com/pegi3s/pss-fs. 14 https://hub.docker.com/r/pegi3s/pss-fs.
56
A. Nogueira-Rodríguez et al.
Taking as example the FastScreen pipeline, Fig. 2B shows the “Overview” tab of its 1.0.0 version. This first tab has the following four sequential parts: (i) a pipeline directed acyclic graph, which is generated in the backend using the “compi export-graph” command and where visitors can click to navigate to the associated task description; (ii) a table listing the pipeline tasks and their associated descriptions; (iii) a table showing the global parameters of the pipeline; (iv) a final part with one table for each task, showing their descriptions and specific parameters. The first two parts can be seen in Fig. 2B and the second two parts in Fig. 3A. All the information in these tables is generated from the pipeline XML automatically. While the tasks and parameters are directly extracted from the pipeline elements, their documentation and description is extracted from the metadata section of the XML. Although this is an optional section, we encourage developers of Compi pipelines to fill it when planning to share a pipeline at Compi Hub.
Fig. 3. Screenshots of the public view of a pipeline at Compi Hub. (A) “Overview” tab showing the tasks and parameters. (B) “Dataset” tab showing the information of the associated datasets.
We also encourage developers to provide the following files in order to enhance the public view of their pipelines at Compi Hub: – A README.md file containing a comprehensive description of the pipeline as well as instructions on how to use it that will be shown in the “Readme” tab. In the case of the FastScreen pipeline, we have included a brief description, the links to the external repositories (GitHub and Docker Hub), instructions on how to use it with sample data, and technical notes for developers regarding the implementation of the pipeline. – A DEPENDENCIES.md file containing a human-readable description of the pipeline dependencies that will appear in the “Dependencies” tab. We recommend specifying the versions for which the pipeline has been tested in this file. – A LICENSE file containing the license of the project that will appear in the “License” tab.
Compi Hub: A Public Repository
57
Additionally, developers can also provide examples of Compi runners15 that can be used with the pipeline as well as examples of parameter files. The FastScreen pipeline has a low number of parameters and it can work with default values, but this can be valuable for pipelines with a high number of configuration parameters as in the case of the Metatax pipeline16 . As stated previously, our recommendation for developers is to provide clear instructions on how to run the pipeline, together with the necessary test datasets. Compi Hub shows a dedicated “Dataset” tab (Fig. 3B) for each dataset associated with a pipeline version. This helps users finding test data and motivates them to try the pipelines out, something that can be very relevant when reviewers have to test a pipeline associated to a paper.
5 Publishing Pipelines at Compi Hub Pipelines and pipeline versions can be published at Compi Hub in two different ways. On the one hand, developers can register at Compi Hub and join their own private area (Fig. 4A) to register new pipelines, publish new versions, and edit existing ones. On the other hand, we have included three specific commands in the compi-dk tool, to aid Compi-based pipeline developers to publish their pipelines at Compi Hub, avoiding the need to use a web browser for registering the pipelines. The first of these three commands is “hub-init”, which registers a new pipeline at Compi Hub by establishing an alias, a title
Fig. 4. Screenshots of the private user area at Compi Hub. (A) User’s pipeline list, including buttons to view, edit, and remove them. (B) Dialog to register a new pipeline. (C) Pipeline edition screen, where new pipeline versions and datasets can be registered. 15 https://www.sing-group.org/compi/docs/custom_runners.html. 16 https://www.sing-group.org/compihub/explore/5d807e5590f1ec002fc6dd83.
58
A. Nogueira-Rodríguez et al.
and its visibility (public or private). This is equivalent to using the “Add pipeline” option (Fig. 4B) of the private web area (Fig. 4A). Secondly, the “hub-metadata” command allows developers to save some pipeline metadata (e.g. links to external repositories) in a local file for further submission to the hub. Finally, the third command is “hubpush”, which publishes a new version of the corresponding pipeline by uploading all the necessary files. This is equivalent to using the “Import version” option of the pipeline edition page at Compi Hub (Fig. 4C). In this web page, the “Add version” option opens a dialog with an assistant to guide users through the process of adding the necessary files.
6 Conclusions and Future Work Our expectation is that Compi Hub and the main guidelines presented in this work can encourage developers of Compi-based pipelines to release them in the same way that we are doing with our own ones. Regarding the future work, our efforts are centered on two aspects: (i) to publish new pipelines that we are developing for different purposes at Compi Hub and encourage other researchers to publish their own pipelines; (ii) to improve Compi by implementing new features requested by pipeline developers (e.g. the possibility of defining optional parameters or coupling foreach tasks to speed up the pipeline execution). Acknowledgments. The SING group thanks the CITI (Centro de Investigación, Transferencia e Innovación) from the University of Vigo for hosting its IT infrastructure. This work was partially supported by the Consellería de Educación, Universidades e Formación Profesional (Xunta de Galicia) under the scope of the strategic funding ED431C2018/55-GRC Competitive Reference Group and by the Ministerio de Economía, Industria y Competitividad, Gobierno de España under the scope of the PolyDeep project (DPI2017-87494-R). A. Nogueira-Rodríguez is supported by a pre-doctoral fellowship from Xunta de Galicia (ED481A-2019/299).
References 1. Perkel, J.M.: Workflow systems turn raw data into scientific knowledge. Nature 573, 149–150 (2019). https://doi.org/10.1038/d41586-019-02619-z ˇ 2. Afgan, E., Baker, D., Batut, B., van den Beek, M., Bouvier, D., Cech, M., Chilton, J., Clements, D., Coraor, N., Grüning, B.A., Guerler, A., Hillman-Jackson, J., Hiltemann, S., Jalili, V., Rasche, H., Soranzo, N., Goecks, J., Taylor, J., Nekrutenko, A., Blankenberg, D.: The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 46, W537–W544 (2018). https://doi.org/10.1093/nar/gky379 3. Wolstencroft, K., Haines, R., Fellows, D., Williams, A., Withers, D., Owen, S., Soiland-Reyes, S., Dunlop, I., Nenadic, A., Fisher, P., Bhagat, J., Belhajjame, K., Bacall, F., Hardisty, A., Nieva de la Hidalga, A., Balcazar Vargas, M.P., Sufi, S., Goble, C.: The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Res. 41, W557–W561 (2013). https://doi.org/10.1093/nar/gkt328 4. Köster, J., Rahmann, S.: Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012). https://doi.org/10.1093/bioinformatics/bts480 5. Di Tommaso, P., Chatzou, M., Floden, E.W., Barja, P.P., Palumbo, E., Notredame, C.: Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017). https:// doi.org/10.1038/nbt.3820
Compi Hub: A Public Repository
59
6. Lampa, S., Dahlö, M., Alvarsson, J., Spjuth, O.: SciPipe: a workflow library for agile development of complex and dynamic bioinformatics pipelines. GigaScience 8 (2019). https://doi. org/10.1093/gigascience/giz044 7. Graña-Castro, O., López-Fernández, H., Fdez-Riverola, F., Al-Shahrour, F., Glez-Peña, D.: Proposal of a new bioinformatics pipeline for metataxonomics in precision medicine. In: FdezRiverola, F., Rocha, M., Mohamad, M.S., Zaki, N., Castellanos-Garzón, J.A. (eds.) 13th International Conference on Practical Applications of Computational Biology and Bioinformatics, pp. 8–15. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-23873-5_2 8. López-Fernández, H., Duque, P., Vázquez, N., Fdez-Riverola, F., Reboiro-Jato, M., Vieira, C.P., Vieira, J.: Inferring positive selection in large viral datasets. In: Fdez-Riverola, F., Rocha, M., Mohamad, M.S., Zaki, N., Castellanos-Garzón, J.A. (eds.) 13th International Conference on Practical Applications of Computational Biology and Bioinformatics, pp. 61–69. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-23873-5_8 9. Baker, M.: 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016). https:// doi.org/10.1038/533452a 10. Popkin, G.: Data sharing and how it can benefit your scientific career. Nature 569, 445–447 (2019). https://doi.org/10.1038/d41586-019-01506-x 11. Wilkinson, M.D., Dumontier, M., Aalbersberg, Ij.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L.B., Bourne, P.E., Bouwman, J., Brookes, A.J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C.T., Finkers, R., Gonzalez-Beltran, A., Gray, A.J.G., Groth, P., Goble, C., Grethe, J.S., Heringa, J., ’t Hoen, P.A.C., Hooft, R., Kuhn, T., Kok, R., Kok, J., Lusher, S.J., Martone, M.E., Mons, A., Packer, A.L., Persson, B., Rocca-Serra, P., Roos, M., van Schaik, R., Sansone, S.-A., Schultes, E., Sengstag, T., Slater, T., Strawn, G., Swertz, M.A., Thompson, M., van der Lei, J., van Mulligen, E., Velterop, J., Waagmeester, A., Wittenburg, P., Wolstencroft, K., Zhao, J., Mons, B.: The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18 12. Blankenberg, D., Von Kuster, G., Bouvier, E., Baker, D., Afgan, E., Stoler, N., the Galaxy Team, Taylor, J., Nekrutenko, A.: Dissemination of scientific software with Galaxy ToolShed. Genome Biol. 15, 403 (2014). https://doi.org/10.1186/gb4161 13. Ewels, P.A., Peltzer, A., Fillinger, S., Alneberg, J., Patel, H., Wilm, A., Garcia, M.U., Di Tommaso, P., Nahnsen, S.: nf-core: community curated bioinformatics pipelines. Bioinformatics (2019). https://doi.org/10.1101/610741
DeepACPpred: A Novel Hybrid CNN-RNN Architecture for Predicting Anti-Cancer Peptides Nathaniel Lane(B) and Indika Kahanda Montana State University, Bozeman, MT 59717, USA [email protected], [email protected]
Abstract. Anti-cancer peptides (ACPs) are a promising alternative to traditional chemotherapy. To aid wet-lab and clinical research, there is a growing interest in using machine learning techniques to help identify good ACP candidates computationally. In this paper, we describe DeepACPpred, a novel deep learning model composed of a hybrid CNNRNN architecture for predicting ACPs. Using several gold-standard ACP datasets, we demonstrate that DeepACPpred is highly effective compared to state-of-the-art ACP prediction models.
Keywords: Anti-cancer Peptides Convolutional Neural Networks
1
· Recurrent Neural Networks ·
Introduction
Chemotherapy, one of the primary treatments for cancer, often has side effects that are debilitating for the patient. Some examples of problems that can arise include vomiting, hair loss, and fatigue [3]. Additionally, because cancer cells reproduce at an unregulated pace, tumors will often develop resistance to chemotherapeutic drugs [10]. These facts, taken together with the existence of many different kinds of cancer, show why there is a need for the development of many kinds of anticancer drugs. Recent research shows that anti-cancer peptides (ACPs) may offer a promising alternative to chemotherapy [7,8]. These peptides are typically sequences of 5–30 amino acids that exhibit physicochemical properties that help them target cancer cells [4]. To aid wet-lab and clinical researchers, there has been a recent interest in developing machine learning algorithms that can help identify good ACP candidates [2,13,16]. Most of these predictive models use features that are generated from the physicochemical properties of the amino acids that comprise the ACP. We hypothesize that a model that leverages the actual sequences of amino acids may outperform these other models. Our reasoning is based on the fact that the sequence determines the folding structure, and therefore, the function of c The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 G. Panuccio et al. (Eds.): PACBB 2020, AISC 1240, pp. 60–69, 2021. https://doi.org/10.1007/978-3-030-54568-0_7
DeepACPpred
61
the peptide; hence, the sequence itself would be more informative than features generated using the sequence properties. Our proposed method of ACP prediction, DeepACPpred, is specifically based on a hybrid model composed of Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). The intuition for this design is that a CNN may be able to extract information from interactions among nearby amino acids, while an RNN may be more capable of analyzing interactions among distant amino acids. We compare the performance of the proposed model against the models from two state-of-the-art ACP prediction studies on several popular ACP goldstandard datasets. The first of these studies analyzed the performance of both Random Forest (RF) and Support Vector Machine (SVM) techniques, and, at the time of writing, appears to be the best model for ACP prediction [13]. The other study is the first (and to our knowledge at the time of writing, the only) RNN-based approach to predicting ACPs [19]. Our experimental results indicate that our proposed model is highly effective for ACP prediction in comparison to the models from the above two studies. The rest of the paper is organized as follows. Section 2 discusses previous work conducted on the problem of ACP prediction. In Sect. 3, we detail in-depth our approach to this problem. Section 4 reports and analyzes the performance of our model and compares it to other state-of-the-art methods. Finally, in Sect. 5, we report on what work may be performed in the future to improve the techniques built in this work.
2
Related Work
Manavalan et al. [13] present SVM and RF models for predicting whether or not a given sequence of amino acids represents an ACP. Their proposed solution, MLACP, uses features gained from the physicochemical properties of the amino acids, as well as the raw amino acid composition. MLACP is used as a baseline for evaluating the performance of our model, and we use the same datasets and metrics from their study. Since they compared their model to AntiCP [17] and iACP [2], both of which are SVM-based models, and provided metrics against them, we do the same when evaluating DeepACPpred. Boopathi et al. [1] present a similar approach using SVMs. There are many recent applications of deep learning for biological problems [11,12] but there are only two studies that use deep learning for ACP prediction [18,19]. One of them is a very recent study by Yi et al. [19], which, to our knowledge, is the only other paper to directly use RNNs to predict whether or not an input string of amino acids represents an ACP. In their proposed method, ACP-DL, each amino acid in each sequence is converted into feature vectors, which are fed into a Long short-term memory (LSTM). The other directly related study, PTPD by Wu et al., uses Word2vec embeddings with a fully convolutional network to tackle this problem [18]. We attempted to integrate concepts from PTPD into DeepACPpred, but ultimately
62
N. Lane and I. Kahanda
settled on the hybrid CNN-RNN model presented in this paper. In two other distantly related studies, Grisoni et al. use RNNs [5] and counter-propagation artificial neural networks (CPANNs) [6] to generate ACPs which were then experimentally verified in a wet-lab setting.
3 3.1
Methods Data
Six datasets will be used for training and testing in this project: the Tyagi, LEE, and HC datasets from Manavalan et al. [13]; the ACP240 and ACP740 datasets from Yi et al. [19]; and the final key dataset is the anti-fungal peptide (AFP) dataset from Meher et al. [14]. Table 1 shows the balance of positive to negative examples in each dataset. The interested reader is referred to the corresponding papers mentioned above for learning more about the processes by which each dataset was produced. Table 1. An analysis of the make-up of each dataset. Dataset # of Positives # of Negatives % of Positives Reference Tyagi
187
399
32%
[13]
LEE
422
422
50%
[13]
HC
126
205
38%
[13]
ACP240
129
111
54%
[19]
ACP740
376
364
51%
[19]
1,497
1,393
52%
[14]
AFP
The lengths of the sequences in Tyagi and HC are similar for both positives and negatives (data not shown). Importantly, however, there is a large disparity in the lengths of amino acid sequences seen in LEE vs Tyagi. In Tyagi, the distribution of positives appears to be bimodal with a preference for shorter sequences, while the distribution for the negatives appears to be approximately a normal distribution skewed right. In the LEE dataset, the positives are more significantly skewed right, with the distribution favoring shorter sequences even more than in Tyagi. The negatives in this dataset appear to follow an almost perfectly bimodal distribution that is quite broad. This is significantly different from the Tyagi dataset. 3.2
DeepACPpred Model
Figure 1 shows an outline of the neural network structure. This model takes the amino acid sequences and feeds them into two different paths. Each path begins with an embeddings layer and ends with a BiLSTM; the difference is that one
DeepACPpred
63
path has a 1D convolutional layer in the middle. We hypothesize that the RNN may be able to consider interactions among all amino acids, while the CNN may be able to specialize in interactions of spatially close ones. Regardless of whether or not this is actually how the system is behaving, the results of the CNN-RNN hybrid architecture are better than of each architecture individually (see Sect. 4.3). To be processed by this neural network, each amino acid must be converted into an integer. To indicate the end of a sequence, we also cap off it with a unique identifier. Further, each amino acid sequence must have the same length, so we padded them all with another unique integer. When these sequences go into an embeddings layer, each integer is mapped to a unique vector whose values are adjusted with the rest of the weights in the neural network. We experimented with other encoding schemes, such as the Word2vec [15] algorithm and using vectors of the physicochemical properties of each amino acid as input, but none produced results as good as the embeddings layer’s (data not shown).
Input Embeddings Layer Embeddings Layer Dropout Layer Dropout Layer
1D Convolution Dropout Layer
BiLSTM BiLSTM Concatenation Layer Dropout Layer Dense Layer Dropout Layer Dense Layer with 1 Node
Fig. 1. Structure of the DeepACPpred model.
64
N. Lane and I. Kahanda
The model was implemented with the Keras1 library. This network is compiled with the Adam optimizer and a Binary Crossentropy loss function. To help prevent overfitting, regularization is used in addition to dropout. While batch normalization typically produces better performance than either of these techniques [9], we found that in this case, it produced inferior results (data not shown). Because of the large number of hyperparameters, this model is difficult to optimize by hand. Therefore, we employed the use of the library hyperas 2 to automatically optimize the hyperparameters. 3.3
Transfer Learning
None of the provided ACP datasets are particularly large; the largest is LEE with 844 sequences. This poses a problem, as neural networks tend to perform best when the training set has a high number of samples to draw from. Our solution was to implement transfer learning, which is a process wherein the neural network is trained at first with a dataset that, while not directly pertaining to the problem at hand, is related to the data being tested. In this case, we created a neural network with the structure outlined above. We then trained it on the AFP dataset presented by Meher et al. [14]. This was chosen because some anti-microbial peptides have anti-cancer properties [4], and this dataset has 1,496 positive and 1,384 negative entries, for a total number of 2,880. This is a much greater amount of data than is present in any of the ACP datasets. Once the neural network has been trained, the weights are saved and transferred to a neural network with the same overall structure. This new network is then trained and tested on ACP data. 3.4
Experimental Setup
Because we are comparing the quality of our model to those proposed by a couple of other studies, we use a number of different evaluation schemes to mimic theirs as closely as possible. First, to compare with MLACP [13], we will optimize our model’s hyperparameters based on 10-fold cross-validation performed on the Tyagi dataset. For testing, the model is trained with the Tyagi dataset and evaluated against the LEE and HC datasets. We then compare performances using the metrics MCC, accuracy, sensitivity, and specificity. Ideally, we would also have been able to have used MLACP with other datasets for our comparison; however, we were never afforded access to the code. Nevertheless, throughout the course of this work, we have been able to train and test DeepACPPred on many datasets, so we are confident of its generalizability. All experiments were performed on Windows-running machines with 16 cores (64-bit, 3.7 GHz), 32 GB of memory, and Titan X GPUs. Combined running times for performing cross-validation is about 4.5 h. 1 2
https://keras.io/. https://maxpumperla.com/hyperas/.
DeepACPpred
65
When comparing with ACP-DL [19], we use 5-fold cross-validation on the ACP240 and ACP740 datasets. Their performance metrics were accuracy, sensitivity, specificity, precision, and MCC, so we evaluate our performance similarly. Even though they do not present F1 scores in their study, they provide sensitivity and precision, from which F1 can be calculated, so we provide those as well. Additionally, since we had access to the source code of this tool, we were able to use it on other datasets including the Tyagi dataset. As part of our experimentation, we show that the hybrid model is superior to the RNN alone and the CNN alone. We do this by optimizing an RNN, a CNN, and the hybrid via 10-fold cross-validation on the Tyagi dataset, and comparing the results. We will evaluate the effect of transfer learning similarly.
4 4.1
Results and Discussion Comparison to MLACP
Table 2 shows the performance of our model against the RFACP and SVMACP models presented in the MLACP study [13]. Each model was evaluated using 10fold cross validation on the Tyagi dataset. DeepACPpred is able to outperform their models on Sensitivity. Table 2. Comparison of DeepACPpred against MLACP model using the Tyagi training set and 10-Fold Cross-Validation. The best values are indicated in bold text. Algorithm
MCC Accuracy Sensitivity Specificity
RFACP
0.698 0.872
0.722
0.942
SVMACP
0.697
0.872
0.706
0.95
0.854
0.729
0.913
DeepACPpred 0.655
The MLACP study outlined an additional testing procedure, in which their models were trained on the Tyagi dataset and tested on the HC and LEE datasets. With this approach, the authors were able to compare their results against previous models presented in other studies [2,17]. Tables 3 and 4 compare the performance of DeepACPpred against all models presented in the MLACP paper. Our model outperformed all models that had been described prior to their work while outperforming their RF model in sensitivity when tested with the HC dataset, and being competitive in terms of sensitivity on the LEE dataset. HC tends to have significantly better performance than LEE. This is likely because it is more similar to the training set than LEE is. Of particular note, the sequence length differences between LEE and Tyagi may also explain why DeepACPpred has a greater dip in performance than other methods when evaluating on LEE. The other models are based on SVM and RF machine learning algorithms, and while the length of the sequences may alter some of the input features for these methods, DeepACPpred, being based on RNNs, is perhaps more sensitive to sequence length than the other models.
66
N. Lane and I. Kahanda
Table 3. Comparison of DeepACPpred against MLACP and other models using the Tyagi training set and HC testing set. The best values are indicated in bold text. Algorithm
MCC Accuracy Sensitivity Specificity
RFACP
0.885 0.946
0.889
0.981
DeepACPpred
0.843
0.924
0.944
0.912
SVMACP
0.750
0.882
0.841
0.907
AntiCP (Model 2) [17] 0.719
0.869
0.813
0.902
AntiCP (Model 1) [17] 0.062
0.402
0.976
0.049
Table 4. Comparison of DeepACPpred against MLACP and other models using the Tyagi training set and LEE testing set. The best values are indicated in bold text. Algorithm
4.2
MCC Accuracy Sensitivity Specificity
RFACP
0.674 0.827
0.706
0.948
SVMACP
0.630
0.814
0.775
0.853
DeepACPpred
0.578
0.789
0.774
0.803
AntiCP (Model 2) [17] 0.505
0.752
0.744
0.761
iACP [2]
0.412
0.706
0.697
0.716
AntiCP (Model 1) [17] 0.096
0.527
0.938
0.116
Comparison to ACP-DL
Table 5 depicts the performance of DeepACPpred against ACP-DL. To allow as direct a comparison with their study as possible, we used 5-fold cross-evaluation on their datasets (ACP240 and ACP740) as well as Tyagi. DeepACPpred outperformed ACP-DL in almost every performance metric except for specificity on the ACP240 dataset. The difference in performance on the ACP 740 dataset was statistically significant with a p-value of 0.027 using a paired t-test. 4.3
Comparing the Hybrid Model to Individual RNNs and CNNs Models
Table 6 compares the results of 10-fold cross-validation on the Tyagi dataset when using RNN, CNN and hybrid CNN + RNN models. This demonstrates that the combined model performs better than either network alone and supports our hypothesis that the networks train on separate aspects of the sequences. 4.4
Impact of Transfer Learning
Table 7 shows the results of performing 10-fold cross-validation on the Tyagi dataset with and without transfer learning from the AFP dataset. This confirms our hypothesis that using anti-fungal peptides as a transfer learning set greatly impacts the success of the model.
DeepACPpred
67
Table 5. Comparison of DeepACPpred against ACP-DL using the ACP240 and ACP740 training sets with 5-Fold Cross-Validation. The best values for each dataset are indicated in bold text. Algorithm
Dataset
ACP-DL
ACP240 0.714
MCC
Accuracy Sensitivity Specificity Precision F1 0.846
0.899
0.803
0.824
DeepACPpred ACP240 0.716 0.858
0.880
0.840
0.862
0.869
ACP-DL
0.825
ACP740 0.631
0.854
0.826
0.806
0.824
DeepACPpred ACP740 0.706 0.850
0.853
0.850
0.856
0.850
ACP-DL
0.626
0.862
0.691
0.657
0.716
0.886
0.743
0.724
Tyagi
DeepACPpred Tyagi
0.504
0.815 0.786
0.608 0.832
Table 6. Comparing base RNN, base CNN, and combined CNN + RNN models using the Tyagi training set and 10-Fold Cross-Validation. The best values are indicated in bold. Model
MCC Accuracy Sensitivity Specificity
RNN
0.642
0.844
0.783
CNN
0.575
0.816
0.679
0.883
0.729
0.913
RNN + CNN 0.655 0.854
0.871
Table 7. Evaluation of the effects of transfer learning using the Tyagi training set and 10-Fold Cross-Validation. The best values are indicated in bold text. MCC Accuracy Sensitivity Specificity Without Transfer 0.539 With Transfer
5
0.793
0.655 0.854
0.711
0.832
0.729
0.913
Conclusions and Future Work
In this work, we introduce DeepACPpred, a novel CNN-RNN hybrid architecture for ACP prediction. As far as we know, DeepACPpred is only the second model to use recurrent neural networks to predict ACPs. While it offers a great improvement over the previous RNN model, it is also competitive with the other current state-of-the-art models. Below, we identify avenues for future research. Firstly, in this project, only the amino acid sequence itself was used as input to the neural network. It may be that adding a simple (i.e., not recurrent or convolutional) ANN pathway that can take additional features for the overall amino acid sequence may improve the results. One particular set of features that we have in mind is a subset of the annotations from the Gene Ontology (GO). Another potential set of features could be the physicochemical properties of all amino acids in a peptide sequence. Second, it is possible that other transfer learning sets could have more success than the anti-fungal peptides. For DeepACPpred to be useful to the medical research community, a web server must
68
N. Lane and I. Kahanda
be established. At present, no such server exists for this project. We hope to be able to implement one in the future. Acknowledgments. This work was made possible by the publicly available datasets by Balachandran et al. [13], Yi et al. [19], and Meher et al. [14].
References 1. Boopathi, V., Subramaniyam, S., Malik, A., Lee, G., Manavalan, B., Yang, D.C.: Macppred: a support vector machine-based meta-predictor for identification of anticancer peptides. Int. J. Molecular Sci. 20(8), 1964 (2019). https://doi.org/10. 3390/ijms20081964. https://pubmed.ncbi.nlm.nih.gov/31013619, 31013619[pmid] 2. Chen, W., Ding, H., Feng, P., Lin, H., Chou, K.C.: iacp: a sequence-based tool for identifying anticancer peptides. Oncotarget 7(13), 16895–16909 (2016). https:// www.ncbi.nlm.nih.gov/pubmed/26942877[pmid] 3. Coates, A., Abraham, S., Kaye, S., Sowerbutts, T., Frewin, C., Fox, R.,Tattersall, M.: On the receiving end — patient perception of the side-effects of cancer chemotherapy. Euro. J. Cancer Clinical Oncol. 19(2), 203–208 (1983).https:// doi.org/10.1016/0277-5379(83)90418-2,http://www.sciencedirect.com/science/ article/pii/0277537983904182 4. Gaspar, D., Veiga, A.S., Castanho, M.A.: From antimicrobial to anticancer peptides a review. Front. Microbiol. 4, 294 (2013). https://doi.org/10.3389/fmicb.2013. 00294 5. Grisoni, F., Neuhaus, C.S., Gabernet, G., Muller, A.T., Hiss, J.A, Schneider, G.: Designing anticancer peptides by constructive machine learning. ChemMedChem, 13(13), 1300–1302 (2018). https://doi.org/10.1002/cmdc.201800204, https:// onlinelibrary.wiley.com/doi/abs/10.1002/cmdc.201800204 6. Grisoni, F., Neuhaus, C.S., Hishinuma, M., Gabernet, G., Hiss, J.A., Kotera, M., Schneider, G.: De novo design of anticancer peptides by ensemble artificial neural networks. J. Molecular Model. 25(5), 112 (2019). https://doi.org/10.1007/s00894019-4007-6 7. Harris, F., Dennison, S.R., Singh, J., Phoenix, D.A.: On the selectivity and efficacy of defense peptides with respect to cancer cells. Med. Res. Rev. 33(1), 190–234 (2013). https://doi.org/10.1002/med.20252 8. Hoskin, D.W., Ramamoorthy, A.: Studies on anticancer activities of antimicrobial peptides. Biochimica et Biophysica Acta (BBA) - Biomembranes 1778(2), 357 – 375 (2008). https://doi.org/10.1016/j.bbamem.2007.11.008 9. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift (2015) 10. Longley, D., Johnston, P.: Molecular mechanisms of drug resistance. J. Pathol. 205(2), 275–292 (2005). https://doi.org/10.1002/path.1706. https://onlinelibrary. wiley.com/doi/abs/10.1002/path.1706 11. Mahmud, M., Kaiser, M.S., Hussain, A., Vassanelli, S.: Applications of deep learning and reinforcement learning to biological data. IEEE Trans. Neural Netw. Learn. Syst. 29(6), 2063–2079 (2018) 12. Mahmud, M., Kaiser, M.S., Hussain, A.: Deep learning in mining biological data (2020) 13. Manavalan, B., Basith, S., Shin, T.H., Choi, S., Kim, M.O., Lee, G.: Mlacp: machine-learning-based prediction of anticancer peptides. Oncotarget 8(44), 77121–77136 (2017)
DeepACPpred
69
14. Meher, P.K., Sahu, T.K., Saini, V., Rao, A.R.: Predicting antimicrobialpeptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into chou’s general pseaac. Scientific Reports 7(42362) (2017) 15. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 3111–3119. Curran Associates, Inc. (2013). http://papers.nips.cc/paper/5021-distributed-representationsof-words-and-phrases-and-their-compositionality.pdf 16. Tyagi, A., Kapoor, P., Kumar, R., Chaudhary, K., Gautam, A., Raghava, G.P.S.: In silico models for designing and discovering novel anticancer peptides. Sci. Rep. 3(2984), 2045–2322 (2013) 17. Tyagi, A., Kapoor, P., Kumar, R., Chaudhary, K., Gautam, A., Raghava, G.P.S.: In silico models for designing and discovering novel anticancer peptides. Sci. Rep. 3(1), 2984 (2013). https://doi.org/10.1038/srep02984 18. Wu, C., Gao, R., Zhang, Y., De Marinis, Y.: PTPD: predicting therapeutic peptides by deep learning and word2vec. BMC Bioinform. 20(1), 456 (2019). https://doi. org/10.1186/s12859-019-3006-z 19. Yi, H.C., You, Z.H., Zhou, X., Cheng, L., Li, X., Jiang, T.H., Chen, Z.H.: ACP-DL: a deep learning long short-term memory model to predict anticancerpeptides using high-efficiency feature representation. Molecular therapy. Nucleic acids 17, 1–9 (2019).https://doi.org/10.1016/j.omtn.2019.04.025,https:// www.ncbi.nlm.nih.gov/pubmed/31173946, 31173946[pmid]
Preventing Cardiovascular Disease Development Establishing Cardiac Well-Being Indexes Ana Duarte and Orlando Belo(B) ALGORITMI R&D Centre, University of Minho, Campus of Gualtar, 4710-057 Braga, Portugal [email protected], [email protected]
Abstract. Cardiovascular disease is responsible for an alarming number of deaths worldwide. Nowadays more people die of cardiovascular disease than any other type of disease. A considerable number of these deaths is caused by preventable risk factors. Thus, it is necessary to invest in cardiovascular disease prevention and take urgent action to reverse this scenario, reducing risky behaviours. The availability of a tool capable of regularly monitoring cardiac well-being indexes can work as an important means for sensitizing the population and preventing the appearance of preventable risk factors. This paper presents and discusses the implementation of a semi-automatic system capable of returning cardiac wellbeing indexes. The system allows for evaluating individual indexes of each user over time, considering the influence of past values, as well as for observing global statistics, which can be useful for public health decisions. Keywords: Decision support systems · Data mining · Analytical systems · Heart diseases prevention · Well-being indexes
1 Introduction It is not particularly new that Cardiovascular Disease (CVD) represents today one of the biggest health issues, as it is one of the main causes of death and many of its risk factors are preventable. However, in the first decades of the last century, very little was known about this disease. The most important medical study in the context of CVD – the Framingham Heart Study – was initiated in 1948 and, since then, it has allowed the identification of the main risk factors over time and their relationship in the onset of the disease [1]. Just over a decade later, in 1961, this study identified age, gender, hypertension, and total cholesterol as some of the main risk factors for this disease [2]. Later, it was also found out a relationship between the development of the disease and smoking, physical activity and diabetes mellitus [3]. With the discovery of some of the main risk factors, in 1998, Wilson et al. [4] proposed a calculation methodology, based on the sum of scores associated with risk factors, which allowed for determining the risk of developing a CVD. However, the suggested risk values are calculated from a fixed formula, without considering the past measurements, and their meaning is hard to understand by common users. © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 G. Panuccio et al. (Eds.): PACBB 2020, AISC 1240, pp. 70–79, 2021. https://doi.org/10.1007/978-3-030-54568-0_8
Preventing Cardiovascular Disease Development
71
Given the urgency of a greater awareness of the seriousness of CVD, in the same way, as there is an existent concern with carrying out regular clinical analyses, there should also exist mechanisms able to return, in a simple and intuitive way, a self-assessment and continuous monitoring of the quality of the cardiovascular health of its users. From this context emerged the idea for implementing a cardiac well-being index, capable of converting clinical data provided by users into an indicator of their cardiovascular health status. This led to a challenging question: How can an easily understandable index for the general population be built, in order to effectively depict cardiovascular health and to allow its continuous monitoring? One of the most promising ways for approaching health-related topics today is through Business Intelligence (BI) techniques. According to Chen et al. [5], these techniques already present strong contributions in areas such as clinical decision making, patient-centred therapy, and knowledge bases for health and disease. This paper presents and discusses a decision support system capable of taking advantage of the potential of BI techniques for establishing cardiac well-being indexes over time according to different perspectives. In this way, this system can create valuable benefits for common users, health professionals or even for public health decision-makers. The rest of the paper has the following organization: Sect. 2 approaches and discusses briefly the limitations of cardiac risk calculators, Sect. 3 exposes how we established cardiac well-being indexes, approaching their calculation process and the analytical system that supported their storage and analysis, and Sect. 4 analyses the results we got, revealing some of them using personalised analytical dashboards. Finally, Sect. 5 presents some brief conclusions and reveals some research lines for future work.
2 Limitations of Cardiac Risk Calculators Cardiac risk calculators, such as [6] and [7], are one of the existing tools that most closely resembles the proposed cardiac well-being index. In brief, based on values entered by users, these cardiac risk calculators typically use formulas that consider the main known risk factors related to CVD and return a percentual risk value. However, these are not useful tools for enabling an efficient assessment of users’ cardiovascular health status. One of their main disadvantages is returning meaningless results to the general population. For example, for most people, a percentage of 10% of developing a CVD is not enough to understand whether their risk is high or low. This occurs because calculators rarely return values greater than 25%, which is a small fraction of the scale (0–100%). Additionally, these calculators also neglect the past measurements of the same users, considering only the current values, and use “rigid” calculation formulas, which remain the same over time. Besides, they do not also enable the establishment of complex relationships between multiple attributes, being rather limited tools. To overcome these limitations and to create cardiac well-being indexes, the combination of BI techniques, namely Data Mining (DM), Data Warehousing, and dashboards plays a promising role. First, the DM algorithms can be improved over time, offering a flexible calculation methodology, and they are able to consider complex relationships between multiple attributes. In the context of CVD, there are already several studies reported in the literature that prove the efficiency of these techniques. For example,
72
A. Duarte and O. Belo
in 2010, Srinivas et al. [8] conducted a study using Decision Trees, Neural Network, Bayesian Model and Support Vector Machine techniques to predict heart attack in coal mining regions. More recently, in 2017, Kim and Kang [9] used the Neural Network technique for building a predictive model in order to determine the main risk factors of CVD. In its turn, a Data Warehouse (DW) repository enables archiving the records of all users properly organized. Thus, the indexes can be calculated considering the past values, and it is possible to take advantage of the whole data to analyse the cardiovascular health of the population from a global perspective. Finally, for presenting results, dashboards are useful tools in any activity since they facilitate knowledge apprehension, through the creation of intuitive graphs. As an example of their applicability in the health context, Badgeley et al. [10] developed a set of dashboards in real-time to simplify the visualization and the analysis of biomedical, healthcare, and wellness data.
3 Establishing Cardiac Well-Being Indexes 3.1 Calculating the Indexes The first step of implementation was the creation of a mechanism capable of estimating cardiac well-being index values, based on DM techniques. In this case, DM techniques are a solid alternative to traditional calculation methods, as they take into account the interaction between multiple attributes. The choice of the index calculation algorithm is a crucial step since in a clinical context the returned values must give, simultaneously, high accuracy and sensitivity. The dataset used for this process was adapted from Kaggle [11] and comprised a total of 65000 instances and 18 attributes for training the model, and 5000 records for the test phase. The classification attribute is included in the 18 parameters and indicates whether a record is related to an individual who suffers from CVD or to an individual who does not have the disease. Initially, for both numerical and nominal values, their distribution and basic statistics were analysed. Table 1 presents the basic statistics regarding the numerical attributes. Table 1. Basic statistics of numerical attributes. Numeric variable
Missing values
SmokeYears
34956
5.4
Cholesterol
0
FastGlucose Age NumCigaret
Mean
Standard deviation
Min value
Max value
12.0
0
50
170.1
52.5
100
320
0
119.6
56.4
80
400
0
52.8
6.8
29
64
34956
7.1
14.6
0
50
High BP
0
128.9
159.6
−150
16020
Low BP
0
96.6
188.1
−70
11000
BMI
0
27.6
6.1
3.5
298.7
Preventing Cardiovascular Disease Development
73
During the data exploration phase, some errors, incoherencies, and inconsistencies that required a previous treatment were observed. Thus, using the Pentaho Data Integration (PDI) software [12], a set of steps responsible for cleaning the data was defined. The implemented workflow considered all the necessary types of treatment, to enable the execution of an automatic process in case of the dataset is updated with new records. The last step refers to the Knowledge Flow plugin, which allows for the establishment of an automatic communication with Weka [13]. It is in this last software that the different DM candidate algorithms are modelled, and created to support the index calculation. Even before the creation of the DM algorithms, two different scenarios were considered to evaluate the set of attributes that give the best results. The first scenario consisted of all the attributes resulting from the data treatment performed, whereas the second one was composed by filtering those attributes using Weka’s Attribute Selection. This filter was configured selecting the attribute evaluator “CfsSubsetEval”, which considers the redundancy between the attributes, and the Best-First search method, which returns the list of attributes that present, locally, the greatest predictive capacity [14]. For selecting the techniques, the most commonly used predictive algorithms of CVD were gathered. Although there is no predominance in the results that enables a hierarchical assessment of the techniques. Most studies in this area often use the same modelling algorithms, which have already proved being efficient in solving this type of problems. Thus, the selected algorithms were J48, Random Forest (RF), Naïve Bayes (NB), K-Nearest Neighbors (KNN) and Multilayer Perceptron (MLP). To optimize the parameters of each technique, 870 different simulations were carried out in order to determine the values of each one that corresponds to the highest accuracy values and to the lowest false-negative rate. Table 2 summarizes the optimal parameters values found for each technique. Table 2. Applied parameters for each DM technique. J48
RF
NB
KNN
MLP
ConfidFactor: 0.10 MinNumObj: 20 RedErrorPrun: T Unpruned: F NumFolds: Scenario I: 6 Scenario II: 3 MDLcorrection: T
NumTrees: 80 MaxDepth: 25 NumFeat: 0 BreakTies: T
KernelE: F SupervD: T
KNN: 20 CrossValid: F DistWeight: No SearchA: KDTree
HiddenLayer: a TrainTime: 100 NominalToBin: T Decay: F LR: 0.1 Momentum: 0.05
The predictive models were ranked considering the average of their accuracy and sensitivity values. Applying the developed system to the case study, the results revealed that MLP (scenario I) corresponded to the technique with the highest score (0.7265) and, therefore, to the most appropriate model to be used. The ranking obtained, ordered according to the best models to be used, was MLP RF J48 ASJ48 ASKNN ASMLP KNN NB ASNB. In this case, the prefix “AS” indicates the models created using the set of attributes from scenario II. For validating the results, the test data
74
A. Duarte and O. Belo
from the dataset containing 5000 records were used. Towards these test data, the sensitivity and accuracy values remained identical, which allowed confirming the obtained results. Figure 1 provides an overview of the main steps taken. train data + validation data Dataset 1
train data + Data Cleansing validation data Cross-Validation + Attribute (10 folds) Selection*
Dataset 2
validation data
test data
test data
Evaluation of the selected model with test data
train data
Construction of a model for each technique
Selection of the best-fit model
Record accuracy and false negative values
* Attribute Selection is only applied to Scenario
Fig. 1. Methodology to select the best-fit DM model to support the index calculation.
The selected DM technique returns the probability of having a CVD but is still not able to return an index value. In this case, it was established that the cardiac well-being indexes vary between −5 and 5 and that each value corresponds to a colour with the following meaning: – Green (values between 2.5 and 5) – the risk of developing CVD is reduced. Individuals must maintain their habits. – Yellow (values between 0 and 2.5) – the risk of developing CVD is moderate and, therefore, it is necessary to have some regular control and act following behaviours that promote the reduction of risk factors. – Orange (values between −2.5 and 0) – the risk of developing CVD is high. In this situation, individuals should meet a CVD specialist to execute more specific diagnostic tests. – Red (values between −5 and −2.5) – the risk of developing CVD is very high. CVD specialists should be urgently sought, and diagnostic tests performed. This scale allows a more intuitive understanding, not only because it comprises a reduced range of values, but also because negative values can be promptly associated with a poor cardiovascular health by common users. To convert probabilistic values into indexes, 0 and 100% were associated with the index values 5 and −5, respectively, and, after several simulations, the threshold between the colours orange and red (index value of −2.5) was associated with a probability of 55%. Considering these three points, an exponential correspondence curve between the probability of CVD development and the index value was created (Eq. 1). This type of curve was chosen because a small increase in the risk of developing the disease is reflected in a sharp decrease in the value of the index. f(x) = −7.052755 + 12.05276 × e−1.77011 x
(1)
Preventing Cardiovascular Disease Development
75
To validate the mechanism developed, considering the dataset under study, it was observed that the majority of people with disease was associated with red indexes, and that less than 2% of those individuals were included in the green zone. 3.2 Analytical System Implementation For the implementation of the analytical system, we designed and developed a DW system capable of storing cardiac well-being indexes, a set of sequential steps was carried out, beginning with the planning and the identification of the project’s requirements. After that, the system was modelled using the Kimball 4-step methodology, and the conceptual scheme depicted in Fig. 2, was structured. Subsequently, the sources were identified, and their quality was tested. The sources we used contain, respectively, personal and clinical data from different users (Source 1), clinical data with frequent variation (such as blood pressures and daily calories spent) (Source 2), and data concerning the administrative division of Portugal (Source 3). Sources 1 and 2 were synthesized specifically for this purpose whereas Source 3 refers to real data. Contrary to Source 1, data from Source 2 are highly volatile and can be collected without the direct intervention of the users, such as through electronic devices.
Fig. 2. The conceptual view of the multidimensional schema of the DW.
In Fig. 3, we can see a BPMN (Business Process Modelling Notation) scheme that summarizes the Extraction, Transformation and Loading (ETL) steps for each dimension and fact table. The sources associated with the dimensions related to the calendar and the district are static and, therefore, their load process is only done once. On the other hand, in the case of the “User” dimension and the “Well-Being” fact table, beyond the first, it is necessary to consider also the subsequent load processes. For extracting data associated with the fact table, a mechanism responsible for capturing only records that were not extracted in previous ETL processes was implemented. On the contrary, to extract data concerning the “User” dimension, the change data capture mechanism considered was through the use of triggers. As this dimension is considered a Slowly Change Dimension (SCD), the use of triggers allows capturing, in addition to new users, all records that have been removed or modified. In the case of this last dimension, it was considered relevant to store the records’ history, so that quevries to the DW could also take into account old data, which can be useful for more detailed analyses
76
A. Duarte and O. Belo
of the behaviour of the users’ indexes. Thus, it was found that the best solution would be to consider this SCD as being type 4 [15], because it allows the storage of historical data. It also presents an optimized computational performance, when compared to the other types. Thus, it was added a new table to store the historical records of attributes that have changed. It is important to note that the fact table can only be loaded after the dimensions have been properly loaded into the DW.
Fig. 3. BPMN schema regarding DW population methodology.
The ETL process and the determination of the best DM algorithm can be performed at different execution times. In this case, a daily periodicity for the ETL process and a monthly periodicity for the DM process were considered. Whenever the job responsible for the DM process is executed, the name of the model with the highest ranking is saved in the form of a variable, serving as a basis for the process of calculating the indexes, which is carried out in the “Index Calculation” step illustrated in Fig. 3. This periodical process enables a continuous improvement of the model, since data can be updated over time. For displaying the results, it was intended to provide individual and global indexes of cardiac well-being. The individual indexes should reflect a weighted value over time, maintaining the past measurements, so that results are not compromised by the existence of outliers. In this way, the extreme values must be attenuated and the history of the previous values of the indexes must be preserved for the determination of the current value. Thus, it was considered that the value that best represents the cardiovascular health status of a given individual is the last weighted index value, which takes into account, for each point, all the values of the previous indexes. In order to make this index available through dashboards, it was necessary to create a multi-dimensional database - an OLAP (On-Line Analytical Processing) cube, following the DW schema presented in Fig. 2, and adding a calculated member, that could act as a new measure. For creating the new calculated member, we used the “LinRegPoint” function, which calculates the linear regression line and returns the value of Y (weighted index) for a given X (date). Finally, with respect to the global index, a calculated member returning the average of the users’ indexes was created. After its implementation, the constructed cube was imported into Power BI [16] to make the results available.
Preventing Cardiovascular Disease Development
77
4 Analysing the Indexes Figure 4 presents the dashboards that allow monitoring the individual values of index over time. One of the biggest advantages of creating this type of dashboards is the great interaction it provides to its users. For example, in case a user wants to visualize the index evolution on an annual basis, the graph itself allows for navigating intuitively in the hierarchy, without the need to create another graph for that purpose. Moreover, if, for example, the user clicks on a specific date, the display is instantly updated, returning the value and colour associated with the selected date. In the specific case of Fig. 4, it can be seen that the weighted index of the user under analysis is high (3.68), being, therefore, associated with a green colour. Over time, it is possible to analyse the evolution of the index values and this control can also serve as an additional motivation for users to, increasingly, try to improve their results.
Fig. 4. Dashboards for evaluating the individual’s user index over time.
In addition to the dashboards related to the visualization of the evolution of individual users’ indexes, a second type of dashboards was built on a separate page. These dashboards are illustrated in Fig. 5 and aimed at decision-makers, who intervene in matters related to health. The dashboards implemented, by gathering the main information of all users, allow for the assessment of cardiac well-being indexes at a global level, which is useful for statistical analysis. The global index can also be presented according to different parameters, such as the global index by “Calendar”, “Gender”, “Gender and District”, or by “Location”. Similarly to the previous page, if a more detailed analysis is required, by clicking on the parameters or modifying the hierarchical level, all dashboards are immediately updated according to this interactive analysis perspective. Additionally, in certain situations, such as when abrupt variations in the value of individuals’ indexes are detected, it may be of interest to professionals to consult the user’s history tables, to be able to understand if any changes that could have occurred have significantly influenced this variation. In that way, a new page was also built that allows these professionals for consulting the user history tables with modified records. Figure 6 demonstrates the history table content for a particular user (in this case, the user id 4).
78
A. Duarte and O. Belo
Fig. 5. Dashboards for evaluating global index values.
Fig. 6. The users’ history table.
5 Conclusions and Future Work This paper aimed to describe a tool for the development of indexes able to continuously monitor the cardiovascular health of its users, considering weighted values over time. The accomplishment of this work allowed verifying that it is possible to implement a reliable system for the elaboration of cardiac well-being indexes based on clinical data provided by users, using integrated BI tools. The availability of easily understandable results in the form of dashboards can encourage common users to adopt healthier habits, thereby minimizing the incidence of CVD. Furthermore, these tools can also act as decision support systems for doctors and decision-makers in health domains. For example, at the level of public health decision-makers, they can be useful to support the promotion of awareness campaigns focusing on the prevention of CVD in regions with low index values. In terms of limitations, one of the difficulties encountered was the lack of real data. Throughout the implementation process, Web and synthesized data were used, but to reflect the cardiac well-being of the population, real data should have been used instead. In future work, it would be interesting to explore the integration of mobile electronic devices, such as smartbands, for collecting real data from users, such as the daily amount of calories burned and the average blood pressure values. The inclusion of more attributes associated with known risk factors for developing CVD, such as ethnicity and nutrition, could be an interesting topic to study later. In terms of security, to visualize dashboards, different access levels for each user type should also be defined. In order to confirm its validity, it would be important to test the implemented system in a real large-scale application and analyse, for a considerable amount of time, the evolution of
Preventing Cardiovascular Disease Development
79
the index values. Hence, the success of its implementation could be verified by assessing whether the provision of this tool significantly increases the indexes. Acknowledgement. This work has been supported by FCT – Fundação para a Ciência e Tecnologia within the R&D Units Project Scope: UIDB/00319/2020.
References 1. Mahmood, S.S., Levy, D., Vasan, R.S., Wang, T.J.: The Framingham Heart Study and the epidemiology of cardiovascular disease: a historical perspective. Lancet 383, 999–1008 (2014) 2. Kannel, W.B., Dawber, T.R., Kagan, A., Revotskie, N., Stokes, J.: Factors of risk in the development of coronary heart disease - six-year follow-up experience. The Framingham study. Ann. Intern. Med. 55, 33–50 (1961) 3. Andersson, C., Johnson, A.D., Benjamin, E.J., Levy, D., Vasan, R.S.: 70-Year legacy of the Framingham heart study. Nat. Rev. Cardiol. 1968 (2019) 4. Wilson, P.W.F., D’Agostino, R.B., Levy, D., Belanger, A.M., Silbershatz, H., Kannel, W.B.: Prediction of coronary heart disease using risk factor categories. Circulation 97, 1837–1847 (1998) 5. Chen, H., Chiang, R.H.L., Storey, V.C.: Business intelligence and analytics: from big data to big impact. MIS Q. 36, 1165–1188 (2012) 6. QRISK3. https://www.qrisk.org/three/. Accessed 13 Mar 2020 7. Instituto Superiore di Sanità: Il Progetto Cuore. https://www.cuore.iss.it/valutazione/calc-ris chio. Accessed 13 Mar 2020 8. Srinivas, K., Rao, G.R., Govardhan, A.: Analysis of coronary heart disease and prediction of heart attack in coal mining regions using data mining techniques. In: 5th International on Computer Science and Education, pp. 1344–1349 (2010) 9. Kim, J.K., Kang, S.: Neural network-based coronary heart disease risk prediction using feature correlation analysis. J. Healthc. Eng. 2017, 13 (2017) 10. Badgeley, M.A., Shameer, K., Glicksberg, B.S., Tomlinson, M.S., Levin, M.A., McCormick, P.J., Kasarskis, A., Reich, D.L., Dudley, J.T.: EHDViz: clinical dashboard development using open-source technologies. BMJ Open 6(3), 1–11 (2016) 11. Cardiovascular Disease Dataset. https://www.kaggle.com/sulianova/cardiovascular-diseasedataset. Accessed 05 Feb 2020 12. Hitachi Vantara. https://www.hitachivantara.com/en-us/home.html. Accessed 05 Feb 2020 13. Weka. https://www.cs.waikato.ac.nz/ml/weka/. Accessed 05 Feb 2020 14. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers, Burlington (2017) 15. Santos, V., Belo, O.: No need to type slowly changing dimensions. In: Proceedings of the IADIS International Conference Information Systems, pp. 129–136 (2011) 16. Power BI. https://powerbi.microsoft.com/pt-pt/. Accessed 05 Feb 2020
Fuzzy Matching for Cellular Signaling Networks in a Choroidal Melanoma Model Adri´ an Riesco1 , Beatriz Santos-Buitrago2 , Merrill Knapp3 , Gustavo Santos-Garc´ıa4(B) , Emiliano Hern´ andez Galilea4 , 5 and Carolyn Talcott 1
Universidad Complutense de Madrid, 28040 Madrid, Spain [email protected] 2 Seoul National University, Seoul 08826, South Korea [email protected] 3 SRI International, Menlo Park, CA 94025, USA [email protected] 4 University of Salamanca, 37007 Salamanca, Spain {santos,egalilea}@usal.es 5 Stanford University, Stanford, CA 94305, USA [email protected]
Abstract. Symbolic systems biology aims to explore biological processes as whole systems instead of independent elements. The goal is to define formal models that capture biologists’ reasoning. Pathway Logic (PL) is a system for the development of executable formal models of biomolecular processes. PL uses forwards/backwards collections that assemble a connected set of rules from a rule knowledge base in order to model a system specified by an initial state. For this to succeed the rules must contain component states that have the same level of detail, while at the same time, the knowledge base must capture as much detail as possible. In this paper, we propose a new way to perform matching in the rewriting process in Maude language. We introduce a basic concept of fuzzy match or fuzzy instantiates to which we will use to check requirements imposed by controls in the forward collection and to check if the change part of a rule applies forwards or backwards. Keywords: Symbolic systems biology · Choroidal melanoma · Signal transduction · Pathway Logic · Rewriting logic · Maude language
1
Introduction
Complex biological processes can be defined through formal models closer to the biologists’ mindsets. It is equally important to be able to compute with, analyze, and argue about these networks of biomolecular interactions at multiple levels c The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 G. Panuccio et al. (Eds.): PACBB 2020, AISC 1240, pp. 80–90, 2021. https://doi.org/10.1007/978-3-030-54568-0_9
Fuzzy Matching for Cellular Signaling Networks
81
of detail. A computational analysis of cellular signaling networks has been presented by using detailed kinetic models in order to simulate responses to specific stimuli [1]. Symbolic models provide a language which allows us to represent system states as well as change mechanisms such as reactions. Biological interactions can be handled with rule-based modelling in a natural way. In addition, the underlying combinatorial complexity of rule-based systems can cover all the important subjects of these biological interactions [2,4,10]. An executable model defines rewrite rules and system states that specify the manners in which the state may progress and change in time. From the definition of a model, we can define specific system configurations and carry out many kinds of meta-analyses. Moreover, process characteristics can be specified in associated logical languages and can be verified by using formal analysis tools. Our underlying purpose is to augment PL with algorithms that can do forwards and backwards collections by using fuzzy matching to generate rule instances that construe a connected model from knowledge curated at different levels of detail. The fuzzy matcher would use information about families whose modifications are more general than others, and whose locations are more general than others. In this way, fuzzy matching will allow us to analyze pathways in a less restrictive way by applying rules to terms that do not strictly match the current term. The advantage of this method is that it allows us to analyze the evolution of similar components. Furthermore, this matching can be extended to any rewriting logic setting. The rest of the paper is organized as follows: in Sect. 2 we introduce PL and the relevant details of the rewriting logic syntax for our study, as well as the problem we try to solve. We describe the new concept of fuzzy matching in Sect. 3. A case study in PL is developed in Sect. 4. We present some abbreviated notes of the implementation in Sect. 5. Finally, in Sect. 6 we present our conclusions.
2 2.1
Preliminaries Rewriting Logic and Maude
Rewriting logic constitutes a logic of change or of becoming. Rewriting logic facilitates the specification of the dynamic features of systems and naturally deals with highly nondeterministic concurrent computations. A theory of rewriting logic consists of an equational theory that allows the user to specify sorts, constructors, and function symbols, as well as equality between terms. Rewriting logic extends this equational theory by incorporating the notion of rewrite rules which define transitions between states. From a computational point of view, each rewriting step can be seen as a parallel local transition in a concurrent system. From a logical point of view each rewriting step is a logical entailment in a formal system. Maude is a high-level declarative language and a high-performance system that supports equational and rewriting logic computation for a wide range of applications. Maude provides several analysis tools to rewrite theories: rewrite
82
A. Riesco et al.
computation (execution), breadth-first search, and many others [3,8]. By using these tools, it is possible to study the way in which a system behaves, to check if it can reach a certain state from an initial one, and to analyze if our system satisfies some temporal properties. 2.2
Maude Representation of Pathway Logic Cell Signaling Models
We briefly describe in this section the Maude specification of PL and the way in which to use it. We will focus especially on the modeling of the compartments of cell parts and on the definition of their sorts. On the other hand, we will also show some examples of definition of rewrite rules in PL. In the next section, we will discuss the cause that has made us create the new concept of fuzzy matching. PL is a system for the development of executable formal models of biomolecular processes in rewriting logic (http://pl.csl.sri.com). By using the Maude system, the resulting formal models can be executed and analyzed. The PL system has been used to curate models of signal transduction, of protease signaling in bacteria, of glycosylation pathways, and so on [7,9]. The Pathway Logic Assistant, henceforth PLA, provides an interactive visual representation of PL models. Using Petri nets, the PLA provides efficient algorithms for answering reachability queries and natural graphical representations of query results. For example, a model can be visualized as a network of rules and of components connected from reactants to rules, and from rules to products. A pathway is a subnetwork consisting of rules executed to reach a goal. In PL, a global model is a collection of rewrite rules called a rule knowledge base together with the supporting data type specifications. A model of a specific cellular system consists of a specification of an initial state and a collection of rules derived from the global knowledge base by a symbolic reasoning process that searches all rules that may be applicable in an execution starting from the initial state. An initial state contains cell components with their locations. Such executable models reflect the possible ways in which a system can evolve. Logical inference and analysis techniques can: (1) simulate possible ways in which a system can evolve; (2) build pathways in response to queries; and (3) think logically about dynamic assembly of complexes, cascading transmission of signals, feedback-loops, and cross talk between subsystems and larger pathways [11]. Data types like proteins, chemicals, and genes are defined as Maude sorts, while functions on these sorts are defined by means of op. Relations between sorts are stated by means of subsorts. For example, we define a constant for protein Akt1 (where the attribute ctor indicates that this constant is a constructor) and we can define a family AktS of proteins that are a particular case of proteins. We can name a generic member of this family, Akts, and indicate that proteins Akt1, Akt2, and Akt3 are particular members of this family: sorts Protein Chemical Gene . ops Akt1 Akt2 Akt3 : -> AktS [ctor] . sort AktS . subsort AktS < Protein . op Akts :-> AktS [ctor metadata "((type Family)(members Akt1 Akt2 Akt3))"].
Fuzzy Matching for Cellular Signaling Networks
83
Given a multi-set (a Soup) of elements like the proteins, chemicals, and genes above, we define Locations to specify the elements in different places of the cell like the nucleus or the cytoplasm: op {_|_} : LocName Soup -> Location [ctor format (n d d d d d)] .
We can indicate that nucleus NUc contains proteins and genes Maz, Tp53-gene (gene transcription is on), Rb1, Chek1, and other elements: {NUc | Maz [Tp53-gene - on] Rb1 Chek1 Chek2 Myc Tp53 NProteasome}
Finally, dishes are defined as wrappers of Soup, which in this case are not isolated elements but different locations: op PD : Soup -> Dish [ctor]. Different elements could appear in different parts or locations of the cell: outside the cell (XOut), in/across the cell membrane (CLm), attached to the inside of the cell membrane (CLi), in the cytoplasm (CLc), and in the nucleus (NUc). Some proteins are represented by: epidermal growth factor (Egf), PI3 kinase (Pi3k), ERK activator kinase 1 (Mek1), and so on. Some components could appear with different modifiers: activation (act), phosphorylation on tyrosine (Yphos), or binding to GDP (GDP). We can indicate that dish SKMEL133Dish1 contains cell locations (e.g., CLm) and each location contains their elements (e.g., Pi3k) with their corresponding modification states (e.g., [Rheb - GTP]) such as: op SKMEL133Dish : -> Dish . eq SKMEL133Dish = PD( {XOut | empty} {CLm | PIP2} {CLi | Parva Pi3k Pld1} {CLc | [Csnk1a1 - act] [Gsk3s - act] [Ikke - act] [Ilk - act] Akts BrafV600E Bim Eif4ebp1 Erks Mek1 Raptor Akt1 Cdc42 Erbb2 Erk5 ...} {NUc | Maz [Tp53-gene - on] Rb1 Chek1 Chek2 Myc Tp53 NProteasome} {CVc | (Tsc1 : Tsc2) [Rheb - GTP]} ) .
Note that a binding between elements can be defined with operator (_:_). For example, (Tsc1 : Tsc2) indicates a binding between proteins Tsc1 and Tsc2. Biochemical reactions are defined on sets of locations by means of rewrite rules (with syntax rl) that stand for transitions in a concurrent system. For example, we can say that if, in the location CLm (that corresponds to the cell membrane), a protein ADAM17 is phosphorylated on T735, then this protein will be activated: rl[972c.Adam17.act]: {CLm | clm [Adam17 - phos(T 735)] } => {CLm | clm [Adam17 - act] } .
where the variable clm stands for any other element that might appear in the corresponding location. Now, we can use the rew command to ask Maude to apply rules and provide a reachable state from particular dishes. The following code obtains the result after applying 3 rewrite steps to an initial dish: rew [3] SKMEL133Dish. However, 1
We show a simplified version of the dish, i.e. some of the elements that are not relevant in the context are indicated with some ellipses.
84
A. Riesco et al.
since several different rules can be applied to the same dish to obtain different results, the rew command does not provide much information. To solve this problem, Maude provides the search command which performs a breadth-first search for the pattern given in the command. For example, we can check if in ten steps we can reach from a SKMEL133Dish to a dish composed of a protein Eif4ebp1 phosphorylated on T37 in the cytoplasm, a gene Tp53-gene in the on state in the nucleus, a protein PIP2 in the plasma membrane, and a protein Pi3k stuck to the inside of the plasma membrane: Maude> search [1,10] SKMEL133Dish > =>* PD({CLc | clc:Things [Eif4ebp1 - phos(T 37) modset:ModSet]} > {NUc | nuc:Things [Tp53-gene - on]} {CLm | clm:Things PIP2} > {CLi | cli:Things Pi3k} S:Soup) . Solution 1 (state 37) S:Soup --> {XOut | empty} {CVc | (Tsc1 : Tsc2) [Rheb - GTP]} clm:Things --> empty cli:Things --> Parva Pld1 clc:Things --> Akts Akt1 BrafV600E Erbb2 Erks Mek1 Rac1 [Gsk3s - act] ... modset:ModSet --> phos(T 46) phos(T 70) phos(S 65) nuc:Things --> Chek1 Chek2 Maz Myc NProteasome Rb1 Tp53 .
where the variable S:Soup abstracts the rest of elements, which are not relevant in this case, and the search option =>* stands for zero or more steps. Maude indicates one possible solution which fulfills this condition and also shows the terms that match in the solution. 2.3
The Problem Motivating Fuzzy Matching
As mentioned above, a PL model of a specific cellular system is obtained by a symbolic reasoning process to derive instances of rules from a rule knowledge base relevant to a given initial state. The ordinary course of PL and Maude is the following: it starts with the initial state, it uses Maude’s matching function, it finds all rule instances that can be applied, it adds the instances to the model, and add all the products/consequences of the rule instances to the accumulating state, and it repeats rewriting until no new rule instances can be found. This works well when the rules are hand curated to work together. Mismatch issues2 become a problem when we want to add more automation to the process of building models. One example is an attempt to automate the inference of rules from a formal representation of experimental findings [6]. We derive constraints on rule pattern variables from experimental findings and use answerset programming tools to generate minimal models, i.e. rules. This resulting rule sets are disconnected because the constraint solver finds the most 2
Some examples of what can go wrong are: (1) The state contains [Akts - acts phos] and the rule premise requires [Akts - phos]; (2) A rule requires Akts (the family) to be present and the state contains Akt1 (a member of the family); and (3) The state contains [Akts - Yphos] and the rule premise requires [Akt - phos]. Maude’s matching function fails to find rule instances in these cases despite the fact that biologically they do apply given the interpretation of experimental data.
Fuzzy Matching for Cellular Signaling Networks
85
detailed rule supported by the evidence while experimental data comes in many different levels of detail or specificity. A second example is building models in order to explain effects of drugs on cellular processes. Here the initial states represent exponentially growing cells rather than resting cells treated with a stimulus. The rule knowledge base is a set of rules called Common Rules. These rules capture the ‘normal’ interactions that go in inside a cell and also reflect what has been experimentally determined. These components appear at different levels of detail. The case study of Sect. 4 is an example of this kind of application of PL. For these reasons, we present in this paper a notion of “fuzzy” matching. In this context, “fuzzy” refers to a “vagueness” or “permissiveness” in the application of matching, where biology allows Maude to match terms that do not match according to underlying logic.
3
Fuzzy Matching
Our goal is to automate the process of adapting rules from a rule knowledge base to form a connected network, thus allowing specification of states and curation of rules to capture the available information and to postpone adjustments until they are needed in a specific model. An intuitive way of adapting rules consists of substituting constants by variables, given that these constants belong to the same family and appear in rules with the same context. If this transformation is introduced into PL it supports a more general rewriting mechanism, hence we can obtain results that are unreachable in other cases. If a proper notion of “generalization” is given in another context, fuzzy matching can be instantiated in this way and adapted to this rewriting setting. Here we present some (meta)rules for adapting/instantiating rules to apply in a given state. Although PL rules are curated in terms of soups of locations, the reasoning tool operates on an occurrence-based representation of rules as triples of occurrences tagged with the rule identifier. An occurrence is a protein, a chemical or a gene that may have been modified together with their location name. The rule components can be reactants, products, and controls.3 For example, in the rule 632c.Akts.by.Ilk: rl {CLc | clc [Ilk-act] Akts} => {CLc | clc [Ilk-act] [Akts-phos(FSY)]}.
There is some hidden information here. Namely, FSY is a symbolic name for sites in the members of the Akts family. Associated meta-data maps specific family members to a site position so that the non-overlap of modifications can be checked in case positions that are represented concretely for family members. We write the occurrence-based form of the rule 632c.Akts.by.Ilk as: rl < Akts, CLc > => < [Akts-phos(FSY)], CLc > 3
if < [Ilk-act], CLc >
Controls are the occurrences that appear in the rule’s premises and conclusion, but do not change. Reactants are the occurrences that appear in the premises but not in the conclusions while products are the occurrences that appear in the conclusions but not the premises.
86
A. Riesco et al.
We define the fuzzy instantiates to relation “>>” on occurrences (this can be read as: is satisfied by, or is more general than) < [P - mods], L > >> < [P’ - mods’], L >
if P > P’, mods is m1 ... mk, mods’ is m1’ ... mk’ ms, and m1 > m1’, . . . , mk > mk’. Here m1 ... is an individual modification such as Yphos or act, while ms stands for zero or insignificant additional modifications. The relation “>” is defined by using knowledge that underlies the controlled vocabulary (equational signature). P > P’ holds if P = P’ or if P names a family and P’ names a family member. This information is part of the meta-data associated with constant declarations. m > m’ if m = m’ or m’ is more specific than m (e.g., phos > Yphos > phos(Y123)). Fuzzy Forward Matches. In PL, a modification rule has the form: < [P - mods mods0], L > => < [P - mods mods1], L > if < [Q - qmods] , L >
where mods and/or qmods may be empty, and mods0 or mods1 may be empty but not both. The modification rule fuzzy forward matches a state containing < [P’ - mods’ mods0], L > < [Q’ - qmods’], L > if P > P’, mods > mods’, Q > Q’, qmods>qmods’, and mods’ is disjoint from mods1. The resulting rule instance is < [P’- mods’ mods0], L > => if
which can be added to the accumulated rule set in a forward collection. Also, will be added to the accumulated state occurrence set. Continuing our Akts by Ilk example, if the state contains < [Ilk act], CLc > and < Akt1, CLc > then the rule 632c.Akts.by.Ilk has the fuzzy forward match instance: 632c.Akts.by.Ilk: < Akt1,CLc > => < [Akt1 - phos(FSY),CLc > if < [Ilk - act], CLc > . Note that because we have used symbolic site names we can specialize Akts to Akt1 and not worry about the consistency of positions since the symbolic name is interpreted according to the site which gives the name to the protein. Fuzzy Backwards Matches. We can use fuzzy matching for backwards collection as well. If the collected state contains , P > P’, mods > mods’, and mods’ is disjoint from mods0, then the modification rule above fuzzy backwards matches the state with the corresponding rule instance =>
if
which can be added to the accumulated rule set in a backwards collection. Also, < [P’ - mods’ mods0], L > < [Q - qmods], L > will be added to the accumulated state occurrence set. The fuzzy matching principles can be lifted to complexes by establishing a one-to-one matching correspondence between complex components.
Fuzzy Matching for Cellular Signaling Networks
87
It is worth noting that fuzzy matching presents some important shortcomings: (i) it can return many results; (ii) it has a minor performance than the one used by the standard implementation; and (iii) it cannot be used in combination with some predefined functions like model checking. On the other hand, if the model is too strict the use of fuzzy matching will allow users to obtain automatic results that are not reachable otherwise, so in these cases it greatly improves the original PL implementation.
4
Case Study
This case study uses fuzzy matching to automate part of the rule collection process. As part of the DARPA BigMechanism project we defined a PL model to explain changes in protein expression and phosphorylation in response to treatment of SKMEL133 cells with one or more drugs (the data set and results were reported in [5,13]). SKMEL133 cells contain the constitutively active Braf V600E mutation that causes constitutive activation of the RAS-RAF-MEK-ERK signaling pathway which is independent of extracellular signals. It is found in choroidal melanoma and other cancers. Using biological insights, a SKMEL133 dish was defined by deciding locations and modifications in any of the selected proteins and genes. Ideally the normal forwards collection process generates the desired model, but some adjustments are necessary to obtain a model that connects the initial state to the measured protein states. We begin by showing how we carried out the first few steps by hand. Starting with the SKMEL133 dish and unmodified Common Rules, Maude found only four rules that could be applied. We get two more rules in the model by adding a rule that represents the behavior of the Braf mutation: rl[3808c.BrafV600E.act]: {CLc | clc BrafV600E} => {CLc | clc [Braf - act]}.
We can extend this model by searching for rules with premises/input that almost match the leaves, or by searching for rules with consequences/products that almost match the entities that are being measured. Starting with the leaves we see that rule 014c has [Mek1 - act] as a premise while the state contains [Mek1 - act phos(SMANS)]. This is a fuzzy forward match. To convince Maude we add a modification set variable to the rule premise [Mek1 - ms1 act] which takes care of the modifications that we do not care about. There is a rule to activate [Akts - phos(FSY) phos(KTF)] but the state contains [Akts - phos(FSY)]. This can be done by 109c if [Pdpk1-act] is present in the state. There are rules for generating [Pdpk1 - Yphos], but we do not know if Yphos implies act. So, we propose to replace Pdpk1 by [Pdpk1 - act] in the initial state and later search for experimental evidence for a rule leading from Pdpk1 to [Pdpk1 - act]. Rules with [Akt - act] in their premiss will fuzzy forward match [Akts-act phos(FSY) phos(KTF)] in the accumulating state. We added extra variables to let Maude know that they match. The resulting model can apply new rules such as 014c or 109c because fuzzy matching offers more flexibility.
88
A. Riesco et al.
The rules for initial formation of Mtor complexes Mlst8 : Raptor : Mtor and Mlst8 : Sin1 : Rictor : Mtor include the protein Mlst8 while rules that modify elements of Mtor complexes do not include Mlst8 in the complex terms. Under some circumstances it is plausible to lift rules to a more complex term. Note that by adding this “context” variable the new rule can match some terms that did not match before (before they only matched when the context had no extra components). Once the extended rules are added PL can use them for standard analyses (e.g. reachability), and we attain new results besides those obtained with standard matching. In our particular example, fuzzy matching allowed us to reach a result that was “hidden” from the standard rewriting mechanism.
5
Implementation
In this section we present our implementation which transforms the rules in the module introduced by the user in order to take advantage of Maude standard matching algorithm that simulates fuzzy matching. The complete code is available at https://github.com/ariesco/pathway. The main function in our implementation is finst which receives a module and returns another module in which we have modified the rules as explained in the previous sections. This function is applied to the rules together with the operator declaration set which is used to extract the metadata information: op finst : SModule -> SModule . eq finst(mod H is IL sorts SS . SSDS ODS MAS EqS RS endm) = mod H is IL sorts SS . SSDS ODS MAS EqS finst(RS, ODS) endm .
This auxiliary function traverses the rules, transforming each of them with finstRule: op finst : RuleSet OpDeclSet -> RuleSet . eq finst(none, ODS) = none . ceq finst(R RS, ODS) = R’ finst(RS, ODS) if R’ := finstRule(R, ODS) .
The function finstRule uses the auxiliary functions forwards and backwards, which are in charge of computing the forwards and backwards collections described in the previous sections. These functions compute a TermSub, that is, a substitution mapping terms to terms; these substitutions indicate how the different compounds found in the rules must be generalized to support fuzzy matching. We use matching conditions to collect the first substitution; then, we apply it to the rules with the auxiliary function sub and compute the backwards recollection. Since these substitutions will make both rules equal, it is enough to return just one of the rules after applying both substitutions. Finally, the function simplify removes those duplicated terms that appeared after generalizing the terms: op finstRule : Rule OpDeclSet -> Rule . ceq finstRule(R, ODS) = simplify(sub(R, TS . TS’)) if TS := forwards(R, ODS) /\ TS’ := backwards(sub(R, TS), ODS) .
Fuzzy Matching for Cellular Signaling Networks
89
Both forwards and backwards traverse the terms in the rule until they reach the subterms of interest. Below we show a particular case in the forwards recollection in which we find a compound T that is part of a family. In this case, we return a substitution mapping of the term to the corresponding constant of the family while we recursively compute the substitution for the rest of the term: op substForwards : Term OpDeclSet -> TermSub . ceq substForwards(’‘[_-_‘][T, T’], ODS) = T -> C . substForwards(T’, ODS) if C := getFamily(ODS, T) .
The auxiliary function getFamily checks whether the meta-data information in the operator indicates that there exists a family containing that element; in this case we build the corresponding constant: op getFamily : OpDeclSet Constant ~> Constant . ceq getFamily(op Q : TyL -> Ty [metadata(S) AtS] . ODS, C) = C’ if N := find(S, "type Family", 0) /\ S1 := string(getName(C)) /\ find(S, S1, N) =/= notFound /\ C’ := qid(string(Q) + "." + string(Ty)) .
This transformation is automatically performed when loading the finst file in the repository above. Once the module has been transformed, the rew and search commands can be used in the usual way so that the transformation becomes transparent to the users.
6
Conclusions
Various models for the computational analysis of cellular signaling networks have been proposed to simulate responses to specific stimuli [1]. Symbolic models are based on formalisms that provide: a language that represents the states of a system; mechanisms to model their changes; and tools for analysis grounded on computational or logical inference. PL is a system that develops symbolic models of cellular signaling processes [12]. One aim is to formalize the cartoon models of signaling processes that biologists build to organize and understand experimental results. The PLA uses Maude’s matching functions (at the meta-level) to assemble networks of rule instances that may be applicable from a given initial state by a forwards collection process. In this paper, we propose a new notion of fuzzy matching that captures the fuzziness of experimental results and allows rules to be matched by specializing or generalizing occurrences that allow a match, while the specialization/generalization relations are based on biological notions of specificity. The notion of fuzzy matching was inspired by issues that arise in developing models of biological processes. However, the need to create this concept comes from the need to derive specific models from a general rule knowledge base and from the need to adapt rules to common level of information that can be applied broadly.
90
A. Riesco et al.
Acknowledgement. PL development has been funded in part by NIH BISTI R21/R33 grant (GM068146-01), NIH/NCI P50 grant (CA112970-01), and NSF grants IIS-0513857 and CNS-1318848. Research was supported by Spanish project TRACES TIN2015-67522-C3-3-R and Comunidad de Madrid project BLOQUES-CM (S2018/TCS-4339).
References 1. Bartocci, E., Li´ o, P.: Computational modeling, formal analysis, and tools for systems biology. PLoS Comput. Biol. 12(1), e1004591 (2016) 2. Chylek, L.A., Harris, L.A., Faeder, J.R., Hlavacek, W.S.: Modeling for (physical) biologists: an introduction to the rule-based approach. Phys. Biol. 12(4), 045007 (2015) 3. Dur´ an, F., Eker, S., Escobar, S., et al.: Programming and symbolic computation in Maude. J. Log. Algebr. Methods Program. 110, 100497 (2020) 4. Hwang, W., Hwang, Y., Lee, S., Lee, D.: Rule-based multi-scale simulation for drug effect pathway analysis. BMC Med. Inform. Decis. Making 13(Suppl 1), s4 (2013) 5. Korkut, A., Wang, W.: Perturbation biology nominates upstream-downstream drug combinations in RAF inhibitor resistant melanoma cells. eLIFE 18(4) (2015) 6. Nigam, V., Donaldson, R., Knapp, M., McCarthy, T., Talcott, C.: Inferring executable models from formalized experimental evidence. In: Computational Methods in Systems Biology, CMSB 2015, pp. 90–103. Springer (2015) 7. Riesco, A., Santos-Buitrago, B., De Las Rivas, J., Knapp, M., Santos-Garc´ıa, G., Talcott, C.: Epidermal growth factor signaling towards proliferation: modeling and logic inference using forward and backward search. Biomed. Res. Int. 2017, 11 (2017) 8. Riesco, A., Verdejo, A., Mart´ı-Oliet, N., Caballero, R.: Declarative debugging of rewriting logic specifications. J. Log. Algebr. Program. 81(7–8), 851–897 (2012) 9. Santos-Buitrago, B., Hern´ andez-Galilea, E.: Signaling transduction networks in choroidal melanoma: a symbolic model approach. In: 13th International Conference on PACBB 2019, pp. 96–104. Springer (2019) 10. Santos-Buitrago, B., Riesco, A., Knapp, M., Alcantud, J.C.R., Santos-Garc´ıa, G., Talcott, C.: Soft set theory for decision making in computational biology under incomplete information. IEEE Access 7, 18183–18193 (2019) 11. Santos-Buitrago, B., Riesco, A., Knapp, M., Santos-Garc´ıa, G., Talcott, C.: Reverse inference in symbolic systems biology. In: 11th International Conference on PACBB 2017, Springer, pp. 101–109 (2017) 12. Talcott, C.: The pathway logic formal modeling system: diverse views of a formal representation of signal transduction. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1468–1476. IEEE (2016) 13. Talcott, C., Knapp, M.: Explaining response to drugs using pathway logic. In: Computational Methods in Systems Biology, CMSB 2017, pp. 249–264. Springer (2017)
Towards A More Effective Bidirectional LSTM-Based Learning Model for Human-Bacterium Protein-Protein Interactions Huaming Chen1(B) , Jun Shen1 , Lei Wang1 , and Yaochu Jin2 1
University of Wollongong, Wollongong, NSW 2500, Australia [email protected], [email protected] 2 University of Surrey, Guildford GU2 7XH, UK
Abstract. The identification of protein-protein interaction (PPI) is one of the most important tasks to understand the biological functions and disease mechanisms. Although numerous databases of biological interactions have been published in debt to advanced high-throughput technology, the study of inter-species protein-protein interactions, especially between human and bacterium pathogens, remains an active yet challenging topic to harness computational models tackling the complex analysis and prediction tasks. In this paper, we comprehensively revisit the prediction task of human-bacterium protein-protein interactions (HB-PPI), which is a first ever endeavour to report an empirical evaluation in learning and predicting HB-PPI based on machine learning models. Firstly, we summarise the literature review of human-bacterium interaction (HBI) study, meanwhile a vast number of databases published in the last decades are carefully examined. Secondly, a broader and deeper experimental framework is designed for HB-PPI prediction task, which explores a variety of feature representation algorithms and different computational models to learn from the curated HB-PPI dataset and perform predictions. Furthermore, a bidirectional LSTM-based model is proposed for the prediction task, which demonstrates a more effective performance in comparison with the others. Finally, opportunities for improving the performance and robustness of machine learning models for HP-PPI prediction are also discussed, laying a foundation for future work. Keywords: Human-bacterium interactions · Protein-protein interactions · Machine learning · Computational model
1
Introduction
Monitoring and curing the infectious diseases for human are still prevalent and intractable problems, while there have been substantial researches focusing on the understanding of infectious mechanisms and the development of novel therapeutic solutions. This solicits great efforts in revealing the biological interactions c The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 G. Panuccio et al. (Eds.): PACBB 2020, AISC 1240, pp. 91–101, 2021. https://doi.org/10.1007/978-3-030-54568-0_10
92
H. Chen et al.
between human and different pathogens [1,12,22]. However, research on identification of interactions is still in its early stage. Some published data may focus on particular human-pathogen interactions (HPI). Meanwhile, the identification of interactions takes huge amount of experimental resources and time. As a cost-effective approach, computational models for analysis and predictions of HPI systems have been investigated. Although several literature reviews have been published by introducing the machine learning-based methods and some applications in the HPI domain, little research on empirical evaluations of the performance of HBI predictions based on machine learning models has been ever conducted [27,31], and no work focusing on the prediction of humanbacterium interactions has been reported. Meanwhile, most studies of PPI predictions have been conducted based on a hypothesis on evaluating the predictor with a balanced and small dataset. To achieve an extensive empirical evaluation of predictions of HB-PPI based on machine learning models, we firstly build human-bacterium protein-protein interaction dataset. Our dataset was curated based on our dedicated and comprehensive review of published databases for the last two decades. We specify our data with either expert annotated interactions or directly experiments outcomes, to build a trustable positive protein interactions dataset. Furthermore, we collected the unlabelled protein interaction data by accessing UnitProtKB database [8], among which the human proteins (taxonomy ID: 9606) were downloaded and the corresponding proteins for the bacterium specie were also acquired. We constructed the typical protein interaction data curation process by following [9,37], meanwhile an extensively dataset curation strategy on top of [13] was included. The details for data curation will be discussed in Sect. 3. By building the human-bacterium protein-protein interactions dataset, we found that prediction of HB-PPI posed three challenges for machine learning methods to build robust and efficient models: (C1) Given the data availability and experiments design, a curated HB-PPI dataset for machine learning model is required; (C2) The protein sequence information, which is the preliminary information determining subsequent levels of protein structure, still requires an effective feature representation algorithm to retain their identities; (C3) Different machine learning methods exhibit various performances regarding C1. How to design a robust and effective model remains challenging. To evaluate solutions tackling challenges C1 and C2, the experiment settings to build the HPI datasets are designed by including, firstly different ratios of positive HB-PPI to negative interactions, and secondly two different categories of sequence feature representation algorithms. Our evaluations of various traditional machine learning methods and models found in the literature review have revealed that, current techniques could not render a robust performance and could not generalise well for the HB-PPI dataset. Thus, to tackle C3, we have subsequently proposed a bidirectional long shortterm memory-based model, jointly learning with the designed multi-channel feature representation algorithm, tree-based feature selection algorithm and synthetic minority over-sampling technique (SMOTE), for the prediction of HB-PPI dataset. The proposed model demonstrates a best performance.
Towards A More Effective Bidirectional LSTM-Based Learning Model
93
The contributions of this paper can be summarised as follows: – (1) A comprehensive and systematic HB-PPI review is achieved, and we have collectively presented different feature representation algorithms and machine learning models for prediction of HB-PPI (Sect. 2); – (2) To address the challenges of C1–C3, we have implemented a broader and deeper experimental framework to revisit the learning task of HB-PPI dataset. The extensively empirical evaluations considering different sequence feature representation algorithms and machine learning methods show that, there is still plenty of room for improvements to achieve a robust and efficient machine learning based method for prediction of HB-PPI (Sect. 3); – (3) We have proposed a model achieving a more robust and effective performance on the HB-PPI datasets of three different HBI systems, based on bidirectional LSTM model with the designed multi-channel feature. The proposed model indicates a promising research direction of studying big HB-PPI dataset with deep learning model (Sect. 4).
2
A Comprehensive Re-examination of HPI Study
There have been substantial research interests in applying machine learning methods for prediction of protein-protein interactions [3,13,16,26,31,32,34,37]. A similarity among these works was to have successfully applied machine learning methods in a given positive protein interactions data, whilst their work focused on a balanced protein interactions dataset by building negative protein interactions data with a same number of the positives. In our work, we will explicitly characterise the prediction tasks of HBI systems from the identified challenges. We formulate our empirical evaluation from two different aspects, which were somehow scarcely investigated in the past. 2.1
Host-Pathogen Protein-Protein Interactions
Prior to conduct the evaluation for HB-PPI prediction, we have carefully reviewed the existing literature reviews. Since there is currently no single review dedicated to HB-PPI, several up-to-date reviews of broad topics on HP-PPI are evaluated. A wide coverage of HPI study can be found in [10,31] and [33], which includes the prediction and analysis, while research on computational prediction of HPI was discussed in [27] and [41]. Since these reviews aimed at describing the progress in prediction of host-pathogen interactions without anchors of naming pathogens, they have collectively listed potential computational methods and no systematic evaluation with sufficient details has been implemented and reported. 2.2
Variety of Host-Pathogen Databases
An systematic literature review has been conducted to screen the abundant HPI resources. There are over 4,000 returning items according to the keywords search
94
H. Chen et al.
of ‘pathogen’ and ‘database’ by NCBI PubMed search engine. The first 400 results ranking with best relevance are manually examined with the ‘Abstract’. 45 databases have been evaluated by their availability and contents. Eventually, there are 11 databases chosen to curate our dataset of different HPI systems. We focus on those in which human is the host (taxonomy ID: 9606) and the bacterium is the pathogen. These 11 databases are DIP [29], Reactome [20], APID [28], IntAct [21], MINT [24], InnateDB [4], PHISTO [11], PATRIC [35], Mentha [5], HPIDB [2] and BioGRID [6]. Their data sources are collected via literature and domain expert manual verification with high confidence. After cleansing the databases, 90 different bacterium pathogens are identified having interactions with other hosts. In this study, we have dedicated the study between three bacteria and human host, for the reason of their sufficiently available protein information to constitute big datasets for the evaluation and comparison with the proposed model. The others could be used for further repeated verification and research, but are not within the scope of this paper.
3
Evaluation Design for HB-PPI Dataset
As mentioned, although several reviews have discussed the research challenges in HPI prediction, there is neither curated datasets available nor evaluation results presented in the research papers. In this section, we introduce the evaluation design for HB-PPI dataset, as illustrated in Fig. 1. 3.1
The HB-PPI Dataset
Since only a small number of positive protein interaction data are catalogued in public databases and the scale of remaining unknown protein interactions relationships are very huge, most studies of intra-species PPI in the literature adopted random sampling scheme to select protein interactions from the unknown data as the negative protein interactions data to constitute a discriminative dataset for model learning [14,15,32,37]. A balanced protein interactions dataset, which assumed that positive and negative protein interactions data were with the same amount, is normally curated for evaluation.
Public Databases
Positive Protein Interactions Data
Protein Data UniProtKB ... ...
Machine Learning based Model
Curated HB-PPI Dataset
Feature Representation
Amino Acid Composition
Training Dataset
Independent Dataset
Evolutionary Information
Fig. 1. The overall experimental framework
Towards A More Effective Bidirectional LSTM-Based Learning Model
95
Table 1. Datasets Statistics Taxonomy IDa Positive Ratio 1:25 interactions number
Ratio 1:50
Ratio 1:100
Training Independent Training Independent Training Independent testing testing testing 1491
57
1185
297
2325
582
4605
1151
177419
1207
25105
6277
49245
12312
97525
24382
1392 2810 58448 14612 114648 28662 227048 ‘1491’ represents Clostridium botulinum, ‘177419’ is Francisella tularensis subsp. a tularensis (strain SCHU S4 / Schu 4), and ‘1392’ is Bacillus anthracis bacterium.
56762
However, considering HPI systems, which is concerned with the inter-species interactions, the interaction ratio (i.e. the number of positives in a large set of protein pairs between species) is expected to be very low, which in practice may be set as 25, 50 even 100 times as many negative interactions as positive interactions [13,23]. In other words, it could be a highly imbalanced dataset. Given the hypothesis, our work strives to evaluate the impact of amount of negative data by reproducing different ratios to generate HB-PPI datasets. The details of the HB-PPI dataset are shown in Table 1. The taxonomy IDs are listed as the bacterium pathogens selected after data pre-processing. They are three different bacterium pathogens actively interacting with human host. To alleviate the impact of randomness in sampling, we have repeated this process for five times, which resulted in five-fold independent tests for evaluation. 3.2
Interpreting the Sequence Information
Since utilizing protein sequence information has become a research trend due to its availability of abundant information, it also solicits novel feature representation algorithms to the ongoing protein researches to improve the prediction performance [9,39,40]. In our work, we will focus on sequence information. We anticipate the study can be potentially extended to other related research topics. Thus, mapping the sequence information according to the selected feature representation algorithms is the first step. Because every different protein possesses different length of amino acid combinations, it will be difficult to directly input the sequence information into the machine learning methods. This raises a great interest to develop an efficient algorithm to retain the identity of proteins. Two different categories are included, namely, amino acid composition methods and evolutionary information methods. Amino acid composition methods consider the feature representation according to the amino acid combination of a given protein sequence information, which result in two different popular algorithms, namely the conjoint triad method [32] and auto covariance algorithm [17]. Evolutionary information methods involve a protein alignment process against a reference protein sequence database, which produces a position-specific scoring matrix (PSSM) to
96
H. Chen et al.
indicate the probability of each amino acid type for corresponding position. In our work, we apply two different methods, which are Pseudo Position-Specific Score Matrix (Pse-PSSM) [7] and Block-PSSM [19]. 3.3
Machine Learning Based Methods
It is crucial to select feasible machine learning methods to perform HBI prediction task, in which challenges C1 and C2 are inherent. In this paper, we evaluate several popular machine learning models, including support vector machine (SVM), random forests (RF), logistic regression (LR), na¨ıve Bayes model (GNB), decision tree (DT) and gradient boosting machine (GBM). These machine learning models are still more predominant than deep learning methods in protein interaction studies, because they usually require less data and have a simpler architecture, yet achieving a reasonable performance, in contrast to computer vision or other AI problems. For the hyperparameters optimization, five-fold cross-validation test was adopted to select the best parameters. Meanwhile, two sequence-based machine learning models are included [9,37], for comparisons.
4
Proposed Bi-LSTM-Based Model and Evaluation
4.1
Our Model
Fig. 2 illustrates the novel model we proposed, including different components1 . Bidirectional LSTM model (Bi-LSTM) is the critical component of the model, which is a variant deep learning model of LSTM proposed by [18,30]. LSTM model and its variant version Bi-LSTM have demonstrated superior performance in domainss such as natural language processing , transportation and action recognition [36,38]. In Bi-LSTM model, two layers, namely forward and backward layers, are designed to converge into a single layer.
Fig. 2. The overall experimental framework
1
The code and data are available on: https://huaming-chen.com/Bi-LSTM-Predictor/.
Towards A More Effective Bidirectional LSTM-Based Learning Model
97
However, the Bi-LSTM model explicitly suffers from the conventional vanishing gradient problem for the prediction of the highly skewed HB-PPI data. To resolve the problem, we firstly introduce the focal loss function [25] as the cost function Δ in Bi-LSTM model, which is defined in Eq. 1. Normally, cross entropy loss is applied for binary classification, which could be defined as Δ(p, y) = Δ(pt ) = −log(pt ). Alternatively, Eq. 1 is defined in our model, where pt defines the estimated output probability and αt and γ are the parameters. In this study, αt = 0.5 and γ = 2 for all the experiments. Δ(pt ) = −αt (1 − pt )γ log(pt )
(1)
Additionally, we designed a novel three-dimension tensor data as the feature representation algorithm, which is a multi-channel feature in this study. The design of the multi-channel feature benefits from the sequence-based feature representation algorithms. The tree-based feature selection algorithm is employed at first to unify the features to be transformed as multi-channel feature. Once the features are processed, the data will be learnt by SMOTE technique to ease Table 2. Results of F1-score for Pathogen Taxonomy ID ‘1491’ and ‘1392’ Model
‘1491’a 1:25
1:50
1:100
0.992±0.016 0.984±0.020
0.170±0.010
0.140±0.009
0.068±0.007
2 0.941±0.075
0.959±0.024
0.925±0.083
0.103±0.020
0.079±0.006
0.056±0.013
3 0.985±0.031
0.969±0.029
0.983±0.021
0.207±0.009
0.166±0.004
0.092±0.014
4 0.955±0.052
0.992±0.016
1.000±0.000 0.198±0.016
0.174±0.008
0.104±0.003
1 1.000±0.000 0.992±0.016
0.984±0.020
0.000±0.000
0.000±0.000
0.000±0.000
2 0.969±0.029
0.991±0.017
1.000±0.000
0.000±0.000
0.000±0.000
0.000±0.000
3 1.000±0.000
0.984±0.020
0.957±0.000
0.000±0.000
0.000±0.000
0.000±0.000
4 1.000±0.000
0.984±0.020
0.957±0.000
0.048±0.029
0.000±0.000
0.003±0.006
1 0.667±0.000
0.406±0.071
0.278±0.009
0.021±0.006
0.000±0.000
0.007±0.000
2 0.969±0.029
0.992±0.016
0.957±0.000
0.051±0.006
0.012±0.003
0.007±0.003
3 0.954±0.038
0.939±0.038
0.832±0.100
0.031±0.003
0.000±0.000
0.000±0.000
4 0.985±0.031
0.985±0.031
0.984±0.020
0.108±0.004
0.042±0.005
0.016±0.003
Na¨ıve Bayes 1 0.883±0.025
0.759±0.083
0.649±0.071
0.105±0.002
0.057±0.000
0.030±0.000
2 0.911±0.043
0.859±0.038
0.772±0.076
0.109±0.000
0.067±0.001
0.030±0.000
3 0.852±0.030
0.710±0.093
0.509±0.072
0.115±0.001
0.060±0.001
0.038±0.000
4 0.852±0.029
0.708±0.099
0.535±0.071
0.117±0.001
0.063±0.000
0.034±0.000
1 0.941±0.020
0.955±0.052
0.911±0.044
0.158±0.005
0.118±0.004
0.142±0.017
2 0.921±0.052
0.984±0.020
0.829±0.120
0.152±0.007
0.119±0.011
0.093±0.012
3 0.938±0.055
0.939±0.048
0.876±0.051
0.115±0.009
0.096±0.023
0.091±0.009
4 0.915±0.091
0.961±0.034
0.856±0.057
0.156±0.013
0.114±0.012
0.101±0.018
1 0.870±0.016
0.867±0.076
0.860±0.070
0.238±0.014 0.039±0.013
0.011±0.016
2 0.768±0.096
0.885±0.082
0.804±0.063
0.085±0.022
0.035±0.007
0.017±0.007
3 0.935±0.065
0.902±0.063
0.891±0.028
0.235±0.016
0.073±0.009
0.006±0.011
4 0.893±0.075
0.933±0.054
0.955±0.052
0.035±0.034
0.187±0.014 0.080±0.018
Modelc 1
0.693±0.066
0.928±0.023
0.604±0.031
0.046±0.005
0.052±0.004
0.017±0.002
Modelc 2
0.950±0.039
0.976±0.020
0.978±0.044
0.199±0.012
0.152±0.005
0.123±0.015
SVM
LR
GBM
DT
1:100
‘1392’ 1:25
b 1 0.957±0.000
RF
1:50
Proposed Model 0.939±0.038 0.925±0.044 0.969±0.029 0.281±0.011 0.243±0.016 0.194±0.011 a ‘1491’ and ‘1392’ represent the taxonomy IDs for the related bacterium pathogen species from Sect. 3.1; b 1 –4 are the different feature representations algorithms of ACC, CTM, PsePSSM and BlockPSSM; c Model1 is the method from [37];d Model2 is the method from [9].
98
H. Chen et al.
the imbalanced ratio. The output of the SMOTE model will be subsequently stacked horizontally to build the multi-channel feature data, which is then input to Bi-LSTM. Additionally, we designed a novel three-dimension tensor data as the feature representation algorithm, which is a multi-channel feature in this study. The design of the multi-channel feature benefits from the sequence-based feature representation algorithms. The tree-based feature selection algorithm is employed at first to unify the features to be transformed as multi-channel feature. Once the features are processed, the data will be learnt by SMOTE technique to ease the imbalanced ratio. The output of the SMOTE model will be subsequently stacked horizontally to build the multi-channel feature data. Table 3. Results of F1-score for Pathogen Taxonomy ID ‘177419’ Model RF
b 1
‘177419’a 1:25
1:50
1:100
0.040±0.014
0.003±0.004
0.000±0.000
2 0.029±0.015
0.007±0.003
0.008±0.005
3 0.069±0.014
0.015±0.006
0.005±0.004
4 0.043±0.013
0.008±0.009
0.002±0.003
1 0.127±0.014
0.052±0.006
0.027±0.006
2 0.023±0.006
0.041±0.013
0.052±0.010
3 0.122±0.011
0.040±0.008
0.000±0.000
4 0.106±0.014
0.020±0.006
0.000±0.000
1 0.008±0.000
0.000±0.000
0.000±0.000
2 0.062±0.007
0.011±0.004
0.000±0.000
3 0.000±0.000
0.000±0.000
0.000±0.000
4 0.145±0.010
0.082±0.010
0.056±0.005
Na¨ıve Bayes 1 0.116±0.001
0.063±0.001
0.036±0.000
2 0.113±0.001
0.056±0.001
0.029±0.000
3 0.123±0.003
0.076±0.002
0.040±0.000
4 0.119±0.001
0.067±0.000
0.035±0.000
1 0.076±0.009
0.074±0.025
0.041±0.007
2 0.103±0.024
0.045±0.006
0.037±0.008
3 0.111±0.007
0.092±0.009
0.048±0.007
4 0.122±0.017
0.082±0.007
0.051±0.012
1 0.153±0.023
0.017±0.012
0.000±0.000
2 0.036±0.036
0.020±0.015
0.006±0.006
3 0.164±0.017 0.049±0.006
0.014±0.012
SVM
LR
GBM
DT
4 0.002±0.003
0.106±0.010 0.020±0.014
Modelc1
0.029±0.011
0.005±0.004
0.000±0.000
Model2
0.109±0.016
0.068±0.011
0.052±0.013
Proposed Model 0.244±0.012 0.186±0.015 0.135±0.015 represents the taxonomy ID for the related bacterium pathogen specie from Sect. 3.1; bc The same as in Table 2. a ‘177419’
Towards A More Effective Bidirectional LSTM-Based Learning Model
4.2
99
Evaluation and Discussion
For the evaluation, all the data used in the evaluation have been preprocessed with the same protocol according to the relevant literature. Due to the space limit, the results of pathogens with taxonomy ID ‘1491’ and ‘1392’ are collectively included in Table. 2 and the result of ‘177419’ is included in Table. 3. The first two best performances of each column are indicated by bold font. We can observe that, the performances of different machine learning models for the different dataset vary a lot. ParIt is not easy to identify which one would achieve the best in a combination with an appropriate feature representation algorithm. In Table 2, the overall performance of Model2 is better than the results from Model1 . However, they are neither the best nor the second best. For different column, the traditional models present different capabilities of the performance. For our proposed Bi-LSTM-based model, it has achieved a more stable and better performance than the others for HBI systems of ID ‘1392’ and ‘177419’. These two datasets are much bigger than the one of ID ‘1491’, for which BiLSTM-based model has not been the best. However, it still yields results quite smoothly when the ratio changes. Meanwhile, Bi-LSTM-based model also shows a strong capability in dealing with the imbalanced issue. In the overall comparison, Bi-LSTM-based model has demonstrated a best performance.
5
Conclusion
In this study, our extensive evaluation of HB-PPI is presented. We anticipate in delivering this research work as a first attempt to systematically evaluate machine learning methods for HB-PPI prediction. Three challenges were identified as causing the performance fluctuation in the HBI datasets. Thus, a complete experimental framework in different HBI systems was established to learn and predict from positive and unlabeled protein interactions data. We have also proposed a Bi-LSTM-based model achieving a more robust and effective performance. Although the performance is better than the others, it is expected to design a sophisticated learning models for prediction in the future.
References 1. Ahmed, H.R., et al.: Pattern discovery in protein networks reveals high-confidence predictions of novel interactions. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pp. 2938–2945 (2014) 2. Ammari, M.G., et al.: Hpidb 2.0: a curated database for host–pathogen interactions. Database 2016 (2016) 3. Ben-Hur, A., et al.: Kernel methods for predicting protein–protein interactions. Bioinformatics 21(suppl 1), i38–i46 (2005) 4. Breuer, K., et al.: Innatedb: systems biology of innate immunity and beyond–recent updates and continuing curation. Nucleic Acids Res. 41(D1), D1228–D1233 (2013) 5. Calderone, A., et al.: Mentha: a resource for browsing integrated protein-interaction networks. Nat. Meth. 10(8), 690–691 (2013)
100
H. Chen et al.
6. Chatr-Aryamontri, A., et al.: The biogrid interaction database: 2017 update. Nucleic Acids Res. 45(D1), D369–D379 (2017) 7. Chou, K.C., et al.: Memtype-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through PSE-PSSM. Biochem. Biophys. Res. Commun. 360(2), 339–345 (2007) 8. Consortium, U., et al.: Uniprot: the universal protein knowledgebase. Nucleic Acids Res. 46(5), 2699 (2018) 9. Cui, G., et al.: Prediction of protein-protein interactions between viruses and human by an SVM model. In: BMC bioinformatics. vol. 13, p. S5. Springer (2012) 10. Durmu¸s, S., et al.: A review on computational systems biology of pathogen-host interactions. Front. Microbiol. 6, 235 (2015) 11. Durmu¸s Tekir, S., et al.: Phisto: pathogen-host interaction search tool. Bioinformatics 29(10), 1357–1358 (2013) 12. Durmus Tekir, S., et al.: Infection strategies of bacterial and viral pathogens through pathogen–human protein–protein interactions. Front. Microbiol. 3, 46 (2012) 13. Dyer, M.D., et al.: Supervised learning and prediction of physical interactions between human and hiv proteins. Infect. Genet. Evolut. 11(5), 917–923 (2011) 14. Eid, F.E., et al.: Denovo: virus-host sequence-based protein-protein interaction prediction. Bioinformatics 32(8), 1144–1150 (2016) 15. Emamjomeh, A., et al.: Predicting protein-protein interactions between human and hepatitis c virus via an ensemble learning method. Molecular Biosyst. 10(12), 3147–3154 (2014) 16. Gomez, S.M., et al.: Learning to predict protein-protein interactions from protein sequences. Bioinformatics 19(15), 1875–1881 (2003) 17. Guo, Y., et al.: Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res. 36(9), 3025–3030 (2008) 18. Hochreiter, S., et al.: Long short-term memory. Neural Computation 9(8), 1735– 1780 (1997) 19. Cheol Jeong, J., et al.: On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 8(2), 308–315 (2010) 20. Joshi-Tope, G., et al.: Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 33(suppl 1), D428–D432 (2005) 21. Kerrien, S., et al.: The intact molecular interaction database in 2012. Nucleic Acids Res. 40(D1), D841–D846 (2012) 22. K¨ onig, R., et al.: Global analysis of host-pathogen interactions that regulate earlystage hiv-1 replication. Cell 135(1), 49–60 (2008) 23. Kshirsagar, M., et al.: Multitask learning for host-pathogen protein interactions. Bioinformatics 29(13), i217–i226 (2013) 24. Licata, L., et al.: Mint, the molecular interaction database: 2012 update. Nucleic Acids Res. 40(D1), D857–D861 (2012) 25. Lin, T.Y., et al.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017) 26. Nanni, L., et al.: An empirical study of different approaches for protein classification. The Scientific World Journal 2014 (2014) 27. Nourani, E., et al.: Computational approaches for prediction of pathogen-host protein-protein interactions. Front. Microbiol. 6, 94 (2015) 28. Prieto, C., et al.: Apid: agile protein interaction dataanalyzer. Nucleic Acids Res. 34(suppl 2), W298–W302 (2006)
Towards A More Effective Bidirectional LSTM-Based Learning Model
101
29. Salwinski, L., et al.: The database of interacting proteins: 2004 update. Nucleic Acids Res. 32(suppl 1), D449–D451 (2004) 30. Schuster, M., et al.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997) 31. Sen, R., et al.: A review on host-pathogen interactions: classification and prediction. Euro. J. Clin. Microbiol. Infect. Dis. 35(10), 1581–1599 (2016) 32. Shen, J., et al.: Predicting protein-protein interactions based only on sequences information. PNAS 104(11), 4337–4341 (2007) 33. Soyemi, J., et al.: Inter-species/host-parasite protein interaction predictions reviewed. Curr. Bioinform. 13(4), 396–406 (2018) 34. Wang, X., et al.: A novel matrix of sequence descriptors for predicting proteinprotein interactions from amino acid sequences. PLoS One 14(6), e0217312 (2019) 35. Wattam, A.R., et al.: Patric, the bacterial bioinformatics database and analysis resource. Nucleic Acids Res. 42(D1), D581–D591 (2014) 36. Wu, J., et al.: Towards a general prediction system for the primary delay in urban railways. In: 2019 IEEE ITSC, pp. 3482–3487. IEEE (2019) 37. Wuchty, S.: Computational prediction of host-parasite protein interactions between p. falciparum and h. sapiens. PLoS One 6(11), e26960 (2011) 38. Yao, Y., et al.: Bi-directional LSTM recurrent neural network for chinese word segmentation. In: ICONIP, pp. 345–353. Springer (2016) 39. Zhang, J., et al.: Review and comparative assessment of sequence-based predictors of protein-binding residues. Brief. Bioinform. 19(5), 821–837 (2018) 40. Zhang, L.: Sequence-based prediction of protein-protein interactions using random tree and genetic algorithm. In: ICIC, pp. 334–341. Springer (2012) 41. Zhou, H., et al.: Progress in computational studies of host-pathogen interactions. J. Bioinform. Comput. Biol. 11(02), 1230001 (2013)
Machine Learning for Depression Screening in Online Communities Alina Trifan(B) , Rui Antunes, and Jos´e Lu´ıs Oliveira DETI/IEETA, University of Aveiro, Aveiro, Portugal {alina.trifan,ruiantunes,jlo}@ua.pt
Abstract. Social media writings have been explored over the last years, in the context of mental health, as a potential source of information for extending the so-called digital phenotyping of a person. In this paper we present a computational approach for the classification of depressed social media users. We conducted a cross evaluation study based on two public datasets, collected from the same social network, in order to understand the impact of transfer learning when the data source is virtually the same. We hope that the results presented here challenge the research community to address more often the issues of reproducibility and interoperability, two key concepts in the era of computational Big Data. Keywords: Social media language processing
1
· Data mining · Mental health · Natural
Introduction
The widespread use of social media combined with the rapid development of computational infrastructures to support big data, and the maturation of natural language processing and machine learning technologies offer exciting possibilities for the improvement of both population-level and individual-level health [7]. The Internet and social media have quickly become major sources of health information, providing both broad and targeted exposure to such information as well as facilitating information-seeking and sharing. As people increasingly turn to social media for news and information, these platforms can serve as novel sources of observational data for infodemiology, public health surveillance, as well as tracking health attitudes and behavioral intention. World Health Organization’s Mental Health Atlas 2017 [1] reveals a global shortage of health workers trained in mental health and a lack of investment in community-based mental health facilities. The median number of mental health beds per 100 000 people ranges below 7 in low and lower middle-income countries to over 50 in high-income countries. Prevention and early identification of mental health diseases by means that are complimentary to traditional medical approaches have the ability to mitigate the under-supply of mental health facilities. They can do so by advancing different types of counseling or support c The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 G. Panuccio et al. (Eds.): PACBB 2020, AISC 1240, pp. 102–111, 2021. https://doi.org/10.1007/978-3-030-54568-0_11
Machine Learning for Depression Screening in Online Communities
103
for the ones in need, such as connecting a depressed person to resources or peer support when they most need it [9]. Different types of depression or anxiety may be best treated with different types of internet interventions. As the advantages of mining social media for inferring health related outcomes are slowly being embraced by the research community, Natural Language Processing (NLP) researchers are focusing more and more their attention towards inferring risk situations on social media networks. Early prediction or detection of depressed users of social media is one such example. While there have been important breakthroughs in identifying psycholinguistic features that characterize the written speech of people in distress, data-related concerns still hamper a smooth evolution of this research area. Posts written on social media may be extremely personal and as a result, users often do not feel comfortable in granting researchers access to such private information. Because of this ethical apprehension and the difficulty in enrolling a large number of social media users in control trials in which they would agree to have their social media history processed, researchers have been guiding their efforts into collecting social media posts from public social networks. The most used ones have been Twitter1 and more recently Reddit2 . Their popularity among NLP research groups is also increased by the fact that both social networks provide Aplication Programming Interfaces (APIs) for collecting posts that are publicly shared. In this paper we evaluate several machine learning algorithms on two different public datasets of social media posts with the purpose of inferring their mental health status. More precisely, we are interested in classifying depressed users on social media so as to leverage the potential of social media predictors as possible pre-clinical screening tools. Both datasets are publicly available, under a user agreement protocol, and both represent collections of Reddit posts. We explore not only the binary classification of depressed versus non-depressed users in each of the datasets, but we aim to understand whether cross evaluation can lead to reliable results. To this purpose we experiment standard NLP pipelines for binary classification of depressed users, along with a rule-based estimator that takes into consideration psycholinguistic features that characterize the style of writting of depressed online users. We discuss the cross evaluation results obtained when training a model on one of the datasets and testing its performance on the second dataset. The remaining of this paper is organized as follows: Sect. 2 overviews the state of the art in mining social media for inferring health outcomes. We present the methodology for classifying depressed users in Sect. 3. Results are presented in Sect. 4 and are further discussed in Sect. 5. We draw final remarks in Sect. 6.
2
Background
In the digital era, researchers are keen to study how people interact with the digital world to better understand their mood, cognition, and behavior. In this paper we explore the potential of anonymized social media posts as an extension 1 2
https://www.twitter.com/. https://www.reddit.com/.
104
A. Trifan et al.
to digital phenotyping [24]. We focus on classifying depressed users in online forums based on their digital written trace, as an addition to traditional clinical procedures that lead to the identification of such population at risk. Social media platforms serve as novel sources of rich observational data for health research such as infodemiology, infoveillance, and digital disease detection [3,12,16,31]. Patients with chronic health conditions use online health communities to seek support and information to help manage their condition. Automatically identifying forum posts that need validated clinical resources can help online health communities efficiently manage content exchange. This automation can also assist patients in need of clinical expertise by getting proper health [25]. Disease surveillance is one of the longest-running use cases for social media mining. Influenza has been by far the most commonly surveyed disease [5,14,23,30]. Pharmacovigilance is another established use case of social media (user posts mentioning adverse reactions and the extraction of drugadverse reaction association signals) [26,28]. Through an application to birth defects, Klein et al. [13] present a generalizable NLP-based approach to iteratively prepare an annotated dataset of rare health-related events reported on social media. Sentiment analysis has been applied to social media in various ways to understand important public health issues, such as public attitudes towards vaccination or marijuana. Emotion tweets can be utilized to detect and monitor disease outbreaks, which suggests that emotion classification could help distinguish outbreak-related tweets from other disease discussion [4,17,18]. Social data mining has the potential to improve our understanding of the determinants and consequences of well-being, which is correlated with outcomes of both mental and physical health [20]. Several studies focusing on mental health understanding through social network data have been conducted using Twitter texts. Coppersmith et al. [8] presented a method for gathering data for a range of mental illnesses along with proof-of-concept results that focus on the analysis of four mental disorders: posttraumatic stress disorder, depression, bipolar disorder, and seasonal affective disorder. Their ultimate goal was to enable the ethical discussion regarding the balance between the utility of such data and the privacy of mental health related information. Later on, Coppersmith et al. [9] released a Twitter dataset of users who have attempted suicide, matched by neurotypical control users. Language modeling techniques were employed to classify these users, along with open government data to identify quantifiable signals that can relate them to psychometrically validated concepts associated to suicide. Nadeem et al. [19] used the same dataset to predict Major Depressive Disorder among online personas based on a Bag of Words (BoW) approach and several statistical classifiers. More recently, Vioul`es et al. [27] combined NLP features with a martingale framework to detect Twitter posts containing suicide-related content. The results were comparable to traditional machine learning classifiers. A recent study by Ernala et al. [11] questions the validity of classification results when there is no medical confirmation of the diagnosis and raises a meaningful discussion on the methodologies used so far for identifying patients at risk in online forums.
Machine Learning for Depression Screening in Online Communities
3
105
Methodology
In this paper we tackle a binary classification problem of depressed online users based on two publicly available datasets. We explore machine learning algorithms along with psycholinguistic features. Since both datasets were built using posts collected from the same social network, we are interested in a cross evaluation process in order to understand the impact of the data collection process on the classification results, as well as the interoperability of such datasets. We are interested in understanding how well does a model trained on one dataset perform on a different dataset and whether this is an indicator of the generalization or reproducibility of the approach. 3.1
Corpus Description
Reddit is a social media network of communities that aggregate users who share a common interest for a given discussion topic. Discussion topics are called subreddits and many of them are public, meaning that anyone can read the post published within these subreddits. Reddit users are given a randomly generated username, which makes it impossible to identify a person based on the username. Moreover, it provides an API for crawling public posts. Because privacy is guaranteed and public posts can be easily collected by programmatic means, Reddit has become one of the most popular social network data source among NLP researchers. The datasets used in this work are both composed of Reddit posts and are detailed next. The Reddit Self-reported Depression Dataset (RSDD) proposed by Yates et al. [29] consists of all Reddit users who made a post between January and October 2016. Using high-precision patterns of self-reported diagnosis, 9210 diagnosed users were matched with 107 274 control users. The Losada and Crestani test collection [15] comprises 137 depressed users matched by 755 control users. Users that expressed self-reported depression diagnoses were obtained by running specific searches against Reddit and then their writings were manually curated by the authors of the dataset. Both datasets contain the writings history of each user, from which posts that explicitly express the diagnosis were removed. We consider this curation relevant for two main reasons. First, it zeroes the possibility of wrongly classifying people engaged in social forums that share experiences of relatives or families, as well as people who might be seeking to help the ones depressed, such as doctors addressing the topic of depression. Moreover, it is relevant in the context of replicating a scenario in which depressed users do not focus on their disease and that might even be unaware of it, which turns this prediction even more ubiquitous. This relates to the possibility of identifying people that are unaware of their mental health status through heterogeneous texts. 3.2
Text Preprocessing
Prior to post classification, the preprocessing of the Reddit posts followed a standard pipeline in NLP. Posts were lowercased and tokenized, all non-alphabetic
106
A. Trifan et al.
characters were removed and words with less than 2 characters were filtered out, along with other stopwords. The NLTK stopword list was used for this purpose3 . 3.3
Psycholinguistic Features
The background study on the use of social media for mental health status prediction revealed a series of cognitive features that characterize the writings of depressed users. We explored the use of some of these patterns as features to be considered by a rule based estimator, which we developed with the intent of understanding their impact on the classification of depressed users. We describe next the ones on which we focused in this paper. – Absolutist words - A recent study on absolutist thinking, which is considered a cognitive distortion by most cognitive therapies for anxiety and depression, showed that anxiety, depression, and suicidal ideation forums contained more absolutist words than control forums [2]. The study by Al-Mosaiwi et al. resulted in a validation of an absolutist words dictionary that was used in this paper. – Self-related words - Depressed users tend to use more often self-related words (such as: I, myself, mine) in their writings [6,22]. – Posts length - Depressed and suicidal people tend to write more words than control users [10]. 3.4
Binary Classification
As a first approach, we followed a standard natural language processing stream for text classification. We considered BoW features and tf-idf feature weighting for different classifiers: Multinomial Naive Bayes, Passive Aggressive Classifier and Support Vector Machine with Stochastic Gradient Descent. A second approach took into consideration the psycholinguistic features previously introduced, that we modelled as features of a rule-based estimator. We then considered a feature union of equal weights for tf-idf and the output of the rule-based estimator, combined with a Passive Aggressive classifier. The code was written in Python4 and we used scikit-learn [21] as machine learning framework.
4
Results
Because the two datasets were considerably different in size, we opted for using the one proposed by Yates et al. [29] as reference and cross evaluate its performance on the one introduced by Losada and Crestani [15]. For each datasest, we present the statistics of the training and test subsets in Tables 1 and Table 2 respectively. 3 4
https://gist.github.com/sebleier/554280. www.python.org.
Machine Learning for Depression Screening in Online Communities
107
Table 1. Statistics of the training datasets. [29]
[15]
Control Depressed Control Depressed Number of subjects
36197
3112
403
83
Avg. number of words per user
16416
20820
69556
21318
Avg. number of absolutist words
189
701
153
154
Avg. number of self-related words
579
2411
430
731
Table 2. Statistics of the test datasets. [29]
[15]
Control Depressed Control Depressed Number of subjects
36218
3112
352
54
Avg. number of words per user
21164
70305
21933
15370
Avg. number of self-related words
590
2435
529
637
Avg. number of absolutist words
189
709
167
145
These statistics show for each of the two datasets a sharp difference in the number of self-related words and the total number of words used by user. This is valid for both training and test collections. When it comes to absolutist words, we note that in the Losada and Crestani dataset there is no clear distinction between the number of absolutist words used by control and depressed users, both in the training and test collection. This explains the low recall value obtained in the classification when relying on the rule based estimator. Table 3 presents the classification results for each of the datasets when the training was done on the respective training corpus and evaluation was measured on the corresponding test corpus. For the Yates et al. dataset the feature union estimator provides the best results in terms of Recall and F1 score, while Support Vector Machine (SVM) leads to an improved Precision. In the case of the Losada and Crestani dataset best results are obtained by the Multinomial Bayes predictor, while feature union results are probably negatively influenced by the use of absolutist words as a feature. The results of the cross evaluation are presented in Table 4. For this experiment, we trained the model on the entire Yates et al. dataset and we considered the whole Losada and Crestani dataset as test corpus. The results obtained in the cross evaluation are slightly worse than when training and testing on the same dataset. While these results are preliminary and probably intuitive for most readers, we consider such cross evaluations important for researchers to understand whether their work is useful outside a well-defined scenario. Moreover, these results represent evidence for supporting research into
108
A. Trifan et al. Table 3. Classification results. Method
Prec Rec F1
Acc
[29] Support Vector Machine
0.76 0.62 0.68 0.95
Multinomial Bayes
0.61 0.47 0.53 0.94
Passive Aggressive
0.64 0.64 0.64 0.94
Feature Union Rule-based 0.68 0.72 0.70 0.95 [15] Support Vector Machine
0.91 0.20 0.33 0.89
Multinomial Bayes
0.52 0.64 0.57 0.87
Passive Aggressive
0.70 0.38 0.50 0.89
Feature Union Rule-based 0.72 0.14 0.24 0.87 Table 4. Cross evaluation results. Method
Prec Rec F1
Acc
Support Vector Machine 0.46 0.38 0.42 0.83 Multinomial Bayes
0.19 0.19 0.19 0.75
Passive Aggressive
0.40 0.62 0.49 0.80
understanding how much does data collection and curation contribute for the prediction or detection biases.
5
Discussion
While this study is not free from limitations, we consider it relevant for understanding how data collection impacts the results of a specific detection model and what is the current status of dataset interoperability when it comes to social media writings. Data availability has always been an issue in the era of Big Data and while consistent efforts are being made in order to securely gather social media writings, we would like to understand if data collection itself is a source of bias or whether models trained on a given dataset maintain their perfomance and there is indeed a knowledge transfer when used on a different dataset. In this paper we conducted a simple experiment of cross evalution in the scenario of binary classification of depressed online users of social media. The results obtained show that models trained on a given dataset have the potential of being reused with different datasets. While the cross evaluation results are slightly inferior to the ones obtained when training and testing on the same dataset, it is important to conduct such comparative experiments in order to have a broader understanding of the generalization of such results.
Machine Learning for Depression Screening in Online Communities
6
109
Conclusions
Social media mining has the potential of extending the definition of digital phenotyping by contributing with new insights on a person’s well-being based on her online writings. Several studies focus on developing models for early prediction and high accuracy classification of depressed online users, but there is little work done so far in ensuring study interoperability and reproducibility. Apart from building precise and fast models, we are interested in building models that can be reused even when data changes. This study presented cross evaluation results when using one public dataset for training and a different one for testing. These results show that cross evaluation scores are lower than when training and testing on the same dataset. Even though this outcome is somehow intuitive, we hope that this paper could start the discussion on the topic of reusability and encourage scientist to test their approaches “out of the box”. Acknowledgements. This work was supported by the Integrated Programme of SR&TD SOCA (Ref. CENTRO-01-0145-FEDER-000010), co-funded by Centro 2020 program, Portugal 2020, European Union, through the European Regional Development Fund. Rui Antunes is supported by the Funda¸ca ˜o para a Ciˆencia e a Tecnologia (PhD Grant SFRH/BD/137000/2018).
References 1. Mental Health Atlas 2017 (Geneva: World Health Organization) (2018) 2. Al-Mosaiwi, M., Johnstone, T.: In an absolute state: elevated use of absolutist words is a marker specific to anxiety, depression, and suicidal ideation. Clin. Psychol. Sci. 6, 2167702617747074 (2018) 3. Benton, A., Coppersmith, G., Dredze, M.: Ethical research protocols for social media health research. In: Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pp. 94–102 (2017) 4. Bravo-Marquez, F., Frank, E., Mohammad, S.M., Pfahringer, B.: Determining word-emotion associations from tweets by multi-label classification. In: 2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pp. 536– 539. IEEE (2016) 5. Chen, L., Hossain, K.T., Butler, P., Ramakrishnan, N., Prakash, B.A.: Syndromic surveillance of flu on twitter using weakly supervised temporal topic models. Data Min. Knowl. Disc. 30(3), 681–710 (2016) 6. Chung, C., Pennebaker, J.W.: The psychological functions of function words. Soc. Commun. 1, 343–359 (2007) 7. Conway, M., O’Connor, D.: Social media, big data, and mental health: current advances and ethical implications. Curr. Opin. Psychol. 9, 77–82 (2016) 8. Coppersmith, G., Dredze, M., Harman, C.: Quantifying mental health signals in Twitter. In: Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pp. 51–60 (2014) 9. Coppersmith, G., Leary, R., Whyne, E., Wood, T.: Quantifying suicidal ideation via language usage on social media. In: Joint Statistics Meetings Proceedings, Statistical Computing Section, JSM (2015)
110
A. Trifan et al.
10. Coppersmith, G., Ngo, K., Leary, R., Wood, A.: Exploratory analysis of social media prior to a suicide attempt. In: Proceedings of the Third Workshop on Computational Lingusitics and Clinical Psychology, pp. 106–117 (2016) 11. Ernala, S.K., Birnbaum, M.L., Candan, K.A., Rizvi, A.F., Sterling, W.A., Kane, J.M., De Choudhury, M.: Methodological gaps in predicting mental health states from social media: triangulating diagnostic signals. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, p. 134. ACM (2019) 12. Kim, Y., Huang, J., Emery, S.: Garbage in, garbage out: data collection, quality assessment and reporting standards for social media data use in health research, infodemiology and digital disease detection. J. Med. Internet Res. 18(2), 41 (2016) 13. Klein, A.Z., Sarker, A., Cai, H., Weissenbacher, D., Gonzalez-Hernandez, G.: Social media mining for birth defects research: a rule-based, bootstrapping approach to collecting data for rare health-related events on twitter. J. Biomed. Inform. 87, 68–78 (2018) 14. Lampos, V.: Flu detector: estimating influenza-like illness rates from online usergenerated content. arXiv preprint arXiv:1612.03494 (2016) 15. Losada, D.E., Crestani, F.: A test collection for research on depression and language use. In: International Conference of the Cross-Language Evaluation Forum for European Languages, pp. 28–39. Springer (2016) 16. Loveys, K., Crutchley, P., Wyatt, E., Coppersmith, G.: Small but mighty: affective micropatterns for quantifying mental health from social media language. In: Proceedings of the Fourth Workshop on Computational Linguistics and Clinical Psychology—From Linguistic Signal to Clinical Reality. pp. 85–95 (2017) 17. Mohammad, S.M., Bravo-Marquez, F.: Emotion intensities in tweets. arXiv preprint arXiv:1708.03696 (2017) 18. Mohammad, S.M., Kiritchenko, S.: Using hashtags to capture fine emotion categories from tweets. Computat. Intell. 31(2), 301–326 (2015) 19. Nadeem, M.: Identifying depression on twitter. CoRR abs/1607.07384 (2016) 20. Paul, M.J., Sarker, A., Brownstein, J.S., Nikfarjam, A., Scotch, M., Smith, K.L., Gonzalez, G.: Social media mining for public health monitoring and surveillance. In: Biocomputing 2016: Proceedings of the Pacific Symposium, pp. 468–479. World Scientific (2016) 21. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 22. Rude, S., Gortner, E.M., Pennebaker, J.: Language use of depressed and depressionvulnerable college students. Cogn. Emot. 18(8), 1121–1133 (2004) 23. Smith, M., Broniatowski, D.A., Paul, M.J., Dredze, M.: Towards real-time measurement of public epidemic awareness: Monitoring influenza awareness through twitter. In: AAAI Spring Symposium on Observational Studies through Social Media and Other Human-Generated Content (2016) 24. Torous, J., Staples, P., Onnela, J.P.: Realizing the potential of mobile mental health: new methods for new data in psychiatry. Curr. Psychiatry Rep. 17(8), 61 (2015) 25. VanDam, C., Kanthawala, S., Pratt, W., Chai, J., Huh, J.: Detecting clinically related content in online patient posts. J. Biomed. Inform. 75, 96–106 (2017) 26. Vilar, S., Friedman, C., Hripcsak, G.: Detection of drug-drug interactions through data mining studies using clinical sources, scientific literature and social media. Brief. Bioinform. 19(5), 863–877 (2017)
Machine Learning for Depression Screening in Online Communities
111
27. Vioul`es, M.J., Moulahi, B., Az´e, J., Bringay, S.: Detection of suicide-related posts in twitter data streams. IBM J. Res. Dev. 62(1), 1–7 (2018) 28. Yang, C.C., Yang, H., Jiang, L., Zhang, M.: Social media mining for drug safety signal detection. In: Proceedings of the 2012 International Workshop on Smart Health and Wellbeing, pp. 33–40. ACM (2012) 29. Yates, A., Cohan, A., Goharian, N.: Depression and self-harm risk assessment in online forums. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, p. 2968–2978. Association for Computational Linguistics (2017) 30. Yun, G.W., Morin, D., Park, S., Joa, C.Y., Labbe, B., Lim, J., Lee, S., Hyun, D.: Social media and flu: media twitter accounts as agenda setters. Int. J. Med. Inform. 91, 67–73 (2016) 31. Zhang, J., Brackbill, D., Yang, S., Centola, D.: Identifying the effects of social media on health behavior: data from a large-scale online experiment. Data Brief 5, 453–457 (2015)
Towards Triclustering-Based Classification of Three-Way Clinical Data: A Case Study on Predicting Non-invasive Ventilation in ALS Diogo Soares1(B) , Rui Henriques2 , Marta Gromicho3 , Susana Pinto3 , Mamede de Carvalho3 , and Sara C. Madeira1 1
2
LASIGE, Faculdade de Ciˆencias, Universidade de Lisboa, Lisbon, Portugal {dfsoares,sacmadeira}@ciencias.ulisboa.pt INESC-ID, Instituto Superior T´ecnico, Universidade de Lisboa, Lisbon, Portugal [email protected] 3 Instituto de Medicina Molecular, Instituto de Fisiologia, Faculdade de Medicina, Universidade de Lisboa, Lisbon, Portugal [email protected], {susana.c.pinto,mamedealves}@medicina.ulisboa.pt
Abstract. The importance to learn disease progression patterns from longitudinal clinical data and use them effectively to improve prognosis, triggers the need for new approaches for three-way data analysis. In this context, triclustering has been widely researched for its potential in biomedical problems, showing promising results in the discovery of putative biological modules, patient profiles, and disease progression patterns. In this work, we propose a triclustering-based approach for threeway data classification, resulting from a combination of triclustering with random forests, and use it to predict the need for non-invasive ventilation in ALS patients. We analyse ALSFRS-R functional scores together with respiratory function tests collected from patient follow-up. The results are promising, enabling to understand the potential of triclustering and pinpointing improvements towards an effective triclustering-based classifier for clinical domains, taking advantage of the benefits of exploring disease progression patterns mined from three-way clinical data. Keywords: Triclustering · Three-dimensional data · Three-way clinical data · Amyotrophic lateral sclerosis · Prognostic prediction
1
Introduction
Given a (real-valued, symbolic or heterogenous) three-dimensional dataset (threeway data), triclustering aims to discover subsets of observations, attributes, and contexts (triclusters) satisfying certain homogeneity and statistical significance c The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 G. Panuccio et al. (Eds.): PACBB 2020, AISC 1240, pp. 112–122, 2021. https://doi.org/10.1007/978-3-030-54568-0_12
A Case Study on Predicting NIV in ALS Patients
113
criteria [10]. Promising triclustering applications in clinical domains are: multivariate physiological signal data analysis, where triclusters can capture coherent physiological responses for a group of individuals; neuroimaging data analysis, where triclusters can capture hemodynamic response functions and connectivity between brain regions; and clinical records analysis (patient-feature-time data), where triclusters correspond to groups of patients with correlated clinical features along time [10]. This work focuses on this latter class. Amyotrophic Lateral Sclerosis (ALS) is a highly heterogeneous neurodegenerative disease characterized by a rapidly progressive muscular weakness. In general, patients with ALS generally die from respiratory failure within 3 to 5 years. However, some patients can live for less than one year, while others can live more than 10 years [9]. Worldwide, ALS affects between 5.9 and 39 people per 100.000 inhabitants [6]. In Portugal, 10 in 100.000 inhabitants suffer from this disease [7]. Most patients develop hypoventilation with hypoxemia and hypercapnia, requiring non-invasive ventilation (NIV) support [9]. In this context, foreseeing the beginning of hypoventilation is key to anticipate opportune interventions, such us the start of NIV. NIV was demonstrated to be effective in prolonging life and improving quality of life in ALS, in particular in patients without major bulbar muscles weakness [2,3]. In clinical practice, the Revised ALS Functional Rating Scale (ALSFRS-R) is broadly used to help clinicians disclose the state of disease progression [2]. In this scenario, Carreiro et al. [4] proposed the first prognostic models based on clinically defined time windows to predict the need for NIV in ALS. Following this work, Pires et al. [12] stratified patients according to their state of disease progression, and proposed specialized learning models based on three ALS progression groups (slow, normal and fast). Despite the promising results concerning using patient stratification for prognostic prediction, their prognostic models did not take into account the temporal dependence between the features. Matos [11] used biclustering-based classification. Biclustering was used to find groups of patients with coherent values in subsets of clinical features (biclusters), then used as features together with static demographic data. The results were interesting but no temporal data were used. In this work, we propose to couple triclustering with Random Forests and train a triclustering-based classifier able to use disease progression patterns as features. The goal is to predict if a given patient will need NIV in the next 90 days using a classifier learned from temporal data from the follow-up of patients. We use the Lisbon ALS clinical dataset (version September 2019), developed at Hospital de Santa Maria (CHULN) in Lisbon since 1995, which was preprocessed and first used for prognostic prediction using time windows by Carreiro et al. [4]. In this case study, we use triclustering to find disease progression patterns in three-way clinical data, corresponding to groups of patients with coherent temporal evolution, which are then used for prognostic prediction. To this aim, we use triCluster [13], a pioneer and highly cited triclustering algorithm, proposed by Zhao and Zaki to mine patterns in three-way gene expression data. Despite not being proposed for clinical data, we believe it is a good starting point due to its algorithmic approach and type of patterns that it is able to find: quasi-exhaustive approach, mining arbitrarily positioned, potentially overlapping, scaling and shifting patterns.
114
D. Soares et al.
In what follows we propose the triclustering-based classification approach for three-way clinical data, present results in ALS case study, and draw conclusions.
2
Methods
This section describes a new triclustering-based classification approach, where triclusters are discovered and then used as features in a Random Forest classifier. Figure 1 depicts the workflow. In what follows, we first cover triclustering clinical three-way data, briefly explaining triCluster, the triclustering algorithm used here to mine patterns in three-way data [13]. We then propose the triclusteringbased Random Forests approach for three-way clinical data classification.
Fig. 1. Workflow of the proposed triclustering-based classifier.
2.1
Triclustering Three-Way Clinical Data
In this work, we use triCluster to identify triclusters on three-way clinical data from ALS patients. triCluster [13], proposed and implemented by Zhao and Zaki in 2005, is a pioneer and highly cited triclustering approach. It is a quasiexhaustive approach, able to mine arbitrarily positioned and overlapping triclusters with constant, scaling, and shifting patterns from three-way data. Given triCluster was proposed to mine coherent triclusters in three-way gene expression data (gene-sample-time), at this point it is important to understand that clinical data can be preprocessed in order to have a similar structure, in which gene-sample-time data becomes patient-feature-time data. Figure 2 presents the analogy we made between these two different, but similar, three-way types of data, that enables triclustering three-way clinical data using triCluster.
A Case Study on Predicting NIV in ALS Patients
115
triCluster has 3 main steps: 1) construct a multigraph with similar value ranges between all pairs of samples; 2) mine maximal biclusters from the multigraph formed for each time point (slices of the 3D dataset); and 3) extract triclusters by merging similar biclusters from different time-points. Optionally, it can delete or merge triclusters, according to the overlapping criteria used.
(a) Gene Expression Time-series.
(b) Electronic Health Records.
Fig. 2. Gene expression and clinical three-way data representation.
2.2
Triclustering-Based Random Forest
After running triclustering the goal is to use triclusters for patient classification. To tackle this goal, we consider triclusters as features and construct a matrix of patients × triclusters to be used by the classifier. The approach followed here was to build a binary matrix, where the relation between a patient i and tricluster j is 1 if patient i is in tricluster j and 0 otherwise. After computing the class for each learning example a Random Forest is used to learn the predictive model. In our ALS case study, three-way clinical data are composed by observations of patients at different appointments during follow-up (patient snapshots computed as in Carreiro et al. [4]. Since the goal is to predict the need for noninvasive ventilation within a given time window (90 days, corresponding to the next clinical appointment, in our case), the class used for each patient in the learning examples is binary and represents the patient evolution/non-evolution to a state where NIV is needed within 90 days (see [4] for details on computing patient snapshots and learning examples). The designed experimental pipeline mines triclusters from patients with at least two appointments and uses the binary matrix with labelled patients as input to a Random Forest classifier.
116
3
D. Soares et al.
Results and Discussion
This section presents the results and discusses the challenges of learning a triclustering-based classifier, able to use disease progression patterns as features, from three-way clinical data. We used triCluster for triclustering and Random Forests for classification as described above. The goal is to predict if a given ALS patient will need NIV within 90 days given current and past clinical evaluations. We learn from the Lisbon ALS clinical dataset described below. 3.1
Data
We use the Lisbon ALS clinic dataset containing Electronic Health Records from ALS Patients regularly followed in our clinic, since 1995 and last updated in September 2019. Its current version (updated after the work of Pires et al. [12]) contains 1319 patients. Each patient has a set of static features (demographics, disease severity, co-morbidities, medication, genetic information, habits, trauma/surgery information and occupations) together with temporal features (collected repeatedly at follow-up), such as disease progression tests (ALSFRS-R scale, respiratory tests, etc.) and clinical laboratory investigations. Since the focus of this work is three-way clinical data analysis, we focus on temporal data, discarding static data. We used 10 features per time point, the Functional Scores (ALSFRS-R), briefly described below, and respiratory tests: Forced Vital Capacity (FVC), Maximal Sniff Nasal Inspiratory Pressure (SNIP), Maximal Inspiratory Pressure (MIP) and Maximal Expiratory Pressure (MEP). ALSFRS-R scores for disease progression rating are an aggregation of integers on a scale of 0 to 4 (where 0 is the worst and 4 is the best), providing different evaluations of the patient functional abilities at a given time point [8]. This functional evaluation is based on 13 questions, explained in Table 1. Different functional scores are then computed using subsets of scores, as shown in Table 2. Table 1. ALSFRS-R Questions Q1 - Speech Q2 - Salivation Q3 - Swallowing Q4 - Handwriting Q5 - Cutting food and handling utensils Q6 - Dressing and hygiene Q7 - Turning bed ans adjusting bed clothes Q8 - Walking Q9 - Climbing stairs Q10 - Respiration QR1 - Dyspnea QR2 - Orthopnea QR3 - Respiratory insufficiency
A Case Study on Predicting NIV in ALS Patients
3.2
117
Data Preprocessing
The above ALS dataset with static and temporal features, was preprocessed as described by Carreiro et al. [4] and Pires et al. [12] to obtain patient snapshots and then compute the Evolution class for each snapshot using NIV administration date: a patient is labelled ‘Y’, if 90 days after the snapshot he/she was administrated NIV, and ‘N’ otherwise. In this work, and in order to apply triCluster [13], we performed experiments using training examples computed as follows: 2, 3 and 4 consecutive snapshots for each patient (corresponding to clinical evaluations at 2, 3 and 4 consecutive appointments, respectively) were used as features, and the NIV evolution value of the last snapshot was used as class. The first challenge was dealing with missing values, since triCluster does not support them. This is certainly a drawback, since missing values are common in clinical data, that should be taken into account in the triclustering step. In this work, and in order to test triCluster, we were thus forced to select only features with low levels of missing values for further analysis, leading to a subset of respiratory tests and the ALSFRS-R scores described above. We then removed all patients with missing values in more than 2 snapshots. For each remaining patient, we performed missing value imputation by using values in previous appointments to input latter missing values (Last Observation Carried Forward), when possible, and mean/mode of all patients values, otherwise. After tackling missing values, we had to deal with class imbalance. In our case, and due to the time window of 90 days (next appointment) used as case study, the number of patients labeled as ‘N’, non-evolutions (2179, 1666 and 1283 examples, for 2, 3 and 4 snapshots, respectively), largely outnumbered those labelled as ‘Y’ (326, 224 and 162 examples, for 2, 3 and 4 snapshots, respectively), identifying the patients requiring NIV within 90 days, key for the learning task. To deal with this issue we first used a Random Undersampler to reduce ‘N’ examples until obtaining a class proportion of 2/3–1/3 (652, 448, 324 of ‘N’ and 326, 224, 162 of ‘Y’, for 2, 3 and 4 snapshots, respectively), and then used SMOTE [5] to balance datasets to 50%/50% class proportion, leading to 1304, 896 and 648 learning examples, for 2, 3 and 4 snapshots, respectively. Table 2. Functional scores and sub-scores according to ALSFRS-R. Functional score
Description
ALSFRS
Sum of Q1 to Q10
ALSFRS-R
Sum of Q1 to Q9 + QR1 + QR2 + QR3
ALSFRSb
Q1 + Q2 + Q3
ALSFRSsUL
Q4 + Q5 + Q6
ALSFRSsLL
Q7 + Q8 + Q9
ALSFRSr
Q10
R
QR1 + QR2 + QR3
118
3.3
D. Soares et al.
Model Evaluation
Since the main goal of this work is to evaluate the results of learning models for prognostic prediction in ALS patients using triclusters as features, we compare the performance of the proposed triclustering-based classification approach with a baseline obtained by training Random Forests with the original temporal features (used to compute the triclusters). To evaluate results, we use 5 × 10-fold Stratified Cross-Validation (CV) and compute Area Under the Curve (AUC), accuracy, and sensitivity and specificity, commonly used in clinical applications. 3.4
Baseline Results Using Random Forests and Original Features
Table 3 shows the baseline results using the original features: respiratory tests and ALSFRS-R scores for each appointment and treated as independent features (3 × 10 features). We can observe that baseline classifiers achieved classification accuracies around 0.78 in CV with low standard deviation. These are good results, considering those obtained by Pires et al. [12], using patient stratification and a large number of features. Sensitivity and specificity show approximately the same values, meaning all classifiers perform well when predicting both classes. Table 3. Baseline results (Random Forest with original features). AUC
Accuracy
Sensitivity
Specificity
2TP 0.87 ± 0.0024 0.79 ± 0.0076 0.81 ± 0.0054 0.77 ± 0.0079 3TP 0.87 ± 0.0019 0.78 ± 0.0042 0.80 ± 0.0021 0.78 ± 0.0020 4TP 0.87 ± 0.0030 0.78 ± 0.0142 0.80 ± 0.0087 0.76 ± 0.0026
3.5
Results Using Random Forests and Triclusters as Features
Since triCluster allows different parameterizations, potentially discovering triclusters with different types of coherence, we run the algorithm using 3 different parameterizations: Case 1 - Unconstrained, to capture all coherent triclusters across the three dimensions (x-patient, y-feature and z-time); Case 2 δ x = δ y = δ z = 0, to capture triclusters with constant values across the three dimensions; and Case 3 − δ x = 0, to force constant coherence on patient dimension while relaxing the others two. In all cases we set the minimum number of patients, features and time-points in each tricluster as 25, 2 and 3, respectively. triCluster discovered a total of 460 (121, 61 and 278), 179 (22, 32 and 125), 1250 triclusters (392, 459 and 399), for 2, 3 and 4 snapshots, respectively. In parentheses we show the triclusters found in Case 1, 2 and 3, respectively. We then used these triclusters as features for each case independently and for all cases altogether.
A Case Study on Predicting NIV in ALS Patients
119
Can Triclusters Outperform Original Features? Table 4 shows the results obtained by the triclustering-based classifier using the triclusters obtained in each case above. It is possible to see that constant triclusters (Case 2) are the worst performing of the 3 cases. As expected, since the different cases capture different progression patterns, the performance improved when we trained the classifier using the triclusters obtained in the 3 cases. We tried to remove constant triclusters to evaluate if they are needless, but results were worse. Unfortunately, these results are not better than those obtained by baseline, meaning these triclusters cannot outperform the original features. Potential causes might rely on triCluster limitations: it was designed to analyse gene expression time series (real-valued) and not clinical data (mix of real-valued and categorical features); and its approach to deal with highly overlapping triclusters leads to the creation of redundant features, probably preventing other relevant triclusters to be discovered. Nevertheless, we are still interested to know if these triclusters (temporal features) can be used to improve baseline performance. Table 4. Triclustering-based results using different parameterizations. AUC
Accuracy
Sensitivity
Specificity
2TP Case 1 0.77 ± 0.0018 0.71 ± 0.0113 0.72 ± 0.0068 0.70 ± 0,0077 Case 2 0.73 ± 0.0016 0.68 ± 0.0028 0.75 ± 0.0042 0.64 ± 0.0026 Case 3 0.77 ± 0.0004 0.71 ± 0.0062 0.73 ± 0.0069 0.69 ± 0.0043 All
0.79 ± 0.0023 0.72 ± 0.0054 0.72 ± 0.0047 0.71 ± 0.0054 3TP
Case 1 0.71 ± 0.0013 0.68 ± 0.0016 0.68 ± 0.0013 0.66 ± 0.0014 Case 2 0.68 ± 0.0010 0.66 ± 0.0019 0.77 ± 0.0011 0.61 ± 0.0013 Case 3 0.74 ± 0.0011 0.69 ± 0.0037 0.76 ± 0.0023 0.66 ± 0.0028 All
0.77 ± 0.0008 0.71 ± 0.0043 0.74 ± 0.0031 0.69 ± 0.0021 4TP
Case 1 0.74 ± 0.0016 0.65 ± 0.0016 0.66 ± 0,0165 0.62 ± 0.0093 Case 2 0.72 ± 0.0014 0.66 ± 0.0073 0.71 ± 0.0031 0.63 ± 0.0020 Case 3 0.72 ± 0.0022 0.66 ± 0.0030 0.70 ± 0.0059 0.67 ± 0.0072 All
0.74 ± 0.0008 0.76 ± 0.0031 0.67 ± 0.0072 0.66 ± 0.0056
Can Triclusters Be Used to Improve Baseline Performance? In order to evaluate if triclusters can improve the results of baseline classifiers, we trained Random Forests using the triclusters together with the original features. As seen in Table 5, results without feature selection are approximately the same as those obtained as baseline. Furthermore, we also expected that when using temporal
120
D. Soares et al.
features performance would improve. These results might be misleading, leading us to conclude triclusters are not important and original features are enough. In this context, we decided to inspect feature importance at baseline (original features), and when triclusters are used together with original features. Figure 3 depicts the importance of used features for three snapshots and shows that, as we expected, some triclusters (being temporal features) arouse as more important than some original features. This lead us to believe that Random Forests are not dealing well with the 256 features. We thus decided to perform feature selection (FS), selecting the best 120 features (including triclusters) to be used by the classifier. The slight improvement in FS results in Table 5 confirms our intuition. Table 5. Classification results with triclusters and original features. AUC
Accuracy
Sensitivity
Specificity
2TP Before FS 0.88 ± 0.0018 0.79 ± 0.0024 0.82 ± 0.0056 0.78 ± 0.0062 After FS 0.88 ± 0.0026 0.80 ± 0.0041 0.83 ± 0.0046 0.78 ± 0.0037 3TP Before FS 0.87 ± 0.0026 0.78 ± 0.0092 0.79 ± 0.0025 0.78 ± 0.0019 After FS 0.87 ± 0.0025 0.79 ± 0.0052 0.79 ± 0.0021 0.77 ± 0.0018 4TP Before FS 0.85 ± 0.0029 0.76 ± 0.0045 0.77 ± 0.0072 0.75 ± 0.0062 After FS 0.86 ± 0.0011 0.77 ± 0.0046 0.78 ± 0.0027 0.75 ± 0.0078
Fig. 3. Feature ranking (top 15): feature importance for original features (left) and including triclusters (right). Images from Orange3 Data Mining Toolkit [1].
A Case Study on Predicting NIV in ALS Patients
4
121
Conclusions
We proposed to couple triclustering with Random Forests and train a triclustering-based classifier able to use disease progression patterns as features. The goal was to predict if a given patient will need NIV in the next 90 days using temporal data from follow-up. In this case study, we used triclustering to find disease progression patterns in three-way clinical data, corresponding to groups of patients with coherent temporal evolution, then used for prognostic prediction. The results are promising but pinpoint the limitations of the triclustering algorithm used when dealing with clinical data. In our opinion, a key advantage of a triclustering-based classification is the possibility to provide a better understanding of the results, promoting model interpretability (critical in clinical applications) together with potential improvements in classification (by incorporating temporal features). Since the triclusters identify subsets of patients with subsets of features showing coherent evolution patterns over contiguous time-points, we hypothesize they may uncover disease progression patterns, that might be key to boost classification results. We will thus work towards an effective triclustering-based classifier starting by improving triclustering results and in next make it able to yield improvements against state-of-the-art classifiers. Acknowledgements. This work was partially supported by FCT funding to Neuroclinomics2 (PTDC/EEI-SII/1937/2014) and iCare4U (LISBOA-01-0145-FEDER031474 + PTDC/EME-SIS/31474/2017) research projects, and LASIGE Research Unit (UIDB/00408/2020).
References 1. Orange3 Data Mining Toolkit. https://orange.biolab.si 2. Andersena, S.A., Borasioc, G.D., de Carvalho, M., Chioe, A., Van Dammef, P., Hardimang, O., Kolleweh, K., Morrisoni, K.E., et al.: Efns guidelines on the clinical management of amyotrophic lateral sclerosis (MALS)-revised report of an EFNS task force. Eur. J. Neurol. 19, 360–375 (2011) 3. Bourke, S.C., Tomlinson, M., Williams, T.L., Bullock, R.E., Shaw, P.J., Gibson, G.J.: Effects of non-invasive ventilation on survival and quality of life in patients with amyotrophic lateral sclerosis: a randomised controlled trial. Lancet Neurol. 5(2), 140–147 (2006) 4. Carreiro, A.V., Amaral, P.M., Pinto, S., Tom´ as, P., de Carvalho, M., Madeira, S.C.: Prognostic models based on patient snapshots and time windows: predicting disease progression to assisted ventilation in amyotrophic lateral sclerosis. J. Biomed. Inform. 58, 133–144 (2015) 5. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002) 6. Chi` o, A., Logroscino, G., Traynor, B., Collins, J., Simeone, J., Goldstein, L., White, L.: Global epidemiology of amyotrophic lateral sclerosis: a systematic review of the published literature. Neuroepidemiology 41(2), 118–130 (2013)
122
D. Soares et al.
7. Conde, B., Winck, J.C., Azevedo, L.F.: Estimating amyotrophic lateral sclerosis and motor neuron disease prevalence in Portugal using a pharmaco-epidemiological approach and a bayesian multiparameter evidence synthesis model. Neuroepidemiology 53(1–2), 73–83 (2019) 8. ENCALS: ALS functional rating scale revised (ALS-FRS-R). version (May 2015) 9. Heffernan, C., Jenkinson, C., Holmes, T., Macleod, H., Kinnear, W., Oliver, D., Leigh, N., Ampong, M.: Management of respiration in mnd/als patients: an evidence based review. Amyotroph. Lateral Scler. 7(1), 5–15 (2006) 10. Henriques, R., Madeira, S.C.: Triclustering algorithms for three-dimensional data analysis: a comprehensive survey. ACM Comput. Surv. 51(5), 95 (2019) 11. Matos, J.: Biclustering electronic health records to unravel disease presentation patterns. MSc Thesis (2019) 12. Pires, S., Gromicho, M., Pinto, S., Carvalho, M., Madeira, S.C.: Predicting noninvasive ventilation in ALS patients using stratified disease progression groups. In: 2018 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 748–757. IEEE (2018) 13. Zhao, L., Zaki, M.J.: Tricluster: an effective algorithm for mining coherent clusters in 3D microarray data, pp. 694–705 (2005)
Searching RNA Substructures with Arbitrary Pseudoknots Michela Quadrini(B) Department of Information Engineering, University of Padova, via Gradenigo 6/A, 35131 Padova, Italy [email protected]
Abstract. RNA functions depend on its three-dimensional structure formed largely from hydrogen bonds between pairs of nucleotides. RNAs with analogous functions exhibit highly similar structures without showing significant sequence similarity necessarily. Understanding the relationships between the structure and the functions has been considered one of the challenges in biology. In this study, we face the problem of identifying a given structural pattern into an RNA secondary structure with arbitrary pseudoknots. We abstract the shape in terms of secondary structure, formalized by the arc diagram, and we introduce a set of operators necessary and sufficient to describe any arc diagram in terms of relations among loops. For each molecule, we uniquely associate the relation matrix, and we face the aforementioned problem in terms of searching a submatrix. The algorithms work in polynomial time.
Keywords: RNA secondary structures Structural matching
1
· Relations · Loops ·
Introduction
Ribonucleic acid (RNA) is a single stranded polymer, with a preferred 5–3 direction, made of four types of nucleotides, known as Adenine (A), Guanine (G), Cytosine (C) and Uracil (U). Each nucleotide is linked to the next one by a phosphodiester bond, referred to as a strong bond. Moreover, it can interact with at most another non-contiguous one, establishing a hydrogen bond, referred to as a weak bond. Such a process, known as a folding process, induces complex three-dimensional structure (or shape). Such a shape is tied to its biological function. RNAs play a variety of roles in cellular processes and are directly involved in the diseases for their ability to turn genes on and off [1]. Discovering the relationships between the structure and the function has been considered a challenge in biology. Disregarding the spatial configuration of the nucleotides and reducing them to dots, the molecule is abstracted in terms of secondary structures that can be formalized as an arc diagram. In such representation, the nucleotides are identified by vertices on a straight line (backbone) c The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 G. Panuccio et al. (Eds.): PACBB 2020, AISC 1240, pp. 123–133, 2021. https://doi.org/10.1007/978-3-030-54568-0_13
124
M. Quadrini
and the weak bonds are drawn as semi-circular zigzag arcs in the upper halfplane, see Fig. 1 for an example. Each arc determines a loop. Therefore, every RNA secondary structure is composed of loops. It is said to be pseudoknot-free if the arc diagram does not present crossings among loops (Fig. 1-a), otherwise it is called pseudoknotted (Fig. 1-b). The secondary structure represents an intermediate level between the sequence and its shape. Moreover, it is both tractable from a computational point of view and relevant from a biological perspective. As an example, under the action of antibiotics, many 16s ribosomal RNAs preserve the nucleotides sequence, but alter their shape [8]. Such changes are detected by secondary structure. Therefore, the ability to compare RNA secondary structures and identify common substructures is useful for understanding the relationships between the structure and the functions. Functional RNA families, such as tRNA, rRNA, and RNAse P, exhibit a highly conserved secondary structure but little sequence similarity [12]. Therefore, searching for sequence motifs does not work as effectively with RNA, while it has been a powerful tool for analysis of DNA and proteins [13]. In the literature, several approaches have been introduced for searching common patterns. Algorithms based on tree data structures have been proposed for finding the largest approximately common substructures and local patterns in [20] and [11], respectively. Affix trees have been used to exact and approximate pattern matching and discovery in RNA sequences [14]. Arslan et al. proposed a substructure search algorithm based on binary search on suffix array to finds the largest common substructure of a given RNA structures [3]. Backofen and Siebertd have developed a dynamic programming approach for computing common exact sequential and structural patterns between two RNAs without pseudoknots [4]. Several proposed approached have been based on arc-annotated sequences, also called contact maps, among which we can mention the longest arc-annotated subsequence problem, the arc preserving subsequence problem, the maximum arc-preserving common subsequence problem, and the edit-distance for arc-annotated sequence problem [5]. As an example, the maximum arc-preserving common subsequence problem was introduced by Blin et al. for comparing arc-annotated sequences in [6], while Evans proposed an algorithm to find common structures excluding some classes of pseudoknots [9]. Recently, Quadrini et al. have faced the problems of
5’
A
U
G
U
1
2
3
4
a
G
U
C
U G G U G C
5
6
7
8
C
A
9 10 11 12 13 14 15
G C
A U
U
G
A G A U
3’
16 17 18 19 20 21 22 23 24 25
b
Fig. 1. An example of secondary structure represented as arc diagram. In a, the zigzagged arcs do not cross while, in b, pseudoknots are clearly visible as crossings of arcs
Searching RNA Substructures
125
identifying substructures considering both the primary and secondary structure only for pseudoknot-free structure in [18], while a similarity measure for RNAs with arbitrary pseudoknots have been calculated in polynomial time [19]. In this paper, we face the problem of searching a given structural pattern into an RNA secondary structure. Both the pattern and the structure are characterized by arbitrary pseudoknots. Based on our previous results [19], we introduce a set of operators, concatenation, nesting, and crossing, that is necessary and sufficient to describe any arc diagram in terms of relations among loops. Briefly, concatenation formalizes that a loop is followed by the other one as illustrated in Fig. 2-a. Nesting corresponds to the insertion of a loop into the other one (Fig. 2-b) and crossing models interaction between them (Fig. 2-c). For any RNA secondary structures, such description allows us to uniquely associate a matrix, called relation matrix, whose elements represent the relation between the two corresponding arcs. As a consequence, the identification of a given structural pattern into an RNA secondary structure is equivalent to search a submatrix within a matrix. To reach the aim, we have defined two algorithms, Determination of the Relation Matrix and Structural Relation Matching, that work in polynomial time. The former takes as an input an RNA molecule and returns the relation matrix, while the latter searches the relation matrix associated to the pattern into the relation matrix of the structure. The approach has been tested on structures that include arbitrary pseudoknots.
5’
A
U
G
C
G
A
U
a
G
C
A
C
C
3’
5’
A
A
G
A
C
C
U
G
C
A
C
U
3’
b
5’
A
A
G
A
C
C
U
G
C
A
C
G
3’
c
Fig. 2. Relations between two loops a. Concatenation, b. Nesting and c. Crossing of two loops
The paper is organized as follows. In Sect. 2, we recall some necessary concepts. In Sect. 3, we introduce the concepts of relations among loops and the relation matrix. In Sect. 4, we face the problem of searching a structural pattern. The paper ends with some conclusions and future perspective, Sect. 5.
2
Background and Problem Definition
An arc diagram is a sequence, over a given alphabet, with additional structure described by a set of arcs, eventually empty. Formally, Definition 1 (Arc Diagram). An arc diagram is a labeled graph over the ordered set of vertices [] = {1, . . . , }, in which each vertex has degree ≤ 3, and
126
M. Quadrini
the edges are all the segments [i, i + 1] for i = 1, . . . , − 1 and some semi-circular arcs (i, j) in the upper half-plane, with 1 ≤ i < j ≤ . The arc diagram is denoted by D = (ω, B), where ω is the string that corresponds to the sequence of labels over the ordered set [] and B is the set of all arcs (i, j). Note that in the literature the notation (ω, B) indicates an arc-annotated sequence. Roughly speaking, an arc diagram corresponds to an arc-annotated sequence, whose nodes have degree less than or equal to 3. In order to not introduce many symbols, we denote arc diagram by the pair (ω, B) since only arc diagrams are considered in this work. As an example, the structure illustrated in Fig. 1 is formalized by D = (ω, B), where ω = AUGUGUCUGGUGCCAGCAUUGAGAU and B = {(1, 8), (3, 7), (10, 19), (13, 16), (15, 25), (17, 23)}. We face the problem of searching a given structural pattern into an RNA secondary structure with arbitrary pseudoknots. Formally, we face the problem of arc-preserving subsequence (APS) problem with a particular restriction. Let D = (ω, B) and D = (ω , B ) be two arc-annotated sequences such that n = |ω| and m = |ω | with n ≥ m, the APS problem asks whether D can be exactly obtained from D by deleting some of its bases together with their incident arcs, if any. The computational complexity of the problem has been studied in [7,9,10]. We face such problem for arc diagrams without deleting any arcs (i, j), whose paired nucleotide j is into the considered substructure. Futhermore, we do not impose restriction for the paired nucleotide i. The reason of this choice concerns the nature of the folding process: a nucleotide can perform a hydrogen bond with another already synthesized one. In our formalism, the nucleotide i of the pair (i, j) is synthesized before nucleotide j. Operationally, we enumerate the loops of the structure starting from the loop whose last nucleotide is the most right and we extract substructures determined by M consecutive loops Li . An illustration of the APS problem and the one with our restriction is given in Fig. 3, respectively. In Fig. 3-b we find only an occurrence of the pattern, graphically identified by a bold arcs into the structure. We observe that the occurrences of the pattern are two and one composed of loops determined by pairs (2, 9) and (6, 11), the other formed by (6, 11) and (10, 16), without considering the restriction.
3
Relations Between Loops on Arc Diagram
Each RNA secondary structure is characterized by hydrogen bonds that bind a part of the chain to another one. Each hydrogen bond, represented by an arc (i, j) over a sequence, determines a loop. Therefore, each arc diagram is composed of loops. Given two loops, Ls and Lt , there are only three possible cases: a loop follows the other one, a loop is inside the other one, and a loop crosses with the other one, as illustrated in Fig. 2 respectively. We say Ls is concatenated to Lt (Ls Lt ) if the vertices of the relative pairs (is , js ), (it , jt ) satisfy the following relation is < js < it < jt . We say Ls is nested into Lt (Ls Lt ) if Lt ) if is < it < js < jt . it < is < js < jt and we say Ls crosses with Lt (Ls Without loss of generality, we enumerate loops of the structure starting from the loop whose last nucleotide is the most right. In other words, the first loop
Searching RNA Substructures Structure
5'
Structure
AU G CU AG CA U U
G
C
C G A
U G
3’
AU G CU AC CG U G G C C G G 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
5'
AU G CU AG CA U U
G
C
C
G A U
3’
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
5'
127
5'
3’
AU G CU AC C G U G U 1 2 3 4 5 6 7 8 9 10 11 12
3’
Pattern
Pattern
a
b
Fig. 3. a. A graphical example of APS problem, b. An illustration of the problem with our restriction
L1 is formed by the pair (i1 , j1 ) such that j1 is the last paired nucleotide of the structure considering the 5–3 direction. As a consequence, given two loops Ls and Lt respectively formed by (is , js ) and (it , jt ), if s < t then by definition js > jt . As an example, we can consider the structure illustrated in Fig. 4 that consists of four loops, L1 = (9, 15), L2 = (5, 12), L4 = (2, 8), L3 = (3, 7) and L2 , L1 L3 , L1 L4 , L2 L3 , L2 L4 , six relations among them, L1 L3 L4 . Taking advantage of such an enumeration, we impose an order over loops. Each nucleotide can interact at most with another one. As a consequence, each nucleotide can be involved at most in a pair. This means that the choice of a loop is unique. Moreover, the three relations, concatenation, nesting and crossing, are necessary and sufficient to describe any RNA secondary structure with arbitrary pseudoknots. In fact, given two loops, Ls and Lt with s < t, it is equivalent to consider the two pairs of natural number (is , js ) and (it , jt ) such that is < js , it < jt and js < jt . It follows that js is the greatest number. From the theory of combinations, we have 6 different order relations over is , js , and it , which became 3 considering the constrains is < js and it < jt . For each structure, we can uniquely associate a relation matrix. Each element aij of the matrix is the relation between the loops Li and Lj . The relation matrix of the structure in Fig. 4 is given in Table 1. L3
L2
L1
L4
5’
C
A
G
C
C
U
C
U
C
G
A
U
G
U
G U
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
3’
Fig. 4. Example of an RNA secondary structure
128
M. Quadrini Table 1. The relation matrix of the RNA secondary structure shown in Fig. 4 L1 L2 L3 L4 L1 -
L2
-
-
L3 L4
4
-
Structural Matching
For each RNA secondary structure represented as an arc diagram, we uniquely determine its relation matrix using the Algorithm 1, Determination of the Relation Matrix. The algorithm, whose pseudocode is reported into the Appendix A, takes as input the set B of the pairs (is , js ) and returns a matrix, whose element ak,t represents the relation between the loops Lk and Lt . It is computed with time complexity of O(N 2 ), where N is the number of loops of the structure that corresponds to the cardinality of B, i.e., the set of pair list of the arc diagram D. As an example, we take into account the RNA secondary structures illustrated in Fig. 5. The relation matrices of the two molecules, obtained by Relation Matrix Algorithm, whose pseudocode is reported into the Appendix A, are shown in Tables 2. To search a given structural pattern into an RNA secondary structure with arbitrary pseudoknots taking advantage of relation matrices, we define the Algorithm 2, Structural Relation Matching. It searches the matrix of the pattern into the matrix of the structure. As a consequence, it takes as input the two matrices and returns another matrix, whose rows identify the set of loops that forms an occurrence of the pattern in the structure. It is a brute-force search algorithm and its complexity can be reduced for example using techniques of dynamic programming. Continuing with the structures illustrated in Fig. 5, we consider the molecule in Fig. 5-b as a pattern to find into the molecule shown in Fig. 5-a. The structure contains the pattern twice: the former is determined by loops, L1 , L2 , L3 , while the latter is formed by loops, L4 , L5 , L6 . In general, the output of the Structural Relation Matching Algorithm is a matrix characterized L5
L2
L4
L6
L2
L3
5’
C
A
G
C
C
U
C
U
C
G
a
A
U
G
Pattern
U
G U C
A 3’
L1
L3
L1
5’
C
A
G
C
C
U
C
Pattern
Fig. 5. a The structure, b the pattern
b
U
C
G
A
U 3’
Searching RNA Substructures
129
Table 2. Relation matrices of the structures in Fig. 5, respectively. The two patterns into the structure are formed by loops L1 , L2 , L3 and L4 , L5 , L6 L1 L2 L3 L4 L5 L6 L1 -
L2
-
-
-
-
L3 L4 L5 L6
L1 L2 L3 L1 -
L2
-
L3
-
-
by M columns and P rows, where M is the number of loops of the pattern and P is the number of occurrences of the pattern in the structure.
5
Conclusions and Future Work
RNA functions are largely determined by its three-dimensional structure. Understanding the relationship between structure and biological function has been considered one of the challenges in biology. In this work, we have faced the problem of searching a given structural pattern into an RNA secondary structure. We proposed a method able to identify any patterns into structures characterized by arbitrary pseudoknots. We have implemented the presented methodology in the RNA Relation Pattern open source Python application and we have tested our approach searching a set of patterns into a set of 24 molecules of Archaea 16S ribosomal RNA [16]. Now, we are working on the development of the tool by improving its computational performance. Moreover, we want to add other molecular encodings as accepted input, i.e., including also dot-bracket and CT. The tool will be used to analyze real RNAs that are available in public database such as RNA STRAND [2]. It will be carried out in collaboration with experts of the biological domain in order to test the impact of our approach on the creation of new biological knowledge. As a future work, we want to generalize the approach considering also the sequences of nucleotides. In other words, we want to face the problem of finding a given structural pattern into an RNA with arbitrary pseudoknots taking into both primary and secondary structure of molecules. Although functional RNAs exhibit a highly conserved secondary structure but little sequence similarity, the primary structure plays an important role in the formation of intermolecular hydrogen bonds. In other words, the primary structure is important to study and predict the RNA-RNA interaction structures. Motivated by our recent results [15,17] and taking advantage of such an approach, we also intend to propose an alignment-free structural classification based on relations among loops. On the theoretical side, we also want to extend such a preliminary result to face the APS problem for arc diagrams and arc-annotated sequences.
130
A
M. Quadrini
Appendix
In this Appendix, we define the pseudocode of the two algorithm mentioned in the paper, Relation Matrix and Determination of the Relation Matrix Algorithms. Input : B = {(ai , bi ) : ai > bi and ai > ai−1 , ∀i = 2, . . . , N }, the set of the ordered pair Output: r matrix, the relation matrix pairs, array of elements of set B; n, length of pairs; i = 0; while i < n do pair1 = pairs[i]; a = pair1[0], b = pair1[1]; k = k + 1; while k < n do pair2 = pairs[k]; c = pair1[0], d = pair1[1]; if b < d and b < c then r matrix[i][k] = else if b > d and b < c then r matrix[i][k] = else if c < d then r matrix[i][k] = k = k + 1; end i=i+1 end Algorithm 1: Determination of the Relation Matrix
Searching RNA Substructures
131
Input : r matrix s, r matrix p, matrices of relations of structure and patter, respectively Output: pattern m, matrix that contains the occurrence of the pattern n, number of rows of r matrix s, m, number of rows of r matrix p; if m < n then /* search all occurrences of the first relation of the pattern into the relation matrix of the structure */ k = n − 2, ; while k ≥ 0 do j = k + 1; while j ≤ n − 1 do if r matrix s[k][j] = r matrix p[m − 2][m − 1] then f irst o = f irst o + [(k, j)] end j =j+1 end k =k−1 end /* search all occurrences of the pattern into the structure */ for i = 0 to length(f irst o) do element = f irst o[i], d = element[1] − element[0]; max = length(element[1]) − 1, row s = element[0] − 1 ; N relations = (m−1)(m−2) ; 2 while rows ≥ 0 do col s = row s + d; while col s ≥ 0 and col s ≤ max do R = r matrix p[row p][col p]; if r matrixs [row s][col s] = R then pattern = pattern + [(row s, col s)]; if length(pattern) = N relations then t = t + 1; pattern m[t] = pattern; end if col p = m − 1 then row p = row p − 1 ; col p = row p + 1 else col p = col p + 1 end else col s = length(element[1]) + 1; row s = −1; pattern = [ ] ; end end end i=i+1 end end
Algorithm 2: Structural Relation Matching
132
M. Quadrini
References 1. Alberts, B., Bray, D., Hopkin, K., Johnson, A.D., Lewis, J., Raff, M., Roberts, K., Walter, P.: Essential Cell Biology. Garland Science, New York (2013) 2. Andronescu, M., Bereg, V., Hoos, H.H., Condon, A.: RNA strand: the RNA secondary structure and statistical analysis database. BMC Bioinform. 9(1), 340 (2008) 3. Arslan, A.N., Anandan, J., Fry, E., Monschke, K., Ganneboina, N., Bowerman, J.: Efficient RNA structure comparison algorithms. J. Bioinform. Comput. Biol. 15(06), 1740009 (2017) 4. Backofen, R., Siebert, S.: Fast detection of common sequence structure patterns in RNAs. J. Discret. Algorithms 5(2), 212–228 (2007) 5. Blin, G., Crochemore, M., Vialette, S.: Algorithmic Aspects of Arcannotated Sequences. Algorithms in Molecular Biology: Techniques, Approaches, and Applications. Wiley, Hoboken (2011) 6. Blin, G., Fertin, G., Herry, G., Vialette, S.: Comparing RNA structures: towards an intermediate model between the edit and the Lapcs problems. In: Brazilian Symposium on Bioinformatics, pp. 101–112. Springer (2007) 7. Blin, G., Fertin, G., Rizzi, R., Vialette, S.: What makes the arc-preserving subsequence problem hard? In: Transactions on Computational Systems Biology II, pp. 1–36. Springer (2005) 8. Carter, A.P., Clemons, W.M., Brodersen, D.E., Morgan-Warren, R.J., Wimberly, B.T., Ramakrishnan, V.: Functional insights from the structure of the 30s ribosomal subunit and its interactions with antibiotics. Nature 407(6802), 340–348 (2000) 9. Evans, P.A.: Finding common subsequences with arcs and pseudoknots. In: Annual Symposium on Combinatorial Pattern Matching, pp. 270–280. Springer (1999) 10. Gramm, J., Guo, J., Niedermeier, R.: Pattern matching for arc-annotated sequences. In: International Conference on Foundations of Software Technology and Theoretical Computer Science, pp. 182–193. Springer (2002) 11. Hochsmann, M., Toller, T., Giegerich, R., Kurtz, S.: Local similarity in RNA secondary structures. In: Proceedings of the 2003 IEEE Bioinformatics Conference on Computational Systems Bioinformatics, CSB2003, pp. 159–168. IEEE (2003) 12. H¨ ochsmann, M., Voss, B., Giegerich, R.: Pure multiple RNA Secondary structure alignments: a progressive profile approach. IEEE/ACM Trans. Comput. Biol. Bioinform. 1(1), 53–62 (2004) 13. Li, K., Rahman, R., Gupta, A., Siddavatam, P., Gribskov, M.: Pattern matching in RNA structures. In: Proceedings of the 4th International Conference on Bioinformatics Research and Applications, ISBRA 2008, pp. 317–330. Springer-Verlag (2008) 14. Mauri, G., Pavesi, G.: Algorithms for pattern matching and discovery in RNA secondary structure. Theor. Comput. Sci. 335(1), 29–51 (2005) 15. Quadrini, M., Culmone, R., Merelli, E.: Topological classification of structures via intersection graph. In: Theory and Practice of Natural Computing, TPNC 2017 (2017) 16. Quadrini, M.: RNA relation pattern (2020). https://github.com/michelaquadrini/ RNARelationPattern. Accessed 27 Mar 2020 17. Quadrini, M., Merelli, E.: Loop-loop interaction metrics on RNA secondary structures with pseudoknots. In: Proceedings of the 11th International Joint Conference on Biomedical Engineering Systems and Technologies, BIOINFORMATICS, vol. 4, pp. 29–37. SciTePress, Set´ ubal (2018)
Searching RNA Substructures
133
18. Quadrini, M., Merelli, E., Piergallini, R.: Loop grammars to identify RNA structural patterns. In: Proceedings of the 12th International Joint Conference on Biomedical Engineering Systems and Technologies, BIOINFORMATICS, vol. 3, pp. 302–309. SciTePress (2019) 19. Quadrini, M., Tesei, L., Merelli, E.: An algebraic language for RNA pseudoknots comparison. BMC Bioinform. 20(4), 161 (2019) 20. Wang, J.T.L., Shapiro, B.A., Shasha, D., Zhang, K., Currey, K.M.: An algorithm for finding the largest approximately common substructures of two trees. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 889–895 (1998)
An Application of Ontological Engineering for Design and Specification of Ontocancro J´essica A. Bonini1 , Matheus D. Da Silva1 , Rafael Pereira1 , Bruno A. Mozzaquatro1 , Ricardo G. Martini2 , and Giovani R. Librelotto1(B) 1
Universidade Federal de Santa Maria, Santa Maria, Rio Grande do Sul, Brazil {jbonini,mdsilva,brunomozza,librelotto}@inf.ufsm.br, [email protected], [email protected] 2 Universidade do Minho, Braga, Portugal [email protected]
Abstract. In the field of bioinformatics, ontologies prove to be very useful for dealing with massive amounts of data are generated by several biological studies. Ontologies, recently, have been used to support the gathering, organization, and integration of information in different databases. Biological studies are proposed in different areas, mainly related to diseases such as cancer, Type 2 Diabetes, and Alzheimer’s, which need computer support in order to annotate genes, proteins, metabolic pathways, and also transcriptional regulatory models. Thus, this work aims to design and realize a specification of Ontocancro, an ontology developed by UFSM researchers to assist in research related to the diseases above mentioned, using the NeOn methodology and reusing the structure of BioPax ontology.
Keywords: Bioinformatics
1
· Ontologies · Cancer
Introduction
In the last few years, the bioinformatics domain has been growing and becoming increasingly essential for the integration and also provided information about molecular biological studies. The analysis of this information helps researchers to understand the interactions between complex cellular models and their relationship with genetic material as well as functional products with pathologies stages that can be affected by humans. Researchers around the world have been focused on study one of the diseases that kills the most people each year, cancer. According to [1], in the year of 2018, approximately 8.8 million people died to this disease. The large amount of data that can be generated from these studies and the need to integrate them makes the use of tools capable of dealing with the large databases, allowing to share the knowledge generated between people and c The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 G. Panuccio et al. (Eds.): PACBB 2020, AISC 1240, pp. 134–143, 2021. https://doi.org/10.1007/978-3-030-54568-0_14
Ontocancro’s Redesign and Specification
135
software inevitable. Thus, ontologies arise as a solution to these issues, since they allow the representation of knowledge formally way through the definition of concepts and semantic relationships between them, and enable interoperability between systems that have this similar knowledge structure [2]. The Ontocranco project [3] was developed to aid researchers who works on the field of gene expression and also biological networks related to cancer diseases. The ontology, nowadays is on the third version and recently was updated to gathering biological information related to Cancer, Alzheimer’s, Type 2 Diabetes and COPD, and also, to the process of chronic inflammation resulting from cellular aging, Inflammaging, which may be the origin of these diseases [4]. During their evolution, the Ontocancro went through changes made by researchers from different areas, where each researcher selected the methodology that met their needs at the time. Therefore, the Ontocancro was not designed and built following a unique method for the development of ontologies. Thus, generating doubts about the quality and completeness of formalized knowledge. Hence, these problems could be solved by using a unique methodology capable of allowing the reusing and the interoperability between different versions of ontologies, recovering all resources already existing on a domain. Moreover, providing clear guidelines on the activities involved in the development process. Thus, the present work aims to design and specify the Ontocancro through a unique methodology of ontology engineering to provide the sharing of knowledge generated during the three versions, the interoperability with systems and other resources in the bioinformatics domain.
2
Methodologies and Tools Used for the Ontological Engineering Development
In this section, the main concepts about the NeOn methodology and the BioPax ontology will be presented. Moreover, in the subsections, it will be described all the three versions of the Ontocancro ontology. 2.1
NeOn Methodology
The NeOn methodology proposes to deal with the weakness in the main existing approaches, such as the lack of detailed and explicit instructions, and for reusing of knowledge sources that are widely provided for several specific domains. In addition, the known methodologies do not show the process of building ontologies as the same detail description, like the software development methodologies, which could lead to an easy understanding and application of these approaches. Finally, the lack of consensus about the patterning of good practices and processes to develop ontologies leads to a lack of standardization in the terminology of ontology engineering [5]. The NeOn provides a guideline follow this premise: “divide and conquer”; in other words, it divides the problem to solve into different subproblems. The solutions are described through nine scenarios (Fig. 1) that can be combined
136
J. A. Bonini et al.
with each other, so they are composed of several processes, and activities are defined by the Glossary NeOn [5].
Fig. 1. NeOn methodology activity flow
Amongst the activities described in NeOn’s scenarios, there is the reuse of ontological resources already existing in the domain in question. Thus, the next section provides an overview of the BioPax ontology, reused in the development of Ontocancro. 2.2
Biological Pathway Exchange (BioPAX)
The BioPAX ontology represents biological pathways at the molecular and cellular level. Furthermore, it aims to facilitate the integration, visualization, analysis, and provide a way to share this data. The main structure is composed of five basic classes: the first one is called by Entity, which is the root class from this ontology and represents a discrete biological unit. The second Pathway, determines a set or a series of interactions grouped by biologists due to organizational, biophysical reasons. The third one, the Interaction defines a biological relationship between two or more entities. The fourth one is called by Physical Entity, which represents the set of entities that have a physical structure. Moreover, the last one, the Gene, is responsible for encoding information that can be inherited through replication [6]. BioPax allows the representation of reactions such as transcription and also brings the concepts of pathways and genes as two of the five basic entities of
Ontocancro’s Redesign and Specification
137
ontology. That way, it allows the clear and focused annotation of the main concepts inserted in Ontocancro. 2.3
Ontocancro
Complex networks of molecular interactions control the machinery of cells. The amount and complexity of data that are provided from several different studies involving these structures require the development of new technologies capable of leading with the hard task of data integration. Thus, researchers from different areas must work together to make this possible and then extract knowledge from these heterogeneous biological bases. In this scenario, researchers in the areas of physics, biology, and informatics at the Federal University of Santa Maria (UFSM), joined efforts to develop the Ontocancro ontology [3]. The Ontocancro 1.0, was developed to help with the study of the gene expression of biological networks involved in cancer disease. The main objective was to map the most significant number of genes involved in the Genomic Stability Maintenance pathways in a single shared database [7]. In this first version of Ontocancro, the genes, present in several biological pathways were grouped, hindering to analyze their expression. Another side, part of the routes included in the study were not cured in their original repositories, which could result in distortions in the results. From this, it was decided to update the ontology generating their second version, the Ontocancro 2.0 [8]. The problem of the excessive number of genes by pathways was solved by separating pathways into sub-pathways, grouping genes according to their similarities. Thus, the number of genes to be analyzed was reduced, producing a more reliable and more significant statistical analysis [9]. The second issue was solved by including metabolic pathways that had already undergone accruing procedures on the original base. Such measures provided a higher reliability of the analysis of their profile. To ensure that all metabolic pathways had been cured, it was decided to restrict the number of sources, considering only metabolic pathways from these public bases: NCI-Nature (BioCarta and Reactome), Ontocancro, KEGG and Gene Ontology [8]. Other studies were developed, in addition to Cancer, such as the Inflammaging that can be evolved diseases like Alzheimer’s and Type 2 Diabetes [10]. These studies have shown the existence of a relationship between the release of proinflammatory cytokines and the abnormal production process of beta-amyloid protein. That characteristic can be considered one of the main changes presented by patients with Alzheimer [11]. Otherside, patients with type 2 diabetes, have high levels of concentration of the cytokine IL-6 and other markers in the blood. This fact is the result of a systemic inflammation attributed to the imbalance of the immune system, which characterizes Inflammaging. Taking this into account was decided to include Alzheimer’s and Type 2 Diabetes on the project since the same chronic state of inflammation can cause these diseases as Cancer. The ontology resulting after the update was called Ontocancro 3.0, Fig. 2 shows the third version of Ontocancro ontology [12].
138
J. A. Bonini et al.
Fig. 2. Ontocancro’s 3.0 structure
In that version, the domain of study was extended for including the following diseases: Alzheimer’s, Type 2 Diabetes, COPD or Chronic Obstructive Pulmonary Disorder, Thyroid, Pancreatic, Colon Rectal, and Adrenocortical Glands. The information of these diseases was retrieved from public databases like HUGO, which comprises all the genes present in the human body; GEO, which provides tissue samples; NCI and Reactome, both contain data on metabolic pathways.
3
Design and Specification of Ontocancro Based on Ontological Engineering
Ontocancro aims to aid researchers involved in cancer studies and, over the years, has gone through several phases and versions. Each new version was generated according to a methodology chosen by the responsible researcher, at that time, generating doubts about the structuring and formalization of the new knowledge defined in the project. To solve this issue, the present work describes how it is possible to develop a unique methodology for the restructuring of knowledge and also create a newly refined version of Ontocancro. Therefore, ensuring the interoperability between these ontologies and sharing knowledge with other resources in the same domain. Figure 3 presents a diagram with all the activities covered in this work.
Ontocancro’s Redesign and Specification
139
Fig. 3. Steps during ontocancro’s project and specification.
The NeOn methodology, which was described in Subsect. 2.1, was chosen to provide the reuse of existing ontological resources in the field of bioinformatics and for bringing guidance for the process of developing ontologies clearly and concisely. Among the scenarios which are proposed by NeOn, scenario four was selected because it covers cases where there are ontological resources that can be reused and are useful for the problem in question. Thereby, it becomes possible to reuse the knowledge formalized in previous versions of Ontocancro and insert new definitions relevant to the current stage of the project. Following the activities of scenario four, as shown in Fig. 1, the development started with searching for ontological resources in available repositories in the area of bioinformatics, to reuse the knowledge already formalized. Between resources, which were found, was chosen for using of BioPax, because it is one of them that represents the existing concepts of Ontocancro similarly. In addition, it allows the insertion of new pertinent information into the current state of the project, such as the definition of the product encoded by the gene. Meanwhile, the reengineering activity of the selected resource was started to adapt it to the presented problem. At this stage, some activities such as reverse engineering and restructuring of ontological resources must be followed. The achievement of this process is linked to four levels of abstraction: specification, conceptualization, formulation, and implementation, which will be detailed in the next paragraphs. In the specification step, the components are described as requirements, purpose, and scope. As a goal of this work, it seeks to facilitate access and recover data with higher quality. Thus, the update of the ontology intends to consolidate the existing information, performing a reformulation on relationships and properties which were obsolete through a unique development methodology. Besides, it aims to add the concepts of gene expression, mapping the final product of each gene that is important for the current stage of the project. The Conceptualization phase describes the characteristics of the ontology, such as structure and components, following the primitives of knowledge representation. In this phase, we developed the structure of the ontology from the information gathered in the specification stage. The structure model includes, in
140
J. A. Bonini et al.
addition to the classes already present in BioPax, the Diseases class. This entity includes the class Inflammaging and all other diseases already mentioned. The Affymetrics class provides information about DNA microarrays, which are glass slides where single-stranded segments (probes) are fixed in an orderly manner and specific areas (probe cells). The insertion of this entity, in previous versions, was due to the need to identify the genes present in sample diseases. This identification was carried out using the IDs provided by the microarrays manufactured by the American company Affymetrix [13]. Thus, the relationship between the Gene and Samples entities was made through the Affymetrics entity, forcing the existence of a microarray to provide the gene ID. During discussions with the domain specialist, there was a need to eliminate the requirement for a microarray to identify the gene since new technologies are being studied and developed for the study of gene expression. The conceptualization was also responsible for discussing the insertion and description of concepts related to products encoded by genes. For the definition, it was considered that a gene carries part of the genetic information and, consequently, it is a region within the DNA that provides instructions for encoding a specific functional product. This product, in turn, can be a protein or an RNA molecule. In the formalization stage, the concepts and relationships were mapped in the previous phases and modeled according to the BioPax structure. The gene and pathways were defined accordingly to the Gene and Pathway classes existing in BioPax. These are disjoint, have no members in common, and relate through the hasGene property. Below the main class of BioPax, Entity, the Diseases classes were created, with a daughter class Inflammaging that includes a subclass for each disease; Samples, with disease samples; and Series, with the series to which each sample must be related, for example, samples of carcinoma, adenomas, and inflammation. The relationship between diseases and samples is made through the hasSample property. On the other hand, the belongsTo property formalizes the relationship between Samples and Series. Finally, the product encoded by each gene was modeled using the TemplateReaction class, which is already part of the BioPax structure. It defines an interaction in which a macromolecule is polymerized from a model macromolecule. Examples of this type of reaction are DNA to RNA transcription, RNA to protein translation, and DNA to protein-coding. In the last phase of the ontology development, the semi-formal model was structured, and the formalization phase was implemented. The chosen formalization language was OWL, and the tool used for editing was Prot´eg´e. The first step in building the formal model was the insertion of new entities (Diseases, Inflammaging, Series, and Samples) in the structure of BioPax. In addition to the new classes, hasSample, onlyOneReference, hasGene, and isExpressed object properties were created. The first two refer to the relationship between the classes Diseases and Samples, and Series and Samples, with onlyOneReference being the inverse of hasSample.
Ontocancro’s Redesign and Specification
141
The hasGene property, and its inverse isExpressed, describe the relationship between Samples and Gene, and between Pathway and Gene. This property was created to enable the direct relationship between samples and genes and to map the genes inserted in one or more pathways during the disease phases. Moreover, relationships among entities, cardinality, and quantification restrictions were defined to guarantee the defined premises. After structural formalizations, instances were created within the ontology. For Gene and Pathway classes, the instances were built from the analysis of data tables taken from public biological databases in previous works. For genes, a new version of the information was removed from the HGNC database to update old information and add information such as the locus type (e.g., a gene with protein product or pseudogene) and the gene reference in the Uniprot database. Pathway instances mapped information such as the name of the pathway, the bank of origin, and its type (Apoptosis, CellCycle, Dna Damage Response, or Inflammation). Furthermore, they were associated with instances of genes expressed in the path through the hasGene property. Next, Protein and TemplateReaction instances were built. The gene coding reaction is modeled by associating a product and a template, the latter, due to the BioPax structure, is not directly associated with the gene. For this, the DnaRegion instances and the geneReference object property were created, which will make the relationship between the DnaRegion and Gene classes, as formalized in the previous step. Among all the genes involved in the study, six are pseudogenes and do not code for a functional product. Two are non-coding RNA, that is, they do not encode proteins. For the latter, instances of the Rna class, existing in BioPax, were created and listed as a product in the gene transcription reaction. The structure of Ontocancro after the completion of the implementation stage can be seen in Fig. 4. The resulting OWL file has instances of 1107 genes, 1099 proteins, and 1099 transcription reactions. The evaluation of constructed ontology is an essential step to guarantee its use and to verify its efficiency in the representation of knowledge, ensuring that the initial requirements have been satisfied. Besides, when it is desired to understand whether the constructed ontology is following the domain knowledge, specialists in the domain need to verify the formalized structure. These specialists have technical knowledge and experience in the domain under which the ontology is being modeled. So they can evaluate it, as they will be dealing with information specific to their area [14–16].
142
J. A. Bonini et al.
Fig. 4. Ontocancro’s structure
4
Results and Final Considerations
Concerning the version 3.0 of Ontocancro1 , the structure built in this work starts to restrict the relationship of each sample to a specific disease and series, through the insertion of the onlyOneReference property and cardinality restrictions. The new specification also guarantees the relationship between genes and samples regardless of the technology used to capture the expression of genes in disease tissue samples. In version 3.0, microarray technology was imposed by the Affymetrics class. The inclusion of genes in the study allowed the structure now to have the definition of the reaction of functional coding products. The latest OWL file implemented and available for the project basically reflected the structure of version 1.0 of Ontocancro. The pathways were separated according to the originating bank, Biocarta, GO, KEGG, NCI, Ontocancro, Proposite, and Reactome, by defining a class for each one below the main Entity class. Currently, the pathways were classified into CellCycle, DnaDamageResponse, Apoptosis, and Inflammation, and all are below the Pathway class. Besides that, the Series, Samples, Diseases classes and their subclasses, inserted in the study in versions 2.0 and 3.0, does not exist in OWL. The results obtained bring improvements to the Ontocancro project about the knowledge formalization and inserted concepts, matching the structure of the ontology to the current status of the project. These make the development process transparent and guided by a comprehensive methodology, allowing the upcoming researchers to have a clear view of what was defined and modeled during the research. The domain of bioinformatics can count on new knowledge resources.
References 1. World Health Organization, et al.: World health statistics 2019: monitoring health for the SDGs, sustainable development goals (2019) 1
http://ontocancro.inf.ufsm.br.
Ontocancro’s Redesign and Specification
143
2. Schulz, S., Costa, C. M.: How ontologies can improve semantic interoperability in health care, pp. 1–10. Springer (2013) 3. Falcade, L., Souza. K. R., Librelotto, G. R.: Um comparativo entre Ontologias Relacionadas ao Cˆ ancer: a interoperabilidade destas ontologias com a Ontocancro. In: XIV Conferˆencia da Associa¸ca ˜o Portuguesa de Sistemas de Informa¸ca ˜o, Santar´em, Portugal (2014) 4. Xia, S., Zhang, X., Zheng, S., Khanabdali, R., Kalionis, B., Wu, J., Wan, W., Tai, X.: An update on inflamm-aging: mechanisms, prevention, and treatment. J. Immunol. Res. (2016) 5. Figueroa, M.C.S.: NeOn Methodology for Building Ontology Networks: Specification. Scheduling and Reuse. Universidad Polit´ecnica de Madrid, Madrid, Spain (2010) 6. BioPAX: Biological Pathways Exchange Language Level 3 – Release Version 1 Documentation (2010) 7. Librelotto, G.R., Pereira, R.T., Azevedo, P., Mombach, J.C.M.: Utilizando a Ontocancro para Tra¸car o Perfil das Vias de Manuten¸ca ˜o da Estabilidade Genˆ omica. In: XXXII Congresso da Sociedade Brasileira de Computa¸ca ˜o, Curitiba, Brazil (2012) 8. Soares, K., Bastiani, E., Librelotto, G.: Ontocancro 2.0: um estudo de caso para a aplica¸ca ˜o da ontologia em vias metab´ olicas ligadas ao processo carcinogˆenico. Revista do CCEI 16, 177–192 (2012) 9. Pereira, R., Henriques, P., Librelotto, G.R.: Desenvolvimento de uma Ferramenta para a An´ alise de Vias de Estabilidade Genˆ omica. Universidade do Minho, Braga, Portugal (2013) 10. Franceschi, C., Capri, M., Monti, D., Giunta, S., Olivieri, F., Sevini, F., Panourgia, M.P., Invidia, L., Celani, L., Scurti, M., Cevenini, E., Castellani, C.G., Salvioli, S.: Inflammaging and Anti-inflammaging: a systemic perspective on aging and longevity emerged from studies in humans. Mech. Ageing Dev. 128(1), 92–105 (2007) 11. Giunta, B., Fernandez, F., Nikolic, W.V., Obregon, D., Rrapo, E., Town, T., Tan, J.: Inflammaging as a prodrome to Alzheimer’s disease. J. Neuroinflammation, 5(51) (2008). https://doi.org/10.1186/1742-2094-5-51 12. Bonini, J.A., Stringhini, R.M., Falcade, L., Librelotto, G.R.: Estendendo o Dom´ınio da Ontocancro 3.0 para abordar o Inflammaging. In: XV Congresso Brasileiro de Inform´ atica em Sa´ ude, Goiˆ ania, Brazil (2016) 13. Irizarry, R.A., Bolstad, B.M., Collin, F., Cope, L.M., Hobbs, B., Speed, T.P.: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research 31(4) (2013). https://doi.org/10.1093/nar/gng015 14. Ara´ ujo, W.J.: Avalia¸ca ˜o de Ontologias com Base Compara¸ca ˜o a um Corpus: um estudo da OntoAgroHidro da EMBRAPA. Universidade Federal de Minas Gerais, Brazil (2016) 15. Jain, V., Prasad, S.V.A.V.: Evaluation and validation of ontology using Prot´eg´e tool. Int. J. Res. Eng. Technol. 4(4), 21–32 (2016) 16. Rautenberg, S., Todesco, J.L., Gauthier, F.A.O.: Processo de Desenvolvimento de Ontologias: uma proposta e uma ferramenta. Revista Tecnologia 30(1), 133–144 (2009)
Evaluation of the Effect of Cell Parameters on the Number of Microtubule Merotelic Attachments in Metaphase Using a Three-Dimensional Computer Model Maxim A. Krivov1 , Fazoil I. Ataullakhanov2 , and Pavel S. Ivanov1(B) 1 M. V. Lomonosov Moscow State University, Moscow 119991, Russia
[email protected], [email protected] 2 Center for Theoretical Problems of Physico-Chemical Pharmacology,
Russian Academy of Sciences, Moscow 199991, Russia [email protected]
Abstract. Even chromosome segregation between daughter cells during mitosis is crucial for genome integrity and is mostly regulated by proper attachments of spindle microtubules (MTs) to kinetochores. Abnormalities in this process can lead to chromosome mis-segregation and potentially result in severe developmental disorders, including aneuploidy and cancer. Merotelic attachments when tubulin MTs captured by kinetochore of one chromatid originate from both spindle poles are considered as one of the key molecular processes that cause such abnormalities. Here we present the first comprehensive three-dimensional model of metaphase, the key stage of mitosis in the context of proper chromosome segregation, and the results of its application to supercomputer simulation of kinetochore-MT attachments in metaphase. It appears that large values of the kinetochore crown angle lead to the preservation of merotelic attachments while the size of the cell and the probability of MT detachments affect only the rate of their suppression but do not interfere with the process of suppression itself. It has been demonstrated that the structure and the set of parameters of the model of mitosis have a severe impact on the results of simulations. We also compare the results of supercomputer 3D modeling of mitosis with outcomes of existing two-dimensional models. Keywords: Mitosis · Metaphase · Chromosomes · Microtubules · Kinetochore · Merotelic attachment · Computer simulation
1 Introduction An equal segregation of chromosomes between daughter cells is a key yet non-trivial task during mitosis. Such an outcome depends dramatically on the proper kinetochoremicrotubule attachments. Merotelic attachments (MAs) correspond to the scenario when tubulin microtubules (MTs) captured by kinetochore of one chromatid originate from both spindle poles. Such erroneous attachments regularly take place at the early stages © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 G. Panuccio et al. (Eds.): PACBB 2020, AISC 1240, pp. 144–154, 2021. https://doi.org/10.1007/978-3-030-54568-0_15
Evaluation of the Effect of Cell Parameters on the Number
145
of mitosis. However, if they remain in anaphase, this can lead to mis-segregation of chromosomes and severe developmental abnormalities, including aneuploidy and cancer [1, 2]. Cellular mechanisms that suppress MAs at various stages of mitosis can be divided into two groups, namely, those that prevent the emergence of new MAs and those that reduce the number of existing ones [3, 4]. While protein-protein interactions play a substantial role in both cases, for the mechanisms of the first group, the structure of individual cellular organelles, especially kinetochore, is also of importance. Due to geometric restrictions, it is more difficult for the kinetochore to capture MTs growing from the opposite spindle pole than those growing from a nearby spindle pole. The influence of geometric factors on MAs is considered in [5]. Using a twodimensional computer model of the cell, the authors showed the importance of such parameter as the size and thickness of the kinetochore crown, the region that MTs can attach to. In particular, a significant deviation from the values that correspond to the “average” human cell, results in an increase in the number of MAs. Here, we present the results of mitosis simulation based on a three-dimensional model of the cell that has much more parameters than the model in [5]. The paper has the following structure. In Sect. 2 we provide a brief overview of existing mathematical models of MT–chromosome interactions. Section 3 contains a description of the computer model we developed. Section 4 describes the methodology of numerical experiments as well as the parameters of virtual cells. Finally, in Sect. 5 we present and discuss the results of supercomputer simulations.
2 Similar Works A detailed analysis of existing mathematical models of a dividing cell and their components can be found in [6, 7]. Below we briefly analyze the works that are closest to our three-dimensional model. The main mechanism that allows MTs to find chromosomes in the space of a cell is called “search-and-capture” [8]. It is assumed that in metaphase, MTs have random directions and, due to their instability, constantly switch from polymerization to depolymerization state (the so-called catastrophe) and vice versa (see Fig. 1). When the length of the MT turns out to be zero, its direction changes randomly, thus, the search for chromosomes is carried out “blindly”. It was shown that if the catastrophe is considered as a probabilistic event, the virtual cell begins to correspond to a living cell in such a parameter as the average length of MTs in time [9]. On the other hand, if the probabilities of these events are set to be constant, the time of detection of the first chromosome is several orders of magnitude greater than the times known from experimental studies [10]. This problem can be solved if the probability of catastrophes is determined through the gradient function, which corresponds to the “fight” of MTs for free tubulin proteins [11]. According to an alternative approach, these times become commensurate with expected ones if kinetochore is “allowed” to interact with MT as a whole, rather than just with its plus-end [12]. Another issue is the formalization of the mechanism of chromosome intracellular motion. The dominating concept of such a mechanism, called the “balance of forces”,
146
M. A. Krivov et al.
Fig. 1. Illustration of the search-and-capture concept used to describe the instability of MTs.
assumes that each pair of chromosomes moves in such a way that the sum of all forces exerted on them is zero [13]. Initially, three types of forces were analyzed, specifically, (i) the forces exerted on MTs from the spindle poles, (ii) the forces of attraction that arise at the plus-ends of MTs captured by kinetochore, and (iii) the force that counteracts centromere stretching. This approach was further developed by breaking down the forces of the second type into two independent categories, as well as by introducing an additional friction force to account for the viscosity of the cytoplasm [14]. A similar idea of the “balance of forces” is used in many other works. For example, when describing the interaction of MTs with chromosome arms, the repulsive force between the proteins of chromokinesin and tubulin was added [15]. Among recently proposed computer models, it is worth mentioning the representation of the kinetochore in the form of a flexible polymer structure, which, in particular, made it possible to evaluate the effect of thermal effects on its shape [16]. In [17], the growth of small auxiliary MTs directly on the kinetochore was reproduced. These MTs could bind to MTs from the spindle pole thus increasing the efficiency of the search-and-capture mechanism.
3 Computer Model A detailed description of the computer model proposed by the authors, which is the development of the two-dimensional model [5], as well as its software implementation is provided elsewhere [18]. This model can describe any eukaryotic cell with metacentric chromosomes. The spindle poles are represented by two material points diverging in the diametrically opposite sides of the cell in the first 180 s (see Fig. 2). MTs represented by lines of zero thickness are growing from the spindle poles in random directions within solid angle of π radians. Similarly to [9], MTs growth dynamics is described by four parameters: the rates of polymerization and depolymerization, V pol and V depol , and the probabilities of catastrophe, f cat , and resurrection, f res . MTs do not interact with each other, but respond to the following events: – overriding the cell membrane that triggers the transition to the state of depolymerization, – achieving zero length that leads to choosing a new direction of growth,
Evaluation of the Effect of Cell Parameters on the Number
147
Fig. 2. Divergence of the spindle poles according to the used cell model: (A) schematic representation of model objects; (B) three-dimensional visualization of the model (1500 MTs, 2 pairs of chromosomes).
– collision with the arms of the chromosome that results in the MT “break off” or its transition to the state of depolymerization depending on the model parameter, – collision with kinetochore that also results in “breaking off” and transition to the state of depolymerization, – collision with the kinetochore crown that brings about the attachment to the kinetochore with probability K on . MT is attached to kinetochore by its plus-end and then moves with it. Probability K off corresponds to its detachment and transition to a free state. A pair of sister chromatids is modeled as a construction of six half-cylinders (see Fig. 3A), the dimensions of which are determined by the following parameters: the lengths of the chromosome, L chr , and of the kinetochore, L kin , the diameter of the arms of the chromosome, Dchr = 2Rchr , and of the kinetochore, Dkin = 2Rkin , the length of the centromere, S L , the angle αkin , which determines the size of the kinetochore crown. The centromere, in turn, is modeled either as a rod or as an extensible Hookean spring with the elastic coefficient S K . Thus, each pair of sister chromatids has six or seven degrees of freedom, i.e. three spatial coordinates, three rotation angles and the length of the centromere (provided the latter is represented as a spring). To simulate the chromosomes motion, the principle of the “balance of forces” [13] is written for the center of each chromosome pair in the form of equations of the sum → − − −→k − → − → − → → − → F1k + F2 + F3 + F4 = 0 and angular momenta M1 + M3 + M4 = 0 of forces k
k
(see Fig. 3B). Denoting the scalar and vector products by (,) and [,], the forces can be represented as follows: −→ − → − → − → − → − → − → and M1k = r1k , F1k define the force and – F1k = Rk a − b · Rk , V + ω, rk torque exerted by the kth MT attached to the kinetochore. The vectors V and ω specify the linear and angular velocities of the pair of chromosomes, the constants a and b characterize the maximum force and its extinction coefficient, respectively. The unit
148
M. A. Krivov et al.
Fig. 3. The meaning of the key parameters of the model: (A) setting the size of the pair of chromosomes; (B) the directions of vectors of forces and points of their application.
− → vector r k is directed from the center of the pair of chromosomes to the point of MT − → attachment, and Rk is directed from the point of attachment to the spindle pole. − → – F2 = n · (S − SL ) · SK defines the force that arises when the centromere represented as a Hookean spring is stretched and that is exerted on each chromosome. The unit vector n is directed to the center of the sister chromatid, and the scalar S determines the current extension. − → − → define the friction force and momentum arising due to – F3 = γ V and M3 = ηω the viscosity of the cytoplasm. The coefficients γ and η are constant parameters of the model, vectors V and ω specify the linear and angular velocities of the pair of chromosomes. − → − → – F4 (t) and M4 (t) correspond to the noise term of the Langevin equation, which statistically reproduces the effect of the Brownian motion of cytoplasm molecules. The rates of translational and rotational motion are characterized by constants Dtrans and Drot . Random variables are modeled by normal distribution. For an unambiguous description of the cell, the following parameters were added. The numbers N MTs and N chrs specify the total number of MTs growing from one spindle pole and the number of pairs of chromosomes, respectively. The geometrical dimensions of the cell are determined by its radius, Rcell , and by the distance between the spindle poles, L poles .
4 Numerical Experiments Technique For numerical experiments, we used the open source software package MiCoSi (Mitosis Computer Simulator, https://github.com/m-krivov/MiCoSi) developed by the authors that implements the proposed mathematical model of mitosis. Its accuracy and consistency was verified in two ways. First, MiCoSi software contains automatic tests that simulate trivial scenarios of cell division and track the transition of a simple virtual cell
Evaluation of the Effect of Cell Parameters on the Number
149
to the expected state. Second, within each experiment, the evolution of one cell (from a group of identical cells) was tracked in manual mode using the built-in visualizer. The setup of simulated scenarios and the export of results are carried out by compiling and running an auxiliary program in C#. The conclusions below were obtained using a solver codenamed Experimental, which contains the latest version of the model. As for the implementations of those parts of the algorithm that allow of some alternatives, the choice was made as follows: – spatial coordinates of chromosomes were “frozen” in the equatorial plane to level out a possible side effect of their oscillations. The validity of such a “freeze” requires a separate detailed study and is not discussed in this paper; – in the case of rotation of a pair of chromosomes, the attached MTs were not allowed to pass through the kinetochore. Instead, they “wound” round it like threads. While this choice did not lead to noticeable differences in simulation results, we consider it as more consistent with reality; – MT can attach to the kinetochore not only with its plus-end, but also with any of its points. When passing through the chromosome arm, MT switches to the state of depolymerization rather than “breaks off”. For each case under consideration, the simulations were performed on an ensemble of 100 cells with a time step of 0.1 s, afterwards the results were averaged. The parameters of the model were chosen as corresponding to a human cell (Table 1) but only for modeling one pair of chromosomes. Each numerical experiment consisted in varying one selected parameter and, unless otherwise indicated, in measuring two quantities the total number of attached MTs and the number of MAs. The calculations were partially performed on ten nodes of the Lomonosov-2 supercomputer equipped with an Intel Xeon E5 - 2697 v3 series CPU. In total, 100 cores were used, and parallelization between them was performed using MPI and OpenMP technologies. Due to the independence of the calculations, almost linear scalability was observed [18]. Table 1. The values of cell parameters used in computer simulations by default.
150
M. A. Krivov et al.
5 Results and Discussion The features of the mathematical model have too significant impact on the results. On Fig. 4, the results of three-dimensional modeling of the beginning of metaphase are compared with similar numerical experiments from [5] conducted on a fairly similar but two-dimensional model. In both cases, the virtual cells were in the same initial states and had similar values of biophysical parameters. The main difference in the results is the sharp increase in the total number of MT attachments between 20 and 50 s after the start of the metaphase (see Fig. 4D) in 3D model, which is primarily due to the possibility of lateral attachments of MTs. After reaching the peak, this number begins to monotonously decrease and eventually stabilizes at a certain level that depends on the cell parameters, while in [5] an opposite conclusion was made about the monotonous growth of the number of MT attachments that reach a “plateau” only by 10–20 min of the metaphase. This tendency also resulted in noticeable differences in the distribution of kinetochores by the types of attachments (see Fig. 4A). Our 3D model predicts that during the first minute, there should be a fairly sharp transition of all pairs of kinetochores from the “No KMTs” state to the “Merotelic” one (see Fig. 4B), which means that each of them has at least one MA with MT. In the model from [5], this process takes 2 min (see Fig. 4C), and by the time it is completed, about 20% of the kinetochore pairs lose their merotelic attachments or do not have them at all. At the same time, it should be recognized that the key mitosis patterns known from experimental work are reproduced within the framework of both models [19]. For example, there is a characteristic increase in the number of MAs at the beginning of the metaphase and they are almost completely suppressed towards its end. The values for the total number of attachments are close to the expected ones. Thus, we can conclude that the issue of validating the entire variety of mathematical models of mitosis is becoming more and more relevant, especially if new conclusions about the nature of mitosis are made on their basis. Large values of kinetochore crown angle lead to the preservation of MAs, the size of virtual cell and the probability of MT detachments affect the rate of their suppression. In [5], it was concluded that the initial position and orientation of a pair of chromosomes have a significant impact on the MAs dynamics. Our calculations confirmed this statement [18], showing that for some configurations, the pair of chromosomes can be rotated by 70°–90° by the end of the metaphase, and this position is stable. To reduce the possible impact of the initial cell configuration on the process under study, in this numerical experiment, a pair of chromosomes was positioned in the center of the cell so that kinetochores were equally accessible to MTs growing from each of the poles, rather than being shielded by chromosomes’ arms (see Fig. 5). If the radius of the cell, r cell (see Fig. 5C), and the probability of the detaching events, k off (see Fig. 5B), are varied, there is a similar change in the total number of MT attachments. At the same time, the time required for the cell to completely suppress MAs increases or decreases by dozens of minutes. This suggests that these two parameters implicitly determine the duration of the metaphase.
Evaluation of the Effect of Cell Parameters on the Number
151
Fig. 4. Transition from 2D to 3D: reproducing the results of modeling the beginning of the metaphase, taken from [5], on the basis of the package developed by the authors: (A) types of kinetochores depending on the nature of MT attachments; (B) classification of kinetochores based on the 3D model, the cell parameters correspond to Table 1; (C) classification of kinetochores from [5] (2D model); (D) the dependence of the average number of MT attachments per kinetochore in 3D model when the probabilities of events are varied; (E) the dependence of the average number of MT attachments per kinetochore from [5] (2D model), the probabilities of detachments are estimated according to [13].
Finally, our calculations confirmed the conclusion [5] that the size of the kinetochore crown, set by αkin , is indeed a key element of the geometric mechanism for suppressing MAs. Large values of this angle lead not only to a slowdown in the rate of MAs suppression, but also to their preservation at the end of the metaphase. The diameter of the kinetochore does not have a significant effect on MAs at all. If we consider the influence of the kinetochore diameter, Dkin (see Fig. 5A), it limits only the total number of MT attachments (from ~20 attachments for Dkin = 2 μm and ~60 attachments for Dkin = 0.5 μm), but does not affect the percentage of MAs. Additionally, it was found that with a kinetochore diameter of about 1 μm, by the end of the metaphase, the pair of chromosomes almost completely loses all MT attachments.
152
M. A. Krivov et al.
Fig. 5. The efficiency of suppressing MAs depending on the biophysical parameters of virtual cell during the first hour of the metaphase. Values marked with * correspond to the configuration in Table 1: (A) variation in the size of the kinetochore (μm); (B) variation of the probability of MT detachment from the kinetochore (s−1 ); (C) variation of cell radius (μm) and the distance between the spindle poles (proportional to the radius); (D) variation of the angle of the kinetochore crown (degrees).
It should be emphasized that this conclusion, obtained by mathematical simulations, contradicts some experimentally established facts. When studying cells of female Indian muntjac deer [20], which have only 6 pairs of chromosomes, it was observed that chromosomes with larger kinetochore have more MAs, including the percentage ratio (7.0% vs. 1.6%). As a consequence the authors claimed that the size of the kinetochore is extremely important for suppressing MAs and for erroneous chromosome divergence in anaphase, i.e. chromosomes ‘missegregate during anaphase’. The reason for this discrepancy may be both the features of the proposed mathematical model and differences in the properties of the studied cells. As already noted, the initial position of the chromosomes has a certain influence on the dynamics of MAs, so a similar simulation conducted with other model settings can recalibrate our conclusions. A certain effect may be also observed from further modifications of the model such as a transition to a more complex representation of the kinetochore and account of repulsive forces arising from the interaction of microtubules with chromosomes’ arms. Thus, it should be emphasized once again that the choice of the mathematical model of mitosis, unfortunately, has a noticeable impact on the outcomes. Summing up the results of the numerical simulation it can be argued that large values of the kinetochore crown angle lead to the preservation of MAs at the end of the metaphase. As for the size of the cell and the probability of MT detaching events, they only affect the rate of Mas suppression, but do not interfere with such suppression itself. The diameter of kinetochore does not have a significant effect on MAs at all.
Evaluation of the Effect of Cell Parameters on the Number
153
Acknowledgements. The project was accomplished using the equipment of the Center for Collective Use of Super High-Performance Computing Resources of the M.V. Lomonosov Moscow State University. This work was funded by Russian Foundation for Basic Research (RFBR) grants 16-07-01064a and 19-07-01164a (to M.K and P.I.).
References 1. Cimini, D.: Merotelic kinetochore orientation, aneuploidy, and cancer. Biochim. Biophys. Acta (BBA) – Rev. Cancer 1786(1), 32–40 (2008) 2. Heim, S., Mitelman, F.: Cancer Cytogenetics: Chromosomal and Molecular Genetic Aberrations of Tumor Cells, 4th edn. Wiley, Hoboken (2015) 3. Gregan, J.: Merotelic kinetochore attachment: causes and effects. Trends Cell Biol. 21(6), 374–381 (2011) 4. Salmon, E., Cimini, D., Cameron, L., DeLuca, J.: Merotelic kinetochores in mammalian tissue cells. Philos. Trans. R. Soc. Lond B Biol. Sci. 360(1455), 553–568 (2005) 5. Zaytsev, A., Grishchuk, E.: Basic mechanism for biorientation of mitotic chromosomes is provided by the kinetochore geometry and indiscriminate turnover of kinetochore microtubules. Mol. Biol. Cell 26(22), 3985–3998 (2015) 6. Civelekoglu-Scholey, G., Cimini, D.: Modelling chromosome dynamics in mitosis: a historical perspective on models of metaphase and anaphase in eukaryotic cells. Interface Focus 4(3), 1–9 (2014) 7. McIntosh, R., Molodtsov, M., Ataullakhanov, F.: Biophysics of mitosis. Q. Rev. Biophys. 45(2), 147–207 (2012) 8. Kirschner, M., Mitchison, T.: Beyond self-assembly: from microtubules to morphogenesis. Cell 45(3), 329–342 (1986) 9. Gliksman, N., Skibbens, R., Salmon, E.: How the transition frequencies of microtubule dynamic instability (nucleation, catastrophe, and rescue) regulate microtubule dynamics in interphase and mitosis: analysis using a Monte Carlo computer simulation. Mol. Biol. Cell 4(10), 1035–1050 (1993) 10. Wollman, R., et al.: Efficient chromosome capture requires a bias in the ‘search-and-capture’ process during mitotic-spindle assembly. Curr. Biol. 15(9), 828–832 (2005) 11. Gregoretti, I., et al.: Insights into cytoskeletal behavior from computational modeling of dynamic microtubules in cell-like environment. J. Cell Sci. 119(Pt22), 4781–4788 (2006) 12. Paul, R., et al.: Computer simulations predict that chromosome movements and rotations accelerate mitotic spindle assembly without compromising accuracy. Proc. Natl. Acad. Sci. U.S.A. 106(37), 15708–15713 (2009) 13. Joglekar, A., Hunt, A.: A simple, mechanistic model for directional instability during mitotic chromosome movements. Biophys. J. 83(1), 42–58 (2002) 14. Civelekoglu-Scholey, G., Sharp, D., Mogilner, A., Scholey, J.: Model of chromosome motility in Drosophila embryos: adaptation of a general mechanism for rapid mitosis. Biophys. J. 90(11), 3966–3982 (2006) 15. Campas, O., Sens, P.: Chromosome oscillations in Mitosis. Phys. Rev. Lett. 97(12), 1281021–128102-4 (2006) 16. Lawrimore, J., et al.: ChromoShake: a chromosome dynamics simulator reveals that chromatin loops stiffen centromeric chromatin. Mol. Biol. Cell 27(1), 153–166 (2016) 17. Vasileva, V., et al.: Molecular mechanisms facilitating the initial kinetochore encounter with spindle microtubules. J. Cell Biol. 216(6), 1609–1622 (2017)
154
M. A. Krivov et al.
18. Krivov, M., Zaytsev, A., Ataullakhanov, F., Ivanov, P.: Simulation of biological cell division in metaphase on a supercomputer ‘Lomonosov-2’. Comput. Meth. Soft. Dev.: New Comput. Technol. 19, 327–339 (2018). (in Russian) 19. McIntosh R., (Ed.): Mechanisms of Mitotic Chromosome Segregation. Biology, special issue. MDPI, Basel (2017) 20. Drpic, D., et al.: Chromosome segregation is biased by kinetochore size. Curr. Biol. 28(9), 1344–1356 (2018)
Reconciliation of Regulatory Data: The Regulatory Networks of Escherichia coli and Bacillus subtilis Diogo Lima, Fernando Cruz, Miguel Rocha, and Oscar Dias(B) Centre of Biological Engineering, University of Minho, 4710-057 Braga, Portugal {diogolima,fernando.cruz,odias}@ceb.uminho.pt, [email protected]
Abstract. Multiple efforts have been made to comprehend the regulatory machinery of model prokaryotic organisms, such as Escherichia coli and Bacillus subtilis. However, the lack of unification of published regulatory data reduces the potential of reconstructing whole genome transcriptional regulatory networks. The work hereby discussed, focuses on the retrieval and integration of relevant regulatory data from multiple resources, including databases, such as RegulonDB and DBTBS as well as available literature. This study presents state-of-art, reconciled transcriptional regulatory networks of the previously mentioned model organisms, as well as a topological and functional analysis. Keywords: Transcriptional regulatory networks · Gene regulation · Reconciliation of regulatory data
1 Introduction Escherichia coli K12 MG1655 and Bacillus subtilis str. 168 have been studied extensively, thus being model gram-negative and gram-positive organisms, respectively. Multiple efforts to elucidate the regulatory machinery of these model prokaryotes have been made [1, 2]. However, the lack of reconciliation of published regulatory networks impairs the applicability of this knowledge. This study has shown that though most published networks have a major overlap of regulatory data, the unification of retrieved data allows reconstructing state-of-art regulatory networks, even more so for model organisms. The collection of gene regulatory data has several applications in the field of systems biology, such as the integration of a regulatory layer in Genome-Scale Metabolic (GSM) models to improve phenotypic predictions [3].
2 Materials and Methods Aiming to understand the regulatory control system of Escherichia coli K12 MG1655 and Bacillus subtilis str. 168, an extensive search was performed for resources containing © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 G. Panuccio et al. (Eds.): PACBB 2020, AISC 1240, pp. 155–165, 2021. https://doi.org/10.1007/978-3-030-54568-0_16
156
D. Lima et al.
relevant transcriptional regulatory data. Regarding the former, three resources were used to retrieve data: RegulonDB [4], which is an E. coli specific database of transcriptional regulatory networks; RegPrecise [5], a major database of prokaryotic regulatory interactions (providing data for over 450 species) and finally, the work of Fang et al. [1], which reconciled data from RegulonDB with validated chromatin immunoprecipitation data. As shown in Fig. 1, the largest dataset was the one retrieved from the work of Fang et al. However, RegulonDB and RegPrecise were the only resources that provided binding site data, a valuable resource in transcriptional regulatory networks reconstruction [6]. B. subtilis regulatory data was retrieved from five resources: the previously mentioned RegPrecise database; CollecTF [7], a database of experimentally validated transcription factor binding sites for several organisms; DBTBS [8], a B. subtilis specific repository of regulatory data, and finally Faria et al. [9] and Arrieta-Ortiz et al. [2], as shown in Fig. 1. The former work reconciled data from several databases with gene expression data from GEO [10], while the latter reconstructed a regulatory network from data retrieved from Subtiwiki [11], and further complemented it with two transcriptomics datasets.
Fig. 1. Summary of the reconciliation of transcriptional regulatory networks from multiple resources. The values surrounded by blue shaded boxes represent the number of regulatory interactions retrieved from each resource. Retrieved regulatory interactions may show redundancy through multiple interactions between the same regulator-gene pair, albeit portraying different evidences.
The reconciliation process allowed the reconstruction of extensive transcriptional regulatory networks, as illustrated in Fig. 2, for both model organisms. The data extraction process was based on data warehousing procedures and comprehended the implementation of Extract Transform Load (ETL) tools to retrieve data from the resources mentioned in this work. In addition, data was also retrieved from National Center for Biotechnology Information (NCBI) Taxonomy database [12] and
Reconciliation of Regulatory Data
157
Fig. 2. Overview of both tyrR (transcriptional regulatory protein) and ydcR (HTH-type transcriptional regulator) regulatory mechanisms as an illustration of the reconciled Escherichia coli K12 MG1655 transcriptional regulatory network. White, light gray and blue nodes stand for regulated genes, regulators and binding sites, respectively. Whereas blue, green, red and black edges stand for dual, activating, repressing and unknown regulatory effects, respectively. Dashed light gray edges represent the cis-acting regulatory site where a regulator binds at to control the expression of a given regulated gene. The edges are directed and thus, represent interactions between regulators and genes. In this example, the transcriptional regulatory protein tyrR is repressed by the regulators pgrR and lrp. Nevertheless, the regulator lrp can also activate the expression of the tyrR regulator. This regulator controls, in turn, the expression of the yaiA, folA and mtr genes as well as the aro and tyr operons. Although we have gathered evidence that the ydcR regulator controls the expression of the ylaC gene, the specific regulatory effect is unknown to date.
Universal Protein Resource (UniProt) [13] to further complement the assembled regulatory networks. The results of this research were compiled in a graph database, publicly available through an open access web platform at https://protrend.bio.di.uminho.pt. This web application contains all data described in this work as well as relevant regulatory information for over 450 prokaryotic species. The complete regulatory networks, degree distributions, and clustering coefficients are available in the supplementary material (https://zenodo.org/record/3708325). The p value resulting from the comparison of average out-degree and in-degree of retrieved regulators and genes (discussed in Sect. 3.1), respectively, was obtained using a two-sample t-Test. The out-degree of a regulator is the total sum of connected target genes. The in-degree of a gene is the total sum of connected regulators. Thus, regulatory interactions are directed connections from a regulator to a target gene. The distribution of a given degree k is the fraction of regulators or genes in the regulatory network with degree k. The local clustering coefficient cc of a regulator i, excluding multiple edges between the same regulator-gene pair, is determined as follows: cci =
number of connections between neighbors of i maximum number of connections between neighbors of i
158
D. Lima et al.
3 Results and Discussion 3.1 Network Topology and Analysis The reconstruction of extensive transcriptional regulatory networks of E. coli and B. subtilis was followed by a descriptive and topological analysis, providing the results presented in Tables 1 and 2. Although it was possible to gather data regarding sigma factors, RNA riboswitches and other regulatory mechanisms, such as accessory proteins, transcription factors are the best represented mechanisms. The reconciliated networks include approximately 64% and 80% of the protein coding genes of E. coli and B. subtilis respectively. The assembled regulatory network of E. coli is predominantly comprised of activating regulatory interactions as described in literature [14]. While the same can be said for the assembled network of B. subtilis, the difference is not as evident, which can be explained by the significant amount of gathered unknown regulatory interactions (18%). Table 1. Descriptive statistics of the reconciled regulatory networks. The unique regulatory interactions counts merge all interactions with distinct evidences, but the same regulatory effect between a given regulator-gene pair. Statistics
E. Coli K12 MG1655
B. subtilis str. 168
Number of protein coding genes
4240
4174
Number of regulators
235
340
Number of transcription factors
217
227
Number of sigma factors
6
16
Number of RNA riboswitches
12
73
Number of other regulators
0
24
Number of target genes
2725 (64%)
3319 (80%)
Number of unique regulatory interactions
9369
7394
Number of unique activating regulatory interactions
5557 (59%)
3058 (41%)
Number of unique repressing regulatory interactions
3170 (34%)
2974 (40%)
Number of unique dual regulatory interactions
618 (7%)
65 (1%)
Number of unique unknown regulatory interactions
24 ( 0.5 (logFC; logarithm of FC between case and control). Separately, pathways (gene sets) were collected from KEGG repository (211 unique human pathways) [21]. 2.2 Enrichment Methods The first generation method does not have numerous solution hence ORA with a hypergeometric test was run for previously defined DEGs. To each KEGG pathway the contingency table was defined. Then hypothesis claiming that the probability of our observed data (DEGs in pathway) or that more extreme under the assumption that there is no association between expression and gene set membership was tested. The second-generation methods are represented by various approaches which advantages and drawbacks are detailed discussed in different studies [12–16, 22, 23]. In presented work, Coincident Extreme Ranks in Numerical Observations (CERNO) algorithm was evaluated. Out of many methods, it is one of the most flexible in terms of input and characterised by good performance [22]. CERNO uses ranked list of genes according to a given ranking measure (in the presented study it is q-value; low q-value leads to low rank). Further, the artificial probabilities are made as a ratio of the rank of the i-th gene over the total amount of analyzed genes. In the last step, the obtained mock probabilities for genes in the pathway are integrated by the Fisher method. From the group of third-generation methods, the Signaling-Pathway Impact Analysis (SPIA) was evaluated as it is one of the most commonly used and effective approaches [24]. Algorithm for each gene set, calculate the global probability by integrating results from ORA method and perturbation probability of pathway topology based on genes relations and log fold-change of DEGs. To integrate those two probabilities Fisher method is used. In the presented research, z-transformation method for probability integration was tested as Fisher approach is not robust to asymmetrical p-values and can overestimate results [25]. Finally, SPIA results
Robustness of Pathway Enrichment Analysis
179
from pathway perturbation probability were integrated with CERNO results. To distinguish results, the original SPIA approach will be label as SPIA with ORA, while second tested results will be label as SPIA with CERNO. All algorithms were runs with their default options suggested by authors on every dataset and KEGG repository pathways. 2.3 Performance Evaluation All evaluation process was performed only on pathways common to all tested enrichment method. This filtration was forced by SPIA method which requires pathway topology available only for 206 human KEGG pathways. Enrichment methods comparison process at first was performed only on paired design datasets outcomes. This will reduce any possible bias between datasets despite transcriptomic profiling platform. Methods were compared in three different ways: clustering, similarity under the various threshold and target pathway detection. The clustering was tested with the usage of a hierarchical approach. First, the Spearman rank correlation was calculated between obtained enrichment results. Further, the hierarchical clustering was executed with Euclidian distance metric. Second evaluation concentrates on enrichment method outcomes similarities under different significance level thresholds. For such a sequence, the percentage of detected pathways were calculated in each tested method and dataset. Finally, the ability to target pathways detection was investigated. Target pathways were selected by the literature study of LUAD cancer. From the following research [27–29] 16 pathways characteristic for analysed cancer were extracted. To each method and its target pathway surrogate sensitivity (p-value from enrichment) was analysed together with prioritisation (rank of target pathway in all analyzed pathways). Both metrics and techniques in details are discussed in [14, 16]. Finally, a dataset with unpaired samples from microarrays was incorporate into the described process to check weather experiment design could influence enrichment method outcomes. To evaluate multi-factor results clustering the UMAP data dimension reduction method [26] was applied on −log10 of enrichment p-values. UMAP technique was chosen as, in contrary to PCA, it preserves similarities between features.
3 Results To every analysed dataset common pre-processing was performed and obtained results are summarized in Table 1. As can be observed, similar sample size, number of DEGs, the same number of analysed genes/transcripts and phenotype were kept. Thus it is unlikely that datasets’ properties will impact enrichment method evaluation except for the transcriptomic profiling platform. Next, for paired datasets, the evaluation process was performed. First, the impact of the integration method in SPIA algorithm was investigated. Figure 1 shows the relation between −log10 of enrichment p-value for the same method and different integration technique. As can be observed in Fig. 1 obtained results are highly correlated, however, Fisher integration method gives lower p-values compared to z-transformation. As it was mention previously this fact is one of Fisher’s method drawbacks. Thus in the further evaluation
180
J. Zyla et al. Table 1. Detailed datasets description and its pre-processing results.
ID
Phenotype Platform
GSE19188
LUAD
Microarrays 153 (91 + 62) Unpaired 12 055
GSE18842
LUAD
Microarrays
TCG-LUAD LUAD
RNA-Seq
Sample size (Case + Control)
Design
# of genes # of DEGs
3 148
88 (44 + 44) Paired
12 055
3 870
116 (58 + 58) Paired
12 055
4 869
Fig. 1. The relation between different integration techniques in SPIA algorithm. Each dot represents the same KEGG pathway. Panel A) and B) show results for RNA-Seq for SPIA with CERNO and SPIA with ORA respectively. Panel C) and D) show results for microarrays for SPIA with CERNO and SPIA with ORA respectively. Axis x shows −log10 of enrichment p-value when z-transformation integration is applied. Axis y shows −log10 of enrichment p-value when Fisher integration is applied. Dashed lines represent significance level 0.05.
process, results for SPIA with ORA (original method) and SPIA with CERNO for z-transformation integration were taken, as they not overestimate the outcomes.
Robustness of Pathway Enrichment Analysis
181
For selected DEGs and integration approach for SPIA method, the Spearman rank correlation was calculated for every dataset and its enrichment outcomes (transformed to log10 of p-values). Results of correlation hierarchical clustering are presented in Fig. 2.
Fig. 2. Heatmap with hierarchical clustering of Spearman rank correlation between enrichment methods outcomes.
In Fig. 2, it can be observed that CERNO method and SPIA with CERNO create one cluster without division to transcriptomic profiling platform. Next, the group of ORA and SPIA with ORA can be distinguished (ORA based approaches). Here hierarchical clustering shows division due to platform. Based on those observations, it can be concluded that at first the main division of enrichment results is between methods (CERNO based approaches vs ORA based approaches). In addition, for ORA based approaches platform can have an impact. Moreover, it can be noticed that all correlation coefficients are above 0.5 hence, similar results are observed for the same type of cancer (in the presented study LUAD). Further, in the evaluation pipeline, the percentage of detected pathways under various significance level thresholds was investigated (see Fig. 3). Panel A and D in Fig. 3 shows that for CERNO based approaches (CERNO and SPIA with CERNO) proportion of significant pathways under various thresholds is similar for both transcriptomic profiling platforms. For ORA method it can be noticed that RNA-Seq dataset show more pathways under various level compared to microarrays. This could be an effect of binary division for DEGs and non-DEGS in overrepresentation analysis. It was reported that RNASeq platform gives more DEGs compared to microarrays [30, 31] which could affect the enrichment methods results. Looking to ORA with SPIA (panel C) Fig. 3) this dispersion is lower which can be an effect of additional information from network perturbation. Finally, the detection of the target pathway was analysed. Results from surrogate sensitivity (enrichment p-value of the expected pathway) and prioritisation (ranking of the target pathway in all analysed pathways) are presented in Fig. 4.
182
J. Zyla et al.
Fig. 3. Percentage of the detected pathway under various significance levels. Each panel represents one tested method with two different transcriptomic platforms. The vertical black line represents the significance level at 0.05.
Fig. 4. Results for target pathway detection for various enrichment methods. Panel A) shows log10 of enrichment p-value of the target pathway (surrogate sensitivity). Panel B) shows the ranking of the target pathway in all analysed pathway as a percentage (prioritisation). Colours represent different transcriptomic profiling platform.
For both measurements, the main differences are between the method itself. Those observations are with an agreement to the previous one when all pathways were analysed. Moreover, it can be observed that the target pathways for ORA based approaches have
Robustness of Pathway Enrichment Analysis
183
lower enrichment p-value of the target pathway compared to CERNO based approaches (Fig. 4, panel A)). This phenomenon can also be observed in Fig. 3. Additionally, ORA base approaches differ more between transcriptomic platforms, then CERNO approaches. In terms of the target pathway prioritisation, those observations are not presented. As a last step of the presented research, the unpaired design dataset from a microarray type of transcriptomic profiling platform was added. To demonstrate the enrichment result clustering under various conditions, the UMAP method for dimensionality reduction with preservation of similarity was used. Results are presented in Fig. 5.
Fig. 5. UMAP method results. Each panel represents the first and second component of UMAP. Panel A) shows labelling due to enrichment method. Panel B) shows labelling due to transcriptomic profiling platform. Panel C) shows labelling due to the design of the experiment.
On Panel A) Fig. 5 it can be observed that results are divided due to enrichment method (separate clusters for ORA and CERNO based approaches). Panel B) shows that CERNO base approaches are robust to the transcriptomic profiling platform. For ORA based approaches differences between microarrays and RNA-Seq can be noticed. Panel C) shows that the design of the experiment does not have an impact on enrichment outcomes. Results from an extended group of datasets confirm previous observation i.e. the most significant difference is observed between enrichment methods, ORA based approaches have the ability to differentiate between platforms. As it was mentioned previously, this can be caused by binary division for DEGs and non-DEGs in ORA method.
184
J. Zyla et al.
4 Conclusions The comprehensive comparison of three different enrichment algorithms (representative to each generation) and transcriptomic expression profiling platform was performed. After removing the main confounding factors that may have an impact on the comparison results, it was shown that the biggest difference between outcomes is due to the used enrichment method rather than the transcriptomic profiling platform itself. Out of tested methods, the biggest disperse regarding the used high-throughput technique was observed for ORA method. Those differences were to a certain extent reduced by a combination of ORA with third-generation method SPIA. ORA method relies on binary division for DEGs and non-DEGs, which proportion differs between RNA-Seq and microarrays [30, 31]. Hence, this effect can be observed in enrichment results. CERNO method itself and combination with third-generation SPIA show robustness to the transcriptomic platform. Previously, CERNO method was reported as the most reproducible enrichment method [22] and as the same phenotype was investigated, this outcome was also observed in the presented study. In summary, enrichment methods show robustness to transcriptomic expression profiling platforms, and they mostly differ between themselves. Method performance differences were previously reported [12–16]. Acknowledgements. This work was co-financed by SUT grant for maintaining and developing research potential (JZ), National Science Centre Poland, grant BITIMS 2015/19/B/ST6/01736 (JP) and European Union grant under the European Social Fund, project no. POWR.03.02.0000-I029 (KL). All calculations were carried out using GeCONiI infrastructure funded by NCBiR project no. POIG.02.03.01-24-099/13.
References 1. Dziuda, D.M.: Data Mining for Genomics and Proteomics: Analysis of Gene and Protein Expression Data. Wiley, Hoboken (2010) 2. Schena, M., et al.: Microarrays: biotechnology’s discovery platform for functional genomics. Trends Biotechnol. 16(7), 301–306 (1988) 3. Zhang, Z.H., et al.: A comparative study of techniques for differential expression analysis on RNA-Seq data. PloS one 9(8), e103207 (2014) 4. Robertson, G., et al.: De novo assembly and analysis of RNA-seq data. Nat. Methods 7, 909–912 (2010) 5. Zhang, W.: Comparison of RNA-seq and microarray-based models for clinical endpoint prediction. Genome Biol. 16(1), 133 (2015) 6. Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010) 7. Malone, J.H., Brian, O.: Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biol. 9(1), 34 (2011) 8. Khatri, P., Sirota, M., Butte A.J.: Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput. Biol. 8(2), e1002375 (2012) 9. Huang, D.W., et al.: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37(1), 1–13 (2009)
Robustness of Pathway Enrichment Analysis
185
10. Khatri, P., Draghici, S., Ostermeier, G.C., Krawetz, S.A.: Profiling gene expression using onto-express. Genomics 79(2), 266–270 (2002) 11. Hung, J.H., et al.: Identification of functional modules that correlate with phenotypic difference: the influence of network topology. Genome Biol. 11(2), R23 (2010) 12. Ihnatova, I., Popovici, V., Budinska, E.: A critical comparison of topology-based pathway analysis methods. PloS one 13(1), e0191154 (2018) 13. Maciejewski, H.: Gene set analysis methods: statistical models and methodological differences. Briefings Bioinform. 15(4), 504–5018 (2014) 14. Zyla, J., et al.: Ranking metrics in gene set enrichment analysis: do they matter? BMC Bioinform. 18(1), 256 (2017) 15. Hung, J.H., et al.: Gene set enrichment analysis: performance evaluation and usage guidelines. Briefings Bioinform. 13(3), 281–291 (2012) 16. Tarca, A.L., Bhatti, G., Romero, R.: A comparison of gene set analysis methods in terms of sensitivity, prioritization and specificity. PloS one 8(11), e79217 (2013) 17. Geistlinger, L., et al.: Toward a gold standard for benchmarking gene set enrichment analysis. Brief. Bioinform., bbz158 (2020). https://doi.org/10.1093/bib/bbz158 18. Tarca, A.L., Draghici, S., Bhatti, G., Romero, R.: Down-weighting overlapping genes improves gene set analysis. BMC Bioinform. 13(1), 136 (2012) 19. Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., Smyth, G.K.: Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Rese. 43(7), e47 (2015). https://doi.org/10.1093/nar/gkv007 20. McCarthy, D.J., Chen, Y., Smyth, G.K.: Differential expression analysis of multifactor RNASeq experiments with respect to biological variation. Nucleic Acids Res. 40(10), 4288–4297 (2012) 21. Kanehisa, F.M., et al.: KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–D361 (2017) 22. Zyla, J., et al.: Gene set enrichment for reproducible science: comparison of CERNO and eight other algorithms. Bioinformatics 35(24), 5146–5154 (2019) 23. Maleki, F., et al.: Size matters: how sample size affects the reproducibility and specificity of gene set analysis. Hum. Genomics 13(1), 42 (2019) 24. Tarca, A.L., et al.: A novel signaling pathway impact analysis. Bioinformatics 25(1), 75–82 (2009) 25. Whitlock, M.C.: Combining probability from independent tests: the weighted Z-method is superior to Fisher’s approach. J. Evol. Biol. 18(5), 1368–1373 (2005) 26. Becht, E., et al.: Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37(1), 38 (2019) 27. Navab, R., et al.: Prognostic gene-expression signature of carcinoma-associated fibroblasts in non-small cell lung cancer. PNAS 108(17), 7160–7165 (2011) 28. Tang, Q., et al.: Hub genes and key pathways of non-small lung cancer identified using bioinformatics. Oncol. Lett. 16(2), 2344–2354 (2018) 29. Shi, W.Y., et al.: Gene expression analysis of lung cancer. Eur. Rev. Med. Pharmacol. Sci. 18(2), 217–228 (2014) 30. Bottomly, D., et al.: Evaluating gene expression in C57BL/6 J and DBA/2 J mouse striatum using RNA-Seq and microarrays. PloS one 6(3), e17820 (2011) 31. Zhao, S., et al.: Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells. PloS one 9(1), e78644 (2014)
Hypoglycemia Prevention Using an Embedded Model Control with a Safety Scheme: In-silico Test Fabian Leon-Vargas, Andres L. Jutinico(B) , and Andres Molano-Jimenez Universidad Antonio Nari˜ no, Bogot´ a, Colombia {fabianleon,ajutinico,andres.molano}@uan.edu.co, http://www.uan.edu.co
Abstract. Artificial pancreas systems have been designed and implemented over type 1 diabetes patients to overcome glucose regulation problems of conventional therapies. However, hypoglycemia is one of the most feared conditions associated with artificial pancreas systems. This paper presents a new control system based on an Embedded Model Control strategy for glucose regulation including a safety scheme designed to reduce the risk of hypoglycemia events on a full closed-loop system. Insulin on board estimation was used as part of the safety scheme to limit the insulin dose. Simulations implemented through a FDA simulator to test and compare the system performance were assessed. Results show an avoidance of hypoglycemia events in all virtual patients implementing the safety scheme.
Keywords: Glucose control Hypoglycemia · In-silico test
1
· Type 1 diabetes · Safety layer ·
Introduction
Diabetes is a metabolic disease characterized by elevated plasma glucose levels, corresponding to acute or chronic hyperglycemia, which can lead to long-term micro or macrovascular complications. In type 1 diabetes there is a lack of insulin secretion by the beta-cells in the islets of Langerhans in the pancreas, while in type 2 diabetes there is a combination of resistance to insulin action and an inadequate compensatory insulin secretory response. Diabetes is one of the most serious diseases that must be regulated artificially. According to the latest data from the International Diabetes Federation, it is estimated that the number of people with diabetes worldwide will increase from 425 million in 2017 to 628.6 million in 2045, of the world’s adult population aged between 20 and 79 years. Currently, there is a large international effort addressed to use technology in order to prevent or delay the onset of diabetes complications. Several studies presenting clinically validated strategies have shown the importance of diabetes c The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 G. Panuccio et al. (Eds.): PACBB 2020, AISC 1240, pp. 186–196, 2021. https://doi.org/10.1007/978-3-030-54568-0_19
Hypoglycemia Prevention Using EMC with a Safety Scheme
187
technology in the glucose control performance [1–5]. One of the greatest diabetes technology advances is the well known artificial pancreas system (APS) concept, which is a technological combination of three components: a continuous glucose monitoring system, an insulin pump, and a control algorithm; whose interaction allows to control the glucose levels of type 1 diabetes (T1D) patients. Several APS have been first designed and pre-validated in simulators approved by the Food and Drug Administration (FDA) agency of the United States, which is used as a substitute for clinical trials in animals, and then tested in clinical studies around the world showing the feasibility of this concept [6,7]. Although nowadays commercial developments can be found [8], as the modern Medtronic’s insulin pump 670G that includes a PID based control algorithm regulating the basal insulin dose every five minutes according to the current glucose level, there are still several challenges to achieve a complete artificial pancreas. Some of the most important challenges that still remains in the artificial pancreas development triggering hypoglycemia events are: – Physiological intra-patient variability (exercise, stress, alcohol consumption, etc) – Delays in the insulin action and glucose measurement due to non-use of a physiological route – Lack of accuracy in the continuous glucose monitoring systems – Big disturbances, non measurable or with poor knowledge to predict their effect. In this paper, an embedded model (EM) control strategy for glucose regulation is presented along with a safety scheme to prevent hypoglycemia events. The embedded model control is based on the glucose-insulin model of Colmegna et al., while the safety scheme is based on an insulin on board estimation model and an adaptive algorithm. This strategy is evaluated over a standard test scenario of meals in a FDA approved simulator. Results show an avoidance of hypoglycemia events in all virtual patients implementing the safety scheme.
2
The Colmegna Model
The model presented in [9,10] has the following state space representation: ⎤ ⎡ ⎤ ⎡ 0 0 1 0 ⎦ x + ⎣0⎦ u, 0 0 1 x˙ = ⎣ 1 −p1 p2 p3 −p2 p3 − p1 (p2 + p3 ) −(p2 + p2 + p3 ) (1) y = k z 1 0 x. This model has the following parameters k = −1.788 × 10−5 , z = 0.1501, p1 = 0.0035, p2 = 0.0138, and p3 = 0.0143, where the only varying parameter is p1 , which depends on the measurement glucose value (g) and associated parameters of Table 1, and follows the polynomial rule, p1 (g) = qi g 3 + ri g 2 + si g + ti ,
(2)
188
F. Leon-Vargas et al.
⎧ 1 ⎪ ⎪ ⎨ 2 i= 3 ⎪ ⎪ ⎩ 4
110 ≤ g 65 ≤ g < 110 59 ≤ g < 65 g < 59.
(3)
Table 1. Parameter values of p1 (g) of (2). i qi
si −8
ti −5
1.1357 × 10−2
1 0
9.0580 × 10
−5.3562 × 10
2 −4.2382 × 10−8
1.1402 × 10−5
−9.1676 × 10−4 2.5849 × 10−2
−4
−2.3080 × 10−2 7.7121 × 10−1
3 0 4 0
3
ri
1.7321 × 10
−6
−2.9126 × 10
2.4514 × 10−4 8.0865 × 10−3
Continuous Time Embedded Model Extracted from Colmegna Model
The transfer function of the model (1) can be put in the following form, H(s) = C(sI − A)−1 B =
M k(s + z) = , (s + p1 )(s + p2 )(s + p3 ) 1 + ∂HM
(4)
upon the definition of the following Model M and neglected output-input interconnections ∂H, k(s + z) M (s) = , (5) s3
p1 (p2 + p3 ) + p2 p3 p1 p2 p3 (p1 + p2 + p3 )s s + + ∂H(s) = . (6) s+z k ks k The stability of the uncertainty estimator is guaranteed by the proper election of the observer sensitivity function Sm to satisfy the following restriction [11], Sm M ∂H∞ ≤ η < 1,
(7)
A space state representation of the model M is presented in (8), were d(t) must be estimated in real time by a proper disturbance dynamics. ⎡ ⎤ ⎡ ⎤ 0 1 1/z 0 x˙ c = ⎣ 0 0 1 ⎦ xc + ⎣ 0 ⎦ u + d(t), 0 0 0 kz (8) y = k 1 0 0 xc .
Hypoglycemia Prevention Using EMC with a Safety Scheme
4 4.1
189
The Embedded Model Control The State Predictor
The model states in (8) must be estimated using the glucose measurements and the following model of d(t), ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 0 1 1/z 0 0 0 ˆc + ⎣ 0 ⎦ u + ⎣0 0⎦ x ˆd + Gc Wc , x ˆ˙ c = ⎣ 0 0 1 ⎦ x 0 0 0 kz 1 0 (9) 0 1 x ˆ˙ d = x ˆ + Gd Wd . 0 0 d Kalman Assumption: Assume that all the states are corrected by noise Gc = I3 and Gd = I2 , then the noises can be estimated using a static predictor, ⎡ ⎤ l1 l Wc = Lc eˆm = ⎣l2 ⎦ eˆm , Wd = Ld eˆm = 4 eˆm , (10) l5 l3 where, the characteristic polynomial of the state predictor is given by,
l3 l4 l5 + l2 s3 + + l3 s2 + + l 4 s + l5 . s5 + l1 s4 + z z z
(11)
The gains of the observer must satisfy (7). Therefore, a design for EM control fixing all the eigenvalues to the same frequency value was made as it is shown in Table 2, then: Vm = 4.2
s4 + 0.4s3 + 0.08s2 + 0.008s + 0.00032 , s5 + s4 + 0.4s3 + 0.08s2 + 0.008s + 0.00032
Sm = 1 − V m .
(12)
Disturbance Rejection Matrices and Feedback Control
The disturbance rejection matrices for the EM control are computed solving the Francis Equation [12] by, 1 0 0 0 , (13) Q= , M = − kz 0 0 The feedback gains are computed from the desire polynomial, K = k1 k2 k3 , s3 + kzk3 s2 + k(k1 + k2 z)s + kzk1 .
(14)
The closed-loop eigenvalues are tuned to guarantee stability in the presence of disturbance and model variability (see Table 2), accomplish 0.04 ≤ η, with the values of the following gains:
190
F. Leon-Vargas et al. Table 2. EMC eigenvalues values. Predictor eigenvalues [rad/m]
Closed loop eigenvalues [rad/m][rad/m]
1 −0.2
−0.04
2 −0.2
−0.04
3 −0.2
−0.016
4 −0.2
Not applicable
5 −0.2
Not applicable
K = −10 −1075 −38097 , T T Lc Ld = 1 0.1275 0.0409 0.0059 0.0003 .
(15)
The control law given by the EM controller is: uemc (t) = −Kˆ xc (t) − KQˆ xd (t) + M x ˆd (t).
5 5.1
(16)
Safety Scheme Insulin on Board Estimation
The insulin on board can be defined as the amount of administered insulin that is still active in the body. In an attempt to reduce hypoglycemia events, some insulin pumps estimate the IOB to correct the boluses in order to prevent from excessive insulin stacking, particularly when boluses are given close together. Each patient exhibits its own insulin activity dynamics, which is usually characterized by the duration of insulin action (DIA), a parameter that clinicians are used to tune when setting up insulin pumps. Here, the insulin activity is represented by a two-compartment dynamical model: dC1 (t) = u(t) − KDIA C1 (t), dt dC2 (17) (t) = KDIA (C1 (t) − C2 (t)), dt IOB(t) = C1 (t) + C2 (t), where C1 and C2 are the two compartments and u(t) is the insulin dose. The constant KDIA can be tuned for each patient so as model (17) replicates its corresponding DIA. Particular KDIA values for each virtual patient of the FDA approved simulator can be found in [13].
Hypoglycemia Prevention Using EMC with a Safety Scheme
5.2
191
Safety Layer
The safety layer is used to prevent hypoglycemia events, then the infused insulin u(t) considering this approach is given by, u(t) = uemc (t)ωf (t),
(18)
where uemc was deduced in (16), and ωf is the filtered signal of ω(σ) as follow, ωf (s) = ω(σ) =
1 0
ω(σ) , 2s + 1 σ(t) > 0 otherwise,
σ = IOB − IOB,
(19) (20) (21)
CHO , (22) I2C where IOBb is the corresponding IOB value from the basal insulin, I2C is the subject-specific insulin-to-carbohydrate ratio, and CHO is a tuning parameter related to the carbohydrates ingested. The CHO parameter was set to 45 g for the corresponding IOB thresholds. It is important to note that insulin dose values always are positive. Therefore negative values of u(t) are set to zero. IOB = IOBb +
6 6.1
Results Test Scenario
The test scenario corresponds to five and a half days for each virtual patient, which includes three meals totaling 200 g of carbohydrates per day and distributed as follows: 50 g at 7:00, 80 g at 12:00, and 70 g at 19:00. The simulation considers a set of 11 adult virtual patients (including the average one) from Dalla Man et al. model [14] using the commercial version of the UVa/Padova TD1MS [15]. The performance and robustness of the EM control approach for glucose regulation was tested with and without the safety scheme. Conventional treatment is applied during the first 2680 min of simulation, so each virtual patient receives the corresponding insulin dose of the open-loop treatment according to the information about the meal intake and basal requirements of each patient. The control loop is then plugged in after this period of time, i.e. at 20:20 of the second day. Here, the same KDIA value was used to estimate the IOB in all virtual patients.
192
6.2
F. Leon-Vargas et al.
Performance Obtained
Figure 1c(a) shows blood glucose response of all virtual patients for the EM control with the safety scheme, while Fig. 1c(b) shows the blood glucose response without the safety layer. Both control strategies provide stability for the test. Notice that the performance obtained to use the EM control with the safety layer provides proper behavior, similar to the conventional treatment, as shown
Fig. 1. Results for glucose regulation. (a) embedded model control with a safety scheme. (b) embedded model control.
Hypoglycemia Prevention Using EMC with a Safety Scheme
193
in Fig. 1c(a). Conversely, the performance given by the EM controller without the safety layer is not suitable, Fig. 1c(b). Table 3 shows the performance results of the test scenario considered. Hypos corresponds to the number of hypoglycemia events below 70 mg/dL, Excursion is related to the postprandial glucose excursion in mg/dL, and Mean glucose in mg/dL. Table 3. Performance results. Mean and excursion are in mg/dL. EM control with safety layer
EM control
Patient Hypos Excursion Mean
Hypos Excursion Mean
1
0
119.6
128.3
13
175.9
2
0
97.9
123.2
16
172.6
87.4
3
0
93.0
135.9
6
126.3
113.7
4
0
121.5
133.3
10
171.1
94.2
5
0
106.7
130.7
10
163.8
98.8
6
0
159.5
133.5
10
179.0
105.6
7
0
144.8
128.4
19
214.2
75.5
8
0
130.2
146.4
6
170.3
118.4
9
0
141.3
127.5
10
188.6
83.1
10
0
110.4
119.1
6
147.1
94.4
11
0
128.8
126.9
10
202.1
84.3
90.3
Figure 2c(a) shows the infused insulin rate by the EM control with the safety layer, while Fig. 2(b) shows the corresponding control action to the EM control scheme. The magnitude of the infused insulin rate in Fig. 2c(a) is less than its corresponding for Fig. 2c(b), this is evidence of the action provides by the safety layer to reduce the Hypoglycemia events. Figure 3c shows the Control-Variability Grid Analysis (CVGA) plot, which summarizes the performance obtained by a control system over a patient (zone A is the best and zone E the worst). For the EM control with the safety layer, ten virtual patients are in the upper B zone and just one in the lower B zone. For the EM control scheme, the performance decreased to zones C and D.
194
F. Leon-Vargas et al.
Fig. 2. Infused insulin rate. (a) embedded model control with a safety scheme. (b) embedded model control.
Hypoglycemia Prevention Using EMC with a Safety Scheme
195
Fig. 3. CVGA result. (a) embedded model control with a safety scheme. (b) embedded model control.
7
Conclusion
An embedded model control for glucose regulation has been design and evaluated over an in-silico cohort of virtual patients with T1D. Results show an increased performance when the safety scheme is added to the main controller allowing to prevent any hypoglycemic event in each patient of the virtual cohort. In addition, a better postprandial excursion was achieved in all cases when the safety scheme was included. A more comprehensive control scenario test is required to verify this result.
196
F. Leon-Vargas et al.
Acknowledgments. Authors are supported by Antonio Nari˜ no University project 2018222 and Colciencias project #110180763081.
References 1. Cho, N., Shaw, J.E., Karuranga, S., Huang, Y., da Rocha Fernandes, J.D., Ohlrogge, A.W., Malanda, B.: IDF diabetes atlas: global estimates of diabetes prevalence for 2017 and projections for 2045. Diabetes Res. Clin. Pract. 138, 271– 281 (2018) 2. Garg, S., Shah, V., Akturk, H., Beatson, C., Snell-Bergeon, J.: Role of mobile technology to improve diabetes care in adults with type 1 diabetes: the remoteR type 1 Diabetes Management. Diabetes Ther. 8(4), 811– T1D study iBGStarin 819 (2017) 3. G´ omez, A.M., Henao, D., Imitola, A., Mu˜ noz, O., Sep´ ulveda, M., Kattah, L., Guerrero, J., Morros, E., Llano, J., Jaramillo, M., Le´ on-Vargas, F.: Efficacy and safety of sensor-augmented pump therapy (SAPT) with predictive low-glucose management in patients diagnosed with type 1 diabetes mellitus previous. Endocrinolog´ıa Diabetes y Nutrici´ on 65(8), 451–457 (2018) 4. Hunt, C.W.: Technology and diabetes self-management: an integrative review. World J. Diabetes 6(2), 225–233 (2015) 5. Spearson, C., Mistry, A.: Several aspects of internet and web-based technology in diabetes management. Diabetes Spectr. 29(4), 245–248 (2016) 6. Bekiari, E., Kitsios, K., Thabit, H., Tauschmann, M., Athanasiadou, E., Karagiannis, T., Haidich, A., Hovorka, R., Tsapas, A.: Artificial pancreas treatment for outpatients with type 1 diabetes: systematic review and meta-analysis. BMJ 361 2018 7. Choi, S., Hong, E., Noh, Y.: Open artificial pancreas system reduced hypoglycemia and improved glycemic control in patients with type 1 diabetes. Diabetes 67(Supplement 1), 964 (2018) 8. Dadlani, V., Pinsker, J., Dassau, E., Kudva, Y.: Advances in closed-loop insulin delivery systems in patients with type 1 diabetes. Curr. Diabetes Rep. 18(10), 88 (2018) 9. Colmegna, P., S´ anchez-Pe˜ na, R.S., Gondhalekar, R.: Control-oriented linear parameter-varying model for glucose control in type 1 diabetes. In: 2016 IEEE Conference on Control Applications (CCA), pp. 410–415 (September 2016) 10. Colmegna, P., S´ anchez-Pe˜ na, R.S., Gondhalekar, R.: Linear parameter-varying model to design control laws for an artificial pancreas. Biomed. Signal Process. Control 40, 204–213 (2018) 11. Canuto, E.: On dynamic uncertainty estimators. In: 2015 American Control Conference (ACC), pp. 3968–3973 (July 2015) 12. Canuto, E.: Embedded model control: outline of the theory. ISA Trans. 46(3), 363–377 (2007) 13. Le´ on-Vargas, F., Garelli, F., De Battista, H., Veh´ı, J.: Postprandial response improvement via safety layer in closed-loop blood glucose controllers. Biomed. Signal Process. Control 16, 80–87 (2015) 14. Dalla Man, C., Rizza, R.A., Cobelli, C.: Meal simulation model of the glucoseinsulin system. IEEE Trans. Biomed. Eng. 54(10), 1740–1749 (2007) 15. Dalla Man, C., Micheletto, F., Dayu, L., Breton, M., Kovatchev, B., Cobelli, C.: The UVA/PADOVA type 1 diabetes simulator. J. Diabetes Sci. Technol. 8(1), 26–34 (2014)
Bidirectional-Pass Algorithm for Interictal Event Detection David Garc´ıa-Retuerta1 , Angel Canal-Alonso1 , Roberto Casado-Vara1(B) , Angel Martin-del Rey3,4 , Gabriella Panuccio2 , and Juan M. Corchado1,5 1
BISITE Research Group, University of Salamanca, 37008 Salamanca, Spain {dvid,acanal,rober,corchado}@usal.es 2 Istituto Italiano di Tecnologia, Via Morego, 30, 16163 Genova, Italy [email protected] 3 Department of Applied Mathematics, University of Salamanca, Calle del Parque 2, 37008 Salamanca, Spain [email protected] 4 Institute of Fundamental Physics and Mathematics, Department of Applied Mathematics, University of Salamanca, Calle del Parque 2, 37008 Salamanca, Spain 5 Air Institute, IoT Digital Innovation Hub (Spain), Calle Segunda 4, 37188 Salamanca, Spain
Abstract. Epilepsy is one of the most invalidating neurological conditions, affecting 1% of the global population. The main diagnostic tool for epilepsy is electroencephalography (EEG), used to detect local field potentials and discern pathological brain patterns, i.e., ictal activity (the EEG correlate of a clinical seizure) and interictal activity (the pathological brain pattern occurring between seizures). Interictal activity may provide insights into seizure generation mechanisms and may contain information relevant to the identification of the seizure onset zone. Further, interictal activity may be relevant for seizure prediction algorithms. This paper presents a algorithm for accurate detection of interictal events using a combination of mathematical methods. Keywords: Epilepsy
1
· Interictal events · Algorithm design
Introduction
Epilepsy is a chronic and progressive life-threatening neurological condition resulting from uncontrolled brain activity; it manifests itself with loss of control of body function up to loss of consciousness (seizure) [1]. Epilepsy is among the most common brain disorders, affecting 1% of the global population and carrying among the highest global burden of disease [2]. Along with seizure recurrence, epilepsy leads to progressive cognitive impairment and psychiatric disturbances, making up a neuropsychiatric syndrome that significantly affects the quality of life of stricken patients. c The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 G. Panuccio et al. (Eds.): PACBB 2020, AISC 1240, pp. 197–204, 2021. https://doi.org/10.1007/978-3-030-54568-0_20
198
D. Garc´ıa-Retuerta et al.
The most common epileptic syndrome is temporal lobe epilepsy (TLE), which accounts for 40% of the epileptic population, and is also the least responsive to anti-epileptic drugs. The hippocampus, the brain area responsible for learning and memory, is a key player in TLE manifestation; hippocampal sclerosis is frequent finding in TLE patients [1] and contributes to the cognitive and psychiatric comorbidities of TLE. One third of the epileptic population does not respond to currently available medications, for which neurosurgery might be an option. However, surgical resection of the involved brain area may not guarantee a seizure-free life; in fact, in diagnosing epilepsy, one major difficulty stems in the identification of the epileptic focus, i.e., the brain region responsible for the onset of seizures, which may be a single brain area or multiple brain areas in the case of multifocal epilepsy [3]. Overall, these issues call for the urgent need of improving the diagnosis and management of epilepsy. Among the core diagnostic tools for epilepsy is electroencephalography (EEG) [4], i.e., the recording of brain electrical activity. Although the EEG instrumentation typically used in the clinical setting lacks enough resolution to detect individual neuron firing, it is able to detect the simultaneous activity of neuronal populations, so-called local field potentials (LFP), which are voltage oscillations generated by the brain tissue. Along with the physiological brain waves, EEG in epileptic patients can reveal the pathological brain patterns typical of epilepsy, i.e., ictal activity (the EEG correlate of clinical seizures) and interictal activity (the pathological brain waves recurring between ictal events). Interictal activity results from the state of background hyperexcitability of the epileptic brain and it has recently emerged as a particularly relevant information: the identified correlation between different types of interictal discharges with distinct seizure onset types promises to aid in better addressing the management of epilepsy [5]. Studying brain electrical signals in a physical or mathematical approach would grant a better understanding of their nature and would allow to characterise and classify more accurately the electrographic features of the epileptic brain to establish more personalised treatments. However, accurate detection and classification of EEG events is still a major challenge in the scientific community. In this perspective, the implementation of machine learning (ML) algorithms would greatly benefit EEG signal analysis. ML provides computers the ability to learn the best approach to solve a problem on their own. The great advantage of ML techniques is that programs do not need to be hard-coded and a looser implementation is available. Therefore, a multi-purpose system can be applied to different scenarios. Nonetheless, EEG recordings from epileptic patients may not always be accessible or may not include signals from the desired brain regions to aid in ML algorithms development and validation. On the other hand, human EEG studies make it difficult to obtain the large experimental groups or data sets needed to work with ML. In this context, animal models have brought fundamental insights into the mechanisms of seizure generation and still represent an invaluable asset to further
Hamiltonian Mechanics
199
our understanding of epilepsy. Specifically, rodent brain slices comprising the hippocampus and the parahippocampal cortices represent an accessible highyield simplified model of the brain circuits involved in TLE. Acute treatment of these brain slices with convulsant drugs induces the generation of LFP that resemble the electrographic features of the EEG recorded from epileptic patients. Among in vitro electrophysiology techniques, microelectrode array recordings enable the observation of LFP from multiple sites within the brain slice, making it possible to collect a greater amount of information at a higher resolution than conventional extracellular electrophysiology. As brain slice electrophysiology can generate great amounts of data, from which information must be extracted, data mining, combining ML and mathematical methods into hybrid algorithms [6] is an efficient strategy to provide insights from the raw data. In this paper, we present a new algorithm combining ML and mathematical techniques with the objective of measuring LFP recoded using MEA electrophysiology and automatically label interictal events in several regions of rodent hippocampus-cortex slices in which epileptiform discharges are induced by acute convulsant treatment.
2
Data Sets and Methodology
The developed proposal has three phases: 1. Event modelling. An example of a single event wave is modelled and stored. 2. Forward-pass. All the events which directly correspond with the model are detected and stored. As a result, several time windows are created in the received signal. 3. Backward-pass. The created time windows are expanded as far as the beginning of the event. There is no clear appearance of the model in this area, as it is overlapped with itself in a chaotic manner. An example of the produced labels found using this method can be found in Fig. 1. 2.1
Data Acquisition
Male CD1 mice were used to prepare 400 µm thick brain slices. Brain slices were transferred to a holding chamber with room temperature ACSF and recovered after 60 min. The protocol developed in [7,8] was used: Brain Slice Preparation and Maintenance. Brain slices, 400 µms thick, were prepared from male CD1 mice 4–8 weeks old. Animals were decapitated under deep isoflurane anesthesia, their brain was removed and placed into icecold (2 ◦ C) sucrose-based artificial cerebrospinal fluid (sucrose-ACSF) composed of (mM): Sucrose 208, KCl 2, KH2PO4 1.25, MgCl2 5, MgSO4, CaCl2 0.5, Dglucose 10, NaHCO3 26, L-Ascorbic Acid 1, Pyruvic Acid 3. The brain was
200
D. Garc´ıa-Retuerta et al.
Fig. 1. Example of the forward-pass (yellow) and of the backward-pass (blue).
let chill for ∼2 min before slicing in ice-cold sucrose-ACSF using a vibratome (Leica VT1000S, Leica, Germany). Brain slices were immediately transferred to a submerged chamber containing room-temperature holding ACSF composed of (mM): NaCl 115, KCl 2, KH2PO4, 1.25, MgSO4 1.3, CaCl2 2, D-glucose 25, NaHCO3 26, L-Ascorbic Acid 1. After at least 60 min recovery, individual slices were pre-warmed at ∼32 ◦ C for 20–30 min in a submerged chamber containing holding ACSF before being incubated in warm ACSF containing the K+ channel blocker 4-aminopyridine (4AP, 250 µM), in which MgSO4 concentration was lowered to 1 mM (4AP-ACSF,). Brain slice treatment with 4AP is known to enhance both excitatory and inhibitory neurotransmission and induce the acute generation of epileptiform discharges. All brain slices were incubated in 4APACSF for at least 1 h before beginning any recording session. All solutions were constantly equilibrated at pH = ∼7.35 with 95% O2/5% CO2 gas mixture (carbogen) and had an osmolality of 300–305 mOsm/Kg. Chemicals were acquired from Sigma-Aldrich. All procedures been approved by the Institutional Animal Welfare Body and by the Italian Ministry of Health (authorizations 860/2015-PR and 176AA.NTN9), in accordance with the National Legislation (D.Lgs. 26/2014) and the European Directive 2010/63/EU. All efforts were made to minimize the number of animals used and their suffering. Micro-Electrode Array Recording. Extracellular field potentials were acquired using the Mc Rack software through a 6 × 10 planar MEA (Ti-iR electrodes, diameter 30 µm, inter-electrode distance 500 µm, impedance