Reverse Engineering of Regulatory Networks: Methods and Protocols (Methods in Molecular Biology, 2719) 1071634607, 9781071634608

This volume details the development of updated dry lab and wet lab based methods for the reconstruction of Gene regulato

113 15 12MB

English Pages 337 [331] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
Contributors
Chapter 1: Molecular Modeling Techniques and In-Silico Drug Discovery
1 Introduction
1.1 Building the Structures of Molecules
1.2 Binding Interactions Between Molecules and Virtual Screening of Ligand Libraries
2 Methods
2.1 Building the Structures of Molecules
2.1.1 Analysis of the Primary Structure of the Target
2.1.2 Identification of the Suitable Template for the Target
2.1.3 Building the Model of the Target
2.1.4 Assessment the Quality of the Built Model
MODELLER
2.2 Virtual Screening of Ligand Libraries
3 Notes
4 Conclusion
References
Chapter 2: Systems Biology Approach to Analyze Microarray Datasets for Identification of Disease-Causing Genes: Case Study of ...
1 Introduction
2 Materials
2.1 Databases
2.2 Software
3 Methods
3.1 Microarray Dataset
3.2 Data Processing and Identification of DEGs
3.3 Functional Annotation and KEGG Pathway for DEGs
3.4 Protein-Protein Interactions Analysis
3.5 Identification of Hub Genes
3.6 Construction of the TF-miRNA-target Gene Regulatory Network
3.7 Identification of Protein-Drug Interactions
3.8 Hub Gene Survival and Expression Level Analysis
4 Interpretation of Results
4.1 Gene Expression Analysis
4.2 Pathway and Functional Association Analysis
4.3 Protein-Protein Interactions (PPIs) Analysis
4.4 Hub Proteins Were Identified from Protein-Protein Interaction Analysis
4.5 TF-miRNA Target DEG Regulatory Network Analysis
4.6 Identification of Protein-Drug Interactions
4.7 The Expression Level and Hub Genes Kaplan-Meier Plotter
5 Notes
6 Conclusion and Outlook
References
Chapter 3: Fluorescence Spectroscopy: A Useful Method to Explore the Interactions of Small Molecule Ligands with DNA Structures
1 Introduction
2 Materials
2.1 Principle of Fluorescence Spectroscopy
2.2 Instrumentation of Fluorescence Spectroscopy
2.3 Materials and Sample Preparation
3 Methods
3.1 Steady-state Fluorescence Spectroscopy and Quenching
3.2 Time-Resolved Fluorescence Decay Measurements
3.3 Time-Resolved Fluorescence Anisotropy Decay Measurements
3.4 Fluorescence Correlation Spectroscopy
3.5 FRET and Single-Molecule Fluorescence Spectroscopy
3.6 Information Extraction of Fluorescence Spectra
4 Notes
References
Chapter 4: Inference of Dynamic Growth Regulatory Network in Cancer Using High-Throughput Transcriptomic Data
1 Introduction
2 Materials
3 Methods
3.1 RNA-SeqData Acquisition
3.2 Quality Control of Raw Reads
3.3 Reads Pre-processing
3.4 Read Alignment Against the Reference Genome
3.5 Read Assignment
3.6 Differential Gene Expression
3.7 Reconstruction of Growth Regulatory Network (GRN)
3.8 Network Visualization and Inference
3.8.1 Integration of Publicly Available Interaction Data
3.8.2 Import Network to Cytoscape
3.8.3 Topological Inference
3.9 Functional Annotation of Deregulated Genes
3.10 Final Remarks
4 Notes
References
Chapter 5: Implementation of Exome Sequencing to Identify Rare Genetic Diseases
1 Introduction
2 Materials
3 Methods
3.1 DNA Extraction
3.1.1 Phenol-Chloroform Extraction of Genomic DNA from Peripheral White Blood Cells
3.1.2 gDNA Extraction with Kits from Commercial Suppliers
3.2 Exome Library Preparation
3.2.1 DNA Fragmentation
3.2.2 Library Construction and Clean-up
3.2.3 End Adenylation (A-tailing)
3.2.4 Adapter Ligation
3.2.5 Target Enrichment
3.3 Variant Annotation
3.4 Variant Calling and Annotation
3.5 Variant Prioritization
3.6 In silico Analysis
3.7 Applications
4 Notes
5 Conclusion
References
Headings0005676534
Chapter 6: Emerging Trends in Big Data Analysis in Computational Biology and Bioinformatics in Health Informatics: A Case Stud...
1 Introduction
1.1 Big Data Resource Challenges and Promises
1.1.1 Genomic Database Resources
1.1.2 Transcriptomics Big Database Resource
1.1.3 Proteomics Database Resources
1.1.4 Metabolomics Database Resources
1.1.5 Biological Pathway Database Resource
1.2 A Case Study of Epilepsy and Seizures
2 Materials
2.1 DisGeNET Database
2.2 GeneMANIA Prediction Server
2.3 NetworkAnalyst
2.4 MCODE Plugin
2.5 Cytoscape Software
2.6 FunRich Tool
3 Methods
3.1 Generation of Gene-disease-variant-Associated Network
3.2 Genetic Interaction Network
3.3 Cluster Analysis of the Regulatory Network
3.4 Gene-mRNA-TFs Regulatory Network
3.5 Gene Ontology Analysis
4 Notes
References
Chapter 7: New Insights into Clinical Management for Sickle Cell Disease: Uncovering the Significant Pathways Affected by the ...
1 Introduction
2 Materials
2.1 DisGeNET Database
2.2 GEIO2R Tool
2.3 Reactome FIViz Plugins
2.4 Cystoscope Software
2.5 BiNGO
3 Methods
3.1 Gene Disease Association Network
3.2 Pathway Enrichment Analysis
3.3 Functional Interaction (FI) Network
3.4 Network Enrichment Analysis
4 Notes
References
Chapter 8: A Review of Computational Approach for S-system-based Modeling of Gene Regulatory Network
1 Introduction
2 History of S-system
3 S-system Based Modeling of GRN
3.1 Preliminary of S-system-based GRN
3.2 Few Major Issues Regarding Optimization for S-system Parameters
3.2.1 Major Issue 1: Computational Complexity
3.2.2 Major Issue 2: Accuracy in the Prediction of Dynamics of Genes
3.2.3 Major Issue 3: Over-fitting Problem
3.3 Proposed Solutions Regarding the Above Issues
3.3.1 Decoupling to Reduce Computational Complexity
3.3.2 Selection of Suitable Optimization Technique to Increase Accuracy
3.3.3 Regularization to Deal with Over-fitting Problem
3.4 How to Validate a New Algorithm for S-system-based GRN Reconstruction?
4 Literature Survey
5 Conclusion
References
Chapter 9: Big Data in Bioinformatics and Computational Biology: Basic Insights
1 Introduction
1.1 Importance of Big Data in Biology
1.2 Big Data Handling: Collection, Storage, and Analysis
2 Tools/Softwares
3 Methods
3.1 Data Collection
3.2 Data Storage
3.2.1 Rules to Store Data
3.2.2 Data Storage Systems
3.2.3 Important Features of a Big Data Database
3.3 Data Analysis
3.3.1 Descriptive Analysis
Analysis of Sequence Data
Gene Expression Analysis
3.3.2 Predictive Analysis
Supervised Learning
Unsupervised Learning
4 Big Data Solutions: The Data Architectures
4.1 MapReduce Architecture
4.2 Fault Tolerance Architecture
4.3 Stream Graph Architecture
5 Conclusion
References
Chapter 10: Identification of Culprit Genes for Different Diseases by Analyzing Microarray Data
1 Introduction
2 Materials
2.1 Dataset
2.2 Software
3 Methods
3.1 Package Installation
3.2 Importing Raw Data
3.3 Quality Control of Raw Data
3.4 Normalization and QC (RMA Method)
3.4.1 PCA Plot for QC
3.5 Differential Expression Analysis
3.6 Heatmap
3.7 Annotation
3.8 Gene Ontology (GO) and Pathway Enrichment (KEGG) Analysis
4 Notes
5 Conclusion
References
Chapter 11: Big Data Analysis in Computational Biology and Bioinformatics
1 Introduction
1.1 Data Acquisition and Storage
1.2 Data Processing and Analysis
1.3 Tools to Handling Big Data Analysis in Computational Biology and Bioinformatics
1.4 Linux Shell Scripts
1.5 Hadoop
1.5.1 Hadoop Modules
1.5.2 Hadoop´s Working Mechanism
1.6 R-Programming Language
1.6.1 Data Preprocessing
1.6.2 Data Analysis
1.6.3 Data Visualization
1.6.4 Data Storage
1.7 Python Programming Language
2 Review of Literature
3 Challenges and Opportunities
4 Conclusion
References
Chapter 12: Prediction and Analysis of Transcription Factor Binding Sites: Practical Examples and Case Studies Using R Program...
1 Introduction
2 Materials
3 Methods
3.1 Obtaining Upstream Sequences of a Gene from the Genome Sequence
3.2 Finding Enrichment of TFBS Motifs in a Single Sequence: A Case Study of the EOMES Gene
3.3 Exploratory Analysis of Sequence Report
3.4 Examination and Visualization of Significant TFBS Motifs in the EOMES Promoter
3.5 Validating Significant TFBS Motifs in the EOMES Promoter Against Chance
3.6 Understanding and Interpretating Results in Functional Contexts
3.7 Pointers on Exporting Results
3.8 Finding Motif Enrichment in a Group of Genes
3.9 Limitations
4 Notes
References
Chapter 13: Hubs and Bottlenecks in Protein-Protein Interaction Networks
1 Introduction
1.1 Methods for Identifying Protein-Protein Interactions
1.2 Protein-Protein Interaction Databases
1.3 Protein-Protein Interaction Networks
2 Centrality Measures
3 Characteristics Features of Hubs
3.1 Dichotomy Among Hubs
4 Characteristics Features of Bottlenecks
5 Further Categories Among Hubs and Bottlenecks
6 Summary
References
Chapter 14: Next-Generation Sequencing to Study the DNA Interaction
1 Introduction
2 Sequencing
3 Sanger Sequencing Versus NGS
4 Roche 454 Sequencing Technique
5 SOLiD ABI Platform
6 Illumina (Solexa) Sequencing Platform
7 Other New Sequencing Technologies
7.1 Polony-Based Sequencing Technology
7.1.1 Limitations
7.2 DNA Nanoball Sequencing
7.2.1 Advantages
7.2.2 Limitations
7.3 Nanopore Sequencing Technology
7.3.1 Oxford Nanopore Sequencing
7.3.2 Advantages
7.3.3 Limitations
8 Preparation of Library
9 DNA Sequencing by NGS
9.1 Whole Genome Sequencing (WGS)
9.2 Whole Exome Sequencing (WES)
9.3 Gene Panel
10 Applications of NGS
10.1 Epigenetics
10.2 Prenatal and Postnatal Diagnosis
10.3 To Detect Infectious Disease
10.4 Food and Nutrition
10.5 Cancer Research
10.6 Bioinformatics Analytics
11 Conclusion
References
Chapter 15: Deep Learning for Predicting Gene Regulatory Networks: A Step-by-Step Protocol in R
1 Introduction
2 Material
2.1 Computational and Software Requirements
2.2 Data Requirements
3 Methods
3.1 Loading R Libraries and Preparing Workspace
3.2 Data Preparation and Exploration
3.3 Defining Deep Learning Model Architecture
3.4 Defining Model Compilation Parameters
3.5 Training Deep Learning Model
3.6 Estimating the Accuracy of Deep Learning Model
3.7 Predicting Genome-wide Regulatory Interactions
3.8 Tuning Deep Learning for Improved Performance
4 Notes
References
Chapter 16: Computational Inference of Gene Regulatory Network Using Genome-wide ChIP-X Data
1 Introduction
2 Materials
3 Methods
4 Notes
References
Chapter 17: Reverse Engineering in Biotechnology: The Role of Genetic Engineering in Synthetic Biology
1 Introduction
2 Genetic Engineering
2.1 Materials
3 Applications of Synthetic Biology
4 Methodology: Generalized Protocol for Genetic Engineering in Synthetic Biology
5 Example: Development of Integrase-mediated Differentiation Circuits to Improve Evolutionary Stability in E. coli
5.1 Introduction
5.2 Motivation
5.3 Materials and Methodology
5.4 Computational Modeling
5.5 Model Implementation and Parameters
5.5.1 Cell Growth
5.5.2 Mutations
5.5.3 Terminal Differentiation
5.5.4 Production and Burden
5.5.5 Differentiation Rates
5.6 Computational Simulations of the Model
6 Other Techniques
6.1 PCR
6.2 Gel Electrophoresis
6.3 Restriction Digestion and Ligation
7 Conclusion
References
Index
Recommend Papers

Reverse Engineering of Regulatory Networks: Methods and Protocols (Methods in Molecular Biology, 2719)
 1071634607, 9781071634608

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Methods in Molecular Biology 2719

Sudip Mandal  Editor

Reverse Engineering of Regulatory Networks

METHODS

IN

MOLECULAR BIOLOGY

Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK

For further volumes: http://www.springer.com/series/7651

For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.

Reverse Engineering of Regulatory Networks Edited by

Sudip Mandal Department of Electronics and Communication Engineering, Jalpaiguri Govt. Engineering College, Jalpaiguri, West Bengal, India

Editor Sudip Mandal Department of Electronics and Communication Engineering Jalpaiguri Govt. Engineering College Jalpaiguri, West Bengal, India

ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-3460-8 ISBN 978-1-0716-3461-5 (eBook) https://doi.org/10.1007/978-1-0716-3461-5 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A. Paper in this product is recyclable.

Preface Inference of regulatory networks is a problem to find out the interaction between mostly proteins and RNAs that act together to control the various genes expression level in a specific genome. This issue is also known as the reverse engineering of regulatory networks. It is a fascinating task for the researchers and scientists (in the field of System Biology, Biotechnology, Bio-informatics, and Health-informatics) because not only can they make it possible to have a detailed insight into the functioning of cells but can also furnish critical information on the optimal intervention tactics for treatment of human diseases. This has motivated researchers worldwide to investigate the development of efficient methodologies for network inference or reverse engineering of regulatory network from gene expression data or lab-based experiments. In this book entitled Reverse Engineering of Regulatory Networks, the focus has been on the development of improved dry lab- and wet lab-based strategies for the reconstruction of gene regulatory networks (GRN). A variety of research area has been covered in this book including review as well as original research work. Book chapters include interesting topics like identification of culprit genes for some genetic disease such as oral squamous cell carcinoma, sickle cell, etc. from microarray data, and in silico drug discovery techniques. Analysis of genome-wide ChIP-X data and high-throughput transcriptomic data for inference of regulatory network is also proposed by the respective contributors. Some book chapters also deal with latest research areas like exome sequencing, next-generation sequencing to observe DNA interaction. The readers can also know about the detailed process regarding fluorescence spectroscopy to explore the interactions of small molecule ligands with DNA structures. Review on big data analysis in bioinformatics and computational biology and S-system based modeling of GRN is also covered. Reader will also get an idea about how emerging techniques like artificial intelligence and deep learning are merged in this area to uncover the regulatory networks with more efficiency. Topics like discovering transcription factor-binding sites, signaling pathways, and synthetic biology are well covered in this book. Moreover, the readers will also get an idea about how to use and implement different web resource for huge database like NCB microarray data; software tools like R, GNW, etc.; or several computational approaches in the field of genetic engineering. Foremost, even though only my name appears on the cover of this book as Editor, a great many people have contributed towards its production. I owe my gratitude to all those contributors, authors, and researchers who have made this book possible. Moreover, I would like to thank John M. Walker, Series Editor of MIMB Book series, for giving me the opportunity to work as Editor of this book. I would also like to express my sincere thanks to the publishing team of Springer Nature as it’s a great experience to work with such prestigious publishers. I would also like to express my sincere thanks to my employer (WBHED) and colleagues of Jalpaiguri Government Engineering College for all their support and encouragement. Last but not the least, I am deeply appreciative of my beloved family, who has always stood beside me.

v

vi

Preface

Hope, the readers or research community or enthusiast will be benefitted a lot who are interested in molecular and computational biology especially in the field of inferencing regulatory network. Jalpaiguri, India

Sudip Mandal

Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 Molecular Modeling Techniques and In-Silico Drug Discovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Angshuman Bagchi 2 Systems Biology Approach to Analyze Microarray Datasets for Identification of Disease-Causing Genes: Case Study of Oral Squamous Cell Carcinoma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jyotsna Choubey, Olaf Wolkenhauer, and Tanushree Chatterjee 3 Fluorescence Spectroscopy: A Useful Method to Explore the Interactions of Small Molecule Ligands with DNA Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . Sagar Bag and Sudipta Bhowmik 4 Inference of Dynamic Growth Regulatory Network in Cancer Using High-Throughput Transcriptomic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aparna Chaturvedi and Anup Som 5 Implementation of Exome Sequencing to Identify Rare Genetic Diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prajna Udupa and Debasish Kumar Ghosh 6 Emerging Trends in Big Data Analysis in Computational Biology and Bioinformatics in Health Informatics: A Case Study on Epilepsy and Seizures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Usha Chouhan, Rakesh Kumar Sahu, Shaifali Bhatt, Sonu Kurmi, and Jyoti Kant Choudhari 7 New Insights into Clinical Management for Sickle Cell Disease: Uncovering the Significant Pathways Affected by the Involvement of Sickle Cell Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Usha Chouhan, Trilok Janghel, Shaifali Bhatt, Sonu Kurmi, and Jyoti Kant Choudhari 8 A Review of Computational Approach for S-system-based Modeling of Gene Regulatory Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sudip Mandal and Pijush Dutta 9 Big Data in Bioinformatics and Computational Biology: Basic Insights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aanchal Gupta, Shubham Kumar, and Ashwani Kumar 10 Identification of Culprit Genes for Different Diseases by Analyzing Microarray Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ayushman Kumar Banerjee, Shrayana Ghosh, and Chittabrata Mal 11 Big Data Analysis in Computational Biology and Bioinformatics. . . . . . . . . . . . . . Prakash Kumar, Ranjit Kumar Paul, Himadri Shekhar Roy, Md. Yeasin, Ajit, and Amrit Kumar Paul

vii

v ix

1

13

33

51

79

99

121

133

153

167 181

viii

Contents

12

Prediction and Analysis of Transcription Factor Binding Sites: Practical Examples and Case Studies Using R Programming . . . . . . . . . . . . . . . . . Vijaykumar Yogesh Muley 13 Hubs and Bottlenecks in Protein-Protein Interaction Networks . . . . . . . . . . . . . . Chandramohan Nithya, Manjari Kiran, and Hampapathalu Adimurthy Nagarajaram 14 Next-Generation Sequencing to Study the DNA Interaction . . . . . . . . . . . . . . . . . Nachammai Kathiresan, Srinithi Ramachandran, and Langeswaran Kulanthaivel 15 Deep Learning for Predicting Gene Regulatory Networks: A Step-by-Step Protocol in R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vijaykumar Yogesh Muley 16 Computational Inference of Gene Regulatory Network Using Genome-wide ChIP-X Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Samayaditya Singh, Manjari Kiran, and Pramod R. Somvanshi 17 Reverse Engineering in Biotechnology: The Role of Genetic Engineering in Synthetic Biology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gopikrishnan Bijukumar and Pramod R. Somvanshi Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

199 227

249

265

295

307 325

Contributors AJIT • ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi, India SAGAR BAG • Department of Biophysics, Molecular Biology and Bioinformatics, University of Calcutta, Kolkata, West Bengal, India ANGSHUMAN BAGCHI • Department of Biochemistry and Biophysics, University of Kalyani, Kalyani, West Bengal, India AYUSHMAN KUMAR BANERJEE • Department of Bioinformatics, Maulana Abul Kalam Azad University of Technology, West Bengal, Haringhata, West Bengal, India SHAIFALI BHATT • Department of Mathematics, Bioinformatics & Computer Applications, Maulana Azad National Institute of Technology, Bhopal, Madhya Pradesh, India SUDIPTA BHOWMIK • Department of Biophysics, Molecular Biology and Bioinformatics, University of Calcutta, Kolkata, West Bengal, India; Mahatma Gandhi Medical Advanced Research Institute (MGMARI), Sri Balaji Vidyapeeth (Deemed to be University), Pondicherry, India GOPIKRISHNAN BIJUKUMAR • Department of Systems and Computational Biology, School of Life Sciences, University of Hyderabad, Hyderabad, Telangana, India TANUSHREE CHATTERJEE • Raipur Institute of Technology, Raipur, Chhattisgarh, India APARNA CHATURVEDI • Centre of Bioinformatics, Institute of Interdisciplinary Studies, University of Allahabad, Prayagraj, India JYOTSNA CHOUBEY • Raipur Institute of Technology, Raipur, Chhattisgarh, India JYOTI KANT CHOUDHARI • Department of Mathematics, Bioinformatics & Computer Applications, Maulana Azad National Institute of Technology, Bhopal, Madhya Pradesh, India USHA CHOUHAN • Department of Mathematics, Bioinformatics & Computer Applications, Maulana Azad National Institute of Technology, Bhopal, Madhya Pradesh, India PIJUSH DUTTA • Department of Electronics and Communication Engineering, Greater Kolkata College of Engineering and Management, Baruipur, India DEBASISH KUMAR GHOSH • Enteric Disease Division, Department of Microbiology, Kasturba Medical College, Manipal Academy of Higher Education, Manipal, Karnataka, India SHRAYANA GHOSH • Amity Institute of Biotechnology, Amity University Kolkata, Kolkata, West Bengal, India AANCHAL GUPTA • University Institute of Biotechnology, Chandigarh University, Mohali, Punjab, India TRILOK JANGHEL • Department of Biotechnology, Government V.Y.T. Post Graduate Autonomous College, Durg, Chhattisgarh, India NACHAMMAI KATHIRESAN • Department of Biotechnology, Alagappa University, Karaikudi, Tamil Nadu, India MANJARI KIRAN • Department of Systems and Computational Biology, School of Life Sciences, University of Hyderabad, Hyderabad, Telangana, India LANGESWARAN KULANTHAIVEL • Department of Biotechnology, Alagappa University, Karaikudi, Tamil Nadu, India; Molecular Cancer Biology Laboratory, Department of Biomedical Science, Alagappa University, Karaikudi, Tamil Nadu, India ASHWANI KUMAR • University Institute of Biotechnology, Chandigarh University, Mohali, Punjab, India

ix

x

Contributors

PRAKASH KUMAR • ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi, India SHUBHAM KUMAR • University Institute of Biotechnology, Chandigarh University, Mohali, Punjab, India SONU KURMI • Department of Mathematics, Bioinformatics & Computer Applications, Maulana Azad National Institute of Technology, Bhopal, Madhya Pradesh, India CHITTABRATA MAL • Department of Bioinformatics, Maulana Abul Kalam Azad University of Technology, West Bengal, Haringhata, West Bengal, India SUDIP MANDAL • Department of Electronics and Communication Engineering, Jalpaiguri Government Engineering College, Jalpaiguri, West Bengal, India VIJAYKUMAR YOGESH MULEY • Independent Researcher, Hingoli, India; Instituto de Neurobiologı´a, Universidad Nacional Autonoma de Me´xico, Quere´taro, Mexico VIJAYKUMAR YOGESH MULEY • Independent Researcher, Hingoli, India; Instituto de Neurobiologı´a, Universidad Nacional Autonoma de Me´xico, Quere´taro, Me´xico HAMPAPATHALU ADIMURTHY NAGARAJARAM • Department of Systems and Computational Biology, School of Life Sciences, University of Hyderabad, Hyderabad, Telangana, India CHANDRAMOHAN NITHYA • Department of Biotechnology and Bioinformatics, School of Life Sciences, University of Hyderabad, Hyderabad, Telangana, India AMRIT KUMAR PAUL • ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi, India RANJIT KUMAR PAUL • ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi, India SRINITHI RAMACHANDRAN • Department of Bioinformatics, Alagappa University, Karaikudi, Tamil Nadu, India HIMADRI SHEKHAR ROY • ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi, India RAKESH KUMAR SAHU • Department of Biotechnology, Government V.Y.T. Post Graduate Autonomous College, Durg, Chhattisgarh, India SAMAYADITYA SINGH • Department of Systems and Computational Biology, School of Life Sciences, University of Hyderabad, Hyderabad, Telangana, India ANUP SOM • Centre of Bioinformatics, Institute of Interdisciplinary Studies, University of Allahabad, Prayagraj, India PRAMOD R. SOMVANSHI • Department of Systems and Computational Biology, School of Life Sciences, University of Hyderabad, Hyderabad, Telangana, India PRAJNA UDUPA • Department of Medical Genetics, Kasturba Medical College, Manipal Academy of Higher Education, Manipal, Karnataka, India OLAF WOLKENHAUER • Department of Systems Biology & Bioinformatics, University of Rostock, Rostock, Germany MD. YEASIN • ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi, India

Chapter 1 Molecular Modeling Techniques and In-Silico Drug Discovery Angshuman Bagchi Abstract Molecular modeling is the technique to determine the overall structure of an unknown molecule, be it a small one or a macromolecule. The technique encompasses the method of screening ligand libraries for the development of new candidate drug molecules. All these aspects have become an essential topic of research. This field is truly interdisciplinary and finds its applications in almost all fields of life science research. In this chapter, an overview of the protocol associated with molecular modeling techniques will be discussed. Key words Molecular modeling, In-silico virtual screening, Docking, Binding interactions

1

Introduction Molecular modeling techniques have become an integral part of modern-day biological research. Molecular modeling is an interdisciplinary science and utilizes the basics of Physics, Chemistry, Biology, Mathematics, Statistics, and Computer Science. The concepts of molecular modeling are constantly being utilized in Computational Biology, Computational Chemistry, Material Science, and Drug Designing. The most important aspect of the technique is the description of the entire molecular system at the atomistic level. The technique encompasses the principles of both molecular mechanics and quantum chemistry. In the molecular mechanics approach, the atoms are considered as small individual units. On the other hand, in the quantum chemistry approach, the details of the entire atomic structure consisting of the nucleus and nuclear particles, and surrounding electronic environments are considered [1–24]. The techniques of molecular modeling are mainly used in the following cases: (a) To build the structures of small and macromolecules (b) To study the binding interactions between different molecules

Sudip Mandal (ed.), Reverse Engineering of Regulatory Networks, Methods in Molecular Biology, vol. 2719, https://doi.org/10.1007/978-1-0716-3461-5_1, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2024

1

2

Angshuman Bagchi

(c) To analyze the dynamic behaviors of molecules (d) To perform screening of ligand libraries for the purpose of drug developments 1.1 Building the Structures of Molecules

Sometimes, it becomes difficult to get the structures of molecules by traditional methods like X-ray crystallography, NMR, etc. In such cases, the three-dimensional coordinates of the atoms of the molecules are generated with the help of molecular modeling techniques. In such cases, the most important parameter to consider is to determine a suitable template on the basis of which the final structure of the proposed molecule is built. A few basic terminologies related to the process are worth a revisit: (a) Primary structure of a protein: It is the sequence of amino acid residues of a protein. (b) Secondary structure of a protein: It is the two-dimensional arrangement of the amino acid residues in a protein. The main types of secondary structural elements include helices, strands, and turns. (c) Tertiary structure of a protein: It is the three-dimensional structure of the polypeptide chain. (d) Super-secondary structure: It is the arrangement of the secondary structural units of protein to make a compact scaffolding of the protein architecture. Some of the common super-secondary structures are helix hairpin, helix corner, helix-loop-helix, helix-turn-helix, beta-hairpin, beta corner, Greek key motif, beta-alpha-beta motif, and Rossman fold. (e) Protein domain: A domain is an independently foldable unit in a protein. Protein domains represent the local threedimensional structures. In general, the domains have specific functionalities. (f) Target: The protein for which structure will have to be generated. The primary structure of the target must be available. (g) Template: The protein with known structure is used to generate the model of the target. There should be at least 30% identities between the amino acid residues of the target and the template.

1.2 Binding Interactions Between Molecules and Virtual Screening of Ligand Libraries

Analysis of the binding interactions between molecules is another very vital aspect that is dealt with the techniques of molecular modeling. In this case, different thermodynamic principles are used to study the modes of bindings. The atoms of the molecules under study are typed with appropriate charges based on the applied force fields. The force fields are generally obtained using the available structures of the complexes from databases [17–19, 25–36]. The following parameters necessary for this purpose are worth a revisit:

Molecular Modeling and In-Silico Drug Discovery

3

• Enthalpy (H): It is the net content of energy of the system. It is the sum total of the internal energy of the system and the energy obtained from the pressure-volume work. In thermodynamics, the change in enthalpy (ΔH) is measured. A negative value of ΔH would represent energy given out by the system. • Entropy (S): It is the measure of the randomness of a system. The more random the system is the more entropy it has and subsequently the system becomes more stable. In thermodynamics, the change in entropy (ΔS) is measured. A positive value of ΔS would represent spontaneous behavior and vice versa. • Gibbs Free Energy (G): It is the amount of available work for a system. In thermodynamics, the change in Gibbs Free Energy (ΔG) is measured. A negative value of ΔG would represent spontaneous behavior and vice versa. • Active Site: It is a region of a protein. In this region, there are amino acid residues which help the protein to bind to its ligands and exert the necessary biochemical functionalities. In the cellular world, the major interactions are the non-covalent ones. • Hydrogen Bond: It is a typical dipole-dipole interaction between a partially positive hydrogen atom bound to a highly electronegative atom like Oxygen, Nitrogen, etc. This is one of the most prevalent binding interactions in Biology. • Electrostatic Interaction: In this type of interaction, a positively charged part of a molecule faces the negatively charged region. This type of interaction is found in many different cases, a predominant example being the interactions between proteins and nucleic acids. • Van der Waals Interaction: This is a weak dipolar interaction. This interaction arises out of instantaneous accumulations of negatively charged electrons at one part of a molecule making it negative while the opposite end would become positive. Though it is a weak one, the cumulative effect of the interaction becomes significant. • Hydrophobic Interactions: This is a special type of interaction involving the stacking together of the non-polar parts of molecules. The non-polar regions in molecules would tend to stay away from the surrounding polar environment. This is a predominant interaction in the protein core. One of the most vital aspects of molecular modeling is virtual screening of ligand libraries. This technique is used to determine the possible structures of ligand molecules for the purpose of drug design. The results obtained from the study are tested in wet-lab for necessary clinical trials [11, 17–19].

4

2

Angshuman Bagchi

Methods

2.1 Building the Structures of Molecules

The process requires a target. A target is a protein with a known primary structure (sequence of amino acid residues). However, the details of the dispositions of the secondary structures are not known and are to be determined. The steps of building the model of the target are as follows: Retrieval of the primary structure of the target: The amino acid sequence of the target protein is to be collected from databases. The commonly used databases are: (a) NCBI (https://www.ncbi.nlm.nih.gov/) (b) UniProt (https://www.uniprot.org/) The amino acid sequences of the target proteins are fetched from these databases either using proper accession codes or in a generic way. The amino acid sequences are to be checked thoroughly to ascertain the presence of the full-length protein and the absence of any missing amino acid residues.

2.1.1 Analysis of the Primary Structure of the Target

The primary structure of the target, i.e., the amino acid sequence of the protein is used to check for the presence of various features. The typical features include the identifications of the following: (a) Post-translational modifications (PTMs), like sites for phosphorylation, methylations, acetylations, etc. (b) Active sites, ligand binding sites, etc. (c) Domains, motifs, etc. (d) Sequence repeats. For the purpose of the analysis of the primary structure of the target, the following databases and tools are frequently used: (a) Interpro (https://www.ebi.ac.uk/interpro/) (b) Prosite (http://www.expasy.org/prosite/) (c) Prints (http://130.88.97.239/PRINTS/index.php) (d) Pfam (https://pfam.xfam.org/) (e) Panther (http://www.pantherdb.org/) (f) SCOP (https://scop.mrc-lmb.cam.ac.uk/) (g) CATH (https://www.cathdb.info/)

2.1.2 Identification of the Suitable Template for the Target

The next step is to choose a suitable template structure to generate the three-dimensional coordinates of the atoms of the amino acid residues of the target (see Note 1). For this purpose, the amino acid sequence of the target is used to search the structural databases like the Protein Data Bank (PDB) (https://www.rcsb.org/). The most popular tool for sequence alignment is the BLAST (Basic Local

Molecular Modeling and In-Silico Drug Discovery

5

Alignment Search Tool) (https://blast.ncbi.nlm.nih.gov/Blast. cgi). The BLAST tool has several important characteristics to be kept under consideration in order to run the sequence alignment protocol. They are listed below: (a) Choice of substitution matrix: There are mainly two different types of substitution matrices, viz., PAM (Percent Accepted Mutations) and BLOSUM (Block Substitution Matrix). PAM is used for the analysis of closely related sequences. On the other hand, BLOSUM is suitable for distantly related sequence pairs. A lower number in PAM (like PAM1 as opposed to PAM100) is used to analyze highly conserved amino acid sequences. On the other hand, a higher BLOSUM (BLOSUM80 as opposed to BLOSUM62) is suitable for the analysis of highly conserved amino acid sequences. (b) Choice of database: In order to build the structure of a target sequence, the PDB is generally chosen. (c) Expect value or E-value: A lower E-value is considered to be a reliable indicator of the sequence alignment. The amino acid sequences of the target and the template are compared and an alignment score is provided to each of the sequence alignments (see Note 2). The template was chosen from the sequence alignment which received the maximum score. 2.1.3 Building the Model of the Target

The basic criterion for building the model of a target from the amino acid sequence is by measuring the percentage of sequence identity between the target and the template. The cut-off value of sequence identity score between a target and a template is 30%. The higher is the score the better is the built model. The threedimensional coordinates of the backbone atoms of the amino acid residues of the target are generated from the sequence alignment. The loop regions are then generated from the calculations of the dihedral angles. The most popular tools used for the purpose are as follows: (a) MODELLER (https://salilab.org/modeller/) (b) Swissmodel (https://swissmodel.expasy.org/) (c) Modloop (https://modbase.compbio.ucsf.edu/modloop/) (d) RaptorX (http://raptorx.uchicago.edu/) (e) Phyre2 (http://www.sbg.bio.ic.ac.uk/phyre2) (f) I-TASSER (https://zhanggroup.org/I-TASSER/) (g) ROBETTA (https://robetta.bakerlab.org/) (h) HHPred (https://toolkit.tuebingen.mpg.de/tools/hhpred)

6

Angshuman Bagchi

Retrieval of the primary structure of the target

Analysis of the primary structure of the target

Idenficaon of the suitable template for the target

Assessment the quality of the built model

Building the model of the target

Fig. 1 The steps of the generations of three-dimensional coordinates of a protein from the primary structure

2.1.4 Assessment the Quality of the Built Model

The model quality is determined by calculating various energy terms. The model with the minimum energy score is considered to be the near-native form of the target. The entire protocol for model building is presented in Fig. 1. In the following section, the step-wise protocol for building of a molecular model by the MODELLER is presented.

MODELLER

The necessary tool for the purpose is MODELER9.24. The necessary files are: Template file (represented as template.pdb here) and Alignment file between the target and the template (represented as target_template.ali) Step 1: The first step in model building is to retrieve the amino acid sequence of a protein from the database. For this, the sequence file with the extension .fasta is to be downloaded. Let, the downloaded file be target.fasta. The file target.fasta is to be converted to PIR format as target.ali. The file should look like: >P1; target sequence:target: : : : : : :0.00: 0.00 MATSAA................* Here, P1: Target protein

The second line has ten fields delimited by colons. The fields would represent the presence of suitable template structures. However, in this case only the amino acid sequence of the target is available and the information pertaining to the target is presented as sequence:target. No other information is available. So, the remaining parts are left as it is. Step 2: The next step is to download the template structure from the PDB. The amino acid sequence of the template is to be extracted for performing the sequence alignment. The downloaded sequence in fasta format is to be converted to ali format. Let the downloaded file be template.ali. The file should look like:

Molecular Modeling and In-Silico Drug Discovery

7

>P2; template Sequence:template:1: A: 100: B: undefined:undefined:: MDFGT................* Here, P2: template protein

A: Chain ID of the template protein. The number after the Chain ID represents the total number of amino acid residues expressed by the Chain ID of A. In the next line, the amino acid sequence of the template expressed by the Chain ID of A is presented by single-letter codes. B: Chain ID of the ligand if any. *represents the end of the sequence. Step 3: The next step is to perform the sequence alignment between the target and the template. For performing the sequence alignment between the target and the template, the following input file needs to be generated: File name: align.py The content of the file is as follows: #To read the structure file: aln= alignment (env) mdl = model (env, file=‘template’, model_segment = (‘FIRST: A’, ‘LAST:B’)) aln.append_model (mdl, align_codes =‘template’, atom_files = ‘template.pdb’) #To read the sequence file: aln.append (file=‘target.ali’, align_codes = ‘target’)

The command to be used is python align.py Step 4: The next step is model building. For this a template structure, a target sequence, and an alignment file between the target sequence and template are necessary. The command is python build_model.py > build_model.log

In the build_model.py file, there is a section called MyModel. Necessary changes may be made in this file. In general, five models are generated. However, the number of models to be generated can be changed. Step 5: The next step is to evaluate the quality of the built models. For that the necessary command is python evaluate_model.py

MODELLER uses DOPE scores to evaluate the model qualities. A negative DOPE score would indicate a goodquality model.

8

Angshuman Bagchi

Table 1 The necessary parameters of a candidate drug as determined by Lipinski’s rule of five

2.2 Virtual Screening of Ligand Libraries

3

Parameters

Lipinski’s rule of drug-likeliness

ALogP

Less than 5

Molecular_Weight

Less than 500 Da

Num_H_Acceptors

Less than 10

Num_H_Donors

Less than 5

Molar Refractivity

40–130

Another very vital aspect of structural bioinformatics is virtual screening of ligand libraries for development of new drug candidates. The basic idea of the process is to identify a specific compound against a target protein. The most popular tool to perform the task of virtual screening is Genetic Optimization of Ligand Docking (GOLD). Two of the most important aspects of virtual screening processes are that the chosen ligands should adhere to the ADME properties and Lipinski’s rule of five as presented in Table 1. The protocol for virtual screening is presented in Fig. 2 as a flowchart.

Notes 1. Selection of the template is also important. It is necessary to check the stereo-chemical features of the template. Any unfavorable interaction in the template is to be removed. It is also necessary to remove the information pertaining to the heteroatoms, which are present only to satisfy the experimental conditions, from the template before using it. However, the details of the heteroatoms like the cofactors for an enzyme or other biologically relevant ones may be kept if there is information regarding their presence in the target as well. 2. The main source of error in building the molecular models is the selection of improper sequence alignments. The technique of homology modeling is heavily dependent on the extent of identities at the sequence level. It is important to choose a proper alignment scheme and scoring matrix. The choice of scoring matrix depends on the nature of the target protein. Sometimes manual intervention is required for performing sequence alignments. In case of molecular modeling techniques, the amino acid sequence and the corresponding structure

Molecular Modeling and In-Silico Drug Discovery

9

Computational Modelling and validation of the target protein and determination of critical residues necessary for the functionality of the protein

Identification of the source database of ligands

Molecular dockings of the ligands with the target protein & scoring of the protein ligand complexes on the basis of binding free energy values

Selection of the aforementioned screened ligands and estimation of their druglikeliness properties

Final selection of the ligands Fig. 2 The steps of the virtual screening of ligand libraries

of the template protein are known. A substitution of a hydrophobic amino acid in the template by a similar one in the target in the sequence alignment may be accepted in the protein core but not on the protein surface. Therefore, the sequence alignment used for model building is to be manually verified to avoid potential errors.

4

Conclusion The techniques of structural bioinformatics have become very popular alternatives to traditional wet lab techniques. The fundamental aspect of this branch of experimentation is to streamline the extensive search processes. The molecular modeling methods are able to shed light on the structure-function relationships of biomolecules for which the wet-lab-based methods fail. Similarly, the process of virtual screening would be able to pick up the most probable alternative drug candidate from the set of innumerable compounds available.

10

Angshuman Bagchi

Acknowledgments The author would like to thank DBT-funded BIF center (Sanction no.: BT/PR40162/BTIS/137/48/2022) for the infrastructural support. Support from UGC-SAP-DRS-II, DST-PURSE2, and the University of Kalyani are duly acknowledged. References 1. Blaxter M (2003) In: Barnes MR, Grey IC (eds) Bioinformatics for geneticists. Wiley, Chichester 2. Cherkasov A (2005) In: Baxevanis AD, Ouellette BFF (eds) Bioinformatics: a practical guide to the analysis of genes and proteins, 3rd edn. Wiley, Hoboken 3. Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge 4. Xiong J (2006) Essential bioinformatics. Cambridge University Press, Cambridge 5. Pevzner P (2000) Computational molecular biology: an algorithmic approach. MIT Press, Massachusetts 6. Stevens H (2019) Life out of sequence: a datadriven history of bioinformatics. University of Chicago Press, Chicago 7. Pachter L, Sturmfels B (eds) (2005) Algebraic statistics for computational biology, vol 13. Cambridge University Press, Cambridge 8. Cristianini N, Hahn M (2006) Introduction to computational genomics. Cambridge University Press, Cambridge 9. Nisbet R, Elder J, Miner GD (2009) Handbook of statistical analysis and data mining applications. Academic Press, Massachusetts 10. Wong KC (2016) Computational biology and bioinformatics: gene regulation. CRC Press/ Taylor & Francis Group, Florida 11. Leach AR (2009) Molecular modelling: principles and applications. Pearson Prentice Hall, New Jersey 12. Heinz H, Ramezani-Dakhel H (2016) Simulations of inorganic-bioorganic interfaces to discover new materials: insights, comparisons to experiment, challenges, and opportunities. Chem Soc Rev 45(2):412–448 13. Parsons J, Holmes JB, Rojas JM, Tsa J, Strauss CE (2005) Practical conversion from torsion space to Cartesian space for in silico protein synthesis. J Comput Chem 26(10):1063–1068 14. Stephan S, Horsch MT, Vrabec J, Hasse H (2019) MolMod – an open access database of

force fields for molecular simulations of fluids. Mol Simul 45(10):806–814 15. Eggimann BL, Sunnarborg AJ, Stern HD, Bliss AP, Siepmann JI (2014) An online parameter and property database for the TraPPE force field. Mol Simul 40(1–3):101–105 16. Silakari O, Singh PK (2021) Fundamentals of molecular modeling. Fundamentals of molecular modeling. In: Concepts and experimental protocols of modelling and informatics in drug design. Academic Press, Massachusetts, pp 1–27 17. Forster MJ (2002) Molecular modelling in structural biology. Micron 33(4):365–384 18. Gu J, Bourne PE (eds) (2009) Structural bioinformatics, vol 44. Wiley-Blackwell, Hoboken, New Jersey 19. Peitsch MC, Schwede T (2008) Computational structural biology: methods and applications. World Scientific, Singapore 20. Stone JE, Phillips JC, Freddolino PL, Hardy DJ, Trabuco LG, Schulten K (2007) Accelerating molecular modeling applications with graphics processors. J Comput Chem 28(16): 2618–2640 21. Baker D, Sali A (2001) Protein structure prediction and structural genomics. Science 294(5540):93–96 22. Kaczanowski S, Zielenkiewicz P (2010) Why similar protein sequences encode similar three-dimensional structures? Theor Chem Accounts 125(3–6):643–650 23. Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A (2000) Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 29: 291–325 24. Zhang Y, Skolnick J (2005) The protein structure prediction problem could be solved using the current PDB library. Proc Natl Acad Sci U S A 102(4):1029–1034 25. Weng G (2002) Exploring protein-protein interactions by peptide docking protocols. Methods Enzymol 344:577–586

Molecular Modeling and In-Silico Drug Discovery 26. Ramachandran KI, Deepa G, Krishnan Namboori PK (2008) Computational chemistry and molecular modeling principles and applications. Springer-Verlag GmbH 27. McGreig JE, Uri H, Antczak M, Sternberg MJ, Michaelis M, Wass MN (2022) 3DLigandSite: structure-based prediction of protein–ligand binding sites. Nucleic Acids Res 50(W1): W13–W20 28. Siebenmorgen T, Zacharias M (2020) Computational prediction of protein–protein binding affinities. WIREs Comput Mol Sci 10:e1448 29. Muhammed MT, Aki-Yalcin E (2018) Homology modeling in drug discovery: overview, current applications, and future perspectives. Chem Biol Drug Des 93(1):12–20 30. Pan AC, Borhani DW, Dror RO, Shaw DE (2013) Molecular determinants of drug–receptor binding kinetics. Drug Discov Today 18(13–14):667–673 31. Alberts B, Bray D, Hopkin K, Johnson AD, Lewis J, Raff M, Roberts K, Walter P (2015)

11

Essential cell biology. Garland Science, New York City 32. Csermely P, Palotai R, Nussinov R (2010) Induced fit, conformational selection and independent dynamic segments: an extended view of binding events. Trends Biochem Sci 35(10): 539–546 33. Copeland RA (2013) Drug–target residence time. In: Evaluation of enzyme inhibitors in drug discovery. Wiley 34. Miller DC, Lunn G, Jones P, Sabnis Y, Davies NL, Driscoll P (2012) Investigation of the effect of molecular properties on the binding kinetics of a ligand to its biological target. Med Chem Comm 3(4):449–452 35. Allen MP, Tildesley DJ (1989) Computer simulation of liquids. Oxford University Press, Oxford 36. Frenkel D, Smit B, Ratner MA (1996) Understanding molecular simulation: from algorithms to applications, vol 2. Academic Press, San Diego

Chapter 2 Systems Biology Approach to Analyze Microarray Datasets for Identification of Disease-Causing Genes: Case Study of Oral Squamous Cell Carcinoma Jyotsna Choubey, Olaf Wolkenhauer, and Tanushree Chatterjee Abstract The discovery of potential disease-causing genes can aid medical progress. The post-genomic era has made this a more difficult task. Modern high-throughput methods have not solved the problem of identifying disease genes. Conventional methods cannot be used to investigate many rare or lethal diseases. Monitoring gene expression values in different samples using microarray technology is one of the best and most accurate ways to identify disease-causing genes. One of the most recent advances in experimental molecular biology is microarrays, which allow researchers to simultaneously monitor the expression levels of thousands of genes. Statistical analysis of microarray data might aid gene discovery by revealing pathways related to the target gene and facilitating identification of candidate genes. Systems biology, an interdisciplinary approach, has emerged as a crucial analytic tool with the potential to reveal previously unidentified causes and consequences of human illness. Genetic, environmental, immunological, or neurological factors have been implicated in the developing complex disorders like cancer. Because of this, it is important to approach the study of such disease from a novel perspective. The system biology approach allows us to rapidly identify disease-causing genes and assess their viability as therapeutic targets. This chapter demonstrates systems biology approaches to identify candidate genes using public database. Oral squamous cell carcinoma (OSCC) is used as a model disease to show how systems biology can be used successfully to identify and prioritize disease genes. Key words Microarray, System biology, Gene expression, Therapeutic targets, Oral squamous cell carcinoma

1

Introduction In biomedical research, one of the most important and fundamental challenges is isolating disease genes from the human genome. The identification of these factors has the potential to enhance both the process of diagnosing diseases and treating them. It is common knowledge that the progression of certain diseases, such as cancer, is accompanied by change in the expression values of particular genes. For example, mutations can turn normal cells into cancerous

Sudip Mandal (ed.), Reverse Engineering of Regulatory Networks, Methods in Molecular Biology, vol. 2719, https://doi.org/10.1007/978-1-0716-3461-5_2, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2024

13

14

Jyotsna Choubey et al.

ones. The level of gene expression is altered as a result of these changes. The pleiotropy of genes, the limited number of confirmed disease genes among the whole genome, and the genetic heterogeneity of diseases all contribute to make the discovery of new disease genes a difficult task, in spite of the many studies that have been done on the application of machine learning methods to find new disease genes [1]. Microarray allows researchers to investigate issues once thought untraceable by measuring the expression levels of thousands of genes simultaneously [2, 3]. Microarrays profiling is an efficient method that can be used to investigate the cellular or tissue-specific gene expression at the genome level. However, when conducting comparative analyses between control and mutant samples, microarrays frequently identify a large number of genes with differential expression. This, in turn, makes it difficult to screen out “high-priority candidate genes” that are most relevant to a phenotype observed in a mutant. A systems-biology approach to complex disease (such as cancer) now complements traditional experiencebased approaches, which are typically invasive and expensive. The rapid progress in biomedical research enables the targeting of disease with precise, proactive, preventive, and personalized therapies. Systems biology integrates and analyses high-throughput quantitative biological data using mathematical modeling and computational biology. Systems biology is an integrative method for understanding the complexity of biological systems by focusing on the interplay between organisms rather than on single molecules [4, 5]. Wet-lab experiments generate multi-omics data that can be comprehensively evaluated using systems biology methods for a more complete picture. Systems biology aims to discover how a disease’s underlying dysfunction can be explained by re-integrating critical elements from multi-modality datasets that have previously been isolated. Hence, systems biology focuses on the interactions among biological elements rather than the elements’ characteristics. The application of systems biology has been ubiquitous and fruitful in molecular oncology. Realistic predictions of biological conditions are the goal of systems biology approaches, but their success is highly dependent on the accuracy of the data and the sophistication of the computational tools used to analyze it [6]. Human, technical, and environmental factors may cause bias in human tissue samples and biological data [7]. In the last decade, numerous studies have shown that highthroughput data from in vitro or in vivo models and patient tumor samples can yield molecular signatures that accurately predict the malignancy of a tumor when analyzed using systems biology techniques. The techniques identified molecular signatures predicting the malignancy of a tumor [8], its resistance to conventional or advanced therapy [9, 10], as well as to suggest suitable mono and combined therapies, or to predict molecular therapy target [11]. Similarly, these techniques have been successfully employed

Finding Responsible Genes for Oral Squamous Cell Carcinoma

15

in various other diseases like glaucoma [12], metabolic syndrome [13], and silicosis [14]. This chapter mainly emphasizes for computational biologists to become familiar with the wide range of tools and programs available for microarray data processing and analysis using a system biology approach to prioritize candidate disease genes. To do this, we used R [15], a popular open-source programming language and environment for statistical computing and graphics, and its Bioconductor packages [16]. Here we used the case of oral squamous cell carcinoma (OSCC) to illustrate the procedure in great detail.

2 2.1

Materials Databases

1. Gene Expression Omnibus (GEO): GEO [17] is a database containing a vast amount of data on large-scale gene expression measurement experiments. 2. The STRING v11.0 database [18] was employed to construct the protein-protein interaction network of DEGs. 3. The Database for Annotation, Visualization and Integrated Discovery (DAVID, http://david.abcc.ncifc rf.gov/) (version 6.8) is a public online bioinformatics database [19] which helps to identify the most significant enriched functional genes and biological pathways. 4. Gene Expression Profiling Interactive Analysis (GEPAI; http://gepia.cance r-pku.cn) is an interactive web application [20] that can be used for survival, correlation analysis, and gene expression analysis in different types of cancer and normal tissues.

2.2

Software

1. R version 3.2.2 [15]: Program language. It is particularly strong to analyze statistical parameters but also has strong further calculation capabilities. 2. Bioconductor [16]: It is used for the compilation of tools for analyzing genomic data with R. 3. Cytoscape v3.9.1 [9]: This software was used for visualization and network parameter calculation. 4. iRegulon cytoscape plugin [21]: It facilitates motif and track identification in Pre-existing network or in a group of co-regulated genes. 5. miRWalk (http://mirwalk.umm.uni-heidelberg.de/) [22]: It is an open-source platform that predicts and validates miRNA-binding sites of human, mouse, rat, dog, and cow genes. miRWalk predicts miRNA target sites using the randomforest-based software TarPmiR to search the 5′-UTR, CDS, and 3′-UTR. It integrates other databases’ miRNA-target interactions.

16

Jyotsna Choubey et al.

6. NetworkAnalyst (http://www.networkanalyst.ca) [23]: It is a comprehensive web-based tool that allows bench researchers to perform various common and complex meta-analyses of gene expression data via an intuitive web interface.

3

Methods The approach used in this study can be summarized as follows: Using network analysis tools that are based on gene expression data, oral cancer was modeled, and potential compounds that can affect patients’ survival were found. In addition, interactions between noncoding RNAs and pathways were investigated in order to gain a deeper comprehension of the pathological mechanisms underlying the development of oral cancer. A summary of the method used is shown in Fig. 1.

3.1 Microarray Dataset

Fig. 1 Workflow of study

Microarray datasets GSE37991 have been downloaded from the National Centre for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) [17]. To seek GEO datasets for related gene expression profiles, we used the keywords “oral cancer” and “Microarray,” and “Homo sapiens” (see Note 1).

Finding Responsible Genes for Oral Squamous Cell Carcinoma

17

3.2 Data Processing and Identification of DEGs

The LIMMA package of R was used to identify DEGs [24]. For the purpose of preventing the appearance of false-positive results, adjusted P-values (adj P-value) were implemented. This study identified genes with a log2 fold change >2 and an adj P-value < 0.05 as differentially expressed genes.

3.3 Functional Annotation and KEGG Pathway for DEGs

Database for Annotation, Visualization, and Integrated Discovery (DAVID, https://david.ncifcrf.gov/) was utilized in order to perform gene functional and Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathway enrichment analyses [25]. Different types of gene ontology analyses, such as cellular component, molecular function, biological process, and KEGG pathway enrichment analyses, were performed on these DEGs (see Note 2).

3.4 Protein-Protein Interactions Analysis

The protein-protein interaction network of DEGs was constructed using the STRING v11.0 database, and Cytoscape v3.9.1 software was used to visualize and calculate network parameter values [9]. In the subsequent analysis, the Cytoscape plug-in Network Analyzer was utilized. Additionally, the topological properties of the PPI network, namely the node degree, and closeness of a gene’s position in the network were computed in order to search for hub genes within the PPI network [26]. Following that, a Molecular Complex Detection (MCODE) analysis was carried out in Cytoscape to screen the PPI network’s significant modules. The cut-off parameters that were used for this analysis were as follows: node score cut-off = 0.2, K-Core = 2, and degree cut-off = 2 (see Note 3).

3.5 Identification of Hub Genes

Genes with many direct connections among themselves are called hub genes [27]. In the PPI network, genes, edges, and connections all come together, with hub genes serving as the most intertwined nodes. These hub genes were identified using the maximal clique centrality (MCC) topological algorithm. CytoHubba, a Cytoscape plugin (available at http://apps.cytoscape.org/apps/cytohubba), is used to run the MCC algorithm on the PPIs network [28]. CytoHubba [https://github.com/Cytoscape/CytoHubba] is comprised of 11 topological algorithms for ranking nodes in a specific network (see Note 4).

3.6 Construction of the TF-miRNA-target Gene Regulatory Network

The miRNAs targeting the DEGs were predicted using the miRWalk2.0 (http://zmf.umm.uni-heidelberg.de/apps/zmf/ mirwalk/) [21], miTarbase (https://miRTarBase.cuhk.edu.cn/.) [29], miRDB (http://www.mirdb.org/) [30], and TargetScan (http://www.targetscan.org/vert_72/) databases [31]. To create the miRNA-target DEG regulatory network, we imported the predictions of the three databases and used the Cytoscape 3.9.1 software to visualize the miRNA-target DEG pairs.

18

Jyotsna Choubey et al.

The iRegulon plug-in (http://apps.cytoscape.org/apps/ iRegulon) of the Cytoscape software 3.9.1 was used to predict the TFs that regulate miRNAs. This plug-in incorporates the TF-target pairs from multiple human databases, including the Transfac (http://www.gene-regulation.com/pub/databases.html) and Encode (https://www.Encodeproject.org/) [32, 33]. In order to screen TF-target pairs, we used the following cut-offs: a normalized enrichment score (NES) > 3, an identity of 0.0 between orthologous genes, and a false discovery rate (FDR) of 0.001 on motif similarity. When everything was collected, the miRNA-DEG regulatory network was integrated with the obtained transcription regulation relationships, and the combined network was visualized in Cytoscape (see Note 5). 3.7 Identification of Protein-Drug Interactions

Network Analyst was utilized to examine protein-drug interactions within the DrugBank database [34] to identify possible drugs for treating oral cancer. Based on statistical significance thresholds for drug-protein interactions and the potential role of the targeted protein in oral cancer pathogenesis, a set of protein-drug interactions was selected, and simulations were run to analyze the binding affinities of identified drugs with their target protein. Network is analyzed using the Cytoscape app.

3.8 Hub Gene Survival and Expression Level Analysis

A comprehensive online platform called Gene Expression Profiling Interactive Analysis (GEPIA2) provides fast and customized delivery of functionalities based on TCGA (The Cancer Genome Atlas) and genotype-tissue expression (GTEx) data. The GEPIA database (http://gepia.cancer-pku.cn/) was employed for survival analysis (see Note 6).

4

Interpretation of Results The study used system biology to identify disease genes. The method prioritizes disease gene candidates by combining protein interaction networks and GO annotations.

4.1 Gene Expression Analysis

Based on the cut-off criteria (adjusted P < 0.05 and |log2 FC| > 2), a total of 437 DEGs were identified from GSE37991 datasets. There are 114 up-regulated genes and 323 down-regulated genes in oral tissues compared to adjacent normal tissues.

4.2 Pathway and Functional Association Analysis

In the GO analysis, the screened DEGs mainly participate in the biological process (BP) of collagen catabolic process, extracellular matrix disassembly, extracellular matrix organization, muscle contraction, and cardiac muscle contraction. As for the molecular function (MF), DEGs are mainly involved in the calcium ion binding, structural constituent of muscle, actin binding, serine-type

Finding Responsible Genes for Oral Squamous Cell Carcinoma

19

endopeptidase activity, metalloendopeptidase activity, and extracellular matrix structural constituent. DEGs’ cell composition (CC) includes extracellular region, extracellular space, and extracellular matrix. The KEGG pathway analysis found that DEGs significantly enriched in drug metabolism – cytochrome P450, Metabolism of xenobiotics by cytochrome P450 Metabolic pathways, Human papillomavirus infection, ECM-receptor interaction, IL-17 signaling pathway, and Pathways in cancer (Table 1). 4.3 Protein-Protein Interactions (PPIs) Analysis

The PPI network of DEGs comprises 452 nodes and 882 edges, with 323 up-regulated genes and 129 down-regulated genes included in its composition. Nodes represent the proteins, and the interactions between the proteins are represented by the edges. MCODE was used to analyze the entire PPI network, and the results yielded a significant module from the PPI network of DEGs, consisting of 15 nodes and 85 edges and 10 of these genes were the hub DEGs we identified (Fig. 2).

4.4 Hub Proteins Were Identified from Protein-Protein Interaction Analysis

We analyzed the STRING PPI network and mapped it out in Cytoscape to predict typical DEG interactions and adhesion pathways. A PPI network’s hub genes play a central role in the network’s structure. Cytoscape’s Cytohubba plugin were able to determine the most influential genes, ranking them among the top 10 DEGs. We identified the top 10 DEGs as the most influential genes based on the PPI network analysis in Cytoscape using the Cytohubba plugin. TNNC2, TNNI2, MYL1, MYLPF, ACTA1, MYH2, MYH6, MYH7, CKM, and ATP2A1 are the hub genes (Fig. 3).

4.5 TF-miRNA Target DEG Regulatory Network Analysis

To identify the potential miRNAs regulating the expression of target gene, we screened 395 miRNAs from the miRWalk2.0 database and 405 miRNA-DEG pairs were identified. Each miRNAtarget DEG pair was simultaneously predicted by the miTarbase, miRDB, and TargetScan databases. The miRNA-DEG regulatory network was visualized by Cytoscape software, consisting of 405 nodes and 708 edges (Fig. 4). Based on the iRegulon app of Cytoscape software, a total of 13 TFs were predicted as regulators of miRNAs in the miRNA-target gene regulatory network (Fig. 5). Gene expression can be regulated in complex feedback or feedforward loops by a variety of mechanisms. For instance, transcription factors (TFs) can initiate the transcription of microRNAs (miRNAs), which can then affect the translation and degradation of messenger RNA (mRNA) and transcripts (transcripts) [35– 37]. Since miRNAs and TFs can mutually regulate one another’s expression, it can be difficult to disentangle their relative contributions to target gene expression. The integrated network (Fig. 6) consisting of a total of 418 nodes and 548 edges is constructed using Cytoscape from 10 hub genes, 13 regulating TFs, and 395 miRNA.

Terms

6.27E-05 6.93E-08 2.27E-07 2.12E-05 5.56E-05 1.20E-04

GO:0006508 Proteolysis

GO:0006936 Muscle contraction

GO:0060048 Cardiac muscle contraction

GO:0045214 Sarcomere organization

GO:0006941 Striated muscle contraction

GO:0002026 Regulation of the force of heart contraction

GO:0005576 Extracellular region

2.07E-11

0.002419625 LHX1, HOXC4, HOXB7, HOXD10, HOXC8, HOXA5, HOXA11

1.13E-05

GO:0009952 Anterior/posterior pattern specification

Cellular function

7.75E-04

GO:0030198 Extracellular matrix organization 2.71E-06

CSRP3, MYL1, MYL2, TNNC1, TNNI2, TCAP, ATP1A2, MYH6, MYH7

CKMT2, MYOM1, ACTA1, MYH2, MYBPC1, TMOD4, MYL1, MYOT, TNNI2, SCN7A, MYH6, MYH7

3.93E-09

LY6K, ADAMDEC1, CXCL9, TNFRSF6B, CCL11, CSF2, LAMA1, NXPH4, SERPINE1, LAMC2, CXCL13, PTHLH, ESM1, ISLR, PLAU, SPP1, COL10A1, EPHB2, CTSC, PLA2G2F, MMP7, MUCL1, MMP1, TNFRSF18, IGFL2, WNT7A, MMP9, MMP10, MMP12, MMP11, CXCL11, MMP13, PROC, MFAP2, COL4A5, PAEP, AGRN, STX1A

0.042942435 CSRP3, MYL2, ATP1A2, MYH6, MYH7

0.024795469 SMPX, PGAM2, MYH6, KLHL41, MYH7

0.012574463 CSRP3, TCAP, CASQ1, MYH6, KLHL41, MYLK3, MYH7

2.02E-04

1.24E-04

0.010729228 MMP12, MMP11, ADAMDEC1, MMP13, PROC, MMP7, PLAU, MMP1, MMP9, CTSC, MMP10

MMP12, MMP11, MMP13, MMP7, MMP1, COL10A1, COL4A5, MMP9, MMP10

MMP12, MMP11, MMP13, MMP7, MMP1, MMP9, MMP10

6.43E-05

1.50E-07

MMP12, MMP11, MMP13, MMP7, MMP1, MMP9, MMP10

Genes

GO:0022617 Extracellular matrix disassembly

6.31E-05

FDR

7.37E-08

P value

GO:0030574 Collagen catabolic process

Biological function

Go ID

Table 1 Gene ID, functions, and respective p-values

20 Jyotsna Choubey et al.

0.005688796 0.216174263 PROC, EVA1A, SPP1, COL10A1, WNT7A, COL4A5, CTSC

GO:0005788 Endoplasmic reticulum lumen

1.24E-04 3.05E-04 0.001430067 0.05920478 2.48E-05

5.49E-05 1.05E-04

1.14E-04

GO:0005201 Extracellular matrix structural constituent

GO:0048248 CXCR3 chemokine receptor binding

GO:0004175 Endopeptidase activity

GO:0005509 Calcium ion binding

GO:0008307 Structural constituent of muscle

GO:0003779 Actin binding

GO:0008201 Heparin binding

MMP12, MMP11, ADAMDEC1, MMP13, MMP7, MMP1, MMP9, MMP10

MMP12, MMP11, MMP13, PROC, MMP7, PLAU, MMP1, MMP9, CTSC, MMP10

MMP12, MMP13, MMP7, MMP1, MMP9

CXCL11, CXCL9, CXCL13

(continued)

0.013925651 COMP, GREM2, FGF7, FGF14, PCOLCE2, MSTN, LRTM1, NDNF, PRELP, ELANE, SERPINA5, FGF10

0.013925651 CAMK2B, PACRG, MYBPC1, TNNC2, FAM107A, MYRIP, MYH2, CSRP3, SCIN, MYOT, PIP, NRAP, TNNI2, XIRP2, MYOZ3, MAPT, MYH7

0.013375181 MYOM1, CSRP3, MYBPC1, MYL1, MYOT, MYL2, TCAP

0.012075076 WDR49, DGKB, CALML6, STAB2, ATP2A1, CLGN, COMP, CAPN9, SCIN, CRTAC1, MB, F10, S100A1, TNNC1, PCDH20, PLA2G2A, TNNC2, DLK1, LRP1B, CRNN, MYLPF, PCP4, MYL1, MYL2, CDH10, CDH12, CASQ1

0.0157968

0.008558963 LAMA1, MFAP2, COL10A1, COL4A5, ZPLD1, LAMC2, AGRN

5.07E-04

4.89E-06

GO:0004222 Metalloendopeptidase activity

2.91E-04

1.40E-06

GO:0004252 Serine-type endopeptidase activity

Molecular function

0.002408505 0.114403981 MMP11, PROC, MUCL1, WNT7A, AGRN

MMP12, MMP11, MMP13, MMP7, MMP1, LAMA1, COL10A1, WNT7A, COL4A5, MMP10

GO:0005796 Golgi lumen

6.02E-04

9.50E-06

CXCL9, TNFRSF6B, CCL11, CSF2, LAMA1, SERPINE1, LAMC2, SERPINA9, CXCL13, PTHLH, S100A7A, CST1, PLAU, CREG2, SPP1, COL10A1, ZPLD1, CTSC, MMP7, IGFL2, WNT7A, MMP9, MMP10, MMP12, CXCL11, MMP13, PROC, COL4A5, PAEP

GO:0031012 Extracellular matrix

5.45E-05

5.74E-07

GO:0005615 Extracellular space

Finding Responsible Genes for Oral Squamous Cell Carcinoma 21

Terms

ECM-receptor interaction

IL-17 signaling pathway

Pathways in cancer

Cytokine-cytokine receptor interaction

Drug metabolism – Cytochrome 2.78E-05 P450

Adrenergic signaling in cardiomyocytes

Metabolism of xenobiotics by cytochrome P450

Metabolic pathways

Tyrosine metabolism

hsa04512

hsa04657

hsa05200

hsa04060

hsa00982

hsa04261

hsa00980

hsa01100

hsa00350

3.30E-04

8.89E-05

4.99E-05

4.68E-05

5.72E-04

4.99E-04

3.39E-04

2.49E-04

Human papillomavirus infection

2.45E-04

2.51E-04

P value

hsa05165

KEEG pathway

GO:0051015 Actin filament binding

Go ID

Table 1 (continued) Genes

0.013583174 DCT, ADH1C, ADH1B, ADH1A, TYRP1, TYR

0.004580537 ACSM3, GALNT16, ADH1C, DGKB, ADH1B, IDI2, ADH1A, TYRP1, MGST1, GADL1, PYGM, GPT, COX6A2, ABO, CKMT2, FUT6, ACADL, CYP11A1, CA4, SMYD1, ATP6V0A4, HMGCS2, CA8, ENPP6, BCO1, HMGCLL1, UGT1A7, UGT1A6, CHST9, PTGIS, CKM, PLA2G2A, PGAM2, AMPD1, FMO2, B3GALT5, TYR, GATM, DCT, ALDH1A2, GSTA1, CDO1, ASPA, FBP2, GSTM5, OTC

0.003429173 ADH1C, ADH1B, ADH1A, GSTA1, MGST1, CYP2F1, UGT1A7, UGT1A6, GSTM5

0.003429173 CAMK2B, CACNB4, PPP1R1A, CALML6, MYL2, TNNC1, AGTR1, SCN7A, ATP2A1, ATP1A2, MYH6, MYH7

0.003429173 ADH1C, ADH1B, ADH1A, GSTA1, MGST1, FMO2, UGT1A7, UGT1A6, GSTM5

0.015550033 CXCL11, CXCL9, TNFRSF6B, CCL11, CSF2, TNFRSF12A, TNFRSF18, CXCL13, IL12RB2

0.015550033 GNGT1, CCNA1, MMP1, ITGA3, LAMA1, GNA12, WNT7A, COL4A5, LAMC2, MMP9, IL12RB2, DLL3

0.015356154 S100A7A, MMP13, CCL11, CSF2, MMP1, MMP9

0.015356154 ITGA3, LAMA1, SPP1, COL4A5, LAMC2, AGRN

0.015356154 CCNA1, ITGA3, LAMA1, SPP1, WNT7A, COL4A5, LAMC2, TCIRG1, ATP6V0D2, OASL

0.024495741 MYOM1, TMOD4, TNNC1, TNNC2, PPP1R9A, SLC6A4, MYH2, SCIN, NRAP, GAS2, XIRP2, MYH6, MYH7

FDR

22 Jyotsna Choubey et al.

Finding Responsible Genes for Oral Squamous Cell Carcinoma

23

Fig. 2 Protein-protein interaction network and a significant module. (a) Protein-protein interaction network of DEGs. The network consists of 882 edges (interactions) between 436 nodes based on a confidence score of 0.4 and the maximum additional interactors default parameter. Nodes represent proteins, edges represent the interaction between two nodes (proteins). (b) A significant module selected from protein-protein interaction network. DEGs, differentially expressed genes

CP PYGM XIRP2 CKMT2

KLHL40

COXBA2 ATP2A1 SLN

NRAP

CKM

MYOT SMPX

LAMC2

PPP1R3A

MYL1

TNNC2

MB

TNN12

AGRN LAMA1

MYH7

TCAP MYH6

MYLPF

MYBPC1

SMTN

TMOD4

MYOM1 SMYD1

KLHL41

AMPD1

CASQ1

PPARGC1A

KRT4

PGAM2

MYH2

MYL2 ACTA1

MAL

MYLK3

MSTN

CSRP3

TNNC1

SCIN

OGN

Fig. 3 The Cytohubba plugin in Cytoscape was used to determine the hub genes in the PPI network by analyzing the PPI network. To obtain the hub genes, the Cytohubba plugin was used in conjunction with the most recent MCC method. Here, the orange nodes indicate the highlighted top 10 hub genes and their interactions with other molecules

24

Jyotsna Choubey et al.

Fig. 4 Interaction network between hub genes and targeted miRNAs. Hub genes are presented in green circles, whereas targeted miRNAs are shown in purple square. The network consists of 405 nodes and 708 edges

ACTA1 TBP

MYOG MYL1

TNNI2 MEIS3 ZBTB7A MYH2 MYLPF LEF1 PPARG MYF6 SOX7 ATP2A1 ANXA1 TRIM28 MYH7 JDP2 TNNC2 MYH6 MEF2A CKM

SRF

Fig. 5 Network of transcription factors and the genes they regulate. A DEG is represented by a green circle, and a TF by a pink triangle

Finding Responsible Genes for Oral Squamous Cell Carcinoma

25

Fig. 6 TF-miRNA-target network of DEGs. Green circles represent DEGs, purple circles represent miRNAs, and pink triangles represent TFs. The network consists of 418 nodes and 548 edges. Network is constructed using Cytoscape from 10 hub genes, 13 regulating TFs, and 395 miRNA.TF, transcription factor; miRNA, microRNA 4.6 Identification of Protein-Drug Interactions

Protein-drug interactions are essential for understanding the features of sensitive receptors [38]. The drug’s interaction with a hub protein was discovered as a result of studies on protein-drug interactions. Figure 7 demonstrates the association of ACTA1 hub proteins with 13 therapeutic compounds.

4.7 The Expression Level and Hub Genes Kaplan-Meier Plotter

GEPIA database was used to validate the expression levels of key hub genes in OSCC tissues vs. normal oral mucosa in the TCGA dataset. Figure 8 reveals that the entire hub genes were downregulated in 519 OSCC samples as compared to high expressing in 44 normal samples which was consistent with GEO analysis. The results showed that all hub genes significantly differed in survival outcome between the high- and low-expression groups. Moreover, the high-expression group had shorter overall survival and worse prognosis compared to the low-expression group (Fig. 9).

26

Jyotsna Choubey et al.

Fig. 7 Protein-drug interaction for drug analysis. It is possible to see the interaction between a hub protein and its drugs. The red circle indicates the hub protein and the green triangle indicates possible drugs

Fig. 8 The expression level of hub genes in OSCC tissues and normal tissues from patients. To further verify the expression level of the hub genes between OSCC tissues and normal tissues, the hub genes were analyzed by the GEPIA2 online database. P < 0.01 was considered statistically significant

Finding Responsible Genes for Oral Squamous Cell Carcinoma

27

Fig. 9 Kaplan-Meier overall survival analysis of the hub genes expressed in OSCC. Curves were generated using Gene Expression Profiling Interactive Analysis based on The Cancer Genome Atlas database ( p _ 0.01)

5 Notes 1. The GEO database from NCBI1 was used to access the GSE37991 dataset that contains expression profiles by array. The datasets from various experiments are deposited in this database and enable users to download the gene expression profiles stored in GEO. The dataset GSE37991 was based on GPL6883 platform (Illumina HumanRef-8 v3.0 expression bead chip) and included 40 OSCC samples and 40 normal oral mucosa samples (submission year, 2012; year of last update, 2017) [39]. The gene expression data were downloaded from the public database, and in this study, there were no animal or human experiments assisted by any of the authors. 2. In order to screen the biological processes involved in the pathogenesis of oral cancer, the online software Database for Annotation, Visualization and Integrated Discovery was used to perform Gene Ontology (GO) annotation and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis for the DEGs. These analyses used Hypergeometric Distribution test, and the P value 70%

Read Alignment

Sequence coverage

Picard

>70%

Million reads aligned reference

>14milllion reads

Percent aligned to rRNA

50%

4. One can also download the paired-end data of a single FASTQ sample using wget in the Ubuntu terminal as (see Notes 4 and 5): $ wgetftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR102/ 003/SRR1027983/SRR1027983_1.fastq.gz $ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR102/ 003/SRR1027983/SRR1027983_2.fastq.gz

3.2 Quality Control of Raw Reads

Ideally, quality assessment should be done at every step of data analysis, so that any bias in the data can be corrected for a reliable downstream analysis (Table 2). Among others, FastQC is a tool used for quality control of raw reads which accepts FASTQ files as input and generates quality assessment results in HTML format [25]. FastQC can be executed on a single sample as: $ fastqc SRR1027983_1.fastq.gz

FastQC can be executed on each sample in the working directory in a batch as: $ fastqc *.fastq.gz

Inference of Dynamic Growth Regulatory Network

59

The results generated by FastQC are based on the quality of the base sequence, the quality of each tile sequence, the quality score of each sequence, the content of each base sequence, the GC content of each sequence, the N content of each base, the sequence length distribution, overexpressed sequences, adapter content, and k-mer content and these parameters are classified as fail, warning, or pass. Although raw reads need not qualify all features as pass, warnings in few features like per tile sequence quality or per base sequence content do not mean one needs to repeat the sample sequencing or discard the sample altogether. Alternatively, per base sequence quality, per sequence GC content, adapter content are few features that raw reads must qualify as pass for reliable transcriptomic data analysis. It is advised to discard the reads that fall below the average quality of 30%. Additional details on data quality controls and examples of good and bad data can be found at the website from the Babraham Bioinformatics Group (https://www.bioinformatics. babraham.ac.uk/projects/fastqc/). All the FastQC generated reports can be summarized into single file using MultiQC tool, this step is suggested where sample size is very large and to obtain a comparative analysis of quality assessment among each sample of the dataset to quickly identify the outlier sample. Along with HTML file generated as FastQC result, a ZIP file is also generated which contains graphs and data files which are designed to be easily parsed to allow for a more detailed and automated evaluation of the raw data. MultiQC can be executed for each sample in working directory as: $ multiqc*.zip

3.3 Reads Preprocessing

In this step, sequences with low-quality reads or poor-quality bases and adapter sequence should bed is carded for further analysis. Trimmomatic is a tool that can detect adapter sequences required to be trimmed, information of adapter sequences is present in “Trimmomatic-0.39/adapter” directory of the tool [26]. Other parameters such as k-mer or GC content are organism- or condition-specific and should be homogenous across the samples within a dataset, which was checked using MultiQC tool in the above step. In the example, PE reads can be trimmed using the following commands: $

ja va -j a r

-threads fastq.gz

/ pa th / to /t ri mm om at i c- 0. 39 .j ar

6SRR1027983_1.fastq.gz

PE

SRR1027983_2.

SRR1027983_1Paired_clean.fq.gz

S R R 1 0 2 7 9 8 3 _ 1 U n p a i r e d _ c l e a n . f q . g z SRR1027983_2Paired_clean.fq.gz

SRR1027983_2Un-

paired_clean.fq.gzILLUMINACLIP:/path/to/Trimmomatic-0.39/adapters/TruSeq3-PE-2.fa:2:30:10SLIDINGWINDOW:4:30 MINLEN:60

60

Aparna Chaturvedi and Anup Som

In the command above, the -threads option defines the number of threads required to speed-up execution using multiple cores of CPU. For PE reads, two files are required as input for forward- and reverse-end reads respectively. Eventually, clean reads are generated as outputs (4 files in this case). Two output files correspond to paired and unpaired reads for forward-end reads respectively, similarly another two files correspond to reverse-end reads. Next adapter sequence identifier (TruSeq3-PE-2.fa) is given in command followed by other trimming parameters. SLIDINGWINDOW starts scanning sequences at 5′ end and trims the reads whose quality falls below threshold, 30 in this case. As in case of Illumina sequenced reads, base quality starts to decrease with read length, so MINLEN corresponds to the minimum length that reads are required to qualify. Usually, in the case of RNA-Seq data, the read length should be at least 50 base pairs (bp) so that any important information is not lost from the data. Other options that can be considered to trim the reads that fall below the quality threshold are LEADING (trims bases below threshold quality at the start of the reads), TRAILING (trims bases below threshold quality at the end of the reads), CROP (trims reads from the end of the reads for specified length), and HEADCROP (trims reads from the start of the reads for specified length) (see Note 6). We again perform the quality check on trimmed reads to assure if the data is clean and adapter has been removed for further downstream analysis. 3.4 Read Alignment Against the Reference Genome

Once we have obtained the good-quality reads, the next step is mapping or alignment of reads against the reference genome to locate the origin of transcripts within the genome. Alternatively, mapping can also be done against a reference transcriptome but that rules out the identification of any novel transcript. In case of RNA-Seq, irrespective of whether a transcriptome or genome is used as reference, reads map either uniquely or assigned to multiple positions known as multi-mapped reads and should not be discarded in the mapping output [34]. However, based on how different alignment tools handle the gaps within the genome becomes a challenging task in case of RNA-Seq data analysis. Hence, one important factor to be considered before choosing certain alignment tool is whether the tool performs splice-aware alignment or un-spliced alignment [35]. That is why we need a well-annotated genome as annotations have information about where to expect large intronic gaps. Hisat2 is a tool that can identify spliced junctions or indels and for fast alignment it uses index of the reference genome [27]. In the example, we have considered Human Reference Genome GRCh38 Ensembl release 105. So first we need to download the reference genome and its annotation (GTF file) using the following commands:

Inference of Dynamic Growth Regulatory Network

61

$http://ftp.ensembl.org/pub/release-105/fasta/ homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz $

http://ftp.ensembl.org/pub/release-105/gtf/

homo_sapiens/Homo_sapiens.GRCh38.105.gtf.gz

The downloaded reference genome is then used to build the index which is used by Hisat2 for faster and memory-efficient alignment using the following commands: $ hisat2-build -p 6 Homo_sapiens.GRCh38.dna.primary_assembly.fa-exon genome hisat2-build builds indices from DNA sequences and appends “.1.ht2, .2.ht2. . . .” suffixes to the base name (genome) of the index files generated. Additional arguments can be given in the command, -ss and -exon corresponding to splicesites and exons within the genome. Since index creation for larger genomes like Human genome requires huge memory, so index files for well-annotated genomes can be downloaded from the Hisat2 website itself (https://genome-idx.s3.amazonaws.com/hisat/ grch38_genome.tar.gz). Once the index files and annotation files are ready, we can proceed for alignment of reads using the following command: $ hisat2 -p 12 -dta-rna-strandness RF -x /path/to/ reference/genome/index/genome SRR1027983_1P_clean.fq.gz

-1 -2

/path/to/ /path/to/

SRR1027983_2P_clean.fq.gz -S SRR1027983.sam -un-gz SRR1027983_unaln.fq.gz

-summary-file

SRR1027983_sum.txt -met-file SRR1027983_met.tsv

reports alignments tailored for downstream transcript assemblers, -rna-strandness is used to detect the correct strand against which reads should be aligned, -x is used for providing base name (genome) of index file, -1 and -2 are used for paired-end sequencing. The other arguments -S, -un-gz, -summary-file, -met-file are used to print SAM file, export the reads that failed to align, alignment summary, and the metrics of alignment respectively. The primary output of alignment, SAM (Sequence Alignment/Map) format is a generic nucleotide alignment format and are human readable and can be compressed to Binary alignment/ Map (BAM) format using SAM tools. BAM format contains the positions of the genes and are optimized for fast access which can also be sorted and indexed so that all reads aligning to a locus can be efficiently retrieved without loading the entire file into memory. This can be achieved using SAM tools as: -dta

62

Aparna Chaturvedi and Anup Som $ samtools sort -o SRR1027983.bam SRR1027983.sam $ samtools index SRR1027983.bam

The indexed file can then be used to visualize the mapping of reads onto the genome using the IGV visualization tool (see Note 7). For this, open the IGV tool and load the indexed bam file (.bai file format) and the genome annotation file (.GFF file format). The genome annotation for hg38 Ensembl release 105 using the following command: $

wget

http://ftp.ensembl.org/pub/release-105/

gff3/homo_sapiens/Homo_sapiens.GRCh38.105.gff3.gz

Before loading the indexed bam file, first correct genome must be selected in the top left menu as “Human hg38”. Then select File> Load from file>Homo_sapiens.GRCh38.105.gff3. gz. Similarly, load the indexed bam file within the IGV environment. To visualize a particular gene (e.g., “EGR1”) type the name in search box and navigate to the locus. Further, it is rational to perform a quality check of alignment of reads as well such as the percentage of mapped reads. A consensus is 70-90% of RNA-Seq reads mapping in case of humans is an indicator of sequencing accuracy, this expected percentage is lesser for alignment against reference transcriptome as reads corresponding to novel or unannotated transcripts might be lost. Other important parameters include read distribution as the accumulation of reads toward the 3’ end might suggest low-quality data. RSeQC [36] or Picard tool (https://broadinstitute.github.io/picard/) can be used for quality check of read alignment.

3.5

Read Assignment

Once we have identified the transcript location within the genome, we have the coordinates for the transcript but final goal is to identify the genes along which the transcripts aligned. This process is known as read assignment to their respective genes and we need an annotation of the reference genome (GTF or GFF file) containing all information of the genes like, starting and ending coordinates of the genes, transcripts, exons, introns, etc., as a guide to all the relevant information of a gene. We can extract the genes based on different biotypes as protein-coding or non-coding genes (see Note 8) using “grep” command within Linux console as: $grep-E #|gene_biotype “protein_coding” Homo_sapiens.GRCh38.105.gtf > Homo_sapiens.GRCh38.105. protein_coding.gtf

Further for assignment of reads, feature Counts function of R Subread package is employed, which requires genome annotation file and SAM/BAM files as input [29]. The transcript values are

Inference of Dynamic Growth Regulatory Network

63

obtained as result in the form of a counts table that can be used for co-expression network reconstruction or differential analysis among other downstream analyses. In order to work within the R environment using Rstudio IDE, we first need to set up the working directory. Note that all the tables or files written through R script will be saved in the current working directory therefore one must select the working directory first. To setup the directory, in menu panel of RStudio IDE, select Session> Set Working Directory> Choose Directory. Now, from the window pop-up menu, select the directory where the user wants to save read assignment outputs and then load the necessary packages/ libraries for read assignment within the R environment as: >BiocManager::install("Rsubread") >library(Rsubread)

After loading the package into the R environment, next step is to provide all the BAM files and metadata of samples as input to feature Counts. >all.bamcountswrite.table(fc$counts, "counts.tsv", sep="\t", col.names=NA, quote=F) >write.table(fc$annotation,

"annotation.tsv",

sep="\t", row.names=F, quote=F) >write.table(fc$stat, "stats.tsv", sep="\t", col. names=NA, quote=F)

64

Aparna Chaturvedi and Anup Som

Additionally, one might be interested in isoform quantification, which is often done hand-in-hand with the alignment, however this step is beyond the scope of this article. Several tools are available for this purpose such as Cufflinks [37], RSEM [38], and eXpress [39]. 3.6 Differential Gene Expression

In this step, gene expression values are compared across samples at different time points or stages. Although gene network analysis is prone to spurious correlations due to the effects of sequencing depth and library size, they can be normalized either directly or by considering the number of transcripts, which can vary significantly between samples. As to which algorithm shall be used for normalization highly depends on the data in consideration. For instance, RPKM (reads per kilobase of exon model per millionreads), a within-sample normalization, FPKM (fragments per kilobase of exon model per millionmapped reads), or its derivative TPM (transcripts per million) are frequently used for RNA-seqdatasets [34]. Several literatures have outlined a detailed comparison of various tools available for identification of DEGs using transcriptomic data [40, 41]. DESeq2 provides its own normalization approach using negative binomial as the reference distribution and is also used for identification of differentially expressed genes [30]. >BiocManager::install("DESeq2") >library(DESeq2)

Since our approach is to capture the dynamic expression of genes across different time points, that is in this case, progressive stages of cancer: ‘nor, ‘ais’, and ‘inv’. So, users need to perform the DEG analysis in two steps: (1) DEG1- analysis across normal and adenocarcinoma in situ and (2) DEG2- analysis across adenocarcinoma in situ vs invasive adenocarcinoma. #Here, for analysis of DEG1, counts and metadata of only nor and ais samples need to be loaded into counts and factor1 object respectively. >counts c o u n t s

< -

c o u n t s [ c

(1,3,4,6,7,9,10,12,13,15,16,18)] >factor1keep 0)) >= 12 Factor1 contains sample accession numbers in one column and their corresponding conditions in the second column. rowSums in this case removes low count genes across which no reads

Inference of Dynamic Growth Regulatory Network

65

were assigned in the count matrix and retain only the genes across which at least one read was assigned in each sample (in this case 12 number of samples). Further, normalization can be done using DESeqDataSet object provided by DESeq2 as: >ddsddsnorm_countsddsdds

class: DESeqDataSet dim: 18777 17 metadata(1): version assays(6): counts mu ... replaceCountsreplaceCooks rownames(18777):

ENSG00000000003

ENSG00000000419

...

ENSG00000289629

ENSG00000289642 rowData

names(23):

baseMeanbaseVar

...

max-

Cooks replace colnames(17):

SRR1027983

SRR1027984

...

SRR1027999 SRR1028000 colData

names(14):

SampleTypeAssay.Type

sizeFactor replaceable

...

66

Aparna Chaturvedi and Anup Som

The results function returns DGE analysis results in the form of a table containing columns for baseMean, log2FoldChange, lfcSE, stat, pvalue, padj and genes with EnsemblGene IDs in rows. >ressummary(res) >write.table(res,

file=DEG_nor_ais_.tsv",

sep="\t", row.names=FALSE, quote=FALSE) >res_clean = na.exclude(as.data.frame(res)) >sprintf ("Number of genes after NA exclude (DESeq2)") >dim(res_clean)

MA plot is an application of a Bland–Altman plot for visual representation of genomic data. The plot visualizes the differences between measurements taken across two conditions, normal and diseased, by transforming the data onto M (log fold change) and A axes (mean of normalized counts). # plotMA function of limma package is used to visualise a 2-D scatter plot to analyse gene express patterns across the normal and tumor conditions. >plotMA(res)

Further, on filtering the DGE analysis results using a certain threshold for absolute log2-FoldChange >1 and p-adjusted value upreg_deg1 &res_clean$padjwrite.table(upreg_deg, file="DEG_nor_ais_upreg. tsv",sep="\t",row.names=FALSE,quote=FALSE) >downreg_degfactor2keep 0)) >= 12 # DESeq2 differential expression analysis and normalization >ddsddsnorm_countsressummary(res) >write.table(res,

file=

"DEG_ais_inv.tsv",

sep="\t", row.names=FALSE, quote=FALSE) >res_clean = na.exclude(as.data.frame(res)) >sprintf ("Number of genes after NA exclude (DESeq2)") >dim(res_clean) >plotMA(res) >upreg_deg1 &res_clean$padj1

and

p-adjusted value write.table(upreg_deg, file="DEG_ais_inv_upreg. tsv",sep="\t", row.names=FALSE, quote=FALSE) >downreg_degdeg1 &res_clean$padjdeg.exprdeg.cordeg.cor$p[lower.tri(deg.cor$p, diag=TRUE)]=NA >Pval.adjdeg.cor$r[lower.tri(deg.cor$r, diag=TRUE)]=NA

The adjacency matrix has now been reorganized in the form of a table containing columns gene1, gene2, correlation value, and adjusted p-value that can be exported in a tab-delimited file to be visualized as a network. >correlationdeg.cor.tablecolnames(deg.cor.table) 0.80 and adjusted p-value deg.cor.table.filt0.80&cor.DEG.table[,4]write.table(deg.cor.table.filt,"CoexpressionNetwork.txt",sep="\t", row.names=F, quote=F)

The list of qualifying gene pairs has been written into a Coexfile which can be used later for network visualization and inference. The correlation table contains columns as gene1, gene2, cor, p.adj (Fig. 3a). All the above-mentioned steps for GRN reconstruction are repeated for AIS vs INV samples. In this step, normalized counts and list of DEGs for AIS vs INV are imported to deg.expr and deg data frames respectively, followed by all other steps as mentioned for NOR vs AIS. At last, we get two GRNs, one for normal vs. AIS samples (Fig. 4a) and another for AIS vs INV samples (Fig. 4b). pressionNetwork.txt

3.8 Network Visualization and Inference

There are various computational tools available to visually explore a biological network, Suderman et al have discussed many tools such as Cytoscape [31], VisANT [47], Pathway Studio [48], and so on [49].

Inference of Dynamic Growth Regulatory Network

71

Fig. 4 Growth regulatory network (a) constructed for differentially expressed genes for normal(NOR) vs adenocarcinoma in-situ samples (AIS), (b) constructed for differentially expressed genes for adenocarcinoma in-situ (AIS) vs invasive adenocarcinoma (INV) samples 3.8.1 Integration of Publicly Available Interaction Data

The correlation network contains genes that are related at functional levels and evidence for the regulation between these genes to construct a reliable GRN can be added either by manual literature curation or using publicly available interaction databases (Table 5). In the example, we imported the list of differential gene interactions, DESeq2_result.tsv to the STRING database and exported the resulting interactions in a tab-delimited format using the export option mentioned as “TSV: tab-separated valuescan be opened in Excel and Cytoscape (lists only one-way edges: A-B)” (Fig. 5). This interaction information has gene identifier as official gene symbol and can now be integrated to the constructed co-expression network using union mode in Cytoscape. In the Cytoscape menu select Tools> Merge> Networks, select both the STRING network and co-expression network and click on Union option. Notably for all the genes in interaction data identifiers should be in the same format (Ensembl ID, Official gene symbol, etc.). The genes in co-expression network consist of gene identifiers as Ensembl gene IDs which can be converted to official gene symbol using custom-prepared gene annotation file using reference genome GTF file format.

3.8.2 Import Network to Cytoscape

The interaction table obtained from the correlation matrix can be exported to Cytoscape for network visualization (Fig. 3). In the menu select File> Import > Network from file and then select the tab-delimited file from the windows pop-up screen. For the given example, select gene1 as the source node, gene2 as the target node and the correlation (cor) adjusted p-value (p.adj) column as the edge attribute. For better visualization, node size/color/ label or edge thickness/color can be changed from the “Style”

72

Aparna Chaturvedi and Anup Som

Table 5 List of publicly available interaction databases that can be used for incorporating interaction information into Cytoscape Databases Description

Link

BioGRID

BioGRID is a biomedical interaction repository for 2,579,588 https://thebiogrid.org/ protein and genetic interactions, 30,725 chemical interactions, and 1,128,339 post-translational modifications from major model organism species

DIP

It currently catalogs experimentally determined 81923 interactions between proteins

https://dip.doe-mbi.ucla. edu/dip/Main.cgi

HPID

Provides human protein interaction information pre-computed from existing structural and experimental data

http://dept.inha.ac.kr

HPRD

A literature-curated platform to visually depict and integrate http://hprd.org information pertaining to domain architecture, posttranslational modifications, interaction networks, and disease association for each protein in the human proteome

IntAct

All interactions are derived from literature curation or direct user submissions

https://www.ebi.ac.uk/ intact

MINT

Stores data on functional interactions between proteins curated from the scientific literature

https://mint.bio. uniroma2.it

PIPs

The full database of predicted by a naı¨ve Bayesian model includes interactions about 69,965 human proteins

http://www.compbio. dundee.ac.uk/www-pips

STRING

It has currently stored >20 billion protein-protein interactions https://string-db.org for 14,094 organisms

Fig. 5 An example demonstrating (a) criteria to be selected in STRING database for inclusion of publicly available interaction information into the GRN and (b) the way to export the interaction information.

Inference of Dynamic Growth Regulatory Network

73

Fig. 6 Cytoscape interface showing (a) Network panel listing all the networks loaded into the Cytoscape environment. (b) The main network view window. (c) The network analyzer results panel. (d) Table panel listing all nodes and edges in the displayed network along with its associated attributes

section in the control panel of Cytoscape. Layout of the network can be changed from “Layout” section in the menu to visualize the network based on different parameters, it is advised to use “orthogonal layout” for better visualization of directed networks. Directed networks are those where the direction of interaction is known (inhibition or activation) such as in the case of GRNs. 3.8.3 Topological Inference

GRN can be visualized for relevant topological analysis within Cytoscape for which various plug-ins are available. Network Analyzer is used to identify the overall statistics of the network such as node degree exponent, clustering coefficient, network density, etc. [50] (Fig. 6). Degree distribution can be plotted to identify whether the network is a scale-free or small-world network. Often biological networks or GRNs are expected to have a scale-free nature that follows a power-law distribution for degree exponent, 2 < γ < 3 [43]. Additionally, the hub gene is an important aspect of a network that can be identified based on node degree, betweenness centrality or closeness centrality. cytoHubba plug-in is used for hub-gene identification [51]. Clustering of genes or module detection to understand if genes are functionally related or have similar expression pattern is another very important topological inference of a GRN. Gene clusters are sub-networks of genes that are highly correlated and offer crucial biological insights. Apart from other plug-ins available in Cytoscape, cluster Maker provides with various options to perform clustering of genes. We use “GLay” clustering algorithm which is based on the Newman-Girvan’s method of community [52] to perform module detection.

74

Aparna Chaturvedi and Anup Som

3.9 Functional Annotation of Deregulated Genes

Once we have the list of genes regulating each other within the co-expression network, the next step is to find the Gene Ontology (GO) terms and the pathways regulated by the genes. This analysis can be performed either on the whole network or for the major modules in the co-expression network. By virtue that genes in a certain module are correlated, functional profiling can help identify the function of unknown genes based on their similar expression pattern [22]. This requires adequate information on the functional annotation for the disease or organism in concern. Database for Annotation, Visualization, and Integrated Discovery (DAVID) [32] tool has adequate information for the functional annotation of cancer transcriptomics data. The tool can be accessed at https:// david.ncifcrf.gov and navigate to the functional annotation section. In the first “Upload” tab, a gene list is required as an input for functional annotation and select identifier for the genes as “Official_Gene_Symbol” in this case. In the next “List” tab select the species in concern, and then click the “Use” icon. This will generate annotation summary results where on expanding the gene ontology tab, we can find the relevant terms as “GOTERM_BP_DIRECT”, “GOTERM_CC_DIRECT”, and “GOTERM_CC_DIRECT”. Similarly, select “KEGG_PATHWAY” in Pathways and click on the “Functional Annotation Chart” button.

3.10

In this work, simple steps have been illustrated to construct a stagespecific dGRN using RNA-Seqdata beginning from data acquisition to network inference such as module, hub gene identification, or betweenness centrality. This work demonstrates steps to analyze a time-series data to construct a dGRN involved in tissue growth and tumor progression. For a more robust inference of GRN, inclusive protocols for protein-protein or gene-gene interactions from several public databases have also been mentioned. This protocol serves as a step-by-step guide for the inference of a dGRN using RNA-Seq data and by no means should it be considered as the only way to infer a dGRN. As already discussed, other statistical inferences and software are also used for the construction and inference of dynamic GRNs based on the inclusion of more timepoints and multiple conditions in several biological processes and diseases. Hence, the tools and criteria to be used at each step of the data analysis vary for different case studies. Eventually, this protocol has been prepared with the aim to guide biologists at the beginning of their research to analyze high-throughput data for the inference of a dGRN.

Final Remarks

Inference of Dynamic Growth Regulatory Network

4

75

Notes 1. Computational specifications mentioned are for example dataset used in this study. However, such kinds of analysis are routinely performed using GPU clusters or servers of higher specifications. 2. The versions of the software mentioned in the work were used when this protocol was prepared. Users are advised to check the latest available version and the relevant changes it contains. 3. Raw RNA-Seq reads have been considered as examples to explain the detailed handling of RNA-Seq data. However, one might want to consider the normalized counts for highthroughput data available in the above-mentioned databases for data acquisition. The normalized count data can be downloaded in the matrix format, in which case, Sect. 3.6 onwards data analysis protocol can be followed for inference of GRN. 4. “$” symbol has been used for the commands to be used in the Linux terminal while the “>” symbol has been used for the commands to be used in the R environment. 5. The commands shown here have been used for single pairedend data. For batch-download and processing of files a script can be prepared and accordingly given set of commands can be used. 6. The data used in the example is paired-end data and has been used only for demonstration purpose, single-end data has other commands for pre-processing. Please refer to documentation of Trimmomatic [26] for more clarity of commands used. 7. However, this step is an optional step and one-might skip it. 8. In this work, we have covered analysis dealing with proteincoding genes only. To identify the non-coding genes within the genome, along with the grep command as given in Sect. 3.5, “protein coding” can be replaced for gene biotype by non-coding as, “lncRNA” or “miRNA”. 9. One may need to utilise the round function as round (norm_counts) as input to the DESeqDataSet object if they want to compare the findings of the DGE analysis between normalised and un-normalized counts.

Acknowledgments The authors thank Arindam Ghosh for reviewing the earlier version of the manuscript and for his valuable comments. AC is grateful to the University Grants Commission (India) for providing financial assistance to carry out the research work.

76

Aparna Chaturvedi and Anup Som

References 1. Baute J, Herman D, Coppens F, De Block J, Slabbinck B, Dell’Acqua M, Pe` ME, Maere S, Nelissen H, Inze´ D (2016) Combined largescale phenotyping and transcriptomics in maize reveals a robust growth regulatory network. Plant Physiol 170(3):1848–1867 2. Baena-Lopez LA, Nojima H, Vincent JP (2012) Integration of morphogen signalling within the growth regulatory network. Curr Opin Cell Biol 24(2):166–172 3. Claeys H, De Bodt S, Inze´ D (2014) Gibberellins and DELLAs: central nodes in growth regulatory networks. Trends Plant Sci 19(4): 231–239 4. Carey M, Ramı´rez JC, Wu S, Wu H (2018) A big data pipeline: Identifying dynamic growth regulatory networks from time-course Gene Expression Omnibus data with applications to influenza infection. Stat Methods Med Res 27(7):1930–1955 5. Hurd PJ, Nelson CJ (2009) Advantages of next-generation sequencing versus the microarray in epigenetic research. Brief Funct Genom Proteom 8(3):174–183 6. Contreras-Lo´pez O, Moyano TC, Soto DC, Gutie´rrez RA (2018) Step-by-step construction of gene co-expression networks from high-throughput Arabidopsis RNA sequencing data. Methods and Protocols, Root Development, pp 275–301 7. Hecker M, Lambeck S, Toepfer S, Van Someren E, Guthke R (2009) Growth regulatory network inference: data integration in dynamic models-a review. Bio Systems 96(1): 86–103 8. Stark R, Grzelak M, Hadfield J (2019) RNA sequencing: the teenage years. Nat Rev Genet 20(11):631–656 9. de Jong H (2002) Modeling and simulation of genetic regulatory systems: a literature review. J Comput Biol 9(1):67–103 10. J€anes J, Hu F, Lewin A, Turro E (2015) A comparative study of RNA-seq analysis strategies. Brief Bioinform 16(6):932–940 11. Costa-Silva J, Domingues DS, Menotti D, Hungria M, Lopes FM (2022) Temporal progress of gene expression analysis with RNA-Seq data: a review on the relationship between computational methods. Comput Struct Biotechnol J 21:86–98 12. Ding J, Bar-Joseph Z (2020) Analysis of timeseries regulatory networks. Curr Opin Syst Biol 21:16–24 13. Steuer R, Kurths J, Daub CO, Weise J, Selbig J (2002) The mutual information: detecting and

evaluating dependencies between variables. Bioinformatics 18(suppl_2):S231–S240 14. Thomas R (1973) Boolean formalization of genetic control circuits. J Theor Biol 42(3): 563–585 15. Stuart JM, Segal E, Koller D, Kim SK (2003) A gene-coexpression network for global discovery of conserved genetic modules. Science 302(5643):249–255 16. Shmulevich I, Dougherty ER, Kim S, Zhang W (2002) Probabilistic Boolean Networks: a rulebased uncertainty model for growth regulatory networks. Bioinformatics 18(2):261–274 17. Friedman N, Linial M, Nachman I, Pe’er D (2000) Using Bayesian networks to analyze expression data. J Comput Biol 7(3–4): 601–620 18. Perrin BE, Ralaivola L, Mazurie A, Bottani S, Mallet J, d’Alche–Buc F (2003) Gene networks inference using dynamic Bayesian networks. Bioinformatics 19(Suppl_2):ii138–ii148 19. Bar-Joseph Z, Gitter A, Simon I (2012) Studying and modelling dynamic biological processes using time-series gene expression data. Nat Rev Genet 13(8):552–564 20. Spies D, Ciaudo C (2015) Dynamics in transcriptomics: advancements in RNA-seq time course and downstream analysis. Comput Struct Biotechnol J 13:469–477 21. Oh S, Song S, Grabowski G, Zhao H, Noonan JP (2013) Time series expression analyses using RNA-seq: a statistical approach. BioMedResearch Int 2013:203681 22. Van Dam S, Vosa U, van der Graaf A, Franke L, de Magalhaes JP (2018) Gene co-expression analysis for functional classification and gene– disease predictions. Brief Bioinform 19(4): 575–592 23. Singh R, Som A (2020) Role of network biology in cancer research. Recent trends in ‘Computational Omics’: concepts and methodology. Nova Science Publishers, New York 24. Morton ML, Bai X, Merry CR, Linden PA, Khalil AM, Leidner RS, Thompson CL (2014) Identification of mRNAs and lincRNAs associated with lung cancer progression using next-generation RNA sequencing from laser micro-dissected archival FFPE tissue specimens. Lung Cancer 85(1):31–39 25. Andrews S (2010) Babraham bioinformatics – FastQC a quality control tool for high throughput sequence data. https://www.bioin formatics.babraham.ac.uk/projects/fastqc/

Inference of Dynamic Growth Regulatory Network 26. Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120 27. Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12:357–360 28. Thorvaldsdo´ttir H, Robinson JT, Mesirov JP (2013) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform 14(2): 178–192 29. Liao Y, Smyth GK, Shi W (2014) featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30:923–930 30. Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15(12):1–21 31. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504 32. Sherman BT, Hao M, Qiu J, Jiao X, Baseler MW, Lane HC, Imamichi T, Chang W (2022) DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res 50(W1):W216–W221 33. Lai Y (2010) Diferential expression analysis of digital gene expression data: RNA-tag filtering, comparison of t-type tests and their genomewide co-expression-based adjustments. Int J Bioinforma Res Appl 6(4):353–365 34. Conesa A, Madrigal P, Tarazona S, GomezCabrero D, Cervera A, McPherson A, Szczes´niak MW, Gaffney DJ, Elo LL, Zhang X, Mortazavi A (2016) A survey of best practices for RNA-seq data analysis. Genome Biol 17(1): 1–9 35. Ghosh A, Som A (2022) Transcriptomic analysis of human naı¨ve and primed pluripotent stem cells. Human Naı¨ve Pluripotent Stem Cells 2022:213–237 36. Wang L, Wang S, Li W (2012) RSeQC: quality control of RNA-seq experiments. Bioinformatics 28(16):2184–2185 37. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7(3):562–578 38. Li B, Dewey CN (2011) RSEM: accurate transcript quantification from RNA-Seq data with

77

or without a reference genome. BMC Bioinf 12:1–6 39. Roberts A, Feng H, Pachter L (2013) Fragment assignment in the cloud with eXpress-D. BMC Bioinf 14(1):1–9 40. Li D, Zand MS, Dye TD, Goniewicz ML, Rahman I, Xie Z (2022) An evaluation of RNA-seq differential analysis methods. PLoS One 17(9):e0264246 41. Ghosh A, Som A (2021) Decoding molecular markers and transcriptional circuitry of naive and primed states of human pluripotency. Stem Cell Res 53:102334 42. Joehanes R (2018) Network analysis of gene expression. Methods Mol Biol (Clifton, NJ) 1783:325–341 43. Albert R, Baraba´si AL (2002) Statistical mechanics of complex networks. Rev Mod Phys 74(1):47 44. Ghosh A, Som A (2020) RNA-Seq analysis reveals pluripotency-associated genes and their interaction networks in human embryonic stem cells. Comput Biol Chem 85:107239 45. Chaturvedi A, Som A (2022) The LCNetWork: an electronic representation of the mRNA-lncRNA-miRNA regulatory network underlying mechanisms of non-small cell lung cancer in humans, and its explorative analysis. Comput Biol Chem 101:107781 46. Singh R, Som A (2020) Identification of common candidate genes and pathways for progression of ovarian, cervical and endometrial cancers. Meta Gene 23:100634 47. Hu Z, Mellor J, Wu J, Yamada T, Holloway D, DeLisi C (2005) VisANT: data-integrating visual framework for biological networks and modules. Nucleic Acids Res 33(suppl_2): W352–W357 48. Nikitin A, Egorov S, Daraselia N, Mazo I (2003) Pathway studio – the analysis and navigation of molecular networks. Bioinformatics 19(16):2155–2157 49. Suderman M, Hallett M (2007) Tools for visually exploring biological networks. Bioinformatics 23(20):2651–2659 50. Assenov Y, Ramı´rez F, Schelhorn SE, Lengauer T, Albrecht M (2008) Computing topological parameters of biological networks. Bioinformatics 24(2):282–284 51. Chin CH, Chen SH, Wu HH, Ho CW, Ko MT, Lin CY (2014) cytoHubba: identifying hub objects and sub-networks from complex interactome. BMC Syst Biol 8(4):1–7 52. Morris JH, Apeltsin L, Newman AM, Baumbach J, Wittkop T, Su G, Bader GD, Ferrin TE (2011) clusterMaker: a multialgorithm clustering plugin for Cytoscape. BMC Bioinf 12(1):1–4

Chapter 5 Implementation of Exome Sequencing to Identify Rare Genetic Diseases Prajna Udupa and Debasish Kumar Ghosh Abstract Modern high-throughput genomic testing using next-generation sequencing (NGS) has led to a significant increase in the successful diagnosis of rare genetic disorders. Recent advances in NGS tools and techniques have led to accurate and timely diagnosis of a large proportion of genetic diseases by finding sequence variations in clinical samples. One of the NGS techniques, exome sequencing (ES), is considered as a powerful and easily approachable method for genetic disorders in terms of rapid and cost-effective diagnostic yields. In this chapter, we describe an overview of whole exome sequencing (ES) in the context of experimental and analytical methodologies. Approaches to ES include sequencing capture technique, quality control processes at various stages of sequencing analysis, exome data filtering strategy that incorporates both primary and secondary filtering, and prioritization of candidate variants in diagnosing genetic diseases. Key words Exome sequencing, Library preparation, Capture kit, Variant annotation, In-silico analysis

1

Introduction In the human population, rare disorders are likely to affect 0.95–1% of people at different stages of life [1]. Overall, about 85% of these disorders have a genetic basis with monogenetic etiology [1]. OMIM (Online Mendelian Inheritance in Man) lists more than 3500 phenotypes describing genetic disorders with a known molecular basis [2]. Modern testing using NGS has led to a significant increase in the diagnosis of rare genetic disorders [3]. In this context, exome sequencing (ES) is considered one of the most powerful tools for identifying the variants causing genetic disorders with a diagnostic yield of up to 57% [4]. Therefore, an encouraging approach highlights the need for systematic use and correlation of ES screens for the development of accurate, faster, and cost-effective detection of rare genetic disorders [5]. A robust understanding of the genetic

Sudip Mandal (ed.), Reverse Engineering of Regulatory Networks, Methods in Molecular Biology, vol. 2719, https://doi.org/10.1007/978-1-0716-3461-5_5, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2024

79

80

Prajna Udupa and Debasish Kumar Ghosh

Fig. 1 Representative overview of the various stages of exome sequencing

cause and the development of an ES-based detection method could help millions of affected individuals in management, family planning, definite counseling, and exploring therapeutic options [6]. Consequently, the combinatorial approach of exome and gene screening could be of immense impact in rare genetic disorders and beyond. ES is based on an approach to sequencing the coding portion or exonic regions of the protein-coding genes of an organism’s genome [7]. While exomes cover about 1–2% of the entire genome, depending on the species, ES has a wide range of applications in genetic diseases, population genetics, cancer studies, and precision medicine [8]. Exome sequencing consists of three key processes: target enrichment, sequencing, and data analysis (Fig. 1). Enrichment is usually performed by polymerase chain reaction (PCR) amplification and probe-target hybridization. Primers are specifically designed for PCR amplification, while probe-target hybridization uses oligonucleotide probes to target regions. DNA library preparation requires 4–5 Gb of sequencing per exome. Notably, exome sequencing can extend the target content to untranslated regions (UTRs) and microRNAs (miRNAs) to fully elucidate gene regulation. The final step of ES is analysis, and the analysis pipeline includes raw data quality control, pre-processing, sequence alignment, post-alignment processing, variant calling, variant annotation, variant filtering, and prioritization. Because ES uncovers genetic variants by targeting protein-coding regions, and an estimated 85% of known disease-causing mutations are found in this region, exome sequencing is a method that significantly reduces sequencing costs and provides a therapeutically realistic strategy for diagnosing patients [9].

Exome Sequencing for Identification of Rare Genetic Diseases

81

In the last decade, a massive amount of ES data has helped to identify more than 7000 genetic disorders with pathogenic variants in 4761 genes (OMIM, February 2023 updated). Although more sophisticated sequencing technologies and approaches may overwhelm ES in the future, the data generated by ES will continue to be important for disease research. Smaller studies with ES will contribute to large-scale consortium studies of unusually large cohorts to find extremely rare variants, leading to more collaborative and open research.

2

Materials • Red blood cell lysis buffer: 155 mM NH4Cl, 12 mM NaHCO3, 0.1 mM ethylene diaminetetra-acetic acid (EDTA) (pH of solution is 7.4). • 0.5 M EDTA. • White blood cell lysis buffer: 320 mM sucrose, 10 mM Tris-Cl, 5 mM MgCl2, 1% Triton X-100 (pH of solution is 8.0). • 20% sodium dodecyl sulfate (SDS). • 10 mg/mL proteinase K solution. • 3 M sodium acetate (pH of solution is 5.2). • Phosphate buffered saline: 137 mM NaCl, 2.7 mM KCl, 8 mM Na2HPO4, and 2 mM KH2PO4 (pH of solution is 7.4). • Fibroblast cell lysis buffer: 150 mM NaCl, 1% Triton X-100, 0.5% Sodium deoxycholate, 0.1% SDS, 50 mM Tris-HCl (pH of solution is 8.0).

3

Methods The overall method of exome sequencing is shown in Fig. 2.

3.1

DNA Extraction

3.1.1 Phenol-Chloroform Extraction of Genomic DNA from Peripheral White Blood Cells

Genomic DNA (gDNA) is extracted from peripheral white blood cells or skin fibroblasts. 1–2 μg of genomic DNA at a concentration of 25–50 ng/μL and A260/A280 purity of 1.8 (ratio of absorbance of sample at λ = 260 nm and λ = 280 nm in a spectrophotometer) is sufficient to perform whole exome sequencing. While there are several commercial vendors that offer gDNA extraction kits, a conventional laboratory method uses phenol-chloroform solvent extraction for gDNA purification for human cells. 1. 2–3 mL of peripheral blood is collected in an EDTA vacutainer tube (Becton, Dickinson, BD; 366643). Later, this blood is transferred to a 15 mL conical sterile polypropylene centrifuge tube (ThermoFischer Scientific, TFS; 339650).

82

Prajna Udupa and Debasish Kumar Ghosh

Fig. 2 The overall method and different steps of exome sequencing

2. To the blood sample, 10 mL of RBC lysis buffer is added and incubated at 37 °C for 30 min. (Note: If the cells are biopsied skin fibroblasts, add the cell lysis buffer directly to the cells and proceed to step 8.) 3. Centrifuge the cell lysate suspension (in the 15 mL centrifuge tube) at 3000 rpm for 10 min at 25 °C. 4. After centrifugation, collect the clear cell pellet and discard the supernatant. 5. Steps 2–4 are repeated until a clear white pellet remains after discarding the supernatant. 6. The cell pellet is mixed with 1.5–2 mL of WBC lysis buffer and 20 μL of 20% SDS. Mix the suspension well with vortex until the solution is homogeneous. Add 50 μL Proteinase K (TFS; EO0491) to the suspension and incubate for 15 min at 56 °C. 7. Incubate the suspension in a water bath at 37 °C. After 48 h of incubation, transfer equal volumes (~500 μL in each tube) of the suspension to 1.5 mL microcentrifuge tubes. 8. Add 500 μL/tube of Tris buffer saturated phenol (Sigma Aldrich, SA; P4557) to the suspension and mix well for 10–12 min. 9. Centrifuge at 14,000 rpm for 10 min at 4 °C. 10. After centrifugation, transfer the clear supernatant to a new microcentrifuge tube and add chloroform: isoamyl alcohol (SA; C0549) (in a ratio of 24:1) solvent mixture. 11. The suspension is mixed well for 10 min and then centrifuged at 13,000 rpm for 15 min at 4 °C.

Exome Sequencing for Identification of Rare Genetic Diseases

83

12. Transfer the clear supernatant to a fresh microcentrifuge tube and add 1/10 volume (~40 μL) of 3 M sodium acetate (pH 5.5) (TFS; AM9740) with 1 mL of chilled ethanol (100%) (SA; 1.00983). Mix the final suspension well by inverting. 13. The suspension is kept at -20 °C for 12 h and then centrifuged at 13,000 rpm for 15 min at 4 °C. The supernatant is discarded, and the pellet is stored. 14. Add 500 μL of 70% ethanol to the pellet to dissolve the pellet. 15. Centrifuge the suspension at 14,000 rpm for 10 min at 4 °C. 16. Air-dry the pellet in the microcentrifuge tubes and add 20–40 μL of Tris-EDTA buffer (TE) (TFS; 12090015) or deionized water. Allow the tube to stand at room temperature for 1–2 h. 17. Determine the quality and concentration of gDNA using a nanodrop spectrophotometer (TFS) at λ = 260 nm. Note the optical density ratio (OD) at 260 nm/280 nm. After satisfactory isolation, the gDNA can be stored at -20 °C to -80 °C. 3.1.2 gDNA Extraction with Kits from Commercial Suppliers

1. Several gDNA extraction kits are available from commercial vendors (e.g., Qiagen, Promega, ThermoFischer, Himedia, Biorad, etc.), and these kits require minimal steps and time to obtain good quality and quantity of gDNA. 2. The gDNA is extracted using the silica membrane filter method according to the manufacturer’s protocol. Note: The best gDNA extraction process should be sensitive, consistent, and rapid. It should prevent cross-contamination of samples. Importantly, the gDNA extraction method should yield good concentration DNA samples with purity suitable for use in downstream applications.

3.2 Exome Library Preparation 3.2.1

DNA Fragmentation

1. Before gDNA is sequenced, it is fragmented and end-repaired with adapters for library preparation. 2. There are three fragmentation methods, namely physical, chemical, and enzymatic fragmentation methods. The practical and commonly used laboratory method is the physical fragmentation of gDNA by sonication [10]. 3. For the Illumina platform, DNA is fragmented into ~100–200 bp long fragments. It should be noted that the length of the DNA fragment is always application specific. The size of the target DNA is an important parameter for library construction. 4. The gDNA is usually stored in TE buffer and pre-chilled in ice for 5–10 min, then mixed vigorously in a bench shaker.

84

Prajna Udupa and Debasish Kumar Ghosh

Table 1 Ultrasonic treatment of the gDNA conditions

Target size

Cycle condition (On/Off cycle time, s)

Cycle number

150 bp

30″/30″

30

200 bp

30″/30″

13

300 bp

30″/90″

6

400 bp

15″/90″

7–8

1000 bp

5″/90″

7–8

Table 2 Different components for DNA sample (purify the beads according to the manufacturer’s protocol) Component

Volume (μL)

Sheared DNA

Up to 50.0 μg

End-repair 10× buffer

9.0 μL

End-repair enzyme mix

5.0 μL

Nuclease-free water

To 90.0 μL

Total

90.0 μL

Ultrasonic treatment of the gDNA (concentration range: 15–50 ng/μL) is performed under the following conditions for DNA fragments of different sizes (see Table 1). (Note: The sonication amplitude is kept at 75% of maximum amplitude.) 3.2.2 Library Construction and Clean-up

1. Library preparation for the fragmented DNA fragments can be performed using various capture kits according to the manufacturer’s protocol. An example using the NEBNext End-Repair Module (NEB; E6050L) is as follows. 2. Incubate the DNA sample and purify the beads according to the manufacturer’s protocol. Table 2 shows different components of it. 3. After final repair, DNA can be purified with any commercially available kit. For example, purification with the MinElute column (Qiagen; 28004) includes the following steps. 4. Centrifuge the column at 14,000g for 1 min or place it on a vacuum manifold and draw the liquid through it. 5. Add 500 μL of buffer PB mixed with ethanol to the DNA sample. Pull the DNA sample through the column by centrifugation at 14,000g for 1 min. Discard the flow through.

Exome Sequencing for Identification of Rare Genetic Diseases

85

Table 3 Components for A-tailing reaction (set using the purified DNA) Component

Volume (μL)

Purified DNA

Up to 30 μg

Adenylation buffer

5

Klenow reagent/exonuclease

3

dATP

10

Water

To 50

Total volume

50

6. Add 750 μL of buffer PE and centrifuge the column at 14,000g for 1 min or place it on a vacuum manifold and pull the liquid through. 7. Discard the flow through. Centrifuge the column at 14,000g for 1 min. 8. Fill a 1.7 mL tube halfway with the column. 9. Incubate the column for 1 min at room temperature with 16 μL of buffer EB. 10. Centrifuge the column at 14,000g for 1 min. 11. If necessary, add another 16 μL of buffer EB to the column and incubate for 1 min at room temperature, then centrifuge at 14,000g for 1 min. 12. Finally, transfer the sample to a 0.2 mL PCR tube and discard the column. 3.2.3 End Adenylation (A-tailing)

1. An A-tailing reaction is set using the purified DNA and the following components (see Table 3). 2. Mix the contents gently and purify the A-tailed DNA using an appropriate kit according to the manufacturer’s protocol.

3.2.4

Adapter Ligation

1. Combine the purified product from the A-tailing reaction (dA-tailed DNA) and the four- to five-fold excess dT-tailed adaptor in a tube to mix them well, followed by brief centrifugation. 2. Add the buffer and water to the mix and centrifuge briefly. 3. Add the DNA ligase. Pipette to mix, followed by brief centrifugation. 4. Incubate the reaction for 15 min at room temperature. (Note: the incubation period can be increased up to 1 h if the yield of the 20 min incubation reaction is not satisfactory.) 5. Purify the DNA using a MinElute column by following the steps described earlier, except using 10 μL of EB elution buffer.

86 3.2.5

Prajna Udupa and Debasish Kumar Ghosh Target Enrichment

1. After bead clean-up, 10 μL of DNA is used for initial PCR amplification of fragmented DNA. 2. An Agilent Technologies 2100 Bioanalyzer with an Agilent DNA 1000 chip is used to quantify fragment sizes after a second clean-up. 3. Currently, capture kits from Life Technologies (SOLiD), Pacific Bioscience (RS), Life Technologies (Ion Proton), Roche (454 Genome Sequencer), and Illumina (HiSeq) are available [11]. Illumina’s “Nextera Rapid Acquire Exome Design” is used twice to capture samples from a combined sample set. 4. The collected fragments are amplified in an additional PCR step. After post-PCR clean-up, the library is eluted in 30 μL of resuspension buffer. 5. The Agilent Technologies 2100 Bioanalyzer and an Agilent High Sensitivity DNA Chip are used for quality control of the final library. 6. The Qubit dsDNA HS assay kit and Qubit 2.0 fluorometer are used to determine the library concentration. For sequencing, libraries are diluted to a final concentration of 4 nM with insert sizes ranging from 200 bp to 1 kb. Illumina’s NextSeq500 sequencer and NextSeq™ 500 High Output Kit are used for sequencing (300 cycles). 7. The 95% of bases are covered at >20× and a sensitivity of >90%. The average depth of coverage is 130x. 8. The parameter information and exome sequencing results are stored in a local repository. For example, the information can be stored on a PowerEdge R730XD rack server with an Intel Xeon E5-2660 v4 14Core, 2.0 GHz CPU, 405 GB RAM, and 96 TB of data storage space. For exome sequencing data storage, a tower server, T630, with 256 GB RAM and 16 TB of storage space, and an Intel Xeon E5-2650 v3 processor can be provided. The automated bioinformatics pipeline produces a processed output (.vcf) file in the following three steps: alignment, variant calling, and variant classification. The three steps may change depending on the sequencing methods and capture kits used.

3.3 Variant Annotation

1. The raw sequences (reads) generated in the form of FASTQ files after sequencing are subjected to overall quality control using the FastQC toolkit [12]. Quality control (QC) parameters such as sequence quality score, base quality score, read length, GC content, PCR amplification issues, and sequence duplication issues are considered. 2. Pre-processing steps involve removing of 3′-end adapter, sequence trimming, and low-quality read filtering with the help of tools like PRINSEQ [13] and QC3 [14].

Exome Sequencing for Identification of Rare Genetic Diseases

87

3. After pre-processing the FASTQ file, the raw sequence reads are aligned to the human reference genome (GRCh38) using BWA-MEM [15], v0.7.15. 4. Picard (v.2.5.0; Picard) [16] is used to sort and index the aligned reads and post-processed using the Genome Analysis Toolkit (GATK) [17] Best Practices Pipeline v4.1.2. Base quality separation is used to identify single nucleotide variants (SNVs), insertions, and deletions (INDELS). 5. Realignment is performed using GATK RealignerTargetCreator, IndelRealigner, and SNPEff around known INDELs and SNVs, followed by recalibration of base quality values using GATK BaseRecalibrator [18]. 6. The genomic VCF file (gvcf) is generated for each sample using GATK HaplotypeCaller with the specific exome capture kit bed file based on the capture kit used for sequencing [19]. 7. The entire cohort is genotyped together to generate a multisample VCF file, which is then followed by GATK variant quality recalibration (VQSR) and normalization using BCFTOOLS v1.3.1 [20]. 8. Customized Perl scripts are used to determine allele status (number of heterozygotes and homozygotes). 9. During downstream analysis, variants with call frequency below 8% are filtered out. KING [21] is used to calculate the pairwise kinship coefficient of the aggregated data, and the suggested threshold of 0.34 is used to identify duplicate samples or monozygotic twins, as indicated in the manual of KING. 10. ANNOVAR [22] is used to annotate the called variants from the multi-sample VCF file. 11. Variant quality is assessed based on genotype quality and read depth. 12. Variants are annotated using publicly available population databases such as RefGene, gnomad_exome, gnomad_genome, snp138, clinvar_ 20190305, exac03, and avsnp150. 3.4 Variant Calling and Annotation

1. ANNOVAR is used to annotate the multisample VCF files for the original and refined cohorts with RefGene, gnomad exome, gnomad genome, snp138, clinvar 20190305, exac03, and avsnp150. 2. To include gene-based cross-reference annotations, including different intolerance ratings, tissue-specific expressions, and the number of homozygous loss-of-function variants found in gnomAD, the ANNOVAR-xref argument is used. 3. The “genericdbfile” of ANNOVAR is used to include the number of heterozygotes and homozygotes from the cohorts as well as the proportion of variants expressed across transcripts (pext) [23].

88

Prajna Udupa and Debasish Kumar Ghosh

4. Perl scripts are used to incorporate disease phenotypes listed in OMIM according to ANNOVAR annotations. 5. ANNOVAR is used to annotate the variations identified in both cohorts. 6. Several BASH and AWK commands, internal Perl scripts, and BCFTOOLS (v1.3.1) are used to create the variant profiles. 7. No hard screening is used for the variant datasets to avoid missing disease-causing variants that might be present in underserved areas. 8. Variants are classified as “common” if their minor allele frequency (MAF) is 1% or greater, “rare” if it is less than 1%, and “extremely rare” if it is less than 0.01%. 9. Population-specific variants are those that are not present in gnomAD, AVSNP150 build, or SNP138 build. VariantValidator is used to translate ANNOVAR-derived variants into HGVS nomenclature. 3.5 Variant Prioritization

1. Variation quality is assessed by read depth and genotype quality in publicly available datasets such as gnomAD [24], GenomeAsia [25], Singapore Genome Project [26], and the internal variant database. Variants with a Minor Allele Frequency (MAF) of 1% are filtered and eliminated. 2. Variants with synonyms, variants found in internal databases, and variants found in public databases are eliminated. 3. For additional analysis, exonic and splice variants are considered. In silico methods are used to predict the pathogenicity of clinically significant variants. 4. Following the application of the filtration methods, one can be able to reduce the number of candidate variations to just 15–30 homozygous variants with autosomal recessive mode of inheritance and 30–50 heterozygous variants discovered with the autosomal dominant mode of inheritance in different diseasecausing genes. 5. The correct calls are verified using variants with genotype quality>60, supporting read numbers at least 10–15, and visualization in Integrative Genomics Viewer (IGV). 6. Computational analysis of variants’ pathogenicity and conservation predict it to be deleterious and highly conserved and are classified as “pathogenic, likely pathogenic and variant of uncertain significance” according to the ACMG guidelines [27]. 7. Sanger sequencing is used for validation and segregation of pathogenic variants.

Exome Sequencing for Identification of Rare Genetic Diseases

3.6

In silico Analysis

89

1. In silico tools such as MutationTaster, CADD, M-CAP, MetaLR, FATHMM-MKL, and HSF are used to predict the pathogenicity of variants detected by Sanger and exome sequencing. 2. Variant conservation in different mammalian species is predicted using GERP. 3. The details of in silico tools and their descriptions are given in Table 4.

3.7

Applications

1. The exome sequencing method is useful in identifying rare variants, including substitutions, deletions, insertions, duplications, and copy number changes in Mendelian diseases and complex disorders. 2. It also helps in identifying causative genes or variants for various diseases and provides a low-cost, high-throughput method for discovering relevant variants. 3. ES helps in the identification of somatic mutations in the cancer genome. 4. Compared to whole genome sequencing, WES provides costeffective analysis. As most of the disease-causing variants lie in the exonic region, it is likely that this method will give high diagnostic yield at lower cost. 5. ES also helps in identifying more de novo variants using a trioexome sequencing approach with probands and biological parents [28]. 6. Copy-number variants (CNVs) from the ES data can be called, and they can be implicated in many rare Mendelian disorders [29].

4

Notes 1. Since exome sequencing covers about 97% of exons, the method does not cover all the protein-coding regions of all genes in the human genome. Therefore, disease-causing mutations in the missing exons cannot be identified. 2. It is important to remember that the goal of any capture kit is to achieve consistent coverage of the desired targets. Therefore, the variability in capturing the non-target regions from run to run is due to technological limitations. It is very difficult to analyze the variants present in the poorly and moderately covered regions of the exome. 3. Consistent and good coverage are also not sufficient to identify significant variants without a robust analysis pipeline.

Description

CADD version v1.5

SIFT score predicts whether an amino acid substitution affects protein function

D: damaging Predicts possible impact of an amino acid substitution on the structure and function of P: possibly damaging B: benign a human protein

Predicts possible impact of an amino acid D: damaging substitution on the structure and function of P: possibly damaging a human protein (pp2hvar should be used for B: benign diagnostics of Mendelian diseases)

LRT for significantly conserved amino acid positions

Predict the effect of a mutation/variation

CADD13 PHRED

SIFT

Polyphen2 HDIV (polymorphism phenotyping v2)

Polyphen2 HVAR (polymorphism phenotyping v2)

LRT (likelihood ratio test)

MutationTaster

A: disease causing automatic D: disease causing N: polymorphism [probably harmless] P: polymorphism automatic [known to be harmless]

D: deleterious N: neutral U: unknown

D: deleterious T: tolerated

Higher numbers suggest that a variation is more likely to have deleterious consequences.

Genomic evolutionary rate profiling (GERP) that uses maximum likelihood evolutionary rate estimation to produce position-specific estimates of evolutionary constraint

Scores range from -12.3 to 6.17. Larger score indicates the more conserved sites.

Score interpretation

GERP++ RS

For missense and nonsense variants

In silico tools

Table 4 In silico tools for analysis of ES variants

https://www.mutationtaster. org/

http://genetics.bwh.harvard. edu/pph2/

http://genetics.bwh.harvard. edu/pph2/

https://sift.bii.a-star.edu.sg

https://cadd.gs.washington. edu/

Web resources

90 Prajna Udupa and Debasish Kumar Ghosh

http://fathmm.biocompute. org.uk/

https://www.jcvi.org/ research/provean

FATHMM can estimate the effects of D: deleterious non-synonymous single nucleotide variations T: tolerated (nsSNVs) and coding variants on the human genome’s functionality

PROVEAN predicts how an amino acid change D: damaging N: neutral or INDEL will affect a protein’s ability to perform biologically

PROVEAN

https://wglab.org/ members/15-memberdetail/36-coco-dong

http://bejerano.stanford.edu/ mcap/

D: deleterious MetaLR, using logistic regression (LR) gives more accurate and comprehensive evaluation T: tolerated of deleteriousness of missense mutations. MetaLR scores are calculated by the dbNSFP project

T: tolerated M-CAP is the first pathogenicity classifier for uncommon missense variations in the human D: damaging genome

MetaLR

M-CAP

(continued)

https://sites.google.com/ site/jpopgen/dbNSFP

D: deleterious T: tolerated

MetaSVM is a meta-analytic support vector machine (SVM) that can accommodate multiple omics data, making it possible to detect consensus genes associated with diseases across studies

MetaSVM

http://sites.google.com/site/ jpopgen/dbNSFP

FATHMM

The score ranges from 0 to 1 D: damaging, 0.15

dbNSFP predicts and annotates the functional effects of all probable non-synonymous single-nucleotide variations (nsSNVs) in the human genome

http://mutationassessor.org/ r3/

dbNSFP

H: high M: medium L: low N: neutral H/M: functional L/N: non-functional

Predicts the effects of amino-acid alterations in proteins, such as cancer mutations or missense polymorphisms, on their functional properties

MutationAssessor

Exome Sequencing for Identification of Rare Genetic Diseases 91

Prediction tool to identify disease-relevant nonsynonymous single-nucleotide variants.

Deep neural networks (DNNs) are preferable for annotating the pathogenicity of wholegenome genetic variants

Variant effect scoring tool (VEST) 3.0 determines the functional significance of missense mutations

Computation of p-values for conservation or acceleration, either lineage-specific or across all branches

Conservation scoring and identification of conserved elements

SiPhy uses exact statistical analyses to identify bases that are being selected from various alignment data

ClinPred

DANN

VEST 3.0

phyloP

phastCons

SiPhy

Score ranges between 0 and 1

https://portals.broadinstitute. org/genome_bio/siphy/

http://compgen.cshl.edu/ phast/

Positive scores – predicted to be conserved http://compgen.cshl.edu/ Negative scores – predicted to be fast-evolving phast/

http://wiki.chasmsoftware.org

The score can range from 0 to 1, higher values https://cbcl.ics.uci.edu/ public_data/DANN/data/ are more likely to be deleterious

Score range from 0 to 1, with higher scores https://sites.google.com/ reflecting greater likelihood that the variant site/clinpred/ is disease-causing

The REVEL score can range from 0 to 1, with https://sites.google.com/ Method for predicting the pathogenicity of site/revelgenomics/ higher scores reflecting greater likelihood missense variants based on a combination of that the variant is disease-causing. tools such as MutPred, FATHMM, VEST, PolyPhen, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP, SiPhy, phyloP, and phastCons

REVEL

https://fathmm.biocompute. org.uk/fathmmMKL.htm

D: damaging N: neutral.

Predicts the functional consequences of non-coding and coding single nucleotide variants (SNVs)

fathmm-MKL

Web resources

Score interpretation

Description

In silico tools

Table 4 (continued)

92 Prajna Udupa and Debasish Kumar Ghosh

GENOMNIS has developed HSF for splicing prediction analysis

Human splicing finder

ClinVar

(continued)

https://www.ncbi.nlm.nih. gov/clinvar/

http://159.226.67.237/sun/ varcards/

Varcard uses some of the resources including CADD, ClinVar, COSMIC, ICGC, InterPro, DANN, OMIM, Polyphen-2, REVEL, RVIS, VEST 3.0, etc.

Varcards

ClinVar provides clinical significance of the variant

https://franklin.genoox.com/ clinical-db/home

Franklin helps to analyze the variant

Franklin

Disease-related

https://varsome.com/

Suite of bioinformatics tools for processing and annotation of NGS data and variants

http://umd.be/Redirect.html

https://spliceailookup. broadinstitute.org/

http://www.columbia.edu/ ~ii2135/eigen.html

http://compgen.cshl.edu/ fitCons/

Varsome

Combined prediction database

Uses pre-computed scores provided by Illumina for SNVs and small INDEL

SpliceAI

Unknown Benign Likely-benign Likely-pathogenic Pathogenic Conflicting interpretations of pathogenicity

Eigen is a spectral approach to the functional annotation of genetic variants in coding and noncoding regions

Eigen

Splicing predictions

fitCons, the fitness consequences of functional The score can range from 0 to 1 annotation, combines INSIGHT-based selective pressure inference with functional tests (like ChIP-Seq)

fitCons

Exome Sequencing for Identification of Rare Genetic Diseases 93

Disorder associated with the gene, cataloged in OMIM is a publicly accessible, regularly OMIM updated database of human genes and genetic traits. It is thorough and authoritative.

The human phenotype ontology (HPO) offers a defined database of phenotypic anomalies seen in human illness.

OMIM phenotype

HPO terms

The genome aggregation database (gnomAD) is a coalition of large-scale exome and genome data

GenomeAsia 100K is an innovative collaboration between academics, research institutions, and for-profit businesses that combines the rigorousness of academic research.

gnomAD

genomeAsia

Population databases

The genotype-tissue expression (GTEx) project is an initiative to create a comprehensive public resource to research the regulation and expression of genes that are unique to different tissues. The gene name linked to the variation is provided by GTEx.

GTEx portal gene

Unknown Benign Likely-benign Likely-pathogenic Pathogenic Conflicting interpretations of pathogenicity

The human gene mutation database (HGMD) is an effort to compile all known (published) gene defects linked to inherited human disease.

HGMD

Score interpretation

Description

In silico tools

Table 4 (continued)

https://browser. genomeasia100k.org/

https://gnomad. broadinstitute.org/

https://hpo.jax.org/app/

https://www.omim.org/

https://gtexportal.org

https://www.hgmd.cf.ac.uk/ ac/index.php

Web resources

94 Prajna Udupa and Debasish Kumar Ghosh

This project intends to evaluate the degree of shared variation in the human genome across at least one million single nucleotide polymorphisms (SNPs) for DNA samples from each of Singapore’s three ethnic groups: Chinese, Malays, and Indians.

In order to provide a comprehensive database on human genetic diversity, the 1000 Genomes Project was the first initiative to sequence the genomes of a significant number of individuals.

A large collection of basic genetic polymorphisms that is archived in the publicdomain single nucleotide polymorphism database (dbSNP)

Kaviar is a database of SNVs, INDELS, and complicated variations found in humans. It is used to assess the novelty and frequency of discovered variants.

Singapore genome project

1000Genomes

dbSNP

Kaviar

http://db.systemsbiology.net/ kaviar/

https://www.ncbi.nlm.nih. gov/snp

http://www.1000genomes. org/

https://blog.nus.edu.sg/ sshsphphg/singaporegenome-variation/

Exome Sequencing for Identification of Rare Genetic Diseases 95

96

Prajna Udupa and Debasish Kumar Ghosh

4. ES is not an appropriate technique for detecting structural variations (SVs), such as large copy number variants (CNVs), inversions, and translocations. 5. Since ES only identifies genetic variations in exon regions of the genome, variations in non-coding regions of genes and non-genetic regions (such as promoter region, intergenic regions, etc.) are not detected by the ES process. 6. In developing countries, the cost of ES is somewhat higher than other clinical procedures.

5

Conclusion Advances in sequencing technology and large-scale data analysis and management will greatly expand the frontiers of exome sequencing applications. Longer reads, identification of longer INDELS, fewer base determination errors, high-quality sequences, and accurate reference genomes will be the result of advances in sequencing technology [30]. Because of advances in sequencing technology and informatics, the study of genetics has made amazing strides in the last decade. By reducing sequencing costs and turnaround times, it will be possible to use reference genomes that are unique by geographic population, ethnicity, etc., eliminating the need to use a single reference genome in variant detection. In addition to identifying variants associated with rare diseases, ES is also being used in cell-omics such as the transcriptome, epigenome, proteome, and metabolome [31]. With decreasing sequencing costs and increasing use, whole-exome sequencing has become an indispensable tool in the therapeutic field, providing rapid, costeffective, and more reliable results. However, one of the major obstacles is the translation of the obtained information into clinical practice. To solve all Mendelian diseases, decipher the correlation between phenotype and genotype, understand the disease mechanism and its systems biology, and finally translate the information into improved diagnoses and therapeutics, researchers in ES from around the world need to collaborate.

References 1. Warr A, Robert C, Hume D, Archibald A, Deeb N, Watson M (2015) Exome sequencing: current and future perspectives. G3 Genes Genomes Genetics 5(8):1543–1550 2. McKusick VA (2000) Online Mendelian inheritance in man, OMIM™. McKusick-Nathans Institute for Genetic Medicine, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, National

Library of Medicine, Bethesda. World Wide Web URL: https://omim.org 3. Ross JP, Dion PA, Rouleau GA (2000) Exome sequencing in genetic disease: recent advances and considerations. F1000Research 9:336 4. Dillon OJ, Lunke S, Stark Z, Yeung A, Thorne N, Melbourne Genomics Health Alliance, Gaff C, White SM, Tan TY (2018) Exome sequencing has higher diagnostic yield

Exome Sequencing for Identification of Rare Genetic Diseases compared to simulated disease-specific panels in children with suspected monogenic disorders. Eur J Hum Genet 26(5):644–651 5. Fernandez-Marmiesse A, Gouveia S, Couce ML (2018) NGS technologies as a turning point in rare disease research, diagnosis and treatment. Curr Med Chem 25(3):404–432 6. Marwaha S, Knowles JW, Ashley EA (2022) A guide for the diagnosis of rare and undiagnosed disease: beyond the exome. Genome Med 14(1):1–22 7. Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J (2011) Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet 12(11): 745–755 8. Shickh S, Mighton C, Uleryk E, Pechlivanoglou P, Bombard Y (2021) The clinical utility of exome and genome sequencing across clinical indications: a systematic review. Hum Genet 140:1403–1416 9. Seaby EG, Pengelly RJ, Ennis S (2016) Exome sequencing explained: a practical guide to its clinical application. Brief Funct Genomics 5(5): 374–384 10. Van Dijk EL, Jaszczyszyn Y, Thermes C (2014) Library preparation methods for nextgeneration sequencing: tone down the bias. Exp Cell Res 322(1):12–20 11. Chilamakuri CS, Lorenz S, Madoui MA, Voda´k D, Sun J, Hovig E, Myklebost O, Meza-Zepeda LA (2014) Performance comparison of four exome capture systems for deep sequencing. BMC Genomics 15:1–4 12. Andrews S (2010) FASTQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics. babraham.ac.uk/projects/fastqc 13. Schmieder R, Edwards R (2011) Quality control and preprocessing of metagenomic datasets. Bioinformatics 27(6):863–864 14. Guo Y, Zhao S, Sheng Q, Ye F, Li J, Lehmann B, Pietenpol J, Samuels DC, Shyr Y (2014) Multi-perspective quality control of Illumina exome sequencing data using QC3. Genomics 103(5–6):323–328 15. Li H (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 16. Picard Toolkit (2019) Broad Institute, GitHub Repository, https://broadinstitute.github.io/ picard/ 17. Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E (2013) From FastQ data to highconfidence variant calls: the genome analysis

97

toolkit best practices pipeline. Curr Protoc Bioinformatics 43:11.10.1–11.10.33 18. Tian S, Yan H, Kalmbach M, Slager SL (2016) Impact of post-alignment processing in variant discovery from whole exome data. BMC Bioinf 17:1–13 19. Li H (2011) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27:2987–2993 20. Ulintz PJ, Wu W, Gates CM (2019) Bioinformatics analysis of whole exome sequencing data. In: Chronic lymphocytic leukemia: methods and protocols. Humana Press, New York, pp 277–318 21. Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26(22): 2867–2873 22. Wang K, Li M, Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38(16):e164 23. Cummings J, Lee G, Ritter A, Sabbagh M, Zhong K (2020) Alzheimer’s disease drug development pipeline: 2020. Alzheimer’s Dement Transl Res Clin Intervent 6(1): e12050 24. Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alfo¨ldi J, Wang Q, Collins RL, Laricchia KM, Ganna A, Birnbaum DP, Gauthier LD (2020) The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581(7809):434–443 25. GenomeAsia KC (2019) The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature 576:106–111 26. Wu D, Dou J, Chai X, Bellis C, Wilm A, Shih CC, Soon WW, Bertin N, Lin CB, Khor CC, DeGiorgio M (2019) Large-scale wholegenome sequencing of three diverse Asian populations in Singapore. Cell 179(3): 736–749 27. Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, Grody WW, Hegde M, Lyon E, Spector E, Voelkerding K (2015) Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med 17(5): 405–423 28. Retterer K, Juusola J, Cho MT, Vitazka P, Millan F, Gibellini F, Vertino-Bell A, Smaoui N, Neidich J, Monaghan KG,

98

Prajna Udupa and Debasish Kumar Ghosh

McKnight D (2016) Clinical application of whole-exome sequencing across clinical indications. Genet Med 18(7):696–704 29. Retterer K, Scuffins J, Schmidt D, Lewis R, Pineda-Alvarez D, Stafford A, Schmidt L, Warren S, Gibellini F, Kondakova A, Blair A (2015) Assessing copy number from exome sequencing and exome array CGH based on CNV spectrum in a large clinical cohort. Genet Med 17(8):623–629

30. Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q (2020) Opportunities and challenges in long-read sequencing data analysis. Genome Biol 21(1):1–6 31. Manzoni C, Kia DA, Vandrovcova J, Hardy J, Wood NW, Lewis PA, Ferrari R (2018) Genome, transcriptome and proteome: the rise of omics data and their integration in biomedical sciences. Brief Bioinform 19(2): 286–302

Web Resources ANNOVAR, http://annovar.openbioinformatics. org/ BWA, http://bio-bwa.sourceforge.net/ DGV, https://clinicalgenome.org/tools Ensembl, https://asia.ensembl.org/index.html GATK, https://gatk.broadinstitute.org/ https://genome.ucsc.edu/

http://pcingola.github.io/SnpEff/ https://samtools.github.io/bcftools/bcftools. html NCBI, https://www.ncbi.nlm.nih.gov/ OMIM, https://www.omim.org/ PanelApp, https://panelapp.genomicsengland.co. uk/

Chapter 6 Emerging Trends in Big Data Analysis in Computational Biology and Bioinformatics in Health Informatics: A Case Study on Epilepsy and Seizures Usha Chouhan, Rakesh Kumar Sahu, Shaifali Bhatt, Sonu Kurmi, and Jyoti Kant Choudhari Abstract Advanced technology innovations allow cost-effective, high-throughput profiling of biological systems. It enabled genome sequencing in days using advanced technologies (e.g., next-generation sequencing, microarrays, and mass spectrometry). Since technology has been developed, massive biological data (e.g., genomics, proteomics) has been produced cheaply, allowing the “big data” era to create new opportunities to solve medical and biological complications in many disciplines—preventive medicine, biology, Personalized Medicine, gene sequencing, healthcare, and industry. Computational biology and bioinformatics are interdisciplinary fields that develop and apply computational methods (e.g., analytical methods, mathematical modeling, and simulation) to analyze large collections of biological data, such as genetic sequences, cell populations, or protein samples, to make new predictions or discover new biology. Biological data storage, mining, and analysis have challenges because data is much more heterogeneous. In this study, the big data resources of genomics, proteomics, and metabolomics have been explored to solve biological problems using big data analysis approaches. The goal is to build a network of relationship-based gene-disease associations to prioritize phenotypes common to epilepsy and seizure disease. Through network analysis, The 10 seed genes, 22 associated genes, 132 microRNAs, and 38 transcription factors have been identified that have a direct effect on all forms of epilepsy and seizures. The majority of seed genes, according to the results of a functional analysis of seed genes, are involved in the acetylcholine-gated channel complex (10%) and the heterotrimeric G-protein complex (10%) pathways related to cellular components, followed by a role in the regulation of action potential (20%) and positive regulation of vascular endothelial growth factor production (20%) in Epilepsy and Seizures pathways related to biological processes. This study might provide insight into the workings of the disease and shows the importance of continued research into epilepsy and other conditions that can trigger seizure activity. Key words Big data, Genomics, Proteomics, healthcare, Computational biology, Bioinformatics

1

Introduction Big data refers to data sets that are both vast and complicated, making them difficult to handle using conventional techniques of

Sudip Mandal (ed.), Reverse Engineering of Regulatory Networks, Methods in Molecular Biology, vol. 2719, https://doi.org/10.1007/978-1-0716-3461-5_6, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2024

99

100

Usha Chouhan et al.

data processing and storage. In 2001, three defining characteristics or dimensions of big data, namely volume, velocity, and variety, were introduced by Doug Laney, an analyst at Gartner. Several further “Vs” have been suggested, including veracity, value, visibility, and variability [2]. A database is the systemic collection of data. Since the data in a database is organized, data management is easy. In the last several decades, there has been an increase in the number of experimental data sets accessible on a genome scale. The accessibility of the database, the breadth of the data coverage, and the data curation techniques are some of the criteria that may be used to classify these databases [11]. Biological databases are created for a wide variety of reasons, include many different kinds of data with varying degrees of coverage, and are curated at various levels using a variety of methodologies and multiple distinct criteria may be used to classify databases according to the many kinds of information that are stored in various biological databases, such databases may, to a certain extent, be divided into the following categories: database resources dedicated to genomics, transcriptomics, proteomics, and metabolomics, in order: (1) genomic resources of database, (2) transcriptomics resources of database, (3) proteomics resources of database, (4) metabolomics resources of database, and (5) a resource of the database for biological pathways [31]. 1.1 Big Data Resource Challenges and Promises 1.1.1 Genomic Database Resources

In the era that followed the completion of the human genome project, genomic databases became an essential part of the field of human genome informatics. It is expected to improve diagnosis accuracy by linking genetic variations to particular phenotypic patterns and clinical features. The term “human genomic databases” refers to online archives of genomic variants that have been predominantly documented for one or more genes or a specific demographic or ethnic group. These variations may be found in humans [31]. The Genomic big databases are a great way to learn about how human genes work and how they change over time. The first two databases are the most well-known: dbVar and dbGaP. These databases contain information about the single nucleotide polymorphisms (SNPs) in your genome and the variations in those SNPs across large populations. They’re great for learning about what those SNPs look like and how they relate to each other. Another popular database is GEO, which contains information on how genes are expressed in various tissues. The gene expression pattern associated with a particular disease or disorder can be excess from these genomic database resources. A vast amount of genomic information is available today. Tracking down the correct information can be a daunting task—and even if you know where to look, it can be hard to know exactly what you’re looking for. That’s why we’ve put together these significant genomics database resource.

Gene-Disease Associations for Epilepsy and Seizure Disease

101

1.1.2 Transcriptomics Big Database Resource

Transcriptomics has led to the development of advanced technology. These innovations have further expanded the capabilities of transcriptomics to analyze the complexities of eukaryotic transcriptomes. Notably, technologies, such as RNA-seq and Microarray, have played a pivotal role in pushing the boundaries of transcriptomic research and providing deeper insights into gene expression and regulation. Transcriptomic research aims to identify essential genes that may contribute to various biological processes. Information extracted through transcriptomics in the last year has led to significant discoveries in various scientific fields. The insights gained from transcriptomic studies have been instrumental in advancing our understanding of gene expression, regulation, and underlying mechanisms of complex biological processes. In the section on transcriptomics, a large amount of data has been analyzed and published by researchers around the globe. The dataset can be retrieved from the big database resource. Several databases contain information about the genes in a cell. The most common type of database is called a microarray or gene expression database, which comprises genetic information from cells. The second type of database is called an RNA-seq or gene expression sequence database, which contains information about the structure of RNA in cells. The first type of database is known as microarrays. A microarray is a plate-like device that houses thousands of tiny wells for cells to be grown in. Each well holds a sample of DNA from each cell, and when the DNA is stained with different dyes, it will show up as different colors on the surface of the plate. The process behind making these types of databases is called “arraying” or “arrayed” because they’re made up of many samples arranged on one surface with many wells per sample. The second type of database is called RNA-seq database because they contain information about how much RNA each cell produces at different times during its life cycle—for example, whether they’re growing or dying off (post-mitotic). This information can then be compared between two different types of cells to figure out how similar they are. There are various Microarray databases such as EMAGE, GenNote, M3D, NetAffx, and SOURCE and the RNA sequencing databases such as CAGE, Genome RNAi, HPMR, MAMEP, and SEQanswers. Other databases have Microarray and RNA-seq data, such as 4DXpress, Array Express, GENSAT, and GEO.

1.1.3 Proteomics Database Resources

Proteomics is a well-established and expanding subject that studies protein samples using mass spectrometry. Large amounts of unprocessed experimental data and inferred biological findings are produced by proteomic research. Centralized data repositories have been built to streamline the sharing of this data and make the data and findings accessible to proteome researchers and biologists alike. The subjects of this overview of proteomics data repositories are the only publicly accessible, centralized data sources that disseminate or

102

Usha Chouhan et al.

archive experimental MS data and results. The features and functionality that resources offer users are defined together with their intended use [23]. The purpose of constructing protein databases includes collection of universal proteins, identification of protein families and domains reconstruction of phylogenetic trees, and profiling of protein structures [31]. Proteomics data may be found in various formats, and different users may have very varied requirements; hence it is crucial to emphasize that data repositories for proteomics exist at the highest level to make data accessible to all users. Protein Analysis Through Evolutionary Relationships (PANTHER) is a large curated biological database of gene and protein families. PANTHER can be used to classify proteins and their genes by high throughput analysis. Another widely used database tool is UniProt, the world’s leading high-quality, comprehensive, and freely accessible resource of protein. The UniProt consortium was established in 2002 as a result of the three institutes’ decision to combine their resources. The UniProt knowledgebase is divided into two sections: one with records that have been fully manually annotated following the extraction of literature-related information and curator-evaluated computational analysis. In 1971, the structural biology community established the single worldwide archive for macromolecular structure data—the Protein Data Bank (PDB). The PDB has embraced a culture of open access, leading to its widespread use by the research community. PDB data are used by millions of users exploring fundamental biology, energy, and biomedicine. Impact of the Protein Data Bank on antineoplastic approvals is presented here. The PDB database has proved very useful in the proteomics database. 1.1.4 Metabolomics Database Resources

The Metabolome Database [13] is a comprehensive, high-quality, small-molecule metabolites found in the human body. Metabolomics is associated with chemical biology that employs chemical synthesis, analytical chemistry, and other tools to study biological systems. The metabolome represents a molecular phenotype that allows us to access the external influences under which an organism exists and develops dynamically. Steady advancements in instrumentation toward high-throughput and high-resolution methods have led to a revival of analytical chemistry methods for the measurement and analysis of the metabolome of organisms. This steady growth of metabolomics as a field is leading to a similar accumulation of big data across laboratories worldwide, as observed in all of the other omics areas. This calls for developing methods and technologies for handling and dealing with such large datasets, efficiently distributing them and enabling reanalysis. Metabolic pathway databases generally contain detailed data models representing a pathway as a series of biochemical reactions, focusing

Gene-Disease Associations for Epilepsy and Seizure Disease

103

mainly on the chemical notifications made to the small molecule substrates of enzymes. Many metabolic pathways have been mapped to the molecular level since the 1950s or earlier. Metabolic pathway databases are the earliest and perhaps the best-known examples of such databases. However, it is important to note that these databases generally do not represent higher-order cellular processes such as gene regulation. Gene regulation involves the control of gene expression, which plays a crucial role in various cellular activities including development, response to stimuli, and maintenance of homeostasis. Although metabolic pathway databases are essential for understanding cellular metabolism, gene regulation databases are equally vital for comprehending the intricate regulatory mechanisms that govern cellular functions at the genetic level. 1.1.5 Biological Pathway Database Resource

Pathway databases are a means to systematically associate proteins with their functions and link them into networks that describe the reaction space of an organism. Pathway databases contain biological pathways for metabolic, signaling, and regulatory pathway analysis. From the database point of view, biological pathways are sets of proteins and other biomacromolecules that represent spatiotemporally organized cascades of interaction with the involvement of low molecular compounds and are responsible for achieving specific phenotypic biological outcomes. A pathway is usually associated with certain subcellular compartments. In addition, systems for information retrieval and tools for mapping user-defined gene sets onto the information in pathway databases and their typical research applications are reviewed. Whereas today’s pathway databases contain almost exclusively qualitative information, the desired trend is toward quantitative description of interactions and reactions in pathways, which will gradually enable predictive modeling and transform the pathway database into an analytical workbench [18]. A representative example is KEGG PATHWAY, a curated biological pathway resource on the molecular interaction and reaction networks [31].

1.2 A Case Study of Epilepsy and Seizures

Epilepsy is a persistent health condition of the brain that affects individuals of all ages. It is a neurological disorder that brings about unexpected, frequent seizures. A seizure is a sudden burst of atypical electrical activity in the brain. This is the medical testament of abnormally excitable cortical neurons. During a seizure, there is unregulated electrical action in the brain and the neurons begin to fire randomly. Throughout history, the disorder has been known as the sacred disease because those with epileptic seizures were believed to be taken over by devils or evil spirits. The Hippocratic writings from around 400 BC contradicted these ideas and argued that epilepsy was a hereditary illness brought on by an excess of phlegm in the brain. By the eighteenth century, epilepsy was

104

Usha Chouhan et al.

accepted as a chronic cerebral disorder. In the nineteenth century, John Hughlings Jackson theorized that epilepsy was due to overly sensitive gray matter in the brain. He correlated motor seizure symptoms with post-mortem investigations to localize the source of epileptogenic lesions. The invention of the electroencephalogram in 1929 had a great impact on the diagnosis and classification of epileptic seizures. The incidence of epilepsy in the United States is 6 or 7 per 1000 and there are 40–50 new cases every year. The likelihood of suffering from epilepsy increases from 1% at birth to 3% by the age of 75. Epidemiological studies suggest that the incidence of epilepsy increases significantly after the age of 60, higher than in other groups such as children. In two-thirds of cases, the cause is not determined [10]. Genetics play an important role in the cause and treatment of epilepsy and seizures. Researchers have identified many genes that are associated with epilepsy and seizures, including genes associated with ion channels, neurotransmitters, and other proteins involved in neuronal development and excitability. Genetic mutations can also directly cause seizures by causing abnormalities in the brain or nervous system. In addition to genetic mutations, some types of epilepsy are associated with genetic syndromes, in which mutations in multiple genes increase the risk of epilepsy or seizures. Other types of epilepsy are inherited in a Mendelian pattern, meaning that the likelihood of a person developing epilepsy is inherited from their parents. Finally, the genetic architecture of the epilepsies is complex and the underlying causes are still being elucidated.

2

Materials

2.1 DisGeNET Database

The DisGeNET Database is a comprehensive, open-access resource that contains information on genes associated with human disease. The database includes data on gene-disease associations, gene-gene interactions, and disease-related phenotypes. It is a valuable resource for researchers studying the genetics of human disease as well as a downloadable, open-source data warehouse that can be used in research. It is an invaluable resource for researchers, clinicians, and patients alike [21]. The DisGeNET Cytoscape App is a powerful tool for visualizing, querying, and analyzing gene-disease and variant-disease networks. The app can provide the feature for easily generating networks that are restricted to specific data sources, association types, disease classes, diseases, genes, and variants. Additionally, the gene-gene interaction and disease-disease networks can be generated that show relationships between nodes that share a neighbor in the original gene-disease network [3]. It is available for free download from the Cytoscape App Store (https:// apps.cytoscape.org/apps/disgenetapp). Once installed, the app can be launched from within Cytoscape (http://www.cytoscape.

Gene-Disease Associations for Epilepsy and Seizure Disease

105

org/) by selecting “DisGeNET Viewer” from the “Apps” menu. Once the app is open, users can select which type of network they would like to generate (gene-disease or variant-disease) from the drop-down menu in the “Network Type” section. In the “Data Source” section, users can then choose which DisGeNET data source they would like to use. After selecting the desired settings, users can click on the “Generate Network” button to generate the network (see Note 1). The generated network will be displayed in a new window, and various options are available for customizing the display (e.g., changing node colors, sizes, etc.). It allows users to visualize, query, and analyze a network representation of DisGeNET data. These networks can be restricted to certain parameters, such as the original data source, the association type, a disease class of interest, specific diseases, genes or variants, and lists and combinations of them. In addition, users can also specify a range of score and EvidenceIndex range that they are interested in. Finally, the app also allows users to filter by Evidence Level category (see Note 2). 2.2 GeneMANIA Prediction Server

GeneMANIA is a powerful tool for predicting the function of genes and gene sets. One of its key features is that it reports weights that indicate the predictive value of each selected data set for the query. This allows users to choose the data set that is most likely to provide accurate results for their particular gene or gene set. The weights are based on several factors, including the size of the data set, the number of genes in the data set, and the similarity of the genes in the data set to the query gene or gene set. It currently supports six organisms: Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus, Homo sapiens, and Saccharomyces cerevisiae. For each of these organisms, GeneMANIA has collected data sets from a variety of sources, including GEO, BioGRID, Pathway Commons and I2D, as well as organism-specific functional genomics data sets. It uses this data to predict which genes are likely to share function based on their physical interactions or predicted physical interactions (see Note 3).

2.3

NetworkAnalyst is a powerful tool for analyzing gene expression data and identifying potential biomarkers of disease. The platform supports a variety of data analysis methods, including gene set enrichment analysis, correlation analysis, and network analysis. It also provides a visual interface for exploring biological networks, which can be used to identify key genes and pathways involved in disease development (see Note 9). The platform includes a variety of statistical and machine learning methods that can be used to find hidden patterns in data sets. The platform also includes a number of visualization tools that make it easy to explore data sets [29] (see Note 5).

NetworkAnalyst

106

2.4

Usha Chouhan et al.

MCODE Plugin

MCODE finds clusters (highly interconnected regions) in a network. Clusters mean different things in different types of networks. MCODE is a relatively fast method of clustering. With an intuitive interface, it is suited for both computationally and biologically oriented researchers. Current features include fast network clustering, fine-tuning of results with numerous node-scoring and clusterfinding parameters, interactive cluster boundary and content exploration, multiple result set management and cluster sub-network creation, and plain text export (see Note 4).

2.5 Cytoscape Software

Cytoscape is an open-source bioinformatics software platform for visualizing molecular interaction networks and biological pathways and integrating these networks with annotations, gene expression profiles, and other state data. Although Cytoscape was originally designed for biological research, now it is a general platform for complex network analysis and visualization. Cytoscape core distribution provides a basic set of features for data integration, analysis, and visualization. Additional features are available as Apps (formerly called plugins). Apps are available for network and molecular profiling analyses, new layouts, additional file format support, scripting, and connection with the database. Anyone using the Cytoscape open API may develop them based on JavaTM technology and app community development is encouraged. Most of the apps are freely available from Cytoscape App Store (see Note 6).

2.6

FunRich is a stand-alone software tool used for functional enrichment and interaction network analysis of genes and proteins. Enrichment analysis is a statistical method used to identify which genes or proteins are most important for a given function or process. This information can be used to better understand the biology of a particular gene or protein and may also reveal new insights into disease etiology and therapeutic targets. It is also providing the graphical representation of functional enrichment and interaction networks of genes and proteins, which can be helpful in understanding the results of the enrichment analysis (see Note 7).

3

FunRich Tool

Methods

3.1 Generation of Gene-disease-variantAssociated Network

The network of gene-disease-variant combinations for “Epilepsy and Seizures” and their disease-related subtypes is extracted by the DisGeNET Database. The DisGeNET platform includes a Cytoscape plugin, which allows all gene-disease interactions to be integrated and shown in a single comprehensive gene-disease network. There are many different types of epilepsy and seizures, and they can be categorized by their cause, the symptoms they produce, and how they affect people. In our networks, the multiple edges

Gene-Disease Associations for Epilepsy and Seizure Disease

107

Fig. 1 Gene-disease-variant-associated network

represent the multiple evidence reporting the gene-disease association (see Note 8). We found 871 genes and 1242 variations that are associated with the epilepsy condition. Similarly, we identified 1052 genes and 1510 variants that are associated with seizure condition depicted in Fig. 1. 3.2 Genetic Interaction Network

A gene regulatory network is a group of molecular regulators that interact with one another and with other components of the cell to control the levels of mRNA and protein that are produced when genes are expressed, which in turn regulates how well the cell functions. Using the GeneMANIA big database resource, the Gene regulatory network has been created. We have found 81.21% of the physically interacting genes and 18.79% of the genetically interacting genes. The genetic interactions network has been shown in Fig. 2.

3.3 Cluster Analysis of the Regulatory Network

A cluster is a subnetwork entity composed of nodes that are more closely related to each other than to the rest of the components in a network. This means that the nodes in such a cluster have a high clustering coefficient. Structural motifs within the clusters can provide information about their dynamics and functional capabilities, while intermolecular motifs can link different clusters, thereby allowing for the transfer of information and coordination of activity among the clusters. To construct the clusters, we used the “Molecular Complex Detection” (MCODE) v2.0.0, a cytoscape plugin that identifies the nodes that are highly interconnected in the form of clusters representing relatively stable, multi-protein complexes

108

Usha Chouhan et al.

Fig. 2 Gene interaction network (physical and genetic)

that function as a single entity in the network. Clusters in a proteinprotein interaction network are usually protein complexes and pathways, while clusters in a protein similarity network are protein families. The algorithm has a three-stage process: (i) Weighting: nodes with the most linked neighbors receive a higher score; (ii) Molecular complex prediction: recursively add nodes to the complex that are over a specified threshold, starting with the highest-weighted node (seed); (iii) Post-processing: filters are applied to increase cluster quality (haircut and fluff) (see Note 10). The eight clusters found to be highly coupled and operating as a single entity (Fig. 3) had 58 nodes and 401 edges (scoring 13.439) in the first cluster, 46 (13.289) in the second, 72 (8.282) in the third, 10 (6.222) in the fourth, 12 (3.636) in the fifth, 27 (3.154) in the sixth, 26 (2.96) in the seventh, and 38 edges in the eighth (Table 1). These clusters form a complex network, whose topology suggests a cluster organization that orchestrates the overall functionality of the network. This organization is a collection of biological entities that work together to support a particular biological function. 3.4 Gene-mRNA-TFs Regulatory Network

The miRNA-TFs network is made up of miRNAs and transcription factors that work together. The miRNA-TFs network plays an important role in regulating gene expression. The process begins when a gene is transcribed from its DNA template to messenger

Gene-Disease Associations for Epilepsy and Seizure Disease

109

Fig. 3 Gene cluster from the gene regulatory network

RNA (mRNA), which is then translated into protein by the ribosome. At the posttranscriptional level, miRNAs bind to their complementary sequence in specific sequence elements (SEs) located in the 3′ UTR of the mRNA. At the transcriptional level, TFs can either activate or repress the transcription of the corresponding gene. Together, these processes help to ensure proper gene expression. For this research, we used the web-based application NetworkAnalyst to design a Gene Regulatory Network (GRN). This GRN can be divided into three types: Gene-miRNA interaction, TF-gene interaction, and TF-miRNA co-regulatory network. The data was uploaded by selecting the gene list input and specifying the organism as Homo Sapience and the official gene symbol in the set ID type. After this processing, the Network Mapping tab was used to map the GRN with gene-miRNA interaction, TF-gene interaction, and TF-miRNA regulatory network options selected, and the genes of interest (seeds) from previous analysis were also included. The procedure typically produces one big subnetwork (“continent”). This approach is useful for identifying interactions, reactions, and numerous types of biochemical processes. The subnetwork contains 702 Nodes, 5679 Edges, and 233 Seeds. Additionally, the subnetwork includes 319 miRNAs and 145Transcription Factors directly connected to the seeds. Figure 3 depicts the gene-

110

Usha Chouhan et al.

Table 1 Gene clusters Cluster name

Score value

Cluster 1

13.439

FBXL3, CHD7, PLCB1, EEF1A2, DYRK1A, MAPT, HDAC9, PTPRK, KCNH5, IGHMBP2, TARDBP, ANK2, CAMSAP2, GSPT2, DIAPH1, SLC1A3, EPHB1, GHR, LGI1, NT5E, OXTR, GNB1, KIF5B, CPT2, UBE4A, CYP2C9, TRPS1, CHRNA4, FARS2, SLC2A1, ANKRD11, TUBA1A, SLC1A1, FZD4, ABCC2, TSPEAR, C5, ARFGEF2, HEPACAM, FBN1, GRM7, LRP8, ADK, SYN3, GNAQ, GABRA2, GABRB3, KCNQ3, ALOX5, CHRM3, ALDH7A1, BRAF, GRM4, PCDH7, APP, FGF14, SOX4, BICD2,

Cluster 2

13.289

GRIN2B, GC, GAL, FMN2, FGF12, CUX1, CPS1, COL4A1, CEP128, CDKL5, CAMK2D, CAMK2A, CACNA1C, ASXL1, AKT3, AFF3, ADGRG1, ABAT,

Cluster 3

8.282

ABCG2, ACTL6B, ATP1A2, ATP7A, BACE1, BPTF, C1D, CENPT, CHD2, CLN5, COL11A1, CUL4B, CYP1A1, DIMT1, DYNC1H1, G6PC, GABRA1, GABRB1, GLDC, GOSR2, GRIK2, HIVEP1, HUWE1, IER3IP1, INO80, KCNB1, KCNC1, KCNH1, KCNJ10, KCNQ2, KIF2A, KMT2E, LAMC2, LNPEP, LRP12, MED13L, MEF2C, MFSD8, MMP8, MTHFR, MYT1L, NALCN, NIPBL, NSD1, OPHN1, OPRM1, PDCD10, PDXK, PEX26, PIGV, PKD2, PLPBP, POGZ, POMP, PRNP, PRSS23, PURA, RAI1, RELN, RHOBTB2, SCN2A, SORBS1, ST7, STXBP1, SYNJ1, SYT2, TBCK, TLR2, TSHR, UGT2B7, ZMYND11, ZNF804A,

Cluster 4

6.222

ALG13, ARID1B, CNTNAP2, CPA6, CYFIP2, FOXP1, GALC, RBFOX1, SORCS2, TLR4,

Cluster 5

3.636

ARID1A, CINP, DEAF1, FANCI, GJB2, GLRA1, GNAO1, PPARG, RARS2, RIT1, RORA, SATB2,

Cluster 6

3.154

ABCB1, ATP1A3, CAV3, COL3A1, DNM1L, EPHX1, EXT2, GCDH, HEXA, HNF4A, HNRNPU, KCNH2, LMAN2L, MYO5A, NR0B2, POLR3A, PTEN, PTH2R, RALGAPB, SCO2, SEPSECS, SMC1A, SMS, SPAST, SPATA5, THAP11, TRIM8,

Cluster 7

3

CHRNA7, CHRNB2, DEPDC5, DRD1, EPM2A, FBXL4, GABRA6, HTRA1, LMNB2, MTOR, NHLRC1, PACS1, PHF6, RSRC2, SCN5A, STRADA, TOMM40,

Cluster 8

2.96

AMPD2, ATP6V1A, CLU, CNTN2, COTL1, CYP3A5, DHX30, EPHA3, FLG, GLUD1, GRIN1, HAX1, KCTD3, MPP4, MYH7B, NPRL2, NPRL3, PRDM8, RORB, SCN8A, SH2D1A, SPTAN1, STX7, SZT2, TTR, UNC80

Genes

transcription factor regulatory network, providing insight into its structure and complexity. The high degree of seed node has been identified based on the network topology analysis. The focus of this analysis was to identify the degree of the gene that could be used to understand the regulatory network. Subnetwork analysis was then performed in Cytoscape, where a network was created using degree and shortest path. Ten seed genes, which are highly expressed in the

Gene-Disease Associations for Epilepsy and Seizure Disease

111

Fig. 4 Sub-network containing 10 seed genes

network, were identified (see Fig. 4). It was noted that if any variant of the seed gene is affected, the entire network can be affected. Subnetwork topology analysis revealed 10 seed genes (MEF2C, HNF4A, DYRK1A, FOXP1, PPARG, CUX1, PLCB1, PTEN, TRPS1, and MYT1) and their directly associated 22 genes, 38 TFs, and 132 miRNAs related to epilepsy and seizures. The MEF2C gene encodes a transcribing factor that is a member of the myocyte enhancer factor 2 (MEF2) family. The MEF2 family contains transcription factors that regulate gene expression in various tissues, including muscle, brain, and heart. The current literature on MEF2C’s association with epilepsy is limited. However, this does not mean that we know nothing about MEF2C-related epilepsy. There is still more research to be done in order to fully understand this type of epilepsy. For example, we need to better characterize the seizures and EEG findings associated with MEF2C-related epilepsy. Additionally, the involvement of MEF2C in the transcriptional control of MECP2 and CDKL5 genes may help to explain the variety of possible seizures seen in patients with this condition. By conducting a clinical experiment, researcher may be able to delineate MEF2C-related epilepsy as a spectrum that includes febrile seizures (typical or atypical), myoclonia, and focal onset or generalized seizures in the context of gross motor delay and severely impaired expressive language [4]. The DYRK1A is a protein that is encoded by the DYRK1A gene. This protein is a member of the dual-specificity tyrosine-(Y)phosphorylation-regulated kinase (DYRK) family. DYRK1A is involved in a number of cellular processes; including cell proliferation, cell survival, transcription, and cell cycle regulation. Mutations in the DYRK1A gene have been associated with a number of human diseases, including intellectual disability, autosomal dominant microcephaly, and epilepsy [9].

112

Usha Chouhan et al.

FOXP1 is a protein that is important for the development and function of the nervous system. Mutations in the FOXP1 gene can cause a variety of neurological disorders, including mental retardation, autism, and cerebral palsy. Similarly, the PPARG is a gene that acts as a transcription factor and mediates the effects of various hormones on gene expression. It is a key regulator of lipid and glucose metabolism, and is also involved in cell proliferation and differentiation. Mutations in PPARG are associated with a number of diseases, including epilepsy and seizures, obesity, type 2 diabetes, and cancer [26]. Another, CUX1 is a gene that is essential for the proper development and function of the nervous system, without it, neurons fail to mature and function properly, and the nervous system cannot develop or function properly. It is also essential for the formation of the blood-brain barrier, which protects the brain from harmful substances in the blood. There have been a lot of reports of mutated CUX1 genes in patients with epilepsy, and it’s been confirmed that these mutations cause alterations in function [28]. The PLCB1 gene is an important gene associated with epilepsy. Mutations in this gene have been linked to a number of different forms of epilepsy, including generalized epilepsy and febrile seizures. The PLCB1 gene encodes for the enzyme phospholipase C beta-1, which is involved in the regulation of cellular signaling pathways. Studies have shown that mutations in the PLCB1 gene can lead to abnormal neuronal excitability, which may contribute to the development of epilepsy. Furthermore, mutations in the PLCB1 gene can affect the proper functioning of the glutamate system, which may also lead to the development of epilepsy. In addition, mutations in the PLCB1 gene have been associated with an increased risk of developing certain types of epileptic syndromes, such as benign focal epilepsy of childhood and childhood absence epilepsy. Moreover, the PLCB1 gene is also involved in the regulation of various other neurological processes, such as memory and learning. Thus, mutations in the PLCB1 gene can have far-reaching consequences for neurological functioning, and may be associated with an increased risk of developing epilepsy [17]. Similarly, mutations in the PTEN gene have been linked to several neurological disorders including epilepsy [27]. Recent studies suggest that PTEN mutations may be associated with an increased risk of developing intractable childhood epilepsies such as Dravet Syndrome and Lennox-Gastaut Syndrome. Furthermore, it has been found that PTEN mutations lead to increased neuronal excitability, which is thought to contribute to the epileptic phenotype [1]. PTEN mutations have also been linked to an increased risk of developing focal cortical dysplasia and mesial temporal sclerosis, which are two common causes of epilepsy. Additionally, it has been suggested that PTEN mutations can lead to abnormal development of neuronal networks, which can lead to seizure activity. Therefore, it is important to study the PTEN gene in order to better understand the pathogenesis of

Gene-Disease Associations for Epilepsy and Seizure Disease

113

epileptic syndromes [15, 24]. Similarly, the TRPS1 gene has been linked to a variety of disorders, including epilepsy. In a study of over 500 patients with genetic epilepsies, mutations in the TRPS1 gene were identified in up to 4% of individuals with epilepsy. This gene is known to be involved in the development and differentiation of neurons, particularly the development of the nervous system in the embryo. Mutations in the TRPS1 gene can lead to a wide range of neurological disorders, including epilepsy and intellectual disability. Studies have shown that mutations in the TRPS1 gene can lead to abnormal electrical activity in the brain, resulting in epilepsy. Additionally, TRPS1 mutations have been associated with other neurological conditions, such as ataxia, hypoparathyroidism, and mental retardation. Since the TRPS1 gene has been linked to epilepsy, further research is needed to determine its role in the development of epilepsy and to identify possible treatments. Myt1gene is believed to be a contributor to epilepsy. Studies conducted on humans and animals have demonstrated that mutations in Myt1gene are associated with an increased risk of developing epilepsy [14]. In humans, a Myt1gene mutation has been found to be linked to an increased risk of developing temporal lobe epilepsy. In animal studies, a Myt1gene mutation has been linked to an increased risk of developing generalized epilepsy. Furthermore, a Myt1gene mutation has been associated with an increased risk of developing seizures in patients with both idiopathic and symptomatic forms of epilepsy. Additionally, Myt1gene mutations have been found to be associated with an increased risk of developing focal epilepsy. As a result, Myt1gene has been identified as a potential risk factor in the development of epilepsy. Further research is necessary to better understand the role that Myt1gene plays in the development of epilepsy [5]. 3.5 Gene Ontology Analysis

Gene ontology allows us to describe a gene/gene product in detail, considering three main aspects: its molecular function, the biological process in which it participates, and its cellular location. We used FunRich to perform Gene ontology by setting Gene ontology in the Manage Database option, then analyzing the seed genes we found in the uploaded datasets. After this in the enrichment analysis option, the seed gene is analyzed to get information about the cellular component, molecular function, and biological process. The seed genes play an important role in cyclic purine nucleotide metabolic process, bilirubin transport, regulation of cholesterol transporter activity, positive regulation of vascular endothelial growth factor production, and regulation of action potential under biological processes. Similarly, in its role in cellular component, seed genes perform functions associated with heterotrimeric G-protein complex, acetylcholine-gated channel complex, membrane attack complex, nuclear stress granule, intercellular canaliculus, and inward rectifier potassium channel activity. Proceeding

114

Usha Chouhan et al.

from this, seed genes play an important role in molecular function related to DBD domain binding, DNA-binding transcription factor activity regulated by binding to a ligand that modulates the transcription of specific gene sets, prostaglandin receptor activity, bilirubin transmembrane transporter activity, ligand-activated sequence-specific DNA binding RNA polymerase-II transcription factor activity, and transcription cofactor binding activity. Using the FunRich program, gene ontology (GO) analysis was performed between the seed genes. Based on common GO terms used for annotation, GO can divide gene sets into coherent functional subclasses. The differentially expressed genes are arranged into three distinct ontologies using the gene ontology analysis: biological process, cellular component, and molecular function. It classifies to a set of well-defined concepts and relationships that can be used to interpret the function of a specific gene, gene product, or gene-product group. The cellular component ontology analysis is performed to find the location, at the levels of subcellular structure and macromolecular complexes to compare the results with the P value computation in Epilepsy and Seizures. The result shows that the maximum genes are active in intercellular canaliculus (10%) followed by nuclear stress granules (10%), membrane attack complex (10%), acetylcholine-gated channel complex (10%), and heterotrimeric G-protein complex (10%) as shown in Fig. 5.

Fig. 5 Gene function analysis of cellular component

Gene-Disease Associations for Epilepsy and Seizure Disease

115

Fig. 6 Gene function analysis of biological process

The biological process indicates a description of a series of events accomplished by one or more organized assemblies of molecular functions. Biological process analysis conducted for differential expressed genes revealed maximum genes on the role of regulation of action potential (20%) followed by positive regulation vascular endothelial growth factor production (20%) in Epilepsy and Seizures as shown in Fig. 6. The molecular function analysis is performed to find the molecular activities of gene products in Epilepsy and Seizures and compare the results with the P value computation. It was observed that 20% of genes were associated with the “transcription cofactor binding” Gene Ontology (GO) Term, while another 20% of genes are involved in “ligand-activated sequence-specific DNA-binding RNA polymerase II transcription factor activities” highlighting the significance of transcription factors (TFs) in regulating gene expression and their diverse roles in cellular processes as shown in Fig. 7.

116

Usha Chouhan et al.

Fig. 7 Gene function analysis of molecular function

4

Notes 1. The DisGeNET discovery platform contains a large collection of genes and variations linked with human disorders. Data from expert-curated sources, GWAS catalogs, animal models, and scientific literature are all combined via DisGeNET. DisGeNET data are uniformly labeled using community-driven ontologies and controlled vocabularies. A number of unique measures are also offered to help with the prioritization of genotype-phenotype connections [7]. 2. Before downloading data from DisGeNet, one must log in. The data is downloaded as an excel document with a maximum of 5000 rows. Set the source button in the top left corner to all before obtaining the data. This will download all the disease-related data and make the examined target more significant. It is essential to set the select source button to “all or XX” and the association type to “any or XX” when configuring the network settings. This will create a network of gene-diseasevariant associations, which is crucial. In addition, the source must be set between 0 and 1, the EI between 0 and 1, and the

Gene-Disease Associations for Epilepsy and Seizure Disease

117

disease to be searched. To establish the Gene-Disease-Variant related network after configuring all the parameters, click Create Network [22]. 3. The GeneMANIA prediction server is a powerful tool for building genetic interaction networks. In order to get accurate results, it is important to correctly specify the parameters of interest. In our research, we have focused on predicting the network of genetic and physical interactions. By correctly specifying the organism of interest, inputting all the genes for that organism’s particular disease-related genes in the gene of interest, and clicking on the genetic and physical interactions from all the provided interaction networks, we were able to generate an accurately predicted network [16]. 4. The MCODE plugin provides a number of parameters (degree cut-off 2, node score cut-off 0.2, k-score 2, max depth 100, and right mark on haircut option) that can be used to find clusters in a network. By carefully adjusting each parameter, it is possible to identify the most densely connected areas of the network and remove any unnecessary connections [8]. 5. We used the NetworkAnalyst web server to develop the GenemiRNA-TFs network. We followed the correct parameters demanded by NetworkAnalyst to generate the Gene-miRNATFs network, which includes miRNAs and Transcription Factors associated with genes. It is crucial to include TFs and miRNAs in the genetic interaction network, as the accuracy of the results depends entirely on the genes and the TFs and miRNAs connected to them. Usually, when all parameters have been specified, one large subnetwork (a “continent”) and numerous smaller ones (referred to as “islands”) are produced, as illustrated in Fig. 3. This network must be downloaded as a graphML file and saved for cytoscape’s topological analysis [30]. 6. To carry out a topological analysis of networks, we used the Cytoscape program. The success of topological network analysis depends on variables such as degree, betweenness centrality, closeness centrality, path length, etc. We can only accurately analyze the network’s topology by correctly and carefully filling out all of these parameters [6]. 7. The FunRich Tool is a great way to analyze the function of genes, especially those related to disease. It provides graphical representations of gene function, and allows for GO analysis. This tool is extremely helpful in understanding gene function at the cellular, molecular, and genetic levels [19]. 8. There are many databases that may be used for the study of biological networks and that include a wealth of academic

118

Usha Chouhan et al.

research and factual information. For the databases to operate smoothly and produce accurate results, we had to apply a great deal of knowledge [12]. 9. The development of demanding systems and high-speed data sources can help solve this issue because biological networks are quite large and commonly used applications or webservers require very high-performance systems and high data speeds [20]. 10. It is crucial to perform the cluster analysis correctly because it is one of the most important steps in our study and its accuracy has an impact on our primary result. This can be achieved by checking all the parameters correctly and analyzing the cluster analysis’s results using the specified parameters (the highest score value of a given cluster is one of the most important and highly dense regions of the whole network) [25]. References 1. Akhter S (2021) Epilepsy: a common co-morbidity in ASD. In: Autism spectrum disorder-profile, heterogeneity, neurobiology and intervention. InTech 2. Armoogum S, Li X (2019) Big data analytics and deep learning in bioinformatics with hadoop. In: Deep learning and parallel computing environment for bioengineering systems. Elsevier Science & Technology, San Diego, pp 17–36 3. Bauer-Mehren A, Rautschka M, Sanz F, Furlong LI (2010) DisGeNET: a Cytoscape plugin to visualize, integrate, search and analyze gene– disease networks. Bioinformatics 26(22): 2924–2926 4. Borlot F, Whitney R, Cohn RD, Weiss SK (2019) MEF2C-related epilepsy: delineating the phenotypic spectrum from a novel mutation and literature review. Seizure 67:86–90 5. Chen Z, Wang L, Wang C, Chen Q, Zhai Q, Guo Y, Zhang Y (2015) Mutational analysis of CHRNB2, CHRNA2 and CHRNA4 genes in Chinese population with autosomal dominant nocturnal frontal lobe epilepsy. Int J Clin Exp Med 8(6):9063–9070 6. Choudhari JK, Verma MK, Sahariah BP (2019) Chronic fatigue syndrome: identification of transcription factor (TFs) associated with gene expression for drug signature prediction. Netw Model Anal Health Inform Bioinform 8:1–8 7. Choudhari JK, Chatterjee T, Gupta S, GarciaGarcia JG, Vera-Gonza´lez J (2021) Network biology approaches in ophthalmological diseases: a case study of glaucoma.

8. Choudhari JK, Verma MK, Choubey J, Sahariah BP (2021) Investigation of MicroRNA and transcription factor mediated regulatory network for silicosis using systems biology approach. Sci Rep 11(1):1265 9. Courcet JB, Faivre L, Malzac P, MasurelPaulet A, Lopez E, Callier P, Lambert L, Lemesle M, Thevenon J, Gigot N, Duplomb L (2012) The DYRK1A gene is a cause of syndromic intellectual disability with severe microcephaly and epilepsy. J Med Genet 49(12):731–736 10. Foldvary‐Schaefer N, Wyllie E (2007) Epilepsy. In Textbook of Clinical Neurology (pp. 1213–1244). Elsevier. https://doi.org/ 10.1016/B978-141603618-0.10052-9 11. Garg P, Jaiswal P (2016) Databases and bioinformatics tools for rice research. Curr Plant Biol 7:39–52 12. Guzzi PH, Roy S (2020) Biological network analysis: Trends, approaches, graph theory, and algorithms. Elsevier 13. Haug K, Salek RM, Steinbeck C (2017) Global open data management in metabolomics. Curr Opin Chem Biol 36:58–63 14. Lopez E, Berenguer M, Tingaud-Sequeira A, Marlin S, Toutain A, Denoyelle F, Picard A, Charron S, Mathieu G, de Belvalet H, Arveiler B (2016) Mutations in MYT1, encoding the myelin transcription factor 1, are a rare cause of OAVS. J Med Genet 53(11):752–760 15. Marchese M, Conti V, Valvo G, Moro F, Muratori F, Tancredi R, Santorelli FM, Guerrini R, Sicca F (2014) Autism-epilepsy

Gene-Disease Associations for Epilepsy and Seizure Disease phenotype with macrocephaly suggests PTEN, but not GLIALCAM, genetic screening. BMC Med Genet 15(1):1–7 16. Mostafavi S, Ray D, Warde-Farley D, Grouios C, Morris Q (2008) GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol 9:1–15 17. Ngoh A, McTague A, Wentzensen IM, Meyer E, Applegate C, Kossoff EH, Batista DA, Wang T, Kurian MA (2014) Severe infantile epileptic encephalopathy due to mutations in PLCB 1: expansion of the genotypic and phenotypic disease spectrum. Dev Med Child Neurol 56(11):1124–1128 18. Ooi HS, Schneider G, Lim TT, Chan YL, Eisenhaber B, Eisenhaber F (2010) Biomolecular pathway databases. Data mining techniques for the life sciences. Humana Press, New York, pp 129–144 19. Pathan M, Keerthikumar S, Ang CS, Gangoda L, Quek CY, Williamson NA, Mouradov D, Sieber OM, Simpson RJ, Salim A, Bacic A, Hill AF, Stroud DA, Ryan MT, Agbinya JI, Mariadason JM, Burgess AW, Mathivanan S (2015) FunRich: An open access standalone functional enrichment and interaction network analysis tool. Proteomics 15(15): 2597–2601 20. Peddemors AJ, Hertzberger LO (1999) A high performance distributed database system for enhanced Internet services. Future Gener Comput Syst 15(3):407–415 ˜ ero J, Queralt-Rosinach N, Bravo A, 21. Pin Deu-Pons J, Bauer-Mehren A, Baron M, Sanz F, Furlong LI (2015) DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database 2015:bav028 ` , Queralt-Rosinach N, ˜ ero J, Bravo A 22. Pin Gutie´rrez-Sacrista´n A, Deu-Pons J,

119

Centeno E, Garcı´a-Garcı´a J, Sanz F, Furlong LI (2016) DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res, gkw943 23. Riffle M, Eng JK (2009) Proteomics data repositories. Proteomics 9(20):4653–4663 24. Ronzano N, Scala M, Abiusi E, Contaldo I, Leoni C, Vari MS, Pisano T, Battaglia D, Genuardi M, Elia M, Striano P (2022) Phosphatase and tensin homolog (PTEN) variants and epilepsy: a multicenter case series. Seizure 100:82–86 25. Saito R, Smoot ME, Ono K, Ruscheinski J, Wang PL, Lotia S, Pico AR, Bader GD, Ideker T (2012) A travel guide to Cytoscape plugins. Nat Methods 9(11):1069–1076 26. Simeone TA, Matthews SA, Samson KK, Simeone KA (2017) Regulation of brain PPARgamma2 contributes to ketogenic diet anti-seizure efficacy. Exp Neurol 287:54–64 27. Skelton PD, Stan RV, Luikart BW (2020) The role of PTEN in neurodevelopment. Mol Neuropsychiatry 5(Suppl. 1):60–71 28. Wang J, Lin ZJ, Liu L, Xu HQ, Shi YW, Yi YH, He N, Liao WP (2017) Epilepsy-associated genes. Seizure 44:11–20 29. Xia J, Gill EE, Hancock RE (2015) NetworkAnalyst for statistical, visual and network-based meta-analysis of gene expression data. Nat Protoc 10(6):823–844 30. Zhou G, Soufan O, Ewald J, Hancock RE, Basu N, Xia J (2019) NetworkAnalyst 3.0: a visual analytics platform for comprehensive gene expression profiling and meta-analysis. Nucleic Acids Res 47(W1):W234–W241 31. Zou D, Ma L, Yu J, Zhang Z (2015) Biological databases for human research. Genomics Proteomics Bioinformatics 13(1):55–63

Chapter 7 New Insights into Clinical Management for Sickle Cell Disease: Uncovering the Significant Pathways Affected by the Involvement of Sickle Cell Disease Usha Chouhan, Trilok Janghel, Shaifali Bhatt, Sonu Kurmi, and Jyoti Kant Choudhari Abstract One of the severe monogenic conditions with the highest prevalence in the globe is sickle cell disease. Although the significance of chronic anemia, hemolysis, and vasculopathy has been established, hemoglobin polymerization, which results in erythrocyte stiffness and Vaso-occlusion, is important to the pathophysiology of this disease. Clinical management is elementary, and there is scant reliable data for many treatments. The onset of cerebrovascular illness and cognitive impairment are two of the major issues associated with sickle cell disease in children, and it is only now that researchers are beginning to understand how blood transfusions and hydroxycarbamide can prevent these complications. When Vaso occlusion and inflammation occur repeatedly, the majority of organs are gradually damaged, including the brain, kidneys, lungs, bones, and cardiovascular system. This damage worsens with age. In our study, we focused on the specific pathways which are affected by the involvement of effected genes. Firstly, we retrieved the gene datasets from the publically available data source website DisGNET. Using literature-based genes, we identified 290 highly regulated genes that are directly associated with sickle cell disease. We subsequently performed a gene expression analysis and extracted a gene set using GEO2R analysis, which was then used to prune 290 differentially expressed genes (DEGs). After pruning we got 60 highly expressed genes. After identification of DEGs, we used these genes for pathway analysis. For the pathway analysis, we used Reactome software and we found that these DEGs are directly associated with 7 different pathways, which are alpha beta signaling pathways, 15 antiviral mechanism, Oligoadenylate synthetase (OAS) antiviral response, interleukin 1 signaling pathways, interleukin 4 and 13, interleukin 10 signaling pathway, and aspirin ADME pathway. After pathway analysis, we can exactly relate how sickle cell disease alters the gene expression and how these genes affect the different pathways. Additionally, we performed gene ontology of 60 genes and identified the gene biological process, cellular component, and molecular functions as we mentioned in our results. With the help of our study data, there is a chance for pre-identification of sickle cell disease person. Our gene result was used as a biomarker of sickle cell disease. In this paper, our result is the primary approach for sickle cell disease; with the help of this paper any researcher can get their primary data and use that for further research. Key words Sickle cell disease, Signaling pathway, Gene ontology

Sudip Mandal (ed.), Reverse Engineering of Regulatory Networks, Methods in Molecular Biology, vol. 2719, https://doi.org/10.1007/978-1-0716-3461-5_7, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2024

121

122

1

Usha Chouhan et al.

Introduction A group of related functional genes is typically used to establish a pathway. A pathway’s composition can be selected from various public pathway databases or artificially defined [1]. A biological pathway is a sequence of interactions between molecules in a cell that causes the cell to alter or produce a particular product. A pathway like this can start the formation of new molecules like proteins or fats. The fast accumulation of pathway resources has made the development of pathway-based techniques easier. The phrase “pathway-based” refers to the fundamental unit of analysis, which is a pathway rather than a gene or SNP [2]. Additionally, several pathways are involved in the regulation of cell migration. Migration pathways can allow cells to move, organize, and grow in order to maintain the overall function of the organism the cell belongs to. For example, a migration pathway in the development of the nervous system will allow nerve cells to travel to their intended destination and form connections. It can also be triggered by external signals, such as hormones, to cause a cell to switch genes on and off, which can alter its functions and behavior. It also plays a critical role in maintaining the normal functioning of cells and organisms alike as they provide an interconnected link for a multitude of processes, from gene expression to the movement of cells to the transmission of signals. Pathways are crucial in advanced genomics research [3]. An extracellular signaling molecule usually initiates a route model by activating a particular receptor, which sets off a series of chemical interactions. The most common way to visualize a pathway is as a relatively tiny graph with nodes representing genes, proteins, and/or small molecules connected by edges representing known functional relationships. A simpler pathway might look like a chain, but loops and other alternative paths are far more common in complicated pathway topologies. Pathway representations are used in particular formats for computational analyses. However, a pathway could be expressed in its most basic form as a list of constituent molecules with no indication of their order or relationships. Pathway analysis techniques in bioinformatics may be used to identify important genes or proteins within a previously understood pathway in relation to a specific experiment or pathological condition or to create a pathway from scratch using proteins that have been identified as important affected elements. It is possible to investigate a pathway’s biological activity by looking at changes in, for instance, gene expression. However, the term “pathway analysis” most usually refers to a technique for the initial characterization and interpretation of a pathological or experimental condition that has been investigated using OMICS methods or genome-wide association research. Such research may reveal a vast list of changed genes. Due to the fact that the altered genes are associated with a

Uncovering Significance Pathways of Sickle Cell Disease

123

wide range of pathways, processes, and molecular activities, conducting a visual assessment becomes challenging, and summarizing the information becomes difficult. (with a large gene fraction lacking any annotation) [1]. Sickle cell disease is a genetic illness that primarily affects people of African descent. The hemoglobin S gene (also known as HbS) causes valine to take the place of the typical glutamine in the sixth position of the-globin subunit, altering the structure of the hemoglobin molecule and enhancing hemoglobin molecule aggregation under conditions of oxidative stress, dehydration, or hypoxia in cells or tissues [4]. The multisystem sickness sickle cell disease, one of the most common severe monogenic conditions in the world, damages organs gradually over time as well as produces periods of acute illness. Knowledge has progressively advanced since 1910, when Herrick first characterized the recognizable sickle-shaped erythrocytes. In 1949, Pauling and coworkers discovered electrophoretic deviations in sickle hemoglobin (HbS) and introduced the concept of “molecular sickness.” The significant research into the genetics and hemoglobin biophysics of the illness has aided in the understanding of other molecular disorders. Though some evidence supports the use of blood transfusions and hydroxycarbamide in certain situations, no medications have been produced that precisely target the pathophysiology of this condition, and clinical therapy of sickle cell disease is still at a basic level [4]. The pathophysiological characteristics of sickle cell disease include persistent hemolytic anemia vaso-occlusion leading to ischemic tissue injury, and painful episodes. Due to the low oxygen tension and decreased blood flow, the spleen, kidney, and bone marrow are the organs at most danger. This study identifies the several diseases associated with sickle cell disease, including sickle cell anemia, ss disease, sickle cell trait, sickle cell retinopathy, sickle cell dactylitis, sickle cell nephropathy, sickle cell hepatopathy, and sickle cell thalassemia. In all diseases, a very common disease is anemia. This work utilizes the Cytoscape Software, which uses DisGeNET and finds 290 genes and 380 variations of all diseases of sickle cells and pruning with microarray data through the GEO and finds 60 genes from those 290 genes. According to Reactome software, the pathways of sickle cell disease through the 60 genes were found.

2

Materials

2.1 DisGeNET Database

One of the largest publicly accessible datasets of genes and variations linked to human disorders may be found on the discovery platform DisGeNET. Data from highly selected sources, GWAS databases, animal models, and scientific literature are all combined via DisGeNET.

124

2.2

Usha Chouhan et al.

GEIO2R Tool

To find the genes most affecting the network, the gene expression has been performed using GEIO2R tools. The microarray data has been retrieved from the NCBI GEO database. GEO is a free, openaccess repository for functional genomics data that accepts submissions of MIAME-compliant data.

2.3 Reactome FIViz Plugins

The Reactome FIViz Plugins is made to look for network patterns and pathways connected to cancer and other disorders. This app provides access to the Reactome pathways stored in the database, supports pathway enrichment analysis for a set of genes, enables direct viewing of hit pathways using manually drawn pathway diagrams in Cystoscope, and enables investigation of the functional relationships between genes in hit pathways.

2.4 Cystoscope Software

The open-source bioinformatics software platform Cytoscape allows for the integration of gene expression profiles, other state data, and the visualization of molecular interaction networks. Plugins are an option for adding more functionality. There are plugins for network and molecular profiling analysis, new layouts, more file format support, database connections, and searching in massive networks.

2.5

BiNGO is a free and open-source software program designed to discover ontology terms in subgraphs of biological networks, while working in partnership with Cytoscape. Developed with Java, BiNGO is user-friendly and provides a powerful method for biologists to explore statistically significant patterns within their networks.

3

BiNGO

Methods

3.1 Gene Disease Association Network

The cystoscope software was used to create a gene network for sickle cell disease, containing 136 genes and 145 variants. These genes and variants correspond to the various subtypes of the disease, such as Anemia, Sickle Cell, Sickle Cell Trait, Sickle cell-betathalassemia, Sickle cell nephropathy, Sickle cell retinopathy, Sickle Cell Dactylitis, and Sickle Cell-SS Disease, sourced from the DisGeNet database as shown in Fig. 1. In this network, Sickle Cell Anemia are associated with 60 genes and 138 variants; Sickle CellSS Disease is associated with 2 genes and 2 variants, Sickle Cell Dactylitis with 4 genes and 7 variants, Sickle Cell Trait with 11 genes and 16 variants, Sickle Cell Nephropathy with 6 genes, Sickle Cell Retinopathy with 2 genes, and Sickle Cell Hepatopathy with 1 gene.

Uncovering Significance Pathways of Sickle Cell Disease

125

Fig. 1 Gene-disease association network 3.2 Pathway Enrichment Analysis

The pathway enrichment analysis carried out using the Reactome database has revealed eight significant pathways that are impacted by the genes shown in Table 1. The majority of these genes are related to Phase II – Conjugation of compounds, Interleukin-4 and Interleukin-13 signaling, Interleukin-10 signaling, Signaling by Interleukins, Biological oxidations, and Cytokine Signaling in the Immune system.

3.3 Functional Interaction (FI) Network

The ReactomeFIViz app has been used to identify pathways and functional relationships related to sickle cell disease. After a careful analysis of the disease-associated pathways, 233 genes were selected and converted into a Functional Interaction (FI) network as shown in Fig. 2. This FI network offers an invaluable resource for further exploration of this complex condition. With the help of this app, researchers can now gain a better understanding of the underlying mechanisms of sickle cell disease and its associated pathways. To find the genes most affecting the network, the gene expression has been performed using the GEIO2R tools. Data from the NCBI GEO database was retrieved, which is a free, open-access repository for functional genomics data that accepts submissions of MIAME-compliant data, including both array and sequence data. Utilizing the tools provided, experiments and gene expression

126

Usha Chouhan et al.

Table 1 List of genes associated with pathways

Pathway name

No. of Pgenes value FDR

Glucuronidation

8

3E- 1.55E- UGT1A10, UGT1A5, UGT1A4, UGT1A3, UGT1A9 10 07 UGT1A8, UGT1A7, UGT1A6

Aspirin ADME

8

2E- 6.09E- UGT1A5, UGT1A4, UGT1A3, UGT1A9, UGT1A8, 08 06 UGT1A7, UGT1A6, SLC16A1

Phase II – Conjugation of compounds

10

2E- 3.76E- UGT1A10, COMT, GSTM2, UGT1A5, UGT1A4, 07 05 UGT1A3, UGT1A9, UGT1A8, UGT1A7, UGT1A6

Interleukin-4 and interleukin-13 signaling

10

3E- 3.76E- TNF, HMOX1, VCAM1, IL6, CXCL8, CD36, IL10, 07 05 NOS2, VEGFA, IL17A

Interleukin-10 signaling

7

7E- 7.11E- CSF3, TNF, IL6, CXCL8, CCR5, IL10, TNFRSF1A 07 05

Signaling by interleukins

18

9E- 8.02E- CSF3, TNF, HMOX1, MAP2K6, VCAM1, IL6, 07 05 TAB1, CXCL8, CD36, CCR5, IL10, IL17RE, TNFRSF1A, PSMA5, NOS2, SOD2, VEGFA, IL17A

Genes

Biological oxidations 10

1E- 7.72E- UGT1A10, COMT, GSTM2, UGT1A5, UGT1A4, 04 03 UGT1A3, UGT1A9, UGT1A8, UGT1A7, UGT1A6

Cytokine signaling in 19 immune system

1E- 7.72E- CSF3, TNF, HLA-G, HMOX1, MAP2K6, VCAM1, 04 03 IL6, TAB1, CXCL8, CD36, CCR5, IL10, IL17RE, TNFRSF1A, PSMA5, NOS2, SOD2, VEGFA, IL17A

profiles were carefully selected. The GSE72999 dataset was chosen for the processing, containing 20 samples enrolled in the study, 7 crisis state (HbSS CS), and 6 healthy controls (HbAA). Whole blood transcriptome was compared between the study groups and were defined and analyzed to find the gene, with crisis and healthy control being selected. Sixty genes were identified that are highly significantly expressed in the disease condition, associated with different pathways, that are thought to affect the pathways and contribute to the onset of the diseases as shown in Table 2. Studies revealed that despite having a high risk, it is necessary to maintain a primary prevention program to identify sickle cell disease at an earlier stage. By preventing the emergence of harmful complications as well as severe anemia, preventive diagnosis and followup would lower infant mortality. Simply put, monitoring for sickle cell disease would prevent the number of years lost to illness, incapacity, or early death a measure of the loss of life. So, this study focused on the genes involved in sickle cell diseases or

Uncovering Significance Pathways of Sickle Cell Disease

127

Fig. 2 Functional interaction (FI) network from the enriched pathway

responsible genes which cause sickle cell diseases. The Cytoscape program was used to obtain the genes and their variations that are linked to sickle cell disorders after the DisGeNET data source was utilized to examine the gene involvement. Total 290 genes and 380 variants are directly associated with sickle diseases. After that, GEO2R study was carried out using the Gene Expression Omnibus (GEO) database. By collecting samples from patients with disorders and healthy patients, the gene data set can be extracted, which is then utilized to prune the 290 genes. Our analysis revealed 60 highly significant genes after pruning. These genes are involved in different pathways and can affect these pathways, potentially leading to disease. Major findings of this study are the gene and their involvement in the pathway, for example, alpha-beta signaling pathway (10 genes), ISG Antiviral mechanism (21 genes), OAS antiviral response (2 genes), Interleukin-1family signaling pathway (24 genes), Interleukin-4 and interleukin 13 (2 genes), interleukin10 signaling pathway (2 genes), and Aspirin ADME (2 genes). At last, gene ontology was performed on 60 genes to find their involvement in biological processes, cellular components, and molecular functions. The present analysis yielded the most significant genes and the effect of genes on biological pathways. This data may be helpful for future studies dealing with the pathogenesis and complexity of sickle cell disease (SCD) and this methodology could be applied for other diseases.

128

Usha Chouhan et al.

Table 2 List of deferential expressed genes in the disease S. No

Gene

logFC

P value

S. No

Gene

logFC

P value

1

IL36B

-1.72838

2.76E-05

31

PSMC1

0.383233

0.0321

2

IFNA4

-1.30123

0.00621

32

PSMB2

0.423664

0.0252

3

RPS27A

-1.19573

0.00194

33

PPM1B

0.459697

0.0481

4

PLCG1

-1.18642

5.63E-04

34

PTPN1

0.47427

0.0209

5

KPNA5

-1.08312

5.20E-04

35

PSMD11

0.477777

0.0261

6

ABCE1

-1.04563

0.00308

36

PSMD2

0.484961

0.0311

7

IL18BP

-1.03152

0.00341

37

SKP1

0.501075

0.0318

8

POM121

-0.94479

0.0399

38

UBE2V1

0.529217

0.0248

9

TPR

-0.8664

0.011

39

ARIH1

0.609366

0.00739

10

IL1RAP

-0.83231

0.0147

40

EIF4A1

0.62427

0.0346

11

N6AMT1

-0.82415

6.62E-04

41

EIF4E

0.666949

0.0272

12

IL13RA1

-0.81038

0.00624

42

PTPN11

0.692256

0.00242

13

MTR

-0.76223

0.0101

43

PSMD1

0.745295

0.00705

14

SIGIRR

-0.74357

6.23E-04

44

SEC13

0.76065

0.00373

15

NUP85

-0.72036

0.00224

45

MAP2K6

0.793586

0.0308

16

NUP214

-0.57847

0.021

46

STAT3

0.799134

0.0158

17

TRMT112

-0.5596

0.00665

47

PSMF1

0.828157

0.0202

18

NUP43

-0.52144

0.038

48

IL37

0.836597

0.0294

19

NUP98

-0.52006

0.0213

49

TAB3

0.851474

5.68E-04

20

MAP3K3

-0.47884

0.0425

50

IFNA2

0.895354

0.0358

21

PSMB10

-0.46834

0.045

51

FLNA

0.911787

0.00908

22

AAAS

-0.42273

0.0274

52

IRAK3

0.925882

0.00828

23

NUP93

-0.38043

0.0468

53

EIF4G3

0.981188

0.0141

24

IKBKB

-0.37171

0.038

54

PSMD5

1.057259

0.014

25

IFIT1

1.43003

0.0291

55

SOCS3

1.153609

0.0281

26

GCLC

1.457461

0.0071

56

RIPK2

1.182576

0.0289

27

BSG

1.589206

6.49E-05

57

PSME4

1.235109

7.68E-04

28

USP18

1.84615

0.02

58

ISG15

1.287275

0.0125

29

IL10

1.857418

7.44E-05

59

SOCS1

1.331334

0.00842

30

HERC5

1.940078

0.0198

60

SLC16A1

2.094368

5.47E-04

Uncovering Significance Pathways of Sickle Cell Disease

3.4 Network Enrichment Analysis

129

Network enrichment analysis is a powerful extension of traditional methods of gene enrichment analysis that allows to integrate them with the information on connectivity between genes provided by genetic networks. We used BiNGO to undertake functional and pathway enrichment analysis to gain a general understanding of the biological processes and functions of the 60 genes in the genedisease association network. Three different facets of how a gene works are taken into account by gene ontology (GO). It is broken down into three basic groups: biological process (BP), molecular function (MF), and cellular component (CC). Finding regulatory pathways or gene ontologies with statistically significant links to a given gene collection is done using a method called network-based pathway enrichment analysis. We used BiNGO to perform a two-sided hypergeometric test on the gene set and indexed GO terms. Proteins are constantly being broken down and need to be replenished, a process known as protein catabolism. The activity of protein catabolism is regulated by a number of different mechanisms, one of which is the ubiquitin-protein ligase system. This system is responsible for the degradation of proteins that are no longer needed by the cell. It is also involved in the regulation of transcription, the process by which DNA is converted into RNA. Biological processes are the processes that occur within living organisms in order to maintain life. These processes include things like metabolism, growth, reproduction, and response to stimuli. All of these processes are necessary for an organism to stay alive and function properly. The significance of the BP GO term has been identified that linked with sickle cell anemia. These BP GO terms included negative regulation of cellular protein metabolic process, ligase activity, protein metabolic process, protein modification process, protein ubiquitination, ubiquitin-protein ligase activity, ubiquitin-protein ligase activity involved in mitotic cell cycle, positive regulation of ligase activity, molecular function, protein ubiquitination, ubiquitin-protein ligase activity, ubiquitin-protein ligase activity involved in mitotic cell cycle, regulation of cellular protein metabolic process, ligase activity, protein metabolic process, protein modification process, protein ubiquitination, ubiquitinprotein ligase activity, and ubiquitin-protein ligase activity involved in mitotic cell cycle. The cellular component is a key structure in any cell, responsible for the conversion of energy, transport of materials, and other vital processes. The significant CC GO terms have been found that included macromolecular, protein, and proteasome complex and nuclear and pore, etc., as shown in Fig. 3. These are structures that are made up of multiple proteins that work together to perform specific tasks. Another important cellular component is the nuclear pore complex, which provides a gateway for molecules to pass in and out of the nucleus. All of these components are essential for a cell to function properly and it’s amazing how they all work together in perfect harmony. The

130

Usha Chouhan et al.

Fig. 3 Network enrichment analysis of biological process, cellular component, and molecular function

molecular function is a very important part of the Gene Ontology. It is the process by which proteins interact with other proteins or molecules to form complex structures or to carry out specific functions. Protein binding can be used to regulate gene expression or as part of signaling pathways. It can also be an important factor in disease and drug resistance. Protein binding is a complex process that involves multiple steps, including recognition, affinity, and specificity. The recognition step involves the interaction between two molecules, such as a protein and a ligand, in order to form a complex structure. The affinity step is when the two molecules interact with each other in order to form a tight bond. Finally, the specificity step is when the two molecules interact in order for them to perform their intended function. Protein binding plays an important role in many biological processes and understandings are shown in Fig. 3.

Uncovering Significance Pathways of Sickle Cell Disease

4

131

Notes 1. DisGeNET, a discovery platform, is created featuring one of the biggest publicly accessible collections of human diseaseassociated genes and variants. It gathers data from multiple sources, such as GWAS libraries, animal models, and scientific literature. Ontologies and controlled vocabularies are used to uniformly label DisGeNET data while distinctive metrics help prioritize genotype-phenotype links [5]. 2. Establishing all recommended parameters is essential to generate a network of gene-disease-variant relationships (Fig. 1). The disease class must be set to any, the association type must be set to any, and the select source button must be set to all. Furthermore, the source and EI should both be ranged between 0 and 1, along with the disease to research. After configuring all the fields, click Create Network to create the gene-disease-variantlinked network. To prevent any network generation-related issues, all fields must be filled out correctly [6]. 3. NCBI’s GEO2R tools were used to identify genes that are differentially expressed across the experimental conditions (Table 2). For the significance analysis, adjustments to the P-values (Benjamini & Hochberg False Discovery Rate, Benjamini & Yekutieli, Bonferroni, Hochberg, Holm, and Hommel) log transformation and a Significance Level cut-off (in the range of 0–1) must be established [7]. 4. The ReactomeFIViz plugin allows users to visualize pathways associated with any disease. Care must be taken when entering the gene of interest as incorrect input may lead to insignificant findings in the primary pathway analysis [8]. 5. The Cytoscape application was used to visualize a topological analysis of networks, which requires factors such as degree, betweenness centrality, proximity centrality, path length, etc., to be accurately filled out [9].

References 1. Jin L, Zuo XY, Su WY, Zhao XL, Yuan MQ, Han LZ, Zhao X, Chen YD, Rao SQ (2014) Pathway-based analysis tools for complex diseases: a review. Genomics Proteomics Bioinformatics 12(5):210–220 2. Martens M, Ammar A, Riutta A, Waagmeester A, Slenter DN, Hanspers KA, Miller R, Digles D, Lopes EN, Ehrhart F, Dupuis LJ (2021) WikiPathways: connecting communities. Nucleic Acids Res 49(D1): D613–D621

3. Biological pathways fact sheet (2022). https:// www.genome.gov/about-genomics/factsheets/Biological-Pathways-Fact-Sheet 4. ter Maaten JC, Arogundade FA (2010) Sickle cell disease. In: Comprehensive clinical nephrology. Elsevier, Philadelphia, pp 596–608 ` , Queralt-Rosinach N, Gutie´r˜ ero J, Bravo A 5. Pin ´ rez-Sacristan A, Deu-Pons J, Centeno E, Garcı´aGarcı´a J, Sanz F, Furlong LI (2016) DisGeNET: a comprehensive platform integrating information on human disease-associated genes and

132

Usha Chouhan et al.

variants. Nucleic Acids Research 45(D1):D833– D839. https://doi.org/10.1093/nar/gkw943 6. Choudhari JK, Chatterjee T, Gupta S, GarciaGarcia JG, Vera-Gonza´lez J (2021) Network biology approaches in ophthalmological diseases: a case study of glaucoma (pp 190–202). h t t p s : // d o i . o r g / 1 0 . 1 0 1 6 / B 9 7 8 - 0 - 1 2 801238-3.11586-7 7. Clough E, Barrett T (2016) The gene expression omnibus database. Statistical Genomics: Methods and Protocols 93–110. https://doi.org/10. 1007/978-1-4939-3578-9_5

8. Wu G, Feng X, Stein L (2010) A human functional protein interaction network and its application to cancer data analysis. Genome Biology 11(5):R53. https://doi.org/10.1186/gb2010-11-5-r53 9. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment forintegrated models of biomolecular interaction networks. Genome Research 13(11):2498–2504. https://doi.org/10.1101/ gr.1239303

Chapter 8 A Review of Computational Approach for S-system-based Modeling of Gene Regulatory Network Sudip Mandal and Pijush Dutta Abstract Inference of gene regulatory network (GRN) from time series microarray data remains as a fascinating task for computer science researchers to understand the complex biological process that occurred inside a cell. Among the different popular models to infer GRN, S-system is considered as one of the promising non-linear mathematical tools to model the dynamics of gene expressions, as well as to infer the GRN. S-system is based on biochemical system theory and power law formalism. By observing the value of kinetic parameters of S-system model, it is possible to extract the regulatory relationships among genes. In this review, several existing intelligent methods that were already proposed for inference of S-system-based GRN are explained. It is observed that finding out the most suitable and efficient optimization technique for the accurate inference of all kinds of networks, i.e., in-silico, in-vivo, etc., with less computational complexity is still an open research problem to all. This paper may help the beginners or researchers who want to continue their research in the field of computational biology and bioinformatics. Key words Gene regulatory network, Microarray data, S-system, Bio-chemical system theory, Power law function, Optimization, Regularization, Cardinality, Decoupling

1

Introduction Genes, which are made up of deoxyribonucleic acid (DNA) strands, are responsible for all kinds of activities inside a living object. By means of transcription and translation process, different proteins are synthesized from genes and these proteins control the different functions of a cell. Moreover, one gene can control or regulate other gene’s protein synthesis activity which is a complex biological process. A gene regulatory network (GRN) is a model that consists of a group of DNA segments or genes which interact with each other in a direct or indirect way. The study of GRNs appears to be very important to understand complex biological phenomenon inside the cells, to find the cause of a disease and to find the solution of the disease, i.e., “Drug design”, etc.

Sudip Mandal (ed.), Reverse Engineering of Regulatory Networks, Methods in Molecular Biology, vol. 2719, https://doi.org/10.1007/978-1-0716-3461-5_8, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2024

133

134

Sudip Mandal and Pijush Dutta

Normally, a gene regulatory network is represented by a model or graph which represents regulations or interactions among genes using a directed graph. In gene networks, nodes represent genes and edges represent relations or interactions among genes (e.g., activation or suppression), i.e., gene regulations. The regulatory relationships (depending on the nature of the control) may be of two types, namely, activation: where there is an increase in the expression value of the target gene, and suppression: where the expression value of the target gene decreases. Thus, a GRN can be represented by a weight matrix W = [w]NxN where the number of genes in the GRN is equal to N. If there is no regulation, value of weight is zero and if there is a regulation the weight should have finite value. Nowadays, DNA microarray data [1–3], which contains of gene expression level of thousands gene, are widely used to extract the hidden relationships among genes. However, current progresses in time series DNA microarray technologies have helped biologists a lot to examine the dynamic behaviors and the regulations among different genes by analyzing these huge gene expression values corresponding to all genes at different time instances. Therefore, reverse engineering of GRNs from the time series microarray data, which are readily available online [4–6], are very challenging but crucial tasks for the computer science researchers. Usually, different artificial intelligence methods are employed for this reverse engineering problem. Many types of linear or non-linear mathematical models have been already proposed to infer gene regulatory networks and dynamics from the time series microarray data, i.e., Boolean networks [7, 8] examine binary state transition matrices to search patterns in gene expression depending on a binary state function. A Dynamic Bayesian network [9, 10] makes conditional probabilistic transitions between network states that merge the features of hidden Markov model to include the feedback. Neural Network [11–13] along with GA was also proposed to infer GRN successfully. Recurrent Neural Network (RNN) [14–16] which is a closed loop Neural Network with a delayed feedback variable was widely used for inference of genetic network from time series data. However, S-system [17, 18] is also a popular model based on Biochemical System Theory [19], where a set of differential equations with power law formalism are used to represent a GRN. In this review work, we mainly focus on S-system to model genetic system dynamics from temporal data. Generally, S-system along with an optimization method is used to infer the GRN using microarray as training data and objective function of optimization as the training error. In this paper, we revise the concept S-system and observe some existing methods for modeling of GRN based on S-system that may help beginners who are interested to carry out their research in the field of computational biology/bioinformatics.

Review on S-system based Modeling on GRN

135

Moreover, we also emphasize some major issues and their solutions during optimization of the S-system model for GRN. Next, some open research problems in this domain are elaborated in the conclusion section.

2

History of S-system A complex system like GRNs can be modeled by a set of ordinary differential equations (ODEs). Nonlinear differential equation models can model more complicated genetic behaviors and dynamics successfully. If there are N genes of interest, and Xi is the expression level of the i-th gene, then the dynamics of a GRN may be modeled as: dX i = f i ðX 1 , X 2 , . . . , X N Þ dt

ð1Þ

The function fi is non-linear in nature and needs to be calculated from the time series dataset. However, Biochemical Systems Theory (BST) [20–28], proposed by Savageau is the foundation for a set of analytical and modeling tools that facilitate the analysis of dynamic biological systems. It was originally developed for biochemical pathways but by now has become much more widely applied to system biology and gene regulatory systems, but has subsequently found much wider application in various biomedical and other areas. BST is called “canonical”, which means that model construction, diagnosis, and analysis follow stringent rules. The key ingredient of BST is the power-law representation of all processes in a system. The hallmark of BST models is the formulation of each process vi as a product of power-law functions [29–31] of the form f

i,j v i = γ i ∏N j = 1 Xj

ð2Þ

where the rate constant γ i is non-negative and the kinetic orders fi, j are positive for variables Xj that have an augmenting or activating effect on vi, negative for variables that have an inhibiting effect, and 0 for variables that do not have a direct effect on vi at all. In other words, the kinetic-order parameters quantify the strength of the effect that a variable has on a process. For Coarsely speaking, one may approximate every process in the system separately, which leads to the Generalized Mass Action (GMA) form [32–34], or one may aggregate some processes first and then approximate the result, which leads to S-systems. In the Generalized Mass Action (GMA) representation, each process is represented as one power-law term, as shown in Eq. (2), and the result is therefore a set of differential equations whose right-hand sides each consist of a difference between the sum of power-law terms for all influxes and the sum of power-law terms for all effluxes.

136

Sudip Mandal and Pijush Dutta

The dynamics of the system of a generalized mass action (GMA) model with n dependent and m independent variables, which are not affected by the dynamics of the system is given by dX i = dt

Ti k=1

f

i,j,k ± γ i,k ∏jnþm = 1 Xj

ð3Þ

where Ti is the number of terms in the i-th equation. The second BST variant is the so-called S-system [17–19, 35– 37] format. Here, the focus is on pools (dependent variables) rather than on fluxes: all fluxes entering a pool are represented by only one collective power-law term and all fluxes leaving a pool are represented by one collective power-law term. The S-system representation is built by first aggregating kinetic descriptions for the different processes affecting a metabolite Xi; those that account for the synthesis of Xi & yield a net rate law Vi, and those that account for its degradation yield a net rate law V-i. As a consequence, S-systems have at most one positive and one negative term in each equation, and their general format is g dX i hi,j = V i - V - i = αi ∏jnþm Xj i,j - βi ∏jnþm = 1 = 1 Xj , dt

ði = 1, 2, ::, nÞ

ð4Þ

where Xi is the state variable. αi and βi are the positive rate constants for increasing and decreasing term respectively. gi,j and hi,j are the exponential parameters that are also called as kinetic orders. If gi, j > 0, gene j will excite the expression level of gene i. On the other hand, gene j will inhibit the expression level of gene i if gi,j < 0. hi,j have the inverse effects on controlling gene expressions compared to gi,j. As S-system model is characterized by power-law functions, it has the rich configuration capability of capturing various dynamics in many complex biochemical systems. The above-mentioned equation is called an “S-system”, because it accurately captures the saturable and synergistic [18, 19] properties intrinsic to biological and other organizationally complex systems. A gene expression of time series microarray can be considered as saturable system as it is observed that the value of gene expression becomes saturated with time, i.e., it moves from a transient state to a steady state gradually. Moreover, change in gene expression is a combined effect of all genes, i.e.,, gene regulation inside a cell follows the synergistic effect of genes. S-system has the following important characteristics [18, 19]: (1) accuracy in predicting values of the concentration variables at steady states around the nominal operating point, (2) accuracy in predicting values of the flux variables at steady states around the nominal operating point, (3) accuracy in predicting transient responses between two such steady states, and (4) robustness of the representation.

Review on S-system based Modeling on GRN

3

137

S-system Based Modeling of GRN In this section, we shall discuss the preliminary concept of Ssystem-based modeling and several issues associated with it.

3.1 Preliminary of Ssystem-based GRN

S-system is a popular and well-accepted nonlinear mathematical model for modeling different nonlinear dynamics and complex chemical reactions occurred by genes. The GRN based on S-system model represents a set of αi, βi, gi,j, and hi,j which are called S-system parameters. The number of such parameters is 2N(N + 1), where N is the number of state variables. When the dynamic actions of the model are calculated by the numerical solution method from the above equation, they cannot be determined sequentially; all must be determined simultaneously because the parameters affect others mutually. In literature, lots of techniques were already proposed for the inference of S-system parameters; those will be briefed in the next section one by one. Conventionally, the inference method of the genetic network of the S-system model is done by finding the optimal value of S-system parameters for which training error is minimized. But, an S-system may have different optimal solution (i.e., local minima) depending on the different set of values of its parameters. Therefore, optimization of the S-system variables using some suitable optimization technique is a crucial step in the gene network reconstruction. All optimization methods use an objective function or a fitness function to measure the goodness of a solution. Most common estimation criterion is the squared error which is defined as follows: M

N

T

f = k=1 i=1 t =1

X cal,k,i,t - X exp,k,i,t X exp,k,i,t

2

ð5Þ

where N is the number of genes in the problem, T is the number of sampling instances of the observed gene expression data, and M is the number of training dataset. Xcal, k, i, t is numerically calculated gene expression value of k-th dataset at time t of i-th gene using the set of obtained parameter of the S-system model. Xexp, k, i, t is the actual gene expression level of k-th dataset at t-time of i-th gene. The f denotes the total squared error between the calculated and the observed gene expression data gene expression. Therefore, S-system modeling is a non-linear function optimization problem to discover the optimal value of S-system parameters by minimizing the fitness function or mean square error so that calculated gene expression data fits with the observed gene expression data. Depending on the kinetics parameters (value and sign), the type of regulation can be easily predicted. Following are the rules for inference of GRN using an S-system.

138

Sudip Mandal and Pijush Dutta

• If (gi, j = +ve and hi, j = -ve) or (gi, j = +ve and hi, j = 0) or (gi, j = 0 and hi, j = -ve) then it denotes activation. • If (gi, j = -ve and hi, j = +ve) or (gi, j = -ve and hi, j = 0) or (gi, j = 0 and hi, j = +ve) then it denotes inhibition. • If gi, j and hi, j are both positive (negative), then there may be a True Regulation and a False Regulation or both are False Regulations. • If gi, j = and hi, j are both zero, then it will be considered as non-regulation. 3.2 Few Major Issues Regarding Optimization for Ssystem Parameters

During optimization of S-system-based GRN, some difficulties arise which are described as follows.

3.2.1 Major Issue 1: Computational Complexity

Since, for N genes, 2N(N + 1) parameters must be determined to find the solution of a set of equations as in (4), the S-system model parameter search space is of 2N(N + 1) dimensional space. However, this space becomes too computationally expensive in the case of large-scale genetic networks. Therefore, optimization of this huge number of parameters in a single program is quite difficult which may lead to erroneous result and large computation time. Complexity is needed to be reduced so that runtime is minimal and inference accuracy also increases. Moreover, algorithm should be able to infer GRN from less number of time series data without stuck at the local optimal point in less runtime.

3.2.2 Major Issue 2: Accuracy in the Prediction of Dynamics of Genes

Learning the S-system model parameters so as to fit best the predicted expression dynamics with the training data is, in essence, an optimization problem. Normally in S-system-based Gene Regulatory network reconstruction using different metaheuristics, two criteria are needed to be satisfied. The first one is the minimization training error f as in Eq. (5) that leads us to correct the prediction of dynamics for the gene expressions. Moreover, the inferred values of the regulatory parameters are also a bit of concern as its magnitude can affect the network connectivity. So, we define another performance measurement parameter as Inferred Parametric Error (IPE) which measures the deviation in the magnitude of inferred parameters of the S-system model from the original one.

N

IPE = i, j = 1

exp

N

g i,j - g cal i,j þ

i, j = 1 exp

N

exp

h i,j - h cal i,j þ exp

exp

exp

i=1

exp

αi

N

þ - αcal i

i=1

exp

βi

- βcal i

ð6Þ

where g i,j , h i,j , αi , βi are the actual values of S-system paracal cal cal meters and g cal i,j , h i,j , αi , βi are the calculated values of the same. Using these two types of performance parameter, we can estimate

Review on S-system based Modeling on GRN

139

the efficiency of an inferred algorithm in the accurate prediction of dynamics of genes. Both training error and IPE should be small as much as possible. 3.2.3 Major Issue 3: Over-fitting Problem

Another most important criterion, which should be kept in mind, is that structure of the gene regulatory network must be accurate. Real-life genetic networks are sparsely connected [38], i.e., very few connections exist between the genes. It may be possible that though the dynamics is correctly predicted, i.e., corresponding training error is very small, the network structure is completely different (i.e., different set of g and h) due to different optimal solutions (i.e., local optima) found by the metaheuristic and which is known as an over-fitting problem. Moreover, for this problem, true regulations are missing and many false regulations are included. Now, the performance of an algorithm for inference problem of GRN is measured in terms of its sensitivity (Sn), specificity (SP), and F-score which are defined as follows: Sn =

TP TP þ FN

ð7Þ

Sp =

TN TN þ FP

ð8Þ

F - score =

2  TP 2  TP þ FP þ FN

ð9Þ

TP (True Positive) denotes the number of correctly predicted regulations, and TN (True Negative) represents the number of properly predicted non-regulations. FP (False Positive) denotes the number of incorrectly predicted regulations, and FN (False Negative) represents the number of falsely predicted non-regulations by the inference algorithm. Sensitivity denotes the fraction of the total number of existing edges in the original network, correctly predicted in the inferred network. Specificity denotes the fraction of the total number of non-existent edges in the original network, correctly identified as non-existent in the inferred network as well. F-score is calculated to evaluate an algorithm without looking at the trade-off between sensitivity and specificity. The value of sensitivity (Sn), specificity (SP), and F-score should be ideally 1 for accurate inference of GRN. 3.3 Proposed Solutions Regarding the Above Issues

To overcome above-mentioned problems during optimization following actions may be taken.

3.3.1 Decoupling to Reduce Computational Complexity

To overcome the problem of computational complexity, the genetic network inference problem can be divided or decoupled into several sub-problems for a single gene. The change in the expression

140

Sudip Mandal and Pijush Dutta

level of a particular gene in a given time instant depends on the expression levels of all genes in the previous time instance only. Moreover, the changes in expression level for different genes in that given time instant are independent of each other. Therefore, a decoupling procedure can be introduced here without losing any vital information. First, decoupled S-system is proposed by Kimura et al. [39] which is discussed in detail in the literature survey section. Moreover, less number of training data must be used so that computational time of the algorithm will not become too large. 3.3.2 Selection of Suitable Optimization Technique to Increase Accuracy

Moreover, No Free Lunch (NFL) theorem [40] logically states that there is no single metaheuristic which is best suited for solving all kinds of optimization problems. Therefore, new and suitable metaheuristic must be employed for accurate prediction in the dynamics of gene expression so that training error and IPE are minimized. Moreover, it is possible to modify and tune the existing metaheuristic for faster convergence and better accuracy.

3.3.3 Regularization to Deal with Over-fitting Problem

To overcome over-fitting problem, the researchers normally add different regularizers to the training error or objective function. This modified objective function is used to balance between actual network structure and dynamics of genes. Regularizer normally acts penalty or pruning term that adds some penalty for different architecture of the GRN. In literature, many kinds of penalty terms were proposed, which will be discussed in the next section.

3.4 How to Validate a New Algorithm for Ssystem-based GRN Reconstruction?

For S-system-based modeling of GRN, the performances of a new algorithm must be validated against the following artificial and reallife genetic network (benchmark problem) to prove its efficiency: • 5 Genes Artificial network without and with noise • 20 Genes Artificial network without and with noise • Real-world Genetic Network like E. coli [41] • Synthetic Network like Gene Net Weaver Network [42] Tables 1 and 2 show the parameters of 5 genes small-scale artificial network and 20 genes medium-scale artificial network respectively which are benchmark problems in this domain of research. Several authors used these parameters to generate artificial time series data and infer this small artificial network for validation. Authors also add different levels of Gaussian noise to the data and check roboustness of their proposed algorithm. The SOS network for E. coli was first introduced by the Uri Alon group [43] which is a benchmark in real-life GRN problem to find out the effectiveness of the inference algorithm on real-time dataset [44] and network. In the SOS network, 8 genes were considered (uvrD, lexA, umuD, recA, uvrA, uvrY, ruvA, and

Review on S-system based Modeling on GRN

141

Table 1 S-system parameters for 5 genes artificial network i

gi,1

gi,2

gi,3

gi,4

gi,5

hi,1

hi,2

hi,3

hi,4

hi,5

αi

βi

1

0

0

1

0

-1

2

0

0

0

0

5

10

2

2

0

0

0

0

0

2

0

0

0

10

10

3

0

-1

0

0

0

0

-1

2

0

0

10

10

4

0

0

2

0

-1

0

0

0

2

0

8

10

5

0

0

0

2

0

0

0

0

0

2

10

10

Table 2 S-system parameters for the 20 genes artificial network αi, 10.0 βi gi,j

g3,15 = -0.7, g5,1 = 1.0, g6,1 = 2.0, g7,2 = 1.2, g7,3 = -0.8, g7,10 = 1.6, g8,3 = -0.6, g9,4 = 0.5, g9,4 = 0.5, g9,5 = 0.7, g10,6 = -0.3, g10,14 = 0.9, g11,7 = 0.5, g12,1 = 1.0, g13,10 = -0.4, g13,17 = 1.3, g14,11 = -0.4, g15,8 = 0.5, g15,11 = -1.0, g15,18 = -0.9, g16,12 = 2.0, g17,13 = -0.5, g18,14 = 1.2, g19,12 = 1.4, g19,17 = 0.6, g20,17 = 1.5, other gi,j = 0

hi,j

1.0 if (i = j), 0.0 otherwise

polB). During the experiments, the E. coli cells were irradiated with UV light, which damaged some DNA. Now, this affects gene recA, which in turn activates lexA. Subsequently, the network repairs itself by suppressing other genes’ expression values. Consequently, lexA accumulates again and represses the SOS genes, and the cell returns to its initial normal state. Figure 1 shows the SOS DNA Repair network for E. coli. They performed four experiments for different UV light intensities. Each experiment consists of 50 time steps spaced by 6 min for each of the eight genes. Since the yearly DREAM challenge keeps providing all kinds of benchmark networks, inference of Synthetic Network generated by Gene Net Weaver (version 3.1.3 beta) [42] is another interesting benchmark problem for researchers. But, there are few researchers who have tried this to prove the efficiency of their algorithm. Figure 2 shows such kind of a synthetic GRN.

4

Literature Survey In this section, a brief survey on different state of art techniques was discussed one after another.

142

Sudip Mandal and Pijush Dutta

recA

lexA

uvrD polB

uvrA uvrY

ruvA

umuD

Fig. 1 The original structure of the SOS DNA Repair network of E. coli metN

metE

yeiB

metF

metR

metI

metA

metJ

ybjG

metB

yrb1

metC folE

metL

phoP

metQ mgrB

rstB

borD slyB

Fig. 2 The graphical representation of the 20 gene network extracted from GNW

In 2000, using the idea of the LP-based method, T. Akutsu et al. [45] developed a simple method (denoted by SSYS-1) for i inference of S-systems. The authors assumed that dX > 0. By taking gi,j hi,j dt N N “log” of each side of αi ∏j = 1 Xj > βi ∏j = 1 Xj , it can be written as N

log /i þ j =1

N

g i,j log X j ðt Þ > log βi þ

h i,j log X j ðt Þ j =1

ð10Þ

Review on S-system based Modeling on GRN

143

Since Xj (t) are known data, this is a linear inequality if we treat i log /i and log βi as parameters. In the case of dX dt < 0, we can obtain a similar inequality. Therefore, solving these linear inequalities by LP, we can determine parameters. However, parameters could not be determined uniquely even if many data were given which is one of the major disadvantages of this method. Therefore, in 2003, S. Kikuchi et al. [46] proposed a method to predict not only the network structure but also its dynamics using Genetic Algorithm (GA) and S-system formalism. However, it could predict only a small number of parameters and could rarely obtain essential structures. Moreover, an additional term, pruning term in its evaluation function that aims at eliminating futile parameters was used. M

N

T

f = k=1 i=1 t =1

þ cNT i, j

X cal,k,i,t - X exp,k,i,t X exp,k,i,t

g i,j þ

2

ð11Þ

h i,j i, j , i ≠ j

where c is the weighted coefficient that balances between the accuracy and the original structure of the GRN. This model was tested for successful dynamics prediction of small-scale artificial GRN, i.e., 5 genes benchmark problem. In 2004, C. Spieth et al. [47] used Memetic Algorithm for inference of an artificial 5- and 10-dimensional regulatory system based on S-system. The author used the same fitness function as Eq. (5). However, only here dynamics was the only concern, no attention was given to the sparse nature of GRN. All the above-mentioned methods were suffered from large computational complexity and over-fitting problem. In 2005, Kimura et al. [39] proposed novel Decoupled S-system along with Cardinality to overcome these problems. The author used Cooperative Co evolutionary Algorithm for optimization problem. For Decoupled S-system, the genetic network inference problem was divided or decoupled into several sub-problems corresponding to each gene to reduce the computational complexity. So, the objective function of the sub-problem corresponding to i-th gene is to find the decoupled S-system model parameters which minimize error for i-the gene fi and it is defined as follows: M

fD i =

T

k=1 t =1

X cal,k,i,t - X exp,k,i,t X exp,k,i,t

2

ð12Þ

Hence, only 2(N + 1) parameters were needed to be determined for i-th gene of Decoupled S-system. Thus, this decoupling method divides a 2N(N + 1)-dimensional problem space into 2 (N + 1)-dimensional sub problem space for each gene. By

144

Sudip Mandal and Pijush Dutta

accumulating the 2(N + 1) parameters of N genes we get the overall structure of S-system which in turn denotes the GRN. Next, to generate sparse solutions, the concept of in-degree or cardinality of genes in error function was introduced by the author. Cardinality of gene is defined as the allowed number of regulations over the particular genes. In this paper, a penalty term based maximum cardinality I is added to fitness function where it is assumed that out of 2N kinetic parameter values of g and h, only I non-zero values are allowed within g and h vectors, thus imposing other 2N-I values to be zero. If any of these 2N-I elements achieved a nonzero-value during the optimization process, the solution would be penalized in the following way for decoupled S-system M

f DP i =

T

k=1 t =1

X cal,k,i,t - X exp,k,i,t X exp,k,i,t

2

þc

N -I

G i,j þ H i,j

ð13Þ

j =1

where Gi, j and Hi, j are the vectors which contain sorted in ascending order of absolute values of gi, j and hi, j. c is the weight constant that denotes the magnitude of penalty to balance between over-fitting and actual network structure. The proposed model gave a satisfactory performance for small 5 genes artificial GRN, but the required time series was very high. Performance in the presence of 10% noise was very poor for medium artificial GRN (20 genes). Moreover, the algorithm was tested against real-life GRN Thermus thermophilus HB8 strains; but the results were not up to mark. In 2005, N. Noman et al. [48] used Evolutionary Computation with a novel pruning term for Decoupled S-system inference problem. By considering the fact that very few genes affect both the synthesis process and degradation process of a specific gene, the author has modified the objective function as M

f DP i =

T

k=1 t =1

X cal,k,i,t - X exp,k,i,t X exp,k,i,t

2

þc

N -I

K i,j

ð14Þ

j =1

where Ki,j are the kinetic orders of gene i sorted in ascending order of their absolute values. It can identify the zero-valued parameters increasingly and thus obtain the skeletal network structure more precisely. This model performed very well for small-scale artificial network in the absence of noise but few FPs were included in the presence of 5% Gaussian Noise in the data. For real-life network like E. Coli, it was able to identify 4 TPs and 11 FPs. In 2006, N. Noman et al. [49] proposed Akaike’s Information Criteria (AIC)-based fitness evaluation instead of the conventional Mean Squared Error (MSE)-based fitness evaluation for selecting parameters in decoupled S-system formalism which was defined in the following way:

Review on S-system based Modeling on GRN

f AIC = - 2Λi þ 2Φi þ c i

2N - I

K i,j

145

ð15Þ

j =1

where Φ is the number of parameters included in the mode, Λiis the log-likehood defined in the following way: Λ i ðΩ i , σ i Þ = -

1 2σ 2i

T

X cal,k,i,t - X exp,k,i,t k=1

2

-

T ln 2πσ 2i 2

ð16Þ

where σ i is the standard deviation for normally distributed error function between calculated and experimental gene expression value for a set of parameters Ωi for gene i. The proposed fitness function was optimized using Trigonometric Differential Evolution (TDE) and found more suitable for identifying correct network topology and for estimating the accurate parameter values of small-scale artificial GRN. The results also proved the robustness of the model against different noise levels of data. In 2007, Wang et al. [50] proposed a unified approach to infer GRN which increased the speed up to good extent for S-system. A two-step method was proposed where the range of the parameters was determined first using Genetic Programming and Recursive Least Square estimation. Then the exact values of the parameters were calculated using a multi-dimensional optimization algorithm. Both downhill simplex algorithm and modified Powell algorithm were tested for multi-dimensional optimization. Only dynamics of 2 genes artificial network and 5 genes network of Yeast are tested, but performed poor with respect to sensitivity and specificity. In 2007, a hill-climbing local-search method was incorporated in evolutionary algorithm for efficiently attaining the skeletal architecture that was most frequently observed in biological networks by N. Noman et al. [51]. The author used the same AIC-based objective function as described in Eqs. (15) and (16). The proposed algorithm performed very well for small artificial GRN without noise case but required time series data is still high and in the presence of noise lots of FPs were included. For noiseless medium-scale artificial network, accuracy was very high whereas in the presence of noise, it was able to detect almost all TPs but include 54 FPs. So, the proposed model was not noise immune for artificial network. Moreover, the methodology was applied for analyzing the cell cycle of budding yeast and reconstructed the network of some key regulators. In 2008, H. Murata et al. [52] proposed Product Unit Neural Network to represent decoupled S-system where inputs are gene i expression Xi, output nodes are dX dt , weights of edges between hidden layer and inputs layer are kinetic orders in power law formalism, and weight of edges between hidden layer and output node are αi and -βi respectively. The author also used PSO for

146

Sudip Mandal and Pijush Dutta

optimization purpose along with a novel dynamic objective function that balance minimum error and structure finding capability with the progress of iteration f P,i = C P ðt Þ × E 0P,i þ ð1 - C P ðt ÞÞ × P P,i where c P ðt Þ = 1 - 1 -

t t end

2

ð17Þ

× 0:4 þ 0:5, t is the current itera-

tion number, tend is the maximum iteration, PP, i is penalty term using clustering and E 0P,i is the error function similar to Eq. (5). Details of these terms can be found in the corresponding paper [52]. However, the proposed method was only tested against the artificial genetic network which is consists of five genes (N = 5) without any noise and results were not so interesting. In 2010, M. Kabir et al. [53] modeled the non-linearity for S-system model using the linear time-variant model. The used estimation criterion was the same as Eq. (5) and Self-Adaptive Differential Evolution for optimization purpose. In case of noiseless 5 Genes small artificial gene network, it gave high accuracy in dynamics prediction but there is huge parametric error in inferred S-system parameters and performance in the presence of noise was very poor. Moreover, it was also tested against E. Coli. and IRMA data. For E. coli., 5 TPs were detected but lots of FPs were included. In 2011, A. R. Chowdhury et al. [54] proposed a method where the genetic algorithm was used for scoring of networks that contain several useful features for accurate network inference, namely a Prediction Initialization (PI) algorithm to initialize the individuals; a Flip Operation (FO) for better mating of values and a restricted execution of Hill Climbing Local Search over few individuals. It also included a novel refinement technique which utilizes the fit solutions of the genetic algorithm for optimizing sensitivity and specificity of the inferred network. The model performed very well for small-scale artificial 5 genes network with and without noise (very few FPs for higher level of noise). In 2011, T. Nakayama et al. [55] applied Immune Algorithm (IA) to search for the S-system parameters. Only dynamics of 5 gene artificial network and real-life dipocyte differentiations of Musmusculus are tested but no attention was given to the structure of the network. In 2012, Li-Zhi Liu et al. [56] proposed a pruning separable parameter estimation algorithm (PSPEA) that combines the separable parameter estimation method (SPEM) and a pruning strategy, which was optimized using Continuous Genetic Algorithm (CGA) for inferring S-systems. The method was very successful for inference 4 and 5 dimensional artificial GRN in both dynamics predications and parameters estimations. In 2012, A. R. Chowdhury et al. [57] proposed a novel approach for inferring GRN based on the decoupled S-system model, incorporating the new concept of adaptive regulatory

Review on S-system based Modeling on GRN

147

genes cardinality using Trigonometric Evolutionary Algorithm. Unlike the previous methods of using maximum in-degree I with a fixed value, this novel ARGC algorithm adjusts the value of I automatically for every gene. Objective function is defined as the following way: M

f

ARGC i

T

= k=1 t =1

2

X cal,k,i,t - X exp,k,i,t X exp,k,i,t

þ Ci

2N Z Count

ð18Þ

where maximum in-degree (I), the minimum in-degree (J), ZCount is total number of non-regulation for i-th gene, and Ci is scaling factor that is a function of I, J, rt (No of regulation). Its performance was best so far for both noiseless and noisy small-scale artificial GRN. But for IRMA real network, its sensitivity was best among others but specificity was slightly poorer than others. In the year of 2013, A. R. Chowdhury et al. [58] proposed a novel Time-Delayed S-system (TDSS) model which used a set of delay differential equations to represent the system dynamics. The ability to incorporate time-delay parameters in the proposed S-system model enables simultaneous modeling of both instantaneous and time-delayed interactions. Moreover, the author refined the evaluation function as follows: M

f TD i =

T

k=1 t =1

X cal,k,i,t - X exp,k,i,t X exp,k,i,t

2

þ Bi C i

2N 2N - r i

ð19Þ

where ri is the number of all actual regulators, Bi is a balancing factor between the two terms, and Ci is the penalty factor for gene i. The experimental studies on synthetic networks with various timedelayed regulations clearly demonstrate that the proposed method can capture both instantaneous and delayed interactions correctly with high precision. The experiments were carried out on two wellknown real-life networks, namely IRMA and SOS DNA repair networks. For Escherichia coli, the results have shown a significant improvement compared with other state-of-the-art approaches for GRN modeling. In 2013, L. Palafox et al. [59, 60] have used novel Dissipative Particle Swarm Optimization along L1 regularizer in objective function that is given as follows: M

fD i =

T

k=1 t =1

X cal,k,i,t - X exp,k,i,t X exp,k,i,t

2

þλ i, k = N

g i,j þ h i,j

ð20Þ

where λ is the balancing factor. The model performed satisfactorily for small and medium-scale artificial GRN without noise. But its performance degraded in the presence of noise. Moreover, for E. Coli SOS network, 7 TPs and 3 FPs were detected which was quite satisfactory.

148

Sudip Mandal and Pijush Dutta

In 2013, A S Jereesh et al. [61] have used Clonal-Based Algorithm and MSE-based objective function (Eq. 5) for inference problem. For small artificial gene network, dynamics is predicted correctly. For E. Coli Network huge number false regulations were inferred. So far, all attempts were made to successful inference of small GRN. However, in 2013, A. R. Chowdhury et al. [62] proposed novel S-system-based framework where GRN was decomposed into two sub-networks representing TF (Transcription Factor)-TF and TF-TG (Target Gene) interactions. The author following dynamic fitness function: - GRN f REL = i

M

T

k=1 t =1

X cal,k,i,t - X exp,k,i,t X exp,k,i,t

2

þ Ci

r max ð21Þ r max - r i

Ci is the modified scaling factor, rmax is number of transcription factor for single gene, ri is total regulation. These proposed methods provided satisfactory result for medium and large In-Silico Analysis (20, 50, 100 genes network). Again, in 2013, A S Jereesh et al. [63] have used Cuckoo Search optimization and MSE-based objective function (Eq. 5) for inference problem. For small artificial gene network, dynamics was predicted correctly. For E. Coli Network huge number false regulations were inferred. In 2014, A S Jereesh et al. [64] have proposed Clono-Hybrid algorithm for S-system-based GRN reconstruction problem. For E. Coli Network hu7 true regulations were detected but specificity was very low due to the huge number of false regulations. In practical cases, where data are insufficient and only partially observable, some of the parameters may be unidentifiable. Therefore, in 2015, C. Zhan et al. [65] have studied the structural and practical identifiability of S-system to improve the identification quality. An application on yeast fermentation pathway was conducted successfully. In 2016, S. Mandal et al. [66] have used Bat Algorithm to optimize the decoupled S-system model parameters. Initially, the proposed method has been successfully tested on an artificial network with and without the presence of noise. Based on the fact that a real-life genetic network is sparsely connected, a novel Accumulative Cardinality-based decoupled S-system has been proposed. The cardinality has been varied from zero up to a maximum value, and this model has been implemented for the reconstruction of the DNA SOS repair network of Escherichia coli. The obtained results have shown significant improvements in the detection of a greater number of true regulations, and in the minimization of false detections compared to other existing methods.

Review on S-system based Modeling on GRN

5

149

Conclusion Here, we survey the applications of different optimization techniques and hybrid intelligence methods in bioinformatics, which may help researchers or beginners from both areas to understand each other and ensure their future collaborations. Accurate inferences of all types of GRNs with less computational complexity, less data point but with high accuracy are still fascinating task to computer science researchers. In this paper, we have given a brief idea of inference problem of GRN, i.e., reverse engineering of genetic network using S-system which is one of the most popular approaches for GRN modeling. In the literature survey, different states of art techniques for this purpose were described where it is found that most of the authors used a metaheuristic for optimization of parameters of S-system. We investigated some major issues that occurred during optimization of S-system model and also proposed some solutions regarding them. Different metaheuristics were applied successfully to infer small-scale gene regulatory network structure as well as dynamics of genes in the presence of noise and without noise but yet to accomplish an accurate inference of large-scale artificial, Dream4 and real-life GRNs with less computational time. However, few of them were able to find all true regulations but they also detected some false regulations. Therefore followings are still open research problem to the computer science researchers in the field of S-system modeling of GRN. 1. Finding out the most suitable and efficient optimization techniques for the accurate inference of small artificial (without and with noise), large artificial (without and with noise), GNW or Dream4, and real-world GRNs with less computational complexity. 2. Modification or improvement of existing metaheuristic for the accurate inference of all kinds of GRNs with less time series data. 3. Modification of regularization or penalty term can be done term to obtain better result (like dynamic penalty where penalty value change with iteration to balance more etc.). 4. Some techniques must be employed for reduction in FPs from the GRN. 5. Hybridization of different optimization techniques may be introduced to reduce the search space of optimization (as in case of regularized cost function, the algorithm still searches in large space for all parameters) by fixing the regulator.

150

Sudip Mandal and Pijush Dutta

References 1. Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32(4):496–501 2. Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–470 3. Shyamsundar R, Kim YH, Higgins JP, Montgomery K, Jorden M, Sethuraman A, van de Rijn M, Botstein D, Brown PO, Pollack JR (2005) A DNA microarray survey of gene expression in normal human tissues. Genome Biol 6(3):1–9 4. Stanford Microarray Database. http://smd. stanford.edu 5. Gene Expression Omnibus (GEO). http:// www.ncbi.nlm.nih.gov/geo 6. AmiGO. http://www.godatabase.org/cgibin/go.cgi 7. Akutsu T, Miyano S, Kuhara S (1999) Identification of genetic networks from a small number of gene expression patterns under the Boolean network model. In: Biocomputing’99. 17–28 8. Weaver DC, Workman CT, Stormo GD (1999) Modeling regulatory networks with weight matrices. In: Biocomputing’99. 112–123 9. Perrin BE, Ralaivola L, Mazurie A, Bottani S, Mallet J, d’Alche-Buc F (2003) Gene networks inference using dynamic Bayesian networks. Bioinformatics 19(suppl_2):ii138–ii148 10. Werhli AV, Grzegorczyk M, Husmeier D (2006) Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical Gaussian models and Bayesian networks. Bioinformatics 22(20):2523–2531 11. Keedwell E, Narayanan A (2005) Discovering gene networks with a neural-genetic hybrid. IEEE/ACM Trans Comput Biol Bioinform 2(3):231–242 12. Wahde M, Hertz J (2001) Modeling genetic regulatory dynamics in neural development. J Comput Biol 8(4):429–442 13. Mandal S, Saha G, Pal RK (2016) Inference of gene regulatory networks with neural-cuckoo hybrid. Adv Comput Syst Secur 1:87–99 14. Kentzoglanakis K, Poole M (2011) A swarm intelligence framework for reconstructing gene networks: searching for biologically plausible architectures. IEEE/ACM Trans Comput Biol Bioinform 9(2):358–371 15. Noman N, Palafox L, Iba H (2013) Reconstruction of gene regulatory networks from

gene expression data using decoupled recurrent neural network model. Nat Comput Beyond 6:93–103 16. Mandal S, Khan A, Saha G, Pal RK (2016) Large-scale recurrent neural network based modelling of gene regulatory network using cuckoo search-flower pollination algorithm. Adv Bioinforma 2016:1–9 17. Savageau MA, Voit EO (1987) Recasting nonlinear differential equations as S-systems: a canonical nonlinear form. Math Biosci 87:83– 115 18. Savageau MA (1988) Introduction to S-systems and the underlying power-law formalism. Math Comput Model 11:546–551 19. Sorribas A, Savageau MA (1989) Strategies for representing metabolic pathways within biochemical systems theory: reversible pathways. Math Biosci 94(2):239–269 20. Savageau MA (1969) Biochemical systems analysis I. Some mathematical properties of the rate law for the component enzymatic reactions. J Theoret Biol 25:365–369 21. Savageau MA (1969) Biochemical systems analysis II. The steady state solutions for an n-pool system using a power-law approximation. J Theoret Biol 25:370–379 22. Savageau MA (1970) Biochemical systems analysis III. Dynamic solutions using a power law approximation. J Theoret Biol 26:215–226 23. Savageau MA (1971) Concepts relating the behavior of biochemical systems to their underlying molecular properties. Arch Biochem Biophys 145:612–621 24. Savageau MA (1972) The behavior of intact biochemical control svstems. Curr Top Cell Reg 6:63–130 25. Savageau MA (1976) Biochemical systems analysis: a study of function and design in molecular biology. Addison-Wesley, Reading 26. Savageau MA, Voit EO, Irvine DH (1987) Biochemical systems theory and metabolic control theory: 1. Fundamental similarities and differences. Math Biosci 86(2):127–145 27. Savageau MA, Voit EO, Irvine DH (1987) Biochemical systems theory and metabolic control theory: 2. The role of summation and connectivity relationships. Math Biosci 86(2): 147–169 28. Voit EO (2013) Biochemical systems theory (BST): a review. International Scholarly Research Network (ISRN) Biomathematics 2013:1–53

Review on S-system based Modeling on GRN 29. Savageau MA, Voit EO (1982) Power-law approach to modeling biological systems: I. Theory. J Ferment Technol 60(3):221–228 30. Savageau MA (1996) Power-law formalism: a canonical nonlinear approach to modeling and analysis. Proc 1st World Congr Nonlinear Anal 4:3323–3334 31. Voit E, Chou IC (2010) Parameter estimation in canonical biological systems models. Int J Syst Synthet Biol 1(1):1–9 32. Horn FJM, Jackson R (1972) General mass action kinetics. Arch Ration Mech 47:81–116 33. Mu¨ller S, Regensburger G (2012) Generalized mass action systems: complex balancing equilibria and sign vectors of the stoichiometric and kinetic-order subspaces. SIAM J Appl Math 72: 1926–1947 34. Voit EO, Martens HA, Omholt SW (2015) 150 years of the mass action law. PLoS Comput Biol 11(1):e1004012 35. Lewis DC (1991) A qualitative analysis of S-systems: Hopf bifurcations. Canonical nonlinear modeling. S-system approach to understanding complexity. Van Nostrand Reinhold, New York, pp 304–344 36. Voit EO (1993) S-system modeling of complex systems with chaotic input. Environmetrics 4: 153–186 37. Savageau MA (1991) 20 years of S-systems. ln: Voit EO (ed) Canonical nonlinear modeling S-system approach understanding complexity. Van Nostrand Reinhold, New York, p 1–44 38. Thieffry D, Huerta AM, Pe´rez-Rueda E, Collado-Vides J (1998) From specific gene regulation to genomic networks: a global analysis of transcriptional regulation in Escherichia coli. Bioessays 20(5):433–440 39. Kimura S, Ide K, Kashihara A, Kano M, Hatakeyama M, Masui R, Nakagawa N, Yokoyama S, Kuramitsu S, Konagaya A (2005) Inference of S-system models of genetic networks using a cooperative coevolutionary algorithm. Bioinformatics 21(7):1154–1163 40. Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1(1):67–82 41. Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS (2007) Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol 5(1):e8 42. Schaffter T, Marbach D, Floreano D (2011) GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics 27(16): 2263–2270

151

43. Ronen M, Rosenberg R, Shraiman BI, Alon U (2002) Assigning numbers to the arrows: parameterizing a gene regulation network by using accurate expression kinetics. Proc Natl Acad Sci 99(16):10555–10560 44. E. Coli SOS Network time series data. http:// www.weizmann.ac.il/mcb/UriAlon/down load/downloadable-data 45. Akutsu T, Miyano S, Kuhara S (2000) Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics 16(8): 727–734 46. Kikuchi S, Tominaga D, Arita M, Takahashi K, Tomita M (2003) Dynamic modeling of genetic networks using genetic algorithm and S-system. Bioinformatics 19(5):643–650 47. Spieth C, Streichert F, Speer N, Zell A (2004) A memetic inference method for gene regulatory networks based on S-Systems. In: Proceedings of the 2004 congress on evolutionary computation. 1:152–157 48. Noman N, Iba H (2005) Reverse engineering genetic networks using evolutionary computation. Genome Inform 16(2):205–214 49. Noman N, Iba H (2006) On the reconstruction of gene regulatory networks from noisy expression profiles. In: IEEE international conference on evolutionary computation. 2543–2550 50. Wang H, Qian L, Dougherty E (2010) Inference of gene regulatory networks using S-system: a unified approach. IET Syst Biol 4(2):145–156 51. Noman N, Iba H (2007) Inferring gene regulatory networks using differential evolution with local search heuristics. IEEE/ACM Trans Comput Biol Bioinform 4(4):634–647 52. Murata H, Koshino M, Mitamura M, Kimura H (2008). Inference of S-system models of genetic networks using product unit neural networks. In: IEEE international conference on systems, man and cybernetics. 1390–1395 53. Kabir M, Noman N, Iba H (2010) Reverse engineering gene regulatory network from microarray data using linear time-variant model. BMC Bioinform 11:1–5 54. Chowdhury AR, Chetty M (2011) An improved method to infer gene regulatory network using s-system. In: IEEE congress of evolutionary computation (CEC). 1012–1019 55. Nakayama T, Seno S, Takenaka Y, Matsuda H (2011) Inference of S-system models of gene regulatory networks using immune algorithm. J Bioinforma Comput Biol 9(supp01):75–86 56. Liu LZ, Wu FX, Zhang WJ (2011) Inference of biological S-system using the separable estimation method and the genetic algorithm. IEEE/

152

Sudip Mandal and Pijush Dutta

ACM Trans Comput Biol Bioinform 9(4): 955–965 57. Chowdhury AR, Chetty M, Vinh NX (2012) Adaptive regulatory genes cardinality for reconstructing genetic networks. In: IEEE congress on evolutionary computation. 1–8 58. Chowdhury AR, Chetty M, Vinh NX (2013) Incorporating time-delays in S-system model for reverse engineering genetic networks. BMC Bioinform 14:1–22 59. Palafox L, Noman N, Iba H (2012) Reverse engineering of gene regulatory networks using dissipative particle swarm optimization. IEEE Trans Evol Comput 17(4):577–587 60. Palafox L, Noman N, Iba H (2013) Study on the use of evolutionary techniques for inference in gene regulatory networks. Nat Comput Beyond 6:82–92 61. Jereesh AS, Govindan VK (2013) A clonal based algorithm for the reconstruction of genetic network using S-system. Int J Res Eng Technol 02(08):44–50

62. Chowdhury AR, Chetty M, Vinh NX (2013) Inferring large scale genetic networks with S-system model. In: Proceedings of the 15th annual conference on genetic and evolutionary computation. 271–278 63. Jereesh AS, Govindan VK (2013) Gene regulatory network modelling using cuckoo search and S-system. Int J Adv Res Comput Sci Softw Eng 3(9):1231–1237 64. Jereesh AS, Govindan VK (2014) Clono– hybrid algorithm for the reconstruction of gene regulatory network using S-system. Int J Pure App Biosci 2(6):241–248 65. Zhan C, Li BY, Yeung LF (2015) Structural and practical identifiability analysis of S-system. IET Syst Biol 9(6):285–293 66. Mandal S, Khan A, Saha G, Pal RK (2016) Reverse engineering of gene regulatory networks based on S-systems and bat algorithm. J Bioinforma Comput Biol 14(03):1650010

Chapter 9 Big Data in Bioinformatics and Computational Biology: Basic Insights Aanchal Gupta, Shubham Kumar, and Ashwani Kumar Abstract The human genome was first sequenced in 1994. It took 10 years of cooperation between numerous international research organizations to reveal a preliminary human DNA sequence. Genomics labs can now sequence an entire genome in only a few days. Here, we talk about how the advent of high-performance sequencing platforms has paved the way for Big Data in biology and contributed to the development of modern bioinformatics, which in turn has helped to expand the scope of biology and allied sciences. New technologies and methodologies for the storage, management, analysis, and visualization of big data have been shown to be necessary. Not only does modern bioinformatics have to deal with the challenge of processing massive amounts of heterogeneous data, but it also has to deal with different ways of interpreting and presenting those results, as well as the use of different software programs and file formats. Solutions to these problems are tried to present in this chapter. In order to store massive amounts of data and provide a reasonable period for completing search queries, new database management systems other than relational ones will be necessary. Emerging advance programing approaches, such as machine learning, Hadoop, and MapReduce, aim to provide the capacity to easily construct one’s own scripts for data processing and address the issue of the diversity of genomic and proteomic data formats in bioinformatics. Key words Big data, NGS, Genome sequencing, Bioinformatics, Hadoop, Mapreduce

1

Introduction We all are living in a data age. About petabytes (1015 bytes) of data is generated every day. Data sets that are generated at high velocity, in different varieties, and too large in the volume are named big data. Big data experts forecast that the global datasphere will reach approximately 175 zettabytes by the year 2025. The field of medicine and biology contributes largely to big data and has led to evolution of the field of bioinformatics and computational biology.

1.1 Importance of Big Data in Biology

Like engineering, biological sciences have also experienced an IT boom, emphasizing computer science’s potential to aid biological research. We see a growing use of computational biology, Artificial

Sudip Mandal (ed.), Reverse Engineering of Regulatory Networks, Methods in Molecular Biology, vol. 2719, https://doi.org/10.1007/978-1-0716-3461-5_9, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2024

153

154

Aanchal Gupta et al.

Intelligence, and Big Data in the life sciences. Cutting-edge technologies like next-generation sequencing, microarray, etc., have collected and generated an enormous amount of data. Realize that many other data-rich areas in the biological sciences, such as macromolecular structure, image processing (including neuroimaging), health records analysis, genomics, proteomics, and the inter-relation of these vast data sets, are fast developing as well. These days, high-throughput data can be found in every subfield of genetics and molecular biology. This includes studies of epigenetic changes in development, transcriptional regulation of genes, and post-translational modifications of proteins, as well as those of genomic diversity in the context of human disease and evolution. Data has been expanding at a rate that no biologist could have predicted even 20 years ago. Therefore, it is not only important to manage and store this huge data but also to analyze it. Analysis of big data is important for reaching logical conclusions to solve biological problems. Manual analysis of such a huge amount of data would be nearly impossible. Big data handling requires advanced technologies as the existing traditional technologies are unable to manage, integrate and share data. Traditional technologies are also unable to extract useful information from such enormous datasets and this is where Big Data professionals play an essential role with the help of computer science and artificial intelligence. 1.2 Big Data Handling: Collection, Storage, and Analysis

Biological data includes sequences and structures of DNA, RNA and proteins, metabolic pathways, clinical data, molecular structures, etc. This data is collected using experimental methods like x-ray crystallography and high-throughput methods like NGS, microarray, and many more. This large amount of data further needs to be stored in databases, and requires advance database management systems which can organize these large datasets efficiently. Some major databases including NCBI, DDBJ, EMBL EBI, etc. have their own set of retrieval tools namely Entrez, Getentry, and EBI Search, respectively. A simple search in NCBI using the keyword “COVID” retrieves 6,345,428 Biosample hits (results), 739,078 nucleotide sequence hits, and 1,010,028 protein sequence hits in less than a second which indicates how fast the biological data pool is increasing. This huge amount of data can further be analyzed using big data analysis tools and techniques. Important resources for both predictive and descriptive analytics are the various machine learning techniques that fall under the “supervised,” “unsupervised,” and “hybrid” umbrellas. Several major research foci include mathematical optimization, statistical modeling, and cloud computing. Several different kinds of architectures, such as MapReduce, Fault-tolerant Graph, and Streaming Graph, have been created for large data analytics. This chapter has provided an in-depth analysis of these methods and instruments.

Big Data in Bioinformatics and Computational Biology

2

155

Tools/Softwares Here are some of the key big data analytics tools: • Hadoop – Support in data storage and analysis. • MongoDB – is used for datasets that change regularly. • Talend – is used for data management and integration. • Cassandra – is a distributed database that handles data chunks. • Spark – is used for real-time processing and analysis of large amounts of data. • STORM – is an open-source real-time computational system. • Kafka – is a distributed streaming platform that is used for faulttolerant storage.

3

Methods

3.1

Data Collection

With the explosion of high-throughput sequencing technologies and the large-scale experimental approaches, researchers are generating vast amounts of data that require sophisticated computational tools. Data collection methods include experimental procedures such as X-ray crystallography, NMR spectroscopy, and others. High-throughput methods like NGS (Next Generation Sequencing) and Microarrays produce enormous amount of data. Apart from this, population surveys, clinical trials, patient records, etc. can also be used to generate biological big data. The data that has been collected is found in various structured and unstructured formats, i.e., maps, sequences, graphs, images, etc.

3.2

Data Storage

This enormous amount of data that has been collected needs to be stored in a systematic manner such that data stored is nonredundant, can easily be retrieved, and allows to generate relationships between different datasets. However, it should be ensured that the data is stored in the raw format also, such that it can be re-analyzed in future as the processing algorithms and computational power improves, and new ways of analysis are developed [1]. Softwares for big data storage are Apache Hadoop, Apache Hbase, and Snowflake.

3.2.1

Rules to Store Data

1. Start with determining how the data will be used, how much data will be collected, over what period of time, what software will be used to analyze it and what will be its format. Example: FASTA format can be used in sequence alignment softwares like CLUSTALW. Metadata, the information required to interpret the data, should also be properly curated and closely integrated with the data. Using community software tools is preferable for data storage.

156

Aanchal Gupta et al.

2. Raw data should be saved in separate files. A cryptographic hash of the raw data should be generated to protect it from silent corruption and manipulation while being stored or transferred. Secure Hash Algorithm 2 (SHA-2) can be used to secure the data. This cryptographic hash function converts the data points into encrypted value. 3. Store the data in open formats, which maximize its accessibility and long term-value and are also easier for computers to process it. CSV (Comma-separated values) for tabular data, PNG (Portable Network Graphics) for images, Hierarchical data format (HDF), and NetCDF for hierarchically structured scientific data are some most commonly used open-formats. 4. Organize the data in structured format prior to analysis using Database Management Systems (DBMS). Data can be stored in the form of object-oriented databases, relational databases (in the form of tables), network databases, hierarchical databases, and many more. Online repositories are available to store this large amount of data. NCBI is one of the largest repositories for biological data. Other examples include EMBL, DDBJ, PDB, etc. 5. The stored data requires specific unique identifiers so that it can be reproduced easily in the future. Digital Object Identifier (DOI), Archival Resource Key (ARK), Accession number, and PMID are some unique identifiers used to retrieve the data. 6. Adopt proper privacy protocols and have a systematic backup plan to ensure the data is protected. 3.2.2 Data Storage Systems

Due to the enormous size of big data, it is handled differently in terms of storage as it cannot be backed up and processed using traditional methods. This big data can only be stored in sophisticated storage systems as discussed below. 1. Datawarehouse: An integrated collection of data that has been organized in a way that makes it simple for decision-makers to understand and analyze is known as a data warehouse (DW). According to Inmon, “a data warehouse is a subject-oriented, integrated, non-volatile, and time-variant collection of data used to aid management choices” [2]. 2. Data Lakes: A data lake is a central repository for storage that houses raw and comprehensive mega data collected from numerous sources. It’s flexible enough to store any kind of information, from raw to sorted. Data in the data lake is given unique identities and metadata tags to facilitate speedy retrieval. The term “data lake” refers to a massive collection of unstructured data that serves no particular function. A data warehouse stores information that has been previously converted for analysis [3].

Big Data in Bioinformatics and Computational Biology

157

3. Network Attached Storage: Data storage that is accessed via a network connection rather than by connecting directly to a computer is known as network attached storage (NAS). The processors and operating systems that NAS devices have enable them to run apps and give them the intelligence to easily share files with authorized users [3]. 4. Cloud Storage: When it comes to storing and sharing data, cloud storage is a crucial component of the cloud computing architecture. Cloud storage offers offsite backup, unlimited data storage space, efficient and secure file access, and low cost of use [4]. The data is stored in a shared virtual environment. AWS (Amazon Web Services), MicrosoftAzure Data Lake, IBM Cloud, and Google Cloud Storage are some of the big names that provide cloud-based storage services [3]. 5. Object Storage: In place of traditional file systems, this method of storing data uses a massive repository spread across many physical storage devices and organized in the form of objects. Metadata is stored with each data object to make it accessible without hierarchy [3]. 3.2.3 Important Features of a Big Data Database

1. Data Storage: It should accommodate petabytes of data, including both structured and unstructured forms. 2. Data Modeling: It should be able to store data in the form of various models like key-value sets, documents, graphs, widecolumn stores, multi-models, etc. 3. Data Querying: In addition to supporting real-time large data loading and processing, batch and streaming, and analytical workloads, the database should be able to handle several queries at once. 4. Database Performance: Horizontal scalability, elastic resource configuration, automatic large data replication across several servers for low latency and high availability, and the removal of stale data from tables should all be standard features. 5. Database Security and Reliability: The database should offer security and be reliable. It can be offered through encryption, user authorization and authentication, continuous and on-demand back-up, and restore.

3.3

Data Analysis

3.3.1 Descriptive Analysis

Now that we have collected and stored the data in a systematic manner, we can find meaningful insights such as unknown correlations, hidden patterns, etc. The analysis of big data can be categorized into two parts broadly, i.e., Descriptive Analysis and Predictive Analysis. The descriptive analysis defines and describes the big data. We can analyze past data and find patterns, anomalies, similarities, differences, and relationships between different biological entities.

158

Aanchal Gupta et al.

Descriptive Analysis techniques can be applied according to the type of data available to us and also according to our research question. In bioinformatics, the five major types of big data include 1. DNA, RNA, and protein sequence, 2. GeneExpression Data, 3. Protein-Protein Interaction Data, 4. Gene Ontology data, and 5. Metabolic Pathway data [5]. Analysis of Sequence Data

Sequence data can be analyzed to find the relationship between genomes and proteins of multiple organisms, their evolutionary relationships, connection to diseases, and other phenotypes. Next Generation Sequencing, which provides sequence information about whole genome, produces a large magnitude of data. This high-throughput sequencing method requires advanced computational tools and software. I. Sequence databases: NCBI, GenBank, DDBJ, EMBL EBI II. Tools for sequence analysis: (a) Sequence Alignment Tools • BLAST: The National Center for Biotechnology Information’s Basic Local Alignment Search Tool is a local sequence alignment tool that may be accessed online. To begin aligning two sequences, it first utilizes a heuristic approach to locate small matches between them [6]. The software analyses the statistical importance of matches between nucleotide or protein sequences and sequence databases. BLAST is useful for determining the evolutionary history of sequences, determining their functional roles, and locating related genes. • CLUSTAL OMEGA: It is an online multiple sequence alignment (MSA) tool managed by EMBL-EBI that aligns three or more sequences. It uses seeded guide trees and HMM profile-profile techniques. However, it is not suitable for use with the big data, because it can align up to 4000 sequences only, i.e., maximum file size can only be 5 MB [7]. (b) NGS Analysis Tools • Galaxy: It is one of the most popular data analysis frameworks for NGS data handling. It allows us to design comprehensive workflows and pipelines. It can be used by non-programmers but requires skilled bioinformaticians for complex tasks [8]. • Strand NGS: Alignment of DNA-Seq, RNA-Seq, Small RNA-Seq, ChIP-Seq, the Genome Browser, visualizations, and Biological Interpretation are just some of the

Big Data in Bioinformatics and Computational Biology

159

Fig. 1 Outline of NGS data analysis

capabilities it offers (Fig. 1). The data can be imported in a variety of forms, including FASTA, FASTQ, and tag count. In addition, pre-aligned data can be imported and analyzed immediately in SAM, BAM, or the unique Illumina ELAND format. Strand NGS features support for alignment from Illumina, Ion Torrent, 454 (Roche), and Pac Bio [9]. • CLC Genomics Workbench: CLC Genomics by QIAGEN includes tools for re-sequencing, workflow, read mapping, de novo assembly, variant identification, RNA-Seq, ChIPSeq, a Genome Browser, and more. The Main Workbench features a database search feature (Genbank, Blast, PubMed), and it is compatible with VCF files from the 1000 Genomes Project and dbSNP tracks. In addition to the aforementioned formats, we also take FASTA, GFF/GTF/GVF, BED, Wiggle, Cosmic, the UCSC variation database, and the entire genomics master var file [9]. (c) RNA-Seq Analysis Tools • MapSplice: Accurate mapping of reads that span splice junctions is essential for any analysis performed on RNAseq data. High sensitivity and specificity in splice identification, as well as low computational and memory overhead, are the hallmarks of MapSplice, a splice detection

160

Aanchal Gupta et al.

method of the second generation. MapSplice can handle reads of any length, from the minimum of 75 base pairs to those greater than 75 base pairs. MapSplice can detect innovative canonical and non-canonical splices since it is not dependent on the properties of splice sites or the length of introns. MapSplice takes advantage of the quality and diversity of read alignments in a given splice site to boost precision [10]. • MeV: MeV, short for “MultiExperiment Viewer,” is a desktop application written in Java that provides numerous clustering, statistical, and visualization features through a friendly GUI. MeV will automatically import annotation at the transcript or gene level from the UCSC or Ensemble databases, thereby annotating the data for you. There is support for loading expression data in addition to discrete count levels (as RPKM or FPKM values). The RPKM method converts raw sequence counts per transcript into RPKM values automatically and vice versa [11]. Gene Expression Analysis

Gene expression analysis involves looking at how many genes are turned on or off at different times during a process like the progression of a disease or the application of a treatment. Microarraybased gene expression profiling is the standard approach for recording expression levels for analysis. Microarray information can be broken down into three categories: gene-sample, gene-time, and gene-sample-time. Unlike time-space gene expression profiles, which document changes in expression across time, sample-space gene expression profiles record changes in expression in response to different environmental conditions [12]. Applications of gene expression analysis – Identification of genes that are affected during particular diseases across different stages. Different biomarkers can hence be suggested for disease diagnosis and prevention from this data. I. Databases for Microarrays: ArrayExpress from EBI, and Gene Expression Omnibus from NCBI. II. Tools for Microarray Analysis: Many programs exist for performing a wide variety of analysis on microarray data. Sadly, not all programs were made to handle enormous amounts of information. Due to the proliferation of data, it now takes an inordinate amount of time to generate samples and sequences for complex identification and to analyze diverse illness queries for relevant complexes. • BART: Downloading and analyzing microarray data from different microarray platforms is now a fully automated

Big Data in Bioinformatics and Computational Biology

161

procedure, thanks to the free R Shiny software Bioinformatics Array Research Tool (BART). It is easy to use, automatically downloads and parses data from GEO, makes suggestions for sample groupings for differential expression testing, corrects for batch effect, generates quality control plots, changes probe IDs, and produces full lists of differentially expressed genes and functional enrichment analysis [13]. • Beeline manages large amounts of data by doing parallel calculations and reducing the amount of data using adaptive filtering [14]. • A tool for quality assurance called caCORRECT highthroughput microarray data is cleaned of artificial noise. CaCORRECT provides a universal quality score for validation, which can be used to improve the authenticity and quality of data that has been replicated or is stored in publicly available microarray archives. • Knowledge-driven algorithms are used by omniBiomarker, a web-based program for discovering biomarkers from highthroughput gene expression data, to locate genes that have been differentially expressed. The procedure requires complex computation and validation, and omniBiomarker aids in the discovery of consistent and trustworthy biomarkers. 3.3.2

Predictive Analysis

Supervised Learning

In predictive analysis, we use past data, make inferences about it, and prepare a model which can predict future outcomes. In life sciences, predictive analysis can help early identification and prevention of diseases in at-risk patients. For predictive analysis, advanced and elaborate Machine Learning techniques need to be used. Machine learning is an area of AI and computer science that focuses on teaching computers to acquire domain-specific expertise through the analysis of large amounts of data and the application of sophisticated algorithms. Meaningful patterns in enormous datasets can be discovered by using these computational tools. Generally speaking, there are two types of machine learning algorithms: algorithms that do not require human oversight and those that do. Machine learning techniques are classified as either supervised or unsupervised depending on the expected outcome and the nature of the input used to train the model. An input set of training examples with labels makes up a supervised learning model (e.g., case and control). Supervised learning algorithms also construct an inferred function by analyzing the training data. For instance, tagged and unlabeled data are combined in semi-supervised algorithms. These algorithms build upon existing knowledge, and they are frequently the most effective method for addressing the prediction of novel members of pathways, cell line-

162

Aanchal Gupta et al.

age specificity, or novel connections. These algorithms can be significantly better at disregarding dataset-specific noise because they are based on known biology. As a result, they are especially well suited for integrative analysis of large data collections. Applications of Supervised Learning – The ENCODE Project Collaboration • Tissue-specific enhancer sites in the human genome have also been discovered using supervised machine learning techniques, according to a 2012 study. • In their study, Yip et al. split the human genome into 100-bp bins and calculated the average signal of each chromatin characteristic over the 100-bp of each bin. • Histone alterations, FAIRE, and DNase I hypersensitivity were among the features. The next approach was utilized to predict tissue-specific human enhancers. • We first built a supervised Random Forest model to predict the binding active regions (BAR) across the complete human genome. • A genomic region known as a BAR is one that has an open chromatin structure, is highly accessible to transcription factors, and is more likely to be the site of TRF interaction. • For the model’s training, 100 bp bins that overlapped a TRF binding peak were gathered as positive example BAR regions, and non-positive bins were chosen at random from the complete genome as negative examples. • Information about the chromosomes of the predictive cell lines GM12878 and K562. The model identified a list of possible BAR bins in the human genome, most likely from regions of DNA with open chromatin structure. • Second, the study that followed did not include the anticipated BAR bins that were near known genes or promoters. • Further filtering was performed on the resulting BAR bins to ensure that they contained only the most highly conserved ones. • Neighboring bins were concatenated into longer regions, and for each of the resulting regions, the number of binding motifs for a subset of transcription factors that are specific to embryos (such as members of the SOX and OCT families) was counted. • Conserved DNA sequences with an excess of TF binding function served as embryo-specific enhancers. Unsupervised Learning

The input of an unsupervised learning model consists of instances with no predetermined labels. The purpose of unsupervised learning is to unearth latent signals in the data. Finding clusters

Big Data in Bioinformatics and Computational Biology

163

by unsupervised learning, and subsequently using supervised methods to classify new samples, is possible. When a new sample is received, for example, it can be placed in the most relevant cluster. Unsupervised techniques are employed with the main objective of revealing the data’s underlying organizational structure, such as when attempting to answer the question, “What patterns exist in cancer gene expression?” These algorithms frequently find the most prevalent recurring characteristics in the data. An example of an unsupervised technique for locating the most signal-rich hidden features in the data is principal components analysis (PCA). Much of the variation in the data can be explained by the first fundamental component’s characteristics. Two large studies on breast cancer (Cancer Genome Atlas, 2012; Curtis et al., 2012) are combined, and principal component analysis is performed on the combined data. The study is listed as the system’s primary input. Applications of Unsupervised Learning: Following way unsupervised ML can be applied • Identification of molecular subtypes can be accomplished well using unsupervised methods. • Given measurements of the gene expression of a set of tumors, we are interested in understanding whether there are common gene expression patterns. • Although there are many other clustering algorithms available, the ones that divide the data into a specified number of groups or clusters are most frequently employed to pinpoint cancer subtypes. • The clustering algorithm “k-means” is one of these methods. The k-means clustering technique divides measurements into groups that display comparable patterns, but the number of groups, “k,” must be preset. • The R programming language’s Cluster package uses statistical methods, such as Tibshirani’s GAP statistic, to determine the appropriate number of clusters [15]. • Tothill et al. (2008) give an example to show how k-means clustering can be used to identify molecular subgroups of serous and endometrioid ovarian cancer connected to clinical prognosis [16]. • The most useful metric for rating unsupervised investigations is the subtypes’ reproducibility over different sets of data. In one experiment, the TCGA ovarian cancer group, for example, duplicated the groupings identified by Tothill et al.

164

4

Aanchal Gupta et al.

Big Data Solutions: The Data Architectures Various data architectures have been developed to manage and analyze the big data. Some of them are described below.

4.1 MapReduce Architecture

Google invented the data-parallel architecture known as MapReduce (Dean and Ghemawat 2008). By having numerous computers or nodes work on various types of data simultaneously, parallelism can be achieved. A popular open-source implementation of MapReduce is Apache Hadoop. On each node, a MapReduce daemon continuously runs. One master node manages configuration and control for the duration of the problem execution. The other nodes, referred as the worker nodes, actually process data and carry out computation. Also, the master node divides the data into key-value pairs and stores them in the global memory after assigning them to worker nodes [17].

4.2 Fault Tolerance Architecture

To effectively process the complicated and iterative issues while ensuring fault tolerance, other architectures are required. Scalability depends on fault tolerance since it enables the usage of erratic networks like the Internet. To accomplish so, Low et al. (2014) initially suggested a faulttolerant graph-based architecture dubbed GraphLab [18], and subsequently many other large data solutions adopted comparable structures. Each node in this design carries out a specific duty, and the computing is distributed among them in a heterogeneous manner. The data model is composed of two components: a shared global memory and a graph containing computational nodes (distributed).

4.3 Stream Graph Architecture

The aforementioned graph-based design is not effective for stream data due to its significant read/write disc overhead. Despite the fact that packages exist for the MapReduce architecture that do analytics on stream data, such as Spark Streaming [19], they internally transform stream data to batches for processing. For high bandwidth, stream applications need in-memory processing. For this issue, the well-known message passing interface (MPI) works well. At the application level, MPI and MapReduce share a comparable API, making it possible to implement practically all MapReduce applications using MPI as well. Its architecture keeps pace with rising processing speeds as well as increased network bandwidth and dependability.

Big Data in Bioinformatics and Computational Biology

5

165

Conclusion The Big data analyses in biological sciences are of utmost importance and have emerged out to be a very useful technology. However, it is still in the development phase and comes with many shortcomings. Big Data not only encompasses volume, velocity, and variety, but it is also highly dynamic and keeps on increasing exponentially. Multiple kinds of data are distributed globally and need to be analyzed. Traditional methods are not adapted for processing such huge amount of data. Since this data is huge and heterogenous, the primary problem is faced in its storage. Highly efficient database management systems are required to be developed. The big data should not only be stored but it should also be ensured that it is non-redundant, and easily retrievable through queries. A huge physical storage space might be required for big data, which makes big data analysis a very tedious and costly project. The data can be stored and analyzed parallelly to reduce its dimensions and increase the processing speed. To combat the major problems of storage, today Cloud-Based servers have been created which form a virtual environment for storage and analysis of Big Data. This significantly reduces the cost of storage. Cloud-based servers are useful for student researchers as it does not require huge storage and costly computer systems. Amazon AWS, Google Cloud Services, and Microsoft Azure offer good cloud computing services. Moreover, the field of big data is much less explored in biological sciences. Still, there is a need for development of powerful analysis tools and algorithms, which offer accurate analysis results. The tools which are able to handle heterogenous data, which is highly varied and present in different data formats, have a huge scope for development. Simultaneous processing of multivariate data, high processing power supercomputers, and the need for skilled bioinformaticians and big data analysts are some areas of improvement.

References 1. Hart EM, Barmby P, LeBauer D, Michonneau F, Mount S, Mulrooney P, Poisot T, Woo KH, Zimmerman NB, Hollister JW (2016) Ten simple rules for digital data storage. PLoS Comput Biol 12:e1005097 2. Chaudhuri S, Dayal U (1997) An overview of data warehousing and OLAP technology. SIGMOD Rec 26:65–74 3. Julliet R (2022) How to store big data. https://www.bocasay.com/how-to-store-bigdata/

4. Hassan J, Shehzad D, Habib U, Aftab MU, Ahmad M, Kuleev R, Mazzara M (2022) The rise of cloud computing: data protection, privacy, and open research challenges-a systematic literature review (SLR). Comput Intell Neurosci 2022:8303504 5. Kashyap H, Ahmed HA, Hoque N, Roy S, Bhattacharyya DK (2015) Big data analytics in bioinformatics: a machine learning perspective. arXiv preprint arXiv:1506.05101

166

Aanchal Gupta et al.

6. Johnson M, Zaretskaya I, Raytselis Y, Merezhuk Y, McGinnis S, Madden TL (2008) NCBI BLAST: a better web interface. Nucleic Acids Res 36:W5–W9 7. Madeira F, Pearce M, Tivey ARN, Basutkar P, Lee J, Edbali O, Madhusoodanan N, Kolesnikov A, Lopez R (2022) Search and sequence analysis tools services from EMBLEBI in 2022. Nucleic Acids Res 50:W276– W279 8. Bianchi V, Ceol A, Ogier AG, De Pretis S, Galeota E, Kishore K, Bora P, Croci O, Campaner S, Amati B, Morelli MJ (2016) Integrated systems for NGS data management and analysis: open issues and available solutions. Front Genet 7:75 9. Prajapati J. List of bioinformatics software tools for next generation sequencing. https:// bioinformaticsonline.com/pages/view/2661 7/list-of-bioinformatics-software-tools-fornext-generation-sequencing 10. Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X, Mieczkowski P, Grimm SA, Perou CM, MacLeod JN, Chiang DY, Prins JF, Liu J (2010) MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res 38:e178 11. Howe EA, Sinha R, Schlauch D, Quackenbush J (2011) RNA-Seq analysis in MeV. Bioinformatics 27:3209–3210 12. Kashyap H, Ahmed HA, Hoque N, Roy S, Bhattacharyya DK (2016) Big data analytics in bioinformatics: architectures, techniques, tools and issues. Netw Model Anal Health Inform Bioinform 5:1–28

13. Amaral ML, Erikson GA, Shokhirev MN (2018) BART: bioinformatics array research tool. BMC Bioinform 19:296 14. Illumina (2018) Beeline Illumina (Version 2.0). Illumina, Inc. Retrieved from https:// support.illumina.com/downloads/beelinesoftware-2-0.html 15. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc B Stat Methodol 63:411–423 16. Tothill RW, Tinker AV, George J, Brown R, Fox SB, Lade S, Johnson DS, Trivett MK, Etemadmoghadam D, Locandro B, Traficante N, Fereday S, Hung JA, Chiew Y-E, Haviv I, Australian Ovarian Cancer Study Group, Gertig D, de Fazio A, Bowtell DDL (2008) Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clin Cancer Res 14:5198–5208 17. Khezr SN, Navimipour NJ (2017) MapReduce and its applications, challenges, and architecture: a comprehensive review and directions for future research. J Grid Comput 15:295–321 18. Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, Hellerstein J (2014) GraphLab: a new framework for parallel machine learning. arXiv preprint arXiv:1408.2041 19. Apache Software Foundation (2023) Apache Spark (version 3.4.0). Retrieved from https:// spark.apache.org/news/spark-3-4-0-released. html

Chapter 10 Identification of Culprit Genes for Different Diseases by Analyzing Microarray Data Ayushman Kumar Banerjee, Shrayana Ghosh, and Chittabrata Mal Abstract The identification of disease-causing genes is the first and most important step toward understanding the biological mechanisms underlying a disease. Microarray analysis is one such powerful method that is widely used to identify genes that are expressed differently in two or more conditions (disease vs. normal). Because of its large library of statistical R packages and user-friendly interface, the R programming language provides a platform for microarray analysis. In this chapter, we will go over how to identify diseasecausing culprit genes from the raw microarray data, using various packages of R programming. The pipeline overviews the steps in microarray analysis, such as data pre-processing, normalization, and statistical analysis using visualization techniques such as heatmaps, box plots, and so on. To better understand the function of the altered genes, gene ontology and pathway analysis are performed. Key words Differentially expressed genes, Microarray analysis, R-programming, Gene ontology, Pathway analysis

1

Introduction The discovery of disease-causing genes can greatly benefit the current medical research community because it will aid in disease diagnosis and treatment strategies. For the same, differentially expressed genes or DEGs have come in the focus of investigation. Differentially expressed genes are the genes whose expression levels are different in two different cases (disease vs normal) [1]. The process of converting a gene’s DNA into RNA is referred to as gene expression, and the expression level of the gene indicates the approximate number of copies produced by the gene’s RNA in a cell as well as the amount of corresponding genes. Changes in the normal cell due to a disease cause changes in its expression level. The investigation of gene expression levels in normal and in disease states could lead to helpful insights into the complex pathogenic mechanisms that underpin a disease and could further lead to a

Sudip Mandal (ed.), Reverse Engineering of Regulatory Networks, Methods in Molecular Biology, vol. 2719, https://doi.org/10.1007/978-1-0716-3461-5_10, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2024

167

168

Ayushman Kumar Banerjee et al.

better therapeutic approach [2]. Hence, DEGs could be used as potential candidates for biomarkers, therapeutic targets, or for early diagnosis and prognosis of diseases. Microarray technology is one of the effective methods that can be used in the identification of disease-causing genes. It is a molecular technique that can be used to quantify the expression of mRNA in cells or tissues. It is also known as a DNA chip [3]. The expression levels of thousands of genes can be monitored simultaneously using microarray technology to compare different conditions. Since the microarray data sets are commonly large, it is necessary to reduce the datasets to the genes that are best differentiated between the two cases such as disease and normal case. The first step of an in-depth microarray analysis is to identify the differential gene expression [1]. A microarray is a glass laboratory tool with thousands of microscopic pores in predetermined areas where biological molecules such as DNA, RNA, or proteins are introduced in a sequence onto the surface. The unique patterns of the molecules are detected, and the collected data by microarrays can be used to create gene expression profiles, which are saved in the computer with the help of an image scanner. This method allows us to effectively analyze the expression of a large number of genes at the same time [4]. Microarray data analysis can therefore help in the identification of the genes that are important and informative as well as removal of the redundant and irrelevant genes [5].

2 2.1

Materials Dataset

This chapter shows how to use the R programming language to analyze the microarray gene expression profile matrix. Microarray data related to disease can be obtained from the Gene Expression Omnibus (GEO) or ArrayExpress databases. Several platforms are used to generate profile matrices such as Affymetrix or Illumina microarray sequencing. As a result, metadata associated with that must be thoroughly scrutinized during data collection. Accordingly, different packages and annotation databases must be used in the future. In a gene expression matrix, microarray probe ids are kept at the leftmost column, i.e., the first column. Then the gene expression values of each sample are provided in the subsequent columns. The probe ids are annotated to the corresponding genes during the annotation step. An example of gene expression matrix is shown in Table 1. For example, the gene expression profile of gastric-cardia and gastric-non-cardia adenocarcinomas along with their matched normals are taken here. This dataset can be downloaded from NCBI GEO (accession number GSE29272). It includes 62 gastric-cardia,

Identification of Culprit Genes by Analysing Microarray Data

169

Table 1 Gene expression matrix #Probe_ID

#Sample 1

#Sample 2

#Sample 3

#Sample 4

1007_s_at

10.26

10.4

10.11

9.18

1053_at

6.6

6.65

6.56

6.63

117_at

7.22

7.35

7.32

7.78

121_at

9.55

9.37

9.07

9.08

1255_g_at

4.95

4.87

4.94

4.88

Table 2 List of required R packages Packages

Description

affy

This package is used for analyzing the oligonucleotide arrays that are created by Affymetrix platform [6]

affycoretools

The package is intended to help people easily create useful output from various analyses [7]

arrayQualityMetrics This package generates microarray quality metrics reports for data in Bioconductor microarray data containers (ExpressionSet, NChannelSet, AffyBatch) [8] limma

Gene expression microarray data analysis library. It employs linear models for the evaluation of designed experiments and differential expressions [9]

org.Hs.eg.db

Genome-wide annotation for human, primarily based on mapping using Entrez Gene identifiers [10]

hgu133a.db

Affymetrix Affymetrix HG-U133A Array annotation data (chip hgu133a) assembled using data from public repositories [11]

clusterProfiler

This package supports functional characteristics of both coding and non-coding genomics data for thousands of species with up-to-date gene annotation [12, 13]

pheatmap

An R implementation of heatmaps that offers more control over dimensions and appearance. (https://cran.r-project.org/web/packages/pheatmap/index.html)

72 gastric-non-cardia adenocarcinoma and 134 matched normal expression profiles. This study was performed on the GPL96 [HGU133A] Affymetrix platform. 2.2

Software

R can be downloaded and installed from http://www.r-project. org/. R has an extensive set of packages that can be used for analysis, representation, and reporting of biological data and can be installed on all the major operating systems like Mac, Windows, and Unix or Linux. The R packages required to analyze microarray datasets are obtained from Bioconductor (http://www.bio conductor.org) (Table 2).

170

3

Ayushman Kumar Banerjee et al.

Methods There are different steps for performing microarray data analysis. A simple pipeline (Fig. 1) is shown here.

3.1 Package Installation

After R is installed, the required Bioconductor packages can be either installed using R terminal or R Studio. At first BiocManager is required to be installed; subsequently, the dependencies are installed using BiocManager::install(). # Installing BiocManager if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") # Using BiocManager to install required packages BiocManager::install(affy) BiocManager::install(affycoretools) BiocManager::install(arrayQualityMetrics) BiocManager::install(limma) BiocManager::install(org.Hs.eg.db) BiocManager::install(clusterProfiler) BiocManager::install(ggplot2) BiocManager::install(pheatmap) BiocManager::install(hgu133a.db) BiocManager::install(R2HTML) BiocManager::install(annotate) BiocManager::install(dplyr)

Fig. 1 Overall microarray data analysis pipeline

Identification of Culprit Genes by Analysing Microarray Data

171

The packages are loaded in the given order after successful installation and the function library() can be used to call the packages. # Loading all the tools using library command library(affy) library(affycoretools) library(arrayQualityMetrics) library(limma) library(org.Hs.eg.db) library(clusterProfiler) library(ggplot2) library(pheatmap) library(hgu133a.db) library(R2HTML) library(annotate) library(dplyr)

3.2 Importing Raw Data

The function setwd() is used to specify the working directory, where the .CEL files are downloaded. Next, the .CEL files containing the raw data are imported as an AffyBatch object using the function ReadAffy(). To inspect the content of the AffyBatch object the summary() function can be used. This function provides the user with information such as the size of the array, the number of genes and samples, affy IDs, and annotation. # Set working directory, where the .CEL data files are located #full path of the directory must be provided setwd("C:/Users/user/Data/GSE29272/") # Read in the microarray data (.CEL files) raw_data